Message ID | 20230302062741.483079-1-jstultz@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:5915:0:0:0:0:0 with SMTP id v21csp4080851wrd; Wed, 1 Mar 2023 22:45:05 -0800 (PST) X-Google-Smtp-Source: AK7set8JsGMVwBghJys9a31lb4JmdpZNqiE4u1oaDLh4QMF44NrQZFBQ+q0TpB7ZyUzQPgkOKOja X-Received: by 2002:a17:906:ac6:b0:8af:5403:992d with SMTP id z6-20020a1709060ac600b008af5403992dmr8870981ejf.28.1677739504930; Wed, 01 Mar 2023 22:45:04 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1677739504; cv=none; d=google.com; s=arc-20160816; b=d2xueKzCZRA6oHZQdPbzZRDJ4Wfpl+GJXy6NlhLT39Z9n83NYwo9qX87TKYep0kApw Q6i2m7B7YuTEfACVCnIQpn0PhFx2DBv1waju/KTOnIIqqv1NZA2xmnvDr2ttxrxXHc3m rZJhpfnwSIIkIT2F5fp8xa+zJ6hV5WjQfZCDMh29hB8f5mb+tyfGuKp+MavrbHFZ8/45 TtV8wSAt6Pk2+z83jNwFkqN8CewT7YVbsEGTf2l2Sku8uf5u+seurkJTF6vSel+W7bUs r+4YI3v8640jNH/PU7MMBNiVbwVJmXvrlNtfbvedAw0XIiw4d1buqFVg4Rq+BsBFToK7 samQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :message-id:mime-version:date:dkim-signature; bh=1s9VZXOhJ7DsOR8N1/cvuhFDdJDJYtoiVm83KN71Avo=; b=dA2h4gC4CvhNPkM6z0OS0Fh29tZsoIvNjP6/S8wbpSPvnEDnDVuNjrSoNnmBDdjAfC Qco19gzz71orHNJ0aX09NXI5Bgwl1JtUb1aNFV8obieS3xw6QuxCaotfn+Dlfom5ml0V CRfzsfkoaLectsYMpIwtU4pPS5be3nfe2B5WBfUA3fbiLleEtgMvyAGDaWMhAAt5fbs4 QahRB3AVxeiULbpAzMEZDhYKdirNapSHwjSsRkPw/mvPZCnHpdy4Pdtb7IXGdlcyrkZN zn56DK5tYRkppB32NWUnPifw8KRMnq39xJdv0l2pKLHBz4c3FU4DWBu7fR668vKhBvFx 3OJg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b="sS4d3/dl"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id g14-20020a170906594e00b008c64d11f496si6535205ejr.809.2023.03.01.22.44.41; Wed, 01 Mar 2023 22:45:04 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b="sS4d3/dl"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229962AbjCBG1r (ORCPT <rfc822;davidbtadokoro@gmail.com> + 99 others); Thu, 2 Mar 2023 01:27:47 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60354 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229501AbjCBG1q (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 2 Mar 2023 01:27:46 -0500 Received: from mail-pj1-x1049.google.com (mail-pj1-x1049.google.com [IPv6:2607:f8b0:4864:20::1049]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A52591BF9 for <linux-kernel@vger.kernel.org>; Wed, 1 Mar 2023 22:27:44 -0800 (PST) Received: by mail-pj1-x1049.google.com with SMTP id cl18-20020a17090af69200b0023470d96ae6so1624396pjb.1 for <linux-kernel@vger.kernel.org>; Wed, 01 Mar 2023 22:27:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:from:subject:message-id :mime-version:date:from:to:cc:subject:date:message-id:reply-to; bh=1s9VZXOhJ7DsOR8N1/cvuhFDdJDJYtoiVm83KN71Avo=; b=sS4d3/dlStG8D73OoSC5Gh4GlPDzm5bigEQTEoTH5Rsdt+9eAchZ+PIbrl6RGoJB1u ZpHwXibPE5F3+gnZcTnrR0AeAn9Zp3++9pAD7s4Fskd0cnPzN/t9OPd7B4q03tXobnEA 6qF5DCbrXoy0UmUi9fBP/s5bjV1eMUaf6Mmz2mWLvnkjLsY/gDM2YVo9pBNbOuHtMrQN jf3gI1XceQ0y8HsBqz5HG+hHXHfHR6t7ApqIkK9cDO9fJYn8LYMJ9+9CQduvYiUnKxXF LYV5ZsCzmlnOrHmPrn1+oMbYR/RP3bgbZGjcHPX1rWlZd2tvqzQxcjizAXMdg8cq86RQ Xs0Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:from:subject:message-id :mime-version:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=1s9VZXOhJ7DsOR8N1/cvuhFDdJDJYtoiVm83KN71Avo=; b=Vx+TYq0hOmybsKYED5VvcZi3ftJylSj+H8J/bOLdsQDGSCk5JMqeQ36ESvxfqk8sZo N4GNDhhggkAJC8U7Qip4hLbL97NPxx+0t7wNfFc9UvSl7QBWcFXcG2cdJIdXuluqFQv+ d3/+jqVmRKPtmWYRpqvPd83E2fK3G136OZPvQUTlt0lt+ZlJurWvO+J32jp/YErLfzdD y0LOFWxaXaTZ6DyzaI+G5lvdXBP1DF3RHfgpYf5R9PM3CIV1iv1vFPcqnmvj3RGEaJYe X+NcGn28TmAGhDCZmoruAypV1kolD7TGM8Nsl7Q37xCm5QBB/vNwQCXGEuJbCM3ITdGM yQFQ== X-Gm-Message-State: AO0yUKUnd4m9pJ7leUPYC9hN0looJ2LX3c3iR2sC6wZ9D58kpfWW1Elg sfiewmhcpzzgnS98bAnbjppfC9w7qe1o8QuWFcBAx2Ku1aYikUjR0QdhF0Bmb2SCwvDe+9C3Tso Kc4YflAtA949HO0DYsLt6I9WQXThJ9ULmOOxkse3qBPrDF7Eeze0+JigjbO3AcB/5dTyetbw= X-Received: from jstultz-noogler2.c.googlers.com ([fda3:e722:ac3:cc00:24:72f4:c0a8:600]) (user=jstultz job=sendgmr) by 2002:a05:6a00:be1:b0:5aa:310c:e65b with SMTP id x33-20020a056a000be100b005aa310ce65bmr3419431pfu.2.1677738463969; Wed, 01 Mar 2023 22:27:43 -0800 (PST) Date: Thu, 2 Mar 2023 06:27:41 +0000 Mime-Version: 1.0 X-Mailer: git-send-email 2.39.2.722.g9855ee24e9-goog Message-ID: <20230302062741.483079-1-jstultz@google.com> Subject: [PATCH] pstore: Revert pmsg_lock back to a normal mutex From: John Stultz <jstultz@google.com> To: LKML <linux-kernel@vger.kernel.org> Cc: John Stultz <jstultz@google.com>, Wei Wang <wvw@google.com>, Midas Chien <midaschieh@google.com>, Steven Rostedt <rostedt@goodmis.org>, Kees Cook <keescook@chromium.org>, Anton Vorontsov <anton@enomsg.org>, "Guilherme G. Piccoli" <gpiccoli@igalia.com>, Tony Luck <tony.luck@intel.com>, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1759237379141527337?= X-GMAIL-MSGID: =?utf-8?q?1759237379141527337?= |
Series |
pstore: Revert pmsg_lock back to a normal mutex
|
|
Commit Message
John Stultz
March 2, 2023, 6:27 a.m. UTC
This reverts commit 76d62f24db07f22ccf9bc18ca793c27d4ebef721.
So while priority inversion on the pmsg_lock is an occasional
problem that an rt_mutex would help with, in uses where logging
is writing to pmsg heavily from multiple threads, the pmsg_lock
can be heavily contended.
Normal mutexes can do adaptive spinning, which keeps the
contention overhead fairly low maybe adding on the order of 10s
of us delay waiting, but the slowpath w/ rt_mutexes makes the
blocked tasks sleep & wake. This makes matters worse when there
is heavy contentention, as it just allows additional threads to
run and line up to try to take the lock.
It devolves to a worse case senerio where the lock acquisition
and scheduling overhead dominates, and each thread is waiting on
the order of ~ms to do ~us of work.
Obviously, having tons of threads all contending on a single
lock for logging is non-optimal, so the proper fix is probably
reworking pstore pmsg to have per-cpu buffers so we don't have
contention.
But in the short term, lets revert the change to the rt_mutex
and go back to normal mutexes to avoid a potentially major
performance regression.
Cc: Wei Wang <wvw@google.com>
Cc: Midas Chien<midaschieh@google.com>
Cc: Chunhui Li (李春辉)" <chunhui.li@mediatek.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: kernel-team@android.com
Fixes: 76d62f24db07 ("pstore: Switch pmsg_lock to an rt_mutex to avoid priority inversion")
Reported-by: Chunhui Li (李春辉)" <chunhui.li@mediatek.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
fs/pstore/pmsg.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
Comments
On Thu, 2 Mar 2023 06:27:41 +0000 John Stultz <jstultz@google.com> wrote: > This reverts commit 76d62f24db07f22ccf9bc18ca793c27d4ebef721. > > So while priority inversion on the pmsg_lock is an occasional > problem that an rt_mutex would help with, in uses where logging > is writing to pmsg heavily from multiple threads, the pmsg_lock > can be heavily contended. > > Normal mutexes can do adaptive spinning, which keeps the > contention overhead fairly low maybe adding on the order of 10s > of us delay waiting, but the slowpath w/ rt_mutexes makes the > blocked tasks sleep & wake. This makes matters worse when there > is heavy contentention, as it just allows additional threads to > run and line up to try to take the lock. > > It devolves to a worse case senerio where the lock acquisition > and scheduling overhead dominates, and each thread is waiting on > the order of ~ms to do ~us of work. > > Obviously, having tons of threads all contending on a single > lock for logging is non-optimal, so the proper fix is probably > reworking pstore pmsg to have per-cpu buffers so we don't have > contention. Or perhaps we should convert rt_mutex to have adaptive spinning too. This will likely be needed for PREEMPT_RT anyway. IIRC, in the PREEMPT_RT patch, only the spinlock converted rt_mutexes used adaptive spinning and the argument against converting the mutex to rt_mutex to adaptive spinning was because the normal one (at that time) did not have it, and we wanted to keep it the same as mainline. But it appears that that reason is no longer the case, and perhaps the real answer is to have all mutexes have adaptive spinning? -- Steve > > But in the short term, lets revert the change to the rt_mutex > and go back to normal mutexes to avoid a potentially major > performance regression. > > Cc: Wei Wang <wvw@google.com> > Cc: Midas Chien<midaschieh@google.com> > Cc: Chunhui Li (李春辉)" <chunhui.li@mediatek.com> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Kees Cook <keescook@chromium.org> > Cc: Anton Vorontsov <anton@enomsg.org> > Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> > Cc: Tony Luck <tony.luck@intel.com> > Cc: kernel-team@android.com > Fixes: 76d62f24db07 ("pstore: Switch pmsg_lock to an rt_mutex to avoid priority inversion") > Reported-by: Chunhui Li (李春辉)" <chunhui.li@mediatek.com> > Signed-off-by: John Stultz <jstultz@google.com> > --- > fs/pstore/pmsg.c | 7 +++---- > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/fs/pstore/pmsg.c b/fs/pstore/pmsg.c > index ab82e5f05346..b31c9c72d90b 100644 > --- a/fs/pstore/pmsg.c > +++ b/fs/pstore/pmsg.c > @@ -7,10 +7,9 @@ > #include <linux/device.h> > #include <linux/fs.h> > #include <linux/uaccess.h> > -#include <linux/rtmutex.h> > #include "internal.h" > > -static DEFINE_RT_MUTEX(pmsg_lock); > +static DEFINE_MUTEX(pmsg_lock); > > static ssize_t write_pmsg(struct file *file, const char __user *buf, > size_t count, loff_t *ppos) > @@ -29,9 +28,9 @@ static ssize_t write_pmsg(struct file *file, const char __user *buf, > if (!access_ok(buf, count)) > return -EFAULT; > > - rt_mutex_lock(&pmsg_lock); > + mutex_lock(&pmsg_lock); > ret = psinfo->write_user(&record, buf); > - rt_mutex_unlock(&pmsg_lock); > + mutex_unlock(&pmsg_lock); > return ret ? ret : count; > } >
On Thu, Mar 2, 2023 at 5:24 AM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Thu, 2 Mar 2023 06:27:41 +0000 > John Stultz <jstultz@google.com> wrote: > > > This reverts commit 76d62f24db07f22ccf9bc18ca793c27d4ebef721. > > > > So while priority inversion on the pmsg_lock is an occasional > > problem that an rt_mutex would help with, in uses where logging > > is writing to pmsg heavily from multiple threads, the pmsg_lock > > can be heavily contended. > > > > Normal mutexes can do adaptive spinning, which keeps the > > contention overhead fairly low maybe adding on the order of 10s > > of us delay waiting, but the slowpath w/ rt_mutexes makes the > > blocked tasks sleep & wake. This makes matters worse when there > > is heavy contentention, as it just allows additional threads to > > run and line up to try to take the lock. > > > > It devolves to a worse case senerio where the lock acquisition > > and scheduling overhead dominates, and each thread is waiting on > > the order of ~ms to do ~us of work. > > > > Obviously, having tons of threads all contending on a single > > lock for logging is non-optimal, so the proper fix is probably > > reworking pstore pmsg to have per-cpu buffers so we don't have > > contention. > > Or perhaps we should convert rt_mutex to have adaptive spinning too. This > will likely be needed for PREEMPT_RT anyway. IIRC, in the PREEMPT_RT patch, > only the spinlock converted rt_mutexes used adaptive spinning and the > argument against converting the mutex to rt_mutex to adaptive spinning was > because the normal one (at that time) did not have it, and we wanted to > keep it the same as mainline. But it appears that that reason is no longer > the case, and perhaps the real answer is to have all mutexes have adaptive > spinning? This sounds like something to try as well (though I'd still want to push this revert in the meantime to avoid regressions). Do you have a more specific pointer to this adaptive spinning rt_mutexes for spinlocks? Looking at the current PREEMPT_RT patch I'm not seeing any changes to locking. thanks -john
On Thu, 2 Mar 2023 11:39:12 -0800 John Stultz <jstultz@google.com> wrote: > This sounds like something to try as well (though I'd still want to > push this revert in the meantime to avoid regressions). > > Do you have a more specific pointer to this adaptive spinning > rt_mutexes for spinlocks? Looking at the current PREEMPT_RT patch I'm > not seeing any changes to locking. Actually, it's in mainline too. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/locking/rtmutex.c#n1557 The rtmutex_spin_on_owner() is the code. But currently, only the highest priority waiter will spin. If you have two waiters, the second one will sleep. With today's computers, where it's not uncommon to have more than 16 CPUs, it is very likely (and probably what happened in your case), that there's more than one waiter in contention. I'm thinking all waiters should spin if the owner and the top waiter are also running. But only the top waiter should be allowed to grab the lock. There's no harm in spinning, as the task can still be preempted, and there's no issue of priority inversion, because the spinners will not be on the same CPU as the owner and the top waiter, if they only spin if those two tasks are also running on a CPU. I could possibly add a patch, and see if that also works. -- Steve
On Thu, 2 Mar 2023 15:21:03 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:
> I could possibly add a patch, and see if that also works.
Can you try this patch to see if it improves the situation.
A few of things about this patch. It is lightly tested. It can be optimized
to cache the top waiter and not need to grab the spin lock and disable
interrupts for every loop, but right now I want to see if this improves the
situation. As when PREEMPT_RT becomes more mainline, we may need this.
Another thing I noticed is I think there's a bug in the existing code.
CPU1 CPU2
---- ----
rt_mutex_slowlock_block() {
raw_spin_lock_irq(wait_lock);
owner = rt_mutex_owner();
raw_spin_unlock_irq(wait_lock);
rtmutex_spin_on_owner(owner) {
owner = rt_mutex_owner();
[ task preempted! (could also be a long interrupt) ]
owner releases lock and exits
owner is freed
[ task resumes ]
if (!owner_on_cpu(owner)
READ_ONCE(owner->on_cpu)
*** BOOM invalid pointer dereference ***
I think we need a get_task_struct() somewhere there.
Anyway, that's another issue. Could you try this patch? I even added a
trace_printk() in there to see if it gets hit.
Thanks!
-- Steve
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 010cf4e6d0b8..6c602775bb23 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1399,6 +1399,7 @@ static bool rtmutex_spin_on_owner(struct rt_mutex_base *lock,
struct rt_mutex_waiter *waiter,
struct task_struct *owner)
{
+ struct rt_mutex_waiter *top_waiter;
bool res = true;
rcu_read_lock();
@@ -1421,11 +1422,25 @@ static bool rtmutex_spin_on_owner(struct rt_mutex_base *lock,
* for CONFIG_PREEMPT_RCU=y)
* - the VCPU on which owner runs is preempted
*/
- if (!owner_on_cpu(owner) || need_resched() ||
- !rt_mutex_waiter_is_top_waiter(lock, waiter)) {
+ if (!owner_on_cpu(owner) || need_resched()) {
res = false;
break;
}
+ top_waiter = rt_mutex_top_waiter(lock);
+ if (top_waiter != waiter) {
+ raw_spin_lock_irq(&lock->wait_lock);
+ top_waiter = rt_mutex_top_waiter(lock);
+ if (top_waiter && top_waiter != waiter) {
+ trace_printk("spin on waiter! %s:%d\n",
+ top_waiter->task->comm,
+ top_waiter->task->pid);
+ if (!owner_on_cpu(top_waiter->task))
+ res = false;
+ }
+ raw_spin_unlock_irq(&lock->wait_lock);
+ if (!res)
+ break;
+ }
cpu_relax();
}
rcu_read_unlock();
On Thu, 2 Mar 2023 16:32:53 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > CPU1 CPU2 > ---- ---- > rt_mutex_slowlock_block() { > raw_spin_lock_irq(wait_lock); > owner = rt_mutex_owner(); > raw_spin_unlock_irq(wait_lock); > > rtmutex_spin_on_owner(owner) { I just noticed there's an rcu_read_lock() here around the loop. I'm guessing that's to keep this race from happening. Would have been nice to have a comment there stating such. -- Steve > owner = rt_mutex_owner(); > > [ task preempted! (could also be a long interrupt) ] > > owner releases lock and exits > owner is freed > > [ task resumes ] > > if (!owner_on_cpu(owner) > > READ_ONCE(owner->on_cpu) > *** BOOM invalid pointer dereference ***
On Thu, 2 Mar 2023 16:36:03 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > I just noticed there's an rcu_read_lock() here around the loop. > I'm guessing that's to keep this race from happening. > Would have been nice to have a comment there stating such. Knowing that rcu_read_lock() keeps the tasks safe, I made the optimization to only grab the spinlock (and disable interrupts) once, or whenever the top waiter changes. -- Steve diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index 010cf4e6d0b8..37820331857b 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -1399,8 +1399,12 @@ static bool rtmutex_spin_on_owner(struct rt_mutex_base *lock, struct rt_mutex_waiter *waiter, struct task_struct *owner) { + struct rt_mutex_waiter *top_waiter; + struct rt_mutex_waiter *last_waiter = NULL; + struct task_struct *top_task = NULL; bool res = true; + /* rcu_read_lock keeps task_structs around */ rcu_read_lock(); for (;;) { /* If owner changed, trylock again. */ @@ -1421,11 +1425,27 @@ static bool rtmutex_spin_on_owner(struct rt_mutex_base *lock, * for CONFIG_PREEMPT_RCU=y) * - the VCPU on which owner runs is preempted */ - if (!owner_on_cpu(owner) || need_resched() || - !rt_mutex_waiter_is_top_waiter(lock, waiter)) { + if (!owner_on_cpu(owner) || need_resched()) { res = false; break; } + top_waiter = rt_mutex_top_waiter(lock); + if (top_waiter != waiter) { + if (top_waiter != last_waiter) { + raw_spin_lock_irq(&lock->wait_lock); + last_waiter = rt_mutex_top_waiter(lock); + top_task = last_waiter->task; + raw_spin_unlock_irq(&lock->wait_lock); + } + trace_printk("spin on owner! %s:%d\n", + top_task->comm, + top_task->pid); + + if (!owner_on_cpu(top_task)) { + res = false; + break; + } + } cpu_relax(); } rcu_read_unlock();
From: Steven Rostedt > Sent: 02 March 2023 20:21 ... > There's no harm in spinning, as the task can still be preempted, and > there's no issue of priority inversion, because the spinners will not be on > the same CPU as the owner and the top waiter, if they only spin if those > two tasks are also running on a CPU. ISTM that a spinlock should spin - even on an RT kernel. If it might spin for longer than it takes to do a process switch it shouldn't be a spinlock at all. (I bet there are quite a few that do spin for ages...) Adaptive spinning for a sleep lock (as an optimisation) is entirely different. I changed some very old driver code that used sema4 (which always sleep) to mutex (which quite often spin) and got a massive performance gain. I've also had terrible problems trying to get a multithreaded user program to work well [1]. Because you don't have spinlocks (userpace can always be preempted) you can't bound the time mutex are held for. So any vaguely 'hot' lock (maybe just used to remove an item from a list) can get interrupted by a hardware interrupt. The only way to make it work is to use atomic operations instead of mutex. I can't help feeing that the RT kernel suffers from the same problems if the system is under any kind of load. You might get slightly better RT response, but the overall amount of 'work' a system can actually do will be lower. [1] Test was 36 threads on a 40 cpu system that need to spend about 90% of the time processing RTP (UDP) audio. This also involves 500k ethernet packets/sec (tx and rx). It is all possible, but there were a lot of pitfalls. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Thu, 2 Mar 2023 22:41:36 +0000 David Laight <David.Laight@ACULAB.COM> wrote: > I can't help feeing that the RT kernel suffers from the > same problems if the system is under any kind of load. > You might get slightly better RT response, but the overall > amount of 'work' a system can actually do will be lower. That basically is the definition of an RTOS. But it's not "slightly better RT responses" what you get is a hell of a lot better responses, and no unbounded priority inversion. On some workloads I can still get millisecond latency cases on vanilla Linux where as with the PREEMPT_RT patch, the same workload is still under a 100 microseconds. -- Steve
On Thu, 2 Mar 2023 16:56:13 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > Knowing that rcu_read_lock() keeps the tasks safe, I made the optimization > to only grab the spinlock (and disable interrupts) once, or whenever the > top waiter changes. v3 as I found that there were too places to test for top waiter that had to be removed: (I took out the trace_printk() here). -- Steve diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index 010cf4e6d0b8..283dd8e654ef 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -1399,8 +1399,12 @@ static bool rtmutex_spin_on_owner(struct rt_mutex_base *lock, struct rt_mutex_waiter *waiter, struct task_struct *owner) { + struct rt_mutex_waiter *top_waiter; + struct rt_mutex_waiter *last_waiter = NULL; + struct task_struct *top_task = NULL; bool res = true; + /* rcu_read_lock keeps task_structs around */ rcu_read_lock(); for (;;) { /* If owner changed, trylock again. */ @@ -1421,11 +1425,23 @@ static bool rtmutex_spin_on_owner(struct rt_mutex_base *lock, * for CONFIG_PREEMPT_RCU=y) * - the VCPU on which owner runs is preempted */ - if (!owner_on_cpu(owner) || need_resched() || - !rt_mutex_waiter_is_top_waiter(lock, waiter)) { + if (!owner_on_cpu(owner) || need_resched()) { res = false; break; } + top_waiter = rt_mutex_top_waiter(lock); + if (top_waiter != waiter) { + if (top_waiter != last_waiter) { + raw_spin_lock_irq(&lock->wait_lock); + last_waiter = rt_mutex_top_waiter(lock); + top_task = last_waiter->task; + raw_spin_unlock_irq(&lock->wait_lock); + } + if (!owner_on_cpu(top_task)) { + res = false; + break; + } + } cpu_relax(); } rcu_read_unlock(); @@ -1547,10 +1563,7 @@ static int __sched rt_mutex_slowlock_block(struct rt_mutex_base *lock, break; } - if (waiter == rt_mutex_top_waiter(lock)) - owner = rt_mutex_owner(lock); - else - owner = NULL; + owner = rt_mutex_owner(lock); raw_spin_unlock_irq(&lock->wait_lock); if (!owner || !rtmutex_spin_on_owner(lock, waiter, owner)) @@ -1736,10 +1749,7 @@ static void __sched rtlock_slowlock_locked(struct rt_mutex_base *lock) if (try_to_take_rt_mutex(lock, current, &waiter)) break; - if (&waiter == rt_mutex_top_waiter(lock)) - owner = rt_mutex_owner(lock); - else - owner = NULL; + owner = rt_mutex_owner(lock); raw_spin_unlock_irq(&lock->wait_lock); if (!owner || !rtmutex_spin_on_owner(lock, &waiter, owner))
Hey Steve, On Thu, Mar 02, 2023 at 08:01:36PM -0500, Steven Rostedt wrote: > On Thu, 2 Mar 2023 16:56:13 -0500 > Steven Rostedt <rostedt@goodmis.org> wrote: > > > Knowing that rcu_read_lock() keeps the tasks safe, I made the optimization > > to only grab the spinlock (and disable interrupts) once, or whenever the > > top waiter changes. > > v3 as I found that there were too places to test for top waiter that had to > be removed: Nice patch!!! One question below: > (I took out the trace_printk() here). > > -- Steve > > diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c > index 010cf4e6d0b8..283dd8e654ef 100644 > --- a/kernel/locking/rtmutex.c > +++ b/kernel/locking/rtmutex.c > @@ -1399,8 +1399,12 @@ static bool rtmutex_spin_on_owner(struct rt_mutex_base *lock, > struct rt_mutex_waiter *waiter, > struct task_struct *owner) > { > + struct rt_mutex_waiter *top_waiter; > + struct rt_mutex_waiter *last_waiter = NULL; > + struct task_struct *top_task = NULL; > bool res = true; > > + /* rcu_read_lock keeps task_structs around */ > rcu_read_lock(); > for (;;) { > /* If owner changed, trylock again. */ > @@ -1421,11 +1425,23 @@ static bool rtmutex_spin_on_owner(struct rt_mutex_base *lock, > * for CONFIG_PREEMPT_RCU=y) > * - the VCPU on which owner runs is preempted > */ > - if (!owner_on_cpu(owner) || need_resched() || > - !rt_mutex_waiter_is_top_waiter(lock, waiter)) { > + if (!owner_on_cpu(owner) || need_resched()) { > res = false; > break; > } > + top_waiter = rt_mutex_top_waiter(lock); > + if (top_waiter != waiter) { > + if (top_waiter != last_waiter) { > + raw_spin_lock_irq(&lock->wait_lock); > + last_waiter = rt_mutex_top_waiter(lock); > + top_task = last_waiter->task; > + raw_spin_unlock_irq(&lock->wait_lock); > + } > + if (!owner_on_cpu(top_task)) { > + res = false; > + break; > + } > + } In the normal mutex's adaptive spinning, there is no check for if there is a change in waiter AFAICS (ignoring ww mutex stuff for a second). I can see one may want to do that waiter-check, as spinning indefinitely if the lock owner is on the CPU for too long may result in excessing power burn. But normal mutex does not seem to do that. What makes the rtmutex spin logic different from normal mutex in this scenario, so that rtmutex wants to do that but normal ones dont? Another thought is, I am wondering if all of them spinning indefinitely might be Ok for rtmutex as well, since as you mentioned, preemption is enabled. So adding the if (top_waiter != last_waiter) {...} might be unnecessary? In fact may be even harmful as you are disabling interrupts in the process. Either way, I think a comment should go on top of the "if (top_waiter != waiter)" check IMO. thanks, - Joel > cpu_relax(); > } > rcu_read_unlock(); > @@ -1547,10 +1563,7 @@ static int __sched rt_mutex_slowlock_block(struct rt_mutex_base *lock, > break; > } > > - if (waiter == rt_mutex_top_waiter(lock)) > - owner = rt_mutex_owner(lock); > - else > - owner = NULL; > + owner = rt_mutex_owner(lock); > raw_spin_unlock_irq(&lock->wait_lock); > > if (!owner || !rtmutex_spin_on_owner(lock, waiter, owner)) > @@ -1736,10 +1749,7 @@ static void __sched rtlock_slowlock_locked(struct rt_mutex_base *lock) > if (try_to_take_rt_mutex(lock, current, &waiter)) > break; > > - if (&waiter == rt_mutex_top_waiter(lock)) > - owner = rt_mutex_owner(lock); > - else > - owner = NULL; > + owner = rt_mutex_owner(lock); > raw_spin_unlock_irq(&lock->wait_lock); > > if (!owner || !rtmutex_spin_on_owner(lock, &waiter, owner))
On Fri, 3 Mar 2023 18:11:34 +0000 Joel Fernandes <joel@joelfernandes.org> wrote: > In the normal mutex's adaptive spinning, there is no check for if there is a > change in waiter AFAICS (ignoring ww mutex stuff for a second). > > I can see one may want to do that waiter-check, as spinning > indefinitely if the lock owner is on the CPU for too long may result in > excessing power burn. But normal mutex does not seem to do that. > > What makes the rtmutex spin logic different from normal mutex in this > scenario, so that rtmutex wants to do that but normal ones dont? Well, the point of the patch is that I don't think they should be different ;-) > > Another thought is, I am wondering if all of them spinning indefinitely might > be Ok for rtmutex as well, since as you mentioned, preemption is enabled. So > adding the if (top_waiter != last_waiter) {...} might be unnecessary? In fact > may be even harmful as you are disabling interrupts in the process. The last patch only does the interrupt disabling if the top waiter changes. Which in practice is seldom. But, I don't think a non top waiter should spin if the top waiter is not running. The top waiter is the one that will get the lock next. If the owner releases the lock and gives it to the top waiter, then it has to go through the wake up of the top waiter. I don't see why a task should spin to save a wake up if a wake up has to happen anyway. > > Either way, I think a comment should go on top of the "if (top_waiter != > waiter)" check IMO. What type of comment? -- Steve
On Fri, Mar 3, 2023 at 1:37 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Fri, 3 Mar 2023 18:11:34 +0000 > Joel Fernandes <joel@joelfernandes.org> wrote: > > > In the normal mutex's adaptive spinning, there is no check for if there is a > > change in waiter AFAICS (ignoring ww mutex stuff for a second). > > > > I can see one may want to do that waiter-check, as spinning > > indefinitely if the lock owner is on the CPU for too long may result in > > excessing power burn. But normal mutex does not seem to do that. > > > > What makes the rtmutex spin logic different from normal mutex in this > > scenario, so that rtmutex wants to do that but normal ones dont? > > Well, the point of the patch is that I don't think they should be different > ;-) But there's no "waiter change" thing for mutex_spin_on_owner right. Then, should mutex_spin_on_owner() also add a call to __mutex_waiter_is_first() ? > > Another thought is, I am wondering if all of them spinning indefinitely might > > be Ok for rtmutex as well, since as you mentioned, preemption is enabled. So > > adding the if (top_waiter != last_waiter) {...} might be unnecessary? In fact > > may be even harmful as you are disabling interrupts in the process. > > The last patch only does the interrupt disabling if the top waiter changes. > Which in practice is seldom. > > But, I don't think a non top waiter should spin if the top waiter is not > running. The top waiter is the one that will get the lock next. If the > owner releases the lock and gives it to the top waiter, then it has to go > through the wake up of the top waiter. Correct me if I'm wrong, but I don't think it will go through schedule() after spinning, which is what adds the overhead for John. > I don't see why a task should spin > to save a wake up if a wake up has to happen anyway. What about regular mutexes, happens there too or no? > > Either way, I think a comment should go on top of the "if (top_waiter != > > waiter)" check IMO. > > What type of comment? Comment explaining why "if (top_waiter != waiter)" is essential :-). thanks, -Joel
On Fri, 3 Mar 2023 14:25:23 -0500 Joel Fernandes <joel@joelfernandes.org> wrote: > On Fri, Mar 3, 2023 at 1:37 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > > > On Fri, 3 Mar 2023 18:11:34 +0000 > > Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > In the normal mutex's adaptive spinning, there is no check for if there is a > > > change in waiter AFAICS (ignoring ww mutex stuff for a second). > > > > > > I can see one may want to do that waiter-check, as spinning > > > indefinitely if the lock owner is on the CPU for too long may result in > > > excessing power burn. But normal mutex does not seem to do that. > > > > > > What makes the rtmutex spin logic different from normal mutex in this > > > scenario, so that rtmutex wants to do that but normal ones dont? > > > > Well, the point of the patch is that I don't think they should be different > > ;-) > > But there's no "waiter change" thing for mutex_spin_on_owner right. > > Then, should mutex_spin_on_owner() also add a call to > __mutex_waiter_is_first() ? Ah interesting, I missed the __mutex_waiter_is_first() in the mutex code, where it looks to do basically the same thing as rt_mutex (but slightly different). From looking at this, it appears that mutex() has FIFO fair logic, where the second waiter will sleep. Would be interesting to see why John sees such a huge difference between normal mutex and rtmutex if they are doing the same thing. One thing is perhaps the priority logic is causing the issue, where this will not improve anything. I wonder if we add spinning to normal mutex for the other waiters if that would improve things or make them worse? > > > > Another thought is, I am wondering if all of them spinning indefinitely might > > > be Ok for rtmutex as well, since as you mentioned, preemption is enabled. So > > > adding the if (top_waiter != last_waiter) {...} might be unnecessary? In fact > > > may be even harmful as you are disabling interrupts in the process. > > > > The last patch only does the interrupt disabling if the top waiter changes. > > Which in practice is seldom. > > > > But, I don't think a non top waiter should spin if the top waiter is not > > running. The top waiter is the one that will get the lock next. If the > > owner releases the lock and gives it to the top waiter, then it has to go > > through the wake up of the top waiter. > > Correct me if I'm wrong, but I don't think it will go through > schedule() after spinning, which is what adds the overhead for John. Only if the lock becomes free. > > > I don't see why a task should spin > > to save a wake up if a wake up has to happen anyway. > > What about regular mutexes, happens there too or no? Yes, but in a FIFO order, where in rt_mutex, a second, higher priority task can make the first ones sleep. So maybe it's just the priority logic that is causing the issues. > > > > Either way, I think a comment should go on top of the "if (top_waiter != > > > waiter)" check IMO. > > > > What type of comment? > > Comment explaining why "if (top_waiter != waiter)" is essential :-). You mean, "Don't spin if the next owner is not on any CPU"? -- Steve
On 03/03/23 14:38, Steven Rostedt wrote: > On Fri, 3 Mar 2023 14:25:23 -0500 > Joel Fernandes <joel@joelfernandes.org> wrote: > > > On Fri, Mar 3, 2023 at 1:37 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > > On Fri, 3 Mar 2023 18:11:34 +0000 > > > Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > In the normal mutex's adaptive spinning, there is no check for if there is a > > > > change in waiter AFAICS (ignoring ww mutex stuff for a second). > > > > > > > > I can see one may want to do that waiter-check, as spinning > > > > indefinitely if the lock owner is on the CPU for too long may result in > > > > excessing power burn. But normal mutex does not seem to do that. > > > > > > > > What makes the rtmutex spin logic different from normal mutex in this > > > > scenario, so that rtmutex wants to do that but normal ones dont? > > > > > > Well, the point of the patch is that I don't think they should be different > > > ;-) > > > > But there's no "waiter change" thing for mutex_spin_on_owner right. > > > > Then, should mutex_spin_on_owner() also add a call to > > __mutex_waiter_is_first() ? > > Ah interesting, I missed the __mutex_waiter_is_first() in the mutex code, > where it looks to do basically the same thing as rt_mutex (but slightly > different). From looking at this, it appears that mutex() has FIFO fair > logic, where the second waiter will sleep. > > Would be interesting to see why John sees such a huge difference between > normal mutex and rtmutex if they are doing the same thing. One thing is > perhaps the priority logic is causing the issue, where this will not > improve anything. I think that can be a good suspect. If the waiters are RT tasks the root cause might be starvation issue due to bad priority setup and moving to FIFO just happens to hide it. For same priority RT tasks, we should behave as FIFO too AFAICS. If there are a mix of RT vs CFS; RT will always win of course. > > I wonder if we add spinning to normal mutex for the other waiters if that > would improve things or make them worse? I see a potential risk depending on how long the worst case scenario for this optimistic spinning. RT tasks can prevent all lower priority RT and CFS from running. CFS tasks will lose some precious bandwidth from their sched_slice() as this will be accounted for them as RUNNING time even if they were effectively waiting. Cheers -- Qais Yousef > > > > > > > Another thought is, I am wondering if all of them spinning indefinitely might > > > > be Ok for rtmutex as well, since as you mentioned, preemption is enabled. So > > > > adding the if (top_waiter != last_waiter) {...} might be unnecessary? In fact > > > > may be even harmful as you are disabling interrupts in the process. > > > > > > The last patch only does the interrupt disabling if the top waiter changes. > > > Which in practice is seldom. > > > > > > But, I don't think a non top waiter should spin if the top waiter is not > > > running. The top waiter is the one that will get the lock next. If the > > > owner releases the lock and gives it to the top waiter, then it has to go > > > through the wake up of the top waiter. > > > > Correct me if I'm wrong, but I don't think it will go through > > schedule() after spinning, which is what adds the overhead for John. > > Only if the lock becomes free. > > > > > > I don't see why a task should spin > > > to save a wake up if a wake up has to happen anyway. > > > > What about regular mutexes, happens there too or no? > > Yes, but in a FIFO order, where in rt_mutex, a second, higher priority task > can make the first ones sleep. So maybe it's just the priority logic that > is causing the issues. > > > > > > > Either way, I think a comment should go on top of the "if (top_waiter != > > > > waiter)" check IMO. > > > > > > What type of comment? > > > > Comment explaining why "if (top_waiter != waiter)" is essential :-). > > You mean, "Don't spin if the next owner is not on any CPU"? > > -- Steve
Hey Steve, On Fri, Mar 03, 2023 at 02:38:22PM -0500, Steven Rostedt wrote: > On Fri, 3 Mar 2023 14:25:23 -0500 > Joel Fernandes <joel@joelfernandes.org> wrote: > > > On Fri, Mar 3, 2023 at 1:37 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > > On Fri, 3 Mar 2023 18:11:34 +0000 > > > Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > In the normal mutex's adaptive spinning, there is no check for if there is a > > > > change in waiter AFAICS (ignoring ww mutex stuff for a second). > > > > > > > > I can see one may want to do that waiter-check, as spinning > > > > indefinitely if the lock owner is on the CPU for too long may result in > > > > excessing power burn. But normal mutex does not seem to do that. > > > > > > > > What makes the rtmutex spin logic different from normal mutex in this > > > > scenario, so that rtmutex wants to do that but normal ones dont? > > > > > > Well, the point of the patch is that I don't think they should be different > > > ;-) > > > > But there's no "waiter change" thing for mutex_spin_on_owner right. > > > > Then, should mutex_spin_on_owner() also add a call to > > __mutex_waiter_is_first() ? > > Ah interesting, I missed the __mutex_waiter_is_first() in the mutex code, > where it looks to do basically the same thing as rt_mutex (but slightly > different). True, I concur! > From looking at this, it appears that mutex() has FIFO fair > logic, where the second waiter will sleep. > > Would be interesting to see why John sees such a huge difference between > normal mutex and rtmutex if they are doing the same thing. One thing is > perhaps the priority logic is causing the issue, where this will not > improve anything. > I wonder if we add spinning to normal mutex for the other waiters if that > would improve things or make them worse? Yeah it could improve things (or make them worse ;-). In that approach, I guess those other waiters will not be spinning for too long once the first waiter gets the lock, as mutex_spin_on_owner() it will break out of the spin: while (__mutex_owner(lock) == owner) { ... } But yeah it sounds like a plausible idea. > > > > Another thought is, I am wondering if all of them spinning indefinitely might > > > > be Ok for rtmutex as well, since as you mentioned, preemption is enabled. So > > > > adding the if (top_waiter != last_waiter) {...} might be unnecessary? In fact > > > > may be even harmful as you are disabling interrupts in the process. > > > > > > The last patch only does the interrupt disabling if the top waiter changes. > > > Which in practice is seldom. > > > > > > But, I don't think a non top waiter should spin if the top waiter is not > > > running. The top waiter is the one that will get the lock next. If the > > > owner releases the lock and gives it to the top waiter, then it has to go > > > through the wake up of the top waiter. Ah ok. I see what you're doing now! I guess that check will help whenever the top-waiter keeps changing. In that case you want only the latest top-waiter as the spinner, all while the lock owner is not changing. > > Correct me if I'm wrong, but I don't think it will go through > > schedule() after spinning, which is what adds the overhead for John. > > Only if the lock becomes free. Ah yeah, true! > > > I don't see why a task should spin > > > to save a wake up if a wake up has to happen anyway. > > > > What about regular mutexes, happens there too or no? > > Yes, but in a FIFO order, where in rt_mutex, a second, higher priority task > can make the first ones sleep. So maybe it's just the priority logic that > is causing the issues. True! So in the FIFO thing, there's no way a high prio task can get ahead in line of the first waiter, makes sense. > > > > Either way, I think a comment should go on top of the "if (top_waiter != > > > > waiter)" check IMO. > > > > > > What type of comment? > > > > Comment explaining why "if (top_waiter != waiter)" is essential :-). > Maybe "/* Only the top waiter needs to spin. If we are no longer the top-waiter, no point in spinning, as we do not get the lock next anyway. */" ? thanks, - Joel
On Fri, Mar 03, 2023 at 08:36:45PM +0000, Qais Yousef wrote: > On 03/03/23 14:38, Steven Rostedt wrote: > > On Fri, 3 Mar 2023 14:25:23 -0500 > > Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > On Fri, Mar 3, 2023 at 1:37 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > > > > On Fri, 3 Mar 2023 18:11:34 +0000 > > > > Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > > In the normal mutex's adaptive spinning, there is no check for if there is a > > > > > change in waiter AFAICS (ignoring ww mutex stuff for a second). > > > > > > > > > > I can see one may want to do that waiter-check, as spinning > > > > > indefinitely if the lock owner is on the CPU for too long may result in > > > > > excessing power burn. But normal mutex does not seem to do that. > > > > > > > > > > What makes the rtmutex spin logic different from normal mutex in this > > > > > scenario, so that rtmutex wants to do that but normal ones dont? > > > > > > > > Well, the point of the patch is that I don't think they should be different > > > > ;-) > > > > > > But there's no "waiter change" thing for mutex_spin_on_owner right. > > > > > > Then, should mutex_spin_on_owner() also add a call to > > > __mutex_waiter_is_first() ? > > > > Ah interesting, I missed the __mutex_waiter_is_first() in the mutex code, > > where it looks to do basically the same thing as rt_mutex (but slightly > > different). From looking at this, it appears that mutex() has FIFO fair > > logic, where the second waiter will sleep. > > > > Would be interesting to see why John sees such a huge difference between > > normal mutex and rtmutex if they are doing the same thing. One thing is > > perhaps the priority logic is causing the issue, where this will not > > improve anything. > > I think that can be a good suspect. If the waiters are RT tasks the root cause > might be starvation issue due to bad priority setup and moving to FIFO just > happens to hide it. I wonder if mutex should actually prioritize giving the lock to RT tasks instead of FIFO, since that's higher priority work. It sounds that's more 'fair'. But that's likely to make John's issue worse. > For same priority RT tasks, we should behave as FIFO too AFAICS. > > If there are a mix of RT vs CFS; RT will always win of course. > > > > > I wonder if we add spinning to normal mutex for the other waiters if that > > would improve things or make them worse? > > I see a potential risk depending on how long the worst case scenario for this > optimistic spinning. > > RT tasks can prevent all lower priority RT and CFS from running. Agree, I was kind of hoping need_resched() in mutex_spin_on_owner() would come to the rescue in such a scenario, but obviously not. Modifications to check_preempt_curr_rt() could obviously aid there but... > CFS tasks will lose some precious bandwidth from their sched_slice() as this > will be accounted for them as RUNNING time even if they were effectively > waiting. True, but maybe the CFS task is happy to lose some bandwidth and get back to CPU quickly, than blocking and not getting any work done. ;-) thanks, - Joel > > > Cheers > > -- > Qais Yousef > > > > > > > > > > > Another thought is, I am wondering if all of them spinning indefinitely might > > > > > be Ok for rtmutex as well, since as you mentioned, preemption is enabled. So > > > > > adding the if (top_waiter != last_waiter) {...} might be unnecessary? In fact > > > > > may be even harmful as you are disabling interrupts in the process. > > > > > > > > The last patch only does the interrupt disabling if the top waiter changes. > > > > Which in practice is seldom. > > > > > > > > But, I don't think a non top waiter should spin if the top waiter is not > > > > running. The top waiter is the one that will get the lock next. If the > > > > owner releases the lock and gives it to the top waiter, then it has to go > > > > through the wake up of the top waiter. > > > > > > Correct me if I'm wrong, but I don't think it will go through > > > schedule() after spinning, which is what adds the overhead for John. > > > > Only if the lock becomes free. > > > > > > > > > I don't see why a task should spin > > > > to save a wake up if a wake up has to happen anyway. > > > > > > What about regular mutexes, happens there too or no? > > > > Yes, but in a FIFO order, where in rt_mutex, a second, higher priority task > > can make the first ones sleep. So maybe it's just the priority logic that > > is causing the issues. > > > > > > > > > > Either way, I think a comment should go on top of the "if (top_waiter != > > > > > waiter)" check IMO. > > > > > > > > What type of comment? > > > > > > Comment explaining why "if (top_waiter != waiter)" is essential :-). > > > > You mean, "Don't spin if the next owner is not on any CPU"? > > > > -- Steve
On Sat, Mar 04, 2023 at 03:01:30AM +0000, Joel Fernandes wrote: [...] > > > > > Either way, I think a comment should go on top of the "if (top_waiter != > > > > > waiter)" check IMO. > > > > > > > > What type of comment? > > > > > > Comment explaining why "if (top_waiter != waiter)" is essential :-). > > > > Maybe "/* Only the top waiter needs to spin. If we are no longer the > top-waiter, no point in spinning, as we do not get the lock next anyway. */" > > ? And it could be added to that comment that, we want to continue spinning as long as the top-waiter is still on the CPU (even if we are no longer the top-waiter). thanks, - Joel
On Thu, Mar 2, 2023 at 5:01 PM Steven Rostedt <rostedt@goodmis.org> wrote: > On Thu, 2 Mar 2023 16:56:13 -0500 > Steven Rostedt <rostedt@goodmis.org> wrote: > > > Knowing that rcu_read_lock() keeps the tasks safe, I made the optimization > > to only grab the spinlock (and disable interrupts) once, or whenever the > > top waiter changes. > > v3 as I found that there were too places to test for top waiter that had to > be removed: Hey Steven, Unfortunately previous versions didn't improve the situation for the reporter, and this version is causing crashes (BUG at kernel/locking/rtmutex_common.h:118!) for them. I'm going to spend some time today to get a local reproducer so I can tinker a bit here as well. thanks -john
On 03/04/23 03:21, Joel Fernandes wrote: > On Fri, Mar 03, 2023 at 08:36:45PM +0000, Qais Yousef wrote: > > On 03/03/23 14:38, Steven Rostedt wrote: > > > On Fri, 3 Mar 2023 14:25:23 -0500 > > > Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > On Fri, Mar 3, 2023 at 1:37 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > > > > > > On Fri, 3 Mar 2023 18:11:34 +0000 > > > > > Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > > > > In the normal mutex's adaptive spinning, there is no check for if there is a > > > > > > change in waiter AFAICS (ignoring ww mutex stuff for a second). > > > > > > > > > > > > I can see one may want to do that waiter-check, as spinning > > > > > > indefinitely if the lock owner is on the CPU for too long may result in > > > > > > excessing power burn. But normal mutex does not seem to do that. > > > > > > > > > > > > What makes the rtmutex spin logic different from normal mutex in this > > > > > > scenario, so that rtmutex wants to do that but normal ones dont? > > > > > > > > > > Well, the point of the patch is that I don't think they should be different > > > > > ;-) > > > > > > > > But there's no "waiter change" thing for mutex_spin_on_owner right. > > > > > > > > Then, should mutex_spin_on_owner() also add a call to > > > > __mutex_waiter_is_first() ? > > > > > > Ah interesting, I missed the __mutex_waiter_is_first() in the mutex code, > > > where it looks to do basically the same thing as rt_mutex (but slightly > > > different). From looking at this, it appears that mutex() has FIFO fair > > > logic, where the second waiter will sleep. > > > > > > Would be interesting to see why John sees such a huge difference between > > > normal mutex and rtmutex if they are doing the same thing. One thing is > > > perhaps the priority logic is causing the issue, where this will not > > > improve anything. > > > > I think that can be a good suspect. If the waiters are RT tasks the root cause > > might be starvation issue due to bad priority setup and moving to FIFO just > > happens to hide it. > > I wonder if mutex should actually prioritize giving the lock to RT tasks > instead of FIFO, since that's higher priority work. It sounds that's more > 'fair'. But that's likely to make John's issue worse. It is the right thing to do IMHO, but I guess the implications are just too hard to tell to enforce it by default yet. Which is I guess why it's all protected by PREEMPT_RT still. (I'm not sure but I assumed that logically PREEMPT_RT would convert all mutex to rt_mutexes by default) > > > For same priority RT tasks, we should behave as FIFO too AFAICS. > > > > If there are a mix of RT vs CFS; RT will always win of course. > > > > > > > > I wonder if we add spinning to normal mutex for the other waiters if that > > > would improve things or make them worse? > > > > I see a potential risk depending on how long the worst case scenario for this > > optimistic spinning. > > > > RT tasks can prevent all lower priority RT and CFS from running. > > Agree, I was kind of hoping need_resched() in mutex_spin_on_owner() would > come to the rescue in such a scenario, but obviously not. Modifications to > check_preempt_curr_rt() could obviously aid there but... > > > CFS tasks will lose some precious bandwidth from their sched_slice() as this > > will be accounted for them as RUNNING time even if they were effectively > > waiting. > > True, but maybe the CFS task is happy to lose some bandwidth and get back to > CPU quickly, than blocking and not getting any work done. ;-) It depends on the worst case scenario of spinning. If we can ensure it's bounded to something small, then yeah I don't see an issue :-) Cheers -- Qais Yousef
On Fri, Mar 03, 2023 at 06:11:34PM +0000, Joel Fernandes wrote: > What makes the rtmutex spin logic different from normal mutex in this > scenario, so that rtmutex wants to do that but normal ones dont? Regular mutex uses osq 'lock' to serialize waiters and only the top spinner gets to spin on the mutex itself, this greatly reduces the contention on the mutex. OSQ is FIFO, which is not what RT-mutex needs.
On Tue, Mar 7, 2023 at 9:09 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Mar 03, 2023 at 06:11:34PM +0000, Joel Fernandes wrote: > > What makes the rtmutex spin logic different from normal mutex in this > > scenario, so that rtmutex wants to do that but normal ones dont? > > Regular mutex uses osq 'lock' to serialize waiters and only the top > spinner gets to spin on the mutex itself, this greatly reduces the > contention on the mutex. > > OSQ is FIFO, which is not what RT-mutex needs. Got it, so basically OSQ ensures fairness as its FIFO and also reduces lock contention because I am guessing the OSQ-spinners are spinning on a per-spinner MCS node (that per-CPU optimistic_spin_node it appears). This makes perfect sense now, thank you Peter!!! - Joel
On Thu, 2 Mar 2023 20:01:36 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > @@ -1421,11 +1425,23 @@ static bool rtmutex_spin_on_owner(struct rt_mutex_base *lock, > * for CONFIG_PREEMPT_RCU=y) > * - the VCPU on which owner runs is preempted > */ > - if (!owner_on_cpu(owner) || need_resched() || > - !rt_mutex_waiter_is_top_waiter(lock, waiter)) { > + if (!owner_on_cpu(owner) || need_resched()) { > res = false; > break; > } > + top_waiter = rt_mutex_top_waiter(lock); rt_mutex_top_waiter() can not be called outside the wait_lock, as it may trigger that BUG_ON() you saw. New patch below. > + if (top_waiter != waiter) { > + if (top_waiter != last_waiter) { > + raw_spin_lock_irq(&lock->wait_lock); > + last_waiter = rt_mutex_top_waiter(lock); > + top_task = last_waiter->task; > + raw_spin_unlock_irq(&lock->wait_lock); > + } > + if (!owner_on_cpu(top_task)) { > + res = false; > + break; > + } > + } -- Steve diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index 7779ee8abc2a..f7b0cc3be20e 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -1362,8 +1362,11 @@ static bool rtmutex_spin_on_owner(struct rt_mutex_base *lock, struct rt_mutex_waiter *waiter, struct task_struct *owner) { + struct rt_mutex_waiter *top_waiter = NULL; + struct task_struct *top_task = NULL; bool res = true; + /* rcu_read_lock keeps task_structs around */ rcu_read_lock(); for (;;) { /* If owner changed, trylock again. */ @@ -1384,11 +1387,22 @@ static bool rtmutex_spin_on_owner(struct rt_mutex_base *lock, * for CONFIG_PREEMPT_RCU=y) * - the VCPU on which owner runs is preempted */ - if (!owner_on_cpu(owner) || need_resched() || - !rt_mutex_waiter_is_top_waiter(lock, waiter)) { + if (!owner_on_cpu(owner) || need_resched()) { res = false; break; } + if (!rt_mutex_waiter_is_top_waiter(lock, waiter)) { + if (!rt_mutex_waiter_is_top_waiter(lock, top_waiter)) { + raw_spin_lock_irq(&lock->wait_lock); + top_waiter = rt_mutex_top_waiter(lock); + top_task = top_waiter->task; + raw_spin_unlock_irq(&lock->wait_lock); + } + if (!owner_on_cpu(top_task)) { + res = false; + break; + } + } cpu_relax(); } rcu_read_unlock(); @@ -1510,10 +1524,7 @@ static int __sched rt_mutex_slowlock_block(struct rt_mutex_base *lock, break; } - if (waiter == rt_mutex_top_waiter(lock)) - owner = rt_mutex_owner(lock); - else - owner = NULL; + owner = rt_mutex_owner(lock); raw_spin_unlock_irq(&lock->wait_lock); if (!owner || !rtmutex_spin_on_owner(lock, waiter, owner)) @@ -1699,10 +1710,7 @@ static void __sched rtlock_slowlock_locked(struct rt_mutex_base *lock) if (try_to_take_rt_mutex(lock, current, &waiter)) break; - if (&waiter == rt_mutex_top_waiter(lock)) - owner = rt_mutex_owner(lock); - else - owner = NULL; + owner = rt_mutex_owner(lock); raw_spin_unlock_irq(&lock->wait_lock); if (!owner || !rtmutex_spin_on_owner(lock, &waiter, owner))
On Tue, Mar 7, 2023 at 5:31 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Thu, 2 Mar 2023 20:01:36 -0500 > Steven Rostedt <rostedt@goodmis.org> wrote: > > > @@ -1421,11 +1425,23 @@ static bool rtmutex_spin_on_owner(struct rt_mutex_base *lock, > > * for CONFIG_PREEMPT_RCU=y) > > * - the VCPU on which owner runs is preempted > > */ > > - if (!owner_on_cpu(owner) || need_resched() || > > - !rt_mutex_waiter_is_top_waiter(lock, waiter)) { > > + if (!owner_on_cpu(owner) || need_resched()) { > > res = false; > > break; > > } > > + top_waiter = rt_mutex_top_waiter(lock); > > rt_mutex_top_waiter() can not be called outside the wait_lock, as it may > trigger that BUG_ON() you saw. > > New patch below. Hey Steven! Thanks for the new version! It avoids the crash issue. However, with my sef-created reproducer, I was still seeing similar regression going between mutex to rtmutex. I'm interested in continuing to see if we can further tweak it, but I've got some other work I need to focus on, so I think I'm going to advocate for the revert in the short-term and look at finer grained locking (along with rtmutex to address the priority inversion issue) in the longer term. thanks -john
On Wed, 8 Mar 2023 12:04:23 -0800 John Stultz <jstultz@google.com> wrote: > I'm interested in continuing to see if we can further tweak it, but > I've got some other work I need to focus on, so I think I'm going to > advocate for the revert in the short-term and look at finer grained > locking (along with rtmutex to address the priority inversion issue) > in the longer term. Yeah, I would suggest the same. I would still like to see what the difference is. Because I believe this will also be an issue for PREEMPT_RT. -- Steve
diff --git a/fs/pstore/pmsg.c b/fs/pstore/pmsg.c index ab82e5f05346..b31c9c72d90b 100644 --- a/fs/pstore/pmsg.c +++ b/fs/pstore/pmsg.c @@ -7,10 +7,9 @@ #include <linux/device.h> #include <linux/fs.h> #include <linux/uaccess.h> -#include <linux/rtmutex.h> #include "internal.h" -static DEFINE_RT_MUTEX(pmsg_lock); +static DEFINE_MUTEX(pmsg_lock); static ssize_t write_pmsg(struct file *file, const char __user *buf, size_t count, loff_t *ppos) @@ -29,9 +28,9 @@ static ssize_t write_pmsg(struct file *file, const char __user *buf, if (!access_ok(buf, count)) return -EFAULT; - rt_mutex_lock(&pmsg_lock); + mutex_lock(&pmsg_lock); ret = psinfo->write_user(&record, buf); - rt_mutex_unlock(&pmsg_lock); + mutex_unlock(&pmsg_lock); return ret ? ret : count; }