Message ID | 20230427111937.2745231-2-bigeasy@linutronix.de |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp194554vqo; Thu, 27 Apr 2023 04:25:23 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ4dwBJQR+IUKj4A9BhLIsDaJeck6+UglnJcAeFwyWfGkZZJnlTFdcoN5i1keb6rSKbQKAQZ X-Received: by 2002:a05:6a00:99b:b0:63b:7489:f77 with SMTP id u27-20020a056a00099b00b0063b74890f77mr7936557pfg.0.1682594722916; Thu, 27 Apr 2023 04:25:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1682594722; cv=none; d=google.com; s=arc-20160816; b=07E3Vnz8wj8YH9/Tg6MJqNBNpSTF/t2OfYR3GsspTs12hEjctKcf95bzbSGW8qq4e+ Er5W+B9uE0jBzXIjlfXyBzUbFHKeeR4uC3UR7+964yJVcHtuEZc+Ma6CVOD7fU79HRlA /9sPJg/p/UYnYJrIRbEaCSaHu/SmU92eE0HlxD/lEKX6DSYTRCQitZo6zKFRqV/Iik88 2/3p4Xx6xSOQQqikEWZlt3n2lNTctYVNQQx/mvFn7/05Wp7aOV10j4KGlBptSKK8WD9t 23xfjZSmavVC6i845GG065JYlKyrsSxBZQqZVsrJLU+ohZsn5R+p5q9SA8YiDbm2DFQY QkPQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:dkim-signature :dkim-signature:from; bh=nAuLWDQMB55lcnDvQBz27uhWcaqTSuhCvrx4iirveNU=; b=Ht5NxkyiJNGWfPGa9X/k82wtfCn9ymxNGWGjnM3iGHLaGWnWXit9zdqMYw6J/SEen3 644LbxWTGLfcXYX6zZbZRoOi+Aerm3A04aInOc9QkRpU21xqPg+HXeLuSOPByNXW6UOv PmnUpbAytiulsXtyPSLGdMAuamXeKdqfD+jcnbjZIb1LP0HqtMUzkbC5XYgQHXEU71x2 ebgvz7l+cVHFmnRed1BP1CbL01wmi6Y9XlMtIPdBIfLgWeBEiJ4Oj0JrZluDkubOYcpk 57SAKO33fi4ES7j51GPKE8K+9ajZJRbp3KB44mfvMqJIk9hNcrPUN63E36fMzqo0dAgC 4gLA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=BmkA1tCR; dkim=neutral (no key) header.i=@linutronix.de header.s=2020e; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id r14-20020aa79ece000000b0063b834cf20esi14657071pfq.91.2023.04.27.04.25.07; Thu, 27 Apr 2023 04:25:22 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=BmkA1tCR; dkim=neutral (no key) header.i=@linutronix.de header.s=2020e; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243551AbjD0LUL (ORCPT <rfc822;zxc52fgh@gmail.com> + 99 others); Thu, 27 Apr 2023 07:20:11 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59824 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243620AbjD0LTs (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 27 Apr 2023 07:19:48 -0400 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 69F7B4C2B for <linux-kernel@vger.kernel.org>; Thu, 27 Apr 2023 04:19:45 -0700 (PDT) From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1682594383; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=nAuLWDQMB55lcnDvQBz27uhWcaqTSuhCvrx4iirveNU=; b=BmkA1tCRKTr0kyPsvHU5Dx2g8M0ALymErOFL9wJnFzO+sFQX9zAW50FDQ6L+v5BlNr180F h/yyXzQs3uUjMuEpDYrNzEJMXW9k33YY3s6f/HN+9dj+yfr+6CZicnl82w8xi1IINnwqZo KynDhU/i5Sl6SrfL8yh7366jKIoQ6zeAnzqB4MzH2i6fhBH/w/Alk/0IZA/gLB1seLiuHS M7eRYEVwI7wieHXdEH6c7CyySRdzp70Jm4oTJvb2i7wpW1p07San6NlgxBhYUqBIHSMBgA sYSI9VDdjVS3rG4v1zEPM9dWBRhBh/U/i8slHSCDU/Ce+3prrQ5YKELIKVjfng== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1682594383; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=nAuLWDQMB55lcnDvQBz27uhWcaqTSuhCvrx4iirveNU=; b=bs2EIpLBPRPQy4OQl6bI+zwRt6XERNPrd+Q1YXGQ6JTJocNYKkDZVwlZMK047VgIn5/kja YCoZjm/To83YB5BQ== To: linux-kernel@vger.kernel.org Cc: Ben Segall <bsegall@google.com>, Boqun Feng <boqun.feng@gmail.com>, Crystal Wood <swood@redhat.com>, Daniel Bristot de Oliveira <bristot@redhat.com>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Ingo Molnar <mingo@redhat.com>, John Stultz <jstultz@google.com>, Juri Lelli <juri.lelli@redhat.com>, Mel Gorman <mgorman@suse.de>, Peter Zijlstra <peterz@infradead.org>, Steven Rostedt <rostedt@goodmis.org>, Thomas Gleixner <tglx@linutronix.de>, Valentin Schneider <vschneid@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, Waiman Long <longman@redhat.com>, Will Deacon <will@kernel.org>, Sebastian Andrzej Siewior <bigeasy@linutronix.de> Subject: [PATCH v2 1/4] sched/core: Provide sched_rtmutex() and expose sched work helpers Date: Thu, 27 Apr 2023 13:19:34 +0200 Message-Id: <20230427111937.2745231-2-bigeasy@linutronix.de> In-Reply-To: <20230427111937.2745231-1-bigeasy@linutronix.de> References: <20230427111937.2745231-1-bigeasy@linutronix.de> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1764328444237906371?= X-GMAIL-MSGID: =?utf-8?q?1764328444237906371?= |
Series |
locking/rtmutex: Avoid overwriting pi_blocked_on while invoking blk_flush_plug().
|
|
Commit Message
Sebastian Andrzej Siewior
April 27, 2023, 11:19 a.m. UTC
From: Thomas Gleixner <tglx@linutronix.de> schedule() invokes sched_submit_work() before scheduling and sched_update_worker() afterwards to ensure that queued block requests are flushed and the (IO)worker machineries can instantiate new workers if required. This avoids deadlocks and starvation. With rt_mutexes this can lead to subtle problem: When rtmutex blocks current::pi_blocked_on points to the rtmutex it blocks on. When one of the functions in sched_submit/resume_work() contends on a rtmutex based lock then that would corrupt current::pi_blocked_on. Make it possible to let rtmutex issue the calls outside of the slowpath, i.e. when it is guaranteed that current::pi_blocked_on is NULL, by: - Exposing sched_submit_work() and moving the task_running() condition into schedule() - Renamimg sched_update_worker() to sched_resume_work() and exposing it too. - Providing sched_rtmutex() which just does the inner loop of scheduling until need_resched() is not longer set. Split out the loop so this does not create yet another copy. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> --- include/linux/sched.h | 5 +++++ kernel/sched/core.c | 40 ++++++++++++++++++++++------------------ 2 files changed, 27 insertions(+), 18 deletions(-)
Comments
On Thu, Apr 27, 2023 at 01:19:34PM +0200, Sebastian Andrzej Siewior wrote: > From: Thomas Gleixner <tglx@linutronix.de> > > schedule() invokes sched_submit_work() before scheduling and > sched_update_worker() afterwards to ensure that queued block requests are > flushed and the (IO)worker machineries can instantiate new workers if > required. This avoids deadlocks and starvation. > > With rt_mutexes this can lead to subtle problem: > > When rtmutex blocks current::pi_blocked_on points to the rtmutex it > blocks on. When one of the functions in sched_submit/resume_work() > contends on a rtmutex based lock then that would corrupt > current::pi_blocked_on. > > Make it possible to let rtmutex issue the calls outside of the slowpath, > i.e. when it is guaranteed that current::pi_blocked_on is NULL, by: > > - Exposing sched_submit_work() and moving the task_running() condition > into schedule() > > - Renamimg sched_update_worker() to sched_resume_work() and exposing it > too. > > - Providing sched_rtmutex() which just does the inner loop of scheduling > until need_resched() is not longer set. Split out the loop so this does > not create yet another copy. > > Signed-off-by: Thomas Gleixner <tglx@linutronix.de> > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Urgh, so I really don't like this. The end result is something like: rt_mutex_lock() sched_submit_work(); // a nested rt_mutex_lock() here will not clobber // ->pi_blocked_on because it's not set yet. task_blocks_on_rt_mutex(); tsk->pi_blocked_on = waiter; rt_mutex_enqueue(lock, waiter); <-- the real problem rt_mutex_slowlock_block(); schedule_rtmutex(); sched_resume_work(); And all of this it not just because tsk->pi_blocked_on, but mostly because of task_blocks_on_rt_mutex() enqueueing the waiter. The whole enqueue thing is what makes the 'simple' solution of saving/restoring tsk->pi_blocked_on not work. Basically the pi_blocked_on curruption is a side effect, not the fundamental issue. One task having two waiters registered is the bigger issue. Now, sched_submit_work() could also use (regular) mutex -- after all it's a fully preemptible context. And then we're subject to the 'same' problem but with tsk->blocked_on (DEBUG_MUTEXES=y). This means that strictly speaking we should litter mutex with the same thing :/ This all feels fragile to me. Too many things spread out in too many places. An alternative is something like: void __sched schedule_pi(void) { struct task_struct *tsk = current; void *waiter = tsk->pi_blocked_on; sched_submit_work(tsk); do { preempt_disable(); if (rt_mutex_blocks(tsk, waiter)) schedule(); sched_preempt_enable_no_resched(); } while (need_resched()); sched_update_worker(tsk); } And then rt_mutex_blocks() will do the enqueue/boost/optimistic_spin thing. However, this is going to be a massive reorg of the rt_mutex code and I'm not entirely sure the end result will actually be better... it might just make a mess elsewhere :/ > +void sched_submit_work(void) > { > - unsigned int task_flags; > + struct task_struct *tsk = current; > + unsigned int task_flags = tsk->flags; > > - if (task_is_running(tsk)) > - return; > - > - task_flags = tsk->flags; > /* > * If a worker goes to sleep, notify and ask workqueue whether it > * wants to wake up a task to maintain concurrency. > @@ -6723,8 +6720,10 @@ static inline void sched_submit_work(struct task_struct *tsk) > blk_flush_plug(tsk->plug, true); > } > +asmlinkage __visible void __sched schedule(void) > +{ > + if (!task_is_running(current)) > + sched_submit_work(); > + schedule_loop(SM_NONE); > + sched_resume_work(); > } > EXPORT_SYMBOL(schedule); pulling out task_is_running() like this is going to get into conflict with TJs patches here: https://lkml.kernel.org/r/20230418205159.724789-1-tj@kernel.org That makes sched_submit_work() do things even if task_is_running().
On Wed, 2023-05-03 at 15:20 +0200, Peter Zijlstra wrote: > On Thu, Apr 27, 2023 at 01:19:34PM +0200, Sebastian Andrzej Siewior wrote: > > From: Thomas Gleixner <tglx@linutronix.de> > > > > schedule() invokes sched_submit_work() before scheduling and > > sched_update_worker() afterwards to ensure that queued block requests > > are > > flushed and the (IO)worker machineries can instantiate new workers if > > required. This avoids deadlocks and starvation. > > > > With rt_mutexes this can lead to subtle problem: > > > > When rtmutex blocks current::pi_blocked_on points to the rtmutex it > > blocks on. When one of the functions in sched_submit/resume_work() > > contends on a rtmutex based lock then that would corrupt > > current::pi_blocked_on. > > > > Make it possible to let rtmutex issue the calls outside of the slowpath, > > i.e. when it is guaranteed that current::pi_blocked_on is NULL, by: > > > > - Exposing sched_submit_work() and moving the task_running() condition > > into schedule() > > > > - Renamimg sched_update_worker() to sched_resume_work() and exposing > > it > > too. > > > > - Providing sched_rtmutex() which just does the inner loop of > > scheduling > > until need_resched() is not longer set. Split out the loop so this > > does > > not create yet another copy. > > > > Signed-off-by: Thomas Gleixner <tglx@linutronix.de> > > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > > Urgh, so I really don't like this. > > The end result is something like: > > rt_mutex_lock() > sched_submit_work(); > // a nested rt_mutex_lock() here will not clobber > // ->pi_blocked_on because it's not set yet. > > task_blocks_on_rt_mutex(); > tsk->pi_blocked_on = waiter; > rt_mutex_enqueue(lock, waiter); <-- the real problem > > rt_mutex_slowlock_block(); > schedule_rtmutex(); > > sched_resume_work(); > > And all of this it not just because tsk->pi_blocked_on, but mostly > because of task_blocks_on_rt_mutex() enqueueing the waiter. The whole > enqueue thing is what makes the 'simple' solution of saving/restoring > tsk->pi_blocked_on not work. > > Basically the pi_blocked_on curruption is a side effect, not the > fundamental issue. One task having two waiters registered is the bigger > issue. Where do you see pi_blocked_on being saved/restored? The whole point of this patchset is to deal with sched_submit_work() before anything has been done on the "outer" lock acquisition (not just pi_blocked_on, but also enqueuing) other than failing the fast path. > Now, sched_submit_work() could also use (regular) mutex -- after all > it's a fully preemptible context. And then we're subject to the 'same' > problem but with tsk->blocked_on (DEBUG_MUTEXES=y). It's fully preemptible but it still shouldn't be doing things that would block on non-RT. That'd already be broken for a number of reasons (task state corruption, infinite recursion if current->plug isn't cleared before doing whatever causes another standard schedule(), etc). -Crystal
On 2023-05-03 15:20:51 [+0200], Peter Zijlstra wrote: > Urgh, so I really don't like this. > > The end result is something like: > > rt_mutex_lock() > sched_submit_work(); > // a nested rt_mutex_lock() here will not clobber > // ->pi_blocked_on because it's not set yet. > > task_blocks_on_rt_mutex(); > tsk->pi_blocked_on = waiter; > rt_mutex_enqueue(lock, waiter); <-- the real problem > > rt_mutex_slowlock_block(); > schedule_rtmutex(); > > sched_resume_work(); > > And all of this it not just because tsk->pi_blocked_on, but mostly > because of task_blocks_on_rt_mutex() enqueueing the waiter. The whole > enqueue thing is what makes the 'simple' solution of saving/restoring > tsk->pi_blocked_on not work. > > Basically the pi_blocked_on curruption is a side effect, not the > fundamental issue. One task having two waiters registered is the bigger > issue. Yes, one task blocks on two locks but the lock in sched_submit_work() needs to be solved first. > Now, sched_submit_work() could also use (regular) mutex -- after all > it's a fully preemptible context. And then we're subject to the 'same' > problem but with tsk->blocked_on (DEBUG_MUTEXES=y). Not sure I follow. We only invoke sched_submit_work() if we block on a lock which is sleeping on !RT (mutex_t, not spinlock_t). I browsed of few of the sched_submit_work() callbacks and they all use non-sleeping locks (on !RT). If a sched_submit_work() would use a mutex_t lock then we would recursively call blk_flush_plug() before setting tsk->blocked_on and perform the same callback and block on the very same lock (again). This isn't different compared to !RT therefore you must not use a sleeping lock (mutex_t) in the callback. > This means that strictly speaking we should litter mutex with the same > thing :/ No need, see above logic. > This all feels fragile to me. Too many things spread out in too many > places. An alternative is something like: > > void __sched schedule_pi(void) > { > struct task_struct *tsk = current; > void *waiter = tsk->pi_blocked_on; > > sched_submit_work(tsk); > do { > preempt_disable(); > if (rt_mutex_blocks(tsk, waiter)) > schedule(); > sched_preempt_enable_no_resched(); > } while (need_resched()); > sched_update_worker(tsk); > } > > And then rt_mutex_blocks() will do the enqueue/boost/optimistic_spin > thing. However, this is going to be a massive reorg of the rt_mutex code > and I'm not entirely sure the end result will actually be better... it > might just make a mess elsewhere :/ It might be not needed… > > @@ -6723,8 +6720,10 @@ static inline void sched_submit_work(struct task_struct *tsk) > > blk_flush_plug(tsk->plug, true); > > } > > > +asmlinkage __visible void __sched schedule(void) > > +{ > > + if (!task_is_running(current)) > > + sched_submit_work(); > > + schedule_loop(SM_NONE); > > + sched_resume_work(); > > } > > EXPORT_SYMBOL(schedule); > > pulling out task_is_running() like this is going to get into conflict > with TJs patches here: > > https://lkml.kernel.org/r/20230418205159.724789-1-tj@kernel.org > > That makes sched_submit_work() do things even if task_is_running(). Do I rebase my stuff on top of his then and we good? Sebastian
On Wed, May 10, 2023 at 05:04:15PM +0200, Sebastian Andrzej Siewior wrote: > On 2023-05-03 15:20:51 [+0200], Peter Zijlstra wrote: > > Urgh, so I really don't like this. > > > > The end result is something like: > > > > rt_mutex_lock() > > sched_submit_work(); > > // a nested rt_mutex_lock() here will not clobber > > // ->pi_blocked_on because it's not set yet. > > > > task_blocks_on_rt_mutex(); > > tsk->pi_blocked_on = waiter; > > rt_mutex_enqueue(lock, waiter); <-- the real problem > > > > rt_mutex_slowlock_block(); > > schedule_rtmutex(); > > > > sched_resume_work(); > > > > And all of this it not just because tsk->pi_blocked_on, but mostly > > because of task_blocks_on_rt_mutex() enqueueing the waiter. The whole > > enqueue thing is what makes the 'simple' solution of saving/restoring > > tsk->pi_blocked_on not work. > > > > Basically the pi_blocked_on curruption is a side effect, not the > > fundamental issue. One task having two waiters registered is the bigger > > issue. > > Yes, one task blocks on two locks but the lock in sched_submit_work() > needs to be solved first. Sure.. just saying that the focus on ->pi_blocked_on is a wee bit under representing the issue, because there's a 'simple' solution for ->pi_blocked_on. There is not for the whole waiter situation. > > Now, sched_submit_work() could also use (regular) mutex -- after all > > it's a fully preemptible context. And then we're subject to the 'same' > > problem but with tsk->blocked_on (DEBUG_MUTEXES=y). > > Not sure I follow. We only invoke sched_submit_work() if we block on a > lock which is sleeping on !RT (mutex_t, not spinlock_t). I browsed of > few of the sched_submit_work() callbacks and they all use non-sleeping > locks (on !RT). Right, but this is not enforced, they could change this and we'd never know. It used to be ran with IRQs disabled so anybody trying to do 'funny' things would get yelled at, but commit 9c40cef2b799 ("sched: Move blk_schedule_flush_plug() out of __schedule()") broke that. > If a sched_submit_work() would use a mutex_t lock then we would > recursively call blk_flush_plug() before setting tsk->blocked_on and I'm not following, mutex code sets tsk->blocked_on before it calls schedule(), getting into the very same problem you have with rt_mutex. > perform the same callback and block on the very same lock (again). > This isn't different compared to !RT therefore you must not use a > sleeping lock (mutex_t) in the callback. See the enforcement thing; today nothing stops the code from using a mutex or other blocking primitives here. > > pulling out task_is_running() like this is going to get into conflict > > with TJs patches here: > > > > https://lkml.kernel.org/r/20230418205159.724789-1-tj@kernel.org > > > > That makes sched_submit_work() do things even if task_is_running(). > > Do I rebase my stuff on top of his then and we good? I just suggested he try something else: https://lkml.kernel.org/r/20230510150946.GO4253@hirez.programming.kicks-ass.net if that works out this worry goes away. If we get PROVE_RAW_LOCK_NESTING usable, something like the below might help out with the validation part... --- diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 944c3ae39861..8df1aa333dee 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6694,11 +6694,18 @@ void __noreturn do_task_dead(void) static inline void sched_submit_work(struct task_struct *tsk) { + static DEFINE_WAIT_OVERRIDE_MAP(sched_map, LD_WAIT_CONFIG); unsigned int task_flags; if (task_is_running(tsk)) return; + /* + * Establish LD_WAIT_CONFIG context to ensure none of the code called + * will use a blocking primitive. + */ + lock_map_acquire_try(&sched_map); + task_flags = tsk->flags; /* * If a worker goes to sleep, notify and ask workqueue whether it @@ -6723,6 +6730,8 @@ static inline void sched_submit_work(struct task_struct *tsk) * make sure to submit it to avoid deadlocks. */ blk_flush_plug(tsk->plug, true); + + lock_map_release(&sched_map); } static void sched_update_worker(struct task_struct *tsk)
On Tue, May 09, 2023 at 05:14:38PM -0500, Crystal Wood wrote: > On Wed, 2023-05-03 at 15:20 +0200, Peter Zijlstra wrote: > > Urgh, so I really don't like this. > > > > The end result is something like: > > > > rt_mutex_lock() > > sched_submit_work(); > > // a nested rt_mutex_lock() here will not clobber > > // ->pi_blocked_on because it's not set yet. > > > > task_blocks_on_rt_mutex(); > > tsk->pi_blocked_on = waiter; > > rt_mutex_enqueue(lock, waiter); <-- the real problem > > > > rt_mutex_slowlock_block(); > > schedule_rtmutex(); > > > > sched_resume_work(); > > > > And all of this it not just because tsk->pi_blocked_on, but mostly > > because of task_blocks_on_rt_mutex() enqueueing the waiter. The whole > > enqueue thing is what makes the 'simple' solution of saving/restoring > > tsk->pi_blocked_on not work. > > > > Basically the pi_blocked_on curruption is a side effect, not the > > fundamental issue. One task having two waiters registered is the bigger > > issue. > > Where do you see pi_blocked_on being saved/restored? I'm not, I'm saying that *IF* ->pi_blocked_on corruption were the real problem that would be a sufficient solution. But ->pi_blocked_on corruption is just a wee side effect of the real problem, which is that the waiter is already enqueued and we've already done PI and can't very well back out of all that in a hurry. > The whole point of > this patchset is to deal with sched_submit_work() before anything has > been done on the "outer" lock acquisition (not just pi_blocked_on, but > also enqueuing) other than failing the fast path. It's also terribly fragile, sprinkling stuff all over that shouldn't be sprinkled. And it's sprinkled far wider than it needs to be -- per the below argument it really only should be applied to rtlock, not to rt_mutex or ww_rt_mutex or any of the others that normally block and shouldn't be used anyway. > > Now, sched_submit_work() could also use (regular) mutex -- after all > > it's a fully preemptible context. And then we're subject to the 'same' > > problem but with tsk->blocked_on (DEBUG_MUTEXES=y). > > It's fully preemptible but it still shouldn't be doing things that would > block on non-RT. That'd already be broken for a number of reasons (task > state corruption, infinite recursion if current->plug isn't cleared > before doing whatever causes another standard schedule(), etc). task->state is fairly immune to corruption normally -- the typical case is that the nested block resolves and resets it to RUNNING, at which point the outer loop 'fails' to schedule() and re-tries. All that is mostly harmless. But yes, all that code *SHOULD* not block, but nothing is really enforcing that.
On 2023-05-11 15:43:08 [+0200], Peter Zijlstra wrote: > > If a sched_submit_work() would use a mutex_t lock then we would > > recursively call blk_flush_plug() before setting tsk->blocked_on and > > I'm not following, mutex code sets tsk->blocked_on before it calls > schedule(), getting into the very same problem you have with rt_mutex. > > > perform the same callback and block on the very same lock (again). > > This isn't different compared to !RT therefore you must not use a > > sleeping lock (mutex_t) in the callback. > > See the enforcement thing; today nothing stops the code from using a > mutex or other blocking primitives here. I tried to explain that if blk_flush_plug() blocks on a mutex_t then it will invoke schedule() -> blk_flush_plug() -> schedule() -> blk_flush_plug() -> … until it runs out of stack. So it is broken regardless of RT but yes we don't enforce it and yes people might use it and it would work as long as the lock is not contended. > > Do I rebase my stuff on top of his then and we good? > > I just suggested he try something else: > > https://lkml.kernel.org/r/20230510150946.GO4253@hirez.programming.kicks-ass.net > > if that works out this worry goes away. > > If we get PROVE_RAW_LOCK_NESTING usable, something like the below might > help out with the validation part... Okay. So if I don't collide with workqueue do you buy this or do you ask for something else. I'm not sure… Regarding PROVE_RAW_LOCK_NESTING: If I boot -rc3 with `quiet' then I don't see any complains. Otherwise it is printk during boot (caller is holding raw_spinlock_t and then printk() calls to serial driver with spinlock_t). From time to time ppl send "fixes" for PROVE_RAW_LOCK_NESTING splats so I would guess they boot with `quiet' and there isn't much else. So we are getting close here I guess. Do you want me to test the suggested validation map somewhere? Because if it works, it could be queued. Sebastian
On 2023-05-25 17:25:07 [+0200], To Peter Zijlstra wrote: > > if that works out this worry goes away. > > > > If we get PROVE_RAW_LOCK_NESTING usable, something like the below might > > help out with the validation part... > > Okay. So if I don't collide with workqueue do you buy this or do you > ask for something else. I'm not sure… > > Regarding PROVE_RAW_LOCK_NESTING: If I boot -rc3 with `quiet' then I > don't see any complains. > Otherwise it is printk during boot (caller is holding raw_spinlock_t and > then printk() calls to serial driver with spinlock_t). > From time to time ppl send "fixes" for PROVE_RAW_LOCK_NESTING splats so > I would guess they boot with `quiet' and there isn't much else. So we > are getting close here I guess. > > Do you want me to test the suggested validation map somewhere? Because > if it works, it could be queued. Do you want just the validation map or is there something I'm missing? Sebastian
diff --git a/include/linux/sched.h b/include/linux/sched.h index 675298d6eb362..ff1ce66d8b6e3 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -304,6 +304,11 @@ extern long schedule_timeout_idle(long timeout); asmlinkage void schedule(void); extern void schedule_preempt_disabled(void); asmlinkage void preempt_schedule_irq(void); + +extern void sched_submit_work(void); +extern void sched_resume_work(void); +extern void schedule_rtmutex(void); + #ifdef CONFIG_PREEMPT_RT extern void schedule_rtlock(void); #endif diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c415418b0b847..7c5cfae086c78 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6690,14 +6690,11 @@ void __noreturn do_task_dead(void) cpu_relax(); } -static inline void sched_submit_work(struct task_struct *tsk) +void sched_submit_work(void) { - unsigned int task_flags; + struct task_struct *tsk = current; + unsigned int task_flags = tsk->flags; - if (task_is_running(tsk)) - return; - - task_flags = tsk->flags; /* * If a worker goes to sleep, notify and ask workqueue whether it * wants to wake up a task to maintain concurrency. @@ -6723,8 +6720,10 @@ static inline void sched_submit_work(struct task_struct *tsk) blk_flush_plug(tsk->plug, true); } -static void sched_update_worker(struct task_struct *tsk) +void sched_resume_work(void) { + struct task_struct *tsk = current; + if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) { if (tsk->flags & PF_WQ_WORKER) wq_worker_running(tsk); @@ -6733,20 +6732,29 @@ static void sched_update_worker(struct task_struct *tsk) } } -asmlinkage __visible void __sched schedule(void) +static void schedule_loop(unsigned int sched_mode) { - struct task_struct *tsk = current; - - sched_submit_work(tsk); do { preempt_disable(); - __schedule(SM_NONE); + __schedule(sched_mode); sched_preempt_enable_no_resched(); } while (need_resched()); - sched_update_worker(tsk); +} + +asmlinkage __visible void __sched schedule(void) +{ + if (!task_is_running(current)) + sched_submit_work(); + schedule_loop(SM_NONE); + sched_resume_work(); } EXPORT_SYMBOL(schedule); +void schedule_rtmutex(void) +{ + schedule_loop(SM_NONE); +} + /* * synchronize_rcu_tasks() makes sure that no task is stuck in preempted * state (have scheduled out non-voluntarily) by making sure that all @@ -6806,11 +6814,7 @@ void __sched schedule_preempt_disabled(void) #ifdef CONFIG_PREEMPT_RT void __sched notrace schedule_rtlock(void) { - do { - preempt_disable(); - __schedule(SM_RTLOCK_WAIT); - sched_preempt_enable_no_resched(); - } while (need_resched()); + schedule_loop(SM_RTLOCK_WAIT); } NOKPROBE_SYMBOL(schedule_rtlock); #endif