[v2,1/4] sched/core: Provide sched_rtmutex() and expose sched work helpers

Message ID 20230427111937.2745231-2-bigeasy@linutronix.de
State New
Headers
Series locking/rtmutex: Avoid overwriting pi_blocked_on while invoking blk_flush_plug(). |

Commit Message

Sebastian Andrzej Siewior April 27, 2023, 11:19 a.m. UTC
  From: Thomas Gleixner <tglx@linutronix.de>

schedule() invokes sched_submit_work() before scheduling and
sched_update_worker() afterwards to ensure that queued block requests are
flushed and the (IO)worker machineries can instantiate new workers if
required. This avoids deadlocks and starvation.

With rt_mutexes this can lead to subtle problem:

  When rtmutex blocks current::pi_blocked_on points to the rtmutex it
  blocks on. When one of the functions in sched_submit/resume_work()
  contends on a rtmutex based lock then that would corrupt
  current::pi_blocked_on.

Make it possible to let rtmutex issue the calls outside of the slowpath,
i.e. when it is guaranteed that current::pi_blocked_on is NULL, by:

  - Exposing sched_submit_work() and moving the task_running() condition
    into schedule()

  - Renamimg sched_update_worker() to sched_resume_work() and exposing it
    too.

  - Providing sched_rtmutex() which just does the inner loop of scheduling
    until need_resched() is not longer set. Split out the loop so this does
    not create yet another copy.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/sched.h |  5 +++++
 kernel/sched/core.c   | 40 ++++++++++++++++++++++------------------
 2 files changed, 27 insertions(+), 18 deletions(-)
  

Comments

Peter Zijlstra May 3, 2023, 1:20 p.m. UTC | #1
On Thu, Apr 27, 2023 at 01:19:34PM +0200, Sebastian Andrzej Siewior wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> schedule() invokes sched_submit_work() before scheduling and
> sched_update_worker() afterwards to ensure that queued block requests are
> flushed and the (IO)worker machineries can instantiate new workers if
> required. This avoids deadlocks and starvation.
> 
> With rt_mutexes this can lead to subtle problem:
> 
>   When rtmutex blocks current::pi_blocked_on points to the rtmutex it
>   blocks on. When one of the functions in sched_submit/resume_work()
>   contends on a rtmutex based lock then that would corrupt
>   current::pi_blocked_on.
> 
> Make it possible to let rtmutex issue the calls outside of the slowpath,
> i.e. when it is guaranteed that current::pi_blocked_on is NULL, by:
> 
>   - Exposing sched_submit_work() and moving the task_running() condition
>     into schedule()
> 
>   - Renamimg sched_update_worker() to sched_resume_work() and exposing it
>     too.
> 
>   - Providing sched_rtmutex() which just does the inner loop of scheduling
>     until need_resched() is not longer set. Split out the loop so this does
>     not create yet another copy.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Urgh, so I really don't like this.

The end result is something like:

	rt_mutex_lock()
	  sched_submit_work();
	    // a nested rt_mutex_lock() here will not clobber
	    // ->pi_blocked_on because it's not set yet.

	  task_blocks_on_rt_mutex();
	    tsk->pi_blocked_on = waiter;
	    rt_mutex_enqueue(lock, waiter); <-- the real problem

	  rt_mutex_slowlock_block();
	    schedule_rtmutex();

	  sched_resume_work();

And all of this it not just because tsk->pi_blocked_on, but mostly
because of task_blocks_on_rt_mutex() enqueueing the waiter. The whole
enqueue thing is what makes the 'simple' solution of saving/restoring
tsk->pi_blocked_on not work.

Basically the pi_blocked_on curruption is a side effect, not the
fundamental issue. One task having two waiters registered is the bigger
issue.

Now, sched_submit_work() could also use (regular) mutex -- after all
it's a fully preemptible context. And then we're subject to the 'same'
problem but with tsk->blocked_on (DEBUG_MUTEXES=y).

This means that strictly speaking we should litter mutex with the same
thing :/

This all feels fragile to me. Too many things spread out in too many
places. An alternative is something like:

void __sched schedule_pi(void)
{
	struct task_struct *tsk = current;
	void *waiter = tsk->pi_blocked_on;

	sched_submit_work(tsk);
	do {
		preempt_disable();
		if (rt_mutex_blocks(tsk, waiter))
			schedule();
		sched_preempt_enable_no_resched();
	} while (need_resched());
	sched_update_worker(tsk);
}

And then rt_mutex_blocks() will do the enqueue/boost/optimistic_spin
thing. However, this is going to be a massive reorg of the rt_mutex code
and I'm not entirely sure the end result will actually be better... it
might just make a mess elsewhere :/

> +void sched_submit_work(void)
>  {
> -	unsigned int task_flags;
> +	struct task_struct *tsk = current;
> +	unsigned int task_flags = tsk->flags;
>  
> -	if (task_is_running(tsk))
> -		return;
> -
> -	task_flags = tsk->flags;
>  	/*
>  	 * If a worker goes to sleep, notify and ask workqueue whether it
>  	 * wants to wake up a task to maintain concurrency.
> @@ -6723,8 +6720,10 @@ static inline void sched_submit_work(struct task_struct *tsk)
>  	blk_flush_plug(tsk->plug, true);
>  }

> +asmlinkage __visible void __sched schedule(void)
> +{
> +	if (!task_is_running(current))
> +		sched_submit_work();
> +	schedule_loop(SM_NONE);
> +	sched_resume_work();
>  }
>  EXPORT_SYMBOL(schedule);

pulling out task_is_running() like this is going to get into conflict
with TJs patches here:

  https://lkml.kernel.org/r/20230418205159.724789-1-tj@kernel.org

That makes sched_submit_work() do things even if task_is_running().
  
Crystal Wood May 9, 2023, 10:14 p.m. UTC | #2
On Wed, 2023-05-03 at 15:20 +0200, Peter Zijlstra wrote:
> On Thu, Apr 27, 2023 at 01:19:34PM +0200, Sebastian Andrzej Siewior wrote:
> > From: Thomas Gleixner <tglx@linutronix.de>
> > 
> > schedule() invokes sched_submit_work() before scheduling and
> > sched_update_worker() afterwards to ensure that queued block requests
> > are
> > flushed and the (IO)worker machineries can instantiate new workers if
> > required. This avoids deadlocks and starvation.
> > 
> > With rt_mutexes this can lead to subtle problem:
> > 
> >   When rtmutex blocks current::pi_blocked_on points to the rtmutex it
> >   blocks on. When one of the functions in sched_submit/resume_work()
> >   contends on a rtmutex based lock then that would corrupt
> >   current::pi_blocked_on.
> > 
> > Make it possible to let rtmutex issue the calls outside of the slowpath,
> > i.e. when it is guaranteed that current::pi_blocked_on is NULL, by:
> > 
> >   - Exposing sched_submit_work() and moving the task_running() condition
> >     into schedule()
> > 
> >   - Renamimg sched_update_worker() to sched_resume_work() and exposing
> > it
> >     too.
> > 
> >   - Providing sched_rtmutex() which just does the inner loop of
> > scheduling
> >     until need_resched() is not longer set. Split out the loop so this
> > does
> >     not create yet another copy.
> > 
> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> 
> Urgh, so I really don't like this.
> 
> The end result is something like:
> 
>         rt_mutex_lock()
>           sched_submit_work();
>             // a nested rt_mutex_lock() here will not clobber
>             // ->pi_blocked_on because it's not set yet.
> 
>           task_blocks_on_rt_mutex();
>             tsk->pi_blocked_on = waiter;
>             rt_mutex_enqueue(lock, waiter); <-- the real problem
> 
>           rt_mutex_slowlock_block();
>             schedule_rtmutex();
> 
>           sched_resume_work();
> 
> And all of this it not just because tsk->pi_blocked_on, but mostly
> because of task_blocks_on_rt_mutex() enqueueing the waiter. The whole
> enqueue thing is what makes the 'simple' solution of saving/restoring
> tsk->pi_blocked_on not work.
> 
> Basically the pi_blocked_on curruption is a side effect, not the
> fundamental issue. One task having two waiters registered is the bigger
> issue.

Where do you see pi_blocked_on being saved/restored?  The whole point of
this patchset is to deal with sched_submit_work() before anything has
been done on the "outer" lock acquisition (not just pi_blocked_on, but
also enqueuing) other than failing the fast path.

> Now, sched_submit_work() could also use (regular) mutex -- after all
> it's a fully preemptible context. And then we're subject to the 'same'
> problem but with tsk->blocked_on (DEBUG_MUTEXES=y).

It's fully preemptible but it still shouldn't be doing things that would
block on non-RT.  That'd already be broken for a number of reasons (task
state corruption, infinite recursion if current->plug isn't cleared
before doing whatever causes another standard schedule(), etc).

-Crystal
  
Sebastian Andrzej Siewior May 10, 2023, 3:04 p.m. UTC | #3
On 2023-05-03 15:20:51 [+0200], Peter Zijlstra wrote:
> Urgh, so I really don't like this.
> 
> The end result is something like:
> 
> 	rt_mutex_lock()
> 	  sched_submit_work();
> 	    // a nested rt_mutex_lock() here will not clobber
> 	    // ->pi_blocked_on because it's not set yet.
> 
> 	  task_blocks_on_rt_mutex();
> 	    tsk->pi_blocked_on = waiter;
> 	    rt_mutex_enqueue(lock, waiter); <-- the real problem
> 
> 	  rt_mutex_slowlock_block();
> 	    schedule_rtmutex();
> 
> 	  sched_resume_work();
> 
> And all of this it not just because tsk->pi_blocked_on, but mostly
> because of task_blocks_on_rt_mutex() enqueueing the waiter. The whole
> enqueue thing is what makes the 'simple' solution of saving/restoring
> tsk->pi_blocked_on not work.
> 
> Basically the pi_blocked_on curruption is a side effect, not the
> fundamental issue. One task having two waiters registered is the bigger
> issue.

Yes, one task blocks on two locks but the lock in sched_submit_work()
needs to be solved first.

> Now, sched_submit_work() could also use (regular) mutex -- after all
> it's a fully preemptible context. And then we're subject to the 'same'
> problem but with tsk->blocked_on (DEBUG_MUTEXES=y).

Not sure I follow. We only invoke sched_submit_work() if we block on a
lock which is sleeping on !RT (mutex_t, not spinlock_t). I browsed of
few of the sched_submit_work() callbacks and they all use non-sleeping
locks (on !RT).
If a sched_submit_work() would use a mutex_t lock then we would
recursively call blk_flush_plug() before setting tsk->blocked_on and
perform the same callback and block on the very same lock (again).
This isn't different compared to !RT therefore you must not use a
sleeping lock (mutex_t) in the callback.

> This means that strictly speaking we should litter mutex with the same
> thing :/

No need, see above logic.

> This all feels fragile to me. Too many things spread out in too many
> places. An alternative is something like:
> 
> void __sched schedule_pi(void)
> {
> 	struct task_struct *tsk = current;
> 	void *waiter = tsk->pi_blocked_on;
> 
> 	sched_submit_work(tsk);
> 	do {
> 		preempt_disable();
> 		if (rt_mutex_blocks(tsk, waiter))
> 			schedule();
> 		sched_preempt_enable_no_resched();
> 	} while (need_resched());
> 	sched_update_worker(tsk);
> }
> 
> And then rt_mutex_blocks() will do the enqueue/boost/optimistic_spin
> thing. However, this is going to be a massive reorg of the rt_mutex code
> and I'm not entirely sure the end result will actually be better... it
> might just make a mess elsewhere :/

It might be not needed…

> > @@ -6723,8 +6720,10 @@ static inline void sched_submit_work(struct task_struct *tsk)
> >  	blk_flush_plug(tsk->plug, true);
> >  }
> 
> > +asmlinkage __visible void __sched schedule(void)
> > +{
> > +	if (!task_is_running(current))
> > +		sched_submit_work();
> > +	schedule_loop(SM_NONE);
> > +	sched_resume_work();
> >  }
> >  EXPORT_SYMBOL(schedule);
> 
> pulling out task_is_running() like this is going to get into conflict
> with TJs patches here:
> 
>   https://lkml.kernel.org/r/20230418205159.724789-1-tj@kernel.org
> 
> That makes sched_submit_work() do things even if task_is_running().

Do I rebase my stuff on top of his then and we good?

Sebastian
  
Peter Zijlstra May 11, 2023, 1:43 p.m. UTC | #4
On Wed, May 10, 2023 at 05:04:15PM +0200, Sebastian Andrzej Siewior wrote:
> On 2023-05-03 15:20:51 [+0200], Peter Zijlstra wrote:
> > Urgh, so I really don't like this.
> > 
> > The end result is something like:
> > 
> > 	rt_mutex_lock()
> > 	  sched_submit_work();
> > 	    // a nested rt_mutex_lock() here will not clobber
> > 	    // ->pi_blocked_on because it's not set yet.
> > 
> > 	  task_blocks_on_rt_mutex();
> > 	    tsk->pi_blocked_on = waiter;
> > 	    rt_mutex_enqueue(lock, waiter); <-- the real problem
> > 
> > 	  rt_mutex_slowlock_block();
> > 	    schedule_rtmutex();
> > 
> > 	  sched_resume_work();
> > 
> > And all of this it not just because tsk->pi_blocked_on, but mostly
> > because of task_blocks_on_rt_mutex() enqueueing the waiter. The whole
> > enqueue thing is what makes the 'simple' solution of saving/restoring
> > tsk->pi_blocked_on not work.
> > 
> > Basically the pi_blocked_on curruption is a side effect, not the
> > fundamental issue. One task having two waiters registered is the bigger
> > issue.
> 
> Yes, one task blocks on two locks but the lock in sched_submit_work()
> needs to be solved first.

Sure.. just saying that the focus on ->pi_blocked_on is a wee bit under
representing the issue, because there's a 'simple' solution for
->pi_blocked_on. There is not for the whole waiter situation.

> > Now, sched_submit_work() could also use (regular) mutex -- after all
> > it's a fully preemptible context. And then we're subject to the 'same'
> > problem but with tsk->blocked_on (DEBUG_MUTEXES=y).
> 
> Not sure I follow. We only invoke sched_submit_work() if we block on a
> lock which is sleeping on !RT (mutex_t, not spinlock_t). I browsed of
> few of the sched_submit_work() callbacks and they all use non-sleeping
> locks (on !RT).

Right, but this is not enforced, they could change this and we'd never
know.

It used to be ran with IRQs disabled so anybody trying to do 'funny'
things would get yelled at, but commit 9c40cef2b799 ("sched: Move
blk_schedule_flush_plug() out of __schedule()") broke that.

> If a sched_submit_work() would use a mutex_t lock then we would
> recursively call blk_flush_plug() before setting tsk->blocked_on and

I'm not following, mutex code sets tsk->blocked_on before it calls
schedule(), getting into the very same problem you have with rt_mutex.

> perform the same callback and block on the very same lock (again).
> This isn't different compared to !RT therefore you must not use a
> sleeping lock (mutex_t) in the callback.

See the enforcement thing; today nothing stops the code from using a
mutex or other blocking primitives here.


> > pulling out task_is_running() like this is going to get into conflict
> > with TJs patches here:
> > 
> >   https://lkml.kernel.org/r/20230418205159.724789-1-tj@kernel.org
> > 
> > That makes sched_submit_work() do things even if task_is_running().
> 
> Do I rebase my stuff on top of his then and we good?

I just suggested he try something else:

  https://lkml.kernel.org/r/20230510150946.GO4253@hirez.programming.kicks-ass.net

if that works out this worry goes away.

If we get PROVE_RAW_LOCK_NESTING usable, something like the below might
help out with the validation part...

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 944c3ae39861..8df1aa333dee 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6694,11 +6694,18 @@ void __noreturn do_task_dead(void)
 
 static inline void sched_submit_work(struct task_struct *tsk)
 {
+	static DEFINE_WAIT_OVERRIDE_MAP(sched_map, LD_WAIT_CONFIG);
 	unsigned int task_flags;
 
 	if (task_is_running(tsk))
 		return;
 
+	/*
+	 * Establish LD_WAIT_CONFIG context to ensure none of the code called
+	 * will use a blocking primitive.
+	 */
+	lock_map_acquire_try(&sched_map);
+
 	task_flags = tsk->flags;
 	/*
 	 * If a worker goes to sleep, notify and ask workqueue whether it
@@ -6723,6 +6730,8 @@ static inline void sched_submit_work(struct task_struct *tsk)
 	 * make sure to submit it to avoid deadlocks.
 	 */
 	blk_flush_plug(tsk->plug, true);
+
+	lock_map_release(&sched_map);
 }
 
 static void sched_update_worker(struct task_struct *tsk)
  
Peter Zijlstra May 11, 2023, 1:53 p.m. UTC | #5
On Tue, May 09, 2023 at 05:14:38PM -0500, Crystal Wood wrote:
> On Wed, 2023-05-03 at 15:20 +0200, Peter Zijlstra wrote:

> > Urgh, so I really don't like this.
> > 
> > The end result is something like:
> > 
> >         rt_mutex_lock()
> >           sched_submit_work();
> >             // a nested rt_mutex_lock() here will not clobber
> >             // ->pi_blocked_on because it's not set yet.
> > 
> >           task_blocks_on_rt_mutex();
> >             tsk->pi_blocked_on = waiter;
> >             rt_mutex_enqueue(lock, waiter); <-- the real problem
> > 
> >           rt_mutex_slowlock_block();
> >             schedule_rtmutex();
> > 
> >           sched_resume_work();
> > 
> > And all of this it not just because tsk->pi_blocked_on, but mostly
> > because of task_blocks_on_rt_mutex() enqueueing the waiter. The whole
> > enqueue thing is what makes the 'simple' solution of saving/restoring
> > tsk->pi_blocked_on not work.
> > 
> > Basically the pi_blocked_on curruption is a side effect, not the
> > fundamental issue. One task having two waiters registered is the bigger
> > issue.
> 
> Where do you see pi_blocked_on being saved/restored? 

I'm not, I'm saying that *IF* ->pi_blocked_on corruption were the real
problem that would be a sufficient solution.

But ->pi_blocked_on corruption is just a wee side effect of the real
problem, which is that the waiter is already enqueued and we've already
done PI and can't very well back out of all that in a hurry.

> The whole point of
> this patchset is to deal with sched_submit_work() before anything has
> been done on the "outer" lock acquisition (not just pi_blocked_on, but
> also enqueuing) other than failing the fast path.

It's also terribly fragile, sprinkling stuff all over that shouldn't be
sprinkled.

And it's sprinkled far wider than it needs to be -- per the below
argument it really only should be applied to rtlock, not to rt_mutex or
ww_rt_mutex or any of the others that normally block and shouldn't be
used anyway.

> > Now, sched_submit_work() could also use (regular) mutex -- after all
> > it's a fully preemptible context. And then we're subject to the 'same'
> > problem but with tsk->blocked_on (DEBUG_MUTEXES=y).
> 
> It's fully preemptible but it still shouldn't be doing things that would
> block on non-RT.  That'd already be broken for a number of reasons (task
> state corruption, infinite recursion if current->plug isn't cleared
> before doing whatever causes another standard schedule(), etc).

task->state is fairly immune to corruption normally -- the typical
case is that the nested block resolves and resets it to RUNNING, at
which point the outer loop 'fails' to schedule() and re-tries. All that
is mostly harmless.

But yes, all that code *SHOULD* not block, but nothing is really
enforcing that.
  
Sebastian Andrzej Siewior May 25, 2023, 3:25 p.m. UTC | #6
On 2023-05-11 15:43:08 [+0200], Peter Zijlstra wrote:
> > If a sched_submit_work() would use a mutex_t lock then we would
> > recursively call blk_flush_plug() before setting tsk->blocked_on and
> 
> I'm not following, mutex code sets tsk->blocked_on before it calls
> schedule(), getting into the very same problem you have with rt_mutex.
> 
> > perform the same callback and block on the very same lock (again).
> > This isn't different compared to !RT therefore you must not use a
> > sleeping lock (mutex_t) in the callback.
> 
> See the enforcement thing; today nothing stops the code from using a
> mutex or other blocking primitives here.

I tried to explain that if blk_flush_plug() blocks on a mutex_t then it
will invoke schedule() -> blk_flush_plug() -> schedule() ->
blk_flush_plug() -> … until it runs out of stack.

So it is broken regardless of RT but yes we don't enforce it and yes
people might use it and it would work as long as the lock is not
contended.

> > Do I rebase my stuff on top of his then and we good?
> 
> I just suggested he try something else:
> 
>   https://lkml.kernel.org/r/20230510150946.GO4253@hirez.programming.kicks-ass.net
> 
> if that works out this worry goes away.
> 
> If we get PROVE_RAW_LOCK_NESTING usable, something like the below might
> help out with the validation part...

Okay. So if I don't collide with workqueue do you buy this or do you
ask for something else. I'm not sure…

Regarding PROVE_RAW_LOCK_NESTING: If I boot -rc3 with `quiet' then I
don't see any complains.
Otherwise it is printk during boot (caller is holding raw_spinlock_t and
then printk() calls to serial driver with spinlock_t).
From time to time ppl send "fixes" for PROVE_RAW_LOCK_NESTING splats so
I would guess they boot with `quiet' and there isn't much else. So we
are getting close here I guess.

Do you want me to test the suggested validation map somewhere? Because
if it works, it could be queued.

Sebastian
  
Sebastian Andrzej Siewior June 16, 2023, 12:37 p.m. UTC | #7
On 2023-05-25 17:25:07 [+0200], To Peter Zijlstra wrote:
> > if that works out this worry goes away.
> > 
> > If we get PROVE_RAW_LOCK_NESTING usable, something like the below might
> > help out with the validation part...
> 
> Okay. So if I don't collide with workqueue do you buy this or do you
> ask for something else. I'm not sure…
> 
> Regarding PROVE_RAW_LOCK_NESTING: If I boot -rc3 with `quiet' then I
> don't see any complains.
> Otherwise it is printk during boot (caller is holding raw_spinlock_t and
> then printk() calls to serial driver with spinlock_t).
> From time to time ppl send "fixes" for PROVE_RAW_LOCK_NESTING splats so
> I would guess they boot with `quiet' and there isn't much else. So we
> are getting close here I guess.
> 
> Do you want me to test the suggested validation map somewhere? Because
> if it works, it could be queued.

Do you want just the validation map or is there something I'm missing?

Sebastian
  

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 675298d6eb362..ff1ce66d8b6e3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -304,6 +304,11 @@  extern long schedule_timeout_idle(long timeout);
 asmlinkage void schedule(void);
 extern void schedule_preempt_disabled(void);
 asmlinkage void preempt_schedule_irq(void);
+
+extern void sched_submit_work(void);
+extern void sched_resume_work(void);
+extern void schedule_rtmutex(void);
+
 #ifdef CONFIG_PREEMPT_RT
  extern void schedule_rtlock(void);
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c415418b0b847..7c5cfae086c78 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6690,14 +6690,11 @@  void __noreturn do_task_dead(void)
 		cpu_relax();
 }
 
-static inline void sched_submit_work(struct task_struct *tsk)
+void sched_submit_work(void)
 {
-	unsigned int task_flags;
+	struct task_struct *tsk = current;
+	unsigned int task_flags = tsk->flags;
 
-	if (task_is_running(tsk))
-		return;
-
-	task_flags = tsk->flags;
 	/*
 	 * If a worker goes to sleep, notify and ask workqueue whether it
 	 * wants to wake up a task to maintain concurrency.
@@ -6723,8 +6720,10 @@  static inline void sched_submit_work(struct task_struct *tsk)
 	blk_flush_plug(tsk->plug, true);
 }
 
-static void sched_update_worker(struct task_struct *tsk)
+void sched_resume_work(void)
 {
+	struct task_struct *tsk = current;
+
 	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
 		if (tsk->flags & PF_WQ_WORKER)
 			wq_worker_running(tsk);
@@ -6733,20 +6732,29 @@  static void sched_update_worker(struct task_struct *tsk)
 	}
 }
 
-asmlinkage __visible void __sched schedule(void)
+static void schedule_loop(unsigned int sched_mode)
 {
-	struct task_struct *tsk = current;
-
-	sched_submit_work(tsk);
 	do {
 		preempt_disable();
-		__schedule(SM_NONE);
+		__schedule(sched_mode);
 		sched_preempt_enable_no_resched();
 	} while (need_resched());
-	sched_update_worker(tsk);
+}
+
+asmlinkage __visible void __sched schedule(void)
+{
+	if (!task_is_running(current))
+		sched_submit_work();
+	schedule_loop(SM_NONE);
+	sched_resume_work();
 }
 EXPORT_SYMBOL(schedule);
 
+void schedule_rtmutex(void)
+{
+	schedule_loop(SM_NONE);
+}
+
 /*
  * synchronize_rcu_tasks() makes sure that no task is stuck in preempted
  * state (have scheduled out non-voluntarily) by making sure that all
@@ -6806,11 +6814,7 @@  void __sched schedule_preempt_disabled(void)
 #ifdef CONFIG_PREEMPT_RT
 void __sched notrace schedule_rtlock(void)
 {
-	do {
-		preempt_disable();
-		__schedule(SM_RTLOCK_WAIT);
-		sched_preempt_enable_no_resched();
-	} while (need_resched());
+	schedule_loop(SM_RTLOCK_WAIT);
 }
 NOKPROBE_SYMBOL(schedule_rtlock);
 #endif