[RFC,0/6] softirq: Start pushing down the big softirq lock

Message ID 20230801132441.559222-1-frederic@kernel.org
Headers
Series softirq: Start pushing down the big softirq lock |

Message

Frederic Weisbecker Aug. 1, 2023, 1:24 p.m. UTC
  Networking softirqs can take time, holding the execution of block,
tasklets, timers and RCU callbacks.

People fight hard through this big softirq lock, proposing more and
more hacks over the years to deal with the resulting fundamental
unfairness that is not only a problem for RT users.

Here is a proposal for an entrypoint to dealing with that issue in the
long term. The purpose is to adopt a similar journey to the one we took
with the BKL push-down but with timers. Most timers are unrelated to
other softirq vectors, those can simply be tagged with the new
TIMER_SOFTINTERRUPTIBLE flag that makes a callback soft-interruptible.
The others can carry the TIMER_SOFTINTERRUPTIBLE after they get converted
to use appropriate synchronization against other vectors callbacks
(using spin_lock_bh() for example).

Once all timers are dealt with after a few years (famous last words),
they can be handled separately from the softirq infrastructure.

RCU could follow a similar treatment, if we manage to find room for a
flag somewhere...

(Only -RT and x86 are supported for now)

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
	timers/softirq-preemptible

HEAD: c233aee141ddb78d07b2f7311be38cfc286de654

Thanks,
	Frederic
---

Frederic Weisbecker (6):
      softirq: Turn set_softirq_pending() to reset_softirq_pending()
      softirq: Make softirq handling entry/exit generally available
      softirq: Introduce softirq disabled mask
      x86/softirq: Support softirq disabled mask
      timers: Introduce soft-interruptible timers
      timers: Make process_timeout() soft-interruptible


 arch/Kconfig                   |  3 +++
 arch/x86/Kconfig               |  1 +
 arch/x86/include/asm/current.h |  1 +
 arch/x86/include/asm/hardirq.h |  1 +
 include/linux/bottom_half.h    |  9 +++++++++
 include/linux/interrupt.h      | 15 ++++++++++++---
 include/linux/timer.h          |  5 +++--
 kernel/softirq.c               | 40 ++++++++++++++++++++++++++++++++--------
 kernel/time/timer.c            | 20 +++++++++++++++++++-
 9 files changed, 81 insertions(+), 14 deletions(-)
  

Comments

Sebastian Andrzej Siewior Aug. 7, 2023, 12:50 p.m. UTC | #1
On 2023-08-01 15:24:35 [+0200], Frederic Weisbecker wrote:
> Networking softirqs can take time, holding the execution of block,
> tasklets, timers and RCU callbacks.
> 
> People fight hard through this big softirq lock, proposing more and
> more hacks over the years to deal with the resulting fundamental
> unfairness that is not only a problem for RT users.
> 
> Here is a proposal for an entrypoint to dealing with that issue in the
> long term. The purpose is to adopt a similar journey to the one we took
> with the BKL push-down but with timers. Most timers are unrelated to
> other softirq vectors, those can simply be tagged with the new
> TIMER_SOFTINTERRUPTIBLE flag that makes a callback soft-interruptible.
> The others can carry the TIMER_SOFTINTERRUPTIBLE after they get converted
> to use appropriate synchronization against other vectors callbacks
> (using spin_lock_bh() for example).

This doesn't work as proposed because of lock ordering:
|======================================================
|WARNING: possible circular locking dependency detected
|6.5.0-rc4-rt1+ #220 Not tainted
|------------------------------------------------------
|ktimers/0/15 is trying to acquire lock:
|ffff88817b41b6d8 ((softirq_ctrl.lock)){+.+.}-{2:2}, at: __local_bh_disable_ip+0xb7/0x1a0
|
|but task is already holding lock:
|ffff88817b41c820 (&base->expiry_lock){+.+.}-{2:2}, at: run_timer_softirq+0x61/0x3f0
|
|which lock already depends on the new lock.
|
|
|the existing dependency chain (in reverse order) is:
|
|-> #1 (&base->expiry_lock){+.+.}-{2:2}:
|       lock_acquire+0xd4/0x2f0
|       rt_spin_lock+0x21/0xf0
|       run_timer_softirq+0x61/0x3f0
|       __do_softirq+0x19b/0x4cb
|       run_timersd+0x92/0xf0
|       smpboot_thread_fn+0x211/0x330
|       kthread+0x110/0x130
|       ret_from_fork+0x2b/0x40
|       ret_from_fork_asm+0x1b/0x30
|
|-> #0 ((softirq_ctrl.lock)){+.+.}-{2:2}:
|       check_prev_add+0xe2/0xd60
|       __lock_acquire+0x132d/0x1700
|       lock_acquire+0xd4/0x2f0
|       rt_spin_lock+0x21/0xf0
|       __local_bh_disable_ip+0xb7/0x1a0
|       call_timer_fn+0x172/0x310
|       run_timer_softirq+0x331/0x3f0
|       __do_softirq+0x19b/0x4cb
|       run_timersd+0x92/0xf0
|       smpboot_thread_fn+0x211/0x330
|       kthread+0x110/0x130
|       ret_from_fork+0x2b/0x40
|       ret_from_fork_asm+0x1b/0x30
|
|other info that might help us debug this:
|
| Possible unsafe locking scenario:
|
|       CPU0                    CPU1
|       ----                    ----
|  lock(&base->expiry_lock);
|                               lock((softirq_ctrl.lock));
|                               lock(&base->expiry_lock);
|  lock((softirq_ctrl.lock));
|
| *** DEADLOCK ***
|
|2 locks held by ktimers/0/15:
| #0: ffffffff826e9ce0 (rcu_read_lock){....}-{1:2}, at: rt_spin_lock+0x5d/0xf0
| #1: ffff88817b41c820 (&base->expiry_lock){+.+.}-{2:2}, at: run_timer_softirq+0x61/0x3f0

I posted a different series last week where I drop the lock for other
reasons at a different spot where it is safe to do so. It needs adopting
other level softirq handler (besides TIMER_SOFTIRQ) but there is no
need to deal with the individual callback.

However, how do you continue here? Assuming all timers are marked
TIMER_SOFTINTERRUPTIBLE then you could avoid the BH-lock at the
timer-softirq.
But when is a timer considered safe? Would the lack of the _bh suffix be
that and you would simply add it during your push down? 
Then you continue the same thing for the remaining softirqs. And once
you are done you would remove that RT lock within local_bh_disable()?
This isn't something a !RT user would benefit, right?

The other idea I have (besides the preemption point in each softirq
handler (mentioned earlier)) is to simple drop the BH-lock on RT. Unlike
mainline, RT wouldn't deadlock then. The only that would be missing is
synchronisation against local_bh_disable() only locking for variables.
From what I remember from the various BH-models we have in RT in the
past, that was the only thing that exploded.

Sebastian
  
Frederic Weisbecker Aug. 7, 2023, 3:22 p.m. UTC | #2
On Mon, Aug 07, 2023 at 02:50:20PM +0200, Sebastian Andrzej Siewior wrote:
> On 2023-08-01 15:24:35 [+0200], Frederic Weisbecker wrote:
> > Networking softirqs can take time, holding the execution of block,
> > tasklets, timers and RCU callbacks.
> > 
> > People fight hard through this big softirq lock, proposing more and
> > more hacks over the years to deal with the resulting fundamental
> > unfairness that is not only a problem for RT users.
> > 
> > Here is a proposal for an entrypoint to dealing with that issue in the
> > long term. The purpose is to adopt a similar journey to the one we took
> > with the BKL push-down but with timers. Most timers are unrelated to
> > other softirq vectors, those can simply be tagged with the new
> > TIMER_SOFTINTERRUPTIBLE flag that makes a callback soft-interruptible.
> > The others can carry the TIMER_SOFTINTERRUPTIBLE after they get converted
> > to use appropriate synchronization against other vectors callbacks
> > (using spin_lock_bh() for example).
> 
> This doesn't work as proposed because of lock ordering:
> |======================================================
> |WARNING: possible circular locking dependency detected
> |6.5.0-rc4-rt1+ #220 Not tainted
> |------------------------------------------------------
> |ktimers/0/15 is trying to acquire lock:
> |ffff88817b41b6d8 ((softirq_ctrl.lock)){+.+.}-{2:2}, at: __local_bh_disable_ip+0xb7/0x1a0
> |
> |but task is already holding lock:
> |ffff88817b41c820 (&base->expiry_lock){+.+.}-{2:2}, at: run_timer_softirq+0x61/0x3f0
> |
> |which lock already depends on the new lock.
> |
> |
> |the existing dependency chain (in reverse order) is:
> |
> |-> #1 (&base->expiry_lock){+.+.}-{2:2}:
> |       lock_acquire+0xd4/0x2f0
> |       rt_spin_lock+0x21/0xf0
> |       run_timer_softirq+0x61/0x3f0
> |       __do_softirq+0x19b/0x4cb
> |       run_timersd+0x92/0xf0
> |       smpboot_thread_fn+0x211/0x330
> |       kthread+0x110/0x130
> |       ret_from_fork+0x2b/0x40
> |       ret_from_fork_asm+0x1b/0x30
> |
> |-> #0 ((softirq_ctrl.lock)){+.+.}-{2:2}:
> |       check_prev_add+0xe2/0xd60
> |       __lock_acquire+0x132d/0x1700
> |       lock_acquire+0xd4/0x2f0
> |       rt_spin_lock+0x21/0xf0
> |       __local_bh_disable_ip+0xb7/0x1a0
> |       call_timer_fn+0x172/0x310
> |       run_timer_softirq+0x331/0x3f0
> |       __do_softirq+0x19b/0x4cb
> |       run_timersd+0x92/0xf0
> |       smpboot_thread_fn+0x211/0x330
> |       kthread+0x110/0x130
> |       ret_from_fork+0x2b/0x40
> |       ret_from_fork_asm+0x1b/0x30
> |
> |other info that might help us debug this:
> |
> | Possible unsafe locking scenario:
> |
> |       CPU0                    CPU1
> |       ----                    ----
> |  lock(&base->expiry_lock);
> |                               lock((softirq_ctrl.lock));
> |                               lock(&base->expiry_lock);
> |  lock((softirq_ctrl.lock));
> |
> | *** DEADLOCK ***
> |
> |2 locks held by ktimers/0/15:
> | #0: ffffffff826e9ce0 (rcu_read_lock){....}-{1:2}, at: rt_spin_lock+0x5d/0xf0
> | #1: ffff88817b41c820 (&base->expiry_lock){+.+.}-{2:2}, at:
> |run_timer_softirq+0x61/0x3f0

Right, need to pull that to the caller before it releases the lock.

> 
> I posted a different series last week where I drop the lock for other
> reasons at a different spot where it is safe to do so. It needs adopting
> other level softirq handler (besides TIMER_SOFTIRQ) but there is no
> need to deal with the individual callback.

I've seen that yes, I have yet to review deeper but it looks like a good idea
in any case to have a preemption point between timer callbacks.

RCU already does a similar thing from its rcu boost processing with
cond_resched_tasks_rcu_qs().

> 
> However, how do you continue here? Assuming all timers are marked
> TIMER_SOFTINTERRUPTIBLE then you could avoid the BH-lock at the
> timer-softirq.
> But when is a timer considered safe? Would the lack of the _bh suffix be
> that and you would simply add it during your push down?

Yeah that requires manual inspection. A timer that obviously doesn't mess
up with other softirqs, as is the case most of the time, can simply get the flag.

Other timers can be dealt with individually with local_bh_disable() or
spin_lock_bh() or critical section.

> Then you continue the same thing for the remaining softirqs. And once
> you are done you would remove that RT lock within local_bh_disable()?
> This isn't something a !RT user would benefit, right?

Why not? A long lasting ksoftirqd handling lots of NET_RX could be
interrupted by a timer/rcu softirq in !RT for example. Further, there
could even be more than one ksoftirqd if necessary, though I doubt it.

> The other idea I have (besides the preemption point in each softirq
> handler (mentioned earlier)) is to simple drop the BH-lock on RT. Unlike
> mainline, RT wouldn't deadlock then. The only that would be missing is
> synchronisation against local_bh_disable() only locking for variables.
> From what I remember from the various BH-models we have in RT in the
> past, that was the only thing that exploded.

I thought the issue was about timers fiddling with per-cpu state assuming
they wouldn't be disturbed by other vectors and thus they lack
local_bh_disable() on appropriate places. Or perhaps I misunderstood?

Otherwise all timers can carry TIMER_SOFTINTERRUPTIBLE right away, right?

Thanks.
  
Sebastian Andrzej Siewior Aug. 8, 2023, 7:15 a.m. UTC | #3
On 2023-08-07 17:22:10 [+0200], Frederic Weisbecker wrote:
> > However, how do you continue here? Assuming all timers are marked
> > TIMER_SOFTINTERRUPTIBLE then you could avoid the BH-lock at the
> > timer-softirq.
> > But when is a timer considered safe? Would the lack of the _bh suffix be
> > that and you would simply add it during your push down?
> 
> Yeah that requires manual inspection. A timer that obviously doesn't mess
> up with other softirqs, as is the case most of the time, can simply get the flag.
> 
> Other timers can be dealt with individually with local_bh_disable() or
> spin_lock_bh() or critical section.

The question is what keeps _bh() doing if you intend to get rid of the
lock.

> > Then you continue the same thing for the remaining softirqs. And once
> > you are done you would remove that RT lock within local_bh_disable()?
> > This isn't something a !RT user would benefit, right?
> 
> Why not? A long lasting ksoftirqd handling lots of NET_RX could be
> interrupted by a timer/rcu softirq in !RT for example. Further, there
> could even be more than one ksoftirqd if necessary, though I doubt it.

That would require more thinking on how to accomplish this. It is
generally assumed that a softirq callback makes always progress if you
look at it from a different CPU. Therefore a loop like in
__timer_delete_sync() is "okay" because the callback will continue and
finish. If you drop the BH bits from the preemption pointer and allow
preemption then the other CPU could spin until that timer gets back and
continues to work. There is more of this kind of logic…

But I don't think that you want ksoftirqd handling NET_RX in the first
place. However having a long running NET_RX is a problem in RT because
it blocks all other force-threaded interrupts. Mainline doesn't have
this problem unless it uses threaded interrupts.
The other problematic thing is that everything is mixed up. So you can't
really distinguish between TASKLET as in USB vs SCSI. Not to mention
NET_RX vs TASKLET. If you don't have luxury to move them to different
CPU I *think* assigning them a priority would be a good thing.

> > The other idea I have (besides the preemption point in each softirq
> > handler (mentioned earlier)) is to simple drop the BH-lock on RT. Unlike
> > mainline, RT wouldn't deadlock then. The only that would be missing is
> > synchronisation against local_bh_disable() only locking for variables.
> > From what I remember from the various BH-models we have in RT in the
> > past, that was the only thing that exploded.
> 
> I thought the issue was about timers fiddling with per-cpu state assuming
> they wouldn't be disturbed by other vectors and thus they lack
> local_bh_disable() on appropriate places. Or perhaps I misunderstood?
> 
> Otherwise all timers can carry TIMER_SOFTINTERRUPTIBLE right away, right?

The last time I have been looking at it is just the usage of per-CPU
variables. For instance assume a timer invokes napi_consume_skb().

> Thanks.

Sebastian