[RFC,2/2] softirq: Drop the warning from do_softirq_post_smp_call_flush().

Message ID 20230814093528.117342-3-bigeasy@linutronix.de
State New
Headers
Series net: Use SMP threads for backlog NAPI. |

Commit Message

Sebastian Andrzej Siewior Aug. 14, 2023, 9:35 a.m. UTC
  Once ksoftirqd become active, all softirqs which were raised, would not
be processed immediately but delayed to ksoftirqd. On PREEMPT_RT this
means softirqs, which were raised in a threaded interrupt (at elevated
process priority), would not be served after the interrupt handler
completed its work but will wait until ksoftirqd (normal priority)
becomes running on the CPU. On a busy system with plenty of RT tasks
this could be delayed for quite some time and leads to problems in
general.

This is an undesired situation and it has been attempted to avoid the
situation in which ksoftirqd becomes scheduled. This changed since
commit d15121be74856 ("Revert "softirq: Let ksoftirqd do its job"")
and now a threaded interrupt handler will handle soft interrupts at its
end even if ksoftirqd is pending. That means that they will be processed
in the context in which they were raised.

Unfortunately also all other soft interrupts which were raised (or
enqueued) earlier and are not yet handled. This happens if a thread with
higher priority is raised and has to catch up. This isn't a new problem
and the new high priority thread will PI-boost the current sofitrq owner
or start from scratch if ksoftirqd wasn't running yet.

Since pending ksoftirqd no longer blocks other interrupt threads from
handling soft interrupts I belive the warning can be disabled. The
pending softirq work has to be solved differently.

Remove the warning and update the comment.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/interrupt.h |  4 ++--
 kernel/smp.c              |  4 +---
 kernel/softirq.c          | 12 +++++-------
 3 files changed, 8 insertions(+), 12 deletions(-)
  

Comments

Jesper Dangaard Brouer Aug. 15, 2023, 12:08 p.m. UTC | #1
On 14/08/2023 11.35, Sebastian Andrzej Siewior wrote:
> This is an undesired situation and it has been attempted to avoid the
> situation in which ksoftirqd becomes scheduled. This changed since
> commit d15121be74856 ("Revert "softirq: Let ksoftirqd do its job"")
> and now a threaded interrupt handler will handle soft interrupts at its
> end even if ksoftirqd is pending. That means that they will be processed
> in the context in which they were raised.

$ git describe --contains d15121be74856
v6.5-rc1~232^2~4

That revert basically removes the "overload" protection that was added
to cope with DDoS situations in Aug 2016 (Cc. Cloudflare).  As described
in https://git.kernel.org/torvalds/c/4cd13c21b207 ("softirq: Let
ksoftirqd do its job") in UDP overload situations when UDP socket
receiver runs on same CPU as ksoftirqd it "falls-off-an-edge" and almost
doesn't process packets (because softirq steals CPU/sched time from UDP
pid).  Warning Cloudflare (Cc) as this might affect their production
use-cases, and I recommend getting involved to evaluate the effect of
these changes.

I do realize/acknowledge that the reverted patch caused other latency
issues, given it was a "big-hammer" approach affecting other softirq
processing (as can be seen by e.g. the watchdog fixes patches).
Thus, the revert makes sense, but how to regain the "overload"
protection such that RX networking cannot starve processes reading from
the socket? (is this what Sebastian's patchset does?)

--Jesper

Thread link for people Cc'ed: 
https://lore.kernel.org/all/20230814093528.117342-1-bigeasy@linutronix.de/#r
  
Jesper Dangaard Brouer Aug. 16, 2023, 2:48 p.m. UTC | #2
On 15/08/2023 14.08, Jesper Dangaard Brouer wrote:
> 
> 
> On 14/08/2023 11.35, Sebastian Andrzej Siewior wrote:
>> This is an undesired situation and it has been attempted to avoid the
>> situation in which ksoftirqd becomes scheduled. This changed since
>> commit d15121be74856 ("Revert "softirq: Let ksoftirqd do its job"")
>> and now a threaded interrupt handler will handle soft interrupts at its
>> end even if ksoftirqd is pending. That means that they will be processed
>> in the context in which they were raised.
> 
> $ git describe --contains d15121be74856
> v6.5-rc1~232^2~4
> 
> That revert basically removes the "overload" protection that was added
> to cope with DDoS situations in Aug 2016 (Cc. Cloudflare).  As described
> in https://git.kernel.org/torvalds/c/4cd13c21b207 ("softirq: Let
> ksoftirqd do its job") in UDP overload situations when UDP socket
> receiver runs on same CPU as ksoftirqd it "falls-off-an-edge" and almost
> doesn't process packets (because softirq steals CPU/sched time from UDP
> pid).  Warning Cloudflare (Cc) as this might affect their production
> use-cases, and I recommend getting involved to evaluate the effect of
> these changes.
> 

I did some testing on net-next (with commit d15121be74856 ("Revert 
"softirq: Let ksoftirqd do its job"") using UDP pktgen + udp_sink.

And I observe the old overload issue occur again, where userspace 
process (udp_sink) process very few packets when running on *same* CPU 
as the NAPI-RX/IRQ processing.  The perf report "comm" clearly shows 
that NAPI runs in the context of the "udp_sink" process, stealing its 
sched time. (Same CPU around 3Kpps and diff CPU 1722Kpps, see details 
below).
What happens are that NAPI takes 64 packets and queue them to the 
udp_sink process *socket*, the udp_sink process *wakeup* process 1 
packet from socket queue and on exit (__local_bh_enable_ip) runs softirq 
that starts NAPI (to again process 64 packets... repeat).


> I do realize/acknowledge that the reverted patch caused other latency
> issues, given it was a "big-hammer" approach affecting other softirq
> processing (as can be seen by e.g. the watchdog fixes patches).
> Thus, the revert makes sense, but how to regain the "overload"
> protection such that RX networking cannot starve processes reading from
> the socket? (is this what Sebastian's patchset does?)
> 

I'm no expert in sched / softirq area of the kernel, but I'm willing to 
help out testing different solution that can regain the "overload" 
protection e.g. avoid packet processing "falls-of-an-edge" (and thus 
opens the kernel to be DDoS'ed easily).
Is this what Sebastian's patchset does?


> 
> Thread link for people Cc'ed: 
> https://lore.kernel.org/all/20230814093528.117342-1-bigeasy@linutronix.de/#r

--Jesper
(some testlab results below)

[udp_sink] 
https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c


When udp_sink runs on same CPU and NAPI/softirq
  - UdpInDatagrams: 2,948 packets/sec

$ nstat -n && sleep 1 && nstat
#kernel
IpInReceives                    2831056            0.0
IpInDelivers                    2831053            0.0
UdpInDatagrams                  2948               0.0
UdpInErrors                     2828118            0.0
UdpRcvbufErrors                 2828118            0.0
IpExtInOctets                   130206496          0.0
IpExtInNoECTPkts                2830576            0.0

When udp_sink runs on another CPU than NAPI-RX.
  - UdpInDatagrams: 1,722,307 pps

$ nstat -n && sleep 1 && nstat
#kernel
IpInReceives                    2318560            0.0
IpInDelivers                    2318562            0.0
UdpInDatagrams                  1722307            0.0
UdpInErrors                     596280             0.0
UdpRcvbufErrors                 596280             0.0
IpExtInOctets                   106634256          0.0
IpExtInNoECTPkts                2318136            0.0
  

Patch

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index a92bce40b04b3..5143ae0ea9356 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -590,9 +590,9 @@  asmlinkage void do_softirq(void);
 asmlinkage void __do_softirq(void);
 
 #ifdef CONFIG_PREEMPT_RT
-extern void do_softirq_post_smp_call_flush(unsigned int was_pending);
+extern void do_softirq_post_smp_call_flush(void);
 #else
-static inline void do_softirq_post_smp_call_flush(unsigned int unused)
+static inline void do_softirq_post_smp_call_flush(void)
 {
 	do_softirq();
 }
diff --git a/kernel/smp.c b/kernel/smp.c
index 385179dae360e..cd7db5ffe95ab 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -554,7 +554,6 @@  static void __flush_smp_call_function_queue(bool warn_cpu_offline)
  */
 void flush_smp_call_function_queue(void)
 {
-	unsigned int was_pending;
 	unsigned long flags;
 
 	if (llist_empty(this_cpu_ptr(&call_single_queue)))
@@ -562,10 +561,9 @@  void flush_smp_call_function_queue(void)
 
 	local_irq_save(flags);
 	/* Get the already pending soft interrupts for RT enabled kernels */
-	was_pending = local_softirq_pending();
 	__flush_smp_call_function_queue(true);
 	if (local_softirq_pending())
-		do_softirq_post_smp_call_flush(was_pending);
+		do_softirq_post_smp_call_flush();
 
 	local_irq_restore(flags);
 }
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 807b34ccd7973..aa299cb3ff47b 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -281,15 +281,13 @@  static inline void invoke_softirq(void)
 
 /*
  * flush_smp_call_function_queue() can raise a soft interrupt in a function
- * call. On RT kernels this is undesired and the only known functionality
- * in the block layer which does this is disabled on RT. If soft interrupts
- * get raised which haven't been raised before the flush, warn so it can be
- * investigated.
+ * call. On RT kernels this is undesired because the work is no longer processed
+ * in the context where it originated. It is not especially harmfull but best to
+ * be avoided.
  */
-void do_softirq_post_smp_call_flush(unsigned int was_pending)
+void do_softirq_post_smp_call_flush(void)
 {
-	if (WARN_ON_ONCE(was_pending != local_softirq_pending()))
-		invoke_softirq();
+	invoke_softirq();
 }
 
 #else /* CONFIG_PREEMPT_RT */