[net-next] net/core: add optional threading for backlog processing

Message ID 20230324171314.73537-1-nbd@nbd.name
State New
Headers
Series [net-next] net/core: add optional threading for backlog processing |

Commit Message

Felix Fietkau March 24, 2023, 5:13 p.m. UTC
  When dealing with few flows or an imbalance on CPU utilization, static RPS
CPU assignment can be too inflexible. Add support for enabling threaded NAPI
for backlog processing in order to allow the scheduler to better balance
processing. This helps better spread the load across idle CPUs.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
---
PATCH:
 - add missing process_queue_empty initialization
 - fix kthread leak
 - add documentation
RFC v3:
 - make patch more generic, applies to backlog processing in general
 - fix process queue access on flush
RFC v2:
 - fix rebase error in rps locking
 Documentation/admin-guide/sysctl/net.rst |  9 +++
 Documentation/networking/scaling.rst     | 20 ++++++
 include/linux/netdevice.h                |  2 +
 net/core/dev.c                           | 82 ++++++++++++++++++++++--
 net/core/sysctl_net_core.c               | 27 ++++++++
 5 files changed, 135 insertions(+), 5 deletions(-)
  

Comments

Felix Fietkau March 24, 2023, 5:35 p.m. UTC | #1
On 24.03.23 18:20, Jakub Kicinski wrote:
> On Fri, 24 Mar 2023 18:13:14 +0100 Felix Fietkau wrote:
>> When dealing with few flows or an imbalance on CPU utilization, static RPS
>> CPU assignment can be too inflexible. Add support for enabling threaded NAPI
>> for backlog processing in order to allow the scheduler to better balance
>> processing. This helps better spread the load across idle CPUs.
> 
> Can you explain the use case a little bit more?

I'm primarily testing this on routers with 2 or 4 CPUs and limited 
processing power, handling routing/NAT. RPS is typically needed to 
properly distribute the load across all available CPUs. When there is 
only a small number of flows that are pushing a lot of traffic, a static 
RPS assignment often leaves some CPUs idle, whereas others become a 
bottleneck by being fully loaded. Threaded NAPI reduces this a bit, but 
CPUs can become bottlenecked and fully loaded by a NAPI thread alone.

Making backlog processing threaded helps split up the processing work 
even more and distribute it onto remaining idle CPUs.

It can basically be used to make RPS a bit more dynamic and 
configurable, because you can assign multiple backlog threads to a set 
of CPUs and selectively steer packets from specific devices / rx queues 
to them and allow the scheduler to take care of the rest.

- Felix
  
Jakub Kicinski March 24, 2023, 5:47 p.m. UTC | #2
On Fri, 24 Mar 2023 18:35:00 +0100 Felix Fietkau wrote:
> I'm primarily testing this on routers with 2 or 4 CPUs and limited 
> processing power, handling routing/NAT. RPS is typically needed to 
> properly distribute the load across all available CPUs. When there is 
> only a small number of flows that are pushing a lot of traffic, a static 
> RPS assignment often leaves some CPUs idle, whereas others become a 
> bottleneck by being fully loaded. Threaded NAPI reduces this a bit, but 
> CPUs can become bottlenecked and fully loaded by a NAPI thread alone.

The NAPI thread becomes a bottleneck with RPS enabled?

> Making backlog processing threaded helps split up the processing work 
> even more and distribute it onto remaining idle CPUs.

You'd want to have both threaded NAPI and threaded backlog enabled?

> It can basically be used to make RPS a bit more dynamic and 
> configurable, because you can assign multiple backlog threads to a set 
> of CPUs and selectively steer packets from specific devices / rx queues 

Can you give an example?

With the 4 CPU example, in case 2 queues are very busy - you're trying
to make sure that the RPS does not end up landing on the same CPU as
the other busy queue?

> to them and allow the scheduler to take care of the rest.

You trust the scheduler much more than I do, I think :)
  
Felix Fietkau March 24, 2023, 5:57 p.m. UTC | #3
On 24.03.23 18:47, Jakub Kicinski wrote:
> On Fri, 24 Mar 2023 18:35:00 +0100 Felix Fietkau wrote:
>> I'm primarily testing this on routers with 2 or 4 CPUs and limited 
>> processing power, handling routing/NAT. RPS is typically needed to 
>> properly distribute the load across all available CPUs. When there is 
>> only a small number of flows that are pushing a lot of traffic, a static 
>> RPS assignment often leaves some CPUs idle, whereas others become a 
>> bottleneck by being fully loaded. Threaded NAPI reduces this a bit, but 
>> CPUs can become bottlenecked and fully loaded by a NAPI thread alone.
> 
> The NAPI thread becomes a bottleneck with RPS enabled?

The devices that I work with often only have a single rx queue. That can
easily become a bottleneck.

>> Making backlog processing threaded helps split up the processing work 
>> even more and distribute it onto remaining idle CPUs.
> 
> You'd want to have both threaded NAPI and threaded backlog enabled?

Yes

>> It can basically be used to make RPS a bit more dynamic and 
>> configurable, because you can assign multiple backlog threads to a set 
>> of CPUs and selectively steer packets from specific devices / rx queues 
> 
> Can you give an example?
> 
> With the 4 CPU example, in case 2 queues are very busy - you're trying
> to make sure that the RPS does not end up landing on the same CPU as
> the other busy queue?

In this part I'm thinking about bigger systems where you want to have a
group of CPUs dedicated to dealing with network traffic without
assigning a fixed function (e.g. NAPI processing or RPS target) to each
one, allowing for more dynamic processing.

>> to them and allow the scheduler to take care of the rest.
> 
> You trust the scheduler much more than I do, I think :)

In my tests it brings down latency (both avg and p99) considerably in
some cases. I posted some numbers here:
https://lore.kernel.org/netdev/e317d5bc-cc26-8b1b-ca4b-66b5328683c4@nbd.name/

- Felix
  
Jakub Kicinski March 25, 2023, 3:19 a.m. UTC | #4
On Fri, 24 Mar 2023 18:57:03 +0100 Felix Fietkau wrote:
> >> It can basically be used to make RPS a bit more dynamic and 
> >> configurable, because you can assign multiple backlog threads to a set 
> >> of CPUs and selectively steer packets from specific devices / rx queues   
> > 
> > Can you give an example?
> > 
> > With the 4 CPU example, in case 2 queues are very busy - you're trying
> > to make sure that the RPS does not end up landing on the same CPU as
> > the other busy queue?  
> 
> In this part I'm thinking about bigger systems where you want to have a
> group of CPUs dedicated to dealing with network traffic without
> assigning a fixed function (e.g. NAPI processing or RPS target) to each
> one, allowing for more dynamic processing.

I tried the threaded NAPI on larger systems and helped others try,
and so far it's not been beneficial :( Even the load balancing
improvements are not significant enough to use it, and there 
is a large risk of scheduler making the wrong decision.

Hence my questioning - I'm trying to understand what you're doing
differently.

> >> to them and allow the scheduler to take care of the rest.  
> > 
> > You trust the scheduler much more than I do, I think :)  
> 
> In my tests it brings down latency (both avg and p99) considerably in
> some cases. I posted some numbers here:
> https://lore.kernel.org/netdev/e317d5bc-cc26-8b1b-ca4b-66b5328683c4@nbd.name/

Could you provide the full configuration for this test?
In non-threaded mode the RPS is enabled to spread over remaining 
3 cores?
  
Felix Fietkau March 25, 2023, 5:42 a.m. UTC | #5
On 25.03.23 04:19, Jakub Kicinski wrote:
> On Fri, 24 Mar 2023 18:57:03 +0100 Felix Fietkau wrote:
>> >> It can basically be used to make RPS a bit more dynamic and 
>> >> configurable, because you can assign multiple backlog threads to a set 
>> >> of CPUs and selectively steer packets from specific devices / rx queues   
>> > 
>> > Can you give an example?
>> > 
>> > With the 4 CPU example, in case 2 queues are very busy - you're trying
>> > to make sure that the RPS does not end up landing on the same CPU as
>> > the other busy queue?  
>> 
>> In this part I'm thinking about bigger systems where you want to have a
>> group of CPUs dedicated to dealing with network traffic without
>> assigning a fixed function (e.g. NAPI processing or RPS target) to each
>> one, allowing for more dynamic processing.
> 
> I tried the threaded NAPI on larger systems and helped others try,
> and so far it's not been beneficial :( Even the load balancing
> improvements are not significant enough to use it, and there
> is a large risk of scheduler making the wrong decision.
> 
> Hence my questioning - I'm trying to understand what you're doing
> differently.
I didn't actually run any tests on bigger systems myself, so I don't 
know how to tune it for those.

>> >> to them and allow the scheduler to take care of the rest.  
>> > 
>> > You trust the scheduler much more than I do, I think :)  
>> 
>> In my tests it brings down latency (both avg and p99) considerably in
>> some cases. I posted some numbers here:
>> https://lore.kernel.org/netdev/e317d5bc-cc26-8b1b-ca4b-66b5328683c4@nbd.name/
> 
> Could you provide the full configuration for this test?
> In non-threaded mode the RPS is enabled to spread over remaining
> 3 cores?
In this test I'm using threaded NAPI and backlog_threaded without any 
fixed core assignment.

- Felix
  
Jakub Kicinski March 28, 2023, 2:06 a.m. UTC | #6
On Sat, 25 Mar 2023 06:42:43 +0100 Felix Fietkau wrote:
> >> In my tests it brings down latency (both avg and p99) considerably in
> >> some cases. I posted some numbers here:
> >> https://lore.kernel.org/netdev/e317d5bc-cc26-8b1b-ca4b-66b5328683c4@nbd.name/  
> > 
> > Could you provide the full configuration for this test?
> > In non-threaded mode the RPS is enabled to spread over remaining
> > 3 cores? 
> 
> In this test I'm using threaded NAPI and backlog_threaded without any 
> fixed core assignment.

I was asking about the rps_threaded=0 side of the comparison.
So you're saying on that side you were using threaded NAPI with 
no pinning and RPS across all cores?
  
Paolo Abeni March 28, 2023, 9:29 a.m. UTC | #7
On Fri, 2023-03-24 at 18:57 +0100, Felix Fietkau wrote:
> On 24.03.23 18:47, Jakub Kicinski wrote:
> > On Fri, 24 Mar 2023 18:35:00 +0100 Felix Fietkau wrote:
> > > I'm primarily testing this on routers with 2 or 4 CPUs and limited 
> > > processing power, handling routing/NAT. RPS is typically needed to 
> > > properly distribute the load across all available CPUs. When there is 
> > > only a small number of flows that are pushing a lot of traffic, a static 
> > > RPS assignment often leaves some CPUs idle, whereas others become a 
> > > bottleneck by being fully loaded. Threaded NAPI reduces this a bit, but 
> > > CPUs can become bottlenecked and fully loaded by a NAPI thread alone.
> > 
> > The NAPI thread becomes a bottleneck with RPS enabled?
> 
> The devices that I work with often only have a single rx queue. That can
> easily become a bottleneck.
> 
> > > Making backlog processing threaded helps split up the processing work 
> > > even more and distribute it onto remaining idle CPUs.
> > 
> > You'd want to have both threaded NAPI and threaded backlog enabled?
> 
> Yes
> 
> > > It can basically be used to make RPS a bit more dynamic and 
> > > configurable, because you can assign multiple backlog threads to a set 
> > > of CPUs and selectively steer packets from specific devices / rx queues 
> > 
> > Can you give an example?
> > 
> > With the 4 CPU example, in case 2 queues are very busy - you're trying
> > to make sure that the RPS does not end up landing on the same CPU as
> > the other busy queue?
> 
> In this part I'm thinking about bigger systems where you want to have a
> group of CPUs dedicated to dealing with network traffic without
> assigning a fixed function (e.g. NAPI processing or RPS target) to each
> one, allowing for more dynamic processing.
> 
> > > to them and allow the scheduler to take care of the rest.
> > 
> > You trust the scheduler much more than I do, I think :)
> 
> In my tests it brings down latency (both avg and p99) considerably in
> some cases. I posted some numbers here:
> https://lore.kernel.org/netdev/e317d5bc-cc26-8b1b-ca4b-66b5328683c4@nbd.name/

It's still not 110% clear to me why/how this additional thread could
reduce latency. What/which threads are competing for the busy CPU[s]? I
suspect it could be easier/cleaner move away the others (non RPS)
threads.

Cheers,

Paolo
  
Felix Fietkau March 28, 2023, 9:45 a.m. UTC | #8
On 28.03.23 11:29, Paolo Abeni wrote:
> On Fri, 2023-03-24 at 18:57 +0100, Felix Fietkau wrote:
>> On 24.03.23 18:47, Jakub Kicinski wrote:
>> > On Fri, 24 Mar 2023 18:35:00 +0100 Felix Fietkau wrote:
>> > > I'm primarily testing this on routers with 2 or 4 CPUs and limited 
>> > > processing power, handling routing/NAT. RPS is typically needed to 
>> > > properly distribute the load across all available CPUs. When there is 
>> > > only a small number of flows that are pushing a lot of traffic, a static 
>> > > RPS assignment often leaves some CPUs idle, whereas others become a 
>> > > bottleneck by being fully loaded. Threaded NAPI reduces this a bit, but 
>> > > CPUs can become bottlenecked and fully loaded by a NAPI thread alone.
>> > 
>> > The NAPI thread becomes a bottleneck with RPS enabled?
>> 
>> The devices that I work with often only have a single rx queue. That can
>> easily become a bottleneck.
>> 
>> > > Making backlog processing threaded helps split up the processing work 
>> > > even more and distribute it onto remaining idle CPUs.
>> > 
>> > You'd want to have both threaded NAPI and threaded backlog enabled?
>> 
>> Yes
>> 
>> > > It can basically be used to make RPS a bit more dynamic and 
>> > > configurable, because you can assign multiple backlog threads to a set 
>> > > of CPUs and selectively steer packets from specific devices / rx queues 
>> > 
>> > Can you give an example?
>> > 
>> > With the 4 CPU example, in case 2 queues are very busy - you're trying
>> > to make sure that the RPS does not end up landing on the same CPU as
>> > the other busy queue?
>> 
>> In this part I'm thinking about bigger systems where you want to have a
>> group of CPUs dedicated to dealing with network traffic without
>> assigning a fixed function (e.g. NAPI processing or RPS target) to each
>> one, allowing for more dynamic processing.
>> 
>> > > to them and allow the scheduler to take care of the rest.
>> > 
>> > You trust the scheduler much more than I do, I think :)
>> 
>> In my tests it brings down latency (both avg and p99) considerably in
>> some cases. I posted some numbers here:
>> https://lore.kernel.org/netdev/e317d5bc-cc26-8b1b-ca4b-66b5328683c4@nbd.name/
> 
> It's still not 110% clear to me why/how this additional thread could
> reduce latency. What/which threads are competing for the busy CPU[s]? I
> suspect it could be easier/cleaner move away the others (non RPS)
> threads.
In the tests that I'm doing, network processing load from routing/NAT is 
enough to occupy all available CPUs.
If I dedicate the NAPI thread to one core and use RPS to steer packet 
processing to the other cores, the core taking care of NAPI has some 
idle cycles that go to waste, while the other cores are busy.
If I include the core in the RPS mask, it can take too much away from 
the NAPI thread.
Also, with a smaller number of flows and depending on the hash 
distribution of those flows, there can be an imbalance between how much 
processing each RPS instance gets, making this even more difficult to 
get right.
With the extra threading it works out better in my tests, because the 
scheduler can deal with these kinds of imbalances.

- Felix
  
Felix Fietkau March 28, 2023, 9:46 a.m. UTC | #9
On 28.03.23 04:06, Jakub Kicinski wrote:
> On Sat, 25 Mar 2023 06:42:43 +0100 Felix Fietkau wrote:
>> >> In my tests it brings down latency (both avg and p99) considerably in
>> >> some cases. I posted some numbers here:
>> >> https://lore.kernel.org/netdev/e317d5bc-cc26-8b1b-ca4b-66b5328683c4@nbd.name/  
>> > 
>> > Could you provide the full configuration for this test?
>> > In non-threaded mode the RPS is enabled to spread over remaining
>> > 3 cores? 
>> 
>> In this test I'm using threaded NAPI and backlog_threaded without any 
>> fixed core assignment.
> 
> I was asking about the rps_threaded=0 side of the comparison.
> So you're saying on that side you were using threaded NAPI with
> no pinning and RPS across all cores?
Yes.

- Felix
  
Paolo Abeni March 28, 2023, 3:13 p.m. UTC | #10
On Tue, 2023-03-28 at 11:45 +0200, Felix Fietkau wrote:
> On 28.03.23 11:29, Paolo Abeni wrote:
> > On Fri, 2023-03-24 at 18:57 +0100, Felix Fietkau wrote:
> > > On 24.03.23 18:47, Jakub Kicinski wrote:
> > > > On Fri, 24 Mar 2023 18:35:00 +0100 Felix Fietkau wrote:
> > > > > I'm primarily testing this on routers with 2 or 4 CPUs and limited 
> > > > > processing power, handling routing/NAT. RPS is typically needed to 
> > > > > properly distribute the load across all available CPUs. When there is 
> > > > > only a small number of flows that are pushing a lot of traffic, a static 
> > > > > RPS assignment often leaves some CPUs idle, whereas others become a 
> > > > > bottleneck by being fully loaded. Threaded NAPI reduces this a bit, but 
> > > > > CPUs can become bottlenecked and fully loaded by a NAPI thread alone.
> > > > 
> > > > The NAPI thread becomes a bottleneck with RPS enabled?
> > > 
> > > The devices that I work with often only have a single rx queue. That can
> > > easily become a bottleneck.
> > > 
> > > > > Making backlog processing threaded helps split up the processing work 
> > > > > even more and distribute it onto remaining idle CPUs.
> > > > 
> > > > You'd want to have both threaded NAPI and threaded backlog enabled?
> > > 
> > > Yes
> > > 
> > > > > It can basically be used to make RPS a bit more dynamic and 
> > > > > configurable, because you can assign multiple backlog threads to a set 
> > > > > of CPUs and selectively steer packets from specific devices / rx queues 
> > > > 
> > > > Can you give an example?
> > > > 
> > > > With the 4 CPU example, in case 2 queues are very busy - you're trying
> > > > to make sure that the RPS does not end up landing on the same CPU as
> > > > the other busy queue?
> > > 
> > > In this part I'm thinking about bigger systems where you want to have a
> > > group of CPUs dedicated to dealing with network traffic without
> > > assigning a fixed function (e.g. NAPI processing or RPS target) to each
> > > one, allowing for more dynamic processing.
> > > 
> > > > > to them and allow the scheduler to take care of the rest.
> > > > 
> > > > You trust the scheduler much more than I do, I think :)
> > > 
> > > In my tests it brings down latency (both avg and p99) considerably in
> > > some cases. I posted some numbers here:
> > > https://lore.kernel.org/netdev/e317d5bc-cc26-8b1b-ca4b-66b5328683c4@nbd.name/
> > 
> > It's still not 110% clear to me why/how this additional thread could
> > reduce latency. What/which threads are competing for the busy CPU[s]? I
> > suspect it could be easier/cleaner move away the others (non RPS)
> > threads.
> In the tests that I'm doing, network processing load from routing/NAT is 
> enough to occupy all available CPUs.
> If I dedicate the NAPI thread to one core and use RPS to steer packet 
> processing to the other cores, the core taking care of NAPI has some 
> idle cycles that go to waste, while the other cores are busy.
> If I include the core in the RPS mask, it can take too much away from 
> the NAPI thread.

I feel like I'm missing some relevant points.

If RPS keeps the target CPU fully busy, moving RPS processing in a
separate thread still will not allow using more CPU time.

Which NIC driver are you using?

thanks!

Paolo
  
Felix Fietkau March 28, 2023, 3:21 p.m. UTC | #11
On 28.03.23 17:13, Paolo Abeni wrote:
> On Tue, 2023-03-28 at 11:45 +0200, Felix Fietkau wrote:
>> On 28.03.23 11:29, Paolo Abeni wrote:
>> > On Fri, 2023-03-24 at 18:57 +0100, Felix Fietkau wrote:
>> > > On 24.03.23 18:47, Jakub Kicinski wrote:
>> > > > On Fri, 24 Mar 2023 18:35:00 +0100 Felix Fietkau wrote:
>> > > > > I'm primarily testing this on routers with 2 or 4 CPUs and limited 
>> > > > > processing power, handling routing/NAT. RPS is typically needed to 
>> > > > > properly distribute the load across all available CPUs. When there is 
>> > > > > only a small number of flows that are pushing a lot of traffic, a static 
>> > > > > RPS assignment often leaves some CPUs idle, whereas others become a 
>> > > > > bottleneck by being fully loaded. Threaded NAPI reduces this a bit, but 
>> > > > > CPUs can become bottlenecked and fully loaded by a NAPI thread alone.
>> > > > 
>> > > > The NAPI thread becomes a bottleneck with RPS enabled?
>> > > 
>> > > The devices that I work with often only have a single rx queue. That can
>> > > easily become a bottleneck.
>> > > 
>> > > > > Making backlog processing threaded helps split up the processing work 
>> > > > > even more and distribute it onto remaining idle CPUs.
>> > > > 
>> > > > You'd want to have both threaded NAPI and threaded backlog enabled?
>> > > 
>> > > Yes
>> > > 
>> > > > > It can basically be used to make RPS a bit more dynamic and 
>> > > > > configurable, because you can assign multiple backlog threads to a set 
>> > > > > of CPUs and selectively steer packets from specific devices / rx queues 
>> > > > 
>> > > > Can you give an example?
>> > > > 
>> > > > With the 4 CPU example, in case 2 queues are very busy - you're trying
>> > > > to make sure that the RPS does not end up landing on the same CPU as
>> > > > the other busy queue?
>> > > 
>> > > In this part I'm thinking about bigger systems where you want to have a
>> > > group of CPUs dedicated to dealing with network traffic without
>> > > assigning a fixed function (e.g. NAPI processing or RPS target) to each
>> > > one, allowing for more dynamic processing.
>> > > 
>> > > > > to them and allow the scheduler to take care of the rest.
>> > > > 
>> > > > You trust the scheduler much more than I do, I think :)
>> > > 
>> > > In my tests it brings down latency (both avg and p99) considerably in
>> > > some cases. I posted some numbers here:
>> > > https://lore.kernel.org/netdev/e317d5bc-cc26-8b1b-ca4b-66b5328683c4@nbd.name/
>> > 
>> > It's still not 110% clear to me why/how this additional thread could
>> > reduce latency. What/which threads are competing for the busy CPU[s]? I
>> > suspect it could be easier/cleaner move away the others (non RPS)
>> > threads.
>> In the tests that I'm doing, network processing load from routing/NAT is 
>> enough to occupy all available CPUs.
>> If I dedicate the NAPI thread to one core and use RPS to steer packet 
>> processing to the other cores, the core taking care of NAPI has some 
>> idle cycles that go to waste, while the other cores are busy.
>> If I include the core in the RPS mask, it can take too much away from 
>> the NAPI thread.
> 
> I feel like I'm missing some relevant points.
> 
> If RPS keeps the target CPU fully busy, moving RPS processing in a
> separate thread still will not allow using more CPU time.
RPS doesn't always keep the target CPU fully busy. The combination of 
NAPI thread + RPS threads is enough to keep the system busy. The number 
of flows is often small enough, that some (but not all) RPS instances 
could be keeping their target CPU busy.

With my patch, one CPU could be busy with a NAPI thread + remaining 
cycles allocated to a bit of extra RPS work, while the other CPUs handle 
the rest of the RPS load. In reality it bounces around CPUs a bit more 
instead of sticking to this assigned setup, but it allows cores to be 
more fully utilized by the network processing load, and the resulting 
throughput/latency improves because of that.

> Which NIC driver are you using?
I've been testing on multiple platforms. Mainly mtk_eth_soc, but also bgmac.

- Felix
  
Jesper Dangaard Brouer March 29, 2023, 4:14 p.m. UTC | #12
On 24/03/2023 18.35, Felix Fietkau wrote:
> On 24.03.23 18:20, Jakub Kicinski wrote:
>> On Fri, 24 Mar 2023 18:13:14 +0100 Felix Fietkau wrote:
>>> When dealing with few flows or an imbalance on CPU utilization, static RPS
>>> CPU assignment can be too inflexible. Add support for enabling threaded NAPI
>>> for backlog processing in order to allow the scheduler to better balance
>>> processing. This helps better spread the load across idle CPUs.
>>
>> Can you explain the use case a little bit more?
> 
> I'm primarily testing this on routers with 2 or 4 CPUs and limited 
> processing power, handling routing/NAT. RPS is typically needed to 
> properly distribute the load across all available CPUs. When there is 
> only a small number of flows that are pushing a lot of traffic, a static 
> RPS assignment often leaves some CPUs idle, whereas others become a 
> bottleneck by being fully loaded. Threaded NAPI reduces this a bit, but 
> CPUs can become bottlenecked and fully loaded by a NAPI thread alone.
> 
> Making backlog processing threaded helps split up the processing work 
> even more and distribute it onto remaining idle CPUs.
> 
> It can basically be used to make RPS a bit more dynamic and 
> configurable, because you can assign multiple backlog threads to a set 
> of CPUs and selectively steer packets from specific devices / rx queues 
> to them and allow the scheduler to take care of the rest.
> 

My experience with RPS was that it was too slow on the RX-CPU.  Meaning
that it doesn't really scale, because the RX-CPU becomes the scaling
bottleneck. (My data is old and it might scale differently on your ARM
boards).

This is why I/we created the XDP "cpumap".  It also creates a kernel
threaded model via a kthread on "map-configured" CPUs.  It scales
significantly better than RPS, but it doesn't handle flows and packet
Out-of-Order (OoO) situations automatically like RPS.  That is left up
to the BPF-programmer.  The kernel samples/bpf xdp_redirect_cpu[0] have
code that shows strategies of load-balancing flows.

The project xdp-cpumap-tc[1] runs in production (3 ISPs using this) and
works in concert with netstack Traffic Control (TC) for scaling
bandwidth shaping at the ISPs.  OoO is solved by redirecting all
customers IPs to the same TX/egress CPU. As the README[1] describes it
is recommended to reduce the number of RX-CPUs processing packets, and
have more TX-CPUs that basically runs netstack/TC.  One ISP with
2x25Gbit/s is using 2 CPUs for RX and 6 CPUs for TX.

--Jesper

[0] 
https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_redirect_cpu.bpf.c
[1] https://github.com/xdp-project/xdp-cpumap-tc
  

Patch

diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
index 466c560b0c30..6d037633a52f 100644
--- a/Documentation/admin-guide/sysctl/net.rst
+++ b/Documentation/admin-guide/sysctl/net.rst
@@ -47,6 +47,15 @@  Table : Subdirectories in /proc/sys/net
 1. /proc/sys/net/core - Network core options
 ============================================
 
+backlog_threaded
+----------------
+
+This offloads processing of backlog (input packets steered by RPS, or
+queued because the kernel is receiving more than it can handle on the
+incoming CPU) to threads (one for each CPU) instead of processing them
+in softirq context. This can improve load balancing by allowing the
+scheduler to better spread the load across idle CPUs.
+
 bpf_jit_enable
 --------------
 
diff --git a/Documentation/networking/scaling.rst b/Documentation/networking/scaling.rst
index 3d435caa3ef2..ded6fc713304 100644
--- a/Documentation/networking/scaling.rst
+++ b/Documentation/networking/scaling.rst
@@ -244,6 +244,26 @@  Setting net.core.netdev_max_backlog to either 1000 or 10000
 performed well in experiments.
 
 
+Threaded Backlog
+~~~~~~~~~~~~~~~~
+
+When dealing with few flows or an imbalance on CPU utilization, static
+RPS CPU assignment can be too inflexible. Making backlog processing
+threaded can improve load balancing by allowing the scheduler to spread
+the load across idle CPUs.
+
+
+Suggested Configuration
+~~~~~~~~~~~~~~~~~~~~~~~
+
+If you have CPUs fully utilized with network processing, you can enable
+threaded backlog processing by setting /proc/sys/net/core/backlog_threaded
+to 1. Afterwards, RPS CPU configuration bits no longer refer to CPU
+numbers, but to backlog threads named napi/backlog-<n>.
+If necessary, you can change the CPU affinity of these threads to limit
+them to specific CPU cores.
+
+
 RFS: Receive Flow Steering
 ==========================
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 674ee5daa7b1..1f67a8c1349e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -524,6 +524,7 @@  static inline bool napi_complete(struct napi_struct *n)
 }
 
 int dev_set_threaded(struct net_device *dev, bool threaded);
+int backlog_set_threaded(bool threaded);
 
 /**
  *	napi_disable - prevent NAPI from scheduling
@@ -3214,6 +3215,7 @@  struct softnet_data {
 	unsigned int		cpu;
 	unsigned int		input_queue_tail;
 #endif
+	unsigned int		process_queue_empty;
 	unsigned int		received_rps;
 	unsigned int		dropped;
 	struct sk_buff_head	input_pkt_queue;
diff --git a/net/core/dev.c b/net/core/dev.c
index 7172334a418f..b029a374b5f2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4591,7 +4591,7 @@  static int napi_schedule_rps(struct softnet_data *sd)
 	struct softnet_data *mysd = this_cpu_ptr(&softnet_data);
 
 #ifdef CONFIG_RPS
-	if (sd != mysd) {
+	if (sd != mysd && !test_bit(NAPI_STATE_THREADED, &sd->backlog.state)) {
 		sd->rps_ipi_next = mysd->rps_ipi_list;
 		mysd->rps_ipi_list = sd;
 
@@ -5772,6 +5772,8 @@  static DEFINE_PER_CPU(struct work_struct, flush_works);
 /* Network device is going away, flush any packets still pending */
 static void flush_backlog(struct work_struct *work)
 {
+	unsigned int process_queue_empty;
+	bool threaded, flush_processq;
 	struct sk_buff *skb, *tmp;
 	struct softnet_data *sd;
 
@@ -5786,8 +5788,17 @@  static void flush_backlog(struct work_struct *work)
 			input_queue_head_incr(sd);
 		}
 	}
+
+	threaded = test_bit(NAPI_STATE_THREADED, &sd->backlog.state);
+	flush_processq = threaded &&
+			 !skb_queue_empty_lockless(&sd->process_queue);
+	if (flush_processq)
+		process_queue_empty = sd->process_queue_empty;
 	rps_unlock_irq_enable(sd);
 
+	if (threaded)
+		goto out;
+
 	skb_queue_walk_safe(&sd->process_queue, skb, tmp) {
 		if (skb->dev->reg_state == NETREG_UNREGISTERING) {
 			__skb_unlink(skb, &sd->process_queue);
@@ -5795,7 +5806,16 @@  static void flush_backlog(struct work_struct *work)
 			input_queue_head_incr(sd);
 		}
 	}
+
+out:
 	local_bh_enable();
+
+	while (flush_processq) {
+		msleep(1);
+		rps_lock_irq_disable(sd);
+		flush_processq = process_queue_empty == sd->process_queue_empty;
+		rps_unlock_irq_enable(sd);
+	}
 }
 
 static bool flush_required(int cpu)
@@ -5927,16 +5947,16 @@  static int process_backlog(struct napi_struct *napi, int quota)
 		}
 
 		rps_lock_irq_disable(sd);
+		sd->process_queue_empty++;
 		if (skb_queue_empty(&sd->input_pkt_queue)) {
 			/*
 			 * Inline a custom version of __napi_complete().
-			 * only current cpu owns and manipulates this napi,
-			 * and NAPI_STATE_SCHED is the only possible flag set
-			 * on backlog.
+			 * only current cpu owns and manipulates this napi.
 			 * We can use a plain write instead of clear_bit(),
 			 * and we dont need an smp_mb() memory barrier.
 			 */
-			napi->state = 0;
+			napi->state &= ~(NAPIF_STATE_SCHED |
+					 NAPIF_STATE_SCHED_THREADED);
 			again = false;
 		} else {
 			skb_queue_splice_tail_init(&sd->input_pkt_queue,
@@ -6350,6 +6370,55 @@  int dev_set_threaded(struct net_device *dev, bool threaded)
 }
 EXPORT_SYMBOL(dev_set_threaded);
 
+int backlog_set_threaded(bool threaded)
+{
+	static bool backlog_threaded;
+	int err = 0;
+	int i;
+
+	if (backlog_threaded == threaded)
+		return 0;
+
+	for_each_possible_cpu(i) {
+		struct softnet_data *sd = &per_cpu(softnet_data, i);
+		struct napi_struct *n = &sd->backlog;
+
+		if (n->thread)
+			continue;
+		n->thread = kthread_run(napi_threaded_poll, n, "napi/backlog-%d", i);
+		if (IS_ERR(n->thread)) {
+			err = PTR_ERR(n->thread);
+			pr_err("kthread_run failed with err %d\n", err);
+			n->thread = NULL;
+			threaded = false;
+			break;
+		}
+
+	}
+
+	backlog_threaded = threaded;
+
+	/* Make sure kthread is created before THREADED bit
+	 * is set.
+	 */
+	smp_mb__before_atomic();
+
+	for_each_possible_cpu(i) {
+		struct softnet_data *sd = &per_cpu(softnet_data, i);
+		struct napi_struct *n = &sd->backlog;
+		unsigned long flags;
+
+		rps_lock_irqsave(sd, &flags);
+		if (threaded)
+			n->state |= NAPIF_STATE_THREADED;
+		else
+			n->state &= ~NAPIF_STATE_THREADED;
+		rps_unlock_irq_restore(sd, &flags);
+	}
+
+	return err;
+}
+
 void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
 			   int (*poll)(struct napi_struct *, int), int weight)
 {
@@ -11108,6 +11177,9 @@  static int dev_cpu_dead(unsigned int oldcpu)
 	raise_softirq_irqoff(NET_TX_SOFTIRQ);
 	local_irq_enable();
 
+	if (test_bit(NAPI_STATE_THREADED, &oldsd->backlog.state))
+		return 0;
+
 #ifdef CONFIG_RPS
 	remsd = oldsd->rps_ipi_list;
 	oldsd->rps_ipi_list = NULL;
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 74842b453407..77114cd0b021 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -30,6 +30,7 @@  static int int_3600 = 3600;
 static int min_sndbuf = SOCK_MIN_SNDBUF;
 static int min_rcvbuf = SOCK_MIN_RCVBUF;
 static int max_skb_frags = MAX_SKB_FRAGS;
+static int backlog_threaded;
 
 static int net_msg_warn;	/* Unused, but still a sysctl */
 
@@ -188,6 +189,23 @@  static int rps_sock_flow_sysctl(struct ctl_table *table, int write,
 }
 #endif /* CONFIG_RPS */
 
+static int backlog_threaded_sysctl(struct ctl_table *table, int write,
+			       void *buffer, size_t *lenp, loff_t *ppos)
+{
+	static DEFINE_MUTEX(backlog_threaded_mutex);
+	int ret;
+
+	mutex_lock(&backlog_threaded_mutex);
+
+	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+	if (write && !ret)
+		ret = backlog_set_threaded(backlog_threaded);
+
+	mutex_unlock(&backlog_threaded_mutex);
+
+	return ret;
+}
+
 #ifdef CONFIG_NET_FLOW_LIMIT
 static DEFINE_MUTEX(flow_limit_update_mutex);
 
@@ -532,6 +550,15 @@  static struct ctl_table net_core_table[] = {
 		.proc_handler	= rps_sock_flow_sysctl
 	},
 #endif
+	{
+		.procname	= "backlog_threaded",
+		.data		= &backlog_threaded,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= backlog_threaded_sysctl,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE
+	},
 #ifdef CONFIG_NET_FLOW_LIMIT
 	{
 		.procname	= "flow_limit_cpu_bitmap",