[net-next,00/24] locking: Introduce nested-BH locking.

Message ID	20231215171020.687342-1-bigeasy@linutronix.de
Headers	Received-SPF: pass (google.com: domain of linux-kernel+bounces-1362-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) client-ip=2604:1380:45e3:2400::1; From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org Cc: "David S. Miller" <davem@davemloft.net>, Boqun Feng <boqun.feng@gmail.com>, Daniel Borkmann <daniel@iogearbox.net>, Eric Dumazet <edumazet@google.com>, Frederic Weisbecker <frederic@kernel.org>, Ingo Molnar <mingo@redhat.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Thomas Gleixner <tglx@linutronix.de>, Waiman Long <longman@redhat.com>, Will Deacon <will@kernel.org> Subject: [PATCH net-next 00/24] locking: Introduce nested-BH locking. Date: Fri, 15 Dec 2023 18:07:19 +0100 Message-ID: <20231215171020.687342-1-bigeasy@linutronix.de> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-getmail-retrieved-from-mailbox: INBOX
Series	locking: Introduce nested-BH locking. \| [net-next,00/24] locking: Introduce nested-BH locking. [net-next,01/24] locking/local_lock: Introduce guard definition for local_lock. [net-next,02/24] locking/local_lock: Add local nested BH locking infrastructure. [net-next,03/24] net: Use __napi_alloc_frag_align() instead of open coding it. [net-next,04/24] net: Use nested-BH locking for napi_alloc_cache. [net-next,05/24] net/tcp_sigpool: Use nested-BH locking for sigpool_scratch. [net-next,06/24] net/ipv4: Use nested-BH locking for ipv4_tcp_sk. [net-next,07/24] netfilter: br_netfilter: Use nested-BH locking for brnf_frag_data_storage. [net-next,08/24] net: softnet_data: Make xmit.recursion per task. [net-next,09/24] dev: Use the RPS lock for softnet_data::input_pkt_queue on PREEMPT_RT. [net-next,10/24] dev: Use nested-BH locking for softnet_data.process_queue. [net-next,11/24] lwt: Don't disable migration prio invoking BPF. [net-next,12/24] seg6: Use nested-BH locking for seg6_bpf_srh_states. [net-next,13/24] net: Use nested-BH locking for bpf_scratchpad. [net-next,14/24] net: Add a lock which held during the redirect process. [net-next,15/24] net: Use nested-BH locking for XDP redirect. [net-next,16/24] net: netkit, veth, tun, virt*: Use nested-BH locking for XDP redirect. [net-next,17/24] net: amazon, aquanti, broadcom, cavium, engleder: Use nested-BH locking for XDP re… [net-next,18/24] net: Freescale: Use nested-BH locking for XDP redirect. [net-next,19/24] net: fungible, gve, mtk, microchip, mana: Use nested-BH locking for XDP redirect. [net-next,20/24] net: intel: Use nested-BH locking for XDP redirect. [net-next,21/24] net: marvell: Use nested-BH locking for XDP redirect. [net-next,22/24] net: mellanox, nfp, sfc: Use nested-BH locking for XDP redirect. [net-next,23/24] net: qlogic, socionext, stmmac, cpsw: Use nested-BH locking for XDP redirect. [net-next,24/24] net: bpf: Add lockdep assert for the redirect process.

Message ID

20231215171020.687342-1-bigeasy@linutronix.de

Headers

Received-SPF: pass (google.com: domain of
 linux-kernel+bounces-1362-ouuuleilei=gmail.com@vger.kernel.org designates
 2604:1380:45e3:2400::1 as permitted sender) client-ip=2604:1380:45e3:2400::1;
From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
To: linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>,
	Boqun Feng <boqun.feng@gmail.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Eric Dumazet <edumazet@google.com>,
	Frederic Weisbecker <frederic@kernel.org>,
	Ingo Molnar <mingo@redhat.com>,
	Jakub Kicinski <kuba@kernel.org>,
	Paolo Abeni <pabeni@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Waiman Long <longman@redhat.com>,
	Will Deacon <will@kernel.org>
Subject: [PATCH net-next 00/24] locking: Introduce nested-BH locking.
Date: Fri, 15 Dec 2023 18:07:19 +0100
Message-ID: <20231215171020.687342-1-bigeasy@linutronix.de>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable

Series

locking: Introduce nested-BH locking. |

Message

Sebastian Andrzej Siewior Dec. 15, 2023, 5:07 p.m. UTC

  Hi,

Disabling bottoms halves acts as per-CPU BKL. On PREEMPT_RT code within
local_bh_disable() section remains preemtible. As a result high prior
tasks (or threaded interrupts) will be blocked by lower-prio task (or
threaded interrupts) which are long running which includes softirq
sections.

The proposed way out is to introduce explicit per-CPU locks for
resources which are protected by local_bh_disable() and use those only
on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.

The series introduces the infrastructure and converts large parts of
networking which is largerst stake holder here. Once this done the
per-CPU lock from local_bh_disable() on PREEMPT_RT can be lifted.

Sebastian

Comments

Jakub Kicinski Dec. 15, 2023, 10:50 p.m. UTC | #1

On Fri, 15 Dec 2023 18:07:19 +0100 Sebastian Andrzej Siewior wrote:
> The proposed way out is to introduce explicit per-CPU locks for
> resources which are protected by local_bh_disable() and use those only
> on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.

As I said at LPC, complicating drivers with odd locking constructs
is a no go for me.

Sebastian Andrzej Siewior Dec. 18, 2023, 5:23 p.m. UTC | #2

On 2023-12-15 14:50:59 [-0800], Jakub Kicinski wrote:
> On Fri, 15 Dec 2023 18:07:19 +0100 Sebastian Andrzej Siewior wrote:
> > The proposed way out is to introduce explicit per-CPU locks for
> > resources which are protected by local_bh_disable() and use those only
> > on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.
> 
> As I said at LPC, complicating drivers with odd locking constructs
> is a no go for me.

I misunderstood it then as I assumed you wanted to ease the work while I
was done which every driver after (hopefully) understanding what is
possible/ needed and what not. We do speak here about 15++?

Now. The pattern is usually
|	act = bpf_prog_run_xdp(xdp_prog, &xdp);
|	switch (act) {
|	case XDP_REDIRECT:
|		ret = xdp_do_redirect(netdev, &xdp, xdp_prog)))
|		if (ret)
|			goto XDP_ABORTED;
|		xdp_redir++ or so;

so we might be able to turn this into something that covers both and
returns either XDP_REDIRECT or XDP_ABORTED. So this could be merged
into

| u32 bpf_prog_run_xdp_and_redirect(struct net_device *dev, const struct
| 				  bpf_prog *prog, struct xdp_buff *xdp)
| {
| 	u32 act;
| 	int ret;
| 
| 	act = bpf_prog_run_xdp(prog, xdp);
| 	if (act == XDP_REDIRECT) {
| 		ret = xdp_do_redirect(netdev, xdp, prog);
| 		if (ret < 0)
| 			act = XDP_ABORTED;
| 	}
| 	return act;
| }

so the lock can be put inside the function and all drivers use this
function.

From looking through drivers/net/ethernet/, this should work for most
drivers:
- amazon/ena
- aquantia/atlantic
- engleder/tsnep
- freescale/enetc
- freescale/fec
- intel/igb
- intel/igc
- marvell/mvneta
- marvell/mvpp2
- marvell/octeontx2
- mediatek/mtk
- mellanox/mlx5
- microchip/lan966x
- microsoft/mana
- netronome/nfp (two call paths with no support XDP_REDIRECT)
- sfc/rx
- sfc/siena (that offset pointer can be moved)
- socionext/netsec
- stmicro/stmmac

A few do something custom/ additionally between bpf_prog_run_xdp() and
  xdp_do_redirect():

- broadcom/bnxt
  calculates length, offset, data pointer. DMA unmaps + memory
  allocations before redirect.

- freescale/dpaa2
- freescale/dpaa
  sets xdp.data_hard_start + frame_sz, unmaps DMA.

- fungible/funeth
  conditional redirect.

- google/gve
  Allocates a new packet for redirect.

- intel/ixgbe
- intel/i40e
- intel/ice
  Failure in the ZC case is different from XDP_ABORTED, depends on the
  error from xdp_do_redirect())

- mellanox/mlx4/
  calculates page_offset.

- qlogic/qede
  DMA unmap and buffer alloc.

- ti/cpsw_priv
  recalculates length (pointer).

and a few more don't support XDP_REDIRECT:

- cavium/thunder
  does not support XDP_REDIRECT, calculates length, offset.

- intel/ixgbevf
  does not support XDP_REDIRECT

I don't understand why some driver need to recalculate data_hard_start,
length and so on and others don't. This might be only needed for the
XDP_TX case or not needed…
Also I'm not sure about the dma unmaps and skb allocations. The new skb
allocation can be probably handled before running the bpf prog but then
in the XDP_PASS case it is a waste…
And the DMA unmaps. Only a few seem to need it. Maybe it can be done
before running the BPF program. After all the bpf may look into the skb.

If that is no go, then the only thing that comes to mind is (as you
mentioned on LPC) to acquire the lock in bpf_prog_run_xdp() and drop it
in xdp_do_redirect(). This would require that every driver invokes
xdp_do_redirect() even not if it is not supporting it (by setting netdev
to NULL or so).

Sebastian

Jakub Kicinski Dec. 19, 2023, 12:41 a.m. UTC | #3

On Mon, 18 Dec 2023 18:23:31 +0100 Sebastian Andrzej Siewior wrote:
> On 2023-12-15 14:50:59 [-0800], Jakub Kicinski wrote:
> > On Fri, 15 Dec 2023 18:07:19 +0100 Sebastian Andrzej Siewior wrote:  
> > > The proposed way out is to introduce explicit per-CPU locks for
> > > resources which are protected by local_bh_disable() and use those only
> > > on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.  
> > 
> > As I said at LPC, complicating drivers with odd locking constructs
> > is a no go for me.  
> 
> I misunderstood it then as I assumed you wanted to ease the work while I
> was done which every driver after (hopefully) understanding what is
> possible/ needed and what not. We do speak here about 15++?

My main concern is that it takes the complexity of writing network
device drivers to a next level. It's already hard enough to implement
XDP correctly. "local lock" and "guard"? Too complicated :(
Or "unmaintainable" as in "too much maintainer's time will be spent
reviewing code that gets this wrong".

> Now. The pattern is usually
> |	act = bpf_prog_run_xdp(xdp_prog, &xdp);
> |	switch (act) {
> |	case XDP_REDIRECT:
> |		ret = xdp_do_redirect(netdev, &xdp, xdp_prog)))
> |		if (ret)
> |			goto XDP_ABORTED;
> |		xdp_redir++ or so;
> 
> so we might be able to turn this into something that covers both and
> returns either XDP_REDIRECT or XDP_ABORTED. So this could be merged
> into
> 
> | u32 bpf_prog_run_xdp_and_redirect(struct net_device *dev, const struct
> | 				  bpf_prog *prog, struct xdp_buff *xdp)
> | {
> | 	u32 act;
> | 	int ret;
> | 
> | 	act = bpf_prog_run_xdp(prog, xdp);
> | 	if (act == XDP_REDIRECT) {
> | 		ret = xdp_do_redirect(netdev, xdp, prog);
> | 		if (ret < 0)
> | 			act = XDP_ABORTED;
> | 	}
> | 	return act;
> | }

If we could fold the DROP case into this -- even better!

> so the lock can be put inside the function and all drivers use this
> function.
> 
> From looking through drivers/net/ethernet/, this should work for most
> drivers:
> - amazon/ena
> - aquantia/atlantic
> - engleder/tsnep
> - freescale/enetc
> - freescale/fec
> - intel/igb
> - intel/igc
> - marvell/mvneta
> - marvell/mvpp2
> - marvell/octeontx2
> - mediatek/mtk
> - mellanox/mlx5
> - microchip/lan966x
> - microsoft/mana
> - netronome/nfp (two call paths with no support XDP_REDIRECT)
> - sfc/rx
> - sfc/siena (that offset pointer can be moved)
> - socionext/netsec
> - stmicro/stmmac
> 
> A few do something custom/ additionally between bpf_prog_run_xdp() and
>   xdp_do_redirect():
> 
> - broadcom/bnxt
>   calculates length, offset, data pointer. DMA unmaps + memory
>   allocations before redirect.

Just looked at this one. The recalculation is probably for the PASS /
TX cases, REDIRECT / DROP shouldn't care. The DMA unmap looks like 
a bug (hi, Michael!)

> - freescale/dpaa2
> - freescale/dpaa
>   sets xdp.data_hard_start + frame_sz, unmaps DMA.
> 
> - fungible/funeth
>   conditional redirect.
> 
> - google/gve
>   Allocates a new packet for redirect.
> 
> - intel/ixgbe
> - intel/i40e
> - intel/ice
>   Failure in the ZC case is different from XDP_ABORTED, depends on the
>   error from xdp_do_redirect())
> 
> - mellanox/mlx4/
>   calculates page_offset.
> 
> - qlogic/qede
>   DMA unmap and buffer alloc.
> 
> - ti/cpsw_priv
>   recalculates length (pointer).
> 
> and a few more don't support XDP_REDIRECT:
> 
> - cavium/thunder
>   does not support XDP_REDIRECT, calculates length, offset.
> 
> - intel/ixgbevf
>   does not support XDP_REDIRECT
> 
> I don't understand why some driver need to recalculate data_hard_start,
> length and so on and others don't. This might be only needed for the
> XDP_TX case or not needed…
> Also I'm not sure about the dma unmaps and skb allocations. The new skb
> allocation can be probably handled before running the bpf prog but then
> in the XDP_PASS case it is a waste…
> And the DMA unmaps. Only a few seem to need it. Maybe it can be done
> before running the BPF program. After all the bpf may look into the skb.
> 
> 
> If that is no go, then the only thing that comes to mind is (as you
> mentioned on LPC) to acquire the lock in bpf_prog_run_xdp() and drop it
> in xdp_do_redirect(). This would require that every driver invokes
> xdp_do_redirect() even not if it is not supporting it (by setting netdev
> to NULL or so).

To make progress on other parts of the stack we could also take 
the local lock around all of napi->poll() for now..

Sebastian Andrzej Siewior Dec. 21, 2023, 8:46 p.m. UTC | #4

On 2023-12-18 16:41:42 [-0800], Jakub Kicinski wrote:
> 
> To make progress on other parts of the stack we could also take 
> the local lock around all of napi->poll() for now..

Okay. Thank you. I will look into this next year again.

Sebastian