Message ID | 20231215171020.687342-1-bigeasy@linutronix.de |
---|---|
Headers |
Return-Path: <linux-kernel+bounces-1362-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:3b04:b0:fb:cd0c:d3e with SMTP id c4csp9432393dys; Fri, 15 Dec 2023 09:11:56 -0800 (PST) X-Google-Smtp-Source: AGHT+IHo++wIu6mszFEjyukThd8NBmvHq7bATEYnAjL7r1R1RE9r03cadyb2krpY8x+5abaegNLc X-Received: by 2002:a17:903:2346:b0:1d0:b789:d7b8 with SMTP id c6-20020a170903234600b001d0b789d7b8mr7737981plh.9.1702660316439; Fri, 15 Dec 2023 09:11:56 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702660316; cv=none; d=google.com; s=arc-20160816; b=N9Gu2k45X/Q1LXjyAd1bmLCeObLBfKSSlyPXXM5Pi9bZ0X+z10v1TA1QuX7hgpmzua g0rXBBSXQed7GBsN6Av6TIPE+nJp+ftP4SCW0bQkxZuVIiQN+E+D7CFoVmHZerNZY3Tw XoAG25m6bL97PtlwXIMmbQql4Qy3Khc+ACY6/Ycgz3dQ5WfB8u5lxLDovOTMy9nWBsdH ybgWCc1B4Ur4aEc1SVfKIJPOyXO9zR8vBIMxOQ0FKxYoJ8IeJLAaI/XNLuqC6BLNRQ74 cVxjDtARiQev7W/Bru3lS16a+J8d/w+mQ1Wsfoq/ibLqUa1oYLjncD4Ffq6mYzZ5cQkV yqKg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:cc:to :dkim-signature:dkim-signature:from; bh=ousXBW8YKwa+nWI6kv1lhXaySXJ0I7exp/MtxAHjcWw=; fh=isut5FXJ3IMDA31kBhawKjpOqYjZ97Ca69pMNjk4IR4=; b=Tp4XN589rJ+wlBfQ12wgir4WTT1MBiCBBmTL5NwPuwhhLeoGEbxQgAACUHN4SO57Dm Da02A17srqY2E8z56IbxG8WOw0B+a2XR0SzM4h86uCeKyOiodXrvuS9CSZa5114umVKi 6SOkHng49rDNRYyiaeFxp/QA3iLgB2wmG+Tm8I8zq4ot292k41RCrYYzeTdLmqci//+n KKooBSr32BCxWeCjyoTAqXm6rXuEMcmfeFiLrS/5gjfqOvAMEWY2GD+RfzYO0/DmuMOR fQDUXKletoMRXtoXwtLCvzwHBA6TWJfZhnCYn+ksu4pifJmeJnkCEFkE5j5L1DEdHMtt tpfw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=EDJgxTzv; dkim=neutral (no key) header.i=@linutronix.de header.s=2020e header.b=Fdc2dNEE; spf=pass (google.com: domain of linux-kernel+bounces-1362-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-1362-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [2604:1380:45e3:2400::1]) by mx.google.com with ESMTPS id m10-20020a170902db0a00b001cf8e9e8813si13495720plx.315.2023.12.15.09.11.56 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 15 Dec 2023 09:11:56 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-1362-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) client-ip=2604:1380:45e3:2400::1; Authentication-Results: mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=EDJgxTzv; dkim=neutral (no key) header.i=@linutronix.de header.s=2020e header.b=Fdc2dNEE; spf=pass (google.com: domain of linux-kernel+bounces-1362-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-1362-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 254FF28602D for <ouuuleilei@gmail.com>; Fri, 15 Dec 2023 17:11:56 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 8815146441; Fri, 15 Dec 2023 17:10:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="EDJgxTzv"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="Fdc2dNEE" X-Original-To: linux-kernel@vger.kernel.org Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9395D3FE31; Fri, 15 Dec 2023 17:10:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1702660231; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=ousXBW8YKwa+nWI6kv1lhXaySXJ0I7exp/MtxAHjcWw=; b=EDJgxTzv/PR7nL34hOo/iutHnq9tW0tjeIj2ikrEGyOzaEHjxiQRge594/R+cPF/ZVn63R oqUkJkJIG04oClSRTdQ3kPuusKwzDZZSIEN5C/lZRqSXVwWmwVTGHU5bF00PuJ58inqXPM v0F+i94zKcXl+E3UrPNDDwKtxWA5nJ9uMJjcwILuniEnB+cMU7OHTOlmi1MDFlHP4jmKp4 YTIEejYnnMwDmKYB9rIfw5bQGnlrvTmbpG8OIWeJuqu4eHKZi1ur1Pgi/T+DgukUHx2fix tmt+uvYD0N81CPmW2CrIStLf8b1kk2mng3y1AwF/VPNS97pal+Ia6fLlRpceIg== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1702660231; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=ousXBW8YKwa+nWI6kv1lhXaySXJ0I7exp/MtxAHjcWw=; b=Fdc2dNEEqv0Wbzbs4tUFApE/hBTBsMkY2FZ0QLsAPtak9Rt5rAa13DQHGKBmCu3V5o2dFh CAdgwtV7uDMpMXBQ== To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org Cc: "David S. Miller" <davem@davemloft.net>, Boqun Feng <boqun.feng@gmail.com>, Daniel Borkmann <daniel@iogearbox.net>, Eric Dumazet <edumazet@google.com>, Frederic Weisbecker <frederic@kernel.org>, Ingo Molnar <mingo@redhat.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Thomas Gleixner <tglx@linutronix.de>, Waiman Long <longman@redhat.com>, Will Deacon <will@kernel.org> Subject: [PATCH net-next 00/24] locking: Introduce nested-BH locking. Date: Fri, 15 Dec 2023 18:07:19 +0100 Message-ID: <20231215171020.687342-1-bigeasy@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1785368743809916673 X-GMAIL-MSGID: 1785368743809916673 |
Series |
locking: Introduce nested-BH locking.
|
|
Message
Sebastian Andrzej Siewior
Dec. 15, 2023, 5:07 p.m. UTC
Hi, Disabling bottoms halves acts as per-CPU BKL. On PREEMPT_RT code within local_bh_disable() section remains preemtible. As a result high prior tasks (or threaded interrupts) will be blocked by lower-prio task (or threaded interrupts) which are long running which includes softirq sections. The proposed way out is to introduce explicit per-CPU locks for resources which are protected by local_bh_disable() and use those only on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds. The series introduces the infrastructure and converts large parts of networking which is largerst stake holder here. Once this done the per-CPU lock from local_bh_disable() on PREEMPT_RT can be lifted. Sebastian
Comments
On Fri, 15 Dec 2023 18:07:19 +0100 Sebastian Andrzej Siewior wrote: > The proposed way out is to introduce explicit per-CPU locks for > resources which are protected by local_bh_disable() and use those only > on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds. As I said at LPC, complicating drivers with odd locking constructs is a no go for me.
On 2023-12-15 14:50:59 [-0800], Jakub Kicinski wrote: > On Fri, 15 Dec 2023 18:07:19 +0100 Sebastian Andrzej Siewior wrote: > > The proposed way out is to introduce explicit per-CPU locks for > > resources which are protected by local_bh_disable() and use those only > > on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds. > > As I said at LPC, complicating drivers with odd locking constructs > is a no go for me. I misunderstood it then as I assumed you wanted to ease the work while I was done which every driver after (hopefully) understanding what is possible/ needed and what not. We do speak here about 15++? Now. The pattern is usually | act = bpf_prog_run_xdp(xdp_prog, &xdp); | switch (act) { | case XDP_REDIRECT: | ret = xdp_do_redirect(netdev, &xdp, xdp_prog))) | if (ret) | goto XDP_ABORTED; | xdp_redir++ or so; so we might be able to turn this into something that covers both and returns either XDP_REDIRECT or XDP_ABORTED. So this could be merged into | u32 bpf_prog_run_xdp_and_redirect(struct net_device *dev, const struct | bpf_prog *prog, struct xdp_buff *xdp) | { | u32 act; | int ret; | | act = bpf_prog_run_xdp(prog, xdp); | if (act == XDP_REDIRECT) { | ret = xdp_do_redirect(netdev, xdp, prog); | if (ret < 0) | act = XDP_ABORTED; | } | return act; | } so the lock can be put inside the function and all drivers use this function. From looking through drivers/net/ethernet/, this should work for most drivers: - amazon/ena - aquantia/atlantic - engleder/tsnep - freescale/enetc - freescale/fec - intel/igb - intel/igc - marvell/mvneta - marvell/mvpp2 - marvell/octeontx2 - mediatek/mtk - mellanox/mlx5 - microchip/lan966x - microsoft/mana - netronome/nfp (two call paths with no support XDP_REDIRECT) - sfc/rx - sfc/siena (that offset pointer can be moved) - socionext/netsec - stmicro/stmmac A few do something custom/ additionally between bpf_prog_run_xdp() and xdp_do_redirect(): - broadcom/bnxt calculates length, offset, data pointer. DMA unmaps + memory allocations before redirect. - freescale/dpaa2 - freescale/dpaa sets xdp.data_hard_start + frame_sz, unmaps DMA. - fungible/funeth conditional redirect. - google/gve Allocates a new packet for redirect. - intel/ixgbe - intel/i40e - intel/ice Failure in the ZC case is different from XDP_ABORTED, depends on the error from xdp_do_redirect()) - mellanox/mlx4/ calculates page_offset. - qlogic/qede DMA unmap and buffer alloc. - ti/cpsw_priv recalculates length (pointer). and a few more don't support XDP_REDIRECT: - cavium/thunder does not support XDP_REDIRECT, calculates length, offset. - intel/ixgbevf does not support XDP_REDIRECT I don't understand why some driver need to recalculate data_hard_start, length and so on and others don't. This might be only needed for the XDP_TX case or not needed… Also I'm not sure about the dma unmaps and skb allocations. The new skb allocation can be probably handled before running the bpf prog but then in the XDP_PASS case it is a waste… And the DMA unmaps. Only a few seem to need it. Maybe it can be done before running the BPF program. After all the bpf may look into the skb. If that is no go, then the only thing that comes to mind is (as you mentioned on LPC) to acquire the lock in bpf_prog_run_xdp() and drop it in xdp_do_redirect(). This would require that every driver invokes xdp_do_redirect() even not if it is not supporting it (by setting netdev to NULL or so). Sebastian
On Mon, 18 Dec 2023 18:23:31 +0100 Sebastian Andrzej Siewior wrote: > On 2023-12-15 14:50:59 [-0800], Jakub Kicinski wrote: > > On Fri, 15 Dec 2023 18:07:19 +0100 Sebastian Andrzej Siewior wrote: > > > The proposed way out is to introduce explicit per-CPU locks for > > > resources which are protected by local_bh_disable() and use those only > > > on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds. > > > > As I said at LPC, complicating drivers with odd locking constructs > > is a no go for me. > > I misunderstood it then as I assumed you wanted to ease the work while I > was done which every driver after (hopefully) understanding what is > possible/ needed and what not. We do speak here about 15++? My main concern is that it takes the complexity of writing network device drivers to a next level. It's already hard enough to implement XDP correctly. "local lock" and "guard"? Too complicated :( Or "unmaintainable" as in "too much maintainer's time will be spent reviewing code that gets this wrong". > Now. The pattern is usually > | act = bpf_prog_run_xdp(xdp_prog, &xdp); > | switch (act) { > | case XDP_REDIRECT: > | ret = xdp_do_redirect(netdev, &xdp, xdp_prog))) > | if (ret) > | goto XDP_ABORTED; > | xdp_redir++ or so; > > so we might be able to turn this into something that covers both and > returns either XDP_REDIRECT or XDP_ABORTED. So this could be merged > into > > | u32 bpf_prog_run_xdp_and_redirect(struct net_device *dev, const struct > | bpf_prog *prog, struct xdp_buff *xdp) > | { > | u32 act; > | int ret; > | > | act = bpf_prog_run_xdp(prog, xdp); > | if (act == XDP_REDIRECT) { > | ret = xdp_do_redirect(netdev, xdp, prog); > | if (ret < 0) > | act = XDP_ABORTED; > | } > | return act; > | } If we could fold the DROP case into this -- even better! > so the lock can be put inside the function and all drivers use this > function. > > From looking through drivers/net/ethernet/, this should work for most > drivers: > - amazon/ena > - aquantia/atlantic > - engleder/tsnep > - freescale/enetc > - freescale/fec > - intel/igb > - intel/igc > - marvell/mvneta > - marvell/mvpp2 > - marvell/octeontx2 > - mediatek/mtk > - mellanox/mlx5 > - microchip/lan966x > - microsoft/mana > - netronome/nfp (two call paths with no support XDP_REDIRECT) > - sfc/rx > - sfc/siena (that offset pointer can be moved) > - socionext/netsec > - stmicro/stmmac > > A few do something custom/ additionally between bpf_prog_run_xdp() and > xdp_do_redirect(): > > - broadcom/bnxt > calculates length, offset, data pointer. DMA unmaps + memory > allocations before redirect. Just looked at this one. The recalculation is probably for the PASS / TX cases, REDIRECT / DROP shouldn't care. The DMA unmap looks like a bug (hi, Michael!) > - freescale/dpaa2 > - freescale/dpaa > sets xdp.data_hard_start + frame_sz, unmaps DMA. > > - fungible/funeth > conditional redirect. > > - google/gve > Allocates a new packet for redirect. > > - intel/ixgbe > - intel/i40e > - intel/ice > Failure in the ZC case is different from XDP_ABORTED, depends on the > error from xdp_do_redirect()) > > - mellanox/mlx4/ > calculates page_offset. > > - qlogic/qede > DMA unmap and buffer alloc. > > - ti/cpsw_priv > recalculates length (pointer). > > and a few more don't support XDP_REDIRECT: > > - cavium/thunder > does not support XDP_REDIRECT, calculates length, offset. > > - intel/ixgbevf > does not support XDP_REDIRECT > > I don't understand why some driver need to recalculate data_hard_start, > length and so on and others don't. This might be only needed for the > XDP_TX case or not needed… > Also I'm not sure about the dma unmaps and skb allocations. The new skb > allocation can be probably handled before running the bpf prog but then > in the XDP_PASS case it is a waste… > And the DMA unmaps. Only a few seem to need it. Maybe it can be done > before running the BPF program. After all the bpf may look into the skb. > > > If that is no go, then the only thing that comes to mind is (as you > mentioned on LPC) to acquire the lock in bpf_prog_run_xdp() and drop it > in xdp_do_redirect(). This would require that every driver invokes > xdp_do_redirect() even not if it is not supporting it (by setting netdev > to NULL or so). To make progress on other parts of the stack we could also take the local lock around all of napi->poll() for now..
On 2023-12-18 16:41:42 [-0800], Jakub Kicinski wrote: > > To make progress on other parts of the stack we could also take > the local lock around all of napi->poll() for now.. Okay. Thank you. I will look into this next year again. Sebastian