[bpf-next,v4,2/6] netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
Message ID | d3b0ff95c58356192ea3b50824f8cdbf02c354e3.1689203090.git.dxu@dxuuu.xyz |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:a6b2:0:b0:3e4:2afc:c1 with SMTP id c18csp1488707vqm; Wed, 12 Jul 2023 16:55:02 -0700 (PDT) X-Google-Smtp-Source: APBJJlGnY8LhyB6ry1gGlqp5MLuMZ4Hhs/GpRdDl2lT9hhFrr5Juov+6QSEX3RuPJ5//woPKiPxr X-Received: by 2002:a05:6000:90c:b0:313:f7f1:e34c with SMTP id cw12-20020a056000090c00b00313f7f1e34cmr18088141wrb.60.1689206102536; Wed, 12 Jul 2023 16:55:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689206102; cv=none; d=google.com; s=arc-20160816; b=iItg2hBIOf0AM0T6SNkMD6Z3qelOfBpYTO4o8+SvfLiqKIqmZm4OB0M9ccvCQCi9cQ H4Ow2zYbK9+7uqMrbzxuEqSOaNAGrLCWJQHPfal+ohtE9CLmOFhNfitTU5XDUVcwdV5y bymvbq2Wr0nI1cOM/gEpvsSVbmYgj5kaBBEn+i/vDFoVRFCMX1m25kk7ivIdZoakN56W 45iEAIE7BBRFPRhRuiWQUFWrH5xHnZp9NpVLwqZC9qzORZLuyAMvUvBDBol3iBEDM0HK ZrgkdFqjG5paQW0jgVe+7WFIDRwAuciaaX2kPa0vfoL9hZMA3uRvz6PC5/bNd8G9sGvu aEVA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :feedback-id:dkim-signature:dkim-signature; bh=SpOK0IbwC8H/LwAMTCpFl9t2wFGThdkYmAvUp2+Lk0I=; fh=e1T8+Qmx6lu+AKXdPhTKyJym2NiTBFIE2k1Iww9gnAY=; b=GGb89cfYm0W+NDjlMQtIW9HOvfH0HyHzjxJgfyrJfZjAAWzTtETmQuTYdtj0ulyuov ERoRSb5a41f/5+Cyb033gnwnl2av7jUlkuzBtmdG0YYNWVGGwE0kO1YE7qI/VTgEPwCu CipaJ1FDmZPZCH/hAuVAoBWf/7ti+FkufNXqAS046ocZJRfGIXei2eFWPtkrhVIYD8ER PvCYZbum/kv3Yrnq9kqOv4B79gv9DGLVtRotEbB6/wsYsT7Ea16HAZj6z+2XnIw0D7Gz HKRkA3X678TRLAnJu/R8HigwGx8R+Km4RwjgWX49HD5Pa5Y01jbz1n0Bq7m6gNMywg1m Y49g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@dxuuu.xyz header.s=fm1 header.b=W7Q69FXQ; dkim=pass header.i=@messagingengine.com header.s=fm2 header.b=EugiooFs; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id w10-20020aa7cb4a000000b0051e0ea53eafsi5817177edt.558.2023.07.12.16.54.39; Wed, 12 Jul 2023 16:55:02 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@dxuuu.xyz header.s=fm1 header.b=W7Q69FXQ; dkim=pass header.i=@messagingengine.com header.s=fm2 header.b=EugiooFs; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232683AbjGLXok (ORCPT <rfc822;gnulinuxfreebsd@gmail.com> + 99 others); Wed, 12 Jul 2023 19:44:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50132 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232716AbjGLXoh (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 12 Jul 2023 19:44:37 -0400 Received: from wnew4-smtp.messagingengine.com (wnew4-smtp.messagingengine.com [64.147.123.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9E56CEA; Wed, 12 Jul 2023 16:44:33 -0700 (PDT) Received: from compute6.internal (compute6.nyi.internal [10.202.2.47]) by mailnew.west.internal (Postfix) with ESMTP id 9DDF12B00194; Wed, 12 Jul 2023 19:44:30 -0400 (EDT) Received: from mailfrontend2 ([10.202.2.163]) by compute6.internal (MEProxy); Wed, 12 Jul 2023 19:44:32 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dxuuu.xyz; h=cc :cc:content-transfer-encoding:content-type:date:date:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:sender:subject:subject:to:to; s=fm1; t=1689205470; x= 1689212670; bh=SpOK0IbwC8H/LwAMTCpFl9t2wFGThdkYmAvUp2+Lk0I=; b=W 7Q69FXQmq1w7CoX6Hu5BEXqJTq00ZTFZ/VbpVXvX3P4/c1hpFK8HurR6y6uG6YQ8 EItABoKXqMKswZOSIY0e41BgTyMChNZxWstVGPX6Iggqh4F4JnV9doMwQljP2+wr 0cG48RNZVnJ/7SpmCXSgR5dgx/vte8tckpl/PjGQ9upfw5E120GDpmTdw5mIFhbD RUdDINeV9R/35W9VQEiy+VdJSs7Hdn4Y+X6HM2HclGlgRrmq5JdcWOaZyEkwwzXF ySdipY4pn2ncsLp/oQo1x1yu2gobckTVOn239g9LfYSspGgjA3dNws3/+Ctf+5/U BSIC7NyU+muU7oBx/v8Jg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-transfer-encoding :content-type:date:date:feedback-id:feedback-id:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:sender:subject:subject:to:to:x-me-proxy:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm2; t=1689205470; x= 1689212670; bh=SpOK0IbwC8H/LwAMTCpFl9t2wFGThdkYmAvUp2+Lk0I=; b=E ugiooFspI2bcq3zOT26W2R+qv8Cx8hHvgdDVmILk/xhSDbolSqmNK3puFzE5WUgP Qx7oXugQsr6Rnqw0PcYdOUDDhBtaZdGWelzHdUbKxqMKf+wLY+hQ7xDtWfqQPcU6 QQiqpoFnuvjLo/rizYOTALrUBALp4pjoG150j4Yz+qILVNMqFQ2DglKU6+ej4mTq fEWuh1wCtcnj8tRy+pG0SLkcQdTJ0/uV1On9GFBMGn8Lt+JARbzcqZPDyGYylKUb gVUpRcuE0wNaD/Ugdt9Oaa+NPJ4KrQ4rmllO0FVew0Es0cH+mY1ioCS9/ftuns4L rV0z2XBKA4Kl6JymCmdrQ== X-ME-Sender: <xms:3TqvZPSpoyhSIFGddim2f0UUL9ChKdz3T74aUohty8J3iQn6Ov1ogQ> <xme:3TqvZAyxRuWdTxIeIt-_PCKsUWqUJNFtTf9dcr0PQYN3cQ1mIPwGI9SMEQZiHyHeI PWvk4D280FtYlpISA> X-ME-Received: <xmr:3TqvZE2Cmrim2kopkYACxiWboyrAYzrgpctNw1avwbn0d7YA0iIjZNxqRKRfAA8yRD43rkVdG7dDJhZG3WLAd6CjQQmvt5avPcT9-Bq9V8M> X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedviedrfeefgddvgecutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecufghrlhcuvffnffculdefhedmnecujfgurhephffvve fufffkofgjfhgggfestdekredtredttdenucfhrhhomhepffgrnhhivghlucgiuhcuoegu gihusegugihuuhhurdighiiiqeenucggtffrrghtthgvrhhnpeejgeevleejhedvveduud ffffelleevjeeugeejvdehvdehvdehtefhjeegtdeiieenucffohhmrghinhepnhgvthhf ihhlthgvrhdrphhfnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilh hfrhhomhepugiguhesugiguhhuuhdrgiihii X-ME-Proxy: <xmx:3TqvZPAUupSCCHFpMgTbToEc2Y8HBthbN1HRRmbHsHw11PGz4U_U-A> <xmx:3TqvZIhOjivEqYfRCEi1P1vuxJ7-pToA8ImNq7xoLuxovDM2VB6PUw> <xmx:3TqvZDo8bDwR8pl2uLgnCyHLb-h0AizlCZ-T16HiT0rVh23M6N9VWg> <xmx:3jqvZAQXVXMOoOgtgro8_2BCM60WjdwrmzpM6438mHVu62XAoU1ZL9rT8zo> Feedback-ID: i6a694271:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Wed, 12 Jul 2023 19:44:28 -0400 (EDT) From: Daniel Xu <dxu@dxuuu.xyz> To: andrii@kernel.org, ast@kernel.org, fw@strlen.de, davem@davemloft.net, pablo@netfilter.org, pabeni@redhat.com, daniel@iogearbox.net, edumazet@google.com, kuba@kernel.org, kadlec@netfilter.org, alexei.starovoitov@gmail.com Cc: martin.lau@linux.dev, song@kernel.org, yhs@fb.com, john.fastabend@gmail.com, kpsingh@kernel.org, sdf@google.com, haoluo@google.com, jolsa@kernel.org, bpf@vger.kernel.org, linux-kernel@vger.kernel.org, netfilter-devel@vger.kernel.org, coreteam@netfilter.org, netdev@vger.kernel.org, dsahern@kernel.org Subject: [PATCH bpf-next v4 2/6] netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link Date: Wed, 12 Jul 2023 17:43:57 -0600 Message-ID: <d3b0ff95c58356192ea3b50824f8cdbf02c354e3.1689203090.git.dxu@dxuuu.xyz> X-Mailer: git-send-email 2.41.0 In-Reply-To: <cover.1689203090.git.dxu@dxuuu.xyz> References: <cover.1689203090.git.dxu@dxuuu.xyz> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_LOW,SPF_HELO_PASS, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1771260978147898224 X-GMAIL-MSGID: 1771260978147898224 |
Series |
Support defragmenting IPv(4|6) packets in BPF
|
|
Commit Message
Daniel Xu
July 12, 2023, 11:43 p.m. UTC
This commit adds support for enabling IP defrag using pre-existing netfilter defrag support. Basically all the flag does is bump a refcnt while the link the active. Checks are also added to ensure the prog requesting defrag support is run _after_ netfilter defrag hooks. Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> --- include/uapi/linux/bpf.h | 5 ++ net/netfilter/nf_bpf_link.c | 129 ++++++++++++++++++++++++++++++--- tools/include/uapi/linux/bpf.h | 5 ++ 3 files changed, 128 insertions(+), 11 deletions(-)
Comments
On Wed, Jul 12, 2023 at 4:44 PM Daniel Xu <dxu@dxuuu.xyz> wrote: > +#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV6) > + case NFPROTO_IPV6: > + rcu_read_lock(); > + v6_hook = rcu_dereference(nf_defrag_v6_hook); > + if (!v6_hook) { > + rcu_read_unlock(); > + err = request_module("nf_defrag_ipv6"); > + if (err) > + return err < 0 ? err : -EINVAL; > + > + rcu_read_lock(); > + v6_hook = rcu_dereference(nf_defrag_v6_hook); > + if (!v6_hook) { > + WARN_ONCE(1, "nf_defrag_ipv6_hooks bad registration"); > + err = -ENOENT; > + goto out_v6; > + } > + } > + > + err = v6_hook->enable(link->net); I was about to apply, but luckily caught this issue in my local test: [ 18.462448] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:283 [ 18.463238] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 2042, name: test_progs [ 18.463927] preempt_count: 0, expected: 0 [ 18.464249] RCU nest depth: 1, expected: 0 [ 18.464631] CPU: 15 PID: 2042 Comm: test_progs Tainted: G O 6.4.0-04319-g6f6ec4fa00dc #4896 [ 18.465480] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [ 18.466531] Call Trace: [ 18.466767] <TASK> [ 18.466975] dump_stack_lvl+0x32/0x40 [ 18.467325] __might_resched+0x129/0x180 [ 18.467691] mutex_lock+0x1a/0x40 [ 18.468057] nf_defrag_ipv4_enable+0x16/0x70 [ 18.468467] bpf_nf_link_attach+0x141/0x300 [ 18.468856] __sys_bpf+0x133e/0x26d0 You cannot call mutex under rcu_read_lock. Please make sure you have all kernel debug flags on in your testing.
Hi Alexei, On Wed, Jul 12, 2023 at 05:43:49PM -0700, Alexei Starovoitov wrote: > On Wed, Jul 12, 2023 at 4:44 PM Daniel Xu <dxu@dxuuu.xyz> wrote: > > +#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV6) > > + case NFPROTO_IPV6: > > + rcu_read_lock(); > > + v6_hook = rcu_dereference(nf_defrag_v6_hook); > > + if (!v6_hook) { > > + rcu_read_unlock(); > > + err = request_module("nf_defrag_ipv6"); > > + if (err) > > + return err < 0 ? err : -EINVAL; > > + > > + rcu_read_lock(); > > + v6_hook = rcu_dereference(nf_defrag_v6_hook); > > + if (!v6_hook) { > > + WARN_ONCE(1, "nf_defrag_ipv6_hooks bad registration"); > > + err = -ENOENT; > > + goto out_v6; > > + } > > + } > > + > > + err = v6_hook->enable(link->net); > > I was about to apply, but luckily caught this issue in my local test: > > [ 18.462448] BUG: sleeping function called from invalid context at > kernel/locking/mutex.c:283 > [ 18.463238] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: > 2042, name: test_progs > [ 18.463927] preempt_count: 0, expected: 0 > [ 18.464249] RCU nest depth: 1, expected: 0 > [ 18.464631] CPU: 15 PID: 2042 Comm: test_progs Tainted: G > O 6.4.0-04319-g6f6ec4fa00dc #4896 > [ 18.465480] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 > [ 18.466531] Call Trace: > [ 18.466767] <TASK> > [ 18.466975] dump_stack_lvl+0x32/0x40 > [ 18.467325] __might_resched+0x129/0x180 > [ 18.467691] mutex_lock+0x1a/0x40 > [ 18.468057] nf_defrag_ipv4_enable+0x16/0x70 > [ 18.468467] bpf_nf_link_attach+0x141/0x300 > [ 18.468856] __sys_bpf+0x133e/0x26d0 > > You cannot call mutex under rcu_read_lock. Whoops, my bad. I think this patch should fix it: ``` From 7e8927c44452db07ddd7cf0e30bb49215fc044ed Mon Sep 17 00:00:00 2001 Message-ID: <7e8927c44452db07ddd7cf0e30bb49215fc044ed.1689211250.git.dxu@dxuuu.xyz> From: Daniel Xu <dxu@dxuuu.xyz> Date: Wed, 12 Jul 2023 19:17:35 -0600 Subject: [PATCH] netfilter: bpf: Don't hold rcu_read_lock during enable/disable ->enable()/->disable() takes a mutex which can sleep. You can't sleep during RCU read side critical section. Our refcnt on the module will protect us from ->enable()/->disable() from going away while we call it. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> --- net/netfilter/nf_bpf_link.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/net/netfilter/nf_bpf_link.c b/net/netfilter/nf_bpf_link.c index 77ffbf26ba3d..79704cc596aa 100644 --- a/net/netfilter/nf_bpf_link.c +++ b/net/netfilter/nf_bpf_link.c @@ -60,9 +60,12 @@ static int bpf_nf_enable_defrag(struct bpf_nf_link *link) goto out_v4; } + rcu_read_unlock(); err = v4_hook->enable(link->net); if (err) module_put(v4_hook->owner); + + return err; out_v4: rcu_read_unlock(); return err; @@ -92,9 +95,12 @@ static int bpf_nf_enable_defrag(struct bpf_nf_link *link) goto out_v6; } + rcu_read_unlock(); err = v6_hook->enable(link->net); if (err) module_put(v6_hook->owner); + + return err; out_v6: rcu_read_unlock(); return err; @@ -114,11 +120,11 @@ static void bpf_nf_disable_defrag(struct bpf_nf_link *link) case NFPROTO_IPV4: rcu_read_lock(); v4_hook = rcu_dereference(nf_defrag_v4_hook); + rcu_read_unlock(); if (v4_hook) { v4_hook->disable(link->net); module_put(v4_hook->owner); } - rcu_read_unlock(); break; #endif @@ -126,11 +132,11 @@ static void bpf_nf_disable_defrag(struct bpf_nf_link *link) case NFPROTO_IPV6: rcu_read_lock(); v6_hook = rcu_dereference(nf_defrag_v6_hook); + rcu_read_unlock(); if (v6_hook) { v6_hook->disable(link->net); module_put(v6_hook->owner); } - rcu_read_unlock(); break; } -- 2.41.0 ``` I'll send out a v5 tomorrow morning unless you feel like applying the series + this patch today. > > Please make sure you have all kernel debug flags on in your testing. > Ack. Will make sure lockdep is on. Thanks, Daniel
On Wed, Jul 12, 2023 at 6:22 PM Daniel Xu <dxu@dxuuu.xyz> wrote: > > Hi Alexei, > > On Wed, Jul 12, 2023 at 05:43:49PM -0700, Alexei Starovoitov wrote: > > On Wed, Jul 12, 2023 at 4:44 PM Daniel Xu <dxu@dxuuu.xyz> wrote: > > > +#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV6) > > > + case NFPROTO_IPV6: > > > + rcu_read_lock(); > > > + v6_hook = rcu_dereference(nf_defrag_v6_hook); > > > + if (!v6_hook) { > > > + rcu_read_unlock(); > > > + err = request_module("nf_defrag_ipv6"); > > > + if (err) > > > + return err < 0 ? err : -EINVAL; > > > + > > > + rcu_read_lock(); > > > + v6_hook = rcu_dereference(nf_defrag_v6_hook); > > > + if (!v6_hook) { > > > + WARN_ONCE(1, "nf_defrag_ipv6_hooks bad registration"); > > > + err = -ENOENT; > > > + goto out_v6; > > > + } > > > + } > > > + > > > + err = v6_hook->enable(link->net); > > > > I was about to apply, but luckily caught this issue in my local test: > > > > [ 18.462448] BUG: sleeping function called from invalid context at > > kernel/locking/mutex.c:283 > > [ 18.463238] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: > > 2042, name: test_progs > > [ 18.463927] preempt_count: 0, expected: 0 > > [ 18.464249] RCU nest depth: 1, expected: 0 > > [ 18.464631] CPU: 15 PID: 2042 Comm: test_progs Tainted: G > > O 6.4.0-04319-g6f6ec4fa00dc #4896 > > [ 18.465480] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > > BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 > > [ 18.466531] Call Trace: > > [ 18.466767] <TASK> > > [ 18.466975] dump_stack_lvl+0x32/0x40 > > [ 18.467325] __might_resched+0x129/0x180 > > [ 18.467691] mutex_lock+0x1a/0x40 > > [ 18.468057] nf_defrag_ipv4_enable+0x16/0x70 > > [ 18.468467] bpf_nf_link_attach+0x141/0x300 > > [ 18.468856] __sys_bpf+0x133e/0x26d0 > > > > You cannot call mutex under rcu_read_lock. > > Whoops, my bad. I think this patch should fix it: > > ``` > From 7e8927c44452db07ddd7cf0e30bb49215fc044ed Mon Sep 17 00:00:00 2001 > Message-ID: <7e8927c44452db07ddd7cf0e30bb49215fc044ed.1689211250.git.dxu@dxuuu.xyz> > From: Daniel Xu <dxu@dxuuu.xyz> > Date: Wed, 12 Jul 2023 19:17:35 -0600 > Subject: [PATCH] netfilter: bpf: Don't hold rcu_read_lock during > enable/disable > > ->enable()/->disable() takes a mutex which can sleep. You can't sleep > during RCU read side critical section. > > Our refcnt on the module will protect us from ->enable()/->disable() > from going away while we call it. > > Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> > --- > net/netfilter/nf_bpf_link.c | 10 ++++++++-- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/net/netfilter/nf_bpf_link.c b/net/netfilter/nf_bpf_link.c > index 77ffbf26ba3d..79704cc596aa 100644 > --- a/net/netfilter/nf_bpf_link.c > +++ b/net/netfilter/nf_bpf_link.c > @@ -60,9 +60,12 @@ static int bpf_nf_enable_defrag(struct bpf_nf_link *link) > goto out_v4; > } > > + rcu_read_unlock(); > err = v4_hook->enable(link->net); > if (err) > module_put(v4_hook->owner); > + > + return err; > out_v4: > rcu_read_unlock(); > return err; > @@ -92,9 +95,12 @@ static int bpf_nf_enable_defrag(struct bpf_nf_link *link) > goto out_v6; > } > > + rcu_read_unlock(); > err = v6_hook->enable(link->net); > if (err) > module_put(v6_hook->owner); > + > + return err; > out_v6: > rcu_read_unlock(); > return err; > @@ -114,11 +120,11 @@ static void bpf_nf_disable_defrag(struct bpf_nf_link *link) > case NFPROTO_IPV4: > rcu_read_lock(); > v4_hook = rcu_dereference(nf_defrag_v4_hook); > + rcu_read_unlock(); > if (v4_hook) { > v4_hook->disable(link->net); > module_put(v4_hook->owner); > } > - rcu_read_unlock(); > > break; > #endif > @@ -126,11 +132,11 @@ static void bpf_nf_disable_defrag(struct bpf_nf_link *link) > case NFPROTO_IPV6: > rcu_read_lock(); > v6_hook = rcu_dereference(nf_defrag_v6_hook); > + rcu_read_unlock(); No. v6_hook is gone as soon as you unlock it.
On Wed, Jul 12, 2023 at 06:26:13PM -0700, Alexei Starovoitov wrote: > On Wed, Jul 12, 2023 at 6:22 PM Daniel Xu <dxu@dxuuu.xyz> wrote: > > > > Hi Alexei, > > > > On Wed, Jul 12, 2023 at 05:43:49PM -0700, Alexei Starovoitov wrote: > > > On Wed, Jul 12, 2023 at 4:44 PM Daniel Xu <dxu@dxuuu.xyz> wrote: > > > > +#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV6) > > > > + case NFPROTO_IPV6: > > > > + rcu_read_lock(); > > > > + v6_hook = rcu_dereference(nf_defrag_v6_hook); > > > > + if (!v6_hook) { > > > > + rcu_read_unlock(); > > > > + err = request_module("nf_defrag_ipv6"); > > > > + if (err) > > > > + return err < 0 ? err : -EINVAL; > > > > + > > > > + rcu_read_lock(); > > > > + v6_hook = rcu_dereference(nf_defrag_v6_hook); > > > > + if (!v6_hook) { > > > > + WARN_ONCE(1, "nf_defrag_ipv6_hooks bad registration"); > > > > + err = -ENOENT; > > > > + goto out_v6; > > > > + } > > > > + } > > > > + > > > > + err = v6_hook->enable(link->net); > > > > > > I was about to apply, but luckily caught this issue in my local test: > > > > > > [ 18.462448] BUG: sleeping function called from invalid context at > > > kernel/locking/mutex.c:283 > > > [ 18.463238] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: > > > 2042, name: test_progs > > > [ 18.463927] preempt_count: 0, expected: 0 > > > [ 18.464249] RCU nest depth: 1, expected: 0 > > > [ 18.464631] CPU: 15 PID: 2042 Comm: test_progs Tainted: G > > > O 6.4.0-04319-g6f6ec4fa00dc #4896 > > > [ 18.465480] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > > > BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 > > > [ 18.466531] Call Trace: > > > [ 18.466767] <TASK> > > > [ 18.466975] dump_stack_lvl+0x32/0x40 > > > [ 18.467325] __might_resched+0x129/0x180 > > > [ 18.467691] mutex_lock+0x1a/0x40 > > > [ 18.468057] nf_defrag_ipv4_enable+0x16/0x70 > > > [ 18.468467] bpf_nf_link_attach+0x141/0x300 > > > [ 18.468856] __sys_bpf+0x133e/0x26d0 > > > > > > You cannot call mutex under rcu_read_lock. > > > > Whoops, my bad. I think this patch should fix it: > > > > ``` > > From 7e8927c44452db07ddd7cf0e30bb49215fc044ed Mon Sep 17 00:00:00 2001 > > Message-ID: <7e8927c44452db07ddd7cf0e30bb49215fc044ed.1689211250.git.dxu@dxuuu.xyz> > > From: Daniel Xu <dxu@dxuuu.xyz> > > Date: Wed, 12 Jul 2023 19:17:35 -0600 > > Subject: [PATCH] netfilter: bpf: Don't hold rcu_read_lock during > > enable/disable > > > > ->enable()/->disable() takes a mutex which can sleep. You can't sleep > > during RCU read side critical section. > > > > Our refcnt on the module will protect us from ->enable()/->disable() > > from going away while we call it. > > > > Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> > > --- > > net/netfilter/nf_bpf_link.c | 10 ++++++++-- > > 1 file changed, 8 insertions(+), 2 deletions(-) > > > > diff --git a/net/netfilter/nf_bpf_link.c b/net/netfilter/nf_bpf_link.c > > index 77ffbf26ba3d..79704cc596aa 100644 > > --- a/net/netfilter/nf_bpf_link.c > > +++ b/net/netfilter/nf_bpf_link.c > > @@ -60,9 +60,12 @@ static int bpf_nf_enable_defrag(struct bpf_nf_link *link) > > goto out_v4; > > } > > > > + rcu_read_unlock(); > > err = v4_hook->enable(link->net); > > if (err) > > module_put(v4_hook->owner); > > + > > + return err; > > out_v4: > > rcu_read_unlock(); > > return err; > > @@ -92,9 +95,12 @@ static int bpf_nf_enable_defrag(struct bpf_nf_link *link) > > goto out_v6; > > } > > > > + rcu_read_unlock(); > > err = v6_hook->enable(link->net); > > if (err) > > module_put(v6_hook->owner); > > + > > + return err; > > out_v6: > > rcu_read_unlock(); > > return err; > > @@ -114,11 +120,11 @@ static void bpf_nf_disable_defrag(struct bpf_nf_link *link) > > case NFPROTO_IPV4: > > rcu_read_lock(); > > v4_hook = rcu_dereference(nf_defrag_v4_hook); > > + rcu_read_unlock(); > > if (v4_hook) { > > v4_hook->disable(link->net); > > module_put(v4_hook->owner); > > } > > - rcu_read_unlock(); > > > > break; > > #endif > > @@ -126,11 +132,11 @@ static void bpf_nf_disable_defrag(struct bpf_nf_link *link) > > case NFPROTO_IPV6: > > rcu_read_lock(); > > v6_hook = rcu_dereference(nf_defrag_v6_hook); > > + rcu_read_unlock(); > > No. v6_hook is gone as soon as you unlock it. I think we're protected here by the try_module_get() on the enable path. And we only disable defrag if enabling succeeds. The module shouldn't be able to deregister its hooks until we call the module_put() later. I think READ_ONCE() would've been more appropriate but I wasn't sure if that was ok given nf_defrag_v(4|6)_hook is written to by rcu_assign_pointer() and I was assuming symmetry is necessary. Does that sound right? Thanks, Daniel
On Wed, Jul 12, 2023 at 9:33 PM Daniel Xu <dxu@dxuuu.xyz> wrote: > > On Wed, Jul 12, 2023 at 06:26:13PM -0700, Alexei Starovoitov wrote: > > On Wed, Jul 12, 2023 at 6:22 PM Daniel Xu <dxu@dxuuu.xyz> wrote: > > > > > > Hi Alexei, > > > > > > On Wed, Jul 12, 2023 at 05:43:49PM -0700, Alexei Starovoitov wrote: > > > > On Wed, Jul 12, 2023 at 4:44 PM Daniel Xu <dxu@dxuuu.xyz> wrote: > > > > > +#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV6) > > > > > + case NFPROTO_IPV6: > > > > > + rcu_read_lock(); > > > > > + v6_hook = rcu_dereference(nf_defrag_v6_hook); > > > > > + if (!v6_hook) { > > > > > + rcu_read_unlock(); > > > > > + err = request_module("nf_defrag_ipv6"); > > > > > + if (err) > > > > > + return err < 0 ? err : -EINVAL; > > > > > + > > > > > + rcu_read_lock(); > > > > > + v6_hook = rcu_dereference(nf_defrag_v6_hook); > > > > > + if (!v6_hook) { > > > > > + WARN_ONCE(1, "nf_defrag_ipv6_hooks bad registration"); > > > > > + err = -ENOENT; > > > > > + goto out_v6; > > > > > + } > > > > > + } > > > > > + > > > > > + err = v6_hook->enable(link->net); > > > > > > > > I was about to apply, but luckily caught this issue in my local test: > > > > > > > > [ 18.462448] BUG: sleeping function called from invalid context at > > > > kernel/locking/mutex.c:283 > > > > [ 18.463238] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: > > > > 2042, name: test_progs > > > > [ 18.463927] preempt_count: 0, expected: 0 > > > > [ 18.464249] RCU nest depth: 1, expected: 0 > > > > [ 18.464631] CPU: 15 PID: 2042 Comm: test_progs Tainted: G > > > > O 6.4.0-04319-g6f6ec4fa00dc #4896 > > > > [ 18.465480] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > > > > BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 > > > > [ 18.466531] Call Trace: > > > > [ 18.466767] <TASK> > > > > [ 18.466975] dump_stack_lvl+0x32/0x40 > > > > [ 18.467325] __might_resched+0x129/0x180 > > > > [ 18.467691] mutex_lock+0x1a/0x40 > > > > [ 18.468057] nf_defrag_ipv4_enable+0x16/0x70 > > > > [ 18.468467] bpf_nf_link_attach+0x141/0x300 > > > > [ 18.468856] __sys_bpf+0x133e/0x26d0 > > > > > > > > You cannot call mutex under rcu_read_lock. > > > > > > Whoops, my bad. I think this patch should fix it: > > > > > > ``` > > > From 7e8927c44452db07ddd7cf0e30bb49215fc044ed Mon Sep 17 00:00:00 2001 > > > Message-ID: <7e8927c44452db07ddd7cf0e30bb49215fc044ed.1689211250.git.dxu@dxuuu.xyz> > > > From: Daniel Xu <dxu@dxuuu.xyz> > > > Date: Wed, 12 Jul 2023 19:17:35 -0600 > > > Subject: [PATCH] netfilter: bpf: Don't hold rcu_read_lock during > > > enable/disable > > > > > > ->enable()/->disable() takes a mutex which can sleep. You can't sleep > > > during RCU read side critical section. > > > > > > Our refcnt on the module will protect us from ->enable()/->disable() > > > from going away while we call it. > > > > > > Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> > > > --- > > > net/netfilter/nf_bpf_link.c | 10 ++++++++-- > > > 1 file changed, 8 insertions(+), 2 deletions(-) > > > > > > diff --git a/net/netfilter/nf_bpf_link.c b/net/netfilter/nf_bpf_link.c > > > index 77ffbf26ba3d..79704cc596aa 100644 > > > --- a/net/netfilter/nf_bpf_link.c > > > +++ b/net/netfilter/nf_bpf_link.c > > > @@ -60,9 +60,12 @@ static int bpf_nf_enable_defrag(struct bpf_nf_link *link) > > > goto out_v4; > > > } > > > > > > + rcu_read_unlock(); > > > err = v4_hook->enable(link->net); > > > if (err) > > > module_put(v4_hook->owner); > > > + > > > + return err; > > > out_v4: > > > rcu_read_unlock(); > > > return err; > > > @@ -92,9 +95,12 @@ static int bpf_nf_enable_defrag(struct bpf_nf_link *link) > > > goto out_v6; > > > } > > > > > > + rcu_read_unlock(); > > > err = v6_hook->enable(link->net); > > > if (err) > > > module_put(v6_hook->owner); > > > + > > > + return err; > > > out_v6: > > > rcu_read_unlock(); > > > return err; > > > @@ -114,11 +120,11 @@ static void bpf_nf_disable_defrag(struct bpf_nf_link *link) > > > case NFPROTO_IPV4: > > > rcu_read_lock(); > > > v4_hook = rcu_dereference(nf_defrag_v4_hook); > > > + rcu_read_unlock(); > > > if (v4_hook) { > > > v4_hook->disable(link->net); > > > module_put(v4_hook->owner); > > > } > > > - rcu_read_unlock(); > > > > > > break; > > > #endif > > > @@ -126,11 +132,11 @@ static void bpf_nf_disable_defrag(struct bpf_nf_link *link) > > > case NFPROTO_IPV6: > > > rcu_read_lock(); > > > v6_hook = rcu_dereference(nf_defrag_v6_hook); > > > + rcu_read_unlock(); > > > > No. v6_hook is gone as soon as you unlock it. > > I think we're protected here by the try_module_get() on the enable path. > And we only disable defrag if enabling succeeds. The module shouldn't > be able to deregister its hooks until we call the module_put() later. > > I think READ_ONCE() would've been more appropriate but I wasn't sure if > that was ok given nf_defrag_v(4|6)_hook is written to by > rcu_assign_pointer() and I was assuming symmetry is necessary. Why is rcu_assign_pointer() used? If it's not RCU protected, what is the point of rcu_*() accessors and rcu_read_lock() ? In general, the pattern: rcu_read_lock(); ptr = rcu_dereference(...); rcu_read_unlock(); ptr->.. is a bug. 100%.
On Thu, Jul 13, 2023 at 04:10:03PM -0700, Alexei Starovoitov wrote: > On Wed, Jul 12, 2023 at 9:33 PM Daniel Xu <dxu@dxuuu.xyz> wrote: > > > > On Wed, Jul 12, 2023 at 06:26:13PM -0700, Alexei Starovoitov wrote: > > > On Wed, Jul 12, 2023 at 6:22 PM Daniel Xu <dxu@dxuuu.xyz> wrote: > > > > > > > > Hi Alexei, > > > > > > > > On Wed, Jul 12, 2023 at 05:43:49PM -0700, Alexei Starovoitov wrote: > > > > > On Wed, Jul 12, 2023 at 4:44 PM Daniel Xu <dxu@dxuuu.xyz> wrote: > > > > > > +#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV6) > > > > > > + case NFPROTO_IPV6: > > > > > > + rcu_read_lock(); > > > > > > + v6_hook = rcu_dereference(nf_defrag_v6_hook); > > > > > > + if (!v6_hook) { > > > > > > + rcu_read_unlock(); > > > > > > + err = request_module("nf_defrag_ipv6"); > > > > > > + if (err) > > > > > > + return err < 0 ? err : -EINVAL; > > > > > > + > > > > > > + rcu_read_lock(); > > > > > > + v6_hook = rcu_dereference(nf_defrag_v6_hook); > > > > > > + if (!v6_hook) { > > > > > > + WARN_ONCE(1, "nf_defrag_ipv6_hooks bad registration"); > > > > > > + err = -ENOENT; > > > > > > + goto out_v6; > > > > > > + } > > > > > > + } > > > > > > + > > > > > > + err = v6_hook->enable(link->net); > > > > > > > > > > I was about to apply, but luckily caught this issue in my local test: > > > > > > > > > > [ 18.462448] BUG: sleeping function called from invalid context at > > > > > kernel/locking/mutex.c:283 > > > > > [ 18.463238] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: > > > > > 2042, name: test_progs > > > > > [ 18.463927] preempt_count: 0, expected: 0 > > > > > [ 18.464249] RCU nest depth: 1, expected: 0 > > > > > [ 18.464631] CPU: 15 PID: 2042 Comm: test_progs Tainted: G > > > > > O 6.4.0-04319-g6f6ec4fa00dc #4896 > > > > > [ 18.465480] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > > > > > BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 > > > > > [ 18.466531] Call Trace: > > > > > [ 18.466767] <TASK> > > > > > [ 18.466975] dump_stack_lvl+0x32/0x40 > > > > > [ 18.467325] __might_resched+0x129/0x180 > > > > > [ 18.467691] mutex_lock+0x1a/0x40 > > > > > [ 18.468057] nf_defrag_ipv4_enable+0x16/0x70 > > > > > [ 18.468467] bpf_nf_link_attach+0x141/0x300 > > > > > [ 18.468856] __sys_bpf+0x133e/0x26d0 > > > > > > > > > > You cannot call mutex under rcu_read_lock. > > > > > > > > Whoops, my bad. I think this patch should fix it: > > > > > > > > ``` > > > > From 7e8927c44452db07ddd7cf0e30bb49215fc044ed Mon Sep 17 00:00:00 2001 > > > > Message-ID: <7e8927c44452db07ddd7cf0e30bb49215fc044ed.1689211250.git.dxu@dxuuu.xyz> > > > > From: Daniel Xu <dxu@dxuuu.xyz> > > > > Date: Wed, 12 Jul 2023 19:17:35 -0600 > > > > Subject: [PATCH] netfilter: bpf: Don't hold rcu_read_lock during > > > > enable/disable > > > > > > > > ->enable()/->disable() takes a mutex which can sleep. You can't sleep > > > > during RCU read side critical section. > > > > > > > > Our refcnt on the module will protect us from ->enable()/->disable() > > > > from going away while we call it. > > > > > > > > Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> > > > > --- > > > > net/netfilter/nf_bpf_link.c | 10 ++++++++-- > > > > 1 file changed, 8 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/net/netfilter/nf_bpf_link.c b/net/netfilter/nf_bpf_link.c > > > > index 77ffbf26ba3d..79704cc596aa 100644 > > > > --- a/net/netfilter/nf_bpf_link.c > > > > +++ b/net/netfilter/nf_bpf_link.c > > > > @@ -60,9 +60,12 @@ static int bpf_nf_enable_defrag(struct bpf_nf_link *link) > > > > goto out_v4; > > > > } > > > > > > > > + rcu_read_unlock(); > > > > err = v4_hook->enable(link->net); > > > > if (err) > > > > module_put(v4_hook->owner); > > > > + > > > > + return err; > > > > out_v4: > > > > rcu_read_unlock(); > > > > return err; > > > > @@ -92,9 +95,12 @@ static int bpf_nf_enable_defrag(struct bpf_nf_link *link) > > > > goto out_v6; > > > > } > > > > > > > > + rcu_read_unlock(); > > > > err = v6_hook->enable(link->net); > > > > if (err) > > > > module_put(v6_hook->owner); > > > > + > > > > + return err; > > > > out_v6: > > > > rcu_read_unlock(); > > > > return err; > > > > @@ -114,11 +120,11 @@ static void bpf_nf_disable_defrag(struct bpf_nf_link *link) > > > > case NFPROTO_IPV4: > > > > rcu_read_lock(); > > > > v4_hook = rcu_dereference(nf_defrag_v4_hook); > > > > + rcu_read_unlock(); > > > > if (v4_hook) { > > > > v4_hook->disable(link->net); > > > > module_put(v4_hook->owner); > > > > } > > > > - rcu_read_unlock(); > > > > > > > > break; > > > > #endif > > > > @@ -126,11 +132,11 @@ static void bpf_nf_disable_defrag(struct bpf_nf_link *link) > > > > case NFPROTO_IPV6: > > > > rcu_read_lock(); > > > > v6_hook = rcu_dereference(nf_defrag_v6_hook); > > > > + rcu_read_unlock(); > > > > > > No. v6_hook is gone as soon as you unlock it. > > > > I think we're protected here by the try_module_get() on the enable path. > > And we only disable defrag if enabling succeeds. The module shouldn't > > be able to deregister its hooks until we call the module_put() later. > > > > I think READ_ONCE() would've been more appropriate but I wasn't sure if > > that was ok given nf_defrag_v(4|6)_hook is written to by > > rcu_assign_pointer() and I was assuming symmetry is necessary. > > Why is rcu_assign_pointer() used? > If it's not RCU protected, what is the point of rcu_*() accessors > and rcu_read_lock() ? > > In general, the pattern: > rcu_read_lock(); > ptr = rcu_dereference(...); > rcu_read_unlock(); > ptr->.. > is a bug. 100%. > The reason I left it like this is b/c otherwise I think there is a race with module unload and taking a refcnt. For example: ptr = READ_ONCE(global_var) <module unload on other cpu> // ptr invalid try_module_get(ptr->owner) I think the the synchronize_rcu() call in kernel/module/main.c:free_module() protects against that race based on my reading. Maybe the ->enable() path can store a copy of the hook ptr in struct bpf_nf_link to get rid of the odd rcu_dereference()? Open to other ideas too -- would appreciate any hints. Thanks, Daniel
Daniel Xu <dxu@dxuuu.xyz> wrote: > On Thu, Jul 13, 2023 at 04:10:03PM -0700, Alexei Starovoitov wrote: > > Why is rcu_assign_pointer() used? > > If it's not RCU protected, what is the point of rcu_*() accessors > > and rcu_read_lock() ? > > > > In general, the pattern: > > rcu_read_lock(); > > ptr = rcu_dereference(...); > > rcu_read_unlock(); > > ptr->.. > > is a bug. 100%. FWIW, I agree with Alexei, it does look... dodgy. > The reason I left it like this is b/c otherwise I think there is a race > with module unload and taking a refcnt. For example: > > ptr = READ_ONCE(global_var) > <module unload on other cpu> > // ptr invalid > try_module_get(ptr->owner) > Yes, I agree. > I think the the synchronize_rcu() call in > kernel/module/main.c:free_module() protects against that race based on > my reading. > > Maybe the ->enable() path can store a copy of the hook ptr in > struct bpf_nf_link to get rid of the odd rcu_dereference()? > > Open to other ideas too -- would appreciate any hints. I would suggest the following: - Switch ordering of patches 2 and 3. What is currently patch 3 would add the .owner fields only. Then, what is currently patch #2 would document the rcu/modref interaction like this (omitting error checking for brevity): rcu_read_lock(); v6_hook = rcu_dereference(nf_defrag_v6_hook); if (!v6_hook) { rcu_read_unlock(); err = request_module("nf_defrag_ipv6"); if (err) return err < 0 ? err : -EINVAL; rcu_read_lock(); v6_hook = rcu_dereference(nf_defrag_v6_hook); } if (v6_hook && try_module_get(v6_hook->owner)) v6_hook = rcu_pointer_handoff(v6_hook); else v6_hook = NULL; rcu_read_unlock(); if (!v6_hook) err(); v6_hook->enable(); I'd store the v4/6_hook pointer in the nf bpf link struct, its probably more self-explanatory for the disable side in that we did pick up a module reference that we still own at delete time, without need for any rcu involvement. Because above handoff is repetitive for ipv4 and ipv6, I suggest to add an agnostic helper for this. I know you added distinct structures for ipv4 and ipv6 but if they would use the same one you could add static const struct nf_defrag_hook *get_proto_frag_hook(const struct nf_defrag_hook __rcu *hook, const char *modulename); And then use it like: v4_hook = get_proto_frag_hook(nf_defrag_v4_hook, "nf_defrag_ipv4"); Without a need to copy the modprobe and handoff part. What do you think?
Hi Florian, On Fri, Jul 14, 2023 at 11:47:41AM +0200, Florian Westphal wrote: > Daniel Xu <dxu@dxuuu.xyz> wrote: > > On Thu, Jul 13, 2023 at 04:10:03PM -0700, Alexei Starovoitov wrote: > > > Why is rcu_assign_pointer() used? > > > If it's not RCU protected, what is the point of rcu_*() accessors > > > and rcu_read_lock() ? > > > > > > In general, the pattern: > > > rcu_read_lock(); > > > ptr = rcu_dereference(...); > > > rcu_read_unlock(); > > > ptr->.. > > > is a bug. 100%. > > FWIW, I agree with Alexei, it does look... dodgy. > > > The reason I left it like this is b/c otherwise I think there is a race > > with module unload and taking a refcnt. For example: > > > > ptr = READ_ONCE(global_var) > > <module unload on other cpu> > > // ptr invalid > > try_module_get(ptr->owner) > > > > Yes, I agree. > > > I think the the synchronize_rcu() call in > > kernel/module/main.c:free_module() protects against that race based on > > my reading. > > > > Maybe the ->enable() path can store a copy of the hook ptr in > > struct bpf_nf_link to get rid of the odd rcu_dereference()? > > > > Open to other ideas too -- would appreciate any hints. > > I would suggest the following: > > - Switch ordering of patches 2 and 3. > What is currently patch 3 would add the .owner fields only. > > Then, what is currently patch #2 would document the rcu/modref > interaction like this (omitting error checking for brevity): > > rcu_read_lock(); > v6_hook = rcu_dereference(nf_defrag_v6_hook); > if (!v6_hook) { > rcu_read_unlock(); > err = request_module("nf_defrag_ipv6"); > if (err) > return err < 0 ? err : -EINVAL; > rcu_read_lock(); > v6_hook = rcu_dereference(nf_defrag_v6_hook); > } > > if (v6_hook && try_module_get(v6_hook->owner)) > v6_hook = rcu_pointer_handoff(v6_hook); > else > v6_hook = NULL; > > rcu_read_unlock(); > > if (!v6_hook) > err(); > v6_hook->enable(); > > > I'd store the v4/6_hook pointer in the nf bpf link struct, its probably more > self-explanatory for the disable side in that we did pick up a module reference > that we still own at delete time, without need for any rcu involvement. > > Because above handoff is repetitive for ipv4 and ipv6, > I suggest to add an agnostic helper for this. > > I know you added distinct structures for ipv4 and ipv6 but if they would use > the same one you could add > > static const struct nf_defrag_hook *get_proto_frag_hook(const struct nf_defrag_hook __rcu *hook, > const char *modulename); > > And then use it like: > > v4_hook = get_proto_frag_hook(nf_defrag_v4_hook, "nf_defrag_ipv4"); > > Without a need to copy the modprobe and handoff part. > > What do you think? That sounds reasonable to me. I'll give it a shot. Thanks for the input! Daniel
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 600d0caebbd8..c820076c38db 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1180,6 +1180,11 @@ enum bpf_perf_event_type { */ #define BPF_F_KPROBE_MULTI_RETURN (1U << 0) +/* link_create.netfilter.flags used in LINK_CREATE command for + * BPF_PROG_TYPE_NETFILTER to enable IP packet defragmentation. + */ +#define BPF_F_NETFILTER_IP_DEFRAG (1U << 0) + /* When BPF ldimm64's insn[0].src_reg != 0 then this can have * the following extensions: * diff --git a/net/netfilter/nf_bpf_link.c b/net/netfilter/nf_bpf_link.c index c36da56d756f..5b72aa246577 100644 --- a/net/netfilter/nf_bpf_link.c +++ b/net/netfilter/nf_bpf_link.c @@ -1,6 +1,7 @@ // SPDX-License-Identifier: GPL-2.0 #include <linux/bpf.h> #include <linux/filter.h> +#include <linux/kmod.h> #include <linux/netfilter.h> #include <net/netfilter/nf_bpf_link.h> @@ -23,8 +24,98 @@ struct bpf_nf_link { struct nf_hook_ops hook_ops; struct net *net; u32 dead; + bool defrag; }; +static int bpf_nf_enable_defrag(struct bpf_nf_link *link) +{ + const struct nf_defrag_v4_hook __maybe_unused *v4_hook; + const struct nf_defrag_v6_hook __maybe_unused *v6_hook; + int err; + + switch (link->hook_ops.pf) { +#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV4) + case NFPROTO_IPV4: + rcu_read_lock(); + v4_hook = rcu_dereference(nf_defrag_v4_hook); + if (!v4_hook) { + rcu_read_unlock(); + err = request_module("nf_defrag_ipv4"); + if (err) + return err < 0 ? err : -EINVAL; + + rcu_read_lock(); + v4_hook = rcu_dereference(nf_defrag_v4_hook); + if (!v4_hook) { + WARN_ONCE(1, "nf_defrag_ipv4 bad registration"); + err = -ENOENT; + goto out_v4; + } + } + + err = v4_hook->enable(link->net); +out_v4: + rcu_read_unlock(); + return err; +#endif +#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV6) + case NFPROTO_IPV6: + rcu_read_lock(); + v6_hook = rcu_dereference(nf_defrag_v6_hook); + if (!v6_hook) { + rcu_read_unlock(); + err = request_module("nf_defrag_ipv6"); + if (err) + return err < 0 ? err : -EINVAL; + + rcu_read_lock(); + v6_hook = rcu_dereference(nf_defrag_v6_hook); + if (!v6_hook) { + WARN_ONCE(1, "nf_defrag_ipv6_hooks bad registration"); + err = -ENOENT; + goto out_v6; + } + } + + err = v6_hook->enable(link->net); +out_v6: + rcu_read_unlock(); + return err; +#endif + default: + return -EAFNOSUPPORT; + } +} + +static void bpf_nf_disable_defrag(struct bpf_nf_link *link) +{ + const struct nf_defrag_v4_hook __maybe_unused *v4_hook; + const struct nf_defrag_v6_hook __maybe_unused *v6_hook; + + switch (link->hook_ops.pf) { +#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV4) + case NFPROTO_IPV4: + rcu_read_lock(); + v4_hook = rcu_dereference(nf_defrag_v4_hook); + if (v4_hook) + v4_hook->disable(link->net); + rcu_read_unlock(); + + break; +#endif +#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV6) + case NFPROTO_IPV6: + rcu_read_lock(); + v6_hook = rcu_dereference(nf_defrag_v6_hook); + if (v6_hook) + v6_hook->disable(link->net); + rcu_read_unlock(); + + break; + } +#endif +} + static void bpf_nf_link_release(struct bpf_link *link) { struct bpf_nf_link *nf_link = container_of(link, struct bpf_nf_link, link); @@ -37,6 +128,9 @@ static void bpf_nf_link_release(struct bpf_link *link) */ if (!cmpxchg(&nf_link->dead, 0, 1)) nf_unregister_net_hook(nf_link->net, &nf_link->hook_ops); + + if (nf_link->defrag) + bpf_nf_disable_defrag(nf_link); } static void bpf_nf_link_dealloc(struct bpf_link *link) @@ -92,6 +186,8 @@ static const struct bpf_link_ops bpf_nf_link_lops = { static int bpf_nf_check_pf_and_hooks(const union bpf_attr *attr) { + int prio; + switch (attr->link_create.netfilter.pf) { case NFPROTO_IPV4: case NFPROTO_IPV6: @@ -102,19 +198,18 @@ static int bpf_nf_check_pf_and_hooks(const union bpf_attr *attr) return -EAFNOSUPPORT; } - if (attr->link_create.netfilter.flags) + if (attr->link_create.netfilter.flags & ~BPF_F_NETFILTER_IP_DEFRAG) return -EOPNOTSUPP; - /* make sure conntrack confirm is always last. - * - * In the future, if userspace can e.g. request defrag, then - * "defrag_requested && prio before NF_IP_PRI_CONNTRACK_DEFRAG" - * should fail. - */ - switch (attr->link_create.netfilter.priority) { - case NF_IP_PRI_FIRST: return -ERANGE; /* sabotage_in and other warts */ - case NF_IP_PRI_LAST: return -ERANGE; /* e.g. conntrack confirm */ - } + /* make sure conntrack confirm is always last */ + prio = attr->link_create.netfilter.priority; + if (prio == NF_IP_PRI_FIRST) + return -ERANGE; /* sabotage_in and other warts */ + else if (prio == NF_IP_PRI_LAST) + return -ERANGE; /* e.g. conntrack confirm */ + else if ((attr->link_create.netfilter.flags & BPF_F_NETFILTER_IP_DEFRAG) && + prio <= NF_IP_PRI_CONNTRACK_DEFRAG) + return -ERANGE; /* cannot use defrag if prog runs before nf_defrag */ return 0; } @@ -156,6 +251,18 @@ int bpf_nf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog) return err; } + if (attr->link_create.netfilter.flags & BPF_F_NETFILTER_IP_DEFRAG) { + err = bpf_nf_enable_defrag(link); + if (err) { + bpf_link_cleanup(&link_primer); + return err; + } + /* only mark defrag enabled if enabling succeeds so cleanup path + * doesn't disable without a corresponding enable + */ + link->defrag = true; + } + err = nf_register_net_hook(net, &link->hook_ops); if (err) { bpf_link_cleanup(&link_primer); diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 600d0caebbd8..c820076c38db 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -1180,6 +1180,11 @@ enum bpf_perf_event_type { */ #define BPF_F_KPROBE_MULTI_RETURN (1U << 0) +/* link_create.netfilter.flags used in LINK_CREATE command for + * BPF_PROG_TYPE_NETFILTER to enable IP packet defragmentation. + */ +#define BPF_F_NETFILTER_IP_DEFRAG (1U << 0) + /* When BPF ldimm64's insn[0].src_reg != 0 then this can have * the following extensions: *