Message ID | cover.1677526810.git.dxu@dxuuu.xyz |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:5915:0:0:0:0:0 with SMTP id v21csp2616192wrd; Mon, 27 Feb 2023 12:00:52 -0800 (PST) X-Google-Smtp-Source: AK7set9+UyNrpDv+1Shltujj9owSvExQki96nG7Ibs5JIKzhABfbPnt+CNLuGN5P7kJ+Nb9XKhoD X-Received: by 2002:a05:6402:896:b0:4ad:7cfe:f0a with SMTP id e22-20020a056402089600b004ad7cfe0f0amr766699edy.34.1677528052205; Mon, 27 Feb 2023 12:00:52 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1677528052; cv=none; d=google.com; s=arc-20160816; b=z1XP18Cd/aFGaQ0+7CqJsIKg5zi1UnfCQ9RYfyY/WpxDMz5yyNMScJyByT76ZV3bys qnczzfPl0bcd/ZjWfHSxyPTD1XgHJabMS9OPOK0gfyaT66FMi2aMhPdvTTDisMk9WsAr JoHn4ACFPl78k0SXv1B08mHHehO7xaY0CchnM20Ru2mxQQBnSAnM7HQ5fWdmsYNo1dYL EU2RZnGieqtKd7OGIjv77M9DWAl9s2xCf6WVw6Zk4GhfgzoW+IeFbT4M4AYxAoyFG3o1 5Hx+J9dmzVtFStMQTPRPeKvTH+YHauSjlGCobfXKT+Pdr+vqhiXKbhAFGXJRDiBSPv70 Ib9A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:to:from:feedback-id:dkim-signature :dkim-signature; bh=eD3XTUyleC7VHpJcu1oJPldm9EE+fHMQEZsCRJDzCRQ=; b=HkE/mppiB/HTS22NOhKjogovhAfdHyN7RXqshbmnl1XuurgXy/q7YMq+EGdZghpDFa +J4QjdCIfo0xsfQ5kHJ6dp2j9DRPctn80dw7bi0qlaLNNHhXRl88Ac0HOF7T1SM+F+Ma /w7AJQIHFUXCPycGQ9inlmIy1TKewrP9LZD/e0qE1obN4vSf3595n/2pD9kux/7JhqA3 NMeDmNAqvMNgS/N2Hfjf5AlnRQQNVKRXipKj+llCy4wB+Lty4QV/gw8dJ742pZe2iDri chWfhWcl1/Hz8ZUG8Z+wQUQMDvRKReMoK4kkBl+p+woRL2+8knnDlH4+IL62hZzei79l CrlA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@dxuuu.xyz header.s=fm3 header.b=l2F4WNnM; dkim=pass header.i=@messagingengine.com header.s=fm1 header.b="VzdRcjv/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id g11-20020aa7d1cb000000b004ad7c45b801si8975645edp.40.2023.02.27.12.00.28; Mon, 27 Feb 2023 12:00:52 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@dxuuu.xyz header.s=fm3 header.b=l2F4WNnM; dkim=pass header.i=@messagingengine.com header.s=fm1 header.b="VzdRcjv/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230331AbjB0Tvg (ORCPT <rfc822;wenzhi022@gmail.com> + 99 others); Mon, 27 Feb 2023 14:51:36 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35374 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229944AbjB0Tvf (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 27 Feb 2023 14:51:35 -0500 Received: from wout1-smtp.messagingengine.com (wout1-smtp.messagingengine.com [64.147.123.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D03D07ECC; Mon, 27 Feb 2023 11:51:33 -0800 (PST) Received: from compute2.internal (compute2.nyi.internal [10.202.2.46]) by mailout.west.internal (Postfix) with ESMTP id BADDF3200947; Mon, 27 Feb 2023 14:51:29 -0500 (EST) Received: from mailfrontend1 ([10.202.2.162]) by compute2.internal (MEProxy); Mon, 27 Feb 2023 14:51:30 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dxuuu.xyz; h=cc :content-transfer-encoding:date:date:from:from:in-reply-to :message-id:mime-version:reply-to:sender:subject:subject:to:to; s=fm3; t=1677527489; x=1677613889; bh=eD3XTUyleC7VHpJcu1oJPldm9 EE+fHMQEZsCRJDzCRQ=; b=l2F4WNnM+QPIRn26UF4B25y1qE+SX+/VsfcrMysXl Kap3RUpSG+Gc8o8UUSSWD4Lcrp7valwBfvUYd7+DoO19olWIXyfWvO+z5lUTqMLk did4fsmA59AWfsxgfjGEAV33qa8kJWWgRmp4ZIRQO1+ge2N3DYk1WJy2nOsFWGtq 8q4V82MTe5TxIozc0EEV0LIDiEFiGZ3P4vOwLmfpwJqSGRnsa1dYcWcXG/GHRRL2 nspm/ZeMjt+wtye5q6XUhr481jo+NagpKLjgKxP9Q+RPb7/3rsIfxitnA7xOL/34 QX5FbCniqb0ihQqPAQwM3LiUbotT00fBVnpaVZYGLN+qg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:date:date :feedback-id:feedback-id:from:from:in-reply-to:message-id :mime-version:reply-to:sender:subject:subject:to:to:x-me-proxy :x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm1; t= 1677527489; x=1677613889; bh=eD3XTUyleC7VHpJcu1oJPldm9EE+fHMQEZs CRJDzCRQ=; b=VzdRcjv/DNK/XYFRqa2x32PFNlYqNY7rRk4ITzhyvdiWiZ98/r8 ftvBicpNakKa6JmdXPoyd4aURPlLqVmp83ZepWh1LvtgEqqXiXmEq2WArbBKvT6P wBgcXqUxWybOMXm9awVH0Hd3UIq1SDKHPfFfnMXxeydgS2wdYK5A8WeV+BGyj+jY 8mW8LQaX46Jau6ie1BbBaF0rTSXTgy3s7lajorxZVmwTvWFCsBvYFvnZY0Pb+Qkc wXVVZDeUloXKCXNmRHGSamTVjbWIYDVKJ4DNA0hFIEGUFTb50+T2qXa06kXwRt9w ykeiNlMbq1TkuiiXT8/F0PHyb7wq4Zg5NEQ== X-ME-Sender: <xms:wQn9YxCDL11D-rlH1dD18pcxbKql5K3XFBjvhd7bA5l9E0FkhK_pkw> <xme:wQn9Y_hJsVfn5xmURXLFPBe2YgDAEeOD_25XgLLiUk2BRScYwpaEYDfAl2Ix_wEU4 zqKbvtPLYQUsC2yfQ> X-ME-Received: <xmr:wQn9Y8mm8dZWlnq6Fa0NC5LRThOFTGYiMs19oYpRijeUD66vtYSKqLOoE1q4caIDMCcl20FfLuKVD1EBQKDlbRP1jpwpP4gOGotNm3uFZrgb3Q> X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvhedrudeltddguddviecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh necuuegrihhlohhuthemuceftddtnecufghrlhcuvffnffculdefhedmnecujfgurhephf fvufffkffoggfgsedtkeertdertddtnecuhfhrohhmpeffrghnihgvlhcuighuuceougig uhesugiguhhuuhdrgiihiieqnecuggftrfgrthhtvghrnhepheektdduueeiuefgieeghf efvdeugeetiefffefhgfduudefudehveejgedtgedtnecuffhomhgrihhnpehivghtfhdr ohhrghenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpe gugihusegugihuuhhurdighiii X-ME-Proxy: <xmx:wQn9Y7wF7PTb_i_ziNRJCtylz-gFNAsYbrGFivn24Zs_4lse7tcy1w> <xmx:wQn9Y2QdOeNqnyIToFbBusGkv58mwr9Rzmuw2RY-Iomiacilmb90Pg> <xmx:wQn9Y-b3z2ArS2Oqzoyvs_o5QNZMEWbLopUrKGz1navR0GFJ0ut1qg> <xmx:wQn9Y0c0rTnsOvMy0vn8Wp0klcBq_Wz6LWjMMlfDiO-N-_n7GI_ziA> Feedback-ID: i6a694271:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 27 Feb 2023 14:51:28 -0500 (EST) From: Daniel Xu <dxu@dxuuu.xyz> To: bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, netdev@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH bpf-next v2 0/8] Support defragmenting IPv(4|6) packets in BPF Date: Mon, 27 Feb 2023 12:51:02 -0700 Message-Id: <cover.1677526810.git.dxu@dxuuu.xyz> X-Mailer: git-send-email 2.39.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2,SPF_HELO_PASS,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1759015654367136722?= X-GMAIL-MSGID: =?utf-8?q?1759015654367136722?= |
Series |
Support defragmenting IPv(4|6) packets in BPF
|
|
Message
Daniel Xu
Feb. 27, 2023, 7:51 p.m. UTC
=== Context === In the context of a middlebox, fragmented packets are tricky to handle. The full 5-tuple of a packet is often only available in the first fragment which makes enforcing consistent policy difficult. There are really only two stateless options, neither of which are very nice: 1. Enforce policy on first fragment and accept all subsequent fragments. This works but may let in certain attacks or allow data exfiltration. 2. Enforce policy on first fragment and drop all subsequent fragments. This does not really work b/c some protocols may rely on fragmentation. For example, DNS may rely on oversized UDP packets for large responses. So stateful tracking is the only sane option. RFC 8900 [0] calls this out as well in section 6.3: Middleboxes [...] should process IP fragments in a manner that is consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes must maintain state in order to achieve this goal. === BPF related bits === However, when policy is enforced through BPF, the prog is run before the kernel reassembles fragmented packets. This leaves BPF developers in a awkward place: implement reassembly (possibly poorly) or use a stateless method as described above. Fortunately, the kernel has robust support for fragmented IP packets. This patchset wraps the existing defragmentation facilities in kfuncs so that BPF progs running on middleboxes can reassemble fragmented packets before applying policy. === Patchset details === This patchset is (hopefully) relatively straightforward from BPF perspective. One thing I'd like to call out is the skb_copy()ing of the prog skb. I did this to maintain the invariant that the ctx remains valid after prog has run. This is relevant b/c ip_defrag() and ip_check_defrag() may consume the skb if the skb is a fragment. Originally I did play around with teaching the verifier about kfuncs that may consume the ctx and disallowing ctx accesses in ret != 0 branches. It worked ok, but it seemed too complex to modify the surrounding assumptions about ctx validity. [0]: https://datatracker.ietf.org/doc/html/rfc8900 === Changes from v1: * Add support for ipv6 defragmentation Daniel Xu (8): ip: frags: Return actual error codes from ip_check_defrag() bpf: verifier: Support KF_CHANGES_PKT flag bpf, net, frags: Add bpf_ip_check_defrag() kfunc net: ipv6: Factor ipv6_frag_rcv() to take netns and user bpf: net: ipv6: Add bpf_ipv6_frag_rcv() kfunc bpf: selftests: Support not connecting client socket bpf: selftests: Support custom type and proto for client sockets bpf: selftests: Add defrag selftests Documentation/bpf/kfuncs.rst | 7 + drivers/net/macvlan.c | 2 +- include/linux/btf.h | 1 + include/net/ip.h | 11 + include/net/ipv6.h | 1 + include/net/ipv6_frag.h | 1 + include/net/transp_v6.h | 1 + kernel/bpf/verifier.c | 8 + net/ipv4/Makefile | 1 + net/ipv4/ip_fragment.c | 15 +- net/ipv4/ip_fragment_bpf.c | 98 ++++++ net/ipv6/Makefile | 1 + net/ipv6/af_inet6.c | 4 + net/ipv6/reassembly.c | 16 +- net/ipv6/reassembly_bpf.c | 143 ++++++++ net/packet/af_packet.c | 2 +- tools/testing/selftests/bpf/Makefile | 3 +- .../selftests/bpf/generate_udp_fragments.py | 90 +++++ .../selftests/bpf/ip_check_defrag_frags.h | 57 +++ tools/testing/selftests/bpf/network_helpers.c | 26 +- tools/testing/selftests/bpf/network_helpers.h | 3 + .../bpf/prog_tests/ip_check_defrag.c | 327 ++++++++++++++++++ .../selftests/bpf/progs/bpf_tracing_net.h | 1 + .../selftests/bpf/progs/ip_check_defrag.c | 133 +++++++ 24 files changed, 931 insertions(+), 21 deletions(-) create mode 100644 net/ipv4/ip_fragment_bpf.c create mode 100644 net/ipv6/reassembly_bpf.c create mode 100755 tools/testing/selftests/bpf/generate_udp_fragments.py create mode 100644 tools/testing/selftests/bpf/ip_check_defrag_frags.h create mode 100644 tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c create mode 100644 tools/testing/selftests/bpf/progs/ip_check_defrag.c
Comments
On 27/02/2023 19:51, Daniel Xu wrote: > However, when policy is enforced through BPF, the prog is run before the > kernel reassembles fragmented packets. This leaves BPF developers in a > awkward place: implement reassembly (possibly poorly) or use a stateless > method as described above. Just out of curiosity - what stops BPF progs using the middle ground of stateful validation? I'm thinking of something like: First-frag: run the usual checks on L4 headers etc, if we PASS then save IPID and maybe expected next frag-offset into a map. But don't try to stash the packet contents anywhere for later reassembly, just PASS it. Subsequent frags: look up the IPID in the map. If we find it, validate and update the frag-offset in the map; if this is the last fragment then delete the map entry. If the frag-offset was bogus or the IPID wasn't found in the map, DROP; otherwise PASS. (If re-ordering is prevalent then use something more sophisticated than just expected next frag-offset, but the principle is the same. And of course you might want to put in timers for expiry etc.) So this avoids the need to stash the packet data and modify/consume SKBs, because you're not actually doing reassembly; the down-side is that the BPF program can't so easily make decisions about the application-layer contents of the fragmented datagram, but for the common case (we just care about the 5-tuple) it's simple enough. But I haven't actually tried it, so maybe there's some obvious reason why it can't work this way. -ed
Hi Ed, Thanks for giving this a look. On Mon, Feb 27, 2023 at 08:38:41PM +0000, Edward Cree wrote: > On 27/02/2023 19:51, Daniel Xu wrote: > > However, when policy is enforced through BPF, the prog is run before the > > kernel reassembles fragmented packets. This leaves BPF developers in a > > awkward place: implement reassembly (possibly poorly) or use a stateless > > method as described above. > > Just out of curiosity - what stops BPF progs using the middle ground of > stateful validation? I'm thinking of something like: > First-frag: run the usual checks on L4 headers etc, if we PASS then save > IPID and maybe expected next frag-offset into a map. But don't try to > stash the packet contents anywhere for later reassembly, just PASS it. > Subsequent frags: look up the IPID in the map. If we find it, validate > and update the frag-offset in the map; if this is the last fragment then > delete the map entry. If the frag-offset was bogus or the IPID wasn't > found in the map, DROP; otherwise PASS. > (If re-ordering is prevalent then use something more sophisticated than > just expected next frag-offset, but the principle is the same. And of > course you might want to put in timers for expiry etc.) > So this avoids the need to stash the packet data and modify/consume SKBs, > because you're not actually doing reassembly; the down-side is that the > BPF program can't so easily make decisions about the application-layer > contents of the fragmented datagram, but for the common case (we just > care about the 5-tuple) it's simple enough. > But I haven't actually tried it, so maybe there's some obvious reason why > it can't work this way. I don't believe full L4 headers are required in the first fragment. Sufficiently sneaky attackers can, I think, send a byte at a time to subvert your proposed algorithm. Storing skb data seems inevitable here. Someone can correct me if I'm wrong here. Reordering like you mentioned is another attack vector. Perhaps there are more sophisticated semi-stateful algorithms that can solve the problem, but it leads me to my next point. A semi-stateful method like you are proposing is concerning to me from a reliability and correctness stand point. Such a method can suffer from impedance mismatches with the rest of the system. For example, whatever map sizes you choose should probably be aligned with sysfs conntrack values otherwise you may get some very interesting and unexpected pkt drops. I think cilium had a talk about debugging a related conntrack issue in the same vein a while ago. Furthermore, the debugging and troubleshooting facilities will be different (counters, logs, etc). Unless someone has had lots of experience writing an ip stack from the ground up, I suspect there are quite a few more unknown-unknowns here. What I find valuable about this patch series is that we can leverage the well understood and battle hardened kernel facilities. So avoid all the correctness and security issues that the kernel has spent 20+ years fixing. And make it trivial for the next person that comes along to do the right thing. Hopefully this all makes sense. Thanks, Daniel
On 27/02/2023 22:04, Daniel Xu wrote: > I don't believe full L4 headers are required in the first fragment. > Sufficiently sneaky attackers can, I think, send a byte at a time to > subvert your proposed algorithm. Storing skb data seems inevitable here. > Someone can correct me if I'm wrong here. My thinking was that legitimate traffic would never do this and thus if your first fragment doesn't have enough data to make a determination then you just DROP the packet. > What I find valuable about this patch series is that we can > leverage the well understood and battle hardened kernel facilities. So > avoid all the correctness and security issues that the kernel has spent > 20+ years fixing. I can certainly see the argument here. I guess it's a question of are you more worried about the DoS from tricking the validator into thinking good fragments are bad (the reverse is irrelevant because if you can trick a validator into thinking your bad fragment belongs to a previously seen good packet, then you can equally trick a reassembler into stitching your bad fragment into that packet), or are you more worried about the DoS from tying lots of memory down in the reassembly cache. Even with reordering handling, a data structure to record which ranges of a packet have been seen takes much less memory than storing the complete fragment bodies. (Just a simple bitmap of 8-byte blocks — the resolution of iph->frag_off — reduces size by a factor of 64, not counting all the overhead of a struct sk_buff for each fragment in the queue. Or you could re-use the rbtree-based code from the reassembler, just with a freshly allocated node containing only offset & length, instead of the whole SKB.) And having a BPF helper effectively consume the skb is awkward, as you noted; someone is likely to decide that skb_copy() is too slow, try to add ctx invalidation, and thereby create a whole new swathe of potential correctness and security issues. Plus, imagine trying to support this in a hardware-offload XDP device. They'd have to reimplement the entire frag cache, which is a much bigger attack surface than just a frag validator, and they couldn't leverage the battle-hardened kernel implementation. > And make it trivial for the next person that comes > along to do the right thing. Fwiw the validator approach could *also* be a helper, it doesn't have to be something the BPF developer writes for themselves. But if after thinking about the possibility you still prefer your way, I won't try to stop you — I just wanted to ensure it had been considered. -ed
On Mon, Feb 27, 2023 at 12:51:02PM -0700, Daniel Xu wrote: > === Context === > > In the context of a middlebox, fragmented packets are tricky to handle. > The full 5-tuple of a packet is often only available in the first > fragment which makes enforcing consistent policy difficult. There are > really only two stateless options, neither of which are very nice: > > 1. Enforce policy on first fragment and accept all subsequent fragments. > This works but may let in certain attacks or allow data exfiltration. > > 2. Enforce policy on first fragment and drop all subsequent fragments. > This does not really work b/c some protocols may rely on > fragmentation. For example, DNS may rely on oversized UDP packets for > large responses. > > So stateful tracking is the only sane option. RFC 8900 [0] calls this > out as well in section 6.3: > > Middleboxes [...] should process IP fragments in a manner that is > consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes > must maintain state in order to achieve this goal. > > === BPF related bits === > > However, when policy is enforced through BPF, the prog is run before the > kernel reassembles fragmented packets. This leaves BPF developers in a > awkward place: implement reassembly (possibly poorly) or use a stateless > method as described above. > > Fortunately, the kernel has robust support for fragmented IP packets. > This patchset wraps the existing defragmentation facilities in kfuncs so > that BPF progs running on middleboxes can reassemble fragmented packets > before applying policy. > > === Patchset details === > > This patchset is (hopefully) relatively straightforward from BPF perspective. > One thing I'd like to call out is the skb_copy()ing of the prog skb. I > did this to maintain the invariant that the ctx remains valid after prog > has run. This is relevant b/c ip_defrag() and ip_check_defrag() may > consume the skb if the skb is a fragment. Instead of doing all that with extra skb copy can you hook bpf prog after the networking stack already handled ip defrag? What kind of middle box are you doing? Why does it have to run at TC layer?
On Mon, Feb 27, 2023 at 5:57 PM Daniel Xu <dxu@dxuuu.xyz> wrote: > > Hi Alexei, > > On Mon, Feb 27, 2023 at 03:03:38PM -0800, Alexei Starovoitov wrote: > > On Mon, Feb 27, 2023 at 12:51:02PM -0700, Daniel Xu wrote: > > > === Context === > > > > > > In the context of a middlebox, fragmented packets are tricky to handle. > > > The full 5-tuple of a packet is often only available in the first > > > fragment which makes enforcing consistent policy difficult. There are > > > really only two stateless options, neither of which are very nice: > > > > > > 1. Enforce policy on first fragment and accept all subsequent fragments. > > > This works but may let in certain attacks or allow data exfiltration. > > > > > > 2. Enforce policy on first fragment and drop all subsequent fragments. > > > This does not really work b/c some protocols may rely on > > > fragmentation. For example, DNS may rely on oversized UDP packets for > > > large responses. > > > > > > So stateful tracking is the only sane option. RFC 8900 [0] calls this > > > out as well in section 6.3: > > > > > > Middleboxes [...] should process IP fragments in a manner that is > > > consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes > > > must maintain state in order to achieve this goal. > > > > > > === BPF related bits === > > > > > > However, when policy is enforced through BPF, the prog is run before the > > > kernel reassembles fragmented packets. This leaves BPF developers in a > > > awkward place: implement reassembly (possibly poorly) or use a stateless > > > method as described above. > > > > > > Fortunately, the kernel has robust support for fragmented IP packets. > > > This patchset wraps the existing defragmentation facilities in kfuncs so > > > that BPF progs running on middleboxes can reassemble fragmented packets > > > before applying policy. > > > > > > === Patchset details === > > > > > > This patchset is (hopefully) relatively straightforward from BPF perspective. > > > One thing I'd like to call out is the skb_copy()ing of the prog skb. I > > > did this to maintain the invariant that the ctx remains valid after prog > > > has run. This is relevant b/c ip_defrag() and ip_check_defrag() may > > > consume the skb if the skb is a fragment. > > > > Instead of doing all that with extra skb copy can you hook bpf prog after > > the networking stack already handled ip defrag? > > What kind of middle box are you doing? Why does it have to run at TC layer? > > Unless I'm missing something, the only other relevant hooks would be > socket hooks, right? > > Unfortunately I don't think my use case can do that. We are running the > kernel as a router, so no sockets are involved. Are you using bpf_fib_lookup and populating kernel routing table and doing everything on your own including neigh ? Have you considered to skb redirect to another netdev that does ip defrag? Like macvlan does it under some conditions. This can be generalized. Recently Florian proposed to allow calling bpf progs from all existing netfilter hooks. You can pretend to local deliver and hook in NF_INET_LOCAL_IN ? I feel it would be so much cleaner if stack does ip_defrag normally. The general issue of skb ownership between bpf prog and defrag logic isn't really solved with skb_copy. It's still an issue.
On 2/28/23 5:56 AM, Alexei Starovoitov wrote: > On Mon, Feb 27, 2023 at 5:57 PM Daniel Xu <dxu@dxuuu.xyz> wrote: >> On Mon, Feb 27, 2023 at 03:03:38PM -0800, Alexei Starovoitov wrote: >>> On Mon, Feb 27, 2023 at 12:51:02PM -0700, Daniel Xu wrote: >>>> === Context === >>>> >>>> In the context of a middlebox, fragmented packets are tricky to handle. >>>> The full 5-tuple of a packet is often only available in the first >>>> fragment which makes enforcing consistent policy difficult. There are >>>> really only two stateless options, neither of which are very nice: >>>> >>>> 1. Enforce policy on first fragment and accept all subsequent fragments. >>>> This works but may let in certain attacks or allow data exfiltration. >>>> >>>> 2. Enforce policy on first fragment and drop all subsequent fragments. >>>> This does not really work b/c some protocols may rely on >>>> fragmentation. For example, DNS may rely on oversized UDP packets for >>>> large responses. >>>> >>>> So stateful tracking is the only sane option. RFC 8900 [0] calls this >>>> out as well in section 6.3: >>>> >>>> Middleboxes [...] should process IP fragments in a manner that is >>>> consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes >>>> must maintain state in order to achieve this goal. >>>> >>>> === BPF related bits === >>>> >>>> However, when policy is enforced through BPF, the prog is run before the >>>> kernel reassembles fragmented packets. This leaves BPF developers in a >>>> awkward place: implement reassembly (possibly poorly) or use a stateless >>>> method as described above. >>>> >>>> Fortunately, the kernel has robust support for fragmented IP packets. >>>> This patchset wraps the existing defragmentation facilities in kfuncs so >>>> that BPF progs running on middleboxes can reassemble fragmented packets >>>> before applying policy. >>>> >>>> === Patchset details === >>>> >>>> This patchset is (hopefully) relatively straightforward from BPF perspective. >>>> One thing I'd like to call out is the skb_copy()ing of the prog skb. I >>>> did this to maintain the invariant that the ctx remains valid after prog >>>> has run. This is relevant b/c ip_defrag() and ip_check_defrag() may >>>> consume the skb if the skb is a fragment. >>> >>> Instead of doing all that with extra skb copy can you hook bpf prog after >>> the networking stack already handled ip defrag? >>> What kind of middle box are you doing? Why does it have to run at TC layer? >> >> Unless I'm missing something, the only other relevant hooks would be >> socket hooks, right? >> >> Unfortunately I don't think my use case can do that. We are running the >> kernel as a router, so no sockets are involved. > > Are you using bpf_fib_lookup and populating kernel routing > table and doing everything on your own including neigh ? > > Have you considered to skb redirect to another netdev that does ip defrag? > Like macvlan does it under some conditions. This can be generalized. > > Recently Florian proposed to allow calling bpf progs from all existing > netfilter hooks. > You can pretend to local deliver and hook in NF_INET_LOCAL_IN ? > I feel it would be so much cleaner if stack does ip_defrag normally. > The general issue of skb ownership between bpf prog and defrag logic > isn't really solved with skb_copy. It's still an issue. I do like this series and we would also use it for Cilium case, so +1 on the tc BPF integration. Today we have in Cilium what Ed [0] hinted in his earlier mail where we extract information from first fragment and store the meta data in a BPF map for subsequent packets based on ipid [1], but limitations apply e.g. service load-balancing won't work. Redirecting to a different device or moving higher up the stack is cumbersome since we then need to go and recirculate back into tc BPF layer where all the business logic is located and handling the regular (non-fragmented) path, too. Wrt skb ownership, can you elaborate what is a concrete issue exactly? Anything that comes to mind with this approach that could crash the kernel? [0] https://lore.kernel.org/bpf/cf49a091-9b14-05b8-6a79-00e56f3019e1@gmail.com/ [1] https://github.com/cilium/cilium/pull/10264
Hi Alexei, On Mon, Feb 27, 2023 at 08:56:38PM -0800, Alexei Starovoitov wrote: > On Mon, Feb 27, 2023 at 5:57 PM Daniel Xu <dxu@dxuuu.xyz> wrote: > > > > Hi Alexei, > > > > On Mon, Feb 27, 2023 at 03:03:38PM -0800, Alexei Starovoitov wrote: > > > On Mon, Feb 27, 2023 at 12:51:02PM -0700, Daniel Xu wrote: > > > > === Context === > > > > > > > > In the context of a middlebox, fragmented packets are tricky to handle. > > > > The full 5-tuple of a packet is often only available in the first > > > > fragment which makes enforcing consistent policy difficult. There are > > > > really only two stateless options, neither of which are very nice: > > > > > > > > 1. Enforce policy on first fragment and accept all subsequent fragments. > > > > This works but may let in certain attacks or allow data exfiltration. > > > > > > > > 2. Enforce policy on first fragment and drop all subsequent fragments. > > > > This does not really work b/c some protocols may rely on > > > > fragmentation. For example, DNS may rely on oversized UDP packets for > > > > large responses. > > > > > > > > So stateful tracking is the only sane option. RFC 8900 [0] calls this > > > > out as well in section 6.3: > > > > > > > > Middleboxes [...] should process IP fragments in a manner that is > > > > consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes > > > > must maintain state in order to achieve this goal. > > > > > > > > === BPF related bits === > > > > > > > > However, when policy is enforced through BPF, the prog is run before the > > > > kernel reassembles fragmented packets. This leaves BPF developers in a > > > > awkward place: implement reassembly (possibly poorly) or use a stateless > > > > method as described above. > > > > > > > > Fortunately, the kernel has robust support for fragmented IP packets. > > > > This patchset wraps the existing defragmentation facilities in kfuncs so > > > > that BPF progs running on middleboxes can reassemble fragmented packets > > > > before applying policy. > > > > > > > > === Patchset details === > > > > > > > > This patchset is (hopefully) relatively straightforward from BPF perspective. > > > > One thing I'd like to call out is the skb_copy()ing of the prog skb. I > > > > did this to maintain the invariant that the ctx remains valid after prog > > > > has run. This is relevant b/c ip_defrag() and ip_check_defrag() may > > > > consume the skb if the skb is a fragment. > > > > > > Instead of doing all that with extra skb copy can you hook bpf prog after > > > the networking stack already handled ip defrag? > > > What kind of middle box are you doing? Why does it have to run at TC layer? > > > > Unless I'm missing something, the only other relevant hooks would be > > socket hooks, right? > > > > Unfortunately I don't think my use case can do that. We are running the > > kernel as a router, so no sockets are involved. > > Are you using bpf_fib_lookup and populating kernel routing > table and doing everything on your own including neigh ? We're currently not doing any routing things in BPF yet. All the routing manipulation has been done in iptables / netfilter so far. I'm not super familiar with routing stuff but from what I understand there is some relatively complicated stuff going on with BGP and ipsec tunnels at the moment. Not sure if that answers your question. > Have you considered to skb redirect to another netdev that does ip defrag? > Like macvlan does it under some conditions. This can be generalized. I had not considered that yet. Are you suggesting adding a new passthrough netdev thing that'll defrags? I looked at the macvlan driver and it looks like it defrags to handle some multicast corner case. > Recently Florian proposed to allow calling bpf progs from all existing > netfilter hooks. > You can pretend to local deliver and hook in NF_INET_LOCAL_IN ? Does that work for forwarding cases? I'm reading through [0] and it seems to suggest that it'll only defrag for locally destined packets: If the destination IP address is matches with local NIC's IP address, the dst_input() function will brings the packets into the ip_local_deliver(), which will defrag the packet and pass it to the NF_IP_LOCAL_IN hook Faking local delivery seems kinda ugly -- maybe I don't know any clean ways. [...] [0]: https://kernelnewbies.org/Networking?action=AttachFile&do=get&target=hacking_the_wholism_of_linux_net.txt Thanks, Daniel
Hi Ed, Had some trouble with email yesterday (forgot to renew domain registration) and this reply might not have made it out. Apologies if it's a repost. On Mon, Feb 27, 2023 at 10:58:47PM +0000, Edward Cree wrote: > On 27/02/2023 22:04, Daniel Xu wrote: > > I don't believe full L4 headers are required in the first fragment. > > Sufficiently sneaky attackers can, I think, send a byte at a time to > > subvert your proposed algorithm. Storing skb data seems inevitable here. > > Someone can correct me if I'm wrong here. > > My thinking was that legitimate traffic would never do this and thus if > your first fragment doesn't have enough data to make a determination > then you just DROP the packet. Right, that would be practical. I had some discussion with coworkers and the other option on the table is to drop all fragments. At least for us in the cloud, fragments are heavily frowned upon (where are they not..) anyways. > > What I find valuable about this patch series is that we can > > leverage the well understood and battle hardened kernel facilities. So > > avoid all the correctness and security issues that the kernel has spent > > 20+ years fixing. > > I can certainly see the argument here. I guess it's a question of are > you more worried about the DoS from tricking the validator into thinking > good fragments are bad (the reverse is irrelevant because if you can > trick a validator into thinking your bad fragment belongs to a previously > seen good packet, then you can equally trick a reassembler into stitching > your bad fragment into that packet), or are you more worried about the > DoS from tying lots of memory down in the reassembly cache. Equal balance of concerns on my side. Ideally there are no dropping of valid packets and DoS is very hard to achieve. > Even with reordering handling, a data structure to record which ranges of > a packet have been seen takes much less memory than storing the complete > fragment bodies. (Just a simple bitmap of 8-byte blocks — the resolution > of iph->frag_off — reduces size by a factor of 64, not counting all the > overhead of a struct sk_buff for each fragment in the queue. Or you > could re-use the rbtree-based code from the reassembler, just with a > freshly allocated node containing only offset & length, instead of the > whole SKB.) Yeah, now that you say that, it doesn't sound too bad on space side. But I do wonder -- how much code and complexity is that going to be? For example I think ipv6 frags have a 60s reassembly timeout which adds more stuff to consider. And probably even more I've already forgotten. B/c at least on the kernel side, this series is 80% code for tests. And the kfunc wrappers are not very invasive at all. Plus it's wrapping infra that hasn't changed much for decades. > And having a BPF helper effectively consume the skb is awkward, as you > noted; someone is likely to decide that skb_copy() is too slow, try to > add ctx invalidation, and thereby create a whole new swathe of potential > correctness and security issues. Yep. I did try that. While the verifier bits weren't too tricky, there are a lot of infra concerns to solve: * https://github.com/danobi/linux/commit/35a66af8d54cca647b0adfc7c1da7105d2603dde * https://github.com/danobi/linux/commit/e8c86ea75e2ca8f0631632d54ef763381308711e * https://github.com/danobi/linux/commit/972bcf769f41fbfa7f84ce00faf06b5b57bc6f7a But FWIW, fragmented packets are kinda a corner case anyways. I don't think it would be resonable to expect high perf when packets are in play. > Plus, imagine trying to support this in a hardware-offload XDP device. > They'd have to reimplement the entire frag cache, which is a much bigger > attack surface than just a frag validator, and they couldn't leverage > the battle-hardened kernel implementation. Hmm, well this helper is restricted to TC progs for now. I don't quite see a path to enabling for XDP as there would have to be at a minimum quite a few allocations to handle frags. So not sure XDP is a factor at the moment. > > > And make it trivial for the next person that comes > > along to do the right thing. > > Fwiw the validator approach could *also* be a helper, it doesn't have to > be something the BPF developer writes for themselves. > > But if after thinking about the possibility you still prefer your way, I > won't try to stop you — I just wanted to ensure it had been considered. Thank you for the discussion. The thought had come to mind originally, but I shied away after seeing some of the reassembly details. Would be interested in hearing more from other folks. Thanks, Daniel
On Tue, Feb 28, 2023 at 3:17 PM Daniel Xu <dxu@dxuuu.xyz> wrote: > > > Have you considered to skb redirect to another netdev that does ip defrag? > > Like macvlan does it under some conditions. This can be generalized. > > I had not considered that yet. Are you suggesting adding a new > passthrough netdev thing that'll defrags? I looked at the macvlan driver > and it looks like it defrags to handle some multicast corner case. Something like that. A netdev that bpf prog can redirect too. It will consume ip frags and eventually will produce reassembled skb. The kernel ip_defrag logic has timeouts, counters, rhashtable with thresholds, etc. All of them are per netns. Just another ip_defrag_user will still share rhashtable with its limits. The kernel can even do icmp_send(). ip_defrag is not a kfunc. It's a big block with plenty of kernel wide side effects. I really don't think we can alloc_skb, copy_skb, and ip_defrag it. It messes with the stack too much. It's also not clear to me when skb is reassembled and how bpf sees it. "redirect into reassembling netdev" and attaching bpf prog to consume that skb is much cleaner imo. May be there are other ways to use ip_defrag, but certainly not like synchronous api helper.
Hi Alexei, (cc netfilter maintainers) On Mon, Mar 06, 2023 at 08:17:20PM -0800, Alexei Starovoitov wrote: > On Tue, Feb 28, 2023 at 3:17 PM Daniel Xu <dxu@dxuuu.xyz> wrote: > > > > > Have you considered to skb redirect to another netdev that does ip defrag? > > > Like macvlan does it under some conditions. This can be generalized. > > > > I had not considered that yet. Are you suggesting adding a new > > passthrough netdev thing that'll defrags? I looked at the macvlan driver > > and it looks like it defrags to handle some multicast corner case. > > Something like that. A netdev that bpf prog can redirect too. > It will consume ip frags and eventually will produce reassembled skb. > > The kernel ip_defrag logic has timeouts, counters, rhashtable > with thresholds, etc. All of them are per netns. > Just another ip_defrag_user will still share rhashtable > with its limits. The kernel can even do icmp_send(). > ip_defrag is not a kfunc. It's a big block with plenty of kernel > wide side effects. > I really don't think we can alloc_skb, copy_skb, and ip_defrag it. > It messes with the stack too much. > It's also not clear to me when skb is reassembled and how bpf sees it. > "redirect into reassembling netdev" and attaching bpf prog to consume > that skb is much cleaner imo. > May be there are other ways to use ip_defrag, but certainly not like > synchronous api helper. I was giving the virtual netdev idea some thought this morning and I thought I'd give the netfilter approach a deeper look. From my reading (I'll run some tests later) it looks like netfilter will defrag all ipv4/ipv6 packets in any netns with conntrack enabled. It appears to do so in NF_INET_PRE_ROUTING. Unfortunately that does run after tc hooks. But fortunately with the new BPF netfilter hooks I think we can make defrag work outside of BPF kfuncs like you want. And the NF_IP_FORWARD hook works well for my router use case. One thing we would need though are (probably kfunc) wrappers around nf_defrag_ipv4_enable() and nf_defrag_ipv6_enable() to ensure BPF progs are not transitively depending on defrag support from other netfilter modules. The exact mechanism would probably need some thinking, as the above functions kinda rely on module_init() and module_exit() semantics. We cannot make the prog bump the refcnt every time it runs -- it would overflow. And it would be nice to automatically free the refcnt when prog is unloaded. Once the netfilter prog type series lands I can get that discussion started. Unless Daniel feels strongly that we should continue with the approach in this patchset, I am leaning towards dropping in favor of netfilter approach. Thanks, Daniel
Daniel Xu <dxu@dxuuu.xyz> wrote: > From my reading (I'll run some tests later) it looks like netfilter > will defrag all ipv4/ipv6 packets in any netns with conntrack enabled. > It appears to do so in NF_INET_PRE_ROUTING. Yes, and output. > One thing we would need though are (probably kfunc) wrappers around > nf_defrag_ipv4_enable() and nf_defrag_ipv6_enable() to ensure BPF progs > are not transitively depending on defrag support from other netfilter > modules. > > The exact mechanism would probably need some thinking, as the above > functions kinda rely on module_init() and module_exit() semantics. We > cannot make the prog bump the refcnt every time it runs -- it would > overflow. And it would be nice to automatically free the refcnt when > prog is unloaded. Probably add a flag attribute that is evaluated at BPF_LINK time, so progs can say they need defrag enabled. Same could be used to request conntrack enablement. Will need some glue on netfilter side to handle DEFRAG=m, but we already have plenty of those.
On Tue, Mar 7, 2023 at 12:11 PM Florian Westphal <fw@strlen.de> wrote: > > Daniel Xu <dxu@dxuuu.xyz> wrote: > > From my reading (I'll run some tests later) it looks like netfilter > > will defrag all ipv4/ipv6 packets in any netns with conntrack enabled. > > It appears to do so in NF_INET_PRE_ROUTING. > > Yes, and output. > > > One thing we would need though are (probably kfunc) wrappers around > > nf_defrag_ipv4_enable() and nf_defrag_ipv6_enable() to ensure BPF progs > > are not transitively depending on defrag support from other netfilter > > modules. > > > > The exact mechanism would probably need some thinking, as the above > > functions kinda rely on module_init() and module_exit() semantics. We > > cannot make the prog bump the refcnt every time it runs -- it would > > overflow. And it would be nice to automatically free the refcnt when > > prog is unloaded. > > Probably add a flag attribute that is evaluated at BPF_LINK time, so > progs can say they need defrag enabled. Same could be used to request > conntrack enablement. > > Will need some glue on netfilter side to handle DEFRAG=m, but we already > have plenty of those. All makes perfect sense to me. It's cleaner than a special netdevice. ipv4_conntrack_defrag() is pretty neat. I didn't know about it. If we can reuse it as-is that would be ideal. Conceptually it fits perfectly. If we cannot reuse it (for whatever unlikely reason) I would argue that TC hook should gain similar functionality.