[bpf-next,3/6] bpf, net, frags: Add bpf_ip_check_defrag() kfunc
Commit Message
This kfunc is used to defragment IPv4 packets. The idea is that if you
see a fragmented packet, you call this kfunc. If the kfunc returns 0,
then the skb has been updated to contain the entire reassembled packet.
If the kfunc returns an error (most likely -EINPROGRESS), then it means
the skb is part of a yet-incomplete original packet. A reasonable
response to -EINPROGRESS is to drop the packet, as the ip defrag
infrastructure is already hanging onto the frag for future reassembly.
Care has been taken to ensure the prog skb remains valid no matter what
the underlying ip_check_defrag() call does. This is in contrast to
ip_defrag(), which may consume the skb if the skb is part of a
yet-incomplete original packet.
So far this kfunc is only callable from TC clsact progs.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
include/net/ip.h | 11 +++++
net/ipv4/Makefile | 1 +
net/ipv4/ip_fragment.c | 2 +
net/ipv4/ip_fragment_bpf.c | 98 ++++++++++++++++++++++++++++++++++++++
4 files changed, 112 insertions(+)
create mode 100644 net/ipv4/ip_fragment_bpf.c
Comments
Hi Daniel,
Thanks for working on this!
On 12/15/22 12:25 AM, Daniel Xu wrote:
[...]
> +#include <linux/bpf.h>
> +#include <linux/btf_ids.h>
> +#include <linux/ip.h>
> +#include <linux/filter.h>
> +#include <linux/netdevice.h>
> +#include <net/ip.h>
> +#include <net/sock.h>
> +
> +__diag_push();
> +__diag_ignore_all("-Wmissing-prototypes",
> + "Global functions as their definitions will be in ip_fragment BTF");
> +
> +/* bpf_ip_check_defrag - Defragment an ipv4 packet
> + *
> + * This helper takes an skb as input. If this skb successfully reassembles
> + * the original packet, the skb is updated to contain the original, reassembled
> + * packet.
> + *
> + * Otherwise (on error or incomplete reassembly), the input skb remains
> + * unmodified.
> + *
> + * Parameters:
> + * @ctx - Pointer to program context (skb)
> + * @netns - Child network namespace id. If value is a negative signed
> + * 32-bit integer, the netns of the device in the skb is used.
> + *
> + * Return:
> + * 0 on successfully reassembly or non-fragmented packet. Negative value on
> + * error or incomplete reassembly.
> + */
> +int bpf_ip_check_defrag(struct __sk_buff *ctx, u64 netns)
small nit: for sk lookup helper we've used u32 netns_id, would be nice to have
this consistent here as well.
> +{
> + struct sk_buff *skb = (struct sk_buff *)ctx;
> + struct sk_buff *skb_cpy, *skb_out;
> + struct net *caller_net;
> + struct net *net;
> + int mac_len;
> + void *mac;
> +
> + if (unlikely(!((s32)netns < 0 || netns <= S32_MAX)))
> + return -EINVAL;
> +
> + caller_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
> + if ((s32)netns < 0) {
> + net = caller_net;
> + } else {
> + net = get_net_ns_by_id(caller_net, netns);
> + if (unlikely(!net))
> + return -EINVAL;
> + }
> +
> + mac_len = skb->mac_len;
> + skb_cpy = skb_copy(skb, GFP_ATOMIC);
> + if (!skb_cpy)
> + return -ENOMEM;
Given slow path, this idea is expensive but okay. Maybe in future it could be lifted
which might be a bigger lift to teach verifier that input ctx cannot be accessed
anymore.. but then frags are very much discouraged either way and bpf_ip_check_defrag()
might only apply in corner case situations (like DNS, etc).
> + skb_out = ip_check_defrag(net, skb_cpy, IP_DEFRAG_BPF);
> + if (IS_ERR(skb_out))
> + return PTR_ERR(skb_out);
Looks like ip_check_defrag() can gracefully handle IPv6 packet. It will just return back
skb_cpy pointer in that case. However, this brings me to my main complaint.. I don't
think we should merge anything IPv4-related without also having IPv6 equivalent support,
otherwise we're building up tech debt, so pls also add support for the latter.
> + skb_morph(skb, skb_out);
> + kfree_skb(skb_out);
> +
> + /* ip_check_defrag() does not maintain mac header, so push empty header
> + * in so prog sees the correct layout. The empty mac header will be
> + * later pulled from cls_bpf.
> + */
> + mac = skb_push(skb, mac_len);
> + memset(mac, 0, mac_len);
> + bpf_compute_data_pointers(skb);
> +
> + return 0;
> +}
> +
Thanks,
Daniel
Hi Daniel,
Thanks for taking a look!
On Thu, Dec 15, 2022 at 11:31:52PM +0100, Daniel Borkmann wrote:
> Hi Daniel,
>
> Thanks for working on this!
>
> On 12/15/22 12:25 AM, Daniel Xu wrote:
> [...]
> > +#include <linux/bpf.h>
> > +#include <linux/btf_ids.h>
> > +#include <linux/ip.h>
> > +#include <linux/filter.h>
> > +#include <linux/netdevice.h>
> > +#include <net/ip.h>
> > +#include <net/sock.h>
> > +
> > +__diag_push();
> > +__diag_ignore_all("-Wmissing-prototypes",
> > + "Global functions as their definitions will be in ip_fragment BTF");
> > +
> > +/* bpf_ip_check_defrag - Defragment an ipv4 packet
> > + *
> > + * This helper takes an skb as input. If this skb successfully reassembles
> > + * the original packet, the skb is updated to contain the original, reassembled
> > + * packet.
> > + *
> > + * Otherwise (on error or incomplete reassembly), the input skb remains
> > + * unmodified.
> > + *
> > + * Parameters:
> > + * @ctx - Pointer to program context (skb)
> > + * @netns - Child network namespace id. If value is a negative signed
> > + * 32-bit integer, the netns of the device in the skb is used.
> > + *
> > + * Return:
> > + * 0 on successfully reassembly or non-fragmented packet. Negative value on
> > + * error or incomplete reassembly.
> > + */
> > +int bpf_ip_check_defrag(struct __sk_buff *ctx, u64 netns)
>
> small nit: for sk lookup helper we've used u32 netns_id, would be nice to have
> this consistent here as well.
Hmm, when I grep I see the sk lookup helpers using a u64 as well:
$ rg "u64 netns" ./include/uapi/linux/bpf.h
3394: * struct bpf_sock *bpf_sk_lookup_tcp(void *ctx, struct bpf_sock_tuple *tuple, u32 tuple_size, u64 netns, u64 flags)
3431: * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, u32 tuple_size, u64 netns, u64 flags)
3629: * struct bpf_sock *bpf_skc_lookup_tcp(void *ctx, struct bpf_sock_tuple *tuple, u32 tuple_size, u64 netns, u64 flags)
6238: __u64 netns_dev;
6239: __u64 netns_ino;
6274: __u64 netns_dev;
6275: __u64 netns_ino;
$ rg "u32 netns" ./include/uapi/linux/bpf.h
6335: __u32 netns_ino;
Am I looking at the right place?
>
> > +{
> > + struct sk_buff *skb = (struct sk_buff *)ctx;
> > + struct sk_buff *skb_cpy, *skb_out;
> > + struct net *caller_net;
> > + struct net *net;
> > + int mac_len;
> > + void *mac;
> > +
> > + if (unlikely(!((s32)netns < 0 || netns <= S32_MAX)))
> > + return -EINVAL;
> > +
> > + caller_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
> > + if ((s32)netns < 0) {
> > + net = caller_net;
> > + } else {
> > + net = get_net_ns_by_id(caller_net, netns);
> > + if (unlikely(!net))
> > + return -EINVAL;
> > + }
> > +
> > + mac_len = skb->mac_len;
> > + skb_cpy = skb_copy(skb, GFP_ATOMIC);
> > + if (!skb_cpy)
> > + return -ENOMEM;
>
> Given slow path, this idea is expensive but okay. Maybe in future it could be lifted
> which might be a bigger lift to teach verifier that input ctx cannot be accessed
> anymore.. but then frags are very much discouraged either way and bpf_ip_check_defrag()
> might only apply in corner case situations (like DNS, etc).
Originally I did go the route of teaching the verifier:
* https://github.com/danobi/linux/commit/35a66af8d54cca647b0adfc7c1da7105d2603dde
* https://github.com/danobi/linux/commit/e8c86ea75e2ca8f0631632d54ef763381308711e
* https://github.com/danobi/linux/commit/972bcf769f41fbfa7f84ce00faf06b5b57bc6f7a
It didn't seem too bad on the verifier front (or maybe I got it wrong),
but it seemed a bit unwieldy to wire ctx validity information back out
of the program given ip_check_defrag() may kfree() the skb (so stuffing
data inside skb->cb wouldn't work).
And yeah, given frags are highly discouraged, didn't seem like too bad
of a tradeoff.
>
> > + skb_out = ip_check_defrag(net, skb_cpy, IP_DEFRAG_BPF);
> > + if (IS_ERR(skb_out))
> > + return PTR_ERR(skb_out);
>
> Looks like ip_check_defrag() can gracefully handle IPv6 packet. It will just return back
> skb_cpy pointer in that case. However, this brings me to my main complaint.. I don't
> think we should merge anything IPv4-related without also having IPv6 equivalent support,
> otherwise we're building up tech debt, so pls also add support for the latter.
Ok, I'll take a crack at ipv6 support too. Most likely in the form of
another kfunc, depending on how the ipv6 frag infra is set up.
I'll be in and out during the holidays so v2 will most likely come some
time in the new year.
[...]
Thanks,
Daniel
@@ -679,6 +679,7 @@ enum ip_defrag_users {
IP_DEFRAG_VS_FWD,
IP_DEFRAG_AF_PACKET,
IP_DEFRAG_MACVLAN,
+ IP_DEFRAG_BPF,
};
/* Return true if the value of 'user' is between 'lower_bond'
@@ -692,6 +693,16 @@ static inline bool ip_defrag_user_in_between(u32 user,
}
int ip_defrag(struct net *net, struct sk_buff *skb, u32 user);
+
+#ifdef CONFIG_DEBUG_INFO_BTF
+int register_ip_frag_bpf(void);
+#else
+static inline int register_ip_frag_bpf(void)
+{
+ return 0;
+}
+#endif
+
#ifdef CONFIG_INET
struct sk_buff *ip_check_defrag(struct net *net, struct sk_buff *skb, u32 user);
#else
@@ -64,6 +64,7 @@ obj-$(CONFIG_TCP_CONG_ILLINOIS) += tcp_illinois.o
obj-$(CONFIG_NET_SOCK_MSG) += tcp_bpf.o
obj-$(CONFIG_BPF_SYSCALL) += udp_bpf.o
obj-$(CONFIG_NETLABEL) += cipso_ipv4.o
+obj-$(CONFIG_DEBUG_INFO_BTF) += ip_fragment_bpf.o
obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \
xfrm4_output.o xfrm4_protocol.o
@@ -757,5 +757,7 @@ void __init ipfrag_init(void)
if (inet_frags_init(&ip4_frags))
panic("IP: failed to allocate ip4_frags cache\n");
ip4_frags_ctl_register();
+ if (register_ip_frag_bpf())
+ panic("IP: bpf: failed to register ip_frag_bpf\n");
register_pernet_subsys(&ip4_frags_ops);
}
new file mode 100644
@@ -0,0 +1,98 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Unstable ipv4 fragmentation helpers for TC-BPF hook
+ *
+ * These are called from SCHED_CLS BPF programs. Note that it is allowed to
+ * break compatibility for these functions since the interface they are exposed
+ * through to BPF programs is explicitly unstable.
+ */
+
+#include <linux/bpf.h>
+#include <linux/btf_ids.h>
+#include <linux/ip.h>
+#include <linux/filter.h>
+#include <linux/netdevice.h>
+#include <net/ip.h>
+#include <net/sock.h>
+
+__diag_push();
+__diag_ignore_all("-Wmissing-prototypes",
+ "Global functions as their definitions will be in ip_fragment BTF");
+
+/* bpf_ip_check_defrag - Defragment an ipv4 packet
+ *
+ * This helper takes an skb as input. If this skb successfully reassembles
+ * the original packet, the skb is updated to contain the original, reassembled
+ * packet.
+ *
+ * Otherwise (on error or incomplete reassembly), the input skb remains
+ * unmodified.
+ *
+ * Parameters:
+ * @ctx - Pointer to program context (skb)
+ * @netns - Child network namespace id. If value is a negative signed
+ * 32-bit integer, the netns of the device in the skb is used.
+ *
+ * Return:
+ * 0 on successfully reassembly or non-fragmented packet. Negative value on
+ * error or incomplete reassembly.
+ */
+int bpf_ip_check_defrag(struct __sk_buff *ctx, u64 netns)
+{
+ struct sk_buff *skb = (struct sk_buff *)ctx;
+ struct sk_buff *skb_cpy, *skb_out;
+ struct net *caller_net;
+ struct net *net;
+ int mac_len;
+ void *mac;
+
+ if (unlikely(!((s32)netns < 0 || netns <= S32_MAX)))
+ return -EINVAL;
+
+ caller_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
+ if ((s32)netns < 0) {
+ net = caller_net;
+ } else {
+ net = get_net_ns_by_id(caller_net, netns);
+ if (unlikely(!net))
+ return -EINVAL;
+ }
+
+ mac_len = skb->mac_len;
+ skb_cpy = skb_copy(skb, GFP_ATOMIC);
+ if (!skb_cpy)
+ return -ENOMEM;
+
+ skb_out = ip_check_defrag(net, skb_cpy, IP_DEFRAG_BPF);
+ if (IS_ERR(skb_out))
+ return PTR_ERR(skb_out);
+
+ skb_morph(skb, skb_out);
+ kfree_skb(skb_out);
+
+ /* ip_check_defrag() does not maintain mac header, so push empty header
+ * in so prog sees the correct layout. The empty mac header will be
+ * later pulled from cls_bpf.
+ */
+ mac = skb_push(skb, mac_len);
+ memset(mac, 0, mac_len);
+ bpf_compute_data_pointers(skb);
+
+ return 0;
+}
+
+__diag_pop()
+
+BTF_SET8_START(ip_frag_kfunc_set)
+BTF_ID_FLAGS(func, bpf_ip_check_defrag, KF_CHANGES_PKT)
+BTF_SET8_END(ip_frag_kfunc_set)
+
+static const struct btf_kfunc_id_set ip_frag_bpf_kfunc_set = {
+ .owner = THIS_MODULE,
+ .set = &ip_frag_kfunc_set,
+};
+
+int register_ip_frag_bpf(void)
+{
+ return register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS,
+ &ip_frag_bpf_kfunc_set);
+}