Message ID | ZS1/qtr0dZJ35VII@debian.debian |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2908:b0:403:3b70:6f57 with SMTP id ib8csp3646317vqb; Mon, 16 Oct 2023 11:24:53 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEwuausEfkLowNWeEamDvVXhtIbrpl/z3EAv3unWnQgikmKOJDg0vZaofNxBmQ+NlPX8JNw X-Received: by 2002:a05:6a00:98b:b0:6be:a1e:952 with SMTP id u11-20020a056a00098b00b006be0a1e0952mr4405756pfg.2.1697480692865; Mon, 16 Oct 2023 11:24:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697480692; cv=none; d=google.com; s=arc-20160816; b=hCt/vDsDCkTkx5BE5hSZjzb2mYey+m54P8qFV6RrNbKGJeZwFnosLI8eTXhqFie0Rs ml1585C8/dukCbxdunIrC16L3lfRBz0qSXheKhWHA6WkvmH9fZczje0QwCGJgHBvMEWo eOEIZsqjpCcQac8G1yzOVDyTryyCs/NlLrjl1vqfJGBHub01EKFUpQP+BYrkQlJq5O7o AfUwS6QStpWhGZEWHWX4yqM1HXZxmzig2U/7swOuoNngyvm0kgpNjU11bzM+k3YyebEO jmjU6HJWtvSIKSzYXUoTxdZv/ets8jiPk6oBsDww5Y/zdHLKHOCqYpCMYoccikyHke1W HaIg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-disposition:mime-version:message-id :subject:cc:to:from:date:dkim-signature; bh=slLQdIKd076Wd/jcmKbWiuJysNgl1PI1tXfrw1I/2ns=; fh=X0+pxaxwBSJD+SPTMXk1Xgk/FD2iNpMukhQNhYv68GM=; b=TF+ds+PmAyA2fQxc4QOgMAS36UNJeRNZaVpUOqSViv/2/cAZGK/f5A8avw5A2ZWK6J YnpUaFEnW+/1nNH4t11TZAsZ4zC0W7sKjKCqpUWRucELnOjYjBhRinOvQPIaoU+aMmLm ijsQFVYwOOsUaOTz0a4oCJPZkypIif5iRTbuyVttZG7sCRx4hajIog6538dFEPhpIYzA BBo6qcHtJW8U6A+xjbMXeuV3wr7DBXt8XAcvz6rfUlbn+vY37trLOleUIPqILUaStGeo PkI+VdTf+5TgqDqhiHG8LOUa7pUgmXnOQoMTAqmQOn0QYCayFcs/27pgPxWvQwKSq1kQ VYdA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cloudflare.com header.s=google09082023 header.b=RFypY+mw; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=cloudflare.com Received: from howler.vger.email (howler.vger.email. [2620:137:e000::3:4]) by mx.google.com with ESMTPS id k9-20020aa790c9000000b006be04dec663si296148pfk.311.2023.10.16.11.24.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 16 Oct 2023 11:24:52 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) client-ip=2620:137:e000::3:4; Authentication-Results: mx.google.com; dkim=pass header.i=@cloudflare.com header.s=google09082023 header.b=RFypY+mw; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=cloudflare.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id ED1F2804910E; Mon, 16 Oct 2023 11:24:49 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233956AbjJPSXp (ORCPT <rfc822;hjfbswb@gmail.com> + 18 others); Mon, 16 Oct 2023 14:23:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36764 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233724AbjJPSXn (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 16 Oct 2023 14:23:43 -0400 Received: from mail-qt1-x82d.google.com (mail-qt1-x82d.google.com [IPv6:2607:f8b0:4864:20::82d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8FA14E8 for <linux-kernel@vger.kernel.org>; Mon, 16 Oct 2023 11:23:41 -0700 (PDT) Received: by mail-qt1-x82d.google.com with SMTP id d75a77b69052e-417f872fb94so33985891cf.0 for <linux-kernel@vger.kernel.org>; Mon, 16 Oct 2023 11:23:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudflare.com; s=google09082023; t=1697480620; x=1698085420; darn=vger.kernel.org; h=content-disposition:mime-version:message-id:subject:cc:to:from:date :from:to:cc:subject:date:message-id:reply-to; bh=slLQdIKd076Wd/jcmKbWiuJysNgl1PI1tXfrw1I/2ns=; b=RFypY+mwC6t9sbo9JHt5O+rjkefCfBzGS49N5fFZj8PUbzoS3S6Aj6MwA2jbzru8lR DLNX0alfgU7cg9eQW0ro2+bduCFveSQnesQXwTo6TqNx2LxdyIRvEnQuvijKJNSlQRHk y/afqiecl0K5CmfBjUpNmHQSASw0J1+8XluEByDF52F+Ykdi838LVuTgMB8MxwFMzBm0 NyN8e8PiyeYg+/sZqV+ggJITbWRUDz0zVBlSkSxMImTO8ehOworSAxAN3acOFEhJvr4F 9o2C/6cnlNKgMW+mpapWsizs39HaygxZL7uZZfsRJAzkJNd4D1Ttyz7Xxp+p98L5oZjA xI5A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1697480620; x=1698085420; h=content-disposition:mime-version:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=slLQdIKd076Wd/jcmKbWiuJysNgl1PI1tXfrw1I/2ns=; b=FIksqEDlsZXr9JQXKDu+FQsISx2ALUIvSGqlhiFqIiDd0177UOPg4ulll67X0O48QO goxzt6Y/bMSxr3JGU5hpsmuvjTnFbjUYAPkJkQFKhTudnMzdkfMqODMtIjexjBXCuZkE Wz9K/4MSXVQ9DU5dWInBQaj4lxRFlH9fY/sw7HezWRFtUnxUvLdt6c9PPV04EqWuJHVW UK5roB4WfFJ5XhiI0/gI7fAlE6UqekwepQ/PX5piJJFURMo5ofE6sugpRjfuR3Nd0ifP /ozmgEGJc3gKM76TlVABLHegz8vkLGl3Z2irKQSVpyXCc8ZLiHz/TWfshFQseTVhFEbv wjYw== X-Gm-Message-State: AOJu0YzCUYfAknfCYsZdzn6Sc51zjs3cput6S0/oaC3H3t/OkY9NNscE uQ1mNzJch/w0hAHH3AOFPrwhiA== X-Received: by 2002:ac8:5916:0:b0:418:163b:c5d7 with SMTP id 22-20020ac85916000000b00418163bc5d7mr55277qty.58.1697480620673; Mon, 16 Oct 2023 11:23:40 -0700 (PDT) Received: from debian.debian ([140.141.197.139]) by smtp.gmail.com with ESMTPSA id kq19-20020ac86193000000b00405502aaf76sm3218093qtb.57.2023.10.16.11.23.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 16 Oct 2023 11:23:40 -0700 (PDT) Date: Mon, 16 Oct 2023 11:23:38 -0700 From: Yan Zhai <yan@cloudflare.com> To: netdev@vger.kernel.org Cc: "David S. Miller" <davem@davemloft.net>, David Ahern <dsahern@kernel.org>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Aya Levin <ayal@nvidia.com>, Tariq Toukan <tariqt@nvidia.com>, linux-kernel@vger.kernel.org, kernel-team@cloudflare.com, Florian Westphal <fw@strlen.de>, Willem de Bruijn <willemdebruijn.kernel@gmail.com> Subject: [PATCH v2 net-next] ipv6: avoid atomic fragment on GSO packets Message-ID: <ZS1/qtr0dZJ35VII@debian.debian> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on howler.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Mon, 16 Oct 2023 11:24:50 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1779937515226384172 X-GMAIL-MSGID: 1779937515226384172 |
Series |
[v2,net-next] ipv6: avoid atomic fragment on GSO packets
|
|
Commit Message
Yan Zhai
Oct. 16, 2023, 6:23 p.m. UTC
GSO packets can contain a trailing segment that is smaller than
gso_size. When examining the dst MTU for such packet, if its gso_size is
too large, then all segments would be fragmented. However, there is a
good chance the trailing segment has smaller actual size than both
gso_size as well as the MTU, which leads to an "atomic fragment". It is
considered harmful in RFC-8021. An Existing report from APNIC also shows
that atomic fragments are more likely to be dropped even it is
equivalent to a no-op [1].
Refactor __ip6_finish_output code to separate GSO and non-GSO packet
processing. It mirrors __ip_finish_output logic now. Add an extra check
in GSO handling to avoid atomic fragments. Lastly, drop dst_allfrag
check, which is no longer true since commit 9d289715eb5c ("ipv6: stop
sending PTB packets for MTU < 1280").
Link: https://www.potaroo.net/presentations/2022-03-01-ipv6-frag.pdf [1]
Fixes: b210de4f8c97 ("net: ipv6: Validate GSO SKB before finish IPv6 processing")
Suggested-by: Florian Westphal <fw@strlen.de>
Reported-by: David Wragg <dwragg@cloudflare.com>
Signed-off-by: Yan Zhai <yan@cloudflare.com>
---
net/ipv6/ip6_output.c | 33 +++++++++++++++++++++++----------
1 file changed, 23 insertions(+), 10 deletions(-)
Comments
On Mon, Oct 16, 2023 at 1:23 PM Yan Zhai <yan@cloudflare.com> wrote: > > GSO packets can contain a trailing segment that is smaller than > gso_size. When examining the dst MTU for such packet, if its gso_size is > too large, then all segments would be fragmented. However, there is a > good chance the trailing segment has smaller actual size than both > gso_size as well as the MTU, which leads to an "atomic fragment". It is > considered harmful in RFC-8021. An Existing report from APNIC also shows > that atomic fragments are more likely to be dropped even it is > equivalent to a no-op [1]. > > Refactor __ip6_finish_output code to separate GSO and non-GSO packet > processing. It mirrors __ip_finish_output logic now. Add an extra check > in GSO handling to avoid atomic fragments. Lastly, drop dst_allfrag > check, which is no longer true since commit 9d289715eb5c ("ipv6: stop > sending PTB packets for MTU < 1280"). > > Link: https://www.potaroo.net/presentations/2022-03-01-ipv6-frag.pdf [1] > Fixes: b210de4f8c97 ("net: ipv6: Validate GSO SKB before finish IPv6 processing") > Suggested-by: Florian Westphal <fw@strlen.de> > Reported-by: David Wragg <dwragg@cloudflare.com> > Signed-off-by: Yan Zhai <yan@cloudflare.com> > --- Forgot to add v1 thread: https://lore.kernel.org/lkml/20231002171146.GB9274@breakpoint.cc/. It was wrongly implemented though without considering max_frag_size for non-GSO packets though, so not really useful in fact. > net/ipv6/ip6_output.c | 33 +++++++++++++++++++++++---------- > 1 file changed, 23 insertions(+), 10 deletions(-) > > diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c > index a471c7e91761..1de6f3c11655 100644 > --- a/net/ipv6/ip6_output.c > +++ b/net/ipv6/ip6_output.c > @@ -162,7 +162,14 @@ ip6_finish_output_gso_slowpath_drop(struct net *net, struct sock *sk, > int err; > > skb_mark_not_on_list(segs); > - err = ip6_fragment(net, sk, segs, ip6_finish_output2); > + /* Last gso segment might be smaller than actual MTU. Adding > + * a fragment header to it would produce an "atomic fragment", > + * which is considered harmful (RFC-8021) > + */ > + err = segs->len > mtu ? > + ip6_fragment(net, sk, segs, ip6_finish_output2) : > + ip6_finish_output2(net, sk, segs); > + > if (err && ret == 0) > ret = err; > } > @@ -170,10 +177,19 @@ ip6_finish_output_gso_slowpath_drop(struct net *net, struct sock *sk, > return ret; > } > > +static int ip6_finish_output_gso(struct net *net, struct sock *sk, > + struct sk_buff *skb, unsigned int mtu) > +{ > + if (!(IP6CB(skb)->flags & IP6SKB_FAKEJUMBO) && > + !skb_gso_validate_network_len(skb, mtu)) > + return ip6_finish_output_gso_slowpath_drop(net, sk, skb, mtu); > + > + return ip6_finish_output2(net, sk, skb); > +} > + > static int __ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff *skb) > { > unsigned int mtu; > - > #if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM) > /* Policy lookup after SNAT yielded a new policy */ > if (skb_dst(skb)->xfrm) { > @@ -183,17 +199,14 @@ static int __ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff > #endif > > mtu = ip6_skb_dst_mtu(skb); > - if (skb_is_gso(skb) && > - !(IP6CB(skb)->flags & IP6SKB_FAKEJUMBO) && > - !skb_gso_validate_network_len(skb, mtu)) > - return ip6_finish_output_gso_slowpath_drop(net, sk, skb, mtu); > + if (skb_is_gso(skb)) > + return ip6_finish_output_gso(net, sk, skb, mtu); > > - if ((skb->len > mtu && !skb_is_gso(skb)) || > - dst_allfrag(skb_dst(skb)) || > + if (skb->len > mtu || > (IP6CB(skb)->frag_max_size && skb->len > IP6CB(skb)->frag_max_size)) > return ip6_fragment(net, sk, skb, ip6_finish_output2); > - else > - return ip6_finish_output2(net, sk, skb); > + > + return ip6_finish_output2(net, sk, skb); > } > > static int ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff *skb) > -- > 2.30.2 >
On Mon, 2023-10-16 at 11:23 -0700, Yan Zhai wrote: > GSO packets can contain a trailing segment that is smaller than > gso_size. When examining the dst MTU for such packet, if its gso_size is > too large, then all segments would be fragmented. However, there is a > good chance the trailing segment has smaller actual size than both > gso_size as well as the MTU, which leads to an "atomic fragment". It is > considered harmful in RFC-8021. An Existing report from APNIC also shows > that atomic fragments are more likely to be dropped even it is > equivalent to a no-op [1]. > > Refactor __ip6_finish_output code to separate GSO and non-GSO packet > processing. It mirrors __ip_finish_output logic now. Add an extra check > in GSO handling to avoid atomic fragments. Lastly, drop dst_allfrag > check, which is no longer true since commit 9d289715eb5c ("ipv6: stop > sending PTB packets for MTU < 1280"). > > Link: https://www.potaroo.net/presentations/2022-03-01-ipv6-frag.pdf [1] > Fixes: b210de4f8c97 ("net: ipv6: Validate GSO SKB before finish IPv6 processing") > Suggested-by: Florian Westphal <fw@strlen.de> > Reported-by: David Wragg <dwragg@cloudflare.com> > Signed-off-by: Yan Zhai <yan@cloudflare.com> > --- > net/ipv6/ip6_output.c | 33 +++++++++++++++++++++++---------- > 1 file changed, 23 insertions(+), 10 deletions(-) > > diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c > index a471c7e91761..1de6f3c11655 100644 > --- a/net/ipv6/ip6_output.c > +++ b/net/ipv6/ip6_output.c > @@ -162,7 +162,14 @@ ip6_finish_output_gso_slowpath_drop(struct net *net, struct sock *sk, > int err; > > skb_mark_not_on_list(segs); > - err = ip6_fragment(net, sk, segs, ip6_finish_output2); > + /* Last gso segment might be smaller than actual MTU. Adding > + * a fragment header to it would produce an "atomic fragment", > + * which is considered harmful (RFC-8021) > + */ > + err = segs->len > mtu ? > + ip6_fragment(net, sk, segs, ip6_finish_output2) : > + ip6_finish_output2(net, sk, segs); > + > if (err && ret == 0) > ret = err; > } > @@ -170,10 +177,19 @@ ip6_finish_output_gso_slowpath_drop(struct net *net, struct sock *sk, > return ret; > } > > +static int ip6_finish_output_gso(struct net *net, struct sock *sk, > + struct sk_buff *skb, unsigned int mtu) > +{ > + if (!(IP6CB(skb)->flags & IP6SKB_FAKEJUMBO) && > + !skb_gso_validate_network_len(skb, mtu)) > + return ip6_finish_output_gso_slowpath_drop(net, sk, skb, mtu); If we are sending fakejumbo or have a frame that doesn't pass the muster it is just going immediately to ip6_finish_output. I think the checks that you removed are needed to keep the socket from getting stuck sending frames that will probably be discarded. > + > + return ip6_finish_output2(net, sk, skb); > +} > + > static int __ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff *skb) > { > unsigned int mtu; > - This blank line can probably be left there to separate variable declarations from code. > #if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM) > /* Policy lookup after SNAT yielded a new policy */ > if (skb_dst(skb)->xfrm) { > @@ -183,17 +199,14 @@ static int __ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff > #endif > > mtu = ip6_skb_dst_mtu(skb); > - if (skb_is_gso(skb) && > - !(IP6CB(skb)->flags & IP6SKB_FAKEJUMBO) && > - !skb_gso_validate_network_len(skb, mtu)) > - return ip6_finish_output_gso_slowpath_drop(net, sk, skb, mtu); > + if (skb_is_gso(skb)) > + return ip6_finish_output_gso(net, sk, skb, mtu); > > - if ((skb->len > mtu && !skb_is_gso(skb)) || > - dst_allfrag(skb_dst(skb)) || > + if (skb->len > mtu || This change looks a bit too aggressive to me. Basically if the frame is gso you now bypass the ip6_fragment entirely and are ignoring the dst_allfrag and frag_max_size case below. See the fail_toobig code in ip6_fragment. > (IP6CB(skb)->frag_max_size && skb->len > IP6CB(skb)->frag_max_size)) > return ip6_fragment(net, sk, skb, ip6_finish_output2); > - else > - return ip6_finish_output2(net, sk, skb); > + > + return ip6_finish_output2(net, sk, skb); > } > > static int ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff *skb)
On Mon, Oct 16, 2023 at 4:00 PM Alexander H Duyck <alexander.duyck@gmail.com> wrote: > > On Mon, 2023-10-16 at 11:23 -0700, Yan Zhai wrote: > > GSO packets can contain a trailing segment that is smaller than > > gso_size. When examining the dst MTU for such packet, if its gso_size is > > too large, then all segments would be fragmented. However, there is a > > good chance the trailing segment has smaller actual size than both > > gso_size as well as the MTU, which leads to an "atomic fragment". It is > > considered harmful in RFC-8021. An Existing report from APNIC also shows > > that atomic fragments are more likely to be dropped even it is > > equivalent to a no-op [1]. > > > > Refactor __ip6_finish_output code to separate GSO and non-GSO packet > > processing. It mirrors __ip_finish_output logic now. Add an extra check > > in GSO handling to avoid atomic fragments. Lastly, drop dst_allfrag > > check, which is no longer true since commit 9d289715eb5c ("ipv6: stop > > sending PTB packets for MTU < 1280"). > > > > Link: https://www.potaroo.net/presentations/2022-03-01-ipv6-frag.pdf [1] > > Fixes: b210de4f8c97 ("net: ipv6: Validate GSO SKB before finish IPv6 processing") > > Suggested-by: Florian Westphal <fw@strlen.de> > > Reported-by: David Wragg <dwragg@cloudflare.com> > > Signed-off-by: Yan Zhai <yan@cloudflare.com> > > --- > > net/ipv6/ip6_output.c | 33 +++++++++++++++++++++++---------- > > 1 file changed, 23 insertions(+), 10 deletions(-) > > > > diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c > > index a471c7e91761..1de6f3c11655 100644 > > --- a/net/ipv6/ip6_output.c > > +++ b/net/ipv6/ip6_output.c > > @@ -162,7 +162,14 @@ ip6_finish_output_gso_slowpath_drop(struct net *net, struct sock *sk, > > int err; > > > > skb_mark_not_on_list(segs); > > - err = ip6_fragment(net, sk, segs, ip6_finish_output2); > > + /* Last gso segment might be smaller than actual MTU. Adding > > + * a fragment header to it would produce an "atomic fragment", > > + * which is considered harmful (RFC-8021) > > + */ > > + err = segs->len > mtu ? > > + ip6_fragment(net, sk, segs, ip6_finish_output2) : > > + ip6_finish_output2(net, sk, segs); > > + > > if (err && ret == 0) > > ret = err; > > } > > @@ -170,10 +177,19 @@ ip6_finish_output_gso_slowpath_drop(struct net *net, struct sock *sk, > > return ret; > > } > > > > +static int ip6_finish_output_gso(struct net *net, struct sock *sk, > > + struct sk_buff *skb, unsigned int mtu) > > +{ > > + if (!(IP6CB(skb)->flags & IP6SKB_FAKEJUMBO) && > > + !skb_gso_validate_network_len(skb, mtu)) > > + return ip6_finish_output_gso_slowpath_drop(net, sk, skb, mtu); > > If we are sending fakejumbo or have a frame that doesn't pass the > muster it is just going immediately to ip6_finish_output. I think the > checks that you removed are needed to keep the socket from getting > stuck sending frames that will probably be discarded. > Hi Alexander, Thanks for the feedback! But I am not sure I follow the situation you mentioned here. If it is a fake jumbo but non GSO packet, it won't enter ip6_finish_output_gso. What I am really skipping are the dst_allfrag and frag_max_size checks on GSO packets, and dst_allfrag on non-GSO packets. As to dst_allfrag, I looked back at the case when this was added: https://www.mail-archive.com/bk-commits-head@vger.kernel.org/msg03399.html The actual feature was set only when a PMTU message carries a value smaller than 1280 byte. But the main line kernel just drops such messages now since the commit I pointed to in the change log (which makes sense because the feature was set based on old RFC-2460 guidelines, and those have been deprecated in RFC-8200). Iproute2 also doesn't expose this option as well. Is there any case that I am not aware of here that still relies on it? For frag_max_size, I might be wrong but to my best knowledge it only applies when netfilter defrags packets. However, when dealing with fragments, both local output and GRO code won't produce GSO packets in the first place. Similarly, if we look at IPv4 implementation, it also does not consider frag_max_size in GSO handling. So I intentionally skip this for GSO packets in the change. WDYT? > > + > > + return ip6_finish_output2(net, sk, skb); > > +} > > + > > static int __ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff *skb) > > { > > unsigned int mtu; > > - > > This blank line can probably be left there to separate variable > declarations from code. > my bad, should not have included it. I'll revise this. thanks Yan > > #if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM) > > /* Policy lookup after SNAT yielded a new policy */ > > if (skb_dst(skb)->xfrm) { > > @@ -183,17 +199,14 @@ static int __ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff > > #endif > > > > mtu = ip6_skb_dst_mtu(skb); > > - if (skb_is_gso(skb) && > > - !(IP6CB(skb)->flags & IP6SKB_FAKEJUMBO) && > > - !skb_gso_validate_network_len(skb, mtu)) > > - return ip6_finish_output_gso_slowpath_drop(net, sk, skb, mtu); > > + if (skb_is_gso(skb)) > > + return ip6_finish_output_gso(net, sk, skb, mtu); > > > > - if ((skb->len > mtu && !skb_is_gso(skb)) || > > - dst_allfrag(skb_dst(skb)) || > > + if (skb->len > mtu || > > This change looks a bit too aggressive to me. Basically if the frame is > gso you now bypass the ip6_fragment entirely and are ignoring the > dst_allfrag and frag_max_size case below. See the fail_toobig code in > ip6_fragment. > > > (IP6CB(skb)->frag_max_size && skb->len > IP6CB(skb)->frag_max_size)) > > return ip6_fragment(net, sk, skb, ip6_finish_output2); > > - else > > - return ip6_finish_output2(net, sk, skb); > > + > > + return ip6_finish_output2(net, sk, skb); > > } > > > > static int ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff *skb) >
On Mon, Oct 16, 2023 at 2:51 PM Yan Zhai <yan@cloudflare.com> wrote: > > On Mon, Oct 16, 2023 at 4:00 PM Alexander H Duyck > <alexander.duyck@gmail.com> wrote: > > > > On Mon, 2023-10-16 at 11:23 -0700, Yan Zhai wrote: > > > GSO packets can contain a trailing segment that is smaller than > > > gso_size. When examining the dst MTU for such packet, if its gso_size is > > > too large, then all segments would be fragmented. However, there is a > > > good chance the trailing segment has smaller actual size than both > > > gso_size as well as the MTU, which leads to an "atomic fragment". It is > > > considered harmful in RFC-8021. An Existing report from APNIC also shows > > > that atomic fragments are more likely to be dropped even it is > > > equivalent to a no-op [1]. > > > > > > Refactor __ip6_finish_output code to separate GSO and non-GSO packet > > > processing. It mirrors __ip_finish_output logic now. Add an extra check > > > in GSO handling to avoid atomic fragments. Lastly, drop dst_allfrag > > > check, which is no longer true since commit 9d289715eb5c ("ipv6: stop > > > sending PTB packets for MTU < 1280"). > > > > > > Link: https://www.potaroo.net/presentations/2022-03-01-ipv6-frag.pdf [1] > > > Fixes: b210de4f8c97 ("net: ipv6: Validate GSO SKB before finish IPv6 processing") > > > Suggested-by: Florian Westphal <fw@strlen.de> > > > Reported-by: David Wragg <dwragg@cloudflare.com> > > > Signed-off-by: Yan Zhai <yan@cloudflare.com> > > > --- > > > net/ipv6/ip6_output.c | 33 +++++++++++++++++++++++---------- > > > 1 file changed, 23 insertions(+), 10 deletions(-) > > > > > > diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c > > > index a471c7e91761..1de6f3c11655 100644 > > > --- a/net/ipv6/ip6_output.c > > > +++ b/net/ipv6/ip6_output.c > > > @@ -162,7 +162,14 @@ ip6_finish_output_gso_slowpath_drop(struct net *net, struct sock *sk, > > > int err; > > > > > > skb_mark_not_on_list(segs); > > > - err = ip6_fragment(net, sk, segs, ip6_finish_output2); > > > + /* Last gso segment might be smaller than actual MTU. Adding > > > + * a fragment header to it would produce an "atomic fragment", > > > + * which is considered harmful (RFC-8021) > > > + */ > > > + err = segs->len > mtu ? > > > + ip6_fragment(net, sk, segs, ip6_finish_output2) : > > > + ip6_finish_output2(net, sk, segs); > > > + > > > if (err && ret == 0) > > > ret = err; > > > } > > > @@ -170,10 +177,19 @@ ip6_finish_output_gso_slowpath_drop(struct net *net, struct sock *sk, > > > return ret; > > > } > > > > > > +static int ip6_finish_output_gso(struct net *net, struct sock *sk, > > > + struct sk_buff *skb, unsigned int mtu) > > > +{ > > > + if (!(IP6CB(skb)->flags & IP6SKB_FAKEJUMBO) && > > > + !skb_gso_validate_network_len(skb, mtu)) > > > + return ip6_finish_output_gso_slowpath_drop(net, sk, skb, mtu); > > > > If we are sending fakejumbo or have a frame that doesn't pass the > > muster it is just going immediately to ip6_finish_output. I think the > > checks that you removed are needed to keep the socket from getting > > stuck sending frames that will probably be discarded. > > > > Hi Alexander, > > Thanks for the feedback! But I am not sure I follow the situation you > mentioned here. If it is a fake jumbo but non GSO packet, it won't > enter ip6_finish_output_gso. What I am really skipping are the > dst_allfrag and frag_max_size checks on GSO packets, and dst_allfrag > on non-GSO packets. > > As to dst_allfrag, I looked back at the case when this was added: > > https://www.mail-archive.com/bk-commits-head@vger.kernel.org/msg03399.html > > The actual feature was set only when a PMTU message carries a value > smaller than 1280 byte. But the main line kernel just drops such > messages now since the commit I pointed to in the change log (which > makes sense because the feature was set based on old RFC-2460 > guidelines, and those have been deprecated in RFC-8200). Iproute2 also > doesn't expose this option as well. Is there any case that I am not > aware of here that still relies on it? > > For frag_max_size, I might be wrong but to my best knowledge it only > applies when netfilter defrags packets. However, when dealing with > fragments, both local output and GRO code won't produce GSO packets in > the first place. Similarly, if we look at IPv4 implementation, it also > does not consider frag_max_size in GSO handling. So I intentionally > skip this for GSO packets in the change. WDYT? I am not certain. Just looking at the code it seems like there were a number of corner cases handled that this is getting rid of the code for. Specifically my main concern is GSO being enabled for a path where the MTU is incorrect due to something such as a tunnel being between the system and the endpoint. In such a case it would normally send back an ICMP message triggering a path MTU update which would then have to ripple through. I'm not an IPv6 expert though so perhaps I will leave that for somebody else to provide feedback on.
Yan Zhai <yan@cloudflare.com> wrote: > Refactor __ip6_finish_output code to separate GSO and non-GSO packet > processing. It mirrors __ip_finish_output logic now. Add an extra check > in GSO handling to avoid atomic fragments. Lastly, drop dst_allfrag > check, which is no longer true since commit 9d289715eb5c ("ipv6: stop > sending PTB packets for MTU < 1280"). > - if ((skb->len > mtu && !skb_is_gso(skb)) || > - dst_allfrag(skb_dst(skb)) || My preference is to first remove dst_allfrag, i.e. do this in a separate change.
On Tue, Oct 17, 2023 at 3:02 PM Florian Westphal <fw@strlen.de> wrote: > > Yan Zhai <yan@cloudflare.com> wrote: > > Refactor __ip6_finish_output code to separate GSO and non-GSO packet > > processing. It mirrors __ip_finish_output logic now. Add an extra check > > in GSO handling to avoid atomic fragments. Lastly, drop dst_allfrag > > check, which is no longer true since commit 9d289715eb5c ("ipv6: stop > > sending PTB packets for MTU < 1280"). > > > > - if ((skb->len > mtu && !skb_is_gso(skb)) || > > - dst_allfrag(skb_dst(skb)) || > > My preference is to first remove dst_allfrag, i.e. do this in > a separate change. You mean completely removing all dst_allfrag references and related stuff such like IP cork flags/socket flags? I was debating, it might be cleaner that way but it does not fit so well with the subject of this patch. I can open a new patchset to clean that up separately. For this one, I guess I can keep dst_allfrag for now and come back with a V3. Does that sound good to you? Yan
On Tue, Oct 17, 2023 at 9:42 PM Yan Zhai <yan@cloudflare.com> wrote: > > On Tue, Oct 17, 2023 at 3:02 PM Florian Westphal <fw@strlen.de> wrote: > > > > Yan Zhai <yan@cloudflare.com> wrote: > > > Refactor __ip6_finish_output code to separate GSO and non-GSO packet > > > processing. It mirrors __ip_finish_output logic now. Add an extra check > > > in GSO handling to avoid atomic fragments. Lastly, drop dst_allfrag > > > check, which is no longer true since commit 9d289715eb5c ("ipv6: stop > > > sending PTB packets for MTU < 1280"). > > > > > > > - if ((skb->len > mtu && !skb_is_gso(skb)) || > > > - dst_allfrag(skb_dst(skb)) || > > > > My preference is to first remove dst_allfrag, i.e. do this in > > a separate change. > > You mean completely removing all dst_allfrag references and related > stuff such like IP cork flags/socket flags? I was debating, it might > be cleaner that way but it does not fit so well with the subject of > this patch. I can open a new patchset to clean that up separately. For > this one, I guess I can keep dst_allfrag for now and come back with a > V3. Does that sound good to you? The second paragraph in the commit message really makes clear that this combines three changes in one patch. Of which the largest one in terms of code churn is supposed to be a NOOP. Separating into three patches will make all three more clear. They can be pushed as one series, conceivably.
On Tue, Oct 17, 2023 at 8:58 PM Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > The second paragraph in the commit message really makes > clear that this combines three changes in one patch. Of which > the largest one in terms of code churn is supposed to be a > NOOP. > > Separating into three patches will make all three more clear. > They can be pushed as one series, conceivably. Thanks for clarifying. In that case I am just gonna remove dst_allfrag in ip6_finish_output rather than everywhere for the series. Remaining cleanup can come later then. In fact there were some past considerations already on this: https://lkml.kernel.org/netdev/1335348157.3274.30.camel@edumazet-glaptop/ Could be a good base to work on later. Yan
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index a471c7e91761..1de6f3c11655 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -162,7 +162,14 @@ ip6_finish_output_gso_slowpath_drop(struct net *net, struct sock *sk, int err; skb_mark_not_on_list(segs); - err = ip6_fragment(net, sk, segs, ip6_finish_output2); + /* Last gso segment might be smaller than actual MTU. Adding + * a fragment header to it would produce an "atomic fragment", + * which is considered harmful (RFC-8021) + */ + err = segs->len > mtu ? + ip6_fragment(net, sk, segs, ip6_finish_output2) : + ip6_finish_output2(net, sk, segs); + if (err && ret == 0) ret = err; } @@ -170,10 +177,19 @@ ip6_finish_output_gso_slowpath_drop(struct net *net, struct sock *sk, return ret; } +static int ip6_finish_output_gso(struct net *net, struct sock *sk, + struct sk_buff *skb, unsigned int mtu) +{ + if (!(IP6CB(skb)->flags & IP6SKB_FAKEJUMBO) && + !skb_gso_validate_network_len(skb, mtu)) + return ip6_finish_output_gso_slowpath_drop(net, sk, skb, mtu); + + return ip6_finish_output2(net, sk, skb); +} + static int __ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff *skb) { unsigned int mtu; - #if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM) /* Policy lookup after SNAT yielded a new policy */ if (skb_dst(skb)->xfrm) { @@ -183,17 +199,14 @@ static int __ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff #endif mtu = ip6_skb_dst_mtu(skb); - if (skb_is_gso(skb) && - !(IP6CB(skb)->flags & IP6SKB_FAKEJUMBO) && - !skb_gso_validate_network_len(skb, mtu)) - return ip6_finish_output_gso_slowpath_drop(net, sk, skb, mtu); + if (skb_is_gso(skb)) + return ip6_finish_output_gso(net, sk, skb, mtu); - if ((skb->len > mtu && !skb_is_gso(skb)) || - dst_allfrag(skb_dst(skb)) || + if (skb->len > mtu || (IP6CB(skb)->frag_max_size && skb->len > IP6CB(skb)->frag_max_size)) return ip6_fragment(net, sk, skb, ip6_finish_output2); - else - return ip6_finish_output2(net, sk, skb); + + return ip6_finish_output2(net, sk, skb); } static int ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff *skb)