Message ID | 20230711043453.64095-1-ivan@cloudflare.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:a6b2:0:b0:3e4:2afc:c1 with SMTP id c18csp245437vqm; Mon, 10 Jul 2023 21:56:30 -0700 (PDT) X-Google-Smtp-Source: APBJJlHfuYOnnraFmSskL5sEXc3FQzQzllGY7CrydMu6G6XvPNftQaH5nb5CJMZbE+yng1yggfrI X-Received: by 2002:a05:6870:b14a:b0:1b4:4934:45e5 with SMTP id a10-20020a056870b14a00b001b4493445e5mr11904401oal.41.1689051390300; Mon, 10 Jul 2023 21:56:30 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689051390; cv=none; d=google.com; s=arc-20160816; b=Zw+s0x70Bi9xy2syAxOmV8N6uQrsLxCFxxUJwhF3KxvxJ0JaX9TkwxYG2zQqFR9b11 N95Jb6U0wAmqU+vueJEHpPSbtVBYXQnCCjxLOGnoTzs6oX5wGk8O8fdA7bwiwtANUQx9 f1pZaUjbEMdB9FWGFrCgO6c5b8Q1CHMRkl5EUzk3B/keXxSNuG2XlNXqPSRnBnWBbAbI 31aLqMia/8UkAobmwMHJj3Pz3tL5BCeJwxKg10bSGDKzIY91ERLpo3ay21VXon4JupjJ uGf7tSCo5JUJjXcLwV/r5MLP99WT/OXcdqd9nkiWdoPlVTJmVMKFWi0bRSsylr5GEoIT Wydw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=F4d6TzlfjJWrdOGjW/NEKdoSNLBqa20i4S1TfSLDQrw=; fh=6N07mL+bog80wlAGzgZAOQVBDqeJAliei1fTKPraJD0=; b=ncx/Kbf8Mcv1ko3jnrJrPDob7TW1Jlud57fRI6eUbeNe6xRI46cfTKN6saqkqOnUPj kMpShnhDmoXjyWeMDp+fEPS5cv3ffKfVc7xMKJiUAxFA2sAHOxY4BPrx4/3dZeAORpca k65e1cXqTrXCX2OhiNMDHnuSoFyEXbKwTq0yLdDevV70zk0cDJNW2YKBpUfWK+h/M3lF j0qajvo+ao9P1kY4UPBF9za9y6TI+5rqMofQJ026d/Y3h/HL6ajvVTFcuRzeFTsvB3Qn MCabx4K5dK8ex9FytimezlWyD9gSZK6UdfIFzhq62VYD4Twz2vW+KHZYVg9FzIL/wRTd EQXA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cloudflare.com header.s=google header.b=xW5EjQ2t; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=cloudflare.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x8-20020a17090a164800b0024e47fae466si980817pje.180.2023.07.10.21.56.17; Mon, 10 Jul 2023 21:56:30 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@cloudflare.com header.s=google header.b=xW5EjQ2t; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=cloudflare.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230197AbjGKEfn (ORCPT <rfc822;gnulinuxfreebsd@gmail.com> + 99 others); Tue, 11 Jul 2023 00:35:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44456 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230284AbjGKEf2 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 11 Jul 2023 00:35:28 -0400 Received: from mail-ot1-x32b.google.com (mail-ot1-x32b.google.com [IPv6:2607:f8b0:4864:20::32b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2196EE73 for <linux-kernel@vger.kernel.org>; Mon, 10 Jul 2023 21:35:25 -0700 (PDT) Received: by mail-ot1-x32b.google.com with SMTP id 46e09a7af769-6b5d1498670so4379338a34.1 for <linux-kernel@vger.kernel.org>; Mon, 10 Jul 2023 21:35:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudflare.com; s=google; t=1689050124; x=1691642124; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=F4d6TzlfjJWrdOGjW/NEKdoSNLBqa20i4S1TfSLDQrw=; b=xW5EjQ2tpBl4JvgMTw68zXoGETIEeH48G9EfbS7yTmQEdr2kaZjO8SDz74qk6jT41X 0ceJiEgAh9YOb37ysPyF+uWUxNmvfWqLyxEzXaV4FptHfNK3MIktGtwdXPHi9UhOB05U vwl9F/luKu4oQLCteLviw+B0NDaCuuZHulGRU= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689050124; x=1691642124; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=F4d6TzlfjJWrdOGjW/NEKdoSNLBqa20i4S1TfSLDQrw=; b=dLFejDq1fGrK9xxdzXnMpU4EOVZJgh4hImi/EdOYS0gIvKdeOu8RxYfrSfoiuVWiq/ C946mEcW+wOqEYyXGyy2dRC3YLdAktTXfu1il+kDth1OI2yvbbdrJvqi3s9FzoudUURy 7XQIJNFUIoeYwIWPdtwAyeyBFolUlMMAasDNpJHCaXzJg6U0p3CjNkPJywfWUTe7Hezb 5X+p7L4cDCJU8/cFfoBl2GSJ9NB58Rre3p9EFC3ZwHnb9fEl0hsx71nHCEKZwM2gp/jH FK7XwuYJrjZ3UUeV5Q8nNusThiBaToIG3wmaMTPevU9zalt+ztSSS+c0VUMQaf5EALgW LBow== X-Gm-Message-State: ABy/qLZKZUZG7XbH7D4vv28Opn13WDWpML9dL/9TQkSec00ULQS77b8e 2Bi5wSK9CSidrtTLKh7IQjrrgw== X-Received: by 2002:a05:6870:c1d3:b0:199:f985:7129 with SMTP id i19-20020a056870c1d300b00199f9857129mr13611073oad.39.1689050124314; Mon, 10 Jul 2023 21:35:24 -0700 (PDT) Received: from localhost ([2601:644:200:aea:60e1:d34a:f5f6:64b5]) by smtp.gmail.com with ESMTPSA id a9-20020a170902ee8900b001b6674b6140sm714976pld.290.2023.07.10.21.35.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 10 Jul 2023 21:35:23 -0700 (PDT) From: Ivan Babrou <ivan@cloudflare.com> To: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org, kernel-team@cloudflare.com, Ivan Babrou <ivan@cloudflare.com>, Eric Dumazet <edumazet@google.com>, "David S. Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Steven Rostedt <rostedt@goodmis.org>, Masami Hiramatsu <mhiramat@kernel.org>, David Ahern <dsahern@kernel.org> Subject: [RFC PATCH net-next] tcp: add a tracepoint for tcp_listen_queue_drop Date: Mon, 10 Jul 2023 21:34:52 -0700 Message-ID: <20230711043453.64095-1-ivan@cloudflare.com> X-Mailer: git-send-email 2.41.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1771098750744981361 X-GMAIL-MSGID: 1771098750744981361 |
Series |
[RFC,net-next] tcp: add a tracepoint for tcp_listen_queue_drop
|
|
Commit Message
Ivan Babrou
July 11, 2023, 4:34 a.m. UTC
There's already a way to count the overall numbers of queue overflows:
$ sudo netstat -s | grep 'listen queue'
4 times the listen queue of a socket overflowed
However, it's too coarse for monitoring and alerting when a user wants to
track errors per socket and route alerts to people responsible for those
sockets directly. For UDP there's udp_fail_queue_rcv_skb, which fills
a similar need for UDP sockets. This patch adds a TCP equivalent.
--
The goal is to use this new tracepoint with ebpf_exporter:
* https://github.com/cloudflare/ebpf_exporter
There's an example configuration for UDP drops there that we use:
* https://github.com/cloudflare/ebpf_exporter/blob/master/examples/udp-drops.bpf.c
* https://github.com/cloudflare/ebpf_exporter/blob/master/examples/udp-drops.yaml
Paolo Abeni asked whether we need the UDP tracepoint given that kfree_skb
and MIB counters already exist, and I covered that part in my reply here:
* https://lore.kernel.org/netdev/CABWYdi3DVex0wq2kM72QTZkhNzkh_Vjb4+T8mj8U7t06Na=5kA@mail.gmail.com/
I added a TCP example utilizing this patch here:
* https://github.com/cloudflare/ebpf_exporter/pull/221
Not so long ago we hit a bug in one of our services that broke its accept
loop, which in resulted in the listen queue overflow. With this new
tracepoint we can have a metric for this and alert the service owners
directly, cutting the middleman SRE and improving the alert fidelity.
We don't really need a tracepoint for this, just a place to hook a kprobe
or an fprobe to. The existing tcp_listendrop is great for this, except
it's a short inlined function, so there's no way to attach a probe to it.
There are a few ways to approach this:
* Un-inline tcp_listendrop to allow probe attachment
* Un-inline tcp_listendrop and add a stable tracepoint
* Keep tcp_listendrop inlined, but add a tracepoint wrapper to call into
There is no option to keep tcp_listendrop inlined and call into tracepoint
directly from it (it does not compile and it wouldn't be nice if it did):
* https://docs.kernel.org/trace/tracepoints.html
Therefore I went with the third option, which this patch implements.
Example output from perf:
$ sudo perf trace -a -e tcp:tcp_listen_queue_drop
0.000 sockfull/5459 tcp:tcp_listen_queue_drop(skaddr: 0xffffff90d7a25580, sport: 12345, saddr: 0x7faa1aed26, saddr_v6: 0x7faa1aed2a, sk_max_ack_backlog: 128, sk_ack_backlog: 129)
Example extracting the local port with bpftrace:
$ sudo ~/projects/bpftrace/src/bpftrace -e 'rawtracepoint:tcp_listen_queue_drop { $sk = (struct sock *) arg0; $lport = $sk->__sk_common.skc_num; printf("drop on lport = %d\n", $lport); }'
Attaching 1 probe...
drop on lport = 12345
Signed-off-by: Ivan Babrou <ivan@cloudflare.com>
---
include/net/tcp.h | 7 ++++++
include/trace/events/tcp.h | 46 ++++++++++++++++++++++++++++++++++++++
net/ipv4/tcp.c | 7 ++++++
3 files changed, 60 insertions(+)
Comments
On Mon, 10 Jul 2023 21:34:52 -0700 Ivan Babrou wrote: > There's already a way to count the overall numbers of queue overflows: > > $ sudo netstat -s | grep 'listen queue' > 4 times the listen queue of a socket overflowed > > However, it's too coarse for monitoring and alerting when a user wants to > track errors per socket and route alerts to people responsible for those > sockets directly. For UDP there's udp_fail_queue_rcv_skb, which fills > a similar need for UDP sockets. This patch adds a TCP equivalent. Makes me want to revert your recent UDP tracepoint to be honest :( We can play whack a mole like this. You said that kfree_skb fires too often, why is that? Maybe it's an issue of someone using kfree_skb() when they should be using consume_skb() ?
On Tue, Jul 11, 2023 at 9:36 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Mon, 10 Jul 2023 21:34:52 -0700 Ivan Babrou wrote: > > There's already a way to count the overall numbers of queue overflows: > > > > $ sudo netstat -s | grep 'listen queue' > > 4 times the listen queue of a socket overflowed > > > > However, it's too coarse for monitoring and alerting when a user wants to > > track errors per socket and route alerts to people responsible for those > > sockets directly. For UDP there's udp_fail_queue_rcv_skb, which fills > > a similar need for UDP sockets. This patch adds a TCP equivalent. > > Makes me want to revert your recent UDP tracepoint to be honest :( > We can play whack a mole like this. You said that kfree_skb fires > too often, why is that? Maybe it's an issue of someone using > kfree_skb() when they should be using consume_skb() ? Hi Jakub, The issue with kfree_skb is not that it fires too frequently (not in the 6.x kernel now). Rather, it is unable to locate the socket info when a SYN is dropped due to the accept queue being full. The sk is stolen upon inet lookup, e.g. in tcp_v4_rcv. This makes it unable to tell in kfree_skb which socket a SYN skb is targeting (when TPROXY or socket lookup are used). A tracepoint with sk information will be more useful to monitor accurately which service/socket is involved. -- Yan
On Wed, 12 Jul 2023 11:42:26 -0500 Yan Zhai wrote: > The issue with kfree_skb is not that it fires too frequently (not in > the 6.x kernel now). Rather, it is unable to locate the socket info > when a SYN is dropped due to the accept queue being full. The sk is > stolen upon inet lookup, e.g. in tcp_v4_rcv. This makes it unable to > tell in kfree_skb which socket a SYN skb is targeting (when TPROXY or > socket lookup are used). A tracepoint with sk information will be more > useful to monitor accurately which service/socket is involved. No doubt that kfree_skb isn't going to solve all our needs, but I'd really like you to clean up the unnecessary callers on your systems first, before adding further tracepoints. That way we'll have a clear picture of which points can be solved by kfree_skb and where we need further work.
On Wed, Jul 12, 2023 at 12:42 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Wed, 12 Jul 2023 11:42:26 -0500 Yan Zhai wrote: > > The issue with kfree_skb is not that it fires too frequently (not in > > the 6.x kernel now). Rather, it is unable to locate the socket info > > when a SYN is dropped due to the accept queue being full. The sk is > > stolen upon inet lookup, e.g. in tcp_v4_rcv. This makes it unable to > > tell in kfree_skb which socket a SYN skb is targeting (when TPROXY or > > socket lookup are used). A tracepoint with sk information will be more > > useful to monitor accurately which service/socket is involved. > > No doubt that kfree_skb isn't going to solve all our needs, but I'd > really like you to clean up the unnecessary callers on your systems > first, before adding further tracepoints. That way we'll have a clear > picture of which points can be solved by kfree_skb and where we need > further work. Those are not unnecessary calls, e.g. a lot of those kfree_skb come from iptables drops, tcp validation, ttl expires, etc. On a moderately loaded server, it is called at a rate of ~10k/sec, which isn't terribly awful given that we absorb millions of attack packets at each data center. We used to have many consume skb noises at this trace point with older versions of kernels, but those have gone ever since the better separation between consume and drop. That said, accessing sk information is still the critical piece to address our use case. Is there any other possible way that we can get this information at the accept queue overflow time? -- Yan
On Wed, 12 Jul 2023 21:43:32 -0500 Yan Zhai wrote: > Those are not unnecessary calls, e.g. a lot of those kfree_skb come > from iptables drops, tcp validation, ttl expires, etc. On a moderately > loaded server, it is called at a rate of ~10k/sec, which isn't > terribly awful given that we absorb millions of attack packets at each > data center. We used to have many consume skb noises at this trace > point with older versions of kernels, but those have gone ever since > the better separation between consume and drop. I was hoping you can break them down by category. Specifically what I'm wondering is whether we should also have a separation between policy / "firewall drops" and error / exception drops. Within the skb drop reason codes, I mean.
On Wed, Jul 12, 2023 at 10:42 AM Jakub Kicinski <kuba@kernel.org> wrote: > > On Wed, 12 Jul 2023 11:42:26 -0500 Yan Zhai wrote: > > The issue with kfree_skb is not that it fires too frequently (not in > > the 6.x kernel now). Rather, it is unable to locate the socket info > > when a SYN is dropped due to the accept queue being full. The sk is > > stolen upon inet lookup, e.g. in tcp_v4_rcv. This makes it unable to > > tell in kfree_skb which socket a SYN skb is targeting (when TPROXY or > > socket lookup are used). A tracepoint with sk information will be more > > useful to monitor accurately which service/socket is involved. > > No doubt that kfree_skb isn't going to solve all our needs, but I'd > really like you to clean up the unnecessary callers on your systems > first, before adding further tracepoints. That way we'll have a clear > picture of which points can be solved by kfree_skb and where we need > further work. The existing UDP tracepoint was there for 12 years and it's a part of what kernel exposes to userspace, so I don't think it's fair to remove this and break its consumers. I think "do not break userspace" applies here. The proposed TCP tracepoint mostly mirrors it, so I think it's fair to have it. I don't know why kfree_skb is called so much. I also don't agree with Yan that it's not actually too much, because it's a lot (especially compared with near zero for my proposed tracepoint). I can easily see 300-500k calls per second into it: $ perf stat -I 1000 -a -e skb:kfree_skb -- sleep 10 # time counts unit events 1.000520165 10,108 skb:kfree_skb 2.010494526 11,178 skb:kfree_skb 3.075503743 10,770 skb:kfree_skb 4.122814843 11,334 skb:kfree_skb 5.128518432 12,020 skb:kfree_skb 6.176504094 11,117 skb:kfree_skb 7.201504214 12,753 skb:kfree_skb 8.229523643 10,566 skb:kfree_skb 9.326499044 365,239 skb:kfree_skb 10.002106098 313,105 skb:kfree_skb $ perf stat -I 1000 -a -e skb:kfree_skb -- sleep 10 # time counts unit events 1.000767744 52,240 skb:kfree_skb 2.069762695 508,310 skb:kfree_skb 3.102763492 417,895 skb:kfree_skb 4.142757608 385,981 skb:kfree_skb 5.190759795 430,154 skb:kfree_skb 6.243765384 405,707 skb:kfree_skb 7.290818228 362,934 skb:kfree_skb 8.297764298 336,702 skb:kfree_skb 9.314287243 353,039 skb:kfree_skb 10.002288423 251,414 skb:kfree_skb Most of it is NOT_SPECIFIED (1s data from one CPU during a spike): $ perf script | sed 's/.*skbaddr=//' | awk '{ print $NF }' | sort | uniq -c | sort -n | tail 1 TCP_CLOSE 2 NO_SOCKET 4 TCP_INVALID_SEQUENCE 4 TCP_RESET 13 TCP_OLD_DATA 14 NETFILTER_DROP 4594 NOT_SPECIFIED We can start a separate discussion to break it down by category if it would help. Let me know what kind of information you would like us to provide to help with that. I assume you're interested in kernel stacks leading to kfree_skb with NOT_SPECIFIED reason, but maybe there's something else. Even if I was only interested in one specific reason, I would still have to arm the whole tracepoint and route a ton of skbs I'm not interested in into my bpf code. This seems like a lot of overhead, especially if I'm dropping some attack packets. Perhaps a lot of extra NOT_SPECIFIED stuff can be fixed and removed from kfree_skb. It's not something I can personally do as it requires much deeper network code understanding than I possess. For TCP we'll also have to add some extra reasons for kfree_skb, because currently it's all NOT_SPECIFIED (no reason set in the accept path): * https://elixir.bootlin.com/linux/v6.5-rc1/source/net/ipv4/tcp_input.c#L6499 * https://elixir.bootlin.com/linux/v6.5-rc1/source/net/ipv4/tcp_ipv4.c#L1749 For UDP we already have SKB_DROP_REASON_SOCKET_RCVBUFF, so I tried my best to implement what I wanted based on that. It's not very approachable, as you'd have to extract the destination port yourself from the raw skb. As Yan said, for TCP people often rely on skb->sk, which is just not present when the incoming SYN is dropped. I failed to find a good example of extracting a destination port that I could replicate. So far I have just a per-reason breakdown working: * https://github.com/cloudflare/ebpf_exporter/pull/233 If you have an ebpf example that would help me extract the destination port from an skb in kfree_skb, I'd be interested in taking a look and trying to make it work. The need to extract the protocol level information in ebpf is only making kfree_skb more expensive for the needs of catching rare cases when we run out of buffer space (UDP) or listen queue (TCP). These two cases are very common failure scenarios that people are interested in catching with straightforward tracepoints that can give them the needed information easily and cheaply. I sympathize with the desire to keep the number of tracepoints in check, but I also feel like UDP buffer drops and TCP listen drops tracepoints are very much justified to exist.
On Thu, 13 Jul 2023 16:17:31 -0700 Ivan Babrou wrote: > On Wed, Jul 12, 2023 at 10:42 AM Jakub Kicinski <kuba@kernel.org> wrote: > > On Wed, 12 Jul 2023 11:42:26 -0500 Yan Zhai wrote: > > > The issue with kfree_skb is not that it fires too frequently (not in > > > the 6.x kernel now). Rather, it is unable to locate the socket info > > > when a SYN is dropped due to the accept queue being full. The sk is > > > stolen upon inet lookup, e.g. in tcp_v4_rcv. This makes it unable to > > > tell in kfree_skb which socket a SYN skb is targeting (when TPROXY or > > > socket lookup are used). A tracepoint with sk information will be more > > > useful to monitor accurately which service/socket is involved. > > > > No doubt that kfree_skb isn't going to solve all our needs, but I'd > > really like you to clean up the unnecessary callers on your systems > > first, before adding further tracepoints. That way we'll have a clear > > picture of which points can be solved by kfree_skb and where we need > > further work. > > The existing UDP tracepoint was there for 12 years and it's a part of > what kernel exposes to userspace, so I don't think it's fair to remove > this and break its consumers. I think "do not break userspace" applies > here. The proposed TCP tracepoint mostly mirrors it, so I think it's > fair to have it. > > I don't know why kfree_skb is called so much. I also don't agree with > Yan that it's not actually too much, because it's a lot (especially > compared with near zero for my proposed tracepoint). I can easily see > 300-500k calls per second into it: > > $ perf stat -I 1000 -a -e skb:kfree_skb -- sleep 10 > # time counts unit events > 1.000520165 10,108 skb:kfree_skb > 2.010494526 11,178 skb:kfree_skb > 3.075503743 10,770 skb:kfree_skb > 4.122814843 11,334 skb:kfree_skb > 5.128518432 12,020 skb:kfree_skb > 6.176504094 11,117 skb:kfree_skb > 7.201504214 12,753 skb:kfree_skb > 8.229523643 10,566 skb:kfree_skb > 9.326499044 365,239 skb:kfree_skb > 10.002106098 313,105 skb:kfree_skb > $ perf stat -I 1000 -a -e skb:kfree_skb -- sleep 10 > # time counts unit events > 1.000767744 52,240 skb:kfree_skb > 2.069762695 508,310 skb:kfree_skb > 3.102763492 417,895 skb:kfree_skb > 4.142757608 385,981 skb:kfree_skb > 5.190759795 430,154 skb:kfree_skb > 6.243765384 405,707 skb:kfree_skb > 7.290818228 362,934 skb:kfree_skb > 8.297764298 336,702 skb:kfree_skb > 9.314287243 353,039 skb:kfree_skb > 10.002288423 251,414 skb:kfree_skb > > Most of it is NOT_SPECIFIED (1s data from one CPU during a spike): > > $ perf script | sed 's/.*skbaddr=//' | awk '{ print $NF }' | sort | > uniq -c | sort -n | tail > 1 TCP_CLOSE > 2 NO_SOCKET > 4 TCP_INVALID_SEQUENCE > 4 TCP_RESET > 13 TCP_OLD_DATA > 14 NETFILTER_DROP > 4594 NOT_SPECIFIED > > We can start a separate discussion to break it down by category if it > would help. Let me know what kind of information you would like us to > provide to help with that. I assume you're interested in kernel stacks > leading to kfree_skb with NOT_SPECIFIED reason, but maybe there's > something else. Just the stacks. > Even if I was only interested in one specific reason, I would still > have to arm the whole tracepoint and route a ton of skbs I'm not > interested in into my bpf code. This seems like a lot of overhead, > especially if I'm dropping some attack packets. That's what I meant with my drop vs exception comment. We already have two tracepoints on the skb free path (free and consume), adding another shouldn't rise too many eyebrows. > Perhaps a lot of extra NOT_SPECIFIED stuff can be fixed and removed > from kfree_skb. It's not something I can personally do as it requires > much deeper network code understanding than I possess. For TCP we'll > also have to add some extra reasons for kfree_skb, because currently > it's all NOT_SPECIFIED (no reason set in the accept path): > > * https://elixir.bootlin.com/linux/v6.5-rc1/source/net/ipv4/tcp_input.c#L6499 > * https://elixir.bootlin.com/linux/v6.5-rc1/source/net/ipv4/tcp_ipv4.c#L1749 > > For UDP we already have SKB_DROP_REASON_SOCKET_RCVBUFF, so I tried my > best to implement what I wanted based on that. It's not very > approachable, as you'd have to extract the destination port yourself > from the raw skb. As Yan said, for TCP people often rely on skb->sk, > which is just not present when the incoming SYN is dropped. I failed > to find a good example of extracting a destination port that I could > replicate. So far I have just a per-reason breakdown working: > > * https://github.com/cloudflare/ebpf_exporter/pull/233 > > If you have an ebpf example that would help me extract the destination > port from an skb in kfree_skb, I'd be interested in taking a look and > trying to make it work. > > The need to extract the protocol level information in ebpf is only > making kfree_skb more expensive for the needs of catching rare cases > when we run out of buffer space (UDP) or listen queue (TCP). These two > cases are very common failure scenarios that people are interested in > catching with straightforward tracepoints that can give them the > needed information easily and cheaply. > > I sympathize with the desire to keep the number of tracepoints in > check, but I also feel like UDP buffer drops and TCP listen drops > tracepoints are very much justified to exist. I'm not completely opposed to the tracepoints where needed. It's more of trying to make sure we do due diligence on the existing solutions. Or maybe not even due diligence as much as pay off some technical debt.
On 7/13/23 5:17 PM, Ivan Babrou wrote: > On Wed, Jul 12, 2023 at 10:42 AM Jakub Kicinski <kuba@kernel.org> wrote: >> >> On Wed, 12 Jul 2023 11:42:26 -0500 Yan Zhai wrote: >>> The issue with kfree_skb is not that it fires too frequently (not in >>> the 6.x kernel now). Rather, it is unable to locate the socket info >>> when a SYN is dropped due to the accept queue being full. The sk is >>> stolen upon inet lookup, e.g. in tcp_v4_rcv. This makes it unable to >>> tell in kfree_skb which socket a SYN skb is targeting (when TPROXY or >>> socket lookup are used). A tracepoint with sk information will be more >>> useful to monitor accurately which service/socket is involved. >> >> No doubt that kfree_skb isn't going to solve all our needs, but I'd >> really like you to clean up the unnecessary callers on your systems >> first, before adding further tracepoints. That way we'll have a clear >> picture of which points can be solved by kfree_skb and where we need >> further work. > > The existing UDP tracepoint was there for 12 years and it's a part of > what kernel exposes to userspace, so I don't think it's fair to remove > this and break its consumers. I think "do not break userspace" applies > here. The proposed TCP tracepoint mostly mirrors it, so I think it's > fair to have it. > > I don't know why kfree_skb is called so much. I also don't agree with > Yan that it's not actually too much, because it's a lot (especially > compared with near zero for my proposed tracepoint). I can easily see > 300-500k calls per second into it: > > $ perf stat -I 1000 -a -e skb:kfree_skb -- sleep 10 > # time counts unit events > 1.000520165 10,108 skb:kfree_skb > 2.010494526 11,178 skb:kfree_skb > 3.075503743 10,770 skb:kfree_skb > 4.122814843 11,334 skb:kfree_skb > 5.128518432 12,020 skb:kfree_skb > 6.176504094 11,117 skb:kfree_skb > 7.201504214 12,753 skb:kfree_skb > 8.229523643 10,566 skb:kfree_skb > 9.326499044 365,239 skb:kfree_skb > 10.002106098 313,105 skb:kfree_skb > $ perf stat -I 1000 -a -e skb:kfree_skb -- sleep 10 > # time counts unit events > 1.000767744 52,240 skb:kfree_skb > 2.069762695 508,310 skb:kfree_skb > 3.102763492 417,895 skb:kfree_skb > 4.142757608 385,981 skb:kfree_skb > 5.190759795 430,154 skb:kfree_skb > 6.243765384 405,707 skb:kfree_skb > 7.290818228 362,934 skb:kfree_skb > 8.297764298 336,702 skb:kfree_skb > 9.314287243 353,039 skb:kfree_skb > 10.002288423 251,414 skb:kfree_skb > > Most of it is NOT_SPECIFIED (1s data from one CPU during a spike): > > $ perf script | sed 's/.*skbaddr=//' | awk '{ print $NF }' | sort | > uniq -c | sort -n | tail > 1 TCP_CLOSE > 2 NO_SOCKET > 4 TCP_INVALID_SEQUENCE > 4 TCP_RESET > 13 TCP_OLD_DATA > 14 NETFILTER_DROP > 4594 NOT_SPECIFIED > > We can start a separate discussion to break it down by category if it > would help. Let me know what kind of information you would like us to > provide to help with that. I assume you're interested in kernel stacks > leading to kfree_skb with NOT_SPECIFIED reason, but maybe there's > something else. stack traces would be helpful. > > Even if I was only interested in one specific reason, I would still > have to arm the whole tracepoint and route a ton of skbs I'm not > interested in into my bpf code. This seems like a lot of overhead, > especially if I'm dropping some attack packets. you can add a filter on the tracepoint event to limit what is passed (although I have not tried the filter with an ebpf program - e.g., reason != NOT_SPECIFIED). > > Perhaps a lot of extra NOT_SPECIFIED stuff can be fixed and removed > from kfree_skb. It's not something I can personally do as it requires > much deeper network code understanding than I possess. For TCP we'll > also have to add some extra reasons for kfree_skb, because currently > it's all NOT_SPECIFIED (no reason set in the accept path): > > * https://elixir.bootlin.com/linux/v6.5-rc1/source/net/ipv4/tcp_input.c#L6499 > * https://elixir.bootlin.com/linux/v6.5-rc1/source/net/ipv4/tcp_ipv4.c#L1749 > > For UDP we already have SKB_DROP_REASON_SOCKET_RCVBUFF, so I tried my > best to implement what I wanted based on that. It's not very > approachable, as you'd have to extract the destination port yourself > from the raw skb. As Yan said, for TCP people often rely on skb->sk, > which is just not present when the incoming SYN is dropped. I failed > to find a good example of extracting a destination port that I could > replicate. So far I have just a per-reason breakdown working: > > * https://github.com/cloudflare/ebpf_exporter/pull/233 > > If you have an ebpf example that would help me extract the destination > port from an skb in kfree_skb, I'd be interested in taking a look and > trying to make it work. This is from 2020 and I forget which kernel version (pre-BTF), but it worked at that time and allowed userspace to summarize drop reasons by various network data (mac, L3 address, n-tuple, etc): https://github.com/dsahern/bpf-progs/blob/master/ksrc/pktdrop.c > > The need to extract the protocol level information in ebpf is only > making kfree_skb more expensive for the needs of catching rare cases > when we run out of buffer space (UDP) or listen queue (TCP). These two > cases are very common failure scenarios that people are interested in > catching with straightforward tracepoints that can give them the > needed information easily and cheaply. > > I sympathize with the desire to keep the number of tracepoints in > check, but I also feel like UDP buffer drops and TCP listen drops > tracepoints are very much justified to exist. sure, kfree_skb is like the raw_syscall tracepoint - it can be more than what you need for a specific problem, but it is also give you way more than you are thinking about today.
On Thu, Jul 13, 2023 at 8:14 PM Jakub Kicinski <kuba@kernel.org> wrote: > > I don't know why kfree_skb is called so much. I also don't agree with > > Yan that it's not actually too much, because it's a lot (especially > > compared with near zero for my proposed tracepoint). I can easily see > > 300-500k calls per second into it: > > > > $ perf stat -I 1000 -a -e skb:kfree_skb -- sleep 10 > > # time counts unit events > > 1.000520165 10,108 skb:kfree_skb > > 2.010494526 11,178 skb:kfree_skb > > 3.075503743 10,770 skb:kfree_skb > > 4.122814843 11,334 skb:kfree_skb > > 5.128518432 12,020 skb:kfree_skb > > 6.176504094 11,117 skb:kfree_skb > > 7.201504214 12,753 skb:kfree_skb > > 8.229523643 10,566 skb:kfree_skb > > 9.326499044 365,239 skb:kfree_skb > > 10.002106098 313,105 skb:kfree_skb > > $ perf stat -I 1000 -a -e skb:kfree_skb -- sleep 10 > > # time counts unit events > > 1.000767744 52,240 skb:kfree_skb > > 2.069762695 508,310 skb:kfree_skb > > 3.102763492 417,895 skb:kfree_skb > > 4.142757608 385,981 skb:kfree_skb > > 5.190759795 430,154 skb:kfree_skb > > 6.243765384 405,707 skb:kfree_skb > > 7.290818228 362,934 skb:kfree_skb > > 8.297764298 336,702 skb:kfree_skb > > 9.314287243 353,039 skb:kfree_skb > > 10.002288423 251,414 skb:kfree_skb > > > > Most of it is NOT_SPECIFIED (1s data from one CPU during a spike): > > > > $ perf script | sed 's/.*skbaddr=//' | awk '{ print $NF }' | sort | > > uniq -c | sort -n | tail > > 1 TCP_CLOSE > > 2 NO_SOCKET > > 4 TCP_INVALID_SEQUENCE > > 4 TCP_RESET > > 13 TCP_OLD_DATA > > 14 NETFILTER_DROP > > 4594 NOT_SPECIFIED > > > > We can start a separate discussion to break it down by category if it > > would help. Let me know what kind of information you would like us to > > provide to help with that. I assume you're interested in kernel stacks > > leading to kfree_skb with NOT_SPECIFIED reason, but maybe there's > > something else. > > Just the stacks. Here you go: https://lore.kernel.org/netdev/CABWYdi00L+O30Q=Zah28QwZ_5RU-xcxLFUK2Zj08A8MrLk9jzg@mail.gmail.com/ > > Even if I was only interested in one specific reason, I would still > > have to arm the whole tracepoint and route a ton of skbs I'm not > > interested in into my bpf code. This seems like a lot of overhead, > > especially if I'm dropping some attack packets. > > That's what I meant with my drop vs exception comment. We already have > two tracepoints on the skb free path (free and consume), adding another > shouldn't rise too many eyebrows. I'm a bit confused. Previously you said: > Specifically what I'm wondering is whether we should also have > a separation between policy / "firewall drops" and error / exception > drops. Within the skb drop reason codes, I mean. My understanding was that you proposed adding more SKB_DROP_REASON_*, but now you seem to imply that we might want to add another tracepoint. Could you clarify which path you have in mind? We can add a few reasons that would satisfy my need by covering whatever results into tcp_listendrop() calls today. The problem is: unless we remove some other reasons from kfree_skb, adding more reasons for firewall drops / exceptions wouldn't change the cost at all. We'd still have the same number of calls into the tracepoint and the condition to find "interesting" reasons would be the same: if (reason == SKB_DROP_REASON_TCP_OVERFLOW_OR_SOMETHING) It still seems very expensive to consume a firehose of kfree_skb just to find some rare nuggets. > > Perhaps a lot of extra NOT_SPECIFIED stuff can be fixed and removed > > from kfree_skb. It's not something I can personally do as it requires > > much deeper network code understanding than I possess. For TCP we'll > > also have to add some extra reasons for kfree_skb, because currently > > it's all NOT_SPECIFIED (no reason set in the accept path): > > > > * https://elixir.bootlin.com/linux/v6.5-rc1/source/net/ipv4/tcp_input.c#L6499 > > * https://elixir.bootlin.com/linux/v6.5-rc1/source/net/ipv4/tcp_ipv4.c#L1749 > > > > For UDP we already have SKB_DROP_REASON_SOCKET_RCVBUFF, so I tried my > > best to implement what I wanted based on that. It's not very > > approachable, as you'd have to extract the destination port yourself > > from the raw skb. As Yan said, for TCP people often rely on skb->sk, > > which is just not present when the incoming SYN is dropped. I failed > > to find a good example of extracting a destination port that I could > > replicate. So far I have just a per-reason breakdown working: > > > > * https://github.com/cloudflare/ebpf_exporter/pull/233 > > > > If you have an ebpf example that would help me extract the destination > > port from an skb in kfree_skb, I'd be interested in taking a look and > > trying to make it work. > > > > The need to extract the protocol level information in ebpf is only > > making kfree_skb more expensive for the needs of catching rare cases > > when we run out of buffer space (UDP) or listen queue (TCP). These two > > cases are very common failure scenarios that people are interested in > > catching with straightforward tracepoints that can give them the > > needed information easily and cheaply. > > > > I sympathize with the desire to keep the number of tracepoints in > > check, but I also feel like UDP buffer drops and TCP listen drops > > tracepoints are very much justified to exist. > > I'm not completely opposed to the tracepoints where needed. It's more > of trying to make sure we do due diligence on the existing solutions. > Or maybe not even due diligence as much as pay off some technical debt. The arguments for a new tracepoint from my due diligence: 1. There are too many calls into kfree_skb. Maybe that's fixable to a degree, but there would still be a big discrepancy in the number of calls between my proposed targeted tracepoint and skb:kfree_skb. For some setups this might not be feasible. 2. It's hard to extract the necessary information from skb without an sk attached. Both from usability perspective (you need to parse the network header yourself with a non-trivial amount of code) and from cost perspective (parsing is not zero cost). The latter doesn't matter as much if we target a specific reason.
On Fri, Jul 14, 2023 at 8:09 AM David Ahern <dsahern@kernel.org> wrote: > > We can start a separate discussion to break it down by category if it > > would help. Let me know what kind of information you would like us to > > provide to help with that. I assume you're interested in kernel stacks > > leading to kfree_skb with NOT_SPECIFIED reason, but maybe there's > > something else. > > stack traces would be helpful. Here you go: https://lore.kernel.org/netdev/CABWYdi00L+O30Q=Zah28QwZ_5RU-xcxLFUK2Zj08A8MrLk9jzg@mail.gmail.com/ > > Even if I was only interested in one specific reason, I would still > > have to arm the whole tracepoint and route a ton of skbs I'm not > > interested in into my bpf code. This seems like a lot of overhead, > > especially if I'm dropping some attack packets. > > you can add a filter on the tracepoint event to limit what is passed > (although I have not tried the filter with an ebpf program - e.g., > reason != NOT_SPECIFIED). Absolutely, but isn't there overhead to even do just that for every freed skb? > > If you have an ebpf example that would help me extract the destination > > port from an skb in kfree_skb, I'd be interested in taking a look and > > trying to make it work. > > This is from 2020 and I forget which kernel version (pre-BTF), but it > worked at that time and allowed userspace to summarize drop reasons by > various network data (mac, L3 address, n-tuple, etc): > > https://github.com/dsahern/bpf-progs/blob/master/ksrc/pktdrop.c It doesn't seem to extract the L4 metadata (local port specifically), which is what I'm after. > > The need to extract the protocol level information in ebpf is only > > making kfree_skb more expensive for the needs of catching rare cases > > when we run out of buffer space (UDP) or listen queue (TCP). These two > > cases are very common failure scenarios that people are interested in > > catching with straightforward tracepoints that can give them the > > needed information easily and cheaply. > > > > I sympathize with the desire to keep the number of tracepoints in > > check, but I also feel like UDP buffer drops and TCP listen drops > > tracepoints are very much justified to exist. > > sure, kfree_skb is like the raw_syscall tracepoint - it can be more than > what you need for a specific problem, but it is also give you way more > than you are thinking about today. I really like the comparison to raw_syscall tracepoint. There are two flavors: 1. Catch-all: raw_syscalls:sys_enter, which is similar to skb:kfree_skb. 2. Specific tracepoints: syscalls:sys_enter_* for every syscall. If you are interested in one rare syscall, you wouldn't attach to a catch-all firehose and the filter for id in post. Understandably, we probably can't have a separate skb:kfree_skb for every reason. However, some of them are more useful than others and I believe that tcp listen drops fall into that category. We went through a similar exercise with audit subsystem, which in fact always arms all syscalls even if you audit one of them: * https://lore.kernel.org/audit/20230523181624.19932-1-ivan@cloudflare.com/T/#u With pictures, if you're interested: * https://mastodon.ivan.computer/@mastodon/110426498281603668 Nobody audits futex(), but if you audit execve(), all the rules run for both. Some rules will run faster, but all of them will run. It's a lot of overhead with millions of CPUs, which I'm trying to avoid (the planet is hot as it is). Ultimately my arguments for a separate tracepoint for tcp listen drops are at the bottom of my reply to Jakub: * https://lore.kernel.org/netdev/CABWYdi2BGi=iRCfLhmQCqO=1eaQ1WaCG7F9WsJrz-7==ocZidg@mail.gmail.com/
On 7/14/23 5:38 PM, Ivan Babrou wrote: > On Fri, Jul 14, 2023 at 8:09 AM David Ahern <dsahern@kernel.org> wrote: >>> We can start a separate discussion to break it down by category if it >>> would help. Let me know what kind of information you would like us to >>> provide to help with that. I assume you're interested in kernel stacks >>> leading to kfree_skb with NOT_SPECIFIED reason, but maybe there's >>> something else. >> >> stack traces would be helpful. > > Here you go: https://lore.kernel.org/netdev/CABWYdi00L+O30Q=Zah28QwZ_5RU-xcxLFUK2Zj08A8MrLk9jzg@mail.gmail.com/ > >>> Even if I was only interested in one specific reason, I would still >>> have to arm the whole tracepoint and route a ton of skbs I'm not >>> interested in into my bpf code. This seems like a lot of overhead, >>> especially if I'm dropping some attack packets. >> >> you can add a filter on the tracepoint event to limit what is passed >> (although I have not tried the filter with an ebpf program - e.g., >> reason != NOT_SPECIFIED). > > Absolutely, but isn't there overhead to even do just that for every freed skb? There is some amount of overhead. If filters can be used with ebpf programs, then the differential cost is just the cycles for the filter which in this case is an integer compare. Should be low - maybe Steven has some data on the overhead? > >>> If you have an ebpf example that would help me extract the destination >>> port from an skb in kfree_skb, I'd be interested in taking a look and >>> trying to make it work. >> >> This is from 2020 and I forget which kernel version (pre-BTF), but it >> worked at that time and allowed userspace to summarize drop reasons by >> various network data (mac, L3 address, n-tuple, etc): >> >> https://github.com/dsahern/bpf-progs/blob/master/ksrc/pktdrop.c > > It doesn't seem to extract the L4 metadata (local port specifically), > which is what I'm after. This program takes the path of copy headers to userspace and does the parsing there (there is a netmon program that uses that ebpf program which shows drops for varying perspectives). You can just as easily do the parsing in ebpf. Once you have the start of packet data, walk the protocols of interest -- e.g., see parse_pkt in flow.h. > >>> The need to extract the protocol level information in ebpf is only >>> making kfree_skb more expensive for the needs of catching rare cases >>> when we run out of buffer space (UDP) or listen queue (TCP). These two >>> cases are very common failure scenarios that people are interested in >>> catching with straightforward tracepoints that can give them the >>> needed information easily and cheaply. >>> >>> I sympathize with the desire to keep the number of tracepoints in >>> check, but I also feel like UDP buffer drops and TCP listen drops >>> tracepoints are very much justified to exist. >> >> sure, kfree_skb is like the raw_syscall tracepoint - it can be more than >> what you need for a specific problem, but it is also give you way more >> than you are thinking about today. > > I really like the comparison to raw_syscall tracepoint. There are two flavors: > > 1. Catch-all: raw_syscalls:sys_enter, which is similar to skb:kfree_skb. > 2. Specific tracepoints: syscalls:sys_enter_* for every syscall. > > If you are interested in one rare syscall, you wouldn't attach to a > catch-all firehose and the filter for id in post. Understandably, we > probably can't have a separate skb:kfree_skb for every reason. yea, I pushed for the 'reason' design rather than having a tracepoint per drop location as it is way more scalable. There is a balance and I believe that is what Kuba is pushing at here. > However, some of them are more useful than others and I believe that > tcp listen drops fall into that category. > > We went through a similar exercise with audit subsystem, which in fact > always arms all syscalls even if you audit one of them: > > * https://lore.kernel.org/audit/20230523181624.19932-1-ivan@cloudflare.com/T/#u > > With pictures, if you're interested: > > * https://mastodon.ivan.computer/@mastodon/110426498281603668 > > Nobody audits futex(), but if you audit execve(), all the rules run > for both. Some rules will run faster, but all of them will run. It's a > lot of overhead with millions of CPUs, which I'm trying to avoid (the > planet is hot as it is). > > Ultimately my arguments for a separate tracepoint for tcp listen drops > are at the bottom of my reply to Jakub: > > * https://lore.kernel.org/netdev/CABWYdi2BGi=iRCfLhmQCqO=1eaQ1WaCG7F9WsJrz-7==ocZidg@mail.gmail.com/
On Fri, 14 Jul 2023 16:21:08 -0700 Ivan Babrou wrote: > > Just the stacks. > > Here you go: https://lore.kernel.org/netdev/CABWYdi00L+O30Q=Zah28QwZ_5RU-xcxLFUK2Zj08A8MrLk9jzg@mail.gmail.com/ Thanks! I'll follow the discussion there. Just the one remaining clarification here: > > > Even if I was only interested in one specific reason, I would still > > > have to arm the whole tracepoint and route a ton of skbs I'm not > > > interested in into my bpf code. This seems like a lot of overhead, > > > especially if I'm dropping some attack packets. > > > > That's what I meant with my drop vs exception comment. We already have > > two tracepoints on the skb free path (free and consume), adding another > > shouldn't rise too many eyebrows. > > I'm a bit confused. Previously you said: > > > Specifically what I'm wondering is whether we should also have > > a separation between policy / "firewall drops" and error / exception > > drops. Within the skb drop reason codes, I mean. > > My understanding was that you proposed adding more SKB_DROP_REASON_*, > but now you seem to imply that we might want to add another > tracepoint. Could you clarify which path you have in mind? What I had in mind was sorting the drop reasons to be able to easily distinguish policy drops from error drops. > We can add a few reasons that would satisfy my need by covering > whatever results into tcp_listendrop() calls today. The problem is: > unless we remove some other reasons from kfree_skb, adding more > reasons for firewall drops / exceptions wouldn't change the cost at > all. We'd still have the same number of calls into the tracepoint and > the condition to find "interesting" reasons would be the same: > > if (reason == SKB_DROP_REASON_TCP_OVERFLOW_OR_SOMETHING) > > It still seems very expensive to consume a firehose of kfree_skb just > to find some rare nuggets. Let me show you a quick mock-up of a diff: diff --git a/include/net/dropreason-core.h b/include/net/dropreason-core.h index a2b953b57689..86ee70fcf540 100644 --- a/include/net/dropreason-core.h +++ b/include/net/dropreason-core.h @@ -5,12 +5,18 @@ #define DEFINE_DROP_REASON(FN, FNe) \ FN(NOT_SPECIFIED) \ + /* Policy-driven/intentional drops: */ \ + FN(NETFILTER_DROP) \ + FN(BPF_CGROUP_EGRESS) \ + FN(TC_INGRESS) \ + FN(TC_EGRESS) \ + FN(XDP) \ + /* Errors: */ \ FN(NO_SOCKET) \ FN(PKT_TOO_SMALL) \ FN(TCP_CSUM) \ FN(SOCKET_FILTER) \ FN(UDP_CSUM) \ - FN(NETFILTER_DROP) \ FN(OTHERHOST) \ FN(IP_CSUM) \ FN(IP_INHDR) \ @@ -41,17 +47,13 @@ FN(TCP_OFO_QUEUE_PRUNE) \ FN(TCP_OFO_DROP) \ FN(IP_OUTNOROUTES) \ - FN(BPF_CGROUP_EGRESS) \ FN(IPV6DISABLED) \ FN(NEIGH_CREATEFAIL) \ FN(NEIGH_FAILED) \ FN(NEIGH_QUEUEFULL) \ FN(NEIGH_DEAD) \ - FN(TC_EGRESS) \ FN(QDISC_DROP) \ FN(CPU_BACKLOG) \ - FN(XDP) \ - FN(TC_INGRESS) \ FN(UNHANDLED_PROTO) \ FN(SKB_CSUM) \ FN(SKB_GSO_SEG) \ @@ -80,6 +82,8 @@ FN(IPV6_NDISC_NS_OTHERHOST) \ FNe(MAX) +#define __SKB_POLICY_DROP_END SKB_DROP_REASON_NO_SOCKET + /** * enum skb_drop_reason - the reasons of skb drops * diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 6c5915efbc17..a36c498eb693 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1031,6 +1031,8 @@ bool __kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason) if (reason == SKB_CONSUMED) trace_consume_skb(skb, __builtin_return_address(0)); + else if (reason < __SKB_POLICY_DROP_END) + trace_drop_skb(skb, __builtin_return_address(0), reason); else trace_kfree_skb(skb, __builtin_return_address(0), reason); return true;
On Fri, Jul 14, 2023 at 6:30 PM David Ahern <dsahern@kernel.org> wrote: > > On 7/14/23 5:38 PM, Ivan Babrou wrote: > > On Fri, Jul 14, 2023 at 8:09 AM David Ahern <dsahern@kernel.org> wrote: > >>> We can start a separate discussion to break it down by category if it > >>> would help. Let me know what kind of information you would like us to > >>> provide to help with that. I assume you're interested in kernel stacks > >>> leading to kfree_skb with NOT_SPECIFIED reason, but maybe there's > >>> something else. > >> > >> stack traces would be helpful. > > > > Here you go: https://lore.kernel.org/netdev/CABWYdi00L+O30Q=Zah28QwZ_5RU-xcxLFUK2Zj08A8MrLk9jzg@mail.gmail.com/ > > > >>> Even if I was only interested in one specific reason, I would still > >>> have to arm the whole tracepoint and route a ton of skbs I'm not > >>> interested in into my bpf code. This seems like a lot of overhead, > >>> especially if I'm dropping some attack packets. > >> > >> you can add a filter on the tracepoint event to limit what is passed > >> (although I have not tried the filter with an ebpf program - e.g., > >> reason != NOT_SPECIFIED). > > > > Absolutely, but isn't there overhead to even do just that for every freed skb? > > There is some amount of overhead. If filters can be used with ebpf > programs, then the differential cost is just the cycles for the filter > which in this case is an integer compare. Should be low - maybe Steven > has some data on the overhead? I updated my benchmarks and added two dimensions: * Empty probe that just returns immediately (simple and complex map increments were already there) * Tracepoint probe (fprobe and kprobe were already there) The results are here: * https://github.com/cloudflare/ebpf_exporter/tree/master/benchmark It looks like we can expect an empty tracepoint probe to finish in ~15ns. At least that's what I see on my M1 laptop in a VM running v6.5-rc1. 15ns x 400k calls/s = 6ms/s or 0.6% of a single CPU if all you do is nothing (which is the likely outcome) for kfree_skb tracepoint. I guess it's not as terrible as I expected, which is good news. > >>> If you have an ebpf example that would help me extract the destination > >>> port from an skb in kfree_skb, I'd be interested in taking a look and > >>> trying to make it work. > >> > >> This is from 2020 and I forget which kernel version (pre-BTF), but it > >> worked at that time and allowed userspace to summarize drop reasons by > >> various network data (mac, L3 address, n-tuple, etc): > >> > >> https://github.com/dsahern/bpf-progs/blob/master/ksrc/pktdrop.c > > > > It doesn't seem to extract the L4 metadata (local port specifically), > > which is what I'm after. > > This program takes the path of copy headers to userspace and does the > parsing there (there is a netmon program that uses that ebpf program > which shows drops for varying perspectives). You can just as easily do > the parsing in ebpf. Once you have the start of packet data, walk the > protocols of interest -- e.g., see parse_pkt in flow.h. I see, thanks. I want to do all the aggregation in the kernel, so I took a stab at that. With a lot of trial and error I was able to come up with the following: * https://github.com/cloudflare/ebpf_exporter/pull/235 Some points from my experience doing that: * It was a lot harder to get it working than the tracepoint I proposed. There are few examples of parsing skb in the kernel in bpf and none do what I wanted to do. * It is unclear whether this would work with vlan or other encapsulation. Your code has special handling for vlan. As a user I just want the L4 port. * There's a lot more boilerplate code to get to L4 info. Accessing sk is a lot easier. * It's not very useful without adding the reasons that would correspond to listen drops in TCP. I'm not sure if I'm up to the task, but I can give it a shot. * It's unclear how to detect which end of the socket is bound locally. I want to know which ports that are listened on locally are experiencing issues, ignoring sockets that connect elsewhere. E.g. I care about my http server dropping packets, but not as much about curl doing the same. * UDP drops seem to work okay in my local testing, I can see SKB_DROP_REASON_SOCKET_RCVBUFF by port. As a reminder, the code for my tracepoint is here: * https://github.com/cloudflare/ebpf_exporter/pull/221 It's a lot simpler. I still feel that it's justified to exist. Hope this helps.
On Tue, Jul 18, 2023 at 2:57 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Fri, 14 Jul 2023 16:21:08 -0700 Ivan Babrou wrote: > > > Just the stacks. > > > > Here you go: https://lore.kernel.org/netdev/CABWYdi00L+O30Q=Zah28QwZ_5RU-xcxLFUK2Zj08A8MrLk9jzg@mail.gmail.com/ > > Thanks! I'll follow the discussion there. Just the one remaining > clarification here: > > > > > Even if I was only interested in one specific reason, I would still > > > > have to arm the whole tracepoint and route a ton of skbs I'm not > > > > interested in into my bpf code. This seems like a lot of overhead, > > > > especially if I'm dropping some attack packets. > > > > > > That's what I meant with my drop vs exception comment. We already have > > > two tracepoints on the skb free path (free and consume), adding another > > > shouldn't rise too many eyebrows. > > > > I'm a bit confused. Previously you said: > > > > > Specifically what I'm wondering is whether we should also have > > > a separation between policy / "firewall drops" and error / exception > > > drops. Within the skb drop reason codes, I mean. > > > > My understanding was that you proposed adding more SKB_DROP_REASON_*, > > but now you seem to imply that we might want to add another > > tracepoint. Could you clarify which path you have in mind? > > What I had in mind was sorting the drop reasons to be able to easily > distinguish policy drops from error drops. > > > We can add a few reasons that would satisfy my need by covering > > whatever results into tcp_listendrop() calls today. The problem is: > > unless we remove some other reasons from kfree_skb, adding more > > reasons for firewall drops / exceptions wouldn't change the cost at > > all. We'd still have the same number of calls into the tracepoint and > > the condition to find "interesting" reasons would be the same: > > > > if (reason == SKB_DROP_REASON_TCP_OVERFLOW_OR_SOMETHING) > > > > It still seems very expensive to consume a firehose of kfree_skb just > > to find some rare nuggets. > > Let me show you a quick mock-up of a diff: > > diff --git a/include/net/dropreason-core.h b/include/net/dropreason-core.h > index a2b953b57689..86ee70fcf540 100644 > --- a/include/net/dropreason-core.h > +++ b/include/net/dropreason-core.h > @@ -5,12 +5,18 @@ > > #define DEFINE_DROP_REASON(FN, FNe) \ > FN(NOT_SPECIFIED) \ > + /* Policy-driven/intentional drops: */ \ > + FN(NETFILTER_DROP) \ > + FN(BPF_CGROUP_EGRESS) \ > + FN(TC_INGRESS) \ > + FN(TC_EGRESS) \ > + FN(XDP) \ > + /* Errors: */ \ > FN(NO_SOCKET) \ > FN(PKT_TOO_SMALL) \ > FN(TCP_CSUM) \ > FN(SOCKET_FILTER) \ > FN(UDP_CSUM) \ > - FN(NETFILTER_DROP) \ > FN(OTHERHOST) \ > FN(IP_CSUM) \ > FN(IP_INHDR) \ > @@ -41,17 +47,13 @@ > FN(TCP_OFO_QUEUE_PRUNE) \ > FN(TCP_OFO_DROP) \ > FN(IP_OUTNOROUTES) \ > - FN(BPF_CGROUP_EGRESS) \ > FN(IPV6DISABLED) \ > FN(NEIGH_CREATEFAIL) \ > FN(NEIGH_FAILED) \ > FN(NEIGH_QUEUEFULL) \ > FN(NEIGH_DEAD) \ > - FN(TC_EGRESS) \ > FN(QDISC_DROP) \ > FN(CPU_BACKLOG) \ > - FN(XDP) \ > - FN(TC_INGRESS) \ > FN(UNHANDLED_PROTO) \ > FN(SKB_CSUM) \ > FN(SKB_GSO_SEG) \ > @@ -80,6 +82,8 @@ > FN(IPV6_NDISC_NS_OTHERHOST) \ > FNe(MAX) > > +#define __SKB_POLICY_DROP_END SKB_DROP_REASON_NO_SOCKET > + > /** > * enum skb_drop_reason - the reasons of skb drops > * > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 6c5915efbc17..a36c498eb693 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -1031,6 +1031,8 @@ bool __kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason) > > if (reason == SKB_CONSUMED) > trace_consume_skb(skb, __builtin_return_address(0)); > + else if (reason < __SKB_POLICY_DROP_END) > + trace_drop_skb(skb, __builtin_return_address(0), reason); > else > trace_kfree_skb(skb, __builtin_return_address(0), reason); > return true; I see what you mean now. This makes sense.
diff --git a/include/net/tcp.h b/include/net/tcp.h index 226bce6d1e8c..810ad606641f 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -46,6 +46,7 @@ #include <linux/bpf-cgroup.h> #include <linux/siphash.h> #include <linux/net_mm.h> +#include <linux/tracepoint-defs.h> extern struct inet_hashinfo tcp_hashinfo; @@ -2259,6 +2260,10 @@ static inline void tcp_segs_in(struct tcp_sock *tp, const struct sk_buff *skb) WRITE_ONCE(tp->data_segs_in, tp->data_segs_in + segs_in); } +DECLARE_TRACEPOINT(tcp_listen_queue_drop); + +void do_trace_tcp_listen_queue_drop_wrapper(const struct sock *sk); + /* * TCP listen path runs lockless. * We forced "struct sock" to be const qualified to make sure @@ -2270,6 +2275,8 @@ static inline void tcp_listendrop(const struct sock *sk) { atomic_inc(&((struct sock *)sk)->sk_drops); __NET_INC_STATS(sock_net(sk), LINUX_MIB_LISTENDROPS); + if (tracepoint_enabled(tcp_listen_queue_drop)) + do_trace_tcp_listen_queue_drop_wrapper(sk); } enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer); diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h index bf06db8d2046..646ad0bbd378 100644 --- a/include/trace/events/tcp.h +++ b/include/trace/events/tcp.h @@ -416,6 +416,52 @@ TRACE_EVENT(tcp_cong_state_set, __entry->cong_state) ); +TRACE_EVENT(tcp_listen_queue_drop, + + TP_PROTO(const struct sock *sk), + + TP_ARGS(sk), + + TP_STRUCT__entry( + __field(const void *, skaddr) + __field(__u16, sport) + __array(__u8, saddr, 4) + __array(__u8, saddr_v6, 16) + __field(__u32, sk_max_ack_backlog) + __field(__u32, sk_ack_backlog) + ), + + TP_fast_assign( + const struct inet_sock *inet = inet_sk(sk); + struct in6_addr *pin6; + __be32 *p32; + + __entry->skaddr = sk; + + __entry->sport = ntohs(inet->inet_sport); + + p32 = (__be32 *) __entry->saddr; + *p32 = inet->inet_saddr; + + pin6 = (struct in6_addr *)__entry->saddr_v6; +#if IS_ENABLED(CONFIG_IPV6) + if (sk->sk_family == AF_INET6) + *pin6 = sk->sk_v6_rcv_saddr; + else + ipv6_addr_set_v4mapped(inet->inet_saddr, pin6); +#else + ipv6_addr_set_v4mapped(inet->inet_saddr, pin6); +#endif + + __entry->sk_max_ack_backlog = READ_ONCE(sk->sk_max_ack_backlog); + __entry->sk_ack_backlog = READ_ONCE(sk->sk_ack_backlog); + ), + + TP_printk("sport=%hu saddr=%pI4 saddrv6=%pI6c sk_max_ack_backlog=%d sk_ack_backlog=%d", + __entry->sport, __entry->saddr, __entry->saddr_v6, + __entry->sk_max_ack_backlog, __entry->sk_ack_backlog) +); + #endif /* _TRACE_TCP_H */ /* This part must be outside protection */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index e03e08745308..29ecbc5248c3 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -276,6 +276,8 @@ #include <net/ip.h> #include <net/sock.h> +#include <trace/events/tcp.h> + #include <linux/uaccess.h> #include <asm/ioctls.h> #include <net/busy_poll.h> @@ -1697,6 +1699,11 @@ int tcp_peek_len(struct socket *sock) } EXPORT_SYMBOL(tcp_peek_len); +void do_trace_tcp_listen_queue_drop_wrapper(const struct sock *sk) +{ + trace_tcp_listen_queue_drop(sk); +} + /* Make sure sk_rcvbuf is big enough to satisfy SO_RCVLOWAT hint */ int tcp_set_rcvlowat(struct sock *sk, int val) {