[RFC,net-next] tcp: add a tracepoint for tcp_listen_queue_drop

Message ID 20230711043453.64095-1-ivan@cloudflare.com
State New
Headers
Series [RFC,net-next] tcp: add a tracepoint for tcp_listen_queue_drop |

Commit Message

Ivan Babrou July 11, 2023, 4:34 a.m. UTC
  There's already a way to count the overall numbers of queue overflows:

    $ sudo netstat -s | grep 'listen queue'
    4 times the listen queue of a socket overflowed

However, it's too coarse for monitoring and alerting when a user wants to
track errors per socket and route alerts to people responsible for those
sockets directly. For UDP there's udp_fail_queue_rcv_skb, which fills
a similar need for UDP sockets. This patch adds a TCP equivalent.

--

The goal is to use this new tracepoint with ebpf_exporter:

* https://github.com/cloudflare/ebpf_exporter

There's an example configuration for UDP drops there that we use:

* https://github.com/cloudflare/ebpf_exporter/blob/master/examples/udp-drops.bpf.c
* https://github.com/cloudflare/ebpf_exporter/blob/master/examples/udp-drops.yaml

Paolo Abeni asked whether we need the UDP tracepoint given that kfree_skb
and MIB counters already exist, and I covered that part in my reply here:

* https://lore.kernel.org/netdev/CABWYdi3DVex0wq2kM72QTZkhNzkh_Vjb4+T8mj8U7t06Na=5kA@mail.gmail.com/

I added a TCP example utilizing this patch here:

* https://github.com/cloudflare/ebpf_exporter/pull/221

Not so long ago we hit a bug in one of our services that broke its accept
loop, which in resulted in the listen queue overflow. With this new
tracepoint we can have a metric for this and alert the service owners
directly, cutting the middleman SRE and improving the alert fidelity.

We don't really need a tracepoint for this, just a place to hook a kprobe
or an fprobe to. The existing tcp_listendrop is great for this, except
it's a short inlined function, so there's no way to attach a probe to it.

There are a few ways to approach this:

* Un-inline tcp_listendrop to allow probe attachment
* Un-inline tcp_listendrop and add a stable tracepoint
* Keep tcp_listendrop inlined, but add a tracepoint wrapper to call into

There is no option to keep tcp_listendrop inlined and call into tracepoint
directly from it (it does not compile and it wouldn't be nice if it did):

* https://docs.kernel.org/trace/tracepoints.html

Therefore I went with the third option, which this patch implements.

Example output from perf:

    $ sudo perf trace -a -e tcp:tcp_listen_queue_drop
    0.000 sockfull/5459 tcp:tcp_listen_queue_drop(skaddr: 0xffffff90d7a25580, sport: 12345, saddr: 0x7faa1aed26, saddr_v6: 0x7faa1aed2a, sk_max_ack_backlog: 128, sk_ack_backlog: 129)

Example extracting the local port with bpftrace:

    $ sudo ~/projects/bpftrace/src/bpftrace -e 'rawtracepoint:tcp_listen_queue_drop { $sk = (struct sock *) arg0; $lport = $sk->__sk_common.skc_num; printf("drop on lport = %d\n", $lport); }'
    Attaching 1 probe...
    drop on lport = 12345

Signed-off-by: Ivan Babrou <ivan@cloudflare.com>
---
 include/net/tcp.h          |  7 ++++++
 include/trace/events/tcp.h | 46 ++++++++++++++++++++++++++++++++++++++
 net/ipv4/tcp.c             |  7 ++++++
 3 files changed, 60 insertions(+)
  

Comments

Jakub Kicinski July 12, 2023, 2:36 a.m. UTC | #1
On Mon, 10 Jul 2023 21:34:52 -0700 Ivan Babrou wrote:
> There's already a way to count the overall numbers of queue overflows:
> 
>     $ sudo netstat -s | grep 'listen queue'
>     4 times the listen queue of a socket overflowed
> 
> However, it's too coarse for monitoring and alerting when a user wants to
> track errors per socket and route alerts to people responsible for those
> sockets directly. For UDP there's udp_fail_queue_rcv_skb, which fills
> a similar need for UDP sockets. This patch adds a TCP equivalent.

Makes me want to revert your recent UDP tracepoint to be honest :(
We can play whack a mole like this. You said that kfree_skb fires
too often, why is that? Maybe it's an issue of someone using
kfree_skb() when they should be using consume_skb() ?
  
Yan Zhai July 12, 2023, 4:42 p.m. UTC | #2
On Tue, Jul 11, 2023 at 9:36 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 10 Jul 2023 21:34:52 -0700 Ivan Babrou wrote:
> > There's already a way to count the overall numbers of queue overflows:
> >
> >     $ sudo netstat -s | grep 'listen queue'
> >     4 times the listen queue of a socket overflowed
> >
> > However, it's too coarse for monitoring and alerting when a user wants to
> > track errors per socket and route alerts to people responsible for those
> > sockets directly. For UDP there's udp_fail_queue_rcv_skb, which fills
> > a similar need for UDP sockets. This patch adds a TCP equivalent.
>
> Makes me want to revert your recent UDP tracepoint to be honest :(
> We can play whack a mole like this. You said that kfree_skb fires
> too often, why is that? Maybe it's an issue of someone using
> kfree_skb() when they should be using consume_skb() ?

Hi Jakub,

  The issue with kfree_skb is not that it fires too frequently (not in
the 6.x kernel now). Rather, it is unable to locate the socket info
when a SYN is dropped due to the accept queue being full. The sk is
stolen upon inet lookup, e.g. in tcp_v4_rcv. This makes it unable to
tell in kfree_skb which socket a SYN skb is targeting (when TPROXY or
socket lookup are used). A tracepoint with sk information will be more
useful to monitor accurately which service/socket is involved.

--
Yan
  
Jakub Kicinski July 12, 2023, 5:42 p.m. UTC | #3
On Wed, 12 Jul 2023 11:42:26 -0500 Yan Zhai wrote:
>   The issue with kfree_skb is not that it fires too frequently (not in
> the 6.x kernel now). Rather, it is unable to locate the socket info
> when a SYN is dropped due to the accept queue being full. The sk is
> stolen upon inet lookup, e.g. in tcp_v4_rcv. This makes it unable to
> tell in kfree_skb which socket a SYN skb is targeting (when TPROXY or
> socket lookup are used). A tracepoint with sk information will be more
> useful to monitor accurately which service/socket is involved.

No doubt that kfree_skb isn't going to solve all our needs, but I'd
really like you to clean up the unnecessary callers on your systems
first, before adding further tracepoints. That way we'll have a clear
picture of which points can be solved by kfree_skb and where we need
further work.
  
Yan Zhai July 13, 2023, 2:43 a.m. UTC | #4
On Wed, Jul 12, 2023 at 12:42 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 12 Jul 2023 11:42:26 -0500 Yan Zhai wrote:
> >   The issue with kfree_skb is not that it fires too frequently (not in
> > the 6.x kernel now). Rather, it is unable to locate the socket info
> > when a SYN is dropped due to the accept queue being full. The sk is
> > stolen upon inet lookup, e.g. in tcp_v4_rcv. This makes it unable to
> > tell in kfree_skb which socket a SYN skb is targeting (when TPROXY or
> > socket lookup are used). A tracepoint with sk information will be more
> > useful to monitor accurately which service/socket is involved.
>
> No doubt that kfree_skb isn't going to solve all our needs, but I'd
> really like you to clean up the unnecessary callers on your systems
> first, before adding further tracepoints. That way we'll have a clear
> picture of which points can be solved by kfree_skb and where we need
> further work.

Those are not unnecessary calls, e.g. a lot of those kfree_skb come
from iptables drops, tcp validation, ttl expires, etc. On a moderately
loaded server, it is called at a rate of ~10k/sec, which isn't
terribly awful given that we absorb millions of attack packets at each
data center. We used to have many consume skb noises at this trace
point with older versions of kernels, but those have gone ever since
the better separation between consume and drop.

That said, accessing sk information is still the critical piece to
address our use case. Is there any other possible way that we can get
this information at the accept queue overflow time?

--
Yan
  
Jakub Kicinski July 13, 2023, 4:57 p.m. UTC | #5
On Wed, 12 Jul 2023 21:43:32 -0500 Yan Zhai wrote:
> Those are not unnecessary calls, e.g. a lot of those kfree_skb come
> from iptables drops, tcp validation, ttl expires, etc. On a moderately
> loaded server, it is called at a rate of ~10k/sec, which isn't
> terribly awful given that we absorb millions of attack packets at each
> data center. We used to have many consume skb noises at this trace
> point with older versions of kernels, but those have gone ever since
> the better separation between consume and drop.

I was hoping you can break them down by category.

Specifically what I'm wondering is whether we should also have 
a separation between policy / "firewall drops" and error / exception
drops. Within the skb drop reason codes, I mean.
  
Ivan Babrou July 13, 2023, 11:17 p.m. UTC | #6
On Wed, Jul 12, 2023 at 10:42 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 12 Jul 2023 11:42:26 -0500 Yan Zhai wrote:
> >   The issue with kfree_skb is not that it fires too frequently (not in
> > the 6.x kernel now). Rather, it is unable to locate the socket info
> > when a SYN is dropped due to the accept queue being full. The sk is
> > stolen upon inet lookup, e.g. in tcp_v4_rcv. This makes it unable to
> > tell in kfree_skb which socket a SYN skb is targeting (when TPROXY or
> > socket lookup are used). A tracepoint with sk information will be more
> > useful to monitor accurately which service/socket is involved.
>
> No doubt that kfree_skb isn't going to solve all our needs, but I'd
> really like you to clean up the unnecessary callers on your systems
> first, before adding further tracepoints. That way we'll have a clear
> picture of which points can be solved by kfree_skb and where we need
> further work.

The existing UDP tracepoint was there for 12 years and it's a part of
what kernel exposes to userspace, so I don't think it's fair to remove
this and break its consumers. I think "do not break userspace" applies
here. The proposed TCP tracepoint mostly mirrors it, so I think it's
fair to have it.

I don't know why kfree_skb is called so much. I also don't agree with
Yan that it's not actually too much, because it's a lot (especially
compared with near zero for my proposed tracepoint). I can easily see
300-500k calls per second into it:

$ perf stat -I 1000 -a -e skb:kfree_skb -- sleep 10
#           time             counts unit events
     1.000520165             10,108      skb:kfree_skb
     2.010494526             11,178      skb:kfree_skb
     3.075503743             10,770      skb:kfree_skb
     4.122814843             11,334      skb:kfree_skb
     5.128518432             12,020      skb:kfree_skb
     6.176504094             11,117      skb:kfree_skb
     7.201504214             12,753      skb:kfree_skb
     8.229523643             10,566      skb:kfree_skb
     9.326499044            365,239      skb:kfree_skb
    10.002106098            313,105      skb:kfree_skb
$ perf stat -I 1000 -a -e skb:kfree_skb -- sleep 10
#           time             counts unit events
     1.000767744             52,240      skb:kfree_skb
     2.069762695            508,310      skb:kfree_skb
     3.102763492            417,895      skb:kfree_skb
     4.142757608            385,981      skb:kfree_skb
     5.190759795            430,154      skb:kfree_skb
     6.243765384            405,707      skb:kfree_skb
     7.290818228            362,934      skb:kfree_skb
     8.297764298            336,702      skb:kfree_skb
     9.314287243            353,039      skb:kfree_skb
    10.002288423            251,414      skb:kfree_skb

Most of it is NOT_SPECIFIED (1s data from one CPU during a spike):

$ perf script | sed 's/.*skbaddr=//' | awk '{ print $NF }' | sort |
uniq -c | sort -n | tail
      1 TCP_CLOSE
      2 NO_SOCKET
      4 TCP_INVALID_SEQUENCE
      4 TCP_RESET
     13 TCP_OLD_DATA
     14 NETFILTER_DROP
   4594 NOT_SPECIFIED

We can start a separate discussion to break it down by category if it
would help. Let me know what kind of information you would like us to
provide to help with that. I assume you're interested in kernel stacks
leading to kfree_skb with NOT_SPECIFIED reason, but maybe there's
something else.

Even if I was only interested in one specific reason, I would still
have to arm the whole tracepoint and route a ton of skbs I'm not
interested in into my bpf code. This seems like a lot of overhead,
especially if I'm dropping some attack packets.

Perhaps a lot of extra NOT_SPECIFIED stuff can be fixed and removed
from kfree_skb. It's not something I can personally do as it requires
much deeper network code understanding than I possess. For TCP we'll
also have to add some extra reasons for kfree_skb, because currently
it's all NOT_SPECIFIED (no reason set in the accept path):

* https://elixir.bootlin.com/linux/v6.5-rc1/source/net/ipv4/tcp_input.c#L6499
* https://elixir.bootlin.com/linux/v6.5-rc1/source/net/ipv4/tcp_ipv4.c#L1749

For UDP we already have SKB_DROP_REASON_SOCKET_RCVBUFF, so I tried my
best to implement what I wanted based on that. It's not very
approachable, as you'd have to extract the destination port yourself
from the raw skb. As Yan said, for TCP people often rely on skb->sk,
which is just not present when the incoming SYN is dropped. I failed
to find a good example of extracting a destination port that I could
replicate. So far I have just a per-reason breakdown working:

* https://github.com/cloudflare/ebpf_exporter/pull/233

If you have an ebpf example that would help me extract the destination
port from an skb in kfree_skb, I'd be interested in taking a look and
trying to make it work.

The need to extract the protocol level information in ebpf is only
making kfree_skb more expensive for the needs of catching rare cases
when we run out of buffer space (UDP) or listen queue (TCP). These two
cases are very common failure scenarios that people are interested in
catching with straightforward tracepoints that can give them the
needed information easily and cheaply.

I sympathize with the desire to keep the number of tracepoints in
check, but I also feel like UDP buffer drops and TCP listen drops
tracepoints are very much justified to exist.
  
Jakub Kicinski July 14, 2023, 3:14 a.m. UTC | #7
On Thu, 13 Jul 2023 16:17:31 -0700 Ivan Babrou wrote:
> On Wed, Jul 12, 2023 at 10:42 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > On Wed, 12 Jul 2023 11:42:26 -0500 Yan Zhai wrote:  
> > >   The issue with kfree_skb is not that it fires too frequently (not in
> > > the 6.x kernel now). Rather, it is unable to locate the socket info
> > > when a SYN is dropped due to the accept queue being full. The sk is
> > > stolen upon inet lookup, e.g. in tcp_v4_rcv. This makes it unable to
> > > tell in kfree_skb which socket a SYN skb is targeting (when TPROXY or
> > > socket lookup are used). A tracepoint with sk information will be more
> > > useful to monitor accurately which service/socket is involved.  
> >
> > No doubt that kfree_skb isn't going to solve all our needs, but I'd
> > really like you to clean up the unnecessary callers on your systems
> > first, before adding further tracepoints. That way we'll have a clear
> > picture of which points can be solved by kfree_skb and where we need
> > further work.  
> 
> The existing UDP tracepoint was there for 12 years and it's a part of
> what kernel exposes to userspace, so I don't think it's fair to remove
> this and break its consumers. I think "do not break userspace" applies
> here. The proposed TCP tracepoint mostly mirrors it, so I think it's
> fair to have it.
> 
> I don't know why kfree_skb is called so much. I also don't agree with
> Yan that it's not actually too much, because it's a lot (especially
> compared with near zero for my proposed tracepoint). I can easily see
> 300-500k calls per second into it:
> 
> $ perf stat -I 1000 -a -e skb:kfree_skb -- sleep 10
> #           time             counts unit events
>      1.000520165             10,108      skb:kfree_skb
>      2.010494526             11,178      skb:kfree_skb
>      3.075503743             10,770      skb:kfree_skb
>      4.122814843             11,334      skb:kfree_skb
>      5.128518432             12,020      skb:kfree_skb
>      6.176504094             11,117      skb:kfree_skb
>      7.201504214             12,753      skb:kfree_skb
>      8.229523643             10,566      skb:kfree_skb
>      9.326499044            365,239      skb:kfree_skb
>     10.002106098            313,105      skb:kfree_skb
> $ perf stat -I 1000 -a -e skb:kfree_skb -- sleep 10
> #           time             counts unit events
>      1.000767744             52,240      skb:kfree_skb
>      2.069762695            508,310      skb:kfree_skb
>      3.102763492            417,895      skb:kfree_skb
>      4.142757608            385,981      skb:kfree_skb
>      5.190759795            430,154      skb:kfree_skb
>      6.243765384            405,707      skb:kfree_skb
>      7.290818228            362,934      skb:kfree_skb
>      8.297764298            336,702      skb:kfree_skb
>      9.314287243            353,039      skb:kfree_skb
>     10.002288423            251,414      skb:kfree_skb
> 
> Most of it is NOT_SPECIFIED (1s data from one CPU during a spike):
> 
> $ perf script | sed 's/.*skbaddr=//' | awk '{ print $NF }' | sort |
> uniq -c | sort -n | tail
>       1 TCP_CLOSE
>       2 NO_SOCKET
>       4 TCP_INVALID_SEQUENCE
>       4 TCP_RESET
>      13 TCP_OLD_DATA
>      14 NETFILTER_DROP
>    4594 NOT_SPECIFIED
> 
> We can start a separate discussion to break it down by category if it
> would help. Let me know what kind of information you would like us to
> provide to help with that. I assume you're interested in kernel stacks
> leading to kfree_skb with NOT_SPECIFIED reason, but maybe there's
> something else.

Just the stacks.

> Even if I was only interested in one specific reason, I would still
> have to arm the whole tracepoint and route a ton of skbs I'm not
> interested in into my bpf code. This seems like a lot of overhead,
> especially if I'm dropping some attack packets.

That's what I meant with my drop vs exception comment. We already have
two tracepoints on the skb free path (free and consume), adding another
shouldn't rise too many eyebrows.

> Perhaps a lot of extra NOT_SPECIFIED stuff can be fixed and removed
> from kfree_skb. It's not something I can personally do as it requires
> much deeper network code understanding than I possess. For TCP we'll
> also have to add some extra reasons for kfree_skb, because currently
> it's all NOT_SPECIFIED (no reason set in the accept path):
> 
> * https://elixir.bootlin.com/linux/v6.5-rc1/source/net/ipv4/tcp_input.c#L6499
> * https://elixir.bootlin.com/linux/v6.5-rc1/source/net/ipv4/tcp_ipv4.c#L1749
> 
> For UDP we already have SKB_DROP_REASON_SOCKET_RCVBUFF, so I tried my
> best to implement what I wanted based on that. It's not very
> approachable, as you'd have to extract the destination port yourself
> from the raw skb. As Yan said, for TCP people often rely on skb->sk,
> which is just not present when the incoming SYN is dropped. I failed
> to find a good example of extracting a destination port that I could
> replicate. So far I have just a per-reason breakdown working:
> 
> * https://github.com/cloudflare/ebpf_exporter/pull/233
> 
> If you have an ebpf example that would help me extract the destination
> port from an skb in kfree_skb, I'd be interested in taking a look and
> trying to make it work.
> 
> The need to extract the protocol level information in ebpf is only
> making kfree_skb more expensive for the needs of catching rare cases
> when we run out of buffer space (UDP) or listen queue (TCP). These two
> cases are very common failure scenarios that people are interested in
> catching with straightforward tracepoints that can give them the
> needed information easily and cheaply.
> 
> I sympathize with the desire to keep the number of tracepoints in
> check, but I also feel like UDP buffer drops and TCP listen drops
> tracepoints are very much justified to exist.

I'm not completely opposed to the tracepoints where needed. It's more 
of trying to make sure we do due diligence on the existing solutions.
Or maybe not even due diligence as much as pay off some technical debt.
  
David Ahern July 14, 2023, 3:09 p.m. UTC | #8
On 7/13/23 5:17 PM, Ivan Babrou wrote:
> On Wed, Jul 12, 2023 at 10:42 AM Jakub Kicinski <kuba@kernel.org> wrote:
>>
>> On Wed, 12 Jul 2023 11:42:26 -0500 Yan Zhai wrote:
>>>   The issue with kfree_skb is not that it fires too frequently (not in
>>> the 6.x kernel now). Rather, it is unable to locate the socket info
>>> when a SYN is dropped due to the accept queue being full. The sk is
>>> stolen upon inet lookup, e.g. in tcp_v4_rcv. This makes it unable to
>>> tell in kfree_skb which socket a SYN skb is targeting (when TPROXY or
>>> socket lookup are used). A tracepoint with sk information will be more
>>> useful to monitor accurately which service/socket is involved.
>>
>> No doubt that kfree_skb isn't going to solve all our needs, but I'd
>> really like you to clean up the unnecessary callers on your systems
>> first, before adding further tracepoints. That way we'll have a clear
>> picture of which points can be solved by kfree_skb and where we need
>> further work.
> 
> The existing UDP tracepoint was there for 12 years and it's a part of
> what kernel exposes to userspace, so I don't think it's fair to remove
> this and break its consumers. I think "do not break userspace" applies
> here. The proposed TCP tracepoint mostly mirrors it, so I think it's
> fair to have it.
> 
> I don't know why kfree_skb is called so much. I also don't agree with
> Yan that it's not actually too much, because it's a lot (especially
> compared with near zero for my proposed tracepoint). I can easily see
> 300-500k calls per second into it:
> 
> $ perf stat -I 1000 -a -e skb:kfree_skb -- sleep 10
> #           time             counts unit events
>      1.000520165             10,108      skb:kfree_skb
>      2.010494526             11,178      skb:kfree_skb
>      3.075503743             10,770      skb:kfree_skb
>      4.122814843             11,334      skb:kfree_skb
>      5.128518432             12,020      skb:kfree_skb
>      6.176504094             11,117      skb:kfree_skb
>      7.201504214             12,753      skb:kfree_skb
>      8.229523643             10,566      skb:kfree_skb
>      9.326499044            365,239      skb:kfree_skb
>     10.002106098            313,105      skb:kfree_skb
> $ perf stat -I 1000 -a -e skb:kfree_skb -- sleep 10
> #           time             counts unit events
>      1.000767744             52,240      skb:kfree_skb
>      2.069762695            508,310      skb:kfree_skb
>      3.102763492            417,895      skb:kfree_skb
>      4.142757608            385,981      skb:kfree_skb
>      5.190759795            430,154      skb:kfree_skb
>      6.243765384            405,707      skb:kfree_skb
>      7.290818228            362,934      skb:kfree_skb
>      8.297764298            336,702      skb:kfree_skb
>      9.314287243            353,039      skb:kfree_skb
>     10.002288423            251,414      skb:kfree_skb
> 
> Most of it is NOT_SPECIFIED (1s data from one CPU during a spike):
> 
> $ perf script | sed 's/.*skbaddr=//' | awk '{ print $NF }' | sort |
> uniq -c | sort -n | tail
>       1 TCP_CLOSE
>       2 NO_SOCKET
>       4 TCP_INVALID_SEQUENCE
>       4 TCP_RESET
>      13 TCP_OLD_DATA
>      14 NETFILTER_DROP
>    4594 NOT_SPECIFIED
> 
> We can start a separate discussion to break it down by category if it
> would help. Let me know what kind of information you would like us to
> provide to help with that. I assume you're interested in kernel stacks
> leading to kfree_skb with NOT_SPECIFIED reason, but maybe there's
> something else.

stack traces would be helpful.

> 
> Even if I was only interested in one specific reason, I would still
> have to arm the whole tracepoint and route a ton of skbs I'm not
> interested in into my bpf code. This seems like a lot of overhead,
> especially if I'm dropping some attack packets.

you can add a filter on the tracepoint event to limit what is passed
(although I have not tried the filter with an ebpf program - e.g.,
reason != NOT_SPECIFIED).

> 
> Perhaps a lot of extra NOT_SPECIFIED stuff can be fixed and removed
> from kfree_skb. It's not something I can personally do as it requires
> much deeper network code understanding than I possess. For TCP we'll
> also have to add some extra reasons for kfree_skb, because currently
> it's all NOT_SPECIFIED (no reason set in the accept path):
> 
> * https://elixir.bootlin.com/linux/v6.5-rc1/source/net/ipv4/tcp_input.c#L6499
> * https://elixir.bootlin.com/linux/v6.5-rc1/source/net/ipv4/tcp_ipv4.c#L1749
> 
> For UDP we already have SKB_DROP_REASON_SOCKET_RCVBUFF, so I tried my
> best to implement what I wanted based on that. It's not very
> approachable, as you'd have to extract the destination port yourself
> from the raw skb. As Yan said, for TCP people often rely on skb->sk,
> which is just not present when the incoming SYN is dropped. I failed
> to find a good example of extracting a destination port that I could
> replicate. So far I have just a per-reason breakdown working:
> 
> * https://github.com/cloudflare/ebpf_exporter/pull/233
> 
> If you have an ebpf example that would help me extract the destination
> port from an skb in kfree_skb, I'd be interested in taking a look and
> trying to make it work.

This is from 2020 and I forget which kernel version (pre-BTF), but it
worked at that time and allowed userspace to summarize drop reasons by
various network data (mac, L3 address, n-tuple, etc):

https://github.com/dsahern/bpf-progs/blob/master/ksrc/pktdrop.c

> 
> The need to extract the protocol level information in ebpf is only
> making kfree_skb more expensive for the needs of catching rare cases
> when we run out of buffer space (UDP) or listen queue (TCP). These two
> cases are very common failure scenarios that people are interested in
> catching with straightforward tracepoints that can give them the
> needed information easily and cheaply.
> 
> I sympathize with the desire to keep the number of tracepoints in
> check, but I also feel like UDP buffer drops and TCP listen drops
> tracepoints are very much justified to exist.

sure, kfree_skb is like the raw_syscall tracepoint - it can be more than
what you need for a specific problem, but it is also give you way more
than you are thinking about today.
  
Ivan Babrou July 14, 2023, 11:21 p.m. UTC | #9
On Thu, Jul 13, 2023 at 8:14 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > I don't know why kfree_skb is called so much. I also don't agree with
> > Yan that it's not actually too much, because it's a lot (especially
> > compared with near zero for my proposed tracepoint). I can easily see
> > 300-500k calls per second into it:
> >
> > $ perf stat -I 1000 -a -e skb:kfree_skb -- sleep 10
> > #           time             counts unit events
> >      1.000520165             10,108      skb:kfree_skb
> >      2.010494526             11,178      skb:kfree_skb
> >      3.075503743             10,770      skb:kfree_skb
> >      4.122814843             11,334      skb:kfree_skb
> >      5.128518432             12,020      skb:kfree_skb
> >      6.176504094             11,117      skb:kfree_skb
> >      7.201504214             12,753      skb:kfree_skb
> >      8.229523643             10,566      skb:kfree_skb
> >      9.326499044            365,239      skb:kfree_skb
> >     10.002106098            313,105      skb:kfree_skb
> > $ perf stat -I 1000 -a -e skb:kfree_skb -- sleep 10
> > #           time             counts unit events
> >      1.000767744             52,240      skb:kfree_skb
> >      2.069762695            508,310      skb:kfree_skb
> >      3.102763492            417,895      skb:kfree_skb
> >      4.142757608            385,981      skb:kfree_skb
> >      5.190759795            430,154      skb:kfree_skb
> >      6.243765384            405,707      skb:kfree_skb
> >      7.290818228            362,934      skb:kfree_skb
> >      8.297764298            336,702      skb:kfree_skb
> >      9.314287243            353,039      skb:kfree_skb
> >     10.002288423            251,414      skb:kfree_skb
> >
> > Most of it is NOT_SPECIFIED (1s data from one CPU during a spike):
> >
> > $ perf script | sed 's/.*skbaddr=//' | awk '{ print $NF }' | sort |
> > uniq -c | sort -n | tail
> >       1 TCP_CLOSE
> >       2 NO_SOCKET
> >       4 TCP_INVALID_SEQUENCE
> >       4 TCP_RESET
> >      13 TCP_OLD_DATA
> >      14 NETFILTER_DROP
> >    4594 NOT_SPECIFIED
> >
> > We can start a separate discussion to break it down by category if it
> > would help. Let me know what kind of information you would like us to
> > provide to help with that. I assume you're interested in kernel stacks
> > leading to kfree_skb with NOT_SPECIFIED reason, but maybe there's
> > something else.
>
> Just the stacks.

Here you go: https://lore.kernel.org/netdev/CABWYdi00L+O30Q=Zah28QwZ_5RU-xcxLFUK2Zj08A8MrLk9jzg@mail.gmail.com/

> > Even if I was only interested in one specific reason, I would still
> > have to arm the whole tracepoint and route a ton of skbs I'm not
> > interested in into my bpf code. This seems like a lot of overhead,
> > especially if I'm dropping some attack packets.
>
> That's what I meant with my drop vs exception comment. We already have
> two tracepoints on the skb free path (free and consume), adding another
> shouldn't rise too many eyebrows.

I'm a bit confused. Previously you said:

> Specifically what I'm wondering is whether we should also have
> a separation between policy / "firewall drops" and error / exception
> drops. Within the skb drop reason codes, I mean.

My understanding was that you proposed adding more SKB_DROP_REASON_*,
but now you seem to imply that we might want to add another
tracepoint. Could you clarify which path you have in mind?

We can add a few reasons that would satisfy my need by covering
whatever results into tcp_listendrop() calls today. The problem is:
unless we remove some other reasons from kfree_skb, adding more
reasons for firewall drops / exceptions wouldn't change the cost at
all. We'd still have the same number of calls into the tracepoint and
the condition to find "interesting" reasons would be the same:

if (reason == SKB_DROP_REASON_TCP_OVERFLOW_OR_SOMETHING)

It still seems very expensive to consume a firehose of kfree_skb just
to find some rare nuggets.

> > Perhaps a lot of extra NOT_SPECIFIED stuff can be fixed and removed
> > from kfree_skb. It's not something I can personally do as it requires
> > much deeper network code understanding than I possess. For TCP we'll
> > also have to add some extra reasons for kfree_skb, because currently
> > it's all NOT_SPECIFIED (no reason set in the accept path):
> >
> > * https://elixir.bootlin.com/linux/v6.5-rc1/source/net/ipv4/tcp_input.c#L6499
> > * https://elixir.bootlin.com/linux/v6.5-rc1/source/net/ipv4/tcp_ipv4.c#L1749
> >
> > For UDP we already have SKB_DROP_REASON_SOCKET_RCVBUFF, so I tried my
> > best to implement what I wanted based on that. It's not very
> > approachable, as you'd have to extract the destination port yourself
> > from the raw skb. As Yan said, for TCP people often rely on skb->sk,
> > which is just not present when the incoming SYN is dropped. I failed
> > to find a good example of extracting a destination port that I could
> > replicate. So far I have just a per-reason breakdown working:
> >
> > * https://github.com/cloudflare/ebpf_exporter/pull/233
> >
> > If you have an ebpf example that would help me extract the destination
> > port from an skb in kfree_skb, I'd be interested in taking a look and
> > trying to make it work.
> >
> > The need to extract the protocol level information in ebpf is only
> > making kfree_skb more expensive for the needs of catching rare cases
> > when we run out of buffer space (UDP) or listen queue (TCP). These two
> > cases are very common failure scenarios that people are interested in
> > catching with straightforward tracepoints that can give them the
> > needed information easily and cheaply.
> >
> > I sympathize with the desire to keep the number of tracepoints in
> > check, but I also feel like UDP buffer drops and TCP listen drops
> > tracepoints are very much justified to exist.
>
> I'm not completely opposed to the tracepoints where needed. It's more
> of trying to make sure we do due diligence on the existing solutions.
> Or maybe not even due diligence as much as pay off some technical debt.

The arguments for a new tracepoint from my due diligence:

1. There are too many calls into kfree_skb. Maybe that's fixable to a
degree, but there would still be a big discrepancy in the number of
calls between my proposed targeted tracepoint and skb:kfree_skb. For
some setups this might not be feasible.
2. It's hard to extract the necessary information from skb without an
sk attached. Both from usability perspective (you need to parse the
network header yourself with a non-trivial amount of code) and from
cost perspective (parsing is not zero cost). The latter doesn't matter
as much if we target a specific reason.
  
Ivan Babrou July 14, 2023, 11:38 p.m. UTC | #10
On Fri, Jul 14, 2023 at 8:09 AM David Ahern <dsahern@kernel.org> wrote:
> > We can start a separate discussion to break it down by category if it
> > would help. Let me know what kind of information you would like us to
> > provide to help with that. I assume you're interested in kernel stacks
> > leading to kfree_skb with NOT_SPECIFIED reason, but maybe there's
> > something else.
>
> stack traces would be helpful.

Here you go: https://lore.kernel.org/netdev/CABWYdi00L+O30Q=Zah28QwZ_5RU-xcxLFUK2Zj08A8MrLk9jzg@mail.gmail.com/

> > Even if I was only interested in one specific reason, I would still
> > have to arm the whole tracepoint and route a ton of skbs I'm not
> > interested in into my bpf code. This seems like a lot of overhead,
> > especially if I'm dropping some attack packets.
>
> you can add a filter on the tracepoint event to limit what is passed
> (although I have not tried the filter with an ebpf program - e.g.,
> reason != NOT_SPECIFIED).

Absolutely, but isn't there overhead to even do just that for every freed skb?

> > If you have an ebpf example that would help me extract the destination
> > port from an skb in kfree_skb, I'd be interested in taking a look and
> > trying to make it work.
>
> This is from 2020 and I forget which kernel version (pre-BTF), but it
> worked at that time and allowed userspace to summarize drop reasons by
> various network data (mac, L3 address, n-tuple, etc):
>
> https://github.com/dsahern/bpf-progs/blob/master/ksrc/pktdrop.c

It doesn't seem to extract the L4 metadata (local port specifically),
which is what I'm after.

> > The need to extract the protocol level information in ebpf is only
> > making kfree_skb more expensive for the needs of catching rare cases
> > when we run out of buffer space (UDP) or listen queue (TCP). These two
> > cases are very common failure scenarios that people are interested in
> > catching with straightforward tracepoints that can give them the
> > needed information easily and cheaply.
> >
> > I sympathize with the desire to keep the number of tracepoints in
> > check, but I also feel like UDP buffer drops and TCP listen drops
> > tracepoints are very much justified to exist.
>
> sure, kfree_skb is like the raw_syscall tracepoint - it can be more than
> what you need for a specific problem, but it is also give you way more
> than you are thinking about today.

I really like the comparison to raw_syscall tracepoint. There are two flavors:

1. Catch-all: raw_syscalls:sys_enter, which is similar to skb:kfree_skb.
2. Specific tracepoints: syscalls:sys_enter_* for every syscall.

If you are interested in one rare syscall, you wouldn't attach to a
catch-all firehose and the filter for id in post. Understandably, we
probably can't have a separate skb:kfree_skb for every reason.
However, some of them are more useful than others and I believe that
tcp listen drops fall into that category.

We went through a similar exercise with audit subsystem, which in fact
always arms all syscalls even if you audit one of them:

* https://lore.kernel.org/audit/20230523181624.19932-1-ivan@cloudflare.com/T/#u

With pictures, if you're interested:

* https://mastodon.ivan.computer/@mastodon/110426498281603668

Nobody audits futex(), but if you audit execve(), all the rules run
for both. Some rules will run faster, but all of them will run. It's a
lot of overhead with millions of CPUs, which I'm trying to avoid (the
planet is hot as it is).

Ultimately my arguments for a separate tracepoint for tcp listen drops
are at the bottom of my reply to Jakub:

* https://lore.kernel.org/netdev/CABWYdi2BGi=iRCfLhmQCqO=1eaQ1WaCG7F9WsJrz-7==ocZidg@mail.gmail.com/
  
David Ahern July 15, 2023, 1:30 a.m. UTC | #11
On 7/14/23 5:38 PM, Ivan Babrou wrote:
> On Fri, Jul 14, 2023 at 8:09 AM David Ahern <dsahern@kernel.org> wrote:
>>> We can start a separate discussion to break it down by category if it
>>> would help. Let me know what kind of information you would like us to
>>> provide to help with that. I assume you're interested in kernel stacks
>>> leading to kfree_skb with NOT_SPECIFIED reason, but maybe there's
>>> something else.
>>
>> stack traces would be helpful.
> 
> Here you go: https://lore.kernel.org/netdev/CABWYdi00L+O30Q=Zah28QwZ_5RU-xcxLFUK2Zj08A8MrLk9jzg@mail.gmail.com/
> 
>>> Even if I was only interested in one specific reason, I would still
>>> have to arm the whole tracepoint and route a ton of skbs I'm not
>>> interested in into my bpf code. This seems like a lot of overhead,
>>> especially if I'm dropping some attack packets.
>>
>> you can add a filter on the tracepoint event to limit what is passed
>> (although I have not tried the filter with an ebpf program - e.g.,
>> reason != NOT_SPECIFIED).
> 
> Absolutely, but isn't there overhead to even do just that for every freed skb?

There is some amount of overhead. If filters can be used with ebpf
programs, then the differential cost is just the cycles for the filter
which in this case is an integer compare. Should be low - maybe Steven
has some data on the overhead?

> 
>>> If you have an ebpf example that would help me extract the destination
>>> port from an skb in kfree_skb, I'd be interested in taking a look and
>>> trying to make it work.
>>
>> This is from 2020 and I forget which kernel version (pre-BTF), but it
>> worked at that time and allowed userspace to summarize drop reasons by
>> various network data (mac, L3 address, n-tuple, etc):
>>
>> https://github.com/dsahern/bpf-progs/blob/master/ksrc/pktdrop.c
> 
> It doesn't seem to extract the L4 metadata (local port specifically),
> which is what I'm after.

This program takes the path of copy headers to userspace and does the
parsing there (there is a netmon program that uses that ebpf program
which shows drops for varying perspectives). You can just as easily do
the parsing in ebpf. Once you have the start of packet data, walk the
protocols of interest -- e.g., see parse_pkt in flow.h.

> 
>>> The need to extract the protocol level information in ebpf is only
>>> making kfree_skb more expensive for the needs of catching rare cases
>>> when we run out of buffer space (UDP) or listen queue (TCP). These two
>>> cases are very common failure scenarios that people are interested in
>>> catching with straightforward tracepoints that can give them the
>>> needed information easily and cheaply.
>>>
>>> I sympathize with the desire to keep the number of tracepoints in
>>> check, but I also feel like UDP buffer drops and TCP listen drops
>>> tracepoints are very much justified to exist.
>>
>> sure, kfree_skb is like the raw_syscall tracepoint - it can be more than
>> what you need for a specific problem, but it is also give you way more
>> than you are thinking about today.
> 
> I really like the comparison to raw_syscall tracepoint. There are two flavors:
> 
> 1. Catch-all: raw_syscalls:sys_enter, which is similar to skb:kfree_skb.
> 2. Specific tracepoints: syscalls:sys_enter_* for every syscall.
> 
> If you are interested in one rare syscall, you wouldn't attach to a
> catch-all firehose and the filter for id in post. Understandably, we
> probably can't have a separate skb:kfree_skb for every reason.

yea, I pushed for the 'reason' design rather than having a tracepoint
per drop location as it is way more scalable. There is a balance and I
believe that is what Kuba is pushing at here.

> However, some of them are more useful than others and I believe that
> tcp listen drops fall into that category.
> 
> We went through a similar exercise with audit subsystem, which in fact
> always arms all syscalls even if you audit one of them:
> 
> * https://lore.kernel.org/audit/20230523181624.19932-1-ivan@cloudflare.com/T/#u
> 
> With pictures, if you're interested:
> 
> * https://mastodon.ivan.computer/@mastodon/110426498281603668
> 
> Nobody audits futex(), but if you audit execve(), all the rules run
> for both. Some rules will run faster, but all of them will run. It's a
> lot of overhead with millions of CPUs, which I'm trying to avoid (the
> planet is hot as it is).
> 
> Ultimately my arguments for a separate tracepoint for tcp listen drops
> are at the bottom of my reply to Jakub:
> 
> * https://lore.kernel.org/netdev/CABWYdi2BGi=iRCfLhmQCqO=1eaQ1WaCG7F9WsJrz-7==ocZidg@mail.gmail.com/
  
Jakub Kicinski July 18, 2023, 9:57 p.m. UTC | #12
On Fri, 14 Jul 2023 16:21:08 -0700 Ivan Babrou wrote:
> > Just the stacks.  
> 
> Here you go: https://lore.kernel.org/netdev/CABWYdi00L+O30Q=Zah28QwZ_5RU-xcxLFUK2Zj08A8MrLk9jzg@mail.gmail.com/

Thanks! I'll follow the discussion there. Just the one remaining
clarification here:

> > > Even if I was only interested in one specific reason, I would still
> > > have to arm the whole tracepoint and route a ton of skbs I'm not
> > > interested in into my bpf code. This seems like a lot of overhead,
> > > especially if I'm dropping some attack packets.  
> >
> > That's what I meant with my drop vs exception comment. We already have
> > two tracepoints on the skb free path (free and consume), adding another
> > shouldn't rise too many eyebrows.  
> 
> I'm a bit confused. Previously you said:
> 
> > Specifically what I'm wondering is whether we should also have
> > a separation between policy / "firewall drops" and error / exception
> > drops. Within the skb drop reason codes, I mean.  
> 
> My understanding was that you proposed adding more SKB_DROP_REASON_*,
> but now you seem to imply that we might want to add another
> tracepoint. Could you clarify which path you have in mind?

What I had in mind was sorting the drop reasons to be able to easily
distinguish policy drops from error drops.

> We can add a few reasons that would satisfy my need by covering
> whatever results into tcp_listendrop() calls today. The problem is:
> unless we remove some other reasons from kfree_skb, adding more
> reasons for firewall drops / exceptions wouldn't change the cost at
> all. We'd still have the same number of calls into the tracepoint and
> the condition to find "interesting" reasons would be the same:
> 
> if (reason == SKB_DROP_REASON_TCP_OVERFLOW_OR_SOMETHING)
> 
> It still seems very expensive to consume a firehose of kfree_skb just
> to find some rare nuggets.

Let me show you a quick mock-up of a diff:

diff --git a/include/net/dropreason-core.h b/include/net/dropreason-core.h
index a2b953b57689..86ee70fcf540 100644
--- a/include/net/dropreason-core.h
+++ b/include/net/dropreason-core.h
@@ -5,12 +5,18 @@
 
 #define DEFINE_DROP_REASON(FN, FNe)	\
 	FN(NOT_SPECIFIED)		\
+	/* Policy-driven/intentional drops: */	\
+	FN(NETFILTER_DROP)		\
+	FN(BPF_CGROUP_EGRESS)		\
+	FN(TC_INGRESS)			\
+	FN(TC_EGRESS)			\
+	FN(XDP)				\
+	/* Errors: */			\
 	FN(NO_SOCKET)			\
 	FN(PKT_TOO_SMALL)		\
 	FN(TCP_CSUM)			\
 	FN(SOCKET_FILTER)		\
 	FN(UDP_CSUM)			\
-	FN(NETFILTER_DROP)		\
 	FN(OTHERHOST)			\
 	FN(IP_CSUM)			\
 	FN(IP_INHDR)			\
@@ -41,17 +47,13 @@
 	FN(TCP_OFO_QUEUE_PRUNE)		\
 	FN(TCP_OFO_DROP)		\
 	FN(IP_OUTNOROUTES)		\
-	FN(BPF_CGROUP_EGRESS)		\
 	FN(IPV6DISABLED)		\
 	FN(NEIGH_CREATEFAIL)		\
 	FN(NEIGH_FAILED)		\
 	FN(NEIGH_QUEUEFULL)		\
 	FN(NEIGH_DEAD)			\
-	FN(TC_EGRESS)			\
 	FN(QDISC_DROP)			\
 	FN(CPU_BACKLOG)			\
-	FN(XDP)				\
-	FN(TC_INGRESS)			\
 	FN(UNHANDLED_PROTO)		\
 	FN(SKB_CSUM)			\
 	FN(SKB_GSO_SEG)			\
@@ -80,6 +82,8 @@
 	FN(IPV6_NDISC_NS_OTHERHOST)	\
 	FNe(MAX)
 
+#define	__SKB_POLICY_DROP_END	SKB_DROP_REASON_NO_SOCKET
+
 /**
  * enum skb_drop_reason - the reasons of skb drops
  *
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6c5915efbc17..a36c498eb693 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1031,6 +1031,8 @@ bool __kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason)
 
 	if (reason == SKB_CONSUMED)
 		trace_consume_skb(skb, __builtin_return_address(0));
+	else if (reason < __SKB_POLICY_DROP_END)
+		trace_drop_skb(skb, __builtin_return_address(0), reason);
 	else
 		trace_kfree_skb(skb, __builtin_return_address(0), reason);
 	return true;
  
Ivan Babrou July 18, 2023, 10:10 p.m. UTC | #13
On Fri, Jul 14, 2023 at 6:30 PM David Ahern <dsahern@kernel.org> wrote:
>
> On 7/14/23 5:38 PM, Ivan Babrou wrote:
> > On Fri, Jul 14, 2023 at 8:09 AM David Ahern <dsahern@kernel.org> wrote:
> >>> We can start a separate discussion to break it down by category if it
> >>> would help. Let me know what kind of information you would like us to
> >>> provide to help with that. I assume you're interested in kernel stacks
> >>> leading to kfree_skb with NOT_SPECIFIED reason, but maybe there's
> >>> something else.
> >>
> >> stack traces would be helpful.
> >
> > Here you go: https://lore.kernel.org/netdev/CABWYdi00L+O30Q=Zah28QwZ_5RU-xcxLFUK2Zj08A8MrLk9jzg@mail.gmail.com/
> >
> >>> Even if I was only interested in one specific reason, I would still
> >>> have to arm the whole tracepoint and route a ton of skbs I'm not
> >>> interested in into my bpf code. This seems like a lot of overhead,
> >>> especially if I'm dropping some attack packets.
> >>
> >> you can add a filter on the tracepoint event to limit what is passed
> >> (although I have not tried the filter with an ebpf program - e.g.,
> >> reason != NOT_SPECIFIED).
> >
> > Absolutely, but isn't there overhead to even do just that for every freed skb?
>
> There is some amount of overhead. If filters can be used with ebpf
> programs, then the differential cost is just the cycles for the filter
> which in this case is an integer compare. Should be low - maybe Steven
> has some data on the overhead?

I updated my benchmarks and added two dimensions:

* Empty probe that just returns immediately (simple and complex map
increments were already there)
* Tracepoint probe (fprobe and kprobe were already there)

The results are here:

* https://github.com/cloudflare/ebpf_exporter/tree/master/benchmark

It looks like we can expect an empty tracepoint probe to finish in
~15ns. At least that's what I see on my M1 laptop in a VM running
v6.5-rc1.

15ns x 400k calls/s = 6ms/s or 0.6% of a single CPU if all you do is
nothing (which is the likely outcome) for kfree_skb tracepoint.

I guess it's not as terrible as I expected, which is good news.

> >>> If you have an ebpf example that would help me extract the destination
> >>> port from an skb in kfree_skb, I'd be interested in taking a look and
> >>> trying to make it work.
> >>
> >> This is from 2020 and I forget which kernel version (pre-BTF), but it
> >> worked at that time and allowed userspace to summarize drop reasons by
> >> various network data (mac, L3 address, n-tuple, etc):
> >>
> >> https://github.com/dsahern/bpf-progs/blob/master/ksrc/pktdrop.c
> >
> > It doesn't seem to extract the L4 metadata (local port specifically),
> > which is what I'm after.
>
> This program takes the path of copy headers to userspace and does the
> parsing there (there is a netmon program that uses that ebpf program
> which shows drops for varying perspectives). You can just as easily do
> the parsing in ebpf. Once you have the start of packet data, walk the
> protocols of interest -- e.g., see parse_pkt in flow.h.

I see, thanks. I want to do all the aggregation in the kernel, so I
took a stab at that. With a lot of trial and error I was able to come
up with the following:

* https://github.com/cloudflare/ebpf_exporter/pull/235

Some points from my experience doing that:

* It was a lot harder to get it working than the tracepoint I
proposed. There are few examples of parsing skb in the kernel in bpf
and none do what I wanted to do.
* It is unclear whether this would work with vlan or other
encapsulation. Your code has special handling for vlan. As a user I
just want the L4 port.
* There's a lot more boilerplate code to get to L4 info. Accessing sk
is a lot easier.
* It's not very useful without adding the reasons that would
correspond to listen drops in TCP. I'm not sure if I'm up to the task,
but I can give it a shot.
* It's unclear how to detect which end of the socket is bound locally.
I want to know which ports that are listened on locally are
experiencing issues, ignoring sockets that connect elsewhere. E.g. I
care about my http server dropping packets, but not as much about curl
doing the same.
* UDP drops seem to work okay in my local testing, I can see
SKB_DROP_REASON_SOCKET_RCVBUFF by port.

As a reminder, the code for my tracepoint is here:

* https://github.com/cloudflare/ebpf_exporter/pull/221

It's a lot simpler. I still feel that it's justified to exist.

Hope this helps.
  
Ivan Babrou July 18, 2023, 10:11 p.m. UTC | #14
On Tue, Jul 18, 2023 at 2:57 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 14 Jul 2023 16:21:08 -0700 Ivan Babrou wrote:
> > > Just the stacks.
> >
> > Here you go: https://lore.kernel.org/netdev/CABWYdi00L+O30Q=Zah28QwZ_5RU-xcxLFUK2Zj08A8MrLk9jzg@mail.gmail.com/
>
> Thanks! I'll follow the discussion there. Just the one remaining
> clarification here:
>
> > > > Even if I was only interested in one specific reason, I would still
> > > > have to arm the whole tracepoint and route a ton of skbs I'm not
> > > > interested in into my bpf code. This seems like a lot of overhead,
> > > > especially if I'm dropping some attack packets.
> > >
> > > That's what I meant with my drop vs exception comment. We already have
> > > two tracepoints on the skb free path (free and consume), adding another
> > > shouldn't rise too many eyebrows.
> >
> > I'm a bit confused. Previously you said:
> >
> > > Specifically what I'm wondering is whether we should also have
> > > a separation between policy / "firewall drops" and error / exception
> > > drops. Within the skb drop reason codes, I mean.
> >
> > My understanding was that you proposed adding more SKB_DROP_REASON_*,
> > but now you seem to imply that we might want to add another
> > tracepoint. Could you clarify which path you have in mind?
>
> What I had in mind was sorting the drop reasons to be able to easily
> distinguish policy drops from error drops.
>
> > We can add a few reasons that would satisfy my need by covering
> > whatever results into tcp_listendrop() calls today. The problem is:
> > unless we remove some other reasons from kfree_skb, adding more
> > reasons for firewall drops / exceptions wouldn't change the cost at
> > all. We'd still have the same number of calls into the tracepoint and
> > the condition to find "interesting" reasons would be the same:
> >
> > if (reason == SKB_DROP_REASON_TCP_OVERFLOW_OR_SOMETHING)
> >
> > It still seems very expensive to consume a firehose of kfree_skb just
> > to find some rare nuggets.
>
> Let me show you a quick mock-up of a diff:
>
> diff --git a/include/net/dropreason-core.h b/include/net/dropreason-core.h
> index a2b953b57689..86ee70fcf540 100644
> --- a/include/net/dropreason-core.h
> +++ b/include/net/dropreason-core.h
> @@ -5,12 +5,18 @@
>
>  #define DEFINE_DROP_REASON(FN, FNe)    \
>         FN(NOT_SPECIFIED)               \
> +       /* Policy-driven/intentional drops: */  \
> +       FN(NETFILTER_DROP)              \
> +       FN(BPF_CGROUP_EGRESS)           \
> +       FN(TC_INGRESS)                  \
> +       FN(TC_EGRESS)                   \
> +       FN(XDP)                         \
> +       /* Errors: */                   \
>         FN(NO_SOCKET)                   \
>         FN(PKT_TOO_SMALL)               \
>         FN(TCP_CSUM)                    \
>         FN(SOCKET_FILTER)               \
>         FN(UDP_CSUM)                    \
> -       FN(NETFILTER_DROP)              \
>         FN(OTHERHOST)                   \
>         FN(IP_CSUM)                     \
>         FN(IP_INHDR)                    \
> @@ -41,17 +47,13 @@
>         FN(TCP_OFO_QUEUE_PRUNE)         \
>         FN(TCP_OFO_DROP)                \
>         FN(IP_OUTNOROUTES)              \
> -       FN(BPF_CGROUP_EGRESS)           \
>         FN(IPV6DISABLED)                \
>         FN(NEIGH_CREATEFAIL)            \
>         FN(NEIGH_FAILED)                \
>         FN(NEIGH_QUEUEFULL)             \
>         FN(NEIGH_DEAD)                  \
> -       FN(TC_EGRESS)                   \
>         FN(QDISC_DROP)                  \
>         FN(CPU_BACKLOG)                 \
> -       FN(XDP)                         \
> -       FN(TC_INGRESS)                  \
>         FN(UNHANDLED_PROTO)             \
>         FN(SKB_CSUM)                    \
>         FN(SKB_GSO_SEG)                 \
> @@ -80,6 +82,8 @@
>         FN(IPV6_NDISC_NS_OTHERHOST)     \
>         FNe(MAX)
>
> +#define        __SKB_POLICY_DROP_END   SKB_DROP_REASON_NO_SOCKET
> +
>  /**
>   * enum skb_drop_reason - the reasons of skb drops
>   *
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 6c5915efbc17..a36c498eb693 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1031,6 +1031,8 @@ bool __kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason)
>
>         if (reason == SKB_CONSUMED)
>                 trace_consume_skb(skb, __builtin_return_address(0));
> +       else if (reason < __SKB_POLICY_DROP_END)
> +               trace_drop_skb(skb, __builtin_return_address(0), reason);
>         else
>                 trace_kfree_skb(skb, __builtin_return_address(0), reason);
>         return true;

I see what you mean now. This makes sense.
  

Patch

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 226bce6d1e8c..810ad606641f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -46,6 +46,7 @@ 
 #include <linux/bpf-cgroup.h>
 #include <linux/siphash.h>
 #include <linux/net_mm.h>
+#include <linux/tracepoint-defs.h>
 
 extern struct inet_hashinfo tcp_hashinfo;
 
@@ -2259,6 +2260,10 @@  static inline void tcp_segs_in(struct tcp_sock *tp, const struct sk_buff *skb)
 		WRITE_ONCE(tp->data_segs_in, tp->data_segs_in + segs_in);
 }
 
+DECLARE_TRACEPOINT(tcp_listen_queue_drop);
+
+void do_trace_tcp_listen_queue_drop_wrapper(const struct sock *sk);
+
 /*
  * TCP listen path runs lockless.
  * We forced "struct sock" to be const qualified to make sure
@@ -2270,6 +2275,8 @@  static inline void tcp_listendrop(const struct sock *sk)
 {
 	atomic_inc(&((struct sock *)sk)->sk_drops);
 	__NET_INC_STATS(sock_net(sk), LINUX_MIB_LISTENDROPS);
+	if (tracepoint_enabled(tcp_listen_queue_drop))
+		do_trace_tcp_listen_queue_drop_wrapper(sk);
 }
 
 enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer);
diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index bf06db8d2046..646ad0bbd378 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -416,6 +416,52 @@  TRACE_EVENT(tcp_cong_state_set,
 		  __entry->cong_state)
 );
 
+TRACE_EVENT(tcp_listen_queue_drop,
+
+	TP_PROTO(const struct sock *sk),
+
+	TP_ARGS(sk),
+
+	TP_STRUCT__entry(
+		__field(const void *, skaddr)
+		__field(__u16, sport)
+		__array(__u8, saddr, 4)
+		__array(__u8, saddr_v6, 16)
+		__field(__u32, sk_max_ack_backlog)
+		__field(__u32, sk_ack_backlog)
+	),
+
+	TP_fast_assign(
+		const struct inet_sock *inet = inet_sk(sk);
+		struct in6_addr *pin6;
+		__be32 *p32;
+
+		__entry->skaddr = sk;
+
+		__entry->sport = ntohs(inet->inet_sport);
+
+		p32 = (__be32 *) __entry->saddr;
+		*p32 = inet->inet_saddr;
+
+		pin6 = (struct in6_addr *)__entry->saddr_v6;
+#if IS_ENABLED(CONFIG_IPV6)
+		if (sk->sk_family == AF_INET6)
+			*pin6 = sk->sk_v6_rcv_saddr;
+		else
+			ipv6_addr_set_v4mapped(inet->inet_saddr, pin6);
+#else
+		ipv6_addr_set_v4mapped(inet->inet_saddr, pin6);
+#endif
+
+		__entry->sk_max_ack_backlog = READ_ONCE(sk->sk_max_ack_backlog);
+		__entry->sk_ack_backlog = READ_ONCE(sk->sk_ack_backlog);
+	),
+
+	TP_printk("sport=%hu saddr=%pI4 saddrv6=%pI6c sk_max_ack_backlog=%d sk_ack_backlog=%d",
+		__entry->sport, __entry->saddr, __entry->saddr_v6,
+		__entry->sk_max_ack_backlog, __entry->sk_ack_backlog)
+);
+
 #endif /* _TRACE_TCP_H */
 
 /* This part must be outside protection */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e03e08745308..29ecbc5248c3 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -276,6 +276,8 @@ 
 #include <net/ip.h>
 #include <net/sock.h>
 
+#include <trace/events/tcp.h>
+
 #include <linux/uaccess.h>
 #include <asm/ioctls.h>
 #include <net/busy_poll.h>
@@ -1697,6 +1699,11 @@  int tcp_peek_len(struct socket *sock)
 }
 EXPORT_SYMBOL(tcp_peek_len);
 
+void do_trace_tcp_listen_queue_drop_wrapper(const struct sock *sk)
+{
+	trace_tcp_listen_queue_drop(sk);
+}
+
 /* Make sure sk_rcvbuf is big enough to satisfy SO_RCVLOWAT hint */
 int tcp_set_rcvlowat(struct sock *sk, int val)
 {