[net-next] ipv6: fix routing cache overflow for raw sockets

Message ID 20221218234801.579114-1-jmaxwell37@gmail.com
State New
Headers
Series [net-next] ipv6: fix routing cache overflow for raw sockets |

Commit Message

Jon Maxwell Dec. 18, 2022, 11:48 p.m. UTC
  Sending Ipv6 packets in a loop via a raw socket triggers an issue where a 
route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly 
consumes the Ipv6 max_size threshold which defaults to 4096 resulting in 
these warnings:

[1]   99.187805] dst_alloc: 7728 callbacks suppressed
[2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
.
.
[300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.

When this happens the packet is dropped and sendto() gets a network is 
unreachable error:

# ./a.out -s 

remaining pkt 200557 errno 101
remaining pkt 196462 errno 101
.
.
remaining pkt 126821 errno 101

Fix this by adding a flag to prevent the cloning of routes for raw sockets. 
Which makes the Ipv6 routing code use per-cpu routes instead which prevents 
packet drop due to max_size overflow. 

Ipv4 is not affected because it has a very large default max_size.

Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
---
 include/net/flow.h | 1 +
 net/ipv6/raw.c     | 2 +-
 net/ipv6/route.c   | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)
  

Comments

Paolo Abeni Dec. 20, 2022, 12:35 p.m. UTC | #1
On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote:
> Sending Ipv6 packets in a loop via a raw socket triggers an issue where a 
> route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly 
> consumes the Ipv6 max_size threshold which defaults to 4096 resulting in 
> these warnings:
> 
> [1]   99.187805] dst_alloc: 7728 callbacks suppressed
> [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> .
> .
> [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.

If I read correctly, the maximum number of dst that the raw socket can
use this way is limited by the number of packets it allows via the
sndbuf limit, right?

Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6,
ipvs, seg6?

@DavidA: why do we need to create RTF_CACHE clones for KNOWN_NH flows?

Thanks,

Paolo
  
David Ahern Dec. 20, 2022, 3:10 p.m. UTC | #2
On 12/20/22 5:35 AM, Paolo Abeni wrote:
> On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote:
>> Sending Ipv6 packets in a loop via a raw socket triggers an issue where a 
>> route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly 
>> consumes the Ipv6 max_size threshold which defaults to 4096 resulting in 
>> these warnings:
>>
>> [1]   99.187805] dst_alloc: 7728 callbacks suppressed
>> [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
>> .
>> .
>> [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> 
> If I read correctly, the maximum number of dst that the raw socket can
> use this way is limited by the number of packets it allows via the
> sndbuf limit, right?
> 
> Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6,
> ipvs, seg6?
> 
> @DavidA: why do we need to create RTF_CACHE clones for KNOWN_NH flows?
> 
> Thanks,
> 
> Paolo
> 

If I recall the details correctly: that sysctl limit was added back when
ipv6 routes were managed as dst_entries and there was a desire to allow
an admin to limit the memory consumed. At this point in time, IPv6 is
more inline with IPv4 - a separate struct for fib entries from dst
entries. That "Route cache is full" message is now out of date since
this is dst_entries which have a gc mechanism.

IPv4 does not limit the number of dst_entries that can be allocated
(ip_rt_max_size is the sysctl variable behind the ipv4 version of
max_size and it is a no-op). IPv6 can probably do the same here?

I do not believe the suggested flag is the right change.
  
Julian Anastasov Dec. 20, 2022, 3:17 p.m. UTC | #3
Hello,

On Tue, 20 Dec 2022, Paolo Abeni wrote:

> On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote:
> > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a 
> > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly 
> > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in 
> > these warnings:
> > 
> > [1]   99.187805] dst_alloc: 7728 callbacks suppressed
> > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > .
> > .
> > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> 
> If I read correctly, the maximum number of dst that the raw socket can
> use this way is limited by the number of packets it allows via the
> sndbuf limit, right?
> 
> Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6,
> ipvs, seg6?

	For IPVS there is no sndbuf limit. IPVS uses this flag
when receiving packets from world (destined to some local Virtual
IP) and then diverts/redirects the packets (without changing daddr)
to one of its backend servers on the LAN (no RTF_GATEWAY on such routes).
So, for each packet IPVS requests output route with FLOWI_FLAG_KNOWN_NH
flag and then sends the packet to backend server (nexthop) using
this route attached to the skb. Packet rate is usually high. The goal is 
nexthop to be used from the route, not from the IP header. KNOWN_NH 
means "nexthop is provided in route, not in daddr". As for the 
implementation details in ipv6, I can not comment. But all users that
set the flag wants this, to send packet where daddr can be != nexthop.

> @DavidA: why do we need to create RTF_CACHE clones for KNOWN_NH flows?

Regards

--
Julian Anastasov <ja@ssi.bg>
  
Julian Anastasov Dec. 20, 2022, 3:41 p.m. UTC | #4
Hello,

On Tue, 20 Dec 2022, Paolo Abeni wrote:

> Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6,
> ipvs, seg6?

	I forgot to mention one thing: IPVS can cache such routes in
its own storage, one per backend server, it still calls dst->ops->check
for them. So, such route can live for long time, that is why they were 
created as uncached. So, IPVS requests one route, remembers it and then 
can attach it to multiple packets for this backend server with
skb_dst_set_noref. So, IPVS have to use 4096 backend servers to
hit this limit.

	It does not look correct in this patch to invalidate the
FLOWI_FLAG_KNOWN_NH flag with a FLOWI_FLAG_SKIP_RAW flag. The
same thing would be to not set FLOWI_FLAG_KNOWN_NH which is
wrong for the hdrincl case.

Regards

--
Julian Anastasov <ja@ssi.bg>
  
Jon Maxwell Dec. 20, 2022, 9:48 p.m. UTC | #5
On Tue, Dec 20, 2022 at 11:35 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote:
> > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a
> > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly
> > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in
> > these warnings:
> >
> > [1]   99.187805] dst_alloc: 7728 callbacks suppressed
> > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > .
> > .
> > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
>
> If I read correctly, the maximum number of dst that the raw socket can
> use this way is limited by the number of packets it allows via the
> sndbuf limit, right?
>

Yes, but in my test sndbuf limit is never hit so it clones a route for
every packet.

e.g:

output from C program sending 5000000 packets via a raw socket.

ip raw: total num pkts 5000000

# bpftrace -e 'kprobe:dst_alloc {@count[comm] = count()}'
Attaching 1 probe...

@count[a.out]: 5000009

> Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6,
> ipvs, seg6?
>

Any call to ip6_pol_route(s) where no res.nh->fib_nh_gw_family is 0 can do it.
But we have only seen this for raw sockets so far.

Regards

Jon

> @DavidA: why do we need to create RTF_CACHE clones for KNOWN_NH flows?
>
> Thanks,
>
> Paolo
>
  
Jon Maxwell Dec. 20, 2022, 9:55 p.m. UTC | #6
On Wed, Dec 21, 2022 at 2:10 AM David Ahern <dsahern@kernel.org> wrote:
>
> On 12/20/22 5:35 AM, Paolo Abeni wrote:
> > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote:
> >> Sending Ipv6 packets in a loop via a raw socket triggers an issue where a
> >> route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly
> >> consumes the Ipv6 max_size threshold which defaults to 4096 resulting in
> >> these warnings:
> >>
> >> [1]   99.187805] dst_alloc: 7728 callbacks suppressed
> >> [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> >> .
> >> .
> >> [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> >
> > If I read correctly, the maximum number of dst that the raw socket can
> > use this way is limited by the number of packets it allows via the
> > sndbuf limit, right?
> >
> > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6,
> > ipvs, seg6?
> >
> > @DavidA: why do we need to create RTF_CACHE clones for KNOWN_NH flows?
> >
> > Thanks,
> >
> > Paolo
> >
>
> If I recall the details correctly: that sysctl limit was added back when
> ipv6 routes were managed as dst_entries and there was a desire to allow
> an admin to limit the memory consumed. At this point in time, IPv6 is
> more inline with IPv4 - a separate struct for fib entries from dst
> entries. That "Route cache is full" message is now out of date since
> this is dst_entries which have a gc mechanism.
>
> IPv4 does not limit the number of dst_entries that can be allocated
> (ip_rt_max_size is the sysctl variable behind the ipv4 version of
> max_size and it is a no-op). IPv6 can probably do the same here?
>

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index dbc224023977..701aba7feaf5 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -6470,7 +6470,7 @@ static int __net_init ip6_route_net_init(struct net *net)
 #endif

        net->ipv6.sysctl.flush_delay = 0;
-       net->ipv6.sysctl.ip6_rt_max_size = 4096;
+       net->ipv6.sysctl.ip6_rt_max_size = INT_MAX;
        net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2;
        net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ;
        net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ;

The above patch resolved it for the Ipv6 reproducer.

Would that be sufficient?

> I do not believe the suggested flag is the right change.

Regards

Jon
  
Jon Maxwell Dec. 21, 2022, 4:31 a.m. UTC | #7
On Wed, Dec 21, 2022 at 8:55 AM Jonathan Maxwell <jmaxwell37@gmail.com> wrote:
>
> On Wed, Dec 21, 2022 at 2:10 AM David Ahern <dsahern@kernel.org> wrote:
> >
> > On 12/20/22 5:35 AM, Paolo Abeni wrote:
> > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote:
> > >> Sending Ipv6 packets in a loop via a raw socket triggers an issue where a
> > >> route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly
> > >> consumes the Ipv6 max_size threshold which defaults to 4096 resulting in
> > >> these warnings:
> > >>
> > >> [1]   99.187805] dst_alloc: 7728 callbacks suppressed
> > >> [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > >> .
> > >> .
> > >> [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > >
> > > If I read correctly, the maximum number of dst that the raw socket can
> > > use this way is limited by the number of packets it allows via the
> > > sndbuf limit, right?
> > >
> > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6,
> > > ipvs, seg6?
> > >
> > > @DavidA: why do we need to create RTF_CACHE clones for KNOWN_NH flows?
> > >
> > > Thanks,
> > >
> > > Paolo
> > >
> >
> > If I recall the details correctly: that sysctl limit was added back when
> > ipv6 routes were managed as dst_entries and there was a desire to allow
> > an admin to limit the memory consumed. At this point in time, IPv6 is
> > more inline with IPv4 - a separate struct for fib entries from dst
> > entries. That "Route cache is full" message is now out of date since
> > this is dst_entries which have a gc mechanism.
> >
> > IPv4 does not limit the number of dst_entries that can be allocated
> > (ip_rt_max_size is the sysctl variable behind the ipv4 version of
> > max_size and it is a no-op). IPv6 can probably do the same here?
> >
>
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index dbc224023977..701aba7feaf5 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -6470,7 +6470,7 @@ static int __net_init ip6_route_net_init(struct net *net)
>  #endif
>
>         net->ipv6.sysctl.flush_delay = 0;
> -       net->ipv6.sysctl.ip6_rt_max_size = 4096;
> +       net->ipv6.sysctl.ip6_rt_max_size = INT_MAX;
>         net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2;
>         net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ;
>         net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ;
>
> The above patch resolved it for the Ipv6 reproducer.
>
> Would that be sufficient?
>

Otherwise if you prefer to make Ipv6 behaviour similar to IPv4.
Rather than upping max_size.

Here is prototype patch that removes the max_size check for Ipv6:

diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h
index 88ff7bb2bb9b..632086b2f644 100644
--- a/include/net/dst_ops.h
+++ b/include/net/dst_ops.h
@@ -16,7 +16,7 @@ struct dst_ops {
        unsigned short          family;
        unsigned int            gc_thresh;

-       int                     (*gc)(struct dst_ops *ops);
+       void                    (*gc)(struct dst_ops *ops);
        struct dst_entry *      (*check)(struct dst_entry *, __u32 cookie);
        unsigned int            (*default_advmss)(const struct dst_entry *);
        unsigned int            (*mtu)(const struct dst_entry *);
diff --git a/net/core/dst.c b/net/core/dst.c
index 497ef9b3fc6a..dcb85267bc4c 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -82,12 +82,8 @@ void *dst_alloc(struct dst_ops *ops, struct net_device *dev,

        if (ops->gc &&
            !(flags & DST_NOCOUNT) &&
-           dst_entries_get_fast(ops) > ops->gc_thresh) {
-               if (ops->gc(ops)) {
-                       pr_notice_ratelimited("Route cache is full:
consider increasing sysctl net.ipv6.route.max_size.\n");
-                       return NULL;
-               }
-       }
+           dst_entries_get_fast(ops) > ops->gc_thresh)
+               ops->gc(ops);

        dst = kmem_cache_alloc(ops->kmem_cachep, GFP_ATOMIC);
        if (!dst)
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index dbc224023977..8db7c5436da4 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -91,7 +91,7 @@ static struct dst_entry *ip6_negative_advice(struct
dst_entry *);
 static void            ip6_dst_destroy(struct dst_entry *);
 static void            ip6_dst_ifdown(struct dst_entry *,
                                       struct net_device *dev, int how);
-static int              ip6_dst_gc(struct dst_ops *ops);
+static void             ip6_dst_gc(struct dst_ops *ops);

 static int             ip6_pkt_discard(struct sk_buff *skb);
 static int             ip6_pkt_discard_out(struct net *net, struct
sock *sk, struct sk_buff *skb);
@@ -3295,32 +3295,21 @@ struct dst_entry *icmp6_dst_alloc(struct
net_device *dev,
        return dst;
 }

-static int ip6_dst_gc(struct dst_ops *ops)
+static void ip6_dst_gc(struct dst_ops *ops)
 {
        struct net *net = container_of(ops, struct net, ipv6.ip6_dst_ops);
-       int rt_min_interval = net->ipv6.sysctl.ip6_rt_gc_min_interval;
-       int rt_max_size = net->ipv6.sysctl.ip6_rt_max_size;
        int rt_elasticity = net->ipv6.sysctl.ip6_rt_gc_elasticity;
        int rt_gc_timeout = net->ipv6.sysctl.ip6_rt_gc_timeout;
-       unsigned long rt_last_gc = net->ipv6.ip6_rt_last_gc;
        int entries;

        entries = dst_entries_get_fast(ops);
-       if (entries > rt_max_size)
-               entries = dst_entries_get_slow(ops);
-
-       if (time_after(rt_last_gc + rt_min_interval, jiffies) &&
-           entries <= rt_max_size)
-               goto out;

        net->ipv6.ip6_rt_gc_expire++;
        fib6_run_gc(net->ipv6.ip6_rt_gc_expire, net, true);
        entries = dst_entries_get_slow(ops);
        if (entries < ops->gc_thresh)
                net->ipv6.ip6_rt_gc_expire = rt_gc_timeout>>1;
-out:
        net->ipv6.ip6_rt_gc_expire -= net->ipv6.ip6_rt_gc_expire>>rt_elasticity;
-       return entries > rt_max_size;
 }

 static int ip6_nh_lookup_table(struct net *net, struct fib6_config *cfg,

> > I do not believe the suggested flag is the right change.
>
> Regards
>
> Jon
  
Jon Maxwell Dec. 22, 2022, 5:39 a.m. UTC | #8
On Wed, Dec 21, 2022 at 3:31 PM Jonathan Maxwell <jmaxwell37@gmail.com> wrote:
>
> On Wed, Dec 21, 2022 at 8:55 AM Jonathan Maxwell <jmaxwell37@gmail.com> wrote:
> >
> > On Wed, Dec 21, 2022 at 2:10 AM David Ahern <dsahern@kernel.org> wrote:
> > >
> > > On 12/20/22 5:35 AM, Paolo Abeni wrote:
> > > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote:
> > > >> Sending Ipv6 packets in a loop via a raw socket triggers an issue where a
> > > >> route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly
> > > >> consumes the Ipv6 max_size threshold which defaults to 4096 resulting in
> > > >> these warnings:
> > > >>
> > > >> [1]   99.187805] dst_alloc: 7728 callbacks suppressed
> > > >> [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > > >> .
> > > >> .
> > > >> [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > > >
> > > > If I read correctly, the maximum number of dst that the raw socket can
> > > > use this way is limited by the number of packets it allows via the
> > > > sndbuf limit, right?
> > > >
> > > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6,
> > > > ipvs, seg6?
> > > >
> > > > @DavidA: why do we need to create RTF_CACHE clones for KNOWN_NH flows?
> > > >
> > > > Thanks,
> > > >
> > > > Paolo
> > > >
> > >
> > > If I recall the details correctly: that sysctl limit was added back when
> > > ipv6 routes were managed as dst_entries and there was a desire to allow
> > > an admin to limit the memory consumed. At this point in time, IPv6 is
> > > more inline with IPv4 - a separate struct for fib entries from dst
> > > entries. That "Route cache is full" message is now out of date since
> > > this is dst_entries which have a gc mechanism.
> > >
> > > IPv4 does not limit the number of dst_entries that can be allocated
> > > (ip_rt_max_size is the sysctl variable behind the ipv4 version of
> > > max_size and it is a no-op). IPv6 can probably do the same here?
> > >
> >
> > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > index dbc224023977..701aba7feaf5 100644
> > --- a/net/ipv6/route.c
> > +++ b/net/ipv6/route.c
> > @@ -6470,7 +6470,7 @@ static int __net_init ip6_route_net_init(struct net *net)
> >  #endif
> >
> >         net->ipv6.sysctl.flush_delay = 0;
> > -       net->ipv6.sysctl.ip6_rt_max_size = 4096;
> > +       net->ipv6.sysctl.ip6_rt_max_size = INT_MAX;
> >         net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2;
> >         net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ;
> >         net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ;
> >
> > The above patch resolved it for the Ipv6 reproducer.
> >
> > Would that be sufficient?
> >
>
> Otherwise if you prefer to make Ipv6 behaviour similar to IPv4.
> Rather than upping max_size.
>
> Here is prototype patch that removes the max_size check for Ipv6:
>

There are some mistakes in this prototype patch. Let me come up with a
better one. I'll submit a new patch in the new year for review. Thanks for
the suggestion DavidA.

Regards

Jon

> diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h
> index 88ff7bb2bb9b..632086b2f644 100644
> --- a/include/net/dst_ops.h
> +++ b/include/net/dst_ops.h
> @@ -16,7 +16,7 @@ struct dst_ops {
>         unsigned short          family;
>         unsigned int            gc_thresh;
>
> -       int                     (*gc)(struct dst_ops *ops);
> +       void                    (*gc)(struct dst_ops *ops);
>         struct dst_entry *      (*check)(struct dst_entry *, __u32 cookie);
>         unsigned int            (*default_advmss)(const struct dst_entry *);
>         unsigned int            (*mtu)(const struct dst_entry *);
> diff --git a/net/core/dst.c b/net/core/dst.c
> index 497ef9b3fc6a..dcb85267bc4c 100644
> --- a/net/core/dst.c
> +++ b/net/core/dst.c
> @@ -82,12 +82,8 @@ void *dst_alloc(struct dst_ops *ops, struct net_device *dev,
>
>         if (ops->gc &&
>             !(flags & DST_NOCOUNT) &&
> -           dst_entries_get_fast(ops) > ops->gc_thresh) {
> -               if (ops->gc(ops)) {
> -                       pr_notice_ratelimited("Route cache is full:
> consider increasing sysctl net.ipv6.route.max_size.\n");
> -                       return NULL;
> -               }
> -       }
> +           dst_entries_get_fast(ops) > ops->gc_thresh)
> +               ops->gc(ops);
>
>         dst = kmem_cache_alloc(ops->kmem_cachep, GFP_ATOMIC);
>         if (!dst)
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index dbc224023977..8db7c5436da4 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -91,7 +91,7 @@ static struct dst_entry *ip6_negative_advice(struct
> dst_entry *);
>  static void            ip6_dst_destroy(struct dst_entry *);
>  static void            ip6_dst_ifdown(struct dst_entry *,
>                                        struct net_device *dev, int how);
> -static int              ip6_dst_gc(struct dst_ops *ops);
> +static void             ip6_dst_gc(struct dst_ops *ops);
>
>  static int             ip6_pkt_discard(struct sk_buff *skb);
>  static int             ip6_pkt_discard_out(struct net *net, struct
> sock *sk, struct sk_buff *skb);
> @@ -3295,32 +3295,21 @@ struct dst_entry *icmp6_dst_alloc(struct
> net_device *dev,
>         return dst;
>  }
>
> -static int ip6_dst_gc(struct dst_ops *ops)
> +static void ip6_dst_gc(struct dst_ops *ops)
>  {
>         struct net *net = container_of(ops, struct net, ipv6.ip6_dst_ops);
> -       int rt_min_interval = net->ipv6.sysctl.ip6_rt_gc_min_interval;
> -       int rt_max_size = net->ipv6.sysctl.ip6_rt_max_size;
>         int rt_elasticity = net->ipv6.sysctl.ip6_rt_gc_elasticity;
>         int rt_gc_timeout = net->ipv6.sysctl.ip6_rt_gc_timeout;
> -       unsigned long rt_last_gc = net->ipv6.ip6_rt_last_gc;
>         int entries;
>
>         entries = dst_entries_get_fast(ops);
> -       if (entries > rt_max_size)
> -               entries = dst_entries_get_slow(ops);
> -
> -       if (time_after(rt_last_gc + rt_min_interval, jiffies) &&
> -           entries <= rt_max_size)
> -               goto out;
>
>         net->ipv6.ip6_rt_gc_expire++;
>         fib6_run_gc(net->ipv6.ip6_rt_gc_expire, net, true);
>         entries = dst_entries_get_slow(ops);
>         if (entries < ops->gc_thresh)
>                 net->ipv6.ip6_rt_gc_expire = rt_gc_timeout>>1;
> -out:
>         net->ipv6.ip6_rt_gc_expire -= net->ipv6.ip6_rt_gc_expire>>rt_elasticity;
> -       return entries > rt_max_size;
>  }
>
>  static int ip6_nh_lookup_table(struct net *net, struct fib6_config *cfg,
>
> > > I do not believe the suggested flag is the right change.
> >
> > Regards
> >
> > Jon
  
David Ahern Dec. 22, 2022, 4:17 p.m. UTC | #9
On 12/21/22 10:39 PM, Jonathan Maxwell wrote:
> There are some mistakes in this prototype patch. Let me come up with a
> better one. I'll submit a new patch in the new year for review. Thanks for
> the suggestion DavidA.

you are welcome. When you submit the next one, it would be helpful to
show change in memory consumption and a comparison to IPv4 for a similar
raw socket program.
  
Jon Maxwell Dec. 22, 2022, 10:36 p.m. UTC | #10
On Fri, Dec 23, 2022 at 3:17 AM David Ahern <dsahern@kernel.org> wrote:
>
> On 12/21/22 10:39 PM, Jonathan Maxwell wrote:
> > There are some mistakes in this prototype patch. Let me come up with a
> > better one. I'll submit a new patch in the new year for review. Thanks for
> > the suggestion DavidA.
>
> you are welcome. When you submit the next one, it would be helpful to
> show change in memory consumption and a comparison to IPv4 for a similar
> raw socket program.

Sure will do, I have an ipv4 version of the test program and a better patch.
Initial results are quite similar for resource usage. The ipv6 GC does a good
job at managing memory  resources with max_size threshold removed. I'll
include some comparison stats after I test this  extensively when I submit
the new patch.
  
Andrea Mayer Dec. 23, 2022, 8:28 p.m. UTC | #11
Hi Jon,
please see below, thanks.

On Wed, 21 Dec 2022 08:48:11 +1100
Jonathan Maxwell <jmaxwell37@gmail.com> wrote:

> On Tue, Dec 20, 2022 at 11:35 PM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote:
> > > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a
> > > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly
> > > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in
> > > these warnings:
> > >
> > > [1]   99.187805] dst_alloc: 7728 callbacks suppressed
> > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > > .
> > > .
> > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> >
> > If I read correctly, the maximum number of dst that the raw socket can
> > use this way is limited by the number of packets it allows via the
> > sndbuf limit, right?
> >
> 
> Yes, but in my test sndbuf limit is never hit so it clones a route for
> every packet.
> 
> e.g:
> 
> output from C program sending 5000000 packets via a raw socket.
> 
> ip raw: total num pkts 5000000
> 
> # bpftrace -e 'kprobe:dst_alloc {@count[comm] = count()}'
> Attaching 1 probe...
> 
> @count[a.out]: 5000009
> 
> > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6,
> > ipvs, seg6?
> >
> 
> Any call to ip6_pol_route(s) where no res.nh->fib_nh_gw_family is 0 can do it.
> But we have only seen this for raw sockets so far.
> 

In the SRv6 subsystem, the seg6_lookup_nexthop() is used by some
cross-connecting behaviors such as End.X and End.DX6 to forward traffic to a
specified nexthop. SRv6 End.X/DX6 can specify an IPv6 DA (i.e., a nexthop)
different from the one carried by the IPv6 header. For this purpose,
seg6_lookup_nexthop() sets the FLOWI_FLAG_KNOWN_NH.

> > > [1]   99.187805] dst_alloc: 7728 callbacks suppressed
> > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > > .
> > > .
> > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.

I can reproduce the same warning messages reported by you, by instantiating an
End.X behavior whose nexthop is handled by a route for which there is no "via".
In this configuration, the ip6_pol_route() (called by seg6_lookup_nexthop())
triggers ip6_rt_cache_alloc() because i) the FLOWI_FLAG_KNOWN_NH is present ii)
and the res.nh->fib_nh_gw_family is 0 (as already pointed out).

> Regards
> 
> Jon

Ciao,
Andrea
  
Jon Maxwell Dec. 24, 2022, 7:38 a.m. UTC | #12
On Sat, Dec 24, 2022 at 7:28 AM Andrea Mayer <andrea.mayer@uniroma2.it> wrote:
>
> Hi Jon,
> please see below, thanks.
>
> On Wed, 21 Dec 2022 08:48:11 +1100
> Jonathan Maxwell <jmaxwell37@gmail.com> wrote:
>
> > On Tue, Dec 20, 2022 at 11:35 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > >
> > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote:
> > > > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a
> > > > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly
> > > > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in
> > > > these warnings:
> > > >
> > > > [1]   99.187805] dst_alloc: 7728 callbacks suppressed
> > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > > > .
> > > > .
> > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > >
> > > If I read correctly, the maximum number of dst that the raw socket can
> > > use this way is limited by the number of packets it allows via the
> > > sndbuf limit, right?
> > >
> >
> > Yes, but in my test sndbuf limit is never hit so it clones a route for
> > every packet.
> >
> > e.g:
> >
> > output from C program sending 5000000 packets via a raw socket.
> >
> > ip raw: total num pkts 5000000
> >
> > # bpftrace -e 'kprobe:dst_alloc {@count[comm] = count()}'
> > Attaching 1 probe...
> >
> > @count[a.out]: 5000009
> >
> > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6,
> > > ipvs, seg6?
> > >
> >
> > Any call to ip6_pol_route(s) where no res.nh->fib_nh_gw_family is 0 can do it.
> > But we have only seen this for raw sockets so far.
> >
>
> In the SRv6 subsystem, the seg6_lookup_nexthop() is used by some
> cross-connecting behaviors such as End.X and End.DX6 to forward traffic to a
> specified nexthop. SRv6 End.X/DX6 can specify an IPv6 DA (i.e., a nexthop)
> different from the one carried by the IPv6 header. For this purpose,
> seg6_lookup_nexthop() sets the FLOWI_FLAG_KNOWN_NH.
>
Hi Andrea,

Thanks for pointing that datapath out. The more generic approach we are
taking bringing Ipv6 closer to Ipv4 in this regard should fix all instances
of this.

> > > > [1]   99.187805] dst_alloc: 7728 callbacks suppressed
> > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > > > .
> > > > .
> > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
>
> I can reproduce the same warning messages reported by you, by instantiating an
> End.X behavior whose nexthop is handled by a route for which there is no "via".
> In this configuration, the ip6_pol_route() (called by seg6_lookup_nexthop())
> triggers ip6_rt_cache_alloc() because i) the FLOWI_FLAG_KNOWN_NH is present ii)
> and the res.nh->fib_nh_gw_family is 0 (as already pointed out).
>

Nice, when I get back after the holiday break I'll submit the next patch. It
would be great if you could test the new patch and let me know how it works in
your tests at that juncture. I'll keep you posted.

Regards

Jon

> > Regards
> >
> > Jon
>
> Ciao,
> Andrea
  
Jon Maxwell Jan. 2, 2023, 11:59 p.m. UTC | #13
Hi Andrea,

Happy New Year.

Any chance you could test this patch based on the latest net-next
kernel and let me know the result?

diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h
index 88ff7bb2bb9b..632086b2f644 100644
--- a/include/net/dst_ops.h
+++ b/include/net/dst_ops.h
@@ -16,7 +16,7 @@ struct dst_ops {
        unsigned short          family;
        unsigned int            gc_thresh;

-       int                     (*gc)(struct dst_ops *ops);
+       void                    (*gc)(struct dst_ops *ops);
        struct dst_entry *      (*check)(struct dst_entry *, __u32 cookie);
        unsigned int            (*default_advmss)(const struct dst_entry *);
        unsigned int            (*mtu)(const struct dst_entry *);
diff --git a/net/core/dst.c b/net/core/dst.c
index 6d2dd03dafa8..31c08a3386d3 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -82,12 +82,8 @@ void *dst_alloc(struct dst_ops *ops, struct net_device *dev,

        if (ops->gc &&
            !(flags & DST_NOCOUNT) &&
-           dst_entries_get_fast(ops) > ops->gc_thresh) {
-               if (ops->gc(ops)) {
-                       pr_notice_ratelimited("Route cache is full:
consider increasing sysctl net.ipv6.route.max_size.\n");
-                       return NULL;
-               }
-       }
+           dst_entries_get_fast(ops) > ops->gc_thresh)
+               ops->gc(ops);

        dst = kmem_cache_alloc(ops->kmem_cachep, GFP_ATOMIC);
        if (!dst)
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index e74e0361fd92..b643dda68d31 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -91,7 +91,7 @@ static struct dst_entry *ip6_negative_advice(struct
dst_entry *);
 static void            ip6_dst_destroy(struct dst_entry *);
 static void            ip6_dst_ifdown(struct dst_entry *,
                                       struct net_device *dev, int how);
-static int              ip6_dst_gc(struct dst_ops *ops);
+static void             ip6_dst_gc(struct dst_ops *ops);

 static int             ip6_pkt_discard(struct sk_buff *skb);
 static int             ip6_pkt_discard_out(struct net *net, struct
sock *sk, struct sk_buff *skb);
@@ -3284,11 +3284,10 @@ struct dst_entry *icmp6_dst_alloc(struct
net_device *dev,
        return dst;
 }

-static int ip6_dst_gc(struct dst_ops *ops)
+static void ip6_dst_gc(struct dst_ops *ops)
 {
        struct net *net = container_of(ops, struct net, ipv6.ip6_dst_ops);
        int rt_min_interval = net->ipv6.sysctl.ip6_rt_gc_min_interval;
-       int rt_max_size = net->ipv6.sysctl.ip6_rt_max_size;
        int rt_elasticity = net->ipv6.sysctl.ip6_rt_gc_elasticity;
        int rt_gc_timeout = net->ipv6.sysctl.ip6_rt_gc_timeout;
        unsigned long rt_last_gc = net->ipv6.ip6_rt_last_gc;
@@ -3296,11 +3295,10 @@ static int ip6_dst_gc(struct dst_ops *ops)
        int entries;

        entries = dst_entries_get_fast(ops);
-       if (entries > rt_max_size)
+       if (entries > ops->gc_thresh)
                entries = dst_entries_get_slow(ops);

-       if (time_after(rt_last_gc + rt_min_interval, jiffies) &&
-           entries <= rt_max_size)
+       if (time_after(rt_last_gc + rt_min_interval, jiffies))
                goto out;

        fib6_run_gc(atomic_inc_return(&net->ipv6.ip6_rt_gc_expire), net, true);
@@ -3310,7 +3308,6 @@ static int ip6_dst_gc(struct dst_ops *ops)
 out:
        val = atomic_read(&net->ipv6.ip6_rt_gc_expire);
        atomic_set(&net->ipv6.ip6_rt_gc_expire, val - (val >> rt_elasticity));
-       return entries > rt_max_size;
 }

 static int ip6_nh_lookup_table(struct net *net, struct fib6_config *cfg,
@@ -6512,7 +6509,7 @@ static int __net_init ip6_route_net_init(struct net *net)
 #endif

        net->ipv6.sysctl.flush_delay = 0;
-       net->ipv6.sysctl.ip6_rt_max_size = 4096;
+       net->ipv6.sysctl.ip6_rt_max_size = INT_MAX;
        net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2;
        net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ;
        net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ;

On Sat, Dec 24, 2022 at 6:38 PM Jonathan Maxwell <jmaxwell37@gmail.com> wrote:
>
> On Sat, Dec 24, 2022 at 7:28 AM Andrea Mayer <andrea.mayer@uniroma2.it> wrote:
> >
> > Hi Jon,
> > please see below, thanks.
> >
> > On Wed, 21 Dec 2022 08:48:11 +1100
> > Jonathan Maxwell <jmaxwell37@gmail.com> wrote:
> >
> > > On Tue, Dec 20, 2022 at 11:35 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > > >
> > > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote:
> > > > > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a
> > > > > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly
> > > > > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in
> > > > > these warnings:
> > > > >
> > > > > [1]   99.187805] dst_alloc: 7728 callbacks suppressed
> > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > > > > .
> > > > > .
> > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > > >
> > > > If I read correctly, the maximum number of dst that the raw socket can
> > > > use this way is limited by the number of packets it allows via the
> > > > sndbuf limit, right?
> > > >
> > >
> > > Yes, but in my test sndbuf limit is never hit so it clones a route for
> > > every packet.
> > >
> > > e.g:
> > >
> > > output from C program sending 5000000 packets via a raw socket.
> > >
> > > ip raw: total num pkts 5000000
> > >
> > > # bpftrace -e 'kprobe:dst_alloc {@count[comm] = count()}'
> > > Attaching 1 probe...
> > >
> > > @count[a.out]: 5000009
> > >
> > > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6,
> > > > ipvs, seg6?
> > > >
> > >
> > > Any call to ip6_pol_route(s) where no res.nh->fib_nh_gw_family is 0 can do it.
> > > But we have only seen this for raw sockets so far.
> > >
> >
> > In the SRv6 subsystem, the seg6_lookup_nexthop() is used by some
> > cross-connecting behaviors such as End.X and End.DX6 to forward traffic to a
> > specified nexthop. SRv6 End.X/DX6 can specify an IPv6 DA (i.e., a nexthop)
> > different from the one carried by the IPv6 header. For this purpose,
> > seg6_lookup_nexthop() sets the FLOWI_FLAG_KNOWN_NH.
> >
> Hi Andrea,
>
> Thanks for pointing that datapath out. The more generic approach we are
> taking bringing Ipv6 closer to Ipv4 in this regard should fix all instances
> of this.
>
> > > > > [1]   99.187805] dst_alloc: 7728 callbacks suppressed
> > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > > > > .
> > > > > .
> > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> >
> > I can reproduce the same warning messages reported by you, by instantiating an
> > End.X behavior whose nexthop is handled by a route for which there is no "via".
> > In this configuration, the ip6_pol_route() (called by seg6_lookup_nexthop())
> > triggers ip6_rt_cache_alloc() because i) the FLOWI_FLAG_KNOWN_NH is present ii)
> > and the res.nh->fib_nh_gw_family is 0 (as already pointed out).
> >
>
> Nice, when I get back after the holiday break I'll submit the next patch. It
> would be great if you could test the new patch and let me know how it works in
> your tests at that juncture. I'll keep you posted.
>
> Regards
>
> Jon
>
> > > Regards
> > >
> > > Jon
> >
> > Ciao,
> > Andrea
  
Andrea Mayer Jan. 3, 2023, 4:07 p.m. UTC | #14
Hi Jon,
please see below, thanks.

On Tue, 3 Jan 2023 10:59:50 +1100
Jonathan Maxwell <jmaxwell37@gmail.com> wrote:

> Hi Andrea,
> 
> Happy New Year.
> 

Thank you, Happy New Year to you too and everybody on the mailing list as well.

> Any chance you could test this patch based on the latest net-next
> kernel and let me know the result?
> 
> diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h
> index 88ff7bb2bb9b..632086b2f644 100644
> --- a/include/net/dst_ops.h
> +++ b/include/net/dst_ops.h
> @@ -16,7 +16,7 @@ struct dst_ops {
>         unsigned short          family;
>         unsigned int            gc_thresh;
> 
> -       int                     (*gc)(struct dst_ops *ops);
> +       void                    (*gc)(struct dst_ops *ops);
>         struct dst_entry *      (*check)(struct dst_entry *, __u32 cookie);
>         unsigned int            (*default_advmss)(const struct dst_entry *);
>         unsigned int            (*mtu)(const struct dst_entry *);
> diff --git a/net/core/dst.c b/net/core/dst.c
> index 6d2dd03dafa8..31c08a3386d3 100644
> --- a/net/core/dst.c
> +++ b/net/core/dst.c
> @@ -82,12 +82,8 @@ void *dst_alloc(struct dst_ops *ops, struct net_device *dev,
> 
>         if (ops->gc &&
>             !(flags & DST_NOCOUNT) &&
> -           dst_entries_get_fast(ops) > ops->gc_thresh) {
> -               if (ops->gc(ops)) {
> -                       pr_notice_ratelimited("Route cache is full:
> consider increasing sysctl net.ipv6.route.max_size.\n");
> -                       return NULL;
> -               }
> -       }
> +           dst_entries_get_fast(ops) > ops->gc_thresh)
> +               ops->gc(ops);
> 
>         dst = kmem_cache_alloc(ops->kmem_cachep, GFP_ATOMIC);
>         if (!dst)
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index e74e0361fd92..b643dda68d31 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -91,7 +91,7 @@ static struct dst_entry *ip6_negative_advice(struct
> dst_entry *);
>  static void            ip6_dst_destroy(struct dst_entry *);
>  static void            ip6_dst_ifdown(struct dst_entry *,
>                                        struct net_device *dev, int how);
> -static int              ip6_dst_gc(struct dst_ops *ops);
> +static void             ip6_dst_gc(struct dst_ops *ops);
> 
>  static int             ip6_pkt_discard(struct sk_buff *skb);
>  static int             ip6_pkt_discard_out(struct net *net, struct
> sock *sk, struct sk_buff *skb);
> @@ -3284,11 +3284,10 @@ struct dst_entry *icmp6_dst_alloc(struct
> net_device *dev,
>         return dst;
>  }
> 
> -static int ip6_dst_gc(struct dst_ops *ops)
> +static void ip6_dst_gc(struct dst_ops *ops)
>  {
>         struct net *net = container_of(ops, struct net, ipv6.ip6_dst_ops);
>         int rt_min_interval = net->ipv6.sysctl.ip6_rt_gc_min_interval;
> -       int rt_max_size = net->ipv6.sysctl.ip6_rt_max_size;
>         int rt_elasticity = net->ipv6.sysctl.ip6_rt_gc_elasticity;
>         int rt_gc_timeout = net->ipv6.sysctl.ip6_rt_gc_timeout;
>         unsigned long rt_last_gc = net->ipv6.ip6_rt_last_gc;
> @@ -3296,11 +3295,10 @@ static int ip6_dst_gc(struct dst_ops *ops)
>         int entries;
> 
>         entries = dst_entries_get_fast(ops);
> -       if (entries > rt_max_size)
> +       if (entries > ops->gc_thresh)
>                 entries = dst_entries_get_slow(ops);
> 
> -       if (time_after(rt_last_gc + rt_min_interval, jiffies) &&
> -           entries <= rt_max_size)
> +       if (time_after(rt_last_gc + rt_min_interval, jiffies))
>                 goto out;
> 
>         fib6_run_gc(atomic_inc_return(&net->ipv6.ip6_rt_gc_expire), net, true);
> @@ -3310,7 +3308,6 @@ static int ip6_dst_gc(struct dst_ops *ops)
>  out:
>         val = atomic_read(&net->ipv6.ip6_rt_gc_expire);
>         atomic_set(&net->ipv6.ip6_rt_gc_expire, val - (val >> rt_elasticity));
> -       return entries > rt_max_size;
>  }
> 
>  static int ip6_nh_lookup_table(struct net *net, struct fib6_config *cfg,
> @@ -6512,7 +6509,7 @@ static int __net_init ip6_route_net_init(struct net *net)
>  #endif
> 
>         net->ipv6.sysctl.flush_delay = 0;
> -       net->ipv6.sysctl.ip6_rt_max_size = 4096;
> +       net->ipv6.sysctl.ip6_rt_max_size = INT_MAX;
>         net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2;
>         net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ;
>         net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ;
> 

Yes, I will apply this patch in the next days and check how it deals with the
seg6 subsystem. I will keep you posted.

Ciao,
Andrea

> On Sat, Dec 24, 2022 at 6:38 PM Jonathan Maxwell <jmaxwell37@gmail.com> wrote:
> >
> > On Sat, Dec 24, 2022 at 7:28 AM Andrea Mayer <andrea.mayer@uniroma2.it> wrote:
> > >
> > > Hi Jon,
> > > please see below, thanks.
> > >
> > > On Wed, 21 Dec 2022 08:48:11 +1100
> > > Jonathan Maxwell <jmaxwell37@gmail.com> wrote:
> > >
> > > > On Tue, Dec 20, 2022 at 11:35 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > > > >
> > > > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote:
> > > > > > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a
> > > > > > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly
> > > > > > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in
> > > > > > these warnings:
> > > > > >
> > > > > > [1]   99.187805] dst_alloc: 7728 callbacks suppressed
> > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > > > > > .
> > > > > > .
> > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > > > >
> > > > > If I read correctly, the maximum number of dst that the raw socket can
> > > > > use this way is limited by the number of packets it allows via the
> > > > > sndbuf limit, right?
> > > > >
> > > >
> > > > Yes, but in my test sndbuf limit is never hit so it clones a route for
> > > > every packet.
> > > >
> > > > e.g:
> > > >
> > > > output from C program sending 5000000 packets via a raw socket.
> > > >
> > > > ip raw: total num pkts 5000000
> > > >
> > > > # bpftrace -e 'kprobe:dst_alloc {@count[comm] = count()}'
> > > > Attaching 1 probe...
> > > >
> > > > @count[a.out]: 5000009
> > > >
> > > > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6,
> > > > > ipvs, seg6?
> > > > >
> > > >
> > > > Any call to ip6_pol_route(s) where no res.nh->fib_nh_gw_family is 0 can do it.
> > > > But we have only seen this for raw sockets so far.
> > > >
> > >
> > > In the SRv6 subsystem, the seg6_lookup_nexthop() is used by some
> > > cross-connecting behaviors such as End.X and End.DX6 to forward traffic to a
> > > specified nexthop. SRv6 End.X/DX6 can specify an IPv6 DA (i.e., a nexthop)
> > > different from the one carried by the IPv6 header. For this purpose,
> > > seg6_lookup_nexthop() sets the FLOWI_FLAG_KNOWN_NH.
> > >
> > Hi Andrea,
> >
> > Thanks for pointing that datapath out. The more generic approach we are
> > taking bringing Ipv6 closer to Ipv4 in this regard should fix all instances
> > of this.
> >
> > > > > > [1]   99.187805] dst_alloc: 7728 callbacks suppressed
> > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > > > > > .
> > > > > > .
> > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > >
> > > I can reproduce the same warning messages reported by you, by instantiating an
> > > End.X behavior whose nexthop is handled by a route for which there is no "via".
> > > In this configuration, the ip6_pol_route() (called by seg6_lookup_nexthop())
> > > triggers ip6_rt_cache_alloc() because i) the FLOWI_FLAG_KNOWN_NH is present ii)
> > > and the res.nh->fib_nh_gw_family is 0 (as already pointed out).
> > >
> >
> > Nice, when I get back after the holiday break I'll submit the next patch. It
> > would be great if you could test the new patch and let me know how it works in
> > your tests at that juncture. I'll keep you posted.
> >
> > Regards
> >
> > Jon
> >
> > > > Regards
> > > >
> > > > Jon
> > >
> > > Ciao,
> > > Andrea
  
Andrea Mayer Jan. 6, 2023, 11:26 p.m. UTC | #15
Hi Jon,
please see after, thanks.

> 
> > Any chance you could test this patch based on the latest net-next
> > kernel and let me know the result?
> > 
> > diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h
> > index 88ff7bb2bb9b..632086b2f644 100644
> > --- a/include/net/dst_ops.h
> > +++ b/include/net/dst_ops.h
> > @@ -16,7 +16,7 @@ struct dst_ops {
> >         unsigned short          family;
> >         unsigned int            gc_thresh;
> > 
> > -       int                     (*gc)(struct dst_ops *ops);
> > +       void                    (*gc)(struct dst_ops *ops);
> >         struct dst_entry *      (*check)(struct dst_entry *, __u32 cookie);
> >         unsigned int            (*default_advmss)(const struct dst_entry *);
> >         unsigned int            (*mtu)(const struct dst_entry *);
> > diff --git a/net/core/dst.c b/net/core/dst.c
> > index 6d2dd03dafa8..31c08a3386d3 100644
> > --- a/net/core/dst.c
> > +++ b/net/core/dst.c
> > @@ -82,12 +82,8 @@ void *dst_alloc(struct dst_ops *ops, struct net_device *dev,
> > 
> >         if (ops->gc &&
> >             !(flags & DST_NOCOUNT) &&
> > -           dst_entries_get_fast(ops) > ops->gc_thresh) {
> > -               if (ops->gc(ops)) {
> > -                       pr_notice_ratelimited("Route cache is full:
> > consider increasing sysctl net.ipv6.route.max_size.\n");
> > -                       return NULL;
> > -               }
> > -       }
> > +           dst_entries_get_fast(ops) > ops->gc_thresh)
> > +               ops->gc(ops);
> > 
> >         dst = kmem_cache_alloc(ops->kmem_cachep, GFP_ATOMIC);
> >         if (!dst)
> > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > index e74e0361fd92..b643dda68d31 100644
> > --- a/net/ipv6/route.c
> > +++ b/net/ipv6/route.c
> > @@ -91,7 +91,7 @@ static struct dst_entry *ip6_negative_advice(struct
> > dst_entry *);
> >  static void            ip6_dst_destroy(struct dst_entry *);
> >  static void            ip6_dst_ifdown(struct dst_entry *,
> >                                        struct net_device *dev, int how);
> > -static int              ip6_dst_gc(struct dst_ops *ops);
> > +static void             ip6_dst_gc(struct dst_ops *ops);
> > 
> >  static int             ip6_pkt_discard(struct sk_buff *skb);
> >  static int             ip6_pkt_discard_out(struct net *net, struct
> > sock *sk, struct sk_buff *skb);
> > @@ -3284,11 +3284,10 @@ struct dst_entry *icmp6_dst_alloc(struct
> > net_device *dev,
> >         return dst;
> >  }
> > 
> > -static int ip6_dst_gc(struct dst_ops *ops)
> > +static void ip6_dst_gc(struct dst_ops *ops)
> >  {
> >         struct net *net = container_of(ops, struct net, ipv6.ip6_dst_ops);
> >         int rt_min_interval = net->ipv6.sysctl.ip6_rt_gc_min_interval;
> > -       int rt_max_size = net->ipv6.sysctl.ip6_rt_max_size;
> >         int rt_elasticity = net->ipv6.sysctl.ip6_rt_gc_elasticity;
> >         int rt_gc_timeout = net->ipv6.sysctl.ip6_rt_gc_timeout;
> >         unsigned long rt_last_gc = net->ipv6.ip6_rt_last_gc;
> > @@ -3296,11 +3295,10 @@ static int ip6_dst_gc(struct dst_ops *ops)
> >         int entries;
> > 
> >         entries = dst_entries_get_fast(ops);
> > -       if (entries > rt_max_size)
> > +       if (entries > ops->gc_thresh)
> >                 entries = dst_entries_get_slow(ops);
> > 
> > -       if (time_after(rt_last_gc + rt_min_interval, jiffies) &&
> > -           entries <= rt_max_size)
> > +       if (time_after(rt_last_gc + rt_min_interval, jiffies))
> >                 goto out;
> > 
> >         fib6_run_gc(atomic_inc_return(&net->ipv6.ip6_rt_gc_expire), net, true);
> > @@ -3310,7 +3308,6 @@ static int ip6_dst_gc(struct dst_ops *ops)
> >  out:
> >         val = atomic_read(&net->ipv6.ip6_rt_gc_expire);
> >         atomic_set(&net->ipv6.ip6_rt_gc_expire, val - (val >> rt_elasticity));
> > -       return entries > rt_max_size;
> >  }
> > 
> >  static int ip6_nh_lookup_table(struct net *net, struct fib6_config *cfg,
> > @@ -6512,7 +6509,7 @@ static int __net_init ip6_route_net_init(struct net *net)
> >  #endif
> > 
> >         net->ipv6.sysctl.flush_delay = 0;
> > -       net->ipv6.sysctl.ip6_rt_max_size = 4096;
> > +       net->ipv6.sysctl.ip6_rt_max_size = INT_MAX;
> >         net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2;
> >         net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ;
> >         net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ;
> > 
> 
> Yes, I will apply this patch in the next days and check how it deals with the
> seg6 subsystem. I will keep you posted.
> 

I applied the patch* to the net-next (HEAD 6bd4755c7c49) and did some tests on
the seg6 subsystem, specifically running the End.X/DX6 behaviors. They seem to
work fine.

(*) I had to slightly edit the patch because of the code formatting, e.g.
    some incorrect line breaks, spaces, etc.

Ciao,
Andrea

> 
> > On Sat, Dec 24, 2022 at 6:38 PM Jonathan Maxwell <jmaxwell37@gmail.com> wrote:
> > >
> > > On Sat, Dec 24, 2022 at 7:28 AM Andrea Mayer <andrea.mayer@uniroma2.it> wrote:
> > > >
> > > > Hi Jon,
> > > > please see below, thanks.
> > > >
> > > > On Wed, 21 Dec 2022 08:48:11 +1100
> > > > Jonathan Maxwell <jmaxwell37@gmail.com> wrote:
> > > >
> > > > > On Tue, Dec 20, 2022 at 11:35 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > > > > >
> > > > > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote:
> > > > > > > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a
> > > > > > > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly
> > > > > > > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in
> > > > > > > these warnings:
> > > > > > >
> > > > > > > [1]   99.187805] dst_alloc: 7728 callbacks suppressed
> > > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > > > > > > .
> > > > > > > .
> > > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > > > > >
> > > > > > If I read correctly, the maximum number of dst that the raw socket can
> > > > > > use this way is limited by the number of packets it allows via the
> > > > > > sndbuf limit, right?
> > > > > >
> > > > >
> > > > > Yes, but in my test sndbuf limit is never hit so it clones a route for
> > > > > every packet.
> > > > >
> > > > > e.g:
> > > > >
> > > > > output from C program sending 5000000 packets via a raw socket.
> > > > >
> > > > > ip raw: total num pkts 5000000
> > > > >
> > > > > # bpftrace -e 'kprobe:dst_alloc {@count[comm] = count()}'
> > > > > Attaching 1 probe...
> > > > >
> > > > > @count[a.out]: 5000009
> > > > >
> > > > > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6,
> > > > > > ipvs, seg6?
> > > > > >
> > > > >
> > > > > Any call to ip6_pol_route(s) where no res.nh->fib_nh_gw_family is 0 can do it.
> > > > > But we have only seen this for raw sockets so far.
> > > > >
> > > >
> > > > In the SRv6 subsystem, the seg6_lookup_nexthop() is used by some
> > > > cross-connecting behaviors such as End.X and End.DX6 to forward traffic to a
> > > > specified nexthop. SRv6 End.X/DX6 can specify an IPv6 DA (i.e., a nexthop)
> > > > different from the one carried by the IPv6 header. For this purpose,
> > > > seg6_lookup_nexthop() sets the FLOWI_FLAG_KNOWN_NH.
> > > >
> > > Hi Andrea,
> > >
> > > Thanks for pointing that datapath out. The more generic approach we are
> > > taking bringing Ipv6 closer to Ipv4 in this regard should fix all instances
> > > of this.
> > >
> > > > > > > [1]   99.187805] dst_alloc: 7728 callbacks suppressed
> > > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > > > > > > .
> > > > > > > .
> > > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > > >
> > > > I can reproduce the same warning messages reported by you, by instantiating an
> > > > End.X behavior whose nexthop is handled by a route for which there is no "via".
> > > > In this configuration, the ip6_pol_route() (called by seg6_lookup_nexthop())
> > > > triggers ip6_rt_cache_alloc() because i) the FLOWI_FLAG_KNOWN_NH is present ii)
> > > > and the res.nh->fib_nh_gw_family is 0 (as already pointed out).
> > > >
> > >
> > > Nice, when I get back after the holiday break I'll submit the next patch. It
> > > would be great if you could test the new patch and let me know how it works in
> > > your tests at that juncture. I'll keep you posted.
> > >
> > > Regards
> > >
> > > Jon
> > >
> > > > > Regards
> > > > >
> > > > > Jon
> > > >
> > > > Ciao,
> > > > Andrea
> 
> 
> -- 
> Andrea Mayer <andrea.mayer@uniroma2.it>
  
Jon Maxwell Jan. 7, 2023, 11:46 p.m. UTC | #16
On Sat, Jan 7, 2023 at 10:27 AM Andrea Mayer <andrea.mayer@uniroma2.it> wrote:
>
> Hi Jon,
> please see after, thanks.
>
> >
> > > Any chance you could test this patch based on the latest net-next
> > > kernel and let me know the result?
> > >
> > > diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h
> > > index 88ff7bb2bb9b..632086b2f644 100644
> > > --- a/include/net/dst_ops.h
> > > +++ b/include/net/dst_ops.h
> > > @@ -16,7 +16,7 @@ struct dst_ops {
> > >         unsigned short          family;
> > >         unsigned int            gc_thresh;
> > >
> > > -       int                     (*gc)(struct dst_ops *ops);
> > > +       void                    (*gc)(struct dst_ops *ops);
> > >         struct dst_entry *      (*check)(struct dst_entry *, __u32 cookie);
> > >         unsigned int            (*default_advmss)(const struct dst_entry *);
> > >         unsigned int            (*mtu)(const struct dst_entry *);
> > > diff --git a/net/core/dst.c b/net/core/dst.c
> > > index 6d2dd03dafa8..31c08a3386d3 100644
> > > --- a/net/core/dst.c
> > > +++ b/net/core/dst.c
> > > @@ -82,12 +82,8 @@ void *dst_alloc(struct dst_ops *ops, struct net_device *dev,
> > >
> > >         if (ops->gc &&
> > >             !(flags & DST_NOCOUNT) &&
> > > -           dst_entries_get_fast(ops) > ops->gc_thresh) {
> > > -               if (ops->gc(ops)) {
> > > -                       pr_notice_ratelimited("Route cache is full:
> > > consider increasing sysctl net.ipv6.route.max_size.\n");
> > > -                       return NULL;
> > > -               }
> > > -       }
> > > +           dst_entries_get_fast(ops) > ops->gc_thresh)
> > > +               ops->gc(ops);
> > >
> > >         dst = kmem_cache_alloc(ops->kmem_cachep, GFP_ATOMIC);
> > >         if (!dst)
> > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > > index e74e0361fd92..b643dda68d31 100644
> > > --- a/net/ipv6/route.c
> > > +++ b/net/ipv6/route.c
> > > @@ -91,7 +91,7 @@ static struct dst_entry *ip6_negative_advice(struct
> > > dst_entry *);
> > >  static void            ip6_dst_destroy(struct dst_entry *);
> > >  static void            ip6_dst_ifdown(struct dst_entry *,
> > >                                        struct net_device *dev, int how);
> > > -static int              ip6_dst_gc(struct dst_ops *ops);
> > > +static void             ip6_dst_gc(struct dst_ops *ops);
> > >
> > >  static int             ip6_pkt_discard(struct sk_buff *skb);
> > >  static int             ip6_pkt_discard_out(struct net *net, struct
> > > sock *sk, struct sk_buff *skb);
> > > @@ -3284,11 +3284,10 @@ struct dst_entry *icmp6_dst_alloc(struct
> > > net_device *dev,
> > >         return dst;
> > >  }
> > >
> > > -static int ip6_dst_gc(struct dst_ops *ops)
> > > +static void ip6_dst_gc(struct dst_ops *ops)
> > >  {
> > >         struct net *net = container_of(ops, struct net, ipv6.ip6_dst_ops);
> > >         int rt_min_interval = net->ipv6.sysctl.ip6_rt_gc_min_interval;
> > > -       int rt_max_size = net->ipv6.sysctl.ip6_rt_max_size;
> > >         int rt_elasticity = net->ipv6.sysctl.ip6_rt_gc_elasticity;
> > >         int rt_gc_timeout = net->ipv6.sysctl.ip6_rt_gc_timeout;
> > >         unsigned long rt_last_gc = net->ipv6.ip6_rt_last_gc;
> > > @@ -3296,11 +3295,10 @@ static int ip6_dst_gc(struct dst_ops *ops)
> > >         int entries;
> > >
> > >         entries = dst_entries_get_fast(ops);
> > > -       if (entries > rt_max_size)
> > > +       if (entries > ops->gc_thresh)
> > >                 entries = dst_entries_get_slow(ops);
> > >
> > > -       if (time_after(rt_last_gc + rt_min_interval, jiffies) &&
> > > -           entries <= rt_max_size)
> > > +       if (time_after(rt_last_gc + rt_min_interval, jiffies))
> > >                 goto out;
> > >
> > >         fib6_run_gc(atomic_inc_return(&net->ipv6.ip6_rt_gc_expire), net, true);
> > > @@ -3310,7 +3308,6 @@ static int ip6_dst_gc(struct dst_ops *ops)
> > >  out:
> > >         val = atomic_read(&net->ipv6.ip6_rt_gc_expire);
> > >         atomic_set(&net->ipv6.ip6_rt_gc_expire, val - (val >> rt_elasticity));
> > > -       return entries > rt_max_size;
> > >  }
> > >
> > >  static int ip6_nh_lookup_table(struct net *net, struct fib6_config *cfg,
> > > @@ -6512,7 +6509,7 @@ static int __net_init ip6_route_net_init(struct net *net)
> > >  #endif
> > >
> > >         net->ipv6.sysctl.flush_delay = 0;
> > > -       net->ipv6.sysctl.ip6_rt_max_size = 4096;
> > > +       net->ipv6.sysctl.ip6_rt_max_size = INT_MAX;
> > >         net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2;
> > >         net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ;
> > >         net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ;
> > >
> >
> > Yes, I will apply this patch in the next days and check how it deals with the
> > seg6 subsystem. I will keep you posted.
> >
>
> I applied the patch* to the net-next (HEAD 6bd4755c7c49) and did some tests on
> the seg6 subsystem, specifically running the End.X/DX6 behaviors. They seem to
> work fine.

Thanks Andrea much appreciated. It worked fine in my raw socket tests as well.
I'll look at submitting it soon.

>
> (*) I had to slightly edit the patch because of the code formatting, e.g.
>     some incorrect line breaks, spaces, etc.
>

Sorry about that, I should have sent it from git to avoid that.

Regards

Jon

> Ciao,
> Andrea
>
> >
> > > On Sat, Dec 24, 2022 at 6:38 PM Jonathan Maxwell <jmaxwell37@gmail.com> wrote:
> > > >
> > > > On Sat, Dec 24, 2022 at 7:28 AM Andrea Mayer <andrea.mayer@uniroma2.it> wrote:
> > > > >
> > > > > Hi Jon,
> > > > > please see below, thanks.
> > > > >
> > > > > On Wed, 21 Dec 2022 08:48:11 +1100
> > > > > Jonathan Maxwell <jmaxwell37@gmail.com> wrote:
> > > > >
> > > > > > On Tue, Dec 20, 2022 at 11:35 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > > > > > >
> > > > > > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote:
> > > > > > > > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a
> > > > > > > > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly
> > > > > > > > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in
> > > > > > > > these warnings:
> > > > > > > >
> > > > > > > > [1]   99.187805] dst_alloc: 7728 callbacks suppressed
> > > > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > > > > > > > .
> > > > > > > > .
> > > > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > > > > > >
> > > > > > > If I read correctly, the maximum number of dst that the raw socket can
> > > > > > > use this way is limited by the number of packets it allows via the
> > > > > > > sndbuf limit, right?
> > > > > > >
> > > > > >
> > > > > > Yes, but in my test sndbuf limit is never hit so it clones a route for
> > > > > > every packet.
> > > > > >
> > > > > > e.g:
> > > > > >
> > > > > > output from C program sending 5000000 packets via a raw socket.
> > > > > >
> > > > > > ip raw: total num pkts 5000000
> > > > > >
> > > > > > # bpftrace -e 'kprobe:dst_alloc {@count[comm] = count()}'
> > > > > > Attaching 1 probe...
> > > > > >
> > > > > > @count[a.out]: 5000009
> > > > > >
> > > > > > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6,
> > > > > > > ipvs, seg6?
> > > > > > >
> > > > > >
> > > > > > Any call to ip6_pol_route(s) where no res.nh->fib_nh_gw_family is 0 can do it.
> > > > > > But we have only seen this for raw sockets so far.
> > > > > >
> > > > >
> > > > > In the SRv6 subsystem, the seg6_lookup_nexthop() is used by some
> > > > > cross-connecting behaviors such as End.X and End.DX6 to forward traffic to a
> > > > > specified nexthop. SRv6 End.X/DX6 can specify an IPv6 DA (i.e., a nexthop)
> > > > > different from the one carried by the IPv6 header. For this purpose,
> > > > > seg6_lookup_nexthop() sets the FLOWI_FLAG_KNOWN_NH.
> > > > >
> > > > Hi Andrea,
> > > >
> > > > Thanks for pointing that datapath out. The more generic approach we are
> > > > taking bringing Ipv6 closer to Ipv4 in this regard should fix all instances
> > > > of this.
> > > >
> > > > > > > > [1]   99.187805] dst_alloc: 7728 callbacks suppressed
> > > > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > > > > > > > .
> > > > > > > > .
> > > > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > > > >
> > > > > I can reproduce the same warning messages reported by you, by instantiating an
> > > > > End.X behavior whose nexthop is handled by a route for which there is no "via".
> > > > > In this configuration, the ip6_pol_route() (called by seg6_lookup_nexthop())
> > > > > triggers ip6_rt_cache_alloc() because i) the FLOWI_FLAG_KNOWN_NH is present ii)
> > > > > and the res.nh->fib_nh_gw_family is 0 (as already pointed out).
> > > > >
> > > >
> > > > Nice, when I get back after the holiday break I'll submit the next patch. It
> > > > would be great if you could test the new patch and let me know how it works in
> > > > your tests at that juncture. I'll keep you posted.
> > > >
> > > > Regards
> > > >
> > > > Jon
> > > >
> > > > > > Regards
> > > > > >
> > > > > > Jon
> > > > >
> > > > > Ciao,
> > > > > Andrea
> >
> >
> > --
> > Andrea Mayer <andrea.mayer@uniroma2.it>
>
>
> --
> Andrea Mayer <andrea.mayer@uniroma2.it>
  
Andrea Mayer Jan. 8, 2023, 5:34 p.m. UTC | #17
On Sun, 8 Jan 2023 10:46:09 +1100
Jonathan Maxwell <jmaxwell37@gmail.com> wrote:

> On Sat, Jan 7, 2023 at 10:27 AM Andrea Mayer <andrea.mayer@uniroma2.it> wrote:
> >
> > Hi Jon,
> > please see after, thanks.
> >
> > >
> > > > Any chance you could test this patch based on the latest net-next
> > > > kernel and let me know the result?
> > > >
> > > > diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h
> > > > index 88ff7bb2bb9b..632086b2f644 100644
> > > > --- a/include/net/dst_ops.h
> > > > +++ b/include/net/dst_ops.h
> > > > @@ -16,7 +16,7 @@ struct dst_ops {
> > > >         unsigned short          family;
> > > >         unsigned int            gc_thresh;
> > > >
> > > > -       int                     (*gc)(struct dst_ops *ops);
> > > > +       void                    (*gc)(struct dst_ops *ops);
> > > >         struct dst_entry *      (*check)(struct dst_entry *, __u32 cookie);
> > > >         unsigned int            (*default_advmss)(const struct dst_entry *);
> > > >         unsigned int            (*mtu)(const struct dst_entry *);
> > > > diff --git a/net/core/dst.c b/net/core/dst.c
> > > > index 6d2dd03dafa8..31c08a3386d3 100644
> > > > --- a/net/core/dst.c
> > > > +++ b/net/core/dst.c
> > > > @@ -82,12 +82,8 @@ void *dst_alloc(struct dst_ops *ops, struct net_device *dev,
> > > >
> > > >         if (ops->gc &&
> > > >             !(flags & DST_NOCOUNT) &&
> > > > -           dst_entries_get_fast(ops) > ops->gc_thresh) {
> > > > -               if (ops->gc(ops)) {
> > > > -                       pr_notice_ratelimited("Route cache is full:
> > > > consider increasing sysctl net.ipv6.route.max_size.\n");
> > > > -                       return NULL;
> > > > -               }
> > > > -       }
> > > > +           dst_entries_get_fast(ops) > ops->gc_thresh)
> > > > +               ops->gc(ops);
> > > >
> > > >         dst = kmem_cache_alloc(ops->kmem_cachep, GFP_ATOMIC);
> > > >         if (!dst)
> > > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > > > index e74e0361fd92..b643dda68d31 100644
> > > > --- a/net/ipv6/route.c
> > > > +++ b/net/ipv6/route.c
> > > > @@ -91,7 +91,7 @@ static struct dst_entry *ip6_negative_advice(struct
> > > > dst_entry *);
> > > >  static void            ip6_dst_destroy(struct dst_entry *);
> > > >  static void            ip6_dst_ifdown(struct dst_entry *,
> > > >                                        struct net_device *dev, int how);
> > > > -static int              ip6_dst_gc(struct dst_ops *ops);
> > > > +static void             ip6_dst_gc(struct dst_ops *ops);
> > > >
> > > >  static int             ip6_pkt_discard(struct sk_buff *skb);
> > > >  static int             ip6_pkt_discard_out(struct net *net, struct
> > > > sock *sk, struct sk_buff *skb);
> > > > @@ -3284,11 +3284,10 @@ struct dst_entry *icmp6_dst_alloc(struct
> > > > net_device *dev,
> > > >         return dst;
> > > >  }
> > > >
> > > > -static int ip6_dst_gc(struct dst_ops *ops)
> > > > +static void ip6_dst_gc(struct dst_ops *ops)
> > > >  {
> > > >         struct net *net = container_of(ops, struct net, ipv6.ip6_dst_ops);
> > > >         int rt_min_interval = net->ipv6.sysctl.ip6_rt_gc_min_interval;
> > > > -       int rt_max_size = net->ipv6.sysctl.ip6_rt_max_size;
> > > >         int rt_elasticity = net->ipv6.sysctl.ip6_rt_gc_elasticity;
> > > >         int rt_gc_timeout = net->ipv6.sysctl.ip6_rt_gc_timeout;
> > > >         unsigned long rt_last_gc = net->ipv6.ip6_rt_last_gc;
> > > > @@ -3296,11 +3295,10 @@ static int ip6_dst_gc(struct dst_ops *ops)
> > > >         int entries;
> > > >
> > > >         entries = dst_entries_get_fast(ops);
> > > > -       if (entries > rt_max_size)
> > > > +       if (entries > ops->gc_thresh)
> > > >                 entries = dst_entries_get_slow(ops);
> > > >
> > > > -       if (time_after(rt_last_gc + rt_min_interval, jiffies) &&
> > > > -           entries <= rt_max_size)
> > > > +       if (time_after(rt_last_gc + rt_min_interval, jiffies))
> > > >                 goto out;
> > > >
> > > >         fib6_run_gc(atomic_inc_return(&net->ipv6.ip6_rt_gc_expire), net, true);
> > > > @@ -3310,7 +3308,6 @@ static int ip6_dst_gc(struct dst_ops *ops)
> > > >  out:
> > > >         val = atomic_read(&net->ipv6.ip6_rt_gc_expire);
> > > >         atomic_set(&net->ipv6.ip6_rt_gc_expire, val - (val >> rt_elasticity));
> > > > -       return entries > rt_max_size;
> > > >  }
> > > >
> > > >  static int ip6_nh_lookup_table(struct net *net, struct fib6_config *cfg,
> > > > @@ -6512,7 +6509,7 @@ static int __net_init ip6_route_net_init(struct net *net)
> > > >  #endif
> > > >
> > > >         net->ipv6.sysctl.flush_delay = 0;
> > > > -       net->ipv6.sysctl.ip6_rt_max_size = 4096;
> > > > +       net->ipv6.sysctl.ip6_rt_max_size = INT_MAX;
> > > >         net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2;
> > > >         net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ;
> > > >         net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ;
> > > >
> > >
> > > Yes, I will apply this patch in the next days and check how it deals with the
> > > seg6 subsystem. I will keep you posted.
> > >
> >
> > I applied the patch* to the net-next (HEAD 6bd4755c7c49) and did some tests on
> > the seg6 subsystem, specifically running the End.X/DX6 behaviors. They seem to
> > work fine.
> 
> Thanks Andrea much appreciated. It worked fine in my raw socket tests as well.

You're welcome. Good!

> I'll look at submitting it soon.

Please let me know (keep me in cc).

> 
> >
> > (*) I had to slightly edit the patch because of the code formatting, e.g.
> >     some incorrect line breaks, spaces, etc.
> >
> 
> Sorry about that, I should have sent it from git to avoid that.
> 

Don't worry.

> Regards
> 
> Jon
> 

Ciao,
Andrea

> > >
> > > > On Sat, Dec 24, 2022 at 6:38 PM Jonathan Maxwell <jmaxwell37@gmail.com> wrote:
> > > > >
> > > > > On Sat, Dec 24, 2022 at 7:28 AM Andrea Mayer <andrea.mayer@uniroma2.it> wrote:
> > > > > >
> > > > > > Hi Jon,
> > > > > > please see below, thanks.
> > > > > >
> > > > > > On Wed, 21 Dec 2022 08:48:11 +1100
> > > > > > Jonathan Maxwell <jmaxwell37@gmail.com> wrote:
> > > > > >
> > > > > > > On Tue, Dec 20, 2022 at 11:35 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote:
> > > > > > > > > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a
> > > > > > > > > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly
> > > > > > > > > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in
> > > > > > > > > these warnings:
> > > > > > > > >
> > > > > > > > > [1]   99.187805] dst_alloc: 7728 callbacks suppressed
> > > > > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > > > > > > > > .
> > > > > > > > > .
> > > > > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > > > > > > >
> > > > > > > > If I read correctly, the maximum number of dst that the raw socket can
> > > > > > > > use this way is limited by the number of packets it allows via the
> > > > > > > > sndbuf limit, right?
> > > > > > > >
> > > > > > >
> > > > > > > Yes, but in my test sndbuf limit is never hit so it clones a route for
> > > > > > > every packet.
> > > > > > >
> > > > > > > e.g:
> > > > > > >
> > > > > > > output from C program sending 5000000 packets via a raw socket.
> > > > > > >
> > > > > > > ip raw: total num pkts 5000000
> > > > > > >
> > > > > > > # bpftrace -e 'kprobe:dst_alloc {@count[comm] = count()}'
> > > > > > > Attaching 1 probe...
> > > > > > >
> > > > > > > @count[a.out]: 5000009
> > > > > > >
> > > > > > > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6,
> > > > > > > > ipvs, seg6?
> > > > > > > >
> > > > > > >
> > > > > > > Any call to ip6_pol_route(s) where no res.nh->fib_nh_gw_family is 0 can do it.
> > > > > > > But we have only seen this for raw sockets so far.
> > > > > > >
> > > > > >
> > > > > > In the SRv6 subsystem, the seg6_lookup_nexthop() is used by some
> > > > > > cross-connecting behaviors such as End.X and End.DX6 to forward traffic to a
> > > > > > specified nexthop. SRv6 End.X/DX6 can specify an IPv6 DA (i.e., a nexthop)
> > > > > > different from the one carried by the IPv6 header. For this purpose,
> > > > > > seg6_lookup_nexthop() sets the FLOWI_FLAG_KNOWN_NH.
> > > > > >
> > > > > Hi Andrea,
> > > > >
> > > > > Thanks for pointing that datapath out. The more generic approach we are
> > > > > taking bringing Ipv6 closer to Ipv4 in this regard should fix all instances
> > > > > of this.
> > > > >
> > > > > > > > > [1]   99.187805] dst_alloc: 7728 callbacks suppressed
> > > > > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > > > > > > > > .
> > > > > > > > > .
> > > > > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > > > > >
> > > > > > I can reproduce the same warning messages reported by you, by instantiating an
> > > > > > End.X behavior whose nexthop is handled by a route for which there is no "via".
> > > > > > In this configuration, the ip6_pol_route() (called by seg6_lookup_nexthop())
> > > > > > triggers ip6_rt_cache_alloc() because i) the FLOWI_FLAG_KNOWN_NH is present ii)
> > > > > > and the res.nh->fib_nh_gw_family is 0 (as already pointed out).
> > > > > >
> > > > >
> > > > > Nice, when I get back after the holiday break I'll submit the next patch. It
> > > > > would be great if you could test the new patch and let me know how it works in
> > > > > your tests at that juncture. I'll keep you posted.
> > > > >
> > > > > Regards
> > > > >
> > > > > Jon
> > > > >
> > > > > > > Regards
> > > > > > >
> > > > > > > Jon
> > > > > >
> > > > > > Ciao,
> > > > > > Andrea
> > >
> > >
> > > --
> > > Andrea Mayer <andrea.mayer@uniroma2.it>
> >
> >
> > --
> > Andrea Mayer <andrea.mayer@uniroma2.it>
  

Patch

diff --git a/include/net/flow.h b/include/net/flow.h
index 2f0da4f0318b..30b8973ffb4b 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -37,6 +37,7 @@  struct flowi_common {
 	__u8	flowic_flags;
 #define FLOWI_FLAG_ANYSRC		0x01
 #define FLOWI_FLAG_KNOWN_NH		0x02
+#define FLOWI_FLAG_SKIP_RAW		0x04
 	__u32	flowic_secid;
 	kuid_t  flowic_uid;
 	struct flowi_tunnel flowic_tun_key;
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index a06a9f847db5..0b89a7e66d09 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -884,7 +884,7 @@  static int rawv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	security_sk_classify_flow(sk, flowi6_to_flowi_common(&fl6));
 
 	if (hdrincl)
-		fl6.flowi6_flags |= FLOWI_FLAG_KNOWN_NH;
+		fl6.flowi6_flags |= FLOWI_FLAG_KNOWN_NH | FLOWI_FLAG_SKIP_RAW;
 
 	if (ipc6.tclass < 0)
 		ipc6.tclass = np->tclass;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index e74e0361fd92..beae0bd61738 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2226,6 +2226,7 @@  struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table,
 	if (rt) {
 		goto out;
 	} else if (unlikely((fl6->flowi6_flags & FLOWI_FLAG_KNOWN_NH) &&
+			    !(fl6->flowi6_flags & FLOWI_FLAG_SKIP_RAW) &&
 			    !res.nh->fib_nh_gw_family)) {
 		/* Create a RTF_CACHE clone which will not be
 		 * owned by the fib6 tree.  It is for the special case where