Message ID | 20221218234801.579114-1-jmaxwell37@gmail.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:e747:0:0:0:0:0 with SMTP id c7csp2115552wrn; Sun, 18 Dec 2022 16:10:30 -0800 (PST) X-Google-Smtp-Source: AA0mqf4icJbcHy4j3n01yBXl8GStxDMnw/QEb+eYjYCEhL6OlFNYaDcWeLo68gYrV487Am3TDXtd X-Received: by 2002:a05:6a20:a8a4:b0:af:7336:f468 with SMTP id ca36-20020a056a20a8a400b000af7336f468mr19326678pzb.20.1671408630021; Sun, 18 Dec 2022 16:10:30 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1671408630; cv=none; d=google.com; s=arc-20160816; b=fKY7ZBt9hzo9EBWRxhiYDOfmfb9pkUzSr/ZBj9iiv3sLeboDQ+oxSuWIpd3AuEzHf3 92uJrhOWdC5i3Z4lZ+D70aomH8LLgqor06MskJpULob46aVp4yzQhXHDe3zz8dvbR5DR 2Cyi+Ls4e4tdEyLXJlAVPyk5MIYIqDGiT0muONwSE4jg/4Yx0K6fShi8xol9quNhvKn0 NVr6h9/xk0nHCSMPtnVgmUKTktNgrwc/paumv2Z+Nt7c5Pj9MOPKDt/dDO+ehYuUaqJ0 NGZ7wCGMcaapnYlQ/2VCXLPduJko2HKRkHjhdLr1UvlGjFCJWNUEOG+QdYEvPGAad0e2 qqpQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:ironport-hdrordr:ironport-data :ironport-phdr; bh=5u+ixdiXdqP+yZl/MMQrwhE/kj9lvij5eGy7zGh5dzE=; b=VJbioyb52sEanE/LPL74HOKfixmkLYv+U3dOPTrNxSlKLQY4C7TV+WroibppVOuAzG zhRdUWvnBs3en5Tvzrs0zqCaiQUOfUu2Y4ocASQMkfN81YlrwHW4GV/6aUmvGgYDATTx h7FosmhKGizKBlHR5HbN4+F0ipWF63N8ell7+jJ1Jwqv3HZQqDRMNOk3NtRhyZkrduuq nv2/vbga7gKkqL2ciEwd/4XarWMMKFqKmY+lrPEAQYGbhvMozlWefJkpFpULMaxR5BBe rfGr4Grf2dNyZnCpTU39oOFG0AqCjd6+yaySlxpGFjuyYxhkZ79wTl0B6O0vXI3gABCP wJfQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id e3-20020a656883000000b004620564706dsi9370239pgt.730.2022.12.18.16.10.13; Sun, 18 Dec 2022 16:10:29 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231197AbiLRXwe (ORCPT <rfc822;peekingduck44@gmail.com> + 99 others); Sun, 18 Dec 2022 18:52:34 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59200 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230231AbiLRXwb (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sun, 18 Dec 2022 18:52:31 -0500 X-Greylist: delayed 123 seconds by postgrey-1.37 at lindbergh.monkeyblade.net; Sun, 18 Dec 2022 15:52:30 PST Received: from rpt-cro-asav1.external.tpg.com.au (rpt-cro-asav1.external.tpg.com.au [60.241.0.37]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 5D2706573 for <linux-kernel@vger.kernel.org>; Sun, 18 Dec 2022 15:52:30 -0800 (PST) X-Ironport-Abuse: host=202-168-35-241.tpgi.com.au, ip=202.168.35.241, date=12/19/22 10:49:21 X-SMTP-MATCH: 0 X-IPAS-Result: A2EaCgA9pp9j//EjqMpaHgEBCxIMSYZSlW6DFogNMAKTZA8BAwEBAQEBTQQBAT4BhEaFESY4EwEBAQQBAQEBAQIFAQEBAQEBAwEBAQUBAgEBAQQFAQEBAoEZhS9GgjgihAkrCwEpHSZcAk2CfoJuAQMxuUoWBQIWgQGeFwoZKA1oA4FkgUCEVVCCE4QoiBOEfYIggWuBBoVxBIEJlnABAQMCAgMCAgMGBAICAgUDAwIBAwQCCwUEDgMBAQICAQECBAgCAgMDAgIIDxUDBwIBBgUBAwECBgQCBAELAgIFAgEKAQIEAQICAgEFCQECAwEDAQwCAgYCAgMFBgQCBgQGAgIFAgEBAwICAg0DAgMCBAEFBQEBAhACBgQJAQYECwIFAQQEAQIFBwEDCQMCAgICCAQHBAUKGQMDAiADCQMHBUkCCQMjDwMLCQgHDAEWKAYDAQoHDCgEBAwoAQoNCAUBAgIBBwMDBQUCBw8DBAIBAwMCBQ8DAQYFAQIBAgICBAIIAgQFAgUDAgQCAwICCAMCAwECAQcEAwQBBAIEAw0EAwQCAwICBQICAgICBQICAwECAgICAgIFAgMCAQUBAgIBAgICBAECAgcEAgMBAwQOBAMCAgcBAgIBBgIHAwECAQQDAQEEAgQBAgUCBAEDBgIDAQMKAgMCAQECAwMFAwICCAgCAwUCBAEBAgQDBAICCwEGAgcCAgMCAgQEBAEBAgEEBQIDAQIDAwkCAgMCBAICCgEBAQECAQcCBAcGAgUCAgIDAQICAQMCAQICChEBAQIDAwMEBgUDAwMCARUFAgEBAgIDAwIGAgECCAIEAQQFAgECAQECAgQBCAICAQEBAgECAgMDAgECAgIEAwMBAgECAgMCAgIDAgIBDQIGBgECAgICAgICAgIGAQIBAgMBAgcCBAMCAQICBQICAgMBAQYCBAsBAwICAgIBCAEBAgUBAgICAwEBAwMEAwMFBgMCDAgBBQEDAR8DAgIIAgcCAQYDAgEPAwICAwICAQQKAgMFAgQCAQQIBwIEAQIJAwIGAgYFGAECAgcEDAoBAgIFBgQBAQIDAQIBAQIDAwIDAgQFAQUCAQIEAgICAQECBQ0BAQMEAgQCBwICAgMBBAIBAgEDAwIDAQEBAwYGAgQEAgMDBgICAgMCAQICAwQNAQUCAgYDBAENBQYFBAMCCAECAQEHAgQCBwkOAgECBAEFAgIDAgIBAwICAQIEAwECAgICBQcFAwQBBAMKCwMBAQQDAgECAQIDAgMHAwIEAgMBAgMEBgYBCQQGBAENAwQCAgECAQEDBAQEAgIBAgIDAQQCAgEBAwMDAgICAwQCAwMLBAoHAwMCAQULAgICAwIBAQMHBAUEAgIGAQIEAgICAgICAgMBAQMKBAIBAwICBgMGAgECAQkFAgEJAwECAQMEAQMJAQICBAkCAwcFCgICAgIIAgIOAwMCAQEEAgIEAwIJAQIHAgUBAQMFBwICAQICAQQDAQkEAQIDAgEBAxIDAwEEAgIFAwMNCQYCAgEDAgENAwECAQIDAQUFFwMIBxQDBQICBAQBCAICAwMDAgECCQYBAwEFAg4DAgIDAwYBAgEBAgMQAgMBAQEBFwEDBAIDAQQDAQECAQIDAg4EAQQFDAMBAhEMAgQBBgIIAgICAgMBAwMFAQIDBAIBCAYEAgICAgoCCgMCAwEDBQEDAgkDAQUBAgcCBgEBAQICCAIIAgMLAQMCAwYCAQICAQUCAQICBQMFAgICAgQNAgUCAgIGAQIHBAICAgQBAgIGAgUBAgcHAgUCAgIDAwoEBAIKBAEDAQEFAQIBAwQBAgQBAgECBQMGAgICAgECAgECAQEIAgICAgICAgMEAgUDnBMHexRLgSNwgRoBAZFWgxqqXIFFRCEJAQYCW4FWfBopmlmFbRoyqSSXQpEvkX2FBoFEgX9NI4EBbYFJUhkPnQRhOwIHCwEBAwmJSoJZAQE IronPort-PHdr: A9a23:MxaPLBHELoWxWUr71Y2mOJ1Gf0RKhN3EVzX9CrIZgr5DOp6u447ld BSGo6k30RmVB86CsbptsKn/jePJYSQ4+5GPsXQPItRndiQuroEopTEmG9OPEkbhLfTnPGQQF cVGU0J5rTngaRAGUMnxaEfPrXKs8DUcBgvwNRZvJuTyB4Xek9m72/q99pHNYwhEnjWwba19I Bmrswnaq9Ubj5ZlJqstxRTFpWdFdf5Lzm1yP1KTmBj85sa0/JF99ilbpuws+c1dX6jkZqo0V bNXAigoPGAz/83rqALMTRCT6XsGU2UZiQRHDg7Y5xznRJjxsy/6tu1g2CmGOMD9UL45VSi+4 6ptVRTljjoMOTwk/2HNksF+jLxVrQy8qRJxwIDaZ46aOvVlc6/Bft4XX3ZNU9xNWyBdBI63c osBD/AGPeZdt4Tzo1wOrR2jDgerHuzuxTFJiWHy3a0+zu8sFgPG3Ak6ENMBvnXbstH1NKMcX O2316TIwjDDYOlX2Tf58oTHbhchofSVUL92bMHexlUhGRnfgVWMtYzqISmV1uIVvmWb7+RtW /+ihm0ppQ9xrTWixNkgh4bUi44L1F3J8SV0zZgpKNC4R0N1bsOoHpRRui2GKod7TMwsTW5pt Sg1ybALv4OwcisSyJk/2RLTd+KLf5KV7h/iV+udOzl1iXJ/dL6hiBu+7E6twfDmWMauylZFt C9Fn8HJtnAKyhPc9NCKSuB4/ke9wTaP0B3T6v1cLUA0i6XbL5khz6YylpoWq0vCESH3l1vyj K+SbEkr5u+o6+H/brXnoJ+TKZN0hxngPqgyhMCzG/k0PwkNUmSB9+mx1Kfv8VP2TblXlvE2l 7PWsJHeJcQVvK65BApV354h6xa6FTin39oZkmcDLFJBdh+KjZPkO17LIP/iDPe/h06gnytsx /DDJrHhBI7CIWDZkLj9ZbZ991JcyA0rwN1b/55UEK0OIOrvWk/ts9zVFgI2PBaqw+n5DdVwz Z4RVniRAqCHNaPStViI5uwzI+WWYo8apir9J+A/5/HylX85hUMdfa6x0JQJdX+4A/FmLF+YY HXyntcMCmgKvg05TOzljF2NTyRfaGq1X6I5/j07Ep6pDZ/fRoCxh7yMxCS7HoBNaW9cEV2ME mnnd5+CW/gSbCKeOMhhkiYLVbS5UY8uyQmutBPmy7pgNufU4jcXuon929hz5u3ejgsy+iJpA MSdyW6NU3t4kX8PRz8zxKx/u1Byyk+f0ahkhPxVDdxS5/RSUgc6O57c0u56C9HpVwLFf9eJT kumQ9q/DTEwVtIx3d4Db1x6G9W4gRDPxzCqDKMNl7yXGJw09brR0GXqJ8lny3bJyrMhj189T 8tMK2KmnKh/+BbXB4LTlEWZjamqebwG3CHR7GeD0XaOvEZAXQ52T6rFQW0QaVXIrdni+EPCQ KGhCa49PgtC18GCMK1KZcPtjVlcQ/fjItveb3qrm2isHRaI2q+MbI3ydmUZ3SXdDlUEkg8K8 XaFKwc+HCGhrHzaDDF1C1LvbF3j8fNkpHO4UEA01QeKYFNl17av/R4Vn/OcR+sJ3r0YoCcht yl0HFGl0tLUDtqPvQVgfatCbtM55FdK22DUuhdyPpylNa9ih1oefx5rsEPp0hUkQrlHxMQjs n4v5AZ7N6+d1FRPa3We0IyjFKfQLzzQ+xbnTqfGxVza1J7C+KIG+Os1r1G44ymmE0Mj9zNs1 NwDgCjU3YnDEAdHCcG5aU0w7RUv/9nn IronPort-Data: A9a23:N27Pqq6Cc9TkRZ5fs5nKfAxRtMvGchMFZxGqfqrLsTDasY5as4F+v jcbCz3QOffbN2ShfN10b4u190MB6JLRydMwHQc4qCphEysa+MHILOrCIxarNUt+DCFioGGLT Sk6QoOdRCzhZiaE/n9BCpC48T8mk/jgqoPUUIbsIjp2SRJvVBAvgBdin/9RqoNziLBVOSvU0 T/Ji5CZaQHNNwJcaDpOsfvZ8UM35pwehRtB1rAATaAT1LPhvyRNZH4vDfnZB2f1RIBSAtm7S 47rpF1u1jqEl/uFIorNfofTKiXmcJaLVeS9oiM+t5yZv/R3jndaPpDXlhYrQRw/Zz2hx7idw f0R7sboEV9B0qfkwIzxWDEAe81y0DEvFBYq7hFTvOTKp3AqfUcAzN1uDmwLGJM5yNpXKnNs9 sY0bzQWbBmM0rfeLLKTEoGAh+whKcD7I44bvjdryjSx4fQOG8iZBfyUtZkDgXFq2pkm8fX2P qL1bRJtaR3QfBBLPgxIIJ07leaswHL4dlW0rXrP+PpqvzKKnVMZPL7FEPveVM6IS/hssk+cp SGbx339OB5KO4nKodaC2jf27gPVpgv3UZwfEZW0/+BnhVmUyHBVDhAKPXO2reS8g1yzR/pQL Esb/idopq83nGSoU9P0dx61uniJulgbQdU4O/Uz4gyLy4LO7gqZD3RCRTlEAPQ3s9Q2SyEo1 3eNntX0FXluqKPLD3WH+d+8oSi7OSUPK0cBaDUCQA9D5MPsyKk1gw7DQ8hLDqG4lJv2FCv2z jTMqzIx750XjMgWx+C48ErBjjaEuJfEVEg26x/RU2bj6Rl2DKanYoW49lXf6a0fBImcR1iF+ nMDnqCjAPsmV8nX0XXTEKBWQfTzu6/DLCXTgBhkGJxn/inFF2OfQL28KQpWfC9BWvvosxe3P Sc/ZSs5CFRv0LdGoEO5j09dyyjn8EQ4KenYaw== IronPort-HdrOrdr: A9a23:7aP4/an013jIMiWkJJdxObPrImTpDfI13DAbv31ZSRFFG/FwWf rCoB19726WtN9/Yh4dcLy7U5VoIkm9yXcK2+cs1N6ZNWHbUQCTQL2Kg7GJ/9SZIUzDytI= X-IronPort-Anti-Spam-Filtered: true X-IronPort-AV: E=Sophos;i="5.96,254,1665406800"; d="scan'208";a="258542197" Received: from 202-168-35-241.tpgi.com.au (HELO jmaxwell.com) ([202.168.35.241]) by rpt-cro-asav1.external.tpg.com.au with ESMTP; 19 Dec 2022 10:49:03 +1100 From: Jon Maxwell <jmaxwell37@gmail.com> To: davem@davemloft.net Cc: edumazet@google.com, kuba@kernel.org, pabeni@redhat.com, yoshfuji@linux-ipv6.org, dsahern@kernel.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Jon Maxwell <jmaxwell37@gmail.com> Subject: [net-next] ipv6: fix routing cache overflow for raw sockets Date: Mon, 19 Dec 2022 10:48:01 +1100 Message-Id: <20221218234801.579114-1-jmaxwell37@gmail.com> X-Mailer: git-send-email 2.31.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=0.9 required=5.0 tests=BAYES_00,DKIM_ADSP_CUSTOM_MED, FORGED_GMAIL_RCVD,FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM, NML_ADSP_CUSTOM_MED,SPF_HELO_NONE,SPF_SOFTFAIL,SPOOFED_FREEMAIL, SPOOF_GMAIL_MID autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1752598975457702968?= X-GMAIL-MSGID: =?utf-8?q?1752598975457702968?= |
Series |
[net-next] ipv6: fix routing cache overflow for raw sockets
|
|
Commit Message
Jon Maxwell
Dec. 18, 2022, 11:48 p.m. UTC
Sending Ipv6 packets in a loop via a raw socket triggers an issue where a
route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly
consumes the Ipv6 max_size threshold which defaults to 4096 resulting in
these warnings:
[1] 99.187805] dst_alloc: 7728 callbacks suppressed
[2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
.
.
[300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
When this happens the packet is dropped and sendto() gets a network is
unreachable error:
# ./a.out -s
remaining pkt 200557 errno 101
remaining pkt 196462 errno 101
.
.
remaining pkt 126821 errno 101
Fix this by adding a flag to prevent the cloning of routes for raw sockets.
Which makes the Ipv6 routing code use per-cpu routes instead which prevents
packet drop due to max_size overflow.
Ipv4 is not affected because it has a very large default max_size.
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
---
include/net/flow.h | 1 +
net/ipv6/raw.c | 2 +-
net/ipv6/route.c | 1 +
3 files changed, 3 insertions(+), 1 deletion(-)
Comments
On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote: > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in > these warnings: > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > . > . > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. If I read correctly, the maximum number of dst that the raw socket can use this way is limited by the number of packets it allows via the sndbuf limit, right? Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, ipvs, seg6? @DavidA: why do we need to create RTF_CACHE clones for KNOWN_NH flows? Thanks, Paolo
On 12/20/22 5:35 AM, Paolo Abeni wrote: > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote: >> Sending Ipv6 packets in a loop via a raw socket triggers an issue where a >> route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly >> consumes the Ipv6 max_size threshold which defaults to 4096 resulting in >> these warnings: >> >> [1] 99.187805] dst_alloc: 7728 callbacks suppressed >> [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. >> . >> . >> [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > If I read correctly, the maximum number of dst that the raw socket can > use this way is limited by the number of packets it allows via the > sndbuf limit, right? > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, > ipvs, seg6? > > @DavidA: why do we need to create RTF_CACHE clones for KNOWN_NH flows? > > Thanks, > > Paolo > If I recall the details correctly: that sysctl limit was added back when ipv6 routes were managed as dst_entries and there was a desire to allow an admin to limit the memory consumed. At this point in time, IPv6 is more inline with IPv4 - a separate struct for fib entries from dst entries. That "Route cache is full" message is now out of date since this is dst_entries which have a gc mechanism. IPv4 does not limit the number of dst_entries that can be allocated (ip_rt_max_size is the sysctl variable behind the ipv4 version of max_size and it is a no-op). IPv6 can probably do the same here? I do not believe the suggested flag is the right change.
Hello, On Tue, 20 Dec 2022, Paolo Abeni wrote: > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote: > > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a > > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly > > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in > > these warnings: > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > . > > . > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > If I read correctly, the maximum number of dst that the raw socket can > use this way is limited by the number of packets it allows via the > sndbuf limit, right? > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, > ipvs, seg6? For IPVS there is no sndbuf limit. IPVS uses this flag when receiving packets from world (destined to some local Virtual IP) and then diverts/redirects the packets (without changing daddr) to one of its backend servers on the LAN (no RTF_GATEWAY on such routes). So, for each packet IPVS requests output route with FLOWI_FLAG_KNOWN_NH flag and then sends the packet to backend server (nexthop) using this route attached to the skb. Packet rate is usually high. The goal is nexthop to be used from the route, not from the IP header. KNOWN_NH means "nexthop is provided in route, not in daddr". As for the implementation details in ipv6, I can not comment. But all users that set the flag wants this, to send packet where daddr can be != nexthop. > @DavidA: why do we need to create RTF_CACHE clones for KNOWN_NH flows? Regards -- Julian Anastasov <ja@ssi.bg>
Hello, On Tue, 20 Dec 2022, Paolo Abeni wrote: > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, > ipvs, seg6? I forgot to mention one thing: IPVS can cache such routes in its own storage, one per backend server, it still calls dst->ops->check for them. So, such route can live for long time, that is why they were created as uncached. So, IPVS requests one route, remembers it and then can attach it to multiple packets for this backend server with skb_dst_set_noref. So, IPVS have to use 4096 backend servers to hit this limit. It does not look correct in this patch to invalidate the FLOWI_FLAG_KNOWN_NH flag with a FLOWI_FLAG_SKIP_RAW flag. The same thing would be to not set FLOWI_FLAG_KNOWN_NH which is wrong for the hdrincl case. Regards -- Julian Anastasov <ja@ssi.bg>
On Tue, Dec 20, 2022 at 11:35 PM Paolo Abeni <pabeni@redhat.com> wrote: > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote: > > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a > > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly > > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in > > these warnings: > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > . > > . > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > If I read correctly, the maximum number of dst that the raw socket can > use this way is limited by the number of packets it allows via the > sndbuf limit, right? > Yes, but in my test sndbuf limit is never hit so it clones a route for every packet. e.g: output from C program sending 5000000 packets via a raw socket. ip raw: total num pkts 5000000 # bpftrace -e 'kprobe:dst_alloc {@count[comm] = count()}' Attaching 1 probe... @count[a.out]: 5000009 > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, > ipvs, seg6? > Any call to ip6_pol_route(s) where no res.nh->fib_nh_gw_family is 0 can do it. But we have only seen this for raw sockets so far. Regards Jon > @DavidA: why do we need to create RTF_CACHE clones for KNOWN_NH flows? > > Thanks, > > Paolo >
On Wed, Dec 21, 2022 at 2:10 AM David Ahern <dsahern@kernel.org> wrote: > > On 12/20/22 5:35 AM, Paolo Abeni wrote: > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote: > >> Sending Ipv6 packets in a loop via a raw socket triggers an issue where a > >> route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly > >> consumes the Ipv6 max_size threshold which defaults to 4096 resulting in > >> these warnings: > >> > >> [1] 99.187805] dst_alloc: 7728 callbacks suppressed > >> [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > >> . > >> . > >> [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > If I read correctly, the maximum number of dst that the raw socket can > > use this way is limited by the number of packets it allows via the > > sndbuf limit, right? > > > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, > > ipvs, seg6? > > > > @DavidA: why do we need to create RTF_CACHE clones for KNOWN_NH flows? > > > > Thanks, > > > > Paolo > > > > If I recall the details correctly: that sysctl limit was added back when > ipv6 routes were managed as dst_entries and there was a desire to allow > an admin to limit the memory consumed. At this point in time, IPv6 is > more inline with IPv4 - a separate struct for fib entries from dst > entries. That "Route cache is full" message is now out of date since > this is dst_entries which have a gc mechanism. > > IPv4 does not limit the number of dst_entries that can be allocated > (ip_rt_max_size is the sysctl variable behind the ipv4 version of > max_size and it is a no-op). IPv6 can probably do the same here? > diff --git a/net/ipv6/route.c b/net/ipv6/route.c index dbc224023977..701aba7feaf5 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -6470,7 +6470,7 @@ static int __net_init ip6_route_net_init(struct net *net) #endif net->ipv6.sysctl.flush_delay = 0; - net->ipv6.sysctl.ip6_rt_max_size = 4096; + net->ipv6.sysctl.ip6_rt_max_size = INT_MAX; net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2; net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ; net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ; The above patch resolved it for the Ipv6 reproducer. Would that be sufficient? > I do not believe the suggested flag is the right change. Regards Jon
On Wed, Dec 21, 2022 at 8:55 AM Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > > On Wed, Dec 21, 2022 at 2:10 AM David Ahern <dsahern@kernel.org> wrote: > > > > On 12/20/22 5:35 AM, Paolo Abeni wrote: > > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote: > > >> Sending Ipv6 packets in a loop via a raw socket triggers an issue where a > > >> route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly > > >> consumes the Ipv6 max_size threshold which defaults to 4096 resulting in > > >> these warnings: > > >> > > >> [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > >> [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > >> . > > >> . > > >> [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > If I read correctly, the maximum number of dst that the raw socket can > > > use this way is limited by the number of packets it allows via the > > > sndbuf limit, right? > > > > > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, > > > ipvs, seg6? > > > > > > @DavidA: why do we need to create RTF_CACHE clones for KNOWN_NH flows? > > > > > > Thanks, > > > > > > Paolo > > > > > > > If I recall the details correctly: that sysctl limit was added back when > > ipv6 routes were managed as dst_entries and there was a desire to allow > > an admin to limit the memory consumed. At this point in time, IPv6 is > > more inline with IPv4 - a separate struct for fib entries from dst > > entries. That "Route cache is full" message is now out of date since > > this is dst_entries which have a gc mechanism. > > > > IPv4 does not limit the number of dst_entries that can be allocated > > (ip_rt_max_size is the sysctl variable behind the ipv4 version of > > max_size and it is a no-op). IPv6 can probably do the same here? > > > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c > index dbc224023977..701aba7feaf5 100644 > --- a/net/ipv6/route.c > +++ b/net/ipv6/route.c > @@ -6470,7 +6470,7 @@ static int __net_init ip6_route_net_init(struct net *net) > #endif > > net->ipv6.sysctl.flush_delay = 0; > - net->ipv6.sysctl.ip6_rt_max_size = 4096; > + net->ipv6.sysctl.ip6_rt_max_size = INT_MAX; > net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2; > net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ; > net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ; > > The above patch resolved it for the Ipv6 reproducer. > > Would that be sufficient? > Otherwise if you prefer to make Ipv6 behaviour similar to IPv4. Rather than upping max_size. Here is prototype patch that removes the max_size check for Ipv6: diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h index 88ff7bb2bb9b..632086b2f644 100644 --- a/include/net/dst_ops.h +++ b/include/net/dst_ops.h @@ -16,7 +16,7 @@ struct dst_ops { unsigned short family; unsigned int gc_thresh; - int (*gc)(struct dst_ops *ops); + void (*gc)(struct dst_ops *ops); struct dst_entry * (*check)(struct dst_entry *, __u32 cookie); unsigned int (*default_advmss)(const struct dst_entry *); unsigned int (*mtu)(const struct dst_entry *); diff --git a/net/core/dst.c b/net/core/dst.c index 497ef9b3fc6a..dcb85267bc4c 100644 --- a/net/core/dst.c +++ b/net/core/dst.c @@ -82,12 +82,8 @@ void *dst_alloc(struct dst_ops *ops, struct net_device *dev, if (ops->gc && !(flags & DST_NOCOUNT) && - dst_entries_get_fast(ops) > ops->gc_thresh) { - if (ops->gc(ops)) { - pr_notice_ratelimited("Route cache is full: consider increasing sysctl net.ipv6.route.max_size.\n"); - return NULL; - } - } + dst_entries_get_fast(ops) > ops->gc_thresh) + ops->gc(ops); dst = kmem_cache_alloc(ops->kmem_cachep, GFP_ATOMIC); if (!dst) diff --git a/net/ipv6/route.c b/net/ipv6/route.c index dbc224023977..8db7c5436da4 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -91,7 +91,7 @@ static struct dst_entry *ip6_negative_advice(struct dst_entry *); static void ip6_dst_destroy(struct dst_entry *); static void ip6_dst_ifdown(struct dst_entry *, struct net_device *dev, int how); -static int ip6_dst_gc(struct dst_ops *ops); +static void ip6_dst_gc(struct dst_ops *ops); static int ip6_pkt_discard(struct sk_buff *skb); static int ip6_pkt_discard_out(struct net *net, struct sock *sk, struct sk_buff *skb); @@ -3295,32 +3295,21 @@ struct dst_entry *icmp6_dst_alloc(struct net_device *dev, return dst; } -static int ip6_dst_gc(struct dst_ops *ops) +static void ip6_dst_gc(struct dst_ops *ops) { struct net *net = container_of(ops, struct net, ipv6.ip6_dst_ops); - int rt_min_interval = net->ipv6.sysctl.ip6_rt_gc_min_interval; - int rt_max_size = net->ipv6.sysctl.ip6_rt_max_size; int rt_elasticity = net->ipv6.sysctl.ip6_rt_gc_elasticity; int rt_gc_timeout = net->ipv6.sysctl.ip6_rt_gc_timeout; - unsigned long rt_last_gc = net->ipv6.ip6_rt_last_gc; int entries; entries = dst_entries_get_fast(ops); - if (entries > rt_max_size) - entries = dst_entries_get_slow(ops); - - if (time_after(rt_last_gc + rt_min_interval, jiffies) && - entries <= rt_max_size) - goto out; net->ipv6.ip6_rt_gc_expire++; fib6_run_gc(net->ipv6.ip6_rt_gc_expire, net, true); entries = dst_entries_get_slow(ops); if (entries < ops->gc_thresh) net->ipv6.ip6_rt_gc_expire = rt_gc_timeout>>1; -out: net->ipv6.ip6_rt_gc_expire -= net->ipv6.ip6_rt_gc_expire>>rt_elasticity; - return entries > rt_max_size; } static int ip6_nh_lookup_table(struct net *net, struct fib6_config *cfg, > > I do not believe the suggested flag is the right change. > > Regards > > Jon
On Wed, Dec 21, 2022 at 3:31 PM Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > > On Wed, Dec 21, 2022 at 8:55 AM Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > > > > On Wed, Dec 21, 2022 at 2:10 AM David Ahern <dsahern@kernel.org> wrote: > > > > > > On 12/20/22 5:35 AM, Paolo Abeni wrote: > > > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote: > > > >> Sending Ipv6 packets in a loop via a raw socket triggers an issue where a > > > >> route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly > > > >> consumes the Ipv6 max_size threshold which defaults to 4096 resulting in > > > >> these warnings: > > > >> > > > >> [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > >> [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > >> . > > > >> . > > > >> [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > If I read correctly, the maximum number of dst that the raw socket can > > > > use this way is limited by the number of packets it allows via the > > > > sndbuf limit, right? > > > > > > > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, > > > > ipvs, seg6? > > > > > > > > @DavidA: why do we need to create RTF_CACHE clones for KNOWN_NH flows? > > > > > > > > Thanks, > > > > > > > > Paolo > > > > > > > > > > If I recall the details correctly: that sysctl limit was added back when > > > ipv6 routes were managed as dst_entries and there was a desire to allow > > > an admin to limit the memory consumed. At this point in time, IPv6 is > > > more inline with IPv4 - a separate struct for fib entries from dst > > > entries. That "Route cache is full" message is now out of date since > > > this is dst_entries which have a gc mechanism. > > > > > > IPv4 does not limit the number of dst_entries that can be allocated > > > (ip_rt_max_size is the sysctl variable behind the ipv4 version of > > > max_size and it is a no-op). IPv6 can probably do the same here? > > > > > > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c > > index dbc224023977..701aba7feaf5 100644 > > --- a/net/ipv6/route.c > > +++ b/net/ipv6/route.c > > @@ -6470,7 +6470,7 @@ static int __net_init ip6_route_net_init(struct net *net) > > #endif > > > > net->ipv6.sysctl.flush_delay = 0; > > - net->ipv6.sysctl.ip6_rt_max_size = 4096; > > + net->ipv6.sysctl.ip6_rt_max_size = INT_MAX; > > net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2; > > net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ; > > net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ; > > > > The above patch resolved it for the Ipv6 reproducer. > > > > Would that be sufficient? > > > > Otherwise if you prefer to make Ipv6 behaviour similar to IPv4. > Rather than upping max_size. > > Here is prototype patch that removes the max_size check for Ipv6: > There are some mistakes in this prototype patch. Let me come up with a better one. I'll submit a new patch in the new year for review. Thanks for the suggestion DavidA. Regards Jon > diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h > index 88ff7bb2bb9b..632086b2f644 100644 > --- a/include/net/dst_ops.h > +++ b/include/net/dst_ops.h > @@ -16,7 +16,7 @@ struct dst_ops { > unsigned short family; > unsigned int gc_thresh; > > - int (*gc)(struct dst_ops *ops); > + void (*gc)(struct dst_ops *ops); > struct dst_entry * (*check)(struct dst_entry *, __u32 cookie); > unsigned int (*default_advmss)(const struct dst_entry *); > unsigned int (*mtu)(const struct dst_entry *); > diff --git a/net/core/dst.c b/net/core/dst.c > index 497ef9b3fc6a..dcb85267bc4c 100644 > --- a/net/core/dst.c > +++ b/net/core/dst.c > @@ -82,12 +82,8 @@ void *dst_alloc(struct dst_ops *ops, struct net_device *dev, > > if (ops->gc && > !(flags & DST_NOCOUNT) && > - dst_entries_get_fast(ops) > ops->gc_thresh) { > - if (ops->gc(ops)) { > - pr_notice_ratelimited("Route cache is full: > consider increasing sysctl net.ipv6.route.max_size.\n"); > - return NULL; > - } > - } > + dst_entries_get_fast(ops) > ops->gc_thresh) > + ops->gc(ops); > > dst = kmem_cache_alloc(ops->kmem_cachep, GFP_ATOMIC); > if (!dst) > diff --git a/net/ipv6/route.c b/net/ipv6/route.c > index dbc224023977..8db7c5436da4 100644 > --- a/net/ipv6/route.c > +++ b/net/ipv6/route.c > @@ -91,7 +91,7 @@ static struct dst_entry *ip6_negative_advice(struct > dst_entry *); > static void ip6_dst_destroy(struct dst_entry *); > static void ip6_dst_ifdown(struct dst_entry *, > struct net_device *dev, int how); > -static int ip6_dst_gc(struct dst_ops *ops); > +static void ip6_dst_gc(struct dst_ops *ops); > > static int ip6_pkt_discard(struct sk_buff *skb); > static int ip6_pkt_discard_out(struct net *net, struct > sock *sk, struct sk_buff *skb); > @@ -3295,32 +3295,21 @@ struct dst_entry *icmp6_dst_alloc(struct > net_device *dev, > return dst; > } > > -static int ip6_dst_gc(struct dst_ops *ops) > +static void ip6_dst_gc(struct dst_ops *ops) > { > struct net *net = container_of(ops, struct net, ipv6.ip6_dst_ops); > - int rt_min_interval = net->ipv6.sysctl.ip6_rt_gc_min_interval; > - int rt_max_size = net->ipv6.sysctl.ip6_rt_max_size; > int rt_elasticity = net->ipv6.sysctl.ip6_rt_gc_elasticity; > int rt_gc_timeout = net->ipv6.sysctl.ip6_rt_gc_timeout; > - unsigned long rt_last_gc = net->ipv6.ip6_rt_last_gc; > int entries; > > entries = dst_entries_get_fast(ops); > - if (entries > rt_max_size) > - entries = dst_entries_get_slow(ops); > - > - if (time_after(rt_last_gc + rt_min_interval, jiffies) && > - entries <= rt_max_size) > - goto out; > > net->ipv6.ip6_rt_gc_expire++; > fib6_run_gc(net->ipv6.ip6_rt_gc_expire, net, true); > entries = dst_entries_get_slow(ops); > if (entries < ops->gc_thresh) > net->ipv6.ip6_rt_gc_expire = rt_gc_timeout>>1; > -out: > net->ipv6.ip6_rt_gc_expire -= net->ipv6.ip6_rt_gc_expire>>rt_elasticity; > - return entries > rt_max_size; > } > > static int ip6_nh_lookup_table(struct net *net, struct fib6_config *cfg, > > > > I do not believe the suggested flag is the right change. > > > > Regards > > > > Jon
On 12/21/22 10:39 PM, Jonathan Maxwell wrote: > There are some mistakes in this prototype patch. Let me come up with a > better one. I'll submit a new patch in the new year for review. Thanks for > the suggestion DavidA. you are welcome. When you submit the next one, it would be helpful to show change in memory consumption and a comparison to IPv4 for a similar raw socket program.
On Fri, Dec 23, 2022 at 3:17 AM David Ahern <dsahern@kernel.org> wrote: > > On 12/21/22 10:39 PM, Jonathan Maxwell wrote: > > There are some mistakes in this prototype patch. Let me come up with a > > better one. I'll submit a new patch in the new year for review. Thanks for > > the suggestion DavidA. > > you are welcome. When you submit the next one, it would be helpful to > show change in memory consumption and a comparison to IPv4 for a similar > raw socket program. Sure will do, I have an ipv4 version of the test program and a better patch. Initial results are quite similar for resource usage. The ipv6 GC does a good job at managing memory resources with max_size threshold removed. I'll include some comparison stats after I test this extensively when I submit the new patch.
Hi Jon, please see below, thanks. On Wed, 21 Dec 2022 08:48:11 +1100 Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > On Tue, Dec 20, 2022 at 11:35 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote: > > > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a > > > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly > > > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in > > > these warnings: > > > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > . > > > . > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > If I read correctly, the maximum number of dst that the raw socket can > > use this way is limited by the number of packets it allows via the > > sndbuf limit, right? > > > > Yes, but in my test sndbuf limit is never hit so it clones a route for > every packet. > > e.g: > > output from C program sending 5000000 packets via a raw socket. > > ip raw: total num pkts 5000000 > > # bpftrace -e 'kprobe:dst_alloc {@count[comm] = count()}' > Attaching 1 probe... > > @count[a.out]: 5000009 > > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, > > ipvs, seg6? > > > > Any call to ip6_pol_route(s) where no res.nh->fib_nh_gw_family is 0 can do it. > But we have only seen this for raw sockets so far. > In the SRv6 subsystem, the seg6_lookup_nexthop() is used by some cross-connecting behaviors such as End.X and End.DX6 to forward traffic to a specified nexthop. SRv6 End.X/DX6 can specify an IPv6 DA (i.e., a nexthop) different from the one carried by the IPv6 header. For this purpose, seg6_lookup_nexthop() sets the FLOWI_FLAG_KNOWN_NH. > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > . > > > . > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. I can reproduce the same warning messages reported by you, by instantiating an End.X behavior whose nexthop is handled by a route for which there is no "via". In this configuration, the ip6_pol_route() (called by seg6_lookup_nexthop()) triggers ip6_rt_cache_alloc() because i) the FLOWI_FLAG_KNOWN_NH is present ii) and the res.nh->fib_nh_gw_family is 0 (as already pointed out). > Regards > > Jon Ciao, Andrea
On Sat, Dec 24, 2022 at 7:28 AM Andrea Mayer <andrea.mayer@uniroma2.it> wrote: > > Hi Jon, > please see below, thanks. > > On Wed, 21 Dec 2022 08:48:11 +1100 > Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > > > On Tue, Dec 20, 2022 at 11:35 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote: > > > > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a > > > > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly > > > > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in > > > > these warnings: > > > > > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > . > > > > . > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > If I read correctly, the maximum number of dst that the raw socket can > > > use this way is limited by the number of packets it allows via the > > > sndbuf limit, right? > > > > > > > Yes, but in my test sndbuf limit is never hit so it clones a route for > > every packet. > > > > e.g: > > > > output from C program sending 5000000 packets via a raw socket. > > > > ip raw: total num pkts 5000000 > > > > # bpftrace -e 'kprobe:dst_alloc {@count[comm] = count()}' > > Attaching 1 probe... > > > > @count[a.out]: 5000009 > > > > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, > > > ipvs, seg6? > > > > > > > Any call to ip6_pol_route(s) where no res.nh->fib_nh_gw_family is 0 can do it. > > But we have only seen this for raw sockets so far. > > > > In the SRv6 subsystem, the seg6_lookup_nexthop() is used by some > cross-connecting behaviors such as End.X and End.DX6 to forward traffic to a > specified nexthop. SRv6 End.X/DX6 can specify an IPv6 DA (i.e., a nexthop) > different from the one carried by the IPv6 header. For this purpose, > seg6_lookup_nexthop() sets the FLOWI_FLAG_KNOWN_NH. > Hi Andrea, Thanks for pointing that datapath out. The more generic approach we are taking bringing Ipv6 closer to Ipv4 in this regard should fix all instances of this. > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > . > > > > . > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > I can reproduce the same warning messages reported by you, by instantiating an > End.X behavior whose nexthop is handled by a route for which there is no "via". > In this configuration, the ip6_pol_route() (called by seg6_lookup_nexthop()) > triggers ip6_rt_cache_alloc() because i) the FLOWI_FLAG_KNOWN_NH is present ii) > and the res.nh->fib_nh_gw_family is 0 (as already pointed out). > Nice, when I get back after the holiday break I'll submit the next patch. It would be great if you could test the new patch and let me know how it works in your tests at that juncture. I'll keep you posted. Regards Jon > > Regards > > > > Jon > > Ciao, > Andrea
Hi Andrea, Happy New Year. Any chance you could test this patch based on the latest net-next kernel and let me know the result? diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h index 88ff7bb2bb9b..632086b2f644 100644 --- a/include/net/dst_ops.h +++ b/include/net/dst_ops.h @@ -16,7 +16,7 @@ struct dst_ops { unsigned short family; unsigned int gc_thresh; - int (*gc)(struct dst_ops *ops); + void (*gc)(struct dst_ops *ops); struct dst_entry * (*check)(struct dst_entry *, __u32 cookie); unsigned int (*default_advmss)(const struct dst_entry *); unsigned int (*mtu)(const struct dst_entry *); diff --git a/net/core/dst.c b/net/core/dst.c index 6d2dd03dafa8..31c08a3386d3 100644 --- a/net/core/dst.c +++ b/net/core/dst.c @@ -82,12 +82,8 @@ void *dst_alloc(struct dst_ops *ops, struct net_device *dev, if (ops->gc && !(flags & DST_NOCOUNT) && - dst_entries_get_fast(ops) > ops->gc_thresh) { - if (ops->gc(ops)) { - pr_notice_ratelimited("Route cache is full: consider increasing sysctl net.ipv6.route.max_size.\n"); - return NULL; - } - } + dst_entries_get_fast(ops) > ops->gc_thresh) + ops->gc(ops); dst = kmem_cache_alloc(ops->kmem_cachep, GFP_ATOMIC); if (!dst) diff --git a/net/ipv6/route.c b/net/ipv6/route.c index e74e0361fd92..b643dda68d31 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -91,7 +91,7 @@ static struct dst_entry *ip6_negative_advice(struct dst_entry *); static void ip6_dst_destroy(struct dst_entry *); static void ip6_dst_ifdown(struct dst_entry *, struct net_device *dev, int how); -static int ip6_dst_gc(struct dst_ops *ops); +static void ip6_dst_gc(struct dst_ops *ops); static int ip6_pkt_discard(struct sk_buff *skb); static int ip6_pkt_discard_out(struct net *net, struct sock *sk, struct sk_buff *skb); @@ -3284,11 +3284,10 @@ struct dst_entry *icmp6_dst_alloc(struct net_device *dev, return dst; } -static int ip6_dst_gc(struct dst_ops *ops) +static void ip6_dst_gc(struct dst_ops *ops) { struct net *net = container_of(ops, struct net, ipv6.ip6_dst_ops); int rt_min_interval = net->ipv6.sysctl.ip6_rt_gc_min_interval; - int rt_max_size = net->ipv6.sysctl.ip6_rt_max_size; int rt_elasticity = net->ipv6.sysctl.ip6_rt_gc_elasticity; int rt_gc_timeout = net->ipv6.sysctl.ip6_rt_gc_timeout; unsigned long rt_last_gc = net->ipv6.ip6_rt_last_gc; @@ -3296,11 +3295,10 @@ static int ip6_dst_gc(struct dst_ops *ops) int entries; entries = dst_entries_get_fast(ops); - if (entries > rt_max_size) + if (entries > ops->gc_thresh) entries = dst_entries_get_slow(ops); - if (time_after(rt_last_gc + rt_min_interval, jiffies) && - entries <= rt_max_size) + if (time_after(rt_last_gc + rt_min_interval, jiffies)) goto out; fib6_run_gc(atomic_inc_return(&net->ipv6.ip6_rt_gc_expire), net, true); @@ -3310,7 +3308,6 @@ static int ip6_dst_gc(struct dst_ops *ops) out: val = atomic_read(&net->ipv6.ip6_rt_gc_expire); atomic_set(&net->ipv6.ip6_rt_gc_expire, val - (val >> rt_elasticity)); - return entries > rt_max_size; } static int ip6_nh_lookup_table(struct net *net, struct fib6_config *cfg, @@ -6512,7 +6509,7 @@ static int __net_init ip6_route_net_init(struct net *net) #endif net->ipv6.sysctl.flush_delay = 0; - net->ipv6.sysctl.ip6_rt_max_size = 4096; + net->ipv6.sysctl.ip6_rt_max_size = INT_MAX; net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2; net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ; net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ; On Sat, Dec 24, 2022 at 6:38 PM Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > > On Sat, Dec 24, 2022 at 7:28 AM Andrea Mayer <andrea.mayer@uniroma2.it> wrote: > > > > Hi Jon, > > please see below, thanks. > > > > On Wed, 21 Dec 2022 08:48:11 +1100 > > Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > > > > > On Tue, Dec 20, 2022 at 11:35 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote: > > > > > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a > > > > > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly > > > > > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in > > > > > these warnings: > > > > > > > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > . > > > > > . > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > If I read correctly, the maximum number of dst that the raw socket can > > > > use this way is limited by the number of packets it allows via the > > > > sndbuf limit, right? > > > > > > > > > > Yes, but in my test sndbuf limit is never hit so it clones a route for > > > every packet. > > > > > > e.g: > > > > > > output from C program sending 5000000 packets via a raw socket. > > > > > > ip raw: total num pkts 5000000 > > > > > > # bpftrace -e 'kprobe:dst_alloc {@count[comm] = count()}' > > > Attaching 1 probe... > > > > > > @count[a.out]: 5000009 > > > > > > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, > > > > ipvs, seg6? > > > > > > > > > > Any call to ip6_pol_route(s) where no res.nh->fib_nh_gw_family is 0 can do it. > > > But we have only seen this for raw sockets so far. > > > > > > > In the SRv6 subsystem, the seg6_lookup_nexthop() is used by some > > cross-connecting behaviors such as End.X and End.DX6 to forward traffic to a > > specified nexthop. SRv6 End.X/DX6 can specify an IPv6 DA (i.e., a nexthop) > > different from the one carried by the IPv6 header. For this purpose, > > seg6_lookup_nexthop() sets the FLOWI_FLAG_KNOWN_NH. > > > Hi Andrea, > > Thanks for pointing that datapath out. The more generic approach we are > taking bringing Ipv6 closer to Ipv4 in this regard should fix all instances > of this. > > > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > . > > > > > . > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > I can reproduce the same warning messages reported by you, by instantiating an > > End.X behavior whose nexthop is handled by a route for which there is no "via". > > In this configuration, the ip6_pol_route() (called by seg6_lookup_nexthop()) > > triggers ip6_rt_cache_alloc() because i) the FLOWI_FLAG_KNOWN_NH is present ii) > > and the res.nh->fib_nh_gw_family is 0 (as already pointed out). > > > > Nice, when I get back after the holiday break I'll submit the next patch. It > would be great if you could test the new patch and let me know how it works in > your tests at that juncture. I'll keep you posted. > > Regards > > Jon > > > > Regards > > > > > > Jon > > > > Ciao, > > Andrea
Hi Jon, please see below, thanks. On Tue, 3 Jan 2023 10:59:50 +1100 Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > Hi Andrea, > > Happy New Year. > Thank you, Happy New Year to you too and everybody on the mailing list as well. > Any chance you could test this patch based on the latest net-next > kernel and let me know the result? > > diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h > index 88ff7bb2bb9b..632086b2f644 100644 > --- a/include/net/dst_ops.h > +++ b/include/net/dst_ops.h > @@ -16,7 +16,7 @@ struct dst_ops { > unsigned short family; > unsigned int gc_thresh; > > - int (*gc)(struct dst_ops *ops); > + void (*gc)(struct dst_ops *ops); > struct dst_entry * (*check)(struct dst_entry *, __u32 cookie); > unsigned int (*default_advmss)(const struct dst_entry *); > unsigned int (*mtu)(const struct dst_entry *); > diff --git a/net/core/dst.c b/net/core/dst.c > index 6d2dd03dafa8..31c08a3386d3 100644 > --- a/net/core/dst.c > +++ b/net/core/dst.c > @@ -82,12 +82,8 @@ void *dst_alloc(struct dst_ops *ops, struct net_device *dev, > > if (ops->gc && > !(flags & DST_NOCOUNT) && > - dst_entries_get_fast(ops) > ops->gc_thresh) { > - if (ops->gc(ops)) { > - pr_notice_ratelimited("Route cache is full: > consider increasing sysctl net.ipv6.route.max_size.\n"); > - return NULL; > - } > - } > + dst_entries_get_fast(ops) > ops->gc_thresh) > + ops->gc(ops); > > dst = kmem_cache_alloc(ops->kmem_cachep, GFP_ATOMIC); > if (!dst) > diff --git a/net/ipv6/route.c b/net/ipv6/route.c > index e74e0361fd92..b643dda68d31 100644 > --- a/net/ipv6/route.c > +++ b/net/ipv6/route.c > @@ -91,7 +91,7 @@ static struct dst_entry *ip6_negative_advice(struct > dst_entry *); > static void ip6_dst_destroy(struct dst_entry *); > static void ip6_dst_ifdown(struct dst_entry *, > struct net_device *dev, int how); > -static int ip6_dst_gc(struct dst_ops *ops); > +static void ip6_dst_gc(struct dst_ops *ops); > > static int ip6_pkt_discard(struct sk_buff *skb); > static int ip6_pkt_discard_out(struct net *net, struct > sock *sk, struct sk_buff *skb); > @@ -3284,11 +3284,10 @@ struct dst_entry *icmp6_dst_alloc(struct > net_device *dev, > return dst; > } > > -static int ip6_dst_gc(struct dst_ops *ops) > +static void ip6_dst_gc(struct dst_ops *ops) > { > struct net *net = container_of(ops, struct net, ipv6.ip6_dst_ops); > int rt_min_interval = net->ipv6.sysctl.ip6_rt_gc_min_interval; > - int rt_max_size = net->ipv6.sysctl.ip6_rt_max_size; > int rt_elasticity = net->ipv6.sysctl.ip6_rt_gc_elasticity; > int rt_gc_timeout = net->ipv6.sysctl.ip6_rt_gc_timeout; > unsigned long rt_last_gc = net->ipv6.ip6_rt_last_gc; > @@ -3296,11 +3295,10 @@ static int ip6_dst_gc(struct dst_ops *ops) > int entries; > > entries = dst_entries_get_fast(ops); > - if (entries > rt_max_size) > + if (entries > ops->gc_thresh) > entries = dst_entries_get_slow(ops); > > - if (time_after(rt_last_gc + rt_min_interval, jiffies) && > - entries <= rt_max_size) > + if (time_after(rt_last_gc + rt_min_interval, jiffies)) > goto out; > > fib6_run_gc(atomic_inc_return(&net->ipv6.ip6_rt_gc_expire), net, true); > @@ -3310,7 +3308,6 @@ static int ip6_dst_gc(struct dst_ops *ops) > out: > val = atomic_read(&net->ipv6.ip6_rt_gc_expire); > atomic_set(&net->ipv6.ip6_rt_gc_expire, val - (val >> rt_elasticity)); > - return entries > rt_max_size; > } > > static int ip6_nh_lookup_table(struct net *net, struct fib6_config *cfg, > @@ -6512,7 +6509,7 @@ static int __net_init ip6_route_net_init(struct net *net) > #endif > > net->ipv6.sysctl.flush_delay = 0; > - net->ipv6.sysctl.ip6_rt_max_size = 4096; > + net->ipv6.sysctl.ip6_rt_max_size = INT_MAX; > net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2; > net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ; > net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ; > Yes, I will apply this patch in the next days and check how it deals with the seg6 subsystem. I will keep you posted. Ciao, Andrea > On Sat, Dec 24, 2022 at 6:38 PM Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > > > > On Sat, Dec 24, 2022 at 7:28 AM Andrea Mayer <andrea.mayer@uniroma2.it> wrote: > > > > > > Hi Jon, > > > please see below, thanks. > > > > > > On Wed, 21 Dec 2022 08:48:11 +1100 > > > Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > > > > > > > On Tue, Dec 20, 2022 at 11:35 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > > > > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote: > > > > > > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a > > > > > > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly > > > > > > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in > > > > > > these warnings: > > > > > > > > > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > . > > > > > > . > > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > > > If I read correctly, the maximum number of dst that the raw socket can > > > > > use this way is limited by the number of packets it allows via the > > > > > sndbuf limit, right? > > > > > > > > > > > > > Yes, but in my test sndbuf limit is never hit so it clones a route for > > > > every packet. > > > > > > > > e.g: > > > > > > > > output from C program sending 5000000 packets via a raw socket. > > > > > > > > ip raw: total num pkts 5000000 > > > > > > > > # bpftrace -e 'kprobe:dst_alloc {@count[comm] = count()}' > > > > Attaching 1 probe... > > > > > > > > @count[a.out]: 5000009 > > > > > > > > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, > > > > > ipvs, seg6? > > > > > > > > > > > > > Any call to ip6_pol_route(s) where no res.nh->fib_nh_gw_family is 0 can do it. > > > > But we have only seen this for raw sockets so far. > > > > > > > > > > In the SRv6 subsystem, the seg6_lookup_nexthop() is used by some > > > cross-connecting behaviors such as End.X and End.DX6 to forward traffic to a > > > specified nexthop. SRv6 End.X/DX6 can specify an IPv6 DA (i.e., a nexthop) > > > different from the one carried by the IPv6 header. For this purpose, > > > seg6_lookup_nexthop() sets the FLOWI_FLAG_KNOWN_NH. > > > > > Hi Andrea, > > > > Thanks for pointing that datapath out. The more generic approach we are > > taking bringing Ipv6 closer to Ipv4 in this regard should fix all instances > > of this. > > > > > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > . > > > > > > . > > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > I can reproduce the same warning messages reported by you, by instantiating an > > > End.X behavior whose nexthop is handled by a route for which there is no "via". > > > In this configuration, the ip6_pol_route() (called by seg6_lookup_nexthop()) > > > triggers ip6_rt_cache_alloc() because i) the FLOWI_FLAG_KNOWN_NH is present ii) > > > and the res.nh->fib_nh_gw_family is 0 (as already pointed out). > > > > > > > Nice, when I get back after the holiday break I'll submit the next patch. It > > would be great if you could test the new patch and let me know how it works in > > your tests at that juncture. I'll keep you posted. > > > > Regards > > > > Jon > > > > > > Regards > > > > > > > > Jon > > > > > > Ciao, > > > Andrea
Hi Jon, please see after, thanks. > > > Any chance you could test this patch based on the latest net-next > > kernel and let me know the result? > > > > diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h > > index 88ff7bb2bb9b..632086b2f644 100644 > > --- a/include/net/dst_ops.h > > +++ b/include/net/dst_ops.h > > @@ -16,7 +16,7 @@ struct dst_ops { > > unsigned short family; > > unsigned int gc_thresh; > > > > - int (*gc)(struct dst_ops *ops); > > + void (*gc)(struct dst_ops *ops); > > struct dst_entry * (*check)(struct dst_entry *, __u32 cookie); > > unsigned int (*default_advmss)(const struct dst_entry *); > > unsigned int (*mtu)(const struct dst_entry *); > > diff --git a/net/core/dst.c b/net/core/dst.c > > index 6d2dd03dafa8..31c08a3386d3 100644 > > --- a/net/core/dst.c > > +++ b/net/core/dst.c > > @@ -82,12 +82,8 @@ void *dst_alloc(struct dst_ops *ops, struct net_device *dev, > > > > if (ops->gc && > > !(flags & DST_NOCOUNT) && > > - dst_entries_get_fast(ops) > ops->gc_thresh) { > > - if (ops->gc(ops)) { > > - pr_notice_ratelimited("Route cache is full: > > consider increasing sysctl net.ipv6.route.max_size.\n"); > > - return NULL; > > - } > > - } > > + dst_entries_get_fast(ops) > ops->gc_thresh) > > + ops->gc(ops); > > > > dst = kmem_cache_alloc(ops->kmem_cachep, GFP_ATOMIC); > > if (!dst) > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c > > index e74e0361fd92..b643dda68d31 100644 > > --- a/net/ipv6/route.c > > +++ b/net/ipv6/route.c > > @@ -91,7 +91,7 @@ static struct dst_entry *ip6_negative_advice(struct > > dst_entry *); > > static void ip6_dst_destroy(struct dst_entry *); > > static void ip6_dst_ifdown(struct dst_entry *, > > struct net_device *dev, int how); > > -static int ip6_dst_gc(struct dst_ops *ops); > > +static void ip6_dst_gc(struct dst_ops *ops); > > > > static int ip6_pkt_discard(struct sk_buff *skb); > > static int ip6_pkt_discard_out(struct net *net, struct > > sock *sk, struct sk_buff *skb); > > @@ -3284,11 +3284,10 @@ struct dst_entry *icmp6_dst_alloc(struct > > net_device *dev, > > return dst; > > } > > > > -static int ip6_dst_gc(struct dst_ops *ops) > > +static void ip6_dst_gc(struct dst_ops *ops) > > { > > struct net *net = container_of(ops, struct net, ipv6.ip6_dst_ops); > > int rt_min_interval = net->ipv6.sysctl.ip6_rt_gc_min_interval; > > - int rt_max_size = net->ipv6.sysctl.ip6_rt_max_size; > > int rt_elasticity = net->ipv6.sysctl.ip6_rt_gc_elasticity; > > int rt_gc_timeout = net->ipv6.sysctl.ip6_rt_gc_timeout; > > unsigned long rt_last_gc = net->ipv6.ip6_rt_last_gc; > > @@ -3296,11 +3295,10 @@ static int ip6_dst_gc(struct dst_ops *ops) > > int entries; > > > > entries = dst_entries_get_fast(ops); > > - if (entries > rt_max_size) > > + if (entries > ops->gc_thresh) > > entries = dst_entries_get_slow(ops); > > > > - if (time_after(rt_last_gc + rt_min_interval, jiffies) && > > - entries <= rt_max_size) > > + if (time_after(rt_last_gc + rt_min_interval, jiffies)) > > goto out; > > > > fib6_run_gc(atomic_inc_return(&net->ipv6.ip6_rt_gc_expire), net, true); > > @@ -3310,7 +3308,6 @@ static int ip6_dst_gc(struct dst_ops *ops) > > out: > > val = atomic_read(&net->ipv6.ip6_rt_gc_expire); > > atomic_set(&net->ipv6.ip6_rt_gc_expire, val - (val >> rt_elasticity)); > > - return entries > rt_max_size; > > } > > > > static int ip6_nh_lookup_table(struct net *net, struct fib6_config *cfg, > > @@ -6512,7 +6509,7 @@ static int __net_init ip6_route_net_init(struct net *net) > > #endif > > > > net->ipv6.sysctl.flush_delay = 0; > > - net->ipv6.sysctl.ip6_rt_max_size = 4096; > > + net->ipv6.sysctl.ip6_rt_max_size = INT_MAX; > > net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2; > > net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ; > > net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ; > > > > Yes, I will apply this patch in the next days and check how it deals with the > seg6 subsystem. I will keep you posted. > I applied the patch* to the net-next (HEAD 6bd4755c7c49) and did some tests on the seg6 subsystem, specifically running the End.X/DX6 behaviors. They seem to work fine. (*) I had to slightly edit the patch because of the code formatting, e.g. some incorrect line breaks, spaces, etc. Ciao, Andrea > > > On Sat, Dec 24, 2022 at 6:38 PM Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > > > > > > On Sat, Dec 24, 2022 at 7:28 AM Andrea Mayer <andrea.mayer@uniroma2.it> wrote: > > > > > > > > Hi Jon, > > > > please see below, thanks. > > > > > > > > On Wed, 21 Dec 2022 08:48:11 +1100 > > > > Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > > > > > > > > > On Tue, Dec 20, 2022 at 11:35 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > > > > > > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote: > > > > > > > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a > > > > > > > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly > > > > > > > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in > > > > > > > these warnings: > > > > > > > > > > > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > . > > > > > > > . > > > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > > > > > If I read correctly, the maximum number of dst that the raw socket can > > > > > > use this way is limited by the number of packets it allows via the > > > > > > sndbuf limit, right? > > > > > > > > > > > > > > > > Yes, but in my test sndbuf limit is never hit so it clones a route for > > > > > every packet. > > > > > > > > > > e.g: > > > > > > > > > > output from C program sending 5000000 packets via a raw socket. > > > > > > > > > > ip raw: total num pkts 5000000 > > > > > > > > > > # bpftrace -e 'kprobe:dst_alloc {@count[comm] = count()}' > > > > > Attaching 1 probe... > > > > > > > > > > @count[a.out]: 5000009 > > > > > > > > > > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, > > > > > > ipvs, seg6? > > > > > > > > > > > > > > > > Any call to ip6_pol_route(s) where no res.nh->fib_nh_gw_family is 0 can do it. > > > > > But we have only seen this for raw sockets so far. > > > > > > > > > > > > > In the SRv6 subsystem, the seg6_lookup_nexthop() is used by some > > > > cross-connecting behaviors such as End.X and End.DX6 to forward traffic to a > > > > specified nexthop. SRv6 End.X/DX6 can specify an IPv6 DA (i.e., a nexthop) > > > > different from the one carried by the IPv6 header. For this purpose, > > > > seg6_lookup_nexthop() sets the FLOWI_FLAG_KNOWN_NH. > > > > > > > Hi Andrea, > > > > > > Thanks for pointing that datapath out. The more generic approach we are > > > taking bringing Ipv6 closer to Ipv4 in this regard should fix all instances > > > of this. > > > > > > > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > . > > > > > > > . > > > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > I can reproduce the same warning messages reported by you, by instantiating an > > > > End.X behavior whose nexthop is handled by a route for which there is no "via". > > > > In this configuration, the ip6_pol_route() (called by seg6_lookup_nexthop()) > > > > triggers ip6_rt_cache_alloc() because i) the FLOWI_FLAG_KNOWN_NH is present ii) > > > > and the res.nh->fib_nh_gw_family is 0 (as already pointed out). > > > > > > > > > > Nice, when I get back after the holiday break I'll submit the next patch. It > > > would be great if you could test the new patch and let me know how it works in > > > your tests at that juncture. I'll keep you posted. > > > > > > Regards > > > > > > Jon > > > > > > > > Regards > > > > > > > > > > Jon > > > > > > > > Ciao, > > > > Andrea > > > -- > Andrea Mayer <andrea.mayer@uniroma2.it>
On Sat, Jan 7, 2023 at 10:27 AM Andrea Mayer <andrea.mayer@uniroma2.it> wrote: > > Hi Jon, > please see after, thanks. > > > > > > Any chance you could test this patch based on the latest net-next > > > kernel and let me know the result? > > > > > > diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h > > > index 88ff7bb2bb9b..632086b2f644 100644 > > > --- a/include/net/dst_ops.h > > > +++ b/include/net/dst_ops.h > > > @@ -16,7 +16,7 @@ struct dst_ops { > > > unsigned short family; > > > unsigned int gc_thresh; > > > > > > - int (*gc)(struct dst_ops *ops); > > > + void (*gc)(struct dst_ops *ops); > > > struct dst_entry * (*check)(struct dst_entry *, __u32 cookie); > > > unsigned int (*default_advmss)(const struct dst_entry *); > > > unsigned int (*mtu)(const struct dst_entry *); > > > diff --git a/net/core/dst.c b/net/core/dst.c > > > index 6d2dd03dafa8..31c08a3386d3 100644 > > > --- a/net/core/dst.c > > > +++ b/net/core/dst.c > > > @@ -82,12 +82,8 @@ void *dst_alloc(struct dst_ops *ops, struct net_device *dev, > > > > > > if (ops->gc && > > > !(flags & DST_NOCOUNT) && > > > - dst_entries_get_fast(ops) > ops->gc_thresh) { > > > - if (ops->gc(ops)) { > > > - pr_notice_ratelimited("Route cache is full: > > > consider increasing sysctl net.ipv6.route.max_size.\n"); > > > - return NULL; > > > - } > > > - } > > > + dst_entries_get_fast(ops) > ops->gc_thresh) > > > + ops->gc(ops); > > > > > > dst = kmem_cache_alloc(ops->kmem_cachep, GFP_ATOMIC); > > > if (!dst) > > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c > > > index e74e0361fd92..b643dda68d31 100644 > > > --- a/net/ipv6/route.c > > > +++ b/net/ipv6/route.c > > > @@ -91,7 +91,7 @@ static struct dst_entry *ip6_negative_advice(struct > > > dst_entry *); > > > static void ip6_dst_destroy(struct dst_entry *); > > > static void ip6_dst_ifdown(struct dst_entry *, > > > struct net_device *dev, int how); > > > -static int ip6_dst_gc(struct dst_ops *ops); > > > +static void ip6_dst_gc(struct dst_ops *ops); > > > > > > static int ip6_pkt_discard(struct sk_buff *skb); > > > static int ip6_pkt_discard_out(struct net *net, struct > > > sock *sk, struct sk_buff *skb); > > > @@ -3284,11 +3284,10 @@ struct dst_entry *icmp6_dst_alloc(struct > > > net_device *dev, > > > return dst; > > > } > > > > > > -static int ip6_dst_gc(struct dst_ops *ops) > > > +static void ip6_dst_gc(struct dst_ops *ops) > > > { > > > struct net *net = container_of(ops, struct net, ipv6.ip6_dst_ops); > > > int rt_min_interval = net->ipv6.sysctl.ip6_rt_gc_min_interval; > > > - int rt_max_size = net->ipv6.sysctl.ip6_rt_max_size; > > > int rt_elasticity = net->ipv6.sysctl.ip6_rt_gc_elasticity; > > > int rt_gc_timeout = net->ipv6.sysctl.ip6_rt_gc_timeout; > > > unsigned long rt_last_gc = net->ipv6.ip6_rt_last_gc; > > > @@ -3296,11 +3295,10 @@ static int ip6_dst_gc(struct dst_ops *ops) > > > int entries; > > > > > > entries = dst_entries_get_fast(ops); > > > - if (entries > rt_max_size) > > > + if (entries > ops->gc_thresh) > > > entries = dst_entries_get_slow(ops); > > > > > > - if (time_after(rt_last_gc + rt_min_interval, jiffies) && > > > - entries <= rt_max_size) > > > + if (time_after(rt_last_gc + rt_min_interval, jiffies)) > > > goto out; > > > > > > fib6_run_gc(atomic_inc_return(&net->ipv6.ip6_rt_gc_expire), net, true); > > > @@ -3310,7 +3308,6 @@ static int ip6_dst_gc(struct dst_ops *ops) > > > out: > > > val = atomic_read(&net->ipv6.ip6_rt_gc_expire); > > > atomic_set(&net->ipv6.ip6_rt_gc_expire, val - (val >> rt_elasticity)); > > > - return entries > rt_max_size; > > > } > > > > > > static int ip6_nh_lookup_table(struct net *net, struct fib6_config *cfg, > > > @@ -6512,7 +6509,7 @@ static int __net_init ip6_route_net_init(struct net *net) > > > #endif > > > > > > net->ipv6.sysctl.flush_delay = 0; > > > - net->ipv6.sysctl.ip6_rt_max_size = 4096; > > > + net->ipv6.sysctl.ip6_rt_max_size = INT_MAX; > > > net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2; > > > net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ; > > > net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ; > > > > > > > Yes, I will apply this patch in the next days and check how it deals with the > > seg6 subsystem. I will keep you posted. > > > > I applied the patch* to the net-next (HEAD 6bd4755c7c49) and did some tests on > the seg6 subsystem, specifically running the End.X/DX6 behaviors. They seem to > work fine. Thanks Andrea much appreciated. It worked fine in my raw socket tests as well. I'll look at submitting it soon. > > (*) I had to slightly edit the patch because of the code formatting, e.g. > some incorrect line breaks, spaces, etc. > Sorry about that, I should have sent it from git to avoid that. Regards Jon > Ciao, > Andrea > > > > > > On Sat, Dec 24, 2022 at 6:38 PM Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > > > > > > > > On Sat, Dec 24, 2022 at 7:28 AM Andrea Mayer <andrea.mayer@uniroma2.it> wrote: > > > > > > > > > > Hi Jon, > > > > > please see below, thanks. > > > > > > > > > > On Wed, 21 Dec 2022 08:48:11 +1100 > > > > > Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > > > > > > > > > > > On Tue, Dec 20, 2022 at 11:35 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > > > > > > > > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote: > > > > > > > > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a > > > > > > > > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly > > > > > > > > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in > > > > > > > > these warnings: > > > > > > > > > > > > > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > . > > > > > > > > . > > > > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > > > > > > > If I read correctly, the maximum number of dst that the raw socket can > > > > > > > use this way is limited by the number of packets it allows via the > > > > > > > sndbuf limit, right? > > > > > > > > > > > > > > > > > > > Yes, but in my test sndbuf limit is never hit so it clones a route for > > > > > > every packet. > > > > > > > > > > > > e.g: > > > > > > > > > > > > output from C program sending 5000000 packets via a raw socket. > > > > > > > > > > > > ip raw: total num pkts 5000000 > > > > > > > > > > > > # bpftrace -e 'kprobe:dst_alloc {@count[comm] = count()}' > > > > > > Attaching 1 probe... > > > > > > > > > > > > @count[a.out]: 5000009 > > > > > > > > > > > > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, > > > > > > > ipvs, seg6? > > > > > > > > > > > > > > > > > > > Any call to ip6_pol_route(s) where no res.nh->fib_nh_gw_family is 0 can do it. > > > > > > But we have only seen this for raw sockets so far. > > > > > > > > > > > > > > > > In the SRv6 subsystem, the seg6_lookup_nexthop() is used by some > > > > > cross-connecting behaviors such as End.X and End.DX6 to forward traffic to a > > > > > specified nexthop. SRv6 End.X/DX6 can specify an IPv6 DA (i.e., a nexthop) > > > > > different from the one carried by the IPv6 header. For this purpose, > > > > > seg6_lookup_nexthop() sets the FLOWI_FLAG_KNOWN_NH. > > > > > > > > > Hi Andrea, > > > > > > > > Thanks for pointing that datapath out. The more generic approach we are > > > > taking bringing Ipv6 closer to Ipv4 in this regard should fix all instances > > > > of this. > > > > > > > > > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > . > > > > > > > > . > > > > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > > > I can reproduce the same warning messages reported by you, by instantiating an > > > > > End.X behavior whose nexthop is handled by a route for which there is no "via". > > > > > In this configuration, the ip6_pol_route() (called by seg6_lookup_nexthop()) > > > > > triggers ip6_rt_cache_alloc() because i) the FLOWI_FLAG_KNOWN_NH is present ii) > > > > > and the res.nh->fib_nh_gw_family is 0 (as already pointed out). > > > > > > > > > > > > > Nice, when I get back after the holiday break I'll submit the next patch. It > > > > would be great if you could test the new patch and let me know how it works in > > > > your tests at that juncture. I'll keep you posted. > > > > > > > > Regards > > > > > > > > Jon > > > > > > > > > > Regards > > > > > > > > > > > > Jon > > > > > > > > > > Ciao, > > > > > Andrea > > > > > > -- > > Andrea Mayer <andrea.mayer@uniroma2.it> > > > -- > Andrea Mayer <andrea.mayer@uniroma2.it>
On Sun, 8 Jan 2023 10:46:09 +1100 Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > On Sat, Jan 7, 2023 at 10:27 AM Andrea Mayer <andrea.mayer@uniroma2.it> wrote: > > > > Hi Jon, > > please see after, thanks. > > > > > > > > > Any chance you could test this patch based on the latest net-next > > > > kernel and let me know the result? > > > > > > > > diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h > > > > index 88ff7bb2bb9b..632086b2f644 100644 > > > > --- a/include/net/dst_ops.h > > > > +++ b/include/net/dst_ops.h > > > > @@ -16,7 +16,7 @@ struct dst_ops { > > > > unsigned short family; > > > > unsigned int gc_thresh; > > > > > > > > - int (*gc)(struct dst_ops *ops); > > > > + void (*gc)(struct dst_ops *ops); > > > > struct dst_entry * (*check)(struct dst_entry *, __u32 cookie); > > > > unsigned int (*default_advmss)(const struct dst_entry *); > > > > unsigned int (*mtu)(const struct dst_entry *); > > > > diff --git a/net/core/dst.c b/net/core/dst.c > > > > index 6d2dd03dafa8..31c08a3386d3 100644 > > > > --- a/net/core/dst.c > > > > +++ b/net/core/dst.c > > > > @@ -82,12 +82,8 @@ void *dst_alloc(struct dst_ops *ops, struct net_device *dev, > > > > > > > > if (ops->gc && > > > > !(flags & DST_NOCOUNT) && > > > > - dst_entries_get_fast(ops) > ops->gc_thresh) { > > > > - if (ops->gc(ops)) { > > > > - pr_notice_ratelimited("Route cache is full: > > > > consider increasing sysctl net.ipv6.route.max_size.\n"); > > > > - return NULL; > > > > - } > > > > - } > > > > + dst_entries_get_fast(ops) > ops->gc_thresh) > > > > + ops->gc(ops); > > > > > > > > dst = kmem_cache_alloc(ops->kmem_cachep, GFP_ATOMIC); > > > > if (!dst) > > > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c > > > > index e74e0361fd92..b643dda68d31 100644 > > > > --- a/net/ipv6/route.c > > > > +++ b/net/ipv6/route.c > > > > @@ -91,7 +91,7 @@ static struct dst_entry *ip6_negative_advice(struct > > > > dst_entry *); > > > > static void ip6_dst_destroy(struct dst_entry *); > > > > static void ip6_dst_ifdown(struct dst_entry *, > > > > struct net_device *dev, int how); > > > > -static int ip6_dst_gc(struct dst_ops *ops); > > > > +static void ip6_dst_gc(struct dst_ops *ops); > > > > > > > > static int ip6_pkt_discard(struct sk_buff *skb); > > > > static int ip6_pkt_discard_out(struct net *net, struct > > > > sock *sk, struct sk_buff *skb); > > > > @@ -3284,11 +3284,10 @@ struct dst_entry *icmp6_dst_alloc(struct > > > > net_device *dev, > > > > return dst; > > > > } > > > > > > > > -static int ip6_dst_gc(struct dst_ops *ops) > > > > +static void ip6_dst_gc(struct dst_ops *ops) > > > > { > > > > struct net *net = container_of(ops, struct net, ipv6.ip6_dst_ops); > > > > int rt_min_interval = net->ipv6.sysctl.ip6_rt_gc_min_interval; > > > > - int rt_max_size = net->ipv6.sysctl.ip6_rt_max_size; > > > > int rt_elasticity = net->ipv6.sysctl.ip6_rt_gc_elasticity; > > > > int rt_gc_timeout = net->ipv6.sysctl.ip6_rt_gc_timeout; > > > > unsigned long rt_last_gc = net->ipv6.ip6_rt_last_gc; > > > > @@ -3296,11 +3295,10 @@ static int ip6_dst_gc(struct dst_ops *ops) > > > > int entries; > > > > > > > > entries = dst_entries_get_fast(ops); > > > > - if (entries > rt_max_size) > > > > + if (entries > ops->gc_thresh) > > > > entries = dst_entries_get_slow(ops); > > > > > > > > - if (time_after(rt_last_gc + rt_min_interval, jiffies) && > > > > - entries <= rt_max_size) > > > > + if (time_after(rt_last_gc + rt_min_interval, jiffies)) > > > > goto out; > > > > > > > > fib6_run_gc(atomic_inc_return(&net->ipv6.ip6_rt_gc_expire), net, true); > > > > @@ -3310,7 +3308,6 @@ static int ip6_dst_gc(struct dst_ops *ops) > > > > out: > > > > val = atomic_read(&net->ipv6.ip6_rt_gc_expire); > > > > atomic_set(&net->ipv6.ip6_rt_gc_expire, val - (val >> rt_elasticity)); > > > > - return entries > rt_max_size; > > > > } > > > > > > > > static int ip6_nh_lookup_table(struct net *net, struct fib6_config *cfg, > > > > @@ -6512,7 +6509,7 @@ static int __net_init ip6_route_net_init(struct net *net) > > > > #endif > > > > > > > > net->ipv6.sysctl.flush_delay = 0; > > > > - net->ipv6.sysctl.ip6_rt_max_size = 4096; > > > > + net->ipv6.sysctl.ip6_rt_max_size = INT_MAX; > > > > net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2; > > > > net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ; > > > > net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ; > > > > > > > > > > Yes, I will apply this patch in the next days and check how it deals with the > > > seg6 subsystem. I will keep you posted. > > > > > > > I applied the patch* to the net-next (HEAD 6bd4755c7c49) and did some tests on > > the seg6 subsystem, specifically running the End.X/DX6 behaviors. They seem to > > work fine. > > Thanks Andrea much appreciated. It worked fine in my raw socket tests as well. You're welcome. Good! > I'll look at submitting it soon. Please let me know (keep me in cc). > > > > > (*) I had to slightly edit the patch because of the code formatting, e.g. > > some incorrect line breaks, spaces, etc. > > > > Sorry about that, I should have sent it from git to avoid that. > Don't worry. > Regards > > Jon > Ciao, Andrea > > > > > > > On Sat, Dec 24, 2022 at 6:38 PM Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > > > > > > > > > > On Sat, Dec 24, 2022 at 7:28 AM Andrea Mayer <andrea.mayer@uniroma2.it> wrote: > > > > > > > > > > > > Hi Jon, > > > > > > please see below, thanks. > > > > > > > > > > > > On Wed, 21 Dec 2022 08:48:11 +1100 > > > > > > Jonathan Maxwell <jmaxwell37@gmail.com> wrote: > > > > > > > > > > > > > On Tue, Dec 20, 2022 at 11:35 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > > > > > > > > > > > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote: > > > > > > > > > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a > > > > > > > > > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly > > > > > > > > > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in > > > > > > > > > these warnings: > > > > > > > > > > > > > > > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > > > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > > . > > > > > > > > > . > > > > > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > > > > > > > > > If I read correctly, the maximum number of dst that the raw socket can > > > > > > > > use this way is limited by the number of packets it allows via the > > > > > > > > sndbuf limit, right? > > > > > > > > > > > > > > > > > > > > > > Yes, but in my test sndbuf limit is never hit so it clones a route for > > > > > > > every packet. > > > > > > > > > > > > > > e.g: > > > > > > > > > > > > > > output from C program sending 5000000 packets via a raw socket. > > > > > > > > > > > > > > ip raw: total num pkts 5000000 > > > > > > > > > > > > > > # bpftrace -e 'kprobe:dst_alloc {@count[comm] = count()}' > > > > > > > Attaching 1 probe... > > > > > > > > > > > > > > @count[a.out]: 5000009 > > > > > > > > > > > > > > > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6, > > > > > > > > ipvs, seg6? > > > > > > > > > > > > > > > > > > > > > > Any call to ip6_pol_route(s) where no res.nh->fib_nh_gw_family is 0 can do it. > > > > > > > But we have only seen this for raw sockets so far. > > > > > > > > > > > > > > > > > > > In the SRv6 subsystem, the seg6_lookup_nexthop() is used by some > > > > > > cross-connecting behaviors such as End.X and End.DX6 to forward traffic to a > > > > > > specified nexthop. SRv6 End.X/DX6 can specify an IPv6 DA (i.e., a nexthop) > > > > > > different from the one carried by the IPv6 header. For this purpose, > > > > > > seg6_lookup_nexthop() sets the FLOWI_FLAG_KNOWN_NH. > > > > > > > > > > > Hi Andrea, > > > > > > > > > > Thanks for pointing that datapath out. The more generic approach we are > > > > > taking bringing Ipv6 closer to Ipv4 in this regard should fix all instances > > > > > of this. > > > > > > > > > > > > > > [1] 99.187805] dst_alloc: 7728 callbacks suppressed > > > > > > > > > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > > . > > > > > > > > > . > > > > > > > > > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. > > > > > > > > > > > > I can reproduce the same warning messages reported by you, by instantiating an > > > > > > End.X behavior whose nexthop is handled by a route for which there is no "via". > > > > > > In this configuration, the ip6_pol_route() (called by seg6_lookup_nexthop()) > > > > > > triggers ip6_rt_cache_alloc() because i) the FLOWI_FLAG_KNOWN_NH is present ii) > > > > > > and the res.nh->fib_nh_gw_family is 0 (as already pointed out). > > > > > > > > > > > > > > > > Nice, when I get back after the holiday break I'll submit the next patch. It > > > > > would be great if you could test the new patch and let me know how it works in > > > > > your tests at that juncture. I'll keep you posted. > > > > > > > > > > Regards > > > > > > > > > > Jon > > > > > > > > > > > > Regards > > > > > > > > > > > > > > Jon > > > > > > > > > > > > Ciao, > > > > > > Andrea > > > > > > > > > -- > > > Andrea Mayer <andrea.mayer@uniroma2.it> > > > > > > -- > > Andrea Mayer <andrea.mayer@uniroma2.it>
diff --git a/include/net/flow.h b/include/net/flow.h index 2f0da4f0318b..30b8973ffb4b 100644 --- a/include/net/flow.h +++ b/include/net/flow.h @@ -37,6 +37,7 @@ struct flowi_common { __u8 flowic_flags; #define FLOWI_FLAG_ANYSRC 0x01 #define FLOWI_FLAG_KNOWN_NH 0x02 +#define FLOWI_FLAG_SKIP_RAW 0x04 __u32 flowic_secid; kuid_t flowic_uid; struct flowi_tunnel flowic_tun_key; diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index a06a9f847db5..0b89a7e66d09 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -884,7 +884,7 @@ static int rawv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) security_sk_classify_flow(sk, flowi6_to_flowi_common(&fl6)); if (hdrincl) - fl6.flowi6_flags |= FLOWI_FLAG_KNOWN_NH; + fl6.flowi6_flags |= FLOWI_FLAG_KNOWN_NH | FLOWI_FLAG_SKIP_RAW; if (ipc6.tclass < 0) ipc6.tclass = np->tclass; diff --git a/net/ipv6/route.c b/net/ipv6/route.c index e74e0361fd92..beae0bd61738 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -2226,6 +2226,7 @@ struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table, if (rt) { goto out; } else if (unlikely((fl6->flowi6_flags & FLOWI_FLAG_KNOWN_NH) && + !(fl6->flowi6_flags & FLOWI_FLAG_SKIP_RAW) && !res.nh->fib_nh_gw_family)) { /* Create a RTF_CACHE clone which will not be * owned by the fib6 tree. It is for the special case where