Message ID | 20240215160458.1727237-1-ast@fiberby.net |
---|---|
Headers |
Return-Path: <linux-kernel+bounces-67282-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:b825:b0:106:860b:bbdd with SMTP id da37csp506451dyb; Thu, 15 Feb 2024 08:13:30 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCUCgJt5Hnkli9P7IOITnKiBDTqV7Rrv5tA4tB4tlqNQYwkH7T8stoQuk3Bcva0QdV/KhB6n58sWbKDfuuSdeHuiJj3Cbg== X-Google-Smtp-Source: AGHT+IE4QKNMnBZrDOwJIt1GWZAZTkEYWrMA733oEHj83jPEnJfvBVTExk7TcuR9dXu+nRVvvJ0/ X-Received: by 2002:a05:620a:199b:b0:785:ce6e:28a2 with SMTP id bm27-20020a05620a199b00b00785ce6e28a2mr3231371qkb.15.1708013610443; Thu, 15 Feb 2024 08:13:30 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1708013610; cv=pass; d=google.com; s=arc-20160816; b=owyVzH9hD7wVevnw4dYf+zTK/9vvrGsieDOCr+35Ix19qIBBul9Ze237I+QNzGdvD+ NLebsDP4OyuzVDnwpa0sYC8w9zA1eOEQ3LgFtmq8vZcrmyE8gz/o9ioneNLxnuU1O9bw ou94H1WZAgxgeWELUpeV6YynR649fa0R5BsBinJPvnHDEPeDWHmGFaieRbDGeelYjbta xDTEYvYaSRa9PeFOPG7PFg+n6C8UrpccV4l/wtjBcU1KE+nEZmacwfXHY4o7TwVuNiD+ YapcIiXuSqYv+p1evuEbee79lL3boGQOLI/xoV2GJ/nE0zyvdsGmpaXXwYQoTS1FQ12D 9MOA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:cc:to :from:dkim-signature; bh=g1IcuG5+8qZbvhtww1BgoWrYIb+WOtKcrZG1hZhNyqQ=; fh=qwn0ZN7d/LlgLhwKKJFblL2+CYxSTOyA+LLcZtnY3QM=; b=khvSmYgiMgAQeyUghSoJVIGQL+SrYSGnkLpI8Lyb6FN2FQlN0Q7upGfOCGke2VTB7t Mp0DyTwPVglxAvVKngVts6sP1Hj/SPxSPk0z07nLVHpJEsFpLY/MyNrqZY0GfeVfnMiA +Pf6/YOcXlI8ZsXZRayusia1a8mQ+7DuafJcJJmKJaQSh6/YZilOHFWpOUiR2SJQQjO7 ZTrSyqv27zIvLpxFCZRSftvt7qx9T+uSXaTdXQtYeo+ouej/TNcGZJjvlDbz7bvum/6Z vz4QjNzCJnCnXylIlGp5mOCBlZey6d55AUIwKAihmZPDke5A4sekv/so5hv40Uk0YA7U EBkA==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@fiberby.net header.s=202008 header.b=AHg9AR2O; arc=pass (i=1 spf=pass spfdomain=fiberby.net dkim=pass dkdomain=fiberby.net dmarc=pass fromdomain=fiberby.net); spf=pass (google.com: domain of linux-kernel+bounces-67282-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-67282-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=fiberby.net Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [147.75.199.223]) by mx.google.com with ESMTPS id s8-20020a05620a29c800b00783ff745819si1967859qkp.531.2024.02.15.08.13.30 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 15 Feb 2024 08:13:30 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-67282-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223; Authentication-Results: mx.google.com; dkim=pass header.i=@fiberby.net header.s=202008 header.b=AHg9AR2O; arc=pass (i=1 spf=pass spfdomain=fiberby.net dkim=pass dkdomain=fiberby.net dmarc=pass fromdomain=fiberby.net); spf=pass (google.com: domain of linux-kernel+bounces-67282-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-67282-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=fiberby.net Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 3D5E41C238F1 for <ouuuleilei@gmail.com>; Thu, 15 Feb 2024 16:13:30 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 14D91135A46; Thu, 15 Feb 2024 16:06:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=fiberby.net header.i=@fiberby.net header.b="AHg9AR2O" Received: from mail1.fiberby.net (mail1.fiberby.net [193.104.135.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 48CE713343F; Thu, 15 Feb 2024 16:06:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.104.135.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708013203; cv=none; b=TmkRnGdCO37fJ9h7Qy7dXtx/P1x3+tN4PugsQHpmqwIDoDutDeeNhtqIxaIxV0Qzse87nAZPuGMih9u02rN6q+JNt36wED+aOCgzxiyVca8kbQtDiJR3bz5SkZ2YTlrdvkv8WI6qlG0wx0O09vjLIDzL3ZXTgZJv4glZN5Ylvas= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708013203; c=relaxed/simple; bh=a/rV8SSxM9OUnBiZhdQjgb607ZE0A5Lcd/T5aUn9od0=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version:Content-Type; b=uM8N8IgKPihOAOPm8HAN53Dl6tt08Xy7USwtIIN4lZMIeKnV0HcxkNj42Mp/SIaa/0vXmQedBVcK9pMXtadExbLMjD3PMO3BM9sR7ZNIuQjRM5X+Bis1gMc8MV+tgV+wez4RpEB0oXEOLsk8tSCfL5+ifCX/Pg3Ky0WTkRWzTVI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=fiberby.net; spf=pass smtp.mailfrom=fiberby.net; dkim=pass (2048-bit key) header.d=fiberby.net header.i=@fiberby.net header.b=AHg9AR2O; arc=none smtp.client-ip=193.104.135.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=fiberby.net Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=fiberby.net Received: from x201s (193-104-135-243.ip4.fiberby.net [193.104.135.243]) by mail1.fiberby.net (Postfix) with ESMTPSA id 125FB6030A; Thu, 15 Feb 2024 16:06:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=fiberby.net; s=202008; t=1708013196; bh=a/rV8SSxM9OUnBiZhdQjgb607ZE0A5Lcd/T5aUn9od0=; h=From:To:Cc:Subject:Date:From; b=AHg9AR2O05BEXVIq48uAoumuULMkas0wiJJSGSU+hDvjyrtlrZbby1Wgq0LHKktpW 5ZWAc3yerXcj88vSkPCpysgp//RIpg4BfEskDDkOADxto5ZoBr/WFwmyWNzE1k4QVE e1e2DSEr1yHiqjoxJT/bbMR7sz3GHj6XIcAdfbQB2oBP4IZeB/yHZPIY6trFVipqce /iQmoM3NS3KqkS+VeblIOwhxtIKgepO9zjNNc+nMjCuBK+IcfQu7p2l4FfhRXuKy4f QZtgDZWzfFDuM0NJbaogBBH36mVmLKfNgaAxnYAPj1mhiA+TLMDJ7WZTUsqoG/bG7f J8eSMAIZ+H+7A== Received: by x201s (Postfix, from userid 1000) id ADA5D2004F9; Thu, 15 Feb 2024 16:05:24 +0000 (UTC) From: =?utf-8?q?Asbj=C3=B8rn_Sloth_T=C3=B8nnesen?= <ast@fiberby.net> To: Jamal Hadi Salim <jhs@mojatatu.com>, Cong Wang <xiyou.wangcong@gmail.com>, Jiri Pirko <jiri@resnulli.us> Cc: =?utf-8?q?Asbj=C3=B8rn_Sloth_T=C3=B8nnesen?= <ast@fiberby.net>, Daniel Borkmann <daniel@iogearbox.net>, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, llu@fiberby.dk Subject: [PATCH net-next 0/3] make skip_sw actually skip software Date: Thu, 15 Feb 2024 16:04:41 +0000 Message-ID: <20240215160458.1727237-1-ast@fiberby.net> X-Mailer: git-send-email 2.43.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1790982079521521187 X-GMAIL-MSGID: 1790982079521521187 |
Series |
make skip_sw actually skip software
|
|
Message
Asbjørn Sloth Tønnesen
Feb. 15, 2024, 4:04 p.m. UTC
Hi, During development of flower-route[1], which I recently presented at FOSDEM[2], I noticed that CPU usage, would increase the more rules I installed into the hardware for IP forwarding offloading. Since we use TC flower offload for the hottest prefixes, and leave the long tail to Linux / the CPU. we therefore need both the hardware and software datapath to perform well. I found that skip_sw rules, are quite expensive in the kernel datapath, sice they must be evaluated and matched upon, before the kernel checks the skip_sw flag. This patchset optimizes the case where all rules are skip_sw. [1] flower-route https://github.com/fiberby-dk/flower-route [2] FOSDEM talk https://fosdem.org/2024/schedule/event/fosdem-2024-3337-flying-higher-hardware-offloading-with-bird/ Asbjørn Sloth Tønnesen (3): net: sched: cls_api: add skip_sw counter net: sched: cls_api: add filter counter net: sched: make skip_sw actually skip software include/net/pkt_cls.h | 5 +++++ include/net/sch_generic.h | 3 +++ net/core/dev.c | 3 +++ net/sched/cls_api.c | 24 ++++++++++++++++++++++++ 4 files changed, 35 insertions(+)
Comments
Hi, On Thu, Feb 15, 2024 at 04:04:41PM +0000, Asbjørn Sloth Tønnesen wrote: .. > Since we use TC flower offload for the hottest > prefixes, and leave the long tail to Linux / the CPU. > we therefore need both the hardware and software > datapath to perform well. > > I found that skip_sw rules, are quite expensive > in the kernel datapath, sice they must be evaluated > and matched upon, before the kernel checks the > skip_sw flag. > > This patchset optimizes the case where all rules > are skip_sw. The talk is interesting. Yet, I don't get how it is set up. How do you use a dedicated block for skip_sw, and then have a catch-all on sw again please? I'm missing which traffic is being matched against the sw datapath. In theory, you have all the heavy duty filters offloaded, so the sw datapath should be seeing only a few packets, right? Marcelo
On Thu 15 Feb 2024 at 10:00, Marcelo Ricardo Leitner <mleitner@redhat.com> wrote: > Hi, > > On Thu, Feb 15, 2024 at 04:04:41PM +0000, Asbjørn Sloth Tønnesen wrote: > ... >> Since we use TC flower offload for the hottest >> prefixes, and leave the long tail to Linux / the CPU. >> we therefore need both the hardware and software >> datapath to perform well. >> >> I found that skip_sw rules, are quite expensive >> in the kernel datapath, sice they must be evaluated >> and matched upon, before the kernel checks the >> skip_sw flag. >> >> This patchset optimizes the case where all rules >> are skip_sw. > > The talk is interesting. Yet, I don't get how it is set up. > How do you use a dedicated block for skip_sw, and then have a > catch-all on sw again please? > > I'm missing which traffic is being matched against the sw datapath. In > theory, you have all the heavy duty filters offloaded, so the sw > datapath should be seeing only a few packets, right? Yeah, I also didn't get the idea here. The cited paragraphs seem to contradict each other.
Hi Marcelo, On 2/15/24 18:00, Marcelo Ricardo Leitner wrote: > On Thu, Feb 15, 2024 at 04:04:41PM +0000, Asbjørn Sloth Tønnesen wrote: > ... >> Since we use TC flower offload for the hottest >> prefixes, and leave the long tail to Linux / the CPU. >> we therefore need both the hardware and software >> datapath to perform well. >> >> I found that skip_sw rules, are quite expensive >> in the kernel datapath, sice they must be evaluated >> and matched upon, before the kernel checks the >> skip_sw flag. >> >> This patchset optimizes the case where all rules >> are skip_sw. > > The talk is interesting. Yet, I don't get how it is set up. > How do you use a dedicated block for skip_sw, and then have a > catch-all on sw again please? Bird installs the DFZ Internet routing table into the main kernel table for the software datapath. Bird also installs a subset of routing table into an aux. kernel table. flower-route then picks up the routes from the aux. kernel table, and installs them as TC skip_sw filters. On these machines we don't have any non-skip_sw TC filters. Since 2021, we have statically offloaded all inbound traffic, since nexthop for our IP space is always the switch next to it, which does interior L3 routing. Thereby we could offload ~50% of the packets. I have put an example of the static script here: https://files.fiberby.net/ast/2024/tc_skip_sw/mlx5_static_offload.sh And `tc filter show dev enp5s0f0np0 ingress` after running the script: https://files.fiberby.net/ast/2024/tc_skip_sw/mlx_offload_demo_tc_dump.txt > I'm missing which traffic is being matched against the sw datapath. In > theory, you have all the heavy duty filters offloaded, so the sw > datapath should be seeing only a few packets, right? We are an residential ISP, our traffic is therefore residential Internet traffic, we run the BGP routers as a router on a stick, the filters therefore see both inbound and outbound traffic. ~50% of packets are inbound traffic, our own prefixes are therefore the hottest prefixes. Most streaming traffic is handled internally, and is therefore not seen on our core routers. We regularly have 5%-10% of all outbound traffic going towards the same prefix, and have 50% of outbound traffic distributed across just a few prefixes. We currently only offload our own prefixes, and a select few other known high-traffic prefixes. The goal is to offload the majority of the trafic, but it is still early days for flower-route, and I need to implement some smarter chain layout first and dynamic filter placement based on hardware counters. Even when I get flower-route to offload almost all traffic, there will still be a long tail of prefixes not in hardware, so the kernel still needs to not be pulled down by the offloaded filters.
On Fri, Feb 16, 2024 at 12:17:28PM +0000, Asbjørn Sloth Tønnesen wrote: > Hi Marcelo, > > On 2/15/24 18:00, Marcelo Ricardo Leitner wrote: > > On Thu, Feb 15, 2024 at 04:04:41PM +0000, Asbjørn Sloth Tønnesen wrote: > > ... > > > Since we use TC flower offload for the hottest > > > prefixes, and leave the long tail to Linux / the CPU. > > > we therefore need both the hardware and software > > > datapath to perform well. > > > > > > I found that skip_sw rules, are quite expensive > > > in the kernel datapath, sice they must be evaluated > > > and matched upon, before the kernel checks the > > > skip_sw flag. > > > > > > This patchset optimizes the case where all rules > > > are skip_sw. > > > > The talk is interesting. Yet, I don't get how it is set up. > > How do you use a dedicated block for skip_sw, and then have a > > catch-all on sw again please? > > Bird installs the DFZ Internet routing table into the main kernel table > for the software datapath. > > Bird also installs a subset of routing table into an aux. kernel table. > > flower-route then picks up the routes from the aux. kernel table, and > installs them as TC skip_sw filters. > > On these machines we don't have any non-skip_sw TC filters. > > Since 2021, we have statically offloaded all inbound traffic, since > nexthop for our IP space is always the switch next to it, which does > interior L3 routing. Thereby we could offload ~50% of the packets. > > I have put an example of the static script here: > https://files.fiberby.net/ast/2024/tc_skip_sw/mlx5_static_offload.sh > > And `tc filter show dev enp5s0f0np0 ingress` after running the script: > https://files.fiberby.net/ast/2024/tc_skip_sw/mlx_offload_demo_tc_dump.txt Ahh ok. So from tc/flower perspective, you actually offload everything. :-) The part that was confusing to me is that what you need done in sw, you don't do it in tc sw, but rather with the IP the stack itself. So you actually offload a flower filter with these, lets say, exceptions. It seems to me a better fix for this is to have action trap to "resume to sw" to itself. Then even if you have traffic that triggers a miss in hw, you could add a catch-all filter to trigger the trap. With the catch-all idea, you may also instead of using trap directly, use a goto chain X. I just don't remember if you need to have a flow in chain X that is not offloaded, or an inexistant chain is enough. These ideas are rooted on the fact that now the offloading can resume processing at a given chain, or even at a given action that triggered the miss. With this, it should skip all the filtering that is unnecessary in your case. IOW, instead of trying to make the filtering smarter, which current proposal would be limited to this use case pretty much (instead of using a dedicated list for skip_sw), it resumes the processing at a better spot, and with what we already have. One caveat with this approach is that it will cause an skb_extension to be allocated for all this traffic that is handled in sw. There's a small performance penalty on it. WDYT? Or maybe I missed something? > > > > I'm missing which traffic is being matched against the sw datapath. In > > theory, you have all the heavy duty filters offloaded, so the sw > > datapath should be seeing only a few packets, right? > > We are an residential ISP, our traffic is therefore residential Internet > traffic, we run the BGP routers as a router on a stick, the filters therefore > see both inbound and outbound traffic. > > ~50% of packets are inbound traffic, our own prefixes are therefore the > hottest prefixes. Most streaming traffic is handled internally, and is > therefore not seen on our core routers. We regularly have 5%-10% of all > outbound traffic going towards the same prefix, and have 50% of outbound > traffic distributed across just a few prefixes. > > We currently only offload our own prefixes, and a select few other known > high-traffic prefixes. > > The goal is to offload the majority of the trafic, but it is still early > days for flower-route, and I need to implement some smarter chain layout > first and dynamic filter placement based on hardware counters. Cool. Btw, be aware that after a few chain jumps, performance may drop considerably even if offloaded. > > Even when I get flower-route to offload almost all traffic, there will still > be a long tail of prefixes not in hardware, so the kernel still needs > to not be pulled down by the offloaded filters. > > -- > Best regards > Asbjørn Sloth Tønnesen > Network Engineer > Fiberby - AS42541 >