Message ID | 20230808031913.46965-1-huangjie.albert@bytedance.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:c44e:0:b0:3f2:4152:657d with SMTP id w14csp2232545vqr; Tue, 8 Aug 2023 09:20:34 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHYFrvNsNVmFIroV8Tj9UGeUwtfOaGRk1h82d9yE69YU0g9c9i2OQafJno4TgPtdlxvzjJe X-Received: by 2002:a17:90a:6302:b0:268:5fd9:b660 with SMTP id e2-20020a17090a630200b002685fd9b660mr18198pjj.39.1691511633989; Tue, 08 Aug 2023 09:20:33 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1691511633; cv=none; d=google.com; s=arc-20160816; b=KcMMXoGrfH0eppp+zZtPPZ6u5M/0gFjiofOQ97sdDa47O1EXoewTRSfZoMkZ6JYDdG ZASDMuB6/PfiOe/Qj0iZ1K+7RZzy6j/1CG2kkUmxLX6SAmOY2uNgFNtHdcC/yub5ox9T mB0v3vldJqegyLFASYRzXQSx7qT9+Dl29h1gjWAJuRUPufPVr+Ewaulz92CDoqd8mShl oMjlwi1qWsecq15MjZQ1HFZgymm3RMQVeTJxVEZEUzdaeOWd6Hob8Ekg500SggvwXowK ZMurqB9sMuTZ8/yx2dLs9oFTg0LK4Ap9HiFS/mjOEMuU23CU59haf2p2Zt+yqhCWi4nX vbIA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=FElCe4GqO7dcEzUVLWub8fb5QZN0mdVSba+a7j3yTZw=; fh=d52JIWYEkePcj18BLr16Py79ZIzTYL7SzTuYhl8wvHo=; b=I9B+pnPsUiD6sVPP08DUQGWX0eOLx36q9Q2+z3433JwKP4hD3j1Wf5Z+SNUkeWRJlt i0hCxIbIs0HQH34ofDhX7eiqB2l4lFZ6xz3SxSk7dyX5paSVHILiovyJWK50aTu+S97X 7WvkEWwIxyQ9+HNhey75sJUdK8BGP41yP1v4S2mcJM+Cj2/98xXSH3mD681UsH2v3MeN 9edcieLqDPxH96hlIxw8N8GTg0REgdCuVP+5+iMtwLAG8mDk0Woj83R5bWIjPdBLmFxX EjSjcM4HQGk6DSrNlKxxITYpWRFeoT2g7RfZMSQ6XAunMZ54GPhGjoeXQ8s4Mwu83w2o pBfw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=eXiUNr+h; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id on10-20020a17090b1d0a00b00269154cf655si9061856pjb.126.2023.08.08.09.20.20; Tue, 08 Aug 2023 09:20:33 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=eXiUNr+h; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229913AbjHHDUW (ORCPT <rfc822;aaronkmseo@gmail.com> + 99 others); Mon, 7 Aug 2023 23:20:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45304 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229454AbjHHDUT (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 7 Aug 2023 23:20:19 -0400 Received: from mail-pl1-x634.google.com (mail-pl1-x634.google.com [IPv6:2607:f8b0:4864:20::634]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5CB23C9 for <linux-kernel@vger.kernel.org>; Mon, 7 Aug 2023 20:20:17 -0700 (PDT) Received: by mail-pl1-x634.google.com with SMTP id d9443c01a7336-1bbc64f9a91so44901375ad.0 for <linux-kernel@vger.kernel.org>; Mon, 07 Aug 2023 20:20:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1691464817; x=1692069617; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=FElCe4GqO7dcEzUVLWub8fb5QZN0mdVSba+a7j3yTZw=; b=eXiUNr+hyR+wu45aflOHgSuIkLqkiZfIbqJp2zNY23KBnr50ICeGARr5j7xJZrrBSI APTX7radaR/RPrRIdsosSTPZbo+QFkMmR0X45+xHqX5LyW3UxZidiSW/rCTzv7j4foxa nsBp5vLvd6e6Fuszk4yej0lvXlULO/9oBlArWaa5MVsKrA0DmtzGx/bPX9PyEOzkSIgM 87r0skMXjeF84QWHiawLb/yhoZt1F92Kvjtc/JHZf75yGNfC8NsIyC6bKtV23YtZT4Cj AzBwfOzF1MsKR2Io6LisMfrjIwvnJyHJ0GeKqLJXOfrZxJmLSe6E2lHyLGAcKf5mJyXe gRqA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691464817; x=1692069617; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=FElCe4GqO7dcEzUVLWub8fb5QZN0mdVSba+a7j3yTZw=; b=ABzdOF8+Bh3k95g8U8mtLYReiKwFBGMOyBhtOWwsL8KSnUwutqNojnwIsaQFLJ1KRZ gdeBu+v70Wq8nrlEm0OTps2eYWUPbZrZbQ/u6thC6H5s+xTFLl6RTNt1Foar4J3TiqG1 sg4MbTOmOon/6xMokZhgvQElxCB87OPftVwaNaTDFIeTRl98eOB7aX8huhj1ez45+Ruy yd+ZcMGsy7dRTmlnDWjqw+mLJww+PH/494Gf6gc2P+z33w6yEWRMXAYQcVolww+Wj9Ee mWzSo7TFZGXBWAgMRHy7kH7hEM12CLkHpRT5C4znCswgiV5QJ8aymewo0QbYYWTzoHon AO+A== X-Gm-Message-State: AOJu0YzlDAmPy74bqnQBWoEwthxek9irWIEAh+7Tje6UEBfYAiqfU/fX RRFP3G+/0fif0CiGVQAkYj232Z5fxF6vDvvZDCQ= X-Received: by 2002:a17:902:d882:b0:1bc:5855:f108 with SMTP id b2-20020a170902d88200b001bc5855f108mr9933657plz.46.1691464816858; Mon, 07 Aug 2023 20:20:16 -0700 (PDT) Received: from C02FG34NMD6R.bytedance.net ([2408:8656:30f8:e020::b]) by smtp.gmail.com with ESMTPSA id 13-20020a170902c10d00b001b896686c78sm7675800pli.66.2023.08.07.20.20.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 07 Aug 2023 20:20:16 -0700 (PDT) From: Albert Huang <huangjie.albert@bytedance.com> To: davem@davemloft.net, edumazet@google.com, kuba@kernel.org, pabeni@redhat.com Cc: Albert Huang <huangjie.albert@bytedance.com>, Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Jesper Dangaard Brouer <hawk@kernel.org>, John Fastabend <john.fastabend@gmail.com>, =?utf-8?b?QmrDtnJuIFTDtnBlbA==?= <bjorn@kernel.org>, Magnus Karlsson <magnus.karlsson@intel.com>, Maciej Fijalkowski <maciej.fijalkowski@intel.com>, Jonathan Lemon <jonathan.lemon@gmail.com>, Pavel Begunkov <asml.silence@gmail.com>, Yunsheng Lin <linyunsheng@huawei.com>, Kees Cook <keescook@chromium.org>, Richard Gobert <richardbgobert@gmail.com>, "open list:NETWORKING DRIVERS" <netdev@vger.kernel.org>, open list <linux-kernel@vger.kernel.org>, "open list:XDP (eXpress Data Path)" <bpf@vger.kernel.org> Subject: [RFC v3 Optimizing veth xsk performance 0/9] Date: Tue, 8 Aug 2023 11:19:04 +0800 Message-Id: <20230808031913.46965-1-huangjie.albert@bytedance.com> X-Mailer: git-send-email 2.37.1 (Apple Git-137.1) MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_BLOCKED, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1773678503400192684 X-GMAIL-MSGID: 1773678503400192684 |
Series |
[RFC,v3,Optimizing,veth,xsk,performance,1/9] veth: Implement ethtool's get_ringparam() callback
|
|
Message
黄杰
Aug. 8, 2023, 3:19 a.m. UTC
AF_XDP is a kernel bypass technology that can greatly improve performance. However,for virtual devices like veth,even with the use of AF_XDP sockets, there are still many additional software paths that consume CPU resources. This patch series focuses on optimizing the performance of AF_XDP sockets for veth virtual devices. Patches 1 to 4 mainly involve preparatory work. Patch 5 introduces tx queue and tx napi for packet transmission, while patch 8 primarily implements batch sending for IPv4 UDP packets, and patch 9 add support for AF_XDP tx need_wakup feature. These optimizations significantly reduce the software path and support checksum offload. I tested those feature with A typical topology is shown below: client(send): server:(recv) veth<-->veth-peer veth1-peer<--->veth1 1 | | 7 |2 6| | | bridge<------->eth0(mlnx5)- switch -eth1(mlnx5)<--->bridge1 3 4 5 (machine1) (machine2) AF_XDP socket is attach to veth and veth1. and send packets to physical NIC(eth0) veth:(172.17.0.2/24) bridge:(172.17.0.1/24) eth0:(192.168.156.66/24) eth1(172.17.0.2/24) bridge1:(172.17.0.1/24) eth0:(192.168.156.88/24) after set default route\snat\dnat. we can have a tests to get the performance results. packets send from veth to veth1: af_xdp test tool: link:https://github.com/cclinuxer/libxudp send:(veth) ./objs/xudpperf send --dst 192.168.156.88:6002 -l 1300 recv:(veth1) ./objs/xudpperf recv --src 172.17.0.2:6002 udp test tool:iperf3 send:(veth) iperf3 -c 192.168.156.88 -p 6002 -l 1300 -b 0 -u recv:(veth1) iperf3 -s -p 6002 performance: performance:(test weth libxudp lib) UDP : 320 Kpps (with 100% cpu) AF_XDP no zerocopy + no batch : 480 Kpps (with ksoftirqd 100% cpu) AF_XDP with batch + zerocopy : 1.5 Mpps (with ksoftirqd 15% cpu) With af_xdp batch, the libxudp user-space program reaches a bottleneck. Therefore, the softirq did not reach the limit. This is just an RFC patch series, and some code details still need further consideration. Please review this proposal. v2->v3: - fix build error find by kernel test robot. v1->v2: - all the patches pass checkpatch.pl test. suggested by Simon Horman. - iperf3 tested with -b 0, update the test results. suggested by Paolo Abeni. - refactor code to make code structure clearer. - delete some useless code logic in the veth_xsk_tx_xmit function. - add support for AF_XDP tx need_wakup feature. Albert Huang (9): veth: Implement ethtool's get_ringparam() callback xsk: add dma_check_skip for skipping dma check veth: add support for send queue xsk: add xsk_tx_completed_addr function veth: use send queue tx napi to xmit xsk tx desc veth: add ndo_xsk_wakeup callback for veth sk_buff: add destructor_arg_xsk_pool for zero copy veth: af_xdp tx batch support for ipv4 udp veth: add support for AF_XDP tx need_wakup feature drivers/net/veth.c | 679 +++++++++++++++++++++++++++++++++++- include/linux/skbuff.h | 2 + include/net/xdp_sock_drv.h | 5 + include/net/xsk_buff_pool.h | 1 + net/xdp/xsk.c | 6 + net/xdp/xsk_buff_pool.c | 3 +- net/xdp/xsk_queue.h | 10 + 7 files changed, 704 insertions(+), 2 deletions(-)
Comments
Albert Huang <huangjie.albert@bytedance.com> writes: > AF_XDP is a kernel bypass technology that can greatly improve performance. > However,for virtual devices like veth,even with the use of AF_XDP sockets, > there are still many additional software paths that consume CPU resources. > This patch series focuses on optimizing the performance of AF_XDP sockets > for veth virtual devices. Patches 1 to 4 mainly involve preparatory work. > Patch 5 introduces tx queue and tx napi for packet transmission, while > patch 8 primarily implements batch sending for IPv4 UDP packets, and patch 9 > add support for AF_XDP tx need_wakup feature. These optimizations significantly > reduce the software path and support checksum offload. > > I tested those feature with > A typical topology is shown below: > client(send): server:(recv) > veth<-->veth-peer veth1-peer<--->veth1 > 1 | | 7 > |2 6| > | | > bridge<------->eth0(mlnx5)- switch -eth1(mlnx5)<--->bridge1 > 3 4 5 > (machine1) (machine2) I definitely applaud the effort to improve the performance of af_xdp over veth, this is something we have flagged as in need of improvement as well. However, looking through your patch series, I am less sure that the approach you're taking here is the right one. AFAIU (speaking about the TX side here), the main difference between AF_XDP ZC and the regular transmit mode is that in the regular TX mode the stack will allocate an skb to hold the frame and push that down the stack. Whereas in ZC mode, there's a driver NDO that gets called directly, bypassing the skb allocation entirely. In this series, you're implementing the ZC mode for veth, but the driver code ends up allocating an skb anyway. Which seems to be a bit of a weird midpoint between the two modes, and adds a lot of complexity to the driver that (at least conceptually) is mostly just a reimplementation of what the stack does in non-ZC mode (allocate an skb and push it through the stack). So my question is, why not optimise the non-zc path in the stack instead of implementing the zc logic for veth? It seems to me that it would be quite feasible to apply the same optimisations (bulking, and even GRO) to that path and achieve the same benefits, without having to add all this complexity to the veth driver? -Toke
Toke Høiland-Jørgensen <toke@redhat.com> 于2023年8月8日周二 20:01写道: > > Albert Huang <huangjie.albert@bytedance.com> writes: > > > AF_XDP is a kernel bypass technology that can greatly improve performance. > > However,for virtual devices like veth,even with the use of AF_XDP sockets, > > there are still many additional software paths that consume CPU resources. > > This patch series focuses on optimizing the performance of AF_XDP sockets > > for veth virtual devices. Patches 1 to 4 mainly involve preparatory work. > > Patch 5 introduces tx queue and tx napi for packet transmission, while > > patch 8 primarily implements batch sending for IPv4 UDP packets, and patch 9 > > add support for AF_XDP tx need_wakup feature. These optimizations significantly > > reduce the software path and support checksum offload. > > > > I tested those feature with > > A typical topology is shown below: > > client(send): server:(recv) > > veth<-->veth-peer veth1-peer<--->veth1 > > 1 | | 7 > > |2 6| > > | | > > bridge<------->eth0(mlnx5)- switch -eth1(mlnx5)<--->bridge1 > > 3 4 5 > > (machine1) (machine2) > > I definitely applaud the effort to improve the performance of af_xdp > over veth, this is something we have flagged as in need of improvement > as well. > > However, looking through your patch series, I am less sure that the > approach you're taking here is the right one. > > AFAIU (speaking about the TX side here), the main difference between > AF_XDP ZC and the regular transmit mode is that in the regular TX mode > the stack will allocate an skb to hold the frame and push that down the > stack. Whereas in ZC mode, there's a driver NDO that gets called > directly, bypassing the skb allocation entirely. > > In this series, you're implementing the ZC mode for veth, but the driver > code ends up allocating an skb anyway. Which seems to be a bit of a > weird midpoint between the two modes, and adds a lot of complexity to > the driver that (at least conceptually) is mostly just a > reimplementation of what the stack does in non-ZC mode (allocate an skb > and push it through the stack). > > So my question is, why not optimise the non-zc path in the stack instead > of implementing the zc logic for veth? It seems to me that it would be > quite feasible to apply the same optimisations (bulking, and even GRO) > to that path and achieve the same benefits, without having to add all > this complexity to the veth driver? > > -Toke > thanks! This idea is really good indeed. You've reminded me, and that's something I overlooked. I will now consider implementing the solution you've proposed and test the performance enhancement. Albert.
黄杰 <huangjie.albert@bytedance.com> writes: > Toke Høiland-Jørgensen <toke@redhat.com> 于2023年8月8日周二 20:01写道: >> >> Albert Huang <huangjie.albert@bytedance.com> writes: >> >> > AF_XDP is a kernel bypass technology that can greatly improve performance. >> > However,for virtual devices like veth,even with the use of AF_XDP sockets, >> > there are still many additional software paths that consume CPU resources. >> > This patch series focuses on optimizing the performance of AF_XDP sockets >> > for veth virtual devices. Patches 1 to 4 mainly involve preparatory work. >> > Patch 5 introduces tx queue and tx napi for packet transmission, while >> > patch 8 primarily implements batch sending for IPv4 UDP packets, and patch 9 >> > add support for AF_XDP tx need_wakup feature. These optimizations significantly >> > reduce the software path and support checksum offload. >> > >> > I tested those feature with >> > A typical topology is shown below: >> > client(send): server:(recv) >> > veth<-->veth-peer veth1-peer<--->veth1 >> > 1 | | 7 >> > |2 6| >> > | | >> > bridge<------->eth0(mlnx5)- switch -eth1(mlnx5)<--->bridge1 >> > 3 4 5 >> > (machine1) (machine2) >> >> I definitely applaud the effort to improve the performance of af_xdp >> over veth, this is something we have flagged as in need of improvement >> as well. >> >> However, looking through your patch series, I am less sure that the >> approach you're taking here is the right one. >> >> AFAIU (speaking about the TX side here), the main difference between >> AF_XDP ZC and the regular transmit mode is that in the regular TX mode >> the stack will allocate an skb to hold the frame and push that down the >> stack. Whereas in ZC mode, there's a driver NDO that gets called >> directly, bypassing the skb allocation entirely. >> >> In this series, you're implementing the ZC mode for veth, but the driver >> code ends up allocating an skb anyway. Which seems to be a bit of a >> weird midpoint between the two modes, and adds a lot of complexity to >> the driver that (at least conceptually) is mostly just a >> reimplementation of what the stack does in non-ZC mode (allocate an skb >> and push it through the stack). >> >> So my question is, why not optimise the non-zc path in the stack instead >> of implementing the zc logic for veth? It seems to me that it would be >> quite feasible to apply the same optimisations (bulking, and even GRO) >> to that path and achieve the same benefits, without having to add all >> this complexity to the veth driver? >> >> -Toke >> > thanks! > This idea is really good indeed. You've reminded me, and that's > something I overlooked. I will now consider implementing the solution > you've proposed and test the performance enhancement. Sounds good, thanks! :) -Toke
On 09/08/2023 11.06, Toke Høiland-Jørgensen wrote: > 黄杰 <huangjie.albert@bytedance.com> writes: > >> Toke Høiland-Jørgensen <toke@redhat.com> 于2023年8月8日周二 20:01写道: >>> >>> Albert Huang <huangjie.albert@bytedance.com> writes: >>> >>>> AF_XDP is a kernel bypass technology that can greatly improve performance. >>>> However,for virtual devices like veth,even with the use of AF_XDP sockets, >>>> there are still many additional software paths that consume CPU resources. >>>> This patch series focuses on optimizing the performance of AF_XDP sockets >>>> for veth virtual devices. Patches 1 to 4 mainly involve preparatory work. >>>> Patch 5 introduces tx queue and tx napi for packet transmission, while >>>> patch 8 primarily implements batch sending for IPv4 UDP packets, and patch 9 >>>> add support for AF_XDP tx need_wakup feature. These optimizations significantly >>>> reduce the software path and support checksum offload. >>>> >>>> I tested those feature with >>>> A typical topology is shown below: >>>> client(send): server:(recv) >>>> veth<-->veth-peer veth1-peer<--->veth1 >>>> 1 | | 7 >>>> |2 6| >>>> | | >>>> bridge<------->eth0(mlnx5)- switch -eth1(mlnx5)<--->bridge1 >>>> 3 4 5 >>>> (machine1) (machine2) >>> >>> I definitely applaud the effort to improve the performance of af_xdp >>> over veth, this is something we have flagged as in need of improvement >>> as well. >>> >>> However, looking through your patch series, I am less sure that the >>> approach you're taking here is the right one. >>> >>> AFAIU (speaking about the TX side here), the main difference between >>> AF_XDP ZC and the regular transmit mode is that in the regular TX mode >>> the stack will allocate an skb to hold the frame and push that down the >>> stack. Whereas in ZC mode, there's a driver NDO that gets called >>> directly, bypassing the skb allocation entirely. >>> >>> In this series, you're implementing the ZC mode for veth, but the driver >>> code ends up allocating an skb anyway. Which seems to be a bit of a >>> weird midpoint between the two modes, and adds a lot of complexity to >>> the driver that (at least conceptually) is mostly just a >>> reimplementation of what the stack does in non-ZC mode (allocate an skb >>> and push it through the stack). >>> >>> So my question is, why not optimise the non-zc path in the stack instead >>> of implementing the zc logic for veth? It seems to me that it would be >>> quite feasible to apply the same optimisations (bulking, and even GRO) >>> to that path and achieve the same benefits, without having to add all >>> this complexity to the veth driver? >>> >>> -Toke >>> >> thanks! >> This idea is really good indeed. You've reminded me, and that's >> something I overlooked. I will now consider implementing the solution >> you've proposed and test the performance enhancement. > > Sounds good, thanks! :) Good to hear, that you want to optimize the non-zc TX path of AF_XDP, as Toke suggests. There is a number of performance issues for AF_XDP non-zc TX that I've talked/complained to Magnus and Bjørn about over the years. I've recently started to work on fixing these myself, in collaboration with Maryam (cc). The most obvious is that non-zc TX uses socket memory accounting for the SKBs that gets allocated. (ZC TX obviously doesn't). IMHO this doesn't make sense as AF_XDP concept is to pre-allocate memory, thus AF_XDP memory limits are already bounded at setup time. Further more, __xsk_generic_xmit() already have a backpressure mechanism based on avail room in the CQ (Completion Queue) . Hint: the call sock_alloc_send_skb() includes/does socket mem accounting. When AF_XDP gets combined with veth (or other layered software devices), the problem gets worse, because: (1) the SKB that gets allocated by xsk_build_skb() doesn't have enough headroom to satisfy XDP requirement XDP_PACKET_HEADROOM. (2) the backing memory type from sock_alloc_send_skb() is not compatible with generic/veth XDP. Both these issues, result in that when peer veth device RX the (AF_XDP) TX packet, then it have to reallocate memory+SKB and copy data *again*. I'm currently[1] looking into how to fix this and have some PoC patches to estimate the performance benefit from avoiding the realloc when entering veth. With packet size 512, the numbers start at 828Kpps and after increase to 1002Kpps (and increase of 20% or 208 nanosec). [1] https://github.com/xdp-project/xdp-project/blob/veth-benchmark01/areas/core/veth_benchmark03.org -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer