Message ID | 20230710223304.1174642-1-almasrymina@google.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:a6b2:0:b0:3e4:2afc:c1 with SMTP id c18csp126172vqm; Mon, 10 Jul 2023 16:09:06 -0700 (PDT) X-Google-Smtp-Source: APBJJlGXysH5f0b2efI2uyd8UlllY0rqEoErWZnFQCgOyVLuehAX0artFYCe9zCm8BwmZzakpJcl X-Received: by 2002:a05:6512:39cb:b0:4f3:93d6:f969 with SMTP id k11-20020a05651239cb00b004f393d6f969mr12249261lfu.59.1689030545998; Mon, 10 Jul 2023 16:09:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689030545; cv=none; d=google.com; s=arc-20160816; b=DSWQAnhgulsW3wqD4S2yijSff5joLxZqTumR9gchpEuUFr2jcx/ZA887T8U8/wJf5M ZLLPqKhzQaG6sxfxoMzh2bedz3hUOqF8e1r3l2XbCU7TOTYGVRSHpIZRQzLCToUdF3qv ud+gpZhmDmxim9X+zLJ5pIH3kp51WxsSJhnxx0a9FmETtbLgGIsjcSOsFUO3U/Xwj7Em CNmrg9lwFAhMQxYhSK2uolqk6M22vxLtYpcxgTJ0LazEJptGtMsWjmsA5/sDcVqilXbV zaL8xSHU/79vQdzwHgOXQ3WxtsiZ8jWudiLwrNQnj4ev3HcHmF4pB+xE7GwKS3JbAqTo N5tQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :message-id:mime-version:date:dkim-signature; bh=Vww/pYdC1DJeva2AfnXbGu4C5u6TgbMhBMBTAdXoxfg=; fh=55VxVtf/wwatkgb/EP05SgBT89BC1zibP0SY8oI5qug=; b=XxzE8vFKCJPezPh7ASQSQ9QgWnkiGiEMEpZkL7TaAbqZAJsV2U/7oIKf9uWLC/+HMB oynZ8VcWoIMBm+ocm8KeEczA5zN4M1/UZjY47j0aJO0bbKj7INydMYWi8fV87y0qCJB/ L2x21caASHsJmdVMg3CFW/ad3A7TM2UDo61HbWuU0i4N3juStjg/lqUK8R/Z4VqFLeoe NpFk2zRCKtGu0hHnpjTbwOgwXi9WkY92C47IVSXB+fgrSUc82uLsQ97mdDmqeK5gM+40 AwT4Vuzz93XTy5PD6ALH5n+gBf6OaH1ulCKwWUa/fi+lMXUzuuU/vvbj9REdTJBvF/MN aP0A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=tYuqkf1i; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id t8-20020a05640203c800b0051d7fa84adbsi624347edw.301.2023.07.10.16.08.42; Mon, 10 Jul 2023 16:09:05 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=tYuqkf1i; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230311AbjGJWdl (ORCPT <rfc822;ybw1215001957@gmail.com> + 99 others); Mon, 10 Jul 2023 18:33:41 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58602 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229658AbjGJWdj (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 10 Jul 2023 18:33:39 -0400 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7DC8E1BB for <linux-kernel@vger.kernel.org>; Mon, 10 Jul 2023 15:33:37 -0700 (PDT) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-5706641dda9so55039197b3.3 for <linux-kernel@vger.kernel.org>; Mon, 10 Jul 2023 15:33:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689028416; x=1691620416; h=content-transfer-encoding:cc:to:from:subject:message-id :mime-version:date:from:to:cc:subject:date:message-id:reply-to; bh=Vww/pYdC1DJeva2AfnXbGu4C5u6TgbMhBMBTAdXoxfg=; b=tYuqkf1i1ML5V4o25T0vWw2AWYvTHmOdYjkPseY2zygasjrULUgRP0lpwslZq4BlNn r1p4R05iJaCaBeDKpj43UQs0ywurMXa8xm/jLGob/nvAh3tjpHM1PVLFapsJwR/ka1YV OCsLUyi/uoR6UBz4oCwsZPYuOl7NB6tSRitea5cmk6T+xQOOyfhtosCAuCmmHP3pASl3 FTJcJXa6DUCUieE5FhnpIjd11iJE7Eyy6HtGjsye3wuVkng9TA1g9wlYGDtVF3B0KaVO 3HzZZK95D11PQUMva7mNWvWl66mMUrB5ooxHO0g24qTp1JovuF4cWpSDU4WzfBUUTKLM wSig== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689028416; x=1691620416; h=content-transfer-encoding:cc:to:from:subject:message-id :mime-version:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=Vww/pYdC1DJeva2AfnXbGu4C5u6TgbMhBMBTAdXoxfg=; b=Zyq/n3YBvKENZNDwrdXvL/zwXZaF7MsTOwekRmP2lRHP+ybW+t8+UQxPfwcw4ND9T4 TEltzgSp2szsu1XQIlZEFmEyc4riqHH+88M0dGEYo0z+XA2KFxMNzk9Zpa3su2/4uQ1w qd4buessF5zDGlCtJTvWQwOrmf5jq9GQ/LYbAkcIBDO58EKJqg9zN6tK2vMlxrSTuqqf GPMwwgoa+ZmQsmmu4966e2/qV6zbuXkun1BlMGGxapxrzLwEkSsvbvmf0OfGvAjyZ29T 1p7bijyqPMwjUXEfpSYXta+2hU70nnnD/LqKCGwygPRyXyLmy2kWnkgv3Q3Pc8Bye+Vd W91Q== X-Gm-Message-State: ABy/qLaTStOUavsAWWgH8Q6itzG7drybTolTip5sPG+j35sQ1i2ukj7X 58yOfzYDK3yI0iuToe5YUYfxWFj5sVpWQ8CBbdyrcCm6E0KeJcjHEQ93SD4HsAaeyfSS87AM3eV R+q953YrRhhkbMsNIq8dnV+hSTtumcxARc6TDBR5qycrXh9jPgTn+zVi+GSjbyh4k2JcNlaDIIV L8efQVfCM= X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:4c0f:bfb6:9942:8c53]) (user=almasrymina job=sendgmr) by 2002:a81:ad44:0:b0:565:9e73:f937 with SMTP id l4-20020a81ad44000000b005659e73f937mr67586ywk.4.1689028416099; Mon, 10 Jul 2023 15:33:36 -0700 (PDT) Date: Mon, 10 Jul 2023 15:32:51 -0700 Mime-Version: 1.0 X-Mailer: git-send-email 2.41.0.390.g38632f3daf-goog Message-ID: <20230710223304.1174642-1-almasrymina@google.com> Subject: [RFC PATCH 00/10] Device Memory TCP From: Mina Almasry <almasrymina@google.com> To: linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, netdev@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org Cc: Mina Almasry <almasrymina@google.com>, Sumit Semwal <sumit.semwal@linaro.org>, " =?utf-8?q?Christian_K=C3=B6nig?= " <christian.koenig@amd.com>, "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Jesper Dangaard Brouer <hawk@kernel.org>, Ilias Apalodimas <ilias.apalodimas@linaro.org>, Arnd Bergmann <arnd@arndb.de>, David Ahern <dsahern@kernel.org>, Willem de Bruijn <willemdebruijn.kernel@gmail.com>, Shuah Khan <shuah@kernel.org>, jgg@ziepe.ca Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1771076893931026030 X-GMAIL-MSGID: 1771076893931026030 |
Series |
Device Memory TCP
|
|
Message
Mina Almasry
July 10, 2023, 10:32 p.m. UTC
* TL;DR: Device memory TCP (devmem TCP) is a proposal for transferring data to and/or from device memory efficiently, without bouncing the data to a host memory buffer. * Problem: A large amount of data transfers have device memory as the source and/or destination. Accelerators drastically increased the volume of such transfers. Some examples include: - ML accelerators transferring large amounts of training data from storage into GPU/TPU memory. In some cases ML training setup time can be as long as 50% of TPU compute time, improving data transfer throughput & efficiency can help improving GPU/TPU utilization. - Distributed training, where ML accelerators, such as GPUs on different hosts, exchange data among them. - Distributed raw block storage applications transfer large amounts of data with remote SSDs, much of this data does not require host processing. Today, the majority of the Device-to-Device data transfers the network are implemented as the following low level operations: Device-to-Host copy, Host-to-Host network transfer, and Host-to-Device copy. The implementation is suboptimal, especially for bulk data transfers, and can put significant strains on system resources, such as host memory bandwidth, PCIe bandwidth, etc. One important reason behind the current state is the kernel’s lack of semantics to express device to network transfers. * Proposal: In this patch series we attempt to optimize this use case by implementing socket APIs that enable the user to: 1. send device memory across the network directly, and 2. receive incoming network packets directly into device memory. Packet _payloads_ go directly from the NIC to device memory for receive and from device memory to NIC for transmit. Packet _headers_ go to/from host memory and are processed by the TCP/IP stack normally. The NIC _must_ support header split to achieve this. Advantages: - Alleviate host memory bandwidth pressure, compared to existing network-transfer + device-copy semantics. - Alleviate PCIe BW pressure, by limiting data transfer to the lowest level of the PCIe tree, compared to traditional path which sends data through the root complex. With this proposal we're able to reach ~96.6% line rate speeds with data sent and received directly from/to device memory. * Patch overview: ** Part 1: struct paged device memory Currently the standard for device memory sharing is DMABUF, which doesn't generate struct pages. On the other hand, networking stack (skbs, drivers, and page pool) operate on pages. We have 2 options: 1. Generate struct pages for dmabuf device memory, or, 2. Modify the networking stack to understand a new memory type. This proposal implements option #1. We implement a small framework to generate struct pages for an sg_table returned from dma_buf_map_attachment(). The support added here should be generic and easily extended to other use cases interested in struct paged device memory. We use this framework to generate pages that can be used in the networking stack. ** Part 2: recvmsg() & sendmsg() APIs We define user APIs for the user to send and receive these dmabuf pages. ** part 3: support for unreadable skb frags Dmabuf pages are not accessible by the host; we implement changes throughput the networking stack to correctly handle skbs with unreadable frags. ** part 4: page pool support We piggy back on Jakub's page pool memory providers idea: https://github.com/kuba-moo/linux/tree/pp-providers It allows the page pool to define a memory provider that provides the page allocation and freeing. It helps abstract most of the device memory TCP changes from the driver. This is not strictly necessary, the driver can choose to allocate dmabuf pages and use them directly without going through the page pool (if acceptable to their maintainers). Not included with this RFC is the GVE devmem TCP support, just to simplify the review. Code available here if desired: https://github.com/mina/linux/tree/tcpdevmem This RFC is built on top of v6.4-rc7 with Jakub's pp-providers changes cherry-picked. * NIC dependencies: 1. (strict) Devmem TCP require the NIC to support header split, i.e. the capability to split incoming packets into a header + payload and to put each into a separate buffer. Devmem TCP works by using dmabuf pages for the packet payload, and host memory for the packet headers. 2. (optional) Devmem TCP works better with flow steering support & RSS support, i.e. the NIC's ability to steer flows into certain rx queues. This allows the sysadmin to enable devmem TCP on a subset of the rx queues, and steer devmem TCP traffic onto these queues and non devmem TCP elsewhere. The NIC I have access to with these properties is the GVE with DQO support running in Google Cloud, but any NIC that supports these features would suffice. I may be able to help reviewers bring up devmem TCP on their NICs. * Testing: The series includes a udmabuf kselftest that show a simple use case of devmem TCP and validates the entire data path end to end without a dependency on a specific dmabuf provider. Not included in this series is our devmem TCP benchmark, which transfers data to/from GPU dmabufs directly. With this implementation & benchmark we're able to reach ~96.6% line rate speeds with 4 GPU/NIC pairs running bi-direction traffic, with all the packet payloads going straight to the GPU memory (no host buffer bounce). ** Test Setup Kernel: v6.4-rc7, with this RFC and Jakub's memory provider API cherry-picked locally. Hardware: Google Cloud A3 VMs. NIC: GVE with header split & RSS & flow steering support. Benchmark: custom devmem TCP benchmark not yet open sourced. Mina Almasry (10): dma-buf: add support for paged attachment mappings dma-buf: add support for NET_RX pages dma-buf: add support for NET_TX pages net: add support for skbs with unreadable frags tcp: implement recvmsg() RX path for devmem TCP net: add SO_DEVMEM_DONTNEED setsockopt to release RX pages tcp: implement sendmsg() TX path for for devmem tcp selftests: add ncdevmem, netcat for devmem TCP memory-provider: updates core provider API for devmem TCP memory-provider: add dmabuf devmem provider drivers/dma-buf/dma-buf.c | 444 ++++++++++++++++ include/linux/dma-buf.h | 142 +++++ include/linux/netdevice.h | 1 + include/linux/skbuff.h | 34 +- include/linux/socket.h | 1 + include/net/page_pool.h | 21 + include/net/sock.h | 4 + include/net/tcp.h | 6 +- include/uapi/asm-generic/socket.h | 6 + include/uapi/linux/dma-buf.h | 12 + include/uapi/linux/uio.h | 10 + net/core/datagram.c | 3 + net/core/page_pool.c | 111 +++- net/core/skbuff.c | 81 ++- net/core/sock.c | 47 ++ net/ipv4/tcp.c | 262 +++++++++- net/ipv4/tcp_input.c | 13 +- net/ipv4/tcp_ipv4.c | 8 + net/ipv4/tcp_output.c | 5 +- net/packet/af_packet.c | 4 +- tools/testing/selftests/net/.gitignore | 1 + tools/testing/selftests/net/Makefile | 1 + tools/testing/selftests/net/ncdevmem.c | 693 +++++++++++++++++++++++++ 23 files changed, 1868 insertions(+), 42 deletions(-) create mode 100644 tools/testing/selftests/net/ncdevmem.c
Comments
On 7/10/23 15:32, Mina Almasry wrote: > * TL;DR: > > Device memory TCP (devmem TCP) is a proposal for transferring data to and/or > from device memory efficiently, without bouncing the data to a host memory > buffer. (I'm writing this as someone who might plausibly use this mechanism, but I don't think I'm very likely to end up working on the kernel side, unless I somehow feel extremely inspired to implement it for i40e.) I looked at these patches and the GVE tree, and I'm trying to wrap my head around the data path. As I understand it, for RX: 1. The GVE driver notices that the queue is programmed to use devmem, and it programs the NIC to copy packet payloads to the devmem that has been programmed. 2. The NIC receives the packet and copies the header to kernel memory and the payload to dma-buf memory. 3. The kernel tells userspace where in the dma-buf the data is. 4. Userspace does something with the data. 5. Userspace does DONTNEED to recycle the memory and make it available for new received packets. Did I get this right? This seems a bit awkward if there's any chance that packets not intended for the target device end up in the rxq. I'm wondering if a more capable if somewhat higher latency model could work where the NIC stores received packets in its own device memory. Then userspace (or the kernel or a driver or whatever) could initiate a separate DMA from the NIC to the final target *after* reading the headers. Can the hardware support this? Another way of putting this is: steering received data to a specific device based on the *receive queue* forces the logic selecting a destination device to be the same as the logic selecting the queue. RX steering logic is pretty limited on most hardware (as far as I know -- certainly I've never had much luck doing anything especially intelligent with RX flow steering, and I've tried on a couple of different brands of supposedly fancy NICs). But Linux has very nice capabilities to direct packets, in software, to where they are supposed to go, and it would be nice if all that logic could just work, scalably, with device memory. If Linux could examine headers *before* the payload gets DMAed to wherever it goes, I think this could plausibly work quite nicely. One could even have an easy-to-use interface in which one directs a *socket* to a PCIe device. I expect, although I've never looked at the datasheets, that the kernel could even efficiently make rx decisions based on data in device memory on upcoming CXL NICs where device memory could participate in the host cache hierarchy. My real ulterior motive is that I think it would be great to use an ability like this for DPDK-like uses. Wouldn't it be nifty if I could open a normal TCP socket, then, after it's open, ask the kernel to kindly DMA the results directly to my application memory (via udmabuf, perhaps)? Or have a whole VLAN or macvlan get directed to a userspace queue, etc? It also seems a bit odd to me that the binding from rxq to dma-buf is established by programming the dma-buf. This makes the security model (and the mental model) awkward -- this binding is a setting on the *queue*, not the dma-buf, and in a containerized or privilege-separated system, a process could have enough privilege to make a dma-buf somewhere but not have any privileges on the NIC. (And may not even have the NIC present in its network namespace!) --Andy
On Sun, 16 Jul 2023 19:41:28 -0700 Andy Lutomirski wrote: > I'm wondering if a more capable if somewhat higher latency model could > work where the NIC stores received packets in its own device memory. > Then userspace (or the kernel or a driver or whatever) could initiate a > separate DMA from the NIC to the final target *after* reading the > headers. Can the hardware support this? No, no, that's impossible. SW response times are in 100s of usec (at best) which at 200Gbps already means megabytes of data _per-queue_. Way more than the amount of buffer NICs will have. The Rx application can bind to a IP addr + Port and then install a one-sided-3-tuple (dst IP+proto+port) rule in the HW. Worst case a full 5-tuple per flow. Most NICs support OvS offloads with 100s of thousands of flows. The steering should be bread and butter. It does require splitting flows into separate data and control channels, but it's the right trade-off - complexity should be on the SW side.
On Sun, Jul 16, 2023 at 7:41 PM Andy Lutomirski <luto@kernel.org> wrote: > > On 7/10/23 15:32, Mina Almasry wrote: > > * TL;DR: > > > > Device memory TCP (devmem TCP) is a proposal for transferring data to and/or > > from device memory efficiently, without bouncing the data to a host memory > > buffer. > > (I'm writing this as someone who might plausibly use this mechanism, but > I don't think I'm very likely to end up working on the kernel side, > unless I somehow feel extremely inspired to implement it for i40e.) > > I looked at these patches and the GVE tree, and I'm trying to wrap my > head around the data path. As I understand it, for RX: > > 1. The GVE driver notices that the queue is programmed to use devmem, > and it programs the NIC to copy packet payloads to the devmem that has > been programmed. > 2. The NIC receives the packet and copies the header to kernel memory > and the payload to dma-buf memory. > 3. The kernel tells userspace where in the dma-buf the data is. > 4. Userspace does something with the data. > 5. Userspace does DONTNEED to recycle the memory and make it available > for new received packets. > > Did I get this right? > Sorry for the late reply. I'm a bit buried working on the follow up to this proposal: exploring using dma-bufs without pages. Yes, this is completely correct. > This seems a bit awkward if there's any chance that packets not intended > for the target device end up in the rxq. > It does a bit. What happens in practice is that we use RSS to steer general traffic away from the devmem queues, and we use flow steering to steer specific flows to devem queues. In the case where the RSS/flow steering configuration is done incorrectly, the user would call recvmsg() on a devmem skb and if they haven't specified the MSG_SOCK_DEVMEM flag they'd get an error. > I'm wondering if a more capable if somewhat higher latency model could > work where the NIC stores received packets in its own device memory. > Then userspace (or the kernel or a driver or whatever) could initiate a > separate DMA from the NIC to the final target *after* reading the > headers. Can the hardware support this? > Not that I know of. I guess Jakub also responded with the same. > Another way of putting this is: steering received data to a specific > device based on the *receive queue* forces the logic selecting a > destination device to be the same as the logic selecting the queue. RX > steering logic is pretty limited on most hardware (as far as I know -- > certainly I've never had much luck doing anything especially intelligent > with RX flow steering, and I've tried on a couple of different brands of > supposedly fancy NICs). But Linux has very nice capabilities to direct > packets, in software, to where they are supposed to go, and it would be > nice if all that logic could just work, scalably, with device memory. > If Linux could examine headers *before* the payload gets DMAed to > wherever it goes, I think this could plausibly work quite nicely. One > could even have an easy-to-use interface in which one directs a *socket* > to a PCIe device. I expect, although I've never looked at the > datasheets, that the kernel could even efficiently make rx decisions > based on data in device memory on upcoming CXL NICs where device memory > could participate in the host cache hierarchy. > > My real ulterior motive is that I think it would be great to use an > ability like this for DPDK-like uses. Wouldn't it be nifty if I could > open a normal TCP socket, then, after it's open, ask the kernel to > kindly DMA the results directly to my application memory (via udmabuf, > perhaps)? Or have a whole VLAN or macvlan get directed to a userspace > queue, etc? > > > It also seems a bit odd to me that the binding from rxq to dma-buf is > established by programming the dma-buf. That is specific to this proposal, and will likely be very different in future ones. I thought the dma-buf pages approach was extensible and the uapi belonged somewhere in dma-buf. Clearly not. The next proposal, I think, will program the rxq via some net uapi and will take the dma-buf as input. Probably some netlink api (not sure if ethtool family or otherwise). I'm working out details of this non-paged networking first. > This makes the security model > (and the mental model) awkward -- this binding is a setting on the > *queue*, not the dma-buf, and in a containerized or privilege-separated > system, a process could have enough privilege to make a dma-buf > somewhere but not have any privileges on the NIC. (And may not even > have the NIC present in its network namespace!) > > --Andy -- Thanks, Mina
On Tue, Jul 18, 2023 at 10:36:52AM -0700, Mina Almasry wrote: > That is specific to this proposal, and will likely be very different > in future ones. I thought the dma-buf pages approach was extensible > and the uapi belonged somewhere in dma-buf. Clearly not. The next > proposal, I think, will program the rxq via some net uapi and will > take the dma-buf as input. Probably some netlink api (not sure if > ethtool family or otherwise). I'm working out details of this > non-paged networking first. In practice you want the application to startup, get itself some 3/5 tuples and then request the kernel to setup the flow steering and provision the NIC queues. This is the right moment for the application to provide the backing for the rx queue memory via a DMABUF handle. Ideally this would all be accessible to non-priv applications as well, so I think you'd want some kind of system call that sets all this up and takes in a FD for the 3/5-tuple socket (to prove ownership over the steering) and the DMABUF FD. The queues and steering should exist only as long as the application is still running (whatever that means). Otherwise you have a big mess to clean up whenever anything crashes. netlink feels like a weird API choice for that, in particular it would be really wrong to somehow bind the lifecycle of a netlink object to a process. Further, if you are going to all the trouble of doing this, it seems to me you should make it work with any kind of memory, including CPU memory. Get a consistent approach to zero-copy TCP RX. So also allow a memfd or similar to be passed in as the backing storage. Jason
On Tue, 18 Jul 2023 15:06:29 -0300 Jason Gunthorpe wrote: > netlink feels like a weird API choice for that, in particular it would > be really wrong to somehow bind the lifecycle of a netlink object to a > process. Netlink is the right API, life cycle of objects can be easily tied to a netlink socket.
On 7/18/23 12:15 PM, Jakub Kicinski wrote: > On Tue, 18 Jul 2023 15:06:29 -0300 Jason Gunthorpe wrote: >> netlink feels like a weird API choice for that, in particular it would >> be really wrong to somehow bind the lifecycle of a netlink object to a >> process. > > Netlink is the right API, life cycle of objects can be easily tied to > a netlink socket. That is an untuitive connection -- memory references, h/w queues, flow steering should be tied to the datapath socket, not a control plane socket.
On Tue, 18 Jul 2023 12:20:59 -0600 David Ahern wrote: > On 7/18/23 12:15 PM, Jakub Kicinski wrote: > > On Tue, 18 Jul 2023 15:06:29 -0300 Jason Gunthorpe wrote: > >> netlink feels like a weird API choice for that, in particular it would > >> be really wrong to somehow bind the lifecycle of a netlink object to a > >> process. > > > > Netlink is the right API, life cycle of objects can be easily tied to > > a netlink socket. > > That is an untuitive connection -- memory references, h/w queues, flow > steering should be tied to the datapath socket, not a control plane socket. There's one RSS context for may datapath sockets. Plus a lot of the APIs already exist, and it's more of a question of packaging them up at the user space level. For things which do not have an API, however, netlink, please.
On 7/18/23 12:29 PM, Jakub Kicinski wrote: > On Tue, 18 Jul 2023 12:20:59 -0600 David Ahern wrote: >> On 7/18/23 12:15 PM, Jakub Kicinski wrote: >>> On Tue, 18 Jul 2023 15:06:29 -0300 Jason Gunthorpe wrote: >>>> netlink feels like a weird API choice for that, in particular it would >>>> be really wrong to somehow bind the lifecycle of a netlink object to a >>>> process. >>> >>> Netlink is the right API, life cycle of objects can be easily tied to >>> a netlink socket. >> >> That is an untuitive connection -- memory references, h/w queues, flow >> steering should be tied to the datapath socket, not a control plane socket. > > There's one RSS context for may datapath sockets. Plus a lot of the > APIs already exist, and it's more of a question of packaging them up > at the user space level. For things which do not have an API, however, > netlink, please. I do not see how 1 RSS context (or more specifically a h/w Rx queue) can be used properly with memory from different processes (or dma-buf references). When the process dies, that memory needs to be flushed from the H/W queues. Queues with interlaced submissions make that more complicated. I guess the devil is in the details; I look forward to the evolution of the patches.
On Tue, 18 Jul 2023 16:35:17 -0600 David Ahern wrote: > I do not see how 1 RSS context (or more specifically a h/w Rx queue) can > be used properly with memory from different processes (or dma-buf > references). When the process dies, that memory needs to be flushed from > the H/W queues. Queues with interlaced submissions make that more > complicated. Agreed, one process, one control path socket. FWIW the rtnetlink use of netlink is very basic. genetlink already has some infra which allows associate state with a user socket and cleaning it up when the socket gets closed. This needs some improvements. A bit of a chicken and egg problem, I can't make the improvements until there are families making use of it, and nobody will make use of it until it's in tree... But the basics are already in place and I can help with building it out. > I guess the devil is in the details; I look forward to the evolution of > the patches. +1
On Tue, Jul 18, 2023 at 3:45 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Tue, 18 Jul 2023 16:35:17 -0600 David Ahern wrote: > > I do not see how 1 RSS context (or more specifically a h/w Rx queue) can > > be used properly with memory from different processes (or dma-buf > > references). Right, my experience with dma-bufs from GPUs are that they're allocated from the userspace and owned by the process that allocated the backing GPU memory and generated the dma-buf from it. I.e., we're limited to 1 dma-buf per RX queue. If we enable binding multiple dma-bufs to the same RX queue, we have a problem, because AFAIU the NIC can't decide which dma-buf to put the packet into (it hasn't parsed the packet's destination yet). > > When the process dies, that memory needs to be flushed from > > the H/W queues. Queues with interlaced submissions make that more > > complicated. > When the process dies, do we really want to flush the memory from the hardware queues? The drivers I looked at don't seem to have a function to flush the rx queues alone, they usually do an entire driver reset to achieve that. Not sure if that's just convenience or there is some technical limitation there. Do we really want to trigger a driver reset at the event a userspace process crashes? > Agreed, one process, one control path socket. > > FWIW the rtnetlink use of netlink is very basic. genetlink already has > some infra which allows associate state with a user socket and cleaning > it up when the socket gets closed. This needs some improvements. A bit > of a chicken and egg problem, I can't make the improvements until there > are families making use of it, and nobody will make use of it until > it's in tree... But the basics are already in place and I can help with > building it out. > I had this approach in mind (which doesn't need netlink improvements) for the next POC. It's mostly inspired by the comments from the cover letter of Jakub's memory-provider RFC, if I understood it correctly. I'm sure there's going to be some iteration, but roughly: 1. A netlink CAP_NET_ADMIN API which binds the dma-buf to any number of rx queues on 1 NIC. It will do the dma_buf_attach() and dma_buf_map_attachment() and leave some indicator in the struct net_device to tell the NIC that it's bound to a dma-buf. The actual binding doesn't actuate until the next driver reset. The API, I guess, can cause a driver reset (or just a refill of the rx queues, if you think that's feasible) as well to streamline things a bit. The API returns a file handle to the user representing that binding. 2. On the driver reset, the driver notices that its struct net_device is bound to a dma-buf, and sets up the dma-buf memory-provider instead of the basic one which provides host memory. 3. The user can close the file handle received in #1 to unbind the dma-buf from the rx queues. Or if the user crashes, the kernel closes the handle for us. The unbind doesn't take effect until the next flushing or rx queues, or the next driver reset (not sure the former is feasible). 4. The dma-buf memory provider keeps the dma buf mapping alive until the next driver reset, where all the dma-buf slices are freed, and the dma buf attachment mapping can be unmapped. I'm thinking the user sets up RSS and flow steering outside this API using existing ethtool APIs, but things can be streamlined a bit by doing some of these RSS/flow steering steps in cohesion with the dma-buf binding/unbinding. The complication with setting up flow steering in cohesion with dma-buf bind unbind is that the application may start more connections after the bind, and it will need to install flow steering rules for those too, and use the ethtool api for that. May as well use the ethtool apis for all of it...? From Jakub and David's comments it sounds (if I understood correctly), you'd like to tie the dma-buf bind/unbind functions to the lifetime of a netlink socket, rather than a struct file like I was thinking. That does sound cleaner, but I'm not sure how. Can you link me to any existing code examples? Or rough pointers to any existing code? > > I guess the devil is in the details; I look forward to the evolution of > > the patches. > > +1
On Wed, 19 Jul 2023 08:10:58 -0700 Mina Almasry <almasrymina@google.com> wrote: > On Tue, Jul 18, 2023 at 3:45 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > On Tue, 18 Jul 2023 16:35:17 -0600 David Ahern wrote: > > > I do not see how 1 RSS context (or more specifically a h/w Rx queue) can > > > be used properly with memory from different processes (or dma-buf > > > references). > > Right, my experience with dma-bufs from GPUs are that they're > allocated from the userspace and owned by the process that allocated > the backing GPU memory and generated the dma-buf from it. I.e., we're > limited to 1 dma-buf per RX queue. If we enable binding multiple > dma-bufs to the same RX queue, we have a problem, because AFAIU the > NIC can't decide which dma-buf to put the packet into (it hasn't > parsed the packet's destination yet). > > > > When the process dies, that memory needs to be flushed from > > > the H/W queues. Queues with interlaced submissions make that more > > > complicated. > > > > When the process dies, do we really want to flush the memory from the > hardware queues? The drivers I looked at don't seem to have a function > to flush the rx queues alone, they usually do an entire driver reset > to achieve that. Not sure if that's just convenience or there is some > technical limitation there. Do we really want to trigger a driver > reset at the event a userspace process crashes? Naive idea. Would it be possible for process to use mmap() on the GPU memory and then do zero copy TCP receive some how? Or is this what is being proposed.
On Wed, 19 Jul 2023 08:10:58 -0700 Mina Almasry wrote: > From Jakub and David's comments it sounds (if I understood correctly), > you'd like to tie the dma-buf bind/unbind functions to the lifetime of > a netlink socket, rather than a struct file like I was thinking. That > does sound cleaner, but I'm not sure how. Can you link me to any > existing code examples? Or rough pointers to any existing code? I don't have a strong preference whether the lifetime is bound to the socket or not. My main point was that if we're binding lifetimes to processes, it should be done via netlink sockets, not special- -purpose FDs. Inevitably more commands and info will be needed and we'll start reinventing the uAPI wheel which is Netlink. Currently adding state to netlink sockets is a bit raw. You can create an Xarray which stores the per socket state using socket's portid (genl_info->snd_portid) and use netlink_register_notifier() to get notifications when sockets are closed.
On Wed, Jul 19, 2023 at 10:57:11AM -0700, Stephen Hemminger wrote: > Naive idea. > Would it be possible for process to use mmap() on the GPU memory and then > do zero copy TCP receive some how? Or is this what is being proposed. It could be possible, but currently there is no API to recover the underlying dmabuf from the VMA backing the mmap. Also you can't just take arbitary struct pages from any old VMA and make them "netmem" Jason
Am 20.07.23 um 01:24 schrieb Jason Gunthorpe: > On Wed, Jul 19, 2023 at 10:57:11AM -0700, Stephen Hemminger wrote: > >> Naive idea. >> Would it be possible for process to use mmap() on the GPU memory and then >> do zero copy TCP receive some how? Or is this what is being proposed. > It could be possible, but currently there is no API to recover the > underlying dmabuf from the VMA backing the mmap. Sorry for being a bit late, have been on vacation. Well actually this was discussed before to work around problems with Windows applications through wine/proton. Not 100% sure what the outcome of that was, but if I'm not completely mistaken getting the fd behind a VMA should be possible. It might just not be the DMA-buf fd, because we use mmap() re-routing to be able to work around problems with the reverse tracking of mappings. Christian. > > Also you can't just take arbitary struct pages from any old VMA and > make them "netmem" > > Jason > _______________________________________________ > Linaro-mm-sig mailing list -- linaro-mm-sig@lists.linaro.org > To unsubscribe send an email to linaro-mm-sig-leave@lists.linaro.org