Message ID | 20231106024413.2801438-6-almasrymina@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:8f47:0:b0:403:3b70:6f57 with SMTP id j7csp2407931vqu; Sun, 5 Nov 2023 18:45:08 -0800 (PST) X-Google-Smtp-Source: AGHT+IGX6KNdF1kEF8j9azUWh/gECV63u/F16uuJr2cR66g9KbjvmCQmi7UefkxZ4R0SePOviBi2 X-Received: by 2002:a05:6808:6509:b0:3b5:655b:2f5f with SMTP id fm9-20020a056808650900b003b5655b2f5fmr19603139oib.18.1699238708346; Sun, 05 Nov 2023 18:45:08 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1699238708; cv=none; d=google.com; s=arc-20160816; b=g46ze5FIdgkip4wRh9gq2xIJmLz52Gc9ed7TF/rCboCDkXmILBhNlcc7aVOzo0NXIh zrxr6R8Ios+sRoz2DPuM+JpcVsmwL85yRsPkh11oeCHEH4YNHqsZCn0t/7CoPEBknju7 fvrQcnyQqrcBWNWgMoe8fYT0MpaMiVnaHp7WqcMWNraPvE1WZwsy2t+jxPhPzLSotlJt 9yGvIJZ4nL8GLCHLcFIuAbq4Jmy9pzHJ/hQVfgXlwtjG3bspWQWvAqsLAtBzDQYp689/ JEIISfi/uZM7y7f9WAuP2ucl4uitSR/LUlMKiHSVB/ST/nfVr2AO7lZNtN68UVQ+TxXJ 4IYw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=zCln+P1ZLV/9SicyKW3j3Qj5ciapVl1fP7AJwqOPEjI=; fh=VbsQUsSy2K9z2+PhiP7mYRWSYfWA3hQEIPFL548Z5fY=; b=eMy3jxw/Bx1MDwrcFXVEe5OmqgdwzG1/QYGLjXi/IYit9YC4PJkvWYLCYO0ya4SXJj jTRMyVq7hMKi10Ixj6kygd1tQKkcYrxlIbGz1buK0C5wr0DcQHXxfIEHb93SofrGunYU OOePQ6fuKWjLmi9odTUux1i53mfi7zIe6pj0DUMXSS5e7dYp67oBl7cOTOhNZ4ggKSCU 7ogdt/cqPmhrcYdQjICLpsxJ0ky7j/m7iP5UQHjWABd0FfHMt8DzaRgyhyM9SsXqSc+r 24m9a+OIjHKIWKARiGrC9XnzuruV6iGLz4I7JHXOEg84xrUabC6i65VUkuf1moFVZZ2y MUEw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=YM1SEkq3; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7]) by mx.google.com with ESMTPS id fe8-20020a056a002f0800b006b4319c78efsi6786449pfb.389.2023.11.05.18.45.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 05 Nov 2023 18:45:08 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=YM1SEkq3; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 7A6AF80309AE; Sun, 5 Nov 2023 18:45:07 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230260AbjKFCo7 (ORCPT <rfc822;heyuhang3455@gmail.com> + 34 others); Sun, 5 Nov 2023 21:44:59 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45268 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230151AbjKFCod (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sun, 5 Nov 2023 21:44:33 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 24FA2125 for <linux-kernel@vger.kernel.org>; Sun, 5 Nov 2023 18:44:30 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id 3f1490d57ef6-d9a5a3f2d4fso4624658276.3 for <linux-kernel@vger.kernel.org>; Sun, 05 Nov 2023 18:44:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1699238669; x=1699843469; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=zCln+P1ZLV/9SicyKW3j3Qj5ciapVl1fP7AJwqOPEjI=; b=YM1SEkq3uLkxlOuZL1VwFYmCBgjwqqpM8fBxRsaBTEy60dJ8XtbpbtQaaAxjD7HbFu g52d2hu5Vg6ifDN7YHs32I9K3GIv2Qd5wLNhAH/diZUToVYlgJu/OyTZOpJSxXcCAzvt Gwi1PZolJ8PvTWdsSKlJDDCzw4GNwXv0AN7cWW012U6w7cq0Q87y9MAbaHHV0EgpPNPe cBQpVXfq6A0DlhOT47C7P/06a3wQDPhtlJY4+II2nLCjUusdZOZgjsvTkjZaIAlQDZPu EXQjGMy9+0bXLt2elR/AhrrHlV3AaGPcasehjxjy1yTkvXTC6SqNSuFWE2hlNZVrl9rg SIqA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699238669; x=1699843469; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=zCln+P1ZLV/9SicyKW3j3Qj5ciapVl1fP7AJwqOPEjI=; b=RHh61KV709Tb78dWfB1dsDpD/sZt2buPtwLFpYdC/ehEpdnD3QJXY0oXvNLbXX+dPi lxez78OHodWYsK2Btr2WLUpF2Cti1WelE7yd7EoS+mQxcz4kmlxdPvGxCFvEAEFv+jQT BJhk3wOHMW9y9VSPevo2LFeHHxrAJM1t4a26v8TqPxGPl782/X1uHO/tI59NNdDHg2X+ IYMY127hyzWlRbHn0a9e3WC9ws8C4SKmFcD9FbUlrqLfQSs/rkDMcqY9eyaazGS/qRAm G7RMEUEaluIXJCeLfvUwqQZ719myNl2klDIK52U3RwVLmnQq6h1cqKMKPOTxFpRuPmlL YNwA== X-Gm-Message-State: AOJu0Yw+XhhkNF21LEgMu2ifgBk4VePhExide5K1NCrGpErw1eCmYuqx di6sRrUZzihE5JqJw7yaz/tzEU4NLItrXD/bLg== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:35de:fff:97b7:db3e]) (user=almasrymina job=sendgmr) by 2002:a25:7909:0:b0:da3:ab41:304a with SMTP id u9-20020a257909000000b00da3ab41304amr308351ybc.4.1699238669418; Sun, 05 Nov 2023 18:44:29 -0800 (PST) Date: Sun, 5 Nov 2023 18:44:04 -0800 In-Reply-To: <20231106024413.2801438-1-almasrymina@google.com> Mime-Version: 1.0 References: <20231106024413.2801438-1-almasrymina@google.com> X-Mailer: git-send-email 2.42.0.869.gea05f2083d-goog Message-ID: <20231106024413.2801438-6-almasrymina@google.com> Subject: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator From: Mina Almasry <almasrymina@google.com> To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org Cc: Mina Almasry <almasrymina@google.com>, "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Jesper Dangaard Brouer <hawk@kernel.org>, Ilias Apalodimas <ilias.apalodimas@linaro.org>, Arnd Bergmann <arnd@arndb.de>, David Ahern <dsahern@kernel.org>, Willem de Bruijn <willemdebruijn.kernel@gmail.com>, Shuah Khan <shuah@kernel.org>, Sumit Semwal <sumit.semwal@linaro.org>, " =?utf-8?q?Christian_K=C3=B6nig?= " <christian.koenig@amd.com>, Shakeel Butt <shakeelb@google.com>, Jeroen de Borst <jeroendb@google.com>, Praveen Kaligineedi <pkaligineedi@google.com>, Willem de Bruijn <willemb@google.com>, Kaiyuan Zhang <kaiyuanz@google.com> Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Sun, 05 Nov 2023 18:45:07 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1781780927840614289 X-GMAIL-MSGID: 1781780927840614289 |
Series |
Device Memory TCP
|
|
Commit Message
Mina Almasry
Nov. 6, 2023, 2:44 a.m. UTC
Implement netdev devmem allocator. The allocator takes a given struct netdev_dmabuf_binding as input and allocates page_pool_iov from that binding. The allocation simply delegates to the binding's genpool for the allocation logic and wraps the returned memory region in a page_pool_iov struct. page_pool_iov are refcounted and are freed back to the binding when the refcount drops to 0. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com> Signed-off-by: Mina Almasry <almasrymina@google.com> --- include/linux/netdevice.h | 13 ++++++++++++ include/net/page_pool/helpers.h | 28 +++++++++++++++++++++++++ net/core/dev.c | 37 +++++++++++++++++++++++++++++++++ 3 files changed, 78 insertions(+)
Comments
On 11/5/23 7:44 PM, Mina Almasry wrote: > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index eeeda849115c..1c351c138a5b 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding { > }; > > #ifdef CONFIG_DMA_SHARED_BUFFER > +struct page_pool_iov * > +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding); > +void netdev_free_devmem(struct page_pool_iov *ppiov); netdev_{alloc,free}_dmabuf? I say that because a dmabuf can be host memory, at least I am not aware of a restriction that a dmabuf is device memory.
On 11/7/23 3:10 PM, Mina Almasry wrote: > On Mon, Nov 6, 2023 at 3:44 PM David Ahern <dsahern@kernel.org> wrote: >> >> On 11/5/23 7:44 PM, Mina Almasry wrote: >>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h >>> index eeeda849115c..1c351c138a5b 100644 >>> --- a/include/linux/netdevice.h >>> +++ b/include/linux/netdevice.h >>> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding { >>> }; >>> >>> #ifdef CONFIG_DMA_SHARED_BUFFER >>> +struct page_pool_iov * >>> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding); >>> +void netdev_free_devmem(struct page_pool_iov *ppiov); >> >> netdev_{alloc,free}_dmabuf? >> > > Can do. > >> I say that because a dmabuf can be host memory, at least I am not aware >> of a restriction that a dmabuf is device memory. >> > > In my limited experience dma-buf is generally device memory, and > that's really its use case. CONFIG_UDMABUF is a driver that mocks > dma-buf with a memfd which I think is used for testing. But I can do > the rename, it's more clear anyway, I think. config UDMABUF bool "userspace dmabuf misc driver" default n depends on DMA_SHARED_BUFFER depends on MEMFD_CREATE || COMPILE_TEST help A driver to let userspace turn memfd regions into dma-bufs. Qemu can use this to create host dmabufs for guest framebuffers. Qemu is just a userspace process; it is no way a special one. Treating host memory as a dmabuf should radically simplify the io_uring extension of this set. That the io_uring set needs to dive into page_pools is just wrong - complicating the design and code and pushing io_uring into a realm it does not need to be involved in. Most (all?) of this patch set can work with any memory; only device memory is unreadable.
On 2023-11-07 14:55, David Ahern wrote: > On 11/7/23 3:10 PM, Mina Almasry wrote: >> On Mon, Nov 6, 2023 at 3:44 PM David Ahern <dsahern@kernel.org> wrote: >>> >>> On 11/5/23 7:44 PM, Mina Almasry wrote: >>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h >>>> index eeeda849115c..1c351c138a5b 100644 >>>> --- a/include/linux/netdevice.h >>>> +++ b/include/linux/netdevice.h >>>> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding { >>>> }; >>>> >>>> #ifdef CONFIG_DMA_SHARED_BUFFER >>>> +struct page_pool_iov * >>>> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding); >>>> +void netdev_free_devmem(struct page_pool_iov *ppiov); >>> >>> netdev_{alloc,free}_dmabuf? >>> >> >> Can do. >> >>> I say that because a dmabuf can be host memory, at least I am not aware >>> of a restriction that a dmabuf is device memory. >>> >> >> In my limited experience dma-buf is generally device memory, and >> that's really its use case. CONFIG_UDMABUF is a driver that mocks >> dma-buf with a memfd which I think is used for testing. But I can do >> the rename, it's more clear anyway, I think. > > config UDMABUF > bool "userspace dmabuf misc driver" > default n > depends on DMA_SHARED_BUFFER > depends on MEMFD_CREATE || COMPILE_TEST > help > A driver to let userspace turn memfd regions into dma-bufs. > Qemu can use this to create host dmabufs for guest framebuffers. > > > Qemu is just a userspace process; it is no way a special one. > > Treating host memory as a dmabuf should radically simplify the io_uring > extension of this set. That the io_uring set needs to dive into > page_pools is just wrong - complicating the design and code and pushing > io_uring into a realm it does not need to be involved in. I think our io_uring proposal will already be vastly simplified once we rebase onto Kuba's page pool memory provider API. Using udmabuf means depending on a driver designed for testing, vs io_uring's registered buffers API that's been tried and tested. I don't have an intuitive understanding of the trade offs yet, and would need to try out udmabuf and compare vs say using our own page pool memory provider. > > Most (all?) of this patch set can work with any memory; only device > memory is unreadable. > >
On 2023-11-07 15:03, Mina Almasry wrote: > On Tue, Nov 7, 2023 at 2:55 PM David Ahern <dsahern@kernel.org> wrote: >> >> On 11/7/23 3:10 PM, Mina Almasry wrote: >>> On Mon, Nov 6, 2023 at 3:44 PM David Ahern <dsahern@kernel.org> wrote: >>>> >>>> On 11/5/23 7:44 PM, Mina Almasry wrote: >>>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h >>>>> index eeeda849115c..1c351c138a5b 100644 >>>>> --- a/include/linux/netdevice.h >>>>> +++ b/include/linux/netdevice.h >>>>> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding { >>>>> }; >>>>> >>>>> #ifdef CONFIG_DMA_SHARED_BUFFER >>>>> +struct page_pool_iov * >>>>> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding); >>>>> +void netdev_free_devmem(struct page_pool_iov *ppiov); >>>> >>>> netdev_{alloc,free}_dmabuf? >>>> >>> >>> Can do. >>> >>>> I say that because a dmabuf can be host memory, at least I am not aware >>>> of a restriction that a dmabuf is device memory. >>>> >>> >>> In my limited experience dma-buf is generally device memory, and >>> that's really its use case. CONFIG_UDMABUF is a driver that mocks >>> dma-buf with a memfd which I think is used for testing. But I can do >>> the rename, it's more clear anyway, I think. >> >> config UDMABUF >> bool "userspace dmabuf misc driver" >> default n >> depends on DMA_SHARED_BUFFER >> depends on MEMFD_CREATE || COMPILE_TEST >> help >> A driver to let userspace turn memfd regions into dma-bufs. >> Qemu can use this to create host dmabufs for guest framebuffers. >> >> >> Qemu is just a userspace process; it is no way a special one. >> >> Treating host memory as a dmabuf should radically simplify the io_uring >> extension of this set. > > I agree actually, and I was about to make that comment to David Wei's > series once I have the time. > > David, your io_uring RX zerocopy proposal actually works with devmem > TCP, if you're inclined to do that instead, what you'd do roughly is > (I think): > > - Allocate a memfd, > - Use CONFIG_UDMABUF to create a dma-buf out of that memfd. > - Bind the dma-buf to the NIC using the netlink API in this RFC. > - Your io_uring extensions and io_uring uapi should work as-is almost > on top of this series, I think. > > If you do this the incoming packets should land into your memfd, which > may or may not work for you. In the future if you feel inclined to use > device memory, this approach that I'm describing here would be more > extensible to device memory, because you'd already be using dma-bufs > for your user memory; you'd just replace one kind of dma-buf (UDMABUF) > with another. > How would TCP devmem change if we no longer assume that dmabuf is device memory? Pavel will know more on the perf side, but I wouldn't want to put any if/else on the hot path if we can avoid it. I could be wrong, but right now in my mind using different memory providers solves this neatly and the driver/networking stack doesn't need to care. Mina, I believe you said at NetDev conf that you already had an udmabuf implementation for testing. I would like to see this (you can send privately) to see how TCP devmem would handle both user memory and device memory. >> That the io_uring set needs to dive into >> page_pools is just wrong - complicating the design and code and pushing >> io_uring into a realm it does not need to be involved in. >> >> Most (all?) of this patch set can work with any memory; only device >> memory is unreadable. >> >> > >
> > On Mon, Nov 6, 2023 at 11:45 PM Yunsheng Lin <linyunsheng@huawei.com> wrote: > >> > >> On 2023/11/6 10:44, Mina Almasry wrote: > >>> + > >>> +void netdev_free_devmem(struct page_pool_iov *ppiov) > >>> +{ > >>> + struct netdev_dmabuf_binding *binding = page_pool_iov_binding(ppiov); > >>> + > >>> + refcount_set(&ppiov->refcount, 1); > >>> + > >>> + if (gen_pool_has_addr(binding->chunk_pool, > >>> + page_pool_iov_dma_addr(ppiov), PAGE_SIZE)) > >> > >> When gen_pool_has_addr() returns false, does it mean something has gone > >> really wrong here? > >> > > > > Yes, good eye. gen_pool_has_addr() should never return false, but then > > again, gen_pool_free() BUG_ON()s if it doesn't find the address, > > which is an extremely severe reaction to what can be a minor bug in > > the accounting. I prefer to leak rather than crash the machine. It's a > > bit of defensive programming that is normally frowned upon, but I feel > > like in this case it's maybe warranted due to the very severe reaction > > (BUG_ON). > > I would argue that why is the above defensive programming not done in the > gen_pool core:) > I think gen_pool is not really not that new, and suggesting removing the BUG_ONs must have been proposed before and rejected. I'll try to do some research and maybe suggest downgrading the BUG_ON to WARN_ON, but my guess is there is some reason the maintainer wants it to be a BUG_ON. On Wed, Nov 8, 2023 at 5:00 PM David Wei <dw@davidwei.uk> wrote: > > On 2023-11-07 14:55, David Ahern wrote: > > On 11/7/23 3:10 PM, Mina Almasry wrote: > >> On Mon, Nov 6, 2023 at 3:44 PM David Ahern <dsahern@kernel.org> wrote: > >>> > >>> On 11/5/23 7:44 PM, Mina Almasry wrote: > >>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > >>>> index eeeda849115c..1c351c138a5b 100644 > >>>> --- a/include/linux/netdevice.h > >>>> +++ b/include/linux/netdevice.h > >>>> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding { > >>>> }; > >>>> > >>>> #ifdef CONFIG_DMA_SHARED_BUFFER > >>>> +struct page_pool_iov * > >>>> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding); > >>>> +void netdev_free_devmem(struct page_pool_iov *ppiov); > >>> > >>> netdev_{alloc,free}_dmabuf? > >>> > >> > >> Can do. > >> > >>> I say that because a dmabuf can be host memory, at least I am not aware > >>> of a restriction that a dmabuf is device memory. > >>> > >> > >> In my limited experience dma-buf is generally device memory, and > >> that's really its use case. CONFIG_UDMABUF is a driver that mocks > >> dma-buf with a memfd which I think is used for testing. But I can do > >> the rename, it's more clear anyway, I think. > > > > config UDMABUF > > bool "userspace dmabuf misc driver" > > default n > > depends on DMA_SHARED_BUFFER > > depends on MEMFD_CREATE || COMPILE_TEST > > help > > A driver to let userspace turn memfd regions into dma-bufs. > > Qemu can use this to create host dmabufs for guest framebuffers. > > > > > > Qemu is just a userspace process; it is no way a special one. > > > > Treating host memory as a dmabuf should radically simplify the io_uring > > extension of this set. That the io_uring set needs to dive into > > page_pools is just wrong - complicating the design and code and pushing > > io_uring into a realm it does not need to be involved in. > > I think our io_uring proposal will already be vastly simplified once we > rebase onto Kuba's page pool memory provider API. Using udmabuf means > depending on a driver designed for testing, vs io_uring's registered > buffers API that's been tried and tested. > FWIW I also get an impression that udmabuf is mostly targeting testing, but I'm not aware of any deficiency that makes it concretely unsuitable for you. You be the judge. The only quirk of udmabuf I'm aware of is that it seems to cap the max dma-buf size to 16000 pages. Not sure if that's due to a genuine technical limitation or just convenience. > I don't have an intuitive understanding of the trade offs yet, and would > need to try out udmabuf and compare vs say using our own page pool > memory provider. > On Wed, Nov 8, 2023 at 5:15 PM David Wei <dw@davidwei.uk> wrote: > How would TCP devmem change if we no longer assume that dmabuf is device > memory? It wouldn't. The code already never assumes that dmabuf is device memory. Any dma-buf should work, as far as I can tell. I'm also quite confident udmabuf works, I use it for testing. (Jason Gunthrope is much more of an expert and may chime in to say 'some dma-buf will not work'. My primitive understanding is that we're using dma-bufs without any quirks and any dma-buf should work. I of course haven't tested all dma-bufs :D) > Pavel will know more on the perf side, but I wouldn't want to > put any if/else on the hot path if we can avoid it. I could be wrong, > but right now in my mind using different memory providers solves this > neatly and the driver/networking stack doesn't need to care. > > Mina, I believe you said at NetDev conf that you already had an udmabuf > implementation for testing. I would like to see this (you can send > privately) to see how TCP devmem would handle both user memory and > device memory. > There is nothing to send privately. The patch series you're looking at supports udma-buf as-is, and the kselftest included with the series demonstrates devmem TCP working with udmabuf. The only thing missing from this series is the driver support. You can see the GVE driver support for devmem TCP here: https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-v3 You may need to implement devmem TCP for your driver before you can reproduce udmabuf working for yourself, though.
On Sun, 2023-11-05 at 18:44 -0800, Mina Almasry wrote: [...] > +void netdev_free_devmem(struct page_pool_iov *ppiov) > +{ > + struct netdev_dmabuf_binding *binding = page_pool_iov_binding(ppiov); > + > + refcount_set(&ppiov->refcount, 1); > + > + if (gen_pool_has_addr(binding->chunk_pool, > + page_pool_iov_dma_addr(ppiov), PAGE_SIZE)) > + gen_pool_free(binding->chunk_pool, > + page_pool_iov_dma_addr(ppiov), PAGE_SIZE); Minor nit: what about caching the dma_addr value to make the above more readable? Cheers, Paolo
On 11/10/23 7:26 AM, Pavel Begunkov wrote: > On 11/7/23 23:03, Mina Almasry wrote: >> On Tue, Nov 7, 2023 at 2:55 PM David Ahern <dsahern@kernel.org> wrote: >>> >>> On 11/7/23 3:10 PM, Mina Almasry wrote: >>>> On Mon, Nov 6, 2023 at 3:44 PM David Ahern <dsahern@kernel.org> wrote: >>>>> >>>>> On 11/5/23 7:44 PM, Mina Almasry wrote: >>>>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h >>>>>> index eeeda849115c..1c351c138a5b 100644 >>>>>> --- a/include/linux/netdevice.h >>>>>> +++ b/include/linux/netdevice.h >>>>>> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding { >>>>>> }; >>>>>> >>>>>> #ifdef CONFIG_DMA_SHARED_BUFFER >>>>>> +struct page_pool_iov * >>>>>> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding); >>>>>> +void netdev_free_devmem(struct page_pool_iov *ppiov); >>>>> >>>>> netdev_{alloc,free}_dmabuf? >>>>> >>>> >>>> Can do. >>>> >>>>> I say that because a dmabuf can be host memory, at least I am not >>>>> aware >>>>> of a restriction that a dmabuf is device memory. >>>>> >>>> >>>> In my limited experience dma-buf is generally device memory, and >>>> that's really its use case. CONFIG_UDMABUF is a driver that mocks >>>> dma-buf with a memfd which I think is used for testing. But I can do >>>> the rename, it's more clear anyway, I think. >>> >>> config UDMABUF >>> bool "userspace dmabuf misc driver" >>> default n >>> depends on DMA_SHARED_BUFFER >>> depends on MEMFD_CREATE || COMPILE_TEST >>> help >>> A driver to let userspace turn memfd regions into dma-bufs. >>> Qemu can use this to create host dmabufs for guest >>> framebuffers. >>> >>> >>> Qemu is just a userspace process; it is no way a special one. >>> >>> Treating host memory as a dmabuf should radically simplify the io_uring >>> extension of this set. >> >> I agree actually, and I was about to make that comment to David Wei's >> series once I have the time. >> >> David, your io_uring RX zerocopy proposal actually works with devmem >> TCP, if you're inclined to do that instead, what you'd do roughly is >> (I think): > That would be a Frankenstein's monster api with no good reason for it. It brings a consistent API from a networking perspective. io_uring should not need to be in the page pool and memory management business. Have you or David coded up the re-use of the socket APIs with dmabuf to see how much smaller it makes the io_uring change - or even walked through from a theoretical perspective?
On 11/11/23 17:19, David Ahern wrote: > On 11/10/23 7:26 AM, Pavel Begunkov wrote: >> On 11/7/23 23:03, Mina Almasry wrote: >>> On Tue, Nov 7, 2023 at 2:55 PM David Ahern <dsahern@kernel.org> wrote: >>>> >>>> On 11/7/23 3:10 PM, Mina Almasry wrote: >>>>> On Mon, Nov 6, 2023 at 3:44 PM David Ahern <dsahern@kernel.org> wrote: >>>>>> >>>>>> On 11/5/23 7:44 PM, Mina Almasry wrote: >>>>>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h >>>>>>> index eeeda849115c..1c351c138a5b 100644 >>>>>>> --- a/include/linux/netdevice.h >>>>>>> +++ b/include/linux/netdevice.h >>>>>>> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding { >>>>>>> }; >>>>>>> >>>>>>> #ifdef CONFIG_DMA_SHARED_BUFFER >>>>>>> +struct page_pool_iov * >>>>>>> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding); >>>>>>> +void netdev_free_devmem(struct page_pool_iov *ppiov); >>>>>> >>>>>> netdev_{alloc,free}_dmabuf? >>>>>> >>>>> >>>>> Can do. >>>>> >>>>>> I say that because a dmabuf can be host memory, at least I am not >>>>>> aware >>>>>> of a restriction that a dmabuf is device memory. >>>>>> >>>>> >>>>> In my limited experience dma-buf is generally device memory, and >>>>> that's really its use case. CONFIG_UDMABUF is a driver that mocks >>>>> dma-buf with a memfd which I think is used for testing. But I can do >>>>> the rename, it's more clear anyway, I think. >>>> >>>> config UDMABUF >>>> bool "userspace dmabuf misc driver" >>>> default n >>>> depends on DMA_SHARED_BUFFER >>>> depends on MEMFD_CREATE || COMPILE_TEST >>>> help >>>> A driver to let userspace turn memfd regions into dma-bufs. >>>> Qemu can use this to create host dmabufs for guest >>>> framebuffers. >>>> >>>> >>>> Qemu is just a userspace process; it is no way a special one. >>>> >>>> Treating host memory as a dmabuf should radically simplify the io_uring >>>> extension of this set. >>> >>> I agree actually, and I was about to make that comment to David Wei's >>> series once I have the time. >>> >>> David, your io_uring RX zerocopy proposal actually works with devmem >>> TCP, if you're inclined to do that instead, what you'd do roughly is >>> (I think): >> That would be a Frankenstein's monster api with no good reason for it. > > It brings a consistent API from a networking perspective. > > io_uring should not need to be in the page pool and memory management > business. Have you or David coded up the re-use of the socket APIs with > dmabuf to see how much smaller it makes the io_uring change - or even > walked through from a theoretical perspective? Yes, we did the mental exercise, which is why we're converting to pp. I don't see many opportunities for reuse for the main data path, potentially apart from using the iov format instead of pages. If the goal is to minimise the amount of code, it can mimic the tcp devmem api with netlink, ioctl-ish buffer return, but that'd be a pretty bad api for io_uring, overly complicated and limiting optimisation options. If not, then we have to do some buffer management in io_uring, and I don't see anything wrong with that. It shouldn't be a burden for networking if all that extra code is contained in io_uring and only exposed via pp ops and following the rules.
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index eeeda849115c..1c351c138a5b 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding { }; #ifdef CONFIG_DMA_SHARED_BUFFER +struct page_pool_iov * +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding); +void netdev_free_devmem(struct page_pool_iov *ppiov); void __netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding); int netdev_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd, struct netdev_dmabuf_binding **out); @@ -850,6 +853,16 @@ void netdev_unbind_dmabuf(struct netdev_dmabuf_binding *binding); int netdev_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx, struct netdev_dmabuf_binding *binding); #else +static inline struct page_pool_iov * +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding) +{ + return NULL; +} + +static inline void netdev_free_devmem(struct page_pool_iov *ppiov) +{ +} + static inline void __netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding) { diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h index 4ebd544ae977..78cbb040af94 100644 --- a/include/net/page_pool/helpers.h +++ b/include/net/page_pool/helpers.h @@ -83,6 +83,34 @@ static inline u64 *page_pool_ethtool_stats_get(u64 *data, void *stats) } #endif +/* page_pool_iov support */ + +static inline struct dmabuf_genpool_chunk_owner * +page_pool_iov_owner(const struct page_pool_iov *ppiov) +{ + return ppiov->owner; +} + +static inline unsigned int page_pool_iov_idx(const struct page_pool_iov *ppiov) +{ + return ppiov - page_pool_iov_owner(ppiov)->ppiovs; +} + +static inline dma_addr_t +page_pool_iov_dma_addr(const struct page_pool_iov *ppiov) +{ + struct dmabuf_genpool_chunk_owner *owner = page_pool_iov_owner(ppiov); + + return owner->base_dma_addr + + ((dma_addr_t)page_pool_iov_idx(ppiov) << PAGE_SHIFT); +} + +static inline struct netdev_dmabuf_binding * +page_pool_iov_binding(const struct page_pool_iov *ppiov) +{ + return page_pool_iov_owner(ppiov)->binding; +} + /** * page_pool_dev_alloc_pages() - allocate a page. * @pool: pool from which to allocate diff --git a/net/core/dev.c b/net/core/dev.c index c8c3709d42c8..2315bbc03ec8 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -156,6 +156,7 @@ #include <linux/genalloc.h> #include <linux/dma-buf.h> #include <net/page_pool/types.h> +#include <net/page_pool/helpers.h> #include "dev.h" #include "net-sysfs.h" @@ -2077,6 +2078,42 @@ void __netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding) kfree(binding); } +struct page_pool_iov *netdev_alloc_devmem(struct netdev_dmabuf_binding *binding) +{ + struct dmabuf_genpool_chunk_owner *owner; + struct page_pool_iov *ppiov; + unsigned long dma_addr; + ssize_t offset; + ssize_t index; + + dma_addr = gen_pool_alloc_owner(binding->chunk_pool, PAGE_SIZE, + (void **)&owner); + if (!dma_addr) + return NULL; + + offset = dma_addr - owner->base_dma_addr; + index = offset / PAGE_SIZE; + ppiov = &owner->ppiovs[index]; + + netdev_devmem_binding_get(binding); + + return ppiov; +} + +void netdev_free_devmem(struct page_pool_iov *ppiov) +{ + struct netdev_dmabuf_binding *binding = page_pool_iov_binding(ppiov); + + refcount_set(&ppiov->refcount, 1); + + if (gen_pool_has_addr(binding->chunk_pool, + page_pool_iov_dma_addr(ppiov), PAGE_SIZE)) + gen_pool_free(binding->chunk_pool, + page_pool_iov_dma_addr(ppiov), PAGE_SIZE); + + netdev_devmem_binding_put(binding); +} + void netdev_unbind_dmabuf(struct netdev_dmabuf_binding *binding) { struct netdev_rx_queue *rxq;