Message ID | 20231113130041.58124-1-linyunsheng@huawei.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b909:0:b0:403:3b70:6f57 with SMTP id t9csp1182894vqg; Mon, 13 Nov 2023 05:01:54 -0800 (PST) X-Google-Smtp-Source: AGHT+IEvRN3I0IQOkzD5ZSqF2XTuiE4FaAJp9un/9xGjgA38Cszxc+0vfkJhd+oD5x9DET30UBSU X-Received: by 2002:a17:90b:33cc:b0:280:23e1:e4dd with SMTP id lk12-20020a17090b33cc00b0028023e1e4ddmr10072191pjb.17.1699880514186; Mon, 13 Nov 2023 05:01:54 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1699880514; cv=none; d=google.com; s=arc-20160816; b=wQqUTUiJ+d3CdMvwkVW2x2q64w9ZlxTjK5NUbvf44KUHkLqzjYRvljBzdN8AQasmaH Id5LhQPM2RCA/fACJqeus0vF23O1BOKL/9oOfHx8xv+45laRo6g25ZH/C0guOVBm8p04 1oBw4yTcDE40zduwxKhVtjqmQH4SpePpinoQbxuhsu3wzn5RZYOwfmlRJmQ8TMQCNP4p gcPULWZKxkgJ99p2JMiH3eHl72Ow9P/E6zS6wLujnbsTFi3TZxRnQU2Hm7KW9PQKD8oE wjkgCTiZFl1l6qULRhQWTqsmfiZAoYoHVCQG+IzaFUs41jgCzrU0tID5uMXGqCBd++ue lAzw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=WkIYikg8QjxH7viJFCIvezhNpPUEoQE4RkgCXJL9Xss=; fh=rpgd+rBTFhv/UVUrKvI/g1qH621dbxAWhcUv71ig/BU=; b=cto9rxAIE3yKgEN/YWhOOn1wQHAT9kXUuevMgnQ+NwtS1W4wynPavRTLiyjIOjiW8u ZyikVR78R+F/9jVTSJayl4WaklocvJaCJXOpU+x8biwr0Z446BqjJA0moisopuzTJjoG GZ0v4QITt7pSBmw4nHvowMj1J1IXyge1LOq5/QMc9qsLfw4d8pPMQFGYzX/UTllLNsXa pwWVe7wotZYWuFkyMBo9WY0SqfQT2FXDOyP2rPNA1BtcEEeWe1+DiGLGdyeMmJ4UK2y2 Y1fVcRXcJibd6aaqzdWGqHWnBejXlmPbvH9o40w0wrJd3d93Yc80MKz58/E/QzNdennK D57g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from fry.vger.email (fry.vger.email. [2620:137:e000::3:8]) by mx.google.com with ESMTPS id c18-20020a17090a8d1200b00262ff3a4545si10126733pjo.169.2023.11.13.05.01.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 13 Nov 2023 05:01:54 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) client-ip=2620:137:e000::3:8; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by fry.vger.email (Postfix) with ESMTP id BCEB58080EE7; Mon, 13 Nov 2023 05:01:37 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at fry.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230478AbjKMNAu (ORCPT <rfc822;heyuhang3455@gmail.com> + 29 others); Mon, 13 Nov 2023 08:00:50 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47636 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230441AbjKMNAl (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 13 Nov 2023 08:00:41 -0500 Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C515719B1; Mon, 13 Nov 2023 05:00:32 -0800 (PST) Received: from dggpemm500005.china.huawei.com (unknown [172.30.72.54]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4STTvQ5hHZzPnmK; Mon, 13 Nov 2023 20:56:18 +0800 (CST) Received: from localhost.localdomain (10.69.192.56) by dggpemm500005.china.huawei.com (7.185.36.74) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.31; Mon, 13 Nov 2023 21:00:30 +0800 From: Yunsheng Lin <linyunsheng@huawei.com> To: <davem@davemloft.net>, <kuba@kernel.org>, <pabeni@redhat.com> CC: <netdev@vger.kernel.org>, <linux-kernel@vger.kernel.org>, Yunsheng Lin <linyunsheng@huawei.com> Subject: [PATCH RFC 0/8] A possible proposal for intergating dmabuf to page pool Date: Mon, 13 Nov 2023 21:00:32 +0800 Message-ID: <20231113130041.58124-1-linyunsheng@huawei.com> X-Mailer: git-send-email 2.33.0 MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII X-Originating-IP: [10.69.192.56] X-ClientProxiedBy: dggems705-chm.china.huawei.com (10.3.19.182) To dggpemm500005.china.huawei.com (7.185.36.74) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]); Mon, 13 Nov 2023 05:01:37 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1782453910414914931 X-GMAIL-MSGID: 1782453910414914931 |
Series |
A possible proposal for intergating dmabuf to page pool
|
|
Message
Yunsheng Lin
Nov. 13, 2023, 1 p.m. UTC
This patchset is based on the [1] and [2], it is similar to what patch [2] is doing in essence, the main differences is: 1. It reuses the 'struct page' to have more unified handling between normal page and devmem page for net stack. 2. It relies on the page->pp_frag_count to do reference counting instead of page->_refcount, in order to decouple the devmem page from mm subsystem. As this patch is using normal memory page as devmem page as prototyping, it is tested using simple iperf with some hack in hns3 driver and in __skb_datagram_iter(). 1. https://lkml.kernel.org/netdev/20230105214631.3939268-2-willy@infradead.org/ 2. https://lore.kernel.org/all/20231106024413.2801438-1-almasrymina@google.com/ Jakub Kicinski (2): net: page_pool: factor out releasing DMA from releasing the page net: page_pool: create hooks for custom page providers Mina Almasry (1): memory-provider: dmabuf devmem memory provider Yunsheng Lin (5): skbuff: explicitize the semantics of skb_frag_fill_page_desc() skbuff: remove compound_head() related function calling skbuff: always try to do page pool frag reference counting net: hns3: temp hack for hns3 to use dmabuf memory provider net: temp hack for dmabuf page in __skb_datagram_iter() .../chelsio/inline_crypto/ch_ktls/chcr_ktls.c | 2 +- .../net/ethernet/hisilicon/hns3/hns3_enet.c | 9 +- drivers/net/ethernet/sun/cassini.c | 4 +- drivers/net/veth.c | 2 +- include/linux/skbuff.h | 26 ++- include/net/page_pool/types.h | 60 ++++++ net/core/datagram.c | 10 +- net/core/page_pool.c | 197 ++++++++++++++++-- net/core/skbuff.c | 10 +- net/tls/tls_device_fallback.c | 2 +- 10 files changed, 282 insertions(+), 40 deletions(-)
Comments
On Mon, 13 Nov 2023 05:42:16 -0800 Mina Almasry wrote: > You're doing exactly what I think you're doing, and what was nacked in RFC v1. > > You've converted 'struct page_pool_iov' to essentially become a > duplicate of 'struct page'. Then, you're casting page_pool_iov* into > struct page* in mp_dmabuf_devmem_alloc_pages(), then, you're calling > mm APIs like page_ref_*() on the page_pool_iov* because you've fooled > the mm stack into thinking dma-buf memory is a struct page. > > RFC v1 was almost exactly the same, except instead of creating a > duplicate definition of struct page, it just allocated 'struct page' > instead of allocating another struct that is identical to struct page > and casting it into struct page. > > I don't think what you're doing here reverses the nacks I got in RFC > v1. You also did not CC any dma-buf or mm people on this proposal that > would bring up these concerns again. Right, but the mirror struct has some appeal to a non-mm person like myself. The problem IIUC is that this patch is the wrong way around, we should be converting everyone who can deal with non-host mem to struct page_pool_iov. Using page_address() on ppiov which hns3 seems to do in this series does not compute for me. Then we can turn the existing non-iov helpers to be a thin wrapper with just a cast from struct page to struct page_pool_iov, and a call of the iov helper. Again - never cast the other way around. Also I think this conversion can be done completely separately from the mem provider changes. Just add struct page_pool_iov and start using it. Does that make more sense?
+cc Christian, Jason and Willy On 2023/11/14 7:05, Jakub Kicinski wrote: > On Mon, 13 Nov 2023 05:42:16 -0800 Mina Almasry wrote: >> You're doing exactly what I think you're doing, and what was nacked in RFC v1. >> >> You've converted 'struct page_pool_iov' to essentially become a >> duplicate of 'struct page'. Then, you're casting page_pool_iov* into >> struct page* in mp_dmabuf_devmem_alloc_pages(), then, you're calling >> mm APIs like page_ref_*() on the page_pool_iov* because you've fooled >> the mm stack into thinking dma-buf memory is a struct page. Yes, something like above, but I am not sure about the 'fooled the mm stack into thinking dma-buf memory is a struct page' part, because: 1. We never let the 'struct page' for devmem leaking out of net stacking through the 'not kmap()able and not readable' checking in your patchset. 2. We inititiate page->_refcount for devmem to one and it remains as one, we will never call page_ref_inc()/page_ref_dec()/get_page()/put_page(), instead, we use page pool's pp_frag_count to do reference counting for devmem page in patch 6. >> >> RFC v1 was almost exactly the same, except instead of creating a >> duplicate definition of struct page, it just allocated 'struct page' >> instead of allocating another struct that is identical to struct page >> and casting it into struct page. Perhaps it is more accurate to say this is something between RFC v1 and RFC v3, in order to decouple 'struct page' for devmem from mm subsystem, but still have most unified handling for both normal memory and devmem in page pool and net stack. The main difference between this patchset and RFC v1: 1. The mm subsystem is not supposed to see the 'struct page' for devmem in this patchset, I guess we could say it is decoupled from the mm subsystem even though we still call PageTail()/page_ref_count()/ page_is_pfmemalloc() on 'struct page' for devmem. The main difference between this patchset and RFC v3: 1. It reuses the 'struct page' to have more unified handling between normal page and devmem page for net stack. 2. It relies on the page->pp_frag_count to do reference counting. >> >> I don't think what you're doing here reverses the nacks I got in RFC >> v1. You also did not CC any dma-buf or mm people on this proposal that >> would bring up these concerns again. > > Right, but the mirror struct has some appeal to a non-mm person like > myself. The problem IIUC is that this patch is the wrong way around, we > should be converting everyone who can deal with non-host mem to struct > page_pool_iov. Using page_address() on ppiov which hns3 seems to do in > this series does not compute for me. The hacking use of ppiov in hns3 is only used to do the some prototype testing, so ignore it. > > Then we can turn the existing non-iov helpers to be a thin wrapper with > just a cast from struct page to struct page_pool_iov, and a call of the > iov helper. Again - never cast the other way around. I am agreed that a cast from struct page to struct page_pool_iov is allowed, but a cast from struct page_pool_iov to struct page is not allowed if I am understanding you correctly. Before we can also completely decouple 'struct page' allocated using buddy allocator directly from mm subsystem in netstack, below is what I have in mind in order to support different memory provider. +--------------+ | Netstack | |'struct page' | +--------------+ ^ | | v +---------------------+ +----------------------+ | | +---------------+ | devmem MP |<---->| Page pool |----->| **** MP | |'struct page_pool_iov'| | 'struct page' | |'struct **_iov'| +----------------------+ | | +---------------+ +---------------------+ ^ | | v +---------------+ | Driver | | 'struct page' | +---------------+ I would expect net stack, page pool, driver still see the 'struct page', only memory provider see the specific struct for itself, for the above, devmem memory provider sees the 'struct page_pool_iov'. The reason I still expect driver to see the 'struct page' is that driver will still need to support normal memory besides devmem. > > Also I think this conversion can be done completely separately from the > mem provider changes. Just add struct page_pool_iov and start using it. I am not sure I understand what does "Just add struct page_pool_iov and start using it" mean yet. > > Does that make more sense? > > . >
On Tue, Nov 14, 2023 at 12:23 AM Yunsheng Lin <linyunsheng@huawei.com> wrote: > > +cc Christian, Jason and Willy > > On 2023/11/14 7:05, Jakub Kicinski wrote: > > On Mon, 13 Nov 2023 05:42:16 -0800 Mina Almasry wrote: > >> You're doing exactly what I think you're doing, and what was nacked in RFC v1. > >> > >> You've converted 'struct page_pool_iov' to essentially become a > >> duplicate of 'struct page'. Then, you're casting page_pool_iov* into > >> struct page* in mp_dmabuf_devmem_alloc_pages(), then, you're calling > >> mm APIs like page_ref_*() on the page_pool_iov* because you've fooled > >> the mm stack into thinking dma-buf memory is a struct page. > > Yes, something like above, but I am not sure about the 'fooled the mm > stack into thinking dma-buf memory is a struct page' part, because: > 1. We never let the 'struct page' for devmem leaking out of net stacking > through the 'not kmap()able and not readable' checking in your patchset. RFC never used dma-buf pages outside the net stack, so that is the same. You are not able to get rid of the 'net kmap()able and not readable' checking with this approach, because dma-buf memory is fundamentally unkmapable and unreadable. This approach would still need skb_frags_not_readable checks in net stack, so that is also the same. > 2. We inititiate page->_refcount for devmem to one and it remains as one, > we will never call page_ref_inc()/page_ref_dec()/get_page()/put_page(), > instead, we use page pool's pp_frag_count to do reference counting for > devmem page in patch 6. > I'm not sure that moves the needle in terms of allowing dma-buf memory to look like struct pages. > >> > >> RFC v1 was almost exactly the same, except instead of creating a > >> duplicate definition of struct page, it just allocated 'struct page' > >> instead of allocating another struct that is identical to struct page > >> and casting it into struct page. > > Perhaps it is more accurate to say this is something between RFC v1 and > RFC v3, in order to decouple 'struct page' for devmem from mm subsystem, > but still have most unified handling for both normal memory and devmem > in page pool and net stack. > > The main difference between this patchset and RFC v1: > 1. The mm subsystem is not supposed to see the 'struct page' for devmem > in this patchset, I guess we could say it is decoupled from the mm > subsystem even though we still call PageTail()/page_ref_count()/ > page_is_pfmemalloc() on 'struct page' for devmem. > In this patchset you pretty much allocate a struct page for your dma-buf memory, and then cast it into a struct page, so all the mm calls in page_pool.c are seeing a struct page when it's really dma-buf memory. 'even though we still call PageTail()/page_ref_count()/page_is_pfmemalloc() on 'struct page' for devmem' is basically making dma-buf memory look like struct pages. Actually because you put the 'strtuct page for devmem' in skb->bv_frag, the net stack will grab the 'struct page' for devmem using skb_frag_page() then call things like page_address(), kmap, get_page, put_page, etc, etc, etc. > The main difference between this patchset and RFC v3: > 1. It reuses the 'struct page' to have more unified handling between > normal page and devmem page for net stack. This is what was nacked in RFC v1. > 2. It relies on the page->pp_frag_count to do reference counting. > I don't see you change any of the page_ref_* calls in page_pool.c, for example this one: https://elixir.bootlin.com/linux/latest/source/net/core/page_pool.c#L601 So the reference the page_pool is seeing is actually page->_refcount, not page->pp_frag_count? I'm confused here. Is this a bug in the patchset? > >> > >> I don't think what you're doing here reverses the nacks I got in RFC > >> v1. You also did not CC any dma-buf or mm people on this proposal that > >> would bring up these concerns again. > > > > Right, but the mirror struct has some appeal to a non-mm person like > > myself. The problem IIUC is that this patch is the wrong way around, we > > should be converting everyone who can deal with non-host mem to struct > > page_pool_iov. Using page_address() on ppiov which hns3 seems to do in > > this series does not compute for me. > > The hacking use of ppiov in hns3 is only used to do the some prototype > testing, so ignore it. > > > > > Then we can turn the existing non-iov helpers to be a thin wrapper with > > just a cast from struct page to struct page_pool_iov, and a call of the > > iov helper. Again - never cast the other way around. > > I am agreed that a cast from struct page to struct page_pool_iov is allowed, > but a cast from struct page_pool_iov to struct page is not allowed if I am > understanding you correctly. > > Before we can also completely decouple 'struct page' allocated using buddy > allocator directly from mm subsystem in netstack, below is what I have in > mind in order to support different memory provider. > > +--------------+ > | Netstack | > |'struct page' | > +--------------+ > ^ > | > | > v > +---------------------+ > +----------------------+ | | +---------------+ > | devmem MP |<---->| Page pool |----->| **** MP | > |'struct page_pool_iov'| | 'struct page' | |'struct **_iov'| > +----------------------+ | | +---------------+ > +---------------------+ > ^ > | > | > v > +---------------+ > | Driver | > | 'struct page' | > +---------------+ > > I would expect net stack, page pool, driver still see the 'struct page', > only memory provider see the specific struct for itself, for the above, > devmem memory provider sees the 'struct page_pool_iov'. > > The reason I still expect driver to see the 'struct page' is that driver > will still need to support normal memory besides devmem. > > > > > Also I think this conversion can be done completely separately from the > > mem provider changes. Just add struct page_pool_iov and start using it. > > I am not sure I understand what does "Just add struct page_pool_iov and > start using it" mean yet. > > > > > Does that make more sense? > > > > . > > -- Thanks, Mina
On 2023/11/14 20:21, Mina Almasry wrote: > On Tue, Nov 14, 2023 at 12:23 AM Yunsheng Lin <linyunsheng@huawei.com> wrote: >> >> +cc Christian, Jason and Willy >> >> On 2023/11/14 7:05, Jakub Kicinski wrote: >>> On Mon, 13 Nov 2023 05:42:16 -0800 Mina Almasry wrote: >>>> You're doing exactly what I think you're doing, and what was nacked in RFC v1. >>>> >>>> You've converted 'struct page_pool_iov' to essentially become a >>>> duplicate of 'struct page'. Then, you're casting page_pool_iov* into >>>> struct page* in mp_dmabuf_devmem_alloc_pages(), then, you're calling >>>> mm APIs like page_ref_*() on the page_pool_iov* because you've fooled >>>> the mm stack into thinking dma-buf memory is a struct page. >> >> Yes, something like above, but I am not sure about the 'fooled the mm >> stack into thinking dma-buf memory is a struct page' part, because: >> 1. We never let the 'struct page' for devmem leaking out of net stacking >> through the 'not kmap()able and not readable' checking in your patchset. > > RFC never used dma-buf pages outside the net stack, so that is the same. > > You are not able to get rid of the 'net kmap()able and not readable' > checking with this approach, because dma-buf memory is fundamentally > unkmapable and unreadable. This approach would still need > skb_frags_not_readable checks in net stack, so that is also the same. Yes, I am agreed that checking is still needed whatever the proposal is. > >> 2. We inititiate page->_refcount for devmem to one and it remains as one, >> we will never call page_ref_inc()/page_ref_dec()/get_page()/put_page(), >> instead, we use page pool's pp_frag_count to do reference counting for >> devmem page in patch 6. >> > > I'm not sure that moves the needle in terms of allowing dma-buf > memory to look like struct pages. > >>>> >>>> RFC v1 was almost exactly the same, except instead of creating a >>>> duplicate definition of struct page, it just allocated 'struct page' >>>> instead of allocating another struct that is identical to struct page >>>> and casting it into struct page. >> >> Perhaps it is more accurate to say this is something between RFC v1 and >> RFC v3, in order to decouple 'struct page' for devmem from mm subsystem, >> but still have most unified handling for both normal memory and devmem >> in page pool and net stack. >> >> The main difference between this patchset and RFC v1: >> 1. The mm subsystem is not supposed to see the 'struct page' for devmem >> in this patchset, I guess we could say it is decoupled from the mm >> subsystem even though we still call PageTail()/page_ref_count()/ >> page_is_pfmemalloc() on 'struct page' for devmem. >> > > In this patchset you pretty much allocate a struct page for your > dma-buf memory, and then cast it into a struct page, so all the mm > calls in page_pool.c are seeing a struct page when it's really dma-buf > memory. > > 'even though we still call > PageTail()/page_ref_count()/page_is_pfmemalloc() on 'struct page' for > devmem' is basically making dma-buf memory look like struct pages. > > Actually because you put the 'strtuct page for devmem' in > skb->bv_frag, the net stack will grab the 'struct page' for devmem > using skb_frag_page() then call things like page_address(), kmap, > get_page, put_page, etc, etc, etc. Yes, as above, skb_frags_not_readable() checking is still needed for kmap() and page_address(). get_page, put_page related calling is avoided in page_pool_frag_ref() and napi_pp_put_page() for devmem page as the above checking is true for devmem page: (pp_iov->pp_magic & ~0x3UL) == PP_SIGNATURE > >> The main difference between this patchset and RFC v3: >> 1. It reuses the 'struct page' to have more unified handling between >> normal page and devmem page for net stack. > > This is what was nacked in RFC v1. > >> 2. It relies on the page->pp_frag_count to do reference counting. >> > > I don't see you change any of the page_ref_* calls in page_pool.c, for > example this one: > > https://elixir.bootlin.com/linux/latest/source/net/core/page_pool.c#L601 > > So the reference the page_pool is seeing is actually page->_refcount, > not page->pp_frag_count? I'm confused here. Is this a bug in the > patchset? page->_refcount is the same as page_pool_iov->_refcount for devmem, which is ensured by the 'PAGE_POOL_MATCH(_refcount, _refcount);', and page_pool_iov->_refcount is set to one in mp_dmabuf_devmem_alloc_pages() by calling 'refcount_set(&ppiov->_refcount, 1)' and always remains as one. So the 'page_ref_count(page) == 1' checking is always true for devmem page.
On Tue, Nov 14, 2023 at 4:49 AM Yunsheng Lin <linyunsheng@huawei.com> wrote: > > On 2023/11/14 20:21, Mina Almasry wrote: > > On Tue, Nov 14, 2023 at 12:23 AM Yunsheng Lin <linyunsheng@huawei.com> wrote: > >> > >> +cc Christian, Jason and Willy > >> > >> On 2023/11/14 7:05, Jakub Kicinski wrote: > >>> On Mon, 13 Nov 2023 05:42:16 -0800 Mina Almasry wrote: > >>>> You're doing exactly what I think you're doing, and what was nacked in RFC v1. > >>>> > >>>> You've converted 'struct page_pool_iov' to essentially become a > >>>> duplicate of 'struct page'. Then, you're casting page_pool_iov* into > >>>> struct page* in mp_dmabuf_devmem_alloc_pages(), then, you're calling > >>>> mm APIs like page_ref_*() on the page_pool_iov* because you've fooled > >>>> the mm stack into thinking dma-buf memory is a struct page. > >> > >> Yes, something like above, but I am not sure about the 'fooled the mm > >> stack into thinking dma-buf memory is a struct page' part, because: > >> 1. We never let the 'struct page' for devmem leaking out of net stacking > >> through the 'not kmap()able and not readable' checking in your patchset. > > > > RFC never used dma-buf pages outside the net stack, so that is the same. > > > > You are not able to get rid of the 'net kmap()able and not readable' > > checking with this approach, because dma-buf memory is fundamentally > > unkmapable and unreadable. This approach would still need > > skb_frags_not_readable checks in net stack, so that is also the same. > > Yes, I am agreed that checking is still needed whatever the proposal is. > > > > >> 2. We inititiate page->_refcount for devmem to one and it remains as one, > >> we will never call page_ref_inc()/page_ref_dec()/get_page()/put_page(), > >> instead, we use page pool's pp_frag_count to do reference counting for > >> devmem page in patch 6. > >> > > > > I'm not sure that moves the needle in terms of allowing dma-buf > > memory to look like struct pages. > > > >>>> > >>>> RFC v1 was almost exactly the same, except instead of creating a > >>>> duplicate definition of struct page, it just allocated 'struct page' > >>>> instead of allocating another struct that is identical to struct page > >>>> and casting it into struct page. > >> > >> Perhaps it is more accurate to say this is something between RFC v1 and > >> RFC v3, in order to decouple 'struct page' for devmem from mm subsystem, > >> but still have most unified handling for both normal memory and devmem > >> in page pool and net stack. > >> > >> The main difference between this patchset and RFC v1: > >> 1. The mm subsystem is not supposed to see the 'struct page' for devmem > >> in this patchset, I guess we could say it is decoupled from the mm > >> subsystem even though we still call PageTail()/page_ref_count()/ > >> page_is_pfmemalloc() on 'struct page' for devmem. > >> > > > > In this patchset you pretty much allocate a struct page for your > > dma-buf memory, and then cast it into a struct page, so all the mm > > calls in page_pool.c are seeing a struct page when it's really dma-buf > > memory. > > > > 'even though we still call > > PageTail()/page_ref_count()/page_is_pfmemalloc() on 'struct page' for > > devmem' is basically making dma-buf memory look like struct pages. > > > > Actually because you put the 'strtuct page for devmem' in > > skb->bv_frag, the net stack will grab the 'struct page' for devmem > > using skb_frag_page() then call things like page_address(), kmap, > > get_page, put_page, etc, etc, etc. > > Yes, as above, skb_frags_not_readable() checking is still needed for > kmap() and page_address(). > > get_page, put_page related calling is avoided in page_pool_frag_ref() > and napi_pp_put_page() for devmem page as the above checking is true > for devmem page: > (pp_iov->pp_magic & ~0x3UL) == PP_SIGNATURE > So, devmem needs special handling with if statement for refcounting, even after using struct pages for devmem, which is not allowed (IIUC the dma-buf maintainer). > > > >> The main difference between this patchset and RFC v3: > >> 1. It reuses the 'struct page' to have more unified handling between > >> normal page and devmem page for net stack. > > > > This is what was nacked in RFC v1. > > > >> 2. It relies on the page->pp_frag_count to do reference counting. > >> > > > > I don't see you change any of the page_ref_* calls in page_pool.c, for > > example this one: > > > > https://elixir.bootlin.com/linux/latest/source/net/core/page_pool.c#L601 > > > > So the reference the page_pool is seeing is actually page->_refcount, > > not page->pp_frag_count? I'm confused here. Is this a bug in the > > patchset? > > page->_refcount is the same as page_pool_iov->_refcount for devmem, which > is ensured by the 'PAGE_POOL_MATCH(_refcount, _refcount);', and > page_pool_iov->_refcount is set to one in mp_dmabuf_devmem_alloc_pages() > by calling 'refcount_set(&ppiov->_refcount, 1)' and always remains as one. > > So the 'page_ref_count(page) == 1' checking is always true for devmem page. Which, of course, is a bug in the patchset, and it only works because it's a POC for you. devmem pages (which shouldn't exist according to the dma-buf maintainer, IIUC) can't be recycled all the time. See SO_DEVMEM_DONTNEED patch in my RFC and refcounting needed for devmem.
On Tue, Nov 14, 2023 at 04:21:26AM -0800, Mina Almasry wrote: > Actually because you put the 'strtuct page for devmem' in > skb->bv_frag, the net stack will grab the 'struct page' for devmem > using skb_frag_page() then call things like page_address(), kmap, > get_page, put_page, etc, etc, etc. Yikes, please no. If net has its own struct page look alike it has to stay entirely inside net. A non-mm owned struct page should not be passed into mm calls. It is just way too hacky to be seriously considered :( > > I would expect net stack, page pool, driver still see the 'struct page', > > only memory provider see the specific struct for itself, for the above, > > devmem memory provider sees the 'struct page_pool_iov'. > > > > The reason I still expect driver to see the 'struct page' is that driver > > will still need to support normal memory besides devmem. I wouldn't say this approach is unreasonable, but it does have to be done carefully to isolate the mm. Keeping the struct page in the API is going to make this very hard. Jason
On 2023/11/14 20:58, Mina Almasry wrote: >> >> Yes, as above, skb_frags_not_readable() checking is still needed for >> kmap() and page_address(). >> >> get_page, put_page related calling is avoided in page_pool_frag_ref() >> and napi_pp_put_page() for devmem page as the above checking is true >> for devmem page: >> (pp_iov->pp_magic & ~0x3UL) == PP_SIGNATURE >> > > So, devmem needs special handling with if statement for refcounting, > even after using struct pages for devmem, which is not allowed (IIUC > the dma-buf maintainer). It reuses the already existing checking or optimization, that is the point of 'mirror struct'. > >>> >>>> The main difference between this patchset and RFC v3: >>>> 1. It reuses the 'struct page' to have more unified handling between >>>> normal page and devmem page for net stack. >>> >>> This is what was nacked in RFC v1. >>> >>>> 2. It relies on the page->pp_frag_count to do reference counting. >>>> >>> >>> I don't see you change any of the page_ref_* calls in page_pool.c, for >>> example this one: >>> >>> https://elixir.bootlin.com/linux/latest/source/net/core/page_pool.c#L601 >>> >>> So the reference the page_pool is seeing is actually page->_refcount, >>> not page->pp_frag_count? I'm confused here. Is this a bug in the >>> patchset? >> >> page->_refcount is the same as page_pool_iov->_refcount for devmem, which >> is ensured by the 'PAGE_POOL_MATCH(_refcount, _refcount);', and >> page_pool_iov->_refcount is set to one in mp_dmabuf_devmem_alloc_pages() >> by calling 'refcount_set(&ppiov->_refcount, 1)' and always remains as one. >> >> So the 'page_ref_count(page) == 1' checking is always true for devmem page. > > Which, of course, is a bug in the patchset, and it only works because > it's a POC for you. devmem pages (which shouldn't exist according to > the dma-buf maintainer, IIUC) can't be recycled all the time. See > SO_DEVMEM_DONTNEED patch in my RFC and refcounting needed for devmem. I am not sure dma-buf maintainer's concern is still there with this patchset. Whatever name you calling it for the struct, however you arrange each field in the struct, some metadata is always needed for dmabuf to intergrate into page pool. If the above is true, why not utilize the 'struct page' to have more unified handling? >
Yunsheng Lin wrote: > On 2023/11/14 20:58, Mina Almasry wrote: > > >> > >> Yes, as above, skb_frags_not_readable() checking is still needed for > >> kmap() and page_address(). > >> > >> get_page, put_page related calling is avoided in page_pool_frag_ref() > >> and napi_pp_put_page() for devmem page as the above checking is true > >> for devmem page: > >> (pp_iov->pp_magic & ~0x3UL) == PP_SIGNATURE > >> > > > > So, devmem needs special handling with if statement for refcounting, > > even after using struct pages for devmem, which is not allowed (IIUC > > the dma-buf maintainer). > > It reuses the already existing checking or optimization, that is the point > of 'mirror struct'. > > > > >>> > >>>> The main difference between this patchset and RFC v3: > >>>> 1. It reuses the 'struct page' to have more unified handling between > >>>> normal page and devmem page for net stack. > >>> > >>> This is what was nacked in RFC v1. > >>> > >>>> 2. It relies on the page->pp_frag_count to do reference counting. > >>>> > >>> > >>> I don't see you change any of the page_ref_* calls in page_pool.c, for > >>> example this one: > >>> > >>> https://elixir.bootlin.com/linux/latest/source/net/core/page_pool.c#L601 > >>> > >>> So the reference the page_pool is seeing is actually page->_refcount, > >>> not page->pp_frag_count? I'm confused here. Is this a bug in the > >>> patchset? > >> > >> page->_refcount is the same as page_pool_iov->_refcount for devmem, which > >> is ensured by the 'PAGE_POOL_MATCH(_refcount, _refcount);', and > >> page_pool_iov->_refcount is set to one in mp_dmabuf_devmem_alloc_pages() > >> by calling 'refcount_set(&ppiov->_refcount, 1)' and always remains as one. > >> > >> So the 'page_ref_count(page) == 1' checking is always true for devmem page. > > > > Which, of course, is a bug in the patchset, and it only works because > > it's a POC for you. devmem pages (which shouldn't exist according to > > the dma-buf maintainer, IIUC) can't be recycled all the time. See > > SO_DEVMEM_DONTNEED patch in my RFC and refcounting needed for devmem. > > I am not sure dma-buf maintainer's concern is still there with this patchset. > > Whatever name you calling it for the struct, however you arrange each field > in the struct, some metadata is always needed for dmabuf to intergrate into > page pool. > > If the above is true, why not utilize the 'struct page' to have more unified > handling? My understanding is that there is a general preference to simplify struct page, and at the least not move in the other direction by overloading the struct in new ways. If using struct page for something that is not memory, there is ZONE_DEVICE. But using that correctly is non-trivial: https://lore.kernel.org/all/ZKyZBbKEpmkFkpWV@ziepe.ca/ Since all we need is a handle that does not leave the network stack, a network specific struct like page_pool_iov entirely avoids this issue. RFC v3 seems like a good simplification over RFC v1 in that regard to me. I was also pleasantly surprised how minimal the change to the users of skb_frag_t actually proved to be.
On Tue, 14 Nov 2023 16:23:29 +0800 Yunsheng Lin wrote: > I would expect net stack, page pool, driver still see the 'struct page', > only memory provider see the specific struct for itself, for the above, > devmem memory provider sees the 'struct page_pool_iov'. You can't lie to the driver that an _iov is a page either. The driver must explicitly "opt-in" to using the _iov variant, by calling the _iov set of APIs. Only drivers which can support header-data split can reasonably use the _iov API, for data pages.
Am 14.11.23 um 14:16 schrieb Jason Gunthorpe: > A non-mm owned struct page should not be > passed into mm calls. It is just way too hacky to be seriously > considered :( Can we please print this sentence on T-Shirts? Or framed on the wall of meeting rooms? I don't know how often I had to repeat that to people. Regards, Christian.
On 2023/11/14 21:16, Jason Gunthorpe wrote: > On Tue, Nov 14, 2023 at 04:21:26AM -0800, Mina Almasry wrote: > >> Actually because you put the 'strtuct page for devmem' in >> skb->bv_frag, the net stack will grab the 'struct page' for devmem >> using skb_frag_page() then call things like page_address(), kmap, >> get_page, put_page, etc, etc, etc. > > Yikes, please no. If net has its own struct page look alike it has to > stay entirely inside net. A non-mm owned struct page should not be > passed into mm calls. It is just way too hacky to be seriously > considered :( Yes, that is something this patchset is trying to do, defining its own struct page look alike for page pool to support devmem. struct page for devmem will not be called into the mm subsystem, so most of the mm calls is avoided by calling into the devmem memory provider' ops instead of calling mm calls. As far as I see for now, only page_ref_count(), page_is_pfmemalloc() and PageTail() is called for devmem page, which should be easy to ensure that those call for devmem page is consistent with the struct page owned by mm. I am not sure if we can use some kind of compile/runtime checking to ensure those kinds of consistency? > >>> I would expect net stack, page pool, driver still see the 'struct page', >>> only memory provider see the specific struct for itself, for the above, >>> devmem memory provider sees the 'struct page_pool_iov'. >>> >>> The reason I still expect driver to see the 'struct page' is that driver >>> will still need to support normal memory besides devmem. > > I wouldn't say this approach is unreasonable, but it does have to be > done carefully to isolate the mm. Keeping the struct page in the API > is going to make this very hard. I would expect that most of the isolation is done in page pool, as far as I can see: 1. For control part: the driver may need to tell the page pool which memory provider it want to use. Or the administrator specifies which memory provider to use by some netlink-based cmd. 2. For data part: I am thinking that driver should only call page_pool_alloc(), page_pool_free() and page_pool_get_dma_addr related function. Of course the driver may need to be aware of that if it can call kmap() or page_address() on the page returned from page_pool_alloc(), and maybe tell net stack that those pages is not kmap()'able and page_address()'able. > > Jason > . >
On 2023/11/15 6:25, Jakub Kicinski wrote: > On Tue, 14 Nov 2023 16:23:29 +0800 Yunsheng Lin wrote: >> I would expect net stack, page pool, driver still see the 'struct page', >> only memory provider see the specific struct for itself, for the above, >> devmem memory provider sees the 'struct page_pool_iov'. > > You can't lie to the driver that an _iov is a page either. Yes, agreed about that. As a matter of fact, the driver should be awared of what kind of memory provider is using when it calls page_pool_create() during init process. > The driver must explicitly "opt-in" to using the _iov variant, > by calling the _iov set of APIs. > > Only drivers which can support header-data split can reasonably > use the _iov API, for data pages. But those drivers can still allow allocating normal memory, right? sometimes for data and header part, and sometimes for the header part. Do those drivers need to support two sets of APIs? the one with _iov for devmem, and the one without _iov for normal memory. It seems somewhat unnecessary from driver' point of veiw to support two sets of APIs? The driver seems to know which type of page it is expecting when calling page_pool_alloc() with a specific page_pool instance. Or do we use the API with _iov to allocate both devmem and normal memory in the new driver supporting devmem page? If that is the case, does it really matter if the API is with _iov or not? > . >
On Wed, Nov 15, 2023 at 05:21:02PM +0800, Yunsheng Lin wrote: > >>> I would expect net stack, page pool, driver still see the 'struct page', > >>> only memory provider see the specific struct for itself, for the above, > >>> devmem memory provider sees the 'struct page_pool_iov'. > >>> > >>> The reason I still expect driver to see the 'struct page' is that driver > >>> will still need to support normal memory besides devmem. > > > > I wouldn't say this approach is unreasonable, but it does have to be > > done carefully to isolate the mm. Keeping the struct page in the API > > is going to make this very hard. > > I would expect that most of the isolation is done in page pool, as far as > I can see: It is the sort of thing that is important enough it should have compiler help via types to prove that it is being done properly. Otherwise it will be full of mistakes over time. Jason
On Wed, Nov 15, 2023 at 1:21 AM Yunsheng Lin <linyunsheng@huawei.com> wrote: > > On 2023/11/14 21:16, Jason Gunthorpe wrote: > > On Tue, Nov 14, 2023 at 04:21:26AM -0800, Mina Almasry wrote: > > > >> Actually because you put the 'strtuct page for devmem' in > >> skb->bv_frag, the net stack will grab the 'struct page' for devmem > >> using skb_frag_page() then call things like page_address(), kmap, > >> get_page, put_page, etc, etc, etc. > > > > Yikes, please no. If net has its own struct page look alike it has to > > stay entirely inside net. A non-mm owned struct page should not be > > passed into mm calls. It is just way too hacky to be seriously > > considered :( > > Yes, that is something this patchset is trying to do, defining its own > struct page look alike for page pool to support devmem. > > struct page for devmem will not be called into the mm subsystem, so most > of the mm calls is avoided by calling into the devmem memory provider' > ops instead of calling mm calls. > > As far as I see for now, only page_ref_count(), page_is_pfmemalloc() and > PageTail() is called for devmem page, which should be easy to ensure that > those call for devmem page is consistent with the struct page owned by mm. I'm not sure this is true. These 3 calls are just the calls you're aware of. In your proposal you're casting mirror pages into page* and releasing them into the net stack. You need to scrub the entire net stack for mm calls, i.e. all driver code and all skb_frag_page() call sites. Of the top of my head, the driver is probably calling page_address() and illegal_highdma() is calling PageHighMem(). TCP zerocopy receive is calling vm_insert_pages(). > I am not sure if we can use some kind of compile/runtime checking to ensure > those kinds of consistency? > > > > >>> I would expect net stack, page pool, driver still see the 'struct page', > >>> only memory provider see the specific struct for itself, for the above, > >>> devmem memory provider sees the 'struct page_pool_iov'. > >>> > >>> The reason I still expect driver to see the 'struct page' is that driver > >>> will still need to support normal memory besides devmem. > > > > I wouldn't say this approach is unreasonable, but it does have to be > > done carefully to isolate the mm. Keeping the struct page in the API > > is going to make this very hard. > > I would expect that most of the isolation is done in page pool, as far as > I can see: > > 1. For control part: the driver may need to tell the page pool which memory > provider it want to use. Or the administrator specifies > which memory provider to use by some netlink-based cmd. > > 2. For data part: I am thinking that driver should only call page_pool_alloc(), > page_pool_free() and page_pool_get_dma_addr related function. > > Of course the driver may need to be aware of that if it can call kmap() or > page_address() on the page returned from page_pool_alloc(), and maybe tell > net stack that those pages is not kmap()'able and page_address()'able. > > > > > Jason > > . > >
On 11/15/23 2:21 AM, Yunsheng Lin wrote: > On 2023/11/14 21:16, Jason Gunthorpe wrote: >> On Tue, Nov 14, 2023 at 04:21:26AM -0800, Mina Almasry wrote: >> >>> Actually because you put the 'strtuct page for devmem' in >>> skb->bv_frag, the net stack will grab the 'struct page' for devmem >>> using skb_frag_page() then call things like page_address(), kmap, >>> get_page, put_page, etc, etc, etc. >> >> Yikes, please no. If net has its own struct page look alike it has to >> stay entirely inside net. A non-mm owned struct page should not be >> passed into mm calls. It is just way too hacky to be seriously >> considered :( > > Yes, that is something this patchset is trying to do, defining its own > struct page look alike for page pool to support devmem. > Networking needs to be able to move away from struct page references. The devmem and host memory for Rx use cases do not need to be page based.
On Wed, Nov 15, 2023 at 1:29 AM Yunsheng Lin <linyunsheng@huawei.com> wrote: > > On 2023/11/14 23:41, Willem de Bruijn wrote: > >> > >> I am not sure dma-buf maintainer's concern is still there with this patchset. > >> > >> Whatever name you calling it for the struct, however you arrange each field > >> in the struct, some metadata is always needed for dmabuf to intergrate into > >> page pool. > >> > >> If the above is true, why not utilize the 'struct page' to have more unified > >> handling? > > > > My understanding is that there is a general preference to simplify struct > > page, and at the least not move in the other direction by overloading the > > struct in new ways. > > As my understanding, the new struct is just mirroring the struct page pool > is already using, see: > https://elixir.free-electrons.com/linux/v6.7-rc1/source/include/linux/mm_types.h#L119 > > If there is simplifying to the struct page_pool is using, I think the new > stuct the devmem memory provider is using can adjust accordingly. > > As a matter of fact, I think the way 'struct page' for devmem is decoupled > from mm subsystem may provide a way to simplify or decoupled the already > existing 'struct page' used in netstack from mm subsystem, before this > patchset, it seems we have the below types of 'struct page': > 1. page allocated in the netstack using page pool. > 2. page allocated in the netstack using buddy allocator. > 3. page allocated in other subsystem and passed to the netstack, such as > zcopy or spliced page? > > If we can decouple 'struct page' for devmem from mm subsystem, we may be able > to decouple the above 'struct page' from mm subsystem one by one. > > > > > If using struct page for something that is not memory, there is ZONE_DEVICE. > > But using that correctly is non-trivial: > > > > https://lore.kernel.org/all/ZKyZBbKEpmkFkpWV@ziepe.ca/ > > > > Since all we need is a handle that does not leave the network stack, > > a network specific struct like page_pool_iov entirely avoids this issue. > > Yes, I am agree about the network specific struct. > I am wondering if we can make the struct more generic if we want to > intergrate it into page_pool and use it in net stack. > > > RFC v3 seems like a good simplification over RFC v1 in that regard to me. > > I was also pleasantly surprised how minimal the change to the users of > > skb_frag_t actually proved to be. > > Yes, I am agreed about that too. Maybe we can make it simpler by using > a more abstract struct as page_pool, and utilize some features of > page_pool too. > > For example, from the page_pool doc, page_pool have fast cache and > ptr-ring cache as below, but if napi_frag_unref() call > page_pool_page_put_many() and return the dmabuf chunk directly to > gen_pool in the memory provider, then it seems we are bypassing the > below caches in the page_pool. > I think you're just misunderstanding the code. The page recycling works with my patchset. napi_frag_unref() calls napi_pp_put_page() if recycle == true, and that works the same with devmem as with regular pages. If recycle == false, we call page_pool_page_put_many() which will call put_page() for regular pages and page_pool_iov_put_many() for devmem pages. So, the memory recycling works exactly the same as before with devmem as with regular pages. In my tests I do see the devmem being recycled correctly. We are not bypassing any caches. > +------------------+ > | Driver | > +------------------+ > ^ > | > | > | > v > +--------------------------------------------+ > | request memory | > +--------------------------------------------+ > ^ ^ > | | > | Pool empty | Pool has entries > | | > v v > +-----------------------+ +------------------------+ > | alloc (and map) pages | | get page from cache | > +-----------------------+ +------------------------+ > ^ ^ > | | > | cache available | No entries, refill > | | from ptr-ring > | | > v v > +-----------------+ +------------------+ > | Fast cache | | ptr-ring cache | > +-----------------+ +------------------+ > > > > > > . > >
On Wed, Nov 15, 2023 at 10:07 AM Mina Almasry <almasrymina@google.com> wrote: > > On Wed, Nov 15, 2023 at 1:29 AM Yunsheng Lin <linyunsheng@huawei.com> wrote: > > > > On 2023/11/14 23:41, Willem de Bruijn wrote: > > >> > > >> I am not sure dma-buf maintainer's concern is still there with this patchset. > > >> > > >> Whatever name you calling it for the struct, however you arrange each field > > >> in the struct, some metadata is always needed for dmabuf to intergrate into > > >> page pool. > > >> > > >> If the above is true, why not utilize the 'struct page' to have more unified > > >> handling? > > > > > > My understanding is that there is a general preference to simplify struct > > > page, and at the least not move in the other direction by overloading the > > > struct in new ways. > > > > As my understanding, the new struct is just mirroring the struct page pool > > is already using, see: > > https://elixir.free-electrons.com/linux/v6.7-rc1/source/include/linux/mm_types.h#L119 > > > > If there is simplifying to the struct page_pool is using, I think the new > > stuct the devmem memory provider is using can adjust accordingly. > > > > As a matter of fact, I think the way 'struct page' for devmem is decoupled > > from mm subsystem may provide a way to simplify or decoupled the already > > existing 'struct page' used in netstack from mm subsystem, before this > > patchset, it seems we have the below types of 'struct page': > > 1. page allocated in the netstack using page pool. > > 2. page allocated in the netstack using buddy allocator. > > 3. page allocated in other subsystem and passed to the netstack, such as > > zcopy or spliced page? > > > > If we can decouple 'struct page' for devmem from mm subsystem, we may be able > > to decouple the above 'struct page' from mm subsystem one by one. > > > > > > > > If using struct page for something that is not memory, there is ZONE_DEVICE. > > > But using that correctly is non-trivial: > > > > > > https://lore.kernel.org/all/ZKyZBbKEpmkFkpWV@ziepe.ca/ > > > > > > Since all we need is a handle that does not leave the network stack, > > > a network specific struct like page_pool_iov entirely avoids this issue. > > > > Yes, I am agree about the network specific struct. > > I am wondering if we can make the struct more generic if we want to > > intergrate it into page_pool and use it in net stack. > > > > > RFC v3 seems like a good simplification over RFC v1 in that regard to me. > > > I was also pleasantly surprised how minimal the change to the users of > > > skb_frag_t actually proved to be. > > > > Yes, I am agreed about that too. Maybe we can make it simpler by using > > a more abstract struct as page_pool, and utilize some features of > > page_pool too. > > > > For example, from the page_pool doc, page_pool have fast cache and > > ptr-ring cache as below, but if napi_frag_unref() call > > page_pool_page_put_many() and return the dmabuf chunk directly to > > gen_pool in the memory provider, then it seems we are bypassing the > > below caches in the page_pool. > > > > I think you're just misunderstanding the code. The page recycling > works with my patchset. napi_frag_unref() calls napi_pp_put_page() if > recycle == true, and that works the same with devmem as with regular > pages. > > If recycle == false, we call page_pool_page_put_many() which will call > put_page() for regular pages and page_pool_iov_put_many() for devmem > pages. So, the memory recycling works exactly the same as before with > devmem as with regular pages. In my tests I do see the devmem being > recycled correctly. We are not bypassing any caches. > > Ah, taking a closer look here, the devmem recycling works for me but I think that's a side effect to the fact that the page_pool support I implemented with GVE is unusual. I currently allocate pages from the page_pool but do not set skb_mark_for_recycle(). The page recycling still happens when GVE is done with the page and calls page_pool_put_full_pgae(), as that eventually checks the refcount on the devmem and recycles it. I will fix up the GVE to call skb_mark_for_recycle() and ensure the napi_pp_put_page() path recycles the devmem or page correctly in the next version. > > +------------------+ > > | Driver | > > +------------------+ > > ^ > > | > > | > > | > > v > > +--------------------------------------------+ > > | request memory | > > +--------------------------------------------+ > > ^ ^ > > | | > > | Pool empty | Pool has entries > > | | > > v v > > +-----------------------+ +------------------------+ > > | alloc (and map) pages | | get page from cache | > > +-----------------------+ +------------------------+ > > ^ ^ > > | | > > | cache available | No entries, refill > > | | from ptr-ring > > | | > > v v > > +-----------------+ +------------------+ > > | Fast cache | | ptr-ring cache | > > +-----------------+ +------------------+ > > > > > > > > > > . > > > > > > > -- > Thanks, > Mina
On 2023/11/16 1:44, Mina Almasry wrote: > On Wed, Nov 15, 2023 at 1:21 AM Yunsheng Lin <linyunsheng@huawei.com> wrote: >> >> On 2023/11/14 21:16, Jason Gunthorpe wrote: >>> On Tue, Nov 14, 2023 at 04:21:26AM -0800, Mina Almasry wrote: >>> >>>> Actually because you put the 'strtuct page for devmem' in >>>> skb->bv_frag, the net stack will grab the 'struct page' for devmem >>>> using skb_frag_page() then call things like page_address(), kmap, >>>> get_page, put_page, etc, etc, etc. >>> >>> Yikes, please no. If net has its own struct page look alike it has to >>> stay entirely inside net. A non-mm owned struct page should not be >>> passed into mm calls. It is just way too hacky to be seriously >>> considered :( >> >> Yes, that is something this patchset is trying to do, defining its own >> struct page look alike for page pool to support devmem. >> >> struct page for devmem will not be called into the mm subsystem, so most >> of the mm calls is avoided by calling into the devmem memory provider' >> ops instead of calling mm calls. >> >> As far as I see for now, only page_ref_count(), page_is_pfmemalloc() and >> PageTail() is called for devmem page, which should be easy to ensure that >> those call for devmem page is consistent with the struct page owned by mm. > > I'm not sure this is true. These 3 calls are just the calls you're > aware of. In your proposal you're casting mirror pages into page* and > releasing them into the net stack. You need to scrub the entire net > stack for mm calls, i.e. all driver code and all skb_frag_page() call > sites. Of the top of my head, the driver is probably calling > page_address() and illegal_highdma() is calling PageHighMem(). TCP > zerocopy receive is calling vm_insert_pages(). For net stack part, I believe your patch below is handling to aovid those mm calls? I don't include it in this patchset as I thought it is obvious that whatever the proposal is, we always need those checking. Maybe we should have included it to avoid this kind of confusion. https://lore.kernel.org/all/20231106024413.2801438-10-almasrymina@google.com/ For driver part, I was thinking if the driver supports devmem, it should check that if it can call page_address() related call on a specific 'stuct page', or maybe we should introduce a new helper to make it obvious?
On 2023/11/16 1:57, David Ahern wrote: > On 11/15/23 2:21 AM, Yunsheng Lin wrote: >> On 2023/11/14 21:16, Jason Gunthorpe wrote: >>> On Tue, Nov 14, 2023 at 04:21:26AM -0800, Mina Almasry wrote: >>> >>>> Actually because you put the 'strtuct page for devmem' in >>>> skb->bv_frag, the net stack will grab the 'struct page' for devmem >>>> using skb_frag_page() then call things like page_address(), kmap, >>>> get_page, put_page, etc, etc, etc. >>> >>> Yikes, please no. If net has its own struct page look alike it has to >>> stay entirely inside net. A non-mm owned struct page should not be >>> passed into mm calls. It is just way too hacky to be seriously >>> considered :( >> >> Yes, that is something this patchset is trying to do, defining its own >> struct page look alike for page pool to support devmem. >> > > Networking needs to be able to move away from struct page references. > The devmem and host memory for Rx use cases do not need to be page based. Yes, I am agreed the ultimate goal is to move away from struct page references. But I am not sure if we can do it right away as there still are different types of existing 'struct page' in the netstack, see: https://lore.kernel.org/all/8b7d25eb-1f10-3e37-8753-92b42da3fb34@huawei.com/ > > > . >
On Thu, Nov 16, 2023 at 3:12 AM Yunsheng Lin <linyunsheng@huawei.com> wrote: > > On 2023/11/16 3:05, Mina Almasry wrote: > > On Wed, Nov 15, 2023 at 10:07 AM Mina Almasry <almasrymina@google.com> wrote: > >> > >> On Wed, Nov 15, 2023 at 1:29 AM Yunsheng Lin <linyunsheng@huawei.com> wrote: > >>> > >>> On 2023/11/14 23:41, Willem de Bruijn wrote: > >>>>> > >>>>> I am not sure dma-buf maintainer's concern is still there with this patchset. > >>>>> > >>>>> Whatever name you calling it for the struct, however you arrange each field > >>>>> in the struct, some metadata is always needed for dmabuf to intergrate into > >>>>> page pool. > >>>>> > >>>>> If the above is true, why not utilize the 'struct page' to have more unified > >>>>> handling? > >>>> > >>>> My understanding is that there is a general preference to simplify struct > >>>> page, and at the least not move in the other direction by overloading the > >>>> struct in new ways. > >>> > >>> As my understanding, the new struct is just mirroring the struct page pool > >>> is already using, see: > >>> https://elixir.free-electrons.com/linux/v6.7-rc1/source/include/linux/mm_types.h#L119 > >>> > >>> If there is simplifying to the struct page_pool is using, I think the new > >>> stuct the devmem memory provider is using can adjust accordingly. > >>> > >>> As a matter of fact, I think the way 'struct page' for devmem is decoupled > >>> from mm subsystem may provide a way to simplify or decoupled the already > >>> existing 'struct page' used in netstack from mm subsystem, before this > >>> patchset, it seems we have the below types of 'struct page': > >>> 1. page allocated in the netstack using page pool. > >>> 2. page allocated in the netstack using buddy allocator. > >>> 3. page allocated in other subsystem and passed to the netstack, such as > >>> zcopy or spliced page? > >>> > >>> If we can decouple 'struct page' for devmem from mm subsystem, we may be able > >>> to decouple the above 'struct page' from mm subsystem one by one. > >>> > >>>> > >>>> If using struct page for something that is not memory, there is ZONE_DEVICE. > >>>> But using that correctly is non-trivial: > >>>> > >>>> https://lore.kernel.org/all/ZKyZBbKEpmkFkpWV@ziepe.ca/ > >>>> > >>>> Since all we need is a handle that does not leave the network stack, > >>>> a network specific struct like page_pool_iov entirely avoids this issue. > >>> > >>> Yes, I am agree about the network specific struct. > >>> I am wondering if we can make the struct more generic if we want to > >>> intergrate it into page_pool and use it in net stack. > >>> > >>>> RFC v3 seems like a good simplification over RFC v1 in that regard to me. > >>>> I was also pleasantly surprised how minimal the change to the users of > >>>> skb_frag_t actually proved to be. > >>> > >>> Yes, I am agreed about that too. Maybe we can make it simpler by using > >>> a more abstract struct as page_pool, and utilize some features of > >>> page_pool too. > >>> > >>> For example, from the page_pool doc, page_pool have fast cache and > >>> ptr-ring cache as below, but if napi_frag_unref() call > >>> page_pool_page_put_many() and return the dmabuf chunk directly to > >>> gen_pool in the memory provider, then it seems we are bypassing the > >>> below caches in the page_pool. > >>> > >> > >> I think you're just misunderstanding the code. The page recycling > >> works with my patchset. napi_frag_unref() calls napi_pp_put_page() if > >> recycle == true, and that works the same with devmem as with regular > >> pages. > >> > >> If recycle == false, we call page_pool_page_put_many() which will call > >> put_page() for regular pages and page_pool_iov_put_many() for devmem > >> pages. So, the memory recycling works exactly the same as before with > >> devmem as with regular pages. In my tests I do see the devmem being > >> recycled correctly. We are not bypassing any caches. > >> > >> > > > > Ah, taking a closer look here, the devmem recycling works for me but I > > think that's a side effect to the fact that the page_pool support I > > implemented with GVE is unusual. I currently allocate pages from the > > page_pool but do not set skb_mark_for_recycle(). The page recycling > > still happens when GVE is done with the page and calls > > page_pool_put_full_pgae(), as that eventually checks the refcount on > > the devmem and recycles it. > > > > I will fix up the GVE to call skb_mark_for_recycle() and ensure the > > napi_pp_put_page() path recycles the devmem or page correctly in the > > next version. > > What about other features? Like dma mapping optimization and frag support > in the page pool. > PP_FLAG_DMA_MAP will be supported and required in the next version per Jakub's request. frag support is something I disabled in the initial versions of the patchset, but only out of convenience and to simplify the initial implementation. At google we typically use page aligned MSS and the frag support isn't really that useful for us. I don't see an issue extending frag support to devmem and iovs in the future if needed. We'd probably add the pp_frag_count field to page_pool_iov and handle it similarly to how it's handled for pages. > I understand that you use some trick in the gen_gool to avoid the per chunk > dma_addr storage in the 'struct page_pool_iov' and do not need frag support > for now. > > But for other memory provider, if they need those supports, we probably need > to extend 'struct page_pool_iov' to support those features, then we may have > a 'struct page_pool_iov' to be looking alike to the sepcific union for page_pool > in 'struct page', see: > > https://elixir.free-electrons.com/linux/v6.7-rc1/source/include/linux/mm_types.h#L119 > > We currently don't have a way to decouple the existing 'struct page' from mm > subsystem yet as my understanding, if we don't mirror the above union in the > 'struct page', we may have more 'if' checking added in the future.
On Thu, Nov 16, 2023 at 07:10:01PM +0800, Yunsheng Lin wrote: > On 2023/11/15 21:38, Jason Gunthorpe wrote: > > On Wed, Nov 15, 2023 at 05:21:02PM +0800, Yunsheng Lin wrote: > > > >>>>> I would expect net stack, page pool, driver still see the 'struct page', > >>>>> only memory provider see the specific struct for itself, for the above, > >>>>> devmem memory provider sees the 'struct page_pool_iov'. > >>>>> > >>>>> The reason I still expect driver to see the 'struct page' is that driver > >>>>> will still need to support normal memory besides devmem. > >>> > >>> I wouldn't say this approach is unreasonable, but it does have to be > >>> done carefully to isolate the mm. Keeping the struct page in the API > >>> is going to make this very hard. > >> > >> I would expect that most of the isolation is done in page pool, as far as > >> I can see: > > > > It is the sort of thing that is important enough it should have > > compiler help via types to prove that it is being done > > properly. Otherwise it will be full of mistakes over time. > > Yes, agreed. > > I have done something similar as willy has done for some of > folio conversion as below: That is not at all what I mean, I mean you should not use struct page * types at all in code that flows from the _iov version except via limited accessors that can be audited and have appropriate assertions. Just releasing struct page * that is not a struct page * everywhere without type safety will never be correct long term. Jason
On 11/16/23 4:12 AM, Yunsheng Lin wrote: > On 2023/11/16 1:57, David Ahern wrote: >> On 11/15/23 2:21 AM, Yunsheng Lin wrote: >>> On 2023/11/14 21:16, Jason Gunthorpe wrote: >>>> On Tue, Nov 14, 2023 at 04:21:26AM -0800, Mina Almasry wrote: >>>> >>>>> Actually because you put the 'strtuct page for devmem' in >>>>> skb->bv_frag, the net stack will grab the 'struct page' for devmem >>>>> using skb_frag_page() then call things like page_address(), kmap, >>>>> get_page, put_page, etc, etc, etc. >>>> >>>> Yikes, please no. If net has its own struct page look alike it has to >>>> stay entirely inside net. A non-mm owned struct page should not be >>>> passed into mm calls. It is just way too hacky to be seriously >>>> considered :( >>> >>> Yes, that is something this patchset is trying to do, defining its own >>> struct page look alike for page pool to support devmem. >>> >> >> Networking needs to be able to move away from struct page references. >> The devmem and host memory for Rx use cases do not need to be page based. > > Yes, I am agreed the ultimate goal is to move away from struct page > references. But I am not sure if we can do it right away as there > still are different types of existing 'struct page' in the netstack, > see: > > https://lore.kernel.org/all/8b7d25eb-1f10-3e37-8753-92b42da3fb34@huawei.com/ yes, that is the point of a blended approach -- pages and buffers (or iov) -- leveraging the LSB of the address. That proposal is the right direction to be moving for co-existence. Adding fake struct page instances is the wrong direction.