[RFC,v3,06/12] memory-provider: dmabuf devmem memory provider
Commit Message
Implement a memory provider that allocates dmabuf devmem page_pool_iovs.
Support of PP_FLAG_DMA_MAP and PP_FLAG_DMA_SYNC_DEV is omitted for
simplicity.
The provider receives a reference to the struct netdev_dmabuf_binding
via the pool->mp_priv pointer. The driver needs to set this pointer for
the provider in the page_pool_params.
The provider obtains a reference on the netdev_dmabuf_binding which
guarantees the binding and the underlying mapping remains alive until
the provider is destroyed.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
---
include/net/page_pool/helpers.h | 40 +++++++++++++++++
include/net/page_pool/types.h | 10 +++++
net/core/page_pool.c | 76 +++++++++++++++++++++++++++++++++
3 files changed, 126 insertions(+)
Comments
On 11/05, Mina Almasry wrote:
> Implement a memory provider that allocates dmabuf devmem page_pool_iovs.
>
> Support of PP_FLAG_DMA_MAP and PP_FLAG_DMA_SYNC_DEV is omitted for
> simplicity.
>
> The provider receives a reference to the struct netdev_dmabuf_binding
> via the pool->mp_priv pointer. The driver needs to set this pointer for
> the provider in the page_pool_params.
>
> The provider obtains a reference on the netdev_dmabuf_binding which
> guarantees the binding and the underlying mapping remains alive until
> the provider is destroyed.
>
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> Signed-off-by: Mina Almasry <almasrymina@google.com>
>
> ---
> include/net/page_pool/helpers.h | 40 +++++++++++++++++
> include/net/page_pool/types.h | 10 +++++
> net/core/page_pool.c | 76 +++++++++++++++++++++++++++++++++
> 3 files changed, 126 insertions(+)
>
> diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
> index 78cbb040af94..b93243c2a640 100644
> --- a/include/net/page_pool/helpers.h
> +++ b/include/net/page_pool/helpers.h
> @@ -53,6 +53,7 @@
> #define _NET_PAGE_POOL_HELPERS_H
>
> #include <net/page_pool/types.h>
> +#include <net/net_debug.h>
>
> #ifdef CONFIG_PAGE_POOL_STATS
> int page_pool_ethtool_stats_get_count(void);
> @@ -111,6 +112,45 @@ page_pool_iov_binding(const struct page_pool_iov *ppiov)
> return page_pool_iov_owner(ppiov)->binding;
> }
>
> +static inline int page_pool_iov_refcount(const struct page_pool_iov *ppiov)
> +{
> + return refcount_read(&ppiov->refcount);
> +}
> +
> +static inline void page_pool_iov_get_many(struct page_pool_iov *ppiov,
> + unsigned int count)
> +{
> + refcount_add(count, &ppiov->refcount);
> +}
> +
> +void __page_pool_iov_free(struct page_pool_iov *ppiov);
> +
> +static inline void page_pool_iov_put_many(struct page_pool_iov *ppiov,
> + unsigned int count)
> +{
> + if (!refcount_sub_and_test(count, &ppiov->refcount))
> + return;
> +
> + __page_pool_iov_free(ppiov);
> +}
> +
> +/* page pool mm helpers */
> +
> +static inline bool page_is_page_pool_iov(const struct page *page)
> +{
> + return (unsigned long)page & PP_DEVMEM;
> +}
Speaking of bpf: one thing that might be problematic with this PP_DEVMEM
bit is that it will make debugging with bpftrace a bit (more)
complicated. If somebody were trying to get to that page_pool_iov from
the frags, they will have to do the equivalent of page_is_page_pool_iov,
but probably not a big deal? (thinking out loud)
On 11/5/23 7:44 PM, Mina Almasry wrote:
> diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
> index 78cbb040af94..b93243c2a640 100644
> --- a/include/net/page_pool/helpers.h
> +++ b/include/net/page_pool/helpers.h
> @@ -111,6 +112,45 @@ page_pool_iov_binding(const struct page_pool_iov *ppiov)
> return page_pool_iov_owner(ppiov)->binding;
> }
>
> +static inline int page_pool_iov_refcount(const struct page_pool_iov *ppiov)
> +{
> + return refcount_read(&ppiov->refcount);
> +}
> +
> +static inline void page_pool_iov_get_many(struct page_pool_iov *ppiov,
> + unsigned int count)
> +{
> + refcount_add(count, &ppiov->refcount);
> +}
> +
> +void __page_pool_iov_free(struct page_pool_iov *ppiov);
> +
> +static inline void page_pool_iov_put_many(struct page_pool_iov *ppiov,
> + unsigned int count)
> +{
> + if (!refcount_sub_and_test(count, &ppiov->refcount))
> + return;
> +
> + __page_pool_iov_free(ppiov);
> +}
> +
> +/* page pool mm helpers */
> +
> +static inline bool page_is_page_pool_iov(const struct page *page)
> +{
> + return (unsigned long)page & PP_DEVMEM;
This is another one where the code can be more generic to not force a
lot changes later. e.g., PP_CUSTOM or PP_NO_PAGE. Then io_uring use
case with host memory can leverage the iov pool in a similar manner.
That does mean skb->devmem needs to be a flag on the page pool and not
just assume iov == device memory.
On Mon, Nov 6, 2023 at 1:02 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On 11/05, Mina Almasry wrote:
> > +static inline bool page_is_page_pool_iov(const struct page *page)
> > +{
> > + return (unsigned long)page & PP_DEVMEM;
> > +}
>
> Speaking of bpf: one thing that might be problematic with this PP_DEVMEM
> bit is that it will make debugging with bpftrace a bit (more)
> complicated. If somebody were trying to get to that page_pool_iov from
> the frags, they will have to do the equivalent of page_is_page_pool_iov,
> but probably not a big deal? (thinking out loud)
Good point, but that doesn't only apply to bpf I think. I'm guessing
even debugger drgn access to the bv_page in the frag will have trouble
if it's actually accessing an iov with LSB set.
But this is not specific to this use for LSB pointer trick. I think
all code that currently uses LSB pointer trick will have similar
troubles. In this context my humble vote is that we get such big
upside from reducing code churn that it's reasonable to tolerate such
side effects.
I could alleviate some of the issues by teaching drgn to do the right
thing for devmem/iovs... time permitting.
On Mon, Nov 6, 2023 at 3:49 PM David Ahern <dsahern@kernel.org> wrote:
>
> On 11/5/23 7:44 PM, Mina Almasry wrote:
> > diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
> > index 78cbb040af94..b93243c2a640 100644
> > --- a/include/net/page_pool/helpers.h
> > +++ b/include/net/page_pool/helpers.h
> > @@ -111,6 +112,45 @@ page_pool_iov_binding(const struct page_pool_iov *ppiov)
> > return page_pool_iov_owner(ppiov)->binding;
> > }
> >
> > +static inline int page_pool_iov_refcount(const struct page_pool_iov *ppiov)
> > +{
> > + return refcount_read(&ppiov->refcount);
> > +}
> > +
> > +static inline void page_pool_iov_get_many(struct page_pool_iov *ppiov,
> > + unsigned int count)
> > +{
> > + refcount_add(count, &ppiov->refcount);
> > +}
> > +
> > +void __page_pool_iov_free(struct page_pool_iov *ppiov);
> > +
> > +static inline void page_pool_iov_put_many(struct page_pool_iov *ppiov,
> > + unsigned int count)
> > +{
> > + if (!refcount_sub_and_test(count, &ppiov->refcount))
> > + return;
> > +
> > + __page_pool_iov_free(ppiov);
> > +}
> > +
> > +/* page pool mm helpers */
> > +
> > +static inline bool page_is_page_pool_iov(const struct page *page)
> > +{
> > + return (unsigned long)page & PP_DEVMEM;
>
> This is another one where the code can be more generic to not force a
> lot changes later. e.g., PP_CUSTOM or PP_NO_PAGE. Then io_uring use
> case with host memory can leverage the iov pool in a similar manner.
>
> That does mean skb->devmem needs to be a flag on the page pool and not
> just assume iov == device memory.
>
>
>
On 11/7/23 5:02 PM, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 1:02 PM Stanislav Fomichev <sdf@google.com> wrote:
>>
>> On 11/05, Mina Almasry wrote:
>>> +static inline bool page_is_page_pool_iov(const struct page *page)
>>> +{
>>> + return (unsigned long)page & PP_DEVMEM;
>>> +}
>>
>> Speaking of bpf: one thing that might be problematic with this PP_DEVMEM
>> bit is that it will make debugging with bpftrace a bit (more)
>> complicated. If somebody were trying to get to that page_pool_iov from
>> the frags, they will have to do the equivalent of page_is_page_pool_iov,
>> but probably not a big deal? (thinking out loud)
>
> Good point, but that doesn't only apply to bpf I think. I'm guessing
> even debugger drgn access to the bv_page in the frag will have trouble
> if it's actually accessing an iov with LSB set.
>
> But this is not specific to this use for LSB pointer trick. I think
> all code that currently uses LSB pointer trick will have similar
> troubles. In this context my humble vote is that we get such big
> upside from reducing code churn that it's reasonable to tolerate such
> side effects.
+1
>
> I could alleviate some of the issues by teaching drgn to do the right
> thing for devmem/iovs... time permitting.
>
Tools like drgn and crash have to know when the LSB trick is used -
e.g., dst_entry - and handle it when dereferencing pointers.
On Sun, 5 Nov 2023 18:44:05 -0800 Mina Almasry wrote:
> +static int mp_dmabuf_devmem_init(struct page_pool *pool)
> +{
> + struct netdev_dmabuf_binding *binding = pool->mp_priv;
> +
> + if (!binding)
> + return -EINVAL;
> +
> + if (pool->p.flags & PP_FLAG_DMA_MAP ||
> + pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> + return -EOPNOTSUPP;
This looks backwards, we should _force_ the driver to use the dma
mapping built into the page pool APIs, to isolate the driver from
how the DMA addr actually gets obtained. Right?
Maybe seeing driver patches would illuminate.
On Fri, Nov 10, 2023 at 3:16 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sun, 5 Nov 2023 18:44:05 -0800 Mina Almasry wrote:
> > +static int mp_dmabuf_devmem_init(struct page_pool *pool)
> > +{
> > + struct netdev_dmabuf_binding *binding = pool->mp_priv;
> > +
> > + if (!binding)
> > + return -EINVAL;
> > +
> > + if (pool->p.flags & PP_FLAG_DMA_MAP ||
> > + pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> > + return -EOPNOTSUPP;
>
> This looks backwards, we should _force_ the driver to use the dma
> mapping built into the page pool APIs, to isolate the driver from
> how the DMA addr actually gets obtained. Right?
>
> Maybe seeing driver patches would illuminate.
The full tree with driver patches is here:
https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-v3
This is probably the most relevant patch, it implements POC page-pool
support into GVE + devmem support:
https://github.com/torvalds/linux/commit/3c27aa21eb3374f2f1677ece6258f046da234443
But, to answer your question, yes, this is a mistake. devmem doesn't
need to be mapped, which is why I disabled the flag. Actually what
should happen is like you said, we should enforce that PP_FLAG_DMA_MAP
is on, and have it be a no-op, so the driver doesn't try to map the
devmem on its own.
@@ -53,6 +53,7 @@
#define _NET_PAGE_POOL_HELPERS_H
#include <net/page_pool/types.h>
+#include <net/net_debug.h>
#ifdef CONFIG_PAGE_POOL_STATS
int page_pool_ethtool_stats_get_count(void);
@@ -111,6 +112,45 @@ page_pool_iov_binding(const struct page_pool_iov *ppiov)
return page_pool_iov_owner(ppiov)->binding;
}
+static inline int page_pool_iov_refcount(const struct page_pool_iov *ppiov)
+{
+ return refcount_read(&ppiov->refcount);
+}
+
+static inline void page_pool_iov_get_many(struct page_pool_iov *ppiov,
+ unsigned int count)
+{
+ refcount_add(count, &ppiov->refcount);
+}
+
+void __page_pool_iov_free(struct page_pool_iov *ppiov);
+
+static inline void page_pool_iov_put_many(struct page_pool_iov *ppiov,
+ unsigned int count)
+{
+ if (!refcount_sub_and_test(count, &ppiov->refcount))
+ return;
+
+ __page_pool_iov_free(ppiov);
+}
+
+/* page pool mm helpers */
+
+static inline bool page_is_page_pool_iov(const struct page *page)
+{
+ return (unsigned long)page & PP_DEVMEM;
+}
+
+static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
+{
+ if (page_is_page_pool_iov(page))
+ return (struct page_pool_iov *)((unsigned long)page &
+ ~PP_DEVMEM);
+
+ DEBUG_NET_WARN_ON_ONCE(true);
+ return NULL;
+}
+
/**
* page_pool_dev_alloc_pages() - allocate a page.
* @pool: pool from which to allocate
@@ -124,6 +124,7 @@ struct mem_provider;
enum pp_memory_provider_type {
__PP_MP_NONE, /* Use system allocator directly */
+ PP_MP_DMABUF_DEVMEM, /* dmabuf devmem provider */
};
struct pp_memory_provider_ops {
@@ -133,8 +134,15 @@ struct pp_memory_provider_ops {
bool (*release_page)(struct page_pool *pool, struct page *page);
};
+extern const struct pp_memory_provider_ops dmabuf_devmem_ops;
+
/* page_pool_iov support */
+/* We overload the LSB of the struct page pointer to indicate whether it's
+ * a page or page_pool_iov.
+ */
+#define PP_DEVMEM 0x01UL
+
/* Owner of the dma-buf chunks inserted into the gen pool. Each scatterlist
* entry from the dmabuf is inserted into the genpool as a chunk, and needs
* this owner struct to keep track of some metadata necessary to create
@@ -158,6 +166,8 @@ struct page_pool_iov {
struct dmabuf_genpool_chunk_owner *owner;
refcount_t refcount;
+
+ struct page_pool *pp;
};
struct page_pool {
@@ -20,6 +20,7 @@
#include <linux/poison.h>
#include <linux/ethtool.h>
#include <linux/netdevice.h>
+#include <linux/genalloc.h>
#include <trace/events/page_pool.h>
@@ -231,6 +232,9 @@ static int page_pool_init(struct page_pool *pool,
switch (pool->p.memory_provider) {
case __PP_MP_NONE:
break;
+ case PP_MP_DMABUF_DEVMEM:
+ pool->mp_ops = &dmabuf_devmem_ops;
+ break;
default:
err = -EINVAL;
goto free_ptr_ring;
@@ -996,3 +1000,75 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
}
}
EXPORT_SYMBOL(page_pool_update_nid);
+
+void __page_pool_iov_free(struct page_pool_iov *ppiov)
+{
+ if (ppiov->pp->mp_ops != &dmabuf_devmem_ops)
+ return;
+
+ netdev_free_devmem(ppiov);
+}
+EXPORT_SYMBOL_GPL(__page_pool_iov_free);
+
+/*** "Dmabuf devmem memory provider" ***/
+
+static int mp_dmabuf_devmem_init(struct page_pool *pool)
+{
+ struct netdev_dmabuf_binding *binding = pool->mp_priv;
+
+ if (!binding)
+ return -EINVAL;
+
+ if (pool->p.flags & PP_FLAG_DMA_MAP ||
+ pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
+ return -EOPNOTSUPP;
+
+ netdev_devmem_binding_get(binding);
+ return 0;
+}
+
+static struct page *mp_dmabuf_devmem_alloc_pages(struct page_pool *pool,
+ gfp_t gfp)
+{
+ struct netdev_dmabuf_binding *binding = pool->mp_priv;
+ struct page_pool_iov *ppiov;
+
+ ppiov = netdev_alloc_devmem(binding);
+ if (!ppiov)
+ return NULL;
+
+ ppiov->pp = pool;
+ pool->pages_state_hold_cnt++;
+ trace_page_pool_state_hold(pool, (struct page *)ppiov,
+ pool->pages_state_hold_cnt);
+ return (struct page *)((unsigned long)ppiov | PP_DEVMEM);
+}
+
+static void mp_dmabuf_devmem_destroy(struct page_pool *pool)
+{
+ struct netdev_dmabuf_binding *binding = pool->mp_priv;
+
+ netdev_devmem_binding_put(binding);
+}
+
+static bool mp_dmabuf_devmem_release_page(struct page_pool *pool,
+ struct page *page)
+{
+ struct page_pool_iov *ppiov;
+
+ if (WARN_ON_ONCE(!page_is_page_pool_iov(page)))
+ return false;
+
+ ppiov = page_to_page_pool_iov(page);
+ page_pool_iov_put_many(ppiov, 1);
+ /* We don't want the page pool put_page()ing our page_pool_iovs. */
+ return false;
+}
+
+const struct pp_memory_provider_ops dmabuf_devmem_ops = {
+ .init = mp_dmabuf_devmem_init,
+ .destroy = mp_dmabuf_devmem_destroy,
+ .alloc_pages = mp_dmabuf_devmem_alloc_pages,
+ .release_page = mp_dmabuf_devmem_release_page,
+};
+EXPORT_SYMBOL(dmabuf_devmem_ops);