[net-next,v1,09/16] page_pool: device memory support
Commit Message
Overload the LSB of struct page* to indicate that it's a page_pool_iov.
Refactor mm calls on struct page* into helpers, and add page_pool_iov
handling on those helpers. Modify callers of these mm APIs with calls to
these helpers instead.
In areas where struct page* is dereferenced, add a check for special
handling of page_pool_iov.
Signed-off-by: Mina Almasry <almasrymina@google.com>
---
v1:
- Disable fragmentation support for iov properly.
- fix napi_pp_put_page() path (Yunsheng).
---
include/net/page_pool/helpers.h | 78 ++++++++++++++++++++++++++++++++-
net/core/page_pool.c | 67 ++++++++++++++++++++--------
net/core/skbuff.c | 28 +++++++-----
3 files changed, 141 insertions(+), 32 deletions(-)
Comments
On Fri, Dec 8, 2023 at 1:30 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
>
> As mentioned before, it seems we need to have the above checking every
> time we need to do some per-page handling in page_pool core, is there
> a plan in your mind how to remove those kind of checking in the future?
>
I see 2 ways to remove the checking, both infeasible:
1. Allocate a wrapper struct that pulls out all the fields the page pool needs:
struct netmem {
/* common fields */
refcount_t refcount;
bool is_pfmemalloc;
int nid;
...
union {
struct dmabuf_genpool_chunk_owner *owner;
struct page * page;
};
};
The page pool can then not care if the underlying memory is iov or
page. However this introduces significant memory bloat as this struct
needs to be allocated for each page or ppiov, which I imagine is not
acceptable for the upside of removing a few static_branch'd if
statements with no performance cost.
2. Create a unified struct for page and dmabuf memory, which the mm
folks have repeatedly nacked, and I imagine will repeatedly nack in
the future.
So I imagine the special handling of ppiov in some form is critical
and the checking may not be removable.
> Even though a static_branch check is added in page_is_page_pool_iov(), it
> does not make much sense that a core has tow different 'struct' for its
> most basic data.
>
> IMHO, the ppiov for dmabuf is forced fitting into page_pool without much
> design consideration at this point.
>
...
>
> For now, the above may work for the the rx part as it seems that you are
> only enabling rx for dmabuf for now.
>
> What is the plan to enable tx for dmabuf? If it is also intergrated into
> page_pool? There was a attempt to enable page_pool for tx, Eric seemed to
> have some comment about this:
> https://lkml.kernel.org/netdev/2cf4b672-d7dc-db3d-ce90-15b4e91c4005@huawei.com/T/#mb6ab62dc22f38ec621d516259c56dd66353e24a2
>
> If tx is not intergrated into page_pool, do we need to create a new layer for
> the tx dmabuf?
>
I imagine the TX path will reuse page_pool_iov, page_pool_iov_*()
helpers, and page_pool_page_*() helpers, but will not need any core
page_pool changes. This is because the TX path will have to piggyback
on MSG_ZEROCOPY (devmem is not copyable), so no memory allocation from
the page_pool (or otherwise) is needed or possible. RFCv1 had a TX
implementation based on dmabuf pages without page_pool involvement, I
imagine I'll do something similar.
On Sun, Dec 10, 2023 at 6:04 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2023/12/9 0:05, Mina Almasry wrote:
> > On Fri, Dec 8, 2023 at 1:30 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>
> >>
> >> As mentioned before, it seems we need to have the above checking every
> >> time we need to do some per-page handling in page_pool core, is there
> >> a plan in your mind how to remove those kind of checking in the future?
> >>
> >
> > I see 2 ways to remove the checking, both infeasible:
> >
> > 1. Allocate a wrapper struct that pulls out all the fields the page pool needs:
> >
> > struct netmem {
> > /* common fields */
> > refcount_t refcount;
> > bool is_pfmemalloc;
> > int nid;
> > ...
> > union {
> > struct dmabuf_genpool_chunk_owner *owner;
> > struct page * page;
> > };
> > };
> >
> > The page pool can then not care if the underlying memory is iov or
> > page. However this introduces significant memory bloat as this struct
> > needs to be allocated for each page or ppiov, which I imagine is not
> > acceptable for the upside of removing a few static_branch'd if
> > statements with no performance cost.
> >
> > 2. Create a unified struct for page and dmabuf memory, which the mm
> > folks have repeatedly nacked, and I imagine will repeatedly nack in
> > the future.
> >
> > So I imagine the special handling of ppiov in some form is critical
> > and the checking may not be removable.
>
> If the above is true, perhaps devmem is not really supposed to be intergated
> into page_pool.
>
> Adding a checking for every per-page handling in page_pool core is just too
> hacky to be really considerred a longterm solution.
>
The only other option is to implement another page_pool for ppiov and
have the driver create page_pool or ppiov_pool depending on the state
of the netdev_rx_queue (or some helper in the net stack to do that for
the driver). This introduces some code duplication. The ppiov_pool &
page_pool would look similar in implementation.
But this was all discussed in detail in RFC v2 and the last response I
heard from Jesper was in favor if this approach, if I understand
correctly:
https://lore.kernel.org/netdev/7aedc5d5-0daf-63be-21bc-3b724cc1cab9@redhat.com/
Would love to have the maintainer weigh in here.
> It is somewhat ironical that devmem is using static_branch to alliviate the
> performance impact for normal memory at the possible cost of performance
> degradation for devmem, does it not defeat some purpose of intergating devmem
> to page_pool?
>
I don't see the issue. The static branch sets the non-ppiov path as
default if no memory providers are in use, and flips it when they are,
making the default branch prediction ideal in both cases.
> >
> >> Even though a static_branch check is added in page_is_page_pool_iov(), it
> >> does not make much sense that a core has tow different 'struct' for its
> >> most basic data.
> >>
> >> IMHO, the ppiov for dmabuf is forced fitting into page_pool without much
> >> design consideration at this point.
> >>
> > ...
> >>
> >> For now, the above may work for the the rx part as it seems that you are
> >> only enabling rx for dmabuf for now.
> >>
> >> What is the plan to enable tx for dmabuf? If it is also intergrated into
> >> page_pool? There was a attempt to enable page_pool for tx, Eric seemed to
> >> have some comment about this:
> >> https://lkml.kernel.org/netdev/2cf4b672-d7dc-db3d-ce90-15b4e91c4005@huawei.com/T/#mb6ab62dc22f38ec621d516259c56dd66353e24a2
> >>
> >> If tx is not intergrated into page_pool, do we need to create a new layer for
> >> the tx dmabuf?
> >>
> >
> > I imagine the TX path will reuse page_pool_iov, page_pool_iov_*()
> > helpers, and page_pool_page_*() helpers, but will not need any core
> > page_pool changes. This is because the TX path will have to piggyback
>
> We may need another bit/flags checking to demux between page_pool owned
> devmem and non-page_pool owned devmem.
>
The way I'm imagining the support, I don't see the need for such
flags. We'd be re-using generic helpers like
page_pool_iov_get_dma_address() and what not that don't need that
checking.
> Also calling page_pool_*() on non-page_pool owned devmem is confusing
> enough that we may need a thin layer handling non-page_pool owned devmem
> in the end.
>
The page_pool_page* & page_pool_iov* functions can be renamed if
confusing. I would think that's no issue (note that the page_pool_*
functions need not be called for TX path).
> > on MSG_ZEROCOPY (devmem is not copyable), so no memory allocation from
> > the page_pool (or otherwise) is needed or possible. RFCv1 had a TX
> > implementation based on dmabuf pages without page_pool involvement, I
> > imagine I'll do something similar.
> It would be good to have a tx implementation for the next version, so
> that we can have a whole picture of devmem.
>
> >
On Sun, Dec 10, 2023 at 6:26 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Sun, Dec 10, 2023 at 6:04 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >
> > On 2023/12/9 0:05, Mina Almasry wrote:
> > > On Fri, Dec 8, 2023 at 1:30 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> > >>
> > >>
> > >> As mentioned before, it seems we need to have the above checking every
> > >> time we need to do some per-page handling in page_pool core, is there
> > >> a plan in your mind how to remove those kind of checking in the future?
> > >>
> > >
> > > I see 2 ways to remove the checking, both infeasible:
> > >
> > > 1. Allocate a wrapper struct that pulls out all the fields the page pool needs:
> > >
> > > struct netmem {
> > > /* common fields */
> > > refcount_t refcount;
> > > bool is_pfmemalloc;
> > > int nid;
> > > ...
> > > union {
> > > struct dmabuf_genpool_chunk_owner *owner;
> > > struct page * page;
> > > };
> > > };
> > >
> > > The page pool can then not care if the underlying memory is iov or
> > > page. However this introduces significant memory bloat as this struct
> > > needs to be allocated for each page or ppiov, which I imagine is not
> > > acceptable for the upside of removing a few static_branch'd if
> > > statements with no performance cost.
> > >
> > > 2. Create a unified struct for page and dmabuf memory, which the mm
> > > folks have repeatedly nacked, and I imagine will repeatedly nack in
> > > the future.
> > >
> > > So I imagine the special handling of ppiov in some form is critical
> > > and the checking may not be removable.
> >
> > If the above is true, perhaps devmem is not really supposed to be intergated
> > into page_pool.
> >
> > Adding a checking for every per-page handling in page_pool core is just too
> > hacky to be really considerred a longterm solution.
> >
>
> The only other option is to implement another page_pool for ppiov and
> have the driver create page_pool or ppiov_pool depending on the state
> of the netdev_rx_queue (or some helper in the net stack to do that for
> the driver). This introduces some code duplication. The ppiov_pool &
> page_pool would look similar in implementation.
>
> But this was all discussed in detail in RFC v2 and the last response I
> heard from Jesper was in favor if this approach, if I understand
> correctly:
>
> https://lore.kernel.org/netdev/7aedc5d5-0daf-63be-21bc-3b724cc1cab9@redhat.com/
>
> Would love to have the maintainer weigh in here.
>
I should note we may be able to remove some of the checking, but maybe not all.
- Checks that disable page fragging for ppiov can be removed once
ppiov has frag support (in this series or follow up).
- If we use page->pp_frag_count (or page->pp_ref_count) for
refcounting ppiov, we can remove the if checking in the refcounting.
- We may be able to store the dma_addr of the ppiov in page->dma_addr,
but I'm unsure if that actually works, because the dma_buf dmaddr is
dma_addr_t (u32 or u64), but page->dma_addr is unsigned long (4 bytes
I think). But if it works for pages I may be able to make it work for
ppiov as well.
- Checks that obtain the page->pp can work with ppiov if we align the
offset of page->pp and ppiov->pp.
- Checks around page->pp_magic can be removed if we also have offset
aligned ppiov->pp_magic.
Sadly I don't see us removing the checking for these other cases:
- page_is_pfmemalloc(): I'm not allowed to pass a non-struct page into
that helper.
- page_to_nid(): I'm not allowed to pass a non-struct page into that helper.
- page_pool_free_va(): ppiov have no va.
- page_pool_sync_for_dev/page_pool_dma_map: ppiov backed by dma-buf
fundamentally can't get mapped again.
Are the removal (or future removal) of these checks enough to resolve this?
> > It is somewhat ironical that devmem is using static_branch to alliviate the
> > performance impact for normal memory at the possible cost of performance
> > degradation for devmem, does it not defeat some purpose of intergating devmem
> > to page_pool?
> >
>
> I don't see the issue. The static branch sets the non-ppiov path as
> default if no memory providers are in use, and flips it when they are,
> making the default branch prediction ideal in both cases.
>
> > >
> > >> Even though a static_branch check is added in page_is_page_pool_iov(), it
> > >> does not make much sense that a core has tow different 'struct' for its
> > >> most basic data.
> > >>
> > >> IMHO, the ppiov for dmabuf is forced fitting into page_pool without much
> > >> design consideration at this point.
> > >>
> > > ...
> > >>
> > >> For now, the above may work for the the rx part as it seems that you are
> > >> only enabling rx for dmabuf for now.
> > >>
> > >> What is the plan to enable tx for dmabuf? If it is also intergrated into
> > >> page_pool? There was a attempt to enable page_pool for tx, Eric seemed to
> > >> have some comment about this:
> > >> https://lkml.kernel.org/netdev/2cf4b672-d7dc-db3d-ce90-15b4e91c4005@huawei.com/T/#mb6ab62dc22f38ec621d516259c56dd66353e24a2
> > >>
> > >> If tx is not intergrated into page_pool, do we need to create a new layer for
> > >> the tx dmabuf?
> > >>
> > >
> > > I imagine the TX path will reuse page_pool_iov, page_pool_iov_*()
> > > helpers, and page_pool_page_*() helpers, but will not need any core
> > > page_pool changes. This is because the TX path will have to piggyback
> >
> > We may need another bit/flags checking to demux between page_pool owned
> > devmem and non-page_pool owned devmem.
> >
>
> The way I'm imagining the support, I don't see the need for such
> flags. We'd be re-using generic helpers like
> page_pool_iov_get_dma_address() and what not that don't need that
> checking.
>
> > Also calling page_pool_*() on non-page_pool owned devmem is confusing
> > enough that we may need a thin layer handling non-page_pool owned devmem
> > in the end.
> >
>
> The page_pool_page* & page_pool_iov* functions can be renamed if
> confusing. I would think that's no issue (note that the page_pool_*
> functions need not be called for TX path).
>
> > > on MSG_ZEROCOPY (devmem is not copyable), so no memory allocation from
> > > the page_pool (or otherwise) is needed or possible. RFCv1 had a TX
> > > implementation based on dmabuf pages without page_pool involvement, I
> > > imagine I'll do something similar.
> > It would be good to have a tx implementation for the next version, so
> > that we can have a whole picture of devmem.
> >
> > >
>
>
>
> --
> Thanks,
> Mina
--
Thanks,
Mina
On Mon, Dec 11, 2023 at 3:51 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2023/12/11 12:04, Mina Almasry wrote:
> > On Sun, Dec 10, 2023 at 6:26 PM Mina Almasry <almasrymina@google.com> wrote:
> >>
> >> On Sun, Dec 10, 2023 at 6:04 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>>
> >>> On 2023/12/9 0:05, Mina Almasry wrote:
> >>>> On Fri, Dec 8, 2023 at 1:30 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>>>>
> >>>>>
> >>>>> As mentioned before, it seems we need to have the above checking every
> >>>>> time we need to do some per-page handling in page_pool core, is there
> >>>>> a plan in your mind how to remove those kind of checking in the future?
> >>>>>
> >>>>
> >>>> I see 2 ways to remove the checking, both infeasible:
> >>>>
> >>>> 1. Allocate a wrapper struct that pulls out all the fields the page pool needs:
> >>>>
> >>>> struct netmem {
> >>>> /* common fields */
> >>>> refcount_t refcount;
> >>>> bool is_pfmemalloc;
> >>>> int nid;
> >>>> ...
> >>>> union {
> >>>> struct dmabuf_genpool_chunk_owner *owner;
> >>>> struct page * page;
> >>>> };
> >>>> };
> >>>>
> >>>> The page pool can then not care if the underlying memory is iov or
> >>>> page. However this introduces significant memory bloat as this struct
> >>>> needs to be allocated for each page or ppiov, which I imagine is not
> >>>> acceptable for the upside of removing a few static_branch'd if
> >>>> statements with no performance cost.
> >>>>
> >>>> 2. Create a unified struct for page and dmabuf memory, which the mm
> >>>> folks have repeatedly nacked, and I imagine will repeatedly nack in
> >>>> the future.
> >>>>
> >>>> So I imagine the special handling of ppiov in some form is critical
> >>>> and the checking may not be removable.
> >>>
> >>> If the above is true, perhaps devmem is not really supposed to be intergated
> >>> into page_pool.
> >>>
> >>> Adding a checking for every per-page handling in page_pool core is just too
> >>> hacky to be really considerred a longterm solution.
> >>>
> >>
> >> The only other option is to implement another page_pool for ppiov and
> >> have the driver create page_pool or ppiov_pool depending on the state
> >> of the netdev_rx_queue (or some helper in the net stack to do that for
> >> the driver). This introduces some code duplication. The ppiov_pool &
> >> page_pool would look similar in implementation.
>
> I think there is a design pattern already to deal with this kind of problem,
> refactoring common code used by both page_pool and ppiov into a library to
> aovid code duplication if most of them have similar implementation.
>
Code can be refactored if it's identical, not if it is similar. I
suspect the page_pools will be only similar, and if you're not willing
to take devmem handling into the page pool then refactoring page_pool
code into helpers that do devmem handling may also not be an option.
> >>
> >> But this was all discussed in detail in RFC v2 and the last response I
> >> heard from Jesper was in favor if this approach, if I understand
> >> correctly:
> >>
> >> https://lore.kernel.org/netdev/7aedc5d5-0daf-63be-21bc-3b724cc1cab9@redhat.com/
> >>
> >> Would love to have the maintainer weigh in here.
> >>
> >
> > I should note we may be able to remove some of the checking, but maybe not all.
> >
> > - Checks that disable page fragging for ppiov can be removed once
> > ppiov has frag support (in this series or follow up).
> >
> > - If we use page->pp_frag_count (or page->pp_ref_count) for
> > refcounting ppiov, we can remove the if checking in the refcounting.
> >
I'm not sure this is actually possible in the short term. The
page_pool uses both page->_refcount and page->pp_frag_count for
refcounting, and I will not be able to remove the special handling
around page->_refcount as i'm not allowed to call page_ref_*() APIs on
a non-struct page.
> > - We may be able to store the dma_addr of the ppiov in page->dma_addr,
> > but I'm unsure if that actually works, because the dma_buf dmaddr is
> > dma_addr_t (u32 or u64), but page->dma_addr is unsigned long (4 bytes
> > I think). But if it works for pages I may be able to make it work for
> > ppiov as well.
> >
> > - Checks that obtain the page->pp can work with ppiov if we align the
> > offset of page->pp and ppiov->pp.
> >
> > - Checks around page->pp_magic can be removed if we also have offset
> > aligned ppiov->pp_magic.
> >
> > Sadly I don't see us removing the checking for these other cases:
> >
> > - page_is_pfmemalloc(): I'm not allowed to pass a non-struct page into
> > that helper.
>
> We can do similar trick like above as bit 1 of page->pp_magic is used to
> indicate that if it is a pfmemalloc page.
>
Likely yes.
> >
> > - page_to_nid(): I'm not allowed to pass a non-struct page into that helper.
>
> Yes, this one need special case.
>
> >
> > - page_pool_free_va(): ppiov have no va.
>
> Doesn't the skb_frags_readable() checking will protect the page_pool_free_va()
> from being called on devmem?
>
This function seems to be only called from veth which doesn't support
devmem. I can remove the handling there.
> >
> > - page_pool_sync_for_dev/page_pool_dma_map: ppiov backed by dma-buf
> > fundamentally can't get mapped again.
>
> Can we just fail the page_pool creation with PP_FLAG_DMA_MAP and
> DMA_ATTR_SKIP_CPU_SYNC flags for devmem provider?
>
Jakub says PP_FLAG_DMA_MAP must be enabled for devmem, such that the
page_pool handles the dma mapping of the devmem and the driver doesn't
use it on its own.
We may fail creating the page pool on PP_FLAG_DMA_SYNC_DEV maybe, and
remove the checking from page_pool_sync_for_dev(), I think.
> >
> > Are the removal (or future removal) of these checks enough to resolve this?
>
> Yes, that is somewhat similar to my proposal, the biggest objection seems to
> be that we need to have a safe type checking for it to work correctly.
>
> >
> >>> It is somewhat ironical that devmem is using static_branch to alliviate the
> >>> performance impact for normal memory at the possible cost of performance
> >>> degradation for devmem, does it not defeat some purpose of intergating devmem
> >>> to page_pool?
> >>>
> >>
> >> I don't see the issue. The static branch sets the non-ppiov path as
> >> default if no memory providers are in use, and flips it when they are,
> >> making the default branch prediction ideal in both cases.
>
> You are assuming the we are not using page pool for both normal memory and
> devmem at the same. But a generic solution should not have that assumption
> as my understanding.
>
> >>
> >>>>
> >>>>> Even though a static_branch check is added in page_is_page_pool_iov(), it
> >>>>> does not make much sense that a core has tow different 'struct' for its
> >>>>> most basic data.
> >>>>>
> >>>>> IMHO, the ppiov for dmabuf is forced fitting into page_pool without much
> >>>>> design consideration at this point.
> >>>>>
> >>>> ...
> >>>>>
> >>>>> For now, the above may work for the the rx part as it seems that you are
> >>>>> only enabling rx for dmabuf for now.
> >>>>>
> >>>>> What is the plan to enable tx for dmabuf? If it is also intergrated into
> >>>>> page_pool? There was a attempt to enable page_pool for tx, Eric seemed to
> >>>>> have some comment about this:
> >>>>> https://lkml.kernel.org/netdev/2cf4b672-d7dc-db3d-ce90-15b4e91c4005@huawei.com/T/#mb6ab62dc22f38ec621d516259c56dd66353e24a2
> >>>>>
> >>>>> If tx is not intergrated into page_pool, do we need to create a new layer for
> >>>>> the tx dmabuf?
> >>>>>
> >>>>
> >>>> I imagine the TX path will reuse page_pool_iov, page_pool_iov_*()
> >>>> helpers, and page_pool_page_*() helpers, but will not need any core
> >>>> page_pool changes. This is because the TX path will have to piggyback
> >>>
> >>> We may need another bit/flags checking to demux between page_pool owned
> >>> devmem and non-page_pool owned devmem.
> >>>
> >>
> >> The way I'm imagining the support, I don't see the need for such
> >> flags. We'd be re-using generic helpers like
> >> page_pool_iov_get_dma_address() and what not that don't need that
> >> checking.
> >>
> >>> Also calling page_pool_*() on non-page_pool owned devmem is confusing
> >>> enough that we may need a thin layer handling non-page_pool owned devmem
> >>> in the end.
> >>>
> >>
> >> The page_pool_page* & page_pool_iov* functions can be renamed if
> >> confusing. I would think that's no issue (note that the page_pool_*
>
> When you rename those functions, you will have a thin layer automatically.
>
> >> functions need not be called for TX path).
> >>
> >>>> on MSG_ZEROCOPY (devmem is not copyable), so no memory allocation from
> >>>> the page_pool (or otherwise) is needed or possible. RFCv1 had a TX
> >>>> implementation based on dmabuf pages without page_pool involvement, I
> >>>> imagine I'll do something similar.
> >>> It would be good to have a tx implementation for the next version, so
> >>> that we can have a whole picture of devmem.
On Tue, Dec 12, 2023 at 3:17 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2023/12/12 2:14, Mina Almasry wrote:
> > On Mon, Dec 11, 2023 at 3:51 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>
> >> On 2023/12/11 12:04, Mina Almasry wrote:
> >>> On Sun, Dec 10, 2023 at 6:26 PM Mina Almasry <almasrymina@google.com> wrote:
> >>>>
> >>>> On Sun, Dec 10, 2023 at 6:04 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>>>>
> >>>>> On 2023/12/9 0:05, Mina Almasry wrote:
> >>>>>> On Fri, Dec 8, 2023 at 1:30 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> As mentioned before, it seems we need to have the above checking every
> >>>>>>> time we need to do some per-page handling in page_pool core, is there
> >>>>>>> a plan in your mind how to remove those kind of checking in the future?
> >>>>>>>
> >>>>>>
> >>>>>> I see 2 ways to remove the checking, both infeasible:
> >>>>>>
> >>>>>> 1. Allocate a wrapper struct that pulls out all the fields the page pool needs:
> >>>>>>
> >>>>>> struct netmem {
> >>>>>> /* common fields */
> >>>>>> refcount_t refcount;
> >>>>>> bool is_pfmemalloc;
> >>>>>> int nid;
> >>>>>> ...
> >>>>>> union {
> >>>>>> struct dmabuf_genpool_chunk_owner *owner;
> >>>>>> struct page * page;
> >>>>>> };
> >>>>>> };
> >>>>>>
> >>>>>> The page pool can then not care if the underlying memory is iov or
> >>>>>> page. However this introduces significant memory bloat as this struct
> >>>>>> needs to be allocated for each page or ppiov, which I imagine is not
> >>>>>> acceptable for the upside of removing a few static_branch'd if
> >>>>>> statements with no performance cost.
> >>>>>>
> >>>>>> 2. Create a unified struct for page and dmabuf memory, which the mm
> >>>>>> folks have repeatedly nacked, and I imagine will repeatedly nack in
> >>>>>> the future.
> >>>>>>
> >>>>>> So I imagine the special handling of ppiov in some form is critical
> >>>>>> and the checking may not be removable.
> >>>>>
> >>>>> If the above is true, perhaps devmem is not really supposed to be intergated
> >>>>> into page_pool.
> >>>>>
> >>>>> Adding a checking for every per-page handling in page_pool core is just too
> >>>>> hacky to be really considerred a longterm solution.
> >>>>>
> >>>>
> >>>> The only other option is to implement another page_pool for ppiov and
> >>>> have the driver create page_pool or ppiov_pool depending on the state
> >>>> of the netdev_rx_queue (or some helper in the net stack to do that for
> >>>> the driver). This introduces some code duplication. The ppiov_pool &
> >>>> page_pool would look similar in implementation.
> >>
> >> I think there is a design pattern already to deal with this kind of problem,
> >> refactoring common code used by both page_pool and ppiov into a library to
> >> aovid code duplication if most of them have similar implementation.
> >>
> >
> > Code can be refactored if it's identical, not if it is similar. I
>
> Similarity indicates an opportunity to the refactor out the common
> code, like the page_frag case below:
> https://patchwork.kernel.org/project/netdevbpf/cover/20231205113444.63015-1-linyunsheng@huawei.com/
>
> But untill we do a proof of concept implemention, it is hard to tell if
> it is feasiable or not.
>
> > suspect the page_pools will be only similar, and if you're not willing
> > to take devmem handling into the page pool then refactoring page_pool
> > code into helpers that do devmem handling may also not be an option.
> >
> >>>>
> >>>> But this was all discussed in detail in RFC v2 and the last response I
> >>>> heard from Jesper was in favor if this approach, if I understand
> >>>> correctly:
> >>>>
> >>>> https://lore.kernel.org/netdev/7aedc5d5-0daf-63be-21bc-3b724cc1cab9@redhat.com/
> >>>>
> >>>> Would love to have the maintainer weigh in here.
> >>>>
> >>>
> >>> I should note we may be able to remove some of the checking, but maybe not all.
> >>>
> >>> - Checks that disable page fragging for ppiov can be removed once
> >>> ppiov has frag support (in this series or follow up).
> >>>
> >>> - If we use page->pp_frag_count (or page->pp_ref_count) for
> >>> refcounting ppiov, we can remove the if checking in the refcounting.
> >>>
> >
> > I'm not sure this is actually possible in the short term. The
> > page_pool uses both page->_refcount and page->pp_frag_count for
> > refcounting, and I will not be able to remove the special handling
> > around page->_refcount as i'm not allowed to call page_ref_*() APIs on
> > a non-struct page.
>
> the page_ref_*() API may be avoided using the below patch:
> https://patchwork.kernel.org/project/netdevbpf/patch/20231113130041.58124-7-linyunsheng@huawei.com/
>
Even after the patch above, you're still calling page_ref_count() in
the page_pool to check for recycling, so after that patch you're still
using page->_refcount.
> But I am not sure how to do that for tx part if devmem for tx is not
> intergating into page_pool, that is why I suggest having a tx implementation
> for the next version, so that we can have a whole picture of devmem.
>
I strongly prefer to keep the TX implementation in a separate series.
This series is complicated to implement and review as it is, and is
hitting the 15 patch limit anyway.
> >
> >>> - We may be able to store the dma_addr of the ppiov in page->dma_addr,
> >>> but I'm unsure if that actually works, because the dma_buf dmaddr is
> >>> dma_addr_t (u32 or u64), but page->dma_addr is unsigned long (4 bytes
> >>> I think). But if it works for pages I may be able to make it work for
> >>> ppiov as well.
> >>>
> >>> - Checks that obtain the page->pp can work with ppiov if we align the
> >>> offset of page->pp and ppiov->pp.
> >>>
> >>> - Checks around page->pp_magic can be removed if we also have offset
> >>> aligned ppiov->pp_magic.
> >>>
> >>> Sadly I don't see us removing the checking for these other cases:
> >>>
> >>> - page_is_pfmemalloc(): I'm not allowed to pass a non-struct page into
> >>> that helper.
> >>
> >> We can do similar trick like above as bit 1 of page->pp_magic is used to
> >> indicate that if it is a pfmemalloc page.
> >>
> >
> > Likely yes.
> >
> >>>
> >>> - page_to_nid(): I'm not allowed to pass a non-struct page into that helper.
> >>
> >> Yes, this one need special case.
> >>
> >>>
> >>> - page_pool_free_va(): ppiov have no va.
> >>
> >> Doesn't the skb_frags_readable() checking will protect the page_pool_free_va()
> >> from being called on devmem?
> >>
> >
> > This function seems to be only called from veth which doesn't support
> > devmem. I can remove the handling there.
> >
> >>>
> >>> - page_pool_sync_for_dev/page_pool_dma_map: ppiov backed by dma-buf
> >>> fundamentally can't get mapped again.
> >>
> >> Can we just fail the page_pool creation with PP_FLAG_DMA_MAP and
> >> DMA_ATTR_SKIP_CPU_SYNC flags for devmem provider?
> >>
> >
> > Jakub says PP_FLAG_DMA_MAP must be enabled for devmem, such that the
> > page_pool handles the dma mapping of the devmem and the driver doesn't
> > use it on its own.
>
> I am not sure what benefit does it bring by enabling the DMA_MAP for devmem,
> as devmem seems to call dma_buf_map_attachment() in netdev_bind_dmabuf(), it
> does not really need enabling PP_FLAG_DMA_MAP to get the dma addr for the
> devmem chunk.
On Sun, Dec 10, 2023 at 8:04 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Sun, Dec 10, 2023 at 6:26 PM Mina Almasry <almasrymina@google.com> wrote:
> >
> > On Sun, Dec 10, 2023 at 6:04 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> > >
> > > On 2023/12/9 0:05, Mina Almasry wrote:
> > > > On Fri, Dec 8, 2023 at 1:30 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> > > >>
> > > >>
> > > >> As mentioned before, it seems we need to have the above checking every
> > > >> time we need to do some per-page handling in page_pool core, is there
> > > >> a plan in your mind how to remove those kind of checking in the future?
> > > >>
> > > >
> > > > I see 2 ways to remove the checking, both infeasible:
> > > >
> > > > 1. Allocate a wrapper struct that pulls out all the fields the page pool needs:
> > > >
> > > > struct netmem {
> > > > /* common fields */
> > > > refcount_t refcount;
> > > > bool is_pfmemalloc;
> > > > int nid;
> > > > ...
> > > > union {
> > > > struct dmabuf_genpool_chunk_owner *owner;
> > > > struct page * page;
> > > > };
> > > > };
> > > >
> > > > The page pool can then not care if the underlying memory is iov or
> > > > page. However this introduces significant memory bloat as this struct
> > > > needs to be allocated for each page or ppiov, which I imagine is not
> > > > acceptable for the upside of removing a few static_branch'd if
> > > > statements with no performance cost.
> > > >
> > > > 2. Create a unified struct for page and dmabuf memory, which the mm
> > > > folks have repeatedly nacked, and I imagine will repeatedly nack in
> > > > the future.
> > > >
> > > > So I imagine the special handling of ppiov in some form is critical
> > > > and the checking may not be removable.
> > >
> > > If the above is true, perhaps devmem is not really supposed to be intergated
> > > into page_pool.
> > >
> > > Adding a checking for every per-page handling in page_pool core is just too
> > > hacky to be really considerred a longterm solution.
> > >
> >
> > The only other option is to implement another page_pool for ppiov and
> > have the driver create page_pool or ppiov_pool depending on the state
> > of the netdev_rx_queue (or some helper in the net stack to do that for
> > the driver). This introduces some code duplication. The ppiov_pool &
> > page_pool would look similar in implementation.
> >
> > But this was all discussed in detail in RFC v2 and the last response I
> > heard from Jesper was in favor if this approach, if I understand
> > correctly:
> >
> > https://lore.kernel.org/netdev/7aedc5d5-0daf-63be-21bc-3b724cc1cab9@redhat.com/
> >
> > Would love to have the maintainer weigh in here.
> >
>
> I should note we may be able to remove some of the checking, but maybe not all.
>
> - Checks that disable page fragging for ppiov can be removed once
> ppiov has frag support (in this series or follow up).
>
> - If we use page->pp_frag_count (or page->pp_ref_count) for
> refcounting ppiov, we can remove the if checking in the refcounting.
>
> - We may be able to store the dma_addr of the ppiov in page->dma_addr,
> but I'm unsure if that actually works, because the dma_buf dmaddr is
> dma_addr_t (u32 or u64), but page->dma_addr is unsigned long (4 bytes
> I think). But if it works for pages I may be able to make it work for
> ppiov as well.
>
> - Checks that obtain the page->pp can work with ppiov if we align the
> offset of page->pp and ppiov->pp.
>
> - Checks around page->pp_magic can be removed if we also have offset
> aligned ppiov->pp_magic.
>
> Sadly I don't see us removing the checking for these other cases:
>
> - page_is_pfmemalloc(): I'm not allowed to pass a non-struct page into
> that helper.
>
> - page_to_nid(): I'm not allowed to pass a non-struct page into that helper.
>
> - page_pool_free_va(): ppiov have no va.
>
> - page_pool_sync_for_dev/page_pool_dma_map: ppiov backed by dma-buf
> fundamentally can't get mapped again.
>
> Are the removal (or future removal) of these checks enough to resolve this?
>
I took a deeper look here, and with some effort I'm able to remove
almost all the custom checks for ppiov. The only remaining checks for
devmem are the checks around these mm calls:
page_is_pfmemalloc()
page_to_nid()
page_ref_count()
compound_head()
page_is_pfmemalloc() checks can be removed by using a bit
page->pp_magic potentially to indicate pfmemalloc().
The other 3, I'm not sure I can remove. They rely on the page flags or
other fields not specific to page_pool pages. The next version should
come with the most minimal amount of devmem checks for the page_pool.
> > > It is somewhat ironical that devmem is using static_branch to alliviate the
> > > performance impact for normal memory at the possible cost of performance
> > > degradation for devmem, does it not defeat some purpose of intergating devmem
> > > to page_pool?
> > >
> >
> > I don't see the issue. The static branch sets the non-ppiov path as
> > default if no memory providers are in use, and flips it when they are,
> > making the default branch prediction ideal in both cases.
> >
> > > >
> > > >> Even though a static_branch check is added in page_is_page_pool_iov(), it
> > > >> does not make much sense that a core has tow different 'struct' for its
> > > >> most basic data.
> > > >>
> > > >> IMHO, the ppiov for dmabuf is forced fitting into page_pool without much
> > > >> design consideration at this point.
> > > >>
> > > > ...
> > > >>
> > > >> For now, the above may work for the the rx part as it seems that you are
> > > >> only enabling rx for dmabuf for now.
> > > >>
> > > >> What is the plan to enable tx for dmabuf? If it is also intergrated into
> > > >> page_pool? There was a attempt to enable page_pool for tx, Eric seemed to
> > > >> have some comment about this:
> > > >> https://lkml.kernel.org/netdev/2cf4b672-d7dc-db3d-ce90-15b4e91c4005@huawei.com/T/#mb6ab62dc22f38ec621d516259c56dd66353e24a2
> > > >>
> > > >> If tx is not intergrated into page_pool, do we need to create a new layer for
> > > >> the tx dmabuf?
> > > >>
> > > >
> > > > I imagine the TX path will reuse page_pool_iov, page_pool_iov_*()
> > > > helpers, and page_pool_page_*() helpers, but will not need any core
> > > > page_pool changes. This is because the TX path will have to piggyback
> > >
> > > We may need another bit/flags checking to demux between page_pool owned
> > > devmem and non-page_pool owned devmem.
> > >
> >
> > The way I'm imagining the support, I don't see the need for such
> > flags. We'd be re-using generic helpers like
> > page_pool_iov_get_dma_address() and what not that don't need that
> > checking.
> >
> > > Also calling page_pool_*() on non-page_pool owned devmem is confusing
> > > enough that we may need a thin layer handling non-page_pool owned devmem
> > > in the end.
> > >
> >
> > The page_pool_page* & page_pool_iov* functions can be renamed if
> > confusing. I would think that's no issue (note that the page_pool_*
> > functions need not be called for TX path).
> >
> > > > on MSG_ZEROCOPY (devmem is not copyable), so no memory allocation from
> > > > the page_pool (or otherwise) is needed or possible. RFCv1 had a TX
> > > > implementation based on dmabuf pages without page_pool involvement, I
> > > > imagine I'll do something similar.
> > > It would be good to have a tx implementation for the next version, so
> > > that we can have a whole picture of devmem.
> > >
> > > >
> >
> >
> >
> > --
> > Thanks,
> > Mina
>
>
>
> --
> Thanks,
> Mina
@@ -154,6 +154,64 @@ static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
return NULL;
}
+static inline int page_pool_page_ref_count(struct page *page)
+{
+ if (page_is_page_pool_iov(page))
+ return page_pool_iov_refcount(page_to_page_pool_iov(page));
+
+ return page_ref_count(page);
+}
+
+static inline void page_pool_page_get_many(struct page *page,
+ unsigned int count)
+{
+ if (page_is_page_pool_iov(page))
+ return page_pool_iov_get_many(page_to_page_pool_iov(page),
+ count);
+
+ return page_ref_add(page, count);
+}
+
+static inline void page_pool_page_put_many(struct page *page,
+ unsigned int count)
+{
+ if (page_is_page_pool_iov(page))
+ return page_pool_iov_put_many(page_to_page_pool_iov(page),
+ count);
+
+ if (count > 1)
+ page_ref_sub(page, count - 1);
+
+ put_page(page);
+}
+
+static inline bool page_pool_page_is_pfmemalloc(struct page *page)
+{
+ if (page_is_page_pool_iov(page))
+ return false;
+
+ return page_is_pfmemalloc(page);
+}
+
+static inline bool page_pool_page_is_pref_nid(struct page *page, int pref_nid)
+{
+ /* Assume page_pool_iov are on the preferred node without actually
+ * checking...
+ *
+ * This check is only used to check for recycling memory in the page
+ * pool's fast paths. Currently the only implementation of page_pool_iov
+ * is dmabuf device memory. It's a deliberate decision by the user to
+ * bind a certain dmabuf to a certain netdev, and the netdev rx queue
+ * would not be able to reallocate memory from another dmabuf that
+ * exists on the preferred node, so, this check doesn't make much sense
+ * in this case. Assume all page_pool_iovs can be recycled for now.
+ */
+ if (page_is_page_pool_iov(page))
+ return true;
+
+ return page_to_nid(page) == pref_nid;
+}
+
/**
* page_pool_dev_alloc_pages() - allocate a page.
* @pool: pool from which to allocate
@@ -304,6 +362,10 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
{
long ret;
+ /* fragmentation support hasn't been added to ppiov yet */
+ if (WARN_ON_ONCE(page_is_page_pool_iov(page)))
+ return 0;
+
/* If nr == pp_frag_count then we have cleared all remaining
* references to the page:
* 1. 'n == 1': no need to actually overwrite it.
@@ -347,7 +409,8 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
static inline bool page_pool_is_last_frag(struct page *page)
{
/* If page_pool_defrag_page() returns 0, we were the last user */
- return page_pool_defrag_page(page, 1) == 0;
+ return page_is_page_pool_iov(page) ||
+ page_pool_defrag_page(page, 1) == 0;
}
/**
@@ -434,7 +497,12 @@ static inline void page_pool_free_va(struct page_pool *pool, void *va,
*/
static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
{
- dma_addr_t ret = page->dma_addr;
+ dma_addr_t ret;
+
+ if (page_is_page_pool_iov(page))
+ return page_pool_iov_dma_addr(page_to_page_pool_iov(page));
+
+ ret = page->dma_addr;
if (PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA)
ret <<= PAGE_SHIFT;
@@ -444,6 +512,12 @@ static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
{
+ /* page_pool_iovs are mapped and their dma-addr can't be modified. */
+ if (page_is_page_pool_iov(page)) {
+ DEBUG_NET_WARN_ON_ONCE(true);
+ return false;
+ }
+
if (PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA) {
page->dma_addr = addr >> PAGE_SHIFT;
@@ -346,7 +346,7 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
if (unlikely(!page))
break;
- if (likely(page_to_nid(page) == pref_nid)) {
+ if (likely(page_pool_page_is_pref_nid(page, pref_nid))) {
pool->alloc.cache[pool->alloc.count++] = page;
} else {
/* NUMA mismatch;
@@ -391,7 +391,15 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
struct page *page,
unsigned int dma_sync_size)
{
- dma_addr_t dma_addr = page_pool_get_dma_addr(page);
+ dma_addr_t dma_addr;
+
+ /* page_pool_iov memory provider do not support PP_FLAG_DMA_SYNC_DEV */
+ if (page_is_page_pool_iov(page)) {
+ DEBUG_NET_WARN_ON_ONCE(true);
+ return;
+ }
+
+ dma_addr = page_pool_get_dma_addr(page);
dma_sync_size = min(dma_sync_size, pool->p.max_len);
dma_sync_single_range_for_device(pool->p.dev, dma_addr,
@@ -403,6 +411,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
{
dma_addr_t dma;
+ if (page_is_page_pool_iov(page)) {
+ /* page_pool_iovs are already mapped */
+ DEBUG_NET_WARN_ON_ONCE(true);
+ return true;
+ }
+
/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
* since dma_addr_t can be either 32 or 64 bits and does not always fit
* into page private data (i.e 32bit cpu with 64bit DMA caps)
@@ -434,22 +448,33 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
static void page_pool_set_pp_info(struct page_pool *pool,
struct page *page)
{
- page->pp = pool;
- page->pp_magic |= PP_SIGNATURE;
-
- /* Ensuring all pages have been split into one fragment initially:
- * page_pool_set_pp_info() is only called once for every page when it
- * is allocated from the page allocator and page_pool_fragment_page()
- * is dirtying the same cache line as the page->pp_magic above, so
- * the overhead is negligible.
- */
- page_pool_fragment_page(page, 1);
+ if (!page_is_page_pool_iov(page)) {
+ page->pp = pool;
+ page->pp_magic |= PP_SIGNATURE;
+
+ /* Ensuring all pages have been split into one fragment
+ * initially:
+ * page_pool_set_pp_info() is only called once for every page
+ * when it is allocated from the page allocator and
+ * page_pool_fragment_page() is dirtying the same cache line as
+ * the page->pp_magic above, so * the overhead is negligible.
+ */
+ page_pool_fragment_page(page, 1);
+ } else {
+ page_to_page_pool_iov(page)->pp = pool;
+ }
+
if (pool->has_init_callback)
pool->slow.init_callback(page, pool->slow.init_arg);
}
static void page_pool_clear_pp_info(struct page *page)
{
+ if (page_is_page_pool_iov(page)) {
+ page_to_page_pool_iov(page)->pp = NULL;
+ return;
+ }
+
page->pp_magic = 0;
page->pp = NULL;
}
@@ -664,7 +689,7 @@ static bool page_pool_recycle_in_cache(struct page *page,
return false;
}
- /* Caller MUST have verified/know (page_ref_count(page) == 1) */
+ /* Caller MUST have verified/know (page_pool_page_ref_count(page) == 1) */
pool->alloc.cache[pool->alloc.count++] = page;
recycle_stat_inc(pool, cached);
return true;
@@ -689,9 +714,10 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
* refcnt == 1 means page_pool owns page, and can recycle it.
*
* page is NOT reusable when allocated when system is under
- * some pressure. (page_is_pfmemalloc)
+ * some pressure. (page_pool_page_is_pfmemalloc)
*/
- if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
+ if (likely(page_pool_page_ref_count(page) == 1 &&
+ !page_pool_page_is_pfmemalloc(page))) {
/* Read barrier done in page_ref_count / READ_ONCE */
if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
@@ -806,7 +832,8 @@ static struct page *page_pool_drain_frag(struct page_pool *pool,
if (likely(page_pool_defrag_page(page, drain_count)))
return NULL;
- if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) {
+ if (page_pool_page_ref_count(page) == 1 &&
+ !page_pool_page_is_pfmemalloc(page)) {
if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
page_pool_dma_sync_for_device(pool, page, -1);
@@ -840,6 +867,10 @@ struct page *page_pool_alloc_frag(struct page_pool *pool,
if (WARN_ON(size > max_size))
return NULL;
+ /* page_pool_iov's don't currently support fragmentation */
+ if (WARN_ON_ONCE(pool->mp_ops == &dmabuf_devmem_ops))
+ return NULL;
+
size = ALIGN(size, dma_get_cache_alignment());
*offset = pool->frag_offset;
@@ -882,9 +913,9 @@ static void page_pool_empty_ring(struct page_pool *pool)
/* Empty recycle ring */
while ((page = ptr_ring_consume_bh(&pool->ring))) {
/* Verify the refcnt invariant of cached pages */
- if (!(page_ref_count(page) == 1))
+ if (!(page_pool_page_ref_count(page) == 1))
pr_crit("%s() page_pool refcnt %d violation\n",
- __func__, page_ref_count(page));
+ __func__, page_pool_page_ref_count(page));
page_pool_return_page(pool, page);
}
@@ -896,19 +896,23 @@ bool napi_pp_put_page(struct page *page, bool napi_safe)
bool allow_direct = false;
struct page_pool *pp;
- page = compound_head(page);
-
- /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
- * in order to preserve any existing bits, such as bit 0 for the
- * head page of compound page and bit 1 for pfmemalloc page, so
- * mask those bits for freeing side when doing below checking,
- * and page_is_pfmemalloc() is checked in __page_pool_put_page()
- * to avoid recycling the pfmemalloc page.
- */
- if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
- return false;
+ if (!page_is_page_pool_iov(page)) {
+ page = compound_head(page);
+
+ /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
+ * in order to preserve any existing bits, such as bit 0 for the
+ * head page of compound page and bit 1 for pfmemalloc page, so
+ * mask those bits for freeing side when doing below checking,
+ * and page_is_pfmemalloc() is checked in __page_pool_put_page()
+ * to avoid recycling the pfmemalloc page.
+ */
+ if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
+ return false;
- pp = page->pp;
+ pp = page->pp;
+ } else {
+ pp = page_to_page_pool_iov(page)->pp;
+ }
/* Allow direct recycle if we have reasons to believe that we are
* in the same context as the consumer would run, so there's