[net-next,v6,08/12] libie: add Rx buffer management (via Page Pool)
Commit Message
Add a couple intuitive helpers to hide Rx buffer implementation details
in the library and not multiplicate it between drivers. The settings are
optimized for Intel hardware, but nothing really HW-specific here.
Use the new page_pool_dev_alloc() to dynamically switch between
split-page and full-page modes depending on MTU, page size, required
headroom etc. For example, on x86_64 with the default driver settings
each page is shared between 2 buffers. Turning on XDP (not in this
series) -> increasing headroom requirement pushes truesize out of 2048
boundary, leading to that each buffer starts getting a full page.
The "ceiling" limit is %PAGE_SIZE, as only order-0 pages are used to
avoid compound overhead. For the above architecture, this means maximum
linear frame size of 3712 w/o XDP.
Not that &libie_buf_queue is not a complete queue/ring structure for
now, rather a shim, but eventually the libie-enabled drivers will move
to it, with iavf being the first one.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
drivers/net/ethernet/intel/libie/Kconfig | 1 +
drivers/net/ethernet/intel/libie/rx.c | 69 ++++++++++++
include/linux/net/intel/libie/rx.h | 133 ++++++++++++++++++++++-
3 files changed, 202 insertions(+), 1 deletion(-)
Comments
On 2023/12/8 1:20, Alexander Lobakin wrote:
...
> +
> +/**
> + * libie_rx_page_pool_create - create a PP with the default libie settings
> + * @bq: buffer queue struct to fill
> + * @napi: &napi_struct covering this PP (no usage outside its poll loops)
> + *
> + * Return: 0 on success, -errno on failure.
> + */
> +int libie_rx_page_pool_create(struct libie_buf_queue *bq,
> + struct napi_struct *napi)
> +{
> + struct page_pool_params pp = {
> + .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
> + .order = LIBIE_RX_PAGE_ORDER,
> + .pool_size = bq->count,
> + .nid = NUMA_NO_NODE,
Is there a reason the NUMA_NO_NODE is used here instead of
dev_to_node(napi->dev->dev.parent)?
> + .dev = napi->dev->dev.parent,
> + .netdev = napi->dev,
> + .napi = napi,
> + .dma_dir = DMA_FROM_DEVICE,
> + .offset = LIBIE_SKB_HEADROOM,
> + };
> +
> + /* HW-writeable / syncable length per one page */
> + pp.max_len = LIBIE_RX_BUF_LEN(pp.offset);
> +
> + /* HW-writeable length per buffer */
> + bq->rx_buf_len = libie_rx_hw_len(&pp);
> + /* Buffer size to allocate */
> + bq->truesize = roundup_pow_of_two(SKB_HEAD_ALIGN(pp.offset +
> + bq->rx_buf_len));
> +
> + bq->pp = page_pool_create(&pp);
> +
> + return PTR_ERR_OR_ZERO(bq->pp);
> +}
> +EXPORT_SYMBOL_NS_GPL(libie_rx_page_pool_create, LIBIE);
> +
...
> +/**
> + * libie_rx_sync_for_cpu - synchronize or recycle buffer post DMA
> + * @buf: buffer to process
> + * @len: frame length from the descriptor
> + *
> + * Process the buffer after it's written by HW. The regular path is to
> + * synchronize DMA for CPU, but in case of no data it will be immediately
> + * recycled back to its PP.
> + *
> + * Return: true when there's data to process, false otherwise.
> + */
> +static inline bool libie_rx_sync_for_cpu(const struct libie_rx_buffer *buf,
> + u32 len)
> +{
> + struct page *page = buf->page;
> +
> + /* Very rare, but possible case. The most common reason:
> + * the last fragment contained FCS only, which was then
> + * stripped by the HW.
> + */
> + if (unlikely(!len)) {
> + page_pool_recycle_direct(page->pp, page);
> + return false;
> + }
> +
> + page_pool_dma_sync_for_cpu(page->pp, page, buf->offset, len);
Is there a reason why page_pool_dma_sync_for_cpu() is still used when
page_pool_create() is called with PP_FLAG_DMA_SYNC_DEV flag? Isn't syncing
already handled in page_pool core when when PP_FLAG_DMA_SYNC_DEV flag is
set?
> +
> + return true;
> +}
>
> /* O(1) converting i40e/ice/iavf's 8/10-bit hardware packet type to a parsed
> * bitfield struct.
>
On 2023/12/8 17:28, Yunsheng Lin wrote:
>> +
>> + page_pool_dma_sync_for_cpu(page->pp, page, buf->offset, len);
>
> Is there a reason why page_pool_dma_sync_for_cpu() is still used when
> page_pool_create() is called with PP_FLAG_DMA_SYNC_DEV flag? Isn't syncing
> already handled in page_pool core when when PP_FLAG_DMA_SYNC_DEV flag is
> set?
Ah, it is a sync_for_cpu.
Ignore this one.
From: Yunsheng Lin <linyunsheng@huawei.com>
Date: Fri, 8 Dec 2023 17:28:21 +0800
> On 2023/12/8 1:20, Alexander Lobakin wrote:
> ...
>
>> +
>> +/**
>> + * libie_rx_page_pool_create - create a PP with the default libie settings
>> + * @bq: buffer queue struct to fill
>> + * @napi: &napi_struct covering this PP (no usage outside its poll loops)
>> + *
>> + * Return: 0 on success, -errno on failure.
>> + */
>> +int libie_rx_page_pool_create(struct libie_buf_queue *bq,
>> + struct napi_struct *napi)
>> +{
>> + struct page_pool_params pp = {
>> + .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
>> + .order = LIBIE_RX_PAGE_ORDER,
>> + .pool_size = bq->count,
>> + .nid = NUMA_NO_NODE,
>
> Is there a reason the NUMA_NO_NODE is used here instead of
> dev_to_node(napi->dev->dev.parent)?
NUMA_NO_NODE creates a "dynamic" page_pool and makes sure the pages are
local to the CPU where PP allocation functions are called. Setting ::nid
to a "static" value pins the PP to a particular node.
But the main reason is that Rx queues can be distributed across several
nodes and in that case NUMA_NO_NODE will make sure each page_pool is
local to the queue it's running on. dev_to_node() will return the same
value, thus forcing some PPs to allocate remote pages.
Ideally, I'd like to pass a CPU ID this queue will be run on and use
cpu_to_node(), but currently there's no NUMA-aware allocations in the
Intel drivers and Rx queues don't get the corresponding CPU ID when
configuring. I may revisit this later, but for now NUMA_NO_NODE is the
most optimal solution here.
[...]
Thanks,
Olek
On Mon, 11 Dec 2023 11:16:20 +0100 Alexander Lobakin wrote:
> Ideally, I'd like to pass a CPU ID this queue will be run on and use
> cpu_to_node(), but currently there's no NUMA-aware allocations in the
> Intel drivers and Rx queues don't get the corresponding CPU ID when
> configuring. I may revisit this later, but for now NUMA_NO_NODE is the
> most optimal solution here.
Hm, I've been wondering about persistent IRQ mappings. Drivers
resetting IRQ mapping on reconfiguration is a major PITA in production
clusters. You change the RSS hash and some NICs suddenly forget
affinitization 🤯️
The connection with memory allocations changes the math on that a bit.
The question is really whether we add CPU <> NAPI config as a netdev
Netlink API or build around the generic IRQ affinity API. The latter
is definitely better from "don't duplicate uAPI" perspective.
But we need to reset the queues and reallocate their state when
the mapping is changed. And shutting down queues on
echo $cpu > /../smp_affinity_list
seems moderately insane. Perhaps some middle-ground exists.
Anyway, if you do find cycles to tackle this - pls try to do it
generically not just for Intel? :)
From: Jakub Kicinski <kuba@kernel.org>
Date: Mon, 11 Dec 2023 11:23:32 -0800
> On Mon, 11 Dec 2023 11:16:20 +0100 Alexander Lobakin wrote:
>> Ideally, I'd like to pass a CPU ID this queue will be run on and use
>> cpu_to_node(), but currently there's no NUMA-aware allocations in the
>> Intel drivers and Rx queues don't get the corresponding CPU ID when
>> configuring. I may revisit this later, but for now NUMA_NO_NODE is the
>> most optimal solution here.
>
> Hm, I've been wondering about persistent IRQ mappings. Drivers
> resetting IRQ mapping on reconfiguration is a major PITA in production
> clusters. You change the RSS hash and some NICs suddenly forget
> affinitization 🤯️
>
> The connection with memory allocations changes the math on that a bit.
>
> The question is really whether we add CPU <> NAPI config as a netdev
> Netlink API or build around the generic IRQ affinity API. The latter
> is definitely better from "don't duplicate uAPI" perspective.
> But we need to reset the queues and reallocate their state when
> the mapping is changed. And shutting down queues on
>
> echo $cpu > /../smp_affinity_list
>
> seems moderately insane. Perhaps some middle-ground exists.
>
> Anyway, if you do find cycles to tackle this - pls try to do it
> generically not just for Intel? :)
Sounds good, adding to my fathomless backlog :>
Thanks,
Olek
@@ -3,6 +3,7 @@
config LIBIE
tristate
+ select PAGE_POOL
help
libie (Intel Ethernet library) is a common library containing
routines shared between several Intel Ethernet drivers.
@@ -3,6 +3,75 @@
#include <linux/net/intel/libie/rx.h>
+/* Rx buffer management */
+
+/**
+ * libie_rx_hw_len - get the actual buffer size to be passed to HW
+ * @pp: &page_pool_params of the netdev to calculate the size for
+ *
+ * Return: HW-writeable length per one buffer to pass it to the HW accounting:
+ * MTU the @dev has, HW required alignment, minimum and maximum allowed values,
+ * and system's page size.
+ */
+static u32 libie_rx_hw_len(const struct page_pool_params *pp)
+{
+ u32 len;
+
+ len = READ_ONCE(pp->netdev->mtu) + LIBIE_RX_LL_LEN;
+ len = ALIGN(len, LIBIE_RX_BUF_LEN_ALIGN);
+ len = clamp(len, LIBIE_MIN_RX_BUF_LEN, pp->max_len);
+
+ return len;
+}
+
+/**
+ * libie_rx_page_pool_create - create a PP with the default libie settings
+ * @bq: buffer queue struct to fill
+ * @napi: &napi_struct covering this PP (no usage outside its poll loops)
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int libie_rx_page_pool_create(struct libie_buf_queue *bq,
+ struct napi_struct *napi)
+{
+ struct page_pool_params pp = {
+ .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
+ .order = LIBIE_RX_PAGE_ORDER,
+ .pool_size = bq->count,
+ .nid = NUMA_NO_NODE,
+ .dev = napi->dev->dev.parent,
+ .netdev = napi->dev,
+ .napi = napi,
+ .dma_dir = DMA_FROM_DEVICE,
+ .offset = LIBIE_SKB_HEADROOM,
+ };
+
+ /* HW-writeable / syncable length per one page */
+ pp.max_len = LIBIE_RX_BUF_LEN(pp.offset);
+
+ /* HW-writeable length per buffer */
+ bq->rx_buf_len = libie_rx_hw_len(&pp);
+ /* Buffer size to allocate */
+ bq->truesize = roundup_pow_of_two(SKB_HEAD_ALIGN(pp.offset +
+ bq->rx_buf_len));
+
+ bq->pp = page_pool_create(&pp);
+
+ return PTR_ERR_OR_ZERO(bq->pp);
+}
+EXPORT_SYMBOL_NS_GPL(libie_rx_page_pool_create, LIBIE);
+
+/**
+ * libie_rx_page_pool_destroy - destroy a &page_pool created by libie
+ * @bq: buffer queue to process
+ */
+void libie_rx_page_pool_destroy(struct libie_buf_queue *bq)
+{
+ page_pool_destroy(bq->pp);
+ bq->pp = NULL;
+}
+EXPORT_SYMBOL_NS_GPL(libie_rx_page_pool_destroy, LIBIE);
+
/* O(1) converting i40e/ice/iavf's 8/10-bit hardware packet type to a parsed
* bitfield struct.
*/
@@ -4,7 +4,138 @@
#ifndef __LIBIE_RX_H
#define __LIBIE_RX_H
-#include <linux/netdevice.h>
+#include <linux/if_vlan.h>
+#include <net/page_pool/helpers.h>
+
+/* Rx MTU/buffer/truesize helpers. Mostly pure software-side; HW-defined values
+ * are valid for all Intel HW.
+ */
+
+/* Space reserved in front of each frame */
+#define LIBIE_SKB_HEADROOM (NET_SKB_PAD + NET_IP_ALIGN)
+/* Maximum headroom to calculate max MTU below */
+#define LIBIE_MAX_HEADROOM LIBIE_SKB_HEADROOM
+/* Link layer / L2 overhead: Ethernet, 2 VLAN tags (C + S), FCS */
+#define LIBIE_RX_LL_LEN (ETH_HLEN + 2 * VLAN_HLEN + ETH_FCS_LEN)
+
+/* Always use order-0 pages */
+#define LIBIE_RX_PAGE_ORDER 0
+/* Rx buffer size config is a multiple of 128 */
+#define LIBIE_RX_BUF_LEN_ALIGN 128
+/* HW-writeable space in one buffer: truesize - headroom/tailroom,
+ * HW-aligned
+ */
+#define __LIBIE_RX_BUF_LEN(hr) \
+ ALIGN_DOWN(SKB_MAX_ORDER(hr, LIBIE_RX_PAGE_ORDER), \
+ LIBIE_RX_BUF_LEN_ALIGN)
+/* The smallest and largest size for a single descriptor as per HW */
+#define LIBIE_MIN_RX_BUF_LEN 1024U
+#define LIBIE_MAX_RX_BUF_LEN 9728U
+/* "True" HW-writeable space: minimum from SW and HW values */
+#define LIBIE_RX_BUF_LEN(hr) min_t(u32, __LIBIE_RX_BUF_LEN(hr), \
+ LIBIE_MAX_RX_BUF_LEN)
+
+/* The maximum frame size as per HW (S/G) */
+#define __LIBIE_MAX_RX_FRM_LEN 16382U
+/* ATST, HW can chain up to 5 Rx descriptors */
+#define LIBIE_MAX_RX_FRM_LEN(hr) \
+ min_t(u32, __LIBIE_MAX_RX_FRM_LEN, LIBIE_RX_BUF_LEN(hr) * 5)
+/* Maximum frame size minus LL overhead */
+#define LIBIE_MAX_MTU \
+ (LIBIE_MAX_RX_FRM_LEN(LIBIE_MAX_HEADROOM) - LIBIE_RX_LL_LEN)
+
+/* Rx buffer management */
+
+/**
+ * struct libie_rx_buffer - structure representing an Rx buffer
+ * @page: page holding the buffer
+ * @offset: offset from the page start (to the headroom)
+ * @truesize: total space occupied by the buffer (w/ headroom and tailroom)
+ *
+ * Depending on the MTU, API switches between one-page-per-frame and shared
+ * page model (to conserve memory on bigger-page platforms). In case of the
+ * former, @offset is always 0 and @truesize is always ```PAGE_SIZE```.
+ */
+struct libie_rx_buffer {
+ struct page *page;
+ u32 offset;
+ u32 truesize;
+};
+
+/**
+ * struct libie_buf_queue - structure representing a buffer queue
+ * @pp: &page_pool for buffer management
+ * @rx_bi: array of Rx buffers
+ * @truesize: size to allocate per buffer, w/overhead
+ * @count: number of descriptors/buffers the queue has
+ * @rx_buf_len: HW-writeable length per each buffer
+ */
+struct libie_buf_queue {
+ struct page_pool *pp;
+ struct libie_rx_buffer *rx_bi;
+
+ u32 truesize;
+ u32 count;
+
+ /* Cold fields */
+ u32 rx_buf_len;
+};
+
+int libie_rx_page_pool_create(struct libie_buf_queue *bq,
+ struct napi_struct *napi);
+void libie_rx_page_pool_destroy(struct libie_buf_queue *bq);
+
+/**
+ * libie_rx_alloc - allocate a new Rx buffer
+ * @bq: buffer queue to allocate for
+ * @i: index of the buffer within the queue
+ *
+ * Return: DMA address to be passed to HW for Rx on successful allocation,
+ * ```DMA_MAPPING_ERROR``` otherwise.
+ */
+static inline dma_addr_t libie_rx_alloc(const struct libie_buf_queue *bq,
+ u32 i)
+{
+ struct libie_rx_buffer *buf = &bq->rx_bi[i];
+
+ buf->truesize = bq->truesize;
+ buf->page = page_pool_dev_alloc(bq->pp, &buf->offset, &buf->truesize);
+ if (unlikely(!buf->page))
+ return DMA_MAPPING_ERROR;
+
+ return page_pool_get_dma_addr(buf->page) + buf->offset +
+ bq->pp->p.offset;
+}
+
+/**
+ * libie_rx_sync_for_cpu - synchronize or recycle buffer post DMA
+ * @buf: buffer to process
+ * @len: frame length from the descriptor
+ *
+ * Process the buffer after it's written by HW. The regular path is to
+ * synchronize DMA for CPU, but in case of no data it will be immediately
+ * recycled back to its PP.
+ *
+ * Return: true when there's data to process, false otherwise.
+ */
+static inline bool libie_rx_sync_for_cpu(const struct libie_rx_buffer *buf,
+ u32 len)
+{
+ struct page *page = buf->page;
+
+ /* Very rare, but possible case. The most common reason:
+ * the last fragment contained FCS only, which was then
+ * stripped by the HW.
+ */
+ if (unlikely(!len)) {
+ page_pool_recycle_direct(page->pp, page);
+ return false;
+ }
+
+ page_pool_dma_sync_for_cpu(page->pp, page, buf->offset, len);
+
+ return true;
+}
/* O(1) converting i40e/ice/iavf's 8/10-bit hardware packet type to a parsed
* bitfield struct.