[net-next,v6,08/12] libie: add Rx buffer management (via Page Pool)

Message ID 20231207172010.1441468-9-aleksander.lobakin@intel.com
State New
Headers
Series net: intel: start The Great Code Dedup + Page Pool for iavf |

Commit Message

Alexander Lobakin Dec. 7, 2023, 5:20 p.m. UTC
  Add a couple intuitive helpers to hide Rx buffer implementation details
in the library and not multiplicate it between drivers. The settings are
optimized for Intel hardware, but nothing really HW-specific here.
Use the new page_pool_dev_alloc() to dynamically switch between
split-page and full-page modes depending on MTU, page size, required
headroom etc. For example, on x86_64 with the default driver settings
each page is shared between 2 buffers. Turning on XDP (not in this
series) -> increasing headroom requirement pushes truesize out of 2048
boundary, leading to that each buffer starts getting a full page.
The "ceiling" limit is %PAGE_SIZE, as only order-0 pages are used to
avoid compound overhead. For the above architecture, this means maximum
linear frame size of 3712 w/o XDP.
Not that &libie_buf_queue is not a complete queue/ring structure for
now, rather a shim, but eventually the libie-enabled drivers will move
to it, with iavf being the first one.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 drivers/net/ethernet/intel/libie/Kconfig |   1 +
 drivers/net/ethernet/intel/libie/rx.c    |  69 ++++++++++++
 include/linux/net/intel/libie/rx.h       | 133 ++++++++++++++++++++++-
 3 files changed, 202 insertions(+), 1 deletion(-)
  

Comments

Yunsheng Lin Dec. 8, 2023, 9:28 a.m. UTC | #1
On 2023/12/8 1:20, Alexander Lobakin wrote:
...

> +
> +/**
> + * libie_rx_page_pool_create - create a PP with the default libie settings
> + * @bq: buffer queue struct to fill
> + * @napi: &napi_struct covering this PP (no usage outside its poll loops)
> + *
> + * Return: 0 on success, -errno on failure.
> + */
> +int libie_rx_page_pool_create(struct libie_buf_queue *bq,
> +			      struct napi_struct *napi)
> +{
> +	struct page_pool_params pp = {
> +		.flags		= PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
> +		.order		= LIBIE_RX_PAGE_ORDER,
> +		.pool_size	= bq->count,
> +		.nid		= NUMA_NO_NODE,

Is there a reason the NUMA_NO_NODE is used here instead of
dev_to_node(napi->dev->dev.parent)?

> +		.dev		= napi->dev->dev.parent,
> +		.netdev		= napi->dev,
> +		.napi		= napi,
> +		.dma_dir	= DMA_FROM_DEVICE,
> +		.offset		= LIBIE_SKB_HEADROOM,
> +	};
> +
> +	/* HW-writeable / syncable length per one page */
> +	pp.max_len = LIBIE_RX_BUF_LEN(pp.offset);
> +
> +	/* HW-writeable length per buffer */
> +	bq->rx_buf_len = libie_rx_hw_len(&pp);
> +	/* Buffer size to allocate */
> +	bq->truesize = roundup_pow_of_two(SKB_HEAD_ALIGN(pp.offset +
> +							 bq->rx_buf_len));
> +
> +	bq->pp = page_pool_create(&pp);
> +
> +	return PTR_ERR_OR_ZERO(bq->pp);
> +}
> +EXPORT_SYMBOL_NS_GPL(libie_rx_page_pool_create, LIBIE);
> +

...

> +/**
> + * libie_rx_sync_for_cpu - synchronize or recycle buffer post DMA
> + * @buf: buffer to process
> + * @len: frame length from the descriptor
> + *
> + * Process the buffer after it's written by HW. The regular path is to
> + * synchronize DMA for CPU, but in case of no data it will be immediately
> + * recycled back to its PP.
> + *
> + * Return: true when there's data to process, false otherwise.
> + */
> +static inline bool libie_rx_sync_for_cpu(const struct libie_rx_buffer *buf,
> +					 u32 len)
> +{
> +	struct page *page = buf->page;
> +
> +	/* Very rare, but possible case. The most common reason:
> +	 * the last fragment contained FCS only, which was then
> +	 * stripped by the HW.
> +	 */
> +	if (unlikely(!len)) {
> +		page_pool_recycle_direct(page->pp, page);
> +		return false;
> +	}
> +
> +	page_pool_dma_sync_for_cpu(page->pp, page, buf->offset, len);

Is there a reason why page_pool_dma_sync_for_cpu() is still used when
page_pool_create() is called with PP_FLAG_DMA_SYNC_DEV flag? Isn't syncing
already handled in page_pool core when when PP_FLAG_DMA_SYNC_DEV flag is
set?

> +
> +	return true;
> +}
>  
>  /* O(1) converting i40e/ice/iavf's 8/10-bit hardware packet type to a parsed
>   * bitfield struct.
>
  
Yunsheng Lin Dec. 8, 2023, 11:28 a.m. UTC | #2
On 2023/12/8 17:28, Yunsheng Lin wrote:

>> +
>> +	page_pool_dma_sync_for_cpu(page->pp, page, buf->offset, len);
> 
> Is there a reason why page_pool_dma_sync_for_cpu() is still used when
> page_pool_create() is called with PP_FLAG_DMA_SYNC_DEV flag? Isn't syncing
> already handled in page_pool core when when PP_FLAG_DMA_SYNC_DEV flag is
> set?

Ah, it is a sync_for_cpu.
Ignore this one.
  
Alexander Lobakin Dec. 11, 2023, 10:16 a.m. UTC | #3
From: Yunsheng Lin <linyunsheng@huawei.com>
Date: Fri, 8 Dec 2023 17:28:21 +0800

> On 2023/12/8 1:20, Alexander Lobakin wrote:
> ...
> 
>> +
>> +/**
>> + * libie_rx_page_pool_create - create a PP with the default libie settings
>> + * @bq: buffer queue struct to fill
>> + * @napi: &napi_struct covering this PP (no usage outside its poll loops)
>> + *
>> + * Return: 0 on success, -errno on failure.
>> + */
>> +int libie_rx_page_pool_create(struct libie_buf_queue *bq,
>> +			      struct napi_struct *napi)
>> +{
>> +	struct page_pool_params pp = {
>> +		.flags		= PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
>> +		.order		= LIBIE_RX_PAGE_ORDER,
>> +		.pool_size	= bq->count,
>> +		.nid		= NUMA_NO_NODE,
> 
> Is there a reason the NUMA_NO_NODE is used here instead of
> dev_to_node(napi->dev->dev.parent)?

NUMA_NO_NODE creates a "dynamic" page_pool and makes sure the pages are
local to the CPU where PP allocation functions are called. Setting ::nid
to a "static" value pins the PP to a particular node.
But the main reason is that Rx queues can be distributed across several
nodes and in that case NUMA_NO_NODE will make sure each page_pool is
local to the queue it's running on. dev_to_node() will return the same
value, thus forcing some PPs to allocate remote pages.

Ideally, I'd like to pass a CPU ID this queue will be run on and use
cpu_to_node(), but currently there's no NUMA-aware allocations in the
Intel drivers and Rx queues don't get the corresponding CPU ID when
configuring. I may revisit this later, but for now NUMA_NO_NODE is the
most optimal solution here.

[...]

Thanks,
Olek
  
Jakub Kicinski Dec. 11, 2023, 7:23 p.m. UTC | #4
On Mon, 11 Dec 2023 11:16:20 +0100 Alexander Lobakin wrote:
> Ideally, I'd like to pass a CPU ID this queue will be run on and use
> cpu_to_node(), but currently there's no NUMA-aware allocations in the
> Intel drivers and Rx queues don't get the corresponding CPU ID when
> configuring. I may revisit this later, but for now NUMA_NO_NODE is the
> most optimal solution here.

Hm, I've been wondering about persistent IRQ mappings. Drivers
resetting IRQ mapping on reconfiguration is a major PITA in production
clusters. You change the RSS hash and some NICs suddenly forget
affinitization 🤯️

The connection with memory allocations changes the math on that a bit.

The question is really whether we add CPU <> NAPI config as a netdev
Netlink API or build around the generic IRQ affinity API. The latter 
is definitely better from "don't duplicate uAPI" perspective.
But we need to reset the queues and reallocate their state when 
the mapping is changed. And shutting down queues on 

  echo $cpu > /../smp_affinity_list

seems moderately insane. Perhaps some middle-ground exists.

Anyway, if you do find cycles to tackle this - pls try to do it
generically not just for Intel? :)
  
Alexander Lobakin Dec. 13, 2023, 11:23 a.m. UTC | #5
From: Jakub Kicinski <kuba@kernel.org>
Date: Mon, 11 Dec 2023 11:23:32 -0800

> On Mon, 11 Dec 2023 11:16:20 +0100 Alexander Lobakin wrote:
>> Ideally, I'd like to pass a CPU ID this queue will be run on and use
>> cpu_to_node(), but currently there's no NUMA-aware allocations in the
>> Intel drivers and Rx queues don't get the corresponding CPU ID when
>> configuring. I may revisit this later, but for now NUMA_NO_NODE is the
>> most optimal solution here.
> 
> Hm, I've been wondering about persistent IRQ mappings. Drivers
> resetting IRQ mapping on reconfiguration is a major PITA in production
> clusters. You change the RSS hash and some NICs suddenly forget
> affinitization 🤯️
> 
> The connection with memory allocations changes the math on that a bit.
> 
> The question is really whether we add CPU <> NAPI config as a netdev
> Netlink API or build around the generic IRQ affinity API. The latter 
> is definitely better from "don't duplicate uAPI" perspective.
> But we need to reset the queues and reallocate their state when 
> the mapping is changed. And shutting down queues on 
> 
>   echo $cpu > /../smp_affinity_list
> 
> seems moderately insane. Perhaps some middle-ground exists.
> 
> Anyway, if you do find cycles to tackle this - pls try to do it
> generically not just for Intel? :)

Sounds good, adding to my fathomless backlog :>

Thanks,
Olek
  

Patch

diff --git a/drivers/net/ethernet/intel/libie/Kconfig b/drivers/net/ethernet/intel/libie/Kconfig
index 1eda4a5faa5a..6e0162fb94d2 100644
--- a/drivers/net/ethernet/intel/libie/Kconfig
+++ b/drivers/net/ethernet/intel/libie/Kconfig
@@ -3,6 +3,7 @@ 
 
 config LIBIE
 	tristate
+	select PAGE_POOL
 	help
 	  libie (Intel Ethernet library) is a common library containing
 	  routines shared between several Intel Ethernet drivers.
diff --git a/drivers/net/ethernet/intel/libie/rx.c b/drivers/net/ethernet/intel/libie/rx.c
index f503476d8eef..359867714a1b 100644
--- a/drivers/net/ethernet/intel/libie/rx.c
+++ b/drivers/net/ethernet/intel/libie/rx.c
@@ -3,6 +3,75 @@ 
 
 #include <linux/net/intel/libie/rx.h>
 
+/* Rx buffer management */
+
+/**
+ * libie_rx_hw_len - get the actual buffer size to be passed to HW
+ * @pp: &page_pool_params of the netdev to calculate the size for
+ *
+ * Return: HW-writeable length per one buffer to pass it to the HW accounting:
+ * MTU the @dev has, HW required alignment, minimum and maximum allowed values,
+ * and system's page size.
+ */
+static u32 libie_rx_hw_len(const struct page_pool_params *pp)
+{
+	u32 len;
+
+	len = READ_ONCE(pp->netdev->mtu) + LIBIE_RX_LL_LEN;
+	len = ALIGN(len, LIBIE_RX_BUF_LEN_ALIGN);
+	len = clamp(len, LIBIE_MIN_RX_BUF_LEN, pp->max_len);
+
+	return len;
+}
+
+/**
+ * libie_rx_page_pool_create - create a PP with the default libie settings
+ * @bq: buffer queue struct to fill
+ * @napi: &napi_struct covering this PP (no usage outside its poll loops)
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int libie_rx_page_pool_create(struct libie_buf_queue *bq,
+			      struct napi_struct *napi)
+{
+	struct page_pool_params pp = {
+		.flags		= PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
+		.order		= LIBIE_RX_PAGE_ORDER,
+		.pool_size	= bq->count,
+		.nid		= NUMA_NO_NODE,
+		.dev		= napi->dev->dev.parent,
+		.netdev		= napi->dev,
+		.napi		= napi,
+		.dma_dir	= DMA_FROM_DEVICE,
+		.offset		= LIBIE_SKB_HEADROOM,
+	};
+
+	/* HW-writeable / syncable length per one page */
+	pp.max_len = LIBIE_RX_BUF_LEN(pp.offset);
+
+	/* HW-writeable length per buffer */
+	bq->rx_buf_len = libie_rx_hw_len(&pp);
+	/* Buffer size to allocate */
+	bq->truesize = roundup_pow_of_two(SKB_HEAD_ALIGN(pp.offset +
+							 bq->rx_buf_len));
+
+	bq->pp = page_pool_create(&pp);
+
+	return PTR_ERR_OR_ZERO(bq->pp);
+}
+EXPORT_SYMBOL_NS_GPL(libie_rx_page_pool_create, LIBIE);
+
+/**
+ * libie_rx_page_pool_destroy - destroy a &page_pool created by libie
+ * @bq: buffer queue to process
+ */
+void libie_rx_page_pool_destroy(struct libie_buf_queue *bq)
+{
+	page_pool_destroy(bq->pp);
+	bq->pp = NULL;
+}
+EXPORT_SYMBOL_NS_GPL(libie_rx_page_pool_destroy, LIBIE);
+
 /* O(1) converting i40e/ice/iavf's 8/10-bit hardware packet type to a parsed
  * bitfield struct.
  */
diff --git a/include/linux/net/intel/libie/rx.h b/include/linux/net/intel/libie/rx.h
index 55263930aa99..71bc9a1a9856 100644
--- a/include/linux/net/intel/libie/rx.h
+++ b/include/linux/net/intel/libie/rx.h
@@ -4,7 +4,138 @@ 
 #ifndef __LIBIE_RX_H
 #define __LIBIE_RX_H
 
-#include <linux/netdevice.h>
+#include <linux/if_vlan.h>
+#include <net/page_pool/helpers.h>
+
+/* Rx MTU/buffer/truesize helpers. Mostly pure software-side; HW-defined values
+ * are valid for all Intel HW.
+ */
+
+/* Space reserved in front of each frame */
+#define LIBIE_SKB_HEADROOM	(NET_SKB_PAD + NET_IP_ALIGN)
+/* Maximum headroom to calculate max MTU below */
+#define LIBIE_MAX_HEADROOM	LIBIE_SKB_HEADROOM
+/* Link layer / L2 overhead: Ethernet, 2 VLAN tags (C + S), FCS */
+#define LIBIE_RX_LL_LEN		(ETH_HLEN + 2 * VLAN_HLEN + ETH_FCS_LEN)
+
+/* Always use order-0 pages */
+#define LIBIE_RX_PAGE_ORDER	0
+/* Rx buffer size config is a multiple of 128 */
+#define LIBIE_RX_BUF_LEN_ALIGN	128
+/* HW-writeable space in one buffer: truesize - headroom/tailroom,
+ * HW-aligned
+ */
+#define __LIBIE_RX_BUF_LEN(hr)						\
+	ALIGN_DOWN(SKB_MAX_ORDER(hr, LIBIE_RX_PAGE_ORDER),		\
+		   LIBIE_RX_BUF_LEN_ALIGN)
+/* The smallest and largest size for a single descriptor as per HW */
+#define LIBIE_MIN_RX_BUF_LEN	1024U
+#define LIBIE_MAX_RX_BUF_LEN	9728U
+/* "True" HW-writeable space: minimum from SW and HW values */
+#define LIBIE_RX_BUF_LEN(hr)	min_t(u32, __LIBIE_RX_BUF_LEN(hr),	\
+				      LIBIE_MAX_RX_BUF_LEN)
+
+/* The maximum frame size as per HW (S/G) */
+#define __LIBIE_MAX_RX_FRM_LEN	16382U
+/* ATST, HW can chain up to 5 Rx descriptors */
+#define LIBIE_MAX_RX_FRM_LEN(hr)					\
+	min_t(u32, __LIBIE_MAX_RX_FRM_LEN, LIBIE_RX_BUF_LEN(hr) * 5)
+/* Maximum frame size minus LL overhead */
+#define LIBIE_MAX_MTU							\
+	(LIBIE_MAX_RX_FRM_LEN(LIBIE_MAX_HEADROOM) - LIBIE_RX_LL_LEN)
+
+/* Rx buffer management */
+
+/**
+ * struct libie_rx_buffer - structure representing an Rx buffer
+ * @page: page holding the buffer
+ * @offset: offset from the page start (to the headroom)
+ * @truesize: total space occupied by the buffer (w/ headroom and tailroom)
+ *
+ * Depending on the MTU, API switches between one-page-per-frame and shared
+ * page model (to conserve memory on bigger-page platforms). In case of the
+ * former, @offset is always 0 and @truesize is always ```PAGE_SIZE```.
+ */
+struct libie_rx_buffer {
+	struct page		*page;
+	u32			offset;
+	u32			truesize;
+};
+
+/**
+ * struct libie_buf_queue - structure representing a buffer queue
+ * @pp: &page_pool for buffer management
+ * @rx_bi: array of Rx buffers
+ * @truesize: size to allocate per buffer, w/overhead
+ * @count: number of descriptors/buffers the queue has
+ * @rx_buf_len: HW-writeable length per each buffer
+ */
+struct libie_buf_queue {
+	struct page_pool	*pp;
+	struct libie_rx_buffer	*rx_bi;
+
+	u32			truesize;
+	u32			count;
+
+	/* Cold fields */
+	u32			rx_buf_len;
+};
+
+int libie_rx_page_pool_create(struct libie_buf_queue *bq,
+			      struct napi_struct *napi);
+void libie_rx_page_pool_destroy(struct libie_buf_queue *bq);
+
+/**
+ * libie_rx_alloc - allocate a new Rx buffer
+ * @bq: buffer queue to allocate for
+ * @i: index of the buffer within the queue
+ *
+ * Return: DMA address to be passed to HW for Rx on successful allocation,
+ * ```DMA_MAPPING_ERROR``` otherwise.
+ */
+static inline dma_addr_t libie_rx_alloc(const struct libie_buf_queue *bq,
+					u32 i)
+{
+	struct libie_rx_buffer *buf = &bq->rx_bi[i];
+
+	buf->truesize = bq->truesize;
+	buf->page = page_pool_dev_alloc(bq->pp, &buf->offset, &buf->truesize);
+	if (unlikely(!buf->page))
+		return DMA_MAPPING_ERROR;
+
+	return page_pool_get_dma_addr(buf->page) + buf->offset +
+	       bq->pp->p.offset;
+}
+
+/**
+ * libie_rx_sync_for_cpu - synchronize or recycle buffer post DMA
+ * @buf: buffer to process
+ * @len: frame length from the descriptor
+ *
+ * Process the buffer after it's written by HW. The regular path is to
+ * synchronize DMA for CPU, but in case of no data it will be immediately
+ * recycled back to its PP.
+ *
+ * Return: true when there's data to process, false otherwise.
+ */
+static inline bool libie_rx_sync_for_cpu(const struct libie_rx_buffer *buf,
+					 u32 len)
+{
+	struct page *page = buf->page;
+
+	/* Very rare, but possible case. The most common reason:
+	 * the last fragment contained FCS only, which was then
+	 * stripped by the HW.
+	 */
+	if (unlikely(!len)) {
+		page_pool_recycle_direct(page->pp, page);
+		return false;
+	}
+
+	page_pool_dma_sync_for_cpu(page->pp, page, buf->offset, len);
+
+	return true;
+}
 
 /* O(1) converting i40e/ice/iavf's 8/10-bit hardware packet type to a parsed
  * bitfield struct.