net: ethernet: ti: am65-cpsw: Add minimal XDP support

Message ID 20240223-am65-cpsw-xdp-basic-v1-1-9f0b6cbda310@baylibre.com
State New
Headers
Series net: ethernet: ti: am65-cpsw: Add minimal XDP support |

Commit Message

Julien Panis Feb. 23, 2024, 11:01 a.m. UTC
  This patch adds XDP (eXpress Data Path) support to TI AM65 CPSW
Ethernet driver. The following features are implemented:
- NETDEV_XDP_ACT_BASIC (XDP_PASS, XDP_TX, XDP_DROP, XDP_ABORTED)
- NETDEV_XDP_ACT_REDIRECT (XDP_REDIRECT)
- NETDEV_XDP_ACT_NDO_XMIT (ndo_xdp_xmit callback)

Signed-off-by: Julien Panis <jpanis@baylibre.com>
---
This patch adds XDP support to TI AM65 CPSW Ethernet driver.

The following features are implemented: NETDEV_XDP_ACT_BASIC,
NETDEV_XDP_ACT_REDIRECT, and NETDEV_XDP_ACT_NDO_XMIT.

Zero-copy and non-linear XDP buffer supports are NOT implemented.
---
 drivers/net/ethernet/ti/am65-cpsw-nuss.c | 366 +++++++++++++++++++++++++++++--
 drivers/net/ethernet/ti/am65-cpsw-nuss.h |   9 +
 2 files changed, 351 insertions(+), 24 deletions(-)


---
base-commit: 6613476e225e090cc9aad49be7fa504e290dd33d
change-id: 20240223-am65-cpsw-xdp-basic-4db828508b48

Best regards,
  

Comments

Simon Horman Feb. 26, 2024, 5:25 p.m. UTC | #1
On Fri, Feb 23, 2024 at 12:01:37PM +0100, Julien Panis wrote:
> This patch adds XDP (eXpress Data Path) support to TI AM65 CPSW
> Ethernet driver. The following features are implemented:
> - NETDEV_XDP_ACT_BASIC (XDP_PASS, XDP_TX, XDP_DROP, XDP_ABORTED)
> - NETDEV_XDP_ACT_REDIRECT (XDP_REDIRECT)
> - NETDEV_XDP_ACT_NDO_XMIT (ndo_xdp_xmit callback)
> 
> Signed-off-by: Julien Panis <jpanis@baylibre.com>

..

> @@ -440,6 +476,27 @@ static void am65_cpsw_nuss_tx_cleanup(void *data, dma_addr_t desc_dma)
>  	dev_kfree_skb_any(skb);
>  }
>  
> +static struct sk_buff *am65_cpsw_alloc_skb(struct net_device *ndev, unsigned int len)
> +{
> +	struct page *page;
> +	struct sk_buff *skb;

nit: please arrange local variables in reverse xmas tree order,
     from longest line to shortest in new code.

     This tool can be useful: https://github.com/ecree-solarflare/xmastree

> +
> +	page = dev_alloc_pages(0);

nit: Maybe dev_alloc_page() is appropriate here?

> +	if (unlikely(!page))
> +		return NULL;
> +
> +	len += AM65_CPSW_HEADROOM;
> +
> +	skb = build_skb(page_address(page), len);
> +	if (unlikely(!skb))

Does page need to be freed here?

> +		return NULL;
> +
> +	skb_reserve(skb, AM65_CPSW_HEADROOM + NET_IP_ALIGN);
> +	skb->dev = ndev;
> +
> +	return skb;
> +}

..
  
Julien Panis Feb. 26, 2024, 5:48 p.m. UTC | #2
Hello Simon,

Thank you for the review.

On 2/26/24 18:25, Simon Horman wrote:
> On Fri, Feb 23, 2024 at 12:01:37PM +0100, Julien Panis wrote:
>> This patch adds XDP (eXpress Data Path) support to TI AM65 CPSW
>> Ethernet driver. The following features are implemented:
>> - NETDEV_XDP_ACT_BASIC (XDP_PASS, XDP_TX, XDP_DROP, XDP_ABORTED)
>> - NETDEV_XDP_ACT_REDIRECT (XDP_REDIRECT)
>> - NETDEV_XDP_ACT_NDO_XMIT (ndo_xdp_xmit callback)
>>
>> Signed-off-by: Julien Panis <jpanis@baylibre.com>
> ...
>
>> @@ -440,6 +476,27 @@ static void am65_cpsw_nuss_tx_cleanup(void *data, dma_addr_t desc_dma)
>>   	dev_kfree_skb_any(skb);
>>   }
>>   
>> +static struct sk_buff *am65_cpsw_alloc_skb(struct net_device *ndev, unsigned int len)
>> +{
>> +	struct page *page;
>> +	struct sk_buff *skb;
> nit: please arrange local variables in reverse xmas tree order,
>       from longest line to shortest in new code.
>
>       This tool can be useful: https://github.com/ecree-solarflare/xmastree

You mean, for the new functions introduced in this patch only ?

>
>> +
>> +	page = dev_alloc_pages(0);
> nit: Maybe dev_alloc_page() is appropriate here?

Absolutely.

>
>> +	if (unlikely(!page))
>> +		return NULL;
>> +
>> +	len += AM65_CPSW_HEADROOM;
>> +
>> +	skb = build_skb(page_address(page), len);
>> +	if (unlikely(!skb))
> Does page need to be freed here?

Of course it does ! This will be fixed in the next version.

>
>> +		return NULL;
>> +
>> +	skb_reserve(skb, AM65_CPSW_HEADROOM + NET_IP_ALIGN);
>> +	skb->dev = ndev;
>> +
>> +	return skb;
>> +}
> ...
  
Andrew Lunn Feb. 26, 2024, 11:18 p.m. UTC | #3
> +static struct sk_buff *am65_cpsw_alloc_skb(struct net_device *ndev, unsigned int len)
> +{
> +	struct page *page;
> +	struct sk_buff *skb;
> +
> +	page = dev_alloc_pages(0);

You are likely to get better performance if you use the page_pool.

When FEC added XDP support, the first set of changes was to make use
of page_pool. That improved the drivers performance. Then XDP was
added on top. Maybe you can follow that pattern.

      Andrew
  
Simon Horman Feb. 27, 2024, 5:30 p.m. UTC | #4
On Mon, Feb 26, 2024 at 06:48:25PM +0100, Julien Panis wrote:
> Hello Simon,
> 
> Thank you for the review.
> 
> On 2/26/24 18:25, Simon Horman wrote:
> > On Fri, Feb 23, 2024 at 12:01:37PM +0100, Julien Panis wrote:
> > > This patch adds XDP (eXpress Data Path) support to TI AM65 CPSW
> > > Ethernet driver. The following features are implemented:
> > > - NETDEV_XDP_ACT_BASIC (XDP_PASS, XDP_TX, XDP_DROP, XDP_ABORTED)
> > > - NETDEV_XDP_ACT_REDIRECT (XDP_REDIRECT)
> > > - NETDEV_XDP_ACT_NDO_XMIT (ndo_xdp_xmit callback)
> > > 
> > > Signed-off-by: Julien Panis <jpanis@baylibre.com>
> > ...
> > 
> > > @@ -440,6 +476,27 @@ static void am65_cpsw_nuss_tx_cleanup(void *data, dma_addr_t desc_dma)
> > >   	dev_kfree_skb_any(skb);
> > >   }
> > > +static struct sk_buff *am65_cpsw_alloc_skb(struct net_device *ndev, unsigned int len)
> > > +{
> > > +	struct page *page;
> > > +	struct sk_buff *skb;
> > nit: please arrange local variables in reverse xmas tree order,
> >       from longest line to shortest in new code.
> > 
> >       This tool can be useful: https://github.com/ecree-solarflare/xmastree
> 
> You mean, for the new functions introduced in this patch only ?

It's a bit loose. But generally the idea would be to move towards this
practice. So for new functions: yes. And when modifying old ones old ones:
if possible.

> > > +
> > > +	page = dev_alloc_pages(0);
> > nit: Maybe dev_alloc_page() is appropriate here?
> 
> Absolutely.
> 
> > 
> > > +	if (unlikely(!page))
> > > +		return NULL;
> > > +
> > > +	len += AM65_CPSW_HEADROOM;
> > > +
> > > +	skb = build_skb(page_address(page), len);
> > > +	if (unlikely(!skb))
> > Does page need to be freed here?
> 
> Of course it does ! This will be fixed in the next version.

Thanks!

> 
> > 
> > > +		return NULL;
> > > +
> > > +	skb_reserve(skb, AM65_CPSW_HEADROOM + NET_IP_ALIGN);
> > > +	skb->dev = ndev;
> > > +
> > > +	return skb;
> > > +}
> > ...
>
  
Julien Panis Feb. 29, 2024, 4:19 p.m. UTC | #5
On 2/27/24 00:18, Andrew Lunn wrote:
>> +static struct sk_buff *am65_cpsw_alloc_skb(struct net_device *ndev, unsigned int len)
>> +{
>> +	struct page *page;
>> +	struct sk_buff *skb;
>> +
>> +	page = dev_alloc_pages(0);
> You are likely to get better performance if you use the page_pool.
>
> When FEC added XDP support, the first set of changes was to make use
> of page_pool. That improved the drivers performance. Then XDP was
> added on top. Maybe you can follow that pattern.
>
>        Andrew

Hello Andrew,

Thanks for this suggestion. I've been working on it over the last days.
I encountered issues and I begin to wonder if that's a good option for
this driver. Let me explain...

This device has 3 ports:
- Port0 is the host port (internal to the subsystem and referred as CPPI
in the driver).
- Port1 and 2 are the external ethernet ports.
Each port's RX FIFO has 1 queue.

As mentioned in page_pool documentation:
https://docs.kernel.org/networking/page_pool.html
"The number of pools created MUST match the number of hardware
queues unless hardware restrictions make that impossible. This would
otherwise beat the purpose of page pool, which is allocate pages fast
from cache without locking."

So, for this driver I should use 2 page pools (for port1 and 2) if possible.

But, if I I replace any alloc_skb() with page_pool_alloc() in the original
driver, I will have to create only 1 page_pool.
This is visible in am65_cpsw_nuss_common_open(), which starts with:
"if (common->usage_count) return 0;" to ensure that subsequent code
will be executed only once even if the 2 interfaces are ndo_open'd.
IOW, skbs will be allocated for only 1 RX channel. I checked that behavior,
and that's the way it works.
This is because the host port (CPPI) has only 1 RX queue. This single
queue is used to get all the packets, from both Ports and 2 (port ID is
stored in each descriptor).

So, because of this constraint, I ended up working on the "single
page pool" option.

Questions:
1) Is the behavior described above usual for ethernet switch devices ?
2) Given that I can't use a page pool for each HW queue, is it worth
using the page pool memory model ?
3) Currently I use 2 xdp_rxq (one for port1 and another one for port2).
If an XDP program is attached to port1, my page pool dma parameter
will need to be DMA_BIDIRECTIONAL (because of XDP_TX). As a result,
even if there is no XDP program attached to port2, the setting for page
pool will need to be DMA_BIDIRECTIONAL instead of DMA_FROM_DEVICE.
In such situation, isn't it a problem for port2 ?
4) Because of 1) and 2), will page pool performance really be better for
this driver ?

I'm not familiar with network devices, so it's possible that I misunderstand
some stuffs and I might have written things that are not correct or not
consistent. If so, do not hesitate to enlighten me. :)

Julien
  
Andrew Lunn Feb. 29, 2024, 4:46 p.m. UTC | #6
On Thu, Feb 29, 2024 at 05:19:44PM +0100, Julien Panis wrote:
> On 2/27/24 00:18, Andrew Lunn wrote:
> > > +static struct sk_buff *am65_cpsw_alloc_skb(struct net_device *ndev, unsigned int len)
> > > +{
> > > +	struct page *page;
> > > +	struct sk_buff *skb;
> > > +
> > > +	page = dev_alloc_pages(0);
> > You are likely to get better performance if you use the page_pool.
> > 
> > When FEC added XDP support, the first set of changes was to make use
> > of page_pool. That improved the drivers performance. Then XDP was
> > added on top. Maybe you can follow that pattern.
> > 
> >        Andrew
> 
> Hello Andrew,
> 
> Thanks for this suggestion. I've been working on it over the last days.
> I encountered issues and I begin to wonder if that's a good option for
> this driver. Let me explain...

I'm not a page pool expert, so hopefully those that are will jump in
and help.

> This device has 3 ports:
> - Port0 is the host port (internal to the subsystem and referred as CPPI
> in the driver).
> - Port1 and 2 are the external ethernet ports.
> Each port's RX FIFO has 1 queue.
> 
> As mentioned in page_pool documentation:
> https://docs.kernel.org/networking/page_pool.html
> "The number of pools created MUST match the number of hardware
> queues unless hardware restrictions make that impossible. This would
> otherwise beat the purpose of page pool, which is allocate pages fast
> from cache without locking."

My guess is, this last bit is the important part. Locking. Do ports 1
and port 2 rx and tx run in parallel on different CPUs? Hence do you
need locking?

> So, for this driver I should use 2 page pools (for port1 and 2) if possible.

Maybe, maybe not. It is not really the number of front panel interface
which matters here. It is the number of entities which need buffers.

> But, if I I replace any alloc_skb() with page_pool_alloc() in the original
> driver, I will have to create only 1 page_pool.
> This is visible in am65_cpsw_nuss_common_open(), which starts with:
> "if (common->usage_count) return 0;" to ensure that subsequent code
> will be executed only once even if the 2 interfaces are ndo_open'd.
> IOW, skbs will be allocated for only 1 RX channel. I checked that behavior,
> and that's the way it works.

> This is because the host port (CPPI) has only 1 RX queue. This single
> queue is used to get all the packets, from both Ports and 2 (port ID is
> stored in each descriptor).

So you have one entity which needs buffers. CPPI. It seems like Port1
and Port2 do not need buffers? So to me, you need one page pool.

> So, because of this constraint, I ended up working on the "single
> page pool" option.
> 
> Questions:
> 1) Is the behavior described above usual for ethernet switch devices ?

This might be the first time page pool has been used by a switch? I
would check the melanox and sparx5 driver and see if they use page
pool.

> 2) Given that I can't use a page pool for each HW queue, is it worth
> using the page pool memory model ?

> 3) Currently I use 2 xdp_rxq (one for port1 and another one for port2).
> If an XDP program is attached to port1, my page pool dma parameter
> will need to be DMA_BIDIRECTIONAL (because of XDP_TX). As a result,
> even if there is no XDP program attached to port2, the setting for page
> pool will need to be DMA_BIDIRECTIONAL instead of DMA_FROM_DEVICE.
> In such situation, isn't it a problem for port2 ?
> 4) Because of 1) and 2), will page pool performance really be better for
> this driver ?

You probably need to explain the TX architecture a bit. How are
packets passed to the hardware? Is it again via a single CPPI entity?
Or are there two entities?

DMA_BIDIRECTIONAL and DMA_FROM_DEVICE is about cache flushing and
invalidation. Before you pass a buffer to the hardware for it to
receive into, you need to invalidate the cache. That means when the
hardware gives the buffer back with a packet in it, there is a cache
miss and it fetches the new data from memory. If that packet gets
turned into an XDP_TX, you need to flush the cache to force any
changes out of the cache and into memory. The DMA from memory then
gets the up to date packet contents.

My guess would be, always using DMA_BIDIRECTIONAL is fine, so long as
your cache operations are correct. Turn on DMA_API_DEBUG and make sure
it is happy.

     Andrew
  
Julien Panis Feb. 29, 2024, 5:40 p.m. UTC | #7
On 2/29/24 17:46, Andrew Lunn wrote:
> On Thu, Feb 29, 2024 at 05:19:44PM +0100, Julien Panis wrote:
>> On 2/27/24 00:18, Andrew Lunn wrote:
>>>> +static struct sk_buff *am65_cpsw_alloc_skb(struct net_device *ndev, unsigned int len)
>>>> +{
>>>> +	struct page *page;
>>>> +	struct sk_buff *skb;
>>>> +
>>>> +	page = dev_alloc_pages(0);
>>> You are likely to get better performance if you use the page_pool.
>>>
>>> When FEC added XDP support, the first set of changes was to make use
>>> of page_pool. That improved the drivers performance. Then XDP was
>>> added on top. Maybe you can follow that pattern.
>>>
>>>         Andrew
>> Hello Andrew,
>>
>> Thanks for this suggestion. I've been working on it over the last days.
>> I encountered issues and I begin to wonder if that's a good option for
>> this driver. Let me explain...
> I'm not a page pool expert, so hopefully those that are will jump in
> and help.
>
>> This device has 3 ports:
>> - Port0 is the host port (internal to the subsystem and referred as CPPI
>> in the driver).
>> - Port1 and 2 are the external ethernet ports.
>> Each port's RX FIFO has 1 queue.
>>
>> As mentioned in page_pool documentation:
>> https://docs.kernel.org/networking/page_pool.html
>> "The number of pools created MUST match the number of hardware
>> queues unless hardware restrictions make that impossible. This would
>> otherwise beat the purpose of page pool, which is allocate pages fast
>> from cache without locking."
> My guess is, this last bit is the important part. Locking. Do ports 1
> and port 2 rx and tx run in parallel on different CPUs? Hence do you
> need locking?

No.

>
>> So, for this driver I should use 2 page pools (for port1 and 2) if possible.
> Maybe, maybe not. It is not really the number of front panel interface
> which matters here. It is the number of entities which need buffers.
>
>> But, if I I replace any alloc_skb() with page_pool_alloc() in the original
>> driver, I will have to create only 1 page_pool.
>> This is visible in am65_cpsw_nuss_common_open(), which starts with:
>> "if (common->usage_count) return 0;" to ensure that subsequent code
>> will be executed only once even if the 2 interfaces are ndo_open'd.
>> IOW, skbs will be allocated for only 1 RX channel. I checked that behavior,
>> and that's the way it works.
>> This is because the host port (CPPI) has only 1 RX queue. This single
>> queue is used to get all the packets, from both Ports and 2 (port ID is
>> stored in each descriptor).
> So you have one entity which needs buffers. CPPI. It seems like Port1
> and Port2 do not need buffers? So to me, you need one page pool.

Yes, only one entity (CPPI) needs buffers.

>
>> So, because of this constraint, I ended up working on the "single
>> page pool" option.
>>
>> Questions:
>> 1) Is the behavior described above usual for ethernet switch devices ?
> This might be the first time page pool has been used by a switch? I
> would check the melanox and sparx5 driver and see if they use page
> pool.

It seems that sparx5 does not use page pools, mellanox does.

>
>> 2) Given that I can't use a page pool for each HW queue, is it worth
>> using the page pool memory model ?
>> 3) Currently I use 2 xdp_rxq (one for port1 and another one for port2).
>> If an XDP program is attached to port1, my page pool dma parameter
>> will need to be DMA_BIDIRECTIONAL (because of XDP_TX). As a result,
>> even if there is no XDP program attached to port2, the setting for page
>> pool will need to be DMA_BIDIRECTIONAL instead of DMA_FROM_DEVICE.
>> In such situation, isn't it a problem for port2 ?
>> 4) Because of 1) and 2), will page pool performance really be better for
>> this driver ?
> You probably need to explain the TX architecture a bit. How are
> packets passed to the hardware? Is it again via a single CPPI entity?
> Or are there two entities?

Yes, packets are passed to the hardware via a single CPPI entity.

> DMA_BIDIRECTIONAL and DMA_FROM_DEVICE is about cache flushing and
> invalidation. Before you pass a buffer to the hardware for it to
> receive into, you need to invalidate the cache. That means when the
> hardware gives the buffer back with a packet in it, there is a cache
> miss and it fetches the new data from memory. If that packet gets
> turned into an XDP_TX, you need to flush the cache to force any
> changes out of the cache and into memory. The DMA from memory then
> gets the up to date packet contents.
>
> My guess would be, always using DMA_BIDIRECTIONAL is fine, so long as
> your cache operations are correct. Turn on DMA_API_DEBUG and make sure
> it is happy.
>
>       Andrew

Thank you for all these explanations.
I'll carry on working on this single page pool option, so.

Julien
  

Patch

diff --git a/drivers/net/ethernet/ti/am65-cpsw-nuss.c b/drivers/net/ethernet/ti/am65-cpsw-nuss.c
index 9d2f4ac783e4..080910f45629 100644
--- a/drivers/net/ethernet/ti/am65-cpsw-nuss.c
+++ b/drivers/net/ethernet/ti/am65-cpsw-nuss.c
@@ -5,6 +5,7 @@ 
  *
  */
 
+#include <linux/bpf_trace.h>
 #include <linux/clk.h>
 #include <linux/etherdevice.h>
 #include <linux/if_vlan.h>
@@ -138,6 +139,17 @@ 
 
 #define AM65_CPSW_DEFAULT_TX_CHNS	8
 
+/* CPPI streaming packet interface */
+#define AM65_CPSW_CPPI_TX_FLOW_ID  0x3FFF
+#define AM65_CPSW_CPPI_TX_PKT_TYPE 0x7
+
+/* XDP */
+#define AM65_CPSW_XDP_CONSUMED	1
+#define AM65_CPSW_XDP_PASS	0
+
+/* Include headroom compatible with both skb and xdpf */
+#define AM65_CPSW_HEADROOM max(NET_SKB_PAD, XDP_PACKET_HEADROOM)
+
 static void am65_cpsw_port_set_sl_mac(struct am65_cpsw_port *slave,
 				      const u8 *dev_addr)
 {
@@ -369,6 +381,30 @@  static void am65_cpsw_init_host_port_emac(struct am65_cpsw_common *common);
 static void am65_cpsw_init_port_switch_ale(struct am65_cpsw_port *port);
 static void am65_cpsw_init_port_emac_ale(struct am65_cpsw_port *port);
 
+static void am65_cpsw_destroy_xdp_rxq(struct am65_cpsw_port *port)
+{
+	struct xdp_rxq_info *rxq = &port->xdp_rxq;
+
+	if (xdp_rxq_info_is_reg(rxq))
+		xdp_rxq_info_unreg(rxq);
+}
+
+static int am65_cpsw_create_xdp_rxq(struct am65_cpsw_port *port)
+{
+	struct xdp_rxq_info *rxq = &port->xdp_rxq;
+	int ret;
+
+	ret = xdp_rxq_info_reg(rxq, port->ndev, port->port_id - 1, 0);
+	if (ret)
+		return ret;
+
+	ret = xdp_rxq_info_reg_mem_model(rxq, MEM_TYPE_PAGE_ORDER0, NULL);
+	if (ret)
+		xdp_rxq_info_unreg(rxq);
+
+	return ret;
+}
+
 static void am65_cpsw_nuss_rx_cleanup(void *data, dma_addr_t desc_dma)
 {
 	struct am65_cpsw_rx_chn *rx_chn = data;
@@ -440,6 +476,27 @@  static void am65_cpsw_nuss_tx_cleanup(void *data, dma_addr_t desc_dma)
 	dev_kfree_skb_any(skb);
 }
 
+static struct sk_buff *am65_cpsw_alloc_skb(struct net_device *ndev, unsigned int len)
+{
+	struct page *page;
+	struct sk_buff *skb;
+
+	page = dev_alloc_pages(0);
+	if (unlikely(!page))
+		return NULL;
+
+	len += AM65_CPSW_HEADROOM;
+
+	skb = build_skb(page_address(page), len);
+	if (unlikely(!skb))
+		return NULL;
+
+	skb_reserve(skb, AM65_CPSW_HEADROOM + NET_IP_ALIGN);
+	skb->dev = ndev;
+
+	return skb;
+}
+
 static int am65_cpsw_nuss_common_open(struct am65_cpsw_common *common)
 {
 	struct am65_cpsw_host *host_p = am65_common_get_host(common);
@@ -506,9 +563,7 @@  static int am65_cpsw_nuss_common_open(struct am65_cpsw_common *common)
 	am65_cpsw_qos_tx_p0_rate_init(common);
 
 	for (i = 0; i < common->rx_chns.descs_num; i++) {
-		skb = __netdev_alloc_skb_ip_align(NULL,
-						  AM65_CPSW_MAX_PACKET_SIZE,
-						  GFP_KERNEL);
+		skb = am65_cpsw_alloc_skb(NULL, AM65_CPSW_MAX_PACKET_SIZE);
 		if (!skb) {
 			ret = -ENOMEM;
 			dev_err(common->dev, "cannot allocate skb\n");
@@ -648,6 +703,8 @@  static int am65_cpsw_nuss_ndo_slave_stop(struct net_device *ndev)
 
 	phylink_disconnect_phy(port->slave.phylink);
 
+	am65_cpsw_destroy_xdp_rxq(port);
+
 	ret = am65_cpsw_nuss_common_stop(common);
 	if (ret)
 		return ret;
@@ -719,6 +776,10 @@  static int am65_cpsw_nuss_ndo_slave_open(struct net_device *ndev)
 
 	common->usage_count++;
 
+	ret = am65_cpsw_create_xdp_rxq(port);
+	if (ret)
+		goto error_cleanup;
+
 	am65_cpsw_port_set_sl_mac(port, ndev->dev_addr);
 
 	if (common->is_emac_mode)
@@ -749,6 +810,138 @@  static int am65_cpsw_nuss_ndo_slave_open(struct net_device *ndev)
 	return ret;
 }
 
+static int am65_cpsw_xdp_tx_frame(struct net_device *ndev,
+				  struct am65_cpsw_tx_chn *tx_chn,
+				  struct xdp_frame *xdpf)
+{
+	struct am65_cpsw_common *common = am65_ndev_to_common(ndev);
+	struct am65_cpsw_port *port = am65_ndev_to_port(ndev);
+	struct netdev_queue *netif_txq;
+	struct cppi5_host_desc_t *host_desc;
+	dma_addr_t dma_desc, dma_buf;
+	u32 pkt_len = xdpf->len;
+	void **swdata;
+	int ret;
+
+	host_desc = k3_cppi_desc_pool_alloc(tx_chn->desc_pool);
+	if (unlikely(!host_desc)) {
+		ndev->stats.tx_dropped++;
+		return -ENOMEM;
+	}
+
+	dma_buf = dma_map_single(tx_chn->dma_dev, xdpf->data, pkt_len, DMA_TO_DEVICE);
+	if (unlikely(dma_mapping_error(tx_chn->dma_dev, dma_buf))) {
+		ndev->stats.tx_dropped++;
+		ret = -ENOMEM;
+		goto pool_free;
+	}
+
+	cppi5_hdesc_init(host_desc, CPPI5_INFO0_HDESC_EPIB_PRESENT, AM65_CPSW_NAV_PS_DATA_SIZE);
+	cppi5_hdesc_set_pkttype(host_desc, AM65_CPSW_CPPI_TX_PKT_TYPE);
+	cppi5_hdesc_set_pktlen(host_desc, pkt_len);
+	cppi5_desc_set_pktids(&host_desc->hdr, 0, AM65_CPSW_CPPI_TX_FLOW_ID);
+	cppi5_desc_set_tags_ids(&host_desc->hdr, 0, port->port_id);
+
+	k3_udma_glue_tx_dma_to_cppi5_addr(tx_chn->tx_chn, &dma_buf);
+	cppi5_hdesc_attach_buf(host_desc, dma_buf, pkt_len, dma_buf, pkt_len);
+
+	swdata = cppi5_hdesc_get_swdata(host_desc);
+	*(swdata) = xdpf;
+
+	/* Report BQL before sending the packet */
+	netif_txq = netdev_get_tx_queue(ndev, tx_chn->id);
+	netdev_tx_sent_queue(netif_txq, pkt_len);
+
+	dma_desc = k3_cppi_desc_pool_virt2dma(tx_chn->desc_pool, host_desc);
+	if (AM65_CPSW_IS_CPSW2G(common)) {
+		ret = k3_udma_glue_push_tx_chn(tx_chn->tx_chn, host_desc, dma_desc);
+	} else {
+		spin_lock_bh(&tx_chn->lock);
+		ret = k3_udma_glue_push_tx_chn(tx_chn->tx_chn, host_desc, dma_desc);
+		spin_unlock_bh(&tx_chn->lock);
+	}
+	if (ret) {
+		/* Inform BQL */
+		netdev_tx_completed_queue(netif_txq, 1, pkt_len);
+		ndev->stats.tx_errors++;
+		goto dma_unmap;
+	}
+
+	return 0;
+
+dma_unmap:
+	k3_udma_glue_tx_cppi5_to_dma_addr(tx_chn->tx_chn, &dma_buf);
+	dma_unmap_single(tx_chn->dma_dev, dma_buf, pkt_len, DMA_TO_DEVICE);
+pool_free:
+	k3_cppi_desc_pool_free(tx_chn->desc_pool, host_desc);
+	return ret;
+}
+
+static int am65_cpsw_run_xdp(struct am65_cpsw_port *port, struct xdp_buff *xdp, int cpu, int *len)
+{
+	struct net_device *ndev = port->ndev;
+	struct am65_cpsw_tx_chn *tx_chn;
+	struct netdev_queue *netif_txq;
+	struct bpf_prog *prog;
+	struct xdp_frame *xdpf;
+	struct page *page;
+	u32 act;
+	int ret = AM65_CPSW_XDP_CONSUMED;
+
+	prog = READ_ONCE(port->xdp_prog);
+	if (!prog)
+		return AM65_CPSW_XDP_PASS;
+
+	act = bpf_prog_run_xdp(prog, xdp);
+	/* XDP prog might have changed packet data and boundaries */
+	*len = xdp->data_end - xdp->data;
+
+	switch (act) {
+	case XDP_PASS:
+		ret = AM65_CPSW_XDP_PASS;
+		goto out;
+	case XDP_TX:
+		tx_chn = &am65_ndev_to_common(ndev)->tx_chns[cpu % AM65_CPSW_MAX_TX_QUEUES];
+		netif_txq = netdev_get_tx_queue(ndev, tx_chn->id);
+
+		xdpf = xdp_convert_buff_to_frame(xdp);
+		if (unlikely(!xdpf))
+			break;
+
+		__netif_tx_lock(netif_txq, cpu);
+		tx_chn->buf_type = AM65_CPSW_TX_BUF_TYPE_XDP;
+		ret = am65_cpsw_xdp_tx_frame(ndev, tx_chn, xdpf);
+		__netif_tx_unlock(netif_txq);
+		if (ret)
+			break;
+
+		ndev->stats.rx_bytes += *len;
+		ndev->stats.rx_packets++;
+		ret = AM65_CPSW_XDP_CONSUMED;
+		goto out;
+	case XDP_REDIRECT:
+		if (unlikely(xdp_do_redirect(ndev, xdp, prog)))
+			break;
+
+		ndev->stats.rx_bytes += *len;
+		ndev->stats.rx_packets++;
+		goto out;
+	default:
+		bpf_warn_invalid_xdp_action(ndev, prog, act);
+		fallthrough;
+	case XDP_ABORTED:
+		trace_xdp_exception(ndev, prog, act);
+		fallthrough;
+	case XDP_DROP:
+		ndev->stats.rx_dropped++;
+	}
+
+	page = virt_to_page(xdp->data);
+	put_page(page);
+out:
+	return ret;
+}
+
 static void am65_cpsw_nuss_rx_ts(struct sk_buff *skb, u32 *psdata)
 {
 	struct skb_shared_hwtstamps *ssh;
@@ -795,7 +988,7 @@  static void am65_cpsw_nuss_rx_csum(struct sk_buff *skb, u32 csum_info)
 }
 
 static int am65_cpsw_nuss_rx_packets(struct am65_cpsw_common *common,
-				     u32 flow_idx)
+				     u32 flow_idx, int cpu)
 {
 	struct am65_cpsw_rx_chn *rx_chn = &common->rx_chns;
 	u32 buf_dma_len, pkt_len, port_id = 0, csum_info;
@@ -804,11 +997,14 @@  static int am65_cpsw_nuss_rx_packets(struct am65_cpsw_common *common,
 	struct cppi5_host_desc_t *desc_rx;
 	struct device *dev = common->dev;
 	struct sk_buff *skb, *new_skb;
+	struct xdp_buff	xdp;
 	dma_addr_t desc_dma, buf_dma;
 	struct am65_cpsw_port *port;
 	struct net_device *ndev;
+	struct page *page;
 	void **swdata;
 	u32 *psdata;
+	int headroom;
 	int ret = 0;
 
 	ret = k3_udma_glue_pop_rx_chn(rx_chn->rx_chn, flow_idx, &desc_dma);
@@ -851,7 +1047,23 @@  static int am65_cpsw_nuss_rx_packets(struct am65_cpsw_common *common,
 
 	k3_cppi_desc_pool_free(rx_chn->desc_pool, desc_rx);
 
-	new_skb = netdev_alloc_skb_ip_align(ndev, AM65_CPSW_MAX_PACKET_SIZE);
+	if (port->xdp_prog) {
+		xdp_init_buff(&xdp, AM65_CPSW_MAX_PACKET_SIZE, &port->xdp_rxq);
+
+		page = virt_to_page(skb->data);
+		xdp_prepare_buff(&xdp, page_address(page), skb_headroom(skb), pkt_len, false);
+
+		ret = am65_cpsw_run_xdp(port, &xdp, cpu, &pkt_len);
+		if (ret != AM65_CPSW_XDP_PASS)
+			return ret;
+
+		/* Compute additional headroom to be reserved */
+		headroom = (xdp.data - xdp.data_hard_start) - skb_headroom(skb);
+		skb_reserve(skb, headroom);
+	}
+
+	/* Pass skb to netstack if no XDP prog or returned XDP_PASS */
+	new_skb = am65_cpsw_alloc_skb(ndev, AM65_CPSW_MAX_PACKET_SIZE);
 	if (new_skb) {
 		ndev_priv = netdev_priv(ndev);
 		am65_cpsw_nuss_set_offload_fwd_mark(skb, ndev_priv->offload_fwd_mark);
@@ -901,6 +1113,7 @@  static int am65_cpsw_nuss_rx_poll(struct napi_struct *napi_rx, int budget)
 {
 	struct am65_cpsw_common *common = am65_cpsw_napi_to_common(napi_rx);
 	int flow = AM65_CPSW_MAX_RX_FLOWS;
+	int cpu = smp_processor_id();
 	int cur_budget, ret;
 	int num_rx = 0;
 
@@ -909,7 +1122,7 @@  static int am65_cpsw_nuss_rx_poll(struct napi_struct *napi_rx, int budget)
 		cur_budget = budget - num_rx;
 
 		while (cur_budget--) {
-			ret = am65_cpsw_nuss_rx_packets(common, flow);
+			ret = am65_cpsw_nuss_rx_packets(common, flow, cpu);
 			if (ret)
 				break;
 			num_rx++;
@@ -938,8 +1151,8 @@  static int am65_cpsw_nuss_rx_poll(struct napi_struct *napi_rx, int budget)
 }
 
 static struct sk_buff *
-am65_cpsw_nuss_tx_compl_packet(struct am65_cpsw_tx_chn *tx_chn,
-			       dma_addr_t desc_dma)
+am65_cpsw_nuss_tx_compl_packet_skb(struct am65_cpsw_tx_chn *tx_chn,
+				   dma_addr_t desc_dma)
 {
 	struct am65_cpsw_ndev_priv *ndev_priv;
 	struct am65_cpsw_ndev_stats *stats;
@@ -968,6 +1181,39 @@  am65_cpsw_nuss_tx_compl_packet(struct am65_cpsw_tx_chn *tx_chn,
 	return skb;
 }
 
+static struct xdp_frame *
+am65_cpsw_nuss_tx_compl_packet_xdp(struct am65_cpsw_common *common,
+				   struct am65_cpsw_tx_chn *tx_chn,
+				   dma_addr_t desc_dma,
+				   struct net_device **ndev)
+{
+	struct am65_cpsw_port *port;
+	struct am65_cpsw_ndev_priv *ndev_priv;
+	struct am65_cpsw_ndev_stats *stats;
+	struct cppi5_host_desc_t *desc_tx;
+	struct xdp_frame *xdpf;
+	void **swdata;
+	u32 port_id = 0;
+
+	desc_tx = k3_cppi_desc_pool_dma2virt(tx_chn->desc_pool, desc_dma);
+	cppi5_desc_get_tags_ids(&desc_tx->hdr, NULL, &port_id);
+	swdata = cppi5_hdesc_get_swdata(desc_tx);
+	xdpf = *(swdata);
+	am65_cpsw_nuss_xmit_free(tx_chn, desc_tx);
+
+	port = am65_common_get_port(common, port_id);
+	*ndev = port->ndev;
+
+	ndev_priv = netdev_priv(*ndev);
+	stats = this_cpu_ptr(ndev_priv->stats);
+	u64_stats_update_begin(&stats->syncp);
+	stats->tx_packets++;
+	stats->tx_bytes += xdpf->len;
+	u64_stats_update_end(&stats->syncp);
+
+	return xdpf;
+}
+
 static void am65_cpsw_nuss_tx_wake(struct am65_cpsw_tx_chn *tx_chn, struct net_device *ndev,
 				   struct netdev_queue *netif_txq)
 {
@@ -994,6 +1240,7 @@  static int am65_cpsw_nuss_tx_compl_packets(struct am65_cpsw_common *common,
 	unsigned int total_bytes = 0;
 	struct net_device *ndev;
 	struct sk_buff *skb;
+	struct xdp_frame *xdpf;
 	dma_addr_t desc_dma;
 	int res, num_tx = 0;
 
@@ -1013,10 +1260,16 @@  static int am65_cpsw_nuss_tx_compl_packets(struct am65_cpsw_common *common,
 			break;
 		}
 
-		skb = am65_cpsw_nuss_tx_compl_packet(tx_chn, desc_dma);
-		total_bytes = skb->len;
-		ndev = skb->dev;
-		napi_consume_skb(skb, budget);
+		if (tx_chn->buf_type == AM65_CPSW_TX_BUF_TYPE_SKB) {
+			skb = am65_cpsw_nuss_tx_compl_packet_skb(tx_chn, desc_dma);
+			ndev = skb->dev;
+			total_bytes = skb->len;
+			napi_consume_skb(skb, budget);
+		} else {
+			xdpf = am65_cpsw_nuss_tx_compl_packet_xdp(common, tx_chn, desc_dma, &ndev);
+			total_bytes = xdpf->len;
+			xdp_return_frame(xdpf);
+		}
 		num_tx++;
 
 		netif_txq = netdev_get_tx_queue(ndev, chn);
@@ -1040,6 +1293,7 @@  static int am65_cpsw_nuss_tx_compl_packets_2g(struct am65_cpsw_common *common,
 	unsigned int total_bytes = 0;
 	struct net_device *ndev;
 	struct sk_buff *skb;
+	struct xdp_frame *xdpf;
 	dma_addr_t desc_dma;
 	int res, num_tx = 0;
 
@@ -1057,11 +1311,16 @@  static int am65_cpsw_nuss_tx_compl_packets_2g(struct am65_cpsw_common *common,
 			break;
 		}
 
-		skb = am65_cpsw_nuss_tx_compl_packet(tx_chn, desc_dma);
-
-		ndev = skb->dev;
-		total_bytes += skb->len;
-		napi_consume_skb(skb, budget);
+		if (tx_chn->buf_type == AM65_CPSW_TX_BUF_TYPE_SKB) {
+			skb = am65_cpsw_nuss_tx_compl_packet_skb(tx_chn, desc_dma);
+			ndev = skb->dev;
+			total_bytes += skb->len;
+			napi_consume_skb(skb, budget);
+		} else {
+			xdpf = am65_cpsw_nuss_tx_compl_packet_xdp(common, tx_chn, desc_dma, &ndev);
+			total_bytes += xdpf->len;
+			xdp_return_frame(xdpf);
+		}
 		num_tx++;
 	}
 
@@ -1166,6 +1425,8 @@  static netdev_tx_t am65_cpsw_nuss_ndo_slave_xmit(struct sk_buff *skb,
 	tx_chn = &common->tx_chns[q_idx];
 	netif_txq = netdev_get_tx_queue(ndev, q_idx);
 
+	tx_chn->buf_type = AM65_CPSW_TX_BUF_TYPE_SKB;
+
 	/* Map the linear buffer */
 	buf_dma = dma_map_single(tx_chn->dma_dev, skb->data, pkt_len,
 				 DMA_TO_DEVICE);
@@ -1185,8 +1446,8 @@  static netdev_tx_t am65_cpsw_nuss_ndo_slave_xmit(struct sk_buff *skb,
 
 	cppi5_hdesc_init(first_desc, CPPI5_INFO0_HDESC_EPIB_PRESENT,
 			 AM65_CPSW_NAV_PS_DATA_SIZE);
-	cppi5_desc_set_pktids(&first_desc->hdr, 0, 0x3FFF);
-	cppi5_hdesc_set_pkttype(first_desc, 0x7);
+	cppi5_desc_set_pktids(&first_desc->hdr, 0, AM65_CPSW_CPPI_TX_FLOW_ID);
+	cppi5_hdesc_set_pkttype(first_desc, AM65_CPSW_CPPI_TX_PKT_TYPE);
 	cppi5_desc_set_tags_ids(&first_desc->hdr, 0, port->port_id);
 
 	k3_udma_glue_tx_dma_to_cppi5_addr(tx_chn->tx_chn, &buf_dma);
@@ -1488,6 +1749,58 @@  static void am65_cpsw_nuss_ndo_get_stats(struct net_device *dev,
 	stats->tx_dropped	= dev->stats.tx_dropped;
 }
 
+static int am65_cpsw_xdp_prog_setup(struct net_device *ndev, struct bpf_prog *prog)
+{
+	struct am65_cpsw_port *port = am65_ndev_to_port(ndev);
+	struct bpf_prog *old_prog;
+	bool running = netif_running(ndev);
+
+	if (running)
+		am65_cpsw_nuss_ndo_slave_stop(ndev);
+
+	old_prog = xchg(&port->xdp_prog, prog);
+	if (old_prog)
+		bpf_prog_put(old_prog);
+
+	if (running)
+		return am65_cpsw_nuss_ndo_slave_open(ndev);
+
+	return 0;
+}
+
+static int am65_cpsw_ndo_bpf(struct net_device *ndev, struct netdev_bpf *bpf)
+{
+	switch (bpf->command) {
+	case XDP_SETUP_PROG:
+		return am65_cpsw_xdp_prog_setup(ndev, bpf->prog);
+	default:
+		return -EINVAL;
+	}
+}
+
+static int am65_cpsw_ndo_xdp_xmit(struct net_device *ndev, int n,
+				  struct xdp_frame **frames, u32 flags)
+{
+	struct am65_cpsw_tx_chn *tx_chn;
+	struct netdev_queue *netif_txq;
+	int cpu = smp_processor_id();
+	int i, nxmit = 0;
+
+	tx_chn = &am65_ndev_to_common(ndev)->tx_chns[cpu % AM65_CPSW_MAX_TX_QUEUES];
+	netif_txq = netdev_get_tx_queue(ndev, tx_chn->id);
+
+	__netif_tx_lock(netif_txq, cpu);
+	tx_chn->buf_type = AM65_CPSW_TX_BUF_TYPE_XDP;
+	for (i = 0; i < n; i++) {
+		if (am65_cpsw_xdp_tx_frame(ndev, tx_chn, frames[i]))
+			break;
+		nxmit++;
+	}
+	__netif_tx_unlock(netif_txq);
+
+	return nxmit;
+}
+
 static const struct net_device_ops am65_cpsw_nuss_netdev_ops = {
 	.ndo_open		= am65_cpsw_nuss_ndo_slave_open,
 	.ndo_stop		= am65_cpsw_nuss_ndo_slave_stop,
@@ -1502,6 +1815,8 @@  static const struct net_device_ops am65_cpsw_nuss_netdev_ops = {
 	.ndo_eth_ioctl		= am65_cpsw_nuss_ndo_slave_ioctl,
 	.ndo_setup_tc           = am65_cpsw_qos_ndo_setup_tc,
 	.ndo_set_tx_maxrate	= am65_cpsw_qos_ndo_tx_p0_set_maxrate,
+	.ndo_bpf		= am65_cpsw_ndo_bpf,
+	.ndo_xdp_xmit		= am65_cpsw_ndo_xdp_xmit,
 };
 
 static void am65_cpsw_disable_phy(struct phy *phy)
@@ -1816,6 +2131,8 @@  static int am65_cpsw_nuss_init_tx_chns(struct am65_cpsw_common *common)
 			goto err;
 		}
 
+		tx_chn->buf_type = AM65_CPSW_TX_BUF_TYPE_SKB;
+
 		tx_chn->irq = k3_udma_glue_tx_get_irq(tx_chn->tx_chn);
 		if (tx_chn->irq < 0) {
 			dev_err(dev, "Failed to get tx dma irq %d\n",
@@ -1873,11 +2190,7 @@  static void am65_cpsw_nuss_remove_rx_chns(void *data)
 
 	netif_napi_del(&common->napi_rx);
 
-	if (!IS_ERR_OR_NULL(rx_chn->desc_pool))
-		k3_cppi_desc_pool_destroy(rx_chn->desc_pool);
-
-	if (!IS_ERR_OR_NULL(rx_chn->rx_chn))
-		k3_udma_glue_release_rx_chn(rx_chn->rx_chn);
+	am65_cpsw_nuss_free_rx_chns(common);
 
 	common->rx_flow_id_base = -1;
 }
@@ -2252,6 +2565,9 @@  am65_cpsw_nuss_init_port_ndev(struct am65_cpsw_common *common, u32 port_idx)
 				  NETIF_F_HW_TC;
 	port->ndev->features = port->ndev->hw_features |
 			       NETIF_F_HW_VLAN_CTAG_FILTER;
+	port->ndev->xdp_features = NETDEV_XDP_ACT_BASIC |
+				   NETDEV_XDP_ACT_REDIRECT |
+				   NETDEV_XDP_ACT_NDO_XMIT;
 	port->ndev->vlan_features |=  NETIF_F_SG;
 	port->ndev->netdev_ops = &am65_cpsw_nuss_netdev_ops;
 	port->ndev->ethtool_ops = &am65_cpsw_ethtool_ops_slave;
@@ -2315,6 +2631,8 @@  am65_cpsw_nuss_init_port_ndev(struct am65_cpsw_common *common, u32 port_idx)
 	if (ret)
 		dev_err(dev, "failed to add percpu stat free action %d\n", ret);
 
+	port->xdp_prog = NULL;
+
 	if (!common->dma_ndev)
 		common->dma_ndev = port->ndev;
 
diff --git a/drivers/net/ethernet/ti/am65-cpsw-nuss.h b/drivers/net/ethernet/ti/am65-cpsw-nuss.h
index 7da0492dc091..6fbb975427df 100644
--- a/drivers/net/ethernet/ti/am65-cpsw-nuss.h
+++ b/drivers/net/ethernet/ti/am65-cpsw-nuss.h
@@ -14,6 +14,7 @@ 
 #include <linux/platform_device.h>
 #include <linux/soc/ti/k3-ringacc.h>
 #include <net/devlink.h>
+#include <net/xdp.h>
 #include "am65-cpsw-qos.h"
 
 struct am65_cpts;
@@ -56,10 +57,17 @@  struct am65_cpsw_port {
 	bool				rx_ts_enabled;
 	struct am65_cpsw_qos		qos;
 	struct devlink_port		devlink_port;
+	struct bpf_prog			*xdp_prog;
+	struct xdp_rxq_info		xdp_rxq;
 	/* Only for suspend resume context */
 	u32				vid_context;
 };
 
+enum am65_cpsw_tx_buf_type {
+	AM65_CPSW_TX_BUF_TYPE_SKB,
+	AM65_CPSW_TX_BUF_TYPE_XDP,
+};
+
 struct am65_cpsw_host {
 	struct am65_cpsw_common		*common;
 	void __iomem			*port_base;
@@ -74,6 +82,7 @@  struct am65_cpsw_tx_chn {
 	struct am65_cpsw_common	*common;
 	struct k3_cppi_desc_pool *desc_pool;
 	struct k3_udma_glue_tx_channel *tx_chn;
+	enum am65_cpsw_tx_buf_type buf_type;
 	spinlock_t lock; /* protect TX rings in multi-port mode */
 	struct hrtimer tx_hrtimer;
 	unsigned long tx_pace_timeout;