diff mbox series

[V2,net-next] net: fec: add CBS offload support

Message ID	20230213092912.2314029-1-wei.fang@nxp.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: wei.fang@nxp.com To: shenwei.wang@nxp.com, xiaoning.wang@nxp.com, davem@davemloft.net, edumazet@google.com, kuba@kernel.org, pabeni@redhat.com, netdev@vger.kernel.org, simon.horman@corigine.com, andrew@lunn.ch Cc: linux-imx@nxp.com, linux-kernel@vger.kernel.org Subject: [PATCH V2 net-next] net: fec: add CBS offload support Date: Mon, 13 Feb 2023 17:29:12 +0800 Message-Id: <20230213092912.2314029-1-wei.fang@nxp.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain MIME-Version: 1.0 Precedence: bulk
Series	[V2,net-next] net: fec: add CBS offload support \| [V2,net-next] net: fec: add CBS offload support

Commit Message

Wei Fang Feb. 13, 2023, 9:29 a.m. UTC

  From: Wei Fang <wei.fang@nxp.com>

The FEC hardware supports the Credit-based shaper (CBS) which control
the bandwidth distribution between normal traffic and time-sensitive
traffic with respect to the total link bandwidth available.
But notice that the bandwidth allocation of hardware is restricted to
certain values. Below is the equation which is used to calculate the
BW (bandwidth) fraction for per class:
	BW fraction = 1 / (1 + 512 / idle_slope)

For values of idle_slope less than 128, idle_slope = 2 ^ n, when n =
0,1,2,...,6. For values equal to or greater than 128, idle_slope =
128 * m, where m = 1,2,3,...,12.
Example 1. idle_slope = 64, therefore BW fraction = 0.111.
Example 2. idle_slope = 128, therefore BW fraction = 0.200.

Here is an example command to set 200Mbps bandwidth on 1000Mbps port
for TC 2 and 111Mbps for TC 3.
tc qdisc add dev eth0 parent root handle 100 mqprio num_tc 3 map \
0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 queues 1@0 1@1 1@2 hw 0
tc qdisc replace dev eth0 parent 100:2 cbs idleslope 200000 \
sendslope -800000 hicredit 153 locredit -1389 offload 1
tc qdisc replace dev eth0 parent 100:3 cbs idleslope 111000 \
sendslope -889000 hicredit 90 locredit -892 offload 1

Signed-off-by: Wei Fang <wei.fang@nxp.com>
---
V2 changes:
1. Based on Simon's suggestion, modified the description in
   fec_enet_get_idle_slope to make it more clear.
2. Adopted Simon's suggestion to use macro DIV_ROUND_CLOSEST to calculate
   idle_slope. And also amended some nits.
3. According to Andrew's comments, the speed may be equal to 0 when the
   link is not up, so added a check to see if speed is equal to 0. In
   addtion, the change in link speed also need to be taken into account.
   Considering that the change of link speed has invalidated the original
   configuration, so we just fall back to the default setting.
4. Considering that some events will cause the MAC reset and clear the CBS
   registers (such as link status change, transmit timeout, checksum offload
   feature change and so on), so reconfigure the CBS in fec_restart.
5. Added more checks for parameters passed in from user space.
---
 drivers/net/ethernet/freescale/fec.h      |  13 ++
 drivers/net/ethernet/freescale/fec_main.c | 179 ++++++++++++++++++++++
 2 files changed, 192 insertions(+)

Comments

Andrew Lunn Feb. 13, 2023, 1:32 p.m. UTC | #1

> 3. According to Andrew's comments, the speed may be equal to 0 when the
>    link is not up, so added a check to see if speed is equal to 0. In
>    addtion, the change in link speed also need to be taken into account.
>    Considering that the change of link speed has invalidated the original
>    configuration, so we just fall back to the default setting.

I don't think that is what you have actually implemented. A link
status change causes a fec_restart. And in fac_restart, you now
reprogram the hardware. So if the link speed is sufficient to support
the request, the hardware should be setup to support it.

What are the real uses cases here? VoIP? Video streaming? So 128kbps,
2Mbps. Both of those are fine over a 10Half limk. So i think you
should try to configure the hardware whenever possible, after link
change or any other condition which causes a reset of the hardware.

What i have not seen addresses here is my comment/question about what
tc shows when it is not possible to perform the request after a link
change? Did you look at how other drivers handle this? Maybe you need
to ask Jamal?

> +static int fec_enet_setup_tc_cbs(struct net_device *ndev, void *type_data)
> +{
> +	struct fec_enet_private *fep = netdev_priv(ndev);
> +	struct tc_cbs_qopt_offload *cbs = type_data;
> +	int queue = cbs->queue;
> +	int speed = fep->speed;
> +	int queue2;
> +
> +	if (!(fep->quirks & FEC_QUIRK_HAS_AVB))
> +		return -EOPNOTSUPP;
> +
> +	/* Queue 1 for Class A, Queue 2 for Class B, so the ENET must
> +	 * have three queues.
> +	 */
> +	if (fep->num_tx_queues != FEC_ENET_MAX_TX_QS)
> +		return -EOPNOTSUPP;
> +
> +	if (!speed) {
> +		netdev_err(ndev, "Link speed is 0!\n");
> +		return -ECANCELED;

ECANCLED? First time i've seen that one used. I had to go look it up
to see what it means. It does not really give the user any idea why it
failed. How about -ENETDOWN?

	Andrew

Alexander Lobakin Feb. 13, 2023, 4:04 p.m. UTC | #2

From: Wei Fang <wei.fang@nxp.com>
Date: Mon, 13 Feb 2023 17:29:12 +0800

> From: Wei Fang <wei.fang@nxp.com>
> 
> The FEC hardware supports the Credit-based shaper (CBS) which control
> the bandwidth distribution between normal traffic and time-sensitive
> traffic with respect to the total link bandwidth available.
> But notice that the bandwidth allocation of hardware is restricted to
> certain values. Below is the equation which is used to calculate the
> BW (bandwidth) fraction for per class:
> 	BW fraction = 1 / (1 + 512 / idle_slope)

[...]

> @@ -571,6 +575,12 @@ struct fec_stop_mode_gpr {
>  	u8 bit;
>  };
>  
> +struct fec_cbs_params {
> +	bool enable[FEC_ENET_MAX_TX_QS];

Maybe smth like

	DECLARE_BITMAP(enabled, FEC_ENET_MAX_TX_QS);

?

> +	int idleslope[FEC_ENET_MAX_TX_QS];
> +	int sendslope[FEC_ENET_MAX_TX_QS];

Can they actually be negative? (probably I'll see it below)

> +};
> +
>  /* The FEC buffer descriptors track the ring buffers.  The rx_bd_base and
>   * tx_bd_base always point to the base of the buffer descriptors.  The
>   * cur_rx and cur_tx point to the currently available buffer.
> @@ -679,6 +689,9 @@ struct fec_enet_private {
>  	/* XDP BPF Program */
>  	struct bpf_prog *xdp_prog;
>  
> +	/* CBS parameters */
> +	struct fec_cbs_params cbs;
> +
>  	u64 ethtool_stats[];
>  };
>  
> diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
> index c73e25f8995e..91394ad05121 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -66,6 +66,7 @@
>  #include <linux/mfd/syscon.h>
>  #include <linux/regmap.h>
>  #include <soc/imx/cpuidle.h>
> +#include <net/pkt_sched.h>

Some alphabetic order? At least for new files :D

>  #include <linux/filter.h>
>  #include <linux/bpf.h>
>  
> @@ -1023,6 +1024,174 @@ static void fec_enet_reset_skb(struct net_device *ndev)
>  	}
>  }
>  
> +static u32 fec_enet_get_idle_slope(u8 bw)

Just use `u32`, usually there's no reason to use types shorter than
integer on the stack.

> +{
> +	int msb, power;
> +	u32 idle_slope;
> +
> +	if (bw >= 100)
> +		return 0;
> +
> +	/* Convert bw to hardware idle slope */
> +	idle_slope = (512 * bw) / (100 - bw);
> +

Redundant newline. Also first pair of braces is optional and up to you.

> +	if (idle_slope >= 128) {
> +		/* For values greater than or equal to 128, idle_slope
> +		 * rounded to the nearest multiple of 128.
> +		 */

But you can just do

	idle_slope = min(idle_slope, 128U);

and still use the "standard" path below, without the conditional return?
Or even combine it with the code above:

	idle_slope = min((512 * bw) / (100 - bw), 128U);

> +		idle_slope = DIV_ROUND_CLOSEST(idle_slope, 128U) * 128U;
> +
> +		return idle_slope;
> +	}
> +
> +	/* For values less than 128, idle_slope is rounded to
> +	 * nearst power of 2.

Typo, 'nearest'.

> +	 */
> +	if (idle_slope <= 1)
> +		return 1;
> +
> +	msb = __fls(idle_slope);

I think `fls() - 1` is preferred over `__fls()` since it may go UB. And
some checks wouldn't hurt.

> +	power = BIT(msb);

Oh okay. Then rounddown_pow_of_two() is what you're looking for.

	power = rounddown_pow_of_two(idle_slope);

Or even just use one variable, @idle_slope.

> +	idle_slope = DIV_ROUND_CLOSEST(idle_slope, power) * power;
> +
> +	return idle_slope;

You can return DIV_ROUND_ ... right away, without assignning first.
Also, I'm thinking of that this might be a generic helper. We have
roundup() and rounddown(), this could be something like "round_closest()"?

> +}
> +
> +static void fec_enet_set_cbs_idle_slope(struct fec_enet_private *fep)
> +{
> +	u32 bw, val, idle_slope;
> +	int speed = fep->speed;
> +	int idle_slope_sum = 0;
> +	int i;

Can any of them be negative?

> +
> +	if (!speed)
> +		return;
> +
> +	for (i = 1; i < FEC_ENET_MAX_TX_QS; i++) {

So you don't use the zeroth array elements at all? Why having them then?

> +		int port_tx_rate;

(same for type)

> +
> +		/* As defined in IEEE 802.1Q-2014 Section 8.6.8.2 item:
> +		 *       sendslope = idleslope -  port_tx_rate
> +		 * So we need to check whether port_tx_rate is equal to
> +		 * the current link rate.

[...]

> +	for (i = 1; i < FEC_ENET_MAX_TX_QS; i++) {
> +		bw = fep->cbs.idleslope[i] / (speed * 10);
> +		idle_slope = fec_enet_get_idle_slope(bw);
> +
> +		val = readl(fep->hwp + FEC_DMA_CFG(i));
> +		val &= ~IDLE_SLOPE_MASK;
> +		val |= idle_slope & IDLE_SLOPE_MASK;

u32_replace_bits() will do it for you.

> +		writel(val, fep->hwp + FEC_DMA_CFG(i));
> +	}
> +
> +	/* Enable Credit-based shaper. */
> +	val = readl(fep->hwp + FEC_QOS_SCHEME);
> +	val &= ~FEC_QOS_TX_SHEME_MASK;
> +	val |= CREDIT_BASED_SCHEME;

(same)

> +	writel(val, fep->hwp + FEC_QOS_SCHEME);
> +}
> +
> +static int fec_enet_setup_tc_cbs(struct net_device *ndev, void *type_data)
> +{
> +	struct fec_enet_private *fep = netdev_priv(ndev);
> +	struct tc_cbs_qopt_offload *cbs = type_data;
> +	int queue = cbs->queue;
> +	int speed = fep->speed;
> +	int queue2;

(types)

> +
> +	if (!(fep->quirks & FEC_QUIRK_HAS_AVB))
> +		return -EOPNOTSUPP;
> +
> +	/* Queue 1 for Class A, Queue 2 for Class B, so the ENET must
> +	 * have three queues.
> +	 */
> +	if (fep->num_tx_queues != FEC_ENET_MAX_TX_QS)
> +		return -EOPNOTSUPP;
> +
> +	if (!speed) {
> +		netdev_err(ndev, "Link speed is 0!\n");

??? Is this possible? If so, why is it checked only here and why can it
be possible?

> +		return -ECANCELED;

(already mentioned in other review)

> +	}
> +
> +	/* Queue 0 is not AVB capable */
> +	if (queue <= 0 || queue >= fep->num_tx_queues) {

Is `< 0` possible? I realize it's `s32`, just curious.

> +		netdev_err(ndev, "The queue: %d is invalid!\n", queue);

Maybe less emotions? There's no point in having `!` at the end of every
error.

> +		return -EINVAL;
> +	}
> +
> +	if (!cbs->enable) {
> +		u32 val;
> +
> +		val = readl(fep->hwp + FEC_QOS_SCHEME);
> +		val &= ~FEC_QOS_TX_SHEME_MASK;
> +		val |= ROUND_ROBIN_SCHEME;

(u32_replace_bits())

> +		writel(val, fep->hwp + FEC_QOS_SCHEME);
> +
> +		memset(&fep->cbs, 0, sizeof(fep->cbs));
> +
> +		return 0;
> +	}
> +
> +	if (cbs->idleslope - cbs->sendslope != speed * 1000 ||
> +	    cbs->idleslope <= 0 || cbs->sendslope >= 0)

So you check slopes here, why check them above then?

> +		return -EINVAL;
> +
> +	/* Another AVB queue */
> +	queue2 = (queue == 1) ? 2 : 1;

Braces are unneeded.

> +	if (cbs->idleslope + fep->cbs.idleslope[queue2] > speed * 1000) {
> +		netdev_err(ndev,
> +			   "The sum of all idle slope can't exceed link speed!\n");
> +		return -EINVAL;
> +	}
[...]

Thanks,
Olek

Andrew Lunn Feb. 13, 2023, 4:21 p.m. UTC | #3

> > +	if (!speed) {
> > +		netdev_err(ndev, "Link speed is 0!\n");
> 
> ??? Is this possible? If so, why is it checked only here and why can it
> be possible?

The obvious way this happens is that there is no link partner, so
auto-neg has not completed yet. The link speed is unknown.

     Andrew

Alexander Lobakin Feb. 13, 2023, 5:44 p.m. UTC | #4

From: Andrew Lunn <andrew@lunn.ch>
Date: Mon, 13 Feb 2023 17:21:43 +0100

>>> +	if (!speed) {
>>> +		netdev_err(ndev, "Link speed is 0!\n");
>>
>> ??? Is this possible? If so, why is it checked only here and why can it
>> be possible?
> 
> The obvious way this happens is that there is no link partner, so
> auto-neg has not completed yet. The link speed is unknown.

Sure, but why treat it an error path then?

> 
>      Andrew

Thanks,
Olek

Andrew Lunn Feb. 13, 2023, 6:07 p.m. UTC | #5

On Mon, Feb 13, 2023 at 06:44:05PM +0100, Alexander Lobakin wrote:
> From: Andrew Lunn <andrew@lunn.ch>
> Date: Mon, 13 Feb 2023 17:21:43 +0100
> 
> >>> +	if (!speed) {
> >>> +		netdev_err(ndev, "Link speed is 0!\n");
> >>
> >> ??? Is this possible? If so, why is it checked only here and why can it
> >> be possible?
> > 
> > The obvious way this happens is that there is no link partner, so
> > auto-neg has not completed yet. The link speed is unknown.
> 
> Sure, but why treat it an error path then?

You need to treat is somehow. I would actually disagree with
netdev_err(), netdev_dbg() seems more appropriate. But if you don't
know the link speed, you cannot program the scheduler.

This also comes back to my question about what should happen with a TC
configuration which works fine for 1000BaseT, but will not work for
10BaseT. Should the driver accept it only if the current link speed is
sufficient? Should it always accept it, and not program it into the
hardware if the current link speed does not support it?

Since we are talking about hardware acceleration here, what does the
pure software version do? Ideally we want the accelerated version to
do the same as the software version.

Wei, please disable all clever stuff in the hardware, setup a pure
software qdisc and a 10BaseT link. Oversubscribe the link and see what
happens. Does other traffic get starved?

	Andrew

Wei Fang Feb. 14, 2023, 9:34 a.m. UTC | #6

> -----Original Message-----
> From: Alexander Lobakin <alexandr.lobakin@intel.com>
> Sent: 2023年2月14日 0:05
> To: Wei Fang <wei.fang@nxp.com>
> Cc: Shenwei Wang <shenwei.wang@nxp.com>; Clark Wang
> <xiaoning.wang@nxp.com>; davem@davemloft.net; edumazet@google.com;
> kuba@kernel.org; pabeni@redhat.com; simon.horman@corigine.com;
> andrew@lunn.ch; netdev@vger.kernel.org; dl-linux-imx <linux-imx@nxp.com>;
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH V2 net-next] net: fec: add CBS offload support
> 
> From: Wei Fang <wei.fang@nxp.com>
> Date: Mon, 13 Feb 2023 17:29:12 +0800
> 
> > From: Wei Fang <wei.fang@nxp.com>
> >
> > The FEC hardware supports the Credit-based shaper (CBS) which control
> > the bandwidth distribution between normal traffic and time-sensitive
> > traffic with respect to the total link bandwidth available.
> > But notice that the bandwidth allocation of hardware is restricted to
> > certain values. Below is the equation which is used to calculate the
> > BW (bandwidth) fraction for per class:
> > 	BW fraction = 1 / (1 + 512 / idle_slope)
> 
> [...]
> 
> > @@ -571,6 +575,12 @@ struct fec_stop_mode_gpr {
> >  	u8 bit;
> >  };
> >
> > +struct fec_cbs_params {
> > +	bool enable[FEC_ENET_MAX_TX_QS];
> 
> Maybe smth like
> 
> 	DECLARE_BITMAP(enabled, FEC_ENET_MAX_TX_QS);
> 
> ?
I think it's a matter of personal habit.

> 
> > +	int idleslope[FEC_ENET_MAX_TX_QS];
> > +	int sendslope[FEC_ENET_MAX_TX_QS];
> 
> Can they actually be negative? (probably I'll see it below)
> 
idleslope and sendslope are used to save the parameters passed in from user space.
And the sendslope should always be negative.

> > +};
> > +
> >  /* The FEC buffer descriptors track the ring buffers.  The rx_bd_base and
> >   * tx_bd_base always point to the base of the buffer descriptors.  The
> >   * cur_rx and cur_tx point to the currently available buffer.
> > @@ -679,6 +689,9 @@ struct fec_enet_private {
> >  	/* XDP BPF Program */
> >  	struct bpf_prog *xdp_prog;
> >
> > +	/* CBS parameters */
> > +	struct fec_cbs_params cbs;
> > +
> >  	u64 ethtool_stats[];
> >  };
> >
> > diff --git a/drivers/net/ethernet/freescale/fec_main.c
> > b/drivers/net/ethernet/freescale/fec_main.c
> > index c73e25f8995e..91394ad05121 100644
> > --- a/drivers/net/ethernet/freescale/fec_main.c
> > +++ b/drivers/net/ethernet/freescale/fec_main.c
> > @@ -66,6 +66,7 @@
> >  #include <linux/mfd/syscon.h>
> >  #include <linux/regmap.h>
> >  #include <soc/imx/cpuidle.h>
> > +#include <net/pkt_sched.h>
> 
> Some alphabetic order? At least for new files :D
> 
I just want to keep the reverse Christmas tree style, although the whole #include
order is already out of the style.

> >  #include <linux/filter.h>
> >  #include <linux/bpf.h>
> >
> > @@ -1023,6 +1024,174 @@ static void fec_enet_reset_skb(struct
> net_device *ndev)
> >  	}
> >  }
> >
> > +static u32 fec_enet_get_idle_slope(u8 bw)
> 
> Just use `u32`, usually there's no reason to use types shorter than integer on
> the stack.
> 
> > +{
> > +	int msb, power;
> > +	u32 idle_slope;
> > +
> > +	if (bw >= 100)
> > +		return 0;
> > +
> > +	/* Convert bw to hardware idle slope */
> > +	idle_slope = (512 * bw) / (100 - bw);
> > +
> 
> Redundant newline. Also first pair of braces is optional and up to you.
> 
I will fix this nit, thanks!

> > +	if (idle_slope >= 128) {
> > +		/* For values greater than or equal to 128, idle_slope
> > +		 * rounded to the nearest multiple of 128.
> > +		 */
> 
> But you can just do
> 
> 	idle_slope = min(idle_slope, 128U);
> 
> and still use the "standard" path below, without the conditional return?
> Or even combine it with the code above:
> 
> 	idle_slope = min((512 * bw) / (100 - bw), 128U);
> 
If idles_slope is greater than or equal to 128, idle_slope should be rounded to the nearest
multiple of 128. For example, if idle_slope = 255, it should be set to 256. However 
min(idle_slope, 128U) is not as expected.

> > +		idle_slope = DIV_ROUND_CLOSEST(idle_slope, 128U) * 128U;
> > +
> > +		return idle_slope;
> > +	}
> > +
> > +	/* For values less than 128, idle_slope is rounded to
> > +	 * nearst power of 2.
> 
> Typo, 'nearest'.
> 
Yes, I'll fix it.

> > +	 */
> > +	if (idle_slope <= 1)
> > +		return 1;
> > +
> > +	msb = __fls(idle_slope);
> 
> I think `fls() - 1` is preferred over `__fls()` since it may go UB. And some checks
> wouldn't hurt.
> 
I used fls() at first, but later changed to __fls. Now that you pointed out its possible
risks, I'll change it back. Thanks.

> > +	power = BIT(msb);
> 
> Oh okay. Then rounddown_pow_of_two() is what you're looking for.
> 
> 	power = rounddown_pow_of_two(idle_slope);
> 
> Or even just use one variable, @idle_slope.
> 
Thanks for the reminder, I think I should use roundup_pow_of_two().

> > +	idle_slope = DIV_ROUND_CLOSEST(idle_slope, power) * power;
> > +
> > +	return idle_slope;
> 
> You can return DIV_ROUND_ ... right away, without assignning first.
> Also, I'm thinking of that this might be a generic helper. We have
> roundup() and rounddown(), this could be something like "round_closest()"?
> 
> > +}
> > +
> > +static void fec_enet_set_cbs_idle_slope(struct fec_enet_private *fep)
> > +{
> > +	u32 bw, val, idle_slope;
> > +	int speed = fep->speed;
> > +	int idle_slope_sum = 0;
> > +	int i;
> 
> Can any of them be negative?
> 
No.

> > +
> > +	if (!speed)
> > +		return;
> > +
> > +	for (i = 1; i < FEC_ENET_MAX_TX_QS; i++) {
> 
> So you don't use the zeroth array elements at all? Why having them then?
> 
Yes, the zeroth indicates queue 0, and queue 0 only support Non-AVB traffic.
Why having them then?
Firstly I think it more clear that the i indicates the index of queue. Secondly,
it is for more convenience. If you think it is inappropriate, I will amend it
later.

> > +		int port_tx_rate;
> 
> (same for type)
> 
> > +
> > +		/* As defined in IEEE 802.1Q-2014 Section 8.6.8.2 item:
> > +		 *       sendslope = idleslope -  port_tx_rate
> > +		 * So we need to check whether port_tx_rate is equal to
> > +		 * the current link rate.
> 
> [...]
> 
> > +	for (i = 1; i < FEC_ENET_MAX_TX_QS; i++) {
> > +		bw = fep->cbs.idleslope[i] / (speed * 10);
> > +		idle_slope = fec_enet_get_idle_slope(bw);
> > +
> > +		val = readl(fep->hwp + FEC_DMA_CFG(i));
> > +		val &= ~IDLE_SLOPE_MASK;
> > +		val |= idle_slope & IDLE_SLOPE_MASK;
> 
> u32_replace_bits() will do it for you.
> 
Sorry, I can't find the definition of u32_replace_bits().

> > +		writel(val, fep->hwp + FEC_DMA_CFG(i));
> > +	}
> > +
> > +	/* Enable Credit-based shaper. */
> > +	val = readl(fep->hwp + FEC_QOS_SCHEME);
> > +	val &= ~FEC_QOS_TX_SHEME_MASK;
> > +	val |= CREDIT_BASED_SCHEME;
> 
> (same)
> 
> > +	writel(val, fep->hwp + FEC_QOS_SCHEME); }
> > +
> > +static int fec_enet_setup_tc_cbs(struct net_device *ndev, void
> > +*type_data) {
> > +	struct fec_enet_private *fep = netdev_priv(ndev);
> > +	struct tc_cbs_qopt_offload *cbs = type_data;
> > +	int queue = cbs->queue;
> > +	int speed = fep->speed;
> > +	int queue2;
> 
> (types)
> 
> > +
> > +	if (!(fep->quirks & FEC_QUIRK_HAS_AVB))
> > +		return -EOPNOTSUPP;
> > +
> > +	/* Queue 1 for Class A, Queue 2 for Class B, so the ENET must
> > +	 * have three queues.
> > +	 */
> > +	if (fep->num_tx_queues != FEC_ENET_MAX_TX_QS)
> > +		return -EOPNOTSUPP;
> > +
> > +	if (!speed) {
> > +		netdev_err(ndev, "Link speed is 0!\n");
> 
> ??? Is this possible? If so, why is it checked only here and why can it be
> possible?
> 
It's possible if the board bootup without cable.

> > +		return -ECANCELED;
> 
> (already mentioned in other review)
> 
> > +	}
> > +
> > +	/* Queue 0 is not AVB capable */
> > +	if (queue <= 0 || queue >= fep->num_tx_queues) {
> 
> Is `< 0` possible? I realize it's `s32`, just curious.
> 
If the wrong parameter is entered and the app does not check the value,
It might be 0. I'm not sure, just in case.

> > +		netdev_err(ndev, "The queue: %d is invalid!\n", queue);
> 
> Maybe less emotions? There's no point in having `!` at the end of every error.
> 
OK, it fine to remove '!'.

> > +		return -EINVAL;
> > +	}
> > +
> > +	if (!cbs->enable) {
> > +		u32 val;
> > +
> > +		val = readl(fep->hwp + FEC_QOS_SCHEME);
> > +		val &= ~FEC_QOS_TX_SHEME_MASK;
> > +		val |= ROUND_ROBIN_SCHEME;
> 
> (u32_replace_bits())
> 
> > +		writel(val, fep->hwp + FEC_QOS_SCHEME);
> > +
> > +		memset(&fep->cbs, 0, sizeof(fep->cbs));
> > +
> > +		return 0;
> > +	}
> > +
> > +	if (cbs->idleslope - cbs->sendslope != speed * 1000 ||
> > +	    cbs->idleslope <= 0 || cbs->sendslope >= 0)
> 
> So you check slopes here, why check them above then?
> 
Because this function will be invoked due to some events in
fec_restart too, so I added these checks here. If you think it
is redundant, I will move these checks to fec_restart. Thanks.


> > +		return -EINVAL;
> > +
> > +	/* Another AVB queue */
> > +	queue2 = (queue == 1) ? 2 : 1;
> 
> Braces are unneeded.
> 
> > +	if (cbs->idleslope + fep->cbs.idleslope[queue2] > speed * 1000) {
> > +		netdev_err(ndev,
> > +			   "The sum of all idle slope can't exceed link speed!\n");
> > +		return -EINVAL;
> > +	}
> [...]
> 
> Thanks,
> Olek

Wei Fang Feb. 14, 2023, 1:22 p.m. UTC | #7

> -----Original Message-----
> From: Andrew Lunn <andrew@lunn.ch>
> Sent: 2023年2月14日 2:07
> To: Alexander Lobakin <alexandr.lobakin@intel.com>
> Cc: Wei Fang <wei.fang@nxp.com>; Shenwei Wang <shenwei.wang@nxp.com>;
> Clark Wang <xiaoning.wang@nxp.com>; davem@davemloft.net;
> edumazet@google.com; kuba@kernel.org; pabeni@redhat.com;
> simon.horman@corigine.com; netdev@vger.kernel.org; dl-linux-imx
> <linux-imx@nxp.com>; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH V2 net-next] net: fec: add CBS offload support
> 
> On Mon, Feb 13, 2023 at 06:44:05PM +0100, Alexander Lobakin wrote:
> > From: Andrew Lunn <andrew@lunn.ch>
> > Date: Mon, 13 Feb 2023 17:21:43 +0100
> >
> > >>> +	if (!speed) {
> > >>> +		netdev_err(ndev, "Link speed is 0!\n");
> > >>
> > >> ??? Is this possible? If so, why is it checked only here and why
> > >> can it be possible?
> > >
> > > The obvious way this happens is that there is no link partner, so
> > > auto-neg has not completed yet. The link speed is unknown.
> >
> > Sure, but why treat it an error path then?
> 
> You need to treat is somehow. I would actually disagree with netdev_err(),
> netdev_dbg() seems more appropriate. But if you don't know the link speed,
> you cannot program the scheduler.
> 
Yes, netdev_dbg() seems more appropriate. And as I replied before, I think
!fep->link maybe more appropriate than !speed. What do you think?

> This also comes back to my question about what should happen with a TC
> configuration which works fine for 1000BaseT, but will not work for 10BaseT.
> Should the driver accept it only if the current link speed is sufficient? Should it
> always accept it, and not program it into the hardware if the current link speed
> does not support it?
> 
Please see the previous reply.

> Since we are talking about hardware acceleration here, what does the pure
> software version do? Ideally we want the accelerated version to do the same
> as the software version.
> 
> Wei, please disable all clever stuff in the hardware, setup a pure software qdisc
> and a 10BaseT link. Oversubscribe the link and see what happens. Does other
> traffic get starved?
> 

Sorry, I'm not very familiar with the configuration of pure software implementation
of CBS. I tried to configure the CBS like the following. The bandwidth of queue 1 was
set to 30Mbps. And the queue 2 is set to 20Mbps. Then one stream were sent the
queue 1 and the rate was 50Mbps, the link speed was 1Gbps. But the result seemed that
the CBS did not take effective. And the linux packet generator was used to send the
stream. Was there something wrong with my configuration?

tc qdisc add dev eth0 parent root handle 100 mqprio num_tc 3 map 0 0 2 1 0 0 0 0 0 \
0 0 0 0 0 0 0 queues 1@0 1@1 1@2 hw 0
tc qdisc replace dev eth0 parent 100:2 cbs idleslope 30000 sendslope -970000 hicredit \
12 locredit -113 offload 0
tc qdisc replace dev eth0 parent 100:3 cbs idleslope 20000 sendslope -980000 hicredit \
12 locredit -113 offload 0

Andrew Lunn Feb. 14, 2023, 1:49 p.m. UTC | #8

> > > --- a/drivers/net/ethernet/freescale/fec_main.c
> > > +++ b/drivers/net/ethernet/freescale/fec_main.c
> > > @@ -66,6 +66,7 @@
> > >  #include <linux/mfd/syscon.h>
> > >  #include <linux/regmap.h>
> > >  #include <soc/imx/cpuidle.h>
> > > +#include <net/pkt_sched.h>
> > 
> > Some alphabetic order? At least for new files :D
> > 
> I just want to keep the reverse Christmas tree style, although the whole #include
> order is already out of the style.

Reverse Christmass tree does not apply to includes, just local
variables. And the FEC driver has never been very good at it.

	   Andrew

Andrew Lunn Feb. 14, 2023, 2:28 p.m. UTC | #9

> Sorry, I'm not very familiar with the configuration of pure software implementation
> of CBS. I tried to configure the CBS like the following. The bandwidth of queue 1 was
> set to 30Mbps. And the queue 2 is set to 20Mbps. Then one stream were sent the
> queue 1 and the rate was 50Mbps, the link speed was 1Gbps. But the result seemed that
> the CBS did not take effective.

I'm not that familiar with CBS, but that is what i would expect. You
are over subscribing the queue by 20Mbps, so that 20Mbps gets
relegated to best effort. And since you have a 1G link, you have
plenty of best effort bandwidth.

As with most QoS queuing, it only really makes a different to packet
loss when you oversubscribe the link as a whole.

So with your 30Mbps + 20Mbps + BE configuration on a 1G link, send
50Mbps + 0Mbps + 1Gbps. 30Mbps of your 50Mbps stream should be
guaranteed to arrive at the destination. The remaining 20Mbps needs to
share the remaining 970Mbps of link capacity with the 1G of BE
traffic. So you would expect to see a few extra Kbps of queue #1
traffic arriving and around 969Mbps of best effort traffic.

However, that is not really the case i'm interested in. This
discussion started from the point that autoneg has resulted in a much
smaller link capacity. The link is now over subscribed by the CBS
configuration. Should the hardware just give up and go back to default
behaviour, or should it continue to do some CBS?

Set lets start with a 7Mbps queue 1 and 5Mbps queue 2, on a link which
auto negs to 100Mbps. Generate traffic of 8Mbps, 6Mpbs and 100Mbps
BE. You would expect ~7Mbps, ~5Mbps and 88Mbps to arrive at the link
peer. Your two CBS flows get there reserved bandwidth, plus a little
of the BE. BE gets whats remains of the link. Test that and make sure
that is what actually happens with software CBS, and with your TC
offload to hardware.

Now force the link down to 10Mbps. The CBS queues then over subscribe
the link. Keep with the traffic generator producing 8Mbps, 6Mpbs and
100Mbps BE. What i guess the software CBS will do is 7Mbps, 3Mbps and
0 BE. You should confirm this with testing.

What does this mean for TC offload? You should be aiming for the same
behaviour. So even when the link is over subscribed, you should still
be programming the hardware.

   Andrew

Alexander Lobakin Feb. 14, 2023, 4:49 p.m. UTC | #10

From: Wei Fang <wei.fang@nxp.com>
Date: Tue, 14 Feb 2023 09:34:09 +0000

>> -----Original Message-----
>> From: Alexander Lobakin <alexandr.lobakin@intel.com>
>> Sent: 2023年2月14日 0:05
>> To: Wei Fang <wei.fang@nxp.com>
>> Cc: Shenwei Wang <shenwei.wang@nxp.com>; Clark Wang
>> <xiaoning.wang@nxp.com>; davem@davemloft.net; edumazet@google.com;
>> kuba@kernel.org; pabeni@redhat.com; simon.horman@corigine.com;
>> andrew@lunn.ch; netdev@vger.kernel.org; dl-linux-imx <linux-imx@nxp.com>;
>> linux-kernel@vger.kernel.org
>> Subject: Re: [PATCH V2 net-next] net: fec: add CBS offload support
>>
>> From: Wei Fang <wei.fang@nxp.com>
>> Date: Mon, 13 Feb 2023 17:29:12 +0800
>>
>>> From: Wei Fang <wei.fang@nxp.com>
>>>
>>> The FEC hardware supports the Credit-based shaper (CBS) which control
>>> the bandwidth distribution between normal traffic and time-sensitive
>>> traffic with respect to the total link bandwidth available.
>>> But notice that the bandwidth allocation of hardware is restricted to
>>> certain values. Below is the equation which is used to calculate the
>>> BW (bandwidth) fraction for per class:
>>> 	BW fraction = 1 / (1 + 512 / idle_slope)
>>
>> [...]
>>
>>> @@ -571,6 +575,12 @@ struct fec_stop_mode_gpr {
>>>  	u8 bit;
>>>  };
>>>
>>> +struct fec_cbs_params {
>>> +	bool enable[FEC_ENET_MAX_TX_QS];
>>
>> Maybe smth like
>>
>> 	DECLARE_BITMAP(enabled, FEC_ENET_MAX_TX_QS);
>>
>> ?
> I think it's a matter of personal habit.

It's a matter that you can fit 32 bits if you declare it as u32 or 64
bits if as a bitmap vs you waste 1 byte per 1 true/false flag as you do
in this version :D

> 
>>
>>> +	int idleslope[FEC_ENET_MAX_TX_QS];
>>> +	int sendslope[FEC_ENET_MAX_TX_QS];
>>
>> Can they actually be negative? (probably I'll see it below)
>>
> idleslope and sendslope are used to save the parameters passed in from user space.
> And the sendslope should always be negative.

Parameters coming from userspace must be validated before saving them
anywhere.
Also, if sendslope is always negative as you say, then just negate it
when you read it from userspace and store as u32.

> 
>>> +};
>>> +
>>>  /* The FEC buffer descriptors track the ring buffers.  The rx_bd_base and
>>>   * tx_bd_base always point to the base of the buffer descriptors.  The
>>>   * cur_rx and cur_tx point to the currently available buffer.
>>> @@ -679,6 +689,9 @@ struct fec_enet_private {
>>>  	/* XDP BPF Program */
>>>  	struct bpf_prog *xdp_prog;
>>>
>>> +	/* CBS parameters */
>>> +	struct fec_cbs_params cbs;
>>> +
>>>  	u64 ethtool_stats[];
>>>  };
>>>
>>> diff --git a/drivers/net/ethernet/freescale/fec_main.c
>>> b/drivers/net/ethernet/freescale/fec_main.c
>>> index c73e25f8995e..91394ad05121 100644
>>> --- a/drivers/net/ethernet/freescale/fec_main.c
>>> +++ b/drivers/net/ethernet/freescale/fec_main.c
>>> @@ -66,6 +66,7 @@
>>>  #include <linux/mfd/syscon.h>
>>>  #include <linux/regmap.h>
>>>  #include <soc/imx/cpuidle.h>
>>> +#include <net/pkt_sched.h>
>>
>> Some alphabetic order? At least for new files :D
>>
> I just want to keep the reverse Christmas tree style, although the whole #include
> order is already out of the style.

RCT is applied to variable declaration inside functions, not to header
include block :D

> 
>>>  #include <linux/filter.h>
>>>  #include <linux/bpf.h>
>>>
>>> @@ -1023,6 +1024,174 @@ static void fec_enet_reset_skb(struct
>> net_device *ndev)
>>>  	}
>>>  }
>>>
>>> +static u32 fec_enet_get_idle_slope(u8 bw)
>>
>> Just use `u32`, usually there's no reason to use types shorter than integer on
>> the stack.
>>
>>> +{
>>> +	int msb, power;
>>> +	u32 idle_slope;
>>> +
>>> +	if (bw >= 100)
>>> +		return 0;
>>> +
>>> +	/* Convert bw to hardware idle slope */
>>> +	idle_slope = (512 * bw) / (100 - bw);
>>> +
>>
>> Redundant newline. Also first pair of braces is optional and up to you.
>>
> I will fix this nit, thanks!
> 
>>> +	if (idle_slope >= 128) {
>>> +		/* For values greater than or equal to 128, idle_slope
>>> +		 * rounded to the nearest multiple of 128.
>>> +		 */
>>
>> But you can just do
>>
>> 	idle_slope = min(idle_slope, 128U);
>>
>> and still use the "standard" path below, without the conditional return?
>> Or even combine it with the code above:
>>
>> 	idle_slope = min((512 * bw) / (100 - bw), 128U);
>>
> If idles_slope is greater than or equal to 128, idle_slope should be rounded to the nearest
> multiple of 128. For example, if idle_slope = 255, it should be set to 256. However 
> min(idle_slope, 128U) is not as expected.

So you say that for for >= 128 it must be a multiple of 128 and for <
128 it must be pow-2? Then I did misread your code a bit, sorry.
But then my propo regarding adding round_closest() applies here as well.

> 
>>> +		idle_slope = DIV_ROUND_CLOSEST(idle_slope, 128U) * 128U;
>>> +
>>> +		return idle_slope;
>>> +	}
>>> +
>>> +	/* For values less than 128, idle_slope is rounded to
>>> +	 * nearst power of 2.
>>
>> Typo, 'nearest'.
>>
> Yes, I'll fix it.
> 
>>> +	 */
>>> +	if (idle_slope <= 1)
>>> +		return 1;
>>> +
>>> +	msb = __fls(idle_slope);
>>
>> I think `fls() - 1` is preferred over `__fls()` since it may go UB. And some checks
>> wouldn't hurt.
>>
> I used fls() at first, but later changed to __fls. Now that you pointed out its possible
> risks, I'll change it back. Thanks.
> 
>>> +	power = BIT(msb);
>>
>> Oh okay. Then rounddown_pow_of_two() is what you're looking for.
>>
>> 	power = rounddown_pow_of_two(idle_slope);
>>
>> Or even just use one variable, @idle_slope.
>>
> Thanks for the reminder, I think I should use roundup_pow_of_two().

But your code does what rounddown_pow_of_two() does, not roundup.
Imagine that you have 0b1111, then your code will turn it into
0b1000, not 0b10000. Or am I missing something?

> 
>>> +	idle_slope = DIV_ROUND_CLOSEST(idle_slope, power) * power;
>>> +
>>> +	return idle_slope;
>>
>> You can return DIV_ROUND_ ... right away, without assignning first.
>> Also, I'm thinking of that this might be a generic helper. We have
>> roundup() and rounddown(), this could be something like "round_closest()"?
>>
>>> +}
>>> +
>>> +static void fec_enet_set_cbs_idle_slope(struct fec_enet_private *fep)
>>> +{
>>> +	u32 bw, val, idle_slope;
>>> +	int speed = fep->speed;
>>> +	int idle_slope_sum = 0;
>>> +	int i;
>>
>> Can any of them be negative?
>>
> No.

So u32 for them.

> 
>>> +
>>> +	if (!speed)
>>> +		return;
>>> +
>>> +	for (i = 1; i < FEC_ENET_MAX_TX_QS; i++) {
>>
>> So you don't use the zeroth array elements at all? Why having them then?
>>
> Yes, the zeroth indicates queue 0, and queue 0 only support Non-AVB traffic.
> Why having them then?
> Firstly I think it more clear that the i indicates the index of queue. Secondly,
> it is for more convenience. If you think it is inappropriate, I will amend it
> later.

Well, it's not recommended to allocate some space you will never use.

> 
>>> +		int port_tx_rate;
>>
>> (same for type)
>>
>>> +
>>> +		/* As defined in IEEE 802.1Q-2014 Section 8.6.8.2 item:
>>> +		 *       sendslope = idleslope -  port_tx_rate
>>> +		 * So we need to check whether port_tx_rate is equal to
>>> +		 * the current link rate.
>>
>> [...]
>>
>>> +	for (i = 1; i < FEC_ENET_MAX_TX_QS; i++) {
>>> +		bw = fep->cbs.idleslope[i] / (speed * 10);
>>> +		idle_slope = fec_enet_get_idle_slope(bw);
>>> +
>>> +		val = readl(fep->hwp + FEC_DMA_CFG(i));
>>> +		val &= ~IDLE_SLOPE_MASK;
>>> +		val |= idle_slope & IDLE_SLOPE_MASK;
>>
>> u32_replace_bits() will do it for you.
>>
> Sorry, I can't find the definition of u32_replace_bits().

Because Elixir doesn't index functions generated via macros. See the end
of <linux/bitfield.h>, this is where the whole family gets defined.

> 
>>> +		writel(val, fep->hwp + FEC_DMA_CFG(i));
>>> +	}
>>> +
>>> +	/* Enable Credit-based shaper. */
>>> +	val = readl(fep->hwp + FEC_QOS_SCHEME);
>>> +	val &= ~FEC_QOS_TX_SHEME_MASK;
>>> +	val |= CREDIT_BASED_SCHEME;
>>
>> (same)
>>
>>> +	writel(val, fep->hwp + FEC_QOS_SCHEME); }
>>> +
>>> +static int fec_enet_setup_tc_cbs(struct net_device *ndev, void
>>> +*type_data) {
>>> +	struct fec_enet_private *fep = netdev_priv(ndev);
>>> +	struct tc_cbs_qopt_offload *cbs = type_data;
>>> +	int queue = cbs->queue;
>>> +	int speed = fep->speed;
>>> +	int queue2;
>>
>> (types)
>>
>>> +
>>> +	if (!(fep->quirks & FEC_QUIRK_HAS_AVB))
>>> +		return -EOPNOTSUPP;
>>> +
>>> +	/* Queue 1 for Class A, Queue 2 for Class B, so the ENET must
>>> +	 * have three queues.
>>> +	 */
>>> +	if (fep->num_tx_queues != FEC_ENET_MAX_TX_QS)
>>> +		return -EOPNOTSUPP;
>>> +
>>> +	if (!speed) {
>>> +		netdev_err(ndev, "Link speed is 0!\n");
>>
>> ??? Is this possible? If so, why is it checked only here and why can it be
>> possible?
>>
> It's possible if the board bootup without cable.

Then it shouldn't be an error -- no link partner is a regular flow, not
something horrendous.

> 
>>> +		return -ECANCELED;
>>
>> (already mentioned in other review)
>>
>>> +	}
>>> +
>>> +	/* Queue 0 is not AVB capable */
>>> +	if (queue <= 0 || queue >= fep->num_tx_queues) {
>>
>> Is `< 0` possible? I realize it's `s32`, just curious.
>>
> If the wrong parameter is entered and the app does not check the value,
> It might be 0. I'm not sure, just in case.

Ah okay, so it's a check for userspace sanity. Then it should stay.

> 
>>> +		netdev_err(ndev, "The queue: %d is invalid!\n", queue);
>>
>> Maybe less emotions? There's no point in having `!` at the end of every error.
>>
> OK, it fine to remove '!'.
> 
>>> +		return -EINVAL;
>>> +	}
>>> +
>>> +	if (!cbs->enable) {
>>> +		u32 val;
>>> +
>>> +		val = readl(fep->hwp + FEC_QOS_SCHEME);
>>> +		val &= ~FEC_QOS_TX_SHEME_MASK;
>>> +		val |= ROUND_ROBIN_SCHEME;
>>
>> (u32_replace_bits())
>>
>>> +		writel(val, fep->hwp + FEC_QOS_SCHEME);
>>> +
>>> +		memset(&fep->cbs, 0, sizeof(fep->cbs));
>>> +
>>> +		return 0;
>>> +	}
>>> +
>>> +	if (cbs->idleslope - cbs->sendslope != speed * 1000 ||
>>> +	    cbs->idleslope <= 0 || cbs->sendslope >= 0)
>>
>> So you check slopes here, why check them above then?
>>
> Because this function will be invoked due to some events in
> fec_restart too, so I added these checks here. If you think it
> is redundant, I will move these checks to fec_restart. Thanks.

Maybe you could make a small static inline and use it where appropriate
instead of open-coding?

> 
> 
>>> +		return -EINVAL;
>>> +
>>> +	/* Another AVB queue */
>>> +	queue2 = (queue == 1) ? 2 : 1;
>>
>> Braces are unneeded.
>>
>>> +	if (cbs->idleslope + fep->cbs.idleslope[queue2] > speed * 1000) {
>>> +		netdev_err(ndev,
>>> +			   "The sum of all idle slope can't exceed link speed!\n");
>>> +		return -EINVAL;
>>> +	}
>> [...]
>>
>> Thanks,
>> Olek
Thanks,
Olek

Wei Fang Feb. 16, 2023, 12:43 p.m. UTC | #11

> -----Original Message-----
> From: Andrew Lunn <andrew@lunn.ch>
> Sent: 2023年2月14日 22:29
> To: Wei Fang <wei.fang@nxp.com>
> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>; Shenwei Wang
> <shenwei.wang@nxp.com>; Clark Wang <xiaoning.wang@nxp.com>;
> davem@davemloft.net; edumazet@google.com; kuba@kernel.org;
> pabeni@redhat.com; simon.horman@corigine.com; netdev@vger.kernel.org;
> dl-linux-imx <linux-imx@nxp.com>; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH V2 net-next] net: fec: add CBS offload support
> 
> > Sorry, I'm not very familiar with the configuration of pure software
> > implementation of CBS. I tried to configure the CBS like the
> > following. The bandwidth of queue 1 was set to 30Mbps. And the queue 2
> > is set to 20Mbps. Then one stream were sent the queue 1 and the rate
> > was 50Mbps, the link speed was 1Gbps. But the result seemed that the CBS
> did not take effective in my previous test.
> 
> I'm not that familiar with CBS, but that is what i would expect. You are over
> subscribing the queue by 20Mbps, so that 20Mbps gets relegated to best effort.
> And since you have a 1G link, you have plenty of best effort bandwidth.
> 
That is not the behavior of CBS, if the bandwidth of the queue is set to 20Mbps, the
maximum transmit rate of the queue shouldn't exceed 20Mbps. I have found the
root cause why the CBS does not take effective in my previous test, because the
default setting of pktgen will bypass the Qdisc.

> As with most QoS queuing, it only really makes a different to packet loss when
> you oversubscribe the link as a whole.
> 
> So with your 30Mbps + 20Mbps + BE configuration on a 1G link, send 50Mbps
> + 0Mbps + 1Gbps. 30Mbps of your 50Mbps stream should be guaranteed to
> arrive at the destination. The remaining 20Mbps needs to share the remaining
> 970Mbps of link capacity with the 1G of BE traffic. So you would expect to see
> a few extra Kbps of queue #1 traffic arriving and around 969Mbps of best
> effort traffic.
> 
> However, that is not really the case i'm interested in. This discussion started
> from the point that autoneg has resulted in a much smaller link capacity. The
> link is now over subscribed by the CBS configuration. Should the hardware just
> give up and go back to default behaviour, or should it continue to do some
> CBS?
> 
See test results and responses below.

> Set lets start with a 7Mbps queue 1 and 5Mbps queue 2, on a link which auto
> negs to 100Mbps. Generate traffic of 8Mbps, 6Mpbs and 100Mbps BE. You
> would expect ~7Mbps, ~5Mbps and 88Mbps to arrive at the link peer. Your two
> CBS flows get there reserved bandwidth, plus a little of the BE. BE gets whats
> remains of the link. Test that and make sure that is what actually happens with
> software CBS, and with your TC offload to hardware.
> 
> Now force the link down to 10Mbps. The CBS queues then over subscribe the
> link. Keep with the traffic generator producing 8Mbps, 6Mpbs and 100Mbps
> BE. What i guess the software CBS will do is 7Mbps, 3Mbps and
> 0 BE. You should confirm this with testing.
> 
I have tested the pure software CBS today. And below are the test steps and results.
Link speed 100Mbps.
Queue 0: Non-CBS queue, 100Mbps traffic.
Queue 1: CBS queue, 7Mbps bandwidth and 8Mbps traffic.
Queue 2: CBS queue, 5Mbps bandwidth and 6Mbps traffic.
Results: queue 0 egress rate is 86Mbps, queue 1 egress rate is 6Mbps, and queue 2
egress rate is 4Mbps.
Then change the link speed to 10Mbps, queue 0 egress rate is 4Mbps, queue 1 egress
rate is 4Mbps, and queue 2 egress rate is 3Mbps.

> What does this mean for TC offload? You should be aiming for the same
> behaviour. So even when the link is over subscribed, you should still be
> programming the hardware.
> 
Beside the test results, I also checked the CBS codes. Unlike hardware implementation,
the pure software method is more flexible, it has four parameters: idleslope, sendslope,
locredit and hicredit. And it can detect the change of link speed and do some adjust.
However, for hardware we only use the idleslope parameter. It's hard for us to make
the hardware behave as the pure software when the link speed changes.
So for the question: Should the hardware just give up and go back to default behaviour,
or should it continue to do some CBS?
I think that we can refer to the behaviors of stmmac and enetc drivers, just keep the
bandwidth ratio constant when the link rate changes. In addition, the link speed change
is a corner case, there is no need to spend any more effort to discuss this matter.

Wei Fang Feb. 16, 2023, 1:03 p.m. UTC | #12

> -----Original Message-----
> From: Alexander Lobakin <alexandr.lobakin@intel.com>
> Sent: 2023年2月15日 0:49
> To: Wei Fang <wei.fang@nxp.com>
> Cc: Shenwei Wang <shenwei.wang@nxp.com>; Clark Wang
> <xiaoning.wang@nxp.com>; davem@davemloft.net; edumazet@google.com;
> kuba@kernel.org; pabeni@redhat.com; simon.horman@corigine.com;
> andrew@lunn.ch; netdev@vger.kernel.org; dl-linux-imx <linux-imx@nxp.com>;
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH V2 net-next] net: fec: add CBS offload support
> 
> From: Wei Fang <wei.fang@nxp.com>
> Date: Tue, 14 Feb 2023 09:34:09 +0000
> 
> >> -----Original Message-----
> >> From: Alexander Lobakin <alexandr.lobakin@intel.com>
> >> Sent: 2023年2月14日 0:05
> >> To: Wei Fang <wei.fang@nxp.com>
> >> Cc: Shenwei Wang <shenwei.wang@nxp.com>; Clark Wang
> >> <xiaoning.wang@nxp.com>; davem@davemloft.net;
> edumazet@google.com;
> >> kuba@kernel.org; pabeni@redhat.com; simon.horman@corigine.com;
> >> andrew@lunn.ch; netdev@vger.kernel.org; dl-linux-imx
> >> <linux-imx@nxp.com>; linux-kernel@vger.kernel.org
> >> Subject: Re: [PATCH V2 net-next] net: fec: add CBS offload support
> >>
> >> From: Wei Fang <wei.fang@nxp.com>
> >> Date: Mon, 13 Feb 2023 17:29:12 +0800
> >>
> >>> From: Wei Fang <wei.fang@nxp.com>
> >>>
> >>
> >>> @@ -571,6 +575,12 @@ struct fec_stop_mode_gpr {
> >>>  	u8 bit;
> >>>  };
> >>>
> >>> +struct fec_cbs_params {
> >>> +	bool enable[FEC_ENET_MAX_TX_QS];
> >>
> >> Maybe smth like
> >>
> >> 	DECLARE_BITMAP(enabled, FEC_ENET_MAX_TX_QS);
> >>
> >> ?
> > I think it's a matter of personal habit.
> 
> It's a matter that you can fit 32 bits if you declare it as u32 or 64 bits if as a
> bitmap vs you waste 1 byte per 1 true/false flag as you do in this version :D
> 
Okay, I'll optimize this part.

> >
> >>
> >>> +	int idleslope[FEC_ENET_MAX_TX_QS];
> >>> +	int sendslope[FEC_ENET_MAX_TX_QS];
> >>
> >> Can they actually be negative? (probably I'll see it below)
> >>
> > idleslope and sendslope are used to save the parameters passed in from user
> space.
> > And the sendslope should always be negative.
> 
> Parameters coming from userspace must be validated before saving them
> anywhere.
> Also, if sendslope is always negative as you say, then just negate it when you
> read it from userspace and store as u32.
Sorry, I still don't understand why u32 is necessary to store parameters from user
space? In addition, people who understand 802.1Qav may be confused when they
see that sendslope is a u32 type.

> 
> >
> >>> +};
> >>> +
> >>>  /* The FEC buffer descriptors track the ring buffers.  The rx_bd_base and
> >>>   * tx_bd_base always point to the base of the buffer descriptors.  The
> >>>   * cur_rx and cur_tx point to the currently available buffer.
> >>> @@ -679,6 +689,9 @@ struct fec_enet_private {
> >>>  	/* XDP BPF Program */
> >>>  	struct bpf_prog *xdp_prog;
> >>>
> >>> +	/* CBS parameters */
> >>> +	struct fec_cbs_params cbs;
> >>> +
> >>>  	u64 ethtool_stats[];
> >>>  };
> >>>
> >>> @@ -66,6 +66,7 @@
> >>>  #include <linux/mfd/syscon.h>
> >>>  #include <linux/regmap.h>
> >>>  #include <soc/imx/cpuidle.h>
> >>> +#include <net/pkt_sched.h>
> >>
> >> Some alphabetic order? At least for new files :D
> >>
> > I just want to keep the reverse Christmas tree style, although the
> > whole #include order is already out of the style.
> 
> RCT is applied to variable declaration inside functions, not to header include
> block :D
It seems that I misunderstood before, I'll fix it.

> 
> >
> >>>  #include <linux/filter.h>
> >>>  #include <linux/bpf.h>
> >>>
> >>> @@ -1023,6 +1024,174 @@ static void fec_enet_reset_skb(struct
> >> net_device *ndev)
> >>>  	}
> >>>  }
> >>>
> >>> +static u32 fec_enet_get_idle_slope(u8 bw)
> >>
> >> Just use `u32`, usually there's no reason to use types shorter than
> >> integer on the stack.
> >>
> >>> +{
> >>> +	int msb, power;
> >>> +	u32 idle_slope;
> >>> +
> >>> +	if (bw >= 100)
> >>> +		return 0;
> >>> +
> >>> +	/* Convert bw to hardware idle slope */
> >>> +	idle_slope = (512 * bw) / (100 - bw);
> >>> +
> >>
> >> Redundant newline. Also first pair of braces is optional and up to you.
> >>
> > I will fix this nit, thanks!
> >
> >>> +	if (idle_slope >= 128) {
> >>> +		/* For values greater than or equal to 128, idle_slope
> >>> +		 * rounded to the nearest multiple of 128.
> >>> +		 */
> >>
> >> But you can just do
> >>
> >> 	idle_slope = min(idle_slope, 128U);
> >>
> >> and still use the "standard" path below, without the conditional return?
> >> Or even combine it with the code above:
> >>
> >> 	idle_slope = min((512 * bw) / (100 - bw), 128U);
> >>
> > If idles_slope is greater than or equal to 128, idle_slope should be
> > rounded to the nearest multiple of 128. For example, if idle_slope =
> > 255, it should be set to 256. However min(idle_slope, 128U) is not as
> expected.
> 
> So you say that for for >= 128 it must be a multiple of 128 and for <
> 128 it must be pow-2? Then I did misread your code a bit, sorry.
> But then my propo regarding adding round_closest() applies here as well.
> 
yes.

> >> I think `fls() - 1` is preferred over `__fls()` since it may go UB.
> >> And some checks wouldn't hurt.
> >>
> > I used fls() at first, but later changed to __fls. Now that you
> > pointed out its possible risks, I'll change it back. Thanks.
> >
> >>> +	power = BIT(msb);
> >>
> >> Oh okay. Then rounddown_pow_of_two() is what you're looking for.
> >>
> >> 	power = rounddown_pow_of_two(idle_slope);
> >>
> >> Or even just use one variable, @idle_slope.
> >>
> > Thanks for the reminder, I think I should use roundup_pow_of_two().
> 
> But your code does what rounddown_pow_of_two() does, not roundup.
> Imagine that you have 0b1111, then your code will turn it into 0b1000, not
> 0b10000. Or am I missing something?
> 
0b1111 is nearest to 0b10000, so it should be turned into 0x10000.

> >
> >>> +	idle_slope = DIV_ROUND_CLOSEST(idle_slope, power) * power;
> >>> +
> >>> +	return idle_slope;
> >>
> >> You can return DIV_ROUND_ ... right away, without assignning first.
> >> Also, I'm thinking of that this might be a generic helper. We have
> >> roundup() and rounddown(), this could be something like "round_closest()"?
> >>
> >>> +}
> >>> +
> >>> +static void fec_enet_set_cbs_idle_slope(struct fec_enet_private
> >>> +*fep) {
> >>> +	u32 bw, val, idle_slope;
> >>> +	int speed = fep->speed;
> >>> +	int idle_slope_sum = 0;
> >>> +	int i;
> >>
> >> Can any of them be negative?
> >>
> > No.
> 
> So u32 for them.
> 
> >
> >>> +
> >>> +	if (!speed)
> >>> +		return;
> >>> +
> >>> +	for (i = 1; i < FEC_ENET_MAX_TX_QS; i++) {
> >>
> >> So you don't use the zeroth array elements at all? Why having them then?
> >>
> > Yes, the zeroth indicates queue 0, and queue 0 only support Non-AVB traffic.
> > Why having them then?
> > Firstly I think it more clear that the i indicates the index of queue.
> > Secondly, it is for more convenience. If you think it is
> > inappropriate, I will amend it later.
> 
> Well, it's not recommended to allocate some space you will never use.
> 
Okay, I'll fix it, thanks.
> >
> >>> +		int port_tx_rate;
> >>
> >> (same for type)
> >>
> >>> +
> >>> +		/* As defined in IEEE 802.1Q-2014 Section 8.6.8.2 item:
> >>> +		 *       sendslope = idleslope -  port_tx_rate
> >>> +		 * So we need to check whether port_tx_rate is equal to
> >>> +		 * the current link rate.
> >>
> >> [...]
> >>
> >>> +	for (i = 1; i < FEC_ENET_MAX_TX_QS; i++) {
> >>> +		bw = fep->cbs.idleslope[i] / (speed * 10);
> >>> +		idle_slope = fec_enet_get_idle_slope(bw);
> >>> +
> >>> +		val = readl(fep->hwp + FEC_DMA_CFG(i));
> >>> +		val &= ~IDLE_SLOPE_MASK;
> >>> +		val |= idle_slope & IDLE_SLOPE_MASK;
> >>
> >> u32_replace_bits() will do it for you.
> >>
> > Sorry, I can't find the definition of u32_replace_bits().
> 
> Because Elixir doesn't index functions generated via macros. See the end of
> <linux/bitfield.h>, this is where the whole family gets defined.
> 
Thanks, I got it.

> >
> >>> +		writel(val, fep->hwp + FEC_DMA_CFG(i));
> >>> +	}
> >>> +
> >>> +	/* Enable Credit-based shaper. */
> >>> +	val = readl(fep->hwp + FEC_QOS_SCHEME);
> >>> +	val &= ~FEC_QOS_TX_SHEME_MASK;
> >>> +	val |= CREDIT_BASED_SCHEME;
> >>
> >> (same)
> >>
> >>> +	writel(val, fep->hwp + FEC_QOS_SCHEME); }
> >>> +
> >>> +static int fec_enet_setup_tc_cbs(struct net_device *ndev, void
> >>> +*type_data) {
> >>> +	struct fec_enet_private *fep = netdev_priv(ndev);
> >>> +	struct tc_cbs_qopt_offload *cbs = type_data;
> >>> +	int queue = cbs->queue;
> >>> +	int speed = fep->speed;
> >>> +	int queue2;
> >>
> >> (types)
> >>
> >>> +
> >>> +	if (!(fep->quirks & FEC_QUIRK_HAS_AVB))
> >>> +		return -EOPNOTSUPP;
> >>> +
> >>> +	/* Queue 1 for Class A, Queue 2 for Class B, so the ENET must
> >>> +	 * have three queues.
> >>> +	 */
> >>> +	if (fep->num_tx_queues != FEC_ENET_MAX_TX_QS)
> >>> +		return -EOPNOTSUPP;
> >>> +
> >>> +	if (!speed) {
> >>> +		netdev_err(ndev, "Link speed is 0!\n");
> >>
> >> ??? Is this possible? If so, why is it checked only here and why can
> >> it be possible?
> >>
> > It's possible if the board bootup without cable.
> 
> Then it shouldn't be an error -- no link partner is a regular flow, not something
> horrendous.
yes, I'll fix it.
> 
> >
> >>> +		return -ECANCELED;
> >>
> >> (already mentioned in other review)
> >>
> >>> +	}
> >>> +
> >>> +	/* Queue 0 is not AVB capable */
> >>> +	if (queue <= 0 || queue >= fep->num_tx_queues) {
> >>
> >> Is `< 0` possible? I realize it's `s32`, just curious.
> >>
> > If the wrong parameter is entered and the app does not check the
> > value, It might be 0. I'm not sure, just in case.
> 
> Ah okay, so it's a check for userspace sanity. Then it should stay.
> 
> >
> >>> +		netdev_err(ndev, "The queue: %d is invalid!\n", queue);
> >>
> >> Maybe less emotions? There's no point in having `!` at the end of every
> error.
> >>
> > OK, it fine to remove '!'.
> >
> >>> +		return -EINVAL;
> >>> +	}
> >>> +
> >>> +	if (!cbs->enable) {
> >>> +		u32 val;
> >>> +
> >>> +		val = readl(fep->hwp + FEC_QOS_SCHEME);
> >>> +		val &= ~FEC_QOS_TX_SHEME_MASK;
> >>> +		val |= ROUND_ROBIN_SCHEME;
> >>
> >> (u32_replace_bits())
> >>
> >>> +		writel(val, fep->hwp + FEC_QOS_SCHEME);
> >>> +
> >>> +		memset(&fep->cbs, 0, sizeof(fep->cbs));
> >>> +
> >>> +		return 0;
> >>> +	}
> >>> +
> >>> +	if (cbs->idleslope - cbs->sendslope != speed * 1000 ||
> >>> +	    cbs->idleslope <= 0 || cbs->sendslope >= 0)
> >>
> >> So you check slopes here, why check them above then?
> >>
> > Because this function will be invoked due to some events in
> > fec_restart too, so I added these checks here. If you think it is
> > redundant, I will move these checks to fec_restart. Thanks.
> 
> Maybe you could make a small static inline and use it where appropriate
> instead of open-coding?
> 
That is a good suggestion, thanks.
> >
> >
> >>> +		return -EINVAL;
> >>> +
> >>> +	/* Another AVB queue */
> >>> +	queue2 = (queue == 1) ? 2 : 1;
> >>
> >> Braces are unneeded.
> >>
> >>> +	if (cbs->idleslope + fep->cbs.idleslope[queue2] > speed * 1000) {
> >>> +		netdev_err(ndev,
> >>> +			   "The sum of all idle slope can't exceed link speed!\n");
> >>> +		return -EINVAL;
> >>> +	}
> >> [...]
> >>
> >> Thanks,
> >> Olek
> Thanks,
> Olek

Alexander Lobakin Feb. 16, 2023, 3:28 p.m. UTC | #13

From: Wei Fang <wei.fang@nxp.com>
Date: Thu, 16 Feb 2023 13:03:37 +0000

> 
>> -----Original Message-----
>> From: Alexander Lobakin <alexandr.lobakin@intel.com>
>> Sent: 2023年2月15日 0:49
>> To: Wei Fang <wei.fang@nxp.com>
>> Cc: Shenwei Wang <shenwei.wang@nxp.com>; Clark Wang
>> <xiaoning.wang@nxp.com>; davem@davemloft.net; edumazet@google.com;
>> kuba@kernel.org; pabeni@redhat.com; simon.horman@corigine.com;
>> andrew@lunn.ch; netdev@vger.kernel.org; dl-linux-imx <linux-imx@nxp.com>;
>> linux-kernel@vger.kernel.org
>> Subject: Re: [PATCH V2 net-next] net: fec: add CBS offload support
>>
>> From: Wei Fang <wei.fang@nxp.com>
>> Date: Tue, 14 Feb 2023 09:34:09 +0000

[...]

>>>>> +	int idleslope[FEC_ENET_MAX_TX_QS];
>>>>> +	int sendslope[FEC_ENET_MAX_TX_QS];
>>>>
>>>> Can they actually be negative? (probably I'll see it below)
>>>>
>>> idleslope and sendslope are used to save the parameters passed in from user
>> space.
>>> And the sendslope should always be negative.
>>
>> Parameters coming from userspace must be validated before saving them
>> anywhere.
>> Also, if sendslope is always negative as you say, then just negate it when you
>> read it from userspace and store as u32.
> Sorry, I still don't understand why u32 is necessary to store parameters from user
> space? In addition, people who understand 802.1Qav may be confused when they
> see that sendslope is a u32 type.

I didn't say you need to store any userspace param as unsigned, I only
say that there's no point in using signed types when the value range
belongs to only either positive or negative space.
I'm not insisting in this particular case, I guess you pick what should
look more intuitive here.

> 
>>
>>>
>>>>> +};
>>>>> +
>>>>>  /* The FEC buffer descriptors track the ring buffers.  The rx_bd_base and

[...]

>>>> Oh okay. Then rounddown_pow_of_two() is what you're looking for.
>>>>
>>>> 	power = rounddown_pow_of_two(idle_slope);
>>>>
>>>> Or even just use one variable, @idle_slope.
>>>>
>>> Thanks for the reminder, I think I should use roundup_pow_of_two().
>>
>> But your code does what rounddown_pow_of_two() does, not roundup.
>> Imagine that you have 0b1111, then your code will turn it into 0b1000, not
>> 0b10000. Or am I missing something?
>>
> 0b1111 is nearest to 0b10000, so it should be turned into 0x10000.

fls() + BIT() won't give you the *nearest* pow-2, have you checked what
your code does return? Check with 0xff and then 0x101 and you'll be
surprised, it doesn't work that way.

I'd highly suggest you introducing not only round_closest(), but also
round_closest_pow_of_two(), as your driver might not be the sole user of
such generic functionality.

> 
>>>
>>>>> +	idle_slope = DIV_ROUND_CLOSEST(idle_slope, power) * power;
>>>>> +
>>>>> +	return idle_slope;
>>>>
>>>> You can return DIV_ROUND_ ... right away, without assignning first.
>>>> Also, I'm thinking of that this might be a generic helper. We have
>>>> roundup() and rounddown(), this could be something like "round_closest()"?
[...]

Thanks,
Olek

Wei Fang Feb. 17, 2023, 2:18 a.m. UTC | #14

> -----Original Message-----
> From: Alexander Lobakin <aleksander.lobakin@intel.com>
> Sent: 2023年2月16日 23:28
> To: Wei Fang <wei.fang@nxp.com>
> Cc: Shenwei Wang <shenwei.wang@nxp.com>; Clark Wang
> <xiaoning.wang@nxp.com>; davem@davemloft.net; edumazet@google.com;
> kuba@kernel.org; pabeni@redhat.com; simon.horman@corigine.com;
> andrew@lunn.ch; netdev@vger.kernel.org; dl-linux-imx <linux-imx@nxp.com>;
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH V2 net-next] net: fec: add CBS offload support
> 
> From: Wei Fang <wei.fang@nxp.com>
> Date: Thu, 16 Feb 2023 13:03:37 +0000
> 
> >
> >> -----Original Message-----
> >> From: Alexander Lobakin <alexandr.lobakin@intel.com>
> >> Sent: 2023年2月15日 0:49
> >> To: Wei Fang <wei.fang@nxp.com>
> >> Cc: Shenwei Wang <shenwei.wang@nxp.com>; Clark Wang
> >> <xiaoning.wang@nxp.com>; davem@davemloft.net;
> edumazet@google.com;
> >> kuba@kernel.org; pabeni@redhat.com; simon.horman@corigine.com;
> >> andrew@lunn.ch; netdev@vger.kernel.org; dl-linux-imx
> >> <linux-imx@nxp.com>; linux-kernel@vger.kernel.org
> >> Subject: Re: [PATCH V2 net-next] net: fec: add CBS offload support
> >>
> >> From: Wei Fang <wei.fang@nxp.com>
> >> Date: Tue, 14 Feb 2023 09:34:09 +0000
> 
> [...]
> >>>> Oh okay. Then rounddown_pow_of_two() is what you're looking for.
> >>>>
> >>>> 	power = rounddown_pow_of_two(idle_slope);
> >>>>
> >>>> Or even just use one variable, @idle_slope.
> >>>>
> >>> Thanks for the reminder, I think I should use roundup_pow_of_two().
> >>
> >> But your code does what rounddown_pow_of_two() does, not roundup.
> >> Imagine that you have 0b1111, then your code will turn it into
> >> 0b1000, not 0b10000. Or am I missing something?
> >>
> > 0b1111 is nearest to 0b10000, so it should be turned into 0x10000.
> 
> fls() + BIT() won't give you the *nearest* pow-2, have you checked what your
> code does return? Check with 0xff and then 0x101 and you'll be surprised, it
> doesn't work that way.
> 
My real intention is, if x >= 1.5 * 2 ^ (fls(x) - 1), then x = 2 ^ fls(x), otherwise,
x = 2 ^ (fls(x) - 1). Anyway, I'll check the final implementation can meet my
expectation.

> I'd highly suggest you introducing not only round_closest(), but also
> round_closest_pow_of_two(), as your driver might not be the sole user of such
> generic functionality.
> 
Yes, I agree with you, it's better to use the generic functionality.

> >
> >>>
> >>>>> +	idle_slope = DIV_ROUND_CLOSEST(idle_slope, power) * power;
> >>>>> +
> >>>>> +	return idle_slope;
> >>>>
> >>>> You can return DIV_ROUND_ ... right away, without assignning first.
> >>>> Also, I'm thinking of that this might be a generic helper. We have
> >>>> roundup() and rounddown(), this could be something like
> "round_closest()"?
> [...]
> 
> Thanks,
> Olek

Andrew Lunn Feb. 18, 2023, 1:15 a.m. UTC | #15

> I have tested the pure software CBS today. And below are the test steps and results.
> Link speed 100Mbps.
> Queue 0: Non-CBS queue, 100Mbps traffic.
> Queue 1: CBS queue, 7Mbps bandwidth and 8Mbps traffic.
> Queue 2: CBS queue, 5Mbps bandwidth and 6Mbps traffic.
> Results: queue 0 egress rate is 86Mbps, queue 1 egress rate is 6Mbps, and queue 2
> egress rate is 4Mbps.
> Then change the link speed to 10Mbps, queue 0 egress rate is 4Mbps, queue 1 egress
> rate is 4Mbps, and queue 2 egress rate is 3Mbps.
>
> Beside the test results, I also checked the CBS codes. Unlike hardware implementation,
> the pure software method is more flexible, it has four parameters: idleslope, sendslope,
> locredit and hicredit. And it can detect the change of link speed and do some adjust.
> However, for hardware we only use the idleslope parameter. It's hard for us to make
> the hardware behave as the pure software when the link speed changes.
> So for the question: Should the hardware just give up and go back to default behaviour,
> or should it continue to do some CBS?

If you give up on hardware CBS, does the software version take over?

The idea of hardware offload is that the user should not care, nor
really notice. You want the software and hardware behaviour to be
similar.

> I think that we can refer to the behaviors of stmmac and enetc drivers, just keep the
> bandwidth ratio constant when the link rate changes. In addition, the link speed change
> is a corner case, there is no need to spend any more effort to discuss this matter.

It is a corner case, but it is an important one. You need it to do
something sensible. Giving up all together is not sensible. Falling
back to software CBS would be sensible, or supporting something
similar to the software CBS.

	Andrew

Wei Fang Feb. 18, 2023, 1:59 a.m. UTC | #16

> -----Original Message-----
> From: Andrew Lunn <andrew@lunn.ch>
> Sent: 2023年2月18日 9:16
> To: Wei Fang <wei.fang@nxp.com>
> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>; Shenwei Wang
> <shenwei.wang@nxp.com>; Clark Wang <xiaoning.wang@nxp.com>;
> davem@davemloft.net; edumazet@google.com; kuba@kernel.org;
> pabeni@redhat.com; simon.horman@corigine.com; netdev@vger.kernel.org;
> dl-linux-imx <linux-imx@nxp.com>; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH V2 net-next] net: fec: add CBS offload support
> 
> > I have tested the pure software CBS today. And below are the test steps and
> results.
> > Link speed 100Mbps.
> > Queue 0: Non-CBS queue, 100Mbps traffic.
> > Queue 1: CBS queue, 7Mbps bandwidth and 8Mbps traffic.
> > Queue 2: CBS queue, 5Mbps bandwidth and 6Mbps traffic.
> > Results: queue 0 egress rate is 86Mbps, queue 1 egress rate is 6Mbps,
> > and queue 2 egress rate is 4Mbps.
> > Then change the link speed to 10Mbps, queue 0 egress rate is 4Mbps,
> > queue 1 egress rate is 4Mbps, and queue 2 egress rate is 3Mbps.
> >
> > Beside the test results, I also checked the CBS codes. Unlike hardware
> > implementation, the pure software method is more flexible, it has four
> > parameters: idleslope, sendslope, locredit and hicredit. And it can detect the
> change of link speed and do some adjust.
> > However, for hardware we only use the idleslope parameter. It's hard
> > for us to make the hardware behave as the pure software when the link
> speed changes.
> > So for the question: Should the hardware just give up and go back to
> > default behaviour, or should it continue to do some CBS?
> 
> If you give up on hardware CBS, does the software version take over?
> 
No, because the cbs offload flag is enabled in CBS driver, unless configure cbs offload
disable with tc command.

> The idea of hardware offload is that the user should not care, nor really notice.
> You want the software and hardware behaviour to be similar.
> 
> > I think that we can refer to the behaviors of stmmac and enetc
> > drivers, just keep the bandwidth ratio constant when the link rate
> > changes. In addition, the link speed change is a corner case, there is no need
> to spend any more effort to discuss this matter.
> 
> It is a corner case, but it is an important one. You need it to do something
> sensible. Giving up all together is not sensible. Falling back to software CBS
> would be sensible, or supporting something similar to the software CBS.
> 
Unfortunately, FEC IP itself not follows the standard 802.1Qav spec completely.
We only can program DMAnCFG[IDLE_SLOPE] field to calculate bandwidth fraction.
And IDLE_SLOPE is restricted to certain values. It's far away from CBS QDisc implemented
in Linux TC framework. It is more difficult to guarantee similar software behavior when
the link speed changes.
If the method of keeping the bandwidth ratio unchanged is not sensible, I can only give up
CBS offload and use pure software CBS. :(

Andrew Lunn Feb. 21, 2023, 2:25 a.m. UTC | #17

> Unfortunately, FEC IP itself not follows the standard 802.1Qav spec completely.
> We only can program DMAnCFG[IDLE_SLOPE] field to calculate bandwidth fraction.
> And IDLE_SLOPE is restricted to certain values. It's far away from CBS QDisc implemented
> in Linux TC framework. It is more difficult to guarantee similar software behavior when
> the link speed changes.
> If the method of keeping the bandwidth ratio unchanged is not sensible, I can only give up
> CBS offload and use pure software CBS. :(

I know the hardware supports less parameters. But if you restrict the
CBS configuration such that the parameters you don't support are 0,
can you accurately implement what you do support?

If the user configures TC such that it matches the hardware
capabilities, you can accept the offload. If not return -EOPNOTSUPP,
and the software CBS should take over. And then you need to clearly
document what parameters can be set for hardware offload to work.

	 Andrew

diff mbox series

Patch

diff --git a/drivers/net/ethernet/freescale/fec.h b/drivers/net/ethernet/freescale/fec.h
index 5ba1e0d71c68..5383681ac273 100644
--- a/drivers/net/ethernet/freescale/fec.h
+++ b/drivers/net/ethernet/freescale/fec.h
@@ -340,6 +340,10 @@  struct bufdesc_ex {
 #define RCMR_CMP(X)		(((X) == 1) ? RCMR_CMP_1 : RCMR_CMP_2)
 #define FEC_TX_BD_FTYPE(X)	(((X) & 0xf) << 20)
 
+#define FEC_QOS_TX_SHEME_MASK	GENMASK(2, 0)
+#define CREDIT_BASED_SCHEME	0
+#define ROUND_ROBIN_SCHEME	1
+
 /* The number of Tx and Rx buffers.  These are allocated from the page
  * pool.  The code may assume these are power of two, so it it best
  * to keep them that size.
@@ -571,6 +575,12 @@  struct fec_stop_mode_gpr {
 	u8 bit;
 };
 
+struct fec_cbs_params {
+	bool enable[FEC_ENET_MAX_TX_QS];
+	int idleslope[FEC_ENET_MAX_TX_QS];
+	int sendslope[FEC_ENET_MAX_TX_QS];
+};
+
 /* The FEC buffer descriptors track the ring buffers.  The rx_bd_base and
  * tx_bd_base always point to the base of the buffer descriptors.  The
  * cur_rx and cur_tx point to the currently available buffer.
@@ -679,6 +689,9 @@  struct fec_enet_private {
 	/* XDP BPF Program */
 	struct bpf_prog *xdp_prog;
 
+	/* CBS parameters */
+	struct fec_cbs_params cbs;
+
 	u64 ethtool_stats[];
 };
 
diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index c73e25f8995e..91394ad05121 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -66,6 +66,7 @@ 
 #include <linux/mfd/syscon.h>
 #include <linux/regmap.h>
 #include <soc/imx/cpuidle.h>
+#include <net/pkt_sched.h>
 #include <linux/filter.h>
 #include <linux/bpf.h>
 
@@ -1023,6 +1024,174 @@  static void fec_enet_reset_skb(struct net_device *ndev)
 	}
 }
 
+static u32 fec_enet_get_idle_slope(u8 bw)
+{
+	int msb, power;
+	u32 idle_slope;
+
+	if (bw >= 100)
+		return 0;
+
+	/* Convert bw to hardware idle slope */
+	idle_slope = (512 * bw) / (100 - bw);
+
+	if (idle_slope >= 128) {
+		/* For values greater than or equal to 128, idle_slope
+		 * rounded to the nearest multiple of 128.
+		 */
+		idle_slope = DIV_ROUND_CLOSEST(idle_slope, 128U) * 128U;
+
+		return idle_slope;
+	}
+
+	/* For values less than 128, idle_slope is rounded to
+	 * nearst power of 2.
+	 */
+	if (idle_slope <= 1)
+		return 1;
+
+	msb = __fls(idle_slope);
+	power = BIT(msb);
+	idle_slope = DIV_ROUND_CLOSEST(idle_slope, power) * power;
+
+	return idle_slope;
+}
+
+static void fec_enet_set_cbs_idle_slope(struct fec_enet_private *fep)
+{
+	u32 bw, val, idle_slope;
+	int speed = fep->speed;
+	int idle_slope_sum = 0;
+	int i;
+
+	if (!speed)
+		return;
+
+	for (i = 1; i < FEC_ENET_MAX_TX_QS; i++) {
+		int port_tx_rate;
+
+		/* As defined in IEEE 802.1Q-2014 Section 8.6.8.2 item:
+		 *       sendslope = idleslope -  port_tx_rate
+		 * So we need to check whether port_tx_rate is equal to
+		 * the current link rate.
+		 */
+		port_tx_rate = fep->cbs.idleslope[i] - fep->cbs.sendslope[i];
+		if (port_tx_rate != speed * 1000)
+			return;
+
+		idle_slope_sum += fep->cbs.idleslope[i];
+	}
+
+	/* The all bandwidth of Queue 1 and Queue 2 can't greater than
+	 * the link rate.
+	 */
+	if (idle_slope_sum > speed * 1000)
+		return;
+
+	/* idleslope is in kilobits per second.
+	 * speed is the port rate in megabits per second.
+	 * So bandwidth the ratio, bw, is idleslope / (speed * 1000) * 100,
+	 * the unit of bw is percentage.
+	 */
+	for (i = 1; i < FEC_ENET_MAX_TX_QS; i++) {
+		bw = fep->cbs.idleslope[i] / (speed * 10);
+		idle_slope = fec_enet_get_idle_slope(bw);
+
+		val = readl(fep->hwp + FEC_DMA_CFG(i));
+		val &= ~IDLE_SLOPE_MASK;
+		val |= idle_slope & IDLE_SLOPE_MASK;
+		writel(val, fep->hwp + FEC_DMA_CFG(i));
+	}
+
+	/* Enable Credit-based shaper. */
+	val = readl(fep->hwp + FEC_QOS_SCHEME);
+	val &= ~FEC_QOS_TX_SHEME_MASK;
+	val |= CREDIT_BASED_SCHEME;
+	writel(val, fep->hwp + FEC_QOS_SCHEME);
+}
+
+static int fec_enet_setup_tc_cbs(struct net_device *ndev, void *type_data)
+{
+	struct fec_enet_private *fep = netdev_priv(ndev);
+	struct tc_cbs_qopt_offload *cbs = type_data;
+	int queue = cbs->queue;
+	int speed = fep->speed;
+	int queue2;
+
+	if (!(fep->quirks & FEC_QUIRK_HAS_AVB))
+		return -EOPNOTSUPP;
+
+	/* Queue 1 for Class A, Queue 2 for Class B, so the ENET must
+	 * have three queues.
+	 */
+	if (fep->num_tx_queues != FEC_ENET_MAX_TX_QS)
+		return -EOPNOTSUPP;
+
+	if (!speed) {
+		netdev_err(ndev, "Link speed is 0!\n");
+		return -ECANCELED;
+	}
+
+	/* Queue 0 is not AVB capable */
+	if (queue <= 0 || queue >= fep->num_tx_queues) {
+		netdev_err(ndev, "The queue: %d is invalid!\n", queue);
+		return -EINVAL;
+	}
+
+	if (!cbs->enable) {
+		u32 val;
+
+		val = readl(fep->hwp + FEC_QOS_SCHEME);
+		val &= ~FEC_QOS_TX_SHEME_MASK;
+		val |= ROUND_ROBIN_SCHEME;
+		writel(val, fep->hwp + FEC_QOS_SCHEME);
+
+		memset(&fep->cbs, 0, sizeof(fep->cbs));
+
+		return 0;
+	}
+
+	if (cbs->idleslope - cbs->sendslope != speed * 1000 ||
+	    cbs->idleslope <= 0 || cbs->sendslope >= 0)
+		return -EINVAL;
+
+	/* Another AVB queue */
+	queue2 = (queue == 1) ? 2 : 1;
+	if (cbs->idleslope + fep->cbs.idleslope[queue2] > speed * 1000) {
+		netdev_err(ndev,
+			   "The sum of all idle slope can't exceed link speed!\n");
+		return -EINVAL;
+	}
+
+	fep->cbs.enable[queue] = true;
+	fep->cbs.idleslope[queue] = cbs->idleslope;
+	fep->cbs.sendslope[queue] = cbs->sendslope;
+	/* We need to configure the credit-based shaper of hardware after
+	 * the CBS parameters of queue 1 and queue 2 are both configured.
+	 * Avoid parameter conflicts between queue 1 and queue 2, causing
+	 * one of the queues to fail to be configured. Additionally, once
+	 * the FEC_QOS_SCHEME field is set to credit-based scheme, queue 1
+	 * and queue 2 are taking effective as AVB queues immediately. So
+	 * it's better to set credit-based shaper after both queues are
+	 * configured.
+	 */
+	if (fep->cbs.enable[queue2])
+		fec_enet_set_cbs_idle_slope(fep);
+
+	return 0;
+}
+
+static int fec_enet_setup_tc(struct net_device *ndev, enum tc_setup_type type,
+			     void *type_data)
+{
+	switch (type) {
+	case TC_SETUP_QDISC_CBS:
+		return fec_enet_setup_tc_cbs(ndev, type_data);
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
 /*
  * This function is called to start or restart the FEC during a link
  * change, transmit timeout, or to reconfigure the FEC.  The network
@@ -1173,6 +1342,15 @@  fec_restart(struct net_device *ndev)
 
 	writel(rcntl, fep->hwp + FEC_R_CNTRL);
 
+	/* We need to reconfigure the CBS due to some events will cause
+	 * the MAC reset such as link change, transmit timeout, checksum
+	 * feature change and so on.
+	 */
+	if (fep->quirks & FEC_QUIRK_HAS_AVB &&
+	    fep->num_tx_queues == FEC_ENET_MAX_TX_QS &&
+	    fep->cbs.enable[1] && fep->cbs.enable[2])
+		fec_enet_set_cbs_idle_slope(fep);
+
 	/* Setup multicast filter. */
 	set_multicast_list(ndev);
 #ifndef CONFIG_M5272
@@ -3882,6 +4060,7 @@  static const struct net_device_ops fec_netdev_ops = {
 	.ndo_tx_timeout		= fec_timeout,
 	.ndo_set_mac_address	= fec_set_mac_address,
 	.ndo_eth_ioctl		= fec_enet_ioctl,
+	.ndo_setup_tc	= fec_enet_setup_tc,
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	.ndo_poll_controller	= fec_poll_controller,
 #endif