[net-next,3/3] net: stmmac: increase TX coalesce timer to 5ms

Message ID 20230922111247.497-3-ansuelsmth@gmail.com
State New
Headers
Series [net-next,1/3] net: introduce napi_is_scheduled helper |

Commit Message

Christian Marangi Sept. 22, 2023, 11:12 a.m. UTC
  Commit 8fce33317023 ("net: stmmac: Rework coalesce timer and fix
multi-queue races") decreased the TX coalesce timer from 40ms to 1ms.

This caused some performance regression on some target (regression was
reported at least on ipq806x) in the order of 600mbps dropping from
gigabit handling to only 200mbps.

The problem was identified in the TX timer getting armed too much time.
While this was fixed and improved in another commit, performance can be
improved even further by increasing the timer delay a bit moving from
1ms to 5ms.

The value is a good balance between battery saving by prevending too
much interrupt to be generated and permitting good performance for
internet oriented devices.

Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
---
 drivers/net/ethernet/stmicro/stmmac/common.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
  

Comments

Andrew Lunn Sept. 22, 2023, 12:28 p.m. UTC | #1
On Fri, Sep 22, 2023 at 01:12:47PM +0200, Christian Marangi wrote:
> Commit 8fce33317023 ("net: stmmac: Rework coalesce timer and fix
> multi-queue races") decreased the TX coalesce timer from 40ms to 1ms.
> 
> This caused some performance regression on some target (regression was
> reported at least on ipq806x) in the order of 600mbps dropping from
> gigabit handling to only 200mbps.
> 
> The problem was identified in the TX timer getting armed too much time.
> While this was fixed and improved in another commit, performance can be
> improved even further by increasing the timer delay a bit moving from
> 1ms to 5ms.
> 
> The value is a good balance between battery saving by prevending too
> much interrupt to be generated and permitting good performance for
> internet oriented devices.

ethtool has a settings you can use for this:

      ethtool -C|--coalesce devname [adaptive-rx on|off] [adaptive-tx on|off]
              [rx-usecs N] [rx-frames N] [rx-usecs-irq N] [rx-frames-irq N]
              [tx-usecs N] [tx-frames N] [tx-usecs-irq N] [tx-frames-irq N]
              [stats-block-usecs N] [pkt-rate-low N] [rx-usecs-low N]
              [rx-frames-low N] [tx-usecs-low N] [tx-frames-low N]
              [pkt-rate-high N] [rx-usecs-high N] [rx-frames-high N]
              [tx-usecs-high N] [tx-frames-high N] [sample-interval N]
              [cqe-mode-rx on|off] [cqe-mode-tx on|off] [tx-aggr-max-bytes N]
              [tx-aggr-max-frames N] [tx-aggr-time-usecs N]

If this is not implemented, i suggest you add support for it.

Changing the default might cause regressions. Say there is a VoIP
application which wants this low latency? It would be safer to allow
user space to configure it as wanted.

     Andrew
  
Christian Marangi Sept. 22, 2023, 12:39 p.m. UTC | #2
On Fri, Sep 22, 2023 at 02:28:06PM +0200, Andrew Lunn wrote:
> On Fri, Sep 22, 2023 at 01:12:47PM +0200, Christian Marangi wrote:
> > Commit 8fce33317023 ("net: stmmac: Rework coalesce timer and fix
> > multi-queue races") decreased the TX coalesce timer from 40ms to 1ms.
> > 
> > This caused some performance regression on some target (regression was
> > reported at least on ipq806x) in the order of 600mbps dropping from
> > gigabit handling to only 200mbps.
> > 
> > The problem was identified in the TX timer getting armed too much time.
> > While this was fixed and improved in another commit, performance can be
> > improved even further by increasing the timer delay a bit moving from
> > 1ms to 5ms.
> > 
> > The value is a good balance between battery saving by prevending too
> > much interrupt to be generated and permitting good performance for
> > internet oriented devices.
> 
> ethtool has a settings you can use for this:
> 
>       ethtool -C|--coalesce devname [adaptive-rx on|off] [adaptive-tx on|off]
>               [rx-usecs N] [rx-frames N] [rx-usecs-irq N] [rx-frames-irq N]
>               [tx-usecs N] [tx-frames N] [tx-usecs-irq N] [tx-frames-irq N]
>               [stats-block-usecs N] [pkt-rate-low N] [rx-usecs-low N]
>               [rx-frames-low N] [tx-usecs-low N] [tx-frames-low N]
>               [pkt-rate-high N] [rx-usecs-high N] [rx-frames-high N]
>               [tx-usecs-high N] [tx-frames-high N] [sample-interval N]
>               [cqe-mode-rx on|off] [cqe-mode-tx on|off] [tx-aggr-max-bytes N]
>               [tx-aggr-max-frames N] [tx-aggr-time-usecs N]
> 
> If this is not implemented, i suggest you add support for it.
> 
> Changing the default might cause regressions. Say there is a VoIP
> application which wants this low latency? It would be safer to allow
> user space to configure it as wanted.
>

Yep stmmac already support it. Idea here was to not fallback to use
ethtool and find a good value.

Just for reference before one commit, the value was set to 40ms and
nobody ever pointed out regression about VoIP application. Wtih some
testing I found 5ms a small increase that restore original perf and
should not cause any regression.

(for reference keeping this to 1ms cause a lost of about 100-200mbps)
(also the tx timer implementation was created before any napi poll logic
and before dma interrupt handling was a thing, with the later change I
expect this timer to be very little used in VoIP scenario or similar
with continuous traffic as napi will take care of handling packet)

Aside from these reason I totally get the concern and totally ok with
this not getting applied, was just an idea to push for a common value.

Just preferred to handle this here instead of script+userspace :(
(the important part is the previous patch)
  
Dave Taht Sept. 22, 2023, 8:02 p.m. UTC | #3
On Fri, Sep 22, 2023 at 5:39 AM Christian Marangi <ansuelsmth@gmail.com> wrote:
>
> On Fri, Sep 22, 2023 at 02:28:06PM +0200, Andrew Lunn wrote:
> > On Fri, Sep 22, 2023 at 01:12:47PM +0200, Christian Marangi wrote:
> > > Commit 8fce33317023 ("net: stmmac: Rework coalesce timer and fix
> > > multi-queue races") decreased the TX coalesce timer from 40ms to 1ms.
> > >
> > > This caused some performance regression on some target (regression was
> > > reported at least on ipq806x) in the order of 600mbps dropping from
> > > gigabit handling to only 200mbps.
> > >
> > > The problem was identified in the TX timer getting armed too much time.
> > > While this was fixed and improved in another commit, performance can be
> > > improved even further by increasing the timer delay a bit moving from
> > > 1ms to 5ms.

I am always looking for finding ways to improve interrupt service
time, rather than paper over the problem by increasing batchi-ness.

http://www.taht.net/~d/broadcom_aug9_2018.pdf

But also looking for hard data, particularly as to observed power
savings. How much power does upping this number save?

I have tried to question other assumptions more modern kernels are
making, in particular I wish more folk would experience with
decreasing the overlarge (IMHO) NAPI default of 64 packets to, say 8
in the mq case, benefiting from multiple arm cores still equipped with
limited cache, as well as looking at the impact of TLB flushes. Other
deferred multi-core processing... that is looking good on a modern
xeon, but might not be so good on a more limited arm, worries me.

Over here there was an enormous test series recently run against a
bunch of older arm64s which appears to indicate that memory bandwidth
is a source of problems:

https://docs.google.com/document/d/1HxIU_TEBI6xG9jRHlr8rzyyxFEN43zMcJXUFlRuhiUI/edit

We are looking to add more devices to that testbed.

> > >
> > > The value is a good balance between battery saving by prevending too
> > > much interrupt to be generated and permitting good performance for
> > > internet oriented devices.
> >
> > ethtool has a settings you can use for this:
> >
> >       ethtool -C|--coalesce devname [adaptive-rx on|off] [adaptive-tx on|off]
> >               [rx-usecs N] [rx-frames N] [rx-usecs-irq N] [rx-frames-irq N]
> >               [tx-usecs N] [tx-frames N] [tx-usecs-irq N] [tx-frames-irq N]
> >               [stats-block-usecs N] [pkt-rate-low N] [rx-usecs-low N]
> >               [rx-frames-low N] [tx-usecs-low N] [tx-frames-low N]
> >               [pkt-rate-high N] [rx-usecs-high N] [rx-frames-high N]
> >               [tx-usecs-high N] [tx-frames-high N] [sample-interval N]
> >               [cqe-mode-rx on|off] [cqe-mode-tx on|off] [tx-aggr-max-bytes N]
> >               [tx-aggr-max-frames N] [tx-aggr-time-usecs N]
> >
> > If this is not implemented, i suggest you add support for it.
> >
> > Changing the default might cause regressions. Say there is a VoIP
> > application which wants this low latency? It would be safer to allow
> > user space to configure it as wanted.
> >
>
> Yep stmmac already support it. Idea here was to not fallback to use
> ethtool and find a good value.
>
> Just for reference before one commit, the value was set to 40ms and
> nobody ever pointed out regression about VoIP application. Wtih some
> testing I found 5ms a small increase that restore original perf and
> should not cause any regression.

Does this driver have BQL?

> (for reference keeping this to 1ms cause a lost of about 100-200mbps)
> (also the tx timer implementation was created before any napi poll logic
> and before dma interrupt handling was a thing, with the later change I
> expect this timer to be very little used in VoIP scenario or similar
> with continuous traffic as napi will take care of handling packet)

I would be pretty interested in a kernel flame graph of the before vs the after.

> Aside from these reason I totally get the concern and totally ok with
> this not getting applied, was just an idea to push for a common value.

I try to get people to run much longer and more complicated tests such
as the flent rrul test to see what kind of damage bigger buffers did
to latency, as well as how other problems might show up. Really
notable in the above test series was how badly various devices behaved
over time on that workload. Extremely notable in that test series
above was how badly the  jetson performed:

https://github.com/randomizedcoder/cake/blob/2023_09_02/pfifo_fast/jetson.png

And the nanopi was weird.

https://github.com/randomizedcoder/cake/blob/2023_09_02/pfifo_fast/nanopi-neo3.png

> Just preferred to handle this here instead of script+userspace :(
> (the important part is the previous patch)
>
> --
>         Ansuel
>
--
Oct 30: https://netdevconf.info/0x17/news/the-maestro-and-the-music-bof.html
Dave Täht CSO, LibreQos
  

Patch

diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h
index 403cb397d4d3..2d9f895c2193 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -290,7 +290,7 @@  struct stmmac_safety_stats {
 #define MIN_DMA_RIWT		0x10
 #define DEF_DMA_RIWT		0xa0
 /* Tx coalesce parameters */
-#define STMMAC_COAL_TX_TIMER	1000
+#define STMMAC_COAL_TX_TIMER	5000
 #define STMMAC_MAX_COAL_TX_TICK	100000
 #define STMMAC_TX_MAX_FRAMES	256
 #define STMMAC_TX_FRAMES	25