net: sched: print jiffies when transmit queue time out

Message ID 20230419115632.738730-1-yajun.deng@linux.dev
State New
Headers
Series net: sched: print jiffies when transmit queue time out |

Commit Message

Yajun Deng April 19, 2023, 11:56 a.m. UTC
  Although there is watchdog_timeo to let users know when the transmit queue
begin stall, but dev_watchdog() is called with an interval. The jiffies
will always be greater than watchdog_timeo.

To let users know the exact time the stall started, print jiffies when
the transmit queue time out.

Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
---
 net/sched/sch_generic.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)
  

Comments

Eric Dumazet April 19, 2023, 12:02 p.m. UTC | #1
On Wed, Apr 19, 2023 at 1:56 PM Yajun Deng <yajun.deng@linux.dev> wrote:
>
> Although there is watchdog_timeo to let users know when the transmit queue
> begin stall, but dev_watchdog() is called with an interval. The jiffies
> will always be greater than watchdog_timeo.
>
> To let users know the exact time the stall started, print jiffies when
> the transmit queue time out.
>
> Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
> ---


>                                         atomic_long_inc(&txq->trans_timeout);
>                                         break;
>                                 }
> @@ -522,8 +522,9 @@ static void dev_watchdog(struct timer_list *t)
>
>                         if (unlikely(some_queue_timedout)) {
>                                 trace_net_dev_xmit_timeout(dev, i);
> -                               WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): transmit queue %u timed out\n",
> -                                      dev->name, netdev_drivername(dev), i);
> +                               WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): \
> +                                         transmit queue %u timed out %lu jiffies\n",
> +                                         dev->name, netdev_drivername(dev), i, some_queue_timedout);

If we really want this, I suggest we export a time in ms units, using
jiffies_to_msecs()
  
Jakub Kicinski April 20, 2023, 1:27 a.m. UTC | #2
On Wed, 19 Apr 2023 19:56:32 +0800 Yajun Deng wrote:
> Although there is watchdog_timeo to let users know when the transmit queue
> begin stall, but dev_watchdog() is called with an interval. The jiffies
> will always be greater than watchdog_timeo.
> 
> To let users know the exact time the stall started, print jiffies when
> the transmit queue time out.

Please add an explanation of how this information is useful in practice.
  
Yajun Deng April 20, 2023, 2:17 a.m. UTC | #3
April 20, 2023 9:27 AM, "Jakub Kicinski" <kuba@kernel.org> wrote:

> On Wed, 19 Apr 2023 19:56:32 +0800 Yajun Deng wrote:
> 
>> Although there is watchdog_timeo to let users know when the transmit queue
>> begin stall, but dev_watchdog() is called with an interval. The jiffies
>> will always be greater than watchdog_timeo.
>> 
>> To let users know the exact time the stall started, print jiffies when
>> the transmit queue time out.
> 
> Please add an explanation of how this information is useful in practice.

We found some cases with several warnings. We want to confirm which happened first. 

First warning:
16:37:57 kernel: [ 7100.097547] ------------[ cut here ]------------
16:37:57 kernel: [ 7100.097550] NETDEV WATCHDOG: eno2 (i40e): transmit queue 8 timed out
16:37:57 kernel: [ 7100.097571] WARNING: CPU: 8 PID: 0 at net/sched/sch_generic.c:467 dev_watchdog+0x260/0x270
...

Second warning:
16:38:44 kernel: [ 7147.756952] rcu: INFO: rcu_preempt self-detected stall on CPU
16:38:44 kernel: [ 7147.756958] rcu:   24-....: (59999 ticks this GP) idle=546/1/0x4000000000000000 softirq=367      3137/3673146 fqs=13844
16:38:44 kernel: [ 7147.756960]        (t=60001 jiffies g=4322709 q=133381)
16:38:44 kernel: [ 7147.756962] NMI backtrace for cpu 24
...

As we can see, the transmit queue start stall should be before 16:37:52, the rcu start stall is 16:37:44.
These two times are closer, we want to confirm which happened first.
  
Yajun Deng April 20, 2023, 2:49 a.m. UTC | #4
April 19, 2023 8:02 PM, "Eric Dumazet" <edumazet@google.com> wrote:

> On Wed, Apr 19, 2023 at 1:56 PM Yajun Deng <yajun.deng@linux.dev> wrote:
> 
>> Although there is watchdog_timeo to let users know when the transmit queue
>> begin stall, but dev_watchdog() is called with an interval. The jiffies
>> will always be greater than watchdog_timeo.
>> 
>> To let users know the exact time the stall started, print jiffies when
>> the transmit queue time out.
>> 
>> Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
>> ---
>> 
>> atomic_long_inc(&txq->trans_timeout);
>> break;
>> }
>> @@ -522,8 +522,9 @@ static void dev_watchdog(struct timer_list *t)
>> 
>> if (unlikely(some_queue_timedout)) {
>> trace_net_dev_xmit_timeout(dev, i);
>> - WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): transmit queue %u timed out\n",
>> - dev->name, netdev_drivername(dev), i);
>> + WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): \
>> + transmit queue %u timed out %lu jiffies\n",
>> + dev->name, netdev_drivername(dev), i, some_queue_timedout);
> 
> If we really want this, I suggest we export a time in ms units, using
> jiffies_to_msecs()

OK.
  

Patch

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index a9aadc4e6858..67b78e260d28 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -502,7 +502,7 @@  static void dev_watchdog(struct timer_list *t)
 		if (netif_device_present(dev) &&
 		    netif_running(dev) &&
 		    netif_carrier_ok(dev)) {
-			int some_queue_timedout = 0;
+			unsigned long some_queue_timedout = 0;
 			unsigned int i;
 			unsigned long trans_start;
 
@@ -514,7 +514,7 @@  static void dev_watchdog(struct timer_list *t)
 				if (netif_xmit_stopped(txq) &&
 				    time_after(jiffies, (trans_start +
 							 dev->watchdog_timeo))) {
-					some_queue_timedout = 1;
+					some_queue_timedout = jiffies - trans_start;
 					atomic_long_inc(&txq->trans_timeout);
 					break;
 				}
@@ -522,8 +522,9 @@  static void dev_watchdog(struct timer_list *t)
 
 			if (unlikely(some_queue_timedout)) {
 				trace_net_dev_xmit_timeout(dev, i);
-				WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): transmit queue %u timed out\n",
-				       dev->name, netdev_drivername(dev), i);
+				WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): \
+					  transmit queue %u timed out %lu jiffies\n",
+					  dev->name, netdev_drivername(dev), i, some_queue_timedout);
 				netif_freeze_queues(dev);
 				dev->netdev_ops->ndo_tx_timeout(dev, i);
 				netif_unfreeze_queues(dev);