ice/ptp: fix the PTP worker retrying indefinitely if the link went down

Message ID 20230117181533.2350335-1-neelx@redhat.com
State New
Headers
Series ice/ptp: fix the PTP worker retrying indefinitely if the link went down |

Commit Message

Daniel Vacek Jan. 17, 2023, 6:15 p.m. UTC
  When the link goes down the ice_ptp_tx_tstamp_work() may loop forever trying
to process the packets. In this case it makes sense to just drop them.

Signed-off-by: Daniel Vacek <neelx@redhat.com>
---
 drivers/net/ethernet/intel/ice/ice_ptp.c | 9 +++++++++
 1 file changed, 9 insertions(+)
  

Comments

Jacob Keller Jan. 17, 2023, 6:46 p.m. UTC | #1
On 1/17/2023 10:15 AM, Daniel Vacek wrote:
> When the link goes down the ice_ptp_tx_tstamp_work() may loop forever trying
> to process the packets. In this case it makes sense to just drop them.
> 

Hi,

Do you have some more details on what situation caused this state? Or is
this just based on code review?

It won't loop forever because if link is down for more than 2 seconds
we'll discard the old timestamps which we assume are not going to arrive.

The trouble is that if a timestamp *does* arrive late, we need to ensure
that we never assign the captured time to the wrong packet, and that for
E822 devices we always read the correct number of timestamps (otherwise
we can get the logic for timestamp interrupt generation broken).

Consider for example this flow for e810:

1) a tx packet with a timestamp request is scheduled to hw
2) the packet begins being transmitted
3) link goes down
4) interrupt triggers, ice_ptp_tx_tstamp is run
5) link is down, so we skip reading this timestamp. Since we don't read
the timestamp, we just discard the skb and we don't update the cached tx
timestamp value
6) link goes up
7) 2 tx packets with a timestamp request are sent and one of them is on
the same index as the packet in (1)
8) the other tx packet completes and we get an interrupt
9) the loop reads both timestamps. Since the tx packet in slot (1)
doesn't match its cached value it looks "new" so the function reports
the old timestamp to the wrong packet.

Consider this flow for e822:

1) 2 tx packets with a timestamp request are scheduled to hw
2) the packets begin being transmitted
3) link goes down
4) an interrupt for the Tx timestamp is triggered, but we don't read the
timestamps because we have link down and we skipped the ts_read.
5) the internal e822 hardware counter is not decremented due to no reads
6) no more timestamp interrupts will be triggered by hardware until we
read the appropriate number of timestamps

I am not sure if link going up will properly reset and re-initialize the
Tx timestamp block but I suspect it does not. @Karol, @Siddaraju,
@Michal, do you recall more details on this?

I understand the desire to avoid polling when unnecessary, but I am
worried because the hardware and firmware interactions here are pretty
complex and its easy to get this wrong. (see: all of the previous
patches and bug fixes we've been working on... we got this wrong a LOT
already...)

Without a more concrete explanation of what this fixes I'm worried about
this change :(

At a minimum I think I would only set drop_ts but not not goto skip_ts_read.

That way if we happen to have a ready timestamp (for E822) we'll still
read it and avoid the miscounting from not reading a completed timestamp.

This also ensures that on e810 the cached timestamp value is updated and
that we avoid the other situation.

I'd still prefer if you have a bug report or more details on the failure
case. I believe even if we poll it should be no more than 2 seconds for
an old timestamp that never got sent to be discarded.

> Signed-off-by: Daniel Vacek <neelx@redhat.com>
> ---
>  drivers/net/ethernet/intel/ice/ice_ptp.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c
> index d63161d73eb16..c313177ba6676 100644
> --- a/drivers/net/ethernet/intel/ice/ice_ptp.c
> +++ b/drivers/net/ethernet/intel/ice/ice_ptp.c
> @@ -680,6 +680,7 @@ static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx)
>  	struct ice_pf *pf;
>  	struct ice_hw *hw;
>  	u64 tstamp_ready;
> +	bool link_up;
>  	int err;
>  	u8 idx;
>  
> @@ -695,6 +696,8 @@ static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx)
>  	if (err)
>  		return false;
>  
> +	link_up = hw->port_info->phy.link_info.link_info & ICE_AQ_LINK_UP;
> +
>  	for_each_set_bit(idx, tx->in_use, tx->len) {
>  		struct skb_shared_hwtstamps shhwtstamps = {};
>  		u8 phy_idx = idx + tx->offset;
> @@ -702,6 +705,12 @@ static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx)
>  		bool drop_ts = false;
>  		struct sk_buff *skb;
>  
> +		/* Drop packets if the link went down */
> +		if (!link_up) {
> +			drop_ts = true;
> +			goto skip_ts_read;
> +		}
> +
>  		/* Drop packets which have waited for more than 2 seconds */
>  		if (time_is_before_jiffies(tx->tstamps[idx].start + 2 * HZ)) {
>  			drop_ts = true;
  
Daniel Vacek Jan. 18, 2023, 3:14 p.m. UTC | #2
On Tue, Jan 17, 2023 at 7:47 PM Jacob Keller <jacob.e.keller@intel.com> wrote:
> On 1/17/2023 10:15 AM, Daniel Vacek wrote:
> > When the link goes down the ice_ptp_tx_tstamp_work() may loop forever trying
> > to process the packets. In this case it makes sense to just drop them.
> >
>
> Hi,
>
> Do you have some more details on what situation caused this state? Or is
> this just based on code review?
>
> It won't loop forever because if link is down for more than 2 seconds
> we'll discard the old timestamps which we assume are not going to arrive.

Oh, my bad. I should have improved the commit message. This
comes from RHEL 8.4 where we do not have commit 0dd9286263923
("ice: handle discarding old Tx requests in ice_ptp_tx_tstamp")
yet.

Though, I still consider 2 seconds being 'forever like'.

What really happened is that after the link went down, the
kthread worker 'ice-ptp-0000:10' being a high prio RT class
task kept looping on CPU 33 for almost three minutes starving
other tasks with lower priority. This resulted in RCU stall
and finally in panic due to hung tasks.

The machine is using the E810 NIC.

> [1193024.082447] ice 0000:10:00.1 ens1f1: NIC Link is Down
> [1193024.087861] iavf 0000:10:09.1 net1: NIC Link is Down
> [1193024.093077] iavf 0000:10:09.2 llsm1: NIC Link is Down
> [1193024.099324] iavf 0000:10:09.3 llsm1: NIC Link is Down
> [1193060.630500] ice 0000:10:00.2 ens1f2: NIC Link is up 25 Gbps Full Duplex, Requested FEC: RS-FEC, Negotiated FEC: RS-FEC, Autoneg Advertised: Off, Autoneg Negotiated: False, Flow Contr>
> [1193060.647921] iavf 0000:10:11.1 llsm2: NIC Link is Up Speed is 25 Gbps Full Duplex
> [1193060.655575] iavf 0000:10:11.2 net2: NIC Link is Up Speed is 25 Gbps Full Duplex
> [1193060.663140] iavf 0000:10:11.3 llsm2: NIC Link is Up Speed is 25 Gbps Full Duplex
> [1193084.383798] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> [1193084.383799] rcu:   Tasks blocked on level-1 rcu_node (CPUs 32-47): P183189
> [1193084.383804]        (detected by 0, t=60002 jiffies, g=379331397, q=294595)
> [1193084.383807] runc            R  running task        0 183189 183175 0x00084080
> [1193088.721408] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { P161097 P183189 } 63616 jiffies s: 1074129 root: 0x4/.
> [1193089.074856] rcu: blocking rcu_node structures: l=1:32-47:0x0/T
>
> [1193199.313106] INFO: task xfsaild/sda4:1428 blocked for more than 120 seconds.
>
> [1193201.399021] NMI backtrace for cpu 33
> [1193201.399021] CPU: 33 PID: 1354 Comm: ice-ptp-0000:10 Kdump: loaded Tainted: G        W        --------- -  - 4.18.0-305.49.1.rt7.121.el8_4.x86_64 #1
>
> [1193201.399050] Kernel panic - not syncing: hung_task: blocked tasks
> [1193201.399052] CPU: 63 PID: 559 Comm: khungtaskd Kdump: loaded Tainted: G        W        --------- -  - 4.18.0-305.49.1.rt7.121.el8_4.x86_64 #1

> crash> bc 33
> PID: 1354     TASK: ff48df6f2c5c8000  CPU: 33   COMMAND: "ice-ptp-0000:10"
>  #0 [fffffe000069ce38] crash_nmi_callback+0x33 at ffffffffb664cb63
>  #1 [fffffe000069ce40] nmi_handle+0x5b at ffffffffb661c68b
>  #2 [fffffe000069ce98] default_do_nmi+0x72 at ffffffffb6f4a562
>  #3 [fffffe000069cec0] do_nmi+0x18d at ffffffffb661cd1d
>  #4 [fffffe000069cef0] end_repeat_nmi+0x16 at ffffffffb7001544
>     [exception RIP: delay_tsc+0x6d]
>     RIP: ffffffffb6f4794d  RSP: ff7daf7a4aa2fd48  RFLAGS: 00000246
>     RAX: 000000008cf66cff  RBX: 0006582e8cf66caf  RCX: 0000000000000000
>     RDX: 000000000006582e  RSI: ffffffffb76eefe3  RDI: ffffffffb76d1b60
>     RBP: 0006582e8cf5ef89   R8: 000000000000000c   R9: 0000000000000000
>     R10: 0000000000000000  R11: 0000000000000000  R12: 00000000000249f2
>     R13: 0000000000000021  R14: ff48df63ec9aa3e0  R15: ff48df57541d0300
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> --- <NMI exception stack> ---
>  #5 [ff7daf7a4aa2fd48] delay_tsc+0x6d at ffffffffb6f4794d
>  ## [  endcall jmp 68] __const_udelay+0x37 at ffffffffb6f477f7
>  #6 [ff7daf7a4aa2fd68] ice_sq_send_cmd+0x356 at ffffffffc0612ee6 [ice]
>  ## [  inlined     c0] ice_sbq_send_cmd at drivers/net/ethernet/intel/ice/ice_common.c:1407
>  #7 [ff7daf7a4aa2fdc0] ice_sbq_rw_reg+0xc4 at ffffffffc06147e4 [ice]
>  #8 [ff7daf7a4aa2fe18] ice_read_phy_reg_e810+0x4d at ffffffffc0655b0d [ice]
>  ## [  inlined     50] ice_read_phy_tstamp_e810 at drivers/net/ethernet/intel/ice/ice_ptp_hw.c:2608
>  #9 [ff7daf7a4aa2fe50] ice_read_phy_tstamp+0x5a at ffffffffc06584ca [ice]
>
> #10 [ff7daf7a4aa2fe90] ice_ptp_tx_tstamp_work+0x95 at ffffffffc0654185 [ice]
> #11 [ff7daf7a4aa2fee0] kthread_worker_fn+0xab at ffffffffb6701c4b
> #12 [ff7daf7a4aa2ff10] kthread+0x15d at ffffffffb6701b7d
>
> crash> delay_fn
> delay_fn = (void (*)(u64)) 0xffffffffb6f478e0 <delay_tsc>
>
> crash> bc 33 | grep -o 'R[BDA1][PX2]: 0[^ ]*'
> RAX: 000000008cf66cff
> RBX: 0006582e8cf66caf
> RDX: 000000000006582e
> RBP: 0006582e8cf5ef89
> R12: 00000000000249f2
>
>         249f2 r12     cycles (150002)
> 6582e8cf5ef89 rbp     bclock
> 6582e8cf66cff rdx:rax now
> 6582e8cf66caf rbx     last
>
>  962 ice_sq_send_cmd(struct ice_hw *hw, struct ice_ctl_q_info *cq,
> --
>  965 {
> --
> 1063         do {
> 1064                 if (ice_sq_done(hw, cq))
> 1065                         break;
> 1066
> 1067                 udelay(ICE_CTL_Q_SQ_CMD_USEC);
> 1068                 total_delay++;
> 1069         } while (total_delay < cq->sq_cmd_timeout);
>
>  19 /* 0x10c7 is 2**32 / 1000000 (rounded up) */
>  20 #define udelay(n)                                                       \
> --
>  26                                 __const_udelay((n) * 0x10c7ul);         \
>
> crash> pd loops_per_jiffy
> loops_per_jiffy = 1500000
>
> #define ICE_CTL_Q_SQ_CMD_USEC           100   /* Check every 100usec */
>
> crash> pd (100*0x10c7*4*250*loops_per_jiffy>>32)+1
> 150002
>
> crash> p 100*0x10c7ul
> 0x68dbc
>
> crash> dis ice_sq_send_cmd+0x346 13
> 0xffffffffc0612ed6 <ice_sq_send_cmd+0x346>: jmp    0xffffffffc0612eef <ice_sq_send_cmd+0x35f>
>
> 0xffffffffc0612ed8 <ice_sq_send_cmd+0x348>: mov    $0x68dbc,%edi
> 0xffffffffc0612edd <ice_sq_send_cmd+0x34d>: add    $0x1,%r13d
> 0xffffffffc0612ee1 <ice_sq_send_cmd+0x351>: call   0xffffffffb6f477c0 <__const_udelay>
> 0xffffffffc0612ee6 <ice_sq_send_cmd+0x356>: cmp    %r13d,0xc8(%rbp)
> 0xffffffffc0612eed <ice_sq_send_cmd+0x35d>: jbe    0xffffffffc0612f05 <ice_sq_send_cmd+0x375>
>
> 0xffffffffc0612eef <ice_sq_send_cmd+0x35f>: mov    0xa0(%rbp),%eax
> 0xffffffffc0612ef5 <ice_sq_send_cmd+0x365>: add    (%rbx),%rax
> 0xffffffffc0612ef8 <ice_sq_send_cmd+0x368>: mov    (%rax),%edx
> 0xffffffffc0612efa <ice_sq_send_cmd+0x36a>: movzwl 0x9a(%rbp),%eax
> 0xffffffffc0612f01 <ice_sq_send_cmd+0x371>: cmp    %eax,%edx
> 0xffffffffc0612f03 <ice_sq_send_cmd+0x373>: jne    0xffffffffc0612ed8 <ice_sq_send_cmd+0x348>
>
> 0xffffffffc0612f05 <ice_sq_send_cmd+0x375>: mov    0xa0(%rbp),%eax
>
> crash> dis delay_tsc 2
> 0xffffffffb6f478e0 <delay_tsc>: nopl   0x0(%rax,%rax,1) [FTRACE NOP]
> 0xffffffffb6f478e5 <delay_tsc+0x5>: push   %r13
>
> crash> rx ff7daf7a4aa2fd60
> ff7daf7a4aa2fd60:  0000000000000001    // total_delay

It's the first time the loop in function ice_sq_send_cmd() is being iterated.
The ice_sq_send_cmd() has to be called all over again.

> crash> dis ice_sq_send_cmd+0x2d
> 0xffffffffc0612bbd <ice_sq_send_cmd+0x2d>: mov    %rsi,%rbp
>
> crash> dis delay_tsc 6
> 0xffffffffb6f478e0 <delay_tsc>: nopl   0x0(%rax,%rax,1) [FTRACE NOP]
> 0xffffffffb6f478e5 <delay_tsc+0x5>: push   %r13
> 0xffffffffb6f478e7 <delay_tsc+0x7>: push   %r12
> 0xffffffffb6f478e9 <delay_tsc+0x9>: mov    %rdi,%r12
> 0xffffffffb6f478ec <delay_tsc+0xc>: mov    $0x1,%edi
> 0xffffffffb6f478f1 <delay_tsc+0x11>: push   %rbp
>
> crash> rx ff7daf7a4aa2fd50
> ff7daf7a4aa2fd50:  ff48df6f05dd6730
>
> crash> ice_ctl_q_info.sq_cmd_timeout,sq.head,sq.next_to_use ff48df6f05dd6730
>   sq_cmd_timeout = 0x2710 = 10000,
>   sq.head = 0x80300,
>   sq.next_to_use = 0x20,


> crash> runqc 33
> CPU 33 RUNQUEUE: ff48df6f3f86c380
>   CURRENT: PID: 1354     TASK: ff48df6f2c5c8000  COMMAND: "ice-ptp-0000:10"
>   RT PRIO_ARRAY: ff48df6f3f86c5c0
>      [ 89] PID: 1354     TASK: ff48df6f2c5c8000  COMMAND: "ice-ptp-0000:10"
>      [ 98] PID: 161097   TASK: ff48df634a111ec0  COMMAND: "handler111"
>   CFS RB_ROOT: ff48df6f3f86c430
>      [100] PID: 583      TASK: ff48df6f37845c40  COMMAND: "kworker/33:1H"
>      [120] PID: 4113728  TASK: ff48df5055db9ec0  COMMAND: "kworker/33:6"
>      [120] PID: 183189   TASK: ff48df5d24aa8000  COMMAND: "runc"
>
> crash> ps -m 1354 161097 583 4113728 183189
> [ 0 00:02:56.990] [RU]  PID: 1354     TASK: ff48df6f2c5c8000  CPU: 33   COMMAND: "ice-ptp-0000:10"
> [ 0 00:02:56.990] [RU]  PID: 161097   TASK: ff48df634a111ec0  CPU: 33   COMMAND: "handler111"
> [ 0 00:02:56.991] [RU]  PID: 4113728  TASK: ff48df5055db9ec0  CPU: 33   COMMAND: "kworker/33:6"
> [ 0 00:02:57.015] [RU]  PID: 183189   TASK: ff48df5d24aa8000  CPU: 33   COMMAND: "runc"
> [ 0 00:02:57.030] [RU]  PID: 583      TASK: ff48df6f37845c40  CPU: 33   COMMAND: "kworker/33:1H"

These runnable tasks have lower prio and hence they are being starved.

> crash> ps -mC33 | head -13
> CPU: 33
> [ 0 00:02:56.990] [RU]  PID: 1354     TASK: ff48df6f2c5c8000  CPU: 33   COMMAND: "ice-ptp-0000:10"
> [ 0 00:02:56.990] [RU]  PID: 161097   TASK: ff48df634a111ec0  CPU: 33   COMMAND: "handler111"
> [ 0 00:02:56.991] [RU]  PID: 4113728  TASK: ff48df5055db9ec0  CPU: 33   COMMAND: "kworker/33:6"
> [ 0 00:02:56.991] [IN]  PID: 278      TASK: ff48df504fd08000  CPU: 33   COMMAND: "ksoftirqd/33"
> [ 0 00:02:57.014] [IN]  PID: 277      TASK: ff48df504fcebd80  CPU: 33   COMMAND: "rcuc/33"
> [ 0 00:02:57.015] [RU]  PID: 183189   TASK: ff48df5d24aa8000  CPU: 33   COMMAND: "runc"
> [ 0 00:02:57.016] [IN]  PID: 146      TASK: ff48df5040268000  CPU: 33   COMMAND: "rcuop/16"
> [ 0 00:02:57.020] [IN]  PID: 276      TASK: ff48df504fce8000  CPU: 33   COMMAND: "migration/33"
> [ 0 00:02:57.024] [IN]  PID: 413      TASK: ff48df504f971ec0  CPU: 33   COMMAND: "rcuop/49"
> [ 0 00:02:57.024] [IN]  PID: 494      TASK: ff48df504f43bd80  CPU: 33   COMMAND: "rcuop/59"
> [ 0 00:02:57.029] [UN]  PID: 183233   TASK: ff48df5193618000  CPU: 33   COMMAND: "exe"
> [ 0 00:02:57.030] [RU]  PID: 583      TASK: ff48df6f37845c40  CPU: 33   COMMAND: "kworker/33:1H"

> crash> ice_pf.pdev,int_name ff48df6f05dd4180
>   pdev = 0xff48df6f37fe6000,
>   int_name = "ice-0000:10:00.1:misc",
>
> crash> ice_hw.mac_type ff48df6f05dd4dc0
>   mac_type = ICE_MAC_E810,

> crash> ice_ptp_tx.work.func,len,in_use,quad,quad_offset ff48df6f05dd44e0
>   work.func = 0xffffffffc06540f0 <ice_ptp_tx_tstamp_work>,
>   len = 0x40,
>   in_use = 0xff48df63d0354cc0,
>   quad = 0x1,
>   quad_offset = 0x0,
>
> crash> rx 0xff48df63d0354cc0
> ff48df63d0354cc0:  0000000000000001

> crash> ice_ptp_tx.tstamps ff48df6f05dd44e0
>   tstamps = 0xff48df62d6e69800,
>
> crash> ice_tx_tstamp.start,cached_tstamp,skb 0xff48df62d6e69800
>   start = 0x14717935c,
>   cached_tstamp = 0xbe9c63f31b
>   skb = 0xff48df5e20552f00,
>
> crash> pd jiffies-0x14717935c
> 177354

The packet is waiting for almost three minutes. The mentioned
commit 0dd9286263923 would definitely skip it and clear it's
bit in the tx->in_use mask. But still only two seconds too late
which needs to be addressed. And since the link went down
I guess just dropping the packets is a reasonable approach.

Checking the latest upstream code in v6.2-rc4 I *guess* this
can still happen as described above. Just not looping forever
but rather only for 2 seconds. But doing nothing for two
seconds is just as bad really. Hence the patch I'm suggesting.

> The trouble is that if a timestamp *does* arrive late, we need to ensure
> that we never assign the captured time to the wrong packet, and that for
> E822 devices we always read the correct number of timestamps (otherwise
> we can get the logic for timestamp interrupt generation broken).
>
> Consider for example this flow for e810:
>
> 1) a tx packet with a timestamp request is scheduled to hw
> 2) the packet begins being transmitted
> 3) link goes down
> 4) interrupt triggers, ice_ptp_tx_tstamp is run
> 5) link is down, so we skip reading this timestamp. Since we don't read
> the timestamp, we just discard the skb and we don't update the cached tx
> timestamp value
> 6) link goes up
> 7) 2 tx packets with a timestamp request are sent and one of them is on
> the same index as the packet in (1)
> 8) the other tx packet completes and we get an interrupt
> 9) the loop reads both timestamps. Since the tx packet in slot (1)
> doesn't match its cached value it looks "new" so the function reports
> the old timestamp to the wrong packet.

I don't really fully understand the HW nor the driver to be honest.
For sure I can see it kept looping as the work is being re-queued
due to tx->in_use bitmap not being cleared. And that can only
happen when ice_read_phy_tstamp() returns with error or the
returned &raw_tstamp is not valid or it matches the cached value
(considering our version of ice_ptp_tx_tstamp() looks like the
one in v5.15 stable tree). And that said the v5.15 stable needs
backport of this fix as well I guess.

> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/net/ethernet/intel/ice/ice_ptp.c?h=v5.15.88#n1170

> Consider this flow for e822:
>
> 1) 2 tx packets with a timestamp request are scheduled to hw
> 2) the packets begin being transmitted
> 3) link goes down
> 4) an interrupt for the Tx timestamp is triggered, but we don't read the
> timestamps because we have link down and we skipped the ts_read.
> 5) the internal e822 hardware counter is not decremented due to no reads
> 6) no more timestamp interrupts will be triggered by hardware until we
> read the appropriate number of timestamps
>
> I am not sure if link going up will properly reset and re-initialize the
> Tx timestamp block but I suspect it does not. @Karol, @Siddaraju,
> @Michal, do you recall more details on this?
>
> I understand the desire to avoid polling when unnecessary, but I am
> worried because the hardware and firmware interactions here are pretty
> complex and its easy to get this wrong. (see: all of the previous
> patches and bug fixes we've been working on... we got this wrong a LOT
> already...)

I see. Here my favorite quote applies: "Everybody falls the first time."

I totally understand and I cannot imagine developing and maintaining this
without a solid test suite. On the other hand, some things (like especially
races) could be tricky to reproduce.

> Without a more concrete explanation of what this fixes I'm worried about
> this change :(

Hopefully I already managed to shed a bit more light here. Again, I'm sorry
about the brief commit message.

> At a minimum I think I would only set drop_ts but not not goto skip_ts_read.

IIUC, that would still fail to clear the tx->in_use bit in case
ice_read_phy_tstamp()
returns with error. It would only work for the other case where no error is
returned but rather the returned &raw_tstamp is invalid. I'll send a v2 of
this fix trying to address the goto concern.

> crash> ice_sbq_msg_input ff7daf7a4aa2fe2c
> struct ice_sbq_msg_input {
>   dest_dev = 0x2,
>   opcode = 0x0,
>   msg_addr_low = 0x1000,
>   msg_addr_high = 0x309,
>   data = 0x0
> }
> crash> ice_sbq_msg_req ff7daf7a4aa2fdc8
> struct ice_sbq_msg_req {
>   dest_dev = 0x2,
>   src_dev = 0x0,
>   opcode = 0x0,
>   flags = 0x40,
>   sbe_fbe = 0xf,
>   func_id = 0x0,
>   msg_addr_low = 0x1000,
>   msg_addr_high = 0x309,
>   data = 0x0
> }
> crash> ice_sbq_cmd_desc ff7daf7a4aa2fdd8
> struct ice_sbq_cmd_desc {
>   flags = 0x1400,
>   opcode = 0xc00,
>   datalen = 0x0,
>   cmd_retval = 0x0,
>   cookie_high = 0x0,
>   cookie_low = 0x0,
>   param0 = {
>     cmd_len = 0xc,
>     cmpl_len = 0xc
>   },
>   reserved = "",
>   addr_high = 0x0,
>   addr_low = 0x0
> }

I see commit 1229b33973c7b ("ice: Add low latency Tx timestamp read") got
rid of that kthread_work in favor of threaded irq. But the
ice_misc_intr_thread_fn()
keeps waking itself in the same manner as the work did keep re-queuing itself.
So the loop still needs to be broken (i.e. the tx->in_use bitmap cleared). Even
for v6.2-rc4.

> That way if we happen to have a ready timestamp (for E822) we'll still
> read it and avoid the miscounting from not reading a completed timestamp.
>
> This also ensures that on e810 the cached timestamp value is updated and
> that we avoid the other situation.
>
> I'd still prefer if you have a bug report or more details on the failure
> case. I believe even if we poll it should be no more than 2 seconds for
> an old timestamp that never got sent to be discarded.

Unfortunately I only have this one report with a dump from customer's testing.
The uptime was almost 14 days, so I'm not sure how easy it is to reproduce.

Again, hopefully the above covers enough details. I'm happy to share more
if I missed some. But looping in a softirq (or worker thread) for two seconds
is not really acceptable. That's just a wrong design. Right?

--nX

> > Signed-off-by: Daniel Vacek <neelx@redhat.com>
> > ---
> >  drivers/net/ethernet/intel/ice/ice_ptp.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c
> > index d63161d73eb16..c313177ba6676 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_ptp.c
> > +++ b/drivers/net/ethernet/intel/ice/ice_ptp.c
> > @@ -680,6 +680,7 @@ static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx)
> >       struct ice_pf *pf;
> >       struct ice_hw *hw;
> >       u64 tstamp_ready;
> > +     bool link_up;
> >       int err;
> >       u8 idx;
> >
> > @@ -695,6 +696,8 @@ static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx)
> >       if (err)
> >               return false;
> >
> > +     link_up = hw->port_info->phy.link_info.link_info & ICE_AQ_LINK_UP;
> > +
> >       for_each_set_bit(idx, tx->in_use, tx->len) {
> >               struct skb_shared_hwtstamps shhwtstamps = {};
> >               u8 phy_idx = idx + tx->offset;
> > @@ -702,6 +705,12 @@ static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx)
> >               bool drop_ts = false;
> >               struct sk_buff *skb;
> >
> > +             /* Drop packets if the link went down */
> > +             if (!link_up) {
> > +                     drop_ts = true;
> > +                     goto skip_ts_read;
> > +             }
> > +
> >               /* Drop packets which have waited for more than 2 seconds */
> >               if (time_is_before_jiffies(tx->tstamps[idx].start + 2 * HZ)) {
> >                       drop_ts = true;
>
  
Jacob Keller Jan. 18, 2023, 8:59 p.m. UTC | #3
On 1/18/2023 7:14 AM, Daniel Vacek wrote:
> On Tue, Jan 17, 2023 at 7:47 PM Jacob Keller <jacob.e.keller@intel.com> wrote:
>> On 1/17/2023 10:15 AM, Daniel Vacek wrote:
>>> When the link goes down the ice_ptp_tx_tstamp_work() may loop forever trying
>>> to process the packets. In this case it makes sense to just drop them.
>>>
>>
>> Hi,
>>
>> Do you have some more details on what situation caused this state? Or is
>> this just based on code review?
>>
>> It won't loop forever because if link is down for more than 2 seconds
>> we'll discard the old timestamps which we assume are not going to arrive.
> 
> Oh, my bad. I should have improved the commit message. This
> comes from RHEL 8.4 where we do not have commit 0dd9286263923
> ("ice: handle discarding old Tx requests in ice_ptp_tx_tstamp")
> yet.
> 
> Though, I still consider 2 seconds being 'forever like'.
> 

It is somewhat forever like in CPU time, yes.

> What really happened is that after the link went down, the
> kthread worker 'ice-ptp-0000:10' being a high prio RT class
> task kept looping on CPU 33 for almost three minutes starving
> other tasks with lower priority. This resulted in RCU stall
> and finally in panic due to hung tasks.
> 

Right, because we essentially used to starve the task that was supposed
to clean this up. That is fixed with the commit you mentioned above.

> The machine is using the E810 NIC.
> 
>> [1193024.082447] ice 0000:10:00.1 ens1f1: NIC Link is Down
>> [1193024.087861] iavf 0000:10:09.1 net1: NIC Link is Down
>> [1193024.093077] iavf 0000:10:09.2 llsm1: NIC Link is Down
>> [1193024.099324] iavf 0000:10:09.3 llsm1: NIC Link is Down
>> [1193060.630500] ice 0000:10:00.2 ens1f2: NIC Link is up 25 Gbps Full Duplex, Requested FEC: RS-FEC, Negotiated FEC: RS-FEC, Autoneg Advertised: Off, Autoneg Negotiated: False, Flow Contr>
>> [1193060.647921] iavf 0000:10:11.1 llsm2: NIC Link is Up Speed is 25 Gbps Full Duplex
>> [1193060.655575] iavf 0000:10:11.2 net2: NIC Link is Up Speed is 25 Gbps Full Duplex
>> [1193060.663140] iavf 0000:10:11.3 llsm2: NIC Link is Up Speed is 25 Gbps Full Duplex
>> [1193084.383798] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
>> [1193084.383799] rcu:   Tasks blocked on level-1 rcu_node (CPUs 32-47): P183189
>> [1193084.383804]        (detected by 0, t=60002 jiffies, g=379331397, q=294595)
>> [1193084.383807] runc            R  running task        0 183189 183175 0x00084080
>> [1193088.721408] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { P161097 P183189 } 63616 jiffies s: 1074129 root: 0x4/.
>> [1193089.074856] rcu: blocking rcu_node structures: l=1:32-47:0x0/T
>>
>> [1193199.313106] INFO: task xfsaild/sda4:1428 blocked for more than 120 seconds.
>>
>> [1193201.399021] NMI backtrace for cpu 33
>> [1193201.399021] CPU: 33 PID: 1354 Comm: ice-ptp-0000:10 Kdump: loaded Tainted: G        W        --------- -  - 4.18.0-305.49.1.rt7.121.el8_4.x86_64 #1
>>
>> [1193201.399050] Kernel panic - not syncing: hung_task: blocked tasks
>> [1193201.399052] CPU: 63 PID: 559 Comm: khungtaskd Kdump: loaded Tainted: G        W        --------- -  - 4.18.0-305.49.1.rt7.121.el8_4.x86_64 #1
> 
>> crash> bc 33
>> PID: 1354     TASK: ff48df6f2c5c8000  CPU: 33   COMMAND: "ice-ptp-0000:10"
>>  #0 [fffffe000069ce38] crash_nmi_callback+0x33 at ffffffffb664cb63
>>  #1 [fffffe000069ce40] nmi_handle+0x5b at ffffffffb661c68b
>>  #2 [fffffe000069ce98] default_do_nmi+0x72 at ffffffffb6f4a562
>>  #3 [fffffe000069cec0] do_nmi+0x18d at ffffffffb661cd1d
>>  #4 [fffffe000069cef0] end_repeat_nmi+0x16 at ffffffffb7001544
>>     [exception RIP: delay_tsc+0x6d]
>>     RIP: ffffffffb6f4794d  RSP: ff7daf7a4aa2fd48  RFLAGS: 00000246
>>     RAX: 000000008cf66cff  RBX: 0006582e8cf66caf  RCX: 0000000000000000
>>     RDX: 000000000006582e  RSI: ffffffffb76eefe3  RDI: ffffffffb76d1b60
>>     RBP: 0006582e8cf5ef89   R8: 000000000000000c   R9: 0000000000000000
>>     R10: 0000000000000000  R11: 0000000000000000  R12: 00000000000249f2
>>     R13: 0000000000000021  R14: ff48df63ec9aa3e0  R15: ff48df57541d0300
>>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>> --- <NMI exception stack> ---
>>  #5 [ff7daf7a4aa2fd48] delay_tsc+0x6d at ffffffffb6f4794d
>>  ## [  endcall jmp 68] __const_udelay+0x37 at ffffffffb6f477f7
>>  #6 [ff7daf7a4aa2fd68] ice_sq_send_cmd+0x356 at ffffffffc0612ee6 [ice]
>>  ## [  inlined     c0] ice_sbq_send_cmd at drivers/net/ethernet/intel/ice/ice_common.c:1407
>>  #7 [ff7daf7a4aa2fdc0] ice_sbq_rw_reg+0xc4 at ffffffffc06147e4 [ice]
>>  #8 [ff7daf7a4aa2fe18] ice_read_phy_reg_e810+0x4d at ffffffffc0655b0d [ice]
>>  ## [  inlined     50] ice_read_phy_tstamp_e810 at drivers/net/ethernet/intel/ice/ice_ptp_hw.c:2608
>>  #9 [ff7daf7a4aa2fe50] ice_read_phy_tstamp+0x5a at ffffffffc06584ca [ice]
>>
>> #10 [ff7daf7a4aa2fe90] ice_ptp_tx_tstamp_work+0x95 at ffffffffc0654185 [ice]
>> #11 [ff7daf7a4aa2fee0] kthread_worker_fn+0xab at ffffffffb6701c4b
>> #12 [ff7daf7a4aa2ff10] kthread+0x15d at ffffffffb6701b7d
>>
>> crash> delay_fn
>> delay_fn = (void (*)(u64)) 0xffffffffb6f478e0 <delay_tsc>
>>
>> crash> bc 33 | grep -o 'R[BDA1][PX2]: 0[^ ]*'
>> RAX: 000000008cf66cff
>> RBX: 0006582e8cf66caf
>> RDX: 000000000006582e
>> RBP: 0006582e8cf5ef89
>> R12: 00000000000249f2
>>
>>         249f2 r12     cycles (150002)
>> 6582e8cf5ef89 rbp     bclock
>> 6582e8cf66cff rdx:rax now
>> 6582e8cf66caf rbx     last
>>
>>  962 ice_sq_send_cmd(struct ice_hw *hw, struct ice_ctl_q_info *cq,
>> --
>>  965 {
>> --
>> 1063         do {
>> 1064                 if (ice_sq_done(hw, cq))
>> 1065                         break;
>> 1066
>> 1067                 udelay(ICE_CTL_Q_SQ_CMD_USEC);
>> 1068                 total_delay++;
>> 1069         } while (total_delay < cq->sq_cmd_timeout);
>>
>>  19 /* 0x10c7 is 2**32 / 1000000 (rounded up) */
>>  20 #define udelay(n)                                                       \
>> --
>>  26                                 __const_udelay((n) * 0x10c7ul);         \
>>
>> crash> pd loops_per_jiffy
>> loops_per_jiffy = 1500000
>>
>> #define ICE_CTL_Q_SQ_CMD_USEC           100   /* Check every 100usec */
>>
>> crash> pd (100*0x10c7*4*250*loops_per_jiffy>>32)+1
>> 150002
>>
>> crash> p 100*0x10c7ul
>> 0x68dbc
>>
>> crash> dis ice_sq_send_cmd+0x346 13
>> 0xffffffffc0612ed6 <ice_sq_send_cmd+0x346>: jmp    0xffffffffc0612eef <ice_sq_send_cmd+0x35f>
>>
>> 0xffffffffc0612ed8 <ice_sq_send_cmd+0x348>: mov    $0x68dbc,%edi
>> 0xffffffffc0612edd <ice_sq_send_cmd+0x34d>: add    $0x1,%r13d
>> 0xffffffffc0612ee1 <ice_sq_send_cmd+0x351>: call   0xffffffffb6f477c0 <__const_udelay>
>> 0xffffffffc0612ee6 <ice_sq_send_cmd+0x356>: cmp    %r13d,0xc8(%rbp)
>> 0xffffffffc0612eed <ice_sq_send_cmd+0x35d>: jbe    0xffffffffc0612f05 <ice_sq_send_cmd+0x375>
>>
>> 0xffffffffc0612eef <ice_sq_send_cmd+0x35f>: mov    0xa0(%rbp),%eax
>> 0xffffffffc0612ef5 <ice_sq_send_cmd+0x365>: add    (%rbx),%rax
>> 0xffffffffc0612ef8 <ice_sq_send_cmd+0x368>: mov    (%rax),%edx
>> 0xffffffffc0612efa <ice_sq_send_cmd+0x36a>: movzwl 0x9a(%rbp),%eax
>> 0xffffffffc0612f01 <ice_sq_send_cmd+0x371>: cmp    %eax,%edx
>> 0xffffffffc0612f03 <ice_sq_send_cmd+0x373>: jne    0xffffffffc0612ed8 <ice_sq_send_cmd+0x348>
>>
>> 0xffffffffc0612f05 <ice_sq_send_cmd+0x375>: mov    0xa0(%rbp),%eax
>>
>> crash> dis delay_tsc 2
>> 0xffffffffb6f478e0 <delay_tsc>: nopl   0x0(%rax,%rax,1) [FTRACE NOP]
>> 0xffffffffb6f478e5 <delay_tsc+0x5>: push   %r13
>>
>> crash> rx ff7daf7a4aa2fd60
>> ff7daf7a4aa2fd60:  0000000000000001    // total_delay
> 
> It's the first time the loop in function ice_sq_send_cmd() is being iterated.
> The ice_sq_send_cmd() has to be called all over again.
> 
>> crash> dis ice_sq_send_cmd+0x2d
>> 0xffffffffc0612bbd <ice_sq_send_cmd+0x2d>: mov    %rsi,%rbp
>>
>> crash> dis delay_tsc 6
>> 0xffffffffb6f478e0 <delay_tsc>: nopl   0x0(%rax,%rax,1) [FTRACE NOP]
>> 0xffffffffb6f478e5 <delay_tsc+0x5>: push   %r13
>> 0xffffffffb6f478e7 <delay_tsc+0x7>: push   %r12
>> 0xffffffffb6f478e9 <delay_tsc+0x9>: mov    %rdi,%r12
>> 0xffffffffb6f478ec <delay_tsc+0xc>: mov    $0x1,%edi
>> 0xffffffffb6f478f1 <delay_tsc+0x11>: push   %rbp
>>
>> crash> rx ff7daf7a4aa2fd50
>> ff7daf7a4aa2fd50:  ff48df6f05dd6730
>>
>> crash> ice_ctl_q_info.sq_cmd_timeout,sq.head,sq.next_to_use ff48df6f05dd6730
>>   sq_cmd_timeout = 0x2710 = 10000,
>>   sq.head = 0x80300,
>>   sq.next_to_use = 0x20,
> 
> 
>> crash> runqc 33
>> CPU 33 RUNQUEUE: ff48df6f3f86c380
>>   CURRENT: PID: 1354     TASK: ff48df6f2c5c8000  COMMAND: "ice-ptp-0000:10"
>>   RT PRIO_ARRAY: ff48df6f3f86c5c0
>>      [ 89] PID: 1354     TASK: ff48df6f2c5c8000  COMMAND: "ice-ptp-0000:10"
>>      [ 98] PID: 161097   TASK: ff48df634a111ec0  COMMAND: "handler111"
>>   CFS RB_ROOT: ff48df6f3f86c430
>>      [100] PID: 583      TASK: ff48df6f37845c40  COMMAND: "kworker/33:1H"
>>      [120] PID: 4113728  TASK: ff48df5055db9ec0  COMMAND: "kworker/33:6"
>>      [120] PID: 183189   TASK: ff48df5d24aa8000  COMMAND: "runc"
>>
>> crash> ps -m 1354 161097 583 4113728 183189
>> [ 0 00:02:56.990] [RU]  PID: 1354     TASK: ff48df6f2c5c8000  CPU: 33   COMMAND: "ice-ptp-0000:10"
>> [ 0 00:02:56.990] [RU]  PID: 161097   TASK: ff48df634a111ec0  CPU: 33   COMMAND: "handler111"
>> [ 0 00:02:56.991] [RU]  PID: 4113728  TASK: ff48df5055db9ec0  CPU: 33   COMMAND: "kworker/33:6"
>> [ 0 00:02:57.015] [RU]  PID: 183189   TASK: ff48df5d24aa8000  CPU: 33   COMMAND: "runc"
>> [ 0 00:02:57.030] [RU]  PID: 583      TASK: ff48df6f37845c40  CPU: 33   COMMAND: "kworker/33:1H"
> 
> These runnable tasks have lower prio and hence they are being starved.
> 
>> crash> ps -mC33 | head -13
>> CPU: 33
>> [ 0 00:02:56.990] [RU]  PID: 1354     TASK: ff48df6f2c5c8000  CPU: 33   COMMAND: "ice-ptp-0000:10"
>> [ 0 00:02:56.990] [RU]  PID: 161097   TASK: ff48df634a111ec0  CPU: 33   COMMAND: "handler111"
>> [ 0 00:02:56.991] [RU]  PID: 4113728  TASK: ff48df5055db9ec0  CPU: 33   COMMAND: "kworker/33:6"
>> [ 0 00:02:56.991] [IN]  PID: 278      TASK: ff48df504fd08000  CPU: 33   COMMAND: "ksoftirqd/33"
>> [ 0 00:02:57.014] [IN]  PID: 277      TASK: ff48df504fcebd80  CPU: 33   COMMAND: "rcuc/33"
>> [ 0 00:02:57.015] [RU]  PID: 183189   TASK: ff48df5d24aa8000  CPU: 33   COMMAND: "runc"
>> [ 0 00:02:57.016] [IN]  PID: 146      TASK: ff48df5040268000  CPU: 33   COMMAND: "rcuop/16"
>> [ 0 00:02:57.020] [IN]  PID: 276      TASK: ff48df504fce8000  CPU: 33   COMMAND: "migration/33"
>> [ 0 00:02:57.024] [IN]  PID: 413      TASK: ff48df504f971ec0  CPU: 33   COMMAND: "rcuop/49"
>> [ 0 00:02:57.024] [IN]  PID: 494      TASK: ff48df504f43bd80  CPU: 33   COMMAND: "rcuop/59"
>> [ 0 00:02:57.029] [UN]  PID: 183233   TASK: ff48df5193618000  CPU: 33   COMMAND: "exe"
>> [ 0 00:02:57.030] [RU]  PID: 583      TASK: ff48df6f37845c40  CPU: 33   COMMAND: "kworker/33:1H"
> 
>> crash> ice_pf.pdev,int_name ff48df6f05dd4180
>>   pdev = 0xff48df6f37fe6000,
>>   int_name = "ice-0000:10:00.1:misc",
>>
>> crash> ice_hw.mac_type ff48df6f05dd4dc0
>>   mac_type = ICE_MAC_E810,
> 
>> crash> ice_ptp_tx.work.func,len,in_use,quad,quad_offset ff48df6f05dd44e0
>>   work.func = 0xffffffffc06540f0 <ice_ptp_tx_tstamp_work>,
>>   len = 0x40,
>>   in_use = 0xff48df63d0354cc0,
>>   quad = 0x1,
>>   quad_offset = 0x0,
>>
>> crash> rx 0xff48df63d0354cc0
>> ff48df63d0354cc0:  0000000000000001
> 
>> crash> ice_ptp_tx.tstamps ff48df6f05dd44e0
>>   tstamps = 0xff48df62d6e69800,
>>
>> crash> ice_tx_tstamp.start,cached_tstamp,skb 0xff48df62d6e69800
>>   start = 0x14717935c,
>>   cached_tstamp = 0xbe9c63f31b
>>   skb = 0xff48df5e20552f00,
>>
>> crash> pd jiffies-0x14717935c
>> 177354
> 
> The packet is waiting for almost three minutes. The mentioned
> commit 0dd9286263923 would definitely skip it and clear it's
> bit in the tx->in_use mask. But still only two seconds too late
> which needs to be addressed. And since the link went down
> I guess just dropping the packets is a reasonable approach.
> 
Yes, but we have to be careful of corner cases that could result in
completely stopping Tx timestamp interrupts until a device reset.

> Checking the latest upstream code in v6.2-rc4 I *guess* this
> can still happen as described above. Just not looping forever
> but rather only for 2 seconds. But doing nothing for two
> seconds is just as bad really. Hence the patch I'm suggesting.
> 
>> The trouble is that if a timestamp *does* arrive late, we need to ensure
>> that we never assign the captured time to the wrong packet, and that for
>> E822 devices we always read the correct number of timestamps (otherwise
>> we can get the logic for timestamp interrupt generation broken).
>>
>> Consider for example this flow for e810:
>>
>> 1) a tx packet with a timestamp request is scheduled to hw
>> 2) the packet begins being transmitted
>> 3) link goes down
>> 4) interrupt triggers, ice_ptp_tx_tstamp is run
>> 5) link is down, so we skip reading this timestamp. Since we don't read
>> the timestamp, we just discard the skb and we don't update the cached tx
>> timestamp value
>> 6) link goes up
>> 7) 2 tx packets with a timestamp request are sent and one of them is on
>> the same index as the packet in (1)
>> 8) the other tx packet completes and we get an interrupt
>> 9) the loop reads both timestamps. Since the tx packet in slot (1)
>> doesn't match its cached value it looks "new" so the function reports
>> the old timestamp to the wrong packet.
> 
> I don't really fully understand the HW nor the driver to be honest.
> For sure I can see it kept looping as the work is being re-queued
> due to tx->in_use bitmap not being cleared. And that can only
> happen when ice_read_phy_tstamp() returns with error or the
> returned &raw_tstamp is not valid or it matches the cached value
> (considering our version of ice_ptp_tx_tstamp() looks like the
> one in v5.15 stable tree). And that said the v5.15 stable needs
> backport of this fix as well I guess.
> 

Right.

>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/net/ethernet/intel/ice/ice_ptp.c?h=v5.15.88#n1170
> 
>> Consider this flow for e822:
>>
>> 1) 2 tx packets with a timestamp request are scheduled to hw
>> 2) the packets begin being transmitted
>> 3) link goes down
>> 4) an interrupt for the Tx timestamp is triggered, but we don't read the
>> timestamps because we have link down and we skipped the ts_read.
>> 5) the internal e822 hardware counter is not decremented due to no reads
>> 6) no more timestamp interrupts will be triggered by hardware until we
>> read the appropriate number of timestamps
>>
>> I am not sure if link going up will properly reset and re-initialize the
>> Tx timestamp block but I suspect it does not. @Karol, @Siddaraju,
>> @Michal, do you recall more details on this?
>>
>> I understand the desire to avoid polling when unnecessary, but I am
>> worried because the hardware and firmware interactions here are pretty
>> complex and its easy to get this wrong. (see: all of the previous
>> patches and bug fixes we've been working on... we got this wrong a LOT
>> already...)
> 
> I see. Here my favorite quote applies: "Everybody falls the first time."
> 
> I totally understand and I cannot imagine developing and maintaining this
> without a solid test suite. On the other hand, some things (like especially
> races) could be tricky to reproduce.
>
Yea..

>> Without a more concrete explanation of what this fixes I'm worried about
>> this change :(
> 
> Hopefully I already managed to shed a bit more light here. Again, I'm sorry
> about the brief commit message.
> 

That's ok, this has provided a lot more detail thanks!

>> At a minimum I think I would only set drop_ts but not not goto skip_ts_read.
> 
> IIUC, that would still fail to clear the tx->in_use bit in case
> ice_read_phy_tstamp()
> returns with error. It would only work for the other case where no error is
> returned but rather the returned &raw_tstamp is invalid. I'll send a v2 of
> this fix trying to address the goto concern.
>

If drop_ts is set we should always be dropping the timestamp after
checking, even if we get an error. If there's a flow where that doesn't
happen we should fix that.

>> crash> ice_sbq_msg_input ff7daf7a4aa2fe2c
>> struct ice_sbq_msg_input {
>>   dest_dev = 0x2,
>>   opcode = 0x0,
>>   msg_addr_low = 0x1000,
>>   msg_addr_high = 0x309,
>>   data = 0x0
>> }
>> crash> ice_sbq_msg_req ff7daf7a4aa2fdc8
>> struct ice_sbq_msg_req {
>>   dest_dev = 0x2,
>>   src_dev = 0x0,
>>   opcode = 0x0,
>>   flags = 0x40,
>>   sbe_fbe = 0xf,
>>   func_id = 0x0,
>>   msg_addr_low = 0x1000,
>>   msg_addr_high = 0x309,
>>   data = 0x0
>> }
>> crash> ice_sbq_cmd_desc ff7daf7a4aa2fdd8
>> struct ice_sbq_cmd_desc {
>>   flags = 0x1400,
>>   opcode = 0xc00,
>>   datalen = 0x0,
>>   cmd_retval = 0x0,
>>   cookie_high = 0x0,
>>   cookie_low = 0x0,
>>   param0 = {
>>     cmd_len = 0xc,
>>     cmpl_len = 0xc
>>   },
>>   reserved = "",
>>   addr_high = 0x0,
>>   addr_low = 0x0
>> }
> 
> I see commit 1229b33973c7b ("ice: Add low latency Tx timestamp read") got
> rid of that kthread_work in favor of threaded irq. But the
> ice_misc_intr_thread_fn()
> keeps waking itself in the same manner as the work did keep re-queuing itself.
> So the loop still needs to be broken (i.e. the tx->in_use bitmap cleared). Even
> for v6.2-rc4.
> 


Yes.

>> That way if we happen to have a ready timestamp (for E822) we'll still
>> read it and avoid the miscounting from not reading a completed timestamp.
>>
>> This also ensures that on e810 the cached timestamp value is updated and
>> that we avoid the other situation.
>>
>> I'd still prefer if you have a bug report or more details on the failure
>> case. I believe even if we poll it should be no more than 2 seconds for
>> an old timestamp that never got sent to be discarded.
> 
> Unfortunately I only have this one report with a dump from customer's testing.
> The uptime was almost 14 days, so I'm not sure how easy it is to reproduce.
> 
> Again, hopefully the above covers enough details. I'm happy to share more
> if I missed some. But looping in a softirq (or worker thread) for two seconds
> is not really acceptable. That's just a wrong design. Right?
> 

Please re-send the patch with the goto concern addressed. (only set
drop_ts, and still read the timestamp just in case and make sure that if
the read function fails while drop_ts is set we still drop the ts).

I believe that alleviates my concerns regarding the potential link down
race with a completed timestamp above.

We also should already be preventing requesting new timestamps while
link is down so that should be sufficient to combine with this and cover
the three flows:

1) request tx timestamp
2) link down
3) timestamp never occurs

1) request tx timestamp
2) timestamp occurs
3) link goes down while processing

1) link down
2) request tx timestamp rejected

Thanks!

-Jake

> --nX
> 
>>> Signed-off-by: Daniel Vacek <neelx@redhat.com>
>>> ---
>>>  drivers/net/ethernet/intel/ice/ice_ptp.c | 9 +++++++++
>>>  1 file changed, 9 insertions(+)
>>>
>>> diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c
>>> index d63161d73eb16..c313177ba6676 100644
>>> --- a/drivers/net/ethernet/intel/ice/ice_ptp.c
>>> +++ b/drivers/net/ethernet/intel/ice/ice_ptp.c
>>> @@ -680,6 +680,7 @@ static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx)
>>>       struct ice_pf *pf;
>>>       struct ice_hw *hw;
>>>       u64 tstamp_ready;
>>> +     bool link_up;
>>>       int err;
>>>       u8 idx;
>>>
>>> @@ -695,6 +696,8 @@ static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx)
>>>       if (err)
>>>               return false;
>>>
>>> +     link_up = hw->port_info->phy.link_info.link_info & ICE_AQ_LINK_UP;
>>> +
>>>       for_each_set_bit(idx, tx->in_use, tx->len) {
>>>               struct skb_shared_hwtstamps shhwtstamps = {};
>>>               u8 phy_idx = idx + tx->offset;
>>> @@ -702,6 +705,12 @@ static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx)
>>>               bool drop_ts = false;
>>>               struct sk_buff *skb;
>>>
>>> +             /* Drop packets if the link went down */
>>> +             if (!link_up) {
>>> +                     drop_ts = true;
>>> +                     goto skip_ts_read;
>>> +             }
>>> +
>>>               /* Drop packets which have waited for more than 2 seconds */
>>>               if (time_is_before_jiffies(tx->tstamps[idx].start + 2 * HZ)) {
>>>                       drop_ts = true;
>>
>
  
Daniel Vacek Jan. 18, 2023, 10:11 p.m. UTC | #4
On Wed, Jan 18, 2023 at 9:59 PM Jacob Keller <jacob.e.keller@intel.com> wrote:
> On 1/18/2023 7:14 AM, Daniel Vacek wrote:
> > On Tue, Jan 17, 2023 at 7:47 PM Jacob Keller <jacob.e.keller@intel.com> wrote:
> >>
> >> At a minimum I think I would only set drop_ts but not not goto skip_ts_read.
> >
> > IIUC, that would still fail to clear the tx->in_use bit in case ice_read_phy_tstamp()
> > returns with error. It would only work for the other case where no error is
> > returned but rather the returned &raw_tstamp is invalid. I'll send a v2 of
> > this fix trying to address the goto concern.
> >
>
> Please re-send the patch with the goto concern addressed. (only set
> drop_ts, and still read the timestamp just in case and make sure that if
> the read function fails while drop_ts is set we still drop the ts).

Just in case it got lost I did send the patch here:

https://lore.kernel.org/intel-wired-lan/20230118161727.2485457-1-neelx@redhat.com/

But unfortunately I misplaced the --in-reply-to header with --reply-to one :-(
Hopefully it arrived to you safely.

> I believe that alleviates my concerns regarding the potential link down
> race with a completed timestamp above.
>
> We also should already be preventing requesting new timestamps while
> link is down so that should be sufficient to combine with this and cover
> the three flows:
>
> 1) request tx timestamp
> 2) link down
> 3) timestamp never occurs
>
> 1) request tx timestamp
> 2) timestamp occurs
> 3) link goes down while processing

I was thinking this is the case we got reported. But then again, I'm
not really experienced in this field.

--nX

> 1) link down
> 2) request tx timestamp rejected
>
> Thanks!
>
> -Jake
  
Jacob Keller Jan. 18, 2023, 10:22 p.m. UTC | #5
On 1/18/2023 2:11 PM, Daniel Vacek wrote:
> On Wed, Jan 18, 2023 at 9:59 PM Jacob Keller <jacob.e.keller@intel.com> wrote:
>> On 1/18/2023 7:14 AM, Daniel Vacek wrote:
>> 1) request tx timestamp
>> 2) timestamp occurs
>> 3) link goes down while processing
> 
> I was thinking this is the case we got reported. But then again, I'm
> not really experienced in this field.
> 

I think it might be, or at least something similar to this.

I think that can be fixed with the link check you added. I think we
actually have a copy of the current link status in the ice_ptp or
ice_ptp_tx structure which could be used instead of having to check back
to the other structure.

I'm just hoping not to re-introduce bugs related to the hardware
interrupt counter that we had which results in preventing all future
timestamp interrupts.

> --nX
> 
>> 1) link down
>> 2) request tx timestamp rejected
>>
>> Thanks!
>>
>> -Jake
>
  
Daniel Vacek Jan. 19, 2023, 9:38 a.m. UTC | #6
On Wed, Jan 18, 2023 at 11:22 PM Jacob Keller <jacob.e.keller@intel.com> wrote:
> On 1/18/2023 2:11 PM, Daniel Vacek wrote:
> > On Wed, Jan 18, 2023 at 9:59 PM Jacob Keller <jacob.e.keller@intel.com> wrote:
> >> On 1/18/2023 7:14 AM, Daniel Vacek wrote:
> >> 1) request tx timestamp
> >> 2) timestamp occurs
> >> 3) link goes down while processing
> >
> > I was thinking this is the case we got reported. But then again, I'm
> > not really experienced in this field.
> >
>
> I think it might be, or at least something similar to this.
>
> I think that can be fixed with the link check you added. I think we
> actually have a copy of the current link status in the ice_ptp or
> ice_ptp_tx structure which could be used instead of having to check back
> to the other structure.

If you're talking about ptp_port->link_up that one is always false no
matter the actual NIC link status. First I wanted to use it but
checking all the 8 devices available in the dump data it just does not
match the net_dev->state or the port_info->phy.link_info.link_info

crash> net_device.name,state 0xff48df6f0c553000
  name = "ens1f1",
  state = 0x7,    // DOWN
crash> ice_port_info.phy.link_info.link_info 0xff48df6f05dca018
  phy.link_info.link_info = 0xc0,    // DOWN
crash> ice_ptp_port.port_num,link_up 0xff48df6f05dd44e0
  port_num = 0x1
  link_up = 0x0,    // False

crash> net_device.name,state 0xff48df6f25e3f000
  name = "ens1f0",
  state = 0x3,    // UP
crash> ice_port_info.phy.link_info.link_info 0xff48df6f070a3018
  phy.link_info.link_info = 0xe1,    // UP
crash> ice_ptp_port.port_num,link_up 0xff48df6f063184e0
  port_num = 0x0
  link_up = 0x0,    // False

crash> ice_ptp_port.port_num,link_up 0xff48df6f25b844e0
  port_num = 0x2
  link_up = 0x0,    // False even this device is UP
crash> ice_ptp_port.port_num,link_up 0xff48df6f140384e0
  port_num = 0x3
  link_up = 0x0,    // False even this device is UP
crash> ice_ptp_port.port_num,link_up 0xff48df6f055044e0
  port_num = 0x0
  link_up = 0x0,    // False even this device is UP
crash> ice_ptp_port.port_num,link_up 0xff48df6f251cc4e0
  port_num = 0x1
  link_up = 0x0,
crash> ice_ptp_port.port_num,link_up 0xff48df6f33a9c4e0
  port_num = 0x2
  link_up = 0x0,
crash> ice_ptp_port.port_num,link_up 0xff48df6f3bb7c4e0
  port_num = 0x3
  link_up = 0x0,

In other words, the ice_ptp_port.link_up is always false and cannot be
used. That's why I had to fall back to
hw->port_info->phy.link_info.link_info

--nX

> I'm just hoping not to re-introduce bugs related to the hardware
> interrupt counter that we had which results in preventing all future
> timestamp interrupts.
>
> > --nX
> >
> >> 1) link down
> >> 2) request tx timestamp rejected
> >>
> >> Thanks!
> >>
> >> -Jake
> >
>
  
Jacob Keller Jan. 19, 2023, 7:24 p.m. UTC | #7
On 1/19/2023 1:38 AM, Daniel Vacek wrote:
> On Wed, Jan 18, 2023 at 11:22 PM Jacob Keller <jacob.e.keller@intel.com> wrote:
>> On 1/18/2023 2:11 PM, Daniel Vacek wrote:
>>> On Wed, Jan 18, 2023 at 9:59 PM Jacob Keller <jacob.e.keller@intel.com> wrote:
>>>> On 1/18/2023 7:14 AM, Daniel Vacek wrote:
>>>> 1) request tx timestamp
>>>> 2) timestamp occurs
>>>> 3) link goes down while processing
>>>
>>> I was thinking this is the case we got reported. But then again, I'm
>>> not really experienced in this field.
>>>
>>
>> I think it might be, or at least something similar to this.
>>
>> I think that can be fixed with the link check you added. I think we
>> actually have a copy of the current link status in the ice_ptp or
>> ice_ptp_tx structure which could be used instead of having to check back
>> to the other structure.
> 
> If you're talking about ptp_port->link_up that one is always false no
> matter the actual NIC link status. First I wanted to use it but
> checking all the 8 devices available in the dump data it just does not
> match the net_dev->state or the port_info->phy.link_info.link_info
> 
> crash> net_device.name,state 0xff48df6f0c553000
>   name = "ens1f1",
>   state = 0x7,    // DOWN
> crash> ice_port_info.phy.link_info.link_info 0xff48df6f05dca018
>   phy.link_info.link_info = 0xc0,    // DOWN
> crash> ice_ptp_port.port_num,link_up 0xff48df6f05dd44e0
>   port_num = 0x1
>   link_up = 0x0,    // False
> 
> crash> net_device.name,state 0xff48df6f25e3f000
>   name = "ens1f0",
>   state = 0x3,    // UP
> crash> ice_port_info.phy.link_info.link_info 0xff48df6f070a3018
>   phy.link_info.link_info = 0xe1,    // UP
> crash> ice_ptp_port.port_num,link_up 0xff48df6f063184e0
>   port_num = 0x0
>   link_up = 0x0,    // False
> 
> crash> ice_ptp_port.port_num,link_up 0xff48df6f25b844e0
>   port_num = 0x2
>   link_up = 0x0,    // False even this device is UP
> crash> ice_ptp_port.port_num,link_up 0xff48df6f140384e0
>   port_num = 0x3
>   link_up = 0x0,    // False even this device is UP
> crash> ice_ptp_port.port_num,link_up 0xff48df6f055044e0
>   port_num = 0x0
>   link_up = 0x0,    // False even this device is UP
> crash> ice_ptp_port.port_num,link_up 0xff48df6f251cc4e0
>   port_num = 0x1
>   link_up = 0x0,
> crash> ice_ptp_port.port_num,link_up 0xff48df6f33a9c4e0
>   port_num = 0x2
>   link_up = 0x0,
> crash> ice_ptp_port.port_num,link_up 0xff48df6f3bb7c4e0
>   port_num = 0x3
>   link_up = 0x0,
> 
> In other words, the ice_ptp_port.link_up is always false and cannot be
> used. That's why I had to fall back to
> hw->port_info->phy.link_info.link_info
> 

Hmm. We call ice_ptp_link_change in ice_link_event which is called from
ice_handle_link_event...

In ice_link_event, a local link_up field is set based on
phy_info->link_info.link_info & ICE_AQ_LINK_UP

What kernel are you testing on? Does it include 6b1ff5d39228 ("ice:
always call ice_ptp_link_change and make it void")?

Prior to this commit the field was only valid for E822 devices, but I
fixed that as it was used for other checks as well.

I am guessing that the Red Hat kernel you are using lacks several of
these clean ups and fixes.

For the current code in the net-next kernel I believe we can safely use
the ptp_port->link_up field.

Thanks,
Jake
  
Daniel Vacek Jan. 19, 2023, 8:08 p.m. UTC | #8
On Thu, Jan 19, 2023 at 8:25 PM Jacob Keller <jacob.e.keller@intel.com> wrote:
> On 1/19/2023 1:38 AM, Daniel Vacek wrote:
> > On Wed, Jan 18, 2023 at 11:22 PM Jacob Keller <jacob.e.keller@intel.com> wrote:
> >> On 1/18/2023 2:11 PM, Daniel Vacek wrote:
> >>> On Wed, Jan 18, 2023 at 9:59 PM Jacob Keller <jacob.e.keller@intel.com> wrote:
> >>>> On 1/18/2023 7:14 AM, Daniel Vacek wrote:
> >>>> 1) request tx timestamp
> >>>> 2) timestamp occurs
> >>>> 3) link goes down while processing
> >>>
> >>> I was thinking this is the case we got reported. But then again, I'm
> >>> not really experienced in this field.
> >>>
> >>
> >> I think it might be, or at least something similar to this.
> >>
> >> I think that can be fixed with the link check you added. I think we
> >> actually have a copy of the current link status in the ice_ptp or
> >> ice_ptp_tx structure which could be used instead of having to check back
> >> to the other structure.
> >
> > If you're talking about ptp_port->link_up that one is always false no
> > matter the actual NIC link status. First I wanted to use it but
> > checking all the 8 devices available in the dump data it just does not
> > match the net_dev->state or the port_info->phy.link_info.link_info
> >
> > crash> net_device.name,state 0xff48df6f0c553000
> >   name = "ens1f1",
> >   state = 0x7,    // DOWN
> > crash> ice_port_info.phy.link_info.link_info 0xff48df6f05dca018
> >   phy.link_info.link_info = 0xc0,    // DOWN
> > crash> ice_ptp_port.port_num,link_up 0xff48df6f05dd44e0
> >   port_num = 0x1
> >   link_up = 0x0,    // False
> >
> > crash> net_device.name,state 0xff48df6f25e3f000
> >   name = "ens1f0",
> >   state = 0x3,    // UP
> > crash> ice_port_info.phy.link_info.link_info 0xff48df6f070a3018
> >   phy.link_info.link_info = 0xe1,    // UP
> > crash> ice_ptp_port.port_num,link_up 0xff48df6f063184e0
> >   port_num = 0x0
> >   link_up = 0x0,    // False
> >
> > crash> ice_ptp_port.port_num,link_up 0xff48df6f25b844e0
> >   port_num = 0x2
> >   link_up = 0x0,    // False even this device is UP
> > crash> ice_ptp_port.port_num,link_up 0xff48df6f140384e0
> >   port_num = 0x3
> >   link_up = 0x0,    // False even this device is UP
> > crash> ice_ptp_port.port_num,link_up 0xff48df6f055044e0
> >   port_num = 0x0
> >   link_up = 0x0,    // False even this device is UP
> > crash> ice_ptp_port.port_num,link_up 0xff48df6f251cc4e0
> >   port_num = 0x1
> >   link_up = 0x0,
> > crash> ice_ptp_port.port_num,link_up 0xff48df6f33a9c4e0
> >   port_num = 0x2
> >   link_up = 0x0,
> > crash> ice_ptp_port.port_num,link_up 0xff48df6f3bb7c4e0
> >   port_num = 0x3
> >   link_up = 0x0,
> >
> > In other words, the ice_ptp_port.link_up is always false and cannot be
> > used. That's why I had to fall back to
> > hw->port_info->phy.link_info.link_info
> >
>
> Hmm. We call ice_ptp_link_change in ice_link_event which is called from
> ice_handle_link_event...
>
> In ice_link_event, a local link_up field is set based on
> phy_info->link_info.link_info & ICE_AQ_LINK_UP
>
> What kernel are you testing on? Does it include 6b1ff5d39228 ("ice:
> always call ice_ptp_link_change and make it void")?
>
> Prior to this commit the field was only valid for E822 devices, but I
> fixed that as it was used for other checks as well.
>
> I am guessing that the Red Hat kernel you are using lacks several of
> these clean ups and fixes.

Yeah, makes perfect sense. We don't have that commit in 8.4. All the data
I have and present here are from 4.18.0-305.49.1.rt7.121.el8_4.x86_64

> For the current code in the net-next kernel I believe we can safely use
> the ptp_port->link_up field.

I'll fix that up and drop you a v3. Thank you for the review.

--nX

> Thanks,
> Jake
  
D H, Siddaraju Jan. 22, 2023, 5:44 p.m. UTC | #9
Yes Jake, ICE driver reinitializes PTP timers on link UP (especially for E82x).
ice_ptp.c-> ice_ptp_link_change-> ice_ptp_port_phy_restart-> ice_start_phy_timer_e822

-Siddaraju D H

-----Original Message-----
From: Keller, Jacob E <jacob.e.keller@intel.com> 
Sent: Wednesday, January 18, 2023 12:16 AM
To: Daniel Vacek <neelx@redhat.com>; Brandeburg, Jesse <jesse.brandeburg@intel.com>; Nguyen, Anthony L <anthony.l.nguyen@intel.com>; David S. Miller <davem@davemloft.net>; Eric Dumazet <edumazet@google.com>; Jakub Kicinski <kuba@kernel.org>; Paolo Abeni <pabeni@redhat.com>; Richard Cochran <richardcochran@gmail.com>; Kolacinski, Karol <karol.kolacinski@intel.com>; D H, Siddaraju <siddaraju.dh@intel.com>; Michalik, Michal <michal.michalik@intel.com>
Cc: netdev@vger.kernel.org; intel-wired-lan@lists.osuosl.org; linux-kernel@vger.kernel.org
Subject: Re: [Intel-wired-lan] [PATCH] ice/ptp: fix the PTP worker retrying indefinitely if the link went down



On 1/17/2023 10:15 AM, Daniel Vacek wrote:
> When the link goes down the ice_ptp_tx_tstamp_work() may loop forever 
> trying to process the packets. In this case it makes sense to just drop them.
> 

Hi,

Do you have some more details on what situation caused this state? Or is this just based on code review?

It won't loop forever because if link is down for more than 2 seconds we'll discard the old timestamps which we assume are not going to arrive.

The trouble is that if a timestamp *does* arrive late, we need to ensure that we never assign the captured time to the wrong packet, and that for
E822 devices we always read the correct number of timestamps (otherwise we can get the logic for timestamp interrupt generation broken).

Consider for example this flow for e810:

1) a tx packet with a timestamp request is scheduled to hw
2) the packet begins being transmitted
3) link goes down
4) interrupt triggers, ice_ptp_tx_tstamp is run
5) link is down, so we skip reading this timestamp. Since we don't read the timestamp, we just discard the skb and we don't update the cached tx timestamp value
6) link goes up
7) 2 tx packets with a timestamp request are sent and one of them is on the same index as the packet in (1)
8) the other tx packet completes and we get an interrupt
9) the loop reads both timestamps. Since the tx packet in slot (1) doesn't match its cached value it looks "new" so the function reports the old timestamp to the wrong packet.

Consider this flow for e822:

1) 2 tx packets with a timestamp request are scheduled to hw
2) the packets begin being transmitted
3) link goes down
4) an interrupt for the Tx timestamp is triggered, but we don't read the timestamps because we have link down and we skipped the ts_read.
5) the internal e822 hardware counter is not decremented due to no reads
6) no more timestamp interrupts will be triggered by hardware until we read the appropriate number of timestamps

I am not sure if link going up will properly reset and re-initialize the Tx timestamp block but I suspect it does not. @Karol, @Siddaraju, @Michal, do you recall more details on this?

I understand the desire to avoid polling when unnecessary, but I am worried because the hardware and firmware interactions here are pretty complex and its easy to get this wrong. (see: all of the previous patches and bug fixes we've been working on... we got this wrong a LOT
already...)

Without a more concrete explanation of what this fixes I'm worried about this change :(

At a minimum I think I would only set drop_ts but not not goto skip_ts_read.

That way if we happen to have a ready timestamp (for E822) we'll still read it and avoid the miscounting from not reading a completed timestamp.

This also ensures that on e810 the cached timestamp value is updated and that we avoid the other situation.

I'd still prefer if you have a bug report or more details on the failure case. I believe even if we poll it should be no more than 2 seconds for an old timestamp that never got sent to be discarded.

> Signed-off-by: Daniel Vacek <neelx@redhat.com>
> ---
>  drivers/net/ethernet/intel/ice/ice_ptp.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c 
> b/drivers/net/ethernet/intel/ice/ice_ptp.c
> index d63161d73eb16..c313177ba6676 100644
> --- a/drivers/net/ethernet/intel/ice/ice_ptp.c
> +++ b/drivers/net/ethernet/intel/ice/ice_ptp.c
> @@ -680,6 +680,7 @@ static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx)
>  	struct ice_pf *pf;
>  	struct ice_hw *hw;
>  	u64 tstamp_ready;
> +	bool link_up;
>  	int err;
>  	u8 idx;
>  
> @@ -695,6 +696,8 @@ static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx)
>  	if (err)
>  		return false;
>  
> +	link_up = hw->port_info->phy.link_info.link_info & ICE_AQ_LINK_UP;
> +
>  	for_each_set_bit(idx, tx->in_use, tx->len) {
>  		struct skb_shared_hwtstamps shhwtstamps = {};
>  		u8 phy_idx = idx + tx->offset;
> @@ -702,6 +705,12 @@ static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx)
>  		bool drop_ts = false;
>  		struct sk_buff *skb;
>  
> +		/* Drop packets if the link went down */
> +		if (!link_up) {
> +			drop_ts = true;
> +			goto skip_ts_read;
> +		}
> +
>  		/* Drop packets which have waited for more than 2 seconds */
>  		if (time_is_before_jiffies(tx->tstamps[idx].start + 2 * HZ)) {
>  			drop_ts = true;
  

Patch

diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c
index d63161d73eb16..c313177ba6676 100644
--- a/drivers/net/ethernet/intel/ice/ice_ptp.c
+++ b/drivers/net/ethernet/intel/ice/ice_ptp.c
@@ -680,6 +680,7 @@  static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx)
 	struct ice_pf *pf;
 	struct ice_hw *hw;
 	u64 tstamp_ready;
+	bool link_up;
 	int err;
 	u8 idx;
 
@@ -695,6 +696,8 @@  static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx)
 	if (err)
 		return false;
 
+	link_up = hw->port_info->phy.link_info.link_info & ICE_AQ_LINK_UP;
+
 	for_each_set_bit(idx, tx->in_use, tx->len) {
 		struct skb_shared_hwtstamps shhwtstamps = {};
 		u8 phy_idx = idx + tx->offset;
@@ -702,6 +705,12 @@  static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx)
 		bool drop_ts = false;
 		struct sk_buff *skb;
 
+		/* Drop packets if the link went down */
+		if (!link_up) {
+			drop_ts = true;
+			goto skip_ts_read;
+		}
+
 		/* Drop packets which have waited for more than 2 seconds */
 		if (time_is_before_jiffies(tx->tstamps[idx].start + 2 * HZ)) {
 			drop_ts = true;