Message ID | 20230117181533.2350335-1-neelx@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp1927345wrn; Tue, 17 Jan 2023 10:57:30 -0800 (PST) X-Google-Smtp-Source: AMrXdXvEO79UQC89OBWMUdORVMpBX5GQh5g3lGW+VjYpyODbp+tz4X1PByVeQTl+0VxqglsfoBtf X-Received: by 2002:a17:90b:48d0:b0:226:d7e8:e11f with SMTP id li16-20020a17090b48d000b00226d7e8e11fmr4625531pjb.12.1673981850668; Tue, 17 Jan 2023 10:57:30 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1673981850; cv=none; d=google.com; s=arc-20160816; b=bCbZM7Bi4QKEe4B0ObAdZ3HjBBWUssI2JvNoK4l8VcP8h9NCs89HPBwIJW/VKD5AVv 5CkqkfjcZY9SS4B0RnGLVjc25cB0PJ3iX1HP5TJ6QYFHLd4j9NkBUfAenwtR4qjUttho bUTPZ2gQzoOA235ATnxaq9J3800q/gO9p6FZZmu17Uqr0dfdkcGKd+6jueTPmGV4wwOS hzMsxbbTFkHPedJMx0QC8RmnWVuAkKd15HYTeYkGFsU0fjTiY6/B24u/h8wMNbC0eME8 Gxjzw2Yj/xpK3tgS6diVSOSMet4rZGvJPMNSstwShKvAlnKAuf3a36T/AGC6wd+QApLM XhTQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=hoHrYOZi5AepvexSScpofKM1v72zsm9+pebsH147dPw=; b=nASRz8pJRjlIfkUqpCrT5nvlxaKJfAZTNpOv91w4IygAkpWLCno7V9CRKFho9GcedE tgru7lbm9lMDHrEjNqmeUj/SeWY6OMtLsUXaI4bui+BDejPQfzwrz3mrv/QLPt8f9CJ5 eSen8S5aUV3T/GEcuypH0zjUVkLn2KwsAOxhUjJDqkv/mo4OE0VO/Mqulngyt8KLanX2 ME2mQhnwuo6lZGtISqNTZbspOC+l5kyL7+5SeLK//EB+3ZCYH4JTk8Bof3nYrFkpE3yt s5jmG/aYzl5snIhsCbfqYQM9vgUcUbUEQCLhpSyA98LYjeBX1kUZOwfn9QK6Z2w9UU9L dHwQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=MmGczuIe; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id on10-20020a17090b1d0a00b002270cd1cb8fsi30104047pjb.46.2023.01.17.10.57.16; Tue, 17 Jan 2023 10:57:30 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=MmGczuIe; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230168AbjAQSw4 (ORCPT <rfc822;pfffrao@gmail.com> + 99 others); Tue, 17 Jan 2023 13:52:56 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35116 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235513AbjAQSuN (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 17 Jan 2023 13:50:13 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A1A964DCFA for <linux-kernel@vger.kernel.org>; Tue, 17 Jan 2023 10:15:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1673979343; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=hoHrYOZi5AepvexSScpofKM1v72zsm9+pebsH147dPw=; b=MmGczuIeAzksmWzYtCXrHg0tyDNjPW14GYi7MMQ6we8kPowOhyuNu5MjUrEnHfq9jjAcUU RTqu6eJbw0PNwQ5iHGXMS/+PPPAm2GL8ScLyH7FftDKX0LsxNgXOGAMBC0yoRWl7LuX1CZ LlsNC+X9Y40C9SRaH9+gYZL+6a+wBeY= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-498-E6arC8fUMxqGHhfEb1Ibsg-1; Tue, 17 Jan 2023 13:15:40 -0500 X-MC-Unique: E6arC8fUMxqGHhfEb1Ibsg-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.rdu2.redhat.com [10.11.54.7]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 267BC3C0F42F; Tue, 17 Jan 2023 18:15:40 +0000 (UTC) Received: from metal.redhat.com (ovpn-192-69.brq.redhat.com [10.40.192.69]) by smtp.corp.redhat.com (Postfix) with ESMTP id A90931415108; Tue, 17 Jan 2023 18:15:37 +0000 (UTC) From: Daniel Vacek <neelx@redhat.com> To: Jesse Brandeburg <jesse.brandeburg@intel.com>, Tony Nguyen <anthony.l.nguyen@intel.com>, "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Richard Cochran <richardcochran@gmail.com> Cc: Daniel Vacek <neelx@redhat.com>, intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH] ice/ptp: fix the PTP worker retrying indefinitely if the link went down Date: Tue, 17 Jan 2023 19:15:32 +0100 Message-Id: <20230117181533.2350335-1-neelx@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.7 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1755297193307428739?= X-GMAIL-MSGID: =?utf-8?q?1755297193307428739?= |
Series |
ice/ptp: fix the PTP worker retrying indefinitely if the link went down
|
|
Commit Message
Daniel Vacek
Jan. 17, 2023, 6:15 p.m. UTC
When the link goes down the ice_ptp_tx_tstamp_work() may loop forever trying
to process the packets. In this case it makes sense to just drop them.
Signed-off-by: Daniel Vacek <neelx@redhat.com>
---
drivers/net/ethernet/intel/ice/ice_ptp.c | 9 +++++++++
1 file changed, 9 insertions(+)
Comments
On 1/17/2023 10:15 AM, Daniel Vacek wrote: > When the link goes down the ice_ptp_tx_tstamp_work() may loop forever trying > to process the packets. In this case it makes sense to just drop them. > Hi, Do you have some more details on what situation caused this state? Or is this just based on code review? It won't loop forever because if link is down for more than 2 seconds we'll discard the old timestamps which we assume are not going to arrive. The trouble is that if a timestamp *does* arrive late, we need to ensure that we never assign the captured time to the wrong packet, and that for E822 devices we always read the correct number of timestamps (otherwise we can get the logic for timestamp interrupt generation broken). Consider for example this flow for e810: 1) a tx packet with a timestamp request is scheduled to hw 2) the packet begins being transmitted 3) link goes down 4) interrupt triggers, ice_ptp_tx_tstamp is run 5) link is down, so we skip reading this timestamp. Since we don't read the timestamp, we just discard the skb and we don't update the cached tx timestamp value 6) link goes up 7) 2 tx packets with a timestamp request are sent and one of them is on the same index as the packet in (1) 8) the other tx packet completes and we get an interrupt 9) the loop reads both timestamps. Since the tx packet in slot (1) doesn't match its cached value it looks "new" so the function reports the old timestamp to the wrong packet. Consider this flow for e822: 1) 2 tx packets with a timestamp request are scheduled to hw 2) the packets begin being transmitted 3) link goes down 4) an interrupt for the Tx timestamp is triggered, but we don't read the timestamps because we have link down and we skipped the ts_read. 5) the internal e822 hardware counter is not decremented due to no reads 6) no more timestamp interrupts will be triggered by hardware until we read the appropriate number of timestamps I am not sure if link going up will properly reset and re-initialize the Tx timestamp block but I suspect it does not. @Karol, @Siddaraju, @Michal, do you recall more details on this? I understand the desire to avoid polling when unnecessary, but I am worried because the hardware and firmware interactions here are pretty complex and its easy to get this wrong. (see: all of the previous patches and bug fixes we've been working on... we got this wrong a LOT already...) Without a more concrete explanation of what this fixes I'm worried about this change :( At a minimum I think I would only set drop_ts but not not goto skip_ts_read. That way if we happen to have a ready timestamp (for E822) we'll still read it and avoid the miscounting from not reading a completed timestamp. This also ensures that on e810 the cached timestamp value is updated and that we avoid the other situation. I'd still prefer if you have a bug report or more details on the failure case. I believe even if we poll it should be no more than 2 seconds for an old timestamp that never got sent to be discarded. > Signed-off-by: Daniel Vacek <neelx@redhat.com> > --- > drivers/net/ethernet/intel/ice/ice_ptp.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c > index d63161d73eb16..c313177ba6676 100644 > --- a/drivers/net/ethernet/intel/ice/ice_ptp.c > +++ b/drivers/net/ethernet/intel/ice/ice_ptp.c > @@ -680,6 +680,7 @@ static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx) > struct ice_pf *pf; > struct ice_hw *hw; > u64 tstamp_ready; > + bool link_up; > int err; > u8 idx; > > @@ -695,6 +696,8 @@ static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx) > if (err) > return false; > > + link_up = hw->port_info->phy.link_info.link_info & ICE_AQ_LINK_UP; > + > for_each_set_bit(idx, tx->in_use, tx->len) { > struct skb_shared_hwtstamps shhwtstamps = {}; > u8 phy_idx = idx + tx->offset; > @@ -702,6 +705,12 @@ static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx) > bool drop_ts = false; > struct sk_buff *skb; > > + /* Drop packets if the link went down */ > + if (!link_up) { > + drop_ts = true; > + goto skip_ts_read; > + } > + > /* Drop packets which have waited for more than 2 seconds */ > if (time_is_before_jiffies(tx->tstamps[idx].start + 2 * HZ)) { > drop_ts = true;
On Tue, Jan 17, 2023 at 7:47 PM Jacob Keller <jacob.e.keller@intel.com> wrote: > On 1/17/2023 10:15 AM, Daniel Vacek wrote: > > When the link goes down the ice_ptp_tx_tstamp_work() may loop forever trying > > to process the packets. In this case it makes sense to just drop them. > > > > Hi, > > Do you have some more details on what situation caused this state? Or is > this just based on code review? > > It won't loop forever because if link is down for more than 2 seconds > we'll discard the old timestamps which we assume are not going to arrive. Oh, my bad. I should have improved the commit message. This comes from RHEL 8.4 where we do not have commit 0dd9286263923 ("ice: handle discarding old Tx requests in ice_ptp_tx_tstamp") yet. Though, I still consider 2 seconds being 'forever like'. What really happened is that after the link went down, the kthread worker 'ice-ptp-0000:10' being a high prio RT class task kept looping on CPU 33 for almost three minutes starving other tasks with lower priority. This resulted in RCU stall and finally in panic due to hung tasks. The machine is using the E810 NIC. > [1193024.082447] ice 0000:10:00.1 ens1f1: NIC Link is Down > [1193024.087861] iavf 0000:10:09.1 net1: NIC Link is Down > [1193024.093077] iavf 0000:10:09.2 llsm1: NIC Link is Down > [1193024.099324] iavf 0000:10:09.3 llsm1: NIC Link is Down > [1193060.630500] ice 0000:10:00.2 ens1f2: NIC Link is up 25 Gbps Full Duplex, Requested FEC: RS-FEC, Negotiated FEC: RS-FEC, Autoneg Advertised: Off, Autoneg Negotiated: False, Flow Contr> > [1193060.647921] iavf 0000:10:11.1 llsm2: NIC Link is Up Speed is 25 Gbps Full Duplex > [1193060.655575] iavf 0000:10:11.2 net2: NIC Link is Up Speed is 25 Gbps Full Duplex > [1193060.663140] iavf 0000:10:11.3 llsm2: NIC Link is Up Speed is 25 Gbps Full Duplex > [1193084.383798] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: > [1193084.383799] rcu: Tasks blocked on level-1 rcu_node (CPUs 32-47): P183189 > [1193084.383804] (detected by 0, t=60002 jiffies, g=379331397, q=294595) > [1193084.383807] runc R running task 0 183189 183175 0x00084080 > [1193088.721408] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { P161097 P183189 } 63616 jiffies s: 1074129 root: 0x4/. > [1193089.074856] rcu: blocking rcu_node structures: l=1:32-47:0x0/T > > [1193199.313106] INFO: task xfsaild/sda4:1428 blocked for more than 120 seconds. > > [1193201.399021] NMI backtrace for cpu 33 > [1193201.399021] CPU: 33 PID: 1354 Comm: ice-ptp-0000:10 Kdump: loaded Tainted: G W --------- - - 4.18.0-305.49.1.rt7.121.el8_4.x86_64 #1 > > [1193201.399050] Kernel panic - not syncing: hung_task: blocked tasks > [1193201.399052] CPU: 63 PID: 559 Comm: khungtaskd Kdump: loaded Tainted: G W --------- - - 4.18.0-305.49.1.rt7.121.el8_4.x86_64 #1 > crash> bc 33 > PID: 1354 TASK: ff48df6f2c5c8000 CPU: 33 COMMAND: "ice-ptp-0000:10" > #0 [fffffe000069ce38] crash_nmi_callback+0x33 at ffffffffb664cb63 > #1 [fffffe000069ce40] nmi_handle+0x5b at ffffffffb661c68b > #2 [fffffe000069ce98] default_do_nmi+0x72 at ffffffffb6f4a562 > #3 [fffffe000069cec0] do_nmi+0x18d at ffffffffb661cd1d > #4 [fffffe000069cef0] end_repeat_nmi+0x16 at ffffffffb7001544 > [exception RIP: delay_tsc+0x6d] > RIP: ffffffffb6f4794d RSP: ff7daf7a4aa2fd48 RFLAGS: 00000246 > RAX: 000000008cf66cff RBX: 0006582e8cf66caf RCX: 0000000000000000 > RDX: 000000000006582e RSI: ffffffffb76eefe3 RDI: ffffffffb76d1b60 > RBP: 0006582e8cf5ef89 R8: 000000000000000c R9: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000249f2 > R13: 0000000000000021 R14: ff48df63ec9aa3e0 R15: ff48df57541d0300 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > --- <NMI exception stack> --- > #5 [ff7daf7a4aa2fd48] delay_tsc+0x6d at ffffffffb6f4794d > ## [ endcall jmp 68] __const_udelay+0x37 at ffffffffb6f477f7 > #6 [ff7daf7a4aa2fd68] ice_sq_send_cmd+0x356 at ffffffffc0612ee6 [ice] > ## [ inlined c0] ice_sbq_send_cmd at drivers/net/ethernet/intel/ice/ice_common.c:1407 > #7 [ff7daf7a4aa2fdc0] ice_sbq_rw_reg+0xc4 at ffffffffc06147e4 [ice] > #8 [ff7daf7a4aa2fe18] ice_read_phy_reg_e810+0x4d at ffffffffc0655b0d [ice] > ## [ inlined 50] ice_read_phy_tstamp_e810 at drivers/net/ethernet/intel/ice/ice_ptp_hw.c:2608 > #9 [ff7daf7a4aa2fe50] ice_read_phy_tstamp+0x5a at ffffffffc06584ca [ice] > > #10 [ff7daf7a4aa2fe90] ice_ptp_tx_tstamp_work+0x95 at ffffffffc0654185 [ice] > #11 [ff7daf7a4aa2fee0] kthread_worker_fn+0xab at ffffffffb6701c4b > #12 [ff7daf7a4aa2ff10] kthread+0x15d at ffffffffb6701b7d > > crash> delay_fn > delay_fn = (void (*)(u64)) 0xffffffffb6f478e0 <delay_tsc> > > crash> bc 33 | grep -o 'R[BDA1][PX2]: 0[^ ]*' > RAX: 000000008cf66cff > RBX: 0006582e8cf66caf > RDX: 000000000006582e > RBP: 0006582e8cf5ef89 > R12: 00000000000249f2 > > 249f2 r12 cycles (150002) > 6582e8cf5ef89 rbp bclock > 6582e8cf66cff rdx:rax now > 6582e8cf66caf rbx last > > 962 ice_sq_send_cmd(struct ice_hw *hw, struct ice_ctl_q_info *cq, > -- > 965 { > -- > 1063 do { > 1064 if (ice_sq_done(hw, cq)) > 1065 break; > 1066 > 1067 udelay(ICE_CTL_Q_SQ_CMD_USEC); > 1068 total_delay++; > 1069 } while (total_delay < cq->sq_cmd_timeout); > > 19 /* 0x10c7 is 2**32 / 1000000 (rounded up) */ > 20 #define udelay(n) \ > -- > 26 __const_udelay((n) * 0x10c7ul); \ > > crash> pd loops_per_jiffy > loops_per_jiffy = 1500000 > > #define ICE_CTL_Q_SQ_CMD_USEC 100 /* Check every 100usec */ > > crash> pd (100*0x10c7*4*250*loops_per_jiffy>>32)+1 > 150002 > > crash> p 100*0x10c7ul > 0x68dbc > > crash> dis ice_sq_send_cmd+0x346 13 > 0xffffffffc0612ed6 <ice_sq_send_cmd+0x346>: jmp 0xffffffffc0612eef <ice_sq_send_cmd+0x35f> > > 0xffffffffc0612ed8 <ice_sq_send_cmd+0x348>: mov $0x68dbc,%edi > 0xffffffffc0612edd <ice_sq_send_cmd+0x34d>: add $0x1,%r13d > 0xffffffffc0612ee1 <ice_sq_send_cmd+0x351>: call 0xffffffffb6f477c0 <__const_udelay> > 0xffffffffc0612ee6 <ice_sq_send_cmd+0x356>: cmp %r13d,0xc8(%rbp) > 0xffffffffc0612eed <ice_sq_send_cmd+0x35d>: jbe 0xffffffffc0612f05 <ice_sq_send_cmd+0x375> > > 0xffffffffc0612eef <ice_sq_send_cmd+0x35f>: mov 0xa0(%rbp),%eax > 0xffffffffc0612ef5 <ice_sq_send_cmd+0x365>: add (%rbx),%rax > 0xffffffffc0612ef8 <ice_sq_send_cmd+0x368>: mov (%rax),%edx > 0xffffffffc0612efa <ice_sq_send_cmd+0x36a>: movzwl 0x9a(%rbp),%eax > 0xffffffffc0612f01 <ice_sq_send_cmd+0x371>: cmp %eax,%edx > 0xffffffffc0612f03 <ice_sq_send_cmd+0x373>: jne 0xffffffffc0612ed8 <ice_sq_send_cmd+0x348> > > 0xffffffffc0612f05 <ice_sq_send_cmd+0x375>: mov 0xa0(%rbp),%eax > > crash> dis delay_tsc 2 > 0xffffffffb6f478e0 <delay_tsc>: nopl 0x0(%rax,%rax,1) [FTRACE NOP] > 0xffffffffb6f478e5 <delay_tsc+0x5>: push %r13 > > crash> rx ff7daf7a4aa2fd60 > ff7daf7a4aa2fd60: 0000000000000001 // total_delay It's the first time the loop in function ice_sq_send_cmd() is being iterated. The ice_sq_send_cmd() has to be called all over again. > crash> dis ice_sq_send_cmd+0x2d > 0xffffffffc0612bbd <ice_sq_send_cmd+0x2d>: mov %rsi,%rbp > > crash> dis delay_tsc 6 > 0xffffffffb6f478e0 <delay_tsc>: nopl 0x0(%rax,%rax,1) [FTRACE NOP] > 0xffffffffb6f478e5 <delay_tsc+0x5>: push %r13 > 0xffffffffb6f478e7 <delay_tsc+0x7>: push %r12 > 0xffffffffb6f478e9 <delay_tsc+0x9>: mov %rdi,%r12 > 0xffffffffb6f478ec <delay_tsc+0xc>: mov $0x1,%edi > 0xffffffffb6f478f1 <delay_tsc+0x11>: push %rbp > > crash> rx ff7daf7a4aa2fd50 > ff7daf7a4aa2fd50: ff48df6f05dd6730 > > crash> ice_ctl_q_info.sq_cmd_timeout,sq.head,sq.next_to_use ff48df6f05dd6730 > sq_cmd_timeout = 0x2710 = 10000, > sq.head = 0x80300, > sq.next_to_use = 0x20, > crash> runqc 33 > CPU 33 RUNQUEUE: ff48df6f3f86c380 > CURRENT: PID: 1354 TASK: ff48df6f2c5c8000 COMMAND: "ice-ptp-0000:10" > RT PRIO_ARRAY: ff48df6f3f86c5c0 > [ 89] PID: 1354 TASK: ff48df6f2c5c8000 COMMAND: "ice-ptp-0000:10" > [ 98] PID: 161097 TASK: ff48df634a111ec0 COMMAND: "handler111" > CFS RB_ROOT: ff48df6f3f86c430 > [100] PID: 583 TASK: ff48df6f37845c40 COMMAND: "kworker/33:1H" > [120] PID: 4113728 TASK: ff48df5055db9ec0 COMMAND: "kworker/33:6" > [120] PID: 183189 TASK: ff48df5d24aa8000 COMMAND: "runc" > > crash> ps -m 1354 161097 583 4113728 183189 > [ 0 00:02:56.990] [RU] PID: 1354 TASK: ff48df6f2c5c8000 CPU: 33 COMMAND: "ice-ptp-0000:10" > [ 0 00:02:56.990] [RU] PID: 161097 TASK: ff48df634a111ec0 CPU: 33 COMMAND: "handler111" > [ 0 00:02:56.991] [RU] PID: 4113728 TASK: ff48df5055db9ec0 CPU: 33 COMMAND: "kworker/33:6" > [ 0 00:02:57.015] [RU] PID: 183189 TASK: ff48df5d24aa8000 CPU: 33 COMMAND: "runc" > [ 0 00:02:57.030] [RU] PID: 583 TASK: ff48df6f37845c40 CPU: 33 COMMAND: "kworker/33:1H" These runnable tasks have lower prio and hence they are being starved. > crash> ps -mC33 | head -13 > CPU: 33 > [ 0 00:02:56.990] [RU] PID: 1354 TASK: ff48df6f2c5c8000 CPU: 33 COMMAND: "ice-ptp-0000:10" > [ 0 00:02:56.990] [RU] PID: 161097 TASK: ff48df634a111ec0 CPU: 33 COMMAND: "handler111" > [ 0 00:02:56.991] [RU] PID: 4113728 TASK: ff48df5055db9ec0 CPU: 33 COMMAND: "kworker/33:6" > [ 0 00:02:56.991] [IN] PID: 278 TASK: ff48df504fd08000 CPU: 33 COMMAND: "ksoftirqd/33" > [ 0 00:02:57.014] [IN] PID: 277 TASK: ff48df504fcebd80 CPU: 33 COMMAND: "rcuc/33" > [ 0 00:02:57.015] [RU] PID: 183189 TASK: ff48df5d24aa8000 CPU: 33 COMMAND: "runc" > [ 0 00:02:57.016] [IN] PID: 146 TASK: ff48df5040268000 CPU: 33 COMMAND: "rcuop/16" > [ 0 00:02:57.020] [IN] PID: 276 TASK: ff48df504fce8000 CPU: 33 COMMAND: "migration/33" > [ 0 00:02:57.024] [IN] PID: 413 TASK: ff48df504f971ec0 CPU: 33 COMMAND: "rcuop/49" > [ 0 00:02:57.024] [IN] PID: 494 TASK: ff48df504f43bd80 CPU: 33 COMMAND: "rcuop/59" > [ 0 00:02:57.029] [UN] PID: 183233 TASK: ff48df5193618000 CPU: 33 COMMAND: "exe" > [ 0 00:02:57.030] [RU] PID: 583 TASK: ff48df6f37845c40 CPU: 33 COMMAND: "kworker/33:1H" > crash> ice_pf.pdev,int_name ff48df6f05dd4180 > pdev = 0xff48df6f37fe6000, > int_name = "ice-0000:10:00.1:misc", > > crash> ice_hw.mac_type ff48df6f05dd4dc0 > mac_type = ICE_MAC_E810, > crash> ice_ptp_tx.work.func,len,in_use,quad,quad_offset ff48df6f05dd44e0 > work.func = 0xffffffffc06540f0 <ice_ptp_tx_tstamp_work>, > len = 0x40, > in_use = 0xff48df63d0354cc0, > quad = 0x1, > quad_offset = 0x0, > > crash> rx 0xff48df63d0354cc0 > ff48df63d0354cc0: 0000000000000001 > crash> ice_ptp_tx.tstamps ff48df6f05dd44e0 > tstamps = 0xff48df62d6e69800, > > crash> ice_tx_tstamp.start,cached_tstamp,skb 0xff48df62d6e69800 > start = 0x14717935c, > cached_tstamp = 0xbe9c63f31b > skb = 0xff48df5e20552f00, > > crash> pd jiffies-0x14717935c > 177354 The packet is waiting for almost three minutes. The mentioned commit 0dd9286263923 would definitely skip it and clear it's bit in the tx->in_use mask. But still only two seconds too late which needs to be addressed. And since the link went down I guess just dropping the packets is a reasonable approach. Checking the latest upstream code in v6.2-rc4 I *guess* this can still happen as described above. Just not looping forever but rather only for 2 seconds. But doing nothing for two seconds is just as bad really. Hence the patch I'm suggesting. > The trouble is that if a timestamp *does* arrive late, we need to ensure > that we never assign the captured time to the wrong packet, and that for > E822 devices we always read the correct number of timestamps (otherwise > we can get the logic for timestamp interrupt generation broken). > > Consider for example this flow for e810: > > 1) a tx packet with a timestamp request is scheduled to hw > 2) the packet begins being transmitted > 3) link goes down > 4) interrupt triggers, ice_ptp_tx_tstamp is run > 5) link is down, so we skip reading this timestamp. Since we don't read > the timestamp, we just discard the skb and we don't update the cached tx > timestamp value > 6) link goes up > 7) 2 tx packets with a timestamp request are sent and one of them is on > the same index as the packet in (1) > 8) the other tx packet completes and we get an interrupt > 9) the loop reads both timestamps. Since the tx packet in slot (1) > doesn't match its cached value it looks "new" so the function reports > the old timestamp to the wrong packet. I don't really fully understand the HW nor the driver to be honest. For sure I can see it kept looping as the work is being re-queued due to tx->in_use bitmap not being cleared. And that can only happen when ice_read_phy_tstamp() returns with error or the returned &raw_tstamp is not valid or it matches the cached value (considering our version of ice_ptp_tx_tstamp() looks like the one in v5.15 stable tree). And that said the v5.15 stable needs backport of this fix as well I guess. > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/net/ethernet/intel/ice/ice_ptp.c?h=v5.15.88#n1170 > Consider this flow for e822: > > 1) 2 tx packets with a timestamp request are scheduled to hw > 2) the packets begin being transmitted > 3) link goes down > 4) an interrupt for the Tx timestamp is triggered, but we don't read the > timestamps because we have link down and we skipped the ts_read. > 5) the internal e822 hardware counter is not decremented due to no reads > 6) no more timestamp interrupts will be triggered by hardware until we > read the appropriate number of timestamps > > I am not sure if link going up will properly reset and re-initialize the > Tx timestamp block but I suspect it does not. @Karol, @Siddaraju, > @Michal, do you recall more details on this? > > I understand the desire to avoid polling when unnecessary, but I am > worried because the hardware and firmware interactions here are pretty > complex and its easy to get this wrong. (see: all of the previous > patches and bug fixes we've been working on... we got this wrong a LOT > already...) I see. Here my favorite quote applies: "Everybody falls the first time." I totally understand and I cannot imagine developing and maintaining this without a solid test suite. On the other hand, some things (like especially races) could be tricky to reproduce. > Without a more concrete explanation of what this fixes I'm worried about > this change :( Hopefully I already managed to shed a bit more light here. Again, I'm sorry about the brief commit message. > At a minimum I think I would only set drop_ts but not not goto skip_ts_read. IIUC, that would still fail to clear the tx->in_use bit in case ice_read_phy_tstamp() returns with error. It would only work for the other case where no error is returned but rather the returned &raw_tstamp is invalid. I'll send a v2 of this fix trying to address the goto concern. > crash> ice_sbq_msg_input ff7daf7a4aa2fe2c > struct ice_sbq_msg_input { > dest_dev = 0x2, > opcode = 0x0, > msg_addr_low = 0x1000, > msg_addr_high = 0x309, > data = 0x0 > } > crash> ice_sbq_msg_req ff7daf7a4aa2fdc8 > struct ice_sbq_msg_req { > dest_dev = 0x2, > src_dev = 0x0, > opcode = 0x0, > flags = 0x40, > sbe_fbe = 0xf, > func_id = 0x0, > msg_addr_low = 0x1000, > msg_addr_high = 0x309, > data = 0x0 > } > crash> ice_sbq_cmd_desc ff7daf7a4aa2fdd8 > struct ice_sbq_cmd_desc { > flags = 0x1400, > opcode = 0xc00, > datalen = 0x0, > cmd_retval = 0x0, > cookie_high = 0x0, > cookie_low = 0x0, > param0 = { > cmd_len = 0xc, > cmpl_len = 0xc > }, > reserved = "", > addr_high = 0x0, > addr_low = 0x0 > } I see commit 1229b33973c7b ("ice: Add low latency Tx timestamp read") got rid of that kthread_work in favor of threaded irq. But the ice_misc_intr_thread_fn() keeps waking itself in the same manner as the work did keep re-queuing itself. So the loop still needs to be broken (i.e. the tx->in_use bitmap cleared). Even for v6.2-rc4. > That way if we happen to have a ready timestamp (for E822) we'll still > read it and avoid the miscounting from not reading a completed timestamp. > > This also ensures that on e810 the cached timestamp value is updated and > that we avoid the other situation. > > I'd still prefer if you have a bug report or more details on the failure > case. I believe even if we poll it should be no more than 2 seconds for > an old timestamp that never got sent to be discarded. Unfortunately I only have this one report with a dump from customer's testing. The uptime was almost 14 days, so I'm not sure how easy it is to reproduce. Again, hopefully the above covers enough details. I'm happy to share more if I missed some. But looping in a softirq (or worker thread) for two seconds is not really acceptable. That's just a wrong design. Right? --nX > > Signed-off-by: Daniel Vacek <neelx@redhat.com> > > --- > > drivers/net/ethernet/intel/ice/ice_ptp.c | 9 +++++++++ > > 1 file changed, 9 insertions(+) > > > > diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c > > index d63161d73eb16..c313177ba6676 100644 > > --- a/drivers/net/ethernet/intel/ice/ice_ptp.c > > +++ b/drivers/net/ethernet/intel/ice/ice_ptp.c > > @@ -680,6 +680,7 @@ static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx) > > struct ice_pf *pf; > > struct ice_hw *hw; > > u64 tstamp_ready; > > + bool link_up; > > int err; > > u8 idx; > > > > @@ -695,6 +696,8 @@ static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx) > > if (err) > > return false; > > > > + link_up = hw->port_info->phy.link_info.link_info & ICE_AQ_LINK_UP; > > + > > for_each_set_bit(idx, tx->in_use, tx->len) { > > struct skb_shared_hwtstamps shhwtstamps = {}; > > u8 phy_idx = idx + tx->offset; > > @@ -702,6 +705,12 @@ static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx) > > bool drop_ts = false; > > struct sk_buff *skb; > > > > + /* Drop packets if the link went down */ > > + if (!link_up) { > > + drop_ts = true; > > + goto skip_ts_read; > > + } > > + > > /* Drop packets which have waited for more than 2 seconds */ > > if (time_is_before_jiffies(tx->tstamps[idx].start + 2 * HZ)) { > > drop_ts = true; >
On 1/18/2023 7:14 AM, Daniel Vacek wrote: > On Tue, Jan 17, 2023 at 7:47 PM Jacob Keller <jacob.e.keller@intel.com> wrote: >> On 1/17/2023 10:15 AM, Daniel Vacek wrote: >>> When the link goes down the ice_ptp_tx_tstamp_work() may loop forever trying >>> to process the packets. In this case it makes sense to just drop them. >>> >> >> Hi, >> >> Do you have some more details on what situation caused this state? Or is >> this just based on code review? >> >> It won't loop forever because if link is down for more than 2 seconds >> we'll discard the old timestamps which we assume are not going to arrive. > > Oh, my bad. I should have improved the commit message. This > comes from RHEL 8.4 where we do not have commit 0dd9286263923 > ("ice: handle discarding old Tx requests in ice_ptp_tx_tstamp") > yet. > > Though, I still consider 2 seconds being 'forever like'. > It is somewhat forever like in CPU time, yes. > What really happened is that after the link went down, the > kthread worker 'ice-ptp-0000:10' being a high prio RT class > task kept looping on CPU 33 for almost three minutes starving > other tasks with lower priority. This resulted in RCU stall > and finally in panic due to hung tasks. > Right, because we essentially used to starve the task that was supposed to clean this up. That is fixed with the commit you mentioned above. > The machine is using the E810 NIC. > >> [1193024.082447] ice 0000:10:00.1 ens1f1: NIC Link is Down >> [1193024.087861] iavf 0000:10:09.1 net1: NIC Link is Down >> [1193024.093077] iavf 0000:10:09.2 llsm1: NIC Link is Down >> [1193024.099324] iavf 0000:10:09.3 llsm1: NIC Link is Down >> [1193060.630500] ice 0000:10:00.2 ens1f2: NIC Link is up 25 Gbps Full Duplex, Requested FEC: RS-FEC, Negotiated FEC: RS-FEC, Autoneg Advertised: Off, Autoneg Negotiated: False, Flow Contr> >> [1193060.647921] iavf 0000:10:11.1 llsm2: NIC Link is Up Speed is 25 Gbps Full Duplex >> [1193060.655575] iavf 0000:10:11.2 net2: NIC Link is Up Speed is 25 Gbps Full Duplex >> [1193060.663140] iavf 0000:10:11.3 llsm2: NIC Link is Up Speed is 25 Gbps Full Duplex >> [1193084.383798] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: >> [1193084.383799] rcu: Tasks blocked on level-1 rcu_node (CPUs 32-47): P183189 >> [1193084.383804] (detected by 0, t=60002 jiffies, g=379331397, q=294595) >> [1193084.383807] runc R running task 0 183189 183175 0x00084080 >> [1193088.721408] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { P161097 P183189 } 63616 jiffies s: 1074129 root: 0x4/. >> [1193089.074856] rcu: blocking rcu_node structures: l=1:32-47:0x0/T >> >> [1193199.313106] INFO: task xfsaild/sda4:1428 blocked for more than 120 seconds. >> >> [1193201.399021] NMI backtrace for cpu 33 >> [1193201.399021] CPU: 33 PID: 1354 Comm: ice-ptp-0000:10 Kdump: loaded Tainted: G W --------- - - 4.18.0-305.49.1.rt7.121.el8_4.x86_64 #1 >> >> [1193201.399050] Kernel panic - not syncing: hung_task: blocked tasks >> [1193201.399052] CPU: 63 PID: 559 Comm: khungtaskd Kdump: loaded Tainted: G W --------- - - 4.18.0-305.49.1.rt7.121.el8_4.x86_64 #1 > >> crash> bc 33 >> PID: 1354 TASK: ff48df6f2c5c8000 CPU: 33 COMMAND: "ice-ptp-0000:10" >> #0 [fffffe000069ce38] crash_nmi_callback+0x33 at ffffffffb664cb63 >> #1 [fffffe000069ce40] nmi_handle+0x5b at ffffffffb661c68b >> #2 [fffffe000069ce98] default_do_nmi+0x72 at ffffffffb6f4a562 >> #3 [fffffe000069cec0] do_nmi+0x18d at ffffffffb661cd1d >> #4 [fffffe000069cef0] end_repeat_nmi+0x16 at ffffffffb7001544 >> [exception RIP: delay_tsc+0x6d] >> RIP: ffffffffb6f4794d RSP: ff7daf7a4aa2fd48 RFLAGS: 00000246 >> RAX: 000000008cf66cff RBX: 0006582e8cf66caf RCX: 0000000000000000 >> RDX: 000000000006582e RSI: ffffffffb76eefe3 RDI: ffffffffb76d1b60 >> RBP: 0006582e8cf5ef89 R8: 000000000000000c R9: 0000000000000000 >> R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000249f2 >> R13: 0000000000000021 R14: ff48df63ec9aa3e0 R15: ff48df57541d0300 >> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 >> --- <NMI exception stack> --- >> #5 [ff7daf7a4aa2fd48] delay_tsc+0x6d at ffffffffb6f4794d >> ## [ endcall jmp 68] __const_udelay+0x37 at ffffffffb6f477f7 >> #6 [ff7daf7a4aa2fd68] ice_sq_send_cmd+0x356 at ffffffffc0612ee6 [ice] >> ## [ inlined c0] ice_sbq_send_cmd at drivers/net/ethernet/intel/ice/ice_common.c:1407 >> #7 [ff7daf7a4aa2fdc0] ice_sbq_rw_reg+0xc4 at ffffffffc06147e4 [ice] >> #8 [ff7daf7a4aa2fe18] ice_read_phy_reg_e810+0x4d at ffffffffc0655b0d [ice] >> ## [ inlined 50] ice_read_phy_tstamp_e810 at drivers/net/ethernet/intel/ice/ice_ptp_hw.c:2608 >> #9 [ff7daf7a4aa2fe50] ice_read_phy_tstamp+0x5a at ffffffffc06584ca [ice] >> >> #10 [ff7daf7a4aa2fe90] ice_ptp_tx_tstamp_work+0x95 at ffffffffc0654185 [ice] >> #11 [ff7daf7a4aa2fee0] kthread_worker_fn+0xab at ffffffffb6701c4b >> #12 [ff7daf7a4aa2ff10] kthread+0x15d at ffffffffb6701b7d >> >> crash> delay_fn >> delay_fn = (void (*)(u64)) 0xffffffffb6f478e0 <delay_tsc> >> >> crash> bc 33 | grep -o 'R[BDA1][PX2]: 0[^ ]*' >> RAX: 000000008cf66cff >> RBX: 0006582e8cf66caf >> RDX: 000000000006582e >> RBP: 0006582e8cf5ef89 >> R12: 00000000000249f2 >> >> 249f2 r12 cycles (150002) >> 6582e8cf5ef89 rbp bclock >> 6582e8cf66cff rdx:rax now >> 6582e8cf66caf rbx last >> >> 962 ice_sq_send_cmd(struct ice_hw *hw, struct ice_ctl_q_info *cq, >> -- >> 965 { >> -- >> 1063 do { >> 1064 if (ice_sq_done(hw, cq)) >> 1065 break; >> 1066 >> 1067 udelay(ICE_CTL_Q_SQ_CMD_USEC); >> 1068 total_delay++; >> 1069 } while (total_delay < cq->sq_cmd_timeout); >> >> 19 /* 0x10c7 is 2**32 / 1000000 (rounded up) */ >> 20 #define udelay(n) \ >> -- >> 26 __const_udelay((n) * 0x10c7ul); \ >> >> crash> pd loops_per_jiffy >> loops_per_jiffy = 1500000 >> >> #define ICE_CTL_Q_SQ_CMD_USEC 100 /* Check every 100usec */ >> >> crash> pd (100*0x10c7*4*250*loops_per_jiffy>>32)+1 >> 150002 >> >> crash> p 100*0x10c7ul >> 0x68dbc >> >> crash> dis ice_sq_send_cmd+0x346 13 >> 0xffffffffc0612ed6 <ice_sq_send_cmd+0x346>: jmp 0xffffffffc0612eef <ice_sq_send_cmd+0x35f> >> >> 0xffffffffc0612ed8 <ice_sq_send_cmd+0x348>: mov $0x68dbc,%edi >> 0xffffffffc0612edd <ice_sq_send_cmd+0x34d>: add $0x1,%r13d >> 0xffffffffc0612ee1 <ice_sq_send_cmd+0x351>: call 0xffffffffb6f477c0 <__const_udelay> >> 0xffffffffc0612ee6 <ice_sq_send_cmd+0x356>: cmp %r13d,0xc8(%rbp) >> 0xffffffffc0612eed <ice_sq_send_cmd+0x35d>: jbe 0xffffffffc0612f05 <ice_sq_send_cmd+0x375> >> >> 0xffffffffc0612eef <ice_sq_send_cmd+0x35f>: mov 0xa0(%rbp),%eax >> 0xffffffffc0612ef5 <ice_sq_send_cmd+0x365>: add (%rbx),%rax >> 0xffffffffc0612ef8 <ice_sq_send_cmd+0x368>: mov (%rax),%edx >> 0xffffffffc0612efa <ice_sq_send_cmd+0x36a>: movzwl 0x9a(%rbp),%eax >> 0xffffffffc0612f01 <ice_sq_send_cmd+0x371>: cmp %eax,%edx >> 0xffffffffc0612f03 <ice_sq_send_cmd+0x373>: jne 0xffffffffc0612ed8 <ice_sq_send_cmd+0x348> >> >> 0xffffffffc0612f05 <ice_sq_send_cmd+0x375>: mov 0xa0(%rbp),%eax >> >> crash> dis delay_tsc 2 >> 0xffffffffb6f478e0 <delay_tsc>: nopl 0x0(%rax,%rax,1) [FTRACE NOP] >> 0xffffffffb6f478e5 <delay_tsc+0x5>: push %r13 >> >> crash> rx ff7daf7a4aa2fd60 >> ff7daf7a4aa2fd60: 0000000000000001 // total_delay > > It's the first time the loop in function ice_sq_send_cmd() is being iterated. > The ice_sq_send_cmd() has to be called all over again. > >> crash> dis ice_sq_send_cmd+0x2d >> 0xffffffffc0612bbd <ice_sq_send_cmd+0x2d>: mov %rsi,%rbp >> >> crash> dis delay_tsc 6 >> 0xffffffffb6f478e0 <delay_tsc>: nopl 0x0(%rax,%rax,1) [FTRACE NOP] >> 0xffffffffb6f478e5 <delay_tsc+0x5>: push %r13 >> 0xffffffffb6f478e7 <delay_tsc+0x7>: push %r12 >> 0xffffffffb6f478e9 <delay_tsc+0x9>: mov %rdi,%r12 >> 0xffffffffb6f478ec <delay_tsc+0xc>: mov $0x1,%edi >> 0xffffffffb6f478f1 <delay_tsc+0x11>: push %rbp >> >> crash> rx ff7daf7a4aa2fd50 >> ff7daf7a4aa2fd50: ff48df6f05dd6730 >> >> crash> ice_ctl_q_info.sq_cmd_timeout,sq.head,sq.next_to_use ff48df6f05dd6730 >> sq_cmd_timeout = 0x2710 = 10000, >> sq.head = 0x80300, >> sq.next_to_use = 0x20, > > >> crash> runqc 33 >> CPU 33 RUNQUEUE: ff48df6f3f86c380 >> CURRENT: PID: 1354 TASK: ff48df6f2c5c8000 COMMAND: "ice-ptp-0000:10" >> RT PRIO_ARRAY: ff48df6f3f86c5c0 >> [ 89] PID: 1354 TASK: ff48df6f2c5c8000 COMMAND: "ice-ptp-0000:10" >> [ 98] PID: 161097 TASK: ff48df634a111ec0 COMMAND: "handler111" >> CFS RB_ROOT: ff48df6f3f86c430 >> [100] PID: 583 TASK: ff48df6f37845c40 COMMAND: "kworker/33:1H" >> [120] PID: 4113728 TASK: ff48df5055db9ec0 COMMAND: "kworker/33:6" >> [120] PID: 183189 TASK: ff48df5d24aa8000 COMMAND: "runc" >> >> crash> ps -m 1354 161097 583 4113728 183189 >> [ 0 00:02:56.990] [RU] PID: 1354 TASK: ff48df6f2c5c8000 CPU: 33 COMMAND: "ice-ptp-0000:10" >> [ 0 00:02:56.990] [RU] PID: 161097 TASK: ff48df634a111ec0 CPU: 33 COMMAND: "handler111" >> [ 0 00:02:56.991] [RU] PID: 4113728 TASK: ff48df5055db9ec0 CPU: 33 COMMAND: "kworker/33:6" >> [ 0 00:02:57.015] [RU] PID: 183189 TASK: ff48df5d24aa8000 CPU: 33 COMMAND: "runc" >> [ 0 00:02:57.030] [RU] PID: 583 TASK: ff48df6f37845c40 CPU: 33 COMMAND: "kworker/33:1H" > > These runnable tasks have lower prio and hence they are being starved. > >> crash> ps -mC33 | head -13 >> CPU: 33 >> [ 0 00:02:56.990] [RU] PID: 1354 TASK: ff48df6f2c5c8000 CPU: 33 COMMAND: "ice-ptp-0000:10" >> [ 0 00:02:56.990] [RU] PID: 161097 TASK: ff48df634a111ec0 CPU: 33 COMMAND: "handler111" >> [ 0 00:02:56.991] [RU] PID: 4113728 TASK: ff48df5055db9ec0 CPU: 33 COMMAND: "kworker/33:6" >> [ 0 00:02:56.991] [IN] PID: 278 TASK: ff48df504fd08000 CPU: 33 COMMAND: "ksoftirqd/33" >> [ 0 00:02:57.014] [IN] PID: 277 TASK: ff48df504fcebd80 CPU: 33 COMMAND: "rcuc/33" >> [ 0 00:02:57.015] [RU] PID: 183189 TASK: ff48df5d24aa8000 CPU: 33 COMMAND: "runc" >> [ 0 00:02:57.016] [IN] PID: 146 TASK: ff48df5040268000 CPU: 33 COMMAND: "rcuop/16" >> [ 0 00:02:57.020] [IN] PID: 276 TASK: ff48df504fce8000 CPU: 33 COMMAND: "migration/33" >> [ 0 00:02:57.024] [IN] PID: 413 TASK: ff48df504f971ec0 CPU: 33 COMMAND: "rcuop/49" >> [ 0 00:02:57.024] [IN] PID: 494 TASK: ff48df504f43bd80 CPU: 33 COMMAND: "rcuop/59" >> [ 0 00:02:57.029] [UN] PID: 183233 TASK: ff48df5193618000 CPU: 33 COMMAND: "exe" >> [ 0 00:02:57.030] [RU] PID: 583 TASK: ff48df6f37845c40 CPU: 33 COMMAND: "kworker/33:1H" > >> crash> ice_pf.pdev,int_name ff48df6f05dd4180 >> pdev = 0xff48df6f37fe6000, >> int_name = "ice-0000:10:00.1:misc", >> >> crash> ice_hw.mac_type ff48df6f05dd4dc0 >> mac_type = ICE_MAC_E810, > >> crash> ice_ptp_tx.work.func,len,in_use,quad,quad_offset ff48df6f05dd44e0 >> work.func = 0xffffffffc06540f0 <ice_ptp_tx_tstamp_work>, >> len = 0x40, >> in_use = 0xff48df63d0354cc0, >> quad = 0x1, >> quad_offset = 0x0, >> >> crash> rx 0xff48df63d0354cc0 >> ff48df63d0354cc0: 0000000000000001 > >> crash> ice_ptp_tx.tstamps ff48df6f05dd44e0 >> tstamps = 0xff48df62d6e69800, >> >> crash> ice_tx_tstamp.start,cached_tstamp,skb 0xff48df62d6e69800 >> start = 0x14717935c, >> cached_tstamp = 0xbe9c63f31b >> skb = 0xff48df5e20552f00, >> >> crash> pd jiffies-0x14717935c >> 177354 > > The packet is waiting for almost three minutes. The mentioned > commit 0dd9286263923 would definitely skip it and clear it's > bit in the tx->in_use mask. But still only two seconds too late > which needs to be addressed. And since the link went down > I guess just dropping the packets is a reasonable approach. > Yes, but we have to be careful of corner cases that could result in completely stopping Tx timestamp interrupts until a device reset. > Checking the latest upstream code in v6.2-rc4 I *guess* this > can still happen as described above. Just not looping forever > but rather only for 2 seconds. But doing nothing for two > seconds is just as bad really. Hence the patch I'm suggesting. > >> The trouble is that if a timestamp *does* arrive late, we need to ensure >> that we never assign the captured time to the wrong packet, and that for >> E822 devices we always read the correct number of timestamps (otherwise >> we can get the logic for timestamp interrupt generation broken). >> >> Consider for example this flow for e810: >> >> 1) a tx packet with a timestamp request is scheduled to hw >> 2) the packet begins being transmitted >> 3) link goes down >> 4) interrupt triggers, ice_ptp_tx_tstamp is run >> 5) link is down, so we skip reading this timestamp. Since we don't read >> the timestamp, we just discard the skb and we don't update the cached tx >> timestamp value >> 6) link goes up >> 7) 2 tx packets with a timestamp request are sent and one of them is on >> the same index as the packet in (1) >> 8) the other tx packet completes and we get an interrupt >> 9) the loop reads both timestamps. Since the tx packet in slot (1) >> doesn't match its cached value it looks "new" so the function reports >> the old timestamp to the wrong packet. > > I don't really fully understand the HW nor the driver to be honest. > For sure I can see it kept looping as the work is being re-queued > due to tx->in_use bitmap not being cleared. And that can only > happen when ice_read_phy_tstamp() returns with error or the > returned &raw_tstamp is not valid or it matches the cached value > (considering our version of ice_ptp_tx_tstamp() looks like the > one in v5.15 stable tree). And that said the v5.15 stable needs > backport of this fix as well I guess. > Right. >> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/net/ethernet/intel/ice/ice_ptp.c?h=v5.15.88#n1170 > >> Consider this flow for e822: >> >> 1) 2 tx packets with a timestamp request are scheduled to hw >> 2) the packets begin being transmitted >> 3) link goes down >> 4) an interrupt for the Tx timestamp is triggered, but we don't read the >> timestamps because we have link down and we skipped the ts_read. >> 5) the internal e822 hardware counter is not decremented due to no reads >> 6) no more timestamp interrupts will be triggered by hardware until we >> read the appropriate number of timestamps >> >> I am not sure if link going up will properly reset and re-initialize the >> Tx timestamp block but I suspect it does not. @Karol, @Siddaraju, >> @Michal, do you recall more details on this? >> >> I understand the desire to avoid polling when unnecessary, but I am >> worried because the hardware and firmware interactions here are pretty >> complex and its easy to get this wrong. (see: all of the previous >> patches and bug fixes we've been working on... we got this wrong a LOT >> already...) > > I see. Here my favorite quote applies: "Everybody falls the first time." > > I totally understand and I cannot imagine developing and maintaining this > without a solid test suite. On the other hand, some things (like especially > races) could be tricky to reproduce. > Yea.. >> Without a more concrete explanation of what this fixes I'm worried about >> this change :( > > Hopefully I already managed to shed a bit more light here. Again, I'm sorry > about the brief commit message. > That's ok, this has provided a lot more detail thanks! >> At a minimum I think I would only set drop_ts but not not goto skip_ts_read. > > IIUC, that would still fail to clear the tx->in_use bit in case > ice_read_phy_tstamp() > returns with error. It would only work for the other case where no error is > returned but rather the returned &raw_tstamp is invalid. I'll send a v2 of > this fix trying to address the goto concern. > If drop_ts is set we should always be dropping the timestamp after checking, even if we get an error. If there's a flow where that doesn't happen we should fix that. >> crash> ice_sbq_msg_input ff7daf7a4aa2fe2c >> struct ice_sbq_msg_input { >> dest_dev = 0x2, >> opcode = 0x0, >> msg_addr_low = 0x1000, >> msg_addr_high = 0x309, >> data = 0x0 >> } >> crash> ice_sbq_msg_req ff7daf7a4aa2fdc8 >> struct ice_sbq_msg_req { >> dest_dev = 0x2, >> src_dev = 0x0, >> opcode = 0x0, >> flags = 0x40, >> sbe_fbe = 0xf, >> func_id = 0x0, >> msg_addr_low = 0x1000, >> msg_addr_high = 0x309, >> data = 0x0 >> } >> crash> ice_sbq_cmd_desc ff7daf7a4aa2fdd8 >> struct ice_sbq_cmd_desc { >> flags = 0x1400, >> opcode = 0xc00, >> datalen = 0x0, >> cmd_retval = 0x0, >> cookie_high = 0x0, >> cookie_low = 0x0, >> param0 = { >> cmd_len = 0xc, >> cmpl_len = 0xc >> }, >> reserved = "", >> addr_high = 0x0, >> addr_low = 0x0 >> } > > I see commit 1229b33973c7b ("ice: Add low latency Tx timestamp read") got > rid of that kthread_work in favor of threaded irq. But the > ice_misc_intr_thread_fn() > keeps waking itself in the same manner as the work did keep re-queuing itself. > So the loop still needs to be broken (i.e. the tx->in_use bitmap cleared). Even > for v6.2-rc4. > Yes. >> That way if we happen to have a ready timestamp (for E822) we'll still >> read it and avoid the miscounting from not reading a completed timestamp. >> >> This also ensures that on e810 the cached timestamp value is updated and >> that we avoid the other situation. >> >> I'd still prefer if you have a bug report or more details on the failure >> case. I believe even if we poll it should be no more than 2 seconds for >> an old timestamp that never got sent to be discarded. > > Unfortunately I only have this one report with a dump from customer's testing. > The uptime was almost 14 days, so I'm not sure how easy it is to reproduce. > > Again, hopefully the above covers enough details. I'm happy to share more > if I missed some. But looping in a softirq (or worker thread) for two seconds > is not really acceptable. That's just a wrong design. Right? > Please re-send the patch with the goto concern addressed. (only set drop_ts, and still read the timestamp just in case and make sure that if the read function fails while drop_ts is set we still drop the ts). I believe that alleviates my concerns regarding the potential link down race with a completed timestamp above. We also should already be preventing requesting new timestamps while link is down so that should be sufficient to combine with this and cover the three flows: 1) request tx timestamp 2) link down 3) timestamp never occurs 1) request tx timestamp 2) timestamp occurs 3) link goes down while processing 1) link down 2) request tx timestamp rejected Thanks! -Jake > --nX > >>> Signed-off-by: Daniel Vacek <neelx@redhat.com> >>> --- >>> drivers/net/ethernet/intel/ice/ice_ptp.c | 9 +++++++++ >>> 1 file changed, 9 insertions(+) >>> >>> diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c >>> index d63161d73eb16..c313177ba6676 100644 >>> --- a/drivers/net/ethernet/intel/ice/ice_ptp.c >>> +++ b/drivers/net/ethernet/intel/ice/ice_ptp.c >>> @@ -680,6 +680,7 @@ static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx) >>> struct ice_pf *pf; >>> struct ice_hw *hw; >>> u64 tstamp_ready; >>> + bool link_up; >>> int err; >>> u8 idx; >>> >>> @@ -695,6 +696,8 @@ static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx) >>> if (err) >>> return false; >>> >>> + link_up = hw->port_info->phy.link_info.link_info & ICE_AQ_LINK_UP; >>> + >>> for_each_set_bit(idx, tx->in_use, tx->len) { >>> struct skb_shared_hwtstamps shhwtstamps = {}; >>> u8 phy_idx = idx + tx->offset; >>> @@ -702,6 +705,12 @@ static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx) >>> bool drop_ts = false; >>> struct sk_buff *skb; >>> >>> + /* Drop packets if the link went down */ >>> + if (!link_up) { >>> + drop_ts = true; >>> + goto skip_ts_read; >>> + } >>> + >>> /* Drop packets which have waited for more than 2 seconds */ >>> if (time_is_before_jiffies(tx->tstamps[idx].start + 2 * HZ)) { >>> drop_ts = true; >> >
On Wed, Jan 18, 2023 at 9:59 PM Jacob Keller <jacob.e.keller@intel.com> wrote: > On 1/18/2023 7:14 AM, Daniel Vacek wrote: > > On Tue, Jan 17, 2023 at 7:47 PM Jacob Keller <jacob.e.keller@intel.com> wrote: > >> > >> At a minimum I think I would only set drop_ts but not not goto skip_ts_read. > > > > IIUC, that would still fail to clear the tx->in_use bit in case ice_read_phy_tstamp() > > returns with error. It would only work for the other case where no error is > > returned but rather the returned &raw_tstamp is invalid. I'll send a v2 of > > this fix trying to address the goto concern. > > > > Please re-send the patch with the goto concern addressed. (only set > drop_ts, and still read the timestamp just in case and make sure that if > the read function fails while drop_ts is set we still drop the ts). Just in case it got lost I did send the patch here: https://lore.kernel.org/intel-wired-lan/20230118161727.2485457-1-neelx@redhat.com/ But unfortunately I misplaced the --in-reply-to header with --reply-to one :-( Hopefully it arrived to you safely. > I believe that alleviates my concerns regarding the potential link down > race with a completed timestamp above. > > We also should already be preventing requesting new timestamps while > link is down so that should be sufficient to combine with this and cover > the three flows: > > 1) request tx timestamp > 2) link down > 3) timestamp never occurs > > 1) request tx timestamp > 2) timestamp occurs > 3) link goes down while processing I was thinking this is the case we got reported. But then again, I'm not really experienced in this field. --nX > 1) link down > 2) request tx timestamp rejected > > Thanks! > > -Jake
On 1/18/2023 2:11 PM, Daniel Vacek wrote: > On Wed, Jan 18, 2023 at 9:59 PM Jacob Keller <jacob.e.keller@intel.com> wrote: >> On 1/18/2023 7:14 AM, Daniel Vacek wrote: >> 1) request tx timestamp >> 2) timestamp occurs >> 3) link goes down while processing > > I was thinking this is the case we got reported. But then again, I'm > not really experienced in this field. > I think it might be, or at least something similar to this. I think that can be fixed with the link check you added. I think we actually have a copy of the current link status in the ice_ptp or ice_ptp_tx structure which could be used instead of having to check back to the other structure. I'm just hoping not to re-introduce bugs related to the hardware interrupt counter that we had which results in preventing all future timestamp interrupts. > --nX > >> 1) link down >> 2) request tx timestamp rejected >> >> Thanks! >> >> -Jake >
On Wed, Jan 18, 2023 at 11:22 PM Jacob Keller <jacob.e.keller@intel.com> wrote: > On 1/18/2023 2:11 PM, Daniel Vacek wrote: > > On Wed, Jan 18, 2023 at 9:59 PM Jacob Keller <jacob.e.keller@intel.com> wrote: > >> On 1/18/2023 7:14 AM, Daniel Vacek wrote: > >> 1) request tx timestamp > >> 2) timestamp occurs > >> 3) link goes down while processing > > > > I was thinking this is the case we got reported. But then again, I'm > > not really experienced in this field. > > > > I think it might be, or at least something similar to this. > > I think that can be fixed with the link check you added. I think we > actually have a copy of the current link status in the ice_ptp or > ice_ptp_tx structure which could be used instead of having to check back > to the other structure. If you're talking about ptp_port->link_up that one is always false no matter the actual NIC link status. First I wanted to use it but checking all the 8 devices available in the dump data it just does not match the net_dev->state or the port_info->phy.link_info.link_info crash> net_device.name,state 0xff48df6f0c553000 name = "ens1f1", state = 0x7, // DOWN crash> ice_port_info.phy.link_info.link_info 0xff48df6f05dca018 phy.link_info.link_info = 0xc0, // DOWN crash> ice_ptp_port.port_num,link_up 0xff48df6f05dd44e0 port_num = 0x1 link_up = 0x0, // False crash> net_device.name,state 0xff48df6f25e3f000 name = "ens1f0", state = 0x3, // UP crash> ice_port_info.phy.link_info.link_info 0xff48df6f070a3018 phy.link_info.link_info = 0xe1, // UP crash> ice_ptp_port.port_num,link_up 0xff48df6f063184e0 port_num = 0x0 link_up = 0x0, // False crash> ice_ptp_port.port_num,link_up 0xff48df6f25b844e0 port_num = 0x2 link_up = 0x0, // False even this device is UP crash> ice_ptp_port.port_num,link_up 0xff48df6f140384e0 port_num = 0x3 link_up = 0x0, // False even this device is UP crash> ice_ptp_port.port_num,link_up 0xff48df6f055044e0 port_num = 0x0 link_up = 0x0, // False even this device is UP crash> ice_ptp_port.port_num,link_up 0xff48df6f251cc4e0 port_num = 0x1 link_up = 0x0, crash> ice_ptp_port.port_num,link_up 0xff48df6f33a9c4e0 port_num = 0x2 link_up = 0x0, crash> ice_ptp_port.port_num,link_up 0xff48df6f3bb7c4e0 port_num = 0x3 link_up = 0x0, In other words, the ice_ptp_port.link_up is always false and cannot be used. That's why I had to fall back to hw->port_info->phy.link_info.link_info --nX > I'm just hoping not to re-introduce bugs related to the hardware > interrupt counter that we had which results in preventing all future > timestamp interrupts. > > > --nX > > > >> 1) link down > >> 2) request tx timestamp rejected > >> > >> Thanks! > >> > >> -Jake > > >
On 1/19/2023 1:38 AM, Daniel Vacek wrote: > On Wed, Jan 18, 2023 at 11:22 PM Jacob Keller <jacob.e.keller@intel.com> wrote: >> On 1/18/2023 2:11 PM, Daniel Vacek wrote: >>> On Wed, Jan 18, 2023 at 9:59 PM Jacob Keller <jacob.e.keller@intel.com> wrote: >>>> On 1/18/2023 7:14 AM, Daniel Vacek wrote: >>>> 1) request tx timestamp >>>> 2) timestamp occurs >>>> 3) link goes down while processing >>> >>> I was thinking this is the case we got reported. But then again, I'm >>> not really experienced in this field. >>> >> >> I think it might be, or at least something similar to this. >> >> I think that can be fixed with the link check you added. I think we >> actually have a copy of the current link status in the ice_ptp or >> ice_ptp_tx structure which could be used instead of having to check back >> to the other structure. > > If you're talking about ptp_port->link_up that one is always false no > matter the actual NIC link status. First I wanted to use it but > checking all the 8 devices available in the dump data it just does not > match the net_dev->state or the port_info->phy.link_info.link_info > > crash> net_device.name,state 0xff48df6f0c553000 > name = "ens1f1", > state = 0x7, // DOWN > crash> ice_port_info.phy.link_info.link_info 0xff48df6f05dca018 > phy.link_info.link_info = 0xc0, // DOWN > crash> ice_ptp_port.port_num,link_up 0xff48df6f05dd44e0 > port_num = 0x1 > link_up = 0x0, // False > > crash> net_device.name,state 0xff48df6f25e3f000 > name = "ens1f0", > state = 0x3, // UP > crash> ice_port_info.phy.link_info.link_info 0xff48df6f070a3018 > phy.link_info.link_info = 0xe1, // UP > crash> ice_ptp_port.port_num,link_up 0xff48df6f063184e0 > port_num = 0x0 > link_up = 0x0, // False > > crash> ice_ptp_port.port_num,link_up 0xff48df6f25b844e0 > port_num = 0x2 > link_up = 0x0, // False even this device is UP > crash> ice_ptp_port.port_num,link_up 0xff48df6f140384e0 > port_num = 0x3 > link_up = 0x0, // False even this device is UP > crash> ice_ptp_port.port_num,link_up 0xff48df6f055044e0 > port_num = 0x0 > link_up = 0x0, // False even this device is UP > crash> ice_ptp_port.port_num,link_up 0xff48df6f251cc4e0 > port_num = 0x1 > link_up = 0x0, > crash> ice_ptp_port.port_num,link_up 0xff48df6f33a9c4e0 > port_num = 0x2 > link_up = 0x0, > crash> ice_ptp_port.port_num,link_up 0xff48df6f3bb7c4e0 > port_num = 0x3 > link_up = 0x0, > > In other words, the ice_ptp_port.link_up is always false and cannot be > used. That's why I had to fall back to > hw->port_info->phy.link_info.link_info > Hmm. We call ice_ptp_link_change in ice_link_event which is called from ice_handle_link_event... In ice_link_event, a local link_up field is set based on phy_info->link_info.link_info & ICE_AQ_LINK_UP What kernel are you testing on? Does it include 6b1ff5d39228 ("ice: always call ice_ptp_link_change and make it void")? Prior to this commit the field was only valid for E822 devices, but I fixed that as it was used for other checks as well. I am guessing that the Red Hat kernel you are using lacks several of these clean ups and fixes. For the current code in the net-next kernel I believe we can safely use the ptp_port->link_up field. Thanks, Jake
On Thu, Jan 19, 2023 at 8:25 PM Jacob Keller <jacob.e.keller@intel.com> wrote: > On 1/19/2023 1:38 AM, Daniel Vacek wrote: > > On Wed, Jan 18, 2023 at 11:22 PM Jacob Keller <jacob.e.keller@intel.com> wrote: > >> On 1/18/2023 2:11 PM, Daniel Vacek wrote: > >>> On Wed, Jan 18, 2023 at 9:59 PM Jacob Keller <jacob.e.keller@intel.com> wrote: > >>>> On 1/18/2023 7:14 AM, Daniel Vacek wrote: > >>>> 1) request tx timestamp > >>>> 2) timestamp occurs > >>>> 3) link goes down while processing > >>> > >>> I was thinking this is the case we got reported. But then again, I'm > >>> not really experienced in this field. > >>> > >> > >> I think it might be, or at least something similar to this. > >> > >> I think that can be fixed with the link check you added. I think we > >> actually have a copy of the current link status in the ice_ptp or > >> ice_ptp_tx structure which could be used instead of having to check back > >> to the other structure. > > > > If you're talking about ptp_port->link_up that one is always false no > > matter the actual NIC link status. First I wanted to use it but > > checking all the 8 devices available in the dump data it just does not > > match the net_dev->state or the port_info->phy.link_info.link_info > > > > crash> net_device.name,state 0xff48df6f0c553000 > > name = "ens1f1", > > state = 0x7, // DOWN > > crash> ice_port_info.phy.link_info.link_info 0xff48df6f05dca018 > > phy.link_info.link_info = 0xc0, // DOWN > > crash> ice_ptp_port.port_num,link_up 0xff48df6f05dd44e0 > > port_num = 0x1 > > link_up = 0x0, // False > > > > crash> net_device.name,state 0xff48df6f25e3f000 > > name = "ens1f0", > > state = 0x3, // UP > > crash> ice_port_info.phy.link_info.link_info 0xff48df6f070a3018 > > phy.link_info.link_info = 0xe1, // UP > > crash> ice_ptp_port.port_num,link_up 0xff48df6f063184e0 > > port_num = 0x0 > > link_up = 0x0, // False > > > > crash> ice_ptp_port.port_num,link_up 0xff48df6f25b844e0 > > port_num = 0x2 > > link_up = 0x0, // False even this device is UP > > crash> ice_ptp_port.port_num,link_up 0xff48df6f140384e0 > > port_num = 0x3 > > link_up = 0x0, // False even this device is UP > > crash> ice_ptp_port.port_num,link_up 0xff48df6f055044e0 > > port_num = 0x0 > > link_up = 0x0, // False even this device is UP > > crash> ice_ptp_port.port_num,link_up 0xff48df6f251cc4e0 > > port_num = 0x1 > > link_up = 0x0, > > crash> ice_ptp_port.port_num,link_up 0xff48df6f33a9c4e0 > > port_num = 0x2 > > link_up = 0x0, > > crash> ice_ptp_port.port_num,link_up 0xff48df6f3bb7c4e0 > > port_num = 0x3 > > link_up = 0x0, > > > > In other words, the ice_ptp_port.link_up is always false and cannot be > > used. That's why I had to fall back to > > hw->port_info->phy.link_info.link_info > > > > Hmm. We call ice_ptp_link_change in ice_link_event which is called from > ice_handle_link_event... > > In ice_link_event, a local link_up field is set based on > phy_info->link_info.link_info & ICE_AQ_LINK_UP > > What kernel are you testing on? Does it include 6b1ff5d39228 ("ice: > always call ice_ptp_link_change and make it void")? > > Prior to this commit the field was only valid for E822 devices, but I > fixed that as it was used for other checks as well. > > I am guessing that the Red Hat kernel you are using lacks several of > these clean ups and fixes. Yeah, makes perfect sense. We don't have that commit in 8.4. All the data I have and present here are from 4.18.0-305.49.1.rt7.121.el8_4.x86_64 > For the current code in the net-next kernel I believe we can safely use > the ptp_port->link_up field. I'll fix that up and drop you a v3. Thank you for the review. --nX > Thanks, > Jake
Yes Jake, ICE driver reinitializes PTP timers on link UP (especially for E82x). ice_ptp.c-> ice_ptp_link_change-> ice_ptp_port_phy_restart-> ice_start_phy_timer_e822 -Siddaraju D H -----Original Message----- From: Keller, Jacob E <jacob.e.keller@intel.com> Sent: Wednesday, January 18, 2023 12:16 AM To: Daniel Vacek <neelx@redhat.com>; Brandeburg, Jesse <jesse.brandeburg@intel.com>; Nguyen, Anthony L <anthony.l.nguyen@intel.com>; David S. Miller <davem@davemloft.net>; Eric Dumazet <edumazet@google.com>; Jakub Kicinski <kuba@kernel.org>; Paolo Abeni <pabeni@redhat.com>; Richard Cochran <richardcochran@gmail.com>; Kolacinski, Karol <karol.kolacinski@intel.com>; D H, Siddaraju <siddaraju.dh@intel.com>; Michalik, Michal <michal.michalik@intel.com> Cc: netdev@vger.kernel.org; intel-wired-lan@lists.osuosl.org; linux-kernel@vger.kernel.org Subject: Re: [Intel-wired-lan] [PATCH] ice/ptp: fix the PTP worker retrying indefinitely if the link went down On 1/17/2023 10:15 AM, Daniel Vacek wrote: > When the link goes down the ice_ptp_tx_tstamp_work() may loop forever > trying to process the packets. In this case it makes sense to just drop them. > Hi, Do you have some more details on what situation caused this state? Or is this just based on code review? It won't loop forever because if link is down for more than 2 seconds we'll discard the old timestamps which we assume are not going to arrive. The trouble is that if a timestamp *does* arrive late, we need to ensure that we never assign the captured time to the wrong packet, and that for E822 devices we always read the correct number of timestamps (otherwise we can get the logic for timestamp interrupt generation broken). Consider for example this flow for e810: 1) a tx packet with a timestamp request is scheduled to hw 2) the packet begins being transmitted 3) link goes down 4) interrupt triggers, ice_ptp_tx_tstamp is run 5) link is down, so we skip reading this timestamp. Since we don't read the timestamp, we just discard the skb and we don't update the cached tx timestamp value 6) link goes up 7) 2 tx packets with a timestamp request are sent and one of them is on the same index as the packet in (1) 8) the other tx packet completes and we get an interrupt 9) the loop reads both timestamps. Since the tx packet in slot (1) doesn't match its cached value it looks "new" so the function reports the old timestamp to the wrong packet. Consider this flow for e822: 1) 2 tx packets with a timestamp request are scheduled to hw 2) the packets begin being transmitted 3) link goes down 4) an interrupt for the Tx timestamp is triggered, but we don't read the timestamps because we have link down and we skipped the ts_read. 5) the internal e822 hardware counter is not decremented due to no reads 6) no more timestamp interrupts will be triggered by hardware until we read the appropriate number of timestamps I am not sure if link going up will properly reset and re-initialize the Tx timestamp block but I suspect it does not. @Karol, @Siddaraju, @Michal, do you recall more details on this? I understand the desire to avoid polling when unnecessary, but I am worried because the hardware and firmware interactions here are pretty complex and its easy to get this wrong. (see: all of the previous patches and bug fixes we've been working on... we got this wrong a LOT already...) Without a more concrete explanation of what this fixes I'm worried about this change :( At a minimum I think I would only set drop_ts but not not goto skip_ts_read. That way if we happen to have a ready timestamp (for E822) we'll still read it and avoid the miscounting from not reading a completed timestamp. This also ensures that on e810 the cached timestamp value is updated and that we avoid the other situation. I'd still prefer if you have a bug report or more details on the failure case. I believe even if we poll it should be no more than 2 seconds for an old timestamp that never got sent to be discarded. > Signed-off-by: Daniel Vacek <neelx@redhat.com> > --- > drivers/net/ethernet/intel/ice/ice_ptp.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c > b/drivers/net/ethernet/intel/ice/ice_ptp.c > index d63161d73eb16..c313177ba6676 100644 > --- a/drivers/net/ethernet/intel/ice/ice_ptp.c > +++ b/drivers/net/ethernet/intel/ice/ice_ptp.c > @@ -680,6 +680,7 @@ static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx) > struct ice_pf *pf; > struct ice_hw *hw; > u64 tstamp_ready; > + bool link_up; > int err; > u8 idx; > > @@ -695,6 +696,8 @@ static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx) > if (err) > return false; > > + link_up = hw->port_info->phy.link_info.link_info & ICE_AQ_LINK_UP; > + > for_each_set_bit(idx, tx->in_use, tx->len) { > struct skb_shared_hwtstamps shhwtstamps = {}; > u8 phy_idx = idx + tx->offset; > @@ -702,6 +705,12 @@ static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx) > bool drop_ts = false; > struct sk_buff *skb; > > + /* Drop packets if the link went down */ > + if (!link_up) { > + drop_ts = true; > + goto skip_ts_read; > + } > + > /* Drop packets which have waited for more than 2 seconds */ > if (time_is_before_jiffies(tx->tstamps[idx].start + 2 * HZ)) { > drop_ts = true;
diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c index d63161d73eb16..c313177ba6676 100644 --- a/drivers/net/ethernet/intel/ice/ice_ptp.c +++ b/drivers/net/ethernet/intel/ice/ice_ptp.c @@ -680,6 +680,7 @@ static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx) struct ice_pf *pf; struct ice_hw *hw; u64 tstamp_ready; + bool link_up; int err; u8 idx; @@ -695,6 +696,8 @@ static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx) if (err) return false; + link_up = hw->port_info->phy.link_info.link_info & ICE_AQ_LINK_UP; + for_each_set_bit(idx, tx->in_use, tx->len) { struct skb_shared_hwtstamps shhwtstamps = {}; u8 phy_idx = idx + tx->offset; @@ -702,6 +705,12 @@ static bool ice_ptp_tx_tstamp(struct ice_ptp_tx *tx) bool drop_ts = false; struct sk_buff *skb; + /* Drop packets if the link went down */ + if (!link_up) { + drop_ts = true; + goto skip_ts_read; + } + /* Drop packets which have waited for more than 2 seconds */ if (time_is_before_jiffies(tx->tstamps[idx].start + 2 * HZ)) { drop_ts = true;