Message ID | 1688374171-10534-1-git-send-email-schakrabarti@linux.microsoft.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:9f45:0:b0:3ea:f831:8777 with SMTP id v5csp379581vqx; Mon, 3 Jul 2023 02:00:44 -0700 (PDT) X-Google-Smtp-Source: APBJJlEBmaeokYgMo+sky7JBY/vmA2y9aVaxuRMcZgo/eYcVtSg82IviUgt1mRMDYjT77wbtm9Gx X-Received: by 2002:a17:902:e294:b0:1af:adc2:ab5b with SMTP id o20-20020a170902e29400b001afadc2ab5bmr10574125plc.0.1688374844189; Mon, 03 Jul 2023 02:00:44 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1688374844; cv=none; d=google.com; s=arc-20160816; b=IFFTO3IEXL6VCQFTXw0ZKFtpZoIw+iA4Sozo7QqBIfZ2ImgSO+f96SsiQ9Qp8QCgKO mHL8Wsjxoaex3xdaBm2laAJz2jv7rQZLXFyLlENkSF3yls8RknqW8hmTVHQYZSUGibdk w9lhL1Ccw2VmA3iqKlZfGHKLdhqFIfHsT6zleeeESTCfJ3CmxpS0ljv5qutbcf7UZBbo V1+s3bHaq683G7TJtv/PvMzHCA2NXVdkbBA44SfvJ1jE0iCpum+gpBwnhGVvwxRI7RyH BsYjrvZwVoOFY9QhfeSxiRtkv4lMokEWRAhjdVLhJntMn9551gKmzG25BknQqNjkhNsB Ut1Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:message-id:date:subject:cc:to:from :dkim-signature:dkim-filter; bh=wLvOYsL0xTLNog9Esl1IJpg/jgFjcy1xWtdfH3ZESbc=; fh=qKGHJM/OaNHTo96+1HG2BqACTjy4q9C0dbufp/+3Ef4=; b=BAq31cPL6rmiEfvoOrpxbed3siq8qGvVmMWbO4tujtZertetN35Eg/oxhRZdTdbolg we/vmZpDm2lgPQxgVE/ROcKXujm8LF+d/oY85TjR6sRVw/EfIORp5OW0BvhuN3TOzWsc fAjYTcmP7oPyRlWWmL7NIFWrkVqWLK6PPJLX/CmHy6OzHFwLGINpmqDzSYsm2RDgdPNu toP9NpWSRbpvfHn3tcYIVs7CSPX3p1OKpBd85x7bGdU/p2isEw3cBaJvgcioLnYQVweK zNUwAse2G/sfC91H0+3wu+kwt+XA9CUA2gkgZf30e3rCspWgUG0ShmCHfkDDtaJ6r7xO PsTQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.microsoft.com header.s=default header.b=qJlSF5nC; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.microsoft.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id jf5-20020a170903268500b001b18ec6ddfcsi15829450plb.294.2023.07.03.02.00.29; Mon, 03 Jul 2023 02:00:44 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.microsoft.com header.s=default header.b=qJlSF5nC; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.microsoft.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230310AbjGCIuH (ORCPT <rfc822;ivan.orlov0322@gmail.com> + 99 others); Mon, 3 Jul 2023 04:50:07 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44816 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229888AbjGCIuC (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 3 Jul 2023 04:50:02 -0400 Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 949DB10C7; Mon, 3 Jul 2023 01:49:41 -0700 (PDT) Received: by linux.microsoft.com (Postfix, from userid 1099) id 7CEF720AECAD; Mon, 3 Jul 2023 01:49:34 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com 7CEF720AECAD DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1688374174; bh=wLvOYsL0xTLNog9Esl1IJpg/jgFjcy1xWtdfH3ZESbc=; h=From:To:Cc:Subject:Date:From; b=qJlSF5nCdYP0m5NBCth77epMbBD9PUok8pnsnPduqP6HFaKbBSJC0KSBX05iB8Sb0 xxdrowkaHthO5XOKYwfC4/e6I2tLXTWZE+qRanRUbKzGXXCrGW02hIh6e8wh2ahiCL MmC/+4wcXVueI45GKAIoAjb2/nCQW2dMTVqawiEU= From: souradeep chakrabarti <schakrabarti@linux.microsoft.com> To: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, davem@davemloft.net, edumazet@google.com, kuba@kernel.org, pabeni@redhat.com, longli@microsoft.com, sharmaajay@microsoft.com, leon@kernel.org, cai.huoqing@linux.dev, ssengar@linux.microsoft.com, vkuznets@redhat.com, tglx@linutronix.de, linux-hyperv@vger.kernel.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org Cc: stable@vger.kernel.org, schakrabarti@microsoft.com, Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> Subject: [PATCH V4 net] net: mana: Fix MANA VF unload when host is unresponsive Date: Mon, 3 Jul 2023 01:49:31 -0700 Message-Id: <1688374171-10534-1-git-send-email-schakrabarti@linux.microsoft.com> X-Mailer: git-send-email 1.8.3.1 X-Spam-Status: No, score=-19.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_MED, SPF_HELO_PASS,SPF_PASS,T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL, USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1770389340630374011?= X-GMAIL-MSGID: =?utf-8?q?1770389340630374011?= |
Series |
[V4,net] net: mana: Fix MANA VF unload when host is unresponsive
|
|
Commit Message
Souradeep Chakrabarti
July 3, 2023, 8:49 a.m. UTC
From: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> When unloading the MANA driver, mana_dealloc_queues() waits for the MANA hardware to complete any inflight packets and set the pending send count to zero. But if the hardware has failed, mana_dealloc_queues() could wait forever. Fix this by adding a timeout to the wait. Set the timeout to 120 seconds, which is a somewhat arbitrary value that is more than long enough for functional hardware to complete any sends. Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> --- V3 -> V4: * Fixed the commit message to describe the context. * Removed the vf_unload_timeout, as it is not required. --- drivers/net/ethernet/microsoft/mana/mana_en.c | 26 ++++++++++++++++--- 1 file changed, 23 insertions(+), 3 deletions(-)
Comments
> -----Original Message----- > From: souradeep chakrabarti <schakrabarti@linux.microsoft.com> > Sent: Monday, July 3, 2023 4:50 AM > To: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang > <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui > <decui@microsoft.com>; davem@davemloft.net; edumazet@google.com; > kuba@kernel.org; pabeni@redhat.com; Long Li <longli@microsoft.com>; Ajay > Sharma <sharmaajay@microsoft.com>; leon@kernel.org; > cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets@redhat.com; > tglx@linutronix.de; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; > linux-kernel@vger.kernel.org; linux-rdma@vger.kernel.org > Cc: stable@vger.kernel.org; Souradeep Chakrabarti > <schakrabarti@microsoft.com>; Souradeep Chakrabarti > <schakrabarti@linux.microsoft.com> > Subject: [PATCH V4 net] net: mana: Fix MANA VF unload when host is > unresponsive > > From: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> > > When unloading the MANA driver, mana_dealloc_queues() waits for the MANA > hardware to complete any inflight packets and set the pending send count > to zero. But if the hardware has failed, mana_dealloc_queues() > could wait forever. > > Fix this by adding a timeout to the wait. Set the timeout to 120 seconds, > which is a somewhat arbitrary value that is more than long enough for > functional hardware to complete any sends. > > Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> > --- > V3 -> V4: > * Fixed the commit message to describe the context. > * Removed the vf_unload_timeout, as it is not required. > --- > drivers/net/ethernet/microsoft/mana/mana_en.c | 26 ++++++++++++++++--- > 1 file changed, 23 insertions(+), 3 deletions(-) > > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c > b/drivers/net/ethernet/microsoft/mana/mana_en.c > index a499e460594b..d26f1da70411 100644 > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c > @@ -2346,7 +2346,10 @@ static int mana_dealloc_queues(struct net_device > *ndev) > { > struct mana_port_context *apc = netdev_priv(ndev); > struct gdma_dev *gd = apc->ac->gdma_dev; > + unsigned long timeout; > struct mana_txq *txq; > + struct sk_buff *skb; > + struct mana_cq *cq; > int i, err; > > if (apc->port_is_up) > @@ -2363,15 +2366,32 @@ static int mana_dealloc_queues(struct net_device > *ndev) > * to false, but it doesn't matter since mana_start_xmit() drops any > * new packets due to apc->port_is_up being false. > * > - * Drain all the in-flight TX packets > + * Drain all the in-flight TX packets. > + * A timeout of 120 seconds for all the queues is used. > + * This will break the while loop when h/w is not responding. > + * This value of 120 has been decided here considering max > + * number of queues. > */ > + > + timeout = jiffies + 120 * HZ; > for (i = 0; i < apc->num_queues; i++) { > txq = &apc->tx_qp[i].txq; > - > - while (atomic_read(&txq->pending_sends) > 0) > + while (atomic_read(&txq->pending_sends) > 0 && > + time_before(jiffies, timeout)) { > usleep_range(1000, 2000); > + } > } > > + for (i = 0; i < apc->num_queues; i++) { > + txq = &apc->tx_qp[i].txq; > + cq = &apc->tx_qp[i].tx_cq; > + while (atomic_read(&txq->pending_sends)) { > + skb = skb_dequeue(&txq->pending_skbs); > + mana_unmap_skb(skb, apc); > + napi_consume_skb(skb, cq->budget); This is not in NAPI context, so it should be dev_consume_skb_any() Thanks, - Haiyang
From: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> Date: Mon, 3 Jul 2023 01:49:31 -0700 > From: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> Please sync your Git name and Git mail account settings, so that your own patches won't have "From:" when sending. From what I see, you need to correct first letters of name and surname to capital in the Git email settings block. > > When unloading the MANA driver, mana_dealloc_queues() waits for the MANA > hardware to complete any inflight packets and set the pending send count > to zero. But if the hardware has failed, mana_dealloc_queues() > could wait forever. > > Fix this by adding a timeout to the wait. Set the timeout to 120 seconds, > which is a somewhat arbitrary value that is more than long enough for > functional hardware to complete any sends. > > Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> Where's "Fixes:" tagging the blamed commit? > --- > V3 -> V4: > * Fixed the commit message to describe the context. > * Removed the vf_unload_timeout, as it is not required. > --- > drivers/net/ethernet/microsoft/mana/mana_en.c | 26 ++++++++++++++++--- > 1 file changed, 23 insertions(+), 3 deletions(-) > > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c > index a499e460594b..d26f1da70411 100644 > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c > @@ -2346,7 +2346,10 @@ static int mana_dealloc_queues(struct net_device *ndev) > { > struct mana_port_context *apc = netdev_priv(ndev); > struct gdma_dev *gd = apc->ac->gdma_dev; > + unsigned long timeout; > struct mana_txq *txq; > + struct sk_buff *skb; > + struct mana_cq *cq; > int i, err; > > if (apc->port_is_up) > @@ -2363,15 +2366,32 @@ static int mana_dealloc_queues(struct net_device *ndev) > * to false, but it doesn't matter since mana_start_xmit() drops any > * new packets due to apc->port_is_up being false. > * > - * Drain all the in-flight TX packets > + * Drain all the in-flight TX packets. > + * A timeout of 120 seconds for all the queues is used. > + * This will break the while loop when h/w is not responding. > + * This value of 120 has been decided here considering max > + * number of queues. > */ > + > + timeout = jiffies + 120 * HZ; Why not initialize it right when declaring? > for (i = 0; i < apc->num_queues; i++) { > txq = &apc->tx_qp[i].txq; > - > - while (atomic_read(&txq->pending_sends) > 0) > + while (atomic_read(&txq->pending_sends) > 0 && > + time_before(jiffies, timeout)) { > usleep_range(1000, 2000);> + } > } 120 seconds by 2 msec step is 60000 iterations, by 1 msec is 120000 iterations. I know usleep_range() often is much less precise, but still. Do you really need that much time? Has this been measured during the tests that it can take up to 120 seconds or is it just some random value that "should be enough"? If you really need 120 seconds, I'd suggest using a timer / delayed work instead of wasting resources. > > + for (i = 0; i < apc->num_queues; i++) { > + txq = &apc->tx_qp[i].txq; > + cq = &apc->tx_qp[i].tx_cq; cq can be just &txq->tx_cq. > + while (atomic_read(&txq->pending_sends)) { > + skb = skb_dequeue(&txq->pending_skbs); > + mana_unmap_skb(skb, apc); > + napi_consume_skb(skb, cq->budget); (you already have comment about this one) > + atomic_sub(1, &txq->pending_sends); > + } > + } > /* We're 100% sure the queues can no longer be woken up, because > * we're sure now mana_poll_tx_cq() can't be running. > */ Thanks, Olek
>-----Original Message----- >From: Alexander Lobakin <aleksander.lobakin@intel.com> >Sent: Monday, July 3, 2023 10:18 PM >To: souradeep chakrabarti <schakrabarti@linux.microsoft.com> >Cc: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang ><haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui ><decui@microsoft.com>; davem@davemloft.net; edumazet@google.com; >kuba@kernel.org; pabeni@redhat.com; Long Li <longli@microsoft.com>; Ajay >Sharma <sharmaajay@microsoft.com>; leon@kernel.org; >cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets@redhat.com; >tglx@linutronix.de; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; >linux-kernel@vger.kernel.org; linux-rdma@vger.kernel.org; >stable@vger.kernel.org; Souradeep Chakrabarti <schakrabarti@microsoft.com> >Subject: [EXTERNAL] Re: [PATCH V4 net] net: mana: Fix MANA VF unload when >host is unresponsive > >From: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> >Date: Mon, 3 Jul 2023 01:49:31 -0700 > >> From: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> > >Please sync your Git name and Git mail account settings, so that your own >patches won't have "From:" when sending. From what I see, you need to >correct first letters of name and surname to capital in the Git email settings >block. Thank you for pointing, I will fix it. > >> >> When unloading the MANA driver, mana_dealloc_queues() waits for the >> MANA hardware to complete any inflight packets and set the pending >> send count to zero. But if the hardware has failed, >> mana_dealloc_queues() could wait forever. >> >> Fix this by adding a timeout to the wait. Set the timeout to 120 >> seconds, which is a somewhat arbitrary value that is more than long >> enough for functional hardware to complete any sends. >> >> Signed-off-by: Souradeep Chakrabarti >> <schakrabarti@linux.microsoft.com> > >Where's "Fixes:" tagging the blamed commit? This is present from the day zero of the mana driver code. It has not been introduced in the code by any commit. > >> --- >> V3 -> V4: >> * Fixed the commit message to describe the context. >> * Removed the vf_unload_timeout, as it is not required. >> --- >> drivers/net/ethernet/microsoft/mana/mana_en.c | 26 >> ++++++++++++++++--- >> 1 file changed, 23 insertions(+), 3 deletions(-) >> >> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c >> b/drivers/net/ethernet/microsoft/mana/mana_en.c >> index a499e460594b..d26f1da70411 100644 >> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c >> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c >> @@ -2346,7 +2346,10 @@ static int mana_dealloc_queues(struct >> net_device *ndev) { >> struct mana_port_context *apc = netdev_priv(ndev); >> struct gdma_dev *gd = apc->ac->gdma_dev; >> + unsigned long timeout; >> struct mana_txq *txq; >> + struct sk_buff *skb; >> + struct mana_cq *cq; >> int i, err; >> >> if (apc->port_is_up) >> @@ -2363,15 +2366,32 @@ static int mana_dealloc_queues(struct >net_device *ndev) >> * to false, but it doesn't matter since mana_start_xmit() drops any >> * new packets due to apc->port_is_up being false. >> * >> - * Drain all the in-flight TX packets >> + * Drain all the in-flight TX packets. >> + * A timeout of 120 seconds for all the queues is used. >> + * This will break the while loop when h/w is not responding. >> + * This value of 120 has been decided here considering max >> + * number of queues. >> */ >> + >> + timeout = jiffies + 120 * HZ; > >Why not initialize it right when declaring? I will fix it in next version. > >> for (i = 0; i < apc->num_queues; i++) { >> txq = &apc->tx_qp[i].txq; >> - >> - while (atomic_read(&txq->pending_sends) > 0) >> + while (atomic_read(&txq->pending_sends) > 0 && >> + time_before(jiffies, timeout)) { >> usleep_range(1000, 2000);> + } >> } > >120 seconds by 2 msec step is 60000 iterations, by 1 msec is 120000 >iterations. I know usleep_range() often is much less precise, but still. >Do you really need that much time? Has this been measured during the tests >that it can take up to 120 seconds or is it just some random value that "should >be enough"? >If you really need 120 seconds, I'd suggest using a timer / delayed work instead >of wasting resources. Here the intent is not waiting for 120 seconds, rather than avoid continue checking the pending_sends of each tx queues for an indefinite time, before freeing sk_buffs. The pending_sends can only get decreased for a tx queue, if mana_poll_tx_cq() gets called for a completion notification and increased by xmit. In this particular bug, apc->port_is_up is not set to false, causing xmit to keep increasing the pending_sends for the queue and mana_poll_tx_cq() not getting called for the queue. If we see the comment in the function mana_dealloc_queues(), it mentions it : 2346 /* No packet can be transmitted now since apc->port_is_up is false. 2347 * There is still a tiny chance that mana_poll_tx_cq() can re-enable 2348 * a txq because it may not timely see apc->port_is_up being cleared 2349 * to false, but it doesn't matter since mana_start_xmit() drops any 2350 * new packets due to apc->port_is_up being false. The value 120 seconds has been decided here based on maximum number of queues are allowed in this specific hardware, it is a safe assumption. > >> >> + for (i = 0; i < apc->num_queues; i++) { >> + txq = &apc->tx_qp[i].txq; >> + cq = &apc->tx_qp[i].tx_cq; > >cq can be just &txq->tx_cq. mana_txq structure does not have a pointer to mana_cq. > >> + while (atomic_read(&txq->pending_sends)) { >> + skb = skb_dequeue(&txq->pending_skbs); >> + mana_unmap_skb(skb, apc); >> + napi_consume_skb(skb, cq->budget); > >(you already have comment about this one) > >> + atomic_sub(1, &txq->pending_sends); >> + } >> + } >> /* We're 100% sure the queues can no longer be woken up, because >> * we're sure now mana_poll_tx_cq() can't be running. >> */ > >Thanks, >Olek
On Mon, 2023-07-03 at 19:55 +0000, Souradeep Chakrabarti wrote: > > -----Original Message----- > > From: Alexander Lobakin <aleksander.lobakin@intel.com> > > Sent: Monday, July 3, 2023 10:18 PM > > To: souradeep chakrabarti <schakrabarti@linux.microsoft.com> > > Cc: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang > > <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui > > <decui@microsoft.com>; davem@davemloft.net; edumazet@google.com; > > kuba@kernel.org; pabeni@redhat.com; Long Li <longli@microsoft.com>; Ajay > > Sharma <sharmaajay@microsoft.com>; leon@kernel.org; > > cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets@redhat.com; > > tglx@linutronix.de; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; > > linux-kernel@vger.kernel.org; linux-rdma@vger.kernel.org; > > stable@vger.kernel.org; Souradeep Chakrabarti <schakrabarti@microsoft.com> > > Subject: [EXTERNAL] Re: [PATCH V4 net] net: mana: Fix MANA VF unload when > > host is unresponsive > > > > From: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> > > Date: Mon, 3 Jul 2023 01:49:31 -0700 > > > > > From: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> > > > > Please sync your Git name and Git mail account settings, so that your own > > patches won't have "From:" when sending. From what I see, you need to > > correct first letters of name and surname to capital in the Git email settings > > block. > Thank you for pointing, I will fix it. > > > > > > > > When unloading the MANA driver, mana_dealloc_queues() waits for the > > > MANA hardware to complete any inflight packets and set the pending > > > send count to zero. But if the hardware has failed, > > > mana_dealloc_queues() could wait forever. > > > > > > Fix this by adding a timeout to the wait. Set the timeout to 120 > > > seconds, which is a somewhat arbitrary value that is more than long > > > enough for functional hardware to complete any sends. > > > > > > Signed-off-by: Souradeep Chakrabarti > > > <schakrabarti@linux.microsoft.com> > > > > Where's "Fixes:" tagging the blamed commit? > This is present from the day zero of the mana driver code. > It has not been introduced in the code by any commit. > Then the fixes tag should be: Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)") Cheers, Paolo
> -----Original Message----- > From: Souradeep Chakrabarti <schakrabarti@microsoft.com> > Sent: Monday, July 3, 2023 3:55 PM > To: Alexander Lobakin <aleksander.lobakin@intel.com>; souradeep chakrabarti > <schakrabarti@linux.microsoft.com> > Cc: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang > <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui > <decui@microsoft.com>; davem@davemloft.net; edumazet@google.com; > kuba@kernel.org; pabeni@redhat.com; Long Li <longli@microsoft.com>; Ajay > Sharma <sharmaajay@microsoft.com>; leon@kernel.org; > cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets@redhat.com; > tglx@linutronix.de; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; > linux-kernel@vger.kernel.org; linux-rdma@vger.kernel.org; > stable@vger.kernel.org > Subject: RE: [EXTERNAL] Re: [PATCH V4 net] net: mana: Fix MANA VF unload > when host is unresponsive > > > > >-----Original Message----- > >From: Alexander Lobakin <aleksander.lobakin@intel.com> > >Sent: Monday, July 3, 2023 10:18 PM > >To: souradeep chakrabarti <schakrabarti@linux.microsoft.com> > >Cc: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang > ><haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui > ><decui@microsoft.com>; davem@davemloft.net; edumazet@google.com; > >kuba@kernel.org; pabeni@redhat.com; Long Li <longli@microsoft.com>; Ajay > >Sharma <sharmaajay@microsoft.com>; leon@kernel.org; > >cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets@redhat.com; > >tglx@linutronix.de; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; > >linux-kernel@vger.kernel.org; linux-rdma@vger.kernel.org; > >stable@vger.kernel.org; Souradeep Chakrabarti > <schakrabarti@microsoft.com> > >Subject: [EXTERNAL] Re: [PATCH V4 net] net: mana: Fix MANA VF unload when > >host is unresponsive > > > >From: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> > >Date: Mon, 3 Jul 2023 01:49:31 -0700 > > > >> From: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> > > > >Please sync your Git name and Git mail account settings, so that your own > >patches won't have "From:" when sending. From what I see, you need to > >correct first letters of name and surname to capital in the Git email settings > >block. > Thank you for pointing, I will fix it. > > > >> > >> When unloading the MANA driver, mana_dealloc_queues() waits for the > >> MANA hardware to complete any inflight packets and set the pending > >> send count to zero. But if the hardware has failed, > >> mana_dealloc_queues() could wait forever. > >> > >> Fix this by adding a timeout to the wait. Set the timeout to 120 > >> seconds, which is a somewhat arbitrary value that is more than long > >> enough for functional hardware to complete any sends. > >> > >> Signed-off-by: Souradeep Chakrabarti > >> <schakrabarti@linux.microsoft.com> > > > >Where's "Fixes:" tagging the blamed commit? > This is present from the day zero of the mana driver code. > It has not been introduced in the code by any commit. > > > >> --- > >> V3 -> V4: > >> * Fixed the commit message to describe the context. > >> * Removed the vf_unload_timeout, as it is not required. > >> --- > >> drivers/net/ethernet/microsoft/mana/mana_en.c | 26 > >> ++++++++++++++++--- > >> 1 file changed, 23 insertions(+), 3 deletions(-) > >> > >> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c > >> b/drivers/net/ethernet/microsoft/mana/mana_en.c > >> index a499e460594b..d26f1da70411 100644 > >> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c > >> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c > >> @@ -2346,7 +2346,10 @@ static int mana_dealloc_queues(struct > >> net_device *ndev) { > >> struct mana_port_context *apc = netdev_priv(ndev); > >> struct gdma_dev *gd = apc->ac->gdma_dev; > >> + unsigned long timeout; > >> struct mana_txq *txq; > >> + struct sk_buff *skb; > >> + struct mana_cq *cq; > >> int i, err; > >> > >> if (apc->port_is_up) > >> @@ -2363,15 +2366,32 @@ static int mana_dealloc_queues(struct > >net_device *ndev) > >> * to false, but it doesn't matter since mana_start_xmit() drops any > >> * new packets due to apc->port_is_up being false. > >> * > >> - * Drain all the in-flight TX packets > >> + * Drain all the in-flight TX packets. > >> + * A timeout of 120 seconds for all the queues is used. > >> + * This will break the while loop when h/w is not responding. > >> + * This value of 120 has been decided here considering max > >> + * number of queues. > >> */ > >> + > >> + timeout = jiffies + 120 * HZ; > > > >Why not initialize it right when declaring? > I will fix it in next version. > > > >> for (i = 0; i < apc->num_queues; i++) { > >> txq = &apc->tx_qp[i].txq; > >> - > >> - while (atomic_read(&txq->pending_sends) > 0) > >> + while (atomic_read(&txq->pending_sends) > 0 && > >> + time_before(jiffies, timeout)) { > >> usleep_range(1000, 2000);> + } > >> } > > > >120 seconds by 2 msec step is 60000 iterations, by 1 msec is 120000 > >iterations. I know usleep_range() often is much less precise, but still. > >Do you really need that much time? Has this been measured during the tests > >that it can take up to 120 seconds or is it just some random value that "should > >be enough"? > >If you really need 120 seconds, I'd suggest using a timer / delayed work > instead > >of wasting resources. > Here the intent is not waiting for 120 seconds, rather than avoid continue > checking the > pending_sends of each tx queues for an indefinite time, before freeing > sk_buffs. > The pending_sends can only get decreased for a tx queue, if mana_poll_tx_cq() > gets called for a completion notification and increased by xmit. > > In this particular bug, apc->port_is_up is not set to false, causing > xmit to keep increasing the pending_sends for the queue and > mana_poll_tx_cq() > not getting called for the queue. > > If we see the comment in the function mana_dealloc_queues(), it mentions it : > > 2346 /* No packet can be transmitted now since apc->port_is_up is false. > 2347 * There is still a tiny chance that mana_poll_tx_cq() can re-enable > 2348 * a txq because it may not timely see apc->port_is_up being cleared > 2349 * to false, but it doesn't matter since mana_start_xmit() drops any > 2350 * new packets due to apc->port_is_up being false. > > The value 120 seconds has been decided here based on maximum number of > queues > are allowed in this specific hardware, it is a safe assumption. I agree. Also, this waiting time is usually much shorter than 120 sec. The long wait only happens in rare and unexpected NIC HW non-responding cases. To further reduce the resource consumption, we can double the usleep_range() time in every iteration. So, the number of iterations will be greatly reduced before reaching 120 sec. Thanks, - Haiyang
From: Souradeep Chakrabarti <schakrabarti@microsoft.com> Date: Mon, 3 Jul 2023 19:55:06 +0000 > > >> -----Original Message----- >> From: Alexander Lobakin <aleksander.lobakin@intel.com> >> Sent: Monday, July 3, 2023 10:18 PM [...] >>> for (i = 0; i < apc->num_queues; i++) { >>> txq = &apc->tx_qp[i].txq; >>> - >>> - while (atomic_read(&txq->pending_sends) > 0) >>> + while (atomic_read(&txq->pending_sends) > 0 && >>> + time_before(jiffies, timeout)) { >>> usleep_range(1000, 2000);> + } >>> } >> >> 120 seconds by 2 msec step is 60000 iterations, by 1 msec is 120000 >> iterations. I know usleep_range() often is much less precise, but still. >> Do you really need that much time? Has this been measured during the tests >> that it can take up to 120 seconds or is it just some random value that "should >> be enough"? >> If you really need 120 seconds, I'd suggest using a timer / delayed work instead >> of wasting resources. > Here the intent is not waiting for 120 seconds, rather than avoid continue checking the > pending_sends of each tx queues for an indefinite time, before freeing sk_buffs. > The pending_sends can only get decreased for a tx queue, if mana_poll_tx_cq() > gets called for a completion notification and increased by xmit. > > In this particular bug, apc->port_is_up is not set to false, causing > xmit to keep increasing the pending_sends for the queue and mana_poll_tx_cq() > not getting called for the queue. > > If we see the comment in the function mana_dealloc_queues(), it mentions it : > > 2346 /* No packet can be transmitted now since apc->port_is_up is false. > 2347 * There is still a tiny chance that mana_poll_tx_cq() can re-enable > 2348 * a txq because it may not timely see apc->port_is_up being cleared > 2349 * to false, but it doesn't matter since mana_start_xmit() drops any > 2350 * new packets due to apc->port_is_up being false. > > The value 120 seconds has been decided here based on maximum number of queues This is quite opposite to what you're saying above. How should I connect these two: Here the intent is not waiting for 120 seconds + The value 120 seconds has been decided here based on maximum number of queues ? Can cleaning the Tx queues really last for 120 seconds? My understanding is that timeouts need to be sensible and not go to the outer space. What is the medium value you got during the tests? > are allowed in this specific hardware, it is a safe assumption. >> >>> >>> + for (i = 0; i < apc->num_queues; i++) { >>> + txq = &apc->tx_qp[i].txq; >>> + cq = &apc->tx_qp[i].tx_cq; >> >> cq can be just &txq->tx_cq. > mana_txq structure does not have a pointer to mana_cq. Sorry, misread, my bad. >> >>> + while (atomic_read(&txq->pending_sends)) { >>> + skb = skb_dequeue(&txq->pending_skbs); >>> + mana_unmap_skb(skb, apc); >>> + napi_consume_skb(skb, cq->budget); >> >> (you already have comment about this one) >> >>> + atomic_sub(1, &txq->pending_sends); >>> + } >>> + } >>> /* We're 100% sure the queues can no longer be woken up, because >>> * we're sure now mana_poll_tx_cq() can't be running. >>> */ >> >> Thanks, >> Olek Thanks, Olek
>-----Original Message----- >From: Alexander Lobakin <aleksander.lobakin@intel.com> >Sent: Wednesday, July 5, 2023 8:06 PM >To: Souradeep Chakrabarti <schakrabarti@microsoft.com>; souradeep >chakrabarti <schakrabarti@linux.microsoft.com> >Cc: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang ><haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui ><decui@microsoft.com>; davem@davemloft.net; edumazet@google.com; >kuba@kernel.org; pabeni@redhat.com; Long Li <longli@microsoft.com>; Ajay >Sharma <sharmaajay@microsoft.com>; leon@kernel.org; >cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets@redhat.com; >tglx@linutronix.de; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; >linux-kernel@vger.kernel.org; linux-rdma@vger.kernel.org; >stable@vger.kernel.org >Subject: Re: [EXTERNAL] Re: [PATCH V4 net] net: mana: Fix MANA VF unload >when host is unresponsive > >[You don't often get email from aleksander.lobakin@intel.com. Learn why this is >important at https://aka.ms/LearnAboutSenderIdentification ] > >From: Souradeep Chakrabarti <schakrabarti@microsoft.com> >Date: Mon, 3 Jul 2023 19:55:06 +0000 > >> >> >>> -----Original Message----- >>> From: Alexander Lobakin <aleksander.lobakin@intel.com> >>> Sent: Monday, July 3, 2023 10:18 PM > >[...] > >>>> for (i = 0; i < apc->num_queues; i++) { >>>> txq = &apc->tx_qp[i].txq; >>>> - >>>> - while (atomic_read(&txq->pending_sends) > 0) >>>> + while (atomic_read(&txq->pending_sends) > 0 && >>>> + time_before(jiffies, timeout)) { >>>> usleep_range(1000, 2000);> + } >>>> } >>> >>> 120 seconds by 2 msec step is 60000 iterations, by 1 msec is 120000 >>> iterations. I know usleep_range() often is much less precise, but still. >>> Do you really need that much time? Has this been measured during the >>> tests that it can take up to 120 seconds or is it just some random >>> value that "should be enough"? >>> If you really need 120 seconds, I'd suggest using a timer / delayed >>> work instead of wasting resources. >> Here the intent is not waiting for 120 seconds, rather than avoid >> continue checking the pending_sends of each tx queues for an indefinite time, >before freeing sk_buffs. >> The pending_sends can only get decreased for a tx queue, if >> mana_poll_tx_cq() gets called for a completion notification and increased by >xmit. >> >> In this particular bug, apc->port_is_up is not set to false, causing >> xmit to keep increasing the pending_sends for the queue and >> mana_poll_tx_cq() not getting called for the queue. >> >> If we see the comment in the function mana_dealloc_queues(), it mentions it : >> >> 2346 /* No packet can be transmitted now since apc->port_is_up is false. >> 2347 * There is still a tiny chance that mana_poll_tx_cq() can re-enable >> 2348 * a txq because it may not timely see apc->port_is_up being cleared >> 2349 * to false, but it doesn't matter since mana_start_xmit() drops any >> 2350 * new packets due to apc->port_is_up being false. >> >> The value 120 seconds has been decided here based on maximum number of >> queues > >This is quite opposite to what you're saying above. How should I connect these >two: > >Here the intent is not waiting for 120 seconds > >+ > >The value 120 seconds has been decided here based on maximum number of >queues > >? >Can cleaning the Tx queues really last for 120 seconds? >My understanding is that timeouts need to be sensible and not go to the outer >space. What is the medium value you got during the tests? > For each queue each takes few milli second, in a normal condition. So based on maximum number of allowed queues for our h/w it won't go beyond a sec. The 120s only happens rarely during some NIC HW issue -unexpected. So this timeout will only trigger in a very rare scenario. >> are allowed in this specific hardware, it is a safe assumption. >>> >>>> >>>> + for (i = 0; i < apc->num_queues; i++) { >>>> + txq = &apc->tx_qp[i].txq; >>>> + cq = &apc->tx_qp[i].tx_cq; >>> >>> cq can be just &txq->tx_cq. >> mana_txq structure does not have a pointer to mana_cq. > >Sorry, misread, my bad. > >>> >>>> + while (atomic_read(&txq->pending_sends)) { >>>> + skb = skb_dequeue(&txq->pending_skbs); >>>> + mana_unmap_skb(skb, apc); >>>> + napi_consume_skb(skb, cq->budget); >>> >>> (you already have comment about this one) >>> >>>> + atomic_sub(1, &txq->pending_sends); >>>> + } >>>> + } >>>> /* We're 100% sure the queues can no longer be woken up, because >>>> * we're sure now mana_poll_tx_cq() can't be running. >>>> */ >>> >>> Thanks, >>> Olek >Thanks, >Olek
From: Souradeep Chakrabarti <schakrabarti@microsoft.com> Date: Thu, 6 Jul 2023 10:41:03 +0000 > > >> -----Original Message----- >> From: Alexander Lobakin <aleksander.lobakin@intel.com> >> Sent: Wednesday, July 5, 2023 8:06 PM [...] >>>> 120 seconds by 2 msec step is 60000 iterations, by 1 msec is 120000 >>>> iterations. I know usleep_range() often is much less precise, but still. >>>> Do you really need that much time? Has this been measured during the >>>> tests that it can take up to 120 seconds or is it just some random >>>> value that "should be enough"? >>>> If you really need 120 seconds, I'd suggest using a timer / delayed >>>> work instead of wasting resources. >>> Here the intent is not waiting for 120 seconds, rather than avoid >>> continue checking the pending_sends of each tx queues for an indefinite time, >> before freeing sk_buffs. >>> The pending_sends can only get decreased for a tx queue, if >>> mana_poll_tx_cq() gets called for a completion notification and increased by >> xmit. >>> >>> In this particular bug, apc->port_is_up is not set to false, causing >>> xmit to keep increasing the pending_sends for the queue and >>> mana_poll_tx_cq() not getting called for the queue. >>> >>> If we see the comment in the function mana_dealloc_queues(), it mentions it : >>> >>> 2346 /* No packet can be transmitted now since apc->port_is_up is false. >>> 2347 * There is still a tiny chance that mana_poll_tx_cq() can re-enable >>> 2348 * a txq because it may not timely see apc->port_is_up being cleared >>> 2349 * to false, but it doesn't matter since mana_start_xmit() drops any >>> 2350 * new packets due to apc->port_is_up being false. >>> >>> The value 120 seconds has been decided here based on maximum number of >>> queues >> >> This is quite opposite to what you're saying above. How should I connect these >> two: >> >> Here the intent is not waiting for 120 seconds >> >> + >> >> The value 120 seconds has been decided here based on maximum number of >> queues >> >> ? >> Can cleaning the Tx queues really last for 120 seconds? >> My understanding is that timeouts need to be sensible and not go to the outer >> space. What is the medium value you got during the tests? >> > For each queue each takes few milli second, in a normal condition. So > based on maximum number of allowed queues for our h/w it won't > go beyond a sec. > The 120s only happens rarely during some NIC HW issue -unexpected. > So this timeout will only trigger in a very rare scenario. So set the timeout to 2 seconds if it makes no difference? >>> are allowed in this specific hardware, it is a safe assumption. >>>> >>>>> >>>>> + for (i = 0; i < apc->num_queues; i++) { >>>>> + txq = &apc->tx_qp[i].txq; >>>>> + cq = &apc->tx_qp[i].tx_cq; [...] Thanks, Olek
>-----Original Message----- >From: Alexander Lobakin <aleksander.lobakin@intel.com> >Sent: Thursday, July 6, 2023 5:09 PM >To: Souradeep Chakrabarti <schakrabarti@microsoft.com>; souradeep >chakrabarti <schakrabarti@linux.microsoft.com> >Cc: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang ><haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui ><decui@microsoft.com>; davem@davemloft.net; edumazet@google.com; >kuba@kernel.org; pabeni@redhat.com; Long Li <longli@microsoft.com>; Ajay >Sharma <sharmaajay@microsoft.com>; leon@kernel.org; >cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets@redhat.com; >tglx@linutronix.de; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; >linux-kernel@vger.kernel.org; linux-rdma@vger.kernel.org; >stable@vger.kernel.org >Subject: Re: [EXTERNAL] Re: [PATCH V4 net] net: mana: Fix MANA VF unload >when host is unresponsive > >From: Souradeep Chakrabarti <schakrabarti@microsoft.com> >Date: Thu, 6 Jul 2023 10:41:03 +0000 > >> >> >>> -----Original Message----- >>> From: Alexander Lobakin <aleksander.lobakin@intel.com> >>> Sent: Wednesday, July 5, 2023 8:06 PM > >[...] > >>>>> 120 seconds by 2 msec step is 60000 iterations, by 1 msec is 120000 >>>>> iterations. I know usleep_range() often is much less precise, but still. >>>>> Do you really need that much time? Has this been measured during >>>>> the tests that it can take up to 120 seconds or is it just some >>>>> random value that "should be enough"? >>>>> If you really need 120 seconds, I'd suggest using a timer / delayed >>>>> work instead of wasting resources. >>>> Here the intent is not waiting for 120 seconds, rather than avoid >>>> continue checking the pending_sends of each tx queues for an >>>> indefinite time, >>> before freeing sk_buffs. >>>> The pending_sends can only get decreased for a tx queue, if >>>> mana_poll_tx_cq() gets called for a completion notification and >>>> increased by >>> xmit. >>>> >>>> In this particular bug, apc->port_is_up is not set to false, causing >>>> xmit to keep increasing the pending_sends for the queue and >>>> mana_poll_tx_cq() not getting called for the queue. >>>> >>>> If we see the comment in the function mana_dealloc_queues(), it mentions >it : >>>> >>>> 2346 /* No packet can be transmitted now since apc->port_is_up is false. >>>> 2347 * There is still a tiny chance that mana_poll_tx_cq() can re-enable >>>> 2348 * a txq because it may not timely see apc->port_is_up being cleared >>>> 2349 * to false, but it doesn't matter since mana_start_xmit() drops any >>>> 2350 * new packets due to apc->port_is_up being false. >>>> >>>> The value 120 seconds has been decided here based on maximum number >>>> of queues >>> >>> This is quite opposite to what you're saying above. How should I >>> connect these >>> two: >>> >>> Here the intent is not waiting for 120 seconds >>> >>> + >>> >>> The value 120 seconds has been decided here based on maximum number >>> of queues >>> >>> ? >>> Can cleaning the Tx queues really last for 120 seconds? >>> My understanding is that timeouts need to be sensible and not go to >>> the outer space. What is the medium value you got during the tests? >>> >> For each queue each takes few milli second, in a normal condition. So >> based on maximum number of allowed queues for our h/w it won't go >> beyond a sec. >> The 120s only happens rarely during some NIC HW issue -unexpected. >> So this timeout will only trigger in a very rare scenario. > >So set the timeout to 2 seconds if it makes no difference? It can go near 120 seconds in a very rare MANA h/w scenario. That normally won't happen. But during that scenario, we may need 120 seconds. > >>>> are allowed in this specific hardware, it is a safe assumption. >>>>> >>>>>> >>>>>> + for (i = 0; i < apc->num_queues; i++) { >>>>>> + txq = &apc->tx_qp[i].txq; >>>>>> + cq = &apc->tx_qp[i].tx_cq; >[...] > >Thanks, >Olek
From: Souradeep Chakrabarti <schakrabarti@microsoft.com> Date: Thu, 6 Jul 2023 11:43:58 +0000 > > >> -----Original Message----- >> From: Alexander Lobakin <aleksander.lobakin@intel.com> >> Sent: Thursday, July 6, 2023 5:09 PM >> To: Souradeep Chakrabarti <schakrabarti@microsoft.com>; souradeep >> chakrabarti <schakrabarti@linux.microsoft.com> >> Cc: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang >> <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui >> <decui@microsoft.com>; davem@davemloft.net; edumazet@google.com; >> kuba@kernel.org; pabeni@redhat.com; Long Li <longli@microsoft.com>; Ajay >> Sharma <sharmaajay@microsoft.com>; leon@kernel.org; >> cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets@redhat.com; >> tglx@linutronix.de; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; >> linux-kernel@vger.kernel.org; linux-rdma@vger.kernel.org; >> stable@vger.kernel.org >> Subject: Re: [EXTERNAL] Re: [PATCH V4 net] net: mana: Fix MANA VF unload >> when host is unresponsive >> >> From: Souradeep Chakrabarti <schakrabarti@microsoft.com> >> Date: Thu, 6 Jul 2023 10:41:03 +0000 >> >>> >>> >>>> -----Original Message----- >>>> From: Alexander Lobakin <aleksander.lobakin@intel.com> >>>> Sent: Wednesday, July 5, 2023 8:06 PM >> >> [...] >> >>>>>> 120 seconds by 2 msec step is 60000 iterations, by 1 msec is 120000 >>>>>> iterations. I know usleep_range() often is much less precise, but still. >>>>>> Do you really need that much time? Has this been measured during >>>>>> the tests that it can take up to 120 seconds or is it just some >>>>>> random value that "should be enough"? >>>>>> If you really need 120 seconds, I'd suggest using a timer / delayed >>>>>> work instead of wasting resources. >>>>> Here the intent is not waiting for 120 seconds, rather than avoid >>>>> continue checking the pending_sends of each tx queues for an >>>>> indefinite time, >>>> before freeing sk_buffs. >>>>> The pending_sends can only get decreased for a tx queue, if >>>>> mana_poll_tx_cq() gets called for a completion notification and >>>>> increased by >>>> xmit. >>>>> >>>>> In this particular bug, apc->port_is_up is not set to false, causing >>>>> xmit to keep increasing the pending_sends for the queue and >>>>> mana_poll_tx_cq() not getting called for the queue. >>>>> >>>>> If we see the comment in the function mana_dealloc_queues(), it mentions >> it : >>>>> >>>>> 2346 /* No packet can be transmitted now since apc->port_is_up is false. >>>>> 2347 * There is still a tiny chance that mana_poll_tx_cq() can re-enable >>>>> 2348 * a txq because it may not timely see apc->port_is_up being cleared >>>>> 2349 * to false, but it doesn't matter since mana_start_xmit() drops any >>>>> 2350 * new packets due to apc->port_is_up being false. >>>>> >>>>> The value 120 seconds has been decided here based on maximum number >>>>> of queues >>>> >>>> This is quite opposite to what you're saying above. How should I >>>> connect these >>>> two: >>>> >>>> Here the intent is not waiting for 120 seconds >>>> >>>> + >>>> >>>> The value 120 seconds has been decided here based on maximum number >>>> of queues >>>> >>>> ? >>>> Can cleaning the Tx queues really last for 120 seconds? >>>> My understanding is that timeouts need to be sensible and not go to >>>> the outer space. What is the medium value you got during the tests? >>>> >>> For each queue each takes few milli second, in a normal condition. So >>> based on maximum number of allowed queues for our h/w it won't go >>> beyond a sec. >>> The 120s only happens rarely during some NIC HW issue -unexpected. >>> So this timeout will only trigger in a very rare scenario. >> >> So set the timeout to 2 seconds if it makes no difference? > It can go near 120 seconds in a very rare MANA h/w scenario. That normally won't happen. > But during that scenario, we may need 120 seconds. This waiting loop is needed to let the pending Tx packets be sent. If they weren't sent in 1 second, it most likely makes no sense already whether they will be sent at all or not -- the destination host won't wait for them for so long. You say that it may happen only in case of HW issue. If so, I assume you need to fix it some way, e.g. do a HW reset or so? If so, why bother waiting for Tx completions if Tx is hung? You free all skbs later either way, so there are no leaks. >> >>>>> are allowed in this specific hardware, it is a safe assumption. >>>>>> >>>>>>> >>>>>>> + for (i = 0; i < apc->num_queues; i++) { >>>>>>> + txq = &apc->tx_qp[i].txq; >>>>>>> + cq = &apc->tx_qp[i].tx_cq; >> [...] >> >> Thanks, >> Olek Thanks, Olek
> -----Original Message----- > From: Alexander Lobakin <aleksander.lobakin@intel.com> > Sent: Thursday, July 6, 2023 7:49 AM > To: Souradeep Chakrabarti <schakrabarti@microsoft.com>; souradeep > chakrabarti <schakrabarti@linux.microsoft.com> > Cc: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang > <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui > <decui@microsoft.com>; davem@davemloft.net; edumazet@google.com; > kuba@kernel.org; pabeni@redhat.com; Long Li <longli@microsoft.com>; Ajay > Sharma <sharmaajay@microsoft.com>; leon@kernel.org; > cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets@redhat.com; > tglx@linutronix.de; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; > linux-kernel@vger.kernel.org; linux-rdma@vger.kernel.org; > stable@vger.kernel.org > Subject: Re: [EXTERNAL] Re: [PATCH V4 net] net: mana: Fix MANA VF unload > when host is unresponsive > > From: Souradeep Chakrabarti <schakrabarti@microsoft.com> > Date: Thu, 6 Jul 2023 11:43:58 +0000 > > > > > > >> -----Original Message----- > >> From: Alexander Lobakin <aleksander.lobakin@intel.com> > >> Sent: Thursday, July 6, 2023 5:09 PM > >> To: Souradeep Chakrabarti <schakrabarti@microsoft.com>; souradeep > >> chakrabarti <schakrabarti@linux.microsoft.com> > >> Cc: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang > >> <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui > >> <decui@microsoft.com>; davem@davemloft.net; edumazet@google.com; > >> kuba@kernel.org; pabeni@redhat.com; Long Li <longli@microsoft.com>; > Ajay > >> Sharma <sharmaajay@microsoft.com>; leon@kernel.org; > >> cai.huoqing@linux.dev; ssengar@linux.microsoft.com; > vkuznets@redhat.com; > >> tglx@linutronix.de; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; > >> linux-kernel@vger.kernel.org; linux-rdma@vger.kernel.org; > >> stable@vger.kernel.org > >> Subject: Re: [EXTERNAL] Re: [PATCH V4 net] net: mana: Fix MANA VF unload > >> when host is unresponsive > >> > >> From: Souradeep Chakrabarti <schakrabarti@microsoft.com> > >> Date: Thu, 6 Jul 2023 10:41:03 +0000 > >> > >>> > >>> > >>>> -----Original Message----- > >>>> From: Alexander Lobakin <aleksander.lobakin@intel.com> > >>>> Sent: Wednesday, July 5, 2023 8:06 PM > >> > >> [...] > >> > >>>>>> 120 seconds by 2 msec step is 60000 iterations, by 1 msec is 120000 > >>>>>> iterations. I know usleep_range() often is much less precise, but still. > >>>>>> Do you really need that much time? Has this been measured during > >>>>>> the tests that it can take up to 120 seconds or is it just some > >>>>>> random value that "should be enough"? > >>>>>> If you really need 120 seconds, I'd suggest using a timer / delayed > >>>>>> work instead of wasting resources. > >>>>> Here the intent is not waiting for 120 seconds, rather than avoid > >>>>> continue checking the pending_sends of each tx queues for an > >>>>> indefinite time, > >>>> before freeing sk_buffs. > >>>>> The pending_sends can only get decreased for a tx queue, if > >>>>> mana_poll_tx_cq() gets called for a completion notification and > >>>>> increased by > >>>> xmit. > >>>>> > >>>>> In this particular bug, apc->port_is_up is not set to false, causing > >>>>> xmit to keep increasing the pending_sends for the queue and > >>>>> mana_poll_tx_cq() not getting called for the queue. > >>>>> > >>>>> If we see the comment in the function mana_dealloc_queues(), it > mentions > >> it : > >>>>> > >>>>> 2346 /* No packet can be transmitted now since apc->port_is_up is > false. > >>>>> 2347 * There is still a tiny chance that mana_poll_tx_cq() can re- > enable > >>>>> 2348 * a txq because it may not timely see apc->port_is_up being > cleared > >>>>> 2349 * to false, but it doesn't matter since mana_start_xmit() drops > any > >>>>> 2350 * new packets due to apc->port_is_up being false. > >>>>> > >>>>> The value 120 seconds has been decided here based on maximum > number > >>>>> of queues > >>>> > >>>> This is quite opposite to what you're saying above. How should I > >>>> connect these > >>>> two: > >>>> > >>>> Here the intent is not waiting for 120 seconds > >>>> > >>>> + > >>>> > >>>> The value 120 seconds has been decided here based on maximum number > >>>> of queues > >>>> > >>>> ? > >>>> Can cleaning the Tx queues really last for 120 seconds? > >>>> My understanding is that timeouts need to be sensible and not go to > >>>> the outer space. What is the medium value you got during the tests? > >>>> > >>> For each queue each takes few milli second, in a normal condition. So > >>> based on maximum number of allowed queues for our h/w it won't go > >>> beyond a sec. > >>> The 120s only happens rarely during some NIC HW issue -unexpected. > >>> So this timeout will only trigger in a very rare scenario. > >> > >> So set the timeout to 2 seconds if it makes no difference? > > It can go near 120 seconds in a very rare MANA h/w scenario. That normally > won't happen. > > But during that scenario, we may need 120 seconds. > > This waiting loop is needed to let the pending Tx packets be sent. If > they weren't sent in 1 second, it most likely makes no sense already > whether they will be sent at all or not -- the destination host won't > wait for them for so long. > You say that it may happen only in case of HW issue. If so, I assume you > need to fix it some way, e.g. do a HW reset or so? If so, why bother > waiting for Tx completions if Tx is hung? You free all skbs later either > way, so there are no leaks. At that point, we don't actually care if the pending packets are sent or not. But if we free the queues too soon, and the HW is slow for unexpected reasons, a delayed completion notice will DMA into the freed memory and cause corruption. That's why we have a longer waiting time. Souradeep, you may double check with hostnet team to see what's the best waiting time to ensure no more HW activities. Thanks, - Haiyang
On Thu, Jul 06, 2023 at 01:54:35PM +0000, Haiyang Zhang wrote: > > This waiting loop is needed to let the pending Tx packets be sent. If > > they weren't sent in 1 second, it most likely makes no sense already > > whether they will be sent at all or not -- the destination host won't > > wait for them for so long. > > You say that it may happen only in case of HW issue. If so, I assume you > > need to fix it some way, e.g. do a HW reset or so? If so, why bother > > waiting for Tx completions if Tx is hung? You free all skbs later either > > way, so there are no leaks. > > At that point, we don't actually care if the pending packets are sent or not. > But if we free the queues too soon, and the HW is slow for unexpected > reasons, a delayed completion notice will DMA into the freed memory and > cause corruption. That's why we have a longer waiting time. Aieiiie that is a horrible HW design to not have a strong fence of DMA. "just wait and hope the HW doesn't UAF the kernel with DMA" is really awful. Jason
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c index a499e460594b..d26f1da70411 100644 --- a/drivers/net/ethernet/microsoft/mana/mana_en.c +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c @@ -2346,7 +2346,10 @@ static int mana_dealloc_queues(struct net_device *ndev) { struct mana_port_context *apc = netdev_priv(ndev); struct gdma_dev *gd = apc->ac->gdma_dev; + unsigned long timeout; struct mana_txq *txq; + struct sk_buff *skb; + struct mana_cq *cq; int i, err; if (apc->port_is_up) @@ -2363,15 +2366,32 @@ static int mana_dealloc_queues(struct net_device *ndev) * to false, but it doesn't matter since mana_start_xmit() drops any * new packets due to apc->port_is_up being false. * - * Drain all the in-flight TX packets + * Drain all the in-flight TX packets. + * A timeout of 120 seconds for all the queues is used. + * This will break the while loop when h/w is not responding. + * This value of 120 has been decided here considering max + * number of queues. */ + + timeout = jiffies + 120 * HZ; for (i = 0; i < apc->num_queues; i++) { txq = &apc->tx_qp[i].txq; - - while (atomic_read(&txq->pending_sends) > 0) + while (atomic_read(&txq->pending_sends) > 0 && + time_before(jiffies, timeout)) { usleep_range(1000, 2000); + } } + for (i = 0; i < apc->num_queues; i++) { + txq = &apc->tx_qp[i].txq; + cq = &apc->tx_qp[i].tx_cq; + while (atomic_read(&txq->pending_sends)) { + skb = skb_dequeue(&txq->pending_skbs); + mana_unmap_skb(skb, apc); + napi_consume_skb(skb, cq->budget); + atomic_sub(1, &txq->pending_sends); + } + } /* We're 100% sure the queues can no longer be woken up, because * we're sure now mana_poll_tx_cq() can't be running. */