Message ID | 1687771224-27162-1-git-send-email-schakrabarti@linux.microsoft.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp7358302vqr; Mon, 26 Jun 2023 02:44:27 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ5N/O2RtaRhnlZbp5G85RDAd5b5WexjM4YWbsEqmo9f2B7vAHn7iKthH2wLfII51duUTvaD X-Received: by 2002:a50:e604:0:b0:518:9174:9b5f with SMTP id y4-20020a50e604000000b0051891749b5fmr24059594edm.1.1687772667346; Mon, 26 Jun 2023 02:44:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1687772667; cv=none; d=google.com; s=arc-20160816; b=WivpDytHJX7lZCHYH32xzRk4Go/5AuPc/GTpgZekV3rM/MqDG0meWmy10CPeZ7HYIY D0Qi2DiM35UKlwmFX/yB6h1EQg3NWfjhnza9xl9cgUculsKSeTqLgJ38W2FDLbzKYAaY 2O0crXjXgj2ickEtp1B5voTVyNN9vI97QMrHhBqPaJQGNHwkmNbw//lN6+EohM2k51bV cj3WxEMG/pMAm0uU0NBRsMIrEiSuQgi7RAGsl0M8GZs301aILbAvw0+xaZ5/MdOJIzsH RIQvzW4SdF0ELLJpvaJEHbVv3wgdQztNciTTTDLm25Rme06H79WEant17B/gdpcivEFB OmHA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:references:in-reply-to:message-id:date:subject :cc:to:from:dkim-signature:dkim-filter; bh=VB4MexHF8wUx/3UJxoMDpyL6ZRDMzCiGHwtXMUapkGk=; fh=qKGHJM/OaNHTo96+1HG2BqACTjy4q9C0dbufp/+3Ef4=; b=tR4rUGA/mHxg0ehR09t+KUiBOxGUFXvmGmw7MJWMVQXR1H56QzlH7ld8NU75euUZE1 YrcMd01lb1kNbPcSIUMgJ2PDzRsM3kuX4awW1pUl7kPDb99c4aeWLMGK5ezucqwSoHj6 UK1rGBX3R9UPX0A9LXpaFdISV0F4kWJJHCjm9uM53yEZiSEgAm8iS4/kuFAx7/SZPuag +o0cMZQmFIE6lfnk6/aOSTsSqKNueAV0wMuhNuWBdX0pz7v47HG5OYSORzeefGspjWA/ 8/iLsgnr2m1PHqyiUBPCq9ZRzvUKMZHiOQx5WomWnQbkqfHkT26krftYcE+XKW8EMTlf pf4g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.microsoft.com header.s=default header.b=aGUviutx; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.microsoft.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id r18-20020aa7cb92000000b0051be93a4395si2477078edt.52.2023.06.26.02.44.03; Mon, 26 Jun 2023 02:44:27 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.microsoft.com header.s=default header.b=aGUviutx; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.microsoft.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230318AbjFZJW7 (ORCPT <rfc822;filip.gregor98@gmail.com> + 99 others); Mon, 26 Jun 2023 05:22:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35344 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229848AbjFZJWe (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 26 Jun 2023 05:22:34 -0400 Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id B868A30ED; Mon, 26 Jun 2023 02:20:26 -0700 (PDT) Received: by linux.microsoft.com (Postfix, from userid 1099) id 0B90D21C3F28; Mon, 26 Jun 2023 02:20:26 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com 0B90D21C3F28 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1687771226; bh=VB4MexHF8wUx/3UJxoMDpyL6ZRDMzCiGHwtXMUapkGk=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=aGUviutx7TQIUQCUbspcEJ5RfWrsdiEnMmLa1VIOTU4KcMiVu/eSDvRmLIB9Vj/1G G/otWpA7LWm+Md5BEoCdwirBQt3hTRoNAk1fVkN1Rg2IRz/+rDW8c/Z+aI/IcAREk9 97u7ti3KNh+swoUgKVfWtKHnmTNT/RCY/rt3gCQY= From: souradeep chakrabarti <schakrabarti@linux.microsoft.com> To: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, davem@davemloft.net, edumazet@google.com, kuba@kernel.org, pabeni@redhat.com, longli@microsoft.com, sharmaajay@microsoft.com, leon@kernel.org, cai.huoqing@linux.dev, ssengar@linux.microsoft.com, vkuznets@redhat.com, tglx@linutronix.de, linux-hyperv@vger.kernel.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org Cc: stable@vger.kernel.org, schakrabarti@microsoft.com, Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> Subject: [PATCH 2/2 V3 net] net: mana: Fix MANA VF unload when host is unresponsive Date: Mon, 26 Jun 2023 02:20:24 -0700 Message-Id: <1687771224-27162-1-git-send-email-schakrabarti@linux.microsoft.com> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1687771098-26775-1-git-send-email-schakrabarti@linux.microsoft.com> References: <1687771098-26775-1-git-send-email-schakrabarti@linux.microsoft.com> X-Spam-Status: No, score=-19.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_MED, SPF_HELO_PASS,SPF_PASS,T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL, USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1769757565421688496?= X-GMAIL-MSGID: =?utf-8?q?1769757912026093527?= |
Series |
[1/2,V3,net] net: mana: Fix MANA VF unload when host is unresponsive
|
|
Commit Message
Souradeep Chakrabarti
June 26, 2023, 9:20 a.m. UTC
From: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> This is the second part of the fix. Also this patch adds a new attribute in mana_context, which gets set when mana_hwc_send_request() hits a timeout because of host unresponsiveness. This flag then helps to avoid the timeouts in successive calls. Fixes: ca9c54d2d6a5ab2430c4eda364c77125d62e5e0f (net: mana: Add a driver for Microsoft Azure Network Adapter) Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> --- V2 -> V3: * Removed the initialization of vf_unload_timeout * Splitted the patch in two. * Fixed extra space from the commit message. --- drivers/net/ethernet/microsoft/mana/gdma_main.c | 4 +++- drivers/net/ethernet/microsoft/mana/hw_channel.c | 12 +++++++++++- include/net/mana/mana.h | 2 ++ 3 files changed, 16 insertions(+), 2 deletions(-)
Comments
On 6/26/2023 2:50 PM, souradeep chakrabarti wrote: > From: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> > > This is the second part of the fix. > > Also this patch adds a new attribute in mana_context, which gets set when > mana_hwc_send_request() hits a timeout because of host unresponsiveness. > This flag then helps to avoid the timeouts in successive calls. > > Fixes: ca9c54d2d6a5ab2430c4eda364c77125d62e5e0f (net: mana: Add a driver for > Microsoft Azure Network Adapter) > Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> > --- > V2 -> V3: > * Removed the initialization of vf_unload_timeout > * Splitted the patch in two. > * Fixed extra space from the commit message. > --- > drivers/net/ethernet/microsoft/mana/gdma_main.c | 4 +++- > drivers/net/ethernet/microsoft/mana/hw_channel.c | 12 +++++++++++- > include/net/mana/mana.h | 2 ++ > 3 files changed, 16 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c > index 8f3f78b68592..6411f01be0d9 100644 > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c > @@ -946,10 +946,12 @@ int mana_gd_deregister_device(struct gdma_dev *gd) > struct gdma_context *gc = gd->gdma_context; > struct gdma_general_resp resp = {}; > struct gdma_general_req req = {}; > + struct mana_context *ac; > int err; > > if (gd->pdid == INVALID_PDID) > return -EINVAL; > + ac = gd->driver_data; > > mana_gd_init_req_hdr(&req.hdr, GDMA_DEREGISTER_DEVICE, sizeof(req), > sizeof(resp)); > @@ -957,7 +959,7 @@ int mana_gd_deregister_device(struct gdma_dev *gd) > req.hdr.dev_id = gd->dev_id; > > err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp); > - if (err || resp.hdr.status) { > + if ((err || resp.hdr.status) && !ac->vf_unload_timeout) { > dev_err(gc->dev, "Failed to deregister device: %d, 0x%x\n", > err, resp.hdr.status); With !ac->vf_unload_timeout option, this message may not be correctly showing err, status. Probably you want to add explicit information during timeouts so that it give right information ? Or have the err, status field properly updated. > if (!err) > diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.c b/drivers/net/ethernet/microsoft/mana/hw_channel.c > index 9d1507eba5b9..492cb2c6e2cb 100644 > --- a/drivers/net/ethernet/microsoft/mana/hw_channel.c > +++ b/drivers/net/ethernet/microsoft/mana/hw_channel.c > @@ -1,8 +1,10 @@ > // SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause > /* Copyright (c) 2021, Microsoft Corporation. */ > > +#include "asm-generic/errno.h" > #include <net/mana/gdma.h> > #include <net/mana/hw_channel.h> > +#include <net/mana/mana.h> > > static int mana_hwc_get_msg_index(struct hw_channel_context *hwc, u16 *msg_id) > { > @@ -786,12 +788,19 @@ int mana_hwc_send_request(struct hw_channel_context *hwc, u32 req_len, > struct hwc_wq *txq = hwc->txq; > struct gdma_req_hdr *req_msg; > struct hwc_caller_ctx *ctx; > + struct mana_context *ac; > u32 dest_vrcq = 0; > u32 dest_vrq = 0; > u16 msg_id; > int err; > > mana_hwc_get_msg_index(hwc, &msg_id); > + ac = hwc->gdma_dev->driver_data; Is there a case where gdma_dev be invalid here ? If so, lets check the state and then proceed further ? > + if (ac->vf_unload_timeout) { > + dev_err(hwc->dev, "HWC: vport is already unloaded.\n"); > + err = -ETIMEDOUT; > + goto out; > + } > > tx_wr = &txq->msg_buf->reqs[msg_id]; > > @@ -825,9 +834,10 @@ int mana_hwc_send_request(struct hw_channel_context *hwc, u32 req_len, > goto out; > } > > - if (!wait_for_completion_timeout(&ctx->comp_event, 30 * HZ)) { > + if (!wait_for_completion_timeout(&ctx->comp_event, 5 * HZ)) { IMHO we should have macros instead of magic numbers (5 , 30 or so). But would like others to comment here. > dev_err(hwc->dev, "HWC: Request timed out!\n"); > err = -ETIMEDOUT; > + ac->vf_unload_timeout = true; > goto out; > } > > diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h > index 9eef19972845..5f5affdca1eb 100644 > --- a/include/net/mana/mana.h > +++ b/include/net/mana/mana.h > @@ -358,6 +358,8 @@ struct mana_context { > > u16 num_ports; > > + bool vf_unload_timeout; > + > struct mana_eq *eqs; > > struct net_device *ports[MAX_PORTS_IN_MANA_DEV];
> -----Original Message----- > From: souradeep chakrabarti <schakrabarti@linux.microsoft.com> > Sent: Monday, June 26, 2023 5:20 AM > To: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang > <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui > <decui@microsoft.com>; davem@davemloft.net; edumazet@google.com; > kuba@kernel.org; pabeni@redhat.com; Long Li <longli@microsoft.com>; Ajay > Sharma <sharmaajay@microsoft.com>; leon@kernel.org; > cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets@redhat.com; > tglx@linutronix.de; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; > linux-kernel@vger.kernel.org; linux-rdma@vger.kernel.org > Cc: stable@vger.kernel.org; Souradeep Chakrabarti > <schakrabarti@microsoft.com>; Souradeep Chakrabarti > <schakrabarti@linux.microsoft.com> > Subject: [PATCH 2/2 V3 net] net: mana: Fix MANA VF unload when host is > unresponsive In general, two patches shouldn't have the same Subject. For this patch set, the two patches are two steps to fix the unloading issue, and they are not very long. IMHO, they should be in one patch. Thanks, - Haiyang
> -----Original Message----- > From: Haiyang Zhang <haiyangz@microsoft.com> > Sent: Monday, June 26, 2023 11:54 AM > To: souradeep chakrabarti <schakrabarti@linux.microsoft.com>; KY Srinivasan > <kys@microsoft.com>; wei.liu@kernel.org; Dexuan Cui > <decui@microsoft.com>; davem@davemloft.net; edumazet@google.com; > kuba@kernel.org; pabeni@redhat.com; Long Li <longli@microsoft.com>; Ajay > Sharma <sharmaajay@microsoft.com>; leon@kernel.org; > cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets@redhat.com; > tglx@linutronix.de; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; > linux-kernel@vger.kernel.org; linux-rdma@vger.kernel.org > Cc: stable@vger.kernel.org; Souradeep Chakrabarti > <schakrabarti@microsoft.com> > Subject: RE: [PATCH 2/2 V3 net] net: mana: Fix MANA VF unload when host is > unresponsive > > > > > -----Original Message----- > > From: souradeep chakrabarti <schakrabarti@linux.microsoft.com> > > Sent: Monday, June 26, 2023 5:20 AM > > To: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang > > <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui > > <decui@microsoft.com>; davem@davemloft.net; edumazet@google.com; > > kuba@kernel.org; pabeni@redhat.com; Long Li <longli@microsoft.com>; Ajay > > Sharma <sharmaajay@microsoft.com>; leon@kernel.org; > > cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets@redhat.com; > > tglx@linutronix.de; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; > > linux-kernel@vger.kernel.org; linux-rdma@vger.kernel.org > > Cc: stable@vger.kernel.org; Souradeep Chakrabarti > > <schakrabarti@microsoft.com>; Souradeep Chakrabarti > > <schakrabarti@linux.microsoft.com> > > Subject: [PATCH 2/2 V3 net] net: mana: Fix MANA VF unload when host is > > unresponsive > > In general, two patches shouldn't have the same Subject. > If two patches are preferred, please use more descriptive subjects like these: 1/2: Fix the infinite waiting on pending_sends during host unresponsiveness 2/2: Avoid extra waits if host not responding in earlier steps Thanks, - Haiyang
> -----Original Message----- > From: Praveen Kumar <kumarpraveen@linux.microsoft.com> > Sent: Monday, June 26, 2023 10:13 AM > To: souradeep chakrabarti <schakrabarti@linux.microsoft.com>; KY Srinivasan > <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>; > wei.liu@kernel.org; Dexuan Cui <decui@microsoft.com>; > davem@davemloft.net; edumazet@google.com; kuba@kernel.org; > pabeni@redhat.com; Long Li <longli@microsoft.com>; Ajay Sharma > <sharmaajay@microsoft.com>; leon@kernel.org; cai.huoqing@linux.dev; > ssengar@linux.microsoft.com; vkuznets@redhat.com; tglx@linutronix.de; linux- > hyperv@vger.kernel.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; > linux-rdma@vger.kernel.org > Cc: stable@vger.kernel.org; Souradeep Chakrabarti > <schakrabarti@microsoft.com> > Subject: Re: [PATCH 2/2 V3 net] net: mana: Fix MANA VF unload when host is > unresponsive > > On 6/26/2023 2:50 PM, souradeep chakrabarti wrote: > > From: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> > > > > This is the second part of the fix. > > > > Also this patch adds a new attribute in mana_context, which gets set when > > mana_hwc_send_request() hits a timeout because of host unresponsiveness. > > This flag then helps to avoid the timeouts in successive calls. > > > > Fixes: ca9c54d2d6a5ab2430c4eda364c77125d62e5e0f (net: mana: Add a > driver for > > Microsoft Azure Network Adapter) > > Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> > > --- > > V2 -> V3: > > * Removed the initialization of vf_unload_timeout > > * Splitted the patch in two. > > * Fixed extra space from the commit message. > > --- > > drivers/net/ethernet/microsoft/mana/gdma_main.c | 4 +++- > > drivers/net/ethernet/microsoft/mana/hw_channel.c | 12 +++++++++++- > > include/net/mana/mana.h | 2 ++ > > 3 files changed, 16 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c > b/drivers/net/ethernet/microsoft/mana/gdma_main.c > > index 8f3f78b68592..6411f01be0d9 100644 > > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c > > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c > > @@ -946,10 +946,12 @@ int mana_gd_deregister_device(struct gdma_dev > *gd) > > struct gdma_context *gc = gd->gdma_context; > > struct gdma_general_resp resp = {}; > > struct gdma_general_req req = {}; > > + struct mana_context *ac; > > int err; > > > > if (gd->pdid == INVALID_PDID) > > return -EINVAL; > > + ac = gd->driver_data; > > > > mana_gd_init_req_hdr(&req.hdr, GDMA_DEREGISTER_DEVICE, > sizeof(req), > > sizeof(resp)); > > @@ -957,7 +959,7 @@ int mana_gd_deregister_device(struct gdma_dev *gd) > > req.hdr.dev_id = gd->dev_id; > > > > err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp); > > - if (err || resp.hdr.status) { > > + if ((err || resp.hdr.status) && !ac->vf_unload_timeout) { > > dev_err(gc->dev, "Failed to deregister device: %d, 0x%x\n", > > err, resp.hdr.status); > > With !ac->vf_unload_timeout option, this message may not be correctly > showing err, status. Probably you want to add explicit information during > timeouts so that it give right information ? Or have the err, status field properly > updated. > > > if (!err) > > diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.c > b/drivers/net/ethernet/microsoft/mana/hw_channel.c > > index 9d1507eba5b9..492cb2c6e2cb 100644 > > --- a/drivers/net/ethernet/microsoft/mana/hw_channel.c > > +++ b/drivers/net/ethernet/microsoft/mana/hw_channel.c > > @@ -1,8 +1,10 @@ > > // SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause > > /* Copyright (c) 2021, Microsoft Corporation. */ > > > > +#include "asm-generic/errno.h" > > #include <net/mana/gdma.h> > > #include <net/mana/hw_channel.h> > > +#include <net/mana/mana.h> > > > > static int mana_hwc_get_msg_index(struct hw_channel_context *hwc, u16 > *msg_id) > > { > > @@ -786,12 +788,19 @@ int mana_hwc_send_request(struct > hw_channel_context *hwc, u32 req_len, > > struct hwc_wq *txq = hwc->txq; > > struct gdma_req_hdr *req_msg; > > struct hwc_caller_ctx *ctx; > > + struct mana_context *ac; > > u32 dest_vrcq = 0; > > u32 dest_vrq = 0; > > u16 msg_id; > > int err; > > > > mana_hwc_get_msg_index(hwc, &msg_id); > > + ac = hwc->gdma_dev->driver_data; > > Is there a case where gdma_dev be invalid here ? If so, lets check the state and > then proceed further ? Yes, hwc->gdma_dev is assigned shortly after it's allocated - see the code below. So it's valid. But hwc->gdma_dev->driver_data is hwc, not "mana_context *ac". There are two gdma_dev in gdma_context: hwc & mana. You can get ac from: hwc->gdma_dev->gdma_context->mana.driver_data Or, to avoid too many pointer deference, I suggest to put the vf_unload_timeout into gdma_context. int mana_hwc_create_channel(struct gdma_context *gc) { hwc = kzalloc(sizeof(*hwc), GFP_KERNEL); ... gd->gdma_context = gc; gd->driver_data = hwc; hwc->gdma_dev = gd; hwc->dev = gc->dev; Also, mana_gd_send_request/mana_hwc_send_request() is used in many places, not just unloading. Should you use timeout value 5 sec, and the vf_unload_timeout flag in unloading path only, and avoid touching other code paths? Please check with hostnet team for suggestions. If we decide to let the vf_unload_timeout flag affect all code paths, not just unloading, then it should be renamed to hwc_timeout, and submit the second patch separately. If just use it for unloading, since mana_gd_deregister_device() is used by PF too, name it like: unload_hwc_timeout. Thanks, -Haiyang
On Mon, Jun 26, 2023 at 07:43:07PM +0530, Praveen Kumar wrote: > On 6/26/2023 2:50 PM, souradeep chakrabarti wrote: > > From: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> > > > > This is the second part of the fix. > > > > Also this patch adds a new attribute in mana_context, which gets set when > > mana_hwc_send_request() hits a timeout because of host unresponsiveness. > > This flag then helps to avoid the timeouts in successive calls. > > > > Fixes: ca9c54d2d6a5ab2430c4eda364c77125d62e5e0f (net: mana: Add a driver for > > Microsoft Azure Network Adapter) > > Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> > > --- > > V2 -> V3: > > * Removed the initialization of vf_unload_timeout > > * Splitted the patch in two. > > * Fixed extra space from the commit message. > > --- > > drivers/net/ethernet/microsoft/mana/gdma_main.c | 4 +++- > > drivers/net/ethernet/microsoft/mana/hw_channel.c | 12 +++++++++++- > > include/net/mana/mana.h | 2 ++ > > 3 files changed, 16 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c > > index 8f3f78b68592..6411f01be0d9 100644 > > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c > > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c > > @@ -946,10 +946,12 @@ int mana_gd_deregister_device(struct gdma_dev *gd) > > struct gdma_context *gc = gd->gdma_context; > > struct gdma_general_resp resp = {}; > > struct gdma_general_req req = {}; > > + struct mana_context *ac; > > int err; > > > > if (gd->pdid == INVALID_PDID) > > return -EINVAL; > > + ac = gd->driver_data; > > > > mana_gd_init_req_hdr(&req.hdr, GDMA_DEREGISTER_DEVICE, sizeof(req), > > sizeof(resp)); > > @@ -957,7 +959,7 @@ int mana_gd_deregister_device(struct gdma_dev *gd) > > req.hdr.dev_id = gd->dev_id; > > > > err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp); > > - if (err || resp.hdr.status) { > > + if ((err || resp.hdr.status) && !ac->vf_unload_timeout) { > > dev_err(gc->dev, "Failed to deregister device: %d, 0x%x\n", > > err, resp.hdr.status); > > With !ac->vf_unload_timeout option, this message may not be correctly showing err, status. Probably you want to add explicit information during timeouts so that it give right information ? Or have the err, status field properly updated. This check !ac->vf_unload_timeout here means if ac->vf_unload_timeout is not yet set, then only consider the err path, else continue the remaining operation. > > > if (!err) > > diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.c b/drivers/net/ethernet/microsoft/mana/hw_channel.c > > index 9d1507eba5b9..492cb2c6e2cb 100644 > > --- a/drivers/net/ethernet/microsoft/mana/hw_channel.c > > +++ b/drivers/net/ethernet/microsoft/mana/hw_channel.c > > @@ -1,8 +1,10 @@ > > // SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause > > /* Copyright (c) 2021, Microsoft Corporation. */ > > > > +#include "asm-generic/errno.h" > > #include <net/mana/gdma.h> > > #include <net/mana/hw_channel.h> > > +#include <net/mana/mana.h> > > > > static int mana_hwc_get_msg_index(struct hw_channel_context *hwc, u16 *msg_id) > > { > > @@ -786,12 +788,19 @@ int mana_hwc_send_request(struct hw_channel_context *hwc, u32 req_len, > > struct hwc_wq *txq = hwc->txq; > > struct gdma_req_hdr *req_msg; > > struct hwc_caller_ctx *ctx; > > + struct mana_context *ac; > > u32 dest_vrcq = 0; > > u32 dest_vrq = 0; > > u16 msg_id; > > int err; > > > > mana_hwc_get_msg_index(hwc, &msg_id); > > + ac = hwc->gdma_dev->driver_data; > > Is there a case where gdma_dev be invalid here ? If so, lets check the state and then proceed further ? I can see Haiyang has already in his comment, responded on the same. hwc->gdma_dev will be valid here, but as Haiyang pointed we need to use hwc->gdma_dev->gdma_context->mana.driver_data, or better to relocate the attribute in gdma_context. > > > + if (ac->vf_unload_timeout) { > > + dev_err(hwc->dev, "HWC: vport is already unloaded.\n"); > > + err = -ETIMEDOUT; > > + goto out; > > + } > > > > tx_wr = &txq->msg_buf->reqs[msg_id]; > > > > @@ -825,9 +834,10 @@ int mana_hwc_send_request(struct hw_channel_context *hwc, u32 req_len, > > goto out; > > } > > > > - if (!wait_for_completion_timeout(&ctx->comp_event, 30 * HZ)) { > > + if (!wait_for_completion_timeout(&ctx->comp_event, 5 * HZ)) { > > IMHO we should have macros instead of magic numbers (5 , 30 or so). But would like others to comment here. > > > dev_err(hwc->dev, "HWC: Request timed out!\n"); > > err = -ETIMEDOUT; > > + ac->vf_unload_timeout = true; > > goto out; > > } > > > > diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h > > index 9eef19972845..5f5affdca1eb 100644 > > --- a/include/net/mana/mana.h > > +++ b/include/net/mana/mana.h > > @@ -358,6 +358,8 @@ struct mana_context { > > > > u16 num_ports; > > > > + bool vf_unload_timeout; > > + > > struct mana_eq *eqs; > > > > struct net_device *ports[MAX_PORTS_IN_MANA_DEV];
On Mon, Jun 26, 2023 at 03:53:50PM +0000, Haiyang Zhang wrote: > > > > -----Original Message----- > > From: souradeep chakrabarti <schakrabarti@linux.microsoft.com> > > Sent: Monday, June 26, 2023 5:20 AM > > To: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang > > <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui > > <decui@microsoft.com>; davem@davemloft.net; edumazet@google.com; > > kuba@kernel.org; pabeni@redhat.com; Long Li <longli@microsoft.com>; Ajay > > Sharma <sharmaajay@microsoft.com>; leon@kernel.org; > > cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets@redhat.com; > > tglx@linutronix.de; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; > > linux-kernel@vger.kernel.org; linux-rdma@vger.kernel.org > > Cc: stable@vger.kernel.org; Souradeep Chakrabarti > > <schakrabarti@microsoft.com>; Souradeep Chakrabarti > > <schakrabarti@linux.microsoft.com> > > Subject: [PATCH 2/2 V3 net] net: mana: Fix MANA VF unload when host is > > unresponsive > > In general, two patches shouldn't have the same Subject. > > For this patch set, the two patches are two steps to fix the unloading issue, and > they are not very long. IMHO, they should be in one patch. > > Thanks, > - Haiyang I will create a single patch in next version. As this fixes the unloading issue.
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c index 8f3f78b68592..6411f01be0d9 100644 --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c @@ -946,10 +946,12 @@ int mana_gd_deregister_device(struct gdma_dev *gd) struct gdma_context *gc = gd->gdma_context; struct gdma_general_resp resp = {}; struct gdma_general_req req = {}; + struct mana_context *ac; int err; if (gd->pdid == INVALID_PDID) return -EINVAL; + ac = gd->driver_data; mana_gd_init_req_hdr(&req.hdr, GDMA_DEREGISTER_DEVICE, sizeof(req), sizeof(resp)); @@ -957,7 +959,7 @@ int mana_gd_deregister_device(struct gdma_dev *gd) req.hdr.dev_id = gd->dev_id; err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp); - if (err || resp.hdr.status) { + if ((err || resp.hdr.status) && !ac->vf_unload_timeout) { dev_err(gc->dev, "Failed to deregister device: %d, 0x%x\n", err, resp.hdr.status); if (!err) diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.c b/drivers/net/ethernet/microsoft/mana/hw_channel.c index 9d1507eba5b9..492cb2c6e2cb 100644 --- a/drivers/net/ethernet/microsoft/mana/hw_channel.c +++ b/drivers/net/ethernet/microsoft/mana/hw_channel.c @@ -1,8 +1,10 @@ // SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause /* Copyright (c) 2021, Microsoft Corporation. */ +#include "asm-generic/errno.h" #include <net/mana/gdma.h> #include <net/mana/hw_channel.h> +#include <net/mana/mana.h> static int mana_hwc_get_msg_index(struct hw_channel_context *hwc, u16 *msg_id) { @@ -786,12 +788,19 @@ int mana_hwc_send_request(struct hw_channel_context *hwc, u32 req_len, struct hwc_wq *txq = hwc->txq; struct gdma_req_hdr *req_msg; struct hwc_caller_ctx *ctx; + struct mana_context *ac; u32 dest_vrcq = 0; u32 dest_vrq = 0; u16 msg_id; int err; mana_hwc_get_msg_index(hwc, &msg_id); + ac = hwc->gdma_dev->driver_data; + if (ac->vf_unload_timeout) { + dev_err(hwc->dev, "HWC: vport is already unloaded.\n"); + err = -ETIMEDOUT; + goto out; + } tx_wr = &txq->msg_buf->reqs[msg_id]; @@ -825,9 +834,10 @@ int mana_hwc_send_request(struct hw_channel_context *hwc, u32 req_len, goto out; } - if (!wait_for_completion_timeout(&ctx->comp_event, 30 * HZ)) { + if (!wait_for_completion_timeout(&ctx->comp_event, 5 * HZ)) { dev_err(hwc->dev, "HWC: Request timed out!\n"); err = -ETIMEDOUT; + ac->vf_unload_timeout = true; goto out; } diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h index 9eef19972845..5f5affdca1eb 100644 --- a/include/net/mana/mana.h +++ b/include/net/mana/mana.h @@ -358,6 +358,8 @@ struct mana_context { u16 num_ports; + bool vf_unload_timeout; + struct mana_eq *eqs; struct net_device *ports[MAX_PORTS_IN_MANA_DEV];