Message ID | 20230420070914.1383918-1-korantwork@gmail.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp139449vqo; Thu, 20 Apr 2023 00:25:49 -0700 (PDT) X-Google-Smtp-Source: AKy350ZICfk3KfDd5mHbiMV/2/M/j2kbyaCOfLeEkr0LlbL9JAZTtWGZZnsXZIPAPVsQ0EsYXZ2a X-Received: by 2002:a05:6a20:e405:b0:ee:b372:207b with SMTP id nh5-20020a056a20e40500b000eeb372207bmr659210pzb.54.1681975548721; Thu, 20 Apr 2023 00:25:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1681975548; cv=none; d=google.com; s=arc-20160816; b=MwgcjPIQB2CE79zxmnzUtGhZI5BC8k6pjWpsyALnb2EoKUVwfFF2cMFXTf0kIwUoVx cbueJuljDIndKdX1VKIzlXs+PG7IRi1gP73qWtIPna7VWxrR+DCJlNw/lyZ7zaRxI4u9 MasHhs+CTwTNxSGSTOrnySkWkfiNropE4ywsQEvSV6z98/EakRYDDBB5xWgH3BZZDdC5 1Y6MX9DyAsbjmFbJVgtkny6yM7LKna2oFBq/TnUIzU9rIT7SRoH3pvmYvy2Ajj+YcXh/ c09Nirt9B/PjNx1PysY8E9cq2bxd0k6MDFTbYTuUNsKWlVmmpKdQ9k60GiNu6oB5uTZx 3Hiw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=RgJY1hiFKqX+w6LLZ3fee7l/ovDaM7Mvo24SInHmDsg=; b=ihduCD74s8qZQj4I01xwop+BtBrlZrNNcQ0vXU49sS0WkBDxC7FzztJuS+85AveTWd ftwsnNIMRH8OMwp1Jy5T4Cnb7xRPhTcpuQ6F6xkK80l65HzCIh6J69dQd7k8/2KTtYDj My6I6jiH3cmCFbg0BIpM3/I+xedeKxfkdNxJrw0xioyjTghiYizZ0sZ+7xZYUGIJS0zE WTTf+mngSmvpB1WYnG12temoFkiVWdjDHmqlVsIk7dWwaDt6CFcVWIS+u5+7L69gywZk 7upfMvF0Mo7N7mvH0/olsZ7AfFMO2fygMTD2hwhRThLuSW/nc++XD+EDW01CPSsJtCN+ b8jA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=V+ZsB60b; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id l69-20020a638848000000b0051b5f1f1633si946096pgd.635.2023.04.20.00.25.36; Thu, 20 Apr 2023 00:25:48 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=V+ZsB60b; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229705AbjDTHJv (ORCPT <rfc822;cjcooper78@gmail.com> + 99 others); Thu, 20 Apr 2023 03:09:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57638 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233975AbjDTHJn (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 20 Apr 2023 03:09:43 -0400 Received: from mail-pf1-x441.google.com (mail-pf1-x441.google.com [IPv6:2607:f8b0:4864:20::441]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D39A8558D; Thu, 20 Apr 2023 00:09:19 -0700 (PDT) Received: by mail-pf1-x441.google.com with SMTP id d2e1a72fcca58-63b7588005fso676845b3a.0; Thu, 20 Apr 2023 00:09:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1681974559; x=1684566559; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=RgJY1hiFKqX+w6LLZ3fee7l/ovDaM7Mvo24SInHmDsg=; b=V+ZsB60bpdgJy7Dn9GZC4OnTeeHxQPWo9NW7+bMfokCuZxbGmstFSeYHez97GS6a42 QQAECBG9Pp1B7RogsbqgT3gkj0CZ57DStXkghmir9fl76hBozq60MOapPmNv6VuxEK4a QcupBobZmpVym6dZTlZWINENyzfoQsrVGh/1zMnEca3kYFCaTbjlRAbkTxJfs90kWJtT Bumq5J0dITt379+n84an6gwpXRV/6ke6MMi+dbCK8USYsIURfkiUmL4y9s7oBiwAvo52 QDTbvDsFoPfTB8iXVyrgzJnRwp9G33SRbMvePwYf4SB0pkQWQWGiVRhsQXNge/vQI3nZ I/GQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681974559; x=1684566559; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=RgJY1hiFKqX+w6LLZ3fee7l/ovDaM7Mvo24SInHmDsg=; b=a1lIhHwWF/ZrnmRY0vfSSbMc+P30Zf7KR1f9QS7m0ARJoUAEC/Qsczz3s81OImOeUr cB18jwZv3hLjlnHXNU9YYepMPoBEKa4JAltytu5ehCwvcuESTigv19ejZNYdYy/0f98W pFMY0iFmMoIif0f0GHPG5hlLLW2Y7XWWOs8TMeWFi6S8epN9Nt7VbQFaFerYmRL6u6zY 1eIqqQBJq8akDhgtN1S8Y83xbQa0y7q8FCWnu0UrKIwSI/obCAwHjQCaIlx2w7f0vTaA 7sj/QdpbC/xNGDeVPpXb5ksIaNXl5Upvru46mnd6j+9WWxWYpjpaxFwrhatq69XaH9oB 8YAA== X-Gm-Message-State: AAQBX9foZmCyVzWXPfEshA21IW0w1UVIxDvTA8MjOHvr9ryKou7GhYOO Zech4wZXFP48YGIpaCW7HjI= X-Received: by 2002:a05:6a20:8e0c:b0:ef:2389:66c5 with SMTP id y12-20020a056a208e0c00b000ef238966c5mr1277801pzj.12.1681974559066; Thu, 20 Apr 2023 00:09:19 -0700 (PDT) Received: from localhost.localdomain ([43.132.141.8]) by smtp.gmail.com with ESMTPSA id h64-20020a638343000000b0051b603bf22csm503080pge.69.2023.04.20.00.09.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 20 Apr 2023 00:09:18 -0700 (PDT) From: korantwork@gmail.com To: helgaas@kernel.org, nirmal.patel@linux.intel.com, kbusch@kernel.org, jonathan.derrick@linux.dev, lpieralisi@kernel.org Cc: linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Xinghui Li <korantli@tencent.com> Subject: [PATCH v5] PCI: vmd: Add the module param to adjust MSI mode Date: Thu, 20 Apr 2023 15:09:14 +0800 Message-Id: <20230420070914.1383918-1-korantwork@gmail.com> X-Mailer: git-send-email 2.31.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1763679193004642319?= X-GMAIL-MSGID: =?utf-8?q?1763679193004642319?= |
Series |
[v5] PCI: vmd: Add the module param to adjust MSI mode
|
|
Commit Message
Xinghui Li
April 20, 2023, 7:09 a.m. UTC
From: Xinghui Li <korantli@tencent.com> In the past, the vmd MSI mode can only be adjusted by configuring vmd_ids table. This patch adds another way to adjust MSI mode by adjusting module parameter, which allows users easier to adjust the vmd according to the I/O scenario without rebuilding driver. - "disable_msi_bypass=0 or other values": Under normal circumstances, we recommend enable the VMD MSI-X bypass feature, which improves interrupt handling performance by avoiding the VMD MSI-X domain interrupt handler. - "disable_msi_bypass=1": Use this when multiple NVMe devices are mounted on the same PCIe node with a high volume of 4K random I/O. It mitigates excessive pressure on the PCIe node caused by numerous interrupts from NVMe drives, resulting in improved I/O performance. Such as: In FIO 4K random test when 4 NVME(Gen4) mounted on the same PCIE port: - Enable bypass: read: IOPS=562k, BW=2197MiB/s, io=644GiB - Disable bypass: read: IOPS=1144k, BW=4470MiB/s, io=1310GiB As not all devices support VMD MSI-X bypass, this parameter is only applicable to devices that support the bypass function and have already enabled it, such as VMD_28C0. Besides, this parameter does not affect the MSI-X working mode in guest. Signed-off-by: Xinghui Li <korantli@tencent.com> --- drivers/pci/controller/vmd.c | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+)
Comments
On Thu, Apr 20, 2023 at 03:09:14PM +0800, korantwork@gmail.com wrote: > From: Xinghui Li <korantli@tencent.com> > > In the past, the vmd MSI mode can only be adjusted by configuring > vmd_ids table. This patch adds another way to adjust MSI mode by > adjusting module parameter, which allows users easier to adjust the vmd > according to the I/O scenario without rebuilding driver. We're making good progress here, but I still have a hard time understanding what's going on, partly because some of the naming is confusing to me. The "VMCONFIG_MSI_REMAP" name suggests that setting this bit enables MSI remapping, but I think it actually *disables* MSI remapping. IMO this should be named something like "VMCONFIG_MSI_REMAP_DISABLE". (This would be a separate patch if we did anything like this.) > - "disable_msi_bypass=0 or other values": > Under normal circumstances, we recommend enable the VMD MSI-X bypass > feature, which improves interrupt handling performance by avoiding > the VMD MSI-X domain interrupt handler. The "disable_msi_bypass" parameter name also leads to some complicated logic. IIUC, "disable_msi_bypass=0" means "do not disable MSI remap bypassing" or, in other words, "do not remap MSIs." This is only supported by some VMDs. Using "disable_msi_bypass=0" to *enable* the bypass feature is confusing. And I guess "disable_msi_bypass=1" means "remap MSIs" (which is supported by all VMD versions)? What if you made boolean parameters like these: no_msi_remap If the VMD supports it, disable VMD MSI-X remapping. This improves interrupt performance because child device interrupts avoid the VMD MSI-X domain interrupt handler. msi_remap Remap child MSI-X interrupts into VMD MSI-X interrupts. This limits the number of MSI-X vectors available to the whole child device domain to the number of VMD MSI-X interrupts. Is it also the case that "msi_remap" may be required for some virtualization scenarios when the vmd driver can't work that out itself via vmd_get_phys_offsets()? > - "disable_msi_bypass=1": > Use this when multiple NVMe devices are mounted on the same PCIe > node with a high volume of 4K random I/O. It mitigates excessive > pressure on the PCIe node caused by numerous interrupts from NVMe > drives, resulting in improved I/O performance. Such as: > > In FIO 4K random test when 4 NVME(Gen4) mounted on the same PCIE port: > - Enable bypass: read: IOPS=562k, BW=2197MiB/s, io=644GiB > - Disable bypass: read: IOPS=1144k, BW=4470MiB/s, io=1310GiB I still don't understand what causes the performance problem here. I guess you see higher performance when the VMD remaps child MSIs? So adding the VMD MSI-X domain interrupt handler and squashing all the child MSI vectors into the VMD MSI vector space makes things better? That seems backwards. Is this because of cache effects or something? What does "excessive pressure on the PCIe node" mean? I assume the PCIe node means the VMD? It receives the same number of child interrupts in either case. > As not all devices support VMD MSI-X bypass, this parameter is > only applicable to devices that support the bypass function and > have already enabled it, such as VMD_28C0. If you made two boolean parameters, "msi_remap" would work for all devices, and "no_msi_remap" would work only on certain VMDs, right? > Besides, this parameter does not affect the MSI-X working mode in > guest. I don't understand what you're saying here. From the patch, I think that "disable_msi_bypass=1", i.e., "always remap child MSIs", means we pretend this VMD doesn't support the VMCONFIG_MSI_REMAP bit. In that case MSI remapping always happens. If the user may need to use "disable_msi_bypass=1" (or "msi_remap") in some virtualization scenarios, we should mention that and maybe give a hint about what happens *without* that parameter. > Signed-off-by: Xinghui Li <korantli@tencent.com> > --- > drivers/pci/controller/vmd.c | 29 +++++++++++++++++++++++++++++ > 1 file changed, 29 insertions(+) > > diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c > index 990630ec57c6..8ee673810cbf 100644 > --- a/drivers/pci/controller/vmd.c > +++ b/drivers/pci/controller/vmd.c > @@ -34,6 +34,20 @@ > #define MB2_SHADOW_OFFSET 0x2000 > #define MB2_SHADOW_SIZE 16 > > +/* > + * The VMD disable_msi_bypass module parameter provides the alternative > + * way to adjust MSI mode when loading vmd.ko. This parameter is only applicable > + * to devices that both support and have enabled bypass, such as VMD_28C0. > + * Besides, it does not affect MSI-X mode in the guest. > + * > + * 1: disable MSI-X bypass > + * other values: not disable MSI-X bypass > + */ > +static int disable_msi_bypass; > +module_param(disable_msi_bypass, int, 0444); > +MODULE_PARM_DESC(disable_msi_bypass, "Whether to disable MSI-X bypass function.\n" > + "\t\t Only effective on the device supporting bypass, such as 28C0."); > + > enum vmd_features { > /* > * Device may contain registers which hint the physical location of the > @@ -875,6 +889,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features) > return ret; > > vmd_set_msi_remapping(vmd, true); > + dev_info(&vmd->dev->dev, "init vmd with remapping MSI-X\n"); > > ret = vmd_create_irq_domain(vmd); > if (ret) > @@ -887,6 +902,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features) > irq_domain_update_bus_token(vmd->irq_domain, DOMAIN_BUS_VMD_MSI); > } else { > vmd_set_msi_remapping(vmd, false); > + dev_info(&vmd->dev->dev, "init vmd with bypass MSI-X\n"); > } > > pci_add_resource(&resources, &vmd->resources[0]); > @@ -955,6 +971,17 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features) > return 0; > } > > +static void vmd_config_msi_bypass_param(unsigned long *features) > +{ > + /* > + * Not every VMD device supports and enables bypass MSI-X. > + * Make sure current device has the bypass flag set. > + */ > + if (disable_msi_bypass == 1 && > + *features & VMD_FEAT_CAN_BYPASS_MSI_REMAP) > + *features &= ~(VMD_FEAT_CAN_BYPASS_MSI_REMAP); > +} > + > static int vmd_probe(struct pci_dev *dev, const struct pci_device_id *id) > { > unsigned long features = (unsigned long) id->driver_data; > @@ -984,6 +1011,8 @@ static int vmd_probe(struct pci_dev *dev, const struct pci_device_id *id) > if (err < 0) > goto out_release_instance; > > + vmd_config_msi_bypass_param(&features); > + > vmd->cfgbar = pcim_iomap(dev, VMD_CFGBAR, 0); > if (!vmd->cfgbar) { > err = -ENOMEM; > -- > 2.31.1 >
On Fri, Apr 28, 2023 at 01:40:36PM -0500, Bjorn Helgaas wrote: > On Thu, Apr 20, 2023 at 03:09:14PM +0800, korantwork@gmail.com wrote: > > From: Xinghui Li <korantli@tencent.com> > What if you made boolean parameters like these: > > no_msi_remap > > If the VMD supports it, disable VMD MSI-X remapping. This > improves interrupt performance because child device interrupts > avoid the VMD MSI-X domain interrupt handler. > > msi_remap > > Remap child MSI-X interrupts into VMD MSI-X interrupts. This > limits the number of MSI-X vectors available to the whole child > device domain to the number of VMD MSI-X interrupts. I guess having two parameters that affect the same feature is also confusing. Maybe just "msi_remap=0" or "msi_remap=1" or something? I think what makes "disable_msi_bypass=0" hard is that "MSI bypass" by itself is a negative feature (the positive activity is MSI remapping), and disabling bypass gets us back to the positive "MSI remapping" situation, and "disable_msi_bypass=0" negates that again, so we're back to ... uh ... let's see ... we are not disabling the bypass of MSI remapping, so I guess MSI remapping would be *enabled*? Is that right? Bjorn
On Sat, Apr 29, 2023 at 3:58 AM Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Fri, Apr 28, 2023 at 01:40:36PM -0500, Bjorn Helgaas wrote: > > On Thu, Apr 20, 2023 at 03:09:14PM +0800, korantwork@gmail.com wrote: > > > From: Xinghui Li <korantli@tencent.com> > > > What if you made boolean parameters like these: > > > > no_msi_remap > > > > If the VMD supports it, disable VMD MSI-X remapping. This > > improves interrupt performance because child device interrupts > > avoid the VMD MSI-X domain interrupt handler. > > > > msi_remap > > > > Remap child MSI-X interrupts into VMD MSI-X interrupts. This > > limits the number of MSI-X vectors available to the whole child > > device domain to the number of VMD MSI-X interrupts. > > I guess having two parameters that affect the same feature is also > confusing. Maybe just "msi_remap=0" or "msi_remap=1" or something? > > I think what makes "disable_msi_bypass=0" hard is that "MSI bypass" by > itself is a negative feature (the positive activity is MSI remapping), > and disabling bypass gets us back to the positive "MSI remapping" > situation, and "disable_msi_bypass=0" negates that again, so we're > back to ... uh ... let's see ... we are not disabling the bypass of > MSI remapping, so I guess MSI remapping would be *enabled*? Is that > right? > > Bjorn I am fine with these two ways naming of the param. Adjusting from enable_msi_remaping to disable_msi_bypass was aimed to trying address your comment about dealing with the device not supporting bypass. And in vmd drivers, the vmd bypass feature is enabled by adding the flag "VMD_FEAT_CAN_BYPASS_MSI_REMAP". Therefore, I think disabling bypass seems more appropriate. This patch aims to provide a convenient way to remove that flag in a specific case. On Sat, Apr 29, 2023 at 2:40 AM Bjorn Helgaas <helgaas@kernel.org> wrote: > The "disable_msi_bypass" parameter name also leads to some complicated > logic. IIUC, "disable_msi_bypass=0" means "do not disable MSI remap > bypassing" or, in other words, "do not remap MSIs." This is only > supported by some VMDs. Using "disable_msi_bypass=0" to *enable* the > bypass feature is confusing. However, as you said, it does lead to some complicated logic. So, I also believe that these two approaches have their own pros and cons. > I still don't understand what causes the performance problem here. I > guess you see higher performance when the VMD remaps child MSIs? So > adding the VMD MSI-X domain interrupt handler and squashing all the > child MSI vectors into the VMD MSI vector space makes things better? > That seems backwards. Is this because of cache effects or something? > What does "excessive pressure on the PCIe node" mean? I assume the > PCIe node means the VMD? It receives the same number of child > interrupts in either case. What I mean is that there will be performance issues when a PCIe domain is fully loaded with 4 Gen4 NVMe devices. like this: +-[10002:00]-+-01.0-[01]----00.0 device0 | +-03.0-[02]----00.0 device1 | +-05.0-[03]----00.0 device2 | \-07.0-[04]----00.0 device3 According to the perf/irqtop tool, we found the os does get the same counts of interrupts in different modes. However, when the above situation occurs, we observed a significant increase in CPU idle time. Besides, the data and performance when using the bypass VMD feature were identical to when VMD was disabled. And after the devices mounted on a domain are reduced, the IOPS of individual devices will rebound somewhat. Therefore, we speculate that VMD can play a role in balancing and buffering interrupt loads. Therefore, in this situation, we believe that VMD ought to not be bypassed to handle MSI. > > Besides, this parameter does not affect the MSI-X working mode in > > guest. > I don't understand what you're saying here. From the patch, I think > that "disable_msi_bypass=1", i.e., "always remap child MSIs", means we > pretend this VMD doesn't support the VMCONFIG_MSI_REMAP bit. In that > case MSI remapping always happens. > If the user may need to use "disable_msi_bypass=1" (or "msi_remap") in > some virtualization scenarios, we should mention that and maybe give a > hint about what happens *without* that parameter. I apologize for the confusion, I think I missed the keyword 'passthrough mode'. In the virtualization scenarios, VMD doesn't support bypassing MSI-X when it is set to passthrough mode. PCI: vmd: Disable MSI-X remapping when possible(commit ee81ee8): + /* + * Currently MSI remapping must be enabled in guest passthrough mode + * due to some missing interrupt remapping plumbing. This is probably + * acceptable because the guest is usually CPU-limited and MSI + * remapping doesn't become a performance bottleneck. + */ This patch will not change this point, I just wanted to mention it again~ Thanks for your reviewing. I hope this reply can address your points.
On Fri, May 05, 2023 at 05:31:44PM +0800, Xinghui Li wrote: > On Sat, Apr 29, 2023 at 3:58 AM Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Fri, Apr 28, 2023 at 01:40:36PM -0500, Bjorn Helgaas wrote: > > > On Thu, Apr 20, 2023 at 03:09:14PM +0800, korantwork@gmail.com wrote: > > > > From: Xinghui Li <korantli@tencent.com> > > > > > What if you made boolean parameters like these: > > > > > > no_msi_remap > > > > > > If the VMD supports it, disable VMD MSI-X remapping. This > > > improves interrupt performance because child device interrupts > > > avoid the VMD MSI-X domain interrupt handler. > > > > > > msi_remap > > > > > > Remap child MSI-X interrupts into VMD MSI-X interrupts. This > > > limits the number of MSI-X vectors available to the whole child > > > device domain to the number of VMD MSI-X interrupts. > > > > I guess having two parameters that affect the same feature is also > > confusing. Maybe just "msi_remap=0" or "msi_remap=1" or something? > > > > I think what makes "disable_msi_bypass=0" hard is that "MSI bypass" by > > itself is a negative feature (the positive activity is MSI remapping), > > and disabling bypass gets us back to the positive "MSI remapping" > > situation, and "disable_msi_bypass=0" negates that again, so we're > > back to ... uh ... let's see ... we are not disabling the bypass of > > MSI remapping, so I guess MSI remapping would be *enabled*? Is that > > right? > > I am fine with these two ways naming of the param. Adjusting from > enable_msi_remaping to disable_msi_bypass was aimed to trying address > your comment about dealing with the device not supporting bypass. > And in vmd drivers, the vmd bypass feature is enabled by adding the flag > "VMD_FEAT_CAN_BYPASS_MSI_REMAP". Therefore, I think disabling > bypass seems more appropriate. This patch aims to provide a convenient > way to remove that flag in a specific case. Users don't care about the name of VMD_FEAT_CAN_BYPASS_MSI_REMAP. I don't think that's a very good name either (in my opinion "VMD_FEAT_MSI_REMAP_DISABLE" would be more descriptive, and VMCONFIG_MSI_REMAP is even worse since setting it *disables* MSI remapping), but in any case these are internal to the driver. > On Sat, Apr 29, 2023 at 2:40 AM Bjorn Helgaas <helgaas@kernel.org> wrote: > > The "disable_msi_bypass" parameter name also leads to some complicated > > logic. IIUC, "disable_msi_bypass=0" means "do not disable MSI remap > > bypassing" or, in other words, "do not remap MSIs." This is only > > supported by some VMDs. Using "disable_msi_bypass=0" to *enable* the > > bypass feature is confusing. > > However, as you said, it does lead to some complicated logic. So, I > also believe that these two approaches have their own pros and cons. > > > I still don't understand what causes the performance problem here. I > > guess you see higher performance when the VMD remaps child MSIs? So > > adding the VMD MSI-X domain interrupt handler and squashing all the > > child MSI vectors into the VMD MSI vector space makes things better? > > That seems backwards. Is this because of cache effects or something? > > > What does "excessive pressure on the PCIe node" mean? I assume the > > PCIe node means the VMD? It receives the same number of child > > interrupts in either case. > > What I mean is that there will be performance issues when a PCIe domain > is fully loaded with 4 Gen4 NVMe devices. like this: > +-[10002:00]-+-01.0-[01]----00.0 device0 > | +-03.0-[02]----00.0 device1 > | +-05.0-[03]----00.0 device2 > | \-07.0-[04]----00.0 device3 > > According to the perf/irqtop tool, we found the os does get the same > counts of interrupts in different modes. However, when the above > situation occurs, we observed a significant increase in CPU idle > time. Besides, the data and performance when using the bypass VMD > feature were identical to when VMD was disabled. And after the > devices mounted on a domain are reduced, the IOPS of individual > devices will rebound somewhat. Therefore, we speculate that VMD can > play a role in balancing and buffering interrupt loads. Therefore, > in this situation, we believe that VMD ought to not be bypassed to > handle MSI. The proposed text was: Use this when multiple NVMe devices are mounted on the same PCIe node with a high volume of 4K random I/O. It mitigates excessive pressure on the PCIe node caused by numerous interrupts from NVMe drives, resulting in improved I/O performance. Such as: The NVMe configuration and workload you mentioned works better with MSI-X remapping. But I don't know *why*, and I don't know that NVMe is the only device affected. So it's hard to write useful guidance for users, other than "sometimes it helps." Straw man proposal: msi_remap=0 Disable VMD MSI-X remapping, which improves interrupt performance because child device interrupts avoid the VMD MSI-X domain interrupt handler. Not all VMD devices support this, but for those that do, this is the default. msi_remap=1 Remap child MSI-X interrupts into VMD MSI-X interrupts. This limits the number of MSI-X vectors available to the whole child device domain to the number of VMD MSI-X interrupts. This may be required in some virtualization scenarios. This may improve performance in some I/O topologies, e.g., several NVMe devices doing lots of random I/O, but we don't know why. I hate the idea of "we don't know why." If you *do* know why, it would be much better to outline the mechanism because that would help users know when to use this. But if we don't know why, admitting that straight out is better than hand-waving about excessive pressure, etc. The same applies to the virtualization caveat. The text above is not actionable -- how do users know whether their particular virtualization scenario requires this option? Bjorn
On Sat, May 6, 2023 at 12:08 AM Bjorn Helgaas <helgaas@kernel.org> wrote: > > > I am fine with these two ways naming of the param. Adjusting from > > enable_msi_remaping to disable_msi_bypass was aimed to trying address > > your comment about dealing with the device not supporting bypass. > > And in vmd drivers, the vmd bypass feature is enabled by adding the flag > > "VMD_FEAT_CAN_BYPASS_MSI_REMAP". Therefore, I think disabling > > bypass seems more appropriate. This patch aims to provide a convenient > > way to remove that flag in a specific case. > > Users don't care about the name of VMD_FEAT_CAN_BYPASS_MSI_REMAP. I > don't think that's a very good name either (in my opinion > "VMD_FEAT_MSI_REMAP_DISABLE" would be more descriptive, and > VMCONFIG_MSI_REMAP is even worse since setting it *disables* MSI > remapping), but in any case these are internal to the driver. > > > On Sat, Apr 29, 2023 at 2:40 AM Bjorn Helgaas <helgaas@kernel.org> wrote: > > > The "disable_msi_bypass" parameter name also leads to some complicated > > > logic. IIUC, "disable_msi_bypass=0" means "do not disable MSI remap > > > bypassing" or, in other words, "do not remap MSIs." This is only > > > supported by some VMDs. Using "disable_msi_bypass=0" to *enable* the > > > bypass feature is confusing. > > > > However, as you said, it does lead to some complicated logic. So, I > > also believe that these two approaches have their own pros and cons. > > > > > I still don't understand what causes the performance problem here. I > > > guess you see higher performance when the VMD remaps child MSIs? So > > > adding the VMD MSI-X domain interrupt handler and squashing all the > > > child MSI vectors into the VMD MSI vector space makes things better? > > > That seems backwards. Is this because of cache effects or something? > > > > > What does "excessive pressure on the PCIe node" mean? I assume the > > > PCIe node means the VMD? It receives the same number of child > > > interrupts in either case. > > > > What I mean is that there will be performance issues when a PCIe domain > > is fully loaded with 4 Gen4 NVMe devices. like this: > > +-[10002:00]-+-01.0-[01]----00.0 device0 > > | +-03.0-[02]----00.0 device1 > > | +-05.0-[03]----00.0 device2 > > | \-07.0-[04]----00.0 device3 > > > > According to the perf/irqtop tool, we found the os does get the same > > counts of interrupts in different modes. However, when the above > > situation occurs, we observed a significant increase in CPU idle > > time. Besides, the data and performance when using the bypass VMD > > feature were identical to when VMD was disabled. And after the > > devices mounted on a domain are reduced, the IOPS of individual > > devices will rebound somewhat. Therefore, we speculate that VMD can > > play a role in balancing and buffering interrupt loads. Therefore, > > in this situation, we believe that VMD ought to not be bypassed to > > handle MSI. > > The proposed text was: > > Use this when multiple NVMe devices are mounted on the same PCIe > node with a high volume of 4K random I/O. It mitigates excessive > pressure on the PCIe node caused by numerous interrupts from NVMe > drives, resulting in improved I/O performance. Such as: > > The NVMe configuration and workload you mentioned works better with > MSI-X remapping. But I don't know *why*, and I don't know that NVMe > is the only device affected. So it's hard to write useful guidance > for users, other than "sometimes it helps." > > Straw man proposal: > > msi_remap=0 > > Disable VMD MSI-X remapping, which improves interrupt performance > because child device interrupts avoid the VMD MSI-X domain > interrupt handler. Not all VMD devices support this, but for > those that do, this is the default. > > msi_remap=1 > > Remap child MSI-X interrupts into VMD MSI-X interrupts. This > limits the number of MSI-X vectors available to the whole child > device domain to the number of VMD MSI-X interrupts. > > This may be required in some virtualization scenarios. > > This may improve performance in some I/O topologies, e.g., several > NVMe devices doing lots of random I/O, but we don't know why. > > I hate the idea of "we don't know why." If you *do* know why, it > would be much better to outline the mechanism because that would help > users know when to use this. But if we don't know why, admitting that > straight out is better than hand-waving about excessive pressure, etc. > I completely agree with you. I discovered this issue using the bisect method. Based on the observed data and experiments, I drew the above conclusions. We deduced the cause from the observed results. However, I have not yet been able to determine the precise cause of this issue. > The same applies to the virtualization caveat. The text above is not > actionable -- how do users know whether their particular > virtualization scenario requires this option? > I will add a note to make it clear that bypass mode will not work in guest passthrought mode, only remapping mode can be used. Thanks
diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c index 990630ec57c6..8ee673810cbf 100644 --- a/drivers/pci/controller/vmd.c +++ b/drivers/pci/controller/vmd.c @@ -34,6 +34,20 @@ #define MB2_SHADOW_OFFSET 0x2000 #define MB2_SHADOW_SIZE 16 +/* + * The VMD disable_msi_bypass module parameter provides the alternative + * way to adjust MSI mode when loading vmd.ko. This parameter is only applicable + * to devices that both support and have enabled bypass, such as VMD_28C0. + * Besides, it does not affect MSI-X mode in the guest. + * + * 1: disable MSI-X bypass + * other values: not disable MSI-X bypass + */ +static int disable_msi_bypass; +module_param(disable_msi_bypass, int, 0444); +MODULE_PARM_DESC(disable_msi_bypass, "Whether to disable MSI-X bypass function.\n" + "\t\t Only effective on the device supporting bypass, such as 28C0."); + enum vmd_features { /* * Device may contain registers which hint the physical location of the @@ -875,6 +889,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features) return ret; vmd_set_msi_remapping(vmd, true); + dev_info(&vmd->dev->dev, "init vmd with remapping MSI-X\n"); ret = vmd_create_irq_domain(vmd); if (ret) @@ -887,6 +902,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features) irq_domain_update_bus_token(vmd->irq_domain, DOMAIN_BUS_VMD_MSI); } else { vmd_set_msi_remapping(vmd, false); + dev_info(&vmd->dev->dev, "init vmd with bypass MSI-X\n"); } pci_add_resource(&resources, &vmd->resources[0]); @@ -955,6 +971,17 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features) return 0; } +static void vmd_config_msi_bypass_param(unsigned long *features) +{ + /* + * Not every VMD device supports and enables bypass MSI-X. + * Make sure current device has the bypass flag set. + */ + if (disable_msi_bypass == 1 && + *features & VMD_FEAT_CAN_BYPASS_MSI_REMAP) + *features &= ~(VMD_FEAT_CAN_BYPASS_MSI_REMAP); +} + static int vmd_probe(struct pci_dev *dev, const struct pci_device_id *id) { unsigned long features = (unsigned long) id->driver_data; @@ -984,6 +1011,8 @@ static int vmd_probe(struct pci_dev *dev, const struct pci_device_id *id) if (err < 0) goto out_release_instance; + vmd_config_msi_bypass_param(&features); + vmd->cfgbar = pcim_iomap(dev, VMD_CFGBAR, 0); if (!vmd->cfgbar) { err = -ENOMEM;