[v5] PCI: vmd: Add the module param to adjust MSI mode

Message ID 20230420070914.1383918-1-korantwork@gmail.com
State New
Headers
Series [v5] PCI: vmd: Add the module param to adjust MSI mode |

Commit Message

Xinghui Li April 20, 2023, 7:09 a.m. UTC
  From: Xinghui Li <korantli@tencent.com>

In the past, the vmd MSI mode can only be adjusted by configuring
vmd_ids table. This patch adds another way to adjust MSI mode by
adjusting module parameter, which allows users easier to adjust the vmd
according to the I/O scenario without rebuilding driver.

- "disable_msi_bypass=0 or other values":
  Under normal circumstances, we recommend enable the VMD MSI-X bypass
  feature, which improves interrupt handling performance by avoiding
  the VMD MSI-X domain interrupt handler.

- "disable_msi_bypass=1":
  Use this when multiple NVMe devices are mounted on the same PCIe
  node with a high volume of 4K random I/O. It mitigates excessive
  pressure on the PCIe node caused by numerous interrupts from NVMe
  drives, resulting in improved I/O performance. Such as:

  In FIO 4K random test when 4 NVME(Gen4) mounted on the same PCIE port:
    - Enable bypass: read: IOPS=562k, BW=2197MiB/s, io=644GiB
    - Disable bypass: read: IOPS=1144k, BW=4470MiB/s, io=1310GiB

As not all devices support VMD MSI-X bypass, this parameter is
only applicable to devices that support the bypass function and
have already enabled it, such as VMD_28C0. Besides, this parameter
does not affect the MSI-X working mode in guest.

Signed-off-by: Xinghui Li <korantli@tencent.com>
---
 drivers/pci/controller/vmd.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)
  

Comments

Bjorn Helgaas April 28, 2023, 6:40 p.m. UTC | #1
On Thu, Apr 20, 2023 at 03:09:14PM +0800, korantwork@gmail.com wrote:
> From: Xinghui Li <korantli@tencent.com>
> 
> In the past, the vmd MSI mode can only be adjusted by configuring
> vmd_ids table. This patch adds another way to adjust MSI mode by
> adjusting module parameter, which allows users easier to adjust the vmd
> according to the I/O scenario without rebuilding driver.

We're making good progress here, but I still have a hard time
understanding what's going on, partly because some of the naming is
confusing to me.

The "VMCONFIG_MSI_REMAP" name suggests that setting this bit enables
MSI remapping, but I think it actually *disables* MSI remapping.  IMO
this should be named something like "VMCONFIG_MSI_REMAP_DISABLE".
(This would be a separate patch if we did anything like this.)

> - "disable_msi_bypass=0 or other values":
>   Under normal circumstances, we recommend enable the VMD MSI-X bypass
>   feature, which improves interrupt handling performance by avoiding
>   the VMD MSI-X domain interrupt handler.

The "disable_msi_bypass" parameter name also leads to some complicated
logic.  IIUC, "disable_msi_bypass=0" means "do not disable MSI remap
bypassing" or, in other words, "do not remap MSIs."  This is only
supported by some VMDs.  Using "disable_msi_bypass=0" to *enable* the
bypass feature is confusing.

And I guess "disable_msi_bypass=1" means "remap MSIs" (which is
supported by all VMD versions)?

What if you made boolean parameters like these:

  no_msi_remap

    If the VMD supports it, disable VMD MSI-X remapping.  This
    improves interrupt performance because child device interrupts
    avoid the VMD MSI-X domain interrupt handler.

  msi_remap

    Remap child MSI-X interrupts into VMD MSI-X interrupts.  This
    limits the number of MSI-X vectors available to the whole child
    device domain to the number of VMD MSI-X interrupts.

Is it also the case that "msi_remap" may be required for some
virtualization scenarios when the vmd driver can't work that out
itself via vmd_get_phys_offsets()?

> - "disable_msi_bypass=1":
>   Use this when multiple NVMe devices are mounted on the same PCIe
>   node with a high volume of 4K random I/O. It mitigates excessive
>   pressure on the PCIe node caused by numerous interrupts from NVMe
>   drives, resulting in improved I/O performance. Such as:
>
>   In FIO 4K random test when 4 NVME(Gen4) mounted on the same PCIE port:
>     - Enable bypass: read: IOPS=562k, BW=2197MiB/s, io=644GiB
>     - Disable bypass: read: IOPS=1144k, BW=4470MiB/s, io=1310GiB

I still don't understand what causes the performance problem here.  I
guess you see higher performance when the VMD remaps child MSIs?  So
adding the VMD MSI-X domain interrupt handler and squashing all the
child MSI vectors into the VMD MSI vector space makes things better?
That seems backwards.  Is this because of cache effects or something?

What does "excessive pressure on the PCIe node" mean?  I assume the
PCIe node means the VMD?  It receives the same number of child
interrupts in either case.

> As not all devices support VMD MSI-X bypass, this parameter is
> only applicable to devices that support the bypass function and
> have already enabled it, such as VMD_28C0.

If you made two boolean parameters, "msi_remap" would work for all
devices, and "no_msi_remap" would work only on certain VMDs, right?

> Besides, this parameter does not affect the MSI-X working mode in
> guest.

I don't understand what you're saying here.  From the patch, I think
that "disable_msi_bypass=1", i.e., "always remap child MSIs", means we
pretend this VMD doesn't support the VMCONFIG_MSI_REMAP bit.  In that
case MSI remapping always happens.

If the user may need to use "disable_msi_bypass=1" (or "msi_remap") in
some virtualization scenarios, we should mention that and maybe give a
hint about what happens *without* that parameter.

> Signed-off-by: Xinghui Li <korantli@tencent.com>
> ---
>  drivers/pci/controller/vmd.c | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
> index 990630ec57c6..8ee673810cbf 100644
> --- a/drivers/pci/controller/vmd.c
> +++ b/drivers/pci/controller/vmd.c
> @@ -34,6 +34,20 @@
>  #define MB2_SHADOW_OFFSET	0x2000
>  #define MB2_SHADOW_SIZE		16
>  
> +/*
> + * The VMD disable_msi_bypass module parameter provides the alternative
> + * way to adjust MSI mode when loading vmd.ko. This parameter is only applicable
> + * to devices that both support and have enabled bypass, such as VMD_28C0.
> + * Besides, it does not affect MSI-X mode in the guest.
> + *
> + * 1: disable MSI-X bypass
> + * other values: not disable MSI-X bypass
> + */
> +static int disable_msi_bypass;
> +module_param(disable_msi_bypass, int, 0444);
> +MODULE_PARM_DESC(disable_msi_bypass, "Whether to disable MSI-X bypass function.\n"
> +	"\t\t  Only effective on the device supporting bypass, such as 28C0.");
> +
>  enum vmd_features {
>  	/*
>  	 * Device may contain registers which hint the physical location of the
> @@ -875,6 +889,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
>  			return ret;
>  
>  		vmd_set_msi_remapping(vmd, true);
> +		dev_info(&vmd->dev->dev, "init vmd with remapping MSI-X\n");
>  
>  		ret = vmd_create_irq_domain(vmd);
>  		if (ret)
> @@ -887,6 +902,7 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
>  		irq_domain_update_bus_token(vmd->irq_domain, DOMAIN_BUS_VMD_MSI);
>  	} else {
>  		vmd_set_msi_remapping(vmd, false);
> +		dev_info(&vmd->dev->dev, "init vmd with bypass MSI-X\n");
>  	}
>  
>  	pci_add_resource(&resources, &vmd->resources[0]);
> @@ -955,6 +971,17 @@ static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
>  	return 0;
>  }
>  
> +static void vmd_config_msi_bypass_param(unsigned long *features)
> +{
> +	/*
> +	 * Not every VMD device supports and enables bypass MSI-X.
> +	 * Make sure current device has the bypass flag set.
> +	 */
> +	if (disable_msi_bypass == 1 &&
> +	  *features & VMD_FEAT_CAN_BYPASS_MSI_REMAP)
> +		*features &= ~(VMD_FEAT_CAN_BYPASS_MSI_REMAP);
> +}
> +
>  static int vmd_probe(struct pci_dev *dev, const struct pci_device_id *id)
>  {
>  	unsigned long features = (unsigned long) id->driver_data;
> @@ -984,6 +1011,8 @@ static int vmd_probe(struct pci_dev *dev, const struct pci_device_id *id)
>  	if (err < 0)
>  		goto out_release_instance;
>  
> +	vmd_config_msi_bypass_param(&features);
> +
>  	vmd->cfgbar = pcim_iomap(dev, VMD_CFGBAR, 0);
>  	if (!vmd->cfgbar) {
>  		err = -ENOMEM;
> -- 
> 2.31.1
>
  
Bjorn Helgaas April 28, 2023, 7:58 p.m. UTC | #2
On Fri, Apr 28, 2023 at 01:40:36PM -0500, Bjorn Helgaas wrote:
> On Thu, Apr 20, 2023 at 03:09:14PM +0800, korantwork@gmail.com wrote:
> > From: Xinghui Li <korantli@tencent.com>

> What if you made boolean parameters like these:
> 
>   no_msi_remap
> 
>     If the VMD supports it, disable VMD MSI-X remapping.  This
>     improves interrupt performance because child device interrupts
>     avoid the VMD MSI-X domain interrupt handler.
> 
>   msi_remap
> 
>     Remap child MSI-X interrupts into VMD MSI-X interrupts.  This
>     limits the number of MSI-X vectors available to the whole child
>     device domain to the number of VMD MSI-X interrupts.

I guess having two parameters that affect the same feature is also
confusing.  Maybe just "msi_remap=0" or "msi_remap=1" or something?

I think what makes "disable_msi_bypass=0" hard is that "MSI bypass" by
itself is a negative feature (the positive activity is MSI remapping),
and disabling bypass gets us back to the positive "MSI remapping"
situation, and "disable_msi_bypass=0" negates that again, so we're
back to ... uh ... let's see ... we are not disabling the bypass of
MSI remapping, so I guess MSI remapping would be *enabled*?  Is that
right?

Bjorn
  
Xinghui Li May 5, 2023, 9:31 a.m. UTC | #3
On Sat, Apr 29, 2023 at 3:58 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Fri, Apr 28, 2023 at 01:40:36PM -0500, Bjorn Helgaas wrote:
> > On Thu, Apr 20, 2023 at 03:09:14PM +0800, korantwork@gmail.com wrote:
> > > From: Xinghui Li <korantli@tencent.com>
>
> > What if you made boolean parameters like these:
> >
> >   no_msi_remap
> >
> >     If the VMD supports it, disable VMD MSI-X remapping.  This
> >     improves interrupt performance because child device interrupts
> >     avoid the VMD MSI-X domain interrupt handler.
> >
> >   msi_remap
> >
> >     Remap child MSI-X interrupts into VMD MSI-X interrupts.  This
> >     limits the number of MSI-X vectors available to the whole child
> >     device domain to the number of VMD MSI-X interrupts.
>
> I guess having two parameters that affect the same feature is also
> confusing.  Maybe just "msi_remap=0" or "msi_remap=1" or something?
>
> I think what makes "disable_msi_bypass=0" hard is that "MSI bypass" by
> itself is a negative feature (the positive activity is MSI remapping),
> and disabling bypass gets us back to the positive "MSI remapping"
> situation, and "disable_msi_bypass=0" negates that again, so we're
> back to ... uh ... let's see ... we are not disabling the bypass of
> MSI remapping, so I guess MSI remapping would be *enabled*?  Is that
> right?
>
> Bjorn
I am fine with these two ways naming of the param. Adjusting from
enable_msi_remaping to disable_msi_bypass was aimed to trying address
your comment about dealing with the device not supporting bypass.
And in vmd drivers, the vmd bypass feature is enabled by adding the flag
"VMD_FEAT_CAN_BYPASS_MSI_REMAP".  Therefore, I think disabling
bypass seems more appropriate. This patch aims to provide a convenient
way to remove that flag in a specific case.

On Sat, Apr 29, 2023 at 2:40 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> The "disable_msi_bypass" parameter name also leads to some complicated
> logic.  IIUC, "disable_msi_bypass=0" means "do not disable MSI remap
> bypassing" or, in other words, "do not remap MSIs."  This is only
> supported by some VMDs.  Using "disable_msi_bypass=0" to *enable* the
> bypass feature is confusing.
However, as you said, it does lead to some complicated logic.  So, I
also believe
that these two approaches have their own pros and cons.

> I still don't understand what causes the performance problem here.  I
> guess you see higher performance when the VMD remaps child MSIs?  So
> adding the VMD MSI-X domain interrupt handler and squashing all the
> child MSI vectors into the VMD MSI vector space makes things better?
> That seems backwards.  Is this because of cache effects or something?

> What does "excessive pressure on the PCIe node" mean?  I assume the
> PCIe node means the VMD?  It receives the same number of child
> interrupts in either case.
What I mean is that there will be performance issues when a PCIe domain
is fully loaded with 4 Gen4 NVMe devices.  like this:
 +-[10002:00]-+-01.0-[01]----00.0  device0
 |                     +-03.0-[02]----00.0  device1
 |                     +-05.0-[03]----00.0  device2
 |                      \-07.0-[04]----00.0  device3

According to the perf/irqtop tool, we found the os does get the same counts
of interrupts in different modes. However, when the above situation occurs,
we observed a significant increase in CPU idle time. Besides, the data and
performance when using the bypass VMD feature were identical to when VMD
was disabled. And after the devices mounted on a domain are reduced, the
IOPS of individual devices will rebound somewhat. Therefore, we speculate
that VMD can play a role in balancing and buffering interrupt loads. Therefore,
in this situation, we believe that VMD  ought to not be bypassed to handle MSI.

> > Besides, this parameter does not affect the MSI-X working mode in
> > guest.

> I don't understand what you're saying here.  From the patch, I think
> that "disable_msi_bypass=1", i.e., "always remap child MSIs", means we
> pretend this VMD doesn't support the VMCONFIG_MSI_REMAP bit.  In that
> case MSI remapping always happens.

> If the user may need to use "disable_msi_bypass=1" (or "msi_remap") in
> some virtualization scenarios, we should mention that and maybe give a
> hint about what happens *without* that parameter.
I apologize for the confusion, I think I missed the keyword 'passthrough mode'.
In the virtualization scenarios, VMD doesn't support bypassing MSI-X when it is
set to passthrough mode.
  PCI: vmd: Disable MSI-X remapping when possible(commit ee81ee8):
+  /*
+  * Currently MSI remapping must be enabled in guest passthrough mode
+  * due to some missing interrupt remapping plumbing. This is probably
+  * acceptable because the guest is usually CPU-limited and MSI
+  * remapping doesn't become a performance bottleneck.
+  */
This patch will not change this point, I just wanted to mention it again~

Thanks for your reviewing. I hope this reply can address your points.
  
Bjorn Helgaas May 5, 2023, 4:07 p.m. UTC | #4
On Fri, May 05, 2023 at 05:31:44PM +0800, Xinghui Li wrote:
> On Sat, Apr 29, 2023 at 3:58 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Fri, Apr 28, 2023 at 01:40:36PM -0500, Bjorn Helgaas wrote:
> > > On Thu, Apr 20, 2023 at 03:09:14PM +0800, korantwork@gmail.com wrote:
> > > > From: Xinghui Li <korantli@tencent.com>
> >
> > > What if you made boolean parameters like these:
> > >
> > >   no_msi_remap
> > >
> > >     If the VMD supports it, disable VMD MSI-X remapping.  This
> > >     improves interrupt performance because child device interrupts
> > >     avoid the VMD MSI-X domain interrupt handler.
> > >
> > >   msi_remap
> > >
> > >     Remap child MSI-X interrupts into VMD MSI-X interrupts.  This
> > >     limits the number of MSI-X vectors available to the whole child
> > >     device domain to the number of VMD MSI-X interrupts.
> >
> > I guess having two parameters that affect the same feature is also
> > confusing.  Maybe just "msi_remap=0" or "msi_remap=1" or something?
> >
> > I think what makes "disable_msi_bypass=0" hard is that "MSI bypass" by
> > itself is a negative feature (the positive activity is MSI remapping),
> > and disabling bypass gets us back to the positive "MSI remapping"
> > situation, and "disable_msi_bypass=0" negates that again, so we're
> > back to ... uh ... let's see ... we are not disabling the bypass of
> > MSI remapping, so I guess MSI remapping would be *enabled*?  Is that
> > right?
>
> I am fine with these two ways naming of the param. Adjusting from
> enable_msi_remaping to disable_msi_bypass was aimed to trying address
> your comment about dealing with the device not supporting bypass.
> And in vmd drivers, the vmd bypass feature is enabled by adding the flag
> "VMD_FEAT_CAN_BYPASS_MSI_REMAP".  Therefore, I think disabling
> bypass seems more appropriate. This patch aims to provide a convenient
> way to remove that flag in a specific case.

Users don't care about the name of VMD_FEAT_CAN_BYPASS_MSI_REMAP.  I
don't think that's a very good name either (in my opinion
"VMD_FEAT_MSI_REMAP_DISABLE" would be more descriptive, and
VMCONFIG_MSI_REMAP is even worse since setting it *disables* MSI
remapping), but in any case these are internal to the driver.

> On Sat, Apr 29, 2023 at 2:40 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > The "disable_msi_bypass" parameter name also leads to some complicated
> > logic.  IIUC, "disable_msi_bypass=0" means "do not disable MSI remap
> > bypassing" or, in other words, "do not remap MSIs."  This is only
> > supported by some VMDs.  Using "disable_msi_bypass=0" to *enable* the
> > bypass feature is confusing.
>
> However, as you said, it does lead to some complicated logic.  So, I
> also believe that these two approaches have their own pros and cons.
> 
> > I still don't understand what causes the performance problem here.  I
> > guess you see higher performance when the VMD remaps child MSIs?  So
> > adding the VMD MSI-X domain interrupt handler and squashing all the
> > child MSI vectors into the VMD MSI vector space makes things better?
> > That seems backwards.  Is this because of cache effects or something?
> 
> > What does "excessive pressure on the PCIe node" mean?  I assume the
> > PCIe node means the VMD?  It receives the same number of child
> > interrupts in either case.
>
> What I mean is that there will be performance issues when a PCIe domain
> is fully loaded with 4 Gen4 NVMe devices.  like this:
>  +-[10002:00]-+-01.0-[01]----00.0  device0
>  |                     +-03.0-[02]----00.0  device1
>  |                     +-05.0-[03]----00.0  device2
>  |                      \-07.0-[04]----00.0  device3
> 
> According to the perf/irqtop tool, we found the os does get the same
> counts of interrupts in different modes. However, when the above
> situation occurs, we observed a significant increase in CPU idle
> time. Besides, the data and performance when using the bypass VMD
> feature were identical to when VMD was disabled. And after the
> devices mounted on a domain are reduced, the IOPS of individual
> devices will rebound somewhat. Therefore, we speculate that VMD can
> play a role in balancing and buffering interrupt loads. Therefore,
> in this situation, we believe that VMD ought to not be bypassed to
> handle MSI.

The proposed text was:

  Use this when multiple NVMe devices are mounted on the same PCIe
  node with a high volume of 4K random I/O. It mitigates excessive
  pressure on the PCIe node caused by numerous interrupts from NVMe
  drives, resulting in improved I/O performance. Such as:

The NVMe configuration and workload you mentioned works better with
MSI-X remapping.  But I don't know *why*, and I don't know that NVMe
is the only device affected.  So it's hard to write useful guidance
for users, other than "sometimes it helps."

Straw man proposal:

  msi_remap=0

    Disable VMD MSI-X remapping, which improves interrupt performance
    because child device interrupts avoid the VMD MSI-X domain
    interrupt handler.  Not all VMD devices support this, but for
    those that do, this is the default.

  msi_remap=1

    Remap child MSI-X interrupts into VMD MSI-X interrupts.  This
    limits the number of MSI-X vectors available to the whole child
    device domain to the number of VMD MSI-X interrupts.

    This may be required in some virtualization scenarios.

    This may improve performance in some I/O topologies, e.g., several
    NVMe devices doing lots of random I/O, but we don't know why.

I hate the idea of "we don't know why."  If you *do* know why, it
would be much better to outline the mechanism because that would help
users know when to use this.  But if we don't know why, admitting that
straight out is better than hand-waving about excessive pressure, etc.

The same applies to the virtualization caveat.  The text above is not
actionable -- how do users know whether their particular
virtualization scenario requires this option?

Bjorn
  
Xinghui Li May 8, 2023, 1:01 p.m. UTC | #5
On Sat, May 6, 2023 at 12:08 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> > I am fine with these two ways naming of the param. Adjusting from
> > enable_msi_remaping to disable_msi_bypass was aimed to trying address
> > your comment about dealing with the device not supporting bypass.
> > And in vmd drivers, the vmd bypass feature is enabled by adding the flag
> > "VMD_FEAT_CAN_BYPASS_MSI_REMAP".  Therefore, I think disabling
> > bypass seems more appropriate. This patch aims to provide a convenient
> > way to remove that flag in a specific case.
>
> Users don't care about the name of VMD_FEAT_CAN_BYPASS_MSI_REMAP.  I
> don't think that's a very good name either (in my opinion
> "VMD_FEAT_MSI_REMAP_DISABLE" would be more descriptive, and
> VMCONFIG_MSI_REMAP is even worse since setting it *disables* MSI
> remapping), but in any case these are internal to the driver.
>
> > On Sat, Apr 29, 2023 at 2:40 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > The "disable_msi_bypass" parameter name also leads to some complicated
> > > logic.  IIUC, "disable_msi_bypass=0" means "do not disable MSI remap
> > > bypassing" or, in other words, "do not remap MSIs."  This is only
> > > supported by some VMDs.  Using "disable_msi_bypass=0" to *enable* the
> > > bypass feature is confusing.
> >
> > However, as you said, it does lead to some complicated logic.  So, I
> > also believe that these two approaches have their own pros and cons.
> >
> > > I still don't understand what causes the performance problem here.  I
> > > guess you see higher performance when the VMD remaps child MSIs?  So
> > > adding the VMD MSI-X domain interrupt handler and squashing all the
> > > child MSI vectors into the VMD MSI vector space makes things better?
> > > That seems backwards.  Is this because of cache effects or something?
> >
> > > What does "excessive pressure on the PCIe node" mean?  I assume the
> > > PCIe node means the VMD?  It receives the same number of child
> > > interrupts in either case.
> >
> > What I mean is that there will be performance issues when a PCIe domain
> > is fully loaded with 4 Gen4 NVMe devices.  like this:
> >  +-[10002:00]-+-01.0-[01]----00.0  device0
> >  |                     +-03.0-[02]----00.0  device1
> >  |                     +-05.0-[03]----00.0  device2
> >  |                      \-07.0-[04]----00.0  device3
> >
> > According to the perf/irqtop tool, we found the os does get the same
> > counts of interrupts in different modes. However, when the above
> > situation occurs, we observed a significant increase in CPU idle
> > time. Besides, the data and performance when using the bypass VMD
> > feature were identical to when VMD was disabled. And after the
> > devices mounted on a domain are reduced, the IOPS of individual
> > devices will rebound somewhat. Therefore, we speculate that VMD can
> > play a role in balancing and buffering interrupt loads. Therefore,
> > in this situation, we believe that VMD ought to not be bypassed to
> > handle MSI.
>
> The proposed text was:
>
>   Use this when multiple NVMe devices are mounted on the same PCIe
>   node with a high volume of 4K random I/O. It mitigates excessive
>   pressure on the PCIe node caused by numerous interrupts from NVMe
>   drives, resulting in improved I/O performance. Such as:
>
> The NVMe configuration and workload you mentioned works better with
> MSI-X remapping.  But I don't know *why*, and I don't know that NVMe
> is the only device affected.  So it's hard to write useful guidance
> for users, other than "sometimes it helps."
>
> Straw man proposal:
>
>   msi_remap=0
>
>     Disable VMD MSI-X remapping, which improves interrupt performance
>     because child device interrupts avoid the VMD MSI-X domain
>     interrupt handler.  Not all VMD devices support this, but for
>     those that do, this is the default.
>
>   msi_remap=1
>
>     Remap child MSI-X interrupts into VMD MSI-X interrupts.  This
>     limits the number of MSI-X vectors available to the whole child
>     device domain to the number of VMD MSI-X interrupts.
>
>     This may be required in some virtualization scenarios.
>
>     This may improve performance in some I/O topologies, e.g., several
>     NVMe devices doing lots of random I/O, but we don't know why.
>
> I hate the idea of "we don't know why."  If you *do* know why, it
> would be much better to outline the mechanism because that would help
> users know when to use this.  But if we don't know why, admitting that
> straight out is better than hand-waving about excessive pressure, etc.
>
I completely agree with you. I discovered this issue using the bisect method.
Based on the observed data and experiments, I drew the above conclusions.
We deduced the cause from the observed results. However, I have not yet
been able to determine the precise cause of this issue.

> The same applies to the virtualization caveat.  The text above is not
> actionable -- how do users know whether their particular
> virtualization scenario requires this option?
>
I will add a note to make it clear that bypass mode will not work in guest
passthrought mode, only remapping mode can be used.

Thanks
  

Patch

diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index 990630ec57c6..8ee673810cbf 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -34,6 +34,20 @@ 
 #define MB2_SHADOW_OFFSET	0x2000
 #define MB2_SHADOW_SIZE		16
 
+/*
+ * The VMD disable_msi_bypass module parameter provides the alternative
+ * way to adjust MSI mode when loading vmd.ko. This parameter is only applicable
+ * to devices that both support and have enabled bypass, such as VMD_28C0.
+ * Besides, it does not affect MSI-X mode in the guest.
+ *
+ * 1: disable MSI-X bypass
+ * other values: not disable MSI-X bypass
+ */
+static int disable_msi_bypass;
+module_param(disable_msi_bypass, int, 0444);
+MODULE_PARM_DESC(disable_msi_bypass, "Whether to disable MSI-X bypass function.\n"
+	"\t\t  Only effective on the device supporting bypass, such as 28C0.");
+
 enum vmd_features {
 	/*
 	 * Device may contain registers which hint the physical location of the
@@ -875,6 +889,7 @@  static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
 			return ret;
 
 		vmd_set_msi_remapping(vmd, true);
+		dev_info(&vmd->dev->dev, "init vmd with remapping MSI-X\n");
 
 		ret = vmd_create_irq_domain(vmd);
 		if (ret)
@@ -887,6 +902,7 @@  static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
 		irq_domain_update_bus_token(vmd->irq_domain, DOMAIN_BUS_VMD_MSI);
 	} else {
 		vmd_set_msi_remapping(vmd, false);
+		dev_info(&vmd->dev->dev, "init vmd with bypass MSI-X\n");
 	}
 
 	pci_add_resource(&resources, &vmd->resources[0]);
@@ -955,6 +971,17 @@  static int vmd_enable_domain(struct vmd_dev *vmd, unsigned long features)
 	return 0;
 }
 
+static void vmd_config_msi_bypass_param(unsigned long *features)
+{
+	/*
+	 * Not every VMD device supports and enables bypass MSI-X.
+	 * Make sure current device has the bypass flag set.
+	 */
+	if (disable_msi_bypass == 1 &&
+	  *features & VMD_FEAT_CAN_BYPASS_MSI_REMAP)
+		*features &= ~(VMD_FEAT_CAN_BYPASS_MSI_REMAP);
+}
+
 static int vmd_probe(struct pci_dev *dev, const struct pci_device_id *id)
 {
 	unsigned long features = (unsigned long) id->driver_data;
@@ -984,6 +1011,8 @@  static int vmd_probe(struct pci_dev *dev, const struct pci_device_id *id)
 	if (err < 0)
 		goto out_release_instance;
 
+	vmd_config_msi_bypass_param(&features);
+
 	vmd->cfgbar = pcim_iomap(dev, VMD_CFGBAR, 0);
 	if (!vmd->cfgbar) {
 		err = -ENOMEM;