PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller

Message ID 20221222072603.1175248-1-korantwork@gmail.com
State New
Headers
Series PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller |

Commit Message

Xinghui Li Dec. 22, 2022, 7:26 a.m. UTC
  From: Xinghui Li <korantli@tencent.com>

Commit ee81ee84f873("PCI: vmd: Disable MSI-X remapping when possible")
disable the vmd MSI-X remapping for optimizing pci performance.However,
this feature severely negatively optimized performance in multi-disk
situations.

In FIO 4K random test, we test 1 disk in the 1 CPU

when disable MSI-X remapping:
read: IOPS=1183k, BW=4622MiB/s (4847MB/s)(1354GiB/300001msec)
READ: bw=4622MiB/s (4847MB/s), 4622MiB/s-4622MiB/s (4847MB/s-4847MB/s),
io=1354GiB (1454GB), run=300001-300001msec

When not disable MSI-X remapping:
read: IOPS=1171k, BW=4572MiB/s (4795MB/s)(1340GiB/300001msec)
READ: bw=4572MiB/s (4795MB/s), 4572MiB/s-4572MiB/s (4795MB/s-4795MB/s),
io=1340GiB (1438GB), run=300001-300001msec

However, the bypass mode could increase the interrupts costs in CPU.
We test 12 disks in the 6 CPU,

When disable MSI-X remapping:
read: IOPS=562k, BW=2197MiB/s (2304MB/s)(644GiB/300001msec)
READ: bw=2197MiB/s (2304MB/s), 2197MiB/s-2197MiB/s (2304MB/s-2304MB/s),
io=644GiB (691GB), run=300001-300001msec

When not disable MSI-X remapping:
read: IOPS=1144k, BW=4470MiB/s (4687MB/s)(1310GiB/300005msec)
READ: bw=4470MiB/s (4687MB/s), 4470MiB/s-4470MiB/s (4687MB/s-4687MB/s),
io=1310GiB (1406GB), run=300005-300005msec

Signed-off-by: Xinghui Li <korantli@tencent.com>
---
 drivers/pci/controller/vmd.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)
  

Comments

Jonathan Derrick Dec. 22, 2022, 9:15 a.m. UTC | #1
On 12/22/22 12:26 AM, korantwork@gmail.com wrote:
> From: Xinghui Li <korantli@tencent.com>
> 
> Commit ee81ee84f873("PCI: vmd: Disable MSI-X remapping when possible")
> disable the vmd MSI-X remapping for optimizing pci performance.However,
> this feature severely negatively optimized performance in multi-disk
> situations.
> 
> In FIO 4K random test, we test 1 disk in the 1 CPU
> 
> when disable MSI-X remapping:
> read: IOPS=1183k, BW=4622MiB/s (4847MB/s)(1354GiB/300001msec)
> READ: bw=4622MiB/s (4847MB/s), 4622MiB/s-4622MiB/s (4847MB/s-4847MB/s),
> io=1354GiB (1454GB), run=300001-300001msec
> 
> When not disable MSI-X remapping:
> read: IOPS=1171k, BW=4572MiB/s (4795MB/s)(1340GiB/300001msec)
> READ: bw=4572MiB/s (4795MB/s), 4572MiB/s-4572MiB/s (4795MB/s-4795MB/s),
> io=1340GiB (1438GB), run=300001-300001msec
> 
> However, the bypass mode could increase the interrupts costs in CPU.
> We test 12 disks in the 6 CPU,
Well the bypass mode was made to improve performance where you have >4 
drives so this is pretty surprising. With bypass mode disabled, VMD will 
intercept and forward interrupts, increasing costs.

I think Nirmal would want to to understand if there's some other factor 
going on here.

> 
> When disable MSI-X remapping:
> read: IOPS=562k, BW=2197MiB/s (2304MB/s)(644GiB/300001msec)
> READ: bw=2197MiB/s (2304MB/s), 2197MiB/s-2197MiB/s (2304MB/s-2304MB/s),
> io=644GiB (691GB), run=300001-300001msec
> 
> When not disable MSI-X remapping:
> read: IOPS=1144k, BW=4470MiB/s (4687MB/s)(1310GiB/300005msec)
> READ: bw=4470MiB/s (4687MB/s), 4470MiB/s-4470MiB/s (4687MB/s-4687MB/s),
> io=1310GiB (1406GB), run=300005-300005msec
> 
> Signed-off-by: Xinghui Li <korantli@tencent.com>
> ---
>   drivers/pci/controller/vmd.c | 3 +--
>   1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
> index e06e9f4fc50f..9f6e9324d67d 100644
> --- a/drivers/pci/controller/vmd.c
> +++ b/drivers/pci/controller/vmd.c
> @@ -998,8 +998,7 @@ static const struct pci_device_id vmd_ids[] = {
>   		.driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW_VSCAP,},
>   	{PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_VMD_28C0),
>   		.driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW |
> -				VMD_FEAT_HAS_BUS_RESTRICTIONS |
> -				VMD_FEAT_CAN_BYPASS_MSI_REMAP,},
> +				VMD_FEAT_HAS_BUS_RESTRICTIONS,},
>   	{PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x467f),
>   		.driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW_VSCAP |
>   				VMD_FEAT_HAS_BUS_RESTRICTIONS |
  
Keith Busch Dec. 22, 2022, 9:56 p.m. UTC | #2
On Thu, Dec 22, 2022 at 02:15:20AM -0700, Jonathan Derrick wrote:
> On 12/22/22 12:26 AM, korantwork@gmail.com wrote:
> > 
> > However, the bypass mode could increase the interrupts costs in CPU.
> > We test 12 disks in the 6 CPU,
>
> Well the bypass mode was made to improve performance where you have >4
> drives so this is pretty surprising. With bypass mode disabled, VMD will
> intercept and forward interrupts, increasing costs.
> 
> I think Nirmal would want to to understand if there's some other factor
> going on here.

With 12 drives and only 6 CPUs, the bypass mode is going to get more irq
context switching. Sounds like the non-bypass mode is aggregating and
spreading interrupts across the cores better, but there's probably some
cpu:drive count tipping point where performance favors the other way.

The fio jobs could also probably set their cpus_allowed differently to
get better performance in the bypass mode.
  
Xinghui Li Dec. 23, 2022, 7:53 a.m. UTC | #3
Jonathan Derrick <jonathan.derrick@linux.dev> 于2022年12月22日周四 17:15写道:
>
>
>
> On 12/22/22 12:26 AM, korantwork@gmail.com wrote:
> > From: Xinghui Li <korantli@tencent.com>
> >
> > Commit ee81ee84f873("PCI: vmd: Disable MSI-X remapping when possible")
> > disable the vmd MSI-X remapping for optimizing pci performance.However,
> > this feature severely negatively optimized performance in multi-disk
> > situations.
> >
> > In FIO 4K random test, we test 1 disk in the 1 CPU
> >
> > when disable MSI-X remapping:
> > read: IOPS=1183k, BW=4622MiB/s (4847MB/s)(1354GiB/300001msec)
> > READ: bw=4622MiB/s (4847MB/s), 4622MiB/s-4622MiB/s (4847MB/s-4847MB/s),
> > io=1354GiB (1454GB), run=300001-300001msec
> >
> > When not disable MSI-X remapping:
> > read: IOPS=1171k, BW=4572MiB/s (4795MB/s)(1340GiB/300001msec)
> > READ: bw=4572MiB/s (4795MB/s), 4572MiB/s-4572MiB/s (4795MB/s-4795MB/s),
> > io=1340GiB (1438GB), run=300001-300001msec
> >
> > However, the bypass mode could increase the interrupts costs in CPU.
> > We test 12 disks in the 6 CPU,
> Well the bypass mode was made to improve performance where you have >4
> drives so this is pretty surprising. With bypass mode disabled, VMD will
> intercept and forward interrupts, increasing costs.

We also find the more drives we tested, the more severe the
performance degradation.
When we tested 8 drives in 6 CPU, there is about 30% drop.

> I think Nirmal would want to to understand if there's some other factor
> going on here.

I also agree with this. The tested server is None io-scheduler.
We tested the same server. Tested drives are Samsung Gen-4 nvme.
Is there anything else you worried effecting test results?
  
Xinghui Li Dec. 23, 2022, 8:02 a.m. UTC | #4
Keith Busch <kbusch@kernel.org> 于2022年12月23日周五 05:56写道:
>
> With 12 drives and only 6 CPUs, the bypass mode is going to get more irq
> context switching. Sounds like the non-bypass mode is aggregating and
> spreading interrupts across the cores better, but there's probably some
> cpu:drive count tipping point where performance favors the other way.

We found that tunning the interrupt aggregation can also bring the
drive performance back to normal.

> The fio jobs could also probably set their cpus_allowed differently to
> get better performance in the bypass mode.

We used the cpus_allowed in FIO to fix 12 dirves in 6 different CPU.

By the way, sorry for emailing twice, the last one had the format problem.
  
Jonathan Derrick Dec. 27, 2022, 10:32 p.m. UTC | #5
On 12/23/2022 2:02 AM, Xinghui Li wrote:
> Keith Busch <kbusch@kernel.org> 于2022年12月23日周五 05:56写道:
>>
>> With 12 drives and only 6 CPUs, the bypass mode is going to get more irq
>> context switching. Sounds like the non-bypass mode is aggregating and
>> spreading interrupts across the cores better, but there's probably some
>> cpu:drive count tipping point where performance favors the other way.
> 
> We found that tunning the interrupt aggregation can also bring the
> drive performance back to normal.
> 
>> The fio jobs could also probably set their cpus_allowed differently to
>> get better performance in the bypass mode.
> 
> We used the cpus_allowed in FIO to fix 12 dirves in 6 different CPU.
> 
> By the way, sorry for emailing twice, the last one had the format problem.

The bypass mode should help in the cases where drives irqs (eg nproc) exceed
VMD I/O irqs. VMD I/O irqs for 28c0 should be min(63, nproc). You have
very few cpus for a Skylake system with that many drives, unless you mean you
are explicitly restricting the 12 drives to only 6 cpus. Either way, bypass mode
is effectively VMD-disabled, which points to other issues. Though I have also seen
much smaller interrupt aggregation benefits.
  
Xinghui Li Dec. 28, 2022, 2:19 a.m. UTC | #6
Jonathan Derrick <jonathan.derrick@linux.dev> 于2022年12月28日周三 06:32写道:
>
> The bypass mode should help in the cases where drives irqs (eg nproc) exceed
> VMD I/O irqs. VMD I/O irqs for 28c0 should be min(63, nproc). You have
> very few cpus for a Skylake system with that many drives, unless you mean you
> are explicitly restricting the 12 drives to only 6 cpus. Either way, bypass mode
> is effectively VMD-disabled, which points to other issues. Though I have also seen
> much smaller interrupt aggregation benefits.

Firstly,I am sorry for my words misleading you. We totally tested 12 drives.
And each drive run in 6 CPU cores with 8 jobs.

Secondly, I try to test the drives with VMD disabled,I found the results to
be largely consistent with bypass mode. I suppose the bypass mode just
"bypass" the VMD controller.

The last one,we found in bypass mode the CPU idle is 91%. But in remapping mode
the CPU idle is 78%. And the bypass's context-switchs is much fewer
than the remapping
mode's. It seems the system is watiing for something in bypass mode.
  
Jonathan Derrick Jan. 9, 2023, 9 p.m. UTC | #7
As the bypass mode seems to affect performance greatly depending on the specific configuration,
it may make sense to use a moduleparam to control it

I'd vote for it being in VMD mode (non-bypass) by default.

On 12/27/2022 7:19 PM, Xinghui Li wrote:
> Jonathan Derrick <jonathan.derrick@linux.dev> 于2022年12月28日周三 06:32写道:
>>
>> The bypass mode should help in the cases where drives irqs (eg nproc) exceed
>> VMD I/O irqs. VMD I/O irqs for 28c0 should be min(63, nproc). You have
>> very few cpus for a Skylake system with that many drives, unless you mean you
>> are explicitly restricting the 12 drives to only 6 cpus. Either way, bypass mode
>> is effectively VMD-disabled, which points to other issues. Though I have also seen
>> much smaller interrupt aggregation benefits.
> 
> Firstly,I am sorry for my words misleading you. We totally tested 12 drives.
> And each drive run in 6 CPU cores with 8 jobs.
> 
> Secondly, I try to test the drives with VMD disabled,I found the results to
> be largely consistent with bypass mode. I suppose the bypass mode just
> "bypass" the VMD controller.
> 
> The last one,we found in bypass mode the CPU idle is 91%. But in remapping mode
> the CPU idle is 78%. And the bypass's context-switchs is much fewer
> than the remapping
> mode's. It seems the system is watiing for something in bypass mode.
  
Xinghui Li Jan. 10, 2023, 12:28 p.m. UTC | #8
Jonathan Derrick <jonathan.derrick@linux.dev> 于2023年1月10日周二 05:00写道:
>
> As the bypass mode seems to affect performance greatly depending on the specific configuration,
> it may make sense to use a moduleparam to control it
>
We found that each pcie port can mount four drives. If we only test 2
or 1 dirve of one pcie port,
the performance of the drive performance will be normal. Also, we
observed the interruptions in different modes.
bypass:
.....
2022-12-28-11-39-14: 1224       181665   IR-PCI-MSI 201850948-edge      nvme0q68
2022-12-28-11-39-14: 1179       180115   IR-PCI-MSI 201850945-edge      nvme0q65
2022-12-28-11-39-14:  RES        26743   Rescheduling interrupts
2022-12-28-11-39-17: irqtop - IRQ : 3029, TOTAL : 2100315228, CPU :
192, ACTIVE CPU : 192
disable:
......
2022-12-28-12-05-56: 1714       169797   IR-PCI-MSI 14155850-edge      nvme1q74
2022-12-28-12-05-56: 1701       168753   IR-PCI-MSI 14155849-edge      nvme1q73
2022-12-28-12-05-56:  LOC       163697   Local timer interrupts
2022-12-28-12-05-56:  TLB         5465   TLB shootdowns
2022-12-28-12-06-00: irqtop - IRQ : 3029, TOTAL : 2179022106, CPU :
192, ACTIVE CPU : 192
remapping:
022-12-28-11-25-38:  283       325568   IR-PCI-MSI 24651790-edge      vmd3
2022-12-28-11-25-38:  140       267899   IR-PCI-MSI 13117447-edge      vmd1
2022-12-28-11-25-38:  183       265978   IR-PCI-MSI 13117490-edge      vmd1
......
2022-12-28-11-25-42: irqtop - IRQ : 2109, TOTAL : 2377172002, CPU :
192, ACTIVE CPU : 192

From the result it is not difficult to find, in remapping mode the
interruptions come from vmd.
While in other modes, interrupts come from nvme devices. Besides, we
found the port mounting
4 dirves total interruptions is much fewer than the port mounting 2 or 1 drive.
NVME 8 and 9 mount in one port, other port mount 4 dirves.

2022-12-28-11-39-14: 2582       494635   IR-PCI-MSI 470810698-edge      nvme9q74
2022-12-28-11-39-14: 2579       489972   IR-PCI-MSI 470810697-edge      nvme9q73
2022-12-28-11-39-14: 2573       480024   IR-PCI-MSI 470810695-edge      nvme9q71
2022-12-28-11-39-14: 2544       312967   IR-PCI-MSI 470286401-edge      nvme8q65
2022-12-28-11-39-14: 2556       312229   IR-PCI-MSI 470286405-edge      nvme8q69
2022-12-28-11-39-14: 2547       310013   IR-PCI-MSI 470286402-edge      nvme8q66
2022-12-28-11-39-14: 2550       308993   IR-PCI-MSI 470286403-edge      nvme8q67
2022-12-28-11-39-14: 2559       308794   IR-PCI-MSI 470286406-edge      nvme8q70
......
2022-12-28-11-39-14: 1296       185773   IR-PCI-MSI 202375243-edge      nvme1q75
2022-12-28-11-39-14: 1209       185646   IR-PCI-MSI 201850947-edge      nvme0q67
2022-12-28-11-39-14: 1831       184151   IR-PCI-MSI 203423828-edge      nvme3q84
2022-12-28-11-39-14: 1254       182313   IR-PCI-MSI 201850950-edge      nvme0q70
2022-12-28-11-39-14: 1224       181665   IR-PCI-MSI 201850948-edge      nvme0q68
2022-12-28-11-39-14: 1179       180115   IR-PCI-MSI 201850945-edge      nvme0q65
> I'd vote for it being in VMD mode (non-bypass) by default.
I speculate that the vmd controller equalizes the interrupt load and
acts like a buffer,
which improves the performance of nvme. I am not sure about my
analysis. So, I'd like
to discuss it with the community.
  
Xinghui Li Feb. 6, 2023, 12:45 p.m. UTC | #9
Friendly ping~

Xinghui Li <korantwork@gmail.com> 于2023年1月10日周二 20:28写道:
>
> Jonathan Derrick <jonathan.derrick@linux.dev> 于2023年1月10日周二 05:00写道:
> >
> > As the bypass mode seems to affect performance greatly depending on the specific configuration,
> > it may make sense to use a moduleparam to control it
> >
> We found that each pcie port can mount four drives. If we only test 2
> or 1 dirve of one pcie port,
> the performance of the drive performance will be normal. Also, we
> observed the interruptions in different modes.
> bypass:
> .....
> 2022-12-28-11-39-14: 1224       181665   IR-PCI-MSI 201850948-edge      nvme0q68
> 2022-12-28-11-39-14: 1179       180115   IR-PCI-MSI 201850945-edge      nvme0q65
> 2022-12-28-11-39-14:  RES        26743   Rescheduling interrupts
> 2022-12-28-11-39-17: irqtop - IRQ : 3029, TOTAL : 2100315228, CPU :
> 192, ACTIVE CPU : 192
> disable:
> ......
> 2022-12-28-12-05-56: 1714       169797   IR-PCI-MSI 14155850-edge      nvme1q74
> 2022-12-28-12-05-56: 1701       168753   IR-PCI-MSI 14155849-edge      nvme1q73
> 2022-12-28-12-05-56:  LOC       163697   Local timer interrupts
> 2022-12-28-12-05-56:  TLB         5465   TLB shootdowns
> 2022-12-28-12-06-00: irqtop - IRQ : 3029, TOTAL : 2179022106, CPU :
> 192, ACTIVE CPU : 192
> remapping:
> 022-12-28-11-25-38:  283       325568   IR-PCI-MSI 24651790-edge      vmd3
> 2022-12-28-11-25-38:  140       267899   IR-PCI-MSI 13117447-edge      vmd1
> 2022-12-28-11-25-38:  183       265978   IR-PCI-MSI 13117490-edge      vmd1
> ......
> 2022-12-28-11-25-42: irqtop - IRQ : 2109, TOTAL : 2377172002, CPU :
> 192, ACTIVE CPU : 192
>
> From the result it is not difficult to find, in remapping mode the
> interruptions come from vmd.
> While in other modes, interrupts come from nvme devices. Besides, we
> found the port mounting
> 4 dirves total interruptions is much fewer than the port mounting 2 or 1 drive.
> NVME 8 and 9 mount in one port, other port mount 4 dirves.
>
> 2022-12-28-11-39-14: 2582       494635   IR-PCI-MSI 470810698-edge      nvme9q74
> 2022-12-28-11-39-14: 2579       489972   IR-PCI-MSI 470810697-edge      nvme9q73
> 2022-12-28-11-39-14: 2573       480024   IR-PCI-MSI 470810695-edge      nvme9q71
> 2022-12-28-11-39-14: 2544       312967   IR-PCI-MSI 470286401-edge      nvme8q65
> 2022-12-28-11-39-14: 2556       312229   IR-PCI-MSI 470286405-edge      nvme8q69
> 2022-12-28-11-39-14: 2547       310013   IR-PCI-MSI 470286402-edge      nvme8q66
> 2022-12-28-11-39-14: 2550       308993   IR-PCI-MSI 470286403-edge      nvme8q67
> 2022-12-28-11-39-14: 2559       308794   IR-PCI-MSI 470286406-edge      nvme8q70
> ......
> 2022-12-28-11-39-14: 1296       185773   IR-PCI-MSI 202375243-edge      nvme1q75
> 2022-12-28-11-39-14: 1209       185646   IR-PCI-MSI 201850947-edge      nvme0q67
> 2022-12-28-11-39-14: 1831       184151   IR-PCI-MSI 203423828-edge      nvme3q84
> 2022-12-28-11-39-14: 1254       182313   IR-PCI-MSI 201850950-edge      nvme0q70
> 2022-12-28-11-39-14: 1224       181665   IR-PCI-MSI 201850948-edge      nvme0q68
> 2022-12-28-11-39-14: 1179       180115   IR-PCI-MSI 201850945-edge      nvme0q65
> > I'd vote for it being in VMD mode (non-bypass) by default.
> I speculate that the vmd controller equalizes the interrupt load and
> acts like a buffer,
> which improves the performance of nvme. I am not sure about my
> analysis. So, I'd like
> to discuss it with the community.
  
Patel, Nirmal Feb. 6, 2023, 6:11 p.m. UTC | #10
On 2/6/2023 5:45 AM, Xinghui Li wrote:
> Friendly ping~
>
> Xinghui Li <korantwork@gmail.com> 于2023年1月10日周二 20:28写道:
>> Jonathan Derrick <jonathan.derrick@linux.dev> 于2023年1月10日周二 05:00写道:
>>> As the bypass mode seems to affect performance greatly depending on the specific configuration,
>>> it may make sense to use a moduleparam to control it
>>>
>> We found that each pcie port can mount four drives. If we only test 2
>> or 1 dirve of one pcie port,
>> the performance of the drive performance will be normal. Also, we
>> observed the interruptions in different modes.
>> bypass:
>> .....
>> 2022-12-28-11-39-14: 1224       181665   IR-PCI-MSI 201850948-edge      nvme0q68
>> 2022-12-28-11-39-14: 1179       180115   IR-PCI-MSI 201850945-edge      nvme0q65
>> 2022-12-28-11-39-14:  RES        26743   Rescheduling interrupts
>> 2022-12-28-11-39-17: irqtop - IRQ : 3029, TOTAL : 2100315228, CPU :
>> 192, ACTIVE CPU : 192
>> disable:
>> ......
>> 2022-12-28-12-05-56: 1714       169797   IR-PCI-MSI 14155850-edge      nvme1q74
>> 2022-12-28-12-05-56: 1701       168753   IR-PCI-MSI 14155849-edge      nvme1q73
>> 2022-12-28-12-05-56:  LOC       163697   Local timer interrupts
>> 2022-12-28-12-05-56:  TLB         5465   TLB shootdowns
>> 2022-12-28-12-06-00: irqtop - IRQ : 3029, TOTAL : 2179022106, CPU :
>> 192, ACTIVE CPU : 192
>> remapping:
>> 022-12-28-11-25-38:  283       325568   IR-PCI-MSI 24651790-edge      vmd3
>> 2022-12-28-11-25-38:  140       267899   IR-PCI-MSI 13117447-edge      vmd1
>> 2022-12-28-11-25-38:  183       265978   IR-PCI-MSI 13117490-edge      vmd1
>> ......
>> 2022-12-28-11-25-42: irqtop - IRQ : 2109, TOTAL : 2377172002, CPU :
>> 192, ACTIVE CPU : 192
>>
>> From the result it is not difficult to find, in remapping mode the
>> interruptions come from vmd.
>> While in other modes, interrupts come from nvme devices. Besides, we
>> found the port mounting
>> 4 dirves total interruptions is much fewer than the port mounting 2 or 1 drive.
>> NVME 8 and 9 mount in one port, other port mount 4 dirves.
>>
>> 2022-12-28-11-39-14: 2582       494635   IR-PCI-MSI 470810698-edge      nvme9q74
>> 2022-12-28-11-39-14: 2579       489972   IR-PCI-MSI 470810697-edge      nvme9q73
>> 2022-12-28-11-39-14: 2573       480024   IR-PCI-MSI 470810695-edge      nvme9q71
>> 2022-12-28-11-39-14: 2544       312967   IR-PCI-MSI 470286401-edge      nvme8q65
>> 2022-12-28-11-39-14: 2556       312229   IR-PCI-MSI 470286405-edge      nvme8q69
>> 2022-12-28-11-39-14: 2547       310013   IR-PCI-MSI 470286402-edge      nvme8q66
>> 2022-12-28-11-39-14: 2550       308993   IR-PCI-MSI 470286403-edge      nvme8q67
>> 2022-12-28-11-39-14: 2559       308794   IR-PCI-MSI 470286406-edge      nvme8q70
>> ......
>> 2022-12-28-11-39-14: 1296       185773   IR-PCI-MSI 202375243-edge      nvme1q75
>> 2022-12-28-11-39-14: 1209       185646   IR-PCI-MSI 201850947-edge      nvme0q67
>> 2022-12-28-11-39-14: 1831       184151   IR-PCI-MSI 203423828-edge      nvme3q84
>> 2022-12-28-11-39-14: 1254       182313   IR-PCI-MSI 201850950-edge      nvme0q70
>> 2022-12-28-11-39-14: 1224       181665   IR-PCI-MSI 201850948-edge      nvme0q68
>> 2022-12-28-11-39-14: 1179       180115   IR-PCI-MSI 201850945-edge      nvme0q65
>>> I'd vote for it being in VMD mode (non-bypass) by default.
>> I speculate that the vmd controller equalizes the interrupt load and
>> acts like a buffer,
>> which improves the performance of nvme. I am not sure about my
>> analysis. So, I'd like
>> to discuss it with the community.

I like the idea of module parameter to allow switching between the modes
but keep MSI remapping enabled (non-bypass) by default.
  
Keith Busch Feb. 6, 2023, 6:28 p.m. UTC | #11
On Mon, Feb 06, 2023 at 11:11:36AM -0700, Patel, Nirmal wrote:
> I like the idea of module parameter to allow switching between the modes
> but keep MSI remapping enabled (non-bypass) by default.

Isn't there a more programatic way to go about selecting the best option at
runtime? I suspect bypass is the better choice if "num_active_cpus() > pci_msix_vec_count(vmd->dev)".
  
Xinghui Li Feb. 7, 2023, 3:18 a.m. UTC | #12
Keith Busch <kbusch@kernel.org> 于2023年2月7日周二 02:28写道:
>
> On Mon, Feb 06, 2023 at 11:11:36AM -0700, Patel, Nirmal wrote:
> > I like the idea of module parameter to allow switching between the modes
> > but keep MSI remapping enabled (non-bypass) by default.
>
> Isn't there a more programatic way to go about selecting the best option at
> runtime?
Do you mean that the operating mode is automatically selected by
detecting the number of devices and CPUs instead of being set
manually?
>I suspect bypass is the better choice if "num_active_cpus() > pci_msix_vec_count(vmd->dev)".
For this situation, My speculation is that the PCIE nodes are
over-mounted and not just because of the CPU to Drive ratio.
We considered designing online nodes, because we were concerned that
the IO of different chunk sizes would adapt to different MSI-X modes.
I privately think that it may be logically complicated if programmatic
judgments are made.
  
Patel, Nirmal Feb. 7, 2023, 8:32 p.m. UTC | #13
On 2/6/2023 8:18 PM, Xinghui Li wrote:
> Keith Busch <kbusch@kernel.org> 于2023年2月7日周二 02:28写道:
>> On Mon, Feb 06, 2023 at 11:11:36AM -0700, Patel, Nirmal wrote:
>>> I like the idea of module parameter to allow switching between the modes
>>> but keep MSI remapping enabled (non-bypass) by default.
>> Isn't there a more programatic way to go about selecting the best option at
>> runtime?
> Do you mean that the operating mode is automatically selected by
> detecting the number of devices and CPUs instead of being set
> manually?
>> I suspect bypass is the better choice if "num_active_cpus() > pci_msix_vec_count(vmd->dev)".
> For this situation, My speculation is that the PCIE nodes are
> over-mounted and not just because of the CPU to Drive ratio.
> We considered designing online nodes, because we were concerned that
> the IO of different chunk sizes would adapt to different MSI-X modes.
> I privately think that it may be logically complicated if programmatic
> judgments are made.

Also newer CPUs have more MSIx (128) which means we can still have
better performance without bypass. It would be better if user have
can chose module parameter based on their requirements. Thanks.
  
Xinghui Li Feb. 9, 2023, 12:05 p.m. UTC | #14
Patel, Nirmal <nirmal.patel@linux.intel.com> 于2023年2月8日周三 04:32写道:
>
> Also newer CPUs have more MSIx (128) which means we can still have
> better performance without bypass. It would be better if user have
> can chose module parameter based on their requirements. Thanks.
>
All right~I will reset the patch V2 with the online node version later.

Thanks
  
Keith Busch Feb. 9, 2023, 11:05 p.m. UTC | #15
On Tue, Feb 07, 2023 at 01:32:20PM -0700, Patel, Nirmal wrote:
> On 2/6/2023 8:18 PM, Xinghui Li wrote:
> > Keith Busch <kbusch@kernel.org> 于2023年2月7日周二 02:28写道:
> >> I suspect bypass is the better choice if "num_active_cpus() > pci_msix_vec_count(vmd->dev)".
> > For this situation, My speculation is that the PCIE nodes are
> > over-mounted and not just because of the CPU to Drive ratio.
> > We considered designing online nodes, because we were concerned that
> > the IO of different chunk sizes would adapt to different MSI-X modes.
> > I privately think that it may be logically complicated if programmatic
> > judgments are made.
> 
> Also newer CPUs have more MSIx (128) which means we can still have
> better performance without bypass. It would be better if user have
> can chose module parameter based on their requirements. Thanks.

So what? More vectors just pushes the threshold to when bypass becomes
relevant, which is exactly why I suggested it. There has to be an empirical
answer to when bypass beats muxing. Why do you want a user tunable if there's a
verifiable and automated better choice?
  
Patel, Nirmal Feb. 9, 2023, 11:57 p.m. UTC | #16
On 2/9/2023 4:05 PM, Keith Busch wrote:
> On Tue, Feb 07, 2023 at 01:32:20PM -0700, Patel, Nirmal wrote:
>> On 2/6/2023 8:18 PM, Xinghui Li wrote:
>>> Keith Busch <kbusch@kernel.org> 于2023年2月7日周二 02:28写道:
>>>> I suspect bypass is the better choice if "num_active_cpus() > pci_msix_vec_count(vmd->dev)".
>>> For this situation, My speculation is that the PCIE nodes are
>>> over-mounted and not just because of the CPU to Drive ratio.
>>> We considered designing online nodes, because we were concerned that
>>> the IO of different chunk sizes would adapt to different MSI-X modes.
>>> I privately think that it may be logically complicated if programmatic
>>> judgments are made.
>> Also newer CPUs have more MSIx (128) which means we can still have
>> better performance without bypass. It would be better if user have
>> can chose module parameter based on their requirements. Thanks.
> So what? More vectors just pushes the threshold to when bypass becomes
> relevant, which is exactly why I suggested it. There has to be an empirical
> answer to when bypass beats muxing. Why do you want a user tunable if there's a
> verifiable and automated better choice?

Make sense about the automated choice. I am not sure what is the exact
tipping point. The commit message includes only two cases. one 1 drive
1 CPU and second 12 drives 6 CPU. Also performance gets worse from 8
drives to 12 drives.
One the previous comments also mentioned something about FIO changing
cpus_allowed; will there be an issue when VMD driver decides to bypass
the remapping during the boot up, but FIO job changes the cpu_allowed?
  
Keith Busch Feb. 10, 2023, 12:47 a.m. UTC | #17
On Thu, Feb 09, 2023 at 04:57:59PM -0700, Patel, Nirmal wrote:
> On 2/9/2023 4:05 PM, Keith Busch wrote:
> > On Tue, Feb 07, 2023 at 01:32:20PM -0700, Patel, Nirmal wrote:
> >> On 2/6/2023 8:18 PM, Xinghui Li wrote:
> >>> Keith Busch <kbusch@kernel.org> 于2023年2月7日周二 02:28写道:
> >>>> I suspect bypass is the better choice if "num_active_cpus() > pci_msix_vec_count(vmd->dev)".
> >>> For this situation, My speculation is that the PCIE nodes are
> >>> over-mounted and not just because of the CPU to Drive ratio.
> >>> We considered designing online nodes, because we were concerned that
> >>> the IO of different chunk sizes would adapt to different MSI-X modes.
> >>> I privately think that it may be logically complicated if programmatic
> >>> judgments are made.
> >> Also newer CPUs have more MSIx (128) which means we can still have
> >> better performance without bypass. It would be better if user have
> >> can chose module parameter based on their requirements. Thanks.
> > So what? More vectors just pushes the threshold to when bypass becomes
> > relevant, which is exactly why I suggested it. There has to be an empirical
> > answer to when bypass beats muxing. Why do you want a user tunable if there's a
> > verifiable and automated better choice?
> 
> Make sense about the automated choice. I am not sure what is the exact
> tipping point. The commit message includes only two cases. one 1 drive
> 1 CPU and second 12 drives 6 CPU. Also performance gets worse from 8
> drives to 12 drives.

That configuration's storage performance overwhelms the CPU with interrupt
context switching. That problem probably inverts when your active CPU count
exceeds your VMD vectors because you'll be funnelling more interrupts into
fewer CPUs, leaving other CPUs idle.

> One the previous comments also mentioned something about FIO changing
> cpus_allowed; will there be an issue when VMD driver decides to bypass
> the remapping during the boot up, but FIO job changes the cpu_allowed?

No. Bypass mode uses managed interrupts for your nvme child devices, which sets
the best possible affinity.
  

Patch

diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index e06e9f4fc50f..9f6e9324d67d 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -998,8 +998,7 @@  static const struct pci_device_id vmd_ids[] = {
 		.driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW_VSCAP,},
 	{PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_VMD_28C0),
 		.driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW |
-				VMD_FEAT_HAS_BUS_RESTRICTIONS |
-				VMD_FEAT_CAN_BYPASS_MSI_REMAP,},
+				VMD_FEAT_HAS_BUS_RESTRICTIONS,},
 	{PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x467f),
 		.driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW_VSCAP |
 				VMD_FEAT_HAS_BUS_RESTRICTIONS |