Message ID | 20221222072603.1175248-1-korantwork@gmail.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:e747:0:0:0:0:0 with SMTP id c7csp184979wrn; Wed, 21 Dec 2022 23:32:01 -0800 (PST) X-Google-Smtp-Source: AMrXdXvWmUcQXZJBdSAMbyQMpc9eWB4rRmIRNZwjjCRXdWVoiRYx3M0s80vGnTCaDg9eZEKHhbSa X-Received: by 2002:a17:90a:744b:b0:223:2aa8:aad3 with SMTP id o11-20020a17090a744b00b002232aa8aad3mr5361392pjk.0.1671694320868; Wed, 21 Dec 2022 23:32:00 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1671694320; cv=none; d=google.com; s=arc-20160816; b=vSLAh9D9tL6JoGfuIg4CzF/aOdNWX+BAEQeMf7hd1J/43tesP2uPLo6E9ByhTnb/NH Zko5FWMXDeDucvqlkKc4Lx9JiNKnjNAXlCv7xuSkPo4/yd2ycHgElZXOWCG3IO533WHC LxwoaVAAK8j7n8RX5k26ug6nhy/MISPK3v9Zdk/w2MXUdl4MOwOya2jdUwnipMVk1XSC JnwqdK7xLx+tHdOXY+R2dX2kDiaVyeZF/RkQ9nqL4OpTDZpHHaccLsu0jnYL6dAJjyXb r7vRZMXFKuTquZgwVWWZ65Qz9zYMg6jlGI5XMVQZHrWx3xXzCdfmMkfx/k75S0Q2mr4Y Fzqw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=C1M6sI8nGjn1AQZW7MjYI4O6nfdLnBf5J3+wPQ6ghb0=; b=yv+lZ9Kh7W7kYdT4NmPFOa/o5Km3zKKcduRgDZxdHkrFsWEMuTCHI+y6dt1mHgd+jt BD5Hj8YyToQEsWNidE55f1QycF6Vdu3nzATgft0LpUYIfYFnDHM6Tw2ukDfn9w2eCBFp L6I3I2MOB+bhuW8ihcyhNT3x+hqKaFfWWblD99A38NVmBHwvKmo3AimLLMted82twQgJ St3Rmbm6DH7gcqLg2SSLj7/ZSqPXeWr/lvwRlMFWEtpz7tZq7Jz8hLP9tubUgoQb1MOf 6FZ0aUTg0I52PnoDFA0zXONY9HFhKvnzsRalaToQYpHrkWfV5c8bHpwMN8lxvCLldLnG JhEw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=KZRpJQE2; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id p30-20020a63951e000000b00476f3f74928si129228pgd.59.2022.12.21.23.31.46; Wed, 21 Dec 2022 23:32:00 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=KZRpJQE2; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229862AbiLVH0N (ORCPT <rfc822;pacteraone@gmail.com> + 99 others); Thu, 22 Dec 2022 02:26:13 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41240 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229608AbiLVH0I (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 22 Dec 2022 02:26:08 -0500 Received: from mail-pf1-x442.google.com (mail-pf1-x442.google.com [IPv6:2607:f8b0:4864:20::442]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 96A8538B4; Wed, 21 Dec 2022 23:26:07 -0800 (PST) Received: by mail-pf1-x442.google.com with SMTP id x66so650596pfx.3; Wed, 21 Dec 2022 23:26:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=C1M6sI8nGjn1AQZW7MjYI4O6nfdLnBf5J3+wPQ6ghb0=; b=KZRpJQE2dMcoZOUADI+9CK3UIfA+yeOJRhVV06WaoBzBZFtg5CdZP5hSE68WLOOdXm rG8tFq14LpuMpsAzqBqXWMdbGZYMyGIQka9dtDTu9Z0iNl1g7AiYDphbg44aUydjiiaS TuK6DrJjGI1cOwkQxLoSaKogyEP3yeyP+rasQspiBvXFLV8sNUIRPw7oj6UtTpLhfUV6 7xz/qHNDbXb8f2Zqsoz+dIcgYoWolASTcAcCD70/oNGbes+I0+4ZkgcXp+k9QUtqaAUC NyLYQmrC52cU6+XUGtmFw5AxyCxtKkWHiAUMIQXHDjVZZb2Z4YPqvLKlXoylD3mbxDCh OgJg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=C1M6sI8nGjn1AQZW7MjYI4O6nfdLnBf5J3+wPQ6ghb0=; b=13v8N6MhXyuD7XNvmql05xK8xp49um2Fe+wFzsCpFkHwRl2WTl7sbCCeM1IvCZmMBb Ki09o5wldhZi/tmEsg2o17jRNpdQYskdJAcIMaX+KQTOnOi9I83NUKmC02+kcFJbrJRn 9rQy1ybCwk/gfGmDfsyeaUd5yqiEehjit0i5AyGVM9EfenmI1rlQLhreYKLMR4rysegU InMw/qls+m1yfO1aKzQ5bNH7MEfKeBUiUVUL9TNnm8PP7OODUe6f1suLYyye8JfGVTzq 6KN3LeFtjGhDvArsdigzokOJB/qRlHv6LMYyO4UYWdhKIyWfvoLv90Zm0GgpgFwnW5Kx r4Fw== X-Gm-Message-State: AFqh2kq0605eupuvKR9z/keIBpsThISVg/8XBKZ93bvBIMqVAzMtaEHK hD1Ev/k7AaraZoYZjnSq9MI= X-Received: by 2002:a05:6a00:3691:b0:577:6264:9d0f with SMTP id dw17-20020a056a00369100b0057762649d0fmr5425480pfb.6.1671693967081; Wed, 21 Dec 2022 23:26:07 -0800 (PST) Received: from localhost.localdomain ([43.132.141.4]) by smtp.gmail.com with ESMTPSA id j186-20020a62c5c3000000b00573eb4a9a66sm11815117pfg.2.2022.12.21.23.26.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 21 Dec 2022 23:26:06 -0800 (PST) From: korantwork@gmail.com To: nirmal.patel@linux.intel.com, jonathan.derrick@linux.dev, lpieralisi@kernel.org Cc: linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Xinghui Li <korantli@tencent.com> Subject: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller Date: Thu, 22 Dec 2022 15:26:03 +0800 Message-Id: <20221222072603.1175248-1-korantwork@gmail.com> X-Mailer: git-send-email 2.31.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1752898544409354228?= X-GMAIL-MSGID: =?utf-8?q?1752898544409354228?= |
Series |
PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller
|
|
Commit Message
Xinghui Li
Dec. 22, 2022, 7:26 a.m. UTC
From: Xinghui Li <korantli@tencent.com> Commit ee81ee84f873("PCI: vmd: Disable MSI-X remapping when possible") disable the vmd MSI-X remapping for optimizing pci performance.However, this feature severely negatively optimized performance in multi-disk situations. In FIO 4K random test, we test 1 disk in the 1 CPU when disable MSI-X remapping: read: IOPS=1183k, BW=4622MiB/s (4847MB/s)(1354GiB/300001msec) READ: bw=4622MiB/s (4847MB/s), 4622MiB/s-4622MiB/s (4847MB/s-4847MB/s), io=1354GiB (1454GB), run=300001-300001msec When not disable MSI-X remapping: read: IOPS=1171k, BW=4572MiB/s (4795MB/s)(1340GiB/300001msec) READ: bw=4572MiB/s (4795MB/s), 4572MiB/s-4572MiB/s (4795MB/s-4795MB/s), io=1340GiB (1438GB), run=300001-300001msec However, the bypass mode could increase the interrupts costs in CPU. We test 12 disks in the 6 CPU, When disable MSI-X remapping: read: IOPS=562k, BW=2197MiB/s (2304MB/s)(644GiB/300001msec) READ: bw=2197MiB/s (2304MB/s), 2197MiB/s-2197MiB/s (2304MB/s-2304MB/s), io=644GiB (691GB), run=300001-300001msec When not disable MSI-X remapping: read: IOPS=1144k, BW=4470MiB/s (4687MB/s)(1310GiB/300005msec) READ: bw=4470MiB/s (4687MB/s), 4470MiB/s-4470MiB/s (4687MB/s-4687MB/s), io=1310GiB (1406GB), run=300005-300005msec Signed-off-by: Xinghui Li <korantli@tencent.com> --- drivers/pci/controller/vmd.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
Comments
On 12/22/22 12:26 AM, korantwork@gmail.com wrote: > From: Xinghui Li <korantli@tencent.com> > > Commit ee81ee84f873("PCI: vmd: Disable MSI-X remapping when possible") > disable the vmd MSI-X remapping for optimizing pci performance.However, > this feature severely negatively optimized performance in multi-disk > situations. > > In FIO 4K random test, we test 1 disk in the 1 CPU > > when disable MSI-X remapping: > read: IOPS=1183k, BW=4622MiB/s (4847MB/s)(1354GiB/300001msec) > READ: bw=4622MiB/s (4847MB/s), 4622MiB/s-4622MiB/s (4847MB/s-4847MB/s), > io=1354GiB (1454GB), run=300001-300001msec > > When not disable MSI-X remapping: > read: IOPS=1171k, BW=4572MiB/s (4795MB/s)(1340GiB/300001msec) > READ: bw=4572MiB/s (4795MB/s), 4572MiB/s-4572MiB/s (4795MB/s-4795MB/s), > io=1340GiB (1438GB), run=300001-300001msec > > However, the bypass mode could increase the interrupts costs in CPU. > We test 12 disks in the 6 CPU, Well the bypass mode was made to improve performance where you have >4 drives so this is pretty surprising. With bypass mode disabled, VMD will intercept and forward interrupts, increasing costs. I think Nirmal would want to to understand if there's some other factor going on here. > > When disable MSI-X remapping: > read: IOPS=562k, BW=2197MiB/s (2304MB/s)(644GiB/300001msec) > READ: bw=2197MiB/s (2304MB/s), 2197MiB/s-2197MiB/s (2304MB/s-2304MB/s), > io=644GiB (691GB), run=300001-300001msec > > When not disable MSI-X remapping: > read: IOPS=1144k, BW=4470MiB/s (4687MB/s)(1310GiB/300005msec) > READ: bw=4470MiB/s (4687MB/s), 4470MiB/s-4470MiB/s (4687MB/s-4687MB/s), > io=1310GiB (1406GB), run=300005-300005msec > > Signed-off-by: Xinghui Li <korantli@tencent.com> > --- > drivers/pci/controller/vmd.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c > index e06e9f4fc50f..9f6e9324d67d 100644 > --- a/drivers/pci/controller/vmd.c > +++ b/drivers/pci/controller/vmd.c > @@ -998,8 +998,7 @@ static const struct pci_device_id vmd_ids[] = { > .driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW_VSCAP,}, > {PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_VMD_28C0), > .driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW | > - VMD_FEAT_HAS_BUS_RESTRICTIONS | > - VMD_FEAT_CAN_BYPASS_MSI_REMAP,}, > + VMD_FEAT_HAS_BUS_RESTRICTIONS,}, > {PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x467f), > .driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW_VSCAP | > VMD_FEAT_HAS_BUS_RESTRICTIONS |
On Thu, Dec 22, 2022 at 02:15:20AM -0700, Jonathan Derrick wrote: > On 12/22/22 12:26 AM, korantwork@gmail.com wrote: > > > > However, the bypass mode could increase the interrupts costs in CPU. > > We test 12 disks in the 6 CPU, > > Well the bypass mode was made to improve performance where you have >4 > drives so this is pretty surprising. With bypass mode disabled, VMD will > intercept and forward interrupts, increasing costs. > > I think Nirmal would want to to understand if there's some other factor > going on here. With 12 drives and only 6 CPUs, the bypass mode is going to get more irq context switching. Sounds like the non-bypass mode is aggregating and spreading interrupts across the cores better, but there's probably some cpu:drive count tipping point where performance favors the other way. The fio jobs could also probably set their cpus_allowed differently to get better performance in the bypass mode.
Jonathan Derrick <jonathan.derrick@linux.dev> 于2022年12月22日周四 17:15写道: > > > > On 12/22/22 12:26 AM, korantwork@gmail.com wrote: > > From: Xinghui Li <korantli@tencent.com> > > > > Commit ee81ee84f873("PCI: vmd: Disable MSI-X remapping when possible") > > disable the vmd MSI-X remapping for optimizing pci performance.However, > > this feature severely negatively optimized performance in multi-disk > > situations. > > > > In FIO 4K random test, we test 1 disk in the 1 CPU > > > > when disable MSI-X remapping: > > read: IOPS=1183k, BW=4622MiB/s (4847MB/s)(1354GiB/300001msec) > > READ: bw=4622MiB/s (4847MB/s), 4622MiB/s-4622MiB/s (4847MB/s-4847MB/s), > > io=1354GiB (1454GB), run=300001-300001msec > > > > When not disable MSI-X remapping: > > read: IOPS=1171k, BW=4572MiB/s (4795MB/s)(1340GiB/300001msec) > > READ: bw=4572MiB/s (4795MB/s), 4572MiB/s-4572MiB/s (4795MB/s-4795MB/s), > > io=1340GiB (1438GB), run=300001-300001msec > > > > However, the bypass mode could increase the interrupts costs in CPU. > > We test 12 disks in the 6 CPU, > Well the bypass mode was made to improve performance where you have >4 > drives so this is pretty surprising. With bypass mode disabled, VMD will > intercept and forward interrupts, increasing costs. We also find the more drives we tested, the more severe the performance degradation. When we tested 8 drives in 6 CPU, there is about 30% drop. > I think Nirmal would want to to understand if there's some other factor > going on here. I also agree with this. The tested server is None io-scheduler. We tested the same server. Tested drives are Samsung Gen-4 nvme. Is there anything else you worried effecting test results?
Keith Busch <kbusch@kernel.org> 于2022年12月23日周五 05:56写道: > > With 12 drives and only 6 CPUs, the bypass mode is going to get more irq > context switching. Sounds like the non-bypass mode is aggregating and > spreading interrupts across the cores better, but there's probably some > cpu:drive count tipping point where performance favors the other way. We found that tunning the interrupt aggregation can also bring the drive performance back to normal. > The fio jobs could also probably set their cpus_allowed differently to > get better performance in the bypass mode. We used the cpus_allowed in FIO to fix 12 dirves in 6 different CPU. By the way, sorry for emailing twice, the last one had the format problem.
On 12/23/2022 2:02 AM, Xinghui Li wrote: > Keith Busch <kbusch@kernel.org> 于2022年12月23日周五 05:56写道: >> >> With 12 drives and only 6 CPUs, the bypass mode is going to get more irq >> context switching. Sounds like the non-bypass mode is aggregating and >> spreading interrupts across the cores better, but there's probably some >> cpu:drive count tipping point where performance favors the other way. > > We found that tunning the interrupt aggregation can also bring the > drive performance back to normal. > >> The fio jobs could also probably set their cpus_allowed differently to >> get better performance in the bypass mode. > > We used the cpus_allowed in FIO to fix 12 dirves in 6 different CPU. > > By the way, sorry for emailing twice, the last one had the format problem. The bypass mode should help in the cases where drives irqs (eg nproc) exceed VMD I/O irqs. VMD I/O irqs for 28c0 should be min(63, nproc). You have very few cpus for a Skylake system with that many drives, unless you mean you are explicitly restricting the 12 drives to only 6 cpus. Either way, bypass mode is effectively VMD-disabled, which points to other issues. Though I have also seen much smaller interrupt aggregation benefits.
Jonathan Derrick <jonathan.derrick@linux.dev> 于2022年12月28日周三 06:32写道: > > The bypass mode should help in the cases where drives irqs (eg nproc) exceed > VMD I/O irqs. VMD I/O irqs for 28c0 should be min(63, nproc). You have > very few cpus for a Skylake system with that many drives, unless you mean you > are explicitly restricting the 12 drives to only 6 cpus. Either way, bypass mode > is effectively VMD-disabled, which points to other issues. Though I have also seen > much smaller interrupt aggregation benefits. Firstly,I am sorry for my words misleading you. We totally tested 12 drives. And each drive run in 6 CPU cores with 8 jobs. Secondly, I try to test the drives with VMD disabled,I found the results to be largely consistent with bypass mode. I suppose the bypass mode just "bypass" the VMD controller. The last one,we found in bypass mode the CPU idle is 91%. But in remapping mode the CPU idle is 78%. And the bypass's context-switchs is much fewer than the remapping mode's. It seems the system is watiing for something in bypass mode.
As the bypass mode seems to affect performance greatly depending on the specific configuration, it may make sense to use a moduleparam to control it I'd vote for it being in VMD mode (non-bypass) by default. On 12/27/2022 7:19 PM, Xinghui Li wrote: > Jonathan Derrick <jonathan.derrick@linux.dev> 于2022年12月28日周三 06:32写道: >> >> The bypass mode should help in the cases where drives irqs (eg nproc) exceed >> VMD I/O irqs. VMD I/O irqs for 28c0 should be min(63, nproc). You have >> very few cpus for a Skylake system with that many drives, unless you mean you >> are explicitly restricting the 12 drives to only 6 cpus. Either way, bypass mode >> is effectively VMD-disabled, which points to other issues. Though I have also seen >> much smaller interrupt aggregation benefits. > > Firstly,I am sorry for my words misleading you. We totally tested 12 drives. > And each drive run in 6 CPU cores with 8 jobs. > > Secondly, I try to test the drives with VMD disabled,I found the results to > be largely consistent with bypass mode. I suppose the bypass mode just > "bypass" the VMD controller. > > The last one,we found in bypass mode the CPU idle is 91%. But in remapping mode > the CPU idle is 78%. And the bypass's context-switchs is much fewer > than the remapping > mode's. It seems the system is watiing for something in bypass mode.
Jonathan Derrick <jonathan.derrick@linux.dev> 于2023年1月10日周二 05:00写道: > > As the bypass mode seems to affect performance greatly depending on the specific configuration, > it may make sense to use a moduleparam to control it > We found that each pcie port can mount four drives. If we only test 2 or 1 dirve of one pcie port, the performance of the drive performance will be normal. Also, we observed the interruptions in different modes. bypass: ..... 2022-12-28-11-39-14: 1224 181665 IR-PCI-MSI 201850948-edge nvme0q68 2022-12-28-11-39-14: 1179 180115 IR-PCI-MSI 201850945-edge nvme0q65 2022-12-28-11-39-14: RES 26743 Rescheduling interrupts 2022-12-28-11-39-17: irqtop - IRQ : 3029, TOTAL : 2100315228, CPU : 192, ACTIVE CPU : 192 disable: ...... 2022-12-28-12-05-56: 1714 169797 IR-PCI-MSI 14155850-edge nvme1q74 2022-12-28-12-05-56: 1701 168753 IR-PCI-MSI 14155849-edge nvme1q73 2022-12-28-12-05-56: LOC 163697 Local timer interrupts 2022-12-28-12-05-56: TLB 5465 TLB shootdowns 2022-12-28-12-06-00: irqtop - IRQ : 3029, TOTAL : 2179022106, CPU : 192, ACTIVE CPU : 192 remapping: 022-12-28-11-25-38: 283 325568 IR-PCI-MSI 24651790-edge vmd3 2022-12-28-11-25-38: 140 267899 IR-PCI-MSI 13117447-edge vmd1 2022-12-28-11-25-38: 183 265978 IR-PCI-MSI 13117490-edge vmd1 ...... 2022-12-28-11-25-42: irqtop - IRQ : 2109, TOTAL : 2377172002, CPU : 192, ACTIVE CPU : 192 From the result it is not difficult to find, in remapping mode the interruptions come from vmd. While in other modes, interrupts come from nvme devices. Besides, we found the port mounting 4 dirves total interruptions is much fewer than the port mounting 2 or 1 drive. NVME 8 and 9 mount in one port, other port mount 4 dirves. 2022-12-28-11-39-14: 2582 494635 IR-PCI-MSI 470810698-edge nvme9q74 2022-12-28-11-39-14: 2579 489972 IR-PCI-MSI 470810697-edge nvme9q73 2022-12-28-11-39-14: 2573 480024 IR-PCI-MSI 470810695-edge nvme9q71 2022-12-28-11-39-14: 2544 312967 IR-PCI-MSI 470286401-edge nvme8q65 2022-12-28-11-39-14: 2556 312229 IR-PCI-MSI 470286405-edge nvme8q69 2022-12-28-11-39-14: 2547 310013 IR-PCI-MSI 470286402-edge nvme8q66 2022-12-28-11-39-14: 2550 308993 IR-PCI-MSI 470286403-edge nvme8q67 2022-12-28-11-39-14: 2559 308794 IR-PCI-MSI 470286406-edge nvme8q70 ...... 2022-12-28-11-39-14: 1296 185773 IR-PCI-MSI 202375243-edge nvme1q75 2022-12-28-11-39-14: 1209 185646 IR-PCI-MSI 201850947-edge nvme0q67 2022-12-28-11-39-14: 1831 184151 IR-PCI-MSI 203423828-edge nvme3q84 2022-12-28-11-39-14: 1254 182313 IR-PCI-MSI 201850950-edge nvme0q70 2022-12-28-11-39-14: 1224 181665 IR-PCI-MSI 201850948-edge nvme0q68 2022-12-28-11-39-14: 1179 180115 IR-PCI-MSI 201850945-edge nvme0q65 > I'd vote for it being in VMD mode (non-bypass) by default. I speculate that the vmd controller equalizes the interrupt load and acts like a buffer, which improves the performance of nvme. I am not sure about my analysis. So, I'd like to discuss it with the community.
Friendly ping~ Xinghui Li <korantwork@gmail.com> 于2023年1月10日周二 20:28写道: > > Jonathan Derrick <jonathan.derrick@linux.dev> 于2023年1月10日周二 05:00写道: > > > > As the bypass mode seems to affect performance greatly depending on the specific configuration, > > it may make sense to use a moduleparam to control it > > > We found that each pcie port can mount four drives. If we only test 2 > or 1 dirve of one pcie port, > the performance of the drive performance will be normal. Also, we > observed the interruptions in different modes. > bypass: > ..... > 2022-12-28-11-39-14: 1224 181665 IR-PCI-MSI 201850948-edge nvme0q68 > 2022-12-28-11-39-14: 1179 180115 IR-PCI-MSI 201850945-edge nvme0q65 > 2022-12-28-11-39-14: RES 26743 Rescheduling interrupts > 2022-12-28-11-39-17: irqtop - IRQ : 3029, TOTAL : 2100315228, CPU : > 192, ACTIVE CPU : 192 > disable: > ...... > 2022-12-28-12-05-56: 1714 169797 IR-PCI-MSI 14155850-edge nvme1q74 > 2022-12-28-12-05-56: 1701 168753 IR-PCI-MSI 14155849-edge nvme1q73 > 2022-12-28-12-05-56: LOC 163697 Local timer interrupts > 2022-12-28-12-05-56: TLB 5465 TLB shootdowns > 2022-12-28-12-06-00: irqtop - IRQ : 3029, TOTAL : 2179022106, CPU : > 192, ACTIVE CPU : 192 > remapping: > 022-12-28-11-25-38: 283 325568 IR-PCI-MSI 24651790-edge vmd3 > 2022-12-28-11-25-38: 140 267899 IR-PCI-MSI 13117447-edge vmd1 > 2022-12-28-11-25-38: 183 265978 IR-PCI-MSI 13117490-edge vmd1 > ...... > 2022-12-28-11-25-42: irqtop - IRQ : 2109, TOTAL : 2377172002, CPU : > 192, ACTIVE CPU : 192 > > From the result it is not difficult to find, in remapping mode the > interruptions come from vmd. > While in other modes, interrupts come from nvme devices. Besides, we > found the port mounting > 4 dirves total interruptions is much fewer than the port mounting 2 or 1 drive. > NVME 8 and 9 mount in one port, other port mount 4 dirves. > > 2022-12-28-11-39-14: 2582 494635 IR-PCI-MSI 470810698-edge nvme9q74 > 2022-12-28-11-39-14: 2579 489972 IR-PCI-MSI 470810697-edge nvme9q73 > 2022-12-28-11-39-14: 2573 480024 IR-PCI-MSI 470810695-edge nvme9q71 > 2022-12-28-11-39-14: 2544 312967 IR-PCI-MSI 470286401-edge nvme8q65 > 2022-12-28-11-39-14: 2556 312229 IR-PCI-MSI 470286405-edge nvme8q69 > 2022-12-28-11-39-14: 2547 310013 IR-PCI-MSI 470286402-edge nvme8q66 > 2022-12-28-11-39-14: 2550 308993 IR-PCI-MSI 470286403-edge nvme8q67 > 2022-12-28-11-39-14: 2559 308794 IR-PCI-MSI 470286406-edge nvme8q70 > ...... > 2022-12-28-11-39-14: 1296 185773 IR-PCI-MSI 202375243-edge nvme1q75 > 2022-12-28-11-39-14: 1209 185646 IR-PCI-MSI 201850947-edge nvme0q67 > 2022-12-28-11-39-14: 1831 184151 IR-PCI-MSI 203423828-edge nvme3q84 > 2022-12-28-11-39-14: 1254 182313 IR-PCI-MSI 201850950-edge nvme0q70 > 2022-12-28-11-39-14: 1224 181665 IR-PCI-MSI 201850948-edge nvme0q68 > 2022-12-28-11-39-14: 1179 180115 IR-PCI-MSI 201850945-edge nvme0q65 > > I'd vote for it being in VMD mode (non-bypass) by default. > I speculate that the vmd controller equalizes the interrupt load and > acts like a buffer, > which improves the performance of nvme. I am not sure about my > analysis. So, I'd like > to discuss it with the community.
On 2/6/2023 5:45 AM, Xinghui Li wrote: > Friendly ping~ > > Xinghui Li <korantwork@gmail.com> 于2023年1月10日周二 20:28写道: >> Jonathan Derrick <jonathan.derrick@linux.dev> 于2023年1月10日周二 05:00写道: >>> As the bypass mode seems to affect performance greatly depending on the specific configuration, >>> it may make sense to use a moduleparam to control it >>> >> We found that each pcie port can mount four drives. If we only test 2 >> or 1 dirve of one pcie port, >> the performance of the drive performance will be normal. Also, we >> observed the interruptions in different modes. >> bypass: >> ..... >> 2022-12-28-11-39-14: 1224 181665 IR-PCI-MSI 201850948-edge nvme0q68 >> 2022-12-28-11-39-14: 1179 180115 IR-PCI-MSI 201850945-edge nvme0q65 >> 2022-12-28-11-39-14: RES 26743 Rescheduling interrupts >> 2022-12-28-11-39-17: irqtop - IRQ : 3029, TOTAL : 2100315228, CPU : >> 192, ACTIVE CPU : 192 >> disable: >> ...... >> 2022-12-28-12-05-56: 1714 169797 IR-PCI-MSI 14155850-edge nvme1q74 >> 2022-12-28-12-05-56: 1701 168753 IR-PCI-MSI 14155849-edge nvme1q73 >> 2022-12-28-12-05-56: LOC 163697 Local timer interrupts >> 2022-12-28-12-05-56: TLB 5465 TLB shootdowns >> 2022-12-28-12-06-00: irqtop - IRQ : 3029, TOTAL : 2179022106, CPU : >> 192, ACTIVE CPU : 192 >> remapping: >> 022-12-28-11-25-38: 283 325568 IR-PCI-MSI 24651790-edge vmd3 >> 2022-12-28-11-25-38: 140 267899 IR-PCI-MSI 13117447-edge vmd1 >> 2022-12-28-11-25-38: 183 265978 IR-PCI-MSI 13117490-edge vmd1 >> ...... >> 2022-12-28-11-25-42: irqtop - IRQ : 2109, TOTAL : 2377172002, CPU : >> 192, ACTIVE CPU : 192 >> >> From the result it is not difficult to find, in remapping mode the >> interruptions come from vmd. >> While in other modes, interrupts come from nvme devices. Besides, we >> found the port mounting >> 4 dirves total interruptions is much fewer than the port mounting 2 or 1 drive. >> NVME 8 and 9 mount in one port, other port mount 4 dirves. >> >> 2022-12-28-11-39-14: 2582 494635 IR-PCI-MSI 470810698-edge nvme9q74 >> 2022-12-28-11-39-14: 2579 489972 IR-PCI-MSI 470810697-edge nvme9q73 >> 2022-12-28-11-39-14: 2573 480024 IR-PCI-MSI 470810695-edge nvme9q71 >> 2022-12-28-11-39-14: 2544 312967 IR-PCI-MSI 470286401-edge nvme8q65 >> 2022-12-28-11-39-14: 2556 312229 IR-PCI-MSI 470286405-edge nvme8q69 >> 2022-12-28-11-39-14: 2547 310013 IR-PCI-MSI 470286402-edge nvme8q66 >> 2022-12-28-11-39-14: 2550 308993 IR-PCI-MSI 470286403-edge nvme8q67 >> 2022-12-28-11-39-14: 2559 308794 IR-PCI-MSI 470286406-edge nvme8q70 >> ...... >> 2022-12-28-11-39-14: 1296 185773 IR-PCI-MSI 202375243-edge nvme1q75 >> 2022-12-28-11-39-14: 1209 185646 IR-PCI-MSI 201850947-edge nvme0q67 >> 2022-12-28-11-39-14: 1831 184151 IR-PCI-MSI 203423828-edge nvme3q84 >> 2022-12-28-11-39-14: 1254 182313 IR-PCI-MSI 201850950-edge nvme0q70 >> 2022-12-28-11-39-14: 1224 181665 IR-PCI-MSI 201850948-edge nvme0q68 >> 2022-12-28-11-39-14: 1179 180115 IR-PCI-MSI 201850945-edge nvme0q65 >>> I'd vote for it being in VMD mode (non-bypass) by default. >> I speculate that the vmd controller equalizes the interrupt load and >> acts like a buffer, >> which improves the performance of nvme. I am not sure about my >> analysis. So, I'd like >> to discuss it with the community. I like the idea of module parameter to allow switching between the modes but keep MSI remapping enabled (non-bypass) by default.
On Mon, Feb 06, 2023 at 11:11:36AM -0700, Patel, Nirmal wrote: > I like the idea of module parameter to allow switching between the modes > but keep MSI remapping enabled (non-bypass) by default. Isn't there a more programatic way to go about selecting the best option at runtime? I suspect bypass is the better choice if "num_active_cpus() > pci_msix_vec_count(vmd->dev)".
Keith Busch <kbusch@kernel.org> 于2023年2月7日周二 02:28写道: > > On Mon, Feb 06, 2023 at 11:11:36AM -0700, Patel, Nirmal wrote: > > I like the idea of module parameter to allow switching between the modes > > but keep MSI remapping enabled (non-bypass) by default. > > Isn't there a more programatic way to go about selecting the best option at > runtime? Do you mean that the operating mode is automatically selected by detecting the number of devices and CPUs instead of being set manually? >I suspect bypass is the better choice if "num_active_cpus() > pci_msix_vec_count(vmd->dev)". For this situation, My speculation is that the PCIE nodes are over-mounted and not just because of the CPU to Drive ratio. We considered designing online nodes, because we were concerned that the IO of different chunk sizes would adapt to different MSI-X modes. I privately think that it may be logically complicated if programmatic judgments are made.
On 2/6/2023 8:18 PM, Xinghui Li wrote: > Keith Busch <kbusch@kernel.org> 于2023年2月7日周二 02:28写道: >> On Mon, Feb 06, 2023 at 11:11:36AM -0700, Patel, Nirmal wrote: >>> I like the idea of module parameter to allow switching between the modes >>> but keep MSI remapping enabled (non-bypass) by default. >> Isn't there a more programatic way to go about selecting the best option at >> runtime? > Do you mean that the operating mode is automatically selected by > detecting the number of devices and CPUs instead of being set > manually? >> I suspect bypass is the better choice if "num_active_cpus() > pci_msix_vec_count(vmd->dev)". > For this situation, My speculation is that the PCIE nodes are > over-mounted and not just because of the CPU to Drive ratio. > We considered designing online nodes, because we were concerned that > the IO of different chunk sizes would adapt to different MSI-X modes. > I privately think that it may be logically complicated if programmatic > judgments are made. Also newer CPUs have more MSIx (128) which means we can still have better performance without bypass. It would be better if user have can chose module parameter based on their requirements. Thanks.
Patel, Nirmal <nirmal.patel@linux.intel.com> 于2023年2月8日周三 04:32写道: > > Also newer CPUs have more MSIx (128) which means we can still have > better performance without bypass. It would be better if user have > can chose module parameter based on their requirements. Thanks. > All right~I will reset the patch V2 with the online node version later. Thanks
On Tue, Feb 07, 2023 at 01:32:20PM -0700, Patel, Nirmal wrote: > On 2/6/2023 8:18 PM, Xinghui Li wrote: > > Keith Busch <kbusch@kernel.org> 于2023年2月7日周二 02:28写道: > >> I suspect bypass is the better choice if "num_active_cpus() > pci_msix_vec_count(vmd->dev)". > > For this situation, My speculation is that the PCIE nodes are > > over-mounted and not just because of the CPU to Drive ratio. > > We considered designing online nodes, because we were concerned that > > the IO of different chunk sizes would adapt to different MSI-X modes. > > I privately think that it may be logically complicated if programmatic > > judgments are made. > > Also newer CPUs have more MSIx (128) which means we can still have > better performance without bypass. It would be better if user have > can chose module parameter based on their requirements. Thanks. So what? More vectors just pushes the threshold to when bypass becomes relevant, which is exactly why I suggested it. There has to be an empirical answer to when bypass beats muxing. Why do you want a user tunable if there's a verifiable and automated better choice?
On 2/9/2023 4:05 PM, Keith Busch wrote: > On Tue, Feb 07, 2023 at 01:32:20PM -0700, Patel, Nirmal wrote: >> On 2/6/2023 8:18 PM, Xinghui Li wrote: >>> Keith Busch <kbusch@kernel.org> 于2023年2月7日周二 02:28写道: >>>> I suspect bypass is the better choice if "num_active_cpus() > pci_msix_vec_count(vmd->dev)". >>> For this situation, My speculation is that the PCIE nodes are >>> over-mounted and not just because of the CPU to Drive ratio. >>> We considered designing online nodes, because we were concerned that >>> the IO of different chunk sizes would adapt to different MSI-X modes. >>> I privately think that it may be logically complicated if programmatic >>> judgments are made. >> Also newer CPUs have more MSIx (128) which means we can still have >> better performance without bypass. It would be better if user have >> can chose module parameter based on their requirements. Thanks. > So what? More vectors just pushes the threshold to when bypass becomes > relevant, which is exactly why I suggested it. There has to be an empirical > answer to when bypass beats muxing. Why do you want a user tunable if there's a > verifiable and automated better choice? Make sense about the automated choice. I am not sure what is the exact tipping point. The commit message includes only two cases. one 1 drive 1 CPU and second 12 drives 6 CPU. Also performance gets worse from 8 drives to 12 drives. One the previous comments also mentioned something about FIO changing cpus_allowed; will there be an issue when VMD driver decides to bypass the remapping during the boot up, but FIO job changes the cpu_allowed?
On Thu, Feb 09, 2023 at 04:57:59PM -0700, Patel, Nirmal wrote: > On 2/9/2023 4:05 PM, Keith Busch wrote: > > On Tue, Feb 07, 2023 at 01:32:20PM -0700, Patel, Nirmal wrote: > >> On 2/6/2023 8:18 PM, Xinghui Li wrote: > >>> Keith Busch <kbusch@kernel.org> 于2023年2月7日周二 02:28写道: > >>>> I suspect bypass is the better choice if "num_active_cpus() > pci_msix_vec_count(vmd->dev)". > >>> For this situation, My speculation is that the PCIE nodes are > >>> over-mounted and not just because of the CPU to Drive ratio. > >>> We considered designing online nodes, because we were concerned that > >>> the IO of different chunk sizes would adapt to different MSI-X modes. > >>> I privately think that it may be logically complicated if programmatic > >>> judgments are made. > >> Also newer CPUs have more MSIx (128) which means we can still have > >> better performance without bypass. It would be better if user have > >> can chose module parameter based on their requirements. Thanks. > > So what? More vectors just pushes the threshold to when bypass becomes > > relevant, which is exactly why I suggested it. There has to be an empirical > > answer to when bypass beats muxing. Why do you want a user tunable if there's a > > verifiable and automated better choice? > > Make sense about the automated choice. I am not sure what is the exact > tipping point. The commit message includes only two cases. one 1 drive > 1 CPU and second 12 drives 6 CPU. Also performance gets worse from 8 > drives to 12 drives. That configuration's storage performance overwhelms the CPU with interrupt context switching. That problem probably inverts when your active CPU count exceeds your VMD vectors because you'll be funnelling more interrupts into fewer CPUs, leaving other CPUs idle. > One the previous comments also mentioned something about FIO changing > cpus_allowed; will there be an issue when VMD driver decides to bypass > the remapping during the boot up, but FIO job changes the cpu_allowed? No. Bypass mode uses managed interrupts for your nvme child devices, which sets the best possible affinity.
diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c index e06e9f4fc50f..9f6e9324d67d 100644 --- a/drivers/pci/controller/vmd.c +++ b/drivers/pci/controller/vmd.c @@ -998,8 +998,7 @@ static const struct pci_device_id vmd_ids[] = { .driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW_VSCAP,}, {PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_VMD_28C0), .driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW | - VMD_FEAT_HAS_BUS_RESTRICTIONS | - VMD_FEAT_CAN_BYPASS_MSI_REMAP,}, + VMD_FEAT_HAS_BUS_RESTRICTIONS,}, {PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x467f), .driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW_VSCAP | VMD_FEAT_HAS_BUS_RESTRICTIONS |