Message ID | 20221111142722.1172-1-longpeng2@huawei.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp781626wru; Fri, 11 Nov 2022 06:46:56 -0800 (PST) X-Google-Smtp-Source: AA0mqf59bDogICSrmQOjGrxEU0bMPfpZDTZzyQroVA+xj5N6dJDEC7U7XmgCddHdoAOvq9pIsFtU X-Received: by 2002:a17:906:78e:b0:7a1:e52e:bddb with SMTP id l14-20020a170906078e00b007a1e52ebddbmr1948304ejc.685.1668178016489; Fri, 11 Nov 2022 06:46:56 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1668178016; cv=none; d=google.com; s=arc-20160816; b=fH/5hFCfyF36L5SwxU/+c9Pt7+ODJriR6F9b366pdDbHvnA+L1kTknUBmS+aPK7ziF FSc4ZJrdFAyvXbfK+G9JYDxzmqj8iH5sky3RhfL14g+L6wSC3gWHBCzCM7QCtOy/ABKN ROiR9me0HwPfsg0JZgBdAtp4SsGJL5j10FY7Qivi3aSff2sZcuw1a+6whQ/30UPeFAGb icq0pvj1iHyByx0Gy0kcMWkc8RVNbyj0xjC41Qdy8rky0Dv9ukLkX2r/g8Ko7O6Hh/Np nqllqGi56wp0BPVdGkifAUhsvh+iFPhcBiQi98XyyydhM6+kMNUmeu7qNLFo106ubwow C+bQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=tuppcrhBPe1EhMgwS3MP9nl2RWEBVsyluEKSPoJulHc=; b=sbNRUsbA8ynjtOittWbs1E+hP+kuCkbo7tyWQbTEA+DKUZBHM1rb6JDb2K6BUpSGXU 1zFhOnhrjrivqPb8cyqCVPot3A9Mc3ScOTwNfP7aZPEQ6/8cSrBGe7tgwck2Vb/6VlFR p7MS/ufkWVAiY2jgqGRhas4HZcBsH6BI+g0InwWyWq39BMn8lpl3wtrrNeSWH2WT/Zkj 8QODhmBvBnbJgej+wX0R/s0m0KucDT9iOrjr1YDlMDBLoURgqiePPjLbwkeuf4PzbYWW YqVvteuRZj3ol6gZm8GRxMzW9ncjYrPTL2BJE92F0Jp3WW8iSmbNKa9Xev7fydb8HWis 6nfQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id lx3-20020a170906af0300b007adacdfd08asi1682249ejb.295.2022.11.11.06.46.31; Fri, 11 Nov 2022 06:46:56 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234041AbiKKO1f (ORCPT <rfc822;winker.wchi@gmail.com> + 99 others); Fri, 11 Nov 2022 09:27:35 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40880 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231840AbiKKO1a (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 11 Nov 2022 09:27:30 -0500 Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E1DA8C1; Fri, 11 Nov 2022 06:27:28 -0800 (PST) Received: from kwepemi100025.china.huawei.com (unknown [172.30.72.53]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4N81Cg3gHNzbnfm; Fri, 11 Nov 2022 22:23:43 +0800 (CST) Received: from DESKTOP-27KDQMV.china.huawei.com (10.174.148.223) by kwepemi100025.china.huawei.com (7.221.188.158) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Fri, 11 Nov 2022 22:27:25 +0800 From: "Longpeng(Mike)" <longpeng2@huawei.com> To: <bhelgaas@google.com> CC: <linux-pci@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <jianjay.zhou@huawei.com>, <zhuangshengen@huawei.com>, <arei.gonglei@huawei.com>, <yechuan@huawei.com>, <huangzhichao@huawei.com>, <xiehong@huawei.com>, Longpeng <longpeng2@huawei.com> Subject: [RFC 0/4] pci/sriov: support VFs dynamic addition Date: Fri, 11 Nov 2022 22:27:18 +0800 Message-ID: <20221111142722.1172-1-longpeng2@huawei.com> X-Mailer: git-send-email 2.25.0.windows.1 MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII X-Originating-IP: [10.174.148.223] X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To kwepemi100025.china.huawei.com (7.221.188.158) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1749211431870150433?= X-GMAIL-MSGID: =?utf-8?q?1749211431870150433?= |
Series | pci/sriov: support VFs dynamic addition | |
Message
Longpeng(Mike)
Nov. 11, 2022, 2:27 p.m. UTC
From: Longpeng <longpeng2@huawei.com>
We can enable SRIOV and add VFs by /sys/bus/pci/devices/..../sriov_numvfs, but
this operation needs to spend lots of time if there has a large amount of VFs.
For example, if the machine has 10 PFs and 250 VFs per-PF, enable all the VFs
concurrently would cost about 200-250ms. However most of them are not need to be
used at the moment, so we can enable SRIOV first but add VFs on demand.
This series introduces two interfaces:
1. sriov_numvfs_no_scan: enable SRIOV without add the VFs.
2. sriov_scan_vf_id: add a specific VF.
Longpeng (4):
pci/sriov: extract sriov_numvfs common helper
pci/sriov: add vf_bitmap to mark the vf id allocation
pci/sriov: add sriov_numfs_no_scan interface
pci/sriov: add sriov_scan_vf_id interface
drivers/pci/iov.c | 162 +++++++++++++++++++++++++++++++++++++++++-----
drivers/pci/pci.h | 1 +
2 files changed, 148 insertions(+), 15 deletions(-)
Comments
On Fri, Nov 11, 2022 at 10:27:18PM +0800, Longpeng(Mike) wrote: > From: Longpeng <longpeng2@huawei.com> > > We can enable SRIOV and add VFs by /sys/bus/pci/devices/..../sriov_numvfs, but > this operation needs to spend lots of time if there has a large amount of VFs. > > For example, if the machine has 10 PFs and 250 VFs per-PF, enable all the VFs > concurrently would cost about 200-250ms. However most of them are not need to be > used at the moment, so we can enable SRIOV first but add VFs on demand. It is unclear what took 200-250ms, is it physical VF creation or bind of the driver to these VFs? If the latter, you can try with sriov_drivers_autoprobe set to true. This is how ennoblement of large SR-IOV systems is done. Also PCI spec declares "VF enable" bit, which is applicable to all VFs associated to that PF, see section "9.3.3.3.1 VF Enable". Thanks > > This series introduces two interfaces: > 1. sriov_numvfs_no_scan: enable SRIOV without add the VFs. > 2. sriov_scan_vf_id: add a specific VF. > > Longpeng (4): > pci/sriov: extract sriov_numvfs common helper > pci/sriov: add vf_bitmap to mark the vf id allocation > pci/sriov: add sriov_numfs_no_scan interface > pci/sriov: add sriov_scan_vf_id interface > > drivers/pci/iov.c | 162 +++++++++++++++++++++++++++++++++++++++++----- > drivers/pci/pci.h | 1 + > 2 files changed, 148 insertions(+), 15 deletions(-) > > -- > 2.23.0 >
On Fri, Nov 11, 2022 at 10:27:18PM +0800, Longpeng(Mike) wrote: > From: Longpeng <longpeng2@huawei.com> > > We can enable SRIOV and add VFs by /sys/bus/pci/devices/..../sriov_numvfs, but > this operation needs to spend lots of time if there has a large amount of VFs. > > For example, if the machine has 10 PFs and 250 VFs per-PF, enable all the VFs > concurrently would cost about 200-250ms. However most of them are not need to be > used at the moment, so we can enable SRIOV first but add VFs on demand. > > This series introduces two interfaces: > 1. sriov_numvfs_no_scan: enable SRIOV without add the VFs. > 2. sriov_scan_vf_id: add a specific VF. > > Longpeng (4): > pci/sriov: extract sriov_numvfs common helper > pci/sriov: add vf_bitmap to mark the vf id allocation > pci/sriov: add sriov_numfs_no_scan interface > pci/sriov: add sriov_scan_vf_id interface When you respond to Leon's questions, please also spend a little time to look at the iov.c history and coding style and follow it, so we don't have to waste time commenting on or fixing trivial things. For example, there are no previous subject lines for that file that start with "pci/sriov". Don't make up a new prefix; use what's been done in the past. And follow the style of capitalizing the first word after the prefix. There are a few comments that look like they are more than 80 columns. Again unlike everything else in the file. The PCIe spec spells it "SR-IOV", not "sriov". Do the same in your commit logs and comments. Capitalize "VF" consistently. I see "VF" and "vf" used randomly. Sysfs changes require documention updates, e.g., in Documentation/ABI/testing/sysfs-bus-pci.
Hi leon, 在 2022/11/12 0:39, Leon Romanovsky 写道: > On Fri, Nov 11, 2022 at 10:27:18PM +0800, Longpeng(Mike) wrote: >> From: Longpeng <longpeng2@huawei.com> >> >> We can enable SRIOV and add VFs by /sys/bus/pci/devices/..../sriov_numvfs, but >> this operation needs to spend lots of time if there has a large amount of VFs. >> >> For example, if the machine has 10 PFs and 250 VFs per-PF, enable all the VFs >> concurrently would cost about 200-250ms. However most of them are not need to be >> used at the moment, so we can enable SRIOV first but add VFs on demand. > > It is unclear what took 200-250ms, is it physical VF creation or bind of > the driver to these VFs? > It is neither. In our test, we already created physical VFs before, so we skipped the 100ms waiting when writing PCI_SRIOV_CTRL. And our driver only probes PF, it just returns an error if the function is VF. The hotspot is the sriov_add_vfs (but no driver probe in fact) which is a long procedure. Each step costs only a little, but the total cost is not acceptable in some time-sensitive cases. What’s more, the sriov_add_vfs adds the VFs of a PF one by one. So we can mostly support 10 concurrent calls if there has 10 PFs. > If the latter, you can try with sriov_drivers_autoprobe set to true. This is how > ennoblement of large SR-IOV systems is done. > > Also PCI spec declares "VF enable" bit, which is applicable to all VFs > associated to that PF, see section "9.3.3.3.1 VF Enable". > > Thanks > >> >> This series introduces two interfaces: >> 1. sriov_numvfs_no_scan: enable SRIOV without add the VFs. >> 2. sriov_scan_vf_id: add a specific VF. >> >> Longpeng (4): >> pci/sriov: extract sriov_numvfs common helper >> pci/sriov: add vf_bitmap to mark the vf id allocation >> pci/sriov: add sriov_numfs_no_scan interface >> pci/sriov: add sriov_scan_vf_id interface >> >> drivers/pci/iov.c | 162 +++++++++++++++++++++++++++++++++++++++++----- >> drivers/pci/pci.h | 1 + >> 2 files changed, 148 insertions(+), 15 deletions(-) >> >> -- >> 2.23.0 >> > .
在 2022/11/12 7:07, Bjorn Helgaas 写道: > On Fri, Nov 11, 2022 at 10:27:18PM +0800, Longpeng(Mike) wrote: >> From: Longpeng <longpeng2@huawei.com> >> >> We can enable SRIOV and add VFs by /sys/bus/pci/devices/..../sriov_numvfs, but >> this operation needs to spend lots of time if there has a large amount of VFs. >> >> For example, if the machine has 10 PFs and 250 VFs per-PF, enable all the VFs >> concurrently would cost about 200-250ms. However most of them are not need to be >> used at the moment, so we can enable SRIOV first but add VFs on demand. >> >> This series introduces two interfaces: >> 1. sriov_numvfs_no_scan: enable SRIOV without add the VFs. >> 2. sriov_scan_vf_id: add a specific VF. >> >> Longpeng (4): >> pci/sriov: extract sriov_numvfs common helper >> pci/sriov: add vf_bitmap to mark the vf id allocation >> pci/sriov: add sriov_numfs_no_scan interface >> pci/sriov: add sriov_scan_vf_id interface > > When you respond to Leon's questions, please also spend a little time > to look at the iov.c history and coding style and follow it, so we > don't have to waste time commenting on or fixing trivial things. > Okay, thanks for your valuable suggestions. > For example, there are no previous subject lines for that file that > start with "pci/sriov". Don't make up a new prefix; use what's been > done in the past. And follow the style of capitalizing the first word > after the prefix. > > There are a few comments that look like they are more than 80 columns. > Again unlike everything else in the file. > > The PCIe spec spells it "SR-IOV", not "sriov". Do the same in your > commit logs and comments. > > Capitalize "VF" consistently. I see "VF" and "vf" used randomly. > > Sysfs changes require documention updates, e.g., in > Documentation/ABI/testing/sysfs-bus-pci. > .
On Sun, Nov 13, 2022 at 09:47:12PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote: > Hi leon, > > 在 2022/11/12 0:39, Leon Romanovsky 写道: > > On Fri, Nov 11, 2022 at 10:27:18PM +0800, Longpeng(Mike) wrote: > > > From: Longpeng <longpeng2@huawei.com> > > > > > > We can enable SRIOV and add VFs by /sys/bus/pci/devices/..../sriov_numvfs, but > > > this operation needs to spend lots of time if there has a large amount of VFs. > > > For example, if the machine has 10 PFs and 250 VFs per-PF, enable all the VFs > > > concurrently would cost about 200-250ms. However most of them are not need to be > > > used at the moment, so we can enable SRIOV first but add VFs on demand. > > > > It is unclear what took 200-250ms, is it physical VF creation or bind of > > the driver to these VFs? > > > It is neither. In our test, we already created physical VFs before, so we > skipped the 100ms waiting when writing PCI_SRIOV_CTRL. And our driver only > probes PF, it just returns an error if the function is VF. It means that you didn't try sriov_drivers_autoprobe. Once it is set to true, It won't even try to probe VFs. > > The hotspot is the sriov_add_vfs (but no driver probe in fact) which is a > long procedure. Each step costs only a little, but the total cost is not > acceptable in some time-sensitive cases. This is also cryptic to me. In standard SR-IOV deployment, all VFs are created and configured while operator booted the machine with sriov_drivers_autoprobe set to false. Once this machine is ready, VFs are assigned to relevant VMs/users through orchestration SW (IMHO, it is supported by all orchestration SW). And only last part (assigning to users) is time-sensitive operation. > > What’s more, the sriov_add_vfs adds the VFs of a PF one by one. So we can > mostly support 10 concurrent calls if there has 10 PFs. I wondered, are you using real HW? or QEMU SR-IOV? What is your server that supports such large number of VFs? BTW, Your change will probably break all SR-IOV devices in the market as they rely on PCI subsystem to have VFs ready and configured. Thanks
在 2022/11/14 15:04, Leon Romanovsky 写道: > On Sun, Nov 13, 2022 at 09:47:12PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote: >> Hi leon, >> >> 在 2022/11/12 0:39, Leon Romanovsky 写道: >>> On Fri, Nov 11, 2022 at 10:27:18PM +0800, Longpeng(Mike) wrote: >>>> From: Longpeng <longpeng2@huawei.com> >>>> >>>> We can enable SRIOV and add VFs by /sys/bus/pci/devices/..../sriov_numvfs, but >>>> this operation needs to spend lots of time if there has a large amount of VFs. >>>> For example, if the machine has 10 PFs and 250 VFs per-PF, enable all the VFs >>>> concurrently would cost about 200-250ms. However most of them are not need to be >>>> used at the moment, so we can enable SRIOV first but add VFs on demand. >>> >>> It is unclear what took 200-250ms, is it physical VF creation or bind of >>> the driver to these VFs? >>> >> It is neither. In our test, we already created physical VFs before, so we >> skipped the 100ms waiting when writing PCI_SRIOV_CTRL. And our driver only >> probes PF, it just returns an error if the function is VF. > > It means that you didn't try sriov_drivers_autoprobe. Once it is set to > true, It won't even try to probe VFs. > >> >> The hotspot is the sriov_add_vfs (but no driver probe in fact) which is a >> long procedure. Each step costs only a little, but the total cost is not >> acceptable in some time-sensitive cases. > > This is also cryptic to me. In standard SR-IOV deployment, all VFs are > created and configured while operator booted the machine with sriov_drivers_autoprobe > set to false. Once this machine is ready, VFs are assigned to relevant VMs/users > through orchestration SW (IMHO, it is supported by all orchestration SW). > > And only last part (assigning to users) is time-sensitive operation. > The VF creation and configuration are also time-sensitive in some cases, for example, the hypervisor live update case (such as [1]): save VMs -> kexec -> restore VMs After the new kernel starts, the VFs must be added into the system, and then assign the original VFs to the QEMU. This means we must enable all 2K+ VFs at once and increase the downtime. If we can enable the VFs that are used by existing VMs then restore the VMs and enable other unused VFs at last, the downtime would be significantly reduced. [1] https://static.sched.com/hosted_files/kvmforum2022/65/kvmforum2022-Preserving%20IOMMU%20states%20during%20kexec%20reboot-v4.pdf >> >> What’s more, the sriov_add_vfs adds the VFs of a PF one by one. So we can >> mostly support 10 concurrent calls if there has 10 PFs. > > I wondered, are you using real HW? or QEMU SR-IOV? What is your server > that supports such large number of VFs? > Physical device. Some devices in the market support the large number of VFs, especially in the hardware offloading area, e.g DPU/IPU. I think the SR-IOV software should keep pace with times too. > BTW, Your change will probably break all SR-IOV devices in the market as > they rely on PCI subsystem to have VFs ready and configured. > I see, but maybe this change could be a choice for some users. > Thanks > .
On Mon, Nov 14, 2022 at 08:38:42PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote: > > > 在 2022/11/14 15:04, Leon Romanovsky 写道: > > On Sun, Nov 13, 2022 at 09:47:12PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote: > > > Hi leon, > > > > > > 在 2022/11/12 0:39, Leon Romanovsky 写道: > > > > On Fri, Nov 11, 2022 at 10:27:18PM +0800, Longpeng(Mike) wrote: > > > > > From: Longpeng <longpeng2@huawei.com> > > > > > > > > > > We can enable SRIOV and add VFs by /sys/bus/pci/devices/..../sriov_numvfs, but > > > > > this operation needs to spend lots of time if there has a large amount of VFs. > > > > > For example, if the machine has 10 PFs and 250 VFs per-PF, enable all the VFs > > > > > concurrently would cost about 200-250ms. However most of them are not need to be > > > > > used at the moment, so we can enable SRIOV first but add VFs on demand. > > > > > > > > It is unclear what took 200-250ms, is it physical VF creation or bind of > > > > the driver to these VFs? > > > > > > > It is neither. In our test, we already created physical VFs before, so we > > > skipped the 100ms waiting when writing PCI_SRIOV_CTRL. And our driver only > > > probes PF, it just returns an error if the function is VF. > > > > It means that you didn't try sriov_drivers_autoprobe. Once it is set to > > true, It won't even try to probe VFs. > > > > > > > > The hotspot is the sriov_add_vfs (but no driver probe in fact) which is a > > > long procedure. Each step costs only a little, but the total cost is not > > > acceptable in some time-sensitive cases. > > > > This is also cryptic to me. In standard SR-IOV deployment, all VFs are > > created and configured while operator booted the machine with sriov_drivers_autoprobe > > set to false. Once this machine is ready, VFs are assigned to relevant VMs/users > > through orchestration SW (IMHO, it is supported by all orchestration SW). > > > > And only last part (assigning to users) is time-sensitive operation. > > > The VF creation and configuration are also time-sensitive in some cases, for > example, the hypervisor live update case (such as [1]): > save VMs -> kexec -> restore VMs > > After the new kernel starts, the VFs must be added into the system, and then > assign the original VFs to the QEMU. This means we must enable all 2K+ VFs > at once and increase the downtime. > > If we can enable the VFs that are used by existing VMs then restore the VMs > and enable other unused VFs at last, the downtime would be significantly > reduced. > > [1] https://static.sched.com/hosted_files/kvmforum2022/65/kvmforum2022-Preserving%20IOMMU%20states%20during%20kexec%20reboot-v4.pdf Like it is written in presentation, the standard way of doing it is done by VFIO live migration feature, where 2K+ VMs are migrated to another server at the time first server is scheduled for maintenance. However, even in live update case mentioned in the presentation, you should disable ALL PFs/VFs and enable ALL PFs/VFs at the same time, so you don't need per-VF id enable knob. > > > > > > > What’s more, the sriov_add_vfs adds the VFs of a PF one by one. So we can > > > mostly support 10 concurrent calls if there has 10 PFs. > > > > I wondered, are you using real HW? or QEMU SR-IOV? What is your server > > that supports such large number of VFs? > > > Physical device. Some devices in the market support the large number of VFs, > especially in the hardware offloading area, e.g DPU/IPU. I think the SR-IOV > software should keep pace with times too. Our devices (and Intel too) support many VFs too. The thing is that servers are unlikely to be able to support 10 physical devices with 2K+ VFs. There are many limitations that will make such is not usable. Like, global MSI-X pool and PCI bandwidth to support all these devices. > > > BTW, Your change will probably break all SR-IOV devices in the market as > > they rely on PCI subsystem to have VFs ready and configured. > > > I see, but maybe this change could be a choice for some users. It should come with relevant driver changes and very strong justification why such functionality is needed now and can't be achieved by anything else except user-facing sysfs. I don't see anything in this presentation and discussion that supports need of such UAPI. Thanks > > > Thanks > > .
在 2022/11/14 21:09, Leon Romanovsky 写道: > On Mon, Nov 14, 2022 at 08:38:42PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote: >> >> >> 在 2022/11/14 15:04, Leon Romanovsky 写道: >>> On Sun, Nov 13, 2022 at 09:47:12PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote: >>>> Hi leon, >>>> >>>> 在 2022/11/12 0:39, Leon Romanovsky 写道: >>>>> On Fri, Nov 11, 2022 at 10:27:18PM +0800, Longpeng(Mike) wrote: >>>>>> From: Longpeng <longpeng2@huawei.com> >>>>>> >>>>>> We can enable SRIOV and add VFs by /sys/bus/pci/devices/..../sriov_numvfs, but >>>>>> this operation needs to spend lots of time if there has a large amount of VFs. >>>>>> For example, if the machine has 10 PFs and 250 VFs per-PF, enable all the VFs >>>>>> concurrently would cost about 200-250ms. However most of them are not need to be >>>>>> used at the moment, so we can enable SRIOV first but add VFs on demand. >>>>> >>>>> It is unclear what took 200-250ms, is it physical VF creation or bind of >>>>> the driver to these VFs? >>>>> >>>> It is neither. In our test, we already created physical VFs before, so we >>>> skipped the 100ms waiting when writing PCI_SRIOV_CTRL. And our driver only >>>> probes PF, it just returns an error if the function is VF. >>> >>> It means that you didn't try sriov_drivers_autoprobe. Once it is set to >>> true, It won't even try to probe VFs. >>> >>>> >>>> The hotspot is the sriov_add_vfs (but no driver probe in fact) which is a >>>> long procedure. Each step costs only a little, but the total cost is not >>>> acceptable in some time-sensitive cases. >>> >>> This is also cryptic to me. In standard SR-IOV deployment, all VFs are >>> created and configured while operator booted the machine with sriov_drivers_autoprobe >>> set to false. Once this machine is ready, VFs are assigned to relevant VMs/users >>> through orchestration SW (IMHO, it is supported by all orchestration SW). >>> >>> And only last part (assigning to users) is time-sensitive operation. >>> >> The VF creation and configuration are also time-sensitive in some cases, for >> example, the hypervisor live update case (such as [1]): >> save VMs -> kexec -> restore VMs >> >> After the new kernel starts, the VFs must be added into the system, and then >> assign the original VFs to the QEMU. This means we must enable all 2K+ VFs >> at once and increase the downtime. >> >> If we can enable the VFs that are used by existing VMs then restore the VMs >> and enable other unused VFs at last, the downtime would be significantly >> reduced. >> >> [1] https://static.sched.com/hosted_files/kvmforum2022/65/kvmforum2022-Preserving%20IOMMU%20states%20during%20kexec%20reboot-v4.pdf > > Like it is written in presentation, the standard way of doing it is done > by VFIO live migration feature, where 2K+ VMs are migrated to another server > at the time first server is scheduled for maintenance. > Live migration is not the best choice in production environment, it's too heavy. Some cloud providers prefer to using hypervisor live update in their system, such as AWS's nitro hypervisor. > However, even in live update case mentioned in the presentation, you > should disable ALL PFs/VFs and enable ALL PFs/VFs at the same time, > so you don't need per-VF id enable knob. > The presentation is just a reference, some points could be optimized including disable PFs/VFs and enable PFs/VFs. Hypervisor live update can finish in less than 1 second, so the cost of disabling PFs/VFs and enabling PFs/VFs (~200-250ms or even worst) is too high. >> >>>> >>>> What’s more, the sriov_add_vfs adds the VFs of a PF one by one. So we can >>>> mostly support 10 concurrent calls if there has 10 PFs. >>> >>> I wondered, are you using real HW? or QEMU SR-IOV? What is your server >>> that supports such large number of VFs? >>> >> Physical device. Some devices in the market support the large number of VFs, >> especially in the hardware offloading area, e.g DPU/IPU. I think the SR-IOV >> software should keep pace with times too. > > Our devices (and Intel too) support many VFs too. The thing is that > servers are unlikely to be able to support 10 physical devices with 2K+ > VFs. There are many limitations that will make such is not usable. > Like, global MSI-X pool and PCI bandwidth to support all these devices. > >> >>> BTW, Your change will probably break all SR-IOV devices in the market as >>> they rely on PCI subsystem to have VFs ready and configured. >>> >> I see, but maybe this change could be a choice for some users. > > It should come with relevant driver changes and very strong justification why > such functionality is needed now and can't be achieved by anything else > except user-facing sysfs. > Adding 2K+ VFs to the sysfs need too much time. Look at the bottomhalf of the hypervisor live update: kexec --> add 2K VFs --> restore VMs The downtime can be reduced if the sequence is: kexec --> add 100 VFs(the VMs used) --> resotre VMs --> add 1.9K VFs > I don't see anything in this presentation and discussion that supports > need of such UAPI. > > Thanks > >> >>> Thanks >>> . > .
On Mon, Nov 14, 2022 at 10:06:49PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote: > > > 在 2022/11/14 21:09, Leon Romanovsky 写道: > > On Mon, Nov 14, 2022 at 08:38:42PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote: > > > > > > > > > 在 2022/11/14 15:04, Leon Romanovsky 写道: > > > > On Sun, Nov 13, 2022 at 09:47:12PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote: > > > > > Hi leon, > > > > > > > > > > 在 2022/11/12 0:39, Leon Romanovsky 写道: > > > > > > On Fri, Nov 11, 2022 at 10:27:18PM +0800, Longpeng(Mike) wrote: > > > > > > > From: Longpeng <longpeng2@huawei.com> > > > > > > > > > > > > > > We can enable SRIOV and add VFs by /sys/bus/pci/devices/..../sriov_numvfs, but > > > > > > > this operation needs to spend lots of time if there has a large amount of VFs. > > > > > > > For example, if the machine has 10 PFs and 250 VFs per-PF, enable all the VFs > > > > > > > concurrently would cost about 200-250ms. However most of them are not need to be > > > > > > > used at the moment, so we can enable SRIOV first but add VFs on demand. > > > > > > > > > > > > It is unclear what took 200-250ms, is it physical VF creation or bind of > > > > > > the driver to these VFs? > > > > > > > > > > > It is neither. In our test, we already created physical VFs before, so we > > > > > skipped the 100ms waiting when writing PCI_SRIOV_CTRL. And our driver only > > > > > probes PF, it just returns an error if the function is VF. > > > > > > > > It means that you didn't try sriov_drivers_autoprobe. Once it is set to > > > > true, It won't even try to probe VFs. > > > > > > > > > > > > > > The hotspot is the sriov_add_vfs (but no driver probe in fact) which is a > > > > > long procedure. Each step costs only a little, but the total cost is not > > > > > acceptable in some time-sensitive cases. > > > > > > > > This is also cryptic to me. In standard SR-IOV deployment, all VFs are > > > > created and configured while operator booted the machine with sriov_drivers_autoprobe > > > > set to false. Once this machine is ready, VFs are assigned to relevant VMs/users > > > > through orchestration SW (IMHO, it is supported by all orchestration SW). > > > > > > > > And only last part (assigning to users) is time-sensitive operation. > > > > > > > The VF creation and configuration are also time-sensitive in some cases, for > > > example, the hypervisor live update case (such as [1]): > > > save VMs -> kexec -> restore VMs > > > > > > After the new kernel starts, the VFs must be added into the system, and then > > > assign the original VFs to the QEMU. This means we must enable all 2K+ VFs > > > at once and increase the downtime. > > > > > > If we can enable the VFs that are used by existing VMs then restore the VMs > > > and enable other unused VFs at last, the downtime would be significantly > > > reduced. > > > > > > [1] https://static.sched.com/hosted_files/kvmforum2022/65/kvmforum2022-Preserving%20IOMMU%20states%20during%20kexec%20reboot-v4.pdf > > > > Like it is written in presentation, the standard way of doing it is done > > by VFIO live migration feature, where 2K+ VMs are migrated to another server > > at the time first server is scheduled for maintenance. > > > Live migration is not the best choice in production environment, it's too > heavy. Some cloud providers prefer to using hypervisor live update in their > system, such as AWS's nitro hypervisor. How is AWS nitro relevant to our discussion about adding sysfs file to Linux? Can you please point us to the source code of that hypervisor? Does it even run on Linux? Anyway, I'm aware of big cloud providers who are pretty happy with live migration in production. > > > However, even in live update case mentioned in the presentation, you > > should disable ALL PFs/VFs and enable ALL PFs/VFs at the same time, > > so you don't need per-VF id enable knob. > > > The presentation is just a reference, some points could be optimized > including disable PFs/VFs and enable PFs/VFs. > > Hypervisor live update can finish in less than 1 second, so the cost of > disabling PFs/VFs and enabling PFs/VFs (~200-250ms or even worst) is too > high. > > > > > > > > > > > > > > What’s more, the sriov_add_vfs adds the VFs of a PF one by one. So we can > > > > > mostly support 10 concurrent calls if there has 10 PFs. > > > > > > > > I wondered, are you using real HW? or QEMU SR-IOV? What is your server > > > > that supports such large number of VFs? > > > > > > > Physical device. Some devices in the market support the large number of VFs, > > > especially in the hardware offloading area, e.g DPU/IPU. I think the SR-IOV > > > software should keep pace with times too. > > > > Our devices (and Intel too) support many VFs too. The thing is that > > servers are unlikely to be able to support 10 physical devices with 2K+ > > VFs. There are many limitations that will make such is not usable. > > Like, global MSI-X pool and PCI bandwidth to support all these devices. > > > > > > > > > BTW, Your change will probably break all SR-IOV devices in the market as > > > > they rely on PCI subsystem to have VFs ready and configured. > > > > > > > I see, but maybe this change could be a choice for some users. > > > > It should come with relevant driver changes and very strong justification why > > such functionality is needed now and can't be achieved by anything else > > except user-facing sysfs. > > > Adding 2K+ VFs to the sysfs need too much time. > > Look at the bottomhalf of the hypervisor live update: > kexec --> add 2K VFs --> restore VMs > > The downtime can be reduced if the sequence is: > kexec --> add 100 VFs(the VMs used) --> resotre VMs --> add 1.9K VFs Addition of VFs is serial operation, you can fire your VMs once you counted 100 VFs in sysfs directory. > > > > I don't see anything in this presentation and discussion that supports > > need of such UAPI. > > > Thanks > > > > > > > > > Thanks > > > > . > > .
在 2022/11/14 22:20, Leon Romanovsky 写道: > On Mon, Nov 14, 2022 at 10:06:49PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote: >> >> >> 在 2022/11/14 21:09, Leon Romanovsky 写道: >>> On Mon, Nov 14, 2022 at 08:38:42PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote: >>>> >>>> >>>> 在 2022/11/14 15:04, Leon Romanovsky 写道: >>>>> On Sun, Nov 13, 2022 at 09:47:12PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote: >>>>>> Hi leon, >>>>>> >>>>>> 在 2022/11/12 0:39, Leon Romanovsky 写道: >>>>>>> On Fri, Nov 11, 2022 at 10:27:18PM +0800, Longpeng(Mike) wrote: >>>>>>>> From: Longpeng <longpeng2@huawei.com> >>>>>>>> >>>>>>>> We can enable SRIOV and add VFs by /sys/bus/pci/devices/..../sriov_numvfs, but >>>>>>>> this operation needs to spend lots of time if there has a large amount of VFs. >>>>>>>> For example, if the machine has 10 PFs and 250 VFs per-PF, enable all the VFs >>>>>>>> concurrently would cost about 200-250ms. However most of them are not need to be >>>>>>>> used at the moment, so we can enable SRIOV first but add VFs on demand. >>>>>>> >>>>>>> It is unclear what took 200-250ms, is it physical VF creation or bind of >>>>>>> the driver to these VFs? >>>>>>> >>>>>> It is neither. In our test, we already created physical VFs before, so we >>>>>> skipped the 100ms waiting when writing PCI_SRIOV_CTRL. And our driver only >>>>>> probes PF, it just returns an error if the function is VF. >>>>> >>>>> It means that you didn't try sriov_drivers_autoprobe. Once it is set to >>>>> true, It won't even try to probe VFs. >>>>> >>>>>> >>>>>> The hotspot is the sriov_add_vfs (but no driver probe in fact) which is a >>>>>> long procedure. Each step costs only a little, but the total cost is not >>>>>> acceptable in some time-sensitive cases. >>>>> >>>>> This is also cryptic to me. In standard SR-IOV deployment, all VFs are >>>>> created and configured while operator booted the machine with sriov_drivers_autoprobe >>>>> set to false. Once this machine is ready, VFs are assigned to relevant VMs/users >>>>> through orchestration SW (IMHO, it is supported by all orchestration SW). >>>>> >>>>> And only last part (assigning to users) is time-sensitive operation. >>>>> >>>> The VF creation and configuration are also time-sensitive in some cases, for >>>> example, the hypervisor live update case (such as [1]): >>>> save VMs -> kexec -> restore VMs >>>> >>>> After the new kernel starts, the VFs must be added into the system, and then >>>> assign the original VFs to the QEMU. This means we must enable all 2K+ VFs >>>> at once and increase the downtime. >>>> >>>> If we can enable the VFs that are used by existing VMs then restore the VMs >>>> and enable other unused VFs at last, the downtime would be significantly >>>> reduced. >>>> >>>> [1] https://static.sched.com/hosted_files/kvmforum2022/65/kvmforum2022-Preserving%20IOMMU%20states%20during%20kexec%20reboot-v4.pdf >>> >>> Like it is written in presentation, the standard way of doing it is done >>> by VFIO live migration feature, where 2K+ VMs are migrated to another server >>> at the time first server is scheduled for maintenance. >>> >> Live migration is not the best choice in production environment, it's too >> heavy. Some cloud providers prefer to using hypervisor live update in their >> system, such as AWS's nitro hypervisor. > > How is AWS nitro relevant to our discussion about adding sysfs file to Linux? > Can you please point us to the source code of that hypervisor? Does it even > run on Linux? > Um...You can google for more information about the AWS nitro system. Yes, it's digressive, so let's back to the discussion about adding sysfs file. > Anyway, I'm aware of big cloud providers who are pretty happy with live > migration in production. > We're having trouble coming to an agreement on this point, but it does't matter. Please see below. >> >>> However, even in live update case mentioned in the presentation, you >>> should disable ALL PFs/VFs and enable ALL PFs/VFs at the same time, >>> so you don't need per-VF id enable knob. >>> >> The presentation is just a reference, some points could be optimized >> including disable PFs/VFs and enable PFs/VFs. >> >> Hypervisor live update can finish in less than 1 second, so the cost of >> disabling PFs/VFs and enabling PFs/VFs (~200-250ms or even worst) is too >> high. >> >>>> >>>>>> >>>>>> What’s more, the sriov_add_vfs adds the VFs of a PF one by one. So we can >>>>>> mostly support 10 concurrent calls if there has 10 PFs. >>>>> >>>>> I wondered, are you using real HW? or QEMU SR-IOV? What is your server >>>>> that supports such large number of VFs? >>>>> >>>> Physical device. Some devices in the market support the large number of VFs, >>>> especially in the hardware offloading area, e.g DPU/IPU. I think the SR-IOV >>>> software should keep pace with times too. >>> >>> Our devices (and Intel too) support many VFs too. The thing is that >>> servers are unlikely to be able to support 10 physical devices with 2K+ >>> VFs. There are many limitations that will make such is not usable. >>> Like, global MSI-X pool and PCI bandwidth to support all these devices. >>> >>>> >>>>> BTW, Your change will probably break all SR-IOV devices in the market as >>>>> they rely on PCI subsystem to have VFs ready and configured. >>>>> >>>> I see, but maybe this change could be a choice for some users. >>> >>> It should come with relevant driver changes and very strong justification why >>> such functionality is needed now and can't be achieved by anything else >>> except user-facing sysfs. >>> >> Adding 2K+ VFs to the sysfs need too much time. >> >> Look at the bottomhalf of the hypervisor live update: >> kexec --> add 2K VFs --> restore VMs >> >> The downtime can be reduced if the sequence is: >> kexec --> add 100 VFs(the VMs used) --> resotre VMs --> add 1.9K VFs > > Addition of VFs is serial operation, you can fire your VMs once you > counted 100 VFs in sysfs directory. > According to the current implementation, the addition of VFs must be in order, so this can not properly work. For example, the VM uses VF200, VF202, VF204, but the sriov_add_vfs can only add VFs in the order VF0, VF1, VF2 ... The limitation is introduced by the software, not the PCI spec. >> >> >>> I don't see anything in this presentation and discussion that supports >>> need of such UAPI. >>> > Thanks >>> >>>> >>>>> Thanks >>>>> . >>> . > .
On Tue, Nov 15, 2022 at 1:27 AM Leon Romanovsky <leon@kernel.org> wrote: > > *snip* > > Anyway, I'm aware of big cloud providers who are pretty happy with live > migration in production. I could see someone sufficiently cloudbrained deciding that rebooting the hypervisor is fine provided the downtime doesn't violate any customer uptime SLAs. Personally I'd only be brave enough to do that for a HV hosting internal services which I know are behind a load balancer, but apparently there are people at Huawei far braver than I. > *snip* > > > Adding 2K+ VFs to the sysfs need too much time. > > > > Look at the bottomhalf of the hypervisor live update: > > kexec --> add 2K VFs --> restore VMs > > > > The downtime can be reduced if the sequence is: > > kexec --> add 100 VFs(the VMs used) --> resotre VMs --> add 1.9K VFs > > Addition of VFs is serial operation, you can fire your VMs once you > counted 100 VFs in sysfs directory. I don't know if making that kind of assumption about the behaviour of sysfs is better or worse than just adding another knob. If at some point in the future the initialisation of VF pci_devs was moved to a workqueue or something we'd be violating that assumption without breaking any of the documented ABI. I guess you could argue that VFs being added sequentially is "ABI", but userspace has always been told not to make assumptions about when sysfs attributes (or nodes, I guess) appear since doing so is prone to races.
On Tue, Nov 15, 2022 at 1:08 AM Longpeng (Mike, Cloud Infrastructure Service Product Dept.) <longpeng2@huawei.com> wrote: > > *snip* > > Adding 2K+ VFs to the sysfs need too much time. > > Look at the bottomhalf of the hypervisor live update: > kexec --> add 2K VFs --> restore VMs > > The downtime can be reduced if the sequence is: > kexec --> add 100 VFs(the VMs used) --> resotre VMs --> add 1.9K VFs Right, so you want to add VFs in batches rather than all at once. Personally I think the bitmap approach is error prone since it renders the meaning of pf_dev->sriov->num_VFs unclear and there's some hairy code in arch/powerpc/ that approach will likely break. A better approach would be to add an attribute to control the number of VFs enabled in hardware and allowing sriov_numvfs to accept any number between the current value and sriov_hw_numvfs. e.g. your HV setup would look something like: echo 2048 > sriov_hw_numvfs echo 100 > sriov_numvfs # time passes echo 2048 > sriov_numvfs This would be fairly simple to implement and you can make it backwards compatible by having writes to sriov_numvfs retain their current semantics if sriov_hw_numvfs is zero.
On Tue, Nov 15, 2022 at 12:50:34PM +1100, Oliver O'Halloran wrote: > On Tue, Nov 15, 2022 at 1:27 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > *snip* > > > > Anyway, I'm aware of big cloud providers who are pretty happy with live > > migration in production. > > I could see someone sufficiently cloudbrained deciding that rebooting > the hypervisor is fine provided the downtime doesn't violate any > customer uptime SLAs. Personally I'd only be brave enough to do that > for a HV hosting internal services which I know are behind a load > balancer, but apparently there are people at Huawei far braver than I. My main point in this discussion that Huawei team doesn't actually provide any meaningful justification why it is great idea to add new sysfs file. They use HPC as an argument, but in that world, you won't see many VMs on one server, as it is important to provide separate MSI-X vectors and CPUs to each VM. They ask from us optimization (do not add device hierarchy for existing HW) that doesn't exist in the kernel. I would say that they are trying to meld SIOV architecture of subfunctions (SFs) into PCI and SR-IOV world. Thanks
On Tue, Nov 15, 2022 at 05:36:38PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote: > > > 在 2022/11/15 16:32, Leon Romanovsky 写道: > > On Tue, Nov 15, 2022 at 12:50:34PM +1100, Oliver O'Halloran wrote: > > > On Tue, Nov 15, 2022 at 1:27 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > *snip* > > > > > > > > Anyway, I'm aware of big cloud providers who are pretty happy with live > > > > migration in production. > > > > > > I could see someone sufficiently cloudbrained deciding that rebooting > > > the hypervisor is fine provided the downtime doesn't violate any > > > customer uptime SLAs. Personally I'd only be brave enough to do that > > > for a HV hosting internal services which I know are behind a load > > > balancer, but apparently there are people at Huawei far braver than I. > > > > My main point in this discussion that Huawei team doesn't actually > > provide any meaningful justification why it is great idea to add new > > sysfs file. They use HPC as an argument, but in that world, you won't > > see many VMs on one server, as it is important to provide separate MSI-X > > vectors and CPUs to each VM. > > > > They ask from us optimization (do not add device hierarchy for existing HW) > > that doesn't exist in the kernel. > > > > I would say that they are trying to meld SIOV architecture of subfunctions > > (SFs) into PCI and SR-IOV world. > > > I may not agree with you on this point. The bright side of open source that you don't need to agree with everyone. If you success to convince others, it will be merged. > The sriov_numvfs interface mixes the > operation of enabling hardware VFs and the addition of VFs. I just want to > split these two operations and also can selectively add some VFs earlier > than others. I think there's no violation of PCI spec. Right, it just doesn't fit into Linux kernel device initialization model. Thanks > > > Thanks > > .
On Tue, Nov 15, 2022 at 7:32 PM Leon Romanovsky <leon@kernel.org> wrote: > > On Tue, Nov 15, 2022 at 12:50:34PM +1100, Oliver O'Halloran wrote: > > On Tue, Nov 15, 2022 at 1:27 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > *snip* > > > > > > Anyway, I'm aware of big cloud providers who are pretty happy with live > > > migration in production. > > > > I could see someone sufficiently cloudbrained deciding that rebooting > > the hypervisor is fine provided the downtime doesn't violate any > > customer uptime SLAs. Personally I'd only be brave enough to do that > > for a HV hosting internal services which I know are behind a load > > balancer, but apparently there are people at Huawei far braver than I. > > My main point in this discussion that Huawei team doesn't actually > provide any meaningful justification why it is great idea to add new > sysfs file. All their arguments seem to be based on trying to reduce the time-to-VMs when a hypervisor is kexec()ed, which is a pretty reasonable justification IMO. I do have some reservations about the numbers they're claiming since 250ms for initializing struct pci_dev's for the VFs seems excessive. Unfortunately, I don't have any hardware that supports 2048 VFs on hand so I can't verify that claim. > They use HPC as an argument but in that world, you won't > see many VMs on one server, as it is important to provide separate MSI-X > vectors and CPUs to each VM. I don't think HPC has come up in this thread, but assuming it has: In the cloud "HPC" usually means "it has timesliced access to GPUs". Having 2k VMs sharing one or more GPUs on a single system isn't necessarily advisable, but if we assume only a subset of those VMs will actually need access to a GPU at any given time it's sort of reasonable. > They ask from us optimization (do not add device hierarchy for existing HW) > that doesn't exist in the kernel. > > I would say that they are trying to meld SIOV architecture of subfunctions > (SFs) into PCI and SR-IOV world. I don't know what asks you're referring to, but they're not present in this thread. I'm going to give Longpeng the benefit of the doubt and assume that this series is an attempt to fix a problem he's facing with actual hardware that exists today. To say they should have implemented the device with SIOV (proprietary to Intel until March this year) rather than SR-IOV (standardised by PCI-SIG over a decade ago) is not terribly helpful to anyone. Additionally, SIOV exists largely to solve a problem that's an issue because Intel decided that all PCI devices should exist within a single PCI domain. If you don't have that problem SIOV is a lot less compelling.