[RFC,0/4] pci/sriov: support VFs dynamic addition

Message ID	20221111142722.1172-1-longpeng2@huawei.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: "Longpeng(Mike)" <longpeng2@huawei.com> To: <bhelgaas@google.com> CC: <linux-pci@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <jianjay.zhou@huawei.com>, <zhuangshengen@huawei.com>, <arei.gonglei@huawei.com>, <yechuan@huawei.com>, <huangzhichao@huawei.com>, <xiehong@huawei.com>, Longpeng <longpeng2@huawei.com> Subject: [RFC 0/4] pci/sriov: support VFs dynamic addition Date: Fri, 11 Nov 2022 22:27:18 +0800 Message-ID: <20221111142722.1172-1-longpeng2@huawei.com> MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII Precedence: bulk
Series	pci/sriov: support VFs dynamic addition \| [RFC,0/4] pci/sriov: support VFs dynamic addition [RFC,1/4] pci/sriov: extract sriov_numvfs common helper [RFC,2/4] pci/sriov: add vf_bitmap to mark the vf id allocation [RFC,3/4] pci/sriov: add sriov_numfs_no_scan interface [RFC,4/4] pci/sriov: add sriov_scan_vf_id interface

Message ID

20221111142722.1172-1-longpeng2@huawei.com

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
From: "Longpeng(Mike)" <longpeng2@huawei.com>
To: <bhelgaas@google.com>
CC: <linux-pci@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
        <jianjay.zhou@huawei.com>, <zhuangshengen@huawei.com>,
        <arei.gonglei@huawei.com>, <yechuan@huawei.com>,
        <huangzhichao@huawei.com>, <xiehong@huawei.com>,
        Longpeng <longpeng2@huawei.com>
Subject: [RFC 0/4] pci/sriov: support VFs dynamic addition
Date: Fri, 11 Nov 2022 22:27:18 +0800
Message-ID: <20221111142722.1172-1-longpeng2@huawei.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 7BIT
Content-Type: text/plain; charset=US-ASCII
Precedence: bulk

Series

pci/sriov: support VFs dynamic addition |

Message

Longpeng(Mike) Nov. 11, 2022, 2:27 p.m. UTC

  From: Longpeng <longpeng2@huawei.com>

We can enable SRIOV and add VFs by /sys/bus/pci/devices/..../sriov_numvfs, but
this operation needs to spend lots of time if there has a large amount of VFs.      
                                                            
For example, if the machine has 10 PFs and 250 VFs per-PF, enable all the VFs
concurrently would cost about 200-250ms. However most of them are not need to be
used at the moment, so we can enable SRIOV first but add VFs on demand.

This series introduces two interfaces:
1. sriov_numvfs_no_scan: enable SRIOV without add the VFs.
2. sriov_scan_vf_id: add a specific VF.

Longpeng (4):
  pci/sriov: extract sriov_numvfs common helper
  pci/sriov: add vf_bitmap to mark the vf id allocation
  pci/sriov: add sriov_numfs_no_scan interface
  pci/sriov: add sriov_scan_vf_id interface

 drivers/pci/iov.c | 162 +++++++++++++++++++++++++++++++++++++++++-----
 drivers/pci/pci.h |   1 +
 2 files changed, 148 insertions(+), 15 deletions(-)

Comments

Leon Romanovsky Nov. 11, 2022, 4:39 p.m. UTC | #1

On Fri, Nov 11, 2022 at 10:27:18PM +0800, Longpeng(Mike) wrote:
> From: Longpeng <longpeng2@huawei.com>
> 
> We can enable SRIOV and add VFs by /sys/bus/pci/devices/..../sriov_numvfs, but
> this operation needs to spend lots of time if there has a large amount of VFs.      
>                                                             
> For example, if the machine has 10 PFs and 250 VFs per-PF, enable all the VFs
> concurrently would cost about 200-250ms. However most of them are not need to be
> used at the moment, so we can enable SRIOV first but add VFs on demand.

It is unclear what took 200-250ms, is it physical VF creation or bind of
the driver to these VFs?

If the latter, you can try with sriov_drivers_autoprobe set to true. This is how
ennoblement of large SR-IOV systems is done.

Also PCI spec declares "VF enable" bit, which is applicable to all VFs
associated to that PF, see section "9.3.3.3.1 VF Enable".

Thanks

> 
> This series introduces two interfaces:
> 1. sriov_numvfs_no_scan: enable SRIOV without add the VFs.
> 2. sriov_scan_vf_id: add a specific VF.
> 
> Longpeng (4):
>   pci/sriov: extract sriov_numvfs common helper
>   pci/sriov: add vf_bitmap to mark the vf id allocation
>   pci/sriov: add sriov_numfs_no_scan interface
>   pci/sriov: add sriov_scan_vf_id interface
> 
>  drivers/pci/iov.c | 162 +++++++++++++++++++++++++++++++++++++++++-----
>  drivers/pci/pci.h |   1 +
>  2 files changed, 148 insertions(+), 15 deletions(-)
> 
> -- 
> 2.23.0
>

Bjorn Helgaas Nov. 11, 2022, 11:07 p.m. UTC | #2

On Fri, Nov 11, 2022 at 10:27:18PM +0800, Longpeng(Mike) wrote:
> From: Longpeng <longpeng2@huawei.com>
> 
> We can enable SRIOV and add VFs by /sys/bus/pci/devices/..../sriov_numvfs, but
> this operation needs to spend lots of time if there has a large amount of VFs.      
>                                                             
> For example, if the machine has 10 PFs and 250 VFs per-PF, enable all the VFs
> concurrently would cost about 200-250ms. However most of them are not need to be
> used at the moment, so we can enable SRIOV first but add VFs on demand.
> 
> This series introduces two interfaces:
> 1. sriov_numvfs_no_scan: enable SRIOV without add the VFs.
> 2. sriov_scan_vf_id: add a specific VF.
> 
> Longpeng (4):
>   pci/sriov: extract sriov_numvfs common helper
>   pci/sriov: add vf_bitmap to mark the vf id allocation
>   pci/sriov: add sriov_numfs_no_scan interface
>   pci/sriov: add sriov_scan_vf_id interface

When you respond to Leon's questions, please also spend a little time
to look at the iov.c history and coding style and follow it, so we
don't have to waste time commenting on or fixing trivial things.

For example, there are no previous subject lines for that file that
start with "pci/sriov".  Don't make up a new prefix; use what's been
done in the past.  And follow the style of capitalizing the first word
after the prefix.

There are a few comments that look like they are more than 80 columns.
Again unlike everything else in the file.

The PCIe spec spells it "SR-IOV", not "sriov".  Do the same in your
commit logs and comments.

Capitalize "VF" consistently.  I see "VF" and "vf" used randomly.

Sysfs changes require documention updates, e.g., in
Documentation/ABI/testing/sysfs-bus-pci.

Longpeng(Mike) Nov. 13, 2022, 1:47 p.m. UTC | #3

Hi leon,

在 2022/11/12 0:39, Leon Romanovsky 写道:
> On Fri, Nov 11, 2022 at 10:27:18PM +0800, Longpeng(Mike) wrote:
>> From: Longpeng <longpeng2@huawei.com>
>>
>> We can enable SRIOV and add VFs by /sys/bus/pci/devices/..../sriov_numvfs, but
>> this operation needs to spend lots of time if there has a large amount of VFs.
>>                                                              
>> For example, if the machine has 10 PFs and 250 VFs per-PF, enable all the VFs
>> concurrently would cost about 200-250ms. However most of them are not need to be
>> used at the moment, so we can enable SRIOV first but add VFs on demand.
> 
> It is unclear what took 200-250ms, is it physical VF creation or bind of
> the driver to these VFs?
>
It is neither. In our test, we already created physical VFs before, so 
we skipped the 100ms waiting when writing PCI_SRIOV_CTRL. And our driver 
only probes PF, it just returns an error if the function is VF.

The hotspot is the sriov_add_vfs (but no driver probe in fact) which is 
a long procedure. Each step costs only a little, but the total cost is 
not acceptable in some time-sensitive cases.

What’s more, the sriov_add_vfs adds the VFs of a PF one by one. So we 
can mostly support 10 concurrent calls if there has 10 PFs.

> If the latter, you can try with sriov_drivers_autoprobe set to true. This is how
> ennoblement of large SR-IOV systems is done.
>  > Also PCI spec declares "VF enable" bit, which is applicable to all VFs
> associated to that PF, see section "9.3.3.3.1 VF Enable".
> 
> Thanks
> 
>>
>> This series introduces two interfaces:
>> 1. sriov_numvfs_no_scan: enable SRIOV without add the VFs.
>> 2. sriov_scan_vf_id: add a specific VF.
>>
>> Longpeng (4):
>>    pci/sriov: extract sriov_numvfs common helper
>>    pci/sriov: add vf_bitmap to mark the vf id allocation
>>    pci/sriov: add sriov_numfs_no_scan interface
>>    pci/sriov: add sriov_scan_vf_id interface
>>
>>   drivers/pci/iov.c | 162 +++++++++++++++++++++++++++++++++++++++++-----
>>   drivers/pci/pci.h |   1 +
>>   2 files changed, 148 insertions(+), 15 deletions(-)
>>
>> -- 
>> 2.23.0
>>
> .

Longpeng(Mike) Nov. 13, 2022, 1:49 p.m. UTC | #4

在 2022/11/12 7:07, Bjorn Helgaas 写道:
> On Fri, Nov 11, 2022 at 10:27:18PM +0800, Longpeng(Mike) wrote:
>> From: Longpeng <longpeng2@huawei.com>
>>
>> We can enable SRIOV and add VFs by /sys/bus/pci/devices/..../sriov_numvfs, but
>> this operation needs to spend lots of time if there has a large amount of VFs.
>>                                                              
>> For example, if the machine has 10 PFs and 250 VFs per-PF, enable all the VFs
>> concurrently would cost about 200-250ms. However most of them are not need to be
>> used at the moment, so we can enable SRIOV first but add VFs on demand.
>>
>> This series introduces two interfaces:
>> 1. sriov_numvfs_no_scan: enable SRIOV without add the VFs.
>> 2. sriov_scan_vf_id: add a specific VF.
>>
>> Longpeng (4):
>>    pci/sriov: extract sriov_numvfs common helper
>>    pci/sriov: add vf_bitmap to mark the vf id allocation
>>    pci/sriov: add sriov_numfs_no_scan interface
>>    pci/sriov: add sriov_scan_vf_id interface
> 
> When you respond to Leon's questions, please also spend a little time
> to look at the iov.c history and coding style and follow it, so we
> don't have to waste time commenting on or fixing trivial things.
> 
Okay, thanks for your valuable suggestions.

> For example, there are no previous subject lines for that file that
> start with "pci/sriov".  Don't make up a new prefix; use what's been
> done in the past.  And follow the style of capitalizing the first word
> after the prefix.
> 
> There are a few comments that look like they are more than 80 columns.
> Again unlike everything else in the file.
> 
> The PCIe spec spells it "SR-IOV", not "sriov".  Do the same in your
> commit logs and comments.
> 
> Capitalize "VF" consistently.  I see "VF" and "vf" used randomly.
> 
> Sysfs changes require documention updates, e.g., in
> Documentation/ABI/testing/sysfs-bus-pci.
> .

Leon Romanovsky Nov. 14, 2022, 7:04 a.m. UTC | #5

On Sun, Nov 13, 2022 at 09:47:12PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
> Hi leon,
> 
> 在 2022/11/12 0:39, Leon Romanovsky 写道:
> > On Fri, Nov 11, 2022 at 10:27:18PM +0800, Longpeng(Mike) wrote:
> > > From: Longpeng <longpeng2@huawei.com>
> > > 
> > > We can enable SRIOV and add VFs by /sys/bus/pci/devices/..../sriov_numvfs, but
> > > this operation needs to spend lots of time if there has a large amount of VFs.
> > > For example, if the machine has 10 PFs and 250 VFs per-PF, enable all the VFs
> > > concurrently would cost about 200-250ms. However most of them are not need to be
> > > used at the moment, so we can enable SRIOV first but add VFs on demand.
> > 
> > It is unclear what took 200-250ms, is it physical VF creation or bind of
> > the driver to these VFs?
> > 
> It is neither. In our test, we already created physical VFs before, so we
> skipped the 100ms waiting when writing PCI_SRIOV_CTRL. And our driver only
> probes PF, it just returns an error if the function is VF.

It means that you didn't try sriov_drivers_autoprobe. Once it is set to
true, It won't even try to probe VFs.

> 
> The hotspot is the sriov_add_vfs (but no driver probe in fact) which is a
> long procedure. Each step costs only a little, but the total cost is not
> acceptable in some time-sensitive cases.

This is also cryptic to me. In standard SR-IOV deployment, all VFs are
created and configured while operator booted the machine with sriov_drivers_autoprobe
set to false. Once this machine is ready, VFs are assigned to relevant VMs/users
through orchestration SW (IMHO, it is supported by all orchestration SW). 

And only last part (assigning to users) is time-sensitive operation.

> 
> What’s more, the sriov_add_vfs adds the VFs of a PF one by one. So we can
> mostly support 10 concurrent calls if there has 10 PFs.

I wondered, are you using real HW? or QEMU SR-IOV? What is your server
that supports such large number of VFs?

BTW, Your change will probably break all SR-IOV devices in the market as
they rely on PCI subsystem to have VFs ready and configured.

Thanks

Longpeng(Mike) Nov. 14, 2022, 12:38 p.m. UTC | #6

在 2022/11/14 15:04, Leon Romanovsky 写道:
> On Sun, Nov 13, 2022 at 09:47:12PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
>> Hi leon,
>>
>> 在 2022/11/12 0:39, Leon Romanovsky 写道:
>>> On Fri, Nov 11, 2022 at 10:27:18PM +0800, Longpeng(Mike) wrote:
>>>> From: Longpeng <longpeng2@huawei.com>
>>>>
>>>> We can enable SRIOV and add VFs by /sys/bus/pci/devices/..../sriov_numvfs, but
>>>> this operation needs to spend lots of time if there has a large amount of VFs.
>>>> For example, if the machine has 10 PFs and 250 VFs per-PF, enable all the VFs
>>>> concurrently would cost about 200-250ms. However most of them are not need to be
>>>> used at the moment, so we can enable SRIOV first but add VFs on demand.
>>>
>>> It is unclear what took 200-250ms, is it physical VF creation or bind of
>>> the driver to these VFs?
>>>
>> It is neither. In our test, we already created physical VFs before, so we
>> skipped the 100ms waiting when writing PCI_SRIOV_CTRL. And our driver only
>> probes PF, it just returns an error if the function is VF.
> 
> It means that you didn't try sriov_drivers_autoprobe. Once it is set to
> true, It won't even try to probe VFs.
> 
>>
>> The hotspot is the sriov_add_vfs (but no driver probe in fact) which is a
>> long procedure. Each step costs only a little, but the total cost is not
>> acceptable in some time-sensitive cases.
> 
> This is also cryptic to me. In standard SR-IOV deployment, all VFs are
> created and configured while operator booted the machine with sriov_drivers_autoprobe
> set to false. Once this machine is ready, VFs are assigned to relevant VMs/users
> through orchestration SW (IMHO, it is supported by all orchestration SW).
> 
> And only last part (assigning to users) is time-sensitive operation.
> 
The VF creation and configuration are also time-sensitive in some cases, 
for example, the hypervisor live update case (such as [1]):
  save VMs -> kexec -> restore VMs

After the new kernel starts, the VFs must be added into the system, and 
then assign the original VFs to the QEMU. This means we must enable all 
2K+ VFs at once and increase the downtime.

If we can enable the VFs that are used by existing VMs then restore the 
VMs and enable other unused VFs at last, the downtime would be 
significantly reduced.

[1] 
https://static.sched.com/hosted_files/kvmforum2022/65/kvmforum2022-Preserving%20IOMMU%20states%20during%20kexec%20reboot-v4.pdf

>>
>> What’s more, the sriov_add_vfs adds the VFs of a PF one by one. So we can
>> mostly support 10 concurrent calls if there has 10 PFs.
> 
> I wondered, are you using real HW? or QEMU SR-IOV? What is your server
> that supports such large number of VFs?
> 
Physical device. Some devices in the market support the large number of 
VFs, especially in the hardware offloading area, e.g DPU/IPU. I think 
the SR-IOV software should keep pace with times too.

> BTW, Your change will probably break all SR-IOV devices in the market as
> they rely on PCI subsystem to have VFs ready and configured.
> 
I see, but maybe this change could be a choice for some users.

> Thanks
> .

Leon Romanovsky Nov. 14, 2022, 1:09 p.m. UTC | #7

On Mon, Nov 14, 2022 at 08:38:42PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
> 
> 
> 在 2022/11/14 15:04, Leon Romanovsky 写道:
> > On Sun, Nov 13, 2022 at 09:47:12PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
> > > Hi leon,
> > > 
> > > 在 2022/11/12 0:39, Leon Romanovsky 写道:
> > > > On Fri, Nov 11, 2022 at 10:27:18PM +0800, Longpeng(Mike) wrote:
> > > > > From: Longpeng <longpeng2@huawei.com>
> > > > > 
> > > > > We can enable SRIOV and add VFs by /sys/bus/pci/devices/..../sriov_numvfs, but
> > > > > this operation needs to spend lots of time if there has a large amount of VFs.
> > > > > For example, if the machine has 10 PFs and 250 VFs per-PF, enable all the VFs
> > > > > concurrently would cost about 200-250ms. However most of them are not need to be
> > > > > used at the moment, so we can enable SRIOV first but add VFs on demand.
> > > > 
> > > > It is unclear what took 200-250ms, is it physical VF creation or bind of
> > > > the driver to these VFs?
> > > > 
> > > It is neither. In our test, we already created physical VFs before, so we
> > > skipped the 100ms waiting when writing PCI_SRIOV_CTRL. And our driver only
> > > probes PF, it just returns an error if the function is VF.
> > 
> > It means that you didn't try sriov_drivers_autoprobe. Once it is set to
> > true, It won't even try to probe VFs.
> > 
> > > 
> > > The hotspot is the sriov_add_vfs (but no driver probe in fact) which is a
> > > long procedure. Each step costs only a little, but the total cost is not
> > > acceptable in some time-sensitive cases.
> > 
> > This is also cryptic to me. In standard SR-IOV deployment, all VFs are
> > created and configured while operator booted the machine with sriov_drivers_autoprobe
> > set to false. Once this machine is ready, VFs are assigned to relevant VMs/users
> > through orchestration SW (IMHO, it is supported by all orchestration SW).
> > 
> > And only last part (assigning to users) is time-sensitive operation.
> > 
> The VF creation and configuration are also time-sensitive in some cases, for
> example, the hypervisor live update case (such as [1]):
>  save VMs -> kexec -> restore VMs
> 
> After the new kernel starts, the VFs must be added into the system, and then
> assign the original VFs to the QEMU. This means we must enable all 2K+ VFs
> at once and increase the downtime.
> 
> If we can enable the VFs that are used by existing VMs then restore the VMs
> and enable other unused VFs at last, the downtime would be significantly
> reduced.
> 
> [1] https://static.sched.com/hosted_files/kvmforum2022/65/kvmforum2022-Preserving%20IOMMU%20states%20during%20kexec%20reboot-v4.pdf

Like it is written in presentation, the standard way of doing it is done
by VFIO live migration feature, where 2K+ VMs are migrated to another server
at the time first server is scheduled for maintenance.

However, even in live update case mentioned in the presentation, you
should disable ALL PFs/VFs and enable ALL PFs/VFs at the same time,
so you don't need per-VF id enable knob.

> 
> > > 
> > > What’s more, the sriov_add_vfs adds the VFs of a PF one by one. So we can
> > > mostly support 10 concurrent calls if there has 10 PFs.
> > 
> > I wondered, are you using real HW? or QEMU SR-IOV? What is your server
> > that supports such large number of VFs?
> > 
> Physical device. Some devices in the market support the large number of VFs,
> especially in the hardware offloading area, e.g DPU/IPU. I think the SR-IOV
> software should keep pace with times too.

Our devices (and Intel too) support many VFs too. The thing is that
servers are unlikely to be able to support 10 physical devices with 2K+
VFs. There are many limitations that will make such is not usable.
Like, global MSI-X pool and PCI bandwidth to support all these devices.

> 
> > BTW, Your change will probably break all SR-IOV devices in the market as
> > they rely on PCI subsystem to have VFs ready and configured.
> > 
> I see, but maybe this change could be a choice for some users.

It should come with relevant driver changes and very strong justification why
such functionality is needed now and can't be achieved by anything else
except user-facing sysfs.

I don't see anything in this presentation and discussion that supports
need of such UAPI.

Thanks

> 
> > Thanks
> > .

Longpeng(Mike) Nov. 14, 2022, 2:06 p.m. UTC | #8

在 2022/11/14 21:09, Leon Romanovsky 写道:
> On Mon, Nov 14, 2022 at 08:38:42PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
>>
>>
>> 在 2022/11/14 15:04, Leon Romanovsky 写道:
>>> On Sun, Nov 13, 2022 at 09:47:12PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
>>>> Hi leon,
>>>>
>>>> 在 2022/11/12 0:39, Leon Romanovsky 写道:
>>>>> On Fri, Nov 11, 2022 at 10:27:18PM +0800, Longpeng(Mike) wrote:
>>>>>> From: Longpeng <longpeng2@huawei.com>
>>>>>>
>>>>>> We can enable SRIOV and add VFs by /sys/bus/pci/devices/..../sriov_numvfs, but
>>>>>> this operation needs to spend lots of time if there has a large amount of VFs.
>>>>>> For example, if the machine has 10 PFs and 250 VFs per-PF, enable all the VFs
>>>>>> concurrently would cost about 200-250ms. However most of them are not need to be
>>>>>> used at the moment, so we can enable SRIOV first but add VFs on demand.
>>>>>
>>>>> It is unclear what took 200-250ms, is it physical VF creation or bind of
>>>>> the driver to these VFs?
>>>>>
>>>> It is neither. In our test, we already created physical VFs before, so we
>>>> skipped the 100ms waiting when writing PCI_SRIOV_CTRL. And our driver only
>>>> probes PF, it just returns an error if the function is VF.
>>>
>>> It means that you didn't try sriov_drivers_autoprobe. Once it is set to
>>> true, It won't even try to probe VFs.
>>>
>>>>
>>>> The hotspot is the sriov_add_vfs (but no driver probe in fact) which is a
>>>> long procedure. Each step costs only a little, but the total cost is not
>>>> acceptable in some time-sensitive cases.
>>>
>>> This is also cryptic to me. In standard SR-IOV deployment, all VFs are
>>> created and configured while operator booted the machine with sriov_drivers_autoprobe
>>> set to false. Once this machine is ready, VFs are assigned to relevant VMs/users
>>> through orchestration SW (IMHO, it is supported by all orchestration SW).
>>>
>>> And only last part (assigning to users) is time-sensitive operation.
>>>
>> The VF creation and configuration are also time-sensitive in some cases, for
>> example, the hypervisor live update case (such as [1]):
>>   save VMs -> kexec -> restore VMs
>>
>> After the new kernel starts, the VFs must be added into the system, and then
>> assign the original VFs to the QEMU. This means we must enable all 2K+ VFs
>> at once and increase the downtime.
>>
>> If we can enable the VFs that are used by existing VMs then restore the VMs
>> and enable other unused VFs at last, the downtime would be significantly
>> reduced.
>>
>> [1] https://static.sched.com/hosted_files/kvmforum2022/65/kvmforum2022-Preserving%20IOMMU%20states%20during%20kexec%20reboot-v4.pdf
> 
> Like it is written in presentation, the standard way of doing it is done
> by VFIO live migration feature, where 2K+ VMs are migrated to another server
> at the time first server is scheduled for maintenance.
> 
Live migration is not the best choice in production environment, it's 
too heavy. Some cloud providers prefer to using hypervisor live update 
in their system, such as AWS's nitro hypervisor.

> However, even in live update case mentioned in the presentation, you
> should disable ALL PFs/VFs and enable ALL PFs/VFs at the same time,
> so you don't need per-VF id enable knob.
> 
The presentation is just a reference, some points could be optimized 
including disable PFs/VFs and enable PFs/VFs.

Hypervisor live update can finish in less than 1 second, so the cost of 
disabling PFs/VFs and enabling PFs/VFs (~200-250ms or even worst) is too 
high.

>>
>>>>
>>>> What’s more, the sriov_add_vfs adds the VFs of a PF one by one. So we can
>>>> mostly support 10 concurrent calls if there has 10 PFs.
>>>
>>> I wondered, are you using real HW? or QEMU SR-IOV? What is your server
>>> that supports such large number of VFs?
>>>
>> Physical device. Some devices in the market support the large number of VFs,
>> especially in the hardware offloading area, e.g DPU/IPU. I think the SR-IOV
>> software should keep pace with times too.
> 
> Our devices (and Intel too) support many VFs too. The thing is that
> servers are unlikely to be able to support 10 physical devices with 2K+
> VFs. There are many limitations that will make such is not usable.
> Like, global MSI-X pool and PCI bandwidth to support all these devices.
> 
>>
>>> BTW, Your change will probably break all SR-IOV devices in the market as
>>> they rely on PCI subsystem to have VFs ready and configured.
>>>
>> I see, but maybe this change could be a choice for some users.
> 
> It should come with relevant driver changes and very strong justification why
> such functionality is needed now and can't be achieved by anything else
> except user-facing sysfs.
> 
Adding 2K+ VFs to the sysfs need too much time.

Look at the bottomhalf of the hypervisor live update:
kexec --> add 2K VFs --> restore VMs

The downtime can be reduced if the sequence is:
kexec --> add 100 VFs（the VMs used） --> resotre VMs --> add 1.9K VFs


> I don't see anything in this presentation and discussion that supports
> need of such UAPI.
>  > Thanks
> 
>>
>>> Thanks
>>> .
> .

Leon Romanovsky Nov. 14, 2022, 2:20 p.m. UTC | #9

On Mon, Nov 14, 2022 at 10:06:49PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
> 
> 
> 在 2022/11/14 21:09, Leon Romanovsky 写道:
> > On Mon, Nov 14, 2022 at 08:38:42PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
> > > 
> > > 
> > > 在 2022/11/14 15:04, Leon Romanovsky 写道:
> > > > On Sun, Nov 13, 2022 at 09:47:12PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
> > > > > Hi leon,
> > > > > 
> > > > > 在 2022/11/12 0:39, Leon Romanovsky 写道:
> > > > > > On Fri, Nov 11, 2022 at 10:27:18PM +0800, Longpeng(Mike) wrote:
> > > > > > > From: Longpeng <longpeng2@huawei.com>
> > > > > > > 
> > > > > > > We can enable SRIOV and add VFs by /sys/bus/pci/devices/..../sriov_numvfs, but
> > > > > > > this operation needs to spend lots of time if there has a large amount of VFs.
> > > > > > > For example, if the machine has 10 PFs and 250 VFs per-PF, enable all the VFs
> > > > > > > concurrently would cost about 200-250ms. However most of them are not need to be
> > > > > > > used at the moment, so we can enable SRIOV first but add VFs on demand.
> > > > > > 
> > > > > > It is unclear what took 200-250ms, is it physical VF creation or bind of
> > > > > > the driver to these VFs?
> > > > > > 
> > > > > It is neither. In our test, we already created physical VFs before, so we
> > > > > skipped the 100ms waiting when writing PCI_SRIOV_CTRL. And our driver only
> > > > > probes PF, it just returns an error if the function is VF.
> > > > 
> > > > It means that you didn't try sriov_drivers_autoprobe. Once it is set to
> > > > true, It won't even try to probe VFs.
> > > > 
> > > > > 
> > > > > The hotspot is the sriov_add_vfs (but no driver probe in fact) which is a
> > > > > long procedure. Each step costs only a little, but the total cost is not
> > > > > acceptable in some time-sensitive cases.
> > > > 
> > > > This is also cryptic to me. In standard SR-IOV deployment, all VFs are
> > > > created and configured while operator booted the machine with sriov_drivers_autoprobe
> > > > set to false. Once this machine is ready, VFs are assigned to relevant VMs/users
> > > > through orchestration SW (IMHO, it is supported by all orchestration SW).
> > > > 
> > > > And only last part (assigning to users) is time-sensitive operation.
> > > > 
> > > The VF creation and configuration are also time-sensitive in some cases, for
> > > example, the hypervisor live update case (such as [1]):
> > >   save VMs -> kexec -> restore VMs
> > > 
> > > After the new kernel starts, the VFs must be added into the system, and then
> > > assign the original VFs to the QEMU. This means we must enable all 2K+ VFs
> > > at once and increase the downtime.
> > > 
> > > If we can enable the VFs that are used by existing VMs then restore the VMs
> > > and enable other unused VFs at last, the downtime would be significantly
> > > reduced.
> > > 
> > > [1] https://static.sched.com/hosted_files/kvmforum2022/65/kvmforum2022-Preserving%20IOMMU%20states%20during%20kexec%20reboot-v4.pdf
> > 
> > Like it is written in presentation, the standard way of doing it is done
> > by VFIO live migration feature, where 2K+ VMs are migrated to another server
> > at the time first server is scheduled for maintenance.
> > 
> Live migration is not the best choice in production environment, it's too
> heavy. Some cloud providers prefer to using hypervisor live update in their
> system, such as AWS's nitro hypervisor.

How is AWS nitro relevant to our discussion about adding sysfs file to Linux?
Can you please point us to the source code of that hypervisor? Does it even
run on Linux?

Anyway, I'm aware of big cloud providers who are pretty happy with live
migration in production.

> 
> > However, even in live update case mentioned in the presentation, you
> > should disable ALL PFs/VFs and enable ALL PFs/VFs at the same time,
> > so you don't need per-VF id enable knob.
> > 
> The presentation is just a reference, some points could be optimized
> including disable PFs/VFs and enable PFs/VFs.
> 
> Hypervisor live update can finish in less than 1 second, so the cost of
> disabling PFs/VFs and enabling PFs/VFs (~200-250ms or even worst) is too
> high.
> 
> > > 
> > > > > 
> > > > > What’s more, the sriov_add_vfs adds the VFs of a PF one by one. So we can
> > > > > mostly support 10 concurrent calls if there has 10 PFs.
> > > > 
> > > > I wondered, are you using real HW? or QEMU SR-IOV? What is your server
> > > > that supports such large number of VFs?
> > > > 
> > > Physical device. Some devices in the market support the large number of VFs,
> > > especially in the hardware offloading area, e.g DPU/IPU. I think the SR-IOV
> > > software should keep pace with times too.
> > 
> > Our devices (and Intel too) support many VFs too. The thing is that
> > servers are unlikely to be able to support 10 physical devices with 2K+
> > VFs. There are many limitations that will make such is not usable.
> > Like, global MSI-X pool and PCI bandwidth to support all these devices.
> > 
> > > 
> > > > BTW, Your change will probably break all SR-IOV devices in the market as
> > > > they rely on PCI subsystem to have VFs ready and configured.
> > > > 
> > > I see, but maybe this change could be a choice for some users.
> > 
> > It should come with relevant driver changes and very strong justification why
> > such functionality is needed now and can't be achieved by anything else
> > except user-facing sysfs.
> > 
> Adding 2K+ VFs to the sysfs need too much time.
> 
> Look at the bottomhalf of the hypervisor live update:
> kexec --> add 2K VFs --> restore VMs
> 
> The downtime can be reduced if the sequence is:
> kexec --> add 100 VFs（the VMs used） --> resotre VMs --> add 1.9K VFs

Addition of VFs is serial operation, you can fire your VMs once you
counted 100 VFs in sysfs directory.

> 
> 
> > I don't see anything in this presentation and discussion that supports
> > need of such UAPI.
> >  > Thanks
> > 
> > > 
> > > > Thanks
> > > > .
> > .

Longpeng(Mike) Nov. 15, 2022, 1:38 a.m. UTC | #10

在 2022/11/14 22:20, Leon Romanovsky 写道:
> On Mon, Nov 14, 2022 at 10:06:49PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
>>
>>
>> 在 2022/11/14 21:09, Leon Romanovsky 写道:
>>> On Mon, Nov 14, 2022 at 08:38:42PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
>>>>
>>>>
>>>> 在 2022/11/14 15:04, Leon Romanovsky 写道:
>>>>> On Sun, Nov 13, 2022 at 09:47:12PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
>>>>>> Hi leon,
>>>>>>
>>>>>> 在 2022/11/12 0:39, Leon Romanovsky 写道:
>>>>>>> On Fri, Nov 11, 2022 at 10:27:18PM +0800, Longpeng(Mike) wrote:
>>>>>>>> From: Longpeng <longpeng2@huawei.com>
>>>>>>>>
>>>>>>>> We can enable SRIOV and add VFs by /sys/bus/pci/devices/..../sriov_numvfs, but
>>>>>>>> this operation needs to spend lots of time if there has a large amount of VFs.
>>>>>>>> For example, if the machine has 10 PFs and 250 VFs per-PF, enable all the VFs
>>>>>>>> concurrently would cost about 200-250ms. However most of them are not need to be
>>>>>>>> used at the moment, so we can enable SRIOV first but add VFs on demand.
>>>>>>>
>>>>>>> It is unclear what took 200-250ms, is it physical VF creation or bind of
>>>>>>> the driver to these VFs?
>>>>>>>
>>>>>> It is neither. In our test, we already created physical VFs before, so we
>>>>>> skipped the 100ms waiting when writing PCI_SRIOV_CTRL. And our driver only
>>>>>> probes PF, it just returns an error if the function is VF.
>>>>>
>>>>> It means that you didn't try sriov_drivers_autoprobe. Once it is set to
>>>>> true, It won't even try to probe VFs.
>>>>>
>>>>>>
>>>>>> The hotspot is the sriov_add_vfs (but no driver probe in fact) which is a
>>>>>> long procedure. Each step costs only a little, but the total cost is not
>>>>>> acceptable in some time-sensitive cases.
>>>>>
>>>>> This is also cryptic to me. In standard SR-IOV deployment, all VFs are
>>>>> created and configured while operator booted the machine with sriov_drivers_autoprobe
>>>>> set to false. Once this machine is ready, VFs are assigned to relevant VMs/users
>>>>> through orchestration SW (IMHO, it is supported by all orchestration SW).
>>>>>
>>>>> And only last part (assigning to users) is time-sensitive operation.
>>>>>
>>>> The VF creation and configuration are also time-sensitive in some cases, for
>>>> example, the hypervisor live update case (such as [1]):
>>>>    save VMs -> kexec -> restore VMs
>>>>
>>>> After the new kernel starts, the VFs must be added into the system, and then
>>>> assign the original VFs to the QEMU. This means we must enable all 2K+ VFs
>>>> at once and increase the downtime.
>>>>
>>>> If we can enable the VFs that are used by existing VMs then restore the VMs
>>>> and enable other unused VFs at last, the downtime would be significantly
>>>> reduced.
>>>>
>>>> [1] https://static.sched.com/hosted_files/kvmforum2022/65/kvmforum2022-Preserving%20IOMMU%20states%20during%20kexec%20reboot-v4.pdf
>>>
>>> Like it is written in presentation, the standard way of doing it is done
>>> by VFIO live migration feature, where 2K+ VMs are migrated to another server
>>> at the time first server is scheduled for maintenance.
>>>
>> Live migration is not the best choice in production environment, it's too
>> heavy. Some cloud providers prefer to using hypervisor live update in their
>> system, such as AWS's nitro hypervisor.
> 
> How is AWS nitro relevant to our discussion about adding sysfs file to Linux?
> Can you please point us to the source code of that hypervisor? Does it even
> run on Linux?
> 
Um...You can google for more information about the AWS nitro system.

Yes, it's digressive, so let's back to the discussion about adding sysfs 
file.

> Anyway, I'm aware of big cloud providers who are pretty happy with live
> migration in production.
> 
We're having trouble coming to an agreement on this point, but it does't 
matter. Please see below.

>>
>>> However, even in live update case mentioned in the presentation, you
>>> should disable ALL PFs/VFs and enable ALL PFs/VFs at the same time,
>>> so you don't need per-VF id enable knob.
>>>
>> The presentation is just a reference, some points could be optimized
>> including disable PFs/VFs and enable PFs/VFs.
>>
>> Hypervisor live update can finish in less than 1 second, so the cost of
>> disabling PFs/VFs and enabling PFs/VFs (~200-250ms or even worst) is too
>> high.
>>
>>>>
>>>>>>
>>>>>> What’s more, the sriov_add_vfs adds the VFs of a PF one by one. So we can
>>>>>> mostly support 10 concurrent calls if there has 10 PFs.
>>>>>
>>>>> I wondered, are you using real HW? or QEMU SR-IOV? What is your server
>>>>> that supports such large number of VFs?
>>>>>
>>>> Physical device. Some devices in the market support the large number of VFs,
>>>> especially in the hardware offloading area, e.g DPU/IPU. I think the SR-IOV
>>>> software should keep pace with times too.
>>>
>>> Our devices (and Intel too) support many VFs too. The thing is that
>>> servers are unlikely to be able to support 10 physical devices with 2K+
>>> VFs. There are many limitations that will make such is not usable.
>>> Like, global MSI-X pool and PCI bandwidth to support all these devices.
>>>
>>>>
>>>>> BTW, Your change will probably break all SR-IOV devices in the market as
>>>>> they rely on PCI subsystem to have VFs ready and configured.
>>>>>
>>>> I see, but maybe this change could be a choice for some users.
>>>
>>> It should come with relevant driver changes and very strong justification why
>>> such functionality is needed now and can't be achieved by anything else
>>> except user-facing sysfs.
>>>
>> Adding 2K+ VFs to the sysfs need too much time.
>>
>> Look at the bottomhalf of the hypervisor live update:
>> kexec --> add 2K VFs --> restore VMs
>>
>> The downtime can be reduced if the sequence is:
>> kexec --> add 100 VFs（the VMs used） --> resotre VMs --> add 1.9K VFs
> 
> Addition of VFs is serial operation, you can fire your VMs once you
> counted 100 VFs in sysfs directory.
> 
According to the current implementation, the addition of VFs must be in 
order, so this can not properly work.

For example, the VM uses VF200, VF202, VF204, but the sriov_add_vfs can 
only add VFs in the order VF0, VF1, VF2 ... The limitation is introduced 
by the software, not the PCI spec.

>>
>>
>>> I don't see anything in this presentation and discussion that supports
>>> need of such UAPI.
>>>   > Thanks
>>>
>>>>
>>>>> Thanks
>>>>> .
>>> .
> .

Oliver O'Halloran Nov. 15, 2022, 1:50 a.m. UTC | #11

On Tue, Nov 15, 2022 at 1:27 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> *snip*
>
> Anyway, I'm aware of big cloud providers who are pretty happy with live
> migration in production.

I could see someone sufficiently cloudbrained deciding that rebooting
the hypervisor is fine provided the downtime doesn't violate any
customer uptime SLAs. Personally I'd only be brave enough to do that
for a HV hosting internal services which I know are behind a load
balancer, but apparently there are people at Huawei far braver than I.

> *snip*
>
> > Adding 2K+ VFs to the sysfs need too much time.
> >
> > Look at the bottomhalf of the hypervisor live update:
> > kexec --> add 2K VFs --> restore VMs
> >
> > The downtime can be reduced if the sequence is:
> > kexec --> add 100 VFs（the VMs used） --> resotre VMs --> add 1.9K VFs
>
> Addition of VFs is serial operation, you can fire your VMs once you
> counted 100 VFs in sysfs directory.

I don't know if making that kind of assumption about the behaviour of
sysfs is better or worse than just adding another knob. If at some
point in the future the initialisation of VF pci_devs was moved to a
workqueue or something we'd be violating that assumption without
breaking any of the documented ABI. I guess you could argue that VFs
being added sequentially is "ABI", but userspace has always been told
not to make assumptions about when sysfs attributes (or nodes, I
guess) appear since doing so is prone to races.

Oliver O'Halloran Nov. 15, 2022, 2:06 a.m. UTC | #12

On Tue, Nov 15, 2022 at 1:08 AM Longpeng (Mike, Cloud Infrastructure
Service Product Dept.) <longpeng2@huawei.com> wrote:
>
> *snip*
>
> Adding 2K+ VFs to the sysfs need too much time.
>
> Look at the bottomhalf of the hypervisor live update:
> kexec --> add 2K VFs --> restore VMs
>
> The downtime can be reduced if the sequence is:
> kexec --> add 100 VFs（the VMs used） --> resotre VMs --> add 1.9K VFs

Right, so you want to add VFs in batches rather than all at once.
Personally I think the bitmap approach is error prone since it renders
the meaning of pf_dev->sriov->num_VFs unclear and there's some hairy
code in arch/powerpc/ that approach will likely break. A better
approach would be to add an attribute to control the number of VFs
enabled in hardware and allowing sriov_numvfs to accept any number
between the current value and sriov_hw_numvfs. e.g. your HV setup
would look something like:

echo 2048 > sriov_hw_numvfs
echo 100 > sriov_numvfs

# time passes
echo 2048 > sriov_numvfs

This would be fairly simple to implement and you can make it backwards
compatible by having writes to sriov_numvfs retain their current
semantics if sriov_hw_numvfs is zero.

Leon Romanovsky Nov. 15, 2022, 8:32 a.m. UTC | #13

On Tue, Nov 15, 2022 at 12:50:34PM +1100, Oliver O'Halloran wrote:
> On Tue, Nov 15, 2022 at 1:27 AM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > *snip*
> >
> > Anyway, I'm aware of big cloud providers who are pretty happy with live
> > migration in production.
> 
> I could see someone sufficiently cloudbrained deciding that rebooting
> the hypervisor is fine provided the downtime doesn't violate any
> customer uptime SLAs. Personally I'd only be brave enough to do that
> for a HV hosting internal services which I know are behind a load
> balancer, but apparently there are people at Huawei far braver than I.

My main point in this discussion that Huawei team doesn't actually
provide any meaningful justification why it is great idea to add new
sysfs file. They use HPC as an argument, but in that world, you won't
see many VMs on one server, as it is important to provide separate MSI-X
vectors and CPUs to each VM.

They ask from us optimization (do not add device hierarchy for existing HW)
that doesn't exist in the kernel.

I would say that they are trying to meld SIOV architecture of subfunctions
(SFs) into PCI and SR-IOV world.

Thanks

Leon Romanovsky Nov. 15, 2022, 10:02 a.m. UTC | #14

On Tue, Nov 15, 2022 at 05:36:38PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
> 
> 
> 在 2022/11/15 16:32, Leon Romanovsky 写道:
> > On Tue, Nov 15, 2022 at 12:50:34PM +1100, Oliver O'Halloran wrote:
> > > On Tue, Nov 15, 2022 at 1:27 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > > 
> > > > *snip*
> > > > 
> > > > Anyway, I'm aware of big cloud providers who are pretty happy with live
> > > > migration in production.
> > > 
> > > I could see someone sufficiently cloudbrained deciding that rebooting
> > > the hypervisor is fine provided the downtime doesn't violate any
> > > customer uptime SLAs. Personally I'd only be brave enough to do that
> > > for a HV hosting internal services which I know are behind a load
> > > balancer, but apparently there are people at Huawei far braver than I.
> > 
> > My main point in this discussion that Huawei team doesn't actually
> > provide any meaningful justification why it is great idea to add new
> > sysfs file. They use HPC as an argument, but in that world, you won't
> > see many VMs on one server, as it is important to provide separate MSI-X
> > vectors and CPUs to each VM.
> > 
> > They ask from us optimization (do not add device hierarchy for existing HW)
> > that doesn't exist in the kernel.
> > 
> > I would say that they are trying to meld SIOV architecture of subfunctions
> > (SFs) into PCI and SR-IOV world.
> > 
> I may not agree with you on this point. 

The bright side of open source that you don't need to agree with everyone.
If you success to convince others, it will be merged.

> The sriov_numvfs interface mixes the
> operation of enabling hardware VFs and the addition of VFs. I just want to
> split these two operations and also can selectively add some VFs earlier
> than others. I think there's no violation of PCI spec.

Right, it just doesn't fit into Linux kernel device initialization model.

Thanks

> 
> > Thanks
> > .

Oliver O'Halloran Nov. 15, 2022, 12:49 p.m. UTC | #15

On Tue, Nov 15, 2022 at 7:32 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Tue, Nov 15, 2022 at 12:50:34PM +1100, Oliver O'Halloran wrote:
> > On Tue, Nov 15, 2022 at 1:27 AM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > *snip*
> > >
> > > Anyway, I'm aware of big cloud providers who are pretty happy with live
> > > migration in production.
> >
> > I could see someone sufficiently cloudbrained deciding that rebooting
> > the hypervisor is fine provided the downtime doesn't violate any
> > customer uptime SLAs. Personally I'd only be brave enough to do that
> > for a HV hosting internal services which I know are behind a load
> > balancer, but apparently there are people at Huawei far braver than I.
>
> My main point in this discussion that Huawei team doesn't actually
> provide any meaningful justification why it is great idea to add new
> sysfs file.

All their arguments seem to be based on trying to reduce the
time-to-VMs when a hypervisor is kexec()ed, which is a pretty
reasonable justification IMO.  I do have some reservations about the
numbers they're claiming since 250ms for initializing struct pci_dev's
for the VFs seems excessive. Unfortunately, I don't have any hardware
that supports 2048 VFs on hand so I can't verify that claim.

> They use HPC as an argument but in that world, you won't
> see many VMs on one server, as it is important to provide separate MSI-X
> vectors and CPUs to each VM.

I don't think HPC has come up in this thread, but assuming it has: In
the cloud "HPC" usually means "it has timesliced access to GPUs".
Having 2k VMs sharing one or more GPUs on a single system isn't
necessarily advisable, but if we assume only a subset of those VMs
will actually need access to a GPU at any given time it's sort of
reasonable.

> They ask from us optimization (do not add device hierarchy for existing HW)
> that doesn't exist in the kernel.
>
> I would say that they are trying to meld SIOV architecture of subfunctions
> (SFs) into PCI and SR-IOV world.

I don't know what asks you're referring to, but they're not present in
this thread. I'm going to give Longpeng the benefit of the doubt and
assume that this series is an attempt to fix a problem he's facing
with actual hardware that exists today. To say they should have
implemented the device with SIOV (proprietary to Intel until March
this year) rather than SR-IOV (standardised by PCI-SIG over a decade
ago) is not terribly helpful to anyone. Additionally, SIOV exists
largely to solve a problem that's an issue because Intel decided that
all PCI devices should exist within a single PCI domain. If you don't
have that problem SIOV is a lot less compelling.