[vfio] vfio/pci: remove msi domain on msi disable

Message ID 20230914191406.54656-1-shannon.nelson@amd.com
State New
Headers
Series [vfio] vfio/pci: remove msi domain on msi disable |

Commit Message

Nelson, Shannon Sept. 14, 2023, 7:14 p.m. UTC
  The new MSI dynamic allocation machinery is great for making the irq
management more flexible.  It includes caching information about the
MSI domain which gets reused on each new open of a VFIO fd.  However,
this causes an issue when the underlying hardware has flexible MSI-x
configurations, as a changed configuration doesn't get seen between
new opens, and is only refreshed between PCI unbind/bind cycles.

In our device we can change the per-VF MSI-x resource allocation
without the need for rebooting or function reset.  For example,

  1. Initial power up and kernel boot:
	# lspci -s 2e:00.1 -vv | grep MSI-X
	        Capabilities: [a0] MSI-X: Enable+ Count=8 Masked-

  2. Device VF configuration change happens with no reset

  3. New MSI-x count value seen:
	# lspci -s 2e:00.1 -vv | grep MSI-X
		Capabilities: [a0] MSI-X: Enable- Count=64 Masked-

This allows for per-VF dynamic reconfiguration of interrupt resources
for the VMs using the VFIO devices supported by our hardware.

The problem comes where the dynamic IRQ management created the MSI
domain when the VFIO device creates the first IRQ in the first ioctl()
VFIO_DEVICE_SET_IRQS request.  The current MSI-x count (hwsize) is read
when setting up the irq vectors under pci_alloc_irq_vectors_affinity(),
and the MSI domain information is set up, which includes the hwsize.

When the VFIO fd is closed, the IRQs are removed, but the MSI domain
information is kept for later use since we're only closing the current
VFIO fd, not unbinding the PCI device connection.  When a new VFIO fd
open happens and a new VFIO_DEVICE_SET_IRQS request comes down, the cycle
starts again, reusing the existing MSI domain with the previous hwsize.
This is fine until this new QEMU instance has read the new larger MSI-x
count from PCI config space (QEMU:vfio_msi_enable()) and tries to create
more IRQs than were available before.  We fail in msi_insert_desc()
because the MSI domain still is set up for the earlier hwsize and has no
room for the n+1 IRQ.

This can be easily fixed by simply adding msi_remove_device_irq_domain()
into vfio_msi_disable() which is called when the VFIO IRQs are removed
either by an ioctl() call from QEMU or from the VFIO fd close.  This forces
the MSI domain to be recreated with the new MSI-x count on the next
VFIO_DEVICE_SET_IRQS request.

Link: https://lore.kernel.org/all/cover.1683740667.git.reinette.chatre@intel.com/
Link: https://lore.kernel.org/r/20221124232325.798556374@linutronix.de
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
---
 drivers/vfio/pci/vfio_pci_intrs.c | 1 +
 1 file changed, 1 insertion(+)
  

Comments

Jason Gunthorpe Sept. 18, 2023, 2:17 p.m. UTC | #1
On Thu, Sep 14, 2023 at 12:14:06PM -0700, Shannon Nelson wrote:
> The new MSI dynamic allocation machinery is great for making the irq
> management more flexible.  It includes caching information about the
> MSI domain which gets reused on each new open of a VFIO fd.  However,
> this causes an issue when the underlying hardware has flexible MSI-x
> configurations, as a changed configuration doesn't get seen between
> new opens, and is only refreshed between PCI unbind/bind cycles.
> 
> In our device we can change the per-VF MSI-x resource allocation
> without the need for rebooting or function reset.  For example,
> 
>   1. Initial power up and kernel boot:
> 	# lspci -s 2e:00.1 -vv | grep MSI-X
> 	        Capabilities: [a0] MSI-X: Enable+ Count=8 Masked-
> 
>   2. Device VF configuration change happens with no reset

Is this an out of tree driver problem?

The intree way to alter the MSI configuration is via
sriov_set_msix_vec_count, and there is only one in-tree driver that
uses it right now.

If something is going wrong here it should be fixed in the
sriov_set_msix_vec_count() machinery, possibly in the pci core to
synchronize the msi_domain view of the world.

Jason
  
Nelson, Shannon Sept. 18, 2023, 5:48 p.m. UTC | #2
On 9/18/2023 7:17 AM, Jason Gunthorpe wrote:
> 
> On Thu, Sep 14, 2023 at 12:14:06PM -0700, Shannon Nelson wrote:
>> The new MSI dynamic allocation machinery is great for making the irq
>> management more flexible.  It includes caching information about the
>> MSI domain which gets reused on each new open of a VFIO fd.  However,
>> this causes an issue when the underlying hardware has flexible MSI-x
>> configurations, as a changed configuration doesn't get seen between
>> new opens, and is only refreshed between PCI unbind/bind cycles.
>>
>> In our device we can change the per-VF MSI-x resource allocation
>> without the need for rebooting or function reset.  For example,
>>
>>    1. Initial power up and kernel boot:
>>        # lspci -s 2e:00.1 -vv | grep MSI-X
>>                Capabilities: [a0] MSI-X: Enable+ Count=8 Masked-
>>
>>    2. Device VF configuration change happens with no reset
> 
> Is this an out of tree driver problem?

No, not an out-of-tree driver, this is the vfio pci core.

> 
> The intree way to alter the MSI configuration is via
> sriov_set_msix_vec_count, and there is only one in-tree driver that
> uses it right now.
> 
> If something is going wrong here it should be fixed in the
> sriov_set_msix_vec_count() machinery, possibly in the pci core to
> synchronize the msi_domain view of the world.
> 
> Jason

The sriov_set_msix_vec_count method assumes
  (a) the unbind/bind cycle on the VF, and
  (b) VF MSIx count change configured from the host
neither of which are the case in our situation.

In our case, the VF device's msix count value found in PCI config space 
is changed by device configuration management outside of the baremetal 
host and read by the QEMU instance when it starts up, and then read by 
the vfio PCI core when QEMU requests the first IRQ.  The core code 
enables the msix range on first IRQ request, and disables it when the 
vfio fd is closed.  It creates the msi_domain on the first call if it 
doesn't exist, but it does not remove the msi_domain when the irqs are 
disabled.

The IRQ request call trace looks like:
QEMU: vfio_msix_vector_do_use->ioctl()
   (driver.vfio_device_ops.ioctl = vfio_pci_core_ioctl)
   vfio_pci_core_ioctl
     vfio_pci_ioctl_set_irqs
       vfio_pci_set_irqs_ioctl
         vfio_pci_set_msi_trigger
           vfio_msi_enable
             pci_alloc_irq_vectors
               pci_alloc_irq_vectors_affinity
                 __pci_enable_msix_range
                   pci_setup_msix_device_domain
                         return if msi_domain exists
                     pci_create_device_domain
                       msi_create_device_irq_domain
                         __msi_create_irq_domain - sets info->hwsize
		  msi_capability_init
		    msi_setup_msi_desc
		      msi_insert_msi_desc
                         msi_domain_insert_msi_desc
                           msi_insert_desc
			    fail if index >= hwsize

On close of the vfio fd, the trace is:
QEMU: close()
   driver.vfio_device_ops.close
     vfio_pci_core_close_device
       vfio_pci_core_disable
         vfio_pci_set_irqs_ioctl(vdev, VFIO_IRQ_SET_DATA_NONE |
				VFIO_IRQ_SET_ACTION_TRIGGER,
				vdev->irq_type, 0, 0, NULL);
           vfio_pci_set_msi_trigger
             vfio_msi_disable
               pci_free_irq_vectors

The msix vectors are freed, but the msi_domain is not, and the 
msi_domain holds the MSIx count that it read when it was created.  If 
the device's MSIx count is increased, the next QEMU session will see the 
new number in PCI config space and try to use that new larger number, 
but the msi_domain is still using the smaller hwsize and the QEMU IRQ 
setup fails in msi_insert_desc().

This patch adds a msi_remove_device_irq_domain() call when the irqs are 
disabled in order to force a new read on the next IRQ allocation cycle. 
This is limited to only the vfio use of the msi_domain.

I suppose we could add this to the trailing end of callbacks in our own 
driver, but this looks more like a generic vfio/msi issue than a driver 
specific thing.

The other possibility is to force the user to always do a bind cycle 
between QEMU sessions using the VF.  This seems to be unnecessary 
overhead and was not necessary when using the v6.1 kernel.  To the user, 
this looks like a regression - this is how it was reported to me.

sln
  
Thomas Gleixner Sept. 18, 2023, 6:43 p.m. UTC | #3
On Mon, Sep 18 2023 at 11:17, Jason Gunthorpe wrote:
> On Thu, Sep 14, 2023 at 12:14:06PM -0700, Shannon Nelson wrote:
>> The new MSI dynamic allocation machinery is great for making the irq
>> management more flexible.  It includes caching information about the
>> MSI domain which gets reused on each new open of a VFIO fd.  However,
>> this causes an issue when the underlying hardware has flexible MSI-x
>> configurations, as a changed configuration doesn't get seen between
>> new opens, and is only refreshed between PCI unbind/bind cycles.
>> 
>> In our device we can change the per-VF MSI-x resource allocation
>> without the need for rebooting or function reset.  For example,
>> 
>>   1. Initial power up and kernel boot:
>> 	# lspci -s 2e:00.1 -vv | grep MSI-X
>> 	        Capabilities: [a0] MSI-X: Enable+ Count=8 Masked-
>> 
>>   2. Device VF configuration change happens with no reset
>
> Is this an out of tree driver problem?
>
> The intree way to alter the MSI configuration is via
> sriov_set_msix_vec_count, and there is only one in-tree driver that
> uses it right now.

Right, but that only addresses the driver specific issues.

> If something is going wrong here it should be fixed in the
> sriov_set_msix_vec_count() machinery, possibly in the pci core to
> synchronize the msi_domain view of the world.

Right, we should definitely not do that on a per driver basis.

Thanks,

        tglx
  
Jason Gunthorpe Sept. 18, 2023, 11:32 p.m. UTC | #4
On Mon, Sep 18, 2023 at 10:48:54AM -0700, Nelson, Shannon wrote:

> In our case, the VF device's msix count value found in PCI config space is
> changed by device configuration management outside of the baremetal host and
> read by the QEMU instance when it starts up, and then read by the vfio PCI
> core when QEMU requests the first IRQ.  

Oh, you definitely can't do that!

PCI config space is not allowed to change outside the OS's view and we
added sriov_set_msix_vec_count() specifically as a way to provide the
necessary synchronization between all the parts.

Randomly changing, what should be immutable, parts of the config space
from under a running OS is just non-compliant PCI behavior.

> The msix vectors are freed, but the msi_domain is not, and the msi_domain
> holds the MSIx count that it read when it was created.  If the device's MSIx
> count is increased, the next QEMU session will see the new number in PCI
> config space and try to use that new larger number, but the msi_domain is
> still using the smaller hwsize and the QEMU IRQ setup fails in
> msi_insert_desc().

Correct, devices are not allowed to change these parameters
autonomously, so there is no reason to accommodate this.

> This patch adds a msi_remove_device_irq_domain() call when the irqs are
> disabled in order to force a new read on the next IRQ allocation cycle. This
> is limited to only the vfio use of the msi_domain.

Definately no.
 
> I suppose we could add this to the trailing end of callbacks in our own
> driver, but this looks more like a generic vfio/msi issue than a driver
> specific thing.

Certainly not.
 
> The other possibility is to force the user to always do a bind cycle between
> QEMU sessions using the VF.  This seems to be unnecessary overhead and was
> not necessary when using the v6.1 kernel.  To the user, this looks like a
> regression - this is how it was reported to me.

You need to use sriov_set_msix_vec_count() and only
sriov_set_msix_vec_count() to change this parameter or I expect you
will constantly experience problems.

Jason
  
Jason Gunthorpe Sept. 18, 2023, 11:37 p.m. UTC | #5
On Mon, Sep 18, 2023 at 08:43:21PM +0200, Thomas Gleixner wrote:
> On Mon, Sep 18 2023 at 11:17, Jason Gunthorpe wrote:
> > On Thu, Sep 14, 2023 at 12:14:06PM -0700, Shannon Nelson wrote:
> >> The new MSI dynamic allocation machinery is great for making the irq
> >> management more flexible.  It includes caching information about the
> >> MSI domain which gets reused on each new open of a VFIO fd.  However,
> >> this causes an issue when the underlying hardware has flexible MSI-x
> >> configurations, as a changed configuration doesn't get seen between
> >> new opens, and is only refreshed between PCI unbind/bind cycles.
> >> 
> >> In our device we can change the per-VF MSI-x resource allocation
> >> without the need for rebooting or function reset.  For example,
> >> 
> >>   1. Initial power up and kernel boot:
> >> 	# lspci -s 2e:00.1 -vv | grep MSI-X
> >> 	        Capabilities: [a0] MSI-X: Enable+ Count=8 Masked-
> >> 
> >>   2. Device VF configuration change happens with no reset
> >
> > Is this an out of tree driver problem?
> >
> > The intree way to alter the MSI configuration is via
> > sriov_set_msix_vec_count, and there is only one in-tree driver that
> > uses it right now.
> 
> Right, but that only addresses the driver specific issues.

Sort of.. sriov_vf_msix_count_store() is intended to be the entry
point for this and if the kernel grows places that cache the value or
something then this function should flush those caches too.

I suppose flushing happens implicitly because Shannon reports that
things work fine if the driver is rebound. Since
sriov_vf_msix_count_store() ensures there is no driver bound before
proceeding it probe/unprobe must be flushing out everything?

Jason
  
Thomas Gleixner Sept. 18, 2023, 11:47 p.m. UTC | #6
On Mon, Sep 18 2023 at 20:37, Jason Gunthorpe wrote:
> On Mon, Sep 18, 2023 at 08:43:21PM +0200, Thomas Gleixner wrote:
>> On Mon, Sep 18 2023 at 11:17, Jason Gunthorpe wrote:
>> > On Thu, Sep 14, 2023 at 12:14:06PM -0700, Shannon Nelson wrote:
>> >> The new MSI dynamic allocation machinery is great for making the irq
>> >> management more flexible.  It includes caching information about the
>> >> MSI domain which gets reused on each new open of a VFIO fd.  However,
>> >> this causes an issue when the underlying hardware has flexible MSI-x
>> >> configurations, as a changed configuration doesn't get seen between
>> >> new opens, and is only refreshed between PCI unbind/bind cycles.
>> >> 
>> >> In our device we can change the per-VF MSI-x resource allocation
>> >> without the need for rebooting or function reset.  For example,
>> >> 
>> >>   1. Initial power up and kernel boot:
>> >> 	# lspci -s 2e:00.1 -vv | grep MSI-X
>> >> 	        Capabilities: [a0] MSI-X: Enable+ Count=8 Masked-
>> >> 
>> >>   2. Device VF configuration change happens with no reset
>> >
>> > Is this an out of tree driver problem?
>> >
>> > The intree way to alter the MSI configuration is via
>> > sriov_set_msix_vec_count, and there is only one in-tree driver that
>> > uses it right now.
>> 
>> Right, but that only addresses the driver specific issues.
>
> Sort of.. sriov_vf_msix_count_store() is intended to be the entry
> point for this and if the kernel grows places that cache the value or
> something then this function should flush those caches too.

Sorry. What I wanted to say is that the driver callback is not the right
place to reload the MSI domains after the change.

> I suppose flushing happens implicitly because Shannon reports that
> things work fine if the driver is rebound. Since
> sriov_vf_msix_count_store() ensures there is no driver bound before
> proceeding it probe/unprobe must be flushing out everything?

Correct. So sriov_set_msix_vec_count() could just do:

	ret = pdev->driver->sriov_set_msix_vec_count(vf_dev, val);
        if (!ret)
        	teardown_msi_domain(pdev);

Right?

Thanks,

        tglx
  
Jason Gunthorpe Sept. 19, 2023, 12:02 a.m. UTC | #7
On Tue, Sep 19, 2023 at 01:47:37AM +0200, Thomas Gleixner wrote:
> On Mon, Sep 18 2023 at 20:37, Jason Gunthorpe wrote:
> > On Mon, Sep 18, 2023 at 08:43:21PM +0200, Thomas Gleixner wrote:
> >> On Mon, Sep 18 2023 at 11:17, Jason Gunthorpe wrote:
> >> > On Thu, Sep 14, 2023 at 12:14:06PM -0700, Shannon Nelson wrote:
> >> >> The new MSI dynamic allocation machinery is great for making the irq
> >> >> management more flexible.  It includes caching information about the
> >> >> MSI domain which gets reused on each new open of a VFIO fd.  However,
> >> >> this causes an issue when the underlying hardware has flexible MSI-x
> >> >> configurations, as a changed configuration doesn't get seen between
> >> >> new opens, and is only refreshed between PCI unbind/bind cycles.
> >> >> 
> >> >> In our device we can change the per-VF MSI-x resource allocation
> >> >> without the need for rebooting or function reset.  For example,
> >> >> 
> >> >>   1. Initial power up and kernel boot:
> >> >> 	# lspci -s 2e:00.1 -vv | grep MSI-X
> >> >> 	        Capabilities: [a0] MSI-X: Enable+ Count=8 Masked-
> >> >> 
> >> >>   2. Device VF configuration change happens with no reset
> >> >
> >> > Is this an out of tree driver problem?
> >> >
> >> > The intree way to alter the MSI configuration is via
> >> > sriov_set_msix_vec_count, and there is only one in-tree driver that
> >> > uses it right now.
> >> 
> >> Right, but that only addresses the driver specific issues.
> >
> > Sort of.. sriov_vf_msix_count_store() is intended to be the entry
> > point for this and if the kernel grows places that cache the value or
> > something then this function should flush those caches too.
> 
> Sorry. What I wanted to say is that the driver callback is not the right
> place to reload the MSI domains after the change.

Oh, that isn't even what Shannon's patch does, it patched VFIO's main
PCI driver - not a sriov_set_msix_vec_count() callback :( Shannon's
scenario doesn't even use sriov_vf_msix_count_store() at all - the AMD
device just randomly changes its MSI count whenever it likes.

> > I suppose flushing happens implicitly because Shannon reports that
> > things work fine if the driver is rebound. Since
> > sriov_vf_msix_count_store() ensures there is no driver bound before
> > proceeding it probe/unprobe must be flushing out everything?
> 
> Correct. So sriov_set_msix_vec_count() could just do:
> 
> 	ret = pdev->driver->sriov_set_msix_vec_count(vf_dev, val);
>         if (!ret)
>         	teardown_msi_domain(pdev);
>
> Right?

It subtly isn't needed, sriov_vf_msix_count_store() already requires
no driver is associated with the device and this:

int msi_setup_device_data(struct device *dev)
{
	struct msi_device_data *md;
	int ret, i;

	if (dev->msi.data)
		return 0;

	md = devres_alloc(msi_device_data_release, sizeof(*md), GFP_KERNEL);
	if (!md)
		return -ENOMEM;

Already ensured that msi_remove_device_irq_domain() was called via
msi_device_data_release() triggering as part of the devm shutdown of
the bound driver.

So, the intree mechanism to change the MSI vector size works. The
crazy mechanism where the device just changes its value without
synchronizing to the OS does not.

I don't think we need to try and fix that..

Jason
  
Nelson, Shannon Sept. 19, 2023, 12:13 a.m. UTC | #8
On 9/18/2023 4:32 PM, Jason Gunthorpe wrote:
> 
> On Mon, Sep 18, 2023 at 10:48:54AM -0700, Nelson, Shannon wrote:
> 
>> In our case, the VF device's msix count value found in PCI config space is
>> changed by device configuration management outside of the baremetal host and
>> read by the QEMU instance when it starts up, and then read by the vfio PCI
>> core when QEMU requests the first IRQ.
> 
> Oh, you definitely can't do that!
> 
> PCI config space is not allowed to change outside the OS's view and we
> added sriov_set_msix_vec_count() specifically as a way to provide the
> necessary synchronization between all the parts.
> 
> Randomly changing, what should be immutable, parts of the config space
> from under a running OS is just non-compliant PCI behavior.

Hmmm... I guess I need to have a little chat with my friendly HW/FW folks.

Thanks,
sln
  
Thomas Gleixner Sept. 19, 2023, 12:25 a.m. UTC | #9
On Mon, Sep 18 2023 at 21:02, Jason Gunthorpe wrote:
> On Tue, Sep 19, 2023 at 01:47:37AM +0200, Thomas Gleixner wrote:
>> >> > The intree way to alter the MSI configuration is via
>> >> > sriov_set_msix_vec_count, and there is only one in-tree driver that
>> >> > uses it right now.
>> >> 
>> >> Right, but that only addresses the driver specific issues.
>> >
>> > Sort of.. sriov_vf_msix_count_store() is intended to be the entry
>> > point for this and if the kernel grows places that cache the value or
>> > something then this function should flush those caches too.
>> 
>> Sorry. What I wanted to say is that the driver callback is not the right
>> place to reload the MSI domains after the change.
>
> Oh, that isn't even what Shannon's patch does, it patched VFIO's main
> PCI driver - not a sriov_set_msix_vec_count() callback :( Shannon's
> scenario doesn't even use sriov_vf_msix_count_store() at all - the AMD
> device just randomly changes its MSI count whenever it likes.

Ooops. When real hardware changes things behind the kernels back we
consider it a hardware bug. The same applies to virtualization muck.

So all we should do is add some code which yells when the "hardware"
plays silly buggers.

>> > I suppose flushing happens implicitly because Shannon reports that
>> > things work fine if the driver is rebound. Since
>> > sriov_vf_msix_count_store() ensures there is no driver bound before
>> > proceeding it probe/unprobe must be flushing out everything?
>> 
>> Correct. So sriov_set_msix_vec_count() could just do:
>> 
>> 	ret = pdev->driver->sriov_set_msix_vec_count(vf_dev, val);
>>         if (!ret)
>>         	teardown_msi_domain(pdev);
>>
>> Right?
>
> It subtly isn't needed, sriov_vf_msix_count_store() already requires
> no driver is associated with the device and this:
>
> int msi_setup_device_data(struct device *dev)
> {
> 	struct msi_device_data *md;
> 	int ret, i;
>
> 	if (dev->msi.data)
> 		return 0;
>
> 	md = devres_alloc(msi_device_data_release, sizeof(*md), GFP_KERNEL);
> 	if (!md)
> 		return -ENOMEM;
>
> Already ensured that msi_remove_device_irq_domain() was called via
> msi_device_data_release() triggering as part of the devm shutdown of
> the bound driver.

Indeed.

> So, the intree mechanism to change the MSI vector size works. The
> crazy mechanism where the device just changes its value without
> synchronizing to the OS does not.
>
> I don't think we need to try and fix that..

We might want to detect it and yell about it, right?

Thanks,

        tglx
  
Jason Gunthorpe Sept. 19, 2023, 12:32 a.m. UTC | #10
On Tue, Sep 19, 2023 at 02:25:32AM +0200, Thomas Gleixner wrote:

> > I don't think we need to try and fix that..
> 
> We might want to detect it and yell about it, right?

It strikes me as a good idea, yes. If it doesn't cost anything.

Jason
  
Thomas Gleixner Sept. 19, 2023, 12:57 a.m. UTC | #11
On Mon, Sep 18 2023 at 21:32, Jason Gunthorpe wrote:
> On Tue, Sep 19, 2023 at 02:25:32AM +0200, Thomas Gleixner wrote:
>
>> > I don't think we need to try and fix that..
>> 
>> We might want to detect it and yell about it, right?
>
> It strikes me as a good idea, yes. If it doesn't cost anything.

It should not be expensive in the interrupt allocation/deallocation
path, where hardware needs to be accessed anyway. So one extra read is
not the end of the world.

Thanks,

        tglx
  

Patch

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index cbb4bcbfbf83..f66d5e7e078b 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -538,6 +538,7 @@  static void vfio_msi_disable(struct vfio_pci_core_device *vdev, bool msix)
 	cmd = vfio_pci_memory_lock_and_enable(vdev);
 	pci_free_irq_vectors(pdev);
 	vfio_pci_memory_unlock_and_restore(vdev, cmd);
+	msi_remove_device_irq_domain(&pdev->dev, MSI_DEFAULT_DOMAIN);
 
 	/*
 	 * Both disable paths above use pci_intx_for_msi() to clear DisINTx