diff mbox series

[V4,10/11] vfio/pci: Support dynamic MSI-X

Message ID	c4c582970fbeaf4b6000845c400aa4c6b7bb2f13.1682615447.git.reinette.chatre@intel.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Reinette Chatre <reinette.chatre@intel.com> To: jgg@nvidia.com, yishaih@nvidia.com, shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com, alex.williamson@redhat.com Cc: tglx@linutronix.de, darwi@linutronix.de, kvm@vger.kernel.org, dave.jiang@intel.com, jing2.liu@intel.com, ashok.raj@intel.com, fenghua.yu@intel.com, tom.zanussi@linux.intel.com, reinette.chatre@intel.com, linux-kernel@vger.kernel.org Subject: [PATCH V4 10/11] vfio/pci: Support dynamic MSI-X Date: Thu, 27 Apr 2023 10:36:07 -0700 Message-Id: <c4c582970fbeaf4b6000845c400aa4c6b7bb2f13.1682615447.git.reinette.chatre@intel.com> In-Reply-To: <cover.1682615447.git.reinette.chatre@intel.com> References: <cover.1682615447.git.reinette.chatre@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	vfio/pci: Support dynamic allocation of MSI-X interrupts \| [V4,00/11] vfio/pci: Support dynamic allocation of MSI-X interrupts [V4,01/11] vfio/pci: Consolidate irq cleanup on MSI/MSI-X disable [V4,02/11] vfio/pci: Remove negative check on unsigned vector [V4,03/11] vfio/pci: Prepare for dynamic interrupt context storage [V4,04/11] vfio/pci: Move to single error path [V4,05/11] vfio/pci: Use xarray for interrupt context storage [V4,06/11] vfio/pci: Remove interrupt context counter [V4,07/11] vfio/pci: Update stale comment [V4,08/11] vfio/pci: Use bitfield for struct vfio_pci_core_device flags [V4,09/11] vfio/pci: Probe and store ability to support dynamic MSI-X [V4,10/11] vfio/pci: Support dynamic MSI-X [V4,11/11] vfio/pci: Clear VFIO_IRQ_INFO_NORESIZE for MSI-X

Commit Message

Reinette Chatre April 27, 2023, 5:36 p.m. UTC

  pci_msix_alloc_irq_at() enables an individual MSI-X interrupt to be
allocated after MSI-X enabling.

Use dynamic MSI-X (if supported by the device) to allocate an interrupt
after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at
the time a valid eventfd is assigned. This is different behavior from
a range provided during MSI-X enabling where interrupts are allocated
for the entire range whether a valid eventfd is provided for each
interrupt or not.

The PCI-MSIX API requires that some number of irqs are allocated for
an initial set of vectors when enabling MSI-X on the device. When
dynamic MSIX allocation is not supported, the vector table, and thus
the allocated irq set can only be resized by disabling and re-enabling
MSI-X with a different range. In that case the irq allocation is
essentially a cache for configuring vectors within the previously
allocated vector range. When dynamic MSI-X allocation is supported,
the API still requires some initial set of irqs to be allocated, but
also supports allocating and freeing specific irq vectors both
within and beyond the initially allocated range.

For consistency between modes, as well as to reduce latency and improve
reliability of allocations, and also simplicity, this implementation
only releases irqs via pci_free_irq_vectors() when either the interrupt
mode changes or the device is released.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/lkml/20230403211841.0e206b67.alex.williamson@redhat.com/
---
Changes since V3:
- Remove vfio_msi_free_irq(). (Alex)
- Rework changelog. (Alex)

Changes since V2:
- Move vfio_irq_ctx_free() to earlier in series to support
  earlier usage. (Alex)
- Use consistent terms in changelog: MSI-x changed to MSI-X.
- Make dynamic interrupt context creation generic across all
  MSI/MSI-X interrupts. This resulted in code moving to earlier
  in series as part of xarray introduction patch. (Alex)
- Remove the local allow_dyn_alloc and direct calling of
  pci_msix_can_alloc_dyn(), use the new vdev->has_dyn_msix
  introduced earlier instead. (Alex)
- Stop tracking new allocations (remove "new_ctx"). (Alex)
- Introduce new wrapper that returns Linux interrupt number or
  dynamically allocate a new interrupt. Wrapper can be used for
  all interrupt cases. (Alex)
- Only free dynamic MSI-X interrupts on MSI-X teardown. (Alex)

Changes since RFC V1:
- Add pointer to interrupt context as function parameter to
  vfio_irq_ctx_free(). (Alex)
- Initialize new_ctx to false. (Dan Carpenter)
- Only support dynamic allocation if device supports it. (Alex)

 drivers/vfio/pci/vfio_pci_intrs.c | 47 +++++++++++++++++++++++++++----
 1 file changed, 41 insertions(+), 6 deletions(-)

Comments

Tian, Kevin April 28, 2023, 6:50 a.m. UTC | #1

> From: Chatre, Reinette <reinette.chatre@intel.com>
> Sent: Friday, April 28, 2023 1:36 AM
> 
> pci_msix_alloc_irq_at() enables an individual MSI-X interrupt to be
> allocated after MSI-X enabling.
> 
> Use dynamic MSI-X (if supported by the device) to allocate an interrupt
> after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at
> the time a valid eventfd is assigned. This is different behavior from
> a range provided during MSI-X enabling where interrupts are allocated
> for the entire range whether a valid eventfd is provided for each
> interrupt or not.
> 
> The PCI-MSIX API requires that some number of irqs are allocated for
> an initial set of vectors when enabling MSI-X on the device. When
> dynamic MSIX allocation is not supported, the vector table, and thus
> the allocated irq set can only be resized by disabling and re-enabling
> MSI-X with a different range. In that case the irq allocation is
> essentially a cache for configuring vectors within the previously
> allocated vector range. When dynamic MSI-X allocation is supported,
> the API still requires some initial set of irqs to be allocated, but
> also supports allocating and freeing specific irq vectors both
> within and beyond the initially allocated range.
> 
> For consistency between modes, as well as to reduce latency and improve
> reliability of allocations, and also simplicity, this implementation
> only releases irqs via pci_free_irq_vectors() when either the interrupt
> mode changes or the device is released.

It improves the reliability of allocations from the calling device p.o.v.

But system-wide this is not efficient use of irqs and not releasing them
timely may affect the reliability of allocations for other devices.

Should this behavior be something configurable?

> 
> +/*
> + * Return Linux IRQ number of an MSI or MSI-X device interrupt vector.
> + * If a Linux IRQ number is not available then a new interrupt will be
> + * allocated if dynamic MSI-X is supported.
> + */
> +static int vfio_msi_alloc_irq(struct vfio_pci_core_device *vdev,
> +			      unsigned int vector, bool msix)
> +{
> +	struct pci_dev *pdev = vdev->pdev;
> +	struct msi_map map;
> +	int irq;
> +	u16 cmd;
> +
> +	irq = pci_irq_vector(pdev, vector);
> +	if (irq > 0 || !msix || !vdev->has_dyn_msix)
> +		return irq;

if (irq >= 0 || ...)

> +
> +/*
> + * Where is vfio_msi_free_irq() ?
> + *
> + * Allocated interrupts are maintained, essentially forming a cache that
> + * subsequent allocations can draw from. Interrupts are freed using
> + * pci_free_irq_vectors() when MSI/MSI-X is disabled.
> + */

Probably merge it with the comment of vfio_msi_alloc_irq()?

> @@ -401,6 +430,12 @@ static int vfio_msi_set_vector_signal(struct
> vfio_pci_core_device *vdev,
>  	if (fd < 0)
>  		return 0;
> 
> +	if (irq == -EINVAL) {
> +		irq = vfio_msi_alloc_irq(vdev, vector, msix);
> +		if (irq < 0)
> +			return irq;
> +	}
> +
>  	ctx = vfio_irq_ctx_alloc(vdev, vector);
>  	if (!ctx)
>  		return -ENOMEM;

This doesn't read clean that an irq is allocated but not released
in the error unwind.

Reinette Chatre April 28, 2023, 6:35 p.m. UTC | #2

Hi Kevin,

On 4/27/2023 11:50 PM, Tian, Kevin wrote:
>> From: Chatre, Reinette <reinette.chatre@intel.com>
>> Sent: Friday, April 28, 2023 1:36 AM
>>
>> pci_msix_alloc_irq_at() enables an individual MSI-X interrupt to be
>> allocated after MSI-X enabling.
>>
>> Use dynamic MSI-X (if supported by the device) to allocate an interrupt
>> after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at
>> the time a valid eventfd is assigned. This is different behavior from
>> a range provided during MSI-X enabling where interrupts are allocated
>> for the entire range whether a valid eventfd is provided for each
>> interrupt or not.
>>
>> The PCI-MSIX API requires that some number of irqs are allocated for
>> an initial set of vectors when enabling MSI-X on the device. When
>> dynamic MSIX allocation is not supported, the vector table, and thus
>> the allocated irq set can only be resized by disabling and re-enabling
>> MSI-X with a different range. In that case the irq allocation is
>> essentially a cache for configuring vectors within the previously
>> allocated vector range. When dynamic MSI-X allocation is supported,
>> the API still requires some initial set of irqs to be allocated, but
>> also supports allocating and freeing specific irq vectors both
>> within and beyond the initially allocated range.
>>
>> For consistency between modes, as well as to reduce latency and improve
>> reliability of allocations, and also simplicity, this implementation
>> only releases irqs via pci_free_irq_vectors() when either the interrupt
>> mode changes or the device is released.
> 
> It improves the reliability of allocations from the calling device p.o.v.
> 
> But system-wide this is not efficient use of irqs and not releasing them
> timely may affect the reliability of allocations for other devices.

Could you please elaborate how other devices may be impacted?

> Should this behavior be something configurable?

This is not clear to me and I look to you for guidance here. From practical
side it looks like configuration via module parameters is supported but
whether it should be done is not clear to me.

When considering this we need to think about what the user may expect when
turning on/off the configuration. For example, MSI-X continues to allocate a
range of interrupts during enabling. These have always been treated as a
"cache" (interrupts remain allocated, whether they have an associated trigger
or not). If there is new configurable behavior, do you expect that the
driver needs to distinguish between the original "cache" that the user is
used to and the new dynamic allocations? That is, should a dynamic MSI-X
capable device always free interrupts when user space removes an eventfd
or should only interrupts that were allocated dynamically be freed dynamically?

>> +/*
>> + * Return Linux IRQ number of an MSI or MSI-X device interrupt vector.
>> + * If a Linux IRQ number is not available then a new interrupt will be
>> + * allocated if dynamic MSI-X is supported.
>> + */
>> +static int vfio_msi_alloc_irq(struct vfio_pci_core_device *vdev,
>> +			      unsigned int vector, bool msix)
>> +{
>> +	struct pci_dev *pdev = vdev->pdev;
>> +	struct msi_map map;
>> +	int irq;
>> +	u16 cmd;
>> +
>> +	irq = pci_irq_vector(pdev, vector);
>> +	if (irq > 0 || !msix || !vdev->has_dyn_msix)
>> +		return irq;
> 
> if (irq >= 0 || ...)
> 

I am not sure about this request because pci_irq_vector() cannot return 0.
The Linux interrupt number will be > 0 on success. 0 means "not found"
(see msi_get_virq()), which is translated to -EINVAL by pci_irq_vector().

>> +
>> +/*
>> + * Where is vfio_msi_free_irq() ?
>> + *
>> + * Allocated interrupts are maintained, essentially forming a cache that
>> + * subsequent allocations can draw from. Interrupts are freed using
>> + * pci_free_irq_vectors() when MSI/MSI-X is disabled.
>> + */
> 
> Probably merge it with the comment of vfio_msi_alloc_irq()?

Sure, will do.

> 
>> @@ -401,6 +430,12 @@ static int vfio_msi_set_vector_signal(struct
>> vfio_pci_core_device *vdev,
>>  	if (fd < 0)
>>  		return 0;
>>
>> +	if (irq == -EINVAL) {
>> +		irq = vfio_msi_alloc_irq(vdev, vector, msix);
>> +		if (irq < 0)
>> +			return irq;
>> +	}
>> +
>>  	ctx = vfio_irq_ctx_alloc(vdev, vector);
>>  	if (!ctx)
>>  		return -ENOMEM;
> 
> This doesn't read clean that an irq is allocated but not released
> in the error unwind.

I can add a comment similar to the location where the trigger is released:
	/* Interrupt stays allocated, will be freed at MSI-X disable. */

Reinette

Tian, Kevin May 5, 2023, 8:10 a.m. UTC | #3

> From: Chatre, Reinette <reinette.chatre@intel.com>
> Sent: Saturday, April 29, 2023 2:35 AM
> 
> Hi Kevin,
> 
> On 4/27/2023 11:50 PM, Tian, Kevin wrote:
> >> From: Chatre, Reinette <reinette.chatre@intel.com>
> >> Sent: Friday, April 28, 2023 1:36 AM
> >>
> >> pci_msix_alloc_irq_at() enables an individual MSI-X interrupt to be
> >> allocated after MSI-X enabling.
> >>
> >> Use dynamic MSI-X (if supported by the device) to allocate an interrupt
> >> after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at
> >> the time a valid eventfd is assigned. This is different behavior from
> >> a range provided during MSI-X enabling where interrupts are allocated
> >> for the entire range whether a valid eventfd is provided for each
> >> interrupt or not.
> >>
> >> The PCI-MSIX API requires that some number of irqs are allocated for
> >> an initial set of vectors when enabling MSI-X on the device. When
> >> dynamic MSIX allocation is not supported, the vector table, and thus
> >> the allocated irq set can only be resized by disabling and re-enabling
> >> MSI-X with a different range. In that case the irq allocation is
> >> essentially a cache for configuring vectors within the previously
> >> allocated vector range. When dynamic MSI-X allocation is supported,
> >> the API still requires some initial set of irqs to be allocated, but
> >> also supports allocating and freeing specific irq vectors both
> >> within and beyond the initially allocated range.
> >>
> >> For consistency between modes, as well as to reduce latency and improve
> >> reliability of allocations, and also simplicity, this implementation
> >> only releases irqs via pci_free_irq_vectors() when either the interrupt
> >> mode changes or the device is released.
> >
> > It improves the reliability of allocations from the calling device p.o.v.
> >
> > But system-wide this is not efficient use of irqs and not releasing them
> > timely may affect the reliability of allocations for other devices.
> 
> Could you please elaborate how other devices may be impacted?

the more this devices reserves the less remains for others, e.g. irte entries.

> 
> > Should this behavior be something configurable?
> 
> This is not clear to me and I look to you for guidance here. From practical
> side it looks like configuration via module parameters is supported but
> whether it should be done is not clear to me.
> 
> When considering this we need to think about what the user may expect
> when
> turning on/off the configuration. For example, MSI-X continues to allocate a
> range of interrupts during enabling. These have always been treated as a
> "cache" (interrupts remain allocated, whether they have an associated
> trigger
> or not). If there is new configurable behavior, do you expect that the
> driver needs to distinguish between the original "cache" that the user is
> used to and the new dynamic allocations? That is, should a dynamic MSI-X
> capable device always free interrupts when user space removes an eventfd
> or should only interrupts that were allocated dynamically be freed
> dynamically?

That looks tricky. Probably that is why Alex suggested doing this simple
scheme and it is on par with the old logic anyway. So I'll withdraw this
comment.

> 
> >> +/*
> >> + * Return Linux IRQ number of an MSI or MSI-X device interrupt vector.
> >> + * If a Linux IRQ number is not available then a new interrupt will be
> >> + * allocated if dynamic MSI-X is supported.
> >> + */
> >> +static int vfio_msi_alloc_irq(struct vfio_pci_core_device *vdev,
> >> +			      unsigned int vector, bool msix)
> >> +{
> >> +	struct pci_dev *pdev = vdev->pdev;
> >> +	struct msi_map map;
> >> +	int irq;
> >> +	u16 cmd;
> >> +
> >> +	irq = pci_irq_vector(pdev, vector);
> >> +	if (irq > 0 || !msix || !vdev->has_dyn_msix)
> >> +		return irq;
> >
> > if (irq >= 0 || ...)
> >
> 
> I am not sure about this request because pci_irq_vector() cannot return 0.
> The Linux interrupt number will be > 0 on success. 0 means "not found"
> (see msi_get_virq()), which is translated to -EINVAL by pci_irq_vector().
> 

There is a subtle difference between the description and the code of
pci_irq_vector().

/**
 * pci_irq_vector() - Get Linux IRQ number of a device interrupt vector
 * @dev: the PCI device to operate on
 * @nr:  device-relative interrupt vector index (0-based); has different
 *       meanings, depending on interrupt mode:
 *
 *         * MSI-X     the index in the MSI-X vector table
 *         * MSI       the index of the enabled MSI vectors
 *         * INTx      must be 0
 *
 * Return: the Linux IRQ number, or -EINVAL if @nr is out of range
 */

From above '0' is a valid irq number.

then in following code:

	irq = msi_get_virq(&dev->dev, nr);
	return irq ? irq : -EINVAL;

'0' is obviously invalid for msi.

I didn't realize the msi part when reading the patch. It left me in
confusion that '0' is unhandled as here we only check ">0" while in
other places "-EINVAL" is checked.

Not big matter but it sounds slightly clearer to me to follow the
description of pci_irq_vector() instead of its internal detail.

Alex Williamson May 5, 2023, 3:28 p.m. UTC | #4

On Fri, 5 May 2023 08:10:33 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Chatre, Reinette <reinette.chatre@intel.com>
> > Sent: Saturday, April 29, 2023 2:35 AM
> > 
> > Hi Kevin,
> > 
> > On 4/27/2023 11:50 PM, Tian, Kevin wrote:  
> > >> From: Chatre, Reinette <reinette.chatre@intel.com>
> > >> Sent: Friday, April 28, 2023 1:36 AM
> > >>
> > >> pci_msix_alloc_irq_at() enables an individual MSI-X interrupt to be
> > >> allocated after MSI-X enabling.
> > >>
> > >> Use dynamic MSI-X (if supported by the device) to allocate an interrupt
> > >> after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at
> > >> the time a valid eventfd is assigned. This is different behavior from
> > >> a range provided during MSI-X enabling where interrupts are allocated
> > >> for the entire range whether a valid eventfd is provided for each
> > >> interrupt or not.
> > >>
> > >> The PCI-MSIX API requires that some number of irqs are allocated for
> > >> an initial set of vectors when enabling MSI-X on the device. When
> > >> dynamic MSIX allocation is not supported, the vector table, and thus
> > >> the allocated irq set can only be resized by disabling and re-enabling
> > >> MSI-X with a different range. In that case the irq allocation is
> > >> essentially a cache for configuring vectors within the previously
> > >> allocated vector range. When dynamic MSI-X allocation is supported,
> > >> the API still requires some initial set of irqs to be allocated, but
> > >> also supports allocating and freeing specific irq vectors both
> > >> within and beyond the initially allocated range.
> > >>
> > >> For consistency between modes, as well as to reduce latency and improve
> > >> reliability of allocations, and also simplicity, this implementation
> > >> only releases irqs via pci_free_irq_vectors() when either the interrupt
> > >> mode changes or the device is released.  
> > >
> > > It improves the reliability of allocations from the calling device p.o.v.
> > >
> > > But system-wide this is not efficient use of irqs and not releasing them
> > > timely may affect the reliability of allocations for other devices.  
> > 
> > Could you please elaborate how other devices may be impacted?  
> 
> the more this devices reserves the less remains for others, e.g. irte entries.
> 
> >   
> > > Should this behavior be something configurable?  
> > 
> > This is not clear to me and I look to you for guidance here. From practical
> > side it looks like configuration via module parameters is supported but
> > whether it should be done is not clear to me.
> > 
> > When considering this we need to think about what the user may expect
> > when
> > turning on/off the configuration. For example, MSI-X continues to allocate a
> > range of interrupts during enabling. These have always been treated as a
> > "cache" (interrupts remain allocated, whether they have an associated
> > trigger
> > or not). If there is new configurable behavior, do you expect that the
> > driver needs to distinguish between the original "cache" that the user is
> > used to and the new dynamic allocations? That is, should a dynamic MSI-X
> > capable device always free interrupts when user space removes an eventfd
> > or should only interrupts that were allocated dynamically be freed
> > dynamically?  
> 
> That looks tricky. Probably that is why Alex suggested doing this simple
> scheme and it is on par with the old logic anyway. So I'll withdraw this
> comment.

Don't forget we're also releasing the irq reservations when the guest
changes interrupt mode, ex. reboot, so the "caching" is really only
within a session of the guest/userspace driver where it would be
unusual to have an unused reservation for an extended period.  Thanks,

Alex

Reinette Chatre May 5, 2023, 5:21 p.m. UTC | #5

Hi Kevin,

On 5/5/2023 1:10 AM, Tian, Kevin wrote:
>> From: Chatre, Reinette <reinette.chatre@intel.com>
>> Sent: Saturday, April 29, 2023 2:35 AM
>> On 4/27/2023 11:50 PM, Tian, Kevin wrote:
>>>> From: Chatre, Reinette <reinette.chatre@intel.com>
>>>> Sent: Friday, April 28, 2023 1:36 AM

...

>>>> +/*
>>>> + * Return Linux IRQ number of an MSI or MSI-X device interrupt vector.
>>>> + * If a Linux IRQ number is not available then a new interrupt will be
>>>> + * allocated if dynamic MSI-X is supported.
>>>> + */
>>>> +static int vfio_msi_alloc_irq(struct vfio_pci_core_device *vdev,
>>>> +			      unsigned int vector, bool msix)
>>>> +{
>>>> +	struct pci_dev *pdev = vdev->pdev;
>>>> +	struct msi_map map;
>>>> +	int irq;
>>>> +	u16 cmd;
>>>> +
>>>> +	irq = pci_irq_vector(pdev, vector);
>>>> +	if (irq > 0 || !msix || !vdev->has_dyn_msix)
>>>> +		return irq;
>>>
>>> if (irq >= 0 || ...)
>>>
>>
>> I am not sure about this request because pci_irq_vector() cannot return 0.
>> The Linux interrupt number will be > 0 on success. 0 means "not found"
>> (see msi_get_virq()), which is translated to -EINVAL by pci_irq_vector().
>>
> 
> There is a subtle difference between the description and the code of
> pci_irq_vector().
> 
> /**
>  * pci_irq_vector() - Get Linux IRQ number of a device interrupt vector
>  * @dev: the PCI device to operate on
>  * @nr:  device-relative interrupt vector index (0-based); has different
>  *       meanings, depending on interrupt mode:
>  *
>  *         * MSI-X     the index in the MSI-X vector table
>  *         * MSI       the index of the enabled MSI vectors
>  *         * INTx      must be 0
>  *
>  * Return: the Linux IRQ number, or -EINVAL if @nr is out of range
>  */
> 
> From above '0' is a valid irq number.
> 
> then in following code:
> 
> 	irq = msi_get_virq(&dev->dev, nr);
> 	return irq ? irq : -EINVAL;
> 
> '0' is obviously invalid for msi.
> 
> I didn't realize the msi part when reading the patch. It left me in
> confusion that '0' is unhandled as here we only check ">0" while in
> other places "-EINVAL" is checked.
> 
> Not big matter but it sounds slightly clearer to me to follow the
> description of pci_irq_vector() instead of its internal detail. 

I can add an explicit check for '0' and, as you confirmed, this is
invalid for MSI and thus I think it should be treated as an error.
This is perhaps another candidate for a WARN considering that
pci_irq_vector() returning a '0' for MSI indicates a kernel problem .

I now consider taking guidance from pci_irq_get_affinity(). Note that
pci_irq_get_affinity() contains:

const struct cpumask *pci_irq_get_affinity(struct pci_dev *dev, int nr)
{
	int idx, irq = pci_irq_vector(dev, nr);
	...
	if (WARN_ON_ONCE(irq <= 0))
		return NULL;
	...
}	


Would you be ok with something like below?

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index b549f5c97cb8..a8e96254f953 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -393,6 +393,8 @@ static int vfio_msi_alloc_irq(struct vfio_pci_core_device *vdev,
 	u16 cmd;
 
 	irq = pci_irq_vector(pdev, vector);
+	if (WARN_ON_ONCE(irq == 0))
+		return -EINVAL;
 	if (irq > 0 || !msix || !vdev->has_dyn_msix)
 		return irq;

I would prefer that vfio_msi_alloc_irq() returns negative errors. This enables
callers to in turn just return the error code on failure (note that dynamic
allocation can return different error codes), not needing to translate 0 into
an error.

Reinette

Tian, Kevin May 6, 2023, 8:13 a.m. UTC | #6

> From: Chatre, Reinette <reinette.chatre@intel.com>
> Sent: Saturday, May 6, 2023 1:21 AM
> 
> Would you be ok with something like below?
> 
> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
> index b549f5c97cb8..a8e96254f953 100644
> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> @@ -393,6 +393,8 @@ static int vfio_msi_alloc_irq(struct
> vfio_pci_core_device *vdev,
>  	u16 cmd;
> 
>  	irq = pci_irq_vector(pdev, vector);
> +	if (WARN_ON_ONCE(irq == 0))
> +		return -EINVAL;
>  	if (irq > 0 || !msix || !vdev->has_dyn_msix)
>  		return irq;
> 
> I would prefer that vfio_msi_alloc_irq() returns negative errors. This enables
> callers to in turn just return the error code on failure (note that dynamic
> allocation can return different error codes), not needing to translate 0 into
> an error.
> 

This looks good to me. Thanks.

Tian, Kevin May 6, 2023, 8:15 a.m. UTC | #7

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, May 5, 2023 11:28 PM
> 
> On Fri, 5 May 2023 08:10:33 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Chatre, Reinette <reinette.chatre@intel.com>
> > > Sent: Saturday, April 29, 2023 2:35 AM
> > >
> > > Hi Kevin,
> > >
> > > On 4/27/2023 11:50 PM, Tian, Kevin wrote:
> > >
> > > > Should this behavior be something configurable?
> > >
> > > This is not clear to me and I look to you for guidance here. From practical
> > > side it looks like configuration via module parameters is supported but
> > > whether it should be done is not clear to me.
> > >
> > > When considering this we need to think about what the user may expect
> > > when
> > > turning on/off the configuration. For example, MSI-X continues to
> allocate a
> > > range of interrupts during enabling. These have always been treated as a
> > > "cache" (interrupts remain allocated, whether they have an associated
> > > trigger
> > > or not). If there is new configurable behavior, do you expect that the
> > > driver needs to distinguish between the original "cache" that the user is
> > > used to and the new dynamic allocations? That is, should a dynamic MSI-
> X
> > > capable device always free interrupts when user space removes an
> eventfd
> > > or should only interrupts that were allocated dynamically be freed
> > > dynamically?
> >
> > That looks tricky. Probably that is why Alex suggested doing this simple
> > scheme and it is on par with the old logic anyway. So I'll withdraw this
> > comment.
> 
> Don't forget we're also releasing the irq reservations when the guest
> changes interrupt mode, ex. reboot, so the "caching" is really only
> within a session of the guest/userspace driver where it would be
> unusual to have an unused reservation for an extended period.  Thanks,
> 

Yeah, that makes sense.

diff mbox series

Patch

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index bdda7f46c2be..8340135b09fa 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -372,27 +372,56 @@  static int vfio_msi_enable(struct vfio_pci_core_device *vdev, int nvec, bool msi
 	return 0;
 }
 
+/*
+ * Return Linux IRQ number of an MSI or MSI-X device interrupt vector.
+ * If a Linux IRQ number is not available then a new interrupt will be
+ * allocated if dynamic MSI-X is supported.
+ */
+static int vfio_msi_alloc_irq(struct vfio_pci_core_device *vdev,
+			      unsigned int vector, bool msix)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	struct msi_map map;
+	int irq;
+	u16 cmd;
+
+	irq = pci_irq_vector(pdev, vector);
+	if (irq > 0 || !msix || !vdev->has_dyn_msix)
+		return irq;
+
+	cmd = vfio_pci_memory_lock_and_enable(vdev);
+	map = pci_msix_alloc_irq_at(pdev, vector, NULL);
+	vfio_pci_memory_unlock_and_restore(vdev, cmd);
+
+	return map.index < 0 ? map.index : map.virq;
+}
+
+/*
+ * Where is vfio_msi_free_irq() ?
+ *
+ * Allocated interrupts are maintained, essentially forming a cache that
+ * subsequent allocations can draw from. Interrupts are freed using
+ * pci_free_irq_vectors() when MSI/MSI-X is disabled.
+ */
+
 static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 				      unsigned int vector, int fd, bool msix)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	struct vfio_pci_irq_ctx *ctx;
 	struct eventfd_ctx *trigger;
-	int irq, ret;
+	int irq = -EINVAL, ret;
 	u16 cmd;
 
-	irq = pci_irq_vector(pdev, vector);
-	if (irq < 0)
-		return -EINVAL;
-
 	ctx = vfio_irq_ctx_get(vdev, vector);
 
 	if (ctx) {
 		irq_bypass_unregister_producer(&ctx->producer);
-
+		irq = pci_irq_vector(pdev, vector);
 		cmd = vfio_pci_memory_lock_and_enable(vdev);
 		free_irq(irq, ctx->trigger);
 		vfio_pci_memory_unlock_and_restore(vdev, cmd);
+		/* Interrupt stays allocated, will be freed at MSI-X disable. */
 		kfree(ctx->name);
 		eventfd_ctx_put(ctx->trigger);
 		vfio_irq_ctx_free(vdev, ctx, vector);
@@ -401,6 +430,12 @@  static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 	if (fd < 0)
 		return 0;
 
+	if (irq == -EINVAL) {
+		irq = vfio_msi_alloc_irq(vdev, vector, msix);
+		if (irq < 0)
+			return irq;
+	}
+
 	ctx = vfio_irq_ctx_alloc(vdev, vector);
 	if (!ctx)
 		return -ENOMEM;