diff mbox series

[V3,09/10] vfio/pci: Support dynamic MSI-X

Message ID	86cda5cf2742feff3b14954284fb509863355050.1681837892.git.reinette.chatre@intel.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Reinette Chatre <reinette.chatre@intel.com> To: jgg@nvidia.com, yishaih@nvidia.com, shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com, alex.williamson@redhat.com Cc: tglx@linutronix.de, darwi@linutronix.de, kvm@vger.kernel.org, dave.jiang@intel.com, jing2.liu@intel.com, ashok.raj@intel.com, fenghua.yu@intel.com, tom.zanussi@linux.intel.com, reinette.chatre@intel.com, linux-kernel@vger.kernel.org Subject: [PATCH V3 09/10] vfio/pci: Support dynamic MSI-X Date: Tue, 18 Apr 2023 10:29:20 -0700 Message-Id: <86cda5cf2742feff3b14954284fb509863355050.1681837892.git.reinette.chatre@intel.com> In-Reply-To: <cover.1681837892.git.reinette.chatre@intel.com> References: <cover.1681837892.git.reinette.chatre@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	vfio/pci: Support dynamic allocation of MSI-X interrupts \| [V3,00/10] vfio/pci: Support dynamic allocation of MSI-X interrupts [V3,01/10] vfio/pci: Consolidate irq cleanup on MSI/MSI-X disable [V3,02/10] vfio/pci: Remove negative check on unsigned vector [V3,03/10] vfio/pci: Prepare for dynamic interrupt context storage [V3,04/10] vfio/pci: Move to single error path [V3,05/10] vfio/pci: Use xarray for interrupt context storage [V3,06/10] vfio/pci: Remove interrupt context counter [V3,07/10] vfio/pci: Update stale comment [V3,08/10] vfio/pci: Probe and store ability to support dynamic MSI-X [V3,09/10] vfio/pci: Support dynamic MSI-X [V3,10/10] vfio/pci: Clear VFIO_IRQ_INFO_NORESIZE for MSI-X

Commit Message

Reinette Chatre April 18, 2023, 5:29 p.m. UTC

  Recently introduced pci_msix_alloc_irq_at() and pci_msix_free_irq()
enables an individual MSI-X interrupt to be allocated and freed after
MSI-X enabling.

Use dynamic MSI-X (if supported by the device) to allocate an interrupt
after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at
the time a valid eventfd is assigned. This is different behavior from
a range provided during MSI-X enabling where interrupts are allocated
for the entire range whether a valid eventfd is provided for each
interrupt or not.

Do not dynamically free interrupts, leave that to when MSI-X is
disabled.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/lkml/20230403211841.0e206b67.alex.williamson@redhat.com/
---
The get_cached_msi_msg()/pci_write_msi_msg() behavior is kept
similar to the scenario when MSI-X is enabled with triggers
provided for new interrupts. get_cached_msi_msg()/pci_write_msi_msg()
follows for interrupts recently allocated with pci_msix_alloc_irq_at()
just like get_cached_msi_msg()/pci_write_msi_msg() is done for
interrupts recently allocated with pci_alloc_irq_vectors().

Changes since V2:
- Move vfio_irq_ctx_free() to earlier in series to support
  earlier usage. (Alex)
- Use consistent terms in changelog: MSI-x changed to MSI-X.
- Make dynamic interrupt context creation generic across all
  MSI/MSI-X interrupts. This resulted in code moving to earlier
  in series as part of xarray introduction patch. (Alex)
- Remove the local allow_dyn_alloc and direct calling of
  pci_msix_can_alloc_dyn(), use the new vdev->has_dyn_msix
  introduced earlier instead. (Alex)
- Stop tracking new allocations (remove "new_ctx"). (Alex)
- Introduce new wrapper that returns Linux interrupt number or
  dynamically allocate a new interrupt. Wrapper can be used for
  all interrupt cases. (Alex)
- Only free dynamic MSI-X interrupts on MSI-X teardown. (Alex)

Changes since RFC V1:
- Add pointer to interrupt context as function parameter to
  vfio_irq_ctx_free(). (Alex)
- Initialize new_ctx to false. (Dan Carpenter)
- Only support dynamic allocation if device supports it. (Alex)

 drivers/vfio/pci/vfio_pci_intrs.c | 73 +++++++++++++++++++++++++++----
 1 file changed, 65 insertions(+), 8 deletions(-)

Comments

Alex Williamson April 18, 2023, 10:38 p.m. UTC | #1

On Tue, 18 Apr 2023 10:29:20 -0700
Reinette Chatre <reinette.chatre@intel.com> wrote:

> Recently introduced pci_msix_alloc_irq_at() and pci_msix_free_irq()
> enables an individual MSI-X interrupt to be allocated and freed after
> MSI-X enabling.
> 
> Use dynamic MSI-X (if supported by the device) to allocate an interrupt
> after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at
> the time a valid eventfd is assigned. This is different behavior from
> a range provided during MSI-X enabling where interrupts are allocated
> for the entire range whether a valid eventfd is provided for each
> interrupt or not.
> 
> Do not dynamically free interrupts, leave that to when MSI-X is
> disabled.

But we do, sometimes, even if essentially only on the error path.  Is
that worthwhile?  It seems like we could entirely remove
vfio_msi_free_irq() and rely only on pci_free_irq_vectors() on MSI/X
teardown.

I'd probably also add a comment in the commit log about the theory
behind not dynamically freeing irqs, ie. latency, reliability, and
whatever else we used to justify it.  Thanks,

Alex

> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
> Link: https://lore.kernel.org/lkml/20230403211841.0e206b67.alex.williamson@redhat.com/
> ---
> The get_cached_msi_msg()/pci_write_msi_msg() behavior is kept
> similar to the scenario when MSI-X is enabled with triggers
> provided for new interrupts. get_cached_msi_msg()/pci_write_msi_msg()
> follows for interrupts recently allocated with pci_msix_alloc_irq_at()
> just like get_cached_msi_msg()/pci_write_msi_msg() is done for
> interrupts recently allocated with pci_alloc_irq_vectors().
> 
> Changes since V2:
> - Move vfio_irq_ctx_free() to earlier in series to support
>   earlier usage. (Alex)
> - Use consistent terms in changelog: MSI-x changed to MSI-X.
> - Make dynamic interrupt context creation generic across all
>   MSI/MSI-X interrupts. This resulted in code moving to earlier
>   in series as part of xarray introduction patch. (Alex)
> - Remove the local allow_dyn_alloc and direct calling of
>   pci_msix_can_alloc_dyn(), use the new vdev->has_dyn_msix
>   introduced earlier instead. (Alex)
> - Stop tracking new allocations (remove "new_ctx"). (Alex)
> - Introduce new wrapper that returns Linux interrupt number or
>   dynamically allocate a new interrupt. Wrapper can be used for
>   all interrupt cases. (Alex)
> - Only free dynamic MSI-X interrupts on MSI-X teardown. (Alex)
> 
> Changes since RFC V1:
> - Add pointer to interrupt context as function parameter to
>   vfio_irq_ctx_free(). (Alex)
> - Initialize new_ctx to false. (Dan Carpenter)
> - Only support dynamic allocation if device supports it. (Alex)
> 
>  drivers/vfio/pci/vfio_pci_intrs.c | 73 +++++++++++++++++++++++++++----
>  1 file changed, 65 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
> index bdda7f46c2be..c1a3e224c867 100644
> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> @@ -372,27 +372,74 @@ static int vfio_msi_enable(struct vfio_pci_core_device *vdev, int nvec, bool msi
>  	return 0;
>  }
>  
> +/*
> + * Return Linux IRQ number of an MSI or MSI-X device interrupt vector.
> + * If a Linux IRQ number is not available then a new interrupt will be
> + * allocated if dynamic MSI-X is supported.
> + */
> +static int vfio_msi_alloc_irq(struct vfio_pci_core_device *vdev,
> +			      unsigned int vector, bool msix)
> +{
> +	struct pci_dev *pdev = vdev->pdev;
> +	struct msi_map map;
> +	int irq;
> +	u16 cmd;
> +
> +	irq = pci_irq_vector(pdev, vector);
> +	if (irq > 0 || !msix || !vdev->has_dyn_msix)
> +		return irq;
> +
> +	cmd = vfio_pci_memory_lock_and_enable(vdev);
> +	map = pci_msix_alloc_irq_at(pdev, vector, NULL);
> +	vfio_pci_memory_unlock_and_restore(vdev, cmd);
> +
> +	return map.index < 0 ? map.index : map.virq;
> +}
> +
> +/*
> + * Free interrupt if it can be re-allocated dynamically (while MSI-X
> + * is enabled).
> + */
> +static void vfio_msi_free_irq(struct vfio_pci_core_device *vdev,
> +			      unsigned int vector, bool msix)
> +{
> +	struct pci_dev *pdev = vdev->pdev;
> +	struct msi_map map;
> +	int irq;
> +	u16 cmd;
> +
> +	if (!msix || !vdev->has_dyn_msix)
> +		return;
> +
> +	irq = pci_irq_vector(pdev, vector);
> +	map = (struct msi_map) { .index = vector, .virq = irq };
> +
> +	if (WARN_ON(irq < 0))
> +		return;
> +
> +	cmd = vfio_pci_memory_lock_and_enable(vdev);
> +	pci_msix_free_irq(pdev, map);
> +	vfio_pci_memory_unlock_and_restore(vdev, cmd);
> +}
> +
>  static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
>  				      unsigned int vector, int fd, bool msix)
>  {
>  	struct pci_dev *pdev = vdev->pdev;
>  	struct vfio_pci_irq_ctx *ctx;
>  	struct eventfd_ctx *trigger;
> -	int irq, ret;
> +	int irq = -EINVAL, ret;
>  	u16 cmd;
>  
> -	irq = pci_irq_vector(pdev, vector);
> -	if (irq < 0)
> -		return -EINVAL;
> -
>  	ctx = vfio_irq_ctx_get(vdev, vector);
>  
>  	if (ctx) {
>  		irq_bypass_unregister_producer(&ctx->producer);
> -
> +		irq = pci_irq_vector(pdev, vector);
>  		cmd = vfio_pci_memory_lock_and_enable(vdev);
>  		free_irq(irq, ctx->trigger);
>  		vfio_pci_memory_unlock_and_restore(vdev, cmd);
> +		/* Interrupt stays allocated, will be freed at MSI-X disable. */
>  		kfree(ctx->name);
>  		eventfd_ctx_put(ctx->trigger);
>  		vfio_irq_ctx_free(vdev, ctx, vector);
> @@ -401,9 +448,17 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
>  	if (fd < 0)
>  		return 0;
>  
> +	if (irq == -EINVAL) {
> +		irq = vfio_msi_alloc_irq(vdev, vector, msix);
> +		if (irq < 0)
> +			return irq;
> +	}
> +
>  	ctx = vfio_irq_ctx_alloc(vdev, vector);
> -	if (!ctx)
> -		return -ENOMEM;
> +	if (!ctx) {
> +		ret = -ENOMEM;
> +		goto out_free_irq;
> +	}
>  
>  	ctx->name = kasprintf(GFP_KERNEL_ACCOUNT, "vfio-msi%s[%d](%s)",
>  			      msix ? "x" : "", vector, pci_name(pdev));
> @@ -456,6 +511,8 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
>  	kfree(ctx->name);
>  out_free_ctx:
>  	vfio_irq_ctx_free(vdev, ctx, vector);
> +out_free_irq:
> +	vfio_msi_free_irq(vdev, vector, msix);
>  	return ret;
>  }
>

Reinette Chatre April 19, 2023, 6:13 p.m. UTC | #2

Hi Alex,

On 4/18/2023 3:38 PM, Alex Williamson wrote:
> On Tue, 18 Apr 2023 10:29:20 -0700
> Reinette Chatre <reinette.chatre@intel.com> wrote:
> 
>> Recently introduced pci_msix_alloc_irq_at() and pci_msix_free_irq()
>> enables an individual MSI-X interrupt to be allocated and freed after
>> MSI-X enabling.
>>
>> Use dynamic MSI-X (if supported by the device) to allocate an interrupt
>> after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at
>> the time a valid eventfd is assigned. This is different behavior from
>> a range provided during MSI-X enabling where interrupts are allocated
>> for the entire range whether a valid eventfd is provided for each
>> interrupt or not.
>>
>> Do not dynamically free interrupts, leave that to when MSI-X is
>> disabled.
> 
> But we do, sometimes, even if essentially only on the error path.  Is
> that worthwhile?  It seems like we could entirely remove
> vfio_msi_free_irq() and rely only on pci_free_irq_vectors() on MSI/X
> teardown.

Yes, it is only on the error path where dynamic MSI-X interrupts are
removed. I do not know how to determine if it is worthwhile. On the
kernel side failure seems unlikely since it would mean memory cannot
be allocated or request_irq() failed. In these cases it may not be
worthwhile since user space may try again and having the interrupt
already allocated would be helpful. The remaining error seems to be
if user space provided an invalid eventfd. An allocation in response
to wrong user input is a concern to me. Should we consider
buggy/malicious users? I am uncertain here so would defer to your
guidance.

> I'd probably also add a comment in the commit log about the theory
> behind not dynamically freeing irqs, ie. latency, reliability, and
> whatever else we used to justify it.  Thanks,

Sure. How about something like below to replace the final sentence of
the changelog:

"When a guest disables an interrupt, user space (Qemu) does not
disable the interrupt but instead assigns it a different trigger. A
common flow is thus for the VFIO_DEVICE_SET_IRQS ioctl() to be 
used to assign a new eventfd to an already enabled interrupt. Freeing
and re-allocating an interrupt in this scenario would add unnecessary
latency as well as uncertainty since the re-allocation may fail. Do
not dynamically free interrupts when an interrupt is disabled, instead
support a subsequent re-enable to draw from the initial allocation when
possible. Leave freeing of interrupts to when MSI-X is disabled."

Reinette

Alex Williamson April 19, 2023, 9:38 p.m. UTC | #3

On Wed, 19 Apr 2023 11:13:29 -0700
Reinette Chatre <reinette.chatre@intel.com> wrote:

> Hi Alex,
> 
> On 4/18/2023 3:38 PM, Alex Williamson wrote:
> > On Tue, 18 Apr 2023 10:29:20 -0700
> > Reinette Chatre <reinette.chatre@intel.com> wrote:
> >   
> >> Recently introduced pci_msix_alloc_irq_at() and pci_msix_free_irq()
> >> enables an individual MSI-X interrupt to be allocated and freed after
> >> MSI-X enabling.
> >>
> >> Use dynamic MSI-X (if supported by the device) to allocate an interrupt
> >> after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at
> >> the time a valid eventfd is assigned. This is different behavior from
> >> a range provided during MSI-X enabling where interrupts are allocated
> >> for the entire range whether a valid eventfd is provided for each
> >> interrupt or not.
> >>
> >> Do not dynamically free interrupts, leave that to when MSI-X is
> >> disabled.  
> > 
> > But we do, sometimes, even if essentially only on the error path.  Is
> > that worthwhile?  It seems like we could entirely remove
> > vfio_msi_free_irq() and rely only on pci_free_irq_vectors() on MSI/X
> > teardown.  
> 
> Yes, it is only on the error path where dynamic MSI-X interrupts are
> removed. I do not know how to determine if it is worthwhile. On the
> kernel side failure seems unlikely since it would mean memory cannot
> be allocated or request_irq() failed. In these cases it may not be
> worthwhile since user space may try again and having the interrupt
> already allocated would be helpful. The remaining error seems to be
> if user space provided an invalid eventfd. An allocation in response
> to wrong user input is a concern to me. Should we consider
> buggy/malicious users? I am uncertain here so would defer to your
> guidance.

I don't really see that a malicious user can exploit anything here,
their irq allocation is bound by the device support and they're
entitled to make use of the full vector set of the device by virtue of
having ownership of the device.  All the MSI-X allocated irqs are freed
when the interrupt mode is changed or the device is released regardless.

The end result is also no different than if the user had not configured
the vector when enabling MSI-X or configured it and later de-configured
with a -1 eventfd.  The irq is allocated but not attached to a ctx.
We're intentionally using this as a cache.

Also, as implemented here in v3, we might be freeing from the original
allocation rather than a new, dynamically allocated irq.

My thinking is that if we think there's a benefit to caching any
allocated irqs, we should do so consistently.  We don't currently know
if the irq was allocated now or previously.  Tracking that would add
complexity for little benefit.  The user can get to the same end result
of an allocated, unused irq in numerous way, the state itself is not
erroneous, and is actually in support of caching irq allocations.
Removing the de-allocation of a single vector further simplifies the
code as there exists only one path where irqs are freed, ie.
pci_free_irq_vectors().

So I'd lean towards removing vfio_msi_free_irq().

> > I'd probably also add a comment in the commit log about the theory
> > behind not dynamically freeing irqs, ie. latency, reliability, and
> > whatever else we used to justify it.  Thanks,  
> 
> Sure. How about something like below to replace the final sentence of
> the changelog:
> 
> "When a guest disables an interrupt, user space (Qemu) does not
> disable the interrupt but instead assigns it a different trigger. A
> common flow is thus for the VFIO_DEVICE_SET_IRQS ioctl() to be 
> used to assign a new eventfd to an already enabled interrupt. Freeing
> and re-allocating an interrupt in this scenario would add unnecessary
> latency as well as uncertainty since the re-allocation may fail. Do
> not dynamically free interrupts when an interrupt is disabled, instead
> support a subsequent re-enable to draw from the initial allocation when
> possible. Leave freeing of interrupts to when MSI-X is disabled."

There are other means besides caching irqs that could achieve the above,
for instance if a trigger is simply swapped from one eventfd to another,
that all happens within vfio_msi_set_vector_signal() where we could
hold the irq for the transition.

I think I might justify it as:

	The PCI-MSIX API requires that some number of irqs are
	allocated for an initial set of vectors when enabling MSI-X on
	the device.  When dynamic MSIX allocation is not supported, the
	vector table, and thus the allocated irq set can only be resized
	by disabling and re-enabling MSIX with a different range.  In
	that case the irq allocation is essentially a cache for
	configuring vectors within the previously allocated vector
	range.  When dynamic MSIX allocation is supported, the API
	still requires some initial set of irqs to be allocated, but
	also supports allocating and freeing specific irq vectors both
	within and beyond the initially allocated range.

	For consistency between modes, as well as to reduce latency and
	improve reliability of allocations, and also simplicity, this
	implementation only releases irqs via pci_free_irq_vectors()
	when either the interrupt mode changes or the device is
	released.

Does that cover the key points for someone that might want to revisit
this decision later?  Thanks,

Alex

Reinette Chatre April 19, 2023, 10:03 p.m. UTC | #4

Hi Alex,

On 4/19/2023 2:38 PM, Alex Williamson wrote:
> On Wed, 19 Apr 2023 11:13:29 -0700
> Reinette Chatre <reinette.chatre@intel.com> wrote:
>> On 4/18/2023 3:38 PM, Alex Williamson wrote:
>>> On Tue, 18 Apr 2023 10:29:20 -0700
>>> Reinette Chatre <reinette.chatre@intel.com> wrote:
>>>   
>>>> Recently introduced pci_msix_alloc_irq_at() and pci_msix_free_irq()
>>>> enables an individual MSI-X interrupt to be allocated and freed after
>>>> MSI-X enabling.
>>>>
>>>> Use dynamic MSI-X (if supported by the device) to allocate an interrupt
>>>> after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at
>>>> the time a valid eventfd is assigned. This is different behavior from
>>>> a range provided during MSI-X enabling where interrupts are allocated
>>>> for the entire range whether a valid eventfd is provided for each
>>>> interrupt or not.
>>>>
>>>> Do not dynamically free interrupts, leave that to when MSI-X is
>>>> disabled.  
>>>
>>> But we do, sometimes, even if essentially only on the error path.  Is
>>> that worthwhile?  It seems like we could entirely remove
>>> vfio_msi_free_irq() and rely only on pci_free_irq_vectors() on MSI/X
>>> teardown.  
>>
>> Yes, it is only on the error path where dynamic MSI-X interrupts are
>> removed. I do not know how to determine if it is worthwhile. On the
>> kernel side failure seems unlikely since it would mean memory cannot
>> be allocated or request_irq() failed. In these cases it may not be
>> worthwhile since user space may try again and having the interrupt
>> already allocated would be helpful. The remaining error seems to be
>> if user space provided an invalid eventfd. An allocation in response
>> to wrong user input is a concern to me. Should we consider
>> buggy/malicious users? I am uncertain here so would defer to your
>> guidance.
> 
> I don't really see that a malicious user can exploit anything here,
> their irq allocation is bound by the device support and they're
> entitled to make use of the full vector set of the device by virtue of
> having ownership of the device.  All the MSI-X allocated irqs are freed
> when the interrupt mode is changed or the device is released regardless.
> 
> The end result is also no different than if the user had not configured
> the vector when enabling MSI-X or configured it and later de-configured
> with a -1 eventfd.  The irq is allocated but not attached to a ctx.
> We're intentionally using this as a cache.
> 
> Also, as implemented here in v3, we might be freeing from the original
> allocation rather than a new, dynamically allocated irq.

Great point.

> 
> My thinking is that if we think there's a benefit to caching any
> allocated irqs, we should do so consistently.  We don't currently know
> if the irq was allocated now or previously.  Tracking that would add
> complexity for little benefit.  The user can get to the same end result
> of an allocated, unused irq in numerous way, the state itself is not
> erroneous, and is actually in support of caching irq allocations.
> Removing the de-allocation of a single vector further simplifies the
> code as there exists only one path where irqs are freed, ie.
> pci_free_irq_vectors().
> 
> So I'd lean towards removing vfio_msi_free_irq().

Thank you for your detailed analysis. I understand and agree.
I will remove vfio_msi_free_irq().

>  
>>> I'd probably also add a comment in the commit log about the theory
>>> behind not dynamically freeing irqs, ie. latency, reliability, and
>>> whatever else we used to justify it.  Thanks,  
>>
>> Sure. How about something like below to replace the final sentence of
>> the changelog:
>>
>> "When a guest disables an interrupt, user space (Qemu) does not
>> disable the interrupt but instead assigns it a different trigger. A
>> common flow is thus for the VFIO_DEVICE_SET_IRQS ioctl() to be 
>> used to assign a new eventfd to an already enabled interrupt. Freeing
>> and re-allocating an interrupt in this scenario would add unnecessary
>> latency as well as uncertainty since the re-allocation may fail. Do
>> not dynamically free interrupts when an interrupt is disabled, instead
>> support a subsequent re-enable to draw from the initial allocation when
>> possible. Leave freeing of interrupts to when MSI-X is disabled."
> 
> There are other means besides caching irqs that could achieve the above,
> for instance if a trigger is simply swapped from one eventfd to another,
> that all happens within vfio_msi_set_vector_signal() where we could
> hold the irq for the transition.
> 
> I think I might justify it as:
> 
> 	The PCI-MSIX API requires that some number of irqs are
> 	allocated for an initial set of vectors when enabling MSI-X on
> 	the device.  When dynamic MSIX allocation is not supported, the
> 	vector table, and thus the allocated irq set can only be resized
> 	by disabling and re-enabling MSIX with a different range.  In
> 	that case the irq allocation is essentially a cache for
> 	configuring vectors within the previously allocated vector
> 	range.  When dynamic MSIX allocation is supported, the API
> 	still requires some initial set of irqs to be allocated, but
> 	also supports allocating and freeing specific irq vectors both
> 	within and beyond the initially allocated range.
> 
> 	For consistency between modes, as well as to reduce latency and
> 	improve reliability of allocations, and also simplicity, this
> 	implementation only releases irqs via pci_free_irq_vectors()
> 	when either the interrupt mode changes or the device is
> 	released.
> 
> Does that cover the key points for someone that might want to revisit
> this decision later?  Thanks,

It does so clearly, yes. Thank you so much for taking the time to
write this. I will include it in the changelog.

Reinette

diff mbox series

Patch

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index bdda7f46c2be..c1a3e224c867 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -372,27 +372,74 @@  static int vfio_msi_enable(struct vfio_pci_core_device *vdev, int nvec, bool msi
 	return 0;
 }
 
+/*
+ * Return Linux IRQ number of an MSI or MSI-X device interrupt vector.
+ * If a Linux IRQ number is not available then a new interrupt will be
+ * allocated if dynamic MSI-X is supported.
+ */
+static int vfio_msi_alloc_irq(struct vfio_pci_core_device *vdev,
+			      unsigned int vector, bool msix)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	struct msi_map map;
+	int irq;
+	u16 cmd;
+
+	irq = pci_irq_vector(pdev, vector);
+	if (irq > 0 || !msix || !vdev->has_dyn_msix)
+		return irq;
+
+	cmd = vfio_pci_memory_lock_and_enable(vdev);
+	map = pci_msix_alloc_irq_at(pdev, vector, NULL);
+	vfio_pci_memory_unlock_and_restore(vdev, cmd);
+
+	return map.index < 0 ? map.index : map.virq;
+}
+
+/*
+ * Free interrupt if it can be re-allocated dynamically (while MSI-X
+ * is enabled).
+ */
+static void vfio_msi_free_irq(struct vfio_pci_core_device *vdev,
+			      unsigned int vector, bool msix)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	struct msi_map map;
+	int irq;
+	u16 cmd;
+
+	if (!msix || !vdev->has_dyn_msix)
+		return;
+
+	irq = pci_irq_vector(pdev, vector);
+	map = (struct msi_map) { .index = vector, .virq = irq };
+
+	if (WARN_ON(irq < 0))
+		return;
+
+	cmd = vfio_pci_memory_lock_and_enable(vdev);
+	pci_msix_free_irq(pdev, map);
+	vfio_pci_memory_unlock_and_restore(vdev, cmd);
+}
+
 static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 				      unsigned int vector, int fd, bool msix)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	struct vfio_pci_irq_ctx *ctx;
 	struct eventfd_ctx *trigger;
-	int irq, ret;
+	int irq = -EINVAL, ret;
 	u16 cmd;
 
-	irq = pci_irq_vector(pdev, vector);
-	if (irq < 0)
-		return -EINVAL;
-
 	ctx = vfio_irq_ctx_get(vdev, vector);
 
 	if (ctx) {
 		irq_bypass_unregister_producer(&ctx->producer);
-
+		irq = pci_irq_vector(pdev, vector);
 		cmd = vfio_pci_memory_lock_and_enable(vdev);
 		free_irq(irq, ctx->trigger);
 		vfio_pci_memory_unlock_and_restore(vdev, cmd);
+		/* Interrupt stays allocated, will be freed at MSI-X disable. */
 		kfree(ctx->name);
 		eventfd_ctx_put(ctx->trigger);
 		vfio_irq_ctx_free(vdev, ctx, vector);
@@ -401,9 +448,17 @@  static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 	if (fd < 0)
 		return 0;
 
+	if (irq == -EINVAL) {
+		irq = vfio_msi_alloc_irq(vdev, vector, msix);
+		if (irq < 0)
+			return irq;
+	}
+
 	ctx = vfio_irq_ctx_alloc(vdev, vector);
-	if (!ctx)
-		return -ENOMEM;
+	if (!ctx) {
+		ret = -ENOMEM;
+		goto out_free_irq;
+	}
 
 	ctx->name = kasprintf(GFP_KERNEL_ACCOUNT, "vfio-msi%s[%d](%s)",
 			      msix ? "x" : "", vector, pci_name(pdev));
@@ -456,6 +511,8 @@  static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
 	kfree(ctx->name);
 out_free_ctx:
 	vfio_irq_ctx_free(vdev, ctx, vector);
+out_free_irq:
+	vfio_msi_free_irq(vdev, vector, msix);
 	return ret;
 }