Message ID | c4c582970fbeaf4b6000845c400aa4c6b7bb2f13.1682615447.git.reinette.chatre@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp429866vqo; Thu, 27 Apr 2023 10:38:46 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ58JcLVPuiSFySdkZJHYnBwVRBZLf7e+cEFeii4EmfFFUFyPG8Zt9rmDgYD/g1hOBCRrN5X X-Received: by 2002:a05:6a20:6a07:b0:ef:cb4c:c23e with SMTP id p7-20020a056a206a0700b000efcb4cc23emr3035553pzk.29.1682617125904; Thu, 27 Apr 2023 10:38:45 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1682617125; cv=none; d=google.com; s=arc-20160816; b=wga76dugRf0hx+YgXYg0VNlpVgC/rsxfYU4fszwAvIZDPkaKQr2R7JSFbEuuTFCQIW DFWv85ma7XIkCB/smwv2nZqLCzKdViCE9XA6KFAXb+grGU5Nqg6deNfXJCkKCqL+Xe5E HC4DSkHA4Wa8A9fCVlwiVpuhs0utBLLOrf/qiFYkDVYclehUoTriX7o9pkDJ9ND7oW4E 4aYCqxEpD+OKzcEFZRuOyq/MRUWIfuV5KPpgtmGeUy6vtQZBA//DO4Y+BGgiBDeMlJQ+ Qc4PKvc5vdaz0khy11gpbdGVJbAEa/QvTe6oBRWQ9ET3/AOir+OodraBG2aoMpwynCnH 5vDA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=Wc4cy6/FnULjJBm50CezazWXPL+MCJQ518e9PPqOypo=; b=XyiL5WTDXiZtxyMIE6NeUO6MzHWp/FYL7Gp4HRuMoKJBHxcX0yz1go2COhmnmNiVR9 AHgYf/SdFhOLD7PjVkSw7w8MtT2kiXIXAVT02cE6INyMebHdtT+RAGa49egMlKaozy9o Jr3/rFcAt7OCZmNNdIwiUIECVzE4t0c99Dx6QdfYZbJMnL44cA1Vkt0EpGfYZa2AyW+w S9iZRx0pJ77dUnwqYRboGFNxmXc6T0km5ycmne/I84MMBJOcbMFOvskj8Lu08kd5emTT FAjanePbTZ444DFxJY/dhD9PVe8ElNSPdv8WK+VqiMymtt80LV7GtjpdRhZJ4P7M+Jcs VUDA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=c43hC4sS; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id gt18-20020a17090af2d200b002473f9e8a19si21568356pjb.14.2023.04.27.10.38.32; Thu, 27 Apr 2023 10:38:45 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=c43hC4sS; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244572AbjD0RhB (ORCPT <rfc822;zxc52fgh@gmail.com> + 99 others); Thu, 27 Apr 2023 13:37:01 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41108 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S244443AbjD0Rg2 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 27 Apr 2023 13:36:28 -0400 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 751393A91; Thu, 27 Apr 2023 10:36:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1682616987; x=1714152987; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Sngpy7QSRRwSxWcsLL2pb0sag7VSeHhyYvrzzFIvniE=; b=c43hC4sSIXLnF+iIj3AQkVtVJVcdXEn5QXXgisGepKmg9OIz/S+oqKwz /KxR0CtVLT2TudIouddtoLsyi++eMJlB6ug0y1uv7FKSGxw4z2V/zmXd9 6XcB5lFR6EBkNvSGD+xr39DYBpu5ypQUA0NeavqY6ay0620S2bzrQn7Cm q7BDun374/geJXs5bOTZsv/63fEJiWg7sRf3wlm0dCLpRka952ycdPF8m welnTj+wPEbBaxHRZtk+lbz8+fXoCNq+Y7OVSYphHAkBLWpopCA33MC7P UrJeP6XQDxzt6DktFoBzdpOQmbMWgMAbZiup5YFJfScSPLD9wiZg1YuUi w==; X-IronPort-AV: E=McAfee;i="6600,9927,10693"; a="349496937" X-IronPort-AV: E=Sophos;i="5.99,232,1677571200"; d="scan'208";a="349496937" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Apr 2023 10:36:22 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10693"; a="697172998" X-IronPort-AV: E=Sophos;i="5.99,232,1677571200"; d="scan'208";a="697172998" Received: from rchatre-ws.ostc.intel.com ([10.54.69.144]) by fmsmga007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Apr 2023 10:36:22 -0700 From: Reinette Chatre <reinette.chatre@intel.com> To: jgg@nvidia.com, yishaih@nvidia.com, shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com, alex.williamson@redhat.com Cc: tglx@linutronix.de, darwi@linutronix.de, kvm@vger.kernel.org, dave.jiang@intel.com, jing2.liu@intel.com, ashok.raj@intel.com, fenghua.yu@intel.com, tom.zanussi@linux.intel.com, reinette.chatre@intel.com, linux-kernel@vger.kernel.org Subject: [PATCH V4 10/11] vfio/pci: Support dynamic MSI-X Date: Thu, 27 Apr 2023 10:36:07 -0700 Message-Id: <c4c582970fbeaf4b6000845c400aa4c6b7bb2f13.1682615447.git.reinette.chatre@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <cover.1682615447.git.reinette.chatre@intel.com> References: <cover.1682615447.git.reinette.chatre@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.6 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1764351934928351479?= X-GMAIL-MSGID: =?utf-8?q?1764351934928351479?= |
Series |
vfio/pci: Support dynamic allocation of MSI-X interrupts
|
|
Commit Message
Reinette Chatre
April 27, 2023, 5:36 p.m. UTC
pci_msix_alloc_irq_at() enables an individual MSI-X interrupt to be
allocated after MSI-X enabling.
Use dynamic MSI-X (if supported by the device) to allocate an interrupt
after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at
the time a valid eventfd is assigned. This is different behavior from
a range provided during MSI-X enabling where interrupts are allocated
for the entire range whether a valid eventfd is provided for each
interrupt or not.
The PCI-MSIX API requires that some number of irqs are allocated for
an initial set of vectors when enabling MSI-X on the device. When
dynamic MSIX allocation is not supported, the vector table, and thus
the allocated irq set can only be resized by disabling and re-enabling
MSI-X with a different range. In that case the irq allocation is
essentially a cache for configuring vectors within the previously
allocated vector range. When dynamic MSI-X allocation is supported,
the API still requires some initial set of irqs to be allocated, but
also supports allocating and freeing specific irq vectors both
within and beyond the initially allocated range.
For consistency between modes, as well as to reduce latency and improve
reliability of allocations, and also simplicity, this implementation
only releases irqs via pci_free_irq_vectors() when either the interrupt
mode changes or the device is released.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/lkml/20230403211841.0e206b67.alex.williamson@redhat.com/
---
Changes since V3:
- Remove vfio_msi_free_irq(). (Alex)
- Rework changelog. (Alex)
Changes since V2:
- Move vfio_irq_ctx_free() to earlier in series to support
earlier usage. (Alex)
- Use consistent terms in changelog: MSI-x changed to MSI-X.
- Make dynamic interrupt context creation generic across all
MSI/MSI-X interrupts. This resulted in code moving to earlier
in series as part of xarray introduction patch. (Alex)
- Remove the local allow_dyn_alloc and direct calling of
pci_msix_can_alloc_dyn(), use the new vdev->has_dyn_msix
introduced earlier instead. (Alex)
- Stop tracking new allocations (remove "new_ctx"). (Alex)
- Introduce new wrapper that returns Linux interrupt number or
dynamically allocate a new interrupt. Wrapper can be used for
all interrupt cases. (Alex)
- Only free dynamic MSI-X interrupts on MSI-X teardown. (Alex)
Changes since RFC V1:
- Add pointer to interrupt context as function parameter to
vfio_irq_ctx_free(). (Alex)
- Initialize new_ctx to false. (Dan Carpenter)
- Only support dynamic allocation if device supports it. (Alex)
drivers/vfio/pci/vfio_pci_intrs.c | 47 +++++++++++++++++++++++++++----
1 file changed, 41 insertions(+), 6 deletions(-)
Comments
> From: Chatre, Reinette <reinette.chatre@intel.com> > Sent: Friday, April 28, 2023 1:36 AM > > pci_msix_alloc_irq_at() enables an individual MSI-X interrupt to be > allocated after MSI-X enabling. > > Use dynamic MSI-X (if supported by the device) to allocate an interrupt > after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at > the time a valid eventfd is assigned. This is different behavior from > a range provided during MSI-X enabling where interrupts are allocated > for the entire range whether a valid eventfd is provided for each > interrupt or not. > > The PCI-MSIX API requires that some number of irqs are allocated for > an initial set of vectors when enabling MSI-X on the device. When > dynamic MSIX allocation is not supported, the vector table, and thus > the allocated irq set can only be resized by disabling and re-enabling > MSI-X with a different range. In that case the irq allocation is > essentially a cache for configuring vectors within the previously > allocated vector range. When dynamic MSI-X allocation is supported, > the API still requires some initial set of irqs to be allocated, but > also supports allocating and freeing specific irq vectors both > within and beyond the initially allocated range. > > For consistency between modes, as well as to reduce latency and improve > reliability of allocations, and also simplicity, this implementation > only releases irqs via pci_free_irq_vectors() when either the interrupt > mode changes or the device is released. It improves the reliability of allocations from the calling device p.o.v. But system-wide this is not efficient use of irqs and not releasing them timely may affect the reliability of allocations for other devices. Should this behavior be something configurable? > > +/* > + * Return Linux IRQ number of an MSI or MSI-X device interrupt vector. > + * If a Linux IRQ number is not available then a new interrupt will be > + * allocated if dynamic MSI-X is supported. > + */ > +static int vfio_msi_alloc_irq(struct vfio_pci_core_device *vdev, > + unsigned int vector, bool msix) > +{ > + struct pci_dev *pdev = vdev->pdev; > + struct msi_map map; > + int irq; > + u16 cmd; > + > + irq = pci_irq_vector(pdev, vector); > + if (irq > 0 || !msix || !vdev->has_dyn_msix) > + return irq; if (irq >= 0 || ...) > + > +/* > + * Where is vfio_msi_free_irq() ? > + * > + * Allocated interrupts are maintained, essentially forming a cache that > + * subsequent allocations can draw from. Interrupts are freed using > + * pci_free_irq_vectors() when MSI/MSI-X is disabled. > + */ Probably merge it with the comment of vfio_msi_alloc_irq()? > @@ -401,6 +430,12 @@ static int vfio_msi_set_vector_signal(struct > vfio_pci_core_device *vdev, > if (fd < 0) > return 0; > > + if (irq == -EINVAL) { > + irq = vfio_msi_alloc_irq(vdev, vector, msix); > + if (irq < 0) > + return irq; > + } > + > ctx = vfio_irq_ctx_alloc(vdev, vector); > if (!ctx) > return -ENOMEM; This doesn't read clean that an irq is allocated but not released in the error unwind.
Hi Kevin, On 4/27/2023 11:50 PM, Tian, Kevin wrote: >> From: Chatre, Reinette <reinette.chatre@intel.com> >> Sent: Friday, April 28, 2023 1:36 AM >> >> pci_msix_alloc_irq_at() enables an individual MSI-X interrupt to be >> allocated after MSI-X enabling. >> >> Use dynamic MSI-X (if supported by the device) to allocate an interrupt >> after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at >> the time a valid eventfd is assigned. This is different behavior from >> a range provided during MSI-X enabling where interrupts are allocated >> for the entire range whether a valid eventfd is provided for each >> interrupt or not. >> >> The PCI-MSIX API requires that some number of irqs are allocated for >> an initial set of vectors when enabling MSI-X on the device. When >> dynamic MSIX allocation is not supported, the vector table, and thus >> the allocated irq set can only be resized by disabling and re-enabling >> MSI-X with a different range. In that case the irq allocation is >> essentially a cache for configuring vectors within the previously >> allocated vector range. When dynamic MSI-X allocation is supported, >> the API still requires some initial set of irqs to be allocated, but >> also supports allocating and freeing specific irq vectors both >> within and beyond the initially allocated range. >> >> For consistency between modes, as well as to reduce latency and improve >> reliability of allocations, and also simplicity, this implementation >> only releases irqs via pci_free_irq_vectors() when either the interrupt >> mode changes or the device is released. > > It improves the reliability of allocations from the calling device p.o.v. > > But system-wide this is not efficient use of irqs and not releasing them > timely may affect the reliability of allocations for other devices. Could you please elaborate how other devices may be impacted? > Should this behavior be something configurable? This is not clear to me and I look to you for guidance here. From practical side it looks like configuration via module parameters is supported but whether it should be done is not clear to me. When considering this we need to think about what the user may expect when turning on/off the configuration. For example, MSI-X continues to allocate a range of interrupts during enabling. These have always been treated as a "cache" (interrupts remain allocated, whether they have an associated trigger or not). If there is new configurable behavior, do you expect that the driver needs to distinguish between the original "cache" that the user is used to and the new dynamic allocations? That is, should a dynamic MSI-X capable device always free interrupts when user space removes an eventfd or should only interrupts that were allocated dynamically be freed dynamically? >> +/* >> + * Return Linux IRQ number of an MSI or MSI-X device interrupt vector. >> + * If a Linux IRQ number is not available then a new interrupt will be >> + * allocated if dynamic MSI-X is supported. >> + */ >> +static int vfio_msi_alloc_irq(struct vfio_pci_core_device *vdev, >> + unsigned int vector, bool msix) >> +{ >> + struct pci_dev *pdev = vdev->pdev; >> + struct msi_map map; >> + int irq; >> + u16 cmd; >> + >> + irq = pci_irq_vector(pdev, vector); >> + if (irq > 0 || !msix || !vdev->has_dyn_msix) >> + return irq; > > if (irq >= 0 || ...) > I am not sure about this request because pci_irq_vector() cannot return 0. The Linux interrupt number will be > 0 on success. 0 means "not found" (see msi_get_virq()), which is translated to -EINVAL by pci_irq_vector(). >> + >> +/* >> + * Where is vfio_msi_free_irq() ? >> + * >> + * Allocated interrupts are maintained, essentially forming a cache that >> + * subsequent allocations can draw from. Interrupts are freed using >> + * pci_free_irq_vectors() when MSI/MSI-X is disabled. >> + */ > > Probably merge it with the comment of vfio_msi_alloc_irq()? Sure, will do. > >> @@ -401,6 +430,12 @@ static int vfio_msi_set_vector_signal(struct >> vfio_pci_core_device *vdev, >> if (fd < 0) >> return 0; >> >> + if (irq == -EINVAL) { >> + irq = vfio_msi_alloc_irq(vdev, vector, msix); >> + if (irq < 0) >> + return irq; >> + } >> + >> ctx = vfio_irq_ctx_alloc(vdev, vector); >> if (!ctx) >> return -ENOMEM; > > This doesn't read clean that an irq is allocated but not released > in the error unwind. I can add a comment similar to the location where the trigger is released: /* Interrupt stays allocated, will be freed at MSI-X disable. */ Reinette
> From: Chatre, Reinette <reinette.chatre@intel.com> > Sent: Saturday, April 29, 2023 2:35 AM > > Hi Kevin, > > On 4/27/2023 11:50 PM, Tian, Kevin wrote: > >> From: Chatre, Reinette <reinette.chatre@intel.com> > >> Sent: Friday, April 28, 2023 1:36 AM > >> > >> pci_msix_alloc_irq_at() enables an individual MSI-X interrupt to be > >> allocated after MSI-X enabling. > >> > >> Use dynamic MSI-X (if supported by the device) to allocate an interrupt > >> after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at > >> the time a valid eventfd is assigned. This is different behavior from > >> a range provided during MSI-X enabling where interrupts are allocated > >> for the entire range whether a valid eventfd is provided for each > >> interrupt or not. > >> > >> The PCI-MSIX API requires that some number of irqs are allocated for > >> an initial set of vectors when enabling MSI-X on the device. When > >> dynamic MSIX allocation is not supported, the vector table, and thus > >> the allocated irq set can only be resized by disabling and re-enabling > >> MSI-X with a different range. In that case the irq allocation is > >> essentially a cache for configuring vectors within the previously > >> allocated vector range. When dynamic MSI-X allocation is supported, > >> the API still requires some initial set of irqs to be allocated, but > >> also supports allocating and freeing specific irq vectors both > >> within and beyond the initially allocated range. > >> > >> For consistency between modes, as well as to reduce latency and improve > >> reliability of allocations, and also simplicity, this implementation > >> only releases irqs via pci_free_irq_vectors() when either the interrupt > >> mode changes or the device is released. > > > > It improves the reliability of allocations from the calling device p.o.v. > > > > But system-wide this is not efficient use of irqs and not releasing them > > timely may affect the reliability of allocations for other devices. > > Could you please elaborate how other devices may be impacted? the more this devices reserves the less remains for others, e.g. irte entries. > > > Should this behavior be something configurable? > > This is not clear to me and I look to you for guidance here. From practical > side it looks like configuration via module parameters is supported but > whether it should be done is not clear to me. > > When considering this we need to think about what the user may expect > when > turning on/off the configuration. For example, MSI-X continues to allocate a > range of interrupts during enabling. These have always been treated as a > "cache" (interrupts remain allocated, whether they have an associated > trigger > or not). If there is new configurable behavior, do you expect that the > driver needs to distinguish between the original "cache" that the user is > used to and the new dynamic allocations? That is, should a dynamic MSI-X > capable device always free interrupts when user space removes an eventfd > or should only interrupts that were allocated dynamically be freed > dynamically? That looks tricky. Probably that is why Alex suggested doing this simple scheme and it is on par with the old logic anyway. So I'll withdraw this comment. > > >> +/* > >> + * Return Linux IRQ number of an MSI or MSI-X device interrupt vector. > >> + * If a Linux IRQ number is not available then a new interrupt will be > >> + * allocated if dynamic MSI-X is supported. > >> + */ > >> +static int vfio_msi_alloc_irq(struct vfio_pci_core_device *vdev, > >> + unsigned int vector, bool msix) > >> +{ > >> + struct pci_dev *pdev = vdev->pdev; > >> + struct msi_map map; > >> + int irq; > >> + u16 cmd; > >> + > >> + irq = pci_irq_vector(pdev, vector); > >> + if (irq > 0 || !msix || !vdev->has_dyn_msix) > >> + return irq; > > > > if (irq >= 0 || ...) > > > > I am not sure about this request because pci_irq_vector() cannot return 0. > The Linux interrupt number will be > 0 on success. 0 means "not found" > (see msi_get_virq()), which is translated to -EINVAL by pci_irq_vector(). > There is a subtle difference between the description and the code of pci_irq_vector(). /** * pci_irq_vector() - Get Linux IRQ number of a device interrupt vector * @dev: the PCI device to operate on * @nr: device-relative interrupt vector index (0-based); has different * meanings, depending on interrupt mode: * * * MSI-X the index in the MSI-X vector table * * MSI the index of the enabled MSI vectors * * INTx must be 0 * * Return: the Linux IRQ number, or -EINVAL if @nr is out of range */ From above '0' is a valid irq number. then in following code: irq = msi_get_virq(&dev->dev, nr); return irq ? irq : -EINVAL; '0' is obviously invalid for msi. I didn't realize the msi part when reading the patch. It left me in confusion that '0' is unhandled as here we only check ">0" while in other places "-EINVAL" is checked. Not big matter but it sounds slightly clearer to me to follow the description of pci_irq_vector() instead of its internal detail.
On Fri, 5 May 2023 08:10:33 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > > From: Chatre, Reinette <reinette.chatre@intel.com> > > Sent: Saturday, April 29, 2023 2:35 AM > > > > Hi Kevin, > > > > On 4/27/2023 11:50 PM, Tian, Kevin wrote: > > >> From: Chatre, Reinette <reinette.chatre@intel.com> > > >> Sent: Friday, April 28, 2023 1:36 AM > > >> > > >> pci_msix_alloc_irq_at() enables an individual MSI-X interrupt to be > > >> allocated after MSI-X enabling. > > >> > > >> Use dynamic MSI-X (if supported by the device) to allocate an interrupt > > >> after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at > > >> the time a valid eventfd is assigned. This is different behavior from > > >> a range provided during MSI-X enabling where interrupts are allocated > > >> for the entire range whether a valid eventfd is provided for each > > >> interrupt or not. > > >> > > >> The PCI-MSIX API requires that some number of irqs are allocated for > > >> an initial set of vectors when enabling MSI-X on the device. When > > >> dynamic MSIX allocation is not supported, the vector table, and thus > > >> the allocated irq set can only be resized by disabling and re-enabling > > >> MSI-X with a different range. In that case the irq allocation is > > >> essentially a cache for configuring vectors within the previously > > >> allocated vector range. When dynamic MSI-X allocation is supported, > > >> the API still requires some initial set of irqs to be allocated, but > > >> also supports allocating and freeing specific irq vectors both > > >> within and beyond the initially allocated range. > > >> > > >> For consistency between modes, as well as to reduce latency and improve > > >> reliability of allocations, and also simplicity, this implementation > > >> only releases irqs via pci_free_irq_vectors() when either the interrupt > > >> mode changes or the device is released. > > > > > > It improves the reliability of allocations from the calling device p.o.v. > > > > > > But system-wide this is not efficient use of irqs and not releasing them > > > timely may affect the reliability of allocations for other devices. > > > > Could you please elaborate how other devices may be impacted? > > the more this devices reserves the less remains for others, e.g. irte entries. > > > > > > Should this behavior be something configurable? > > > > This is not clear to me and I look to you for guidance here. From practical > > side it looks like configuration via module parameters is supported but > > whether it should be done is not clear to me. > > > > When considering this we need to think about what the user may expect > > when > > turning on/off the configuration. For example, MSI-X continues to allocate a > > range of interrupts during enabling. These have always been treated as a > > "cache" (interrupts remain allocated, whether they have an associated > > trigger > > or not). If there is new configurable behavior, do you expect that the > > driver needs to distinguish between the original "cache" that the user is > > used to and the new dynamic allocations? That is, should a dynamic MSI-X > > capable device always free interrupts when user space removes an eventfd > > or should only interrupts that were allocated dynamically be freed > > dynamically? > > That looks tricky. Probably that is why Alex suggested doing this simple > scheme and it is on par with the old logic anyway. So I'll withdraw this > comment. Don't forget we're also releasing the irq reservations when the guest changes interrupt mode, ex. reboot, so the "caching" is really only within a session of the guest/userspace driver where it would be unusual to have an unused reservation for an extended period. Thanks, Alex
Hi Kevin, On 5/5/2023 1:10 AM, Tian, Kevin wrote: >> From: Chatre, Reinette <reinette.chatre@intel.com> >> Sent: Saturday, April 29, 2023 2:35 AM >> On 4/27/2023 11:50 PM, Tian, Kevin wrote: >>>> From: Chatre, Reinette <reinette.chatre@intel.com> >>>> Sent: Friday, April 28, 2023 1:36 AM ... >>>> +/* >>>> + * Return Linux IRQ number of an MSI or MSI-X device interrupt vector. >>>> + * If a Linux IRQ number is not available then a new interrupt will be >>>> + * allocated if dynamic MSI-X is supported. >>>> + */ >>>> +static int vfio_msi_alloc_irq(struct vfio_pci_core_device *vdev, >>>> + unsigned int vector, bool msix) >>>> +{ >>>> + struct pci_dev *pdev = vdev->pdev; >>>> + struct msi_map map; >>>> + int irq; >>>> + u16 cmd; >>>> + >>>> + irq = pci_irq_vector(pdev, vector); >>>> + if (irq > 0 || !msix || !vdev->has_dyn_msix) >>>> + return irq; >>> >>> if (irq >= 0 || ...) >>> >> >> I am not sure about this request because pci_irq_vector() cannot return 0. >> The Linux interrupt number will be > 0 on success. 0 means "not found" >> (see msi_get_virq()), which is translated to -EINVAL by pci_irq_vector(). >> > > There is a subtle difference between the description and the code of > pci_irq_vector(). > > /** > * pci_irq_vector() - Get Linux IRQ number of a device interrupt vector > * @dev: the PCI device to operate on > * @nr: device-relative interrupt vector index (0-based); has different > * meanings, depending on interrupt mode: > * > * * MSI-X the index in the MSI-X vector table > * * MSI the index of the enabled MSI vectors > * * INTx must be 0 > * > * Return: the Linux IRQ number, or -EINVAL if @nr is out of range > */ > > From above '0' is a valid irq number. > > then in following code: > > irq = msi_get_virq(&dev->dev, nr); > return irq ? irq : -EINVAL; > > '0' is obviously invalid for msi. > > I didn't realize the msi part when reading the patch. It left me in > confusion that '0' is unhandled as here we only check ">0" while in > other places "-EINVAL" is checked. > > Not big matter but it sounds slightly clearer to me to follow the > description of pci_irq_vector() instead of its internal detail. I can add an explicit check for '0' and, as you confirmed, this is invalid for MSI and thus I think it should be treated as an error. This is perhaps another candidate for a WARN considering that pci_irq_vector() returning a '0' for MSI indicates a kernel problem . I now consider taking guidance from pci_irq_get_affinity(). Note that pci_irq_get_affinity() contains: const struct cpumask *pci_irq_get_affinity(struct pci_dev *dev, int nr) { int idx, irq = pci_irq_vector(dev, nr); ... if (WARN_ON_ONCE(irq <= 0)) return NULL; ... } Would you be ok with something like below? diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c index b549f5c97cb8..a8e96254f953 100644 --- a/drivers/vfio/pci/vfio_pci_intrs.c +++ b/drivers/vfio/pci/vfio_pci_intrs.c @@ -393,6 +393,8 @@ static int vfio_msi_alloc_irq(struct vfio_pci_core_device *vdev, u16 cmd; irq = pci_irq_vector(pdev, vector); + if (WARN_ON_ONCE(irq == 0)) + return -EINVAL; if (irq > 0 || !msix || !vdev->has_dyn_msix) return irq; I would prefer that vfio_msi_alloc_irq() returns negative errors. This enables callers to in turn just return the error code on failure (note that dynamic allocation can return different error codes), not needing to translate 0 into an error. Reinette
> From: Chatre, Reinette <reinette.chatre@intel.com> > Sent: Saturday, May 6, 2023 1:21 AM > > Would you be ok with something like below? > > diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c > index b549f5c97cb8..a8e96254f953 100644 > --- a/drivers/vfio/pci/vfio_pci_intrs.c > +++ b/drivers/vfio/pci/vfio_pci_intrs.c > @@ -393,6 +393,8 @@ static int vfio_msi_alloc_irq(struct > vfio_pci_core_device *vdev, > u16 cmd; > > irq = pci_irq_vector(pdev, vector); > + if (WARN_ON_ONCE(irq == 0)) > + return -EINVAL; > if (irq > 0 || !msix || !vdev->has_dyn_msix) > return irq; > > I would prefer that vfio_msi_alloc_irq() returns negative errors. This enables > callers to in turn just return the error code on failure (note that dynamic > allocation can return different error codes), not needing to translate 0 into > an error. > This looks good to me. Thanks.
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Friday, May 5, 2023 11:28 PM > > On Fri, 5 May 2023 08:10:33 +0000 > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > From: Chatre, Reinette <reinette.chatre@intel.com> > > > Sent: Saturday, April 29, 2023 2:35 AM > > > > > > Hi Kevin, > > > > > > On 4/27/2023 11:50 PM, Tian, Kevin wrote: > > > > > > > Should this behavior be something configurable? > > > > > > This is not clear to me and I look to you for guidance here. From practical > > > side it looks like configuration via module parameters is supported but > > > whether it should be done is not clear to me. > > > > > > When considering this we need to think about what the user may expect > > > when > > > turning on/off the configuration. For example, MSI-X continues to > allocate a > > > range of interrupts during enabling. These have always been treated as a > > > "cache" (interrupts remain allocated, whether they have an associated > > > trigger > > > or not). If there is new configurable behavior, do you expect that the > > > driver needs to distinguish between the original "cache" that the user is > > > used to and the new dynamic allocations? That is, should a dynamic MSI- > X > > > capable device always free interrupts when user space removes an > eventfd > > > or should only interrupts that were allocated dynamically be freed > > > dynamically? > > > > That looks tricky. Probably that is why Alex suggested doing this simple > > scheme and it is on par with the old logic anyway. So I'll withdraw this > > comment. > > Don't forget we're also releasing the irq reservations when the guest > changes interrupt mode, ex. reboot, so the "caching" is really only > within a session of the guest/userspace driver where it would be > unusual to have an unused reservation for an extended period. Thanks, > Yeah, that makes sense.
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c index bdda7f46c2be..8340135b09fa 100644 --- a/drivers/vfio/pci/vfio_pci_intrs.c +++ b/drivers/vfio/pci/vfio_pci_intrs.c @@ -372,27 +372,56 @@ static int vfio_msi_enable(struct vfio_pci_core_device *vdev, int nvec, bool msi return 0; } +/* + * Return Linux IRQ number of an MSI or MSI-X device interrupt vector. + * If a Linux IRQ number is not available then a new interrupt will be + * allocated if dynamic MSI-X is supported. + */ +static int vfio_msi_alloc_irq(struct vfio_pci_core_device *vdev, + unsigned int vector, bool msix) +{ + struct pci_dev *pdev = vdev->pdev; + struct msi_map map; + int irq; + u16 cmd; + + irq = pci_irq_vector(pdev, vector); + if (irq > 0 || !msix || !vdev->has_dyn_msix) + return irq; + + cmd = vfio_pci_memory_lock_and_enable(vdev); + map = pci_msix_alloc_irq_at(pdev, vector, NULL); + vfio_pci_memory_unlock_and_restore(vdev, cmd); + + return map.index < 0 ? map.index : map.virq; +} + +/* + * Where is vfio_msi_free_irq() ? + * + * Allocated interrupts are maintained, essentially forming a cache that + * subsequent allocations can draw from. Interrupts are freed using + * pci_free_irq_vectors() when MSI/MSI-X is disabled. + */ + static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev, unsigned int vector, int fd, bool msix) { struct pci_dev *pdev = vdev->pdev; struct vfio_pci_irq_ctx *ctx; struct eventfd_ctx *trigger; - int irq, ret; + int irq = -EINVAL, ret; u16 cmd; - irq = pci_irq_vector(pdev, vector); - if (irq < 0) - return -EINVAL; - ctx = vfio_irq_ctx_get(vdev, vector); if (ctx) { irq_bypass_unregister_producer(&ctx->producer); - + irq = pci_irq_vector(pdev, vector); cmd = vfio_pci_memory_lock_and_enable(vdev); free_irq(irq, ctx->trigger); vfio_pci_memory_unlock_and_restore(vdev, cmd); + /* Interrupt stays allocated, will be freed at MSI-X disable. */ kfree(ctx->name); eventfd_ctx_put(ctx->trigger); vfio_irq_ctx_free(vdev, ctx, vector); @@ -401,6 +430,12 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev, if (fd < 0) return 0; + if (irq == -EINVAL) { + irq = vfio_msi_alloc_irq(vdev, vector, msix); + if (irq < 0) + return irq; + } + ctx = vfio_irq_ctx_alloc(vdev, vector); if (!ctx) return -ENOMEM;