Message ID | 86cda5cf2742feff3b14954284fb509863355050.1681837892.git.reinette.chatre@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp3017867vqo; Tue, 18 Apr 2023 10:31:36 -0700 (PDT) X-Google-Smtp-Source: AKy350YGtbFA262kW+6IIU8zZi4gZn+S/07biXOxHrUooI8XDUn8lc/G0r1YFWXkeoSXw+2jrN4U X-Received: by 2002:a05:6a20:8f29:b0:d9:2d4e:c08c with SMTP id b41-20020a056a208f2900b000d92d4ec08cmr386813pzk.61.1681839096231; Tue, 18 Apr 2023 10:31:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1681839096; cv=none; d=google.com; s=arc-20160816; b=adeMjQUAwEUIPLMUoySwZ7p93YtHUmBmBCbT+eJiD6C6SHJZnTlCPG39SKSXCmygOF TcTNP5+ue6Rcj7uguxYN7HjN/z1t7KE3Y+QQ9NlMeVWNr+ZfTlJrLt5m3ueLz+k8A2YA DOl6OFOWht42Lh05aSj54BZp3hZNc0N/bPntmVCTmCAlM1ERh898HOSif9yhQmzIEtd+ /pxkF8MFxG3fY9Ys1/0byeyaJ/Ax9UQ02kiTzL+BjUxmEYqpeb7Aq8li8J5AZ5iDTHKr JPLwmnQUQd8qmY+/iFLFo0U4Gu1zUPePDFs0rUfeVatE2cIYmKsV3jdGPRLn5XCGZJd+ fPCw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=DWasRgYWVlulztPharkaQqtI8SSjpDH+g5u/1tsWCPI=; b=FfkmfemSS8BAcq3IGG828VkOfICPz6/hoTOoqOgma8KeiHARBN2aK1uG7uWOW7+p8h ptvkJhGDXh35YycrJhFelhwEodFiwTVnm9RGhMNQ8RMlqT5lfUMHLzo8PjD+V6V879MF ylnTdTFnOCdL8RvOyS8Nym4GteCL38djg0kgfFzxyplaE7rN96DUHjgF63VTb/dIwDP+ QwwS0bRSwmoehu63Naj8NNIeuPu1O3W1eHzRfSPuZafuuVIZiGXVyL56k8LuUKzosB6t j01KBuAipG453JFwBAd4YMZC0ODjDK89atL4O+FojqnWxd2Bs/VJmahT8W+DEFJqa4KW /ZQw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="YEuByj/D"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id b18-20020a63eb52000000b00513f15fe82dsi15238982pgk.786.2023.04.18.10.31.21; Tue, 18 Apr 2023 10:31:36 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="YEuByj/D"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231421AbjDRRaL (ORCPT <rfc822;leviz.kernel.dev@gmail.com> + 99 others); Tue, 18 Apr 2023 13:30:11 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39052 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232578AbjDRR3u (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 18 Apr 2023 13:29:50 -0400 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 82A3844BC; Tue, 18 Apr 2023 10:29:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1681838989; x=1713374989; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=oMfakM0i/B5Q/gryK7ofBQXHR1G72EZy+CSdZ2o9fNM=; b=YEuByj/DMj3ZucwxUtZ6luDlPfNjLh/sdQpcehKBest7/o6cArsswcXB 4EPSNWcIHrAqoOU7l4vQvajKXu6vT6t9cOMKKndKgcfRVicCNBL0lNcak flPXaAG0CPInV2FloMeRzIfIjTZa/Y5uXZxhIoBdmgu1NVaDRFy5aYK6U ykjJ3+hgYKKxGO5aRtt4kNo7ornnhJUPfU2L4W4AXfcz75uQa9vgB0HrG MOcVsD2qccGOewwFf8/jnYF+BEBjxWjqmQG2YvtV2bkbnBNoD0K8dTqaz e1TXUcPRMmV8ozioKIxG48kbM0Ak5vaKSGxhFU6Cr5hJnxZxu/tn2/bY3 Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10684"; a="410466492" X-IronPort-AV: E=Sophos;i="5.99,207,1677571200"; d="scan'208";a="410466492" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Apr 2023 10:29:43 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10684"; a="865503494" X-IronPort-AV: E=Sophos;i="5.99,207,1677571200"; d="scan'208";a="865503494" Received: from rchatre-ws.ostc.intel.com ([10.54.69.144]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Apr 2023 10:29:42 -0700 From: Reinette Chatre <reinette.chatre@intel.com> To: jgg@nvidia.com, yishaih@nvidia.com, shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com, alex.williamson@redhat.com Cc: tglx@linutronix.de, darwi@linutronix.de, kvm@vger.kernel.org, dave.jiang@intel.com, jing2.liu@intel.com, ashok.raj@intel.com, fenghua.yu@intel.com, tom.zanussi@linux.intel.com, reinette.chatre@intel.com, linux-kernel@vger.kernel.org Subject: [PATCH V3 09/10] vfio/pci: Support dynamic MSI-X Date: Tue, 18 Apr 2023 10:29:20 -0700 Message-Id: <86cda5cf2742feff3b14954284fb509863355050.1681837892.git.reinette.chatre@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <cover.1681837892.git.reinette.chatre@intel.com> References: <cover.1681837892.git.reinette.chatre@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1763536111749490113?= X-GMAIL-MSGID: =?utf-8?q?1763536111749490113?= |
Series |
vfio/pci: Support dynamic allocation of MSI-X interrupts
|
|
Commit Message
Reinette Chatre
April 18, 2023, 5:29 p.m. UTC
Recently introduced pci_msix_alloc_irq_at() and pci_msix_free_irq()
enables an individual MSI-X interrupt to be allocated and freed after
MSI-X enabling.
Use dynamic MSI-X (if supported by the device) to allocate an interrupt
after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at
the time a valid eventfd is assigned. This is different behavior from
a range provided during MSI-X enabling where interrupts are allocated
for the entire range whether a valid eventfd is provided for each
interrupt or not.
Do not dynamically free interrupts, leave that to when MSI-X is
disabled.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/lkml/20230403211841.0e206b67.alex.williamson@redhat.com/
---
The get_cached_msi_msg()/pci_write_msi_msg() behavior is kept
similar to the scenario when MSI-X is enabled with triggers
provided for new interrupts. get_cached_msi_msg()/pci_write_msi_msg()
follows for interrupts recently allocated with pci_msix_alloc_irq_at()
just like get_cached_msi_msg()/pci_write_msi_msg() is done for
interrupts recently allocated with pci_alloc_irq_vectors().
Changes since V2:
- Move vfio_irq_ctx_free() to earlier in series to support
earlier usage. (Alex)
- Use consistent terms in changelog: MSI-x changed to MSI-X.
- Make dynamic interrupt context creation generic across all
MSI/MSI-X interrupts. This resulted in code moving to earlier
in series as part of xarray introduction patch. (Alex)
- Remove the local allow_dyn_alloc and direct calling of
pci_msix_can_alloc_dyn(), use the new vdev->has_dyn_msix
introduced earlier instead. (Alex)
- Stop tracking new allocations (remove "new_ctx"). (Alex)
- Introduce new wrapper that returns Linux interrupt number or
dynamically allocate a new interrupt. Wrapper can be used for
all interrupt cases. (Alex)
- Only free dynamic MSI-X interrupts on MSI-X teardown. (Alex)
Changes since RFC V1:
- Add pointer to interrupt context as function parameter to
vfio_irq_ctx_free(). (Alex)
- Initialize new_ctx to false. (Dan Carpenter)
- Only support dynamic allocation if device supports it. (Alex)
drivers/vfio/pci/vfio_pci_intrs.c | 73 +++++++++++++++++++++++++++----
1 file changed, 65 insertions(+), 8 deletions(-)
Comments
On Tue, 18 Apr 2023 10:29:20 -0700 Reinette Chatre <reinette.chatre@intel.com> wrote: > Recently introduced pci_msix_alloc_irq_at() and pci_msix_free_irq() > enables an individual MSI-X interrupt to be allocated and freed after > MSI-X enabling. > > Use dynamic MSI-X (if supported by the device) to allocate an interrupt > after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at > the time a valid eventfd is assigned. This is different behavior from > a range provided during MSI-X enabling where interrupts are allocated > for the entire range whether a valid eventfd is provided for each > interrupt or not. > > Do not dynamically free interrupts, leave that to when MSI-X is > disabled. But we do, sometimes, even if essentially only on the error path. Is that worthwhile? It seems like we could entirely remove vfio_msi_free_irq() and rely only on pci_free_irq_vectors() on MSI/X teardown. I'd probably also add a comment in the commit log about the theory behind not dynamically freeing irqs, ie. latency, reliability, and whatever else we used to justify it. Thanks, Alex > Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> > Link: https://lore.kernel.org/lkml/20230403211841.0e206b67.alex.williamson@redhat.com/ > --- > The get_cached_msi_msg()/pci_write_msi_msg() behavior is kept > similar to the scenario when MSI-X is enabled with triggers > provided for new interrupts. get_cached_msi_msg()/pci_write_msi_msg() > follows for interrupts recently allocated with pci_msix_alloc_irq_at() > just like get_cached_msi_msg()/pci_write_msi_msg() is done for > interrupts recently allocated with pci_alloc_irq_vectors(). > > Changes since V2: > - Move vfio_irq_ctx_free() to earlier in series to support > earlier usage. (Alex) > - Use consistent terms in changelog: MSI-x changed to MSI-X. > - Make dynamic interrupt context creation generic across all > MSI/MSI-X interrupts. This resulted in code moving to earlier > in series as part of xarray introduction patch. (Alex) > - Remove the local allow_dyn_alloc and direct calling of > pci_msix_can_alloc_dyn(), use the new vdev->has_dyn_msix > introduced earlier instead. (Alex) > - Stop tracking new allocations (remove "new_ctx"). (Alex) > - Introduce new wrapper that returns Linux interrupt number or > dynamically allocate a new interrupt. Wrapper can be used for > all interrupt cases. (Alex) > - Only free dynamic MSI-X interrupts on MSI-X teardown. (Alex) > > Changes since RFC V1: > - Add pointer to interrupt context as function parameter to > vfio_irq_ctx_free(). (Alex) > - Initialize new_ctx to false. (Dan Carpenter) > - Only support dynamic allocation if device supports it. (Alex) > > drivers/vfio/pci/vfio_pci_intrs.c | 73 +++++++++++++++++++++++++++---- > 1 file changed, 65 insertions(+), 8 deletions(-) > > diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c > index bdda7f46c2be..c1a3e224c867 100644 > --- a/drivers/vfio/pci/vfio_pci_intrs.c > +++ b/drivers/vfio/pci/vfio_pci_intrs.c > @@ -372,27 +372,74 @@ static int vfio_msi_enable(struct vfio_pci_core_device *vdev, int nvec, bool msi > return 0; > } > > +/* > + * Return Linux IRQ number of an MSI or MSI-X device interrupt vector. > + * If a Linux IRQ number is not available then a new interrupt will be > + * allocated if dynamic MSI-X is supported. > + */ > +static int vfio_msi_alloc_irq(struct vfio_pci_core_device *vdev, > + unsigned int vector, bool msix) > +{ > + struct pci_dev *pdev = vdev->pdev; > + struct msi_map map; > + int irq; > + u16 cmd; > + > + irq = pci_irq_vector(pdev, vector); > + if (irq > 0 || !msix || !vdev->has_dyn_msix) > + return irq; > + > + cmd = vfio_pci_memory_lock_and_enable(vdev); > + map = pci_msix_alloc_irq_at(pdev, vector, NULL); > + vfio_pci_memory_unlock_and_restore(vdev, cmd); > + > + return map.index < 0 ? map.index : map.virq; > +} > + > +/* > + * Free interrupt if it can be re-allocated dynamically (while MSI-X > + * is enabled). > + */ > +static void vfio_msi_free_irq(struct vfio_pci_core_device *vdev, > + unsigned int vector, bool msix) > +{ > + struct pci_dev *pdev = vdev->pdev; > + struct msi_map map; > + int irq; > + u16 cmd; > + > + if (!msix || !vdev->has_dyn_msix) > + return; > + > + irq = pci_irq_vector(pdev, vector); > + map = (struct msi_map) { .index = vector, .virq = irq }; > + > + if (WARN_ON(irq < 0)) > + return; > + > + cmd = vfio_pci_memory_lock_and_enable(vdev); > + pci_msix_free_irq(pdev, map); > + vfio_pci_memory_unlock_and_restore(vdev, cmd); > +} > + > static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev, > unsigned int vector, int fd, bool msix) > { > struct pci_dev *pdev = vdev->pdev; > struct vfio_pci_irq_ctx *ctx; > struct eventfd_ctx *trigger; > - int irq, ret; > + int irq = -EINVAL, ret; > u16 cmd; > > - irq = pci_irq_vector(pdev, vector); > - if (irq < 0) > - return -EINVAL; > - > ctx = vfio_irq_ctx_get(vdev, vector); > > if (ctx) { > irq_bypass_unregister_producer(&ctx->producer); > - > + irq = pci_irq_vector(pdev, vector); > cmd = vfio_pci_memory_lock_and_enable(vdev); > free_irq(irq, ctx->trigger); > vfio_pci_memory_unlock_and_restore(vdev, cmd); > + /* Interrupt stays allocated, will be freed at MSI-X disable. */ > kfree(ctx->name); > eventfd_ctx_put(ctx->trigger); > vfio_irq_ctx_free(vdev, ctx, vector); > @@ -401,9 +448,17 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev, > if (fd < 0) > return 0; > > + if (irq == -EINVAL) { > + irq = vfio_msi_alloc_irq(vdev, vector, msix); > + if (irq < 0) > + return irq; > + } > + > ctx = vfio_irq_ctx_alloc(vdev, vector); > - if (!ctx) > - return -ENOMEM; > + if (!ctx) { > + ret = -ENOMEM; > + goto out_free_irq; > + } > > ctx->name = kasprintf(GFP_KERNEL_ACCOUNT, "vfio-msi%s[%d](%s)", > msix ? "x" : "", vector, pci_name(pdev)); > @@ -456,6 +511,8 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev, > kfree(ctx->name); > out_free_ctx: > vfio_irq_ctx_free(vdev, ctx, vector); > +out_free_irq: > + vfio_msi_free_irq(vdev, vector, msix); > return ret; > } >
Hi Alex, On 4/18/2023 3:38 PM, Alex Williamson wrote: > On Tue, 18 Apr 2023 10:29:20 -0700 > Reinette Chatre <reinette.chatre@intel.com> wrote: > >> Recently introduced pci_msix_alloc_irq_at() and pci_msix_free_irq() >> enables an individual MSI-X interrupt to be allocated and freed after >> MSI-X enabling. >> >> Use dynamic MSI-X (if supported by the device) to allocate an interrupt >> after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at >> the time a valid eventfd is assigned. This is different behavior from >> a range provided during MSI-X enabling where interrupts are allocated >> for the entire range whether a valid eventfd is provided for each >> interrupt or not. >> >> Do not dynamically free interrupts, leave that to when MSI-X is >> disabled. > > But we do, sometimes, even if essentially only on the error path. Is > that worthwhile? It seems like we could entirely remove > vfio_msi_free_irq() and rely only on pci_free_irq_vectors() on MSI/X > teardown. Yes, it is only on the error path where dynamic MSI-X interrupts are removed. I do not know how to determine if it is worthwhile. On the kernel side failure seems unlikely since it would mean memory cannot be allocated or request_irq() failed. In these cases it may not be worthwhile since user space may try again and having the interrupt already allocated would be helpful. The remaining error seems to be if user space provided an invalid eventfd. An allocation in response to wrong user input is a concern to me. Should we consider buggy/malicious users? I am uncertain here so would defer to your guidance. > I'd probably also add a comment in the commit log about the theory > behind not dynamically freeing irqs, ie. latency, reliability, and > whatever else we used to justify it. Thanks, Sure. How about something like below to replace the final sentence of the changelog: "When a guest disables an interrupt, user space (Qemu) does not disable the interrupt but instead assigns it a different trigger. A common flow is thus for the VFIO_DEVICE_SET_IRQS ioctl() to be used to assign a new eventfd to an already enabled interrupt. Freeing and re-allocating an interrupt in this scenario would add unnecessary latency as well as uncertainty since the re-allocation may fail. Do not dynamically free interrupts when an interrupt is disabled, instead support a subsequent re-enable to draw from the initial allocation when possible. Leave freeing of interrupts to when MSI-X is disabled." Reinette
On Wed, 19 Apr 2023 11:13:29 -0700 Reinette Chatre <reinette.chatre@intel.com> wrote: > Hi Alex, > > On 4/18/2023 3:38 PM, Alex Williamson wrote: > > On Tue, 18 Apr 2023 10:29:20 -0700 > > Reinette Chatre <reinette.chatre@intel.com> wrote: > > > >> Recently introduced pci_msix_alloc_irq_at() and pci_msix_free_irq() > >> enables an individual MSI-X interrupt to be allocated and freed after > >> MSI-X enabling. > >> > >> Use dynamic MSI-X (if supported by the device) to allocate an interrupt > >> after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at > >> the time a valid eventfd is assigned. This is different behavior from > >> a range provided during MSI-X enabling where interrupts are allocated > >> for the entire range whether a valid eventfd is provided for each > >> interrupt or not. > >> > >> Do not dynamically free interrupts, leave that to when MSI-X is > >> disabled. > > > > But we do, sometimes, even if essentially only on the error path. Is > > that worthwhile? It seems like we could entirely remove > > vfio_msi_free_irq() and rely only on pci_free_irq_vectors() on MSI/X > > teardown. > > Yes, it is only on the error path where dynamic MSI-X interrupts are > removed. I do not know how to determine if it is worthwhile. On the > kernel side failure seems unlikely since it would mean memory cannot > be allocated or request_irq() failed. In these cases it may not be > worthwhile since user space may try again and having the interrupt > already allocated would be helpful. The remaining error seems to be > if user space provided an invalid eventfd. An allocation in response > to wrong user input is a concern to me. Should we consider > buggy/malicious users? I am uncertain here so would defer to your > guidance. I don't really see that a malicious user can exploit anything here, their irq allocation is bound by the device support and they're entitled to make use of the full vector set of the device by virtue of having ownership of the device. All the MSI-X allocated irqs are freed when the interrupt mode is changed or the device is released regardless. The end result is also no different than if the user had not configured the vector when enabling MSI-X or configured it and later de-configured with a -1 eventfd. The irq is allocated but not attached to a ctx. We're intentionally using this as a cache. Also, as implemented here in v3, we might be freeing from the original allocation rather than a new, dynamically allocated irq. My thinking is that if we think there's a benefit to caching any allocated irqs, we should do so consistently. We don't currently know if the irq was allocated now or previously. Tracking that would add complexity for little benefit. The user can get to the same end result of an allocated, unused irq in numerous way, the state itself is not erroneous, and is actually in support of caching irq allocations. Removing the de-allocation of a single vector further simplifies the code as there exists only one path where irqs are freed, ie. pci_free_irq_vectors(). So I'd lean towards removing vfio_msi_free_irq(). > > I'd probably also add a comment in the commit log about the theory > > behind not dynamically freeing irqs, ie. latency, reliability, and > > whatever else we used to justify it. Thanks, > > Sure. How about something like below to replace the final sentence of > the changelog: > > "When a guest disables an interrupt, user space (Qemu) does not > disable the interrupt but instead assigns it a different trigger. A > common flow is thus for the VFIO_DEVICE_SET_IRQS ioctl() to be > used to assign a new eventfd to an already enabled interrupt. Freeing > and re-allocating an interrupt in this scenario would add unnecessary > latency as well as uncertainty since the re-allocation may fail. Do > not dynamically free interrupts when an interrupt is disabled, instead > support a subsequent re-enable to draw from the initial allocation when > possible. Leave freeing of interrupts to when MSI-X is disabled." There are other means besides caching irqs that could achieve the above, for instance if a trigger is simply swapped from one eventfd to another, that all happens within vfio_msi_set_vector_signal() where we could hold the irq for the transition. I think I might justify it as: The PCI-MSIX API requires that some number of irqs are allocated for an initial set of vectors when enabling MSI-X on the device. When dynamic MSIX allocation is not supported, the vector table, and thus the allocated irq set can only be resized by disabling and re-enabling MSIX with a different range. In that case the irq allocation is essentially a cache for configuring vectors within the previously allocated vector range. When dynamic MSIX allocation is supported, the API still requires some initial set of irqs to be allocated, but also supports allocating and freeing specific irq vectors both within and beyond the initially allocated range. For consistency between modes, as well as to reduce latency and improve reliability of allocations, and also simplicity, this implementation only releases irqs via pci_free_irq_vectors() when either the interrupt mode changes or the device is released. Does that cover the key points for someone that might want to revisit this decision later? Thanks, Alex
Hi Alex, On 4/19/2023 2:38 PM, Alex Williamson wrote: > On Wed, 19 Apr 2023 11:13:29 -0700 > Reinette Chatre <reinette.chatre@intel.com> wrote: >> On 4/18/2023 3:38 PM, Alex Williamson wrote: >>> On Tue, 18 Apr 2023 10:29:20 -0700 >>> Reinette Chatre <reinette.chatre@intel.com> wrote: >>> >>>> Recently introduced pci_msix_alloc_irq_at() and pci_msix_free_irq() >>>> enables an individual MSI-X interrupt to be allocated and freed after >>>> MSI-X enabling. >>>> >>>> Use dynamic MSI-X (if supported by the device) to allocate an interrupt >>>> after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at >>>> the time a valid eventfd is assigned. This is different behavior from >>>> a range provided during MSI-X enabling where interrupts are allocated >>>> for the entire range whether a valid eventfd is provided for each >>>> interrupt or not. >>>> >>>> Do not dynamically free interrupts, leave that to when MSI-X is >>>> disabled. >>> >>> But we do, sometimes, even if essentially only on the error path. Is >>> that worthwhile? It seems like we could entirely remove >>> vfio_msi_free_irq() and rely only on pci_free_irq_vectors() on MSI/X >>> teardown. >> >> Yes, it is only on the error path where dynamic MSI-X interrupts are >> removed. I do not know how to determine if it is worthwhile. On the >> kernel side failure seems unlikely since it would mean memory cannot >> be allocated or request_irq() failed. In these cases it may not be >> worthwhile since user space may try again and having the interrupt >> already allocated would be helpful. The remaining error seems to be >> if user space provided an invalid eventfd. An allocation in response >> to wrong user input is a concern to me. Should we consider >> buggy/malicious users? I am uncertain here so would defer to your >> guidance. > > I don't really see that a malicious user can exploit anything here, > their irq allocation is bound by the device support and they're > entitled to make use of the full vector set of the device by virtue of > having ownership of the device. All the MSI-X allocated irqs are freed > when the interrupt mode is changed or the device is released regardless. > > The end result is also no different than if the user had not configured > the vector when enabling MSI-X or configured it and later de-configured > with a -1 eventfd. The irq is allocated but not attached to a ctx. > We're intentionally using this as a cache. > > Also, as implemented here in v3, we might be freeing from the original > allocation rather than a new, dynamically allocated irq. Great point. > > My thinking is that if we think there's a benefit to caching any > allocated irqs, we should do so consistently. We don't currently know > if the irq was allocated now or previously. Tracking that would add > complexity for little benefit. The user can get to the same end result > of an allocated, unused irq in numerous way, the state itself is not > erroneous, and is actually in support of caching irq allocations. > Removing the de-allocation of a single vector further simplifies the > code as there exists only one path where irqs are freed, ie. > pci_free_irq_vectors(). > > So I'd lean towards removing vfio_msi_free_irq(). Thank you for your detailed analysis. I understand and agree. I will remove vfio_msi_free_irq(). > >>> I'd probably also add a comment in the commit log about the theory >>> behind not dynamically freeing irqs, ie. latency, reliability, and >>> whatever else we used to justify it. Thanks, >> >> Sure. How about something like below to replace the final sentence of >> the changelog: >> >> "When a guest disables an interrupt, user space (Qemu) does not >> disable the interrupt but instead assigns it a different trigger. A >> common flow is thus for the VFIO_DEVICE_SET_IRQS ioctl() to be >> used to assign a new eventfd to an already enabled interrupt. Freeing >> and re-allocating an interrupt in this scenario would add unnecessary >> latency as well as uncertainty since the re-allocation may fail. Do >> not dynamically free interrupts when an interrupt is disabled, instead >> support a subsequent re-enable to draw from the initial allocation when >> possible. Leave freeing of interrupts to when MSI-X is disabled." > > There are other means besides caching irqs that could achieve the above, > for instance if a trigger is simply swapped from one eventfd to another, > that all happens within vfio_msi_set_vector_signal() where we could > hold the irq for the transition. > > I think I might justify it as: > > The PCI-MSIX API requires that some number of irqs are > allocated for an initial set of vectors when enabling MSI-X on > the device. When dynamic MSIX allocation is not supported, the > vector table, and thus the allocated irq set can only be resized > by disabling and re-enabling MSIX with a different range. In > that case the irq allocation is essentially a cache for > configuring vectors within the previously allocated vector > range. When dynamic MSIX allocation is supported, the API > still requires some initial set of irqs to be allocated, but > also supports allocating and freeing specific irq vectors both > within and beyond the initially allocated range. > > For consistency between modes, as well as to reduce latency and > improve reliability of allocations, and also simplicity, this > implementation only releases irqs via pci_free_irq_vectors() > when either the interrupt mode changes or the device is > released. > > Does that cover the key points for someone that might want to revisit > this decision later? Thanks, It does so clearly, yes. Thank you so much for taking the time to write this. I will include it in the changelog. Reinette
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c index bdda7f46c2be..c1a3e224c867 100644 --- a/drivers/vfio/pci/vfio_pci_intrs.c +++ b/drivers/vfio/pci/vfio_pci_intrs.c @@ -372,27 +372,74 @@ static int vfio_msi_enable(struct vfio_pci_core_device *vdev, int nvec, bool msi return 0; } +/* + * Return Linux IRQ number of an MSI or MSI-X device interrupt vector. + * If a Linux IRQ number is not available then a new interrupt will be + * allocated if dynamic MSI-X is supported. + */ +static int vfio_msi_alloc_irq(struct vfio_pci_core_device *vdev, + unsigned int vector, bool msix) +{ + struct pci_dev *pdev = vdev->pdev; + struct msi_map map; + int irq; + u16 cmd; + + irq = pci_irq_vector(pdev, vector); + if (irq > 0 || !msix || !vdev->has_dyn_msix) + return irq; + + cmd = vfio_pci_memory_lock_and_enable(vdev); + map = pci_msix_alloc_irq_at(pdev, vector, NULL); + vfio_pci_memory_unlock_and_restore(vdev, cmd); + + return map.index < 0 ? map.index : map.virq; +} + +/* + * Free interrupt if it can be re-allocated dynamically (while MSI-X + * is enabled). + */ +static void vfio_msi_free_irq(struct vfio_pci_core_device *vdev, + unsigned int vector, bool msix) +{ + struct pci_dev *pdev = vdev->pdev; + struct msi_map map; + int irq; + u16 cmd; + + if (!msix || !vdev->has_dyn_msix) + return; + + irq = pci_irq_vector(pdev, vector); + map = (struct msi_map) { .index = vector, .virq = irq }; + + if (WARN_ON(irq < 0)) + return; + + cmd = vfio_pci_memory_lock_and_enable(vdev); + pci_msix_free_irq(pdev, map); + vfio_pci_memory_unlock_and_restore(vdev, cmd); +} + static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev, unsigned int vector, int fd, bool msix) { struct pci_dev *pdev = vdev->pdev; struct vfio_pci_irq_ctx *ctx; struct eventfd_ctx *trigger; - int irq, ret; + int irq = -EINVAL, ret; u16 cmd; - irq = pci_irq_vector(pdev, vector); - if (irq < 0) - return -EINVAL; - ctx = vfio_irq_ctx_get(vdev, vector); if (ctx) { irq_bypass_unregister_producer(&ctx->producer); - + irq = pci_irq_vector(pdev, vector); cmd = vfio_pci_memory_lock_and_enable(vdev); free_irq(irq, ctx->trigger); vfio_pci_memory_unlock_and_restore(vdev, cmd); + /* Interrupt stays allocated, will be freed at MSI-X disable. */ kfree(ctx->name); eventfd_ctx_put(ctx->trigger); vfio_irq_ctx_free(vdev, ctx, vector); @@ -401,9 +448,17 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev, if (fd < 0) return 0; + if (irq == -EINVAL) { + irq = vfio_msi_alloc_irq(vdev, vector, msix); + if (irq < 0) + return irq; + } + ctx = vfio_irq_ctx_alloc(vdev, vector); - if (!ctx) - return -ENOMEM; + if (!ctx) { + ret = -ENOMEM; + goto out_free_irq; + } ctx->name = kasprintf(GFP_KERNEL_ACCOUNT, "vfio-msi%s[%d](%s)", msix ? "x" : "", vector, pci_name(pdev)); @@ -456,6 +511,8 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev, kfree(ctx->name); out_free_ctx: vfio_irq_ctx_free(vdev, ctx, vector); +out_free_irq: + vfio_msi_free_irq(vdev, vector, msix); return ret; }