Message ID | 20221111135206.463650635@linutronix.de |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp779854wru; Fri, 11 Nov 2022 06:43:44 -0800 (PST) X-Google-Smtp-Source: AA0mqf75h/v2h7RDJK/99cV89QsdScShzA2VLeb5x0Co/ZptGLKUR8Y651O5r/2emH6yVejZYCBF X-Received: by 2002:a17:90a:8990:b0:212:dc2f:7ee7 with SMTP id v16-20020a17090a899000b00212dc2f7ee7mr2161379pjn.172.1668177823834; Fri, 11 Nov 2022 06:43:43 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1668177823; cv=none; d=google.com; s=arc-20160816; b=ETsWx3I6lxyjL+omdSsuLnTx2hLQWAsgQI6c+SK5q5TaRY/Qm60uCYqQl5pSd3bYPK qJfpyeq0zr07oZyhwUDwTTXt64Nx3bNsABjnNO1aacMPpJRo9SGkHstDEoUk/mANrA5d 6AGd1XIJC3HwCANFYym3a7lz0xxuX5K7zGn5KnLvzQfYLBVPAUNrYx3H0Xu9ZepZv454 eQ01Pakd+yFYYwtefqPLBwUnlX9kVWpPWiv1/pU31XVOkdIznrq1eUsqQJLyKZLHnzoq BBoATbtXTReGhAJiaqTlLx6xajyTDu0al+4d68mHV79sqS6ADbFBEu1aFxpCdpeM4pya kBPw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:date:mime-version:references:subject:cc:to:from :dkim-signature:dkim-signature:message-id; bh=fxLPSxIR9ZSTFaDUWsnR87RJL6z40jyNrKWhIshWH24=; b=E2sxfZBXLhwpDC1nlB58dPLMGQu1WmY9ORbHTYgrlOi25xMh+5uA5eo6BzCVjad+si 4dUjF0LRbS9bj5MbNoi46bD9AnmB4U8GvENd4jfXz+pTYmRBYP+AEgfIP9H7FyoFLIUw wYBqVFAMo5/5h9dLS7tGeeV60uEn7Ip3hAagBmRlY2QlPwGeFwEI6XCDpCQQmnJjBsuJ thwU9GaRDpFU6/VDivhc4UXiuuPvvJKY6keZzY6MvRzV32qvvmPUnxeNcgnh4cBuQZqZ LwoqUiRiAbXXP8VpmRNdSSEIxetcBDm632pgp8KLZpR4zDqepGfJO8L9xzGYKmOMPKTp u7Qw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=gr6TbLeA; dkim=neutral (no key) header.i=@linutronix.de header.b=ATFUOEfB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id h4-20020a056a00170400b0056334b72215si2808282pfc.86.2022.11.11.06.43.28; Fri, 11 Nov 2022 06:43:43 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=gr6TbLeA; dkim=neutral (no key) header.i=@linutronix.de header.b=ATFUOEfB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234555AbiKKOJP (ORCPT <rfc822;winker.wchi@gmail.com> + 99 others); Fri, 11 Nov 2022 09:09:15 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52630 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234717AbiKKOH2 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 11 Nov 2022 09:07:28 -0500 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 689EB8D7DF; Fri, 11 Nov 2022 06:00:09 -0800 (PST) Message-ID: <20221111135206.463650635@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1668175125; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=fxLPSxIR9ZSTFaDUWsnR87RJL6z40jyNrKWhIshWH24=; b=gr6TbLeAglJQA4CkKIu8La2LR+j4NvxNLSxGDetxfKPxYNQav72vYnviCrYCDLCAa9Atq5 1jV4Z+lPfKTkbz/YKiyiLdoovniJ8XpJrfOGpOJXWb68ygvUN3hYt/8IDwpLQBVfmFkfzp iCpcWLaMsa1W5cbBivMyZD4Z77BxlB4F5+EqAc8eacAku1E38KI/LjpBCc5DIyEO9vWkAW BD+L0znlUIDFS4DD2QNC2aIBPxziZnKVPlidD9PdefTVwuywXjk+a+igZCRDnTEkHbuTTS W00hpg+Y9axfBokUSrSmTvGXSofS3/FYJ47CC+LZrIMwJ2A0Mci5BkuVS+So6g== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1668175125; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=fxLPSxIR9ZSTFaDUWsnR87RJL6z40jyNrKWhIshWH24=; b=ATFUOEfBPF+bLWCFt9RheygXADG6b2cWwhn4hwR4Ev/sGKw76fIcEoGvjTj7FZE+F0Sv9u PI37+4wDpNkeCiBQ== From: Thomas Gleixner <tglx@linutronix.de> To: LKML <linux-kernel@vger.kernel.org> Cc: x86@kernel.org, Joerg Roedel <joro@8bytes.org>, Will Deacon <will@kernel.org>, linux-pci@vger.kernel.org, Bjorn Helgaas <bhelgaas@google.com>, Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>, Marc Zyngier <maz@kernel.org>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Jason Gunthorpe <jgg@mellanox.com>, Dave Jiang <dave.jiang@intel.com>, Alex Williamson <alex.williamson@redhat.com>, Kevin Tian <kevin.tian@intel.com>, Dan Williams <dan.j.williams@intel.com>, Logan Gunthorpe <logang@deltatee.com>, Ashok Raj <ashok.raj@intel.com>, Jon Mason <jdmason@kudzu.us>, Allen Hubbe <allenbh@gmail.com>, "Ahmed S. Darwish" <darwi@linutronix.de>, Reinette Chatre <reinette.chatre@intel.com> Subject: [patch 21/33] genirq/msi: Provide msi_domain_alloc_irq_at() References: <20221111133158.196269823@linutronix.de> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Date: Fri, 11 Nov 2022 14:58:44 +0100 (CET) X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1749211230123732790?= X-GMAIL-MSGID: =?utf-8?q?1749211230123732790?= |
Series |
genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 3 implementation
|
|
Commit Message
Thomas Gleixner
Nov. 11, 2022, 1:58 p.m. UTC
For supporting post MSI-X enable allocations and for the upcoming PCI/IMS
support a seperate interface is required which allows not only the
allocation of a specific index, but also the allocation of any, i.e. the
next free index. The latter is especially required for IMS because IMS
completely does away with index to functionality mappings which are
often found in MSI/MSI-X implementation.
But even with MSI-X there are devices where only the first few indices have
a fixed functionality and the rest is freely assignable by software,
e.g. to queues.
msi_domain_alloc_irq_at() is also different from the range based interfaces
as it always enforces that the MSI descriptor is allocated by the core code
and not preallocated by the caller like the PCI/MSI[-X] enable code path
does.
msi_domain_alloc_irq_at() can be invoked with the index argument set to
MSI_ANY_INDEX which makes the core code pick the next free index. The irq
domain can provide a prepare_desc() operation callback in its
msi_domain_ops to do domain specific post allocation initialization before
the actual Linux interrupt and the associated interrupt descriptor and
hierarchy alloccations are conducted.
The function also takes an optional @cookie argument which is of type union
msi_dev_cookie. This cookie is not used by the core code and is stored in
the allocated msi_desc::data::cookie. The meaning of the cookie is
completely implementation defined. In case of IMS this might be a PASID or
a pointer to a device queue, but for the MSI core it's opaque and not used
in any way.
The function returns a struct msi_map which on success contains the
allocated index number and the Linux interrupt number so the caller can
spare the index to Linux interrupt number lookup.
On failure map::index contains the error code and map::virq is 0.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
include/linux/msi.h | 4 +
include/linux/msi_api.h | 7 +++
kernel/irq/msi.c | 105 ++++++++++++++++++++++++++++++++++++++++++------
3 files changed, 105 insertions(+), 11 deletions(-)
Comments
On Fri, Nov 11, 2022 at 02:58:44PM +0100, Thomas Gleixner wrote: > For supporting post MSI-X enable allocations and for the upcoming PCI/IMS > support a seperate interface is required which allows not only the > allocation of a specific index, but also the allocation of any, i.e. the > next free index. The latter is especially required for IMS because IMS > completely does away with index to functionality mappings which are > often found in MSI/MSI-X implementation. > > But even with MSI-X there are devices where only the first few indices have > a fixed functionality and the rest is freely assignable by software, > e.g. to queues. > > msi_domain_alloc_irq_at() is also different from the range based interfaces > as it always enforces that the MSI descriptor is allocated by the core code > and not preallocated by the caller like the PCI/MSI[-X] enable code path > does. > > msi_domain_alloc_irq_at() can be invoked with the index argument set to > MSI_ANY_INDEX which makes the core code pick the next free index. The irq > domain can provide a prepare_desc() operation callback in its > msi_domain_ops to do domain specific post allocation initialization before > the actual Linux interrupt and the associated interrupt descriptor and > hierarchy alloccations are conducted. > > The function also takes an optional @cookie argument which is of type union > msi_dev_cookie. This cookie is not used by the core code and is stored in > the allocated msi_desc::data::cookie. The meaning of the cookie is > completely implementation defined. In case of IMS this might be a PASID or > a pointer to a device queue, but for the MSI core it's opaque and not used > in any way. To my mind it makes more sense to pass a 'void *' through from msi_domain_alloc_irq_at() to the prepare_desc() op with the idea that the driver calling msi_domain_alloc_irq_at() knows it is calling it against the domain that it allocated. The prepare_desc can then use the void * to properly initialize anything about the desc under the right lock. Before calling this the driver should have setup whatever thing is going to originate the interrupt, eg allocated the HW object that sources the interrupt and part of what the void * would convey is the detailed information on how to program the HW object. eg IDXD is using an iobase and an offset along with the enforcing PASID, but something like mlx5 would probably want an object id, type, and SF ID. This is again where I don't much like the use of an ID to refer to the domain. Having the driver allocate the device domain, retain a pointer to it, and use that domain pointer with all these new APIs seems much clearer than converting the pointer to an ID. Jason
On Wed, Nov 16 2022 at 15:36, Jason Gunthorpe wrote: > On Fri, Nov 11, 2022 at 02:58:44PM +0100, Thomas Gleixner wrote: >> The function also takes an optional @cookie argument which is of type union >> msi_dev_cookie. This cookie is not used by the core code and is stored in >> the allocated msi_desc::data::cookie. The meaning of the cookie is >> completely implementation defined. In case of IMS this might be a PASID or >> a pointer to a device queue, but for the MSI core it's opaque and not used >> in any way. > > To my mind it makes more sense to pass a 'void *' through from > msi_domain_alloc_irq_at() to the prepare_desc() op with the idea that > the driver calling msi_domain_alloc_irq_at() knows it is calling it > against the domain that it allocated. The prepare_desc can then use > the void * to properly initialize anything about the desc under the > right lock. You are looking at it from one particular use case. > Before calling this the driver should have setup whatever thing is > going to originate the interrupt, eg allocated the HW object that > sources the interrupt and part of what the void * would convey is the > detailed information on how to program the HW object. eg IDXD is using > an iobase and an offset along with the enforcing PASID, but something > like mlx5 would probably want an object id, type, and SF ID. Correct, and that's why the cookie is there. You can stash your pointer into the cookie and an IDXD user stores the PASID. The IDXD user which allocates an interrupt does not even know about iobase and offset. It does neither care about what the IDXD irq domain implementation does with that cookie. Neither should your queue code care. The queue driver code puts a pointer to struct mlx5_voodoo into the cookie when allocating the interrupt and then the mlx5 irqdomain code which is a complete separate entity gets this cookie handed into prepare_desc(). struct mlx5_voodoo contains all information for the irq domain code to set up the necessary things in the queue. That must be obviously a contract between the queue code and the irqdomain code but that's not any different than MSI or MSI-X. The only difference is that in the IMS case the contract is per device and not codified in a standard. > This is again where I don't much like the use of an ID to refer to the > domain. > > Having the driver allocate the device domain, retain a pointer to it, > and use that domain pointer with all these new APIs seems much clearer > than converting the pointer to an ID. You're really obsessed about this irqdomain pointer, right? You have to differentiate between the irq domain implementation and the actual usage sites and not conflate them into one thing. Let's look at the usage site: struct cookie cookie = { .ptr = mymagicqueue, } pci_ims_alloc_irq(pci_dev, &cookie); versus: struct cookie cookie = { .ptr = mymagicqueue, } ims_alloc_irq(&pci_dev->dev, mydev->ims_domain, &cookie); Even in the unlikely case that we have more than two domains, then still the usage site has zero interest in the domain pointer: struct cookie cookie = { .ptr = mymagicqueue, } pci_ims_alloc_irq(pci_dev, myqueue->domid, &cookie); where the code which instantiates myqueue sets up domid. The usage site has absolutely no business to touch irqdomain pointer or to even know that one exists. All it needs to know is how the cookie contract works, obviously. Now the functions you need in your irqdomain implementation to e.g. prepare the MSI descriptor surely need to know about the irqdomain pointer, but that gets handed in from the allocation code so the prepare function knows which instance it is operating on. So what does the irqdomain pointer buy you? Exactly nothing! Look at the IDXD reference implementation. The IDXD probe code which initializes the physical device instantiates the irq domain along with the iobase for the storage array. The actual queue (or whatever IDXD names it) setup code just sticks PASID into the cookie and allocates an interrupt. It gets a virtual irq number and requests the interrupt. Where is the need for a pointer? The queue code does not even know about the iobase of the storage array. It's completely irrelevant there. All it has to know is the cookie contract, not more. Let's take you pointer obsession to the extreme: struct irq_desc *desc = pci_alloc_msix_interrupt(pci_dev); request_irq(desc, handler, pci_dev); versus: int virq = pci_alloc_msix_interrupt(pci_dev); request_irq(virq, handler, pci_dev); You could argue the same way that there is no need for a Linux interrupt number and we could just use the interrupt descriptor pointer. Sure, you can do that, but then you violate _all_ encapsulation rules in one go for absolutely _ZERO_ value. Want another example based on kmalloc()? Almost 20 years ago I did a treewide mopup of drivers which decided that they need to fiddle in the irq descriptor for the very wrong reasons. I had to do that to be able to do a trivial change in the core code... C is patently bad for encapsulation, but you can make it worse by forcefully ignoring the design patterns which allow to completely hide implementation details of a subsystem or infrastructure. If you look at the last commit in the ARM part of this work: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git/commit/?h=devmsi-arm&id=96c97746cbb431a306e95c04d6b3c75751244716 then you can see the final move to remove the visibility of the MSI management internals. This makes it possible to completely overhaul the inner workings of the MSI core without having to chase abuse all over the place. Thanks, tglx
Hi Thomas, I am trying all three parts of this work out with some experimental code within the IDXD driver that attempts to use IMS on the host. In the test, pci_ims_alloc_irq() always encounters -EBUSY and it seems that there is an attempt to insert the struct msi_desc into the xarray twice, the second attempt encountering the -EBUSY. While trying to understand what is going on I found myself looking at this code area and I'll annotate this patch with what I learned. On 11/11/2022 5:58 AM, Thomas Gleixner wrote: ... > --- a/kernel/irq/msi.c > +++ b/kernel/irq/msi.c > @@ -39,6 +39,7 @@ static inline int msi_sysfs_create_group > /* Invalid XA index which is outside of any searchable range */ > #define MSI_XA_MAX_INDEX (ULONG_MAX - 1) > #define MSI_XA_DOMAIN_SIZE (MSI_MAX_INDEX + 1) > +#define MSI_ANY_INDEX UINT_MAX > > static inline void msi_setup_default_irqdomain(struct device *dev, struct msi_device_data *md) > { > @@ -126,18 +127,34 @@ static int msi_insert_desc(struct device When calling pci_ims_alloc_irq(), msi_insert_desc() ends up being called twice, first with index = MSI_ANY_INDEX, second with index = 0. (domid = 1 both times) > } > > hwsize = msi_domain_get_hwsize(dev, domid); > - if (index >= hwsize) { > - ret = -ERANGE; > - goto fail; > - } > > - desc->msi_index = index; > - index += baseidx; > - ret = xa_insert(&md->__store, index, desc, GFP_KERNEL); > - if (ret) > - goto fail; > - return 0; > + if (index == MSI_ANY_INDEX) { > + struct xa_limit limit; > + unsigned int index; > + > + limit.min = baseidx; > + limit.max = baseidx + hwsize - 1; > > + /* Let the xarray allocate a free index within the limits */ > + ret = xa_alloc(&md->__store, &index, desc, limit, GFP_KERNEL); > + if (ret) > + goto fail; > + This path (index == MSI_ANY_INDEX) is followed when msi_insert_desc() is called the first time and the xa_alloc() succeeds at index 65536. > + desc->msi_index = index; This is problematic with desc->msi_index being a u16, assigning 65536 to it becomes 0. > + return 0; > + } else { > + if (index >= hwsize) { > + ret = -ERANGE; > + goto fail; > + } > + > + desc->msi_index = index; > + index += baseidx; > + ret = xa_insert(&md->__store, index, desc, GFP_KERNEL); > + if (ret) > + goto fail; This "else" path is followed when msi_insert_desc() is called the second time with "index = 0". The xa_insert() above fails at index 65536 (baseidx = 65536) with -EBUSY, trickling up as the return code to pci_ims_alloc_irq(). > + return 0; > + } > fail: > msi_free_desc(desc); > return ret; > @@ -335,7 +352,7 @@ int msi_setup_device_data(struct device > > msi_setup_default_irqdomain(dev, md); > > - xa_init(&md->__store); > + xa_init_flags(&md->__store, XA_FLAGS_ALLOC); > mutex_init(&md->mutex); > md->__iter_idx = MSI_XA_MAX_INDEX; > dev->msi.data = md; > @@ -1423,6 +1440,72 @@ int msi_domain_alloc_irqs_all_locked(str > return msi_domain_alloc_locked(dev, &ctrl); > } > > +/** > + * msi_domain_alloc_irq_at - Allocate an interrupt from a MSI interrupt domain at > + * a given index - or at the next free index > + * > + * @dev: Pointer to device struct of the device for which the interrupts > + * are allocated > + * @domid: Id of the interrupt domain to operate on > + * @index: Index for allocation. If @index == %MSI_ANY_INDEX the allocation > + * uses the next free index. > + * @affdesc: Optional pointer to an interrupt affinity descriptor structure > + * @cookie: Optional pointer to a descriptor specific cookie to be stored > + * in msi_desc::data. Must be NULL for MSI-X allocations > + * > + * This requires a MSI interrupt domain which lets the core code manage the > + * MSI descriptors. > + * > + * Return: struct msi_map > + * > + * On success msi_map::index contains the allocated index number and > + * msi_map::virq the corresponding Linux interrupt number > + * > + * On failure msi_map::index contains the error code and msi_map::virq > + * is %0. > + */ > +struct msi_map msi_domain_alloc_irq_at(struct device *dev, unsigned int domid, unsigned int index, > + const struct irq_affinity_desc *affdesc, > + union msi_dev_cookie *cookie) > +{ > + struct irq_domain *domain; > + struct msi_map map = { }; > + struct msi_desc *desc; > + int ret; > + > + msi_lock_descs(dev); > + domain = msi_get_device_domain(dev, domid); > + if (!domain) { > + map.index = -ENODEV; > + goto unlock; > + } > + > + desc = msi_alloc_desc(dev, 1, affdesc); > + if (!desc) { > + map.index = -ENOMEM; > + goto unlock; > + } > + > + if (cookie) > + desc->data.cookie = *cookie; > + > + ret = msi_insert_desc(dev, desc, domid, index); > + if (ret) { > + map.index = ret; > + goto unlock; > + } Above is the first call to msi_insert_desc(/* index = MSI_ANY_INDEX */) > + > + map.index = desc->msi_index; msi_insert_desc() did attempt to set desc->msi_index to 65536 but map.index ends up being 0. > + ret = msi_domain_alloc_irqs_range_locked(dev, domid, map.index, map.index); Here is where the second call to msi_insert_desc() originates: msi_domain_alloc_irqs_range_locked() -> msi_domain_alloc_locked() -> \ __msi_domain_alloc_locked() -> msi_domain_alloc_simple_msi_descs() -> \ msi_domain_add_simple_msi_descs() -> msi_insert_desc() > + if (ret) > + map.index = ret; > + else > + map.virq = desc->irq; > +unlock: > + msi_unlock_descs(dev); > + return map; > +} > + > static void __msi_domain_free_irqs(struct device *dev, struct irq_domain *domain, > struct msi_ctrl *ctrl) > { > Reinette
On Thu, Nov 17 2022 at 15:33, Reinette Chatre wrote: > I am trying all three parts of this work out with some experimental > code within the IDXD driver that attempts to use IMS on the host. > > In the test, pci_ims_alloc_irq() always encounters -EBUSY and it > seems that there is an attempt to insert the struct msi_desc into > the xarray twice, the second attempt encountering the -EBUSY. > > While trying to understand what is going on I found myself looking > at this code area and I'll annotate this patch with what I learned. Ok. > When calling pci_ims_alloc_irq(), msi_insert_desc() ends up being > called twice, first with index = MSI_ANY_INDEX, second with index = 0. > (domid = 1 both times) How so? >> } >> >> hwsize = msi_domain_get_hwsize(dev, domid); >> - if (index >= hwsize) { >> - ret = -ERANGE; >> - goto fail; >> - } >> >> - desc->msi_index = index; >> - index += baseidx; >> - ret = xa_insert(&md->__store, index, desc, GFP_KERNEL); >> - if (ret) >> - goto fail; >> - return 0; >> + if (index == MSI_ANY_INDEX) { >> + struct xa_limit limit; >> + unsigned int index; >> + >> + limit.min = baseidx; >> + limit.max = baseidx + hwsize - 1; >> >> + /* Let the xarray allocate a free index within the limits */ >> + ret = xa_alloc(&md->__store, &index, desc, limit, GFP_KERNEL); >> + if (ret) >> + goto fail; >> + > > This path (index == MSI_ANY_INDEX) is followed when msi_insert_desc() > is called the first time and the xa_alloc() succeeds at index 65536. > >> + desc->msi_index = index; > > This is problematic with desc->msi_index being a u16, assigning > 65536 to it becomes 0. You are partially right. I need to fix that and make it explicit as it's a "works by chance or maybe not" construct right now. But desc->msi_index is correct to be truncated because it's the index within the domain space which is zero based. >> + return 0; >> + } else { >> + if (index >= hwsize) { >> + ret = -ERANGE; >> + goto fail; >> + } >> + >> + desc->msi_index = index; >> + index += baseidx; >> + ret = xa_insert(&md->__store, index, desc, GFP_KERNEL); >> + if (ret) >> + goto fail; > > This "else" path is followed when msi_insert_desc() is called the second > time with "index = 0". The xa_insert() above fails at index 65536 > (baseidx = 65536) with -EBUSY, trickling up as the return code to > pci_ims_alloc_irq(). Why is it called with index=0 the second time? >> + desc = msi_alloc_desc(dev, 1, affdesc); >> + if (!desc) { >> + map.index = -ENOMEM; >> + goto unlock; >> + } >> + >> + if (cookie) >> + desc->data.cookie = *cookie; >> + >> + ret = msi_insert_desc(dev, desc, domid, index); >> + if (ret) { >> + map.index = ret; >> + goto unlock; >> + } > > Above is the first call to msi_insert_desc(/* index = MSI_ANY_INDEX */) > >> + >> + map.index = desc->msi_index; > > msi_insert_desc() did attempt to set desc->msi_index to 65536 but map.index ends > up being 0. which is kinda correct. >> + ret = msi_domain_alloc_irqs_range_locked(dev, domid, map.index, map.index); > > Here is where the second call to msi_insert_desc() originates: > > msi_domain_alloc_irqs_range_locked() -> msi_domain_alloc_locked() -> \ > __msi_domain_alloc_locked() -> msi_domain_alloc_simple_msi_descs() -> \ > msi_domain_add_simple_msi_descs() -> msi_insert_desc() but yes, that's bogus because it tries to allocate what is allocated already. Too tired to decode this circular dependency right now. Will stare at it with brain awake in the morning. Duh! Thanks, tglx
On Fri, Nov 18 2022 at 01:58, Thomas Gleixner wrote: > On Thu, Nov 17 2022 at 15:33, Reinette Chatre wrote: >> When calling pci_ims_alloc_irq(), msi_insert_desc() ends up being >> called twice, first with index = MSI_ANY_INDEX, second with index = 0. >> (domid = 1 both times) > > How so? > >>> } >>> >>> hwsize = msi_domain_get_hwsize(dev, domid); >>> - if (index >= hwsize) { >>> - ret = -ERANGE; >>> - goto fail; >>> - } >>> >>> - desc->msi_index = index; >>> - index += baseidx; >>> - ret = xa_insert(&md->__store, index, desc, GFP_KERNEL); >>> - if (ret) >>> - goto fail; >>> - return 0; >>> + if (index == MSI_ANY_INDEX) { >>> + struct xa_limit limit; >>> + unsigned int index; >>> + >>> + limit.min = baseidx; >>> + limit.max = baseidx + hwsize - 1; >>> >>> + /* Let the xarray allocate a free index within the limits */ >>> + ret = xa_alloc(&md->__store, &index, desc, limit, GFP_KERNEL); >>> + if (ret) >>> + goto fail; >>> + >> >> This path (index == MSI_ANY_INDEX) is followed when msi_insert_desc() >> is called the first time and the xa_alloc() succeeds at index 65536. >> >>> + desc->msi_index = index; >> >> This is problematic with desc->msi_index being a u16, assigning >> 65536 to it becomes 0. > > You are partially right. I need to fix that and make it explicit as it's > a "works by chance or maybe not" construct right now. > > But desc->msi_index is correct to be truncated because it's the index > within the domain space which is zero based. It should obviously do: desc->msi_index = index - baseidx; >>> + return 0; >>> + } else { >>> + if (index >= hwsize) { >>> + ret = -ERANGE; >>> + goto fail; >>> + } >>> + >>> + desc->msi_index = index; >>> + index += baseidx; >>> + ret = xa_insert(&md->__store, index, desc, GFP_KERNEL); >>> + if (ret) >>> + goto fail; >> >> This "else" path is followed when msi_insert_desc() is called the second >> time with "index = 0". The xa_insert() above fails at index 65536 >> (baseidx = 65536) with -EBUSY, trickling up as the return code to >> pci_ims_alloc_irq(). > > Why is it called with index=0 the second time? >>> + desc = msi_alloc_desc(dev, 1, affdesc); >>> + if (!desc) { >>> + map.index = -ENOMEM; >>> + goto unlock; >>> + } >>> + >>> + if (cookie) >>> + desc->data.cookie = *cookie; >>> + >>> + ret = msi_insert_desc(dev, desc, domid, index); >>> + if (ret) { >>> + map.index = ret; >>> + goto unlock; >>> + } >> >> Above is the first call to msi_insert_desc(/* index = MSI_ANY_INDEX */) >> >>> + >>> + map.index = desc->msi_index; >> >> msi_insert_desc() did attempt to set desc->msi_index to 65536 but map.index ends >> up being 0. > > which is kinda correct. > >>> + ret = msi_domain_alloc_irqs_range_locked(dev, domid, map.index, map.index); >> >> Here is where the second call to msi_insert_desc() originates: >> >> msi_domain_alloc_irqs_range_locked() -> msi_domain_alloc_locked() -> \ >> __msi_domain_alloc_locked() -> msi_domain_alloc_simple_msi_descs() -> \ >> msi_domain_add_simple_msi_descs() -> msi_insert_desc() > > but yes, that's bogus because it tries to allocate what is allocated already. > > Too tired to decode this circular dependency right now. Will stare at it > with brain awake in the morning. Duh! Duh. I'm a moron. Of course I "tested" this by flipping default and secondary domain around and doing dynamic allocations from PCI/MSI-X but that won't catch the bug because PCI/MSI-X does not have the ALLOC_SIMPLE_DESCS flag set. Let me fix that. Thanks, tglx
On Fri, Nov 18 2022 at 10:15, Thomas Gleixner wrote: > On Fri, Nov 18 2022 at 01:58, Thomas Gleixner wrote: > Of course I "tested" this by flipping default and secondary domain > around and doing dynamic allocations from PCI/MSI-X but that won't catch > the bug because PCI/MSI-X does not have the ALLOC_SIMPLE_DESCS flag set. > > Let me fix that. Delta patch against git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git devmsi-v1G-part3 below. Thanks, tglx --- diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c index d4f26649a185..d243ad3e5489 100644 --- a/kernel/irq/msi.c +++ b/kernel/irq/msi.c @@ -141,7 +141,7 @@ static int msi_insert_desc(struct device *dev, struct msi_desc *desc, if (ret) goto fail; - desc->msi_index = index; + desc->msi_index = index - baseidx; return 0; } else { if (index >= hwsize) { @@ -1476,9 +1476,10 @@ struct msi_map msi_domain_alloc_irq_at(struct device *dev, unsigned int domid, u const struct irq_affinity_desc *affdesc, union msi_dev_cookie *cookie) { + struct msi_ctrl ctrl = { .domid = domid, .nirqs = 1, }; + struct msi_domain_info *info; struct irq_domain *domain; struct msi_map map = { }; - struct msi_desc *desc; int ret; msi_lock_descs(dev); @@ -1503,12 +1504,16 @@ struct msi_map msi_domain_alloc_irq_at(struct device *dev, unsigned int domid, u goto unlock; } - map.index = desc->msi_index; - ret = msi_domain_alloc_irqs_range_locked(dev, domid, map.index, map.index); - if (ret) + ctrl.first = ctrl.last = desc->msi_index; + info = domain->host_data; + + ret = __msi_domain_alloc_irqs(dev, domain, &ctrl); + if (ret) { map.index = ret; - else + msi_domain_free_locked(dev, &ctrl); + } else { map.virq = desc->irq; + } unlock: msi_unlock_descs(dev); return map;
Hi Thomas, On 11/18/2022 3:05 AM, Thomas Gleixner wrote: > On Fri, Nov 18 2022 at 10:15, Thomas Gleixner wrote: >> On Fri, Nov 18 2022 at 01:58, Thomas Gleixner wrote: ... > diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c > index d4f26649a185..d243ad3e5489 100644 > --- a/kernel/irq/msi.c > +++ b/kernel/irq/msi.c > @@ -141,7 +141,7 @@ static int msi_insert_desc(struct device *dev, struct msi_desc *desc, > if (ret) > goto fail; > > - desc->msi_index = index; > + desc->msi_index = index - baseidx; Could msi_desc->msi_index be made bigger? The hardware I am testing on claims to support more IMS entries than what the u16 can accommodate. > return 0; > } else { > if (index >= hwsize) { > @@ -1476,9 +1476,10 @@ struct msi_map msi_domain_alloc_irq_at(struct device *dev, unsigned int domid, u > const struct irq_affinity_desc *affdesc, > union msi_dev_cookie *cookie) > { > + struct msi_ctrl ctrl = { .domid = domid, .nirqs = 1, }; > + struct msi_domain_info *info; > struct irq_domain *domain; > struct msi_map map = { }; > - struct msi_desc *desc; (*desc is still needed) > int ret; > > msi_lock_descs(dev); > @@ -1503,12 +1504,16 @@ struct msi_map msi_domain_alloc_irq_at(struct device *dev, unsigned int domid, u > goto unlock; > } > > - map.index = desc->msi_index; > - ret = msi_domain_alloc_irqs_range_locked(dev, domid, map.index, map.index); > - if (ret) > + ctrl.first = ctrl.last = desc->msi_index; > + info = domain->host_data; > + > + ret = __msi_domain_alloc_irqs(dev, domain, &ctrl); > + if (ret) { > map.index = ret; > - else > + msi_domain_free_locked(dev, &ctrl); > + } else { > map.virq = desc->irq; > + } > unlock: > msi_unlock_descs(dev); > return map; Thank you very much. With the above snippet it is possible to allocate an IMS IRQ. I am not yet able to use the IRQ and I am working on more tracing to figure out why. In the mean time, I did just try the pci_ims_alloc_irq()/pci_ims_free_irq() flow and pci_ims_free_irq() triggered the WARN below: remove_proc_entry: removing non-empty directory 'irq/220', leaking at least 'idxd-portal' WARNING: CPU: XX PID: 4322 at fs/proc/generic.c:718 remove_proc_entry+0x184/0x190 [SNIP] RIP: 0010:remove_proc_entry+0x184/0x190 Code: a5 af 48 8d 90 68 ff ff ff 48 85 c0 48 0f 45 c2 48 8b 95 88 00 00 00 4c 8b 80 b0 00 00 00 48 8b 92 b0 00 00 00 e8 2d 67 c6 00 <0f> 0b e9 4d ff ff ff e8 a0 c1 ce 00 0f 1f 44 00 00 41 57 41 56 41 RSP: 0018:ff223b51cf947c80 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ff1b39f680241300 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffffafa37b97 RDI: 00000000ffffffff RBP: ff1b3a06666b8000 R08: ff1b3a15bdbfffe8 R09: 0000000000000003 R10: ff1b3a15bce00000 R11: ff1b3a15bd900000 R12: ff1b3a06666b8090 R13: 00000000000000dd R14: ff1b3a069237fb80 R15: 0000000000000001 FS: 00007fedd2dff000(0000) GS:ff1b3a15be940000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ff8e5bfc01c CR3: 000000110a138006 CR4: 0000000000771ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> unregister_irq_proc+0xe3/0x110 free_desc+0x29/0x70 irq_free_descs+0x4b/0x80 msi_domain_free_locked.part.0+0x19b/0x1d0 msi_domain_free_irqs_range+0x67/0xb0 idxd_wq_free_irq+0x89/0x150 [idxd] drv_disable_wq+0x5f/0x90 [idxd] idxd_dmaengine_drv_remove+0xa3/0xc0 [idxd] device_release_driver_internal+0x1aa/0x230 driver_detach+0x44/0x90 bus_remove_driver+0x58/0xe0 idxd_exit_module+0x18/0x3a [idxd] __do_sys_delete_module.constprop.0+0x186/0x280 ? fpregs_assert_state_consistent+0x22/0x50 ? exit_to_user_mode_prepare+0x40/0x150 do_syscall_64+0x40/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7fedd2526c9b Code: 73 01 c3 48 8b 0d 95 21 0f 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 65 21 0f 00 f7 d8 64 89 01 48 RSP: 002b:00007ffca85a47d8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 000055af0e1a3700 RCX: 00007fedd2526c9b RDX: 000000000000000a RSI: 0000000000000800 RDI: 000055af0e1a3768 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 00007fedd25beac0 R11: 0000000000000206 R12: 00007ffca85a4a30 R13: 000055af0e1a32a0 R14: 00007ffca85a58e5 R15: 000055af0e1a3700 </TASK> Reinette
On Fri, Nov 18 2022 at 10:18, Reinette Chatre wrote: >> @@ -141,7 +141,7 @@ static int msi_insert_desc(struct device *dev, struct msi_desc *desc, >> if (ret) >> goto fail; >> >> - desc->msi_index = index; >> + desc->msi_index = index - baseidx; > > Could msi_desc->msi_index be made bigger? The hardware I am testing > on claims to support more IMS entries than what the u16 can > accommodate. Sure that's trivial. How big does it claim it is? >> @@ -1476,9 +1476,10 @@ struct msi_map msi_domain_alloc_irq_at(struct device *dev, unsigned int domid, u >> const struct irq_affinity_desc *affdesc, >> union msi_dev_cookie *cookie) >> { >> + struct msi_ctrl ctrl = { .domid = domid, .nirqs = 1, }; >> + struct msi_domain_info *info; >> struct irq_domain *domain; >> struct msi_map map = { }; >> - struct msi_desc *desc; > > (*desc is still needed) Yes, I figured that out later :) > Thank you very much. With the above snippet it is possible to > allocate an IMS IRQ. I am not yet able to use the IRQ and I am working > on more tracing to figure out why. In the mean time, I did > just try the pci_ims_alloc_irq()/pci_ims_free_irq() flow and > pci_ims_free_irq() triggered the WARN below: > > remove_proc_entry: removing non-empty directory 'irq/220', leaking at least 'idxd-portal' Hrm, that's the irq action directory. No idea why that is not torn down. I assume your sequence is: pci_ims_alloc(); request_irq(); <- This creates it free_irq(); <- This removes it pci_ims_free(); Right? Thanks, tglx
Hi Thomas, On 11/18/2022 2:31 PM, Thomas Gleixner wrote: > On Fri, Nov 18 2022 at 10:18, Reinette Chatre wrote: >>> @@ -141,7 +141,7 @@ static int msi_insert_desc(struct device *dev, struct msi_desc *desc, >>> if (ret) >>> goto fail; >>> >>> - desc->msi_index = index; >>> + desc->msi_index = index - baseidx; >> >> Could msi_desc->msi_index be made bigger? The hardware I am testing >> on claims to support more IMS entries than what the u16 can >> accommodate. > > Sure that's trivial. How big does it claim it is? 2048 > I assume your sequence is: > > pci_ims_alloc(); > request_irq(); <- This creates it > free_irq(); <- This removes it > pci_ims_free(); > > Right? No. My mistake. Sorry for the noise. Reinette
Hi Thomas, On 11/18/2022 2:59 PM, Reinette Chatre wrote: > On 11/18/2022 2:31 PM, Thomas Gleixner wrote: >> On Fri, Nov 18 2022 at 10:18, Reinette Chatre wrote: >>>> @@ -141,7 +141,7 @@ static int msi_insert_desc(struct device *dev, struct msi_desc *desc, >>>> if (ret) >>>> goto fail; >>>> >>>> - desc->msi_index = index; >>>> + desc->msi_index = index - baseidx; >>> >>> Could msi_desc->msi_index be made bigger? The hardware I am testing >>> on claims to support more IMS entries than what the u16 can >>> accommodate. >> >> Sure that's trivial. How big does it claim it is? > > 2048 Dave Jiang corrected me ... the max the hardware can support is 16128 so the current size of msi_index is sufficient. Reinette
--- a/include/linux/msi.h +++ b/include/linux/msi.h @@ -80,6 +80,7 @@ struct pci_dev; struct platform_msi_priv_data; struct device_attribute; struct irq_domain; +struct irq_affinity_desc; void __get_cached_msi_msg(struct msi_desc *entry, struct msi_msg *msg); void get_cached_msi_msg(unsigned int irq, struct msi_msg *msg); @@ -567,6 +568,9 @@ int msi_domain_alloc_irqs_range(struct d unsigned int first, unsigned int last); int msi_domain_alloc_irqs_all_locked(struct device *dev, unsigned int domid, int nirqs); +struct msi_map msi_domain_alloc_irq_at(struct device *dev, unsigned int domid, unsigned int index, + const struct irq_affinity_desc *affdesc, + union msi_dev_cookie *cookie); void msi_domain_free_irqs_range_locked(struct device *dev, unsigned int domid, unsigned int first, unsigned int last); --- a/include/linux/msi_api.h +++ b/include/linux/msi_api.h @@ -48,6 +48,13 @@ struct msi_map { int virq; }; +/* + * Constant to be used for dynamic allocations when the allocation + * is any free MSI index (entry in the MSI-X table or a software + * managed index. + */ +#define MSI_ANY_INDEX UINT_MAX + unsigned int msi_domain_get_virq(struct device *dev, unsigned int domid, unsigned int index); /** --- a/kernel/irq/msi.c +++ b/kernel/irq/msi.c @@ -39,6 +39,7 @@ static inline int msi_sysfs_create_group /* Invalid XA index which is outside of any searchable range */ #define MSI_XA_MAX_INDEX (ULONG_MAX - 1) #define MSI_XA_DOMAIN_SIZE (MSI_MAX_INDEX + 1) +#define MSI_ANY_INDEX UINT_MAX static inline void msi_setup_default_irqdomain(struct device *dev, struct msi_device_data *md) { @@ -126,18 +127,34 @@ static int msi_insert_desc(struct device } hwsize = msi_domain_get_hwsize(dev, domid); - if (index >= hwsize) { - ret = -ERANGE; - goto fail; - } - desc->msi_index = index; - index += baseidx; - ret = xa_insert(&md->__store, index, desc, GFP_KERNEL); - if (ret) - goto fail; - return 0; + if (index == MSI_ANY_INDEX) { + struct xa_limit limit; + unsigned int index; + + limit.min = baseidx; + limit.max = baseidx + hwsize - 1; + /* Let the xarray allocate a free index within the limits */ + ret = xa_alloc(&md->__store, &index, desc, limit, GFP_KERNEL); + if (ret) + goto fail; + + desc->msi_index = index; + return 0; + } else { + if (index >= hwsize) { + ret = -ERANGE; + goto fail; + } + + desc->msi_index = index; + index += baseidx; + ret = xa_insert(&md->__store, index, desc, GFP_KERNEL); + if (ret) + goto fail; + return 0; + } fail: msi_free_desc(desc); return ret; @@ -335,7 +352,7 @@ int msi_setup_device_data(struct device msi_setup_default_irqdomain(dev, md); - xa_init(&md->__store); + xa_init_flags(&md->__store, XA_FLAGS_ALLOC); mutex_init(&md->mutex); md->__iter_idx = MSI_XA_MAX_INDEX; dev->msi.data = md; @@ -1423,6 +1440,72 @@ int msi_domain_alloc_irqs_all_locked(str return msi_domain_alloc_locked(dev, &ctrl); } +/** + * msi_domain_alloc_irq_at - Allocate an interrupt from a MSI interrupt domain at + * a given index - or at the next free index + * + * @dev: Pointer to device struct of the device for which the interrupts + * are allocated + * @domid: Id of the interrupt domain to operate on + * @index: Index for allocation. If @index == %MSI_ANY_INDEX the allocation + * uses the next free index. + * @affdesc: Optional pointer to an interrupt affinity descriptor structure + * @cookie: Optional pointer to a descriptor specific cookie to be stored + * in msi_desc::data. Must be NULL for MSI-X allocations + * + * This requires a MSI interrupt domain which lets the core code manage the + * MSI descriptors. + * + * Return: struct msi_map + * + * On success msi_map::index contains the allocated index number and + * msi_map::virq the corresponding Linux interrupt number + * + * On failure msi_map::index contains the error code and msi_map::virq + * is %0. + */ +struct msi_map msi_domain_alloc_irq_at(struct device *dev, unsigned int domid, unsigned int index, + const struct irq_affinity_desc *affdesc, + union msi_dev_cookie *cookie) +{ + struct irq_domain *domain; + struct msi_map map = { }; + struct msi_desc *desc; + int ret; + + msi_lock_descs(dev); + domain = msi_get_device_domain(dev, domid); + if (!domain) { + map.index = -ENODEV; + goto unlock; + } + + desc = msi_alloc_desc(dev, 1, affdesc); + if (!desc) { + map.index = -ENOMEM; + goto unlock; + } + + if (cookie) + desc->data.cookie = *cookie; + + ret = msi_insert_desc(dev, desc, domid, index); + if (ret) { + map.index = ret; + goto unlock; + } + + map.index = desc->msi_index; + ret = msi_domain_alloc_irqs_range_locked(dev, domid, map.index, map.index); + if (ret) + map.index = ret; + else + map.virq = desc->irq; +unlock: + msi_unlock_descs(dev); + return map; +} + static void __msi_domain_free_irqs(struct device *dev, struct irq_domain *domain, struct msi_ctrl *ctrl) {