Message ID | 20230724110406.107212-8-yi.l.liu@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:9010:0:b0:3e4:2afc:c1 with SMTP id l16csp1726945vqg; Mon, 24 Jul 2023 04:18:52 -0700 (PDT) X-Google-Smtp-Source: APBJJlEdcTWhLzEyUg5wRh4HBXaQ4pAYvsUpeL7e5NJDxX84E+2u1hHt5GN65T+ASDixu4vnqI/D X-Received: by 2002:a17:906:59:b0:994:54e9:692c with SMTP id 25-20020a170906005900b0099454e9692cmr10853276ejg.1.1690197532317; Mon, 24 Jul 2023 04:18:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1690197532; cv=none; d=google.com; s=arc-20160816; b=ZWJXVMj+7+bIUp/lBKyeUdb4U1Vn65/zB+s8EPkiQ/Ulw03bn4cu4xLPune8R2NvBu fule1KPfnJFnXdMqBK6tytCps7IFs7bANJg/H6MTB6vO4OWhD8L9FCJlZtSTHx/zJ0QA AGNABsGRQeDWLKl0wB7EoOctRVnBFaXp0w4LvW5E8gYkr5iaFVSEAs9mXk1/GD2c9zwc fSIYDTgCj3UWA11HnBzNpPpkz1FmvPFvu97Fx9jXxGhZroFjFlf9g+SOxoLymWMPTET/ ouo7bRCFv0mlbfA7sBQ2fRMgFuLsjynHmEne672U258Pn4ALpCNKbbB3IjWnMVGlg6Fv aqoA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=EAtUnJ2DGeePrEi31FJzOXpe7GFYYh3+5ZeWQV4AQ1o=; fh=25kVo4cotgMzE250R6E3K5ES7E7C2JawiavlW0h5mKU=; b=GHj35hJkIggkdW9xskzAzJh69KLk8ugAihBbQrQmbhsEU8VzwwXAp/uJfM+G/oSwWZ p3KECHbpCf2tw9V1jRixegUr8HkePvWiuR6kq9+AsPSkiRXP0G+Xz57+Kuic4B8LJqkb SpD92I4rcATPORo3LzDKk2oiXrpPtPt9Xb16IdF0XYy64YZcAvj9LEWtguJq4ZNfcRAC jtrjMXABJYOMjSRE5+r4vqIr2eM5nyqG31uXn41xQTOVCALRhIdHk1nLmpzFSf7RYj5w W4Idi/Juh+ddokibuCMBUYjkGzGVVQrlRMCJNKoRJlpArg6RYY4hqqi8e+40ywMuRFUG E8KA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=ljU6luiu; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id lw2-20020a170906bcc200b0098759716e36si6337475ejb.217.2023.07.24.04.18.26; Mon, 24 Jul 2023 04:18:52 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=ljU6luiu; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229598AbjGXLEt (ORCPT <rfc822;kautuk.consul.80@gmail.com> + 99 others); Mon, 24 Jul 2023 07:04:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58206 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232542AbjGXLEg (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 24 Jul 2023 07:04:36 -0400 Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3927A124; Mon, 24 Jul 2023 04:04:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1690196666; x=1721732666; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=9g6bhAIb+d+E1XevtMPLGEYTgSwXzpUWm9YZpFWGFC0=; b=ljU6luiudRpQ0piCeoc1ahYaXmg/bcq2u7rQPKxIdnYS94HSqCdp2Dnv Ree4s1eIe0gDConHzV2YHvtcvwX1O3dbM6pjXPAvjTlADQe1HAojPNbW/ dHBTZN+sTeui/8LJXcydrXGBvOrei1eOZPUylI5r1VgS59QO1DXn1bzlv THwHH224v7SlKAd6NFDJzsmLNItZ8fGOBVw+T0G0cswalTcoYv8KH7bUs 2EDS4jtd70ScWjz+EvEIUD0jGbgGnL+gNAdD6y+U6yBcMrOTb4Bsmp7GU 3nn5MIsh38WcNMOWPMCHbVivQ5aIUnvU8ubrJOnBktanFnNPiEQkrKFv/ A==; X-IronPort-AV: E=McAfee;i="6600,9927,10780"; a="347013296" X-IronPort-AV: E=Sophos;i="6.01,228,1684825200"; d="scan'208";a="347013296" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Jul 2023 04:04:25 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10780"; a="815775811" X-IronPort-AV: E=Sophos;i="6.01,228,1684825200"; d="scan'208";a="815775811" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by FMSMGA003.fm.intel.com with ESMTP; 24 Jul 2023 04:04:17 -0700 From: Yi Liu <yi.l.liu@intel.com> To: joro@8bytes.org, alex.williamson@redhat.com, jgg@nvidia.com, kevin.tian@intel.com, robin.murphy@arm.com, baolu.lu@linux.intel.com Cc: cohuck@redhat.com, eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com, yi.l.liu@intel.com, yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com, shameerali.kolothum.thodi@huawei.com, lulu@redhat.com, suravee.suthikulpanit@amd.com, iommu@lists.linux.dev, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, zhenzhong.duan@intel.com Subject: [PATCH v3 07/17] iommufd: Add IOMMU_RESV_IOVA_RANGES Date: Mon, 24 Jul 2023 04:03:56 -0700 Message-Id: <20230724110406.107212-8-yi.l.liu@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230724110406.107212-1-yi.l.liu@intel.com> References: <20230724110406.107212-1-yi.l.liu@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, SPF_HELO_PASS,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1772300567878946142 X-GMAIL-MSGID: 1772300567878946142 |
Series |
iommufd: Add nesting infrastructure
|
|
Commit Message
Yi Liu
July 24, 2023, 11:03 a.m. UTC
This reports device's reserved IOVA regions to userspace. This is needed
in the nested translation as userspace owns stage-1 HWPT, and userspace
needs to exclude the reserved IOVA regions in the stage-1 HWPT hence exclude
them in the device's DMA address space.
This can also be used to figure out allowed IOVAs of an IOAS.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
drivers/iommu/iommufd/main.c | 54 ++++++++++++++++++++++++++++++++++++
include/uapi/linux/iommufd.h | 46 ++++++++++++++++++++++++++++++
2 files changed, 100 insertions(+)
Comments
> From: Liu, Yi L <yi.l.liu@intel.com> > Sent: Monday, July 24, 2023 7:04 PM > > This reports device's reserved IOVA regions to userspace. This is needed > in the nested translation as userspace owns stage-1 HWPT, and userspace > needs to exclude the reserved IOVA regions in the stage-1 HWPT hence > exclude > them in the device's DMA address space. > > This can also be used to figure out allowed IOVAs of an IOAS. We may need a special type to mark SW_MSI since it requires identity mapping in stage-1 instead of being reserved.
On Fri, Jul 28, 2023 at 10:07:58AM +0000, Tian, Kevin wrote: > > From: Liu, Yi L <yi.l.liu@intel.com> > > Sent: Monday, July 24, 2023 7:04 PM > > > > This reports device's reserved IOVA regions to userspace. This is needed > > in the nested translation as userspace owns stage-1 HWPT, and userspace > > needs to exclude the reserved IOVA regions in the stage-1 HWPT hence > > exclude > > them in the device's DMA address space. > > > > This can also be used to figure out allowed IOVAs of an IOAS. > > We may need a special type to mark SW_MSI since it requires identity > mapping in stage-1 instead of being reserved. Only the kernel can do this, so there is no action for user space to take beyond knowing that is is not mappable IOVA. The merit for "SW_MSI" may be to inform the rest of the system about the IOVA of the ITS page, but with the current situation that isn't required since only the kernel needs that information. I think the long term way forward is to somehow arrange for the SW_MSI to not become mapped when creating the parent HWPT and instead cause the ITS page to be mapped through some explicit IOCTL. Jason
On Mon, Jul 24, 2023 at 04:03:56AM -0700, Yi Liu wrote: > +/** > + * struct iommu_resv_iova_ranges - ioctl(IOMMU_RESV_IOVA_RANGES) > + * @size: sizeof(struct iommu_resv_iova_ranges) > + * @dev_id: device to read resv iova ranges for > + * @num_iovas: Input/Output total number of resv ranges for the device > + * @__reserved: Must be 0 > + * @resv_iovas: Pointer to the output array of struct iommu_resv_iova_range > + * > + * Query a device for ranges of reserved IOVAs. num_iovas will be set to the > + * total number of iovas and the resv_iovas[] will be filled in as space > + * permits. > + * > + * On input num_iovas is the length of the resv_iovas array. On output it is > + * the total number of iovas filled in. The ioctl will return -EMSGSIZE and > + * set num_iovas to the required value if num_iovas is too small. In this > + * case the caller should allocate a larger output array and re-issue the > + * ioctl. > + * > + * Under nested translation, userspace should query the reserved IOVAs for a > + * given device, and report it to the stage-1 I/O page table owner to exclude > + * the reserved IOVAs. The reserved IOVAs can also be used to figure out the > + * allowed IOVA ranges for the IOAS that the device is attached to. For detail > + * see ioctl IOMMU_IOAS_IOVA_RANGES. I'm not sure I like this, the other APIs here work with the *allowed* IOVAs, which is the inverse of this one which works with the *disallowed* IOVAs. It means we can't take the output of this API and feed it into IOMMUFD_CMD_IOAS_ALLOW_IOVAS.. Though I suppose qemu isn't going to do this anyhow. On the other hand, it is kind of hard to intersect an allowed list of multiple idevs into a single master list. As it is, userspace will have to aggregate the list, sort it, merge adjacent overlapping reserved ranges then invert the list to get an allowed list. This is not entirely simple.. Did you already write an algorithm to do this in qemu someplace? Anyhow, this should be split out from this series. It seems simple enough to merge it now if someone can confirm what qemu needs. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Saturday, July 29, 2023 1:17 AM > > On Fri, Jul 28, 2023 at 10:07:58AM +0000, Tian, Kevin wrote: > > > From: Liu, Yi L <yi.l.liu@intel.com> > > > Sent: Monday, July 24, 2023 7:04 PM > > > > > > This reports device's reserved IOVA regions to userspace. This is needed > > > in the nested translation as userspace owns stage-1 HWPT, and > userspace > > > needs to exclude the reserved IOVA regions in the stage-1 HWPT hence > > > exclude > > > them in the device's DMA address space. > > > > > > This can also be used to figure out allowed IOVAs of an IOAS. > > > > We may need a special type to mark SW_MSI since it requires identity > > mapping in stage-1 instead of being reserved. > > Only the kernel can do this, so there is no action for user space to > take beyond knowing that is is not mappable IOVA. > > The merit for "SW_MSI" may be to inform the rest of the system about > the IOVA of the ITS page, but with the current situation that isn't > required since only the kernel needs that information. IIUC guest kernel needs to know the "SW_MSI" region and then setup an 1:1 mapping for it in S1. So Qemu needs to know and pass this information to the guest? > > I think the long term way forward is to somehow arrange for the SW_MSI > to not become mapped when creating the parent HWPT and instead cause > the ITS page to be mapped through some explicit IOCTL. > yes this is a cleaner approach. Qemu selects the intermediate address of vITS page and maps it to physical ITS page in S2. Then the guest kernel just pick whatever "SW_MSI" address in S1 to vITS as it does today on bare metal.
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Saturday, July 29, 2023 1:45 AM > > On Mon, Jul 24, 2023 at 04:03:56AM -0700, Yi Liu wrote: > > > +/** > > + * struct iommu_resv_iova_ranges - ioctl(IOMMU_RESV_IOVA_RANGES) > > + * @size: sizeof(struct iommu_resv_iova_ranges) > > + * @dev_id: device to read resv iova ranges for > > + * @num_iovas: Input/Output total number of resv ranges for the device > > + * @__reserved: Must be 0 > > + * @resv_iovas: Pointer to the output array of struct > iommu_resv_iova_range > > + * > > + * Query a device for ranges of reserved IOVAs. num_iovas will be set to > the > > + * total number of iovas and the resv_iovas[] will be filled in as space > > + * permits. > > + * > > + * On input num_iovas is the length of the resv_iovas array. On output it is > > + * the total number of iovas filled in. The ioctl will return -EMSGSIZE and > > + * set num_iovas to the required value if num_iovas is too small. In this > > + * case the caller should allocate a larger output array and re-issue the > > + * ioctl. > > + * > > + * Under nested translation, userspace should query the reserved IOVAs > for a > > + * given device, and report it to the stage-1 I/O page table owner to > exclude > > + * the reserved IOVAs. The reserved IOVAs can also be used to figure out > the > > + * allowed IOVA ranges for the IOAS that the device is attached to. For > detail > > + * see ioctl IOMMU_IOAS_IOVA_RANGES. > > I'm not sure I like this, the other APIs here work with the *allowed* > IOVAs, which is the inverse of this one which works with the > *disallowed* IOVAs. > > It means we can't take the output of this API and feed it into > IOMMUFD_CMD_IOAS_ALLOW_IOVAS.. Though I suppose qemu isn't going > to do > this anyhow. > > On the other hand, it is kind of hard to intersect an allowed list of > multiple idevs into a single master list. > > As it is, userspace will have to aggregate the list, sort it, merge > adjacent overlapping reserved ranges then invert the list to get an > allowed list. This is not entirely simple.. > > Did you already write an algorithm to do this in qemu someplace? Qemu is optional to aggregate it for S2 given IOMMU_IOAS_IOVA_RANGES is still being used. If the only purpose of using this new cmd is to report per-device reserved ranges to the guest then aggregation is not required. Arguably IOMMU_IOAS_IOVA_RANGES becomes redundant with this new cmd. But it's already there and as you said it's actually more convenient to be used if the user doesn't care about per-device reserved ranges... > > Anyhow, this should be split out from this series. It seems simple > enough to merge it now if someone can confirm what qemu needs. > > Jason
> From: Tian, Kevin <kevin.tian@intel.com> > Sent: Monday, July 31, 2023 2:22 PM > > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Saturday, July 29, 2023 1:45 AM > > > > On Mon, Jul 24, 2023 at 04:03:56AM -0700, Yi Liu wrote: > > > > > +/** > > > + * struct iommu_resv_iova_ranges - ioctl(IOMMU_RESV_IOVA_RANGES) > > > + * @size: sizeof(struct iommu_resv_iova_ranges) > > > + * @dev_id: device to read resv iova ranges for > > > + * @num_iovas: Input/Output total number of resv ranges for the device > > > + * @__reserved: Must be 0 > > > + * @resv_iovas: Pointer to the output array of struct > > iommu_resv_iova_range > > > + * > > > + * Query a device for ranges of reserved IOVAs. num_iovas will be set to > > the > > > + * total number of iovas and the resv_iovas[] will be filled in as space > > > + * permits. > > > + * > > > + * On input num_iovas is the length of the resv_iovas array. On output it is > > > + * the total number of iovas filled in. The ioctl will return -EMSGSIZE and > > > + * set num_iovas to the required value if num_iovas is too small. In this > > > + * case the caller should allocate a larger output array and re-issue the > > > + * ioctl. > > > + * > > > + * Under nested translation, userspace should query the reserved IOVAs > > for a > > > + * given device, and report it to the stage-1 I/O page table owner to > > exclude > > > + * the reserved IOVAs. The reserved IOVAs can also be used to figure out > > the > > > + * allowed IOVA ranges for the IOAS that the device is attached to. For > > detail > > > + * see ioctl IOMMU_IOAS_IOVA_RANGES. > > > > I'm not sure I like this, the other APIs here work with the *allowed* > > IOVAs, which is the inverse of this one which works with the > > *disallowed* IOVAs. > > > > It means we can't take the output of this API and feed it into > > IOMMUFD_CMD_IOAS_ALLOW_IOVAS.. Though I suppose qemu isn't going > > to do > > this anyhow. > > > > On the other hand, it is kind of hard to intersect an allowed list of > > multiple idevs into a single master list. > > > > As it is, userspace will have to aggregate the list, sort it, merge > > adjacent overlapping reserved ranges then invert the list to get an > > allowed list. This is not entirely simple.. > > > > Did you already write an algorithm to do this in qemu someplace? > > Qemu is optional to aggregate it for S2 given IOMMU_IOAS_IOVA_RANGES > is still being used. If the only purpose of using this new cmd is to report > per-device reserved ranges to the guest then aggregation is not required. > > Arguably IOMMU_IOAS_IOVA_RANGES becomes redundant with this > new cmd. But it's already there and as you said it's actually more > convenient to be used if the user doesn't care about per-device > reserved ranges... Yes. it's not simple as userspace needs to do a lot ofwork to get the allowed iovas if multiple devices are attached to an IOAS. > > Anyhow, this should be split out from this series. It seems simple > > enough to merge it now if someone can confirm what qemu needs. Ok, so the reason is this new ioctl can be used to figure out allowd iovas. Right? Regards, Yi Liu
On Mon, Jul 31, 2023 at 06:14:50AM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Saturday, July 29, 2023 1:17 AM > > > > On Fri, Jul 28, 2023 at 10:07:58AM +0000, Tian, Kevin wrote: > > > > From: Liu, Yi L <yi.l.liu@intel.com> > > > > Sent: Monday, July 24, 2023 7:04 PM > > > > > > > > This reports device's reserved IOVA regions to userspace. This is needed > > > > in the nested translation as userspace owns stage-1 HWPT, and > > userspace > > > > needs to exclude the reserved IOVA regions in the stage-1 HWPT hence > > > > exclude > > > > them in the device's DMA address space. > > > > > > > > This can also be used to figure out allowed IOVAs of an IOAS. > > > > > > We may need a special type to mark SW_MSI since it requires identity > > > mapping in stage-1 instead of being reserved. > > > > Only the kernel can do this, so there is no action for user space to > > take beyond knowing that is is not mappable IOVA. > > > > The merit for "SW_MSI" may be to inform the rest of the system about > > the IOVA of the ITS page, but with the current situation that isn't > > required since only the kernel needs that information. > > IIUC guest kernel needs to know the "SW_MSI" region and then setup an > 1:1 mapping for it in S1. Yes, but qemu hardcodes this and for some reason people thought that was a good idea back when. > > I think the long term way forward is to somehow arrange for the SW_MSI > > to not become mapped when creating the parent HWPT and instead cause > > the ITS page to be mapped through some explicit IOCTL. > > yes this is a cleaner approach. Qemu selects the intermediate address of > vITS page and maps it to physical ITS page in S2. Then the guest kernel > just pick whatever "SW_MSI" address in S1 to vITS as it does today on > bare metal. Right, so I've been inclined to minimize the amount of special stuff created for this way of doing the MSI and hope we can reach a better way sooner than later Jason
On Mon, Jul 31, 2023 at 06:21:44AM +0000, Tian, Kevin wrote: > > As it is, userspace will have to aggregate the list, sort it, merge > > adjacent overlapping reserved ranges then invert the list to get an > > allowed list. This is not entirely simple.. > > > > Did you already write an algorithm to do this in qemu someplace? > > Qemu is optional to aggregate it for S2 given IOMMU_IOAS_IOVA_RANGES > is still being used. If the only purpose of using this new cmd is to report > per-device reserved ranges to the guest then aggregation is not required. I don't think it is entirely optional.. If qmeu doesn't track this, then it will have failures when attaching the S2 to the device. It needs to make sure it punches the right holes in the guest memory map to be compatible with the VFIO HW. I suppose in reality the reserved regions are fairly predictable and probably always match the existing qemu memory map so you can ignore this and still work. Plus most qemu cases don't deal with hotplug so you can build up the identity ioas with all the devices and then use IOMMU_IOAS_IOVA_RANGES as you say and still work. > Arguably IOMMU_IOAS_IOVA_RANGES becomes redundant with this > new cmd. But it's already there and as you said it's actually more > convenient to be used if the user doesn't care about per-device > reserved ranges... Yes and yes So, I guess we should leave it like this? Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Monday, July 31, 2023 9:24 PM > > On Mon, Jul 31, 2023 at 06:21:44AM +0000, Tian, Kevin wrote: > > > > As it is, userspace will have to aggregate the list, sort it, merge > > > adjacent overlapping reserved ranges then invert the list to get an > > > allowed list. This is not entirely simple.. > > > > > > Did you already write an algorithm to do this in qemu someplace? > > > > Qemu is optional to aggregate it for S2 given IOMMU_IOAS_IOVA_RANGES > > is still being used. If the only purpose of using this new cmd is to report > > per-device reserved ranges to the guest then aggregation is not required. > > I don't think it is entirely optional.. If qmeu doesn't track this, > then it will have failures when attaching the S2 to the device. It > needs to make sure it punches the right holes in the guest memory map > to be compatible with the VFIO HW. > > I suppose in reality the reserved regions are fairly predictable and > probably always match the existing qemu memory map so you can ignore > this and still work. > > Plus most qemu cases don't deal with hotplug so you can build up the > identity ioas with all the devices and then use IOMMU_IOAS_IOVA_RANGES > as you say and still work. > > > Arguably IOMMU_IOAS_IOVA_RANGES becomes redundant with this > > new cmd. But it's already there and as you said it's actually more > > convenient to be used if the user doesn't care about per-device > > reserved ranges... > > Yes and yes > > So, I guess we should leave it like this? > Yes. Along with this discussion (including what you explained for sw_msi) let's abandon this new cmd and leave it as today.
On Tue, Aug 01, 2023 at 02:40:44AM +0000, Tian, Kevin wrote: > > So, I guess we should leave it like this? > > > > Yes. Along with this discussion (including what you explained for sw_msi) > let's abandon this new cmd and leave it as today. You sure? This makes it basically impossible to write a "correct" vmm that is aware of what the physical memory map must be early on Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Wednesday, August 2, 2023 2:23 AM > > On Tue, Aug 01, 2023 at 02:40:44AM +0000, Tian, Kevin wrote: > > > > So, I guess we should leave it like this? > > > > > > > Yes. Along with this discussion (including what you explained for sw_msi) > > let's abandon this new cmd and leave it as today. > > You sure? This makes it basically impossible to write a "correct" vmm > that is aware of what the physical memory map must be early on > emmm... I thought it's what you meant by "leave it like this" and the fact that existing VMM's memory layout happens to match the reserved regions. Nobody complains lacking of such a interface for years then we may postpone supporting it until it's really required. btw even if we add this new cmd now, getting the Qemu support to use the aggregated list when creating the guest memory map is not a simple task given currently vfio only passively acts on change notifications in the guest memory layout. It requires a new mechanism to enforce strict order (probe all vfio devices before creating the memory layout) and then injects vfio reserved regions into the layout. Preferably let's not making it a hard dependency for this series.
On Wed, Aug 02, 2023 at 01:09:28AM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Wednesday, August 2, 2023 2:23 AM > > > > On Tue, Aug 01, 2023 at 02:40:44AM +0000, Tian, Kevin wrote: > > > > > > So, I guess we should leave it like this? > > > > > > > > > > Yes. Along with this discussion (including what you explained for sw_msi) > > > let's abandon this new cmd and leave it as today. > > > > You sure? This makes it basically impossible to write a "correct" vmm > > that is aware of what the physical memory map must be early on > > > > emmm... I thought it's what you meant by "leave it like this" and the > fact that existing VMM's memory layout happens to match the reserved > regions. Nobody complains lacking of such a interface for years then > we may postpone supporting it until it's really required. > > btw even if we add this new cmd now, getting the Qemu support to > use the aggregated list when creating the guest memory map is not > a simple task given currently vfio only passively acts on change > notifications in the guest memory layout. It requires a new mechanism > to enforce strict order (probe all vfio devices before creating the memory > layout) and then injects vfio reserved regions into the layout. > > Preferably let's not making it a hard dependency for this series. I see, if qemu won't use it then lets not do it Jason
On Wed, Aug 02, 2023 at 01:09:28AM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Wednesday, August 2, 2023 2:23 AM > > > > On Tue, Aug 01, 2023 at 02:40:44AM +0000, Tian, Kevin wrote: > > > > > > So, I guess we should leave it like this? > > > > > > > > > > Yes. Along with this discussion (including what you explained for sw_msi) > > > let's abandon this new cmd and leave it as today. > > > > You sure? This makes it basically impossible to write a "correct" vmm > > that is aware of what the physical memory map must be early on > > > > emmm... I thought it's what you meant by "leave it like this" and the > fact that existing VMM's memory layout happens to match the reserved > regions. Nobody complains lacking of such a interface for years then > we may postpone supporting it until it's really required. > > btw even if we add this new cmd now, getting the Qemu support to > use the aggregated list when creating the guest memory map is not > a simple task given currently vfio only passively acts on change > notifications in the guest memory layout. It requires a new mechanism > to enforce strict order (probe all vfio devices before creating the memory > layout) and then injects vfio reserved regions into the layout. > > Preferably let's not making it a hard dependency for this series. Should we drop this and its selftest patch from this series? Thanks Nic
> From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Thursday, August 3, 2023 9:23 AM > > On Wed, Aug 02, 2023 at 01:09:28AM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Wednesday, August 2, 2023 2:23 AM > > > > > > On Tue, Aug 01, 2023 at 02:40:44AM +0000, Tian, Kevin wrote: > > > > > > > > So, I guess we should leave it like this? > > > > > > > > > > > > > Yes. Along with this discussion (including what you explained for > sw_msi) > > > > let's abandon this new cmd and leave it as today. > > > > > > You sure? This makes it basically impossible to write a "correct" vmm > > > that is aware of what the physical memory map must be early on > > > > > > > emmm... I thought it's what you meant by "leave it like this" and the > > fact that existing VMM's memory layout happens to match the reserved > > regions. Nobody complains lacking of such a interface for years then > > we may postpone supporting it until it's really required. > > > > btw even if we add this new cmd now, getting the Qemu support to > > use the aggregated list when creating the guest memory map is not > > a simple task given currently vfio only passively acts on change > > notifications in the guest memory layout. It requires a new mechanism > > to enforce strict order (probe all vfio devices before creating the memory > > layout) and then injects vfio reserved regions into the layout. > > > > Preferably let's not making it a hard dependency for this series. > > Should we drop this and its selftest patch from this series? > Yes. let's drop it.
On Thu, Aug 03, 2023 at 01:25:37AM +0000, Tian, Kevin wrote: > > > Preferably let's not making it a hard dependency for this series. > > > > Should we drop this and its selftest patch from this series? > > > > Yes. let's drop it. Ack. Thanks!
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index bd3efc1d8509..510db114fc61 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -250,6 +250,57 @@ static int iommufd_get_hw_info(struct iommufd_ucmd *ucmd) return rc; } +static int iommufd_resv_iova_ranges(struct iommufd_ucmd *ucmd) +{ + struct iommu_resv_iova_ranges *cmd = ucmd->cmd; + struct iommu_resv_iova_range __user *ranges; + struct iommu_resv_region *resv; + struct iommufd_device *idev; + LIST_HEAD(resv_regions); + u32 max_iovas; + int rc; + + if (cmd->__reserved) + return -EOPNOTSUPP; + + idev = iommufd_get_device(ucmd, cmd->dev_id); + if (IS_ERR(idev)) + return PTR_ERR(idev); + + max_iovas = cmd->num_iovas; + ranges = u64_to_user_ptr(cmd->resv_iovas); + cmd->num_iovas = 0; + + iommu_get_resv_regions(idev->dev, &resv_regions); + + list_for_each_entry(resv, &resv_regions, list) { + if (resv->type == IOMMU_RESV_DIRECT_RELAXABLE) + continue; + if (cmd->num_iovas < max_iovas) { + struct iommu_resv_iova_range elm = { + .start = resv->start, + .last = resv->length - 1 + resv->start, + }; + + if (copy_to_user(&ranges[cmd->num_iovas], &elm, + sizeof(elm))) { + rc = -EFAULT; + goto out_put; + } + } + cmd->num_iovas++; + } + rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd)); + if (rc) + goto out_put; + if (cmd->num_iovas > max_iovas) + rc = -EMSGSIZE; +out_put: + iommu_put_resv_regions(idev->dev, &resv_regions); + iommufd_put_object(&idev->obj); + return rc; +} + static int iommufd_fops_open(struct inode *inode, struct file *filp) { struct iommufd_ctx *ictx; @@ -347,6 +398,7 @@ union ucmd_buffer { struct iommu_ioas_map map; struct iommu_ioas_unmap unmap; struct iommu_option option; + struct iommu_resv_iova_ranges resv_ranges; struct iommu_vfio_ioas vfio_ioas; #ifdef CONFIG_IOMMUFD_TEST struct iommu_test_cmd test; @@ -389,6 +441,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = { length), IOCTL_OP(IOMMU_OPTION, iommufd_option, struct iommu_option, val64), + IOCTL_OP(IOMMU_RESV_IOVA_RANGES, iommufd_resv_iova_ranges, + struct iommu_resv_iova_ranges, resv_iovas), IOCTL_OP(IOMMU_VFIO_IOAS, iommufd_vfio_ioas, struct iommu_vfio_ioas, __reserved), #ifdef CONFIG_IOMMUFD_TEST diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index d38bc54fd5f2..f2026cde2d64 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -47,6 +47,7 @@ enum { IOMMUFD_CMD_VFIO_IOAS, IOMMUFD_CMD_HWPT_ALLOC, IOMMUFD_CMD_GET_HW_INFO, + IOMMUFD_CMD_RESV_IOVA_RANGES, }; /** @@ -422,4 +423,49 @@ struct iommu_hw_info { __u32 __reserved; }; #define IOMMU_GET_HW_INFO _IO(IOMMUFD_TYPE, IOMMUFD_CMD_GET_HW_INFO) + +/** + * struct iommu_resv_iova_range - ioctl(IOMMU_RESV_IOVA_RANGE) + * @start: First IOVA + * @last: Inclusive last IOVA + * + * An interval in IOVA space. + */ +struct iommu_resv_iova_range { + __aligned_u64 start; + __aligned_u64 last; +}; + +/** + * struct iommu_resv_iova_ranges - ioctl(IOMMU_RESV_IOVA_RANGES) + * @size: sizeof(struct iommu_resv_iova_ranges) + * @dev_id: device to read resv iova ranges for + * @num_iovas: Input/Output total number of resv ranges for the device + * @__reserved: Must be 0 + * @resv_iovas: Pointer to the output array of struct iommu_resv_iova_range + * + * Query a device for ranges of reserved IOVAs. num_iovas will be set to the + * total number of iovas and the resv_iovas[] will be filled in as space + * permits. + * + * On input num_iovas is the length of the resv_iovas array. On output it is + * the total number of iovas filled in. The ioctl will return -EMSGSIZE and + * set num_iovas to the required value if num_iovas is too small. In this + * case the caller should allocate a larger output array and re-issue the + * ioctl. + * + * Under nested translation, userspace should query the reserved IOVAs for a + * given device, and report it to the stage-1 I/O page table owner to exclude + * the reserved IOVAs. The reserved IOVAs can also be used to figure out the + * allowed IOVA ranges for the IOAS that the device is attached to. For detail + * see ioctl IOMMU_IOAS_IOVA_RANGES. + */ +struct iommu_resv_iova_ranges { + __u32 size; + __u32 dev_id; + __u32 num_iovas; + __u32 __reserved; + __aligned_u64 resv_iovas; +}; +#define IOMMU_RESV_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_RESV_IOVA_RANGES) #endif