Message ID | 20231117130717.19875-1-yi.l.liu@intel.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:9910:0:b0:403:3b70:6f57 with SMTP id i16csp513519vqn; Fri, 17 Nov 2023 05:08:24 -0800 (PST) X-Google-Smtp-Source: AGHT+IHK8WWnz/O5YkCOwcz27uSI9GssZ7rmyV+/9HK03/0NZaKI3Psncqgcw3xY1sNDFoZQgTK6 X-Received: by 2002:a17:902:f70c:b0:1cc:5aef:f2c1 with SMTP id h12-20020a170902f70c00b001cc5aeff2c1mr14422820plo.33.1700226504359; Fri, 17 Nov 2023 05:08:24 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700226504; cv=none; d=google.com; s=arc-20160816; b=Q+8jthAWjpdLdHndxtjtL5B3I0VrFT96J0jJc8GO5j+gfiE7XClTohU4wHxaIJmnl2 nCET1DXRQca+nU53czC+ojjQMU09r9QD8ciVWeisS8NnZwU0yZiGqihDfRvgIr7mRv14 WnYl1/qoDJWTApX6/2SY6fBBN5Sf/WvwLYkMW2V8LULtRCqXBdIFl7mIWbFChkePtk30 l5dr6+z+jAuoSozOXEwC8OtvWyEov1u0zXUplrCVhGdl0Ieds2i0navCZHGHqzWRQmUV fkCXs1FUqJYB4MgtnYztpjl42Tz29+EpYuZQdwjfpMWhxiLk97zhmiJaaSE2Zhp0MUfc PVFg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=JCCddECbIaAx0WJa+my740DEJRTwblqoTl5wZXiJUbI=; fh=gm0A6EtasJwovXfz1TpeYfkY+4H0EN5E4rSCpite+Dk=; b=ftzL64KkE+JmQeqkQd3EXAnVZJNbPGDv+lEz+cEHkqCBq3XeTpcW10qcuLAaVF+xfP pqUrVzeQt4FGp3Dy0jXNbUVb05R2SYytNXzGnlu6bObaOiPX5KcHitQVQyhVWypYyM7i GuGFqfpURi2UzsuCqo39vXF3RgZh3098dxps0LuhgjOV3x+OJZpEJHN/ShsT5flcsGLt 0fg843nOGgkaN6xRo0Xqw8ghB6FMMUyx+VQdUaPgB5fI+bVK3ATYil7157AOHVNeotMZ +bjNow9z8RCxDxrq0j5vuCPOaIukBEy3/sFINRFhtvfmvaHJz+xOjDwa1abUdXouguTR zMqA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=cqWpPth1; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from groat.vger.email (groat.vger.email. [2620:137:e000::3:5]) by mx.google.com with ESMTPS id c14-20020a170902d48e00b001c5d09c7b0dsi1856389plg.458.2023.11.17.05.08.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 17 Nov 2023 05:08:24 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) client-ip=2620:137:e000::3:5; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=cqWpPth1; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id 01003830F569; Fri, 17 Nov 2023 05:08:09 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231345AbjKQNHZ (ORCPT <rfc822;jaysivo@gmail.com> + 30 others); Fri, 17 Nov 2023 08:07:25 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59068 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229543AbjKQNHX (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 17 Nov 2023 08:07:23 -0500 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.93]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 749E1126; Fri, 17 Nov 2023 05:07:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1700226440; x=1731762440; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=p+Kod0Kjrk21wqwsn8JeKgSVzjdY4Ik14dg4MPgfld0=; b=cqWpPth1KsQCr1bXrzStSHFOu10dRg6znYF+D4lihK8pJLM+n/MFgu+A Q3FYWMU6kC9GTu6HpCbHGYTrbP5exh8Bzhm+J4vgwcu6Ropdtt01OFTWa BbX0w/jjufyh5yzXXfDwMKL+6KLr3c7du9UR5ZIMmZZFu5WO2yAgLO4X/ eSDYKR6LdWZS23drZUvVocvmiG2tJAwEv7UZXqqfQCoGGHmCFCXXDId/E yhiGuY3jBA829ZDu8ZXWKDWtVEZXDwNuB5exD2ePX/4RbdmGoKFie/uA7 6nh3ssgXcrjP84Atu3qA+yFC8SmAoHYJD4XHEqpB7PDJxJbiN0i+Ijtiz g==; X-IronPort-AV: E=McAfee;i="6600,9927,10896"; a="388446344" X-IronPort-AV: E=Sophos;i="6.04,206,1695711600"; d="scan'208";a="388446344" Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Nov 2023 05:07:20 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10896"; a="836072044" X-IronPort-AV: E=Sophos;i="6.04,206,1695711600"; d="scan'208";a="836072044" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by fmsmga004.fm.intel.com with ESMTP; 17 Nov 2023 05:07:19 -0800 From: Yi Liu <yi.l.liu@intel.com> To: joro@8bytes.org, alex.williamson@redhat.com, jgg@nvidia.com, kevin.tian@intel.com, robin.murphy@arm.com, baolu.lu@linux.intel.com Cc: cohuck@redhat.com, eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com, yi.l.liu@intel.com, yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com, shameerali.kolothum.thodi@huawei.com, lulu@redhat.com, suravee.suthikulpanit@amd.com, iommu@lists.linux.dev, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, zhenzhong.duan@intel.com, joao.m.martins@oracle.com, xin.zeng@intel.com, yan.y.zhao@intel.com Subject: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2) Date: Fri, 17 Nov 2023 05:07:11 -0800 Message-Id: <20231117130717.19875-1-yi.l.liu@intel.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Fri, 17 Nov 2023 05:08:09 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1782816707092798962 X-GMAIL-MSGID: 1782816707092798962 |
Series |
iommufd: Add nesting infrastructure (part 2/2)
|
|
Message
Yi Liu
Nov. 17, 2023, 1:07 p.m. UTC
Nested translation is a hardware feature that is supported by many modern IOMMU hardwares. It has two stages (stage-1, stage-2) address translation to get access to the physical address. stage-1 translation table is owned by userspace (e.g. by a guest OS), while stage-2 is owned by kernel. Changes to stage-1 translation table should be followed by an IOTLB invalidation. Take Intel VT-d as an example, the stage-1 translation table is I/O page table. As the below diagram shows, guest I/O page table pointer in GPA (guest physical address) is passed to host and be used to perform the stage-1 address translation. Along with it, modifications to present mappings in the guest I/O page table should be followed with an IOTLB invalidation. .-------------. .---------------------------. | vIOMMU | | Guest I/O page table | | | '---------------------------' .----------------/ | PASID Entry |--- PASID cache flush --+ '-------------' | | | V | | I/O page table pointer in GPA '-------------' Guest ------| Shadow |---------------------------|-------- v v v Host .-------------. .------------------------. | pIOMMU | | FS for GIOVA->GPA | | | '------------------------' .----------------/ | | PASID Entry | V (Nested xlate) '----------------\.----------------------------------. | | | SS for GPA->HPA, unmanaged domain| | | '----------------------------------' '-------------' Where: - FS = First stage page tables - SS = Second stage page tables <Intel VT-d Nested translation> This series adds the cache invalidation path for the userspace to invalidate cache after modifying the stage-1 page table. This is based on the first part of nesting [1] Complete code can be found in [2], QEMU could can be found in [3]. At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks them for the help. ^_^. Look forward to your feedbacks. [1] https://lore.kernel.org/linux-iommu/20231026044216.64964-1-yi.l.liu@intel.com/ - merged [2] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting [3] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1 Change log: v6: - No much change, just rebase on top of 6.7-rc1 as part 1/2 is merged v5: https://lore.kernel.org/linux-iommu/20231020092426.13907-1-yi.l.liu@intel.com/#t - Split the iommufd nesting series into two parts of alloc_user and invalidation (Jason) - Split IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGING/_NESTED, and do the same with the structures/alloc()/abort()/destroy(). Reworked the selftest accordingly too. (Jason) - Move hwpt/data_type into struct iommu_user_data from standalone op arguments. (Jason) - Rename hwpt_type to be data_type, the HWPT_TYPE to be HWPT_ALLOC_DATA, _TYPE_DEFAULT to be _ALLOC_DATA_NONE (Jason, Kevin) - Rename iommu_copy_user_data() to iommu_copy_struct_from_user() (Kevin) - Add macro to the iommu_copy_struct_from_user() to calculate min_size (Jason) - Fix two bugs spotted by ZhaoYan v4: https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@intel.com/ - Separate HWPT alloc/destroy/abort functions between user-managed HWPTs and kernel-managed HWPTs - Rework invalidate uAPI to be a multi-request array-based design - Add a struct iommu_user_data_array and a helper for driver to sanitize and copy the entry data from user space invalidation array - Add a patch fixing TEST_LENGTH() in selftest program - Drop IOMMU_RESV_IOVA_RANGES patches - Update kdoc and inline comments - Drop the code to add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation, this does not change the rule that resv regions should only be added to the kernel-managed HWPT. The IOMMU_RESV_SW_MSI stuff will be added in later series as it is needed only by SMMU so far. v3: https://lore.kernel.org/linux-iommu/20230724110406.107212-1-yi.l.liu@intel.com/ - Add new uAPI things in alphabetical order - Pass in "enum iommu_hwpt_type hwpt_type" to op->domain_alloc_user for sanity, replacing the previous op->domain_alloc_user_data_len solution - Return ERR_PTR from domain_alloc_user instead of NULL - Only add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation (Kevin) - Add IOMMU_RESV_IOVA_RANGES to report resv iova ranges to userspace hence userspace is able to exclude the ranges in the stage-1 HWPT (e.g. guest I/O page table). (Kevin) - Add selftest coverage for the new IOMMU_RESV_IOVA_RANGES ioctl - Minor changes per Kevin's inputs v2: https://lore.kernel.org/linux-iommu/20230511143844.22693-1-yi.l.liu@intel.com/ - Add union iommu_domain_user_data to include all user data structures to avoid passing void * in kernel APIs. - Add iommu op to return user data length for user domain allocation - Rename struct iommu_hwpt_alloc::data_type to be hwpt_type - Store the invalidation data length in iommu_domain_ops::cache_invalidate_user_data_len - Convert cache_invalidate_user op to be int instead of void - Remove @data_type in struct iommu_hwpt_invalidate - Remove out_hwpt_type_bitmap in struct iommu_hw_info hence drop patch 08 of v1 v1: https://lore.kernel.org/linux-iommu/20230309080910.607396-1-yi.l.liu@intel.com/ Thanks, Yi Liu Lu Baolu (1): iommu: Add cache_invalidate_user op Nicolin Chen (4): iommu: Add iommu_copy_struct_from_user_array helper iommufd/selftest: Add mock_domain_cache_invalidate_user support iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl Yi Liu (1): iommufd: Add IOMMU_HWPT_INVALIDATE drivers/iommu/iommufd/hw_pagetable.c | 35 ++++++++ drivers/iommu/iommufd/iommufd_private.h | 9 ++ drivers/iommu/iommufd/iommufd_test.h | 22 +++++ drivers/iommu/iommufd/main.c | 3 + drivers/iommu/iommufd/selftest.c | 69 +++++++++++++++ include/linux/iommu.h | 84 +++++++++++++++++++ include/uapi/linux/iommufd.h | 35 ++++++++ tools/testing/selftests/iommu/iommufd.c | 75 +++++++++++++++++ tools/testing/selftests/iommu/iommufd_utils.h | 63 ++++++++++++++ 9 files changed, 395 insertions(+)
Comments
On Fri, Nov 17, 2023 at 05:07:11AM -0800, Yi Liu wrote: > Take Intel VT-d as an example, the stage-1 translation table is I/O page > table. As the below diagram shows, guest I/O page table pointer in GPA > (guest physical address) is passed to host and be used to perform the stage-1 > address translation. Along with it, modifications to present mappings in the > guest I/O page table should be followed with an IOTLB invalidation. I've been looking at what the three HW's need for invalidation, it is a bit messy.. Here is my thinking. Please let me know if I got it right What is the starting point of the guest memory walks: Intel: Single Scalable Mode PASID table entry indexed by a RID & PASID AMD: GCR3 table (a table of PASIDs) indexed by RID ARM: CD table (a table of PASIDs) indexed by RID What key does the physical HW use for invalidation: Intel: Domain-ID (stored in hypervisor, per PASID), PASID AMD: Domain-ID (stored in hypervisor, per RID), PASID ARM: VMID (stored in hypervisor, per RID), ASID (stored in guest) Why key does the VM use for invalidation: Intel: vDomain-ID (per PASID), PASID AMD: vDomain-ID (per RID), PASID ARM: ASID What is in a Nested domain: Intel: A single IO page table refereed to by a PASID entry Each vDomain-ID,PASID allocates a unique nesting domain AMD: A GCR3 table pointer Nesting domains are created for every unique GCR3 pointer. vDomain-ID can possibly refer to multiple Nesting domains :( ARM: A CD table pointer Nesting domains are created for every unique CD table top pointer. How does the hypervisor compute it's cache tag: Intel: Each time a nesting domain is attached to a (RID,PASID) the hypervisor driver will try to find a (DID,PASID) that is already used by this domain, or allocate a new DID. AMD: When creating the Nesting Domain the vDomain-ID should be passed in. The hypervisor driver will allocate a unique pDomain-ID for each vDomain-ID (hand wave). Several Nesting Domains will share the same p/vDomain-ID. ARM: The VMID is uniquely assigned to the Nesting Parent when it is allocated, in some configurations it has to be shared with KVM's VMID so allocating the Nesting Parent will require a KVM FD. Will ATC be forwarded or synthesized: Intel: The (vDomain-ID,PASID) is a unique nesting domain so the hypervisor knows exactly which RIDs this nesting domain is linked to and can generate an ATC invalidation. Plan is to supress/discard the ATC invalidations from the VM and generate them in the hypervisor. AMD: (vDomain-ID,PASID) is ambiguous, it can refer to multiple GCR3 tables. We know which maximal set of RIDs it represents, but not the actual set. I expect AMD will forward the ATC invalidation to avoid over invalidation. ARM: ASID is ambiguous. We have no idea which Nesting Domain/CD table the ASID is contained in. ARM must forward the ATC invalidation from the guest. What iommufd object should receive the IOTLB invalidation command list: Intel: The Nesting domain. The command list has to be broken up per (vDomain-ID,PASID) and that batch delivered to the single nesting domain. Kernel ignores vDomain-ID/PASID and just invalidates whatever the nesting domain is actually attached to AMD: Any Nesting Domain in the vDomain-ID group. The command list has to be broken up per (vDomain-ID). Kernel replaces vDomain-ID with pDomain-ID from the nesting domain and executes the invalidation. ARM: The Nesting Parent domain. Kernel forces the VMID from the Nesting Parent and executes the invalidation. In all cases the VM issues an ATC invalidation with (vRID, PASID) as the tag. The VMM must translate vRID -> dev_id -> pRID For a pure SW flow the vRID can be mapped to the dev_id and the ATC invalidation delivered to the device object (eg IOMMUFD_DEV_INVALIDATE) Finally, we have the HW driven invalidation DMA queues that can be directly assigned to the guest. AMD and SMMUv3+vCMDQ support this. In this case the HW is directly processing invalidation commands without a hypervisor trap. To make this work the iommu needs to be programmed with: AMD: A vDomain-ID -> pDomain-ID table A vRID -> pRID table This is all bound to some "virtual function" ARM: A vRID -> pRID table The vCMDQ is bound to a VM_ID, so to the Nesting Parent For AMD, as above, I suggest the vDomain-ID be passed when creating the nesting domain. The AMD "virtual function".. It is probably best to create a new iommufd object for this and it can be passed in to a few places The vRID->pRID table should be some mostly common IOMMUFD_DEV_ASSIGN_VIRTUAL_ID. AMD will need to pass in the virtual function ID and ARM will need to pass in the Nesting Parent ID. For the HW path some function will create the command queue and DMA/mmap it. Taking in the virtual function/nesting parent as the handle to associate it with. For a SW path: AMD: All invalidations can be delivered to the virtual function and the kernel can use the vDomainID/vRID tables to translate them fully ARM: All invalidations can be delivered to the nesting parent In many ways the nesting parent/virtual function are very similar things. Perhaps ARM should also create a virtual function object which is just welded to the nesting parent for API consistency. So.. In short.. Invalidation is a PITA. The idea is the same but annoying little details interfere with actually having a compltely common API here. IMHO the uAPI in this series is fine. It will support Intel invalidation and non-ATC invalidation on AMD/ARM. It should be setup to allow that the target domain object can be any HWPT. ARM will be able to do IOTLB invalidation using this API. IOMMUFD_DEV_INVALIDATE should be introduced with the same design as HWPT invalidate. This would be used for AMD/ARM's ATC invalidation (and just force the stream ID, userspace must direct the vRID to the correct dev_id). Then in yet another series we can tackle the entire "virtual function" vRID/pRID translation stuff when the mmapable queue thing is introduced. Thus next steps: - Respin this and lets focus on Intel only (this will be tough for the holidays, but if it is available I will try) - Get an ARM patch that just does IOTLB invalidation and add it to my part 3 - Start working on IOMMUFD_DEV_INVALIDATE along with an ARM implementation of it - Reorganize the AMD RFC broadly along these lines and lets see it freshened up in the next months as well. I would like to see the AMD support structured to implement the SW paths in first steps and later add in the "virtual function" acceleration stuff. The latter is going to be complex. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Saturday, December 9, 2023 9:47 AM > > What is in a Nested domain: > Intel: A single IO page table refereed to by a PASID entry > Each vDomain-ID,PASID allocates a unique nesting domain > AMD: A GCR3 table pointer > Nesting domains are created for every unique GCR3 pointer. > vDomain-ID can possibly refer to multiple Nesting domains :( > ARM: A CD table pointer > Nesting domains are created for every unique CD table top pointer. this AMD/ARM difference is not very clear to me. How could a vDomain-ID refer to multiple GCR3 pointers? Wouldn't it lead to cache tag conflict when a same PASID entry in multiple GCR3 tables points to different I/O page tables?
On 2023/12/9 09:47, Jason Gunthorpe wrote: > On Fri, Nov 17, 2023 at 05:07:11AM -0800, Yi Liu wrote: > >> Take Intel VT-d as an example, the stage-1 translation table is I/O page >> table. As the below diagram shows, guest I/O page table pointer in GPA >> (guest physical address) is passed to host and be used to perform the stage-1 >> address translation. Along with it, modifications to present mappings in the >> guest I/O page table should be followed with an IOTLB invalidation. > > I've been looking at what the three HW's need for invalidation, it is > a bit messy.. Here is my thinking. Please let me know if I got it right > > What is the starting point of the guest memory walks: > Intel: Single Scalable Mode PASID table entry indexed by a RID & PASID > AMD: GCR3 table (a table of PASIDs) indexed by RID > ARM: CD table (a table of PASIDs) indexed by RID > > What key does the physical HW use for invalidation: > Intel: Domain-ID (stored in hypervisor, per PASID), PASID > AMD: Domain-ID (stored in hypervisor, per RID), PASID > ARM: VMID (stored in hypervisor, per RID), ASID (stored in guest) > > Why key does the VM use for invalidation: > Intel: vDomain-ID (per PASID), PASID > AMD: vDomain-ID (per RID), PASID > ARM: ASID > > What is in a Nested domain: > Intel: A single IO page table refereed to by a PASID entry > Each vDomain-ID,PASID allocates a unique nesting domain > AMD: A GCR3 table pointer > Nesting domains are created for every unique GCR3 pointer. > vDomain-ID can possibly refer to multiple Nesting domains :( Per section '2.2.6.3 Guest CR3 Table' in below spec, DTE has domainID, and it points to a guest CR3 Table (it should be a set of guest CRs3) which is indexed by PASID. So it looks like the PASID is per-DommainID. HW should use domainID+PASID to tag the cache, otherwise there would be cache conflict... https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/specifications/48882_IOMMU.pdf > ARM: A CD table pointer > Nesting domains are created for every unique CD table top pointer. > > How does the hypervisor compute it's cache tag: > Intel: Each time a nesting domain is attached to a (RID,PASID) the > hypervisor driver will try to find a (DID,PASID) that is > already used by this domain, or allocate a new DID. > AMD: When creating the Nesting Domain the vDomain-ID should be passed > in. The hypervisor driver will allocate a unique pDomain-ID for > each vDomain-ID (hand wave). Several Nesting Domains will share > the same p/vDomain-ID. > ARM: The VMID is uniquely assigned to the Nesting Parent when it > is allocated, in some configurations it has to be shared with > KVM's VMID so allocating the Nesting Parent will require a KVM FD. > > Will ATC be forwarded or synthesized: > Intel: The (vDomain-ID,PASID) is a unique nesting domain so > the hypervisor knows exactly which RIDs this nesting domain is > linked to and can generate an ATC invalidation. Plan is to > supress/discard the ATC invalidations from the VM and generate > them in the hypervisor. > AMD: (vDomain-ID,PASID) is ambiguous, it can refer to multiple GCR3 > tables. We know which maximal set of RIDs it represents, but not > the actual set. I expect AMD will forward the ATC invalidation > to avoid over invalidation. > ARM: ASID is ambiguous. We have no idea which Nesting Domain/CD table > the ASID is contained in. ARM must forward the ATC invalidation > from the guest. > > What iommufd object should receive the IOTLB invalidation command list: > Intel: The Nesting domain. The command list has to be broken up per > (vDomain-ID,PASID) and that batch delivered to the single > nesting domain. Kernel ignores vDomain-ID/PASID and just > invalidates whatever the nesting domain is actually attached to this is what we are doing in current series[1]. is it? Guest needs to issue invalidation request with vDomain-ID and PASID (if it is enabled), and affected pages for sure. But in hypervisor side, use vDomainID and PASID info to figure out the target HWPT, then invoke a cache invalidation on the HWPT with the invalidation range is enough. Kernel can figure out what device/pasid this HWPT has been attached and invalidate the caches. [1] https://lore.kernel.org/linux-iommu/20231117131816.24359-1-yi.l.liu@intel.com/ > AMD: Any Nesting Domain in the vDomain-ID group. The command list has > to be broken up per (vDomain-ID). Kernel replaces > vDomain-ID with pDomain-ID from the nesting domain and executes > the invalidation. > ARM: The Nesting Parent domain. Kernel forces the VMID from the > Nesting Parent and executes the invalidation. > > In all cases the VM issues an ATC invalidation with (vRID, PASID) as > the tag. The VMM must translate vRID -> dev_id -> pRID > > For a pure SW flow the vRID can be mapped to the dev_id and the ATC > invalidation delivered to the device object (eg IOMMUFD_DEV_INVALIDATE) > > Finally, we have the HW driven invalidation DMA queues that can be > directly assigned to the guest. AMD and SMMUv3+vCMDQ support this. In > this case the HW is directly processing invalidation commands without > a hypervisor trap. > > To make this work the iommu needs to be programmed with: > AMD: A vDomain-ID -> pDomain-ID table > A vRID -> pRID table > This is all bound to some "virtual function" > ARM: A vRID -> pRID table > The vCMDQ is bound to a VM_ID, so to the Nesting Parent > > For AMD, as above, I suggest the vDomain-ID be passed when creating > the nesting domain. > > The AMD "virtual function".. It is probably best to create a new iommufd > object for this and it can be passed in to a few places > > The vRID->pRID table should be some mostly common > IOMMUFD_DEV_ASSIGN_VIRTUAL_ID. AMD will need to pass in the virtual > function ID and ARM will need to pass in the Nesting Parent ID. > > For the HW path some function will create the command queue and > DMA/mmap it. Taking in the virtual function/nesting parent as the > handle to associate it with. > > For a SW path: > AMD: All invalidations can be delivered to the virtual function > and the kernel can use the vDomainID/vRID tables to translate > them fully > ARM: All invalidations can be delivered to the nesting parent > > In many ways the nesting parent/virtual function are very similar > things. Perhaps ARM should also create a virtual function object which > is just welded to the nesting parent for API consistency. > > So.. In short.. Invalidation is a PITA. The idea is the same but > annoying little details interfere with actually having a compltely > common API here. IMHO the uAPI in this series is fine. It will support > Intel invalidation and non-ATC invalidation on AMD/ARM. It should be > setup to allow that the target domain object can be any HWPT. This HWPT is still nested domain. Is it? But it can represent a guest I/O page table (VT-d), guest CD table (ARM), guest CR3 Table (AMD, it seems to be a set of guest CR3 table pointers). May ARM and AMD guys keep me honest here. The Intel guest I/O page table case may be the simplest as userspace only needs to provide the HWPT ID and the affected ranges for invalidating. As mentioned above, kernel will find out the attached device/pasid and invalidating cache with the device/pasid. For ARM and AMD case, extra information is needed. Am I getting you correct? > > ARM will be able to do IOTLB invalidation using this API. > > IOMMUFD_DEV_INVALIDATE should be introduced with the same design as > HWPT invalidate. This would be used for AMD/ARM's ATC invalidation > (and just force the stream ID, userspace must direct the vRID to the > correct dev_id). > > Then in yet another series we can tackle the entire "virtual function" > vRID/pRID translation stuff when the mmapable queue thing is > introduced. > > Thus next steps: > - Respin this and lets focus on Intel only (this will be tough for > the holidays, but if it is available I will try) I've respinned the iommufd cache invalidation part with the change to report error_code/error_data per invalidation entry. yet still busy on making Intel VTd part to report the error_code. Besides, I didn't see other respin needed for Intel VT-d invalidation. If I missed thing, please do let me know.:) > - Get an ARM patch that just does IOTLB invalidation and add it to my > part 3 > - Start working on IOMMUFD_DEV_INVALIDATE along with an ARM > implementation of it > - Reorganize the AMD RFC broadly along these lines and lets see it > freshened up in the next months as well. I would like to see the > AMD support structured to implement the SW paths in first steps and > later add in the "virtual function" acceleration stuff. The latter > is going to be complex. > > Jason
On 2023/12/11 10:29, Tian, Kevin wrote: >> From: Jason Gunthorpe <jgg@nvidia.com> >> Sent: Saturday, December 9, 2023 9:47 AM >> >> What is in a Nested domain: >> Intel: A single IO page table refereed to by a PASID entry >> Each vDomain-ID,PASID allocates a unique nesting domain >> AMD: A GCR3 table pointer >> Nesting domains are created for every unique GCR3 pointer. >> vDomain-ID can possibly refer to multiple Nesting domains :( >> ARM: A CD table pointer >> Nesting domains are created for every unique CD table top pointer. > > this AMD/ARM difference is not very clear to me. > > How could a vDomain-ID refer to multiple GCR3 pointers? Wouldn't it > lead to cache tag conflict when a same PASID entry in multiple GCR3 tables > points to different I/O page tables? Perhaps due to only one DomainID in the DTE table indexed by BDF? Actually, the vDomainID will not be used to tag cache, the host DomainId would be used instead. @Jason?
On Mon, Dec 11, 2023 at 08:36:46PM +0800, Yi Liu wrote: > On 2023/12/11 10:29, Tian, Kevin wrote: > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Saturday, December 9, 2023 9:47 AM > > > > > > What is in a Nested domain: > > > Intel: A single IO page table refereed to by a PASID entry > > > Each vDomain-ID,PASID allocates a unique nesting domain > > > AMD: A GCR3 table pointer > > > Nesting domains are created for every unique GCR3 pointer. > > > vDomain-ID can possibly refer to multiple Nesting domains :( > > > ARM: A CD table pointer > > > Nesting domains are created for every unique CD table top pointer. > > > > this AMD/ARM difference is not very clear to me. > > > > How could a vDomain-ID refer to multiple GCR3 pointers? Wouldn't it > > lead to cache tag conflict when a same PASID entry in multiple GCR3 tables > > points to different I/O page tables? > > Perhaps due to only one DomainID in the DTE table indexed by BDF? Actually, > the vDomainID will not be used to tag cache, the host DomainId would be > used instead. @Jason? The DomainID comes from the DTE table which is indexed by the RID, and the DTE entry points to the GCR3 table. So the VM certainly can setup a DTE table with multiple entires having the same vDomainID but pointing to different GCR3's. So the VMM has to do *something* with this. Most likely this is not a useful thing to do. However what should the VMM do when it sees this? Block a random DTE or push the duplication down to real HW would be my options. I'd probably try to do the latter just on the basis of better emulation. Jason
On Mon, Dec 11, 2023 at 08:35:09PM +0800, Yi Liu wrote: > > What iommufd object should receive the IOTLB invalidation command list: > > Intel: The Nesting domain. The command list has to be broken up per > > (vDomain-ID,PASID) and that batch delivered to the single > > nesting domain. Kernel ignores vDomain-ID/PASID and just > > invalidates whatever the nesting domain is actually attached to > > this is what we are doing in current series[1]. is it? I think so > > So.. In short.. Invalidation is a PITA. The idea is the same but > > annoying little details interfere with actually having a compltely > > common API here. IMHO the uAPI in this series is fine. It will support > > Intel invalidation and non-ATC invalidation on AMD/ARM. It should be > > setup to allow that the target domain object can be any HWPT. > > This HWPT is still nested domain. Is it? But it can represent a guest I/O > page table (VT-d), guest CD table (ARM), guest CR3 Table (AMD, it seems to > be a set of guest CR3 table pointers). May ARM and AMD guys keep me honest > here. I was thinking ARM would not want to use a nested domain because really the invalidation is global to the entire nesting parent. But, there is an issue with that - the nesting parent could be attached to multiple iommu instances but we only want to invalidate a single instance. However, specifying the nesting domain instead, and restricting the nesting domain to be single-instance would provide the kernel enough information without providing weird new APIs. So maybe it is best to just make everything use the nesting domain even if it is somewhat semantically weird on arm. > The Intel guest I/O page table case may be the simplest as userspace only > needs to provide the HWPT ID and the affected ranges for invalidating. As > mentioned above, kernel will find out the attached device/pasid and > invalidating cache with the device/pasid. For ARM and AMD case, extra > information is needed. Am I getting you correct? Yes > > Thus next steps: > > - Respin this and lets focus on Intel only (this will be tough for > > the holidays, but if it is available I will try) > > I've respinned the iommufd cache invalidation part with the change to > report error_code/error_data per invalidation entry. yet still busy on > making Intel VTd part to report the error_code. Besides, I didn't see > other respin needed for Intel VT-d invalidation. If I missed thing, please > do let me know.:) Lets see Jason
On 12/11/2023 8:05 PM, Jason Gunthorpe wrote: > On Mon, Dec 11, 2023 at 08:36:46PM +0800, Yi Liu wrote: >> On 2023/12/11 10:29, Tian, Kevin wrote: >>>> From: Jason Gunthorpe <jgg@nvidia.com> >>>> Sent: Saturday, December 9, 2023 9:47 AM >>>> >>>> What is in a Nested domain: >>>> Intel: A single IO page table refereed to by a PASID entry >>>> Each vDomain-ID,PASID allocates a unique nesting domain >>>> AMD: A GCR3 table pointer >>>> Nesting domains are created for every unique GCR3 pointer. >>>> vDomain-ID can possibly refer to multiple Nesting domains :( >>>> ARM: A CD table pointer >>>> Nesting domains are created for every unique CD table top pointer. >>> >>> this AMD/ARM difference is not very clear to me. >>> >>> How could a vDomain-ID refer to multiple GCR3 pointers? Wouldn't it >>> lead to cache tag conflict when a same PASID entry in multiple GCR3 tables >>> points to different I/O page tables? >> >> Perhaps due to only one DomainID in the DTE table indexed by BDF? Actually, >> the vDomainID will not be used to tag cache, the host DomainId would be >> used instead. @Jason? > > The DomainID comes from the DTE table which is indexed by the RID, and > the DTE entry points to the GCR3 table. So the VM certainly can setup > a DTE table with multiple entires having the same vDomainID but > pointing to different GCR3's. So the VMM has to do *something* with > this. > > Most likely this is not a useful thing to do. However what should the > VMM do when it sees this? Block a random DTE or push the duplication > down to real HW would be my options. I'd probably try to do the latter > just on the basis of better emulation. > > Jason For AMD, the hardware uses host DomainID (hDomainId) and PASID to tag the IOMMU TLB. The VM can setup vDomainID independently from device (RID) and hDomainID. The vDomainId->hDomainId mapping would be managed by the host IOMMU driver (since this is also needed by the HW when enabling the HW-vIOMMU support a.k.a virtual function). Currently, the AMD IOMMU driver allocates a DomainId per IOMMU group. One issue with this is when we have nested translation where we could end up with multiple devices (RIDs) sharing same PASID and the same hDomainID. For example: - Host view Device1 (RID 1) w/ hDomainId 1 Device2 (RID 2) w/ hDomainId 1 - Guest view Pass-through Device1 (vRID 3) w/ vDomainID A + PASID 0 Pass-through Device2 (vRID 4) w/ vDomainID B + PASID 0 We should be able to workaround this by changing the way we assign hDomainId to be per-device for VFIO pass-through devices although sharing the same v1 (stage-2) page table. This would look like. - Host view Device1 (RID 1) w/ hDomainId 1 Device2 (RID 2) w/ hDomainId 2 - Guest view Pass-through Device1 (vRID 3) w/ vDomainID A + PASID 0 Pass-through Device2 (vRID 4) w/ vDomainID B + PASID 0 This should avoid the IOMMU TLB conflict. However, the invalidation would need to be done for both DomainId 1 and 2 when updating the v1 (stage-2) page table. Thanks, Suravee
On Mon, Dec 11, 2023 at 10:34:09PM +0700, Suthikulpanit, Suravee wrote: > Currently, the AMD IOMMU driver allocates a DomainId per IOMMU group. > One issue with this is when we have nested translation where we could end up > with multiple devices (RIDs) sharing same PASID and the same hDomainID. Which means you also create multiple GCR3 tables since those are (soon) per-device and we end up with the situation I described for a functional legitimate reason :( It is just wasting memory by duplicating GCR3 tables. > For example: > > - Host view > Device1 (RID 1) w/ hDomainId 1 > Device2 (RID 2) w/ hDomainId 1 So.. Groups are another ugly mess that we may have to do something more robust about. The group infrastructure assumes that all devices in the group have the same translation. This is not how the VM communicates, each member of the group gets to have its own DTE and there are legitimate cases where the DTEs will be different (even if just temporarily) How to mesh this is not yet solved (most likely we need to allow group members to have temporarily different translation). But in the long run the group should definately not be providing the cache tag, the driver has to be smarter than this. I think we talked about this before.. For the AMD driver the v1 page table should store the domainid in the iommu_domain and that value should be used everywhere For modes with a GCR3 table the best you can do is to de-duplicate the GCR3 tables and assign identical GCR3 tables to identical domain ids. Ie all devices in a group will eventually share GCR3 tables so they can converge on the same domain id. > - Guest view > Pass-through Device1 (vRID 3) w/ vDomainID A + PASID 0 > Pass-through Device2 (vRID 4) w/ vDomainID B + PASID 0 > > We should be able to workaround this by changing the way we assign hDomainId > to be per-device for VFIO pass-through devices although sharing the same v1 > (stage-2) page table. This would look like. As I said, this doesn't quite work since the VM could do other things. The kernel must be aware of the vDomainID and must select an appropriate hDomainID with that knowledge in mind, otherwise multi-device-groups in guests are fully broken. > - Guest view > Pass-through Device1 (vRID 3) w/ vDomainID A + PASID 0 > Pass-through Device2 (vRID 4) w/ vDomainID B + PASID 0 > > This should avoid the IOMMU TLB conflict. However, the invalidation would > need to be done for both DomainId 1 and 2 when updating the v1 (stage-2) > page table. Which is the key problem, if the VM thinks it has only one vDomainID the VMM can't split that into two hDomainID's and expect the viommu acceleration will work - so we shouldn't try to make it work in SW either, IMHO. Jason
On 12/9/2023 8:47 AM, Jason Gunthorpe wrote: > On Fri, Nov 17, 2023 at 05:07:11AM -0800, Yi Liu wrote: > >> Take Intel VT-d as an example, the stage-1 translation table is I/O page >> table. As the below diagram shows, guest I/O page table pointer in GPA >> (guest physical address) is passed to host and be used to perform the stage-1 >> address translation. Along with it, modifications to present mappings in the >> guest I/O page table should be followed with an IOTLB invalidation. > > I've been looking at what the three HW's need for invalidation, it is > a bit messy.. Here is my thinking. Please let me know if I got it right > > What is the starting point of the guest memory walks: > Intel: Single Scalable Mode PASID table entry indexed by a RID & PASID > AMD: GCR3 table (a table of PASIDs) indexed by RID GCR3 table is indexed by PASID. Device Table (DTE) is indexted by DeviceID (RID) > ... > Will ATC be forwarded or synthesized: > Intel: The (vDomain-ID,PASID) is a unique nesting domain so > the hypervisor knows exactly which RIDs this nesting domain is > linked to and can generate an ATC invalidation. Plan is to > supress/discard the ATC invalidations from the VM and generate > them in the hypervisor. > AMD: (vDomain-ID,PASID) is ambiguous, it can refer to multiple GCR3 > tables. We know which maximal set of RIDs it represents, but not > the actual set. I expect AMD will forward the ATC invalidation > to avoid over invalidation. Not sure I understand your description here. For the AMD IOMMU INVALIDE_IOMMU_PAGES (i.e. invalidate the IOMMU TLB), the hypervisor needs to map gDomainId->hDomainId and issue the command on behalf of the VM along with the PASID and GVA (or GVA range) provided by the guest. For the AMD IOMMU INVALIDE_IOTLB_PAGES (i.e. invalidate the ATC on the device), the hypervisor needs to map gDeviceId->hDeviceId and issue the command on behalf of the VM along with the PASID and GVA (or GVA range) provided by the guest. > ARM: ASID is ambiguous. We have no idea which Nesting Domain/CD table > the ASID is contained in. ARM must forward the ATC invalidation > from the guest. > > What iommufd object should receive the IOTLB invalidation command list: > Intel: The Nesting domain. The command list has to be broken up per > (vDomain-ID,PASID) and that batch delivered to the single > nesting domain. Kernel ignores vDomain-ID/PASID and just > invalidates whatever the nesting domain is actually attached to > AMD: Any Nesting Domain in the vDomain-ID group. The command list has > to be broken up per (vDomain-ID). Kernel replaces > vDomain-ID with pDomain-ID from the nesting domain and executes > the invalidation. > ARM: The Nesting Parent domain. Kernel forces the VMID from the > Nesting Parent and executes the invalidation. > > In all cases the VM issues an ATC invalidation with (vRID, PASID) as > the tag. The VMM must translate vRID -> dev_id -> pRID > > For a pure SW flow the vRID can be mapped to the dev_id and the ATC > invalidation delivered to the device object (eg IOMMUFD_DEV_INVALIDATE) > > Finally, we have the HW driven invalidation DMA queues that can be > directly assigned to the guest. AMD and SMMUv3+vCMDQ support this. In > this case the HW is directly processing invalidation commands without > a hypervisor trap. > > To make this work the iommu needs to be programmed with: > AMD: A vDomain-ID -> pDomain-ID table > A vRID -> pRID table > This is all bound to some "virtual function" By "virtual function", I assume you are referring to the AMD vIOMMU instance in the guest? > ARM: A vRID -> pRID table > The vCMDQ is bound to a VM_ID, so to the Nesting Parent > > For AMD, as above, I suggest the vDomain-ID be passed when creating > the nesting domain Sure, we can do this part. > The AMD "virtual function".. It is probably best to create a new iommufd > object for this and it can be passed in to a few places Something like IOMMUFD_OBJ_VIOMMU? Then operation would include something like: * Init * Destroy * ... > The vRID->pRID table should be some mostly common > IOMMUFD_DEV_ASSIGN_VIRTUAL_ID. AMD will need to pass in the virtual > function ID and ARM will need to pass in the Nesting Parent ID. Ok. > ... > Thus next steps: > - Respin this and lets focus on Intel only (this will be tough for > the holidays, but if it is available I will try) > - Get an ARM patch that just does IOTLB invalidation and add it to my > part 3 > - Start working on IOMMUFD_DEV_INVALIDATE along with an ARM > implementation of it > - Reorganize the AMD RFC broadly along these lines and lets see it > freshened up in the next months as well. I would like to see the > AMD support structured to implement the SW paths in first steps and > later add in the "virtual function" acceleration stuff. The latter > is going to be complex. Working on refining the part 1 to add HW info reporting and nested translation (minus the invalidation stuff). Should be sending out soon. Suravee
On Tue, Dec 12, 2023 at 12:35:26AM +0700, Suthikulpanit, Suravee wrote: > > > On 12/9/2023 8:47 AM, Jason Gunthorpe wrote: > > On Fri, Nov 17, 2023 at 05:07:11AM -0800, Yi Liu wrote: > > > > > Take Intel VT-d as an example, the stage-1 translation table is I/O page > > > table. As the below diagram shows, guest I/O page table pointer in GPA > > > (guest physical address) is passed to host and be used to perform the stage-1 > > > address translation. Along with it, modifications to present mappings in the > > > guest I/O page table should be followed with an IOTLB invalidation. > > > > I've been looking at what the three HW's need for invalidation, it is > > a bit messy.. Here is my thinking. Please let me know if I got it right > > > > What is the starting point of the guest memory walks: > > Intel: Single Scalable Mode PASID table entry indexed by a RID & PASID > > AMD: GCR3 table (a table of PASIDs) indexed by RID > > GCR3 table is indexed by PASID. > Device Table (DTE) is indexted by DeviceID (RID) Yes, this is what I was trying to say > > Will ATC be forwarded or synthesized: > > Intel: The (vDomain-ID,PASID) is a unique nesting domain so > > the hypervisor knows exactly which RIDs this nesting domain is > > linked to and can generate an ATC invalidation. Plan is to > > supress/discard the ATC invalidations from the VM and generate > > them in the hypervisor. > > AMD: (vDomain-ID,PASID) is ambiguous, it can refer to multiple GCR3 > > tables. We know which maximal set of RIDs it represents, but not > > the actual set. I expect AMD will forward the ATC invalidation > > to avoid over invalidation. > > Not sure I understand your description here. > > For the AMD IOMMU INVALIDE_IOMMU_PAGES (i.e. invalidate the IOMMU TLB), the > hypervisor needs to map gDomainId->hDomainId and issue the command on behalf > of the VM along with the PASID and GVA (or GVA range) provided by the guest. Yes, that is the "forwarding" approach. Contrast this to the Intel approach where the ATC is synthesized by the kernel emulating the INVALIDE_IOMMU_PAGES > > To make this work the iommu needs to be programmed with: > > AMD: A vDomain-ID -> pDomain-ID table > > A vRID -> pRID table > > This is all bound to some "virtual function" > > By "virtual function", I assume you are referring to the AMD vIOMMU instance > in the guest? Yes, but it is not in the guest, it has to be some concrete iommufd object. > Something like IOMMUFD_OBJ_VIOMMU? Then operation would include something > like: > * Init > * Destroy > * ... Yes, something like that. It needs to be able to work for ARM vCMDQ stuff too. I don't know what the name should be. Maybe viommu is OK for now. - Alloc viommu (against a single iommu instance?) - Assign a virtual ID to an iommufd device within the same instance - Setup a submission and completion queue in userspace memory - mmap the doorbell page (both need this?) - Route back completion interrupts via eventfd When you get here you and Nicolin should work out something along those lines that works for both But I'd like to keep things in steps, so if we can get info, nesting parent, nesting domain and SW IOTLB and ATC invalidation as the first (two?) steps that would be great > > Thus next steps: > > - Respin this and lets focus on Intel only (this will be tough for > > the holidays, but if it is available I will try) > > - Get an ARM patch that just does IOTLB invalidation and add it to my > > part 3 > > - Start working on IOMMUFD_DEV_INVALIDATE along with an ARM > > implementation of it > > - Reorganize the AMD RFC broadly along these lines and lets see it > > freshened up in the next months as well. I would like to see the > > AMD support structured to implement the SW paths in first steps and > > later add in the "virtual function" acceleration stuff. The latter > > is going to be complex. > > Working on refining the part 1 to add HW info reporting and nested > translation (minus the invalidation stuff). Should be sending out soon. Nice! Jason
On Mon, Dec 11, 2023 at 09:20:41AM -0400, Jason Gunthorpe wrote: > On Mon, Dec 11, 2023 at 08:35:09PM +0800, Yi Liu wrote: > > > So.. In short.. Invalidation is a PITA. The idea is the same but > > > annoying little details interfere with actually having a compltely > > > common API here. IMHO the uAPI in this series is fine. It will support > > > Intel invalidation and non-ATC invalidation on AMD/ARM. It should be > > > setup to allow that the target domain object can be any HWPT. > > > > This HWPT is still nested domain. Is it? But it can represent a guest I/O > > page table (VT-d), guest CD table (ARM), guest CR3 Table (AMD, it seems to > > be a set of guest CR3 table pointers). May ARM and AMD guys keep me honest > > here. > > I was thinking ARM would not want to use a nested domain because > really the invalidation is global to the entire nesting parent. > > But, there is an issue with that - the nesting parent could be > attached to multiple iommu instances but we only want to invalidate a > single instance. I am still not sure about attaching an S2 domain to multiple SMMUs. An S2 domain is created per SMMU, and we have such a rejection in arm_smmu_attach_dev(): } else if (smmu_domain->smmu != smmu) ret = -EINVAL; I understand that it would be probably ideal to share the S2 iopt among the SMMUs. But in the driver the objects (domain) holding a shared S2 iopt must be different to allocate their own VMIDs, right? Thanks Nicolin
On Fri, Dec 08, 2023 at 09:47:26PM -0400, Jason Gunthorpe wrote: > What is in a Nested domain: > ARM: A CD table pointer > Nesting domains are created for every unique CD table top pointer. I think we basically implemented in a way of syncing STE, i,e, vSTE.Config must be "S1 Translate" besides a CD table pointer, and a nested domain is freed when vSTE.Config=BYPASS even if a CD table pointer is present, right? > To make this work the iommu needs to be programmed with: > AMD: A vDomain-ID -> pDomain-ID table > A vRID -> pRID table > This is all bound to some "virtual function" > ARM: A vRID -> pRID table > The vCMDQ is bound to a VM_ID, so to the Nesting Parent VCMDQ also has something called "virtual interface" that holds a VMID and a list of CMDQ queues, which might be a bit similar to AMD's "virtual function". > For AMD, as above, I suggest the vDomain-ID be passed when creating > the nesting domain. > > The AMD "virtual function".. It is probably best to create a new iommufd > object for this and it can be passed in to a few places > > The vRID->pRID table should be some mostly common > IOMMUFD_DEV_ASSIGN_VIRTUAL_ID. AMD will need to pass in the virtual > function ID and ARM will need to pass in the Nesting Parent ID. It sounds like our previous IOMMUFD_SET/UNSET_IDEV_DATA. I'm wondering if we need to make it exclusive to the ID assigning? Maybe set_idev_data could be reused for other potential cases? If we do implement an IOMMUFD_DEV_ASSIGN_VIRTUAL_ID, do we need an IOMMUFD_DEV_RESIGN_VIRTUAL_ID? (or better word than resign). Could the structure just look like this? struct iommu_dev_assign_virtual_id { __u32 size; __u32 dev_id; __u32 id_type; __u32 id; }; > In many ways the nesting parent/virtual function are very similar > things. Perhaps ARM should also create a virtual function object which > is just welded to the nesting parent for API consistency. A virtual function that holds an S2 domain/iopt + a VMID? If this is for VCMDQ, the VMCDQ extension driver has that kinda object holding an S2 domain: I implemented as the extension function at the end of arm_smmu_finalise_s2() previously. > So.. In short.. Invalidation is a PITA. The idea is the same but > annoying little details interfere with actually having a compltely > common API here. IMHO the uAPI in this series is fine. It will support > Intel invalidation and non-ATC invalidation on AMD/ARM. It should be > setup to allow that the target domain object can be any HWPT. > > ARM will be able to do IOTLB invalidation using this API. > > IOMMUFD_DEV_INVALIDATE should be introduced with the same design as > HWPT invalidate. This would be used for AMD/ARM's ATC invalidation > (and just force the stream ID, userspace must direct the vRID to the > correct dev_id). SMMU's CD invalidations could fall into this category too. > Then in yet another series we can tackle the entire "virtual function" > vRID/pRID translation stuff when the mmapable queue thing is > introduced. VCMDQ is also a mmapable queue. I feel that there could be more common stuff between "virtual function" and "virtual interface", I'll need to take a look at AMD's stuff though. I previously drafted something to test it out with iommufd. Basically it needs the pairing of vRID/pRID in attach_dev() and another ioctl to mmap/config user queue(s): +struct iommu_hwpt_cache_config_tegra241_vcmdq { + __u32 vcmdq_id; // queue id + __u32 vcmdq_log2size; // queue size + __aligned_u64 vcmdq_base; // queue guest PA +}; > Thus next steps: > - Get an ARM patch that just does IOTLB invalidation and add it to my > part 3 > - Start working on IOMMUFD_DEV_INVALIDATE along with an ARM > implementation of it. I will work on these two, presumably including the new IOMMUFD_DEV_ASSIGN_VIRTUAL_ID or so. Thanks Nicolin
On Mon, Dec 11, 2023 at 12:11:25PM -0800, Nicolin Chen wrote: > On Mon, Dec 11, 2023 at 09:20:41AM -0400, Jason Gunthorpe wrote: > > On Mon, Dec 11, 2023 at 08:35:09PM +0800, Yi Liu wrote: > > > > So.. In short.. Invalidation is a PITA. The idea is the same but > > > > annoying little details interfere with actually having a compltely > > > > common API here. IMHO the uAPI in this series is fine. It will support > > > > Intel invalidation and non-ATC invalidation on AMD/ARM. It should be > > > > setup to allow that the target domain object can be any HWPT. > > > > > > This HWPT is still nested domain. Is it? But it can represent a guest I/O > > > page table (VT-d), guest CD table (ARM), guest CR3 Table (AMD, it seems to > > > be a set of guest CR3 table pointers). May ARM and AMD guys keep me honest > > > here. > > > > I was thinking ARM would not want to use a nested domain because > > really the invalidation is global to the entire nesting parent. > > > > But, there is an issue with that - the nesting parent could be > > attached to multiple iommu instances but we only want to invalidate a > > single instance. > > I am still not sure about attaching an S2 domain to multiple > SMMUs. An S2 domain is created per SMMU, and we have such a > rejection in arm_smmu_attach_dev(): > } else if (smmu_domain->smmu != smmu) > ret = -EINVAL; I intend to remove that eventually > I understand that it would be probably ideal to share the S2 > iopt among the SMMUs. But in the driver the objects (domain) > holding a shared S2 iopt must be different to allocate their > own VMIDs, right? No, the vmid will be moved into the struct arm_smmu_master_domain Jason
On Mon, Dec 11, 2023 at 01:27:49PM -0800, Nicolin Chen wrote: > On Fri, Dec 08, 2023 at 09:47:26PM -0400, Jason Gunthorpe wrote: > > What is in a Nested domain: > > ARM: A CD table pointer > > Nesting domains are created for every unique CD table top pointer. > > I think we basically implemented in a way of syncing STE, i,e, > vSTE.Config must be "S1 Translate" besides a CD table pointer, > and a nested domain is freed when vSTE.Config=BYPASS even if a > CD table pointer is present, right? Yes, but you can also de-duplicate the nested domains based on the CD table pointer. It is not as critical for ARM as others, but may still be worth doing. > > To make this work the iommu needs to be programmed with: > > AMD: A vDomain-ID -> pDomain-ID table > > A vRID -> pRID table > > This is all bound to some "virtual function" > > ARM: A vRID -> pRID table > > The vCMDQ is bound to a VM_ID, so to the Nesting Parent > > VCMDQ also has something called "virtual interface" that holds > a VMID and a list of CMDQ queues, which might be a bit similar > to AMD's "virtual function". Yeah, there must be some kind of logical grouping of HW objects to build that kind of stuff. > > The vRID->pRID table should be some mostly common > > IOMMUFD_DEV_ASSIGN_VIRTUAL_ID. AMD will need to pass in the virtual > > function ID and ARM will need to pass in the Nesting Parent ID. > > It sounds like our previous IOMMUFD_SET/UNSET_IDEV_DATA. I'm > wondering if we need to make it exclusive to the ID assigning? > Maybe set_idev_data could be reused for other potential cases? No, it should be an API only for the ID > If we do implement an IOMMUFD_DEV_ASSIGN_VIRTUAL_ID, do we need > an IOMMUFD_DEV_RESIGN_VIRTUAL_ID? (or better word than resign). I don't think so.. The vRID is basically fixed, if it needs to be changed then the device can be destroyed (or assign can just change it) > Could the structure just look like this? > struct iommu_dev_assign_virtual_id { > __u32 size; > __u32 dev_id; > __u32 id_type; > __u32 id; > }; It needs to take in the viommu_id also, and I'd make the id 64 bits just for good luck. > > In many ways the nesting parent/virtual function are very similar > > things. Perhaps ARM should also create a virtual function object which > > is just welded to the nesting parent for API consistency. > > A virtual function that holds an S2 domain/iopt + a VMID? If > this is for VCMDQ, the VMCDQ extension driver has that kinda > object holding an S2 domain: I implemented as the extension > function at the end of arm_smmu_finalise_s2() previously. Not so much hold a S2, but that the VMID would be forced to be shared amung them somehow. > > IOMMUFD_DEV_INVALIDATE should be introduced with the same design as > > HWPT invalidate. This would be used for AMD/ARM's ATC invalidation > > (and just force the stream ID, userspace must direct the vRID to the > > correct dev_id). > > SMMU's CD invalidations could fall into this category too. Yes, I forgot to look closely at the CD/GCR3 table invalidations :( I actually can't tell how AMD invalidates any GCR3 cache, maybe INVALIDATE_DEVTAB_ENTRY? > > Then in yet another series we can tackle the entire "virtual function" > > vRID/pRID translation stuff when the mmapable queue thing is > > introduced. > > VCMDQ is also a mmapable queue. I feel that there could be > more common stuff between "virtual function" and "virtual > interface", I'll need to take a look at AMD's stuff though. I'm not thinking of two things right now at least.. > I previously drafted something to test it out with iommufd. > Basically it needs the pairing of vRID/pRID in attach_dev() > and another ioctl to mmap/config user queue(s): > +struct iommu_hwpt_cache_config_tegra241_vcmdq { > + __u32 vcmdq_id; // queue id > + __u32 vcmdq_log2size; // queue size > + __aligned_u64 vcmdq_base; // queue guest PA > +}; vRID/pRID pairing should come from IOMMUFD_DEV_ASSIGN_VIRTUAL_ID. When a HWPT is allocated it would be connected to the viommu_id and then it would all be bundled together in the HW somehow From there you can ask the viommu_id to setup a queue. Jason
On Mon, Dec 11, 2023 at 05:57:38PM -0400, Jason Gunthorpe wrote: > On Mon, Dec 11, 2023 at 01:27:49PM -0800, Nicolin Chen wrote: > > On Fri, Dec 08, 2023 at 09:47:26PM -0400, Jason Gunthorpe wrote: > > > What is in a Nested domain: > > > ARM: A CD table pointer > > > Nesting domains are created for every unique CD table top pointer. > > > > I think we basically implemented in a way of syncing STE, i,e, > > vSTE.Config must be "S1 Translate" besides a CD table pointer, > > and a nested domain is freed when vSTE.Config=BYPASS even if a > > CD table pointer is present, right? > > Yes, but you can also de-duplicate the nested domains based on the CD > table pointer. It is not as critical for ARM as others, but may > still be worth doing. Ah right! Should do that. > > If we do implement an IOMMUFD_DEV_ASSIGN_VIRTUAL_ID, do we need > > an IOMMUFD_DEV_RESIGN_VIRTUAL_ID? (or better word than resign). > > I don't think so.. The vRID is basically fixed, if it needs to be > changed then the device can be destroyed (or assign can just change it) OK. Will insert the cleanup piece to the unbind()/destroy(). > > Could the structure just look like this? > > struct iommu_dev_assign_virtual_id { > > __u32 size; > > __u32 dev_id; > > __u32 id_type; > > __u32 id; > > }; > > It needs to take in the viommu_id also, and I'd make the id 64 bits > just for good luck. What is viommu_id required for in this context? I thought we already know which SMMU instance to issue commands via dev_id? > > > IOMMUFD_DEV_INVALIDATE should be introduced with the same design as > > > HWPT invalidate. This would be used for AMD/ARM's ATC invalidation > > > (and just force the stream ID, userspace must direct the vRID to the > > > correct dev_id). > > > > SMMU's CD invalidations could fall into this category too. Do we need a different iommu API for this ioctl? We currently have the cache_invalidate_user as a domain op. The new device op will be an iommu op? And should we rename the "cache_invalidate_user"? Would VT-d still uses it for device cache? > > I previously drafted something to test it out with iommufd. > > Basically it needs the pairing of vRID/pRID in attach_dev() > > and another ioctl to mmap/config user queue(s): > > +struct iommu_hwpt_cache_config_tegra241_vcmdq { > > + __u32 vcmdq_id; // queue id > > + __u32 vcmdq_log2size; // queue size > > + __aligned_u64 vcmdq_base; // queue guest PA > > +}; > > vRID/pRID pairing should come from IOMMUFD_DEV_ASSIGN_VIRTUAL_ID. When > a HWPT is allocated it would be connected to the viommu_id and then it > would all be bundled together in the HW somehow Since we were talking about sharing stage-2 domain, the HWPT to the stage-2 domain will be shared among the vIOMMU/pIOMMU instances too? I think I am not quite getting the viommu_id part yet. From the other side of this thread, viommu object is created per vIOMMU instance and each one actually tied to a pIOMMU by nature? Thanks Nicolin
On Mon, Dec 11, 2023 at 11:30:00PM -0800, Nicolin Chen wrote: > > > Could the structure just look like this? > > > struct iommu_dev_assign_virtual_id { > > > __u32 size; > > > __u32 dev_id; > > > __u32 id_type; > > > __u32 id; > > > }; > > > > It needs to take in the viommu_id also, and I'd make the id 64 bits > > just for good luck. > > What is viommu_id required for in this context? I thought we > already know which SMMU instance to issue commands via dev_id? The viommu_id would be the container that holds the xarray that maps the vRID to pRID Logically we could have multiple mappings per iommufd as we could have multiple iommu instances working here. > > > > IOMMUFD_DEV_INVALIDATE should be introduced with the same design as > > > > HWPT invalidate. This would be used for AMD/ARM's ATC invalidation > > > > (and just force the stream ID, userspace must direct the vRID to the > > > > correct dev_id). > > > > > > SMMU's CD invalidations could fall into this category too. > > Do we need a different iommu API for this ioctl? We currently > have the cache_invalidate_user as a domain op. The new device > op will be an iommu op? Yes > And should we rename the "cache_invalidate_user"? Would VT-d > still uses it for device cache? I think vt-d will not implement it > > > I previously drafted something to test it out with iommufd. > > > Basically it needs the pairing of vRID/pRID in attach_dev() > > > and another ioctl to mmap/config user queue(s): > > > +struct iommu_hwpt_cache_config_tegra241_vcmdq { > > > + __u32 vcmdq_id; // queue id > > > + __u32 vcmdq_log2size; // queue size > > > + __aligned_u64 vcmdq_base; // queue guest PA > > > +}; > > > > vRID/pRID pairing should come from IOMMUFD_DEV_ASSIGN_VIRTUAL_ID. When > > a HWPT is allocated it would be connected to the viommu_id and then it > > would all be bundled together in the HW somehow > > Since we were talking about sharing stage-2 domain, the HWPT > to the stage-2 domain will be shared among the vIOMMU/pIOMMU > instances too? The HWPT for the stage 2 should be shared > I think I am not quite getting the viommu_id > part yet. From the other side of this thread, viommu object > is created per vIOMMU instance and each one actually tied to > a pIOMMU by nature? Yes Jason
On Tue, Dec 12, 2023 at 10:44:21AM -0400, Jason Gunthorpe wrote: > On Mon, Dec 11, 2023 at 11:30:00PM -0800, Nicolin Chen wrote: > > > > > Could the structure just look like this? > > > > struct iommu_dev_assign_virtual_id { > > > > __u32 size; > > > > __u32 dev_id; > > > > __u32 id_type; > > > > __u32 id; > > > > }; > > > > > > It needs to take in the viommu_id also, and I'd make the id 64 bits > > > just for good luck. > > > > What is viommu_id required for in this context? I thought we > > already know which SMMU instance to issue commands via dev_id? > > The viommu_id would be the container that holds the xarray that maps > the vRID to pRID > > Logically we could have multiple mappings per iommufd as we could have > multiple iommu instances working here. I see. This is the object to hold a shared stage-2 HWPT/domain then. // iommufd_private.h enum iommufd_object_type { ... + IOMMUFD_OBJ_VIOMMU, ... }; +struct iommufd_viommu { + struct iommufd_object obj; + struct iommufd_hwpt_paging *hwpt; + struct xarray devices; +}; struct iommufd_hwpt_paging hwpt { ... + struct list_head viommu_list; ... }; struct iommufd_group { ... + struct iommufd_viommu *viommu; // should we attach to viommu instead of hwpt? ... }; Question to finalize how we maps vRID-pRID in the xarray: how should IOMMUFD_DEV_INVALIDATE work? The ioctl structure has a dev_id and a list of commands that belongs to the device. So, it forwards the struct device pointer to the driver along with the commands. Then, doesn't the driver already know the pRID from the dev pointer without looking up a vRID-pRID table? // uapi/linux/iommufd.h struct iommu_hwpt_alloc { ... + __u32 viommu_id; }; +enum iommu_dev_virtual_id_type { + IOMMU_DEV_VIRTUAL_ID_TYPE_AMD_VIOMMU_DID, // not sure how this fits the xarray in viommu obj. + IOMMU_DEV_VIRTUAL_ID_TYPE_AMD_VIOMMU_RID, + IOMMU_DEV_VIRTUAL_ID_TYPE_ARM_SMMU_SID, +}; + +struct iommu_dev_assign_virtual_id { + __u32 size; + __u32 dev_id; + __u32 viommu_id; + __u32 id_type; + __aligned_u64 id; +}; Then, I think that we also need an iommu_viommu_alloc structure and ioctl to allocate an object, and that VMM should know if it needs to allocate multiple viommu objects -- this probably needs the hw_info ioctl to return a piommu_id so VMM gets the list of piommus from the attached devices? Another question, introducing the viommu obj complicates things a lot. Do we want it to come with iommu_dev_assign_virtual_id, or maybe put in a later series? We could stage the xarray in the iommu_hwpt_paging struct for now, so a single-IOMMU system could still work with that. > > > > > IOMMUFD_DEV_INVALIDATE should be introduced with the same design as > > > > > HWPT invalidate. This would be used for AMD/ARM's ATC invalidation > > > > > (and just force the stream ID, userspace must direct the vRID to the > > > > > correct dev_id). > > > > > > > > SMMU's CD invalidations could fall into this category too. > > > > Do we need a different iommu API for this ioctl? We currently > > have the cache_invalidate_user as a domain op. The new device > > op will be an iommu op? > > Yes OK. FWIW, I think either VMM or iommufd core knows which nested HWPT the device is attached to. If we ever wanted to keep it in the domain op list, we could still do that by passing in extra domain pointer to the driver. > > And should we rename the "cache_invalidate_user"? Would VT-d > > still uses it for device cache? > > I think vt-d will not implement it Then should we "s/cache_invalidate_user/iotlb_sync_user"? Thanks Nicolin
On Tue, Dec 12, 2023 at 11:13:37AM -0800, Nicolin Chen wrote: > On Tue, Dec 12, 2023 at 10:44:21AM -0400, Jason Gunthorpe wrote: > > On Mon, Dec 11, 2023 at 11:30:00PM -0800, Nicolin Chen wrote: > > > > > > > Could the structure just look like this? > > > > > struct iommu_dev_assign_virtual_id { > > > > > __u32 size; > > > > > __u32 dev_id; > > > > > __u32 id_type; > > > > > __u32 id; > > > > > }; > > > > > > > > It needs to take in the viommu_id also, and I'd make the id 64 bits > > > > just for good luck. > > > > > > What is viommu_id required for in this context? I thought we > > > already know which SMMU instance to issue commands via dev_id? > > > > The viommu_id would be the container that holds the xarray that maps > > the vRID to pRID > > > > Logically we could have multiple mappings per iommufd as we could have > > multiple iommu instances working here. > > I see. This is the object to hold a shared stage-2 HWPT/domain then. It could be done like that, yes. I wasn't thinking about linking the stage two so tightly but perhaps? If we can avoid putting the hwpt here that might be more general. > // iommufd_private.h > > enum iommufd_object_type { > ... > + IOMMUFD_OBJ_VIOMMU, > ... > }; > > +struct iommufd_viommu { > + struct iommufd_object obj; > + struct iommufd_hwpt_paging *hwpt; > + struct xarray devices; > +}; > > struct iommufd_hwpt_paging hwpt { > ... > + struct list_head viommu_list; > ... > }; I'd probably first try to go backwards and link the hwpt to the viommu. > struct iommufd_group { > ... > + struct iommufd_viommu *viommu; // should we attach to viommu instead of hwpt? > ... > }; No. Attach is a statement of translation so you still attach to the HWPT. > Question to finalize how we maps vRID-pRID in the xarray: > how should IOMMUFD_DEV_INVALIDATE work? The ioctl structure has > a dev_id and a list of commands that belongs to the device. So, > it forwards the struct device pointer to the driver along with > the commands. Then, doesn't the driver already know the pRID > from the dev pointer without looking up a vRID-pRID table? The first version of DEV_INVALIDATE should have no xarray. The invalidate commands are stripped of the SID and executed on the given dev_id period. VMM splits up the invalidate command list. The second version maybe we have the xarray, or maybe we just push the xarray to the eventual viommu series. > struct iommu_hwpt_alloc { > ... > + __u32 viommu_id; > }; > > +enum iommu_dev_virtual_id_type { > + IOMMU_DEV_VIRTUAL_ID_TYPE_AMD_VIOMMU_DID, // not sure how this fits the xarray in viommu obj. > + IOMMU_DEV_VIRTUAL_ID_TYPE_AMD_VIOMMU_RID, It is just DID. In both cases the ID is the index to the "STE" radix tree, whatever the driver happens to call it. > Then, I think that we also need an iommu_viommu_alloc structure > and ioctl to allocate an object, and that VMM should know if it > needs to allocate multiple viommu objects -- this probably needs > the hw_info ioctl to return a piommu_id so VMM gets the list of > piommus from the attached devices? Yes and yes > Another question, introducing the viommu obj complicates things > a lot. Do we want it to come with iommu_dev_assign_virtual_id, > or maybe put in a later series? We could stage the xarray in the > iommu_hwpt_paging struct for now, so a single-IOMMU system could > still work with that. All this would be in its own series to enable HW accelerated viommu support on ARM & AMD as we've been doing so far. I imagine it after we get the basic invalidation done > > > And should we rename the "cache_invalidate_user"? Would VT-d > > > still uses it for device cache? > > > > I think vt-d will not implement it > > Then should we "s/cache_invalidate_user/iotlb_sync_user"? I think cache_invalidate is still a fine name.. vt-d will generate ATC invalidations under that function too. Jason
On Tue, Dec 12, 2023 at 03:21:00PM -0400, Jason Gunthorpe wrote: > On Tue, Dec 12, 2023 at 11:13:37AM -0800, Nicolin Chen wrote: > > On Tue, Dec 12, 2023 at 10:44:21AM -0400, Jason Gunthorpe wrote: > > > On Mon, Dec 11, 2023 at 11:30:00PM -0800, Nicolin Chen wrote: > > > > > > > > > Could the structure just look like this? > > > > > > struct iommu_dev_assign_virtual_id { > > > > > > __u32 size; > > > > > > __u32 dev_id; > > > > > > __u32 id_type; > > > > > > __u32 id; > > > > > > }; > > > > > > > > > > It needs to take in the viommu_id also, and I'd make the id 64 bits > > > > > just for good luck. > > > > > > > > What is viommu_id required for in this context? I thought we > > > > already know which SMMU instance to issue commands via dev_id? > > > > > > The viommu_id would be the container that holds the xarray that maps > > > the vRID to pRID > > > > > > Logically we could have multiple mappings per iommufd as we could have > > > multiple iommu instances working here. > > > > I see. This is the object to hold a shared stage-2 HWPT/domain then. > > It could be done like that, yes. I wasn't thinking about linking the > stage two so tightly but perhaps? If we can avoid putting the hwpt > here that might be more general. > > > // iommufd_private.h > > > > enum iommufd_object_type { > > ... > > + IOMMUFD_OBJ_VIOMMU, > > ... > > }; > > > > +struct iommufd_viommu { > > + struct iommufd_object obj; > > + struct iommufd_hwpt_paging *hwpt; > > + struct xarray devices; > > +}; > > > > struct iommufd_hwpt_paging hwpt { > > ... > > + struct list_head viommu_list; > > ... > > }; > > I'd probably first try to go backwards and link the hwpt to the > viommu. I think a VM should have only one hwpt_paging object while one or more viommu objects, so we could do only viommu->hwpt_paging and hwpt_paging->viommu_list. How to go backwards? > > struct iommufd_group { > > ... > > + struct iommufd_viommu *viommu; // should we attach to viommu instead of hwpt? > > ... > > }; > > No. Attach is a statement of translation so you still attach to the HWPT. OK. It's probably not necessary since we know which piommu the device is behind. And we only need to link viommu and piommu, right? > > Question to finalize how we maps vRID-pRID in the xarray: > > how should IOMMUFD_DEV_INVALIDATE work? The ioctl structure has > > a dev_id and a list of commands that belongs to the device. So, > > it forwards the struct device pointer to the driver along with > > the commands. Then, doesn't the driver already know the pRID > > from the dev pointer without looking up a vRID-pRID table? > > The first version of DEV_INVALIDATE should have no xarray. The > invalidate commands are stripped of the SID and executed on the given > dev_id period. VMM splits up the invalidate command list. Yes. This makes sense to me. VMM knows which device to issue an IOMMUFD_DEV_INVALIDATE from the vSID/vRID in the commands. > The second version maybe we have the xarray, or maybe we just push the > xarray to the eventual viommu series. I think that I still don't get the purpose of the xarray here. It was needed previously because a cache invalidate per hwpt doesn't know which device. Now IOMMUFD_DEV_INVALIDATE knows. Maybe it's related to that narrative "logically we could have multiple mappings per iommufd" that you mentioned above. Mind elaborating a bit? In my mind, viommu is allocated by VMM per piommu, by detecting the piommu_id via hw_info. In that case, viommu can only have one unique device list. If IOMMUFD_DEV_INVALIDATE passes in the dev_id, we don't really need a mapping of vRID-pRID in a multi- viommu case either? In another word, VMM already has a mapping from vRID to dev_id, so it could call the DEV_INVALIDATE ioctl in the first place? Thanks Nicolin
On Tue, Dec 12, 2023 at 12:05:41PM -0800, Nicolin Chen wrote: > > > // iommufd_private.h > > > > > > enum iommufd_object_type { > > > ... > > > + IOMMUFD_OBJ_VIOMMU, > > > ... > > > }; > > > > > > +struct iommufd_viommu { > > > + struct iommufd_object obj; > > > + struct iommufd_hwpt_paging *hwpt; > > > + struct xarray devices; > > > +}; > > > > > > struct iommufd_hwpt_paging hwpt { > > > ... > > > + struct list_head viommu_list; > > > ... > > > }; > > > > I'd probably first try to go backwards and link the hwpt to the > > viommu. > > I think a VM should have only one hwpt_paging object while one > or more viommu objects, so we could do only viommu->hwpt_paging > and hwpt_paging->viommu_list. How to go backwards? That is probably how things would work but I don't know if it makes sense to enforce it in the kernel logic.. Point the S2 to a list of viommu objects it is linked to > > > struct iommufd_group { > > > ... > > > + struct iommufd_viommu *viommu; // should we attach to viommu instead of hwpt? > > > ... > > > }; > > > > No. Attach is a statement of translation so you still attach to the HWPT. > > OK. It's probably not necessary since we know which piommu the > device is behind. And we only need to link viommu and piommu, > right? Yes > > The second version maybe we have the xarray, or maybe we just push the > > xarray to the eventual viommu series. > > I think that I still don't get the purpose of the xarray here. > It was needed previously because a cache invalidate per hwpt > doesn't know which device. Now IOMMUFD_DEV_INVALIDATE knows. > > Maybe it's related to that narrative "logically we could have > multiple mappings per iommufd" that you mentioned above. Mind > elaborating a bit? > > In my mind, viommu is allocated by VMM per piommu, by detecting > the piommu_id via hw_info. In that case, viommu can only have > one unique device list. If IOMMUFD_DEV_INVALIDATE passes in the > dev_id, we don't really need a mapping of vRID-pRID in a multi- > viommu case either? In another word, VMM already has a mapping > from vRID to dev_id, so it could call the DEV_INVALIDATE ioctl > in the first place? The xarray exists to optimize the invalidation flow. For SW you can imagine issuing an invalidation against the viommu itself and *all* commands, be they ASID or ATC invalidations can be processed in one shot. The xarray allows converting the vSID to pSID to process ATC invalidations, and the viommu object forces a single VMID to handle the ATC invalidations. If we want to do this, I don't know. For HW, the xarray holds the vSID to pSID mapping that must be programmed into the HW operating the dedicated invalidation queue. Jason
On Wed, Dec 13, 2023 at 08:40:55AM -0400, Jason Gunthorpe wrote: > On Tue, Dec 12, 2023 at 12:05:41PM -0800, Nicolin Chen wrote: > > > > // iommufd_private.h > > > > > > > > enum iommufd_object_type { > > > > ... > > > > + IOMMUFD_OBJ_VIOMMU, > > > > ... > > > > }; > > > > > > > > +struct iommufd_viommu { > > > > + struct iommufd_object obj; > > > > + struct iommufd_hwpt_paging *hwpt; > > > > + struct xarray devices; > > > > +}; > > > > > > > > struct iommufd_hwpt_paging hwpt { > > > > ... > > > > + struct list_head viommu_list; > > > > ... > > > > }; > > > > > > I'd probably first try to go backwards and link the hwpt to the > > > viommu. > > > > I think a VM should have only one hwpt_paging object while one > > or more viommu objects, so we could do only viommu->hwpt_paging > > and hwpt_paging->viommu_list. How to go backwards? > > That is probably how things would work but I don't know if it makes > sense to enforce it in the kernel logic.. > > Point the S2 to a list of viommu objects it is linked to Hmm, isn't that hwpt_paging->viommu_list already? > > > The second version maybe we have the xarray, or maybe we just push the > > > xarray to the eventual viommu series. > > > > I think that I still don't get the purpose of the xarray here. > > It was needed previously because a cache invalidate per hwpt > > doesn't know which device. Now IOMMUFD_DEV_INVALIDATE knows. > > > > Maybe it's related to that narrative "logically we could have > > multiple mappings per iommufd" that you mentioned above. Mind > > elaborating a bit? > > > > In my mind, viommu is allocated by VMM per piommu, by detecting > > the piommu_id via hw_info. In that case, viommu can only have > > one unique device list. If IOMMUFD_DEV_INVALIDATE passes in the > > dev_id, we don't really need a mapping of vRID-pRID in a multi- > > viommu case either? In another word, VMM already has a mapping > > from vRID to dev_id, so it could call the DEV_INVALIDATE ioctl > > in the first place? > > The xarray exists to optimize the invalidation flow. > > For SW you can imagine issuing an invalidation against the viommu > itself and *all* commands, be they ASID or ATC invalidations can be > processed in one shot. The xarray allows converting the vSID to pSID > to process ATC invalidations, and the viommu object forces a single > VMID to handle the ATC invalidations. If we want to do this, I don't > know. I drafted some patches with IOMMUFD_DEV_INVALIDATE yesterday, and realized the same problem that you pointed out here: how VMM should handle a group of commands interlaced with ASID and ATC commands. If VMM dispatches commands into two groups, the executions of the commands will be in a different order than what the guest kernel issued in. This might be bitter if there is an error occurring in the middle of command executions, in which case some later invalidations are done successfully but the CONS index would have to stop at a command prior. And even if there are only ATC invalidations in a guest queue, there's no guarantee that all commands are for the same dev_id, i.e. ATC invalidations themselves would be dispatched into more groups and separate IOMMUFD_DEV_INVALIDATE calls. With the xarray, perhaps we could provide a viommu_id in data structure of the current iommu_hwpt_invalidate, i.e. reshaping the existing invalidate uAPI per viommu, so it can be reused by ATC invalidations too instead of adding IOMMUFD_DEV_INVALIDATE? Then we wouldn't have the out-of-order execution problem above. > For HW, the xarray holds the vSID to pSID mapping that must be > programmed into the HW operating the dedicated invalidation queue. Ah, right! VCMDQ at least needs that. Thanks Nicolin
Hey Yi I have been working with https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1 and have some questions regarding one of the commits in that series. I however cannot find it in lore.kernel.org. Can you please direct me to where the rfc was posted? If it has not been posted yet, do you have an alternate place for discussion? Best On Fri, Nov 17, 2023 at 05:07:11AM -0800, Yi Liu wrote: > Nested translation is a hardware feature that is supported by many modern > IOMMU hardwares. It has two stages (stage-1, stage-2) address translation > to get access to the physical address. stage-1 translation table is owned > by userspace (e.g. by a guest OS), while stage-2 is owned by kernel. Changes > to stage-1 translation table should be followed by an IOTLB invalidation. > > Take Intel VT-d as an example, the stage-1 translation table is I/O page > table. As the below diagram shows, guest I/O page table pointer in GPA > (guest physical address) is passed to host and be used to perform the stage-1 > address translation. Along with it, modifications to present mappings in the > guest I/O page table should be followed with an IOTLB invalidation. > > .-------------. .---------------------------. > | vIOMMU | | Guest I/O page table | > | | '---------------------------' > .----------------/ > | PASID Entry |--- PASID cache flush --+ > '-------------' | > | | V > | | I/O page table pointer in GPA > '-------------' > Guest > ------| Shadow |---------------------------|-------- > v v v > Host > .-------------. .------------------------. > | pIOMMU | | FS for GIOVA->GPA | > | | '------------------------' > .----------------/ | > | PASID Entry | V (Nested xlate) > '----------------\.----------------------------------. > | | | SS for GPA->HPA, unmanaged domain| > | | '----------------------------------' > '-------------' > Where: > - FS = First stage page tables > - SS = Second stage page tables > <Intel VT-d Nested translation> > > This series adds the cache invalidation path for the userspace to invalidate > cache after modifying the stage-1 page table. This is based on the first part > of nesting [1] > > Complete code can be found in [2], QEMU could can be found in [3]. > > At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks > them for the help. ^_^. Look forward to your feedbacks. > > [1] https://lore.kernel.org/linux-iommu/20231026044216.64964-1-yi.l.liu@intel.com/ - merged > [2] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting > [3] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1 > > Change log: > > v6: > - No much change, just rebase on top of 6.7-rc1 as part 1/2 is merged > > v5: https://lore.kernel.org/linux-iommu/20231020092426.13907-1-yi.l.liu@intel.com/#t > - Split the iommufd nesting series into two parts of alloc_user and > invalidation (Jason) > - Split IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGING/_NESTED, and > do the same with the structures/alloc()/abort()/destroy(). Reworked the > selftest accordingly too. (Jason) > - Move hwpt/data_type into struct iommu_user_data from standalone op > arguments. (Jason) > - Rename hwpt_type to be data_type, the HWPT_TYPE to be HWPT_ALLOC_DATA, > _TYPE_DEFAULT to be _ALLOC_DATA_NONE (Jason, Kevin) > - Rename iommu_copy_user_data() to iommu_copy_struct_from_user() (Kevin) > - Add macro to the iommu_copy_struct_from_user() to calculate min_size > (Jason) > - Fix two bugs spotted by ZhaoYan > > v4: https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@intel.com/ > - Separate HWPT alloc/destroy/abort functions between user-managed HWPTs > and kernel-managed HWPTs > - Rework invalidate uAPI to be a multi-request array-based design > - Add a struct iommu_user_data_array and a helper for driver to sanitize > and copy the entry data from user space invalidation array > - Add a patch fixing TEST_LENGTH() in selftest program > - Drop IOMMU_RESV_IOVA_RANGES patches > - Update kdoc and inline comments > - Drop the code to add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation, > this does not change the rule that resv regions should only be added to the > kernel-managed HWPT. The IOMMU_RESV_SW_MSI stuff will be added in later series > as it is needed only by SMMU so far. > > v3: https://lore.kernel.org/linux-iommu/20230724110406.107212-1-yi.l.liu@intel.com/ > - Add new uAPI things in alphabetical order > - Pass in "enum iommu_hwpt_type hwpt_type" to op->domain_alloc_user for > sanity, replacing the previous op->domain_alloc_user_data_len solution > - Return ERR_PTR from domain_alloc_user instead of NULL > - Only add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation (Kevin) > - Add IOMMU_RESV_IOVA_RANGES to report resv iova ranges to userspace hence > userspace is able to exclude the ranges in the stage-1 HWPT (e.g. guest I/O > page table). (Kevin) > - Add selftest coverage for the new IOMMU_RESV_IOVA_RANGES ioctl > - Minor changes per Kevin's inputs > > v2: https://lore.kernel.org/linux-iommu/20230511143844.22693-1-yi.l.liu@intel.com/ > - Add union iommu_domain_user_data to include all user data structures to avoid > passing void * in kernel APIs. > - Add iommu op to return user data length for user domain allocation > - Rename struct iommu_hwpt_alloc::data_type to be hwpt_type > - Store the invalidation data length in iommu_domain_ops::cache_invalidate_user_data_len > - Convert cache_invalidate_user op to be int instead of void > - Remove @data_type in struct iommu_hwpt_invalidate > - Remove out_hwpt_type_bitmap in struct iommu_hw_info hence drop patch 08 of v1 > > v1: https://lore.kernel.org/linux-iommu/20230309080910.607396-1-yi.l.liu@intel.com/ > > Thanks, > Yi Liu > > Lu Baolu (1): > iommu: Add cache_invalidate_user op > > Nicolin Chen (4): > iommu: Add iommu_copy_struct_from_user_array helper > iommufd/selftest: Add mock_domain_cache_invalidate_user support > iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op > iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl > > Yi Liu (1): > iommufd: Add IOMMU_HWPT_INVALIDATE > > drivers/iommu/iommufd/hw_pagetable.c | 35 ++++++++ > drivers/iommu/iommufd/iommufd_private.h | 9 ++ > drivers/iommu/iommufd/iommufd_test.h | 22 +++++ > drivers/iommu/iommufd/main.c | 3 + > drivers/iommu/iommufd/selftest.c | 69 +++++++++++++++ > include/linux/iommu.h | 84 +++++++++++++++++++ > include/uapi/linux/iommufd.h | 35 ++++++++ > tools/testing/selftests/iommu/iommufd.c | 75 +++++++++++++++++ > tools/testing/selftests/iommu/iommufd_utils.h | 63 ++++++++++++++ > 9 files changed, 395 insertions(+) > > -- > 2.34.1 >
On 2023/12/17 19:21, Joel Granados wrote: > Hey Yi > > I have been working with https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1 good to know about it. > and have some questions regarding one of the commits in that series. > I however cannot find it in lore.kernel.org. Can you please direct me to > where the rfc was posted? If it has not been posted yet, do you have an > alternate place for discussion? the qemu series has not been posted yet as kernel side is still changing. It still needs some time to be ready for public review. Zhenzhong Duan is going to post it when it's ready. If you have questions to discuss, you can post your questions to Zhenzhong and me first. I guess it may be fine to cc Alex Williamson, Eric Auger, Nicolin Chen, Cédric Le Goater, Kevin Tian, Jason Gunthorpe and qemu mail list as this is discussion something that is going to be posted in public. > > Best > > On Fri, Nov 17, 2023 at 05:07:11AM -0800, Yi Liu wrote: >> Nested translation is a hardware feature that is supported by many modern >> IOMMU hardwares. It has two stages (stage-1, stage-2) address translation >> to get access to the physical address. stage-1 translation table is owned >> by userspace (e.g. by a guest OS), while stage-2 is owned by kernel. Changes >> to stage-1 translation table should be followed by an IOTLB invalidation. >> >> Take Intel VT-d as an example, the stage-1 translation table is I/O page >> table. As the below diagram shows, guest I/O page table pointer in GPA >> (guest physical address) is passed to host and be used to perform the stage-1 >> address translation. Along with it, modifications to present mappings in the >> guest I/O page table should be followed with an IOTLB invalidation. >> >> .-------------. .---------------------------. >> | vIOMMU | | Guest I/O page table | >> | | '---------------------------' >> .----------------/ >> | PASID Entry |--- PASID cache flush --+ >> '-------------' | >> | | V >> | | I/O page table pointer in GPA >> '-------------' >> Guest >> ------| Shadow |---------------------------|-------- >> v v v >> Host >> .-------------. .------------------------. >> | pIOMMU | | FS for GIOVA->GPA | >> | | '------------------------' >> .----------------/ | >> | PASID Entry | V (Nested xlate) >> '----------------\.----------------------------------. >> | | | SS for GPA->HPA, unmanaged domain| >> | | '----------------------------------' >> '-------------' >> Where: >> - FS = First stage page tables >> - SS = Second stage page tables >> <Intel VT-d Nested translation> >> >> This series adds the cache invalidation path for the userspace to invalidate >> cache after modifying the stage-1 page table. This is based on the first part >> of nesting [1] >> >> Complete code can be found in [2], QEMU could can be found in [3]. >> >> At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks >> them for the help. ^_^. Look forward to your feedbacks. >> >> [1] https://lore.kernel.org/linux-iommu/20231026044216.64964-1-yi.l.liu@intel.com/ - merged >> [2] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting >> [3] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1 >> >> Change log: >> >> v6: >> - No much change, just rebase on top of 6.7-rc1 as part 1/2 is merged >> >> v5: https://lore.kernel.org/linux-iommu/20231020092426.13907-1-yi.l.liu@intel.com/#t >> - Split the iommufd nesting series into two parts of alloc_user and >> invalidation (Jason) >> - Split IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGING/_NESTED, and >> do the same with the structures/alloc()/abort()/destroy(). Reworked the >> selftest accordingly too. (Jason) >> - Move hwpt/data_type into struct iommu_user_data from standalone op >> arguments. (Jason) >> - Rename hwpt_type to be data_type, the HWPT_TYPE to be HWPT_ALLOC_DATA, >> _TYPE_DEFAULT to be _ALLOC_DATA_NONE (Jason, Kevin) >> - Rename iommu_copy_user_data() to iommu_copy_struct_from_user() (Kevin) >> - Add macro to the iommu_copy_struct_from_user() to calculate min_size >> (Jason) >> - Fix two bugs spotted by ZhaoYan >> >> v4: https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@intel.com/ >> - Separate HWPT alloc/destroy/abort functions between user-managed HWPTs >> and kernel-managed HWPTs >> - Rework invalidate uAPI to be a multi-request array-based design >> - Add a struct iommu_user_data_array and a helper for driver to sanitize >> and copy the entry data from user space invalidation array >> - Add a patch fixing TEST_LENGTH() in selftest program >> - Drop IOMMU_RESV_IOVA_RANGES patches >> - Update kdoc and inline comments >> - Drop the code to add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation, >> this does not change the rule that resv regions should only be added to the >> kernel-managed HWPT. The IOMMU_RESV_SW_MSI stuff will be added in later series >> as it is needed only by SMMU so far. >> >> v3: https://lore.kernel.org/linux-iommu/20230724110406.107212-1-yi.l.liu@intel.com/ >> - Add new uAPI things in alphabetical order >> - Pass in "enum iommu_hwpt_type hwpt_type" to op->domain_alloc_user for >> sanity, replacing the previous op->domain_alloc_user_data_len solution >> - Return ERR_PTR from domain_alloc_user instead of NULL >> - Only add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation (Kevin) >> - Add IOMMU_RESV_IOVA_RANGES to report resv iova ranges to userspace hence >> userspace is able to exclude the ranges in the stage-1 HWPT (e.g. guest I/O >> page table). (Kevin) >> - Add selftest coverage for the new IOMMU_RESV_IOVA_RANGES ioctl >> - Minor changes per Kevin's inputs >> >> v2: https://lore.kernel.org/linux-iommu/20230511143844.22693-1-yi.l.liu@intel.com/ >> - Add union iommu_domain_user_data to include all user data structures to avoid >> passing void * in kernel APIs. >> - Add iommu op to return user data length for user domain allocation >> - Rename struct iommu_hwpt_alloc::data_type to be hwpt_type >> - Store the invalidation data length in iommu_domain_ops::cache_invalidate_user_data_len >> - Convert cache_invalidate_user op to be int instead of void >> - Remove @data_type in struct iommu_hwpt_invalidate >> - Remove out_hwpt_type_bitmap in struct iommu_hw_info hence drop patch 08 of v1 >> >> v1: https://lore.kernel.org/linux-iommu/20230309080910.607396-1-yi.l.liu@intel.com/ >> >> Thanks, >> Yi Liu >> >> Lu Baolu (1): >> iommu: Add cache_invalidate_user op >> >> Nicolin Chen (4): >> iommu: Add iommu_copy_struct_from_user_array helper >> iommufd/selftest: Add mock_domain_cache_invalidate_user support >> iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op >> iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl >> >> Yi Liu (1): >> iommufd: Add IOMMU_HWPT_INVALIDATE >> >> drivers/iommu/iommufd/hw_pagetable.c | 35 ++++++++ >> drivers/iommu/iommufd/iommufd_private.h | 9 ++ >> drivers/iommu/iommufd/iommufd_test.h | 22 +++++ >> drivers/iommu/iommufd/main.c | 3 + >> drivers/iommu/iommufd/selftest.c | 69 +++++++++++++++ >> include/linux/iommu.h | 84 +++++++++++++++++++ >> include/uapi/linux/iommufd.h | 35 ++++++++ >> tools/testing/selftests/iommu/iommufd.c | 75 +++++++++++++++++ >> tools/testing/selftests/iommu/iommufd_utils.h | 63 ++++++++++++++ >> 9 files changed, 395 insertions(+) >> >> -- >> 2.34.1 >> >
On Tue, Dec 19, 2023 at 05:26:21PM +0800, Yi Liu wrote: > On 2023/12/17 19:21, Joel Granados wrote: > > Hey Yi > > > > I have been working with https://protect2.fireeye.com/v1/url?k=b58750ce-ea1c9eaa-b586db81-000babda0201-365207d33731a099&q=1&e=ee73b69d-5c35-49ef-9e62-2355fb797f21&u=https%3A%2F%2Fgithub.com%2Fyiliu1765%2Fqemu%2Ftree%2Fzhenzhong%2Fwip%2Fiommufd_nesting_rfcv1 > > good to know about it. > > > and have some questions regarding one of the commits in that series. > > I however cannot find it in lore.kernel.org. Can you please direct me to > > where the rfc was posted? If it has not been posted yet, do you have an > > alternate place for discussion? > > the qemu series has not been posted yet as kernel side is still changing. > It still needs some time to be ready for public review. Zhenzhong Duan > is going to post it when it's ready. If you have questions to discuss, > you can post your questions to Zhenzhong and me first. I guess it may be > fine to cc Alex Williamson, Eric Auger, Nicolin Chen, Cédric Le Goater, > Kevin Tian, Jason Gunthorpe and qemu mail list as this is discussion > something that is going to be posted in public. Thx for getting back to me. I'll direct my questions to these recipients. Best > > > > > Best > > > > On Fri, Nov 17, 2023 at 05:07:11AM -0800, Yi Liu wrote: > > > Nested translation is a hardware feature that is supported by many modern > > > IOMMU hardwares. It has two stages (stage-1, stage-2) address translation > > > to get access to the physical address. stage-1 translation table is owned > > > by userspace (e.g. by a guest OS), while stage-2 is owned by kernel. Changes > > > to stage-1 translation table should be followed by an IOTLB invalidation. > > > > > > Take Intel VT-d as an example, the stage-1 translation table is I/O page > > > table. As the below diagram shows, guest I/O page table pointer in GPA > > > (guest physical address) is passed to host and be used to perform the stage-1 > > > address translation. Along with it, modifications to present mappings in the > > > guest I/O page table should be followed with an IOTLB invalidation. > > > > > > .-------------. .---------------------------. > > > | vIOMMU | | Guest I/O page table | > > > | | '---------------------------' > > > .----------------/ > > > | PASID Entry |--- PASID cache flush --+ > > > '-------------' | > > > | | V > > > | | I/O page table pointer in GPA > > > '-------------' > > > Guest > > > ------| Shadow |---------------------------|-------- > > > v v v > > > Host > > > .-------------. .------------------------. > > > | pIOMMU | | FS for GIOVA->GPA | > > > | | '------------------------' > > > .----------------/ | > > > | PASID Entry | V (Nested xlate) > > > '----------------\.----------------------------------. > > > | | | SS for GPA->HPA, unmanaged domain| > > > | | '----------------------------------' > > > '-------------' > > > Where: > > > - FS = First stage page tables > > > - SS = Second stage page tables > > > <Intel VT-d Nested translation> > > > > > > This series adds the cache invalidation path for the userspace to invalidate > > > cache after modifying the stage-1 page table. This is based on the first part > > > of nesting [1] > > > > > > Complete code can be found in [2], QEMU could can be found in [3]. > > > > > > At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks > > > them for the help. ^_^. Look forward to your feedbacks. > > > > > > [1] https://lore.kernel.org/linux-iommu/20231026044216.64964-1-yi.l.liu@intel.com/ - merged > > > [2] https://protect2.fireeye.com/v1/url?k=38b56f01-672ea165-38b4e44e-000babda0201-469ae350f21411ca&q=1&e=ee73b69d-5c35-49ef-9e62-2355fb797f21&u=https%3A%2F%2Fgithub.com%2Fyiliu1765%2Fiommufd%2Ftree%2Fiommufd_nesting > > > [3] https://protect2.fireeye.com/v1/url?k=d6e01ed1-897bd0b5-d6e1959e-000babda0201-bcf2b26a8dc8b34d&q=1&e=ee73b69d-5c35-49ef-9e62-2355fb797f21&u=https%3A%2F%2Fgithub.com%2Fyiliu1765%2Fqemu%2Ftree%2Fzhenzhong%2Fwip%2Fiommufd_nesting_rfcv1 > > > > > > Change log: > > > > > > v6: > > > - No much change, just rebase on top of 6.7-rc1 as part 1/2 is merged > > > > > > v5: https://lore.kernel.org/linux-iommu/20231020092426.13907-1-yi.l.liu@intel.com/#t > > > - Split the iommufd nesting series into two parts of alloc_user and > > > invalidation (Jason) > > > - Split IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGING/_NESTED, and > > > do the same with the structures/alloc()/abort()/destroy(). Reworked the > > > selftest accordingly too. (Jason) > > > - Move hwpt/data_type into struct iommu_user_data from standalone op > > > arguments. (Jason) > > > - Rename hwpt_type to be data_type, the HWPT_TYPE to be HWPT_ALLOC_DATA, > > > _TYPE_DEFAULT to be _ALLOC_DATA_NONE (Jason, Kevin) > > > - Rename iommu_copy_user_data() to iommu_copy_struct_from_user() (Kevin) > > > - Add macro to the iommu_copy_struct_from_user() to calculate min_size > > > (Jason) > > > - Fix two bugs spotted by ZhaoYan > > > > > > v4: https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@intel.com/ > > > - Separate HWPT alloc/destroy/abort functions between user-managed HWPTs > > > and kernel-managed HWPTs > > > - Rework invalidate uAPI to be a multi-request array-based design > > > - Add a struct iommu_user_data_array and a helper for driver to sanitize > > > and copy the entry data from user space invalidation array > > > - Add a patch fixing TEST_LENGTH() in selftest program > > > - Drop IOMMU_RESV_IOVA_RANGES patches > > > - Update kdoc and inline comments > > > - Drop the code to add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation, > > > this does not change the rule that resv regions should only be added to the > > > kernel-managed HWPT. The IOMMU_RESV_SW_MSI stuff will be added in later series > > > as it is needed only by SMMU so far. > > > > > > v3: https://lore.kernel.org/linux-iommu/20230724110406.107212-1-yi.l.liu@intel.com/ > > > - Add new uAPI things in alphabetical order > > > - Pass in "enum iommu_hwpt_type hwpt_type" to op->domain_alloc_user for > > > sanity, replacing the previous op->domain_alloc_user_data_len solution > > > - Return ERR_PTR from domain_alloc_user instead of NULL > > > - Only add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation (Kevin) > > > - Add IOMMU_RESV_IOVA_RANGES to report resv iova ranges to userspace hence > > > userspace is able to exclude the ranges in the stage-1 HWPT (e.g. guest I/O > > > page table). (Kevin) > > > - Add selftest coverage for the new IOMMU_RESV_IOVA_RANGES ioctl > > > - Minor changes per Kevin's inputs > > > > > > v2: https://lore.kernel.org/linux-iommu/20230511143844.22693-1-yi.l.liu@intel.com/ > > > - Add union iommu_domain_user_data to include all user data structures to avoid > > > passing void * in kernel APIs. > > > - Add iommu op to return user data length for user domain allocation > > > - Rename struct iommu_hwpt_alloc::data_type to be hwpt_type > > > - Store the invalidation data length in iommu_domain_ops::cache_invalidate_user_data_len > > > - Convert cache_invalidate_user op to be int instead of void > > > - Remove @data_type in struct iommu_hwpt_invalidate > > > - Remove out_hwpt_type_bitmap in struct iommu_hw_info hence drop patch 08 of v1 > > > > > > v1: https://lore.kernel.org/linux-iommu/20230309080910.607396-1-yi.l.liu@intel.com/ > > > > > > Thanks, > > > Yi Liu > > > > > > Lu Baolu (1): > > > iommu: Add cache_invalidate_user op > > > > > > Nicolin Chen (4): > > > iommu: Add iommu_copy_struct_from_user_array helper > > > iommufd/selftest: Add mock_domain_cache_invalidate_user support > > > iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op > > > iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl > > > > > > Yi Liu (1): > > > iommufd: Add IOMMU_HWPT_INVALIDATE > > > > > > drivers/iommu/iommufd/hw_pagetable.c | 35 ++++++++ > > > drivers/iommu/iommufd/iommufd_private.h | 9 ++ > > > drivers/iommu/iommufd/iommufd_test.h | 22 +++++ > > > drivers/iommu/iommufd/main.c | 3 + > > > drivers/iommu/iommufd/selftest.c | 69 +++++++++++++++ > > > include/linux/iommu.h | 84 +++++++++++++++++++ > > > include/uapi/linux/iommufd.h | 35 ++++++++ > > > tools/testing/selftests/iommu/iommufd.c | 75 +++++++++++++++++ > > > tools/testing/selftests/iommu/iommufd_utils.h | 63 ++++++++++++++ > > > 9 files changed, 395 insertions(+) > > > > > > -- > > > 2.34.1 > > > > > > > -- > Regards, > Yi Liu