Message ID | 20231202091211.13376-1-yan.y.zhao@intel.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:bcd1:0:b0:403:3b70:6f57 with SMTP id r17csp1672046vqy; Sat, 2 Dec 2023 01:41:33 -0800 (PST) X-Google-Smtp-Source: AGHT+IFNvV05rJLWPC8FqCwnJslz+gKBU6SDyLy19VaMtYaIt1cwH1hX6joxXn7zesaLuAPgYo2r X-Received: by 2002:a05:6808:192:b0:3b8:b063:9b7d with SMTP id w18-20020a056808019200b003b8b0639b7dmr1059196oic.111.1701510093615; Sat, 02 Dec 2023 01:41:33 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701510093; cv=none; d=google.com; s=arc-20160816; b=ZOBAUGH1MmUWs/yGQc66xbhWkg7CbKmaxYh/53WJUZnyuEM1METGIZMXMOmvxS92nZ jFJBYBsHiE5eKtIZ5XZp7TlE1xMzyWhxB316H4QHfonOU98JE1+gOoD0BzTxZyCsLDoN EquzqTTHOnH6rKtcIxLZSRPOQrrloDK3jeHsTYpYSM2suqmrSq+D4YtzAbNsRD4nIweL dpzk0S32TRe25NTM/+5nwnj7g7AGFMXLDb8tVZIU7rV+La06Wu4DSRFmO7wgKnyYog4c WAHmP7TAKZi3q9ozfay6NsNtksxC+zyV+VvQEbcaFtdTv0ODPRtRaGb7GeBXY0u76a8K ApQw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:message-id:date:subject:cc:to:from :dkim-signature; bh=nXgFQ5OPQrgmAqqFJxCJT6bRSvrNLBb7jx4tmXzyjqM=; fh=+WI4m5k3dRLR+dR3neThuZkNBTzIm/a8HgtddERL9fA=; b=E0cR1u9COP8uinNIOfhgi7ihEeSPhu6beP1xQND/wxhyjP5DY84DqLAVBr0rpvNkH8 Ye5J0PH+mX4OuqAA0irF7pA5i7Gv5rimr4a+d/Ynj6P6meSuQVfIHVP0t18kdM2EmuqW AVsBf43Dp61xbxaqKsxgIGvtRE5bSSePTXHk2c9mUu8lFTK1WqgD0TCjkl4wnUEFZ9zY cqi2YdNZBdW8ts3Ipq3k7h1CalUoFvdk7VO9jcj3ACShhSWBdJC5NouudvaXZVWdhi9G vNete8jsPUQUE4QU1eJcc37BXI2/JzAsUExNvevlf4gcMfsfML2Dwz+tlVSJ6A9ng77Q 93Rg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=RJ1NJaK3; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from lipwig.vger.email (lipwig.vger.email. [23.128.96.33]) by mx.google.com with ESMTPS id n5-20020a635c45000000b005c5fe7661c7si107679pgm.359.2023.12.02.01.41.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 02 Dec 2023 01:41:33 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) client-ip=23.128.96.33; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=RJ1NJaK3; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id 6ADD180C772F; Sat, 2 Dec 2023 01:41:25 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229472AbjLBJlO (ORCPT <rfc822;pwkd43@gmail.com> + 99 others); Sat, 2 Dec 2023 04:41:14 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37144 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231984AbjLBJlM (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sat, 2 Dec 2023 04:41:12 -0500 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6918910EA; Sat, 2 Dec 2023 01:41:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1701510077; x=1733046077; h=from:to:cc:subject:date:message-id; bh=KBVzLl+kMzCo3U3ohPB86E30usQbXq+GfH8nLunIBzA=; b=RJ1NJaK3/zvMkSpBU4DVVkJ2z7l2f+378ypkPzm6Czc910FRmwf7Y/+2 HP3tSj9pKecBNljF6SdkiFyzAUl4yUUjlpC0tIxe5Lp2tBB8Ogk7YRG1K hgaYkPWl2w5kVX++3IyHcSmsdnJR4FQ5IbF/3vQ1UvIq7Fsxn9PQ51xfi hvle5kaR8B953DGbnDE8NmsTr+3Yr6sunWPIw9pnkFpLKJ+w6H2AxU+Bx d5H48bTJaxE3NukXSxkaGiRi4TAIO9IbITlR+phqktX+ecHT98XrQXxjB aKbc8r/aAybvHfgHoeuRDNZC6R6UDmQTXRR+mOvRpKmKYKlqb3+NGiWns g==; X-IronPort-AV: E=McAfee;i="6600,9927,10911"; a="488057" X-IronPort-AV: E=Sophos;i="6.04,245,1695711600"; d="scan'208";a="488057" Received: from fmviesa001.fm.intel.com ([10.60.135.141]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Dec 2023 01:41:17 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.04,245,1695711600"; d="scan'208";a="17275196" Received: from yzhao56-desk.sh.intel.com ([10.239.159.62]) by smtpauth.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Dec 2023 01:41:13 -0800 From: Yan Zhao <yan.y.zhao@intel.com> To: iommu@lists.linux.dev, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: alex.williamson@redhat.com, jgg@nvidia.com, pbonzini@redhat.com, seanjc@google.com, joro@8bytes.org, will@kernel.org, robin.murphy@arm.com, kevin.tian@intel.com, baolu.lu@linux.intel.com, dwmw2@infradead.org, yi.l.liu@intel.com, Yan Zhao <yan.y.zhao@intel.com> Subject: [RFC PATCH 00/42] Sharing KVM TDP to IOMMU Date: Sat, 2 Dec 2023 17:12:11 +0800 Message-Id: <20231202091211.13376-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.17.1 X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Sat, 02 Dec 2023 01:41:25 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1784162647608964521 X-GMAIL-MSGID: 1784162647608964521 |
Series |
Sharing KVM TDP to IOMMU
|
|
Message
Yan Zhao
Dec. 2, 2023, 9:12 a.m. UTC
This RFC series proposes a framework to resolve IOPF by sharing KVM TDP (Two Dimensional Paging) page table to IOMMU as its stage 2 paging structure to support IOPF (IO page fault) on IOMMU's stage 2 paging structure. Previously, all guest pages have to be pinned and mapped in IOMMU stage 2 paging structures after pass-through devices attached, even if the device has IOPF capability. Such all-guest-memory pinning can be avoided when IOPF handling for stage 2 paging structure is supported and if there are only IOPF-capable devices attached to a VM. There are 2 approaches to support IOPF on IOMMU stage 2 paging structures: - Supporting by IOMMUFD/IOMMU alone IOMMUFD handles IO page faults on stage-2 HWPT by calling GUPs and then iommu_map() to setup IOVA mappings. (IOAS is required to keep info of GPA to HVA, but page pinning/unpinning needs to be skipped.) Then upon MMU notifiers on host primary MMU, iommu_unmap() is called to adjust IOVA mappings accordingly. IOMMU driver needs to support unmapping sub-ranges of a previous mapped range and take care of huge page merge and split in atomic way. [1][2]. - Sharing KVM TDP IOMMUFD sets the root of KVM TDP page table (EPT/NPT in x86) as the root of IOMMU stage 2 paging structure, and routes IO page faults to KVM. (This assumes that the iommu hw supports the same stage-2 page table format as CPU.) In this model the page table is centrally managed by KVM (mmu notifier, page mapping, subpage unmapping, atomic huge page split/merge, etc.), while IOMMUFD only needs to invalidate iotlb/devtlb properly. Currently, there's no upstream code available to support stage 2 IOPF yet. This RFC chooses to implement "Sharing KVM TDP" approach which has below main benefits: - Unified page table management The complexity of allocating guest pages per GPAs, registering to MMU notifier on host primary MMU, sub-page unmapping, atomic page merge/split are only required to by handled in KVM side, which has been doing that well for a long time. - Reduced page faults: Only one page fault is triggered on a single GPA, either caused by IO access or by vCPU access. (compared to one IO page fault for DMA and one CPU page fault for vCPUs in the non-shared approach.) - Reduced memory consumption: Memory of one page table are saved. Design == In this series, term "exported" is used in place of "shared" to avoid confusion with terminology "shared EPT" in TDX. The framework contains 3 main objects: "KVM TDP FD" object - The interface of KVM to export TDP page tables. With this object, KVM allows external components to access a TDP page table exported by KVM. "IOMMUFD KVM HWPT" object - A proxy connecting KVM TDP FD to IOMMU driver. This HWPT has no IOAS associated. "KVM domain" in IOMMU driver - Stage 2 domain in IOMMU driver whose paging structures are managed by KVM. Its hardware TLB invalidation requests are notified from KVM via IOMMUFD KVM HWPT object. 2.IOMMU_HWPT_ALLOC(fd) 1. KVM_CREATE_TDP_FD .------. +--------------| QEMU |----------------------+ | '------'<---+ fd | | | v | | .-------. v | create | KVM | .------------------. .------------.<------'-------' | IOMMUFD KVM HWPT | | KVM TDP FD | | '------------------' '------------' | | kvm_tdp_fd_get(fd) | | |------------------------->| | IOMMU | | | driver alloc(meta) |---------get meta-------->| | .------------.<---------| | | | KVM Domain | |----register_importer---->| | '------------' | | | | | | | | 3. | | | |----iopf handler---->|----------fault---------->|------map------->| | | | 4. | |<-------invalidate---|<-------invalidate--------|<---TLB flush----| | | | | |<-----free-----------| 5. | | |----unregister_importer-->| | | | | |------------------------->| | kvm_tdp_fd_put() 1. QEMU calls KVM_CREATE_TDP_FD to create a TDP FD object. Address space must be specified to identify the exported TDP page table (e.g. system memory or SMM mode system memory in x86). 2. QEMU calls IOMMU_HWPT_ALLOC to create a KVM-type HWPT. The KVM-type HWPT is created upon an exported KVM TDP FD (rather than upon an IOAS), acting as the proxy between KVM TDP and IOMMU driver: - Obtain reference on the exported KVM TDP FD. - get and pass meta data of KVM TDP page tables to IOMMU driver for KVM domain allocation. - register importer callbacks to KVM for invalidation notification. - register a IOPF handler into IOMMU's KVM domain. Upon device attachment, the root HPA of the exported TDP page table is installed to IOMMU hardware. 3. When IO page faults come, IOMMUFD fault handler forwards the fault to KVM. 4. When KVM performs TLB flush, it notifies all importers of KVM TDP FD object. IOMMUFD KVM HWPT, as an importer, will pass the notification to IOMMU driver for hardware TLB invalidations. 5. On destroy IOMMUFD KVM HWPT, it frees IOMMU's KVM domain, unregisters itself as an importer from KVM TDP FD object and puts reference count of KVM TDP FD object. Status == Current support of IOPF on IOMMU stage 2 paging structure is verified on Intel DSA devices on Intel SPR platform. There's no vIOMMU for guest and Intel DSA devices run in-kernel DMA tests successfully with IOPFs handled in host. - Nested translation in IOMMU is currently not supported. - QEMU code in IOMMUFD to create KVM HWPT is just a temporary hack. As KVM HWPT has no IOAS associated, need to fit in current QEMU code to create KVM HWPT with no IOAS and to ensure the address space is from GPA to HPA. - DSA IOPF hack in guest driver. Although DSA hw tolerates IOPF in all DMA paths, DSA driver has the flexibility to turn off IOPF in certain paths. This RFC currently hacks the guest driver to always turn on IOPF. Note == - KVM page write-tracking Unlike write-protection which usually adds back the write permission upon a write fault and re-executes the faulting instruction, KVM page write-tracking keeps the write permission disabled for the tracked pages and instead always emulates the faulting instruction upon fault. There is no way to emulate a faulting DMA request so IOPF and KVM page write-tracking are incompatible. In this RFC we didn't handle the conflict given write-tracking is applied to guest page table pages so far, which are unlikely to be used as DMA buffer. - IOMMU page-walk coherency It's about whether IOMMU hardware will snoop the processor cache of the I/O paging structures. If IOMMU page-walk is non-coherent, the software needs to do clflush after changing the I/O paging structures. Supporting non-coherent IOMMU page-walk adds extra burden (i.e. clflush) in KVM mmu in this shared model, which we don't plan to support. Fortunately most Intel platforms do support coherent page-walk in IOMMU so this exception should not be a big matter. - Non-coherent DMA Non-coherent DMA requires KVM mmu to align the effective memory type with the guest memory type (CR0.CD, vPAT, vMTRR) instead of forcing all guest memory to be WB. It further involves complexities in fault handler to check guest memory type too which requires a vCPU context. There is certainly no vCPU context in an I/O page fault. So this RFC doesn't support devices which cannot be enforced to do coherent DMA. If there is interest in supporting non-coherent DMA in this shared model, there's a discussion about removing vMTRR stuffs in KVM page fault handler [3] hence it's also possible to further remove the vCPU context there. - Enforce DMA cache coherency This design requires the IOMMU supporting a configuration forcing all DMAs to be coherent (even if the PCI request out of the device sets the non-snoop bit) due to aforementioned reason. The control of enforcing cache coherency could be per-IOPT or per-page. e.g. Intel VT-d defines a per-page format (bit 11 in PTE represents the enforce-snoop bit) in legacy mode and a per-IOPT format (control bit in the pasid entry) in scalable mode. Supporting per-page format requires KVM mmu to disable any software use of bit 11 and also provide additional ops for on-demand set/clear-snp requests from iommufd. It's complex and dirty. Therefore the per-IOPT scheme is assumed in this design. For Intel IOMMU, the scalable mode is the default mode for all new IOMMU features (nested translation, pasid, etc.) anyway. - About device which partially supports IOPF Many devices claiming PCIe PRS capability actually only tolerate IOPF in certain paths (e.g. DMA paths for SVM applications, but not for non-SVM applications or driver data such as ring descriptors). But the PRS capability doesn't include a bit to tell whether a device 100% tolerates IOPF in all DMA paths. This creates a trouble how the userspace driver framework (e.g. VFIO) knows that a device with PRS can really avoid static-pinning of the entire guest memory and then reports such knowledge to the VMM. A simple way is to track an allowed list of devices which are known 100% IOPF-friendly in VFIO. Another option is to extend PCIe spec to allow device reporting whether it fully or partially supports IOPF in the PRS capability. Another interesting option is to explore supporting partial-IOPF in this sharing model: * Create a VFIO variant driver to intercept guest operations which registers non-faultable memory to the device and to call KVM TDP ops to request on-demand pinning of traped memory pages in KVM mmu. This allows the VMM to start with zero-pinning as for 100%-faultable device with on demand pinning initiated by the variant driver. * Supporting on-demand pinning in KVM mmu however requires non-trivial effort. Besides introducing logic to pin pages in long term and manage the list of pinned GFNs, more caveats are required to avoid breaking the implication of page pinning, e.g.: a. PTE updates in a pinned GFN range must be atomic, otherwise an in-fly DMA might be broken b. PTE zap in a pinned GFN range is allowed only when the related memory slot is removed (indicating guest won't use it for DMA). The PTE zap for the affected range must be either disabled or replaced by an atomic update. c. any feature related to write-protecting the pinned GFN range is not allowed. This implies live migration is also broken in current way as it starts with write-protection even when TDP dirty bit tracking is enabled. To support on-demand pinning it then requires to rely on a less efficient way by always walking TDP dirty bit instead of using write-protection. Or, we may enhance the live migration code to treat pinned ranges as dirty always. d. Auto NUMA balance also needs to be disabled. [4] If above trickiness can be resolved cleanly, this sharing model could also support a non-faultable device in theory by pinning/unpinning guest memory on slot addition/removal. - How to map MSI page on arm platform demands discussions. Patches layout == [01-08]: Skeleton implementation of KVM's TDP FD object. Patch 1 and 2 are for public and arch specific headers. Patch 4's commit message outlines overall data structure hierarchy on x86 for preview. [09-23]: IOMMU, IOMMUFD and Intel vt-d. - 09-11: IOMMU core part - 12-16: IOMMUFD part Patch 13 is the main patch in IOMMUFD to implement KVM HWPT. - 17-23: Intel vt-d part for KVM domain Patch 18 is the main patch to implement KVM domain. [24-42]: KVM x86 and VMX part - 24-34: KVM x86 preparation patches. Patch 24: Let KVM to reserve bit 11 since bit 11 is reserved as 0 in IOMMU side. Patch 25: Abstract "struct kvm_mmu_common" from "struct kvm_mmu" for "kvm_exported_tdp_mmu" Patches 26~34: Prepare for page fault in non-vCPU context. - 35-38: Core part in KVM x86 Patch 35: X86 MMU core part to show how exported TDP root page is shared between KVM external components and vCPUs. Patch 37: TDP FD fault op implementation - 39-42: KVM VMX part for meta data composing and tlb flush notification. Code base == The code base is commit b85ea95d08647 ("Linux 6.7-rc1") + Yi Liu's v7 series "Add Intel VT-d nested translation (part 2/2)" [5] + Baolu's v7 series "iommu: Prepare to deliver page faults to user space" [6] Complete code can be found at [7], Qemu could be found at [8], Guest test script and workaround patch is at [9]. [1] https://lore.kernel.org/all/20230814121016.32613-1-jijie.ji@linux.alibaba.com/ [2] https://lore.kernel.org/all/BN9PR11MB5276D897431C7E1399EFFF338C14A@BN9PR11MB5276.namprd11.prod.outlook.com/ [3] https://lore.kernel.org/all/ZUAC0jvFE0auohL4@google.com/ [4] https://lore.kernel.org/all/4cb536f6-2609-4e3e-b996-4a613c9844ad@nvidia.com/ [5] https://lore.kernel.org/linux-iommu/20231117131816.24359-1-yi.l.liu@intel.com/ [6] https://lore.kernel.org/linux-iommu/20231115030226.16700-1-baolu.lu@linux.intel.com/ [7] https://github.com/yanhwizhao/linux_kernel/tree/sharept_iopt [8] https://github.com/yanhwizhao/qemu/tree/sharept_iopf [9] https://github.com/yanhwizhao/misc/tree/master Yan Zhao (42): KVM: Public header for KVM to export TDP KVM: x86: Arch header for kvm to export TDP for Intel KVM: Introduce VM ioctl KVM_CREATE_TDP_FD KVM: Skeleton of KVM TDP FD object KVM: Embed "arch" object and call arch init/destroy in TDP FD KVM: Register/Unregister importers to KVM exported TDP KVM: Forward page fault requests to arch specific code for exported TDP KVM: Add a helper to notify importers that KVM exported TDP is flushed iommu: Add IOMMU_DOMAIN_KVM iommu: Add new iommu op to create domains managed by KVM iommu: Add new domain op cache_invalidate_kvm iommufd: Introduce allocation data info and flag for KVM managed HWPT iommufd: Add a KVM HW pagetable object iommufd: Enable KVM HW page table object to be proxy between KVM and IOMMU iommufd: Add iopf handler to KVM hw pagetable iommufd: Enable device feature IOPF during device attachment to KVM HWPT iommu/vt-d: Make some macros and helpers to be extern iommu/vt-d: Support of IOMMU_DOMAIN_KVM domain in Intel IOMMU iommu/vt-d: Set bit PGSNP in PASIDTE if domain cache coherency is enforced iommu/vt-d: Support attach devices to IOMMU_DOMAIN_KVM domain iommu/vt-d: Check reserved bits for IOMMU_DOMAIN_KVM domain iommu/vt-d: Support cache invalidate of IOMMU_DOMAIN_KVM domain iommu/vt-d: Allow pasid 0 in IOPF KVM: x86/mmu: Move bit SPTE_MMU_PRESENT from bit 11 to bit 59 KVM: x86/mmu: Abstract "struct kvm_mmu_common" from "struct kvm_mmu" KVM: x86/mmu: introduce new op get_default_mt_mask to kvm_x86_ops KVM: x86/mmu: change param "vcpu" to "kvm" in kvm_mmu_hugepage_adjust() KVM: x86/mmu: change "vcpu" to "kvm" in page_fault_handle_page_track() KVM: x86/mmu: remove param "vcpu" from kvm_mmu_get_tdp_level() KVM: x86/mmu: remove param "vcpu" from kvm_calc_tdp_mmu_root_page_role() KVM: x86/mmu: add extra param "kvm" to kvm_faultin_pfn() KVM: x86/mmu: add extra param "kvm" to make_mmio_spte() KVM: x86/mmu: add extra param "kvm" to make_spte() KVM: x86/mmu: add extra param "kvm" to tdp_mmu_map_handle_target_level() KVM: x86/mmu: Get/Put TDP root page to be exported KVM: x86/mmu: Keep exported TDP root valid KVM: x86: Implement KVM exported TDP fault handler on x86 KVM: x86: "compose" and "get" interface for meta data of exported TDP KVM: VMX: add config KVM_INTEL_EXPORTED_EPT KVM: VMX: Compose VMX specific meta data for KVM exported TDP KVM: VMX: Implement ops .flush_remote_tlbs* in VMX when EPT is on KVM: VMX: Notify importers of exported TDP to flush TLBs on KVM flushes EPT arch/x86/include/asm/kvm-x86-ops.h | 4 + arch/x86/include/asm/kvm_exported_tdp.h | 43 +++ arch/x86/include/asm/kvm_host.h | 48 ++- arch/x86/kvm/Kconfig | 13 + arch/x86/kvm/mmu.h | 12 +- arch/x86/kvm/mmu/mmu.c | 434 +++++++++++++++++------ arch/x86/kvm/mmu/mmu_internal.h | 8 +- arch/x86/kvm/mmu/paging_tmpl.h | 15 +- arch/x86/kvm/mmu/spte.c | 31 +- arch/x86/kvm/mmu/spte.h | 82 ++++- arch/x86/kvm/mmu/tdp_mmu.c | 209 +++++++++-- arch/x86/kvm/mmu/tdp_mmu.h | 9 + arch/x86/kvm/svm/svm.c | 2 +- arch/x86/kvm/vmx/nested.c | 2 +- arch/x86/kvm/vmx/vmx.c | 56 ++- arch/x86/kvm/x86.c | 68 +++- drivers/iommu/intel/Kconfig | 9 + drivers/iommu/intel/Makefile | 1 + drivers/iommu/intel/iommu.c | 68 ++-- drivers/iommu/intel/iommu.h | 47 +++ drivers/iommu/intel/kvm.c | 185 ++++++++++ drivers/iommu/intel/pasid.c | 3 +- drivers/iommu/intel/svm.c | 37 +- drivers/iommu/iommufd/Kconfig | 10 + drivers/iommu/iommufd/Makefile | 1 + drivers/iommu/iommufd/device.c | 31 +- drivers/iommu/iommufd/hw_pagetable.c | 29 +- drivers/iommu/iommufd/hw_pagetable_kvm.c | 270 ++++++++++++++ drivers/iommu/iommufd/iommufd_private.h | 44 +++ drivers/iommu/iommufd/main.c | 4 + include/linux/iommu.h | 18 + include/linux/kvm_host.h | 58 +++ include/linux/kvm_tdp_fd.h | 137 +++++++ include/linux/kvm_types.h | 12 + include/uapi/linux/iommufd.h | 15 + include/uapi/linux/kvm.h | 19 + virt/kvm/Kconfig | 6 + virt/kvm/Makefile.kvm | 1 + virt/kvm/kvm_main.c | 24 ++ virt/kvm/tdp_fd.c | 344 ++++++++++++++++++ virt/kvm/tdp_fd.h | 15 + 41 files changed, 2177 insertions(+), 247 deletions(-) create mode 100644 arch/x86/include/asm/kvm_exported_tdp.h create mode 100644 drivers/iommu/intel/kvm.c create mode 100644 drivers/iommu/iommufd/hw_pagetable_kvm.c create mode 100644 include/linux/kvm_tdp_fd.h create mode 100644 virt/kvm/tdp_fd.c create mode 100644 virt/kvm/tdp_fd.h
Comments
On Sat, Dec 02, 2023 at 05:12:11PM +0800, Yan Zhao wrote: > In this series, term "exported" is used in place of "shared" to avoid > confusion with terminology "shared EPT" in TDX. > > The framework contains 3 main objects: > > "KVM TDP FD" object - The interface of KVM to export TDP page tables. > With this object, KVM allows external components to > access a TDP page table exported by KVM. I don't know much about the internals of kvm, but why have this extra user visible piece? Isn't there only one "TDP" per kvm fd? Why not just use the KVM FD as a handle for the TDP? > "IOMMUFD KVM HWPT" object - A proxy connecting KVM TDP FD to IOMMU driver. > This HWPT has no IOAS associated. > > "KVM domain" in IOMMU driver - Stage 2 domain in IOMMU driver whose paging > structures are managed by KVM. > Its hardware TLB invalidation requests are > notified from KVM via IOMMUFD KVM HWPT > object. This seems broadly the right direction > - About device which partially supports IOPF > > Many devices claiming PCIe PRS capability actually only tolerate IOPF in > certain paths (e.g. DMA paths for SVM applications, but not for non-SVM > applications or driver data such as ring descriptors). But the PRS > capability doesn't include a bit to tell whether a device 100% tolerates > IOPF in all DMA paths. The lack of tolerance for truely DMA pinned guest memory is a significant problem for any real deployment, IMHO. I am aware of no device that can handle PRI on every single DMA path. :( > A simple way is to track an allowed list of devices which are known 100% > IOPF-friendly in VFIO. Another option is to extend PCIe spec to allow > device reporting whether it fully or partially supports IOPF in the PRS > capability. I think we need something like this. > - How to map MSI page on arm platform demands discussions. Yes, the recurring problem :( Probably the same approach as nesting would work for a hack - map the ITS page into the fixed reserved slot and tell the guest not to touch it and to identity map it. Jason
On Mon, Dec 04, 2023, Jason Gunthorpe wrote: > On Sat, Dec 02, 2023 at 05:12:11PM +0800, Yan Zhao wrote: > > In this series, term "exported" is used in place of "shared" to avoid > > confusion with terminology "shared EPT" in TDX. > > > > The framework contains 3 main objects: > > > > "KVM TDP FD" object - The interface of KVM to export TDP page tables. > > With this object, KVM allows external components to > > access a TDP page table exported by KVM. > > I don't know much about the internals of kvm, but why have this extra > user visible piece? That I don't know, I haven't looked at the gory details of this RFC. > Isn't there only one "TDP" per kvm fd? No. In steady state, with TDP (EPT) enabled and assuming homogeneous capabilities across all vCPUs, KVM will have 3+ sets of TDP page tables *active* at any given time: 1. "Normal" 2. SMM 3-N. Guest (for L2, i.e. nested, VMs) The number of possible TDP page tables used for nested VMs is well bounded, but since devices obviously can't be nested VMs, I won't bother trying to explain the the various possibilities (nested NPT on AMD is downright ridiculous). Nested virtualization aside, devices are obviously not capable of running in SMM and so they all need to use the "normal" page tables. I highlighted "active" above because if _any_ memslot is deleted, KVM will invalidate *all* existing page tables and rebuild new page tables as needed. So over the lifetime of a VM, KVM could theoretically use an infinite number of page tables.
On Sat, Dec 02, 2023, Yan Zhao wrote: > This RFC series proposes a framework to resolve IOPF by sharing KVM TDP > (Two Dimensional Paging) page table to IOMMU as its stage 2 paging > structure to support IOPF (IO page fault) on IOMMU's stage 2 paging > structure. > > Previously, all guest pages have to be pinned and mapped in IOMMU stage 2 > paging structures after pass-through devices attached, even if the device > has IOPF capability. Such all-guest-memory pinning can be avoided when IOPF > handling for stage 2 paging structure is supported and if there are only > IOPF-capable devices attached to a VM. > > There are 2 approaches to support IOPF on IOMMU stage 2 paging structures: > - Supporting by IOMMUFD/IOMMU alone > IOMMUFD handles IO page faults on stage-2 HWPT by calling GUPs and then > iommu_map() to setup IOVA mappings. (IOAS is required to keep info of GPA > to HVA, but page pinning/unpinning needs to be skipped.) > Then upon MMU notifiers on host primary MMU, iommu_unmap() is called to > adjust IOVA mappings accordingly. > IOMMU driver needs to support unmapping sub-ranges of a previous mapped > range and take care of huge page merge and split in atomic way. [1][2]. > > - Sharing KVM TDP > IOMMUFD sets the root of KVM TDP page table (EPT/NPT in x86) as the root > of IOMMU stage 2 paging structure, and routes IO page faults to KVM. > (This assumes that the iommu hw supports the same stage-2 page table > format as CPU.) > In this model the page table is centrally managed by KVM (mmu notifier, > page mapping, subpage unmapping, atomic huge page split/merge, etc.), > while IOMMUFD only needs to invalidate iotlb/devtlb properly. There are more approaches beyond having IOMMUFD and KVM be completely separate entities. E.g. extract the bulk of KVM's "TDP MMU" implementation to common code so that IOMMUFD doesn't need to reinvent the wheel. > Currently, there's no upstream code available to support stage 2 IOPF yet. > > This RFC chooses to implement "Sharing KVM TDP" approach which has below > main benefits: Please list out the pros and cons for each. In the cons column for piggybacking KVM's page tables: - *Significantly* increases the complexity in KVM - Puts constraints on what KVM can/can't do in the future (see the movement of SPTE_MMU_PRESENT). - Subjects IOMMUFD to all of KVM's historical baggage, e.g. the memslot deletion mess, the truly nasty MTRR emulation (which I still hope to delete), the NX hugepage mitigation, etc. Please also explain the intended/expected/targeted use cases. E.g. if the main use case is for device passthrough to slice-of-hardware VMs that aren't memory oversubscribed, > - Unified page table management > The complexity of allocating guest pages per GPAs, registering to MMU > notifier on host primary MMU, sub-page unmapping, atomic page merge/split Please find different terminology than "sub-page". With Sub-Page Protection, Intel has more or less established "sub-page" to mean "less than 4KiB granularity". But that can't possibly what you mean here because KVM doesn't support (un)mapping memory at <4KiB granularity. Based on context above, I assume you mean "unmapping arbitrary pages within a given range". > are only required to by handled in KVM side, which has been doing that > well for a long time. > > - Reduced page faults: > Only one page fault is triggered on a single GPA, either caused by IO > access or by vCPU access. (compared to one IO page fault for DMA and one > CPU page fault for vCPUs in the non-shared approach.) This would be relatively easy to solve with bi-directional notifiers, i.e. KVM notifies IOMMUFD when a vCPU faults in a page, and vice versa. > - Reduced memory consumption: > Memory of one page table are saved. I'm not convinced that memory consumption is all that interesting. If a VM is mapping the majority of memory into a device, then odds are good that the guest is backed with at least 2MiB page, if not 1GiB pages, at which point the memory overhead for pages tables is quite small, especially relative to the total amount of memory overheads for such systems. If a VM is mapping only a small subset of its memory into devices, then the IOMMU page tables should be sparsely populated, i.e. won't consume much memory.
On Mon, Dec 04, 2023 at 09:00:55AM -0800, Sean Christopherson wrote: > There are more approaches beyond having IOMMUFD and KVM be > completely separate entities. E.g. extract the bulk of KVM's "TDP > MMU" implementation to common code so that IOMMUFD doesn't need to > reinvent the wheel. We've pretty much done this already, it is called "hmm" and it is what the IO world uses. Merging/splitting huge page is just something that needs some coding in the page table code, that people want for other reasons anyhow. > - Subjects IOMMUFD to all of KVM's historical baggage, e.g. the memslot deletion > mess, the truly nasty MTRR emulation (which I still hope to delete), the NX > hugepage mitigation, etc. Does it? I think that just remains isolated in kvm. The output from KVM is only a radix table top pointer, it is up to KVM how to manage it still. > I'm not convinced that memory consumption is all that interesting. If a VM is > mapping the majority of memory into a device, then odds are good that the guest > is backed with at least 2MiB page, if not 1GiB pages, at which point the memory > overhead for pages tables is quite small, especially relative to the total amount > of memory overheads for such systems. AFAIK the main argument is performance. It is similar to why we want to do IOMMU SVA with MM page table sharing. If IOMMU mirrors/shadows/copies a page table using something like HMM techniques then the invalidations will mark ranges of IOVA as non-present and faults will occur to trigger hmm_range_fault to do the shadowing. This means that pretty much all IO will always encounter a non-present fault, certainly at the start and maybe worse while ongoing. On the other hand, if we share the exact page table then natural CPU touches will usually make the page present before an IO happens in almost all cases and we don't have to take the horribly expensive IO page fault at all. We were not able to make bi-dir notifiers with with the CPU mm, I'm not sure that is "relatively easy" :( Jason
On Mon, Dec 04, 2023, Jason Gunthorpe wrote: > On Mon, Dec 04, 2023 at 09:00:55AM -0800, Sean Christopherson wrote: > > > There are more approaches beyond having IOMMUFD and KVM be > > completely separate entities. E.g. extract the bulk of KVM's "TDP > > MMU" implementation to common code so that IOMMUFD doesn't need to > > reinvent the wheel. > > We've pretty much done this already, it is called "hmm" and it is what > the IO world uses. Merging/splitting huge page is just something that > needs some coding in the page table code, that people want for other > reasons anyhow. Not really. HMM is a wildly different implementation than KVM's TDP MMU. At a glance, HMM is basically a variation on the primary MMU, e.g. deals with VMAs, runs under mmap_lock (or per-VMA locks?), and faults memory into the primary MMU while walking the "secondary" HMM page tables. KVM's TDP MMU (and all of KVM's flavors of MMUs) is much more of a pure secondary MMU. The core of a KVM MMU maps GFNs to PFNs, the intermediate steps that involve the primary MMU are largely orthogonal. E.g. getting a PFN from guest_memfd instead of the primary MMU essentially boils down to invoking kvm_gmem_get_pfn() instead of __gfn_to_pfn_memslot(), the MMU proper doesn't care how the PFN was resolved. I.e. 99% of KVM's MMU logic has no interaction with the primary MMU. > > - Subjects IOMMUFD to all of KVM's historical baggage, e.g. the memslot deletion > > mess, the truly nasty MTRR emulation (which I still hope to delete), the NX > > hugepage mitigation, etc. > > Does it? I think that just remains isolated in kvm. The output from > KVM is only a radix table top pointer, it is up to KVM how to manage > it still. Oh, I didn't mean from a code perspective, I meant from a behaviorial perspective. E.g. there's no reason to disallow huge mappings in the IOMMU because the CPU is vulnerable to the iTLB multi-hit mitigation. > > I'm not convinced that memory consumption is all that interesting. If a VM is > > mapping the majority of memory into a device, then odds are good that the guest > > is backed with at least 2MiB page, if not 1GiB pages, at which point the memory > > overhead for pages tables is quite small, especially relative to the total amount > > of memory overheads for such systems. > > AFAIK the main argument is performance. It is similar to why we want > to do IOMMU SVA with MM page table sharing. > > If IOMMU mirrors/shadows/copies a page table using something like HMM > techniques then the invalidations will mark ranges of IOVA as > non-present and faults will occur to trigger hmm_range_fault to do the > shadowing. > > This means that pretty much all IO will always encounter a non-present > fault, certainly at the start and maybe worse while ongoing. > > On the other hand, if we share the exact page table then natural CPU > touches will usually make the page present before an IO happens in > almost all cases and we don't have to take the horribly expensive IO > page fault at all. I'm not advocating mirroring/copying/shadowing page tables between KVM and the IOMMU. I'm suggesting managing IOMMU page tables mostly independently, but reusing KVM code to do so. I wouldn't even be opposed to KVM outright managing the IOMMU's page tables. E.g. add an "iommu" flag to "union kvm_mmu_page_role" and then the implementation looks rather similar to this series. What terrifies is me sharing page tables between the CPU and the IOMMU verbatim. Yes, sharing page tables will Just Work for faulting in memory, but the downside is that _when_, not if, KVM modifies PTEs for whatever reason, those modifications will also impact the IO path. My understanding is that IO page faults are at least an order of magnitude more expensive than CPU page faults. That means that what's optimal for CPU page tables may not be optimal, or even _viable_, for IOMMU page tables. E.g. based on our conversation at LPC, write-protecting guest memory to do dirty logging is not a viable option for the IOMMU because the latency of the resulting IOPF is too high. Forcing KVM to use D-bit dirty logging for CPUs just because the VM has passthrough (mediated?) devices would be likely a non-starter. One of my biggest concerns with sharing page tables between KVM and IOMMUs is that we will end up having to revert/reject changes that benefit KVM's usage due to regressing the IOMMU usage. If instead KVM treats IOMMU page tables as their own thing, then we can have divergent behavior as needed, e.g. different dirty logging algorithms, different software-available bits, etc. It would also allow us to define new ABI instead of trying to reconcile the many incompatibilies and warts in KVM's existing ABI. E.g. off the top of my head: - The virtual APIC page shouldn't be visible to devices, as it's not "real" guest memory. - Access tracking, i.e. page aging, by making PTEs !PRESENT because the CPU doesn't support A/D bits or because the admin turned them off via KVM's enable_ept_ad_bits module param. - Write-protecting GFNs for shadow paging when L1 is running nested VMs. KVM's ABI can be that device writes to L1's page tables are exempt. - KVM can exempt IOMMU page tables from KVM's awful "drop all page tables if any memslot is deleted" ABI. > We were not able to make bi-dir notifiers with with the CPU mm, I'm > not sure that is "relatively easy" :( I'm not suggesting full blown mirroring, all I'm suggesting is a fire-and-forget notifier for KVM to tell IOMMUFD "I've faulted in GFN A, you might want to do the same". It wouldn't even necessarily need to be a notifier per se, e.g. if we taught KVM to manage IOMMU page tables, then KVM could simply install mappings for multiple sets of page tables as appropriate.
On Mon, Dec 04, 2023 at 11:22:49AM -0800, Sean Christopherson wrote: > On Mon, Dec 04, 2023, Jason Gunthorpe wrote: > > On Mon, Dec 04, 2023 at 09:00:55AM -0800, Sean Christopherson wrote: > > > > > There are more approaches beyond having IOMMUFD and KVM be > > > completely separate entities. E.g. extract the bulk of KVM's "TDP > > > MMU" implementation to common code so that IOMMUFD doesn't need to > > > reinvent the wheel. > > > > We've pretty much done this already, it is called "hmm" and it is what > > the IO world uses. Merging/splitting huge page is just something that > > needs some coding in the page table code, that people want for other > > reasons anyhow. > > Not really. HMM is a wildly different implementation than KVM's TDP MMU. At a > glance, HMM is basically a variation on the primary MMU, e.g. deals with VMAs, > runs under mmap_lock (or per-VMA locks?), and faults memory into the primary MMU > while walking the "secondary" HMM page tables. hmm supports the essential idea of shadowing parts of the primary MMU. This is a big chunk of what kvm is doing, just differently. > KVM's TDP MMU (and all of KVM's flavors of MMUs) is much more of a pure secondary > MMU. The core of a KVM MMU maps GFNs to PFNs, the intermediate steps that involve > the primary MMU are largely orthogonal. E.g. getting a PFN from guest_memfd > instead of the primary MMU essentially boils down to invoking kvm_gmem_get_pfn() > instead of __gfn_to_pfn_memslot(), the MMU proper doesn't care how the PFN was > resolved. I.e. 99% of KVM's MMU logic has no interaction with the primary MMU. Hopefully the memfd stuff we be generalized so we can use it in iommufd too, without relying on kvm. At least the first basic stuff should be doable fairly soon. > I'm not advocating mirroring/copying/shadowing page tables between KVM and the > IOMMU. I'm suggesting managing IOMMU page tables mostly independently, but reusing > KVM code to do so. I guess from my POV, if KVM has two copies of the logically same radix tree then that is fine too. > Yes, sharing page tables will Just Work for faulting in memory, but the downside > is that _when_, not if, KVM modifies PTEs for whatever reason, those modifications > will also impact the IO path. My understanding is that IO page faults are at least > an order of magnitude more expensive than CPU page faults. That means that what's > optimal for CPU page tables may not be optimal, or even _viable_, for IOMMU page > tables. Yes, you wouldn't want to do some of the same KVM techniques today in a shared mode. > E.g. based on our conversation at LPC, write-protecting guest memory to do dirty > logging is not a viable option for the IOMMU because the latency of the resulting > IOPF is too high. Forcing KVM to use D-bit dirty logging for CPUs just because > the VM has passthrough (mediated?) devices would be likely a > non-starter. Yes > One of my biggest concerns with sharing page tables between KVM and IOMMUs is that > we will end up having to revert/reject changes that benefit KVM's usage due to > regressing the IOMMU usage. It is certainly a strong argument > I'm not suggesting full blown mirroring, all I'm suggesting is a fire-and-forget > notifier for KVM to tell IOMMUFD "I've faulted in GFN A, you might want to do the > same". If we say the only thing this works with is the memfd version of KVM, could we design the memfd stuff to not have the same challenges with mirroring as normal VMAs? > It wouldn't even necessarily need to be a notifier per se, e.g. if we taught KVM > to manage IOMMU page tables, then KVM could simply install mappings for multiple > sets of page tables as appropriate. This somehow feels more achievable to me since KVM already has all the code to handle multiple TDPs, having two parallel ones is probably much easier than trying to weld KVM to a different page table implementation through some kind of loose coupled notifier. Jason
On Mon, Dec 04, 2023, Jason Gunthorpe wrote: > On Mon, Dec 04, 2023 at 11:22:49AM -0800, Sean Christopherson wrote: > > I'm not suggesting full blown mirroring, all I'm suggesting is a fire-and-forget > > notifier for KVM to tell IOMMUFD "I've faulted in GFN A, you might want to do the > > same". > > If we say the only thing this works with is the memfd version of KVM, That's likely a big "if", as guest_memfd is not and will not be a wholesale replacement of VMA-based guest memory, at least not in the forseeable future. I would be quite surprised if the target use cases for this could be moved to guest_memfd without losing required functionality. > could we design the memfd stuff to not have the same challenges with > mirroring as normal VMAs? What challenges in particular are you concerned about? And maybe also define "mirroring"? E.g. ensuring that the CPU and IOMMU page tables are synchronized is very different than ensuring that the IOMMU page tables can only map memory that is mappable by the guest, i.e. that KVM can map into the CPU page tables.
On Mon, Dec 04, 2023 at 12:11:46PM -0800, Sean Christopherson wrote: > > could we design the memfd stuff to not have the same challenges with > > mirroring as normal VMAs? > > What challenges in particular are you concerned about? And maybe also define > "mirroring"? E.g. ensuring that the CPU and IOMMU page tables are synchronized > is very different than ensuring that the IOMMU page tables can only map memory > that is mappable by the guest, i.e. that KVM can map into the CPU page tables. IIRC, it has been awhile, it is difficult to get a new populated PTE out of the MM side and into an hmm user and get all the invalidation locking to work as well. Especially when the devices want to do sleeping invalidations. kvm doesn't solve this problem either, but pushing populated TDP PTEs to another observer may be simpler, as perhaps would pushing populated memfd pages or something like that? "mirroring" here would simply mean that if the CPU side has a popoulated page then the hmm side copying it would also have a populated page. Instead of a fault on use model. Jason
On Mon, Dec 04, 2023 at 08:38:17AM -0800, Sean Christopherson wrote: > On Mon, Dec 04, 2023, Jason Gunthorpe wrote: > > On Sat, Dec 02, 2023 at 05:12:11PM +0800, Yan Zhao wrote: > > > In this series, term "exported" is used in place of "shared" to avoid > > > confusion with terminology "shared EPT" in TDX. > > > > > > The framework contains 3 main objects: > > > > > > "KVM TDP FD" object - The interface of KVM to export TDP page tables. > > > With this object, KVM allows external components to > > > access a TDP page table exported by KVM. > > > > I don't know much about the internals of kvm, but why have this extra > > user visible piece? > > That I don't know, I haven't looked at the gory details of this RFC. > > > Isn't there only one "TDP" per kvm fd? > > No. In steady state, with TDP (EPT) enabled and assuming homogeneous capabilities > across all vCPUs, KVM will have 3+ sets of TDP page tables *active* at any given time: > > 1. "Normal" > 2. SMM > 3-N. Guest (for L2, i.e. nested, VMs) Yes, the reason to introduce KVM TDP FD is to let KVM know which TDP the user wants to export(share). For as_id=0 (which is currently the only supported as_id to share), a TDP with smm=0, guest_mode=0 will be chosen. Upon receiving the KVM_CREATE_TDP_FD ioctl, KVM will try to find an existing TDP root with role specified by as_id 0. If there's existing TDP with the target role found, KVM will just export this one; if no existing one found, KVM will create a new TDP root in non-vCPU context. Then, KVM will mark the exported TDP as "exported". tdp_mmu_roots | role | smm | guest_mode +------+-----------+----------+ ------|----------------- | | | | 0 | 0 | 0 ==> address space 0 | v v v 1 | 1 | 0 | .--------. .--------. .--------. 2 | 0 | 1 | | root | | root | | root | 3 | 1 | 1 | |(role 1)| |(role 2)| |(role 3)| | '--------' '--------' '--------' | ^ | | create or get .------. | +--------------------| vCPU | | fault '------' | smm=1 | guest_mode=0 | (set root as exported) v .--------. create or get .---------------. create or get .------. | TDP FD |------------------->| root (role 0) |<-----------------| vCPU | '--------' fault '---------------' fault '------' . smm=0 . guest_mode=0 . non-vCPU context <---|---> vCPU context . . No matter the TDP is exported or not, vCPUs just load TDP root according to its vCPU modes. In this way, KVM is able to share the TDP in KVM address space 0 to IOMMU side. > The number of possible TDP page tables used for nested VMs is well bounded, but > since devices obviously can't be nested VMs, I won't bother trying to explain the > the various possibilities (nested NPT on AMD is downright ridiculous). In future, if possible, I wonder if we can export an TDP for nested VM too. E.g. in scenarios where TDP is partitioned, and one piece is for L2 VM. Maybe we can specify that and tell KVM the very piece of TDP to export. > Nested virtualization aside, devices are obviously not capable of running in SMM > and so they all need to use the "normal" page tables. > > I highlighted "active" above because if _any_ memslot is deleted, KVM will invalidate > *all* existing page tables and rebuild new page tables as needed. So over the > lifetime of a VM, KVM could theoretically use an infinite number of page tables. Right. In patch 36, the TDP root which is marked as "exported" will be exempted from "invalidate". Instead, an "exported" TDP just zaps all leaf entries upon memory slot removal. That is to say, for an exported TDP, it can be "active" until it's unmarked as exported.
On Mon, Dec 04, 2023 at 11:08:00AM -0400, Jason Gunthorpe wrote: > On Sat, Dec 02, 2023 at 05:12:11PM +0800, Yan Zhao wrote: > > In this series, term "exported" is used in place of "shared" to avoid > > confusion with terminology "shared EPT" in TDX. > > > > The framework contains 3 main objects: > > > > "KVM TDP FD" object - The interface of KVM to export TDP page tables. > > With this object, KVM allows external components to > > access a TDP page table exported by KVM. > > I don't know much about the internals of kvm, but why have this extra > user visible piece? Isn't there only one "TDP" per kvm fd? Why not > just use the KVM FD as a handle for the TDP? As explained in a parallel mail, the reason to introduce KVM TDP FD is to let KVM know which TDP the user wants to export(share). And another reason is wrap the exported TDP with its exported ops in a single structure. So, components outside of KVM can query meta data and request page fault, register invalidate callback through the exported ops. struct kvm_tdp_fd { /* Public */ struct file *file; const struct kvm_exported_tdp_ops *ops; /* private to KVM */ struct kvm_exported_tdp *priv; }; For KVM, it only needs to expose this struct kvm_tdp_fd and two symbols kvm_tdp_fd_get() and kvm_tdp_fd_put(). > > > "IOMMUFD KVM HWPT" object - A proxy connecting KVM TDP FD to IOMMU driver. > > This HWPT has no IOAS associated. > > > > "KVM domain" in IOMMU driver - Stage 2 domain in IOMMU driver whose paging > > structures are managed by KVM. > > Its hardware TLB invalidation requests are > > notified from KVM via IOMMUFD KVM HWPT > > object. > > This seems broadly the right direction > > > - About device which partially supports IOPF > > > > Many devices claiming PCIe PRS capability actually only tolerate IOPF in > > certain paths (e.g. DMA paths for SVM applications, but not for non-SVM > > applications or driver data such as ring descriptors). But the PRS > > capability doesn't include a bit to tell whether a device 100% tolerates > > IOPF in all DMA paths. > > The lack of tolerance for truely DMA pinned guest memory is a > significant problem for any real deployment, IMHO. I am aware of no > device that can handle PRI on every single DMA path. :( DSA actaully can handle PRI on all DMA paths. But it requires driver to turn on this capability :( > > A simple way is to track an allowed list of devices which are known 100% > > IOPF-friendly in VFIO. Another option is to extend PCIe spec to allow > > device reporting whether it fully or partially supports IOPF in the PRS > > capability. > > I think we need something like this. > > > - How to map MSI page on arm platform demands discussions. > > Yes, the recurring problem :( > > Probably the same approach as nesting would work for a hack - map the > ITS page into the fixed reserved slot and tell the guest not to touch > it and to identity map it. Ok.
On Mon, Dec 04, 2023 at 09:00:55AM -0800, Sean Christopherson wrote: > On Sat, Dec 02, 2023, Yan Zhao wrote: > Please list out the pros and cons for each. In the cons column for piggybacking > KVM's page tables: > > - *Significantly* increases the complexity in KVM The complexity to KVM (up to now) are a. fault in non-vCPU context b. keep exported root always "active" c. disallow non-coherent DMAs d. movement of SPTE_MMU_PRESENT for a, I think it's accepted, and we can see eager page split allocates non-leaf pages in non-vCPU context already. for b, it requires exported TDP root to keep "active" in KVM's "fast zap" (which invalidates all active TDP roots). And instead, the exported TDP's leaf entries are all zapped. Though it looks not "fast" enough, it avoids an unnecessary root page zap, and it's actually not frequent -- - one for memslot removal (IO page fault is unlikey to happen during VM boot-up) - one for MMIO gen wraparound (which is rare) - one for nx huge page mode change (which is rare too) for c, maybe we can work out a way to remove the MTRR stuffs. for d, I added a config to turn on/off this movement. But right, KVM side will have to sacrifice a bit for software usage and take care of it when the config is on. > - Puts constraints on what KVM can/can't do in the future (see the movement > of SPTE_MMU_PRESENT). > - Subjects IOMMUFD to all of KVM's historical baggage, e.g. the memslot deletion > mess, the truly nasty MTRR emulation (which I still hope to delete), the NX > hugepage mitigation, etc. NX hugepage mitigation only exists on certain CPUs. I don't see it in recent Intel platforms, e.g. SPR and GNR... We can disallow sharing approach if NX huge page mitigation is enabled. But if pinning or partial pinning are not involved, nx huge page will only cause unnecessary zap to reduce performance, but functionally it still works well. Besides, for the extra IO invalidation involved in TDP zap, I think SVM has the same issue. i.e. each zap in primary MMU is also accompanied by a IO invalidation. > > Please also explain the intended/expected/targeted use cases. E.g. if the main > use case is for device passthrough to slice-of-hardware VMs that aren't memory > oversubscribed, > The main use case is for device passthrough with all devices supporting full IOPF. Opportunistically, we hope it can be used in trusted IO, where TDP are shared to IO side. So, there's only one page table audit required and out-of-sync window for mappings between CPU and IO side can also be eliminated. > > - Unified page table management > > The complexity of allocating guest pages per GPAs, registering to MMU > > notifier on host primary MMU, sub-page unmapping, atomic page merge/split > > Please find different terminology than "sub-page". With Sub-Page Protection, Intel > has more or less established "sub-page" to mean "less than 4KiB granularity". But > that can't possibly what you mean here because KVM doesn't support (un)mapping > memory at <4KiB granularity. Based on context above, I assume you mean "unmapping > arbitrary pages within a given range". > Ok, sorry for this confusion. By "sub-page unmapping", I mean atomic huge page splitting and unmapping smaller range in the previous huge page.
On Mon, Dec 04, 2023 at 11:22:49AM -0800, Sean Christopherson wrote: > On Mon, Dec 04, 2023, Jason Gunthorpe wrote: > > On Mon, Dec 04, 2023 at 09:00:55AM -0800, Sean Christopherson wrote: > > > I'm not convinced that memory consumption is all that interesting. If a VM is > > > mapping the majority of memory into a device, then odds are good that the guest > > > is backed with at least 2MiB page, if not 1GiB pages, at which point the memory > > > overhead for pages tables is quite small, especially relative to the total amount > > > of memory overheads for such systems. > > > > AFAIK the main argument is performance. It is similar to why we want > > to do IOMMU SVA with MM page table sharing. > > > > If IOMMU mirrors/shadows/copies a page table using something like HMM > > techniques then the invalidations will mark ranges of IOVA as > > non-present and faults will occur to trigger hmm_range_fault to do the > > shadowing. > > > > This means that pretty much all IO will always encounter a non-present > > fault, certainly at the start and maybe worse while ongoing. > > > > On the other hand, if we share the exact page table then natural CPU > > touches will usually make the page present before an IO happens in > > almost all cases and we don't have to take the horribly expensive IO > > page fault at all. > > I'm not advocating mirroring/copying/shadowing page tables between KVM and the > IOMMU. I'm suggesting managing IOMMU page tables mostly independently, but reusing > KVM code to do so. > > I wouldn't even be opposed to KVM outright managing the IOMMU's page tables. E.g. > add an "iommu" flag to "union kvm_mmu_page_role" and then the implementation looks > rather similar to this series. Yes, very similar to current implementation, which added a "exported" flag to "union kvm_mmu_page_role". > > What terrifies is me sharing page tables between the CPU and the IOMMU verbatim. > > Yes, sharing page tables will Just Work for faulting in memory, but the downside > is that _when_, not if, KVM modifies PTEs for whatever reason, those modifications > will also impact the IO path. My understanding is that IO page faults are at least > an order of magnitude more expensive than CPU page faults. That means that what's > optimal for CPU page tables may not be optimal, or even _viable_, for IOMMU page > tables. > > E.g. based on our conversation at LPC, write-protecting guest memory to do dirty > logging is not a viable option for the IOMMU because the latency of the resulting > IOPF is too high. Forcing KVM to use D-bit dirty logging for CPUs just because > the VM has passthrough (mediated?) devices would be likely a non-starter. > > One of my biggest concerns with sharing page tables between KVM and IOMMUs is that > we will end up having to revert/reject changes that benefit KVM's usage due to > regressing the IOMMU usage. > As the TDP shared by IOMMU is marked by KVM, could we limit the changes (that benefic KVM but regress IOMMU) to TDPs not shared? > If instead KVM treats IOMMU page tables as their own thing, then we can have > divergent behavior as needed, e.g. different dirty logging algorithms, different > software-available bits, etc. It would also allow us to define new ABI instead > of trying to reconcile the many incompatibilies and warts in KVM's existing ABI. > E.g. off the top of my head: > > - The virtual APIC page shouldn't be visible to devices, as it's not "real" guest > memory. > > - Access tracking, i.e. page aging, by making PTEs !PRESENT because the CPU > doesn't support A/D bits or because the admin turned them off via KVM's > enable_ept_ad_bits module param. > > - Write-protecting GFNs for shadow paging when L1 is running nested VMs. KVM's > ABI can be that device writes to L1's page tables are exempt. > > - KVM can exempt IOMMU page tables from KVM's awful "drop all page tables if > any memslot is deleted" ABI. > > > We were not able to make bi-dir notifiers with with the CPU mm, I'm > > not sure that is "relatively easy" :( > > I'm not suggesting full blown mirroring, all I'm suggesting is a fire-and-forget > notifier for KVM to tell IOMMUFD "I've faulted in GFN A, you might want to do the > same". > > It wouldn't even necessarily need to be a notifier per se, e.g. if we taught KVM > to manage IOMMU page tables, then KVM could simply install mappings for multiple > sets of page tables as appropriate. Not sure which approach below is the one you are referring to by "fire-and-forget notifier" and "if we taught KVM to manage IOMMU page tables". Approach A: 1. User space or IOMMUFD tells KVM which address space to share to IOMMUFD. 2. KVM create a special TDP, and maps this page table whenever a GFN in the specified address space is faulted to PFN in vCPU side. 3. IOMMUFD imports this special TDP and receives zaps notification from KVM. KVM will only send the zap notification for memslot removal or for certain MMU zap notifications Approach B: 1. User space or IOMMUFD tells KVM which address space to notify. 2. KVM notifies IOMMUFD whenever a GFN in the specified address space is faulted to PFN in vCPU side. 3. IOMMUFD translates GFN to PFN in its own way (though VMA or through certain new memfd interface), and maps IO PTEs by itself. 4. IOMMUFD zaps IO PTEs when a memslot is removed and interacts with MMU notifier for zap notification in the primary MMU. If approach A is preferred, could vCPUs also be allowed to attach to this special TDP in VMs that don't suffer from NX hugepage mitigation, and do not want live migration with passthrough devices, and don't rely on write-protection for nested VMs.
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Monday, December 4, 2023 11:08 PM > > On Sat, Dec 02, 2023 at 05:12:11PM +0800, Yan Zhao wrote: > > - How to map MSI page on arm platform demands discussions. > > Yes, the recurring problem :( > > Probably the same approach as nesting would work for a hack - map the > ITS page into the fixed reserved slot and tell the guest not to touch > it and to identity map it. > yes logically it should follow what is planned for nesting. just that kvm needs to involve more iommu specific knowledge e.g. iommu_get_msi_cookie() to reserve the slot.
> From: Zhao, Yan Y <yan.y.zhao@intel.com> > Sent: Tuesday, December 5, 2023 9:32 AM > > On Mon, Dec 04, 2023 at 08:38:17AM -0800, Sean Christopherson wrote: > > The number of possible TDP page tables used for nested VMs is well > bounded, but > > since devices obviously can't be nested VMs, I won't bother trying to > explain the > > the various possibilities (nested NPT on AMD is downright ridiculous). > In future, if possible, I wonder if we can export an TDP for nested VM too. > E.g. in scenarios where TDP is partitioned, and one piece is for L2 VM. > Maybe we can specify that and tell KVM the very piece of TDP to export. > nesting is tricky. The reason why the sharing (w/o nesting) is logically ok is that both IOMMU and KVM page tables are for the same GPA address space created by the host. for nested VM together with vIOMMU, the same sharing story holds if the stage-2 page table in both sides still translates GPA. It implies vIOMMU is enabled in nested translation mode and L0 KVM doesn't expose vEPT to L1 VMM (which then uses shadow instead). things become tricky when vIOMMU is working in a shadowing mode or when L0 KVM exposes vEPT to L1 VMM. In either case the stage-2 page table of L0 IOMMU/KVM actually translates a guest address space then sharing becomes problematic (on figuring out whether both refers to the same guest address space while that fact might change at any time).
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Tuesday, December 5, 2023 3:51 AM > > On Mon, Dec 04, 2023 at 11:22:49AM -0800, Sean Christopherson wrote: > > It wouldn't even necessarily need to be a notifier per se, e.g. if we taught > KVM > > to manage IOMMU page tables, then KVM could simply install mappings for > multiple > > sets of page tables as appropriate. iommu driver still needs to be notified to invalidate the iotlb, unless we want KVM to directly call IOMMU API instead of going through iommufd. > > This somehow feels more achievable to me since KVM already has all the > code to handle multiple TDPs, having two parallel ones is probably > much easier than trying to weld KVM to a different page table > implementation through some kind of loose coupled notifier. > yes performance-wise this can also reduce the I/O page faults as the sharing approach achieves. but how is it compared to another way of supporting IOPF natively in iommufd and iommu drivers? Note that iommufd also needs to support native vfio applications e.g. dpdk. I'm not sure whether there will be strong interest in enabling IOPF for those applications. But if the answer is yes then it's inevitable to have such logic implemented in the iommu stack given KVM is not in the picture there. With that is it more reasonable to develop the IOPF support natively in iommu side, plus an optional notifier mechanism to sync with KVM-induced host PTE installation as optimization?