[RFC,00/42] Sharing KVM TDP to IOMMU

Message ID 20231202091211.13376-1-yan.y.zhao@intel.com
Headers
Series Sharing KVM TDP to IOMMU |

Message

Yan Zhao Dec. 2, 2023, 9:12 a.m. UTC
  This RFC series proposes a framework to resolve IOPF by sharing KVM TDP
(Two Dimensional Paging) page table to IOMMU as its stage 2 paging
structure to support IOPF (IO page fault) on IOMMU's stage 2 paging
structure.

Previously, all guest pages have to be pinned and mapped in IOMMU stage 2 
paging structures after pass-through devices attached, even if the device
has IOPF capability. Such all-guest-memory pinning can be avoided when IOPF
handling for stage 2 paging structure is supported and if there are only
IOPF-capable devices attached to a VM.

There are 2 approaches to support IOPF on IOMMU stage 2 paging structures:
- Supporting by IOMMUFD/IOMMU alone
  IOMMUFD handles IO page faults on stage-2 HWPT by calling GUPs and then
  iommu_map() to setup IOVA mappings. (IOAS is required to keep info of GPA
  to HVA, but page pinning/unpinning needs to be skipped.)
  Then upon MMU notifiers on host primary MMU, iommu_unmap() is called to
  adjust IOVA mappings accordingly.
  IOMMU driver needs to support unmapping sub-ranges of a previous mapped
  range and take care of huge page merge and split in atomic way. [1][2].

- Sharing KVM TDP
  IOMMUFD sets the root of KVM TDP page table (EPT/NPT in x86) as the root
  of IOMMU stage 2 paging structure, and routes IO page faults to KVM.
  (This assumes that the iommu hw supports the same stage-2 page table
  format as CPU.)
  In this model the page table is centrally managed by KVM (mmu notifier,
  page mapping, subpage unmapping, atomic huge page split/merge, etc.),
  while IOMMUFD only needs to invalidate iotlb/devtlb properly.

Currently, there's no upstream code available to support stage 2 IOPF yet.

This RFC chooses to implement "Sharing KVM TDP" approach which has below
main benefits: 

- Unified page table management
  The complexity of allocating guest pages per GPAs, registering to MMU
  notifier on host primary MMU, sub-page unmapping, atomic page merge/split
  are only required to by handled in KVM side, which has been doing that
  well for a long time.

- Reduced page faults:
  Only one page fault is triggered on a single GPA, either caused by IO
  access or by vCPU access. (compared to one IO page fault for DMA and one
  CPU page fault for vCPUs in the non-shared approach.)

- Reduced memory consumption:
  Memory of one page table are saved.


Design
==
In this series, term "exported" is used in place of "shared" to avoid
confusion with terminology "shared EPT" in TDX.

The framework contains 3 main objects:

"KVM TDP FD" object - The interface of KVM to export TDP page tables.
                      With this object, KVM allows external components to
                      access a TDP page table exported by KVM.

"IOMMUFD KVM HWPT" object - A proxy connecting KVM TDP FD to IOMMU driver.
                            This HWPT has no IOAS associated.

"KVM domain" in IOMMU driver - Stage 2 domain in IOMMU driver whose paging
                               structures are managed by KVM.
                               Its hardware TLB invalidation requests are
                               notified from KVM via IOMMUFD KVM HWPT
                               object.


                                               
                2.IOMMU_HWPT_ALLOC(fd)            1. KVM_CREATE_TDP_FD
                                       .------.
                        +--------------| QEMU |----------------------+
                        |              '------'<---+ fd              |
                        |                          |                 v
                        |                          |             .-------.
                        v                          |      create |  KVM  |
             .------------------.           .------------.<------'-------'
             | IOMMUFD KVM HWPT |           | KVM TDP FD |           |
             '------------------'           '------------'           |
                        |    kvm_tdp_fd_get(fd)    |                 |
                        |------------------------->|                 |
  IOMMU                 |                          |                 |
  driver    alloc(meta) |---------get meta-------->|                 |
.------------.<---------|                          |                 |
| KVM Domain |          |----register_importer---->|                 |
'------------'          |                          |                 |
  |                     |                          |                 |
  |   3.                |                          |                 |
  |----iopf handler---->|----------fault---------->|------map------->|
  |                     |                          |  4.             |
  |<-------invalidate---|<-------invalidate--------|<---TLB flush----|
  |                     |                          |                 |
  |<-----free-----------| 5.                       |                 |
                        |----unregister_importer-->|                 |
                        |                          |                 |
                        |------------------------->|                 |
                             kvm_tdp_fd_put()


1. QEMU calls KVM_CREATE_TDP_FD to create a TDP FD object.
   Address space must be specified to identify the exported TDP page table
   (e.g. system memory or SMM mode system memory in x86).

2. QEMU calls IOMMU_HWPT_ALLOC to create a KVM-type HWPT.
   The KVM-type HWPT is created upon an exported KVM TDP FD (rather than
   upon an IOAS), acting as the proxy between KVM TDP and IOMMU driver:
   - Obtain reference on the exported KVM TDP FD.
   - get and pass meta data of KVM TDP page tables to IOMMU driver for KVM
     domain allocation.
   - register importer callbacks to KVM for invalidation notification.
   - register a IOPF handler into IOMMU's KVM domain.

   Upon device attachment, the root HPA of the exported TDP page table is
   installed to IOMMU hardware.

3. When IO page faults come, IOMMUFD fault handler forwards the fault to
   KVM.

4. When KVM performs TLB flush, it notifies all importers of KVM TDP FD
   object. IOMMUFD KVM HWPT, as an importer, will pass the notification to
   IOMMU driver for hardware TLB invalidations.

5. On destroy IOMMUFD KVM HWPT, it frees IOMMU's KVM domain, unregisters
   itself as an importer from KVM TDP FD object and puts reference count of
   KVM TDP FD object.


Status
==
Current support of IOPF on IOMMU stage 2 paging structure is verified on
Intel DSA devices on Intel SPR platform. There's no vIOMMU for guest and
Intel DSA devices run in-kernel DMA tests successfully with IOPFs handled
in host.

- Nested translation in IOMMU is currently not supported.

- QEMU code in IOMMUFD to create KVM HWPT is just a temporary hack.
  As KVM HWPT has no IOAS associated, need to fit in current QEMU code to
  create KVM HWPT with no IOAS and to ensure the address space is from GPA
  to HPA. 

- DSA IOPF hack in guest driver.
  Although DSA hw tolerates IOPF in all DMA paths, DSA driver has the
  flexibility to turn off IOPF in certain paths. 
  This RFC currently hacks the guest driver to always turn on IOPF.


Note
==
- KVM page write-tracking

  Unlike write-protection which usually adds back the write permission upon
  a write fault and re-executes the faulting instruction, KVM page
  write-tracking keeps the write permission disabled for the tracked pages
  and instead always emulates the faulting instruction upon fault.
  There is no way to emulate a faulting DMA request so IOPF and KVM page
  write-tracking are incompatible.

  In this RFC we didn't handle the conflict given write-tracking is applied
  to guest page table pages so far, which are unlikely to be used as DMA
  buffer.

- IOMMU page-walk coherency

  It's about whether IOMMU hardware will snoop the processor cache of the
  I/O paging structures. If IOMMU page-walk is non-coherent, the software
  needs to do clflush after changing the I/O paging structures.

  Supporting non-coherent IOMMU page-walk adds extra burden (i.e. clflush)
  in KVM mmu in this shared model, which we don't plan to support.
  Fortunately most Intel platforms do support coherent page-walk in IOMMU
  so this exception should not be a big matter.

- Non-coherent DMA

  Non-coherent DMA requires KVM mmu to align the effective memory type
  with the guest memory type (CR0.CD, vPAT, vMTRR) instead of forcing all
  guest memory to be WB. It further involves complexities in fault handler
  to check guest memory type too which requires a vCPU context.

  There is certainly no vCPU context in an I/O page fault. So this RFC
  doesn't support devices which cannot be enforced to do coherent DMA.

  If there is interest in supporting non-coherent DMA in this shared model,
  there's a discussion about removing vMTRR stuffs in KVM page fault
  handler [3] hence it's also possible to further remove the vCPU context
  there.

- Enforce DMA cache coherency

  This design requires the IOMMU supporting a configuration forcing all
  DMAs to be coherent (even if the PCI request out of the device sets the
  non-snoop bit) due to aforementioned reason.

  The control of enforcing cache coherency could be per-IOPT or per-page.
  e.g. Intel VT-d defines a per-page format (bit 11 in PTE represents the
  enforce-snoop bit) in legacy mode and a per-IOPT format (control bit in
  the pasid entry) in scalable mode.

  Supporting per-page format requires KVM mmu to disable any software use
  of bit 11 and also provide additional ops for on-demand set/clear-snp
  requests from iommufd. It's complex and dirty.

  Therefore the per-IOPT scheme is assumed in this design. For Intel IOMMU,
  the scalable mode is the default mode for all new IOMMU features (nested
  translation, pasid, etc.) anyway.


- About device which partially supports IOPF

  Many devices claiming PCIe PRS capability actually only tolerate IOPF in
  certain paths (e.g. DMA paths for SVM applications, but not for non-SVM
  applications or driver data such as ring descriptors). But the PRS
  capability doesn't include a bit to tell whether a device 100% tolerates
  IOPF in all DMA paths.

  This creates a trouble how the userspace driver framework (e.g. VFIO)
  knows that a device with PRS can really avoid static-pinning of the
  entire guest memory and then reports such knowledge to the VMM.

  A simple way is to track an allowed list of devices which are known 100%
  IOPF-friendly in VFIO. Another option is to extend PCIe spec to allow
  device reporting whether it fully or partially supports IOPF in the PRS
  capability.

  Another interesting option is to explore supporting partial-IOPF in this
  sharing model:
  * Create a VFIO variant driver to intercept guest operations which
    registers non-faultable memory to the device and to call KVM TDP ops to
    request on-demand pinning of traped memory pages in KVM mmu. This
    allows the VMM to start with zero-pinning as for 100%-faultable device
    with on demand pinning initiated by the variant driver.

  * Supporting on-demand pinning in KVM mmu however requires non-trivial
    effort. Besides introducing logic to pin pages in long term and manage
    the list of pinned GFNs, more caveats are required to avoid breaking
    the implication of page pinning, e.g.:

      a. PTE updates in a pinned GFN range must be atomic, otherwise an
         in-fly DMA might be broken

      b. PTE zap in a pinned GFN range is allowed only when the related
         memory slot is removed (indicating guest won't use it for DMA).
         The PTE zap for the affected range must be either disabled or
         replaced by an atomic update.

      c. any feature related to write-protecting the pinned GFN range is
         not allowed. This implies live migration is also broken in current
         way as it starts with write-protection even when TDP dirty bit
         tracking is enabled. To support on-demand pinning it then requires
         to rely on a less efficient way by always walking TDP dirty bit
         instead of using write-protection. Or, we may enhance the live
         migration code to treat pinned ranges as dirty always.

      d. Auto NUMA balance also needs to be disabled. [4]

  If above trickiness can be resolved cleanly, this sharing model could
  also support a non-faultable device in theory by pinning/unpinning guest
  memory on slot addition/removal.


- How to map MSI page on arm platform demands discussions.


Patches layout
==
[01-08]: Skeleton implementation of KVM's TDP FD object.
         Patch 1 and 2 are for public and arch specific headers.
         Patch 4's commit message outlines overall data structure hierarchy
                 on x86 for preview. 

[09-23]: IOMMU, IOMMUFD and Intel vt-d.
       - 09-11: IOMMU core part
       - 12-16: IOMMUFD part
                Patch 13 is the main patch in IOMMUFD to implement KVM
                HWPT.
       - 17-23: Intel vt-d part for KVM domain
                Patch 18 is the main patch to implement KVM domain.

[24-42]: KVM x86 and VMX part
       - 24-34: KVM x86 preparation patches. 
                Patch 24: Let KVM to reserve bit 11 since bit 11 is
                          reserved as 0 in IOMMU side.
                Patch 25: Abstract "struct kvm_mmu_common" from
                          "struct kvm_mmu" for "kvm_exported_tdp_mmu"
                Patches 26~34: Prepare for page fault in non-vCPU context.

       - 35-38: Core part in KVM x86
                Patch 35: X86 MMU core part to show how exported TDP root
                          page is shared between KVM external components
                          and vCPUs.
                Patch 37: TDP FD fault op implementation

       - 39-42: KVM VMX part for meta data composing and tlb flush
                notification.


Code base
==
The code base is commit b85ea95d08647 ("Linux 6.7-rc1") +
Yi Liu's v7 series "Add Intel VT-d nested translation (part 2/2)" [5] +
Baolu's v7 series "iommu: Prepare to deliver page faults to user space" [6]

Complete code can be found at [7], Qemu could be found at [8],
Guest test script and workaround patch is at [9].

[1] https://lore.kernel.org/all/20230814121016.32613-1-jijie.ji@linux.alibaba.com/
[2] https://lore.kernel.org/all/BN9PR11MB5276D897431C7E1399EFFF338C14A@BN9PR11MB5276.namprd11.prod.outlook.com/
[3] https://lore.kernel.org/all/ZUAC0jvFE0auohL4@google.com/
[4] https://lore.kernel.org/all/4cb536f6-2609-4e3e-b996-4a613c9844ad@nvidia.com/
[5] https://lore.kernel.org/linux-iommu/20231117131816.24359-1-yi.l.liu@intel.com/
[6] https://lore.kernel.org/linux-iommu/20231115030226.16700-1-baolu.lu@linux.intel.com/
[7] https://github.com/yanhwizhao/linux_kernel/tree/sharept_iopt
[8] https://github.com/yanhwizhao/qemu/tree/sharept_iopf 
[9] https://github.com/yanhwizhao/misc/tree/master


Yan Zhao (42):
  KVM: Public header for KVM to export TDP
  KVM: x86: Arch header for kvm to export TDP for Intel
  KVM: Introduce VM ioctl KVM_CREATE_TDP_FD
  KVM: Skeleton of KVM TDP FD object
  KVM: Embed "arch" object and call arch init/destroy in TDP FD
  KVM: Register/Unregister importers to KVM exported TDP
  KVM: Forward page fault requests to arch specific code for exported
    TDP
  KVM: Add a helper to notify importers that KVM exported TDP is flushed
  iommu: Add IOMMU_DOMAIN_KVM
  iommu: Add new iommu op to create domains managed by KVM
  iommu: Add new domain op cache_invalidate_kvm
  iommufd: Introduce allocation data info and flag for KVM managed HWPT
  iommufd: Add a KVM HW pagetable object
  iommufd: Enable KVM HW page table object to be proxy between KVM and
    IOMMU
  iommufd: Add iopf handler to KVM hw pagetable
  iommufd: Enable device feature IOPF during device attachment to KVM
    HWPT
  iommu/vt-d: Make some macros and helpers to be extern
  iommu/vt-d: Support of IOMMU_DOMAIN_KVM domain in Intel IOMMU
  iommu/vt-d: Set bit PGSNP in PASIDTE if domain cache coherency is
    enforced
  iommu/vt-d: Support attach devices to IOMMU_DOMAIN_KVM domain
  iommu/vt-d: Check reserved bits for IOMMU_DOMAIN_KVM domain
  iommu/vt-d: Support cache invalidate of IOMMU_DOMAIN_KVM domain
  iommu/vt-d: Allow pasid 0 in IOPF
  KVM: x86/mmu: Move bit SPTE_MMU_PRESENT from bit 11 to bit 59
  KVM: x86/mmu: Abstract "struct kvm_mmu_common" from "struct kvm_mmu"
  KVM: x86/mmu: introduce new op get_default_mt_mask to kvm_x86_ops
  KVM: x86/mmu: change param "vcpu" to "kvm" in
    kvm_mmu_hugepage_adjust()
  KVM: x86/mmu: change "vcpu" to "kvm" in page_fault_handle_page_track()
  KVM: x86/mmu: remove param "vcpu" from kvm_mmu_get_tdp_level()
  KVM: x86/mmu: remove param "vcpu" from
    kvm_calc_tdp_mmu_root_page_role()
  KVM: x86/mmu: add extra param "kvm" to kvm_faultin_pfn()
  KVM: x86/mmu: add extra param "kvm" to make_mmio_spte()
  KVM: x86/mmu: add extra param "kvm" to make_spte()
  KVM: x86/mmu: add extra param "kvm" to
    tdp_mmu_map_handle_target_level()
  KVM: x86/mmu: Get/Put TDP root page to be exported
  KVM: x86/mmu: Keep exported TDP root valid
  KVM: x86: Implement KVM exported TDP fault handler on x86
  KVM: x86: "compose" and "get" interface for meta data of exported TDP
  KVM: VMX: add config KVM_INTEL_EXPORTED_EPT
  KVM: VMX: Compose VMX specific meta data for KVM exported TDP
  KVM: VMX: Implement ops .flush_remote_tlbs* in VMX when EPT is on
  KVM: VMX: Notify importers of exported TDP to flush TLBs on KVM
    flushes EPT

 arch/x86/include/asm/kvm-x86-ops.h       |   4 +
 arch/x86/include/asm/kvm_exported_tdp.h  |  43 +++
 arch/x86/include/asm/kvm_host.h          |  48 ++-
 arch/x86/kvm/Kconfig                     |  13 +
 arch/x86/kvm/mmu.h                       |  12 +-
 arch/x86/kvm/mmu/mmu.c                   | 434 +++++++++++++++++------
 arch/x86/kvm/mmu/mmu_internal.h          |   8 +-
 arch/x86/kvm/mmu/paging_tmpl.h           |  15 +-
 arch/x86/kvm/mmu/spte.c                  |  31 +-
 arch/x86/kvm/mmu/spte.h                  |  82 ++++-
 arch/x86/kvm/mmu/tdp_mmu.c               | 209 +++++++++--
 arch/x86/kvm/mmu/tdp_mmu.h               |   9 +
 arch/x86/kvm/svm/svm.c                   |   2 +-
 arch/x86/kvm/vmx/nested.c                |   2 +-
 arch/x86/kvm/vmx/vmx.c                   |  56 ++-
 arch/x86/kvm/x86.c                       |  68 +++-
 drivers/iommu/intel/Kconfig              |   9 +
 drivers/iommu/intel/Makefile             |   1 +
 drivers/iommu/intel/iommu.c              |  68 ++--
 drivers/iommu/intel/iommu.h              |  47 +++
 drivers/iommu/intel/kvm.c                | 185 ++++++++++
 drivers/iommu/intel/pasid.c              |   3 +-
 drivers/iommu/intel/svm.c                |  37 +-
 drivers/iommu/iommufd/Kconfig            |  10 +
 drivers/iommu/iommufd/Makefile           |   1 +
 drivers/iommu/iommufd/device.c           |  31 +-
 drivers/iommu/iommufd/hw_pagetable.c     |  29 +-
 drivers/iommu/iommufd/hw_pagetable_kvm.c | 270 ++++++++++++++
 drivers/iommu/iommufd/iommufd_private.h  |  44 +++
 drivers/iommu/iommufd/main.c             |   4 +
 include/linux/iommu.h                    |  18 +
 include/linux/kvm_host.h                 |  58 +++
 include/linux/kvm_tdp_fd.h               | 137 +++++++
 include/linux/kvm_types.h                |  12 +
 include/uapi/linux/iommufd.h             |  15 +
 include/uapi/linux/kvm.h                 |  19 +
 virt/kvm/Kconfig                         |   6 +
 virt/kvm/Makefile.kvm                    |   1 +
 virt/kvm/kvm_main.c                      |  24 ++
 virt/kvm/tdp_fd.c                        | 344 ++++++++++++++++++
 virt/kvm/tdp_fd.h                        |  15 +
 41 files changed, 2177 insertions(+), 247 deletions(-)
 create mode 100644 arch/x86/include/asm/kvm_exported_tdp.h
 create mode 100644 drivers/iommu/intel/kvm.c
 create mode 100644 drivers/iommu/iommufd/hw_pagetable_kvm.c
 create mode 100644 include/linux/kvm_tdp_fd.h
 create mode 100644 virt/kvm/tdp_fd.c
 create mode 100644 virt/kvm/tdp_fd.h
  

Comments

Jason Gunthorpe Dec. 4, 2023, 3:08 p.m. UTC | #1
On Sat, Dec 02, 2023 at 05:12:11PM +0800, Yan Zhao wrote:
> In this series, term "exported" is used in place of "shared" to avoid
> confusion with terminology "shared EPT" in TDX.
> 
> The framework contains 3 main objects:
> 
> "KVM TDP FD" object - The interface of KVM to export TDP page tables.
>                       With this object, KVM allows external components to
>                       access a TDP page table exported by KVM.

I don't know much about the internals of kvm, but why have this extra
user visible piece? Isn't there only one "TDP" per kvm fd? Why not
just use the KVM FD as a handle for the TDP?

> "IOMMUFD KVM HWPT" object - A proxy connecting KVM TDP FD to IOMMU driver.
>                             This HWPT has no IOAS associated.
> 
> "KVM domain" in IOMMU driver - Stage 2 domain in IOMMU driver whose paging
>                                structures are managed by KVM.
>                                Its hardware TLB invalidation requests are
>                                notified from KVM via IOMMUFD KVM HWPT
>                                object.

This seems broadly the right direction

> - About device which partially supports IOPF
> 
>   Many devices claiming PCIe PRS capability actually only tolerate IOPF in
>   certain paths (e.g. DMA paths for SVM applications, but not for non-SVM
>   applications or driver data such as ring descriptors). But the PRS
>   capability doesn't include a bit to tell whether a device 100% tolerates
>   IOPF in all DMA paths.

The lack of tolerance for truely DMA pinned guest memory is a
significant problem for any real deployment, IMHO. I am aware of no
device that can handle PRI on every single DMA path. :(

>   A simple way is to track an allowed list of devices which are known 100%
>   IOPF-friendly in VFIO. Another option is to extend PCIe spec to allow
>   device reporting whether it fully or partially supports IOPF in the PRS
>   capability.

I think we need something like this.

> - How to map MSI page on arm platform demands discussions.

Yes, the recurring problem :(

Probably the same approach as nesting would work for a hack - map the
ITS page into the fixed reserved slot and tell the guest not to touch
it and to identity map it.

Jason
  
Sean Christopherson Dec. 4, 2023, 4:38 p.m. UTC | #2
On Mon, Dec 04, 2023, Jason Gunthorpe wrote:
> On Sat, Dec 02, 2023 at 05:12:11PM +0800, Yan Zhao wrote:
> > In this series, term "exported" is used in place of "shared" to avoid
> > confusion with terminology "shared EPT" in TDX.
> > 
> > The framework contains 3 main objects:
> > 
> > "KVM TDP FD" object - The interface of KVM to export TDP page tables.
> >                       With this object, KVM allows external components to
> >                       access a TDP page table exported by KVM.
> 
> I don't know much about the internals of kvm, but why have this extra
> user visible piece?

That I don't know, I haven't looked at the gory details of this RFC.

> Isn't there only one "TDP" per kvm fd?

No.  In steady state, with TDP (EPT) enabled and assuming homogeneous capabilities
across all vCPUs, KVM will have 3+ sets of TDP page tables *active* at any given time:

  1. "Normal"
  2. SMM
  3-N. Guest (for L2, i.e. nested, VMs)

The number of possible TDP page tables used for nested VMs is well bounded, but
since devices obviously can't be nested VMs, I won't bother trying to explain the
the various possibilities (nested NPT on AMD is downright ridiculous).

Nested virtualization aside, devices are obviously not capable of running in SMM
and so they all need to use the "normal" page tables.

I highlighted "active" above because if _any_ memslot is deleted, KVM will invalidate
*all* existing page tables and rebuild new page tables as needed.  So over the
lifetime of a VM, KVM could theoretically use an infinite number of page tables.
  
Sean Christopherson Dec. 4, 2023, 5 p.m. UTC | #3
On Sat, Dec 02, 2023, Yan Zhao wrote:
> This RFC series proposes a framework to resolve IOPF by sharing KVM TDP
> (Two Dimensional Paging) page table to IOMMU as its stage 2 paging
> structure to support IOPF (IO page fault) on IOMMU's stage 2 paging
> structure.
> 
> Previously, all guest pages have to be pinned and mapped in IOMMU stage 2 
> paging structures after pass-through devices attached, even if the device
> has IOPF capability. Such all-guest-memory pinning can be avoided when IOPF
> handling for stage 2 paging structure is supported and if there are only
> IOPF-capable devices attached to a VM.
> 
> There are 2 approaches to support IOPF on IOMMU stage 2 paging structures:
> - Supporting by IOMMUFD/IOMMU alone
>   IOMMUFD handles IO page faults on stage-2 HWPT by calling GUPs and then
>   iommu_map() to setup IOVA mappings. (IOAS is required to keep info of GPA
>   to HVA, but page pinning/unpinning needs to be skipped.)
>   Then upon MMU notifiers on host primary MMU, iommu_unmap() is called to
>   adjust IOVA mappings accordingly.
>   IOMMU driver needs to support unmapping sub-ranges of a previous mapped
>   range and take care of huge page merge and split in atomic way. [1][2].
> 
> - Sharing KVM TDP
>   IOMMUFD sets the root of KVM TDP page table (EPT/NPT in x86) as the root
>   of IOMMU stage 2 paging structure, and routes IO page faults to KVM.
>   (This assumes that the iommu hw supports the same stage-2 page table
>   format as CPU.)
>   In this model the page table is centrally managed by KVM (mmu notifier,
>   page mapping, subpage unmapping, atomic huge page split/merge, etc.),
>   while IOMMUFD only needs to invalidate iotlb/devtlb properly.

There are more approaches beyond having IOMMUFD and KVM be completely separate
entities.  E.g. extract the bulk of KVM's "TDP MMU" implementation to common code
so that IOMMUFD doesn't need to reinvent the wheel.

> Currently, there's no upstream code available to support stage 2 IOPF yet.
> 
> This RFC chooses to implement "Sharing KVM TDP" approach which has below
> main benefits:

Please list out the pros and cons for each.  In the cons column for piggybacking
KVM's page tables:

 - *Significantly* increases the complexity in KVM
 - Puts constraints on what KVM can/can't do in the future (see the movement
   of SPTE_MMU_PRESENT).
 - Subjects IOMMUFD to all of KVM's historical baggage, e.g. the memslot deletion
   mess, the truly nasty MTRR emulation (which I still hope to delete), the NX
   hugepage mitigation, etc.

Please also explain the intended/expected/targeted use cases.  E.g. if the main
use case is for device passthrough to slice-of-hardware VMs that aren't memory
oversubscribed, 

> - Unified page table management
>   The complexity of allocating guest pages per GPAs, registering to MMU
>   notifier on host primary MMU, sub-page unmapping, atomic page merge/split

Please find different terminology than "sub-page".  With Sub-Page Protection, Intel
has more or less established "sub-page" to mean "less than 4KiB granularity".  But
that can't possibly what you mean here because KVM doesn't support (un)mapping
memory at <4KiB granularity.  Based on context above, I assume you mean "unmapping
arbitrary pages within a given range".

>   are only required to by handled in KVM side, which has been doing that
>   well for a long time.
> 
> - Reduced page faults:
>   Only one page fault is triggered on a single GPA, either caused by IO
>   access or by vCPU access. (compared to one IO page fault for DMA and one
>   CPU page fault for vCPUs in the non-shared approach.)

This would be relatively easy to solve with bi-directional notifiers, i.e. KVM
notifies IOMMUFD when a vCPU faults in a page, and vice versa.
 
> - Reduced memory consumption:
>   Memory of one page table are saved.

I'm not convinced that memory consumption is all that interesting.  If a VM is
mapping the majority of memory into a device, then odds are good that the guest
is backed with at least 2MiB page, if not 1GiB pages, at which point the memory
overhead for pages tables is quite small, especially relative to the total amount
of memory overheads for such systems.

If a VM is mapping only a small subset of its memory into devices, then the IOMMU
page tables should be sparsely populated, i.e. won't consume much memory.
  
Jason Gunthorpe Dec. 4, 2023, 5:30 p.m. UTC | #4
On Mon, Dec 04, 2023 at 09:00:55AM -0800, Sean Christopherson wrote:

> There are more approaches beyond having IOMMUFD and KVM be
> completely separate entities.  E.g. extract the bulk of KVM's "TDP
> MMU" implementation to common code so that IOMMUFD doesn't need to
> reinvent the wheel.

We've pretty much done this already, it is called "hmm" and it is what
the IO world uses. Merging/splitting huge page is just something that
needs some coding in the page table code, that people want for other
reasons anyhow.

> - Subjects IOMMUFD to all of KVM's historical baggage, e.g. the memslot deletion
>   mess, the truly nasty MTRR emulation (which I still hope to delete), the NX
>   hugepage mitigation, etc.

Does it? I think that just remains isolated in kvm. The output from
KVM is only a radix table top pointer, it is up to KVM how to manage
it still.

> I'm not convinced that memory consumption is all that interesting.  If a VM is
> mapping the majority of memory into a device, then odds are good that the guest
> is backed with at least 2MiB page, if not 1GiB pages, at which point the memory
> overhead for pages tables is quite small, especially relative to the total amount
> of memory overheads for such systems.

AFAIK the main argument is performance. It is similar to why we want
to do IOMMU SVA with MM page table sharing.

If IOMMU mirrors/shadows/copies a page table using something like HMM
techniques then the invalidations will mark ranges of IOVA as
non-present and faults will occur to trigger hmm_range_fault to do the
shadowing.

This means that pretty much all IO will always encounter a non-present
fault, certainly at the start and maybe worse while ongoing.

On the other hand, if we share the exact page table then natural CPU
touches will usually make the page present before an IO happens in
almost all cases and we don't have to take the horribly expensive IO
page fault at all.

We were not able to make bi-dir notifiers with with the CPU mm, I'm
not sure that is "relatively easy" :(

Jason
  
Sean Christopherson Dec. 4, 2023, 7:22 p.m. UTC | #5
On Mon, Dec 04, 2023, Jason Gunthorpe wrote:
> On Mon, Dec 04, 2023 at 09:00:55AM -0800, Sean Christopherson wrote:
> 
> > There are more approaches beyond having IOMMUFD and KVM be
> > completely separate entities.  E.g. extract the bulk of KVM's "TDP
> > MMU" implementation to common code so that IOMMUFD doesn't need to
> > reinvent the wheel.
> 
> We've pretty much done this already, it is called "hmm" and it is what
> the IO world uses. Merging/splitting huge page is just something that
> needs some coding in the page table code, that people want for other
> reasons anyhow.

Not really.  HMM is a wildly different implementation than KVM's TDP MMU.  At a
glance, HMM is basically a variation on the primary MMU, e.g. deals with VMAs,
runs under mmap_lock (or per-VMA locks?), and faults memory into the primary MMU
while walking the "secondary" HMM page tables.

KVM's TDP MMU (and all of KVM's flavors of MMUs) is much more of a pure secondary
MMU.  The core of a KVM MMU maps GFNs to PFNs, the intermediate steps that involve
the primary MMU are largely orthogonal.  E.g. getting a PFN from guest_memfd
instead of the primary MMU essentially boils down to invoking kvm_gmem_get_pfn()
instead of __gfn_to_pfn_memslot(), the MMU proper doesn't care how the PFN was
resolved.  I.e. 99% of KVM's MMU logic has no interaction with the primary MMU.

> > - Subjects IOMMUFD to all of KVM's historical baggage, e.g. the memslot deletion
> >   mess, the truly nasty MTRR emulation (which I still hope to delete), the NX
> >   hugepage mitigation, etc.
> 
> Does it? I think that just remains isolated in kvm. The output from
> KVM is only a radix table top pointer, it is up to KVM how to manage
> it still.

Oh, I didn't mean from a code perspective, I meant from a behaviorial perspective.
E.g. there's no reason to disallow huge mappings in the IOMMU because the CPU is
vulnerable to the iTLB multi-hit mitigation.

> > I'm not convinced that memory consumption is all that interesting.  If a VM is
> > mapping the majority of memory into a device, then odds are good that the guest
> > is backed with at least 2MiB page, if not 1GiB pages, at which point the memory
> > overhead for pages tables is quite small, especially relative to the total amount
> > of memory overheads for such systems.
> 
> AFAIK the main argument is performance. It is similar to why we want
> to do IOMMU SVA with MM page table sharing.
> 
> If IOMMU mirrors/shadows/copies a page table using something like HMM
> techniques then the invalidations will mark ranges of IOVA as
> non-present and faults will occur to trigger hmm_range_fault to do the
> shadowing.
>
> This means that pretty much all IO will always encounter a non-present
> fault, certainly at the start and maybe worse while ongoing.
> 
> On the other hand, if we share the exact page table then natural CPU
> touches will usually make the page present before an IO happens in
> almost all cases and we don't have to take the horribly expensive IO
> page fault at all.

I'm not advocating mirroring/copying/shadowing page tables between KVM and the
IOMMU.  I'm suggesting managing IOMMU page tables mostly independently, but reusing
KVM code to do so.

I wouldn't even be opposed to KVM outright managing the IOMMU's page tables.  E.g.
add an "iommu" flag to "union kvm_mmu_page_role" and then the implementation looks
rather similar to this series.

What terrifies is me sharing page tables between the CPU and the IOMMU verbatim. 

Yes, sharing page tables will Just Work for faulting in memory, but the downside
is that _when_, not if, KVM modifies PTEs for whatever reason, those modifications
will also impact the IO path.  My understanding is that IO page faults are at least
an order of magnitude more expensive than CPU page faults.  That means that what's
optimal for CPU page tables may not be optimal, or even _viable_, for IOMMU page
tables.

E.g. based on our conversation at LPC, write-protecting guest memory to do dirty
logging is not a viable option for the IOMMU because the latency of the resulting
IOPF is too high.  Forcing KVM to use D-bit dirty logging for CPUs just because
the VM has passthrough (mediated?) devices would be likely a non-starter.

One of my biggest concerns with sharing page tables between KVM and IOMMUs is that
we will end up having to revert/reject changes that benefit KVM's usage due to
regressing the IOMMU usage.

If instead KVM treats IOMMU page tables as their own thing, then we can have
divergent behavior as needed, e.g. different dirty logging algorithms, different
software-available bits, etc.  It would also allow us to define new ABI instead
of trying to reconcile the many incompatibilies and warts in KVM's existing ABI.
E.g. off the top of my head:

 - The virtual APIC page shouldn't be visible to devices, as it's not "real" guest
   memory.

 - Access tracking, i.e. page aging, by making PTEs !PRESENT because the CPU
   doesn't support A/D bits or because the admin turned them off via KVM's
   enable_ept_ad_bits module param.

 - Write-protecting GFNs for shadow paging when L1 is running nested VMs.  KVM's
   ABI can be that device writes to L1's page tables are exempt.

 - KVM can exempt IOMMU page tables from KVM's awful "drop all page tables if
   any memslot is deleted" ABI.

> We were not able to make bi-dir notifiers with with the CPU mm, I'm
> not sure that is "relatively easy" :(

I'm not suggesting full blown mirroring, all I'm suggesting is a fire-and-forget
notifier for KVM to tell IOMMUFD "I've faulted in GFN A, you might want to do the
same".

It wouldn't even necessarily need to be a notifier per se, e.g. if we taught KVM
to manage IOMMU page tables, then KVM could simply install mappings for multiple
sets of page tables as appropriate.
  
Jason Gunthorpe Dec. 4, 2023, 7:50 p.m. UTC | #6
On Mon, Dec 04, 2023 at 11:22:49AM -0800, Sean Christopherson wrote:
> On Mon, Dec 04, 2023, Jason Gunthorpe wrote:
> > On Mon, Dec 04, 2023 at 09:00:55AM -0800, Sean Christopherson wrote:
> > 
> > > There are more approaches beyond having IOMMUFD and KVM be
> > > completely separate entities.  E.g. extract the bulk of KVM's "TDP
> > > MMU" implementation to common code so that IOMMUFD doesn't need to
> > > reinvent the wheel.
> > 
> > We've pretty much done this already, it is called "hmm" and it is what
> > the IO world uses. Merging/splitting huge page is just something that
> > needs some coding in the page table code, that people want for other
> > reasons anyhow.
> 
> Not really.  HMM is a wildly different implementation than KVM's TDP MMU.  At a
> glance, HMM is basically a variation on the primary MMU, e.g. deals with VMAs,
> runs under mmap_lock (or per-VMA locks?), and faults memory into the primary MMU
> while walking the "secondary" HMM page tables.

hmm supports the essential idea of shadowing parts of the primary
MMU. This is a big chunk of what kvm is doing, just differently.

> KVM's TDP MMU (and all of KVM's flavors of MMUs) is much more of a pure secondary
> MMU.  The core of a KVM MMU maps GFNs to PFNs, the intermediate steps that involve
> the primary MMU are largely orthogonal.  E.g. getting a PFN from guest_memfd
> instead of the primary MMU essentially boils down to invoking kvm_gmem_get_pfn()
> instead of __gfn_to_pfn_memslot(), the MMU proper doesn't care how the PFN was
> resolved.  I.e. 99% of KVM's MMU logic has no interaction with the primary MMU.

Hopefully the memfd stuff we be generalized so we can use it in
iommufd too, without relying on kvm. At least the first basic stuff
should be doable fairly soon.

> I'm not advocating mirroring/copying/shadowing page tables between KVM and the
> IOMMU.  I'm suggesting managing IOMMU page tables mostly independently, but reusing
> KVM code to do so.

I guess from my POV, if KVM has two copies of the logically same radix
tree then that is fine too.

> Yes, sharing page tables will Just Work for faulting in memory, but the downside
> is that _when_, not if, KVM modifies PTEs for whatever reason, those modifications
> will also impact the IO path.  My understanding is that IO page faults are at least
> an order of magnitude more expensive than CPU page faults.  That means that what's
> optimal for CPU page tables may not be optimal, or even _viable_, for IOMMU page
> tables.

Yes, you wouldn't want to do some of the same KVM techniques today in
a shared mode.
 
> E.g. based on our conversation at LPC, write-protecting guest memory to do dirty
> logging is not a viable option for the IOMMU because the latency of the resulting
> IOPF is too high.  Forcing KVM to use D-bit dirty logging for CPUs just because
> the VM has passthrough (mediated?) devices would be likely a
> non-starter.

Yes

> One of my biggest concerns with sharing page tables between KVM and IOMMUs is that
> we will end up having to revert/reject changes that benefit KVM's usage due to
> regressing the IOMMU usage.

It is certainly a strong argument

> I'm not suggesting full blown mirroring, all I'm suggesting is a fire-and-forget
> notifier for KVM to tell IOMMUFD "I've faulted in GFN A, you might want to do the
> same".

If we say the only thing this works with is the memfd version of KVM,
could we design the memfd stuff to not have the same challenges with
mirroring as normal VMAs? 

> It wouldn't even necessarily need to be a notifier per se, e.g. if we taught KVM
> to manage IOMMU page tables, then KVM could simply install mappings for multiple
> sets of page tables as appropriate.

This somehow feels more achievable to me since KVM already has all the
code to handle multiple TDPs, having two parallel ones is probably
much easier than trying to weld KVM to a different page table
implementation through some kind of loose coupled notifier.

Jason
  
Sean Christopherson Dec. 4, 2023, 8:11 p.m. UTC | #7
On Mon, Dec 04, 2023, Jason Gunthorpe wrote:
> On Mon, Dec 04, 2023 at 11:22:49AM -0800, Sean Christopherson wrote:
> > I'm not suggesting full blown mirroring, all I'm suggesting is a fire-and-forget
> > notifier for KVM to tell IOMMUFD "I've faulted in GFN A, you might want to do the
> > same".
> 
> If we say the only thing this works with is the memfd version of KVM,

That's likely a big "if", as guest_memfd is not and will not be a wholesale
replacement of VMA-based guest memory, at least not in the forseeable future.
I would be quite surprised if the target use cases for this could be moved to
guest_memfd without losing required functionality.

> could we design the memfd stuff to not have the same challenges with
> mirroring as normal VMAs? 

What challenges in particular are you concerned about?  And maybe also define
"mirroring"?  E.g. ensuring that the CPU and IOMMU page tables are synchronized
is very different than ensuring that the IOMMU page tables can only map memory
that is mappable by the guest, i.e. that KVM can map into the CPU page tables.
  
Jason Gunthorpe Dec. 4, 2023, 11:49 p.m. UTC | #8
On Mon, Dec 04, 2023 at 12:11:46PM -0800, Sean Christopherson wrote:

> > could we design the memfd stuff to not have the same challenges with
> > mirroring as normal VMAs? 
> 
> What challenges in particular are you concerned about?  And maybe also define
> "mirroring"?  E.g. ensuring that the CPU and IOMMU page tables are synchronized
> is very different than ensuring that the IOMMU page tables can only map memory
> that is mappable by the guest, i.e. that KVM can map into the CPU page tables.

IIRC, it has been awhile, it is difficult to get a new populated PTE
out of the MM side and into an hmm user and get all the invalidation
locking to work as well. Especially when the devices want to do
sleeping invalidations.

kvm doesn't solve this problem either, but pushing populated TDP PTEs
to another observer may be simpler, as perhaps would pushing populated
memfd pages or something like that?

"mirroring" here would simply mean that if the CPU side has a
popoulated page then the hmm side copying it would also have a
populated page. Instead of a fault on use model.

Jason
  
Yan Zhao Dec. 5, 2023, 1:31 a.m. UTC | #9
On Mon, Dec 04, 2023 at 08:38:17AM -0800, Sean Christopherson wrote:
> On Mon, Dec 04, 2023, Jason Gunthorpe wrote:
> > On Sat, Dec 02, 2023 at 05:12:11PM +0800, Yan Zhao wrote:
> > > In this series, term "exported" is used in place of "shared" to avoid
> > > confusion with terminology "shared EPT" in TDX.
> > > 
> > > The framework contains 3 main objects:
> > > 
> > > "KVM TDP FD" object - The interface of KVM to export TDP page tables.
> > >                       With this object, KVM allows external components to
> > >                       access a TDP page table exported by KVM.
> > 
> > I don't know much about the internals of kvm, but why have this extra
> > user visible piece?
> 
> That I don't know, I haven't looked at the gory details of this RFC.
> 
> > Isn't there only one "TDP" per kvm fd?
> 
> No.  In steady state, with TDP (EPT) enabled and assuming homogeneous capabilities
> across all vCPUs, KVM will have 3+ sets of TDP page tables *active* at any given time:
> 
>   1. "Normal"
>   2. SMM
>   3-N. Guest (for L2, i.e. nested, VMs)
Yes, the reason to introduce KVM TDP FD is to let KVM know which TDP the user
wants to export(share).

For as_id=0 (which is currently the only supported as_id to share), a TDP with
smm=0, guest_mode=0 will be chosen.

Upon receiving the KVM_CREATE_TDP_FD ioctl, KVM will try to find an existing
TDP root with role specified by as_id 0. If there's existing TDP with the target
role found, KVM will just export this one; if no existing one found, KVM will
create a new TDP root in non-vCPU context.
Then, KVM will mark the exported TDP as "exported".


                                         tdp_mmu_roots                           
                                             |                                   
 role | smm | guest_mode              +------+-----------+----------+            
------|-----------------              |      |           |          |            
  0   |  0  |  0 ==> address space 0  |      v           v          v            
  1   |  1  |  0                      |  .--------.  .--------. .--------.       
  2   |  0  |  1                      |  |  root  |  |  root  | |  root  |       
  3   |  1  |  1                      |  |(role 1)|  |(role 2)| |(role 3)|       
                                      |  '--------'  '--------' '--------'       
                                      |      ^                                   
                                      |      |    create or get   .------.       
                                      |      +--------------------| vCPU |       
                                      |              fault        '------'       
                                      |                            smm=1         
                                      |                       guest_mode=0       
                                      |                                          
          (set root as exported)      v                                          
.--------.    create or get   .---------------.  create or get   .------.        
| TDP FD |------------------->| root (role 0) |<-----------------| vCPU |        
'--------'        fault       '---------------'     fault        '------'        
                                      .                            smm=0         
                                      .                       guest_mode=0       
                                      .                                          
                 non-vCPU context <---|---> vCPU context                         
                                      .                                          
                                      .                       

No matter the TDP is exported or not, vCPUs just load TDP root according to its
vCPU modes.
In this way, KVM is able to share the TDP in KVM address space 0 to IOMMU side.

> The number of possible TDP page tables used for nested VMs is well bounded, but
> since devices obviously can't be nested VMs, I won't bother trying to explain the
> the various possibilities (nested NPT on AMD is downright ridiculous).
In future, if possible, I wonder if we can export an TDP for nested VM too.
E.g. in scenarios where TDP is partitioned, and one piece is for L2 VM.
Maybe we can specify that and tell KVM the very piece of TDP to export.

> Nested virtualization aside, devices are obviously not capable of running in SMM
> and so they all need to use the "normal" page tables.
>
> I highlighted "active" above because if _any_ memslot is deleted, KVM will invalidate
> *all* existing page tables and rebuild new page tables as needed.  So over the
> lifetime of a VM, KVM could theoretically use an infinite number of page tables.
Right. In patch 36, the TDP root which is marked as "exported" will be exempted
from "invalidate". Instead, an "exported" TDP just zaps all leaf entries upon
memory slot removal.
That is to say, for an exported TDP, it can be "active" until it's unmarked as
exported.
  
Yan Zhao Dec. 5, 2023, 1:52 a.m. UTC | #10
On Mon, Dec 04, 2023 at 11:08:00AM -0400, Jason Gunthorpe wrote:
> On Sat, Dec 02, 2023 at 05:12:11PM +0800, Yan Zhao wrote:
> > In this series, term "exported" is used in place of "shared" to avoid
> > confusion with terminology "shared EPT" in TDX.
> > 
> > The framework contains 3 main objects:
> > 
> > "KVM TDP FD" object - The interface of KVM to export TDP page tables.
> >                       With this object, KVM allows external components to
> >                       access a TDP page table exported by KVM.
> 
> I don't know much about the internals of kvm, but why have this extra
> user visible piece? Isn't there only one "TDP" per kvm fd? Why not
> just use the KVM FD as a handle for the TDP?
As explained in a parallel mail, the reason to introduce KVM TDP FD is to let
KVM know which TDP the user wants to export(share).
And another reason is wrap the exported TDP with its exported ops in a
single structure. So, components outside of KVM can query meta data and
request page fault, register invalidate callback through the exported ops. 

struct kvm_tdp_fd {
        /* Public */
        struct file *file;
        const struct kvm_exported_tdp_ops *ops;

        /* private to KVM */
        struct kvm_exported_tdp *priv;
};
For KVM, it only needs to expose this struct kvm_tdp_fd and two symbols
kvm_tdp_fd_get() and kvm_tdp_fd_put().


> 
> > "IOMMUFD KVM HWPT" object - A proxy connecting KVM TDP FD to IOMMU driver.
> >                             This HWPT has no IOAS associated.
> > 
> > "KVM domain" in IOMMU driver - Stage 2 domain in IOMMU driver whose paging
> >                                structures are managed by KVM.
> >                                Its hardware TLB invalidation requests are
> >                                notified from KVM via IOMMUFD KVM HWPT
> >                                object.
> 
> This seems broadly the right direction
> 
> > - About device which partially supports IOPF
> > 
> >   Many devices claiming PCIe PRS capability actually only tolerate IOPF in
> >   certain paths (e.g. DMA paths for SVM applications, but not for non-SVM
> >   applications or driver data such as ring descriptors). But the PRS
> >   capability doesn't include a bit to tell whether a device 100% tolerates
> >   IOPF in all DMA paths.
> 
> The lack of tolerance for truely DMA pinned guest memory is a
> significant problem for any real deployment, IMHO. I am aware of no
> device that can handle PRI on every single DMA path. :(
DSA actaully can handle PRI on all DMA paths. But it requires driver to turn on
this capability :(

> >   A simple way is to track an allowed list of devices which are known 100%
> >   IOPF-friendly in VFIO. Another option is to extend PCIe spec to allow
> >   device reporting whether it fully or partially supports IOPF in the PRS
> >   capability.
> 
> I think we need something like this.
> 
> > - How to map MSI page on arm platform demands discussions.
> 
> Yes, the recurring problem :(
> 
> Probably the same approach as nesting would work for a hack - map the
> ITS page into the fixed reserved slot and tell the guest not to touch
> it and to identity map it.
Ok.
  
Yan Zhao Dec. 5, 2023, 3:51 a.m. UTC | #11
On Mon, Dec 04, 2023 at 09:00:55AM -0800, Sean Christopherson wrote:
> On Sat, Dec 02, 2023, Yan Zhao wrote:
> Please list out the pros and cons for each.  In the cons column for piggybacking
> KVM's page tables:
> 
>  - *Significantly* increases the complexity in KVM
The complexity to KVM (up to now) are
a. fault in non-vCPU context
b. keep exported root always "active"
c. disallow non-coherent DMAs
d. movement of SPTE_MMU_PRESENT

for a, I think it's accepted, and we can see eager page split allocates
       non-leaf pages in non-vCPU context already.
for b, it requires exported TDP root to keep "active" in KVM's "fast zap" (which
       invalidates all active TDP roots). And instead, the exported TDP's leaf
       entries are all zapped.
       Though it looks not "fast" enough, it avoids an unnecessary root page
       zap, and it's actually not frequent --
       - one for memslot removal (IO page fault is unlikey to happen during VM
                                  boot-up)
       - one for MMIO gen wraparound (which is rare)
       - one for nx huge page mode change (which is rare too)
for c, maybe we can work out a way to remove the MTRR stuffs.
for d, I added a config to turn on/off this movement. But right, KVM side will
       have to sacrifice a bit for software usage and take care of it when the
       config is on.

>  - Puts constraints on what KVM can/can't do in the future (see the movement
>    of SPTE_MMU_PRESENT).
>  - Subjects IOMMUFD to all of KVM's historical baggage, e.g. the memslot deletion
>    mess, the truly nasty MTRR emulation (which I still hope to delete), the NX
>    hugepage mitigation, etc.
NX hugepage mitigation only exists on certain CPUs. I don't see it in recent
Intel platforms, e.g. SPR and GNR...
We can disallow sharing approach if NX huge page mitigation is enabled.
But if pinning or partial pinning are not involved, nx huge page will only cause
unnecessary zap to reduce performance, but functionally it still works well.

Besides, for the extra IO invalidation involved in TDP zap, I think SVM has the
same issue. i.e. each zap in primary MMU is also accompanied by a IO invalidation.

> 
> Please also explain the intended/expected/targeted use cases.  E.g. if the main
> use case is for device passthrough to slice-of-hardware VMs that aren't memory
> oversubscribed, 
>
The main use case is for device passthrough with all devices supporting full
IOPF.
Opportunistically, we hope it can be used in trusted IO, where TDP are shared
to IO side. So, there's only one page table audit required and out-of-sync
window for mappings between CPU and IO side can also be eliminated.

> > - Unified page table management
> >   The complexity of allocating guest pages per GPAs, registering to MMU
> >   notifier on host primary MMU, sub-page unmapping, atomic page merge/split
> 
> Please find different terminology than "sub-page".  With Sub-Page Protection, Intel
> has more or less established "sub-page" to mean "less than 4KiB granularity".  But
> that can't possibly what you mean here because KVM doesn't support (un)mapping
> memory at <4KiB granularity.  Based on context above, I assume you mean "unmapping
> arbitrary pages within a given range".
>
Ok, sorry for this confusion.
By "sub-page unmapping", I mean atomic huge page splitting and unmapping smaller
range in the previous huge page.
  
Yan Zhao Dec. 5, 2023, 5:53 a.m. UTC | #12
On Mon, Dec 04, 2023 at 11:22:49AM -0800, Sean Christopherson wrote:
> On Mon, Dec 04, 2023, Jason Gunthorpe wrote:
> > On Mon, Dec 04, 2023 at 09:00:55AM -0800, Sean Christopherson wrote:

> > > I'm not convinced that memory consumption is all that interesting.  If a VM is
> > > mapping the majority of memory into a device, then odds are good that the guest
> > > is backed with at least 2MiB page, if not 1GiB pages, at which point the memory
> > > overhead for pages tables is quite small, especially relative to the total amount
> > > of memory overheads for such systems.
> > 
> > AFAIK the main argument is performance. It is similar to why we want
> > to do IOMMU SVA with MM page table sharing.
> > 
> > If IOMMU mirrors/shadows/copies a page table using something like HMM
> > techniques then the invalidations will mark ranges of IOVA as
> > non-present and faults will occur to trigger hmm_range_fault to do the
> > shadowing.
> >
> > This means that pretty much all IO will always encounter a non-present
> > fault, certainly at the start and maybe worse while ongoing.
> > 
> > On the other hand, if we share the exact page table then natural CPU
> > touches will usually make the page present before an IO happens in
> > almost all cases and we don't have to take the horribly expensive IO
> > page fault at all.
> 
> I'm not advocating mirroring/copying/shadowing page tables between KVM and the
> IOMMU.  I'm suggesting managing IOMMU page tables mostly independently, but reusing
> KVM code to do so.
> 
> I wouldn't even be opposed to KVM outright managing the IOMMU's page tables.  E.g.
> add an "iommu" flag to "union kvm_mmu_page_role" and then the implementation looks
> rather similar to this series.
Yes, very similar to current implementation, which added a "exported" flag to
"union kvm_mmu_page_role".
> 
> What terrifies is me sharing page tables between the CPU and the IOMMU verbatim. 
> 
> Yes, sharing page tables will Just Work for faulting in memory, but the downside
> is that _when_, not if, KVM modifies PTEs for whatever reason, those modifications
> will also impact the IO path.  My understanding is that IO page faults are at least
> an order of magnitude more expensive than CPU page faults.  That means that what's
> optimal for CPU page tables may not be optimal, or even _viable_, for IOMMU page
> tables.
> 
> E.g. based on our conversation at LPC, write-protecting guest memory to do dirty
> logging is not a viable option for the IOMMU because the latency of the resulting
> IOPF is too high.  Forcing KVM to use D-bit dirty logging for CPUs just because
> the VM has passthrough (mediated?) devices would be likely a non-starter.
> 
> One of my biggest concerns with sharing page tables between KVM and IOMMUs is that
> we will end up having to revert/reject changes that benefit KVM's usage due to
> regressing the IOMMU usage.
>
As the TDP shared by IOMMU is marked by KVM, could we limit the changes (that
benefic KVM but regress IOMMU) to TDPs not shared?

> If instead KVM treats IOMMU page tables as their own thing, then we can have
> divergent behavior as needed, e.g. different dirty logging algorithms, different
> software-available bits, etc.  It would also allow us to define new ABI instead
> of trying to reconcile the many incompatibilies and warts in KVM's existing ABI.
> E.g. off the top of my head:
> 
>  - The virtual APIC page shouldn't be visible to devices, as it's not "real" guest
>    memory.
> 
>  - Access tracking, i.e. page aging, by making PTEs !PRESENT because the CPU
>    doesn't support A/D bits or because the admin turned them off via KVM's
>    enable_ept_ad_bits module param.
> 
>  - Write-protecting GFNs for shadow paging when L1 is running nested VMs.  KVM's
>    ABI can be that device writes to L1's page tables are exempt.
> 
>  - KVM can exempt IOMMU page tables from KVM's awful "drop all page tables if
>    any memslot is deleted" ABI.
> 
> > We were not able to make bi-dir notifiers with with the CPU mm, I'm
> > not sure that is "relatively easy" :(
> 
> I'm not suggesting full blown mirroring, all I'm suggesting is a fire-and-forget
> notifier for KVM to tell IOMMUFD "I've faulted in GFN A, you might want to do the
> same".
> 
> It wouldn't even necessarily need to be a notifier per se, e.g. if we taught KVM
> to manage IOMMU page tables, then KVM could simply install mappings for multiple
> sets of page tables as appropriate.
Not sure which approach below is the one you are referring to by "fire-and-forget
notifier" and "if we taught KVM to manage IOMMU page tables".

Approach A:
1. User space or IOMMUFD tells KVM which address space to share to IOMMUFD.
2. KVM create a special TDP, and maps this page table whenever a GFN in the
   specified address space is faulted to PFN in vCPU side.
3. IOMMUFD imports this special TDP and receives zaps notification from KVM.
   KVM will only send the zap notification for memslot removal or for certain MMU
   zap notifications

Approach B:
1. User space or IOMMUFD tells KVM which address space to notify.
2. KVM notifies IOMMUFD whenever a GFN in the specified address space is faulted
   to PFN in vCPU side.
3. IOMMUFD translates GFN to PFN in its own way (though VMA or through certain
   new memfd interface), and maps IO PTEs by itself.
4. IOMMUFD zaps IO PTEs when a memslot is removed and interacts with MMU notifier
   for zap notification in the primary MMU.


If approach A is preferred, could vCPUs also be allowed to attach to this
special TDP in VMs that don't suffer from NX hugepage mitigation, and do not
want live migration with passthrough devices, and don't rely on write-protection
for nested VMs.
  
Tian, Kevin Dec. 5, 2023, 6:30 a.m. UTC | #13
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, December 4, 2023 11:08 PM
> 
> On Sat, Dec 02, 2023 at 05:12:11PM +0800, Yan Zhao wrote:
> > - How to map MSI page on arm platform demands discussions.
> 
> Yes, the recurring problem :(
> 
> Probably the same approach as nesting would work for a hack - map the
> ITS page into the fixed reserved slot and tell the guest not to touch
> it and to identity map it.
> 

yes logically it should follow what is planned for nesting.

just that kvm needs to involve more iommu specific knowledge e.g.
iommu_get_msi_cookie() to reserve the slot.
  
Tian, Kevin Dec. 5, 2023, 6:45 a.m. UTC | #14
> From: Zhao, Yan Y <yan.y.zhao@intel.com>
> Sent: Tuesday, December 5, 2023 9:32 AM
> 
> On Mon, Dec 04, 2023 at 08:38:17AM -0800, Sean Christopherson wrote:
> > The number of possible TDP page tables used for nested VMs is well
> bounded, but
> > since devices obviously can't be nested VMs, I won't bother trying to
> explain the
> > the various possibilities (nested NPT on AMD is downright ridiculous).
> In future, if possible, I wonder if we can export an TDP for nested VM too.
> E.g. in scenarios where TDP is partitioned, and one piece is for L2 VM.
> Maybe we can specify that and tell KVM the very piece of TDP to export.
> 

nesting is tricky.

The reason why the sharing (w/o nesting) is logically ok is that both IOMMU
and KVM page tables are for the same GPA address space created by
the host.

for nested VM together with vIOMMU, the same sharing story holds if the
stage-2 page table in both sides still translates GPA. It implies vIOMMU is
enabled in nested translation mode and L0 KVM doesn't expose  vEPT to
L1 VMM (which then uses shadow instead). 

things become tricky when vIOMMU is working in a shadowing mode or
when L0 KVM exposes vEPT to L1 VMM. In either case the stage-2 page
table of L0 IOMMU/KVM actually translates a guest address space then
sharing becomes problematic (on figuring out whether both refers to the
same guest address space while that fact might change at any time).
  
Tian, Kevin Dec. 5, 2023, 7:17 a.m. UTC | #15
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, December 5, 2023 3:51 AM
> 
> On Mon, Dec 04, 2023 at 11:22:49AM -0800, Sean Christopherson wrote:
> > It wouldn't even necessarily need to be a notifier per se, e.g. if we taught
> KVM
> > to manage IOMMU page tables, then KVM could simply install mappings for
> multiple
> > sets of page tables as appropriate.

iommu driver still needs to be notified to invalidate the iotlb, unless we want
KVM to directly call IOMMU API instead of going through iommufd.

> 
> This somehow feels more achievable to me since KVM already has all the
> code to handle multiple TDPs, having two parallel ones is probably
> much easier than trying to weld KVM to a different page table
> implementation through some kind of loose coupled notifier.
> 

yes performance-wise this can also reduce the I/O page faults as the
sharing approach achieves.

but how is it compared to another way of supporting IOPF natively in
iommufd and iommu drivers? Note that iommufd also needs to support
native vfio applications e.g. dpdk. I'm not sure whether there will be
strong interest in enabling IOPF for those applications. But if the 
answer is yes then it's inevitable to have such logic implemented in
the iommu stack given KVM is not in the picture there.

With that is it more reasonable to develop the IOPF support natively
in iommu side, plus an optional notifier mechanism to sync with
KVM-induced host PTE installation as optimization?