[RFC,v12,00/33] KVM: guest_memfd() and per-page attributes

Message ID	20230914015531.1419405-1-seanjc@google.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) client-ip=23.128.96.33; Reply-To: Sean Christopherson <seanjc@google.com> Date: Wed, 13 Sep 2023 18:54:58 -0700 Mime-Version: 1.0 Message-ID: <20230914015531.1419405-1-seanjc@google.com> Subject: [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes From: Sean Christopherson <seanjc@google.com> To: Paolo Bonzini <pbonzini@redhat.com>, Marc Zyngier <maz@kernel.org>, Oliver Upton <oliver.upton@linux.dev>, Huacai Chen <chenhuacai@kernel.org>, Michael Ellerman <mpe@ellerman.id.au>, Anup Patel <anup@brainfault.org>, Paul Walmsley <paul.walmsley@sifive.com>, Palmer Dabbelt <palmer@dabbelt.com>, Albert Ou <aou@eecs.berkeley.edu>, Sean Christopherson <seanjc@google.com>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, Andrew Morton <akpm@linux-foundation.org>, Paul Moore <paul@paul-moore.com>, James Morris <jmorris@namei.org>, "Serge E. Hallyn" <serge@hallyn.com> Cc: kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-security-module@vger.kernel.org, linux-kernel@vger.kernel.org, Chao Peng <chao.p.peng@linux.intel.com>, Fuad Tabba <tabba@google.com>, Jarkko Sakkinen <jarkko@kernel.org>, Anish Moorthy <amoorthy@google.com>, Yu Zhang <yu.c.zhang@linux.intel.com>, Isaku Yamahata <isaku.yamahata@intel.com>, Xu Yilun <yilun.xu@intel.com>, Vlastimil Babka <vbabka@suse.cz>, Vishal Annapurve <vannapurve@google.com>, Ackerley Tng <ackerleytng@google.com>, Maciej Szmigiero <mail@maciej.szmigiero.name>, David Hildenbrand <david@redhat.com>, Quentin Perret <qperret@google.com>, Michael Roth <michael.roth@amd.com>, Wang <wei.w.wang@intel.com>, Liam Merwick <liam.merwick@oracle.com>, Isaku Yamahata <isaku.yamahata@gmail.com>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	KVM: guest_memfd() and per-page attributes \| [RFC,v12,00/33] KVM: guest_memfd() and per-page attributes [RFC,v12,01/33] KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn ranges [RFC,v12,02/33] KVM: Use gfn instead of hva for mmu_notifier_retry [RFC,v12,03/33] KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER [RFC,v12,04/33] KVM: PPC: Return '1' unconditionally for KVM_CAP_SYNC_MMU [RFC,v12,05/33] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER [RFC,v12,06/33] KVM: Introduce KVM_SET_USER_MEMORY_REGION2 [RFC,v12,08/33] KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory [RFC,v12,09/33] KVM: Drop .on_unlock() mmu_notifier hook [RFC,v12,10/33] KVM: Set the stage for handling only shared mappings in mmu_notifier events [RFC,v12,11/33] KVM: Introduce per-page memory attributes [RFC,v12,12/33] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable [RFC,v12,13/33] security: Export security_inode_init_security_anon() for use by KVM [RFC,v12,14/33] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory [RFC,v12,15/33] KVM: Add transparent hugepage support for dedicated guest memory [RFC,v12,16/33] KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN [RFC,v12,17/33] KVM: x86: Disallow hugepages when memory attributes are mixed [RFC,v12,18/33] KVM: x86/mmu: Handle page fault for private memory [RFC,v12,19/33] KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro [RFC,v12,20/33] KVM: Allow arch code to track number of memslot address spaces per VM [RFC,v12,21/33] KVM: x86: Add support for "protected VMs" that can utilize private memory [RFC,v12,22/33] KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper [RFC,v12,23/33] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2 [RFC,v12,24/33] KVM: selftests: Add support for creating private memslots [RFC,v12,25/33] KVM: selftests: Add helpers to convert guest memory b/w private and shared [RFC,v12,26/33] KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls (x86) [RFC,v12,27/33] KVM: selftests: Introduce VM "shape" to allow tests to specify the VM type [RFC,v12,28/33] KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data [RFC,v12,29/33] KVM: selftests: Add x86-only selftest for private memory conversions [RFC,v12,31/33] KVM: selftests: Expand set_memory_region_test to validate guest_memfd() [RFC,v12,33/33] KVM: selftests: Test KVM exit behavior for private memory/access

Message ID

20230914015531.1419405-1-seanjc@google.com

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.33 as permitted sender) client-ip=23.128.96.33;
Reply-To: Sean Christopherson <seanjc@google.com>
Date: Wed, 13 Sep 2023 18:54:58 -0700
Mime-Version: 1.0
Message-ID: <20230914015531.1419405-1-seanjc@google.com>
Subject: [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes
From: Sean Christopherson <seanjc@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>, Marc Zyngier <maz@kernel.org>,
        Oliver Upton <oliver.upton@linux.dev>,
        Huacai Chen <chenhuacai@kernel.org>,
        Michael Ellerman <mpe@ellerman.id.au>,
        Anup Patel <anup@brainfault.org>,
        Paul Walmsley <paul.walmsley@sifive.com>,
        Palmer Dabbelt <palmer@dabbelt.com>,
        Albert Ou <aou@eecs.berkeley.edu>,
        Sean Christopherson <seanjc@google.com>,
        "Matthew Wilcox (Oracle)" <willy@infradead.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        Paul Moore <paul@paul-moore.com>,
        James Morris <jmorris@namei.org>,
        "Serge E. Hallyn" <serge@hallyn.com>
Cc: kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
        kvmarm@lists.linux.dev, linux-mips@vger.kernel.org,
        linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org,
        linux-riscv@lists.infradead.org, linux-fsdevel@vger.kernel.org,
        linux-mm@kvack.org, linux-security-module@vger.kernel.org,
        linux-kernel@vger.kernel.org,
        Chao Peng <chao.p.peng@linux.intel.com>,
        Fuad Tabba <tabba@google.com>,
        Jarkko Sakkinen <jarkko@kernel.org>,
        Anish Moorthy <amoorthy@google.com>,
        Yu Zhang <yu.c.zhang@linux.intel.com>,
        Isaku Yamahata <isaku.yamahata@intel.com>,
        Xu Yilun <yilun.xu@intel.com>,
        Vlastimil Babka <vbabka@suse.cz>,
        Vishal Annapurve <vannapurve@google.com>,
        Ackerley Tng <ackerleytng@google.com>,
        Maciej Szmigiero <mail@maciej.szmigiero.name>,
        David Hildenbrand <david@redhat.com>,
        Quentin Perret <qperret@google.com>,
        Michael Roth <michael.roth@amd.com>,
        Wang <wei.w.wang@intel.com>,
        Liam Merwick <liam.merwick@oracle.com>,
        Isaku Yamahata <isaku.yamahata@gmail.com>,
        "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

Series

KVM: guest_memfd() and per-page attributes |

Message

Sean Christopherson Sept. 14, 2023, 1:54 a.m. UTC

  This is hopefully the last RFC for implementing fd-based (instead of vma-based)
memory for KVM guests.  If you want the full background of why we are doing
this, please go read the v10 cover letter.  With luck, v13 will be a "normal"
series that's ready for inclusion.

Tagged RFC as there are still several empty changelogs, a lot of missing
documentation, and a handful of TODOs.  And I haven't tested or proofread this
anywhere near as much as I normally would.  I am posting even though the
remaining TODOs aren't _that_ big so that people can test this new version
without having to wait a few weeks to close out the remaining TODOs, i.e. to
give us at least some chance of hitting v6.7.

The most relevant TODO item for non-KVM folks is that we are planning on
dropping the dedicated "gmem" file system.  Assuming that pans out, the patch
to export security_inode_init_security_anon() should go away.

KVM folks, there a few changes I want to highlight and get feedback on, all of
which are directly related to the "annotated memory faults" series[*]:

 - Rename kvm_run.memory to kvm_run.memory_fault
 - Place "memory_fault" in a separate union
 - Return -EFAULT or -EHWPOISON with exiting with KVM_EXIT_MEMORY_FAULT

The first one is pretty self-explanatory, "run->memory.gpa" looks quite odd and
would prevent ever doing something directly with memory.

Putting the struct in a separate union is not at all necessary for supporting
private memory, it's purely forward looking to Anish series, which wants to
annotate (fill memory_fault) on all faults, even if KVM ultimately doesn't exit
to userspace (x86 has a few unfortunate flows where KVM can clobber a previous
exit, or suppress a memory fault exit).  Using a separate union, i.e. different
bytes in kvm_run, allows exiting to userspace with both memory_fault and the
"normal" union filled, e.g. if KVM starts an MMIO exit and then hits a memory
fault exit, the MMIO exit will be preserved.  It's unlikely userspace will be
able to do anything useful with the info in that case, but the reverse will
likely be much more interesting, e.g. if KVM hits a memory fault and then doesn't
report it to userspace for whatever reason.

As for returning -EFAULT/-EHWPOISON, far too many helpers that touch guest
memory, i.e. can "fault", return 0 on success, which makes it all bug impossible
to use '0' to signal "exit to userspace".  Rather than use '0' for _just_ the
case where the guest is accessing private vs. shared, my thought is to use
-EFAULT everywhere except for the poisoned page case.

[*] https://lore.kernel.org/all/20230908222905.1321305-1-amoorthy@google.com

TODOs [owner]:
 - Documentation [none]
 - Changelogs [Sean]
 - Fully anonymous inode vs. proper filesystem [Paolo]
 - kvm_gmem_error_page() testing (my version is untested) [Isaku?]

v12:
 - Squash fixes from others. [Many people]
 - Kill of the .on_unlock() callback and use .on_lock() when handling
   memory attributes updates. [Isaku]
 - Add more tests. [Ackerley]
 - Move range_has_attrs() to common code. [Paolo]
 - Return actually number of address spaces for the VM-scoped version of
   KVM_CAP_MULTI_ADDRESS_SPACE. [Paolo]
 - Move forward declaration of "struct kvm_gfn_range" to kvm_types.h. [Yuan]
 - Plumb code to have HVA-based mmu_notifier events affect only shared
   mappings. [Asish]
 - Clean up kvm_vm_ioctl_set_mem_attributes() math. [Binbin]
 - Collect a few reviews and acks. [Paolo, Paul]
 - Unconditionally advertise a synchronized MMU on PPC. [Paolo]
 - Check for error return from filemap_grab_folio(). [A
 - Make max_order optional. [Fuad]
 - Remove signal injection, zap SPTEs on memory error. [Isaku]
 - Add KVM_CAP_GUEST_MEMFD. [Xiaoyao]
 - Invoke kvm_arch_pre_set_memory_attributes() instead of
   kvm_mmu_unmap_gfn_range().
 - Rename kvm_run.memory to kvm_run.memory_fault
 - Place "memory_fault" in a separate union
 - Return -EFAULT and -EHWPOISON with KVM_EXIT_MEMORY_FAULT
 - "Init" run->exit_reason in x86's vcpu_run()

v11:
 - https://lore.kernel.org/all/20230718234512.1690985-1-seanjc@google.com
 - Test private<=>shared conversions *without* doing fallocate()
 - PUNCH_HOLE all memory between iterations of the conversion test so that
   KVM doesn't retain pages in the guest_memfd
 - Rename hugepage control to be a very generic ALLOW_HUGEPAGE, instead of
   giving it a THP or PMD specific name.
 - Fold in fixes from a lot of people (thank you!)
 - Zap SPTEs *before* updating attributes to ensure no weirdness, e.g. if
   KVM handles a page fault and looks at inconsistent attributes
 - Refactor MMU interaction with attributes updates to reuse much of KVM's
   framework for mmu_notifiers.

v10: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com

Ackerley Tng (1):
  KVM: selftests: Test KVM exit behavior for private memory/access

Chao Peng (8):
  KVM: Use gfn instead of hva for mmu_notifier_retry
  KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  KVM: Introduce per-page memory attributes
  KVM: x86: Disallow hugepages when memory attributes are mixed
  KVM: x86/mmu: Handle page fault for private memory
  KVM: selftests: Add KVM_SET_USER_MEMORY_REGION2 helper
  KVM: selftests: Expand set_memory_region_test to validate
    guest_memfd()
  KVM: selftests: Add basic selftest for guest_memfd()

Sean Christopherson (21):
  KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn
    ranges
  KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER
  KVM: PPC: Return '1' unconditionally for KVM_CAP_SYNC_MMU
  KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to
    CONFIG_KVM_GENERIC_MMU_NOTIFIER
  KVM: Introduce KVM_SET_USER_MEMORY_REGION2
  KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory
  KVM: Drop .on_unlock() mmu_notifier hook
  KVM: Set the stage for handling only shared mappings in mmu_notifier
    events
  mm: Add AS_UNMOVABLE to mark mapping as completely unmovable
  security: Export security_inode_init_security_anon() for use by KVM
  KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing
    memory
  KVM: Add transparent hugepage support for dedicated guest memory
  KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN
  KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro
  KVM: Allow arch code to track number of memslot address spaces per VM
  KVM: x86: Add support for "protected VMs" that can utilize private
    memory
  KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper
  KVM: selftests: Convert lib's mem regions to
    KVM_SET_USER_MEMORY_REGION2
  KVM: selftests: Add support for creating private memslots
  KVM: selftests: Introduce VM "shape" to allow tests to specify the VM
    type
  KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data

Vishal Annapurve (3):
  KVM: selftests: Add helpers to convert guest memory b/w private and
    shared
  KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls
    (x86)
  KVM: selftests: Add x86-only selftest for private memory conversions

 Documentation/virt/kvm/api.rst                | 116 ++++
 arch/arm64/include/asm/kvm_host.h             |   2 -
 arch/arm64/kvm/Kconfig                        |   2 +-
 arch/mips/include/asm/kvm_host.h              |   2 -
 arch/mips/kvm/Kconfig                         |   2 +-
 arch/powerpc/include/asm/kvm_host.h           |   2 -
 arch/powerpc/kvm/Kconfig                      |   8 +-
 arch/powerpc/kvm/book3s_hv.c                  |   2 +-
 arch/powerpc/kvm/powerpc.c                    |   7 +-
 arch/riscv/include/asm/kvm_host.h             |   2 -
 arch/riscv/kvm/Kconfig                        |   2 +-
 arch/x86/include/asm/kvm_host.h               |  17 +-
 arch/x86/include/uapi/asm/kvm.h               |   3 +
 arch/x86/kvm/Kconfig                          |  14 +-
 arch/x86/kvm/debugfs.c                        |   2 +-
 arch/x86/kvm/mmu/mmu.c                        | 264 +++++++-
 arch/x86/kvm/mmu/mmu_internal.h               |   2 +
 arch/x86/kvm/mmu/tdp_mmu.c                    |   2 +-
 arch/x86/kvm/vmx/vmx.c                        |  11 +-
 arch/x86/kvm/x86.c                            |  25 +-
 include/linux/kvm_host.h                      | 143 +++-
 include/linux/kvm_types.h                     |   1 +
 include/linux/pagemap.h                       |  19 +-
 include/uapi/linux/kvm.h                      |  67 ++
 include/uapi/linux/magic.h                    |   1 +
 mm/compaction.c                               |  43 +-
 mm/migrate.c                                  |   2 +
 security/security.c                           |   1 +
 tools/testing/selftests/kvm/Makefile          |   3 +
 tools/testing/selftests/kvm/dirty_log_test.c  |   2 +-
 .../testing/selftests/kvm/guest_memfd_test.c  | 165 +++++
 .../selftests/kvm/include/kvm_util_base.h     | 148 +++-
 .../testing/selftests/kvm/include/test_util.h |   5 +
 .../selftests/kvm/include/ucall_common.h      |  11 +
 .../selftests/kvm/include/x86_64/processor.h  |  15 +
 .../selftests/kvm/kvm_page_table_test.c       |   2 +-
 tools/testing/selftests/kvm/lib/kvm_util.c    | 231 ++++---
 tools/testing/selftests/kvm/lib/memstress.c   |   3 +-
 .../selftests/kvm/set_memory_region_test.c    | 100 +++
 .../kvm/x86_64/private_mem_conversions_test.c | 410 +++++++++++
 .../kvm/x86_64/private_mem_kvm_exits_test.c   | 121 ++++
 .../kvm/x86_64/ucna_injection_test.c          |   2 +-
 virt/kvm/Kconfig                              |  17 +
 virt/kvm/Makefile.kvm                         |   1 +
 virt/kvm/dirty_ring.c                         |   2 +-
 virt/kvm/guest_mem.c                          | 637 ++++++++++++++++++
 virt/kvm/kvm_main.c                           | 482 +++++++++++--
 virt/kvm/kvm_mm.h                             |  38 ++
 48 files changed, 2888 insertions(+), 271 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_test.c
 create mode 100644 tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
 create mode 100644 tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c
 create mode 100644 virt/kvm/guest_mem.c


base-commit: 0bb80ecc33a8fb5a682236443c1e740d5c917d1d

Comments

Xiaoyao Li Sept. 22, 2023, 6:03 a.m. UTC | #1

On 9/14/2023 9:55 AM, Sean Christopherson wrote:
> From: Chao Peng <chao.p.peng@linux.intel.com>
> 
> Add a new KVM exit type to allow userspace to handle memory faults that
> KVM cannot resolve, but that userspace *may* be able to handle (without
> terminating the guest).
> 
> KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit
> conversions between private and shared memory.  With guest private memory,
> there will be  two kind of memory conversions:
> 
>    - explicit conversion: happens when the guest explicitly calls into KVM
>      to map a range (as private or shared)
> 
>    - implicit conversion: happens when the guest attempts to access a gfn
>      that is configured in the "wrong" state (private vs. shared)
> 
> On x86 (first architecture to support guest private memory), explicit
> conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE,

side topic.

Do we expect to integrate TDVMCALL(MAPGPA) of TDX into KVM_HC_MAP_GPA_RANGE?

> but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable
> as there is (obviously) no hypercall, and there is no guarantee that the
> guest actually intends to convert between private and shared, i.e. what
> KVM thinks is an implicit conversion "request" could actually be the
> result of a guest code bug.
> 
> KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to
> be implicit conversions.
> 
> Place "struct memory_fault" in a second anonymous union so that filling
> memory_fault doesn't clobber state from other yet-to-be-fulfilled exits,
> and to provide additional information if KVM does NOT ultimately exit to
> userspace with KVM_EXIT_MEMORY_FAULT, e.g. if KVM suppresses (or worse,
> loses) the exit, as KVM often suppresses exits for memory failures that
> occur when accessing paravirt data structures.  The initial usage for
> private memory will be all-or-nothing, but other features such as the
> proposed "userfault on missing mappings" support will use
> KVM_EXIT_MEMORY_FAULT for potentially _all_ guest memory accesses, i.e.
> will run afoul of KVM's various quirks.

So when exit reason is KVM_EXIT_MEMORY_FAULT, how can we tell which 
field in the first union is valid?

When exit reason is not KVM_EXIT_MEMORY_FAULT, how can we know the info 
in the second union run.memory is valid without a run.memory.valid field?

> Use bit 3 for flagging private memory so that KVM can use bits 0-2 for
> capturing RWX behavior if/when userspace needs such information.
> 
> Note!  To allow for future possibilities where KVM reports
> KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved
> fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's
> perspective), not '0'!  Due to historical baggage within KVM, exiting to
> userspace with '0' from deep callstacks, e.g. in emulation paths, is
> infeasible as doing so would require a near-complete overhaul of KVM,
> whereas KVM already propagates -errno return codes to userspace even when
> the -errno originated in a low level helper.
> 
> Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoorthy@google.com
> Cc: Anish Moorthy <amoorthy@google.com>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   Documentation/virt/kvm/api.rst | 24 ++++++++++++++++++++++++
>   include/linux/kvm_host.h       | 15 +++++++++++++++
>   include/uapi/linux/kvm.h       | 24 ++++++++++++++++++++++++
>   3 files changed, 63 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 21a7578142a1..e28a13439a95 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6702,6 +6702,30 @@ array field represents return values. The userspace should update the return
>   values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>   spec refer, https://github.com/riscv/riscv-sbi-doc.
>   
> +::
> +
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
> +			__u64 flags;
> +			__u64 gpa;
> +			__u64 size;
> +		} memory;
> +
> +KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that
> +could not be resolved by KVM.  The 'gpa' and 'size' (in bytes) describe the
> +guest physical address range [gpa, gpa + size) of the fault.  The 'flags' field
> +describes properties of the faulting access that are likely pertinent:
> +
> + - KVM_MEMORY_EXIT_FLAG_PRIVATE - When set, indicates the memory fault occurred
> +   on a private memory access.  When clear, indicates the fault occurred on a
> +   shared access.
> +
> +Note!  KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it
> +accompanies a return code of '-1', not '0'!  errno will always be set to EFAULT
> +or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume
> +kvm_run.exit_reason is stale/undefined for all other error numbers.
> +

Initially, this section is the copy of struct kvm_run and had comments 
for each field accordingly. Unfortunately, the consistence has not been 
well maintained during the new filed being added.

Do we expect to fix it?

>   ::
>   
>       /* KVM_EXIT_NOTIFY */
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 4e741ff27af3..d8c6ce6c8211 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2327,4 +2327,19 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
>   /* Max number of entries allowed for each kvm dirty ring */
>   #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
>   
> +static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> +						 gpa_t gpa, gpa_t size,
> +						 bool is_write, bool is_exec,
> +						 bool is_private)
> +{
> +	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> +	vcpu->run->memory_fault.gpa = gpa;
> +	vcpu->run->memory_fault.size = size;
> +
> +	/* RWX flags are not (yet) defined or communicated to userspace. */
> +	vcpu->run->memory_fault.flags = 0;
> +	if (is_private)
> +		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE;
> +}
> +
>   #endif
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index bd1abe067f28..d2d913acf0df 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -274,6 +274,7 @@ struct kvm_xen_exit {
>   #define KVM_EXIT_RISCV_SBI        35
>   #define KVM_EXIT_RISCV_CSR        36
>   #define KVM_EXIT_NOTIFY           37
> +#define KVM_EXIT_MEMORY_FAULT     38
>   
>   /* For KVM_EXIT_INTERNAL_ERROR */
>   /* Emulate instruction failed. */
> @@ -541,6 +542,29 @@ struct kvm_run {
>   		struct kvm_sync_regs regs;
>   		char padding[SYNC_REGS_SIZE_BYTES];
>   	} s;
> +
> +	/*
> +	 * This second exit union holds structs for exit types which may be
> +	 * triggered after KVM has already initiated a different exit, or which
> +	 * may be ultimately dropped by KVM.
> +	 *
> +	 * For example, because of limitations in KVM's uAPI, KVM x86 can
> +	 * generate a memory fault exit an MMIO exit is initiated (exit_reason
> +	 * and kvm_run.mmio are filled).  And conversely, KVM often disables
> +	 * paravirt features if a memory fault occurs when accessing paravirt
> +	 * data instead of reporting the error to userspace.
> +	 */
> +	union {
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
> +			__u64 flags;
> +			__u64 gpa;
> +			__u64 size;
> +		} memory_fault;
> +		/* Fix the size of the union. */
> +		char padding2[256];
> +	};
>   };
>   
>   /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */

Sean Christopherson Sept. 22, 2023, 2:30 p.m. UTC | #2

On Fri, Sep 22, 2023, Xiaoyao Li wrote:
> On 9/14/2023 9:55 AM, Sean Christopherson wrote:
> > From: Chao Peng <chao.p.peng@linux.intel.com>
> > 
> > Add a new KVM exit type to allow userspace to handle memory faults that
> > KVM cannot resolve, but that userspace *may* be able to handle (without
> > terminating the guest).
> > 
> > KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit
> > conversions between private and shared memory.  With guest private memory,
> > there will be  two kind of memory conversions:
> > 
> >    - explicit conversion: happens when the guest explicitly calls into KVM
> >      to map a range (as private or shared)
> > 
> >    - implicit conversion: happens when the guest attempts to access a gfn
> >      that is configured in the "wrong" state (private vs. shared)
> > 
> > On x86 (first architecture to support guest private memory), explicit
> > conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE,
> 
> side topic.
> 
> Do we expect to integrate TDVMCALL(MAPGPA) of TDX into KVM_HC_MAP_GPA_RANGE?

Yes, that's my expectation.

> > but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable
> > as there is (obviously) no hypercall, and there is no guarantee that the
> > guest actually intends to convert between private and shared, i.e. what
> > KVM thinks is an implicit conversion "request" could actually be the
> > result of a guest code bug.
> > 
> > KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to
> > be implicit conversions.
> > 
> > Place "struct memory_fault" in a second anonymous union so that filling
> > memory_fault doesn't clobber state from other yet-to-be-fulfilled exits,
> > and to provide additional information if KVM does NOT ultimately exit to
> > userspace with KVM_EXIT_MEMORY_FAULT, e.g. if KVM suppresses (or worse,
> > loses) the exit, as KVM often suppresses exits for memory failures that
> > occur when accessing paravirt data structures.  The initial usage for
> > private memory will be all-or-nothing, but other features such as the
> > proposed "userfault on missing mappings" support will use
> > KVM_EXIT_MEMORY_FAULT for potentially _all_ guest memory accesses, i.e.
> > will run afoul of KVM's various quirks.
> 
> So when exit reason is KVM_EXIT_MEMORY_FAULT, how can we tell which field in
> the first union is valid?
> 
> When exit reason is not KVM_EXIT_MEMORY_FAULT, how can we know the info in
> the second union run.memory is valid without a run.memory.valid field?

I'll respond to this separately with a trimmed Cc list.  I suspect this will be
a rather lengthy conversation, and it has almost nothing to do with guest_memfd.

> > +Note!  KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it
> > +accompanies a return code of '-1', not '0'!  errno will always be set to EFAULT
> > +or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume
> > +kvm_run.exit_reason is stale/undefined for all other error numbers.
> > +
> 
> Initially, this section is the copy of struct kvm_run and had comments for
> each field accordingly. Unfortunately, the consistence has not been well
> maintained during the new filed being added.
> 
> Do we expect to fix it?

AFAIK, no one is working on cleaning up this section of the docs, but as always,
patches are welcome :-)

Sean Christopherson Sept. 22, 2023, 4:28 p.m. UTC | #3

Removing non-KVM lists/people from Cc, this is going to get way off the guest_memfd
track...

On Fri, Sep 22, 2023, Xiaoyao Li wrote:
> On 9/14/2023 9:55 AM, Sean Christopherson wrote:
> > Place "struct memory_fault" in a second anonymous union so that filling
> > memory_fault doesn't clobber state from other yet-to-be-fulfilled exits,
> > and to provide additional information if KVM does NOT ultimately exit to
> > userspace with KVM_EXIT_MEMORY_FAULT, e.g. if KVM suppresses (or worse,
> > loses) the exit, as KVM often suppresses exits for memory failures that
> > occur when accessing paravirt data structures.  The initial usage for
> > private memory will be all-or-nothing, but other features such as the
> > proposed "userfault on missing mappings" support will use
> > KVM_EXIT_MEMORY_FAULT for potentially _all_ guest memory accesses, i.e.
> > will run afoul of KVM's various quirks.
> 
> So when exit reason is KVM_EXIT_MEMORY_FAULT, how can we tell which field in
> the first union is valid?

/facepalm

At one point, I believe we had discussed a second exit reason field?  But yeah,
as is, there's no way for userspace to glean anything useful from the first union.

The more I think about this, the more I think it's a fool's errand.  Even if KVM
provides the exit_reason history, userspace can't act on the previous, unfulfilled
exit without *knowing* that it's safe/correct to process the previous exit.  I
don't see how that's remotely possible.

Practically speaking, there is one known instance of this in KVM, and it's a
rather riduclous edge case that has existed "forever".  I'm very strongly inclined
to do nothing special, and simply treat clobbering an exit that userspace actually
cares about like any other KVM bug.

> When exit reason is not KVM_EXIT_MEMORY_FAULT, how can we know the info in
> the second union run.memory is valid without a run.memory.valid field?

Anish's series adds a flag in kvm_run.flags to track whether or not memory_fault
has been filled.  The idea is that KVM would clear the flag early in KVM_RUN, and
then set the flag when memory_fault is first filled.

	/* KVM_CAP_MEMORY_FAULT_INFO flag for kvm_run.flags */
	#define KVM_RUN_MEMORY_FAULT_FILLED (1 << 8)

I didn't propose that flag here because clobbering memory_fault from the page
fault path would be a flagrant KVM bug.

Honestly, I'm becoming more and more skeptical that separating memory_fault is
worthwhile, or even desirable.  Similar to memory_fault clobbering something else,
userspace can only take action if memory_fault is clobbered if userspace somehow
knows that it's safe/correct to do so.

Even if KVM exits "immediately" after initially filling memory_fault, the fact
that KVM is exiting for a different reason (or a different memory fault) means
that KVM did *something* between filling memory_fault and actually exiting.  And
it's completely impossible for usersepace to know what that "something" was.

E.g. in the splat from selftests[1], KVM reacts to a failure during Real Mode
event injection by synthesizing a triple fault

	ret = emulate_int_real(ctxt, irq);

	if (ret != X86EMUL_CONTINUE) {
		kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);

There are multiple KVM bugs at play: read_emulate() and write_emulate() incorrectly
assume *all* failures should be treated like MMIO, and conversely ->read_std() and
->write_std() don't handle *any* failures as MMIO.

Circling back to my "capturing the history is pointless" assertion, by the time
userspace gets an exit, the vCPU is already in shutdown, and KVM has clobbered
memory_fault something like five times.  There is zero chance userspace can do
anything but shed a tear for the VM and move on.

The whole "let's annotate all memory faults" idea came from my desire to push KVM
towards a future where all -EFAULT exits are annotated[2].  I still think we should
point KVM in that general direction, i.e. implement something that _can_ provide
100% "coverage" in the future, even though we don't expect to get there anytime soon.

I bring that up because neither private memory nor userfault-on-missing needs to
annotate anything other than -EFAULT during guest page faults.  I.e. all of this
paranoia about clobbering memory_fault isn't actually buying us anything other
than noise and complexity.  The cases we need to work _today_ are perfectly fine,
and _if_ some future use cases needs all/more paths to be 100% accurate, then the
right thing to do is to make whatever changes are necessary for KVM to be 100%
accurate.

In other words, trying to gracefully handle memory_fault clobbering is pointless.
KVM either needs to guarantee there's no clobbering (guest page fault paths) or
treat the annotation as best effort and informational-only (everything else at
this time).  Future features may grow the set of paths that needs strong guarantees,
but that just means fixing more paths and treating any violation of the contract
like any other KVM bug.

And if we stop being unnecessarily paranoid, KVM_RUN_MEMORY_FAULT_FILLED can also
go away.  The flag came about in part because *unconditionally* sanitizing
kvm_run.exit_reason at the start of KVM_RUN would break KVM's ABI, as userspace
may rely on the exit_reason being preserved when calling back into KVM to complete
userspace I/O (or MMIO)[3].  But the goal is purely to avoid exiting with stale
memory_fault information, not to sanitize every other existing exit_reason, and
that can be achieved by simply making the reset conditional.

Somewhat of a tangent, I think we should add KVM_CAP_MEMORY_FAULT_INFO if the
KVM_EXIT_MEMORY_FAULT supports comes in with guest_memfd.

Unless someone comes up with a good argument for keeping the paranoid behavior,
I'll post the below patch as fixup for the guest_memfd series, and work with Anish
to massage the attached patch (result of the below being sqaushed) in case his
series lands first.

[1] https://lore.kernel.org/all/202309141107.30863e9d-oliver.sang@intel.com
[2] https://lore.kernel.org/all/Y+6iX6a22+GEuH1b@google.com
[3] https://lore.kernel.org/all/ZFFbwOXZ5uI%2Fgdaf@google.com

---
 Documentation/virt/kvm/api.rst | 21 +++++++++++++++++++
 arch/x86/kvm/x86.c             |  1 +
 include/uapi/linux/kvm.h       | 37 ++++++++++------------------------
 virt/kvm/kvm_main.c            | 10 +++++++++
 4 files changed, 43 insertions(+), 26 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 5e08f2a157ef..d5c9e46e2d12 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7829,6 +7829,27 @@ This capability is aimed to mitigate the threat that malicious VMs can
 cause CPU stuck (due to event windows don't open up) and make the CPU
 unavailable to host or other VMs.

+7.34 KVM_CAP_MEMORY_FAULT_INFO
+------------------------------
+
+:Architectures: x86
+:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
+
+The presence of this capability indicates that KVM_RUN *may* fill
+kvm_run.memory_fault in response to failed guest memory accesses in a vCPU
+context.  KVM only guarantees that errors that occur when handling guest page
+fault VM-Exits will be annotated, all other error paths are best effort.
+
+The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns
+an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set
+to KVM_EXIT_MEMORY_FAULT.
+
+Note: Userspaces which attempt to resolve memory faults so that they can retry
+KVM_RUN are encouraged to guard against repeatedly receiving the same
+error/annotated fault.
+
+See KVM_EXIT_MEMORY_FAULT for more information.
+
 8. Other capabilities.
 ======================

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 767236b4d771..e25076fdd720 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4525,6 +4525,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_ENABLE_CAP:
 	case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
 	case KVM_CAP_IRQFD_RESAMPLE:
+	case KVM_CAP_MEMORY_FAULT_INFO:
 		r = 1;
 		break;
 	case KVM_CAP_EXIT_HYPERCALL:
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 65fc983af840..7f0ee6475141 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -525,6 +525,13 @@ struct kvm_run {
 #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
 			__u32 flags;
 		} notify;
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
+			__u64 flags;
+			__u64 gpa;
+			__u64 size;
+		} memory_fault;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
@@ -546,29 +553,6 @@ struct kvm_run {
 		struct kvm_sync_regs regs;
 		char padding[SYNC_REGS_SIZE_BYTES];
 	} s;
-
-	/*
-	 * This second exit union holds structs for exit types which may be
-	 * triggered after KVM has already initiated a different exit, or which
-	 * may be ultimately dropped by KVM.
-	 *
-	 * For example, because of limitations in KVM's uAPI, KVM x86 can
-	 * generate a memory fault exit an MMIO exit is initiated (exit_reason
-	 * and kvm_run.mmio are filled).  And conversely, KVM often disables
-	 * paravirt features if a memory fault occurs when accessing paravirt
-	 * data instead of reporting the error to userspace.
-	 */
-	union {
-		/* KVM_EXIT_MEMORY_FAULT */
-		struct {
-#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
-			__u64 flags;
-			__u64 gpa;
-			__u64 size;
-		} memory_fault;
-		/* Fix the size of the union. */
-		char padding2[256];
-	};
 };

 /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
@@ -1231,9 +1215,10 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
 #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
 #define KVM_CAP_USER_MEMORY2 230
-#define KVM_CAP_MEMORY_ATTRIBUTES 231
-#define KVM_CAP_GUEST_MEMFD 232
-#define KVM_CAP_VM_TYPES 233
+#define KVM_CAP_MEMORY_FAULT_INFO 231
+#define KVM_CAP_MEMORY_ATTRIBUTES 232
+#define KVM_CAP_GUEST_MEMFD 233
+#define KVM_CAP_VM_TYPES 234

 #ifdef KVM_CAP_IRQ_ROUTING

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 96fc609459e3..d78e97b527e5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4450,6 +4450,16 @@ static long kvm_vcpu_ioctl(struct file *filp,
 				synchronize_rcu();
 			put_pid(oldpid);
 		}
+
+		/*
+		 * Reset the exit reason if the previous userspace exit was due
+		 * to a memory fault.  Not all -EFAULT exits are annotated, and
+		 * so leaving exit_reason set to KVM_EXIT_MEMORY_FAULT could
+		 * result in feeding userspace stale information.
+		 */
+		if (vcpu->run->exit_reason == KVM_EXIT_MEMORY_FAULT)
+			vcpu->run->exit_reason = KVM_EXIT_UNKNOWN
+
 		r = kvm_arch_vcpu_ioctl_run(vcpu);
 		trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
 		break;

base-commit: 67aa951d727ad2715f7ad891929f18b7f2567a0f
--

Sean Christopherson Oct. 3, 2023, 1:42 a.m. UTC | #4

On Mon, Oct 02, 2023, Anish Moorthy wrote:
> On Fri, Sep 22, 2023 at 9:28 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > > So when exit reason is KVM_EXIT_MEMORY_FAULT, how can we tell which field in
> > > the first union is valid?
> >
> > /facepalm
> >
> > At one point, I believe we had discussed a second exit reason field?  But yeah,
> > as is, there's no way for userspace to glean anything useful from the first union.
> 
> Oh, was this an objective? When I was pushing for the second union
> this I was just trying to make sure all the efault annotations
> wouldn't clobber *other* exits. But yeah, I don't/didn't see a
> meaningful way to have valid information in both structs.

Clobbering other exits means KVM is already broken, because simply accessing memory
in guest context after initiating an exit is a KVM bug as it would violate ordering
and maybe causality.   E.g. the only reason the preemption case (see below) isn't
completely buggy is specifically because it's host paravirt behavior.

In other words, ignoring preemption for the moment, not clobbering other exits isn't
useful because whatever buggy KVM behavior caused the clobbering already happened,
i.e. the VM is already in trouble either way.  The only realistic options are to fix
the KVM bugs, or to effectively take an errata and say "don't do that" (like we've
done for the silly PUSHD to MMIO case).

> > The more I think about this, the more I think it's a fool's errand.  Even if KVM
> > provides the exit_reason history, userspace can't act on the previous, unfulfilled
> > exit without *knowing* that it's safe/correct to process the previous exit.  I
> > don't see how that's remotely possible.
> >
> > Practically speaking, there is one known instance of this in KVM, and it's a
> > rather riduclous edge case that has existed "forever".  I'm very strongly inclined
> > to do nothing special, and simply treat clobbering an exit that userspace actually
> > cares about like any other KVM bug.
> >
> > > When exit reason is not KVM_EXIT_MEMORY_FAULT, how can we know the info in
> > > the second union run.memory is valid without a run.memory.valid field?
> >
> > Anish's series adds a flag in kvm_run.flags to track whether or not memory_fault
> > has been filled.  The idea is that KVM would clear the flag early in KVM_RUN, and
> > then set the flag when memory_fault is first filled.
> >
> >         /* KVM_CAP_MEMORY_FAULT_INFO flag for kvm_run.flags */
> >         #define KVM_RUN_MEMORY_FAULT_FILLED (1 << 8)
> >
> > I didn't propose that flag here because clobbering memory_fault from the page
> > fault path would be a flagrant KVM bug.
> >
> > Honestly, I'm becoming more and more skeptical that separating memory_fault is
> > worthwhile, or even desirable.  Similar to memory_fault clobbering something else,
> > userspace can only take action if memory_fault is clobbered if userspace somehow
> > knows that it's safe/correct to do so.
> >
> > Even if KVM exits "immediately" after initially filling memory_fault, the fact
> > that KVM is exiting for a different reason (or a different memory fault) means
> > that KVM did *something* between filling memory_fault and actually exiting.  And
> > it's completely impossible for usersepace to know what that "something" was.
> 
> Are you describing a scenario in which memory_fault is (initially)
> filled, then something else happens to fill memory_fault (thus
> clobbering it), then KVM_RUN exits? I'm confused by the tension
> between the "KVM exits 'immediately'" and "KVM did *something* between
> filling memory_fault and actually existing" statements here.

Yes, I'm describing a hypothetical scenario.  Immediately was in quotes because
even if KVM returns from the *current* function straightaway, it's possible that
control is deep in a call stack, i.e. KVM may "immediately" try to exit from the
current function's perspective, but in reality it may take a while to actually
get out to userspace.

> > > E.g. in the splat from selftests[1], KVM reacts to a failure during Real Mode
> > event injection by synthesizing a triple fault
> >
> >         ret = emulate_int_real(ctxt, irq);
> >
> >         if (ret != X86EMUL_CONTINUE) {
> >                 kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
> >
> > There are multiple KVM bugs at play: read_emulate() and write_emulate() incorrectly
> > assume *all* failures should be treated like MMIO, and conversely ->read_std() and
> > ->write_std() don't handle *any* failures as MMIO.
> >
> > Circling back to my "capturing the history is pointless" assertion, by the time
> > userspace gets an exit, the vCPU is already in shutdown, and KVM has clobbered
> > memory_fault something like five times.  There is zero chance userspace can do
> > anything but shed a tear for the VM and move on.
> >
> > The whole "let's annotate all memory faults" idea came from my desire to push KVM
> > towards a future where all -EFAULT exits are annotated[2].  I still think we should
> > point KVM in that general direction, i.e. implement something that _can_ provide
> > 100% "coverage" in the future, even though we don't expect to get there anytime soon.
> >
> > I bring that up because neither private memory nor userfault-on-missing needs to
> > annotate anything other than -EFAULT during guest page faults.  I.e. all of this
> > paranoia about clobbering memory_fault isn't actually buying us anything other
> > than noise and complexity.  The cases we need to work _today_ are perfectly fine,
> > and _if_ some future use cases needs all/more paths to be 100% accurate, then the
> > right thing to do is to make whatever changes are necessary for KVM to be 100%
> > accurate.
> >
> > In other words, trying to gracefully handle memory_fault clobbering is pointless.
> > KVM either needs to guarantee there's no clobbering (guest page fault paths) or
> > treat the annotation as best effort and informational-only (everything else at
> > this time).  Future features may grow the set of paths that needs strong guarantees,
> > but that just means fixing more paths and treating any violation of the contract
> > like any other KVM bug.
> 
> Ok, so if we're restricting the exit to just the places it's totally
> accurate (page-fault paths) then, IIUC,
> 
> - There's no reason to attach it to EFAULT, ie it becomes a "normal" exit

No, I still want at least partial line of sight to being able to provide useful
information to userspace on EFAULT.  Making KVM_EXIT_MEMORY_FAULT a "normal" exit
pretty much squashes any hope of that.

> - I should go drop the patches annotating kvm_vcpu_read/write_page
> from my series

Hold up on that.  I'd prefer to keep them as there's still value in giving userspace
debug information.  All I'm proposing is that we would firmly state in the
documentation that those paths must be treated as informational-only.

The whole kvm_steal_time_set_preempted() mess does give me pause though.  That
helper isn't actually problematic, but only because it uses copy_to_user_nofault()
directly :-/

But that doesn't necessarily mean we need to abandon the entire idea, e.g. it
might not be a terrible idea to explicitly differentiate accesses to guest memory
for paravirt stuff, from accesses to guest memory on behalf of the guest.

Anyways, don't do anything just yet.

> - The helper function [a] for filling the memory_fault field
> (downgraded back into the current union) can drop the "has the field
> already been filled?" check/WARN.

That would need to be dropped regardless because it's user-triggered (sadly).

> - [KVM_CAP_USERFAULT_ON_MISSING] The memslot flag check [b] needs to
> be moved back from __gfn_to_pfn_memslot() into
> user_mem_abort()/kvm_handle_error_pfn() since the slot flag-triggered
> fast-gup failures *have* to result in the memory fault exits, and we
> only want to do those in the two SLAT-failure paths (for now).

I'll look at this more closely when I review your series (slowly, slowly getting
there).  There's no right or wrong answer here, it's more a question of what's the
easiest to maintain.

> [a] https://lore.kernel.org/all/20230908222905.1321305-5-amoorthy@google.com/
> [b] https://lore.kernel.org/all/20230908222905.1321305-11-amoorthy@google.com/
> 
> > And if we stop being unnecessarily paranoid, KVM_RUN_MEMORY_FAULT_FILLED can also
> > go away.  The flag came about in part because *unconditionally* sanitizing
> > kvm_run.exit_reason at the start of KVM_RUN would break KVM's ABI, as userspace
> > may rely on the exit_reason being preserved when calling back into KVM to complete
> > userspace I/O (or MMIO)[3].  But the goal is purely to avoid exiting with stale
> > memory_fault information, not to sanitize every other existing exit_reason, and
> > that can be achieved by simply making the reset conditional.
> >
> > ...
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 96fc609459e3..d78e97b527e5 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -4450,6 +4450,16 @@ static long kvm_vcpu_ioctl(struct file *filp,
> >                                 synchronize_rcu();
> >                         put_pid(oldpid);
> >                 }
> > +
> > +               /*
> > +                * Reset the exit reason if the previous userspace exit was due
> > +                * to a memory fault.  Not all -EFAULT exits are annotated, and
> > +                * so leaving exit_reason set to KVM_EXIT_MEMORY_FAULT could
> > +                * result in feeding userspace stale information.
> > +                */
> > +               if (vcpu->run->exit_reason == KVM_EXIT_MEMORY_FAULT)
> > +                       vcpu->run->exit_reason = KVM_EXIT_UNKNOWN
> > +
> >                 r = kvm_arch_vcpu_ioctl_run(vcpu);
> 
> Under my reading of the earlier block I'm not sure why we need to keep
> this around. The original idea behind a canary of this type was to
> avoid stomping on non-memory-fault exits in cases where something
> caused an (ignored) annotated memory fault before the exit could be
> completed. But if the annotations are going to be restricted in
> general to just the page fault paths, then we can just eliminate the
> sentinel check (and just this block) entirely, right?

This isn't a canary, it's to ensure KVM doesn't feed userspace garbage.  As above,
I'm not saying we throw away all of the code for the "optional" paths, I'm saying
that we only commit to 100% accuracy for the paths that the two use cases need to
work, i.e. the page fault handlers.

Anish Moorthy Oct. 3, 2023, 10:59 p.m. UTC | #5

On Mon, Oct 2, 2023 at 6:43 PM Sean Christopherson <seanjc@google.com> wrote:
>
> > - I should go drop the patches annotating kvm_vcpu_read/write_page
> > from my series
>
> Hold up on that.  I'd prefer to keep them as there's still value in giving userspace
> debug information.  All I'm proposing is that we would firmly state in the
> documentation that those paths must be treated as informational-only.

Userspace would then need to know whether annotations were performed
from reliable/unreliable paths though, right? That'd imply another
flag bit beyond the current R/W/E bits.

> > - The helper function [a] for filling the memory_fault field
> > (downgraded back into the current union) can drop the "has the field
> > already been filled?" check/WARN.
>
> That would need to be dropped regardless because it's user-triggered (sadly).

Well the current v5 of the series uses a non-userspace visible canary-
it seems like there'd still be value in that if we were to keep the
annotations in potentially unreliable spots. Although perhaps that
test failure you noticed [1] is a good counter-argument, since it
shows a known case where a current flow does multiple writes to the
memory_fault member.

[1] https://lore.kernel.org/all/202309141107.30863e9d-oliver.sang@intel.com

> Anyways, don't do anything just yet.

:salutes:

Sean Christopherson Oct. 3, 2023, 11:46 p.m. UTC | #6

On Tue, Oct 03, 2023, Anish Moorthy wrote:
> On Mon, Oct 2, 2023 at 6:43 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > > - I should go drop the patches annotating kvm_vcpu_read/write_page
> > > from my series
> >
> > Hold up on that.  I'd prefer to keep them as there's still value in giving userspace
> > debug information.  All I'm proposing is that we would firmly state in the
> > documentation that those paths must be treated as informational-only.
> 
> Userspace would then need to know whether annotations were performed
> from reliable/unreliable paths though, right? That'd imply another
> flag bit beyond the current R/W/E bits.

No, what's missing is a guarantee in KVM that every attempt to exit will actually
make it to userspace.  E.g. if a different exit, including another memory_fault
exit, clobbers an attempt to exit, the "unreliable" annotation will never be seen
by userspace.

The only way a KVM_EXIT_MEMORY_FAULT that actually reaches userspace could be
"unreliable" is if something other than a memory_fault exit clobbered the union,
but didn't signal its KVM_EXIT_* reason.  And that would be an egregious bug that
isn't unique to KVM_EXIT_MEMORY_FAULT, i.e. the same data corruption would affect
each and every other KVM_EXIT_* reason.

The "informational only" part is that userspace can't develop features that
*require* KVM to exit.

> > > - The helper function [a] for filling the memory_fault field
> > > (downgraded back into the current union) can drop the "has the field
> > > already been filled?" check/WARN.
> >
> > That would need to be dropped regardless because it's user-triggered (sadly).
> 
> Well the current v5 of the series uses a non-userspace visible canary-
> it seems like there'd still be value in that if we were to keep the
> annotations in potentially unreliable spots. Although perhaps that
> test failure you noticed [1] is a good counter-argument, since it
> shows a known case where a current flow does multiple writes to the
> memory_fault member.

The problem is that anything but a WARN will go unnoticed, and we can't have any
WARNs that are user-triggerable, at least not in upstream.  Internally, we can
and probably should add a canary, and an aggressive one at that, but I can't think
of a sane way to add a canary in upstream while avoiding the known offenders. :-(

> [1] https://lore.kernel.org/all/202309141107.30863e9d-oliver.sang@intel.com
> 
> > Anyways, don't do anything just yet.
> 
> :salutes:

LOL

Anish Moorthy Oct. 5, 2023, 10:07 p.m. UTC | #7

On Tue, Oct 3, 2023 at 4:46 PM Sean Christopherson <seanjc@google.com> wrote:
>
> The only way a KVM_EXIT_MEMORY_FAULT that actually reaches userspace could be
> "unreliable" is if something other than a memory_fault exit clobbered the union,
> but didn't signal its KVM_EXIT_* reason.  And that would be an egregious bug that
> isn't unique to KVM_EXIT_MEMORY_FAULT, i.e. the same data corruption would affect
> each and every other KVM_EXIT_* reason.

Keep in mind the case where an "unreliable" annotation sets up a
KVM_EXIT_MEMORY_FAULT, KVM_RUN ends up continuing, then something
unrelated comes up and causes KVM_RUN to EFAULT. Although this at
least is a case of "outdated" information rather than blatant
corruption.

IIRC the last time this came up we said that there's minimal harm in
userspace acting on the outdated info, but it seems like another good
argument for just restricting the annotations to paths we know are
reliable. What if the second EFAULT above is fatal (as I understand
all are today) and sets up subsequent KVM_RUNs to crash and burn
somehow? Seems like that'd be a safety issue.

Sean Christopherson Oct. 5, 2023, 10:46 p.m. UTC | #8

On Thu, Oct 05, 2023, Anish Moorthy wrote:
> On Tue, Oct 3, 2023 at 4:46 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > The only way a KVM_EXIT_MEMORY_FAULT that actually reaches userspace could be
> > "unreliable" is if something other than a memory_fault exit clobbered the union,
> > but didn't signal its KVM_EXIT_* reason.  And that would be an egregious bug that
> > isn't unique to KVM_EXIT_MEMORY_FAULT, i.e. the same data corruption would affect
> > each and every other KVM_EXIT_* reason.
> 
> Keep in mind the case where an "unreliable" annotation sets up a
> KVM_EXIT_MEMORY_FAULT, KVM_RUN ends up continuing, then something
> unrelated comes up and causes KVM_RUN to EFAULT. Although this at
> least is a case of "outdated" information rather than blatant
> corruption.

Drat, I managed to forget about that.

> IIRC the last time this came up we said that there's minimal harm in
> userspace acting on the outdated info, but it seems like another good
> argument for just restricting the annotations to paths we know are
> reliable. What if the second EFAULT above is fatal (as I understand
> all are today) and sets up subsequent KVM_RUNs to crash and burn
> somehow? Seems like that'd be a safety issue.

For your series, let's omit 

  KVM: Annotate -EFAULTs from kvm_vcpu_read/write_guest_page

and just fill memory_fault for the page fault paths.  That will be easier to
document too since we can simply say that if the exit reason is KVM_EXIT_MEMORY_FAULT,
then run->memory_fault is valid and fresh.

Adding a flag or whatever to mark the data as trustworthy would be the alternative,
but that's effectively adding ABI that says "KVM is buggy, sorry".

My dream of having KVM always return useful information for -EFAULT will have to
wait for another day.

David Matlack Oct. 10, 2023, 10:21 p.m. UTC | #9

On Thu, Oct 5, 2023 at 3:46 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Oct 05, 2023, Anish Moorthy wrote:
> > On Tue, Oct 3, 2023 at 4:46 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > The only way a KVM_EXIT_MEMORY_FAULT that actually reaches userspace could be
> > > "unreliable" is if something other than a memory_fault exit clobbered the union,
> > > but didn't signal its KVM_EXIT_* reason.  And that would be an egregious bug that
> > > isn't unique to KVM_EXIT_MEMORY_FAULT, i.e. the same data corruption would affect
> > > each and every other KVM_EXIT_* reason.
> >
> > Keep in mind the case where an "unreliable" annotation sets up a
> > KVM_EXIT_MEMORY_FAULT, KVM_RUN ends up continuing, then something
> > unrelated comes up and causes KVM_RUN to EFAULT. Although this at
> > least is a case of "outdated" information rather than blatant
> > corruption.
>
> Drat, I managed to forget about that.
>
> > IIRC the last time this came up we said that there's minimal harm in
> > userspace acting on the outdated info, but it seems like another good
> > argument for just restricting the annotations to paths we know are
> > reliable. What if the second EFAULT above is fatal (as I understand
> > all are today) and sets up subsequent KVM_RUNs to crash and burn
> > somehow? Seems like that'd be a safety issue.
>
> For your series, let's omit
>
>   KVM: Annotate -EFAULTs from kvm_vcpu_read/write_guest_page
>
> and just fill memory_fault for the page fault paths.  That will be easier to
> document too since we can simply say that if the exit reason is KVM_EXIT_MEMORY_FAULT,
> then run->memory_fault is valid and fresh.

+1

And from a performance perspective, I don't think we care about
kvm_vcpu_read/write_guest_page(). Our (Google) KVM Demand Paging
implementation just sends any kvm_vcpu_read/write_guest_page()
requests through the netlink socket, which is just a poor man's
userfaultfd. So I think we'll be fine sending these callsites through
uffd instead of exiting out to userspace.

And with that out of the way, is there any reason to keep tying
KVM_EXIT_MEMORY_FAULT to -EFAULT? As mentioned in the patch at the top
of this thread, -EFAULT is just a hack to allow the emulator paths to
return out to userspace. But that's no longer necessary. I just find
it odd that some KVM_EXIT_* correspond with KVM_RUN returning an error
and others don't. The exit_reason is sufficient to tell userspace
what's going on and has a firm contract, unlike -EFAULT which anything
KVM calls into can return.

>
> Adding a flag or whatever to mark the data as trustworthy would be the alternative,
> but that's effectively adding ABI that says "KVM is buggy, sorry".
>
> My dream of having KVM always return useful information for -EFAULT will have to
> wait for another day.

Sean Christopherson Oct. 13, 2023, 6:45 p.m. UTC | #10

On Tue, Oct 10, 2023, David Matlack wrote:
> On Thu, Oct 5, 2023 at 3:46 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Oct 05, 2023, Anish Moorthy wrote:
> > > On Tue, Oct 3, 2023 at 4:46 PM Sean Christopherson <seanjc@google.com> wrote:
> > > >
> > > > The only way a KVM_EXIT_MEMORY_FAULT that actually reaches userspace could be
> > > > "unreliable" is if something other than a memory_fault exit clobbered the union,
> > > > but didn't signal its KVM_EXIT_* reason.  And that would be an egregious bug that
> > > > isn't unique to KVM_EXIT_MEMORY_FAULT, i.e. the same data corruption would affect
> > > > each and every other KVM_EXIT_* reason.
> > >
> > > Keep in mind the case where an "unreliable" annotation sets up a
> > > KVM_EXIT_MEMORY_FAULT, KVM_RUN ends up continuing, then something
> > > unrelated comes up and causes KVM_RUN to EFAULT. Although this at
> > > least is a case of "outdated" information rather than blatant
> > > corruption.
> >
> > Drat, I managed to forget about that.
> >
> > > IIRC the last time this came up we said that there's minimal harm in
> > > userspace acting on the outdated info, but it seems like another good
> > > argument for just restricting the annotations to paths we know are
> > > reliable. What if the second EFAULT above is fatal (as I understand
> > > all are today) and sets up subsequent KVM_RUNs to crash and burn
> > > somehow? Seems like that'd be a safety issue.
> >
> > For your series, let's omit
> >
> >   KVM: Annotate -EFAULTs from kvm_vcpu_read/write_guest_page
> >
> > and just fill memory_fault for the page fault paths.  That will be easier to
> > document too since we can simply say that if the exit reason is KVM_EXIT_MEMORY_FAULT,
> > then run->memory_fault is valid and fresh.
> 
> +1
> 
> And from a performance perspective, I don't think we care about
> kvm_vcpu_read/write_guest_page(). Our (Google) KVM Demand Paging
> implementation just sends any kvm_vcpu_read/write_guest_page()
> requests through the netlink socket, which is just a poor man's
> userfaultfd. So I think we'll be fine sending these callsites through
> uffd instead of exiting out to userspace.
> 
> And with that out of the way, is there any reason to keep tying
> KVM_EXIT_MEMORY_FAULT to -EFAULT? As mentioned in the patch at the top
> of this thread, -EFAULT is just a hack to allow the emulator paths to
> return out to userspace. But that's no longer necessary.

Not forcing '0' makes handling other error codes simpler, e.g. if the memory is
poisoned, KVM can simply return -EHWPOISON instead of having to add a flag to
run->memory_fault[*].

KVM would also have to make returning '0' instead of -EFAULT conditional based on
a capability being enabled.

And again, committing to returning '0' will make it all but impossible to extend
KVM_EXIT_MEMORY_FAULT beyond the page fault handlers.  Well, I suppose we could
have the top level kvm_arch_vcpu_ioctl_run() do

	if (r == -EFAULT && vcpu->kvm->enable_memory_fault_exits &&
	    kvm_run->exit_reason == KVM_EXIT_MEMORY_FAULT)
		r = 0;

but that's quite gross IMO.

> I just find it odd that some KVM_EXIT_* correspond with KVM_RUN returning an
> error and others don't.

FWIW, there is already precedent for run->exit_reason being valid with a non-zero
error code.  E.g. KVM selftests relies on run->exit_reason being preserved when
forcing an immediate exit, which returns -EINTR, not '0'.

	if (kvm_run->immediate_exit) {
		r = -EINTR;
		goto out;
	}

And pre-immediate_exit code that relies on signalling vCPUs is even more explicit
in setting exit_reason with a non-zero errno:

		if (signal_pending(current)) {
			r = -EINTR;
			kvm_run->exit_reason = KVM_EXIT_INTR;
			++vcpu->stat.signal_exits;
		}

I agree that -EFAULT with KVM_EXIT_MEMORY_FAULT *looks* a little odd, but IMO the
existing KVM behavior of returning '0' is actually what's truly odd.  E.g. returning
'0' + KVM_EXIT_MMIO if the guest accesses non-existent memory is downright weird.
KVM_RUN should arguably never return '0', because it can never actual completely
succeed.

> The exit_reason is sufficient to tell userspace what's going on and has a
> firm contract, unlike -EFAULT which anything KVM calls into can return.

Eh, I don't think it lessens the contract in a meaningful way.  KVM is still
contractually obligated to fill run->exit_reason when KVM returns '0', and
userspace will still likely terminate the VM on an undocumented EFAULT/EHWPOISON.

E.g. if KVM has a bug and doesn't return KVM_EXIT_MEMORY_FAULT when handling a
page fault, then odds are very good that the bug would result in KVM returning a
"bare" -EFAULT regardless of whether KVM_EXIT_MEMORY_FAULT is paried with '0' or
-EFAULT.

[*] https://lore.kernel.org/all/ZQHzVOIsesTTysgf@google.com