Message ID | 20231027182217.3615211-10-seanjc@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:d641:0:b0:403:3b70:6f57 with SMTP id cy1csp801332vqb; Fri, 27 Oct 2023 11:24:17 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHa7lIRpf31XN6QOdw0e3XOmwNF1yaQtAwXaw/uFg6iKKsMF6IU98IKHNGehP2yTX/PxLEW X-Received: by 2002:a81:4815:0:b0:5a8:dbb1:f73b with SMTP id v21-20020a814815000000b005a8dbb1f73bmr3699108ywa.14.1698431057102; Fri, 27 Oct 2023 11:24:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1698431057; cv=none; d=google.com; s=arc-20160816; b=pEwgdEFfjmJCZmRjHg5dByOKpLFuNpO7IZX2imrJrKZeBxsLDXQWGdX256O6JaTbqj F0zgMOui8nmW7snVvDLOLO9xigGJicvnWZ+lq+k/EpmtYBAwwl9z0/Lu8wqNiu377a9g 1P3gcfLmLvH3DklQWFDmXegp2bg1h7A81oFlFYwYRE5f2yh/+Nn+dvOZqBbRIdMjMVeO OaBj3bwlNaBdYFekJCsMJv5LztRIxoC7gPz4xOhrAH/qebZ/jzjxK5/cyRACOr41wZqG JNqLk7HQGdK9RT62HigJ/uUI12Um817tOAjTTO8v6Egp9tZq5HRPUwz3dIB5S7R+URkk IvcA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:reply-to:dkim-signature; bh=dSzaJ5MjV13E6q0DS9u4ZW54i690mbtysQRlTl8m4E8=; fh=lhteFENhZrfxRoH7K7/E/bqXvDWa/XLvUszFia9mLtM=; b=zRwKJKzjgeToT5sf/wVRsnjTI/BIv3XkjDGrG+i9l0wDaa2cS7IGuds42A2iZs8Kh4 NUdnuhBOOr8+gxzSemqe/sS0lovrqJZResy5bSgzf1r7hjzN5spogOpt20vz3YAYgfcJ HjUGSRHWWqVPt+nDevJRXDQ8tr2lqMoeE1P7V8zlJ5FlLxKSmJen3bOdTk54bXIyqEYk lDNl8WOU2l8Mv5sYMrm3AUqGTijAAgJj19fQSdzUq4y1R8DddezIVJVUAHwqp6Wh7fYl bDIFSblaPvfOTLfn9iUh1fRsLG4KjKOl2hwSXZPCAW+yQqKCxmTYc8zvhV0d18NXBPyY jYnw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=ROnluImD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from agentk.vger.email (agentk.vger.email. [2620:137:e000::3:2]) by mx.google.com with ESMTPS id m19-20020a819e13000000b005a7d179e95csi3101565ywj.316.2023.10.27.11.24.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 27 Oct 2023 11:24:17 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) client-ip=2620:137:e000::3:2; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=ROnluImD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id 9445F82CF130; Fri, 27 Oct 2023 11:24:07 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1346487AbjJ0SX1 (ORCPT <rfc822;a1648639935@gmail.com> + 25 others); Fri, 27 Oct 2023 14:23:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45764 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235085AbjJ0SW4 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 27 Oct 2023 14:22:56 -0400 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BB32610F4 for <linux-kernel@vger.kernel.org>; Fri, 27 Oct 2023 11:22:44 -0700 (PDT) Received: by mail-yb1-xb49.google.com with SMTP id 3f1490d57ef6-d99ec34829aso1716764276.1 for <linux-kernel@vger.kernel.org>; Fri, 27 Oct 2023 11:22:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1698430963; x=1699035763; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:from:to:cc:subject:date:message-id:reply-to; bh=dSzaJ5MjV13E6q0DS9u4ZW54i690mbtysQRlTl8m4E8=; b=ROnluImDhqyCn75qI+tqyi5myj8UyFdtYFi2+sxFPCe6XfNKrISKApRhFy9LtbfWuk m4xPsDbnmyFaStrJZHGkZbJgGKe/vtBj7ZWgGsvUeRgdX2szG9NypNg5GJaH2Xgq21fy kr8pLixtJWAxDK2FPXcdOYCNZY83gMqzfFsKwQYisWvHt5rClU00vYHbzVtt4OFNRP+P tmviXu7EqUj6MmxJraqJ2x6W03Z/DQtyqykjgPaIO+qPdn5R80f5/t4MyEEZ0u2FemI5 xmwWE8hb2b9NCUU+o3LZzT0nBbEN2Co5ZA1Wf2BGR+PXT2QO2bj6fvZ1Fo4WShMW8Y4k gUAQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698430963; x=1699035763; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=dSzaJ5MjV13E6q0DS9u4ZW54i690mbtysQRlTl8m4E8=; b=wPvk7OlZ/MF/muqc4vTAekWP/LNaoAGgGNrz9N44mj2PH/rIjKuT5BMrT6mcRz0XVF rPZgfy9Es5D5Gojys94l8AgvD/azcLbjkkYaFlEoRpKiywE2KdzwHxDHXqv8/k1uif5z 6m5QxPjIvpK7RHIy3Vb0+sp3onIQDD3b++mqBIWbhvlWYFjE+KL/H+jQJcV1SjkG9weq 61aBzczzivzECmoD0Po7QQUc75zFYMFAy7oPCy6evrT/Ri+7cFtWbmJbi7psUV8dvp+r gGltD0ZQarbbhdNVShhTY7pHhyOAGnR/o0nSUYn9A1taCXzpi+AKpRKl3xwBY6uLBqW8 o9iA== X-Gm-Message-State: AOJu0YzJGrYZONQpL268gvGnEzGCJO5NZKu5t60bX9hudIZF8zUgnuC/ fUek2QRyZQTfTDdTQHS7IQIkxQoulTs= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a05:6902:1746:b0:d9a:59cb:8bed with SMTP id bz6-20020a056902174600b00d9a59cb8bedmr61033ybb.5.1698430963421; Fri, 27 Oct 2023 11:22:43 -0700 (PDT) Reply-To: Sean Christopherson <seanjc@google.com> Date: Fri, 27 Oct 2023 11:21:51 -0700 In-Reply-To: <20231027182217.3615211-1-seanjc@google.com> Mime-Version: 1.0 References: <20231027182217.3615211-1-seanjc@google.com> X-Mailer: git-send-email 2.42.0.820.g83a721a137-goog Message-ID: <20231027182217.3615211-10-seanjc@google.com> Subject: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace From: Sean Christopherson <seanjc@google.com> To: Paolo Bonzini <pbonzini@redhat.com>, Marc Zyngier <maz@kernel.org>, Oliver Upton <oliver.upton@linux.dev>, Huacai Chen <chenhuacai@kernel.org>, Michael Ellerman <mpe@ellerman.id.au>, Anup Patel <anup@brainfault.org>, Paul Walmsley <paul.walmsley@sifive.com>, Palmer Dabbelt <palmer@dabbelt.com>, Albert Ou <aou@eecs.berkeley.edu>, Sean Christopherson <seanjc@google.com>, Alexander Viro <viro@zeniv.linux.org.uk>, Christian Brauner <brauner@kernel.org>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, Andrew Morton <akpm@linux-foundation.org> Cc: kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Xiaoyao Li <xiaoyao.li@intel.com>, Xu Yilun <yilun.xu@intel.com>, Chao Peng <chao.p.peng@linux.intel.com>, Fuad Tabba <tabba@google.com>, Jarkko Sakkinen <jarkko@kernel.org>, Anish Moorthy <amoorthy@google.com>, David Matlack <dmatlack@google.com>, Yu Zhang <yu.c.zhang@linux.intel.com>, Isaku Yamahata <isaku.yamahata@intel.com>, " =?utf-8?q?Micka=C3=ABl_Sala?= =?utf-8?q?=C3=BCn?= " <mic@digikod.net>, Vlastimil Babka <vbabka@suse.cz>, Vishal Annapurve <vannapurve@google.com>, Ackerley Tng <ackerleytng@google.com>, Maciej Szmigiero <mail@maciej.szmigiero.name>, David Hildenbrand <david@redhat.com>, Quentin Perret <qperret@google.com>, Michael Roth <michael.roth@amd.com>, Wang <wei.w.wang@intel.com>, Liam Merwick <liam.merwick@oracle.com>, Isaku Yamahata <isaku.yamahata@gmail.com>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-8.4 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Fri, 27 Oct 2023 11:24:07 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1780934044271764141 X-GMAIL-MSGID: 1780934044271764141 |
Series |
KVM: guest_memfd() and per-page attributes
|
|
Commit Message
Sean Christopherson
Oct. 27, 2023, 6:21 p.m. UTC
From: Chao Peng <chao.p.peng@linux.intel.com> Add a new KVM exit type to allow userspace to handle memory faults that KVM cannot resolve, but that userspace *may* be able to handle (without terminating the guest). KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit conversions between private and shared memory. With guest private memory, there will be two kind of memory conversions: - explicit conversion: happens when the guest explicitly calls into KVM to map a range (as private or shared) - implicit conversion: happens when the guest attempts to access a gfn that is configured in the "wrong" state (private vs. shared) On x86 (first architecture to support guest private memory), explicit conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE, but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable as there is (obviously) no hypercall, and there is no guarantee that the guest actually intends to convert between private and shared, i.e. what KVM thinks is an implicit conversion "request" could actually be the result of a guest code bug. KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to be implicit conversions. Note! To allow for future possibilities where KVM reports KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's perspective), not '0'! Due to historical baggage within KVM, exiting to userspace with '0' from deep callstacks, e.g. in emulation paths, is infeasible as doing so would require a near-complete overhaul of KVM, whereas KVM already propagates -errno return codes to userspace even when the -errno originated in a low level helper. Report the gpa+size instead of a single gfn even though the initial usage is expected to always report single pages. It's entirely possible, likely even, that KVM will someday support sub-page granularity faults, e.g. Intel's sub-page protection feature allows for additional protections at 128-byte granularity. Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoorthy@google.com Link: https://lore.kernel.org/all/ZQ3AmLO2SYv3DszH@google.com Cc: Anish Moorthy <amoorthy@google.com> Cc: David Matlack <dmatlack@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> --- Documentation/virt/kvm/api.rst | 41 ++++++++++++++++++++++++++++++++++ arch/x86/kvm/x86.c | 1 + include/linux/kvm_host.h | 11 +++++++++ include/uapi/linux/kvm.h | 8 +++++++ 4 files changed, 61 insertions(+)
Comments
On 10/27/23 20:21, Sean Christopherson wrote: > From: Chao Peng <chao.p.peng@linux.intel.com> > > Add a new KVM exit type to allow userspace to handle memory faults that > KVM cannot resolve, but that userspace *may* be able to handle (without > terminating the guest). > > KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit > conversions between private and shared memory. With guest private memory, > there will be two kind of memory conversions: > > - explicit conversion: happens when the guest explicitly calls into KVM > to map a range (as private or shared) > > - implicit conversion: happens when the guest attempts to access a gfn > that is configured in the "wrong" state (private vs. shared) > > On x86 (first architecture to support guest private memory), explicit > conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE, > but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable > as there is (obviously) no hypercall, and there is no guarantee that the > guest actually intends to convert between private and shared, i.e. what > KVM thinks is an implicit conversion "request" could actually be the > result of a guest code bug. > > KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to > be implicit conversions. > > Note! To allow for future possibilities where KVM reports > KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved > fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's > perspective), not '0'! Due to historical baggage within KVM, exiting to > userspace with '0' from deep callstacks, e.g. in emulation paths, is > infeasible as doing so would require a near-complete overhaul of KVM, > whereas KVM already propagates -errno return codes to userspace even when > the -errno originated in a low level helper. > > Report the gpa+size instead of a single gfn even though the initial usage > is expected to always report single pages. It's entirely possible, likely > even, that KVM will someday support sub-page granularity faults, e.g. > Intel's sub-page protection feature allows for additional protections at > 128-byte granularity. > > Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoorthy@google.com > Link: https://lore.kernel.org/all/ZQ3AmLO2SYv3DszH@google.com > Cc: Anish Moorthy <amoorthy@google.com> > Cc: David Matlack <dmatlack@google.com> > Suggested-by: Sean Christopherson <seanjc@google.com> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> > Co-developed-by: Sean Christopherson <seanjc@google.com> > Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> > --- > Documentation/virt/kvm/api.rst | 41 ++++++++++++++++++++++++++++++++++ > arch/x86/kvm/x86.c | 1 + > include/linux/kvm_host.h | 11 +++++++++ > include/uapi/linux/kvm.h | 8 +++++++ > 4 files changed, 61 insertions(+) > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > index ace984acc125..860216536810 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -6723,6 +6723,26 @@ array field represents return values. The userspace should update the return > values of SBI call before resuming the VCPU. For more details on RISC-V SBI > spec refer, https://github.com/riscv/riscv-sbi-doc. > > +:: > + > + /* KVM_EXIT_MEMORY_FAULT */ > + struct { > + __u64 flags; > + __u64 gpa; > + __u64 size; > + } memory; > + > +KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that > +could not be resolved by KVM. The 'gpa' and 'size' (in bytes) describe the > +guest physical address range [gpa, gpa + size) of the fault. The 'flags' field > +describes properties of the faulting access that are likely pertinent. > +Currently, no flags are defined. > + > +Note! KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it > +accompanies a return code of '-1', not '0'! errno will always be set to EFAULT > +or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume > +kvm_run.exit_reason is stale/undefined for all other error numbers. > + > :: > > /* KVM_EXIT_NOTIFY */ > @@ -7757,6 +7777,27 @@ This capability is aimed to mitigate the threat that malicious VMs can > cause CPU stuck (due to event windows don't open up) and make the CPU > unavailable to host or other VMs. > > +7.34 KVM_CAP_MEMORY_FAULT_INFO > +------------------------------ > + > +:Architectures: x86 > +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. > + > +The presence of this capability indicates that KVM_RUN will fill > +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if > +there is a valid memslot but no backing VMA for the corresponding host virtual > +address. > + > +The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns > +an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set > +to KVM_EXIT_MEMORY_FAULT. > + > +Note: Userspaces which attempt to resolve memory faults so that they can retry > +KVM_RUN are encouraged to guard against repeatedly receiving the same > +error/annotated fault. > + > +See KVM_EXIT_MEMORY_FAULT for more information. > + > 8. Other capabilities. > ====================== > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 6409914428ca..ee3cd8c3c0ef 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -4518,6 +4518,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) > case KVM_CAP_ENABLE_CAP: > case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES: > case KVM_CAP_IRQFD_RESAMPLE: > + case KVM_CAP_MEMORY_FAULT_INFO: > r = 1; > break; > case KVM_CAP_EXIT_HYPERCALL: > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 4e741ff27af3..96aa930536b1 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -2327,4 +2327,15 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr) > /* Max number of entries allowed for each kvm dirty ring */ > #define KVM_DIRTY_RING_MAX_ENTRIES 65536 > > +static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu, > + gpa_t gpa, gpa_t size) > +{ > + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT; > + vcpu->run->memory_fault.gpa = gpa; > + vcpu->run->memory_fault.size = size; > + > + /* Flags are not (yet) defined or communicated to userspace. */ > + vcpu->run->memory_fault.flags = 0; > +} > + > #endif > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > index bd1abe067f28..7ae9987b48dd 100644 > --- a/include/uapi/linux/kvm.h > +++ b/include/uapi/linux/kvm.h > @@ -274,6 +274,7 @@ struct kvm_xen_exit { > #define KVM_EXIT_RISCV_SBI 35 > #define KVM_EXIT_RISCV_CSR 36 > #define KVM_EXIT_NOTIFY 37 > +#define KVM_EXIT_MEMORY_FAULT 38 > > /* For KVM_EXIT_INTERNAL_ERROR */ > /* Emulate instruction failed. */ > @@ -520,6 +521,12 @@ struct kvm_run { > #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0) > __u32 flags; > } notify; > + /* KVM_EXIT_MEMORY_FAULT */ > + struct { > + __u64 flags; > + __u64 gpa; > + __u64 size; > + } memory_fault; > /* Fix the size of the union. */ > char padding[256]; > }; > @@ -1203,6 +1210,7 @@ struct kvm_ppc_resize_hpt { > #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228 > #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229 > #define KVM_CAP_USER_MEMORY2 230 > +#define KVM_CAP_MEMORY_FAULT_INFO 231 > > #ifdef KVM_CAP_IRQ_ROUTING >
On 10/28/2023 2:21 AM, Sean Christopherson wrote: > From: Chao Peng <chao.p.peng@linux.intel.com> > > Add a new KVM exit type to allow userspace to handle memory faults that > KVM cannot resolve, but that userspace *may* be able to handle (without > terminating the guest). > > KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit > conversions between private and shared memory. With guest private memory, > there will be two kind of memory conversions: > > - explicit conversion: happens when the guest explicitly calls into KVM > to map a range (as private or shared) > > - implicit conversion: happens when the guest attempts to access a gfn > that is configured in the "wrong" state (private vs. shared) > > On x86 (first architecture to support guest private memory), explicit > conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE, > but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable > as there is (obviously) no hypercall, and there is no guarantee that the > guest actually intends to convert between private and shared, i.e. what > KVM thinks is an implicit conversion "request" could actually be the > result of a guest code bug. > > KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to > be implicit conversions. > > Note! To allow for future possibilities where KVM reports > KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved > fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's > perspective), not '0'! Is "-EHWPOISON" case not considered unresolved, so it is not mentioned here? > Due to historical baggage within KVM, exiting to > userspace with '0' from deep callstacks, e.g. in emulation paths, is > infeasible as doing so would require a near-complete overhaul of KVM, > whereas KVM already propagates -errno return codes to userspace even when > the -errno originated in a low level helper. > > Report the gpa+size instead of a single gfn even though the initial usage > is expected to always report single pages. It's entirely possible, likely > even, that KVM will someday support sub-page granularity faults, e.g. > Intel's sub-page protection feature allows for additional protections at > 128-byte granularity. > > Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoorthy@google.com > Link: https://lore.kernel.org/all/ZQ3AmLO2SYv3DszH@google.com > Cc: Anish Moorthy <amoorthy@google.com> > Cc: David Matlack <dmatlack@google.com> > Suggested-by: Sean Christopherson <seanjc@google.com> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> > Co-developed-by: Sean Christopherson <seanjc@google.com> > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > Documentation/virt/kvm/api.rst | 41 ++++++++++++++++++++++++++++++++++ > arch/x86/kvm/x86.c | 1 + > include/linux/kvm_host.h | 11 +++++++++ > include/uapi/linux/kvm.h | 8 +++++++ > 4 files changed, 61 insertions(+) > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > index ace984acc125..860216536810 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -6723,6 +6723,26 @@ array field represents return values. The userspace should update the return > values of SBI call before resuming the VCPU. For more details on RISC-V SBI > spec refer, https://github.com/riscv/riscv-sbi-doc. > > +:: > + > + /* KVM_EXIT_MEMORY_FAULT */ > + struct { > + __u64 flags; > + __u64 gpa; > + __u64 size; > + } memory; > + > +KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that > +could not be resolved by KVM. The 'gpa' and 'size' (in bytes) describe the > +guest physical address range [gpa, gpa + size) of the fault. The 'flags' field > +describes properties of the faulting access that are likely pertinent. > +Currently, no flags are defined. > + > +Note! KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it > +accompanies a return code of '-1', not '0'! errno will always be set to EFAULT > +or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume > +kvm_run.exit_reason is stale/undefined for all other error numbers. > + > :: > > /* KVM_EXIT_NOTIFY */ > @@ -7757,6 +7777,27 @@ This capability is aimed to mitigate the threat that malicious VMs can > cause CPU stuck (due to event windows don't open up) and make the CPU > unavailable to host or other VMs. > > +7.34 KVM_CAP_MEMORY_FAULT_INFO > +------------------------------ > + > +:Architectures: x86 > +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. > + > +The presence of this capability indicates that KVM_RUN will fill > +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if > +there is a valid memslot but no backing VMA for the corresponding host virtual > +address. > + > +The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns > +an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set > +to KVM_EXIT_MEMORY_FAULT. > + > +Note: Userspaces which attempt to resolve memory faults so that they can retry > +KVM_RUN are encouraged to guard against repeatedly receiving the same > +error/annotated fault. > + > +See KVM_EXIT_MEMORY_FAULT for more information. > + > 8. Other capabilities. > ====================== > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 6409914428ca..ee3cd8c3c0ef 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -4518,6 +4518,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) > case KVM_CAP_ENABLE_CAP: > case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES: > case KVM_CAP_IRQFD_RESAMPLE: > + case KVM_CAP_MEMORY_FAULT_INFO: > r = 1; > break; > case KVM_CAP_EXIT_HYPERCALL: > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 4e741ff27af3..96aa930536b1 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -2327,4 +2327,15 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr) > /* Max number of entries allowed for each kvm dirty ring */ > #define KVM_DIRTY_RING_MAX_ENTRIES 65536 > > +static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu, > + gpa_t gpa, gpa_t size) > +{ > + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT; > + vcpu->run->memory_fault.gpa = gpa; > + vcpu->run->memory_fault.size = size; > + > + /* Flags are not (yet) defined or communicated to userspace. */ > + vcpu->run->memory_fault.flags = 0; > +} > + > #endif > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > index bd1abe067f28..7ae9987b48dd 100644 > --- a/include/uapi/linux/kvm.h > +++ b/include/uapi/linux/kvm.h > @@ -274,6 +274,7 @@ struct kvm_xen_exit { > #define KVM_EXIT_RISCV_SBI 35 > #define KVM_EXIT_RISCV_CSR 36 > #define KVM_EXIT_NOTIFY 37 > +#define KVM_EXIT_MEMORY_FAULT 38 > > /* For KVM_EXIT_INTERNAL_ERROR */ > /* Emulate instruction failed. */ > @@ -520,6 +521,12 @@ struct kvm_run { > #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0) > __u32 flags; > } notify; > + /* KVM_EXIT_MEMORY_FAULT */ > + struct { > + __u64 flags; > + __u64 gpa; > + __u64 size; > + } memory_fault; > /* Fix the size of the union. */ > char padding[256]; > }; > @@ -1203,6 +1210,7 @@ struct kvm_ppc_resize_hpt { > #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228 > #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229 > #define KVM_CAP_USER_MEMORY2 230 > +#define KVM_CAP_MEMORY_FAULT_INFO 231 > > #ifdef KVM_CAP_IRQ_ROUTING >
> +7.34 KVM_CAP_MEMORY_FAULT_INFO > +------------------------------ > + > +:Architectures: x86 > +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. > + > +The presence of this capability indicates that KVM_RUN will fill > +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if > +there is a valid memslot but no backing VMA for the corresponding host virtual > +address. > + > +The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns > +an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set > +to KVM_EXIT_MEMORY_FAULT. IIUC returning -EFAULT or whatever -errno is sort of KVM internal implementation. Is it better to relax the validity of kvm_run.memory_fault when KVM_RUN returns any -errno? [...] > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -2327,4 +2327,15 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr) > /* Max number of entries allowed for each kvm dirty ring */ > #define KVM_DIRTY_RING_MAX_ENTRIES 65536 > > +static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu, > + gpa_t gpa, gpa_t size) > +{ > + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT; > + vcpu->run->memory_fault.gpa = gpa; > + vcpu->run->memory_fault.size = size; > + > + /* Flags are not (yet) defined or communicated to userspace. */ > + vcpu->run->memory_fault.flags = 0; > +} > + KVM_CAP_MEMORY_FAULT_INFO is x86 only, is it better to put this function to <asm/kvm_host.h>?
On Wed, Nov 01, 2023, Kai Huang wrote: > > > +7.34 KVM_CAP_MEMORY_FAULT_INFO > > +------------------------------ > > + > > +:Architectures: x86 > > +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. > > + > > +The presence of this capability indicates that KVM_RUN will fill > > +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if > > +there is a valid memslot but no backing VMA for the corresponding host virtual > > +address. > > + > > +The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns > > +an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set > > +to KVM_EXIT_MEMORY_FAULT. > > IIUC returning -EFAULT or whatever -errno is sort of KVM internal > implementation. The errno that is returned to userspace is ABI. In KVM, it's a _very_ poorly defined ABI for the vast majority of ioctls(), but it's still technically ABI. KVM gets away with being cavalier with errno because the vast majority of errors are considered fatal by userespace, i.e. in most cases, userspace simply doesn't care about the exact errno. A good example is KVM_RUN with -EINTR; if KVM were to return something other than -EINTR on a pending signal or vcpu->run->immediate_exit, userspace would fall over. > Is it better to relax the validity of kvm_run.memory_fault when > KVM_RUN returns any -errno? Not unless there's a need to do so, and if there is then we can update the documentation accordingly. If KVM's ABI is that kvm_run.memory_fault is valid for any errno, then KVM would need to purge kvm_run.exit_reason super early in KVM_RUN, e.g. to prevent an -EINTR return due to immediate_exit from being misinterpreted as KVM_EXIT_MEMORY_FAULT. And purging exit_reason super early is subtly tricky because KVM's (again, poorly documented) ABI is that *some* exit reasons are preserved across KVM_RUN with vcpu->run->immediate_exit (or with a pending signal). https://lore.kernel.org/all/ZFFbwOXZ5uI%2Fgdaf@google.com > [...] > > > > --- a/include/linux/kvm_host.h > > +++ b/include/linux/kvm_host.h > > @@ -2327,4 +2327,15 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr) > > /* Max number of entries allowed for each kvm dirty ring */ > > #define KVM_DIRTY_RING_MAX_ENTRIES 65536 > > > > +static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu, > > + gpa_t gpa, gpa_t size) > > +{ > > + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT; > > + vcpu->run->memory_fault.gpa = gpa; > > + vcpu->run->memory_fault.size = size; > > + > > + /* Flags are not (yet) defined or communicated to userspace. */ > > + vcpu->run->memory_fault.flags = 0; > > +} > > + > > KVM_CAP_MEMORY_FAULT_INFO is x86 only, is it better to put this function to > <asm/kvm_host.h>? I'd prefer to keep it in generic code, as it's highly likely to end up there sooner than later. There's a known use case for ARM (exit to userspace on missing userspace mapping[*]), and I'm guessing pKVM (also ARM) will also utilize this API. [*] https://lore.kernel.org/all/20230908222905.1321305-8-amoorthy@google.com
On 11/2/2023 1:36 AM, Sean Christopherson wrote: >> KVM_CAP_MEMORY_FAULT_INFO is x86 only, is it better to put this function to >> <asm/kvm_host.h>? > I'd prefer to keep it in generic code, as it's highly likely to end up there > sooner than later. There's a known use case for ARM (exit to userspace on missing > userspace mapping[*]), and I'm guessing pKVM (also ARM) will also utilize this API. > > [*]https://lore.kernel.org/all/20230908222905.1321305-8-amoorthy@google.com I wonder how this CAP is supposed to be checked in userspace, for guest memfd case? something like this? if (!kvm_check_extension(s, KVM_CAP_MEMORY_FAULT_INFO) && run->exit_reason == KVM_EXIT_MEMORY_FAULT) abort("unexpected KVM_EXIT_MEMORY_FAULT"); In my implementation of QEMU patches, I find it's unnecessary. When userspace gets an exit with KVM_EXIT_MEMORY_FAULT, it implies "KVM_CAP_MEMORY_FAULT_INFO". So I don't see how it is necessary in this series. Whether it's necessary or not for [*], I don't have the answer but we can leave the discussion to that patch series.
On Wed, 2023-11-01 at 10:36 -0700, Sean Christopherson wrote: > On Wed, Nov 01, 2023, Kai Huang wrote: > > > > > +7.34 KVM_CAP_MEMORY_FAULT_INFO > > > +------------------------------ > > > + > > > +:Architectures: x86 > > > +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. > > > + > > > +The presence of this capability indicates that KVM_RUN will fill > > > +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if > > > +there is a valid memslot but no backing VMA for the corresponding host virtual > > > +address. > > > + > > > +The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns > > > +an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set > > > +to KVM_EXIT_MEMORY_FAULT. > > > > IIUC returning -EFAULT or whatever -errno is sort of KVM internal > > implementation. > > The errno that is returned to userspace is ABI. In KVM, it's a _very_ poorly > defined ABI for the vast majority of ioctls(), but it's still technically ABI. > KVM gets away with being cavalier with errno because the vast majority of errors > are considered fatal by userespace, i.e. in most cases, userspace simply doesn't > care about the exact errno. > > A good example is KVM_RUN with -EINTR; if KVM were to return something other than > -EINTR on a pending signal or vcpu->run->immediate_exit, userspace would fall over. > > > Is it better to relax the validity of kvm_run.memory_fault when > > KVM_RUN returns any -errno? > > Not unless there's a need to do so, and if there is then we can update the > documentation accordingly. If KVM's ABI is that kvm_run.memory_fault is valid > for any errno, then KVM would need to purge kvm_run.exit_reason super early in > KVM_RUN, e.g. to prevent an -EINTR return due to immediate_exit from being > misinterpreted as KVM_EXIT_MEMORY_FAULT. And purging exit_reason super early is > subtly tricky because KVM's (again, poorly documented) ABI is that *some* exit > reasons are preserved across KVM_RUN with vcpu->run->immediate_exit (or with a > pending signal). > > https://lore.kernel.org/all/ZFFbwOXZ5uI%2Fgdaf@google.com > > Agreed with not to relax to any errno. However using -EFAULT as part of ABI definition seems a little bit dangerous, e.g., someone could accidentally or mistakenly return -EFAULT in KVM_RUN at early time and/or in a completely different code path, etc. -EINTR has well defined meaning, but -EFAULT (which is "Bad address") seems doesn't but I am not sure either. :-) One example is, for backing VMA with VM_IO | VM_PFNMAP, hva_to_pfn() returns KVM_PFN_ERR_FAULT when the kernel cannot get a valid PFN (e.g. when SGX vepc fault handler failed to allocate EPC) and kvm_handle_error_pfn() will just return -EFAULT. If kvm_run.exit_reason isn't purged early then is it possible to have some issue here?
On Thu, 2023-11-02 at 03:17 +0000, Huang, Kai wrote: > On Wed, 2023-11-01 at 10:36 -0700, Sean Christopherson wrote: > > On Wed, Nov 01, 2023, Kai Huang wrote: > > > > > > > +7.34 KVM_CAP_MEMORY_FAULT_INFO > > > > +------------------------------ > > > > + > > > > +:Architectures: x86 > > > > +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. > > > > + > > > > +The presence of this capability indicates that KVM_RUN will fill > > > > +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if > > > > +there is a valid memslot but no backing VMA for the corresponding host virtual > > > > +address. > > > > + > > > > +The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns > > > > +an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set > > > > +to KVM_EXIT_MEMORY_FAULT. > > > > > > IIUC returning -EFAULT or whatever -errno is sort of KVM internal > > > implementation. > > > > The errno that is returned to userspace is ABI. In KVM, it's a _very_ poorly > > defined ABI for the vast majority of ioctls(), but it's still technically ABI. > > KVM gets away with being cavalier with errno because the vast majority of errors > > are considered fatal by userespace, i.e. in most cases, userspace simply doesn't > > care about the exact errno. > > > > A good example is KVM_RUN with -EINTR; if KVM were to return something other than > > -EINTR on a pending signal or vcpu->run->immediate_exit, userspace would fall over. > > > > > Is it better to relax the validity of kvm_run.memory_fault when > > > KVM_RUN returns any -errno? > > > > Not unless there's a need to do so, and if there is then we can update the > > documentation accordingly. If KVM's ABI is that kvm_run.memory_fault is valid > > for any errno, then KVM would need to purge kvm_run.exit_reason super early in > > KVM_RUN, e.g. to prevent an -EINTR return due to immediate_exit from being > > misinterpreted as KVM_EXIT_MEMORY_FAULT. And purging exit_reason super early is > > subtly tricky because KVM's (again, poorly documented) ABI is that *some* exit > > reasons are preserved across KVM_RUN with vcpu->run->immediate_exit (or with a > > pending signal). > > > > https://lore.kernel.org/all/ZFFbwOXZ5uI%2Fgdaf@google.com > > > > > > Agreed with not to relax to any errno. However using -EFAULT as part of ABI > definition seems a little bit dangerous, e.g., someone could accidentally or > mistakenly return -EFAULT in KVM_RUN at early time and/or in a completely > different code path, etc. -EINTR has well defined meaning, but -EFAULT (which > is "Bad address") seems doesn't but I am not sure either. :-) > > One example is, for backing VMA with VM_IO | VM_PFNMAP, hva_to_pfn() returns > KVM_PFN_ERR_FAULT when the kernel cannot get a valid PFN (e.g. when SGX vepc > fault handler failed to allocate EPC) and kvm_handle_error_pfn() will just > return -EFAULT. If kvm_run.exit_reason isn't purged early then is it possible > to have some issue here? > Also, regardless whether -EFAULT is too ambiguous to be part of ABI, could you elaborate the EHWPOISON part? IIUC KVM can already handle the case of poisoned page by sending signal to user app: static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) { ... if (fault->pfn == KVM_PFN_ERR_HWPOISON) { kvm_send_hwpoison_signal(fault->slot, fault->gfn); return RET_PF_RETRY; } } And (sorry to hijack) I am thinking whether "SGX vepc unable to allocate EPC" can also use this memory_fault mechanism. Currently as mentioned above when vepc fault handler cannot allocate EPC page KVM returns -EFAULT to Qemu, and Qemu prints ... ...: Bad address <dump guest cpu registers> ... which is nonsense. If we can use memory_fault.flags (or is 'fault_reason' a better name?) to carry a specific value for EPC to let Qemu know and Qemu can then do more reasonable things.
On 11/1/23 18:36, Sean Christopherson wrote: > A good example is KVM_RUN with -EINTR; if KVM were to return something other than > -EINTR on a pending signal or vcpu->run->immediate_exit, userspace would fall over. And dually if KVM were to return KVM_EXIT_INTR together with something other than -EINTR. > And purging exit_reason super early is subtly tricky because KVM's > (again, poorly documented) ABI is that *some* exit reasons are preserved > across KVM_RUN with vcpu->run->immediate_exit (or with a pending > signal). https://lore.kernel.org/all/ZFFbwOXZ5uI%2Fgdaf@google.com vcpu->run->immediate_exit preserves all exit reasons, but it's not a good idea that immediate_exit behaves different from a pending signal on entry to KVM_RUN (remember that immediate_exit was meant to be a better performing alternative to KVM_SET_SIGNAL_MASK). In principle, vcpu->run->immediate_exit could return KVM_EXIT_INTR (perhaps even _should_, except that breaks selftests so at this point it is ABI). Paolo
On 11/2/23 10:35, Huang, Kai wrote: > IIUC KVM can already handle the case of poisoned > page by sending signal to user app: > > static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, > struct kvm_page_fault *fault) > { > ... > > if (fault->pfn == KVM_PFN_ERR_HWPOISON) { > kvm_send_hwpoison_signal(fault->slot, fault->gfn); > return RET_PF_RETRY; > } > } EHWPOISON is not implemented by this series, so it should be left out of the documentation. > Currently as mentioned above when > vepc fault handler cannot allocate EPC page KVM returns -EFAULT to Qemu, and > Qemu prints ... > > ...: Bad address > <dump guest cpu registers> > > ... which is nonsense. > > If we can use memory_fault.flags (or is 'fault_reason' a better name?) to carry > a specific value for EPC to let Qemu know and Qemu can then do more reasonable > things. Yes, that's a good idea that can be implemented on top. Paolo
On Thu, Nov 02, 2023, Paolo Bonzini wrote: > On 11/2/23 10:35, Huang, Kai wrote: > > IIUC KVM can already handle the case of poisoned > > page by sending signal to user app: > > > > static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct > > kvm_page_fault *fault) { > > ... > > > > if (fault->pfn == KVM_PFN_ERR_HWPOISON) { > > kvm_send_hwpoison_signal(fault->slot, fault->gfn); No, this doesn't work, because that signals the host virtual address unsigned long hva = gfn_to_hva_memslot(slot, gfn); send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva, PAGE_SHIFT, current); which is the *shared* page. > > return RET_PF_RETRY; > > } > > } > > EHWPOISON is not implemented by this series, so it should be left out of the > documentation. EHWPOISON *is* implemented. kvm_gmem_get_pfn() returns -EWPOISON as appropriate, and kvm_faultin_pfn() returns that directly without going through kvm_handle_error_pfn(). kvm_faultin_pfn_private() | |-> kvm_gmem_get_pfn() | |-> if (folio_test_hwpoison(folio)) { r = -EHWPOISON; goto out_unlock; } | |-> r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn, &max_order); if (r) { kvm_mmu_prepare_memory_fault_exit(vcpu, fault); return r; } | |-> ret = __kvm_faultin_pfn(vcpu, fault); if (ret != RET_PF_CONTINUE) return ret; if (unlikely(is_error_pfn(fault->pfn))) return kvm_handle_error_pfn(vcpu, fault);
On Thu, Nov 02, 2023, Xiaoyao Li wrote: > On 11/2/2023 1:36 AM, Sean Christopherson wrote: > > > KVM_CAP_MEMORY_FAULT_INFO is x86 only, is it better to put this function to > > > <asm/kvm_host.h>? > > I'd prefer to keep it in generic code, as it's highly likely to end up there > > sooner than later. There's a known use case for ARM (exit to userspace on missing > > userspace mapping[*]), and I'm guessing pKVM (also ARM) will also utilize this API. > > > > [*]https://lore.kernel.org/all/20230908222905.1321305-8-amoorthy@google.com > > I wonder how this CAP is supposed to be checked in userspace, for guest > memfd case? It's basically useless for guest_memfd. > if (!kvm_check_extension(s, KVM_CAP_MEMORY_FAULT_INFO) && > run->exit_reason == KVM_EXIT_MEMORY_FAULT) > abort("unexpected KVM_EXIT_MEMORY_FAULT"); > > In my implementation of QEMU patches, I find it's unnecessary. When > userspace gets an exit with KVM_EXIT_MEMORY_FAULT, it implies > "KVM_CAP_MEMORY_FAULT_INFO". > > So I don't see how it is necessary in this series. Whether it's necessary or > not for [*], I don't have the answer but we can leave the discussion to that > patch series. It's not strictly necessary there either. However, Oliver felt (and presumably still feels) quite strongly, and I agree, that neither reporting extra information shouldn't be tightly coupled to KVM_CAP_EXIT_ON_MISSING or KVM_CAP_GUEST_MEMFD. E.g. if userspace develops a "standalone" use case for KVM_CAP_MEMORY_FAULT_INFO, userspace should be able to check for support without having to take a dependency on KVM_CAP_GUEST_MEMFD, especially since because KVM_CAP_GUEST_MEMFD may not be supported, i.e. userspace should be able to do: if (!kvm_check_extension(s, KVM_CAP_MEMORY_FAULT_INFO)) abort("KVM_CAP_MEMORY_FAULT_INFO required for fancy feature XYZ");
On Thu, Nov 02, 2023, Kai Huang wrote: > On Wed, 2023-11-01 at 10:36 -0700, Sean Christopherson wrote: > > On Wed, Nov 01, 2023, Kai Huang wrote: > > > > > > > +7.34 KVM_CAP_MEMORY_FAULT_INFO > > > > +------------------------------ > > > > + > > > > +:Architectures: x86 > > > > +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. > > > > + > > > > +The presence of this capability indicates that KVM_RUN will fill > > > > +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if > > > > +there is a valid memslot but no backing VMA for the corresponding host virtual > > > > +address. > > > > + > > > > +The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns > > > > +an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set > > > > +to KVM_EXIT_MEMORY_FAULT. > > > > > > IIUC returning -EFAULT or whatever -errno is sort of KVM internal > > > implementation. > > > > The errno that is returned to userspace is ABI. In KVM, it's a _very_ poorly > > defined ABI for the vast majority of ioctls(), but it's still technically ABI. > > KVM gets away with being cavalier with errno because the vast majority of errors > > are considered fatal by userespace, i.e. in most cases, userspace simply doesn't > > care about the exact errno. > > > > A good example is KVM_RUN with -EINTR; if KVM were to return something other than > > -EINTR on a pending signal or vcpu->run->immediate_exit, userspace would fall over. > > > > > Is it better to relax the validity of kvm_run.memory_fault when > > > KVM_RUN returns any -errno? > > > > Not unless there's a need to do so, and if there is then we can update the > > documentation accordingly. If KVM's ABI is that kvm_run.memory_fault is valid > > for any errno, then KVM would need to purge kvm_run.exit_reason super early in > > KVM_RUN, e.g. to prevent an -EINTR return due to immediate_exit from being > > misinterpreted as KVM_EXIT_MEMORY_FAULT. And purging exit_reason super early is > > subtly tricky because KVM's (again, poorly documented) ABI is that *some* exit > > reasons are preserved across KVM_RUN with vcpu->run->immediate_exit (or with a > > pending signal). > > > > https://lore.kernel.org/all/ZFFbwOXZ5uI%2Fgdaf@google.com > > > > > > Agreed with not to relax to any errno. However using -EFAULT as part of ABI > definition seems a little bit dangerous, e.g., someone could accidentally or > mistakenly return -EFAULT in KVM_RUN at early time and/or in a completely > different code path, etc. -EINTR has well defined meaning, but -EFAULT (which > is "Bad address") seems doesn't but I am not sure either. :-) KVM has returned -EFAULT since forever, i.e. it's effectively already part of the ABI. I doubt there's a userspace that relies precisely on -EFAULT, but userspace definitely will be confused if KVM returns '0' where KVM used to return -EFAULT. And so if we want to return '0', it needs to be opt-in, which means forcing userspace to enable a capability *and* requires code in KVM to conditionally return '0' instead of -EFAULT/-EHWPOISON. > One example is, for backing VMA with VM_IO | VM_PFNMAP, hva_to_pfn() returns > KVM_PFN_ERR_FAULT when the kernel cannot get a valid PFN (e.g. when SGX vepc > fault handler failed to allocate EPC) and kvm_handle_error_pfn() will just > return -EFAULT. If kvm_run.exit_reason isn't purged early then is it possible > to have some issue here? Well, yeah, but that's exactly why this series has a patch to reset exit_reason. The solution to "if KVM is buggy then bad things happen" is to not have KVM bugs :-)
On Thu, 2023-11-02 at 08:44 -0700, Sean Christopherson wrote: > On Thu, Nov 02, 2023, Paolo Bonzini wrote: > > On 11/2/23 10:35, Huang, Kai wrote: > > > IIUC KVM can already handle the case of poisoned > > > page by sending signal to user app: > > > > > > static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct > > > kvm_page_fault *fault) { > > > ... > > > > > > if (fault->pfn == KVM_PFN_ERR_HWPOISON) { > > > kvm_send_hwpoison_signal(fault->slot, fault->gfn); > > No, this doesn't work, because that signals the host virtual address Ah, right :-)
On Fri, Oct 27, 2023 at 11:21:51AM -0700, Sean Christopherson wrote: > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -6723,6 +6723,26 @@ array field represents return values. The userspace should update the return > values of SBI call before resuming the VCPU. For more details on RISC-V SBI > spec refer, https://github.com/riscv/riscv-sbi-doc. > > +:: > + > + /* KVM_EXIT_MEMORY_FAULT */ > + struct { > + __u64 flags; > + __u64 gpa; > + __u64 size; > + } memory; ^ Should update to "memory_fault" to align with other places. [...] > @@ -520,6 +521,12 @@ struct kvm_run { > #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0) > __u32 flags; > } notify; > + /* KVM_EXIT_MEMORY_FAULT */ > + struct { > + __u64 flags; > + __u64 gpa; > + __u64 size; > + } memory_fault; > /* Fix the size of the union. */ > char padding[256]; > }; Thanks, Yilun >
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index ace984acc125..860216536810 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6723,6 +6723,26 @@ array field represents return values. The userspace should update the return values of SBI call before resuming the VCPU. For more details on RISC-V SBI spec refer, https://github.com/riscv/riscv-sbi-doc. +:: + + /* KVM_EXIT_MEMORY_FAULT */ + struct { + __u64 flags; + __u64 gpa; + __u64 size; + } memory; + +KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that +could not be resolved by KVM. The 'gpa' and 'size' (in bytes) describe the +guest physical address range [gpa, gpa + size) of the fault. The 'flags' field +describes properties of the faulting access that are likely pertinent. +Currently, no flags are defined. + +Note! KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it +accompanies a return code of '-1', not '0'! errno will always be set to EFAULT +or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume +kvm_run.exit_reason is stale/undefined for all other error numbers. + :: /* KVM_EXIT_NOTIFY */ @@ -7757,6 +7777,27 @@ This capability is aimed to mitigate the threat that malicious VMs can cause CPU stuck (due to event windows don't open up) and make the CPU unavailable to host or other VMs. +7.34 KVM_CAP_MEMORY_FAULT_INFO +------------------------------ + +:Architectures: x86 +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. + +The presence of this capability indicates that KVM_RUN will fill +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if +there is a valid memslot but no backing VMA for the corresponding host virtual +address. + +The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns +an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set +to KVM_EXIT_MEMORY_FAULT. + +Note: Userspaces which attempt to resolve memory faults so that they can retry +KVM_RUN are encouraged to guard against repeatedly receiving the same +error/annotated fault. + +See KVM_EXIT_MEMORY_FAULT for more information. + 8. Other capabilities. ====================== diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6409914428ca..ee3cd8c3c0ef 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4518,6 +4518,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_ENABLE_CAP: case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES: case KVM_CAP_IRQFD_RESAMPLE: + case KVM_CAP_MEMORY_FAULT_INFO: r = 1; break; case KVM_CAP_EXIT_HYPERCALL: diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 4e741ff27af3..96aa930536b1 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2327,4 +2327,15 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr) /* Max number of entries allowed for each kvm dirty ring */ #define KVM_DIRTY_RING_MAX_ENTRIES 65536 +static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu, + gpa_t gpa, gpa_t size) +{ + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT; + vcpu->run->memory_fault.gpa = gpa; + vcpu->run->memory_fault.size = size; + + /* Flags are not (yet) defined or communicated to userspace. */ + vcpu->run->memory_fault.flags = 0; +} + #endif diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index bd1abe067f28..7ae9987b48dd 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -274,6 +274,7 @@ struct kvm_xen_exit { #define KVM_EXIT_RISCV_SBI 35 #define KVM_EXIT_RISCV_CSR 36 #define KVM_EXIT_NOTIFY 37 +#define KVM_EXIT_MEMORY_FAULT 38 /* For KVM_EXIT_INTERNAL_ERROR */ /* Emulate instruction failed. */ @@ -520,6 +521,12 @@ struct kvm_run { #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0) __u32 flags; } notify; + /* KVM_EXIT_MEMORY_FAULT */ + struct { + __u64 flags; + __u64 gpa; + __u64 size; + } memory_fault; /* Fix the size of the union. */ char padding[256]; }; @@ -1203,6 +1210,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228 #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229 #define KVM_CAP_USER_MEMORY2 230 +#define KVM_CAP_MEMORY_FAULT_INFO 231 #ifdef KVM_CAP_IRQ_ROUTING