Message ID | 20230505152046.6575-1-mic@digikod.net |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp489694vqo; Fri, 5 May 2023 08:23:18 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7ILsZtfJAG9FLuCPdq2dho8v7ZS+ntAeFpvsICL1dY2Ag2xqx4SV7zUtEFZD4uqtAgwGP0 X-Received: by 2002:a05:6a20:914a:b0:f2:c2a3:39a with SMTP id x10-20020a056a20914a00b000f2c2a3039amr2377217pzc.61.1683300198391; Fri, 05 May 2023 08:23:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1683300198; cv=none; d=google.com; s=arc-20160816; b=1HfNRNr+LSijjNJ5wpEQiQSu6ruNhZblad0/fn4WakO0P2GOzzHukd0ceqAhlRA7lY Fld372CxNuatnrb0jSP7RcQVSRPFAGp7fIPsNsTqZRdtby/W89d/onylodKw/c/1ZXf6 oWyacUhWejXxzMunAYErNGnnR2FyYanmODA+9a8Aigipgf+rIJuPnA+GseW1H36KwQoT teOysp4rmrNF/8eO9Yqg3Qmzdk70VzzjK/2NR3wpmVPATVIwkYH68eKVFAGvbL4Yijmr wRCjK/wsz2SUYaZYjZgWmEiOy9+4FXPYTMRCXHviEq0pHTDgygQSMpaWSk4TQ/xnO7zH FwdA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=P4npBIOJbTJ1SAMNVYpqEHsIrojfp2rYC7LX+qIXM3M=; b=eQSLyAVc3bbtF2XF8vhF/Zgr2aJwhB6L35RXPP+smApB7wUYHV7TX8Ybu9+f+Osjea 1gqFKryY8/mnXqcmWB9VWsji1Q6qcNTltl5iWviTOHtkrsu3GpW5mTkN9zlPFmi4e7pS jHjjZnT+5Tf0GAdWyStbPKdA1Vp8jyc9jl/2AfY2YrQz5oQiaQu/k5WtyO4aE+SAi212 VD8mddUkr5XlC8EfuK5rBs0tDx/oPuAvEAnBv1s1kklWBpqmUwEHJCq1VPrdzYt36Ml3 IzXNlNgA3+iKKDiPcg3qUgdqxHsnoqrqOIFjV2gpAV6rUVtErRrZlO6wTwzD6Z+umYZs FSfg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@digikod.net header.s=20191114 header.b=sdjSaTf+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id f16-20020a63f110000000b0050aea0375afsi2314906pgi.765.2023.05.05.08.23.03; Fri, 05 May 2023 08:23:18 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@digikod.net header.s=20191114 header.b=sdjSaTf+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232698AbjEEPWG (ORCPT <rfc822;jz.zhangjin@gmail.com> + 99 others); Fri, 5 May 2023 11:22:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54102 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232678AbjEEPWD (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 5 May 2023 11:22:03 -0400 Received: from smtp-8fad.mail.infomaniak.ch (smtp-8fad.mail.infomaniak.ch [83.166.143.173]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2B876150F8 for <linux-kernel@vger.kernel.org>; Fri, 5 May 2023 08:21:58 -0700 (PDT) Received: from smtp-2-0000.mail.infomaniak.ch (unknown [10.5.36.107]) by smtp-2-3000.mail.infomaniak.ch (Postfix) with ESMTPS id 4QCZD24pG9zMqLbT; Fri, 5 May 2023 17:21:54 +0200 (CEST) Received: from unknown by smtp-2-0000.mail.infomaniak.ch (Postfix) with ESMTPA id 4QCZCy4M9gzMpqZ9; Fri, 5 May 2023 17:21:50 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=digikod.net; s=20191114; t=1683300114; bh=Qvc7SESfj28DhibUMOxF8XDC2Dv9Nk03svVhSexGCDo=; h=From:To:Cc:Subject:Date:From; b=sdjSaTf+L8uD2OtYxlVlBp2/RF9APLo7BFa1o6O4wpLqtYohS60n6C+Xb7KWY8oWN 1ThaIncVJDSPEINsOmIciOHlJQ5lHfvGTzMJwRq9cC3H70enX0j5qGVkDZEv5T8RHU X7DMszBWczedvVBKFSwCEr6PDfJcgmy7Hfrc8dOE= From: =?utf-8?q?Micka=C3=ABl_Sala=C3=BCn?= <mic@digikod.net> To: Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, "H . Peter Anvin" <hpa@zytor.com>, Ingo Molnar <mingo@redhat.com>, Kees Cook <keescook@chromium.org>, Paolo Bonzini <pbonzini@redhat.com>, Sean Christopherson <seanjc@google.com>, Thomas Gleixner <tglx@linutronix.de>, Vitaly Kuznetsov <vkuznets@redhat.com>, Wanpeng Li <wanpengli@tencent.com> Cc: =?utf-8?q?Micka=C3=ABl_Sala=C3=BCn?= <mic@digikod.net>, Alexander Graf <graf@amazon.com>, Forrest Yuan Yu <yuanyu@google.com>, James Morris <jamorris@linux.microsoft.com>, John Andersen <john.s.andersen@intel.com>, Liran Alon <liran.alon@oracle.com>, "Madhavan T . Venkataraman" <madvenka@linux.microsoft.com>, Marian Rotariu <marian.c.rotariu@gmail.com>, =?utf-8?q?Mihai_Don=C8=9Bu?= <mdontu@bitdefender.com>, =?utf-8?b?TmljdciZ?= =?utf-8?b?b3IgQ8OuyJt1?= <nicu.citu@icloud.com>, Rick Edgecombe <rick.p.edgecombe@intel.com>, Thara Gopinath <tgopinath@microsoft.com>, Will Deacon <will@kernel.org>, Zahra Tarkhani <ztarkhani@microsoft.com>, =?utf-8?q?=C8=98tefan_=C8=98icler?= =?utf-8?q?u?= <ssicleru@bitdefender.com>, dev@lists.cloudhypervisor.org, kvm@vger.kernel.org, linux-hardening@vger.kernel.org, linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org, linux-security-module@vger.kernel.org, qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org, x86@kernel.org, xen-devel@lists.xenproject.org Subject: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity Date: Fri, 5 May 2023 17:20:37 +0200 Message-Id: <20230505152046.6575-1-mic@digikod.net> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Infomaniak-Routing: alpha X-Spam-Status: No, score=-2.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_LOW,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1765068188500907477?= X-GMAIL-MSGID: =?utf-8?q?1765068188500907477?= |
Series |
Hypervisor-Enforced Kernel Integrity
|
|
Message
Mickaël Salaün
May 5, 2023, 3:20 p.m. UTC
Hi, This patch series is a proof-of-concept that implements new KVM features (extended page tracking, MBEC support, CR pinning) and defines a new API to protect guest VMs. No VMM (e.g., Qemu) modification is required. The main idea being that kernel self-protection mechanisms should be delegated to a more privileged part of the system, hence the hypervisor. It is still the role of the guest kernel to request such restrictions according to its configuration. The high-level security guarantees provided by the hypervisor are semantically the same as a subset of those the kernel already enforces on itself (CR pinning hardening and memory page table protections), but with much higher guarantees. We'd like the mainline kernel to support such hardening features leveraging virtualization. We're looking for reviews and comments that can help mainline these two parts: the KVM implementation and the guest kernel API layer designed to support different hypervisors. The struct heki_hypervisor enables to plug in different backend implementations that are initialized with the heki_early_init() and heki_late_init() calls. This RFC is an initial call for collaboration. There is a lot to do, either on hypervisors, guest kernels or VMMs sides. We took inspiration from previous patches, mainly the KVMI [1] [2] and KVM CR-pinning [3] series, revamped and simplified relevant parts to fit well with our goal, added support for MBEC to enable a deny-by-default kernel execution policy (e.g., write xor execute), added two hypercalls, and created a kernel API for VMs to request protection in a generic way that can be leveraged by any hypervisor. This proof-of-concept is named Hypervisor-Enforced Kernel Integrity (Heki), which reflects the goal to empower guest kernels to protect themselves. This name is new to the kernel, and it enables to easily identify new code required for this set of features. This patch series is based on Linux 6.2 and requires the host to support MBEC. This can easily be checked with: grep ept_mode_based_exec /proc/cpuinfo You can test it by enabling CONFIG_HEKI, CONFIG_HEKI_TEST, CONFIG_KUNIT_DEFAULT_ENABLED, and adding the heki_test=N boot argument to the guest as explained in the last patch. Another way to test it is to try to load a kernel module in the guest: you'll see KVM creating synthetic page faults. This only works using a bare metal machine as KVM host; nested virtualization is not supported yet. # Threat model The initial threat model is a malicious user space process exploiting a kernel vulnerability to gain more privileges or to bypass the access-control system. An extended threat model could include attacks coming from network or storage data (e.g., malformed network packet, inconsistent drive content). Considering all potential ways to compromise a kernel, Heki's goal is to harden a sane kernel before a runtime attack to make it more difficult, and potentially to make such an attack failed. We consider the kernel itself to be partially malicious during its lifetime e.g., because a ROP attack that could disable kernel self-protection mechanisms and make kernel exploitation much easier. Indeed, an exploit is often split into several stages, each bypassing some security measures. Getting the guarantee that new kernel executable code is not possible increases the cost of an attack, hopefully to the point that it is not worth it. To protect against persistent attacks, complementary security mechanisms should be used (e.g., kernel module signing, IMA, IPE, Lockdown). # Prerequisites For this set of features to be useful, guest kernels must be trusted by the VM owners at boot time, before launching any user space processes nor receiving potentially malicious network packets. It is then required to have a security mechanism to provide or check this initial trust (e.g., secure boot, kernel module signing). # How does it work? This implementation mainly leverages KVM capabilities to control the Second Layer Address Translation (or the Two Dimensional Paging e.g., Intel's EPT or AMD's RVI/NPT) and Mode Based Execution Control (Intel's MBEC) introduced with the Kaby Lake (7th generation) architecture. This allows to set permissions on memory pages in a complementary way to the guest kernel's managed memory permissions. Once these permissions are set, they are locked and there is no way back. A first KVM_HC_LOCK_MEM_PAGE_RANGES hypercall enables the guest kernel to lock a set of its memory page ranges with either the HEKI_ATTR_MEM_NOWRITE or the HEKI_ATTR_MEM_EXEC attribute. The first one denies write access to a specific set of pages (allow-list approach), and the second only allows kernel execution for a set of pages (deny-list approach). The current implementation sets the whole kernel's .rodata (i.e., any const or __ro_after_init variables, which includes critical security data such as LSM parameters) and .text sections as non-writable, and the .text section is the only one where kernel execution is allowed. This is possible thanks to the new MBEC support also brough by this series (otherwise the vDSO would have to be executable). Thanks to this hardware support (VT-x, EPT and MBEC), the performance impact of such guest protection is negligible. The second KVM_HC_LOCK_CR_UPDATE hypercall enables guests to pin some of its CPU control register flags (e.g., X86_CR0_WP, X86_CR4_SMEP, X86_CR4_SMAP), which is another complementary hardening mechanism. Heki can be enabled with the heki=1 boot command argument. # Similar implementations Here is a non-exhaustive list of similar implementations that we looked at and took some ideas. Linux mainline doesn't support such security features, let's change that! Windows's Virtualization-Based Security is a proprietary technology relying that provides a superset of this kind of security mechanism, relying on Hyper-V and Virtual Trust Levels which enables to have light and secure VM enforcing restrictions on a full guest VM. This includes several components such as HVCI which is in charge of code authenticity, or HyperGuard which monitors and protects kernel code and data. Samsung's Real-time Kernel Protection (RKP) and Huawei Hypervisor Execution Environment (HHEE) rely on proprietary hypervisors to protect some Android devices. They monitor critical kernel data (e.g., page tables, credentials, selinux_enforcing). The iOS Kernel Patch Protection is a proprietary solution that relies on a secure enclave (dedicated hardware component) to monitor and protect critical parts of the kernel. Bitdefender's Hypervisor Memory Introspection (HVMI) is an open-source (but out of tree) set of components leveraging virtualization. HVMI implementation is very complex, and this approach implies potential semantic gap issues (i.e., kernel data structures may change from one version to another). Linux Kernel Runtime Guard is an open-source kernel module that can detect some kernel data illegitimate modifications. Because it is the same kernel as the compromised one, an attacker could also bypass or disable these checks. Intel's Virtualization Based Hardening [4] [5] is an open-source proof-of-concept of a thin hypervisor dedicated to guest protection. As such, it cannot be used to manage several VMs. # Similar Linux patches The VM introspection [1] [2] patch series proposed a set of features to put probes and introspect VMs for debugging and security reasons. We changed and included the prewrite page tracking and the fault_gva parts. Heki is much simpler because it focuses on guest hardening, not introspection. Paravirtualized Control Register pinning [3] added a set of KVM IOCTLs to restrict some flags to be set. Heki doesn't implement such user space interface, but only a dedicated hypercall to lock such registers. A superset of these flags is configurable with Heki. The Hypervisor Based Integrity patches [6] [7] only contain a generic IPC mechanism (KVM_HC_UCALL hypercall) to request protection to the VMM. The idea was to extend the KVM_SET_USER_MEMORY_REGION IOCTL to support more permission than read-only. # Current limitations The main limitation of this patch series is the statically enforced permissions. This is not an issue for kernels without module but this needs to be addressed. Mechanisms that dynamically impact kernel executable memory are not handled for now (e.g., kernel modules, tracepoints, eBPF JIT), and such code will need to be authenticated. Because the hypervisor is highly privileged and critical to the security of all the VMs, we don't want to implement a code authentication mechanism in the hypervisor itself but delegate this verification to something much less privileged. We are thinking of two ways to solve this: implement this verification in the VMM or spawn a dedicated special VM (similar to Windows's VBS). There are pros on cons to each approach: complexity, verification code ownership (guest's or VMM's), access to guest memory (i.e., confidential computing). Because the guest's virtual address translation is not protected by the hypervisor, a compromised kernel could map existing physical pages into arbitrary virtual addresses. The new Intel's Hypervisor-Managed Linear Address Translation [8] (HLAT) could be used to extend the current protection and cover this case. ROP is not covered by this patch series. Guest kernels can still jump to arbitrary executable pages according to their control-flow integrity protection. # Future work We think this kind of restrictions can be leveraged to log attempts of forbidden actions. Forwarding such signals to the VMM could help improve attack detection. Giving visibility to the VMM would also enable to migrate VMs. New dynamic restrictions could enable to improve the protected data by including security-sensitive data such as LSM states, seccomp filters, keyrings... This requires support outside of the hypervisor. An execute-only mode could also be useful (cf. XOM for KVM [9] [10]). Extending register pinning (e.g., MSRs). Being able to protect nested guests might be possible but we need to figure out the potential security implications. Protecting the host would be useful, but that doesn't really fit with the KVM model. The Protected KVM project is a first step to help in this direction [11]. We only tested this with an Intel CPU, but this approach should work the same with an AMD CPU starting with the Zen 2 generation and their Guest Mode Execute Trap (GMET) capability. We also kept some TODOs to highlight missing checks and code sharing issues, and some pr_warn() calls to help understand how it works. Tests need to be improved (e.g., invalid hypercall arguments). We'll present this work at the Linux Security Summit North America next week. [1] https://lore.kernel.org/all/20211006173113.26445-1-alazar@bitdefender.com/ [2] https://www.linux-kvm.org/images/7/72/KVMForum2017_Introspection.pdf [3] https://lore.kernel.org/all/20200617190757.27081-1-john.s.andersen@intel.com/ [4] https://github.com/intel/vbh [5] https://sched.co/TmwN [6] https://sched.co/eE3f [7] https://lore.kernel.org/all/20200501185147.208192-1-yuanyu@google.com/ [8] https://sched.co/eE4F [9] https://lore.kernel.org/kvm/20191003212400.31130-1-rick.p.edgecombe@intel.com/ [10] https://lpc.events/event/4/contributions/283/ [11] https://sched.co/eE24 Please reach out to us by replying to this thread, we're looking for people to join and collaborate on this project! Regards, Madhavan T. Venkataraman (2): virt: Implement Heki common code KVM: x86: Add Heki hypervisor support Mickaël Salaün (7): KVM: x86: Add kvm_x86_ops.fault_gva() KVM: x86/mmu: Add support for prewrite page tracking KVM: x86: Add new hypercall to set EPT permissions KVM: x86: Add new hypercall to lock control registers KVM: VMX: Add MBEC support KVM: x86/mmu: Enable guests to lock themselves thanks to MBEC virt: Add Heki KUnit tests Documentation/virt/kvm/x86/hypercalls.rst | 34 +++ Kconfig | 2 + arch/x86/Kconfig | 1 + arch/x86/include/asm/kvm-x86-ops.h | 1 + arch/x86/include/asm/kvm_host.h | 2 + arch/x86/include/asm/kvm_page_track.h | 12 + arch/x86/include/asm/sections.h | 4 + arch/x86/include/asm/vmx.h | 11 +- arch/x86/include/asm/x86_init.h | 2 + arch/x86/kernel/cpu/common.c | 2 +- arch/x86/kernel/cpu/hypervisor.c | 1 + arch/x86/kernel/kvm.c | 72 +++++ arch/x86/kernel/setup.c | 49 +++ arch/x86/kernel/x86_init.c | 1 + arch/x86/kvm/Kconfig | 1 + arch/x86/kvm/mmu.h | 3 +- arch/x86/kvm/mmu/mmu.c | 105 ++++++- arch/x86/kvm/mmu/mmutrace.h | 11 +- arch/x86/kvm/mmu/page_track.c | 33 +- arch/x86/kvm/mmu/paging_tmpl.h | 16 +- arch/x86/kvm/mmu/spte.c | 29 +- arch/x86/kvm/mmu/spte.h | 15 +- arch/x86/kvm/mmu/tdp_mmu.c | 73 +++++ arch/x86/kvm/mmu/tdp_mmu.h | 4 + arch/x86/kvm/svm/svm.c | 9 + arch/x86/kvm/vmx/capabilities.h | 7 + arch/x86/kvm/vmx/nested.c | 7 + arch/x86/kvm/vmx/vmx.c | 48 ++- arch/x86/kvm/vmx/vmx.h | 1 + arch/x86/kvm/x86.c | 352 +++++++++++++++++++++- arch/x86/kvm/x86.h | 23 ++ include/linux/heki.h | 90 ++++++ include/linux/kvm_host.h | 20 ++ include/uapi/linux/kvm_para.h | 2 + init/main.c | 3 + virt/Makefile | 1 + virt/heki/Kconfig | 41 +++ virt/heki/Makefile | 3 + virt/heki/heki.c | 321 ++++++++++++++++++++ virt/kvm/kvm_main.c | 5 + 40 files changed, 1377 insertions(+), 40 deletions(-) create mode 100644 include/linux/heki.h create mode 100644 virt/heki/Kconfig create mode 100644 virt/heki/Makefile create mode 100644 virt/heki/heki.c base-commit: c9c3395d5e3dcc6daee66c6908354d47bf98cb0c
Comments
On 5/5/2023 8:20 AM, Mickaël Salaün wrote: > Hi, > > This patch series is a proof-of-concept that implements new KVM features > (extended page tracking, MBEC support, CR pinning) and defines a new API to > protect guest VMs. No VMM (e.g., Qemu) modification is required. > > The main idea being that kernel self-protection mechanisms should be delegated > to a more privileged part of the system, hence the hypervisor. It is still the > role of the guest kernel to request such restrictions according to its Only for the guest kernel images here? Why not for the host OS kernel? Embedded devices w/ Android you have mentioned below supports the host OS as well it seems, right? Do we suggest that all the functionalities should be implemented in the Hypervisor (NS-EL2 for ARM) or even at Secure EL like Secure-EL1 (ARM). I am hoping that whatever we suggest the interface here from the Guest to the Hypervisor becomes the ABI right? > > # Current limitations > > The main limitation of this patch series is the statically enforced > permissions. This is not an issue for kernels without module but this needs to > be addressed. Mechanisms that dynamically impact kernel executable memory are > not handled for now (e.g., kernel modules, tracepoints, eBPF JIT), and such > code will need to be authenticated. Because the hypervisor is highly > privileged and critical to the security of all the VMs, we don't want to > implement a code authentication mechanism in the hypervisor itself but delegate > this verification to something much less privileged. We are thinking of two > ways to solve this: implement this verification in the VMM or spawn a dedicated > special VM (similar to Windows's VBS). There are pros on cons to each approach: > complexity, verification code ownership (guest's or VMM's), access to guest > memory (i.e., confidential computing). Do you foresee the performance regressions due to lot of tracking here? Production kernels do have lot of tracepoints and we use it as feature in the GKI kernel for the vendor hooks implementation and in those cases every vendor driver is a module. Separate VM further fragments this design and delegates more of it to proprietary solutions? Do you have any performance numbers w/ current RFC? ---Trilok Soni
On Fri, 2023-05-05 at 17:20 +0200, Mickaël Salaün wrote: > # How does it work? > > This implementation mainly leverages KVM capabilities to control the > Second > Layer Address Translation (or the Two Dimensional Paging e.g., > Intel's EPT or > AMD's RVI/NPT) and Mode Based Execution Control (Intel's MBEC) > introduced with > the Kaby Lake (7th generation) architecture. This allows to set > permissions on > memory pages in a complementary way to the guest kernel's managed > memory > permissions. Once these permissions are set, they are locked and > there is no > way back. > > A first KVM_HC_LOCK_MEM_PAGE_RANGES hypercall enables the guest > kernel to lock > a set of its memory page ranges with either the HEKI_ATTR_MEM_NOWRITE > or the > HEKI_ATTR_MEM_EXEC attribute. The first one denies write access to a > specific > set of pages (allow-list approach), and the second only allows kernel > execution > for a set of pages (deny-list approach). > > The current implementation sets the whole kernel's .rodata (i.e., any > const or > __ro_after_init variables, which includes critical security data such > as LSM > parameters) and .text sections as non-writable, and the .text section > is the > only one where kernel execution is allowed. This is possible thanks > to the new > MBEC support also brough by this series (otherwise the vDSO would > have to be > executable). Thanks to this hardware support (VT-x, EPT and MBEC), > the > performance impact of such guest protection is negligible. > > The second KVM_HC_LOCK_CR_UPDATE hypercall enables guests to pin some > of its > CPU control register flags (e.g., X86_CR0_WP, X86_CR4_SMEP, > X86_CR4_SMAP), > which is another complementary hardening mechanism. > > Heki can be enabled with the heki=1 boot command argument. > > Can the guest kernel ask the host VMM's emulated devices to DMA into the protected data? It should go through the host userspace mappings I think, which don't care about EPT permissions. Or did I miss where you are protecting that another way? There are a lot of easy ways to ask the host to write to guest memory that don't involve the EPT. You probably need to protect the host userspace mappings, and also the places in KVM that kmap a GPA provided by the guest. [ snip ] > > # Current limitations > > The main limitation of this patch series is the statically enforced > permissions. This is not an issue for kernels without module but this > needs to > be addressed. Mechanisms that dynamically impact kernel executable > memory are > not handled for now (e.g., kernel modules, tracepoints, eBPF JIT), > and such > code will need to be authenticated. Because the hypervisor is highly > privileged and critical to the security of all the VMs, we don't want > to > implement a code authentication mechanism in the hypervisor itself > but delegate > this verification to something much less privileged. We are thinking > of two > ways to solve this: implement this verification in the VMM or spawn a > dedicated > special VM (similar to Windows's VBS). There are pros on cons to each > approach: > complexity, verification code ownership (guest's or VMM's), access to > guest > memory (i.e., confidential computing). The kernel often creates writable aliases in order to write to protected data (kernel text, etc). Some of this is done right as text is being first written out (alternatives for example), and some happens way later (jump labels, etc). So for verification, I wonder what stage you would be verifying? If you want to verify the end state, you would have to maintain knowledge in the verifier of all the touch-ups the kernel does. I think it would get very tricky. It also seems it will be a decent ask for the guest kernel to keep track of GPA permissions as well as normal virtual memory pemirssions, if this thing is not widely used. So I wondering if you could go in two directions with this: 1. Make this a feature only for super locked down kernels (no modules, etc). Forbid any configurations that might modify text. But eBPF is used for seccomp, so you might be turning off some security protections to get this. 2. Loosen the rules to allow the protections to not be so one-way enable. Get less security, but used more widely. There were similar dilemmas with the PV CR pinning stuff.
On 5/24/2023 3:20 PM, Edgecombe, Rick P wrote: > On Fri, 2023-05-05 at 17:20 +0200, Mickaël Salaün wrote: >> # How does it work? >> >> This implementation mainly leverages KVM capabilities to control the >> Second >> Layer Address Translation (or the Two Dimensional Paging e.g., >> Intel's EPT or >> AMD's RVI/NPT) and Mode Based Execution Control (Intel's MBEC) >> introduced with >> the Kaby Lake (7th generation) architecture. This allows to set >> permissions on >> memory pages in a complementary way to the guest kernel's managed >> memory >> permissions. Once these permissions are set, they are locked and >> there is no >> way back. >> >> A first KVM_HC_LOCK_MEM_PAGE_RANGES hypercall enables the guest >> kernel to lock >> a set of its memory page ranges with either the HEKI_ATTR_MEM_NOWRITE >> or the >> HEKI_ATTR_MEM_EXEC attribute. The first one denies write access to a >> specific >> set of pages (allow-list approach), and the second only allows kernel >> execution >> for a set of pages (deny-list approach). >> >> The current implementation sets the whole kernel's .rodata (i.e., any >> const or >> __ro_after_init variables, which includes critical security data such >> as LSM >> parameters) and .text sections as non-writable, and the .text section >> is the >> only one where kernel execution is allowed. This is possible thanks >> to the new >> MBEC support also brough by this series (otherwise the vDSO would >> have to be >> executable). Thanks to this hardware support (VT-x, EPT and MBEC), >> the >> performance impact of such guest protection is negligible. >> >> The second KVM_HC_LOCK_CR_UPDATE hypercall enables guests to pin some >> of its >> CPU control register flags (e.g., X86_CR0_WP, X86_CR4_SMEP, >> X86_CR4_SMAP), >> which is another complementary hardening mechanism. >> >> Heki can be enabled with the heki=1 boot command argument. >> >> > > Can the guest kernel ask the host VMM's emulated devices to DMA into > the protected data? It should go through the host userspace mappings I > think, which don't care about EPT permissions. Or did I miss where you > are protecting that another way? There are a lot of easy ways to ask > the host to write to guest memory that don't involve the EPT. You > probably need to protect the host userspace mappings, and also the > places in KVM that kmap a GPA provided by the guest. > > [ snip ] > >> >> # Current limitations >> >> The main limitation of this patch series is the statically enforced >> permissions. This is not an issue for kernels without module but this >> needs to >> be addressed. Mechanisms that dynamically impact kernel executable >> memory are >> not handled for now (e.g., kernel modules, tracepoints, eBPF JIT), >> and such >> code will need to be authenticated. Because the hypervisor is highly >> privileged and critical to the security of all the VMs, we don't want >> to >> implement a code authentication mechanism in the hypervisor itself >> but delegate >> this verification to something much less privileged. We are thinking >> of two >> ways to solve this: implement this verification in the VMM or spawn a >> dedicated >> special VM (similar to Windows's VBS). There are pros on cons to each >> approach: >> complexity, verification code ownership (guest's or VMM's), access to >> guest >> memory (i.e., confidential computing). > > The kernel often creates writable aliases in order to write to > protected data (kernel text, etc). Some of this is done right as text > is being first written out (alternatives for example), and some happens > way later (jump labels, etc). So for verification, I wonder what stage > you would be verifying? If you want to verify the end state, you would > have to maintain knowledge in the verifier of all the touch-ups the > kernel does. I think it would get very tricky. Right and for the ARM (from what I know) is that Erratas can be applied using the alternatives fwk when you hotplug in the CPU post boot. ---Trilok Soni
On 24/05/2023 23:04, Trilok Soni wrote: > On 5/5/2023 8:20 AM, Mickaël Salaün wrote: >> Hi, >> >> This patch series is a proof-of-concept that implements new KVM features >> (extended page tracking, MBEC support, CR pinning) and defines a new API to >> protect guest VMs. No VMM (e.g., Qemu) modification is required. >> >> The main idea being that kernel self-protection mechanisms should be delegated >> to a more privileged part of the system, hence the hypervisor. It is still the >> role of the guest kernel to request such restrictions according to its > > Only for the guest kernel images here? Why not for the host OS kernel? As explained in the Future work section, protecting the host would be useful, but that doesn't really fit with the KVM model. The Protected KVM project is a first step to help in this direction [11]. In a nutshell, KVM is close to a type-2 hypervisor, and the host kernel is also part of the hypervisor. > Embedded devices w/ Android you have mentioned below supports the host > OS as well it seems, right? What do you mean? > > Do we suggest that all the functionalities should be implemented in the > Hypervisor (NS-EL2 for ARM) or even at Secure EL like Secure-EL1 (ARM). KVM runs in EL2. TrustZone is mainly used to enforce DRM, which means that we may not control the related code. This patch series is dedicated to hypervisor-enforced kernel integrity, then KVM. > > I am hoping that whatever we suggest the interface here from the Guest > to the Hypervisor becomes the ABI right? Yes, hypercalls are part of the KVM ABI. > > >> >> # Current limitations >> >> The main limitation of this patch series is the statically enforced >> permissions. This is not an issue for kernels without module but this needs to >> be addressed. Mechanisms that dynamically impact kernel executable memory are >> not handled for now (e.g., kernel modules, tracepoints, eBPF JIT), and such >> code will need to be authenticated. Because the hypervisor is highly >> privileged and critical to the security of all the VMs, we don't want to >> implement a code authentication mechanism in the hypervisor itself but delegate >> this verification to something much less privileged. We are thinking of two >> ways to solve this: implement this verification in the VMM or spawn a dedicated >> special VM (similar to Windows's VBS). There are pros on cons to each approach: >> complexity, verification code ownership (guest's or VMM's), access to guest >> memory (i.e., confidential computing). > > Do you foresee the performance regressions due to lot of tracking here? The performance impact of execution prevention should be negligible because once configured the hypervisor do nothing except catch illegitimate access attempts. > Production kernels do have lot of tracepoints and we use it as feature > in the GKI kernel for the vendor hooks implementation and in those cases > every vendor driver is a module. As explained in this section, dynamic kernel modifications such as tracepoints or modules are not currently supported by this patch series. Handling tracepoints is possible but requires more work to define and check legitimate changes. This proposal is still useful for static kernels though. > Separate VM further fragments this > design and delegates more of it to proprietary solutions? What do you mean? KVM is not a proprietary solution. For dynamic checks, this would require code not run by KVM itself, but either the VMM or a dedicated VM. In this case, the dynamic authentication code could come from the guest VM or from the VMM itself. In the former case, it is more challenging from a security point of view but doesn't rely on external (proprietary) solution. In the latter case, open-source VMMs should implement the specification to provide the required service (e.g. check kernel module signature). The goal of the common API layer provided by this RFC is to share code as much as possible between different hypervisor backends. > > Do you have any performance numbers w/ current RFC? No, but the only hypervisor performance impact is at boot time and should be negligible. I'll try to get some numbers for the hardware-enforcement impact, but it should be negligible too.
On 25/05/2023 00:20, Edgecombe, Rick P wrote: > On Fri, 2023-05-05 at 17:20 +0200, Mickaël Salaün wrote: >> # How does it work? >> >> This implementation mainly leverages KVM capabilities to control the >> Second >> Layer Address Translation (or the Two Dimensional Paging e.g., >> Intel's EPT or >> AMD's RVI/NPT) and Mode Based Execution Control (Intel's MBEC) >> introduced with >> the Kaby Lake (7th generation) architecture. This allows to set >> permissions on >> memory pages in a complementary way to the guest kernel's managed >> memory >> permissions. Once these permissions are set, they are locked and >> there is no >> way back. >> >> A first KVM_HC_LOCK_MEM_PAGE_RANGES hypercall enables the guest >> kernel to lock >> a set of its memory page ranges with either the HEKI_ATTR_MEM_NOWRITE >> or the >> HEKI_ATTR_MEM_EXEC attribute. The first one denies write access to a >> specific >> set of pages (allow-list approach), and the second only allows kernel >> execution >> for a set of pages (deny-list approach). >> >> The current implementation sets the whole kernel's .rodata (i.e., any >> const or >> __ro_after_init variables, which includes critical security data such >> as LSM >> parameters) and .text sections as non-writable, and the .text section >> is the >> only one where kernel execution is allowed. This is possible thanks >> to the new >> MBEC support also brough by this series (otherwise the vDSO would >> have to be >> executable). Thanks to this hardware support (VT-x, EPT and MBEC), >> the >> performance impact of such guest protection is negligible. >> >> The second KVM_HC_LOCK_CR_UPDATE hypercall enables guests to pin some >> of its >> CPU control register flags (e.g., X86_CR0_WP, X86_CR4_SMEP, >> X86_CR4_SMAP), >> which is another complementary hardening mechanism. >> >> Heki can be enabled with the heki=1 boot command argument. >> >> > > Can the guest kernel ask the host VMM's emulated devices to DMA into > the protected data? It should go through the host userspace mappings I > think, which don't care about EPT permissions. Or did I miss where you > are protecting that another way? There are a lot of easy ways to ask > the host to write to guest memory that don't involve the EPT. You > probably need to protect the host userspace mappings, and also the > places in KVM that kmap a GPA provided by the guest. Good point, I'll check this confused deputy attack. Extended KVM protections should indeed handle all ways to map guests' memory. I'm wondering if current VMMs would gracefully handle such new restrictions though. > > [ snip ] > >> >> # Current limitations >> >> The main limitation of this patch series is the statically enforced >> permissions. This is not an issue for kernels without module but this >> needs to >> be addressed. Mechanisms that dynamically impact kernel executable >> memory are >> not handled for now (e.g., kernel modules, tracepoints, eBPF JIT), >> and such >> code will need to be authenticated. Because the hypervisor is highly >> privileged and critical to the security of all the VMs, we don't want >> to >> implement a code authentication mechanism in the hypervisor itself >> but delegate >> this verification to something much less privileged. We are thinking >> of two >> ways to solve this: implement this verification in the VMM or spawn a >> dedicated >> special VM (similar to Windows's VBS). There are pros on cons to each >> approach: >> complexity, verification code ownership (guest's or VMM's), access to >> guest >> memory (i.e., confidential computing). > > The kernel often creates writable aliases in order to write to > protected data (kernel text, etc). Some of this is done right as text > is being first written out (alternatives for example), and some happens > way later (jump labels, etc). So for verification, I wonder what stage > you would be verifying? If you want to verify the end state, you would > have to maintain knowledge in the verifier of all the touch-ups the > kernel does. I think it would get very tricky. For now, in the static kernel case, all rodata and text GPA is restricted, so aliasing such memory in a writable way before or after the KVM enforcement would still restrict write access to this memory, which could be an issue but not a security one. Do you have such examples in mind? > > It also seems it will be a decent ask for the guest kernel to keep > track of GPA permissions as well as normal virtual memory pemirssions, > if this thing is not widely used. This would indeed be required to properly handle the dynamic cases. > > So I wondering if you could go in two directions with this: > 1. Make this a feature only for super locked down kernels (no modules, > etc). Forbid any configurations that might modify text. But eBPF is > used for seccomp, so you might be turning off some security protections > to get this. Good idea. For "super locked down kernels" :) , we should disable all kernel executable changes with the related kernel build configuration (e.g. eBPF JIT, kernel module, kprobes…) to make sure there is no such legitimate access. This looks like an acceptable initial feature. > 2. Loosen the rules to allow the protections to not be so one-way > enable. Get less security, but used more widely. This is our goal. I think both static and dynamic cases are legitimate and have value according to the level of security sought. This should be a build-time configuration. > > There were similar dilemmas with the PV CR pinning stuff.
On Thu, 2023-05-25 at 15:59 +0200, Mickaël Salaün wrote: [ snip ] > > The kernel often creates writable aliases in order to write to > > protected data (kernel text, etc). Some of this is done right as > > text > > is being first written out (alternatives for example), and some > > happens > > way later (jump labels, etc). So for verification, I wonder what > > stage > > you would be verifying? If you want to verify the end state, you > > would > > have to maintain knowledge in the verifier of all the touch-ups the > > kernel does. I think it would get very tricky. > > For now, in the static kernel case, all rodata and text GPA is > restricted, so aliasing such memory in a writable way before or after > the KVM enforcement would still restrict write access to this memory, > which could be an issue but not a security one. Do you have such > examples in mind? > On x86, look at all the callers of the text_poke() family. In arch/x86/include/asm/text-patching.h. > > > > > It also seems it will be a decent ask for the guest kernel to keep > > track of GPA permissions as well as normal virtual memory > > pemirssions, > > if this thing is not widely used. > > This would indeed be required to properly handle the dynamic cases. > > > > > > So I wondering if you could go in two directions with this: > > 1. Make this a feature only for super locked down kernels (no > > modules, > > etc). Forbid any configurations that might modify text. But eBPF is > > used for seccomp, so you might be turning off some security > > protections > > to get this. > > Good idea. For "super locked down kernels" :) , we should disable all > kernel executable changes with the related kernel build configuration > (e.g. eBPF JIT, kernel module, kprobes…) to make sure there is no > such > legitimate access. This looks like an acceptable initial feature. How many users do you think will want this protection but not protections that would have to be disabled? The main one that came to mind for me is cBPF seccomp stuff. But also, the alternative to JITing cBPF is the eBPF interpreter which AFAIU is considered a juicy enough target for speculative attacks that they created an option to compile it out. And leaving an interpreter in the kernel means any data could be "executed" in the normal non- speculative scenario, kind of working around the hypervisor executable protections. Dropping e/cBPF entirely would be an option, but then I wonder how many users you have left. Hopefully that is all correct, it's hard to keep track with the pace of BPF development. I wonder if it might be a good idea to POC the guest side before settling on the KVM interface. Then you can also look at the whole thing and judge how much usage it would get for the different options of restrictions. > > > > 2. Loosen the rules to allow the protections to not be so one-way > > enable. Get less security, but used more widely. > > This is our goal. I think both static and dynamic cases are > legitimate > and have value according to the level of security sought. This should > be > a build-time configuration. Yea, the proper way to do this is probably to move all text handling stuff into a separate domain of some sort, like you mentioned elsewhere. It would be quite a job.
On Thu, May 25, 2023, Rick P Edgecombe wrote: > I wonder if it might be a good idea to POC the guest side before > settling on the KVM interface. Then you can also look at the whole > thing and judge how much usage it would get for the different options > of restrictions. As I said earlier[*], IMO the control plane logic needs to live in host userspace. I think any attempt to have KVM providen anything but the low level plumbing will suffer the same fate as CR4 pinning and XO memory. Iterating on an imperfect solution to incremently improve security is far, far easier to do in userspace, and far more likely to get merged. [*] https://lore.kernel.org/all/ZFUyhPuhtMbYdJ76@google.com
On 5/25/2023 6:25 AM, Mickaël Salaün wrote: > > On 24/05/2023 23:04, Trilok Soni wrote: >> On 5/5/2023 8:20 AM, Mickaël Salaün wrote: >>> Hi, >>> >>> This patch series is a proof-of-concept that implements new KVM features >>> (extended page tracking, MBEC support, CR pinning) and defines a new >>> API to >>> protect guest VMs. No VMM (e.g., Qemu) modification is required. >>> >>> The main idea being that kernel self-protection mechanisms should be >>> delegated >>> to a more privileged part of the system, hence the hypervisor. It is >>> still the >>> role of the guest kernel to request such restrictions according to its >> >> Only for the guest kernel images here? Why not for the host OS kernel? > > As explained in the Future work section, protecting the host would be > useful, but that doesn't really fit with the KVM model. The Protected > KVM project is a first step to help in this direction [11]. > > In a nutshell, KVM is close to a type-2 hypervisor, and the host kernel > is also part of the hypervisor. > > >> Embedded devices w/ Android you have mentioned below supports the host >> OS as well it seems, right? > > What do you mean? I think you have answered this above w/ pKVM and I was referring the host protection as well w/ Heki. The link/references below refers to the Android OS it seems and not guest VM. > > >> >> Do we suggest that all the functionalities should be implemented in the >> Hypervisor (NS-EL2 for ARM) or even at Secure EL like Secure-EL1 (ARM). > > KVM runs in EL2. TrustZone is mainly used to enforce DRM, which means > that we may not control the related code. > > This patch series is dedicated to hypervisor-enforced kernel integrity, > then KVM. > >> >> I am hoping that whatever we suggest the interface here from the Guest >> to the Hypervisor becomes the ABI right? > > Yes, hypercalls are part of the KVM ABI. Sure. I just hope that they are extensible enough to support for other Hypervisors too. I am not sure if they are on this list like ACRN / Xen and see if it fits their need too. Is there any other Hypervisor you plan to test this feature as well? > >> >> >>> >>> # Current limitations >>> >>> The main limitation of this patch series is the statically enforced >>> permissions. This is not an issue for kernels without module but this >>> needs to >>> be addressed. Mechanisms that dynamically impact kernel executable >>> memory are >>> not handled for now (e.g., kernel modules, tracepoints, eBPF JIT), >>> and such >>> code will need to be authenticated. Because the hypervisor is highly >>> privileged and critical to the security of all the VMs, we don't want to >>> implement a code authentication mechanism in the hypervisor itself >>> but delegate >>> this verification to something much less privileged. We are thinking >>> of two >>> ways to solve this: implement this verification in the VMM or spawn a >>> dedicated >>> special VM (similar to Windows's VBS). There are pros on cons to each >>> approach: >>> complexity, verification code ownership (guest's or VMM's), access to >>> guest >>> memory (i.e., confidential computing). >> >> Do you foresee the performance regressions due to lot of tracking here? > > The performance impact of execution prevention should be negligible > because once configured the hypervisor do nothing except catch > illegitimate access attempts. Yes, if you are using the static kernel only and not considering the other dynamic patching features like explained. They need to be thought upon differently to reduce the likely impact. > > >> Production kernels do have lot of tracepoints and we use it as feature >> in the GKI kernel for the vendor hooks implementation and in those cases >> every vendor driver is a module. > > As explained in this section, dynamic kernel modifications such as > tracepoints or modules are not currently supported by this patch series. > Handling tracepoints is possible but requires more work to define and > check legitimate changes. This proposal is still useful for static > kernels though. > > >> Separate VM further fragments this >> design and delegates more of it to proprietary solutions? > > What do you mean? KVM is not a proprietary solution. Ah, I was referring the VBS Windows VM mentioned in the above text. Is it open-source? The reference of VM (or dedicated VM) didn't mention that VM itself will be open-source running Linux kernel. > > For dynamic checks, this would require code not run by KVM itself, but > either the VMM or a dedicated VM. In this case, the dynamic > authentication code could come from the guest VM or from the VMM itself. > In the former case, it is more challenging from a security point of view > but doesn't rely on external (proprietary) solution. In the latter case, > open-source VMMs should implement the specification to provide the > required service (e.g. check kernel module signature). > > The goal of the common API layer provided by this RFC is to share code > as much as possible between different hypervisor backends. > > >> >> Do you have any performance numbers w/ current RFC? > > No, but the only hypervisor performance impact is at boot time and > should be negligible. I'll try to get some numbers for the > hardware-enforcement impact, but it should be negligible too. Thanks. Please share the data once you have it ready. ---Trilok Soni
On Thu, 2023-05-25 at 09:07 -0700, Sean Christopherson wrote: > On Thu, May 25, 2023, Rick P Edgecombe wrote: > > I wonder if it might be a good idea to POC the guest side before > > settling on the KVM interface. Then you can also look at the whole > > thing and judge how much usage it would get for the different > > options > > of restrictions. > > As I said earlier[*], IMO the control plane logic needs to live in > host userspace. > I think any attempt to have KVM providen anything but the low level > plumbing will > suffer the same fate as CR4 pinning and XO memory. Iterating on an > imperfect > solution to incremently improve security is far, far easier to do in > userspace, > and far more likely to get merged. > > [*] https://lore.kernel.org/all/ZFUyhPuhtMbYdJ76@google.com Sure, I should have put it more generally. I just meant people are not going to want to maintain host-based features that guests can't effectively use. My takeaway from the CR pinning was that the guest kernel integration was surprisingly tricky due to the one-way nature of the interface. XO was more flexible than CR pinning in that respect, because the guest could turn it off (and indeed, in the XO kernel text patches it had to do this a lot).
[Side topic] Would folks be interested in a Linux Plumbers Conference MC on this topic generally, across different hypervisors, VMMs, and architectures? If so, please let me know who the key folk would be and we can try writing up an MC proposal.
On 25/05/2023 15:59, Mickaël Salaün wrote: > > On 25/05/2023 00:20, Edgecombe, Rick P wrote: >> On Fri, 2023-05-05 at 17:20 +0200, Mickaël Salaün wrote: >>> # How does it work? >>> >>> This implementation mainly leverages KVM capabilities to control the >>> Second >>> Layer Address Translation (or the Two Dimensional Paging e.g., >>> Intel's EPT or >>> AMD's RVI/NPT) and Mode Based Execution Control (Intel's MBEC) >>> introduced with >>> the Kaby Lake (7th generation) architecture. This allows to set >>> permissions on >>> memory pages in a complementary way to the guest kernel's managed >>> memory >>> permissions. Once these permissions are set, they are locked and >>> there is no >>> way back. >>> >>> A first KVM_HC_LOCK_MEM_PAGE_RANGES hypercall enables the guest >>> kernel to lock >>> a set of its memory page ranges with either the HEKI_ATTR_MEM_NOWRITE >>> or the >>> HEKI_ATTR_MEM_EXEC attribute. The first one denies write access to a >>> specific >>> set of pages (allow-list approach), and the second only allows kernel >>> execution >>> for a set of pages (deny-list approach). >>> >>> The current implementation sets the whole kernel's .rodata (i.e., any >>> const or >>> __ro_after_init variables, which includes critical security data such >>> as LSM >>> parameters) and .text sections as non-writable, and the .text section >>> is the >>> only one where kernel execution is allowed. This is possible thanks >>> to the new >>> MBEC support also brough by this series (otherwise the vDSO would >>> have to be >>> executable). Thanks to this hardware support (VT-x, EPT and MBEC), >>> the >>> performance impact of such guest protection is negligible. >>> >>> The second KVM_HC_LOCK_CR_UPDATE hypercall enables guests to pin some >>> of its >>> CPU control register flags (e.g., X86_CR0_WP, X86_CR4_SMEP, >>> X86_CR4_SMAP), >>> which is another complementary hardening mechanism. >>> >>> Heki can be enabled with the heki=1 boot command argument. >>> >>> >> >> Can the guest kernel ask the host VMM's emulated devices to DMA into >> the protected data? It should go through the host userspace mappings I >> think, which don't care about EPT permissions. Or did I miss where you >> are protecting that another way? There are a lot of easy ways to ask >> the host to write to guest memory that don't involve the EPT. You >> probably need to protect the host userspace mappings, and also the >> places in KVM that kmap a GPA provided by the guest. > > Good point, I'll check this confused deputy attack. Extended KVM > protections should indeed handle all ways to map guests' memory. I'm > wondering if current VMMs would gracefully handle such new restrictions > though. I guess the host could map arbitrary data to the guest, so that need to be handled, but how could the VMM (not the host kernel) bypass/update EPT initially used for the guest (and potentially later mapped to the host)?
On 25/05/2023 17:52, Edgecombe, Rick P wrote: > On Thu, 2023-05-25 at 15:59 +0200, Mickaël Salaün wrote: > [ snip ] > >>> The kernel often creates writable aliases in order to write to >>> protected data (kernel text, etc). Some of this is done right as >>> text >>> is being first written out (alternatives for example), and some >>> happens >>> way later (jump labels, etc). So for verification, I wonder what >>> stage >>> you would be verifying? If you want to verify the end state, you >>> would >>> have to maintain knowledge in the verifier of all the touch-ups the >>> kernel does. I think it would get very tricky. >> >> For now, in the static kernel case, all rodata and text GPA is >> restricted, so aliasing such memory in a writable way before or after >> the KVM enforcement would still restrict write access to this memory, >> which could be an issue but not a security one. Do you have such >> examples in mind? >> > > On x86, look at all the callers of the text_poke() family. In > arch/x86/include/asm/text-patching.h. OK, thanks! > >> >>> >>> It also seems it will be a decent ask for the guest kernel to keep >>> track of GPA permissions as well as normal virtual memory >>> pemirssions, >>> if this thing is not widely used. >> >> This would indeed be required to properly handle the dynamic cases. >> >> >>> >>> So I wondering if you could go in two directions with this: >>> 1. Make this a feature only for super locked down kernels (no >>> modules, >>> etc). Forbid any configurations that might modify text. But eBPF is >>> used for seccomp, so you might be turning off some security >>> protections >>> to get this. >> >> Good idea. For "super locked down kernels" :) , we should disable all >> kernel executable changes with the related kernel build configuration >> (e.g. eBPF JIT, kernel module, kprobes…) to make sure there is no >> such >> legitimate access. This looks like an acceptable initial feature. > > How many users do you think will want this protection but not > protections that would have to be disabled? The main one that came to > mind for me is cBPF seccomp stuff. > > But also, the alternative to JITing cBPF is the eBPF interpreter which > AFAIU is considered a juicy enough target for speculative attacks that > they created an option to compile it out. And leaving an interpreter in > the kernel means any data could be "executed" in the normal non- > speculative scenario, kind of working around the hypervisor executable > protections. Dropping e/cBPF entirely would be an option, but then I > wonder how many users you have left. Hopefully that is all correct, > it's hard to keep track with the pace of BPF development. seccomp-bpf doesn't rely on JIT, so it is not an issue. For eBPF, JIT is optional, but other text changes may be required according to the eBPF program type (e.g. using kprobes). > > I wonder if it might be a good idea to POC the guest side before > settling on the KVM interface. Then you can also look at the whole > thing and judge how much usage it would get for the different options > of restrictions. The next step is to handle dynamic permissions, but it will be easier to first implement that in KVM itself (which already has the required authentication code). The current interface may be flexible enough though, only new attribute flags should be required (and potentially an async mode). Anyway, this will enable to look at the whole thing. > >> >> >>> 2. Loosen the rules to allow the protections to not be so one-way >>> enable. Get less security, but used more widely. >> >> This is our goal. I think both static and dynamic cases are >> legitimate >> and have value according to the level of security sought. This should >> be >> a build-time configuration. > > Yea, the proper way to do this is probably to move all text handling > stuff into a separate domain of some sort, like you mentioned > elsewhere. It would be quite a job. Not necessarily to move this code, but to make sure that the changes are legitimate (e.g. text signatures, legitimate addresses). This doesn't need to be perfect but it should improve the current state by increasing the cost of attacks.
On 25/05/2023 20:34, Trilok Soni wrote: > On 5/25/2023 6:25 AM, Mickaël Salaün wrote: >> >> On 24/05/2023 23:04, Trilok Soni wrote: >>> On 5/5/2023 8:20 AM, Mickaël Salaün wrote: >>>> Hi, >>>> >>>> This patch series is a proof-of-concept that implements new KVM features >>>> (extended page tracking, MBEC support, CR pinning) and defines a new >>>> API to >>>> protect guest VMs. No VMM (e.g., Qemu) modification is required. >>>> >>>> The main idea being that kernel self-protection mechanisms should be >>>> delegated >>>> to a more privileged part of the system, hence the hypervisor. It is >>>> still the >>>> role of the guest kernel to request such restrictions according to its >>> >>> Only for the guest kernel images here? Why not for the host OS kernel? >> >> As explained in the Future work section, protecting the host would be >> useful, but that doesn't really fit with the KVM model. The Protected >> KVM project is a first step to help in this direction [11]. >> >> In a nutshell, KVM is close to a type-2 hypervisor, and the host kernel >> is also part of the hypervisor. >> >> >>> Embedded devices w/ Android you have mentioned below supports the host >>> OS as well it seems, right? >> >> What do you mean? > > I think you have answered this above w/ pKVM and I was referring the > host protection as well w/ Heki. The link/references below refers to the > Android OS it seems and not guest VM. > >> >> >>> >>> Do we suggest that all the functionalities should be implemented in the >>> Hypervisor (NS-EL2 for ARM) or even at Secure EL like Secure-EL1 (ARM). >> >> KVM runs in EL2. TrustZone is mainly used to enforce DRM, which means >> that we may not control the related code. >> >> This patch series is dedicated to hypervisor-enforced kernel integrity, >> then KVM. >> >>> >>> I am hoping that whatever we suggest the interface here from the Guest >>> to the Hypervisor becomes the ABI right? >> >> Yes, hypercalls are part of the KVM ABI. > > Sure. I just hope that they are extensible enough to support for other > Hypervisors too. I am not sure if they are on this list like ACRN / Xen > and see if it fits their need too. KVM, Hyper-V and Xen mailing lists are CCed. The KVM hypercalls are specific to KVM, but this patch series also include a common guest API intended to be used with all hypervisors. > > Is there any other Hypervisor you plan to test this feature as well? We're also working on Hyper-V. > >> >>> >>> >>>> >>>> # Current limitations >>>> >>>> The main limitation of this patch series is the statically enforced >>>> permissions. This is not an issue for kernels without module but this >>>> needs to >>>> be addressed. Mechanisms that dynamically impact kernel executable >>>> memory are >>>> not handled for now (e.g., kernel modules, tracepoints, eBPF JIT), >>>> and such >>>> code will need to be authenticated. Because the hypervisor is highly >>>> privileged and critical to the security of all the VMs, we don't want to >>>> implement a code authentication mechanism in the hypervisor itself >>>> but delegate >>>> this verification to something much less privileged. We are thinking >>>> of two >>>> ways to solve this: implement this verification in the VMM or spawn a >>>> dedicated >>>> special VM (similar to Windows's VBS). There are pros on cons to each >>>> approach: >>>> complexity, verification code ownership (guest's or VMM's), access to >>>> guest >>>> memory (i.e., confidential computing). >>> >>> Do you foresee the performance regressions due to lot of tracking here? >> >> The performance impact of execution prevention should be negligible >> because once configured the hypervisor do nothing except catch >> illegitimate access attempts. > > Yes, if you are using the static kernel only and not considering the > other dynamic patching features like explained. They need to be thought > upon differently to reduce the likely impact. What do you mean? We plan to support dynamic code, and performance is of course part of the requirement. > >> >> >>> Production kernels do have lot of tracepoints and we use it as feature >>> in the GKI kernel for the vendor hooks implementation and in those cases >>> every vendor driver is a module. >> >> As explained in this section, dynamic kernel modifications such as >> tracepoints or modules are not currently supported by this patch series. >> Handling tracepoints is possible but requires more work to define and >> check legitimate changes. This proposal is still useful for static >> kernels though. >> >> >>> Separate VM further fragments this >>> design and delegates more of it to proprietary solutions? >> >> What do you mean? KVM is not a proprietary solution. > > Ah, I was referring the VBS Windows VM mentioned in the above text. Is > it open-source? The reference of VM (or dedicated VM) didn't mention > that VM itself will be open-source running Linux kernel. This patch series is dedicated to KVM. Windows VBS was only mentioned as a comparable (but much more advanced) set of features. Everything required to use this new KVM features is and will be open-source. There is nothing to worry about licensing, the goal is to make it widely and freely available to protect users. > >> >> For dynamic checks, this would require code not run by KVM itself, but >> either the VMM or a dedicated VM. In this case, the dynamic >> authentication code could come from the guest VM or from the VMM itself. >> In the former case, it is more challenging from a security point of view >> but doesn't rely on external (proprietary) solution. In the latter case, >> open-source VMMs should implement the specification to provide the >> required service (e.g. check kernel module signature). >> >> The goal of the common API layer provided by this RFC is to share code >> as much as possible between different hypervisor backends. >> >> >>> >>> Do you have any performance numbers w/ current RFC? >> >> No, but the only hypervisor performance impact is at boot time and >> should be negligible. I'll try to get some numbers for the >> hardware-enforcement impact, but it should be negligible too. > > Thanks. Please share the data once you have it ready. It's on my todo list, but again, that should not be an issue and I even doubt the difference to be measurable.
On Fri, 2023-05-26 at 17:22 +0200, Mickaël Salaün wrote: > > > Can the guest kernel ask the host VMM's emulated devices to DMA > > > into > > > the protected data? It should go through the host userspace > > > mappings I > > > think, which don't care about EPT permissions. Or did I miss > > > where you > > > are protecting that another way? There are a lot of easy ways to > > > ask > > > the host to write to guest memory that don't involve the EPT. You > > > probably need to protect the host userspace mappings, and also > > > the > > > places in KVM that kmap a GPA provided by the guest. > > > > Good point, I'll check this confused deputy attack. Extended KVM > > protections should indeed handle all ways to map guests' memory. > > I'm > > wondering if current VMMs would gracefully handle such new > > restrictions > > though. > > I guess the host could map arbitrary data to the guest, so that need > to > be handled, but how could the VMM (not the host kernel) bypass/update > EPT initially used for the guest (and potentially later mapped to the > host)? Well traditionally both QEMU and KVM accessed guest memory via host mappings instead of the EPT. So I'm wondering what is stopping the guest from passing a protected gfn when setting up the DMA, and QEMU being enticed to write to it? The emulator as well would use these host userspace mappings and not consult the EPT IIRC. I think Sean was suggesting host userspace should be more involved in this process, so perhaps it could protect its own alias of the protected memory, for example mprotect() it as read-only. There is (was?) some KVM PV features that accessed guest memory via the host direct map as well. I would think mprotect() should protect this at the get_user_pages() stage, but it looks like the details have changed since I last understood it.
On Tue, May 30, 2023, Rick P Edgecombe wrote: > On Fri, 2023-05-26 at 17:22 +0200, Micka�l Sala�n wrote: > > > > Can the guest kernel ask the host VMM's emulated devices to DMA into > > > > the protected data? It should go through the host userspace mappings I > > > > think, which don't care about EPT permissions. Or did I miss where you > > > > are protecting that another way? There are a lot of easy ways to ask > > > > the host to write to guest memory that don't involve the EPT. You > > > > probably need to protect the host userspace mappings, and also the > > > > places in KVM that kmap a GPA provided by the guest. > > > > > > Good point, I'll check this confused deputy attack. Extended KVM > > > protections should indeed handle all ways to map guests' memory. I'm > > > wondering if current VMMs would gracefully handle such new restrictions > > > though. > > > > I guess the host could map arbitrary data to the guest, so that need to be > > handled, but how could the VMM (not the host kernel) bypass/update EPT > > initially used for the guest (and potentially later mapped to the host)? > > Well traditionally both QEMU and KVM accessed guest memory via host > mappings instead of the EPT.�So I'm wondering what is stopping the > guest from passing a protected gfn when setting up the DMA, and QEMU > being enticed to write to it? The emulator as well would use these host > userspace mappings and not consult the EPT IIRC. > > I think Sean was suggesting host userspace should be more involved in > this process, so perhaps it could protect its own alias of the > protected memory, for example mprotect() it as read-only. Ya, though "suggesting" is really "demanding, unless someone provides super strong justification for handling this directly in KVM". It's basically the same argument that led to Linux Security Modules: I'm all for KVM providing the framework and plumbing, but I don't want KVM to get involved in defining policy, thread models, etc.
On 31/05/2023 22:24, Sean Christopherson wrote: > On Tue, May 30, 2023, Rick P Edgecombe wrote: >> On Fri, 2023-05-26 at 17:22 +0200, Micka�l Sala�n wrote: >>>>> Can the guest kernel ask the host VMM's emulated devices to DMA into >>>>> the protected data? It should go through the host userspace mappings I >>>>> think, which don't care about EPT permissions. Or did I miss where you >>>>> are protecting that another way? There are a lot of easy ways to ask >>>>> the host to write to guest memory that don't involve the EPT. You >>>>> probably need to protect the host userspace mappings, and also the >>>>> places in KVM that kmap a GPA provided by the guest. >>>> >>>> Good point, I'll check this confused deputy attack. Extended KVM >>>> protections should indeed handle all ways to map guests' memory. I'm >>>> wondering if current VMMs would gracefully handle such new restrictions >>>> though. >>> >>> I guess the host could map arbitrary data to the guest, so that need to be >>> handled, but how could the VMM (not the host kernel) bypass/update EPT >>> initially used for the guest (and potentially later mapped to the host)? >> >> Well traditionally both QEMU and KVM accessed guest memory via host >> mappings instead of the EPT.�So I'm wondering what is stopping the >> guest from passing a protected gfn when setting up the DMA, and QEMU >> being enticed to write to it? The emulator as well would use these host >> userspace mappings and not consult the EPT IIRC. >> >> I think Sean was suggesting host userspace should be more involved in >> this process, so perhaps it could protect its own alias of the >> protected memory, for example mprotect() it as read-only. > > Ya, though "suggesting" is really "demanding, unless someone provides super strong > justification for handling this directly in KVM". It's basically the same argument > that led to Linux Security Modules: I'm all for KVM providing the framework and > plumbing, but I don't want KVM to get involved in defining policy, thread models, etc. I agree that KVM should not provide its own policy but only the building blocks to enforce one. There is two complementary points: - policy definition by the guest, provided to KVM and the host; - policy enforcement by KVM and the host. A potential extension of this framework could be to enable the host to define it's own policy for guests, but this would be a different threat model. To avoid too much latency because of the host being involved in policy enforcement, I'd like to explore an asynchronous approach that would especially fit well for dynamic restrictions.