Message ID | 20230511040857.6094-1-weijiang.yang@intel.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp4171401vqo; Thu, 11 May 2023 00:15:37 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ4ID/0+9I788gIsEpSUP9zfM65UxyvYeIIoUmplsWC1zLCLdylgqGBr/mbryU4Ptyt4Tn5E X-Received: by 2002:a05:6a20:8424:b0:101:857d:a7c2 with SMTP id c36-20020a056a20842400b00101857da7c2mr9150618pzd.29.1683789337630; Thu, 11 May 2023 00:15:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1683789337; cv=none; d=google.com; s=arc-20160816; b=VBLPdunf5lvAsg39H8VtftvhYIj3wLkPF01OLbz3ZJ1WUIafv4GnTGv5Di9K7OlCWI 31+FjWzHGoEzMJX3HidaCcLlkiN+T5S8A2f1iHG6xnZg42SrbOh+xzrRQ6EUHWZpd0fp 6o1eduHtC+9k9rcqMBKrsZqCRKAC58AeX1QHEEht5g2CRiJwQ1Cm/1tIybWlB3Kw9QNT m+Xhh3i0GPDrmfp1HENtgs0W9XsdNyNVC0oBxebjNKAKMX+SIDSbqfgC72+TMHU05BEL KhSnvPH2vvEoLQfwH9jynufolNEpYnIBb/rmzg8sIvzyPfAy4Lx6gasYHJpd5m2woJiK rrrA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=VMhcGOmU3pS3k8eNTX21RQR0Chg0RMqq3ak03t9yb7Q=; b=iMAlpoX/e+inTcx5EZb8qbNsShna7e8MW80J7LfCXm9Sepuv6MWwvTII+sq3bbhFTh YW/WQ67hgoLWRmgy13DZ0oZRY1FYoWe0Xsluis7siBOspoNCyhatYEtKiZBPWhBZWc+y y9pRUatGKdRu4kq4V5sAVRROBsKX8nM9P/dC69qxe3cQAhUkutT/E8shWl/3bguZeo/r avXRIZin8/q3gUVqUkVTyU6G/4zXunxwxSUlOlbnMgstoOHeOuNQDfur0DQiWWPvKa9y FUQiG9qOsDyhDALQ6vUSapPMb4GpXEyjLsDHqzRZ9sCAZa7mJKL28pQDsBL9EC5xUr2n bSFw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=WP6z9q7y; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id v19-20020a63b953000000b004fbc2116e0esi5860742pgo.205.2023.05.11.00.15.24; Thu, 11 May 2023 00:15:37 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=WP6z9q7y; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237400AbjEKHNn (ORCPT <rfc822;peekingduck44@gmail.com> + 99 others); Thu, 11 May 2023 03:13:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47442 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236848AbjEKHNi (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 11 May 2023 03:13:38 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DDEFC5FF2; Thu, 11 May 2023 00:13:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1683789210; x=1715325210; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=x6kFyyfcC7Qa/ItADGYfJy0M6HL1tbruJU5F3NysgKE=; b=WP6z9q7yIRGFw9M7hpitUsXbtLYOSdekWFpOODRozma6dFR+EJiZy+Kt Ep+6Cmu4zPSY+s3a9VGgsRkHY+O0p3JEby1uE1h981JLg3XyD7jka0rVB 1O6dj/6n+/VPl9Ky19Ek5WUHCveRflPp62nNA7Jdw8Ive5+DMHzqxCm5k hBdMKLvjrUXdagOuNOahFq1T0WZ2e3vNX8r+yxysAj97zX+IoXsW4sW07 yH7ucCvoQoO6ap6ZBEGwf/Cefohbgzftu1UjdzVr85ii67JY8r6EZtzmA j/SiOj1kh48u+P6ZD+LqVJu/nmpk4hnxbxt6sDnhjI/d8HMXvnxdsN1NL w==; X-IronPort-AV: E=McAfee;i="6600,9927,10706"; a="334896552" X-IronPort-AV: E=Sophos;i="5.99,266,1677571200"; d="scan'208";a="334896552" Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 May 2023 00:13:29 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10706"; a="1029512339" X-IronPort-AV: E=Sophos;i="5.99,266,1677571200"; d="scan'208";a="1029512339" Received: from embargo.jf.intel.com ([10.165.9.183]) by fmsmga005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 May 2023 00:13:21 -0700 From: Yang Weijiang <weijiang.yang@intel.com> To: seanjc@google.com, pbonzini@redhat.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: peterz@infradead.org, rppt@kernel.org, binbin.wu@linux.intel.com, rick.p.edgecombe@intel.com, weijiang.yang@intel.com, john.allen@amd.com Subject: [PATCH v3 00/21] Enable CET Virtualization Date: Thu, 11 May 2023 00:08:36 -0400 Message-Id: <20230511040857.6094-1-weijiang.yang@intel.com> X-Mailer: git-send-email 2.27.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.5 required=5.0 tests=BAYES_00,DATE_IN_PAST_03_06, DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1765581088660587431?= X-GMAIL-MSGID: =?utf-8?q?1765581088660587431?= |
Series |
Enable CET Virtualization
|
|
Message
Yang, Weijiang
May 11, 2023, 4:08 a.m. UTC
Control-flow Enforcement Technology (CET) is a CPU feature used to prevent Return/Jump-Oriented Programming (ROP/JOP) attacks. CET introduces a new exception type, Control Protection (#CP), and two sub-features(SHSTK,IBT) to defend against ROP/JOP style control-flow subversion attacks. Shadow Stack (SHSTK): A shadow stack is a second stack used exclusively for control transfer operations. The shadow stack is separate from the data/normal stack and can be enabled individually in user and kernel mode. When shadow stack is enabled, CALL pushes the return address on both the data and shadow stack. RET pops the return address from both stacks and compares them. If the return addresses from the two stacks do not match, the processor generates a #CP. Indirect Branch Tracking (IBT): IBT adds a new instruction, ENDBRANCH, to mark valid target addresses of indirect branches (CALL, JMP etc...). If an indirect branch is executed and the next instruction is _not_ an ENDBRANCH, the processor generates a #CP. These instruction behaves as a NOP on platforms that doesn't support CET. Dependency: -------------------------------------------------------------------------- The first 5 patches are taken over from CET native series [1] in linux-next. They're prerequisites for enabling guest user mode SHSTK. Patch this full series before build host kernel for guest CET testing. Also apply CET enabling patches in [2] to build qualified QEMU. These kernel dependent patches will be enclosed in KVM series until CET native series is merged in mainline tree. Implementation: -------------------------------------------------------------------------- Historically, the early KVM patches can support both user SHSTK and IBT, and most of the early patches are carried forward with changes in this new series. And with kernel IBT feature merged in 5.18, a new patch was added to support the feature in guest. The last patch is introduced to support supervisor SHSTK but the feature is not enabled on Intel platform for now, the main purpose of this patch is to facilitate AMD folks to enable the feature. In summary, this new series enables CET user SHSTK/IBT and kernel IBT, but doesn't fully support CET supervisor SHSTK, the enabling work is left for the future. Supported CET sub-features: | User SHSTK | User IBT (user mode) -------------------------------------------------- s-SHSTK (X) | Kernel IBT (kernel mode) | Guest user mode SHSTK/IBT relies on host side XSAVES support(XSS[bit 11]) to swap CET states. Guest kernel IBT doesn't have dependency on host XSAVES. The supervisor SHSTK relies on host side XSAVES support(XSS[bit 12]) for supervisor mode CET states save/restore. This version removed unnecessary checks of host CET enabling status before expose CET features to guest, making guest CET enabling apart from host. By doing so, it's expected to be more friendly to cloud computing scenarios. CET states management: -------------------------------------------------------------------------- CET user mode states, MSR_IA32_{U_CET,PL3_SSP} depends on {XSAVES,XRSTORS} instructions to swap guest/host context when vm-exit/vm-entry happens. On vm-exit, the guest CET states are stored to guest fpu area and host user mode states are loaded from thread/process context before vCPU returns to userspace, vice-versa on vm-entry. See details in kvm_{load|put}_guest_fpu(). So the user mode state validity depends on host side U_CET bit set in MSR_XSS. CET supervisor mode states are grouped into two categories - XSAVES dependent and non-dependent, the former includes MSR_IA32_PL{0,1,2}_SSP, the later consists of MSR_IA32_S_CET and MSR_IA32_INTR_SSP_TBL. The XSAVES dependent MSR's save/restore depends on S_CET bit set in MSR_XSS. Since native series doesn't enable S_CET support, these s-SHSTK shadow stack pointers are invalid. New VMCS fields, {GUEST|HOST}_{S_CET,SSP,INTR_SSP_TABL}, are introduced for guest/host non-XSAVES managed states switch. When CET entry/exit load bits are set, guest/host MSR_IA32_{S_CET,INTR_SSP_TBL,SSP} are loaded from these fields at vm-exit/entry. With these new fields, current guest kernel IBT enabling doesn't depend on S_CET bit in XSS, i.e., host {XSAVES|XRSTORS} support. Tests: -------------------------------------------------------------------------- This series passed basic CET user shadow stack test and kernel IBT test in L1 and L2 guest. It also works with CET KVM-unit-test application. Executed all KVM-unit-test cases and KVM selftests against this series, all test cases passed except the vmx test, the failure is due to CR4_CET bit testing in test_vmxon_bad_cr(). After add CR4_CET bit to skip list, the test passed. I'll send a patch to fix this issue later. To run user shadow stack test and kernel IBT test in VM, you need an CET capable platform, e.g., Sapphire Rapids server, and follow below steps to build host/guest kernel properly: 1. Build host kernel. Patch this series to kernel tree and build kernel. 2. Build guest kernel. Patch CET native series to kernel tree and opt-in CONFIG_X86_KERNEL_IBT and CONFIG_X86_USER_SHADOW_STACK options. Build with CET enabled gcc versions(>= 8.5.0). 3. Use patched QEMU to launch a VM. Check kernel selftest test_shadow_stack_64 output: [INFO] new_ssp = 7f8c82100ff8, *new_ssp = 7f8c82101001 [INFO] changing ssp from 7f8c82900ff0 to 7f8c82100ff8 [INFO] ssp is now 7f8c82101000 [OK] Shadow stack pivot [OK] Shadow stack faults [INFO] Corrupting shadow stack [INFO] Generated shadow stack violation successfully [OK] Shadow stack violation test [INFO] Gup read -> shstk access success [INFO] Gup write -> shstk access success [INFO] Violation from normal write [INFO] Gup read -> write access success [INFO] Violation from normal write [INFO] Gup write -> write access success [INFO] Cow gup write -> write access success [OK] Shadow gup test [INFO] Violation from shstk access [OK] mprotect() test [SKIP] Userfaultfd unavailable. [OK] 32 bit test Check kernel IBT with dmesg | grep CET: CET detected: Indirect Branch Tracking enabled -------------------------------------------------------------------------- Changes in v3: 1. Moved MSR access check helper to x86 common file. [Mike] 2. Modified cover letter, commit logs and code per review comments. [PeterZ, Binbin, Rick] 3. Fixed an issue on host MSR_IA32_S_CET reload at vm-exit. 5. Rebase on kvm-x86/next [4]. [1]: linux-next: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/?h=next-20230420 [2]: QEMU patch: https://lore.kernel.org/all/20230421041227.90915-1-weijiang.yang@intel.com/ [3]: v2 patchset: https://lore.kernel.org/all/20230421134615.62539-1-weijiang.yang@intel.com/ [4]: Rebase branch: https://github.com/kvm-x86/linux.git, commit: 5c291b93e5d6 (tag: kvm-x86-next-2023.04.26) Rick Edgecombe (5): x86/shstk: Add Kconfig option for shadow stack x86/cpufeatures: Add CPU feature flags for shadow stacks x86/cpufeatures: Enable CET CR4 bit for shadow stack x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states x86/fpu: Add helper for modifying xstate Sean Christopherson (2): KVM:x86: Report XSS as to-be-saved if there are supported features KVM:x86: Load guest FPU state when accessing xsaves-managed MSRs Yang Weijiang (14): KVM:x86: Refresh CPUID on write to guest MSR_IA32_XSS KVM:x86: Init kvm_caps.supported_xss with supported feature bits KVM:x86: Add #CP support in guest exception classification KVM:VMX: Introduce CET VMCS fields and control bits KVM:x86: Add fault checks for guest CR4.CET setting KVM:VMX: Emulate reads and writes to CET MSRs KVM:VMX: Add a synthetic MSR to allow userspace to access GUEST_SSP KVM:x86: Report CET MSRs as to-be-saved if CET is supported KVM:x86: Save/Restore GUEST_SSP to/from SMM state save area KVM:VMX: Pass through user CET MSRs to the guest KVM:x86: Enable CET virtualization for VMX and advertise to userspace KVM:nVMX: Enable user CET support for nested VMX KVM:x86: Enable kernel IBT support for guest KVM:x86: Support CET supervisor shadow stack MSR access arch/x86/Kconfig | 24 +++++ arch/x86/Kconfig.assembler | 5 + arch/x86/include/asm/cpufeatures.h | 2 + arch/x86/include/asm/disabled-features.h | 8 +- arch/x86/include/asm/fpu/api.h | 9 ++ arch/x86/include/asm/fpu/types.h | 16 ++- arch/x86/include/asm/fpu/xstate.h | 6 +- arch/x86/include/asm/kvm_host.h | 3 +- arch/x86/include/asm/vmx.h | 8 ++ arch/x86/include/uapi/asm/kvm.h | 1 + arch/x86/include/uapi/asm/kvm_para.h | 1 + arch/x86/kernel/cpu/common.c | 35 +++++-- arch/x86/kernel/cpu/cpuid-deps.c | 1 + arch/x86/kernel/fpu/core.c | 19 ++++ arch/x86/kernel/fpu/xstate.c | 90 ++++++++-------- arch/x86/kvm/cpuid.c | 19 +++- arch/x86/kvm/cpuid.h | 6 ++ arch/x86/kvm/smm.c | 20 ++++ arch/x86/kvm/vmx/capabilities.h | 4 + arch/x86/kvm/vmx/nested.c | 29 +++++- arch/x86/kvm/vmx/vmcs12.c | 6 ++ arch/x86/kvm/vmx/vmcs12.h | 14 ++- arch/x86/kvm/vmx/vmx.c | 124 ++++++++++++++++++++++- arch/x86/kvm/vmx/vmx.h | 6 +- arch/x86/kvm/x86.c | 122 ++++++++++++++++++++-- arch/x86/kvm/x86.h | 47 ++++++++- 26 files changed, 543 insertions(+), 82 deletions(-) base-commit: 5c291b93e5d665380dbecc6944973583f9565ee5
Comments
On Thu, May 11, 2023, Yang Weijiang wrote: > The last patch is introduced to support supervisor SHSTK but the feature is > not enabled on Intel platform for now, the main purpose of this patch is to > facilitate AMD folks to enable the feature. I am beyond confused by the SDM's wording of CET_SSS. First, it says that CET_SSS says the CPU isn't buggy (or maybe "less buggy" is more appropriate phrasing). Bit 18: CET_SSS. If 1, indicates that an operating system can enable supervisor shadow stacks as long as it ensures that certain supervisor shadow-stack pushes will not cause page faults (see Section 17.2.3 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1). But then it says says VMMs shouldn't set the bit. When emulating the CPUID instruction, a virtual-machine monitor should return this bit as 0 if those pushes can cause VM exits. Based on the Xen code (which is sadly a far better source of information than the SDM), I *think* that what the SDM is trying to say is that VMMs should not set CET_SS if VM-Exits can occur ***and*** the bit is not set in the host CPU. Because if the SDM really means "VMMs should never set the bit", then what on earth is the point of the bit. > In summary, this new series enables CET user SHSTK/IBT and kernel IBT, but > doesn't fully support CET supervisor SHSTK, the enabling work is left for > the future. Why? If my interpretation of the SDM is correct, then all the pieces are there. > Executed all KVM-unit-test cases and KVM selftests against this series, all > test cases passed except the vmx test, the failure is due to CR4_CET bit > testing in test_vmxon_bad_cr(). After add CR4_CET bit to skip list, the test > passed. I'll send a patch to fix this issue later. Your cover letter from v2 back in April said the same thing. Why hasn't the patch been posted? And what exactly is the issue? IIUC, setting CR4.CET with MSR_IA32_S_CET=0 and MSR_IA32_U_CET=0 should be a nop, which suggests that there's a KVM bug. And if that's the case, the next obvious questions is, why are you posting known buggy code?
On Thu, Jun 15, 2023, Sean Christopherson wrote: > Your cover letter from v2 back in April said the same thing. Why hasn't the patch > been posted? And what exactly is the issue? IIUC, setting CR4.CET with > MSR_IA32_S_CET=0 and MSR_IA32_U_CET=0 should be a nop, which suggests that there's > a KVM bug. And if that's the case, the next obvious questions is, why are you > posting known buggy code? Ah, is the problem that the test doesn't set CR0.WP as required by CR4.CET=1?
On 6/16/2023 8:00 AM, Sean Christopherson wrote: > On Thu, Jun 15, 2023, Sean Christopherson wrote: >> Your cover letter from v2 back in April said the same thing. Why hasn't the patch >> been posted? And what exactly is the issue? IIUC, setting CR4.CET with >> MSR_IA32_S_CET=0 and MSR_IA32_U_CET=0 should be a nop, which suggests that there's >> a KVM bug. And if that's the case, the next obvious questions is, why are you >> posting known buggy code? > Ah, is the problem that the test doesn't set CR0.WP as required by CR4.CET=1? Thanks for taking time to review this series! Yes, due to CR0.WP bit is not set while CR4.CET is being set. The check is imposed by patch-12. I'll add the fixup patch together with next the version.
On 6/16/2023 7:30 AM, Sean Christopherson wrote: > On Thu, May 11, 2023, Yang Weijiang wrote: >> The last patch is introduced to support supervisor SHSTK but the feature is >> not enabled on Intel platform for now, the main purpose of this patch is to >> facilitate AMD folks to enable the feature. > I am beyond confused by the SDM's wording of CET_SSS. > > First, it says that CET_SSS says the CPU isn't buggy (or maybe "less buggy" is > more appropriate phrasing). > > Bit 18: CET_SSS. If 1, indicates that an operating system can enable supervisor > shadow stacks as long as it ensures that certain supervisor shadow-stack pushes > will not cause page faults (see Section 17.2.3 of the Intel® 64 and IA-32 > Architectures Software Developer’s Manual, Volume 1). > > But then it says says VMMs shouldn't set the bit. > > When emulating the CPUID instruction, a virtual-machine monitor should return > this bit as 0 if those pushes can cause VM exits. > > Based on the Xen code (which is sadly a far better source of information than the > SDM), I *think* that what the SDM is trying to say is that VMMs should not set > CET_SS if VM-Exits can occur ***and*** the bit is not set in the host CPU. Because > if the SDM really means "VMMs should never set the bit", then what on earth is the > point of the bit. I need to double check for the vague description. From my understanding, on bare metal side, if the bit is 1, OS can enable SSS if pushes won't cause page fault. But for VM case, it's not recommended(regardless of the bit state) to set the bit as vm-exits caused by guest SSS pushes cannot be fully excluded. In other word, the bit is mainly for bare metal guidance now. >> In summary, this new series enables CET user SHSTK/IBT and kernel IBT, but >> doesn't fully support CET supervisor SHSTK, the enabling work is left for >> the future. > Why? If my interpretation of the SDM is correct, then all the pieces are there. My assumption is, VM supervisor SHSTK depends bare metal kernel support as PL0_SSP MSR is backed by XSAVES via IA32_XSS:bit12(CET_S), but this part of support is not there in Rick's native series. And also based on above SDM description, I don't want to add the support blindly now. > [...]
On Fri, Jun 16, 2023, Weijiang Yang wrote: > > On 6/16/2023 7:30 AM, Sean Christopherson wrote: > > On Thu, May 11, 2023, Yang Weijiang wrote: > > > The last patch is introduced to support supervisor SHSTK but the feature is > > > not enabled on Intel platform for now, the main purpose of this patch is to > > > facilitate AMD folks to enable the feature. > > I am beyond confused by the SDM's wording of CET_SSS. > > > > First, it says that CET_SSS says the CPU isn't buggy (or maybe "less buggy" is > > more appropriate phrasing). > > > > Bit 18: CET_SSS. If 1, indicates that an operating system can enable supervisor > > shadow stacks as long as it ensures that certain supervisor shadow-stack pushes > > will not cause page faults (see Section 17.2.3 of the Intel® 64 and IA-32 > > Architectures Software Developer’s Manual, Volume 1). > > > > But then it says says VMMs shouldn't set the bit. > > > > When emulating the CPUID instruction, a virtual-machine monitor should return > > this bit as 0 if those pushes can cause VM exits. > > > > Based on the Xen code (which is sadly a far better source of information than the > > SDM), I *think* that what the SDM is trying to say is that VMMs should not set > > CET_SS if VM-Exits can occur ***and*** the bit is not set in the host CPU. Because > > if the SDM really means "VMMs should never set the bit", then what on earth is the > > point of the bit. > > I need to double check for the vague description. > > From my understanding, on bare metal side, if the bit is 1, OS can enable > SSS if pushes won't cause page fault. But for VM case, it's not recommended > (regardless of the bit state) to set the bit as vm-exits caused by guest SSS > pushes cannot be fully excluded. > > In other word, the bit is mainly for bare metal guidance now. > > > > In summary, this new series enables CET user SHSTK/IBT and kernel IBT, but > > > doesn't fully support CET supervisor SHSTK, the enabling work is left for > > > the future. > > Why? If my interpretation of the SDM is correct, then all the pieces are there. ... > And also based on above SDM description, I don't want to add the support > blindly now. *sigh* I got filled in on the details offlist. 1) In the next version of this series, please rework it to reincorporate Supervisor Shadow Stack support into the main series, i.e. pretend Intel's implemenation isn't horribly flawed. KVM can't guarantee that a VM-Exit won't occur, i.e. can't advertise CET_SS, but I want the baseline support to be implemented, otherwise the series as a whole is a big confusing mess with unanswered question left, right, and center. And more importantly, architecturally SSS exists if X86_FEATURE_SHSTK is enumerated, i.e. the guest should be allowed to utilize SSS if it so chooses, with the obvious caveat that there's a non-zero chance the guest risks death by doing so. Or if userspace can ensure no VM-Exit will occur, which is difficult but feasible (ignoring #MC), e.g. by statically partitioning memory, prefaulting all memory in guest firmware, and not dirty logging SSS pages. In such an extreme setup, userspace can enumerate CET_SSS to the guest, and KVM should support that. 2) Add the below patch to document exactly why KVM doesn't advertise CET_SSS. While Intel is apparently ok with treating KVM developers like mushrooms, I am not. --- From: Sean Christopherson <seanjc@google.com> Date: Fri, 16 Jun 2023 10:04:37 -0700 Subject: [PATCH] KVM: x86: Explicitly document that KVM must not advertise CET_SSS Explicitly call out that KVM must NOT advertise CET_SSS to userspace, i.e. must not tell userspace and thus the guest that it is safe for the guest to enable Supervisor Shadow Stacks (SSS). Intel's implementation of SSS is fatally flawed for virtualized environments, as despite wording in the SDM that suggests otherwise, Intel CPUs' handling of shadow stack switches are NOT fully atomic. Only the check-and-update of the supervisor shadow stack token's busy bit is atomic. Per the SDM: If the far CALL or event delivery pushes a stack frame after the token is acquired and any of the pushes causes a fault or VM exit, the processor will revert to the old shadow stack and the busy bit in the new shadow stack's token remains set. Or more bluntly, any fault or VM-Exit that occurs when pushing to the shadow stack after the busy bit is set is fatal to the kernel, i.e. to the guest in KVM's case. The (guest) kernel can protect itself against faults, e.g. by ensuring that the shadow stack always has a valid mapping, but a guest kernel obviously has no control over, or even knowledge of, VM-Exits due to host activity. To help software determine when it is safe to use SSS, Intel defined CPUID.0x7.1.EDX bit (CET_SSS) and updated Intel CPUs to enumerate CET_SS, i.e. bare metal Intel CPUs advertise to software that it is safe to enable SSS. If CPUID.(EAX=07H,ECX=1H):EDX[bit 18] is enumerated as 1, it is sufficient for an operating system to ensure that none of the pushes can cause a page fault. But CET_SS also comes with an major caveat that is kinda sorta documented in the SDM: When emulating the CPUID instruction, a virtual-machine monitor should return this bit as 0 if those pushes can cause VM exits. In other words, CET_SSS (bit 18) does NOT enumerate that the underlying CPU prevents VM-Exits, only that the environment in which the software is running will not generate VM-Exits. I.e. CET_SSS is a stopgap to stem the bleeding and allow kernels to enable SSS, not an indication that the underlying CPU is immune to the VM-Exit problem. And unfortunately, KVM itself effectively has zero chance of ensuring that a shadow stack switch can't trigger a VM-Exit, e.g. KVM zaps *all* SPTEs when any memslot is deleted, enabling dirty logging write-protects SPTEs, etc. A sufficiently motivated userspace can, at least in theory, provide a safe environment for SSS, e.g. by statically partitioning and prefaulting (in guest firmware) all memory, disabling PML, never write-protecting guest shadow stacks, etc. But such a setup is far, far beyond typical KVM deployments. Note, AMD CPUs have a similar erratum, but AMD CPUs *DO* perform the full shadow stack switch atomically so long as the stack is mapped WB and does not cross a page boundary, i.e. a "normal" KVM setup and a well-behaved guest play nice with SSS without additional shenanigans. Signed-off-by: Sean Christopherson <seanjc@google.com> --- arch/x86/kvm/cpuid.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 1e3ee96c879b..ecf4a68aaa08 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -658,7 +658,15 @@ void kvm_set_cpu_caps(void) ); kvm_cpu_cap_init_kvm_defined(CPUID_7_1_EDX, - F(AVX_VNNI_INT8) | F(AVX_NE_CONVERT) | F(PREFETCHITI) + F(AVX_VNNI_INT8) | F(AVX_NE_CONVERT) | F(PREFETCHITI) | + + /* + * Do NOT advertise CET_SSS, i.e. do not tell userspace and the + * guest that it is safe to use Supervisor Shadow Stacks under + * KVM when running on Intel CPUs. KVM itself cannot guarantee + * that a VM-Exit won't occur during a shadow stack update. + */ + 0 /* F(CET_SSS) */ ); kvm_cpu_cap_mask(CPUID_D_1_EAX, base-commit: 9305c14847719870e9e08294034861360577ce08 --
On 6/17/2023 1:56 AM, Sean Christopherson wrote: > On Fri, Jun 16, 2023, Weijiang Yang wrote: >> On 6/16/2023 7:30 AM, Sean Christopherson wrote: >>> On Thu, May 11, 2023, Yang Weijiang wrote: >>>> The last patch is introduced to support supervisor SHSTK but the feature is >>>> not enabled on Intel platform for now, the main purpose of this patch is to >>>> facilitate AMD folks to enable the feature. >>> I am beyond confused by the SDM's wording of CET_SSS. >>> >>> First, it says that CET_SSS says the CPU isn't buggy (or maybe "less buggy" is >>> more appropriate phrasing). >>> >>> Bit 18: CET_SSS. If 1, indicates that an operating system can enable supervisor >>> shadow stacks as long as it ensures that certain supervisor shadow-stack pushes >>> will not cause page faults (see Section 17.2.3 of the Intel® 64 and IA-32 >>> Architectures Software Developer’s Manual, Volume 1). >>> >>> But then it says says VMMs shouldn't set the bit. >>> >>> When emulating the CPUID instruction, a virtual-machine monitor should return >>> this bit as 0 if those pushes can cause VM exits. >>> >>> Based on the Xen code (which is sadly a far better source of information than the >>> SDM), I *think* that what the SDM is trying to say is that VMMs should not set >>> CET_SS if VM-Exits can occur ***and*** the bit is not set in the host CPU. Because >>> if the SDM really means "VMMs should never set the bit", then what on earth is the >>> point of the bit. >> I need to double check for the vague description. >> >> From my understanding, on bare metal side, if the bit is 1, OS can enable >> SSS if pushes won't cause page fault. But for VM case, it's not recommended >> (regardless of the bit state) to set the bit as vm-exits caused by guest SSS >> pushes cannot be fully excluded. >> >> In other word, the bit is mainly for bare metal guidance now. >> >>>> In summary, this new series enables CET user SHSTK/IBT and kernel IBT, but >>>> doesn't fully support CET supervisor SHSTK, the enabling work is left for >>>> the future. >>> Why? If my interpretation of the SDM is correct, then all the pieces are there. > ... > >> And also based on above SDM description, I don't want to add the support >> blindly now. > *sigh* > > I got filled in on the details offlist. > > 1) In the next version of this series, please rework it to reincorporate Supervisor > Shadow Stack support into the main series, i.e. pretend Intel's implemenation > isn't horribly flawed. Let me make it clear, you want me to do two things: 1)Add Supervisor Shadow Stack state support(i.e., XSS.bit12(CET_S)) into kernel so that host can support guest Supervisor Shadow Stack MSRs in g/h FPU context switch. 2) Add Supervisor Shadow stack support into KVM part so that guest OS is able to use SSS with risk. is it correct? > KVM can't guarantee that a VM-Exit won't occur, i.e. > can't advertise CET_SS, but I want the baseline support to be implemented, > otherwise the series as a whole is a big confusing mess with unanswered question > left, right, and center. And more importantly, architecturally SSS exists if > X86_FEATURE_SHSTK is enumerated, i.e. the guest should be allowed to utilize > SSS if it so chooses, with the obvious caveat that there's a non-zero chance > the guest risks death by doing so. Or if userspace can ensure no VM-Exit will > occur, which is difficult but feasible (ignoring #MC), e.g. by statically > partitioning memory, prefaulting all memory in guest firmware, and not dirty > logging SSS pages. In such an extreme setup, userspace can enumerate CET_SSS > to the guest, and KVM should support that. Make sense, provide support but take risk on your own. > > 2) Add the below patch to document exactly why KVM doesn't advertise CET_SSS. > While Intel is apparently ok with treating KVM developers like mushrooms, I > am not. will add it, thanks a lot for detailed change logs! > > --- > From: Sean Christopherson <seanjc@google.com> > Date: Fri, 16 Jun 2023 10:04:37 -0700 > Subject: [PATCH] KVM: x86: Explicitly document that KVM must not advertise > CET_SSS > > Explicitly call out that KVM must NOT advertise CET_SSS to userspace, > i.e. must not tell userspace and thus the guest that it is safe for the > guest to enable Supervisor Shadow Stacks (SSS). > > Intel's implementation of SSS is fatally flawed for virtualized > environments, as despite wording in the SDM that suggests otherwise, > Intel CPUs' handling of shadow stack switches are NOT fully atomic. Only > the check-and-update of the supervisor shadow stack token's busy bit is > atomic. Per the SDM: > > If the far CALL or event delivery pushes a stack frame after the token > is acquired and any of the pushes causes a fault or VM exit, the > processor will revert to the old shadow stack and the busy bit in the > new shadow stack's token remains set. > > Or more bluntly, any fault or VM-Exit that occurs when pushing to the > shadow stack after the busy bit is set is fatal to the kernel, i.e. to > the guest in KVM's case. The (guest) kernel can protect itself against > faults, e.g. by ensuring that the shadow stack always has a valid mapping, > but a guest kernel obviously has no control over, or even knowledge of, > VM-Exits due to host activity. > > To help software determine when it is safe to use SSS, Intel defined > CPUID.0x7.1.EDX bit (CET_SSS) and updated Intel CPUs to enumerate CET_SS, > i.e. bare metal Intel CPUs advertise to software that it is safe to enable > SSS. > > If CPUID.(EAX=07H,ECX=1H):EDX[bit 18] is enumerated as 1, it is > sufficient for an operating system to ensure that none of the pushes can > cause a page fault. > > But CET_SS also comes with an major caveat that is kinda sorta documented > in the SDM: > > When emulating the CPUID instruction, a virtual-machine monitor should > return this bit as 0 if those pushes can cause VM exits. > > In other words, CET_SSS (bit 18) does NOT enumerate that the underlying > CPU prevents VM-Exits, only that the environment in which the software is > running will not generate VM-Exits. I.e. CET_SSS is a stopgap to stem the > bleeding and allow kernels to enable SSS, not an indication that the > underlying CPU is immune to the VM-Exit problem. > > And unfortunately, KVM itself effectively has zero chance of ensuring that > a shadow stack switch can't trigger a VM-Exit, e.g. KVM zaps *all* SPTEs > when any memslot is deleted, enabling dirty logging write-protects SPTEs, > etc. A sufficiently motivated userspace can, at least in theory, provide > a safe environment for SSS, e.g. by statically partitioning and > prefaulting (in guest firmware) all memory, disabling PML, never > write-protecting guest shadow stacks, etc. But such a setup is far, far > beyond typical KVM deployments. > > Note, AMD CPUs have a similar erratum, but AMD CPUs *DO* perform the full > shadow stack switch atomically so long as the stack is mapped WB and does > not cross a page boundary, i.e. a "normal" KVM setup and a well-behaved > guest play nice with SSS without additional shenanigans. > > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/kvm/cpuid.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c > index 1e3ee96c879b..ecf4a68aaa08 100644 > --- a/arch/x86/kvm/cpuid.c > +++ b/arch/x86/kvm/cpuid.c > @@ -658,7 +658,15 @@ void kvm_set_cpu_caps(void) > ); > > kvm_cpu_cap_init_kvm_defined(CPUID_7_1_EDX, > - F(AVX_VNNI_INT8) | F(AVX_NE_CONVERT) | F(PREFETCHITI) > + F(AVX_VNNI_INT8) | F(AVX_NE_CONVERT) | F(PREFETCHITI) | > + > + /* > + * Do NOT advertise CET_SSS, i.e. do not tell userspace and the > + * guest that it is safe to use Supervisor Shadow Stacks under > + * KVM when running on Intel CPUs. KVM itself cannot guarantee > + * that a VM-Exit won't occur during a shadow stack update. > + */ > + 0 /* F(CET_SSS) */ > ); > > kvm_cpu_cap_mask(CPUID_D_1_EAX, > > base-commit: 9305c14847719870e9e08294034861360577ce08
On Mon, Jun 19, 2023, Weijiang Yang wrote: > > On 6/17/2023 1:56 AM, Sean Christopherson wrote: > > On Fri, Jun 16, 2023, Weijiang Yang wrote: > > > On 6/16/2023 7:30 AM, Sean Christopherson wrote: > > > > On Thu, May 11, 2023, Yang Weijiang wrote: > > > > > The last patch is introduced to support supervisor SHSTK but the feature is > > > > > not enabled on Intel platform for now, the main purpose of this patch is to > > > > > facilitate AMD folks to enable the feature. > > > > I am beyond confused by the SDM's wording of CET_SSS. > > > > > > > > First, it says that CET_SSS says the CPU isn't buggy (or maybe "less buggy" is > > > > more appropriate phrasing). > > > > > > > > Bit 18: CET_SSS. If 1, indicates that an operating system can enable supervisor > > > > shadow stacks as long as it ensures that certain supervisor shadow-stack pushes > > > > will not cause page faults (see Section 17.2.3 of the Intel® 64 and IA-32 > > > > Architectures Software Developer’s Manual, Volume 1). > > > > > > > > But then it says says VMMs shouldn't set the bit. > > > > > > > > When emulating the CPUID instruction, a virtual-machine monitor should return > > > > this bit as 0 if those pushes can cause VM exits. > > > > > > > > Based on the Xen code (which is sadly a far better source of information than the > > > > SDM), I *think* that what the SDM is trying to say is that VMMs should not set > > > > CET_SS if VM-Exits can occur ***and*** the bit is not set in the host CPU. Because > > > > if the SDM really means "VMMs should never set the bit", then what on earth is the > > > > point of the bit. > > > I need to double check for the vague description. > > > > > > From my understanding, on bare metal side, if the bit is 1, OS can enable > > > SSS if pushes won't cause page fault. But for VM case, it's not recommended > > > (regardless of the bit state) to set the bit as vm-exits caused by guest SSS > > > pushes cannot be fully excluded. > > > > > > In other word, the bit is mainly for bare metal guidance now. > > > > > > > > In summary, this new series enables CET user SHSTK/IBT and kernel IBT, but > > > > > doesn't fully support CET supervisor SHSTK, the enabling work is left for > > > > > the future. > > > > Why? If my interpretation of the SDM is correct, then all the pieces are there. > > ... > > > > > And also based on above SDM description, I don't want to add the support > > > blindly now. > > *sigh* > > > > I got filled in on the details offlist. > > > > 1) In the next version of this series, please rework it to reincorporate Supervisor > > Shadow Stack support into the main series, i.e. pretend Intel's implemenation > > isn't horribly flawed. > > Let me make it clear, you want me to do two things: > > 1)Add Supervisor Shadow Stack state support(i.e., XSS.bit12(CET_S)) into > kernel so that host can support guest Supervisor Shadow Stack MSRs in g/h FPU > context switch. If that's necessary for correct functionality, yes. > 2) Add Supervisor Shadow stack support into KVM part so that guest OS is > able to use SSS with risk. Yes. Architecturally, if KVM advertises X86_FEATURE_SHSTK, then KVM needs to provide both User and Supervisor support. CET_SSS doesn't change the architecture, it's little more than a hint. And even if the guest follows SDM's recommendation to not enable shadow stacks, a clever kernel can still utilize SSS assets, e.g. use the MSRs as scratch registers.
On 6/24/2023 4:51 AM, Sean Christopherson wrote: > On Mon, Jun 19, 2023, Weijiang Yang wrote: >> On 6/17/2023 1:56 AM, Sean Christopherson wrote: >>> On Fri, Jun 16, 2023, Weijiang Yang wrote: >>>> On 6/16/2023 7:30 AM, Sean Christopherson wrote: >>>>> On Thu, May 11, 2023, Yang Weijiang wrote: >>>>>> The last patch is introduced to support supervisor SHSTK but the feature is >>>>>> not enabled on Intel platform for now, the main purpose of this patch is to >>>>>> facilitate AMD folks to enable the feature. >>>>> I am beyond confused by the SDM's wording of CET_SSS. >>>>> >>>>> First, it says that CET_SSS says the CPU isn't buggy (or maybe "less buggy" is >>>>> more appropriate phrasing). >>>>> >>>>> Bit 18: CET_SSS. If 1, indicates that an operating system can enable supervisor >>>>> shadow stacks as long as it ensures that certain supervisor shadow-stack pushes >>>>> will not cause page faults (see Section 17.2.3 of the Intel® 64 and IA-32 >>>>> Architectures Software Developer’s Manual, Volume 1). >>>>> >>>>> But then it says says VMMs shouldn't set the bit. >>>>> >>>>> When emulating the CPUID instruction, a virtual-machine monitor should return >>>>> this bit as 0 if those pushes can cause VM exits. >>>>> >>>>> Based on the Xen code (which is sadly a far better source of information than the >>>>> SDM), I *think* that what the SDM is trying to say is that VMMs should not set >>>>> CET_SS if VM-Exits can occur ***and*** the bit is not set in the host CPU. Because >>>>> if the SDM really means "VMMs should never set the bit", then what on earth is the >>>>> point of the bit. >>>> I need to double check for the vague description. >>>> >>>> From my understanding, on bare metal side, if the bit is 1, OS can enable >>>> SSS if pushes won't cause page fault. But for VM case, it's not recommended >>>> (regardless of the bit state) to set the bit as vm-exits caused by guest SSS >>>> pushes cannot be fully excluded. >>>> >>>> In other word, the bit is mainly for bare metal guidance now. >>>> >>>>>> In summary, this new series enables CET user SHSTK/IBT and kernel IBT, but >>>>>> doesn't fully support CET supervisor SHSTK, the enabling work is left for >>>>>> the future. >>>>> Why? If my interpretation of the SDM is correct, then all the pieces are there. >>> ... >>> >>>> And also based on above SDM description, I don't want to add the support >>>> blindly now. >>> *sigh* >>> >>> I got filled in on the details offlist. >>> >>> 1) In the next version of this series, please rework it to reincorporate Supervisor >>> Shadow Stack support into the main series, i.e. pretend Intel's implemenation >>> isn't horribly flawed. >> Let me make it clear, you want me to do two things: >> >> 1)Add Supervisor Shadow Stack state support(i.e., XSS.bit12(CET_S)) into >> kernel so that host can support guest Supervisor Shadow Stack MSRs in g/h FPU >> context switch. > If that's necessary for correct functionality, yes. > >> 2) Add Supervisor Shadow stack support into KVM part so that guest OS is >> able to use SSS with risk. > Yes. Architecturally, if KVM advertises X86_FEATURE_SHSTK, then KVM needs to > provide both User and Supervisor support. CET_SSS doesn't change the architecture, > it's little more than a hint. And even if the guest follows SDM's recommendation > to not enable shadow stacks, a clever kernel can still utilize SSS assets, e.g. use > the MSRs as scratch registers. Understood, thanks!
> *sigh* > > I got filled in on the details offlist. > > 1) In the next version of this series, please rework it to reincorporate Supervisor > Shadow Stack support into the main series, i.e. pretend Intel's implemenation > isn't horribly flawed. KVM can't guarantee that a VM-Exit won't occur, i.e. > can't advertise CET_SS, but I want the baseline support to be implemented, > otherwise the series as a whole is a big confusing mess with unanswered question > left, right, and center. And more importantly, architecturally SSS exists if > X86_FEATURE_SHSTK is enumerated, i.e. the guest should be allowed to utilize > SSS if it so chooses, with the obvious caveat that there's a non-zero chance > the guest risks death by doing so. Or if userspace can ensure no VM-Exit will > occur, which is difficult but feasible (ignoring #MC), e.g. by statically > partitioning memory, prefaulting all memory in guest firmware, and not dirty > logging SSS pages. In such an extreme setup, userspace can enumerate CET_SSS > to the guest, and KVM should support that. > > 2) Add the below patch to document exactly why KVM doesn't advertise CET_SSS. > While Intel is apparently ok with treating KVM developers like mushrooms, I > am not. > > --- > From: Sean Christopherson<seanjc@google.com> > Date: Fri, 16 Jun 2023 10:04:37 -0700 > Subject: [PATCH] KVM: x86: Explicitly document that KVM must not advertise > CET_SSS > > Explicitly call out that KVM must NOT advertise CET_SSS to userspace, > i.e. must not tell userspace and thus the guest that it is safe for the > guest to enable Supervisor Shadow Stacks (SSS). > > Intel's implementation of SSS is fatally flawed for virtualized > environments, as despite wording in the SDM that suggests otherwise, > Intel CPUs' handling of shadow stack switches are NOT fully atomic. Only > the check-and-update of the supervisor shadow stack token's busy bit is > atomic. Per the SDM: > > If the far CALL or event delivery pushes a stack frame after the token > is acquired and any of the pushes causes a fault or VM exit, the > processor will revert to the old shadow stack and the busy bit in the > new shadow stack's token remains set. > > Or more bluntly, any fault or VM-Exit that occurs when pushing to the > shadow stack after the busy bit is set is fatal to the kernel, i.e. to > the guest in KVM's case. The (guest) kernel can protect itself against > faults, e.g. by ensuring that the shadow stack always has a valid mapping, > but a guest kernel obviously has no control over, or even knowledge of, > VM-Exits due to host activity. > > To help software determine when it is safe to use SSS, Intel defined > CPUID.0x7.1.EDX bit (CET_SSS) and updated Intel CPUs to enumerate CET_SS, > i.e. bare metal Intel CPUs advertise to software that it is safe to enable > SSS. > > If CPUID.(EAX=07H,ECX=1H):EDX[bit 18] is enumerated as 1, it is > sufficient for an operating system to ensure that none of the pushes can > cause a page fault. > > But CET_SS also comes with an major caveat that is kinda sorta documented > in the SDM: > > When emulating the CPUID instruction, a virtual-machine monitor should > return this bit as 0 if those pushes can cause VM exits. > > In other words, CET_SSS (bit 18) does NOT enumerate that the underlying > CPU prevents VM-Exits, only that the environment in which the software is > running will not generate VM-Exits. I.e. CET_SSS is a stopgap to stem the > bleeding and allow kernels to enable SSS, not an indication that the > underlying CPU is immune to the VM-Exit problem. > > And unfortunately, KVM itself effectively has zero chance of ensuring that > a shadow stack switch can't trigger a VM-Exit, e.g. KVM zaps *all* SPTEs > when any memslot is deleted, enabling dirty logging write-protects SPTEs, > etc. A sufficiently motivated userspace can, at least in theory, provide > a safe environment for SSS, e.g. by statically partitioning and > prefaulting (in guest firmware) all memory, disabling PML, never > write-protecting guest shadow stacks, etc. But such a setup is far, far > beyond typical KVM deployments. > > Note, AMD CPUs have a similar erratum, but AMD CPUs *DO* perform the full > shadow stack switch atomically so long as the stack is mapped WB and does > not cross a page boundary, i.e. a "normal" KVM setup and a well-behaved > guest play nice with SSS without additional shenanigans. > > Signed-off-by: Sean Christopherson<seanjc@google.com> > --- > arch/x86/kvm/cpuid.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c > index 1e3ee96c879b..ecf4a68aaa08 100644 > --- a/arch/x86/kvm/cpuid.c > +++ b/arch/x86/kvm/cpuid.c > @@ -658,7 +658,15 @@ void kvm_set_cpu_caps(void) > ); > > kvm_cpu_cap_init_kvm_defined(CPUID_7_1_EDX, > - F(AVX_VNNI_INT8) | F(AVX_NE_CONVERT) | F(PREFETCHITI) > + F(AVX_VNNI_INT8) | F(AVX_NE_CONVERT) | F(PREFETCHITI) | > + > + /* > + * Do NOT advertise CET_SSS, i.e. do not tell userspace and the > + * guest that it is safe to use Supervisor Shadow Stacks under > + * KVM when running on Intel CPUs. KVM itself cannot guarantee > + * that a VM-Exit won't occur during a shadow stack update. > + */ > + 0 /* F(CET_SSS) */ > ); > > kvm_cpu_cap_mask(CPUID_D_1_EAX, > > base-commit: 9305c14847719870e9e08294034861360577ce08 Hi, Sean, Gil reminded me SDM has been updated CET SSS related topics recently(June release): ====================================================================== Section 17.2.3 (Supervisor Shadow Stack Token) in Volume 1 of the SDM: If the far CALL or event delivery pushes a stack frame after the token is acquired and any of the pushes causes a fault or VM exit, the processor will revert to the old shadow stack and the busy bit in the new shadow stack's token remains set. The new shadow stack is said to be prematurely busy. Software should enable supervisor shadow stacks only if it is certain that this situation cannot occur. If CPUID.(EAX=07H,ECX=1H):EDX[bit 18] is enumerated as 1, it is sufficient for an operating system to ensure that none of the pushes can cause a page fault. Volume 2A: CPUID.(EAX=07H,ECX=1H):EDX[bit 18] as follows: CET_SSS. If 1, indicates that an operating system can enable supervisor shadow stacks as long as it ensures that a supervisor shadow stack cannot become prematurely busy due to page faults (see Section 17.2.3 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1). When emulating the CPUID instruction, a virtual-machine monitor (VMM) should return this bit as 1 only if it ensures that VM exits cannot cause a guest supervisor shadow stack to appear to be prematurely busy. Such a VMM could set the “prematurely busy shadow stack” VM-exit control and use the additional information that it provides. Volume 3C: new “prematurely busy shadow stack” VM-exit control. ======================================================================== And Gil told me additional information was planed to be released later in the summer. Maybe you need modify above changelog a bit per the update. Given the updated parts are technical forecast, I don't plan to implement it in this series and still enumerate CET_SSS ==0 for guest. What's your thoughts?
On Mon, Jul 10, 2023, Weijiang Yang wrote: > Maybe you need modify above changelog a bit per the update. Ya, I'll make sure the changelog gets updated before CET support is merged, though feel free to post the next version without waiting for new changelog. > Given the updated parts are technical forecast, I don't plan to implement it > in this series and still enumerate > > CET_SSS ==0 for guest. What's your thoughts? Yes, definitely punt shadow-stack fixup to future enabling work.
On 7/11/2023 6:18 AM, Sean Christopherson wrote: > On Mon, Jul 10, 2023, Weijiang Yang wrote: >> Maybe you need modify above changelog a bit per the update. > Ya, I'll make sure the changelog gets updated before CET support is merged, though > feel free to post the next version without waiting for new changelog. Sure, thanks! >> Given the updated parts are technical forecast, I don't plan to implement it >> in this series and still enumerate >> >> CET_SSS ==0 for guest. What's your thoughts? > Yes, definitely punt shadow-stack fixup to future enabling work. Got it.
On 6/24/2023 4:51 AM, Sean Christopherson wrote: > On Mon, Jun 19, 2023, Weijiang Yang wrote: >> On 6/17/2023 1:56 AM, Sean Christopherson wrote: >>> On Fri, Jun 16, 2023, Weijiang Yang wrote: >>>> On 6/16/2023 7:30 AM, Sean Christopherson wrote: >>>>> On Thu, May 11, 2023, Yang Weijiang wrote: >>>>> [...] >> Let me make it clear, you want me to do two things: >> >> 1)Add Supervisor Shadow Stack state support(i.e., XSS.bit12(CET_S)) into >> kernel so that host can support guest Supervisor Shadow Stack MSRs in g/h FPU >> context switch. > If that's necessary for correct functionality, yes. Hi, Sean, I held off posting the new version and want to sync up with you on this point to avoid surprising you. After discussed adding the patch in kernel with Rick and Chao, we got blow conclusions on doing so: the Pros: - Super easy to implement for KVM. - Automatically avoids saving and restoring this data when the vmexit is handled within KVM. the Cons: - Unnecessarily restores XFEATURE_CET_KERNEL when switching to non-KVM task's userspace. - Forces allocating space for this state on all tasks, whether or not they use KVM, and with likely zero users today and the near future. - Complicates the FPU optimization thinking by including things that can have no affect on userspace in the FPU Given above reasons, I implemented guest CET supervisor states management in KVM instead of adding a kernel patch for it. Below are 3 KVM patches to support it: Patch 1: Save/reload guest CET supervisor states when necessary: ======================================================================= commit 16147ede75dee29583b7d42a6621d10d55b63595 Author: Yang Weijiang <weijiang.yang@intel.com> Date: Tue Jul 11 02:26:17 2023 -0400 KVM:x86: Make guest supervisor states as non-XSAVE managed Save and reload guest CET supervisor states, i.e.,PL{0,1,2}_SSP, when vCPU context is being swapped before and after userspace <->kernel entry, also do the same operation when vCPU is sched-in or sched-out. Enabling CET supervisor state management only in KVM due to: 1) Currently, suervisor SHSTK is not enabled on host side, only KVM needs to care about the guest's suervisor SHSTK states. 2) Enabling them in kernel FPU state framework has global effects to all threads on host kernel, but the majority of the threads are free to CET supervisor states. And it requires additional storage size of thread FPU state area. Add a new helper kvm_arch_sched_out() for that purpose. Adding the support in kvm_arch_vcpu_put/load() without the new helper looks possible, but the put/load functions are also called in vcpu_put()/load(), the latter are heavily used in KVM, so adding new helper makes the implementation clearer. Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 7e7e19ef6993..98235cb3d258 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -1023,6 +1023,7 @@ void kvm_arm_vcpu_ptrauth_trap(struct kvm_vcpu *vcpu); static inline void kvm_arch_sync_events(struct kvm *kvm) {} static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {} +static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) {} void kvm_arm_init_debug(void); void kvm_arm_vcpu_init_debug(struct kvm_vcpu *vcpu); diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h index 957121a495f0..56c5e85ba5a3 100644 --- a/arch/mips/include/asm/kvm_host.h +++ b/arch/mips/include/asm/kvm_host.h @@ -893,6 +893,7 @@ static inline void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot) {} static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {} static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {} +static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) {} static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {} static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {} diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 14ee0dece853..11587d953bf6 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -880,6 +880,7 @@ static inline void kvm_arch_sync_events(struct kvm *kvm) {} static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {} static inline void kvm_arch_flush_shadow_all(struct kvm *kvm) {} static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {} +static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) {} static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {} static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {} diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h index ee0acccb1d3b..6ff4a04fe0f2 100644 --- a/arch/riscv/include/asm/kvm_host.h +++ b/arch/riscv/include/asm/kvm_host.h @@ -244,6 +244,7 @@ struct kvm_vcpu_arch { static inline void kvm_arch_sync_events(struct kvm *kvm) {} static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {} +static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) {} #define KVM_ARCH_WANT_MMU_NOTIFIER diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h index 2bbc3d54959d..d1750a6a86cf 100644 --- a/arch/s390/include/asm/kvm_host.h +++ b/arch/s390/include/asm/kvm_host.h @@ -1033,6 +1033,7 @@ extern int kvm_s390_gisc_unregister(struct kvm *kvm, u32 gisc); static inline void kvm_arch_sync_events(struct kvm *kvm) {} static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {} +static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) {} static inline void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot) {} static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {} diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e2c549f147a5..7d9cfb7e2fe8 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -11212,6 +11212,31 @@ static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu) trace_kvm_fpu(0); } +static void kvm_save_cet_supervisor_ssp(struct kvm_vcpu *vcpu) +{ + preempt_disable(); + if (unlikely(guest_can_use(vcpu, X86_FEATURE_SHSTK))) { + rdmsrl(MSR_IA32_PL0_SSP, vcpu->arch.cet_s_ssp[0]); + rdmsrl(MSR_IA32_PL1_SSP, vcpu->arch.cet_s_ssp[1]); + rdmsrl(MSR_IA32_PL2_SSP, vcpu->arch.cet_s_ssp[2]); + wrmsrl(MSR_IA32_PL0_SSP, 0); + wrmsrl(MSR_IA32_PL1_SSP, 0); + wrmsrl(MSR_IA32_PL2_SSP, 0); + } + preempt_enable(); +} + +static void kvm_reload_cet_supervisor_ssp(struct kvm_vcpu *vcpu) +{ + preempt_disable(); + if (unlikely(guest_can_use(vcpu, X86_FEATURE_SHSTK))) { + wrmsrl(MSR_IA32_PL0_SSP, vcpu->arch.cet_s_ssp[0]); + wrmsrl(MSR_IA32_PL1_SSP, vcpu->arch.cet_s_ssp[1]); + wrmsrl(MSR_IA32_PL2_SSP, vcpu->arch.cet_s_ssp[2]); + } + preempt_enable(); +} + int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) { struct kvm_queued_exception *ex = &vcpu->arch.exception; @@ -11222,6 +11247,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) kvm_sigset_activate(vcpu); kvm_run->flags = 0; kvm_load_guest_fpu(vcpu); + kvm_reload_cet_supervisor_ssp(vcpu); kvm_vcpu_srcu_read_lock(vcpu); if (unlikely(vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED)) { @@ -11310,6 +11336,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) r = vcpu_run(vcpu); out: + kvm_save_cet_supervisor_ssp(vcpu); kvm_put_guest_fpu(vcpu); if (kvm_run->kvm_valid_regs) store_regs(vcpu); @@ -12398,9 +12425,17 @@ void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) pmu->need_cleanup = true; kvm_make_request(KVM_REQ_PMU, vcpu); } + + kvm_reload_cet_supervisor_ssp(vcpu); + static_call(kvm_x86_sched_in)(vcpu, cpu); } +void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) +{ + kvm_save_cet_supervisor_ssp(vcpu); +} + void kvm_arch_free_vm(struct kvm *kvm) { kfree(to_kvm_hv(kvm)->hv_pa_pg); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index d90331f16db1..b3032a5f0641 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1423,6 +1423,7 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu, int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu); void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu); +void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu); void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu); void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 66c1447d3c7f..42f28e8905e1 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -5885,6 +5885,7 @@ static void kvm_sched_out(struct preempt_notifier *pn, { struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn); + kvm_arch_sched_out(vcpu, 0); if (current->on_rq) { WRITE_ONCE(vcpu->preempted, true); WRITE_ONCE(vcpu->ready, true); Patch 2: optimization patch for above one: =================================================================== commit ae5fe7c81cc3b93193758d1b7b4ab74a92a51dad Author: Yang Weijiang <weijiang.yang@intel.com> Date: Fri Jul 14 20:03:52 2023 -0400 KVM:x86: Optimize CET supervisor SSP save/reload Make PL{0,1,2}_SSP as write-intercepted to detect whether guest is using these MSRs. Disable intercept to the MSRs if they're written with non-zero values. KVM does save/ reload for the MSRs only if they're used by guest. Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 69cbc9d9b277..c50b555234fb 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -748,6 +748,7 @@ struct kvm_vcpu_arch { bool tpr_access_reporting; bool xsaves_enabled; bool xfd_no_write_intercept; + bool cet_sss_active; u64 ia32_xss; u64 microcode_version; u64 arch_capabilities; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 90ce1c7d3fd7..21c89d200c88 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -2156,6 +2156,18 @@ static u64 vmx_get_supported_debugctl(struct kvm_vcpu *vcpu, bool host_initiated return debugctl; } +static void vmx_disable_write_intercept_sss_msr(struct kvm_vcpu *vcpu) +{ + if (guest_can_use(vcpu, X86_FEATURE_SHSTK)) { + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP, + MSR_TYPE_RW, false); + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL1_SSP, + MSR_TYPE_RW, false); + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL2_SSP, + MSR_TYPE_RW, false); + } +} + /* * Writes msr value into the appropriate "register". * Returns 0 on success, non-0 otherwise. @@ -2427,7 +2439,17 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) #define VMX_CET_CONTROL_MASK (~GENMASK_ULL(9,6)) #define LEG_BITMAP_BASE(data) ((data) >> 12) case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP: - return kvm_set_msr_common(vcpu, msr_info); + if (kvm_set_msr_common(vcpu, msr_info)) + return 1; + /* + * Write to the base SSP MSRs should happen ahead of toggling + * of IA32_S_CET.SH_STK_EN bit. + */ + if (!msr_info->host_initiated && + msr_index != MSR_IA32_PL3_SSP && data) { + vmx_disable_write_intercept_sss_msr(vcpu); + wrmsrl(msr_index, data); + } break; case MSR_IA32_U_CET: case MSR_IA32_S_CET: @@ -7773,12 +7795,17 @@ static void vmx_update_intercept_for_cet_msr(struct kvm_vcpu *vcpu) MSR_TYPE_RW, false); vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET, MSR_TYPE_RW, false); + /* + * Supervisor shadow stack MSRs are intercepted until + * they're written by guest, this is designed to + * optimize the save/restore overhead. + */ vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP, - MSR_TYPE_RW, false); + MSR_TYPE_R, false); vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL1_SSP, - MSR_TYPE_RW, false); + MSR_TYPE_R, false); vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL2_SSP, - MSR_TYPE_RW, false); + MSR_TYPE_R, false); vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL3_SSP, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cab31dbb2bec..06dc5111da3b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4049,8 +4049,11 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) if (!IS_ALIGNED(data, 4)) return 1; if (msr == MSR_IA32_PL0_SSP || msr == MSR_IA32_PL1_SSP || - msr == MSR_IA32_PL2_SSP) + msr == MSR_IA32_PL2_SSP) { + if (!msr_info->host_initiated && data) + vcpu->arch.cet_sss_active = true; vcpu->arch.cet_s_ssp[msr - MSR_IA32_PL0_SSP] = data; + } else if (msr == MSR_IA32_PL3_SSP) kvm_set_xsave_msr(msr_info); break; @@ -11250,7 +11253,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) kvm_sigset_activate(vcpu); kvm_run->flags = 0; kvm_load_guest_fpu(vcpu); - kvm_reload_cet_supervisor_ssp(vcpu); + if (vcpu->arch.cet_sss_active) + kvm_reload_cet_supervisor_ssp(vcpu); kvm_vcpu_srcu_read_lock(vcpu); if (unlikely(vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED)) { @@ -11339,7 +11343,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) r = vcpu_run(vcpu); out: - kvm_save_cet_supervisor_ssp(vcpu); + if (vcpu->arch.cet_sss_active) + kvm_save_cet_supervisor_ssp(vcpu); kvm_put_guest_fpu(vcpu); if (kvm_run->kvm_valid_regs) store_regs(vcpu); kvm_vcpu_srcu_read_lock(vcpu); if (unlikely(vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED)) { @@ -11339,7 +11343,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) r = vcpu_run(vcpu); out: - kvm_save_cet_supervisor_ssp(vcpu); + if (vcpu->arch.cet_sss_active) + kvm_save_cet_supervisor_ssp(vcpu); kvm_put_guest_fpu(vcpu); if (kvm_run->kvm_valid_regs) store_regs(vcpu); @@ -12428,15 +12433,16 @@ void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) pmu->need_cleanup = true; kvm_make_request(KVM_REQ_PMU, vcpu); } - - kvm_reload_cet_supervisor_ssp(vcpu); + if (vcpu->arch.cet_sss_active) + kvm_reload_cet_supervisor_ssp(vcpu); static_call(kvm_x86_sched_in)(vcpu, cpu); } void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) { - kvm_save_cet_supervisor_ssp(vcpu); + if (vcpu->arch.cet_sss_active) + kvm_save_cet_supervisor_ssp(vcpu); } void kvm_arch_free_vm(struct kvm *kvm) ============================================================= Patch 3: support guest CET supervisor xstate bit: commit 2708b3c959db56fb9243f9a157884c2120b8810c Author: Yang Weijiang <weijiang.yang@intel.com> Date: Sat Jul 15 20:56:37 2023 -0400 KVM:x86: Enable guest CET supervisor xstate bit support Add S_CET bit in kvm_caps.supported_xss so that guest can enumerate the feature in CPUID(0xd,1).ECX. Guest S_CET xstate bit is specially handled, i.e., it can be exposed without related enabling on host side, because KVM manually saves/reloads guest supervisor SHSTK SSPs and current XSS swap logic for host/guest aslo supports doing so, thus it's safe to enable the bit without host support. Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 2653e5eb54ee..071bcdedc530 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -228,7 +228,8 @@ static struct kvm_user_return_msrs __percpu *user_return_msrs; | XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \ | XFEATURE_MASK_PKRU | XFEATURE_MASK_XTILE) -#define KVM_SUPPORTED_XSS (XFEATURE_MASK_CET_USER) +#define KVM_SUPPORTED_XSS (XFEATURE_MASK_CET_USER | \ + XFEATURE_MASK_CET_KERNEL) u64 __read_mostly host_efer; EXPORT_SYMBOL_GPL(host_efer); @@ -9639,6 +9640,7 @@ static int __kvm_x86_vendor_init(struct kvm_x86_init_ops *ops) if (boot_cpu_has(X86_FEATURE_XSAVES)) { rdmsrl(MSR_IA32_XSS, host_xss); kvm_caps.supported_xss = host_xss & KVM_SUPPORTED_XSS; + kvm_caps.supported_xss |= XFEATURE_MASK_CET_KERNEL; } kvm_init_pmu_capability(ops->pmu_ops); diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index f8f042c91728..df187d7c3e74 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -362,7 +362,7 @@ static inline bool kvm_mpx_supported(void) == (XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR); } -#define CET_XSTATE_MASK (XFEATURE_MASK_CET_USER) +#define CET_XSTATE_MASK (XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL) /* * Shadow Stack and Indirect Branch Tracking feature enabling depends on * whether host side CET user xstate bit is supported or not. ================================================================= What's your thoughts on the solution? Is it appropriate for KVM? Thanks! [...]
On Mon, Jul 17, 2023, Weijiang Yang wrote: > > On 6/24/2023 4:51 AM, Sean Christopherson wrote: > > > 1)Add Supervisor Shadow Stack� state support(i.e., XSS.bit12(CET_S)) into > > > kernel so that host can support guest Supervisor Shadow Stack MSRs in g/h FPU > > > context switch. > > If that's necessary for correct functionality, yes. ... > the Pros: > �- Super easy to implement for KVM. > �- Automatically avoids saving and restoring this data when the vmexit > �� is handled within KVM. > > the Cons: > �- Unnecessarily restores XFEATURE_CET_KERNEL when switching to > �� non-KVM task's userspace. > �- Forces allocating space for this state on all tasks, whether or not > �� they use KVM, and with likely zero users today and the near future. > �- Complicates the FPU optimization thinking by including things that > �� can have no affect on userspace in the FPU > > Given above reasons, I implemented guest CET supervisor states management > in KVM instead of adding a kernel patch for it. > > Below are 3 KVM patches to support it: > > Patch 1: Save/reload guest CET supervisor states when necessary: > > ======================================================================= > > commit 16147ede75dee29583b7d42a6621d10d55b63595 > Author: Yang Weijiang <weijiang.yang@intel.com> > Date:�� Tue Jul 11 02:26:17 2023 -0400 > > ��� KVM:x86: Make guest supervisor states as non-XSAVE managed > > ��� Save and reload guest CET supervisor states, i.e.,PL{0,1,2}_SSP, > ��� when vCPU context is being swapped before and after userspace > ��� <->kernel entry, also do the same operation when vCPU is sched-in > ��� or sched-out. ... > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index e2c549f147a5..7d9cfb7e2fe8 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -11212,6 +11212,31 @@ static void kvm_put_guest_fpu(struct kvm_vcpu > *vcpu) > ������� trace_kvm_fpu(0); > �} > > +static void kvm_save_cet_supervisor_ssp(struct kvm_vcpu *vcpu) > +{ > +������ preempt_disable(); > +������ if (unlikely(guest_can_use(vcpu, X86_FEATURE_SHSTK))) { > +�������������� rdmsrl(MSR_IA32_PL0_SSP, vcpu->arch.cet_s_ssp[0]); > +�������������� rdmsrl(MSR_IA32_PL1_SSP, vcpu->arch.cet_s_ssp[1]); > +�������������� rdmsrl(MSR_IA32_PL2_SSP, vcpu->arch.cet_s_ssp[2]); > +�������������� wrmsrl(MSR_IA32_PL0_SSP, 0); > +�������������� wrmsrl(MSR_IA32_PL1_SSP, 0); > +�������������� wrmsrl(MSR_IA32_PL2_SSP, 0); > +������ } > +������ preempt_enable(); > +} > + > +static void kvm_reload_cet_supervisor_ssp(struct kvm_vcpu *vcpu) > +{ > +������ preempt_disable(); > +������ if (unlikely(guest_can_use(vcpu, X86_FEATURE_SHSTK))) { > +�������������� wrmsrl(MSR_IA32_PL0_SSP, vcpu->arch.cet_s_ssp[0]); > +�������������� wrmsrl(MSR_IA32_PL1_SSP, vcpu->arch.cet_s_ssp[1]); > +�������������� wrmsrl(MSR_IA32_PL2_SSP, vcpu->arch.cet_s_ssp[2]); > +������ } > +������ preempt_enable(); > +} My understanding is that PL[0-2]_SSP are used only on transitions to the corresponding privilege level from a *different* privilege level. That means KVM should be able to utilize the user_return_msr framework to load the host values. Though if Linux ever supports SSS, I'm guessing the core kernel will have some sort of mechanism to defer loading MSR_IA32_PL0_SSP until an exit to userspace, e.g. to avoid having to write PL0_SSP, which will presumably be per-task, on every context switch. But note my original wording: **If that's necessary** If nothing in the host ever consumes those MSRs, i.e. if SSS is NOT enabled in IA32_S_CET, then running host stuff with guest values should be ok. KVM only needs to guarantee that it doesn't leak values between guests. But that should Just Work, e.g. KVM should load the new vCPU's values if SHSTK is exposed to the guest, and intercept (to inject #GP) if SHSTK is not exposed to the guest. And regardless of what the mechanism ends up managing SSP MSRs, it should only ever touch PL0_SSP, because Linux never runs anything at CPL1 or CPL2, i.e. will never consume PL{1,2}_SSP. Am I missing something?
On Wed, Jul 19, 2023, Sean Christopherson wrote: > On Mon, Jul 17, 2023, Weijiang Yang wrote: > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index e2c549f147a5..7d9cfb7e2fe8 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -11212,6 +11212,31 @@ static void kvm_put_guest_fpu(struct kvm_vcpu > > *vcpu) > > ������� trace_kvm_fpu(0); > > �} Huh. After a bit of debugging, the mangling is due to mutt's default for send_charset being "us-ascii:iso-8859-1:utf-8" and selecting iso-8859-1 instead of utf-8 as the encoding despite the original mail being utf-8. In this case, mutt ran afoul of nbsp (u+00a0). AFAICT, the solution is to essentially tell mutt to never try to use iso-8859-1 for sending mail set send_charset="us-ascii:utf-8"
On Wed, Jul 19, 2023 at 12:41:47PM -0700, Sean Christopherson wrote: > My understanding is that PL[0-2]_SSP are used only on transitions to the > corresponding privilege level from a *different* privilege level. That means > KVM should be able to utilize the user_return_msr framework to load the host > values. Though if Linux ever supports SSS, I'm guessing the core kernel will > have some sort of mechanism to defer loading MSR_IA32_PL0_SSP until an exit to > userspace, e.g. to avoid having to write PL0_SSP, which will presumably be > per-task, on every context switch. > > But note my original wording: **If that's necessary** > > If nothing in the host ever consumes those MSRs, i.e. if SSS is NOT enabled in > IA32_S_CET, then running host stuff with guest values should be ok. KVM only > needs to guarantee that it doesn't leak values between guests. But that should > Just Work, e.g. KVM should load the new vCPU's values if SHSTK is exposed to the > guest, and intercept (to inject #GP) if SHSTK is not exposed to the guest. > > And regardless of what the mechanism ends up managing SSP MSRs, it should only > ever touch PL0_SSP, because Linux never runs anything at CPL1 or CPL2, i.e. will > never consume PL{1,2}_SSP. To clarify, Linux will only use SSS in FRED mode -- FRED removes CPL1,2.
On 7/20/2023 3:41 AM, Sean Christopherson wrote: > [...] > My understanding is that PL[0-2]_SSP are used only on transitions to the > corresponding privilege level from a *different* privilege level. That means > KVM should be able to utilize the user_return_msr framework to load the host > values. Though if Linux ever supports SSS, I'm guessing the core kernel will > have some sort of mechanism to defer loading MSR_IA32_PL0_SSP until an exit to > userspace, e.g. to avoid having to write PL0_SSP, which will presumably be > per-task, on every context switch. > > But note my original wording: **If that's necessary** Thanks! I think host SSS enabling won't happen in short-term, handling the guest supervisor states in KVM side is doable. > > If nothing in the host ever consumes those MSRs, i.e. if SSS is NOT enabled in > IA32_S_CET, then running host stuff with guest values should be ok. KVM only > needs to guarantee that it doesn't leak values between guests. But that should > Just Work, e.g. KVM should load the new vCPU's values if SHSTK is exposed to the > guest, and intercept (to inject #GP) if SHSTK is not exposed to the guest. Yes, these handling have been covered by the new version. > > And regardless of what the mechanism ends up managing SSP MSRs, it should only > ever touch PL0_SSP, because Linux never runs anything at CPL1 or CPL2, i.e. will > never consume PL{1,2}_SSP. > > Am I missing something? I think, guest PL{0,1,2}_SSP can be handled as a bundle to make the handling easy(instead of handling each separately) because guest can be non-Linux systems, as you said before they could even be used as scratch registers. But for host side as it's Linux, I can omit reloading/resetting host PL{1,2}_SSP when vCPU thread is preempted. I will post new version to community if above is minor divergence.
On 7/20/2023 4:26 AM, Sean Christopherson wrote: > On Wed, Jul 19, 2023, Sean Christopherson wrote: >> On Mon, Jul 17, 2023, Weijiang Yang wrote: >>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >>> index e2c549f147a5..7d9cfb7e2fe8 100644 >>> --- a/arch/x86/kvm/x86.c >>> +++ b/arch/x86/kvm/x86.c >>> @@ -11212,6 +11212,31 @@ static void kvm_put_guest_fpu(struct kvm_vcpu >>> *vcpu) >>> ������� trace_kvm_fpu(0); >>> �} > Huh. After a bit of debugging, the mangling is due to mutt's default for send_charset > being > > "us-ascii:iso-8859-1:utf-8" > > and selecting iso-8859-1 instead of utf-8 as the encoding despite the original > mail being utf-8. In this case, mutt ran afoul of nbsp (u+00a0). > > AFAICT, the solution is to essentially tell mutt to never try to use iso-8859-1 > for sending mail > > set send_charset="us-ascii:utf-8" It made me feel a bit guilty as I thought it could be resulted from wrong settings of my email system :-)
> > My understanding is that PL[0-2]_SSP are used only on transitions to the > > corresponding privilege level from a *different* privilege level. That means > > KVM should be able to utilize the user_return_msr framework to load the host > > values. Though if Linux ever supports SSS, I'm guessing the core kernel will > > have some sort of mechanism to defer loading MSR_IA32_PL0_SSP until an exit to > > userspace, e.g. to avoid having to write PL0_SSP, which will presumably be > > per-task, on every context switch. > > > > But note my original wording: **If that's necessary** > > > > If nothing in the host ever consumes those MSRs, i.e. if SSS is NOT enabled in > > IA32_S_CET, then running host stuff with guest values should be ok. KVM only > > needs to guarantee that it doesn't leak values between guests. But that should > > Just Work, e.g. KVM should load the new vCPU's values if SHSTK is exposed to the > > guest, and intercept (to inject #GP) if SHSTK is not exposed to the guest. > > > > And regardless of what the mechanism ends up managing SSP MSRs, it should only > > ever touch PL0_SSP, because Linux never runs anything at CPL1 or CPL2, i.e. will > > never consume PL{1,2}_SSP. > > To clarify, Linux will only use SSS in FRED mode -- FRED removes CPL1,2. Trying to understand more what prevents SSS to enable in pre FRED, Is it better #CP exception handling with other nested exceptions? Won't same problems (to some extent) happen in user-mode shadow stack (and in case of guest, SSS inside VM)? Thanks, Pankaj
On Thu, Jul 20, 2023 at 07:26:04AM +0200, Pankaj Gupta wrote: > > > My understanding is that PL[0-2]_SSP are used only on transitions to the > > > corresponding privilege level from a *different* privilege level. That means > > > KVM should be able to utilize the user_return_msr framework to load the host > > > values. Though if Linux ever supports SSS, I'm guessing the core kernel will > > > have some sort of mechanism to defer loading MSR_IA32_PL0_SSP until an exit to > > > userspace, e.g. to avoid having to write PL0_SSP, which will presumably be > > > per-task, on every context switch. > > > > > > But note my original wording: **If that's necessary** > > > > > > If nothing in the host ever consumes those MSRs, i.e. if SSS is NOT enabled in > > > IA32_S_CET, then running host stuff with guest values should be ok. KVM only > > > needs to guarantee that it doesn't leak values between guests. But that should > > > Just Work, e.g. KVM should load the new vCPU's values if SHSTK is exposed to the > > > guest, and intercept (to inject #GP) if SHSTK is not exposed to the guest. > > > > > > And regardless of what the mechanism ends up managing SSP MSRs, it should only > > > ever touch PL0_SSP, because Linux never runs anything at CPL1 or CPL2, i.e. will > > > never consume PL{1,2}_SSP. > > > > To clarify, Linux will only use SSS in FRED mode -- FRED removes CPL1,2. > > Trying to understand more what prevents SSS to enable in pre FRED, Is > it better #CP exception > handling with other nested exceptions? SSS took the syscall gap and made it worse -- as in *way* worse. To top it off, the whole SSS busy bit thing is fundamentally incompatible with how we manage to survive nested exceptions in NMI context. Basically, the whole x86 exception / stack switching logic was already borderline impossible (consider taking an MCE in the early NMI path where we set up, but have not finished, the re-entrancy stuff), and pushed it over the edge and set it on fire. And NMI isn't the only problem, the various new virt exceptions #VC and #HV are on their own already near impossible, adding SSS again pushes the whole thing into clear insanity. There's a good exposition of the whole trainwreck by Andrew here: https://www.youtube.com/watch?v=qcORS8CN0ow (that is, sorry for the youtube link, but Google is failing me in finding the actual Google Doc that talk is based on, or even the slide deck :/) FRED solves all that by: - removing the stack gap, cc/ip/ss/sp/ssp/gs will all be switched atomically and consistently for every transition. - removing the non-reentrant IST mechanism and replacing it with stack levels - adding an explicit NMI latch - re-organising the actual shadow stacks and doing away with that busy bit thing (I need to re-read the FRED spec on this detail again). Crazy as we are, we're not touching legacy/IDT SSS with a ten foot pole, sorry.
On Thu, Jul 20, 2023 at 10:03:58AM +0200, Peter Zijlstra wrote: > > Trying to understand more what prevents SSS to enable in pre FRED, Is > > it better #CP exception > > handling with other nested exceptions? > > SSS took the syscall gap and made it worse -- as in *way* worse. > > To top it off, the whole SSS busy bit thing is fundamentally > incompatible with how we manage to survive nested exceptions in NMI > context. > > Basically, the whole x86 exception / stack switching logic was already > borderline impossible (consider taking an MCE in the early NMI path > where we set up, but have not finished, the re-entrancy stuff), and SSS > pushed it over the edge and set it on fire. > > And NMI isn't the only problem, the various new virt exceptions #VC and > #HV are on their own already near impossible, adding SSS again pushes > the whole thing into clear insanity. > > There's a good exposition of the whole trainwreck by Andrew here: > > https://www.youtube.com/watch?v=qcORS8CN0ow > > (that is, sorry for the youtube link, but Google is failing me in > finding the actual Google Doc that talk is based on, or even the slide > deck :/) > > > > FRED solves all that by: > > - removing the stack gap, cc/ip/ss/sp/ssp/gs will all be switched > atomically and consistently for every transition. > > - removing the non-reentrant IST mechanism and replacing it with stack > levels > > - adding an explicit NMI latch > > - re-organising the actual shadow stacks and doing away with that busy > bit thing (I need to re-read the FRED spec on this detail again). > > > > Crazy as we are, we're not touching legacy/IDT SSS with a ten foot pole, > sorry.
> > > Trying to understand more what prevents SSS to enable in pre FRED, Is > > > it better #CP exception > > > handling with other nested exceptions? > > > > SSS took the syscall gap and made it worse -- as in *way* worse. > > > > To top it off, the whole SSS busy bit thing is fundamentally > > incompatible with how we manage to survive nested exceptions in NMI > > context. > > > > Basically, the whole x86 exception / stack switching logic was already > > borderline impossible (consider taking an MCE in the early NMI path > > where we set up, but have not finished, the re-entrancy stuff), and > > SSS > > > pushed it over the edge and set it on fire. ah I see. SSS takes it to the next level. > > > > And NMI isn't the only problem, the various new virt exceptions #VC and > > #HV are on their own already near impossible, adding SSS again pushes > > the whole thing into clear insanity. > > > > There's a good exposition of the whole trainwreck by Andrew here: > > > > https://www.youtube.com/watch?v=qcORS8CN0ow > > > > (that is, sorry for the youtube link, but Google is failing me in > > finding the actual Google Doc that talk is based on, or even the slide > > deck :/) I think I got the link: https://docs.google.com/document/d/1hWejnyDkjRRAW-JEsRjA5c9CKLOPc6VKJQsuvODlQEI/edit?pli=1 > > > > > > > > FRED solves all that by: > > > > - removing the stack gap, cc/ip/ss/sp/ssp/gs will all be switched > > atomically and consistently for every transition. > > > > - removing the non-reentrant IST mechanism and replacing it with stack > > levels > > > > - adding an explicit NMI latch > > > > - re-organising the actual shadow stacks and doing away with that busy > > bit thing (I need to re-read the FRED spec on this detail again). > > Thank you for explaining. I will also study the FRED spec and corresponding kernel patches posted in the mailing list. > > > > > > Crazy as we are, we're not touching legacy/IDT SSS with a ten foot pole, > > sorry. ya, interesting. Best regards, Pankaj