[v8,00/26] Enable CET Virtualization

Message ID	20231221140239.4349-1-weijiang.yang@intel.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel+bounces-8059-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; From: Yang Weijiang <weijiang.yang@intel.com> To: seanjc@google.com, pbonzini@redhat.com, dave.hansen@intel.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: peterz@infradead.org, chao.gao@intel.com, rick.p.edgecombe@intel.com, mlevitsk@redhat.com, john.allen@amd.com, weijiang.yang@intel.com Subject: [PATCH v8 00/26] Enable CET Virtualization Date: Thu, 21 Dec 2023 09:02:13 -0500 Message-Id: <20231221140239.4349-1-weijiang.yang@intel.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-getmail-retrieved-from-mailbox: INBOX
Series	Enable CET Virtualization \| [v8,00/26] Enable CET Virtualization [v8,01/26] x86/fpu/xstate: Always preserve non-user xfeatures/flags in __state_perm [v8,02/26] x86/fpu/xstate: Refine CET user xstate bit enabling [v8,03/26] x86/fpu/xstate: Add CET supervisor mode state support [v8,04/26] x86/fpu/xstate: Introduce XFEATURE_MASK_KERNEL_DYNAMIC xfeature set [v8,05/26] x86/fpu/xstate: Introduce fpu_guest_cfg for guest FPU configuration [v8,06/26] x86/fpu/xstate: Create guest fpstate with guest specific config [v8,07/26] x86/fpu/xstate: Warn if kernel dynamic xfeatures detected in normal fpstate [v8,08/26] KVM: x86: Rework cpuid_get_supported_xcr0() to operate on vCPU data [v8,09/26] KVM: x86: Rename kvm_{g,s}et_msr() to menifest emulation operations [v8,10/26] KVM: x86: Refine xsave-managed guest register/MSR reset handling [v8,11/26] KVM: x86: Add kvm_msr_{read,write}() helpers [v8,12/26] KVM: x86: Report XSS as to-be-saved if there are supported features [v8,13/26] KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS [v8,14/26] KVM: x86: Initialize kvm_caps.supported_xss [v8,15/26] KVM: x86: Load guest FPU state when access XSAVE-managed MSRs [v8,16/26] KVM: x86: Add fault checks for guest CR4.CET setting [v8,17/26] KVM: x86: Report KVM supported CET MSRs as to-be-saved [v8,18/26] KVM: VMX: Introduce CET VMCS fields and control bits [v8,19/26] KVM: x86: Use KVM-governed feature framework to track "SHSTK/IBT enabled" [v8,20/26] KVM: VMX: Emulate read and write to CET MSRs [v8,21/26] KVM: x86: Save and reload SSP to/from SMRAM [v8,22/26] KVM: VMX: Set up interception for CET MSRs [v8,23/26] KVM: VMX: Set host constant supervisor states to VMCS fields [v8,24/26] KVM: x86: Enable CET virtualization for VMX and advertise to userspace [v8,25/26] KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery to L1 [v8,26/26] KVM: nVMX: Enable CET support for nested guest

Message ID

20231221140239.4349-1-weijiang.yang@intel.com

Headers

Received-SPF: pass (google.com: domain of
 linux-kernel+bounces-8059-ouuuleilei=gmail.com@vger.kernel.org designates
 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1;
From: Yang Weijiang <weijiang.yang@intel.com>
To: seanjc@google.com,
	pbonzini@redhat.com,
	dave.hansen@intel.com,
	kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org
Cc: peterz@infradead.org,
	chao.gao@intel.com,
	rick.p.edgecombe@intel.com,
	mlevitsk@redhat.com,
	john.allen@amd.com,
	weijiang.yang@intel.com
Subject: [PATCH v8 00/26] Enable CET Virtualization
Date: Thu, 21 Dec 2023 09:02:13 -0500
Message-Id: <20231221140239.4349-1-weijiang.yang@intel.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

Enable CET Virtualization |

Message

Yang, Weijiang Dec. 21, 2023, 2:02 p.m. UTC

  Control-flow Enforcement Technology (CET) is a kind of CPU feature used
to prevent Return/CALL/Jump-Oriented Programming (ROP/COP/JOP) attacks.
It provides two sub-features(SHSTK,IBT) to defend against ROP/COP/JOP
style control-flow subversion attacks.

Shadow Stack (SHSTK):
  A shadow stack is a second stack used exclusively for control transfer
  operations. The shadow stack is separate from the data/normal stack and
  can be enabled individually in user and kernel mode. When shadow stack
  is enabled, CALL pushes the return address on both the data and shadow
  stack. RET pops the return address from both stacks and compares them.
  If the return addresses from the two stacks do not match, the processor
  generates a #CP.

Indirect Branch Tracking (IBT):
  IBT introduces new instruction(ENDBRANCH)to mark valid target addresses of
  indirect branches (CALL, JMP etc...). If an indirect branch is executed
  and the next instruction is _not_ an ENDBRANCH, the processor generates a
  #CP. These instruction behaves as a NOP on platforms that doesn't support
  CET.

Dependency:
=====================
CET native series for user mode shadow stack has already been merged in v6.6
mainline kernel.

The first 7 kernel patches are prerequisites for this KVM patch series since
guest CET user mode and supervisor mode states depends on kernel FPU framework
to properly save/restore the states whenever FPU context switch is required,
e.g., after VM-Exit and before vCPU thread exits to userspace.

In this series, guest supervisor SHSTK mitigation solution isn't introduced
for Intel platform therefore guest SSS_CET bit of CPUID(0x7,1):EDX[bit18] is
cleared. Check SDM (Vol 1, Section 17.2.3) for details.

CET states management:
======================
KVM cooperates with host kernel FPU framework to manage guest CET registers.
With CET supervisor mode state support in this series, KVM can save/restore
full guest CET xsave-managed states.

CET user mode and supervisor mode xstates, i.e., MSR_IA32_{U_CET,PL3_SSP}
and MSR_IA32_PL{0,1,2}, depend on host FPU framework to swap guest and host
xstates. On VM-Exit, guest CET xstates are saved to guest fpu area and host
CET xstates are loaded from task/thread context before vCPU returns to
userspace, vice-versa on VM-Entry. See details in kvm_{load,put}_guest_fpu().
So guest CET xstates management depends on CET xstate bits(U_CET/S_CET bit)
set in host XSS MSR.

CET supervisor mode states are grouped into two categories : XSAVE-managed
and non-XSAVE-managed, the former includes MSR_IA32_PL{0,1,2}_SSP and are
controlled by CET supervisor mode bit(S_CET bit) in XSS, the later consists
of MSR_IA32_S_CET and MSR_IA32_INTR_SSP_TBL.

VMX introduces new VMCS fields, {GUEST|HOST}_{S_CET,SSP,INTR_SSP_TABL}, to
facilitate guest/host non-XSAVES-managed states. When VMX CET entry/exit load
bits are set, guest/host MSR_IA32_{S_CET,INTR_SSP_TBL,SSP} are loaded from
equivalent fields at VM-Exit/Entry. With these new fields, such supervisor
states require no addtional KVM save/reload actions.

Tests:
======================
This series passed basic CET user shadow stack test and kernel IBT test in L1
and L2 guest.
The patch series _has_ impact to existing vmx test cases in KVM-unit-tests,the
failures have been fixed here[1].
One new selftest app[2] is introduced for testing CET MSRs accessibilities.

Note, this series hasn't been tested on AMD platform yet.

To run user SHSTK test and kernel IBT test in guest, an CET capable platform
is required, e.g., Sapphire Rapids server, and follow below steps to build
the binaries:

1. Host kernel: Apply this series to mainline kernel (>= v6.6) and build.

2. Guest kernel: Pull kernel (>= v6.6), opt-in CONFIG_X86_KERNEL_IBT
and CONFIG_X86_USER_SHADOW_STACK options. Build with CET enabled gcc versions
(>= 8.5.0).

3. Apply CET QEMU patches[3] before build mainline QEMU.

Check kernel selftest test_shadow_stack_64 output:
[INFO]  new_ssp = 7f8c82100ff8, *new_ssp = 7f8c82101001
[INFO]  changing ssp from 7f8c82900ff0 to 7f8c82100ff8
[INFO]  ssp is now 7f8c82101000
[OK]    Shadow stack pivot
[OK]    Shadow stack faults
[INFO]  Corrupting shadow stack
[INFO]  Generated shadow stack violation successfully
[OK]    Shadow stack violation test
[INFO]  Gup read -> shstk access success
[INFO]  Gup write -> shstk access success
[INFO]  Violation from normal write
[INFO]  Gup read -> write access success
[INFO]  Violation from normal write
[INFO]  Gup write -> write access success
[INFO]  Cow gup write -> write access success
[OK]    Shadow gup test
[INFO]  Violation from shstk access
[OK]    mprotect() test
[SKIP]  Userfaultfd unavailable.
[OK]    32 bit test


Check kernel IBT with dmesg | grep CET:
CET detected: Indirect Branch Tracking enabled

Changes in v8:
=====================
1. Add annotation for fpu_{kernel,user, guest}_cfg fields. [Maxim]
2. Remove CET state bits in kvm_caps.supported_xss if CET is disabled. [Maxim]
3. Prevent 32-bit guest launch if CET is enabled in CPUID. [Maxim, Chao]
4. Use fpu_guest_cfg.default_size to calculate guest default fpstate size. [Rick]
5. Sync CET host states in vmcs12 to vmcs01 before L2 exits to L1. [Maxim]
7. Other minor changes due to review comments. [Rick, Maxim]
8. Rebased to: https://github.com/kvm-x86/linux tag:kvm-x86-next-2023.11.30


[1]: KVM-unit-tests fixup:
https://lore.kernel.org/all/20230913235006.74172-1-weijiang.yang@intel.com/
[2]: Selftest for CET MSRs:
https://lore.kernel.org/all/20230914064201.85605-1-weijiang.yang@intel.com/
[3]: QEMU patch:
https://lore.kernel.org/all/20230720111445.99509-1-weijiang.yang@intel.com/
[4]: v7 patchset:
https://lore.kernel.org/all/20231124055330.138870-1-weijiang.yang@intel.com/

Patch 1-7:	Fixup patches for kernel xstate and enable CET supervisor xstate.
Patch 8-11:	Cleanup patches for KVM.
Patch 12-15:	Enable KVM XSS MSR support.
Patch 16:	Fault check for CR4.CET setting.
Patch 17:	Report CET MSRs to userspace.
Patch 18:	Introduce CET VMCS fields.
Patch 19:	Add SHSTK/IBT to KVM-governed framework.(to be deprecated)
Patch 20:	Emulate CET MSR access.
Patch 21:	Handle SSP at entry/exit to SMM.
Patch 22:	Set up CET MSR interception.
Patch 23:	Initialize host constant supervisor state.
Patch 24:	Enable CET virtualization settings.
Patch 25-26:	Add CET nested support.


Sean Christopherson (4):
  x86/fpu/xstate: Always preserve non-user xfeatures/flags in
    __state_perm
  KVM: x86: Rework cpuid_get_supported_xcr0() to operate on vCPU data
  KVM: x86: Report XSS as to-be-saved if there are supported features
  KVM: x86: Load guest FPU state when access XSAVE-managed MSRs

Yang Weijiang (22):
  x86/fpu/xstate: Refine CET user xstate bit enabling
  x86/fpu/xstate: Add CET supervisor mode state support
  x86/fpu/xstate: Introduce XFEATURE_MASK_KERNEL_DYNAMIC xfeature set
  x86/fpu/xstate: Introduce fpu_guest_cfg for guest FPU configuration
  x86/fpu/xstate: Create guest fpstate with guest specific config
  x86/fpu/xstate: Warn if kernel dynamic xfeatures detected in normal fpstate
  KVM: x86: Rename kvm_{g,s}et_msr() to menifest emulation operations
  KVM: x86: Refine xsave-managed guest register/MSR reset handling
  KVM: x86: Add kvm_msr_{read,write}() helpers
  KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS
  KVM: x86: Initialize kvm_caps.supported_xss
  KVM: x86: Add fault checks for guest CR4.CET setting
  KVM: x86: Report KVM supported CET MSRs as to-be-saved
  KVM: VMX: Introduce CET VMCS fields and control bits
  KVM: x86: Use KVM-governed feature framework to track "SHSTK/IBT enabled"
  KVM: VMX: Emulate read and write to CET MSRs
  KVM: x86: Save and reload SSP to/from SMRAM
  KVM: VMX: Set up interception for CET MSRs
  KVM: VMX: Set host constant supervisor states to VMCS fields
  KVM: x86: Enable CET virtualization for VMX and advertise to userspace
  KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery to L1
  KVM: nVMX: Enable CET support for nested guest

 arch/x86/include/asm/fpu/types.h     |  16 +-
 arch/x86/include/asm/fpu/xstate.h    |  11 +-
 arch/x86/include/asm/kvm_host.h      |  12 +-
 arch/x86/include/asm/msr-index.h     |   1 +
 arch/x86/include/asm/vmx.h           |   8 +
 arch/x86/include/uapi/asm/kvm_para.h |   1 +
 arch/x86/kernel/fpu/core.c           | 117 ++++++++++--
 arch/x86/kernel/fpu/xstate.c         |  44 ++++-
 arch/x86/kernel/fpu/xstate.h         |   3 +
 arch/x86/kvm/cpuid.c                 |  80 ++++++---
 arch/x86/kvm/governed_features.h     |   2 +
 arch/x86/kvm/smm.c                   |  12 +-
 arch/x86/kvm/smm.h                   |   2 +-
 arch/x86/kvm/vmx/capabilities.h      |  10 ++
 arch/x86/kvm/vmx/nested.c            |  97 ++++++++--
 arch/x86/kvm/vmx/nested.h            |   5 +
 arch/x86/kvm/vmx/vmcs12.c            |   6 +
 arch/x86/kvm/vmx/vmcs12.h            |  14 +-
 arch/x86/kvm/vmx/vmx.c               | 110 +++++++++++-
 arch/x86/kvm/vmx/vmx.h               |   6 +-
 arch/x86/kvm/x86.c                   | 259 +++++++++++++++++++++++++--
 arch/x86/kvm/x86.h                   |  28 +++
 22 files changed, 746 insertions(+), 98 deletions(-)


base-commit: f2a3fb7234e52f72ff4a38364dbf639cf4c7d6c6

Comments

Edgecombe, Rick P Jan. 3, 2024, 6:50 p.m. UTC | #1

On Thu, 2023-12-21 at 09:02 -0500, Yang Weijiang wrote:
> Control-flow Enforcement Technology (CET) is a kind of CPU feature
> used
> to prevent Return/CALL/Jump-Oriented Programming (ROP/COP/JOP)
> attacks.
> It provides two sub-features(SHSTK,IBT) to defend against ROP/COP/JOP
> style control-flow subversion attacks.
> 
> Shadow Stack (SHSTK):
>   A shadow stack is a second stack used exclusively for control
> transfer
>   operations. The shadow stack is separate from the data/normal stack
> and
>   can be enabled individually in user and kernel mode. When shadow
> stack
>   is enabled, CALL pushes the return address on both the data and
> shadow
>   stack. RET pops the return address from both stacks and compares
> them.
>   If the return addresses from the two stacks do not match, the
> processor
>   generates a #CP.
> 
> Indirect Branch Tracking (IBT):
>   IBT introduces new instruction(ENDBRANCH)to mark valid target
> addresses of
>   indirect branches (CALL, JMP etc...). If an indirect branch is
> executed
>   and the next instruction is _not_ an ENDBRANCH, the processor
> generates a
>   #CP. These instruction behaves as a NOP on platforms that doesn't
> support
>   CET.

What is the design around CET and the KVM emulator?

My understanding is that the KVM emulator kind of does what it has to
keep things running, and isn't expected to emulate every possible
instruction. With CET though, it is changing the behavior of existing
supported instructions. I could imagine a guest could skip over CET
enforcement by causing an MMIO exit and racing to overwrite the exit-
causing instruction from a different vcpu to be an indirect CALL/RET,
etc. With reasonable assumptions around the threat model in use by the
guest this is probably not a huge problem. And I guess also reasonable
assumptions about functional expectations, as a misshandled CALL or RET
by the emulator would corrupt the shadow stack.

But, another thing to do could be to just return X86EMUL_UNHANDLEABLE
or X86EMUL_RETRY_INSTR when CET is active and RET or CALL are emulated.
And I guess also for all instructions if the TRACKER bit is set. It
might tie up that loose end without too much trouble.

Anyway, was there a conscious decision to just punt on CET enforcement
in the emulator?

Yang, Weijiang Jan. 4, 2024, 7:11 a.m. UTC | #2

On 1/4/2024 2:50 AM, Edgecombe, Rick P wrote:
> On Thu, 2023-12-21 at 09:02 -0500, Yang Weijiang wrote:
>> Control-flow Enforcement Technology (CET) is a kind of CPU feature
>> used
>> to prevent Return/CALL/Jump-Oriented Programming (ROP/COP/JOP)
>> attacks.
>> It provides two sub-features(SHSTK,IBT) to defend against ROP/COP/JOP
>> style control-flow subversion attacks.
>>
>> Shadow Stack (SHSTK):
>>    A shadow stack is a second stack used exclusively for control
>> transfer
>>    operations. The shadow stack is separate from the data/normal stack
>> and
>>    can be enabled individually in user and kernel mode. When shadow
>> stack
>>    is enabled, CALL pushes the return address on both the data and
>> shadow
>>    stack. RET pops the return address from both stacks and compares
>> them.
>>    If the return addresses from the two stacks do not match, the
>> processor
>>    generates a #CP.
>>
>> Indirect Branch Tracking (IBT):
>>    IBT introduces new instruction(ENDBRANCH)to mark valid target
>> addresses of
>>    indirect branches (CALL, JMP etc...). If an indirect branch is
>> executed
>>    and the next instruction is _not_ an ENDBRANCH, the processor
>> generates a
>>    #CP. These instruction behaves as a NOP on platforms that doesn't
>> support
>>    CET.
> What is the design around CET and the KVM emulator?

KVM doesn't emulate CET HW behavior for guest CET, instead it leaves CET related
checks and handling in guest kernel. E.g., if emulated JMP/CALL in emulator triggers
mismatch of data stack and shadow stack contents, #CP is generated in non-root
mode instead of being injected by KVM.  KVM only emulates basic x86 HW behaviors,
e.g., call/jmp/ret/in/out etc.

> My understanding is that the KVM emulator kind of does what it has to
> keep things running, and isn't expected to emulate every possible
> instruction. With CET though, it is changing the behavior of existing
> supported instructions. I could imagine a guest could skip over CET
> enforcement by causing an MMIO exit and racing to overwrite the exit-
> causing instruction from a different vcpu to be an indirect CALL/RET,
> etc.

Can you elaborate the case? I cannot figure out how it works.

> With reasonable assumptions around the threat model in use by the
> guest this is probably not a huge problem. And I guess also reasonable
> assumptions about functional expectations, as a misshandled CALL or RET
> by the emulator would corrupt the shadow stack.

KVM emulates general x86 HW behaviors, if something wrong happens after emulation
then it can happen even on bare metal, i.e., guest SW most likely gets wrong somewhere
and it's expected to trigger CET exceptions in guest kernel.

> But, another thing to do could be to just return X86EMUL_UNHANDLEABLE
> or X86EMUL_RETRY_INSTR when CET is active and RET or CALL are emulated.

IMHO, translating the CET induced exceptions into X86EMUL_UNHANDLEABLE or X86EMUL_RETRY_INSTR would confuse guest kernel or even VMM, I prefer letting guest kernel handle #CP directly.
> And I guess also for all instructions if the TRACKER bit is set. It
> might tie up that loose end without too much trouble.
>
> Anyway, was there a conscious decision to just punt on CET enforcement
> in the emulator?

I don't remember we ever discussed it in community, but since KVM maintainers reviewed
the CET virtualization series for a long time, I assume we're moving on the right way :-)

Edgecombe, Rick P Jan. 4, 2024, 9:10 p.m. UTC | #3

On Thu, 2024-01-04 at 15:11 +0800, Yang, Weijiang wrote:
> > What is the design around CET and the KVM emulator?
> 
> KVM doesn't emulate CET HW behavior for guest CET, instead it leaves
> CET related
> checks and handling in guest kernel. E.g., if emulated JMP/CALL in
> emulator triggers
> mismatch of data stack and shadow stack contents, #CP is generated in
> non-root
> mode instead of being injected by KVM.  KVM only emulates basic x86
> HW behaviors,
> e.g., call/jmp/ret/in/out etc.

Right. In the case of CET those basic behaviors (call/jmp/ret) now have
host emulation behavior that doesn't match what guest execution would
do.

> 
> > My understanding is that the KVM emulator kind of does what it has
> > to
> > keep things running, and isn't expected to emulate every possible
> > instruction. With CET though, it is changing the behavior of
> > existing
> > supported instructions. I could imagine a guest could skip over CET
> > enforcement by causing an MMIO exit and racing to overwrite the
> > exit-
> > causing instruction from a different vcpu to be an indirect
> > CALL/RET,
> > etc.
> 
> Can you elaborate the case? I cannot figure out how it works.

The point that it should be possible for KVM to emulate call/ret with
CET enabled. Not saying the specific case is critical, but the one I
used as an example was that the KVM emulator can (or at least in the
not too distant past) be forced to emulate arbitrary instructions if
the guest overwrites the instruction between the exit and the SW fetch
from the host. 

The steps are:
vcpu 1                         vcpu 2
-------------------------------------
mov to mmio addr
vm exit ept_misconfig
                               overwrite mov instruction to call %rax
host emulator fetches
host emulates call instruction

So then the guest call operation will skip the endbranch check. But I'm
not sure that there are not less exotic cases that would run across it.
I see a bunch of cases where write protected memory kicks to the
emulator as well. Not sure the exact scenarios and whether this could
happen naturally in races during live migration, dirty tracking, etc.
Again, I'm more just asking the exposure and thinking on it.

> 
> > With reasonable assumptions around the threat model in use by the
> > guest this is probably not a huge problem. And I guess also
> > reasonable
> > assumptions about functional expectations, as a misshandled CALL or
> > RET
> > by the emulator would corrupt the shadow stack.
> 
> KVM emulates general x86 HW behaviors, if something wrong happens
> after emulation
> then it can happen even on bare metal, i.e., guest SW most likely
> gets wrong somewhere
> and it's expected to trigger CET exceptions in guest kernel.
> 
> > But, another thing to do could be to just return
> > X86EMUL_UNHANDLEABLE
> > or X86EMUL_RETRY_INSTR when CET is active and RET or CALL are
> > emulated.
> 
> IMHO, translating the CET induced exceptions into
> X86EMUL_UNHANDLEABLE or X86EMUL_RETRY_INSTR would confuse guest
> kernel or even VMM, I prefer letting guest kernel handle #CP
> directly.

Doesn't X86EMUL_RETRY_INSTR kick it back to the guest which is what you
want? Today it will do the operations without the special CET behavior.

But I do see how this could be tricky to avoid the guest getting stuck
in a loop with X86EMUL_RETRY_INSTR. I guess the question is if this
situation is encountered, when KVM can't handle the emulation
correctly, what should happen? I think usually it returns
KVM_INTERNAL_ERROR_EMULATION to userspace? So I don't see why the CET
case is different.

If the scenario (call/ret emulation with CET enabled) doesn't happen,
how can the guest be confused? If it does happen, won't it be an issue?

> > And I guess also for all instructions if the TRACKER bit is set. It
> > might tie up that loose end without too much trouble.
> > 
> > Anyway, was there a conscious decision to just punt on CET
> > enforcement
> > in the emulator?
> 
> I don't remember we ever discussed it in community, but since KVM
> maintainers reviewed
> the CET virtualization series for a long time, I assume we're moving
> on the right way :-)

It seems like kind of leap that if it never came up that they must be
approving of the specific detail. Don't know. Maybe they will chime in.

Edgecombe, Rick P Jan. 4, 2024, 10:29 p.m. UTC | #4

On Thu, 2023-12-21 at 09:02 -0500, Yang Weijiang wrote:
> Tests:
> ======================
> This series passed basic CET user shadow stack test and kernel IBT
> test in L1
> and L2 guest.

With the build fix, reproduced the basic IBT and user shadow stack
tests, plus the CET enabled glibc unit tests.

Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

Sean Christopherson Jan. 5, 2024, 12:22 a.m. UTC | #5

On Thu, Jan 04, 2024, Rick P Edgecombe wrote:
> On Thu, 2024-01-04 at 15:11 +0800, Yang, Weijiang wrote:
> > > What is the design around CET and the KVM emulator?
> > 
> > KVM doesn't emulate CET HW behavior for guest CET, instead it leaves CET
> > related checks and handling in guest kernel. E.g., if emulated JMP/CALL in
> > emulator triggers mismatch of data stack and shadow stack contents, #CP is
> > generated in non-root mode instead of being injected by KVM.  KVM only
> > emulates basic x86 HW behaviors, e.g., call/jmp/ret/in/out etc.
> 
> Right. In the case of CET those basic behaviors (call/jmp/ret) now have
> host emulation behavior that doesn't match what guest execution would
> do.

I wouldn't say that KVM emulates "basic" x86.  KVM emulates instructions that
BIOS and kernels execute in Big Real Mode (and other "illegal" modes prior to Intel
adding unrestricted guest), instructions that guests commonly use for MMIO, I/O,
and page table modifications, and few other tidbits that have cropped up over the
years.

In other words, as Weijiang suspects below, KVM's emulator handles juuust enough
stuff to squeak by and not barf on real world guests.  It is not, and has never
been, anything remotely resembling a fully capable architectural emulator.

> > > My understanding is that the KVM emulator kind of does what it has to
> > > keep things running, and isn't expected to emulate every possible
> > > instruction. With CET though, it is changing the behavior of existing
> > > supported instructions. I could imagine a guest could skip over CET
> > > enforcement by causing an MMIO exit and racing to overwrite the exit-
> > > causing instruction from a different vcpu to be an indirect CALL/RET,
> > > etc.
> > 
> > Can you elaborate the case? I cannot figure out how it works.
> 
> The point that it should be possible for KVM to emulate call/ret with
> CET enabled. Not saying the specific case is critical, but the one I
> used as an example was that the KVM emulator can (or at least in the
> not too distant past) be forced to emulate arbitrary instructions if
> the guest overwrites the instruction between the exit and the SW fetch
> from the host. 
> 
> The steps are:
> vcpu 1                         vcpu 2
> -------------------------------------
> mov to mmio addr
> vm exit ept_misconfig
>                                overwrite mov instruction to call %rax
> host emulator fetches
> host emulates call instruction
> 
> So then the guest call operation will skip the endbranch check. But I'm
> not sure that there are not less exotic cases that would run across it.
> I see a bunch of cases where write protected memory kicks to the
> emulator as well. Not sure the exact scenarios and whether this could
> happen naturally in races during live migration, dirty tracking, etc.

It's for shadow paging.  Instead of _immediately_ zapping SPTEs on any write to
a shadowed guest PTE, KVM instead tries to emulate the faulting instruction (and
then still zaps SPTE).  If KVM can't emulate the instruction for whatever reason,
then KVM will _usually_ just zap the SPTE and resume the guest, i.e. retry the
faulting instruction.

The reason KVM doesn't automatically/unconditionally zap and retry is that there
are circumstances where the guest can't make forward progress, e.g. if an
instruction is using a guest PTE that it is writing, if L2 is modifying L1 PTEs,
and probably a few other edge cases I'm forgetting.

> Again, I'm more just asking the exposure and thinking on it.

If you care about exposure to the emulator from a guest security perspective,
assume that a compromised guest can coerce KVM into attempting to emulate
arbitrary bytes.  As in the situation described above, it's not _that_ difficult
to play games with TLBs and instruction vs. data caches.

If all you care about is not breaking misbehaving guests, I wouldn't worry too
much about it.

> > > With reasonable assumptions around the threat model in use by the guest
> > > this is probably not a huge problem. And I guess also reasonable
> > > assumptions about functional expectations, as a misshandled CALL or RET
> > > by the emulator would corrupt the shadow stack.
> > 
> > KVM emulates general x86 HW behaviors, if something wrong happens after
> > emulation then it can happen even on bare metal, i.e., guest SW most likely
> > gets wrong somewhere and it's expected to trigger CET exceptions in guest
> > kernel.

No, the days of KVM making shit up from are done.  IIUC, you're advocating that
it's ok for KVM to induce a #CP that architecturally should not happen.  That is
not acceptable, full stop.

Retrying the instruction in the guest, exiting to userspace, and even terminating
the VM are all perfectly acceptable behaviors if KVM encounters something it can't
*correctly* emulate.  But clobbering the shadow stack or not detecting a CFI
violation, even if the guest is misbehaving, is not ok.

> > > But, another thing to do could be to just return X86EMUL_UNHANDLEABLE or
> > > X86EMUL_RETRY_INSTR when CET is active and RET or CALL are emulated.
> > 
> > IMHO, translating the CET induced exceptions into X86EMUL_UNHANDLEABLE or
> > X86EMUL_RETRY_INSTR would confuse guest kernel or even VMM, I prefer
> > letting guest kernel handle #CP directly.
>
> Doesn't X86EMUL_RETRY_INSTR kick it back to the guest which is what you
> want? Today it will do the operations without the special CET behavior.
> 
> But I do see how this could be tricky to avoid the guest getting stuck
> in a loop with X86EMUL_RETRY_INSTR. I guess the question is if this
> situation is encountered, when KVM can't handle the emulation
> correctly, what should happen? I think usually it returns
> KVM_INTERNAL_ERROR_EMULATION to userspace? So I don't see why the CET
> case is different.
> 
> If the scenario (call/ret emulation with CET enabled) doesn't happen,
> how can the guest be confused? If it does happen, won't it be an issue?
> 
> > > And I guess also for all instructions if the TRACKER bit is set. It
> > > might tie up that loose end without too much trouble.
> > > 
> > > Anyway, was there a conscious decision to just punt on CET enforcement in
> > > the emulator?
> > 
> > I don't remember we ever discussed it in community, but since KVM
> > maintainers reviewed the CET virtualization series for a long time, I
> > assume we're moving on the right way :-)
> 
> It seems like kind of leap that if it never came up that they must be
> approving of the specific detail. Don't know. Maybe they will chime in.

Yeah, I don't even know what the TRACKER bit does (I don't feel like reading the
SDM right now), let alone if what KVM does or doesn't do in response is remotely
correct.

For CALL/RET (and presumably any branch instructions with IBT?) other instructions
that are directly affected by CET, the simplest thing would probably be to disable
those in KVM's emulator if shadow stacks and/or IBT are enabled, and let KVM's
failure paths take it from there.

Then, *if* a use case comes along where the guest is utilizing CET and "needs"
KVM to emulate affected instructions, we can add the necessary support the emulator.

Alternatively, if teaching KVM's emulator to play nice with shadow stacks and IBT
is easy-ish, just do that.

Edgecombe, Rick P Jan. 5, 2024, 12:34 a.m. UTC | #6

On Thu, 2024-01-04 at 16:22 -0800, Sean Christopherson wrote:
> No, the days of KVM making shit up from are done.  IIUC, you're
> advocating that
> it's ok for KVM to induce a #CP that architecturally should not
> happen.  That is
> not acceptable, full stop.

Nope, not advocating that at all. I'm noticing that in this series KVM
has special emulator behavior that doesn't match the HW when CET is
enabled. That it *skips* emitting #CPs (and other CET behaviors SW
depends on), and wondering if it is a problem.

I'm worried that there is some way attackers will induce the host to
emulate an instruction and skip CET enforcement that the HW would
normally do.

> 
> Retrying the instruction in the guest, exiting to userspace, and even
> terminating
> the VM are all perfectly acceptable behaviors if KVM encounters
> something it can't
> *correctly* emulate.  But clobbering the shadow stack or not
> detecting a CFI
> violation, even if the guest is misbehaving, is not ok.
> 
[snip]
> Yeah, I don't even know what the TRACKER bit does (I don't feel like
> reading the
> SDM right now), let alone if what KVM does or doesn't do in response
> is remotely
> correct.
> 
> For CALL/RET (and presumably any branch instructions with IBT?) other
> instructions
> that are directly affected by CET, the simplest thing would probably
> be to disable
> those in KVM's emulator if shadow stacks and/or IBT are enabled, and
> let KVM's
> failure paths take it from there.

Right, that is what I was wondering might be the normal solution for
situations like this.

> 
> Then, *if* a use case comes along where the guest is utilizing CET
> and "needs"
> KVM to emulate affected instructions, we can add the necessary
> support the emulator.
> 
> Alternatively, if teaching KVM's emulator to play nice with shadow
> stacks and IBT
> is easy-ish, just do that.

I think it will not be very easy.

Jim Mattson Jan. 5, 2024, 12:44 a.m. UTC | #7

On Thu, Jan 4, 2024 at 4:34 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Thu, 2024-01-04 at 16:22 -0800, Sean Christopherson wrote:
> > No, the days of KVM making shit up from are done.  IIUC, you're
> > advocating that
> > it's ok for KVM to induce a #CP that architecturally should not
> > happen.  That is
> > not acceptable, full stop.
>
> Nope, not advocating that at all. I'm noticing that in this series KVM
> has special emulator behavior that doesn't match the HW when CET is
> enabled. That it *skips* emitting #CPs (and other CET behaviors SW
> depends on), and wondering if it is a problem.
>
> I'm worried that there is some way attackers will induce the host to
> emulate an instruction and skip CET enforcement that the HW would
> normally do.
>
> >
> > Retrying the instruction in the guest, exiting to userspace, and even
> > terminating
> > the VM are all perfectly acceptable behaviors if KVM encounters
> > something it can't
> > *correctly* emulate.  But clobbering the shadow stack or not
> > detecting a CFI
> > violation, even if the guest is misbehaving, is not ok.
> >
> [snip]
> > Yeah, I don't even know what the TRACKER bit does (I don't feel like
> > reading the
> > SDM right now), let alone if what KVM does or doesn't do in response
> > is remotely
> > correct.
> >
> > For CALL/RET (and presumably any branch instructions with IBT?) other
> > instructions
> > that are directly affected by CET, the simplest thing would probably
> > be to disable
> > those in KVM's emulator if shadow stacks and/or IBT are enabled, and
> > let KVM's
> > failure paths take it from there.
>
> Right, that is what I was wondering might be the normal solution for
> situations like this.

On AMD CPUs and on Intel CPUs with "unrestricted guest," I don't think
there is any need to emulate an instruction that doesn't either (a)
cause a VM-exit by opcode (e.g. CPUID) or (b) access memory.  I think
we should probably disable emulation of anything else, for both
security and sanity.

> >
> > Then, *if* a use case comes along where the guest is utilizing CET
> > and "needs"
> > KVM to emulate affected instructions, we can add the necessary
> > support the emulator.
> >
> > Alternatively, if teaching KVM's emulator to play nice with shadow
> > stacks and IBT
> > is easy-ish, just do that.
>
> I think it will not be very easy.

Sean Christopherson Jan. 5, 2024, 12:54 a.m. UTC | #8

On Fri, Jan 05, 2024, Rick P Edgecombe wrote:
> On Thu, 2024-01-04 at 16:22 -0800, Sean Christopherson wrote:
> > No, the days of KVM making shit up from are done.  IIUC, you're advocating
> > that it's ok for KVM to induce a #CP that architecturally should not
> > happen.  That is not acceptable, full stop.
> 
> Nope, not advocating that at all.

Heh, wrong "you".  That "you" was directed at Weijiang, who I *think* is saying
that clobbering the shadow stack by emulating CALL+RET and thus inducing a bogus
#CP in the guest is ok.

> I'm noticing that in this series KVM has special emulator behavior that
> doesn't match the HW when CET is enabled. That it *skips* emitting #CPs (and
> other CET behaviors SW depends on), and wondering if it is a problem.

Yes, it's a problem.  But IIUC, as is KVM would also induce bogus #CPs (which is
probably less of a problem in practice, but still not acceptable).

> I'm worried that there is some way attackers will induce the host to
> emulate an instruction and skip CET enforcement that the HW would
> normally do.

Yep.  The best behavior for this is likely KVM's existing behavior, i.e. retry
the instruction in the guest, and if that doesn't work, kick out to userspace and
let userspace try to sort things out.

> > For CALL/RET (and presumably any branch instructions with IBT?) other
> > instructions that are directly affected by CET, the simplest thing would
> > probably be to disable those in KVM's emulator if shadow stacks and/or IBT
> > are enabled, and let KVM's failure paths take it from there.
> 
> Right, that is what I was wondering might be the normal solution for
> situations like this.

If KVM can't emulate something, it either retries the instruction (with some
decent logic to guard against infinite retries) or punts to userspace.

Or if the platform owner likes to play with fire and doesn't enable
KVM_CAP_EXIT_ON_EMULATION_FAILURE, KVM will inject a #UD (and still exit to
userspace if the emulation happened at CPL0).  And yes, that #UD is 100% KVM
making shit up, and yes, it has caused problems and confusion. :-)

> > Then, *if* a use case comes along where the guest is utilizing CET and
> > "needs" KVM to emulate affected instructions, we can add the necessary
> > support the emulator.
> > 
> > Alternatively, if teaching KVM's emulator to play nice with shadow stacks
> > and IBT is easy-ish, just do that.
> 
> I think it will not be very easy.

Yeah.  As Jim alluded to, I think it's probably time to admit that emulating
instructions for modern CPUs is a fools errand and KVM should simply stop trying.

Yang, Weijiang Jan. 5, 2024, 9:04 a.m. UTC | #9

On 1/5/2024 5:10 AM, Edgecombe, Rick P wrote:
> On Thu, 2024-01-04 at 15:11 +0800, Yang, Weijiang wrote:
[...]
>>> My understanding is that the KVM emulator kind of does what it has
>>> to
>>> keep things running, and isn't expected to emulate every possible
>>> instruction. With CET though, it is changing the behavior of
>>> existing
>>> supported instructions. I could imagine a guest could skip over CET
>>> enforcement by causing an MMIO exit and racing to overwrite the
>>> exit-
>>> causing instruction from a different vcpu to be an indirect
>>> CALL/RET,
>>> etc.
>> Can you elaborate the case? I cannot figure out how it works.
> The point that it should be possible for KVM to emulate call/ret with
> CET enabled. Not saying the specific case is critical, but the one I
> used as an example was that the KVM emulator can (or at least in the
> not too distant past) be forced to emulate arbitrary instructions if
> the guest overwrites the instruction between the exit and the SW fetch
> from the host.
>
> The steps are:
> vcpu 1                         vcpu 2
> -------------------------------------
> mov to mmio addr
> vm exit ept_misconfig
>                                 overwrite mov instruction to call %rax
> host emulator fetches
> host emulates call instruction
>
> So then the guest call operation will skip the endbranch check. But I'm
> not sure that there are not less exotic cases that would run across it.
> I see a bunch of cases where write protected memory kicks to the
> emulator as well. Not sure the exact scenarios and whether this could
> happen naturally in races during live migration, dirty tracking, etc.
> Again, I'm more just asking the exposure and thinking on it.

Now I get your points, I didn't think of exposure from guest and just thought of the
normal execution flow in guest, so I said let guest handle #CP directly.

Yes, I think we need to take these cases into account, as Sean suggested in following
replies, stopping emulation JMP/CALL/RET etc. instructions when guest CET is enabled
is effective and simple, I'll investigate the emulator code.

Thanks for raising the concerns!

Yang, Weijiang Jan. 5, 2024, 9:28 a.m. UTC | #10

On 1/5/2024 8:54 AM, Sean Christopherson wrote:
> On Fri, Jan 05, 2024, Rick P Edgecombe wrote:
>> On Thu, 2024-01-04 at 16:22 -0800, Sean Christopherson wrote:
>>> No, the days of KVM making shit up from are done.  IIUC, you're advocating
>>> that it's ok for KVM to induce a #CP that architecturally should not
>>> happen.  That is not acceptable, full stop.
>> Nope, not advocating that at all.
> Heh, wrong "you".  That "you" was directed at Weijiang, who I *think* is saying
> that clobbering the shadow stack by emulating CALL+RET and thus inducing a bogus
> #CP in the guest is ok.

My fault, I just thought of the normal execution instead of the subverting cases :-)

>
>> I'm noticing that in this series KVM has special emulator behavior that
>> doesn't match the HW when CET is enabled. That it *skips* emitting #CPs (and
>> other CET behaviors SW depends on), and wondering if it is a problem.
> Yes, it's a problem.  But IIUC, as is KVM would also induce bogus #CPs (which is
> probably less of a problem in practice, but still not acceptable).

I'd choose to stop emulating the CET sensitive instructions while CET is enabled in guest
as re-enter guest after emulation would raise some kind of risk, but I don't know how to
stop the emulation decently.

>> I'm worried that there is some way attackers will induce the host to
>> emulate an instruction and skip CET enforcement that the HW would
>> normally do.
> Yep.  The best behavior for this is likely KVM's existing behavior, i.e. retry
> the instruction in the guest, and if that doesn't work, kick out to userspace and
> let userspace try to sort things out.
>
>>> For CALL/RET (and presumably any branch instructions with IBT?) other
>>> instructions that are directly affected by CET, the simplest thing would
>>> probably be to disable those in KVM's emulator if shadow stacks and/or IBT
>>> are enabled, and let KVM's failure paths take it from there.
>> Right, that is what I was wondering might be the normal solution for
>> situations like this.
> If KVM can't emulate something, it either retries the instruction (with some
> decent logic to guard against infinite retries) or punts to userspace.

What kind of error is proper if KVM has to punt to userspace? Or just inject #UD into guest
on detecting this case?

>
> Or if the platform owner likes to play with fire and doesn't enable
> KVM_CAP_EXIT_ON_EMULATION_FAILURE, KVM will inject a #UD (and still exit to
> userspace if the emulation happened at CPL0).  And yes, that #UD is 100% KVM
> making shit up, and yes, it has caused problems and confusion. :-)
>   
>>> Then, *if* a use case comes along where the guest is utilizing CET and
>>> "needs" KVM to emulate affected instructions, we can add the necessary
>>> support the emulator.
>>>
>>> Alternatively, if teaching KVM's emulator to play nice with shadow stacks
>>> and IBT is easy-ish, just do that.
>> I think it will not be very easy.
> Yeah.  As Jim alluded to, I think it's probably time to admit that emulating
> instructions for modern CPUs is a fools errand and KVM should simply stop trying.
>

Sean Christopherson Jan. 5, 2024, 4:21 p.m. UTC | #11

On Fri, Jan 05, 2024, Weijiang Yang wrote:
> On 1/5/2024 8:54 AM, Sean Christopherson wrote:
> > On Fri, Jan 05, 2024, Rick P Edgecombe wrote:
> > > > For CALL/RET (and presumably any branch instructions with IBT?) other
> > > > instructions that are directly affected by CET, the simplest thing would
> > > > probably be to disable those in KVM's emulator if shadow stacks and/or IBT
> > > > are enabled, and let KVM's failure paths take it from there.
> > > Right, that is what I was wondering might be the normal solution for
> > > situations like this.
> > If KVM can't emulate something, it either retries the instruction (with some
> > decent logic to guard against infinite retries) or punts to userspace.
> 
> What kind of error is proper if KVM has to punt to userspace?

KVM_INTERNAL_ERROR_EMULATION.  See prepare_emulation_failure_exit().

> Or just inject #UD into guest on detecting this case?

No, do not inject #UD or do anything else that deviates from architecturally
defined behavior.

Edgecombe, Rick P Jan. 5, 2024, 5:52 p.m. UTC | #12

On Fri, 2024-01-05 at 08:21 -0800, Sean Christopherson wrote:
> No, do not inject #UD or do anything else that deviates from
> architecturally
> defined behavior.

Here is a, at least partial, list of CET touch points I just created by
searching the SDM:
1. The emulator SW fetch with TRACKER=1
2. CALL, RET, JMP, IRET, INT, SYSCALL, SYSENTER, SYSEXIT, SYSRET
3. Task switching
4. The new CET instructions (which I guess should be handled by
default): CLRSSBSY, INCSSPD, RSTORSSP, SAVEPREVSSP, SETSSBSYY, WRSS,
WRUSS

Not all of those are security checks, but would have some functional
implications. It's still not clear to me if this could happen naturally
(the TDP shadowing stuff), or only via strange attacker behavior. If we
only care about the attacker case, then we could have a smaller list.

It also sounds like the instructions in 2 could maybe be filtered by
mode instead of caring about CET being enabled. But maybe it's not good
to mix the CET problem with the bigger emulator issues. Don't know.

Jim Mattson Jan. 5, 2024, 6:09 p.m. UTC | #13

On Fri, Jan 5, 2024 at 9:53 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Fri, 2024-01-05 at 08:21 -0800, Sean Christopherson wrote:
> > No, do not inject #UD or do anything else that deviates from
> > architecturally
> > defined behavior.
>
> Here is a, at least partial, list of CET touch points I just created by
> searching the SDM:
> 1. The emulator SW fetch with TRACKER=1
> 2. CALL, RET, JMP, IRET, INT, SYSCALL, SYSENTER, SYSEXIT, SYSRET
> 3. Task switching

Sigh. KVM is forced to emulate task switch, because the hardware is
incapable of virtualizing it. How hard would it be to make KVM's
task-switch emulation CET-aware?

> 4. The new CET instructions (which I guess should be handled by
> default): CLRSSBSY, INCSSPD, RSTORSSP, SAVEPREVSSP, SETSSBSYY, WRSS,
> WRUSS
>
> Not all of those are security checks, but would have some functional
> implications. It's still not clear to me if this could happen naturally
> (the TDP shadowing stuff), or only via strange attacker behavior. If we
> only care about the attacker case, then we could have a smaller list.
>
> It also sounds like the instructions in 2 could maybe be filtered by
> mode instead of caring about CET being enabled. But maybe it's not good
> to mix the CET problem with the bigger emulator issues. Don't know.

Edgecombe, Rick P Jan. 5, 2024, 6:51 p.m. UTC | #14

On Fri, 2024-01-05 at 10:09 -0800, Jim Mattson wrote:
> > 3. Task switching
> 
> Sigh. KVM is forced to emulate task switch, because the hardware is
> incapable of virtualizing it. How hard would it be to make KVM's
> task-switch emulation CET-aware?

(I am not too familiar with this part of the arch).

See SDM Vol 3a, chapter 7.3, number 8 and 15. The behavior is around
actual task switching. At first glance, it looks annoying at least. It
would need to do a CMPXCHG to guest memory at some points and take care
to not implement the "Complex Shadow-Stack Updates" behavior.

But, would anyone use it? I'm not aware of any 32 bit supervisor shadow
stack support out there. So maybe it is ok to just punt to userspace in
this case?

Sean Christopherson Jan. 5, 2024, 7:34 p.m. UTC | #15

On Fri, Jan 05, 2024, Rick P Edgecombe wrote:
> On Fri, 2024-01-05 at 10:09 -0800, Jim Mattson wrote:
> > > 3. Task switching
> > 
> > Sigh. KVM is forced to emulate task switch, because the hardware is
> > incapable of virtualizing it. How hard would it be to make KVM's
> > task-switch emulation CET-aware?
> 
> (I am not too familiar with this part of the arch).
> 
> See SDM Vol 3a, chapter 7.3, number 8 and 15. The behavior is around
> actual task switching. At first glance, it looks annoying at least. It
> would need to do a CMPXCHG to guest memory at some points and take care
> to not implement the "Complex Shadow-Stack Updates" behavior.
> 
> But, would anyone use it? I'm not aware of any 32 bit supervisor shadow
> stack support out there. So maybe it is ok to just punt to userspace in
> this case?

Yeah, I think KVM can punt.

Yang, Weijiang Jan. 8, 2024, 2:17 p.m. UTC | #16

On 1/6/2024 12:21 AM, Sean Christopherson wrote:
> On Fri, Jan 05, 2024, Weijiang Yang wrote:
>> On 1/5/2024 8:54 AM, Sean Christopherson wrote:
>>> On Fri, Jan 05, 2024, Rick P Edgecombe wrote:
>>>>> For CALL/RET (and presumably any branch instructions with IBT?) other
>>>>> instructions that are directly affected by CET, the simplest thing would
>>>>> probably be to disable those in KVM's emulator if shadow stacks and/or IBT
>>>>> are enabled, and let KVM's failure paths take it from there.
>>>> Right, that is what I was wondering might be the normal solution for
>>>> situations like this.
>>> If KVM can't emulate something, it either retries the instruction (with some
>>> decent logic to guard against infinite retries) or punts to userspace.
>> What kind of error is proper if KVM has to punt to userspace?
> KVM_INTERNAL_ERROR_EMULATION.  See prepare_emulation_failure_exit().
>
>> Or just inject #UD into guest on detecting this case?
> No, do not inject #UD or do anything else that deviates from architecturally
> defined behavior.

Thanks!
But based on current KVM implementation and patch 24, seems that if CET is exposed
to guest, the emulation code or shadow paging mode couldn't be activated at the same time:

In vmx.c,
hardware_setup(void):
if (!cpu_has_vmx_unrestricted_guest() || !enable_ept)
         enable_unrestricted_guest = 0;

in vmx_set_cr0():
[...]
         if (enable_unrestricted_guest)
                 hw_cr0 |= KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST;
         else {
                 hw_cr0 |= KVM_VM_CR0_ALWAYS_ON;
                 if (!enable_ept)
                         hw_cr0 |= X86_CR0_WP;

                 if (vmx->rmode.vm86_active && (cr0 & X86_CR0_PE))
                         enter_pmode(vcpu);

                 if (!vmx->rmode.vm86_active && !(cr0 & X86_CR0_PE))
                         enter_rmode(vcpu);
         }
[...]

And in patch 24:

+   if (!cpu_has_load_cet_ctrl() || !enable_unrestricted_guest ||
+       !cpu_has_vmx_basic_no_hw_errcode()) {
+       kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
+       kvm_cpu_cap_clear(X86_FEATURE_IBT);
+   }

Not sure if I missed anything.

Sean Christopherson Jan. 9, 2024, 3:10 p.m. UTC | #17

On Mon, Jan 08, 2024, Weijiang Yang wrote:
> On 1/6/2024 12:21 AM, Sean Christopherson wrote:
> > On Fri, Jan 05, 2024, Weijiang Yang wrote:
> > > On 1/5/2024 8:54 AM, Sean Christopherson wrote:
> > > > On Fri, Jan 05, 2024, Rick P Edgecombe wrote:
> > > > > > For CALL/RET (and presumably any branch instructions with IBT?) other
> > > > > > instructions that are directly affected by CET, the simplest thing would
> > > > > > probably be to disable those in KVM's emulator if shadow stacks and/or IBT
> > > > > > are enabled, and let KVM's failure paths take it from there.
> > > > > Right, that is what I was wondering might be the normal solution for
> > > > > situations like this.
> > > > If KVM can't emulate something, it either retries the instruction (with some
> > > > decent logic to guard against infinite retries) or punts to userspace.
> > > What kind of error is proper if KVM has to punt to userspace?
> > KVM_INTERNAL_ERROR_EMULATION.  See prepare_emulation_failure_exit().
> > 
> > > Or just inject #UD into guest on detecting this case?
> > No, do not inject #UD or do anything else that deviates from architecturally
> > defined behavior.
> 
> Thanks!
> But based on current KVM implementation and patch 24, seems that if CET is exposed
> to guest, the emulation code or shadow paging mode couldn't be activated at the same time:

No, requiring unrestricted guest only disables the paths where KVM *delibeately*
emulates the entire guest code stream.  In no way, shape, or form does it prevent
KVM from attempting to emulate arbitrary instructions.

> In vmx.c,
> hardware_setup(void):
> if (!cpu_has_vmx_unrestricted_guest() || !enable_ept)
>         enable_unrestricted_guest = 0;
> 
> in vmx_set_cr0():
> [...]
>         if (enable_unrestricted_guest)
>                 hw_cr0 |= KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST;
>         else {
>                 hw_cr0 |= KVM_VM_CR0_ALWAYS_ON;
>                 if (!enable_ept)
>                         hw_cr0 |= X86_CR0_WP;
> 
>                 if (vmx->rmode.vm86_active && (cr0 & X86_CR0_PE))
>                         enter_pmode(vcpu);
> 
>                 if (!vmx->rmode.vm86_active && !(cr0 & X86_CR0_PE))
>                         enter_rmode(vcpu);
>         }
> [...]
> 
> And in patch 24:
> 
> +   if (!cpu_has_load_cet_ctrl() || !enable_unrestricted_guest ||
> +       !cpu_has_vmx_basic_no_hw_errcode()) {
> +       kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
> +       kvm_cpu_cap_clear(X86_FEATURE_IBT);
> +   }
> 
> Not sure if I missed anything.
> 
>

Yang, Weijiang Jan. 11, 2024, 2:56 p.m. UTC | #18

On 1/9/2024 11:10 PM, Sean Christopherson wrote:
> On Mon, Jan 08, 2024, Weijiang Yang wrote:
>> On 1/6/2024 12:21 AM, Sean Christopherson wrote:
>>> On Fri, Jan 05, 2024, Weijiang Yang wrote:
>>>> On 1/5/2024 8:54 AM, Sean Christopherson wrote:
>>>>> On Fri, Jan 05, 2024, Rick P Edgecombe wrote:
>>>>>>> For CALL/RET (and presumably any branch instructions with IBT?) other
>>>>>>> instructions that are directly affected by CET, the simplest thing would
>>>>>>> probably be to disable those in KVM's emulator if shadow stacks and/or IBT
>>>>>>> are enabled, and let KVM's failure paths take it from there.
>>>>>> Right, that is what I was wondering might be the normal solution for
>>>>>> situations like this.
>>>>> If KVM can't emulate something, it either retries the instruction (with some
>>>>> decent logic to guard against infinite retries) or punts to userspace.
>>>> What kind of error is proper if KVM has to punt to userspace?
>>> KVM_INTERNAL_ERROR_EMULATION.  See prepare_emulation_failure_exit().
>>>
>>>> Or just inject #UD into guest on detecting this case?
>>> No, do not inject #UD or do anything else that deviates from architecturally
>>> defined behavior.
>> Thanks!
>> But based on current KVM implementation and patch 24, seems that if CET is exposed
>> to guest, the emulation code or shadow paging mode couldn't be activated at the same time:
> No, requiring unrestricted guest only disables the paths where KVM *delibeately*
> emulates the entire guest code stream.  In no way, shape, or form does it prevent
> KVM from attempting to emulate arbitrary instructions.

Yes, also need to prevent sporadic emulation, how about adding below patch in emulator?


diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index e223043ef5b2..e817d8560ceb 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -178,6 +178,7 @@
  #define IncSP       ((u64)1 << 54)  /* SP is incremented before ModRM calc */
  #define TwoMemOp    ((u64)1 << 55)  /* Instruction has two memory operand */
  #define IsBranch    ((u64)1 << 56)  /* Instruction is considered a branch. */
+#define IsProtected ((u64)1 << 57)  /* Instruction is protected by CET. */

  #define DstXacc     (DstAccLo | SrcAccHi | SrcWrite)

@@ -4098,9 +4099,9 @@ static const struct opcode group4[] = {
  static const struct opcode group5[] = {
         F(DstMem | SrcNone | Lock,              em_inc),
         F(DstMem | SrcNone | Lock,              em_dec),
-       I(SrcMem | NearBranch | IsBranch,       em_call_near_abs),
-       I(SrcMemFAddr | ImplicitOps | IsBranch, em_call_far),
-       I(SrcMem | NearBranch | IsBranch,       em_jmp_abs),
+       I(SrcMem | NearBranch | IsBranch | IsProtected, em_call_near_abs),
+       I(SrcMemFAddr | ImplicitOps | IsBranch | IsProtected, em_call_far),
+       I(SrcMem | NearBranch | IsBranch | IsProtected, em_jmp_abs),
         I(SrcMemFAddr | ImplicitOps | IsBranch, em_jmp_far),
         I(SrcMem | Stack | TwoMemOp,            em_push), D(Undefined),
  };
@@ -4362,11 +4363,11 @@ static const struct opcode opcode_table[256] = {
         /* 0xC8 - 0xCF */
         I(Stack | SrcImmU16 | Src2ImmByte | IsBranch, em_enter),
         I(Stack | IsBranch, em_leave),
-       I(ImplicitOps | SrcImmU16 | IsBranch, em_ret_far_imm),
-       I(ImplicitOps | IsBranch, em_ret_far),
-       D(ImplicitOps | IsBranch), DI(SrcImmByte | IsBranch, intn),
+       I(ImplicitOps | SrcImmU16 | IsBranch | IsProtected, em_ret_far_imm),
+       I(ImplicitOps | IsBranch | IsProtected, em_ret_far),
+       D(ImplicitOps | IsBranch), DI(SrcImmByte | IsBranch | IsProtected, intn),
         D(ImplicitOps | No64 | IsBranch),
-       II(ImplicitOps | IsBranch, em_iret, iret),
+       II(ImplicitOps | IsBranch | IsProtected, em_iret, iret),
         /* 0xD0 - 0xD7 */
         G(Src2One | ByteOp, group2), G(Src2One, group2),
         G(Src2CL | ByteOp, group2), G(Src2CL, group2),
@@ -4382,7 +4383,7 @@ static const struct opcode opcode_table[256] = {
         I2bvIP(SrcImmUByte | DstAcc, em_in,  in,  check_perm_in),
         I2bvIP(SrcAcc | DstImmUByte, em_out, out, check_perm_out),
         /* 0xE8 - 0xEF */
-       I(SrcImm | NearBranch | IsBranch, em_call),
+       I(SrcImm | NearBranch | IsBranch | IsProtected, em_call),
         D(SrcImm | ImplicitOps | NearBranch | IsBranch),
         I(SrcImmFAddr | No64 | IsBranch, em_jmp_far),
         D(SrcImmByte | ImplicitOps | NearBranch | IsBranch),
@@ -4401,7 +4402,7 @@ static const struct opcode opcode_table[256] = {
  static const struct opcode twobyte_table[256] = {
         /* 0x00 - 0x0F */
         G(0, group6), GD(0, &group7), N, N,
-       N, I(ImplicitOps | EmulateOnUD | IsBranch, em_syscall),
+       N, I(ImplicitOps | EmulateOnUD | IsBranch | IsProtected, em_syscall),
         II(ImplicitOps | Priv, em_clts, clts), N,
         DI(ImplicitOps | Priv, invd), DI(ImplicitOps | Priv, wbinvd), N, N,
         N, D(ImplicitOps | ModRM | SrcMem | NoAccess), N, N,
@@ -4432,8 +4433,8 @@ static const struct opcode twobyte_table[256] = {
         IIP(ImplicitOps, em_rdtsc, rdtsc, check_rdtsc),
         II(ImplicitOps | Priv, em_rdmsr, rdmsr),
         IIP(ImplicitOps, em_rdpmc, rdpmc, check_rdpmc),
-       I(ImplicitOps | EmulateOnUD | IsBranch, em_sysenter),
-       I(ImplicitOps | Priv | EmulateOnUD | IsBranch, em_sysexit),
+       I(ImplicitOps | EmulateOnUD | IsBranch | IsProtected, em_sysenter),
+       I(ImplicitOps | Priv | EmulateOnUD | IsBranch | IsProtected, em_sysexit),
         N, N,
         N, N, N, N, N, N, N, N,
         /* 0x40 - 0x4F */
@@ -4971,6 +4972,12 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len, int
         if (ctxt->d == 0)
                 return EMULATION_FAILED;
+       if ((opcode.flags & IsProtected) &&
+           (ctxt->ops->get_cr(ctxt, 4) & X86_CR4_CET)) {
+               WARN_ONCE(1, "CET is active, emulation aborted.\n");
+               return EMULATION_FAILED;
+       }
+
         ctxt->execute = opcode.u.execute;

         if (unlikely(emulation_type & EMULTYPE_TRAP_UD) &&

Chao Gao Jan. 15, 2024, 1:55 a.m. UTC | #19

On Thu, Jan 11, 2024 at 10:56:55PM +0800, Yang, Weijiang wrote:
>On 1/9/2024 11:10 PM, Sean Christopherson wrote:
>> On Mon, Jan 08, 2024, Weijiang Yang wrote:
>> > On 1/6/2024 12:21 AM, Sean Christopherson wrote:
>> > > On Fri, Jan 05, 2024, Weijiang Yang wrote:
>> > > > On 1/5/2024 8:54 AM, Sean Christopherson wrote:
>> > > > > On Fri, Jan 05, 2024, Rick P Edgecombe wrote:
>> > > > > > > For CALL/RET (and presumably any branch instructions with IBT?) other
>> > > > > > > instructions that are directly affected by CET, the simplest thing would
>> > > > > > > probably be to disable those in KVM's emulator if shadow stacks and/or IBT
>> > > > > > > are enabled, and let KVM's failure paths take it from there.
>> > > > > > Right, that is what I was wondering might be the normal solution for
>> > > > > > situations like this.
>> > > > > If KVM can't emulate something, it either retries the instruction (with some
>> > > > > decent logic to guard against infinite retries) or punts to userspace.
>> > > > What kind of error is proper if KVM has to punt to userspace?
>> > > KVM_INTERNAL_ERROR_EMULATION.  See prepare_emulation_failure_exit().
>> > > 
>> > > > Or just inject #UD into guest on detecting this case?
>> > > No, do not inject #UD or do anything else that deviates from architecturally
>> > > defined behavior.
>> > Thanks!
>> > But based on current KVM implementation and patch 24, seems that if CET is exposed
>> > to guest, the emulation code or shadow paging mode couldn't be activated at the same time:
>> No, requiring unrestricted guest only disables the paths where KVM *delibeately*
>> emulates the entire guest code stream.  In no way, shape, or form does it prevent
>> KVM from attempting to emulate arbitrary instructions.
>
>Yes, also need to prevent sporadic emulation, how about adding below patch in emulator?
>
>
>diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
>index e223043ef5b2..e817d8560ceb 100644
>--- a/arch/x86/kvm/emulate.c
>+++ b/arch/x86/kvm/emulate.c
>@@ -178,6 +178,7 @@
> #define IncSP       ((u64)1 << 54)  /* SP is incremented before ModRM calc */
> #define TwoMemOp    ((u64)1 << 55)  /* Instruction has two memory operand */
> #define IsBranch    ((u64)1 << 56)  /* Instruction is considered a branch. */
>+#define IsProtected ((u64)1 << 57)  /* Instruction is protected by CET. */
>
> #define DstXacc     (DstAccLo | SrcAccHi | SrcWrite)
>
>@@ -4098,9 +4099,9 @@ static const struct opcode group4[] = {
> static const struct opcode group5[] = {
>        F(DstMem | SrcNone | Lock,              em_inc),
>        F(DstMem | SrcNone | Lock,              em_dec),
>-       I(SrcMem | NearBranch | IsBranch,       em_call_near_abs),
>-       I(SrcMemFAddr | ImplicitOps | IsBranch, em_call_far),
>-       I(SrcMem | NearBranch | IsBranch,       em_jmp_abs),
>+       I(SrcMem | NearBranch | IsBranch | IsProtected, em_call_near_abs),
>+       I(SrcMemFAddr | ImplicitOps | IsBranch | IsProtected, em_call_far),
>+       I(SrcMem | NearBranch | IsBranch | IsProtected, em_jmp_abs),
>        I(SrcMemFAddr | ImplicitOps | IsBranch, em_jmp_far),
>        I(SrcMem | Stack | TwoMemOp,            em_push), D(Undefined),
> };
>@@ -4362,11 +4363,11 @@ static const struct opcode opcode_table[256] = {
>        /* 0xC8 - 0xCF */
>        I(Stack | SrcImmU16 | Src2ImmByte | IsBranch, em_enter),
>        I(Stack | IsBranch, em_leave),
>-       I(ImplicitOps | SrcImmU16 | IsBranch, em_ret_far_imm),
>-       I(ImplicitOps | IsBranch, em_ret_far),
>-       D(ImplicitOps | IsBranch), DI(SrcImmByte | IsBranch, intn),
>+       I(ImplicitOps | SrcImmU16 | IsBranch | IsProtected, em_ret_far_imm),
>+       I(ImplicitOps | IsBranch | IsProtected, em_ret_far),
>+       D(ImplicitOps | IsBranch), DI(SrcImmByte | IsBranch | IsProtected, intn),
>        D(ImplicitOps | No64 | IsBranch),
>-       II(ImplicitOps | IsBranch, em_iret, iret),
>+       II(ImplicitOps | IsBranch | IsProtected, em_iret, iret),
>        /* 0xD0 - 0xD7 */
>        G(Src2One | ByteOp, group2), G(Src2One, group2),
>        G(Src2CL | ByteOp, group2), G(Src2CL, group2),
>@@ -4382,7 +4383,7 @@ static const struct opcode opcode_table[256] = {
>        I2bvIP(SrcImmUByte | DstAcc, em_in,  in,  check_perm_in),
>        I2bvIP(SrcAcc | DstImmUByte, em_out, out, check_perm_out),
>        /* 0xE8 - 0xEF */
>-       I(SrcImm | NearBranch | IsBranch, em_call),
>+       I(SrcImm | NearBranch | IsBranch | IsProtected, em_call),
>        D(SrcImm | ImplicitOps | NearBranch | IsBranch),
>        I(SrcImmFAddr | No64 | IsBranch, em_jmp_far),
>        D(SrcImmByte | ImplicitOps | NearBranch | IsBranch),
>@@ -4401,7 +4402,7 @@ static const struct opcode opcode_table[256] = {
> static const struct opcode twobyte_table[256] = {
>        /* 0x00 - 0x0F */
>        G(0, group6), GD(0, &group7), N, N,
>-       N, I(ImplicitOps | EmulateOnUD | IsBranch, em_syscall),
>+       N, I(ImplicitOps | EmulateOnUD | IsBranch | IsProtected, em_syscall),
>        II(ImplicitOps | Priv, em_clts, clts), N,
>        DI(ImplicitOps | Priv, invd), DI(ImplicitOps | Priv, wbinvd), N, N,
>        N, D(ImplicitOps | ModRM | SrcMem | NoAccess), N, N,
>@@ -4432,8 +4433,8 @@ static const struct opcode twobyte_table[256] = {
>        IIP(ImplicitOps, em_rdtsc, rdtsc, check_rdtsc),
>        II(ImplicitOps | Priv, em_rdmsr, rdmsr),
>        IIP(ImplicitOps, em_rdpmc, rdpmc, check_rdpmc),
>-       I(ImplicitOps | EmulateOnUD | IsBranch, em_sysenter),
>-       I(ImplicitOps | Priv | EmulateOnUD | IsBranch, em_sysexit),
>+       I(ImplicitOps | EmulateOnUD | IsBranch | IsProtected, em_sysenter),
>+       I(ImplicitOps | Priv | EmulateOnUD | IsBranch | IsProtected, em_sysexit),
>        N, N,
>        N, N, N, N, N, N, N, N,
>        /* 0x40 - 0x4F */
>@@ -4971,6 +4972,12 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len, int
>        if (ctxt->d == 0)
>                return EMULATION_FAILED;
>+       if ((opcode.flags & IsProtected) &&
>+           (ctxt->ops->get_cr(ctxt, 4) & X86_CR4_CET)) {

CR4.CET doesn't necessarily mean IBT or shadow stack is enabled. why not check
CPL and IA32_S/U_CET?

>+               WARN_ONCE(1, "CET is active, emulation aborted.\n");

remove this WARN_ONCE(). Guest can trigger this at will and overflow host dmesg.

if you really want to tell usespace the emulation_failure is due to CET, maybe
you can add a new flag like KVM_INTERNAL_ERROR_EMULATION_FLAG_INSTRUCTION_BYTES.
for now, I won't bother to add this because probably userspace just terminates
the VM on any instruction failure (i.e., won't try to figure out the reason of
the instruction failure and fix it).

Yang, Weijiang Jan. 17, 2024, 12:53 a.m. UTC | #20

On 1/15/2024 9:55 AM, Chao Gao wrote:
> On Thu, Jan 11, 2024 at 10:56:55PM +0800, Yang, Weijiang wrote:
>> On 1/9/2024 11:10 PM, Sean Christopherson wrote:
>>> On Mon, Jan 08, 2024, Weijiang Yang wrote:
>>>> On 1/6/2024 12:21 AM, Sean Christopherson wrote:
>>>>> On Fri, Jan 05, 2024, Weijiang Yang wrote:
>>>>>> On 1/5/2024 8:54 AM, Sean Christopherson wrote:
>>>>>>> On Fri, Jan 05, 2024, Rick P Edgecombe wrote:
>>>>>>>>> For CALL/RET (and presumably any branch instructions with IBT?) other
>>>>>>>>> instructions that are directly affected by CET, the simplest thing would
>>>>>>>>> probably be to disable those in KVM's emulator if shadow stacks and/or IBT
>>>>>>>>> are enabled, and let KVM's failure paths take it from there.
>>>>>>>> Right, that is what I was wondering might be the normal solution for
>>>>>>>> situations like this.
>>>>>>> If KVM can't emulate something, it either retries the instruction (with some
>>>>>>> decent logic to guard against infinite retries) or punts to userspace.
>>>>>> What kind of error is proper if KVM has to punt to userspace?
>>>>> KVM_INTERNAL_ERROR_EMULATION.  See prepare_emulation_failure_exit().
>>>>>
>>>>>> Or just inject #UD into guest on detecting this case?
>>>>> No, do not inject #UD or do anything else that deviates from architecturally
>>>>> defined behavior.
>>>> Thanks!
>>>> But based on current KVM implementation and patch 24, seems that if CET is exposed
>>>> to guest, the emulation code or shadow paging mode couldn't be activated at the same time:
>>> No, requiring unrestricted guest only disables the paths where KVM *delibeately*
>>> emulates the entire guest code stream.  In no way, shape, or form does it prevent
>>> KVM from attempting to emulate arbitrary instructions.
>> Yes, also need to prevent sporadic emulation, how about adding below patch in emulator?
>>
>>
>> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
>> index e223043ef5b2..e817d8560ceb 100644
>> --- a/arch/x86/kvm/emulate.c
>> +++ b/arch/x86/kvm/emulate.c
>> @@ -178,6 +178,7 @@
>>   #define IncSP       ((u64)1 << 54)  /* SP is incremented before ModRM calc */
>>   #define TwoMemOp    ((u64)1 << 55)  /* Instruction has two memory operand */
>>   #define IsBranch    ((u64)1 << 56)  /* Instruction is considered a branch. */
>> +#define IsProtected ((u64)1 << 57)  /* Instruction is protected by CET. */
>>
>>   #define DstXacc     (DstAccLo | SrcAccHi | SrcWrite)
>>
>> @@ -4098,9 +4099,9 @@ static const struct opcode group4[] = {
>>   static const struct opcode group5[] = {
>>          F(DstMem | SrcNone | Lock,              em_inc),
>>          F(DstMem | SrcNone | Lock,              em_dec),
>> -       I(SrcMem | NearBranch | IsBranch,       em_call_near_abs),
>> -       I(SrcMemFAddr | ImplicitOps | IsBranch, em_call_far),
>> -       I(SrcMem | NearBranch | IsBranch,       em_jmp_abs),
>> +       I(SrcMem | NearBranch | IsBranch | IsProtected, em_call_near_abs),
>> +       I(SrcMemFAddr | ImplicitOps | IsBranch | IsProtected, em_call_far),
>> +       I(SrcMem | NearBranch | IsBranch | IsProtected, em_jmp_abs),
>>          I(SrcMemFAddr | ImplicitOps | IsBranch, em_jmp_far),
>>          I(SrcMem | Stack | TwoMemOp,            em_push), D(Undefined),
>>   };
>> @@ -4362,11 +4363,11 @@ static const struct opcode opcode_table[256] = {
>>          /* 0xC8 - 0xCF */
>>          I(Stack | SrcImmU16 | Src2ImmByte | IsBranch, em_enter),
>>          I(Stack | IsBranch, em_leave),
>> -       I(ImplicitOps | SrcImmU16 | IsBranch, em_ret_far_imm),
>> -       I(ImplicitOps | IsBranch, em_ret_far),
>> -       D(ImplicitOps | IsBranch), DI(SrcImmByte | IsBranch, intn),
>> +       I(ImplicitOps | SrcImmU16 | IsBranch | IsProtected, em_ret_far_imm),
>> +       I(ImplicitOps | IsBranch | IsProtected, em_ret_far),
>> +       D(ImplicitOps | IsBranch), DI(SrcImmByte | IsBranch | IsProtected, intn),
>>          D(ImplicitOps | No64 | IsBranch),
>> -       II(ImplicitOps | IsBranch, em_iret, iret),
>> +       II(ImplicitOps | IsBranch | IsProtected, em_iret, iret),
>>          /* 0xD0 - 0xD7 */
>>          G(Src2One | ByteOp, group2), G(Src2One, group2),
>>          G(Src2CL | ByteOp, group2), G(Src2CL, group2),
>> @@ -4382,7 +4383,7 @@ static const struct opcode opcode_table[256] = {
>>          I2bvIP(SrcImmUByte | DstAcc, em_in,  in,  check_perm_in),
>>          I2bvIP(SrcAcc | DstImmUByte, em_out, out, check_perm_out),
>>          /* 0xE8 - 0xEF */
>> -       I(SrcImm | NearBranch | IsBranch, em_call),
>> +       I(SrcImm | NearBranch | IsBranch | IsProtected, em_call),
>>          D(SrcImm | ImplicitOps | NearBranch | IsBranch),
>>          I(SrcImmFAddr | No64 | IsBranch, em_jmp_far),
>>          D(SrcImmByte | ImplicitOps | NearBranch | IsBranch),
>> @@ -4401,7 +4402,7 @@ static const struct opcode opcode_table[256] = {
>>   static const struct opcode twobyte_table[256] = {
>>          /* 0x00 - 0x0F */
>>          G(0, group6), GD(0, &group7), N, N,
>> -       N, I(ImplicitOps | EmulateOnUD | IsBranch, em_syscall),
>> +       N, I(ImplicitOps | EmulateOnUD | IsBranch | IsProtected, em_syscall),
>>          II(ImplicitOps | Priv, em_clts, clts), N,
>>          DI(ImplicitOps | Priv, invd), DI(ImplicitOps | Priv, wbinvd), N, N,
>>          N, D(ImplicitOps | ModRM | SrcMem | NoAccess), N, N,
>> @@ -4432,8 +4433,8 @@ static const struct opcode twobyte_table[256] = {
>>          IIP(ImplicitOps, em_rdtsc, rdtsc, check_rdtsc),
>>          II(ImplicitOps | Priv, em_rdmsr, rdmsr),
>>          IIP(ImplicitOps, em_rdpmc, rdpmc, check_rdpmc),
>> -       I(ImplicitOps | EmulateOnUD | IsBranch, em_sysenter),
>> -       I(ImplicitOps | Priv | EmulateOnUD | IsBranch, em_sysexit),
>> +       I(ImplicitOps | EmulateOnUD | IsBranch | IsProtected, em_sysenter),
>> +       I(ImplicitOps | Priv | EmulateOnUD | IsBranch | IsProtected, em_sysexit),
>>          N, N,
>>          N, N, N, N, N, N, N, N,
>>          /* 0x40 - 0x4F */
>> @@ -4971,6 +4972,12 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len, int
>>          if (ctxt->d == 0)
>>                  return EMULATION_FAILED;
>> +       if ((opcode.flags & IsProtected) &&
>> +           (ctxt->ops->get_cr(ctxt, 4) & X86_CR4_CET)) {
> CR4.CET doesn't necessarily mean IBT or shadow stack is enabled. why not check
> CPL and IA32_S/U_CET?

CR4.CET is the master control bit for CET features, a sane guest should set the bit iff it wants
to activate CET features. On the contrast, the IBT/SHSTK bits in IA32_S/U_CET only mean
the feature is enabled but maybe not active at the moment emulator is working, so no need
to stop emulation in this case.

>
>> +               WARN_ONCE(1, "CET is active, emulation aborted.\n");
> remove this WARN_ONCE(). Guest can trigger this at will and overflow host dmesg.

OK, the purpose is to give some informative message when guest hits the prohibited cases.
I can remove it. Thanks!
>
> if you really want to tell usespace the emulation_failure is due to CET, maybe
> you can add a new flag like KVM_INTERNAL_ERROR_EMULATION_FLAG_INSTRUCTION_BYTES.
> for now, I won't bother to add this because probably userspace just terminates
> the VM on any instruction failure (i.e., won't try to figure out the reason of
> the instruction failure and fix it).

Agreed, don't need to another flag to indicate this is due to CET on.