[0/3] KVM: x86: SGX vs. XCR0 cleanups

Message ID 20230405005911.423699-1-seanjc@google.com
State New
Headers

Commit Message

Sean Christopherson April 5, 2023, 12:59 a.m. UTC
  *** WARNING *** ABI breakage.

Stop adjusting the guest's CPUID info for the allowed XFRM (a.k.a. XCR0)
for SGX enclaves.  Past me didn't understand the roles and responsibilities
between userspace and KVM with respect to CPUID leafs, i.e. I thought I was
being helpful by having KVM adjust the entries.

This is clearly an ABI breakage, but QEMU (tries to) do the right thing,
and AFAIK no other VMMs support SGX (yet), so I'm hoping we can excise the
ugly before userspace starts depending on the bad behavior.

Compile tested only (don't have an SGX system these days).

Note, QEMU commit 301e90675c ("target/i386: Enable support for XSAVES
based features") completely broke SGX by using allowed XSS instead of
XCR0, and no one has complained.  That gives me hope that this one will
go through as well.

I belive the QEMU fix is below.  I'll post a patch at some point unless
someone wants to do the dirty work and claim the patch as their own.


Sean Christopherson (3):
  KVM: VMX: Don't rely _only_ on CPUID to enforce XCR0 restrictions for
    ECREATE
  KVM: x86: Don't adjust guest's CPUID.0x12.1 (allowed SGX enclave XFRM)
  KVM: x86: Open code supported XCR0 calculation in
    kvm_vcpu_after_set_cpuid()

 arch/x86/kvm/cpuid.c   | 43 ++++++++++--------------------------------
 arch/x86/kvm/vmx/sgx.c |  3 ++-
 2 files changed, 12 insertions(+), 34 deletions(-)


base-commit: 27d6845d258b67f4eb3debe062b7dacc67e0c393
  

Comments

Kai Huang April 5, 2023, 3:05 a.m. UTC | #1
On Tue, 2023-04-04 at 17:59 -0700, Sean Christopherson wrote:
> *** WARNING *** ABI breakage.
> 
> Stop adjusting the guest's CPUID info for the allowed XFRM (a.k.a. XCR0)
> for SGX enclaves.  Past me didn't understand the roles and responsibilities
> between userspace and KVM with respect to CPUID leafs, i.e. I thought I was
> being helpful by having KVM adjust the entries.
> 
> This is clearly an ABI breakage, but QEMU (tries to) do the right thing,
> and AFAIK no other VMMs support SGX (yet), so I'm hoping we can excise the
> ugly before userspace starts depending on the bad behavior.
> 
> Compile tested only (don't have an SGX system these days).

I'll look into this, and at the meantime ...

> 
> Note, QEMU commit 301e90675c ("target/i386: Enable support for XSAVES
> based features") completely broke SGX by using allowed XSS instead of
> XCR0, and no one has complained.  That gives me hope that this one will
> go through as well.

...

Actually we got complain around half year ago:

https://github.com/gramineproject/gramine/issues/955#issuecomment-1272829510

> 
> I belive the QEMU fix is below.  I'll post a patch at some point unless
> someone wants to do the dirty work and claim the patch as their own.
> 
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index 6576287e5b..f083ff4335 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -5718,8 +5718,8 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
>          } else {
>              *eax &= env->features[FEAT_SGX_12_1_EAX];
>              *ebx &= 0; /* ebx reserve */
> -            *ecx &= env->features[FEAT_XSAVE_XSS_LO];
> -            *edx &= env->features[FEAT_XSAVE_XSS_HI];
> +            *ecx &= env->features[FEAT_XSAVE_XCR0_LO];
> +            *edx &= env->features[FEAT_XSAVE_XCR0_HI];
>  
>              /* FP and SSE are always allowed regardless of XSAVE/XCR0. */
>              *ecx |= XSTATE_FP_MASK | XSTATE_SSE_MASK;

And since then Yang posted a patch to Qemu mailing list to fix:

https://lists.nongnu.org/archive/html/qemu-devel/2022-10/msg04990.html

I thought it had been merged, but it seems it hasn't :)

> 
> Sean Christopherson (3):
>   KVM: VMX: Don't rely _only_ on CPUID to enforce XCR0 restrictions for
>     ECREATE
>   KVM: x86: Don't adjust guest's CPUID.0x12.1 (allowed SGX enclave XFRM)
>   KVM: x86: Open code supported XCR0 calculation in
>     kvm_vcpu_after_set_cpuid()
> 
>  arch/x86/kvm/cpuid.c   | 43 ++++++++++--------------------------------
>  arch/x86/kvm/vmx/sgx.c |  3 ++-
>  2 files changed, 12 insertions(+), 34 deletions(-)
> 
> 
> base-commit: 27d6845d258b67f4eb3debe062b7dacc67e0c393
> -- 
> 2.40.0.348.gf938b09366-goog
>
  
Kai Huang April 5, 2023, 9:44 a.m. UTC | #2
On Tue, 2023-04-04 at 17:59 -0700, Sean Christopherson wrote:
> *** WARNING *** ABI breakage.
> 
> Stop adjusting the guest's CPUID info for the allowed XFRM (a.k.a. XCR0)
> for SGX enclaves.  Past me didn't understand the roles and responsibilities
> between userspace and KVM with respect to CPUID leafs, i.e. I thought I was
> being helpful by having KVM adjust the entries.

Actually I am not clear about this topic.

So the rule is KVM should never adjust CPUID entries passed from userspace?

What if the userspace passed the incorrect CPUID entries?  Should KVM sanitize
those CPUID entries to ensure there's no insane configuration?  My concern is if
we allow guest to be created with insane CPUID configurations, the guest can be
confused and behaviour unexpectedly.

> 
> This is clearly an ABI breakage, but QEMU (tries to) do the right thing,
> and AFAIK no other VMMs support SGX (yet), so I'm hoping we can excise the
> ugly before userspace starts depending on the bad behavior.

I wouldn't worry about userspace being broken, because, IIUC, such broken can
only happen when userspace doesn't do the right thing (i.e. it sets SGX CPUID
0x12,0x1 to have more bits than the XCR0).

> 
> Compile tested only (don't have an SGX system these days).
> 
> Note, QEMU commit 301e90675c ("target/i386: Enable support for XSAVES
> based features") completely broke SGX by using allowed XSS instead of
> XCR0, and no one has complained.  That gives me hope that this one will
> go through as well.
> 
> I belive the QEMU fix is below.  I'll post a patch at some point unless
> someone wants to do the dirty work and claim the patch as their own.
> 
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index 6576287e5b..f083ff4335 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -5718,8 +5718,8 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
>          } else {
>              *eax &= env->features[FEAT_SGX_12_1_EAX];
>              *ebx &= 0; /* ebx reserve */
> -            *ecx &= env->features[FEAT_XSAVE_XSS_LO];
> -            *edx &= env->features[FEAT_XSAVE_XSS_HI];
> +            *ecx &= env->features[FEAT_XSAVE_XCR0_LO];
> +            *edx &= env->features[FEAT_XSAVE_XCR0_HI];
>  
>              /* FP and SSE are always allowed regardless of XSAVE/XCR0. */
>              *ecx |= XSTATE_FP_MASK | XSTATE_SSE_MASK;
> 
> Sean Christopherson (3):
>   KVM: VMX: Don't rely _only_ on CPUID to enforce XCR0 restrictions for
>     ECREATE
>   KVM: x86: Don't adjust guest's CPUID.0x12.1 (allowed SGX enclave XFRM)
>   KVM: x86: Open code supported XCR0 calculation in
>     kvm_vcpu_after_set_cpuid()
> 
>  arch/x86/kvm/cpuid.c   | 43 ++++++++++--------------------------------
>  arch/x86/kvm/vmx/sgx.c |  3 ++-
>  2 files changed, 12 insertions(+), 34 deletions(-)
> 
> 
> base-commit: 27d6845d258b67f4eb3debe062b7dacc67e0c393
> -- 
> 2.40.0.348.gf938b09366-goog
>
  
Sean Christopherson April 6, 2023, 2:10 a.m. UTC | #3
On Wed, Apr 05, 2023, Huang, Kai wrote:
> On Tue, 2023-04-04 at 17:59 -0700, Sean Christopherson wrote:
> > *** WARNING *** ABI breakage.
> > 
> > Stop adjusting the guest's CPUID info for the allowed XFRM (a.k.a. XCR0)
> > for SGX enclaves.  Past me didn't understand the roles and responsibilities
> > between userspace and KVM with respect to CPUID leafs, i.e. I thought I was
> > being helpful by having KVM adjust the entries.
> 
> Actually I am not clear about this topic.
> 
> So the rule is KVM should never adjust CPUID entries passed from userspace?

Yes, except for true runtime entries where a CPUID leaf is dynamic based on other
CPU state, e.g. CR4 bits, MISC_ENABLES in the MONITOR/MWAIT case, etc.

> What if the userspace passed the incorrect CPUID entries?  Should KVM sanitize
> those CPUID entries to ensure there's no insane configuration?  My concern is if
> we allow guest to be created with insane CPUID configurations, the guest can be
> confused and behaviour unexpectedly.

It is userspace's responsibility to provide a sane, correct setup.  The one
exception is that KVM rejects KVM_SET_CPUID{2} if userspace attempts to define an
unsupported virtual address width, the argument being that a malicious userspace
could attack KVM by coercing KVM into stuff a non-canonical address into e.g. a
VMCS field.

The reason for KVM punting to userspace is that it's all but impossible to define
what is/isn't sane.  A really good example would be an alternative we (Google)
considered for the "smaller MAXPHYADDR" fiasco, the underlying problem being that
migrating a vCPU with MAXPHYADDR=46 to a system with MAXPHYADDR=52 will incorrectly
miss reserved bit #PFs.

Rather than teach KVM to try and deal with smaller MAXPHYADDRs, an idea we considered
was to instead enumerate guest.MAXPHYADDR=52 on platforms with host.MAXPHYADDR=46 in
anticipation of eventual migration.  So long as userspace doesn't actually enumerate
memslots in the illegal address space, KVM would be able to treat such accesses as
emulated MMIO, and would only need to intercept #PF(RSVD).

Circling back to "what's sane", enumerating guest.MAXPHYADDR > host.MAXPHYADDR
definitely qualifies as insane since it really can't work correctly, but in our
opinion it was far superior to running with allow_smaller_maxphyaddr=true.

And sane is not the same thing as architecturally legal.  AMX is a good example
of this.  It's _technically_ legal to enumerate support for XFEATURE_TILE_CFG but
not XFEATURE_TILE_DATA in CPUID, but illegal to actually try to enable TILE_CFG
in XCR0 without also enabling TILE_DATA.  KVM should arguably reject CPUID configs
with TILE_CFG but not TILE_DATA, and vice versa, but then KVM is rejecting a 100%
architecturally valid, if insane, CPUID configuration.  Ditto for nearly all of
the VMX control bits versus their CPUID counterparts.

And sometimes there are good reasons to run a VM with a truly insane configuration,
e.g. for testing purposes.

TL;DR: trying to enforce "sane" CPUID/feature configuration is a gigantic can of worms.
  
Zhi Wang April 6, 2023, 10:01 a.m. UTC | #4
On Wed, 5 Apr 2023 19:10:40 -0700
Sean Christopherson <seanjc@google.com> wrote:

> On Wed, Apr 05, 2023, Huang, Kai wrote:
> > On Tue, 2023-04-04 at 17:59 -0700, Sean Christopherson wrote:
> > > *** WARNING *** ABI breakage.
> > > 
> > > Stop adjusting the guest's CPUID info for the allowed XFRM (a.k.a. XCR0)
> > > for SGX enclaves.  Past me didn't understand the roles and responsibilities
> > > between userspace and KVM with respect to CPUID leafs, i.e. I thought I was
> > > being helpful by having KVM adjust the entries.
> > 
> > Actually I am not clear about this topic.
> > 
> > So the rule is KVM should never adjust CPUID entries passed from userspace?
> 
> Yes, except for true runtime entries where a CPUID leaf is dynamic based on other
> CPU state, e.g. CR4 bits, MISC_ENABLES in the MONITOR/MWAIT case, etc.
> 
> > What if the userspace passed the incorrect CPUID entries?  Should KVM sanitize
> > those CPUID entries to ensure there's no insane configuration?  My concern is if
> > we allow guest to be created with insane CPUID configurations, the guest can be
> > confused and behaviour unexpectedly.
> 
> It is userspace's responsibility to provide a sane, correct setup.  The one
> exception is that KVM rejects KVM_SET_CPUID{2} if userspace attempts to define an
> unsupported virtual address width, the argument being that a malicious userspace
> could attack KVM by coercing KVM into stuff a non-canonical address into e.g. a
> VMCS field.
> 
> The reason for KVM punting to userspace is that it's all but impossible to define
> what is/isn't sane.  A really good example would be an alternative we (Google)
> considered for the "smaller MAXPHYADDR" fiasco, the underlying problem being that
> migrating a vCPU with MAXPHYADDR=46 to a system with MAXPHYADDR=52 will incorrectly
> miss reserved bit #PFs.
> 
> Rather than teach KVM to try and deal with smaller MAXPHYADDRs, an idea we considered
> was to instead enumerate guest.MAXPHYADDR=52 on platforms with host.MAXPHYADDR=46 in
> anticipation of eventual migration.  So long as userspace doesn't actually enumerate
> memslots in the illegal address space, KVM would be able to treat such accesses as
> emulated MMIO, and would only need to intercept #PF(RSVD).
> 
> Circling back to "what's sane", enumerating guest.MAXPHYADDR > host.MAXPHYADDR
> definitely qualifies as insane since it really can't work correctly, but in our
> opinion it was far superior to running with allow_smaller_maxphyaddr=true.
> 
> And sane is not the same thing as architecturally legal.  AMX is a good example
> of this.  It's _technically_ legal to enumerate support for XFEATURE_TILE_CFG but
> not XFEATURE_TILE_DATA in CPUID, but illegal to actually try to enable TILE_CFG
> in XCR0 without also enabling TILE_DATA.  KVM should arguably reject CPUID configs
> with TILE_CFG but not TILE_DATA, and vice versa, but then KVM is rejecting a 100%
> architecturally valid, if insane, CPUID configuration.  Ditto for nearly all of
> the VMX control bits versus their CPUID counterparts.
> 
> And sometimes there are good reasons to run a VM with a truly insane configuration,
> e.g. for testing purposes.
> 
> TL;DR: trying to enforce "sane" CPUID/feature configuration is a gigantic can of worms.

Interesting point. I was digging the CPUID virtualization OF TDX/SNP.
It would be nice to have a conclusion of what is "sane" and what is the
proper role for KVM, as firmware/TDX module is going to validate the "sane"
CPUID.

TDX/SNP requires the CPUID to be pre-configured and validated before creating
a CC guest. (It is done via TDH.MNG.INIT in TDX and inserting a CPUID page in
SNP_LAUNCH_UPDATE in SNP).

IIUC according to what you mentioned, KVM should be treated like "CPUID box"
for QEMU and the checks in KVM is only to ensure the requirements of a chosen
one is literally possible and correct. KVM should not care if the combination, the usage of the chosen ones is insane or not, which gives QEMU flexibility.

As the valid CPUIDs have been decided when creating a CC guest, what should be
the proper behavior (basically any new checks?) of KVM for the later
SET_CPUID2? My gut feeling is KVM should know the "CPUID box" is reduced
at least, because some KVM code paths rely on guest CPUID configuration.
  
Kai Huang April 12, 2023, 12:07 p.m. UTC | #5
On Thu, 2023-04-06 at 13:01 +0300, Zhi Wang wrote:
> On Wed, 5 Apr 2023 19:10:40 -0700
> Sean Christopherson <seanjc@google.com> wrote:
> 
> > On Wed, Apr 05, 2023, Huang, Kai wrote:
> > > On Tue, 2023-04-04 at 17:59 -0700, Sean Christopherson wrote:
> > > > *** WARNING *** ABI breakage.
> > > > 
> > > > Stop adjusting the guest's CPUID info for the allowed XFRM (a.k.a. XCR0)
> > > > for SGX enclaves.  Past me didn't understand the roles and responsibilities
> > > > between userspace and KVM with respect to CPUID leafs, i.e. I thought I was
> > > > being helpful by having KVM adjust the entries.
> > > 
> > > Actually I am not clear about this topic.
> > > 
> > > So the rule is KVM should never adjust CPUID entries passed from userspace?
> > 
> > Yes, except for true runtime entries where a CPUID leaf is dynamic based on other
> > CPU state, e.g. CR4 bits, MISC_ENABLES in the MONITOR/MWAIT case, etc.
> > 
> > > What if the userspace passed the incorrect CPUID entries?  Should KVM sanitize
> > > those CPUID entries to ensure there's no insane configuration?  My concern is if
> > > we allow guest to be created with insane CPUID configurations, the guest can be
> > > confused and behaviour unexpectedly.
> > 
> > It is userspace's responsibility to provide a sane, correct setup.  The one
> > exception is that KVM rejects KVM_SET_CPUID{2} if userspace attempts to define an
> > unsupported virtual address width, the argument being that a malicious userspace
> > could attack KVM by coercing KVM into stuff a non-canonical address into e.g. a
> > VMCS field.
> > 
> > The reason for KVM punting to userspace is that it's all but impossible to define
> > what is/isn't sane.  A really good example would be an alternative we (Google)
> > considered for the "smaller MAXPHYADDR" fiasco, the underlying problem being that
> > migrating a vCPU with MAXPHYADDR=46 to a system with MAXPHYADDR=52 will incorrectly
> > miss reserved bit #PFs.
> > 
> > Rather than teach KVM to try and deal with smaller MAXPHYADDRs, an idea we considered
> > was to instead enumerate guest.MAXPHYADDR=52 on platforms with host.MAXPHYADDR=46 in
> > anticipation of eventual migration.  So long as userspace doesn't actually enumerate
> > memslots in the illegal address space, KVM would be able to treat such accesses as
> > emulated MMIO, and would only need to intercept #PF(RSVD).
> > 
> > Circling back to "what's sane", enumerating guest.MAXPHYADDR > host.MAXPHYADDR
> > definitely qualifies as insane since it really can't work correctly, but in our
> > opinion it was far superior to running with allow_smaller_maxphyaddr=true.
> > 
> > And sane is not the same thing as architecturally legal.  AMX is a good example
> > of this.  It's _technically_ legal to enumerate support for XFEATURE_TILE_CFG but
> > not XFEATURE_TILE_DATA in CPUID, but illegal to actually try to enable TILE_CFG
> > in XCR0 without also enabling TILE_DATA.  KVM should arguably reject CPUID configs
> > with TILE_CFG but not TILE_DATA, and vice versa, but then KVM is rejecting a 100%
> > architecturally valid, if insane, CPUID configuration.  Ditto for nearly all of
> > the VMX control bits versus their CPUID counterparts.
> > 
> > And sometimes there are good reasons to run a VM with a truly insane configuration,
> > e.g. for testing purposes.
> > 
> > TL;DR: trying to enforce "sane" CPUID/feature configuration is a gigantic can of worms.
> 
> Interesting point. I was digging the CPUID virtualization OF TDX/SNP.
> It would be nice to have a conclusion of what is "sane" and what is the
> proper role for KVM, as firmware/TDX module is going to validate the "sane"
> CPUID.
> 
> TDX/SNP requires the CPUID to be pre-configured and validated before creating
> a CC guest. (It is done via TDH.MNG.INIT in TDX and inserting a CPUID page in
> SNP_LAUNCH_UPDATE in SNP).
> 
> IIUC according to what you mentioned, KVM should be treated like "CPUID box"
> for QEMU and the checks in KVM is only to ensure the requirements of a chosen
> one is literally possible and correct. KVM should not care if the combination, the usage of the chosen ones is insane or not, which gives QEMU flexibility.
> 
> As the valid CPUIDs have been decided when creating a CC guest, what should be
> the proper behavior (basically any new checks?) of KVM for the later
> SET_CPUID2? My gut feeling is KVM should know the "CPUID box" is reduced
> at least, because some KVM code paths rely on guest CPUID configuration.

For TDX guest my preference is KVM to save all CPUID entries in TDH.MNG.INIT and
manually make vcpu's CPUID point to the saved CPUIDs.  And then KVM just ignore
the SET_CPUID2 for TDX guest.

Not sure whether AMD counterpart can be done in similar way though.
  
Kai Huang April 12, 2023, 12:15 p.m. UTC | #6
On Wed, 2023-04-05 at 19:10 -0700, Sean Christopherson wrote:
> On Wed, Apr 05, 2023, Huang, Kai wrote:
> > On Tue, 2023-04-04 at 17:59 -0700, Sean Christopherson wrote:
> > > *** WARNING *** ABI breakage.
> > > 
> > > Stop adjusting the guest's CPUID info for the allowed XFRM (a.k.a. XCR0)
> > > for SGX enclaves.  Past me didn't understand the roles and responsibilities
> > > between userspace and KVM with respect to CPUID leafs, i.e. I thought I was
> > > being helpful by having KVM adjust the entries.
> > 
> > Actually I am not clear about this topic.
> > 
> > So the rule is KVM should never adjust CPUID entries passed from userspace?
> 
> Yes, except for true runtime entries where a CPUID leaf is dynamic based on other
> CPU state, e.g. CR4 bits, MISC_ENABLES in the MONITOR/MWAIT case, etc.
> 
> > What if the userspace passed the incorrect CPUID entries?  Should KVM sanitize
> > those CPUID entries to ensure there's no insane configuration?  My concern is if
> > we allow guest to be created with insane CPUID configurations, the guest can be
> > confused and behaviour unexpectedly.
> 
> It is userspace's responsibility to provide a sane, correct setup.  The one
> exception is that KVM rejects KVM_SET_CPUID{2} if userspace attempts to define an
> unsupported virtual address width, the argument being that a malicious userspace
> could attack KVM by coercing KVM into stuff a non-canonical address into e.g. a
> VMCS field.

Sorry could you elaborate an example of such attack? :)

> 
> The reason for KVM punting to userspace is that it's all but impossible to define
> what is/isn't sane.  A really good example would be an alternative we (Google)
> considered for the "smaller MAXPHYADDR" fiasco, the underlying problem being that
> migrating a vCPU with MAXPHYADDR=46 to a system with MAXPHYADDR=52 will incorrectly
> miss reserved bit #PFs.
> 
> Rather than teach KVM to try and deal with smaller MAXPHYADDRs, an idea we considered
> was to instead enumerate guest.MAXPHYADDR=52 on platforms with host.MAXPHYADDR=46 in
> anticipation of eventual migration.  So long as userspace doesn't actually enumerate
> memslots in the illegal address space, KVM would be able to treat such accesses as
> emulated MMIO, and would only need to intercept #PF(RSVD).
> 
> Circling back to "what's sane", enumerating guest.MAXPHYADDR > host.MAXPHYADDR
> definitely qualifies as insane since it really can't work correctly, but in our
> opinion it was far superior to running with allow_smaller_maxphyaddr=true.

I guess everyone wants performance.

> 
> And sane is not the same thing as architecturally legal.  AMX is a good example
> of this.  It's _technically_ legal to enumerate support for XFEATURE_TILE_CFG but
> not XFEATURE_TILE_DATA in CPUID, but illegal to actually try to enable TILE_CFG
> in XCR0 without also enabling TILE_DATA.  KVM should arguably reject CPUID configs
> with TILE_CFG but not TILE_DATA, and vice versa, but then KVM is rejecting a 100%
> architecturally valid, if insane, CPUID configuration.  Ditto for nearly all of
> the VMX control bits versus their CPUID counterparts.
> 
> And sometimes there are good reasons to run a VM with a truly insane configuration,
> e.g. for testing purposes.
> 
> TL;DR: trying to enforce "sane" CPUID/feature configuration is a gigantic can of worms.

Agreed.  Thanks for the clarification.
  
Sean Christopherson April 12, 2023, 2:57 p.m. UTC | #7
On Wed, Apr 12, 2023, Kai Huang wrote:
> On Wed, 2023-04-05 at 19:10 -0700, Sean Christopherson wrote:
> > On Wed, Apr 05, 2023, Huang, Kai wrote:
> > > On Tue, 2023-04-04 at 17:59 -0700, Sean Christopherson wrote:
> > > > *** WARNING *** ABI breakage.
> > > > 
> > > > Stop adjusting the guest's CPUID info for the allowed XFRM (a.k.a. XCR0)
> > > > for SGX enclaves.  Past me didn't understand the roles and responsibilities
> > > > between userspace and KVM with respect to CPUID leafs, i.e. I thought I was
> > > > being helpful by having KVM adjust the entries.
> > > 
> > > Actually I am not clear about this topic.
> > > 
> > > So the rule is KVM should never adjust CPUID entries passed from userspace?
> > 
> > Yes, except for true runtime entries where a CPUID leaf is dynamic based on other
> > CPU state, e.g. CR4 bits, MISC_ENABLES in the MONITOR/MWAIT case, etc.
> > 
> > > What if the userspace passed the incorrect CPUID entries?  Should KVM sanitize
> > > those CPUID entries to ensure there's no insane configuration?  My concern is if
> > > we allow guest to be created with insane CPUID configurations, the guest can be
> > > confused and behaviour unexpectedly.
> > 
> > It is userspace's responsibility to provide a sane, correct setup.  The one
> > exception is that KVM rejects KVM_SET_CPUID{2} if userspace attempts to define an
> > unsupported virtual address width, the argument being that a malicious userspace
> > could attack KVM by coercing KVM into stuff a non-canonical address into e.g. a
> > VMCS field.
> 
> Sorry could you elaborate an example of such attack? :)

Hrm, I was going to say that userspace could shove a noncanonical address in
MSR_FS/GS_BASE and trigger an unexpected VM-Fail (VMX) or ??? behavior on VMLOAD
(I don't think SVM consistency checks FS/GS.base).  But is_noncanonical_address()
queries CR4.LA57, not the address width from CPUID.0x80000008, which makes sense
enumearing 57 bits of virtual address space on a CPU without LA57 would also allow
shoving a bad value into hardware.

So even that example is bogus, i.e. commit dd598091de4a ("KVM: x86: Warn if guest
virtual address space is not 48-bits") really shouldn't have gone in.

> > The reason for KVM punting to userspace is that it's all but impossible to define
> > what is/isn't sane.  A really good example would be an alternative we (Google)
> > considered for the "smaller MAXPHYADDR" fiasco, the underlying problem being that
> > migrating a vCPU with MAXPHYADDR=46 to a system with MAXPHYADDR=52 will incorrectly
> > miss reserved bit #PFs.
> > 
> > Rather than teach KVM to try and deal with smaller MAXPHYADDRs, an idea we considered
> > was to instead enumerate guest.MAXPHYADDR=52 on platforms with host.MAXPHYADDR=46 in
> > anticipation of eventual migration.  So long as userspace doesn't actually enumerate
> > memslots in the illegal address space, KVM would be able to treat such accesses as
> > emulated MMIO, and would only need to intercept #PF(RSVD).
> > 
> > Circling back to "what's sane", enumerating guest.MAXPHYADDR > host.MAXPHYADDR
> > definitely qualifies as insane since it really can't work correctly, but in our
> > opinion it was far superior to running with allow_smaller_maxphyaddr=true.
> 
> I guess everyone wants performance.

Performance was a secondary concern, functional correctness was the main issue.
We were concerned that KVM would end up terminating healthy/sane guests due to
KVM's emulator being incomplete, i.e. if KVM failed to emulate an instruction in
the EPT violation handler when GPA > guest.MAXPHYADDR.  That, and SVM sets the
Accessed bit in the guest PTE before the NPT exit, i.e. KVM can't emulate a
smaller guest.MAXPHYADDR without creating an architectural violation from the
guest's perspective (a PTE with reserved bits should never set A/D bits).
  
Sean Christopherson April 12, 2023, 3:22 p.m. UTC | #8
On Wed, Apr 12, 2023, Kai Huang wrote:
> On Thu, 2023-04-06 at 13:01 +0300, Zhi Wang wrote:
> > On Wed, 5 Apr 2023 19:10:40 -0700
> > Sean Christopherson <seanjc@google.com> wrote:
> > > TL;DR: trying to enforce "sane" CPUID/feature configuration is a gigantic can of worms.
> > 
> > Interesting point. I was digging the CPUID virtualization OF TDX/SNP.
> > It would be nice to have a conclusion of what is "sane" and what is the
> > proper role for KVM, as firmware/TDX module is going to validate the "sane"
> > CPUID.
> > 
> > TDX/SNP requires the CPUID to be pre-configured and validated before creating
> > a CC guest. (It is done via TDH.MNG.INIT in TDX and inserting a CPUID page in
> > SNP_LAUNCH_UPDATE in SNP).
> > 
> > IIUC according to what you mentioned, KVM should be treated like "CPUID box"
> > for QEMU and the checks in KVM is only to ensure the requirements of a chosen
> > one is literally possible and correct. KVM should not care if the
> > combination, the usage of the chosen ones is insane or not, which gives
> > QEMU flexibility.
> > 
> > As the valid CPUIDs have been decided when creating a CC guest, what should be
> > the proper behavior (basically any new checks?) of KVM for the later
> > SET_CPUID2? My gut feeling is KVM should know the "CPUID box" is reduced
> > at least, because some KVM code paths rely on guest CPUID configuration.
> 
> For TDX guest my preference is KVM to save all CPUID entries in TDH.MNG.INIT and
> manually make vcpu's CPUID point to the saved CPUIDs.  And then KVM just ignore
> the SET_CPUID2 for TDX guest.

It's been a long while since I looked at TDX's CPUID management, but IIRC ignoring
SET_CPUID2 is not an option becuase the TDH.MNG.INIT only allows leafs that are
known to the TDX Module, e.g. KVM's paravirt CPUID leafs can't be communicated via
TDH.MNG.INIT.  KVM's uAPI for initiating TDH.MNG.INIT could obviously filter out
unsupported leafs, but doing so would lead to potential ABI breaks, e.g. if a leaf
that KVM filters out becomes known to the TDX Module, then upgrading the TDX Module
could result in previously allowed input becoming invalid.

Even if that weren't the case, ignoring KVM_SET_CPUID{2} would be a bad option
becuase it doesn't allow KVM to open behavior in the future, i.e. ignoring the
leaf would effectively make _everything_ valid input.  If KVM were to rely solely
on TDH.MNG.INIT, then KVM would want to completely disallow KVM_SET_CPUID{2}.

Back to Zhi's question, the best thing to do for TDX and SNP is likely to require
that overlap between KVM_SET_CPUID{2} and the "trusted" CPUID be consistent.  The
key difference is that KVM would be enforcing consistency, not sanity.  I.e. KVM
isn't making arbitrary decisions on what is/isn't sane, KVM is simply requiring
that userspace provide a CPUID model that's consistent with what userspace provided
earlier.
  
Kai Huang April 13, 2023, 12:20 a.m. UTC | #9
On Wed, 2023-04-12 at 08:22 -0700, Sean Christopherson wrote:
> On Wed, Apr 12, 2023, Kai Huang wrote:
> > On Thu, 2023-04-06 at 13:01 +0300, Zhi Wang wrote:
> > > On Wed, 5 Apr 2023 19:10:40 -0700
> > > Sean Christopherson <seanjc@google.com> wrote:
> > > > TL;DR: trying to enforce "sane" CPUID/feature configuration is a gigantic can of worms.
> > > 
> > > Interesting point. I was digging the CPUID virtualization OF TDX/SNP.
> > > It would be nice to have a conclusion of what is "sane" and what is the
> > > proper role for KVM, as firmware/TDX module is going to validate the "sane"
> > > CPUID.
> > > 
> > > TDX/SNP requires the CPUID to be pre-configured and validated before creating
> > > a CC guest. (It is done via TDH.MNG.INIT in TDX and inserting a CPUID page in
> > > SNP_LAUNCH_UPDATE in SNP).
> > > 
> > > IIUC according to what you mentioned, KVM should be treated like "CPUID box"
> > > for QEMU and the checks in KVM is only to ensure the requirements of a chosen
> > > one is literally possible and correct. KVM should not care if the
> > > combination, the usage of the chosen ones is insane or not, which gives
> > > QEMU flexibility.
> > > 
> > > As the valid CPUIDs have been decided when creating a CC guest, what should be
> > > the proper behavior (basically any new checks?) of KVM for the later
> > > SET_CPUID2? My gut feeling is KVM should know the "CPUID box" is reduced
> > > at least, because some KVM code paths rely on guest CPUID configuration.
> > 
> > For TDX guest my preference is KVM to save all CPUID entries in TDH.MNG.INIT and
> > manually make vcpu's CPUID point to the saved CPUIDs.  And then KVM just ignore
> > the SET_CPUID2 for TDX guest.
> 
> It's been a long while since I looked at TDX's CPUID management, but IIRC ignoring
> SET_CPUID2 is not an option becuase the TDH.MNG.INIT only allows leafs that are
> known to the TDX Module, e.g. KVM's paravirt CPUID leafs can't be communicated via
> TDH.MNG.INIT.  
> 

Oh yes.  I forgot this.

> KVM's uAPI for initiating TDH.MNG.INIT could obviously filter out
> unsupported leafs, but doing so would lead to potential ABI breaks, e.g. if a leaf
> that KVM filters out becomes known to the TDX Module, then upgrading the TDX Module
> could result in previously allowed input becoming invalid.

How about only filtering out PV related CPUIDs when applying CPUIDs to
TDH.MNG.INIT?  I think we can assume they are not gonna be known to TDX module
anyway.

> 
> Even if that weren't the case, ignoring KVM_SET_CPUID{2} would be a bad option
> becuase it doesn't allow KVM to open behavior in the future, i.e. ignoring the
> leaf would effectively make _everything_ valid input.  If KVM were to rely solely
> on TDH.MNG.INIT, then KVM would want to completely disallow KVM_SET_CPUID{2}.

Right.  Disallowing SET_CPUID{2} probably is better, as it gives userspace a
more concrete result.  

> 
> Back to Zhi's question, the best thing to do for TDX and SNP is likely to require
> that overlap between KVM_SET_CPUID{2} and the "trusted" CPUID be consistent.  The
> key difference is that KVM would be enforcing consistency, not sanity.  I.e. KVM
> isn't making arbitrary decisions on what is/isn't sane, KVM is simply requiring
> that userspace provide a CPUID model that's consistent with what userspace provided
> earlier.

So IIUC, you prefer to verifying the CPUIDs in SET_CPUID{2} are a super set of
the CPUIDs provided in TDH.MNG.INIT?  And KVM manually verifies all CPUIDs for
all vcpus are consistent (the same) in SET_CPUID{2}?

Looks this is over-complicated, _if_ the "only filtering out PV related CPUIDs
when applying CPUIDs to TDH.MNG.INIT" approach works.
  
Zhi Wang April 13, 2023, 6:07 a.m. UTC | #10
On Wed, 12 Apr 2023 12:07:13 +0000
"Huang, Kai" <kai.huang@intel.com> wrote:

> On Thu, 2023-04-06 at 13:01 +0300, Zhi Wang wrote:
> > On Wed, 5 Apr 2023 19:10:40 -0700
> > Sean Christopherson <seanjc@google.com> wrote:
> > 
> > > On Wed, Apr 05, 2023, Huang, Kai wrote:
> > > > On Tue, 2023-04-04 at 17:59 -0700, Sean Christopherson wrote:
> > > > > *** WARNING *** ABI breakage.
> > > > > 
> > > > > Stop adjusting the guest's CPUID info for the allowed XFRM (a.k.a. XCR0)
> > > > > for SGX enclaves.  Past me didn't understand the roles and responsibilities
> > > > > between userspace and KVM with respect to CPUID leafs, i.e. I thought I was
> > > > > being helpful by having KVM adjust the entries.
> > > > 
> > > > Actually I am not clear about this topic.
> > > > 
> > > > So the rule is KVM should never adjust CPUID entries passed from userspace?
> > > 
> > > Yes, except for true runtime entries where a CPUID leaf is dynamic based on other
> > > CPU state, e.g. CR4 bits, MISC_ENABLES in the MONITOR/MWAIT case, etc.
> > > 
> > > > What if the userspace passed the incorrect CPUID entries?  Should KVM sanitize
> > > > those CPUID entries to ensure there's no insane configuration?  My concern is if
> > > > we allow guest to be created with insane CPUID configurations, the guest can be
> > > > confused and behaviour unexpectedly.
> > > 
> > > It is userspace's responsibility to provide a sane, correct setup.  The one
> > > exception is that KVM rejects KVM_SET_CPUID{2} if userspace attempts to define an
> > > unsupported virtual address width, the argument being that a malicious userspace
> > > could attack KVM by coercing KVM into stuff a non-canonical address into e.g. a
> > > VMCS field.
> > > 
> > > The reason for KVM punting to userspace is that it's all but impossible to define
> > > what is/isn't sane.  A really good example would be an alternative we (Google)
> > > considered for the "smaller MAXPHYADDR" fiasco, the underlying problem being that
> > > migrating a vCPU with MAXPHYADDR=46 to a system with MAXPHYADDR=52 will incorrectly
> > > miss reserved bit #PFs.
> > > 
> > > Rather than teach KVM to try and deal with smaller MAXPHYADDRs, an idea we considered
> > > was to instead enumerate guest.MAXPHYADDR=52 on platforms with host.MAXPHYADDR=46 in
> > > anticipation of eventual migration.  So long as userspace doesn't actually enumerate
> > > memslots in the illegal address space, KVM would be able to treat such accesses as
> > > emulated MMIO, and would only need to intercept #PF(RSVD).
> > > 
> > > Circling back to "what's sane", enumerating guest.MAXPHYADDR > host.MAXPHYADDR
> > > definitely qualifies as insane since it really can't work correctly, but in our
> > > opinion it was far superior to running with allow_smaller_maxphyaddr=true.
> > > 
> > > And sane is not the same thing as architecturally legal.  AMX is a good example
> > > of this.  It's _technically_ legal to enumerate support for XFEATURE_TILE_CFG but
> > > not XFEATURE_TILE_DATA in CPUID, but illegal to actually try to enable TILE_CFG
> > > in XCR0 without also enabling TILE_DATA.  KVM should arguably reject CPUID configs
> > > with TILE_CFG but not TILE_DATA, and vice versa, but then KVM is rejecting a 100%
> > > architecturally valid, if insane, CPUID configuration.  Ditto for nearly all of
> > > the VMX control bits versus their CPUID counterparts.
> > > 
> > > And sometimes there are good reasons to run a VM with a truly insane configuration,
> > > e.g. for testing purposes.
> > > 
> > > TL;DR: trying to enforce "sane" CPUID/feature configuration is a gigantic can of worms.
> > 
> > Interesting point. I was digging the CPUID virtualization OF TDX/SNP.
> > It would be nice to have a conclusion of what is "sane" and what is the
> > proper role for KVM, as firmware/TDX module is going to validate the "sane"
> > CPUID.
> > 
> > TDX/SNP requires the CPUID to be pre-configured and validated before creating
> > a CC guest. (It is done via TDH.MNG.INIT in TDX and inserting a CPUID page in
> > SNP_LAUNCH_UPDATE in SNP).
> > 
> > IIUC according to what you mentioned, KVM should be treated like "CPUID box"
> > for QEMU and the checks in KVM is only to ensure the requirements of a chosen
> > one is literally possible and correct. KVM should not care if the combination, the usage of the chosen ones is insane or not, which gives QEMU flexibility.
> > 
> > As the valid CPUIDs have been decided when creating a CC guest, what should be
> > the proper behavior (basically any new checks?) of KVM for the later
> > SET_CPUID2? My gut feeling is KVM should know the "CPUID box" is reduced
> > at least, because some KVM code paths rely on guest CPUID configuration.
> 
> For TDX guest my preference is KVM to save all CPUID entries in TDH.MNG.INIT and
> manually make vcpu's CPUID point to the saved CPUIDs.  And then KVM just ignore
> the SET_CPUID2 for TDX guest.
> 
> Not sure whether AMD counterpart can be done in similar way though. 

I took a look on AMD SNP kernel[1], it supports host managing the CPUID
and firmware managing the CPUID. The host-managed CPUID is done via a GHCB
message call and it is going to be removed according to the SNP firmware ABI
spec:

7.1 CPUID Reporting
Note: This guest message may be removed in future versions as it is redundant with the CPUID page in SNP_LAUNCH_UPDATE. (See Section 8.17.)

So the style of CPUID virtualization of TDX and SNP will be aligned eventually.
Both will configure the supported CPUID for the firmware/TDX module before
creating a vCPU. 

[1] https://github.com/AMDESE/linux/blob/upmv10-host-snp-v8-rfc/arch/x86/kvm/svm/sev.c
[2] https://www.amd.com/system/files/TechDocs/56860.pdf
  
Sean Christopherson April 13, 2023, 10:48 p.m. UTC | #11
On Thu, Apr 13, 2023, Kai Huang wrote:
> On Wed, 2023-04-12 at 08:22 -0700, Sean Christopherson wrote:
> > KVM's uAPI for initiating TDH.MNG.INIT could obviously filter out
> > unsupported leafs, but doing so would lead to potential ABI breaks, e.g. if a leaf
> > that KVM filters out becomes known to the TDX Module, then upgrading the TDX Module
> > could result in previously allowed input becoming invalid.
> 
> How about only filtering out PV related CPUIDs when applying CPUIDs to
> TDH.MNG.INIT?  I think we can assume they are not gonna be known to TDX module
> anyway.

Nope, not going down that road.  Fool me once[*], shame on you.  Fool me twice,
shame on me :-)

Objections to hardware vendors defining PV interfaces aside, there exist leafs
that are neither PV related nor known to the TDX module, e.g. Centaur leafs.  I
think it's extremely unlikely (understatement) that anyone will want to expose
Centaur leafs to a TDX guest, but again I want to say out of the business of
telling userspace what is and isn't sane CPUID models.

[*] https://lore.kernel.org/all/20221210160046.2608762-6-chen.zhang@intel.com

> > Even if that weren't the case, ignoring KVM_SET_CPUID{2} would be a bad option
> > becuase it doesn't allow KVM to open behavior in the future, i.e. ignoring the
> > leaf would effectively make _everything_ valid input.  If KVM were to rely solely
> > on TDH.MNG.INIT, then KVM would want to completely disallow KVM_SET_CPUID{2}.
> 
> Right.  Disallowing SET_CPUID{2} probably is better, as it gives userspace a
> more concrete result.  
> 
> > 
> > Back to Zhi's question, the best thing to do for TDX and SNP is likely to require
> > that overlap between KVM_SET_CPUID{2} and the "trusted" CPUID be consistent.  The
> > key difference is that KVM would be enforcing consistency, not sanity.  I.e. KVM
> > isn't making arbitrary decisions on what is/isn't sane, KVM is simply requiring
> > that userspace provide a CPUID model that's consistent with what userspace provided
> > earlier.
> 
> So IIUC, you prefer to verifying the CPUIDs in SET_CPUID{2} are a super set of
> the CPUIDs provided in TDH.MNG.INIT?  And KVM manually verifies all CPUIDs for
> all vcpus are consistent (the same) in SET_CPUID{2}?

Yes, except KVM doesn't need to verify vCPUs are consistent with respect to each
other, just that each vCPU is consistent with respect to what was reported to the
TDX Module.

> Looks this is over-complicated, _if_ the "only filtering out PV related CPUIDs
> when applying CPUIDs to TDH.MNG.INIT" approach works. 

It's not complicated at all.  Walk through the leafs defined during TDH.MNG.INIT,
reject KVM_SET_CPUID if a leaf isn't present or doesn't match exactly.  Or has
the TDX spec changed and it's no longer that simple?
  
Kai Huang April 14, 2023, 1:42 p.m. UTC | #12
On Thu, 2023-04-13 at 15:48 -0700, Sean Christopherson wrote:
> On Thu, Apr 13, 2023, Kai Huang wrote:
> > On Wed, 2023-04-12 at 08:22 -0700, Sean Christopherson wrote:
> > > KVM's uAPI for initiating TDH.MNG.INIT could obviously filter out
> > > unsupported leafs, but doing so would lead to potential ABI breaks, e.g. if a leaf
> > > that KVM filters out becomes known to the TDX Module, then upgrading the TDX Module
> > > could result in previously allowed input becoming invalid.
> > 
> > How about only filtering out PV related CPUIDs when applying CPUIDs to
> > TDH.MNG.INIT?  I think we can assume they are not gonna be known to TDX module
> > anyway.
> 
> Nope, not going down that road.  Fool me once[*], shame on you.  Fool me twice,
> shame on me :-)

Ah OK :)

> 
> Objections to hardware vendors defining PV interfaces aside, there exist leafs
> that are neither PV related nor known to the TDX module, e.g. Centaur leafs.  I
> think it's extremely unlikely (understatement) that anyone will want to expose
> Centaur leafs to a TDX guest, but again I want to say out of the business of
> telling userspace what is and isn't sane CPUID models.

Right.  There might be use case that TDX guest wants to use some CPUID which
isn't handled by the TDX module but purely by KVM.  We don't want to limit the
possibility.  Totally agree.

> 
> [*] https://lore.kernel.org/all/20221210160046.2608762-6-chen.zhang@intel.com
> 
> > > Even if that weren't the case, ignoring KVM_SET_CPUID{2} would be a bad option
> > > becuase it doesn't allow KVM to open behavior in the future, i.e. ignoring the
> > > leaf would effectively make _everything_ valid input.  If KVM were to rely solely
> > > on TDH.MNG.INIT, then KVM would want to completely disallow KVM_SET_CPUID{2}.
> > 
> > Right.  Disallowing SET_CPUID{2} probably is better, as it gives userspace a
> > more concrete result.  
> > 
> > > 
> > > Back to Zhi's question, the best thing to do for TDX and SNP is likely to require
> > > that overlap between KVM_SET_CPUID{2} and the "trusted" CPUID be consistent.  The
> > > key difference is that KVM would be enforcing consistency, not sanity.  I.e. KVM
> > > isn't making arbitrary decisions on what is/isn't sane, KVM is simply requiring
> > > that userspace provide a CPUID model that's consistent with what userspace provided
> > > earlier.
> > 
> > So IIUC, you prefer to verifying the CPUIDs in SET_CPUID{2} are a super set of
> > the CPUIDs provided in TDH.MNG.INIT?  And KVM manually verifies all CPUIDs for
> > all vcpus are consistent (the same) in SET_CPUID{2}?
> 
> Yes, except KVM doesn't need to verify vCPUs are consistent with respect to each
> other, just that each vCPU is consistent with respect to what was reported to the
> TDX Module.

OK.  Fine to me.

> 
> > Looks this is over-complicated, _if_ the "only filtering out PV related CPUIDs
> > when applying CPUIDs to TDH.MNG.INIT" approach works. 
> 
> It's not complicated at all.  Walk through the leafs defined during TDH.MNG.INIT,
> reject KVM_SET_CPUID if a leaf isn't present or doesn't match exactly.  Or has
> the TDX spec changed and it's no longer that simple?

No the module hasn't been changed, and yes it should be as simple as you said. 
I just had some first impression that handling CPUID in one IOCTL (TDH.MNG.INIT)
should be simpler than handling CPUID in two IOCTLs, but I guess this might not
be true :)

Anyway I agree with your suggestion.  Thanks.
  
Zhi Wang April 16, 2023, 6:36 a.m. UTC | #13
On Fri, 14 Apr 2023 13:42:11 +0000
"Huang, Kai" <kai.huang@intel.com> wrote:

> On Thu, 2023-04-13 at 15:48 -0700, Sean Christopherson wrote:
> > On Thu, Apr 13, 2023, Kai Huang wrote:
> > > On Wed, 2023-04-12 at 08:22 -0700, Sean Christopherson wrote:
> > > > KVM's uAPI for initiating TDH.MNG.INIT could obviously filter out
> > > > unsupported leafs, but doing so would lead to potential ABI breaks, e.g. if a leaf
> > > > that KVM filters out becomes known to the TDX Module, then upgrading the TDX Module
> > > > could result in previously allowed input becoming invalid.
> > > 
> > > How about only filtering out PV related CPUIDs when applying CPUIDs to
> > > TDH.MNG.INIT?  I think we can assume they are not gonna be known to TDX module
> > > anyway.
> > 
> > Nope, not going down that road.  Fool me once[*], shame on you.  Fool me twice,
> > shame on me :-)
> 
> Ah OK :)
> 
> > 
> > Objections to hardware vendors defining PV interfaces aside, there exist leafs
> > that are neither PV related nor known to the TDX module, e.g. Centaur leafs.  I
> > think it's extremely unlikely (understatement) that anyone will want to expose
> > Centaur leafs to a TDX guest, but again I want to say out of the business of
> > telling userspace what is and isn't sane CPUID models.
> 
> Right.  There might be use case that TDX guest wants to use some CPUID which
> isn't handled by the TDX module but purely by KVM.  We don't want to limit the
> possibility.  Totally agree.
> 
> > 
> > [*] https://lore.kernel.org/all/20221210160046.2608762-6-chen.zhang@intel.com
> > 
> > > > Even if that weren't the case, ignoring KVM_SET_CPUID{2} would be a bad option
> > > > becuase it doesn't allow KVM to open behavior in the future, i.e. ignoring the
> > > > leaf would effectively make _everything_ valid input.  If KVM were to rely solely
> > > > on TDH.MNG.INIT, then KVM would want to completely disallow KVM_SET_CPUID{2}.
> > > 
> > > Right.  Disallowing SET_CPUID{2} probably is better, as it gives userspace a
> > > more concrete result.  
> > > 
> > > > 
> > > > Back to Zhi's question, the best thing to do for TDX and SNP is likely to require
> > > > that overlap between KVM_SET_CPUID{2} and the "trusted" CPUID be consistent.  The
> > > > key difference is that KVM would be enforcing consistency, not sanity.  I.e. KVM
> > > > isn't making arbitrary decisions on what is/isn't sane, KVM is simply requiring
> > > > that userspace provide a CPUID model that's consistent with what userspace provided
> > > > earlier.
> > > 
> > > So IIUC, you prefer to verifying the CPUIDs in SET_CPUID{2} are a super set of
> > > the CPUIDs provided in TDH.MNG.INIT?  And KVM manually verifies all CPUIDs for
> > > all vcpus are consistent (the same) in SET_CPUID{2}?
> > 
> > Yes, except KVM doesn't need to verify vCPUs are consistent with respect to each
> > other, just that each vCPU is consistent with respect to what was reported to the
> > TDX Module.
> 
> OK.  Fine to me.

I did some investigations and I think this approach would work on both TDX
and SNP, as both of them can let a CC guest handle the firmware-not-aware CPUID
in #VE or #VC. E.g. KVM paravirt CPUIDs. And we can factor out and re-use the
"checking-CPUID-is-equal" in KVM_SET_CPUID{2}. But I think TDX needs to
filter out the firmware-not-aware CPUIDs in TDH.MNG.INIT to pass the check?
(SNP firmware can adjust them automatically). I attached some details I found
in case you are interested in digging.

For TDX, KVM provides a CPUID table in TDH.MNG.INIT, and there are two polices
for the following CPUID virtualization: 1) TDX-module handle the CPUID
interception from a TD guest and emulated according to the CPUID table in
TDH.MNG.INIT. If TDX-module doesn't know this CPUID, #VE is injected 2) A TD
guest can request to handle the CPUID by itself via calling TDG.VP_CPUIDVE_SET.
Then a CPUID TD exit will be forwarded to the guest as #VE. The code snippet
of TDX module handling TD CPUID exit can be found here[1].

For SNP, userspace provides a CPUID table in SNP_LAUNCH_UPDATE with
PAGE_TYPE_CPUID. PSP will check and validate the CPUID in the table. It will be
part of the SNP metadata secrets, and passed to the guest later. A guest can
refer to the validated CPUID table when handling CPUID #VC, but can also handle
CPUIDs not in the table[2] (e.g. paravirt CPUID).


[1] https://downloadmirror.intel.com/738876/tdx-module-v1.0.01.01.zip/src/td_dispatcher/vm_exits/td_cpuid.c
[2] https://github.com/AMDESE/linux-svsm/blob/main/src/cpu/vc.rs#L571
> 
> > 
> > > Looks this is over-complicated, _if_ the "only filtering out PV related CPUIDs
> > > when applying CPUIDs to TDH.MNG.INIT" approach works. 
> > 
> > It's not complicated at all.  Walk through the leafs defined during TDH.MNG.INIT,
> > reject KVM_SET_CPUID if a leaf isn't present or doesn't match exactly.  Or has
> > the TDX spec changed and it's no longer that simple?
> 
> No the module hasn't been changed, and yes it should be as simple as you said. 
> I just had some first impression that handling CPUID in one IOCTL (TDH.MNG.INIT)
> should be simpler than handling CPUID in two IOCTLs, but I guess this might not
> be true :)
> 
> Anyway I agree with your suggestion.  Thanks.
>
  

Patch

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 6576287e5b..f083ff4335 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -5718,8 +5718,8 @@  void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
         } else {
             *eax &= env->features[FEAT_SGX_12_1_EAX];
             *ebx &= 0; /* ebx reserve */
-            *ecx &= env->features[FEAT_XSAVE_XSS_LO];
-            *edx &= env->features[FEAT_XSAVE_XSS_HI];
+            *ecx &= env->features[FEAT_XSAVE_XCR0_LO];
+            *edx &= env->features[FEAT_XSAVE_XCR0_HI];
 
             /* FP and SSE are always allowed regardless of XSAVE/XCR0. */
             *ecx |= XSTATE_FP_MASK | XSTATE_SSE_MASK;