x86: KVM: Add feature flag for AMD's FsGsKernelGsBaseNonSerializing

Message ID 20231004002038.907778-1-jmattson@google.com
State New
Headers
Series x86: KVM: Add feature flag for AMD's FsGsKernelGsBaseNonSerializing |

Commit Message

Jim Mattson Oct. 4, 2023, 12:20 a.m. UTC
  Define an X86_FEATURE_* flag for
CPUID.80000021H:EAX.FsGsKernelGsBaseNonSerializing[bit 1], and
advertise the feature to userspace via KVM_GET_SUPPORTED_CPUID.

This feature is not yet documented in the APM. See AMD's "Processor
Programming Reference (PPR) for AMD Family 19h Model 61h, Revision B1
Processors (56713-B1-PUB)."

Signed-off-by: Jim Mattson <jmattson@google.com>
---
 arch/x86/include/asm/cpufeatures.h | 1 +
 arch/x86/kvm/cpuid.c               | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)
  

Comments

Dave Hansen Oct. 4, 2023, 12:57 a.m. UTC | #1
On 10/3/23 17:20, Jim Mattson wrote:
> Define an X86_FEATURE_* flag for
> CPUID.80000021H:EAX.FsGsKernelGsBaseNonSerializing[bit 1], and
> advertise the feature to userspace via KVM_GET_SUPPORTED_CPUID.
...
> +#define X86_FEATURE_BASES_NON_SERIAL	(20*32+ 1) /* "" FSBASE, GSBASE, and KERNELGSBASE are non-serializing */

This is failing to differentiate two *VERY* different things.

FSBASE, GSBASE, and KERNELGSBASE themselves are registers.  They have
*NOTHING* to do with serialization.  WRFSBASE, for instance is not
serializing.  Reading (with RDMSR) or using any of those three registers
is not serializing.

The *ONLY* thing that relates them to serialization is the WRMSR
instruction which itself is (mostly) architecturally serializing and the
fact that WRMSR has historically been the main way to write those three
registers.

The AMD docs call this out, which helps.  But the changelog, comments
and probably the feature naming need some work.

Why does this matter, btw?  Why do guests need this bit passed through?
  
Jim Mattson Oct. 4, 2023, 2:44 a.m. UTC | #2
On Tue, Oct 3, 2023 at 5:57 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/3/23 17:20, Jim Mattson wrote:
> > Define an X86_FEATURE_* flag for
> > CPUID.80000021H:EAX.FsGsKernelGsBaseNonSerializing[bit 1], and
> > advertise the feature to userspace via KVM_GET_SUPPORTED_CPUID.
> ...
> > +#define X86_FEATURE_BASES_NON_SERIAL (20*32+ 1) /* "" FSBASE, GSBASE, and KERNELGSBASE are non-serializing */
>
> This is failing to differentiate two *VERY* different things.
>
> FSBASE, GSBASE, and KERNELGSBASE themselves are registers.  They have
> *NOTHING* to do with serialization.  WRFSBASE, for instance is not
> serializing.  Reading (with RDMSR) or using any of those three registers
> is not serializing.
>
> The *ONLY* thing that relates them to serialization is the WRMSR
> instruction which itself is (mostly) architecturally serializing and the
> fact that WRMSR has historically been the main way to write those three
> registers.
>
> The AMD docs call this out, which helps.  But the changelog, comments
> and probably the feature naming need some work.

You're right; I was overly terse. I'll elucidate in v2.

> Why does this matter, btw?  Why do guests need this bit passed through?

The business of declaring breaking changes to the architectural
specification in a CPUID bit has never made much sense to me. Legacy
software that depends on the original architectural specification
isn't going to query the CPUID bit, because the CPUID bit didn't exist
when it was written. New software probably isn't going to query the
CPUID bit, either, because it has to have an implementation that works
on newer processors regardless. Why, then, would a developer bother to
provide an implementation that only works on older processors *and*
the code to select an implementation based on a CPUID bit?

Take, for example, CPUID.(EAX=7,ECX=0):EBX[bit 13], which, IIRC, was
the first CPUID bit of the "Ha ha; we're changing the architectural
specification" category. When Intel introduced this new behavior in
Haswell, they broke WIN87EM.DLL in Windows XP (see
https://communities.vmware.com/t5/Legacy-User-Blogs/General-Protection-Fault-in-module-WIN87EM-DLL-at-0001-02C6/ta-p/2770422).
I know of at least three software packages commonly running in VMs
that were broken as a result. The CPUID bit didn't solve any problems,
and I doubt that any software queries that bit today.

As a hypervisor developer, however, it's not up to me to make value
judgments on individual CPUID bits. If a bit indicates an innate
characteristic of the hardware, it should be passed through.

No one is likely to query CPUID.80000021H.EAX[bit 21] today, but if
someone does query the bit in the future, they can reasonably expect
that WRMSR({FS,GS,KERNELGS}_BASE) is a serializing operation whenever
this bit is clear. Therefore, any hypervisor that doesn't pass the bit
through is broken. Sadly, this also means that for a heterogenous
migration pool, the hypervisor must set this bit in the guest CPUID if
it is set on any host in the pool. Yes, that means that the legacy
behavior may sometimes be present in a VM that enumerates the CPUID
bit, but that's the best we can do.

I'm a little surprised at the pushback, TBH. Are you implying that
there is some advantage to *not* passing this bit through?
  
Dave Hansen Oct. 4, 2023, 3:27 a.m. UTC | #3
On 10/3/23 19:44, Jim Mattson wrote:
> I'm a little surprised at the pushback, TBH. Are you implying that
> there is some advantage to *not* passing this bit through?

I'm not really trying to push back.  I'm honestly just curious.  Linux
obviously doesn't cat about the bit.  So is this for some future Linux
or some other OS?
  
Jim Mattson Oct. 4, 2023, 4:24 a.m. UTC | #4
On Tue, Oct 3, 2023 at 8:27 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/3/23 19:44, Jim Mattson wrote:
> > I'm a little surprised at the pushback, TBH. Are you implying that
> > there is some advantage to *not* passing this bit through?
>
> I'm not really trying to push back.  I'm honestly just curious.  Linux
> obviously doesn't cat about the bit.  So is this for some future Linux
> or some other OS?

It's not for any particular guest OS. It's just for correctness of the
virtual CPU. Pedantically, hardware that enumerates this bit cannot
run a guest that doesn't. Pragmatically, it almost certainly doesn't
matter. Getting it right is trivial and has no impact on performance
or code size, so why not just do it?
  
Borislav Petkov Oct. 4, 2023, 7:58 a.m. UTC | #5
On Tue, Oct 03, 2023 at 07:44:51PM -0700, Jim Mattson wrote:
> The business of declaring breaking changes to the architectural
> specification in a CPUID bit has never made much sense to me.

How else should they be expressed then?

In some flaky PDF which changes URLs whenever the new corporate CMS gets
installed?

Or we should do f/m/s matching which doesn't make any sense for VMs?

When you think about it, CPUID is the best thing we have.

> No one is likely to query CPUID.80000021H.EAX[bit 21] today, but if
> someone does query the bit in the future, they can reasonably expect
> that WRMSR({FS,GS,KERNELGS}_BASE) is a serializing operation whenever
> this bit is clear. Therefore, any hypervisor that doesn't pass the bit
> through is broken. Sadly, this also means that for a heterogenous
> migration pool, the hypervisor must set this bit in the guest CPUID if
> it is set on any host in the pool. Yes, that means that the legacy
> behavior may sometimes be present in a VM that enumerates the CPUID
> bit, but that's the best we can do.

Yes, add this to your commit message.

> I'm a little surprised at the pushback, TBH. Are you implying that
> there is some advantage to *not* passing this bit through?

We don't add stuff which is not worth adding. There has to be *at*
*least* some justification for it.

Thx.
  
Jim Mattson Oct. 4, 2023, 8:29 p.m. UTC | #6
On Wed, Oct 4, 2023 at 12:59 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Tue, Oct 03, 2023 at 07:44:51PM -0700, Jim Mattson wrote:
> > The business of declaring breaking changes to the architectural
> > specification in a CPUID bit has never made much sense to me.
>
> How else should they be expressed then?
>
> In some flaky PDF which changes URLs whenever the new corporate CMS gets
> installed?
>
> Or we should do f/m/s matching which doesn't make any sense for VMs?
>
> When you think about it, CPUID is the best thing we have.
>
> > No one is likely to query CPUID.80000021H.EAX[bit 21] today, but if
> > someone does query the bit in the future, they can reasonably expect
> > that WRMSR({FS,GS,KERNELGS}_BASE) is a serializing operation whenever
> > this bit is clear. Therefore, any hypervisor that doesn't pass the bit
> > through is broken. Sadly, this also means that for a heterogenous
> > migration pool, the hypervisor must set this bit in the guest CPUID if
> > it is set on any host in the pool. Yes, that means that the legacy
> > behavior may sometimes be present in a VM that enumerates the CPUID
> > bit, but that's the best we can do.
>
> Yes, add this to your commit message.
>
> > I'm a little surprised at the pushback, TBH. Are you implying that
> > there is some advantage to *not* passing this bit through?
>
> We don't add stuff which is not worth adding. There has to be *at*
> *least* some justification for it.

Let me propose the following axiom as justification:

KVM_GET_SUPPORTED_CPUID must pass through any defeature bits that are
set on the host, unless KVM is prepared to emulate the missing
feature.

Here, a defeature bit is any CPUID bit where a value of '1' indicates
the absence of a feature.

> Thx.
>
> --
> Regards/Gruss,
>     Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
  
Jim Mattson Oct. 5, 2023, 4:22 p.m. UTC | #7
On Wed, Oct 4, 2023 at 12:59 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Tue, Oct 03, 2023 at 07:44:51PM -0700, Jim Mattson wrote:
> > The business of declaring breaking changes to the architectural
> > specification in a CPUID bit has never made much sense to me.
>
> How else should they be expressed then?
>
> In some flaky PDF which changes URLs whenever the new corporate CMS gets
> installed?
>
> Or we should do f/m/s matching which doesn't make any sense for VMs?
>
> When you think about it, CPUID is the best thing we have.

Every time a new defeature bit is introduced, it breaks existing
hypervisors, because no one can predict ahead of time that these bits
have to be passed through.

I wonder if we could convince x86 CPU vendors to put all defeature
bits under a single leaf, so that we can just set the entire leaf to
all 1's in KVM_GET_SUPPORTED_CPUID.
  
Dave Hansen Oct. 5, 2023, 4:35 p.m. UTC | #8
On 10/5/23 09:22, Jim Mattson wrote:
> On Wed, Oct 4, 2023 at 12:59 AM Borislav Petkov <bp@alien8.de> wrote:
>> On Tue, Oct 03, 2023 at 07:44:51PM -0700, Jim Mattson wrote:
>>> The business of declaring breaking changes to the architectural
>>> specification in a CPUID bit has never made much sense to me.
>> How else should they be expressed then?
>>
>> In some flaky PDF which changes URLs whenever the new corporate CMS gets
>> installed?
>>
>> Or we should do f/m/s matching which doesn't make any sense for VMs?
>>
>> When you think about it, CPUID is the best thing we have.
> Every time a new defeature bit is introduced, it breaks existing
> hypervisors, because no one can predict ahead of time that these bits
> have to be passed through.
> 
> I wonder if we could convince x86 CPU vendors to put all defeature
> bits under a single leaf, so that we can just set the entire leaf to
> all 1's in KVM_GET_SUPPORTED_CPUID.

I hope I'm not throwing stones from a glass house here...

But I'm struggling to think of cases where Intel has read-only
"defeature bits" like this one.  There are certainly things like
MSR_IA32_MISC_ENABLE_FAST_STRING that can be toggled, but read-only
indicators of a departure from established architecture seems ...
suboptimal.

It's arguable that TDX changed a bunch of architecture like causing
exceptions on CPUID and MSRs that never caused exceptions before and
_that_ constitutes a defeature.  But that's the least of the problems
for a TDX VM. :)

(Seriously, I'm not trying to shame Intel's x86 fellow travelers here,
 just trying to make sure I'm not missing something).
  
Paolo Bonzini Oct. 5, 2023, 4:38 p.m. UTC | #9
On 10/4/23 09:58, Borislav Petkov wrote:
> On Tue, Oct 03, 2023 at 07:44:51PM -0700, Jim Mattson wrote:
>> The business of declaring breaking changes to the architectural
>> specification in a CPUID bit has never made much sense to me.
> How else should they be expressed then?
> 
> In some flaky PDF which changes URLs whenever the new corporate CMS gets
> installed?
> 
> Or we should do f/m/s matching which doesn't make any sense for VMs?

Nothing *needs* to be done other than documenting this retroactive 
change to what constitutes architectural behavior.  It's not a CPUID 
that can be queried to change behavior; the user can use CPUID to 
diagnose that something has broken, but the broken program cannot know 
in the first place that the CPUID bit exists.

I agree with Jim that it would be nice to have some bits from Intel, and 
some bits from AMD, that current processors always return as 1.  Future 
processors can change those to 0 as desired.

Intel did something similar with VMX.  They have a bunch of bits for 
which we don't know the meaning, but we know it is something that "right 
now always causes vmexits".  Even if in the future you might be able to 
disable it, the polarity of the bit is the same as for all other vmexit 
controls.

Paolo
  
Jim Mattson Oct. 5, 2023, 4:41 p.m. UTC | #10
On Thu, Oct 5, 2023 at 9:35 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/5/23 09:22, Jim Mattson wrote:
> > On Wed, Oct 4, 2023 at 12:59 AM Borislav Petkov <bp@alien8.de> wrote:
> >> On Tue, Oct 03, 2023 at 07:44:51PM -0700, Jim Mattson wrote:
> >>> The business of declaring breaking changes to the architectural
> >>> specification in a CPUID bit has never made much sense to me.
> >> How else should they be expressed then?
> >>
> >> In some flaky PDF which changes URLs whenever the new corporate CMS gets
> >> installed?
> >>
> >> Or we should do f/m/s matching which doesn't make any sense for VMs?
> >>
> >> When you think about it, CPUID is the best thing we have.
> > Every time a new defeature bit is introduced, it breaks existing
> > hypervisors, because no one can predict ahead of time that these bits
> > have to be passed through.
> >
> > I wonder if we could convince x86 CPU vendors to put all defeature
> > bits under a single leaf, so that we can just set the entire leaf to
> > all 1's in KVM_GET_SUPPORTED_CPUID.
>
> I hope I'm not throwing stones from a glass house here...
>
> But I'm struggling to think of cases where Intel has read-only
> "defeature bits" like this one.  There are certainly things like
> MSR_IA32_MISC_ENABLE_FAST_STRING that can be toggled, but read-only
> indicators of a departure from established architecture seems ...
> suboptimal.
>
> It's arguable that TDX changed a bunch of architecture like causing
> exceptions on CPUID and MSRs that never caused exceptions before and
> _that_ constitutes a defeature.  But that's the least of the problems
> for a TDX VM. :)
>
> (Seriously, I'm not trying to shame Intel's x86 fellow travelers here,
>  just trying to make sure I'm not missing something).

Intel's defeature bits that I know of are:

CPUID.(EAX=7,ECX=0):EBX[bit 13] (Haswell) - "Deprecates FPU CS and FPU
DS values if 1."
CPUID.(EAX=7,ECX=0):EBX[bit 6] (Skylake) - "FDP_EXCPTN_ONLY. x87 FPU
Data Pointer updated only on x87 exceptions if 1."
  
Paolo Bonzini Oct. 5, 2023, 4:51 p.m. UTC | #11
On 10/5/23 18:41, Jim Mattson wrote:
>> I hope I'm not throwing stones from a glass house here...
>>
>> But I'm struggling to think of cases where Intel has read-only
>> "defeature bits" like this one.  There are certainly things like
>> MSR_IA32_MISC_ENABLE_FAST_STRING that can be toggled, but read-only
>> indicators of a departure from established architecture seems ...
>> suboptimal.
>>
>> It's arguable that TDX changed a bunch of architecture like causing
>> exceptions on CPUID and MSRs that never caused exceptions before and
>> _that_  constitutes a defeature.  But that's the least of the problems
>> for a TDX VM. 😄
>>
>> (Seriously, I'm not trying to shame Intel's x86 fellow travelers here,
>>   just trying to make sure I'm not missing something).
> Intel's defeature bits that I know of are:
> 
> CPUID.(EAX=7,ECX=0):EBX[bit 13] (Haswell) - "Deprecates FPU CS and FPU
> DS values if 1."
> CPUID.(EAX=7,ECX=0):EBX[bit 6] (Skylake) - "FDP_EXCPTN_ONLY. x87 FPU
> Data Pointer updated only on x87 exceptions if 1."

And for AMD, to get the full landscape:

- CPUID(EAX=8000_0021h).EAX[0], "Processor ignores nested data breakpoints"

- CPUID(EAX=8000_0021h).EAX[9], "SMM_CTL MSR is not present" (the MSR 
used to be always present if SVM is available)

AMD had a few processors without X86_BUG_NULL_SEG that do not expose 
X86_FEATURE_NULL_SEL_CLR_BASE, but that's conservative so not a big deal.

Paolo
  
Jim Mattson Oct. 5, 2023, 5:06 p.m. UTC | #12
On Thu, Oct 5, 2023 at 9:39 AM Paolo Bonzini <pbonzini@redhat.com> wrote:

> I agree with Jim that it would be nice to have some bits from Intel, and
> some bits from AMD, that current processors always return as 1.  Future
> processors can change those to 0 as desired.

That's not quite what I meant.

Today, hypervisors will not pass through a non-zero CPUID bit that
they don't know the definition of. This makes sense for positive
features, and for multi-bit fields.

I'm suggesting a leaf devoted to single bit negative features. If a
bit is set in hardware, it means that something has been taken away.
Hypervisors don't need to know exactly what was taken away. For this
leaf only, hypervisors will always pass through a non-zero bit, even
if they have know idea what it means.
  
Paolo Bonzini Oct. 5, 2023, 5:14 p.m. UTC | #13
On 10/5/23 19:06, Jim Mattson wrote:
> On Thu, Oct 5, 2023 at 9:39 AM Paolo Bonzini<pbonzini@redhat.com>  wrote:
> 
>> I agree with Jim that it would be nice to have some bits from Intel, and
>> some bits from AMD, that current processors always return as 1.  Future
>> processors can change those to 0 as desired.
> That's not quite what I meant.
> 
> I'm suggesting a leaf devoted to single bit negative features. If a
> bit is set in hardware, it means that something has been taken away.
> Hypervisors don't need to know exactly what was taken away. For this
> leaf only, hypervisors will always pass through a non-zero bit, even
> if they have know idea what it means.

Understood, but I'm suggesting that these might even have the right 
polarity: if a bit is set it means that something is there and might not 
in the future, even if we don't know exactly what.  We can pass through 
the bit, we can AND bits across the migration pool to define what to 
pass to the guest, we can force-set the leaves to zero (feature 
removed).  Either way, the point is to group future defeatures together.

That said, these bits are only for documentation/debugging purposes 
anyway.  So I like the idea because it would educate the architects 
about this issue, more than because it is actually useful...

Paolo
  
Jim Mattson Oct. 5, 2023, 5:27 p.m. UTC | #14
On Thu, Oct 5, 2023 at 10:14 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 10/5/23 19:06, Jim Mattson wrote:
> > On Thu, Oct 5, 2023 at 9:39 AM Paolo Bonzini<pbonzini@redhat.com>  wrote:
> >
> >> I agree with Jim that it would be nice to have some bits from Intel, and
> >> some bits from AMD, that current processors always return as 1.  Future
> >> processors can change those to 0 as desired.
> > That's not quite what I meant.
> >
> > I'm suggesting a leaf devoted to single bit negative features. If a
> > bit is set in hardware, it means that something has been taken away.
> > Hypervisors don't need to know exactly what was taken away. For this
> > leaf only, hypervisors will always pass through a non-zero bit, even
> > if they have know idea what it means.
>
> Understood, but I'm suggesting that these might even have the right
> polarity: if a bit is set it means that something is there and might not
> in the future, even if we don't know exactly what.  We can pass through
> the bit, we can AND bits across the migration pool to define what to
> pass to the guest, we can force-set the leaves to zero (feature
> removed).  Either way, the point is to group future defeatures together.

Oh, yeah. Your suggestion is better. :)
  
Dave Hansen Oct. 5, 2023, 5:52 p.m. UTC | #15
On 10/5/23 09:41, Jim Mattson wrote:
>>
>> But I'm struggling to think of cases where Intel has read-only
>> "defeature bits" like this one.  There are certainly things like
>> MSR_IA32_MISC_ENABLE_FAST_STRING that can be toggled, but read-only
>> indicators of a departure from established architecture seems ...
>> suboptimal.
...
> CPUID.(EAX=7,ECX=0):EBX[bit 13] (Haswell) - "Deprecates FPU CS and FPU
> DS values if 1."
> CPUID.(EAX=7,ECX=0):EBX[bit 6] (Skylake) - "FDP_EXCPTN_ONLY. x87 FPU
> Data Pointer updated only on x87 exceptions if 1."

Thanks!

Trying to group these does make sense to me.  I don't think people take
architecture breakage lightly, but I certainly never considered that it
might, for instance, be important enough to create a new VM migration pool.

I'll try to keep an eye out for these.
  

Patch

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 58cb9495e40f..b53951c83d1d 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -443,6 +443,7 @@ 
 
 /* AMD-defined Extended Feature 2 EAX, CPUID level 0x80000021 (EAX), word 20 */
 #define X86_FEATURE_NO_NESTED_DATA_BP	(20*32+ 0) /* "" No Nested Data Breakpoints */
+#define X86_FEATURE_BASES_NON_SERIAL	(20*32+ 1) /* "" FSBASE, GSBASE, and KERNELGSBASE are non-serializing */
 #define X86_FEATURE_LFENCE_RDTSC	(20*32+ 2) /* "" LFENCE always serializing / synchronizes RDTSC */
 #define X86_FEATURE_NULL_SEL_CLR_BASE	(20*32+ 6) /* "" Null Selector Clears Base */
 #define X86_FEATURE_AUTOIBRS		(20*32+ 8) /* "" Automatic IBRS */
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 0544e30b4946..5e776e8619be 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -761,7 +761,8 @@  void kvm_set_cpu_caps(void)
 
 	kvm_cpu_cap_mask(CPUID_8000_0021_EAX,
 		F(NO_NESTED_DATA_BP) | F(LFENCE_RDTSC) | 0 /* SmmPgCfgLock */ |
-		F(NULL_SEL_CLR_BASE) | F(AUTOIBRS) | 0 /* PrefetchCtlMsr */
+		F(NULL_SEL_CLR_BASE) | F(AUTOIBRS) | 0 /* PrefetchCtlMsr */ |
+		F(BASES_NON_SERIAL)
 	);
 
 	if (cpu_feature_enabled(X86_FEATURE_SRSO_NO))