[v2] KVM: x86/pmu: Fix emulation on Intel counters' bit width

Message ID 20230322093117.48335-1-likexu@tencent.com
State New
Headers
Series [v2] KVM: x86/pmu: Fix emulation on Intel counters' bit width |

Commit Message

Like Xu March 22, 2023, 9:31 a.m. UTC
  From: Like Xu <likexu@tencent.com>

Per Intel SDM, the bit width of a PMU counter is specified via CPUID
only if the vCPU has FW_WRITE[bit 13] on IA32_PERF_CAPABILITIES.
When the FW_WRITE bit is not set, only EAX is valid and out-of-bounds
bits accesses do not generate #GP. Conversely when this bit is set, #GP
for out-of-bounds bits accesses will also appear on the fixed counters.
vPMU currently does not support emulation of bit widths lower than 32
bits or higher than its host capability.

Signed-off-by: Like Xu <likexu@tencent.com>
---
Previous:
https://lore.kernel.org/kvm/20230316113312.54714-1-likexu@tencent.com/

V1 -> V2 Changelog:
- Apply #GP rule to fixed counetrs when guest has FW_WRITE;
- Apply signed rule to fixed counetrs when guest doesn't have FW_WRITE;
- Counters' bit width set by cpuid cannot be less than 32 bits;

 arch/x86/kvm/vmx/pmu_intel.c | 10 ++++++++++
 1 file changed, 10 insertions(+)


base-commit: d8708b80fa0e6e21bc0c9e7276ad0bccef73b6e7
  

Comments

Paolo Bonzini March 27, 2023, 2:30 p.m. UTC | #1
On Wed, Mar 22, 2023 at 10:31 AM Like Xu <like.xu.linux@gmail.com> wrote:
>
> From: Like Xu <likexu@tencent.com>
>
> Per Intel SDM, the bit width of a PMU counter is specified via CPUID
> only if the vCPU has FW_WRITE[bit 13] on IA32_PERF_CAPABILITIES.
> When the FW_WRITE bit is not set, only EAX is valid and out-of-bounds
> bits accesses do not generate #GP. Conversely when this bit is set, #GP
> for out-of-bounds bits accesses will also appear on the fixed counters.
> vPMU currently does not support emulation of bit widths lower than 32
> bits or higher than its host capability.

Can you please point out the date and paragraph of the SDM?

Paolo
  
Like Xu March 28, 2023, 9:16 a.m. UTC | #2
On 27/3/2023 10:30 pm, Paolo Bonzini wrote:
> On Wed, Mar 22, 2023 at 10:31 AM Like Xu <like.xu.linux@gmail.com> wrote:
>>
>> From: Like Xu <likexu@tencent.com>
>>
>> Per Intel SDM, the bit width of a PMU counter is specified via CPUID
>> only if the vCPU has FW_WRITE[bit 13] on IA32_PERF_CAPABILITIES.
>> When the FW_WRITE bit is not set, only EAX is valid and out-of-bounds
>> bits accesses do not generate #GP. Conversely when this bit is set, #GP
>> for out-of-bounds bits accesses will also appear on the fixed counters.
>> vPMU currently does not support emulation of bit widths lower than 32
>> bits or higher than its host capability.
> 
> Can you please point out the date and paragraph of the SDM?
> 
> Paolo
> 

25462-078US, December 2022
20.2.6 Full-Width Writes to Performance Counter Registers

The general-purpose performance counter registers IA32_PMCx are writable via 
WRMSR instruction.
However, the value written into IA32_PMCx by WRMSR is the signed extended 64-bit 
value of the
EAX[31:0] input of WRMSR.

A processor that supports full-width writes to the general-purpose performance 
counters enumerated by
CPUID.0AH:EAX[15:8] will set IA32_PERF_CAPABILITIES[13] to enumerate its 
full-width-write
capability See Figure 20-65.

If IA32_PERF_CAPABILITIES.FW_WRITE[bit 13] =1, each IA32_PMCi is accompanied by a
corresponding alias address starting at 4C1H for IA32_A_PMC0.

The bit width of the performance monitoring counters is specified in 
CPUID.0AH:EAX[23:16].
If IA32_A_PMCi is present, the 64-bit input value (EDX:EAX) of WRMSR to 
IA32_A_PMCi will cause
IA32_PMCi to be updated by:

	COUNTERWIDTH =
		CPUID.0AH:EAX[23:16] bit width of the performance monitoring counter 		
	IA32_PMCi[COUNTERWIDTH-1:32] := EDX[COUNTERWIDTH-33:0]);
	IA32_PMCi[31:0] := EAX[31:0];
	EDX[63:COUNTERWIDTH] are reserved

---

Some might argue that this is all talking about GP counters, not fixed counters.
In fact, the full-width write hw behaviour is presumed to do the same thing for 
all counters.

Commercial hardware will not use less than 32 bits or a bit width like 46 bits.
A KVM user space (such as selftests) may set a strange bit-width, for example 
using 33 bits,
and based on the current code, writing the reserved bits for #fixed counters 
doesn't cause #GP.

Also when the guest does not have the Full-Width feature, the fixed counters can 
be more than
32 bits wide via CPUID, while the #GP counter is only 32 bits wide, which is 
also monstrous.

The current KVM is also not capable of emulating counter overflow when KVM user 
space is set
to a bit width of less than 32 bits w/ FW_WRITE.

The above SDM-undefined behaviour led to this fix, which may lift some of the fog.
  
Paolo Bonzini March 28, 2023, 9:20 a.m. UTC | #3
On 3/28/23 11:16, Like Xu wrote:
> 
> 
> If IA32_PERF_CAPABILITIES.FW_WRITE[bit 13] =1, each IA32_PMCi is 
> accompanied by a
> corresponding alias address starting at 4C1H for IA32_A_PMC0.
> 
> The bit width of the performance monitoring counters is specified in 
> CPUID.0AH:EAX[23:16].
> If IA32_A_PMCi is present, the 64-bit input value (EDX:EAX) of WRMSR to 
> IA32_A_PMCi will cause
> IA32_PMCi to be updated by:
> 
>      COUNTERWIDTH =
>          CPUID.0AH:EAX[23:16] bit width of the performance monitoring 
> counter
>      IA32_PMCi[COUNTERWIDTH-1:32] := EDX[COUNTERWIDTH-33:0]);
>      IA32_PMCi[31:0] := EAX[31:0];
>      EDX[63:COUNTERWIDTH] are reserved
> 
> ---
> 
> Some might argue that this is all talking about GP counters, not
> fixed counters. In fact, the full-width write hw behaviour is
> presumed to do the same thing for all counters.
But the above behavior, and the #GP, is only true for IA32_A_PMCi (the 
full-witdh MSR).  Did I understand correctly that the behavior for fixed 
counters is changed without introducing an alias MSR?

Paolo
  
Like Xu March 28, 2023, 10:04 a.m. UTC | #4
On 28/3/2023 5:20 pm, Paolo Bonzini wrote:
> On 3/28/23 11:16, Like Xu wrote:
>>
>>
>> If IA32_PERF_CAPABILITIES.FW_WRITE[bit 13] =1, each IA32_PMCi is accompanied by a
>> corresponding alias address starting at 4C1H for IA32_A_PMC0.
>>
>> The bit width of the performance monitoring counters is specified in 
>> CPUID.0AH:EAX[23:16].
>> If IA32_A_PMCi is present, the 64-bit input value (EDX:EAX) of WRMSR to 
>> IA32_A_PMCi will cause
>> IA32_PMCi to be updated by:
>>
>>      COUNTERWIDTH =
>>          CPUID.0AH:EAX[23:16] bit width of the performance monitoring counter
>>      IA32_PMCi[COUNTERWIDTH-1:32] := EDX[COUNTERWIDTH-33:0]);
>>      IA32_PMCi[31:0] := EAX[31:0];
>>      EDX[63:COUNTERWIDTH] are reserved
>>
>> ---
>>
>> Some might argue that this is all talking about GP counters, not
>> fixed counters. In fact, the full-width write hw behaviour is
>> presumed to do the same thing for all counters.
> But the above behavior, and the #GP, is only true for IA32_A_PMCi (the 
> full-witdh MSR).  Did I understand correctly that the behavior for fixed 
> counters is changed without introducing an alias MSR?
> 
> Paolo
> 

If true, why introducing those alias MSRs ? My archaeological findings are:

a platform w/o full-witdh like Westmere (has 3-fixed counters already) is 
declared to
have a counter width (R:48, W:32) and its successor Sandy Bridge has (R:48 , W: 
32/48).

Thus I think the behaviour of the fixed counter has changed from there, and the 
alias GP MSRs
were introduced to keep the support on 32-bit writes on #GP counters (via 
original address).

[*] Intel® 64 and IA-32 Architectures Software Developer’s Manual Documentation 
Changes
(252046-030, January 2011) Table 30-18 Core PMU Comparison.
  
Sean Christopherson May 24, 2023, 8:33 p.m. UTC | #5
On Tue, Mar 28, 2023, Like Xu wrote:
> On 28/3/2023 5:20 pm, Paolo Bonzini wrote:
> > On 3/28/23 11:16, Like Xu wrote:
> > > 
> > > 
> > > If IA32_PERF_CAPABILITIES.FW_WRITE[bit 13] =1, each IA32_PMCi is accompanied by a
> > > corresponding alias address starting at 4C1H for IA32_A_PMC0.
> > > 
> > > The bit width of the performance monitoring counters is specified in
> > > CPUID.0AH:EAX[23:16].
> > > If IA32_A_PMCi is present, the 64-bit input value (EDX:EAX) of WRMSR
> > > to IA32_A_PMCi will cause
> > > IA32_PMCi to be updated by:
> > > 
> > > �����COUNTERWIDTH =
> > > �������� CPUID.0AH:EAX[23:16] bit width of the performance monitoring counter
> > > �����IA32_PMCi[COUNTERWIDTH-1:32] := EDX[COUNTERWIDTH-33:0]);
> > > �����IA32_PMCi[31:0] := EAX[31:0];
> > > �����EDX[63:COUNTERWIDTH] are reserved
> > > 
> > > ---
> > > 
> > > Some might argue that this is all talking about GP counters, not
> > > fixed counters. In fact, the full-width write hw behaviour is
> > > presumed to do the same thing for all counters.
> > But the above behavior, and the #GP, is only true for IA32_A_PMCi (the
> > full-witdh MSR).� Did I understand correctly that the behavior for fixed
> > counters is changed without introducing an alias MSR?
> > 
> > Paolo
> > 
> 
> If true, why introducing those alias MSRs ?

My guess is there is/was software in the field that wrote -1 to the GP counters,
i.e. would have been broken by the new #GP behavior.

> My archaeological findings are:
> 
> a platform w/o full-witdh like Westmere (has 3-fixed counters already) is
> declared to have a counter width (R:48, W:32) and its successor Sandy Bridge
> has (R:48 , W: 32/48).
> 
> Thus I think the behaviour of the fixed counter has changed from there, and
> the alias GP MSRs were introduced to keep the support on 32-bit writes on #GP
> counters (via original address).

FWIW, I see the #GP behavior for fixed counters on Haswell, so this does seem to
be the case.  That said, I would like to get confirmation from Intel that this is
architectural and/or working as intended.

Like, can you follow up with Intel to get clarification/confirmation?  And ideally
an SDM update...
  

Patch

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index e8a3be0b9df9..d38b820d6b9e 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -470,6 +470,12 @@  static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			pmc_update_sample_period(pmc);
 			return 0;
 		} else if ((pmc = get_fixed_pmc(pmu, msr))) {
+			if (fw_writes_is_enabled(vcpu)) {
+				if (data & ~pmu->counter_bitmask[KVM_PMC_FIXED])
+					return 1;
+			} else if (!msr_info->host_initiated) {
+				data = (s64)(s32)data;
+			}
 			pmc->counter += data - pmc_read_counter(pmc);
 			pmc_update_sample_period(pmc);
 			return 0;
@@ -516,6 +522,7 @@  static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
 	union cpuid10_edx edx;
 	u64 perf_capabilities;
 	u64 counter_mask;
+	bool fw_wr = fw_writes_is_enabled(vcpu);
 	int i;
 
 	pmu->nr_arch_gp_counters = 0;
@@ -543,6 +550,7 @@  static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
 
 	pmu->nr_arch_gp_counters = min_t(int, eax.split.num_counters,
 					 kvm_pmu_cap.num_counters_gp);
+	eax.split.bit_width = fw_wr ? max_t(int, 32, eax.split.bit_width) : 32;
 	eax.split.bit_width = min_t(int, eax.split.bit_width,
 				    kvm_pmu_cap.bit_width_gp);
 	pmu->counter_bitmask[KVM_PMC_GP] = ((u64)1 << eax.split.bit_width) - 1;
@@ -558,6 +566,8 @@  static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
 			min3(ARRAY_SIZE(fixed_pmc_events),
 			     (size_t) edx.split.num_counters_fixed,
 			     (size_t)kvm_pmu_cap.num_counters_fixed);
+		edx.split.bit_width_fixed = fw_wr ?
+			max_t(int, 32, edx.split.bit_width_fixed) : 32;
 		edx.split.bit_width_fixed = min_t(int, edx.split.bit_width_fixed,
 						  kvm_pmu_cap.bit_width_fixed);
 		pmu->counter_bitmask[KVM_PMC_FIXED] =