KVM: x86: Use the correct size of struct kvm_vcpu_pv_apf_data and fix the documentation

Message ID 20231013070037.512051-1-xiaoyao.li@intel.com
State New
Headers
Series KVM: x86: Use the correct size of struct kvm_vcpu_pv_apf_data and fix the documentation |

Commit Message

Xiaoyao Li Oct. 13, 2023, 7 a.m. UTC
  The size of struct kvm_vcpu_pv_apf_data is 68 bytes, not 64 bytes.
Fix the kvm_gfn_to_hva_cache_init() to use the correct size though KVM
only touches fist 8 bytes.

Fix the documentation and opportunistically refine the documentation.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 Documentation/virt/kvm/x86/msr.rst | 22 +++++++++++-----------
 arch/x86/kvm/x86.c                 |  2 +-
 2 files changed, 12 insertions(+), 12 deletions(-)


base-commit: 5804c19b80bf625c6a9925317f845e497434d6d3
  

Comments

Sean Christopherson Oct. 17, 2023, 5:28 p.m. UTC | #1
On Fri, Oct 13, 2023, Xiaoyao Li wrote:
> The size of struct kvm_vcpu_pv_apf_data is 68 bytes, not 64 bytes.

LOL, the messed up size is downright hilarious.  Not only was the math botched,
but the "enabled" field that pushes the struct beyond a cache line is completely
unnecessary.

AFAICT, KVM (the host side) has *never* read kvm_vcpu_pv_apf_data.enabled, and
the documentation clearly states that enabling is based solely on the bit in the
synthetic MSR.

So rather than update the documentation, what if we fix the goof?  KVM-as-a-host
obviously doesn't enforce anything or consume the size, and changing the header
will only affect guests that are rebuilt against the new header, so there's no
chance of ABI breakage between KVM and its guests.  The only possible breakage
is if some other hypervisor is emulating KVM's async #PF (LOL) and relies on the
guest to set kvm_vcpu_pv_apf_data.enabled.  But (a) I highly doubt such a hypervisor
exists, (b) that would arguably be a violation of KVM's "spec", and (c) the worst
case scenario is that the guest would simply lose async #PF functionality.

> Fix the kvm_gfn_to_hva_cache_init() to use the correct size though KVM
> only touches fist 8 bytes.

This isn't a fix.  There's actually meaningful value in precisely initializing the
cache as it guards against KVM writing into the padding, e.g. this WARN would fire:

	if (WARN_ON_ONCE(len + offset > ghc->len))
		return -EINVAL;

So it's a bit odd, but I would prefer to keep the current behavior of mapping only
the first 8 bytes.

Here's what I'm thinking to clean up the enabled field (compile tested only,
haven't touched the docs other than the obvious removal):

---
 Documentation/virt/kvm/x86/msr.rst   |  1 -
 arch/x86/include/uapi/asm/kvm_para.h |  1 -
 arch/x86/kernel/kvm.c                | 11 ++++++-----
 3 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/Documentation/virt/kvm/x86/msr.rst b/Documentation/virt/kvm/x86/msr.rst
index 9315fc385fb0..f6d70f99a1a7 100644
--- a/Documentation/virt/kvm/x86/msr.rst
+++ b/Documentation/virt/kvm/x86/msr.rst
@@ -204,7 +204,6 @@ data:
 		__u32 token;
 
 		__u8 pad[56];
-		__u32 enabled;
 	  };
 
 	Bits 5-4 of the MSR are reserved and should be zero. Bit 0 is set to 1
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 6e64b27b2c1e..605899594ebb 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -142,7 +142,6 @@ struct kvm_vcpu_pv_apf_data {
 	__u32 token;
 
 	__u8 pad[56];
-	__u32 enabled;
 };
 
 #define KVM_PV_EOI_BIT 0
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index b8ab9ee5896c..2cd5f8d248a5 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -65,6 +65,7 @@ static int __init parse_no_stealacc(char *arg)
 
 early_param("no-steal-acc", parse_no_stealacc);
 
+static DEFINE_PER_CPU_READ_MOSTLY(bool, async_pf_enabled);
 static DEFINE_PER_CPU_DECRYPTED(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
 DEFINE_PER_CPU_DECRYPTED(struct kvm_steal_time, steal_time) __aligned(64) __visible;
 static int has_steal_clock = 0;
@@ -244,7 +245,7 @@ noinstr u32 kvm_read_and_reset_apf_flags(void)
 {
 	u32 flags = 0;
 
-	if (__this_cpu_read(apf_reason.enabled)) {
+	if (__this_cpu_read(async_pf_enabled)) {
 		flags = __this_cpu_read(apf_reason.flags);
 		__this_cpu_write(apf_reason.flags, 0);
 	}
@@ -295,7 +296,7 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_kvm_asyncpf_interrupt)
 
 	inc_irq_stat(irq_hv_callback_count);
 
-	if (__this_cpu_read(apf_reason.enabled)) {
+	if (__this_cpu_read(async_pf_enabled)) {
 		token = __this_cpu_read(apf_reason.token);
 		kvm_async_pf_task_wake(token);
 		__this_cpu_write(apf_reason.token, 0);
@@ -362,7 +363,7 @@ static void kvm_guest_cpu_init(void)
 		wrmsrl(MSR_KVM_ASYNC_PF_INT, HYPERVISOR_CALLBACK_VECTOR);
 
 		wrmsrl(MSR_KVM_ASYNC_PF_EN, pa);
-		__this_cpu_write(apf_reason.enabled, 1);
+		__this_cpu_write(async_pf_enabled, true);
 		pr_debug("setup async PF for cpu %d\n", smp_processor_id());
 	}
 
@@ -383,11 +384,11 @@ static void kvm_guest_cpu_init(void)
 
 static void kvm_pv_disable_apf(void)
 {
-	if (!__this_cpu_read(apf_reason.enabled))
+	if (!__this_cpu_read(async_pf_enabled))
 		return;
 
 	wrmsrl(MSR_KVM_ASYNC_PF_EN, 0);
-	__this_cpu_write(apf_reason.enabled, 0);
+	__this_cpu_write(async_pf_enabled, false);
 
 	pr_debug("disable async PF for cpu %d\n", smp_processor_id());
 }

base-commit: 437bba5ad2bba00c2056c896753a32edf80860cc
--
  
Xiaoyao Li Oct. 18, 2023, 2:46 p.m. UTC | #2
On 10/18/2023 1:28 AM, Sean Christopherson wrote:
> On Fri, Oct 13, 2023, Xiaoyao Li wrote:
>> The size of struct kvm_vcpu_pv_apf_data is 68 bytes, not 64 bytes.
> 
> LOL, the messed up size is downright hilarious.  Not only was the math botched,
> but the "enabled" field that pushes the struct beyond a cache line is completely
> unnecessary.
> 
> AFAICT, KVM (the host side) has *never* read kvm_vcpu_pv_apf_data.enabled, and
> the documentation clearly states that enabling is based solely on the bit in the
> synthetic MSR.
> 
> So rather than update the documentation, what if we fix the goof?  KVM-as-a-host
> obviously doesn't enforce anything or consume the size, and changing the header
> will only affect guests that are rebuilt against the new header, so there's no
> chance of ABI breakage between KVM and its guests.  The only possible breakage
> is if some other hypervisor is emulating KVM's async #PF (LOL) and relies on the
> guest to set kvm_vcpu_pv_apf_data.enabled.  But (a) I highly doubt such a hypervisor
> exists, (b) that would arguably be a violation of KVM's "spec", and (c) the worst
> case scenario is that the guest would simply lose async #PF functionality.
> 
>> Fix the kvm_gfn_to_hva_cache_init() to use the correct size though KVM
>> only touches fist 8 bytes.
> 
> This isn't a fix.  There's actually meaningful value in precisely initializing the
> cache as it guards against KVM writing into the padding, e.g. this WARN would fire:
> 
> 	if (WARN_ON_ONCE(len + offset > ghc->len))
> 		return -EINVAL;
> 
> So it's a bit odd, but I would prefer to keep the current behavior of mapping only
> the first 8 bytes.
> 
> Here's what I'm thinking to clean up the enabled field (compile tested only,
> haven't touched the docs other than the obvious removal):

It looks better.

Will you send out a formal patch yourself? or leave it to me?

> ---
>   Documentation/virt/kvm/x86/msr.rst   |  1 -
>   arch/x86/include/uapi/asm/kvm_para.h |  1 -
>   arch/x86/kernel/kvm.c                | 11 ++++++-----
>   3 files changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/x86/msr.rst b/Documentation/virt/kvm/x86/msr.rst
> index 9315fc385fb0..f6d70f99a1a7 100644
> --- a/Documentation/virt/kvm/x86/msr.rst
> +++ b/Documentation/virt/kvm/x86/msr.rst
> @@ -204,7 +204,6 @@ data:
>   		__u32 token;
>   
>   		__u8 pad[56];
> -		__u32 enabled;
>   	  };
>   
>   	Bits 5-4 of the MSR are reserved and should be zero. Bit 0 is set to 1
> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> index 6e64b27b2c1e..605899594ebb 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -142,7 +142,6 @@ struct kvm_vcpu_pv_apf_data {
>   	__u32 token;
>   
>   	__u8 pad[56];
> -	__u32 enabled;
>   };
>   
>   #define KVM_PV_EOI_BIT 0
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index b8ab9ee5896c..2cd5f8d248a5 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -65,6 +65,7 @@ static int __init parse_no_stealacc(char *arg)
>   
>   early_param("no-steal-acc", parse_no_stealacc);
>   
> +static DEFINE_PER_CPU_READ_MOSTLY(bool, async_pf_enabled);
>   static DEFINE_PER_CPU_DECRYPTED(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
>   DEFINE_PER_CPU_DECRYPTED(struct kvm_steal_time, steal_time) __aligned(64) __visible;
>   static int has_steal_clock = 0;
> @@ -244,7 +245,7 @@ noinstr u32 kvm_read_and_reset_apf_flags(void)
>   {
>   	u32 flags = 0;
>   
> -	if (__this_cpu_read(apf_reason.enabled)) {
> +	if (__this_cpu_read(async_pf_enabled)) {
>   		flags = __this_cpu_read(apf_reason.flags);
>   		__this_cpu_write(apf_reason.flags, 0);
>   	}
> @@ -295,7 +296,7 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_kvm_asyncpf_interrupt)
>   
>   	inc_irq_stat(irq_hv_callback_count);
>   
> -	if (__this_cpu_read(apf_reason.enabled)) {
> +	if (__this_cpu_read(async_pf_enabled)) {
>   		token = __this_cpu_read(apf_reason.token);
>   		kvm_async_pf_task_wake(token);
>   		__this_cpu_write(apf_reason.token, 0);
> @@ -362,7 +363,7 @@ static void kvm_guest_cpu_init(void)
>   		wrmsrl(MSR_KVM_ASYNC_PF_INT, HYPERVISOR_CALLBACK_VECTOR);
>   
>   		wrmsrl(MSR_KVM_ASYNC_PF_EN, pa);
> -		__this_cpu_write(apf_reason.enabled, 1);
> +		__this_cpu_write(async_pf_enabled, true);
>   		pr_debug("setup async PF for cpu %d\n", smp_processor_id());
>   	}
>   
> @@ -383,11 +384,11 @@ static void kvm_guest_cpu_init(void)
>   
>   static void kvm_pv_disable_apf(void)
>   {
> -	if (!__this_cpu_read(apf_reason.enabled))
> +	if (!__this_cpu_read(async_pf_enabled))
>   		return;
>   
>   	wrmsrl(MSR_KVM_ASYNC_PF_EN, 0);
> -	__this_cpu_write(apf_reason.enabled, 0);
> +	__this_cpu_write(async_pf_enabled, false);
>   
>   	pr_debug("disable async PF for cpu %d\n", smp_processor_id());
>   }
> 
> base-commit: 437bba5ad2bba00c2056c896753a32edf80860cc
  
Sean Christopherson Oct. 18, 2023, 3:12 p.m. UTC | #3
On Wed, Oct 18, 2023, Xiaoyao Li wrote:
> On 10/18/2023 1:28 AM, Sean Christopherson wrote:
> > On Fri, Oct 13, 2023, Xiaoyao Li wrote:
> > > Fix the kvm_gfn_to_hva_cache_init() to use the correct size though KVM
> > > only touches fist 8 bytes.
> > 
> > This isn't a fix.  There's actually meaningful value in precisely initializing the
> > cache as it guards against KVM writing into the padding, e.g. this WARN would fire:
> > 
> > 	if (WARN_ON_ONCE(len + offset > ghc->len))
> > 		return -EINVAL;
> > 
> > So it's a bit odd, but I would prefer to keep the current behavior of mapping only
> > the first 8 bytes.
> > 
> > Here's what I'm thinking to clean up the enabled field (compile tested only,
> > haven't touched the docs other than the obvious removal):
> 
> It looks better.
> 
> Will you send out a formal patch yourself? or leave it to me?

Your call, I don't have a preference.  Just let me know which option you choose.
  

Patch

diff --git a/Documentation/virt/kvm/x86/msr.rst b/Documentation/virt/kvm/x86/msr.rst
index 9315fc385fb0..27bcd49e46b9 100644
--- a/Documentation/virt/kvm/x86/msr.rst
+++ b/Documentation/virt/kvm/x86/msr.rst
@@ -192,9 +192,9 @@  MSR_KVM_ASYNC_PF_EN:
 data:
 	Asynchronous page fault (APF) control MSR.
 
-	Bits 63-6 hold 64-byte aligned physical address of a 64 byte memory area
-	which must be in guest RAM and must be zeroed. This memory is expected
-	to hold a copy of the following structure::
+	Bits 63-6 hold 64-byte aligned physical address of a 68 bytes memory
+	area which must be in guest RAM. This memory is expected to hold a copy
+	of the following structure::
 
 	  struct kvm_vcpu_pv_apf_data {
 		/* Used for 'page not present' events delivered via #PF */
@@ -220,7 +220,7 @@  data:
 	#PF exception. During delivery of these events APF CR2 register contains
 	a token that will be used to notify the guest when missing page becomes
 	available. Also, to make it possible to distinguish between real #PF and
-	APF, first 4 bytes of 64 byte memory location ('flags') will be written
+	APF, first 4 bytes of 68 byte memory location ('flags') will be written
 	to by the hypervisor at the time of injection. Only first bit of 'flags'
 	is currently supported, when set, it indicates that the guest is dealing
 	with asynchronous 'page not present' event. If during a page fault APF
@@ -232,14 +232,14 @@  data:
 	as regular page fault, guest must reset 'flags' to '0' before it does
 	something that can generate normal page fault.
 
-	Bytes 5-7 of 64 byte memory location ('token') will be written to by the
+	Bytes 4-7 of 68 byte memory location ('token') will be written to by the
 	hypervisor at the time of APF 'page ready' event injection. The content
-	of these bytes is a token which was previously delivered as 'page not
-	present' event. The event indicates the page in now available. Guest is
-	supposed to write '0' to 'token' when it is done handling 'page ready'
-	event and to write 1' to MSR_KVM_ASYNC_PF_ACK after clearing the location;
-	writing to the MSR forces KVM to re-scan its queue and deliver the next
-	pending notification.
+	of these bytes is a token which was previously delivered in CR2 as
+	'page not present' event. The event indicates the page is now available.
+	Guest is supposed to write '0' to 'token' when it is done handling
+	'page ready' event and to write '1' to MSR_KVM_ASYNC_PF_ACK after
+	clearing the location; writing to the MSR forces KVM to re-scan its
+	queue and deliver the next pending notification.
 
 	Note, MSR_KVM_ASYNC_PF_INT MSR specifying the interrupt vector for 'page
 	ready' APF delivery needs to be written to before enabling APF mechanism
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9f18b06bbda6..fc253d54cbd3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3427,7 +3427,7 @@  static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 	}
 
 	if (kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.apf.data, gpa,
-					sizeof(u64)))
+					sizeof(struct kvm_vcpu_pv_apf_data)))
 		return 1;
 
 	vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);