KVM: x86/intr: Explicitly check NMI from guest to eliminate false positives

Message ID 20231204074535.9567-1-likexu@tencent.com
State New
Headers
Series KVM: x86/intr: Explicitly check NMI from guest to eliminate false positives |

Commit Message

Like Xu Dec. 4, 2023, 7:45 a.m. UTC
  From: Like Xu <likexu@tencent.com>

Explicitly checking the source of external interrupt is indeed NMI and not
other types in the kvm_arch_pmi_in_guest(), which prevents perf-kvm false
positive samples generated after vm-exit but before kvm_before_interrupt()
from being incorrectly labelled as guest samples:

# test: perf-record + cpu-cycles:HP (which collects host-only precise samples)
# Symbol                                   Overhead       sys       usr  guest sys  guest usr
# .......................................  ........  ........  ........  .........  .........
#
# Before:
  [g] entry_SYSCALL_64                       24.63%     0.00%     0.00%     24.63%      0.00%
  [g] syscall_return_via_sysret              23.23%     0.00%     0.00%     23.23%      0.00%
  [g] files_lookup_fd_raw                     6.35%     0.00%     0.00%      6.35%      0.00%
# After:
  [k] perf_adjust_freq_unthr_context         57.23%    57.23%     0.00%      0.00%      0.00%
  [k] __vmx_vcpu_run                          4.09%     4.09%     0.00%      0.00%      0.00%
  [k] vmx_update_host_rsp                     3.17%     3.17%     0.00%      0.00%      0.00%

In the above case, perf records the samples labelled '[g]', the RIPs behind
the weird samples are actually being queried by perf_instruction_pointer()
after determining whether it's in GUEST state or not, and here's the issue:

If vm-exit is caused by a non-NMI interrupt (such as hrtimer_interrupt) and
at least one PMU counter is enabled on host, the kvm_arch_pmi_in_guest()
will remain true (KVM_HANDLING_IRQ is set) until kvm_before_interrupt().

During this window, if a PMI occurs on host (since the KVM instructions on
host are being executed), the control flow, with the help of the host NMI
context, will be transferred to perf/core to generate performance samples,
thus perf_instruction_pointer() and perf_guest_get_ip() is called.

Since kvm_arch_pmi_in_guest() only checks if there is an interrupt, it may
cause perf/core to mistakenly assume that the source RIP of the host NMI
belongs to the guest world and use perf_guest_get_ip() to get the RIP of
a vCPU that has already exited by a non-NMI interrupt.

Error samples are recorded and presented to the end-user via perf-report.
Such false positive samples could be eliminated by explicitly determining
if the exit reason is KVM_HANDLING_NMI.

Note that when vm-exit is indeed triggered by PMI and before HANDLING_NMI
is cleared, it's also still possible that another PMI is generated on host.
In this case, perf/core should generate two samples, belonging to host and
guest separately, but that's perf/core's story.

Fixes: 73cd107b9685 ("KVM: x86: Drop current_vcpu for kvm_running_vcpu + kvm_arch_vcpu variable")
Signed-off-by: Like Xu <likexu@tencent.com>
---
 arch/x86/include/asm/kvm_host.h | 9 ++++++++-
 arch/x86/kvm/x86.h              | 6 ------
 2 files changed, 8 insertions(+), 7 deletions(-)


base-commit: 1ab097653e4dd8d23272d028a61352c23486fd4a
  

Comments

Sean Christopherson Dec. 4, 2023, 3:55 p.m. UTC | #1
On Mon, Dec 04, 2023, Like Xu wrote:
> From: Like Xu <likexu@tencent.com>
> 
> Explicitly checking the source of external interrupt is indeed NMI and not
> other types in the kvm_arch_pmi_in_guest(), which prevents perf-kvm false
> positive samples generated after vm-exit but before kvm_before_interrupt()
> from being incorrectly labelled as guest samples:

...

> Fixes: 73cd107b9685 ("KVM: x86: Drop current_vcpu for kvm_running_vcpu + kvm_arch_vcpu variable")

The behavior is deliberate, and was added by commit dd60d217062f ("KVM: x86: Fix
perf timer mode IP reporting").  *If* we want to undo that, then the best "fix"
would be to effective reverting that commit by dropping the IRQ usage of
kvm_before_interrupt() and renaming the helpers kvm_{before,after}_nmi().  But
my understanding is that the behavior is necessary for select PMU usage.
  
Like Xu Dec. 5, 2023, 6:23 a.m. UTC | #2
On 4/12/2023 11:55 pm, Sean Christopherson wrote:
> On Mon, Dec 04, 2023, Like Xu wrote:
>> From: Like Xu <likexu@tencent.com>
>>
>> Explicitly checking the source of external interrupt is indeed NMI and not
>> other types in the kvm_arch_pmi_in_guest(), which prevents perf-kvm false
>> positive samples generated after vm-exit but before kvm_before_interrupt()
>> from being incorrectly labelled as guest samples:
> 
> ...
> 
>> Fixes: 73cd107b9685 ("KVM: x86: Drop current_vcpu for kvm_running_vcpu + kvm_arch_vcpu variable")
> 
> The behavior is deliberate, and was added by commit dd60d217062f ("KVM: x86: Fix
> perf timer mode IP reporting").  *If* we want to undo that, then the best "fix"
> would be to effective reverting that commit by dropping the IRQ usage of
> kvm_before_interrupt() and renaming the helpers kvm_{before,after}_nmi().  But
> my understanding is that the behavior is necessary for select PMU usage.

Thanks for your comment. Yes, the commit dd60d217062f should be tracked.

We don't have to undo the commit, and we also don't want to hurt
either of the perf/core use cases (including perf NMI and timer modes).

Thanks to the introduction of "enum kvm_intr_type", we can cover both cases
instead of sacrificing one of two modes, how about:

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c8c7e2475a18..5db607a564c6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1868,8 +1868,16 @@ static inline int kvm_arch_flush_remote_tlbs_range(struct 
kvm *kvm, gfn_t gfn,
  }
  #endif /* CONFIG_HYPERV */

+enum kvm_intr_type {
+	/* Values are arbitrary, but must be non-zero. */
+	KVM_HANDLING_IRQ = 1,
+	KVM_HANDLING_NMI,
+};
+
+/* Linux always use NMI for PMU. */
  #define kvm_arch_pmi_in_guest(vcpu) \
-	((vcpu) && (vcpu)->arch.handling_intr_from_guest)
+	((vcpu) && (vcpu)->arch.handling_intr_from_guest && \
+	 (in_nmi() == ((vcpu)->arch.handling_intr_from_guest == KVM_HANDLING_NMI)))

  void __init kvm_mmu_x86_module_init(void);
  int kvm_mmu_vendor_module_init(void);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 2f7e19166658..4dc38092d599 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -431,12 +431,6 @@ static inline bool kvm_notify_vmexit_enabled(struct kvm *kvm)
  	return kvm->arch.notify_vmexit_flags & KVM_X86_NOTIFY_VMEXIT_ENABLED;
  }

-enum kvm_intr_type {
-	/* Values are arbitrary, but must be non-zero. */
-	KVM_HANDLING_IRQ = 1,
-	KVM_HANDLING_NMI,
-};
-
  static __always_inline void kvm_before_interrupt(struct kvm_vcpu *vcpu,
  						 enum kvm_intr_type intr)
  {
  
Sean Christopherson Dec. 6, 2023, 12:09 a.m. UTC | #3
On Tue, Dec 05, 2023, Like Xu wrote:
> On 4/12/2023 11:55 pm, Sean Christopherson wrote:
> > On Mon, Dec 04, 2023, Like Xu wrote:
> > > From: Like Xu <likexu@tencent.com>
> > > 
> > > Explicitly checking the source of external interrupt is indeed NMI and not
> > > other types in the kvm_arch_pmi_in_guest(), which prevents perf-kvm false
> > > positive samples generated after vm-exit but before kvm_before_interrupt()
> > > from being incorrectly labelled as guest samples:
> > 
> > ...
> > 
> > > Fixes: 73cd107b9685 ("KVM: x86: Drop current_vcpu for kvm_running_vcpu + kvm_arch_vcpu variable")
> > 
> > The behavior is deliberate, and was added by commit dd60d217062f ("KVM: x86: Fix
> > perf timer mode IP reporting").  *If* we want to undo that, then the best "fix"
> > would be to effective reverting that commit by dropping the IRQ usage of
> > kvm_before_interrupt() and renaming the helpers kvm_{before,after}_nmi().  But
> > my understanding is that the behavior is necessary for select PMU usage.
> 
> Thanks for your comment. Yes, the commit dd60d217062f should be tracked.
> 
> We don't have to undo the commit, and we also don't want to hurt
> either of the perf/core use cases (including perf NMI and timer modes).
> 
> Thanks to the introduction of "enum kvm_intr_type", we can cover both cases
> instead of sacrificing one of two modes, how about:

Hmm, yeah, that should work.  It's not the prettiest thing, but I don't see an
easy way to remedy that (I tried).  False positives are still possible, but it's
a clear improvement over what we have.

> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index c8c7e2475a18..5db607a564c6 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1868,8 +1868,16 @@ static inline int
> kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn,
>  }
>  #endif /* CONFIG_HYPERV */
> 
> +enum kvm_intr_type {
> +	/* Values are arbitrary, but must be non-zero. */
> +	KVM_HANDLING_IRQ = 1,
> +	KVM_HANDLING_NMI,
> +};
> +
> +/* Linux always use NMI for PMU. */
>  #define kvm_arch_pmi_in_guest(vcpu) \
> -	((vcpu) && (vcpu)->arch.handling_intr_from_guest)
> +	((vcpu) && (vcpu)->arch.handling_intr_from_guest && \
> +	 (in_nmi() == ((vcpu)->arch.handling_intr_from_guest == KVM_HANDLING_NMI)))
> 
>  void __init kvm_mmu_x86_module_init(void);
>  int kvm_mmu_vendor_module_init(void);
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 2f7e19166658..4dc38092d599 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -431,12 +431,6 @@ static inline bool kvm_notify_vmexit_enabled(struct kvm *kvm)
>  	return kvm->arch.notify_vmexit_flags & KVM_X86_NOTIFY_VMEXIT_ENABLED;
>  }
> 
> -enum kvm_intr_type {
> -	/* Values are arbitrary, but must be non-zero. */
> -	KVM_HANDLING_IRQ = 1,
> -	KVM_HANDLING_NMI,
> -};
> -
>  static __always_inline void kvm_before_interrupt(struct kvm_vcpu *vcpu,
>  						 enum kvm_intr_type intr)
>  {
> -- 
> 2.43.0
> 
> I noticed that timer mode is used when perf-record uses SW events like
> "cpu-clock" event,
> with or without hw-PMU support. My side of the tests no longer show false
> positives and
> the above diff does not affect the use of perf timer mode as well. Any better move ?
  

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c8c7e2475a18..93e667f3099d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1868,8 +1868,15 @@  static inline int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn,
 }
 #endif /* CONFIG_HYPERV */
 
+enum kvm_intr_type {
+	/* Values are arbitrary, but must be non-zero. */
+	KVM_HANDLING_IRQ = 1,
+	KVM_HANDLING_NMI,
+};
+
+/* Linux always use NMI for PMU. */
 #define kvm_arch_pmi_in_guest(vcpu) \
-	((vcpu) && (vcpu)->arch.handling_intr_from_guest)
+	((vcpu) && ((vcpu)->arch.handling_intr_from_guest == KVM_HANDLING_NMI))
 
 void __init kvm_mmu_x86_module_init(void);
 int kvm_mmu_vendor_module_init(void);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 2f7e19166658..4dc38092d599 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -431,12 +431,6 @@  static inline bool kvm_notify_vmexit_enabled(struct kvm *kvm)
 	return kvm->arch.notify_vmexit_flags & KVM_X86_NOTIFY_VMEXIT_ENABLED;
 }
 
-enum kvm_intr_type {
-	/* Values are arbitrary, but must be non-zero. */
-	KVM_HANDLING_IRQ = 1,
-	KVM_HANDLING_NMI,
-};
-
 static __always_inline void kvm_before_interrupt(struct kvm_vcpu *vcpu,
 						 enum kvm_intr_type intr)
 {