[0/2] Fix the duplicate PMI injections in vPMU

Message ID 20230925173448.3518223-1-mizhang@google.com
State New
Headers

Commit Message

Mingwei Zhang Sept. 25, 2023, 5:34 p.m. UTC
  When we do stress test on KVM vPMU using Intel vtune, we find the following
warning kernel message in the guest VM:

[ 1437.487320] Uhhuh. NMI received for unknown reason 20 on CPU 3.
[ 1437.487330] Dazed and confused, but trying to continue

The Problem
===========

The above issue indicates that there are more NMIs injected than guest
could recognize. After a month of investigation, we discovered that the
bug happened due to minor glitches in two separate parts of the KVM: 1)
KVM vPMU mistakenly fires a PMI due to emulated counter overflow even
though the overflow has already been fired by the PMI handler on the
host [1]. 2) KVM APIC allows multiple injections of PMI at one VM entry
which violates Intel SDM. Both glitches contributes to extra injection
of PMIs and thus confuses PMI handler in guest VM and causes the above
warning messages.

The Fixes
=========

The patches disallow the multi-PMI injection fundamentally at APIC
level. In addition, they also simplify the PMI injection process by
removing irq_work and only use KVM_REQ_PMI.

The Testing
===========

With the series applied, we do not see the above warning messages when
stress testing VM with Intel vtune. In addition, we add some kernel
printing, all emulated counter overflow happens when hardware counter
value is 0 and emulated counter value is 1 (prev_counter is -1). We
never observed unexpected prev_counter values we saw in [2].

Note that this series does break the upstream kvm-unit-tests/pmu with the
following error:

FAIL: Intel: emulated instruction: instruction counter overflow
FAIL: Intel: full-width writes: emulated instruction: instruction counter overflow

This is a test bug and apply the following diff should fix the issue:

We will post the above change soon.

[1] commit 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions")
[2] https://lore.kernel.org/all/CAL715WL9T8Ucnj_1AygwMgDjOJrttNZHRP9o-KUNfpx1aYZnog@mail.gmail.com/

Versioning
==========

The series is in v1. We made some changes:
 - drop Dapeng's reviewed-by, since code changes.
 - applies fix up in kvm_apic_local_deliver(). [seanjc]
 - remove pmc->prev_counter. [seanjc]

Previous version (v0) shown as follows:
 - [APIC patches v0]: https://lore.kernel.org/all/20230901185646.2823254-1-jmattson@google.com/
 - [vPMU patch v0]: https://lore.kernel.org/all/ZQ4A4KaSyygKHDUI@google.com/

Jim Mattson (2):
  KVM: x86: Synthesize at most one PMI per VM-exit
  KVM: x86: Mask LVTPC when handling a PMI

 arch/x86/include/asm/kvm_host.h |  1 -
 arch/x86/kvm/lapic.c            |  8 ++++++--
 arch/x86/kvm/pmu.c              | 27 +--------------------------
 arch/x86/kvm/x86.c              |  3 +++
 4 files changed, 10 insertions(+), 29 deletions(-)


base-commit: 6de2ccc169683bf81feba163834dae7cdebdd826
  

Comments

Mingwei Zhang Sept. 25, 2023, 7:33 p.m. UTC | #1
On Mon, Sep 25, 2023 at 10:59 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Sep 25, 2023, Mingwei Zhang wrote:
> > From: Jim Mattson <jmattson@google.com>
> >
> > When the irq_work callback, kvm_pmi_trigger_fn(), is invoked during a
> > VM-exit that also invokes __kvm_perf_overflow() as a result of
> > instruction emulation, kvm_pmu_deliver_pmi() will be called twice
> > before the next VM-entry.
> >
> > That shouldn't be a problem. The local APIC is supposed to
> > automatically set the mask flag in LVTPC when it handles a PMI, so the
> > second PMI should be inhibited. However, KVM's local APIC emulation
> > fails to set the mask flag in LVTPC when it handles a PMI, so two PMIs
> > are delivered via the local APIC. In the common case, where LVTPC is
> > configured to deliver an NMI, the first NMI is vectored through the
> > guest IDT, and the second one is held pending. When the NMI handler
> > returns, the second NMI is vectored through the IDT. For Linux guests,
> > this results in the "dazed and confused" spurious NMI message.
> >
> > Though the obvious fix is to set the mask flag in LVTPC when handling
> > a PMI, KVM's logic around synthesizing a PMI is unnecessarily
> > convoluted.
>
> Unless Jim outright objects, I strongly prefer placing this patch second, with
> the above two paragraphs replaced with my suggestion (or something similar):
>
>   Calling kvm_pmu_deliver_pmi() twice is unlikely to be problematic now that
>   KVM sets the LVTPC mask bit when delivering a PMI.  But using IRQ work to
>   trigger the PMI is still broken, albeit very theoretically.
>
>   E.g. if the self-IPI to trigger IRQ work is be delayed long enough for the
>   vCPU to be migrated to a different pCPU, then it's possible for
>   kvm_pmi_trigger_fn() to race with the kvm_pmu_deliver_pmi() from
>   KVM_REQ_PMI and still generate two PMIs.
>
>   KVM could set the mask bit using an atomic operation, but that'd just be
>   piling on unnecessary code to workaround what is effectively a hack.  The
>   *only* reason KVM uses IRQ work is to ensure the PMI is treated as a wake
>   event, e.g. if the vCPU just executed HLT.
>
> I understand Jim's desire for the patch to be more obviously valuable, but the
> people that need convincing are already convinced that the patch is worth taking.
>
> > Remove the irq_work callback for synthesizing a PMI, and all of the
> > logic for invoking it. Instead, to prevent a vcpu from leaving C0 with
> > a PMI pending, add a check for KVM_REQ_PMI to kvm_vcpu_has_events().
> >
> > Fixes: 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions")
> > Signed-off-by: Jim Mattson <jmattson@google.com>
> > Tested-by: Mingwei Zhang <mizhang@google.com>
> > Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>
> Needs your SoB

Signed-off-by: Mingwei Zhang <mizhang@google.com>
  
Sean Christopherson Sept. 25, 2023, 9:28 p.m. UTC | #2
On Mon, Sep 25, 2023, Mingwei Zhang wrote:
> On Mon, Sep 25, 2023 at 10:59 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Sep 25, 2023, Mingwei Zhang wrote:
> > > From: Jim Mattson <jmattson@google.com>
> > >
> > > When the irq_work callback, kvm_pmi_trigger_fn(), is invoked during a
> > > VM-exit that also invokes __kvm_perf_overflow() as a result of
> > > instruction emulation, kvm_pmu_deliver_pmi() will be called twice
> > > before the next VM-entry.
> > >
> > > That shouldn't be a problem. The local APIC is supposed to
> > > automatically set the mask flag in LVTPC when it handles a PMI, so the
> > > second PMI should be inhibited. However, KVM's local APIC emulation
> > > fails to set the mask flag in LVTPC when it handles a PMI, so two PMIs
> > > are delivered via the local APIC. In the common case, where LVTPC is
> > > configured to deliver an NMI, the first NMI is vectored through the
> > > guest IDT, and the second one is held pending. When the NMI handler
> > > returns, the second NMI is vectored through the IDT. For Linux guests,
> > > this results in the "dazed and confused" spurious NMI message.
> > >
> > > Though the obvious fix is to set the mask flag in LVTPC when handling
> > > a PMI, KVM's logic around synthesizing a PMI is unnecessarily
> > > convoluted.
> >
> > Unless Jim outright objects, I strongly prefer placing this patch second, with
> > the above two paragraphs replaced with my suggestion (or something similar):
> >
> >   Calling kvm_pmu_deliver_pmi() twice is unlikely to be problematic now that
> >   KVM sets the LVTPC mask bit when delivering a PMI.  But using IRQ work to
> >   trigger the PMI is still broken, albeit very theoretically.
> >
> >   E.g. if the self-IPI to trigger IRQ work is be delayed long enough for the
> >   vCPU to be migrated to a different pCPU, then it's possible for
> >   kvm_pmi_trigger_fn() to race with the kvm_pmu_deliver_pmi() from
> >   KVM_REQ_PMI and still generate two PMIs.
> >
> >   KVM could set the mask bit using an atomic operation, but that'd just be
> >   piling on unnecessary code to workaround what is effectively a hack.  The
> >   *only* reason KVM uses IRQ work is to ensure the PMI is treated as a wake
> >   event, e.g. if the vCPU just executed HLT.
> >
> > I understand Jim's desire for the patch to be more obviously valuable, but the
> > people that need convincing are already convinced that the patch is worth taking.
> >
> > > Remove the irq_work callback for synthesizing a PMI, and all of the
> > > logic for invoking it. Instead, to prevent a vcpu from leaving C0 with
> > > a PMI pending, add a check for KVM_REQ_PMI to kvm_vcpu_has_events().
> > >
> > > Fixes: 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions")
> > > Signed-off-by: Jim Mattson <jmattson@google.com>
> > > Tested-by: Mingwei Zhang <mizhang@google.com>
> > > Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >
> > Needs your SoB
> 
> Signed-off-by: Mingwei Zhang <mizhang@google.com>

Thanks!

Jim gave his blessing off-list for swapping the order, I'll do that and massage
the changelogs when applying, i.e. no need for a v3.
  
Sean Christopherson Sept. 28, 2023, 4:41 p.m. UTC | #3
On Mon, 25 Sep 2023 17:34:45 +0000, Mingwei Zhang wrote:
> When we do stress test on KVM vPMU using Intel vtune, we find the following
> warning kernel message in the guest VM:
> 
> [ 1437.487320] Uhhuh. NMI received for unknown reason 20 on CPU 3.
> [ 1437.487330] Dazed and confused, but trying to continue
> 
> The Problem
> ===========
> 
> [...]

Applied to kvm-x86 pmu, with the order swapped and a bit of changelog massaging.
Thanks!

[1/2] KVM: x86: Mask LVTPC when handling a PMI
      https://github.com/kvm-x86/linux/commit/a16eb25b09c0
[2/2] KVM: x86/pmu: Synthesize at most one PMI per VM-exit
      https://github.com/kvm-x86/linux/commit/73554b29bd70

--
https://github.com/kvm-x86/linux/tree/next
  

Patch

diff --git a/x86/pmu.c b/x86/pmu.c
index 0def2869..667e6233 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -68,6 +68,7 @@  volatile uint64_t irq_received;
 static void cnt_overflow(isr_regs_t *regs)
 {
 »......irq_received++;
+»......apic_write(APIC_LVTPC, apic_read(APIC_LVTPC) & ~APIC_LVT_MASKED);
 »......apic_write(APIC_EOI, 0);
 }