KVM: SVM: vNMI pending bit is V_NMI_PENDING_MASK not V_NMI_BLOCKING_MASK

Message ID be4ca192eb0c1e69a210db3009ca984e6a54ae69.1684495380.git.maciej.szmigiero@oracle.com
State New
Headers
Series KVM: SVM: vNMI pending bit is V_NMI_PENDING_MASK not V_NMI_BLOCKING_MASK |

Commit Message

Maciej S. Szmigiero May 19, 2023, 11:26 a.m. UTC
  From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

While testing Hyper-V enabled Windows Server 2019 guests on Zen4 hardware
I noticed that with vCPU count large enough (> 16) they sometimes froze at
boot.
With vCPU count of 64 they never booted successfully - suggesting some kind
of a race condition.

Since adding "vnmi=0" module parameter made these guests boot successfully
it was clear that the problem is most likely (v)NMI-related.

Running kvm-unit-tests quickly showed failing NMI-related tests cases, like
"multiple nmi" and "pending nmi" from apic-split, x2apic and xapic tests
and the NMI parts of eventinj test.

The issue was that once one NMI was being serviced no other NMI was allowed
to be set pending (NMI limit = 0), which was traced to
svm_is_vnmi_pending() wrongly testing for the "NMI blocked" flag rather
than for the "NMI pending" flag.

Fix this by testing for the right flag in svm_is_vnmi_pending().
Once this is done, the NMI-related kvm-unit-tests pass successfully and
the Windows guest no longer freezes at boot.

Fixes: fa4c027a7956 ("KVM: x86: Add support for SVM's Virtual NMI")
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---

It's a bit sad that no-one apparently tested the vNMI patchset with
kvm-unit-tests on an actual vNMI-enabled hardware...

 arch/x86/kvm/svm/svm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
  

Comments

Sean Christopherson May 19, 2023, 3:51 p.m. UTC | #1
On Fri, May 19, 2023, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> While testing Hyper-V enabled Windows Server 2019 guests on Zen4 hardware
> I noticed that with vCPU count large enough (> 16) they sometimes froze at
> boot.
> With vCPU count of 64 they never booted successfully - suggesting some kind
> of a race condition.
> 
> Since adding "vnmi=0" module parameter made these guests boot successfully
> it was clear that the problem is most likely (v)NMI-related.
> 
> Running kvm-unit-tests quickly showed failing NMI-related tests cases, like
> "multiple nmi" and "pending nmi" from apic-split, x2apic and xapic tests
> and the NMI parts of eventinj test.
> 
> The issue was that once one NMI was being serviced no other NMI was allowed
> to be set pending (NMI limit = 0), which was traced to
> svm_is_vnmi_pending() wrongly testing for the "NMI blocked" flag rather
> than for the "NMI pending" flag.
> 
> Fix this by testing for the right flag in svm_is_vnmi_pending().
> Once this is done, the NMI-related kvm-unit-tests pass successfully and
> the Windows guest no longer freezes at boot.
> 
> Fixes: fa4c027a7956 ("KVM: x86: Add support for SVM's Virtual NMI")
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>

Reviewed-by: Sean Christopherson <seanjc@google.com>

> ---
> 
> It's a bit sad that no-one apparently tested the vNMI patchset with
> kvm-unit-tests on an actual vNMI-enabled hardware...

That's one way to put it.

Santosh, what happened?  This goof was present in both v3 and v4, i.e. it wasn't
something that we botched when applying/massaging at the last minute.  And the
cover letters for both v3 and v4 state "Series ... tested on AMD EPYC-Genoa".
  
Santosh Shukla May 20, 2023, 6:06 a.m. UTC | #2
Hi Sean and Maciej,

On 5/19/2023 9:21 PM, Sean Christopherson wrote:
> On Fri, May 19, 2023, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> While testing Hyper-V enabled Windows Server 2019 guests on Zen4 hardware
>> I noticed that with vCPU count large enough (> 16) they sometimes froze at
>> boot.
>> With vCPU count of 64 they never booted successfully - suggesting some kind
>> of a race condition.
>>
>> Since adding "vnmi=0" module parameter made these guests boot successfully
>> it was clear that the problem is most likely (v)NMI-related.
>>
>> Running kvm-unit-tests quickly showed failing NMI-related tests cases, like
>> "multiple nmi" and "pending nmi" from apic-split, x2apic and xapic tests
>> and the NMI parts of eventinj test.
>>
>> The issue was that once one NMI was being serviced no other NMI was allowed
>> to be set pending (NMI limit = 0), which was traced to
>> svm_is_vnmi_pending() wrongly testing for the "NMI blocked" flag rather
>> than for the "NMI pending" flag.
>>
>> Fix this by testing for the right flag in svm_is_vnmi_pending().
>> Once this is done, the NMI-related kvm-unit-tests pass successfully and
>> the Windows guest no longer freezes at boot.
>>
>> Fixes: fa4c027a7956 ("KVM: x86: Add support for SVM's Virtual NMI")
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> 
> Reviewed-by: Sean Christopherson <seanjc@google.com>
> 
>> ---
>>
>> It's a bit sad that no-one apparently tested the vNMI patchset with
>> kvm-unit-tests on an actual vNMI-enabled hardware...
> 
> That's one way to put it.
> 
> Santosh, what happened?  This goof was present in both v3 and v4, i.e. it wasn't
> something that we botched when applying/massaging at the last minute.  And the
> cover letters for both v3 and v4 state "Series ... tested on AMD EPYC-Genoa".

My bad that I only ran svm_test with vnmi in past using Sean's KUT branch
remotes/sean-kut/svm/vnmi_test and saw that vnmi test was passing.
Here are the logs:
---
PASS: vNMI enabled but NMI_INTERCEPT unset!
PASS: vNMI with vector 2 not injected
PASS: VNMI serviced
PASS: vnmi
--- 

However when I ran mentioned tests by Maciej, I do see the failure. Thanks for this pointing out.

Reviewed-by : Santosh Shukla <Santosh.Shukla@amd.com>

Best Regards,
Santosh
  
Maciej S. Szmigiero June 1, 2023, 5:53 p.m. UTC | #3
On 19.05.2023 17:51, Sean Christopherson wrote:
> On Fri, May 19, 2023, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> While testing Hyper-V enabled Windows Server 2019 guests on Zen4 hardware
>> I noticed that with vCPU count large enough (> 16) they sometimes froze at
>> boot.
>> With vCPU count of 64 they never booted successfully - suggesting some kind
>> of a race condition.
>>
>> Since adding "vnmi=0" module parameter made these guests boot successfully
>> it was clear that the problem is most likely (v)NMI-related.
>>
>> Running kvm-unit-tests quickly showed failing NMI-related tests cases, like
>> "multiple nmi" and "pending nmi" from apic-split, x2apic and xapic tests
>> and the NMI parts of eventinj test.
>>
>> The issue was that once one NMI was being serviced no other NMI was allowed
>> to be set pending (NMI limit = 0), which was traced to
>> svm_is_vnmi_pending() wrongly testing for the "NMI blocked" flag rather
>> than for the "NMI pending" flag.
>>
>> Fix this by testing for the right flag in svm_is_vnmi_pending().
>> Once this is done, the NMI-related kvm-unit-tests pass successfully and
>> the Windows guest no longer freezes at boot.
>>
>> Fixes: fa4c027a7956 ("KVM: x86: Add support for SVM's Virtual NMI")
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> 
> Reviewed-by: Sean Christopherson <seanjc@google.com>
> 

I can't see this in kvm/kvm.git trees or the kvm-x86 ones on GitHub -
is this patch planned to be picked up for -rc5 soon?

Technically, just knowing the final commit id would be sufficit for my
purposes.

Thanks,
Maciej
  
Sean Christopherson June 1, 2023, 6:04 p.m. UTC | #4
On Thu, Jun 01, 2023, Maciej S. Szmigiero wrote:
> On 19.05.2023 17:51, Sean Christopherson wrote:
> > On Fri, May 19, 2023, Maciej S. Szmigiero wrote:
> > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > > 
> > > While testing Hyper-V enabled Windows Server 2019 guests on Zen4 hardware
> > > I noticed that with vCPU count large enough (> 16) they sometimes froze at
> > > boot.
> > > With vCPU count of 64 they never booted successfully - suggesting some kind
> > > of a race condition.
> > > 
> > > Since adding "vnmi=0" module parameter made these guests boot successfully
> > > it was clear that the problem is most likely (v)NMI-related.
> > > 
> > > Running kvm-unit-tests quickly showed failing NMI-related tests cases, like
> > > "multiple nmi" and "pending nmi" from apic-split, x2apic and xapic tests
> > > and the NMI parts of eventinj test.
> > > 
> > > The issue was that once one NMI was being serviced no other NMI was allowed
> > > to be set pending (NMI limit = 0), which was traced to
> > > svm_is_vnmi_pending() wrongly testing for the "NMI blocked" flag rather
> > > than for the "NMI pending" flag.
> > > 
> > > Fix this by testing for the right flag in svm_is_vnmi_pending().
> > > Once this is done, the NMI-related kvm-unit-tests pass successfully and
> > > the Windows guest no longer freezes at boot.
> > > 
> > > Fixes: fa4c027a7956 ("KVM: x86: Add support for SVM's Virtual NMI")
> > > Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> > 
> > Reviewed-by: Sean Christopherson <seanjc@google.com>
> > 
> 
> I can't see this in kvm/kvm.git trees or the kvm-x86 ones on GitHub -
> is this patch planned to be picked up for -rc5 soon?
> 
> Technically, just knowing the final commit id would be sufficit for my
> purposes.

If Paolo doesn't pick it up by tomorrow, I'll apply it and send a fixes pull
request for -rc5.
  
Maciej S. Szmigiero June 1, 2023, 6:05 p.m. UTC | #5
On 1.06.2023 20:04, Sean Christopherson wrote:
> On Thu, Jun 01, 2023, Maciej S. Szmigiero wrote:
>> On 19.05.2023 17:51, Sean Christopherson wrote:
>>> On Fri, May 19, 2023, Maciej S. Szmigiero wrote:
>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>>
>>>> While testing Hyper-V enabled Windows Server 2019 guests on Zen4 hardware
>>>> I noticed that with vCPU count large enough (> 16) they sometimes froze at
>>>> boot.
>>>> With vCPU count of 64 they never booted successfully - suggesting some kind
>>>> of a race condition.
>>>>
>>>> Since adding "vnmi=0" module parameter made these guests boot successfully
>>>> it was clear that the problem is most likely (v)NMI-related.
>>>>
>>>> Running kvm-unit-tests quickly showed failing NMI-related tests cases, like
>>>> "multiple nmi" and "pending nmi" from apic-split, x2apic and xapic tests
>>>> and the NMI parts of eventinj test.
>>>>
>>>> The issue was that once one NMI was being serviced no other NMI was allowed
>>>> to be set pending (NMI limit = 0), which was traced to
>>>> svm_is_vnmi_pending() wrongly testing for the "NMI blocked" flag rather
>>>> than for the "NMI pending" flag.
>>>>
>>>> Fix this by testing for the right flag in svm_is_vnmi_pending().
>>>> Once this is done, the NMI-related kvm-unit-tests pass successfully and
>>>> the Windows guest no longer freezes at boot.
>>>>
>>>> Fixes: fa4c027a7956 ("KVM: x86: Add support for SVM's Virtual NMI")
>>>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>>>
>>> Reviewed-by: Sean Christopherson <seanjc@google.com>
>>>
>>
>> I can't see this in kvm/kvm.git trees or the kvm-x86 ones on GitHub -
>> is this patch planned to be picked up for -rc5 soon?
>>
>> Technically, just knowing the final commit id would be sufficit for my
>> purposes.
> 
> If Paolo doesn't pick it up by tomorrow, I'll apply it and send a fixes pull
> request for -rc5.

Thanks Sean.
  
Sean Christopherson June 3, 2023, 12:52 a.m. UTC | #6
On Fri, 19 May 2023 13:26:18 +0200, Maciej S. Szmigiero wrote:
> While testing Hyper-V enabled Windows Server 2019 guests on Zen4 hardware
> I noticed that with vCPU count large enough (> 16) they sometimes froze at
> boot.
> With vCPU count of 64 they never booted successfully - suggesting some kind
> of a race condition.
> 
> Since adding "vnmi=0" module parameter made these guests boot successfully
> it was clear that the problem is most likely (v)NMI-related.
> 
> [...]

Applied to kvm-x86 fixes, thanks!

[1/1] KVM: SVM: vNMI pending bit is V_NMI_PENDING_MASK not V_NMI_BLOCKING_MASK
      https://github.com/kvm-x86/linux/commit/b2ce89978889

--
https://github.com/kvm-x86/linux/tree/next
https://github.com/kvm-x86/linux/tree/fixes
  

Patch

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index ca32389f3c36..54089f990c8f 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3510,7 +3510,7 @@  static bool svm_is_vnmi_pending(struct kvm_vcpu *vcpu)
 	if (!is_vnmi_enabled(svm))
 		return false;
 
-	return !!(svm->vmcb->control.int_ctl & V_NMI_BLOCKING_MASK);
+	return !!(svm->vmcb->control.int_ctl & V_NMI_PENDING_MASK);
 }
 
 static bool svm_set_vnmi_pending(struct kvm_vcpu *vcpu)