[0/2] KVM: SVM: Set pCPU during IRTE update if vCPU is running

Message ID 20230808233132.2499764-1-seanjc@google.com
Headers
Series KVM: SVM: Set pCPU during IRTE update if vCPU is running |

Message

Sean Christopherson Aug. 8, 2023, 11:31 p.m. UTC
  Fix a bug where KVM doesn't set the pCPU affinity for running vCPUs when
updating IRTE routing.  Not setting the pCPU means the IOMMU will signal
the wrong pCPU's doorbell until the vCPU goes through a put+load cycle.

I waffled for far too long between making this one patch or two.  Moving
the lock doesn't make all that much sense as a standalone patch, but in the
end, I decided that isolating the locking change would be useful in the
unlikely event that it breaks something.  If anyone feels strongly about
making this a single patch, I have no objection to squashing these together.

Sean Christopherson (2):
  KVM: SVM: Take and hold ir_list_lock when updating vCPU's Physical ID
    entry
  KVM: SVM: Set target pCPU during IRTE update if target vCPU is running

 arch/x86/kvm/svm/avic.c | 59 +++++++++++++++++++++++++++++++++++------
 1 file changed, 51 insertions(+), 8 deletions(-)


base-commit: 240f736891887939571854bd6d734b6c9291f22e
  

Comments

Joao Martins Aug. 9, 2023, 10:30 a.m. UTC | #1
On 09/08/2023 00:31, Sean Christopherson wrote:
> Fix a bug where KVM doesn't set the pCPU affinity for running vCPUs when
> updating IRTE routing.  Not setting the pCPU means the IOMMU will signal
> the wrong pCPU's doorbell until the vCPU goes through a put+load cycle.
> 

Or also framed as an inefficiency that we depend on the GALog (for a running
vCPU) for interrupt delivery until the put+load cycle happens. I don't think I
ever reproduced the missed interrupt case in our stress testing.

> I waffled for far too long between making this one patch or two.  Moving
> the lock doesn't make all that much sense as a standalone patch, but in the
> end, I decided that isolating the locking change would be useful in the
> unlikely event that it breaks something.  If anyone feels strongly about
> making this a single patch, I have no objection to squashing these together.
> 
IMHO, as two patches looks better;

For what is worth:

	Reviewed-by: Joao Martins <joao.m.martins@oracle.com>

I think Alejandro had reported his testing as successful here:

https://lore.kernel.org/kvm/caefe41b-2736-3df9-b5cd-b81fc4c30ff0@oracle.com/

OTOH, he didn't give the Tested-by explicitly
  
Sean Christopherson Aug. 9, 2023, 2:23 p.m. UTC | #2
On Wed, Aug 09, 2023, Joao Martins wrote:
> On 09/08/2023 00:31, Sean Christopherson wrote:
> > Fix a bug where KVM doesn't set the pCPU affinity for running vCPUs when
> > updating IRTE routing.  Not setting the pCPU means the IOMMU will signal
> > the wrong pCPU's doorbell until the vCPU goes through a put+load cycle.
> > 
> 
> Or also framed as an inefficiency that we depend on the GALog (for a running
> vCPU) for interrupt delivery until the put+load cycle happens. I don't think I
> ever reproduced the missed interrupt case in our stress testing.

Ah, I'll reword the changelog in patch 2 if this only delays the interrupt instead
of dropping it entirely.

> > I waffled for far too long between making this one patch or two.  Moving
> > the lock doesn't make all that much sense as a standalone patch, but in the
> > end, I decided that isolating the locking change would be useful in the
> > unlikely event that it breaks something.  If anyone feels strongly about
> > making this a single patch, I have no objection to squashing these together.
> > 
> IMHO, as two patches looks better;
> 
> For what is worth:
> 
> 	Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
> 
> I think Alejandro had reported his testing as successful here:
> 
> https://lore.kernel.org/kvm/caefe41b-2736-3df9-b5cd-b81fc4c30ff0@oracle.com/
> 
> OTOH, he didn't give the Tested-by explicitly

Yeah, I almost asked for a Tested-by, but figured it would be just as easy to
post the patches.
  
Alejandro Jimenez Aug. 9, 2023, 2:58 p.m. UTC | #3
On 8/9/23 10:23, Sean Christopherson wrote:
> On Wed, Aug 09, 2023, Joao Martins wrote:
>> On 09/08/2023 00:31, Sean Christopherson wrote:
>>> Fix a bug where KVM doesn't set the pCPU affinity for running vCPUs when
>>> updating IRTE routing.  Not setting the pCPU means the IOMMU will signal
>>> the wrong pCPU's doorbell until the vCPU goes through a put+load cycle.
>>>
>>
>> Or also framed as an inefficiency that we depend on the GALog (for a running
>> vCPU) for interrupt delivery until the put+load cycle happens. I don't think I
>> ever reproduced the missed interrupt case in our stress testing.

Right, I was never able to see any dropped interrupts when testing the baseline host kernel with "idle=poll" on my guest.
Though I didn't reproduce Dengqiao's setup exactly e.g. they imply using isolcpus in the host kernel params.

> 
> Ah, I'll reword the changelog in patch 2 if this only delays the interrupt instead
> of dropping it entirely.
> 
>>> I waffled for far too long between making this one patch or two.  Moving
>>> the lock doesn't make all that much sense as a standalone patch, but in the
>>> end, I decided that isolating the locking change would be useful in the
>>> unlikely event that it breaks something.  If anyone feels strongly about
>>> making this a single patch, I have no objection to squashing these together.
>>>
>> IMHO, as two patches looks better;
>>
>> For what is worth:
>>
>> 	Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
>>
>> I think Alejandro had reported his testing as successful here:
>>
>> https://lore.kernel.org/kvm/caefe41b-2736-3df9-b5cd-b81fc4c30ff0@oracle.com/
>>
>> OTOH, he didn't give the Tested-by explicitly
> 
> Yeah, I almost asked for a Tested-by, but figured it would be just as easy to
> post the patches.

I was hoping to find more time to test with other configs (i.e. more closely matching the original environment).
That being said, besides the positive results from the validation script mentioned earlier, I have been using the
patched kernel to launch guests in my setup for quite some time now without encountering any issues. From my side:

Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>