iommu/amd: Do not flush IRTE when only updating isRun and destination fields

Message ID 20231017144236.8287-1-suravee.suthikulpanit@amd.com
State New
Headers
Series iommu/amd: Do not flush IRTE when only updating isRun and destination fields |

Commit Message

Suravee Suthikulpanit Oct. 17, 2023, 2:42 p.m. UTC
  According to the recent update in the AMD IOMMU spec [1], the IsRun and
Destination fields of the Interrupt Remapping Table Entry (IRTE) are not
cached by the IOMMU hardware.

Therefore, do not issue the INVALIDATE_INTERRUPT_TABLE command when
updating IRTE[IsRun] and IRTE[Destination] when IRTE[GuestMode]=1, which
should help improve IOMMU AVIC/x2AVIC performance.

References:
[1] AMD IOMMU Spec Revision (Rev 3.08-PUB)
(Link: https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/specifications/48882_IOMMU.pdf)

Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
---
 drivers/iommu/amd/iommu.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)
  

Comments

Maxim Levitsky Oct. 17, 2023, 2:51 p.m. UTC | #1
У вт, 2023-10-17 у 09:42 -0500, Suravee Suthikulpanit пише:
> According to the recent update in the AMD IOMMU spec [1], the IsRun and
> Destination fields of the Interrupt Remapping Table Entry (IRTE) are not
> cached by the IOMMU hardware.

Is that true for all AMD hardware that supports AVIC? E.g Zen1/Zen2 hardware?

Is there a chance that this will cause a similar errata to the is_running
errata that Zen2 cpus have?



> 
> Therefore, do not issue the INVALIDATE_INTERRUPT_TABLE command when
> updating IRTE[IsRun] and IRTE[Destination] when IRTE[GuestMode]=1, which
> should help improve IOMMU AVIC/x2AVIC performance.
> 
> References:
> [1] AMD IOMMU Spec Revision (Rev 3.08-PUB)
> (Link: https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/specifications/48882_IOMMU.pdf)

Looks like the link is broken.


Best regards,
	Maxim Levitsky

> 
> Cc: Joao Martins <joao.m.martins@oracle.com>
> Cc: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
> ---
>  drivers/iommu/amd/iommu.c | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> index 089886485895..d63590563d3e 100644
> --- a/drivers/iommu/amd/iommu.c
> +++ b/drivers/iommu/amd/iommu.c
> @@ -2970,8 +2970,8 @@ static int alloc_irq_index(struct amd_iommu *iommu, u16 devid, int count,
>  	return index;
>  }
>  
> -static int modify_irte_ga(struct amd_iommu *iommu, u16 devid, int index,
> -			  struct irte_ga *irte)
> +static int __modify_irte_ga(struct amd_iommu *iommu, u16 devid, int index,
> +			    struct irte_ga *irte)
>  {
>  	struct irq_remap_table *table;
>  	struct irte_ga *entry;
> @@ -2998,6 +2998,18 @@ static int modify_irte_ga(struct amd_iommu *iommu, u16 devid, int index,
>  
>  	raw_spin_unlock_irqrestore(&table->lock, flags);
>  
> +	return 0;
> +}
> +
> +static int modify_irte_ga(struct amd_iommu *iommu, u16 devid, int index,
> +			  struct irte_ga *irte)
> +{
> +	bool ret;
> +
> +	ret = __modify_irte_ga(iommu, devid, index, irte);
> +	if (ret)
> +		return ret;
> +
>  	iommu_flush_irt_and_complete(iommu, devid);
>  
>  	return 0;
> @@ -3681,8 +3693,8 @@ int amd_iommu_update_ga(int cpu, bool is_run, void *data)
>  	}
>  	entry->lo.fields_vapic.is_run = is_run;
>  
> -	return modify_irte_ga(ir_data->iommu, ir_data->irq_2_irte.devid,
> -			      ir_data->irq_2_irte.index, entry);
> +	return __modify_irte_ga(ir_data->iommu, ir_data->irq_2_irte.devid,
> +				ir_data->irq_2_irte.index, entry);
>  }
>  EXPORT_SYMBOL(amd_iommu_update_ga);
>  #endif
  
Suravee Suthikulpanit Oct. 17, 2023, 3:36 p.m. UTC | #2
On 10/17/2023 9:51 PM, Maxim Levitsky wrote:
> У вт, 2023-10-17 у 09:42 -0500, Suravee Suthikulpanit пише:
>> According to the recent update in the AMD IOMMU spec [1], the IsRun and
>> Destination fields of the Interrupt Remapping Table Entry (IRTE) are not
>> cached by the IOMMU hardware.
> Is that true for all AMD hardware that supports AVIC? E.g Zen1/Zen2 hardware?

This is true for all AVIC/x2AVIC-capable IOMMU hardware in the past.

> Is there a chance that this will cause a similar errata to the is_running
> errata that Zen2 cpus have?

Please let me check on this and get back.

>> Therefore, do not issue the INVALIDATE_INTERRUPT_TABLE command when
>> updating IRTE[IsRun] and IRTE[Destination] when IRTE[GuestMode]=1, which
>> should help improve IOMMU AVIC/x2AVIC performance.
>>
>> References:
>> [1] AMD IOMMU Spec Revision (Rev 3.08-PUB)
>> (Link:https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/specifications/48882_IOMMU.pdf)
> Looks like the link is broken.

The link above is the default location for AMD IOMMU spec, (which is 
currently being fixed). In the mean time, here is the temporary link to 
the latest document.

(https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/specifications/48882_3_07_PUB.pdf)

Thanks,
Suravee
  
Vasant Hegde Oct. 18, 2023, 6:07 a.m. UTC | #3
On 10/17/2023 8:12 PM, Suravee Suthikulpanit wrote:
> According to the recent update in the AMD IOMMU spec [1], the IsRun and
> Destination fields of the Interrupt Remapping Table Entry (IRTE) are not
> cached by the IOMMU hardware.
> 
> Therefore, do not issue the INVALIDATE_INTERRUPT_TABLE command when
> updating IRTE[IsRun] and IRTE[Destination] when IRTE[GuestMode]=1, which
> should help improve IOMMU AVIC/x2AVIC performance.
> 
> References:
> [1] AMD IOMMU Spec Revision (Rev 3.08-PUB)
> (Link: https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/specifications/48882_IOMMU.pdf)
> 
> Cc: Joao Martins <joao.m.martins@oracle.com>
> Cc: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>

Looks good to me.

Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>

-Vasant


> ---
>  drivers/iommu/amd/iommu.c | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> index 089886485895..d63590563d3e 100644
> --- a/drivers/iommu/amd/iommu.c
> +++ b/drivers/iommu/amd/iommu.c
> @@ -2970,8 +2970,8 @@ static int alloc_irq_index(struct amd_iommu *iommu, u16 devid, int count,
>  	return index;
>  }
>  
> -static int modify_irte_ga(struct amd_iommu *iommu, u16 devid, int index,
> -			  struct irte_ga *irte)
> +static int __modify_irte_ga(struct amd_iommu *iommu, u16 devid, int index,
> +			    struct irte_ga *irte)
>  {
>  	struct irq_remap_table *table;
>  	struct irte_ga *entry;
> @@ -2998,6 +2998,18 @@ static int modify_irte_ga(struct amd_iommu *iommu, u16 devid, int index,
>  
>  	raw_spin_unlock_irqrestore(&table->lock, flags);
>  
> +	return 0;
> +}
> +
> +static int modify_irte_ga(struct amd_iommu *iommu, u16 devid, int index,
> +			  struct irte_ga *irte)
> +{
> +	bool ret;
> +
> +	ret = __modify_irte_ga(iommu, devid, index, irte);
> +	if (ret)
> +		return ret;
> +
>  	iommu_flush_irt_and_complete(iommu, devid);
>  
>  	return 0;
> @@ -3681,8 +3693,8 @@ int amd_iommu_update_ga(int cpu, bool is_run, void *data)
>  	}
>  	entry->lo.fields_vapic.is_run = is_run;
>  
> -	return modify_irte_ga(ir_data->iommu, ir_data->irq_2_irte.devid,
> -			      ir_data->irq_2_irte.index, entry);
> +	return __modify_irte_ga(ir_data->iommu, ir_data->irq_2_irte.devid,
> +				ir_data->irq_2_irte.index, entry);
>  }
>  EXPORT_SYMBOL(amd_iommu_update_ga);
>  #endif
  
Alejandro Jimenez Oct. 18, 2023, 8:16 p.m. UTC | #4
On 10/17/23 10:42, Suravee Suthikulpanit wrote:
> According to the recent update in the AMD IOMMU spec [1], the IsRun and
> Destination fields of the Interrupt Remapping Table Entry (IRTE) are not
> cached by the IOMMU hardware.
> 
> Therefore, do not issue the INVALIDATE_INTERRUPT_TABLE command when
> updating IRTE[IsRun] and IRTE[Destination] when IRTE[GuestMode]=1, which
> should help improve IOMMU AVIC/x2AVIC performance.
> 
> References:
> [1] AMD IOMMU Spec Revision (Rev 3.08-PUB)
> (Link: https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/specifications/48882_IOMMU.pdf)
> 
> Cc: Joao Martins <joao.m.martins@oracle.com>
> Cc: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>

Built on top of v6.6-rc6 tag, and booted on EPYC Genoa host. Launched 380 vCPU guest with AVIC enabled and 16 VFs, and ran rds-ping workload that uses all of the VFs to send traffic. THis workload has been used in the past to stress IOMMU AVIC code paths. No issues found.

Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>

> ---
>   drivers/iommu/amd/iommu.c | 20 ++++++++++++++++----
>   1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> index 089886485895..d63590563d3e 100644
> --- a/drivers/iommu/amd/iommu.c
> +++ b/drivers/iommu/amd/iommu.c
> @@ -2970,8 +2970,8 @@ static int alloc_irq_index(struct amd_iommu *iommu, u16 devid, int count,
>   	return index;
>   }
>   
> -static int modify_irte_ga(struct amd_iommu *iommu, u16 devid, int index,
> -			  struct irte_ga *irte)
> +static int __modify_irte_ga(struct amd_iommu *iommu, u16 devid, int index,
> +			    struct irte_ga *irte)
>   {
>   	struct irq_remap_table *table;
>   	struct irte_ga *entry;
> @@ -2998,6 +2998,18 @@ static int modify_irte_ga(struct amd_iommu *iommu, u16 devid, int index,
>   
>   	raw_spin_unlock_irqrestore(&table->lock, flags);
>   
> +	return 0;
> +}
> +
> +static int modify_irte_ga(struct amd_iommu *iommu, u16 devid, int index,
> +			  struct irte_ga *irte)
> +{
> +	bool ret;
> +
> +	ret = __modify_irte_ga(iommu, devid, index, irte);
> +	if (ret)
> +		return ret;
> +
>   	iommu_flush_irt_and_complete(iommu, devid);
>   
>   	return 0;
> @@ -3681,8 +3693,8 @@ int amd_iommu_update_ga(int cpu, bool is_run, void *data)
>   	}
>   	entry->lo.fields_vapic.is_run = is_run;
>   
> -	return modify_irte_ga(ir_data->iommu, ir_data->irq_2_irte.devid,
> -			      ir_data->irq_2_irte.index, entry);
> +	return __modify_irte_ga(ir_data->iommu, ir_data->irq_2_irte.devid,
> +				ir_data->irq_2_irte.index, entry);
>   }
>   EXPORT_SYMBOL(amd_iommu_update_ga);
>   #endif
  
Suravee Suthikulpanit Oct. 19, 2023, 9:11 a.m. UTC | #5
Maxim,

On 10/17/2023 10:36 PM, Suthikulpanit, Suravee wrote:
> 
> On 10/17/2023 9:51 PM, Maxim Levitsky wrote:
>> У вт, 2023-10-17 у 09:42 -0500, Suravee Suthikulpanit пише:
>>> According to the recent update in the AMD IOMMU spec [1], the IsRun and
>>> Destination fields of the Interrupt Remapping Table Entry (IRTE) are not
>>> cached by the IOMMU hardware.
>> Is that true for all AMD hardware that supports AVIC? E.g Zen1/Zen2 
>> hardware?
> 
> This is true for all AVIC/x2AVIC-capable IOMMU hardware in the past.
> 
>> Is there a chance that this will cause a similar errata to the is_running
>> errata that Zen2 cpus have?
> 
> Please let me check on this and get back.

Just to be sure, could you please tell me which errata number are you 
referring to here?

Thanks,
Suravee
  
Maxim Levitsky Oct. 19, 2023, 1:32 p.m. UTC | #6
On Thu, 2023-10-19 at 16:11 +0700, Suthikulpanit, Suravee wrote:
> Maxim,
> 
> On 10/17/2023 10:36 PM, Suthikulpanit, Suravee wrote:
> > On 10/17/2023 9:51 PM, Maxim Levitsky wrote:
> > > У вт, 2023-10-17 у 09:42 -0500, Suravee Suthikulpanit пише:
> > > > According to the recent update in the AMD IOMMU spec [1], the IsRun and
> > > > Destination fields of the Interrupt Remapping Table Entry (IRTE) are not
> > > > cached by the IOMMU hardware.
> > > Is that true for all AMD hardware that supports AVIC? E.g Zen1/Zen2 
> > > hardware?
> > 
> > This is true for all AVIC/x2AVIC-capable IOMMU hardware in the past.
> > 
> > > Is there a chance that this will cause a similar errata to the is_running
> > > errata that Zen2 cpus have?
> > 
> > Please let me check on this and get back.
> 
> Just to be sure, could you please tell me which errata number are you 
> referring to here?

Hi!

The errata is this one:

https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/revision-guides/56323-PUB_1_01.pdf
(ERRATA #1235)

The errata meaning is that once the CPU sets is_running=0 in AVIC's physical ID table, 
the hardware might still not cause #incomplete_ipi vmexit, for a few times.

This means that KVM is not notified of the pending interrupt and it lets the target vCPU block instead of being woken up.


In regard to the IOMMU, the parallel errata like situation would be that CPU sets is_running=0 in the IOMMU tables, 
exactly afterwards an interrupt arrives and the IOMMU doesn't log the interrupt in the GA log.

Note that regardless if the IOMMU works correctly or not, the KVM has to check IRR after it sets is_running=0 to avoid memory
reordering races, but it does so and uses a memory barrier to ensure correctness. 


PS: Recently I realized that a very simple workaround exists for errata #1235: At expense of the performance, the KVM can
permanently set is_running=0 in the AVIC's physical id table.

This will ensure that all ICR writes except self IPIs will cause a VM exit, thus in the expense of performance will
ensure that AVIC works correctly.

And this configuration while slower than full AVIC, it's still much better than having no AVIC, 
because AVIC still accelerates the receiver side of IPIs, and also IOMMU can still post interrupts.
In fact with the workaround applied, AVIC is exactly equivalent to Intel's APICv without IPI virtualization.


This is the latest version of my patch series:

[PATCH v3 0/4] Allow AVIC's IPI virtualization to be optional
https://www.spinics.net/lists/kvm/msg328416.html

A help with a review will be highly appreciated.

In addition to that I would highly appreciate if AMD could respond to these questions:

1. Is the errata #1235 really not present on Zen1 CPUs? It doesn't appear in the revision guide,
but I still suspect that the guide might just not be up to date. I haven't tested a Zen1 machine
to see if I can reproduce it yet.

2. As I see AMD disabled AVIC on Zen3 machines, but it does appear to work just fine in my testing.
Is it possible that the reason for the disabling is also related to ICR/IPI emulation and that if I apply
the same workaround as for errata #1235, then AVIC could be safely enabled on Zen3,
despite support for it not being present in CPUID?

If I know the answer to these questions, it might be possible to enable AVIC on all current AMD hardware by default.

AFAIK (my personal opinion, based on various downstream bugs opened and their urgency) that our customers do care about APICv,
but somehow assume as a fact that AMD doesn't have such feature while in fact AMD does.

So I think that if AVIC were to be enabled by default on AMD, that would be something that AMD's (and ours) 
customers would appreciate very much.

Best regards,
	Maxim Levitsky

> 
> Thanks,
> Suravee
>
  
Joerg Roedel Dec. 11, 2023, 2:22 p.m. UTC | #7
On Tue, Oct 17, 2023 at 09:42:36AM -0500, Suravee Suthikulpanit wrote:
>  drivers/iommu/amd/iommu.c | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)

Applied, thanks.
  

Patch

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 089886485895..d63590563d3e 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2970,8 +2970,8 @@  static int alloc_irq_index(struct amd_iommu *iommu, u16 devid, int count,
 	return index;
 }
 
-static int modify_irte_ga(struct amd_iommu *iommu, u16 devid, int index,
-			  struct irte_ga *irte)
+static int __modify_irte_ga(struct amd_iommu *iommu, u16 devid, int index,
+			    struct irte_ga *irte)
 {
 	struct irq_remap_table *table;
 	struct irte_ga *entry;
@@ -2998,6 +2998,18 @@  static int modify_irte_ga(struct amd_iommu *iommu, u16 devid, int index,
 
 	raw_spin_unlock_irqrestore(&table->lock, flags);
 
+	return 0;
+}
+
+static int modify_irte_ga(struct amd_iommu *iommu, u16 devid, int index,
+			  struct irte_ga *irte)
+{
+	bool ret;
+
+	ret = __modify_irte_ga(iommu, devid, index, irte);
+	if (ret)
+		return ret;
+
 	iommu_flush_irt_and_complete(iommu, devid);
 
 	return 0;
@@ -3681,8 +3693,8 @@  int amd_iommu_update_ga(int cpu, bool is_run, void *data)
 	}
 	entry->lo.fields_vapic.is_run = is_run;
 
-	return modify_irte_ga(ir_data->iommu, ir_data->irq_2_irte.devid,
-			      ir_data->irq_2_irte.index, entry);
+	return __modify_irte_ga(ir_data->iommu, ir_data->irq_2_irte.devid,
+				ir_data->irq_2_irte.index, entry);
 }
 EXPORT_SYMBOL(amd_iommu_update_ga);
 #endif