arm64: errata: Minimize tlb flush due to vttbr writes on AmpereOne

Message ID 20240207090458.463021-1-gankulkarni@os.amperecomputing.com
State New
Headers
Series arm64: errata: Minimize tlb flush due to vttbr writes on AmpereOne |

Commit Message

Ganapatrao Kulkarni Feb. 7, 2024, 9:04 a.m. UTC
  AmpereOne implementation is doing tlb flush when ever there is
a write to vttbr_el2. As per KVM implementation, vttbr_el2 is updated
with VM's S2-MMU while return to VM. This is not necessary when there
is no VM context switch and a just return to same Guest.

Adding a check to avoid the vttbr_el2 write if the same value
already exist to prevent needless tlb flush.

Signed-off-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com>
---
 Documentation/arch/arm64/silicon-errata.rst |  2 ++
 arch/arm64/Kconfig                          | 13 +++++++++++++
 arch/arm64/include/asm/kvm_mmu.h            |  8 +++++++-
 arch/arm64/kernel/cpu_errata.c              |  7 +++++++
 arch/arm64/tools/cpucaps                    |  1 +
 5 files changed, 30 insertions(+), 1 deletion(-)
  

Comments

Oliver Upton Feb. 7, 2024, 9:45 a.m. UTC | #1
On Wed, Feb 07, 2024 at 01:04:58AM -0800, Ganapatrao Kulkarni wrote:
> AmpereOne implementation is doing tlb flush when ever there is
> a write to vttbr_el2. As per KVM implementation, vttbr_el2 is updated
> with VM's S2-MMU while return to VM. This is not necessary when there
> is no VM context switch and a just return to same Guest.
> 
> Adding a check to avoid the vttbr_el2 write if the same value
> already exist to prevent needless tlb flush.

Sorry, zero interest in taking what is really a uarch optimization.
The errata framework exists to allow the kernel achieve *correctness*
on a variety of hardware and is not a collection of party tricks for
optimizing any given implementation.

Think of the precedent this would establish. What would stop
implementers from, say, changing out our memcpy implementation into a
a hundred different uarch-specific routines. That isn't maintainable,
nor is it even testable as most folks don't have access to your
hardware.

Ignoring all of that -- I question the necessity of these patches
altogether. KVM writes to VTTBR at the time of vcpu load as of commit
934bf871f011 ("KVM: arm64: Load the stage-2 MMU context in
kvm_vcpu_load_vhe()"), which should drastically reduce the overhead of
this hardware fix.

--
Thanks,
Oliver
  
Catalin Marinas Feb. 27, 2024, 8:11 p.m. UTC | #2
(catching up on emails)

On Wed, Feb 07, 2024 at 09:45:59AM +0000, Oliver Upton wrote:
> On Wed, Feb 07, 2024 at 01:04:58AM -0800, Ganapatrao Kulkarni wrote:
> > AmpereOne implementation is doing tlb flush when ever there is
> > a write to vttbr_el2. As per KVM implementation, vttbr_el2 is updated
> > with VM's S2-MMU while return to VM. This is not necessary when there
> > is no VM context switch and a just return to same Guest.
> > 
> > Adding a check to avoid the vttbr_el2 write if the same value
> > already exist to prevent needless tlb flush.
> 
> Sorry, zero interest in taking what is really a uarch optimization.
> The errata framework exists to allow the kernel achieve *correctness*
> on a variety of hardware and is not a collection of party tricks for
> optimizing any given implementation.

Definitely, we should not abuse the errata framework for uarch
optimisations.

> Think of the precedent this would establish. What would stop
> implementers from, say, changing out our memcpy implementation into a
> a hundred different uarch-specific routines. That isn't maintainable,
> nor is it even testable as most folks don't have access to your
> hardware.

I agree. FTR, I'm fine with uarch optimisations if (a) they don't
run-time patch the kernel binary, (b) don't affect the existing hardware
and (c) show significant gains on the targeted uarch in some meaningful
benchmarks (definitely not microbenchmark hammering a certain kernel
path).

We did have uarch optimisations in the past that broke rule (a). We
tried to make them somewhat more justifiable by creating optimisation
classes (well, I think it was only ARM64_HAS_NO_HW_PREFETCH). But such
changes don't scale well for maintainers, so I'd rather not go back
there.

So, if one wants an optimisation, it better benefits the other
implementations or at least it doesn't make them worse. Now, we do have
hardware from mobiles to large enterprise systems, so at some point we
may have to make a call on different kernel behaviours, possibly even at
run-time. We already do this at build-time, e.g. CONFIG_NUMA where it
doesn't make much sense in a mobile (yet). But they should not be seen
as uarch specific tweaks, more like higher-level classes of
optimisations.
  
Oliver Upton Feb. 27, 2024, 8:26 p.m. UTC | #3
On Tue, Feb 27, 2024 at 08:11:22PM +0000, Catalin Marinas wrote:
> On Wed, Feb 07, 2024 at 09:45:59AM +0000, Oliver Upton wrote:

[...]

> > Think of the precedent this would establish. What would stop
> > implementers from, say, changing out our memcpy implementation into a
> > a hundred different uarch-specific routines. That isn't maintainable,
> > nor is it even testable as most folks don't have access to your
> > hardware.
> 
> I agree. FTR, I'm fine with uarch optimisations if (a) they don't
> run-time patch the kernel binary, (b) don't affect the existing hardware
> and (c) show significant gains on the targeted uarch in some meaningful
> benchmarks (definitely not microbenchmark hammering a certain kernel
> path).

and (d) they have a minimal, maintainable code footprint :)

> So, if one wants an optimisation, it better benefits the other
> implementations or at least it doesn't make them worse. Now, we do have
> hardware from mobiles to large enterprise systems, so at some point we
> may have to make a call on different kernel behaviours, possibly even at
> run-time. We already do this at build-time, e.g. CONFIG_NUMA where it
> doesn't make much sense in a mobile (yet). But they should not be seen
> as uarch specific tweaks, more like higher-level classes of
> optimisations.

Agreed. I think the way we handled this case is a great example of how
these sort of things should go -- a general improvement to how the stage-2
MMU gets loaded on VHE systems, which ought to benefit other
implementations too.

Only if we can't extract a generalization should we even think about
something implementation-specific, IMO.
  

Patch

diff --git a/Documentation/arch/arm64/silicon-errata.rst b/Documentation/arch/arm64/silicon-errata.rst
index e8c2ce1f9df6..8924e84358c9 100644
--- a/Documentation/arch/arm64/silicon-errata.rst
+++ b/Documentation/arch/arm64/silicon-errata.rst
@@ -54,6 +54,8 @@  stable kernels.
 +----------------+-----------------+-----------------+-----------------------------+
 | Ampere         | AmpereOne       | AC03_CPU_38     | AMPERE_ERRATUM_AC03_CPU_38  |
 +----------------+-----------------+-----------------+-----------------------------+
+| Ampere         | AmpereOne       | N/A             | AMPERE_AC03_REDUCE_TLB_FLUSH|
++----------------+-----------------+-----------------+-----------------------------+
 +----------------+-----------------+-----------------+-----------------------------+
 | ARM            | Cortex-A510     | #2457168        | ARM64_ERRATUM_2457168       |
 +----------------+-----------------+-----------------+-----------------------------+
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index aa7c1d435139..77485d0322e4 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -436,6 +436,19 @@  config AMPERE_ERRATUM_AC03_CPU_38
 
 	  If unsure, say Y.
 
+config AMPERE_AC03_REDUCE_TLB_FLUSH
+	bool "AmpereOne: Minimize the writes to vttbr_el2 register"
+	default y
+	help
+	  On AmpereOne, any writes to vttbr_el2 results in TLB flush.
+	  It can be avoided to improve the performance when there is no VM
+	  context switches and a just return to same VM from the hypervisor.
+
+	  This option adds a check to avoid rewrite of the same value
+	  to vttbr_el2.
+
+	  If unsure, say Y.
+
 config ARM64_WORKAROUND_CLEAN_CACHE
 	bool
 
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index e3e793d0ec30..da39e4749434 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -317,8 +317,14 @@  static __always_inline u64 kvm_get_vttbr(struct kvm_s2_mmu *mmu)
 static __always_inline void __load_stage2(struct kvm_s2_mmu *mmu,
 					  struct kvm_arch *arch)
 {
+	u64 vttbr;
+
+	vttbr = kvm_get_vttbr(mmu);
 	write_sysreg(mmu->vtcr, vtcr_el2);
-	write_sysreg(kvm_get_vttbr(mmu), vttbr_el2);
+
+	if (!cpus_have_final_cap(ARM64_WORKAROUND_AMPERE_AC03_TLB_FLUSH) ||
+	    read_sysreg(vttbr_el2) != vttbr)
+		write_sysreg(vttbr, vttbr_el2);
 
 	/*
 	 * ARM errata 1165522 and 1530923 require the actual execution of the
diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
index 967c7c7a4e7d..f612975e0cb5 100644
--- a/arch/arm64/kernel/cpu_errata.c
+++ b/arch/arm64/kernel/cpu_errata.c
@@ -740,6 +740,13 @@  const struct arm64_cpu_capabilities arm64_errata[] = {
 		.capability = ARM64_WORKAROUND_AMPERE_AC03_CPU_38,
 		ERRATA_MIDR_ALL_VERSIONS(MIDR_AMPERE1),
 	},
+#endif
+#ifdef CONFIG_AMPERE_AC03_REDUCE_TLB_FLUSH
+	{
+		.desc = "AmpereOne, minimize tlb flush due to vttbr write",
+		.capability = ARM64_WORKAROUND_AMPERE_AC03_TLB_FLUSH,
+		ERRATA_MIDR_ALL_VERSIONS(MIDR_AMPERE1),
+	},
 #endif
 	{
 	}
diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
index b912b1409fc0..b4bee37d0527 100644
--- a/arch/arm64/tools/cpucaps
+++ b/arch/arm64/tools/cpucaps
@@ -85,6 +85,7 @@  WORKAROUND_2457168
 WORKAROUND_2645198
 WORKAROUND_2658417
 WORKAROUND_AMPERE_AC03_CPU_38
+WORKAROUND_AMPERE_AC03_TLB_FLUSH
 WORKAROUND_TRBE_OVERWRITE_FILL_MODE
 WORKAROUND_TSB_FLUSH_FAILURE
 WORKAROUND_TRBE_WRITE_OUT_OF_RANGE