KVM: arm64: Only save S1PIE registers when dirty

Message ID 20240301-kvm-arm64-defer-regs-v1-1-401e3de92e97@kernel.org
State New
Headers
Series KVM: arm64: Only save S1PIE registers when dirty |

Commit Message

Mark Brown March 1, 2024, 6:05 p.m. UTC
  Currently we save the S1PIE registers every time we exit the guest but
the expected usage pattern for these registers is that they will be
written to very infrequently, likely once during initialisation and then
never updated again.  This means that most likely most of our saves of
these registers are redundant.  Let's avoid these redundant saves by
enabling fine grained write traps for the EL0 and EL1 PIE registers when
switching to the guest and only saving if a write happened.

We track if the registers have been written by storing a mask of bits
for HFGWTR_EL2, we may be able to use the same approach for other
registers with similar access patterns.  We assume that it is likely
that both registers will be written in quick succession and mark both
PIR_EL1 and PIRE0_EL1 as dirty if either is written in order to minimise
overhead.

This will have a negative performance impact if guests do start updating
these registers frequently but since the PIE indexes have a wide impact
on the page tables it seems likely that this will not be the case.

We do not need to check for FGT support since it is mandatory for
systems with PIE.

Signed-off-by: Mark Brown <broonie@kernel.org>
---
I don't have a good sense if this is a good idea or not, or if this is a
desirable implementation of the concept - the patch is based on some
concerns about the cost of the system register context switching.  We
should be able to do something similar for some of the other registers.
---
 arch/arm64/include/asm/kvm_host.h          | 37 ++++++++++++++++++++++++++++++
 arch/arm64/kvm/hyp/include/hyp/switch.h    | 36 +++++++++++++++++++++++++++++
 arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h |  6 ++++-
 3 files changed, 78 insertions(+), 1 deletion(-)


---
base-commit: 54be6c6c5ae8e0d93a6c4641cb7528eb0b6ba478
change-id: 20240227-kvm-arm64-defer-regs-d29ae460d0b3

Best regards,
  

Comments

Oliver Upton March 1, 2024, 7:32 p.m. UTC | #1
On Fri, Mar 01, 2024 at 06:05:53PM +0000, Mark Brown wrote:
> Currently we save the S1PIE registers every time we exit the guest but
> the expected usage pattern for these registers is that they will be
> written to very infrequently, likely once during initialisation and then
> never updated again.  This means that most likely most of our saves of
> these registers are redundant.  Let's avoid these redundant saves by
> enabling fine grained write traps for the EL0 and EL1 PIE registers when
> switching to the guest and only saving if a write happened.
> 
> We track if the registers have been written by storing a mask of bits
> for HFGWTR_EL2, we may be able to use the same approach for other
> registers with similar access patterns.  We assume that it is likely
> that both registers will be written in quick succession and mark both
> PIR_EL1 and PIRE0_EL1 as dirty if either is written in order to minimise
> overhead.
> 
> This will have a negative performance impact if guests do start updating
> these registers frequently but since the PIE indexes have a wide impact
> on the page tables it seems likely that this will not be the case.
> 
> We do not need to check for FGT support since it is mandatory for
> systems with PIE.
> 
> Signed-off-by: Mark Brown <broonie@kernel.org>
> ---
> I don't have a good sense if this is a good idea or not, or if this is a
> desirable implementation of the concept - the patch is based on some
> concerns about the cost of the system register context switching.  We
> should be able to do something similar for some of the other registers.

Is there any data beyond a microbenchmark to suggest save elision
benefits the VM at all? The idea of baking the trap configuration based
on what KVM _thinks_ the guest will do isn't particularly exciting. This
doesn't seem to be a one-size-fits-all solution.

The overheads of guest exits are extremely configuration dependent, and
on VHE the save/restore of EL1 state happens at vcpu_load() / vcpu_put()
rather than every exit. There isn't a whole lot KVM can do to lessen the
blow of sharing EL1 in the nVHE configuration.

Looking a bit further out, the cost of traps will be dramatically higher
when running as a guest hypervisor, so we'd want to avoid them if
possible...
  
Mark Brown March 1, 2024, 9:13 p.m. UTC | #2
On Fri, Mar 01, 2024 at 07:32:28PM +0000, Oliver Upton wrote:
> On Fri, Mar 01, 2024 at 06:05:53PM +0000, Mark Brown wrote:

> > I don't have a good sense if this is a good idea or not, or if this is a
> > desirable implementation of the concept - the patch is based on some
> > concerns about the cost of the system register context switching.  We
> > should be able to do something similar for some of the other registers.

> Is there any data beyond a microbenchmark to suggest save elision
> benefits the VM at all? The idea of baking the trap configuration based
> on what KVM _thinks_ the guest will do isn't particularly exciting. This
> doesn't seem to be a one-size-fits-all solution.

No, and as I said above I'm really not confident about this.  There's no
hardware with these registers yet as far as I know so I don't know how
meaningful any benchmark would be anyway, and as you suggest even with a
benchmark a new guest could always come along and blow performance up
with a change in access patterns.

> The overheads of guest exits are extremely configuration dependent, and
> on VHE the save/restore of EL1 state happens at vcpu_load() / vcpu_put()
> rather than every exit. There isn't a whole lot KVM can do to lessen the
> blow of sharing EL1 in the nVHE configuration.

> Looking a bit further out, the cost of traps will be dramatically higher
> when running as a guest hypervisor, so we'd want to avoid them if
> possible...

Indeed, but OTOH I got some complaints about adding more system register
switching in __sysreg_save_el1_state() for one of my other serieses that
specifically mentioned nested virt and there don't seem to be a huge
range of other options for reducing what we're doing with context
switching without using traps to figure out what's in use, especially in
the nVHE case.  I figured I'd send the patch so the idea could be
considered.
  
Marc Zyngier March 2, 2024, 10:28 a.m. UTC | #3
On Fri, 01 Mar 2024 21:13:26 +0000,
Mark Brown <broonie@kernel.org> wrote:
> 
> [1  <text/plain; us-ascii (7bit)>]
> On Fri, Mar 01, 2024 at 07:32:28PM +0000, Oliver Upton wrote:
> > On Fri, Mar 01, 2024 at 06:05:53PM +0000, Mark Brown wrote:
> 
> > > I don't have a good sense if this is a good idea or not, or if this is a
> > > desirable implementation of the concept - the patch is based on some
> > > concerns about the cost of the system register context switching.  We
> > > should be able to do something similar for some of the other registers.
> 
> > Is there any data beyond a microbenchmark to suggest save elision
> > benefits the VM at all? The idea of baking the trap configuration based
> > on what KVM _thinks_ the guest will do isn't particularly exciting. This
> > doesn't seem to be a one-size-fits-all solution.
> 
> No, and as I said above I'm really not confident about this.  There's no
> hardware with these registers yet as far as I know so I don't know how
> meaningful any benchmark would be anyway, and as you suggest even with a
> benchmark a new guest could always come along and blow performance up
> with a change in access patterns.
> 
> > The overheads of guest exits are extremely configuration dependent, and
> > on VHE the save/restore of EL1 state happens at vcpu_load() / vcpu_put()
> > rather than every exit. There isn't a whole lot KVM can do to lessen the
> > blow of sharing EL1 in the nVHE configuration.
> 
> > Looking a bit further out, the cost of traps will be dramatically higher
> > when running as a guest hypervisor, so we'd want to avoid them if
> > possible...
> 
> Indeed, but OTOH I got some complaints about adding more system register

Complains from whom? I can't see anything in my inbox, so it my
conclusion that these "issues" are not serious enough to be publicly
mentioned.

> switching in __sysreg_save_el1_state() for one of my other serieses that
> specifically mentioned nested virt and there don't seem to be a huge
> range of other options for reducing what we're doing with context
> switching without using traps to figure out what's in use, especially in
> the nVHE case.  I figured I'd send the patch so the idea could be
> considered.

nVHE has a cost. If someone finds it too slow, they can use VHE. If
they rely on nVHE for isolation, then they know exactly what they are
paying for.

They can also avoid exposing the feature to the guest, which will
remove *any* save/restore. because *that* is the right thing to do if
you want to minimise the impact of additional features.

If anything, I'm actually minded to remove existing instances of this
stupid trapping, such as PAuth, which is entirely pointless.

Sysreg access should essentially be free. 90% of it is done out of
context, and requires no synchronisation. If someone has HW that is so
badly affected by something that should have the same cost as a NOP,
they can borrow a hammer from my toolbox and put this HW out of its
misery.

The only case where this sort of trap can be beneficial is when the
cost of the trap (over 60 64bit registers being saved/restored --
that's host and guest's GPRs) is small compared to the cost of the
context switch of the trapped state. This may make sense for FP, SVE
and the like. Doesn't make much sense for anything else.

	M.
  
Mark Brown March 4, 2024, 2:11 p.m. UTC | #4
On Sat, Mar 02, 2024 at 10:28:18AM +0000, Marc Zyngier wrote:
> Mark Brown <broonie@kernel.org> wrote:
> > On Fri, Mar 01, 2024 at 07:32:28PM +0000, Oliver Upton wrote:

> > > The overheads of guest exits are extremely configuration dependent, and
> > > on VHE the save/restore of EL1 state happens at vcpu_load() / vcpu_put()
> > > rather than every exit. There isn't a whole lot KVM can do to lessen the
> > > blow of sharing EL1 in the nVHE configuration.

> > > Looking a bit further out, the cost of traps will be dramatically higher
> > > when running as a guest hypervisor, so we'd want to avoid them if
> > > possible...

> > Indeed, but OTOH I got some complaints about adding more system register

> Complains from whom? I can't see anything in my inbox, so it my
> conclusion that these "issues" are not serious enough to be publicly
> mentioned.

This was you saying that adding more registers to be context switched
here needed special explanation, rather than just being the default and
generally unremarkable place to put context switching of registers for
EL0/1.

> If anything, I'm actually minded to remove existing instances of this
> stupid trapping, such as PAuth, which is entirely pointless.

That one was part of why it appeared that this sort of thing was what
you were asking for.  Especially given that there's nothing I can see
explaining why this would be deferred it's really unclear, I'd expect it
to be likely that those registers will be quite frequently accessed if
pointer authentication is in use.  Either it needs more explanation of
why it's special or it does seem like it should be removed.
  
Marc Zyngier March 4, 2024, 2:39 p.m. UTC | #5
On Mon, 04 Mar 2024 14:11:19 +0000,
Mark Brown <broonie@kernel.org> wrote:
> 
> [1  <text/plain; us-ascii (7bit)>]
> On Sat, Mar 02, 2024 at 10:28:18AM +0000, Marc Zyngier wrote:
> > Mark Brown <broonie@kernel.org> wrote:
> > > On Fri, Mar 01, 2024 at 07:32:28PM +0000, Oliver Upton wrote:
> 
> > > > The overheads of guest exits are extremely configuration dependent, and
> > > > on VHE the save/restore of EL1 state happens at vcpu_load() / vcpu_put()
> > > > rather than every exit. There isn't a whole lot KVM can do to lessen the
> > > > blow of sharing EL1 in the nVHE configuration.
> 
> > > > Looking a bit further out, the cost of traps will be dramatically higher
> > > > when running as a guest hypervisor, so we'd want to avoid them if
> > > > possible...
> 
> > > Indeed, but OTOH I got some complaints about adding more system register
> 
> > Complains from whom? I can't see anything in my inbox, so it my
> > conclusion that these "issues" are not serious enough to be publicly
> > mentioned.
> 
> This was you saying that adding more registers to be context switched
> here needed special explanation, rather than just being the default and
> generally unremarkable place to put context switching of registers for
> EL0/1.

What I remember saying is that it is wrong to add extra registers to
the context switch without gating them with the VM configuration.
Which is a very different thing.

I don't know where you got the idea that I wanted to make this sort of
things lazy. Quite the contrary, actually. I want to trap things to
make them UNDEF. And this is exactly how -next now behaves (see
58627b722ee2).

What I want to see explained in all cases is why a register has to be
eagerly switched and not deferred to the load/put phases, specially on
VHE. because that has a very visible impact on the overall performance.

> > If anything, I'm actually minded to remove existing instances of this
> > stupid trapping, such as PAuth, which is entirely pointless.
> 
> That one was part of why it appeared that this sort of thing was what
> you were asking for.

No, really not.

	M.
  
Mark Brown March 4, 2024, 5:09 p.m. UTC | #6
On Mon, Mar 04, 2024 at 02:39:19PM +0000, Marc Zyngier wrote:
> Mark Brown <broonie@kernel.org> wrote:
> > On Sat, Mar 02, 2024 at 10:28:18AM +0000, Marc Zyngier wrote:

> > > Complains from whom? I can't see anything in my inbox, so it my
> > > conclusion that these "issues" are not serious enough to be publicly
> > > mentioned.

> > This was you saying that adding more registers to be context switched
> > here needed special explanation, rather than just being the default and
> > generally unremarkable place to put context switching of registers for
> > EL0/1.

> What I remember saying is that it is wrong to add extra registers to
> the context switch without gating them with the VM configuration.
> Which is a very different thing.

You said both things separately.  This is specifically addressing your
comment:

| For the benefit of the unsuspecting reviewers, and in the absence of a
| public specification (which the XML drop isn't), it would be good to
| have the commit message explaining the rationale of what gets saved
| when.

which did not seem obviously tied to your separate comments about using
your at the time still in flight patches to add support for parsing
features out of the ID registers, it seemed like a separate concern.

> What I want to see explained in all cases is why a register has to be
> eagerly switched and not deferred to the load/put phases, specially on
> VHE. because that has a very visible impact on the overall performance.

I'm confused here, the specific register save/restores that you were
asking for an explanation of are as far as I can tell switched in
load/put (eg, the specific complaint was attached to loads in
__sysreg_restore_el1_state() which for VHE is called from 

	__vcpu_load_switch_sysregs()
	kvm_vcpu_load_vhe()
	kvm_arch_vcpu_load()

which is to my understanding part of the load/put phase).  This should
be true for all the GCS registers, the explanation would be something
along the lines of "these are completely unremarkable EL0/1 registers
and are switched in the default place where we switch all the other
EL0/1 registers".
  

Patch

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 21c57b812569..340567e9b206 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -475,6 +475,8 @@  enum vcpu_sysreg {
 };
 
 struct kvm_cpu_context {
+	u64	reg_dirty;		/* Mask of HFGWTR_EL2 bits */
+
 	struct user_pt_regs regs;	/* sp = sp_el0 */
 
 	u64	spsr_abt;
@@ -492,6 +494,32 @@  struct kvm_cpu_context {
 	u64 *vncr_array;
 };
 
+static inline bool __ctxt_reg_dirty(const struct kvm_cpu_context *ctxt,
+				    u64 trap)
+{
+	return ctxt->reg_dirty & trap;
+}
+
+static inline bool __ctxt_reg_set_dirty(struct kvm_cpu_context *ctxt, u64 trap)
+{
+	return ctxt->reg_dirty |= trap;
+}
+
+static inline bool __ctxt_reg_set_clean(struct kvm_cpu_context *ctxt, u64 trap)
+{
+	return ctxt->reg_dirty &= ~trap;
+}
+
+#define ctxt_reg_dirty(ctxt, trap) \
+	__ctxt_reg_dirty(ctxt, HFGxTR_EL2_##trap)
+
+
+#define ctxt_reg_set_dirty(ctxt, trap) \
+	__ctxt_reg_set_dirty(ctxt, HFGxTR_EL2_##trap)
+
+#define ctxt_reg_set_clean(ctxt, trap) \
+	__ctxt_reg_set_clean(ctxt, HFGxTR_EL2_##trap)
+
 struct kvm_host_data {
 	struct kvm_cpu_context host_ctxt;
 };
@@ -1118,6 +1146,15 @@  static inline void kvm_init_host_cpu_context(struct kvm_cpu_context *cpu_ctxt)
 {
 	/* The host's MPIDR is immutable, so let's set it up at boot time */
 	ctxt_sys_reg(cpu_ctxt, MPIDR_EL1) = read_cpuid_mpidr();
+
+	/*
+	 * Save the S1PIE setup on first use, we assume the host will
+	 * never update.
+	 */
+	if (cpus_have_final_cap(ARM64_HAS_S1PIE)) {
+		ctxt_reg_set_dirty(cpu_ctxt, nPIR_EL1);
+		ctxt_reg_set_dirty(cpu_ctxt, nPIRE0_EL1);
+	}
 }
 
 static inline bool kvm_system_needs_idmapped_vectors(void)
diff --git a/arch/arm64/kvm/hyp/include/hyp/switch.h b/arch/arm64/kvm/hyp/include/hyp/switch.h
index a038320cdb08..2cccc7f2b492 100644
--- a/arch/arm64/kvm/hyp/include/hyp/switch.h
+++ b/arch/arm64/kvm/hyp/include/hyp/switch.h
@@ -152,6 +152,12 @@  static inline void __activate_traps_hfgxtr(struct kvm_vcpu *vcpu)
 	if (cpus_have_final_cap(ARM64_WORKAROUND_AMPERE_AC03_CPU_38))
 		w_set |= HFGxTR_EL2_TCR_EL1_MASK;
 
+	/*
+	 * Trap writes to infrequently updated registers.
+	 */
+	if (cpus_have_final_cap(ARM64_HAS_S1PIE))
+		w_clr |= HFGxTR_EL2_nPIR_EL1 | HFGxTR_EL2_nPIRE0_EL1;
+
 	if (vcpu_has_nv(vcpu) && !is_hyp_ctxt(vcpu)) {
 		compute_clr_set(vcpu, HFGRTR_EL2, r_clr, r_set);
 		compute_clr_set(vcpu, HFGWTR_EL2, w_clr, w_set);
@@ -570,6 +576,33 @@  static bool handle_ampere1_tcr(struct kvm_vcpu *vcpu)
 	return true;
 }
 
+static inline bool kvm_hyp_handle_read_mostly_sysreg(struct kvm_vcpu *vcpu)
+{
+	u32 sysreg = esr_sys64_to_sysreg(kvm_vcpu_get_esr(vcpu));
+	u64 fgt_w_set;
+
+	switch (sysreg) {
+	case SYS_PIR_EL1:
+	case SYS_PIRE0_EL1:
+		/*
+		 * PIR registers are always restored, we just need to
+		 * disable write traps.  We expect EL0 and EL1
+		 * controls to be updated close together so enable
+		 * both.
+		 */
+		if (!cpus_have_final_cap(ARM64_HAS_S1PIE))
+			return false;
+		fgt_w_set = HFGxTR_EL2_nPIRE0_EL1 | HFGxTR_EL2_nPIR_EL1;
+		break;
+	default:
+		return false;
+	}
+
+	sysreg_clear_set_s(SYS_HFGWTR_EL2, 0, fgt_w_set);
+	vcpu->arch.ctxt.reg_dirty |= fgt_w_set;
+	return true;
+}
+
 static bool kvm_hyp_handle_sysreg(struct kvm_vcpu *vcpu, u64 *exit_code)
 {
 	if (cpus_have_final_cap(ARM64_WORKAROUND_CAVIUM_TX2_219_TVM) &&
@@ -590,6 +623,9 @@  static bool kvm_hyp_handle_sysreg(struct kvm_vcpu *vcpu, u64 *exit_code)
 	if (kvm_hyp_handle_cntpct(vcpu))
 		return true;
 
+	if (kvm_hyp_handle_read_mostly_sysreg(vcpu))
+		return true;
+
 	return false;
 }
 
diff --git a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h
index bb6b571ec627..f60d27288b2a 100644
--- a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h
+++ b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h
@@ -55,9 +55,13 @@  static inline void __sysreg_save_el1_state(struct kvm_cpu_context *ctxt)
 	ctxt_sys_reg(ctxt, CONTEXTIDR_EL1) = read_sysreg_el1(SYS_CONTEXTIDR);
 	ctxt_sys_reg(ctxt, AMAIR_EL1)	= read_sysreg_el1(SYS_AMAIR);
 	ctxt_sys_reg(ctxt, CNTKCTL_EL1)	= read_sysreg_el1(SYS_CNTKCTL);
-	if (cpus_have_final_cap(ARM64_HAS_S1PIE)) {
+	if (ctxt_reg_dirty(ctxt, nPIR_EL1)) {
 		ctxt_sys_reg(ctxt, PIR_EL1)	= read_sysreg_el1(SYS_PIR);
+		ctxt_reg_set_clean(ctxt, nPIR_EL1);
+	}
+	if (ctxt_reg_dirty(ctxt, nPIRE0_EL1)) {
 		ctxt_sys_reg(ctxt, PIRE0_EL1)	= read_sysreg_el1(SYS_PIRE0);
+		ctxt_reg_set_clean(ctxt, nPIRE0_EL1);
 	}
 	ctxt_sys_reg(ctxt, PAR_EL1)	= read_sysreg_par();
 	ctxt_sys_reg(ctxt, TPIDR_EL1)	= read_sysreg(tpidr_el1);