perf: Optimize perf_pmu_migrate_context()

Message ID 20230403090858.GT4253@hirez.programming.kicks-ass.net
State New
Headers
Series perf: Optimize perf_pmu_migrate_context() |

Commit Message

Peter Zijlstra April 3, 2023, 9:08 a.m. UTC
  Thomas reported that offlining CPUs spends a lot of time in
synchronize_rcu() as called from perf_pmu_migrate_context() even though
he's not actually using uncore events.

Turns out, the thing is unconditionally waiting for RCU, even if there's
no actual events to migrate.

Fixes: 0cda4c023132 ("perf: Introduce perf_pmu_migrate_context()")
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/events/core.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)
  

Comments

Thomas Gleixner April 3, 2023, 10:07 p.m. UTC | #1
On Mon, Apr 03 2023 at 11:08, Peter Zijlstra wrote:
> Thomas reported that offlining CPUs spends a lot of time in
> synchronize_rcu() as called from perf_pmu_migrate_context() even though
> he's not actually using uncore events.

That happens when offlining CPUs from a socket > 0 in the same order how
those CPUs have been brought up. On socket 0 this is not observable
unless the bogus CPU0 offlining hack is enabled.

If the offlining happens in the reverse order then all is shiny.

The reason is that the first online CPU on a socket gets the uncore
events assigned and when it is offlined then those are moved to the next
online CPU in the same socket.

On a SKL-X with 56 threads per sockets this results in a whopping _1_
second delay per thread (except for the last one which shuts down the
per socket uncore events with no delay because there are no users) due
to 62 times of pointless synchronize_rcu() invocations where each takes
~16ms on a HZ=250 kernel.

Which in turn is interesting because that machine is completely idle
other than running the offline muck...

> Turns out, the thing is unconditionally waiting for RCU, even if there's
> no actual events to migrate.
>
> Fixes: 0cda4c023132 ("perf: Introduce perf_pmu_migrate_context()")
> Reported-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Tested-by: Thomas Gleixner <tglx@linutronix.de>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
  
Paul E. McKenney April 3, 2023, 10:51 p.m. UTC | #2
On Tue, Apr 04, 2023 at 12:07:30AM +0200, Thomas Gleixner wrote:
> On Mon, Apr 03 2023 at 11:08, Peter Zijlstra wrote:
> > Thomas reported that offlining CPUs spends a lot of time in
> > synchronize_rcu() as called from perf_pmu_migrate_context() even though
> > he's not actually using uncore events.
> 
> That happens when offlining CPUs from a socket > 0 in the same order how
> those CPUs have been brought up. On socket 0 this is not observable
> unless the bogus CPU0 offlining hack is enabled.
> 
> If the offlining happens in the reverse order then all is shiny.
> 
> The reason is that the first online CPU on a socket gets the uncore
> events assigned and when it is offlined then those are moved to the next
> online CPU in the same socket.
> 
> On a SKL-X with 56 threads per sockets this results in a whopping _1_
> second delay per thread (except for the last one which shuts down the
> per socket uncore events with no delay because there are no users) due
> to 62 times of pointless synchronize_rcu() invocations where each takes
> ~16ms on a HZ=250 kernel.
> 
> Which in turn is interesting because that machine is completely idle
> other than running the offline muck...
> 
> > Turns out, the thing is unconditionally waiting for RCU, even if there's
> > no actual events to migrate.
> >
> > Fixes: 0cda4c023132 ("perf: Introduce perf_pmu_migrate_context()")
> > Reported-by: Thomas Gleixner <tglx@linutronix.de>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Tested-by: Thomas Gleixner <tglx@linutronix.de>
> 
> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

Yow!  ;-)

Assuming that all the events run under RCU protection, as in preemption
disabled:

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
  

Patch

diff --git a/kernel/events/core.c b/kernel/events/core.c
index fb3e436bcd4a..115320faf1db 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -12893,12 +12893,14 @@  void perf_pmu_migrate_context(struct pmu *pmu, int src_cpu, int dst_cpu)
 	__perf_pmu_remove(src_ctx, src_cpu, pmu, &src_ctx->pinned_groups, &events);
 	__perf_pmu_remove(src_ctx, src_cpu, pmu, &src_ctx->flexible_groups, &events);
 
-	/*
-	 * Wait for the events to quiesce before re-instating them.
-	 */
-	synchronize_rcu();
+	if (!list_empty(&events)) {
+		/*
+		 * Wait for the events to quiesce before re-instating them.
+		 */
+		synchronize_rcu();
 
-	__perf_pmu_install(dst_ctx, dst_cpu, pmu, &events);
+		__perf_pmu_install(dst_ctx, dst_cpu, pmu, &events);
+	}
 
 	mutex_unlock(&dst_ctx->mutex);
 	mutex_unlock(&src_ctx->mutex);