[RFC,0/4] Reduce cost of accessing tg->load_avg

Message ID 20230718134120.81199-1-aaron.lu@intel.com
Headers
Series Reduce cost of accessing tg->load_avg |

Message

Aaron Lu July 18, 2023, 1:41 p.m. UTC
  Nitin Tekchandani noticed some scheduler functions have high cost
according to perf/cycles while running postgres_sysbench workload.
I perf/annotated the high cost functions: update_cfs_group() and
update_load_avg() and found the costs were ~90% due to accessing to
tg->load_avg. This series is an attempt to reduce the cost of accessing
tg->load_avg.

The 1st patch is a bug fix. I probably should try PeterZ's SBRM which
should make things a lot cleaner but I'm not sure if this fix is stable
material and if so, writing in traditional way is easier for backport?

The 2nd patch is actually v2, the first version was sent here:
https://lore.kernel.org/lkml/20230327053955.GA570404@ziqianlu-desk2/
since it is part of this series, I didn't maintain its version. The idea
is from PeterZ: by making the hot variable per-node, at least, the write
side update_load_avg() will become local. The sad part is, the read side
is still global because it has to sum up each node's part and that hurts
performance. Also, update_cfs_group() is called very frequently, e.g. on
every en/dequeue_task_fair(). With this patch, postgres_sysbench on SPR
sees some improvement. hackbench/netperf also see some improvement on
SPR and Cascade Lake. For details, please see patch2.

While there are improvments from making tg->load_avg per node, the two
functions cost is still there while running postgres_sysbench on SPR: ~7%
for update_cfs_group() and ~5% for update_load_avg(). To further reduce
the cost of accessing tg->load_avg, the 3rd patch tried to reduce the
number of updates to tg->load_avg when a task migrates on wakeup: current
code will immediately update_tg_load_avg() once update_load_avg() found
cfs_rq has removed_load pending; patch3 changed this behaviour: it ignores
cfs_rq's pending removed_load and rely on following events like task
attaching so that to aggregate the process of tg->load_avg. For a wakeup
heavy workload, this can roughly reduce the call number to
update_tg_load_avg() by half.

After patch3, the cost of the two functions while running
postgres_sysbench on SPR dropped to ~1% but running netperf/UDP_RR on
SPR the two functions cost are still ~20% and ~10%. So patch4 tried to
reduce the number of calls to update_cfs_group() on en/dequeue path and
that made the two functions cost dropped to ~2% when running netperf/UDP_RR
on SPR.

One issue with patch3 and patch4 is, they both reduced the tracking
accuracy of task group's load_avg in favor of reducing cost of accessing
tg->load_avg. patch3 is probably better than patch4 in this regard,
because we already delay processing cfs_rq's removed_load although patch3
made the delay even longer; patch4 skipped some calls to
update_cfs_group() on en/dequeue path and that may affect things. I made
a test inspired by an old commit and result seem to suggest it's not bad
but I may miss some scenarios.

This series is based on v6.5-rc1 with one more patch applied that made
tg->load_avg stay in a dedicated cacheline to avoid any false sharing
issue as discovered by Deng Pan:
https://lore.kernel.org/lkml/20230621081425.420607-1-pan.deng@intel.com/

For performance data in each commit's changelog, node means patch2,
delay means patch3 and skip means patch4. All performance changes percent
are against base.

Comments are welcome.

Aaron Lu (4):
  sched/fair: free allocated memory on error in alloc_fair_sched_group()
  sched/fair: Make tg->load_avg per node
  sched/fair: delay update_tg_load_avg() for cfs_rq's removed load
  sched/fair: skip some update_cfs_group() on en/dequeue_entity()

 kernel/sched/debug.c |  2 +-
 kernel/sched/fair.c  | 80 ++++++++++++++++++++++++++++++++++----------
 kernel/sched/sched.h | 44 ++++++++++++++++++------
 3 files changed, 97 insertions(+), 29 deletions(-)