[0/2] sched/fair: Limit access to overutilized

Message ID 20240223150707.410417-1-sshegde@linux.ibm.com
Headers
Series sched/fair: Limit access to overutilized |

Message

Shrikanth Hegde Feb. 23, 2024, 3:07 p.m. UTC
  When running a ISV workload on a large system (240 Cores, SMT8), it was
observed from perf profile that newidle_balance and enqueue_task_fair
were consuming more cycles. Perf annotate showed that most of the time
was spent on accessing overutilized field of root domain.

Aboorva was able to simulate similar perf profile by making some
changes to stress-ng --wait. Both newidle_balance and enqueue_task_fair
consume close to 5-7%. Perf annotate shows that most of the cycles are spent
in accessing rd,rd->overutilized field.

perf profile:
7.18%  swapper          [kernel.vmlinux]              [k] enqueue_task_fair
6.78%  s                [kernel.vmlinux]              [k] newidle_balance

perf annotate of enqueue_task_fair:
    1.66 :   c000000000223ba4:       beq     c000000000223c50 <enqueue_task_fair+0x238>
         : 6789             update_overutilized_status():
         : 6675             if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) {
   95.42 :   c000000000223ba8:       ld      r8,2752(r28)
    0.08 :   c000000000223bac:       lwz     r9,540(r8)
Debugging it further, in enqueue_task_fair:
ld      r8,2752(r28) <-- loads rd
lwz     r9,540(r8)   <-- loads rd->overutilized.
Frequent write to rd in other CPUs causes load/store tearing and hence
loading rd could take more time.

Perf annotate of newidle_balance:
         : 12333            sd = rcu_dereference_check_sched_domain(this_rq->sd);
   41.54 :   c000000000228070:       ld      r30,2760(r31)
         : 12335            if (!READ_ONCE(this_rq->rd->overload) ||
    0.07 :   c000000000228074:       lwz     r9,536(r9)
Similarly, in newidle_balance,
ld      r9,2752(r31) <-- loads rd
lwz     r9,536(r9)   <-- loads rd->overload
Though overutilized is not used in this function. The writes to overutilized
could cause the load of overload to take more time. Both overload and
overutilized are part of the same cacheline.

overutilized was added for EAS(Energy aware scheduler) to choose either
EAS aware load balancing or regular load balance. Hence these fields
should only be updated if EAS is active.

As checked, on x86 and powerpc both overload and overutilized share the
same cacheline in rd. Updating overutilized is not required in non-EAS
platforms.  Hence this patch can help reduce cache issues in such archs.

Patch 1/2 is the main patch. It helps in reducing the above said issue.
Both the functions don't show up in the profile. With patch comparison is in
changelog. With the patch stated problem in the ISV workload also got
solved and throughput has improved. Fixes tag 2802bf3cd936 maybe removed
if it causes issues with clean backport all the way. I didn't know what
would be right thing to do here.
Patch 2/2 is only code refactoring to use the helper function instead of
direct access of the field, so one would come to know that it is accessed
only in EAS. This depends on 1/2 to be applied first

Thanks to Aboorva Devarajan and Nysal Jan K A for helping in
recreating,debugging this issue and verifying the patch.

Shrikanth Hegde (2):
  sched/fair: Add EAS checks before updating overutilized
  sched/fair: Use helper function to access rd->overutilized

 kernel/sched/fair.c | 50 +++++++++++++++++++++++++++++++++------------
 1 file changed, 37 insertions(+), 13 deletions(-)

--
2.39.3