[1/2] sched/fair: Add EAS checks before updating overutilized

Message ID 20240223150707.410417-2-sshegde@linux.ibm.com
State New
Headers
Series sched/fair: Limit access to overutilized |

Commit Message

Shrikanth Hegde Feb. 23, 2024, 3:07 p.m. UTC
  Overutilized field of root domain is only used for EAS(energy aware scheduler)
to decide whether to do regular load balance or EAS aware load balance. It
is not used if EAS not possible.

Currently enqueue_task_fair and task_tick_fair accesses, sometime updates
this field. In update_sd_lb_stats it is updated often.
Which causes cache contention due to load/store tearing and burns
a lot of cycles. Hence add EAS check before updating this field.
EAS check is optimized at compile time or it is static branch.
Hence it shouldn't cost much.

With the patch, both enqueue_task_fair and newidle_balance don't show
up as hot routines in perf profile.

6.8-rc4:
7.18%  swapper          [kernel.vmlinux]              [k] enqueue_task_fair
6.78%  s                [kernel.vmlinux]              [k] newidle_balance
+patch:
0.14%  swapper          [kernel.vmlinux]              [k] enqueue_task_fair
0.00%  swapper          [kernel.vmlinux]              [k] newidle_balance

While here, Fix updating overutilized as either SG_OVERUTILIZED or 0
instead. Current code can make it 0, 1 or 2. This shouldn't alter the
functionality.

Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator")
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/fair.c | 36 +++++++++++++++++++++++++-----------
 1 file changed, 25 insertions(+), 11 deletions(-)

--
2.39.3
  

Comments

Chen Yu Feb. 27, 2024, 4:45 p.m. UTC | #1
On 2024-02-23 at 20:37:06 +0530, Shrikanth Hegde wrote:
> Overutilized field of root domain is only used for EAS(energy aware scheduler)
> to decide whether to do regular load balance or EAS aware load balance. It
> is not used if EAS not possible.
> 
> Currently enqueue_task_fair and task_tick_fair accesses, sometime updates
> this field. In update_sd_lb_stats it is updated often.
> Which causes cache contention due to load/store tearing and burns
> a lot of cycles.

Looks like a typical cache false sharing: CPU1 updates the rd->overutilized,
which invalid the cache line when CPU2 access adjacent rd->overload.
This changes looks good to me, just some minor questions:

> Hence add EAS check before updating this field.
> EAS check is optimized at compile time or it is static branch.
> Hence it shouldn't cost much.
> 
> With the patch, both enqueue_task_fair and newidle_balance don't show
> up as hot routines in perf profile.
> 
> 6.8-rc4:
> 7.18%  swapper          [kernel.vmlinux]              [k] enqueue_task_fair
> 6.78%  s                [kernel.vmlinux]              [k] newidle_balance
> +patch:
> 0.14%  swapper          [kernel.vmlinux]              [k] enqueue_task_fair
> 0.00%  swapper          [kernel.vmlinux]              [k] newidle_balance
> 
> While here, Fix updating overutilized as either SG_OVERUTILIZED or 0
> instead. Current code can make it 0, 1 or 2. This shouldn't alter the
> functionality.

Just wonder where 1 comes from? In current code we either write SG_OVERUTILIZED
or sg_status & SG_OVERUTILIZED.

> 
> Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator")
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
>  kernel/sched/fair.c | 36 +++++++++++++++++++++++++-----------
>  1 file changed, 25 insertions(+), 11 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8e30e2bb77a0..9529d9ef2c5b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6670,15 +6670,30 @@ static inline bool cpu_overutilized(int cpu)
>  	return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
>  }
> 
> -static inline void update_overutilized_status(struct rq *rq)
> +static inline void update_rd_overutilized_status(struct root_domain *rd,
> +						 int status)
>  {
> -	if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) {
> -		WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED);
> -		trace_sched_overutilized_tp(rq->rd, SG_OVERUTILIZED);
> +	if (sched_energy_enabled()) {
> +		WRITE_ONCE(rd->overutilized, status);
> +		trace_sched_overutilized_tp(rd, !!status);

Is this !!status intentional? The original one is SG_OVERUTILIZED = 2,
now it is either 0 or 1.

thanks,
Chenyu
  
Shrikanth Hegde Feb. 27, 2024, 5:49 p.m. UTC | #2
On 2/27/24 10:15 PM, Chen Yu wrote:

> On 2024-02-23 at 20:37:06 +0530, Shrikanth Hegde wrote:
>> Overutilized field of root domain is only used for EAS(energy aware scheduler)
>> to decide whether to do regular load balance or EAS aware load balance. It
>> is not used if EAS not possible.
>>
>> Currently enqueue_task_fair and task_tick_fair accesses, sometime updates
>> this field. In update_sd_lb_stats it is updated often.
>> Which causes cache contention due to load/store tearing and burns
>> a lot of cycles.
> 
> Looks like a typical cache false sharing: CPU1 updates the rd->overutilized,
> which invalid the cache line when CPU2 access adjacent rd->overload.
> This changes looks good to me, just some minor questions:

Thanks for taking a look and reviewing it. 

> 
>> Hence add EAS check before updating this field.
>> EAS check is optimized at compile time or it is static branch.
>> Hence it shouldn't cost much.
>>
>> With the patch, both enqueue_task_fair and newidle_balance don't show
>> up as hot routines in perf profile.
>>
>> 6.8-rc4:
>> 7.18%  swapper          [kernel.vmlinux]              [k] enqueue_task_fair
>> 6.78%  s                [kernel.vmlinux]              [k] newidle_balance
>> +patch:
>> 0.14%  swapper          [kernel.vmlinux]              [k] enqueue_task_fair
>> 0.00%  swapper          [kernel.vmlinux]              [k] newidle_balance
>>
>> While here, Fix updating overutilized as either SG_OVERUTILIZED or 0
>> instead. Current code can make it 0, 1 or 2. This shouldn't alter the
>> functionality.
> 
> Just wonder where 1 comes from? In current code we either write SG_OVERUTILIZED
> or sg_status & SG_OVERUTILIZED.

Thanks for catching this, Silly mistake. 
Because of if conditions around I wrongly thought it would be 1. 

I will correct that and send a next version soon.

> 
>>
>> Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator")
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>> ---
>>  kernel/sched/fair.c | 36 +++++++++++++++++++++++++-----------
>>  1 file changed, 25 insertions(+), 11 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8e30e2bb77a0..9529d9ef2c5b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6670,15 +6670,30 @@ static inline bool cpu_overutilized(int cpu)
>>  	return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
>>  }
>>
>> -static inline void update_overutilized_status(struct rq *rq)
>> +static inline void update_rd_overutilized_status(struct root_domain *rd,
>> +						 int status)
>>  {
>> -	if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) {
>> -		WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED);
>> -		trace_sched_overutilized_tp(rq->rd, SG_OVERUTILIZED);
>> +	if (sched_energy_enabled()) {
>> +		WRITE_ONCE(rd->overutilized, status);
>> +		trace_sched_overutilized_tp(rd, !!status);
> 
> Is this !!status intentional? The original one is SG_OVERUTILIZED = 2,
> now it is either 0 or 1.
> 

Yes. this is intentional. To convert into to bool.
The tracepoint hook currently defines the second argument as bool.

include/trace/events/sched.h
DECLARE_TRACE(sched_overutilized_tp,                                               
        TP_PROTO(struct root_domain *rd, bool overutilized),                       
        TP_ARGS(rd, overutilized));  

> thanks,
> Chenyu
  

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8e30e2bb77a0..9529d9ef2c5b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6670,15 +6670,30 @@  static inline bool cpu_overutilized(int cpu)
 	return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
 }

-static inline void update_overutilized_status(struct rq *rq)
+static inline void update_rd_overutilized_status(struct root_domain *rd,
+						 int status)
 {
-	if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) {
-		WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED);
-		trace_sched_overutilized_tp(rq->rd, SG_OVERUTILIZED);
+	if (sched_energy_enabled()) {
+		WRITE_ONCE(rd->overutilized, status);
+		trace_sched_overutilized_tp(rd, !!status);
+	}
+}
+
+static inline void check_update_overutilized_status(struct rq *rq)
+{
+	/*
+	 * overutilized field is used for load balancing decisions only
+	 * if energy aware scheduler is being used
+	 */
+	if (sched_energy_enabled()) {
+		if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu))
+			update_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
 	}
 }
 #else
-static inline void update_overutilized_status(struct rq *rq) { }
+static inline void check_update_overutilized_status(struct rq *rq) { }
+static inline void update_rd_overutilized_status(struct root_domain *rd,
+						 bool status) { }
 #endif

 /* Runqueue only has SCHED_IDLE tasks enqueued */
@@ -6779,7 +6794,7 @@  enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	 * and the following generally works well enough in practice.
 	 */
 	if (!task_new)
-		update_overutilized_status(rq);
+		check_update_overutilized_status(rq);

 enqueue_throttle:
 	assert_list_leaf_cfs_rq(rq);
@@ -10613,13 +10628,12 @@  static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		WRITE_ONCE(rd->overload, sg_status & SG_OVERLOAD);

 		/* Update over-utilization (tipping point, U >= 0) indicator */
-		WRITE_ONCE(rd->overutilized, sg_status & SG_OVERUTILIZED);
-		trace_sched_overutilized_tp(rd, sg_status & SG_OVERUTILIZED);
+		update_rd_overutilized_status(rd,
+				(sg_status & SG_OVERUTILIZED) ? SG_OVERUTILIZED : 0);
 	} else if (sg_status & SG_OVERUTILIZED) {
 		struct root_domain *rd = env->dst_rq->rd;

-		WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED);
-		trace_sched_overutilized_tp(rd, SG_OVERUTILIZED);
+		update_rd_overutilized_status(rd, SG_OVERUTILIZED);
 	}

 	update_idle_cpu_scan(env, sum_util);
@@ -12625,7 +12639,7 @@  static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 		task_tick_numa(rq, curr);

 	update_misfit_status(curr, rq);
-	update_overutilized_status(task_rq(curr));
+	check_update_overutilized_status(task_rq(curr));

 	task_tick_core(rq, curr);
 }