[RFC] sched/fair: Skip idle CPU search on busy system

Message ID 20230726093612.1882644-1-sshegde@linux.vnet.ibm.com
State New
Headers
Series [RFC] sched/fair: Skip idle CPU search on busy system |

Commit Message

Shrikanth Hegde July 26, 2023, 9:36 a.m. UTC
  When the system is fully busy, there will not be any idle CPU's.
In that case, load_balance will be called mainly with CPU_NOT_IDLE
type. In should_we_balance its currently checking for an idle CPU if
one exist. When system is 100% busy, there will not be an idle CPU and
these idle_cpu checks can be skipped. This would avoid fetching those rq
structures.

This is a minor optimization for a specific case of 100% utilization.

.....
Coming to the current implementation. It is a very basic approach to the
issue. It may not be the best/perfect way to this.  It works only in
case of CONFIG_NO_HZ_COMMON. nohz.nr_cpus is a global info available which
tracks idle CPU's. AFAIU there isn't any other. If there is such info, we
can use that instead. nohz.nr_cpus is atomic, which might be costly too.

Alternative way would be to add a new attribute to sched_domain and update
it in cpu idle entry/exit path per CPU. Advantage is, check can be per
env->sd instead of global. Slightly complicated, but maybe better. there
could other advantage at wake up to limit the scan etc.

Your feedback would really help. Does this optimization makes sense?

Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 6 ++++++
 1 file changed, 6 insertions(+)

--
2.31.1
  

Comments

Chen Yu July 27, 2023, 7:25 a.m. UTC | #1
On 2023-07-26 at 15:06:12 +0530, Shrikanth Hegde wrote:
> When the system is fully busy, there will not be any idle CPU's.
> In that case, load_balance will be called mainly with CPU_NOT_IDLE
> type. In should_we_balance its currently checking for an idle CPU if
> one exist. When system is 100% busy, there will not be an idle CPU and
> these idle_cpu checks can be skipped. This would avoid fetching those rq
> structures.
>

Yes, I guess this could help reducing the cost if the sched group
has many CPUs. 

> This is a minor optimization for a specific case of 100% utilization.
> 
> .....
> Coming to the current implementation. It is a very basic approach to the
> issue. It may not be the best/perfect way to this.  It works only in
> case of CONFIG_NO_HZ_COMMON. nohz.nr_cpus is a global info available which
> tracks idle CPU's. AFAIU there isn't any other. If there is such info, we
> can use that instead. nohz.nr_cpus is atomic, which might be costly too.
>
> Alternative way would be to add a new attribute to sched_domain and update
> it in cpu idle entry/exit path per CPU. Advantage is, check can be per
> env->sd instead of global. Slightly complicated, but maybe better. there
> could other advantage at wake up to limit the scan etc.
> 

When checking the code, I found that there is per domain nr_busy_cpus.
However that variable is only for LLC domain. Maybe extend the sd_share
for domains under NUMA is applicable IMO.

thanks,
Chenyu

> Your feedback would really help. Does this optimization makes sense?
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
> ---
>  kernel/sched/fair.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 373ff5f55884..903d59b5290c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10713,6 +10713,12 @@ static int should_we_balance(struct lb_env *env)
>  		return 1;
>  	}
> 
> +#ifdef CONFIG_NO_HZ_COMMON
> +	/* If the system is fully busy, its better to skip the idle checks */
> +	if (env->idle == CPU_NOT_IDLE && atomic_read(&nohz.nr_cpus) == 0)
> +		return group_balance_cpu(sg) == env->dst_cpu;
> +#endif
> +
>  	/* Try to find first idle CPU */
>  	for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) {
>  		if (!idle_cpu(cpu))
> --
> 2.31.1
>
  
Shrikanth Hegde July 27, 2023, 3:04 p.m. UTC | #2
On 7/27/23 12:55 PM, Chen Yu wrote:
> On 2023-07-26 at 15:06:12 +0530, Shrikanth Hegde wrote:
>> When the system is fully busy, there will not be any idle CPU's.
>> In that case, load_balance will be called mainly with CPU_NOT_IDLE
>> type. In should_we_balance its currently checking for an idle CPU if
>> one exist. When system is 100% busy, there will not be an idle CPU and
>> these idle_cpu checks can be skipped. This would avoid fetching those rq
>> structures.
>>
> 
> Yes, I guess this could help reducing the cost if the sched group
> has many CPUs. 

Thank you for the review Chen Yu. 

> 
>> This is a minor optimization for a specific case of 100% utilization.
>>
>> .....
>> Coming to the current implementation. It is a very basic approach to the
>> issue. It may not be the best/perfect way to this.  It works only in
>> case of CONFIG_NO_HZ_COMMON. nohz.nr_cpus is a global info available which
>> tracks idle CPU's. AFAIU there isn't any other. If there is such info, we
>> can use that instead. nohz.nr_cpus is atomic, which might be costly too.
>>
>> Alternative way would be to add a new attribute to sched_domain and update
>> it in cpu idle entry/exit path per CPU. Advantage is, check can be per
>> env->sd instead of global. Slightly complicated, but maybe better. there
>> could other advantage at wake up to limit the scan etc.
>>
> 
> When checking the code, I found that there is per domain nr_busy_cpus.
> However that variable is only for LLC domain. Maybe extend the sd_share
> for domains under NUMA is applicable IMO.

True. I did see that. Doing at every level when there are large number 
of CPU's will likely need lock when updating the sd_share and that would 
be the bottleneck as well. Since sd_share never makes sense for NUMA, 
This would cause different code check for NUMA and non-NUMA. Though main benefit 
for this corner case would be in NUMA as there would be large number of CPU's there.

I will keep that thought and will try to work something along.

> 
> thanks,
> Chenyu
> 
>> Your feedback would really help. Does this optimization makes sense?
>>
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
>> ---
>>  kernel/sched/fair.c | 6 ++++++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 373ff5f55884..903d59b5290c 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -10713,6 +10713,12 @@ static int should_we_balance(struct lb_env *env)
>>  		return 1;
>>  	}
>>
>> +#ifdef CONFIG_NO_HZ_COMMON
>> +	/* If the system is fully busy, its better to skip the idle checks */
>> +	if (env->idle == CPU_NOT_IDLE && atomic_read(&nohz.nr_cpus) == 0)
>> +		return group_balance_cpu(sg) == env->dst_cpu;
>> +#endif
>> +
>>  	/* Try to find first idle CPU */
>>  	for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) {
>>  		if (!idle_cpu(cpu))
>> --
>> 2.31.1
>>
  
Vishal Chourasia Aug. 9, 2023, 6:44 p.m. UTC | #3
On Wed, Jul 26, 2023 at 03:06:12PM +0530, Shrikanth Hegde wrote:
> When the system is fully busy, there will not be any idle CPU's.
> In that case, load_balance will be called mainly with CPU_NOT_IDLE
> type. In should_we_balance its currently checking for an idle CPU if
> one exist. When system is 100% busy, there will not be an idle CPU and
> these idle_cpu checks can be skipped. This would avoid fetching those rq
> structures.
> 
> This is a minor optimization for a specific case of 100% utilization.
> 
> .....
> Coming to the current implementation. It is a very basic approach to the
> issue. It may not be the best/perfect way to this.  It works only in
> case of CONFIG_NO_HZ_COMMON. nohz.nr_cpus is a global info available which
> tracks idle CPU's. AFAIU there isn't any other. If there is such info, we
> can use that instead. nohz.nr_cpus is atomic, which might be costly too.
> 
> Alternative way would be to add a new attribute to sched_domain and update
> it in cpu idle entry/exit path per CPU. Advantage is, check can be per
> env->sd instead of global. Slightly complicated, but maybe better. there
> could other advantage at wake up to limit the scan etc.
> 
> Your feedback would really help. Does this optimization makes sense?
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
> ---
>  kernel/sched/fair.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 373ff5f55884..903d59b5290c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10713,6 +10713,12 @@ static int should_we_balance(struct lb_env *env)
>  		return 1;
>  	}
> 
> +#ifdef CONFIG_NO_HZ_COMMON
> +	/* If the system is fully busy, its better to skip the idle checks */
> +	if (env->idle == CPU_NOT_IDLE && atomic_read(&nohz.nr_cpus) == 0)
> +		return group_balance_cpu(sg) == env->dst_cpu;
> +#endif
> +
>  	/* Try to find first idle CPU */
>  	for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) {
>  		if (!idle_cpu(cpu))
> --
> 2.31.1
> 
Tested this patchset on top of v6.4

5 Runs of stress-ng (100% load) on a system with 16CPUs spawning 23 threads for
60 minutes.

stress-ng: 16CPUs, 23threads, 60mins

- 6.4.0

| completion time(sec) |      user |        sys |
|----------------------+-----------+------------|
|              3600.05 |  57582.44 |       0.70 |
|              3600.10 |  57597.07 |       0.68 |
|              3600.05 |  57596.65 |       0.47 |
|              3600.04 |  57596.36 |       0.71 |
|              3600.06 |  57595.32 |       0.42 |
|              3600.06 | 57593.568 |      0.596 | average
|          0.046904158 | 12.508392 | 0.27878307 | stddev

- 6.4.0+ (with patch)

| completion time(sec) |      user |         sys |
|----------------------+-----------+-------------|
|              3600.04 |  57596.58 |        0.50 |
|              3600.04 |  57595.19 |        0.48 |
|              3600.05 |  57597.39 |        0.49 |
|              3600.04 |  57596.64 |        0.53 |
|              3600.04 |  57595.94 |        0.43 |
|             3600.042 | 57596.348 |       0.486 | average
|         0.0089442719 | 1.6529610 | 0.072938330 | stddev

The average system time is slightly lower in the patched version (0.486 seconds)
compared to the 6.4.0 version (0.596 seconds). 
The standard deviation for system time is also lower in the patched version
(0.0729 seconds) than in the 6.4.0 version (0.2788 seconds), suggesting more
consistent system time results with the patch.

vishal.c
  
Shrikanth Hegde Aug. 10, 2023, 3:44 p.m. UTC | #4
On 8/10/23 12:14 AM, Vishal Chourasia wrote:
> On Wed, Jul 26, 2023 at 03:06:12PM +0530, Shrikanth Hegde wrote:
>> When the system is fully busy, there will not be any idle CPU's.
>>
> Tested this patchset on top of v6.4
[...]
> 5 Runs of stress-ng (100% load) on a system with 16CPUs spawning 23 threads for
> 60 minutes.
> 
> stress-ng: 16CPUs, 23threads, 60mins
> 
> - 6.4.0
> 
> | completion time(sec) |      user |        sys |
> |----------------------+-----------+------------|
> |              3600.05 |  57582.44 |       0.70 |
> |              3600.10 |  57597.07 |       0.68 |
> |              3600.05 |  57596.65 |       0.47 |
> |              3600.04 |  57596.36 |       0.71 |
> |              3600.06 |  57595.32 |       0.42 |
> |              3600.06 | 57593.568 |      0.596 | average
> |          0.046904158 | 12.508392 | 0.27878307 | stddev
> 
> - 6.4.0+ (with patch)
> 
> | completion time(sec) |      user |         sys |
> |----------------------+-----------+-------------|
> |              3600.04 |  57596.58 |        0.50 |
> |              3600.04 |  57595.19 |        0.48 |
> |              3600.05 |  57597.39 |        0.49 |
> |              3600.04 |  57596.64 |        0.53 |
> |              3600.04 |  57595.94 |        0.43 |
> |             3600.042 | 57596.348 |       0.486 | average
> |         0.0089442719 | 1.6529610 | 0.072938330 | stddev
> 
> The average system time is slightly lower in the patched version (0.486 seconds)
> compared to the 6.4.0 version (0.596 seconds). 
> The standard deviation for system time is also lower in the patched version
> (0.0729 seconds) than in the 6.4.0 version (0.2788 seconds), suggesting more
> consistent system time results with the patch.
> 
> vishal.c

Thank you very much Vishal for trying this out. 

Meanwhile, I am yet to try the suggestion given by chen. Let me see if that works okay.
  
Ingo Molnar Oct. 4, 2023, 4:25 p.m. UTC | #5
* Shrikanth Hegde <sshegde@linux.vnet.ibm.com> wrote:

> When the system is fully busy, there will not be any idle CPU's.
> In that case, load_balance will be called mainly with CPU_NOT_IDLE
> type. In should_we_balance its currently checking for an idle CPU if
> one exist. When system is 100% busy, there will not be an idle CPU and
> these idle_cpu checks can be skipped. This would avoid fetching those rq
> structures.
> 
> This is a minor optimization for a specific case of 100% utilization.
> 
> .....
> Coming to the current implementation. It is a very basic approach to the
> issue. It may not be the best/perfect way to this.  It works only in
> case of CONFIG_NO_HZ_COMMON. nohz.nr_cpus is a global info available which
> tracks idle CPU's. AFAIU there isn't any other. If there is such info, we
> can use that instead. nohz.nr_cpus is atomic, which might be costly too.
> 
> Alternative way would be to add a new attribute to sched_domain and update
> it in cpu idle entry/exit path per CPU. Advantage is, check can be per
> env->sd instead of global. Slightly complicated, but maybe better. there
> could other advantage at wake up to limit the scan etc.
> 
> Your feedback would really help. Does this optimization makes sense?
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
> ---
>  kernel/sched/fair.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 373ff5f55884..903d59b5290c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10713,6 +10713,12 @@ static int should_we_balance(struct lb_env *env)
>  		return 1;
>  	}
> 
> +#ifdef CONFIG_NO_HZ_COMMON
> +	/* If the system is fully busy, its better to skip the idle checks */
> +	if (env->idle == CPU_NOT_IDLE && atomic_read(&nohz.nr_cpus) == 0)
> +		return group_balance_cpu(sg) == env->dst_cpu;
> +#endif

Not a big fan of coupling NOHZ to a scheduler optimization in this fashion, 
and not a big fan of the nohz.nr_cpus global cacheline either.

I think it should be done unconditionally, via the scheduler topology tree:

 - We should probably slow-propagate "permanently busy" status of a CPU
   down the topology tree, ie.:

     - mark a domain fully-busy with a delay & batching, probably driven
       by the busy-tick only,

     - while marking a domain idle instantly & propagating this up the
       domain tree only if necessary. The propagation can stop if it
       finds a non-busy domain, so usually it won't reach the root domain.

 - This approach ensures there's no real overhead problem in the domain 
   tree: think of hundreds of CPUs all accessing the nohz.nr_cpus global 
   variable... I bet it's a measurable problem already on large systems.

 - The "atomic_read(&nohz.nr_cpus) == 0" condition in your patch is simply
   the busy-flag checked at the root domain: a readonly global cacheline
   that never gets modified on a permanently busy system.

Thanks,

	Ingo
  

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 373ff5f55884..903d59b5290c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10713,6 +10713,12 @@  static int should_we_balance(struct lb_env *env)
 		return 1;
 	}

+#ifdef CONFIG_NO_HZ_COMMON
+	/* If the system is fully busy, its better to skip the idle checks */
+	if (env->idle == CPU_NOT_IDLE && atomic_read(&nohz.nr_cpus) == 0)
+		return group_balance_cpu(sg) == env->dst_cpu;
+#endif
+
 	/* Try to find first idle CPU */
 	for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) {
 		if (!idle_cpu(cpu))