[v4] sched/topology: change behaviour of sysctl sched_energy_aware based on the platform

Message ID 20230926100046.405188-1-sshegde@linux.vnet.ibm.com
State New
Headers
Series [v4] sched/topology: change behaviour of sysctl sched_energy_aware based on the platform |

Commit Message

Shrikanth Hegde Sept. 26, 2023, 10 a.m. UTC
  sysctl sched_energy_aware is available for the admin to disable/enable
energy aware scheduling(EAS). EAS is enabled only if few conditions are
met by the platform. They are, asymmetric CPU capacity, no SMT,
valid cpufreq policy, frequency invariant load tracking etc. A platform may
boot without EAS capability, but could gain such capability at runtime
For example, changing/registering the cpufreq policy.

At present, though platform doesn't support EAS, this sysctl returns 1
and it ends up calling rebuild of sched domain on write to 1 and
NOP when writing to 0. That is confusing and un-necessary.

Desired behavior would be to, have this sysctl to enable/disable the EAS
on supported platform. On Non supported platform write to the sysctl
would return not supported error and read of the sysctl would return
empty. So
sched_energy_aware returns empty - EAS is not possible at this moment
sched_energy_aware returns 0 - EAS is supported but disabled by admin.
sched_energy_aware returns 1 - EAS is supported and enabled.
User can find out the reason why EAS is not possible by checking
info messages.

sched_is_eas_possible return if the platform can do EAS at this moment.
It takes most of the cases into account except one where EM complexity is
too high as the code was bit tricky to separate that.

v3->v4:
valentin suggested it would be better to consider simpler approach that
was mentioned in v2. It is a standard approach to keep the knob visible
but change how read and write are handled. Did that and Refactored the
code to use a common function in build_perf_domains and in sysctl handler.
v2->v3:
Chen Yu and Pierre Gondois both pointed out that if platform becomes
capable of EAS later, this patch was not allowing that to happen.
Addressed that by using a variable to indicate the sysctl change
and re-worded the commit message with desired behaviour,
v1->v2:
Chen Yu had pointed out that this will not destroy the perf domains on
architectures where EAS is supported by changing the sysctl.
[v1] Link: https://lore.kernel.org/lkml/20230829065040.920629-1-sshegde@linux.vnet.ibm.com/
[v2] Link: https://lore.kernel.org/lkml/20230901065249.137242-1-sshegde@linux.vnet.ibm.com/
[v3] Link: https://lore.kernel.org/lkml/20230913114807.665094-1-sshegde@linux.vnet.ibm.com/

Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
---
 Documentation/admin-guide/sysctl/kernel.rst |   3 +-
 kernel/sched/topology.c                     | 107 +++++++++++++-------
 2 files changed, 71 insertions(+), 39 deletions(-)

--
2.31.1
  

Comments

Shrikanth Hegde Sept. 26, 2023, 10:50 a.m. UTC | #1
On 9/26/23 3:30 PM, Shrikanth Hegde wrote:
> sysctl sched_energy_aware is available for the admin to disable/enable
> energy aware scheduling(EAS). EAS is enabled only if few conditions are
> met by the platform. They are, asymmetric CPU capacity, no SMT,
> valid cpufreq policy, frequency invariant load tracking etc. A platform may
> boot without EAS capability, but could gain such capability at runtime
> For example, changing/registering the cpufreq policy.
> 
> At present, though platform doesn't support EAS, this sysctl returns 1
> and it ends up calling rebuild of sched domain on write to 1 and
> NOP when writing to 0. That is confusing and un-necessary.
> 
> Desired behavior would be to, have this sysctl to enable/disable the EAS
> on supported platform. On Non supported platform write to the sysctl
> would return not supported error and read of the sysctl would return
> empty. So
> sched_energy_aware returns empty - EAS is not possible at this moment
> sched_energy_aware returns 0 - EAS is supported but disabled by admin.
> sched_energy_aware returns 1 - EAS is supported and enabled.
> User can find out the reason why EAS is not possible by checking
> info messages.
> 


On Power10 system which has SMT and symmetric CPU capacity operations              
would be as below.                                                              
# cat sched_energy_aware                                                        
# echo 0 > sched_energy_aware                                                   
-bash: echo: write error: Operation not supported                               
# echo 1 > sched_energy_aware                                                      
-bash: echo: write error: Operation not supported                               
dmesg | tail                                                                    
[ 1608.233159] rd 0-95: Checking EAS, CPUs do not have asymmetric capacities     
[ 1612.026148] rd 0-95: Checking EAS, CPUs do not have asymmetric capacities       
[ 1616.122406] rd 0-95: Checking EAS, CPUs do not have asymmetric capacities  


Pierre, 
Could you please help testing this on your platform which supports EAS. 
That would be helpful. 

> sched_is_eas_possible return if the platform can do EAS at this moment.
> It takes most of the cases into account except one where EM complexity is
> too high as the code was bit tricky to separate that.
> 
> v3->v4:
> valentin suggested it would be better to consider simpler approach that
> was mentioned in v2. It is a standard approach to keep the knob visible
> but change how read and write are handled. Did that and Refactored the
> code to use a common function in build_perf_domains and in sysctl handler.
> v2->v3:
> Chen Yu and Pierre Gondois both pointed out that if platform becomes
> capable of EAS later, this patch was not allowing that to happen.
> Addressed that by using a variable to indicate the sysctl change
> and re-worded the commit message with desired behaviour,
> v1->v2:
> Chen Yu had pointed out that this will not destroy the perf domains on
> architectures where EAS is supported by changing the sysctl.
> [v1] Link: https://lore.kernel.org/lkml/20230829065040.920629-1-sshegde@linux.vnet.ibm.com/
> [v2] Link: https://lore.kernel.org/lkml/20230901065249.137242-1-sshegde@linux.vnet.ibm.com/
> [v3] Link: https://lore.kernel.org/lkml/20230913114807.665094-1-sshegde@linux.vnet.ibm.com/
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
> ---
>  Documentation/admin-guide/sysctl/kernel.rst |   3 +-
>  kernel/sched/topology.c                     | 107 +++++++++++++-------
>  2 files changed, 71 insertions(+), 39 deletions(-)
> 
> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> index cf33de56da27..d89ac2bd8dc4 100644
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -1182,7 +1182,8 @@ automatically on platforms where it can run (that is,
>  platforms with asymmetric CPU topologies and having an Energy
>  Model available). If your platform happens to meet the
>  requirements for EAS but you do not want to use it, change
> -this value to 0.
> +this value to 0. On Non-EAS platforms, write operation fails and
> +read doesn't return anything.
> 
>  task_delayacct
>  ===============
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index a7b50bba7829..839ddc80a5ac 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -212,6 +212,64 @@ static unsigned int sysctl_sched_energy_aware = 1;
>  static DEFINE_MUTEX(sched_energy_mutex);
>  static bool sched_energy_update;
> 
> +extern struct cpufreq_governor schedutil_gov;
> +static bool sched_is_eas_possible(const struct cpumask *cpu_mask)
> +{
> +	int cpu = cpumask_first(cpu_mask);
> +	struct cpufreq_policy *policy;
> +	struct cpufreq_governor *gov;
> +	int i;
> +
> +	/* EAS is enabled for asymmetric CPU capacity topologies. */
> +	if (!per_cpu(sd_asym_cpucapacity, cpu)) {
> +		if (sched_debug()) {
> +			pr_info("rd %*pbl: Checking EAS, CPUs do not have asymmetric capacities\n",
> +				cpumask_pr_args(cpu_mask));
> +		}
> +		return false;
> +	}
> +
> +	/* EAS definitely does *not* handle SMT */
> +	if (sched_smt_active()) {
> +		if (sched_debug()) {
> +			pr_info("rd %*pbl: Checking EAS, SMT is not supported\n",
> +				cpumask_pr_args(cpu_mask));
> +		}
> +		return false;
> +	}
> +
> +	if (!arch_scale_freq_invariant()) {
> +		if (sched_debug()) {
> +			pr_info("rd %*pbl: Checking EAS: frequency-invariant load tracking not yet supported",
> +				cpumask_pr_args(cpu_mask));
> +		}
> +		return false;
> +	}
> +
> +	/* Do not attempt EAS if schedutil is not being used. */
> +	for_each_cpu(i, cpu_mask) {
> +		policy = cpufreq_cpu_get(i);
> +		if (!policy) {
> +			if (sched_debug()) {
> +				pr_info("rd %*pbl: Checking EAS, cpufreq policy not set for CPU: %d",
> +					cpumask_pr_args(cpu_mask), i);
> +			}
> +			return false;
> +		}
> +		gov = policy->governor;
> +		cpufreq_cpu_put(policy);
> +		if (gov != &schedutil_gov) {
> +			if (sched_debug()) {
> +				pr_info("rd %*pbl: Checking EAS, schedutil is mandatory\n",
> +					cpumask_pr_args(cpu_mask));
> +			}
> +			return false;
> +		}
> +	}
> +
> +	return true;
> +}
> +
>  void rebuild_sched_domains_energy(void)
>  {
>  	mutex_lock(&sched_energy_mutex);
> @@ -231,6 +289,14 @@ static int sched_energy_aware_handler(struct ctl_table *table, int write,
>  		return -EPERM;
> 
>  	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> +	if (!sched_is_eas_possible(cpu_active_mask)) {
> +		if (write) {
> +			return -EOPNOTSUPP;
> +		} else {
> +			*lenp = 0;
> +			return 0;
> +		}
> +	}
>  	if (!ret && write) {
>  		state = static_branch_unlikely(&sched_energy_present);
>  		if (state != sysctl_sched_energy_aware)
> @@ -370,61 +436,24 @@ static void sched_energy_set(bool has_eas)
>   */
>  #define EM_MAX_COMPLEXITY 2048
> 
> -extern struct cpufreq_governor schedutil_gov;
>  static bool build_perf_domains(const struct cpumask *cpu_map)
>  {
>  	int i, nr_pd = 0, nr_ps = 0, nr_cpus = cpumask_weight(cpu_map);
>  	struct perf_domain *pd = NULL, *tmp;
>  	int cpu = cpumask_first(cpu_map);
>  	struct root_domain *rd = cpu_rq(cpu)->rd;
> -	struct cpufreq_policy *policy;
> -	struct cpufreq_governor *gov;
> 
>  	if (!sysctl_sched_energy_aware)
>  		goto free;
> 
> -	/* EAS is enabled for asymmetric CPU capacity topologies. */
> -	if (!per_cpu(sd_asym_cpucapacity, cpu)) {
> -		if (sched_debug()) {
> -			pr_info("rd %*pbl: CPUs do not have asymmetric capacities\n",
> -					cpumask_pr_args(cpu_map));
> -		}
> -		goto free;
> -	}
> -
> -	/* EAS definitely does *not* handle SMT */
> -	if (sched_smt_active()) {
> -		pr_warn("rd %*pbl: Disabling EAS, SMT is not supported\n",
> -			cpumask_pr_args(cpu_map));
> -		goto free;
> -	}
> -
> -	if (!arch_scale_freq_invariant()) {
> -		if (sched_debug()) {
> -			pr_warn("rd %*pbl: Disabling EAS: frequency-invariant load tracking not yet supported",
> -				cpumask_pr_args(cpu_map));
> -		}
> +	if (!sched_is_eas_possible(cpu_map))
>  		goto free;
> -	}
> 
>  	for_each_cpu(i, cpu_map) {
>  		/* Skip already covered CPUs. */
>  		if (find_pd(pd, i))
>  			continue;
> 
> -		/* Do not attempt EAS if schedutil is not being used. */
> -		policy = cpufreq_cpu_get(i);
> -		if (!policy)
> -			goto free;
> -		gov = policy->governor;
> -		cpufreq_cpu_put(policy);
> -		if (gov != &schedutil_gov) {
> -			if (rd->pd)
> -				pr_warn("rd %*pbl: Disabling EAS, schedutil is mandatory\n",
> -						cpumask_pr_args(cpu_map));
> -			goto free;
> -		}
> -
>  		/* Create the new pd and add it to the local list. */
>  		tmp = pd_init(i);
>  		if (!tmp)
> @@ -458,6 +487,8 @@ static bool build_perf_domains(const struct cpumask *cpu_map)
>  	return !!pd;
> 
>  free:
> +	if (sched_debug())
> +		pr_warn("rd %*pbl: Disabling EAS", cpumask_pr_args(cpu_map));
>  	free_pd(pd);
>  	tmp = rd->pd;
>  	rcu_assign_pointer(rd->pd, NULL);
> --
> 2.31.1
>
  
Dietmar Eggemann Sept. 26, 2023, 9:29 p.m. UTC | #2
On 26/09/2023 12:00, Shrikanth Hegde wrote:
> sysctl sched_energy_aware is available for the admin to disable/enable
> energy aware scheduling(EAS). EAS is enabled only if few conditions are
> met by the platform. They are, asymmetric CPU capacity, no SMT,
> valid cpufreq policy, frequency invariant load tracking etc. A platform may

s/valid cpufreq policy/Schedutil CPUfreq governor

+ EM complexity < EM_MAX_COMPLEXITY

> boot without EAS capability, but could gain such capability at runtime
> For example, changing/registering the cpufreq policy.

s/cpufreq policy/CPUfreq governor to Schedutil

> At present, though platform doesn't support EAS, this sysctl returns 1
> and it ends up calling rebuild of sched domain on write to 1 and

sched domains are not rebuild in this case, i.e.
partition_sched_domains_locked() does not call detach_destroy_domains()
or build_sched_domains(). Only build_perf_domains() is called which
bails out if !sysctl_sched_energy_aware or one of the EAS conditions is
not true.

> NOP when writing to 0. That is confusing and un-necessary.
> 
> Desired behavior would be to, have this sysctl to enable/disable the EAS
> on supported platform. On Non supported platform write to the sysctl
> would return not supported error and read of the sysctl would return
> empty. So
> sched_energy_aware returns empty - EAS is not possible at this moment
> sched_energy_aware returns 0 - EAS is supported but disabled by admin.
> sched_energy_aware returns 1 - EAS is supported and enabled.
> User can find out the reason why EAS is not possible by checking
> info messages.

This will include EAS capable platforms which have at least one EAS
condition false during startup, e.g. using a Performance CPUfreq governor:

root@juno:~# cat /proc/sys/kernel/sched_energy_aware

root@juno:~# echo 1 > /proc/sys/kernel/sched_energy_aware
echo: write error: Operation not supported

log messages:
...
[  160.902138] rd 0-5: Checking EAS, schedutil is mandatory
...
[  324.467341] rd 0-5: Checking EAS, schedutil is mandatory
...

root@juno:~# echo schedutil >
/sys/devices/system/cpu/cpufreq/policy0/scaling_governor
root@juno:~# echo schedutil >
/sys/devices/system/cpu/cpufreq/policy1/scaling_governor

log messages:
...
[  414.195073] root_domain 0-5: pd1:{ cpus=1-2 nr_pstate=5 } pd0:{
cpus=0,3-5 nr_pstate=5 }
[  414.210877] sched_energy_set: starting EAS
...

root@juno:~# cat /proc/sys/kernel/sched_energy_aware
1

root@juno:~# echo 0 > /proc/sys/kernel/sched_energy_aware

log messages:
...
[  414.195073] root_domain 0-5: pd1:{ cpus=1-2 nr_pstate=5 } pd0:{
cpus=0,3-5 nr_pstate=5 }
[  414.210877] sched_energy_set: starting EAS
...
[  544.482690] rd 0-5: Disabling EAS
[  544.493729] sched_energy_set: stopping EAS

> sched_is_eas_possible return if the platform can do EAS at this moment.

sched_is_eas_possible() returns

> It takes most of the cases into account except one where EM complexity is
> too high as the code was bit tricky to separate that.

I remember that there was a patch from Pierre to get rid of the EM
complexity check:

https://lkml.kernel.org/r/20221121094336.3250917-1-pierre.gondois@arm.com

If this makes this patch easier, maybe both patches can go in as a
patch-set together?

Although I don't see that not checking EM complexity in
sched_energy_aware_handler() -> sched_is_eas_possible() is an issue.

> 
> v3->v4:
> valentin suggested it would be better to consider simpler approach that
> was mentioned in v2. It is a standard approach to keep the knob visible
> but change how read and write are handled. Did that and Refactored the
> code to use a common function in build_perf_domains and in sysctl handler.
> v2->v3:
> Chen Yu and Pierre Gondois both pointed out that if platform becomes
> capable of EAS later, this patch was not allowing that to happen.
> Addressed that by using a variable to indicate the sysctl change
> and re-worded the commit message with desired behaviour,
> v1->v2:
> Chen Yu had pointed out that this will not destroy the perf domains on
> architectures where EAS is supported by changing the sysctl.
> [v1] Link: https://lore.kernel.org/lkml/20230829065040.920629-1-sshegde@linux.vnet.ibm.com/
> [v2] Link: https://lore.kernel.org/lkml/20230901065249.137242-1-sshegde@linux.vnet.ibm.com/
> [v3] Link: https://lore.kernel.org/lkml/20230913114807.665094-1-sshegde@linux.vnet.ibm.com/
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>

[...]

> @@ -231,6 +289,14 @@ static int sched_energy_aware_handler(struct ctl_table *table, int write,
>  		return -EPERM;
> 
>  	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> +	if (!sched_is_eas_possible(cpu_active_mask)) {

This is using `cpu_active_mask` so EAS can only be disabled system-wide.

So I experimented with an exclusive cpuset setup creating a symmetric
(cs1) and an asymmetric (cs2) island on my Juno with its cpumask = [l B
B l l l] (l - little CPU, B - Big CPU).

root@juno:~# cd /sys/fs/cgroup/cpuset
root@juno:/sys/fs/cgroup/cpuset# mkdir cs1
root@juno:/sys/fs/cgroup/cpuset# echo 1 > cs1/cpuset.cpu_exclusive
root@juno:/sys/fs/cgroup/cpuset# echo 0 > cs1/cpuset.mems
root@juno:/sys/fs/cgroup/cpuset# echo 4,5 > cs1/cpuset.cpus
root@juno:/sys/fs/cgroup/cpuset# mkdir cs2
root@juno:/sys/fs/cgroup/cpuset# echo 1 > cs2/cpuset.cpu_exclusive
root@juno:/sys/fs/cgroup/cpuset# echo 0 > cs2/cpuset.mems
root@juno:/sys/fs/cgroup/cpuset# echo 0-3 > cs2/cpuset.cpus
root@juno:/sys/fs/cgroup/cpuset# echo 0 > cpuset.sched_load_balance

[ 3021.761278] root_domain 0-3: pd1:{ cpus=1-2 nr_pstate=5 } pd0:{
cpus=0,3-5 nr_pstate=5 }

root@juno:~# echo 0 > /proc/sys/kernel/sched_energy_aware

log messages:
...
[ 3143.538583] rd 4-5: Disabling EAS
[ 3143.549569] rd 0-3: Disabling EAS
[ 3143.560681] sched_energy_set: stopping EAS
...

root@juno:~# echo 1 > /proc/sys/kernel/sched_energy_aware

log messages:
...
[ 3223.277521] root_domain 0-3: pd1:{ cpus=1-2 nr_pstate=5 } pd0:{
cpus=0,3-5 nr_pstate=5 }
[ 3223.293409] sched_energy_set: starting EAS

Seems still to work correctly.

[...]

> @@ -458,6 +487,8 @@ static bool build_perf_domains(const struct cpumask *cpu_map)
>  	return !!pd;
> 
>  free:
> +	if (sched_debug())
> +		pr_warn("rd %*pbl: Disabling EAS", cpumask_pr_args(cpu_map));

Shouldn't this be used in condition `if (!sysctl_sched_energy_aware)`?
Otherwise we have 2 warnings when the other conditions which leads to
`goto free` aren't met.

[...]
  
Shrikanth Hegde Sept. 27, 2023, 8:14 a.m. UTC | #3
On 9/27/23 2:59 AM, Dietmar Eggemann wrote:
> On 26/09/2023 12:00, Shrikanth Hegde wrote:
Thank you very much Dietmar, for taking a look at this and trying this patch.

>> sysctl sched_energy_aware is available for the admin to disable/enable
>> energy aware scheduling(EAS). EAS is enabled only if few conditions are
>> met by the platform. They are, asymmetric CPU capacity, no SMT,
>> valid cpufreq policy, frequency invariant load tracking etc. A platform may
> 
> s/valid cpufreq policy/Schedutil CPUfreq governor
> 
> + EM complexity < EM_MAX_COMPLEXITY

ok. will add. 
> 
>> boot without EAS capability, but could gain such capability at runtime
>> For example, changing/registering the cpufreq policy.
> 
> s/cpufreq policy/CPUfreq governor to Schedutil

ok will change.

> 
>> At present, though platform doesn't support EAS, this sysctl returns 1
>> and it ends up calling rebuild of sched domain on write to 1 and
> 
> sched domains are not rebuild in this case, i.e.
> partition_sched_domains_locked() does not call detach_destroy_domains()
> or build_sched_domains(). Only build_perf_domains() is called which
> bails out if !sysctl_sched_energy_aware or one of the EAS conditions is
> not true.
> 

ok. that's because it goes to match1 and match2 right?

>> NOP when writing to 0. That is confusing and un-necessary.
>>
>> Desired behavior would be to, have this sysctl to enable/disable the EAS
>> on supported platform. On Non supported platform write to the sysctl
>> would return not supported error and read of the sysctl would return
>> empty. So
>> sched_energy_aware returns empty - EAS is not possible at this moment
>> sched_energy_aware returns 0 - EAS is supported but disabled by admin.
>> sched_energy_aware returns 1 - EAS is supported and enabled.
>> User can find out the reason why EAS is not possible by checking
>> info messages.
> 
> This will include EAS capable platforms which have at least one EAS
> condition false during startup, e.g. using a Performance CPUfreq governor:
> 

ok. will include that in the changelog.

> root@juno:~# cat /proc/sys/kernel/sched_energy_aware
> 
> root@juno:~# echo 1 > /proc/sys/kernel/sched_energy_aware
> echo: write error: Operation not supported
> 
> log messages:
> ...
> [  160.902138] rd 0-5: Checking EAS, schedutil is mandatory
> ...
> [  324.467341] rd 0-5: Checking EAS, schedutil is mandatory
> ...
> 
> root@juno:~# echo schedutil >
> /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
> root@juno:~# echo schedutil >
> /sys/devices/system/cpu/cpufreq/policy1/scaling_governor
> 
> log messages:
> ...
> [  414.195073] root_domain 0-5: pd1:{ cpus=1-2 nr_pstate=5 } pd0:{
> cpus=0,3-5 nr_pstate=5 }
> [  414.210877] sched_energy_set: starting EAS
> ...
> 
> root@juno:~# cat /proc/sys/kernel/sched_energy_aware
> 1
> 
> root@juno:~# echo 0 > /proc/sys/kernel/sched_energy_aware
> 
> log messages:
> ...
> [  414.195073] root_domain 0-5: pd1:{ cpus=1-2 nr_pstate=5 } pd0:{
> cpus=0,3-5 nr_pstate=5 }
> [  414.210877] sched_energy_set: starting EAS
> ...
> [  544.482690] rd 0-5: Disabling EAS
> [  544.493729] sched_energy_set: stopping EAS
> 

Thanks for testing.  Glad to see it works.

>> sched_is_eas_possible return if the platform can do EAS at this moment.
> 
> sched_is_eas_possible() returns
> 
>> It takes most of the cases into account except one where EM complexity is
>> too high as the code was bit tricky to separate that.
> 
> I remember that there was a patch from Pierre to get rid of the EM
> complexity check:
> 
> https://lkml.kernel.org/r/20221121094336.3250917-1-pierre.gondois@arm.com
> 
> If this makes this patch easier, maybe both patches can go in as a
> patch-set together?
> 
> Although I don't see that not checking EM complexity in
> sched_energy_aware_handler() -> sched_is_eas_possible() is an issue.
> 

ok. I can combine it with this patch and send it as patchset. 

>>
>> v3->v4:
>> valentin suggested it would be better to consider simpler approach that
>> was mentioned in v2. It is a standard approach to keep the knob visible
>> but change how read and write are handled. Did that and Refactored the
>> code to use a common function in build_perf_domains and in sysctl handler.
>> v2->v3:
>> Chen Yu and Pierre Gondois both pointed out that if platform becomes
>> capable of EAS later, this patch was not allowing that to happen.
>> Addressed that by using a variable to indicate the sysctl change
>> and re-worded the commit message with desired behaviour,
>> v1->v2:
>> Chen Yu had pointed out that this will not destroy the perf domains on
>> architectures where EAS is supported by changing the sysctl.
>> [v1] Link: https://lore.kernel.org/lkml/20230829065040.920629-1-sshegde@linux.vnet.ibm.com/
>> [v2] Link: https://lore.kernel.org/lkml/20230901065249.137242-1-sshegde@linux.vnet.ibm.com/
>> [v3] Link: https://lore.kernel.org/lkml/20230913114807.665094-1-sshegde@linux.vnet.ibm.com/
>>
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
> 
> [...]
> 
>> @@ -231,6 +289,14 @@ static int sched_energy_aware_handler(struct ctl_table *table, int write,
>>  		return -EPERM;
>>
>>  	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
>> +	if (!sched_is_eas_possible(cpu_active_mask)) {
> 
> This is using `cpu_active_mask` so EAS can only be disabled system-wide.
> 
> So I experimented with an exclusive cpuset setup creating a symmetric
> (cs1) and an asymmetric (cs2) island on my Juno with its cpumask = [l B
> B l l l] (l - little CPU, B - Big CPU).
> 
> root@juno:~# cd /sys/fs/cgroup/cpuset
> root@juno:/sys/fs/cgroup/cpuset# mkdir cs1
> root@juno:/sys/fs/cgroup/cpuset# echo 1 > cs1/cpuset.cpu_exclusive
> root@juno:/sys/fs/cgroup/cpuset# echo 0 > cs1/cpuset.mems
> root@juno:/sys/fs/cgroup/cpuset# echo 4,5 > cs1/cpuset.cpus
> root@juno:/sys/fs/cgroup/cpuset# mkdir cs2
> root@juno:/sys/fs/cgroup/cpuset# echo 1 > cs2/cpuset.cpu_exclusive
> root@juno:/sys/fs/cgroup/cpuset# echo 0 > cs2/cpuset.mems
> root@juno:/sys/fs/cgroup/cpuset# echo 0-3 > cs2/cpuset.cpus
> root@juno:/sys/fs/cgroup/cpuset# echo 0 > cpuset.sched_load_balance
> 
> [ 3021.761278] root_domain 0-3: pd1:{ cpus=1-2 nr_pstate=5 } pd0:{
> cpus=0,3-5 nr_pstate=5 }
> 
> root@juno:~# echo 0 > /proc/sys/kernel/sched_energy_aware
> 
> log messages:
> ...
> [ 3143.538583] rd 4-5: Disabling EAS
> [ 3143.549569] rd 0-3: Disabling EAS
> [ 3143.560681] sched_energy_set: stopping EAS
> ...
> 
> root@juno:~# echo 1 > /proc/sys/kernel/sched_energy_aware
> 
> log messages:
> ...
> [ 3223.277521] root_domain 0-3: pd1:{ cpus=1-2 nr_pstate=5 } pd0:{
> cpus=0,3-5 nr_pstate=5 }
> [ 3223.293409] sched_energy_set: starting EAS
> 
> Seems still to work correctly.

I see that can be a issue. using first cpu here check to asymmetric cpu capacity. 
It would have worked here, since the first group had asymmetry. ( l B B l ). 
It wouldn't have worked if the first group had like ( l l ) and second group has ( l B B l )


Instead of cpu_active_mask, I can make use of ndoms_cur and doms_cur[i] logic to 
traverse through possible doms, and if none of the domains have asymmetric cpu capacity
return false.  Does that look right?

> 
> [...]
> 
>> @@ -458,6 +487,8 @@ static bool build_perf_domains(const struct cpumask *cpu_map)
>>  	return !!pd;
>>
>>  free:
>> +	if (sched_debug())
>> +		pr_warn("rd %*pbl: Disabling EAS", cpumask_pr_args(cpu_map));
> 
> Shouldn't this be used in condition `if (!sysctl_sched_energy_aware)`?
> Otherwise we have 2 warnings when the other conditions which leads to
> `goto free` aren't met.
Since sched_energy_set has the info messages about start and stop of EAS, maybe 
this debug is not needed. Will remove it. Doing it only `if (!sysctl_sched_energy_aware)`
also doesn't seem correct, as calling free in this function would free the perf_domains.  

> 
> [...]
  
Dietmar Eggemann Sept. 27, 2023, 1:14 p.m. UTC | #4
Ah, BTW s/quentin.perret@arm.com/qperret@google.com

On 27/09/2023 10:14, Shrikanth Hegde wrote:
> 
> 
> On 9/27/23 2:59 AM, Dietmar Eggemann wrote:
>> On 26/09/2023 12:00, Shrikanth Hegde wrote:

[...]

>>> At present, though platform doesn't support EAS, this sysctl returns 1
>>> and it ends up calling rebuild of sched domain on write to 1 and
>>
>> sched domains are not rebuild in this case, i.e.
>> partition_sched_domains_locked() does not call detach_destroy_domains()
>> or build_sched_domains(). Only build_perf_domains() is called which
>> bails out if !sysctl_sched_energy_aware or one of the EAS conditions is
>> not true.
>>
> 
> ok. that's because it goes to match1 and match2 right?

yes.

[...]

>>> @@ -231,6 +289,14 @@ static int sched_energy_aware_handler(struct ctl_table *table, int write,
>>>  		return -EPERM;
>>>
>>>  	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
>>> +	if (!sched_is_eas_possible(cpu_active_mask)) {
>>
>> This is using `cpu_active_mask` so EAS can only be disabled system-wide.
>>
>> So I experimented with an exclusive cpuset setup creating a symmetric
>> (cs1) and an asymmetric (cs2) island on my Juno with its cpumask = [l B
>> B l l l] (l - little CPU, B - Big CPU).
>>
>> root@juno:~# cd /sys/fs/cgroup/cpuset
>> root@juno:/sys/fs/cgroup/cpuset# mkdir cs1
>> root@juno:/sys/fs/cgroup/cpuset# echo 1 > cs1/cpuset.cpu_exclusive
>> root@juno:/sys/fs/cgroup/cpuset# echo 0 > cs1/cpuset.mems
>> root@juno:/sys/fs/cgroup/cpuset# echo 4,5 > cs1/cpuset.cpus
>> root@juno:/sys/fs/cgroup/cpuset# mkdir cs2
>> root@juno:/sys/fs/cgroup/cpuset# echo 1 > cs2/cpuset.cpu_exclusive
>> root@juno:/sys/fs/cgroup/cpuset# echo 0 > cs2/cpuset.mems
>> root@juno:/sys/fs/cgroup/cpuset# echo 0-3 > cs2/cpuset.cpus
>> root@juno:/sys/fs/cgroup/cpuset# echo 0 > cpuset.sched_load_balance
>>
>> [ 3021.761278] root_domain 0-3: pd1:{ cpus=1-2 nr_pstate=5 } pd0:{
>> cpus=0,3-5 nr_pstate=5 }
>>
>> root@juno:~# echo 0 > /proc/sys/kernel/sched_energy_aware
>>
>> log messages:
>> ...
>> [ 3143.538583] rd 4-5: Disabling EAS
>> [ 3143.549569] rd 0-3: Disabling EAS
>> [ 3143.560681] sched_energy_set: stopping EAS
>> ...
>>
>> root@juno:~# echo 1 > /proc/sys/kernel/sched_energy_aware
>>
>> log messages:
>> ...
>> [ 3223.277521] root_domain 0-3: pd1:{ cpus=1-2 nr_pstate=5 } pd0:{
>> cpus=0,3-5 nr_pstate=5 }
>> [ 3223.293409] sched_energy_set: starting EAS
>>
>> Seems still to work correctly.
> 
> I see that can be a issue. using first cpu here check to asymmetric cpu capacity. 
> It would have worked here, since the first group had asymmetry. ( l B B l ). 
> It wouldn't have worked if the first group had like ( l l ) and second group has ( l B B l )

Yeah, that's true.

  sched_is_eas_possible(const struct cpumask *cpu_mask)

    !per_cpu(sd_asym_cpucapacity, cpumask_first(cpu_mask));

cpusets cs1=[0,5] and cs2=[1-4] as such an example.


> Instead of cpu_active_mask, I can make use of ndoms_cur and doms_cur[i] logic to 
> traverse through possible doms, and if none of the domains have asymmetric cpu capacity
> return false.  Does that look right?

  rebuild_sched_domains()

    rebuild_sched_domains_locked()

      ndoms = generate_sched_domains(&doms, &attr)

You would need generate_sched_domains() in sched_energy_aware_handler()?

>> [...]
>>
>>> @@ -458,6 +487,8 @@ static bool build_perf_domains(const struct cpumask *cpu_map)
>>>  	return !!pd;
>>>
>>>  free:
>>> +	if (sched_debug())
>>> +		pr_warn("rd %*pbl: Disabling EAS", cpumask_pr_args(cpu_map));
>>
>> Shouldn't this be used in condition `if (!sysctl_sched_energy_aware)`?
>> Otherwise we have 2 warnings when the other conditions which leads to
>> `goto free` aren't met.
> Since sched_energy_set has the info messages about start and stop of EAS, maybe 
> this debug is not needed. Will remove it. Doing it only `if (!sysctl_sched_energy_aware)`

OK.

> also doesn't seem correct, as calling free in this function would free the perf_domains.  

But !sched_is_eas_possible() also does `goto free` and in there we we
emit pr_info's indicating why EAS isn't possible right now.

When issuing a:

# echo 0 > /proc/sys/kernel/sched_energy_aware

we would see in the logs:

...
[  416.325324] rd 0-5: sysctl_sched_energy_aware is N   <-- (*)
[  416.337844] sched_energy_set: stopping EAS
...

but maybe (*) is not necessary ...
  
Shrikanth Hegde Sept. 27, 2023, 5:08 p.m. UTC | #5
On 9/27/23 6:44 PM, Dietmar Eggemann wrote:
> Ah, BTW s/quentin.perret@arm.com/qperret@google.com
> 
> On 27/09/2023 10:14, Shrikanth Hegde wrote:
>>
>>
>> On 9/27/23 2:59 AM, Dietmar Eggemann wrote:
>>> On 26/09/2023 12:00, Shrikanth Hegde wrote:
> 
> [...]
> 
>>>> At present, though platform doesn't support EAS, this sysctl returns 1
>>>> and it ends up calling rebuild of sched domain on write to 1 and
>>>
>>> sched domains are not rebuild in this case, i.e.
>>> partition_sched_domains_locked() does not call detach_destroy_domains()
>>> or build_sched_domains(). Only build_perf_domains() is called which
>>> bails out if !sysctl_sched_energy_aware or one of the EAS conditions is
>>> not true.
>>>
>>
>> ok. that's because it goes to match1 and match2 right?
> 
> yes.
> 
> [...]
> 
>>>> @@ -231,6 +289,14 @@ static int sched_energy_aware_handler(struct ctl_table *table, int write,
>>>>  		return -EPERM;
>>>>
>>>>  	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
>>>> +	if (!sched_is_eas_possible(cpu_active_mask)) {
>>>
>>> This is using `cpu_active_mask` so EAS can only be disabled system-wide.
>>>
>>> So I experimented with an exclusive cpuset setup creating a symmetric
>>> (cs1) and an asymmetric (cs2) island on my Juno with its cpumask = [l B
>>> B l l l] (l - little CPU, B - Big CPU).
>>>
>>> root@juno:~# cd /sys/fs/cgroup/cpuset
>>> root@juno:/sys/fs/cgroup/cpuset# mkdir cs1
>>> root@juno:/sys/fs/cgroup/cpuset# echo 1 > cs1/cpuset.cpu_exclusive
>>> root@juno:/sys/fs/cgroup/cpuset# echo 0 > cs1/cpuset.mems
>>> root@juno:/sys/fs/cgroup/cpuset# echo 4,5 > cs1/cpuset.cpus
>>> root@juno:/sys/fs/cgroup/cpuset# mkdir cs2
>>> root@juno:/sys/fs/cgroup/cpuset# echo 1 > cs2/cpuset.cpu_exclusive
>>> root@juno:/sys/fs/cgroup/cpuset# echo 0 > cs2/cpuset.mems
>>> root@juno:/sys/fs/cgroup/cpuset# echo 0-3 > cs2/cpuset.cpus
>>> root@juno:/sys/fs/cgroup/cpuset# echo 0 > cpuset.sched_load_balance
>>>
>>> [ 3021.761278] root_domain 0-3: pd1:{ cpus=1-2 nr_pstate=5 } pd0:{
>>> cpus=0,3-5 nr_pstate=5 }
>>>
>>> root@juno:~# echo 0 > /proc/sys/kernel/sched_energy_aware
>>>
>>> log messages:
>>> ...
>>> [ 3143.538583] rd 4-5: Disabling EAS
>>> [ 3143.549569] rd 0-3: Disabling EAS
>>> [ 3143.560681] sched_energy_set: stopping EAS
>>> ...
>>>
>>> root@juno:~# echo 1 > /proc/sys/kernel/sched_energy_aware
>>>
>>> log messages:
>>> ...
>>> [ 3223.277521] root_domain 0-3: pd1:{ cpus=1-2 nr_pstate=5 } pd0:{
>>> cpus=0,3-5 nr_pstate=5 }
>>> [ 3223.293409] sched_energy_set: starting EAS
>>>
>>> Seems still to work correctly.
>>
>> I see that can be a issue. using first cpu here check to asymmetric cpu capacity. 
>> It would have worked here, since the first group had asymmetry. ( l B B l ). 
>> It wouldn't have worked if the first group had like ( l l ) and second group has ( l B B l )
> 
> Yeah, that's true.
> 
>   sched_is_eas_possible(const struct cpumask *cpu_mask)
> 
>     !per_cpu(sd_asym_cpucapacity, cpumask_first(cpu_mask));
> 
> cpusets cs1=[0,5] and cs2=[1-4] as such an example.
> 

right. 

> 
>> Instead of cpu_active_mask, I can make use of ndoms_cur and doms_cur[i] logic to 
>> traverse through possible doms, and if none of the domains have asymmetric cpu capacity
>> return false.  Does that look right?
> 
>   rebuild_sched_domains()
> 
>     rebuild_sched_domains_locked()
> 
>       ndoms = generate_sched_domains(&doms, &attr)
> 
> You would need generate_sched_domains() in sched_energy_aware_handler()?
> 

clearly I didnt think through this. ndoms_cur and doms_cur are updated at the end. 
So If EAS is possible at boot, this would fail. 


What  sched_is_eas_possible needs is if there is asymmetric cpu topology. 
Simpler loop of all CPU's and checking if there any of them have sd_asym_cpucapacity might do,
though it goes through all CPU's, Since these functions are not in hot path
So it should not affect any performance. Something like below would work?

	bool any_asym_capacity = false;

        /* EAS is enabled for asymmetric CPU capacity topologies. */
        for_each_cpu(i, cpu_mask) {
                if (per_cpu(sd_asym_cpucapacity, i)) {
                        any_asym_capacity = true;
                        break;
                }
        }
        if (!any_asym_capacity) {
                if (sched_debug()) {
                        pr_info("rd %*pbl: Checking EAS, CPUs do not have asymmetric capacities\n",
                                cpumask_pr_args(cpu_mask));
                }
                return false;
        }



>>> [...]
>>>
>>>> @@ -458,6 +487,8 @@ static bool build_perf_domains(const struct cpumask *cpu_map)
>>>>  	return !!pd;
>>>>
>>>>  free:
>>>> +	if (sched_debug())
>>>> +		pr_warn("rd %*pbl: Disabling EAS", cpumask_pr_args(cpu_map));
>>>
>>> Shouldn't this be used in condition `if (!sysctl_sched_energy_aware)`?
>>> Otherwise we have 2 warnings when the other conditions which leads to
>>> `goto free` aren't met.
>> Since sched_energy_set has the info messages about start and stop of EAS, maybe 
>> this debug is not needed. Will remove it. Doing it only `if (!sysctl_sched_energy_aware)`
> 
> OK.
> 
>> also doesn't seem correct, as calling free in this function would free the perf_domains.  
> 
> But !sched_is_eas_possible() also does `goto free` and in there we we
> emit pr_info's indicating why EAS isn't possible right now.
> 
> When issuing a:
> 
> # echo 0 > /proc/sys/kernel/sched_energy_aware
> 
> we would see in the logs:
> 
> ...
> [  416.325324] rd 0-5: sysctl_sched_energy_aware is N   <-- (*)
> [  416.337844] sched_energy_set: stopping EAS
> ...
> 
> but maybe (*) is not necessary ...

ok. I think we can add similar info message in if(!sysctl_sched_energy_aware) like below?
Would that be enough? 

        if (!sysctl_sched_energy_aware) {
                if (sched_debug()) {
                        pr_info("rd %*pbl: Checking EAS, sysclt sched_energy_aware set to N\n",
                                cpumask_pr_args(cpu_map));
                }
                goto free;
        }

and remove the below one.
        if (sched_debug())
                pr_warn("rd %*pbl: Disabling EAS", cpumask_pr_args(cpu_map));


So there will be these "Checking EAS" and if possible, "starting EAS" or "stopping EAS" message.
  
Pierre Gondois Sept. 28, 2023, 9:25 a.m. UTC | #6
Hello Shrikanth, Dietmar,

On 9/27/23 19:08, Shrikanth Hegde wrote:
> 
> 
> On 9/27/23 6:44 PM, Dietmar Eggemann wrote:
>> Ah, BTW s/quentin.perret@arm.com/qperret@google.com
>>
>> On 27/09/2023 10:14, Shrikanth Hegde wrote:
>>>
>>>
>>> On 9/27/23 2:59 AM, Dietmar Eggemann wrote:
>>>> On 26/09/2023 12:00, Shrikanth Hegde wrote:
>>
>> [...]
>>
>>>>> At present, though platform doesn't support EAS, this sysctl returns 1
>>>>> and it ends up calling rebuild of sched domain on write to 1 and
>>>>
>>>> sched domains are not rebuild in this case, i.e.
>>>> partition_sched_domains_locked() does not call detach_destroy_domains()
>>>> or build_sched_domains(). Only build_perf_domains() is called which
>>>> bails out if !sysctl_sched_energy_aware or one of the EAS conditions is
>>>> not true.
>>>>
>>>
>>> ok. that's because it goes to match1 and match2 right?
>>
>> yes.
>>
>> [...]
>>
>>>>> @@ -231,6 +289,14 @@ static int sched_energy_aware_handler(struct ctl_table *table, int write,
>>>>>   		return -EPERM;
>>>>>
>>>>>   	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
>>>>> +	if (!sched_is_eas_possible(cpu_active_mask)) {
>>>>
>>>> This is using `cpu_active_mask` so EAS can only be disabled system-wide.
>>>>
>>>> So I experimented with an exclusive cpuset setup creating a symmetric
>>>> (cs1) and an asymmetric (cs2) island on my Juno with its cpumask = [l B
>>>> B l l l] (l - little CPU, B - Big CPU).
>>>>
>>>> root@juno:~# cd /sys/fs/cgroup/cpuset
>>>> root@juno:/sys/fs/cgroup/cpuset# mkdir cs1
>>>> root@juno:/sys/fs/cgroup/cpuset# echo 1 > cs1/cpuset.cpu_exclusive
>>>> root@juno:/sys/fs/cgroup/cpuset# echo 0 > cs1/cpuset.mems
>>>> root@juno:/sys/fs/cgroup/cpuset# echo 4,5 > cs1/cpuset.cpus
>>>> root@juno:/sys/fs/cgroup/cpuset# mkdir cs2
>>>> root@juno:/sys/fs/cgroup/cpuset# echo 1 > cs2/cpuset.cpu_exclusive
>>>> root@juno:/sys/fs/cgroup/cpuset# echo 0 > cs2/cpuset.mems
>>>> root@juno:/sys/fs/cgroup/cpuset# echo 0-3 > cs2/cpuset.cpus
>>>> root@juno:/sys/fs/cgroup/cpuset# echo 0 > cpuset.sched_load_balance
>>>>
>>>> [ 3021.761278] root_domain 0-3: pd1:{ cpus=1-2 nr_pstate=5 } pd0:{
>>>> cpus=0,3-5 nr_pstate=5 }
>>>>
>>>> root@juno:~# echo 0 > /proc/sys/kernel/sched_energy_aware
>>>>
>>>> log messages:
>>>> ...
>>>> [ 3143.538583] rd 4-5: Disabling EAS
>>>> [ 3143.549569] rd 0-3: Disabling EAS
>>>> [ 3143.560681] sched_energy_set: stopping EAS
>>>> ...
>>>>
>>>> root@juno:~# echo 1 > /proc/sys/kernel/sched_energy_aware
>>>>
>>>> log messages:
>>>> ...
>>>> [ 3223.277521] root_domain 0-3: pd1:{ cpus=1-2 nr_pstate=5 } pd0:{
>>>> cpus=0,3-5 nr_pstate=5 }
>>>> [ 3223.293409] sched_energy_set: starting EAS
>>>>
>>>> Seems still to work correctly.
>>>
>>> I see that can be a issue. using first cpu here check to asymmetric cpu capacity.
>>> It would have worked here, since the first group had asymmetry. ( l B B l ).
>>> It wouldn't have worked if the first group had like ( l l ) and second group has ( l B B l )
>>
>> Yeah, that's true.
>>
>>    sched_is_eas_possible(const struct cpumask *cpu_mask)
>>
>>      !per_cpu(sd_asym_cpucapacity, cpumask_first(cpu_mask));
>>
>> cpusets cs1=[0,5] and cs2=[1-4] as such an example.
>>
> 
> right.
> 
>>
>>> Instead of cpu_active_mask, I can make use of ndoms_cur and doms_cur[i] logic to
>>> traverse through possible doms, and if none of the domains have asymmetric cpu capacity
>>> return false.  Does that look right?
>>
>>    rebuild_sched_domains()
>>
>>      rebuild_sched_domains_locked()
>>
>>        ndoms = generate_sched_domains(&doms, &attr)
>>
>> You would need generate_sched_domains() in sched_energy_aware_handler()?
>>
> 
> clearly I didnt think through this. ndoms_cur and doms_cur are updated at the end.
> So If EAS is possible at boot, this would fail.
> 
> 
> What  sched_is_eas_possible needs is if there is asymmetric cpu topology.
> Simpler loop of all CPU's and checking if there any of them have sd_asym_cpucapacity might do,
> though it goes through all CPU's, Since these functions are not in hot path
> So it should not affect any performance. Something like below would work?
> 
> 	bool any_asym_capacity = false;
> 
>          /* EAS is enabled for asymmetric CPU capacity topologies. */
>          for_each_cpu(i, cpu_mask) {
>                  if (per_cpu(sd_asym_cpucapacity, i)) {
>                          any_asym_capacity = true;
>                          break;
>                  }
>          }
>          if (!any_asym_capacity) {
>                  if (sched_debug()) {
>                          pr_info("rd %*pbl: Checking EAS, CPUs do not have asymmetric capacities\n",
>                                  cpumask_pr_args(cpu_mask));
>                  }
>                  return false;
>          }
> 

FYIW I could reproduce the issue mentioned above, and the suggested bit
seems to solve it.

> 
> 
>>>> [...]
>>>>
>>>>> @@ -458,6 +487,8 @@ static bool build_perf_domains(const struct cpumask *cpu_map)
>>>>>   	return !!pd;
>>>>>
>>>>>   free:
>>>>> +	if (sched_debug())
>>>>> +		pr_warn("rd %*pbl: Disabling EAS", cpumask_pr_args(cpu_map));
>>>>
>>>> Shouldn't this be used in condition `if (!sysctl_sched_energy_aware)`?
>>>> Otherwise we have 2 warnings when the other conditions which leads to
>>>> `goto free` aren't met.
>>> Since sched_energy_set has the info messages about start and stop of EAS, maybe
>>> this debug is not needed. Will remove it. Doing it only `if (!sysctl_sched_energy_aware)`
>>
>> OK.
>>
>>> also doesn't seem correct, as calling free in this function would free the perf_domains.
>>
>> But !sched_is_eas_possible() also does `goto free` and in there we we
>> emit pr_info's indicating why EAS isn't possible right now.
>>
>> When issuing a:
>>
>> # echo 0 > /proc/sys/kernel/sched_energy_aware
>>
>> we would see in the logs:
>>
>> ...
>> [  416.325324] rd 0-5: sysctl_sched_energy_aware is N   <-- (*)
>> [  416.337844] sched_energy_set: stopping EAS
>> ...
>>
>> but maybe (*) is not necessary ...

I m not sure I understand 100% the point, but it seems to me that when
changing sysctl_sched_energy_aware's value, either:
- EAS is not possible, and sched_is_eas_possible() will output the reason why
   (i.e. "Checking EAS, [...]")
- EAS was deactivated by the user, and it is possible to check the value of
   sysctl_sched_energy_aware. So there would be no need to have an additional
   message "Checking EAS, sysclt sched_energy_aware set to N"

When build_perf_domains() is called while rebuilding the sched domains, it is also
possible to check sched_energy_aware's value and understand why EAS is not enabled.

> 
> ok. I think we can add similar info message in if(!sysctl_sched_energy_aware) like below?
> Would that be enough?
> 
>          if (!sysctl_sched_energy_aware) {
>                  if (sched_debug()) {
>                          pr_info("rd %*pbl: Checking EAS, sysclt sched_energy_aware set to N\n",
>                                  cpumask_pr_args(cpu_map));
>                  }

(No need for brackets here, just in case)

>                  goto free;
>          }
> 
> and remove the below one.
>          if (sched_debug())
>                  pr_warn("rd %*pbl: Disabling EAS", cpumask_pr_args(cpu_map));
> 
> 
> So there will be these "Checking EAS" and if possible, "starting EAS" or "stopping EAS" message.


I will rebase the patch removing the EM complexity and check it/resend it,
like this the 2 patches could go together:
https://lore.kernel.org/lkml/20221121094336.3250917-1-pierre.gondois@arm.com/

Regards,
Pierre
  
Pierre Gondois Sept. 28, 2023, 9:54 a.m. UTC | #7
On 9/28/23 11:25, Pierre Gondois wrote:
> Hello Shrikanth, Dietmar,
> 
> On 9/27/23 19:08, Shrikanth Hegde wrote:
>>
>>
>> On 9/27/23 6:44 PM, Dietmar Eggemann wrote:
>>> Ah, BTW s/quentin.perret@arm.com/qperret@google.com
>>>
>>> On 27/09/2023 10:14, Shrikanth Hegde wrote:
>>>>
>>>>
>>>> On 9/27/23 2:59 AM, Dietmar Eggemann wrote:
>>>>> On 26/09/2023 12:00, Shrikanth Hegde wrote:
>>>
>>> [...]
>>>
>>>>>> At present, though platform doesn't support EAS, this sysctl returns 1
>>>>>> and it ends up calling rebuild of sched domain on write to 1 and
>>>>>
>>>>> sched domains are not rebuild in this case, i.e.
>>>>> partition_sched_domains_locked() does not call detach_destroy_domains()
>>>>> or build_sched_domains(). Only build_perf_domains() is called which
>>>>> bails out if !sysctl_sched_energy_aware or one of the EAS conditions is
>>>>> not true.
>>>>>
>>>>
>>>> ok. that's because it goes to match1 and match2 right?
>>>
>>> yes.
>>>
>>> [...]
>>>
>>>>>> @@ -231,6 +289,14 @@ static int sched_energy_aware_handler(struct ctl_table *table, int write,
>>>>>>    		return -EPERM;
>>>>>>
>>>>>>    	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
>>>>>> +	if (!sched_is_eas_possible(cpu_active_mask)) {
>>>>>
>>>>> This is using `cpu_active_mask` so EAS can only be disabled system-wide.
>>>>>
>>>>> So I experimented with an exclusive cpuset setup creating a symmetric
>>>>> (cs1) and an asymmetric (cs2) island on my Juno with its cpumask = [l B
>>>>> B l l l] (l - little CPU, B - Big CPU).
>>>>>
>>>>> root@juno:~# cd /sys/fs/cgroup/cpuset
>>>>> root@juno:/sys/fs/cgroup/cpuset# mkdir cs1
>>>>> root@juno:/sys/fs/cgroup/cpuset# echo 1 > cs1/cpuset.cpu_exclusive
>>>>> root@juno:/sys/fs/cgroup/cpuset# echo 0 > cs1/cpuset.mems
>>>>> root@juno:/sys/fs/cgroup/cpuset# echo 4,5 > cs1/cpuset.cpus
>>>>> root@juno:/sys/fs/cgroup/cpuset# mkdir cs2
>>>>> root@juno:/sys/fs/cgroup/cpuset# echo 1 > cs2/cpuset.cpu_exclusive
>>>>> root@juno:/sys/fs/cgroup/cpuset# echo 0 > cs2/cpuset.mems
>>>>> root@juno:/sys/fs/cgroup/cpuset# echo 0-3 > cs2/cpuset.cpus
>>>>> root@juno:/sys/fs/cgroup/cpuset# echo 0 > cpuset.sched_load_balance
>>>>>
>>>>> [ 3021.761278] root_domain 0-3: pd1:{ cpus=1-2 nr_pstate=5 } pd0:{
>>>>> cpus=0,3-5 nr_pstate=5 }
>>>>>
>>>>> root@juno:~# echo 0 > /proc/sys/kernel/sched_energy_aware
>>>>>
>>>>> log messages:
>>>>> ...
>>>>> [ 3143.538583] rd 4-5: Disabling EAS
>>>>> [ 3143.549569] rd 0-3: Disabling EAS
>>>>> [ 3143.560681] sched_energy_set: stopping EAS
>>>>> ...
>>>>>
>>>>> root@juno:~# echo 1 > /proc/sys/kernel/sched_energy_aware
>>>>>
>>>>> log messages:
>>>>> ...
>>>>> [ 3223.277521] root_domain 0-3: pd1:{ cpus=1-2 nr_pstate=5 } pd0:{
>>>>> cpus=0,3-5 nr_pstate=5 }
>>>>> [ 3223.293409] sched_energy_set: starting EAS
>>>>>
>>>>> Seems still to work correctly.
>>>>
>>>> I see that can be a issue. using first cpu here check to asymmetric cpu capacity.
>>>> It would have worked here, since the first group had asymmetry. ( l B B l ).
>>>> It wouldn't have worked if the first group had like ( l l ) and second group has ( l B B l )
>>>
>>> Yeah, that's true.
>>>
>>>     sched_is_eas_possible(const struct cpumask *cpu_mask)
>>>
>>>       !per_cpu(sd_asym_cpucapacity, cpumask_first(cpu_mask));
>>>
>>> cpusets cs1=[0,5] and cs2=[1-4] as such an example.
>>>
>>
>> right.
>>
>>>
>>>> Instead of cpu_active_mask, I can make use of ndoms_cur and doms_cur[i] logic to
>>>> traverse through possible doms, and if none of the domains have asymmetric cpu capacity
>>>> return false.  Does that look right?
>>>
>>>     rebuild_sched_domains()
>>>
>>>       rebuild_sched_domains_locked()
>>>
>>>         ndoms = generate_sched_domains(&doms, &attr)
>>>
>>> You would need generate_sched_domains() in sched_energy_aware_handler()?
>>>
>>
>> clearly I didnt think through this. ndoms_cur and doms_cur are updated at the end.
>> So If EAS is possible at boot, this would fail.
>>
>>
>> What  sched_is_eas_possible needs is if there is asymmetric cpu topology.
>> Simpler loop of all CPU's and checking if there any of them have sd_asym_cpucapacity might do,
>> though it goes through all CPU's, Since these functions are not in hot path
>> So it should not affect any performance. Something like below would work?
>>
>> 	bool any_asym_capacity = false;
>>
>>           /* EAS is enabled for asymmetric CPU capacity topologies. */
>>           for_each_cpu(i, cpu_mask) {
>>                   if (per_cpu(sd_asym_cpucapacity, i)) {
>>                           any_asym_capacity = true;
>>                           break;
>>                   }
>>           }
>>           if (!any_asym_capacity) {
>>                   if (sched_debug()) {
>>                           pr_info("rd %*pbl: Checking EAS, CPUs do not have asymmetric capacities\n",
>>                                   cpumask_pr_args(cpu_mask));
>>                   }
>>                   return false;
>>           }
>>
> 
> FYIW I could reproduce the issue mentioned above, and the suggested bit
> seems to solve it.
> 
>>
>>
>>>>> [...]
>>>>>
>>>>>> @@ -458,6 +487,8 @@ static bool build_perf_domains(const struct cpumask *cpu_map)
>>>>>>    	return !!pd;
>>>>>>
>>>>>>    free:
>>>>>> +	if (sched_debug())
>>>>>> +		pr_warn("rd %*pbl: Disabling EAS", cpumask_pr_args(cpu_map));
>>>>>
>>>>> Shouldn't this be used in condition `if (!sysctl_sched_energy_aware)`?
>>>>> Otherwise we have 2 warnings when the other conditions which leads to
>>>>> `goto free` aren't met.
>>>> Since sched_energy_set has the info messages about start and stop of EAS, maybe
>>>> this debug is not needed. Will remove it. Doing it only `if (!sysctl_sched_energy_aware)`
>>>
>>> OK.
>>>
>>>> also doesn't seem correct, as calling free in this function would free the perf_domains.
>>>
>>> But !sched_is_eas_possible() also does `goto free` and in there we we
>>> emit pr_info's indicating why EAS isn't possible right now.
>>>
>>> When issuing a:
>>>
>>> # echo 0 > /proc/sys/kernel/sched_energy_aware
>>>
>>> we would see in the logs:
>>>
>>> ...
>>> [  416.325324] rd 0-5: sysctl_sched_energy_aware is N   <-- (*)
>>> [  416.337844] sched_energy_set: stopping EAS
>>> ...
>>>
>>> but maybe (*) is not necessary ...
> 
> I m not sure I understand 100% the point, but it seems to me that when
> changing sysctl_sched_energy_aware's value, either:
> - EAS is not possible, and sched_is_eas_possible() will output the reason why
>     (i.e. "Checking EAS, [...]")
> - EAS was deactivated by the user, and it is possible to check the value of
>     sysctl_sched_energy_aware. So there would be no need to have an additional
>     message "Checking EAS, sysclt sched_energy_aware set to N"
> 
> When build_perf_domains() is called while rebuilding the sched domains, it is also
> possible to check sched_energy_aware's value and understand why EAS is not enabled.
> 
>>
>> ok. I think we can add similar info message in if(!sysctl_sched_energy_aware) like below?
>> Would that be enough?
>>
>>           if (!sysctl_sched_energy_aware) {
>>                   if (sched_debug()) {
>>                           pr_info("rd %*pbl: Checking EAS, sysclt sched_energy_aware set to N\n",
>>                                   cpumask_pr_args(cpu_map));
>>                   }
> 
> (No need for brackets here, just in case)
> 
>>                   goto free;
>>           }
>>
>> and remove the below one.
>>           if (sched_debug())
>>                   pr_warn("rd %*pbl: Disabling EAS", cpumask_pr_args(cpu_map));
>>
>>
>> So there will be these "Checking EAS" and if possible, "starting EAS" or "stopping EAS" message.
> 
> 
> I will rebase the patch removing the EM complexity and check it/resend it,
> like this the 2 patches could go together:
> https://lore.kernel.org/lkml/20221121094336.3250917-1-pierre.gondois@arm.com/

The patch actually seems to apply cleanly on v6.6-rc3, and the complexity of
find_energy_efficient_cpu() seems to be the same as what it was when the patch
was sent, so I believe you can:
- Rebase your patch on top of it
- Provide a pointer to it in the changelog to mention the dependency
- Remove this bit in the commit message of your patch:
"It takes most of the cases into account except one where EM complexity is
too high as the code was bit tricky to separate that."

Regards,
Pierre
  
Shrikanth Hegde Sept. 29, 2023, 8:48 a.m. UTC | #8
On 9/28/23 3:24 PM, Pierre Gondois wrote:
> 
> 
> On 9/28/23 11:25, Pierre Gondois wrote:
>> Hello Shrikanth, Dietmar,
>>
>> On 9/27/23 19:08, Shrikanth Hegde wrote:
>>>
>>>
>>> On 9/27/23 6:44 PM, Dietmar Eggemann wrote:
>>>> Ah, BTW s/quentin.perret@arm.com/qperret@google.com
>>>>
>>>> On 27/09/2023 10:14, Shrikanth Hegde wrote:
>>>>>
>>>>>
>>>>> On 9/27/23 2:59 AM, Dietmar Eggemann wrote:
>>>>>> On 26/09/2023 12:00, Shrikanth Hegde wrote:
>>>>
>>>> [...]
>>>>
>>>>>>> At present, though platform doesn't support EAS, this sysctl
>>>>>>> returns 1
>>>>>>> and it ends up calling rebuild of sched domain on write to 1 and
>>>>>>
>>>>>> sched domains are not rebuild in this case, i.e.
>>>>>> partition_sched_domains_locked() does not call
>>>>>> detach_destroy_domains()
>>>>>> or build_sched_domains(). Only build_perf_domains() is called which
>>>>>> bails out if !sysctl_sched_energy_aware or one of the EAS
>>>>>> conditions is
>>>>>> not true.
>>>>>>
>>>>>
>>>>> ok. that's because it goes to match1 and match2 right?
>>>>
>>>> yes.
>>>>
>>>> [...]
>>>>
>>>>>>> @@ -231,6 +289,14 @@ static int sched_energy_aware_handler(struct
>>>>>>> ctl_table *table, int write,
>>>>>>>            return -EPERM;
>>>>>>>
>>>>>>>        ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
>>>>>>> +    if (!sched_is_eas_possible(cpu_active_mask)) {
>>>>>>
>>>>>> This is using `cpu_active_mask` so EAS can only be disabled
>>>>>> system-wide.
>>>>>>
>>>>>> So I experimented with an exclusive cpuset setup creating a symmetric
>>>>>> (cs1) and an asymmetric (cs2) island on my Juno with its cpumask =
>>>>>> [l B
>>>>>> B l l l] (l - little CPU, B - Big CPU).
>>>>>>
>>>>>> root@juno:~# cd /sys/fs/cgroup/cpuset
>>>>>> root@juno:/sys/fs/cgroup/cpuset# mkdir cs1
>>>>>> root@juno:/sys/fs/cgroup/cpuset# echo 1 > cs1/cpuset.cpu_exclusive
>>>>>> root@juno:/sys/fs/cgroup/cpuset# echo 0 > cs1/cpuset.mems
>>>>>> root@juno:/sys/fs/cgroup/cpuset# echo 4,5 > cs1/cpuset.cpus
>>>>>> root@juno:/sys/fs/cgroup/cpuset# mkdir cs2
>>>>>> root@juno:/sys/fs/cgroup/cpuset# echo 1 > cs2/cpuset.cpu_exclusive
>>>>>> root@juno:/sys/fs/cgroup/cpuset# echo 0 > cs2/cpuset.mems
>>>>>> root@juno:/sys/fs/cgroup/cpuset# echo 0-3 > cs2/cpuset.cpus
>>>>>> root@juno:/sys/fs/cgroup/cpuset# echo 0 > cpuset.sched_load_balance
>>>>>>
>>>>>> [ 3021.761278] root_domain 0-3: pd1:{ cpus=1-2 nr_pstate=5 } pd0:{
>>>>>> cpus=0,3-5 nr_pstate=5 }
>>>>>>
>>>>>> root@juno:~# echo 0 > /proc/sys/kernel/sched_energy_aware
>>>>>>
>>>>>> log messages:
>>>>>> ...
>>>>>> [ 3143.538583] rd 4-5: Disabling EAS
>>>>>> [ 3143.549569] rd 0-3: Disabling EAS
>>>>>> [ 3143.560681] sched_energy_set: stopping EAS
>>>>>> ...
>>>>>>
>>>>>> root@juno:~# echo 1 > /proc/sys/kernel/sched_energy_aware
>>>>>>
>>>>>> log messages:
>>>>>> ...
>>>>>> [ 3223.277521] root_domain 0-3: pd1:{ cpus=1-2 nr_pstate=5 } pd0:{
>>>>>> cpus=0,3-5 nr_pstate=5 }
>>>>>> [ 3223.293409] sched_energy_set: starting EAS
>>>>>>
>>>>>> Seems still to work correctly.
>>>>>
>>>>> I see that can be a issue. using first cpu here check to asymmetric
>>>>> cpu capacity.
>>>>> It would have worked here, since the first group had asymmetry. ( l
>>>>> B B l ).
>>>>> It wouldn't have worked if the first group had like ( l l ) and
>>>>> second group has ( l B B l )
>>>>
>>>> Yeah, that's true.
>>>>
>>>>     sched_is_eas_possible(const struct cpumask *cpu_mask)
>>>>
>>>>       !per_cpu(sd_asym_cpucapacity, cpumask_first(cpu_mask));
>>>>
>>>> cpusets cs1=[0,5] and cs2=[1-4] as such an example.
>>>>
>>>
>>> right.
>>>
>>>>
>>>>> Instead of cpu_active_mask, I can make use of ndoms_cur and
>>>>> doms_cur[i] logic to
>>>>> traverse through possible doms, and if none of the domains have
>>>>> asymmetric cpu capacity
>>>>> return false.  Does that look right?
>>>>
>>>>     rebuild_sched_domains()
>>>>
>>>>       rebuild_sched_domains_locked()
>>>>
>>>>         ndoms = generate_sched_domains(&doms, &attr)
>>>>
>>>> You would need generate_sched_domains() in
>>>> sched_energy_aware_handler()?
>>>>
>>>
>>> clearly I didnt think through this. ndoms_cur and doms_cur are
>>> updated at the end.
>>> So If EAS is possible at boot, this would fail.
>>>
>>>
>>> What  sched_is_eas_possible needs is if there is asymmetric cpu
>>> topology.
>>> Simpler loop of all CPU's and checking if there any of them have
>>> sd_asym_cpucapacity might do,
>>> though it goes through all CPU's, Since these functions are not in
>>> hot path
>>> So it should not affect any performance. Something like below would
>>> work?
>>>
>>>     bool any_asym_capacity = false;
>>>
>>>           /* EAS is enabled for asymmetric CPU capacity topologies. */
>>>           for_each_cpu(i, cpu_mask) {
>>>                   if (per_cpu(sd_asym_cpucapacity, i)) {
>>>                           any_asym_capacity = true;
>>>                           break;
>>>                   }
>>>           }
>>>           if (!any_asym_capacity) {
>>>                   if (sched_debug()) {
>>>                           pr_info("rd %*pbl: Checking EAS, CPUs do
>>> not have asymmetric capacities\n",
>>>                                   cpumask_pr_args(cpu_mask));
>>>                   }
>>>                   return false;
>>>           }
>>>
>>
>> FYIW I could reproduce the issue mentioned above, and the suggested bit
>> seems to solve it.
>>

Thanks for trying this out. 

>>>
>>>
>>>>>> [...]
>>>>>>
>>>>>>> @@ -458,6 +487,8 @@ static bool build_perf_domains(const struct
>>>>>>> cpumask *cpu_map)
>>>>>>>        return !!pd;
>>>>>>>
>>>>>>>    free:
>>>>>>> +    if (sched_debug())
>>>>>>> +        pr_warn("rd %*pbl: Disabling EAS",
>>>>>>> cpumask_pr_args(cpu_map));
>>>>>>
>>>>>> Shouldn't this be used in condition `if
>>>>>> (!sysctl_sched_energy_aware)`?
>>>>>> Otherwise we have 2 warnings when the other conditions which leads to
>>>>>> `goto free` aren't met.
>>>>> Since sched_energy_set has the info messages about start and stop
>>>>> of EAS, maybe
>>>>> this debug is not needed. Will remove it. Doing it only `if
>>>>> (!sysctl_sched_energy_aware)`
>>>>
>>>> OK.
>>>>
>>>>> also doesn't seem correct, as calling free in this function would
>>>>> free the perf_domains.
>>>>
>>>> But !sched_is_eas_possible() also does `goto free` and in there we we
>>>> emit pr_info's indicating why EAS isn't possible right now.
>>>>
>>>> When issuing a:
>>>>
>>>> # echo 0 > /proc/sys/kernel/sched_energy_aware
>>>>
>>>> we would see in the logs:
>>>>
>>>> ...
>>>> [  416.325324] rd 0-5: sysctl_sched_energy_aware is N   <-- (*)
>>>> [  416.337844] sched_energy_set: stopping EAS
>>>> ...
>>>>
>>>> but maybe (*) is not necessary ...
>>
>> I m not sure I understand 100% the point, but it seems to me that when
>> changing sysctl_sched_energy_aware's value, either:
>> - EAS is not possible, and sched_is_eas_possible() will output the
>> reason why
>>     (i.e. "Checking EAS, [...]")
>> - EAS was deactivated by the user, and it is possible to check the
>> value of
>>     sysctl_sched_energy_aware. So there would be no need to have an
>> additional
>>     message "Checking EAS, sysclt sched_energy_aware set to N"
>>
>> When build_perf_domains() is called while rebuilding the sched
>> domains, it is also
>> possible to check sched_energy_aware's value and understand why EAS is
>> not enabled.
>>
>>>
>>> ok. I think we can add similar info message in
>>> if(!sysctl_sched_energy_aware) like below?
>>> Would that be enough?
>>>
>>>           if (!sysctl_sched_energy_aware) {
>>>                   if (sched_debug()) {
>>>                           pr_info("rd %*pbl: Checking EAS, sysclt
>>> sched_energy_aware set to N\n",
>>>                                   cpumask_pr_args(cpu_map));
>>>                   }
>>
>> (No need for brackets here, just in case)
>>
>>>                   goto free;
>>>           }
>>>
>>> and remove the below one.
>>>           if (sched_debug())
>>>                   pr_warn("rd %*pbl: Disabling EAS",
>>> cpumask_pr_args(cpu_map));
>>>
>>>
>>> So there will be these "Checking EAS" and if possible, "starting EAS"
>>> or "stopping EAS" message.
>>
>>
>> I will rebase the patch removing the EM complexity and check it/resend
>> it,
>> like this the 2 patches could go together:
>> https://lore.kernel.org/lkml/20221121094336.3250917-1-pierre.gondois@arm.com/
> 
> The patch actually seems to apply cleanly on v6.6-rc3, and the
> complexity of
> find_energy_efficient_cpu() seems to be the same as what it was when the
> patch
> was sent, so I believe you can:
> - Rebase your patch on top of it
> - Provide a pointer to it in the changelog to mention the dependency
> - Remove this bit in the commit message of your patch:
> "It takes most of the cases into account except one where EM complexity is
> too high as the code was bit tricky to separate that."
> 
> Regards,
> Pierre

Ok. I should be able to do this and send out v5 soon.
  

Patch

diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index cf33de56da27..d89ac2bd8dc4 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -1182,7 +1182,8 @@  automatically on platforms where it can run (that is,
 platforms with asymmetric CPU topologies and having an Energy
 Model available). If your platform happens to meet the
 requirements for EAS but you do not want to use it, change
-this value to 0.
+this value to 0. On Non-EAS platforms, write operation fails and
+read doesn't return anything.

 task_delayacct
 ===============
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index a7b50bba7829..839ddc80a5ac 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -212,6 +212,64 @@  static unsigned int sysctl_sched_energy_aware = 1;
 static DEFINE_MUTEX(sched_energy_mutex);
 static bool sched_energy_update;

+extern struct cpufreq_governor schedutil_gov;
+static bool sched_is_eas_possible(const struct cpumask *cpu_mask)
+{
+	int cpu = cpumask_first(cpu_mask);
+	struct cpufreq_policy *policy;
+	struct cpufreq_governor *gov;
+	int i;
+
+	/* EAS is enabled for asymmetric CPU capacity topologies. */
+	if (!per_cpu(sd_asym_cpucapacity, cpu)) {
+		if (sched_debug()) {
+			pr_info("rd %*pbl: Checking EAS, CPUs do not have asymmetric capacities\n",
+				cpumask_pr_args(cpu_mask));
+		}
+		return false;
+	}
+
+	/* EAS definitely does *not* handle SMT */
+	if (sched_smt_active()) {
+		if (sched_debug()) {
+			pr_info("rd %*pbl: Checking EAS, SMT is not supported\n",
+				cpumask_pr_args(cpu_mask));
+		}
+		return false;
+	}
+
+	if (!arch_scale_freq_invariant()) {
+		if (sched_debug()) {
+			pr_info("rd %*pbl: Checking EAS: frequency-invariant load tracking not yet supported",
+				cpumask_pr_args(cpu_mask));
+		}
+		return false;
+	}
+
+	/* Do not attempt EAS if schedutil is not being used. */
+	for_each_cpu(i, cpu_mask) {
+		policy = cpufreq_cpu_get(i);
+		if (!policy) {
+			if (sched_debug()) {
+				pr_info("rd %*pbl: Checking EAS, cpufreq policy not set for CPU: %d",
+					cpumask_pr_args(cpu_mask), i);
+			}
+			return false;
+		}
+		gov = policy->governor;
+		cpufreq_cpu_put(policy);
+		if (gov != &schedutil_gov) {
+			if (sched_debug()) {
+				pr_info("rd %*pbl: Checking EAS, schedutil is mandatory\n",
+					cpumask_pr_args(cpu_mask));
+			}
+			return false;
+		}
+	}
+
+	return true;
+}
+
 void rebuild_sched_domains_energy(void)
 {
 	mutex_lock(&sched_energy_mutex);
@@ -231,6 +289,14 @@  static int sched_energy_aware_handler(struct ctl_table *table, int write,
 		return -EPERM;

 	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+	if (!sched_is_eas_possible(cpu_active_mask)) {
+		if (write) {
+			return -EOPNOTSUPP;
+		} else {
+			*lenp = 0;
+			return 0;
+		}
+	}
 	if (!ret && write) {
 		state = static_branch_unlikely(&sched_energy_present);
 		if (state != sysctl_sched_energy_aware)
@@ -370,61 +436,24 @@  static void sched_energy_set(bool has_eas)
  */
 #define EM_MAX_COMPLEXITY 2048

-extern struct cpufreq_governor schedutil_gov;
 static bool build_perf_domains(const struct cpumask *cpu_map)
 {
 	int i, nr_pd = 0, nr_ps = 0, nr_cpus = cpumask_weight(cpu_map);
 	struct perf_domain *pd = NULL, *tmp;
 	int cpu = cpumask_first(cpu_map);
 	struct root_domain *rd = cpu_rq(cpu)->rd;
-	struct cpufreq_policy *policy;
-	struct cpufreq_governor *gov;

 	if (!sysctl_sched_energy_aware)
 		goto free;

-	/* EAS is enabled for asymmetric CPU capacity topologies. */
-	if (!per_cpu(sd_asym_cpucapacity, cpu)) {
-		if (sched_debug()) {
-			pr_info("rd %*pbl: CPUs do not have asymmetric capacities\n",
-					cpumask_pr_args(cpu_map));
-		}
-		goto free;
-	}
-
-	/* EAS definitely does *not* handle SMT */
-	if (sched_smt_active()) {
-		pr_warn("rd %*pbl: Disabling EAS, SMT is not supported\n",
-			cpumask_pr_args(cpu_map));
-		goto free;
-	}
-
-	if (!arch_scale_freq_invariant()) {
-		if (sched_debug()) {
-			pr_warn("rd %*pbl: Disabling EAS: frequency-invariant load tracking not yet supported",
-				cpumask_pr_args(cpu_map));
-		}
+	if (!sched_is_eas_possible(cpu_map))
 		goto free;
-	}

 	for_each_cpu(i, cpu_map) {
 		/* Skip already covered CPUs. */
 		if (find_pd(pd, i))
 			continue;

-		/* Do not attempt EAS if schedutil is not being used. */
-		policy = cpufreq_cpu_get(i);
-		if (!policy)
-			goto free;
-		gov = policy->governor;
-		cpufreq_cpu_put(policy);
-		if (gov != &schedutil_gov) {
-			if (rd->pd)
-				pr_warn("rd %*pbl: Disabling EAS, schedutil is mandatory\n",
-						cpumask_pr_args(cpu_map));
-			goto free;
-		}
-
 		/* Create the new pd and add it to the local list. */
 		tmp = pd_init(i);
 		if (!tmp)
@@ -458,6 +487,8 @@  static bool build_perf_domains(const struct cpumask *cpu_map)
 	return !!pd;

 free:
+	if (sched_debug())
+		pr_warn("rd %*pbl: Disabling EAS", cpumask_pr_args(cpu_map));
 	free_pd(pd);
 	tmp = rd->pd;
 	rcu_assign_pointer(rd->pd, NULL);