[RFC,v5,2/2] cpuidle: teo: Introduce util-awareness

Message ID 20221130153204.2085591-3-kajetan.puchalski@arm.com
State New
Headers
Series cpuidle: teo: Introduce util-awareness |

Commit Message

Kajetan Puchalski Nov. 30, 2022, 3:32 p.m. UTC
  Modern interactive systems, such as recent Android phones, tend to have power
efficient shallow idle states. Selecting deeper idle states on a device while a
latency-sensitive workload is running can adversely impact performance due to
increased latency. Additionally, if the CPU wakes up from a deeper sleep before
its target residency as is often the case, it results in a waste of energy on
top of that.

At the moment, all the available idle governors operate mainly based on their
own past correctness metrics along with timer events without taking into account
any scheduling information. Especially on interactive systems, this results in
them frequently selecting a deeper idle state and then waking up before its
target residency is hit, thus leading to increased wakeup latency and lower
performance with no power saving. For 'menu' while web browsing on Android for
instance, those types of wakeups ('too deep') account for over 24% of all
wakeups.

At the same time, on some platforms C0 can be power efficient enough to warrant
wanting to prefer it over C1. This is because the power usage of the two states
can be so close that sufficient amounts of too deep C1 sleeps can completely
offset the C1 power saving to the point where it would've been more power
efficient to just use C0 instead.

Sleeps that happened in C0 while they could have used C1 ('too shallow') only
save less power than they otherwise could have. Too deep sleeps, on the other
hand, harm performance and nullify the potential power saving from using C1 in
the first place. While taking this into account, it is clear that on balance it
is preferable for an idle governor to have more too shallow sleeps instead of
more too deep sleeps on those kinds of platforms.

This patch specifically tunes TEO to minimise too deep sleeps and minimise
latency to achieve better performance. To this end, before selecting the next
idle state it uses the avg_util signal of a CPU's runqueue in order to determine
to what extent the CPU is being utilized. This util value is then compared to a
threshold defined as a percentage of the cpu's capacity (capacity >> 6 ie. ~1.5%
in the current implementation). If the util is above the threshold, the
idle state selected by TEO metrics will be reduced by 1, thus selecting a
shallower state. If the util is below the threshold, the governor defaults to
the TEO metrics mechanism to try to select the deepest available idle state
based on the closest timer event and its own correctness.

The main goal of this is to reduce latency and increase performance for some
workloads. Under some workloads it will result in an increase in power usage
(Geekbench 5) while for other workloads it will also result in a decrease in
power usage compared to TEO (PCMark Web, Jankbench, Speedometer).

It can provide drastically decreased latency and performance benefits in certain
types of workloads that are sensitive to latency.

Signed-off-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
---
 drivers/cpuidle/governors/teo.c | 85 ++++++++++++++++++++++++++++++++-
 1 file changed, 84 insertions(+), 1 deletion(-)
  

Comments

Rafael J. Wysocki Jan. 3, 2023, 5:53 p.m. UTC | #1
On Wed, Nov 30, 2022 at 4:33 PM Kajetan Puchalski
<kajetan.puchalski@arm.com> wrote:
>
> Modern interactive systems, such as recent Android phones, tend to have power
> efficient shallow idle states. Selecting deeper idle states on a device while a
> latency-sensitive workload is running can adversely impact performance due to
> increased latency. Additionally, if the CPU wakes up from a deeper sleep before
> its target residency as is often the case, it results in a waste of energy on
> top of that.
>
> At the moment, all the available idle governors operate mainly based on their
> own past correctness metrics along with timer events without taking into account
> any scheduling information.

I still don't quite agree with the above statement.

It would be accurate enough to state the fact that currently cpuidle
governors don't take scheduling information into account.

> Especially on interactive systems, this results in
> them frequently selecting a deeper idle state and then waking up before its
> target residency is hit, thus leading to increased wakeup latency and lower
> performance with no power saving. For 'menu' while web browsing on Android for
> instance, those types of wakeups ('too deep') account for over 24% of all
> wakeups.

I don't think that you can convincingly establish a cause-and-effect
relationship between not taking scheduling information into account
and overestimating the idle duration.

It would be just fine to say something like "They also tend to
overestimate the idle duration quite often, which causes them to
select excessively deep idle states, which leads to ...".

> At the same time, on some platforms C0 can be power efficient enough to warrant
> wanting to prefer it over C1.

If you say C0 or C1, a casual reader may think about x86 which
probably is not your intention.

I would say "idle state 0" and "idle state 1" instead.  I would also
say that this is on systems where idle state 0 is not a polling state.

> This is because the power usage of the two states
> can be so close that sufficient amounts of too deep C1 sleeps can completely
> offset the C1 power saving to the point where it would've been more power
> efficient to just use C0 instead.
>
> Sleeps that happened in C0 while they could have used C1 ('too shallow') only
> save less power than they otherwise could have. Too deep sleeps, on the other
> hand, harm performance and nullify the potential power saving from using C1 in
> the first place. While taking this into account, it is clear that on balance it
> is preferable for an idle governor to have more too shallow sleeps instead of
> more too deep sleeps on those kinds of platforms.

I don't think that the above paragraphs, while generally true, are
relevant for what the patch really does.

They would have been relevant if the patch had improved the
energy-efficiency, but it doesn't.  It sacrifices energy for
performance by reducing the CPU wakeup latency.

> This patch specifically tunes TEO to minimise too deep sleeps and minimise
> latency to achieve better performance.

I'm not sure if you can demonstrate that the number of "too deep
sleeps" is really reduced in all cases, but the reduction of latency
is readily demonstrable, so I would focus on that part.

> To this end, before selecting the next
> idle state it uses the avg_util signal of a CPU's runqueue in order to determine
> to what extent the CPU is being utilized. This util value is then compared to a
> threshold defined as a percentage of the cpu's capacity (capacity >> 6 ie. ~1.5%
> in the current implementation). If the util is above the threshold, the
> idle state selected by TEO metrics will be reduced by 1, thus selecting a
> shallower state. If the util is below the threshold, the governor defaults to
> the TEO metrics mechanism to try to select the deepest available idle state
> based on the closest timer event and its own correctness.
>
> The main goal of this is to reduce latency and increase performance for some
> workloads. Under some workloads it will result in an increase in power usage
> (Geekbench 5) while for other workloads it will also result in a decrease in
> power usage compared to TEO (PCMark Web, Jankbench, Speedometer).
>
> It can provide drastically decreased latency and performance benefits in certain
> types of workloads that are sensitive to latency.

And I would put some numbers from your cover letter in here.

> Signed-off-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
> ---
>  drivers/cpuidle/governors/teo.c | 85 ++++++++++++++++++++++++++++++++-
>  1 file changed, 84 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
> index e2864474a98d..2f85dde301d3 100644
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -2,8 +2,13 @@
>  /*
>   * Timer events oriented CPU idle governor
>   *
> + * TEO governor:
>   * Copyright (C) 2018 - 2021 Intel Corporation
>   * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> + *
> + * Util-awareness mechanism:
> + * Copyright (C) 2022 Arm Ltd.
> + * Author: Kajetan Puchalski <kajetan.puchalski@arm.com>
>   */

I don't think the above change is really useful.  At least it is not
necessary for any practical purpose.

See the copyright notice in the schedutil governor, for example: it
hasn't changed since the introduction of it even though many people
have changed it substantially since then.

>  /**
> @@ -99,14 +104,55 @@
>   *      select the given idle state instead of the candidate one.
>   *
>   * 3. By default, select the candidate state.
> + *
> + * Util-awareness mechanism:
> + *
> + * The idea behind the util-awareness extension is that there are two distinct
> + * scenarios for the CPU which should result in two different approaches to idle
> + * state selection - utilized and not utilized.
> + *
> + * In this case, 'utilized' means that the average runqueue util of the CPU is
> + * above a certain threshold.
> + *
> + * When the CPU is utilized while going into idle, more likely than not it will
> + * be woken up to do more work soon and so a shallower idle state should be
> + * selected to minimise latency and maximise performance. When the CPU is not
> + * being utilized, the usual metrics-based approach to selecting the deepest
> + * available idle state should be preferred to take advantage of the power
> + * saving.

I would say "energy saving" instead of "power saving", as the former
is technically more accurate.

> + *
> + * In order to achieve this, the governor uses a utilization threshold.
> + * The threshold is computed per-cpu as a percentage of the CPU's capacity
> + * by bit shifting the capacity value. Based on testing, the shift of 6 (~1.56%)
> + * seems to be getting the best results.
> + *
> + * Before selecting the next idle state, the governor compares the current CPU
> + * util to the precomputed util threshold. If it's below, it defaults to the
> + * TEO metrics mechanism. If it's above, the idle state will be reduced to C0
> + * as long as C0 is not a polling state.

Again, I would avoid the C0 etc terminology here, because it may
confuse people easily enough.

>   */
>
>  #include <linux/cpuidle.h>
>  #include <linux/jiffies.h>
>  #include <linux/kernel.h>
> +#include <linux/sched.h>
>  #include <linux/sched/clock.h>
> +#include <linux/sched/topology.h>
>  #include <linux/tick.h>
>
> +/*
> + * The number of bits to shift the cpu's capacity by in order to determine
> + * the utilized threshold.

Please spell CPU in capitals.

> + *
> + * 6 was chosen based on testing as the number that achieved the best balance
> + * of power and performance on average.
> + *
> + * The resulting threshold is high enough to not be triggered by background
> + * noise and low enough to react quickly when activity starts to ramp up.
> + */
> +#define UTIL_THRESHOLD_SHIFT 6
> +
> +
>  /*
>   * The PULSE value is added to metrics when they grow and the DECAY_SHIFT value
>   * is used for decreasing metrics on a regular basis.
> @@ -137,9 +183,11 @@ struct teo_bin {
>   * @time_span_ns: Time between idle state selection and post-wakeup update.
>   * @sleep_length_ns: Time till the closest timer event (at the selection time).
>   * @state_bins: Idle state data bins for this CPU.
> - * @total: Grand total of the "intercepts" and "hits" mertics for all bins.
> + * @total: Grand total of the "intercepts" and "hits" metrics for all bins.

This spell fix is unrelated to the rest of the patch.  It should be
covered in the changelog at least, but IMV it;s just better to put it
into a separate patch.

>   * @next_recent_idx: Index of the next @recent_idx entry to update.
>   * @recent_idx: Indices of bins corresponding to recent "intercepts".
> + * @util_threshold: Threshold above which the CPU is considered utilized
> + * @utilized: Whether the last sleep on the CPU happened while utilized
>   */
>  struct teo_cpu {
>         s64 time_span_ns;
> @@ -148,10 +196,22 @@ struct teo_cpu {
>         unsigned int total;
>         int next_recent_idx;
>         int recent_idx[NR_RECENT];
> +       unsigned long util_threshold;
> +       bool utilized;
>  };
>
>  static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
>
> +/**
> + * teo_get_util - Update the CPU utilized status
> + * @dev: Target CPU
> + * @cpu_data: Governor CPU data for the target CPU
> + */
> +static void teo_get_util(struct cpuidle_device *dev, struct teo_cpu *cpu_data)
> +{
> +       cpu_data->utilized = sched_cpu_util(dev->cpu) > cpu_data->util_threshold;
> +}
> +

This has been covered by the earlier message from today.

>  /**
>   * teo_update - Update CPU metrics after wakeup.
>   * @drv: cpuidle driver containing state data.
> @@ -323,6 +383,20 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>                         goto end;
>         }
>
> +       teo_get_util(dev, cpu_data);

And this too.

> +       /*
> +        * The cpu is being utilized over the threshold there are only 2 states to choose from.

I would say "If the CPU is being utilized over the threshold and there
are only 2 states to choose from,"

> +        * No need to consider metrics, choose the shallowest non-polling state and exit.

"the shallower one can be returned right away as long as it is not a
polling one".

> +        */
> +       if (drv->state_count < 3 && cpu_data->utilized) {

I would reverse the order of checks here, so the code and the comment
match each other exactly.

> +               for (i = 0; i < drv->state_count; ++i) {
> +                       if (!dev->states_usage[i].disable && !(drv->states[i].flags & CPUIDLE_FLAG_POLLING)) {
> +                               idx = i;
> +                               goto end;
> +                       }
> +               }
> +       }
> +
>         /*
>          * Find the deepest idle state whose target residency does not exceed
>          * the current sleep length and the deepest idle state not deeper than
> @@ -454,6 +528,13 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>         if (idx > constraint_idx)
>                 idx = constraint_idx;
>
> +       /*
> +        * If the CPU is being utilized over the threshold,
> +        * choose a shallower non-polling state to improve latency
> +        */
> +       if (cpu_data->utilized)
> +               idx = teo_find_shallower_state(drv, dev, idx, duration_ns, true);
> +
>  end:
>         /*
>          * Don't stop the tick if the selected state is a polling one or if the
> @@ -510,9 +591,11 @@ static int teo_enable_device(struct cpuidle_driver *drv,
>                              struct cpuidle_device *dev)
>  {
>         struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
> +       unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu);
>         int i;
>
>         memset(cpu_data, 0, sizeof(*cpu_data));
> +       cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;
>
>         for (i = 0; i < NR_RECENT; i++)
>                 cpu_data->recent_idx[i] = -1;
> --
  
Lukasz Luba Jan. 4, 2023, 9:57 a.m. UTC | #2
Hi Rafael,

Forgive me to jump into the discussion, I would like to clarify some
stuff.

On 1/3/23 17:53, Rafael J. Wysocki wrote:
> On Wed, Nov 30, 2022 at 4:33 PM Kajetan Puchalski
> <kajetan.puchalski@arm.com> wrote:
>>
>> Modern interactive systems, such as recent Android phones, tend to have power
>> efficient shallow idle states. Selecting deeper idle states on a device while a
>> latency-sensitive workload is running can adversely impact performance due to
>> increased latency. Additionally, if the CPU wakes up from a deeper sleep before
>> its target residency as is often the case, it results in a waste of energy on
>> top of that.
>>
>> At the moment, all the available idle governors operate mainly based on their
>> own past correctness metrics along with timer events without taking into account
>> any scheduling information.
> 
> I still don't quite agree with the above statement.
> 
> It would be accurate enough to state the fact that currently cpuidle
> governors don't take scheduling information into account.

Fair enough.

> 
>> Especially on interactive systems, this results in
>> them frequently selecting a deeper idle state and then waking up before its
>> target residency is hit, thus leading to increased wakeup latency and lower
>> performance with no power saving. For 'menu' while web browsing on Android for
>> instance, those types of wakeups ('too deep') account for over 24% of all
>> wakeups.
> 
> I don't think that you can convincingly establish a cause-and-effect
> relationship between not taking scheduling information into account
> and overestimating the idle duration.

This is tough topic to prove this correctly, good for some PhD studies.
There is some correlation which we see in the results and measurements,
though. An interesting topic to study would be the CPU runqueue
utilization signal nature, when the tasks sleep or migrate. We don't
have time to study that signal nature over time and if that function
matches some already discovered process and invented math theory,
e.g. Erlang distribution or only exponential distribution...
Your exponentially decaying statistics might fall into same bucket.

> 
> It would be just fine to say something like "They also tend to
> overestimate the idle duration quite often, which causes them to
> select excessively deep idle states, which leads to ...".

Fair enough.

> 
>> At the same time, on some platforms C0 can be power efficient enough to warrant
>> wanting to prefer it over C1.
> 
> If you say C0 or C1, a casual reader may think about x86 which
> probably is not your intention.
> 
> I would say "idle state 0" and "idle state 1" instead.  I would also
> say that this is on systems where idle state 0 is not a polling state.

I agree.

> 
>> This is because the power usage of the two states
>> can be so close that sufficient amounts of too deep C1 sleeps can completely
>> offset the C1 power saving to the point where it would've been more power
>> efficient to just use C0 instead.
>>
>> Sleeps that happened in C0 while they could have used C1 ('too shallow') only
>> save less power than they otherwise could have. Too deep sleeps, on the other
>> hand, harm performance and nullify the potential power saving from using C1 in
>> the first place. While taking this into account, it is clear that on balance it
>> is preferable for an idle governor to have more too shallow sleeps instead of
>> more too deep sleeps on those kinds of platforms.
> 
> I don't think that the above paragraphs, while generally true, are
> relevant for what the patch really does.

I disagree. That's the patch improvement, precisely the threshold
value of ~1.56% makes this trade-off. We more often prefer too shallow
idle state 0. When you change that threshold to bigger value, you reduce
this preference.

> 
> They would have been relevant if the patch had improved the
> energy-efficiency, but it doesn't.  It sacrifices energy for
> performance by reducing the CPU wakeup latency.

I disagree. Let me clarify those results, because there is a lot of
data I think you might miss some important bits.

Kajetan has provided 4 benchmarks' test results, which are really
comprehensive.

A. There are 2 benchmarks 'performance-first', so we want better score,
*no matter the energy cost* (Geekbench 5 and Speedometer).
The Geekbench5 is most important since a lot of people check it before
buying a phone. For this benchmark TEO was an issue as you can see:
multicore score TEO vs TEO_util:
   2764.8 vs. 2865
which is ~3.5% improvement, trust me it's a lot.

B. There are 2 benchmarks trying to reflect 'normal usage' of a phone:
- PCMark Web Browsing
   score TEO vs. TEO_util
     5219.8 vs. 5239.9 ---> better score
   avg power
     184.8 vs. 184.1 mW ---> lower power, better
- Jankbench
   score TEO vs. TEO_util (lower janky frames percentage is better)
     2.1% vs. 1.3% ---> better score
   avg power
     136.9 mW vs. 121.3 mW ----> lower power, better

The results in B. show that it doesn't sacrifice energy for performance.
We were able to reduce the avg power while even improving the score
results. It would lead to lower energy for the whole test.

You might ask: how is this possible?
Or HW design has evolved and made the idle state 0 very attractive
(reducing more and more power). This created a potential new area
to explore which was in my head for quite a long. Kajetan was keen to
explore that space and found many interesting behaviors there.
Some background about energy and costs.
Don't forget that to enter a deeper idle state you have to flush the
caches gently, not just drop the data blindly. That will cost you some
extra energy (you can drop instr. cache) (comparing to a situation when
you don't have to do this and you keep the caches on). Then when you
wakeup the CPU you have to load instr. and data into cache, which cost
you another extra energy. If you have to do this to/from LPDDR, then it
is ~10x bigger energy vs. some internal data passing (e.g. L3 -> L2).
If you do this wrongly (that's what Kajetan is calling "too deep",
which is a nice new metric IMO) then you will pay those two extra energy
penalties and most likely not saving enough energy to compensate that
loss. That's why you might wonder why choosing a deeper idle I couldn't
save energy... That's the reason.


> 
>> This patch specifically tunes TEO to minimise too deep sleeps and minimise
>> latency to achieve better performance.
> 
> I'm not sure if you can demonstrate that the number of "too deep
> sleeps" is really reduced in all cases, but the reduction of latency
> is readily demonstrable, so I would focus on that part.

Sounds sane

> 
>> To this end, before selecting the next
>> idle state it uses the avg_util signal of a CPU's runqueue in order to determine
>> to what extent the CPU is being utilized. This util value is then compared to a
>> threshold defined as a percentage of the cpu's capacity (capacity >> 6 ie. ~1.5%
>> in the current implementation). If the util is above the threshold, the
>> idle state selected by TEO metrics will be reduced by 1, thus selecting a
>> shallower state. If the util is below the threshold, the governor defaults to
>> the TEO metrics mechanism to try to select the deepest available idle state
>> based on the closest timer event and its own correctness.
>>
>> The main goal of this is to reduce latency and increase performance for some
>> workloads. Under some workloads it will result in an increase in power usage
>> (Geekbench 5) while for other workloads it will also result in a decrease in
>> power usage compared to TEO (PCMark Web, Jankbench, Speedometer).
>>
>> It can provide drastically decreased latency and performance benefits in certain
>> types of workloads that are sensitive to latency.
> 
> And I would put some numbers from your cover letter in here.
>

I agree.

[snip]

>> + *
>> + * When the CPU is utilized while going into idle, more likely than not it will
>> + * be woken up to do more work soon and so a shallower idle state should be
>> + * selected to minimise latency and maximise performance. When the CPU is not
>> + * being utilized, the usual metrics-based approach to selecting the deepest
>> + * available idle state should be preferred to take advantage of the power
>> + * saving.
> 
> I would say "energy saving" instead of "power saving", as the former
> is technically more accurate.

I agree.

I hope this could help to clarify some bits.

Regards,
Lukasz
  

Patch

diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index e2864474a98d..2f85dde301d3 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -2,8 +2,13 @@ 
 /*
  * Timer events oriented CPU idle governor
  *
+ * TEO governor:
  * Copyright (C) 2018 - 2021 Intel Corporation
  * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+ *
+ * Util-awareness mechanism:
+ * Copyright (C) 2022 Arm Ltd.
+ * Author: Kajetan Puchalski <kajetan.puchalski@arm.com>
  */
 
 /**
@@ -99,14 +104,55 @@ 
  *      select the given idle state instead of the candidate one.
  *
  * 3. By default, select the candidate state.
+ *
+ * Util-awareness mechanism:
+ *
+ * The idea behind the util-awareness extension is that there are two distinct
+ * scenarios for the CPU which should result in two different approaches to idle
+ * state selection - utilized and not utilized.
+ *
+ * In this case, 'utilized' means that the average runqueue util of the CPU is
+ * above a certain threshold.
+ *
+ * When the CPU is utilized while going into idle, more likely than not it will
+ * be woken up to do more work soon and so a shallower idle state should be
+ * selected to minimise latency and maximise performance. When the CPU is not
+ * being utilized, the usual metrics-based approach to selecting the deepest
+ * available idle state should be preferred to take advantage of the power
+ * saving.
+ *
+ * In order to achieve this, the governor uses a utilization threshold.
+ * The threshold is computed per-cpu as a percentage of the CPU's capacity
+ * by bit shifting the capacity value. Based on testing, the shift of 6 (~1.56%)
+ * seems to be getting the best results.
+ *
+ * Before selecting the next idle state, the governor compares the current CPU
+ * util to the precomputed util threshold. If it's below, it defaults to the
+ * TEO metrics mechanism. If it's above, the idle state will be reduced to C0
+ * as long as C0 is not a polling state.
  */
 
 #include <linux/cpuidle.h>
 #include <linux/jiffies.h>
 #include <linux/kernel.h>
+#include <linux/sched.h>
 #include <linux/sched/clock.h>
+#include <linux/sched/topology.h>
 #include <linux/tick.h>
 
+/*
+ * The number of bits to shift the cpu's capacity by in order to determine
+ * the utilized threshold.
+ *
+ * 6 was chosen based on testing as the number that achieved the best balance
+ * of power and performance on average.
+ *
+ * The resulting threshold is high enough to not be triggered by background
+ * noise and low enough to react quickly when activity starts to ramp up.
+ */
+#define UTIL_THRESHOLD_SHIFT 6
+
+
 /*
  * The PULSE value is added to metrics when they grow and the DECAY_SHIFT value
  * is used for decreasing metrics on a regular basis.
@@ -137,9 +183,11 @@  struct teo_bin {
  * @time_span_ns: Time between idle state selection and post-wakeup update.
  * @sleep_length_ns: Time till the closest timer event (at the selection time).
  * @state_bins: Idle state data bins for this CPU.
- * @total: Grand total of the "intercepts" and "hits" mertics for all bins.
+ * @total: Grand total of the "intercepts" and "hits" metrics for all bins.
  * @next_recent_idx: Index of the next @recent_idx entry to update.
  * @recent_idx: Indices of bins corresponding to recent "intercepts".
+ * @util_threshold: Threshold above which the CPU is considered utilized
+ * @utilized: Whether the last sleep on the CPU happened while utilized
  */
 struct teo_cpu {
 	s64 time_span_ns;
@@ -148,10 +196,22 @@  struct teo_cpu {
 	unsigned int total;
 	int next_recent_idx;
 	int recent_idx[NR_RECENT];
+	unsigned long util_threshold;
+	bool utilized;
 };
 
 static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
 
+/**
+ * teo_get_util - Update the CPU utilized status
+ * @dev: Target CPU
+ * @cpu_data: Governor CPU data for the target CPU
+ */
+static void teo_get_util(struct cpuidle_device *dev, struct teo_cpu *cpu_data)
+{
+	cpu_data->utilized = sched_cpu_util(dev->cpu) > cpu_data->util_threshold;
+}
+
 /**
  * teo_update - Update CPU metrics after wakeup.
  * @drv: cpuidle driver containing state data.
@@ -323,6 +383,20 @@  static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 			goto end;
 	}
 
+	teo_get_util(dev, cpu_data);
+	/*
+	 * The cpu is being utilized over the threshold there are only 2 states to choose from.
+	 * No need to consider metrics, choose the shallowest non-polling state and exit.
+	 */
+	if (drv->state_count < 3 && cpu_data->utilized) {
+		for (i = 0; i < drv->state_count; ++i) {
+			if (!dev->states_usage[i].disable && !(drv->states[i].flags & CPUIDLE_FLAG_POLLING)) {
+				idx = i;
+				goto end;
+			}
+		}
+	}
+
 	/*
 	 * Find the deepest idle state whose target residency does not exceed
 	 * the current sleep length and the deepest idle state not deeper than
@@ -454,6 +528,13 @@  static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 	if (idx > constraint_idx)
 		idx = constraint_idx;
 
+	/*
+	 * If the CPU is being utilized over the threshold,
+	 * choose a shallower non-polling state to improve latency
+	 */
+	if (cpu_data->utilized)
+		idx = teo_find_shallower_state(drv, dev, idx, duration_ns, true);
+
 end:
 	/*
 	 * Don't stop the tick if the selected state is a polling one or if the
@@ -510,9 +591,11 @@  static int teo_enable_device(struct cpuidle_driver *drv,
 			     struct cpuidle_device *dev)
 {
 	struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
+	unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu);
 	int i;
 
 	memset(cpu_data, 0, sizeof(*cpu_data));
+	cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT;
 
 	for (i = 0; i < NR_RECENT; i++)
 		cpu_data->recent_idx[i] = -1;