Documentation: sched: Add a new sched-util-clamp.rst
Commit Message
From: Qais Yousef <qais.yousef@arm.com>
The new util clamp feature needs a document explaining what it is and
how to use it. The new document hopefully covers everything one needs to
know about uclamp.
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
---
Hopefully a not bad first attempt at explaining everything uclamp; what it is
and how to use it.
I have repeated some ideas to help re-enforce them to the readers who're new to
the concept.
Documentation/scheduler/sched-util-clamp.rst | 678 +++++++++++++++++++
1 file changed, 678 insertions(+)
create mode 100644 Documentation/scheduler/sched-util-clamp.rst
Comments
Hi Qais,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on lwn/docs-next]
[also build test WARNING on tip/master linus/master v6.1-rc3 next-20221104]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Qais-Yousef/Documentation-sched-Add-a-new-sched-util-clamp-rst/20221106-072619
base: git://git.lwn.net/linux.git docs-next
patch link: https://lore.kernel.org/r/20221105232343.887199-1-qyousef%40layalina.io
patch subject: [PATCH] Documentation: sched: Add a new sched-util-clamp.rst
reproduce:
# https://github.com/intel-lab-lkp/linux/commit/18b40e54df3058f348a5df25fad6baad82d28f1a
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Qais-Yousef/Documentation-sched-Add-a-new-sched-util-clamp-rst/20221106-072619
git checkout 18b40e54df3058f348a5df25fad6baad82d28f1a
make menuconfig
# enable CONFIG_COMPILE_TEST, CONFIG_WARN_MISSING_DOCUMENTS, CONFIG_WARN_ABI_ERRORS
make htmldocs
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
All warnings (new ones prefixed by >>):
>> Documentation/scheduler/sched-util-clamp.rst:181: WARNING: Unexpected indentation.
>> Documentation/scheduler/sched-util-clamp.rst: WARNING: document isn't included in any toctree
vim +181 Documentation/scheduler/sched-util-clamp.rst
175
176 0 1024
177 | |
178 +-----------+-----------+-----------+---- ----+-----------+
179 | Bucket 0 | Bucket 1 | Bucket 2 | ... | Bucket N |
180 +-----------+-----------+-----------+---- ----+-----------+
> 181 : : :
182 +- p0 +- p3 +- p4
183 : :
184 +- p1 +- p5
185 :
186 +- p2
187
188
189 DISCLAMER:
190 The diagram above is an illustration rather than a true depiction of the
191 internal data structure.
192
On 11/07/22 01:47, kernel test robot wrote:
> Hi Qais,
>
> Thank you for the patch! Perhaps something to improve:
>
> [auto build test WARNING on lwn/docs-next]
> [also build test WARNING on tip/master linus/master v6.1-rc3 next-20221104]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Qais-Yousef/Documentation-sched-Add-a-new-sched-util-clamp-rst/20221106-072619
> base: git://git.lwn.net/linux.git docs-next
> patch link: https://lore.kernel.org/r/20221105232343.887199-1-qyousef%40layalina.io
> patch subject: [PATCH] Documentation: sched: Add a new sched-util-clamp.rst
> reproduce:
> # https://github.com/intel-lab-lkp/linux/commit/18b40e54df3058f348a5df25fad6baad82d28f1a
> git remote add linux-review https://github.com/intel-lab-lkp/linux
> git fetch --no-tags linux-review Qais-Yousef/Documentation-sched-Add-a-new-sched-util-clamp-rst/20221106-072619
> git checkout 18b40e54df3058f348a5df25fad6baad82d28f1a
> make menuconfig
> # enable CONFIG_COMPILE_TEST, CONFIG_WARN_MISSING_DOCUMENTS, CONFIG_WARN_ABI_ERRORS
> make htmldocs
>
> If you fix the issue, kindly add following tag where applicable
> | Reported-by: kernel test robot <lkp@intel.com>
>
> All warnings (new ones prefixed by >>):
>
> >> Documentation/scheduler/sched-util-clamp.rst:181: WARNING: Unexpected indentation.
> >> Documentation/scheduler/sched-util-clamp.rst: WARNING: document isn't included in any toctree
Thanks! I have the below fixup patch that addresses these. It made me realize
my html output could look better. It's cosmetic; so won't post a new version
till some feedback is provided first.
Cheers
--
Qais Yousef
--->8---
diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
index b430d856056a..f12d0d06de3a 100644
--- a/Documentation/scheduler/index.rst
+++ b/Documentation/scheduler/index.rst
@@ -15,6 +15,7 @@ Linux Scheduler
sched-capacity
sched-energy
schedutil
+ sched-util-clamp
sched-nice-design
sched-rt-group
sched-stats
diff --git a/Documentation/scheduler/sched-util-clamp.rst b/Documentation/scheduler/sched-util-clamp.rst
index e75b69767afb..728ffa364fc7 100644
--- a/Documentation/scheduler/sched-util-clamp.rst
+++ b/Documentation/scheduler/sched-util-clamp.rst
@@ -169,24 +169,27 @@ could change with implementation details.
2.1 BUCKETS:
-------------
+.. code-block:: c
+
[struct rq]
-(bottom) (top)
+ (bottom) (top)
- 0 1024
- | |
- +-----------+-----------+-----------+---- ----+-----------+
- | Bucket 0 | Bucket 1 | Bucket 2 | ... | Bucket N |
- +-----------+-----------+-----------+---- ----+-----------+
- : : :
- +- p0 +- p3 +- p4
- : :
- +- p1 +- p5
- :
- +- p2
+ 0 1024
+ | |
+ +-----------+-----------+-----------+---- ----+-----------+
+ | Bucket 0 | Bucket 1 | Bucket 2 | ... | Bucket N |
+ +-----------+-----------+-----------+---- ----+-----------+
+ : : :
+ +- p0 +- p3 +- p4
+ : :
+ +- p1 +- p5
+ :
+ +- p2
-DISCLAMER:
+.. note::
+ DISCLAMER:
The diagram above is an illustration rather than a true depiction of the
internal data structure.
@@ -200,6 +203,8 @@ The rq has a bucket for each uclamp_id: [UCLAMP_MIN, UCLAMP_MAX].
The range of each bucket is 1024/N. For example for the default value of 5 we
will have 5 buckets, each of which will cover the following range:
+.. code-block:: c
+
DELTA = round_closest(1024/5) = 204.8 = 205
Bucket 0: [0:204]
@@ -210,6 +215,8 @@ will have 5 buckets, each of which will cover the following range:
When a task p
+.. code-block:: c
+
p->uclamp[UCLAMP_MIN] = 300
p->uclamp[UCLAMP_MAX] = 1024
@@ -222,12 +229,16 @@ uclamp_id.
When a task p is enqueued, the rq value changes as follows:
+.. code-block:: c
+
// update bucket logic goes here
rq->uclamp[UCLAMP_MIN] = max(rq->uclamp[UCLAMP_MIN], p->uclamp[UCLAMP_MIN])
// repeat for UCLAMP_MAX
When a task is p dequeued the rq value changes as follows:
+.. code-block:: c
+
// update bucket logic goes here
rq->uclamp[UCLAMP_MIN] = search_top_bucket_for_highest_value()
// repeat for UCLAMP_MAX
@@ -249,6 +260,8 @@ another task that doesn't need it or is disallowed from reaching this point.
For example, if there are multiple tasks attached to an rq with the following
values:
+.. code-block:: c
+
p0->uclamp[UCLAMP_MIN] = 300
p0->uclamp[UCLAMP_MAX] = 900
@@ -257,6 +270,8 @@ values:
then assuming both p0 and p1 are enqueued to the same rq
+.. code-block:: c
+
rq->uclamp[UCLAMP_MIN] = max(300, 500) = 500
rq->uclamp[UCLAMP_MAX] = max(900, 500) = 900
@@ -316,6 +331,8 @@ sched_setattr() syscall was extended to accept two new fields:
For example:
+.. code-block:: c
+
attr->sched_util_min = 40% * 1024;
attr->sched_util_max = 80% * 1024;
@@ -333,9 +350,13 @@ default.
Note that resetting the uclamp value to system default using -1 is not the same
as setting the uclamp value to system default.
+.. code-block:: c
+
attr->sched_util_min = -1 // p0 is reset to system default e.g: 0
- not the same as
+not the same as
+
+.. code-block:: c
attr->sched_util_min = 0 // p0 is set to 0, the fact it is the same
// as system default is irrelevant
@@ -375,6 +396,8 @@ as follows:
For example:
+.. code-block:: c
+
p0->uclamp[UCLAMP_MIN] = // system default;
p0->uclamp[UCLAMP_MAX] = // system default;
@@ -389,6 +412,8 @@ For example:
when p0 and p1 are attached to cgroup0
+.. code-block:: c
+
p0->uclamp[UCLAMP_MIN] = cgroup0->cpu.uclamp.min = 20% * 1024;
p0->uclamp[UCLAMP_MAX] = cgroup0->cpu.uclamp.max = 60% * 1024;
@@ -397,6 +422,8 @@ when p0 and p1 are attached to cgroup0
when p0 and p1 are attached to cgroup1
+.. code-block:: c
+
p0->uclamp[UCLAMP_MIN] = cgroup1->cpu.uclamp.min = 60% * 1024;
p0->uclamp[UCLAMP_MAX] = cgroup1->cpu.uclamp.max = 100% * 1024;
@@ -452,6 +479,8 @@ The value must be greater than or equal to sched_util_clamp_min.
By default all SCHED_NORMAL/SCHED_OTHER tasks are initialized to:
+.. code-block:: c
+
p_fair->uclamp[UCLAMP_MIN] = 0
p_fair->uclamp[UCLAMP_MAX] = 1024
@@ -461,6 +490,8 @@ provide this, but can be added in the future.
For SCHED_FIFO/SCHED_RR tasks:
+.. code-block:: c
+
p_rt->uclamp[UCLAMP_MIN] = 1024
p_rt->uclamp[UCLAMP_MAX] = 1024
@@ -564,15 +595,21 @@ still would like to keep your browser performance intact; uclamp enables that.
If task p0 is capped to run at 512
+.. code-block:: c
+
p0->uclamp[UCLAMP_MAX] = 512
is sharing the rq with p1 which is free to run at any performance point
+.. code-block:: c
+
p1->uclamp[UCLAMP_MAX] = 1024
then due to max aggregation the rq will be allowed to reach max performance
point
+.. code-block:: c
+
rq->uclamp[UCLAMP_MAX] = max(512, 1024) = 1024
Assuming both p0 and p1 have UCLAMP_MIN = 0, then the frequency selection for
@@ -597,16 +634,22 @@ when severely capped tasks share the rq with a small non capped task.
As an example if task p
+.. code-block:: c
+
p0->util_avg = 300
p0->uclamp[UCLAMP_MAX] = 0
wakes up on an idle CPU, then it will run at min frequency this CPU is capable
of.
+.. code-block:: c
+
rq->uclamp[UCLAMP_MAX] = 0
If the ratio of Fmax/Fmin is 3, then
+.. code-block:: c
+
300 * (Fmax/Fmin) = 900
Which indicates the CPU will still see idle time since 900 is < 1024. The
@@ -614,19 +657,27 @@ _actual_ util_avg will NOT be 900 though. It will be higher than 300, but won't
approach 900. As long as there's idle time, p->util_avg updates will be off by
a some margin, but not proportional to Fmax/Fmin.
+.. code-block:: c
+
p0->util_avg = 300 + small_error
Now if the ratio of Fmax/Fmin is 4, then
+.. code-block:: c
+
300 * (Fmax/Fmin) = 1200
which is higher than 1024 and indicates that the CPU has no idle time. When
this happens, then the _actual_ util_avg will become 1024.
+.. code-block:: c
+
p0->util_avg = 1024
If task p1 wakes up on this CPU
+.. code-block:: c
+
p1->util_avg = 200
p1->uclamp[UCLAMP_MAX] = 1024
@@ -634,6 +685,8 @@ then the effective UCLAMP_MAX for the CPU will be 1024 according to max
aggregation rule. But since the capped p0 task was running and throttled
severely, then the rq->util_avg will be 1024.
+.. code-block:: c
+
p0->util_avg = 1024
p1->util_avg = 200
@@ -642,6 +695,8 @@ severely, then the rq->util_avg will be 1024.
Hence lead to a frequency spike since if p0 wasn't throttled we should get
+.. code-block:: c
+
p0->util_avg = 300
p1->util_avg = 200
On Sat, Nov 05, 2022 at 11:23:43PM +0000, Qais Yousef wrote:
> From: Qais Yousef <qais.yousef@arm.com>
>
> The new util clamp feature needs a document explaining what it is and
> how to use it. The new document hopefully covers everything one needs to
> know about uclamp.
>
> Signed-off-by: Qais Yousef <qais.yousef@arm.com>
> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
Hmm, why didn't you send this patch from your arm address instead?
On the other hand, thanks for including SoB from your sending address,
which is different.
I will be commenting for the content on your fixup message.
Thanks.
On Sun, Nov 13, 2022 at 03:26:29PM +0000, Qais Yousef wrote:
> Thanks! I have the below fixup patch that addresses these. It made me realize
> my html output could look better. It's cosmetic; so won't post a new version
> till some feedback is provided first.
>
>
> Cheers
>
> --
> Qais Yousef
>
>
> --->8---
>
> diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
> index b430d856056a..f12d0d06de3a 100644
> --- a/Documentation/scheduler/index.rst
> +++ b/Documentation/scheduler/index.rst
> @@ -15,6 +15,7 @@ Linux Scheduler
> sched-capacity
> sched-energy
> schedutil
> + sched-util-clamp
> sched-nice-design
> sched-rt-group
> sched-stats
> diff --git a/Documentation/scheduler/sched-util-clamp.rst b/Documentation/scheduler/sched-util-clamp.rst
> index e75b69767afb..728ffa364fc7 100644
> --- a/Documentation/scheduler/sched-util-clamp.rst
> +++ b/Documentation/scheduler/sched-util-clamp.rst
> @@ -169,24 +169,27 @@ could change with implementation details.
> 2.1 BUCKETS:
> -------------
>
> +.. code-block:: c
> +
> [struct rq]
>
> -(bottom) (top)
> + (bottom) (top)
>
> - 0 1024
> - | |
> - +-----------+-----------+-----------+---- ----+-----------+
> - | Bucket 0 | Bucket 1 | Bucket 2 | ... | Bucket N |
> - +-----------+-----------+-----------+---- ----+-----------+
> - : : :
> - +- p0 +- p3 +- p4
> - : :
> - +- p1 +- p5
> - :
> - +- p2
> + 0 1024
> + | |
> + +-----------+-----------+-----------+---- ----+-----------+
> + | Bucket 0 | Bucket 1 | Bucket 2 | ... | Bucket N |
> + +-----------+-----------+-----------+---- ----+-----------+
> + : : :
> + +- p0 +- p3 +- p4
> + : :
> + +- p1 +- p5
> + :
> + +- p2
The code block above is diagram, isn't it? Thus specifying language for
syntax highlighting (in this case ``c``) isn't appropriate.
>
>
> -DISCLAMER:
> +.. note::
> + DISCLAMER:
> The diagram above is an illustration rather than a true depiction of the
> internal data structure.
The DISCLAIMER line above isn't needed, since note block should do the
job.
>
> @@ -200,6 +203,8 @@ The rq has a bucket for each uclamp_id: [UCLAMP_MIN, UCLAMP_MAX].
> The range of each bucket is 1024/N. For example for the default value of 5 we
> will have 5 buckets, each of which will cover the following range:
>
> +.. code-block:: c
> +
Again, why ``c`` syntax highlighting?
Otherwise no new warnings. Thanks for fixing this up.
However, in the future, for documentation patches you should always Cc:
linux-doc list. Adding it to Cc list now.
Thanks.
On Sat, Nov 05, 2022 at 11:23:43PM +0000, Qais Yousef wrote:
> +2. DESIGN:
> +===========
Why ALLCAPS and trailing colon for section title?
> +When a task is attached to a CPU controller, its uclamp values will be impacted
> +as follows:
> +
> +* cpu.uclamp.min is a protection as described in section 3-3 in
> + Documentation/admin-guide/cgroup-v2.rst.
> <snipped>...
> +* cpu.uclamp.max is a limit as described in section 3-2 in
> + Documentation/admin-guide/cgroup-v2.rst.
> +
Exactly what section on cgroup doc do you refer? I don't see any section
number there. Did you mean this?:
---- >8 ----
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index dc254a3cb95686..fd448069c11562 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -619,6 +619,8 @@ process migrations.
and is an example of this type.
+.. _cgroupv2-limits-distributor:
+
Limits
------
@@ -635,6 +637,7 @@ process migrations.
"io.max" limits the maximum BPS and/or IOPS that a cgroup can consume
on an IO device and is an example of this type.
+.. _cgroupv2-protections-distributor:
Protections
-----------
diff --git a/Documentation/scheduler/sched-util-clamp.rst b/Documentation/scheduler/sched-util-clamp.rst
index 6601bda176d16e..5741acb35b7db2 100644
--- a/Documentation/scheduler/sched-util-clamp.rst
+++ b/Documentation/scheduler/sched-util-clamp.rst
@@ -364,8 +364,8 @@ There are two uclamp related values in the CPU cgroup controller:
When a task is attached to a CPU controller, its uclamp values will be impacted
as follows:
-* cpu.uclamp.min is a protection as described in section 3-3 in
- Documentation/admin-guide/cgroup-v2.rst.
+* cpu.uclamp.min is a protection as described in :ref:`section 3-3 of cgroup
+ v2 documentation <cgroupv2-protections-distributor>`.
If a task uclamp_min value is lower than cpu.uclamp.min, then the task will
inherit the cgroup cpu.uclamp.min value.
@@ -373,8 +373,8 @@ as follows:
In a cgroup hierarchy, effective cpu.uclamp.min is the max of (child,
parent).
-* cpu.uclamp.max is a limit as described in section 3-2 in
- Documentation/admin-guide/cgroup-v2.rst.
+* cpu.uclamp.max is a limit as described in :ref:`section 3-2 of cgroup v2
+ documentation <cgroupv2-limits-distributor>`.
If a task uclamp_max value is higher than cpu.uclamp.max, then the task will
inherit the cgroup cpu.uclamp.max value.
IMO, the doc wording can be improved (applied on top of your fixup [1]):
---- >8 ----
diff --git a/Documentation/scheduler/sched-util-clamp.rst b/Documentation/scheduler/sched-util-clamp.rst
index 728ffa364fc7ad..6601bda176d16e 100644
--- a/Documentation/scheduler/sched-util-clamp.rst
+++ b/Documentation/scheduler/sched-util-clamp.rst
@@ -2,31 +2,29 @@
Utilization Clamping
====================
-1. INTRODUCTION
-================
+1. Introduction
+===============
-Utilization clamping is a scheduler feature that allows user space to help in
-managing the performance requirement of tasks. It was introduced in v5.3
-release. The CGroup support was merged in v5.4.
-
-It is often referred to as util clamp and uclamp. You'll find all variations
-used interchangeably in this documentation and in the source code.
+Utilization clamping, also known as util clamp or uclamp, is a scheduler
+feature that allows user space to help in managing the performance requirement
+of tasks. It was introduced in v5.3 release. The CGroup support was merged in
+v5.4.
Uclamp is a hinting mechanism that allows the scheduler to understand the
-performance requirements and restrictions of the tasks. Hence help it make
-a better placement decision. And when schedutil cpufreq governor is used, util
-clamp will influence the frequency selection as well.
+performance requirements and restrictions of the tasks, thus it helps the
+scheduler to make a better decision. And when schedutil cpufreq governor is
+used, util clamp will influence the frequency selection as well.
Since scheduler and schedutil are both driven by PELT (util_avg) signals, util
clamp acts on that to achieve its goal by clamping the signal to a certain
-point; hence the name. I.e: by clamping utilization we are making the system
-run at a certain performance point.
+point; hence the name. That is, by clamping utilization we are making the
+system run at a certain performance point.
-The right way to view util clamp is as a mechanism to make performance
-constraints request/hint. It consists of two components:
+The right way to view util clamp is as a mechanism to make request or hint on
+performance constraints. It consists of two tunables:
- * UCLAMP_MIN, which sets a lower bound.
- * UCLAMP_MAX, which sets an upper bound.
+ * UCLAMP_MIN, which sets the lower bound.
+ * UCLAMP_MAX, which sets the upper bound.
These two bounds will ensure a task will operate within this performance range
of the system. UCLAMP_MIN implies boosting a task, while UCLAMP_MAX implies
@@ -35,18 +33,18 @@ capping a task.
One can tell the system (scheduler) that some tasks require a minimum
performance point to operate at to deliver the desired user experience. Or one
can tell the system that some tasks should be restricted from consuming too
-much resources and should NOT go above a specific performance point. Viewing
+much resources and should not go above a specific performance point. Viewing
the uclamp values as performance points rather than utilization is a better
abstraction from user space point of view.
As an example, a game can use util clamp to form a feedback loop with its
perceived FPS. It can dynamically increase the minimum performance point
required by its display pipeline to ensure no frame is dropped. It can also
-dynamically 'prime' up these tasks if it knows in the coming few 100ms
-a computationally intensive scene is about to happen.
+dynamically 'prime' up these tasks if it knows in the coming few hundred
+milliseconds a computationally intensive scene is about to happen.
On mobile hardware where the capability of the devices varies a lot, this
-dynamic feedback loop offers a great flexibility in ensuring best user
+dynamic feedback loop offers a great flexibility to ensure best user
experience given the capabilities of any system.
Of course a static configuration is possible too. The exact usage will depend
@@ -68,17 +66,17 @@ stay on the little cores which will ensure that:
are CPU intensive tasks.
By making these uclamp performance requests, or rather hints, user space can
-ensure system resources are used optimally to deliver the best user experience
-the system is capable of.
+ensure system resources are used optimally to deliver the best possible user
+experience.
Another use case is to help with overcoming the ramp up latency inherit in how
scheduler utilization signal is calculated.
-A busy task for instance that requires to run at maximum performance point will
-suffer a delay of ~200ms (PELT HALFIFE = 32ms) for the scheduler to realize
-that. This is known to affect workloads like gaming on mobile devices where
-frames will drop due to slow response time to select the higher frequency
-required for the tasks to finish their work in time.
+On the other hand, a busy task for instance that requires to run at maximum
+performance point will suffer a delay of ~200ms (PELT HALFIFE = 32ms) for the
+scheduler to realize that. This is known to affect workloads like gaming on
+mobile devices where frames will drop due to slow response time to select the
+higher frequency required for the tasks to finish their work in time.
The overall visible effect goes beyond better perceived user
experience/performance and stretches to help achieve a better overall
@@ -101,11 +99,12 @@ when an RT task wakes up. This cost is unchanged by using uclamp. Uclamp only
helps picking what frequency to request instead of schedutil always requesting
MAX for all RT tasks.
-See section 3.4 for default values and 3.4.1 on how to change RT tasks default
-value.
+See :ref:`section 3.4 <uclamp-default-values>` for default values and
+:ref:`3.4.1 <sched-util-clamp-min-rt-default>` on how to change RT tasks
+default value.
-2. DESIGN:
-===========
+2. Design
+=========
Util clamp is a property of every task in the system. It sets the boundaries of
its utilization signal; acting as a bias mechanism that influences certain
@@ -123,10 +122,10 @@ which have implications on the utilization value at rq level, which brings us
to the main design challenge.
When a task wakes up on an rq, the utilization signal of the rq will be
-impacted by the uclamp settings of all the tasks enqueued on it. For example if
+affected by the uclamp settings of all the tasks enqueued on it. For example if
a task requests to run at UTIL_MIN = 512, then the util signal of the rq needs
-to respect this request as well as all other requests from all of the enqueued
-tasks.
+to respect to this request as well as all other requests from all of the
+enqueued tasks.
To be able to aggregate the util clamp value of all the tasks attached to the
rq, uclamp must do some housekeeping at every enqueue/dequeue, which is the
@@ -138,19 +137,21 @@ The way this is handled is by dividing the utilization range into buckets
(struct uclamp_bucket) which allows us to reduce the search space from every
task on the rq to only a subset of tasks on the top-most bucket.
-When a task is enqueued, we increment a counter in the matching bucket. And on
-dequeue we decrement it. This makes keeping track of the effective uclamp value
-at rq level a lot easier.
+When a task is enqueued, the counter in the matching bucket is incremented,
+and on dequeue it is decremented. This makes keeping track of the effective
+uclamp value at rq level a lot easier.
-As we enqueue and dequeue tasks we keep track of the current effective uclamp
-value of the rq. See section 2.1 for details on how this works.
+As tasks are enqueued and dequeued, we keep track of the current effective
+uclamp value of the rq. See :ref:`section 2.1 <uclamp-buckets>` for details on
+how this works.
Later at any path that wants to identify the effective uclamp value of the rq,
it will simply need to read this effective uclamp value of the rq at that exact
moment of time it needs to take a decision.
For task placement case, only Energy Aware and Capacity Aware Scheduling
-(EAS/CAS) make use of uclamp for now. This implies heterogeneous systems only.
+(EAS/CAS) make use of uclamp for now, which implies that it is applied on
+heterogeneous systems only.
When a task wakes up, the scheduler will look at the current effective uclamp
value of every rq and compare it with the potential new value if the task were
to be enqueued there. Favoring the rq that will end up with the most energy
@@ -159,17 +160,19 @@ efficient combination.
Similarly in schedutil, when it needs to make a frequency update it will look
at the current effective uclamp value of the rq which is influenced by the set
of tasks currently enqueued there and select the appropriate frequency that
-will honour uclamp requests.
+will satisfy constraints from requests.
Other paths like setting overutilization state (which effectively disables EAS)
make use of uclamp as well. Such cases are considered necessary housekeeping to
allow the 2 main use cases above and will not be covered in detail here as they
could change with implementation details.
-2.1 BUCKETS:
--------------
+.. _uclamp-buckets:
-.. code-block:: c
+2.1. Buckets
+------------
+
+.. code-block::
[struct rq]
@@ -189,7 +192,6 @@ could change with implementation details.
.. note::
- DISCLAMER:
The diagram above is an illustration rather than a true depiction of the
internal data structure.
@@ -198,12 +200,11 @@ an rq as tasks are enqueued/dequeued, the whole utilization range is divided
into N buckets where N is configured at compile time by setting
CONFIG_UCLAMP_BUCKETS_COUNT. By default it is set to 5.
-The rq has a bucket for each uclamp_id: [UCLAMP_MIN, UCLAMP_MAX].
+The rq has a bucket for each uclamp_id tunables: [UCLAMP_MIN, UCLAMP_MAX].
-The range of each bucket is 1024/N. For example for the default value of 5 we
-will have 5 buckets, each of which will cover the following range:
+The range of each bucket is 1024/N. For example, for the default value of 5 there will be 5 buckets, each of which will cover the following range:
-.. code-block:: c
+.. code-block::
DELTA = round_closest(1024/5) = 204.8 = 205
@@ -213,21 +214,21 @@ will have 5 buckets, each of which will cover the following range:
Bucket 3: [615:819]
Bucket 4: [820:1024]
-When a task p
+When a task p with following tunable parameters
.. code-block:: c
p->uclamp[UCLAMP_MIN] = 300
p->uclamp[UCLAMP_MAX] = 1024
-is enqueued into the rq, Bucket 1 will be incremented for UCLAMP_MIN and Bucket
+is enqueued into the rq, bucket 1 will be incremented for UCLAMP_MIN and bucket
4 will be incremented for UCLAMP_MAX to reflect the fact the rq has a task in
this range.
The rq then keeps track of its current effective uclamp value for each
uclamp_id.
-When a task p is enqueued, the rq value changes as follows:
+When a task p is enqueued, the rq value changes to:
.. code-block:: c
@@ -235,7 +236,7 @@ When a task p is enqueued, the rq value changes as follows:
rq->uclamp[UCLAMP_MIN] = max(rq->uclamp[UCLAMP_MIN], p->uclamp[UCLAMP_MIN])
// repeat for UCLAMP_MAX
-When a task is p dequeued the rq value changes as follows:
+Similarly, when p is dequeued the rq value changes to:
.. code-block:: c
@@ -244,11 +245,11 @@ When a task is p dequeued the rq value changes as follows:
// repeat for UCLAMP_MAX
When all buckets are empty, the rq uclamp values are reset to system defaults.
-See section 3.4 for default values.
+See :ref:`section 3.4 <uclamp-default-values>` for details on default values.
-2.2 MAX AGGREGATION:
----------------------
+2.2. Max aggregation
+--------------------
Util clamp is tuned to honour the request for the task that requires the
highest performance point.
@@ -268,19 +269,20 @@ values:
p1->uclamp[UCLAMP_MIN] = 500
p1->uclamp[UCLAMP_MAX] = 500
-then assuming both p0 and p1 are enqueued to the same rq
+then assuming both p0 and p1 are enqueued to the same rq, both UCLAMP_MIN
+and UCLAMP_MAX become:
.. code-block:: c
rq->uclamp[UCLAMP_MIN] = max(300, 500) = 500
rq->uclamp[UCLAMP_MAX] = max(900, 500) = 900
-As we shall see in section 5.1, this max aggregation is the cause of one of the
-limitations when using util clamp. Particularly for UCLAMP_MAX hint when user
-space would like to save power.
+As we shall see in :ref:`section 5.1 <uclamp-capping-fail>`, this max
+aggregation is the cause of one of limitations when using util clamp, in
+particular for UCLAMP_MAX hint when user space would like to save power.
-2.3 HIERARCHICAL AGGREGATION:
-------------------------------
+2.3. Hierarchial aggregation
+----------------------------
As stated earlier, util clamp is a property of every task in the system. But
the actual applied (effective) value can be influenced by more than just the
@@ -293,80 +295,66 @@ The effective util clamp value of any task is restricted as follows:
2. The restricted value in (1) is then further restricted by the system wide
uclamp settings.
-Section 3 discusses the interfaces and will expand further on that.
+:ref:`Section 3 <uclamp-interfaces>` discusses the interfaces and will expand further on that.
For now suffice to say that if a task makes a request, its actual effective
value will have to adhere to some restrictions imposed by cgroup and system
wide settings.
-The system will still accept the request even if effectively will look
-different; but as soon as the task moves to a different cgroup or a sysadmin
-modifies the system settings, it'll be able to get what it wants if the new
-settings allows it.
+The system will still accept the request even if effectively will be
+beyond the constraints, but as soon as the task moves to a different cgroup
+or a sysadmin modifies the system settings, the request will be satisfied
+only if it is within new constraints.
In other words, this aggregation will not cause an error when a task changes
-its uclamp values. It just might not be able to achieve it based on those
-factors.
+its uclamp values, but rather the system may not be able to satisfy requests
+based on those factors.
2.4 Range:
-----------
-Uclamp performance request follow the utilization range: [0:1024] inclusive.
+Uclamp performance request has the range of 0 to 1024 inclusive.
-For cgroup interface percentage is used: [0:100] inclusive.
-You can use 'max' instead of 100 like other cgroup interfaces.
+For cgroup interface percentage is used (that is 0 to 100 inclusive).
+Just like other cgroup interfaces, you can use 'max' instead of 100.
-3. INTERFACES:
-===============
+.. _uclamp-interfaces:
-3.1 PER TASK INTERFACE:
-------------------------
+3. Interfaces
+==============
+
+3.1 Per-task interface
+-----------------------
sched_setattr() syscall was extended to accept two new fields:
* sched_util_min: requests the minimum performance point the system should run
- at when this task is running. Or lower performance bound.
+ at when this task is running. Or lower performance bound.
* sched_util_max: requests the maximum performance point the system should run
- at when this task is running. Or upper performance bound.
+ at when this task is running. Or upper performance bound.
-For example:
+For example, the following scenario have 40% to 80% utilization constraints:
.. code-block:: c
attr->sched_util_min = 40% * 1024;
attr->sched_util_max = 80% * 1024;
-Will tell the system that when task @p is running, it should try its best to
-ensure it starts at a performance point no less than 40% of maximum system's
-capability.
-
-And if the task runs for a long enough time so that its actual utilization goes
-above 80%, then it should not cause the system to operate at a performance
-point higher than that.
+When task @p is running, the scheduler should try its best to ensure it starts
+at 40% utilization. If the task runs for a long enough time so that its actual
+utilization goes above 80%, the utilization will be capped.
The special value -1 is used to reset the uclamp settings to the system
default.
Note that resetting the uclamp value to system default using -1 is not the same
-as setting the uclamp value to system default.
+as manually setting uclamp value to system default. This distinction is
+important because as we shall see in system interfaces, the default value for
+RT could be changed. SCHED_NORMAL/OTHER might gain similar knobs too in the
+future.
-.. code-block:: c
-
- attr->sched_util_min = -1 // p0 is reset to system default e.g: 0
-
-not the same as
-
-.. code-block:: c
-
- attr->sched_util_min = 0 // p0 is set to 0, the fact it is the same
- // as system default is irrelevant
-
-This distinction is important because as we shall see in system interfaces, the
-default value for RT could be changed. SCHED_NORMAL/OTHER might gain similar
-knobs too in the future.
-
-3.2 CGROUP INTERFACE:
-----------------------
+3.2. cgroup interface
+---------------------
There are two uclamp related values in the CPU cgroup controller:
@@ -394,7 +382,7 @@ as follows:
In a cgroup hierarchy, effective cpu.uclamp.max is the min of (child,
parent).
-For example:
+For example, given following parameters:
.. code-block:: c
@@ -410,7 +398,7 @@ For example:
cgroup1->cpu.uclamp.min = 60% * 1024;
cgroup1->cpu.uclamp.max = 100% * 1024;
-when p0 and p1 are attached to cgroup0
+when p0 and p1 are attached to cgroup0, the values become:
.. code-block:: c
@@ -420,7 +408,7 @@ when p0 and p1 are attached to cgroup0
p1->uclamp[UCLAMP_MIN] = 40% * 1024; // intact
p1->uclamp[UCLAMP_MAX] = 50% * 1024; // intact
-when p0 and p1 are attached to cgroup1
+when p0 and p1 are attached to cgroup1, these instead become:
.. code-block:: c
@@ -433,49 +421,46 @@ when p0 and p1 are attached to cgroup1
Note that cgroup interfaces allows cpu.uclamp.max value to be lower than
cpu.uclamp.min. Other interfaces don't allow that.
-3.3 SYSTEM INTERFACE:
+3.3. System interface
----------------------
-3.3.1 sched_util_clamp_min:
-----------------------------
+3.3.1 sched_util_clamp_min
+---------------------------
-System wide limit of allowed UCLAMP_MIN range. By default set to 1024, which
-means tasks are allowed to reach an effective UCLAMP_MIN value in the range of
-[0:1024].
+System wide limit of allowed UCLAMP_MIN range. By default it is set to 1024,
+which means that permitted effective UCLAMP_MIN range for tasks is [0:1024].
+By changing it to 512 for example the range reduces to [0:512]. This is useful
+to restrict how much boosting tasks are allowed to acquire.
-By changing it to 512 for example the effective allowed range reduces to
-[0:512].
-
-This is useful to restrict how much boosting tasks are allowed to acquire.
-
-Requests from tasks to go above this point will still succeed, but effectively
-they won't be achieved until this value is >= p->uclamp[UCLAMP_MIN].
+Requests from tasks to go above this knob value will still succeed, but
+they won't be satisfied until it is more than p->uclamp[UCLAMP_MIN].
The value must be smaller than or equal to sched_util_clamp_max.
-3.3.2 sched_util_clamp_max:
-----------------------------
+3.3.2 sched_util_clamp_max
+---------------------------
-System wide limit of allowed UCLAMP_MAX range. By default set to 1024, which
-means tasks are allowed to reach an effective UCLAMP_MAX value in the range of
-[0:1024].
+System wide limit of allowed UCLAMP_MAX range. By default it is set to 1024,
+which means that permitted effective UCLAMP_MAX range for tasks is [0:1024].
By changing it to 512 for example the effective allowed range reduces to
-[0:512]. The visible impact of this is that no task can run above 512, which in
-return means that all rqs are restricted too. IOW, the whole system is capped
-to half its performance capacity.
+[0:512]. This means is that no task can run above 512, which implies that all
+rqs are restricted too. IOW, the whole system is capped to half its performance
+capacity.
-This is useful to restrict the overall maximum performance point of the system.
+This is useful to restrict the overall maximum performance point of the
+system. For example, it can be handy to limit performance when running low on
+battery.
-Can be handy to limit performance when running low on battery.
-
-Requests from tasks to go above this point will still succeed, but effectively
-they won't be achieved until this value is >= p->uclamp[UCLAMP_MAX].
+Requests from tasks to go above this knob value will still succeed, but
+they won't be satisfied until it is more than p->uclamp[UCLAMP_MAX].
The value must be greater than or equal to sched_util_clamp_min.
-3.4 DEFAULT VALUES:
-----------------------
+.. _uclamp-default-values:
+
+3.4. Default values
+-------------------
By default all SCHED_NORMAL/SCHED_OTHER tasks are initialized to:
@@ -484,7 +469,7 @@ By default all SCHED_NORMAL/SCHED_OTHER tasks are initialized to:
p_fair->uclamp[UCLAMP_MIN] = 0
p_fair->uclamp[UCLAMP_MAX] = 1024
-That is no boosting or restriction on any task. These default values can't be
+That is, no boosting or restriction on any task. These default values can't be
changed at boot or runtime. No argument was made yet as to why we should
provide this, but can be added in the future.
@@ -495,33 +480,35 @@ For SCHED_FIFO/SCHED_RR tasks:
p_rt->uclamp[UCLAMP_MIN] = 1024
p_rt->uclamp[UCLAMP_MAX] = 1024
-That is by default they're boosted to run at the maximum performance point of
+That is, by default they're boosted to run at the maximum performance point of
the system which retains the historical behavior of the RT tasks.
RT tasks default uclamp_min value can be modified at boot or runtime via
-sysctl. See section 3.4.1.
+sysctl. See below section.
+
+.. _sched-util-clamp-min-rt-default:
3.4.1 sched_util_clamp_min_rt_default:
---------------------------------------
Running RT tasks at maximum performance point is expensive on battery powered
-devices and not necessary. To allow system designers to offer good performance
-guarantees for RT tasks without pushing it all the way to maximum performance
+devices and not necessary. To allow system developer to offer good performance
+guarantees for these tasks without pushing it all the way to maximum performance
point, this sysctl knob allows tuning the best boost value to address the
system requirement without burning power running at maximum performance point
all the time.
-Application designers are encouraged to use the per task util clamp interface
+Application developer are encouraged to use the per task util clamp interface
to ensure they are performance and power aware. Ideally this knob should be set
to 0 by system designers and leave the task of managing performance
-requirements to the apps themselves.
+requirements to the apps.
-4. HOW TO USE UTIL CLAMP:
-==========================
+4. How to use util clamp
+========================
Util clamp promotes the concept of user space assisted power and performance
-management. At the scheduler level the info required to make the best decision
-are non existent. But with util clamp user space can hint to the scheduler to
+management. At the scheduler level there is no info required to make the best
+decision. However, with util clamp user space can hint to the scheduler to
make better decision about task placement and frequency selection.
Best results are achieved by not making any assumptions about the system the
@@ -530,41 +517,41 @@ dynamically monitor and adjust. Ultimately this will allow for a better user
experience at a better perf/watt.
For some systems and use cases, static setup will help to achieve good results.
-Portability will be a problem in this case. After all how much work one can do
-at 100, 200 or 1024 is unknown and a special property of every system. Unless
-there's a specific target system, static setup should be avoided.
+Portability will be a problem in this case. How much work one can do at 100,
+200 or 1024 is different for each system. Unless there's a specific target
+system, static setup should be avoided.
-All in all there are enough possibilities to create a whole framework based on
+There are enough possibilities to create a whole framework based on
util clamp or self contained app that makes use of it directly.
-4.1 BOOST IMPORTANT AND DVFS-LATENCY-SENSITIVE TASKS:
-------------------------------------------------------
+4.1. Boost important and DVFS-latency-sensitive tasks
+-----------------------------------------------------
A GUI task might not be busy to warrant driving the frequency high when it
-wakes up. But it requires to finish its work within a specific period of time
+wakes up. However, it requires to finish its work within a specific time window
to deliver the desired user experience. The right frequency it requires at
wakeup will be system dependent. On some underpowered systems it will be high,
-on other overpowered ones, it will be low or 0.
+on other overpowered ones it will be low or 0.
-This task can increase its UCLAMP_MIN value every time it misses a deadline to
-ensure on next wake up it runs at a higher performance point. It should try to
-approach the lowest UCLAMP_MIN value that allows to meet its deadline on any
+This task can increase its UCLAMP_MIN value every time it misses the deadline
+to ensure on next wake up it runs at a higher performance point. It should try
+to approach the lowest UCLAMP_MIN value that allows to meet its deadline on any
particular system to achieve the best possible perf/watt for that system.
On heterogeneous systems, it might be important for this task to run on
-a bigger CPU.
+a faster CPU.
Generally it is advised to perceive the input as performance level or point
which will imply both task placement and frequency selection.
-4.2 CAP BACKGROUND TASKS:
---------------------------
+4.2. Cap background tasks
+-------------------------
Like explained for Android case in the introduction. Any app can lower
UCLAMP_MAX for some background tasks that don't care about performance but
could end up being busy and consume unnecessary system resources on the system.
-4.3 POWERSAVE MODE:
+4.3. Powersave mode
--------------------
sched_util_clamp_max system wide interface can be used to limit all tasks from
@@ -575,8 +562,8 @@ This is not unique to uclamp as one can achieve the same by reducing max
frequency of the cpufreq governor. It can be considered a more convenient
alternative interface.
-4.4 PER APP PERFORMANCE RESTRICTIONS:
---------------------------------------
+4.4. Per-app performance restriction
+-------------------------------------
Middleware/Utility can provide the user an option to set UCLAMP_MIN/MAX for an
app every time it is executed to guarantee a minimum performance point and/or
@@ -585,28 +572,31 @@ these apps.
If you want to prevent your laptop from heating up while on the go from
compiling the kernel and happy to sacrifice performance to save power, but
-still would like to keep your browser performance intact; uclamp enables that.
+still would like to keep your browser performance intact, uclamp makes it
+possible.
-5. LIMITATIONS:
-================
+5. Limitations
+==============
-5.1 CAPPING FREQUENCY WITH UCLAMP_MAX FAILS UNDER CERTAIN CONDITIONS:
-----------------------------------------------------------------------
+.. _uclamp-capping-fail:
-If task p0 is capped to run at 512
+5.1. Capping frequency with uclamp_max fails under certain conditions
+---------------------------------------------------------------------
+
+If task p0 is capped to run at 512:
.. code-block:: c
p0->uclamp[UCLAMP_MAX] = 512
-is sharing the rq with p1 which is free to run at any performance point
+and it shares the rq with p1 which is free to run at any performance point:
.. code-block:: c
p1->uclamp[UCLAMP_MAX] = 1024
then due to max aggregation the rq will be allowed to reach max performance
-point
+point:
.. code-block:: c
@@ -620,19 +610,19 @@ both are running at the same rq, p1 will cause the frequency capping to be left
from the rq although p1, which is allowed to run at any performance point,
doesn't actually need to run at that frequency.
-5.2 UCLAMP_MAX CAN BREAK PELT (UTIL_AVG) SIGNAL
+5.2. UCLAMP_MAX can break pelt (util_avg) signal
------------------------------------------------
PELT assumes that frequency will always increase as the signals grow to ensure
-there's always some idle time on the CPU. But with UCLAMP_MAX, we will prevent
-this frequency increase which can lead to no idle time in some circumstances.
-When there's no idle time, then a task will look like a busy loop, which would
-result in util_avg being 1024.
+there's always some idle time on the CPU. But with UCLAMP_MAX, this frequency
+increase will be prevented which can lead to no idle time in some
+circumstances. When there's no idle time, a task will stuck in a busy loop,
+which would result in util_avg being 1024.
-Combing with issue described in 5.2, this an lead to unwanted frequency spikes
+Combing with issue described below, this an lead to unwanted frequency spikes
when severely capped tasks share the rq with a small non capped task.
-As an example if task p
+As an example if task p, which have:
.. code-block:: c
@@ -646,35 +636,35 @@ of.
rq->uclamp[UCLAMP_MAX] = 0
-If the ratio of Fmax/Fmin is 3, then
+If the ratio of Fmax/Fmin is 3, then maximum value will be:
.. code-block:: c
300 * (Fmax/Fmin) = 900
-Which indicates the CPU will still see idle time since 900 is < 1024. The
-_actual_ util_avg will NOT be 900 though. It will be higher than 300, but won't
-approach 900. As long as there's idle time, p->util_avg updates will be off by
-a some margin, but not proportional to Fmax/Fmin.
+which indicates the CPU will still see idle time since 900 is < 1024. The
+_actual_ util_avg will not be 900 though, but somewhere between 300 and 900. As
+long as there's idle time, p->util_avg updates will be off by a some margin,
+but not proportional to Fmax/Fmin.
.. code-block:: c
p0->util_avg = 300 + small_error
-Now if the ratio of Fmax/Fmin is 4, then
+Now if the ratio of Fmax/Fmin is 4, the maximum value becomes:
.. code-block:: c
300 * (Fmax/Fmin) = 1200
which is higher than 1024 and indicates that the CPU has no idle time. When
-this happens, then the _actual_ util_avg will become 1024.
+this happens, then the _actual_ util_avg will become:
.. code-block:: c
p0->util_avg = 1024
-If task p1 wakes up on this CPU
+If task p1 wakes up on this CPU, which have:
.. code-block:: c
@@ -683,7 +673,7 @@ If task p1 wakes up on this CPU
then the effective UCLAMP_MAX for the CPU will be 1024 according to max
aggregation rule. But since the capped p0 task was running and throttled
-severely, then the rq->util_avg will be 1024.
+severely, then the rq->util_avg will be:
.. code-block:: c
@@ -693,7 +683,7 @@ severely, then the rq->util_avg will be 1024.
rq->util_avg = 1024
rq->uclamp[UCLAMP_MAX] = 1024
-Hence lead to a frequency spike since if p0 wasn't throttled we should get
+Hence lead to a frequency spike since if p0 wasn't throttled we should get:
.. code-block:: c
Thanks.
[1]: https://lore.kernel.org/lkml/20221113152629.3wbyeejsj5v33rvu@airbuntu/
On 11/14/22 16:22, Bagas Sanjaya wrote:
> On Sat, Nov 05, 2022 at 11:23:43PM +0000, Qais Yousef wrote:
> > +2. DESIGN:
> > +===========
>
> Why ALLCAPS and trailing colon for section title?
Fixed.
>
> > +When a task is attached to a CPU controller, its uclamp values will be impacted
> > +as follows:
> > +
> > +* cpu.uclamp.min is a protection as described in section 3-3 in
> > + Documentation/admin-guide/cgroup-v2.rst.
> > <snipped>...
> > +* cpu.uclamp.max is a limit as described in section 3-2 in
> > + Documentation/admin-guide/cgroup-v2.rst.
> > +
>
> Exactly what section on cgroup doc do you refer? I don't see any section
I got the number from '.. Contents' near the top of the doc
37 3. Resource Distribution Models
38 3-1. Weights
39 3-2. Limits
40 3-3. Protections
41 3-4. Allocations
> number there. Did you mean this?:
Correct.
Thanks!
--
Qais Yousef
>
> ---- >8 ----
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index dc254a3cb95686..fd448069c11562 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -619,6 +619,8 @@ process migrations.
> and is an example of this type.
>
>
> +.. _cgroupv2-limits-distributor:
> +
> Limits
> ------
>
> @@ -635,6 +637,7 @@ process migrations.
> "io.max" limits the maximum BPS and/or IOPS that a cgroup can consume
> on an IO device and is an example of this type.
>
> +.. _cgroupv2-protections-distributor:
>
> Protections
> -----------
> diff --git a/Documentation/scheduler/sched-util-clamp.rst b/Documentation/scheduler/sched-util-clamp.rst
> index 6601bda176d16e..5741acb35b7db2 100644
> --- a/Documentation/scheduler/sched-util-clamp.rst
> +++ b/Documentation/scheduler/sched-util-clamp.rst
> @@ -364,8 +364,8 @@ There are two uclamp related values in the CPU cgroup controller:
> When a task is attached to a CPU controller, its uclamp values will be impacted
> as follows:
>
> -* cpu.uclamp.min is a protection as described in section 3-3 in
> - Documentation/admin-guide/cgroup-v2.rst.
> +* cpu.uclamp.min is a protection as described in :ref:`section 3-3 of cgroup
> + v2 documentation <cgroupv2-protections-distributor>`.
>
> If a task uclamp_min value is lower than cpu.uclamp.min, then the task will
> inherit the cgroup cpu.uclamp.min value.
> @@ -373,8 +373,8 @@ as follows:
> In a cgroup hierarchy, effective cpu.uclamp.min is the max of (child,
> parent).
>
> -* cpu.uclamp.max is a limit as described in section 3-2 in
> - Documentation/admin-guide/cgroup-v2.rst.
> +* cpu.uclamp.max is a limit as described in :ref:`section 3-2 of cgroup v2
> + documentation <cgroupv2-limits-distributor>`.
>
> If a task uclamp_max value is higher than cpu.uclamp.max, then the task will
> inherit the cgroup cpu.uclamp.max value.
>
>
> IMO, the doc wording can be improved (applied on top of your fixup [1]):
>
> ---- >8 ----
>
> diff --git a/Documentation/scheduler/sched-util-clamp.rst b/Documentation/scheduler/sched-util-clamp.rst
> index 728ffa364fc7ad..6601bda176d16e 100644
> --- a/Documentation/scheduler/sched-util-clamp.rst
> +++ b/Documentation/scheduler/sched-util-clamp.rst
> @@ -2,31 +2,29 @@
> Utilization Clamping
> ====================
>
> -1. INTRODUCTION
> -================
> +1. Introduction
> +===============
>
> -Utilization clamping is a scheduler feature that allows user space to help in
> -managing the performance requirement of tasks. It was introduced in v5.3
> -release. The CGroup support was merged in v5.4.
> -
> -It is often referred to as util clamp and uclamp. You'll find all variations
> -used interchangeably in this documentation and in the source code.
> +Utilization clamping, also known as util clamp or uclamp, is a scheduler
> +feature that allows user space to help in managing the performance requirement
> +of tasks. It was introduced in v5.3 release. The CGroup support was merged in
> +v5.4.
>
> Uclamp is a hinting mechanism that allows the scheduler to understand the
> -performance requirements and restrictions of the tasks. Hence help it make
> -a better placement decision. And when schedutil cpufreq governor is used, util
> -clamp will influence the frequency selection as well.
> +performance requirements and restrictions of the tasks, thus it helps the
> +scheduler to make a better decision. And when schedutil cpufreq governor is
> +used, util clamp will influence the frequency selection as well.
>
> Since scheduler and schedutil are both driven by PELT (util_avg) signals, util
> clamp acts on that to achieve its goal by clamping the signal to a certain
> -point; hence the name. I.e: by clamping utilization we are making the system
> -run at a certain performance point.
> +point; hence the name. That is, by clamping utilization we are making the
> +system run at a certain performance point.
>
> -The right way to view util clamp is as a mechanism to make performance
> -constraints request/hint. It consists of two components:
> +The right way to view util clamp is as a mechanism to make request or hint on
> +performance constraints. It consists of two tunables:
>
> - * UCLAMP_MIN, which sets a lower bound.
> - * UCLAMP_MAX, which sets an upper bound.
> + * UCLAMP_MIN, which sets the lower bound.
> + * UCLAMP_MAX, which sets the upper bound.
>
> These two bounds will ensure a task will operate within this performance range
> of the system. UCLAMP_MIN implies boosting a task, while UCLAMP_MAX implies
> @@ -35,18 +33,18 @@ capping a task.
> One can tell the system (scheduler) that some tasks require a minimum
> performance point to operate at to deliver the desired user experience. Or one
> can tell the system that some tasks should be restricted from consuming too
> -much resources and should NOT go above a specific performance point. Viewing
> +much resources and should not go above a specific performance point. Viewing
> the uclamp values as performance points rather than utilization is a better
> abstraction from user space point of view.
>
> As an example, a game can use util clamp to form a feedback loop with its
> perceived FPS. It can dynamically increase the minimum performance point
> required by its display pipeline to ensure no frame is dropped. It can also
> -dynamically 'prime' up these tasks if it knows in the coming few 100ms
> -a computationally intensive scene is about to happen.
> +dynamically 'prime' up these tasks if it knows in the coming few hundred
> +milliseconds a computationally intensive scene is about to happen.
>
> On mobile hardware where the capability of the devices varies a lot, this
> -dynamic feedback loop offers a great flexibility in ensuring best user
> +dynamic feedback loop offers a great flexibility to ensure best user
> experience given the capabilities of any system.
>
> Of course a static configuration is possible too. The exact usage will depend
> @@ -68,17 +66,17 @@ stay on the little cores which will ensure that:
> are CPU intensive tasks.
>
> By making these uclamp performance requests, or rather hints, user space can
> -ensure system resources are used optimally to deliver the best user experience
> -the system is capable of.
> +ensure system resources are used optimally to deliver the best possible user
> +experience.
>
> Another use case is to help with overcoming the ramp up latency inherit in how
> scheduler utilization signal is calculated.
>
> -A busy task for instance that requires to run at maximum performance point will
> -suffer a delay of ~200ms (PELT HALFIFE = 32ms) for the scheduler to realize
> -that. This is known to affect workloads like gaming on mobile devices where
> -frames will drop due to slow response time to select the higher frequency
> -required for the tasks to finish their work in time.
> +On the other hand, a busy task for instance that requires to run at maximum
> +performance point will suffer a delay of ~200ms (PELT HALFIFE = 32ms) for the
> +scheduler to realize that. This is known to affect workloads like gaming on
> +mobile devices where frames will drop due to slow response time to select the
> +higher frequency required for the tasks to finish their work in time.
>
> The overall visible effect goes beyond better perceived user
> experience/performance and stretches to help achieve a better overall
> @@ -101,11 +99,12 @@ when an RT task wakes up. This cost is unchanged by using uclamp. Uclamp only
> helps picking what frequency to request instead of schedutil always requesting
> MAX for all RT tasks.
>
> -See section 3.4 for default values and 3.4.1 on how to change RT tasks default
> -value.
> +See :ref:`section 3.4 <uclamp-default-values>` for default values and
> +:ref:`3.4.1 <sched-util-clamp-min-rt-default>` on how to change RT tasks
> +default value.
>
> -2. DESIGN:
> -===========
> +2. Design
> +=========
>
> Util clamp is a property of every task in the system. It sets the boundaries of
> its utilization signal; acting as a bias mechanism that influences certain
> @@ -123,10 +122,10 @@ which have implications on the utilization value at rq level, which brings us
> to the main design challenge.
>
> When a task wakes up on an rq, the utilization signal of the rq will be
> -impacted by the uclamp settings of all the tasks enqueued on it. For example if
> +affected by the uclamp settings of all the tasks enqueued on it. For example if
> a task requests to run at UTIL_MIN = 512, then the util signal of the rq needs
> -to respect this request as well as all other requests from all of the enqueued
> -tasks.
> +to respect to this request as well as all other requests from all of the
> +enqueued tasks.
>
> To be able to aggregate the util clamp value of all the tasks attached to the
> rq, uclamp must do some housekeeping at every enqueue/dequeue, which is the
> @@ -138,19 +137,21 @@ The way this is handled is by dividing the utilization range into buckets
> (struct uclamp_bucket) which allows us to reduce the search space from every
> task on the rq to only a subset of tasks on the top-most bucket.
>
> -When a task is enqueued, we increment a counter in the matching bucket. And on
> -dequeue we decrement it. This makes keeping track of the effective uclamp value
> -at rq level a lot easier.
> +When a task is enqueued, the counter in the matching bucket is incremented,
> +and on dequeue it is decremented. This makes keeping track of the effective
> +uclamp value at rq level a lot easier.
>
> -As we enqueue and dequeue tasks we keep track of the current effective uclamp
> -value of the rq. See section 2.1 for details on how this works.
> +As tasks are enqueued and dequeued, we keep track of the current effective
> +uclamp value of the rq. See :ref:`section 2.1 <uclamp-buckets>` for details on
> +how this works.
>
> Later at any path that wants to identify the effective uclamp value of the rq,
> it will simply need to read this effective uclamp value of the rq at that exact
> moment of time it needs to take a decision.
>
> For task placement case, only Energy Aware and Capacity Aware Scheduling
> -(EAS/CAS) make use of uclamp for now. This implies heterogeneous systems only.
> +(EAS/CAS) make use of uclamp for now, which implies that it is applied on
> +heterogeneous systems only.
> When a task wakes up, the scheduler will look at the current effective uclamp
> value of every rq and compare it with the potential new value if the task were
> to be enqueued there. Favoring the rq that will end up with the most energy
> @@ -159,17 +160,19 @@ efficient combination.
> Similarly in schedutil, when it needs to make a frequency update it will look
> at the current effective uclamp value of the rq which is influenced by the set
> of tasks currently enqueued there and select the appropriate frequency that
> -will honour uclamp requests.
> +will satisfy constraints from requests.
>
> Other paths like setting overutilization state (which effectively disables EAS)
> make use of uclamp as well. Such cases are considered necessary housekeeping to
> allow the 2 main use cases above and will not be covered in detail here as they
> could change with implementation details.
>
> -2.1 BUCKETS:
> --------------
> +.. _uclamp-buckets:
>
> -.. code-block:: c
> +2.1. Buckets
> +------------
> +
> +.. code-block::
>
> [struct rq]
>
> @@ -189,7 +192,6 @@ could change with implementation details.
>
>
> .. note::
> - DISCLAMER:
> The diagram above is an illustration rather than a true depiction of the
> internal data structure.
>
> @@ -198,12 +200,11 @@ an rq as tasks are enqueued/dequeued, the whole utilization range is divided
> into N buckets where N is configured at compile time by setting
> CONFIG_UCLAMP_BUCKETS_COUNT. By default it is set to 5.
>
> -The rq has a bucket for each uclamp_id: [UCLAMP_MIN, UCLAMP_MAX].
> +The rq has a bucket for each uclamp_id tunables: [UCLAMP_MIN, UCLAMP_MAX].
>
> -The range of each bucket is 1024/N. For example for the default value of 5 we
> -will have 5 buckets, each of which will cover the following range:
> +The range of each bucket is 1024/N. For example, for the default value of 5 there will be 5 buckets, each of which will cover the following range:
>
> -.. code-block:: c
> +.. code-block::
>
> DELTA = round_closest(1024/5) = 204.8 = 205
>
> @@ -213,21 +214,21 @@ will have 5 buckets, each of which will cover the following range:
> Bucket 3: [615:819]
> Bucket 4: [820:1024]
>
> -When a task p
> +When a task p with following tunable parameters
>
> .. code-block:: c
>
> p->uclamp[UCLAMP_MIN] = 300
> p->uclamp[UCLAMP_MAX] = 1024
>
> -is enqueued into the rq, Bucket 1 will be incremented for UCLAMP_MIN and Bucket
> +is enqueued into the rq, bucket 1 will be incremented for UCLAMP_MIN and bucket
> 4 will be incremented for UCLAMP_MAX to reflect the fact the rq has a task in
> this range.
>
> The rq then keeps track of its current effective uclamp value for each
> uclamp_id.
>
> -When a task p is enqueued, the rq value changes as follows:
> +When a task p is enqueued, the rq value changes to:
>
> .. code-block:: c
>
> @@ -235,7 +236,7 @@ When a task p is enqueued, the rq value changes as follows:
> rq->uclamp[UCLAMP_MIN] = max(rq->uclamp[UCLAMP_MIN], p->uclamp[UCLAMP_MIN])
> // repeat for UCLAMP_MAX
>
> -When a task is p dequeued the rq value changes as follows:
> +Similarly, when p is dequeued the rq value changes to:
>
> .. code-block:: c
>
> @@ -244,11 +245,11 @@ When a task is p dequeued the rq value changes as follows:
> // repeat for UCLAMP_MAX
>
> When all buckets are empty, the rq uclamp values are reset to system defaults.
> -See section 3.4 for default values.
> +See :ref:`section 3.4 <uclamp-default-values>` for details on default values.
>
>
> -2.2 MAX AGGREGATION:
> ----------------------
> +2.2. Max aggregation
> +--------------------
>
> Util clamp is tuned to honour the request for the task that requires the
> highest performance point.
> @@ -268,19 +269,20 @@ values:
> p1->uclamp[UCLAMP_MIN] = 500
> p1->uclamp[UCLAMP_MAX] = 500
>
> -then assuming both p0 and p1 are enqueued to the same rq
> +then assuming both p0 and p1 are enqueued to the same rq, both UCLAMP_MIN
> +and UCLAMP_MAX become:
>
> .. code-block:: c
>
> rq->uclamp[UCLAMP_MIN] = max(300, 500) = 500
> rq->uclamp[UCLAMP_MAX] = max(900, 500) = 900
>
> -As we shall see in section 5.1, this max aggregation is the cause of one of the
> -limitations when using util clamp. Particularly for UCLAMP_MAX hint when user
> -space would like to save power.
> +As we shall see in :ref:`section 5.1 <uclamp-capping-fail>`, this max
> +aggregation is the cause of one of limitations when using util clamp, in
> +particular for UCLAMP_MAX hint when user space would like to save power.
>
> -2.3 HIERARCHICAL AGGREGATION:
> -------------------------------
> +2.3. Hierarchial aggregation
> +----------------------------
>
> As stated earlier, util clamp is a property of every task in the system. But
> the actual applied (effective) value can be influenced by more than just the
> @@ -293,80 +295,66 @@ The effective util clamp value of any task is restricted as follows:
> 2. The restricted value in (1) is then further restricted by the system wide
> uclamp settings.
>
> -Section 3 discusses the interfaces and will expand further on that.
> +:ref:`Section 3 <uclamp-interfaces>` discusses the interfaces and will expand further on that.
>
> For now suffice to say that if a task makes a request, its actual effective
> value will have to adhere to some restrictions imposed by cgroup and system
> wide settings.
>
> -The system will still accept the request even if effectively will look
> -different; but as soon as the task moves to a different cgroup or a sysadmin
> -modifies the system settings, it'll be able to get what it wants if the new
> -settings allows it.
> +The system will still accept the request even if effectively will be
> +beyond the constraints, but as soon as the task moves to a different cgroup
> +or a sysadmin modifies the system settings, the request will be satisfied
> +only if it is within new constraints.
>
> In other words, this aggregation will not cause an error when a task changes
> -its uclamp values. It just might not be able to achieve it based on those
> -factors.
> +its uclamp values, but rather the system may not be able to satisfy requests
> +based on those factors.
>
> 2.4 Range:
> -----------
>
> -Uclamp performance request follow the utilization range: [0:1024] inclusive.
> +Uclamp performance request has the range of 0 to 1024 inclusive.
>
> -For cgroup interface percentage is used: [0:100] inclusive.
> -You can use 'max' instead of 100 like other cgroup interfaces.
> +For cgroup interface percentage is used (that is 0 to 100 inclusive).
> +Just like other cgroup interfaces, you can use 'max' instead of 100.
>
> -3. INTERFACES:
> -===============
> +.. _uclamp-interfaces:
>
> -3.1 PER TASK INTERFACE:
> -------------------------
> +3. Interfaces
> +==============
> +
> +3.1 Per-task interface
> +-----------------------
>
> sched_setattr() syscall was extended to accept two new fields:
>
> * sched_util_min: requests the minimum performance point the system should run
> - at when this task is running. Or lower performance bound.
> + at when this task is running. Or lower performance bound.
> * sched_util_max: requests the maximum performance point the system should run
> - at when this task is running. Or upper performance bound.
> + at when this task is running. Or upper performance bound.
>
> -For example:
> +For example, the following scenario have 40% to 80% utilization constraints:
>
> .. code-block:: c
>
> attr->sched_util_min = 40% * 1024;
> attr->sched_util_max = 80% * 1024;
>
> -Will tell the system that when task @p is running, it should try its best to
> -ensure it starts at a performance point no less than 40% of maximum system's
> -capability.
> -
> -And if the task runs for a long enough time so that its actual utilization goes
> -above 80%, then it should not cause the system to operate at a performance
> -point higher than that.
> +When task @p is running, the scheduler should try its best to ensure it starts
> +at 40% utilization. If the task runs for a long enough time so that its actual
> +utilization goes above 80%, the utilization will be capped.
>
> The special value -1 is used to reset the uclamp settings to the system
> default.
>
> Note that resetting the uclamp value to system default using -1 is not the same
> -as setting the uclamp value to system default.
> +as manually setting uclamp value to system default. This distinction is
> +important because as we shall see in system interfaces, the default value for
> +RT could be changed. SCHED_NORMAL/OTHER might gain similar knobs too in the
> +future.
>
> -.. code-block:: c
> -
> - attr->sched_util_min = -1 // p0 is reset to system default e.g: 0
> -
> -not the same as
> -
> -.. code-block:: c
> -
> - attr->sched_util_min = 0 // p0 is set to 0, the fact it is the same
> - // as system default is irrelevant
> -
> -This distinction is important because as we shall see in system interfaces, the
> -default value for RT could be changed. SCHED_NORMAL/OTHER might gain similar
> -knobs too in the future.
> -
> -3.2 CGROUP INTERFACE:
> -----------------------
> +3.2. cgroup interface
> +---------------------
>
> There are two uclamp related values in the CPU cgroup controller:
>
> @@ -394,7 +382,7 @@ as follows:
> In a cgroup hierarchy, effective cpu.uclamp.max is the min of (child,
> parent).
>
> -For example:
> +For example, given following parameters:
>
> .. code-block:: c
>
> @@ -410,7 +398,7 @@ For example:
> cgroup1->cpu.uclamp.min = 60% * 1024;
> cgroup1->cpu.uclamp.max = 100% * 1024;
>
> -when p0 and p1 are attached to cgroup0
> +when p0 and p1 are attached to cgroup0, the values become:
>
> .. code-block:: c
>
> @@ -420,7 +408,7 @@ when p0 and p1 are attached to cgroup0
> p1->uclamp[UCLAMP_MIN] = 40% * 1024; // intact
> p1->uclamp[UCLAMP_MAX] = 50% * 1024; // intact
>
> -when p0 and p1 are attached to cgroup1
> +when p0 and p1 are attached to cgroup1, these instead become:
>
> .. code-block:: c
>
> @@ -433,49 +421,46 @@ when p0 and p1 are attached to cgroup1
> Note that cgroup interfaces allows cpu.uclamp.max value to be lower than
> cpu.uclamp.min. Other interfaces don't allow that.
>
> -3.3 SYSTEM INTERFACE:
> +3.3. System interface
> ----------------------
>
> -3.3.1 sched_util_clamp_min:
> -----------------------------
> +3.3.1 sched_util_clamp_min
> +---------------------------
>
> -System wide limit of allowed UCLAMP_MIN range. By default set to 1024, which
> -means tasks are allowed to reach an effective UCLAMP_MIN value in the range of
> -[0:1024].
> +System wide limit of allowed UCLAMP_MIN range. By default it is set to 1024,
> +which means that permitted effective UCLAMP_MIN range for tasks is [0:1024].
> +By changing it to 512 for example the range reduces to [0:512]. This is useful
> +to restrict how much boosting tasks are allowed to acquire.
>
> -By changing it to 512 for example the effective allowed range reduces to
> -[0:512].
> -
> -This is useful to restrict how much boosting tasks are allowed to acquire.
> -
> -Requests from tasks to go above this point will still succeed, but effectively
> -they won't be achieved until this value is >= p->uclamp[UCLAMP_MIN].
> +Requests from tasks to go above this knob value will still succeed, but
> +they won't be satisfied until it is more than p->uclamp[UCLAMP_MIN].
>
> The value must be smaller than or equal to sched_util_clamp_max.
>
> -3.3.2 sched_util_clamp_max:
> -----------------------------
> +3.3.2 sched_util_clamp_max
> +---------------------------
>
> -System wide limit of allowed UCLAMP_MAX range. By default set to 1024, which
> -means tasks are allowed to reach an effective UCLAMP_MAX value in the range of
> -[0:1024].
> +System wide limit of allowed UCLAMP_MAX range. By default it is set to 1024,
> +which means that permitted effective UCLAMP_MAX range for tasks is [0:1024].
>
> By changing it to 512 for example the effective allowed range reduces to
> -[0:512]. The visible impact of this is that no task can run above 512, which in
> -return means that all rqs are restricted too. IOW, the whole system is capped
> -to half its performance capacity.
> +[0:512]. This means is that no task can run above 512, which implies that all
> +rqs are restricted too. IOW, the whole system is capped to half its performance
> +capacity.
>
> -This is useful to restrict the overall maximum performance point of the system.
> +This is useful to restrict the overall maximum performance point of the
> +system. For example, it can be handy to limit performance when running low on
> +battery.
>
> -Can be handy to limit performance when running low on battery.
> -
> -Requests from tasks to go above this point will still succeed, but effectively
> -they won't be achieved until this value is >= p->uclamp[UCLAMP_MAX].
> +Requests from tasks to go above this knob value will still succeed, but
> +they won't be satisfied until it is more than p->uclamp[UCLAMP_MAX].
>
> The value must be greater than or equal to sched_util_clamp_min.
>
> -3.4 DEFAULT VALUES:
> -----------------------
> +.. _uclamp-default-values:
> +
> +3.4. Default values
> +-------------------
>
> By default all SCHED_NORMAL/SCHED_OTHER tasks are initialized to:
>
> @@ -484,7 +469,7 @@ By default all SCHED_NORMAL/SCHED_OTHER tasks are initialized to:
> p_fair->uclamp[UCLAMP_MIN] = 0
> p_fair->uclamp[UCLAMP_MAX] = 1024
>
> -That is no boosting or restriction on any task. These default values can't be
> +That is, no boosting or restriction on any task. These default values can't be
> changed at boot or runtime. No argument was made yet as to why we should
> provide this, but can be added in the future.
>
> @@ -495,33 +480,35 @@ For SCHED_FIFO/SCHED_RR tasks:
> p_rt->uclamp[UCLAMP_MIN] = 1024
> p_rt->uclamp[UCLAMP_MAX] = 1024
>
> -That is by default they're boosted to run at the maximum performance point of
> +That is, by default they're boosted to run at the maximum performance point of
> the system which retains the historical behavior of the RT tasks.
>
> RT tasks default uclamp_min value can be modified at boot or runtime via
> -sysctl. See section 3.4.1.
> +sysctl. See below section.
> +
> +.. _sched-util-clamp-min-rt-default:
>
> 3.4.1 sched_util_clamp_min_rt_default:
> ---------------------------------------
>
> Running RT tasks at maximum performance point is expensive on battery powered
> -devices and not necessary. To allow system designers to offer good performance
> -guarantees for RT tasks without pushing it all the way to maximum performance
> +devices and not necessary. To allow system developer to offer good performance
> +guarantees for these tasks without pushing it all the way to maximum performance
> point, this sysctl knob allows tuning the best boost value to address the
> system requirement without burning power running at maximum performance point
> all the time.
>
> -Application designers are encouraged to use the per task util clamp interface
> +Application developer are encouraged to use the per task util clamp interface
> to ensure they are performance and power aware. Ideally this knob should be set
> to 0 by system designers and leave the task of managing performance
> -requirements to the apps themselves.
> +requirements to the apps.
>
> -4. HOW TO USE UTIL CLAMP:
> -==========================
> +4. How to use util clamp
> +========================
>
> Util clamp promotes the concept of user space assisted power and performance
> -management. At the scheduler level the info required to make the best decision
> -are non existent. But with util clamp user space can hint to the scheduler to
> +management. At the scheduler level there is no info required to make the best
> +decision. However, with util clamp user space can hint to the scheduler to
> make better decision about task placement and frequency selection.
>
> Best results are achieved by not making any assumptions about the system the
> @@ -530,41 +517,41 @@ dynamically monitor and adjust. Ultimately this will allow for a better user
> experience at a better perf/watt.
>
> For some systems and use cases, static setup will help to achieve good results.
> -Portability will be a problem in this case. After all how much work one can do
> -at 100, 200 or 1024 is unknown and a special property of every system. Unless
> -there's a specific target system, static setup should be avoided.
> +Portability will be a problem in this case. How much work one can do at 100,
> +200 or 1024 is different for each system. Unless there's a specific target
> +system, static setup should be avoided.
>
> -All in all there are enough possibilities to create a whole framework based on
> +There are enough possibilities to create a whole framework based on
> util clamp or self contained app that makes use of it directly.
>
> -4.1 BOOST IMPORTANT AND DVFS-LATENCY-SENSITIVE TASKS:
> -------------------------------------------------------
> +4.1. Boost important and DVFS-latency-sensitive tasks
> +-----------------------------------------------------
>
> A GUI task might not be busy to warrant driving the frequency high when it
> -wakes up. But it requires to finish its work within a specific period of time
> +wakes up. However, it requires to finish its work within a specific time window
> to deliver the desired user experience. The right frequency it requires at
> wakeup will be system dependent. On some underpowered systems it will be high,
> -on other overpowered ones, it will be low or 0.
> +on other overpowered ones it will be low or 0.
>
> -This task can increase its UCLAMP_MIN value every time it misses a deadline to
> -ensure on next wake up it runs at a higher performance point. It should try to
> -approach the lowest UCLAMP_MIN value that allows to meet its deadline on any
> +This task can increase its UCLAMP_MIN value every time it misses the deadline
> +to ensure on next wake up it runs at a higher performance point. It should try
> +to approach the lowest UCLAMP_MIN value that allows to meet its deadline on any
> particular system to achieve the best possible perf/watt for that system.
>
> On heterogeneous systems, it might be important for this task to run on
> -a bigger CPU.
> +a faster CPU.
>
> Generally it is advised to perceive the input as performance level or point
> which will imply both task placement and frequency selection.
>
> -4.2 CAP BACKGROUND TASKS:
> ---------------------------
> +4.2. Cap background tasks
> +-------------------------
>
> Like explained for Android case in the introduction. Any app can lower
> UCLAMP_MAX for some background tasks that don't care about performance but
> could end up being busy and consume unnecessary system resources on the system.
>
> -4.3 POWERSAVE MODE:
> +4.3. Powersave mode
> --------------------
>
> sched_util_clamp_max system wide interface can be used to limit all tasks from
> @@ -575,8 +562,8 @@ This is not unique to uclamp as one can achieve the same by reducing max
> frequency of the cpufreq governor. It can be considered a more convenient
> alternative interface.
>
> -4.4 PER APP PERFORMANCE RESTRICTIONS:
> ---------------------------------------
> +4.4. Per-app performance restriction
> +-------------------------------------
>
> Middleware/Utility can provide the user an option to set UCLAMP_MIN/MAX for an
> app every time it is executed to guarantee a minimum performance point and/or
> @@ -585,28 +572,31 @@ these apps.
>
> If you want to prevent your laptop from heating up while on the go from
> compiling the kernel and happy to sacrifice performance to save power, but
> -still would like to keep your browser performance intact; uclamp enables that.
> +still would like to keep your browser performance intact, uclamp makes it
> +possible.
>
> -5. LIMITATIONS:
> -================
> +5. Limitations
> +==============
>
> -5.1 CAPPING FREQUENCY WITH UCLAMP_MAX FAILS UNDER CERTAIN CONDITIONS:
> -----------------------------------------------------------------------
> +.. _uclamp-capping-fail:
>
> -If task p0 is capped to run at 512
> +5.1. Capping frequency with uclamp_max fails under certain conditions
> +---------------------------------------------------------------------
> +
> +If task p0 is capped to run at 512:
>
> .. code-block:: c
>
> p0->uclamp[UCLAMP_MAX] = 512
>
> -is sharing the rq with p1 which is free to run at any performance point
> +and it shares the rq with p1 which is free to run at any performance point:
>
> .. code-block:: c
>
> p1->uclamp[UCLAMP_MAX] = 1024
>
> then due to max aggregation the rq will be allowed to reach max performance
> -point
> +point:
>
> .. code-block:: c
>
> @@ -620,19 +610,19 @@ both are running at the same rq, p1 will cause the frequency capping to be left
> from the rq although p1, which is allowed to run at any performance point,
> doesn't actually need to run at that frequency.
>
> -5.2 UCLAMP_MAX CAN BREAK PELT (UTIL_AVG) SIGNAL
> +5.2. UCLAMP_MAX can break pelt (util_avg) signal
> ------------------------------------------------
>
> PELT assumes that frequency will always increase as the signals grow to ensure
> -there's always some idle time on the CPU. But with UCLAMP_MAX, we will prevent
> -this frequency increase which can lead to no idle time in some circumstances.
> -When there's no idle time, then a task will look like a busy loop, which would
> -result in util_avg being 1024.
> +there's always some idle time on the CPU. But with UCLAMP_MAX, this frequency
> +increase will be prevented which can lead to no idle time in some
> +circumstances. When there's no idle time, a task will stuck in a busy loop,
> +which would result in util_avg being 1024.
>
> -Combing with issue described in 5.2, this an lead to unwanted frequency spikes
> +Combing with issue described below, this an lead to unwanted frequency spikes
> when severely capped tasks share the rq with a small non capped task.
>
> -As an example if task p
> +As an example if task p, which have:
>
> .. code-block:: c
>
> @@ -646,35 +636,35 @@ of.
>
> rq->uclamp[UCLAMP_MAX] = 0
>
> -If the ratio of Fmax/Fmin is 3, then
> +If the ratio of Fmax/Fmin is 3, then maximum value will be:
>
> .. code-block:: c
>
> 300 * (Fmax/Fmin) = 900
>
> -Which indicates the CPU will still see idle time since 900 is < 1024. The
> -_actual_ util_avg will NOT be 900 though. It will be higher than 300, but won't
> -approach 900. As long as there's idle time, p->util_avg updates will be off by
> -a some margin, but not proportional to Fmax/Fmin.
> +which indicates the CPU will still see idle time since 900 is < 1024. The
> +_actual_ util_avg will not be 900 though, but somewhere between 300 and 900. As
> +long as there's idle time, p->util_avg updates will be off by a some margin,
> +but not proportional to Fmax/Fmin.
>
> .. code-block:: c
>
> p0->util_avg = 300 + small_error
>
> -Now if the ratio of Fmax/Fmin is 4, then
> +Now if the ratio of Fmax/Fmin is 4, the maximum value becomes:
>
> .. code-block:: c
>
> 300 * (Fmax/Fmin) = 1200
>
> which is higher than 1024 and indicates that the CPU has no idle time. When
> -this happens, then the _actual_ util_avg will become 1024.
> +this happens, then the _actual_ util_avg will become:
>
> .. code-block:: c
>
> p0->util_avg = 1024
>
> -If task p1 wakes up on this CPU
> +If task p1 wakes up on this CPU, which have:
>
> .. code-block:: c
>
> @@ -683,7 +673,7 @@ If task p1 wakes up on this CPU
>
> then the effective UCLAMP_MAX for the CPU will be 1024 according to max
> aggregation rule. But since the capped p0 task was running and throttled
> -severely, then the rq->util_avg will be 1024.
> +severely, then the rq->util_avg will be:
>
> .. code-block:: c
>
> @@ -693,7 +683,7 @@ severely, then the rq->util_avg will be 1024.
> rq->util_avg = 1024
> rq->uclamp[UCLAMP_MAX] = 1024
>
> -Hence lead to a frequency spike since if p0 wasn't throttled we should get
> +Hence lead to a frequency spike since if p0 wasn't throttled we should get:
>
> .. code-block:: c
>
>
> Thanks.
>
> [1]: https://lore.kernel.org/lkml/20221113152629.3wbyeejsj5v33rvu@airbuntu/
>
> --
> An old man doll... just what I always wanted! - Clara
On 11/14/22 15:55, Bagas Sanjaya wrote:
> On Sun, Nov 13, 2022 at 03:26:29PM +0000, Qais Yousef wrote:
> > Thanks! I have the below fixup patch that addresses these. It made me realize
> > my html output could look better. It's cosmetic; so won't post a new version
> > till some feedback is provided first.
> >
> >
> > Cheers
> >
> > --
> > Qais Yousef
> >
> >
> > --->8---
> >
> > diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
> > index b430d856056a..f12d0d06de3a 100644
> > --- a/Documentation/scheduler/index.rst
> > +++ b/Documentation/scheduler/index.rst
> > @@ -15,6 +15,7 @@ Linux Scheduler
> > sched-capacity
> > sched-energy
> > schedutil
> > + sched-util-clamp
> > sched-nice-design
> > sched-rt-group
> > sched-stats
> > diff --git a/Documentation/scheduler/sched-util-clamp.rst b/Documentation/scheduler/sched-util-clamp.rst
> > index e75b69767afb..728ffa364fc7 100644
> > --- a/Documentation/scheduler/sched-util-clamp.rst
> > +++ b/Documentation/scheduler/sched-util-clamp.rst
> > @@ -169,24 +169,27 @@ could change with implementation details.
> > 2.1 BUCKETS:
> > -------------
> >
> > +.. code-block:: c
> > +
> > [struct rq]
> >
> > -(bottom) (top)
> > + (bottom) (top)
> >
> > - 0 1024
> > - | |
> > - +-----------+-----------+-----------+---- ----+-----------+
> > - | Bucket 0 | Bucket 1 | Bucket 2 | ... | Bucket N |
> > - +-----------+-----------+-----------+---- ----+-----------+
> > - : : :
> > - +- p0 +- p3 +- p4
> > - : :
> > - +- p1 +- p5
> > - :
> > - +- p2
> > + 0 1024
> > + | |
> > + +-----------+-----------+-----------+---- ----+-----------+
> > + | Bucket 0 | Bucket 1 | Bucket 2 | ... | Bucket N |
> > + +-----------+-----------+-----------+---- ----+-----------+
> > + : : :
> > + +- p0 +- p3 +- p4
> > + : :
> > + +- p1 +- p5
> > + :
> > + +- p2
>
> The code block above is diagram, isn't it? Thus specifying language for
> syntax highlighting (in this case ``c``) isn't appropriate.
I could do with a helping hand here actually. I am a text only person but
trying to follow the new rst docs; but I don't have a clue to be honest.
I did try to find the right directive, but I couldn't find it. What should be
specified for this diagram?
>
> >
> >
> > -DISCLAMER:
> > +.. note::
> > + DISCLAMER:
> > The diagram above is an illustration rather than a true depiction of the
> > internal data structure.
>
> The DISCLAIMER line above isn't needed, since note block should do the
> job.
Okay.
>
> >
> > @@ -200,6 +203,8 @@ The rq has a bucket for each uclamp_id: [UCLAMP_MIN, UCLAMP_MAX].
> > The range of each bucket is 1024/N. For example for the default value of 5 we
> > will have 5 buckets, each of which will cover the following range:
> >
> > +.. code-block:: c
> > +
>
> Again, why ``c`` syntax highlighting?
This is a C code snippet. What would be better to use? I think I was getting
errors if I don't specify something. But again; I was touching my way around in
the dark here trying to figure it out.
> Otherwise no new warnings. Thanks for fixing this up.
>
> However, in the future, for documentation patches you should always Cc:
> linux-doc list. Adding it to Cc list now.
Indeed. Maybe I went into auto-mode and didn't use get_maintainer proper.
Apologies.
Thanks!
--
Qais Yousef
On 11/14/22 09:47, Bagas Sanjaya wrote:
> On Sat, Nov 05, 2022 at 11:23:43PM +0000, Qais Yousef wrote:
> > From: Qais Yousef <qais.yousef@arm.com>
> >
> > The new util clamp feature needs a document explaining what it is and
> > how to use it. The new document hopefully covers everything one needs to
> > know about uclamp.
> >
> > Signed-off-by: Qais Yousef <qais.yousef@arm.com>
> > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
>
> Hmm, why didn't you send this patch from your arm address instead?
> On the other hand, thanks for including SoB from your sending address,
> which is different.
I changed jobs now; but started the patch when I was with Arm, hence the
2 SoBs.
>
> I will be commenting for the content on your fixup message.
Thanks for having a look!
Cheers
--
Qais Yousef
On Tue, Nov 15, 2022 at 08:55:47PM +0000, Qais Yousef wrote:
> > > 2.1 BUCKETS:
> > > -------------
> > >
> > > +.. code-block:: c
> > > +
> > > [struct rq]
> > >
> > > -(bottom) (top)
> > > + (bottom) (top)
> > >
> > > - 0 1024
> > > - | |
> > > - +-----------+-----------+-----------+---- ----+-----------+
> > > - | Bucket 0 | Bucket 1 | Bucket 2 | ... | Bucket N |
> > > - +-----------+-----------+-----------+---- ----+-----------+
> > > - : : :
> > > - +- p0 +- p3 +- p4
> > > - : :
> > > - +- p1 +- p5
> > > - :
> > > - +- p2
> > > + 0 1024
> > > + | |
> > > + +-----------+-----------+-----------+---- ----+-----------+
> > > + | Bucket 0 | Bucket 1 | Bucket 2 | ... | Bucket N |
> > > + +-----------+-----------+-----------+---- ----+-----------+
> > > + : : :
> > > + +- p0 +- p3 +- p4
> > > + : :
> > > + +- p1 +- p5
> > > + :
> > > + +- p2
> >
> > The code block above is diagram, isn't it? Thus specifying language for
> > syntax highlighting (in this case ``c``) isn't appropriate.
>
> I could do with a helping hand here actually. I am a text only person but
> trying to follow the new rst docs; but I don't have a clue to be honest.
>
> I did try to find the right directive, but I couldn't find it. What should be
> specified for this diagram?
Just leave ..code-block:: directive alone or use simpler double colon
(::). The highlighting will not be applied to the code snippet.
> > > @@ -200,6 +203,8 @@ The rq has a bucket for each uclamp_id: [UCLAMP_MIN, UCLAMP_MAX].
> > > The range of each bucket is 1024/N. For example for the default value of 5 we
> > > will have 5 buckets, each of which will cover the following range:
> > >
> > > +.. code-block:: c
> > > +
> >
> > Again, why ``c`` syntax highlighting?
>
> This is a C code snippet. What would be better to use? I think I was getting
> errors if I don't specify something. But again; I was touching my way around in
> the dark here trying to figure it out.
>
Yup, that's the correct language for highlighting.
Thanks.
On 11/16/22 15:36, Bagas Sanjaya wrote:
> On Tue, Nov 15, 2022 at 08:55:47PM +0000, Qais Yousef wrote:
> > > > 2.1 BUCKETS:
> > > > -------------
> > > >
> > > > +.. code-block:: c
> > > > +
> > > > [struct rq]
> > > >
> > > > -(bottom) (top)
> > > > + (bottom) (top)
> > > >
> > > > - 0 1024
> > > > - | |
> > > > - +-----------+-----------+-----------+---- ----+-----------+
> > > > - | Bucket 0 | Bucket 1 | Bucket 2 | ... | Bucket N |
> > > > - +-----------+-----------+-----------+---- ----+-----------+
> > > > - : : :
> > > > - +- p0 +- p3 +- p4
> > > > - : :
> > > > - +- p1 +- p5
> > > > - :
> > > > - +- p2
> > > > + 0 1024
> > > > + | |
> > > > + +-----------+-----------+-----------+---- ----+-----------+
> > > > + | Bucket 0 | Bucket 1 | Bucket 2 | ... | Bucket N |
> > > > + +-----------+-----------+-----------+---- ----+-----------+
> > > > + : : :
> > > > + +- p0 +- p3 +- p4
> > > > + : :
> > > > + +- p1 +- p5
> > > > + :
> > > > + +- p2
> > >
> > > The code block above is diagram, isn't it? Thus specifying language for
> > > syntax highlighting (in this case ``c``) isn't appropriate.
> >
> > I could do with a helping hand here actually. I am a text only person but
> > trying to follow the new rst docs; but I don't have a clue to be honest.
> >
> > I did try to find the right directive, but I couldn't find it. What should be
> > specified for this diagram?
>
> Just leave ..code-block:: directive alone or use simpler double colon
> (::). The highlighting will not be applied to the code snippet.
Leaving
..code-block::
produces this error:
sched-util-clamp.rst:172: WARNING: Error in "code-block" directive: 1 argument(s) required, 0 supplied
I used
::
and it seems to produces the desired results. I tried this first but I think my
indentation was broken then which is why it didn't work at the time and I moved
to code-block.
>
> > > > @@ -200,6 +203,8 @@ The rq has a bucket for each uclamp_id: [UCLAMP_MIN, UCLAMP_MAX].
> > > > The range of each bucket is 1024/N. For example for the default value of 5 we
> > > > will have 5 buckets, each of which will cover the following range:
> > > >
> > > > +.. code-block:: c
> > > > +
> > >
> > > Again, why ``c`` syntax highlighting?
> >
> > This is a C code snippet. What would be better to use? I think I was getting
> > errors if I don't specify something. But again; I was touching my way around in
> > the dark here trying to figure it out.
> >
>
> Yup, that's the correct language for highlighting.
Thanks Bagas!
--
Qais Yousef
new file mode 100644
@@ -0,0 +1,678 @@
+====================
+Utilization Clamping
+====================
+
+1. INTRODUCTION
+================
+
+Utilization clamping is a scheduler feature that allows user space to help in
+managing the performance requirement of tasks. It was introduced in v5.3
+release. The CGroup support was merged in v5.4.
+
+It is often referred to as util clamp and uclamp. You'll find all variations
+used interchangeably in this documentation and in the source code.
+
+Uclamp is a hinting mechanism that allows the scheduler to understand the
+performance requirements and restrictions of the tasks. Hence help it make
+a better placement decision. And when schedutil cpufreq governor is used, util
+clamp will influence the frequency selection as well.
+
+Since scheduler and schedutil are both driven by PELT (util_avg) signals, util
+clamp acts on that to achieve its goal by clamping the signal to a certain
+point; hence the name. I.e: by clamping utilization we are making the system
+run at a certain performance point.
+
+The right way to view util clamp is as a mechanism to make performance
+constraints request/hint. It consists of two components:
+
+ * UCLAMP_MIN, which sets a lower bound.
+ * UCLAMP_MAX, which sets an upper bound.
+
+These two bounds will ensure a task will operate within this performance range
+of the system. UCLAMP_MIN implies boosting a task, while UCLAMP_MAX implies
+capping a task.
+
+One can tell the system (scheduler) that some tasks require a minimum
+performance point to operate at to deliver the desired user experience. Or one
+can tell the system that some tasks should be restricted from consuming too
+much resources and should NOT go above a specific performance point. Viewing
+the uclamp values as performance points rather than utilization is a better
+abstraction from user space point of view.
+
+As an example, a game can use util clamp to form a feedback loop with its
+perceived FPS. It can dynamically increase the minimum performance point
+required by its display pipeline to ensure no frame is dropped. It can also
+dynamically 'prime' up these tasks if it knows in the coming few 100ms
+a computationally intensive scene is about to happen.
+
+On mobile hardware where the capability of the devices varies a lot, this
+dynamic feedback loop offers a great flexibility in ensuring best user
+experience given the capabilities of any system.
+
+Of course a static configuration is possible too. The exact usage will depend
+on the system, application and the desired outcome.
+
+Another example is in Android where tasks are classified as background,
+foreground, top-app, etc. Util clamp can be used to constraint how much
+resources background tasks are consuming by capping the performance point they
+can run at. This constraint helps reserve resources for important tasks, like
+the ones belonging to the currently active app (top-app group). Beside this
+helps in limiting how much power they consume. This can be more obvious in
+heterogeneous systems; the constraint will help bias the background tasks to
+stay on the little cores which will ensure that:
+
+ 1. The big cores are free to run top-app tasks immediately. top-app
+ tasks are the tasks the user is currently interacting with, hence
+ the most important tasks in the system.
+ 2. They don't run on a power hungry core and drain battery even if they
+ are CPU intensive tasks.
+
+By making these uclamp performance requests, or rather hints, user space can
+ensure system resources are used optimally to deliver the best user experience
+the system is capable of.
+
+Another use case is to help with overcoming the ramp up latency inherit in how
+scheduler utilization signal is calculated.
+
+A busy task for instance that requires to run at maximum performance point will
+suffer a delay of ~200ms (PELT HALFIFE = 32ms) for the scheduler to realize
+that. This is known to affect workloads like gaming on mobile devices where
+frames will drop due to slow response time to select the higher frequency
+required for the tasks to finish their work in time.
+
+The overall visible effect goes beyond better perceived user
+experience/performance and stretches to help achieve a better overall
+performance/watt if used effectively.
+
+User space can form a feedback loop with thermal subsystem too to ensure the
+device doesn't heat up to the point where it will throttle.
+
+Both SCHED_NORMAL/OTHER and SCHED_FIFO/RR honour uclamp requests/hints.
+
+In SCHED_FIFO/RR case, uclamp gives the option to run RT tasks at any
+performance point rather than being tied to MAX frequency all the time. Which
+can be useful on general purpose systems that run on battery powered devices.
+
+Note that by design RT tasks don't have per-task PELT signal and must always
+run at a constant frequency to combat undeterministic DVFS rampup delays.
+
+Note that using schedutil always implies a single delay to modify the frequency
+when an RT task wakes up. This cost is unchanged by using uclamp. Uclamp only
+helps picking what frequency to request instead of schedutil always requesting
+MAX for all RT tasks.
+
+See section 3.4 for default values and 3.4.1 on how to change RT tasks default
+value.
+
+2. DESIGN:
+===========
+
+Util clamp is a property of every task in the system. It sets the boundaries of
+its utilization signal; acting as a bias mechanism that influences certain
+decisions within the scheduler.
+
+The actual utilization signal of a task is never clamped in reality. If you
+inspect PELT signals at any point of time you should continue to see them as
+they are intact. Clamping happens only when needed, e.g: when a task wakes up
+and the scheduler needs to select a suitable CPU for it to run on.
+
+Since the goal of util clamp is to allow requesting a minimum and maximum
+performance point for a task to run on, it must be able to influence the
+frequency selection as well as task placement to be most effective. Both of
+which have implications on the utilization value at rq level, which brings us
+to the main design challenge.
+
+When a task wakes up on an rq, the utilization signal of the rq will be
+impacted by the uclamp settings of all the tasks enqueued on it. For example if
+a task requests to run at UTIL_MIN = 512, then the util signal of the rq needs
+to respect this request as well as all other requests from all of the enqueued
+tasks.
+
+To be able to aggregate the util clamp value of all the tasks attached to the
+rq, uclamp must do some housekeeping at every enqueue/dequeue, which is the
+scheduler hot path. Hence care must be taken since any slow down will have
+significant impact on a lot of use cases and could hinder its usability in
+practice.
+
+The way this is handled is by dividing the utilization range into buckets
+(struct uclamp_bucket) which allows us to reduce the search space from every
+task on the rq to only a subset of tasks on the top-most bucket.
+
+When a task is enqueued, we increment a counter in the matching bucket. And on
+dequeue we decrement it. This makes keeping track of the effective uclamp value
+at rq level a lot easier.
+
+As we enqueue and dequeue tasks we keep track of the current effective uclamp
+value of the rq. See section 2.1 for details on how this works.
+
+Later at any path that wants to identify the effective uclamp value of the rq,
+it will simply need to read this effective uclamp value of the rq at that exact
+moment of time it needs to take a decision.
+
+For task placement case, only Energy Aware and Capacity Aware Scheduling
+(EAS/CAS) make use of uclamp for now. This implies heterogeneous systems only.
+When a task wakes up, the scheduler will look at the current effective uclamp
+value of every rq and compare it with the potential new value if the task were
+to be enqueued there. Favoring the rq that will end up with the most energy
+efficient combination.
+
+Similarly in schedutil, when it needs to make a frequency update it will look
+at the current effective uclamp value of the rq which is influenced by the set
+of tasks currently enqueued there and select the appropriate frequency that
+will honour uclamp requests.
+
+Other paths like setting overutilization state (which effectively disables EAS)
+make use of uclamp as well. Such cases are considered necessary housekeeping to
+allow the 2 main use cases above and will not be covered in detail here as they
+could change with implementation details.
+
+2.1 BUCKETS:
+-------------
+
+ [struct rq]
+
+(bottom) (top)
+
+ 0 1024
+ | |
+ +-----------+-----------+-----------+---- ----+-----------+
+ | Bucket 0 | Bucket 1 | Bucket 2 | ... | Bucket N |
+ +-----------+-----------+-----------+---- ----+-----------+
+ : : :
+ +- p0 +- p3 +- p4
+ : :
+ +- p1 +- p5
+ :
+ +- p2
+
+
+DISCLAMER:
+ The diagram above is an illustration rather than a true depiction of the
+ internal data structure.
+
+To reduce the search space when trying to decide the effective uclamp value of
+an rq as tasks are enqueued/dequeued, the whole utilization range is divided
+into N buckets where N is configured at compile time by setting
+CONFIG_UCLAMP_BUCKETS_COUNT. By default it is set to 5.
+
+The rq has a bucket for each uclamp_id: [UCLAMP_MIN, UCLAMP_MAX].
+
+The range of each bucket is 1024/N. For example for the default value of 5 we
+will have 5 buckets, each of which will cover the following range:
+
+ DELTA = round_closest(1024/5) = 204.8 = 205
+
+ Bucket 0: [0:204]
+ Bucket 1: [205:409]
+ Bucket 2: [410:614]
+ Bucket 3: [615:819]
+ Bucket 4: [820:1024]
+
+When a task p
+
+ p->uclamp[UCLAMP_MIN] = 300
+ p->uclamp[UCLAMP_MAX] = 1024
+
+is enqueued into the rq, Bucket 1 will be incremented for UCLAMP_MIN and Bucket
+4 will be incremented for UCLAMP_MAX to reflect the fact the rq has a task in
+this range.
+
+The rq then keeps track of its current effective uclamp value for each
+uclamp_id.
+
+When a task p is enqueued, the rq value changes as follows:
+
+ // update bucket logic goes here
+ rq->uclamp[UCLAMP_MIN] = max(rq->uclamp[UCLAMP_MIN], p->uclamp[UCLAMP_MIN])
+ // repeat for UCLAMP_MAX
+
+When a task is p dequeued the rq value changes as follows:
+
+ // update bucket logic goes here
+ rq->uclamp[UCLAMP_MIN] = search_top_bucket_for_highest_value()
+ // repeat for UCLAMP_MAX
+
+When all buckets are empty, the rq uclamp values are reset to system defaults.
+See section 3.4 for default values.
+
+
+2.2 MAX AGGREGATION:
+---------------------
+
+Util clamp is tuned to honour the request for the task that requires the
+highest performance point.
+
+When multiple tasks are attached to the same rq, then util clamp must make sure
+the task that needs the highest performance point gets it even if there's
+another task that doesn't need it or is disallowed from reaching this point.
+
+For example, if there are multiple tasks attached to an rq with the following
+values:
+
+ p0->uclamp[UCLAMP_MIN] = 300
+ p0->uclamp[UCLAMP_MAX] = 900
+
+ p1->uclamp[UCLAMP_MIN] = 500
+ p1->uclamp[UCLAMP_MAX] = 500
+
+then assuming both p0 and p1 are enqueued to the same rq
+
+ rq->uclamp[UCLAMP_MIN] = max(300, 500) = 500
+ rq->uclamp[UCLAMP_MAX] = max(900, 500) = 900
+
+As we shall see in section 5.1, this max aggregation is the cause of one of the
+limitations when using util clamp. Particularly for UCLAMP_MAX hint when user
+space would like to save power.
+
+2.3 HIERARCHICAL AGGREGATION:
+------------------------------
+
+As stated earlier, util clamp is a property of every task in the system. But
+the actual applied (effective) value can be influenced by more than just the
+request made by the task or another actor on its behalf (middleware library).
+
+The effective util clamp value of any task is restricted as follows:
+
+ 1. By the uclamp settings defined by the cgroup CPU controller it is attached
+ to, if any.
+ 2. The restricted value in (1) is then further restricted by the system wide
+ uclamp settings.
+
+Section 3 discusses the interfaces and will expand further on that.
+
+For now suffice to say that if a task makes a request, its actual effective
+value will have to adhere to some restrictions imposed by cgroup and system
+wide settings.
+
+The system will still accept the request even if effectively will look
+different; but as soon as the task moves to a different cgroup or a sysadmin
+modifies the system settings, it'll be able to get what it wants if the new
+settings allows it.
+
+In other words, this aggregation will not cause an error when a task changes
+its uclamp values. It just might not be able to achieve it based on those
+factors.
+
+2.4 Range:
+-----------
+
+Uclamp performance request follow the utilization range: [0:1024] inclusive.
+
+For cgroup interface percentage is used: [0:100] inclusive.
+You can use 'max' instead of 100 like other cgroup interfaces.
+
+3. INTERFACES:
+===============
+
+3.1 PER TASK INTERFACE:
+------------------------
+
+sched_setattr() syscall was extended to accept two new fields:
+
+* sched_util_min: requests the minimum performance point the system should run
+ at when this task is running. Or lower performance bound.
+* sched_util_max: requests the maximum performance point the system should run
+ at when this task is running. Or upper performance bound.
+
+For example:
+
+ attr->sched_util_min = 40% * 1024;
+ attr->sched_util_max = 80% * 1024;
+
+Will tell the system that when task @p is running, it should try its best to
+ensure it starts at a performance point no less than 40% of maximum system's
+capability.
+
+And if the task runs for a long enough time so that its actual utilization goes
+above 80%, then it should not cause the system to operate at a performance
+point higher than that.
+
+The special value -1 is used to reset the uclamp settings to the system
+default.
+
+Note that resetting the uclamp value to system default using -1 is not the same
+as setting the uclamp value to system default.
+
+ attr->sched_util_min = -1 // p0 is reset to system default e.g: 0
+
+ not the same as
+
+ attr->sched_util_min = 0 // p0 is set to 0, the fact it is the same
+ // as system default is irrelevant
+
+This distinction is important because as we shall see in system interfaces, the
+default value for RT could be changed. SCHED_NORMAL/OTHER might gain similar
+knobs too in the future.
+
+3.2 CGROUP INTERFACE:
+----------------------
+
+There are two uclamp related values in the CPU cgroup controller:
+
+* cpu.uclamp.min
+* cpu.uclamp.max
+
+When a task is attached to a CPU controller, its uclamp values will be impacted
+as follows:
+
+* cpu.uclamp.min is a protection as described in section 3-3 in
+ Documentation/admin-guide/cgroup-v2.rst.
+
+ If a task uclamp_min value is lower than cpu.uclamp.min, then the task will
+ inherit the cgroup cpu.uclamp.min value.
+
+ In a cgroup hierarchy, effective cpu.uclamp.min is the max of (child,
+ parent).
+
+* cpu.uclamp.max is a limit as described in section 3-2 in
+ Documentation/admin-guide/cgroup-v2.rst.
+
+ If a task uclamp_max value is higher than cpu.uclamp.max, then the task will
+ inherit the cgroup cpu.uclamp.max value.
+
+ In a cgroup hierarchy, effective cpu.uclamp.max is the min of (child,
+ parent).
+
+For example:
+
+ p0->uclamp[UCLAMP_MIN] = // system default;
+ p0->uclamp[UCLAMP_MAX] = // system default;
+
+ p1->uclamp[UCLAMP_MIN] = 40% * 1024;
+ p1->uclamp[UCLAMP_MAX] = 50% * 1024;
+
+ cgroup0->cpu.uclamp.min = 20% * 1024;
+ cgroup0->cpu.uclamp.max = 60% * 1024;
+
+ cgroup1->cpu.uclamp.min = 60% * 1024;
+ cgroup1->cpu.uclamp.max = 100% * 1024;
+
+when p0 and p1 are attached to cgroup0
+
+ p0->uclamp[UCLAMP_MIN] = cgroup0->cpu.uclamp.min = 20% * 1024;
+ p0->uclamp[UCLAMP_MAX] = cgroup0->cpu.uclamp.max = 60% * 1024;
+
+ p1->uclamp[UCLAMP_MIN] = 40% * 1024; // intact
+ p1->uclamp[UCLAMP_MAX] = 50% * 1024; // intact
+
+when p0 and p1 are attached to cgroup1
+
+ p0->uclamp[UCLAMP_MIN] = cgroup1->cpu.uclamp.min = 60% * 1024;
+ p0->uclamp[UCLAMP_MAX] = cgroup1->cpu.uclamp.max = 100% * 1024;
+
+ p1->uclamp[UCLAMP_MIN] = cgroup1->cpu.uclamp.min = 60% * 1024;
+ p1->uclamp[UCLAMP_MAX] = 50% * 1024; // intact
+
+Note that cgroup interfaces allows cpu.uclamp.max value to be lower than
+cpu.uclamp.min. Other interfaces don't allow that.
+
+3.3 SYSTEM INTERFACE:
+----------------------
+
+3.3.1 sched_util_clamp_min:
+----------------------------
+
+System wide limit of allowed UCLAMP_MIN range. By default set to 1024, which
+means tasks are allowed to reach an effective UCLAMP_MIN value in the range of
+[0:1024].
+
+By changing it to 512 for example the effective allowed range reduces to
+[0:512].
+
+This is useful to restrict how much boosting tasks are allowed to acquire.
+
+Requests from tasks to go above this point will still succeed, but effectively
+they won't be achieved until this value is >= p->uclamp[UCLAMP_MIN].
+
+The value must be smaller than or equal to sched_util_clamp_max.
+
+3.3.2 sched_util_clamp_max:
+----------------------------
+
+System wide limit of allowed UCLAMP_MAX range. By default set to 1024, which
+means tasks are allowed to reach an effective UCLAMP_MAX value in the range of
+[0:1024].
+
+By changing it to 512 for example the effective allowed range reduces to
+[0:512]. The visible impact of this is that no task can run above 512, which in
+return means that all rqs are restricted too. IOW, the whole system is capped
+to half its performance capacity.
+
+This is useful to restrict the overall maximum performance point of the system.
+
+Can be handy to limit performance when running low on battery.
+
+Requests from tasks to go above this point will still succeed, but effectively
+they won't be achieved until this value is >= p->uclamp[UCLAMP_MAX].
+
+The value must be greater than or equal to sched_util_clamp_min.
+
+3.4 DEFAULT VALUES:
+----------------------
+
+By default all SCHED_NORMAL/SCHED_OTHER tasks are initialized to:
+
+ p_fair->uclamp[UCLAMP_MIN] = 0
+ p_fair->uclamp[UCLAMP_MAX] = 1024
+
+That is no boosting or restriction on any task. These default values can't be
+changed at boot or runtime. No argument was made yet as to why we should
+provide this, but can be added in the future.
+
+For SCHED_FIFO/SCHED_RR tasks:
+
+ p_rt->uclamp[UCLAMP_MIN] = 1024
+ p_rt->uclamp[UCLAMP_MAX] = 1024
+
+That is by default they're boosted to run at the maximum performance point of
+the system which retains the historical behavior of the RT tasks.
+
+RT tasks default uclamp_min value can be modified at boot or runtime via
+sysctl. See section 3.4.1.
+
+3.4.1 sched_util_clamp_min_rt_default:
+---------------------------------------
+
+Running RT tasks at maximum performance point is expensive on battery powered
+devices and not necessary. To allow system designers to offer good performance
+guarantees for RT tasks without pushing it all the way to maximum performance
+point, this sysctl knob allows tuning the best boost value to address the
+system requirement without burning power running at maximum performance point
+all the time.
+
+Application designers are encouraged to use the per task util clamp interface
+to ensure they are performance and power aware. Ideally this knob should be set
+to 0 by system designers and leave the task of managing performance
+requirements to the apps themselves.
+
+4. HOW TO USE UTIL CLAMP:
+==========================
+
+Util clamp promotes the concept of user space assisted power and performance
+management. At the scheduler level the info required to make the best decision
+are non existent. But with util clamp user space can hint to the scheduler to
+make better decision about task placement and frequency selection.
+
+Best results are achieved by not making any assumptions about the system the
+application is running on and to use it in conjunction with a feedback loop to
+dynamically monitor and adjust. Ultimately this will allow for a better user
+experience at a better perf/watt.
+
+For some systems and use cases, static setup will help to achieve good results.
+Portability will be a problem in this case. After all how much work one can do
+at 100, 200 or 1024 is unknown and a special property of every system. Unless
+there's a specific target system, static setup should be avoided.
+
+All in all there are enough possibilities to create a whole framework based on
+util clamp or self contained app that makes use of it directly.
+
+4.1 BOOST IMPORTANT AND DVFS-LATENCY-SENSITIVE TASKS:
+------------------------------------------------------
+
+A GUI task might not be busy to warrant driving the frequency high when it
+wakes up. But it requires to finish its work within a specific period of time
+to deliver the desired user experience. The right frequency it requires at
+wakeup will be system dependent. On some underpowered systems it will be high,
+on other overpowered ones, it will be low or 0.
+
+This task can increase its UCLAMP_MIN value every time it misses a deadline to
+ensure on next wake up it runs at a higher performance point. It should try to
+approach the lowest UCLAMP_MIN value that allows to meet its deadline on any
+particular system to achieve the best possible perf/watt for that system.
+
+On heterogeneous systems, it might be important for this task to run on
+a bigger CPU.
+
+Generally it is advised to perceive the input as performance level or point
+which will imply both task placement and frequency selection.
+
+4.2 CAP BACKGROUND TASKS:
+--------------------------
+
+Like explained for Android case in the introduction. Any app can lower
+UCLAMP_MAX for some background tasks that don't care about performance but
+could end up being busy and consume unnecessary system resources on the system.
+
+4.3 POWERSAVE MODE:
+--------------------
+
+sched_util_clamp_max system wide interface can be used to limit all tasks from
+operating at the higher performance points which are usually energy
+inefficient.
+
+This is not unique to uclamp as one can achieve the same by reducing max
+frequency of the cpufreq governor. It can be considered a more convenient
+alternative interface.
+
+4.4 PER APP PERFORMANCE RESTRICTIONS:
+--------------------------------------
+
+Middleware/Utility can provide the user an option to set UCLAMP_MIN/MAX for an
+app every time it is executed to guarantee a minimum performance point and/or
+limit it from draining system power at the cost of reduced performance for
+these apps.
+
+If you want to prevent your laptop from heating up while on the go from
+compiling the kernel and happy to sacrifice performance to save power, but
+still would like to keep your browser performance intact; uclamp enables that.
+
+5. LIMITATIONS:
+================
+
+5.1 CAPPING FREQUENCY WITH UCLAMP_MAX FAILS UNDER CERTAIN CONDITIONS:
+----------------------------------------------------------------------
+
+If task p0 is capped to run at 512
+
+ p0->uclamp[UCLAMP_MAX] = 512
+
+is sharing the rq with p1 which is free to run at any performance point
+
+ p1->uclamp[UCLAMP_MAX] = 1024
+
+then due to max aggregation the rq will be allowed to reach max performance
+point
+
+ rq->uclamp[UCLAMP_MAX] = max(512, 1024) = 1024
+
+Assuming both p0 and p1 have UCLAMP_MIN = 0, then the frequency selection for
+the rq will depend on the actual utilization value of the tasks.
+
+If p1 is a small task but p0 is a CPU intensive task, then due to the fact that
+both are running at the same rq, p1 will cause the frequency capping to be left
+from the rq although p1, which is allowed to run at any performance point,
+doesn't actually need to run at that frequency.
+
+5.2 UCLAMP_MAX CAN BREAK PELT (UTIL_AVG) SIGNAL
+------------------------------------------------
+
+PELT assumes that frequency will always increase as the signals grow to ensure
+there's always some idle time on the CPU. But with UCLAMP_MAX, we will prevent
+this frequency increase which can lead to no idle time in some circumstances.
+When there's no idle time, then a task will look like a busy loop, which would
+result in util_avg being 1024.
+
+Combing with issue described in 5.2, this an lead to unwanted frequency spikes
+when severely capped tasks share the rq with a small non capped task.
+
+As an example if task p
+
+ p0->util_avg = 300
+ p0->uclamp[UCLAMP_MAX] = 0
+
+wakes up on an idle CPU, then it will run at min frequency this CPU is capable
+of.
+
+ rq->uclamp[UCLAMP_MAX] = 0
+
+If the ratio of Fmax/Fmin is 3, then
+
+ 300 * (Fmax/Fmin) = 900
+
+Which indicates the CPU will still see idle time since 900 is < 1024. The
+_actual_ util_avg will NOT be 900 though. It will be higher than 300, but won't
+approach 900. As long as there's idle time, p->util_avg updates will be off by
+a some margin, but not proportional to Fmax/Fmin.
+
+ p0->util_avg = 300 + small_error
+
+Now if the ratio of Fmax/Fmin is 4, then
+
+ 300 * (Fmax/Fmin) = 1200
+
+which is higher than 1024 and indicates that the CPU has no idle time. When
+this happens, then the _actual_ util_avg will become 1024.
+
+ p0->util_avg = 1024
+
+If task p1 wakes up on this CPU
+
+ p1->util_avg = 200
+ p1->uclamp[UCLAMP_MAX] = 1024
+
+then the effective UCLAMP_MAX for the CPU will be 1024 according to max
+aggregation rule. But since the capped p0 task was running and throttled
+severely, then the rq->util_avg will be 1024.
+
+ p0->util_avg = 1024
+ p1->util_avg = 200
+
+ rq->util_avg = 1024
+ rq->uclamp[UCLAMP_MAX] = 1024
+
+Hence lead to a frequency spike since if p0 wasn't throttled we should get
+
+ p0->util_avg = 300
+ p1->util_avg = 200
+
+ rq->util_avg = 500
+
+and run somewhere near mid performance point of that CPU, not the Fmax we get.
+
+5.3 SCHEDUTIL RESPONSE TIME ISSUES:
+------------------------------------
+
+schedutil has three limitations:
+
+ 1. Hardware takes non-zero time to respond to any frequency change
+ request. On some platforms can be in the order of few ms.
+ 2. Non fast-switch systems require a worker deadline thread to wake up
+ and perform the frequency change, which adds measurable overhead.
+ 3. schedutil rate_limit_us drops any requests during this rate_limit_us
+ window.
+
+If a relatively small task is doing critical job and requires a certain
+performance point when it wakes up and starts running, then all these
+limitations will prevent it from getting what it wants in the time scale it
+expects.
+
+This limitation is not only impactful when using uclamp, but will be more
+prevalent as we no longer gradually ramp up or down. We could easily be
+jumping between frequencies depending on the order tasks wake up, and their
+respective uclamp values.
+
+We regard that as a limitation of the capabilities of the underlying system
+itself.
+
+There is room to improve the behavior of schedutil rate_limit_us, but not much
+to be done for 1 or 2. They are considered hard limitations of the system.