From patchwork Thu Jan  5 11:13:51 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: tip-bot2 for Thomas Gleixner <tip-bot2@linutronix.de>
X-Patchwork-Id: 39486
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a5d:4e01:0:0:0:0:0 with SMTP id p1csp249997wrt;
        Thu, 5 Jan 2023 03:24:07 -0800 (PST)
X-Google-Smtp-Source: 
 AMrXdXvVaAER1frx48JLlfV73u/e8Qi9WOI0rymGxrcOr997YsAj5OVQaVhx5qY58l2I2CSWerF9
X-Received: by 2002:a17:907:c007:b0:7ad:f165:70c2 with SMTP id
 ss7-20020a170907c00700b007adf16570c2mr60374874ejc.27.1672917847100;
        Thu, 05 Jan 2023 03:24:07 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1672917847; cv=none;
        d=google.com; s=arc-20160816;
        b=hR/98Hgn/XEFh6RTHTwzNpPiDZ/l6m7DoUw6euyMEbLfN3ETsi/OGqhrne3c14qn63
         kVvb4kD2Mltg5fiquBFPcvo3a6F/lnWeCoHbQCT2l1eFrM2z1xUVV4TxED2h0VcFDRj0
         3rQALAFhuPx/cExXzAEviu6imo1y3GmMmPcnzXK+bmZ7FMLN4kaRFvUPMYdhTJ+ZcXNP
         J/MlBlaXwV30TUZMi/wMdTx2gilXo+ch5itYh6fE9vUGhOVtmw3n0cLBzm7qtmXwlzC+
         AaRdwe76pKhH50+LMEzElpu3n8fbEss8DllgzPizSSfJRH4zJMDv7aGF75IOKsWi5QHs
         H0FA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:robot-unsubscribe
         :robot-id:message-id:mime-version:references:in-reply-to:cc:subject
         :to:reply-to:sender:from:dkim-signature:dkim-signature:date;
        bh=Dnq7uKyCBT/pabujwzazsx5CxWC1EKYxMChej63F37Q=;
        b=bRIghO8MGw6z8FzyRAJrHjQMTw13VvK9wJdu08pA0gVuxdwqSZs9IexjMfv4snrQPs
         dtOAnJIucMadsa36gpfV9AaXIeA2bDrev5dJtH/6KpVx0bc+aBDjUaaLy8g/P8AG+GLM
         EJSlOc+7mPuUibMcyFWgVmAjAdRIcbPnz63AP5yWmyXmEAJfyOQ3sKok5vQ7/7cbH6HY
         Gief+3H25KlSkIbLxhA+Dz7vqObmUb8PMqo9PXFPKFHHtpE3P+9aA/EsGYIte3G9ysHq
         5NpbHkbR48tqAgVCy5O5YOFdUV8vApaDgMIz1nLOLeZ0XTjkYUkXmcVvJ8JPdoq+asDI
         rMWQ==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@linutronix.de header.s=2020 header.b=M46X043u;
       dkim=neutral (no key) header.i=@linutronix.de header.b=IbN8QNJb;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de
Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20])
        by mx.google.com with ESMTP id
 i13-20020a1709064fcd00b0079dccbbf0c0si15875239ejw.146.2023.01.05.03.23.44;
        Thu, 05 Jan 2023 03:24:07 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@linutronix.de header.s=2020 header.b=M46X043u;
       dkim=neutral (no key) header.i=@linutronix.de header.b=IbN8QNJb;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232475AbjAELOA (ORCPT <rfc822;tmhikaru@gmail.com> + 99 others);
        Thu, 5 Jan 2023 06:14:00 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51994 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231496AbjAELN4 (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 5 Jan 2023 06:13:56 -0500
Received: from galois.linutronix.de (Galois.linutronix.de
 [IPv6:2a0a:51c0:0:12e:550::1])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9B37B4E40D;
        Thu,  5 Jan 2023 03:13:53 -0800 (PST)
Date: Thu, 05 Jan 2023 11:13:51 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de;
        s=2020; t=1672917232;
        h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date:
         message-id:message-id:to:to:cc:cc:mime-version:mime-version:
         content-type:content-type:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=Dnq7uKyCBT/pabujwzazsx5CxWC1EKYxMChej63F37Q=;
        b=M46X043uIPWZVSUsJqavdFzNy94ojt35pz7VAvsvcVFAm4MRQ527EqbW5/ZuTUitq+ik1L
        atEMGbEQDTKzLIbUO89++Si881gtEhrbZnwMnyg1NBNvd3hLRy6KJ7Nm9RuFDsYwA+BLld
        p0KWV296qnXG48FRZhp8F+hinSKRbEjCDm7fZKxWm8kFrPoHk9DKJj3ioUOF53JC4nNJw3
        q1CuQ4XM01hFH6UIB4xnp06FK7PdjSu0nzBGOKsp2PbKZJSCyN0Xv5d6vgTlEmINu4r/7m
        /L7b5HCIvL7fBEfQ/7ek9uwlhv/FYoVRJ0eDZNiaNnlseQ1OYgz1knS6j8fHdw==
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de;
        s=2020e; t=1672917232;
        h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date:
         message-id:message-id:to:to:cc:cc:mime-version:mime-version:
         content-type:content-type:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=Dnq7uKyCBT/pabujwzazsx5CxWC1EKYxMChej63F37Q=;
        b=IbN8QNJbDeV6wPg/+y8VOfNFuyeq8GAfkFM8nNvyjQAkyDec8cANmqeB0kRZnDifIEbwd5
        L5H3I8geWf7x6pAQ==
From: "tip-bot2 for Qais Yousef" <tip-bot2@linutronix.de>
Sender: tip-bot2@linutronix.de
Reply-to: linux-kernel@vger.kernel.org
To: linux-tip-commits@vger.kernel.org
Subject: [tip: sched/core] sched/documentation: Document the util clamp
 feature
Cc: Qais Yousef <qais.yousef@arm.com>,
        "Qais Yousef (Google)" <qyousef@layalina.io>,
        Ingo Molnar <mingo@kernel.org>,
        Lukasz Luba <lukasz.luba@arm.com>,
        Jonathan Corbet <corbet@lwn.net>, x86@kernel.org,
        linux-kernel@vger.kernel.org
In-Reply-To: <20221216235716.201923-1-qyousef@layalina.io>
References: <20221216235716.201923-1-qyousef@layalina.io>
MIME-Version: 1.0
Message-ID: <167291723170.4906.16687249195595490000.tip-bot2@tip-bot2>
Robot-ID: <tip-bot2@linutronix.de>
Robot-Unsubscribe: Contact <mailto:tglx@linutronix.de> to get blacklisted from
 these emails
X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED,
        DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE,
        SPF_PASS autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?=
X-GMAIL-THRID: =?utf-8?q?1754181504810250316?=
X-GMAIL-MSGID: =?utf-8?q?1754181504810250316?=

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     acbee592f1a0913e908b141570034b3fb2991db9
Gitweb:        https://git.kernel.org/tip/acbee592f1a0913e908b141570034b3fb2991db9
Author:        Qais Yousef <qyousef@layalina.io>
AuthorDate:    Fri, 16 Dec 2022 23:57:16 
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Thu, 05 Jan 2023 12:08:34 +01:00

sched/documentation: Document the util clamp feature

Add a document explaining the util clamp feature: what it is and
how to use it. The new document hopefully covers everything one needs to
know about uclamp.

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lore.kernel.org/r/20221216235716.201923-1-qyousef@layalina.io
Cc: Jonathan Corbet <corbet@lwn.net>
---
 Documentation/admin-guide/cgroup-v2.rst      |   3 +-
 Documentation/scheduler/index.rst            |   1 +-
 Documentation/scheduler/sched-util-clamp.rst | 741 ++++++++++++++++++-
 3 files changed, 745 insertions(+)
 create mode 100644 Documentation/scheduler/sched-util-clamp.rst

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index c8ae7c8..1b3ed1c 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -619,6 +619,8 @@ process migrations.
 and is an example of this type.
 
 
+.. _cgroupv2-limits-distributor:
+
 Limits
 ------
 
@@ -635,6 +637,7 @@ process migrations.
 "io.max" limits the maximum BPS and/or IOPS that a cgroup can consume
 on an IO device and is an example of this type.
 
+.. _cgroupv2-protections-distributor:
 
 Protections
 -----------
diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
index b430d85..f12d0d0 100644
--- a/Documentation/scheduler/index.rst
+++ b/Documentation/scheduler/index.rst
@@ -15,6 +15,7 @@ Linux Scheduler
     sched-capacity
     sched-energy
     schedutil
+    sched-util-clamp
     sched-nice-design
     sched-rt-group
     sched-stats
diff --git a/Documentation/scheduler/sched-util-clamp.rst b/Documentation/scheduler/sched-util-clamp.rst
new file mode 100644
index 0000000..74d5b7c
--- /dev/null
+++ b/Documentation/scheduler/sched-util-clamp.rst
@@ -0,0 +1,741 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+Utilization Clamping
+====================
+
+1. Introduction
+===============
+
+Utilization clamping, also known as util clamp or uclamp, is a scheduler
+feature that allows user space to help in managing the performance requirement
+of tasks. It was introduced in v5.3 release. The CGroup support was merged in
+v5.4.
+
+Uclamp is a hinting mechanism that allows the scheduler to understand the
+performance requirements and restrictions of the tasks, thus it helps the
+scheduler to make a better decision. And when schedutil cpufreq governor is
+used, util clamp will influence the CPU frequency selection as well.
+
+Since the scheduler and schedutil are both driven by PELT (util_avg) signals,
+util clamp acts on that to achieve its goal by clamping the signal to a certain
+point; hence the name. That is, by clamping utilization we are making the
+system run at a certain performance point.
+
+The right way to view util clamp is as a mechanism to make request or hint on
+performance constraints. It consists of two tunables:
+
+        * UCLAMP_MIN, which sets the lower bound.
+        * UCLAMP_MAX, which sets the upper bound.
+
+These two bounds will ensure a task will operate within this performance range
+of the system. UCLAMP_MIN implies boosting a task, while UCLAMP_MAX implies
+capping a task.
+
+One can tell the system (scheduler) that some tasks require a minimum
+performance point to operate at to deliver the desired user experience. Or one
+can tell the system that some tasks should be restricted from consuming too
+much resources and should not go above a specific performance point. Viewing
+the uclamp values as performance points rather than utilization is a better
+abstraction from user space point of view.
+
+As an example, a game can use util clamp to form a feedback loop with its
+perceived Frames Per Second (FPS). It can dynamically increase the minimum
+performance point required by its display pipeline to ensure no frame is
+dropped. It can also dynamically 'prime' up these tasks if it knows in the
+coming few hundred milliseconds a computationally intensive scene is about to
+happen.
+
+On mobile hardware where the capability of the devices varies a lot, this
+dynamic feedback loop offers a great flexibility to ensure best user experience
+given the capabilities of any system.
+
+Of course a static configuration is possible too. The exact usage will depend
+on the system, application and the desired outcome.
+
+Another example is in Android where tasks are classified as background,
+foreground, top-app, etc. Util clamp can be used to constrain how much
+resources background tasks are consuming by capping the performance point they
+can run at. This constraint helps reserve resources for important tasks, like
+the ones belonging to the currently active app (top-app group). Beside this
+helps in limiting how much power they consume. This can be more obvious in
+heterogeneous systems (e.g. Arm big.LITTLE); the constraint will help bias the
+background tasks to stay on the little cores which will ensure that:
+
+        1. The big cores are free to run top-app tasks immediately. top-app
+           tasks are the tasks the user is currently interacting with, hence
+           the most important tasks in the system.
+        2. They don't run on a power hungry core and drain battery even if they
+           are CPU intensive tasks.
+
+.. note::
+  **little cores**:
+    CPUs with capacity < 1024
+
+  **big cores**:
+    CPUs with capacity = 1024
+
+By making these uclamp performance requests, or rather hints, user space can
+ensure system resources are used optimally to deliver the best possible user
+experience.
+
+Another use case is to help with **overcoming the ramp up latency inherit in
+how scheduler utilization signal is calculated**.
+
+On the other hand, a busy task for instance that requires to run at maximum
+performance point will suffer a delay of ~200ms (PELT HALFIFE = 32ms) for the
+scheduler to realize that. This is known to affect workloads like gaming on
+mobile devices where frames will drop due to slow response time to select the
+higher frequency required for the tasks to finish their work in time. Setting
+UCLAMP_MIN=1024 will ensure such tasks will always see the highest performance
+level when they start running.
+
+The overall visible effect goes beyond better perceived user
+experience/performance and stretches to help achieve a better overall
+performance/watt if used effectively.
+
+User space can form a feedback loop with the thermal subsystem too to ensure
+the device doesn't heat up to the point where it will throttle.
+
+Both SCHED_NORMAL/OTHER and SCHED_FIFO/RR honour uclamp requests/hints.
+
+In the SCHED_FIFO/RR case, uclamp gives the option to run RT tasks at any
+performance point rather than being tied to MAX frequency all the time. Which
+can be useful on general purpose systems that run on battery powered devices.
+
+Note that by design RT tasks don't have per-task PELT signal and must always
+run at a constant frequency to combat undeterministic DVFS rampup delays.
+
+Note that using schedutil always implies a single delay to modify the frequency
+when an RT task wakes up. This cost is unchanged by using uclamp. Uclamp only
+helps picking what frequency to request instead of schedutil always requesting
+MAX for all RT tasks.
+
+See :ref:`section 3.4 <uclamp-default-values>` for default values and
+:ref:`3.4.1 <sched-util-clamp-min-rt-default>` on how to change RT tasks
+default value.
+
+2. Design
+=========
+
+Util clamp is a property of every task in the system. It sets the boundaries of
+its utilization signal; acting as a bias mechanism that influences certain
+decisions within the scheduler.
+
+The actual utilization signal of a task is never clamped in reality. If you
+inspect PELT signals at any point of time you should continue to see them as
+they are intact. Clamping happens only when needed, e.g: when a task wakes up
+and the scheduler needs to select a suitable CPU for it to run on.
+
+Since the goal of util clamp is to allow requesting a minimum and maximum
+performance point for a task to run on, it must be able to influence the
+frequency selection as well as task placement to be most effective. Both of
+which have implications on the utilization value at CPU runqueue (rq for short)
+level, which brings us to the main design challenge.
+
+When a task wakes up on an rq, the utilization signal of the rq will be
+affected by the uclamp settings of all the tasks enqueued on it. For example if
+a task requests to run at UTIL_MIN = 512, then the util signal of the rq needs
+to respect to this request as well as all other requests from all of the
+enqueued tasks.
+
+To be able to aggregate the util clamp value of all the tasks attached to the
+rq, uclamp must do some housekeeping at every enqueue/dequeue, which is the
+scheduler hot path. Hence care must be taken since any slow down will have
+significant impact on a lot of use cases and could hinder its usability in
+practice.
+
+The way this is handled is by dividing the utilization range into buckets
+(struct uclamp_bucket) which allows us to reduce the search space from every
+task on the rq to only a subset of tasks on the top-most bucket.
+
+When a task is enqueued, the counter in the matching bucket is incremented,
+and on dequeue it is decremented. This makes keeping track of the effective
+uclamp value at rq level a lot easier.
+
+As tasks are enqueued and dequeued, we keep track of the current effective
+uclamp value of the rq. See :ref:`section 2.1 <uclamp-buckets>` for details on
+how this works.
+
+Later at any path that wants to identify the effective uclamp value of the rq,
+it will simply need to read this effective uclamp value of the rq at that exact
+moment of time it needs to take a decision.
+
+For task placement case, only Energy Aware and Capacity Aware Scheduling
+(EAS/CAS) make use of uclamp for now, which implies that it is applied on
+heterogeneous systems only.
+When a task wakes up, the scheduler will look at the current effective uclamp
+value of every rq and compare it with the potential new value if the task were
+to be enqueued there. Favoring the rq that will end up with the most energy
+efficient combination.
+
+Similarly in schedutil, when it needs to make a frequency update it will look
+at the current effective uclamp value of the rq which is influenced by the set
+of tasks currently enqueued there and select the appropriate frequency that
+will satisfy constraints from requests.
+
+Other paths like setting overutilization state (which effectively disables EAS)
+make use of uclamp as well. Such cases are considered necessary housekeeping to
+allow the 2 main use cases above and will not be covered in detail here as they
+could change with implementation details.
+
+.. _uclamp-buckets:
+
+2.1. Buckets
+------------
+
+::
+
+                           [struct rq]
+
+  (bottom)                                                    (top)
+
+    0                                                          1024
+    |                                                           |
+    +-----------+-----------+-----------+----   ----+-----------+
+    |  Bucket 0 |  Bucket 1 |  Bucket 2 |    ...    |  Bucket N |
+    +-----------+-----------+-----------+----   ----+-----------+
+       :           :                                   :
+       +- p0       +- p3                               +- p4
+       :                                               :
+       +- p1                                           +- p5
+       :
+       +- p2
+
+
+.. note::
+  The diagram above is an illustration rather than a true depiction of the
+  internal data structure.
+
+To reduce the search space when trying to decide the effective uclamp value of
+an rq as tasks are enqueued/dequeued, the whole utilization range is divided
+into N buckets where N is configured at compile time by setting
+CONFIG_UCLAMP_BUCKETS_COUNT. By default it is set to 5.
+
+The rq has a bucket for each uclamp_id tunables: [UCLAMP_MIN, UCLAMP_MAX].
+
+The range of each bucket is 1024/N. For example, for the default value of
+5 there will be 5 buckets, each of which will cover the following range:
+
+::
+
+        DELTA = round_closest(1024/5) = 204.8 = 205
+
+        Bucket 0: [0:204]
+        Bucket 1: [205:409]
+        Bucket 2: [410:614]
+        Bucket 3: [615:819]
+        Bucket 4: [820:1024]
+
+When a task p with following tunable parameters
+
+::
+
+        p->uclamp[UCLAMP_MIN] = 300
+        p->uclamp[UCLAMP_MAX] = 1024
+
+is enqueued into the rq, bucket 1 will be incremented for UCLAMP_MIN and bucket
+4 will be incremented for UCLAMP_MAX to reflect the fact the rq has a task in
+this range.
+
+The rq then keeps track of its current effective uclamp value for each
+uclamp_id.
+
+When a task p is enqueued, the rq value changes to:
+
+::
+
+        // update bucket logic goes here
+        rq->uclamp[UCLAMP_MIN] = max(rq->uclamp[UCLAMP_MIN], p->uclamp[UCLAMP_MIN])
+        // repeat for UCLAMP_MAX
+
+Similarly, when p is dequeued the rq value changes to:
+
+::
+
+        // update bucket logic goes here
+        rq->uclamp[UCLAMP_MIN] = search_top_bucket_for_highest_value()
+        // repeat for UCLAMP_MAX
+
+When all buckets are empty, the rq uclamp values are reset to system defaults.
+See :ref:`section 3.4 <uclamp-default-values>` for details on default values.
+
+
+2.2. Max aggregation
+--------------------
+
+Util clamp is tuned to honour the request for the task that requires the
+highest performance point.
+
+When multiple tasks are attached to the same rq, then util clamp must make sure
+the task that needs the highest performance point gets it even if there's
+another task that doesn't need it or is disallowed from reaching this point.
+
+For example, if there are multiple tasks attached to an rq with the following
+values:
+
+::
+
+        p0->uclamp[UCLAMP_MIN] = 300
+        p0->uclamp[UCLAMP_MAX] = 900
+
+        p1->uclamp[UCLAMP_MIN] = 500
+        p1->uclamp[UCLAMP_MAX] = 500
+
+then assuming both p0 and p1 are enqueued to the same rq, both UCLAMP_MIN
+and UCLAMP_MAX become:
+
+::
+
+        rq->uclamp[UCLAMP_MIN] = max(300, 500) = 500
+        rq->uclamp[UCLAMP_MAX] = max(900, 500) = 900
+
+As we shall see in :ref:`section 5.1 <uclamp-capping-fail>`, this max
+aggregation is the cause of one of limitations when using util clamp, in
+particular for UCLAMP_MAX hint when user space would like to save power.
+
+2.3. Hierarchical aggregation
+-----------------------------
+
+As stated earlier, util clamp is a property of every task in the system. But
+the actual applied (effective) value can be influenced by more than just the
+request made by the task or another actor on its behalf (middleware library).
+
+The effective util clamp value of any task is restricted as follows:
+
+  1. By the uclamp settings defined by the cgroup CPU controller it is attached
+     to, if any.
+  2. The restricted value in (1) is then further restricted by the system wide
+     uclamp settings.
+
+:ref:`Section 3 <uclamp-interfaces>` discusses the interfaces and will expand
+further on that.
+
+For now suffice to say that if a task makes a request, its actual effective
+value will have to adhere to some restrictions imposed by cgroup and system
+wide settings.
+
+The system will still accept the request even if effectively will be beyond the
+constraints, but as soon as the task moves to a different cgroup or a sysadmin
+modifies the system settings, the request will be satisfied only if it is
+within new constraints.
+
+In other words, this aggregation will not cause an error when a task changes
+its uclamp values, but rather the system may not be able to satisfy requests
+based on those factors.
+
+2.4. Range
+----------
+
+Uclamp performance request has the range of 0 to 1024 inclusive.
+
+For cgroup interface percentage is used (that is 0 to 100 inclusive).
+Just like other cgroup interfaces, you can use 'max' instead of 100.
+
+.. _uclamp-interfaces:
+
+3. Interfaces
+=============
+
+3.1. Per task interface
+-----------------------
+
+sched_setattr() syscall was extended to accept two new fields:
+
+* sched_util_min: requests the minimum performance point the system should run
+  at when this task is running. Or lower performance bound.
+* sched_util_max: requests the maximum performance point the system should run
+  at when this task is running. Or upper performance bound.
+
+For example, the following scenario have 40% to 80% utilization constraints:
+
+::
+
+        attr->sched_util_min = 40% * 1024;
+        attr->sched_util_max = 80% * 1024;
+
+When task @p is running, **the scheduler should try its best to ensure it
+starts at 40% performance level**. If the task runs for a long enough time so
+that its actual utilization goes above 80%, the utilization, or performance
+level, will be capped.
+
+The special value -1 is used to reset the uclamp settings to the system
+default.
+
+Note that resetting the uclamp value to system default using -1 is not the same
+as manually setting uclamp value to system default. This distinction is
+important because as we shall see in system interfaces, the default value for
+RT could be changed. SCHED_NORMAL/OTHER might gain similar knobs too in the
+future.
+
+3.2. cgroup interface
+---------------------
+
+There are two uclamp related values in the CPU cgroup controller:
+
+* cpu.uclamp.min
+* cpu.uclamp.max
+
+When a task is attached to a CPU controller, its uclamp values will be impacted
+as follows:
+
+* cpu.uclamp.min is a protection as described in :ref:`section 3-3 of cgroup
+  v2 documentation <cgroupv2-protections-distributor>`.
+
+  If a task uclamp_min value is lower than cpu.uclamp.min, then the task will
+  inherit the cgroup cpu.uclamp.min value.
+
+  In a cgroup hierarchy, effective cpu.uclamp.min is the max of (child,
+  parent).
+
+* cpu.uclamp.max is a limit as described in :ref:`section 3-2 of cgroup v2
+  documentation <cgroupv2-limits-distributor>`.
+
+  If a task uclamp_max value is higher than cpu.uclamp.max, then the task will
+  inherit the cgroup cpu.uclamp.max value.
+
+  In a cgroup hierarchy, effective cpu.uclamp.max is the min of (child,
+  parent).
+
+For example, given following parameters:
+
+::
+
+        p0->uclamp[UCLAMP_MIN] = // system default;
+        p0->uclamp[UCLAMP_MAX] = // system default;
+
+        p1->uclamp[UCLAMP_MIN] = 40% * 1024;
+        p1->uclamp[UCLAMP_MAX] = 50% * 1024;
+
+        cgroup0->cpu.uclamp.min = 20% * 1024;
+        cgroup0->cpu.uclamp.max = 60% * 1024;
+
+        cgroup1->cpu.uclamp.min = 60% * 1024;
+        cgroup1->cpu.uclamp.max = 100% * 1024;
+
+when p0 and p1 are attached to cgroup0, the values become:
+
+::
+
+        p0->uclamp[UCLAMP_MIN] = cgroup0->cpu.uclamp.min = 20% * 1024;
+        p0->uclamp[UCLAMP_MAX] = cgroup0->cpu.uclamp.max = 60% * 1024;
+
+        p1->uclamp[UCLAMP_MIN] = 40% * 1024; // intact
+        p1->uclamp[UCLAMP_MAX] = 50% * 1024; // intact
+
+when p0 and p1 are attached to cgroup1, these instead become:
+
+::
+
+        p0->uclamp[UCLAMP_MIN] = cgroup1->cpu.uclamp.min = 60% * 1024;
+        p0->uclamp[UCLAMP_MAX] = cgroup1->cpu.uclamp.max = 100% * 1024;
+
+        p1->uclamp[UCLAMP_MIN] = cgroup1->cpu.uclamp.min = 60% * 1024;
+        p1->uclamp[UCLAMP_MAX] = 50% * 1024; // intact
+
+Note that cgroup interfaces allows cpu.uclamp.max value to be lower than
+cpu.uclamp.min. Other interfaces don't allow that.
+
+3.3. System interface
+---------------------
+
+3.3.1 sched_util_clamp_min
+--------------------------
+
+System wide limit of allowed UCLAMP_MIN range. By default it is set to 1024,
+which means that permitted effective UCLAMP_MIN range for tasks is [0:1024].
+By changing it to 512 for example the range reduces to [0:512]. This is useful
+to restrict how much boosting tasks are allowed to acquire.
+
+Requests from tasks to go above this knob value will still succeed, but
+they won't be satisfied until it is more than p->uclamp[UCLAMP_MIN].
+
+The value must be smaller than or equal to sched_util_clamp_max.
+
+3.3.2 sched_util_clamp_max
+--------------------------
+
+System wide limit of allowed UCLAMP_MAX range. By default it is set to 1024,
+which means that permitted effective UCLAMP_MAX range for tasks is [0:1024].
+
+By changing it to 512 for example the effective allowed range reduces to
+[0:512]. This means is that no task can run above 512, which implies that all
+rqs are restricted too. IOW, the whole system is capped to half its performance
+capacity.
+
+This is useful to restrict the overall maximum performance point of the system.
+For example, it can be handy to limit performance when running low on battery
+or when the system wants to limit access to more energy hungry performance
+levels when it's in idle state or screen is off.
+
+Requests from tasks to go above this knob value will still succeed, but they
+won't be satisfied until it is more than p->uclamp[UCLAMP_MAX].
+
+The value must be greater than or equal to sched_util_clamp_min.
+
+.. _uclamp-default-values:
+
+3.4. Default values
+-------------------
+
+By default all SCHED_NORMAL/SCHED_OTHER tasks are initialized to:
+
+::
+
+        p_fair->uclamp[UCLAMP_MIN] = 0
+        p_fair->uclamp[UCLAMP_MAX] = 1024
+
+That is, by default they're boosted to run at the maximum performance point of
+changed at boot or runtime. No argument was made yet as to why we should
+provide this, but can be added in the future.
+
+For SCHED_FIFO/SCHED_RR tasks:
+
+::
+
+        p_rt->uclamp[UCLAMP_MIN] = 1024
+        p_rt->uclamp[UCLAMP_MAX] = 1024
+
+That is by default they're boosted to run at the maximum performance point of
+the system which retains the historical behavior of the RT tasks.
+
+RT tasks default uclamp_min value can be modified at boot or runtime via
+sysctl. See below section.
+
+.. _sched-util-clamp-min-rt-default:
+
+3.4.1 sched_util_clamp_min_rt_default
+-------------------------------------
+
+Running RT tasks at maximum performance point is expensive on battery powered
+devices and not necessary. To allow system developer to offer good performance
+guarantees for these tasks without pushing it all the way to maximum
+performance point, this sysctl knob allows tuning the best boost value to
+address the system requirement without burning power running at maximum
+performance point all the time.
+
+Application developer are encouraged to use the per task util clamp interface
+to ensure they are performance and power aware. Ideally this knob should be set
+to 0 by system designers and leave the task of managing performance
+requirements to the apps.
+
+4. How to use util clamp
+========================
+
+Util clamp promotes the concept of user space assisted power and performance
+management. At the scheduler level there is no info required to make the best
+decision. However, with util clamp user space can hint to the scheduler to make
+better decision about task placement and frequency selection.
+
+Best results are achieved by not making any assumptions about the system the
+application is running on and to use it in conjunction with a feedback loop to
+dynamically monitor and adjust. Ultimately this will allow for a better user
+experience at a better perf/watt.
+
+For some systems and use cases, static setup will help to achieve good results.
+Portability will be a problem in this case. How much work one can do at 100,
+200 or 1024 is different for each system. Unless there's a specific target
+system, static setup should be avoided.
+
+There are enough possibilities to create a whole framework based on util clamp
+or self contained app that makes use of it directly.
+
+4.1. Boost important and DVFS-latency-sensitive tasks
+-----------------------------------------------------
+
+A GUI task might not be busy to warrant driving the frequency high when it
+wakes up. However, it requires to finish its work within a specific time window
+to deliver the desired user experience. The right frequency it requires at
+wakeup will be system dependent. On some underpowered systems it will be high,
+on other overpowered ones it will be low or 0.
+
+This task can increase its UCLAMP_MIN value every time it misses the deadline
+to ensure on next wake up it runs at a higher performance point. It should try
+to approach the lowest UCLAMP_MIN value that allows to meet its deadline on any
+particular system to achieve the best possible perf/watt for that system.
+
+On heterogeneous systems, it might be important for this task to run on
+a faster CPU.
+
+**Generally it is advised to perceive the input as performance level or point
+which will imply both task placement and frequency selection**.
+
+4.2. Cap background tasks
+-------------------------
+
+Like explained for Android case in the introduction. Any app can lower
+UCLAMP_MAX for some background tasks that don't care about performance but
+could end up being busy and consume unnecessary system resources on the system.
+
+4.3. Powersave mode
+-------------------
+
+sched_util_clamp_max system wide interface can be used to limit all tasks from
+operating at the higher performance points which are usually energy
+inefficient.
+
+This is not unique to uclamp as one can achieve the same by reducing max
+frequency of the cpufreq governor. It can be considered a more convenient
+alternative interface.
+
+4.4. Per-app performance restriction
+------------------------------------
+
+Middleware/Utility can provide the user an option to set UCLAMP_MIN/MAX for an
+app every time it is executed to guarantee a minimum performance point and/or
+limit it from draining system power at the cost of reduced performance for
+these apps.
+
+If you want to prevent your laptop from heating up while on the go from
+compiling the kernel and happy to sacrifice performance to save power, but
+still would like to keep your browser performance intact, uclamp makes it
+possible.
+
+5. Limitations
+==============
+
+.. _uclamp-capping-fail:
+
+5.1. Capping frequency with uclamp_max fails under certain conditions
+---------------------------------------------------------------------
+
+If task p0 is capped to run at 512:
+
+::
+
+        p0->uclamp[UCLAMP_MAX] = 512
+
+and it shares the rq with p1 which is free to run at any performance point:
+
+::
+
+        p1->uclamp[UCLAMP_MAX] = 1024
+
+then due to max aggregation the rq will be allowed to reach max performance
+point:
+
+::
+
+        rq->uclamp[UCLAMP_MAX] = max(512, 1024) = 1024
+
+Assuming both p0 and p1 have UCLAMP_MIN = 0, then the frequency selection for
+the rq will depend on the actual utilization value of the tasks.
+
+If p1 is a small task but p0 is a CPU intensive task, then due to the fact that
+both are running at the same rq, p1 will cause the frequency capping to be left
+from the rq although p1, which is allowed to run at any performance point,
+doesn't actually need to run at that frequency.
+
+5.2. UCLAMP_MAX can break PELT (util_avg) signal
+------------------------------------------------
+
+PELT assumes that frequency will always increase as the signals grow to ensure
+there's always some idle time on the CPU. But with UCLAMP_MAX, this frequency
+increase will be prevented which can lead to no idle time in some
+circumstances. When there's no idle time, a task will stuck in a busy loop,
+which would result in util_avg being 1024.
+
+Combing with issue described below, this can lead to unwanted frequency spikes
+when severely capped tasks share the rq with a small non capped task.
+
+As an example if task p, which have:
+
+::
+
+        p0->util_avg = 300
+        p0->uclamp[UCLAMP_MAX] = 0
+
+wakes up on an idle CPU, then it will run at min frequency (Fmin) this
+CPU is capable of. The max CPU frequency (Fmax) matters here as well,
+since it designates the shortest computational time to finish the task's
+work on this CPU.
+
+::
+
+        rq->uclamp[UCLAMP_MAX] = 0
+
+If the ratio of Fmax/Fmin is 3, then maximum value will be:
+
+::
+
+        300 * (Fmax/Fmin) = 900
+
+which indicates the CPU will still see idle time since 900 is < 1024. The
+_actual_ util_avg will not be 900 though, but somewhere between 300 and 900. As
+long as there's idle time, p->util_avg updates will be off by a some margin,
+but not proportional to Fmax/Fmin.
+
+::
+
+        p0->util_avg = 300 + small_error
+
+Now if the ratio of Fmax/Fmin is 4, the maximum value becomes:
+
+::
+
+        300 * (Fmax/Fmin) = 1200
+
+which is higher than 1024 and indicates that the CPU has no idle time. When
+this happens, then the _actual_ util_avg will become:
+
+::
+
+        p0->util_avg = 1024
+
+If task p1 wakes up on this CPU, which have:
+
+::
+
+        p1->util_avg = 200
+        p1->uclamp[UCLAMP_MAX] = 1024
+
+then the effective UCLAMP_MAX for the CPU will be 1024 according to max
+aggregation rule. But since the capped p0 task was running and throttled
+severely, then the rq->util_avg will be:
+
+::
+
+        p0->util_avg = 1024
+        p1->util_avg = 200
+
+        rq->util_avg = 1024
+        rq->uclamp[UCLAMP_MAX] = 1024
+
+Hence lead to a frequency spike since if p0 wasn't throttled we should get:
+
+::
+
+        p0->util_avg = 300
+        p1->util_avg = 200
+
+        rq->util_avg = 500
+
+and run somewhere near mid performance point of that CPU, not the Fmax we get.
+
+5.3. Schedutil response time issues
+-----------------------------------
+
+schedutil has three limitations:
+
+        1. Hardware takes non-zero time to respond to any frequency change
+           request. On some platforms can be in the order of few ms.
+        2. Non fast-switch systems require a worker deadline thread to wake up
+           and perform the frequency change, which adds measurable overhead.
+        3. schedutil rate_limit_us drops any requests during this rate_limit_us
+           window.
+
+If a relatively small task is doing critical job and requires a certain
+performance point when it wakes up and starts running, then all these
+limitations will prevent it from getting what it wants in the time scale it
+expects.
+
+This limitation is not only impactful when using uclamp, but will be more
+prevalent as we no longer gradually ramp up or down. We could easily be
+jumping between frequencies depending on the order tasks wake up, and their
+respective uclamp values.
+
+We regard that as a limitation of the capabilities of the underlying system
+itself.
+
+There is room to improve the behavior of schedutil rate_limit_us, but not much
+to be done for 1 or 2. They are considered hard limitations of the system.