Message ID | cover.1696345700.git.Hongyan.Xia2@arm.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2a8e:b0:403:3b70:6f57 with SMTP id in14csp2620899vqb; Wed, 4 Oct 2023 02:05:31 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGCocMna/IZYc8d4lCn1KPkqrfS7F+xrnNIopGeK/hLEIhDbBpE9f93oSkSaXg6iSRuKumY X-Received: by 2002:a05:6a20:1447:b0:14e:3daf:fdb9 with SMTP id a7-20020a056a20144700b0014e3daffdb9mr2061521pzi.22.1696410331111; Wed, 04 Oct 2023 02:05:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696410331; cv=none; d=google.com; s=arc-20160816; b=WZMuk7363JikRPl2nVTPh0swpRn0j3N0qsuTK6RZaj7NoFLYbm5Sl2Yzqgq2g6xTuq dTAB9yvP+EvMeHDauwY4emw2X5o2QbmzmQa+R9sMbW1mS4N3TXUA2MT7rneK2BV85F6E CmP0I3FrzV/MyJLLdaf6Y0nli9Bq1BorAns7OOXshhekwt1a+g6JrmwdhRNXdrWkL/EP ah4h0UuMtQhsHwGaLsFwN7keE/+QbxT7sr54Dfvvs1HIoAv36Da9/5fjKSsS1pM0kiWm iHd0KOqcr4OHP9gxFfMOsKYwUiKpM1LzvL8WygDZgHnBblczrWAByC22+Xri+3dmbgKM nW9w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=Cs2LnarsSLmgQA3aeMA0KtzzS1klLtJObLVgOkkbOgQ=; fh=EztwlepW2jPibYOQJhdPP3KHZCP3+ZnpZcr08LILCr4=; b=G2I15X8ix/FMkiquVBUIAfDnhPUxexmzUB2mn10qPtbnhoMsetuI9jyLcKD8S0bPUc wR9uVrEAzcoPHRn69NAUE9X2k6tQa3vcNPIX9DFzV1FckcNStWXfXQd+G4yKXS9q82Aq 7MfvSeO9NiZBTZiDH68vUNa4e8uMy1btsuLlNnZ1233RfOjhFXskYsA+ftp6yt2ADwUK netfF+tzuIsgHt4KpXaWigdythTdZt8yrhDlVRbFqGqQ4WTUp1yC8YIcTWv0xx/zsCwg R16aMxl5eDKJ6MyOor20e8mQwE2qLlXNEv5e/JvA4A7Xuace9ISnABGwDpxsmvwpgQ8o igAw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from howler.vger.email (howler.vger.email. [23.128.96.34]) by mx.google.com with ESMTPS id fd2-20020a056a002e8200b0068e35d412c9si3476603pfb.323.2023.10.04.02.05.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 04 Oct 2023 02:05:31 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) client-ip=23.128.96.34; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id 285C1830D363; Wed, 4 Oct 2023 02:05:30 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241094AbjJDJF1 (ORCPT <rfc822;pusanteemu@gmail.com> + 18 others); Wed, 4 Oct 2023 05:05:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42788 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232554AbjJDJFY (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 4 Oct 2023 05:05:24 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 27316A6 for <linux-kernel@vger.kernel.org>; Wed, 4 Oct 2023 02:05:20 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 2F97DDA7; Wed, 4 Oct 2023 02:05:57 -0700 (PDT) Received: from e130256.cambridge.arm.com (usa-sjc-imap-foss1.foss.arm.com [10.121.207.14]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 260423F59C; Wed, 4 Oct 2023 02:05:17 -0700 (PDT) From: Hongyan Xia <Hongyan.Xia2@arm.com> To: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Vincent Guittot <vincent.guittot@linaro.org>, Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Qais Yousef <qyousef@layalina.io>, Morten Rasmussen <morten.rasmussen@arm.com>, Lukasz Luba <lukasz.luba@arm.com>, Christian Loehle <christian.loehle@arm.com>, linux-kernel@vger.kernel.org, Hongyan Xia <Hongyan.Xia2@arm.com> Subject: [RFC PATCH 0/6] sched: uclamp sum aggregation Date: Wed, 4 Oct 2023 10:04:48 +0100 Message-Id: <cover.1696345700.git.Hongyan.Xia2@arm.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_NONE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Wed, 04 Oct 2023 02:05:30 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1778815159303676964 X-GMAIL-MSGID: 1778815159303676964 |
Series |
sched: uclamp sum aggregation
|
|
Message
Hongyan Xia
Oct. 4, 2023, 9:04 a.m. UTC
Current implementation of uclamp leaves many things to be desired. There are several problems: 1. Max aggregation is fragile. A single task with a high UCLAMP_MAX (or with the default value, which is 1024) can ruin the previous settings the moment it is enqueued, as shown in the uclamp frequency spike problem in Section 5.2 of Documentation/scheduler/sched-util-clamp.rst. Constantly running at 1024 utilization implies that the CPU is driven at its max capacity. However, with UCLAMP_MAX, this assumption is no longer true. To mitigate this, one idea is to filter out uclamp values for short-running tasks. However, the effectiveness of this idea remains to be seen. 2. No way to differentiate between UCLAMP_MAX throttled CPU or a CPU running at its peak, as both show utilization of 1024. However, in certain cases it's important to tell the difference, as we still want to take the opportunity to enqueue tasks in the former case. It's also debatable whether uclamp should be a frequency hint. An example is a system with a mid CPU of capacity 500, and a little CPU with capacity 100. If we have 10 tasks with UCLAMP_MIN of 101, then util_fits_cpu() will schedule them on the mid CPU because feec() thinks they don't fit on the little, even though we should use the little core at some point instead of making the mid even more crowded. Of course, this in reality doesn't happen because of other mechanisms like load balancing, but it's also not good when other mechanisms can cancel the effect of uclamp in feec(). A CPU running at capacity 500 for 1ms or for 1000ms gives completely different performance levels, so trying to fit only the frequency does not give us any performance guarantees. If we then talk about not only running at some frequency but also for some amount of time, then what we really mean is a capacity hint, not a frequency hint. It's even worse when the tasks scheduled on the mid CPU also have UCLAMP_MAX values. In the previous example, instead of running at 500, a single UCLAMP_MAX, assuming it's 110, can make the mid CPU run at a much lower frequency than 500, so it is then a really bad decision to honor the UCLAMP_MIN frequency hint and place it on the mid CPU, instead of running it on the little CPU which is less crowded, and has pretty much the same capacity (100 vs 110) anyway. This series attempts to solve these problems by tracking a util_avg_uclamp signal in tasks and cfs_rq. At task level, p->se.avg.util_avg_uclamp is basically tracking the normal util_avg, but clamped within its uclamp min and max. At cfs_rq level, util_avg_uclamp must always be the sum of all util_avg_uclamp of all the entities on this cfs_rq. As a result, cfs_rq->avg.util_avg_uclamp is the sum aggregation of all the clamped values, which indicates the frequency this rq should run at and what the utilization is. This idea solves Problem 1 by capping the utilization of an always-running task throttled by UCLAMP_MAX. Although the task (denoted by Task A) has no idle time, the util_avg_uclamp signal gives its UCLAMP_MAX value instead of 1024, so even if another task (Task B) with a UCLAMP_MAX value at 1024 joins the rq, the util_avg_uclamp is A's UCLAMP_MAX plus B's utilization, instead of 1024 plus B's utilization, which means we no longer have the frequency spike problem caused by B. This should mean that we might completely avoid the need for uclamp filtering. It also solves Problem 2 by tracking the capped utilization of a rq. Using util_avg_uclamp, we are able to differentiate between a CPU throttled by UCLAMP_MAX and one that is actually running at its peak capacity. An always-running rq with a task having UCLAMP_MAX at 300 will report a cfs_rq util_avg_uclamp at 300, not 1024, which means task placement sees spare capacity on this CPU and will allocate some tasks to it, instead of treating it the same way as a CPU actually running at the peak. This brings us closer to the benefits of Android's sum aggregation (see code related to CONFIG_USE_GROUP_THROTTLE at https://android.googlesource.com/kernel/gs/+/refs/heads/android-gs-raviole-5.10-android12-d1/drivers/soc/google/vh/kernel/sched/fair.c#510), which shows energy benefits because we are able to schedule tasks on smaller cores which are UCLAMP_MAX capped, instead of finding a fresh big CPU. However, a major difference is that this series has an amortized cost of O(1), instead of O(n) in cpu_util_next() in Android kernels. It avoids the shortcomings from uclamp-being-a-frequency-hint. Under sum aggregation, the scheduler sees the sum aggregated util_avg_uclamp and will avoid the problem of enqueueing too many tasks just to fit the frequency, and will place tasks on little CPUs in the previous example. Patch 2/6 tries to simulate PELT decay in the new util_avg_uclamp signal, as this gives us gradual decay of utilization which avoids premature CPU frequency drops. This is a major caveat of this series. We should explore if there's a better way to integrate uclamp directly into the util_avg signal, instead of introducing a new util_avg_uclamp and then simulate PELT on it. This new design significantly simplifies uclamp logic. Patch 4/6 removes the tri-state return value of util_fits_cpu(). Patch 5/6 then completely removes uclamp buckets. Because the util_avg_uclamp is already a clamped value propagated from bottom to top, there's no need to clamp anything at the top level and we can just use this value for frequency selection and spare capacity search. Basically, the idea is that all uclamp logic happens inside util_avg_uclamp, and other code using this new signal can just use it like util_avg, without even knowing that uclamp exists. This drastically reduces the complexity of uclamp and makes the code considering all the uclamp corner cases unnecessary. At the end of the series, we remove 749 lines of code while adding a bit more than 300 (although once we update Documentation.rst, it will be a bit more than that). Note that this series is still considered RFC status. TODO items are: 1. Implement sum aggregation for RT tasks. 2. Improve handling of cpu_util(boost). TESTING: Initial test and benchmark results suggest that sum aggregation not only is significantly simpler, but generally performs much better in achieving what uclamp is supposed to do. Two notebooks are shared at https://nbviewer.org/github/honxia02/sum_aggre_notebooks/blob/main/upstream.ipynb https://nbviewer.org/github/honxia02/sum_aggre_notebooks/blob/main/sum.ipynb The experiments done in notebooks are on Arm Juno r2 board. CPU0-3 are little cores with capacity of 383. CPU4-5 are big cores. The rt-app profiles used for these experiments are included in the notebooks. Scenario 1: Scheduling 4 always-running tasks with UCLAMP_MAX at 200. The scheduling decisions are plotted in Out[11] and Out[14] respectively. Max aggregation fails to recognize the opportunity to run all of them on the efficient little Power Domain (PD), whereas sum aggregation runs all 4 on the little PD, leaving the big PD completely powered off, satisfying uclamp requests while saving power. Scenario 2: Scheduling 4 tasks with UCLAMP_MIN and UCLAMP_MAX at a value slightly above the capacity of the little CPU. Results are in Out[17] and Out[82]. The purpose is to use UCLAMP_MIN to place tasks on the big core. Max aggregation is pretty much in an overutilized state the entire time. Sum aggregation sees that the big CPU can hold two such tasks (satisfying both UCLAMP_MIN and UCLAMP_MAX requests for all 4 tasks) on each big CPU and quickly settles down, and is still under EAS without overutilization. Scenario 3: Task A is a task with a small utilization pinned to CPU4. Task B is an always-running task pinned to CPU5, but UCLAMP_MAX capped at 200. After a while, task A is then pinned to CPU5, joining B. Results are in Out[23] and Out[95]. The blue util_avg signal is the original root CFS util_avg. The yellow line is the root CFS utilization after considering uclamp. Max aggregation sees a frequency spike at 579.1s. When zoomed in, one can see square-wave-like utilization values because of A periodically going to sleep. When A wakes up, its default UCLAMP_MAX of 1024 will uncap B and reach the highest CPU frequency. When A sleeps, B's UCLAMP_MAX will be in effect and will reduce rq utilization to 200. This happens repeatedly, hence the square wave. In contrast, sum aggregation sees a normal increase in utilization when A joins B, at around 2507s, without any square-wave behavior. Scenario 4: 4 always-running tasks with UCLAMP_MAX of 120 pinned to the little PD (CPU0-3). 4 same tasks pinned to the big PD (CPU4-5). After a while, remove the CPU pinning of the 4 tasks on the big PD. Results are in Out[29] and Out[101]. Max aggregation fails to identify that all 8 tasks can be packed on the little PD, whereas sum aggregation immediately moves the 4 tasks pinned to big PD to the little PD the moment pinning is removed. Sum aggregation understands that even when the rq seems to have utilization of 1024, this is because of UCLAMP_MAX and there's still opportunity to pack 2 such tasks on each little CPU, leaving the big PD untouched, saving power. BENCHMARKS: One TODO item is to handle cpu_util(boost) better. The current handling of boost is far from ideal and is a known caveat. All below benchmarks numbers are done without any boosting in cpu_util(), in both max and sum aggregation. Geekbench 6 (on Rock-5B board) +-----+-------------+------------+ | | Single-core | Multi-core | +-----+-------------+------------+ | Max | 800.6 | 2977.0 | | Sum | 801.2 | 2981.4 | +-----+-------------+------------+ No regression is seen after switching to sum aggregation. Jankbench (backporting sum aggregation to Android 5.18 mainline kernel) +------+-----------------+-------+-----------+ | | variable | value | perc_diff | +------+-----------------+-------+-----------+ | main | jank_percentage | 1.1 | 0.00% | | sum | jank_percentage | 0.5 | -53.92% | +------+-----------------+-------+-----------+ +------------+--------+------+-------+-----------+ | | metric | tag | value | perc_diff | +------------+--------+------+-------+-----------+ | CPU | gmean | main | 166.1 | 0.00% | | CPU-Big | gmean | main | 55.1 | 0.00% | | CPU-Little | gmean | main | 91.7 | 0.00% | | CPU-Mid | gmean | main | 19.2 | 0.00% | | CPU | gmean | sum | 161.3 | -2.85% | | CPU-Big | gmean | sum | 52.9 | -3.97% | | CPU-Little | gmean | sum | 86.7 | -5.39% | | CPU-Mid | gmean | sum | 21.6 | 12.63% | +------------+--------+------+-------+-----------+ The TL;DR of Jankbench is that sum aggregation reduces jank by 54% while with 2.85% less CPU power. Note that this benchmark on Pixel 6 by default has some sort of uclamp feedback applied to it. When jank is detected, certain display and benchmark threads are applied with a UCLAMP_MIN value of 512 (any help in identifying where these UCLAMP_MIN values come from in the mainline Android kernel is appreciated). In contrast, we constantly apply a UCLAMP_MIN value of 60 to these threads without any feedback loop. If a similar feedback loop is involved to only apply UCLAMP_MIN when needed, power consumption can be expected to be even lower. Hongyan Xia (6): sched/uclamp: Track uclamped util_avg in sched_avg sched/uclamp: Simulate PELT decay in util_avg_uclamp sched/fair: Use CFS util_avg_uclamp for utilization and frequency sched/fair: Rewrite util_fits_cpu() sched/uclamp: Remove all uclamp bucket logic sched/uclamp: Simplify uclamp_eff_value() include/linux/sched.h | 13 +- init/Kconfig | 32 --- kernel/sched/core.c | 312 ++------------------------- kernel/sched/cpufreq_schedutil.c | 19 +- kernel/sched/fair.c | 354 +++++++++---------------------- kernel/sched/pelt.c | 146 ++++++++++++- kernel/sched/rt.c | 4 - kernel/sched/sched.h | 208 ++++++------------ 8 files changed, 339 insertions(+), 749 deletions(-)
Comments
On 04/10/2023 11:04, Hongyan Xia wrote: > Current implementation of uclamp leaves many things to be desired. > There are several problems: > > 1. Max aggregation is fragile. A single task with a high UCLAMP_MAX (or > with the default value, which is 1024) can ruin the previous > settings the moment it is enqueued, as shown in the uclamp frequency > spike problem in Section 5.2 of > Documentation/scheduler/sched-util-clamp.rst. Constantly running at > 1024 utilization implies that the CPU is driven at its max capacity. > However, with UCLAMP_MAX, this assumption is no longer true. To > mitigate this, one idea is to filter out uclamp values for > short-running tasks. However, the effectiveness of this idea remains > to be seen. The difference is that we don't have to lift the uclamp_max cap of runnable p1's uclamp_max (< 1024) when a short running p2 with uclamp_max = 1024 becomes runnable? Since now, when this happens, we would just add p2's util_avg_uclamp to cfs_rq's util_avg_uclamp which is tiny compared to its uclamp_max = 1024. > 2. No way to differentiate between UCLAMP_MAX throttled CPU or a CPU > running at its peak, as both show utilization of 1024. However, in > certain cases it's important to tell the difference, as we still want > to take the opportunity to enqueue tasks in the former case. Is this related to the `best_fits/max_fits` logic in find_energy_efficient_cpu() (feec()) and select_idle_capacity() (sic()) (commit e5ed0550c04c "sched/fair: unlink misfit task from cpu overutilized")? With your approach, having cfs_rq's `util_avg_uclamp` next to its `util_avg` would allow to see those difference by comparing the two signals? > It's also debatable whether uclamp should be a frequency hint. An > example is a system with a mid CPU of capacity 500, and a little CPU > with capacity 100. If we have 10 tasks with UCLAMP_MIN of 101, then > util_fits_cpu() will schedule them on the mid CPU because feec() > thinks they don't fit on the little, even though we should use the > little core at some point instead of making the mid even more crowded. > Of course, this in reality doesn't happen because of other mechanisms > like load balancing, but it's also not good when other mechanisms can CPU overutilization detection enables CFS load-balance & `sis()->sic()` and disables feec(). > cancel the effect of uclamp in feec(). A CPU running at capacity 500 for > 1ms or for 1000ms gives completely different performance levels, so > trying to fit only the frequency does not give us any performance > guarantees. If we then talk about not only running at some frequency but > also for some amount of time, then what we really mean is a capacity > hint, not a frequency hint. Isn't CPU frequency and capacity going in the same direction? IPC * CPU frequency = Instruction per Second == Performance (Capacity). And in the scheduler, utilization is the portion of the Capacity currently used. What sum aggregation does differently is that you can sum-up individually clamped utilization contributions and compare them against capacity rather then being forced to use the maximum value of a clamp value of one (runnable) task to guide frequency. This avoids those discontinuity-moments when a task with a high uclamp value switches between runnable and sleeping state. > It's even worse when the tasks scheduled on the mid CPU also have > UCLAMP_MAX values. In the previous example, instead of running at 500, a > single UCLAMP_MAX, assuming it's 110, can make the mid CPU run at a much > lower frequency than 500, so it is then a really bad decision to honor > the UCLAMP_MIN frequency hint and place it on the mid CPU, instead of > running it on the little CPU which is less crowded, and has pretty much > the same capacity (100 vs 110) anyway. > > This series attempts to solve these problems by tracking a > util_avg_uclamp signal in tasks and cfs_rq. At task level, > p->se.avg.util_avg_uclamp is basically tracking the normal util_avg, but > clamped within its uclamp min and max. At cfs_rq level, util_avg_uclamp > must always be the sum of all util_avg_uclamp of all the entities on > this cfs_rq. As a result, cfs_rq->avg.util_avg_uclamp is the sum > aggregation of all the clamped values, which indicates the frequency > this rq should run at and what the utilization is. > > This idea solves Problem 1 by capping the utilization of an > always-running task throttled by UCLAMP_MAX. Although the task (denoted > by Task A) has no idle time, the util_avg_uclamp signal gives its > UCLAMP_MAX value instead of 1024, so even if another task (Task B) with > a UCLAMP_MAX value at 1024 joins the rq, the util_avg_uclamp is A's > UCLAMP_MAX plus B's utilization, instead of 1024 plus B's utilization, > which means we no longer have the frequency spike problem caused by B. > This should mean that we might completely avoid the need for uclamp > filtering. That would be very nice since I remember that this filtering approach hat to figure out the actual runtime of the task and the implemention couldn't be just in the sched class code but had to be done in core code as well. > > It also solves Problem 2 by tracking the capped utilization of a rq. > Using util_avg_uclamp, we are able to differentiate between a CPU > throttled by UCLAMP_MAX and one that is actually running at its peak > capacity. An always-running rq with a task having UCLAMP_MAX at 300 will > report a cfs_rq util_avg_uclamp at 300, not 1024, which means task > placement sees spare capacity on this CPU and will allocate some tasks > to it, instead of treating it the same way as a CPU actually running at > the peak. This brings us closer to the benefits of Android's sum > aggregation (see code related to CONFIG_USE_GROUP_THROTTLE at > https://android.googlesource.com/kernel/gs/+/refs/heads/android-gs-raviole-5.10-android12-d1/drivers/soc/google/vh/kernel/sched/fair.c#510), That's true although the latest Pixel phone does not seem to rely on CONFIG_USE_GROUP_THROTTLE anymore but on customized PELT tracking for so called Vendor CFS Util Groups and its sum aggregation on CPU (root cfs_rq) level(CONFIG_USE_VENDOR_GROUP_UTIL). CONFIG_USE_GROUP_THROTTLE was using the fact that Android only uses 1. level taskgroups to map its different task types (e.g. foreground, background etc.) into: for_each_leaf_cfs_rq_safe() calc subgroup_util_sum and throttled_subgroup_util_sum and skip root cpu_util = root_util - subgroup_util_sum + throttled_subgroup_util_sum So summing up the (throttled (capped)) contritions of the 1. level taskgroups plus the root taskgroup util. With CONFIG_USE_VENDOR_GROUP_UTIL each tasks has a vendor pointer into an CFS util array indexed by an Android task type ID to which the task contributes its util. CPU util is then sum aggregated over this array. We should recall that this is all done because the current uclamp-max max aggression isn't working for Androids use-cases. So to overcome this issue in mainline is key here. > which shows energy benefits because we are able to schedule tasks on > smaller cores which are UCLAMP_MAX capped, instead of finding a fresh > big CPU. However, a major difference is that this series has an > amortized cost of O(1), instead of O(n) in cpu_util_next() in Android > kernels. > > It avoids the shortcomings from uclamp-being-a-frequency-hint. Under sum > aggregation, the scheduler sees the sum aggregated util_avg_uclamp and > will avoid the problem of enqueueing too many tasks just to fit the > frequency, and will place tasks on little CPUs in the previous example. > > Patch 2/6 tries to simulate PELT decay in the new util_avg_uclamp > signal, as this gives us gradual decay of utilization which avoids > premature CPU frequency drops. This is a major caveat of this series. We > should explore if there's a better way to integrate uclamp directly into > the util_avg signal, instead of introducing a new util_avg_uclamp and > then simulate PELT on it. > > This new design significantly simplifies uclamp logic. > Patch 4/6 removes the tri-state return value of util_fits_cpu(). > Patch 5/6 then completely removes uclamp buckets. Because the > util_avg_uclamp is already a clamped value propagated from bottom to > top, there's no need to clamp anything at the top level and we can just > use this value for frequency selection and spare capacity search. > Basically, the idea is that all uclamp logic happens inside > util_avg_uclamp, and other code using this new signal can just use it > like util_avg, without even knowing that uclamp exists. This drastically > reduces the complexity of uclamp and makes the code considering all the > uclamp corner cases unnecessary. At the end of the series, we remove 749 > lines of code while adding a bit more than 300 (although once we update > Documentation.rst, it will be a bit more than that). > > Note that this series is still considered RFC status. TODO items are: > > 1. Implement sum aggregation for RT tasks. > 2. Improve handling of cpu_util(boost). What about the integration with util_est here? In cpu_util(), &rq->cfs->avg.util_avg is replaced by rq->root_cfs_util_uclamp and in task_util_est() (should be actually named task_util() to be in sync with cpu_util(), i.e. returning max(util, util_est)), task_util(p) returns p->se.avg.util_avg_uclamp. Are there use cases for the original avg.util_avg still in place? > TESTING: > > Initial test and benchmark results suggest that sum aggregation not only > is significantly simpler, but generally performs much better in > achieving what uclamp is supposed to do. Two notebooks are shared at > > https://nbviewer.org/github/honxia02/sum_aggre_notebooks/blob/main/upstream.ipynb > https://nbviewer.org/github/honxia02/sum_aggre_notebooks/blob/main/sum.ipynb > > The experiments done in notebooks are on Arm Juno r2 board. CPU0-3 are > little cores with capacity of 383. CPU4-5 are big cores. The rt-app > profiles used for these experiments are included in the notebooks. > > Scenario 1: Scheduling 4 always-running tasks with UCLAMP_MAX at 200. > > The scheduling decisions are plotted in Out[11] and Out[14] > respectively. Max aggregation fails to recognize the opportunity to run > all of them on the efficient little Power Domain (PD), whereas sum > aggregation runs all 4 on the little PD, leaving the big PD completely > powered off, satisfying uclamp requests while saving power. Does `upstream` already contain the v6.7 fixes `Fix uclamp code corner cases` ? https://lkml.kernel.org/r/20230916232955.2099394-1-qyousef@layalina.io > Scenario 2: Scheduling 4 tasks with UCLAMP_MIN and UCLAMP_MAX at a value > slightly above the capacity of the little CPU. > > Results are in Out[17] and Out[82]. The purpose is to use UCLAMP_MIN to > place tasks on the big core. Max aggregation is pretty much in an > overutilized state the entire time. Sum aggregation sees that the big > CPU can hold two such tasks (satisfying both UCLAMP_MIN and UCLAMP_MAX > requests for all 4 tasks) on each big CPU and quickly settles down, and > is still under EAS without overutilization. thread0-[0-3] uclamp_max = 309. So p->se.avg.util_avg_uclamp is constrained by this value for all 4 tasks, letting 2 tasks fit on each of the big CPUs. You have to zoom in into Out[82] to actually see this. And I guess for max aggregation cpu_overutilized() can't hold the clamp continuously because of all the other short running uclamp_max = 1024 (default) tasks on the rq's. > Scenario 3: Task A is a task with a small utilization pinned to CPU4. > Task B is an always-running task pinned to CPU5, but UCLAMP_MAX capped > at 200. After a while, task A is then pinned to CPU5, joining B. > > Results are in Out[23] and Out[95]. The blue util_avg signal is the > original root CFS util_avg. The yellow line is the root CFS utilization > after considering uclamp. Max aggregation sees a frequency > spike at 579.1s. When zoomed in, one can see square-wave-like > utilization values because of A periodically going to sleep. When A > wakes up, its default UCLAMP_MAX of 1024 will uncap B and reach the > highest CPU frequency. When A sleeps, B's UCLAMP_MAX will be in effect > and will reduce rq utilization to 200. This happens repeatedly, hence > the square wave. In contrast, sum aggregation sees a normal increase in > utilization when A joins B, at around 2507s, without any square-wave > behavior. Makes sense. But there shouldn't be a root_cfs_util_uclamp in main? Which signal does the yellow line represent in Out[23]? > Scenario 4: 4 always-running tasks with UCLAMP_MAX of 120 pinned to the > little PD (CPU0-3). 4 same tasks pinned to the big PD (CPU4-5). > After a while, remove the CPU pinning of the 4 tasks on the big PD. > > Results are in Out[29] and Out[101]. Max aggregation fails to identify > that all 8 tasks can be packed on the little PD, whereas sum > aggregation immediately moves the 4 tasks pinned to big PD to the > little PD the moment pinning is removed. Sum aggregation understands > that even when the rq seems to have utilization of 1024, this is because > of UCLAMP_MAX and there's still opportunity to pack 2 such tasks on each > little CPU, leaving the big PD untouched, saving power. Looks good to me. > > BENCHMARKS: > > One TODO item is to handle cpu_util(boost) better. The current handling > of boost is far from ideal and is a known caveat. All below benchmarks > numbers are done without any boosting in cpu_util(), in both max and sum > aggregation. Maybe to early for black-box tests at this stage but good to see that nothing seems to go sideways with uclamp sum aggregation. > > Geekbench 6 (on Rock-5B board) > +-----+-------------+------------+ > | | Single-core | Multi-core | > +-----+-------------+------------+ > | Max | 800.6 | 2977.0 | > | Sum | 801.2 | 2981.4 | > +-----+-------------+------------+ > > No regression is seen after switching to sum aggregation. > > Jankbench (backporting sum aggregation to Android 5.18 mainline kernel) > > +------+-----------------+-------+-----------+ > | | variable | value | perc_diff | > +------+-----------------+-------+-----------+ > | main | jank_percentage | 1.1 | 0.00% | > | sum | jank_percentage | 0.5 | -53.92% | > +------+-----------------+-------+-----------+ > > +------------+--------+------+-------+-----------+ > | | metric | tag | value | perc_diff | > +------------+--------+------+-------+-----------+ > | CPU | gmean | main | 166.1 | 0.00% | > | CPU-Big | gmean | main | 55.1 | 0.00% | > | CPU-Little | gmean | main | 91.7 | 0.00% | > | CPU-Mid | gmean | main | 19.2 | 0.00% | > | CPU | gmean | sum | 161.3 | -2.85% | > | CPU-Big | gmean | sum | 52.9 | -3.97% | > | CPU-Little | gmean | sum | 86.7 | -5.39% | > | CPU-Mid | gmean | sum | 21.6 | 12.63% | > +------------+--------+------+-------+-----------+ > > The TL;DR of Jankbench is that sum aggregation reduces jank by 54% while > with 2.85% less CPU power. Note that this benchmark on Pixel 6 by > default has some sort of uclamp feedback applied to it. When jank is > detected, certain display and benchmark threads are applied with a > UCLAMP_MIN value of 512 (any help in identifying where these UCLAMP_MIN > values come from in the mainline Android kernel is appreciated). In > contrast, we constantly apply a UCLAMP_MIN value of 60 to these threads > without any feedback loop. If a similar feedback loop is involved to > only apply UCLAMP_MIN when needed, power consumption can be expected to > be even lower. [...]
On 30/10/2023 18:46, Dietmar Eggemann wrote: > On 04/10/2023 11:04, Hongyan Xia wrote: >> Current implementation of uclamp leaves many things to be desired. >> There are several problems: >> >> 1. Max aggregation is fragile. A single task with a high UCLAMP_MAX (or >> with the default value, which is 1024) can ruin the previous >> settings the moment it is enqueued, as shown in the uclamp frequency >> spike problem in Section 5.2 of >> Documentation/scheduler/sched-util-clamp.rst. Constantly running at >> 1024 utilization implies that the CPU is driven at its max capacity. >> However, with UCLAMP_MAX, this assumption is no longer true. To >> mitigate this, one idea is to filter out uclamp values for >> short-running tasks. However, the effectiveness of this idea remains >> to be seen. > > The difference is that we don't have to lift the uclamp_max cap of > runnable p1's uclamp_max (< 1024) when a short running p2 with > uclamp_max = 1024 becomes runnable? Since now, when this happens, we > would just add p2's util_avg_uclamp to cfs_rq's util_avg_uclamp which is > tiny compared to its uclamp_max = 1024. Yes. That's a correct interpretation. I can add this version to the cover letter to make things clearer. >> 2. No way to differentiate between UCLAMP_MAX throttled CPU or a CPU >> running at its peak, as both show utilization of 1024. However, in >> certain cases it's important to tell the difference, as we still want >> to take the opportunity to enqueue tasks in the former case. > > Is this related to the `best_fits/max_fits` logic in > find_energy_efficient_cpu() (feec()) and select_idle_capacity() (sic()) > (commit e5ed0550c04c "sched/fair: unlink misfit task from cpu > overutilized")? > With your approach, having cfs_rq's `util_avg_uclamp` next to its > `util_avg` would allow to see those difference by comparing the two signals? Somewhat related. The usefulness is mostly around cpu_util(). Max aggregation will see cpu_util() of 1024 (spare capacity of 0) even when this is caused by UCLAMP_MAX, not by actually running the CPU at max capacity, whereas this series will have a util < 1024 if capped by UCLAMP_MAX. This means the patch to consider 0 spare capacities (because under max aggregation 0 capacity doesn't mean the CPU is at max) is unnecessary: 6b00a40147653c8ea748e8f4396510f252763364 sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0 >> It's also debatable whether uclamp should be a frequency hint. An >> example is a system with a mid CPU of capacity 500, and a little CPU >> with capacity 100. If we have 10 tasks with UCLAMP_MIN of 101, then >> util_fits_cpu() will schedule them on the mid CPU because feec() >> thinks they don't fit on the little, even though we should use the >> little core at some point instead of making the mid even more crowded. >> Of course, this in reality doesn't happen because of other mechanisms >> like load balancing, but it's also not good when other mechanisms can > > CPU overutilization detection enables CFS load-balance & `sis()->sic()` > and disables feec(). > >> cancel the effect of uclamp in feec(). A CPU running at capacity 500 for >> 1ms or for 1000ms gives completely different performance levels, so >> trying to fit only the frequency does not give us any performance >> guarantees. If we then talk about not only running at some frequency but >> also for some amount of time, then what we really mean is a capacity >> hint, not a frequency hint. > > Isn't CPU frequency and capacity going in the same direction? > > IPC * CPU frequency = Instruction per Second == Performance (Capacity). > > And in the scheduler, utilization is the portion of the Capacity > currently used. > > What sum aggregation does differently is that you can sum-up > individually clamped utilization contributions and compare them against > capacity rather then being forced to use the maximum value of a clamp > value of one (runnable) task to guide frequency. This avoids those > discontinuity-moments when a task with a high uclamp value switches > between runnable and sleeping state. I guess what I was describing was a new metric: Work = Capacity * Time Max aggregation aims to fit capacity, but if a task can be run at a high capacity but for only a very short period of time, then this scheduling is not particularly useful. Like in the above example, moving a task with UCLAMP_MIN of 101 to the mid CPU isn't helpful when there are already 10 tasks on the mid CPU, because Time is reduced. >> It's even worse when the tasks scheduled on the mid CPU also have >> UCLAMP_MAX values. In the previous example, instead of running at 500, a >> single UCLAMP_MAX, assuming it's 110, can make the mid CPU run at a much >> lower frequency than 500, so it is then a really bad decision to honor >> the UCLAMP_MIN frequency hint and place it on the mid CPU, instead of >> running it on the little CPU which is less crowded, and has pretty much >> the same capacity (100 vs 110) anyway. >> >> This series attempts to solve these problems by tracking a >> util_avg_uclamp signal in tasks and cfs_rq. At task level, >> p->se.avg.util_avg_uclamp is basically tracking the normal util_avg, but >> clamped within its uclamp min and max. At cfs_rq level, util_avg_uclamp >> must always be the sum of all util_avg_uclamp of all the entities on >> this cfs_rq. As a result, cfs_rq->avg.util_avg_uclamp is the sum >> aggregation of all the clamped values, which indicates the frequency >> this rq should run at and what the utilization is. >> >> This idea solves Problem 1 by capping the utilization of an >> always-running task throttled by UCLAMP_MAX. Although the task (denoted >> by Task A) has no idle time, the util_avg_uclamp signal gives its >> UCLAMP_MAX value instead of 1024, so even if another task (Task B) with >> a UCLAMP_MAX value at 1024 joins the rq, the util_avg_uclamp is A's >> UCLAMP_MAX plus B's utilization, instead of 1024 plus B's utilization, >> which means we no longer have the frequency spike problem caused by B. >> This should mean that we might completely avoid the need for uclamp >> filtering. > > That would be very nice since I remember that this filtering approach > hat to figure out the actual runtime of the task and the implemention > couldn't be just in the sched class code but had to be done in core code > as well. Yes. So far I don't see any need for filtering and runtime accounting in uclamp sum aggregation. >> >> It also solves Problem 2 by tracking the capped utilization of a rq. >> Using util_avg_uclamp, we are able to differentiate between a CPU >> throttled by UCLAMP_MAX and one that is actually running at its peak >> capacity. An always-running rq with a task having UCLAMP_MAX at 300 will >> report a cfs_rq util_avg_uclamp at 300, not 1024, which means task >> placement sees spare capacity on this CPU and will allocate some tasks >> to it, instead of treating it the same way as a CPU actually running at >> the peak. This brings us closer to the benefits of Android's sum >> aggregation (see code related to CONFIG_USE_GROUP_THROTTLE at >> https://android.googlesource.com/kernel/gs/+/refs/heads/android-gs-raviole-5.10-android12-d1/drivers/soc/google/vh/kernel/sched/fair.c#510), > > That's true although the latest Pixel phone does not seem to rely on > CONFIG_USE_GROUP_THROTTLE anymore but on customized PELT tracking for so > called Vendor CFS Util Groups and its sum aggregation on CPU (root > cfs_rq) level(CONFIG_USE_VENDOR_GROUP_UTIL). > > CONFIG_USE_GROUP_THROTTLE was using the fact that Android only uses 1. > level taskgroups to map its different task types (e.g. foreground, > background etc.) into: > > for_each_leaf_cfs_rq_safe() > calc subgroup_util_sum and throttled_subgroup_util_sum and skip root > > cpu_util = root_util - subgroup_util_sum + throttled_subgroup_util_sum > > So summing up the (throttled (capped)) contritions of the 1. level > taskgroups plus the root taskgroup util. > > With CONFIG_USE_VENDOR_GROUP_UTIL each tasks has a vendor pointer into > an CFS util array indexed by an Android task type ID to which the task > contributes its util. CPU util is then sum aggregated over this array. > > We should recall that this is all done because the current uclamp-max > max aggression isn't working for Androids use-cases. > > So to overcome this issue in mainline is key here. Thanks for pointing me to the latest Pixel code. The basic ideas of that and this series look very similar. Both sum up CFS utilization instead of directly tracking the root CFS utilization. This series so far has only implemented uclamp on tasks, but the same code can also be implemented on group sched_entities. Once I implement that, then it essentially does the same thing as GROUP_THROTTLE. I think if the current Pixel code works, then it's likely this series works too. The big difference is that this series relies on CONFIG_FAIR_GROUP_SCHED. There seems to be complaints about it and I may need to know why it's not desirable for Android. >> which shows energy benefits because we are able to schedule tasks on >> smaller cores which are UCLAMP_MAX capped, instead of finding a fresh >> big CPU. However, a major difference is that this series has an >> amortized cost of O(1), instead of O(n) in cpu_util_next() in Android >> kernels. >> >> It avoids the shortcomings from uclamp-being-a-frequency-hint. Under sum >> aggregation, the scheduler sees the sum aggregated util_avg_uclamp and >> will avoid the problem of enqueueing too many tasks just to fit the >> frequency, and will place tasks on little CPUs in the previous example. >> >> Patch 2/6 tries to simulate PELT decay in the new util_avg_uclamp >> signal, as this gives us gradual decay of utilization which avoids >> premature CPU frequency drops. This is a major caveat of this series. We >> should explore if there's a better way to integrate uclamp directly into >> the util_avg signal, instead of introducing a new util_avg_uclamp and >> then simulate PELT on it. >> >> This new design significantly simplifies uclamp logic. >> Patch 4/6 removes the tri-state return value of util_fits_cpu(). >> Patch 5/6 then completely removes uclamp buckets. Because the >> util_avg_uclamp is already a clamped value propagated from bottom to >> top, there's no need to clamp anything at the top level and we can just >> use this value for frequency selection and spare capacity search. >> Basically, the idea is that all uclamp logic happens inside >> util_avg_uclamp, and other code using this new signal can just use it >> like util_avg, without even knowing that uclamp exists. This drastically >> reduces the complexity of uclamp and makes the code considering all the >> uclamp corner cases unnecessary. At the end of the series, we remove 749 >> lines of code while adding a bit more than 300 (although once we update >> Documentation.rst, it will be a bit more than that). >> >> Note that this series is still considered RFC status. TODO items are: >> >> 1. Implement sum aggregation for RT tasks. >> 2. Improve handling of cpu_util(boost). > > What about the integration with util_est here? > > In cpu_util(), &rq->cfs->avg.util_avg is replaced by > rq->root_cfs_util_uclamp > > and in > > task_util_est() (should be actually named task_util() to be in sync with > cpu_util(), i.e. returning max(util, util_est)), task_util(p) returns > p->se.avg.util_avg_uclamp. > > Are there use cases for the original avg.util_avg still in place? We still need the original avg.util_avg for tasks because util_avg_uclamp is clamped from it, but other than that, at the moment there's no reason to use the normal util_avg. >> TESTING: >> >> Initial test and benchmark results suggest that sum aggregation not only >> is significantly simpler, but generally performs much better in >> achieving what uclamp is supposed to do. Two notebooks are shared at >> >> https://nbviewer.org/github/honxia02/sum_aggre_notebooks/blob/main/upstream.ipynb >> https://nbviewer.org/github/honxia02/sum_aggre_notebooks/blob/main/sum.ipynb >> >> The experiments done in notebooks are on Arm Juno r2 board. CPU0-3 are >> little cores with capacity of 383. CPU4-5 are big cores. The rt-app >> profiles used for these experiments are included in the notebooks. >> >> Scenario 1: Scheduling 4 always-running tasks with UCLAMP_MAX at 200. >> >> The scheduling decisions are plotted in Out[11] and Out[14] >> respectively. Max aggregation fails to recognize the opportunity to run >> all of them on the efficient little Power Domain (PD), whereas sum >> aggregation runs all 4 on the little PD, leaving the big PD completely >> powered off, satisfying uclamp requests while saving power. > > Does `upstream` already contain the v6.7 fixes `Fix uclamp code corner > cases` ? > > https://lkml.kernel.org/r/20230916232955.2099394-1-qyousef@layalina.io Unfortunately, no. These experiments were done before that patch. I quickly did a test and that patch did fix this issue. Now the four little tasks are scheduled on the small PD. But I remember that you have another test indicating that the patch may expose some new issues? >> Scenario 2: Scheduling 4 tasks with UCLAMP_MIN and UCLAMP_MAX at a value >> slightly above the capacity of the little CPU. >> >> Results are in Out[17] and Out[82]. The purpose is to use UCLAMP_MIN to >> place tasks on the big core. Max aggregation is pretty much in an >> overutilized state the entire time. Sum aggregation sees that the big >> CPU can hold two such tasks (satisfying both UCLAMP_MIN and UCLAMP_MAX >> requests for all 4 tasks) on each big CPU and quickly settles down, and >> is still under EAS without overutilization. > > thread0-[0-3] uclamp_max = 309. So p->se.avg.util_avg_uclamp is > constrained by this value for all 4 tasks, letting 2 tasks fit on each > of the big CPUs. You have to zoom in into Out[82] to actually see this. > > And I guess for max aggregation cpu_overutilized() can't hold the clamp > continuously because of all the other short running uclamp_max = 1024 > (default) tasks on the rq's. Actually I'm not 100% sure about the reason why max aggregation under this experiment never stabilizes. I can investigate further. >> Scenario 3: Task A is a task with a small utilization pinned to CPU4. >> Task B is an always-running task pinned to CPU5, but UCLAMP_MAX capped >> at 200. After a while, task A is then pinned to CPU5, joining B. >> >> Results are in Out[23] and Out[95]. The blue util_avg signal is the >> original root CFS util_avg. The yellow line is the root CFS utilization >> after considering uclamp. Max aggregation sees a frequency >> spike at 579.1s. When zoomed in, one can see square-wave-like >> utilization values because of A periodically going to sleep. When A >> wakes up, its default UCLAMP_MAX of 1024 will uncap B and reach the >> highest CPU frequency. When A sleeps, B's UCLAMP_MAX will be in effect >> and will reduce rq utilization to 200. This happens repeatedly, hence >> the square wave. In contrast, sum aggregation sees a normal increase in >> utilization when A joins B, at around 2507s, without any square-wave >> behavior. > > Makes sense. But there shouldn't be a root_cfs_util_uclamp in main? > Which signal does the yellow line represent in Out[23]? Ah, that can be confusing. For current upstream root_cfs_util_uclamp comes from uclamp_rq_util_with(rq, cfs_rq->avg.util_avg). >> Scenario 4: 4 always-running tasks with UCLAMP_MAX of 120 pinned to the >> little PD (CPU0-3). 4 same tasks pinned to the big PD (CPU4-5). >> After a while, remove the CPU pinning of the 4 tasks on the big PD. >> >> Results are in Out[29] and Out[101]. Max aggregation fails to identify >> that all 8 tasks can be packed on the little PD, whereas sum >> aggregation immediately moves the 4 tasks pinned to big PD to the >> little PD the moment pinning is removed. Sum aggregation understands >> that even when the rq seems to have utilization of 1024, this is because >> of UCLAMP_MAX and there's still opportunity to pack 2 such tasks on each >> little CPU, leaving the big PD untouched, saving power. > > Looks good to me. > >> >> BENCHMARKS: >> >> One TODO item is to handle cpu_util(boost) better. The current handling >> of boost is far from ideal and is a known caveat. All below benchmarks >> numbers are done without any boosting in cpu_util(), in both max and sum >> aggregation. > > Maybe to early for black-box tests at this stage but good to see that > nothing seems to go sideways with uclamp sum aggregation. > > >> >> Geekbench 6 (on Rock-5B board) >> +-----+-------------+------------+ >> | | Single-core | Multi-core | >> +-----+-------------+------------+ >> | Max | 800.6 | 2977.0 | >> | Sum | 801.2 | 2981.4 | >> +-----+-------------+------------+ >> >> No regression is seen after switching to sum aggregation. >> >> Jankbench (backporting sum aggregation to Android 5.18 mainline kernel) >> >> +------+-----------------+-------+-----------+ >> | | variable | value | perc_diff | >> +------+-----------------+-------+-----------+ >> | main | jank_percentage | 1.1 | 0.00% | >> | sum | jank_percentage | 0.5 | -53.92% | >> +------+-----------------+-------+-----------+ >> >> +------------+--------+------+-------+-----------+ >> | | metric | tag | value | perc_diff | >> +------------+--------+------+-------+-----------+ >> | CPU | gmean | main | 166.1 | 0.00% | >> | CPU-Big | gmean | main | 55.1 | 0.00% | >> | CPU-Little | gmean | main | 91.7 | 0.00% | >> | CPU-Mid | gmean | main | 19.2 | 0.00% | >> | CPU | gmean | sum | 161.3 | -2.85% | >> | CPU-Big | gmean | sum | 52.9 | -3.97% | >> | CPU-Little | gmean | sum | 86.7 | -5.39% | >> | CPU-Mid | gmean | sum | 21.6 | 12.63% | >> +------------+--------+------+-------+-----------+ >> >> The TL;DR of Jankbench is that sum aggregation reduces jank by 54% while >> with 2.85% less CPU power. Note that this benchmark on Pixel 6 by >> default has some sort of uclamp feedback applied to it. When jank is >> detected, certain display and benchmark threads are applied with a >> UCLAMP_MIN value of 512 (any help in identifying where these UCLAMP_MIN >> values come from in the mainline Android kernel is appreciated). In >> contrast, we constantly apply a UCLAMP_MIN value of 60 to these threads >> without any feedback loop. If a similar feedback loop is involved to >> only apply UCLAMP_MIN when needed, power consumption can be expected to >> be even lower. > > [...]
On 03/11/2023 12:19, Hongyan Xia wrote: > On 30/10/2023 18:46, Dietmar Eggemann wrote: >> On 04/10/2023 11:04, Hongyan Xia wrote: >>> Current implementation of uclamp leaves many things to be desired. >>> There are several problems: >>> >>> 1. Max aggregation is fragile. A single task with a high UCLAMP_MAX (or >>> Â Â Â with the default value, which is 1024) can ruin the previous >>> Â Â Â settings the moment it is enqueued, as shown in the uclamp frequency >>> Â Â Â spike problem in Section 5.2 of >>> Â Â Â Documentation/scheduler/sched-util-clamp.rst. Constantly running at >>> Â Â Â 1024 utilization implies that the CPU is driven at its max capacity. >>> Â Â Â However, with UCLAMP_MAX, this assumption is no longer true. To >>> Â Â Â mitigate this, one idea is to filter out uclamp values for >>> Â Â Â short-running tasks. However, the effectiveness of this idea remains >>> Â Â Â to be seen. >> >> The difference is that we don't have to lift the uclamp_max cap of >> runnable p1's uclamp_max (< 1024) when a short running p2 with >> uclamp_max = 1024 becomes runnable? Since now, when this happens, we >> would just add p2's util_avg_uclamp to cfs_rq's util_avg_uclamp which is >> tiny compared to its uclamp_max = 1024. > > Yes. That's a correct interpretation. I can add this version to the > cover letter to make things clearer. OK. >>> 2. No way to differentiate between UCLAMP_MAX throttled CPU or a CPU >>> Â Â Â running at its peak, as both show utilization of 1024. However, in >>> Â Â Â certain cases it's important to tell the difference, as we still >>> want >>> Â Â Â to take the opportunity to enqueue tasks in the former case. >> >> Is this related to the `best_fits/max_fits` logic in >> find_energy_efficient_cpu() (feec()) and select_idle_capacity() (sic()) >> (commit e5ed0550c04c "sched/fair: unlink misfit task from cpu >> overutilized")? >> With your approach, having cfs_rq's `util_avg_uclamp` next to its >> `util_avg` would allow to see those difference by comparing the two >> signals? > > Somewhat related. The usefulness is mostly around cpu_util(). Max > aggregation will see cpu_util() of 1024 (spare capacity of 0) even when > this is caused by UCLAMP_MAX, not by actually running the CPU at max > capacity, whereas this series will have a util < 1024 if capped by > UCLAMP_MAX. I see. For me the main diff is: cpu_util() (not considering util_est here) returns a completely different signal with sum aggregation: $\sum_{runnable tasks p0}^{pN} clamp(p->se.avg.util_avg, (1) p->uclamp[UCLAMP_MIN], p->uclamp[UCLAMP_MAX]) instead of max aggregation: rq->cfs.avg.util_avg which then has to be clamped later in different places using cpu_util(): clamp(rq->cfs.avg.util_avg, rq->uclamp[UCLAMP_MIN], (2) rq->uclamp[UCLAMP_MIN]) with the clamp values dictated by the runnable tasks with the highest uclamp values. > This means the patch to consider 0 spare capacities (because under max > aggregation 0 capacity doesn't mean the CPU is at max) is unnecessary: > > 6b00a40147653c8ea748e8f4396510f252763364 > sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0 Yes, this patch makes sure we consider PDs with max_spare_cap = 0 for EM calculations (eenv_pd_busy_time(), compute_energy). I just asked about commit e5ed0550c04c since you removed the (best,max, prev)_fits and the (best, prev)_thermal_cap logic from feec(). And if I remember correctly this brought us an implementation of capacity inversion handling. [...] >>> cancel the effect of uclamp in feec(). A CPU running at capacity 500 for >>> 1ms or for 1000ms gives completely different performance levels, so >>> trying to fit only the frequency does not give us any performance >>> guarantees. If we then talk about not only running at some frequency but >>> also for some amount of time, then what we really mean is a capacity >>> hint, not a frequency hint. >> >> Isn't CPU frequency and capacity going in the same direction? >> >> Â IPC * CPU frequency = Instruction per Second == Performance (Capacity). >> >> And in the scheduler, utilization is the portion of the Capacity >> currently used. >> >> What sum aggregation does differently is that you can sum-up >> individually clamped utilization contributions and compare them against >> capacity rather then being forced to use the maximum value of a clamp >> value of one (runnable) task to guide frequency. This avoids those >> discontinuity-moments when a task with a high uclamp value switches >> between runnable and sleeping state. > > I guess what I was describing was a new metric: > > Â Â Â Â Work = Capacity * Time > > Max aggregation aims to fit capacity, but if a task can be run at a high > capacity but for only a very short period of time, then this scheduling > is not particularly useful. Like in the above example, moving a task > with UCLAMP_MIN of 101 to the mid CPU isn't helpful when there are > already 10 tasks on the mid CPU, because Time is reduced. Not sure I understand this. Isn't the main difference between sum (1) and max (2) aggregation the fact that in (1) a p->uclamp[UCLAMP_MAX] has only an effect when it's really capping p->se.avg.util_avg? This avoids what can happen under (2) namely that a tiny task with a default uclamp_max = 1024 (not capped) can ruin the capping scenario for the time its runnable. I assume we understand (2) as a frequency hint (i.e. the currently highest request of a runnable task towards the CPU capacity) and (1) more as a capacity hint since of the summation of clamped p->se.avg.util_avg over all runnable tasks on the CPU. But with 'IPC * CPU frequency == Capacity' what's the actual difference in this hinting strategies anyway? [...] >>> This idea solves Problem 1 by capping the utilization of an >>> always-running task throttled by UCLAMP_MAX. Although the task (denoted >>> by Task A) has no idle time, the util_avg_uclamp signal gives its >>> UCLAMP_MAX value instead of 1024, so even if another task (Task B) with >>> a UCLAMP_MAX value at 1024 joins the rq, the util_avg_uclamp is A's >>> UCLAMP_MAX plus B's utilization, instead of 1024 plus B's utilization, >>> which means we no longer have the frequency spike problem caused by B. >>> This should mean that we might completely avoid the need for uclamp >>> filtering. >> >> That would be very nice since I remember that this filtering approach >> hat to figure out the actual runtime of the task and the implemention >> couldn't be just in the sched class code but had to be done in core code >> as well. > > Yes. So far I don't see any need for filtering and runtime accounting in > uclamp sum aggregation. OK. [...] >> We should recall that this is all done because the current uclamp-max >> max aggression isn't working for Androids use-cases. >> >> So to overcome this issue in mainline is key here. > > Thanks for pointing me to the latest Pixel code. > > The basic ideas of that and this series look very similar. Both sum up > CFS utilization instead of directly tracking the root CFS utilization. > This series so far has only implemented uclamp on tasks, but the same > code can also be implemented on group sched_entities. Once I implement > that, then it essentially does the same thing as GROUP_THROTTLE. I think > if the current Pixel code works, then it's likely this series works too. > > The big difference is that this series relies on > CONFIG_FAIR_GROUP_SCHED. There seems to be complaints about it and I may > need to know why it's not desirable for Android. Looks like that Pixel8 also has uclamp filtering to exclude a sufficiently short running task from influencing rq's uclamp_max (uclamp_max filtering). GROUP_THROTTLING/VENDOR_GROUP_UTIL was introduced since they couldn't constrain on taskgroup level with the cpu.shares feature. And mainline uclamp max (on taskgroup level) didn't give them this functionality either. So it looks like with sum aggregation uclamp could provide both. [...] >>> Note that this series is still considered RFC status. TODO items are: >>> >>> 1. Implement sum aggregation for RT tasks. >>> 2. Improve handling of cpu_util(boost). >> >> What about the integration with util_est here? >> >> In cpu_util(), &rq->cfs->avg.util_avg is replaced by >> rq->root_cfs_util_uclamp >> >> and in >> >> task_util_est() (should be actually named task_util() to be in sync with >> cpu_util(), i.e. returning max(util, util_est)), task_util(p) returns >> p->se.avg.util_avg_uclamp. >> >> Are there use cases for the original avg.util_avg still in place? > > We still need the original avg.util_avg for tasks because > util_avg_uclamp is clamped from it, but other than that, at the moment > there's no reason to use the normal util_avg. OK. But it looks like that util_est escapes uclamp: cpu_util() return min(max(util_uclamp, util_est), arch_scale_cpu_capacity(cpu)) ^^^^^^^^ [...] >>> Scenario 1: Scheduling 4 always-running tasks with UCLAMP_MAX at 200. [...] >> Does `upstream` already contain the v6.7 fixes `Fix uclamp code corner >> cases` ? >> >> https://lkml.kernel.org/r/20230916232955.2099394-1-qyousef@layalina.io > > Unfortunately, no. These experiments were done before that patch. I > quickly did a test and that patch did fix this issue. Now the four > little tasks are scheduled on the small PD. > > But I remember that you have another test indicating that the patch may > expose some new issues? I remember 2 tests: 6 periodic tasks runtime/period=12/16ms uclamp_min/max=[0, M] on big.LITTLE (2 big, 4 little (arch_scale_cpu_capacity() = 675)) test (1) M = 665 test (2) M = 100 2 kernel: (A) tip sched/core w/ `Fix uclamp code corner cases` (B) uclamp sum aggregation (A) managed to force all task to run on little CPUs for (1) but not for (2). For (B) it was the other way around. These tests show also the _sum_ aggregation aspect of uclamp_max. It's important now to consider the actual uclamp-max value in case you want to achieve packing on little. I still have the ipython notebook for these tests. >>> Scenario 2: Scheduling 4 tasks with UCLAMP_MIN and UCLAMP_MAX at a value [...] >> thread0-[0-3] uclamp_max = 309. So p->se.avg.util_avg_uclamp is >> constrained by this value for all 4 tasks, letting 2 tasks fit on each >> of the big CPUs. You have to zoom in into Out[82] to actually see this. >> >> And I guess for max aggregation cpu_overutilized() can't hold the clamp >> continuously because of all the other short running uclamp_max = 1024 >> (default) tasks on the rq's. > > Actually I'm not 100% sure about the reason why max aggregation under > this experiment never stabilizes. I can investigate further. OK. >>> Scenario 3: Task A is a task with a small utilization pinned to CPU4. [...] >> Makes sense. But there shouldn't be a root_cfs_util_uclamp in main? >> Which signal does the yellow line represent in Out[23]? > > Ah, that can be confusing. For current upstream root_cfs_util_uclamp > comes from uclamp_rq_util_with(rq, cfs_rq->avg.util_avg). Topic is already pretty complicated so better use the correct naming. [...]
Hi Hongyan On 10/04/23 10:04, Hongyan Xia wrote: > Current implementation of uclamp leaves many things to be desired. > There are several problems: > > 1. Max aggregation is fragile. A single task with a high UCLAMP_MAX (or > with the default value, which is 1024) can ruin the previous > settings the moment it is enqueued, as shown in the uclamp frequency > spike problem in Section 5.2 of > Documentation/scheduler/sched-util-clamp.rst. Constantly running at > 1024 utilization implies that the CPU is driven at its max capacity. > However, with UCLAMP_MAX, this assumption is no longer true. To > mitigate this, one idea is to filter out uclamp values for > short-running tasks. However, the effectiveness of this idea remains > to be seen. This was proved when I was at arm and it is used in products now too. It is effective. > > 2. No way to differentiate between UCLAMP_MAX throttled CPU or a CPU > running at its peak, as both show utilization of 1024. However, in > certain cases it's important to tell the difference, as we still want > to take the opportunity to enqueue tasks in the former case. > > It's also debatable whether uclamp should be a frequency hint. An It is not debatable. Uclamp is a performance hint which translates into task placement (for HMP) and frequency selection. On SMP the latter is the only meaning for uclamp. For the time being at least until we get a new technology that can impact performance ;-) > example is a system with a mid CPU of capacity 500, and a little CPU > with capacity 100. If we have 10 tasks with UCLAMP_MIN of 101, then > util_fits_cpu() will schedule them on the mid CPU because feec() > thinks they don't fit on the little, even though we should use the > little core at some point instead of making the mid even more crowded. This is not a problem. One of the major design points for uclamp is to allow small tasks to request to run faster. So if you have 100 tasks that run for 100us and to meet that they need to run at mid CPU that is fine. That doesn't necessarily mean they'll overutilized the cluster. If the cluster gets overutilized, then EAS will be disabled and find_idle_capacity() will spread into highest spare capacity idle CPU, which includes little CPUs. And this behavior is no different to have tasks with util greater than 100 wanting to run on mid cores. The only difference is that we can pack more on mid core before overutilized can be triggered. Which is the desired effect. So this is a no issue and we should spread once the cluster is busy. > Of course, this in reality doesn't happen because of other mechanisms > like load balancing, but it's also not good when other mechanisms can > cancel the effect of uclamp in feec(). A CPU running at capacity 500 for > 1ms or for 1000ms gives completely different performance levels, so > trying to fit only the frequency does not give us any performance > guarantees. If we then talk about not only running at some frequency but > also for some amount of time, then what we really mean is a capacity > hint, not a frequency hint. I'm not sure I parsed that. But again, it's a performance hint which implies both. > > It's even worse when the tasks scheduled on the mid CPU also have > UCLAMP_MAX values. In the previous example, instead of running at 500, a > single UCLAMP_MAX, assuming it's 110, can make the mid CPU run at a much > lower frequency than 500, so it is then a really bad decision to honor > the UCLAMP_MIN frequency hint and place it on the mid CPU, instead of > running it on the little CPU which is less crowded, and has pretty much > the same capacity (100 vs 110) anyway. I can't follow this reasoning. It seems you're assuming the user has picked 110 by mistake and it is okay to ignore it. FYI, one of the use cases is to set uclamp_min to capacity_of(little_core)+1 because they're so crappy to handle their workload. It is still better than using affinity as when the system gets busy we can still use the little core. But only when the system is busy and crowded already. I don't know what's the reference to the mid core having a task with uclamp_max, but seems a bug in feec() if not predicting the frequency correctly. > > This series attempts to solve these problems by tracking a > util_avg_uclamp signal in tasks and cfs_rq. At task level, This was discussed offline and brought several times on the list already. uclamp is a performance not a bandwidth hint. I remember Dietmar was suggesting this concept in the past and I objected that this converts uclamp into a bandwidth hint. This overlaps with CFS bandwidth controller. We can use that if somebody wants to limit the bandwidth the task can consume. uclamp is about limiting the performance level it can reach. ie: it is not about cycles consumed, but Performance Point reached. > p->se.avg.util_avg_uclamp is basically tracking the normal util_avg, but > clamped within its uclamp min and max. At cfs_rq level, util_avg_uclamp > must always be the sum of all util_avg_uclamp of all the entities on > this cfs_rq. As a result, cfs_rq->avg.util_avg_uclamp is the sum > aggregation of all the clamped values, which indicates the frequency > this rq should run at and what the utilization is. This is a no go for me. A task performance requirements can not be added to resemble the 'size' of the tasks. Tasks with util_avg 100 but a uclamp_min of 1024 should allow us to pack them in the big core. Which is an important use case to use. A bunch of small tasks that need to run faster are okay to pack on a bigger core if they need to run faster as long as this doesn't cause overutilized to trigger. > > This idea solves Problem 1 by capping the utilization of an > always-running task throttled by UCLAMP_MAX. Although the task (denoted > by Task A) has no idle time, the util_avg_uclamp signal gives its > UCLAMP_MAX value instead of 1024, so even if another task (Task B) with > a UCLAMP_MAX value at 1024 joins the rq, the util_avg_uclamp is A's > UCLAMP_MAX plus B's utilization, instead of 1024 plus B's utilization, > which means we no longer have the frequency spike problem caused by B. > This should mean that we might completely avoid the need for uclamp > filtering. > > It also solves Problem 2 by tracking the capped utilization of a rq. > Using util_avg_uclamp, we are able to differentiate between a CPU > throttled by UCLAMP_MAX and one that is actually running at its peak > capacity. An always-running rq with a task having UCLAMP_MAX at 300 will > report a cfs_rq util_avg_uclamp at 300, not 1024, which means task > placement sees spare capacity on this CPU and will allocate some tasks > to it, instead of treating it the same way as a CPU actually running at We don't need to know the actual spare capacity, though that would be nice to have. But generally in my view this CPU should be available for non capped task to run on. I view this a load balancer problem where wake up path can consider this CPU as a candidate and rely on load balancer to move this task away to another (smaller) CPU ASAP. If this capped task is already on the smaller CPU and there's nowhere else to place the uncapped task in the system then we must be quite busy already. If userspace ends up capping too many tasks too hard that we end up with no idle time in the system (where otherwise we should) I consider this a bad usage in the userspace and will ask them not to cap that hard. This is a case of one shooting themselves in the foot. Maybe we need to care in the future, but so far this doesn't seem a valid problem to address. I'd like to see why userspace has failed to do proper management first before jumping to fix this. What I'd expect more use cases of tasks that are truly 1024 but being capped. In this case there's truly no idle time on the CPU and your proposal won't address this. I don't think we have a way to distinguish between the two cases at the moment. > the peak. This brings us closer to the benefits of Android's sum > aggregation (see code related to CONFIG_USE_GROUP_THROTTLE at > https://android.googlesource.com/kernel/gs/+/refs/heads/android-gs-raviole-5.10-android12-d1/drivers/soc/google/vh/kernel/sched/fair.c#510), > which shows energy benefits because we are able to schedule tasks on > smaller cores which are UCLAMP_MAX capped, instead of finding a fresh You're jumping to the wrong conclusion about why this is being used. This was to overcome issues when using cpu.shares. This has nothing to do with UCLAMP_MAX. uclamp_max filter is what is being used. Based on the proposal I posted long ago https://android-review.googlesource.com/c/kernel/common/+/1914696 > big CPU. However, a major difference is that this series has an > amortized cost of O(1), instead of O(n) in cpu_util_next() in Android > kernels. > > It avoids the shortcomings from uclamp-being-a-frequency-hint. Under sum This is not a short coming. Performance implies frequency. We should not confused how schedutil works to estimate perf requirement based on historical run time (util_avg) and the fact that a task with short historical runtime needs to run faster. Running faster means increasing frequency. > aggregation, the scheduler sees the sum aggregated util_avg_uclamp and > will avoid the problem of enqueueing too many tasks just to fit the > frequency, and will place tasks on little CPUs in the previous example. I think the issue of uclamp being a performance but not a bandwidth hint was raised several times. Did I miss something that makes this not converting it to a bandwidth hint? Generally I think the max-aggregation should go away too. We've been overcompensating for a problem that must be fixed by cpufreq governor. I've been building up to this solution with patches to improve the governor and will post them soon. Cheers -- Qais Yousef
Hi Qais, On 03/12/2023 00:25, Qais Yousef wrote: > Hi Hongyan > > On 10/04/23 10:04, Hongyan Xia wrote: >> Current implementation of uclamp leaves many things to be desired. >> There are several problems: >> >> 1. Max aggregation is fragile. A single task with a high UCLAMP_MAX (or >> with the default value, which is 1024) can ruin the previous >> settings the moment it is enqueued, as shown in the uclamp frequency >> spike problem in Section 5.2 of >> Documentation/scheduler/sched-util-clamp.rst. Constantly running at >> 1024 utilization implies that the CPU is driven at its max capacity. >> However, with UCLAMP_MAX, this assumption is no longer true. To >> mitigate this, one idea is to filter out uclamp values for >> short-running tasks. However, the effectiveness of this idea remains >> to be seen. > > This was proved when I was at arm and it is used in products now too. It is > effective. > >> >> 2. No way to differentiate between UCLAMP_MAX throttled CPU or a CPU >> running at its peak, as both show utilization of 1024. However, in >> certain cases it's important to tell the difference, as we still want >> to take the opportunity to enqueue tasks in the former case. >> >> It's also debatable whether uclamp should be a frequency hint. An > > It is not debatable. Uclamp is a performance hint which translates into task > placement (for HMP) and frequency selection. On SMP the latter is the only > meaning for uclamp. For the time being at least until we get a new technology > that can impact performance ;-) I agree. In this new series, uclamp is still a performance hint which affects task placement and frequency, so we don't have a disagreement here. >> example is a system with a mid CPU of capacity 500, and a little CPU >> with capacity 100. If we have 10 tasks with UCLAMP_MIN of 101, then >> util_fits_cpu() will schedule them on the mid CPU because feec() >> thinks they don't fit on the little, even though we should use the >> little core at some point instead of making the mid even more crowded. > > This is not a problem. One of the major design points for uclamp is to allow > small tasks to request to run faster. So if you have 100 tasks that run for > 100us and to meet that they need to run at mid CPU that is fine. That doesn't > necessarily mean they'll overutilized the cluster. I understand. This design point is what sum aggregation does as well. This alone is not a problem, but moving a CPU to another CPU to fit uclamp_min requirement can be a problem, when the destination CPU is crowded and mixed with uclamp_max. I will elaborate on this below. > > If the cluster gets overutilized, then EAS will be disabled and > find_idle_capacity() will spread into highest spare capacity idle CPU, which > includes little CPUs. And this behavior is no different to have tasks with util > greater than 100 wanting to run on mid cores. The only difference is that we > can pack more on mid core before overutilized can be triggered. Which is the > desired effect. > > So this is a no issue and we should spread once the cluster is busy. >> Of course, this in reality doesn't happen because of other mechanisms >> like load balancing, but it's also not good when other mechanisms can >> cancel the effect of uclamp in feec(). A CPU running at capacity 500 for >> 1ms or for 1000ms gives completely different performance levels, so >> trying to fit only the frequency does not give us any performance >> guarantees. If we then talk about not only running at some frequency but >> also for some amount of time, then what we really mean is a capacity >> hint, not a frequency hint. > > I'm not sure I parsed that. But again, it's a performance hint which implies > both. > >> >> It's even worse when the tasks scheduled on the mid CPU also have >> UCLAMP_MAX values. In the previous example, instead of running at 500, a >> single UCLAMP_MAX, assuming it's 110, can make the mid CPU run at a much >> lower frequency than 500, so it is then a really bad decision to honor >> the UCLAMP_MIN frequency hint and place it on the mid CPU, instead of >> running it on the little CPU which is less crowded, and has pretty much >> the same capacity (100 vs 110) anyway. > > I can't follow this reasoning. > > It seems you're assuming the user has picked 110 by mistake and it is okay to > ignore it. > > FYI, one of the use cases is to set uclamp_min to capacity_of(little_core)+1 > because they're so crappy to handle their workload. It is still better than > using affinity as when the system gets busy we can still use the little core. > But only when the system is busy and crowded already. > > I don't know what's the reference to the mid core having a task with > uclamp_max, but seems a bug in feec() if not predicting the frequency > correctly. The example in the cover letter might be complex. I can simplify as a "N tasks" problem. Say 2 CPUs, one with capacity 1024 and the other 512. Assume there's a task with uclamp_min 513 and uclamp_max of 800, then the big CPU can, in theory, fit an infinite number of such tasks on the big CPU, without triggering overutilization. What does uclamp_min even mean when this happens? uclamp_min should be a hint of a minimum performance it should get, but it breaks down when there are N tasks, so the hint from uclamp_min is divided by N. Under max aggregation, uclamp_min doesn't give you the "performance" of this task. It gives performance of the CPU, and this is not the performance of this task when N > 1. In fact, N can be infinity. This is the same problem shared by multiple uclamp users from LPC, where max aggregation hints the performance of the CPU, but not the task itself. As to "FYI, one of the use cases is to set uclamp_min to capacity_of(little_core)+1 because they're so crappy to handle their workload", I'm afraid I disagree with this. I think uclamp is a simple hinting mechanism to the scheduler to say: please try to give me [uclamp_min, uclamp_max] performance. It's then the job of EAS to pick the most efficient CPU. We know the little CPU is crappy because we have deep knowledge of the device, but uclamp values could just come from game developers (using ADPF) who have no knowledge of the SoC, and in the game developer's view, asking uclamp of little CPU + 1 is simply because he/she wants that performance, not because the developer thinks the little CPU should be avoided. I don't think a soft-affinity-like property belongs here. We should improve EM and let EAS do its job instead of teaching it to interpret uclamp as affinity hints. >> >> This series attempts to solve these problems by tracking a >> util_avg_uclamp signal in tasks and cfs_rq. At task level, > > This was discussed offline and brought several times on the list already. > uclamp is a performance not a bandwidth hint. I remember Dietmar was suggesting > this concept in the past and I objected that this converts uclamp into > a bandwidth hint. This overlaps with CFS bandwidth controller. We can use that > if somebody wants to limit the bandwidth the task can consume. uclamp is about > limiting the performance level it can reach. ie: it is not about cycles > consumed, but Performance Point reached. I'd argue the current max aggregation implementation is further away from a performance hint. The moment there are more tasks interacting with each other on the same rq, the performance hinting ability of max aggregation tends to do worse, like uclamp_min being divided by N, frequency spikes, uneven task distribution (both shown in the evaluation section of my cover letter) etc, whereas sum aggregation still performs well. Other shortcomings are not that critical, but the fact that uclamp_min's effectiveness is divided by N under max aggregation I think is not acceptable. >> p->se.avg.util_avg_uclamp is basically tracking the normal util_avg, but >> clamped within its uclamp min and max. At cfs_rq level, util_avg_uclamp >> must always be the sum of all util_avg_uclamp of all the entities on >> this cfs_rq. As a result, cfs_rq->avg.util_avg_uclamp is the sum >> aggregation of all the clamped values, which indicates the frequency >> this rq should run at and what the utilization is. > > This is a no go for me. A task performance requirements can not be added to > resemble the 'size' of the tasks. Tasks with util_avg 100 but a uclamp_min of > 1024 should allow us to pack them in the big core. Which is an important use > case to use. A bunch of small tasks that need to run faster are okay to pack on > a bigger core if they need to run faster as long as this doesn't cause > overutilized to trigger. I see. I now understand where we start to diverge. It seems we want completely different things from uclamp. Max aggregation: please try to place me on a CPU that can give CPU capacity [uclamp_min, uclamp_max] and run that CPU within that range. Sum aggregation: please try to treat me like a task with utilization within [uclamp_min, uclamp_max]. Like I mentioned in the previous comment, I'm afraid I disagree with using uclamp as a CPU affinity hint, even if it's just a soft affinity. Its job should just be a performance hint and EM should tell EAS which CPU to pick under that hint. If the little CPU is crappy, then the crappiness should be reflected in the EM, and maybe dynamic EM can help. >> >> This idea solves Problem 1 by capping the utilization of an >> always-running task throttled by UCLAMP_MAX. Although the task (denoted >> by Task A) has no idle time, the util_avg_uclamp signal gives its >> UCLAMP_MAX value instead of 1024, so even if another task (Task B) with >> a UCLAMP_MAX value at 1024 joins the rq, the util_avg_uclamp is A's >> UCLAMP_MAX plus B's utilization, instead of 1024 plus B's utilization, >> which means we no longer have the frequency spike problem caused by B. >> This should mean that we might completely avoid the need for uclamp >> filtering. >> >> It also solves Problem 2 by tracking the capped utilization of a rq. >> Using util_avg_uclamp, we are able to differentiate between a CPU >> throttled by UCLAMP_MAX and one that is actually running at its peak >> capacity. An always-running rq with a task having UCLAMP_MAX at 300 will >> report a cfs_rq util_avg_uclamp at 300, not 1024, which means task >> placement sees spare capacity on this CPU and will allocate some tasks >> to it, instead of treating it the same way as a CPU actually running at > > We don't need to know the actual spare capacity, though that would be nice to > have. > > But generally in my view this CPU should be available for non capped task to > run on. I view this a load balancer problem where wake up path can consider > this CPU as a candidate and rely on load balancer to move this task away to > another (smaller) CPU ASAP. If this capped task is already on the smaller CPU > and there's nowhere else to place the uncapped task in the system then we must > be quite busy already. I'm not quite sure relying on the load balancer is a good idea, because then it means we need uclamp code in the load balancer and the complexity keeps growing. An example is that after applying the patch 'sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0' even though the scheduler now knows that CPUs with 0 spare capacity are now candidates, it creates a very unbalanced task placement and CPU0 tends to receive a lot of tasks in my tests. This certainly needs some uclamp-aware code in the load balancer, whereas sum aggregation has no such problem and balance is achieved during task placement in the first place. > If userspace ends up capping too many tasks too hard that we end up with no > idle time in the system (where otherwise we should) I consider this a bad usage > in the userspace and will ask them not to cap that hard. This is a case of one > shooting themselves in the foot. Maybe we need to care in the future, but so > far this doesn't seem a valid problem to address. I'd like to see why userspace > has failed to do proper management first before jumping to fix this. > > What I'd expect more use cases of tasks that are truly 1024 but being capped. > In this case there's truly no idle time on the CPU and your proposal won't > address this. I don't think we have a way to distinguish between the two cases > at the moment. I don't quite get what you mean. In the examples in my cover letter the capped tasks are truly 1024. Two CPUs, one with an uncapped 1024 task and the other a 1024 task but with uclamp_max of 300. The scheduler sees cfs_rq->avg.util_avg_uclamp of 1024 in the first CPU and 300 in the 2nd, so my proposal indeed sees the difference, and will have the opportunity to place more tasks on the 2nd CPU without any corner case code. >> the peak. This brings us closer to the benefits of Android's sum >> aggregation (see code related to CONFIG_USE_GROUP_THROTTLE at >> https://android.googlesource.com/kernel/gs/+/refs/heads/android-gs-raviole-5.10-android12-d1/drivers/soc/google/vh/kernel/sched/fair.c#510), >> which shows energy benefits because we are able to schedule tasks on >> smaller cores which are UCLAMP_MAX capped, instead of finding a fresh > > You're jumping to the wrong conclusion about why this is being used. This was > to overcome issues when using cpu.shares. This has nothing to do with > UCLAMP_MAX. uclamp_max filter is what is being used. Based on the proposal > I posted long ago > > https://android-review.googlesource.com/c/kernel/common/+/1914696 > Maybe I need to rephrase. I said that GROUP_THROTTLE can give us some energy benefits. I didn't draw any conclusion that this was why it was invented. I was also trying to suggest that, under sum aggregation, the same code can work for both uclamp and GROUP_THROTTLE. However, I probably need to better understand the issues around cpu.shares before saying this concretely. Also, uclamp filtering might be another example why I think max aggregation is over-complicated. Sum aggregation has only around 300 lines of code so far without the need for filtering at all. >> big CPU. However, a major difference is that this series has an >> amortized cost of O(1), instead of O(n) in cpu_util_next() in Android >> kernels. >> >> It avoids the shortcomings from uclamp-being-a-frequency-hint. Under sum > > This is not a short coming. Performance implies frequency. We should not > confused how schedutil works to estimate perf requirement based on historical > run time (util_avg) and the fact that a task with short historical runtime > needs to run faster. Running faster means increasing frequency. > >> aggregation, the scheduler sees the sum aggregated util_avg_uclamp and >> will avoid the problem of enqueueing too many tasks just to fit the >> frequency, and will place tasks on little CPUs in the previous example. > > I think the issue of uclamp being a performance but not a bandwidth hint was > raised several times. Did I miss something that makes this not converting it to > a bandwidth hint? > I wonder what does performance hint even mean if it doesn't at least correlate to bandwidth in some way? Yes, specifying uclamp_min under max aggregation can run the CPU at least at that frequency, but if you have N tasks on the same rq, then each task can only get a share of uclamp_min/N. I highly doubt that when a mobile phone developer increases the uclamp_min of an application, he or she is okay with doubling the CPU frequency when it runs and running it only for 1/10 of the time because it gets moved to a crowded big CPU, which essentially gives 0.2x the "performance" in my opinion, so I'm not sure fully disconnecting bandwidth from performance is meaningful. Of course, a (not yet) uclamp-aware load balancer could come in and migrate those tasks, but why not avoid this problem in the first place via sum aggregation without involving the load balancer, especially when today we don't have any indication of how well a uclamp-aware load balancer can perform anyway? In the end I may still not be able to fully grasp what the issues of sum aggregations are. I think so far I get that it's closer to "bandwidth" and not "performance" (which I think "performance" doesn't mean much if it doesn't correlate to bandwidth) and it doesn't work as a CPU affinity hint (it actually does to some degree, but the uclamp values need to be different). But I think focusing on these ignores the whole picture, which are the benefits of sum aggregation (already presented in the evaluation section of the cover letter of this series): 1. It's fewer than 350 lines of code (max aggregation is 750+ and still growing). 2. It doesn't suffer from frequency spikes or load balancing issues. 3. It needs no code to consider uclamp as a corner case in the scheduler. No need to carry around uclamp_min and uclamp_max in so many functions. util_fits_cpu() is just one line of code. 4. It's effective. When you say a uclamp_min of X, the scheduler really tries to give you X for your task, not for the CPU you are on, as seen in 54% reduction of jank in jankbench, without increasing energy consumption. 5. It looks like something multiple developers want from LPC 2023. The use cases of "additive uclamp", and hinting the host from the VM via util_guest can be achieved exactly by uclamp sum aggregation. Given these benefits, I think it's probably a good idea to give sum aggregation a serious look. On my Pixel 6, sum aggregation either gives better benchmark scores at the same energy as max aggregation, or consumes less power at the same benchmark scores. I'll keep benchmarking it but so far so good. Hongyan
On Mon, 4 Dec 2023 at 02:48, Hongyan Xia <hongyan.xia2@arm.com> wrote: > > Hi Qais, > > On 03/12/2023 00:25, Qais Yousef wrote: > > Hi Hongyan > > > > On 10/04/23 10:04, Hongyan Xia wrote: > >> Current implementation of uclamp leaves many things to be desired. > >> There are several problems: > >> > >> 1. Max aggregation is fragile. A single task with a high UCLAMP_MAX (or > >> with the default value, which is 1024) can ruin the previous > >> settings the moment it is enqueued, as shown in the uclamp frequency > >> spike problem in Section 5.2 of > >> Documentation/scheduler/sched-util-clamp.rst. Constantly running at > >> 1024 utilization implies that the CPU is driven at its max capacity. > >> However, with UCLAMP_MAX, this assumption is no longer true. To > >> mitigate this, one idea is to filter out uclamp values for > >> short-running tasks. However, the effectiveness of this idea remains > >> to be seen. > > > > This was proved when I was at arm and it is used in products now too. It is > > effective. > > > >> > >> 2. No way to differentiate between UCLAMP_MAX throttled CPU or a CPU > >> running at its peak, as both show utilization of 1024. However, in > >> certain cases it's important to tell the difference, as we still want > >> to take the opportunity to enqueue tasks in the former case. > >> > >> It's also debatable whether uclamp should be a frequency hint. An > > > > It is not debatable. Uclamp is a performance hint which translates into task > > placement (for HMP) and frequency selection. On SMP the latter is the only > > meaning for uclamp. For the time being at least until we get a new technology > > that can impact performance ;-) > > I agree. In this new series, uclamp is still a performance hint which > affects task placement and frequency, so we don't have a disagreement here. > > >> example is a system with a mid CPU of capacity 500, and a little CPU > >> with capacity 100. If we have 10 tasks with UCLAMP_MIN of 101, then > >> util_fits_cpu() will schedule them on the mid CPU because feec() > >> thinks they don't fit on the little, even though we should use the > >> little core at some point instead of making the mid even more crowded. > > > > This is not a problem. One of the major design points for uclamp is to allow > > small tasks to request to run faster. So if you have 100 tasks that run for > > 100us and to meet that they need to run at mid CPU that is fine. That doesn't > > necessarily mean they'll overutilized the cluster. > > I understand. This design point is what sum aggregation does as well. > This alone is not a problem, but moving a CPU to another CPU to fit > uclamp_min requirement can be a problem, when the destination CPU is > crowded and mixed with uclamp_max. I will elaborate on this below. > > > > > If the cluster gets overutilized, then EAS will be disabled and > > find_idle_capacity() will spread into highest spare capacity idle CPU, which > > includes little CPUs. And this behavior is no different to have tasks with util > > greater than 100 wanting to run on mid cores. The only difference is that we > > can pack more on mid core before overutilized can be triggered. Which is the > > desired effect. > > > > So this is a no issue and we should spread once the cluster is busy. > >> Of course, this in reality doesn't happen because of other mechanisms > >> like load balancing, but it's also not good when other mechanisms can > >> cancel the effect of uclamp in feec(). A CPU running at capacity 500 for > >> 1ms or for 1000ms gives completely different performance levels, so > >> trying to fit only the frequency does not give us any performance > >> guarantees. If we then talk about not only running at some frequency but > >> also for some amount of time, then what we really mean is a capacity > >> hint, not a frequency hint. > > > > I'm not sure I parsed that. But again, it's a performance hint which implies > > both. > > > >> > >> It's even worse when the tasks scheduled on the mid CPU also have > >> UCLAMP_MAX values. In the previous example, instead of running at 500, a > >> single UCLAMP_MAX, assuming it's 110, can make the mid CPU run at a much > >> lower frequency than 500, so it is then a really bad decision to honor > >> the UCLAMP_MIN frequency hint and place it on the mid CPU, instead of > >> running it on the little CPU which is less crowded, and has pretty much > >> the same capacity (100 vs 110) anyway. > > > > I can't follow this reasoning. > > > > It seems you're assuming the user has picked 110 by mistake and it is okay to > > ignore it. > > > > FYI, one of the use cases is to set uclamp_min to capacity_of(little_core)+1 > > because they're so crappy to handle their workload. It is still better than > > using affinity as when the system gets busy we can still use the little core. > > But only when the system is busy and crowded already. > > > > I don't know what's the reference to the mid core having a task with > > uclamp_max, but seems a bug in feec() if not predicting the frequency > > correctly. > > The example in the cover letter might be complex. I can simplify as a "N > tasks" problem. > > Say 2 CPUs, one with capacity 1024 and the other 512. Assume there's a > task with uclamp_min 513 and uclamp_max of 800, then the big CPU can, in > theory, fit an infinite number of such tasks on the big CPU, without > triggering overutilization. What does uclamp_min even mean when this > happens? uclamp_min should be a hint of a minimum performance it should > get, but it breaks down when there are N tasks, so the hint from > uclamp_min is divided by N. Under max aggregation, uclamp_min doesn't > give you the "performance" of this task. It gives performance of the > CPU, and this is not the performance of this task when N > 1. In fact, N > can be infinity. This is the same problem shared by multiple uclamp > users from LPC, where max aggregation hints the performance of the CPU, > but not the task itself. > > As to "FYI, one of the use cases is to set uclamp_min to > capacity_of(little_core)+1 because they're so crappy to handle their > workload", I'm afraid I disagree with this. I think uclamp is a simple > hinting mechanism to the scheduler to say: please try to give me > [uclamp_min, uclamp_max] performance. It's then the job of EAS to pick You get the point. It's an EAS decision and must stay in EAS not going into PELT. You have all the information: actual utilization, uclamp_min uclamp_max. Make EAS smarter > the most efficient CPU. We know the little CPU is crappy because we have > deep knowledge of the device, but uclamp values could just come from > game developers (using ADPF) who have no knowledge of the SoC, and in > the game developer's view, asking uclamp of little CPU + 1 is simply > because he/she wants that performance, not because the developer thinks > the little CPU should be avoided. I don't think a soft-affinity-like > property belongs here. We should improve EM and let EAS do its job > instead of teaching it to interpret uclamp as affinity hints. > > >> > >> This series attempts to solve these problems by tracking a > >> util_avg_uclamp signal in tasks and cfs_rq. At task level, > > > > This was discussed offline and brought several times on the list already. > > uclamp is a performance not a bandwidth hint. I remember Dietmar was suggesting > > this concept in the past and I objected that this converts uclamp into > > a bandwidth hint. This overlaps with CFS bandwidth controller. We can use that > > if somebody wants to limit the bandwidth the task can consume. uclamp is about > > limiting the performance level it can reach. ie: it is not about cycles > > consumed, but Performance Point reached. > > I'd argue the current max aggregation implementation is further away > from a performance hint. The moment there are more tasks interacting > with each other on the same rq, the performance hinting ability of max > aggregation tends to do worse, like uclamp_min being divided by N, > frequency spikes, uneven task distribution (both shown in the evaluation > section of my cover letter) etc, whereas sum aggregation still performs > well. > > Other shortcomings are not that critical, but the fact that uclamp_min's > effectiveness is divided by N under max aggregation I think is not > acceptable. Change EAS task placement policy in this case to take into account actual utilization and uclamp_min/max > > >> p->se.avg.util_avg_uclamp is basically tracking the normal util_avg, but > >> clamped within its uclamp min and max. At cfs_rq level, util_avg_uclamp > >> must always be the sum of all util_avg_uclamp of all the entities on > >> this cfs_rq. As a result, cfs_rq->avg.util_avg_uclamp is the sum > >> aggregation of all the clamped values, which indicates the frequency > >> this rq should run at and what the utilization is. > > > > This is a no go for me. A task performance requirements can not be added to > > resemble the 'size' of the tasks. Tasks with util_avg 100 but a uclamp_min of > > 1024 should allow us to pack them in the big core. Which is an important use > > case to use. A bunch of small tasks that need to run faster are okay to pack on > > a bigger core if they need to run faster as long as this doesn't cause > > overutilized to trigger. > > I see. I now understand where we start to diverge. It seems we want > completely different things from uclamp. > > Max aggregation: please try to place me on a CPU that can give CPU > capacity [uclamp_min, uclamp_max] and run that CPU within that range. > Sum aggregation: please try to treat me like a task with utilization > within [uclamp_min, uclamp_max]. > > Like I mentioned in the previous comment, I'm afraid I disagree with > using uclamp as a CPU affinity hint, even if it's just a soft affinity. > Its job should just be a performance hint and EM should tell EAS which > CPU to pick under that hint. If the little CPU is crappy, then the > crappiness should be reflected in the EM, and maybe dynamic EM can help. > > >> > >> This idea solves Problem 1 by capping the utilization of an > >> always-running task throttled by UCLAMP_MAX. Although the task (denoted > >> by Task A) has no idle time, the util_avg_uclamp signal gives its > >> UCLAMP_MAX value instead of 1024, so even if another task (Task B) with > >> a UCLAMP_MAX value at 1024 joins the rq, the util_avg_uclamp is A's > >> UCLAMP_MAX plus B's utilization, instead of 1024 plus B's utilization, > >> which means we no longer have the frequency spike problem caused by B. > >> This should mean that we might completely avoid the need for uclamp > >> filtering. > >> > >> It also solves Problem 2 by tracking the capped utilization of a rq. > >> Using util_avg_uclamp, we are able to differentiate between a CPU > >> throttled by UCLAMP_MAX and one that is actually running at its peak > >> capacity. An always-running rq with a task having UCLAMP_MAX at 300 will > >> report a cfs_rq util_avg_uclamp at 300, not 1024, which means task > >> placement sees spare capacity on this CPU and will allocate some tasks > >> to it, instead of treating it the same way as a CPU actually running at > > > > We don't need to know the actual spare capacity, though that would be nice to > > have. > > > > But generally in my view this CPU should be available for non capped task to > > run on. I view this a load balancer problem where wake up path can consider > > this CPU as a candidate and rely on load balancer to move this task away to > > another (smaller) CPU ASAP. If this capped task is already on the smaller CPU > > and there's nowhere else to place the uncapped task in the system then we must > > be quite busy already. > > I'm not quite sure relying on the load balancer is a good idea, because > then it means we need uclamp code in the load balancer and the But you will have because uclamp_max can make any task behaving like an always running task (ie without wakeup event to migrate the task) > complexity keeps growing. An example is that after applying the patch > > 'sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0' > > even though the scheduler now knows that CPUs with 0 spare capacity are > now candidates, it creates a very unbalanced task placement and CPU0 > tends to receive a lot of tasks in my tests. This certainly needs some > uclamp-aware code in the load balancer, whereas sum aggregation has no > such problem and balance is achieved during task placement in the first > place. > > > If userspace ends up capping too many tasks too hard that we end up with no > > idle time in the system (where otherwise we should) I consider this a bad usage > > in the userspace and will ask them not to cap that hard. This is a case of one > > shooting themselves in the foot. Maybe we need to care in the future, but so > > far this doesn't seem a valid problem to address. I'd like to see why userspace > > has failed to do proper management first before jumping to fix this. > > > > What I'd expect more use cases of tasks that are truly 1024 but being capped. > > In this case there's truly no idle time on the CPU and your proposal won't > > address this. I don't think we have a way to distinguish between the two cases > > at the moment. > > I don't quite get what you mean. In the examples in my cover letter the > capped tasks are truly 1024. Two CPUs, one with an uncapped 1024 task > and the other a 1024 task but with uclamp_max of 300. The scheduler sees > cfs_rq->avg.util_avg_uclamp of 1024 in the first CPU and 300 in the 2nd, > so my proposal indeed sees the difference, and will have the opportunity > to place more tasks on the 2nd CPU without any corner case code. > > >> the peak. This brings us closer to the benefits of Android's sum > >> aggregation (see code related to CONFIG_USE_GROUP_THROTTLE at > >> https://android.googlesource.com/kernel/gs/+/refs/heads/android-gs-raviole-5.10-android12-d1/drivers/soc/google/vh/kernel/sched/fair.c#510), > >> which shows energy benefits because we are able to schedule tasks on > >> smaller cores which are UCLAMP_MAX capped, instead of finding a fresh > > > > You're jumping to the wrong conclusion about why this is being used. This was > > to overcome issues when using cpu.shares. This has nothing to do with > > UCLAMP_MAX. uclamp_max filter is what is being used. Based on the proposal > > I posted long ago > > > > https://android-review.googlesource.com/c/kernel/common/+/1914696 > > > > Maybe I need to rephrase. I said that GROUP_THROTTLE can give us some > energy benefits. I didn't draw any conclusion that this was why it was > invented. I was also trying to suggest that, under sum aggregation, the > same code can work for both uclamp and GROUP_THROTTLE. However, I > probably need to better understand the issues around cpu.shares before > saying this concretely. > > Also, uclamp filtering might be another example why I think max > aggregation is over-complicated. Sum aggregation has only around 300 > lines of code so far without the need for filtering at all. > > >> big CPU. However, a major difference is that this series has an > >> amortized cost of O(1), instead of O(n) in cpu_util_next() in Android > >> kernels. > >> > >> It avoids the shortcomings from uclamp-being-a-frequency-hint. Under sum > > > > This is not a short coming. Performance implies frequency. We should not > > confused how schedutil works to estimate perf requirement based on historical > > run time (util_avg) and the fact that a task with short historical runtime > > needs to run faster. Running faster means increasing frequency. > > > >> aggregation, the scheduler sees the sum aggregated util_avg_uclamp and > >> will avoid the problem of enqueueing too many tasks just to fit the > >> frequency, and will place tasks on little CPUs in the previous example. > > > > I think the issue of uclamp being a performance but not a bandwidth hint was > > raised several times. Did I miss something that makes this not converting it to > > a bandwidth hint? > > > > I wonder what does performance hint even mean if it doesn't at least > correlate to bandwidth in some way? Yes, specifying uclamp_min under max > aggregation can run the CPU at least at that frequency, but if you have > N tasks on the same rq, then each task can only get a share of > uclamp_min/N. I highly doubt that when a mobile phone developer > increases the uclamp_min of an application, he or she is okay with > doubling the CPU frequency when it runs and running it only for 1/10 of > the time because it gets moved to a crowded big CPU, which essentially > gives 0.2x the "performance" in my opinion, so I'm not sure fully > disconnecting bandwidth from performance is meaningful. Of course, a > (not yet) uclamp-aware load balancer could come in and migrate those > tasks, but why not avoid this problem in the first place via sum > aggregation without involving the load balancer, especially when today > we don't have any indication of how well a uclamp-aware load balancer > can perform anyway? > > In the end I may still not be able to fully grasp what the issues of sum > aggregations are. I think so far I get that it's closer to "bandwidth" > and not "performance" (which I think "performance" doesn't mean much if > it doesn't correlate to bandwidth) and it doesn't work as a CPU affinity > hint (it actually does to some degree, but the uclamp values need to be > different). But I think focusing on these ignores the whole picture, > which are the benefits of sum aggregation (already presented in the > evaluation section of the cover letter of this series): > > 1. It's fewer than 350 lines of code (max aggregation is 750+ and still > growing). > 2. It doesn't suffer from frequency spikes or load balancing issues. > 3. It needs no code to consider uclamp as a corner case in the > scheduler. No need to carry around uclamp_min and uclamp_max in so many > functions. util_fits_cpu() is just one line of code. > 4. It's effective. When you say a uclamp_min of X, the scheduler really > tries to give you X for your task, not for the CPU you are on, as seen > in 54% reduction of jank in jankbench, without increasing energy > consumption. > 5. It looks like something multiple developers want from LPC 2023. The > use cases of "additive uclamp", and hinting the host from the VM via > util_guest can be achieved exactly by uclamp sum aggregation. > > Given these benefits, I think it's probably a good idea to give sum > aggregation a serious look. On my Pixel 6, sum aggregation either gives > better benchmark scores at the same energy as max aggregation, or > consumes less power at the same benchmark scores. I'll keep benchmarking > it but so far so good. > > Hongyan
On 04/12/2023 16:12, Vincent Guittot wrote: > On Mon, 4 Dec 2023 at 02:48, Hongyan Xia <hongyan.xia2@arm.com> wrote: >> >> [...] >> >> Other shortcomings are not that critical, but the fact that uclamp_min's >> effectiveness is divided by N under max aggregation I think is not >> acceptable. > > Change EAS task placement policy in this case to take into account > actual utilization and uclamp_min/max Thank you. I agree. I want to emphasize this specifically because this is exactly what I'm trying to do. The whole series can be rephrased in a different way: - The PELT signal is distorted when uclamp is active. - Let's consider the [PELT, uclamp_min, uclamp_max] tuple. - Always carrying all three variables is too much, but [PELT, clamped(PELT)] is an approximation that works really well. Of course, I'll explore if there's a way to make things less messy. I just realized why I didn't do things util_est way but instead directly clamping on PELT, it's because util_est boosts util_avg and can't work for uclamp_max. I'll keep exploring options. >> [...]
On Tue, 5 Dec 2023 at 16:19, Hongyan Xia <hongyan.xia2@arm.com> wrote: > > On 04/12/2023 16:12, Vincent Guittot wrote: > > On Mon, 4 Dec 2023 at 02:48, Hongyan Xia <hongyan.xia2@arm.com> wrote: > >> > >> [...] > >> > >> Other shortcomings are not that critical, but the fact that uclamp_min's > >> effectiveness is divided by N under max aggregation I think is not > >> acceptable. > > > > Change EAS task placement policy in this case to take into account > > actual utilization and uclamp_min/max > > Thank you. I agree. I want to emphasize this specifically because this > is exactly what I'm trying to do. The whole series can be rephrased in a > different way: > > - The PELT signal is distorted when uclamp is active. Sorry but no it's not > - Let's consider the [PELT, uclamp_min, uclamp_max] tuple. That's what we are already doing with effective_cpu_util. We might want to improve how we use them in EAS but that's another story than your proposal > - Always carrying all three variables is too much, but [PELT, > clamped(PELT)] is an approximation that works really well. As said before. It's a no go for this mix > > Of course, I'll explore if there's a way to make things less messy. I > just realized why I didn't do things util_est way but instead directly > clamping on PELT, it's because util_est boosts util_avg and can't work > for uclamp_max. I'll keep exploring options. > > >> [...]
On 05/12/2023 16:26, Vincent Guittot wrote: > On Tue, 5 Dec 2023 at 16:19, Hongyan Xia <hongyan.xia2@arm.com> wrote: >> >> On 04/12/2023 16:12, Vincent Guittot wrote: >>> On Mon, 4 Dec 2023 at 02:48, Hongyan Xia <hongyan.xia2@arm.com> wrote: >>>> >>>> [...] >>>> >>>> Other shortcomings are not that critical, but the fact that uclamp_min's >>>> effectiveness is divided by N under max aggregation I think is not >>>> acceptable. >>> >>> Change EAS task placement policy in this case to take into account >>> actual utilization and uclamp_min/max >> >> Thank you. I agree. I want to emphasize this specifically because this >> is exactly what I'm trying to do. The whole series can be rephrased in a >> different way: >> >> - The PELT signal is distorted when uclamp is active. > > Sorry but no it's not >> - Let's consider the [PELT, uclamp_min, uclamp_max] tuple. > > That's what we are already doing with effective_cpu_util. We might > want to improve how we use them in EAS but that's another story than > your proposal It's different. We never catch how we *got* the PELT value. If we wake up a task, what we do now is to have the following: [p->util_avg, p->uclamp_min, p->uclamp_max, target_rq->uclamp_min, target_rq->uclamp_max] But to best understand how big this task really was, we want: [p->util_avg, previous_rq->uclamp_min_back_then, previous_rq->uclamp_max_back_then] Without such information, issues cannot be avoided because we have no idea how big the task really was. Frequency spikes is just one of the symptoms when we mis-interpret how big the task was. >> - Always carrying all three variables is too much, but [PELT, >> clamped(PELT)] is an approximation that works really well. > > As said before. It's a no go for this mix I see your concern. To rephrase this series again, I'm simply arguing that [p->util_avg, previous_uclamp_min, previous_uclamp_max] is better than [p->util_avg, p->uclamp_min, p->uclamp_max, target_rq->uclamp_min, target_rq->uclamp_max] plus the code to mitigate the issues in estimating how big the task is. I anticipate this series to be significantly smaller than the current max aggregation approach plus future code to mitigate the problems, but I'll keep trying to improve it to hopefully address your concerns. >> >> Of course, I'll explore if there's a way to make things less messy. I >> just realized why I didn't do things util_est way but instead directly >> clamping on PELT, it's because util_est boosts util_avg and can't work >> for uclamp_max. I'll keep exploring options. >> >>>> [...]