From patchwork Fri Dec  8 01:52:38 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Qais Yousef <qyousef@layalina.io>
X-Patchwork-Id: 17723
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a59:bcd1:0:b0:403:3b70:6f57 with SMTP id r17csp5186941vqy;
        Thu, 7 Dec 2023 17:53:14 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IF9zefnswKxwWTPTWe5B/QhJ4KmDwQkrMSGXZkjSaGwoTQocfL91O7+c6sLkxGzPhXxT+ca
X-Received: by 2002:a17:902:ab98:b0:1d0:6ffd:e2e9 with SMTP id
 f24-20020a170902ab9800b001d06ffde2e9mr3618186plr.131.1702000394343;
        Thu, 07 Dec 2023 17:53:14 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1702000394; cv=none;
        d=google.com; s=arc-20160816;
        b=sp2TlIGVwBmWaydtl/Cx8/6eN9LE2mQnC0YoA3WUOayu5FFoAf0rHm44PQVIvL9SFc
         8a4O7By4Nwj+KwyHpTE1ZAL+kyKrGaaCSrRi4fkjc+8OxlDNYj99VGn3ksvgK7RcIumo
         tAMzGcRnz5yW+SrAPJpa6StE3nJ+MTcFMCC7gLiRfcHgBvHyXclREX3dWEHM0BIBtEcH
         MgIhq+dJARciUJrBM/LhyO6k1DwdtIWIThn87LM3GwbG9Il+Rpe1+rBcwUYC1kzo6pqX
         71p7tTe7Mx9sqCG6Ytqrp0v++JdgJDe/XQTzxrym618AhiUPvpxaCNfHGflyRfjIdlgl
         7j5g==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :message-id:date:subject:cc:to:from:dkim-signature;
        bh=NMfUFkF8OyeZJffEZnLExwZb8TXV6He0+EMaT/NjaTI=;
        fh=pHO3uYCV6BMyVnxT0+CGNCMVfJiU9jn87gbTf36c9qw=;
        b=QnvigAAhG4pGmGsk+nm0OQMNZqgHVkaPicCdVGdPafwSAo/68qs7dALAIYCn0aiqDU
         ZR39bXyt+iqICfGlUkTypWvR7/g61d8kSRKcoY9RJi9OXdjZ1nS8Au3d4D3HfLgrRDfx
         hx6igurCS0lQ1juf0dEQRuzuMRtObfEq2HvuCesVKMtu7eFAWj1nP1JNcLTGqOjEs2B5
         3KyYxFaJfclgAHI6fw/wCOtN8pvQN2yYrh5DZ/rS2CDqKzKCvE6hmenDm/M1bowlhkT4
         Zik4yKDN3WlL3f1boCQBkFfZLl37FaxkX5pJyakMVyoFVuP0JjEi1Q93Z1DUYIVyjat8
         abcA==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@layalina-io.20230601.gappssmtp.com
 header.s=20230601 header.b=sl7OU0Zm;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:5 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from groat.vger.email (groat.vger.email. [2620:137:e000::3:5])
        by mx.google.com with ESMTPS id
 x3-20020a1709027c0300b001cfd0495291si652703pll.524.2023.12.07.17.53.14
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 07 Dec 2023 17:53:14 -0800 (PST)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:5 as permitted sender)
 client-ip=2620:137:e000::3:5;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@layalina-io.20230601.gappssmtp.com
 header.s=20230601 header.b=sl7OU0Zm;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:5 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by groat.vger.email (Postfix) with ESMTP id 83F328116B2F;
	Thu,  7 Dec 2023 17:53:01 -0800 (PST)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.11 at groat.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231491AbjLHBww (ORCPT <rfc822;chrisfriedt@gmail.com>
        + 99 others); Thu, 7 Dec 2023 20:52:52 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38542 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229531AbjLHBwu (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 7 Dec 2023 20:52:50 -0500
Received: from mail-wr1-x430.google.com (mail-wr1-x430.google.com
 [IPv6:2a00:1450:4864:20::430])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B19B010DA
        for <linux-kernel@vger.kernel.org>;
 Thu,  7 Dec 2023 17:52:55 -0800 (PST)
Received: by mail-wr1-x430.google.com with SMTP id
 ffacd0b85a97d-3332f1512e8so1469579f8f.2
        for <linux-kernel@vger.kernel.org>;
 Thu, 07 Dec 2023 17:52:55 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=layalina-io.20230601.gappssmtp.com; s=20230601; t=1702000374;
 x=1702605174; darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:message-id:date:subject:cc
         :to:from:from:to:cc:subject:date:message-id:reply-to;
        bh=NMfUFkF8OyeZJffEZnLExwZb8TXV6He0+EMaT/NjaTI=;
        b=sl7OU0ZmE7gIOvjMdPFcsEnl00n3W6W8rLBqFJDKqXWHkgyjmnipobNvE5QYVAzJC0
         wV7JyKkFEyiO7XlC9AYMLRwRjmqBeUFPTwE6Yy+/CXyyFYEgIn9fDp9lG4/ZXRJKwCc0
         7yF9UPRTYdEPeFfbfkOPkrIUZnoAKmGoa/ie3FGGWzbm/YmXP7vpsoCKZ/RIQtQdOUul
         pNBOhzPz+7f+wR+rNFbivJEIlu4z9Qdaz1scOEurPhNYqrB7T+z7h6vAJGy0zE/G3N9K
         I4OiUgNgcP92yeCK3wiFemri2B2DNKeTBvDA11UHbqZ+LlGCWgTKBAXg95IuyEoaCPPe
         CfgQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1702000374; x=1702605174;
        h=content-transfer-encoding:mime-version:message-id:date:subject:cc
         :to:from:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=NMfUFkF8OyeZJffEZnLExwZb8TXV6He0+EMaT/NjaTI=;
        b=giuFQoMNbNWR0ikyNhz39WvOBK4F2NmFZdtt+T6vTLKhwjySPgk1zgX+Dcw1W0wodC
         efgqQJn7mrpRTelWxy/8qrmlKlFb/pD0RSgQxy1YB5TqvTHCey5oWmYXBkAw0K6xZ9uI
         A890YU9LAtVOG+flvAfJ8oGCdx8oZjCU3Q4xAwjAAugnL3TEf6xHKnFvUCl4jJ+nMqoC
         fO5oHTOnue5ejhi1lRHrEm2YJVtnOxUdrFRq1PKw97fKftpIv1BkALGfkD3/ObC5/d5o
         GZIhA+xHdxaWNVtVHWoXYmmYWUvHtagYRsTtNAgQr1YCwdp8w7m20wKEbRhd7nXrI82C
         iHHw==
X-Gm-Message-State: AOJu0Yw8zrv77mlem3nEsBbOYgxR0H+P95w4DRfl4ezdhm7LemWIyCIO
        SUFusHCyKF+1w13K8WnX6pMvWQ==
X-Received: by 2002:a5d:4389:0:b0:333:4dbe:1121 with SMTP id
 i9-20020a5d4389000000b003334dbe1121mr1748945wrq.137.1702000374070;
        Thu, 07 Dec 2023 17:52:54 -0800 (PST)
Received: from airbuntu.. (host109-153-232-45.range109-153.btcentralplus.com.
 [109.153.232.45])
        by smtp.gmail.com with ESMTPSA id
 s12-20020adf978c000000b003333a0da243sm902521wrb.81.2023.12.07.17.52.53
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 07 Dec 2023 17:52:53 -0800 (PST)
From: Qais Yousef <qyousef@layalina.io>
To: Ingo Molnar <mingo@kernel.org>,
        Peter Zijlstra <peterz@infradead.org>,
        "Rafael J. Wysocki" <rafael@kernel.org>,
        Viresh Kumar <viresh.kumar@linaro.org>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org,
        Lukasz Luba <lukasz.luba@arm.com>, Wei Wang <wvw@google.com>,
        Rick Yiu <rickyiu@google.com>,
        Chung-Kai Mei <chungkai@google.com>,
        Hongyan Xia <hongyan.xia2@arm.com>,
        Qais Yousef <qyousef@layalina.io>
Subject: [PATCH 0/4] sched: cpufreq: Remove uclamp max-aggregation
Date: Fri,  8 Dec 2023 01:52:38 +0000
Message-Id: <20231208015242.385103-1-qyousef@layalina.io>
X-Mailer: git-send-email 2.34.1
MIME-Version: 1.0
X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_SIGNED,DKIM_VALID,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,
	SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no
	version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]);
 Thu, 07 Dec 2023 17:53:01 -0800 (PST)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1784676765906683551
X-GMAIL-MSGID: 1784676765906683551

One of the practical issues that has risen up when trying to deploy uclamp in
practice, was making uclamp_max more effective. Due to max-aggregation at rq,
the effective uclamp_max value for a capped task can easily be lifted by any
task that gets enqueued. Which can be often enough in practice to cause it to
be ineffective and leaving much to be desired.

One of the solutions attempted out of tree was to filter out the short running
tasks based on sched_slice(), and this proved to be effective enough

	https://android-review.googlesource.com/c/kernel/common/+/1914696

But the solution is not upstream friendly as it introduces more magic
thresholds and not sure how it would work after EEVDF changes got merged in.

In principle, the max-aggregation is required to address the frequency hint
part of uclamp. We don't need this at wake up path as the per-task value should
let us know with the task fits a CPU or not on HMP systems.

As pointed out in remove magic hardcoded margins series [1], the limitation is
actually in the governor/hardware where we are constrained to changing
frequencies at a high rate, which uclamp usage can induce.

To address the problems, we can move the problem to be a cpufreq governor issue
to deal with whatever limitation it has to respond to task's perf requirement
hints. This means we need to tell the governor that we need a frequency update
to cater for a task perf hints, hence adding a new SCHED_CPUFREQ_PERF_HINTS
flag.

With the new flag, we can then send special updates to cpufreq governor on
context switch *if* it has perf requirements that aren't met already.

The governor can then choose to honour or ignore these request based on
whatever criteria it sees fit. For schedutil I made use of the new
approximate_util_avg() to create a threshold based on rate_limit_us to ignore
perf requirements for tasks that their util_avg tells us they were historically
running shorter than hardware's ability to change frequency. Which means they
will actually go back to sleep before they see the freq change, so honouring
their request is pointless, hence we ignore it.

Instead of being exact, I made an assumption that the task has to run for at
least 500us above rate_limit_us which is magical but what I've seen in practice
as a meaningful runtime where task can actually do meaningful work that is
worthwhile. But this exact definition of the threshold is subject for debate.
We can make it 1.5 rate_limit_us for example. I preferred the absolute number
as even in lower end hardware; I think this is a meaningful unit of time for
producing meaningful results that can make can impact. There's the hidden
assumption that most modern hardware already has fast enough DVFS. Absolute
number fails for super fast hardware though..

This allows us to remove uclamp max-aggregation logic completely and moves the
problem to be a cpufreq governor problem instead. Which I think is the right
thing to do as the scheduler was overcompensating for what is in reality
a cpufreq governor limitation and policy. We just need to improve the
interface with the governor.

Implementing different policies/strategies to deal with the problem would be
easier now that the problem space has moved to the governor. And it keeps
scheduler code clean and focus on things that matter from scheduling point of
view only.

For example max-aggregation can be implemented in the governor by adding new
flag when sending cpufreq_update_util() at enqueue/dequeue_task(). Not that
I think it's a good idea, but the possibility is there. Especially if platforms
like x86 has a lot of intelligence in firmware and they'd like to implement
something smarter at that level. They'll just need to improve the interface
with the governor.

===

This patch is based on remove margins series [1] and data is collected it
against it as a baseline.

Testing on pixel 6 with mainline(ish) kernel

==

Speedometer browser benchmark

       | baseline  | 1.25 headroom |   patch   | patch + 1.25 headroom
-------+-----------+---------------+-----------+---------------------
score  |  108.03   |     135.72    |   108.09  |    137.47
-------+-----------+---------------+-----------+---------------------
power  |  1204.75  |    1451.79    |  1216.17  |    1484.73
-------+-----------+---------------+-----------+---------------------

No regressions.

===

UiBench

       | baseline  | 1.25 headroom |   patch   | patch + 1.25 headroom
-------+-----------+---------------+-----------+---------------------
jank   |    68.0   |      56.0     |    65.6   |    57.3
-------+-----------+---------------+-----------+---------------------
power  |   146.54  |     164.49    |   144.91  |    167.57
-------+-----------+---------------+-----------+---------------------

No regressions.

===

Spawning 8 busyloop threads each pinned to a CPU with uclamp_max set to low-ish
OPP

```
adb shell "uclampset -M 90 taskset -a 01 cat /dev/zero > /dev/null &" &
adb shell "uclampset -M 90 taskset -a 02 cat /dev/zero > /dev/null &" &
adb shell "uclampset -M 90 taskset -a 04 cat /dev/zero > /dev/null &" &
adb shell "uclampset -M 90 taskset -a 08 cat /dev/zero > /dev/null &" &
adb shell "uclampset -M 270 taskset -a 10 cat /dev/zero > /dev/null &" &
adb shell "uclampset -M 270 taskset -a 20 cat /dev/zero > /dev/null &" &
adb shell "uclampset -M 670 taskset -a 40 cat /dev/zero > /dev/null &" &
adb shell "uclampset -M 670 taskset -a 80 cat /dev/zero > /dev/null &" &
```

And running speedometer for a single iteration

       | baseline  |   patch   |
-------+-----------+-----------+
score  |   73.44   |   75.62   |
-------+-----------+-----------+
power  |   520.46  |   489.49  |
-------+-----------+-----------+

Similar score at lower power.

Little's Freq Residency:

         | baseline  |   patch   |
---------+-----------+-----------+
OPP@90   |   29.59%  |   49.25%  |
---------+-----------+-----------+
OPP@max  |   40.02%  |   12.31%  |
---------+-----------+-----------+

Mid's Freq Residency:

         | baseline  |   patch   |
---------+-----------+-----------+
OPP@270  |   50.02%  |   77.53%  |
---------+-----------+-----------+
OPP@max  |   49.17%  |   22.46%  |
---------+-----------+-----------+

Big's Freq Residency:

         | baseline  |   patch   |
---------+-----------+-----------+
OPP@670  |   46.43%  |   54.44%  |
---------+-----------+-----------+
OPP@max  |   1.76%   |   4.57%   |
---------+-----------+-----------+

As you can see the residency at the capped frequency has increased considerably
for all clusters. The time spent running at max frequency is reduced
significantly for little and mid. For big there's a slight increase. Both
numbers are suspiciously low. With the busy loops in background; these cores
are more subject to throttling. So the higher number indicates we've been
throttled less.
---

Patch 1 clean ups the code that sends cpufreq_update_util() to be more
intentional and less noisy.

Patch 2 removes uclamp-max aggregation and implements sending
cpufreq_update_util() updates at context switch instead.

Patch 3 implements the logic to filter short running tasks compared to
rate_limit_us and add perf hint flag at task_tick_fair().

Patch 4 updates uclamp docs to reflect the new changes.

[1] https://lore.kernel.org/lkml/20231208002342.367117-1-qyousef@layalina.io/

Thanks!

--
Qais Yousef

Qais Yousef (4):
  sched/fair: Be less aggressive in calling cpufreq_update_util()
  sched/uclamp: Remove rq max aggregation
  sched/schedutil: Ignore update requests for short running tasks
  sched/documentation: Remove reference to max aggregation

 Documentation/scheduler/sched-util-clamp.rst | 239 ++---------
 include/linux/sched.h                        |  16 -
 include/linux/sched/cpufreq.h                |   3 +-
 init/Kconfig                                 |  31 --
 kernel/sched/core.c                          | 426 +++----------------
 kernel/sched/cpufreq_schedutil.c             |  61 ++-
 kernel/sched/fair.c                          | 157 ++-----
 kernel/sched/rt.c                            |   4 -
 kernel/sched/sched.h                         | 150 ++-----
 9 files changed, 244 insertions(+), 843 deletions(-)