Message ID | 20230205224318.2035646-2-qyousef@layalina.io |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp1951738wrn; Sun, 5 Feb 2023 14:47:50 -0800 (PST) X-Google-Smtp-Source: AK7set/gt8iby/YQN4X6xzvhlNg809vbU58NuSz3r+APIIEKwi6eoEB8cv8ZoX9zD99BCvoA4qPS X-Received: by 2002:a17:906:53d4:b0:878:7ef1:4a20 with SMTP id p20-20020a17090653d400b008787ef14a20mr18576541ejo.4.1675637270548; Sun, 05 Feb 2023 14:47:50 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1675637270; cv=none; d=google.com; s=arc-20160816; b=0zqiobYJ/1x5QJMwl3vlsNiVDLmJQG/zeljMt58FeWsLu98lHuALFRlJLfmnvEfYYg QGyS4YRdBUaonYFQZFjgFomzK2W4Jx1PXv9xj+IDwORs7OFDOMY3WS4N7RWObObapDPG IUTw/EoAovl76k61Rkla5VNjb2Uiv3t4NQm1GLvfXzAvISAPmq1OncMxYtzNi+lofwCs Kv4iDb1/YmlGY0mrW74ZzHJgw9KQmovR7Ctj52IspUsXIVw0sScHUUjYQjvwPbU32LSP JUOYt/AWObIj35+1DO05EhnfTOiZrqHHhLarhaju3U+CyjXYoUrwkjn5N2WAwgNqS7UQ JMFQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=pIDKpUZt2NnF88laBIzE/5NEaQ9UqssI9HVNrkQhbgE=; b=0RcE4RttGJoRPpeJclC165eJwDIkhjcVHD7EPM9/py+2tXwt3g8fkw365jVaBGnf85 1h3VcJL/L3/s+aRxqU9GMd7M9UWRn/6yPkN+3qbwz5HZWwk1tyHup7tvWKaZvHvr7GRV 97kO4Q5ietYIws4zu31MKWrbw4meUI2UZDi/0RwkSsK+j3fe8xsIsB7ggZcQSen4OlKy 8vW7kGtWFdWE14hpBT5bcTXF4qC+HkJecCGJU3z04Uf2ZYqGvufpa6JSaSVAXVVkby+v hEE6H0DCh/gL1RqFvh/7cGTjtbQutp7IYikSnZNqvIVu0QJSGyqOk+u1eRkTPM1Dn2mQ jwAQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@layalina-io.20210112.gappssmtp.com header.s=20210112 header.b=BYL6g5q0; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id bf21-20020a170906c51500b0088eb55ed9d5si8729841ejb.691.2023.02.05.14.47.27; Sun, 05 Feb 2023 14:47:50 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@layalina-io.20210112.gappssmtp.com header.s=20210112 header.b=BYL6g5q0; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229634AbjBEWng (ORCPT <rfc822;kmanaouilinux@gmail.com> + 99 others); Sun, 5 Feb 2023 17:43:36 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49890 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229559AbjBEWne (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sun, 5 Feb 2023 17:43:34 -0500 Received: from mail-ej1-x62f.google.com (mail-ej1-x62f.google.com [IPv6:2a00:1450:4864:20::62f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3F4171A48B for <linux-kernel@vger.kernel.org>; Sun, 5 Feb 2023 14:43:33 -0800 (PST) Received: by mail-ej1-x62f.google.com with SMTP id mc11so29413885ejb.10 for <linux-kernel@vger.kernel.org>; Sun, 05 Feb 2023 14:43:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=layalina-io.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=pIDKpUZt2NnF88laBIzE/5NEaQ9UqssI9HVNrkQhbgE=; b=BYL6g5q0EgU9yBxQf59pxMHrdp1oJDqgGJKpiH0qmaXrec/HizPz/+HDUzVpKPquzt jwPEksnWqxxjLC4BpeKg0GTRgY32ad/U4zL9/vWP6yA87pREBy9YG7b10bAdwo7bajFV mB6ZwqlT9bKrj1w37xQZbbc2C9zoEtmPu9L8akdb8BkhUBUXujw2ZcBeAG+NS9vEcNjD pA+FZU5Qe3xN7Mpa8SCiutAXaHs8wU38lo2XdQvmktH6q2GAbK90E2A3GtaQt+hk7SZ/ eFBSeiDwkHddL+oL7MH846vP1n+U/ZF16gijT9LVsyF+cVo5HkNbRkiX5Ke7dnKf3kSf O47g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=pIDKpUZt2NnF88laBIzE/5NEaQ9UqssI9HVNrkQhbgE=; b=UC3NrRwN1ALxr7viDtfDXMBz6tSocy6nPpeoY+naDUhZCgRQBTxNszmbGTJHQC3l1B b2/op3tGtEe7JntNm6nFSLD82XRIVk+F8/IZS8vkl3A5zl0wdYp+zIqPEMRItmXMj1dz EGAou4g5VbDbnFUaHcIOC41aqStpOefly1PDITYIh7sjT4SuwkbAa9Nd7+DoN9pUDjRT O7Eke84e1PijoPSRC0MS8wpaTmTDdzmlGgIWb55hE7RA+kUOL2M/DaFO/2YukTpbzp28 6GnfL+pEf9QB2tORDwPHGKwAwEAu/RVCv7ypKtlziGkYZ17ucRY9ewyZvCyxmXBx3CvZ bs/w== X-Gm-Message-State: AO0yUKXgx724NHWZEOgqCoR/t7AmVCaNDGBR9uI5zvaTy8sW/PkoH22l n4DZFUZWDu6PVEL1jJA/seygcg== X-Received: by 2002:a17:907:3a97:b0:880:2870:7849 with SMTP id fh23-20020a1709073a9700b0088028707849mr13763626ejc.74.1675637011804; Sun, 05 Feb 2023 14:43:31 -0800 (PST) Received: from localhost.localdomain (host86-163-35-10.range86-163.btcentralplus.com. [86.163.35.10]) by smtp.gmail.com with ESMTPSA id m15-20020a1709061ecf00b0087bd2924e74sm4550779ejj.205.2023.02.05.14.43.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 05 Feb 2023 14:43:31 -0800 (PST) From: Qais Yousef <qyousef@layalina.io> To: Ingo Molnar <mingo@kernel.org>, Peter Zijlstra <peterz@infradead.org>, Vincent Guittot <vincent.guittot@linaro.org>, Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: linux-kernel@vger.kernel.org, Lukasz Luba <lukasz.luba@arm.com>, Wei Wang <wvw@google.com>, Xuewen Yan <xuewen.yan94@gmail.com>, Hank <han.lin@mediatek.com>, Jonathan JMChen <Jonathan.JMChen@mediatek.com>, Qais Yousef <qyousef@layalina.io> Subject: [PATCH v2 1/3] sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0 Date: Sun, 5 Feb 2023 22:43:16 +0000 Message-Id: <20230205224318.2035646-2-qyousef@layalina.io> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230205224318.2035646-1-qyousef@layalina.io> References: <20230205224318.2035646-1-qyousef@layalina.io> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1757033026842864418?= X-GMAIL-MSGID: =?utf-8?q?1757033026842864418?= |
Series |
Fix a couple of corner cases in feec() when using uclamp_max
|
|
Commit Message
Qais Yousef
Feb. 5, 2023, 10:43 p.m. UTC
When uclamp_max is being used, the util of the task could be higher than
the spare capacity of the CPU, but due to uclamp_max value we force fit
it there.
The way the condition for checking for max_spare_cap in
find_energy_efficient_cpu() was constructed; it ignored any CPU that has
its spare_cap less than or _equal_ to max_spare_cap. Since we initialize
max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and
hence ending up never performing compute_energy() for this cluster and
missing an opportunity for a better energy efficient placement to honour
uclamp_max setting.
max_spare_cap = 0;
cpu_cap = capacity_of(cpu) - task_util(p); // 0 if task_util(p) is high
...
util_fits_cpu(...); // will return true if uclamp_max forces it to fit
...
// this logic will fail to update max_spare_cap_cpu if cpu_cap is 0
if (cpu_cap > max_spare_cap) {
max_spare_cap = cpu_cap;
max_spare_cap_cpu = cpu;
}
prev_spare_cap suffers from a similar problem.
Fix the logic by converting the variables into long and treating -1
value as 'not populated' instead of 0 which is a viable and correct
spare capacity value.
Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions")
Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
---
kernel/sched/fair.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
Comments
On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote: > > When uclamp_max is being used, the util of the task could be higher than > the spare capacity of the CPU, but due to uclamp_max value we force fit > it there. > > The way the condition for checking for max_spare_cap in > find_energy_efficient_cpu() was constructed; it ignored any CPU that has > its spare_cap less than or _equal_ to max_spare_cap. Since we initialize > max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and > hence ending up never performing compute_energy() for this cluster and > missing an opportunity for a better energy efficient placement to honour > uclamp_max setting. > > max_spare_cap = 0; > cpu_cap = capacity_of(cpu) - task_util(p); // 0 if task_util(p) is high > > ... > > util_fits_cpu(...); // will return true if uclamp_max forces it to fit > > ... > > // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0 > if (cpu_cap > max_spare_cap) { > max_spare_cap = cpu_cap; > max_spare_cap_cpu = cpu; > } > > prev_spare_cap suffers from a similar problem. > > Fix the logic by converting the variables into long and treating -1 > value as 'not populated' instead of 0 which is a viable and correct > spare capacity value. > > Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions") > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> > --- > kernel/sched/fair.c | 9 ++++----- > 1 file changed, 4 insertions(+), 5 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index c6c8e7f52935..7a21ee74139f 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7382,11 +7382,10 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) > for (; pd; pd = pd->next) { > unsigned long util_min = p_util_min, util_max = p_util_max; > unsigned long cpu_cap, cpu_thermal_cap, util; > - unsigned long cur_delta, max_spare_cap = 0; > + long prev_spare_cap = -1, max_spare_cap = -1; > unsigned long rq_util_min, rq_util_max; > - unsigned long prev_spare_cap = 0; > + unsigned long cur_delta, base_energy; > int max_spare_cap_cpu = -1; > - unsigned long base_energy; > int fits, max_fits = -1; > > cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask); > @@ -7461,7 +7460,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) > } > } > > - if (max_spare_cap_cpu < 0 && prev_spare_cap == 0) > + if (max_spare_cap_cpu < 0 && prev_spare_cap < 0) > continue; > > eenv_pd_busy_time(&eenv, cpus, p); > @@ -7469,7 +7468,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) > base_energy = compute_energy(&eenv, pd, cpus, p, -1); > > /* Evaluate the energy impact of using prev_cpu. */ > - if (prev_spare_cap > 0) { > + if (prev_spare_cap > -1) { > prev_delta = compute_energy(&eenv, pd, cpus, p, > prev_cpu); > /* CPU utilization has changed */ I think that you also need the change below to make sure that the signed comparison will be used. I have quickly checked the assembly code for aarch64 and your patch keeps using unsigned comparison (b.ls) ((fits == max_fits) && (cpu_cap > max_spare_cap))) { ffff8000080e4c94: f94067e0 ldr x0, [sp, #200] ffff8000080e4c98: eb00003f cmp x1, x0 ffff8000080e4c9c: 54ff98a9 b.ls ffff8000080e3fb0 <select_task_rq_fair+0x570> // b.plast Whereas the change below make it to use the signed version (b.le) ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) { ffff8000080e4c94: f94067e0 ldr x0, [sp, #200] ffff8000080e4c98: eb00003f cmp x1, x0 ffff8000080e4c9c: 54ff98ad b.le ffff8000080e3fb0 <select_task_rq_fair+0x570> -- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7522,7 +7522,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) prev_spare_cap = cpu_cap; prev_fits = fits; } else if ((fits > max_fits) || - ((fits == max_fits) && (cpu_cap > max_spare_cap))) { + ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) { /* * Find the CPU with the maximum spare capacity * among the remaining CPUs in the performance > -- > 2.25.1 >
On 07/02/2023 10:45, Vincent Guittot wrote: > On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote: >> >> When uclamp_max is being used, the util of the task could be higher than >> the spare capacity of the CPU, but due to uclamp_max value we force fit >> it there. >> >> The way the condition for checking for max_spare_cap in >> find_energy_efficient_cpu() was constructed; it ignored any CPU that has >> its spare_cap less than or _equal_ to max_spare_cap. Since we initialize >> max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and >> hence ending up never performing compute_energy() for this cluster and >> missing an opportunity for a better energy efficient placement to honour >> uclamp_max setting. >> >> max_spare_cap = 0; >> cpu_cap = capacity_of(cpu) - task_util(p); // 0 if task_util(p) is high >> >> ... >> >> util_fits_cpu(...); // will return true if uclamp_max forces it to fit s/true/1/ ? >> >> ... >> >> // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0 >> if (cpu_cap > max_spare_cap) { >> max_spare_cap = cpu_cap; >> max_spare_cap_cpu = cpu; >> } >> >> prev_spare_cap suffers from a similar problem. >> >> Fix the logic by converting the variables into long and treating -1 >> value as 'not populated' instead of 0 which is a viable and correct >> spare capacity value. The issue I see here is in case we don't have any spare capacity left, the energy calculation (in terms CPU utilization) isn't correct anymore. Due to `effective_cpu_util()` returning `max=arch_scale_cpu_capacity()` you never know how big the `busy_time` for the PD really is in this moment. eenv_pd_busy_time() for_each_cpu(cpu, pd_cpus) busy_time += effective_cpu_util(..., ENERGY_UTIL, ...) ^^^^^^^^^ with: sum_util = min(busy_time + task_busy_time, pd_cap) ^^^^^^^^^ freq = (1.25 * max_util / max) * max_freq energy = (perf_state(freq)->cost / max) * sum_util energy is not related to CPU utilization anymore (since there is no idle time/spare capacity) left. So EAS keeps packing on the cheaper PD/clamped OPP. E.g. Juno-r0 [446 1024 1024 446 446 446] with 6 8ms/16ms uclamp_max=440 tasks all running on little PD, cpumask=0x39. The big PD is idle but never beats prev_cpu in terms of energy consumption for the wakee. [...]
On 02/07/23 10:45, Vincent Guittot wrote: > On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote: > > > > When uclamp_max is being used, the util of the task could be higher than > > the spare capacity of the CPU, but due to uclamp_max value we force fit > > it there. > > > > The way the condition for checking for max_spare_cap in > > find_energy_efficient_cpu() was constructed; it ignored any CPU that has > > its spare_cap less than or _equal_ to max_spare_cap. Since we initialize > > max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and > > hence ending up never performing compute_energy() for this cluster and > > missing an opportunity for a better energy efficient placement to honour > > uclamp_max setting. > > > > max_spare_cap = 0; > > cpu_cap = capacity_of(cpu) - task_util(p); // 0 if task_util(p) is high > > > > ... > > > > util_fits_cpu(...); // will return true if uclamp_max forces it to fit > > > > ... > > > > // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0 > > if (cpu_cap > max_spare_cap) { > > max_spare_cap = cpu_cap; > > max_spare_cap_cpu = cpu; > > } > > > > prev_spare_cap suffers from a similar problem. > > > > Fix the logic by converting the variables into long and treating -1 > > value as 'not populated' instead of 0 which is a viable and correct > > spare capacity value. > > > > Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions") > > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> > > --- > > kernel/sched/fair.c | 9 ++++----- > > 1 file changed, 4 insertions(+), 5 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index c6c8e7f52935..7a21ee74139f 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -7382,11 +7382,10 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) > > for (; pd; pd = pd->next) { > > unsigned long util_min = p_util_min, util_max = p_util_max; > > unsigned long cpu_cap, cpu_thermal_cap, util; > > - unsigned long cur_delta, max_spare_cap = 0; > > + long prev_spare_cap = -1, max_spare_cap = -1; > > unsigned long rq_util_min, rq_util_max; > > - unsigned long prev_spare_cap = 0; > > + unsigned long cur_delta, base_energy; > > int max_spare_cap_cpu = -1; > > - unsigned long base_energy; > > int fits, max_fits = -1; > > > > cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask); > > @@ -7461,7 +7460,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) > > } > > } > > > > - if (max_spare_cap_cpu < 0 && prev_spare_cap == 0) > > + if (max_spare_cap_cpu < 0 && prev_spare_cap < 0) > > continue; > > > > eenv_pd_busy_time(&eenv, cpus, p); > > @@ -7469,7 +7468,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) > > base_energy = compute_energy(&eenv, pd, cpus, p, -1); > > > > /* Evaluate the energy impact of using prev_cpu. */ > > - if (prev_spare_cap > 0) { > > + if (prev_spare_cap > -1) { > > prev_delta = compute_energy(&eenv, pd, cpus, p, > > prev_cpu); > > /* CPU utilization has changed */ > > I think that you also need the change below to make sure that the > signed comparison will be used. I have quickly checked the assembly > code for aarch64 and your patch keeps using unsigned comparison (b.ls) > ((fits == max_fits) && (cpu_cap > max_spare_cap))) { > ffff8000080e4c94: f94067e0 ldr x0, [sp, #200] > ffff8000080e4c98: eb00003f cmp x1, x0 > ffff8000080e4c9c: 54ff98a9 b.ls ffff8000080e3fb0 > <select_task_rq_fair+0x570> // b.plast > > Whereas the change below make it to use the signed version (b.le) > ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) { > ffff8000080e4c94: f94067e0 ldr x0, [sp, #200] > ffff8000080e4c98: eb00003f cmp x1, x0 > ffff8000080e4c9c: 54ff98ad b.le ffff8000080e3fb0 <select_task_rq_fair+0x570> > > -- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7522,7 +7522,7 @@ static int find_energy_efficient_cpu(struct > task_struct *p, int prev_cpu) > prev_spare_cap = cpu_cap; > prev_fits = fits; > } else if ((fits > max_fits) || > - ((fits == max_fits) && (cpu_cap > > max_spare_cap))) { > + ((fits == max_fits) && > ((long)cpu_cap > max_spare_cap))) { > /* > * Find the CPU with the maximum spare capacity > * among the remaining CPUs in the performance Isn't it better to go back to v1 form then? The inconsistent type paired with the cast isn't getting too ugly for me :( I don't think we can convert cpu_cap to long without having to do more work as it is used with 'util'. Cheers -- Qais Yousef
On 02/09/23 19:02, Dietmar Eggemann wrote: > On 07/02/2023 10:45, Vincent Guittot wrote: > > On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote: > >> > >> When uclamp_max is being used, the util of the task could be higher than > >> the spare capacity of the CPU, but due to uclamp_max value we force fit > >> it there. > >> > >> The way the condition for checking for max_spare_cap in > >> find_energy_efficient_cpu() was constructed; it ignored any CPU that has > >> its spare_cap less than or _equal_ to max_spare_cap. Since we initialize > >> max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and > >> hence ending up never performing compute_energy() for this cluster and > >> missing an opportunity for a better energy efficient placement to honour > >> uclamp_max setting. > >> > >> max_spare_cap = 0; > >> cpu_cap = capacity_of(cpu) - task_util(p); // 0 if task_util(p) is high > >> > >> ... > >> > >> util_fits_cpu(...); // will return true if uclamp_max forces it to fit > > s/true/1/ ? > > >> > >> ... > >> > >> // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0 > >> if (cpu_cap > max_spare_cap) { > >> max_spare_cap = cpu_cap; > >> max_spare_cap_cpu = cpu; > >> } > >> > >> prev_spare_cap suffers from a similar problem. > >> > >> Fix the logic by converting the variables into long and treating -1 > >> value as 'not populated' instead of 0 which is a viable and correct > >> spare capacity value. > > The issue I see here is in case we don't have any spare capacity left, > the energy calculation (in terms CPU utilization) isn't correct anymore. > > Due to `effective_cpu_util()` returning `max=arch_scale_cpu_capacity()` > you never know how big the `busy_time` for the PD really is in this moment. > > eenv_pd_busy_time() > > for_each_cpu(cpu, pd_cpus) > busy_time += effective_cpu_util(..., ENERGY_UTIL, ...) > ^^^^^^^^^ > > with: > > sum_util = min(busy_time + task_busy_time, pd_cap) > ^^^^^^^^^ > > freq = (1.25 * max_util / max) * max_freq > > energy = (perf_state(freq)->cost / max) * sum_util > > > energy is not related to CPU utilization anymore (since there is no idle > time/spare capacity) left. Am I right that what you're saying is that the energy calculation for the PD will be capped to a certain value and this is why you think the energy is incorrect? What should be the correct energy (in theory at least)? > > So EAS keeps packing on the cheaper PD/clamped OPP. Which is the desired behavior for uclamp_max? The only issue I see is that we want to distribute within a pd. Which is something I was going to work on and send after later - but can lump it in this series if it helps. > > E.g. Juno-r0 [446 1024 1024 446 446 446] with 6 8ms/16ms uclamp_max=440 > tasks all running on little PD, cpumask=0x39. The big PD is idle but > never beats prev_cpu in terms of energy consumption for the wakee. IIUC I'm not seeing this being a problem. The goal of capping with uclamp_max is two folds: 1. Prevent tasks from consuming energy. 2. Keep them away from expensive CPUs. 2 is actually very important for 2 reasons: a. Because of max aggregation - any uncapped tasks that wakes up will cause a frequency spike on this 'expensive' cpu. We don't have a mechanism to downmigrate it - which is another thing I'm working on. b. It is desired to keep these bigger cpu idle ready for more important work. For 2, generally we don't want these tasks to steal bandwidth from these CPUs that we'd like to preserve for other type of work. Of course userspace has control by selecting the right uclamp_max value. They can increase it to allow a spill to next pd - or keep it low to steer them more strongly on a specific pd. Cheers -- Qais Yousef
On 11/02/2023 18:50, Qais Yousef wrote: > On 02/09/23 19:02, Dietmar Eggemann wrote: >> On 07/02/2023 10:45, Vincent Guittot wrote: >>> On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote: [...] >>>> Fix the logic by converting the variables into long and treating -1 >>>> value as 'not populated' instead of 0 which is a viable and correct >>>> spare capacity value. >> >> The issue I see here is in case we don't have any spare capacity left, >> the energy calculation (in terms CPU utilization) isn't correct anymore. >> >> Due to `effective_cpu_util()` returning `max=arch_scale_cpu_capacity()` >> you never know how big the `busy_time` for the PD really is in this moment. >> >> eenv_pd_busy_time() >> >> for_each_cpu(cpu, pd_cpus) >> busy_time += effective_cpu_util(..., ENERGY_UTIL, ...) >> ^^^^^^^^^ >> >> with: >> >> sum_util = min(busy_time + task_busy_time, pd_cap) >> ^^^^^^^^^ >> >> freq = (1.25 * max_util / max) * max_freq >> >> energy = (perf_state(freq)->cost / max) * sum_util >> >> >> energy is not related to CPU utilization anymore (since there is no idle >> time/spare capacity) left. > > Am I right that what you're saying is that the energy calculation for the PD > will be capped to a certain value and this is why you think the energy is > incorrect? > > What should be the correct energy (in theory at least)? The energy value for the perf_state is correct but the decision based on `energy diff` in which PD to put the task might not be. In case CPUx already has some tasks, its `pd_busy_time` contribution is still capped by its capacity_orig. eenv_pd_busy_time() -> cpu_util_next() return min(util, capacity_orig_of(cpu)) In this case we can't use `energy diff` anymore to make the decision where to put the task. The base_energy of CPUx's PD is already so high that the `energy diff = cur_energy - base_energy` is small enough so that CPUx is selected as target again. >> So EAS keeps packing on the cheaper PD/clamped OPP. Sometimes yes, but there are occurrences in which a big CPU ends up with the majority of the tasks. It depends on the start condition. > Which is the desired behavior for uclamp_max? Not sure. Essentially, EAS can't do its job properly if there is no idle time left. As soon as uclamp_max restricts the system (like in my example) task placement decisions via EAS (min energy diff based) can be wrong. > The only issue I see is that we want to distribute within a pd. Which is > something I was going to work on and send after later - but can lump it in this > series if it helps. I assume you refer to } else if ((fits > max_fits) || ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) { here? >> E.g. Juno-r0 [446 1024 1024 446 446 446] with 6 8ms/16ms uclamp_max=440 >> tasks all running on little PD, cpumask=0x39. The big PD is idle but >> never beats prev_cpu in terms of energy consumption for the wakee. > > IIUC I'm not seeing this being a problem. The goal of capping with uclamp_max > is two folds: > > 1. Prevent tasks from consuming energy. > 2. Keep them away from expensive CPUs. Like I tried to reason about above, the energy diff based task placement can't always assure this. > > 2 is actually very important for 2 reasons: > > a. Because of max aggregation - any uncapped tasks that wakes up will > cause a frequency spike on this 'expensive' cpu. We don't have > a mechanism to downmigrate it - which is another thing I'm working > on. True. And it can also lead to CPU overutilization which then leads to load-balancing kicking in. > b. It is desired to keep these bigger cpu idle ready for more important > work. But this is not happening all the time. Using the same workload (6 8ms/16ms uclamp_max=440) on Juno-r0 [446 1024 1024 446 446 446] sometimes ends up with all 6 tasks on big CPU1. A corresponding EAS task placement looks like this one: <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[1]=0 cpu_util[f|e]=[446 994] <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[2]=0 cpu_util[f|e]=[445 1004] <idle>-0 [005] select_task_rq_fair: CPU5 cpu=1 busy_time=994 util=994 pd_cap=2048 <idle>-0 [005] select_task_rq_fair: CPU5 cpu=2 busy_time=1998 util=1004 pd_cap=2048 <idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=-1 energy=821866 pd_busy_time=1998 task_busy_time=921 max_util=446 busy_time=1998 cpu_cap=1024 <idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=1 energy=842434 pd_busy_time=1998 task_busy_time=921 max_util=446 busy_time=2048 cpu_cap=1024 ^^^^^^^^^^^^^ diff = 20,568 <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[0]=0 cpu_util[f|e]=[440 446] <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[3]=0 cpu_util[f|e]=[6 6] <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[4]=0 cpu_util[f|e]=[440 446] <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[5]=0 cpu_util[f|e]=[440 446] <idle>-0 [005] select_task_rq_fair: CPU5 cpu=0 busy_time=446 util=446 pd_cap=1784 <idle>-0 [005] select_task_rq_fair: CPU5 cpu=3 busy_time=452 util=2 pd_cap=1784 <idle>-0 [005] select_task_rq_fair: CPU5 cpu=4 busy_time=898 util=446 pd_cap=1784 <idle>-0 [005] select_task_rq_fair: CPU5 cpu=5 busy_time=907 util=1 pd_cap=1784 <idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=-1 energy=242002 pd_busy_time=907 task_busy_time=921 max_util=440 busy_time=907 cpu_cap=446 <idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=5 energy=476000 pd_busy_time=907 task_busy_time=921 max_util=440 busy_time=1784 cpu_cap=446 ^^^^^^^^^^^^^ diff = 233,998 <idle>-0 [005] select_task_rq_fair: p=[task0-5 2551] target=1 > For 2, generally we don't want these tasks to steal bandwidth from these CPUs > that we'd like to preserve for other type of work. > > Of course userspace has control by selecting the right uclamp_max value. They > can increase it to allow a spill to next pd - or keep it low to steer them more > strongly on a specific pd. [...]
On 02/14/23 13:47, Dietmar Eggemann wrote: > On 11/02/2023 18:50, Qais Yousef wrote: > > On 02/09/23 19:02, Dietmar Eggemann wrote: > >> On 07/02/2023 10:45, Vincent Guittot wrote: > >>> On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote: > > [...] > > >>>> Fix the logic by converting the variables into long and treating -1 > >>>> value as 'not populated' instead of 0 which is a viable and correct > >>>> spare capacity value. > >> > >> The issue I see here is in case we don't have any spare capacity left, > >> the energy calculation (in terms CPU utilization) isn't correct anymore. > >> > >> Due to `effective_cpu_util()` returning `max=arch_scale_cpu_capacity()` > >> you never know how big the `busy_time` for the PD really is in this moment. > >> > >> eenv_pd_busy_time() > >> > >> for_each_cpu(cpu, pd_cpus) > >> busy_time += effective_cpu_util(..., ENERGY_UTIL, ...) > >> ^^^^^^^^^ > >> > >> with: > >> > >> sum_util = min(busy_time + task_busy_time, pd_cap) > >> ^^^^^^^^^ > >> > >> freq = (1.25 * max_util / max) * max_freq > >> > >> energy = (perf_state(freq)->cost / max) * sum_util > >> > >> > >> energy is not related to CPU utilization anymore (since there is no idle > >> time/spare capacity) left. > > > > Am I right that what you're saying is that the energy calculation for the PD > > will be capped to a certain value and this is why you think the energy is > > incorrect? > > > > What should be the correct energy (in theory at least)? > > The energy value for the perf_state is correct but the decision based > on `energy diff` in which PD to put the task might not be. > > In case CPUx already has some tasks, its `pd_busy_time` contribution > is still capped by its capacity_orig. > > eenv_pd_busy_time() -> cpu_util_next() > return min(util, capacity_orig_of(cpu)) > > In this case we can't use `energy diff` anymore to make the decision > where to put the task. > > The base_energy of CPUx's PD is already so high that the > `energy diff = cur_energy - base_energy` is small enough so that CPUx > is selected as target again. I'm sorry bear with me as I'm still struggling to fully understand the case. You're thinking that if all the CPUs in the pd are _already_ fully busy before adding this newly woken up task there, then we'll end up with the wrong energy_diff? IOW, when the pd is fully loaded, then the cost of adding extra load will always look cheap is what you're saying? > > >> So EAS keeps packing on the cheaper PD/clamped OPP. > > Sometimes yes, but there are occurrences in which a big CPU ends up > with the majority of the tasks. It depends on the start condition. It should depend on the energy cost, yes. Which does depend on the current state of the system. > > > Which is the desired behavior for uclamp_max? > > Not sure. Essentially, EAS can't do its job properly if there is no idle This "not sure" statement is making me worried. Are you not sure about how uclamp_max should work in force fitting big tasks into small cores? Or just about handling some of the corner cases? I understood the former, not the latter. > time left. As soon as uclamp_max restricts the system (like in my > example) task placement decisions via EAS (min energy diff based) can be > wrong. I'm assuming 'no idle time' refers to the pd being fully loaded already _before_ placing the newly woken up task. If you refer to it not having idle time _after_ placing it - then I'm confused as one of the goals of uclamp_max is to act as a task placement hint and if it can't do that in this simple scenarios, it won't be much useful? I can appreciate it failing at some corner cases. But for example if a little core is idle and a 1024 task wakes up with uclamp_max that makes the little a candidate; then yeah it'll not leave any idle time on that cpu - but placing it there (if it the energy efficient cpu) is the desired effect, no? > > > The only issue I see is that we want to distribute within a pd. Which is > > something I was going to work on and send after later - but can lump it in this > > series if it helps. > > I assume you refer to > > } else if ((fits > max_fits) || > ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) { > > here? Yes. If there are multiple cpus with the same max_spare_cap maybe we can distribute among them rather than always pick the first one. > > >> E.g. Juno-r0 [446 1024 1024 446 446 446] with 6 8ms/16ms uclamp_max=440 > >> tasks all running on little PD, cpumask=0x39. The big PD is idle but > >> never beats prev_cpu in terms of energy consumption for the wakee. > > > > IIUC I'm not seeing this being a problem. The goal of capping with uclamp_max > > is two folds: > > > > 1. Prevent tasks from consuming energy. > > 2. Keep them away from expensive CPUs. > > Like I tried to reason about above, the energy diff based task placement > can't always assure this. Re assure: It is not expected to be 100% guarantee. But it shouldn't fail simply too. > > > > > 2 is actually very important for 2 reasons: > > > > a. Because of max aggregation - any uncapped tasks that wakes up will > > cause a frequency spike on this 'expensive' cpu. We don't have > > a mechanism to downmigrate it - which is another thing I'm working > > on. > > True. And it can also lead to CPU overutilization which then leads to > load-balancing kicking in. Yep. > > > b. It is desired to keep these bigger cpu idle ready for more important > > work. > > But this is not happening all the time. Using the same workload > (6 8ms/16ms uclamp_max=440) on Juno-r0 [446 1024 1024 446 446 446] > sometimes ends up with all 6 tasks on big CPU1. This seems similar to a case I see on pinebook pro but with uclamp_min. $ grep '' /sys/devices/system/cpu/cpu*/cpu_capacity /sys/devices/system/cpu/cpu0/cpu_capacity:381 /sys/devices/system/cpu/cpu1/cpu_capacity:381 /sys/devices/system/cpu/cpu2/cpu_capacity:381 /sys/devices/system/cpu/cpu3/cpu_capacity:381 /sys/devices/system/cpu/cpu4/cpu_capacity:1024 /sys/devices/system/cpu/cpu5/cpu_capacity:1024 ts, dst_cpu, p_util, uclamp_min, uclamp_max, energy 253049258065, 4, 0, 381, 1024, 10562 253049269732, 3, 0, 381, 1024, 18814 253065609673, 4, 0, 381, 1024, 10562 253065621340, 3, 0, 381, 1024, 17874 253082033696, 4, 0, 381, 1024, 14669 253082045071, 2, 0, 381, 1024, 17403 253098438637, 4, 0, 381, 1024, 14082 253098450011, 3, 0, 381, 1024, 17403 253114803452, 4, 0, 381, 1024, 17016 253114814827, 0, 0, 381, 1024, 16933 253131192435, 4, 0, 381, 1024, 15843 253131204101, 2, 0, 381, 1024, 16933 253147557125, 4, 0, 381, 1024, 18776 253147568500, 0, 0, 381, 1024, 15992 253163935608, 4, 0, 381, 1024, 17603 253163946108, 0, 0, 381, 1024, 15522 253180299382, 4, 0, 381, 1024, 17016 253180306965, 3, 0, 381, 1024, 26811 253196598074, 4, 0, 381, 1024, 16429 253196606532, 0, 0, 381, 1024, 25870 253212953723, 4, 0, 381, 1024, 15256 253212965681, 0, 0, 381, 1024, 25400 253229376288, 4, 0, 381, 1024, 19363 With uclamp_max energy looks different, but p_util is different too which has impact on compute_energy() calculations ts, dst_cpu, p_util, uclamp_min, uclamp_max, energy 760154735422179, 4, 407, 0, 381, 237058 760154735426845, 0, 407, 0, 381, 192382 760154751547464, 4, 407, 0, 381, 237058 760154751552131, 0, 407, 0, 381, 191912 760154767690833, 4, 407, 0, 381, 237058 760154767696667, 0, 407, 0, 381, 202730 760154783818744, 4, 407, 0, 381, 237058 760154783823994, 0, 407, 0, 381, 198967 760154799944613, 4, 407, 0, 381, 237058 760154799949280, 0, 407, 0, 381, 198967 760154816074565, 4, 407, 0, 381, 237058 760154816079232, 0, 407, 0, 381, 195675 760154832201309, 4, 407, 0, 381, 237058 760154832205976, 0, 407, 0, 381, 195204 760154848328053, 4, 407, 0, 381, 237058 760154848333012, 0, 407, 0, 381, 193793 760154864453631, 4, 407, 0, 381, 237058 760154864458297, 0, 407, 0, 381, 193793 760154880578333, 4, 407, 0, 381, 237058 760154880583000, 0, 407, 0, 381, 192852 760154896705369, 4, 407, 0, 381, 237058 760154896710619, 0, 407, 0, 381, 192852 I'm not sure if there's an error in compute_energy to be honest - but as you said depends on the conditions of the system the most energy efficient cpu could be different. Without this patch we don't even call compute_energy() for the little core; but now it is a viable candidate. > > A corresponding EAS task placement looks like this one: > > <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[1]=0 cpu_util[f|e]=[446 994] > <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[2]=0 cpu_util[f|e]=[445 1004] > > <idle>-0 [005] select_task_rq_fair: CPU5 cpu=1 busy_time=994 util=994 pd_cap=2048 > <idle>-0 [005] select_task_rq_fair: CPU5 cpu=2 busy_time=1998 util=1004 pd_cap=2048 > > <idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=-1 energy=821866 pd_busy_time=1998 task_busy_time=921 max_util=446 busy_time=1998 cpu_cap=1024 > <idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=1 energy=842434 pd_busy_time=1998 task_busy_time=921 max_util=446 busy_time=2048 cpu_cap=1024 > ^^^^^^^^^^^^^ > diff = 20,568 This is not necessarily a problem, unless the energy calculations are wrong of course. Setting uclamp_max near the top capacity of the littles will hopefully make it run there more often - but we know that the top frequencies of the little are not necessarily efficient ones too. Does lowering uclamp_max more towards the 'knee' of the little help keeping the tasks there? Note what this patch does is make sure that the little is a 'candidate'. Energy efficiency is the ultimate choice of which candidate cpu/pd to pick. Being more strict about uclamp_max choice might be necessary if there's a stronger desire by userspace to keep the tasks on specific cluster. If there're errors in calculating energy, I'd appreciate your help on how to resolve them. Thanks! -- Qais Yousef > > <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[0]=0 cpu_util[f|e]=[440 446] > <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[3]=0 cpu_util[f|e]=[6 6] > <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[4]=0 cpu_util[f|e]=[440 446] > <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[5]=0 cpu_util[f|e]=[440 446] > > <idle>-0 [005] select_task_rq_fair: CPU5 cpu=0 busy_time=446 util=446 pd_cap=1784 > <idle>-0 [005] select_task_rq_fair: CPU5 cpu=3 busy_time=452 util=2 pd_cap=1784 > <idle>-0 [005] select_task_rq_fair: CPU5 cpu=4 busy_time=898 util=446 pd_cap=1784 > <idle>-0 [005] select_task_rq_fair: CPU5 cpu=5 busy_time=907 util=1 pd_cap=1784 > > <idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=-1 energy=242002 pd_busy_time=907 task_busy_time=921 max_util=440 busy_time=907 cpu_cap=446 > <idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=5 energy=476000 pd_busy_time=907 task_busy_time=921 max_util=440 busy_time=1784 cpu_cap=446 > ^^^^^^^^^^^^^ > diff = 233,998 > > <idle>-0 [005] select_task_rq_fair: p=[task0-5 2551] target=1 > > > For 2, generally we don't want these tasks to steal bandwidth from these CPUs > > that we'd like to preserve for other type of work. > > > > Of course userspace has control by selecting the right uclamp_max value. They > > can increase it to allow a spill to next pd - or keep it low to steer them more > > strongly on a specific pd. > > [...] >
On Sat, 11 Feb 2023 at 18:28, Qais Yousef <qyousef@layalina.io> wrote: > > On 02/07/23 10:45, Vincent Guittot wrote: > > On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote: > > > > > > When uclamp_max is being used, the util of the task could be higher than > > > the spare capacity of the CPU, but due to uclamp_max value we force fit > > > it there. > > > > > > The way the condition for checking for max_spare_cap in > > > find_energy_efficient_cpu() was constructed; it ignored any CPU that has > > > its spare_cap less than or _equal_ to max_spare_cap. Since we initialize > > > max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and > > > hence ending up never performing compute_energy() for this cluster and > > > missing an opportunity for a better energy efficient placement to honour > > > uclamp_max setting. > > > > > > max_spare_cap = 0; > > > cpu_cap = capacity_of(cpu) - task_util(p); // 0 if task_util(p) is high > > > > > > ... > > > > > > util_fits_cpu(...); // will return true if uclamp_max forces it to fit > > > > > > ... > > > > > > // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0 > > > if (cpu_cap > max_spare_cap) { > > > max_spare_cap = cpu_cap; > > > max_spare_cap_cpu = cpu; > > > } > > > > > > prev_spare_cap suffers from a similar problem. > > > > > > Fix the logic by converting the variables into long and treating -1 > > > value as 'not populated' instead of 0 which is a viable and correct > > > spare capacity value. > > > > > > Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions") > > > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> > > > --- > > > kernel/sched/fair.c | 9 ++++----- > > > 1 file changed, 4 insertions(+), 5 deletions(-) > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index c6c8e7f52935..7a21ee74139f 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -7382,11 +7382,10 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) > > > for (; pd; pd = pd->next) { > > > unsigned long util_min = p_util_min, util_max = p_util_max; > > > unsigned long cpu_cap, cpu_thermal_cap, util; > > > - unsigned long cur_delta, max_spare_cap = 0; > > > + long prev_spare_cap = -1, max_spare_cap = -1; > > > unsigned long rq_util_min, rq_util_max; > > > - unsigned long prev_spare_cap = 0; > > > + unsigned long cur_delta, base_energy; > > > int max_spare_cap_cpu = -1; > > > - unsigned long base_energy; > > > int fits, max_fits = -1; > > > > > > cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask); > > > @@ -7461,7 +7460,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) > > > } > > > } > > > > > > - if (max_spare_cap_cpu < 0 && prev_spare_cap == 0) > > > + if (max_spare_cap_cpu < 0 && prev_spare_cap < 0) > > > continue; > > > > > > eenv_pd_busy_time(&eenv, cpus, p); > > > @@ -7469,7 +7468,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) > > > base_energy = compute_energy(&eenv, pd, cpus, p, -1); > > > > > > /* Evaluate the energy impact of using prev_cpu. */ > > > - if (prev_spare_cap > 0) { > > > + if (prev_spare_cap > -1) { > > > prev_delta = compute_energy(&eenv, pd, cpus, p, > > > prev_cpu); > > > /* CPU utilization has changed */ > > > > I think that you also need the change below to make sure that the > > signed comparison will be used. I have quickly checked the assembly > > code for aarch64 and your patch keeps using unsigned comparison (b.ls) > > ((fits == max_fits) && (cpu_cap > max_spare_cap))) { > > ffff8000080e4c94: f94067e0 ldr x0, [sp, #200] > > ffff8000080e4c98: eb00003f cmp x1, x0 > > ffff8000080e4c9c: 54ff98a9 b.ls ffff8000080e3fb0 > > <select_task_rq_fair+0x570> // b.plast > > > > Whereas the change below make it to use the signed version (b.le) > > ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) { > > ffff8000080e4c94: f94067e0 ldr x0, [sp, #200] > > ffff8000080e4c98: eb00003f cmp x1, x0 > > ffff8000080e4c9c: 54ff98ad b.le ffff8000080e3fb0 <select_task_rq_fair+0x570> > > > > -- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -7522,7 +7522,7 @@ static int find_energy_efficient_cpu(struct > > task_struct *p, int prev_cpu) > > prev_spare_cap = cpu_cap; > > prev_fits = fits; > > } else if ((fits > max_fits) || > > - ((fits == max_fits) && (cpu_cap > > > max_spare_cap))) { > > + ((fits == max_fits) && > > ((long)cpu_cap > max_spare_cap))) { > > /* > > * Find the CPU with the maximum spare capacity > > * among the remaining CPUs in the performance > > Isn't it better to go back to v1 form then? The inconsistent type paired with > the cast isn't getting too ugly for me :( the cast into a long of the cpu capacity in the condition was a good way to fix this unsigned/signed comparison and make is consistent with the use of -1 as default value IMO ((long)cpu_cap > max_spare_cap) > > I don't think we can convert cpu_cap to long without having to do more work as > it is used with 'util'. > > > Cheers > > -- > Qais Yousef
On 02/20/23 18:02, Vincent Guittot wrote: > On Sat, 11 Feb 2023 at 18:28, Qais Yousef <qyousef@layalina.io> wrote: > > > > On 02/07/23 10:45, Vincent Guittot wrote: > > > On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote: > > > > > > > > When uclamp_max is being used, the util of the task could be higher than > > > > the spare capacity of the CPU, but due to uclamp_max value we force fit > > > > it there. > > > > > > > > The way the condition for checking for max_spare_cap in > > > > find_energy_efficient_cpu() was constructed; it ignored any CPU that has > > > > its spare_cap less than or _equal_ to max_spare_cap. Since we initialize > > > > max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and > > > > hence ending up never performing compute_energy() for this cluster and > > > > missing an opportunity for a better energy efficient placement to honour > > > > uclamp_max setting. > > > > > > > > max_spare_cap = 0; > > > > cpu_cap = capacity_of(cpu) - task_util(p); // 0 if task_util(p) is high > > > > > > > > ... > > > > > > > > util_fits_cpu(...); // will return true if uclamp_max forces it to fit > > > > > > > > ... > > > > > > > > // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0 > > > > if (cpu_cap > max_spare_cap) { > > > > max_spare_cap = cpu_cap; > > > > max_spare_cap_cpu = cpu; > > > > } > > > > > > > > prev_spare_cap suffers from a similar problem. > > > > > > > > Fix the logic by converting the variables into long and treating -1 > > > > value as 'not populated' instead of 0 which is a viable and correct > > > > spare capacity value. > > > > > > > > Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions") > > > > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> > > > > --- > > > > kernel/sched/fair.c | 9 ++++----- > > > > 1 file changed, 4 insertions(+), 5 deletions(-) > > > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > > index c6c8e7f52935..7a21ee74139f 100644 > > > > --- a/kernel/sched/fair.c > > > > +++ b/kernel/sched/fair.c > > > > @@ -7382,11 +7382,10 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) > > > > for (; pd; pd = pd->next) { > > > > unsigned long util_min = p_util_min, util_max = p_util_max; > > > > unsigned long cpu_cap, cpu_thermal_cap, util; > > > > - unsigned long cur_delta, max_spare_cap = 0; > > > > + long prev_spare_cap = -1, max_spare_cap = -1; > > > > unsigned long rq_util_min, rq_util_max; > > > > - unsigned long prev_spare_cap = 0; > > > > + unsigned long cur_delta, base_energy; > > > > int max_spare_cap_cpu = -1; > > > > - unsigned long base_energy; > > > > int fits, max_fits = -1; > > > > > > > > cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask); > > > > @@ -7461,7 +7460,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) > > > > } > > > > } > > > > > > > > - if (max_spare_cap_cpu < 0 && prev_spare_cap == 0) > > > > + if (max_spare_cap_cpu < 0 && prev_spare_cap < 0) > > > > continue; > > > > > > > > eenv_pd_busy_time(&eenv, cpus, p); > > > > @@ -7469,7 +7468,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) > > > > base_energy = compute_energy(&eenv, pd, cpus, p, -1); > > > > > > > > /* Evaluate the energy impact of using prev_cpu. */ > > > > - if (prev_spare_cap > 0) { > > > > + if (prev_spare_cap > -1) { > > > > prev_delta = compute_energy(&eenv, pd, cpus, p, > > > > prev_cpu); > > > > /* CPU utilization has changed */ > > > > > > I think that you also need the change below to make sure that the > > > signed comparison will be used. I have quickly checked the assembly > > > code for aarch64 and your patch keeps using unsigned comparison (b.ls) > > > ((fits == max_fits) && (cpu_cap > max_spare_cap))) { > > > ffff8000080e4c94: f94067e0 ldr x0, [sp, #200] > > > ffff8000080e4c98: eb00003f cmp x1, x0 > > > ffff8000080e4c9c: 54ff98a9 b.ls ffff8000080e3fb0 > > > <select_task_rq_fair+0x570> // b.plast > > > > > > Whereas the change below make it to use the signed version (b.le) > > > ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) { > > > ffff8000080e4c94: f94067e0 ldr x0, [sp, #200] > > > ffff8000080e4c98: eb00003f cmp x1, x0 > > > ffff8000080e4c9c: 54ff98ad b.le ffff8000080e3fb0 <select_task_rq_fair+0x570> > > > > > > -- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -7522,7 +7522,7 @@ static int find_energy_efficient_cpu(struct > > > task_struct *p, int prev_cpu) > > > prev_spare_cap = cpu_cap; > > > prev_fits = fits; > > > } else if ((fits > max_fits) || > > > - ((fits == max_fits) && (cpu_cap > > > > max_spare_cap))) { > > > + ((fits == max_fits) && > > > ((long)cpu_cap > max_spare_cap))) { > > > /* > > > * Find the CPU with the maximum spare capacity > > > * among the remaining CPUs in the performance > > > > Isn't it better to go back to v1 form then? The inconsistent type paired with > > the cast isn't getting too ugly for me :( > > the cast into a long of the cpu capacity in the condition was a good > way to fix this unsigned/signed comparison and make is consistent with > the use of -1 as default value IMO > ((long)cpu_cap > max_spare_cap) Fair enough. Our boolean brains differ :-) I'll add the cast. Do you see the energy calculation issue Dietmar was trying to highlight? As it stands I only have boolean tweaks to do for v3. Cheers -- Qais Yousef > > > > I don't think we can convert cpu_cap to long without having to do more work as > > it is used with 'util'. > > > > > > Cheers > > > > -- > > Qais Yousef
On 14/02/2023 19:09, Qais Yousef wrote: > On 02/14/23 13:47, Dietmar Eggemann wrote: >> On 11/02/2023 18:50, Qais Yousef wrote: >>> On 02/09/23 19:02, Dietmar Eggemann wrote: >>>> On 07/02/2023 10:45, Vincent Guittot wrote: >>>>> On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote: [...] >> >> The energy value for the perf_state is correct but the decision based >> on `energy diff` in which PD to put the task might not be. >> >> In case CPUx already has some tasks, its `pd_busy_time` contribution >> is still capped by its capacity_orig. >> >> eenv_pd_busy_time() -> cpu_util_next() >> return min(util, capacity_orig_of(cpu)) >> >> In this case we can't use `energy diff` anymore to make the decision >> where to put the task. >> >> The base_energy of CPUx's PD is already so high that the >> `energy diff = cur_energy - base_energy` is small enough so that CPUx >> is selected as target again. > > I'm sorry bear with me as I'm still struggling to fully understand the case. > > You're thinking that if all the CPUs in the pd are _already_ fully busy before > adding this newly woken up task there, then we'll end up with the wrong > energy_diff? IOW, when the pd is fully loaded, then the cost of adding extra > load will always look cheap is what you're saying? Sort of. The key to this is: compute_energy() if (dst_cpu >= 0) busy_time = min(pd_capacity, pd_busy_time + task_busy_time); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ pd + task contribution If `pd + task contribution` > `pd_capacity` then we're saturated and the energy diff is bogus. This includes the case in which `pd contribution` > `pd_capacity`. >>>> So EAS keeps packing on the cheaper PD/clamped OPP. >> >> Sometimes yes, but there are occurrences in which a big CPU ends up >> with the majority of the tasks. It depends on the start condition. > > It should depend on the energy cost, yes. Which does depend on the current > state of the system. I analyzed one of the last traces I got with my example: During the initial wakeup the system is in CPU OU. So feec() returns `target = -1` and `sis()->sic()` (a) does the initial placement for all the 6 tasks. CPU (a) (b) (c) (d) (e) ^ ^ ^ ^ ^ 0 t1-------->| | 1 t0 t3 |t5 |t1 |t4| | | | | ... 2 t2--------------------->| 3 4 t4------------>| | 5 t5---->| | (b) feec() for t5: eenv_pd_busy_time() pd_busy_time = 1998 = 994 (CPU1) + 1004 (CPU2) compute_energy() busy_time = min(pd_capacity, pd_busy_time + task_busy_time) = min(2048, 1998 + 921) = 2048 We go into saturation here and EAS assumes it can place t5 for the cost of 2048 - 1998 = 50 util where in reality t5 has a util of ~921. And that's why t5 ends up on CPU1 too. (c) & (d) similar to (b) (e) From here on no wakeups anymore. Only tick preemption on CPU1 every 4ms (250Hz) between the 5 tasks and t2 finishing on CPU2 eventually. >>> Which is the desired behavior for uclamp_max? >> >> Not sure. Essentially, EAS can't do its job properly if there is no idle > > This "not sure" statement is making me worried. Are you not sure about how > uclamp_max should work in force fitting big tasks into small cores? Or just > about handling some of the corner cases? I understood the former, not the > latter. I'm not sure if we can call the issue that EAS doesn't work in saturation anymore a corner case. In case uclamp_max has an effect, it can quite easily create these saturation scenarios in which EAS is still enabled but it can't do its job anymore. Making sure that we probe each PD is not getting rid of this more fundamental issue. >> time left. As soon as uclamp_max restricts the system (like in my >> example) task placement decisions via EAS (min energy diff based) can be >> wrong. > > I'm assuming 'no idle time' refers to the pd being fully loaded already > _before_ placing the newly woken up task. If you refer to it not having idle > time _after_ placing it - then I'm confused as one of the goals of uclamp_max > is to act as a task placement hint and if it can't do that in this simple We can't consider an individual task here. After placing `t0` might be before placing `t1` for which we might then run into this saturation. > scenarios, it won't be much useful? I can appreciate it failing at some corner > cases. But for example if a little core is idle and a 1024 task wakes up with > uclamp_max that makes the little a candidate; then yeah it'll not leave any > idle time on that cpu - but placing it there (if it the energy efficient cpu) > is the desired effect, no? Not having idle time means EAS can't do its job properly and uclamp_max can create these scenarios. >>> The only issue I see is that we want to distribute within a pd. Which is >>> something I was going to work on and send after later - but can lump it in this >>> series if it helps. >> >> I assume you refer to >> >> } else if ((fits > max_fits) || >> ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) { >> >> here? > > Yes. If there are multiple cpus with the same max_spare_cap maybe we can > distribute among them rather than always pick the first one. This can mitigate the issue but not solve it. [...] >>> b. It is desired to keep these bigger cpu idle ready for more important >>> work. >> >> But this is not happening all the time. Using the same workload >> (6 8ms/16ms uclamp_max=440) on Juno-r0 [446 1024 1024 446 446 446] >> sometimes ends up with all 6 tasks on big CPU1. > > This seems similar to a case I see on pinebook pro but with uclamp_min. > > $ grep '' /sys/devices/system/cpu/cpu*/cpu_capacity > /sys/devices/system/cpu/cpu0/cpu_capacity:381 > /sys/devices/system/cpu/cpu1/cpu_capacity:381 > /sys/devices/system/cpu/cpu2/cpu_capacity:381 > /sys/devices/system/cpu/cpu3/cpu_capacity:381 > /sys/devices/system/cpu/cpu4/cpu_capacity:1024 > /sys/devices/system/cpu/cpu5/cpu_capacity:1024 > > ts, dst_cpu, p_util, uclamp_min, uclamp_max, energy > 253049258065, 4, 0, 381, 1024, 10562 > 253049269732, 3, 0, 381, 1024, 18814 So the energy after placing p on CPU4 (big) is lower than placing it on CPU3 (LITTLE). But to see the energy-diff we would need to see the base energy (dst_cpu = -1) as well. Why is p_util = 0 here? > 253065609673, 4, 0, 381, 1024, 10562 > 253065621340, 3, 0, 381, 1024, 17874 [..] > > With uclamp_max energy looks different, but p_util is different too which has > impact on compute_energy() calculations p_util is p->se.avg.util_avg here? > ts, dst_cpu, p_util, uclamp_min, uclamp_max, energy > 760154735422179, 4, 407, 0, 381, 237058 > 760154735426845, 0, 407, 0, 381, 192382 > > I'm not sure if there's an error in compute_energy to be honest - but as you > said depends on the conditions of the system the most energy efficient cpu > could be different. > > Without this patch we don't even call compute_energy() for the little core; but > now it is a viable candidate. Understood. But this doesn't fix the `EAS not working under saturation problem`. >> A corresponding EAS task placement looks like this one: >> >> <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[1]=0 cpu_util[f|e]=[446 994] >> <idle>-0 [005] select_task_rq_fair: CPU5 p=[task0-5 2551] prev_cpu=5 util=940 uclamp_min=0 uclamp_max=440 cpu_cap[2]=0 cpu_util[f|e]=[445 1004] >> >> <idle>-0 [005] select_task_rq_fair: CPU5 cpu=1 busy_time=994 util=994 pd_cap=2048 >> <idle>-0 [005] select_task_rq_fair: CPU5 cpu=2 busy_time=1998 util=1004 pd_cap=2048 >> >> <idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=-1 energy=821866 pd_busy_time=1998 task_busy_time=921 max_util=446 busy_time=1998 cpu_cap=1024 >> <idle>-0 [005] compute_energy: CPU5 p=[task0-5 2551] cpu=1 energy=842434 pd_busy_time=1998 task_busy_time=921 max_util=446 busy_time=2048 cpu_cap=1024 >> ^^^^^^^^^^^^^ >> diff = 20,568 > > This is not necessarily a problem, unless the energy calculations are wrong of > course. The problem is that placing p=[task0-5 2551] 'costs' only the energy-equivalence of 2048-1998 = 50 util instead of 921. > > Setting uclamp_max near the top capacity of the littles will hopefully make it > run there more often - but we know that the top frequencies of the little are > not necessarily efficient ones too. > > Does lowering uclamp_max more towards the 'knee' of the little help keeping the > tasks there? > > Note what this patch does is make sure that the little is a 'candidate'. Energy > efficiency is the ultimate choice of which candidate cpu/pd to pick. > > Being more strict about uclamp_max choice might be necessary if there's > a stronger desire by userspace to keep the tasks on specific cluster. > > If there're errors in calculating energy, I'd appreciate your help on how to > resolve them. I don't think there is an error in the energy calculation. But EAS doesn't work properly in these saturation cases which uclamp_max can create. And considering each PD won't solve this. I'm afraid we do need a solution for this saturation issue.
On Tue, 21 Feb 2023 at 13:05, Qais Yousef <qyousef@layalina.io> wrote: > > On 02/20/23 18:02, Vincent Guittot wrote: > > On Sat, 11 Feb 2023 at 18:28, Qais Yousef <qyousef@layalina.io> wrote: > > > > > > On 02/07/23 10:45, Vincent Guittot wrote: > > > > On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote: > > > > > > > > > > When uclamp_max is being used, the util of the task could be higher than > > > > > the spare capacity of the CPU, but due to uclamp_max value we force fit > > > > > it there. > > > > > > > > > > The way the condition for checking for max_spare_cap in > > > > > find_energy_efficient_cpu() was constructed; it ignored any CPU that has > > > > > its spare_cap less than or _equal_ to max_spare_cap. Since we initialize > > > > > max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and > > > > > hence ending up never performing compute_energy() for this cluster and > > > > > missing an opportunity for a better energy efficient placement to honour > > > > > uclamp_max setting. > > > > > > > > > > max_spare_cap = 0; > > > > > cpu_cap = capacity_of(cpu) - task_util(p); // 0 if task_util(p) is high > > > > > > > > > > ... > > > > > > > > > > util_fits_cpu(...); // will return true if uclamp_max forces it to fit > > > > > > > > > > ... > > > > > > > > > > // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0 > > > > > if (cpu_cap > max_spare_cap) { > > > > > max_spare_cap = cpu_cap; > > > > > max_spare_cap_cpu = cpu; > > > > > } > > > > > > > > > > prev_spare_cap suffers from a similar problem. > > > > > > > > > > Fix the logic by converting the variables into long and treating -1 > > > > > value as 'not populated' instead of 0 which is a viable and correct > > > > > spare capacity value. > > > > > > > > > > Fixes: 1d42509e475c ("sched/fair: Make EAS wakeup placement consider uclamp restrictions") > > > > > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> > > > > > --- > > > > > kernel/sched/fair.c | 9 ++++----- > > > > > 1 file changed, 4 insertions(+), 5 deletions(-) > > > > > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > > > index c6c8e7f52935..7a21ee74139f 100644 > > > > > --- a/kernel/sched/fair.c > > > > > +++ b/kernel/sched/fair.c > > > > > @@ -7382,11 +7382,10 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) > > > > > for (; pd; pd = pd->next) { > > > > > unsigned long util_min = p_util_min, util_max = p_util_max; > > > > > unsigned long cpu_cap, cpu_thermal_cap, util; > > > > > - unsigned long cur_delta, max_spare_cap = 0; > > > > > + long prev_spare_cap = -1, max_spare_cap = -1; > > > > > unsigned long rq_util_min, rq_util_max; > > > > > - unsigned long prev_spare_cap = 0; > > > > > + unsigned long cur_delta, base_energy; > > > > > int max_spare_cap_cpu = -1; > > > > > - unsigned long base_energy; > > > > > int fits, max_fits = -1; > > > > > > > > > > cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask); > > > > > @@ -7461,7 +7460,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) > > > > > } > > > > > } > > > > > > > > > > - if (max_spare_cap_cpu < 0 && prev_spare_cap == 0) > > > > > + if (max_spare_cap_cpu < 0 && prev_spare_cap < 0) > > > > > continue; > > > > > > > > > > eenv_pd_busy_time(&eenv, cpus, p); > > > > > @@ -7469,7 +7468,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) > > > > > base_energy = compute_energy(&eenv, pd, cpus, p, -1); > > > > > > > > > > /* Evaluate the energy impact of using prev_cpu. */ > > > > > - if (prev_spare_cap > 0) { > > > > > + if (prev_spare_cap > -1) { > > > > > prev_delta = compute_energy(&eenv, pd, cpus, p, > > > > > prev_cpu); > > > > > /* CPU utilization has changed */ > > > > > > > > I think that you also need the change below to make sure that the > > > > signed comparison will be used. I have quickly checked the assembly > > > > code for aarch64 and your patch keeps using unsigned comparison (b.ls) > > > > ((fits == max_fits) && (cpu_cap > max_spare_cap))) { > > > > ffff8000080e4c94: f94067e0 ldr x0, [sp, #200] > > > > ffff8000080e4c98: eb00003f cmp x1, x0 > > > > ffff8000080e4c9c: 54ff98a9 b.ls ffff8000080e3fb0 > > > > <select_task_rq_fair+0x570> // b.plast > > > > > > > > Whereas the change below make it to use the signed version (b.le) > > > > ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) { > > > > ffff8000080e4c94: f94067e0 ldr x0, [sp, #200] > > > > ffff8000080e4c98: eb00003f cmp x1, x0 > > > > ffff8000080e4c9c: 54ff98ad b.le ffff8000080e3fb0 <select_task_rq_fair+0x570> > > > > > > > > -- a/kernel/sched/fair.c > > > > +++ b/kernel/sched/fair.c > > > > @@ -7522,7 +7522,7 @@ static int find_energy_efficient_cpu(struct > > > > task_struct *p, int prev_cpu) > > > > prev_spare_cap = cpu_cap; > > > > prev_fits = fits; > > > > } else if ((fits > max_fits) || > > > > - ((fits == max_fits) && (cpu_cap > > > > > max_spare_cap))) { > > > > + ((fits == max_fits) && > > > > ((long)cpu_cap > max_spare_cap))) { > > > > /* > > > > * Find the CPU with the maximum spare capacity > > > > * among the remaining CPUs in the performance > > > > > > Isn't it better to go back to v1 form then? The inconsistent type paired with > > > the cast isn't getting too ugly for me :( > > > > the cast into a long of the cpu capacity in the condition was a good > > way to fix this unsigned/signed comparison and make is consistent with > > the use of -1 as default value IMO > > ((long)cpu_cap > max_spare_cap) > > Fair enough. Our boolean brains differ :-) I'll add the cast. > > Do you see the energy calculation issue Dietmar was trying to highlight? As it > stands I only have boolean tweaks to do for v3. I haven't looked too much at uclamp_max impact in energy calculation. Nevertheless, I wonder if one solution could be to not clamp the utilization to cpu max capacity in this case. The fact that utilization can go above cpu capacity when we clamp its frequency reflect that it would need more compute capacity or it will run longer. I will try to look more deeply in this use case > > > Cheers > > -- > Qais Yousef > > > > > > > I don't think we can convert cpu_cap to long without having to do more work as > > > it is used with 'util'. > > > > > > > > > Cheers > > > > > > -- > > > Qais Yousef
On 02/21/23 13:20, Dietmar Eggemann wrote: > I analyzed one of the last traces I got with my example: > > During the initial wakeup the system is in CPU OU. So feec() returns > `target = -1` and `sis()->sic()` (a) does the initial placement for all > the 6 tasks. > > CPU (a) (b) (c) (d) (e) > ^ ^ ^ ^ ^ > 0 t1-------->| | > 1 t0 t3 |t5 |t1 |t4| | | | | ... > 2 t2--------------------->| > 3 > 4 t4------------>| | > 5 t5---->| | > > (b) feec() for t5: > > eenv_pd_busy_time() > > pd_busy_time = 1998 = 994 (CPU1) + 1004 (CPU2) > > compute_energy() > > busy_time = min(pd_capacity, pd_busy_time + task_busy_time) > > = min(2048, 1998 + 921) > > = 2048 > > We go into saturation here and EAS assumes it can place t5 for the > cost of 2048 - 1998 = 50 util where in reality t5 has a util of > ~921. And that's why t5 ends up on CPU1 too. So to rephrase as it was hard to follow your line thoughts. The problem you're highlighting is that with uclamp_max we can end up with a fully utilized pd. But the energy cost of this always-runing pd is capped to a constant value. But you think this should be modeled better to reflect it'll run for longer period of time, hence cost more energy. There are multiple compound issue that makes this difficult to be reflected accurately and makes think this best-effort is not really as bad as you think: 1. The capacity of little cores has become small, 158 on pixel 6. This makes the definition of 'busy' task from the little's point of view rather relaxed/small. But there's a strong desire to enable uclamp_max to act as a soft affinity to prevent these tasks from interfering with more important work on bigger cores and potentially waste power. 2. If the task is truly long running 1024 tasks (even on big core), then knowing its actual utilization and runtime would be impossible and we'd have to cap it to something. It's basically inifinity. 3. Because of uclamp_max capping, the util_avg could be wrong - depends on actually how big the task is compared the uclamp_max. In worst case scenario if no idle time is seen util_avg can easily grow into 1024 - although if the task is allowed to run uncapped it actually might be 300 or something in reality. 4. Because of max aggregation, the uclamp_max tasks are a frequency spike hazard. Again if the task is a true 1024 (and capping those tasks is one of the major use cases for uclamp_max, if not The Major use case) then as soon as uncapped task wakes up on the CPU, there'd be a frequency spike regardless of how small this task is. If at wake-up it thought a big core is okay, and ends up running for the new few 100s ms with randodm rt/kworkers waking up on that core - the big core will run enough times at most energy enefficient point of the system. Which is a bigger problem as the power cost will be very high and noticeable. And uclamp_max isn't being used because of this already today. And because this is an always running task for the foreseeable future from the scheduler PoV, and, the lack of a downmigration mechanism to move these tasks away (I have patches for that - but trying to send fixes one by one), these frequency spikes poses a bigger hazard of wasting power than making a one off miscalculation at wakeup time. The decision at wake up is sampling at that exact instant of time. But generally uclamp_max is desired for those long running tasks whose potential energy cost over a long period of time (many ticks worth of realtime) is the worrisome. Immediately after that sampling time the condition could change even today and render our decision completely wrong. And there's no fixing for that. It's the nature of the beast. 5 One of the desired properties of uclamp_max in practice is to act as a soft affinity. In Android background tasks are restricted by cpuset. But this could potentially lead to lost opportunities of better energy placement under normal lightly loaded circumstances where there are a lot of small tasks running but nothing particularly busy. EAS shines under those scenarios. But fails for long running tasks - and here where uclamp_max is expected to help. > > (c) & (d) similar to (b) > > (e) From here on no wakeups anymore. Only tick preemption on CPU1 every > 4ms (250Hz) between the 5 tasks and t2 finishing on CPU2 eventually. > > >>> Which is the desired behavior for uclamp_max? > >> > >> Not sure. Essentially, EAS can't do its job properly if there is no idle > > > > This "not sure" statement is making me worried. Are you not sure about how > > uclamp_max should work in force fitting big tasks into small cores? Or just > > about handling some of the corner cases? I understood the former, not the > > latter. > > I'm not sure if we can call the issue that EAS doesn't work in > saturation anymore a corner case. In case uclamp_max has an effect, it > can quite easily create these saturation scenarios in which EAS is still > enabled but it can't do its job anymore. > Making sure that we probe each PD is not getting rid of this more > fundamental issue. I disagree with there being a fundamental issue in regards to energy calculation. I see a potential nice-to-have improvement for estimating energy calculation of long running tasks. What I see as a fundamental error that this series is fixing is that the hint to keep tasks on little cores doesn't work. Being able to consider a better energy efficient CPU is less of a problem and not sure if it's really affecting anyone in practice. Today's mainline will not consider any busy task as a candidate for little core even of uclamp_max says it's okay. The task has to be small enough (< 158 on pixel 6) for EAS to consider it, but those tasks are not the ones that need uclamp_max hint. We could try to remove the min(). But I worry this is premature now before first introducing my fixes to teach the load balancer to do down migration. And I think we need to do more to address the frequency spike problem - which I do have proposals ready to address this problem too. As it stands we can't use uclamp_max in products, and there are a series of fixes required to achieve that. And what you see as a fundamental issue is not one of the blocking issues I am aware of. It seems a nice enhancement to make the wakeup energy estimation even better. But I do think it requires some crystal ball and it'll remain a best effort even after that. I think the current outcome is the desired behavior to be honest. Maybe we can do better, but that's a separate problem/question and not a fundamental blocking to enabling uclamp_max to do what it says on the tin. I can add this as another limitation in the uclamp doc. The only thing that probably should do now is add a limit to how much cramming we can do before we say it's too much for this pd even with uclamp_max. Maybe we can put a limit based on load_avg as I think it'll continue to grow unlike util_avg which will settle on 1024. Thanks! -- Qais Yousef
On 02/22/23 11:59, Vincent Guittot wrote: > I haven't looked too much at uclamp_max impact in energy calculation. > Nevertheless, I wonder if one solution could be to not clamp the > utilization to cpu max capacity in this case. The fact that > utilization can go above cpu capacity when we clamp its frequency > reflect that it would need more compute capacity or it will run > longer. I will try to look more deeply in this use case Okay thanks! -- Qais Yousef
Hi Qais, I have a question regarding the 'soft cpu affinity'. On 2/11/23 17:50, Qais Yousef wrote: > On 02/09/23 19:02, Dietmar Eggemann wrote: >> On 07/02/2023 10:45, Vincent Guittot wrote: >>> On Sun, 5 Feb 2023 at 23:43, Qais Yousef <qyousef@layalina.io> wrote: >>>> >>>> When uclamp_max is being used, the util of the task could be higher than >>>> the spare capacity of the CPU, but due to uclamp_max value we force fit >>>> it there. >>>> >>>> The way the condition for checking for max_spare_cap in >>>> find_energy_efficient_cpu() was constructed; it ignored any CPU that has >>>> its spare_cap less than or _equal_ to max_spare_cap. Since we initialize >>>> max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and >>>> hence ending up never performing compute_energy() for this cluster and >>>> missing an opportunity for a better energy efficient placement to honour >>>> uclamp_max setting. >>>> >>>> max_spare_cap = 0; >>>> cpu_cap = capacity_of(cpu) - task_util(p); // 0 if task_util(p) is high >>>> >>>> ... >>>> >>>> util_fits_cpu(...); // will return true if uclamp_max forces it to fit >> >> s/true/1/ ? >> >>>> >>>> ... >>>> >>>> // this logic will fail to update max_spare_cap_cpu if cpu_cap is 0 >>>> if (cpu_cap > max_spare_cap) { >>>> max_spare_cap = cpu_cap; >>>> max_spare_cap_cpu = cpu; >>>> } >>>> >>>> prev_spare_cap suffers from a similar problem. >>>> >>>> Fix the logic by converting the variables into long and treating -1 >>>> value as 'not populated' instead of 0 which is a viable and correct >>>> spare capacity value. >> >> The issue I see here is in case we don't have any spare capacity left, >> the energy calculation (in terms CPU utilization) isn't correct anymore. >> >> Due to `effective_cpu_util()` returning `max=arch_scale_cpu_capacity()` >> you never know how big the `busy_time` for the PD really is in this moment. >> >> eenv_pd_busy_time() >> >> for_each_cpu(cpu, pd_cpus) >> busy_time += effective_cpu_util(..., ENERGY_UTIL, ...) >> ^^^^^^^^^ >> >> with: >> >> sum_util = min(busy_time + task_busy_time, pd_cap) >> ^^^^^^^^^ >> >> freq = (1.25 * max_util / max) * max_freq >> >> energy = (perf_state(freq)->cost / max) * sum_util >> >> >> energy is not related to CPU utilization anymore (since there is no idle >> time/spare capacity) left. > > Am I right that what you're saying is that the energy calculation for the PD > will be capped to a certain value and this is why you think the energy is > incorrect? > > What should be the correct energy (in theory at least)? > >> >> So EAS keeps packing on the cheaper PD/clamped OPP. > > Which is the desired behavior for uclamp_max? > > The only issue I see is that we want to distribute within a pd. Which is > something I was going to work on and send after later - but can lump it in this > series if it helps. > >> >> E.g. Juno-r0 [446 1024 1024 446 446 446] with 6 8ms/16ms uclamp_max=440 >> tasks all running on little PD, cpumask=0x39. The big PD is idle but >> never beats prev_cpu in terms of energy consumption for the wakee. > > IIUC I'm not seeing this being a problem. The goal of capping with uclamp_max > is two folds: > > 1. Prevent tasks from consuming energy. > 2. Keep them away from expensive CPUs. > > 2 is actually very important for 2 reasons: > > a. Because of max aggregation - any uncapped tasks that wakes up will > cause a frequency spike on this 'expensive' cpu. We don't have > a mechanism to downmigrate it - which is another thing I'm working > on. > b. It is desired to keep these bigger cpu idle ready for more important > work. > > For 2, generally we don't want these tasks to steal bandwidth from these CPUs > that we'd like to preserve for other type of work. I'm a bit afraid about such 'strong force'. That means the task would not go via EAS if we set uclamp_max e.g. 90, while the little capacity is 125. Or am I missing something? This might effectively use more energy for those tasks which can run on any CPU and EAS would figure a good energy placement. I'm worried about this, since we have L3+littles in one DVFS domain and the L3 would be only bigger in future. IMO to keep the big cpus more in idle, we should give them big energy wake up cost. That's my 3rd feature to the EM presented in OSPM2023. > > Of course userspace has control by selecting the right uclamp_max value. They > can increase it to allow a spill to next pd - or keep it low to steer them more > strongly on a specific pd. This would we be interesting to see in practice. I think we need such experiment, for such changes. Regards, Lukasz
Hi Lukasz! Sorry for late response.. On 05/22/23 09:30, Lukasz Luba wrote: > Hi Qais, > > I have a question regarding the 'soft cpu affinity'. [...] > > IIUC I'm not seeing this being a problem. The goal of capping with uclamp_max > > is two folds: > > > > 1. Prevent tasks from consuming energy. > > 2. Keep them away from expensive CPUs. > > > > 2 is actually very important for 2 reasons: > > > > a. Because of max aggregation - any uncapped tasks that wakes up will > > cause a frequency spike on this 'expensive' cpu. We don't have > > a mechanism to downmigrate it - which is another thing I'm working > > on. > > b. It is desired to keep these bigger cpu idle ready for more important > > work. > > > > For 2, generally we don't want these tasks to steal bandwidth from these CPUs > > that we'd like to preserve for other type of work. > > I'm a bit afraid about such 'strong force'. That means the task would > not go via EAS if we set uclamp_max e.g. 90, while the little capacity > is 125. Or am I missing something? We should go via EAS, actually that's the whole point. Why do you think we won't go via EAS? The logic should be is we give a hint to prefer the little core, but we still can pick something else if it's more energy efficient. What uclamp_max enables us is to still consider that little core even if it's utilization says it doesn't fit there. We need to merge these patches first though as it's broken at the moment. if little capacity is 125 and utilization of the task is 125, then even if uclamp_max is 0, EAS will skip the whole little cluster as apotential candidate because there's no spare_capacity there. Even if the whole little cluster is idle. > > This might effectively use more energy for those tasks which can run on > any CPU and EAS would figure a good energy placement. I'm worried > about this, since we have L3+littles in one DVFS domain and the L3 > would be only bigger in future. It's a bias that will enable the search algorithm in EAS to still consider the little core for big tasks. This bias will depend on the uclamp_max value chosen by userspace (so they have some control on how hard to cap the task), and what else is happening in the system at the time it wakes up. > > IMO to keep the big cpus more in idle, we should give them big energy > wake up cost. That's my 3rd feature to the EM presented in OSPM2023. Considering the wake up cost in EAS would be a great addition to have :) > > > > > Of course userspace has control by selecting the right uclamp_max value. They > > can increase it to allow a spill to next pd - or keep it low to steer them more > > strongly on a specific pd. > > This would we be interesting to see in practice. I think we need such > experiment, for such changes. I'm not sure what you mean by 'such changes'. I hope you don't mean these patches as they are not the key. They fix an obvious bug where task placement hint won't work at all. They don't modify any behavior that shouldn't have already been there. Nor introduce new limitation. I have to say I am disappointed that these patches aren't considered an important fix for an obvious breakage. Thanks -- Qais Yousef
On 31/05/2023 20:22, Qais Yousef wrote: > Hi Lukasz! > > Sorry for late response.. > > On 05/22/23 09:30, Lukasz Luba wrote: >> Hi Qais, >> >> I have a question regarding the 'soft cpu affinity'. > > [...] > >>> IIUC I'm not seeing this being a problem. The goal of capping with uclamp_max >>> is two folds: >>> >>> 1. Prevent tasks from consuming energy. >>> 2. Keep them away from expensive CPUs. >>> >>> 2 is actually very important for 2 reasons: >>> >>> a. Because of max aggregation - any uncapped tasks that wakes up will >>> cause a frequency spike on this 'expensive' cpu. We don't have >>> a mechanism to downmigrate it - which is another thing I'm working >>> on. >>> b. It is desired to keep these bigger cpu idle ready for more important >>> work. >>> >>> For 2, generally we don't want these tasks to steal bandwidth from these CPUs >>> that we'd like to preserve for other type of work. >> >> I'm a bit afraid about such 'strong force'. That means the task would >> not go via EAS if we set uclamp_max e.g. 90, while the little capacity >> is 125. Or am I missing something? > > We should go via EAS, actually that's the whole point. > > Why do you think we won't go via EAS? The logic should be is we give a hint to > prefer the little core, but we still can pick something else if it's more > energy efficient. > > What uclamp_max enables us is to still consider that little core even if it's > utilization says it doesn't fit there. We need to merge these patches first > though as it's broken at the moment. if little capacity is 125 and utilization > of the task is 125, then even if uclamp_max is 0, EAS will skip the whole > little cluster as apotential candidate because there's no spare_capacity there. > Even if the whole little cluster is idle. I'm not against letting uclamp_max force fit the placement of p. I'm just worried that using the EM (especially in it's current state) for that is wrong and will only work in certain scenarios like the one you picked. I did show a counter-example (Juno-r0 [446 1024 1024 446 446 446] with 6 8ms/16ms uclamp_max=440). The issue is that once we have no idle time left, the decisions of the EM are simply wrong and we shouldn't enforce those decisions. There is a related issue. E.g. on Pixel6 with its 4 little CPUs with cpu_cap_orig = 124. If you admit a task with p->util_avg > cpu_cap_orig it doesn't even fit onto a CPU. But we then do an energy calculation taking the values of the whole little PD into consideration. This makes no sense. The EM is not uclamp_max aware right now. What about using `sis() -> sic()` for uclamp_max constrained tasks? We would just have to iterate over the CPUs in cpu_cap_orig order. (1) Or the EM has to be made uclamp_max aware. (2) Your example is the idle system with 1 task p waking up. This task has a util_avg which excludes the little CPUs from being the new task_cpu(p). But p's uclamp_max value would allow this. But we can't just consider the placement of 1 task here. >> This might effectively use more energy for those tasks which can run on >> any CPU and EAS would figure a good energy placement. I'm worried >> about this, since we have L3+littles in one DVFS domain and the L3 >> would be only bigger in future. > > It's a bias that will enable the search algorithm in EAS to still consider the > little core for big tasks. This bias will depend on the uclamp_max value chosen > by userspace (so they have some control on how hard to cap the task), and what > else is happening in the system at the time it wakes up. To teach the EM about such tricky dependencies is IMHO outside the scope of `how to select a CPU for a uclamp_max constrained task`. (3) >> IMO to keep the big cpus more in idle, we should give them big energy >> wake up cost. That's my 3rd feature to the EM presented in OSPM2023. > > Considering the wake up cost in EAS would be a great addition to have :) I see this one as unrelated to (3) as well. >>> Of course userspace has control by selecting the right uclamp_max value. They >>> can increase it to allow a spill to next pd - or keep it low to steer them more >>> strongly on a specific pd. >> >> This would we be interesting to see in practice. I think we need such >> experiment, for such changes. > > I'm not sure what you mean by 'such changes'. I hope you don't mean these > patches as they are not the key. They fix an obvious bug where task placement > hint won't work at all. They don't modify any behavior that shouldn't have > already been there. Nor introduce new limitation. I have to say I am > disappointed that these patches aren't considered an important fix for an > obvious breakage. To me it's a dead-end to go this way. We need to see the full picture including something like (1) or (2) or patches you have mentioned, like the `down-migration in load-balance` thing. Maybe we can at least list all the use cases for uclamp_max capping here: It was mentioned: (A) `soft affinity for tasks w/ util_avg > uclamp_max`. Are there more?
On 5/31/23 19:22, Qais Yousef wrote: > Hi Lukasz! > > Sorry for late response.. > > On 05/22/23 09:30, Lukasz Luba wrote: >> Hi Qais, >> >> I have a question regarding the 'soft cpu affinity'. > > [...] > >>> IIUC I'm not seeing this being a problem. The goal of capping with uclamp_max >>> is two folds: >>> >>> 1. Prevent tasks from consuming energy. >>> 2. Keep them away from expensive CPUs. >>> >>> 2 is actually very important for 2 reasons: >>> >>> a. Because of max aggregation - any uncapped tasks that wakes up will >>> cause a frequency spike on this 'expensive' cpu. We don't have >>> a mechanism to downmigrate it - which is another thing I'm working >>> on. >>> b. It is desired to keep these bigger cpu idle ready for more important >>> work. >>> >>> For 2, generally we don't want these tasks to steal bandwidth from these CPUs >>> that we'd like to preserve for other type of work. >> >> I'm a bit afraid about such 'strong force'. That means the task would >> not go via EAS if we set uclamp_max e.g. 90, while the little capacity >> is 125. Or am I missing something? > > We should go via EAS, actually that's the whole point. > > Why do you think we won't go via EAS? The logic should be is we give a hint to > prefer the little core, but we still can pick something else if it's more > energy efficient. > > What uclamp_max enables us is to still consider that little core even if it's > utilization says it doesn't fit there. We need to merge these patches first > though as it's broken at the moment. if little capacity is 125 and utilization > of the task is 125, then even if uclamp_max is 0, EAS will skip the whole > little cluster as apotential candidate because there's no spare_capacity there. > Even if the whole little cluster is idle. OK, I see now - it's a bug then. > >> >> This might effectively use more energy for those tasks which can run on >> any CPU and EAS would figure a good energy placement. I'm worried >> about this, since we have L3+littles in one DVFS domain and the L3 >> would be only bigger in future. > > It's a bias that will enable the search algorithm in EAS to still consider the > little core for big tasks. This bias will depend on the uclamp_max value chosen > by userspace (so they have some control on how hard to cap the task), and what > else is happening in the system at the time it wakes up. OK, so we would go via EAS and check the energy impact in 3 PDs - which is desired. > >> >> IMO to keep the big cpus more in idle, we should give them big energy >> wake up cost. That's my 3rd feature to the EM presented in OSPM2023. > > Considering the wake up cost in EAS would be a great addition to have :) > >> >>> >>> Of course userspace has control by selecting the right uclamp_max value. They >>> can increase it to allow a spill to next pd - or keep it low to steer them more >>> strongly on a specific pd. >> >> This would we be interesting to see in practice. I think we need such >> experiment, for such changes. > > I'm not sure what you mean by 'such changes'. I hope you don't mean these > patches as they are not the key. They fix an obvious bug where task placement > hint won't work at all. They don't modify any behavior that shouldn't have > already been there. Nor introduce new limitation. I have to say I am > disappointed that these patches aren't considered an important fix for an > obvious breakage. I mean, in practice - in our pixel6 test 3-gear :) Thank for explanation.
Hi Qais, On 2023-02-11 17:28, Qais Yousef wrote: > On 02/07/23 10:45, Vincent Guittot wrote: >> [...] > > Isn't it better to go back to v1 form then? The inconsistent type paired with > the cast isn't getting too ugly for me :( > > I don't think we can convert cpu_cap to long without having to do more work as > it is used with 'util'. Sorry if I'm missing something obvious, but why can't we convert util to long? The only place I see which mixes with util is lsub_positive(&cpu_cap, util); but at this point both cpu_cap and util should have sane values (top bit not set) so doing clamped subtraction should just work fine? Hongyan
Hi Qais, On 2023-02-11 17:50, Qais Yousef wrote: > [...] >> >> So EAS keeps packing on the cheaper PD/clamped OPP. > > Which is the desired behavior for uclamp_max? > > The only issue I see is that we want to distribute within a pd. Which is > something I was going to work on and send after later - but can lump it in this > series if it helps. I more or less share the same concern with Dietmar, which is packing things on the same small CPU when everyone has spare cpu_cap of 0. I wonder if this could be useful: On the side of cfs_rq->avg.util_avg, we have a cfs_rq->avg.util_avg_uclamp_max. It is keeping track of util_avg, but each task on the rq is capped at its uclamp_max value, so even if there's two always-running tasks with uclamp_max values of 100 with no idle time, the cfs_rq only sees cpu_util() of 200 and still has remaining capacity of 1024 - 200, not 0. This also helps balancing the load when rqs have no idle time. Even if two CPUs both have no idle time, but one is running a single task clamped at 100, the other running 2 such tasks, the first sees a remaining capacity of 1024 - 100, while the 2nd is 1024 - 200, so we still prefer the first one. And I wonder if this could also help calculating energy when there's no idle time under uclamp_max. Instead of seeing a util_avg at 1024, we actually see a lower value. This is also what cpu_util_next() does in Android's sum aggregation, but I'm thinking of maintaining it right beside util_avg so that we don't have to sum up everything every time. Hongyan
On 06/05/23 13:29, Dietmar Eggemann wrote: > On 31/05/2023 20:22, Qais Yousef wrote: > > Hi Lukasz! > > > > Sorry for late response.. > > > > On 05/22/23 09:30, Lukasz Luba wrote: > >> Hi Qais, > >> > >> I have a question regarding the 'soft cpu affinity'. > > > > [...] > > > >>> IIUC I'm not seeing this being a problem. The goal of capping with uclamp_max > >>> is two folds: > >>> > >>> 1. Prevent tasks from consuming energy. > >>> 2. Keep them away from expensive CPUs. > >>> > >>> 2 is actually very important for 2 reasons: > >>> > >>> a. Because of max aggregation - any uncapped tasks that wakes up will > >>> cause a frequency spike on this 'expensive' cpu. We don't have > >>> a mechanism to downmigrate it - which is another thing I'm working > >>> on. > >>> b. It is desired to keep these bigger cpu idle ready for more important > >>> work. > >>> > >>> For 2, generally we don't want these tasks to steal bandwidth from these CPUs > >>> that we'd like to preserve for other type of work. > >> > >> I'm a bit afraid about such 'strong force'. That means the task would > >> not go via EAS if we set uclamp_max e.g. 90, while the little capacity > >> is 125. Or am I missing something? > > > > We should go via EAS, actually that's the whole point. > > > > Why do you think we won't go via EAS? The logic should be is we give a hint to > > prefer the little core, but we still can pick something else if it's more > > energy efficient. > > > > What uclamp_max enables us is to still consider that little core even if it's > > utilization says it doesn't fit there. We need to merge these patches first > > though as it's broken at the moment. if little capacity is 125 and utilization > > of the task is 125, then even if uclamp_max is 0, EAS will skip the whole > > little cluster as apotential candidate because there's no spare_capacity there. > > Even if the whole little cluster is idle. > > I'm not against letting uclamp_max force fit the placement of p. I'm > just worried that using the EM (especially in it's current state) for > that is wrong and will only work in certain scenarios like the one you > picked. > > I did show a counter-example (Juno-r0 [446 1024 1024 446 446 446] with 6 > 8ms/16ms uclamp_max=440). The issue is that once we have no idle time > left, the decisions of the EM are simply wrong and we shouldn't enforce > those decisions. But this is a different problem, no? If you consider the current level of brokeness, the impact is worse IMO. Given that Android force packs things on little cores using cpuset; I see this wrong calculation bug less impactful. But I do agree we should do better. But I don't see this as a blocker to merging this fix. The thing we need to keep in mind is that if the big cores are busy or running at high frequencies, then yes we want to cram these tasks on the little cores. The impact of them running accidentally there (especially we lack support for correction action in the load balancer yet) is worse. > > There is a related issue. E.g. on Pixel6 with its 4 little CPUs with > cpu_cap_orig = 124. If you admit a task with p->util_avg > cpu_cap_orig > it doesn't even fit onto a CPU. But we then do an energy calculation > taking the values of the whole little PD into consideration. This makes > no sense. The EM is not uclamp_max aware right now. Sorry could explain the difference between this problem and the one we already were discussing? > > What about using `sis() -> sic()` for uclamp_max constrained tasks? We > would just have to iterate over the CPUs in cpu_cap_orig order. (1) What's the advantage here? This is not useful because we only search idle capacities. So on scenarios where the system is busy, they'll end up randomly but we do want to pack them on little cores IMHO. > Or the EM has to be made uclamp_max aware. (2) Do you have any suggestions here? Note we lack the evidence how problematic this is in real world scenarios. All the systems I've seen already pack everything on little cores via cpuset. So nothing gets worse here. But we get a chance to do better in other cases while we look for ways to improve this. > > Your example is the idle system with 1 task p waking up. This task has a > util_avg which excludes the little CPUs from being the new task_cpu(p). > But p's uclamp_max value would allow this. But we can't just consider > the placement of 1 task here. No one is saying we need to consider the placement of 1 task only. I think you exaggerate the impact of cramming things and not noticing that whether it's 1 or 100 tasks, EAS will not even consider the little cores for those busy tasks. uclamp_max was introduce to allow this, and it just doesn't work since by definition these tasks will cause the little cores to have 0 spare capacity. But we skip all cpus with 0 spare capacities. > > >> This might effectively use more energy for those tasks which can run on > >> any CPU and EAS would figure a good energy placement. I'm worried > >> about this, since we have L3+littles in one DVFS domain and the L3 > >> would be only bigger in future. > > > > It's a bias that will enable the search algorithm in EAS to still consider the > > little core for big tasks. This bias will depend on the uclamp_max value chosen > > by userspace (so they have some control on how hard to cap the task), and what > > else is happening in the system at the time it wakes up. > > To teach the EM about such tricky dependencies is IMHO outside the scope > of `how to select a CPU for a uclamp_max constrained task`. (3) > > >> IMO to keep the big cpus more in idle, we should give them big energy > >> wake up cost. That's my 3rd feature to the EM presented in OSPM2023. > > > > Considering the wake up cost in EAS would be a great addition to have :) > > I see this one as unrelated to (3) as well. > > >>> Of course userspace has control by selecting the right uclamp_max value. They > >>> can increase it to allow a spill to next pd - or keep it low to steer them more > >>> strongly on a specific pd. > >> > >> This would we be interesting to see in practice. I think we need such > >> experiment, for such changes. > > > > I'm not sure what you mean by 'such changes'. I hope you don't mean these > > patches as they are not the key. They fix an obvious bug where task placement > > hint won't work at all. They don't modify any behavior that shouldn't have > > already been there. Nor introduce new limitation. I have to say I am > > disappointed that these patches aren't considered an important fix for an > > obvious breakage. > > To me it's a dead-end to go this way. We need to see the full picture > including something like (1) or (2) or patches you have mentioned, like > the `down-migration in load-balance` thing. The down migration is to address another problem. It is a necessary fix to do corrections when we end up in bad situation. Wake up path still has to do the right thing first. > > Maybe we can at least list all the use cases for uclamp_max capping here: > > It was mentioned: > > (A) `soft affinity for tasks w/ util_avg > uclamp_max`. > > Are there more? We are trying to fix bugs here. Do you want to restart the discussion of why uclamp_max was introduced? Uclamp max is a hint to bias task placement and power consumption by help keeping these task away from big cpus and higher frequencies based on the max allowed performance level the task is allowed to reach. Cheers -- Qais Yousef
On 06/07/23 15:52, Hongyan Xia wrote: > Hi Qais, > > On 2023-02-11 17:50, Qais Yousef wrote: > > [...] > > > > > > So EAS keeps packing on the cheaper PD/clamped OPP. > > > > Which is the desired behavior for uclamp_max? > > > > The only issue I see is that we want to distribute within a pd. Which is > > something I was going to work on and send after later - but can lump it in this > > series if it helps. > > I more or less share the same concern with Dietmar, which is packing things > on the same small CPU when everyone has spare cpu_cap of 0. > > I wonder if this could be useful: On the side of cfs_rq->avg.util_avg, we > have a cfs_rq->avg.util_avg_uclamp_max. It is keeping track of util_avg, but > each task on the rq is capped at its uclamp_max value, so even if there's > two always-running tasks with uclamp_max values of 100 with no idle time, > the cfs_rq only sees cpu_util() of 200 and still has remaining capacity of > 1024 - 200, not 0. This also helps balancing the load when rqs have no idle > time. Even if two CPUs both have no idle time, but one is running a single > task clamped at 100, the other running 2 such tasks, the first sees a > remaining capacity of 1024 - 100, while the 2nd is 1024 - 200, so we still > prefer the first one. If I understood correctly you're suggesting do accounting of the sum of uclamp_max for all the enqueued tasks? I think we discussed this in the past. Can't remember the details now, but adding additional accounting seemed undeseriable. And I had issue with treating uclamp_max as a bandwidth hint rather than a performance requirements hint. Limiting a task to 200 means it can't run faster than this, but it doesn't mean it is not allowed to consume more bandwidth than 200. Nice value and cfs bandwidth controllers should be used for that. > And I wonder if this could also help calculating energy when there's no idle > time under uclamp_max. Instead of seeing a util_avg at 1024, we actually see > a lower value. This is also what cpu_util_next() does in Android's sum > aggregation, but I'm thinking of maintaining it right beside util_avg so > that we don't have to sum up everything every time. I haven't thought about how to improve the EM calculations to be honest, I see this as a secondary problem compared to the other issue we need to fix first. It seems load_avg can grow unboundedly, can you look at using this signal to distribute on a cluster and as a hint we might be better off spilling to other if they're already running at a perf level <= uclamp_max? Thanks -- Qais Yousef
On 06/07/23 12:50, Hongyan Xia wrote: > Hi Qais, > > On 2023-02-11 17:28, Qais Yousef wrote: > > On 02/07/23 10:45, Vincent Guittot wrote: > > > [...] > > > > Isn't it better to go back to v1 form then? The inconsistent type paired with > > the cast isn't getting too ugly for me :( > > > > I don't think we can convert cpu_cap to long without having to do more work as > > it is used with 'util'. > > Sorry if I'm missing something obvious, but why can't we convert util to > long? The only place I see which mixes with util is > > lsub_positive(&cpu_cap, util); > > but at this point both cpu_cap and util should have sane values (top bit not > set) so doing clamped subtraction should just work fine? I completely have this series paged out now. I'll consider this comment when I prepare v3 - which I hope I'll have some down time in few weeks time to send some fixes I have been sitting on for a bit. Cheers -- Qais Yousef
On 06/07/23 12:50, Hongyan Xia wrote: > Hi Qais, > > On 2023-02-11 17:28, Qais Yousef wrote: > > On 02/07/23 10:45, Vincent Guittot wrote: > > > [...] > > > > Isn't it better to go back to v1 form then? The inconsistent type paired with > > the cast isn't getting too ugly for me :( > > > > I don't think we can convert cpu_cap to long without having to do more work as > > it is used with 'util'. > > Sorry if I'm missing something obvious, but why can't we convert util to > long? The only place I see which mixes with util is > > lsub_positive(&cpu_cap, util); > > but at this point both cpu_cap and util should have sane values (top bit not > set) so doing clamped subtraction should just work fine? I think all options are not very pretty. But the least evil is what Vincent suggested to simply do the cast in this one place. Util is used in more places and assumed to be unsigned long, so chances of unhappy accidents are higher and I don't feel like auditing more code for correctness given we have a simple straightforward alternative now :) Thanks -- Qais Yousef
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c6c8e7f52935..7a21ee74139f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7382,11 +7382,10 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) for (; pd; pd = pd->next) { unsigned long util_min = p_util_min, util_max = p_util_max; unsigned long cpu_cap, cpu_thermal_cap, util; - unsigned long cur_delta, max_spare_cap = 0; + long prev_spare_cap = -1, max_spare_cap = -1; unsigned long rq_util_min, rq_util_max; - unsigned long prev_spare_cap = 0; + unsigned long cur_delta, base_energy; int max_spare_cap_cpu = -1; - unsigned long base_energy; int fits, max_fits = -1; cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask); @@ -7461,7 +7460,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) } } - if (max_spare_cap_cpu < 0 && prev_spare_cap == 0) + if (max_spare_cap_cpu < 0 && prev_spare_cap < 0) continue; eenv_pd_busy_time(&eenv, cpus, p); @@ -7469,7 +7468,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) base_energy = compute_energy(&eenv, pd, cpus, p, -1); /* Evaluate the energy impact of using prev_cpu. */ - if (prev_spare_cap > 0) { + if (prev_spare_cap > -1) { prev_delta = compute_energy(&eenv, pd, cpus, p, prev_cpu); /* CPU utilization has changed */