Message ID | 20231019160523.1582101-2-mathieu.desnoyers@efficios.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2010:b0:403:3b70:6f57 with SMTP id fe16csp492299vqb; Thu, 19 Oct 2023 09:06:20 -0700 (PDT) X-Google-Smtp-Source: AGHT+IESBpbI7aDx7jw2f7K3uWX7dzUNj7UYDCgAze6PQL67Y0mNSVgdDUApz8Bhuy/0rSNG32fM X-Received: by 2002:a17:903:40c4:b0:1c9:e3ea:74fd with SMTP id t4-20020a17090340c400b001c9e3ea74fdmr3164616pld.11.1697731580401; Thu, 19 Oct 2023 09:06:20 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697731580; cv=none; d=google.com; s=arc-20160816; b=U+k+ZhJQzsx2oWORLScKDbsURwH94cvMm/13ZAJONs2pZ0JPUxChN6qEl3lf8XjRiP pNaTd24FDm8QJL+f73yyCRtQHox5QwEbT16rQm5AY3Aq5bIDPK7hekV0eoNGOsqmBoWr r7Dwjm7TGnh3VbcRgX/soUzAxs8WRRs8jpTonp9r4Ot3+2kJomPdeJf1uBgJHjEgxhPd qluCxfoPA2qsaUF+Fx5ohPRF7x7h99K+JLpsIx3eexPL/JAQno74Jh1FIRAQG6i8+byn zLSVRCSmsnvte7NnAGE3fA2yUCbWM6chWpwqUNAlKtEU3Zu7bw3NPx00lsLQOI1oFMOA /Fpw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=Txtmv4z/ef0J3X/7eeMPtPV1IT4FjjHbnLkVRo1XOmE=; fh=MlL9xuK1bL5YhHTfhEwdGxMCfB66v6TLl/GiGg/sHRw=; b=iamqDYqlHSE1gcUM/r+9wmII1AicY/8DT6/0FyOdZhP62tIVnB9E5+TtcZefjlc7/G ERYsn7T+b/GfPU6C1BdYM3nB6MpegvEz5XxMuRddBFZljRxfjUTIrRmd4ZDwsn94KAjO GmSFaMAA7V9LEmgancludM/+OK/qOI55Qxqb7OlLruSQ7SSonB2mYRoA+cuKmKSTI1K2 CFirj+k9shAz9tVe5uAGWPN1RS+7pw0eKddSnBX3EGaIHpLkLO6nMzieZWcYcdtEhMZQ I3Wu5THn0ZC1jcrM91xHrDhMdSW44CIbnHwarhgJwvRUwgKgt3msci3LsUM7IUHhakRn tXmw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=XCozT0bB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: from fry.vger.email (fry.vger.email. [23.128.96.38]) by mx.google.com with ESMTPS id t8-20020a170902e84800b001c9d4f08c3asi2728427plg.277.2023.10.19.09.06.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 19 Oct 2023 09:06:20 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) client-ip=23.128.96.38; Authentication-Results: mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=XCozT0bB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by fry.vger.email (Postfix) with ESMTP id 7381083BB5B3; Thu, 19 Oct 2023 09:05:48 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at fry.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1346508AbjJSQF1 (ORCPT <rfc822;a1648639935@gmail.com> + 26 others); Thu, 19 Oct 2023 12:05:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43320 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1346418AbjJSQFV (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 19 Oct 2023 12:05:21 -0400 Received: from smtpout.efficios.com (smtpout.efficios.com [167.114.26.122]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 17541B6 for <linux-kernel@vger.kernel.org>; Thu, 19 Oct 2023 09:05:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com; s=smtpout1; t=1697731516; bh=/qm7GcGwm1ebkabq6qaUZBB+Rnlf8p16CqUhbmaGPMc=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=XCozT0bBR8yiQPSUL4b3MjiN4PCSLMl6gR6wH2f2aS7EnSVrtqWDWkUA6SANwXP0i /Q7DmTTPXjHeSSP5UHrZ/64piIqGFcOUinT5bJ+v+eGF8uiDpnOvhcy4pQLoCM+krV LmrmVpJ1V9GIXL8UeQj7QaLhaxUkmoi08jWtxvAdpEbjw7681RTczW1IgI3Zx3+Wjq YMZN5sSBHTmuehbLPE/g88uVb8TvVb+TjW3A1J/WtNc2h5QgR0OUQX51xurmbhvyTz QZ3cxgk1J5a4B6fsyinMKjKfmBcPSVsrGHVS+4/HF9u3P90vo0bCyWCYArZxDocc35 WZHEFC2u4LTxQ== Received: from thinkos.internal.efficios.com (192-222-143-198.qc.cable.ebox.net [192.222.143.198]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4SBCH03TyNz1YRj; Thu, 19 Oct 2023 12:05:16 -0400 (EDT) From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> To: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, Ingo Molnar <mingo@redhat.com>, Valentin Schneider <vschneid@redhat.com>, Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>, Daniel Bristot de Oliveira <bristot@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, Juri Lelli <juri.lelli@redhat.com>, Swapnil Sapkal <Swapnil.Sapkal@amd.com>, Aaron Lu <aaron.lu@intel.com>, Chen Yu <yu.c.chen@intel.com>, Tim Chen <tim.c.chen@intel.com>, K Prateek Nayak <kprateek.nayak@amd.com>, "Gautham R . Shenoy" <gautham.shenoy@amd.com>, x86@kernel.org Subject: [RFC PATCH v2 1/2] sched/fair: Introduce UTIL_FITS_CAPACITY feature (v2) Date: Thu, 19 Oct 2023 12:05:22 -0400 Message-Id: <20231019160523.1582101-2-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20231019160523.1582101-1-mathieu.desnoyers@efficios.com> References: <20231019160523.1582101-1-mathieu.desnoyers@efficios.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]); Thu, 19 Oct 2023 09:05:48 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1780200589348893531 X-GMAIL-MSGID: 1780200589348893531 |
Series |
sched/fair migration reduction features
|
|
Commit Message
Mathieu Desnoyers
Oct. 19, 2023, 4:05 p.m. UTC
Introduce the UTIL_FITS_CAPACITY scheduler feature. The runqueue
selection picks the previous, target, or recent runqueues if they have
enough remaining capacity to enqueue the task before scanning for an
idle cpu.
This feature is introduced in preparation for the SELECT_BIAS_PREV
scheduler feature.
The following benchmarks only cover the UTIL_FITS_CAPACITY feature.
Those are performed on a v6.5.5 kernel with mitigations=off.
The following hackbench workload on a 192 cores AMD EPYC 9654 96-Core
Processor (over 2 sockets) improves the wall time from 49s to 40s
(18% speedup).
hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100
We can observe that the number of migrations is reduced significantly
with this patch (improvement):
Baseline: 117M cpu-migrations (9.355 K/sec)
With patch: 47M cpu-migrations (3.977 K/sec)
The task-clock utilization is increased (improvement):
Baseline: 253.275 CPUs utilized
With patch: 271.367 CPUs utilized
The number of context-switches is increased (degradation):
Baseline: 445M context-switches (35.516 K/sec)
With patch: 586M context-switches (48.823 K/sec)
Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@amd.com
Link: https://lore.kernel.org/lkml/20230725193048.124796-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/f6dc1652-bc39-0b12-4b6b-29a2f9cd8484@amd.com/
Link: https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20230823060832.454842-1-aaron.lu@intel.com/
Link: https://lore.kernel.org/lkml/20230905171105.1005672-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/cover.1695704179.git.yu.c.chen@intel.com/
Link: https://lore.kernel.org/lkml/20230929183350.239721-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20231012203626.1298944-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20231017221204.1535774-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20231018204511.1563390-1-mathieu.desnoyers@efficios.com/
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Swapnil Sapkal <Swapnil.Sapkal@amd.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Chen Yu <yu.c.chen@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Gautham R . Shenoy <gautham.shenoy@amd.com>
Cc: x86@kernel.org
---
Changes since v1:
- Use scale_rt_capacity(),
- Use fits_capacity which leaves 20% unused capacity to account for
metrics inaccuracy.
---
kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++++++++------
kernel/sched/features.h | 6 ++++++
kernel/sched/sched.h | 5 +++++
3 files changed, 45 insertions(+), 6 deletions(-)
Comments
On 19/10/2023 18:05, Mathieu Desnoyers wrote: > Introduce the UTIL_FITS_CAPACITY scheduler feature. The runqueue > selection picks the previous, target, or recent runqueues if they have > enough remaining capacity to enqueue the task before scanning for an > idle cpu. > > This feature is introduced in preparation for the SELECT_BIAS_PREV > scheduler feature. > > The following benchmarks only cover the UTIL_FITS_CAPACITY feature. > Those are performed on a v6.5.5 kernel with mitigations=off. > > The following hackbench workload on a 192 cores AMD EPYC 9654 96-Core > Processor (over 2 sockets) improves the wall time from 49s to 40s > (18% speedup). > > hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100 > > We can observe that the number of migrations is reduced significantly > with this patch (improvement): > > Baseline: 117M cpu-migrations (9.355 K/sec) > With patch: 47M cpu-migrations (3.977 K/sec) > > The task-clock utilization is increased (improvement): > > Baseline: 253.275 CPUs utilized > With patch: 271.367 CPUs utilized > > The number of context-switches is increased (degradation): > > Baseline: 445M context-switches (35.516 K/sec) > With patch: 586M context-switches (48.823 K/sec) > Haven't run any benchmarks yet to prove the benefit of this prefer packing over spreading or migration avoidance algorithm. [...] > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4497,6 +4497,28 @@ static inline void util_est_update(struct cfs_rq *cfs_rq, > trace_sched_util_est_se_tp(&p->se); > } > > +static unsigned long scale_rt_capacity(int cpu); > + > +/* > + * Returns true if adding the task utilization to the estimated > + * utilization of the runnable tasks on @cpu does not exceed the > + * capacity of @cpu. > + * > + * This considers only the utilization of _runnable_ tasks on the @cpu > + * runqueue, excluding blocked and sleeping tasks. This is achieved by > + * using the runqueue util_est.enqueued. > + */ > +static inline bool task_fits_remaining_cpu_capacity(unsigned long task_util, > + int cpu) This is almost like the existing task_fits_cpu(p, cpu) (used in Capacity Aware Scheduling (CAS) for Asymmetric CPU capacity systems) except the latter only uses `util = task_util_est(p)` and deals with uclamp as well and only tests whether p could fit on the CPU. Or like find_energy_efficient_cpu() (feec(), used in Energy-Aware-Scheduling (EAS)) which uses cpu_util(cpu, p, cpu, 0) to get: max(util_avg(CPU + p), util_est(CPU + p)) feec() ... for (; pd; pd = pd->next) ... util = cpu_util(cpu, p, cpu, 0); ... fits = util_fits_cpu(util, util_min, util_max, cpu) ^^^^^^^^^^^^^^^^^^ not used when uclamp is not active (1) ... capacity = capacity_of(cpu) fits = fits_capacity(util, capacity) if (!uclamp_is_used()) (1) return fits So not introducing new functions like task_fits_remaining_cpu_capacity() in this area and using existing one would be good. > +{ > + unsigned long total_util; > + > + if (!sched_util_fits_capacity_active()) > + return false; > + total_util = READ_ONCE(cpu_rq(cpu)->cfs.avg.util_est.enqueued) + task_util; > + return fits_capacity(total_util, scale_rt_capacity(cpu)); Why not use: static unsigned long capacity_of(int cpu) return cpu_rq(cpu)->cpu_capacity; which is maintained in update_cpu_capacity() as scale_rt_capacity(cpu)? [...] > @@ -7173,7 +7200,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > if (recent_used_cpu != prev && > recent_used_cpu != target && > cpus_share_cache(recent_used_cpu, target) && > - (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) && > + (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu) || > + task_fits_remaining_cpu_capacity(task_util, recent_used_cpu)) && > cpumask_test_cpu(recent_used_cpu, p->cpus_ptr) && > asym_fits_cpu(task_util, util_min, util_max, recent_used_cpu)) { > return recent_used_cpu; > diff --git a/kernel/sched/features.h b/kernel/sched/features.h > index ee7f23c76bd3..9a84a1401123 100644 > --- a/kernel/sched/features.h > +++ b/kernel/sched/features.h > @@ -97,6 +97,12 @@ SCHED_FEAT(WA_BIAS, true) > SCHED_FEAT(UTIL_EST, true) > SCHED_FEAT(UTIL_EST_FASTUP, true) IMHO, asymmetric CPU capacity systems would have to disable the sched feature UTIL_FITS_CAPACITY. Otherwise CAS could deliver different results. task_fits_remaining_cpu_capacity() and asym_fits_cpu() work slightly different. [...]
On 2023-10-23 10:11, Dietmar Eggemann wrote: > On 19/10/2023 18:05, Mathieu Desnoyers wrote: [...] >> >> +static unsigned long scale_rt_capacity(int cpu); >> + >> +/* >> + * Returns true if adding the task utilization to the estimated >> + * utilization of the runnable tasks on @cpu does not exceed the >> + * capacity of @cpu. >> + * >> + * This considers only the utilization of _runnable_ tasks on the @cpu >> + * runqueue, excluding blocked and sleeping tasks. This is achieved by >> + * using the runqueue util_est.enqueued. >> + */ >> +static inline bool task_fits_remaining_cpu_capacity(unsigned long task_util, >> + int cpu) > > This is almost like the existing task_fits_cpu(p, cpu) (used in Capacity > Aware Scheduling (CAS) for Asymmetric CPU capacity systems) except the > latter only uses `util = task_util_est(p)` and deals with uclamp as well > and only tests whether p could fit on the CPU. This is indeed a major difference between how asym capacity check works and what is introduced here: asym capacity check only checks whether the given task theoretically fits in the cpu if that cpu was completely idle, without considering the current cpu utilization. My approach is to consider the current util_est of the cpu to check whether the task fits in the remaining capacity. I did not want to use the existing task_fits_cpu() helper because the notions of uclamp bounds appear to be heavily tied to the fact that it checks whether the task fits in an _idle_ runqueue, whereas the check I am introducing here is much more restrictive: it checks that the task fits on the runqueue within the remaining capacity. > > Or like find_energy_efficient_cpu() (feec(), used in > Energy-Aware-Scheduling (EAS)) which uses cpu_util(cpu, p, cpu, 0) to get: > > max(util_avg(CPU + p), util_est(CPU + p)) I've tried using cpu_util(), but unfortunately anything that considers blocked/sleeping tasks in its utilization total does not work for my use-case. From cpu_util(): * CPU utilization is the sum of running time of runnable tasks plus the * recent utilization of currently non-runnable tasks on that CPU. > > feec() > ... > for (; pd; pd = pd->next) > ... > util = cpu_util(cpu, p, cpu, 0); > ... > fits = util_fits_cpu(util, util_min, util_max, cpu) > ^^^^^^^^^^^^^^^^^^ > not used when uclamp is not active (1) > ... > capacity = capacity_of(cpu) > fits = fits_capacity(util, capacity) > if (!uclamp_is_used()) (1) > return fits > > So not introducing new functions like task_fits_remaining_cpu_capacity() > in this area and using existing one would be good. If the notion of uclamp is not tied to the way asym capacity check is done against a theoretically idle runqueue, I'd be OK with using this, but so far both appear to be very much tied. When I stumbled on this fundamental difference between asym cpu capacity check and the check introduced here, I've started wondering whether the asym cpu capacity check would benefit from considering the target cpu current utilization as well. > >> +{ >> + unsigned long total_util; >> + >> + if (!sched_util_fits_capacity_active()) >> + return false; >> + total_util = READ_ONCE(cpu_rq(cpu)->cfs.avg.util_est.enqueued) + task_util; >> + return fits_capacity(total_util, scale_rt_capacity(cpu)); > > Why not use: > > static unsigned long capacity_of(int cpu) > return cpu_rq(cpu)->cpu_capacity; > > which is maintained in update_cpu_capacity() as scale_rt_capacity(cpu)? The reason for preferring scale_rt_capacity(cpu) over capacity_of(cpu) is that update_cpu_capacity() only runs periodically every balance-interval, therefore providing a coarse-grained remaining capacity approximation with respect to irq, rt, dl, and thermal utilization. If it turns out that being coarse-grained is good enough, we may be able to save some cycles by using capacity_of(), but not without carefully considering the impacts of being imprecise. > > [...] > >> @@ -7173,7 +7200,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) >> if (recent_used_cpu != prev && >> recent_used_cpu != target && >> cpus_share_cache(recent_used_cpu, target) && >> - (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) && >> + (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu) || >> + task_fits_remaining_cpu_capacity(task_util, recent_used_cpu)) && >> cpumask_test_cpu(recent_used_cpu, p->cpus_ptr) && >> asym_fits_cpu(task_util, util_min, util_max, recent_used_cpu)) { >> return recent_used_cpu; >> diff --git a/kernel/sched/features.h b/kernel/sched/features.h >> index ee7f23c76bd3..9a84a1401123 100644 >> --- a/kernel/sched/features.h >> +++ b/kernel/sched/features.h >> @@ -97,6 +97,12 @@ SCHED_FEAT(WA_BIAS, true) >> SCHED_FEAT(UTIL_EST, true) >> SCHED_FEAT(UTIL_EST_FASTUP, true) > > IMHO, asymmetric CPU capacity systems would have to disable the sched > feature UTIL_FITS_CAPACITY. Otherwise CAS could deliver different > results. task_fits_remaining_cpu_capacity() and asym_fits_cpu() work > slightly different. I don't think they should be mutually exclusive. We should look into the differences between those two more closely to make them work nicely together instead. For instance, why does asym capacity only consider whether tasks fit in a theoretically idle runqueue, when it could use the current utilization of the runqueue to check that the task fits in the remaining capacity ? Unfortunately I don't have a machine with asym cpu to test locally. Thanks for your feedback ! Mathieu > > [...] >
On 2023-10-23 at 11:04:49 -0400, Mathieu Desnoyers wrote: > On 2023-10-23 10:11, Dietmar Eggemann wrote: > > On 19/10/2023 18:05, Mathieu Desnoyers wrote: > > [...] > > > +static unsigned long scale_rt_capacity(int cpu); > > > + > > > +/* > > > + * Returns true if adding the task utilization to the estimated > > > + * utilization of the runnable tasks on @cpu does not exceed the > > > + * capacity of @cpu. > > > + * > > > + * This considers only the utilization of _runnable_ tasks on the @cpu > > > + * runqueue, excluding blocked and sleeping tasks. This is achieved by > > > + * using the runqueue util_est.enqueued. > > > + */ > > > +static inline bool task_fits_remaining_cpu_capacity(unsigned long task_util, > > > + int cpu) > > > > Or like find_energy_efficient_cpu() (feec(), used in > > Energy-Aware-Scheduling (EAS)) which uses cpu_util(cpu, p, cpu, 0) to get: > > > > max(util_avg(CPU + p), util_est(CPU + p)) > > I've tried using cpu_util(), but unfortunately anything that considers > blocked/sleeping tasks in its utilization total does not work for my > use-case. > > From cpu_util(): > > * CPU utilization is the sum of running time of runnable tasks plus the > * recent utilization of currently non-runnable tasks on that CPU. > I thought cpu_util() indicates the utilization decay sum of task that was once "running" on this CPU, but will not sum up the "util/load" of the blocked/sleeping task? accumulate_sum() /* only the running task's util will be sum up */ if (running) sa->util_sum += contrib << SCHED_CAPACITY_SHIFT; WRITE_ONCE(sa->util_avg, sa->util_sum / divider); thanks, Chenyu
On 23/10/2023 17:04, Mathieu Desnoyers wrote: > On 2023-10-23 10:11, Dietmar Eggemann wrote: >> On 19/10/2023 18:05, Mathieu Desnoyers wrote: > > [...] >>> +static unsigned long scale_rt_capacity(int cpu); >>> + >>> +/* >>> + * Returns true if adding the task utilization to the estimated >>> + * utilization of the runnable tasks on @cpu does not exceed the >>> + * capacity of @cpu. >>> + * >>> + * This considers only the utilization of _runnable_ tasks on the @cpu >>> + * runqueue, excluding blocked and sleeping tasks. This is achieved by >>> + * using the runqueue util_est.enqueued. >>> + */ >>> +static inline bool task_fits_remaining_cpu_capacity(unsigned long >>> task_util, >>> + int cpu) >> >> This is almost like the existing task_fits_cpu(p, cpu) (used in Capacity >> Aware Scheduling (CAS) for Asymmetric CPU capacity systems) except the >> latter only uses `util = task_util_est(p)` and deals with uclamp as well >> and only tests whether p could fit on the CPU. > > This is indeed a major difference between how asym capacity check works > and what is introduced here: > > asym capacity check only checks whether the given task theoretically > fits in the cpu if that cpu was completely idle, without considering the > current cpu utilization. Yeah, asymmetric CPU capacity systems have to make sure that p fits on the idle/sched_idle CPU, hence the use of sync_entity_load_avg() and asym_fits_cpu(). > My approach is to consider the current util_est of the cpu to check > whether the task fits in the remaining capacity. True. > I did not want to use the existing task_fits_cpu() helper because the > notions of uclamp bounds appear to be heavily tied to the fact that it > checks whether the task fits in an _idle_ runqueue, whereas the check I > am introducing here is much more restrictive: it checks that the task > fits on the runqueue within the remaining capacity. I see. Essentially what you do is util_fits_cpu(util_est(CPU + p), 0, 1024, CPU) in !uclamp_is_used() The uclamp_is_used() case is task-centric though. (*) >> Or like find_energy_efficient_cpu() (feec(), used in >> Energy-Aware-Scheduling (EAS)) which uses cpu_util(cpu, p, cpu, 0) to >> get: >> >> max(util_avg(CPU + p), util_est(CPU + p)) > > I've tried using cpu_util(), but unfortunately anything that considers > blocked/sleeping tasks in its utilization total does not work for my > use-case. > > From cpu_util(): > > * CPU utilization is the sum of running time of runnable tasks plus the > * recent utilization of currently non-runnable tasks on that CPU. OK, I see. Occasions in which `util_avg(CPU + p) > util_est(CPU + p)` would ruin it for your use-case. >> feec() >> ... >> for (; pd; pd = pd->next) >> ... >> util = cpu_util(cpu, p, cpu, 0); >> ... >> fits = util_fits_cpu(util, util_min, util_max, cpu) >> ^^^^^^^^^^^^^^^^^^ >> not used when uclamp is not active (1) >> ... >> capacity = capacity_of(cpu) >> fits = fits_capacity(util, capacity) >> if (!uclamp_is_used()) (1) >> return fits >> >> So not introducing new functions like task_fits_remaining_cpu_capacity() >> in this area and using existing one would be good. > > If the notion of uclamp is not tied to the way asym capacity check is > done against a theoretically idle runqueue, I'd be OK with using this, > but so far both appear to be very much tied. Yeah, uclamp_is_used() scenarios are more complicated (see *). > When I stumbled on this fundamental difference between asym cpu capacity > check and the check introduced here, I've started wondering whether the > asym cpu capacity check would benefit from considering the target cpu > current utilization as well. We just adapted select_idle_sibling() for asymmetric CPU capacity systems by adding the asym_fits_cpu() to the idle/sched_idle check. For me so far sis() is all about finding an idle CPU and not task packing. >>> +{ >>> + unsigned long total_util; >>> + >>> + if (!sched_util_fits_capacity_active()) >>> + return false; >>> + total_util = READ_ONCE(cpu_rq(cpu)->cfs.avg.util_est.enqueued) + >>> task_util; >>> + return fits_capacity(total_util, scale_rt_capacity(cpu)); >> >> Why not use: >> >> static unsigned long capacity_of(int cpu) >> return cpu_rq(cpu)->cpu_capacity; >> >> which is maintained in update_cpu_capacity() as scale_rt_capacity(cpu)? > > The reason for preferring scale_rt_capacity(cpu) over capacity_of(cpu) > is that update_cpu_capacity() only runs periodically every > balance-interval, therefore providing a coarse-grained remaining > capacity approximation with respect to irq, rt, dl, and thermal > utilization. >> If it turns out that being coarse-grained is good enough, we may be able > to save some cycles by using capacity_of(), but not without carefully > considering the impacts of being imprecise. OK, I see. We normally consider capacity_of(cpu) as accurate enough. [...] >>> diff --git a/kernel/sched/features.h b/kernel/sched/features.h >>> index ee7f23c76bd3..9a84a1401123 100644 >>> --- a/kernel/sched/features.h >>> +++ b/kernel/sched/features.h >>> @@ -97,6 +97,12 @@ SCHED_FEAT(WA_BIAS, true) >>> SCHED_FEAT(UTIL_EST, true) >>> SCHED_FEAT(UTIL_EST_FASTUP, true) >> >> IMHO, asymmetric CPU capacity systems would have to disable the sched >> feature UTIL_FITS_CAPACITY. Otherwise CAS could deliver different >> results. task_fits_remaining_cpu_capacity() and asym_fits_cpu() work >> slightly different. > > I don't think they should be mutually exclusive. We should look into the > differences between those two more closely to make them work nicely > together instead. For instance, why does asym capacity only consider > whether tasks fit in a theoretically idle runqueue, when it could use > the current utilization of the runqueue to check that the task fits in > the remaining capacity ? We have EAS (feec()) for this on asymmetric CPU capacity systems (as our per-performance_domain packing strategy), which only works when !overutilized. When overutilized, we just need asym_fits_cpu() (select_idle_capacity() -> util_fits_cpu()) to select a fitting idle/sched_idle CPU in CAS which includes the uclamp handling. [...]
On 2023-10-24 02:10, Chen Yu wrote: > On 2023-10-23 at 11:04:49 -0400, Mathieu Desnoyers wrote: >> On 2023-10-23 10:11, Dietmar Eggemann wrote: >>> On 19/10/2023 18:05, Mathieu Desnoyers wrote: >> >> [...] >>>> +static unsigned long scale_rt_capacity(int cpu); >>>> + >>>> +/* >>>> + * Returns true if adding the task utilization to the estimated >>>> + * utilization of the runnable tasks on @cpu does not exceed the >>>> + * capacity of @cpu. >>>> + * >>>> + * This considers only the utilization of _runnable_ tasks on the @cpu >>>> + * runqueue, excluding blocked and sleeping tasks. This is achieved by >>>> + * using the runqueue util_est.enqueued. >>>> + */ >>>> +static inline bool task_fits_remaining_cpu_capacity(unsigned long task_util, >>>> + int cpu) >>> >>> Or like find_energy_efficient_cpu() (feec(), used in >>> Energy-Aware-Scheduling (EAS)) which uses cpu_util(cpu, p, cpu, 0) to get: >>> >>> max(util_avg(CPU + p), util_est(CPU + p)) >> >> I've tried using cpu_util(), but unfortunately anything that considers >> blocked/sleeping tasks in its utilization total does not work for my >> use-case. >> >> From cpu_util(): >> >> * CPU utilization is the sum of running time of runnable tasks plus the >> * recent utilization of currently non-runnable tasks on that CPU. >> > > I thought cpu_util() indicates the utilization decay sum of task that was once > "running" on this CPU, but will not sum up the "util/load" of the blocked/sleeping > task? > > accumulate_sum() > /* only the running task's util will be sum up */ > if (running) > sa->util_sum += contrib << SCHED_CAPACITY_SHIFT; > > WRITE_ONCE(sa->util_avg, sa->util_sum / divider); The accumulation into the cfs_rq->avg.util_sum indeed only happens when the task is running, which means that the task does not actively contribute to increment the util_sum when it is blocked/sleeping. However, when the task is blocked/sleeping, the task is still attached to the runqueue, and therefore its historic util_sum still contributes to the cfs_rq util_sum/util_avg. This completely differs from what happens when the task is migrated to a different runqueue, in which case its util_sum contribution is entirely removed from the cfs_rq util_sum: static void enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) { [...] update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH) [...] static void dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) { [...] if (entity_is_task(se) && task_on_rq_migrating(task_of(se))) action |= DO_DETACH; [...] static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) { [...] if (!se->avg.last_update_time && (flags & DO_ATTACH)) { /* * DO_ATTACH means we're here from enqueue_entity(). * !last_update_time means we've passed through * migrate_task_rq_fair() indicating we migrated. * * IOW we're enqueueing a task on a new CPU. */ attach_entity_load_avg(cfs_rq, se); update_tg_load_avg(cfs_rq); } else if (flags & DO_DETACH) { /* * DO_DETACH means we're here from dequeue_entity() * and we are migrating task out of the CPU. */ detach_entity_load_avg(cfs_rq, se); update_tg_load_avg(cfs_rq); [...] In comparison, util_est_enqueue()/util_est_dequeue() are called from enqueue_task_fair() and dequeue_task_fair(), which include blocked/sleeping tasks scenarios. Therefore, util_est only considers runnable tasks in its cfs_rq->avg.util_est.enqueued. The current rq utilization total used for rq selection should not include historic utilization of all blocked/sleeping tasks, because we are taking a decision to bring back a recently blocked/sleeping task onto a runqueue at that point. Considering the historic util_sum from the set of other blocked/sleeping tasks still attached to that runqueue in the current utilization mistakenly makes the rq selection think that the rq is busier than it really is. I suspect that cpu_util_without() is an half-successful attempt at solving this by removing the task p from the considered utilization, but it does not take into account scenarios where many other tasks happen to be blocked/sleeping as well. Thanks, Mathieu
On 24/10/2023 08:10, Chen Yu wrote: > On 2023-10-23 at 11:04:49 -0400, Mathieu Desnoyers wrote: >> On 2023-10-23 10:11, Dietmar Eggemann wrote: >>> On 19/10/2023 18:05, Mathieu Desnoyers wrote: [...] >>> Or like find_energy_efficient_cpu() (feec(), used in >>> Energy-Aware-Scheduling (EAS)) which uses cpu_util(cpu, p, cpu, 0) to get: >>> >>> max(util_avg(CPU + p), util_est(CPU + p)) >> >> I've tried using cpu_util(), but unfortunately anything that considers >> blocked/sleeping tasks in its utilization total does not work for my >> use-case. >> >> From cpu_util(): >> >> * CPU utilization is the sum of running time of runnable tasks plus the >> * recent utilization of currently non-runnable tasks on that CPU. >> > > I thought cpu_util() indicates the utilization decay sum of task that was once > "running" on this CPU, but will not sum up the "util/load" of the blocked/sleeping > task? cpu_util() here refers to: cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost) which when called with (cpu, p, cpu, 0) and task_cpu(p) != cpu returns: max(util_avg(CPU + p), util_est(CPU + p)) The term `CPU utilization` in cpu_util()'s header stands for cfs_rq->avg.util_avg. It does not sum up the utilization of blocked tasks but it can contain it. They have to be a blocked tasks and not tasks which were running in cfs_rq since we subtract utilization of tasks which are migrating away from the cfs_rq (cfs_rq->removed.util_avg in remove_entity_load_avg() and update_cfs_rq_load_avg()). > accumulate_sum() > /* only the running task's util will be sum up */ > if (running) > sa->util_sum += contrib << SCHED_CAPACITY_SHIFT; > > WRITE_ONCE(sa->util_avg, sa->util_sum / divider); __update_load_avg_cfs_rq() ___update_load_sum(..., cfs_rq->curr != NULL ^^^^^^^^^^^^^^^^^^^^ running accumulate_sum() if (periods) /* decay _sum */ sa->util_sum = decay_load(sa->util_sum, ...) if (load) /* decay and accrue _sum */ contrib = __accumulate_pelt_segments(...) When crossing periods we decay the old _sum and when additionally load != 0 we decay and accrue the new _sum as well.
On 2023-10-24 at 10:49:37 -0400, Mathieu Desnoyers wrote: > On 2023-10-24 02:10, Chen Yu wrote: > > On 2023-10-23 at 11:04:49 -0400, Mathieu Desnoyers wrote: > > > On 2023-10-23 10:11, Dietmar Eggemann wrote: > > > > On 19/10/2023 18:05, Mathieu Desnoyers wrote: > > > > > > [...] > > > > > +static unsigned long scale_rt_capacity(int cpu); > > > > > + > > > > > +/* > > > > > + * Returns true if adding the task utilization to the estimated > > > > > + * utilization of the runnable tasks on @cpu does not exceed the > > > > > + * capacity of @cpu. > > > > > + * > > > > > + * This considers only the utilization of _runnable_ tasks on the @cpu > > > > > + * runqueue, excluding blocked and sleeping tasks. This is achieved by > > > > > + * using the runqueue util_est.enqueued. > > > > > + */ > > > > > +static inline bool task_fits_remaining_cpu_capacity(unsigned long task_util, > > > > > + int cpu) > > > > > > > > Or like find_energy_efficient_cpu() (feec(), used in > > > > Energy-Aware-Scheduling (EAS)) which uses cpu_util(cpu, p, cpu, 0) to get: > > > > > > > > max(util_avg(CPU + p), util_est(CPU + p)) > > > > > > I've tried using cpu_util(), but unfortunately anything that considers > > > blocked/sleeping tasks in its utilization total does not work for my > > > use-case. > > > > > > From cpu_util(): > > > > > > * CPU utilization is the sum of running time of runnable tasks plus the > > > * recent utilization of currently non-runnable tasks on that CPU. > > > > > > > I thought cpu_util() indicates the utilization decay sum of task that was once > > "running" on this CPU, but will not sum up the "util/load" of the blocked/sleeping > > task? > > > > accumulate_sum() > > /* only the running task's util will be sum up */ > > if (running) > > sa->util_sum += contrib << SCHED_CAPACITY_SHIFT; > > > > WRITE_ONCE(sa->util_avg, sa->util_sum / divider); > > The accumulation into the cfs_rq->avg.util_sum indeed only happens when the task > is running, which means that the task does not actively contribute to increment > the util_sum when it is blocked/sleeping. > > However, when the task is blocked/sleeping, the task is still attached to the > runqueue, and therefore its historic util_sum still contributes to the cfs_rq > util_sum/util_avg. This completely differs from what happens when the task is > migrated to a different runqueue, in which case its util_sum contribution is > entirely removed from the cfs_rq util_sum: > > static void > enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > { > [...] > update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH) > [...] > > static void > dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > { > [...] > if (entity_is_task(se) && task_on_rq_migrating(task_of(se))) > action |= DO_DETACH; > [...] > > static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > { > [...] > if (!se->avg.last_update_time && (flags & DO_ATTACH)) { > > /* > * DO_ATTACH means we're here from enqueue_entity(). > * !last_update_time means we've passed through > * migrate_task_rq_fair() indicating we migrated. > * > * IOW we're enqueueing a task on a new CPU. > */ > attach_entity_load_avg(cfs_rq, se); > update_tg_load_avg(cfs_rq); > > } else if (flags & DO_DETACH) { > /* > * DO_DETACH means we're here from dequeue_entity() > * and we are migrating task out of the CPU. > */ > detach_entity_load_avg(cfs_rq, se); > update_tg_load_avg(cfs_rq); > [...] > > In comparison, util_est_enqueue()/util_est_dequeue() are called from enqueue_task_fair() > and dequeue_task_fair(), which include blocked/sleeping tasks scenarios. Therefore, util_est > only considers runnable tasks in its cfs_rq->avg.util_est.enqueued. > > The current rq utilization total used for rq selection should not include historic > utilization of all blocked/sleeping tasks, because we are taking a decision to bring > back a recently blocked/sleeping task onto a runqueue at that point. Considering > the historic util_sum from the set of other blocked/sleeping tasks still attached to that > runqueue in the current utilization mistakenly makes the rq selection think that the rq is > busier than it really is. > Thanks for the description in detail, it is very helpful! Now I understand that using cpu_util() could overestimate the busyness of the CPU in UTIL_FITS_CAPACITY. > I suspect that cpu_util_without() is an half-successful attempt at solving this by removing > the task p from the considered utilization, but it does not take into account scenarios where many > other tasks happen to be blocked/sleeping as well. Agree, those non-migrated tasks could contribute to cfs_rq's util_avg. thanks, Chenyu
On 2023-10-24 at 17:03:25 +0200, Dietmar Eggemann wrote: > On 24/10/2023 08:10, Chen Yu wrote: > > On 2023-10-23 at 11:04:49 -0400, Mathieu Desnoyers wrote: > >> On 2023-10-23 10:11, Dietmar Eggemann wrote: > >>> On 19/10/2023 18:05, Mathieu Desnoyers wrote: > > [...] > > >>> Or like find_energy_efficient_cpu() (feec(), used in > >>> Energy-Aware-Scheduling (EAS)) which uses cpu_util(cpu, p, cpu, 0) to get: > >>> > >>> max(util_avg(CPU + p), util_est(CPU + p)) > >> > >> I've tried using cpu_util(), but unfortunately anything that considers > >> blocked/sleeping tasks in its utilization total does not work for my > >> use-case. > >> > >> From cpu_util(): > >> > >> * CPU utilization is the sum of running time of runnable tasks plus the > >> * recent utilization of currently non-runnable tasks on that CPU. > >> > > > > I thought cpu_util() indicates the utilization decay sum of task that was once > > "running" on this CPU, but will not sum up the "util/load" of the blocked/sleeping > > task? > > cpu_util() here refers to: > > cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost) > > which when called with (cpu, p, cpu, 0) and task_cpu(p) != cpu returns: > > max(util_avg(CPU + p), util_est(CPU + p)) > > The term `CPU utilization` in cpu_util()'s header stands for > cfs_rq->avg.util_avg. > > It does not sum up the utilization of blocked tasks but it can contain > it. They have to be a blocked tasks and not tasks which were running in > cfs_rq since we subtract utilization of tasks which are migrating away > from the cfs_rq (cfs_rq->removed.util_avg in remove_entity_load_avg() > and update_cfs_rq_load_avg()). Thanks for this description in detail, Dietmar. Yes, I just realized that, if the blocked tasks once ran on this cfs_rq and not being migrated away, the cfs_rq's util_avg will contain those utils. thanks, Chenyu > > accumulate_sum() > > /* only the running task's util will be sum up */ > > if (running) > > sa->util_sum += contrib << SCHED_CAPACITY_SHIFT; > > > > WRITE_ONCE(sa->util_avg, sa->util_sum / divider); > > __update_load_avg_cfs_rq() > > ___update_load_sum(..., cfs_rq->curr != NULL > ^^^^^^^^^^^^^^^^^^^^ > running > accumulate_sum() > > if (periods) > /* decay _sum */ > sa->util_sum = decay_load(sa->util_sum, ...) > > if (load) > /* decay and accrue _sum */ > contrib = __accumulate_pelt_segments(...) > > When crossing periods we decay the old _sum and when additionally load > != 0 we decay and accrue the new _sum as well.
On Thu, Oct 19, 2023 at 12:05:22PM -0400, Mathieu Desnoyers wrote: > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index e93e006a942b..463e75084aed 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -2090,6 +2090,11 @@ static const_debug __maybe_unused unsigned int sysctl_sched_features = > > #endif /* SCHED_DEBUG */ > > +static __always_inline bool sched_util_fits_capacity_active(void) > +{ > + return sched_feat(UTIL_EST) && sched_feat(UTIL_FITS_CAPACITY); > +} This generates pretty terrible code; it cannot collapse this into a single branch. And since sched_feat is at best a debug interface for people who knows wtf they're doing, just make this UTIL_FITS_CAPACITY with a comment or so.
On 2023-10-25 03:56, Peter Zijlstra wrote: > On Thu, Oct 19, 2023 at 12:05:22PM -0400, Mathieu Desnoyers wrote: > >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >> index e93e006a942b..463e75084aed 100644 >> --- a/kernel/sched/sched.h >> +++ b/kernel/sched/sched.h >> @@ -2090,6 +2090,11 @@ static const_debug __maybe_unused unsigned int sysctl_sched_features = >> >> #endif /* SCHED_DEBUG */ >> >> +static __always_inline bool sched_util_fits_capacity_active(void) >> +{ >> + return sched_feat(UTIL_EST) && sched_feat(UTIL_FITS_CAPACITY); >> +} > > This generates pretty terrible code; it cannot collapse this into a > single branch. And since sched_feat is at best a debug interface for > people who knows wtf they're doing, just make this UTIL_FITS_CAPACITY > with a comment or so. OK, change applied for the next round, thanks! Mathieu
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1d9c2482c5a3..cc86d1ffeb27 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4497,6 +4497,28 @@ static inline void util_est_update(struct cfs_rq *cfs_rq, trace_sched_util_est_se_tp(&p->se); } +static unsigned long scale_rt_capacity(int cpu); + +/* + * Returns true if adding the task utilization to the estimated + * utilization of the runnable tasks on @cpu does not exceed the + * capacity of @cpu. + * + * This considers only the utilization of _runnable_ tasks on the @cpu + * runqueue, excluding blocked and sleeping tasks. This is achieved by + * using the runqueue util_est.enqueued. + */ +static inline bool task_fits_remaining_cpu_capacity(unsigned long task_util, + int cpu) +{ + unsigned long total_util; + + if (!sched_util_fits_capacity_active()) + return false; + total_util = READ_ONCE(cpu_rq(cpu)->cfs.avg.util_est.enqueued) + task_util; + return fits_capacity(total_util, scale_rt_capacity(cpu)); +} + static inline int util_fits_cpu(unsigned long util, unsigned long uclamp_min, unsigned long uclamp_max, @@ -7124,12 +7146,15 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) int i, recent_used_cpu; /* - * On asymmetric system, update task utilization because we will check - * that the task fits with cpu's capacity. + * With the UTIL_FITS_CAPACITY feature and on asymmetric system, + * update task utilization because we will check that the task + * fits with cpu's capacity. */ - if (sched_asym_cpucap_active()) { + if (sched_util_fits_capacity_active() || sched_asym_cpucap_active()) { sync_entity_load_avg(&p->se); task_util = task_util_est(p); + } + if (sched_asym_cpucap_active()) { util_min = uclamp_eff_value(p, UCLAMP_MIN); util_max = uclamp_eff_value(p, UCLAMP_MAX); } @@ -7139,7 +7164,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) */ lockdep_assert_irqs_disabled(); - if ((available_idle_cpu(target) || sched_idle_cpu(target)) && + if ((available_idle_cpu(target) || sched_idle_cpu(target) || + task_fits_remaining_cpu_capacity(task_util, target)) && asym_fits_cpu(task_util, util_min, util_max, target)) return target; @@ -7147,7 +7173,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) * If the previous CPU is cache affine and idle, don't be stupid: */ if (prev != target && cpus_share_cache(prev, target) && - (available_idle_cpu(prev) || sched_idle_cpu(prev)) && + (available_idle_cpu(prev) || sched_idle_cpu(prev) || + task_fits_remaining_cpu_capacity(task_util, prev)) && asym_fits_cpu(task_util, util_min, util_max, prev)) return prev; @@ -7173,7 +7200,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if (recent_used_cpu != prev && recent_used_cpu != target && cpus_share_cache(recent_used_cpu, target) && - (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) && + (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu) || + task_fits_remaining_cpu_capacity(task_util, recent_used_cpu)) && cpumask_test_cpu(recent_used_cpu, p->cpus_ptr) && asym_fits_cpu(task_util, util_min, util_max, recent_used_cpu)) { return recent_used_cpu; diff --git a/kernel/sched/features.h b/kernel/sched/features.h index ee7f23c76bd3..9a84a1401123 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -97,6 +97,12 @@ SCHED_FEAT(WA_BIAS, true) SCHED_FEAT(UTIL_EST, true) SCHED_FEAT(UTIL_EST_FASTUP, true) +/* + * Select the previous, target, or recent runqueue if they have enough + * remaining capacity to enqueue the task. Requires UTIL_EST. + */ +SCHED_FEAT(UTIL_FITS_CAPACITY, true) + SCHED_FEAT(LATENCY_WARN, false) SCHED_FEAT(ALT_PERIOD, true) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index e93e006a942b..463e75084aed 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2090,6 +2090,11 @@ static const_debug __maybe_unused unsigned int sysctl_sched_features = #endif /* SCHED_DEBUG */ +static __always_inline bool sched_util_fits_capacity_active(void) +{ + return sched_feat(UTIL_EST) && sched_feat(UTIL_FITS_CAPACITY); +} + extern struct static_key_false sched_numa_balancing; extern struct static_key_false sched_schedstats;