Message ID | 20230929183350.239721-1-mathieu.desnoyers@efficios.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:6359:6f87:b0:13f:353d:d1ed with SMTP id tl7csp3542557rwb; Fri, 29 Sep 2023 13:35:36 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGaEhUfS5KoQLf90MBc4Kz5Da7RhWQJ1GxPJ5gK4oTqFiuP8qtw6jjVdhE3zf3vDF6iDk+j X-Received: by 2002:a17:903:25d4:b0:1c6:112f:5d02 with SMTP id jc20-20020a17090325d400b001c6112f5d02mr4463412plb.55.1696019735943; Fri, 29 Sep 2023 13:35:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696019735; cv=none; d=google.com; s=arc-20160816; b=i98iQknm+EMtxxLYngZolSKkFOSzaeeL7iuoFabh8yD86Wsr/cyXGXGFIhgU8IbWHg WObxU8rJBCmg1lWTvZ70695eYFG4sCCUkP3Y/FiqILp4X/ZEB4ARNKztTLigvE8t+3BL wH9fAidMT0RBAiN6dWbvg28YKDyCOXii8+H/aUbq8ab60UcDVzcc/kGraBxxjS2C7Xs2 2Feacwrr6PZvXPGwDZz44HUaYNJV0LybQSWHBJzUYnTYg69UvqeE0dQuavDq3bD8EaRo LTFiqvzPprNidg+w5HhI/C3FCxz7PkbyqwkFcvW7udQZfiezYQCZhYfRo2TkvfWPUW6E 794A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=+KvWJ4bdLtUeG4CEBZKp3CtimuudPGaxE42bmqLnSfg=; fh=MlL9xuK1bL5YhHTfhEwdGxMCfB66v6TLl/GiGg/sHRw=; b=wwYcRPhh3WabjL4ayXEXCwRfBsZKh5repSJhvq4MGTzOpGSnQhFvDHznGhQPY/OPLH /vAXLPAc38+oJ0AH3aRKjbZjoOLFX2fHKNDXMf11vjXVK7rGFHF9Ie+sWj/sei6o3Fk3 V8HLBrADCjVTU3rhU9GNdc+/iGhngmwBMFt+dzqbqPA7i/6P9w1Fp4zi85fo07jwme8f KuG3MeWxMKMSSgKQOnsDU6fYZhEyaMRHTliQvt+7z8kOH+fa52Wovqug2v2ul6OUxdgz QIsFQJTwyzYujT/g7o7nt/qQ9Twvw28oaLl1TwQ0LUxv6WxzSXRtpUrj0fFyXMGR9qE6 OXOA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=ZxI5MF+6; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: from agentk.vger.email (agentk.vger.email. [23.128.96.32]) by mx.google.com with ESMTPS id f1-20020a170902684100b001bf1d1d99f6si21711157pln.358.2023.09.29.13.35.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Sep 2023 13:35:35 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) client-ip=23.128.96.32; Authentication-Results: mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=ZxI5MF+6; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id 5D14082A5B26; Fri, 29 Sep 2023 11:34:59 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233421AbjI2Se3 (ORCPT <rfc822;pwkd43@gmail.com> + 19 others); Fri, 29 Sep 2023 14:34:29 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58140 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233472AbjI2Se1 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 29 Sep 2023 14:34:27 -0400 Received: from smtpout.efficios.com (smtpout.efficios.com [167.114.26.122]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5977D19F for <linux-kernel@vger.kernel.org>; Fri, 29 Sep 2023 11:34:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com; s=smtpout1; t=1696012462; bh=AlRztrkqyKUOFkQYVS+dl2vFj7Ffobpd8WDcQyWDybc=; h=From:To:Cc:Subject:Date:From; b=ZxI5MF+6gaoCDzGJJbN6lgPMdDS2d6nd2t5VfOqglCfgRqz2pjQVJZ4V5xn9I7XJW tFC1i16fXHgZvZqViBS09fHpsgGH62+HHa/eWHCJilw4akAc3Scnux7g8rtBlEuL9y zCIo5i12gzrspNL4bVIup9IUPJaQBYbvNfAJduOV41diEfwfKScoI5XP7QLyu/huZ8 3IEHOY2cnXtQ/QlSt7STb3MreW0//BTzXBkF6W/90yPdCvjag4enw7CrDOqo37v+03 ++26pbKhgS9vOLFQ1SgZldb5G+fba2RMdam5qYZNxSwamfP8mN9lDkrRDygFcAOB/9 kMiIXBI/HuX7Q== Received: from thinkos.internal.efficios.com (192-222-143-198.qc.cable.ebox.net [192.222.143.198]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4RxzXG4p6zz1Rdc; Fri, 29 Sep 2023 14:34:22 -0400 (EDT) From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> To: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, Ingo Molnar <mingo@redhat.com>, Valentin Schneider <vschneid@redhat.com>, Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>, Daniel Bristot de Oliveira <bristot@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, Juri Lelli <juri.lelli@redhat.com>, Swapnil Sapkal <Swapnil.Sapkal@amd.com>, Aaron Lu <aaron.lu@intel.com>, Chen Yu <yu.c.chen@intel.com>, Tim Chen <tim.c.chen@intel.com>, K Prateek Nayak <kprateek.nayak@amd.com>, "Gautham R . Shenoy" <gautham.shenoy@amd.com>, x86@kernel.org Subject: [RFC PATCH] sched/fair: Bias runqueue selection towards almost idle prev CPU Date: Fri, 29 Sep 2023 14:33:50 -0400 Message-Id: <20230929183350.239721-1-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.39.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Fri, 29 Sep 2023 11:34:59 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1778405590452260546 X-GMAIL-MSGID: 1778405590452260546 |
Series |
[RFC] sched/fair: Bias runqueue selection towards almost idle prev CPU
|
|
Commit Message
Mathieu Desnoyers
Sept. 29, 2023, 6:33 p.m. UTC
Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature. It biases
select_task_rq towards the previous CPU if it was almost idle
(avg_load <= 0.1%). It eliminates frequent task migrations from almost
idle CPU to completely idle CPUs. This is achieved by using the CPU
load of the previously used CPU as "almost idle" criterion in
wake_affine_idle() and select_idle_sibling().
The following benchmarks are performed on a v6.5.5 kernel with
mitigations=off.
This speeds up the following hackbench workload on a 192 cores AMD EPYC
9654 96-Core Processor (over 2 sockets):
hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100
from 49s to 32s. (34% speedup)
We can observe that the number of migrations is reduced significantly
(-94%) with this patch, which may explain the speedup:
Baseline: 118M cpu-migrations (9.286 K/sec)
With patch: 7M cpu-migrations (0.709 K/sec)
As a consequence, the stalled-cycles-backend are reduced:
Baseline: 8.16% backend cycles idle
With patch: 6.70% backend cycles idle
Interestingly, the rate of context switch increases with the patch, but
it does not appear to be an issue performance-wise:
Baseline: 454M context-switches (35.677 K/sec)
With patch: 654M context-switches (62.290 K/sec)
This was developed as part of the investigation into a weird regression
reported by AMD where adding a raw spinlock in the scheduler context
switch accelerated hackbench. It turned out that changing this raw
spinlock for a loop of 10000x cpu_relax within do_idle() had similar
benefits.
This patch achieves a similar effect without the busy-waiting by
allowing select_task_rq to favor almost idle previously used CPUs based
on the CPU load of that CPU. The threshold of 0.1% avg_load for almost
idle CPU load has been identified empirically using the hackbench
workload.
Feedback is welcome. I am especially interested to learn whether this
patch has positive or detrimental effects on performance of other
workloads.
Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@amd.com
Link: https://lore.kernel.org/lkml/20230725193048.124796-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/f6dc1652-bc39-0b12-4b6b-29a2f9cd8484@amd.com/
Link: https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20230823060832.454842-1-aaron.lu@intel.com/
Link: https://lore.kernel.org/lkml/20230905171105.1005672-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/cover.1695704179.git.yu.c.chen@intel.com/
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Swapnil Sapkal <Swapnil.Sapkal@amd.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Chen Yu <yu.c.chen@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Gautham R . Shenoy <gautham.shenoy@amd.com>
Cc: x86@kernel.org
---
kernel/sched/fair.c | 18 +++++++++++++-----
kernel/sched/features.h | 6 ++++++
2 files changed, 19 insertions(+), 5 deletions(-)
Comments
Hi Mathieu, On 2023-09-29 at 14:33:50 -0400, Mathieu Desnoyers wrote: > Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature. It biases > select_task_rq towards the previous CPU if it was almost idle > (avg_load <= 0.1%). Yes, this is a promising direction IMO. One question is that, can cfs_rq->avg.load_avg be used for percentage comparison? If I understand correctly, load_avg reflects that more than 1 tasks could have been running this runqueue, and the load_avg is the direct proportion to the load_weight of that cfs_rq. Besides, LOAD_AVG_MAX seems to not be the max value that load_avg can reach, it is the sum of 1024 * (y + y^1 + y^2 ... ) For example, taskset -c 1 nice -n -20 stress -c 1 cat /sys/kernel/debug/sched/debug | grep 'cfs_rq\[1\]' -A 12 | grep "\.load_avg" .load_avg : 88763 .load_avg : 1024 88763 is higher than LOAD_AVG_MAX=47742 Maybe the util_avg can be used for precentage comparison I suppose? > It eliminates frequent task migrations from almost > idle CPU to completely idle CPUs. This is achieved by using the CPU > load of the previously used CPU as "almost idle" criterion in > wake_affine_idle() and select_idle_sibling(). > > The following benchmarks are performed on a v6.5.5 kernel with > mitigations=off. > > This speeds up the following hackbench workload on a 192 cores AMD EPYC > 9654 96-Core Processor (over 2 sockets): > > hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100 > > from 49s to 32s. (34% speedup) > > We can observe that the number of migrations is reduced significantly > (-94%) with this patch, which may explain the speedup: > > Baseline: 118M cpu-migrations (9.286 K/sec) > With patch: 7M cpu-migrations (0.709 K/sec) > > As a consequence, the stalled-cycles-backend are reduced: > > Baseline: 8.16% backend cycles idle > With patch: 6.70% backend cycles idle > > Interestingly, the rate of context switch increases with the patch, but > it does not appear to be an issue performance-wise: > > Baseline: 454M context-switches (35.677 K/sec) > With patch: 654M context-switches (62.290 K/sec) > > This was developed as part of the investigation into a weird regression > reported by AMD where adding a raw spinlock in the scheduler context > switch accelerated hackbench. It turned out that changing this raw > spinlock for a loop of 10000x cpu_relax within do_idle() had similar > benefits. > > This patch achieves a similar effect without the busy-waiting by > allowing select_task_rq to favor almost idle previously used CPUs based > on the CPU load of that CPU. The threshold of 0.1% avg_load for almost > idle CPU load has been identified empirically using the hackbench > workload. > > Feedback is welcome. I am especially interested to learn whether this > patch has positive or detrimental effects on performance of other > workloads. > > Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@amd.com > Link: https://lore.kernel.org/lkml/20230725193048.124796-1-mathieu.desnoyers@efficios.com/ > Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/ > Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/ > Link: https://lore.kernel.org/lkml/f6dc1652-bc39-0b12-4b6b-29a2f9cd8484@amd.com/ > Link: https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/ > Link: https://lore.kernel.org/lkml/20230823060832.454842-1-aaron.lu@intel.com/ > Link: https://lore.kernel.org/lkml/20230905171105.1005672-1-mathieu.desnoyers@efficios.com/ > Link: https://lore.kernel.org/lkml/cover.1695704179.git.yu.c.chen@intel.com/ > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Valentin Schneider <vschneid@redhat.com> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Ben Segall <bsegall@google.com> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Daniel Bristot de Oliveira <bristot@redhat.com> > Cc: Vincent Guittot <vincent.guittot@linaro.org> > Cc: Juri Lelli <juri.lelli@redhat.com> > Cc: Swapnil Sapkal <Swapnil.Sapkal@amd.com> > Cc: Aaron Lu <aaron.lu@intel.com> > Cc: Chen Yu <yu.c.chen@intel.com> > Cc: Tim Chen <tim.c.chen@intel.com> > Cc: K Prateek Nayak <kprateek.nayak@amd.com> > Cc: Gautham R . Shenoy <gautham.shenoy@amd.com> > Cc: x86@kernel.org > --- > kernel/sched/fair.c | 18 +++++++++++++----- > kernel/sched/features.h | 6 ++++++ > 2 files changed, 19 insertions(+), 5 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 1d9c2482c5a3..65a7d923ea61 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6599,6 +6599,14 @@ static int wake_wide(struct task_struct *p) > return 1; > } > > +static bool > +almost_idle_cpu(int cpu, struct task_struct *p) > +{ > + if (!sched_feat(WAKEUP_BIAS_PREV_IDLE)) > + return false; > + return cpu_load_without(cpu_rq(cpu), p) <= LOAD_AVG_MAX / 1000; Or return cpu_util_without(cpu_rq(cpu), p) * 1000 <= capacity_orig_of(cpu) ? thanks, Chenyu
On 9/30/23 03:11, Chen Yu wrote: > Hi Mathieu, > > On 2023-09-29 at 14:33:50 -0400, Mathieu Desnoyers wrote: >> Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature. It biases >> select_task_rq towards the previous CPU if it was almost idle >> (avg_load <= 0.1%). > > Yes, this is a promising direction IMO. One question is that, > can cfs_rq->avg.load_avg be used for percentage comparison? > If I understand correctly, load_avg reflects that more than > 1 tasks could have been running this runqueue, and the > load_avg is the direct proportion to the load_weight of that > cfs_rq. Besides, LOAD_AVG_MAX seems to not be the max value > that load_avg can reach, it is the sum of > 1024 * (y + y^1 + y^2 ... ) > > For example, > taskset -c 1 nice -n -20 stress -c 1 > cat /sys/kernel/debug/sched/debug | grep 'cfs_rq\[1\]' -A 12 | grep "\.load_avg" > .load_avg : 88763 > .load_avg : 1024 > > 88763 is higher than LOAD_AVG_MAX=47742 I would have expected the load_avg to be limited to LOAD_AVG_MAX somehow, but it appears that it does not happen in practice. That being said, if the cutoff is really at 0.1% or 0.2% of the real max, does it really matter ? > Maybe the util_avg can be used for precentage comparison I suppose? [...] > Or > return cpu_util_without(cpu_rq(cpu), p) * 1000 <= capacity_orig_of(cpu) ? Unfortunately using util_avg does not seem to work based on my testing. Even at utilization thresholds at 0.1%, 1% and 10%. Based on comments in fair.c: * CPU utilization is the sum of running time of runnable tasks plus the * recent utilization of currently non-runnable tasks on that CPU. I think we don't want to include currently non-runnable tasks in the statistics we use, because we are trying to figure out if the cpu is a idle-enough target based on the tasks which are currently running, for the purpose of runqueue selection when waking up a task which is considered at that point in time a non-runnable task on that cpu, and which is about to become runnable again. Thanks, Mathieu
On 2023-09-30 at 07:45:38 -0400, Mathieu Desnoyers wrote: > On 9/30/23 03:11, Chen Yu wrote: > > Hi Mathieu, > > > > On 2023-09-29 at 14:33:50 -0400, Mathieu Desnoyers wrote: > > > Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature. It biases > > > select_task_rq towards the previous CPU if it was almost idle > > > (avg_load <= 0.1%). > > > > Yes, this is a promising direction IMO. One question is that, > > can cfs_rq->avg.load_avg be used for percentage comparison? > > If I understand correctly, load_avg reflects that more than > > 1 tasks could have been running this runqueue, and the > > load_avg is the direct proportion to the load_weight of that > > cfs_rq. Besides, LOAD_AVG_MAX seems to not be the max value > > that load_avg can reach, it is the sum of > > 1024 * (y + y^1 + y^2 ... ) > > > > For example, > > taskset -c 1 nice -n -20 stress -c 1 > > cat /sys/kernel/debug/sched/debug | grep 'cfs_rq\[1\]' -A 12 | grep "\.load_avg" > > .load_avg : 88763 > > .load_avg : 1024 > > > > 88763 is higher than LOAD_AVG_MAX=47742 > > I would have expected the load_avg to be limited to LOAD_AVG_MAX somehow, > but it appears that it does not happen in practice. > > That being said, if the cutoff is really at 0.1% or 0.2% of the real max, > does it really matter ? > > > Maybe the util_avg can be used for precentage comparison I suppose? > [...] > > Or > > return cpu_util_without(cpu_rq(cpu), p) * 1000 <= capacity_orig_of(cpu) ? > > Unfortunately using util_avg does not seem to work based on my testing. > Even at utilization thresholds at 0.1%, 1% and 10%. > > Based on comments in fair.c: > > * CPU utilization is the sum of running time of runnable tasks plus the > * recent utilization of currently non-runnable tasks on that CPU. > > I think we don't want to include currently non-runnable tasks in the > statistics we use, because we are trying to figure out if the cpu is a > idle-enough target based on the tasks which are currently running, for the > purpose of runqueue selection when waking up a task which is considered at > that point in time a non-runnable task on that cpu, and which is about to > become runnable again. > Although LOAD_AVG_MAX is not the max possible load_avg, we still want to find a proper threshold to decide if the CPU is almost idle. The LOAD_AVG_MAX based threshold is modified a little bit: The theory is, if there is only 1 task on the CPU, and that task has a nice of 0, the task runs 50 us every 1000 us, then this CPU is regarded as almost idle. The load_sum of the task is: 50 * (1 + y + y^2 + ... + y^n) The corresponding avg_load of the task is approximately NICE_0_WEIGHT * load_sum / LOAD_AVG_MAX = 50. So: /* which is close to LOAD_AVG_MAX/1000 = 47 */ #define ALMOST_IDLE_CPU_LOAD 50 static bool almost_idle_cpu(int cpu, struct task_struct *p) { if (!sched_feat(WAKEUP_BIAS_PREV_IDLE)) return false; return cpu_load_without(cpu_rq(cpu), p) <= ALMOST_IDLE_CPU_LOAD; } Tested this on Intel Xeon Platinum 8360Y, Ice Lake server, 36 core/package, total 72 core/144 CPUs. Slight improvement is observed in hackbench socket mode: socket mode: hackbench -g 16 -f 20 -l 480000 -s 100 Before patch: Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) Each sender will pass 480000 messages of 100 bytes Time: 81.084 After patch: Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) Each sender will pass 480000 messages of 100 bytes Time: 78.083 pipe mode: hackbench -g 16 -f 20 --pipe -l 480000 -s 100 Before patch: Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) Each sender will pass 480000 messages of 100 bytes Time: 38.219 After patch: Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) Each sender will pass 480000 messages of 100 bytes Time: 38.348 It suggests that, if the workload has larger working-set/cache footprint, waking up the task on its previous CPU could get more benefit. thanks, Chenyu
On 2023-09-30 at 07:45:38 -0400, Mathieu Desnoyers wrote: > On 9/30/23 03:11, Chen Yu wrote: > > Hi Mathieu, > > > > On 2023-09-29 at 14:33:50 -0400, Mathieu Desnoyers wrote: > > > Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature. It biases > > > select_task_rq towards the previous CPU if it was almost idle > > > (avg_load <= 0.1%). > > > > Yes, this is a promising direction IMO. One question is that, > > can cfs_rq->avg.load_avg be used for percentage comparison? > > If I understand correctly, load_avg reflects that more than > > 1 tasks could have been running this runqueue, and the > > load_avg is the direct proportion to the load_weight of that > > cfs_rq. Besides, LOAD_AVG_MAX seems to not be the max value > > that load_avg can reach, it is the sum of > > 1024 * (y + y^1 + y^2 ... ) > > > > For example, > > taskset -c 1 nice -n -20 stress -c 1 > > cat /sys/kernel/debug/sched/debug | grep 'cfs_rq\[1\]' -A 12 | grep "\.load_avg" > > .load_avg : 88763 > > .load_avg : 1024 > > > > 88763 is higher than LOAD_AVG_MAX=47742 > > I would have expected the load_avg to be limited to LOAD_AVG_MAX somehow, > but it appears that it does not happen in practice. > > That being said, if the cutoff is really at 0.1% or 0.2% of the real max, > does it really matter ? > > > Maybe the util_avg can be used for precentage comparison I suppose? > [...] > > Or > > return cpu_util_without(cpu_rq(cpu), p) * 1000 <= capacity_orig_of(cpu) ? > > Unfortunately using util_avg does not seem to work based on my testing. > Even at utilization thresholds at 0.1%, 1% and 10%. > > Based on comments in fair.c: > > * CPU utilization is the sum of running time of runnable tasks plus the > * recent utilization of currently non-runnable tasks on that CPU. > > I think we don't want to include currently non-runnable tasks in the > statistics we use, because we are trying to figure out if the cpu is a > idle-enough target based on the tasks which are currently running, for the > purpose of runqueue selection when waking up a task which is considered at > that point in time a non-runnable task on that cpu, and which is about to > become runnable again. > > Based on the discussion, another effort to inhit task migration is to make WA_BIAS prefers previous CPU rather than the current CPU. However it did not show much difference with/without this change applied. I think this is because although wake_affine_weight() chooses the previous CPU, in select_idle_sibling() it would still prefer the current CPU to the previous CPU if no idle CPU is detected. Based on this I did the following changes in select_idle_sibling(): 1. When the system is underloaded, change the sequence of idle CPU checking. If both the target and previous CPU are idle, choose previous CPU first. 2. When the system is overloaded, and all CPUs are busy, choose the previous CPU over the target CPU. hackbench -g 16 -f 20 -l 480000 -s 100 Before the patch: Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) Each sender will pass 480000 messages of 100 bytes Time: 81.076 After the patch: Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) Each sender will pass 480000 messages of 100 bytes Time: 77.527 track the task migration count in 10 seconds: kretfunc:select_task_rq_fair { $p = (struct task_struct *)args->p; if ($p->comm == "hackbench") { if ($p->thread_info.cpu == retval) { @wakeup_prev = count(); } else if (retval == cpu) { @wakeup_curr = count(); } else { @wakeup_migrate = count(); } } } Before the patch: @wakeup_prev: 8369160 @wakeup_curr: 3624480 @wakeup_migrate: 523936 After the patch @wakeup_prev: 15465952 @wakeup_curr: 214540 @wakeup_migrate: 65020 The percentage of wakeup on previous CPU has been increased from 8369160 / (8369160 + 3624480 + 523936) = 66.85% to 15465952 / (15465952 + 214540 + 65020) = 98.22%. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e2a69af8be36..9131cb359723 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7264,18 +7264,20 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) */ lockdep_assert_irqs_disabled(); - if ((available_idle_cpu(target) || sched_idle_cpu(target)) && - asym_fits_cpu(task_util, util_min, util_max, target)) - return target; - /* * If the previous CPU is cache affine and idle, don't be stupid: + * The previous CPU is checked prio to the target CPU to inhibit + * costly task migration. */ if (prev != target && cpus_share_cache(prev, target) && (available_idle_cpu(prev) || sched_idle_cpu(prev)) && asym_fits_cpu(task_util, util_min, util_max, prev)) return prev; + if ((available_idle_cpu(target) || sched_idle_cpu(target)) && + asym_fits_cpu(task_util, util_min, util_max, target)) + return target; + /* * Allow a per-cpu kthread to stack with the wakee if the * kworker thread and the tasks previous CPUs are the same. @@ -7342,6 +7344,10 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if ((unsigned)i < nr_cpumask_bits) return i; + /* if all CPUs are busy, prefer previous CPU to inhibit migration */ + if (prev != target && cpus_share_cache(prev, target)) + return prev; + return target; }
On 2023-10-09 01:14, Chen Yu wrote: > On 2023-09-30 at 07:45:38 -0400, Mathieu Desnoyers wrote: >> On 9/30/23 03:11, Chen Yu wrote: >>> Hi Mathieu, >>> >>> On 2023-09-29 at 14:33:50 -0400, Mathieu Desnoyers wrote: >>>> Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature. It biases >>>> select_task_rq towards the previous CPU if it was almost idle >>>> (avg_load <= 0.1%). >>> >>> Yes, this is a promising direction IMO. One question is that, >>> can cfs_rq->avg.load_avg be used for percentage comparison? >>> If I understand correctly, load_avg reflects that more than >>> 1 tasks could have been running this runqueue, and the >>> load_avg is the direct proportion to the load_weight of that >>> cfs_rq. Besides, LOAD_AVG_MAX seems to not be the max value >>> that load_avg can reach, it is the sum of >>> 1024 * (y + y^1 + y^2 ... ) >>> >>> For example, >>> taskset -c 1 nice -n -20 stress -c 1 >>> cat /sys/kernel/debug/sched/debug | grep 'cfs_rq\[1\]' -A 12 | grep "\.load_avg" >>> .load_avg : 88763 >>> .load_avg : 1024 >>> >>> 88763 is higher than LOAD_AVG_MAX=47742 >> >> I would have expected the load_avg to be limited to LOAD_AVG_MAX somehow, >> but it appears that it does not happen in practice. >> >> That being said, if the cutoff is really at 0.1% or 0.2% of the real max, >> does it really matter ? >> >>> Maybe the util_avg can be used for precentage comparison I suppose? >> [...] >>> Or >>> return cpu_util_without(cpu_rq(cpu), p) * 1000 <= capacity_orig_of(cpu) ? >> >> Unfortunately using util_avg does not seem to work based on my testing. >> Even at utilization thresholds at 0.1%, 1% and 10%. >> >> Based on comments in fair.c: >> >> * CPU utilization is the sum of running time of runnable tasks plus the >> * recent utilization of currently non-runnable tasks on that CPU. >> >> I think we don't want to include currently non-runnable tasks in the >> statistics we use, because we are trying to figure out if the cpu is a >> idle-enough target based on the tasks which are currently running, for the >> purpose of runqueue selection when waking up a task which is considered at >> that point in time a non-runnable task on that cpu, and which is about to >> become runnable again. >> > > Although LOAD_AVG_MAX is not the max possible load_avg, we still want to find > a proper threshold to decide if the CPU is almost idle. The LOAD_AVG_MAX > based threshold is modified a little bit: > > The theory is, if there is only 1 task on the CPU, and that task has a nice > of 0, the task runs 50 us every 1000 us, then this CPU is regarded as almost > idle. > > The load_sum of the task is: > 50 * (1 + y + y^2 + ... + y^n) > The corresponding avg_load of the task is approximately > NICE_0_WEIGHT * load_sum / LOAD_AVG_MAX = 50. > So: > > /* which is close to LOAD_AVG_MAX/1000 = 47 */ > #define ALMOST_IDLE_CPU_LOAD 50 Sorry to be slow at understanding this concept, but this whole "load" value is still somewhat magic to me. Should it vary based on CONFIG_HZ_{100,250,300,1000}, or is it independent ? Where is it documented that the load is a value in "us" out of a window of 1000 us ? And with this value "50", it would cover the case where there is only a single task taking less than 50us per 1000us, and cases where the sum for the set of tasks on the runqueue is taking less than 50us per 1000us overall. > > static bool > almost_idle_cpu(int cpu, struct task_struct *p) > { > if (!sched_feat(WAKEUP_BIAS_PREV_IDLE)) > return false; > return cpu_load_without(cpu_rq(cpu), p) <= ALMOST_IDLE_CPU_LOAD; > } > > Tested this on Intel Xeon Platinum 8360Y, Ice Lake server, 36 core/package, > total 72 core/144 CPUs. Slight improvement is observed in hackbench socket mode: > > socket mode: > hackbench -g 16 -f 20 -l 480000 -s 100 > > Before patch: > Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > Each sender will pass 480000 messages of 100 bytes > Time: 81.084 > > After patch: > Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > Each sender will pass 480000 messages of 100 bytes > Time: 78.083 > > > pipe mode: > hackbench -g 16 -f 20 --pipe -l 480000 -s 100 > > Before patch: > Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > Each sender will pass 480000 messages of 100 bytes > Time: 38.219 > > After patch: > Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > Each sender will pass 480000 messages of 100 bytes > Time: 38.348 > > It suggests that, if the workload has larger working-set/cache footprint, waking up > the task on its previous CPU could get more benefit. In those tests, what is the average % of idleness of your cpus ? Thanks, Mathieu > > thanks, > Chenyu
On 2023-10-09 01:36, Chen Yu wrote: > On 2023-09-30 at 07:45:38 -0400, Mathieu Desnoyers wrote: >> On 9/30/23 03:11, Chen Yu wrote: >>> Hi Mathieu, >>> >>> On 2023-09-29 at 14:33:50 -0400, Mathieu Desnoyers wrote: >>>> Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature. It biases >>>> select_task_rq towards the previous CPU if it was almost idle >>>> (avg_load <= 0.1%). >>> >>> Yes, this is a promising direction IMO. One question is that, >>> can cfs_rq->avg.load_avg be used for percentage comparison? >>> If I understand correctly, load_avg reflects that more than >>> 1 tasks could have been running this runqueue, and the >>> load_avg is the direct proportion to the load_weight of that >>> cfs_rq. Besides, LOAD_AVG_MAX seems to not be the max value >>> that load_avg can reach, it is the sum of >>> 1024 * (y + y^1 + y^2 ... ) >>> >>> For example, >>> taskset -c 1 nice -n -20 stress -c 1 >>> cat /sys/kernel/debug/sched/debug | grep 'cfs_rq\[1\]' -A 12 | grep "\.load_avg" >>> .load_avg : 88763 >>> .load_avg : 1024 >>> >>> 88763 is higher than LOAD_AVG_MAX=47742 >> >> I would have expected the load_avg to be limited to LOAD_AVG_MAX somehow, >> but it appears that it does not happen in practice. >> >> That being said, if the cutoff is really at 0.1% or 0.2% of the real max, >> does it really matter ? >> >>> Maybe the util_avg can be used for precentage comparison I suppose? >> [...] >>> Or >>> return cpu_util_without(cpu_rq(cpu), p) * 1000 <= capacity_orig_of(cpu) ? >> >> Unfortunately using util_avg does not seem to work based on my testing. >> Even at utilization thresholds at 0.1%, 1% and 10%. >> >> Based on comments in fair.c: >> >> * CPU utilization is the sum of running time of runnable tasks plus the >> * recent utilization of currently non-runnable tasks on that CPU. >> >> I think we don't want to include currently non-runnable tasks in the >> statistics we use, because we are trying to figure out if the cpu is a >> idle-enough target based on the tasks which are currently running, for the >> purpose of runqueue selection when waking up a task which is considered at >> that point in time a non-runnable task on that cpu, and which is about to >> become runnable again. >> >> > > Based on the discussion, another effort to inhit task migration is to make > WA_BIAS prefers previous CPU rather than the current CPU. However it did not > show much difference with/without this change applied. I think this is because > although wake_affine_weight() chooses the previous CPU, in select_idle_sibling() > it would still prefer the current CPU to the previous CPU if no idle CPU is detected. > Based on this I did the following changes in select_idle_sibling(): > > 1. When the system is underloaded, change the sequence of idle CPU checking. > If both the target and previous CPU are idle, choose previous CPU first. Are you suggesting that the patch below be used in combination with my "almost_idle" approach, or as a replacement ? I've tried my workload with only your patch, and the performances were close to the baseline (bad). With both patches combined, the performances are as good as with my almost_idle patch. This workload on my test machine has cpus at about 50% idle with the baseline. > > 2. When the system is overloaded, and all CPUs are busy, choose the previous > CPU over the target CPU. > > hackbench -g 16 -f 20 -l 480000 -s 100 > > Before the patch: > Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > Each sender will pass 480000 messages of 100 bytes > Time: 81.076 > > After the patch: > Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > Each sender will pass 480000 messages of 100 bytes > Time: 77.527 > > track the task migration count in 10 seconds: > kretfunc:select_task_rq_fair > { > $p = (struct task_struct *)args->p; > if ($p->comm == "hackbench") { > if ($p->thread_info.cpu == retval) { > @wakeup_prev = count(); > } else if (retval == cpu) { > @wakeup_curr = count(); > } else { > @wakeup_migrate = count(); > } > } > } > > Before the patch: > @wakeup_prev: 8369160 > @wakeup_curr: 3624480 > @wakeup_migrate: 523936 > > After the patch > @wakeup_prev: 15465952 > @wakeup_curr: 214540 > @wakeup_migrate: 65020 > > The percentage of wakeup on previous CPU has been increased from > 8369160 / (8369160 + 3624480 + 523936) = 66.85% to > 15465952 / (15465952 + 214540 + 65020) = 98.22%. Those results are interesting. I wonder if this change negatively affects other workloads though. > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index e2a69af8be36..9131cb359723 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7264,18 +7264,20 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > */ > lockdep_assert_irqs_disabled(); > > - if ((available_idle_cpu(target) || sched_idle_cpu(target)) && > - asym_fits_cpu(task_util, util_min, util_max, target)) > - return target; > - > /* > * If the previous CPU is cache affine and idle, don't be stupid: > + * The previous CPU is checked prio to the target CPU to inhibit prio -> prior Thanks, Mathieu > + * costly task migration. > */ > if (prev != target && cpus_share_cache(prev, target) && > (available_idle_cpu(prev) || sched_idle_cpu(prev)) && > asym_fits_cpu(task_util, util_min, util_max, prev)) > return prev; > > + if ((available_idle_cpu(target) || sched_idle_cpu(target)) && > + asym_fits_cpu(task_util, util_min, util_max, target)) > + return target; > + > /* > * Allow a per-cpu kthread to stack with the wakee if the > * kworker thread and the tasks previous CPUs are the same. > @@ -7342,6 +7344,10 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > if ((unsigned)i < nr_cpumask_bits) > return i; > > + /* if all CPUs are busy, prefer previous CPU to inhibit migration */ > + if (prev != target && cpus_share_cache(prev, target)) > + return prev; > + > return target; > } >
On Tue, 10 Oct 2023 at 15:49, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2023-10-09 01:14, Chen Yu wrote: > > On 2023-09-30 at 07:45:38 -0400, Mathieu Desnoyers wrote: > >> On 9/30/23 03:11, Chen Yu wrote: > >>> Hi Mathieu, > >>> > >>> On 2023-09-29 at 14:33:50 -0400, Mathieu Desnoyers wrote: > >>>> Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature. It biases > >>>> select_task_rq towards the previous CPU if it was almost idle > >>>> (avg_load <= 0.1%). > >>> > >>> Yes, this is a promising direction IMO. One question is that, > >>> can cfs_rq->avg.load_avg be used for percentage comparison? > >>> If I understand correctly, load_avg reflects that more than > >>> 1 tasks could have been running this runqueue, and the > >>> load_avg is the direct proportion to the load_weight of that > >>> cfs_rq. Besides, LOAD_AVG_MAX seems to not be the max value > >>> that load_avg can reach, it is the sum of > >>> 1024 * (y + y^1 + y^2 ... ) > >>> > >>> For example, > >>> taskset -c 1 nice -n -20 stress -c 1 > >>> cat /sys/kernel/debug/sched/debug | grep 'cfs_rq\[1\]' -A 12 | grep "\.load_avg" > >>> .load_avg : 88763 > >>> .load_avg : 1024 > >>> > >>> 88763 is higher than LOAD_AVG_MAX=47742 > >> > >> I would have expected the load_avg to be limited to LOAD_AVG_MAX somehow, > >> but it appears that it does not happen in practice. > >> > >> That being said, if the cutoff is really at 0.1% or 0.2% of the real max, > >> does it really matter ? > >> > >>> Maybe the util_avg can be used for precentage comparison I suppose? > >> [...] > >>> Or > >>> return cpu_util_without(cpu_rq(cpu), p) * 1000 <= capacity_orig_of(cpu) ? > >> > >> Unfortunately using util_avg does not seem to work based on my testing. > >> Even at utilization thresholds at 0.1%, 1% and 10%. > >> > >> Based on comments in fair.c: > >> > >> * CPU utilization is the sum of running time of runnable tasks plus the > >> * recent utilization of currently non-runnable tasks on that CPU. > >> > >> I think we don't want to include currently non-runnable tasks in the > >> statistics we use, because we are trying to figure out if the cpu is a > >> idle-enough target based on the tasks which are currently running, for the > >> purpose of runqueue selection when waking up a task which is considered at > >> that point in time a non-runnable task on that cpu, and which is about to > >> become runnable again. > >> > > > > Although LOAD_AVG_MAX is not the max possible load_avg, we still want to find > > a proper threshold to decide if the CPU is almost idle. The LOAD_AVG_MAX > > based threshold is modified a little bit: > > > > The theory is, if there is only 1 task on the CPU, and that task has a nice > > of 0, the task runs 50 us every 1000 us, then this CPU is regarded as almost > > idle. > > > > The load_sum of the task is: > > 50 * (1 + y + y^2 + ... + y^n) > > The corresponding avg_load of the task is approximately > > NICE_0_WEIGHT * load_sum / LOAD_AVG_MAX = 50. > > So: > > > > /* which is close to LOAD_AVG_MAX/1000 = 47 */ > > #define ALMOST_IDLE_CPU_LOAD 50 > > Sorry to be slow at understanding this concept, but this whole "load" > value is still somewhat magic to me. > > Should it vary based on CONFIG_HZ_{100,250,300,1000}, or is it > independent ? Where is it documented that the load is a value in "us" > out of a window of 1000 us ? nowhere because load_avg is not in usec. load_avg is the sum of entities' load_avg which is based on the weight of the entity. The weight of an entity is in the range [2:88761] and as a result its load_avg. LOAD_AVG_MAX can be used with the *_sum fields but not the *_avg fields of struct sched_avg If you want to evaluate the idleness of a CPU with pelt signal, you should better use util_avg or runnable_avg which are unweighted values in the range [0:1024] > > And with this value "50", it would cover the case where there is only a > single task taking less than 50us per 1000us, and cases where the sum > for the set of tasks on the runqueue is taking less than 50us per 1000us > overall. > > > > > static bool > > almost_idle_cpu(int cpu, struct task_struct *p) > > { > > if (!sched_feat(WAKEUP_BIAS_PREV_IDLE)) > > return false; > > return cpu_load_without(cpu_rq(cpu), p) <= ALMOST_IDLE_CPU_LOAD; > > } > > > > Tested this on Intel Xeon Platinum 8360Y, Ice Lake server, 36 core/package, > > total 72 core/144 CPUs. Slight improvement is observed in hackbench socket mode: > > > > socket mode: > > hackbench -g 16 -f 20 -l 480000 -s 100 > > > > Before patch: > > Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > > Each sender will pass 480000 messages of 100 bytes > > Time: 81.084 > > > > After patch: > > Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > > Each sender will pass 480000 messages of 100 bytes > > Time: 78.083 > > > > > > pipe mode: > > hackbench -g 16 -f 20 --pipe -l 480000 -s 100 > > > > Before patch: > > Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > > Each sender will pass 480000 messages of 100 bytes > > Time: 38.219 > > > > After patch: > > Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > > Each sender will pass 480000 messages of 100 bytes > > Time: 38.348 > > > > It suggests that, if the workload has larger working-set/cache footprint, waking up > > the task on its previous CPU could get more benefit. > > In those tests, what is the average % of idleness of your cpus ? > > Thanks, > > Mathieu > > > > > thanks, > > Chenyu > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
On 2023-10-10 at 09:49:54 -0400, Mathieu Desnoyers wrote: > On 2023-10-09 01:14, Chen Yu wrote: > > On 2023-09-30 at 07:45:38 -0400, Mathieu Desnoyers wrote: > > > On 9/30/23 03:11, Chen Yu wrote: > > > > Hi Mathieu, > > > > > > > > On 2023-09-29 at 14:33:50 -0400, Mathieu Desnoyers wrote: > > > > > Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature. It biases > > > > > select_task_rq towards the previous CPU if it was almost idle > > > > > (avg_load <= 0.1%). > > > > > > > > Yes, this is a promising direction IMO. One question is that, > > > > can cfs_rq->avg.load_avg be used for percentage comparison? > > > > If I understand correctly, load_avg reflects that more than > > > > 1 tasks could have been running this runqueue, and the > > > > load_avg is the direct proportion to the load_weight of that > > > > cfs_rq. Besides, LOAD_AVG_MAX seems to not be the max value > > > > that load_avg can reach, it is the sum of > > > > 1024 * (y + y^1 + y^2 ... ) > > > > > > > > For example, > > > > taskset -c 1 nice -n -20 stress -c 1 > > > > cat /sys/kernel/debug/sched/debug | grep 'cfs_rq\[1\]' -A 12 | grep "\.load_avg" > > > > .load_avg : 88763 > > > > .load_avg : 1024 > > > > > > > > 88763 is higher than LOAD_AVG_MAX=47742 > > > > > > I would have expected the load_avg to be limited to LOAD_AVG_MAX somehow, > > > but it appears that it does not happen in practice. > > > > > > That being said, if the cutoff is really at 0.1% or 0.2% of the real max, > > > does it really matter ? > > > > > > > Maybe the util_avg can be used for precentage comparison I suppose? > > > [...] > > > > Or > > > > return cpu_util_without(cpu_rq(cpu), p) * 1000 <= capacity_orig_of(cpu) ? > > > > > > Unfortunately using util_avg does not seem to work based on my testing. > > > Even at utilization thresholds at 0.1%, 1% and 10%. > > > > > > Based on comments in fair.c: > > > > > > * CPU utilization is the sum of running time of runnable tasks plus the > > > * recent utilization of currently non-runnable tasks on that CPU. > > > > > > I think we don't want to include currently non-runnable tasks in the > > > statistics we use, because we are trying to figure out if the cpu is a > > > idle-enough target based on the tasks which are currently running, for the > > > purpose of runqueue selection when waking up a task which is considered at > > > that point in time a non-runnable task on that cpu, and which is about to > > > become runnable again. > > > > > > > Although LOAD_AVG_MAX is not the max possible load_avg, we still want to find > > a proper threshold to decide if the CPU is almost idle. The LOAD_AVG_MAX > > based threshold is modified a little bit: > > > > The theory is, if there is only 1 task on the CPU, and that task has a nice > > of 0, the task runs 50 us every 1000 us, then this CPU is regarded as almost > > idle. > > > > The load_sum of the task is: > > 50 * (1 + y + y^2 + ... + y^n) > > The corresponding avg_load of the task is approximately > > NICE_0_WEIGHT * load_sum / LOAD_AVG_MAX = 50. > > So: > > > > /* which is close to LOAD_AVG_MAX/1000 = 47 */ > > #define ALMOST_IDLE_CPU_LOAD 50 > > Sorry to be slow at understanding this concept, but this whole "load" value > is still somewhat magic to me. > > Should it vary based on CONFIG_HZ_{100,250,300,1000}, or is it independent ? > Where is it documented that the load is a value in "us" out of a window of > 1000 us ? > My understanding is that, the load_sum of a single task is a value in "us" out of a window of 1000 us, while the load_avg of the task will multiply the weight of the task. In this case a task with nice 0 is NICE_0_WEIGHT = 1024. __update_load_avg_se -> ___update_load_sum calculate the load_sum of a task(there is comments around ___update_load_sum to describe the pelt calculation), and ___update_load_avg() calculate the load_avg based on the task's weight. > And with this value "50", it would cover the case where there is only a > single task taking less than 50us per 1000us, and cases where the sum for > the set of tasks on the runqueue is taking less than 50us per 1000us > overall. > > > > > static bool > > almost_idle_cpu(int cpu, struct task_struct *p) > > { > > if (!sched_feat(WAKEUP_BIAS_PREV_IDLE)) > > return false; > > return cpu_load_without(cpu_rq(cpu), p) <= ALMOST_IDLE_CPU_LOAD; > > } > > > > Tested this on Intel Xeon Platinum 8360Y, Ice Lake server, 36 core/package, > > total 72 core/144 CPUs. Slight improvement is observed in hackbench socket mode: > > > > socket mode: > > hackbench -g 16 -f 20 -l 480000 -s 100 > > > > Before patch: > > Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > > Each sender will pass 480000 messages of 100 bytes > > Time: 81.084 > > > > After patch: > > Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > > Each sender will pass 480000 messages of 100 bytes > > Time: 78.083 > > > > > > pipe mode: > > hackbench -g 16 -f 20 --pipe -l 480000 -s 100 > > > > Before patch: > > Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > > Each sender will pass 480000 messages of 100 bytes > > Time: 38.219 > > > > After patch: > > Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > > Each sender will pass 480000 messages of 100 bytes > > Time: 38.348 > > > > It suggests that, if the workload has larger working-set/cache footprint, waking up > > the task on its previous CPU could get more benefit. > > In those tests, what is the average % of idleness of your cpus ? > For hackbench -g 16 -f 20 --pipe -l 480000 -s 100, it is around 8~10% idle For hackbench -g 16 -f 20 -l 480000 -s 100, it is around 2~3% idle Then the CPUs in packge 1 are offlined to get stable result when the group number is low. hackbench -g 1 -f 20 --pipe -l 480000 -s 100 Some CPUs are busy, others are idle, and some are half-busy. Core CPU Busy% - - 49.57 0 0 1.89 0 72 75.55 1 1 100.00 1 73 0.00 2 2 100.00 2 74 0.00 3 3 100.00 3 75 0.01 4 4 78.29 4 76 17.72 5 5 100.00 5 77 0.00 hackbench -g 1 -f 20 -l 480000 -s 100 Core CPU Busy% - - 48.29 0 0 57.94 0 72 21.41 1 1 83.28 1 73 0.00 2 2 11.44 2 74 83.38 3 3 21.45 3 75 77.27 4 4 26.89 4 76 80.95 5 5 5.01 5 77 83.09 echo NO_WAKEUP_BIAS_PREV_IDLE > /sys/kernel/debug/sched/features hackbench -g 1 -f 20 --pipe -l 480000 -s 100 Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks) Each sender will pass 480000 messages of 100 bytes Time: 9.434 echo WAKEUP_BIAS_PREV_IDLE > /sys/kernel/debug/sched/features hackbench -g 1 -f 20 --pipe -l 480000 -s 100 Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks) Each sender will pass 480000 messages of 100 bytes Time: 9.373 thanks, Chenyu
On 2023-10-10 at 10:18:18 -0400, Mathieu Desnoyers wrote: > On 2023-10-09 01:36, Chen Yu wrote: > > On 2023-09-30 at 07:45:38 -0400, Mathieu Desnoyers wrote: > > > On 9/30/23 03:11, Chen Yu wrote: > > > > Hi Mathieu, > > > > > > > > On 2023-09-29 at 14:33:50 -0400, Mathieu Desnoyers wrote: > > > > > Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature. It biases > > > > > select_task_rq towards the previous CPU if it was almost idle > > > > > (avg_load <= 0.1%). > > > > > > > > Yes, this is a promising direction IMO. One question is that, > > > > can cfs_rq->avg.load_avg be used for percentage comparison? > > > > If I understand correctly, load_avg reflects that more than > > > > 1 tasks could have been running this runqueue, and the > > > > load_avg is the direct proportion to the load_weight of that > > > > cfs_rq. Besides, LOAD_AVG_MAX seems to not be the max value > > > > that load_avg can reach, it is the sum of > > > > 1024 * (y + y^1 + y^2 ... ) > > > > > > > > For example, > > > > taskset -c 1 nice -n -20 stress -c 1 > > > > cat /sys/kernel/debug/sched/debug | grep 'cfs_rq\[1\]' -A 12 | grep "\.load_avg" > > > > .load_avg : 88763 > > > > .load_avg : 1024 > > > > > > > > 88763 is higher than LOAD_AVG_MAX=47742 > > > > > > I would have expected the load_avg to be limited to LOAD_AVG_MAX somehow, > > > but it appears that it does not happen in practice. > > > > > > That being said, if the cutoff is really at 0.1% or 0.2% of the real max, > > > does it really matter ? > > > > > > > Maybe the util_avg can be used for precentage comparison I suppose? > > > [...] > > > > Or > > > > return cpu_util_without(cpu_rq(cpu), p) * 1000 <= capacity_orig_of(cpu) ? > > > > > > Unfortunately using util_avg does not seem to work based on my testing. > > > Even at utilization thresholds at 0.1%, 1% and 10%. > > > > > > Based on comments in fair.c: > > > > > > * CPU utilization is the sum of running time of runnable tasks plus the > > > * recent utilization of currently non-runnable tasks on that CPU. > > > > > > I think we don't want to include currently non-runnable tasks in the > > > statistics we use, because we are trying to figure out if the cpu is a > > > idle-enough target based on the tasks which are currently running, for the > > > purpose of runqueue selection when waking up a task which is considered at > > > that point in time a non-runnable task on that cpu, and which is about to > > > become runnable again. > > > > > > > > > > Based on the discussion, another effort to inhit task migration is to make > > WA_BIAS prefers previous CPU rather than the current CPU. However it did not > > show much difference with/without this change applied. I think this is because > > although wake_affine_weight() chooses the previous CPU, in select_idle_sibling() > > it would still prefer the current CPU to the previous CPU if no idle CPU is detected. > > Based on this I did the following changes in select_idle_sibling(): > > > > 1. When the system is underloaded, change the sequence of idle CPU checking. > > If both the target and previous CPU are idle, choose previous CPU first. > > Are you suggesting that the patch below be used in combination with my > "almost_idle" approach, or as a replacement ? > This patch is composed of two parts: the first one is to deal with underloaded case, which is supposed to check if it could choose previous idle CPU over an almost idle CPU. But according to your test, this does not have effect. The second part is to deal with the overloaded case which I found that it could bring benefit when the system is overloaded. > I've tried my workload with only your patch, and the performances were close > to the baseline (bad). With both patches combined, the performances are as > good as with my almost_idle patch. This workload on my test machine has cpus > at about 50% idle with the baseline. > Yes, the benefit should come from the almost idle strategy. > > > > 2. When the system is overloaded, and all CPUs are busy, choose the previous > > CPU over the target CPU. > > > > hackbench -g 16 -f 20 -l 480000 -s 100 > > > > Before the patch: > > Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > > Each sender will pass 480000 messages of 100 bytes > > Time: 81.076 > > > > After the patch: > > Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > > Each sender will pass 480000 messages of 100 bytes > > Time: 77.527 > > > > track the task migration count in 10 seconds: > > kretfunc:select_task_rq_fair > > { > > $p = (struct task_struct *)args->p; > > if ($p->comm == "hackbench") { > > if ($p->thread_info.cpu == retval) { > > @wakeup_prev = count(); > > } else if (retval == cpu) { > > @wakeup_curr = count(); > > } else { > > @wakeup_migrate = count(); > > } > > } > > } > > > > Before the patch: > > @wakeup_prev: 8369160 > > @wakeup_curr: 3624480 > > @wakeup_migrate: 523936 > > > > After the patch > > @wakeup_prev: 15465952 > > @wakeup_curr: 214540 > > @wakeup_migrate: 65020 > > > > The percentage of wakeup on previous CPU has been increased from > > 8369160 / (8369160 + 3624480 + 523936) = 66.85% to > > 15465952 / (15465952 + 214540 + 65020) = 98.22%. > > Those results are interesting. I wonder if this change negatively affects > other workloads though. > I haven't tested other benchmarks yet and just to send this idea out. I could launch more test to see if could bring other impact. > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index e2a69af8be36..9131cb359723 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -7264,18 +7264,20 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > > */ > > lockdep_assert_irqs_disabled(); > > - if ((available_idle_cpu(target) || sched_idle_cpu(target)) && > > - asym_fits_cpu(task_util, util_min, util_max, target)) > > - return target; > > - > > /* > > * If the previous CPU is cache affine and idle, don't be stupid: > > + * The previous CPU is checked prio to the target CPU to inhibit > > prio -> prior > Thanks for pointing this out. thanks, Chenyu
On 2023-10-11 06:16, Chen Yu wrote: > On 2023-10-10 at 09:49:54 -0400, Mathieu Desnoyers wrote: >> On 2023-10-09 01:14, Chen Yu wrote: >>> On 2023-09-30 at 07:45:38 -0400, Mathieu Desnoyers wrote: >>>> On 9/30/23 03:11, Chen Yu wrote: >>>>> Hi Mathieu, >>>>> >>>>> On 2023-09-29 at 14:33:50 -0400, Mathieu Desnoyers wrote: >>>>>> Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature. It biases >>>>>> select_task_rq towards the previous CPU if it was almost idle >>>>>> (avg_load <= 0.1%). >>>>> >>>>> Yes, this is a promising direction IMO. One question is that, >>>>> can cfs_rq->avg.load_avg be used for percentage comparison? >>>>> If I understand correctly, load_avg reflects that more than >>>>> 1 tasks could have been running this runqueue, and the >>>>> load_avg is the direct proportion to the load_weight of that >>>>> cfs_rq. Besides, LOAD_AVG_MAX seems to not be the max value >>>>> that load_avg can reach, it is the sum of >>>>> 1024 * (y + y^1 + y^2 ... ) >>>>> >>>>> For example, >>>>> taskset -c 1 nice -n -20 stress -c 1 >>>>> cat /sys/kernel/debug/sched/debug | grep 'cfs_rq\[1\]' -A 12 | grep "\.load_avg" >>>>> .load_avg : 88763 >>>>> .load_avg : 1024 >>>>> >>>>> 88763 is higher than LOAD_AVG_MAX=47742 >>>> >>>> I would have expected the load_avg to be limited to LOAD_AVG_MAX somehow, >>>> but it appears that it does not happen in practice. >>>> >>>> That being said, if the cutoff is really at 0.1% or 0.2% of the real max, >>>> does it really matter ? >>>> >>>>> Maybe the util_avg can be used for precentage comparison I suppose? >>>> [...] >>>>> Or >>>>> return cpu_util_without(cpu_rq(cpu), p) * 1000 <= capacity_orig_of(cpu) ? >>>> >>>> Unfortunately using util_avg does not seem to work based on my testing. >>>> Even at utilization thresholds at 0.1%, 1% and 10%. >>>> >>>> Based on comments in fair.c: >>>> >>>> * CPU utilization is the sum of running time of runnable tasks plus the >>>> * recent utilization of currently non-runnable tasks on that CPU. >>>> >>>> I think we don't want to include currently non-runnable tasks in the >>>> statistics we use, because we are trying to figure out if the cpu is a >>>> idle-enough target based on the tasks which are currently running, for the >>>> purpose of runqueue selection when waking up a task which is considered at >>>> that point in time a non-runnable task on that cpu, and which is about to >>>> become runnable again. >>>> >>> >>> Although LOAD_AVG_MAX is not the max possible load_avg, we still want to find >>> a proper threshold to decide if the CPU is almost idle. The LOAD_AVG_MAX >>> based threshold is modified a little bit: >>> >>> The theory is, if there is only 1 task on the CPU, and that task has a nice >>> of 0, the task runs 50 us every 1000 us, then this CPU is regarded as almost >>> idle. >>> >>> The load_sum of the task is: >>> 50 * (1 + y + y^2 + ... + y^n) >>> The corresponding avg_load of the task is approximately >>> NICE_0_WEIGHT * load_sum / LOAD_AVG_MAX = 50. >>> So: >>> >>> /* which is close to LOAD_AVG_MAX/1000 = 47 */ >>> #define ALMOST_IDLE_CPU_LOAD 50 >> >> Sorry to be slow at understanding this concept, but this whole "load" value >> is still somewhat magic to me. >> >> Should it vary based on CONFIG_HZ_{100,250,300,1000}, or is it independent ? >> Where is it documented that the load is a value in "us" out of a window of >> 1000 us ? >> > > My understanding is that, the load_sum of a single task is a value in "us" out > of a window of 1000 us, while the load_avg of the task will multiply the weight > of the task. In this case a task with nice 0 is NICE_0_WEIGHT = 1024. > > __update_load_avg_se -> ___update_load_sum calculate the load_sum of a task(there > is comments around ___update_load_sum to describe the pelt calculation), > and ___update_load_avg() calculate the load_avg based on the task's weight. Thanks for your thorough explanation, now it makes sense. I understand as well that the cfs_rq->avg.load_sum is the result of summing each task load_sum multiplied by their weight: static inline void enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { cfs_rq->avg.load_avg += se->avg.load_avg; cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum; } Therefore I think we need to multiply the load_sum value we aim for by get_pelt_divider(&cpu_rq(cpu)->cfs.avg) to compare it to a rq load_sum. I plan to compare the rq load sum to "10 * get_pelt_divider(&cpu_rq(cpu)->cfs.avg)" to match runqueues which were previously idle (therefore with prior periods contribution to the rq->load_sum being pretty much zero), and which have a current period rq load_sum below or equal 10us per 1024us (<= 1%): static inline unsigned long cfs_rq_weighted_load_sum(struct cfs_rq *cfs_rq) { return cfs_rq->avg.load_sum; } static unsigned long cpu_weighted_load_sum(struct rq *rq) { return cfs_rq_weighted_load_sum(&rq->cfs); } /* * A runqueue is considered almost idle if: * * cfs_rq->avg.load_sum / get_pelt_divider(&cfs_rq->avg) / 1024 <= 1% * * This inequality is transformed as follows to minimize arithmetic: * * cfs_rq->avg.load_sum <= get_pelt_divider(&cfs_rq->avg) * 10 */ static bool almost_idle_cpu(int cpu, struct task_struct *p) { if (!sched_feat(WAKEUP_BIAS_PREV_IDLE)) return false; return cpu_weighted_load_sum(cpu_rq(cpu)) <= 10 * get_pelt_divider(&cpu_rq(cpu)->cfs.avg); } Does it make sense ? Thanks, Mathieu > >> And with this value "50", it would cover the case where there is only a >> single task taking less than 50us per 1000us, and cases where the sum for >> the set of tasks on the runqueue is taking less than 50us per 1000us >> overall. >> >>> >>> static bool >>> almost_idle_cpu(int cpu, struct task_struct *p) >>> { >>> if (!sched_feat(WAKEUP_BIAS_PREV_IDLE)) >>> return false; >>> return cpu_load_without(cpu_rq(cpu), p) <= ALMOST_IDLE_CPU_LOAD; >>> } >>> >>> Tested this on Intel Xeon Platinum 8360Y, Ice Lake server, 36 core/package, >>> total 72 core/144 CPUs. Slight improvement is observed in hackbench socket mode: >>> >>> socket mode: >>> hackbench -g 16 -f 20 -l 480000 -s 100 >>> >>> Before patch: >>> Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) >>> Each sender will pass 480000 messages of 100 bytes >>> Time: 81.084 >>> >>> After patch: >>> Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) >>> Each sender will pass 480000 messages of 100 bytes >>> Time: 78.083 >>> >>> >>> pipe mode: >>> hackbench -g 16 -f 20 --pipe -l 480000 -s 100 >>> >>> Before patch: >>> Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) >>> Each sender will pass 480000 messages of 100 bytes >>> Time: 38.219 >>> >>> After patch: >>> Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) >>> Each sender will pass 480000 messages of 100 bytes >>> Time: 38.348 >>> >>> It suggests that, if the workload has larger working-set/cache footprint, waking up >>> the task on its previous CPU could get more benefit. >> >> In those tests, what is the average % of idleness of your cpus ? >> > > For hackbench -g 16 -f 20 --pipe -l 480000 -s 100, it is around 8~10% idle > For hackbench -g 16 -f 20 -l 480000 -s 100, it is around 2~3% idle > > Then the CPUs in packge 1 are offlined to get stable result when the group number is low. > hackbench -g 1 -f 20 --pipe -l 480000 -s 100 > Some CPUs are busy, others are idle, and some are half-busy. > Core CPU Busy% > - - 49.57 > 0 0 1.89 > 0 72 75.55 > 1 1 100.00 > 1 73 0.00 > 2 2 100.00 > 2 74 0.00 > 3 3 100.00 > 3 75 0.01 > 4 4 78.29 > 4 76 17.72 > 5 5 100.00 > 5 77 0.00 > > > hackbench -g 1 -f 20 -l 480000 -s 100 > Core CPU Busy% > - - 48.29 > 0 0 57.94 > 0 72 21.41 > 1 1 83.28 > 1 73 0.00 > 2 2 11.44 > 2 74 83.38 > 3 3 21.45 > 3 75 77.27 > 4 4 26.89 > 4 76 80.95 > 5 5 5.01 > 5 77 83.09 > > > echo NO_WAKEUP_BIAS_PREV_IDLE > /sys/kernel/debug/sched/features > hackbench -g 1 -f 20 --pipe -l 480000 -s 100 > Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks) > Each sender will pass 480000 messages of 100 bytes > Time: 9.434 > > echo WAKEUP_BIAS_PREV_IDLE > /sys/kernel/debug/sched/features > hackbench -g 1 -f 20 --pipe -l 480000 -s 100 > Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks) > Each sender will pass 480000 messages of 100 bytes > Time: 9.373 > > thanks, > Chenyu
On Thu, 12 Oct 2023 at 16:33, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2023-10-11 06:16, Chen Yu wrote: > > On 2023-10-10 at 09:49:54 -0400, Mathieu Desnoyers wrote: > >> On 2023-10-09 01:14, Chen Yu wrote: > >>> On 2023-09-30 at 07:45:38 -0400, Mathieu Desnoyers wrote: > >>>> On 9/30/23 03:11, Chen Yu wrote: > >>>>> Hi Mathieu, > >>>>> > >>>>> On 2023-09-29 at 14:33:50 -0400, Mathieu Desnoyers wrote: > >>>>>> Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature. It biases > >>>>>> select_task_rq towards the previous CPU if it was almost idle > >>>>>> (avg_load <= 0.1%). > >>>>> > >>>>> Yes, this is a promising direction IMO. One question is that, > >>>>> can cfs_rq->avg.load_avg be used for percentage comparison? > >>>>> If I understand correctly, load_avg reflects that more than > >>>>> 1 tasks could have been running this runqueue, and the > >>>>> load_avg is the direct proportion to the load_weight of that > >>>>> cfs_rq. Besides, LOAD_AVG_MAX seems to not be the max value > >>>>> that load_avg can reach, it is the sum of > >>>>> 1024 * (y + y^1 + y^2 ... ) > >>>>> > >>>>> For example, > >>>>> taskset -c 1 nice -n -20 stress -c 1 > >>>>> cat /sys/kernel/debug/sched/debug | grep 'cfs_rq\[1\]' -A 12 | grep "\.load_avg" > >>>>> .load_avg : 88763 > >>>>> .load_avg : 1024 > >>>>> > >>>>> 88763 is higher than LOAD_AVG_MAX=47742 > >>>> > >>>> I would have expected the load_avg to be limited to LOAD_AVG_MAX somehow, > >>>> but it appears that it does not happen in practice. > >>>> > >>>> That being said, if the cutoff is really at 0.1% or 0.2% of the real max, > >>>> does it really matter ? > >>>> > >>>>> Maybe the util_avg can be used for precentage comparison I suppose? > >>>> [...] > >>>>> Or > >>>>> return cpu_util_without(cpu_rq(cpu), p) * 1000 <= capacity_orig_of(cpu) ? > >>>> > >>>> Unfortunately using util_avg does not seem to work based on my testing. > >>>> Even at utilization thresholds at 0.1%, 1% and 10%. > >>>> > >>>> Based on comments in fair.c: > >>>> > >>>> * CPU utilization is the sum of running time of runnable tasks plus the > >>>> * recent utilization of currently non-runnable tasks on that CPU. > >>>> > >>>> I think we don't want to include currently non-runnable tasks in the > >>>> statistics we use, because we are trying to figure out if the cpu is a > >>>> idle-enough target based on the tasks which are currently running, for the > >>>> purpose of runqueue selection when waking up a task which is considered at > >>>> that point in time a non-runnable task on that cpu, and which is about to > >>>> become runnable again. > >>>> > >>> > >>> Although LOAD_AVG_MAX is not the max possible load_avg, we still want to find > >>> a proper threshold to decide if the CPU is almost idle. The LOAD_AVG_MAX > >>> based threshold is modified a little bit: > >>> > >>> The theory is, if there is only 1 task on the CPU, and that task has a nice > >>> of 0, the task runs 50 us every 1000 us, then this CPU is regarded as almost > >>> idle. > >>> > >>> The load_sum of the task is: > >>> 50 * (1 + y + y^2 + ... + y^n) > >>> The corresponding avg_load of the task is approximately > >>> NICE_0_WEIGHT * load_sum / LOAD_AVG_MAX = 50. > >>> So: > >>> > >>> /* which is close to LOAD_AVG_MAX/1000 = 47 */ > >>> #define ALMOST_IDLE_CPU_LOAD 50 > >> > >> Sorry to be slow at understanding this concept, but this whole "load" value > >> is still somewhat magic to me. > >> > >> Should it vary based on CONFIG_HZ_{100,250,300,1000}, or is it independent ? > >> Where is it documented that the load is a value in "us" out of a window of > >> 1000 us ? > >> > > > > My understanding is that, the load_sum of a single task is a value in "us" out > > of a window of 1000 us, while the load_avg of the task will multiply the weight > > of the task. In this case a task with nice 0 is NICE_0_WEIGHT = 1024. > > > > __update_load_avg_se -> ___update_load_sum calculate the load_sum of a task(there > > is comments around ___update_load_sum to describe the pelt calculation), > > and ___update_load_avg() calculate the load_avg based on the task's weight. > > Thanks for your thorough explanation, now it makes sense. > > I understand as well that the cfs_rq->avg.load_sum is the result of summing > each task load_sum multiplied by their weight: Please don't use load_sum but only *_avg. As already said, util_avg or runnable_avg are better metrics for you > > static inline void > enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) > { > cfs_rq->avg.load_avg += se->avg.load_avg; > cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum; > } > > Therefore I think we need to multiply the load_sum value we aim for by > get_pelt_divider(&cpu_rq(cpu)->cfs.avg) to compare it to a rq load_sum. > > I plan to compare the rq load sum to "10 * get_pelt_divider(&cpu_rq(cpu)->cfs.avg)" > to match runqueues which were previously idle (therefore with prior periods contribution > to the rq->load_sum being pretty much zero), and which have a current period rq load_sum > below or equal 10us per 1024us (<= 1%): > > static inline unsigned long cfs_rq_weighted_load_sum(struct cfs_rq *cfs_rq) > { > return cfs_rq->avg.load_sum; > } > > static unsigned long cpu_weighted_load_sum(struct rq *rq) > { > return cfs_rq_weighted_load_sum(&rq->cfs); > } > > /* > * A runqueue is considered almost idle if: > * > * cfs_rq->avg.load_sum / get_pelt_divider(&cfs_rq->avg) / 1024 <= 1% > * > * This inequality is transformed as follows to minimize arithmetic: > * > * cfs_rq->avg.load_sum <= get_pelt_divider(&cfs_rq->avg) * 10 > */ > static bool > almost_idle_cpu(int cpu, struct task_struct *p) > { > if (!sched_feat(WAKEUP_BIAS_PREV_IDLE)) > return false; > return cpu_weighted_load_sum(cpu_rq(cpu)) <= 10 * get_pelt_divider(&cpu_rq(cpu)->cfs.avg); > } > > Does it make sense ? > > Thanks, > > Mathieu > > > > > >> And with this value "50", it would cover the case where there is only a > >> single task taking less than 50us per 1000us, and cases where the sum for > >> the set of tasks on the runqueue is taking less than 50us per 1000us > >> overall. > >> > >>> > >>> static bool > >>> almost_idle_cpu(int cpu, struct task_struct *p) > >>> { > >>> if (!sched_feat(WAKEUP_BIAS_PREV_IDLE)) > >>> return false; > >>> return cpu_load_without(cpu_rq(cpu), p) <= ALMOST_IDLE_CPU_LOAD; > >>> } > >>> > >>> Tested this on Intel Xeon Platinum 8360Y, Ice Lake server, 36 core/package, > >>> total 72 core/144 CPUs. Slight improvement is observed in hackbench socket mode: > >>> > >>> socket mode: > >>> hackbench -g 16 -f 20 -l 480000 -s 100 > >>> > >>> Before patch: > >>> Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > >>> Each sender will pass 480000 messages of 100 bytes > >>> Time: 81.084 > >>> > >>> After patch: > >>> Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > >>> Each sender will pass 480000 messages of 100 bytes > >>> Time: 78.083 > >>> > >>> > >>> pipe mode: > >>> hackbench -g 16 -f 20 --pipe -l 480000 -s 100 > >>> > >>> Before patch: > >>> Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > >>> Each sender will pass 480000 messages of 100 bytes > >>> Time: 38.219 > >>> > >>> After patch: > >>> Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > >>> Each sender will pass 480000 messages of 100 bytes > >>> Time: 38.348 > >>> > >>> It suggests that, if the workload has larger working-set/cache footprint, waking up > >>> the task on its previous CPU could get more benefit. > >> > >> In those tests, what is the average % of idleness of your cpus ? > >> > > > > For hackbench -g 16 -f 20 --pipe -l 480000 -s 100, it is around 8~10% idle > > For hackbench -g 16 -f 20 -l 480000 -s 100, it is around 2~3% idle > > > > Then the CPUs in packge 1 are offlined to get stable result when the group number is low. > > hackbench -g 1 -f 20 --pipe -l 480000 -s 100 > > Some CPUs are busy, others are idle, and some are half-busy. > > Core CPU Busy% > > - - 49.57 > > 0 0 1.89 > > 0 72 75.55 > > 1 1 100.00 > > 1 73 0.00 > > 2 2 100.00 > > 2 74 0.00 > > 3 3 100.00 > > 3 75 0.01 > > 4 4 78.29 > > 4 76 17.72 > > 5 5 100.00 > > 5 77 0.00 > > > > > > hackbench -g 1 -f 20 -l 480000 -s 100 > > Core CPU Busy% > > - - 48.29 > > 0 0 57.94 > > 0 72 21.41 > > 1 1 83.28 > > 1 73 0.00 > > 2 2 11.44 > > 2 74 83.38 > > 3 3 21.45 > > 3 75 77.27 > > 4 4 26.89 > > 4 76 80.95 > > 5 5 5.01 > > 5 77 83.09 > > > > > > echo NO_WAKEUP_BIAS_PREV_IDLE > /sys/kernel/debug/sched/features > > hackbench -g 1 -f 20 --pipe -l 480000 -s 100 > > Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks) > > Each sender will pass 480000 messages of 100 bytes > > Time: 9.434 > > > > echo WAKEUP_BIAS_PREV_IDLE > /sys/kernel/debug/sched/features > > hackbench -g 1 -f 20 --pipe -l 480000 -s 100 > > Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks) > > Each sender will pass 480000 messages of 100 bytes > > Time: 9.373 > > > > thanks, > > Chenyu > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
On Wed, 11 Oct 2023 at 12:17, Chen Yu <yu.c.chen@intel.com> wrote: > > On 2023-10-10 at 09:49:54 -0400, Mathieu Desnoyers wrote: > > On 2023-10-09 01:14, Chen Yu wrote: > > > On 2023-09-30 at 07:45:38 -0400, Mathieu Desnoyers wrote: > > > > On 9/30/23 03:11, Chen Yu wrote: > > > > > Hi Mathieu, > > > > > > > > > > On 2023-09-29 at 14:33:50 -0400, Mathieu Desnoyers wrote: > > > > > > Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature. It biases > > > > > > select_task_rq towards the previous CPU if it was almost idle > > > > > > (avg_load <= 0.1%). > > > > > > > > > > Yes, this is a promising direction IMO. One question is that, > > > > > can cfs_rq->avg.load_avg be used for percentage comparison? > > > > > If I understand correctly, load_avg reflects that more than > > > > > 1 tasks could have been running this runqueue, and the > > > > > load_avg is the direct proportion to the load_weight of that > > > > > cfs_rq. Besides, LOAD_AVG_MAX seems to not be the max value > > > > > that load_avg can reach, it is the sum of > > > > > 1024 * (y + y^1 + y^2 ... ) > > > > > > > > > > For example, > > > > > taskset -c 1 nice -n -20 stress -c 1 > > > > > cat /sys/kernel/debug/sched/debug | grep 'cfs_rq\[1\]' -A 12 | grep "\.load_avg" > > > > > .load_avg : 88763 > > > > > .load_avg : 1024 > > > > > > > > > > 88763 is higher than LOAD_AVG_MAX=47742 > > > > > > > > I would have expected the load_avg to be limited to LOAD_AVG_MAX somehow, > > > > but it appears that it does not happen in practice. > > > > > > > > That being said, if the cutoff is really at 0.1% or 0.2% of the real max, > > > > does it really matter ? > > > > > > > > > Maybe the util_avg can be used for precentage comparison I suppose? > > > > [...] > > > > > Or > > > > > return cpu_util_without(cpu_rq(cpu), p) * 1000 <= capacity_orig_of(cpu) ? > > > > > > > > Unfortunately using util_avg does not seem to work based on my testing. > > > > Even at utilization thresholds at 0.1%, 1% and 10%. > > > > > > > > Based on comments in fair.c: > > > > > > > > * CPU utilization is the sum of running time of runnable tasks plus the > > > > * recent utilization of currently non-runnable tasks on that CPU. > > > > > > > > I think we don't want to include currently non-runnable tasks in the > > > > statistics we use, because we are trying to figure out if the cpu is a > > > > idle-enough target based on the tasks which are currently running, for the > > > > purpose of runqueue selection when waking up a task which is considered at > > > > that point in time a non-runnable task on that cpu, and which is about to > > > > become runnable again. > > > > > > > > > > Although LOAD_AVG_MAX is not the max possible load_avg, we still want to find > > > a proper threshold to decide if the CPU is almost idle. The LOAD_AVG_MAX > > > based threshold is modified a little bit: > > > > > > The theory is, if there is only 1 task on the CPU, and that task has a nice > > > of 0, the task runs 50 us every 1000 us, then this CPU is regarded as almost > > > idle. > > > > > > The load_sum of the task is: > > > 50 * (1 + y + y^2 + ... + y^n) > > > The corresponding avg_load of the task is approximately > > > NICE_0_WEIGHT * load_sum / LOAD_AVG_MAX = 50. > > > So: > > > > > > /* which is close to LOAD_AVG_MAX/1000 = 47 */ > > > #define ALMOST_IDLE_CPU_LOAD 50 > > > > Sorry to be slow at understanding this concept, but this whole "load" value > > is still somewhat magic to me. > > > > Should it vary based on CONFIG_HZ_{100,250,300,1000}, or is it independent ? > > Where is it documented that the load is a value in "us" out of a window of > > 1000 us ? > > > > My understanding is that, the load_sum of a single task is a value in "us" out > of a window of 1000 us, while the load_avg of the task will multiply the weight I'm not sure we can say this. We use a 1024us sampling rate for calculating weighted average but load_sum is in the range [0:47742] so what does it mean 47742us out of a window of 1000us ? Beside this we have util_avg in the range [0:cpu capacity] which gives you the average running time of the cpu > of the task. In this case a task with nice 0 is NICE_0_WEIGHT = 1024. > > __update_load_avg_se -> ___update_load_sum calculate the load_sum of a task(there > is comments around ___update_load_sum to describe the pelt calculation), > and ___update_load_avg() calculate the load_avg based on the task's weight. > > > And with this value "50", it would cover the case where there is only a > > single task taking less than 50us per 1000us, and cases where the sum for > > the set of tasks on the runqueue is taking less than 50us per 1000us > > overall. > > > > > > > > static bool > > > almost_idle_cpu(int cpu, struct task_struct *p) > > > { > > > if (!sched_feat(WAKEUP_BIAS_PREV_IDLE)) > > > return false; > > > return cpu_load_without(cpu_rq(cpu), p) <= ALMOST_IDLE_CPU_LOAD; > > > } > > > > > > Tested this on Intel Xeon Platinum 8360Y, Ice Lake server, 36 core/package, > > > total 72 core/144 CPUs. Slight improvement is observed in hackbench socket mode: > > > > > > socket mode: > > > hackbench -g 16 -f 20 -l 480000 -s 100 > > > > > > Before patch: > > > Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > > > Each sender will pass 480000 messages of 100 bytes > > > Time: 81.084 > > > > > > After patch: > > > Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > > > Each sender will pass 480000 messages of 100 bytes > > > Time: 78.083 > > > > > > > > > pipe mode: > > > hackbench -g 16 -f 20 --pipe -l 480000 -s 100 > > > > > > Before patch: > > > Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > > > Each sender will pass 480000 messages of 100 bytes > > > Time: 38.219 > > > > > > After patch: > > > Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > > > Each sender will pass 480000 messages of 100 bytes > > > Time: 38.348 > > > > > > It suggests that, if the workload has larger working-set/cache footprint, waking up > > > the task on its previous CPU could get more benefit. > > > > In those tests, what is the average % of idleness of your cpus ? > > > > For hackbench -g 16 -f 20 --pipe -l 480000 -s 100, it is around 8~10% idle > For hackbench -g 16 -f 20 -l 480000 -s 100, it is around 2~3% idle > > Then the CPUs in packge 1 are offlined to get stable result when the group number is low. > hackbench -g 1 -f 20 --pipe -l 480000 -s 100 > Some CPUs are busy, others are idle, and some are half-busy. > Core CPU Busy% > - - 49.57 > 0 0 1.89 > 0 72 75.55 > 1 1 100.00 > 1 73 0.00 > 2 2 100.00 > 2 74 0.00 > 3 3 100.00 > 3 75 0.01 > 4 4 78.29 > 4 76 17.72 > 5 5 100.00 > 5 77 0.00 > > > hackbench -g 1 -f 20 -l 480000 -s 100 > Core CPU Busy% > - - 48.29 > 0 0 57.94 > 0 72 21.41 > 1 1 83.28 > 1 73 0.00 > 2 2 11.44 > 2 74 83.38 > 3 3 21.45 > 3 75 77.27 > 4 4 26.89 > 4 76 80.95 > 5 5 5.01 > 5 77 83.09 > > > echo NO_WAKEUP_BIAS_PREV_IDLE > /sys/kernel/debug/sched/features > hackbench -g 1 -f 20 --pipe -l 480000 -s 100 > Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks) > Each sender will pass 480000 messages of 100 bytes > Time: 9.434 > > echo WAKEUP_BIAS_PREV_IDLE > /sys/kernel/debug/sched/features > hackbench -g 1 -f 20 --pipe -l 480000 -s 100 > Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks) > Each sender will pass 480000 messages of 100 bytes > Time: 9.373 > > thanks, > Chenyu
On 2023-10-12 11:01, Vincent Guittot wrote: > On Thu, 12 Oct 2023 at 16:33, Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: >> >> On 2023-10-11 06:16, Chen Yu wrote: >>> On 2023-10-10 at 09:49:54 -0400, Mathieu Desnoyers wrote: >>>> On 2023-10-09 01:14, Chen Yu wrote: >>>>> On 2023-09-30 at 07:45:38 -0400, Mathieu Desnoyers wrote: >>>>>> On 9/30/23 03:11, Chen Yu wrote: >>>>>>> Hi Mathieu, >>>>>>> >>>>>>> On 2023-09-29 at 14:33:50 -0400, Mathieu Desnoyers wrote: >>>>>>>> Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature. It biases >>>>>>>> select_task_rq towards the previous CPU if it was almost idle >>>>>>>> (avg_load <= 0.1%). >>>>>>> >>>>>>> Yes, this is a promising direction IMO. One question is that, >>>>>>> can cfs_rq->avg.load_avg be used for percentage comparison? >>>>>>> If I understand correctly, load_avg reflects that more than >>>>>>> 1 tasks could have been running this runqueue, and the >>>>>>> load_avg is the direct proportion to the load_weight of that >>>>>>> cfs_rq. Besides, LOAD_AVG_MAX seems to not be the max value >>>>>>> that load_avg can reach, it is the sum of >>>>>>> 1024 * (y + y^1 + y^2 ... ) >>>>>>> >>>>>>> For example, >>>>>>> taskset -c 1 nice -n -20 stress -c 1 >>>>>>> cat /sys/kernel/debug/sched/debug | grep 'cfs_rq\[1\]' -A 12 | grep "\.load_avg" >>>>>>> .load_avg : 88763 >>>>>>> .load_avg : 1024 >>>>>>> >>>>>>> 88763 is higher than LOAD_AVG_MAX=47742 >>>>>> >>>>>> I would have expected the load_avg to be limited to LOAD_AVG_MAX somehow, >>>>>> but it appears that it does not happen in practice. >>>>>> >>>>>> That being said, if the cutoff is really at 0.1% or 0.2% of the real max, >>>>>> does it really matter ? >>>>>> >>>>>>> Maybe the util_avg can be used for precentage comparison I suppose? >>>>>> [...] >>>>>>> Or >>>>>>> return cpu_util_without(cpu_rq(cpu), p) * 1000 <= capacity_orig_of(cpu) ? >>>>>> >>>>>> Unfortunately using util_avg does not seem to work based on my testing. >>>>>> Even at utilization thresholds at 0.1%, 1% and 10%. >>>>>> >>>>>> Based on comments in fair.c: >>>>>> >>>>>> * CPU utilization is the sum of running time of runnable tasks plus the >>>>>> * recent utilization of currently non-runnable tasks on that CPU. >>>>>> >>>>>> I think we don't want to include currently non-runnable tasks in the >>>>>> statistics we use, because we are trying to figure out if the cpu is a >>>>>> idle-enough target based on the tasks which are currently running, for the >>>>>> purpose of runqueue selection when waking up a task which is considered at >>>>>> that point in time a non-runnable task on that cpu, and which is about to >>>>>> become runnable again. >>>>>> >>>>> >>>>> Although LOAD_AVG_MAX is not the max possible load_avg, we still want to find >>>>> a proper threshold to decide if the CPU is almost idle. The LOAD_AVG_MAX >>>>> based threshold is modified a little bit: >>>>> >>>>> The theory is, if there is only 1 task on the CPU, and that task has a nice >>>>> of 0, the task runs 50 us every 1000 us, then this CPU is regarded as almost >>>>> idle. >>>>> >>>>> The load_sum of the task is: >>>>> 50 * (1 + y + y^2 + ... + y^n) >>>>> The corresponding avg_load of the task is approximately >>>>> NICE_0_WEIGHT * load_sum / LOAD_AVG_MAX = 50. >>>>> So: >>>>> >>>>> /* which is close to LOAD_AVG_MAX/1000 = 47 */ >>>>> #define ALMOST_IDLE_CPU_LOAD 50 >>>> >>>> Sorry to be slow at understanding this concept, but this whole "load" value >>>> is still somewhat magic to me. >>>> >>>> Should it vary based on CONFIG_HZ_{100,250,300,1000}, or is it independent ? >>>> Where is it documented that the load is a value in "us" out of a window of >>>> 1000 us ? >>>> >>> >>> My understanding is that, the load_sum of a single task is a value in "us" out >>> of a window of 1000 us, while the load_avg of the task will multiply the weight >>> of the task. In this case a task with nice 0 is NICE_0_WEIGHT = 1024. >>> >>> __update_load_avg_se -> ___update_load_sum calculate the load_sum of a task(there >>> is comments around ___update_load_sum to describe the pelt calculation), >>> and ___update_load_avg() calculate the load_avg based on the task's weight. >> >> Thanks for your thorough explanation, now it makes sense. >> >> I understand as well that the cfs_rq->avg.load_sum is the result of summing >> each task load_sum multiplied by their weight: > > Please don't use load_sum but only *_avg. > As already said, util_avg or runnable_avg are better metrics for you I think I found out why using util_avg was not working for me. Considering this comment from cpu_util(): * CPU utilization is the sum of running time of runnable tasks plus the * recent utilization of currently non-runnable tasks on that CPU. I don't want to include the recent utilization of currently non-runnable tasks on that CPU in order to choose that CPU to do task placement in a context where many tasks were recently running on that cpu (but are currently blocked). I do not want those blocked tasks to be part of the avg. So I think the issue here is that I was using the cpu_util() (and cpu_util_without()) helpers which are considering max(util, runnable), rather than just "util". Based on your comments, just doing this to match a rq util_avg <= 1% (10us of 1024us) seems to work fine: return cpu_rq(cpu)->cfs.avg.util_avg <= 10 * capacity_of(cpu); Is this approach acceptable ? Thanks! Mathieu > >> >> static inline void >> enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) >> { >> cfs_rq->avg.load_avg += se->avg.load_avg; >> cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum; >> } >> >> Therefore I think we need to multiply the load_sum value we aim for by >> get_pelt_divider(&cpu_rq(cpu)->cfs.avg) to compare it to a rq load_sum. >> >> I plan to compare the rq load sum to "10 * get_pelt_divider(&cpu_rq(cpu)->cfs.avg)" >> to match runqueues which were previously idle (therefore with prior periods contribution >> to the rq->load_sum being pretty much zero), and which have a current period rq load_sum >> below or equal 10us per 1024us (<= 1%): >> >> static inline unsigned long cfs_rq_weighted_load_sum(struct cfs_rq *cfs_rq) >> { >> return cfs_rq->avg.load_sum; >> } >> >> static unsigned long cpu_weighted_load_sum(struct rq *rq) >> { >> return cfs_rq_weighted_load_sum(&rq->cfs); >> } >> >> /* >> * A runqueue is considered almost idle if: >> * >> * cfs_rq->avg.load_sum / get_pelt_divider(&cfs_rq->avg) / 1024 <= 1% >> * >> * This inequality is transformed as follows to minimize arithmetic: >> * >> * cfs_rq->avg.load_sum <= get_pelt_divider(&cfs_rq->avg) * 10 >> */ >> static bool >> almost_idle_cpu(int cpu, struct task_struct *p) >> { >> if (!sched_feat(WAKEUP_BIAS_PREV_IDLE)) >> return false; >> return cpu_weighted_load_sum(cpu_rq(cpu)) <= 10 * get_pelt_divider(&cpu_rq(cpu)->cfs.avg); >> } >> >> Does it make sense ? >> >> Thanks, >> >> Mathieu >> >> >>> >>>> And with this value "50", it would cover the case where there is only a >>>> single task taking less than 50us per 1000us, and cases where the sum for >>>> the set of tasks on the runqueue is taking less than 50us per 1000us >>>> overall. >>>> >>>>> >>>>> static bool >>>>> almost_idle_cpu(int cpu, struct task_struct *p) >>>>> { >>>>> if (!sched_feat(WAKEUP_BIAS_PREV_IDLE)) >>>>> return false; >>>>> return cpu_load_without(cpu_rq(cpu), p) <= ALMOST_IDLE_CPU_LOAD; >>>>> } >>>>> >>>>> Tested this on Intel Xeon Platinum 8360Y, Ice Lake server, 36 core/package, >>>>> total 72 core/144 CPUs. Slight improvement is observed in hackbench socket mode: >>>>> >>>>> socket mode: >>>>> hackbench -g 16 -f 20 -l 480000 -s 100 >>>>> >>>>> Before patch: >>>>> Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) >>>>> Each sender will pass 480000 messages of 100 bytes >>>>> Time: 81.084 >>>>> >>>>> After patch: >>>>> Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) >>>>> Each sender will pass 480000 messages of 100 bytes >>>>> Time: 78.083 >>>>> >>>>> >>>>> pipe mode: >>>>> hackbench -g 16 -f 20 --pipe -l 480000 -s 100 >>>>> >>>>> Before patch: >>>>> Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) >>>>> Each sender will pass 480000 messages of 100 bytes >>>>> Time: 38.219 >>>>> >>>>> After patch: >>>>> Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) >>>>> Each sender will pass 480000 messages of 100 bytes >>>>> Time: 38.348 >>>>> >>>>> It suggests that, if the workload has larger working-set/cache footprint, waking up >>>>> the task on its previous CPU could get more benefit. >>>> >>>> In those tests, what is the average % of idleness of your cpus ? >>>> >>> >>> For hackbench -g 16 -f 20 --pipe -l 480000 -s 100, it is around 8~10% idle >>> For hackbench -g 16 -f 20 -l 480000 -s 100, it is around 2~3% idle >>> >>> Then the CPUs in packge 1 are offlined to get stable result when the group number is low. >>> hackbench -g 1 -f 20 --pipe -l 480000 -s 100 >>> Some CPUs are busy, others are idle, and some are half-busy. >>> Core CPU Busy% >>> - - 49.57 >>> 0 0 1.89 >>> 0 72 75.55 >>> 1 1 100.00 >>> 1 73 0.00 >>> 2 2 100.00 >>> 2 74 0.00 >>> 3 3 100.00 >>> 3 75 0.01 >>> 4 4 78.29 >>> 4 76 17.72 >>> 5 5 100.00 >>> 5 77 0.00 >>> >>> >>> hackbench -g 1 -f 20 -l 480000 -s 100 >>> Core CPU Busy% >>> - - 48.29 >>> 0 0 57.94 >>> 0 72 21.41 >>> 1 1 83.28 >>> 1 73 0.00 >>> 2 2 11.44 >>> 2 74 83.38 >>> 3 3 21.45 >>> 3 75 77.27 >>> 4 4 26.89 >>> 4 76 80.95 >>> 5 5 5.01 >>> 5 77 83.09 >>> >>> >>> echo NO_WAKEUP_BIAS_PREV_IDLE > /sys/kernel/debug/sched/features >>> hackbench -g 1 -f 20 --pipe -l 480000 -s 100 >>> Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks) >>> Each sender will pass 480000 messages of 100 bytes >>> Time: 9.434 >>> >>> echo WAKEUP_BIAS_PREV_IDLE > /sys/kernel/debug/sched/features >>> hackbench -g 1 -f 20 --pipe -l 480000 -s 100 >>> Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks) >>> Each sender will pass 480000 messages of 100 bytes >>> Time: 9.373 >>> >>> thanks, >>> Chenyu >> >> -- >> Mathieu Desnoyers >> EfficiOS Inc. >> https://www.efficios.com >>
On 2023-10-12 11:56, Mathieu Desnoyers wrote: > On 2023-10-12 11:01, Vincent Guittot wrote: >> On Thu, 12 Oct 2023 at 16:33, Mathieu Desnoyers >> <mathieu.desnoyers@efficios.com> wrote: >>> >>> On 2023-10-11 06:16, Chen Yu wrote: >>>> On 2023-10-10 at 09:49:54 -0400, Mathieu Desnoyers wrote: >>>>> On 2023-10-09 01:14, Chen Yu wrote: >>>>>> On 2023-09-30 at 07:45:38 -0400, Mathieu Desnoyers wrote: >>>>>>> On 9/30/23 03:11, Chen Yu wrote: >>>>>>>> Hi Mathieu, >>>>>>>> >>>>>>>> On 2023-09-29 at 14:33:50 -0400, Mathieu Desnoyers wrote: >>>>>>>>> Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature. It biases >>>>>>>>> select_task_rq towards the previous CPU if it was almost idle >>>>>>>>> (avg_load <= 0.1%). >>>>>>>> >>>>>>>> Yes, this is a promising direction IMO. One question is that, >>>>>>>> can cfs_rq->avg.load_avg be used for percentage comparison? >>>>>>>> If I understand correctly, load_avg reflects that more than >>>>>>>> 1 tasks could have been running this runqueue, and the >>>>>>>> load_avg is the direct proportion to the load_weight of that >>>>>>>> cfs_rq. Besides, LOAD_AVG_MAX seems to not be the max value >>>>>>>> that load_avg can reach, it is the sum of >>>>>>>> 1024 * (y + y^1 + y^2 ... ) >>>>>>>> >>>>>>>> For example, >>>>>>>> taskset -c 1 nice -n -20 stress -c 1 >>>>>>>> cat /sys/kernel/debug/sched/debug | grep 'cfs_rq\[1\]' -A 12 | >>>>>>>> grep "\.load_avg" >>>>>>>>      .load_avg                     : 88763 >>>>>>>>      .load_avg                     : 1024 >>>>>>>> >>>>>>>> 88763 is higher than LOAD_AVG_MAX=47742 >>>>>>> >>>>>>> I would have expected the load_avg to be limited to LOAD_AVG_MAX >>>>>>> somehow, >>>>>>> but it appears that it does not happen in practice. >>>>>>> >>>>>>> That being said, if the cutoff is really at 0.1% or 0.2% of the >>>>>>> real max, >>>>>>> does it really matter ? >>>>>>> >>>>>>>> Maybe the util_avg can be used for precentage comparison I suppose? >>>>>>> [...] >>>>>>>> Or >>>>>>>> return cpu_util_without(cpu_rq(cpu), p) * 1000 <= >>>>>>>> capacity_orig_of(cpu) ? >>>>>>> >>>>>>> Unfortunately using util_avg does not seem to work based on my >>>>>>> testing. >>>>>>> Even at utilization thresholds at 0.1%, 1% and 10%. >>>>>>> >>>>>>> Based on comments in fair.c: >>>>>>> >>>>>>>    * CPU utilization is the sum of running time of runnable >>>>>>> tasks plus the >>>>>>>    * recent utilization of currently non-runnable tasks on that >>>>>>> CPU. >>>>>>> >>>>>>> I think we don't want to include currently non-runnable tasks in the >>>>>>> statistics we use, because we are trying to figure out if the cpu >>>>>>> is a >>>>>>> idle-enough target based on the tasks which are currently >>>>>>> running, for the >>>>>>> purpose of runqueue selection when waking up a task which is >>>>>>> considered at >>>>>>> that point in time a non-runnable task on that cpu, and which is >>>>>>> about to >>>>>>> become runnable again. >>>>>>> >>>>>> >>>>>> Although LOAD_AVG_MAX is not the max possible load_avg, we still >>>>>> want to find >>>>>> a proper threshold to decide if the CPU is almost idle. The >>>>>> LOAD_AVG_MAX >>>>>> based threshold is modified a little bit: >>>>>> >>>>>> The theory is, if there is only 1 task on the CPU, and that task >>>>>> has a nice >>>>>> of 0, the task runs 50 us every 1000 us, then this CPU is regarded >>>>>> as almost >>>>>> idle. >>>>>> >>>>>> The load_sum of the task is: >>>>>> 50 * (1 + y + y^2 + ... + y^n) >>>>>> The corresponding avg_load of the task is approximately >>>>>> NICE_0_WEIGHT * load_sum / LOAD_AVG_MAX = 50. >>>>>> So: >>>>>> >>>>>> /* which is close to LOAD_AVG_MAX/1000 = 47 */ >>>>>> #define ALMOST_IDLE_CPU_LOAD  50 >>>>> >>>>> Sorry to be slow at understanding this concept, but this whole >>>>> "load" value >>>>> is still somewhat magic to me. >>>>> >>>>> Should it vary based on CONFIG_HZ_{100,250,300,1000}, or is it >>>>> independent ? >>>>> Where is it documented that the load is a value in "us" out of a >>>>> window of >>>>> 1000 us ? >>>>> >>>> >>>> My understanding is that, the load_sum of a single task is a value >>>> in "us" out >>>> of a window of 1000 us, while the load_avg of the task will multiply >>>> the weight >>>> of the task. In this case a task with nice 0 is NICE_0_WEIGHT = 1024. >>>> >>>> __update_load_avg_se -> ___update_load_sum calculate the load_sum of >>>> a task(there >>>> is comments around ___update_load_sum to describe the pelt >>>> calculation), >>>> and ___update_load_avg() calculate the load_avg based on the task's >>>> weight. >>> >>> Thanks for your thorough explanation, now it makes sense. >>> >>> I understand as well that the cfs_rq->avg.load_sum is the result of >>> summing >>> each task load_sum multiplied by their weight: >> >> Please don't use load_sum but only *_avg. >> As already said, util_avg or runnable_avg are better metrics for you > > I think I found out why using util_avg was not working for me. > > Considering this comment from cpu_util(): > >  * CPU utilization is the sum of running time of runnable tasks plus the >  * recent utilization of currently non-runnable tasks on that CPU. > > I don't want to include the recent utilization of currently non-runnable > tasks on that CPU in order to choose that CPU to do task placement in a > context where many tasks were recently running on that cpu (but are > currently blocked). I do not want those blocked tasks to be part of the > avg. > > So I think the issue here is that I was using the cpu_util() (and > cpu_util_without()) helpers which are considering max(util, runnable), > rather than just "util". Actually AFAIU the part of cpu_util() responsible for adding the utilization of recently blocked tasks is the code under UTIL_EST. Thanks, Mathieu > > Based on your comments, just doing this to match a rq util_avg <= 1% > (10us of 1024us) > seems to work fine: > >  return cpu_rq(cpu)->cfs.avg.util_avg <= 10 * capacity_of(cpu); > > Is this approach acceptable ? > > Thanks! > > Mathieu > >> >>> >>> static inline void >>> enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) >>> { >>>          cfs_rq->avg.load_avg += se->avg.load_avg; >>>          cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum; >>> } >>> >>> Therefore I think we need to multiply the load_sum value we aim for by >>> get_pelt_divider(&cpu_rq(cpu)->cfs.avg) to compare it to a rq load_sum. >>> >>> I plan to compare the rq load sum to "10 * >>> get_pelt_divider(&cpu_rq(cpu)->cfs.avg)" >>> to match runqueues which were previously idle (therefore with prior >>> periods contribution >>> to the rq->load_sum being pretty much zero), and which have a current >>> period rq load_sum >>> below or equal 10us per 1024us (<= 1%): >>> >>> static inline unsigned long cfs_rq_weighted_load_sum(struct cfs_rq >>> *cfs_rq) >>> { >>>          return cfs_rq->avg.load_sum; >>> } >>> >>> static unsigned long cpu_weighted_load_sum(struct rq *rq) >>> { >>>          return cfs_rq_weighted_load_sum(&rq->cfs); >>> } >>> >>> /* >>>   * A runqueue is considered almost idle if: >>>   * >>>   *  cfs_rq->avg.load_sum / get_pelt_divider(&cfs_rq->avg) / 1024 >>> <= 1% >>>   * >>>   * This inequality is transformed as follows to minimize arithmetic: >>>   * >>>   *  cfs_rq->avg.load_sum <= get_pelt_divider(&cfs_rq->avg) * 10 >>>   */ >>> static bool >>> almost_idle_cpu(int cpu, struct task_struct *p) >>> { >>>          if (!sched_feat(WAKEUP_BIAS_PREV_IDLE)) >>>                  return false; >>>          return cpu_weighted_load_sum(cpu_rq(cpu)) <= 10 * >>> get_pelt_divider(&cpu_rq(cpu)->cfs.avg); >>> } >>> >>> Does it make sense ? >>> >>> Thanks, >>> >>> Mathieu >>> >>> >>>> >>>>> And with this value "50", it would cover the case where there is >>>>> only a >>>>> single task taking less than 50us per 1000us, and cases where the >>>>> sum for >>>>> the set of tasks on the runqueue is taking less than 50us per 1000us >>>>> overall. >>>>> >>>>>> >>>>>> static bool >>>>>> almost_idle_cpu(int cpu, struct task_struct *p) >>>>>> { >>>>>>          if (!sched_feat(WAKEUP_BIAS_PREV_IDLE)) >>>>>>                  return false; >>>>>>          return cpu_load_without(cpu_rq(cpu), p) <= >>>>>> ALMOST_IDLE_CPU_LOAD; >>>>>> } >>>>>> >>>>>> Tested this on Intel Xeon Platinum 8360Y, Ice Lake server, 36 >>>>>> core/package, >>>>>> total 72 core/144 CPUs. Slight improvement is observed in >>>>>> hackbench socket mode: >>>>>> >>>>>> socket mode: >>>>>> hackbench -g 16 -f 20 -l 480000 -s 100 >>>>>> >>>>>> Before patch: >>>>>> Running in process mode with 16 groups using 40 file descriptors >>>>>> each (== 640 tasks) >>>>>> Each sender will pass 480000 messages of 100 bytes >>>>>> Time: 81.084 >>>>>> >>>>>> After patch: >>>>>> Running in process mode with 16 groups using 40 file descriptors >>>>>> each (== 640 tasks) >>>>>> Each sender will pass 480000 messages of 100 bytes >>>>>> Time: 78.083 >>>>>> >>>>>> >>>>>> pipe mode: >>>>>> hackbench -g 16 -f 20 --pipe -l 480000 -s 100 >>>>>> >>>>>> Before patch: >>>>>> Running in process mode with 16 groups using 40 file descriptors >>>>>> each (== 640 tasks) >>>>>> Each sender will pass 480000 messages of 100 bytes >>>>>> Time: 38.219 >>>>>> >>>>>> After patch: >>>>>> Running in process mode with 16 groups using 40 file descriptors >>>>>> each (== 640 tasks) >>>>>> Each sender will pass 480000 messages of 100 bytes >>>>>> Time: 38.348 >>>>>> >>>>>> It suggests that, if the workload has larger working-set/cache >>>>>> footprint, waking up >>>>>> the task on its previous CPU could get more benefit. >>>>> >>>>> In those tests, what is the average % of idleness of your cpus ? >>>>> >>>> >>>> For hackbench -g 16 -f 20 --pipe -l 480000 -s 100, it is around >>>> 8~10% idle >>>> For hackbench -g 16 -f 20  -l 480000 -s 100, it is around 2~3% idle >>>> >>>> Then the CPUs in packge 1 are offlined to get stable result when the >>>> group number is low. >>>> hackbench -g 1 -f 20 --pipe -l 480000 -s 100 >>>> Some CPUs are busy, others are idle, and some are half-busy. >>>> Core CPU    Busy% >>>> -    -      49.57 >>>> 0    0      1.89 >>>> 0    72     75.55 >>>> 1    1      100.00 >>>> 1    73     0.00 >>>> 2    2      100.00 >>>> 2    74     0.00 >>>> 3    3      100.00 >>>> 3    75     0.01 >>>> 4    4      78.29 >>>> 4    76     17.72 >>>> 5    5      100.00 >>>> 5    77     0.00 >>>> >>>> >>>> hackbench -g 1 -f 20 -l 480000 -s 100 >>>> Core CPU    Busy% >>>> -    -      48.29 >>>> 0    0      57.94 >>>> 0    72     21.41 >>>> 1    1      83.28 >>>> 1    73     0.00 >>>> 2    2      11.44 >>>> 2    74     83.38 >>>> 3    3      21.45 >>>> 3    75     77.27 >>>> 4    4      26.89 >>>> 4    76     80.95 >>>> 5    5      5.01 >>>> 5    77     83.09 >>>> >>>> >>>> echo NO_WAKEUP_BIAS_PREV_IDLE > /sys/kernel/debug/sched/features >>>> hackbench -g 1 -f 20 --pipe -l 480000 -s 100 >>>> Running in process mode with 1 groups using 40 file descriptors each >>>> (== 40 tasks) >>>> Each sender will pass 480000 messages of 100 bytes >>>> Time: 9.434 >>>> >>>> echo WAKEUP_BIAS_PREV_IDLE > /sys/kernel/debug/sched/features >>>> hackbench -g 1 -f 20 --pipe -l 480000 -s 100 >>>> Running in process mode with 1 groups using 40 file descriptors each >>>> (== 40 tasks) >>>> Each sender will pass 480000 messages of 100 bytes >>>> Time: 9.373 >>>> >>>> thanks, >>>> Chenyu >>> >>> -- >>> Mathieu Desnoyers >>> EfficiOS Inc. >>> https://www.efficios.com >>> >
On Thu, 12 Oct 2023 at 17:56, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2023-10-12 11:01, Vincent Guittot wrote: > > On Thu, 12 Oct 2023 at 16:33, Mathieu Desnoyers > > <mathieu.desnoyers@efficios.com> wrote: > >> > >> On 2023-10-11 06:16, Chen Yu wrote: > >>> On 2023-10-10 at 09:49:54 -0400, Mathieu Desnoyers wrote: > >>>> On 2023-10-09 01:14, Chen Yu wrote: > >>>>> On 2023-09-30 at 07:45:38 -0400, Mathieu Desnoyers wrote: > >>>>>> On 9/30/23 03:11, Chen Yu wrote: > >>>>>>> Hi Mathieu, > >>>>>>> > >>>>>>> On 2023-09-29 at 14:33:50 -0400, Mathieu Desnoyers wrote: > >>>>>>>> Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature. It biases > >>>>>>>> select_task_rq towards the previous CPU if it was almost idle > >>>>>>>> (avg_load <= 0.1%). > >>>>>>> > >>>>>>> Yes, this is a promising direction IMO. One question is that, > >>>>>>> can cfs_rq->avg.load_avg be used for percentage comparison? > >>>>>>> If I understand correctly, load_avg reflects that more than > >>>>>>> 1 tasks could have been running this runqueue, and the > >>>>>>> load_avg is the direct proportion to the load_weight of that > >>>>>>> cfs_rq. Besides, LOAD_AVG_MAX seems to not be the max value > >>>>>>> that load_avg can reach, it is the sum of > >>>>>>> 1024 * (y + y^1 + y^2 ... ) > >>>>>>> > >>>>>>> For example, > >>>>>>> taskset -c 1 nice -n -20 stress -c 1 > >>>>>>> cat /sys/kernel/debug/sched/debug | grep 'cfs_rq\[1\]' -A 12 | grep "\.load_avg" > >>>>>>> .load_avg : 88763 > >>>>>>> .load_avg : 1024 > >>>>>>> > >>>>>>> 88763 is higher than LOAD_AVG_MAX=47742 > >>>>>> > >>>>>> I would have expected the load_avg to be limited to LOAD_AVG_MAX somehow, > >>>>>> but it appears that it does not happen in practice. > >>>>>> > >>>>>> That being said, if the cutoff is really at 0.1% or 0.2% of the real max, > >>>>>> does it really matter ? > >>>>>> > >>>>>>> Maybe the util_avg can be used for precentage comparison I suppose? > >>>>>> [...] > >>>>>>> Or > >>>>>>> return cpu_util_without(cpu_rq(cpu), p) * 1000 <= capacity_orig_of(cpu) ? > >>>>>> > >>>>>> Unfortunately using util_avg does not seem to work based on my testing. > >>>>>> Even at utilization thresholds at 0.1%, 1% and 10%. > >>>>>> > >>>>>> Based on comments in fair.c: > >>>>>> > >>>>>> * CPU utilization is the sum of running time of runnable tasks plus the > >>>>>> * recent utilization of currently non-runnable tasks on that CPU. > >>>>>> > >>>>>> I think we don't want to include currently non-runnable tasks in the > >>>>>> statistics we use, because we are trying to figure out if the cpu is a > >>>>>> idle-enough target based on the tasks which are currently running, for the > >>>>>> purpose of runqueue selection when waking up a task which is considered at > >>>>>> that point in time a non-runnable task on that cpu, and which is about to > >>>>>> become runnable again. > >>>>>> > >>>>> > >>>>> Although LOAD_AVG_MAX is not the max possible load_avg, we still want to find > >>>>> a proper threshold to decide if the CPU is almost idle. The LOAD_AVG_MAX > >>>>> based threshold is modified a little bit: > >>>>> > >>>>> The theory is, if there is only 1 task on the CPU, and that task has a nice > >>>>> of 0, the task runs 50 us every 1000 us, then this CPU is regarded as almost > >>>>> idle. > >>>>> > >>>>> The load_sum of the task is: > >>>>> 50 * (1 + y + y^2 + ... + y^n) > >>>>> The corresponding avg_load of the task is approximately > >>>>> NICE_0_WEIGHT * load_sum / LOAD_AVG_MAX = 50. > >>>>> So: > >>>>> > >>>>> /* which is close to LOAD_AVG_MAX/1000 = 47 */ > >>>>> #define ALMOST_IDLE_CPU_LOAD 50 > >>>> > >>>> Sorry to be slow at understanding this concept, but this whole "load" value > >>>> is still somewhat magic to me. > >>>> > >>>> Should it vary based on CONFIG_HZ_{100,250,300,1000}, or is it independent ? > >>>> Where is it documented that the load is a value in "us" out of a window of > >>>> 1000 us ? > >>>> > >>> > >>> My understanding is that, the load_sum of a single task is a value in "us" out > >>> of a window of 1000 us, while the load_avg of the task will multiply the weight > >>> of the task. In this case a task with nice 0 is NICE_0_WEIGHT = 1024. > >>> > >>> __update_load_avg_se -> ___update_load_sum calculate the load_sum of a task(there > >>> is comments around ___update_load_sum to describe the pelt calculation), > >>> and ___update_load_avg() calculate the load_avg based on the task's weight. > >> > >> Thanks for your thorough explanation, now it makes sense. > >> > >> I understand as well that the cfs_rq->avg.load_sum is the result of summing > >> each task load_sum multiplied by their weight: > > > > Please don't use load_sum but only *_avg. > > As already said, util_avg or runnable_avg are better metrics for you > > I think I found out why using util_avg was not working for me. > > Considering this comment from cpu_util(): > > * CPU utilization is the sum of running time of runnable tasks plus the > * recent utilization of currently non-runnable tasks on that CPU. > > I don't want to include the recent utilization of currently non-runnable > tasks on that CPU in order to choose that CPU to do task placement in a > context where many tasks were recently running on that cpu (but are > currently blocked). I do not want those blocked tasks to be part of the > avg. But you have the exact same behavior with load_sum/avg. > > So I think the issue here is that I was using the cpu_util() (and > cpu_util_without()) helpers which are considering max(util, runnable), > rather than just "util". cpu_util_without() only use util_avg but not runnable_avg. Nevertheless, cpu_util_without ans cpu_util uses util_est which is used to predict the final utilization. Let's take the example of task A running 20ms every 200ms on CPU0. The util_avg of the cpu will vary in the range [7:365]. When task A wakes up on CPU0, CPU0 util_avg = 7 (below 1%) but taskA will run for 20ms which is not really almost idle. On the other side, CPU0 util_est will be 365 as soon as task A is enqueued (which will be the value of CPU0 util_avg just before going idle) Let's now take a task B running 100us every 1024us The util_avg of the cpu should vary in the range [101:103] and once task B is enqueued, CPU0 util_est will be 103 > > Based on your comments, just doing this to match a rq util_avg <= 1% (10us of 1024us) it's not 10us of 1024us > seems to work fine: > > return cpu_rq(cpu)->cfs.avg.util_avg <= 10 * capacity_of(cpu); > > Is this approach acceptable ? > > Thanks! > > Mathieu > > > > >> > >> static inline void > >> enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) > >> { > >> cfs_rq->avg.load_avg += se->avg.load_avg; > >> cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum; > >> } > >> > >> Therefore I think we need to multiply the load_sum value we aim for by > >> get_pelt_divider(&cpu_rq(cpu)->cfs.avg) to compare it to a rq load_sum. > >> > >> I plan to compare the rq load sum to "10 * get_pelt_divider(&cpu_rq(cpu)->cfs.avg)" > >> to match runqueues which were previously idle (therefore with prior periods contribution > >> to the rq->load_sum being pretty much zero), and which have a current period rq load_sum > >> below or equal 10us per 1024us (<= 1%): > >> > >> static inline unsigned long cfs_rq_weighted_load_sum(struct cfs_rq *cfs_rq) > >> { > >> return cfs_rq->avg.load_sum; > >> } > >> > >> static unsigned long cpu_weighted_load_sum(struct rq *rq) > >> { > >> return cfs_rq_weighted_load_sum(&rq->cfs); > >> } > >> > >> /* > >> * A runqueue is considered almost idle if: > >> * > >> * cfs_rq->avg.load_sum / get_pelt_divider(&cfs_rq->avg) / 1024 <= 1% > >> * > >> * This inequality is transformed as follows to minimize arithmetic: > >> * > >> * cfs_rq->avg.load_sum <= get_pelt_divider(&cfs_rq->avg) * 10 > >> */ > >> static bool > >> almost_idle_cpu(int cpu, struct task_struct *p) > >> { > >> if (!sched_feat(WAKEUP_BIAS_PREV_IDLE)) > >> return false; > >> return cpu_weighted_load_sum(cpu_rq(cpu)) <= 10 * get_pelt_divider(&cpu_rq(cpu)->cfs.avg); > >> } > >> > >> Does it make sense ? > >> > >> Thanks, > >> > >> Mathieu > >> > >> > >>> > >>>> And with this value "50", it would cover the case where there is only a > >>>> single task taking less than 50us per 1000us, and cases where the sum for > >>>> the set of tasks on the runqueue is taking less than 50us per 1000us > >>>> overall. > >>>> > >>>>> > >>>>> static bool > >>>>> almost_idle_cpu(int cpu, struct task_struct *p) > >>>>> { > >>>>> if (!sched_feat(WAKEUP_BIAS_PREV_IDLE)) > >>>>> return false; > >>>>> return cpu_load_without(cpu_rq(cpu), p) <= ALMOST_IDLE_CPU_LOAD; > >>>>> } > >>>>> > >>>>> Tested this on Intel Xeon Platinum 8360Y, Ice Lake server, 36 core/package, > >>>>> total 72 core/144 CPUs. Slight improvement is observed in hackbench socket mode: > >>>>> > >>>>> socket mode: > >>>>> hackbench -g 16 -f 20 -l 480000 -s 100 > >>>>> > >>>>> Before patch: > >>>>> Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > >>>>> Each sender will pass 480000 messages of 100 bytes > >>>>> Time: 81.084 > >>>>> > >>>>> After patch: > >>>>> Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > >>>>> Each sender will pass 480000 messages of 100 bytes > >>>>> Time: 78.083 > >>>>> > >>>>> > >>>>> pipe mode: > >>>>> hackbench -g 16 -f 20 --pipe -l 480000 -s 100 > >>>>> > >>>>> Before patch: > >>>>> Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > >>>>> Each sender will pass 480000 messages of 100 bytes > >>>>> Time: 38.219 > >>>>> > >>>>> After patch: > >>>>> Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > >>>>> Each sender will pass 480000 messages of 100 bytes > >>>>> Time: 38.348 > >>>>> > >>>>> It suggests that, if the workload has larger working-set/cache footprint, waking up > >>>>> the task on its previous CPU could get more benefit. > >>>> > >>>> In those tests, what is the average % of idleness of your cpus ? > >>>> > >>> > >>> For hackbench -g 16 -f 20 --pipe -l 480000 -s 100, it is around 8~10% idle > >>> For hackbench -g 16 -f 20 -l 480000 -s 100, it is around 2~3% idle > >>> > >>> Then the CPUs in packge 1 are offlined to get stable result when the group number is low. > >>> hackbench -g 1 -f 20 --pipe -l 480000 -s 100 > >>> Some CPUs are busy, others are idle, and some are half-busy. > >>> Core CPU Busy% > >>> - - 49.57 > >>> 0 0 1.89 > >>> 0 72 75.55 > >>> 1 1 100.00 > >>> 1 73 0.00 > >>> 2 2 100.00 > >>> 2 74 0.00 > >>> 3 3 100.00 > >>> 3 75 0.01 > >>> 4 4 78.29 > >>> 4 76 17.72 > >>> 5 5 100.00 > >>> 5 77 0.00 > >>> > >>> > >>> hackbench -g 1 -f 20 -l 480000 -s 100 > >>> Core CPU Busy% > >>> - - 48.29 > >>> 0 0 57.94 > >>> 0 72 21.41 > >>> 1 1 83.28 > >>> 1 73 0.00 > >>> 2 2 11.44 > >>> 2 74 83.38 > >>> 3 3 21.45 > >>> 3 75 77.27 > >>> 4 4 26.89 > >>> 4 76 80.95 > >>> 5 5 5.01 > >>> 5 77 83.09 > >>> > >>> > >>> echo NO_WAKEUP_BIAS_PREV_IDLE > /sys/kernel/debug/sched/features > >>> hackbench -g 1 -f 20 --pipe -l 480000 -s 100 > >>> Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks) > >>> Each sender will pass 480000 messages of 100 bytes > >>> Time: 9.434 > >>> > >>> echo WAKEUP_BIAS_PREV_IDLE > /sys/kernel/debug/sched/features > >>> hackbench -g 1 -f 20 --pipe -l 480000 -s 100 > >>> Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks) > >>> Each sender will pass 480000 messages of 100 bytes > >>> Time: 9.373 > >>> > >>> thanks, > >>> Chenyu > >> > >> -- > >> Mathieu Desnoyers > >> EfficiOS Inc. > >> https://www.efficios.com > >> > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
On 2023-10-12 12:24, Vincent Guittot wrote: > On Thu, 12 Oct 2023 at 17:56, Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: >> >> On 2023-10-12 11:01, Vincent Guittot wrote: >>> On Thu, 12 Oct 2023 at 16:33, Mathieu Desnoyers >>> <mathieu.desnoyers@efficios.com> wrote: >>>> >>>> On 2023-10-11 06:16, Chen Yu wrote: >>>>> On 2023-10-10 at 09:49:54 -0400, Mathieu Desnoyers wrote: >>>>>> On 2023-10-09 01:14, Chen Yu wrote: >>>>>>> On 2023-09-30 at 07:45:38 -0400, Mathieu Desnoyers wrote: >>>>>>>> On 9/30/23 03:11, Chen Yu wrote: >>>>>>>>> Hi Mathieu, >>>>>>>>> >>>>>>>>> On 2023-09-29 at 14:33:50 -0400, Mathieu Desnoyers wrote: >>>>>>>>>> Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature. It biases >>>>>>>>>> select_task_rq towards the previous CPU if it was almost idle >>>>>>>>>> (avg_load <= 0.1%). >>>>>>>>> >>>>>>>>> Yes, this is a promising direction IMO. One question is that, >>>>>>>>> can cfs_rq->avg.load_avg be used for percentage comparison? >>>>>>>>> If I understand correctly, load_avg reflects that more than >>>>>>>>> 1 tasks could have been running this runqueue, and the >>>>>>>>> load_avg is the direct proportion to the load_weight of that >>>>>>>>> cfs_rq. Besides, LOAD_AVG_MAX seems to not be the max value >>>>>>>>> that load_avg can reach, it is the sum of >>>>>>>>> 1024 * (y + y^1 + y^2 ... ) >>>>>>>>> >>>>>>>>> For example, >>>>>>>>> taskset -c 1 nice -n -20 stress -c 1 >>>>>>>>> cat /sys/kernel/debug/sched/debug | grep 'cfs_rq\[1\]' -A 12 | grep "\.load_avg" >>>>>>>>> .load_avg : 88763 >>>>>>>>> .load_avg : 1024 >>>>>>>>> >>>>>>>>> 88763 is higher than LOAD_AVG_MAX=47742 >>>>>>>> >>>>>>>> I would have expected the load_avg to be limited to LOAD_AVG_MAX somehow, >>>>>>>> but it appears that it does not happen in practice. >>>>>>>> >>>>>>>> That being said, if the cutoff is really at 0.1% or 0.2% of the real max, >>>>>>>> does it really matter ? >>>>>>>> >>>>>>>>> Maybe the util_avg can be used for precentage comparison I suppose? >>>>>>>> [...] >>>>>>>>> Or >>>>>>>>> return cpu_util_without(cpu_rq(cpu), p) * 1000 <= capacity_orig_of(cpu) ? >>>>>>>> >>>>>>>> Unfortunately using util_avg does not seem to work based on my testing. >>>>>>>> Even at utilization thresholds at 0.1%, 1% and 10%. >>>>>>>> >>>>>>>> Based on comments in fair.c: >>>>>>>> >>>>>>>> * CPU utilization is the sum of running time of runnable tasks plus the >>>>>>>> * recent utilization of currently non-runnable tasks on that CPU. >>>>>>>> >>>>>>>> I think we don't want to include currently non-runnable tasks in the >>>>>>>> statistics we use, because we are trying to figure out if the cpu is a >>>>>>>> idle-enough target based on the tasks which are currently running, for the >>>>>>>> purpose of runqueue selection when waking up a task which is considered at >>>>>>>> that point in time a non-runnable task on that cpu, and which is about to >>>>>>>> become runnable again. >>>>>>>> >>>>>>> >>>>>>> Although LOAD_AVG_MAX is not the max possible load_avg, we still want to find >>>>>>> a proper threshold to decide if the CPU is almost idle. The LOAD_AVG_MAX >>>>>>> based threshold is modified a little bit: >>>>>>> >>>>>>> The theory is, if there is only 1 task on the CPU, and that task has a nice >>>>>>> of 0, the task runs 50 us every 1000 us, then this CPU is regarded as almost >>>>>>> idle. >>>>>>> >>>>>>> The load_sum of the task is: >>>>>>> 50 * (1 + y + y^2 + ... + y^n) >>>>>>> The corresponding avg_load of the task is approximately >>>>>>> NICE_0_WEIGHT * load_sum / LOAD_AVG_MAX = 50. >>>>>>> So: >>>>>>> >>>>>>> /* which is close to LOAD_AVG_MAX/1000 = 47 */ >>>>>>> #define ALMOST_IDLE_CPU_LOAD 50 >>>>>> >>>>>> Sorry to be slow at understanding this concept, but this whole "load" value >>>>>> is still somewhat magic to me. >>>>>> >>>>>> Should it vary based on CONFIG_HZ_{100,250,300,1000}, or is it independent ? >>>>>> Where is it documented that the load is a value in "us" out of a window of >>>>>> 1000 us ? >>>>>> >>>>> >>>>> My understanding is that, the load_sum of a single task is a value in "us" out >>>>> of a window of 1000 us, while the load_avg of the task will multiply the weight >>>>> of the task. In this case a task with nice 0 is NICE_0_WEIGHT = 1024. >>>>> >>>>> __update_load_avg_se -> ___update_load_sum calculate the load_sum of a task(there >>>>> is comments around ___update_load_sum to describe the pelt calculation), >>>>> and ___update_load_avg() calculate the load_avg based on the task's weight. >>>> >>>> Thanks for your thorough explanation, now it makes sense. >>>> >>>> I understand as well that the cfs_rq->avg.load_sum is the result of summing >>>> each task load_sum multiplied by their weight: >>> >>> Please don't use load_sum but only *_avg. >>> As already said, util_avg or runnable_avg are better metrics for you >> >> I think I found out why using util_avg was not working for me. >> >> Considering this comment from cpu_util(): >> >> * CPU utilization is the sum of running time of runnable tasks plus the >> * recent utilization of currently non-runnable tasks on that CPU. >> >> I don't want to include the recent utilization of currently non-runnable >> tasks on that CPU in order to choose that CPU to do task placement in a >> context where many tasks were recently running on that cpu (but are >> currently blocked). I do not want those blocked tasks to be part of the >> avg. > > But you have the exact same behavior with load_sum/avg. > >> >> So I think the issue here is that I was using the cpu_util() (and >> cpu_util_without()) helpers which are considering max(util, runnable), >> rather than just "util". > > cpu_util_without() only use util_avg but not runnable_avg. Ah, yes, @boost=0, which prevents it from using the runnable_avg. > Nevertheless, cpu_util_without ans cpu_util uses util_est which is > used to predict the final utilization. Yes, I suspect it's the util_est which prevents me from getting performance improvements when I use cpu_util_without to implement almost-idle. > > Let's take the example of task A running 20ms every 200ms on CPU0. > The util_avg of the cpu will vary in the range [7:365]. When task A > wakes up on CPU0, CPU0 util_avg = 7 (below 1%) but taskA will run for > 20ms which is not really almost idle. On the other side, CPU0 util_est > will be 365 as soon as task A is enqueued (which will be the value of > CPU0 util_avg just before going idle) If task A sleeps (becomes non-runnable) without being migrated, and therefore still have CPU0 as its cpu, is it still considered as part of the util_est of CPU0 while it is blocked ? If it is the case, then the util_est is preventing rq selection from considering a rq almost idle when waking up sleeping tasks due to taking into account the set of sleeping tasks in its utilization estimate. > > Let's now take a task B running 100us every 1024us > The util_avg of the cpu should vary in the range [101:103] and once > task B is enqueued, CPU0 util_est will be 103 > >> >> Based on your comments, just doing this to match a rq util_avg <= 1% (10us of 1024us) > > it's not 10us of 1024us Is the range of util_avg within [0..1024] * capacity_of(cpu), or am I missing something ? Thanks, Mathieu > >> seems to work fine: >> >> return cpu_rq(cpu)->cfs.avg.util_avg <= 10 * capacity_of(cpu); >> >> Is this approach acceptable ? >> >> Thanks! >> >> Mathieu >> >>> >>>> >>>> static inline void >>>> enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) >>>> { >>>> cfs_rq->avg.load_avg += se->avg.load_avg; >>>> cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum; >>>> } >>>> >>>> Therefore I think we need to multiply the load_sum value we aim for by >>>> get_pelt_divider(&cpu_rq(cpu)->cfs.avg) to compare it to a rq load_sum. >>>> >>>> I plan to compare the rq load sum to "10 * get_pelt_divider(&cpu_rq(cpu)->cfs.avg)" >>>> to match runqueues which were previously idle (therefore with prior periods contribution >>>> to the rq->load_sum being pretty much zero), and which have a current period rq load_sum >>>> below or equal 10us per 1024us (<= 1%): >>>> >>>> static inline unsigned long cfs_rq_weighted_load_sum(struct cfs_rq *cfs_rq) >>>> { >>>> return cfs_rq->avg.load_sum; >>>> } >>>> >>>> static unsigned long cpu_weighted_load_sum(struct rq *rq) >>>> { >>>> return cfs_rq_weighted_load_sum(&rq->cfs); >>>> } >>>> >>>> /* >>>> * A runqueue is considered almost idle if: >>>> * >>>> * cfs_rq->avg.load_sum / get_pelt_divider(&cfs_rq->avg) / 1024 <= 1% >>>> * >>>> * This inequality is transformed as follows to minimize arithmetic: >>>> * >>>> * cfs_rq->avg.load_sum <= get_pelt_divider(&cfs_rq->avg) * 10 >>>> */ >>>> static bool >>>> almost_idle_cpu(int cpu, struct task_struct *p) >>>> { >>>> if (!sched_feat(WAKEUP_BIAS_PREV_IDLE)) >>>> return false; >>>> return cpu_weighted_load_sum(cpu_rq(cpu)) <= 10 * get_pelt_divider(&cpu_rq(cpu)->cfs.avg); >>>> } >>>> >>>> Does it make sense ? >>>> >>>> Thanks, >>>> >>>> Mathieu >>>> >>>> >>>>> >>>>>> And with this value "50", it would cover the case where there is only a >>>>>> single task taking less than 50us per 1000us, and cases where the sum for >>>>>> the set of tasks on the runqueue is taking less than 50us per 1000us >>>>>> overall. >>>>>> >>>>>>> >>>>>>> static bool >>>>>>> almost_idle_cpu(int cpu, struct task_struct *p) >>>>>>> { >>>>>>> if (!sched_feat(WAKEUP_BIAS_PREV_IDLE)) >>>>>>> return false; >>>>>>> return cpu_load_without(cpu_rq(cpu), p) <= ALMOST_IDLE_CPU_LOAD; >>>>>>> } >>>>>>> >>>>>>> Tested this on Intel Xeon Platinum 8360Y, Ice Lake server, 36 core/package, >>>>>>> total 72 core/144 CPUs. Slight improvement is observed in hackbench socket mode: >>>>>>> >>>>>>> socket mode: >>>>>>> hackbench -g 16 -f 20 -l 480000 -s 100 >>>>>>> >>>>>>> Before patch: >>>>>>> Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) >>>>>>> Each sender will pass 480000 messages of 100 bytes >>>>>>> Time: 81.084 >>>>>>> >>>>>>> After patch: >>>>>>> Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) >>>>>>> Each sender will pass 480000 messages of 100 bytes >>>>>>> Time: 78.083 >>>>>>> >>>>>>> >>>>>>> pipe mode: >>>>>>> hackbench -g 16 -f 20 --pipe -l 480000 -s 100 >>>>>>> >>>>>>> Before patch: >>>>>>> Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) >>>>>>> Each sender will pass 480000 messages of 100 bytes >>>>>>> Time: 38.219 >>>>>>> >>>>>>> After patch: >>>>>>> Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) >>>>>>> Each sender will pass 480000 messages of 100 bytes >>>>>>> Time: 38.348 >>>>>>> >>>>>>> It suggests that, if the workload has larger working-set/cache footprint, waking up >>>>>>> the task on its previous CPU could get more benefit. >>>>>> >>>>>> In those tests, what is the average % of idleness of your cpus ? >>>>>> >>>>> >>>>> For hackbench -g 16 -f 20 --pipe -l 480000 -s 100, it is around 8~10% idle >>>>> For hackbench -g 16 -f 20 -l 480000 -s 100, it is around 2~3% idle >>>>> >>>>> Then the CPUs in packge 1 are offlined to get stable result when the group number is low. >>>>> hackbench -g 1 -f 20 --pipe -l 480000 -s 100 >>>>> Some CPUs are busy, others are idle, and some are half-busy. >>>>> Core CPU Busy% >>>>> - - 49.57 >>>>> 0 0 1.89 >>>>> 0 72 75.55 >>>>> 1 1 100.00 >>>>> 1 73 0.00 >>>>> 2 2 100.00 >>>>> 2 74 0.00 >>>>> 3 3 100.00 >>>>> 3 75 0.01 >>>>> 4 4 78.29 >>>>> 4 76 17.72 >>>>> 5 5 100.00 >>>>> 5 77 0.00 >>>>> >>>>> >>>>> hackbench -g 1 -f 20 -l 480000 -s 100 >>>>> Core CPU Busy% >>>>> - - 48.29 >>>>> 0 0 57.94 >>>>> 0 72 21.41 >>>>> 1 1 83.28 >>>>> 1 73 0.00 >>>>> 2 2 11.44 >>>>> 2 74 83.38 >>>>> 3 3 21.45 >>>>> 3 75 77.27 >>>>> 4 4 26.89 >>>>> 4 76 80.95 >>>>> 5 5 5.01 >>>>> 5 77 83.09 >>>>> >>>>> >>>>> echo NO_WAKEUP_BIAS_PREV_IDLE > /sys/kernel/debug/sched/features >>>>> hackbench -g 1 -f 20 --pipe -l 480000 -s 100 >>>>> Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks) >>>>> Each sender will pass 480000 messages of 100 bytes >>>>> Time: 9.434 >>>>> >>>>> echo WAKEUP_BIAS_PREV_IDLE > /sys/kernel/debug/sched/features >>>>> hackbench -g 1 -f 20 --pipe -l 480000 -s 100 >>>>> Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks) >>>>> Each sender will pass 480000 messages of 100 bytes >>>>> Time: 9.373 >>>>> >>>>> thanks, >>>>> Chenyu >>>> >>>> -- >>>> Mathieu Desnoyers >>>> EfficiOS Inc. >>>> https://www.efficios.com >>>> >> >> -- >> Mathieu Desnoyers >> EfficiOS Inc. >> https://www.efficios.com >>
On Thu, 12 Oct 2023 at 18:48, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2023-10-12 12:24, Vincent Guittot wrote: > > On Thu, 12 Oct 2023 at 17:56, Mathieu Desnoyers > > <mathieu.desnoyers@efficios.com> wrote: > >> > >> On 2023-10-12 11:01, Vincent Guittot wrote: > >>> On Thu, 12 Oct 2023 at 16:33, Mathieu Desnoyers > >>> <mathieu.desnoyers@efficios.com> wrote: > >>>> > >>>> On 2023-10-11 06:16, Chen Yu wrote: > >>>>> On 2023-10-10 at 09:49:54 -0400, Mathieu Desnoyers wrote: > >>>>>> On 2023-10-09 01:14, Chen Yu wrote: > >>>>>>> On 2023-09-30 at 07:45:38 -0400, Mathieu Desnoyers wrote: > >>>>>>>> On 9/30/23 03:11, Chen Yu wrote: > >>>>>>>>> Hi Mathieu, > >>>>>>>>> > >>>>>>>>> On 2023-09-29 at 14:33:50 -0400, Mathieu Desnoyers wrote: > >>>>>>>>>> Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature. It biases > >>>>>>>>>> select_task_rq towards the previous CPU if it was almost idle > >>>>>>>>>> (avg_load <= 0.1%). > >>>>>>>>> > >>>>>>>>> Yes, this is a promising direction IMO. One question is that, > >>>>>>>>> can cfs_rq->avg.load_avg be used for percentage comparison? > >>>>>>>>> If I understand correctly, load_avg reflects that more than > >>>>>>>>> 1 tasks could have been running this runqueue, and the > >>>>>>>>> load_avg is the direct proportion to the load_weight of that > >>>>>>>>> cfs_rq. Besides, LOAD_AVG_MAX seems to not be the max value > >>>>>>>>> that load_avg can reach, it is the sum of > >>>>>>>>> 1024 * (y + y^1 + y^2 ... ) > >>>>>>>>> > >>>>>>>>> For example, > >>>>>>>>> taskset -c 1 nice -n -20 stress -c 1 > >>>>>>>>> cat /sys/kernel/debug/sched/debug | grep 'cfs_rq\[1\]' -A 12 | grep "\.load_avg" > >>>>>>>>> .load_avg : 88763 > >>>>>>>>> .load_avg : 1024 > >>>>>>>>> > >>>>>>>>> 88763 is higher than LOAD_AVG_MAX=47742 > >>>>>>>> > >>>>>>>> I would have expected the load_avg to be limited to LOAD_AVG_MAX somehow, > >>>>>>>> but it appears that it does not happen in practice. > >>>>>>>> > >>>>>>>> That being said, if the cutoff is really at 0.1% or 0.2% of the real max, > >>>>>>>> does it really matter ? > >>>>>>>> > >>>>>>>>> Maybe the util_avg can be used for precentage comparison I suppose? > >>>>>>>> [...] > >>>>>>>>> Or > >>>>>>>>> return cpu_util_without(cpu_rq(cpu), p) * 1000 <= capacity_orig_of(cpu) ? > >>>>>>>> > >>>>>>>> Unfortunately using util_avg does not seem to work based on my testing. > >>>>>>>> Even at utilization thresholds at 0.1%, 1% and 10%. > >>>>>>>> > >>>>>>>> Based on comments in fair.c: > >>>>>>>> > >>>>>>>> * CPU utilization is the sum of running time of runnable tasks plus the > >>>>>>>> * recent utilization of currently non-runnable tasks on that CPU. > >>>>>>>> > >>>>>>>> I think we don't want to include currently non-runnable tasks in the > >>>>>>>> statistics we use, because we are trying to figure out if the cpu is a > >>>>>>>> idle-enough target based on the tasks which are currently running, for the > >>>>>>>> purpose of runqueue selection when waking up a task which is considered at > >>>>>>>> that point in time a non-runnable task on that cpu, and which is about to > >>>>>>>> become runnable again. > >>>>>>>> > >>>>>>> > >>>>>>> Although LOAD_AVG_MAX is not the max possible load_avg, we still want to find > >>>>>>> a proper threshold to decide if the CPU is almost idle. The LOAD_AVG_MAX > >>>>>>> based threshold is modified a little bit: > >>>>>>> > >>>>>>> The theory is, if there is only 1 task on the CPU, and that task has a nice > >>>>>>> of 0, the task runs 50 us every 1000 us, then this CPU is regarded as almost > >>>>>>> idle. > >>>>>>> > >>>>>>> The load_sum of the task is: > >>>>>>> 50 * (1 + y + y^2 + ... + y^n) > >>>>>>> The corresponding avg_load of the task is approximately > >>>>>>> NICE_0_WEIGHT * load_sum / LOAD_AVG_MAX = 50. > >>>>>>> So: > >>>>>>> > >>>>>>> /* which is close to LOAD_AVG_MAX/1000 = 47 */ > >>>>>>> #define ALMOST_IDLE_CPU_LOAD 50 > >>>>>> > >>>>>> Sorry to be slow at understanding this concept, but this whole "load" value > >>>>>> is still somewhat magic to me. > >>>>>> > >>>>>> Should it vary based on CONFIG_HZ_{100,250,300,1000}, or is it independent ? > >>>>>> Where is it documented that the load is a value in "us" out of a window of > >>>>>> 1000 us ? > >>>>>> > >>>>> > >>>>> My understanding is that, the load_sum of a single task is a value in "us" out > >>>>> of a window of 1000 us, while the load_avg of the task will multiply the weight > >>>>> of the task. In this case a task with nice 0 is NICE_0_WEIGHT = 1024. > >>>>> > >>>>> __update_load_avg_se -> ___update_load_sum calculate the load_sum of a task(there > >>>>> is comments around ___update_load_sum to describe the pelt calculation), > >>>>> and ___update_load_avg() calculate the load_avg based on the task's weight. > >>>> > >>>> Thanks for your thorough explanation, now it makes sense. > >>>> > >>>> I understand as well that the cfs_rq->avg.load_sum is the result of summing > >>>> each task load_sum multiplied by their weight: > >>> > >>> Please don't use load_sum but only *_avg. > >>> As already said, util_avg or runnable_avg are better metrics for you > >> > >> I think I found out why using util_avg was not working for me. > >> > >> Considering this comment from cpu_util(): > >> > >> * CPU utilization is the sum of running time of runnable tasks plus the > >> * recent utilization of currently non-runnable tasks on that CPU. > >> > >> I don't want to include the recent utilization of currently non-runnable > >> tasks on that CPU in order to choose that CPU to do task placement in a > >> context where many tasks were recently running on that cpu (but are > >> currently blocked). I do not want those blocked tasks to be part of the > >> avg. > > > > But you have the exact same behavior with load_sum/avg. > > > >> > >> So I think the issue here is that I was using the cpu_util() (and > >> cpu_util_without()) helpers which are considering max(util, runnable), > >> rather than just "util". > > > > cpu_util_without() only use util_avg but not runnable_avg. > > Ah, yes, @boost=0, which prevents it from using the runnable_avg. > > > Nevertheless, cpu_util_without ans cpu_util uses util_est which is > > used to predict the final utilization. > > Yes, I suspect it's the util_est which prevents me from getting > performance improvements when I use cpu_util_without to implement > almost-idle. > > > > > Let's take the example of task A running 20ms every 200ms on CPU0. > > The util_avg of the cpu will vary in the range [7:365]. When task A > > wakes up on CPU0, CPU0 util_avg = 7 (below 1%) but taskA will run for > > 20ms which is not really almost idle. On the other side, CPU0 util_est > > will be 365 as soon as task A is enqueued (which will be the value of > > CPU0 util_avg just before going idle) > > If task A sleeps (becomes non-runnable) without being migrated, and therefore > still have CPU0 as its cpu, is it still considered as part of the util_est of > CPU0 while it is blocked ? If it is the case, then the util_est is preventing No, the util_est of a cpu only accounts the runnable tasks not the sleeping one CPU's util_est = /Sum of the util_est of runnable tasks > rq selection from considering a rq almost idle when waking up sleeping tasks > due to taking into account the set of sleeping tasks in its utilization estimate. > > > > > Let's now take a task B running 100us every 1024us > > The util_avg of the cpu should vary in the range [101:103] and once > > task B is enqueued, CPU0 util_est will be 103 > > > >> > >> Based on your comments, just doing this to match a rq util_avg <= 1% (10us of 1024us) > > > > it's not 10us of 1024us > > Is the range of util_avg within [0..1024] * capacity_of(cpu), or am I missing something ? util_avg is in the range [0:1024] and CPU's capacity is in the range [0:1024] too. 1024 is the compute capacity of the most powerful CPU of the system. On SMP system, all CPUs have a capacity of 1024. On heterogeneous system, big core have a capacity of 1024 and others will have a lower capacity > > Thanks, > > Mathieu > > > > > >> seems to work fine: > >> > >> return cpu_rq(cpu)->cfs.avg.util_avg <= 10 * capacity_of(cpu); > >> > >> Is this approach acceptable ? > >> > >> Thanks! > >> > >> Mathieu > >> > >>> > >>>> > >>>> static inline void > >>>> enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) > >>>> { > >>>> cfs_rq->avg.load_avg += se->avg.load_avg; > >>>> cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum; > >>>> } > >>>> > >>>> Therefore I think we need to multiply the load_sum value we aim for by > >>>> get_pelt_divider(&cpu_rq(cpu)->cfs.avg) to compare it to a rq load_sum. > >>>> > >>>> I plan to compare the rq load sum to "10 * get_pelt_divider(&cpu_rq(cpu)->cfs.avg)" > >>>> to match runqueues which were previously idle (therefore with prior periods contribution > >>>> to the rq->load_sum being pretty much zero), and which have a current period rq load_sum > >>>> below or equal 10us per 1024us (<= 1%): > >>>> > >>>> static inline unsigned long cfs_rq_weighted_load_sum(struct cfs_rq *cfs_rq) > >>>> { > >>>> return cfs_rq->avg.load_sum; > >>>> } > >>>> > >>>> static unsigned long cpu_weighted_load_sum(struct rq *rq) > >>>> { > >>>> return cfs_rq_weighted_load_sum(&rq->cfs); > >>>> } > >>>> > >>>> /* > >>>> * A runqueue is considered almost idle if: > >>>> * > >>>> * cfs_rq->avg.load_sum / get_pelt_divider(&cfs_rq->avg) / 1024 <= 1% > >>>> * > >>>> * This inequality is transformed as follows to minimize arithmetic: > >>>> * > >>>> * cfs_rq->avg.load_sum <= get_pelt_divider(&cfs_rq->avg) * 10 > >>>> */ > >>>> static bool > >>>> almost_idle_cpu(int cpu, struct task_struct *p) > >>>> { > >>>> if (!sched_feat(WAKEUP_BIAS_PREV_IDLE)) > >>>> return false; > >>>> return cpu_weighted_load_sum(cpu_rq(cpu)) <= 10 * get_pelt_divider(&cpu_rq(cpu)->cfs.avg); > >>>> } > >>>> > >>>> Does it make sense ? > >>>> > >>>> Thanks, > >>>> > >>>> Mathieu > >>>> > >>>> > >>>>> > >>>>>> And with this value "50", it would cover the case where there is only a > >>>>>> single task taking less than 50us per 1000us, and cases where the sum for > >>>>>> the set of tasks on the runqueue is taking less than 50us per 1000us > >>>>>> overall. > >>>>>> > >>>>>>> > >>>>>>> static bool > >>>>>>> almost_idle_cpu(int cpu, struct task_struct *p) > >>>>>>> { > >>>>>>> if (!sched_feat(WAKEUP_BIAS_PREV_IDLE)) > >>>>>>> return false; > >>>>>>> return cpu_load_without(cpu_rq(cpu), p) <= ALMOST_IDLE_CPU_LOAD; > >>>>>>> } > >>>>>>> > >>>>>>> Tested this on Intel Xeon Platinum 8360Y, Ice Lake server, 36 core/package, > >>>>>>> total 72 core/144 CPUs. Slight improvement is observed in hackbench socket mode: > >>>>>>> > >>>>>>> socket mode: > >>>>>>> hackbench -g 16 -f 20 -l 480000 -s 100 > >>>>>>> > >>>>>>> Before patch: > >>>>>>> Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > >>>>>>> Each sender will pass 480000 messages of 100 bytes > >>>>>>> Time: 81.084 > >>>>>>> > >>>>>>> After patch: > >>>>>>> Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > >>>>>>> Each sender will pass 480000 messages of 100 bytes > >>>>>>> Time: 78.083 > >>>>>>> > >>>>>>> > >>>>>>> pipe mode: > >>>>>>> hackbench -g 16 -f 20 --pipe -l 480000 -s 100 > >>>>>>> > >>>>>>> Before patch: > >>>>>>> Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > >>>>>>> Each sender will pass 480000 messages of 100 bytes > >>>>>>> Time: 38.219 > >>>>>>> > >>>>>>> After patch: > >>>>>>> Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) > >>>>>>> Each sender will pass 480000 messages of 100 bytes > >>>>>>> Time: 38.348 > >>>>>>> > >>>>>>> It suggests that, if the workload has larger working-set/cache footprint, waking up > >>>>>>> the task on its previous CPU could get more benefit. > >>>>>> > >>>>>> In those tests, what is the average % of idleness of your cpus ? > >>>>>> > >>>>> > >>>>> For hackbench -g 16 -f 20 --pipe -l 480000 -s 100, it is around 8~10% idle > >>>>> For hackbench -g 16 -f 20 -l 480000 -s 100, it is around 2~3% idle > >>>>> > >>>>> Then the CPUs in packge 1 are offlined to get stable result when the group number is low. > >>>>> hackbench -g 1 -f 20 --pipe -l 480000 -s 100 > >>>>> Some CPUs are busy, others are idle, and some are half-busy. > >>>>> Core CPU Busy% > >>>>> - - 49.57 > >>>>> 0 0 1.89 > >>>>> 0 72 75.55 > >>>>> 1 1 100.00 > >>>>> 1 73 0.00 > >>>>> 2 2 100.00 > >>>>> 2 74 0.00 > >>>>> 3 3 100.00 > >>>>> 3 75 0.01 > >>>>> 4 4 78.29 > >>>>> 4 76 17.72 > >>>>> 5 5 100.00 > >>>>> 5 77 0.00 > >>>>> > >>>>> > >>>>> hackbench -g 1 -f 20 -l 480000 -s 100 > >>>>> Core CPU Busy% > >>>>> - - 48.29 > >>>>> 0 0 57.94 > >>>>> 0 72 21.41 > >>>>> 1 1 83.28 > >>>>> 1 73 0.00 > >>>>> 2 2 11.44 > >>>>> 2 74 83.38 > >>>>> 3 3 21.45 > >>>>> 3 75 77.27 > >>>>> 4 4 26.89 > >>>>> 4 76 80.95 > >>>>> 5 5 5.01 > >>>>> 5 77 83.09 > >>>>> > >>>>> > >>>>> echo NO_WAKEUP_BIAS_PREV_IDLE > /sys/kernel/debug/sched/features > >>>>> hackbench -g 1 -f 20 --pipe -l 480000 -s 100 > >>>>> Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks) > >>>>> Each sender will pass 480000 messages of 100 bytes > >>>>> Time: 9.434 > >>>>> > >>>>> echo WAKEUP_BIAS_PREV_IDLE > /sys/kernel/debug/sched/features > >>>>> hackbench -g 1 -f 20 --pipe -l 480000 -s 100 > >>>>> Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks) > >>>>> Each sender will pass 480000 messages of 100 bytes > >>>>> Time: 9.373 > >>>>> > >>>>> thanks, > >>>>> Chenyu > >>>> > >>>> -- > >>>> Mathieu Desnoyers > >>>> EfficiOS Inc. > >>>> https://www.efficios.com > >>>> > >> > >> -- > >> Mathieu Desnoyers > >> EfficiOS Inc. > >> https://www.efficios.com > >> > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
On 2023-10-12 13:00, Vincent Guittot wrote: > On Thu, 12 Oct 2023 at 18:48, Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: >> >> On 2023-10-12 12:24, Vincent Guittot wrote: >>> On Thu, 12 Oct 2023 at 17:56, Mathieu Desnoyers >>> <mathieu.desnoyers@efficios.com> wrote: >>>> >>>> On 2023-10-12 11:01, Vincent Guittot wrote: >>>>> On Thu, 12 Oct 2023 at 16:33, Mathieu Desnoyers >>>>> <mathieu.desnoyers@efficios.com> wrote: >>>>>> >>>>>> On 2023-10-11 06:16, Chen Yu wrote: >>>>>>> On 2023-10-10 at 09:49:54 -0400, Mathieu Desnoyers wrote: >>>>>>>> On 2023-10-09 01:14, Chen Yu wrote: >>>>>>>>> On 2023-09-30 at 07:45:38 -0400, Mathieu Desnoyers wrote: >>>>>>>>>> On 9/30/23 03:11, Chen Yu wrote: >>>>>>>>>>> Hi Mathieu, >>>>>>>>>>> >>>>>>>>>>> On 2023-09-29 at 14:33:50 -0400, Mathieu Desnoyers wrote: >>>>>>>>>>>> Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature. It biases >>>>>>>>>>>> select_task_rq towards the previous CPU if it was almost idle >>>>>>>>>>>> (avg_load <= 0.1%). >>>>>>>>>>> >>>>>>>>>>> Yes, this is a promising direction IMO. One question is that, >>>>>>>>>>> can cfs_rq->avg.load_avg be used for percentage comparison? >>>>>>>>>>> If I understand correctly, load_avg reflects that more than >>>>>>>>>>> 1 tasks could have been running this runqueue, and the >>>>>>>>>>> load_avg is the direct proportion to the load_weight of that >>>>>>>>>>> cfs_rq. Besides, LOAD_AVG_MAX seems to not be the max value >>>>>>>>>>> that load_avg can reach, it is the sum of >>>>>>>>>>> 1024 * (y + y^1 + y^2 ... ) >>>>>>>>>>> >>>>>>>>>>> For example, >>>>>>>>>>> taskset -c 1 nice -n -20 stress -c 1 >>>>>>>>>>> cat /sys/kernel/debug/sched/debug | grep 'cfs_rq\[1\]' -A 12 | grep "\.load_avg" >>>>>>>>>>> .load_avg : 88763 >>>>>>>>>>> .load_avg : 1024 >>>>>>>>>>> >>>>>>>>>>> 88763 is higher than LOAD_AVG_MAX=47742 >>>>>>>>>> >>>>>>>>>> I would have expected the load_avg to be limited to LOAD_AVG_MAX somehow, >>>>>>>>>> but it appears that it does not happen in practice. >>>>>>>>>> >>>>>>>>>> That being said, if the cutoff is really at 0.1% or 0.2% of the real max, >>>>>>>>>> does it really matter ? >>>>>>>>>> >>>>>>>>>>> Maybe the util_avg can be used for precentage comparison I suppose? >>>>>>>>>> [...] >>>>>>>>>>> Or >>>>>>>>>>> return cpu_util_without(cpu_rq(cpu), p) * 1000 <= capacity_orig_of(cpu) ? >>>>>>>>>> >>>>>>>>>> Unfortunately using util_avg does not seem to work based on my testing. >>>>>>>>>> Even at utilization thresholds at 0.1%, 1% and 10%. >>>>>>>>>> >>>>>>>>>> Based on comments in fair.c: >>>>>>>>>> >>>>>>>>>> * CPU utilization is the sum of running time of runnable tasks plus the >>>>>>>>>> * recent utilization of currently non-runnable tasks on that CPU. >>>>>>>>>> >>>>>>>>>> I think we don't want to include currently non-runnable tasks in the >>>>>>>>>> statistics we use, because we are trying to figure out if the cpu is a >>>>>>>>>> idle-enough target based on the tasks which are currently running, for the >>>>>>>>>> purpose of runqueue selection when waking up a task which is considered at >>>>>>>>>> that point in time a non-runnable task on that cpu, and which is about to >>>>>>>>>> become runnable again. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Although LOAD_AVG_MAX is not the max possible load_avg, we still want to find >>>>>>>>> a proper threshold to decide if the CPU is almost idle. The LOAD_AVG_MAX >>>>>>>>> based threshold is modified a little bit: >>>>>>>>> >>>>>>>>> The theory is, if there is only 1 task on the CPU, and that task has a nice >>>>>>>>> of 0, the task runs 50 us every 1000 us, then this CPU is regarded as almost >>>>>>>>> idle. >>>>>>>>> >>>>>>>>> The load_sum of the task is: >>>>>>>>> 50 * (1 + y + y^2 + ... + y^n) >>>>>>>>> The corresponding avg_load of the task is approximately >>>>>>>>> NICE_0_WEIGHT * load_sum / LOAD_AVG_MAX = 50. >>>>>>>>> So: >>>>>>>>> >>>>>>>>> /* which is close to LOAD_AVG_MAX/1000 = 47 */ >>>>>>>>> #define ALMOST_IDLE_CPU_LOAD 50 >>>>>>>> >>>>>>>> Sorry to be slow at understanding this concept, but this whole "load" value >>>>>>>> is still somewhat magic to me. >>>>>>>> >>>>>>>> Should it vary based on CONFIG_HZ_{100,250,300,1000}, or is it independent ? >>>>>>>> Where is it documented that the load is a value in "us" out of a window of >>>>>>>> 1000 us ? >>>>>>>> >>>>>>> >>>>>>> My understanding is that, the load_sum of a single task is a value in "us" out >>>>>>> of a window of 1000 us, while the load_avg of the task will multiply the weight >>>>>>> of the task. In this case a task with nice 0 is NICE_0_WEIGHT = 1024. >>>>>>> >>>>>>> __update_load_avg_se -> ___update_load_sum calculate the load_sum of a task(there >>>>>>> is comments around ___update_load_sum to describe the pelt calculation), >>>>>>> and ___update_load_avg() calculate the load_avg based on the task's weight. >>>>>> >>>>>> Thanks for your thorough explanation, now it makes sense. >>>>>> >>>>>> I understand as well that the cfs_rq->avg.load_sum is the result of summing >>>>>> each task load_sum multiplied by their weight: >>>>> >>>>> Please don't use load_sum but only *_avg. >>>>> As already said, util_avg or runnable_avg are better metrics for you >>>> >>>> I think I found out why using util_avg was not working for me. >>>> >>>> Considering this comment from cpu_util(): >>>> >>>> * CPU utilization is the sum of running time of runnable tasks plus the >>>> * recent utilization of currently non-runnable tasks on that CPU. >>>> >>>> I don't want to include the recent utilization of currently non-runnable >>>> tasks on that CPU in order to choose that CPU to do task placement in a >>>> context where many tasks were recently running on that cpu (but are >>>> currently blocked). I do not want those blocked tasks to be part of the >>>> avg. >>> >>> But you have the exact same behavior with load_sum/avg. >>> >>>> >>>> So I think the issue here is that I was using the cpu_util() (and >>>> cpu_util_without()) helpers which are considering max(util, runnable), >>>> rather than just "util". >>> >>> cpu_util_without() only use util_avg but not runnable_avg. >> >> Ah, yes, @boost=0, which prevents it from using the runnable_avg. >> >>> Nevertheless, cpu_util_without ans cpu_util uses util_est which is >>> used to predict the final utilization. >> >> Yes, I suspect it's the util_est which prevents me from getting >> performance improvements when I use cpu_util_without to implement >> almost-idle. >> >>> >>> Let's take the example of task A running 20ms every 200ms on CPU0. >>> The util_avg of the cpu will vary in the range [7:365]. When task A >>> wakes up on CPU0, CPU0 util_avg = 7 (below 1%) but taskA will run for >>> 20ms which is not really almost idle. On the other side, CPU0 util_est >>> will be 365 as soon as task A is enqueued (which will be the value of >>> CPU0 util_avg just before going idle) >> >> If task A sleeps (becomes non-runnable) without being migrated, and therefore >> still have CPU0 as its cpu, is it still considered as part of the util_est of >> CPU0 while it is blocked ? If it is the case, then the util_est is preventing > > No, the util_est of a cpu only accounts the runnable tasks not the sleeping one > > CPU's util_est = /Sum of the util_est of runnable tasks OK, after further testing, it turns out that cpu_util_without() works for my case now. I'm not sure what I got wrong in my past attempts. > >> rq selection from considering a rq almost idle when waking up sleeping tasks >> due to taking into account the set of sleeping tasks in its utilization estimate. >> >>> >>> Let's now take a task B running 100us every 1024us >>> The util_avg of the cpu should vary in the range [101:103] and once >>> task B is enqueued, CPU0 util_est will be 103 >>> >>>> >>>> Based on your comments, just doing this to match a rq util_avg <= 1% (10us of 1024us) >>> >>> it's not 10us of 1024us >> >> Is the range of util_avg within [0..1024] * capacity_of(cpu), or am I missing something ? > > util_avg is in the range [0:1024] and CPU's capacity is in the range > [0:1024] too. 1024 is the compute capacity of the most powerful CPU of > the system. On SMP system, all CPUs have a capacity of 1024. On > heterogeneous system, big core have a capacity of 1024 and others will > have a lower capacity Sounds good, I will prepare an updated patch. Thanks, Mathieu > > >> >> Thanks, >> >> Mathieu >> >> >>> >>>> seems to work fine: >>>> >>>> return cpu_rq(cpu)->cfs.avg.util_avg <= 10 * capacity_of(cpu); >>>> >>>> Is this approach acceptable ? >>>> >>>> Thanks! >>>> >>>> Mathieu >>>> >>>>> >>>>>> >>>>>> static inline void >>>>>> enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) >>>>>> { >>>>>> cfs_rq->avg.load_avg += se->avg.load_avg; >>>>>> cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum; >>>>>> } >>>>>> >>>>>> Therefore I think we need to multiply the load_sum value we aim for by >>>>>> get_pelt_divider(&cpu_rq(cpu)->cfs.avg) to compare it to a rq load_sum. >>>>>> >>>>>> I plan to compare the rq load sum to "10 * get_pelt_divider(&cpu_rq(cpu)->cfs.avg)" >>>>>> to match runqueues which were previously idle (therefore with prior periods contribution >>>>>> to the rq->load_sum being pretty much zero), and which have a current period rq load_sum >>>>>> below or equal 10us per 1024us (<= 1%): >>>>>> >>>>>> static inline unsigned long cfs_rq_weighted_load_sum(struct cfs_rq *cfs_rq) >>>>>> { >>>>>> return cfs_rq->avg.load_sum; >>>>>> } >>>>>> >>>>>> static unsigned long cpu_weighted_load_sum(struct rq *rq) >>>>>> { >>>>>> return cfs_rq_weighted_load_sum(&rq->cfs); >>>>>> } >>>>>> >>>>>> /* >>>>>> * A runqueue is considered almost idle if: >>>>>> * >>>>>> * cfs_rq->avg.load_sum / get_pelt_divider(&cfs_rq->avg) / 1024 <= 1% >>>>>> * >>>>>> * This inequality is transformed as follows to minimize arithmetic: >>>>>> * >>>>>> * cfs_rq->avg.load_sum <= get_pelt_divider(&cfs_rq->avg) * 10 >>>>>> */ >>>>>> static bool >>>>>> almost_idle_cpu(int cpu, struct task_struct *p) >>>>>> { >>>>>> if (!sched_feat(WAKEUP_BIAS_PREV_IDLE)) >>>>>> return false; >>>>>> return cpu_weighted_load_sum(cpu_rq(cpu)) <= 10 * get_pelt_divider(&cpu_rq(cpu)->cfs.avg); >>>>>> } >>>>>> >>>>>> Does it make sense ? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Mathieu >>>>>> >>>>>> >>>>>>> >>>>>>>> And with this value "50", it would cover the case where there is only a >>>>>>>> single task taking less than 50us per 1000us, and cases where the sum for >>>>>>>> the set of tasks on the runqueue is taking less than 50us per 1000us >>>>>>>> overall. >>>>>>>> >>>>>>>>> >>>>>>>>> static bool >>>>>>>>> almost_idle_cpu(int cpu, struct task_struct *p) >>>>>>>>> { >>>>>>>>> if (!sched_feat(WAKEUP_BIAS_PREV_IDLE)) >>>>>>>>> return false; >>>>>>>>> return cpu_load_without(cpu_rq(cpu), p) <= ALMOST_IDLE_CPU_LOAD; >>>>>>>>> } >>>>>>>>> >>>>>>>>> Tested this on Intel Xeon Platinum 8360Y, Ice Lake server, 36 core/package, >>>>>>>>> total 72 core/144 CPUs. Slight improvement is observed in hackbench socket mode: >>>>>>>>> >>>>>>>>> socket mode: >>>>>>>>> hackbench -g 16 -f 20 -l 480000 -s 100 >>>>>>>>> >>>>>>>>> Before patch: >>>>>>>>> Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) >>>>>>>>> Each sender will pass 480000 messages of 100 bytes >>>>>>>>> Time: 81.084 >>>>>>>>> >>>>>>>>> After patch: >>>>>>>>> Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) >>>>>>>>> Each sender will pass 480000 messages of 100 bytes >>>>>>>>> Time: 78.083 >>>>>>>>> >>>>>>>>> >>>>>>>>> pipe mode: >>>>>>>>> hackbench -g 16 -f 20 --pipe -l 480000 -s 100 >>>>>>>>> >>>>>>>>> Before patch: >>>>>>>>> Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) >>>>>>>>> Each sender will pass 480000 messages of 100 bytes >>>>>>>>> Time: 38.219 >>>>>>>>> >>>>>>>>> After patch: >>>>>>>>> Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks) >>>>>>>>> Each sender will pass 480000 messages of 100 bytes >>>>>>>>> Time: 38.348 >>>>>>>>> >>>>>>>>> It suggests that, if the workload has larger working-set/cache footprint, waking up >>>>>>>>> the task on its previous CPU could get more benefit. >>>>>>>> >>>>>>>> In those tests, what is the average % of idleness of your cpus ? >>>>>>>> >>>>>>> >>>>>>> For hackbench -g 16 -f 20 --pipe -l 480000 -s 100, it is around 8~10% idle >>>>>>> For hackbench -g 16 -f 20 -l 480000 -s 100, it is around 2~3% idle >>>>>>> >>>>>>> Then the CPUs in packge 1 are offlined to get stable result when the group number is low. >>>>>>> hackbench -g 1 -f 20 --pipe -l 480000 -s 100 >>>>>>> Some CPUs are busy, others are idle, and some are half-busy. >>>>>>> Core CPU Busy% >>>>>>> - - 49.57 >>>>>>> 0 0 1.89 >>>>>>> 0 72 75.55 >>>>>>> 1 1 100.00 >>>>>>> 1 73 0.00 >>>>>>> 2 2 100.00 >>>>>>> 2 74 0.00 >>>>>>> 3 3 100.00 >>>>>>> 3 75 0.01 >>>>>>> 4 4 78.29 >>>>>>> 4 76 17.72 >>>>>>> 5 5 100.00 >>>>>>> 5 77 0.00 >>>>>>> >>>>>>> >>>>>>> hackbench -g 1 -f 20 -l 480000 -s 100 >>>>>>> Core CPU Busy% >>>>>>> - - 48.29 >>>>>>> 0 0 57.94 >>>>>>> 0 72 21.41 >>>>>>> 1 1 83.28 >>>>>>> 1 73 0.00 >>>>>>> 2 2 11.44 >>>>>>> 2 74 83.38 >>>>>>> 3 3 21.45 >>>>>>> 3 75 77.27 >>>>>>> 4 4 26.89 >>>>>>> 4 76 80.95 >>>>>>> 5 5 5.01 >>>>>>> 5 77 83.09 >>>>>>> >>>>>>> >>>>>>> echo NO_WAKEUP_BIAS_PREV_IDLE > /sys/kernel/debug/sched/features >>>>>>> hackbench -g 1 -f 20 --pipe -l 480000 -s 100 >>>>>>> Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks) >>>>>>> Each sender will pass 480000 messages of 100 bytes >>>>>>> Time: 9.434 >>>>>>> >>>>>>> echo WAKEUP_BIAS_PREV_IDLE > /sys/kernel/debug/sched/features >>>>>>> hackbench -g 1 -f 20 --pipe -l 480000 -s 100 >>>>>>> Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks) >>>>>>> Each sender will pass 480000 messages of 100 bytes >>>>>>> Time: 9.373 >>>>>>> >>>>>>> thanks, >>>>>>> Chenyu >>>>>> >>>>>> -- >>>>>> Mathieu Desnoyers >>>>>> EfficiOS Inc. >>>>>> https://www.efficios.com >>>>>> >>>> >>>> -- >>>> Mathieu Desnoyers >>>> EfficiOS Inc. >>>> https://www.efficios.com >>>> >> >> -- >> Mathieu Desnoyers >> EfficiOS Inc. >> https://www.efficios.com >>
On 2023-10-12 at 17:26:36 +0200, Vincent Guittot wrote: > On Wed, 11 Oct 2023 at 12:17, Chen Yu <yu.c.chen@intel.com> wrote: > > > > On 2023-10-10 at 09:49:54 -0400, Mathieu Desnoyers wrote: > > > On 2023-10-09 01:14, Chen Yu wrote: > > > > On 2023-09-30 at 07:45:38 -0400, Mathieu Desnoyers wrote: > > > > > On 9/30/23 03:11, Chen Yu wrote: > > > > > > Hi Mathieu, > > > > > > > > > > > > On 2023-09-29 at 14:33:50 -0400, Mathieu Desnoyers wrote: > > > > > > > Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature. It biases > > > > > > > select_task_rq towards the previous CPU if it was almost idle > > > > > > > (avg_load <= 0.1%). > > > > > > > > > > > > Yes, this is a promising direction IMO. One question is that, > > > > > > can cfs_rq->avg.load_avg be used for percentage comparison? > > > > > > If I understand correctly, load_avg reflects that more than > > > > > > 1 tasks could have been running this runqueue, and the > > > > > > load_avg is the direct proportion to the load_weight of that > > > > > > cfs_rq. Besides, LOAD_AVG_MAX seems to not be the max value > > > > > > that load_avg can reach, it is the sum of > > > > > > 1024 * (y + y^1 + y^2 ... ) > > > > > > > > > > > > For example, > > > > > > taskset -c 1 nice -n -20 stress -c 1 > > > > > > cat /sys/kernel/debug/sched/debug | grep 'cfs_rq\[1\]' -A 12 | grep "\.load_avg" > > > > > > .load_avg : 88763 > > > > > > .load_avg : 1024 > > > > > > > > > > > > 88763 is higher than LOAD_AVG_MAX=47742 > > > > > > > > > > I would have expected the load_avg to be limited to LOAD_AVG_MAX somehow, > > > > > but it appears that it does not happen in practice. > > > > > > > > > > That being said, if the cutoff is really at 0.1% or 0.2% of the real max, > > > > > does it really matter ? > > > > > > > > > > > Maybe the util_avg can be used for precentage comparison I suppose? > > > > > [...] > > > > > > Or > > > > > > return cpu_util_without(cpu_rq(cpu), p) * 1000 <= capacity_orig_of(cpu) ? > > > > > > > > > > Unfortunately using util_avg does not seem to work based on my testing. > > > > > Even at utilization thresholds at 0.1%, 1% and 10%. > > > > > > > > > > Based on comments in fair.c: > > > > > > > > > > * CPU utilization is the sum of running time of runnable tasks plus the > > > > > * recent utilization of currently non-runnable tasks on that CPU. > > > > > > > > > > I think we don't want to include currently non-runnable tasks in the > > > > > statistics we use, because we are trying to figure out if the cpu is a > > > > > idle-enough target based on the tasks which are currently running, for the > > > > > purpose of runqueue selection when waking up a task which is considered at > > > > > that point in time a non-runnable task on that cpu, and which is about to > > > > > become runnable again. > > > > > > > > > > > > > Although LOAD_AVG_MAX is not the max possible load_avg, we still want to find > > > > a proper threshold to decide if the CPU is almost idle. The LOAD_AVG_MAX > > > > based threshold is modified a little bit: > > > > > > > > The theory is, if there is only 1 task on the CPU, and that task has a nice > > > > of 0, the task runs 50 us every 1000 us, then this CPU is regarded as almost > > > > idle. > > > > > > > > The load_sum of the task is: > > > > 50 * (1 + y + y^2 + ... + y^n) > > > > The corresponding avg_load of the task is approximately > > > > NICE_0_WEIGHT * load_sum / LOAD_AVG_MAX = 50. > > > > So: > > > > > > > > /* which is close to LOAD_AVG_MAX/1000 = 47 */ > > > > #define ALMOST_IDLE_CPU_LOAD 50 > > > > > > Sorry to be slow at understanding this concept, but this whole "load" value > > > is still somewhat magic to me. > > > > > > Should it vary based on CONFIG_HZ_{100,250,300,1000}, or is it independent ? > > > Where is it documented that the load is a value in "us" out of a window of > > > 1000 us ? > > > > > > > My understanding is that, the load_sum of a single task is a value in "us" out > > of a window of 1000 us, while the load_avg of the task will multiply the weight > > I'm not sure we can say this. We use a 1024us sampling rate for > calculating weighted average but load_sum is in the range [0:47742] > so what does it mean 47742us out of a window of 1000us ? > > Beside this we have util_avg in the range [0:cpu capacity] which gives > you the average running time of the cpu > Sorry I did not describe it accurately. Yes, it should be 1024us instead of 1000us. And the load_sum is the decayed accumulated duration. util_avg was once used previously and Mathieu found that it did not work. But in the latest version it works again, I'll have a test on that version. thanks, Chenyu
On 2023-10-12 at 18:24:11 +0200, Vincent Guittot wrote: > On Thu, 12 Oct 2023 at 17:56, Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: > > > > On 2023-10-12 11:01, Vincent Guittot wrote: > > > On Thu, 12 Oct 2023 at 16:33, Mathieu Desnoyers > > > <mathieu.desnoyers@efficios.com> wrote: > > >> > > >> On 2023-10-11 06:16, Chen Yu wrote: > > >>> On 2023-10-10 at 09:49:54 -0400, Mathieu Desnoyers wrote: > > >>>> On 2023-10-09 01:14, Chen Yu wrote: > > >>>>> On 2023-09-30 at 07:45:38 -0400, Mathieu Desnoyers wrote: > > >>>>>> On 9/30/23 03:11, Chen Yu wrote: > > >>>>>>> Hi Mathieu, > > >>>>>>> > > >>>>>>> On 2023-09-29 at 14:33:50 -0400, Mathieu Desnoyers wrote: > > >>>>>>>> Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature. It biases > > >>>>>>>> select_task_rq towards the previous CPU if it was almost idle > > >>>>>>>> (avg_load <= 0.1%). > > >>>>>>> > > >>>>>>> Yes, this is a promising direction IMO. One question is that, > > >>>>>>> can cfs_rq->avg.load_avg be used for percentage comparison? > > >>>>>>> If I understand correctly, load_avg reflects that more than > > >>>>>>> 1 tasks could have been running this runqueue, and the > > >>>>>>> load_avg is the direct proportion to the load_weight of that > > >>>>>>> cfs_rq. Besides, LOAD_AVG_MAX seems to not be the max value > > >>>>>>> that load_avg can reach, it is the sum of > > >>>>>>> 1024 * (y + y^1 + y^2 ... ) > > >>>>>>> > > >>>>>>> For example, > > >>>>>>> taskset -c 1 nice -n -20 stress -c 1 > > >>>>>>> cat /sys/kernel/debug/sched/debug | grep 'cfs_rq\[1\]' -A 12 | grep "\.load_avg" > > >>>>>>> .load_avg : 88763 > > >>>>>>> .load_avg : 1024 > > >>>>>>> > > >>>>>>> 88763 is higher than LOAD_AVG_MAX=47742 > > >>>>>> > > >>>>>> I would have expected the load_avg to be limited to LOAD_AVG_MAX somehow, > > >>>>>> but it appears that it does not happen in practice. > > >>>>>> > > >>>>>> That being said, if the cutoff is really at 0.1% or 0.2% of the real max, > > >>>>>> does it really matter ? > > >>>>>> > > >>>>>>> Maybe the util_avg can be used for precentage comparison I suppose? > > >>>>>> [...] > > >>>>>>> Or > > >>>>>>> return cpu_util_without(cpu_rq(cpu), p) * 1000 <= capacity_orig_of(cpu) ? > > >>>>>> > > >>>>>> Unfortunately using util_avg does not seem to work based on my testing. > > >>>>>> Even at utilization thresholds at 0.1%, 1% and 10%. > > >>>>>> > > >>>>>> Based on comments in fair.c: > > >>>>>> > > >>>>>> * CPU utilization is the sum of running time of runnable tasks plus the > > >>>>>> * recent utilization of currently non-runnable tasks on that CPU. > > >>>>>> > > >>>>>> I think we don't want to include currently non-runnable tasks in the > > >>>>>> statistics we use, because we are trying to figure out if the cpu is a > > >>>>>> idle-enough target based on the tasks which are currently running, for the > > >>>>>> purpose of runqueue selection when waking up a task which is considered at > > >>>>>> that point in time a non-runnable task on that cpu, and which is about to > > >>>>>> become runnable again. > > >>>>>> > > >>>>> > > >>>>> Although LOAD_AVG_MAX is not the max possible load_avg, we still want to find > > >>>>> a proper threshold to decide if the CPU is almost idle. The LOAD_AVG_MAX > > >>>>> based threshold is modified a little bit: > > >>>>> > > >>>>> The theory is, if there is only 1 task on the CPU, and that task has a nice > > >>>>> of 0, the task runs 50 us every 1000 us, then this CPU is regarded as almost > > >>>>> idle. > > >>>>> > > >>>>> The load_sum of the task is: > > >>>>> 50 * (1 + y + y^2 + ... + y^n) > > >>>>> The corresponding avg_load of the task is approximately > > >>>>> NICE_0_WEIGHT * load_sum / LOAD_AVG_MAX = 50. > > >>>>> So: > > >>>>> > > >>>>> /* which is close to LOAD_AVG_MAX/1000 = 47 */ > > >>>>> #define ALMOST_IDLE_CPU_LOAD 50 > > >>>> > > >>>> Sorry to be slow at understanding this concept, but this whole "load" value > > >>>> is still somewhat magic to me. > > >>>> > > >>>> Should it vary based on CONFIG_HZ_{100,250,300,1000}, or is it independent ? > > >>>> Where is it documented that the load is a value in "us" out of a window of > > >>>> 1000 us ? > > >>>> > > >>> > > >>> My understanding is that, the load_sum of a single task is a value in "us" out > > >>> of a window of 1000 us, while the load_avg of the task will multiply the weight > > >>> of the task. In this case a task with nice 0 is NICE_0_WEIGHT = 1024. > > >>> > > >>> __update_load_avg_se -> ___update_load_sum calculate the load_sum of a task(there > > >>> is comments around ___update_load_sum to describe the pelt calculation), > > >>> and ___update_load_avg() calculate the load_avg based on the task's weight. > > >> > > >> Thanks for your thorough explanation, now it makes sense. > > >> > > >> I understand as well that the cfs_rq->avg.load_sum is the result of summing > > >> each task load_sum multiplied by their weight: > > > > > > Please don't use load_sum but only *_avg. > > > As already said, util_avg or runnable_avg are better metrics for you > > > > I think I found out why using util_avg was not working for me. > > > > Considering this comment from cpu_util(): > > > > * CPU utilization is the sum of running time of runnable tasks plus the > > * recent utilization of currently non-runnable tasks on that CPU. > > > > I don't want to include the recent utilization of currently non-runnable > > tasks on that CPU in order to choose that CPU to do task placement in a > > context where many tasks were recently running on that cpu (but are > > currently blocked). I do not want those blocked tasks to be part of the > > avg. > > But you have the exact same behavior with load_sum/avg. > > > > > So I think the issue here is that I was using the cpu_util() (and > > cpu_util_without()) helpers which are considering max(util, runnable), > > rather than just "util". > > cpu_util_without() only use util_avg but not runnable_avg. > Nevertheless, cpu_util_without ans cpu_util uses util_est which is > used to predict the final utilization. > > Let's take the example of task A running 20ms every 200ms on CPU0. > The util_avg of the cpu will vary in the range [7:365]. When task A It took me some time to find out where 7 and 365 comes from(Although Dietmar has once told me about it, but I forgot, sorry :P ). If I understand correctly, we are checking two scenarios. One scenario is that A runs now and this has the least decayed running time(max). The other scenario is that A has slept for a while in the latest 1024us window, thus the min value. For a task running A ms every B ms, scenario1: now ^ | ...|--------|--A ms--| ...|--------B ms-----| util_sum = 1024(y^0+...+y^(A-1) + y^B+...+y^(B+A-1) + ...) util_avg = util_sum/(y^0+...+y^n) = 1024(1-y^A)/(1-y^B) When A = 20, B = 200, util_avg is 365 scenario2: now ^ | ...|--A ms--|--------| ...|--------B ms-----| util_sum = 1024(y^(B-A)+y^(B-A+1)+...+y^(B-1)+y^(2B-A)+y^(2B-A+1)+...y^(2B-1) +...) = 1024 y^(B-A)(y^0+...+y^(A-1) + y^B+...+y^(B+A-1) + ...) util_avg = 1024 y^(B-A)(1-y^A)/(1-y^B) When A = 20, B = 200, util_avg is 7. Just wonder if there is any description for this in the kernel, and if not does it make sense to put this description in it(either in the comments or doc)? As it might be useful to estimate the util_avg based on the task's behavior. > wakes up on CPU0, CPU0 util_avg = 7 (below 1%) but taskA will run for > 20ms which is not really almost idle. On the other side, CPU0 util_est > will be 365 as soon as task A is enqueued (which will be the value of > CPU0 util_avg just before going idle) > > Let's now take a task B running 100us every 1024us > The util_avg of the cpu should vary in the range [101:103] and once May I know how we get 101, 103? I thought for task B, the util_sum is 100 * (1+y+..y^n), and the util_avg is util_sum/(1+y+...+y^n), so it is always 100? Am I missing something? thanks, Chenyu
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1d9c2482c5a3..65a7d923ea61 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6599,6 +6599,14 @@ static int wake_wide(struct task_struct *p) return 1; } +static bool +almost_idle_cpu(int cpu, struct task_struct *p) +{ + if (!sched_feat(WAKEUP_BIAS_PREV_IDLE)) + return false; + return cpu_load_without(cpu_rq(cpu), p) <= LOAD_AVG_MAX / 1000; +} + /* * The purpose of wake_affine() is to quickly determine on which CPU we can run * soonest. For the purpose of speed we only consider the waking and previous @@ -6612,7 +6620,7 @@ static int wake_wide(struct task_struct *p) * for the overloaded case. */ static int -wake_affine_idle(int this_cpu, int prev_cpu, int sync) +wake_affine_idle(int this_cpu, int prev_cpu, int sync, struct task_struct *p) { /* * If this_cpu is idle, it implies the wakeup is from interrupt @@ -6632,7 +6640,7 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync) if (sync && cpu_rq(this_cpu)->nr_running == 1) return this_cpu; - if (available_idle_cpu(prev_cpu)) + if (available_idle_cpu(prev_cpu) || almost_idle_cpu(prev_cpu, p)) return prev_cpu; return nr_cpumask_bits; @@ -6687,7 +6695,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int target = nr_cpumask_bits; if (sched_feat(WA_IDLE)) - target = wake_affine_idle(this_cpu, prev_cpu, sync); + target = wake_affine_idle(this_cpu, prev_cpu, sync, p); if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits) target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync); @@ -7139,7 +7147,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) */ lockdep_assert_irqs_disabled(); - if ((available_idle_cpu(target) || sched_idle_cpu(target)) && + if ((available_idle_cpu(target) || sched_idle_cpu(target) || (prev == target && almost_idle_cpu(target, p))) && asym_fits_cpu(task_util, util_min, util_max, target)) return target; @@ -7147,7 +7155,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) * If the previous CPU is cache affine and idle, don't be stupid: */ if (prev != target && cpus_share_cache(prev, target) && - (available_idle_cpu(prev) || sched_idle_cpu(prev)) && + (available_idle_cpu(prev) || sched_idle_cpu(prev) || almost_idle_cpu(prev, p)) && asym_fits_cpu(task_util, util_min, util_max, prev)) return prev; diff --git a/kernel/sched/features.h b/kernel/sched/features.h index ee7f23c76bd3..b06d06c2b728 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -37,6 +37,12 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true) */ SCHED_FEAT(WAKEUP_PREEMPTION, true) +/* + * Bias runqueue selection towards previous runqueue if it is almost + * idle. + */ +SCHED_FEAT(WAKEUP_BIAS_PREV_IDLE, true) + SCHED_FEAT(HRTICK, false) SCHED_FEAT(HRTICK_DL, false) SCHED_FEAT(DOUBLE_TICK, false)