Message ID | 20230718134120.81199-4-aaron.lu@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:c923:0:b0:3e4:2afc:c1 with SMTP id j3csp1760901vqt; Tue, 18 Jul 2023 06:52:34 -0700 (PDT) X-Google-Smtp-Source: APBJJlEfmjnbyK2XEdxDBAGcKy9kjP2wmBqy06nkd/dC4jlFXFAoOy8FIweKN+5bcBEawLjq54RF X-Received: by 2002:a17:902:aa07:b0:1b8:73fb:669a with SMTP id be7-20020a170902aa0700b001b873fb669amr13944286plb.31.1689688353954; Tue, 18 Jul 2023 06:52:33 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689688353; cv=none; d=google.com; s=arc-20160816; b=rwCXTl68Br9PeheCuDEhGxMD/IRvlhFdgxj/ioKKQp963DYpwjC7JL5U66qQlRRzGK apnq0+KoupsqQ5Buknh/zNG0GYKBuj4dZdFzuKdEP6BsALKYTfDsPgWpvd00wNeEYzOc 2IXIxpKtG6r8XuoXdl9DTzyblMD7f/89MYNG8cw+G/o9FXQf/xHSzKQh5y8/u9qRCBAH N3GUD4zsF+nCgocSWnTQS6ertkZxz2oIUZqOb9rY+sG3xmG/LBjY3ed6X5U+M8zg3WM4 2jnO66e1u97jDD3uwvEnYljet8vjmUxWj+OrsXAr54nEVqF+8BUZ+sDlVkg7RKHaGLzZ LEvw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=zrIKjpkUd8iI4Hqz0WTgBdTshUxMTBtgRWrBEaNTJ5k=; fh=4ed2R7g6IAKyDvVc8idVdT+eA7jXSko9FHhVHVAkWGU=; b=tuuX1IQmnfR7Fa9pZ/6XbpqAmqYXCwsp7NoBbGPeAkaO7Au8Mn9cP6aB5dw8y7zFJ1 BD1tPCuMSJfSdX+2vRJ6iAqhZlVlf3fmRrkZhSjRZF7BZRRgLDIqy/8rIrKXx9mEuLk5 K3CrhHlM3yMO8XG1MgpBOp9tEiFlAzrIH9Mdvb1gk7slg6BX83mrJ8tXSGza6M/Ysq39 2o6w7eRc/6GdK9lpSSJTAxPxgIH0YcRVlolop+qzXMCSZlADHUUZYpa2HA1tpSAs8GWm PfSyetx/pobXT9Y7iie8vZ/VMVXCekGwp3ib5eJXC4o9l1E8oP3TxL9HndVoDhcSNy/H bwqA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=lg0SDFlI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id e4-20020a17090301c400b001b84f9263e7si1701671plh.325.2023.07.18.06.52.20; Tue, 18 Jul 2023 06:52:33 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=lg0SDFlI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232844AbjGRNlv (ORCPT <rfc822;assdfgzxcv4@gmail.com> + 99 others); Tue, 18 Jul 2023 09:41:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42452 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232771AbjGRNln (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 18 Jul 2023 09:41:43 -0400 Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BAF8D18E for <linux-kernel@vger.kernel.org>; Tue, 18 Jul 2023 06:41:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1689687701; x=1721223701; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=8F3zASvnELjb+rrDSMq0b/VVlZEErX5L3+l5y/KLAD4=; b=lg0SDFlI9LP3Sgx4BVgQaQ/SyWDkm9qRXKgATSeCH3Gw7idHqaUnz07l ifPJ7284YG2ReUcPkZeBTPboy2dJAkJFQYFz8W+KkgF9BLJ1talROClsV 1VXk+fzqizLOzEkbbEtjVFdv151F47aRFKrNAAcHiw2ePU/oZtNRGXgPs H7vr2uot4/koI1YZRPySTaHNMis/IE3l0ljVNby1usFvUMDlm8bsabzSw kziX0H2OCnQu0s2HFlmM+h8+6YFda88aByRhCxV1WzBiWPylrIJ6Ey701 lz0oJwhjB5CX0e1kknHlgTUKursVF7Ni1LGMetNQqRvxY3kDg1LOieO9n w==; X-IronPort-AV: E=McAfee;i="6600,9927,10775"; a="345800716" X-IronPort-AV: E=Sophos;i="6.01,214,1684825200"; d="scan'208";a="345800716" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jul 2023 06:41:41 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10775"; a="847706530" X-IronPort-AV: E=Sophos;i="6.01,214,1684825200"; d="scan'208";a="847706530" Received: from ziqianlu-desk2.sh.intel.com ([10.239.159.54]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jul 2023 06:41:37 -0700 From: Aaron Lu <aaron.lu@intel.com> To: Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>, Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>, Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>, Daniel Bristot de Oliveira <bristot@redhat.com>, Valentin Schneider <vschneid@redhat.com>, Tim Chen <tim.c.chen@intel.com>, Nitin Tekchandani <nitin.tekchandani@intel.com>, Yu Chen <yu.c.chen@intel.com>, Waiman Long <longman@redhat.com>, linux-kernel@vger.kernel.org Subject: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load Date: Tue, 18 Jul 2023 21:41:19 +0800 Message-ID: <20230718134120.81199-4-aaron.lu@intel.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20230718134120.81199-1-aaron.lu@intel.com> References: <20230718134120.81199-1-aaron.lu@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, SPF_HELO_PASS,SPF_NONE,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1771766655286852147 X-GMAIL-MSGID: 1771766655286852147 |
Series |
Reduce cost of accessing tg->load_avg
|
|
Commit Message
Aaron Lu
July 18, 2023, 1:41 p.m. UTC
When a workload involves many wake time task migrations, tg->load_avg
can be heavily contended among CPUs because every migration involves
removing the task's load from its src cfs_rq and attach that load to
its new cfs_rq. Both the remove and attach requires an update to
tg->load_avg as well as propagating the change up the hierarchy.
E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
Sappire Rapids, during a 5s window, the wakeup number is 14millions and
migration number is 11millions. Since the workload can trigger many
wakeups and migrations, the access(both read and write) to tg->load_avg
can be unbound. For the above said workload, the profile shows
update_cfs_group() costs ~13% and update_load_avg() costs ~10%. With
netperf/nr_client=nr_cpu/UDP_RR, the wakeup number is 21millions and
migration number is 14millions; update_cfs_group() costs ~25% and
update_load_avg() costs ~16%.
This patch is an attempt to reduce the cost of accessing tg->load_avg.
Current logic will immediately do a update_tg_load_avg() if cfs_rq has
removed load; this patch changes this behavior: if this cfs_rq has
removed load as discovered by update_cfs_rq_load_avg(), it didn't call
update_tg_load_avg() or propagate the removed load immediately, instead,
the update to tg->load_avg and propagated load can be dealed with by a
following event like task attached to this cfs_rq or in
update_blocked_averages(). This way, the call to update_tg_load_avg()
for this cfs_rq and its ancestors can be reduced by about half.
================================================
postgres_sysbench(transaction, higher is better)
nr_thread=100%/75%/50% were tested on 2 sockets SPR and Icelake and
results that have a measuable difference are:
nr_thread=100% on SPR:
base: 90569.11±1.15%
node: 104152.26±0.34% +15.0%
delay: 127309.46±4.25% +40.6%
nr_thread=75% on SPR:
base: 100803.96±0.57%
node: 107333.58±0.44% +6.5%
delay: 124332.39±0.51% +23.3%
nr_thread=75% on ICL:
base: 61961.26±0.41%
node: 61585.45±0.50%
delay: 72420.52±0.14% +16.9%
=======================================================================
hackbench/pipe/threads/fd=20/loop=1000000 (throughput, higher is better)
group=1/4/8/16 were tested on 2 sockets SPR and Cascade lake and the
results that have a measuable difference are:
group=8 on SPR:
base: 437163±2.6%
node: 471203±1.2% +7.8%
delay: 490780±0.9% +12.3%
group=16 on SPR:
base: 468279±1.9%
node: 580385±1.7% +23.9%
delay: 664422±0.2% +41.9%
================================================
netperf/TCP_STRAM (throughput, higher is better)
nr_thread=1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
Lake and results that have a measuable difference are:
nr_thread=50% on CSL:
base: 16258±0.7%
node: 16172±2.9%
delay: 17729±0.7% +9.0%
nr_thread=75% on CSL:
base: 12923±1.2%
node: 13011±2.2%
delay: 15452±1.6% +19.6%
nr_thread=75% on SPR:
base: 16232±11.9%
node: 13962±5.1%
delay: 21089±0.8% +29.9%
nr_thread=100% on SPR:
base: 13220±0.6%
node: 13113±0.0%
delay: 18258±11.3% +38.1%
=============================================
netperf/UDP_RR (throughput, higher is better)
nr_thread=1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
Lake and results that have measuable difference are:
nr_thread=1 on CSL:
base: 128521±0.5%
node: 127935±0.6%
delay: 126317±0.4% -1.7%
nr_thread=75% on CSL:
base: 36701±1.7%
node: 39949±1.4% +8.8%
delay: 42516±0.3% +15.8%
nr_thread=75% on SPR:
base: 14249±3.8%
node: 19890±2.0% +39.6%
delay: 31331±0.5% +119.9%
nr_thread=100% on CSL:
base: 52275±0.6%
node: 53827±0.4% +3.0%
delay: 78386±0.7% +49.9%
nr_thread=100% on SPR:
base: 9560±1.6%
node: 14186±3.9% +48.4%
delay: 20779±2.8% +117.4%
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
kernel/sched/fair.c | 23 ++++++++++++++++++-----
kernel/sched/sched.h | 1 +
2 files changed, 19 insertions(+), 5 deletions(-)
Comments
On Tue, 18 Jul 2023 at 15:41, Aaron Lu <aaron.lu@intel.com> wrote: > > When a workload involves many wake time task migrations, tg->load_avg > can be heavily contended among CPUs because every migration involves > removing the task's load from its src cfs_rq and attach that load to > its new cfs_rq. Both the remove and attach requires an update to > tg->load_avg as well as propagating the change up the hierarchy. > > E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel > Sappire Rapids, during a 5s window, the wakeup number is 14millions and > migration number is 11millions. Since the workload can trigger many > wakeups and migrations, the access(both read and write) to tg->load_avg > can be unbound. For the above said workload, the profile shows > update_cfs_group() costs ~13% and update_load_avg() costs ~10%. With > netperf/nr_client=nr_cpu/UDP_RR, the wakeup number is 21millions and > migration number is 14millions; update_cfs_group() costs ~25% and > update_load_avg() costs ~16%. > > This patch is an attempt to reduce the cost of accessing tg->load_avg. here you mention tg->load_avg which is updated with update_tg_load_avg() but your patch below modifies the local update of cfs's util_avg, runnable_avg and load_avg which need to be maintained up to date You should be better to delay or rate limit the call to update_tg_load_avg() or calc_group_shares()/update_cfs_group() which access tg->load_avg and are the culprit instead of modifying other place. Have you tried to remove update_cfs_group() from enqueue/dequeue and only let the tick update the share periodically ? Have you tried to make update_tg_load_avg() return early ? the change of cfs' load_avg is written in the tg->load_avg only when the change is bigger than 1.5%. maybe increase it to 6% ? Or like for update_cfs_group, remove it from attach/detach entity and let the periodic update to propagate the change But please don't touch local update of *_avg > > Current logic will immediately do a update_tg_load_avg() if cfs_rq has > removed load; this patch changes this behavior: if this cfs_rq has > removed load as discovered by update_cfs_rq_load_avg(), it didn't call > update_tg_load_avg() or propagate the removed load immediately, instead, > the update to tg->load_avg and propagated load can be dealed with by a > following event like task attached to this cfs_rq or in > update_blocked_averages(). This way, the call to update_tg_load_avg() > for this cfs_rq and its ancestors can be reduced by about half. > > ================================================ > postgres_sysbench(transaction, higher is better) > nr_thread=100%/75%/50% were tested on 2 sockets SPR and Icelake and > results that have a measuable difference are: > > nr_thread=100% on SPR: > base: 90569.11±1.15% > node: 104152.26±0.34% +15.0% > delay: 127309.46±4.25% +40.6% > > nr_thread=75% on SPR: > base: 100803.96±0.57% > node: 107333.58±0.44% +6.5% > delay: 124332.39±0.51% +23.3% > > nr_thread=75% on ICL: > base: 61961.26±0.41% > node: 61585.45±0.50% > delay: 72420.52±0.14% +16.9% > > ======================================================================= > hackbench/pipe/threads/fd=20/loop=1000000 (throughput, higher is better) > group=1/4/8/16 were tested on 2 sockets SPR and Cascade lake and the > results that have a measuable difference are: > > group=8 on SPR: > base: 437163±2.6% > node: 471203±1.2% +7.8% > delay: 490780±0.9% +12.3% > > group=16 on SPR: > base: 468279±1.9% > node: 580385±1.7% +23.9% > delay: 664422±0.2% +41.9% > > ================================================ > netperf/TCP_STRAM (throughput, higher is better) > nr_thread=1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade > Lake and results that have a measuable difference are: > > nr_thread=50% on CSL: > base: 16258±0.7% > node: 16172±2.9% > delay: 17729±0.7% +9.0% > > nr_thread=75% on CSL: > base: 12923±1.2% > node: 13011±2.2% > delay: 15452±1.6% +19.6% > > nr_thread=75% on SPR: > base: 16232±11.9% > node: 13962±5.1% > delay: 21089±0.8% +29.9% > > nr_thread=100% on SPR: > base: 13220±0.6% > node: 13113±0.0% > delay: 18258±11.3% +38.1% > > ============================================= > netperf/UDP_RR (throughput, higher is better) > nr_thread=1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade > Lake and results that have measuable difference are: > > nr_thread=1 on CSL: > base: 128521±0.5% > node: 127935±0.6% > delay: 126317±0.4% -1.7% > > nr_thread=75% on CSL: > base: 36701±1.7% > node: 39949±1.4% +8.8% > delay: 42516±0.3% +15.8% > > nr_thread=75% on SPR: > base: 14249±3.8% > node: 19890±2.0% +39.6% > delay: 31331±0.5% +119.9% > > nr_thread=100% on CSL: > base: 52275±0.6% > node: 53827±0.4% +3.0% > delay: 78386±0.7% +49.9% > > nr_thread=100% on SPR: > base: 9560±1.6% > node: 14186±3.9% +48.4% > delay: 20779±2.8% +117.4% > > Signed-off-by: Aaron Lu <aaron.lu@intel.com> > --- > kernel/sched/fair.c | 23 ++++++++++++++++++----- > kernel/sched/sched.h | 1 + > 2 files changed, 19 insertions(+), 5 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index aceb8f5922cb..564ffe3e59c1 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3645,6 +3645,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) > if (child_cfs_rq_on_list(cfs_rq)) > return false; > > + if (cfs_rq->prop_removed_sum) > + return false; > + > return true; > } > > @@ -3911,6 +3914,11 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum > { > cfs_rq->propagate = 1; > cfs_rq->prop_runnable_sum += runnable_sum; > + > + if (cfs_rq->prop_removed_sum) { > + cfs_rq->prop_runnable_sum += cfs_rq->prop_removed_sum; > + cfs_rq->prop_removed_sum = 0; > + } > } > > /* Update task and its cfs_rq load average */ > @@ -4133,13 +4141,11 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > * removed_runnable is the unweighted version of removed_load so we > * can use it to estimate removed_load_sum. > */ > - add_tg_cfs_propagate(cfs_rq, > - -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT); > - > - decayed = 1; > + cfs_rq->prop_removed_sum += > + -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT; > } > > - decayed |= __update_load_avg_cfs_rq(now, cfs_rq); > + decayed = __update_load_avg_cfs_rq(now, cfs_rq); > u64_u32_store_copy(sa->last_update_time, > cfs_rq->last_update_time_copy, > sa->last_update_time); > @@ -9001,6 +9007,13 @@ static bool __update_blocked_fair(struct rq *rq, bool *done) > > if (cfs_rq == &rq->cfs) > decayed = true; > + > + /* > + * If the aggregated removed_sum hasn't been taken care of, > + * deal with it now before this cfs_rq is removed from the list. > + */ > + if (cfs_rq->prop_removed_sum) > + add_tg_cfs_propagate(cfs_rq, 0); > } > > /* Propagate pending load changes to the parent, if any: */ > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 9cece2dbc95b..ab540b21d071 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -619,6 +619,7 @@ struct cfs_rq { > unsigned long tg_load_avg_contrib; > long propagate; > long prop_runnable_sum; > + long prop_removed_sum; > > /* > * h_load = weight * f(tg) > -- > 2.41.0 >
On Tue, Jul 18, 2023 at 06:01:51PM +0200, Vincent Guittot wrote: > On Tue, 18 Jul 2023 at 15:41, Aaron Lu <aaron.lu@intel.com> wrote: > > > > When a workload involves many wake time task migrations, tg->load_avg > > can be heavily contended among CPUs because every migration involves > > removing the task's load from its src cfs_rq and attach that load to > > its new cfs_rq. Both the remove and attach requires an update to > > tg->load_avg as well as propagating the change up the hierarchy. > > > > E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel > > Sappire Rapids, during a 5s window, the wakeup number is 14millions and > > migration number is 11millions. Since the workload can trigger many > > wakeups and migrations, the access(both read and write) to tg->load_avg > > can be unbound. For the above said workload, the profile shows > > update_cfs_group() costs ~13% and update_load_avg() costs ~10%. With > > netperf/nr_client=nr_cpu/UDP_RR, the wakeup number is 21millions and > > migration number is 14millions; update_cfs_group() costs ~25% and > > update_load_avg() costs ~16%. > > > > This patch is an attempt to reduce the cost of accessing tg->load_avg. > > here you mention tg->load_avg which is updated with update_tg_load_avg() > > but your patch below modifies the local update of cfs's util_avg, > runnable_avg and load_avg which need to be maintained up to date Yes, since it delayed propagating the removed load, the upper cfs_rqs' *_avg could be updated later than current code. > You should be better to delay or rate limit the call to > update_tg_load_avg() or calc_group_shares()/update_cfs_group() which > access tg->load_avg and are the culprit instead of modifying other > place. Thanks for the suggestion and I think it makes perfect sense. I tried below to rate limit the update to tg->load_avg at most once per ms in update_tg_load_avg(): diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a80a73909dc2..e48fd0e6982d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) { long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; + u64 now = cfs_rq_clock_pelt(cfs_rq); /* * No need to update load_avg for root_task_group as it is not used. @@ -3672,9 +3673,11 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) if (cfs_rq->tg == &root_task_group) return; - if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { + if ((now - cfs_rq->last_update_tg_load_avg > 1000000) && + abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { atomic_long_add(delta, &cfs_rq->tg->load_avg); cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg; + cfs_rq->last_update_tg_load_avg = now; } } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 14dfaafb3a8f..b5201d44d820 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -594,6 +594,7 @@ struct cfs_rq { #ifdef CONFIG_FAIR_GROUP_SCHED unsigned long tg_load_avg_contrib; + u64 last_update_tg_load_avg; long propagate; long prop_runnable_sum; From some quick tests using postgres_sysbench and netperf/UDP_RR on SPR, this alone already delivers good results. For postgres_sysbench, the two functions cost dropped to 1%-2% each; and for netperf/UDP_RR, the two functions cost dropped to ~2% and ~4%. Togerther with some more rate limiting on the read side, I think the end result will be better. Does the above look reasonable to you? Alternatively, I can remove some callsites of update_tg_load_avg() like you suggested below and only call update_tg_load_avg() when cfs_rq is decayed(really just decayed, not when it detected it has removed load pending or load propagated from its children). This way it would give us similar result as above(roughly once per ms). > > Have you tried to remove update_cfs_group() from enqueue/dequeue and > only let the tick update the share periodically ? patch4 kind of did that :-) > > Have you tried to make update_tg_load_avg() return early ? the change > of cfs' load_avg is written in the tg->load_avg only when the change > is bigger than 1.5%. maybe increase it to 6% ? Yes, increase the delta is also a good way to limit the update to tg->load_avg. Will try that too. > > Or like for update_cfs_group, remove it from attach/detach entity and > let the periodic update to propagate the change > > But please don't touch local update of *_avg OK I see. Thanks again for the comments, they are very helpful.
On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote: > Alternatively, I can remove some callsites of update_tg_load_avg() like > you suggested below and only call update_tg_load_avg() when cfs_rq is > decayed(really just decayed, not when it detected it has removed load > pending or load propagated from its children). This way it would give us > similar result as above(roughly once per ms). Something like this: (I think this is better since it removed those unnecessary calls to update_tg_load_avg(), although it is inline but still) From bc749aaefa6bed36aa946921a4006b3dddb69b77 Mon Sep 17 00:00:00 2001 From: Aaron Lu <aaron.lu@intel.com> Date: Wed, 19 Jul 2023 13:54:48 +0800 Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed --- kernel/sched/fair.c | 22 +++++++--------------- 1 file changed, 7 insertions(+), 15 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a80a73909dc2..7d5b7352b8b5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3913,16 +3913,16 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum } /* Update task and its cfs_rq load average */ -static inline int propagate_entity_load_avg(struct sched_entity *se) +static inline void propagate_entity_load_avg(struct sched_entity *se) { struct cfs_rq *cfs_rq, *gcfs_rq; if (entity_is_task(se)) - return 0; + return; gcfs_rq = group_cfs_rq(se); if (!gcfs_rq->propagate) - return 0; + return; gcfs_rq->propagate = 0; @@ -3936,8 +3936,6 @@ static inline int propagate_entity_load_avg(struct sched_entity *se) trace_pelt_cfs_tp(cfs_rq); trace_pelt_se_tp(se); - - return 1; } /* @@ -3974,9 +3972,8 @@ static inline bool skip_blocked_update(struct sched_entity *se) static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {} -static inline int propagate_entity_load_avg(struct sched_entity *se) +static inline void propagate_entity_load_avg(struct sched_entity *se) { - return 0; } static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) {} @@ -4086,7 +4083,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) { unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0; struct sched_avg *sa = &cfs_rq->avg; - int decayed = 0; + int decayed; if (cfs_rq->removed.nr) { unsigned long r; @@ -4134,11 +4131,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) */ add_tg_cfs_propagate(cfs_rq, -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT); - - decayed = 1; } - decayed |= __update_load_avg_cfs_rq(now, cfs_rq); + decayed = __update_load_avg_cfs_rq(now, cfs_rq); u64_u32_store_copy(sa->last_update_time, cfs_rq->last_update_time_copy, sa->last_update_time); @@ -4252,7 +4247,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s __update_load_avg_se(now, cfs_rq, se); decayed = update_cfs_rq_load_avg(now, cfs_rq); - decayed |= propagate_entity_load_avg(se); + propagate_entity_load_avg(se); if (!se->avg.last_update_time && (flags & DO_ATTACH)) { @@ -4264,15 +4259,12 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s * IOW we're enqueueing a task on a new CPU. */ attach_entity_load_avg(cfs_rq, se); - update_tg_load_avg(cfs_rq); - } else if (flags & DO_DETACH) { /* * DO_DETACH means we're here from dequeue_entity() * and we are migrating task out of the CPU. */ detach_entity_load_avg(cfs_rq, se); - update_tg_load_avg(cfs_rq); } else if (decayed) { cfs_rq_util_change(cfs_rq, 0);
On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote: > On Tue, Jul 18, 2023 at 06:01:51PM +0200, Vincent Guittot wrote: > > Have you tried to remove update_cfs_group() from enqueue/dequeue and > > only let the tick update the share periodically ? > > patch4 kind of did that :-) > More about this. If I remove update_cfs_group() in dequeue_task_fair() on top of patch4 like this: From 43d5c12f0b2180c99149e663a71c610e31023d90 Mon Sep 17 00:00:00 2001 From: Aaron Lu <aaron.lu@intel.com> Date: Wed, 19 Jul 2023 14:51:07 +0800 Subject: [PATCH 1/2] sched/fair: completely remove update_cfs_group() in dequeue path --- kernel/sched/fair.c | 1 - 1 file changed, 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2adb6a6abbce..a21ab72819ce 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6434,7 +6434,6 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) update_load_avg(cfs_rq, se, UPDATE_TG); se_update_runnable(se); - update_cfs_group(se); cfs_rq->h_nr_running--; cfs_rq->idle_h_nr_running -= idle_h_nr_running;
On Wed, 19 Jul 2023 at 07:18, Aaron Lu <aaron.lu@intel.com> wrote: > > On Tue, Jul 18, 2023 at 06:01:51PM +0200, Vincent Guittot wrote: > > On Tue, 18 Jul 2023 at 15:41, Aaron Lu <aaron.lu@intel.com> wrote: > > > > > > When a workload involves many wake time task migrations, tg->load_avg > > > can be heavily contended among CPUs because every migration involves > > > removing the task's load from its src cfs_rq and attach that load to > > > its new cfs_rq. Both the remove and attach requires an update to > > > tg->load_avg as well as propagating the change up the hierarchy. > > > > > > E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel > > > Sappire Rapids, during a 5s window, the wakeup number is 14millions and > > > migration number is 11millions. Since the workload can trigger many > > > wakeups and migrations, the access(both read and write) to tg->load_avg > > > can be unbound. For the above said workload, the profile shows > > > update_cfs_group() costs ~13% and update_load_avg() costs ~10%. With > > > netperf/nr_client=nr_cpu/UDP_RR, the wakeup number is 21millions and > > > migration number is 14millions; update_cfs_group() costs ~25% and > > > update_load_avg() costs ~16%. > > > > > > This patch is an attempt to reduce the cost of accessing tg->load_avg. > > > > here you mention tg->load_avg which is updated with update_tg_load_avg() > > > > but your patch below modifies the local update of cfs's util_avg, > > runnable_avg and load_avg which need to be maintained up to date > > Yes, since it delayed propagating the removed load, the upper cfs_rqs' > *_avg could be updated later than current code. > > > You should be better to delay or rate limit the call to > > update_tg_load_avg() or calc_group_shares()/update_cfs_group() which > > access tg->load_avg and are the culprit instead of modifying other > > place. > > Thanks for the suggestion and I think it makes perfect sense. > > I tried below to rate limit the update to tg->load_avg at most once per > ms in update_tg_load_avg(): > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index a80a73909dc2..e48fd0e6982d 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) > { > long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; > + u64 now = cfs_rq_clock_pelt(cfs_rq); > > /* > * No need to update load_avg for root_task_group as it is not used. > @@ -3672,9 +3673,11 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) > if (cfs_rq->tg == &root_task_group) > return; > > - if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { > + if ((now - cfs_rq->last_update_tg_load_avg > 1000000) && > + abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { > atomic_long_add(delta, &cfs_rq->tg->load_avg); > cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg; > + cfs_rq->last_update_tg_load_avg = now; > } > } > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 14dfaafb3a8f..b5201d44d820 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -594,6 +594,7 @@ struct cfs_rq { > > #ifdef CONFIG_FAIR_GROUP_SCHED > unsigned long tg_load_avg_contrib; > + u64 last_update_tg_load_avg; > long propagate; > long prop_runnable_sum; > > From some quick tests using postgres_sysbench and netperf/UDP_RR on SPR, > this alone already delivers good results. For postgres_sysbench, the two > functions cost dropped to 1%-2% each; and for netperf/UDP_RR, the two > functions cost dropped to ~2% and ~4%. Togerther with some more rate > limiting on the read side, I think the end result will be better. Does > the above look reasonable to you? Yes, that seems a better way to limit write access > > Alternatively, I can remove some callsites of update_tg_load_avg() like > you suggested below and only call update_tg_load_avg() when cfs_rq is > decayed(really just decayed, not when it detected it has removed load > pending or load propagated from its children). This way it would give us > similar result as above(roughly once per ms). > > > > > Have you tried to remove update_cfs_group() from enqueue/dequeue and > > only let the tick update the share periodically ? > > patch4 kind of did that :-) That's what I have noticed when reading patch 4 :-) > > > > > Have you tried to make update_tg_load_avg() return early ? the change > > of cfs' load_avg is written in the tg->load_avg only when the change > > is bigger than 1.5%. maybe increase it to 6% ? > > Yes, increase the delta is also a good way to limit the update to > tg->load_avg. Will try that too. > > > > > Or like for update_cfs_group, remove it from attach/detach entity and > > let the periodic update to propagate the change > > > > But please don't touch local update of *_avg > > OK I see. > > Thanks again for the comments, they are very helpful.
On Wed, 19 Jul 2023 at 10:11, Aaron Lu <aaron.lu@intel.com> wrote: > > On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote: > > On Tue, Jul 18, 2023 at 06:01:51PM +0200, Vincent Guittot wrote: > > > Have you tried to remove update_cfs_group() from enqueue/dequeue and > > > only let the tick update the share periodically ? > > > > patch4 kind of did that :-) > > > > More about this. > > If I remove update_cfs_group() in dequeue_task_fair() on top of patch4 > like this: > > From 43d5c12f0b2180c99149e663a71c610e31023d90 Mon Sep 17 00:00:00 2001 > From: Aaron Lu <aaron.lu@intel.com> > Date: Wed, 19 Jul 2023 14:51:07 +0800 > Subject: [PATCH 1/2] sched/fair: completely remove update_cfs_group() in > dequeue path > > --- > kernel/sched/fair.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 2adb6a6abbce..a21ab72819ce 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6434,7 +6434,6 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > > update_load_avg(cfs_rq, se, UPDATE_TG); > se_update_runnable(se); > - update_cfs_group(se); > > cfs_rq->h_nr_running--; > cfs_rq->idle_h_nr_running -= idle_h_nr_running; > -- > 2.40.1 > > Than P95 latency of the schbench workload I described in patch4's > changelog will increase to > 1ms(base and patch4's P95 < 100us): > > Latency percentiles (usec) runtime 300 (s) (18504 total samples) > 50.0th: 20 (9537 samples) > 75.0th: 25 (4869 samples) > 90.0th: 29 (2264 samples) > 95.0th: 2564 (909 samples) > *99.0th: 20768 (740 samples) > 99.5th: 23520 (93 samples) > 99.9th: 31520 (74 samples) > min=6, max=40072 > > If I further remove update_cfs_group() completely in enqueue path on top > of the last change: > > From 4e4cb31590ca2e4080ece9cfa9dfaaf26501c60d Mon Sep 17 00:00:00 2001 > From: Aaron Lu <aaron.lu@intel.com> > Date: Wed, 19 Jul 2023 15:36:24 +0800 > Subject: [PATCH 2/2] sched/fair: completely remove update_cfs_group() from > enqueue path > > --- > kernel/sched/fair.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index a21ab72819ce..8fc325112282 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4847,8 +4847,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > */ > update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH); > se_update_runnable(se); > - if (cfs_rq->nr_running > 0) > - update_cfs_group(se); > account_entity_enqueue(cfs_rq, se); > > if (flags & ENQUEUE_WAKEUP) > @@ -6344,7 +6342,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) > > update_load_avg(cfs_rq, se, UPDATE_TG); > se_update_runnable(se); > - update_cfs_group(se); > > cfs_rq->h_nr_running++; > cfs_rq->idle_h_nr_running += idle_h_nr_running; > -- > 2.40.1 > > Then P50's latency will bump to ~4ms from ~20us: > Latency percentiles (usec) runtime 300 (s) (17940 total samples) > 50.0th: 3996 (12092 samples) > 75.0th: 4004 (4919 samples) > 90.0th: 4004 (0 samples) > 95.0th: 4012 (353 samples) > *99.0th: 20000 (487 samples) > 99.5th: 20000 (0 samples) > 99.9th: 31136 (72 samples) > min=7, max=37402 > real 5m36.633s > user 47m33.947s > sys 4m47.097s > > So for the read side, maybe just keep what patch4 does? yes, skipping update_cfs_group() at enqueue bypass the opportunity to increase the share and get more running time for the group until the update really happen > > Thanks, > Aaron
On Wed, 19 Jul 2023 at 10:01, Aaron Lu <aaron.lu@intel.com> wrote: > > On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote: > > Alternatively, I can remove some callsites of update_tg_load_avg() like > > you suggested below and only call update_tg_load_avg() when cfs_rq is > > decayed(really just decayed, not when it detected it has removed load > > pending or load propagated from its children). This way it would give us > > similar result as above(roughly once per ms). > > Something like this: (I think this is better since it removed those > unnecessary calls to update_tg_load_avg(), although it is inline but > still) > > > From bc749aaefa6bed36aa946921a4006b3dddb69b77 Mon Sep 17 00:00:00 2001 > From: Aaron Lu <aaron.lu@intel.com> > Date: Wed, 19 Jul 2023 13:54:48 +0800 > Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed > > --- > kernel/sched/fair.c | 22 +++++++--------------- > 1 file changed, 7 insertions(+), 15 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index a80a73909dc2..7d5b7352b8b5 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3913,16 +3913,16 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum > } > > /* Update task and its cfs_rq load average */ > -static inline int propagate_entity_load_avg(struct sched_entity *se) > +static inline void propagate_entity_load_avg(struct sched_entity *se) > { > struct cfs_rq *cfs_rq, *gcfs_rq; > > if (entity_is_task(se)) > - return 0; > + return; > > gcfs_rq = group_cfs_rq(se); > if (!gcfs_rq->propagate) > - return 0; > + return; > > gcfs_rq->propagate = 0; > > @@ -3936,8 +3936,6 @@ static inline int propagate_entity_load_avg(struct sched_entity *se) > > trace_pelt_cfs_tp(cfs_rq); > trace_pelt_se_tp(se); > - > - return 1; > } > > /* > @@ -3974,9 +3972,8 @@ static inline bool skip_blocked_update(struct sched_entity *se) > > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {} > > -static inline int propagate_entity_load_avg(struct sched_entity *se) > +static inline void propagate_entity_load_avg(struct sched_entity *se) > { > - return 0; > } > > static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) {} > @@ -4086,7 +4083,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > { > unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0; > struct sched_avg *sa = &cfs_rq->avg; > - int decayed = 0; > + int decayed; > > if (cfs_rq->removed.nr) { > unsigned long r; > @@ -4134,11 +4131,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > */ > add_tg_cfs_propagate(cfs_rq, > -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT); > - > - decayed = 1; > } > > - decayed |= __update_load_avg_cfs_rq(now, cfs_rq); > + decayed = __update_load_avg_cfs_rq(now, cfs_rq); > u64_u32_store_copy(sa->last_update_time, > cfs_rq->last_update_time_copy, > sa->last_update_time); > @@ -4252,7 +4247,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > __update_load_avg_se(now, cfs_rq, se); > > decayed = update_cfs_rq_load_avg(now, cfs_rq); > - decayed |= propagate_entity_load_avg(se); > + propagate_entity_load_avg(se); but then you also skip the call to cfs_rq_util_change() > > if (!se->avg.last_update_time && (flags & DO_ATTACH)) { > > @@ -4264,15 +4259,12 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > * IOW we're enqueueing a task on a new CPU. > */ > attach_entity_load_avg(cfs_rq, se); > - update_tg_load_avg(cfs_rq); > - > } else if (flags & DO_DETACH) { > /* > * DO_DETACH means we're here from dequeue_entity() > * and we are migrating task out of the CPU. > */ > detach_entity_load_avg(cfs_rq, se); > - update_tg_load_avg(cfs_rq); > } else if (decayed) { > cfs_rq_util_change(cfs_rq, 0); > > -- > 2.41.0 >
On Wed, Jul 19, 2023 at 11:47:06AM +0200, Vincent Guittot wrote: > On Wed, 19 Jul 2023 at 10:01, Aaron Lu <aaron.lu@intel.com> wrote: > > > > On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote: > > > Alternatively, I can remove some callsites of update_tg_load_avg() like > > > you suggested below and only call update_tg_load_avg() when cfs_rq is > > > decayed(really just decayed, not when it detected it has removed load > > > pending or load propagated from its children). This way it would give us > > > similar result as above(roughly once per ms). > > > > Something like this: (I think this is better since it removed those > > unnecessary calls to update_tg_load_avg(), although it is inline but > > still) > > > > > > From bc749aaefa6bed36aa946921a4006b3dddb69b77 Mon Sep 17 00:00:00 2001 > > From: Aaron Lu <aaron.lu@intel.com> > > Date: Wed, 19 Jul 2023 13:54:48 +0800 > > Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed > > > > --- > > kernel/sched/fair.c | 22 +++++++--------------- > > 1 file changed, 7 insertions(+), 15 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index a80a73909dc2..7d5b7352b8b5 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -3913,16 +3913,16 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum > > } > > > > /* Update task and its cfs_rq load average */ > > -static inline int propagate_entity_load_avg(struct sched_entity *se) > > +static inline void propagate_entity_load_avg(struct sched_entity *se) > > { > > struct cfs_rq *cfs_rq, *gcfs_rq; > > > > if (entity_is_task(se)) > > - return 0; > > + return; > > > > gcfs_rq = group_cfs_rq(se); > > if (!gcfs_rq->propagate) > > - return 0; > > + return; > > > > gcfs_rq->propagate = 0; > > > > @@ -3936,8 +3936,6 @@ static inline int propagate_entity_load_avg(struct sched_entity *se) > > > > trace_pelt_cfs_tp(cfs_rq); > > trace_pelt_se_tp(se); > > - > > - return 1; > > } > > > > /* > > @@ -3974,9 +3972,8 @@ static inline bool skip_blocked_update(struct sched_entity *se) > > > > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {} > > > > -static inline int propagate_entity_load_avg(struct sched_entity *se) > > +static inline void propagate_entity_load_avg(struct sched_entity *se) > > { > > - return 0; > > } > > > > static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) {} > > @@ -4086,7 +4083,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > > { > > unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0; > > struct sched_avg *sa = &cfs_rq->avg; > > - int decayed = 0; > > + int decayed; > > > > if (cfs_rq->removed.nr) { > > unsigned long r; > > @@ -4134,11 +4131,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > > */ > > add_tg_cfs_propagate(cfs_rq, > > -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT); > > - > > - decayed = 1; > > } > > > > - decayed |= __update_load_avg_cfs_rq(now, cfs_rq); > > + decayed = __update_load_avg_cfs_rq(now, cfs_rq); > > u64_u32_store_copy(sa->last_update_time, > > cfs_rq->last_update_time_copy, > > sa->last_update_time); > > @@ -4252,7 +4247,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > __update_load_avg_se(now, cfs_rq, se); > > > > decayed = update_cfs_rq_load_avg(now, cfs_rq); > > - decayed |= propagate_entity_load_avg(se); > > + propagate_entity_load_avg(se); > > but then you also skip the call to cfs_rq_util_change() Ah right, I missed that, thanks for catching this. Updated: From 09a649f8111cfca656b7b735da975ef607b00956 Mon Sep 17 00:00:00 2001 From: Aaron Lu <aaron.lu@intel.com> Date: Wed, 19 Jul 2023 13:54:48 +0800 Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed --- kernel/sched/fair.c | 17 ++++++----------- 1 file changed, 6 insertions(+), 11 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a80a73909dc2..8d4b9e0a19b6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4086,7 +4086,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) { unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0; struct sched_avg *sa = &cfs_rq->avg; - int decayed = 0; + int decayed; if (cfs_rq->removed.nr) { unsigned long r; @@ -4134,11 +4134,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) */ add_tg_cfs_propagate(cfs_rq, -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT); - - decayed = 1; } - decayed |= __update_load_avg_cfs_rq(now, cfs_rq); + decayed = __update_load_avg_cfs_rq(now, cfs_rq); u64_u32_store_copy(sa->last_update_time, cfs_rq->last_update_time_copy, sa->last_update_time); @@ -4242,7 +4240,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) { u64 now = cfs_rq_clock_pelt(cfs_rq); - int decayed; + int decayed, propagated; /* * Track task load average for carrying it to new CPU after migrated, and @@ -4252,7 +4250,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s __update_load_avg_se(now, cfs_rq, se); decayed = update_cfs_rq_load_avg(now, cfs_rq); - decayed |= propagate_entity_load_avg(se); + propagated = propagate_entity_load_avg(se); if (!se->avg.last_update_time && (flags & DO_ATTACH)) { @@ -4264,19 +4262,16 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s * IOW we're enqueueing a task on a new CPU. */ attach_entity_load_avg(cfs_rq, se); - update_tg_load_avg(cfs_rq); - } else if (flags & DO_DETACH) { /* * DO_DETACH means we're here from dequeue_entity() * and we are migrating task out of the CPU. */ detach_entity_load_avg(cfs_rq, se); - update_tg_load_avg(cfs_rq); - } else if (decayed) { + } else if (decayed || propagated) { cfs_rq_util_change(cfs_rq, 0); - if (flags & UPDATE_TG) + if (decayed && (flags & UPDATE_TG)) update_tg_load_avg(cfs_rq); } }
On Wed, 19 Jul 2023 at 15:29, Aaron Lu <aaron.lu@intel.com> wrote: > > On Wed, Jul 19, 2023 at 11:47:06AM +0200, Vincent Guittot wrote: > > On Wed, 19 Jul 2023 at 10:01, Aaron Lu <aaron.lu@intel.com> wrote: > > > > > > On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote: > > > > Alternatively, I can remove some callsites of update_tg_load_avg() like > > > > you suggested below and only call update_tg_load_avg() when cfs_rq is > > > > decayed(really just decayed, not when it detected it has removed load > > > > pending or load propagated from its children). This way it would give us > > > > similar result as above(roughly once per ms). > > > > > > Something like this: (I think this is better since it removed those > > > unnecessary calls to update_tg_load_avg(), although it is inline but > > > still) > > > > > > > > > From bc749aaefa6bed36aa946921a4006b3dddb69b77 Mon Sep 17 00:00:00 2001 > > > From: Aaron Lu <aaron.lu@intel.com> > > > Date: Wed, 19 Jul 2023 13:54:48 +0800 > > > Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed > > > > > > --- > > > kernel/sched/fair.c | 22 +++++++--------------- > > > 1 file changed, 7 insertions(+), 15 deletions(-) > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index a80a73909dc2..7d5b7352b8b5 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -3913,16 +3913,16 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum > > > } > > > > > > /* Update task and its cfs_rq load average */ > > > -static inline int propagate_entity_load_avg(struct sched_entity *se) > > > +static inline void propagate_entity_load_avg(struct sched_entity *se) > > > { > > > struct cfs_rq *cfs_rq, *gcfs_rq; > > > > > > if (entity_is_task(se)) > > > - return 0; > > > + return; > > > > > > gcfs_rq = group_cfs_rq(se); > > > if (!gcfs_rq->propagate) > > > - return 0; > > > + return; > > > > > > gcfs_rq->propagate = 0; > > > > > > @@ -3936,8 +3936,6 @@ static inline int propagate_entity_load_avg(struct sched_entity *se) > > > > > > trace_pelt_cfs_tp(cfs_rq); > > > trace_pelt_se_tp(se); > > > - > > > - return 1; > > > } > > > > > > /* > > > @@ -3974,9 +3972,8 @@ static inline bool skip_blocked_update(struct sched_entity *se) > > > > > > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {} > > > > > > -static inline int propagate_entity_load_avg(struct sched_entity *se) > > > +static inline void propagate_entity_load_avg(struct sched_entity *se) > > > { > > > - return 0; > > > } > > > > > > static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) {} > > > @@ -4086,7 +4083,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > > > { > > > unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0; > > > struct sched_avg *sa = &cfs_rq->avg; > > > - int decayed = 0; > > > + int decayed; > > > > > > if (cfs_rq->removed.nr) { > > > unsigned long r; > > > @@ -4134,11 +4131,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > > > */ > > > add_tg_cfs_propagate(cfs_rq, > > > -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT); > > > - > > > - decayed = 1; > > > } > > > > > > - decayed |= __update_load_avg_cfs_rq(now, cfs_rq); > > > + decayed = __update_load_avg_cfs_rq(now, cfs_rq); > > > u64_u32_store_copy(sa->last_update_time, > > > cfs_rq->last_update_time_copy, > > > sa->last_update_time); > > > @@ -4252,7 +4247,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > > __update_load_avg_se(now, cfs_rq, se); > > > > > > decayed = update_cfs_rq_load_avg(now, cfs_rq); > > > - decayed |= propagate_entity_load_avg(se); > > > + propagate_entity_load_avg(se); > > > > but then you also skip the call to cfs_rq_util_change() > > Ah right, I missed that, thanks for catching this. > > Updated: > > From 09a649f8111cfca656b7b735da975ef607b00956 Mon Sep 17 00:00:00 2001 > From: Aaron Lu <aaron.lu@intel.com> > Date: Wed, 19 Jul 2023 13:54:48 +0800 > Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed > > --- > kernel/sched/fair.c | 17 ++++++----------- > 1 file changed, 6 insertions(+), 11 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index a80a73909dc2..8d4b9e0a19b6 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4086,7 +4086,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > { > unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0; > struct sched_avg *sa = &cfs_rq->avg; > - int decayed = 0; > + int decayed; > > if (cfs_rq->removed.nr) { > unsigned long r; > @@ -4134,11 +4134,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > */ > add_tg_cfs_propagate(cfs_rq, > -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT); > - > - decayed = 1; We need this to propagate the change in other place like cpufreq > } > > - decayed |= __update_load_avg_cfs_rq(now, cfs_rq); > + decayed = __update_load_avg_cfs_rq(now, cfs_rq); > u64_u32_store_copy(sa->last_update_time, > cfs_rq->last_update_time_copy, > sa->last_update_time); > @@ -4242,7 +4240,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > { > u64 now = cfs_rq_clock_pelt(cfs_rq); > - int decayed; > + int decayed, propagated; > > /* > * Track task load average for carrying it to new CPU after migrated, and > @@ -4252,7 +4250,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > __update_load_avg_se(now, cfs_rq, se); > > decayed = update_cfs_rq_load_avg(now, cfs_rq); > - decayed |= propagate_entity_load_avg(se); > + propagated = propagate_entity_load_avg(se); > > if (!se->avg.last_update_time && (flags & DO_ATTACH)) { > > @@ -4264,19 +4262,16 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > * IOW we're enqueueing a task on a new CPU. > */ > attach_entity_load_avg(cfs_rq, se); > - update_tg_load_avg(cfs_rq); > - > } else if (flags & DO_DETACH) { > /* > * DO_DETACH means we're here from dequeue_entity() > * and we are migrating task out of the CPU. > */ > detach_entity_load_avg(cfs_rq, se); > - update_tg_load_avg(cfs_rq); > - } else if (decayed) { > + } else if (decayed || propagated) { > cfs_rq_util_change(cfs_rq, 0); > > - if (flags & UPDATE_TG) > + if (decayed && (flags & UPDATE_TG)) It would be simpler and more readable to clear UPDATE_TG or not set it from the beginning IIUC, you rely on the fact that a decay happens every 1024 us of the cfs_rq_clock_pelt() which is scaled by frequency and cpu compute capacity. So you can end up with a cfs_rq_clock_pelt() that is far slower than real clock and the 1ms can easily be extended to dozens of ms > update_tg_load_avg(cfs_rq); > } > } > -- > 2.41.0 > > > > > > > > if (!se->avg.last_update_time && (flags & DO_ATTACH)) { > > > > > > @@ -4264,15 +4259,12 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > > * IOW we're enqueueing a task on a new CPU. > > > */ > > > attach_entity_load_avg(cfs_rq, se); > > > - update_tg_load_avg(cfs_rq); > > > - > > > } else if (flags & DO_DETACH) { > > > /* > > > * DO_DETACH means we're here from dequeue_entity() > > > * and we are migrating task out of the CPU. > > > */ > > > detach_entity_load_avg(cfs_rq, se); > > > - update_tg_load_avg(cfs_rq); > > > } else if (decayed) { > > > cfs_rq_util_change(cfs_rq, 0); > > > > > > -- > > > 2.41.0 > > >
On Thu, Jul 20, 2023 at 03:10:30PM +0200, Vincent Guittot wrote: > On Wed, 19 Jul 2023 at 15:29, Aaron Lu <aaron.lu@intel.com> wrote: > > > > On Wed, Jul 19, 2023 at 11:47:06AM +0200, Vincent Guittot wrote: > > > On Wed, 19 Jul 2023 at 10:01, Aaron Lu <aaron.lu@intel.com> wrote: > > > > > > > > On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote: > > > > > Alternatively, I can remove some callsites of update_tg_load_avg() like > > > > > you suggested below and only call update_tg_load_avg() when cfs_rq is > > > > > decayed(really just decayed, not when it detected it has removed load > > > > > pending or load propagated from its children). This way it would give us > > > > > similar result as above(roughly once per ms). > > > > > > > > Something like this: (I think this is better since it removed those > > > > unnecessary calls to update_tg_load_avg(), although it is inline but > > > > still) > > > > > > > > > > > > From bc749aaefa6bed36aa946921a4006b3dddb69b77 Mon Sep 17 00:00:00 2001 > > > > From: Aaron Lu <aaron.lu@intel.com> > > > > Date: Wed, 19 Jul 2023 13:54:48 +0800 > > > > Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed > > > > > > > > --- > > > > kernel/sched/fair.c | 22 +++++++--------------- > > > > 1 file changed, 7 insertions(+), 15 deletions(-) > > > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > > index a80a73909dc2..7d5b7352b8b5 100644 > > > > --- a/kernel/sched/fair.c > > > > +++ b/kernel/sched/fair.c > > > > @@ -3913,16 +3913,16 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum > > > > } > > > > > > > > /* Update task and its cfs_rq load average */ > > > > -static inline int propagate_entity_load_avg(struct sched_entity *se) > > > > +static inline void propagate_entity_load_avg(struct sched_entity *se) > > > > { > > > > struct cfs_rq *cfs_rq, *gcfs_rq; > > > > > > > > if (entity_is_task(se)) > > > > - return 0; > > > > + return; > > > > > > > > gcfs_rq = group_cfs_rq(se); > > > > if (!gcfs_rq->propagate) > > > > - return 0; > > > > + return; > > > > > > > > gcfs_rq->propagate = 0; > > > > > > > > @@ -3936,8 +3936,6 @@ static inline int propagate_entity_load_avg(struct sched_entity *se) > > > > > > > > trace_pelt_cfs_tp(cfs_rq); > > > > trace_pelt_se_tp(se); > > > > - > > > > - return 1; > > > > } > > > > > > > > /* > > > > @@ -3974,9 +3972,8 @@ static inline bool skip_blocked_update(struct sched_entity *se) > > > > > > > > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {} > > > > > > > > -static inline int propagate_entity_load_avg(struct sched_entity *se) > > > > +static inline void propagate_entity_load_avg(struct sched_entity *se) > > > > { > > > > - return 0; > > > > } > > > > > > > > static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) {} > > > > @@ -4086,7 +4083,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > > > > { > > > > unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0; > > > > struct sched_avg *sa = &cfs_rq->avg; > > > > - int decayed = 0; > > > > + int decayed; > > > > > > > > if (cfs_rq->removed.nr) { > > > > unsigned long r; > > > > @@ -4134,11 +4131,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > > > > */ > > > > add_tg_cfs_propagate(cfs_rq, > > > > -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT); > > > > - > > > > - decayed = 1; > > > > } > > > > > > > > - decayed |= __update_load_avg_cfs_rq(now, cfs_rq); > > > > + decayed = __update_load_avg_cfs_rq(now, cfs_rq); > > > > u64_u32_store_copy(sa->last_update_time, > > > > cfs_rq->last_update_time_copy, > > > > sa->last_update_time); > > > > @@ -4252,7 +4247,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > > > __update_load_avg_se(now, cfs_rq, se); > > > > > > > > decayed = update_cfs_rq_load_avg(now, cfs_rq); > > > > - decayed |= propagate_entity_load_avg(se); > > > > + propagate_entity_load_avg(se); > > > > > > but then you also skip the call to cfs_rq_util_change() > > > > Ah right, I missed that, thanks for catching this. > > > > Updated: > > > > From 09a649f8111cfca656b7b735da975ef607b00956 Mon Sep 17 00:00:00 2001 > > From: Aaron Lu <aaron.lu@intel.com> > > Date: Wed, 19 Jul 2023 13:54:48 +0800 > > Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed > > > > --- > > kernel/sched/fair.c | 17 ++++++----------- > > 1 file changed, 6 insertions(+), 11 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index a80a73909dc2..8d4b9e0a19b6 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -4086,7 +4086,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > > { > > unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0; > > struct sched_avg *sa = &cfs_rq->avg; > > - int decayed = 0; > > + int decayed; > > > > if (cfs_rq->removed.nr) { > > unsigned long r; > > @@ -4134,11 +4134,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > > */ > > add_tg_cfs_propagate(cfs_rq, > > -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT); > > - > > - decayed = 1; > > We need this to propagate the change in other place like cpufreq Ah, I just made the same mistake again, sorry. So there are three cases for a cfs_rq: load decayed, load removed and load propagated. For all three cases, cfs_rq_util_change() needs to be called and only for decayed, update_tg_load_avg() needs to be called. I'll update the patch accordingly. > > } > > > > - decayed |= __update_load_avg_cfs_rq(now, cfs_rq); > > + decayed = __update_load_avg_cfs_rq(now, cfs_rq); > > u64_u32_store_copy(sa->last_update_time, > > cfs_rq->last_update_time_copy, > > sa->last_update_time); > > @@ -4242,7 +4240,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > > { > > u64 now = cfs_rq_clock_pelt(cfs_rq); > > - int decayed; > > + int decayed, propagated; > > > > /* > > * Track task load average for carrying it to new CPU after migrated, and > > @@ -4252,7 +4250,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > __update_load_avg_se(now, cfs_rq, se); > > > > decayed = update_cfs_rq_load_avg(now, cfs_rq); > > - decayed |= propagate_entity_load_avg(se); > > + propagated = propagate_entity_load_avg(se); > > > > if (!se->avg.last_update_time && (flags & DO_ATTACH)) { > > > > @@ -4264,19 +4262,16 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > * IOW we're enqueueing a task on a new CPU. > > */ > > attach_entity_load_avg(cfs_rq, se); > > - update_tg_load_avg(cfs_rq); > > - > > } else if (flags & DO_DETACH) { > > /* > > * DO_DETACH means we're here from dequeue_entity() > > * and we are migrating task out of the CPU. > > */ > > detach_entity_load_avg(cfs_rq, se); > > - update_tg_load_avg(cfs_rq); > > - } else if (decayed) { > > + } else if (decayed || propagated) { > > cfs_rq_util_change(cfs_rq, 0); > > > > - if (flags & UPDATE_TG) > > + if (decayed && (flags & UPDATE_TG)) > > It would be simpler and more readable to clear UPDATE_TG or not set it > from the beginning Do you mean something like this? diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8d4b9e0a19b6..084d63371355 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4249,7 +4249,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) __update_load_avg_se(now, cfs_rq, se); - decayed = update_cfs_rq_load_avg(now, cfs_rq); + decayed = update_cfs_rq_load_avg(now, cfs_rq) && (flags & UPDATE_TG); propagated = propagate_entity_load_avg(se); if (!se->avg.last_update_time && (flags & DO_ATTACH)) { @@ -4271,7 +4271,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s } else if (decayed || propagated) { cfs_rq_util_change(cfs_rq, 0); - if (decayed && (flags & UPDATE_TG)) + if (decayed) update_tg_load_avg(cfs_rq); } } > IIUC, you rely on the fact that a decay happens every 1024 us of the > cfs_rq_clock_pelt() which is scaled by frequency and cpu compute > capacity. So you can end up with a cfs_rq_clock_pelt() that is far > slower than real clock and the 1ms can easily be extended to dozens of > ms Thanks for the info. I'm not familiar with this clock scale part and will need to take a closer look. As you know, the intent is to make the unbound update to tg->load_avg become bound for those wakeup migration heavy workloads and the way this change does to achieve it is to remove the update to tg->load_avg on attach and detach path, just leave the update on load decay path. And if the current implementation makes load decay longer than 1024us, that shouldn't be a problem for this change. I don't see an immediate problem if update to tg->load_avg happens less often than once per ms but please let me know if I missed something, thanks. > > > update_tg_load_avg(cfs_rq); > > } > > } > > -- > > 2.41.0 > > > > > > > > > > > > if (!se->avg.last_update_time && (flags & DO_ATTACH)) { > > > > > > > > @@ -4264,15 +4259,12 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > > > * IOW we're enqueueing a task on a new CPU. > > > > */ > > > > attach_entity_load_avg(cfs_rq, se); > > > > - update_tg_load_avg(cfs_rq); > > > > - > > > > } else if (flags & DO_DETACH) { > > > > /* > > > > * DO_DETACH means we're here from dequeue_entity() > > > > * and we are migrating task out of the CPU. > > > > */ > > > > detach_entity_load_avg(cfs_rq, se); > > > > - update_tg_load_avg(cfs_rq); > > > > } else if (decayed) { > > > > cfs_rq_util_change(cfs_rq, 0); > > > > > > > > -- > > > > 2.41.0 > > > >
On Thu, 20 Jul 2023 at 16:42, Aaron Lu <aaron.lu@intel.com> wrote: > > On Thu, Jul 20, 2023 at 03:10:30PM +0200, Vincent Guittot wrote: > > On Wed, 19 Jul 2023 at 15:29, Aaron Lu <aaron.lu@intel.com> wrote: > > > > > > On Wed, Jul 19, 2023 at 11:47:06AM +0200, Vincent Guittot wrote: > > > > On Wed, 19 Jul 2023 at 10:01, Aaron Lu <aaron.lu@intel.com> wrote: > > > > > > > > > > On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote: > > > > > > Alternatively, I can remove some callsites of update_tg_load_avg() like > > > > > > you suggested below and only call update_tg_load_avg() when cfs_rq is > > > > > > decayed(really just decayed, not when it detected it has removed load > > > > > > pending or load propagated from its children). This way it would give us > > > > > > similar result as above(roughly once per ms). > > > > > > > > > > Something like this: (I think this is better since it removed those > > > > > unnecessary calls to update_tg_load_avg(), although it is inline but > > > > > still) > > > > > > > > > > > > > > > From bc749aaefa6bed36aa946921a4006b3dddb69b77 Mon Sep 17 00:00:00 2001 > > > > > From: Aaron Lu <aaron.lu@intel.com> > > > > > Date: Wed, 19 Jul 2023 13:54:48 +0800 > > > > > Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed > > > > > > > > > > --- > > > > > kernel/sched/fair.c | 22 +++++++--------------- > > > > > 1 file changed, 7 insertions(+), 15 deletions(-) > > > > > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > > > index a80a73909dc2..7d5b7352b8b5 100644 > > > > > --- a/kernel/sched/fair.c > > > > > +++ b/kernel/sched/fair.c > > > > > @@ -3913,16 +3913,16 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum > > > > > } > > > > > > > > > > /* Update task and its cfs_rq load average */ > > > > > -static inline int propagate_entity_load_avg(struct sched_entity *se) > > > > > +static inline void propagate_entity_load_avg(struct sched_entity *se) > > > > > { > > > > > struct cfs_rq *cfs_rq, *gcfs_rq; > > > > > > > > > > if (entity_is_task(se)) > > > > > - return 0; > > > > > + return; > > > > > > > > > > gcfs_rq = group_cfs_rq(se); > > > > > if (!gcfs_rq->propagate) > > > > > - return 0; > > > > > + return; > > > > > > > > > > gcfs_rq->propagate = 0; > > > > > > > > > > @@ -3936,8 +3936,6 @@ static inline int propagate_entity_load_avg(struct sched_entity *se) > > > > > > > > > > trace_pelt_cfs_tp(cfs_rq); > > > > > trace_pelt_se_tp(se); > > > > > - > > > > > - return 1; > > > > > } > > > > > > > > > > /* > > > > > @@ -3974,9 +3972,8 @@ static inline bool skip_blocked_update(struct sched_entity *se) > > > > > > > > > > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {} > > > > > > > > > > -static inline int propagate_entity_load_avg(struct sched_entity *se) > > > > > +static inline void propagate_entity_load_avg(struct sched_entity *se) > > > > > { > > > > > - return 0; > > > > > } > > > > > > > > > > static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) {} > > > > > @@ -4086,7 +4083,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > > > > > { > > > > > unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0; > > > > > struct sched_avg *sa = &cfs_rq->avg; > > > > > - int decayed = 0; > > > > > + int decayed; > > > > > > > > > > if (cfs_rq->removed.nr) { > > > > > unsigned long r; > > > > > @@ -4134,11 +4131,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > > > > > */ > > > > > add_tg_cfs_propagate(cfs_rq, > > > > > -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT); > > > > > - > > > > > - decayed = 1; > > > > > } > > > > > > > > > > - decayed |= __update_load_avg_cfs_rq(now, cfs_rq); > > > > > + decayed = __update_load_avg_cfs_rq(now, cfs_rq); > > > > > u64_u32_store_copy(sa->last_update_time, > > > > > cfs_rq->last_update_time_copy, > > > > > sa->last_update_time); > > > > > @@ -4252,7 +4247,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > > > > __update_load_avg_se(now, cfs_rq, se); > > > > > > > > > > decayed = update_cfs_rq_load_avg(now, cfs_rq); > > > > > - decayed |= propagate_entity_load_avg(se); > > > > > + propagate_entity_load_avg(se); > > > > > > > > but then you also skip the call to cfs_rq_util_change() > > > > > > Ah right, I missed that, thanks for catching this. > > > > > > Updated: > > > > > > From 09a649f8111cfca656b7b735da975ef607b00956 Mon Sep 17 00:00:00 2001 > > > From: Aaron Lu <aaron.lu@intel.com> > > > Date: Wed, 19 Jul 2023 13:54:48 +0800 > > > Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed > > > > > > --- > > > kernel/sched/fair.c | 17 ++++++----------- > > > 1 file changed, 6 insertions(+), 11 deletions(-) > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index a80a73909dc2..8d4b9e0a19b6 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -4086,7 +4086,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > > > { > > > unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0; > > > struct sched_avg *sa = &cfs_rq->avg; > > > - int decayed = 0; > > > + int decayed; > > > > > > if (cfs_rq->removed.nr) { > > > unsigned long r; > > > @@ -4134,11 +4134,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > > > */ > > > add_tg_cfs_propagate(cfs_rq, > > > -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT); > > > - > > > - decayed = 1; > > > > We need this to propagate the change in other place like cpufreq > > Ah, I just made the same mistake again, sorry. > > So there are three cases for a cfs_rq: load decayed, load removed and > load propagated. For all three cases, cfs_rq_util_change() needs to be > called and only for decayed, update_tg_load_avg() needs to be called. > > I'll update the patch accordingly. > > > > } > > > > > > - decayed |= __update_load_avg_cfs_rq(now, cfs_rq); > > > + decayed = __update_load_avg_cfs_rq(now, cfs_rq); > > > u64_u32_store_copy(sa->last_update_time, > > > cfs_rq->last_update_time_copy, > > > sa->last_update_time); > > > @@ -4242,7 +4240,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > > static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > > > { > > > u64 now = cfs_rq_clock_pelt(cfs_rq); > > > - int decayed; > > > + int decayed, propagated; > > > > > > /* > > > * Track task load average for carrying it to new CPU after migrated, and > > > @@ -4252,7 +4250,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > > __update_load_avg_se(now, cfs_rq, se); > > > > > > decayed = update_cfs_rq_load_avg(now, cfs_rq); > > > - decayed |= propagate_entity_load_avg(se); > > > + propagated = propagate_entity_load_avg(se); > > > > > > if (!se->avg.last_update_time && (flags & DO_ATTACH)) { > > > > > > @@ -4264,19 +4262,16 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > > * IOW we're enqueueing a task on a new CPU. > > > */ > > > attach_entity_load_avg(cfs_rq, se); > > > - update_tg_load_avg(cfs_rq); > > > - > > > } else if (flags & DO_DETACH) { > > > /* > > > * DO_DETACH means we're here from dequeue_entity() > > > * and we are migrating task out of the CPU. > > > */ > > > detach_entity_load_avg(cfs_rq, se); > > > - update_tg_load_avg(cfs_rq); > > > - } else if (decayed) { > > > + } else if (decayed || propagated) { > > > cfs_rq_util_change(cfs_rq, 0); > > > > > > - if (flags & UPDATE_TG) > > > + if (decayed && (flags & UPDATE_TG)) > > > > It would be simpler and more readable to clear UPDATE_TG or not set it > > from the beginning > > Do you mean something like this? > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 8d4b9e0a19b6..084d63371355 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4249,7 +4249,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) > __update_load_avg_se(now, cfs_rq, se); > > - decayed = update_cfs_rq_load_avg(now, cfs_rq); > + decayed = update_cfs_rq_load_avg(now, cfs_rq) && (flags & UPDATE_TG); > propagated = propagate_entity_load_avg(se); > > if (!se->avg.last_update_time && (flags & DO_ATTACH)) { > @@ -4271,7 +4271,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > } else if (decayed || propagated) { > cfs_rq_util_change(cfs_rq, 0); > > - if (decayed && (flags & UPDATE_TG)) > + if (decayed) > update_tg_load_avg(cfs_rq); > } > } > > > IIUC, you rely on the fact that a decay happens every 1024 us of the > > cfs_rq_clock_pelt() which is scaled by frequency and cpu compute > > capacity. So you can end up with a cfs_rq_clock_pelt() that is far > > slower than real clock and the 1ms can easily be extended to dozens of > > ms > > Thanks for the info. I'm not familiar with this clock scale part and will > need to take a closer look. > > As you know, the intent is to make the unbound update to tg->load_avg > become bound for those wakeup migration heavy workloads and the way this > change does to achieve it is to remove the update to tg->load_avg on > attach and detach path, just leave the update on load decay path. And if > the current implementation makes load decay longer than 1024us, that > shouldn't be a problem for this change. I don't see an immediate problem > if update to tg->load_avg happens less often than once per ms but please > let me know if I missed something, thanks. What was wrong with your proposal to limit the update inside update_tg_load_avg() ? except maybe s/1000000/NSEC_PER_MSEC/ and computing delta after testing the time since last update diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a80a73909dc2..e48fd0e6982d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) { long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; + u64 now = cfs_rq_clock_pelt(cfs_rq); /* * No need to update load_avg for root_task_group as it is not used. @@ -3672,9 +3673,11 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) if (cfs_rq->tg == &root_task_group) return; - if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { + if ((now - cfs_rq->last_update_tg_load_avg > 1000000) && + abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { atomic_long_add(delta, &cfs_rq->tg->load_avg); cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg; + cfs_rq->last_update_tg_load_avg = now; } } > > > > > > update_tg_load_avg(cfs_rq); > > > } > > > } > > > -- > > > 2.41.0 > > > > > > > > > > > > > > > > if (!se->avg.last_update_time && (flags & DO_ATTACH)) { > > > > > > > > > > @@ -4264,15 +4259,12 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > > > > * IOW we're enqueueing a task on a new CPU. > > > > > */ > > > > > attach_entity_load_avg(cfs_rq, se); > > > > > - update_tg_load_avg(cfs_rq); > > > > > - > > > > > } else if (flags & DO_DETACH) { > > > > > /* > > > > > * DO_DETACH means we're here from dequeue_entity() > > > > > * and we are migrating task out of the CPU. > > > > > */ > > > > > detach_entity_load_avg(cfs_rq, se); > > > > > - update_tg_load_avg(cfs_rq); > > > > > } else if (decayed) { > > > > > cfs_rq_util_change(cfs_rq, 0); > > > > > > > > > > -- > > > > > 2.41.0 > > > > >
On Thu, 20 Jul 2023 at 16:42, Aaron Lu <aaron.lu@intel.com> wrote: > > On Thu, Jul 20, 2023 at 03:10:30PM +0200, Vincent Guittot wrote: > > On Wed, 19 Jul 2023 at 15:29, Aaron Lu <aaron.lu@intel.com> wrote: > > > > > > On Wed, Jul 19, 2023 at 11:47:06AM +0200, Vincent Guittot wrote: > > > > On Wed, 19 Jul 2023 at 10:01, Aaron Lu <aaron.lu@intel.com> wrote: > > > > > > > > > > On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote: > > > > > > Alternatively, I can remove some callsites of update_tg_load_avg() like > > > > > > you suggested below and only call update_tg_load_avg() when cfs_rq is > > > > > > decayed(really just decayed, not when it detected it has removed load > > > > > > pending or load propagated from its children). This way it would give us > > > > > > similar result as above(roughly once per ms). > > > > > > > > > > Something like this: (I think this is better since it removed those > > > > > unnecessary calls to update_tg_load_avg(), although it is inline but > > > > > still) > > > > > > > > > > > > > > > From bc749aaefa6bed36aa946921a4006b3dddb69b77 Mon Sep 17 00:00:00 2001 > > > > > From: Aaron Lu <aaron.lu@intel.com> > > > > > Date: Wed, 19 Jul 2023 13:54:48 +0800 > > > > > Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed > > > > > > > > > > --- > > > > > kernel/sched/fair.c | 22 +++++++--------------- > > > > > 1 file changed, 7 insertions(+), 15 deletions(-) > > > > > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > > > index a80a73909dc2..7d5b7352b8b5 100644 > > > > > --- a/kernel/sched/fair.c > > > > > +++ b/kernel/sched/fair.c > > > > > @@ -3913,16 +3913,16 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum > > > > > } > > > > > > > > > > /* Update task and its cfs_rq load average */ > > > > > -static inline int propagate_entity_load_avg(struct sched_entity *se) > > > > > +static inline void propagate_entity_load_avg(struct sched_entity *se) > > > > > { > > > > > struct cfs_rq *cfs_rq, *gcfs_rq; > > > > > > > > > > if (entity_is_task(se)) > > > > > - return 0; > > > > > + return; > > > > > > > > > > gcfs_rq = group_cfs_rq(se); > > > > > if (!gcfs_rq->propagate) > > > > > - return 0; > > > > > + return; > > > > > > > > > > gcfs_rq->propagate = 0; > > > > > > > > > > @@ -3936,8 +3936,6 @@ static inline int propagate_entity_load_avg(struct sched_entity *se) > > > > > > > > > > trace_pelt_cfs_tp(cfs_rq); > > > > > trace_pelt_se_tp(se); > > > > > - > > > > > - return 1; > > > > > } > > > > > > > > > > /* > > > > > @@ -3974,9 +3972,8 @@ static inline bool skip_blocked_update(struct sched_entity *se) > > > > > > > > > > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {} > > > > > > > > > > -static inline int propagate_entity_load_avg(struct sched_entity *se) > > > > > +static inline void propagate_entity_load_avg(struct sched_entity *se) > > > > > { > > > > > - return 0; > > > > > } > > > > > > > > > > static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) {} > > > > > @@ -4086,7 +4083,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > > > > > { > > > > > unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0; > > > > > struct sched_avg *sa = &cfs_rq->avg; > > > > > - int decayed = 0; > > > > > + int decayed; > > > > > > > > > > if (cfs_rq->removed.nr) { > > > > > unsigned long r; > > > > > @@ -4134,11 +4131,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > > > > > */ > > > > > add_tg_cfs_propagate(cfs_rq, > > > > > -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT); > > > > > - > > > > > - decayed = 1; > > > > > } > > > > > > > > > > - decayed |= __update_load_avg_cfs_rq(now, cfs_rq); > > > > > + decayed = __update_load_avg_cfs_rq(now, cfs_rq); > > > > > u64_u32_store_copy(sa->last_update_time, > > > > > cfs_rq->last_update_time_copy, > > > > > sa->last_update_time); > > > > > @@ -4252,7 +4247,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > > > > __update_load_avg_se(now, cfs_rq, se); > > > > > > > > > > decayed = update_cfs_rq_load_avg(now, cfs_rq); > > > > > - decayed |= propagate_entity_load_avg(se); > > > > > + propagate_entity_load_avg(se); > > > > > > > > but then you also skip the call to cfs_rq_util_change() > > > > > > Ah right, I missed that, thanks for catching this. > > > > > > Updated: > > > > > > From 09a649f8111cfca656b7b735da975ef607b00956 Mon Sep 17 00:00:00 2001 > > > From: Aaron Lu <aaron.lu@intel.com> > > > Date: Wed, 19 Jul 2023 13:54:48 +0800 > > > Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed > > > > > > --- > > > kernel/sched/fair.c | 17 ++++++----------- > > > 1 file changed, 6 insertions(+), 11 deletions(-) > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index a80a73909dc2..8d4b9e0a19b6 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -4086,7 +4086,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > > > { > > > unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0; > > > struct sched_avg *sa = &cfs_rq->avg; > > > - int decayed = 0; > > > + int decayed; > > > > > > if (cfs_rq->removed.nr) { > > > unsigned long r; > > > @@ -4134,11 +4134,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > > > */ > > > add_tg_cfs_propagate(cfs_rq, > > > -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT); > > > - > > > - decayed = 1; > > > > We need this to propagate the change in other place like cpufreq > > Ah, I just made the same mistake again, sorry. > > So there are three cases for a cfs_rq: load decayed, load removed and > load propagated. For all three cases, cfs_rq_util_change() needs to be > called and only for decayed, update_tg_load_avg() needs to be called. > > I'll update the patch accordingly. > > > > } > > > > > > - decayed |= __update_load_avg_cfs_rq(now, cfs_rq); > > > + decayed = __update_load_avg_cfs_rq(now, cfs_rq); > > > u64_u32_store_copy(sa->last_update_time, > > > cfs_rq->last_update_time_copy, > > > sa->last_update_time); > > > @@ -4242,7 +4240,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > > static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > > > { > > > u64 now = cfs_rq_clock_pelt(cfs_rq); > > > - int decayed; > > > + int decayed, propagated; > > > > > > /* > > > * Track task load average for carrying it to new CPU after migrated, and > > > @@ -4252,7 +4250,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > > __update_load_avg_se(now, cfs_rq, se); > > > > > > decayed = update_cfs_rq_load_avg(now, cfs_rq); > > > - decayed |= propagate_entity_load_avg(se); > > > + propagated = propagate_entity_load_avg(se); > > > > > > if (!se->avg.last_update_time && (flags & DO_ATTACH)) { > > > > > > @@ -4264,19 +4262,16 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > > * IOW we're enqueueing a task on a new CPU. > > > */ > > > attach_entity_load_avg(cfs_rq, se); > > > - update_tg_load_avg(cfs_rq); > > > - > > > } else if (flags & DO_DETACH) { > > > /* > > > * DO_DETACH means we're here from dequeue_entity() > > > * and we are migrating task out of the CPU. > > > */ > > > detach_entity_load_avg(cfs_rq, se); > > > - update_tg_load_avg(cfs_rq); > > > - } else if (decayed) { > > > + } else if (decayed || propagated) { > > > cfs_rq_util_change(cfs_rq, 0); > > > > > > - if (flags & UPDATE_TG) > > > + if (decayed && (flags & UPDATE_TG)) > > > > It would be simpler and more readable to clear UPDATE_TG or not set it > > from the beginning > > Do you mean something like this? > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 8d4b9e0a19b6..084d63371355 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4249,7 +4249,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) > __update_load_avg_se(now, cfs_rq, se); > > - decayed = update_cfs_rq_load_avg(now, cfs_rq); > + decayed = update_cfs_rq_load_avg(now, cfs_rq) && (flags & UPDATE_TG); > propagated = propagate_entity_load_avg(se); > > if (!se->avg.last_update_time && (flags & DO_ATTACH)) { > @@ -4271,7 +4271,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > } else if (decayed || propagated) { > cfs_rq_util_change(cfs_rq, 0); > > - if (decayed && (flags & UPDATE_TG)) > + if (decayed) > update_tg_load_avg(cfs_rq); > } something like below: --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4252,6 +4252,8 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s __update_load_avg_se(now, cfs_rq, se); decayed = update_cfs_rq_load_avg(now, cfs_rq); + if (!decayed) + flags &= ~UPDATE_TG; decayed |= propagate_entity_load_avg(se); if (!se->avg.last_update_time && (flags & DO_ATTACH)) { > } > > > IIUC, you rely on the fact that a decay happens every 1024 us of the > > cfs_rq_clock_pelt() which is scaled by frequency and cpu compute > > capacity. So you can end up with a cfs_rq_clock_pelt() that is far > > slower than real clock and the 1ms can easily be extended to dozens of > > ms > > Thanks for the info. I'm not familiar with this clock scale part and will > need to take a closer look. > > As you know, the intent is to make the unbound update to tg->load_avg > become bound for those wakeup migration heavy workloads and the way this > change does to achieve it is to remove the update to tg->load_avg on > attach and detach path, just leave the update on load decay path. And if > the current implementation makes load decay longer than 1024us, that > shouldn't be a problem for this change. I don't see an immediate problem > if update to tg->load_avg happens less often than once per ms but please > let me know if I missed something, thanks. > > > > > > update_tg_load_avg(cfs_rq); > > > } > > > } > > > -- > > > 2.41.0 > > > > > > > > > > > > > > > > if (!se->avg.last_update_time && (flags & DO_ATTACH)) { > > > > > > > > > > @@ -4264,15 +4259,12 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > > > > * IOW we're enqueueing a task on a new CPU. > > > > > */ > > > > > attach_entity_load_avg(cfs_rq, se); > > > > > - update_tg_load_avg(cfs_rq); > > > > > - > > > > > } else if (flags & DO_DETACH) { > > > > > /* > > > > > * DO_DETACH means we're here from dequeue_entity() > > > > > * and we are migrating task out of the CPU. > > > > > */ > > > > > detach_entity_load_avg(cfs_rq, se); > > > > > - update_tg_load_avg(cfs_rq); > > > > > } else if (decayed) { > > > > > cfs_rq_util_change(cfs_rq, 0); > > > > > > > > > > -- > > > > > 2.41.0 > > > > >
On 20/07/2023 17:02, Vincent Guittot wrote: > On Thu, 20 Jul 2023 at 16:42, Aaron Lu <aaron.lu@intel.com> wrote: >> >> On Thu, Jul 20, 2023 at 03:10:30PM +0200, Vincent Guittot wrote: >>> On Wed, 19 Jul 2023 at 15:29, Aaron Lu <aaron.lu@intel.com> wrote: >>>> >>>> On Wed, Jul 19, 2023 at 11:47:06AM +0200, Vincent Guittot wrote: >>>>> On Wed, 19 Jul 2023 at 10:01, Aaron Lu <aaron.lu@intel.com> wrote: >>>>>> >>>>>> On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote: [...] > What was wrong with your proposal to limit the update inside > update_tg_load_avg() ? except maybe s/1000000/NSEC_PER_MSEC/ and > computing delta after testing the time since last update > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index a80a73909dc2..e48fd0e6982d 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct > cfs_rq *cfs_rq) > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) > { > long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; > + u64 now = cfs_rq_clock_pelt(cfs_rq); Could this be `u64 now = sched_clock_cpu()` like in migrate_se_pelt_lag() or newidle_balance() to avoid the time morphing due to PELT's frequency and uArch invariance? > > /* > * No need to update load_avg for root_task_group as it is not used. [...]
On Thu, 20 Jul 2023 at 17:22, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > On 20/07/2023 17:02, Vincent Guittot wrote: > > On Thu, 20 Jul 2023 at 16:42, Aaron Lu <aaron.lu@intel.com> wrote: > >> > >> On Thu, Jul 20, 2023 at 03:10:30PM +0200, Vincent Guittot wrote: > >>> On Wed, 19 Jul 2023 at 15:29, Aaron Lu <aaron.lu@intel.com> wrote: > >>>> > >>>> On Wed, Jul 19, 2023 at 11:47:06AM +0200, Vincent Guittot wrote: > >>>>> On Wed, 19 Jul 2023 at 10:01, Aaron Lu <aaron.lu@intel.com> wrote: > >>>>>> > >>>>>> On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote: > > [...] > > > What was wrong with your proposal to limit the update inside > > update_tg_load_avg() ? except maybe s/1000000/NSEC_PER_MSEC/ and > > computing delta after testing the time since last update > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index a80a73909dc2..e48fd0e6982d 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct > > cfs_rq *cfs_rq) > > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) > > { > > long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; > > + u64 now = cfs_rq_clock_pelt(cfs_rq); > > Could this be `u64 now = sched_clock_cpu()` like in > migrate_se_pelt_lag() or newidle_balance() to avoid the time morphing > due to PELT's frequency and uArch invariance? yes that's a good point. I missed this > > > > /* > > * No need to update load_avg for root_task_group as it is not used. > > [...] >
On Thu, Jul 20, 2023 at 05:02:32PM +0200, Vincent Guittot wrote: > > What was wrong with your proposal to limit the update inside > update_tg_load_avg() ? except maybe s/1000000/NSEC_PER_MSEC/ and > computing delta after testing the time since last update I was thinking it might be better to align the update_tg_load_avg() with cfs_rq's decay point but that's just my feeling. Absolutely nothing wrong with the below approach, it's straightforward and effective. I'll fix the use of cfs_rq_clock_pelt() and collect some data and then send out v2. Thank you Vincent for all your comments, they're very useful to me. > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index a80a73909dc2..e48fd0e6982d 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct > cfs_rq *cfs_rq) > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) > { > long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; > + u64 now = cfs_rq_clock_pelt(cfs_rq); > > /* > * No need to update load_avg for root_task_group as it is not used. > @@ -3672,9 +3673,11 @@ static inline void update_tg_load_avg(struct > cfs_rq *cfs_rq) > if (cfs_rq->tg == &root_task_group) > return; > > - if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { > + if ((now - cfs_rq->last_update_tg_load_avg > 1000000) && > + abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { > atomic_long_add(delta, &cfs_rq->tg->load_avg); > cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg; > + cfs_rq->last_update_tg_load_avg = now; > } > }
On Thu, Jul 20, 2023 at 05:22:26PM +0200, Dietmar Eggemann wrote: > On 20/07/2023 17:02, Vincent Guittot wrote: > > On Thu, 20 Jul 2023 at 16:42, Aaron Lu <aaron.lu@intel.com> wrote: > >> > >> On Thu, Jul 20, 2023 at 03:10:30PM +0200, Vincent Guittot wrote: > >>> On Wed, 19 Jul 2023 at 15:29, Aaron Lu <aaron.lu@intel.com> wrote: > >>>> > >>>> On Wed, Jul 19, 2023 at 11:47:06AM +0200, Vincent Guittot wrote: > >>>>> On Wed, 19 Jul 2023 at 10:01, Aaron Lu <aaron.lu@intel.com> wrote: > >>>>>> > >>>>>> On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote: > > [...] > > > What was wrong with your proposal to limit the update inside > > update_tg_load_avg() ? except maybe s/1000000/NSEC_PER_MSEC/ and > > computing delta after testing the time since last update > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index a80a73909dc2..e48fd0e6982d 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct > > cfs_rq *cfs_rq) > > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) > > { > > long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; > > + u64 now = cfs_rq_clock_pelt(cfs_rq); > > Could this be `u64 now = sched_clock_cpu()` like in > migrate_se_pelt_lag() or newidle_balance() to avoid the time morphing > due to PELT's frequency and uArch invariance? Yes, will use sched_clock_cpu() instead of cfs_rq_clock_pelt(). Thanks. > > > > /* > > * No need to update load_avg for root_task_group as it is not used. > > [...] >
On Fri, Jul 21, 2023 at 09:57:04AM +0800, Aaron Lu wrote: > On Thu, Jul 20, 2023 at 05:02:32PM +0200, Vincent Guittot wrote: > > > > What was wrong with your proposal to limit the update inside > > update_tg_load_avg() ? except maybe s/1000000/NSEC_PER_MSEC/ and > > computing delta after testing the time since last update > > I was thinking it might be better to align the update_tg_load_avg() with > cfs_rq's decay point but that's just my feeling. > > Absolutely nothing wrong with the below approach, it's straightforward > and effective. I'll fix the use of cfs_rq_clock_pelt() and collect > some data and then send out v2. > > Thank you Vincent for all your comments, they're very useful to me. > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index a80a73909dc2..e48fd0e6982d 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct > > cfs_rq *cfs_rq) > > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) > > { > > long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; > > + u64 now = cfs_rq_clock_pelt(cfs_rq); > > > > /* > > * No need to update load_avg for root_task_group as it is not used. > > @@ -3672,9 +3673,11 @@ static inline void update_tg_load_avg(struct > > cfs_rq *cfs_rq) > > if (cfs_rq->tg == &root_task_group) > > return; > > > > - if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { > > + if ((now - cfs_rq->last_update_tg_load_avg > 1000000) && > > + abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { > > atomic_long_add(delta, &cfs_rq->tg->load_avg); > > cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg; > > + cfs_rq->last_update_tg_load_avg = now; > > } > > } While collecting data for v2, I find that with this "rate limit updates to tg->load_avg to at most once per ms", the other two optimizations: "skip some update_cfs_group() on en/dequeue_entity()" and "Make tg->load_avg per node" doesn't matter anymore, i.e. as long as "ratelimit" patch is there, adding those two other optimizations doesn't improve performance further. I think this is reasonable, since this "ratelimit" patch reduced updates to tg->load_avg so much like from some 10 millions during a 5s window to ~25k, the overhead of accessing tg->load_avg is dropped greatly and improving it by reducing the read side calls and making the counter per-node doesn't matter anymore. So I think for v2, I'll go with this "ratelimit" patch alone, but let me know if you think otherwise. The tests I've run are listed below. 1 postgres_sysbench with nr_thread=25%/50%/75%/100% on a 2 sockets, 112 cores, 224 threads Intel Sapphire Rapids(SPR); 2 hackbench with group=1/4/8/16, pipe, thread mode on a 2 sockets, 64 cores, 128 threads Intel Skylake(ICL) and another 2 sockets, 112 cores, 224 threads Intel SPR; 3 netperf with tcp_stream and udp_rr modes. For each mode, nr_client is set to 25%/50%/75%/100%. Tested on the same ICL and SPR as in 2). The test summary is: postgres_sysbench on SPR - when nr_thread=25% and 50%, results are in noise range; - when nr_thread=75% and 100%, "ratelimit" patch can improve transaction by 12.2% and 21.2%. hackbench: - when nr_group=1 and 4, results are in noise range for both ICL and SPR; - when nr_group=8 and 16, "ratelimit" patch can improve throughput by 6.6% and 22.4% on ICL. For SPR, the improvement is 12.5% and 48.8%. netperf: - for tcp_stream mode test, all test results are in noise range for ICL and SPR; - for udp_rr mode test, when nr_client=25% and 50% of nr_cpu, test results are in noise range; when nr_client=75% and 100% of nr_cpu, on ICL, "ratelimit" patch can improve throughput by 38.5% and 13.4%; on SPR, the improvement is 162% and 189%. Full test results are listed below. base: peterz's sched/core branch with commit e8d5ff8258bf7("sched/fair: Block nohz tick_stop when cfs bandwidth in use") as head. rate: the patch that rate limit updates to tg->load_avg to at most once per ms; applied on top of "base". skip: the patch "sched/fair: skip some update_cfs_group() on en/dequeue_entity()" as in this v1 series; applied on top of "rate". node: the patch "sched/fair: Make tg->load_avg per node" as in this v1 series + peterz's numa optimizations; applied on top of "skip". All improvement percents are against "base". postgres_sysbench on SPR: 25% (all in noise range) base: 42382±19.8% rate: 50174±9.5% skip: 46311±12.0% node: 48690±13.9% 50% (all in noise range) base: 67626±1.3% rate: 67365±3.1% skip: 68218±4.9% node: 66878±3.3% 75% base: 100216±1.2% rate: 112470±0.1% +12.2% skip: 113345±0.3% +13.1% node: 112962±0.1% +12.7% 100% base: 93671±0.4% rate: 113563±0.2% +21.2% skip: 113741±0.2% +21.4% node: 112793±0.2% +20.4% hackbench on ICL: group=1 (all in noise range) base: 114912±5.2% rate: 117857±2.5% skip: 118501±4.3% node: 119240±7.0% group=4 (all in noise range) base: 359902±1.6% rate: 361685±2.7% skip: 356354±1.6% node: 358744±1.7% group=8 base: 461070±0.8% rate: 491713±0.3% +6.6% skip: 490183±1.4% +6.3% node: 490864±0.5% +6.5% group=16 base: 309032±5.0% rate: 378337±1.3% +22.4% skip: 385637±1.4% +24.8% node: 375087±0.9% +21.4% hackbench on SPR: group=1 (all in noise range) base: 100768±2.9% rate: 103134±2.9% skip: 99642±1.7% node: 104251±2.5% group=4 (all in noise range) base: 413830±12.5% rate: 378660±16.6% skip: 367234±13.1% node: 372122±14.4% group=8 base: 436124±0.6% rate: 490787±3.2% +12.5% skip: 493097±1.0% +13.0% node: 497336±1.8% +14.0% group=16 base: 457730±3.2% rate: 680452±1.3% +48.8% skip: 679271±0.9% +48.4% node: 672365±0.5% +46.9% netperf/udp_rr on ICL 25% base: 114413±0.1% rate: 115111±0.0% +0.6% skip: 115546±0.2% +1.0% node: 115214±0.3% +0.7% 50% (all in noise range) base: 86803±0.5% rate: 86611±0.0% skip: 86986±0.5% node: 87466±0.8% 75% base: 35959±5.3% rate: 49801±0.6% +38.5% skip: 50745±1.5% +41.1% node: 50457±1.5% +40.3% 100% base: 61951±6.4% rate: 70224±0.8% +13.4% skip: 67321±4.3% +8.7% node: 70591±2.3% +13.9% netperf/udp_rr on SPR 25% (all in noise range) base: 104954±1.3% rate: 107312±2.8% skip: 107688±1.7% node: 106024±1.8% 50% (all in noise range) base: 55394±4.6% rate: 54940±7.4% skip: 57829±1.8% node: 57144±6.4% 75% base: 13779±3.1% rate: 36105±1.1% +162% skip: 35319±6.3% +156% node: 36764±3.5% +167% 100% base: 9703±3.7% rate: 28011±0.2% +189% skip: 27087±1.3% +179% node: 27337±2.2% +182% netperf/tcp_stream on ICL 25% (all in noise range) base: 43092±0.1% rate: 42891±0.5% skip: 43208±0.4% node: 43044±0.6% 50% (all in noise range) base: 19278±14.9% rate: 22369±7.2% skip: 19375±16.0% node: 21312±8.2% 75% (all in noise range) base: 16822±3.0% rate: 17086±2.3% skip: 17571±1.8% node: 16801±2.4% 100% (all in noise range) base: 18216±0.6% rate: 18078±2.9% skip: 18040±1.0% node: 18064±1.6% netperf/tcp_stream on SPR 25% (all in noise range) base: 34491±0.3% rate: 34886±0.5% skip: 34894±0.9% node: 34692±0.6% 50% (all in noise range) base: 19278±14.9% rate: 22369±7.2% skip: 19375±16.0% node: 21312±8.2% 75% (all in noise range) base: 16822±3.0% rate: 17086±2.3% skip: 17571±1.8% node: 16801±2.4% 100% (all in noise range) base: 18216±0.6% rate: 18078±2.9% skip: 18040±0.1% node: 18064±1.6% Thanks, Aaron
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index aceb8f5922cb..564ffe3e59c1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3645,6 +3645,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) if (child_cfs_rq_on_list(cfs_rq)) return false; + if (cfs_rq->prop_removed_sum) + return false; + return true; } @@ -3911,6 +3914,11 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum { cfs_rq->propagate = 1; cfs_rq->prop_runnable_sum += runnable_sum; + + if (cfs_rq->prop_removed_sum) { + cfs_rq->prop_runnable_sum += cfs_rq->prop_removed_sum; + cfs_rq->prop_removed_sum = 0; + } } /* Update task and its cfs_rq load average */ @@ -4133,13 +4141,11 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) * removed_runnable is the unweighted version of removed_load so we * can use it to estimate removed_load_sum. */ - add_tg_cfs_propagate(cfs_rq, - -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT); - - decayed = 1; + cfs_rq->prop_removed_sum += + -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT; } - decayed |= __update_load_avg_cfs_rq(now, cfs_rq); + decayed = __update_load_avg_cfs_rq(now, cfs_rq); u64_u32_store_copy(sa->last_update_time, cfs_rq->last_update_time_copy, sa->last_update_time); @@ -9001,6 +9007,13 @@ static bool __update_blocked_fair(struct rq *rq, bool *done) if (cfs_rq == &rq->cfs) decayed = true; + + /* + * If the aggregated removed_sum hasn't been taken care of, + * deal with it now before this cfs_rq is removed from the list. + */ + if (cfs_rq->prop_removed_sum) + add_tg_cfs_propagate(cfs_rq, 0); } /* Propagate pending load changes to the parent, if any: */ diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 9cece2dbc95b..ab540b21d071 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -619,6 +619,7 @@ struct cfs_rq { unsigned long tg_load_avg_contrib; long propagate; long prop_runnable_sum; + long prop_removed_sum; /* * h_load = weight * f(tg)