[RFC,3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load

Message ID 20230718134120.81199-4-aaron.lu@intel.com
State New
Headers
Series Reduce cost of accessing tg->load_avg |

Commit Message

Aaron Lu July 18, 2023, 1:41 p.m. UTC
  When a workload involves many wake time task migrations, tg->load_avg
can be heavily contended among CPUs because every migration involves
removing the task's load from its src cfs_rq and attach that load to
its new cfs_rq. Both the remove and attach requires an update to
tg->load_avg as well as propagating the change up the hierarchy.

E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
Sappire Rapids, during a 5s window, the wakeup number is 14millions and
migration number is 11millions. Since the workload can trigger many
wakeups and migrations, the access(both read and write) to tg->load_avg
can be unbound. For the above said workload, the profile shows
update_cfs_group() costs ~13% and update_load_avg() costs ~10%. With
netperf/nr_client=nr_cpu/UDP_RR, the wakeup number is 21millions and
migration number is 14millions; update_cfs_group() costs ~25% and
update_load_avg() costs ~16%.

This patch is an attempt to reduce the cost of accessing tg->load_avg.

Current logic will immediately do a update_tg_load_avg() if cfs_rq has
removed load; this patch changes this behavior: if this cfs_rq has
removed load as discovered by update_cfs_rq_load_avg(), it didn't call
update_tg_load_avg() or propagate the removed load immediately, instead,
the update to tg->load_avg and propagated load can be dealed with by a
following event like task attached to this cfs_rq or in
update_blocked_averages(). This way, the call to update_tg_load_avg()
for this cfs_rq and its ancestors can be reduced by about half.

================================================
postgres_sysbench(transaction, higher is better)
nr_thread=100%/75%/50% were tested on 2 sockets SPR and Icelake and
results that have a measuable difference are:

nr_thread=100% on SPR:
base:   90569.11±1.15%
node:  104152.26±0.34%  +15.0%
delay: 127309.46±4.25%  +40.6%

nr_thread=75% on SPR:
base:  100803.96±0.57%
node:  107333.58±0.44%   +6.5%
delay: 124332.39±0.51%  +23.3%

nr_thread=75% on ICL:
base:  61961.26±0.41%
node:  61585.45±0.50%
delay: 72420.52±0.14%  +16.9%

=======================================================================
hackbench/pipe/threads/fd=20/loop=1000000 (throughput, higher is better)
group=1/4/8/16 were tested on 2 sockets SPR and Cascade lake and the
results that have a measuable difference are:

group=8 on SPR:
base:  437163±2.6%
node:  471203±1.2%   +7.8%
delay: 490780±0.9%  +12.3%

group=16 on SPR:
base:  468279±1.9%
node:  580385±1.7%  +23.9%
delay: 664422±0.2%  +41.9%

================================================
netperf/TCP_STRAM (throughput, higher is better)
nr_thread=1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
Lake and results that have a measuable difference are:

nr_thread=50% on CSL:
base:  16258±0.7%
node:  16172±2.9%
delay: 17729±0.7%  +9.0%

nr_thread=75% on CSL:
base:  12923±1.2%
node:  13011±2.2%
delay: 15452±1.6%  +19.6%

nr_thread=75% on SPR:
base:  16232±11.9%
node:  13962±5.1%
delay: 21089±0.8%  +29.9%

nr_thread=100% on SPR:
base:  13220±0.6%
node:  13113±0.0%
delay: 18258±11.3% +38.1%

=============================================
netperf/UDP_RR (throughput, higher is better)
nr_thread=1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
Lake and results that have measuable difference are:

nr_thread=1 on CSL:
base:  128521±0.5%
node:  127935±0.6%
delay: 126317±0.4%  -1.7%

nr_thread=75% on CSL:
base:  36701±1.7%
node:  39949±1.4%   +8.8%
delay: 42516±0.3%  +15.8%

nr_thread=75% on SPR:
base:  14249±3.8%
node:  19890±2.0%   +39.6%
delay: 31331±0.5%  +119.9%

nr_thread=100% on CSL:
base:  52275±0.6%
node:  53827±0.4%   +3.0%
delay: 78386±0.7%  +49.9%

nr_thread=100% on SPR:
base:   9560±1.6%
node:  14186±3.9%   +48.4%
delay: 20779±2.8%  +117.4%

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 kernel/sched/fair.c  | 23 ++++++++++++++++++-----
 kernel/sched/sched.h |  1 +
 2 files changed, 19 insertions(+), 5 deletions(-)
  

Comments

Vincent Guittot July 18, 2023, 4:01 p.m. UTC | #1
On Tue, 18 Jul 2023 at 15:41, Aaron Lu <aaron.lu@intel.com> wrote:
>
> When a workload involves many wake time task migrations, tg->load_avg
> can be heavily contended among CPUs because every migration involves
> removing the task's load from its src cfs_rq and attach that load to
> its new cfs_rq. Both the remove and attach requires an update to
> tg->load_avg as well as propagating the change up the hierarchy.
>
> E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
> Sappire Rapids, during a 5s window, the wakeup number is 14millions and
> migration number is 11millions. Since the workload can trigger many
> wakeups and migrations, the access(both read and write) to tg->load_avg
> can be unbound. For the above said workload, the profile shows
> update_cfs_group() costs ~13% and update_load_avg() costs ~10%. With
> netperf/nr_client=nr_cpu/UDP_RR, the wakeup number is 21millions and
> migration number is 14millions; update_cfs_group() costs ~25% and
> update_load_avg() costs ~16%.
>
> This patch is an attempt to reduce the cost of accessing tg->load_avg.

here you mention tg->load_avg which is updated with update_tg_load_avg()

but your patch below modifies the local update of cfs's util_avg,
runnable_avg  and load_avg which need to be maintained up to date

You should be better to delay or rate limit the call to
update_tg_load_avg() or calc_group_shares()/update_cfs_group() which
access tg->load_avg and are the culprit instead of modifying other
place.

Have you tried to remove update_cfs_group() from enqueue/dequeue and
only let the tick update the share periodically ?

Have you tried to make update_tg_load_avg() return early ? the change
of cfs' load_avg is written in the tg->load_avg only when the change
is bigger than 1.5%. maybe increase it to 6% ?

Or like for update_cfs_group, remove it from attach/detach entity and
let the periodic update to propagate the change

But please don't touch local update of *_avg

>
> Current logic will immediately do a update_tg_load_avg() if cfs_rq has
> removed load; this patch changes this behavior: if this cfs_rq has
> removed load as discovered by update_cfs_rq_load_avg(), it didn't call
> update_tg_load_avg() or propagate the removed load immediately, instead,
> the update to tg->load_avg and propagated load can be dealed with by a
> following event like task attached to this cfs_rq or in
> update_blocked_averages(). This way, the call to update_tg_load_avg()
> for this cfs_rq and its ancestors can be reduced by about half.
>
> ================================================
> postgres_sysbench(transaction, higher is better)
> nr_thread=100%/75%/50% were tested on 2 sockets SPR and Icelake and
> results that have a measuable difference are:
>
> nr_thread=100% on SPR:
> base:   90569.11±1.15%
> node:  104152.26±0.34%  +15.0%
> delay: 127309.46±4.25%  +40.6%
>
> nr_thread=75% on SPR:
> base:  100803.96±0.57%
> node:  107333.58±0.44%   +6.5%
> delay: 124332.39±0.51%  +23.3%
>
> nr_thread=75% on ICL:
> base:  61961.26±0.41%
> node:  61585.45±0.50%
> delay: 72420.52±0.14%  +16.9%
>
> =======================================================================
> hackbench/pipe/threads/fd=20/loop=1000000 (throughput, higher is better)
> group=1/4/8/16 were tested on 2 sockets SPR and Cascade lake and the
> results that have a measuable difference are:
>
> group=8 on SPR:
> base:  437163±2.6%
> node:  471203±1.2%   +7.8%
> delay: 490780±0.9%  +12.3%
>
> group=16 on SPR:
> base:  468279±1.9%
> node:  580385±1.7%  +23.9%
> delay: 664422±0.2%  +41.9%
>
> ================================================
> netperf/TCP_STRAM (throughput, higher is better)
> nr_thread=1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
> Lake and results that have a measuable difference are:
>
> nr_thread=50% on CSL:
> base:  16258±0.7%
> node:  16172±2.9%
> delay: 17729±0.7%  +9.0%
>
> nr_thread=75% on CSL:
> base:  12923±1.2%
> node:  13011±2.2%
> delay: 15452±1.6%  +19.6%
>
> nr_thread=75% on SPR:
> base:  16232±11.9%
> node:  13962±5.1%
> delay: 21089±0.8%  +29.9%
>
> nr_thread=100% on SPR:
> base:  13220±0.6%
> node:  13113±0.0%
> delay: 18258±11.3% +38.1%
>
> =============================================
> netperf/UDP_RR (throughput, higher is better)
> nr_thread=1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
> Lake and results that have measuable difference are:
>
> nr_thread=1 on CSL:
> base:  128521±0.5%
> node:  127935±0.6%
> delay: 126317±0.4%  -1.7%
>
> nr_thread=75% on CSL:
> base:  36701±1.7%
> node:  39949±1.4%   +8.8%
> delay: 42516±0.3%  +15.8%
>
> nr_thread=75% on SPR:
> base:  14249±3.8%
> node:  19890±2.0%   +39.6%
> delay: 31331±0.5%  +119.9%
>
> nr_thread=100% on CSL:
> base:  52275±0.6%
> node:  53827±0.4%   +3.0%
> delay: 78386±0.7%  +49.9%
>
> nr_thread=100% on SPR:
> base:   9560±1.6%
> node:  14186±3.9%   +48.4%
> delay: 20779±2.8%  +117.4%
>
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> ---
>  kernel/sched/fair.c  | 23 ++++++++++++++++++-----
>  kernel/sched/sched.h |  1 +
>  2 files changed, 19 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index aceb8f5922cb..564ffe3e59c1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3645,6 +3645,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
>         if (child_cfs_rq_on_list(cfs_rq))
>                 return false;
>
> +       if (cfs_rq->prop_removed_sum)
> +               return false;
> +
>         return true;
>  }
>
> @@ -3911,6 +3914,11 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
>  {
>         cfs_rq->propagate = 1;
>         cfs_rq->prop_runnable_sum += runnable_sum;
> +
> +       if (cfs_rq->prop_removed_sum) {
> +               cfs_rq->prop_runnable_sum += cfs_rq->prop_removed_sum;
> +               cfs_rq->prop_removed_sum = 0;
> +       }
>  }
>
>  /* Update task and its cfs_rq load average */
> @@ -4133,13 +4141,11 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>                  * removed_runnable is the unweighted version of removed_load so we
>                  * can use it to estimate removed_load_sum.
>                  */
> -               add_tg_cfs_propagate(cfs_rq,
> -                       -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> -
> -               decayed = 1;
> +               cfs_rq->prop_removed_sum +=
> +                       -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT;
>         }
>
> -       decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
> +       decayed = __update_load_avg_cfs_rq(now, cfs_rq);
>         u64_u32_store_copy(sa->last_update_time,
>                            cfs_rq->last_update_time_copy,
>                            sa->last_update_time);
> @@ -9001,6 +9007,13 @@ static bool __update_blocked_fair(struct rq *rq, bool *done)
>
>                         if (cfs_rq == &rq->cfs)
>                                 decayed = true;
> +
> +                       /*
> +                        * If the aggregated removed_sum hasn't been taken care of,
> +                        * deal with it now before this cfs_rq is removed from the list.
> +                        */
> +                       if (cfs_rq->prop_removed_sum)
> +                               add_tg_cfs_propagate(cfs_rq, 0);
>                 }
>
>                 /* Propagate pending load changes to the parent, if any: */
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 9cece2dbc95b..ab540b21d071 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -619,6 +619,7 @@ struct cfs_rq {
>         unsigned long           tg_load_avg_contrib;
>         long                    propagate;
>         long                    prop_runnable_sum;
> +       long                    prop_removed_sum;
>
>         /*
>          *   h_load = weight * f(tg)
> --
> 2.41.0
>
  
Aaron Lu July 19, 2023, 5:18 a.m. UTC | #2
On Tue, Jul 18, 2023 at 06:01:51PM +0200, Vincent Guittot wrote:
> On Tue, 18 Jul 2023 at 15:41, Aaron Lu <aaron.lu@intel.com> wrote:
> >
> > When a workload involves many wake time task migrations, tg->load_avg
> > can be heavily contended among CPUs because every migration involves
> > removing the task's load from its src cfs_rq and attach that load to
> > its new cfs_rq. Both the remove and attach requires an update to
> > tg->load_avg as well as propagating the change up the hierarchy.
> >
> > E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
> > Sappire Rapids, during a 5s window, the wakeup number is 14millions and
> > migration number is 11millions. Since the workload can trigger many
> > wakeups and migrations, the access(both read and write) to tg->load_avg
> > can be unbound. For the above said workload, the profile shows
> > update_cfs_group() costs ~13% and update_load_avg() costs ~10%. With
> > netperf/nr_client=nr_cpu/UDP_RR, the wakeup number is 21millions and
> > migration number is 14millions; update_cfs_group() costs ~25% and
> > update_load_avg() costs ~16%.
> >
> > This patch is an attempt to reduce the cost of accessing tg->load_avg.
> 
> here you mention tg->load_avg which is updated with update_tg_load_avg()
> 
> but your patch below modifies the local update of cfs's util_avg,
> runnable_avg  and load_avg which need to be maintained up to date

Yes, since it delayed propagating the removed load, the upper cfs_rqs'
*_avg could be updated later than current code.

> You should be better to delay or rate limit the call to
> update_tg_load_avg() or calc_group_shares()/update_cfs_group() which
> access tg->load_avg and are the culprit instead of modifying other
> place.

Thanks for the suggestion and I think it makes perfect sense.

I tried below to rate limit the update to tg->load_avg at most once per
ms in update_tg_load_avg():

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a80a73909dc2..e48fd0e6982d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 {
 	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
+	u64 now = cfs_rq_clock_pelt(cfs_rq);
 
 	/*
 	 * No need to update load_avg for root_task_group as it is not used.
@@ -3672,9 +3673,11 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 	if (cfs_rq->tg == &root_task_group)
 		return;
 
-	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
+	if ((now - cfs_rq->last_update_tg_load_avg > 1000000) &&
+	    abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
 		atomic_long_add(delta, &cfs_rq->tg->load_avg);
 		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
+		cfs_rq->last_update_tg_load_avg = now;
 	}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 14dfaafb3a8f..b5201d44d820 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -594,6 +594,7 @@ struct cfs_rq {
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	unsigned long		tg_load_avg_contrib;
+	u64			last_update_tg_load_avg;
 	long			propagate;
 	long			prop_runnable_sum;
 
From some quick tests using postgres_sysbench and netperf/UDP_RR on SPR,
this alone already delivers good results. For postgres_sysbench, the two
functions cost dropped to 1%-2% each; and for netperf/UDP_RR, the two
functions cost dropped to ~2% and ~4%. Togerther with some more rate
limiting on the read side, I think the end result will be better. Does
the above look reasonable to you?

Alternatively, I can remove some callsites of update_tg_load_avg() like
you suggested below and only call update_tg_load_avg() when cfs_rq is
decayed(really just decayed, not when it detected it has removed load
pending or load propagated from its children). This way it would give us
similar result as above(roughly once per ms).

> 
> Have you tried to remove update_cfs_group() from enqueue/dequeue and
> only let the tick update the share periodically ?

patch4 kind of did that :-)

> 
> Have you tried to make update_tg_load_avg() return early ? the change
> of cfs' load_avg is written in the tg->load_avg only when the change
> is bigger than 1.5%. maybe increase it to 6% ?

Yes, increase the delta is also a good way to limit the update to
tg->load_avg. Will try that too.

> 
> Or like for update_cfs_group, remove it from attach/detach entity and
> let the periodic update to propagate the change
> 
> But please don't touch local update of *_avg

OK I see.

Thanks again for the comments, they are very helpful.
  
Aaron Lu July 19, 2023, 8:01 a.m. UTC | #3
On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote:
> Alternatively, I can remove some callsites of update_tg_load_avg() like
> you suggested below and only call update_tg_load_avg() when cfs_rq is
> decayed(really just decayed, not when it detected it has removed load
> pending or load propagated from its children). This way it would give us
> similar result as above(roughly once per ms).

Something like this: (I think this is better since it removed those
unnecessary calls to update_tg_load_avg(), although it is inline but
still)


From bc749aaefa6bed36aa946921a4006b3dddb69b77 Mon Sep 17 00:00:00 2001
From: Aaron Lu <aaron.lu@intel.com>
Date: Wed, 19 Jul 2023 13:54:48 +0800
Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed

---
 kernel/sched/fair.c | 22 +++++++---------------
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a80a73909dc2..7d5b7352b8b5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3913,16 +3913,16 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
 }
 
 /* Update task and its cfs_rq load average */
-static inline int propagate_entity_load_avg(struct sched_entity *se)
+static inline void propagate_entity_load_avg(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq, *gcfs_rq;
 
 	if (entity_is_task(se))
-		return 0;
+		return;
 
 	gcfs_rq = group_cfs_rq(se);
 	if (!gcfs_rq->propagate)
-		return 0;
+		return;
 
 	gcfs_rq->propagate = 0;
 
@@ -3936,8 +3936,6 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
 
 	trace_pelt_cfs_tp(cfs_rq);
 	trace_pelt_se_tp(se);
-
-	return 1;
 }
 
 /*
@@ -3974,9 +3972,8 @@ static inline bool skip_blocked_update(struct sched_entity *se)
 
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {}
 
-static inline int propagate_entity_load_avg(struct sched_entity *se)
+static inline void propagate_entity_load_avg(struct sched_entity *se)
 {
-	return 0;
 }
 
 static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) {}
@@ -4086,7 +4083,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 {
 	unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
 	struct sched_avg *sa = &cfs_rq->avg;
-	int decayed = 0;
+	int decayed;
 
 	if (cfs_rq->removed.nr) {
 		unsigned long r;
@@ -4134,11 +4131,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 		 */
 		add_tg_cfs_propagate(cfs_rq,
 			-(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
-
-		decayed = 1;
 	}
 
-	decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
+	decayed = __update_load_avg_cfs_rq(now, cfs_rq);
 	u64_u32_store_copy(sa->last_update_time,
 			   cfs_rq->last_update_time_copy,
 			   sa->last_update_time);
@@ -4252,7 +4247,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 		__update_load_avg_se(now, cfs_rq, se);
 
 	decayed  = update_cfs_rq_load_avg(now, cfs_rq);
-	decayed |= propagate_entity_load_avg(se);
+	propagate_entity_load_avg(se);
 
 	if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
 
@@ -4264,15 +4259,12 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 		 * IOW we're enqueueing a task on a new CPU.
 		 */
 		attach_entity_load_avg(cfs_rq, se);
-		update_tg_load_avg(cfs_rq);
-
 	} else if (flags & DO_DETACH) {
 		/*
 		 * DO_DETACH means we're here from dequeue_entity()
 		 * and we are migrating task out of the CPU.
 		 */
 		detach_entity_load_avg(cfs_rq, se);
-		update_tg_load_avg(cfs_rq);
 	} else if (decayed) {
 		cfs_rq_util_change(cfs_rq, 0);
  
Aaron Lu July 19, 2023, 8:11 a.m. UTC | #4
On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote:
> On Tue, Jul 18, 2023 at 06:01:51PM +0200, Vincent Guittot wrote:
> > Have you tried to remove update_cfs_group() from enqueue/dequeue and
> > only let the tick update the share periodically ?
> 
> patch4 kind of did that :-)
> 

More about this.

If I remove update_cfs_group() in dequeue_task_fair() on top of patch4
like this:

From 43d5c12f0b2180c99149e663a71c610e31023d90 Mon Sep 17 00:00:00 2001
From: Aaron Lu <aaron.lu@intel.com>
Date: Wed, 19 Jul 2023 14:51:07 +0800
Subject: [PATCH 1/2] sched/fair: completely remove update_cfs_group() in
 dequeue path

---
 kernel/sched/fair.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2adb6a6abbce..a21ab72819ce 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6434,7 +6434,6 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
 		se_update_runnable(se);
-		update_cfs_group(se);
 
 		cfs_rq->h_nr_running--;
 		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
  
Vincent Guittot July 19, 2023, 9:09 a.m. UTC | #5
On Wed, 19 Jul 2023 at 07:18, Aaron Lu <aaron.lu@intel.com> wrote:
>
> On Tue, Jul 18, 2023 at 06:01:51PM +0200, Vincent Guittot wrote:
> > On Tue, 18 Jul 2023 at 15:41, Aaron Lu <aaron.lu@intel.com> wrote:
> > >
> > > When a workload involves many wake time task migrations, tg->load_avg
> > > can be heavily contended among CPUs because every migration involves
> > > removing the task's load from its src cfs_rq and attach that load to
> > > its new cfs_rq. Both the remove and attach requires an update to
> > > tg->load_avg as well as propagating the change up the hierarchy.
> > >
> > > E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
> > > Sappire Rapids, during a 5s window, the wakeup number is 14millions and
> > > migration number is 11millions. Since the workload can trigger many
> > > wakeups and migrations, the access(both read and write) to tg->load_avg
> > > can be unbound. For the above said workload, the profile shows
> > > update_cfs_group() costs ~13% and update_load_avg() costs ~10%. With
> > > netperf/nr_client=nr_cpu/UDP_RR, the wakeup number is 21millions and
> > > migration number is 14millions; update_cfs_group() costs ~25% and
> > > update_load_avg() costs ~16%.
> > >
> > > This patch is an attempt to reduce the cost of accessing tg->load_avg.
> >
> > here you mention tg->load_avg which is updated with update_tg_load_avg()
> >
> > but your patch below modifies the local update of cfs's util_avg,
> > runnable_avg  and load_avg which need to be maintained up to date
>
> Yes, since it delayed propagating the removed load, the upper cfs_rqs'
> *_avg could be updated later than current code.
>
> > You should be better to delay or rate limit the call to
> > update_tg_load_avg() or calc_group_shares()/update_cfs_group() which
> > access tg->load_avg and are the culprit instead of modifying other
> > place.
>
> Thanks for the suggestion and I think it makes perfect sense.
>
> I tried below to rate limit the update to tg->load_avg at most once per
> ms in update_tg_load_avg():
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a80a73909dc2..e48fd0e6982d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
>  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
>  {
>         long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> +       u64 now = cfs_rq_clock_pelt(cfs_rq);
>
>         /*
>          * No need to update load_avg for root_task_group as it is not used.
> @@ -3672,9 +3673,11 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
>         if (cfs_rq->tg == &root_task_group)
>                 return;
>
> -       if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> +       if ((now - cfs_rq->last_update_tg_load_avg > 1000000) &&
> +           abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
>                 atomic_long_add(delta, &cfs_rq->tg->load_avg);
>                 cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> +               cfs_rq->last_update_tg_load_avg = now;
>         }
>  }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 14dfaafb3a8f..b5201d44d820 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -594,6 +594,7 @@ struct cfs_rq {
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>         unsigned long           tg_load_avg_contrib;
> +       u64                     last_update_tg_load_avg;
>         long                    propagate;
>         long                    prop_runnable_sum;
>
> From some quick tests using postgres_sysbench and netperf/UDP_RR on SPR,
> this alone already delivers good results. For postgres_sysbench, the two
> functions cost dropped to 1%-2% each; and for netperf/UDP_RR, the two
> functions cost dropped to ~2% and ~4%. Togerther with some more rate
> limiting on the read side, I think the end result will be better. Does
> the above look reasonable to you?

Yes, that seems a better way to limit write access

>
> Alternatively, I can remove some callsites of update_tg_load_avg() like
> you suggested below and only call update_tg_load_avg() when cfs_rq is
> decayed(really just decayed, not when it detected it has removed load
> pending or load propagated from its children). This way it would give us
> similar result as above(roughly once per ms).
>
> >
> > Have you tried to remove update_cfs_group() from enqueue/dequeue and
> > only let the tick update the share periodically ?
>
> patch4 kind of did that :-)

That's what I have noticed when reading patch 4 :-)

>
> >
> > Have you tried to make update_tg_load_avg() return early ? the change
> > of cfs' load_avg is written in the tg->load_avg only when the change
> > is bigger than 1.5%. maybe increase it to 6% ?
>
> Yes, increase the delta is also a good way to limit the update to
> tg->load_avg. Will try that too.
>
> >
> > Or like for update_cfs_group, remove it from attach/detach entity and
> > let the periodic update to propagate the change
> >
> > But please don't touch local update of *_avg
>
> OK I see.
>
> Thanks again for the comments, they are very helpful.
  
Vincent Guittot July 19, 2023, 9:12 a.m. UTC | #6
On Wed, 19 Jul 2023 at 10:11, Aaron Lu <aaron.lu@intel.com> wrote:
>
> On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote:
> > On Tue, Jul 18, 2023 at 06:01:51PM +0200, Vincent Guittot wrote:
> > > Have you tried to remove update_cfs_group() from enqueue/dequeue and
> > > only let the tick update the share periodically ?
> >
> > patch4 kind of did that :-)
> >
>
> More about this.
>
> If I remove update_cfs_group() in dequeue_task_fair() on top of patch4
> like this:
>
> From 43d5c12f0b2180c99149e663a71c610e31023d90 Mon Sep 17 00:00:00 2001
> From: Aaron Lu <aaron.lu@intel.com>
> Date: Wed, 19 Jul 2023 14:51:07 +0800
> Subject: [PATCH 1/2] sched/fair: completely remove update_cfs_group() in
>  dequeue path
>
> ---
>  kernel/sched/fair.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2adb6a6abbce..a21ab72819ce 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6434,7 +6434,6 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>
>                 update_load_avg(cfs_rq, se, UPDATE_TG);
>                 se_update_runnable(se);
> -               update_cfs_group(se);
>
>                 cfs_rq->h_nr_running--;
>                 cfs_rq->idle_h_nr_running -= idle_h_nr_running;
> --
> 2.40.1
>
> Than P95 latency of the schbench workload I described in patch4's
> changelog will increase to > 1ms(base and patch4's P95 < 100us):
>
> Latency percentiles (usec) runtime 300 (s) (18504 total samples)
>         50.0th: 20 (9537 samples)
>         75.0th: 25 (4869 samples)
>         90.0th: 29 (2264 samples)
>         95.0th: 2564 (909 samples)
>         *99.0th: 20768 (740 samples)
>         99.5th: 23520 (93 samples)
>         99.9th: 31520 (74 samples)
>         min=6, max=40072
>
> If I further remove update_cfs_group() completely in enqueue path on top
> of the last change:
>
> From 4e4cb31590ca2e4080ece9cfa9dfaaf26501c60d Mon Sep 17 00:00:00 2001
> From: Aaron Lu <aaron.lu@intel.com>
> Date: Wed, 19 Jul 2023 15:36:24 +0800
> Subject: [PATCH 2/2] sched/fair: completely remove update_cfs_group() from
>  enqueue path
>
> ---
>  kernel/sched/fair.c | 3 ---
>  1 file changed, 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a21ab72819ce..8fc325112282 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4847,8 +4847,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>          */
>         update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
>         se_update_runnable(se);
> -       if (cfs_rq->nr_running > 0)
> -               update_cfs_group(se);
>         account_entity_enqueue(cfs_rq, se);
>
>         if (flags & ENQUEUE_WAKEUP)
> @@ -6344,7 +6342,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>
>                 update_load_avg(cfs_rq, se, UPDATE_TG);
>                 se_update_runnable(se);
> -               update_cfs_group(se);
>
>                 cfs_rq->h_nr_running++;
>                 cfs_rq->idle_h_nr_running += idle_h_nr_running;
> --
> 2.40.1
>
> Then P50's latency will bump to ~4ms from ~20us:
> Latency percentiles (usec) runtime 300 (s) (17940 total samples)
>         50.0th: 3996 (12092 samples)
>         75.0th: 4004 (4919 samples)
>         90.0th: 4004 (0 samples)
>         95.0th: 4012 (353 samples)
>         *99.0th: 20000 (487 samples)
>         99.5th: 20000 (0 samples)
>         99.9th: 31136 (72 samples)
>         min=7, max=37402
> real    5m36.633s
> user    47m33.947s
> sys     4m47.097s
>
> So for the read side, maybe just keep what patch4 does?

yes, skipping update_cfs_group() at enqueue bypass the opportunity to
increase the share and get more running time for the group until the
update really happen

>
> Thanks,
> Aaron
  
Vincent Guittot July 19, 2023, 9:47 a.m. UTC | #7
On Wed, 19 Jul 2023 at 10:01, Aaron Lu <aaron.lu@intel.com> wrote:
>
> On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote:
> > Alternatively, I can remove some callsites of update_tg_load_avg() like
> > you suggested below and only call update_tg_load_avg() when cfs_rq is
> > decayed(really just decayed, not when it detected it has removed load
> > pending or load propagated from its children). This way it would give us
> > similar result as above(roughly once per ms).
>
> Something like this: (I think this is better since it removed those
> unnecessary calls to update_tg_load_avg(), although it is inline but
> still)
>
>
> From bc749aaefa6bed36aa946921a4006b3dddb69b77 Mon Sep 17 00:00:00 2001
> From: Aaron Lu <aaron.lu@intel.com>
> Date: Wed, 19 Jul 2023 13:54:48 +0800
> Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed
>
> ---
>  kernel/sched/fair.c | 22 +++++++---------------
>  1 file changed, 7 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a80a73909dc2..7d5b7352b8b5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3913,16 +3913,16 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
>  }
>
>  /* Update task and its cfs_rq load average */
> -static inline int propagate_entity_load_avg(struct sched_entity *se)
> +static inline void propagate_entity_load_avg(struct sched_entity *se)
>  {
>         struct cfs_rq *cfs_rq, *gcfs_rq;
>
>         if (entity_is_task(se))
> -               return 0;
> +               return;
>
>         gcfs_rq = group_cfs_rq(se);
>         if (!gcfs_rq->propagate)
> -               return 0;
> +               return;
>
>         gcfs_rq->propagate = 0;
>
> @@ -3936,8 +3936,6 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
>
>         trace_pelt_cfs_tp(cfs_rq);
>         trace_pelt_se_tp(se);
> -
> -       return 1;
>  }
>
>  /*
> @@ -3974,9 +3972,8 @@ static inline bool skip_blocked_update(struct sched_entity *se)
>
>  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {}
>
> -static inline int propagate_entity_load_avg(struct sched_entity *se)
> +static inline void propagate_entity_load_avg(struct sched_entity *se)
>  {
> -       return 0;
>  }
>
>  static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) {}
> @@ -4086,7 +4083,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>  {
>         unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
>         struct sched_avg *sa = &cfs_rq->avg;
> -       int decayed = 0;
> +       int decayed;
>
>         if (cfs_rq->removed.nr) {
>                 unsigned long r;
> @@ -4134,11 +4131,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>                  */
>                 add_tg_cfs_propagate(cfs_rq,
>                         -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> -
> -               decayed = 1;
>         }
>
> -       decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
> +       decayed = __update_load_avg_cfs_rq(now, cfs_rq);
>         u64_u32_store_copy(sa->last_update_time,
>                            cfs_rq->last_update_time_copy,
>                            sa->last_update_time);
> @@ -4252,7 +4247,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>                 __update_load_avg_se(now, cfs_rq, se);
>
>         decayed  = update_cfs_rq_load_avg(now, cfs_rq);
> -       decayed |= propagate_entity_load_avg(se);
> +       propagate_entity_load_avg(se);

but then you  also skip the call to cfs_rq_util_change()
>
>         if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
>
> @@ -4264,15 +4259,12 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>                  * IOW we're enqueueing a task on a new CPU.
>                  */
>                 attach_entity_load_avg(cfs_rq, se);
> -               update_tg_load_avg(cfs_rq);
> -
>         } else if (flags & DO_DETACH) {
>                 /*
>                  * DO_DETACH means we're here from dequeue_entity()
>                  * and we are migrating task out of the CPU.
>                  */
>                 detach_entity_load_avg(cfs_rq, se);
> -               update_tg_load_avg(cfs_rq);
>         } else if (decayed) {
>                 cfs_rq_util_change(cfs_rq, 0);
>
> --
> 2.41.0
>
  
Aaron Lu July 19, 2023, 1:29 p.m. UTC | #8
On Wed, Jul 19, 2023 at 11:47:06AM +0200, Vincent Guittot wrote:
> On Wed, 19 Jul 2023 at 10:01, Aaron Lu <aaron.lu@intel.com> wrote:
> >
> > On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote:
> > > Alternatively, I can remove some callsites of update_tg_load_avg() like
> > > you suggested below and only call update_tg_load_avg() when cfs_rq is
> > > decayed(really just decayed, not when it detected it has removed load
> > > pending or load propagated from its children). This way it would give us
> > > similar result as above(roughly once per ms).
> >
> > Something like this: (I think this is better since it removed those
> > unnecessary calls to update_tg_load_avg(), although it is inline but
> > still)
> >
> >
> > From bc749aaefa6bed36aa946921a4006b3dddb69b77 Mon Sep 17 00:00:00 2001
> > From: Aaron Lu <aaron.lu@intel.com>
> > Date: Wed, 19 Jul 2023 13:54:48 +0800
> > Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed
> >
> > ---
> >  kernel/sched/fair.c | 22 +++++++---------------
> >  1 file changed, 7 insertions(+), 15 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a80a73909dc2..7d5b7352b8b5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3913,16 +3913,16 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
> >  }
> >
> >  /* Update task and its cfs_rq load average */
> > -static inline int propagate_entity_load_avg(struct sched_entity *se)
> > +static inline void propagate_entity_load_avg(struct sched_entity *se)
> >  {
> >         struct cfs_rq *cfs_rq, *gcfs_rq;
> >
> >         if (entity_is_task(se))
> > -               return 0;
> > +               return;
> >
> >         gcfs_rq = group_cfs_rq(se);
> >         if (!gcfs_rq->propagate)
> > -               return 0;
> > +               return;
> >
> >         gcfs_rq->propagate = 0;
> >
> > @@ -3936,8 +3936,6 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
> >
> >         trace_pelt_cfs_tp(cfs_rq);
> >         trace_pelt_se_tp(se);
> > -
> > -       return 1;
> >  }
> >
> >  /*
> > @@ -3974,9 +3972,8 @@ static inline bool skip_blocked_update(struct sched_entity *se)
> >
> >  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {}
> >
> > -static inline int propagate_entity_load_avg(struct sched_entity *se)
> > +static inline void propagate_entity_load_avg(struct sched_entity *se)
> >  {
> > -       return 0;
> >  }
> >
> >  static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) {}
> > @@ -4086,7 +4083,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> >  {
> >         unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
> >         struct sched_avg *sa = &cfs_rq->avg;
> > -       int decayed = 0;
> > +       int decayed;
> >
> >         if (cfs_rq->removed.nr) {
> >                 unsigned long r;
> > @@ -4134,11 +4131,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> >                  */
> >                 add_tg_cfs_propagate(cfs_rq,
> >                         -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> > -
> > -               decayed = 1;
> >         }
> >
> > -       decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
> > +       decayed = __update_load_avg_cfs_rq(now, cfs_rq);
> >         u64_u32_store_copy(sa->last_update_time,
> >                            cfs_rq->last_update_time_copy,
> >                            sa->last_update_time);
> > @@ -4252,7 +4247,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >                 __update_load_avg_se(now, cfs_rq, se);
> >
> >         decayed  = update_cfs_rq_load_avg(now, cfs_rq);
> > -       decayed |= propagate_entity_load_avg(se);
> > +       propagate_entity_load_avg(se);
> 
> but then you  also skip the call to cfs_rq_util_change()

Ah right, I missed that, thanks for catching this.

Updated:

From 09a649f8111cfca656b7b735da975ef607b00956 Mon Sep 17 00:00:00 2001
From: Aaron Lu <aaron.lu@intel.com>
Date: Wed, 19 Jul 2023 13:54:48 +0800
Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed

---
 kernel/sched/fair.c | 17 ++++++-----------
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a80a73909dc2..8d4b9e0a19b6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4086,7 +4086,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 {
 	unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
 	struct sched_avg *sa = &cfs_rq->avg;
-	int decayed = 0;
+	int decayed;
 
 	if (cfs_rq->removed.nr) {
 		unsigned long r;
@@ -4134,11 +4134,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 		 */
 		add_tg_cfs_propagate(cfs_rq,
 			-(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
-
-		decayed = 1;
 	}
 
-	decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
+	decayed = __update_load_avg_cfs_rq(now, cfs_rq);
 	u64_u32_store_copy(sa->last_update_time,
 			   cfs_rq->last_update_time_copy,
 			   sa->last_update_time);
@@ -4242,7 +4240,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
 	u64 now = cfs_rq_clock_pelt(cfs_rq);
-	int decayed;
+	int decayed, propagated;
 
 	/*
 	 * Track task load average for carrying it to new CPU after migrated, and
@@ -4252,7 +4250,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 		__update_load_avg_se(now, cfs_rq, se);
 
 	decayed  = update_cfs_rq_load_avg(now, cfs_rq);
-	decayed |= propagate_entity_load_avg(se);
+	propagated = propagate_entity_load_avg(se);
 
 	if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
 
@@ -4264,19 +4262,16 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 		 * IOW we're enqueueing a task on a new CPU.
 		 */
 		attach_entity_load_avg(cfs_rq, se);
-		update_tg_load_avg(cfs_rq);
-
 	} else if (flags & DO_DETACH) {
 		/*
 		 * DO_DETACH means we're here from dequeue_entity()
 		 * and we are migrating task out of the CPU.
 		 */
 		detach_entity_load_avg(cfs_rq, se);
-		update_tg_load_avg(cfs_rq);
-	} else if (decayed) {
+	} else if (decayed || propagated) {
 		cfs_rq_util_change(cfs_rq, 0);
 
-		if (flags & UPDATE_TG)
+		if (decayed && (flags & UPDATE_TG))
 			update_tg_load_avg(cfs_rq);
 	}
 }
  
Vincent Guittot July 20, 2023, 1:10 p.m. UTC | #9
On Wed, 19 Jul 2023 at 15:29, Aaron Lu <aaron.lu@intel.com> wrote:
>
> On Wed, Jul 19, 2023 at 11:47:06AM +0200, Vincent Guittot wrote:
> > On Wed, 19 Jul 2023 at 10:01, Aaron Lu <aaron.lu@intel.com> wrote:
> > >
> > > On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote:
> > > > Alternatively, I can remove some callsites of update_tg_load_avg() like
> > > > you suggested below and only call update_tg_load_avg() when cfs_rq is
> > > > decayed(really just decayed, not when it detected it has removed load
> > > > pending or load propagated from its children). This way it would give us
> > > > similar result as above(roughly once per ms).
> > >
> > > Something like this: (I think this is better since it removed those
> > > unnecessary calls to update_tg_load_avg(), although it is inline but
> > > still)
> > >
> > >
> > > From bc749aaefa6bed36aa946921a4006b3dddb69b77 Mon Sep 17 00:00:00 2001
> > > From: Aaron Lu <aaron.lu@intel.com>
> > > Date: Wed, 19 Jul 2023 13:54:48 +0800
> > > Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed
> > >
> > > ---
> > >  kernel/sched/fair.c | 22 +++++++---------------
> > >  1 file changed, 7 insertions(+), 15 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index a80a73909dc2..7d5b7352b8b5 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -3913,16 +3913,16 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
> > >  }
> > >
> > >  /* Update task and its cfs_rq load average */
> > > -static inline int propagate_entity_load_avg(struct sched_entity *se)
> > > +static inline void propagate_entity_load_avg(struct sched_entity *se)
> > >  {
> > >         struct cfs_rq *cfs_rq, *gcfs_rq;
> > >
> > >         if (entity_is_task(se))
> > > -               return 0;
> > > +               return;
> > >
> > >         gcfs_rq = group_cfs_rq(se);
> > >         if (!gcfs_rq->propagate)
> > > -               return 0;
> > > +               return;
> > >
> > >         gcfs_rq->propagate = 0;
> > >
> > > @@ -3936,8 +3936,6 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
> > >
> > >         trace_pelt_cfs_tp(cfs_rq);
> > >         trace_pelt_se_tp(se);
> > > -
> > > -       return 1;
> > >  }
> > >
> > >  /*
> > > @@ -3974,9 +3972,8 @@ static inline bool skip_blocked_update(struct sched_entity *se)
> > >
> > >  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {}
> > >
> > > -static inline int propagate_entity_load_avg(struct sched_entity *se)
> > > +static inline void propagate_entity_load_avg(struct sched_entity *se)
> > >  {
> > > -       return 0;
> > >  }
> > >
> > >  static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) {}
> > > @@ -4086,7 +4083,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > >  {
> > >         unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
> > >         struct sched_avg *sa = &cfs_rq->avg;
> > > -       int decayed = 0;
> > > +       int decayed;
> > >
> > >         if (cfs_rq->removed.nr) {
> > >                 unsigned long r;
> > > @@ -4134,11 +4131,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > >                  */
> > >                 add_tg_cfs_propagate(cfs_rq,
> > >                         -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> > > -
> > > -               decayed = 1;
> > >         }
> > >
> > > -       decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
> > > +       decayed = __update_load_avg_cfs_rq(now, cfs_rq);
> > >         u64_u32_store_copy(sa->last_update_time,
> > >                            cfs_rq->last_update_time_copy,
> > >                            sa->last_update_time);
> > > @@ -4252,7 +4247,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > >                 __update_load_avg_se(now, cfs_rq, se);
> > >
> > >         decayed  = update_cfs_rq_load_avg(now, cfs_rq);
> > > -       decayed |= propagate_entity_load_avg(se);
> > > +       propagate_entity_load_avg(se);
> >
> > but then you  also skip the call to cfs_rq_util_change()
>
> Ah right, I missed that, thanks for catching this.
>
> Updated:
>
> From 09a649f8111cfca656b7b735da975ef607b00956 Mon Sep 17 00:00:00 2001
> From: Aaron Lu <aaron.lu@intel.com>
> Date: Wed, 19 Jul 2023 13:54:48 +0800
> Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed
>
> ---
>  kernel/sched/fair.c | 17 ++++++-----------
>  1 file changed, 6 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a80a73909dc2..8d4b9e0a19b6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4086,7 +4086,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>  {
>         unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
>         struct sched_avg *sa = &cfs_rq->avg;
> -       int decayed = 0;
> +       int decayed;
>
>         if (cfs_rq->removed.nr) {
>                 unsigned long r;
> @@ -4134,11 +4134,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>                  */
>                 add_tg_cfs_propagate(cfs_rq,
>                         -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> -
> -               decayed = 1;

We need this to propagate the change in other place like cpufreq

>         }
>
> -       decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
> +       decayed = __update_load_avg_cfs_rq(now, cfs_rq);
>         u64_u32_store_copy(sa->last_update_time,
>                            cfs_rq->last_update_time_copy,
>                            sa->last_update_time);
> @@ -4242,7 +4240,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>  static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  {
>         u64 now = cfs_rq_clock_pelt(cfs_rq);
> -       int decayed;
> +       int decayed, propagated;
>
>         /*
>          * Track task load average for carrying it to new CPU after migrated, and
> @@ -4252,7 +4250,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>                 __update_load_avg_se(now, cfs_rq, se);
>
>         decayed  = update_cfs_rq_load_avg(now, cfs_rq);
> -       decayed |= propagate_entity_load_avg(se);
> +       propagated = propagate_entity_load_avg(se);
>
>         if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
>
> @@ -4264,19 +4262,16 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>                  * IOW we're enqueueing a task on a new CPU.
>                  */
>                 attach_entity_load_avg(cfs_rq, se);
> -               update_tg_load_avg(cfs_rq);
> -
>         } else if (flags & DO_DETACH) {
>                 /*
>                  * DO_DETACH means we're here from dequeue_entity()
>                  * and we are migrating task out of the CPU.
>                  */
>                 detach_entity_load_avg(cfs_rq, se);
> -               update_tg_load_avg(cfs_rq);
> -       } else if (decayed) {
> +       } else if (decayed || propagated) {
>                 cfs_rq_util_change(cfs_rq, 0);
>
> -               if (flags & UPDATE_TG)
> +               if (decayed && (flags & UPDATE_TG))

It would be simpler and more readable to clear UPDATE_TG or not set it
from the beginning

IIUC, you rely on the fact that a decay happens every 1024 us of the
cfs_rq_clock_pelt() which is scaled by frequency and cpu compute
capacity. So you can end up with a cfs_rq_clock_pelt() that is far
slower than real clock and the 1ms can easily be extended to dozens of
ms

>                         update_tg_load_avg(cfs_rq);
>         }
>  }
> --
> 2.41.0
>
>
> > >
> > >         if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
> > >
> > > @@ -4264,15 +4259,12 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > >                  * IOW we're enqueueing a task on a new CPU.
> > >                  */
> > >                 attach_entity_load_avg(cfs_rq, se);
> > > -               update_tg_load_avg(cfs_rq);
> > > -
> > >         } else if (flags & DO_DETACH) {
> > >                 /*
> > >                  * DO_DETACH means we're here from dequeue_entity()
> > >                  * and we are migrating task out of the CPU.
> > >                  */
> > >                 detach_entity_load_avg(cfs_rq, se);
> > > -               update_tg_load_avg(cfs_rq);
> > >         } else if (decayed) {
> > >                 cfs_rq_util_change(cfs_rq, 0);
> > >
> > > --
> > > 2.41.0
> > >
  
Aaron Lu July 20, 2023, 2:42 p.m. UTC | #10
On Thu, Jul 20, 2023 at 03:10:30PM +0200, Vincent Guittot wrote:
> On Wed, 19 Jul 2023 at 15:29, Aaron Lu <aaron.lu@intel.com> wrote:
> >
> > On Wed, Jul 19, 2023 at 11:47:06AM +0200, Vincent Guittot wrote:
> > > On Wed, 19 Jul 2023 at 10:01, Aaron Lu <aaron.lu@intel.com> wrote:
> > > >
> > > > On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote:
> > > > > Alternatively, I can remove some callsites of update_tg_load_avg() like
> > > > > you suggested below and only call update_tg_load_avg() when cfs_rq is
> > > > > decayed(really just decayed, not when it detected it has removed load
> > > > > pending or load propagated from its children). This way it would give us
> > > > > similar result as above(roughly once per ms).
> > > >
> > > > Something like this: (I think this is better since it removed those
> > > > unnecessary calls to update_tg_load_avg(), although it is inline but
> > > > still)
> > > >
> > > >
> > > > From bc749aaefa6bed36aa946921a4006b3dddb69b77 Mon Sep 17 00:00:00 2001
> > > > From: Aaron Lu <aaron.lu@intel.com>
> > > > Date: Wed, 19 Jul 2023 13:54:48 +0800
> > > > Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed
> > > >
> > > > ---
> > > >  kernel/sched/fair.c | 22 +++++++---------------
> > > >  1 file changed, 7 insertions(+), 15 deletions(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index a80a73909dc2..7d5b7352b8b5 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -3913,16 +3913,16 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
> > > >  }
> > > >
> > > >  /* Update task and its cfs_rq load average */
> > > > -static inline int propagate_entity_load_avg(struct sched_entity *se)
> > > > +static inline void propagate_entity_load_avg(struct sched_entity *se)
> > > >  {
> > > >         struct cfs_rq *cfs_rq, *gcfs_rq;
> > > >
> > > >         if (entity_is_task(se))
> > > > -               return 0;
> > > > +               return;
> > > >
> > > >         gcfs_rq = group_cfs_rq(se);
> > > >         if (!gcfs_rq->propagate)
> > > > -               return 0;
> > > > +               return;
> > > >
> > > >         gcfs_rq->propagate = 0;
> > > >
> > > > @@ -3936,8 +3936,6 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
> > > >
> > > >         trace_pelt_cfs_tp(cfs_rq);
> > > >         trace_pelt_se_tp(se);
> > > > -
> > > > -       return 1;
> > > >  }
> > > >
> > > >  /*
> > > > @@ -3974,9 +3972,8 @@ static inline bool skip_blocked_update(struct sched_entity *se)
> > > >
> > > >  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {}
> > > >
> > > > -static inline int propagate_entity_load_avg(struct sched_entity *se)
> > > > +static inline void propagate_entity_load_avg(struct sched_entity *se)
> > > >  {
> > > > -       return 0;
> > > >  }
> > > >
> > > >  static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) {}
> > > > @@ -4086,7 +4083,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > > >  {
> > > >         unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
> > > >         struct sched_avg *sa = &cfs_rq->avg;
> > > > -       int decayed = 0;
> > > > +       int decayed;
> > > >
> > > >         if (cfs_rq->removed.nr) {
> > > >                 unsigned long r;
> > > > @@ -4134,11 +4131,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > > >                  */
> > > >                 add_tg_cfs_propagate(cfs_rq,
> > > >                         -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> > > > -
> > > > -               decayed = 1;
> > > >         }
> > > >
> > > > -       decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
> > > > +       decayed = __update_load_avg_cfs_rq(now, cfs_rq);
> > > >         u64_u32_store_copy(sa->last_update_time,
> > > >                            cfs_rq->last_update_time_copy,
> > > >                            sa->last_update_time);
> > > > @@ -4252,7 +4247,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > > >                 __update_load_avg_se(now, cfs_rq, se);
> > > >
> > > >         decayed  = update_cfs_rq_load_avg(now, cfs_rq);
> > > > -       decayed |= propagate_entity_load_avg(se);
> > > > +       propagate_entity_load_avg(se);
> > >
> > > but then you  also skip the call to cfs_rq_util_change()
> >
> > Ah right, I missed that, thanks for catching this.
> >
> > Updated:
> >
> > From 09a649f8111cfca656b7b735da975ef607b00956 Mon Sep 17 00:00:00 2001
> > From: Aaron Lu <aaron.lu@intel.com>
> > Date: Wed, 19 Jul 2023 13:54:48 +0800
> > Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed
> >
> > ---
> >  kernel/sched/fair.c | 17 ++++++-----------
> >  1 file changed, 6 insertions(+), 11 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a80a73909dc2..8d4b9e0a19b6 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4086,7 +4086,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> >  {
> >         unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
> >         struct sched_avg *sa = &cfs_rq->avg;
> > -       int decayed = 0;
> > +       int decayed;
> >
> >         if (cfs_rq->removed.nr) {
> >                 unsigned long r;
> > @@ -4134,11 +4134,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> >                  */
> >                 add_tg_cfs_propagate(cfs_rq,
> >                         -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> > -
> > -               decayed = 1;
> 
> We need this to propagate the change in other place like cpufreq

Ah, I just made the same mistake again, sorry.

So there are three cases for a cfs_rq: load decayed, load removed and
load propagated. For all three cases, cfs_rq_util_change() needs to be
called and only for decayed, update_tg_load_avg() needs to be called.

I'll update the patch accordingly.

> >         }
> >
> > -       decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
> > +       decayed = __update_load_avg_cfs_rq(now, cfs_rq);
> >         u64_u32_store_copy(sa->last_update_time,
> >                            cfs_rq->last_update_time_copy,
> >                            sa->last_update_time);
> > @@ -4242,7 +4240,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >  static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >  {
> >         u64 now = cfs_rq_clock_pelt(cfs_rq);
> > -       int decayed;
> > +       int decayed, propagated;
> >
> >         /*
> >          * Track task load average for carrying it to new CPU after migrated, and
> > @@ -4252,7 +4250,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >                 __update_load_avg_se(now, cfs_rq, se);
> >
> >         decayed  = update_cfs_rq_load_avg(now, cfs_rq);
> > -       decayed |= propagate_entity_load_avg(se);
> > +       propagated = propagate_entity_load_avg(se);
> >
> >         if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
> >
> > @@ -4264,19 +4262,16 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >                  * IOW we're enqueueing a task on a new CPU.
> >                  */
> >                 attach_entity_load_avg(cfs_rq, se);
> > -               update_tg_load_avg(cfs_rq);
> > -
> >         } else if (flags & DO_DETACH) {
> >                 /*
> >                  * DO_DETACH means we're here from dequeue_entity()
> >                  * and we are migrating task out of the CPU.
> >                  */
> >                 detach_entity_load_avg(cfs_rq, se);
> > -               update_tg_load_avg(cfs_rq);
> > -       } else if (decayed) {
> > +       } else if (decayed || propagated) {
> >                 cfs_rq_util_change(cfs_rq, 0);
> >
> > -               if (flags & UPDATE_TG)
> > +               if (decayed && (flags & UPDATE_TG))
> 
> It would be simpler and more readable to clear UPDATE_TG or not set it
> from the beginning

Do you mean something like this?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8d4b9e0a19b6..084d63371355 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4249,7 +4249,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
 		__update_load_avg_se(now, cfs_rq, se);
 
-	decayed  = update_cfs_rq_load_avg(now, cfs_rq);
+	decayed  = update_cfs_rq_load_avg(now, cfs_rq) && (flags & UPDATE_TG);
 	propagated = propagate_entity_load_avg(se);
 
 	if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
@@ -4271,7 +4271,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	} else if (decayed || propagated) {
 		cfs_rq_util_change(cfs_rq, 0);
 
-		if (decayed && (flags & UPDATE_TG))
+		if (decayed)
 			update_tg_load_avg(cfs_rq);
 	}
 }

> IIUC, you rely on the fact that a decay happens every 1024 us of the
> cfs_rq_clock_pelt() which is scaled by frequency and cpu compute
> capacity. So you can end up with a cfs_rq_clock_pelt() that is far
> slower than real clock and the 1ms can easily be extended to dozens of
> ms

Thanks for the info. I'm not familiar with this clock scale part and will
need to take a closer look.

As you know, the intent is to make the unbound update to tg->load_avg
become bound for those wakeup migration heavy workloads and the way this
change does to achieve it is to remove the update to tg->load_avg on
attach and detach path, just leave the update on load decay path. And if
the current implementation makes load decay longer than 1024us, that
shouldn't be a problem for this change. I don't see an immediate problem
if update to tg->load_avg happens less often than once per ms but please
let me know if I missed something, thanks.

> 
> >                         update_tg_load_avg(cfs_rq);
> >         }
> >  }
> > --
> > 2.41.0
> >
> >
> > > >
> > > >         if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
> > > >
> > > > @@ -4264,15 +4259,12 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > > >                  * IOW we're enqueueing a task on a new CPU.
> > > >                  */
> > > >                 attach_entity_load_avg(cfs_rq, se);
> > > > -               update_tg_load_avg(cfs_rq);
> > > > -
> > > >         } else if (flags & DO_DETACH) {
> > > >                 /*
> > > >                  * DO_DETACH means we're here from dequeue_entity()
> > > >                  * and we are migrating task out of the CPU.
> > > >                  */
> > > >                 detach_entity_load_avg(cfs_rq, se);
> > > > -               update_tg_load_avg(cfs_rq);
> > > >         } else if (decayed) {
> > > >                 cfs_rq_util_change(cfs_rq, 0);
> > > >
> > > > --
> > > > 2.41.0
> > > >
  
Vincent Guittot July 20, 2023, 3:02 p.m. UTC | #11
On Thu, 20 Jul 2023 at 16:42, Aaron Lu <aaron.lu@intel.com> wrote:
>
> On Thu, Jul 20, 2023 at 03:10:30PM +0200, Vincent Guittot wrote:
> > On Wed, 19 Jul 2023 at 15:29, Aaron Lu <aaron.lu@intel.com> wrote:
> > >
> > > On Wed, Jul 19, 2023 at 11:47:06AM +0200, Vincent Guittot wrote:
> > > > On Wed, 19 Jul 2023 at 10:01, Aaron Lu <aaron.lu@intel.com> wrote:
> > > > >
> > > > > On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote:
> > > > > > Alternatively, I can remove some callsites of update_tg_load_avg() like
> > > > > > you suggested below and only call update_tg_load_avg() when cfs_rq is
> > > > > > decayed(really just decayed, not when it detected it has removed load
> > > > > > pending or load propagated from its children). This way it would give us
> > > > > > similar result as above(roughly once per ms).
> > > > >
> > > > > Something like this: (I think this is better since it removed those
> > > > > unnecessary calls to update_tg_load_avg(), although it is inline but
> > > > > still)
> > > > >
> > > > >
> > > > > From bc749aaefa6bed36aa946921a4006b3dddb69b77 Mon Sep 17 00:00:00 2001
> > > > > From: Aaron Lu <aaron.lu@intel.com>
> > > > > Date: Wed, 19 Jul 2023 13:54:48 +0800
> > > > > Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed
> > > > >
> > > > > ---
> > > > >  kernel/sched/fair.c | 22 +++++++---------------
> > > > >  1 file changed, 7 insertions(+), 15 deletions(-)
> > > > >
> > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > index a80a73909dc2..7d5b7352b8b5 100644
> > > > > --- a/kernel/sched/fair.c
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -3913,16 +3913,16 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
> > > > >  }
> > > > >
> > > > >  /* Update task and its cfs_rq load average */
> > > > > -static inline int propagate_entity_load_avg(struct sched_entity *se)
> > > > > +static inline void propagate_entity_load_avg(struct sched_entity *se)
> > > > >  {
> > > > >         struct cfs_rq *cfs_rq, *gcfs_rq;
> > > > >
> > > > >         if (entity_is_task(se))
> > > > > -               return 0;
> > > > > +               return;
> > > > >
> > > > >         gcfs_rq = group_cfs_rq(se);
> > > > >         if (!gcfs_rq->propagate)
> > > > > -               return 0;
> > > > > +               return;
> > > > >
> > > > >         gcfs_rq->propagate = 0;
> > > > >
> > > > > @@ -3936,8 +3936,6 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
> > > > >
> > > > >         trace_pelt_cfs_tp(cfs_rq);
> > > > >         trace_pelt_se_tp(se);
> > > > > -
> > > > > -       return 1;
> > > > >  }
> > > > >
> > > > >  /*
> > > > > @@ -3974,9 +3972,8 @@ static inline bool skip_blocked_update(struct sched_entity *se)
> > > > >
> > > > >  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {}
> > > > >
> > > > > -static inline int propagate_entity_load_avg(struct sched_entity *se)
> > > > > +static inline void propagate_entity_load_avg(struct sched_entity *se)
> > > > >  {
> > > > > -       return 0;
> > > > >  }
> > > > >
> > > > >  static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) {}
> > > > > @@ -4086,7 +4083,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > > > >  {
> > > > >         unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
> > > > >         struct sched_avg *sa = &cfs_rq->avg;
> > > > > -       int decayed = 0;
> > > > > +       int decayed;
> > > > >
> > > > >         if (cfs_rq->removed.nr) {
> > > > >                 unsigned long r;
> > > > > @@ -4134,11 +4131,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > > > >                  */
> > > > >                 add_tg_cfs_propagate(cfs_rq,
> > > > >                         -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> > > > > -
> > > > > -               decayed = 1;
> > > > >         }
> > > > >
> > > > > -       decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
> > > > > +       decayed = __update_load_avg_cfs_rq(now, cfs_rq);
> > > > >         u64_u32_store_copy(sa->last_update_time,
> > > > >                            cfs_rq->last_update_time_copy,
> > > > >                            sa->last_update_time);
> > > > > @@ -4252,7 +4247,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > > > >                 __update_load_avg_se(now, cfs_rq, se);
> > > > >
> > > > >         decayed  = update_cfs_rq_load_avg(now, cfs_rq);
> > > > > -       decayed |= propagate_entity_load_avg(se);
> > > > > +       propagate_entity_load_avg(se);
> > > >
> > > > but then you  also skip the call to cfs_rq_util_change()
> > >
> > > Ah right, I missed that, thanks for catching this.
> > >
> > > Updated:
> > >
> > > From 09a649f8111cfca656b7b735da975ef607b00956 Mon Sep 17 00:00:00 2001
> > > From: Aaron Lu <aaron.lu@intel.com>
> > > Date: Wed, 19 Jul 2023 13:54:48 +0800
> > > Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed
> > >
> > > ---
> > >  kernel/sched/fair.c | 17 ++++++-----------
> > >  1 file changed, 6 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index a80a73909dc2..8d4b9e0a19b6 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -4086,7 +4086,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > >  {
> > >         unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
> > >         struct sched_avg *sa = &cfs_rq->avg;
> > > -       int decayed = 0;
> > > +       int decayed;
> > >
> > >         if (cfs_rq->removed.nr) {
> > >                 unsigned long r;
> > > @@ -4134,11 +4134,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > >                  */
> > >                 add_tg_cfs_propagate(cfs_rq,
> > >                         -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> > > -
> > > -               decayed = 1;
> >
> > We need this to propagate the change in other place like cpufreq
>
> Ah, I just made the same mistake again, sorry.
>
> So there are three cases for a cfs_rq: load decayed, load removed and
> load propagated. For all three cases, cfs_rq_util_change() needs to be
> called and only for decayed, update_tg_load_avg() needs to be called.
>
> I'll update the patch accordingly.
>
> > >         }
> > >
> > > -       decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
> > > +       decayed = __update_load_avg_cfs_rq(now, cfs_rq);
> > >         u64_u32_store_copy(sa->last_update_time,
> > >                            cfs_rq->last_update_time_copy,
> > >                            sa->last_update_time);
> > > @@ -4242,7 +4240,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > >  static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > >  {
> > >         u64 now = cfs_rq_clock_pelt(cfs_rq);
> > > -       int decayed;
> > > +       int decayed, propagated;
> > >
> > >         /*
> > >          * Track task load average for carrying it to new CPU after migrated, and
> > > @@ -4252,7 +4250,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > >                 __update_load_avg_se(now, cfs_rq, se);
> > >
> > >         decayed  = update_cfs_rq_load_avg(now, cfs_rq);
> > > -       decayed |= propagate_entity_load_avg(se);
> > > +       propagated = propagate_entity_load_avg(se);
> > >
> > >         if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
> > >
> > > @@ -4264,19 +4262,16 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > >                  * IOW we're enqueueing a task on a new CPU.
> > >                  */
> > >                 attach_entity_load_avg(cfs_rq, se);
> > > -               update_tg_load_avg(cfs_rq);
> > > -
> > >         } else if (flags & DO_DETACH) {
> > >                 /*
> > >                  * DO_DETACH means we're here from dequeue_entity()
> > >                  * and we are migrating task out of the CPU.
> > >                  */
> > >                 detach_entity_load_avg(cfs_rq, se);
> > > -               update_tg_load_avg(cfs_rq);
> > > -       } else if (decayed) {
> > > +       } else if (decayed || propagated) {
> > >                 cfs_rq_util_change(cfs_rq, 0);
> > >
> > > -               if (flags & UPDATE_TG)
> > > +               if (decayed && (flags & UPDATE_TG))
> >
> > It would be simpler and more readable to clear UPDATE_TG or not set it
> > from the beginning
>
> Do you mean something like this?
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8d4b9e0a19b6..084d63371355 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4249,7 +4249,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>         if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
>                 __update_load_avg_se(now, cfs_rq, se);
>
> -       decayed  = update_cfs_rq_load_avg(now, cfs_rq);
> +       decayed  = update_cfs_rq_load_avg(now, cfs_rq) && (flags & UPDATE_TG);
>         propagated = propagate_entity_load_avg(se);
>
>         if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
> @@ -4271,7 +4271,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>         } else if (decayed || propagated) {
>                 cfs_rq_util_change(cfs_rq, 0);
>
> -               if (decayed && (flags & UPDATE_TG))
> +               if (decayed)
>                         update_tg_load_avg(cfs_rq);
>         }
>  }
>
> > IIUC, you rely on the fact that a decay happens every 1024 us of the
> > cfs_rq_clock_pelt() which is scaled by frequency and cpu compute
> > capacity. So you can end up with a cfs_rq_clock_pelt() that is far
> > slower than real clock and the 1ms can easily be extended to dozens of
> > ms
>
> Thanks for the info. I'm not familiar with this clock scale part and will
> need to take a closer look.
>
> As you know, the intent is to make the unbound update to tg->load_avg
> become bound for those wakeup migration heavy workloads and the way this
> change does to achieve it is to remove the update to tg->load_avg on
> attach and detach path, just leave the update on load decay path. And if
> the current implementation makes load decay longer than 1024us, that
> shouldn't be a problem for this change. I don't see an immediate problem
> if update to tg->load_avg happens less often than once per ms but please
> let me know if I missed something, thanks.

What was wrong with your proposal to limit the update inside
update_tg_load_avg()  ? except maybe s/1000000/NSEC_PER_MSEC/ and
computing delta after testing the time since last update

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a80a73909dc2..e48fd0e6982d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct
cfs_rq *cfs_rq)
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 {
        long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
+       u64 now = cfs_rq_clock_pelt(cfs_rq);

        /*
         * No need to update load_avg for root_task_group as it is not used.
@@ -3672,9 +3673,11 @@ static inline void update_tg_load_avg(struct
cfs_rq *cfs_rq)
        if (cfs_rq->tg == &root_task_group)
                return;

-       if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
+       if ((now - cfs_rq->last_update_tg_load_avg > 1000000) &&
+           abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
                atomic_long_add(delta, &cfs_rq->tg->load_avg);
                cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
+               cfs_rq->last_update_tg_load_avg = now;
        }
 }


>
> >
> > >                         update_tg_load_avg(cfs_rq);
> > >         }
> > >  }
> > > --
> > > 2.41.0
> > >
> > >
> > > > >
> > > > >         if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
> > > > >
> > > > > @@ -4264,15 +4259,12 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > > > >                  * IOW we're enqueueing a task on a new CPU.
> > > > >                  */
> > > > >                 attach_entity_load_avg(cfs_rq, se);
> > > > > -               update_tg_load_avg(cfs_rq);
> > > > > -
> > > > >         } else if (flags & DO_DETACH) {
> > > > >                 /*
> > > > >                  * DO_DETACH means we're here from dequeue_entity()
> > > > >                  * and we are migrating task out of the CPU.
> > > > >                  */
> > > > >                 detach_entity_load_avg(cfs_rq, se);
> > > > > -               update_tg_load_avg(cfs_rq);
> > > > >         } else if (decayed) {
> > > > >                 cfs_rq_util_change(cfs_rq, 0);
> > > > >
> > > > > --
> > > > > 2.41.0
> > > > >
  
Vincent Guittot July 20, 2023, 3:04 p.m. UTC | #12
On Thu, 20 Jul 2023 at 16:42, Aaron Lu <aaron.lu@intel.com> wrote:
>
> On Thu, Jul 20, 2023 at 03:10:30PM +0200, Vincent Guittot wrote:
> > On Wed, 19 Jul 2023 at 15:29, Aaron Lu <aaron.lu@intel.com> wrote:
> > >
> > > On Wed, Jul 19, 2023 at 11:47:06AM +0200, Vincent Guittot wrote:
> > > > On Wed, 19 Jul 2023 at 10:01, Aaron Lu <aaron.lu@intel.com> wrote:
> > > > >
> > > > > On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote:
> > > > > > Alternatively, I can remove some callsites of update_tg_load_avg() like
> > > > > > you suggested below and only call update_tg_load_avg() when cfs_rq is
> > > > > > decayed(really just decayed, not when it detected it has removed load
> > > > > > pending or load propagated from its children). This way it would give us
> > > > > > similar result as above(roughly once per ms).
> > > > >
> > > > > Something like this: (I think this is better since it removed those
> > > > > unnecessary calls to update_tg_load_avg(), although it is inline but
> > > > > still)
> > > > >
> > > > >
> > > > > From bc749aaefa6bed36aa946921a4006b3dddb69b77 Mon Sep 17 00:00:00 2001
> > > > > From: Aaron Lu <aaron.lu@intel.com>
> > > > > Date: Wed, 19 Jul 2023 13:54:48 +0800
> > > > > Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed
> > > > >
> > > > > ---
> > > > >  kernel/sched/fair.c | 22 +++++++---------------
> > > > >  1 file changed, 7 insertions(+), 15 deletions(-)
> > > > >
> > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > index a80a73909dc2..7d5b7352b8b5 100644
> > > > > --- a/kernel/sched/fair.c
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -3913,16 +3913,16 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
> > > > >  }
> > > > >
> > > > >  /* Update task and its cfs_rq load average */
> > > > > -static inline int propagate_entity_load_avg(struct sched_entity *se)
> > > > > +static inline void propagate_entity_load_avg(struct sched_entity *se)
> > > > >  {
> > > > >         struct cfs_rq *cfs_rq, *gcfs_rq;
> > > > >
> > > > >         if (entity_is_task(se))
> > > > > -               return 0;
> > > > > +               return;
> > > > >
> > > > >         gcfs_rq = group_cfs_rq(se);
> > > > >         if (!gcfs_rq->propagate)
> > > > > -               return 0;
> > > > > +               return;
> > > > >
> > > > >         gcfs_rq->propagate = 0;
> > > > >
> > > > > @@ -3936,8 +3936,6 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
> > > > >
> > > > >         trace_pelt_cfs_tp(cfs_rq);
> > > > >         trace_pelt_se_tp(se);
> > > > > -
> > > > > -       return 1;
> > > > >  }
> > > > >
> > > > >  /*
> > > > > @@ -3974,9 +3972,8 @@ static inline bool skip_blocked_update(struct sched_entity *se)
> > > > >
> > > > >  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {}
> > > > >
> > > > > -static inline int propagate_entity_load_avg(struct sched_entity *se)
> > > > > +static inline void propagate_entity_load_avg(struct sched_entity *se)
> > > > >  {
> > > > > -       return 0;
> > > > >  }
> > > > >
> > > > >  static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) {}
> > > > > @@ -4086,7 +4083,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > > > >  {
> > > > >         unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
> > > > >         struct sched_avg *sa = &cfs_rq->avg;
> > > > > -       int decayed = 0;
> > > > > +       int decayed;
> > > > >
> > > > >         if (cfs_rq->removed.nr) {
> > > > >                 unsigned long r;
> > > > > @@ -4134,11 +4131,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > > > >                  */
> > > > >                 add_tg_cfs_propagate(cfs_rq,
> > > > >                         -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> > > > > -
> > > > > -               decayed = 1;
> > > > >         }
> > > > >
> > > > > -       decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
> > > > > +       decayed = __update_load_avg_cfs_rq(now, cfs_rq);
> > > > >         u64_u32_store_copy(sa->last_update_time,
> > > > >                            cfs_rq->last_update_time_copy,
> > > > >                            sa->last_update_time);
> > > > > @@ -4252,7 +4247,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > > > >                 __update_load_avg_se(now, cfs_rq, se);
> > > > >
> > > > >         decayed  = update_cfs_rq_load_avg(now, cfs_rq);
> > > > > -       decayed |= propagate_entity_load_avg(se);
> > > > > +       propagate_entity_load_avg(se);
> > > >
> > > > but then you  also skip the call to cfs_rq_util_change()
> > >
> > > Ah right, I missed that, thanks for catching this.
> > >
> > > Updated:
> > >
> > > From 09a649f8111cfca656b7b735da975ef607b00956 Mon Sep 17 00:00:00 2001
> > > From: Aaron Lu <aaron.lu@intel.com>
> > > Date: Wed, 19 Jul 2023 13:54:48 +0800
> > > Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed
> > >
> > > ---
> > >  kernel/sched/fair.c | 17 ++++++-----------
> > >  1 file changed, 6 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index a80a73909dc2..8d4b9e0a19b6 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -4086,7 +4086,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > >  {
> > >         unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
> > >         struct sched_avg *sa = &cfs_rq->avg;
> > > -       int decayed = 0;
> > > +       int decayed;
> > >
> > >         if (cfs_rq->removed.nr) {
> > >                 unsigned long r;
> > > @@ -4134,11 +4134,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > >                  */
> > >                 add_tg_cfs_propagate(cfs_rq,
> > >                         -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> > > -
> > > -               decayed = 1;
> >
> > We need this to propagate the change in other place like cpufreq
>
> Ah, I just made the same mistake again, sorry.
>
> So there are three cases for a cfs_rq: load decayed, load removed and
> load propagated. For all three cases, cfs_rq_util_change() needs to be
> called and only for decayed, update_tg_load_avg() needs to be called.
>
> I'll update the patch accordingly.
>
> > >         }
> > >
> > > -       decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
> > > +       decayed = __update_load_avg_cfs_rq(now, cfs_rq);
> > >         u64_u32_store_copy(sa->last_update_time,
> > >                            cfs_rq->last_update_time_copy,
> > >                            sa->last_update_time);
> > > @@ -4242,7 +4240,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > >  static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > >  {
> > >         u64 now = cfs_rq_clock_pelt(cfs_rq);
> > > -       int decayed;
> > > +       int decayed, propagated;
> > >
> > >         /*
> > >          * Track task load average for carrying it to new CPU after migrated, and
> > > @@ -4252,7 +4250,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > >                 __update_load_avg_se(now, cfs_rq, se);
> > >
> > >         decayed  = update_cfs_rq_load_avg(now, cfs_rq);
> > > -       decayed |= propagate_entity_load_avg(se);
> > > +       propagated = propagate_entity_load_avg(se);
> > >
> > >         if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
> > >
> > > @@ -4264,19 +4262,16 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > >                  * IOW we're enqueueing a task on a new CPU.
> > >                  */
> > >                 attach_entity_load_avg(cfs_rq, se);
> > > -               update_tg_load_avg(cfs_rq);
> > > -
> > >         } else if (flags & DO_DETACH) {
> > >                 /*
> > >                  * DO_DETACH means we're here from dequeue_entity()
> > >                  * and we are migrating task out of the CPU.
> > >                  */
> > >                 detach_entity_load_avg(cfs_rq, se);
> > > -               update_tg_load_avg(cfs_rq);
> > > -       } else if (decayed) {
> > > +       } else if (decayed || propagated) {
> > >                 cfs_rq_util_change(cfs_rq, 0);
> > >
> > > -               if (flags & UPDATE_TG)
> > > +               if (decayed && (flags & UPDATE_TG))



> >
> > It would be simpler and more readable to clear UPDATE_TG or not set it
> > from the beginning
>
> Do you mean something like this?
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8d4b9e0a19b6..084d63371355 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4249,7 +4249,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>         if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
>                 __update_load_avg_se(now, cfs_rq, se);
>
> -       decayed  = update_cfs_rq_load_avg(now, cfs_rq);
> +       decayed  = update_cfs_rq_load_avg(now, cfs_rq) && (flags & UPDATE_TG);
>         propagated = propagate_entity_load_avg(se);
>
>         if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
> @@ -4271,7 +4271,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>         } else if (decayed || propagated) {
>                 cfs_rq_util_change(cfs_rq, 0);
>
> -               if (decayed && (flags & UPDATE_TG))
> +               if (decayed)
>                         update_tg_load_avg(cfs_rq);
>         }

something like below:

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4252,6 +4252,8 @@ static inline void update_load_avg(struct cfs_rq
*cfs_rq, struct sched_entity *s
                __update_load_avg_se(now, cfs_rq, se);

        decayed  = update_cfs_rq_load_avg(now, cfs_rq);
+       if (!decayed)
+               flags &= ~UPDATE_TG;
        decayed |= propagate_entity_load_avg(se);

        if (!se->avg.last_update_time && (flags & DO_ATTACH)) {


>  }
>
> > IIUC, you rely on the fact that a decay happens every 1024 us of the
> > cfs_rq_clock_pelt() which is scaled by frequency and cpu compute
> > capacity. So you can end up with a cfs_rq_clock_pelt() that is far
> > slower than real clock and the 1ms can easily be extended to dozens of
> > ms
>
> Thanks for the info. I'm not familiar with this clock scale part and will
> need to take a closer look.
>
> As you know, the intent is to make the unbound update to tg->load_avg
> become bound for those wakeup migration heavy workloads and the way this
> change does to achieve it is to remove the update to tg->load_avg on
> attach and detach path, just leave the update on load decay path. And if
> the current implementation makes load decay longer than 1024us, that
> shouldn't be a problem for this change. I don't see an immediate problem
> if update to tg->load_avg happens less often than once per ms but please
> let me know if I missed something, thanks.
>
> >
> > >                         update_tg_load_avg(cfs_rq);
> > >         }
> > >  }
> > > --
> > > 2.41.0
> > >
> > >
> > > > >
> > > > >         if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
> > > > >
> > > > > @@ -4264,15 +4259,12 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > > > >                  * IOW we're enqueueing a task on a new CPU.
> > > > >                  */
> > > > >                 attach_entity_load_avg(cfs_rq, se);
> > > > > -               update_tg_load_avg(cfs_rq);
> > > > > -
> > > > >         } else if (flags & DO_DETACH) {
> > > > >                 /*
> > > > >                  * DO_DETACH means we're here from dequeue_entity()
> > > > >                  * and we are migrating task out of the CPU.
> > > > >                  */
> > > > >                 detach_entity_load_avg(cfs_rq, se);
> > > > > -               update_tg_load_avg(cfs_rq);
> > > > >         } else if (decayed) {
> > > > >                 cfs_rq_util_change(cfs_rq, 0);
> > > > >
> > > > > --
> > > > > 2.41.0
> > > > >
  
Dietmar Eggemann July 20, 2023, 3:22 p.m. UTC | #13
On 20/07/2023 17:02, Vincent Guittot wrote:
> On Thu, 20 Jul 2023 at 16:42, Aaron Lu <aaron.lu@intel.com> wrote:
>>
>> On Thu, Jul 20, 2023 at 03:10:30PM +0200, Vincent Guittot wrote:
>>> On Wed, 19 Jul 2023 at 15:29, Aaron Lu <aaron.lu@intel.com> wrote:
>>>>
>>>> On Wed, Jul 19, 2023 at 11:47:06AM +0200, Vincent Guittot wrote:
>>>>> On Wed, 19 Jul 2023 at 10:01, Aaron Lu <aaron.lu@intel.com> wrote:
>>>>>>
>>>>>> On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote:

[...]

> What was wrong with your proposal to limit the update inside
> update_tg_load_avg()  ? except maybe s/1000000/NSEC_PER_MSEC/ and
> computing delta after testing the time since last update
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a80a73909dc2..e48fd0e6982d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct
> cfs_rq *cfs_rq)
>  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
>  {
>         long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> +       u64 now = cfs_rq_clock_pelt(cfs_rq);

Could this be `u64 now = sched_clock_cpu()` like in
migrate_se_pelt_lag() or newidle_balance() to avoid the time morphing
due to PELT's frequency and uArch invariance?
> 
>         /*
>          * No need to update load_avg for root_task_group as it is not used.

[...]
  
Vincent Guittot July 20, 2023, 3:24 p.m. UTC | #14
On Thu, 20 Jul 2023 at 17:22, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 20/07/2023 17:02, Vincent Guittot wrote:
> > On Thu, 20 Jul 2023 at 16:42, Aaron Lu <aaron.lu@intel.com> wrote:
> >>
> >> On Thu, Jul 20, 2023 at 03:10:30PM +0200, Vincent Guittot wrote:
> >>> On Wed, 19 Jul 2023 at 15:29, Aaron Lu <aaron.lu@intel.com> wrote:
> >>>>
> >>>> On Wed, Jul 19, 2023 at 11:47:06AM +0200, Vincent Guittot wrote:
> >>>>> On Wed, 19 Jul 2023 at 10:01, Aaron Lu <aaron.lu@intel.com> wrote:
> >>>>>>
> >>>>>> On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote:
>
> [...]
>
> > What was wrong with your proposal to limit the update inside
> > update_tg_load_avg()  ? except maybe s/1000000/NSEC_PER_MSEC/ and
> > computing delta after testing the time since last update
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a80a73909dc2..e48fd0e6982d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct
> > cfs_rq *cfs_rq)
> >  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> >  {
> >         long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> > +       u64 now = cfs_rq_clock_pelt(cfs_rq);
>
> Could this be `u64 now = sched_clock_cpu()` like in
> migrate_se_pelt_lag() or newidle_balance() to avoid the time morphing
> due to PELT's frequency and uArch invariance?

yes that's a good point. I missed this

> >
> >         /*
> >          * No need to update load_avg for root_task_group as it is not used.
>
> [...]
>
  
Aaron Lu July 21, 2023, 1:57 a.m. UTC | #15
On Thu, Jul 20, 2023 at 05:02:32PM +0200, Vincent Guittot wrote:
>
> What was wrong with your proposal to limit the update inside
> update_tg_load_avg()  ? except maybe s/1000000/NSEC_PER_MSEC/ and
> computing delta after testing the time since last update

I was thinking it might be better to align the update_tg_load_avg() with
cfs_rq's decay point but that's just my feeling.

Absolutely nothing wrong with the below approach, it's straightforward
and effective. I'll fix the use of cfs_rq_clock_pelt() and collect
some data and then send out v2.

Thank you Vincent for all your comments, they're very useful to me.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a80a73909dc2..e48fd0e6982d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct
> cfs_rq *cfs_rq)
>  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
>  {
>         long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> +       u64 now = cfs_rq_clock_pelt(cfs_rq);
> 
>         /*
>          * No need to update load_avg for root_task_group as it is not used.
> @@ -3672,9 +3673,11 @@ static inline void update_tg_load_avg(struct
> cfs_rq *cfs_rq)
>         if (cfs_rq->tg == &root_task_group)
>                 return;
> 
> -       if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> +       if ((now - cfs_rq->last_update_tg_load_avg > 1000000) &&
> +           abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
>                 atomic_long_add(delta, &cfs_rq->tg->load_avg);
>                 cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> +               cfs_rq->last_update_tg_load_avg = now;
>         }
>  }
  
Aaron Lu July 21, 2023, 6:42 a.m. UTC | #16
On Thu, Jul 20, 2023 at 05:22:26PM +0200, Dietmar Eggemann wrote:
> On 20/07/2023 17:02, Vincent Guittot wrote:
> > On Thu, 20 Jul 2023 at 16:42, Aaron Lu <aaron.lu@intel.com> wrote:
> >>
> >> On Thu, Jul 20, 2023 at 03:10:30PM +0200, Vincent Guittot wrote:
> >>> On Wed, 19 Jul 2023 at 15:29, Aaron Lu <aaron.lu@intel.com> wrote:
> >>>>
> >>>> On Wed, Jul 19, 2023 at 11:47:06AM +0200, Vincent Guittot wrote:
> >>>>> On Wed, 19 Jul 2023 at 10:01, Aaron Lu <aaron.lu@intel.com> wrote:
> >>>>>>
> >>>>>> On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote:
> 
> [...]
> 
> > What was wrong with your proposal to limit the update inside
> > update_tg_load_avg()  ? except maybe s/1000000/NSEC_PER_MSEC/ and
> > computing delta after testing the time since last update
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a80a73909dc2..e48fd0e6982d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct
> > cfs_rq *cfs_rq)
> >  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> >  {
> >         long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> > +       u64 now = cfs_rq_clock_pelt(cfs_rq);
> 
> Could this be `u64 now = sched_clock_cpu()` like in
> migrate_se_pelt_lag() or newidle_balance() to avoid the time morphing
> due to PELT's frequency and uArch invariance?

Yes, will use sched_clock_cpu() instead of cfs_rq_clock_pelt().
Thanks.

> > 
> >         /*
> >          * No need to update load_avg for root_task_group as it is not used.
> 
> [...]
>
  
Aaron Lu Aug. 11, 2023, 9:28 a.m. UTC | #17
On Fri, Jul 21, 2023 at 09:57:04AM +0800, Aaron Lu wrote:
> On Thu, Jul 20, 2023 at 05:02:32PM +0200, Vincent Guittot wrote:
> >
> > What was wrong with your proposal to limit the update inside
> > update_tg_load_avg()  ? except maybe s/1000000/NSEC_PER_MSEC/ and
> > computing delta after testing the time since last update
> 
> I was thinking it might be better to align the update_tg_load_avg() with
> cfs_rq's decay point but that's just my feeling.
> 
> Absolutely nothing wrong with the below approach, it's straightforward
> and effective. I'll fix the use of cfs_rq_clock_pelt() and collect
> some data and then send out v2.
> 
> Thank you Vincent for all your comments, they're very useful to me.
> 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a80a73909dc2..e48fd0e6982d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct
> > cfs_rq *cfs_rq)
> >  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> >  {
> >         long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> > +       u64 now = cfs_rq_clock_pelt(cfs_rq);
> > 
> >         /*
> >          * No need to update load_avg for root_task_group as it is not used.
> > @@ -3672,9 +3673,11 @@ static inline void update_tg_load_avg(struct
> > cfs_rq *cfs_rq)
> >         if (cfs_rq->tg == &root_task_group)
> >                 return;
> > 
> > -       if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> > +       if ((now - cfs_rq->last_update_tg_load_avg > 1000000) &&
> > +           abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> >                 atomic_long_add(delta, &cfs_rq->tg->load_avg);
> >                 cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> > +               cfs_rq->last_update_tg_load_avg = now;
> >         }
> >  }

While collecting data for v2, I find that with this "rate limit updates
to tg->load_avg to at most once per ms", the other two optimizations:
"skip some update_cfs_group() on en/dequeue_entity()" and "Make
tg->load_avg per node" doesn't matter anymore, i.e. as long as
"ratelimit" patch is there, adding those two other optimizations doesn't
improve performance further.

I think this is reasonable, since this "ratelimit" patch reduced updates
to tg->load_avg so much like from some 10 millions during a 5s window to
~25k, the overhead of accessing tg->load_avg is dropped greatly and
improving it by reducing the read side calls and making the counter
per-node doesn't matter anymore.

So I think for v2, I'll go with this "ratelimit" patch alone, but let me
know if you think otherwise.

The tests I've run are listed below.
1 postgres_sysbench with nr_thread=25%/50%/75%/100% on a 2 sockets, 112
cores, 224 threads Intel Sapphire Rapids(SPR);
2 hackbench with group=1/4/8/16, pipe, thread mode on a 2 sockets, 64
cores, 128 threads Intel Skylake(ICL) and another 2 sockets, 112 cores,
224 threads Intel SPR;
3 netperf with tcp_stream and udp_rr modes. For each mode, nr_client is
set to 25%/50%/75%/100%. Tested on the same ICL and SPR as in 2).

The test summary is:

postgres_sysbench on SPR
- when nr_thread=25% and 50%, results are in noise range;
- when nr_thread=75% and 100%, "ratelimit" patch can improve transaction
  by 12.2% and 21.2%.

hackbench:
- when nr_group=1 and 4, results are in noise range for both ICL and SPR;
- when nr_group=8 and 16, "ratelimit" patch can improve throughput by
  6.6% and 22.4% on ICL. For SPR, the improvement is 12.5% and 48.8%.

netperf:
- for tcp_stream mode test, all test results are in noise range for ICL
  and SPR;
- for udp_rr mode test, when nr_client=25% and 50% of nr_cpu, test results
  are in noise range; when nr_client=75% and 100% of nr_cpu, on ICL,
  "ratelimit" patch can improve throughput by 38.5% and 13.4%; on SPR,
  the improvement is 162% and 189%.

Full test results are listed below.

base: peterz's sched/core branch with commit e8d5ff8258bf7("sched/fair: 
Block nohz tick_stop when cfs bandwidth in use") as head.
rate: the patch that rate limit updates to tg->load_avg to at most once
per ms; applied on top of "base".
skip: the patch "sched/fair: skip some update_cfs_group() on
en/dequeue_entity()" as in this v1 series; applied on top of "rate".
node: the patch "sched/fair: Make tg->load_avg per node" as in this v1
series + peterz's numa optimizations; applied on top of "skip".

All improvement percents are against "base".

postgres_sysbench on SPR:

25%  (all in noise range)
base:   42382±19.8%
rate:   50174±9.5%
skip:   46311±12.0%
node:   48690±13.9%

50%  (all in noise range)
base:   67626±1.3%
rate:   67365±3.1%
skip:   68218±4.9%
node:   66878±3.3%

75%
base:  100216±1.2%
rate:  112470±0.1% +12.2%
skip:  113345±0.3% +13.1%
node:  112962±0.1% +12.7%

100%
base:   93671±0.4%
rate:  113563±0.2% +21.2%
skip:  113741±0.2% +21.4%
node:  112793±0.2% +20.4%

hackbench on ICL:
group=1 (all in noise range)
base:   114912±5.2%
rate:   117857±2.5%
skip:   118501±4.3%
node:   119240±7.0%

group=4 (all in noise range)
base:   359902±1.6%
rate:   361685±2.7%
skip:   356354±1.6%
node:   358744±1.7%

group=8
base:   461070±0.8%
rate:   491713±0.3% +6.6%
skip:   490183±1.4% +6.3%
node:   490864±0.5% +6.5%

group=16
base:   309032±5.0%
rate:   378337±1.3% +22.4%
skip:   385637±1.4% +24.8%
node:   375087±0.9% +21.4%

hackbench on SPR:
group=1 (all in noise range)
base:   100768±2.9%
rate:   103134±2.9%
skip:    99642±1.7%
node:   104251±2.5%

group=4 (all in noise range)
base:   413830±12.5%
rate:   378660±16.6%
skip:   367234±13.1%
node:   372122±14.4%

group=8
base:   436124±0.6%
rate:   490787±3.2% +12.5%
skip:   493097±1.0% +13.0%
node:   497336±1.8% +14.0%

group=16
base:   457730±3.2%
rate:   680452±1.3% +48.8%
skip:   679271±0.9% +48.4%
node:   672365±0.5% +46.9%

netperf/udp_rr on ICL
25%
base:   114413±0.1%
rate:   115111±0.0%  +0.6%
skip:   115546±0.2%  +1.0%
node:   115214±0.3%  +0.7%

50%  (all in noise range)
base:   86803±0.5%
rate:   86611±0.0%
skip:   86986±0.5%
node:   87466±0.8%

75%
base:   35959±5.3%
rate:   49801±0.6% +38.5%
skip:   50745±1.5% +41.1%
node:   50457±1.5% +40.3%

100%
base:   61951±6.4%
rate:   70224±0.8% +13.4%
skip:   67321±4.3%  +8.7%
node:   70591±2.3% +13.9%

netperf/udp_rr on SPR
25%  (all in noise range)
base:   104954±1.3%
rate:   107312±2.8%
skip:   107688±1.7%
node:   106024±1.8%

50%  (all in noise range)
base:   55394±4.6%
rate:   54940±7.4%
skip:   57829±1.8%
node:   57144±6.4%

75%
base:   13779±3.1%
rate:   36105±1.1% +162%
skip:   35319±6.3% +156%
node:   36764±3.5% +167%

100%
base:   9703±3.7%
rate:  28011±0.2% +189%
skip:  27087±1.3% +179%
node:  27337±2.2% +182%

netperf/tcp_stream on ICL
25%  (all in noise range)
base:   43092±0.1%
rate:   42891±0.5%
skip:   43208±0.4%
node:   43044±0.6%

50%  (all in noise range)
base:   19278±14.9%
rate:   22369±7.2%
skip:   19375±16.0%
node:   21312±8.2%

75%  (all in noise range)
base:   16822±3.0%
rate:   17086±2.3%
skip:   17571±1.8%
node:   16801±2.4%

100%  (all in noise range)
base:   18216±0.6%
rate:   18078±2.9%
skip:   18040±1.0%
node:   18064±1.6%

netperf/tcp_stream on SPR
25%  (all in noise range)
base:   34491±0.3%
rate:   34886±0.5%
skip:   34894±0.9%
node:   34692±0.6%

50%  (all in noise range)
base:   19278±14.9%
rate:   22369±7.2%
skip:   19375±16.0%
node:   21312±8.2%

75%  (all in noise range)
base:   16822±3.0%
rate:   17086±2.3%
skip:   17571±1.8%
node:   16801±2.4%

100%  (all in noise range)
base:   18216±0.6%
rate:   18078±2.9%
skip:   18040±0.1%
node:   18064±1.6%

Thanks,
Aaron
  

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aceb8f5922cb..564ffe3e59c1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3645,6 +3645,9 @@  static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 	if (child_cfs_rq_on_list(cfs_rq))
 		return false;
 
+	if (cfs_rq->prop_removed_sum)
+		return false;
+
 	return true;
 }
 
@@ -3911,6 +3914,11 @@  static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
 {
 	cfs_rq->propagate = 1;
 	cfs_rq->prop_runnable_sum += runnable_sum;
+
+	if (cfs_rq->prop_removed_sum) {
+		cfs_rq->prop_runnable_sum += cfs_rq->prop_removed_sum;
+		cfs_rq->prop_removed_sum = 0;
+	}
 }
 
 /* Update task and its cfs_rq load average */
@@ -4133,13 +4141,11 @@  update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 		 * removed_runnable is the unweighted version of removed_load so we
 		 * can use it to estimate removed_load_sum.
 		 */
-		add_tg_cfs_propagate(cfs_rq,
-			-(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
-
-		decayed = 1;
+		cfs_rq->prop_removed_sum +=
+			-(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT;
 	}
 
-	decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
+	decayed = __update_load_avg_cfs_rq(now, cfs_rq);
 	u64_u32_store_copy(sa->last_update_time,
 			   cfs_rq->last_update_time_copy,
 			   sa->last_update_time);
@@ -9001,6 +9007,13 @@  static bool __update_blocked_fair(struct rq *rq, bool *done)
 
 			if (cfs_rq == &rq->cfs)
 				decayed = true;
+
+			/*
+			 * If the aggregated removed_sum hasn't been taken care of,
+			 * deal with it now before this cfs_rq is removed from the list.
+			 */
+			if (cfs_rq->prop_removed_sum)
+				add_tg_cfs_propagate(cfs_rq, 0);
 		}
 
 		/* Propagate pending load changes to the parent, if any: */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9cece2dbc95b..ab540b21d071 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -619,6 +619,7 @@  struct cfs_rq {
 	unsigned long		tg_load_avg_contrib;
 	long			propagate;
 	long			prop_runnable_sum;
+	long			prop_removed_sum;
 
 	/*
 	 *   h_load = weight * f(tg)