sched/fair: Always update_curr() before placing at enqueue

Message ID 20231006164826.335223-1-daniel.m.jordan@oracle.com
State New
Headers
Series sched/fair: Always update_curr() before placing at enqueue |

Commit Message

Daniel Jordan Oct. 6, 2023, 4:48 p.m. UTC
  Placing wants current's vruntime and the cfs_rq's min_vruntime up to
date so that avg_runtime() is too, and similarly it wants the entity to
be re-weighted and lag adjusted so vslice and vlag are fresh, so always
do update_curr() and update_cfs_group() beforehand.

There doesn't seem to be a reason to treat the 'curr' case specially
after e8f331bcc270 since vruntime doesn't get normalized anymore.

Fixes: e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---

Not sure what the XXX above place_entity() is for, maybe it can go away?

Based on tip/sched/core.

 kernel/sched/fair.c | 14 ++------------
 1 file changed, 2 insertions(+), 12 deletions(-)
  

Comments

K Prateek Nayak Oct. 16, 2023, 5:39 a.m. UTC | #1
Hello Daniel,

I see a good and consistent improvement in Stream (with shorter loops)
with this change. Everything else is more or less the same.

I'll leave the detailed results below.

On 10/6/2023 10:18 PM, Daniel Jordan wrote:
> Placing wants current's vruntime and the cfs_rq's min_vruntime up to
> date so that avg_runtime() is too, and similarly it wants the entity to
> be re-weighted and lag adjusted so vslice and vlag are fresh, so always
> do update_curr() and update_cfs_group() beforehand.
> 
> There doesn't seem to be a reason to treat the 'curr' case specially
> after e8f331bcc270 since vruntime doesn't get normalized anymore.
> 
> Fixes: e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement")
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>

o Machine details

- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- C2 Disabled (POLL and C1(MWAIT) remained enabled)

o Kernel Details

- tip:	tip:sched/core at commit 3657680f38cd ("sched/psi: Delete the
	'update_total' function parameter from update_triggers()") +
	cherry-pick commit 8dafa9d0eb1a sched/eevdf: Fix min_deadline heap
	integrity") from tip:sched/urgent + cherry-pick commit b01db23d5923
	("sched/eevdf: Fix pick_eevdf()") from tip:sched/urgent

update_curr_opt: tip + this patch

o Benchmark Results

==================================================================
Test          : hackbench
Units         : Normalized time in seconds
Interpretation: Lower is better
Statistic     : AMean
==================================================================
Case:           tip[pct imp](CV)    update_curr_opt[pct imp](CV)
 1-groups     1.00 [ -0.00]( 2.69)     1.01 [ -0.71]( 2.88)
 2-groups     1.00 [ -0.00]( 1.69)     1.01 [ -0.62]( 1.40)
 4-groups     1.00 [ -0.00]( 1.25)     1.01 [ -1.17]( 1.03)
 8-groups     1.00 [ -0.00]( 1.36)     1.00 [ -0.43]( 0.83)
16-groups     1.00 [ -0.00]( 1.44)     1.00 [ -0.13]( 2.32)


==================================================================
Test          : tbench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:    tip[pct imp](CV)    update_curr_opt[pct imp](CV)
    1     1.00 [  0.00]( 0.33)     1.01 [  0.52]( 1.51)
    2     1.00 [  0.00]( 0.22)     1.00 [ -0.01]( 0.37)
    4     1.00 [  0.00]( 0.25)     0.99 [ -0.71]( 0.60)
    8     1.00 [  0.00]( 0.71)     1.00 [ -0.26]( 0.36)
   16     1.00 [  0.00]( 0.79)     0.99 [ -1.21]( 0.77)
   32     1.00 [  0.00]( 0.94)     0.99 [ -0.82]( 1.46)
   64     1.00 [  0.00]( 1.76)     0.99 [ -0.92]( 1.25)
  128     1.00 [  0.00]( 0.68)     0.98 [ -2.22]( 1.19)
  256     1.00 [  0.00]( 1.23)     0.99 [ -1.43]( 0.79)
  512     1.00 [  0.00]( 0.28)     0.99 [ -0.93]( 0.14)
 1024     1.00 [  0.00]( 0.20)     0.99 [ -1.44]( 0.41)


==================================================================
Test          : stream-10
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:       tip[pct imp](CV)    update_curr_opt[pct imp](CV)
 Copy     1.00 [  0.00](11.88)     1.22 [ 21.92]( 7.37)
Scale     1.00 [  0.00]( 7.01)     1.04 [  4.02]( 4.89)
  Add     1.00 [  0.00]( 6.56)     1.11 [ 11.03]( 4.77)
Triad     1.00 [  0.00]( 8.81)     1.14 [ 14.12]( 3.89)


==================================================================
Test          : stream-100
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:       tip[pct imp](CV)    update_curr_opt[pct imp](CV)
 Copy     1.00 [  0.00]( 1.07)     1.01 [  0.77]( 1.59)
Scale     1.00 [  0.00]( 4.81)     0.97 [ -2.99]( 7.18)
  Add     1.00 [  0.00]( 4.56)     0.98 [ -2.39]( 6.86)
Triad     1.00 [  0.00]( 1.78)     1.00 [ -0.35]( 4.22)


==================================================================
Test          : netperf
Units         : Normalized Througput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:         tip[pct imp](CV)    update_curr_opt[pct imp](CV)
 1-clients     1.00 [  0.00]( 0.62)     0.99 [ -1.03]( 0.24)
 2-clients     1.00 [  0.00]( 0.36)     1.00 [ -0.32]( 0.66)
 4-clients     1.00 [  0.00]( 0.31)     1.00 [ -0.17]( 0.44)
 8-clients     1.00 [  0.00]( 0.39)     1.00 [  0.24]( 0.67)
16-clients     1.00 [  0.00]( 0.58)     1.00 [  0.50]( 0.46)
32-clients     1.00 [  0.00]( 0.71)     1.01 [  0.54]( 0.66)
64-clients     1.00 [  0.00]( 2.13)     1.00 [  0.35]( 1.80)
128-clients    1.00 [  0.00]( 0.94)     0.99 [ -0.71]( 0.97)
256-clients    1.00 [  0.00]( 6.09)     1.01 [  1.28]( 3.41)
512-clients    1.00 [  0.00](55.28)     1.01 [  1.32](49.78)


==================================================================
Test          : schbench
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers: tip[pct imp](CV)    update_curr_opt[pct imp](CV)
  1     1.00 [ -0.00]( 2.91)     0.97 [  2.56]( 1.53)
  2     1.00 [ -0.00](21.78)     0.89 [ 10.53](10.23)
  4     1.00 [ -0.00]( 4.88)     1.07 [ -7.32]( 6.82)
  8     1.00 [ -0.00]( 2.49)     1.00 [ -0.00]( 9.53)
 16     1.00 [ -0.00]( 3.70)     1.02 [ -1.75]( 0.99)
 32     1.00 [ -0.00](12.65)     0.83 [ 16.51]( 4.41)
 64     1.00 [ -0.00]( 3.98)     0.97 [  2.59]( 8.27)
128     1.00 [ -0.00]( 1.49)     0.96 [  3.60]( 8.01)
256     1.00 [ -0.00](40.79)     0.80 [ 20.39](36.89)
512     1.00 [ -0.00]( 1.12)     0.98 [  2.20]( 0.75)


==================================================================
Test          : ycsb-cassandra
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Mean
==================================================================
metric      tip     update_curr_opt (%diff)
throughput  1.00    1.00 (%diff: -0.45%)


==================================================================
Test          : ycsb-mondodb
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Mean
==================================================================
metric      tip     update_curr_opt (%diff)
throughput  1.00    1.00 (%diff: -0.13%)


==================================================================
Test          : DeathStarBench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Mean
==================================================================
Pinning   scaling   tip     update_curr_opt (%diff)
1CCD        1       1.00    1.01 (%diff: 0.57%)
2CCD        2       1.00    1.00 (%diff: -0.27%)
4CCD        4       1.00    1.00 (%diff: 0.06%)
8CCD        8       1.00    1.00 (%diff: 0.45%)

--
Feel free to include

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

I'll keep a lookout for future versions.

> ---
> 
> Not sure what the XXX above place_entity() is for, maybe it can go away?
> 
> Based on tip/sched/core.
> 
>  kernel/sched/fair.c | 14 ++------------
>  1 file changed, 2 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 04fbcbda97d5f..db2ca9bf9cc49 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5047,15 +5047,6 @@ static inline bool cfs_bandwidth_used(void);
>  static void
>  enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  {
> -	bool curr = cfs_rq->curr == se;
> -
> -	/*
> -	 * If we're the current task, we must renormalise before calling
> -	 * update_curr().
> -	 */
> -	if (curr)
> -		place_entity(cfs_rq, se, flags);
> -
>  	update_curr(cfs_rq);
>  
>  	/*
> @@ -5080,8 +5071,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  	 * XXX now that the entity has been re-weighted, and it's lag adjusted,
>  	 * we can place the entity.
>  	 */
> -	if (!curr)
> -		place_entity(cfs_rq, se, flags);
> +	place_entity(cfs_rq, se, flags);
>  
>  	account_entity_enqueue(cfs_rq, se);
>  
> @@ -5091,7 +5081,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  
>  	check_schedstat_required();
>  	update_stats_enqueue_fair(cfs_rq, se, flags);
> -	if (!curr)
> +	if (cfs_rq->curr != se)
>  		__enqueue_entity(cfs_rq, se);
>  	se->on_rq = 1;
>  

--
Thanks and Regards,
Prateek
  
Daniel Jordan Oct. 18, 2023, 12:43 a.m. UTC | #2
On Fri, Oct 06, 2023 at 09:58:10PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 06, 2023 at 12:48:26PM -0400, Daniel Jordan wrote:
> > Placing wants current's vruntime and the cfs_rq's min_vruntime up to
> > date so that avg_runtime() is too, and similarly it wants the entity to
> > be re-weighted and lag adjusted so vslice and vlag are fresh, so always
> > do update_curr() and update_cfs_group() beforehand.
> > 
> > There doesn't seem to be a reason to treat the 'curr' case specially
> > after e8f331bcc270 since vruntime doesn't get normalized anymore.
> > 
> > Fixes: e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement")
> > Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> > ---
> > 
> > Not sure what the XXX above place_entity() is for, maybe it can go away?
> > 
> > Based on tip/sched/core.
> > 
> >  kernel/sched/fair.c | 14 ++------------
> >  1 file changed, 2 insertions(+), 12 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 04fbcbda97d5f..db2ca9bf9cc49 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5047,15 +5047,6 @@ static inline bool cfs_bandwidth_used(void);
> >  static void
> >  enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >  {
> > -	bool curr = cfs_rq->curr == se;
> > -
> > -	/*
> > -	 * If we're the current task, we must renormalise before calling
> > -	 * update_curr().
> > -	 */
> > -	if (curr)
> > -		place_entity(cfs_rq, se, flags);
> > -
> >  	update_curr(cfs_rq);
> 
> IIRC part of the reason for this order is the:
> 
>   dequeue
>   update
>   enqueue
> 
> pattern we have all over the place. You don't want the enqueue to move
> time forward in this case.
> 
> Could be that all magically works, but please double check.

Yes, I wasn't thinking of the dequeue/update/enqueue places.
Considering these, it seems like there's more to fix (from before EEVDF
even).

Sorry for the delayed response, been staring for a while thinking I'd
have it all by the next day.  It'll take a bit longer to sort out all
the cases, but I'll keep going.
  

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04fbcbda97d5f..db2ca9bf9cc49 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5047,15 +5047,6 @@  static inline bool cfs_bandwidth_used(void);
 static void
 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
-	bool curr = cfs_rq->curr == se;
-
-	/*
-	 * If we're the current task, we must renormalise before calling
-	 * update_curr().
-	 */
-	if (curr)
-		place_entity(cfs_rq, se, flags);
-
 	update_curr(cfs_rq);
 
 	/*
@@ -5080,8 +5071,7 @@  enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * XXX now that the entity has been re-weighted, and it's lag adjusted,
 	 * we can place the entity.
 	 */
-	if (!curr)
-		place_entity(cfs_rq, se, flags);
+	place_entity(cfs_rq, se, flags);
 
 	account_entity_enqueue(cfs_rq, se);
 
@@ -5091,7 +5081,7 @@  enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	check_schedstat_required();
 	update_stats_enqueue_fair(cfs_rq, se, flags);
-	if (!curr)
+	if (cfs_rq->curr != se)
 		__enqueue_entity(cfs_rq, se);
 	se->on_rq = 1;