sched/fair: Preserve PLACE_DEADLINE_INITIAL deadline

Message ID 20231004011737.188682-1-daniel.m.jordan@oracle.com
State New
Headers
Series sched/fair: Preserve PLACE_DEADLINE_INITIAL deadline |

Commit Message

Daniel Jordan Oct. 4, 2023, 1:17 a.m. UTC
  An entity is supposed to get an earlier deadline with
PLACE_DEADLINE_INITIAL when it's forked, but the deadline gets
overwritten soon after in enqueue_entity() the first time a forked
entity is woken so that PLACE_DEADLINE_INITIAL is effectively a no-op.

Placing in task_fork_fair() seems unnecessary since none of the values
that get set (slice, vruntime, deadline) are used before they're set
again at enqueue time, so get rid of that and just pass ENQUEUE_INITIAL
to enqueue_entity() via wake_up_new_task().

Fixes: e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---

Tested on top of peterz/sched/eevdf from 2023-10-03.

 kernel/sched/core.c | 2 +-
 kernel/sched/fair.c | 1 -
 2 files changed, 1 insertion(+), 2 deletions(-)
  

Comments

K Prateek Nayak Oct. 5, 2023, 5:56 a.m. UTC | #1
Hello Daniel,

On 10/4/2023 6:47 AM, Daniel Jordan wrote:
> An entity is supposed to get an earlier deadline with
> PLACE_DEADLINE_INITIAL when it's forked, but the deadline gets
> overwritten soon after in enqueue_entity() the first time a forked
> entity is woken so that PLACE_DEADLINE_INITIAL is effectively a no-op.
> 
> Placing in task_fork_fair() seems unnecessary since none of the values
> that get set (slice, vruntime, deadline) are used before they're set
> again at enqueue time, so get rid of that and just pass ENQUEUE_INITIAL
> to enqueue_entity() via wake_up_new_task().
> 
> Fixes: e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement")
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>

I got a chance to this this on a 3rd Generation EPYC system. I don't
see anything out of the ordinary except for a small regression on
hackbench. I'll leave the full result below.

o System details

- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- Boost enabled, C2 Disabled (POLL and MWAIT based C1 remained enabled)


o Kernel Details

- tip:	tip:sched/core at commit d4d6596b4386 ("sched/headers: Remove
	duplicate header inclusions")

- place-deadline-fix: tip + this patch


o Benchmark Results

==================================================================
Test          : hackbench
Units         : Normalized time in seconds
Interpretation: Lower is better
Statistic     : AMean
==================================================================
Case:           tip[pct imp](CV)    place-deadline-fix[pct imp](CV)
 1-groups     1.00 [ -0.00]( 2.58)     1.04 [ -3.63]( 3.14)
 2-groups     1.00 [ -0.00]( 1.87)     1.03 [ -2.98]( 1.85)
 4-groups     1.00 [ -0.00]( 1.63)     1.02 [ -2.35]( 1.59)
 8-groups     1.00 [ -0.00]( 1.38)     1.03 [ -2.92]( 1.20)
16-groups     1.00 [ -0.00]( 2.67)     1.02 [ -1.61]( 2.08)


==================================================================
Test          : tbench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:           tip[pct imp](CV)    place-deadline-fix[pct imp](CV)
    1     1.00 [  0.00]( 0.59)     1.02 [  2.09]( 0.07)
    2     1.00 [  0.00]( 1.19)     1.02 [  2.38]( 0.82)
    4     1.00 [  0.00]( 0.33)     1.03 [  2.89]( 0.99)
    8     1.00 [  0.00]( 0.76)     1.02 [  2.10]( 0.46)
   16     1.00 [  0.00]( 1.10)     1.01 [  0.81]( 0.49)
   32     1.00 [  0.00]( 1.47)     1.02 [  1.77]( 0.58)
   64     1.00 [  0.00]( 1.77)     1.02 [  1.83]( 1.77)
  128     1.00 [  0.00]( 0.41)     1.02 [  2.49]( 0.52)
  256     1.00 [  0.00]( 0.63)     1.03 [  3.03]( 1.38)
  512     1.00 [  0.00]( 0.32)     1.02 [  1.61]( 0.45)
 1024     1.00 [  0.00]( 0.22)     1.01 [  1.00]( 0.26)


==================================================================
Test          : stream-10
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:           tip[pct imp](CV)    place-deadline-fix[pct imp](CV)
 Copy     1.00 [  0.00]( 9.30)     0.85 [-15.36](11.26)
Scale     1.00 [  0.00]( 6.67)     0.98 [ -2.36]( 7.53)
  Add     1.00 [  0.00]( 6.77)     0.92 [ -7.86]( 7.83)
Triad     1.00 [  0.00]( 7.36)     0.94 [ -5.57]( 6.82)


==================================================================
Test          : stream-100
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:           tip[pct imp](CV)    place-deadline-fix[pct imp](CV)
 Copy     1.00 [  0.00]( 1.83)     0.96 [ -3.68]( 5.08)
Scale     1.00 [  0.00]( 6.41)     1.03 [  2.66]( 5.28)
  Add     1.00 [  0.00]( 6.23)     1.02 [  1.54]( 4.97)
Triad     1.00 [  0.00]( 0.89)     0.94 [ -5.68]( 6.78)


==================================================================
Test          : netperf
Units         : Normalized Througput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:           tip[pct imp](CV)    place-deadline-fix[pct imp](CV)
 1-clients     1.00 [  0.00]( 0.05)     1.02 [  1.83]( 1.98)
 2-clients     1.00 [  0.00]( 0.93)     1.02 [  1.87]( 2.45)
 4-clients     1.00 [  0.00]( 0.54)     1.02 [  2.19]( 1.99)
 8-clients     1.00 [  0.00]( 0.48)     1.02 [  2.29]( 2.27)
16-clients     1.00 [  0.00]( 0.42)     1.02 [  1.60]( 1.70)
32-clients     1.00 [  0.00]( 0.78)     1.02 [  1.88]( 2.08)
64-clients     1.00 [  0.00]( 1.45)     1.02 [  2.33]( 2.18)
128-clients    1.00 [  0.00]( 0.97)     1.02 [  2.38]( 1.95)
256-clients    1.00 [  0.00]( 4.57)     1.02 [  2.50]( 5.42)
512-clients    1.00 [  0.00](52.74)     1.03 [  3.38](49.69)


==================================================================
Test          : schbench
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers:           tip[pct imp](CV)    place-deadline-fix[pct imp](CV)
  1     1.00 [ -0.00]( 3.95)     0.90 [ 10.26](31.80)
  2     1.00 [ -0.00](10.45)     1.08 [ -7.89](15.33)
  4     1.00 [ -0.00]( 4.76)     0.93 [  7.14]( 3.95)
  8     1.00 [ -0.00]( 9.35)     1.06 [ -6.25]( 8.90)
 16     1.00 [ -0.00]( 8.84)     0.92 [  8.06]( 4.39)
 32     1.00 [ -0.00]( 3.33)     1.04 [ -4.40]( 3.68)
 64     1.00 [ -0.00]( 6.70)     0.96 [  4.17]( 2.75)
128     1.00 [ -0.00]( 0.71)     0.96 [  3.55]( 1.26)
256     1.00 [ -0.00](31.20)     1.28 [-28.21]( 9.69)
512     1.00 [ -0.00]( 4.98)     1.00 [  0.48]( 2.76)


==================================================================
Test          : ycsb-cassandra
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Mean
==================================================================
Metric          tip    place-deadline-fix(pct imp)
Throughput      1.00    1.01 (%diff: 1.06%)


==================================================================
Test          : ycsb-mondodb
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Mean
==================================================================
Metric          tip    place-deadline-fix(pct imp)
Throughput      1.00    1.00 (%diff: 0.25%)


==================================================================
Test          : DeathStarBench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Mean
==================================================================
Pinning      scaling     tip            place-deadline-fix(pct imp)
 1CCD           1       1.00            1.00 (%diff: -0.32%)
 2CCD           2       1.00            1.00 (%diff: -0.26%)
 4CCD           4       1.00            1.00 (%diff: 0.17%)
 8CCD           8       1.00            1.00 (%diff: -0.17%)


--
I see there is a v2. I'll give that a spin as well.

> ---
> 
> Tested on top of peterz/sched/eevdf from 2023-10-03.
> 
>  kernel/sched/core.c | 2 +-
>  kernel/sched/fair.c | 1 -
>  2 files changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 779cdc7969c81..500e2dbfd41dd 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4854,7 +4854,7 @@ void wake_up_new_task(struct task_struct *p)
>  	update_rq_clock(rq);
>  	post_init_entity_util_avg(p);
>  
> -	activate_task(rq, p, ENQUEUE_NOCLOCK);
> +	activate_task(rq, p, ENQUEUE_INITIAL | ENQUEUE_NOCLOCK);
>  	trace_sched_wakeup_new(p);
>  	wakeup_preempt(rq, p, WF_FORK);
>  #ifdef CONFIG_SMP
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a0b4dac2662c9..5872b8a3f5891 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12446,7 +12446,6 @@ static void task_fork_fair(struct task_struct *p)
>  	curr = cfs_rq->curr;
>  	if (curr)
>  		update_curr(cfs_rq);
> -	place_entity(cfs_rq, se, ENQUEUE_INITIAL);
>  	rq_unlock(rq, &rf);
>  }
>  
 
--
Thanks and Regards,
Prateek
  
Daniel Jordan Oct. 6, 2023, 4:35 p.m. UTC | #2
Hi Prateek,

On Thu, Oct 05, 2023 at 11:26:07AM +0530, K Prateek Nayak wrote:
> Hello Daniel,
> 
> On 10/4/2023 6:47 AM, Daniel Jordan wrote:
> > An entity is supposed to get an earlier deadline with
> > PLACE_DEADLINE_INITIAL when it's forked, but the deadline gets
> > overwritten soon after in enqueue_entity() the first time a forked
> > entity is woken so that PLACE_DEADLINE_INITIAL is effectively a no-op.
> > 
> > Placing in task_fork_fair() seems unnecessary since none of the values
> > that get set (slice, vruntime, deadline) are used before they're set
> > again at enqueue time, so get rid of that and just pass ENQUEUE_INITIAL
> > to enqueue_entity() via wake_up_new_task().
> > 
> > Fixes: e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement")
> > Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> 
> I got a chance to this this on a 3rd Generation EPYC system. I don't
> see anything out of the ordinary except for a small regression on
> hackbench. I'll leave the full result below.

Thanks for testing!

> o System details
> 
> - 3rd Generation EPYC System
> - 2 sockets each with 64C/128T
> - NPS1 (Each socket is a NUMA node)
> - Boost enabled, C2 Disabled (POLL and MWAIT based C1 remained enabled)
> 
> 
> o Kernel Details
> 
> - tip:	tip:sched/core at commit d4d6596b4386 ("sched/headers: Remove
> 	duplicate header inclusions")
> 
> - place-deadline-fix: tip + this patch
> 
> 
> o Benchmark Results
> 
> ==================================================================
> Test          : hackbench
> Units         : Normalized time in seconds
> Interpretation: Lower is better
> Statistic     : AMean
> ==================================================================
> Case:           tip[pct imp](CV)    place-deadline-fix[pct imp](CV)
>  1-groups     1.00 [ -0.00]( 2.58)     1.04 [ -3.63]( 3.14)
>  2-groups     1.00 [ -0.00]( 1.87)     1.03 [ -2.98]( 1.85)
>  4-groups     1.00 [ -0.00]( 1.63)     1.02 [ -2.35]( 1.59)
>  8-groups     1.00 [ -0.00]( 1.38)     1.03 [ -2.92]( 1.20)
> 16-groups     1.00 [ -0.00]( 2.67)     1.02 [ -1.61]( 2.08)

Huh, numbers do seem a bit outside the noise.  Doesn't hackbench only
fork at the beginning?  I glanced at perf messaging source just now, but
not sure if you use that version.  Anyway, I wouldn't expect this patch
to have much of an effect in that case.
  

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 779cdc7969c81..500e2dbfd41dd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4854,7 +4854,7 @@  void wake_up_new_task(struct task_struct *p)
 	update_rq_clock(rq);
 	post_init_entity_util_avg(p);
 
-	activate_task(rq, p, ENQUEUE_NOCLOCK);
+	activate_task(rq, p, ENQUEUE_INITIAL | ENQUEUE_NOCLOCK);
 	trace_sched_wakeup_new(p);
 	wakeup_preempt(rq, p, WF_FORK);
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a0b4dac2662c9..5872b8a3f5891 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12446,7 +12446,6 @@  static void task_fork_fair(struct task_struct *p)
 	curr = cfs_rq->curr;
 	if (curr)
 		update_curr(cfs_rq);
-	place_entity(cfs_rq, se, ENQUEUE_INITIAL);
 	rq_unlock(rq, &rf);
 }