[v2] sched/fair: Preserve PLACE_DEADLINE_INITIAL deadline

Message ID 20231004130908.238992-1-daniel.m.jordan@oracle.com
State New
Headers
Series [v2] sched/fair: Preserve PLACE_DEADLINE_INITIAL deadline |

Commit Message

Daniel Jordan Oct. 4, 2023, 1:09 p.m. UTC
  An entity is supposed to get an earlier deadline with
PLACE_DEADLINE_INITIAL when it's forked, but the deadline gets
overwritten soon after in enqueue_entity() the first time a forked
entity is woken so that PLACE_DEADLINE_INITIAL is effectively a no-op.

Placing in task_fork_fair() seems unnecessary since none of the values
that get set (slice, vruntime, deadline) are used before they're set
again at enqueue time, so get rid of that (and with it all of
task_fork_fair()) and just pass ENQUEUE_INITIAL to enqueue_entity() via
wake_up_new_task().

Fixes: e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---

v2
 - place_entity() seems like the only reason for task_fork_fair() to exist
   after the recent removal of sysctl_sched_child_runs_first, so take out
   the whole function.

Still based on today's peterz/sched/eevdf

 kernel/sched/core.c |  2 +-
 kernel/sched/fair.c | 24 ------------------------
 2 files changed, 1 insertion(+), 25 deletions(-)
  

Comments

Chen Yu Oct. 4, 2023, 3:46 p.m. UTC | #1
Hi Daniel,

On 2023-10-04 at 09:09:08 -0400, Daniel Jordan wrote:
> An entity is supposed to get an earlier deadline with
> PLACE_DEADLINE_INITIAL when it's forked, but the deadline gets
> overwritten soon after in enqueue_entity() the first time a forked
> entity is woken so that PLACE_DEADLINE_INITIAL is effectively a no-op.
> 
> Placing in task_fork_fair() seems unnecessary since none of the values
> that get set (slice, vruntime, deadline) are used before they're set
> again at enqueue time, so get rid of that (and with it all of
> task_fork_fair()) and just pass ENQUEUE_INITIAL to enqueue_entity() via
> wake_up_new_task().
> 
> Fixes: e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement")
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> ---
> 
> v2
>  - place_entity() seems like the only reason for task_fork_fair() to exist
>    after the recent removal of sysctl_sched_child_runs_first, so take out
>    the whole function.

At first glance I thought if we remove task_fork_fair(), do we lose one chance to
update the parent task's statistic in update_curr()?  We might get out-of-date
parent task's deadline and make preemption decision based on the stale data in
wake_up_new_task() -> wakeup_preempt() -> pick_eevdf(). But after a second thought,
I found that wake_up_new_task() -> enqueue_entity() itself would invoke update_curr(),
so this should not be a problem.

Then I was wondering why can't we just skip place_entity() in enqueue_entity()
if ENQUEUE_WAKEUP is not set, just like the code before e8f331bcc270? In this
way the new fork task's deadline will not be overwritten by wake_up_new_task()->
enqueue_entity(). Then I realized that, after e8f331bcc270, the task's vruntime
and deadline are all calculated by place_entity() rather than being renormalised
to cfs_rq->min_vruntime in enqueue_entity(), so we can not simply skip place_entity()
in enqueue_entity().

Per my understanding, this patch looks good,

Reviewed-by: Chen Yu <yu.c.chen@intel.com>

thanks,
Chenyu
  
Daniel Jordan Oct. 6, 2023, 4:31 p.m. UTC | #2
Hi Chenyu,

On Wed, Oct 04, 2023 at 11:46:21PM +0800, Chen Yu wrote:
> Hi Daniel,
> 
> On 2023-10-04 at 09:09:08 -0400, Daniel Jordan wrote:
> > An entity is supposed to get an earlier deadline with
> > PLACE_DEADLINE_INITIAL when it's forked, but the deadline gets
> > overwritten soon after in enqueue_entity() the first time a forked
> > entity is woken so that PLACE_DEADLINE_INITIAL is effectively a no-op.
> > 
> > Placing in task_fork_fair() seems unnecessary since none of the values
> > that get set (slice, vruntime, deadline) are used before they're set
> > again at enqueue time, so get rid of that (and with it all of
> > task_fork_fair()) and just pass ENQUEUE_INITIAL to enqueue_entity() via
> > wake_up_new_task().
> > 
> > Fixes: e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement")
> > Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> > ---
> > 
> > v2
> >  - place_entity() seems like the only reason for task_fork_fair() to exist
> >    after the recent removal of sysctl_sched_child_runs_first, so take out
> >    the whole function.
> 
> At first glance I thought if we remove task_fork_fair(), do we lose one chance to
> update the parent task's statistic in update_curr()?  We might get out-of-date
> parent task's deadline and make preemption decision based on the stale data in
> wake_up_new_task() -> wakeup_preempt() -> pick_eevdf(). But after a second thought,
> I found that wake_up_new_task() -> enqueue_entity() itself would invoke update_curr(),
> so this should not be a problem.
>
> Then I was wondering why can't we just skip place_entity() in enqueue_entity()
> if ENQUEUE_WAKEUP is not set, just like the code before e8f331bcc270? In this
> way the new fork task's deadline will not be overwritten by wake_up_new_task()->
> enqueue_entity(). Then I realized that, after e8f331bcc270, the task's vruntime
> and deadline are all calculated by place_entity() rather than being renormalised
> to cfs_rq->min_vruntime in enqueue_entity(), so we can not simply skip place_entity()
> in enqueue_entity().

This all made me wonder if the order of update_curr() for the parent and
place_entity() for the child matters.  And it does, since placing uses
avg_vruntime(), which wants an up-to-date vruntime for current and
min_vruntime for cfs_rq.  Good that 'curr' in enqueue_entity() is false
on fork so that the parent's vruntime is up to date, but it seems
placing should always happen after update_curr().

> Per my understanding, this patch looks good,
> 
> Reviewed-by: Chen Yu <yu.c.chen@intel.com>

Thanks!

Daniel
  
K Prateek Nayak Oct. 12, 2023, 4:48 a.m. UTC | #3
Hello Daniel,

Same as v1, I do not see any regressions with this version either.
I'll leave the full results below.

o Machine details

- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- C2 Disabled (POLL and C1(MWAIT) remained enabled)

o Kernel Details

- tip:	tip:sched/core at commit 238437d88cea ("intel_idle: Add ibrs_off
	module parameter to force-disable IBRS")
	[For DeathStarBench comparisons alone since I ran to the issue
	which below commit solves]
	+ min_deadline fix commit 8dafa9d0eb1a ("sched/eevdf: Fix
	min_deadline heap integrity") from tip:sched/urgent 

- place-initial-fix: tip + this patch as is

o Benchmark Results

==================================================================
Test          : hackbench
Units         : Normalized time in seconds
Interpretation: Lower is better
Statistic     : AMean
==================================================================
Case:           tip[pct imp](CV)    place-initial-fix[pct imp](CV)
 1-groups     1.00 [ -0.00]( 2.11)     1.01 [ -1.08]( 2.60)
 2-groups     1.00 [ -0.00]( 1.31)     1.01 [ -0.93]( 1.61)
 4-groups     1.00 [ -0.00]( 1.04)     1.00 [ -0.00]( 1.25)
 8-groups     1.00 [ -0.00]( 1.34)     0.99 [  1.15]( 0.85)
16-groups     1.00 [ -0.00]( 2.45)     1.00 [ -0.27]( 2.32)


==================================================================
Test          : tbench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:    tip[pct imp](CV)    place-initial-fix[pct imp](CV)
    1     1.00 [  0.00]( 0.46)     0.99 [ -0.59]( 0.88)
    2     1.00 [  0.00]( 0.64)     0.99 [ -1.43]( 0.69)
    4     1.00 [  0.00]( 0.59)     0.99 [ -1.49]( 0.76)
    8     1.00 [  0.00]( 0.34)     1.00 [ -0.35]( 0.20)
   16     1.00 [  0.00]( 0.72)     0.98 [ -1.96]( 1.97)
   32     1.00 [  0.00]( 0.65)     1.00 [ -0.24]( 1.07)
   64     1.00 [  0.00]( 0.59)     1.00 [ -0.14]( 1.18)
  128     1.00 [  0.00]( 1.19)     0.99 [ -1.04]( 0.93)
  256     1.00 [  0.00]( 0.16)     1.00 [ -0.18]( 0.34)
  512     1.00 [  0.00]( 0.20)     0.99 [ -0.62]( 0.02)
 1024     1.00 [  0.00]( 0.06)     1.00 [ -0.49]( 0.37)


==================================================================
Test          : stream-10
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:       tip[pct imp](CV)    place-initial-fix[pct imp](CV)
 Copy     1.00 [  0.00]( 6.04)     1.00 [ -0.21]( 7.98)
Scale     1.00 [  0.00]( 5.44)     0.99 [ -0.75]( 5.75)
  Add     1.00 [  0.00]( 5.44)     0.99 [ -1.48]( 5.40)
Triad     1.00 [  0.00]( 7.82)     1.02 [  2.21]( 8.33)


==================================================================
Test          : stream-100
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:       tip[pct imp](CV)    place-initial-fix[pct imp](CV)
 Copy     1.00 [  0.00]( 1.14)     1.00 [  0.40]( 1.12)
Scale     1.00 [  0.00]( 4.60)     1.01 [  1.05]( 4.99)
  Add     1.00 [  0.00]( 4.91)     1.00 [ -0.14]( 4.97)
Triad     1.00 [  0.00]( 0.60)     0.96 [ -3.53]( 6.13)


==================================================================
Test          : netperf
Units         : Normalized Througput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:         tip[pct imp](CV)    place-initial-fix[pct imp](CV)
 1-clients     1.00 [  0.00]( 0.61)     1.00 [  0.40]( 0.75)
 2-clients     1.00 [  0.00]( 0.44)     1.00 [ -0.47]( 0.91)
 4-clients     1.00 [  0.00]( 0.75)     1.00 [ -0.23]( 0.84)
 8-clients     1.00 [  0.00]( 0.65)     1.00 [ -0.07]( 0.62)
16-clients     1.00 [  0.00]( 0.49)     1.00 [ -0.29]( 0.56)
32-clients     1.00 [  0.00]( 0.57)     1.00 [ -0.14]( 0.46)
64-clients     1.00 [  0.00]( 1.67)     1.00 [ -0.14]( 1.81)
128-clients    1.00 [  0.00]( 1.11)     1.01 [  0.64]( 1.04)
256-clients    1.00 [  0.00]( 2.64)     0.99 [ -1.29]( 5.25)
512-clients    1.00 [  0.00](52.49)     0.99 [ -0.57](53.01)


==================================================================
Test          : schbench
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers: tip[pct imp](CV)    place-initial-fix[pct imp](CV)
  1     1.00 [ -0.00]( 8.41)     1.05 [ -5.41](13.45)
  2     1.00 [ -0.00]( 5.29)     0.88 [ 12.50](13.21)
  4     1.00 [ -0.00]( 1.32)     1.00 [ -0.00]( 4.80)
  8     1.00 [ -0.00]( 9.52)     0.94 [  6.25]( 8.85)
 16     1.00 [ -0.00]( 1.61)     0.97 [  3.23]( 5.00)
 32     1.00 [ -0.00]( 7.27)     0.88 [ 12.50]( 2.30)
 64     1.00 [ -0.00]( 6.96)     1.07 [ -6.94]( 4.94)
128     1.00 [ -0.00]( 3.41)     0.99 [  1.44]( 2.69)
256     1.00 [ -0.00](32.95)     0.81 [ 19.17](16.38)
512     1.00 [ -0.00]( 3.20)     0.98 [  1.66]( 2.35)


==================================================================
Test          : ycsb-cassandra
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Mean
==================================================================
metric          tip    place-initial-fix(%diff)
throughput      1.00    0.99 (%diff: -0.67%)


==================================================================
Test          : ycsb-mondodb
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Mean
==================================================================
metric          tip    place-initial-fix(%diff)
throughput      1.00    0.99 (%diff: -0.68%)


==================================================================
Test          : DeathStarBench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Mean
Note	      : Comparisons contains additional commit 8dafa9d0eb1a
		("sched/eevdf: Fix min_deadline heap integrity") from
		tip:sched/urgent to fix an EEVDF issue being hit
==================================================================
Pinning      scaling    tip     place-initial-fix (%diff)
1CCD            1       1.00    1.00 (%diff: -0.09%)
2CCD            2       1.00    1.02 (%diff: 2.46%)
4CCD            4       1.00    1.00 (%diff: 0.45%)
8CCD            8       1.00    1.00 (%diff: -0.46%)

--

On 10/4/2023 6:39 PM, Daniel Jordan wrote:
> An entity is supposed to get an earlier deadline with
> PLACE_DEADLINE_INITIAL when it's forked, but the deadline gets
> overwritten soon after in enqueue_entity() the first time a forked
> entity is woken so that PLACE_DEADLINE_INITIAL is effectively a no-op.
> 
> Placing in task_fork_fair() seems unnecessary since none of the values
> that get set (slice, vruntime, deadline) are used before they're set
> again at enqueue time, so get rid of that (and with it all of
> task_fork_fair()) and just pass ENQUEUE_INITIAL to enqueue_entity() via
> wake_up_new_task().
> 
> Fixes: e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement")
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

> ---
> 
> v2
>  - place_entity() seems like the only reason for task_fork_fair() to exist
>    after the recent removal of sysctl_sched_child_runs_first, so take out
>    the whole function.
> 
> Still based on today's peterz/sched/eevdf
> 
>  kernel/sched/core.c |  2 +-
>  kernel/sched/fair.c | 24 ------------------------
>  2 files changed, 1 insertion(+), 25 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 779cdc7969c81..500e2dbfd41dd 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4854,7 +4854,7 @@ void wake_up_new_task(struct task_struct *p)
>  	update_rq_clock(rq);
>  	post_init_entity_util_avg(p);
>  
> -	activate_task(rq, p, ENQUEUE_NOCLOCK);
> +	activate_task(rq, p, ENQUEUE_INITIAL | ENQUEUE_NOCLOCK);
>  	trace_sched_wakeup_new(p);
>  	wakeup_preempt(rq, p, WF_FORK);
>  #ifdef CONFIG_SMP
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a0b4dac2662c9..3827b302eeb9b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12427,29 +12427,6 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>  	task_tick_core(rq, curr);
>  }
>  
> -/*
> - * called on fork with the child task as argument from the parent's context
> - *  - child not yet on the tasklist
> - *  - preemption disabled
> - */
> -static void task_fork_fair(struct task_struct *p)
> -{
> -	struct sched_entity *se = &p->se, *curr;
> -	struct cfs_rq *cfs_rq;
> -	struct rq *rq = this_rq();
> -	struct rq_flags rf;
> -
> -	rq_lock(rq, &rf);
> -	update_rq_clock(rq);
> -
> -	cfs_rq = task_cfs_rq(current);
> -	curr = cfs_rq->curr;
> -	if (curr)
> -		update_curr(cfs_rq);
> -	place_entity(cfs_rq, se, ENQUEUE_INITIAL);
> -	rq_unlock(rq, &rf);
> -}
> -
>  /*
>   * Priority of the task has changed. Check to see if we preempt
>   * the current task.
> @@ -12953,7 +12930,6 @@ DEFINE_SCHED_CLASS(fair) = {
>  #endif
>  
>  	.task_tick		= task_tick_fair,
> -	.task_fork		= task_fork_fair,
>  
>  	.prio_changed		= prio_changed_fair,
>  	.switched_from		= switched_from_fair,

--
Thanks and Regards,
Prateek
  

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 779cdc7969c81..500e2dbfd41dd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4854,7 +4854,7 @@  void wake_up_new_task(struct task_struct *p)
 	update_rq_clock(rq);
 	post_init_entity_util_avg(p);
 
-	activate_task(rq, p, ENQUEUE_NOCLOCK);
+	activate_task(rq, p, ENQUEUE_INITIAL | ENQUEUE_NOCLOCK);
 	trace_sched_wakeup_new(p);
 	wakeup_preempt(rq, p, WF_FORK);
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a0b4dac2662c9..3827b302eeb9b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12427,29 +12427,6 @@  static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 	task_tick_core(rq, curr);
 }
 
-/*
- * called on fork with the child task as argument from the parent's context
- *  - child not yet on the tasklist
- *  - preemption disabled
- */
-static void task_fork_fair(struct task_struct *p)
-{
-	struct sched_entity *se = &p->se, *curr;
-	struct cfs_rq *cfs_rq;
-	struct rq *rq = this_rq();
-	struct rq_flags rf;
-
-	rq_lock(rq, &rf);
-	update_rq_clock(rq);
-
-	cfs_rq = task_cfs_rq(current);
-	curr = cfs_rq->curr;
-	if (curr)
-		update_curr(cfs_rq);
-	place_entity(cfs_rq, se, ENQUEUE_INITIAL);
-	rq_unlock(rq, &rf);
-}
-
 /*
  * Priority of the task has changed. Check to see if we preempt
  * the current task.
@@ -12953,7 +12930,6 @@  DEFINE_SCHED_CLASS(fair) = {
 #endif
 
 	.task_tick		= task_tick_fair,
-	.task_fork		= task_fork_fair,
 
 	.prio_changed		= prio_changed_fair,
 	.switched_from		= switched_from_fair,