[RFC] sched/eevdf: sched feature to dismiss lag on wakeup

Message ID 20240228161018.14253-1-huschle@linux.ibm.com
State New
Headers
Series [RFC] sched/eevdf: sched feature to dismiss lag on wakeup |

Commit Message

Tobias Huschle Feb. 28, 2024, 4:10 p.m. UTC
  The previously used CFS scheduler gave tasks that were woken up an
enhanced chance to see runtime immediately by deducting a certain value
from its vruntime on runqueue placement during wakeup.

This property was used by some, at least vhost, to ensure, that certain
kworkers are scheduled immediately after being woken up. The EEVDF
scheduler, does not support this so far. Instead, if such a woken up
entitiy carries a negative lag from its previous execution, it will have
to wait for the current time slice to finish, which affects the
performance of the process expecting the immediate execution negatively.

To address this issue, implement EEVDF strategy #2 for rejoining
entities, which dismisses the lag from previous execution and allows
the woken up task to run immediately (if no other entities are deemed
to be preferred for scheduling by EEVDF).

The vruntime is decremented by an additional value of 1 to make sure,
that the woken up tasks gets to actually run. This is of course not
following strategy #2 in an exact manner but guarantees the expected
behavior for the scenario described above. Without the additional
decrement, the performance goes south even more. So there are some
side effects I could not get my head around yet.

Questions:
1. The kworker getting its negative lag occurs in the following scenario
   - kworker and a cgroup are supposed to execute on the same CPU
   - one task within the cgroup is executing and wakes up the kworker
   - kworker with 0 lag, gets picked immediately and finishes its
     execution within ~5000ns
   - on dequeue, kworker gets assigned a negative lag
   Is this expected behavior? With this short execution time, I would
   expect the kworker to be fine.
   For a more detailed discussion on this symptom, please see:
   https://lore.kernel.org/netdev/ZWbapeL34Z8AMR5f@DESKTOP-2CCOB1S./T/
2. The proposed code change of course only addresses the symptom. Am I
   assuming correctly that this is in general the exepected behavior and
   that the task waking up the kworker should rather do an explicit
   reschedule of itself to grant the kworker time to execute?
   In the vhost case, this is currently attempted through a cond_resched
   which is not doing anything because the need_resched flag is not set.

Feedback and opinions would be highly appreciated.

Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
---
 kernel/sched/fair.c     | 5 +++++
 kernel/sched/features.h | 1 +
 2 files changed, 6 insertions(+)
  

Comments

K Prateek Nayak Feb. 29, 2024, 3:36 a.m. UTC | #1
(+ Xuewen Yan, Ke Wang)

Hello Tobias,

On 2/28/2024 9:40 PM, Tobias Huschle wrote:
> The previously used CFS scheduler gave tasks that were woken up an
> enhanced chance to see runtime immediately by deducting a certain value
> from its vruntime on runqueue placement during wakeup.
> 
> This property was used by some, at least vhost, to ensure, that certain
> kworkers are scheduled immediately after being woken up. The EEVDF
> scheduler, does not support this so far. Instead, if such a woken up
> entitiy carries a negative lag from its previous execution, it will have
> to wait for the current time slice to finish, which affects the
> performance of the process expecting the immediate execution negatively.
> 
> To address this issue, implement EEVDF strategy #2 for rejoining
> entities, which dismisses the lag from previous execution and allows
> the woken up task to run immediately (if no other entities are deemed
> to be preferred for scheduling by EEVDF).
> 
> The vruntime is decremented by an additional value of 1 to make sure,
> that the woken up tasks gets to actually run. This is of course not
> following strategy #2 in an exact manner but guarantees the expected
> behavior for the scenario described above. Without the additional
> decrement, the performance goes south even more. So there are some
> side effects I could not get my head around yet.
> 
> Questions:
> 1. The kworker getting its negative lag occurs in the following scenario
>    - kworker and a cgroup are supposed to execute on the same CPU
>    - one task within the cgroup is executing and wakes up the kworker
>    - kworker with 0 lag, gets picked immediately and finishes its
>      execution within ~5000ns
>    - on dequeue, kworker gets assigned a negative lag
>    Is this expected behavior? With this short execution time, I would
>    expect the kworker to be fine.
>    For a more detailed discussion on this symptom, please see:
>    https://lore.kernel.org/netdev/ZWbapeL34Z8AMR5f@DESKTOP-2CCOB1S./T/

Does the lag clamping path from Xuewen Yan [1] work for the vhost case
mentioned in the thread? Instead of placing the task just behind the
0-lag point, clamping the lag seems to be more principled approach since
EEVDF already does it in update_entity_lag().

If the lag is still too large, maybe the above coupled with Peter's
delayed dequeue patch can help [2] (Note: tree is prone to force
updates)

[1] https://lore.kernel.org/lkml/20240130080643.1828-1-xuewen.yan@unisoc.com/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=e62ef63a888c97188a977daddb72b61548da8417

> 2. The proposed code change of course only addresses the symptom. Am I
>    assuming correctly that this is in general the exepected behavior and
>    that the task waking up the kworker should rather do an explicit
>    reschedule of itself to grant the kworker time to execute?
>    In the vhost case, this is currently attempted through a cond_resched
>    which is not doing anything because the need_resched flag is not set.
> 
> Feedback and opinions would be highly appreciated.
> 
> Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
> ---
>  kernel/sched/fair.c     | 5 +++++
>  kernel/sched/features.h | 1 +
>  2 files changed, 6 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 533547e3c90a..c20ae6d62961 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5239,6 +5239,11 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  		lag = div_s64(lag, load);
>  	}
>  
> +	if (sched_feat(NOLAG_WAKEUP) && (flags & ENQUEUE_WAKEUP)) {
> +		se->vlag = 0;
> +		lag = 1;
> +	}
> +
>  	se->vruntime = vruntime - lag;
>  
>  	/*
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 143f55df890b..d3118e7568b4 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -7,6 +7,7 @@
>  SCHED_FEAT(PLACE_LAG, true)
>  SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
>  SCHED_FEAT(RUN_TO_PARITY, true)
> +SCHED_FEAT(NOLAG_WAKEUP, true)
>  
>  /*
>   * Prefer to schedule the task we woke last (assuming it failed

--
Thanks and Regards,
Prateek
  

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 533547e3c90a..c20ae6d62961 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5239,6 +5239,11 @@  place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		lag = div_s64(lag, load);
 	}
 
+	if (sched_feat(NOLAG_WAKEUP) && (flags & ENQUEUE_WAKEUP)) {
+		se->vlag = 0;
+		lag = 1;
+	}
+
 	se->vruntime = vruntime - lag;
 
 	/*
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 143f55df890b..d3118e7568b4 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -7,6 +7,7 @@ 
 SCHED_FEAT(PLACE_LAG, true)
 SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
 SCHED_FEAT(RUN_TO_PARITY, true)
+SCHED_FEAT(NOLAG_WAKEUP, true)
 
 /*
  * Prefer to schedule the task we woke last (assuming it failed