[v10,8/9] sched/fair: Add latency list
Commit Message
Add a rb tree for latency sensitive entities so we can schedule the most
sensitive one first even when it failed to preempt current at wakeup or
when it got quickly preempted by another entity of higher priority.
In order to keep fairness, the latency is used once at wakeup to get a
minimum slice and not during the following scheduling slice to prevent
long running entity to got more running time than allocated to his nice
priority.
The rb tree enables to cover the last corner case where latency
sensitive entity can't got schedule quickly after the wakeup.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 95 +++++++++++++++++++++++++++++++++++++++++--
kernel/sched/sched.h | 1 +
4 files changed, 95 insertions(+), 3 deletions(-)
Comments
On Fri, Jan 13, 2023 at 03:12:33PM +0100, Vincent Guittot wrote:
> @@ -12552,8 +12634,15 @@ int sched_group_set_latency(struct task_group *tg, s64 latency)
>
> for_each_possible_cpu(i) {
> struct sched_entity *se = tg->se[i];
> + struct rq *rq = cpu_rq(i);
> + struct rq_flags rf;
> +
> + rq_lock_irqsave(rq, &rf);
>
> + __dequeue_latency(se->cfs_rq, se);
> WRITE_ONCE(se->latency_offset, latency);
> +
> + rq_unlock_irqrestore(rq, &rf);
> }
This seems asymmetric; maybe something like:
queued = __dequeue_latency(..);
WRITE_ONCE(...);
if (queued)
__enqueue_latency(...);
?
On Tue, 21 Feb 2023 at 16:12, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jan 13, 2023 at 03:12:33PM +0100, Vincent Guittot wrote:
> > @@ -12552,8 +12634,15 @@ int sched_group_set_latency(struct task_group *tg, s64 latency)
> >
> > for_each_possible_cpu(i) {
> > struct sched_entity *se = tg->se[i];
> > + struct rq *rq = cpu_rq(i);
> > + struct rq_flags rf;
> > +
> > + rq_lock_irqsave(rq, &rf);
> >
> > + __dequeue_latency(se->cfs_rq, se);
> > WRITE_ONCE(se->latency_offset, latency);
> > +
> > + rq_unlock_irqrestore(rq, &rf);
> > }
>
> This seems asymmetric; maybe something like:
>
> queued = __dequeue_latency(..);
> WRITE_ONCE(...);
> if (queued)
> __enqueue_latency(...);
>
> ?
Fair enough
On Fri, Jan 13, 2023 at 03:12:33PM +0100, Vincent Guittot wrote:
> +static void __enqueue_latency(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> +{
> +
> + /* Only latency sensitive entity can be added to the list */
> + if (se->latency_offset >= 0)
> + return;
> +
> + if (!RB_EMPTY_NODE(&se->latency_node))
> + return;
> +
> + /*
> + * An execution time less than sysctl_sched_min_granularity means that
> + * the entity has been preempted by a higher sched class or an entity
> + * with higher latency constraint.
> + * Put it back in the list so it gets a chance to run 1st during the
> + * next slice.
> + */
> + if (!(flags & ENQUEUE_WAKEUP)) {
> + u64 delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> +
> + if (delta_exec >= sysctl_sched_min_granularity)
> + return;
> + }
I'm not a big fan of this dynamic enqueueing condition; it makes it
rather hard to interpret the below addition to pick_next_entity().
Let me think about this more... at the very least the comment with
__pick_first_latency() use below needs to be expanded upon if we keep it
like so.
> +
> + rb_add_cached(&se->latency_node, &cfs_rq->latency_timeline, __latency_less);
> +}
> @@ -4966,7 +5040,7 @@ static struct sched_entity *
> pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> {
> struct sched_entity *left = __pick_first_entity(cfs_rq);
> - struct sched_entity *se;
> + struct sched_entity *latency, *se;
>
> /*
> * If curr is set we have to see if its left of the leftmost entity
> @@ -5008,6 +5082,12 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> se = cfs_rq->last;
> }
>
> + /* Check for latency sensitive entity waiting for running */
> + latency = __pick_first_latency(cfs_rq);
> + if (latency && (latency != se) &&
> + wakeup_preempt_entity(latency, se) < 1)
> + se = latency;
I'm not quite sure why this condition isn't sufficient on it's own.
After all, if a task does a 'spurious' nanosleep it can get around the
'restriction' in __enqueue_latency() without any great penalty to it's
actual bandwidth consumption.
On Wed, 22 Feb 2023 at 10:50, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jan 13, 2023 at 03:12:33PM +0100, Vincent Guittot wrote:
>
> > +static void __enqueue_latency(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > +{
> > +
> > + /* Only latency sensitive entity can be added to the list */
> > + if (se->latency_offset >= 0)
> > + return;
> > +
> > + if (!RB_EMPTY_NODE(&se->latency_node))
> > + return;
> > +
> > + /*
> > + * An execution time less than sysctl_sched_min_granularity means that
> > + * the entity has been preempted by a higher sched class or an entity
> > + * with higher latency constraint.
> > + * Put it back in the list so it gets a chance to run 1st during the
> > + * next slice.
> > + */
> > + if (!(flags & ENQUEUE_WAKEUP)) {
> > + u64 delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> > +
> > + if (delta_exec >= sysctl_sched_min_granularity)
> > + return;
> > + }
>
> I'm not a big fan of this dynamic enqueueing condition; it makes it
> rather hard to interpret the below addition to pick_next_entity().
>
> Let me think about this more... at the very least the comment with
> __pick_first_latency() use below needs to be expanded upon if we keep it
> like so.
Only the waking tasks should be added in the latency rb tree so they
can be selected to run 1st (as long as they don't use too much
runtime). But task A can wake up, preempts current task B thanks to
its latency nice , starts to run few usecs but then is immediately
preempted by a RT task C as an example. In this case, we consider that
the task A didn't get a chance to run after its wakeup and we put it
back to the latency rb tree just as if task A has just woken up but
didn't preempted the new current task C.
I have used sysctl_sched_min_granularity has this is min runtime for a
task before being possibly preempted at tick by another cfs task with
a lower vruntime so if it runs less than sysctl_sched_min_granularity
we are sure that it has been preempted by higher prio tasks and it's
not because it used all its runtime compared to others
>
> > +
> > + rb_add_cached(&se->latency_node, &cfs_rq->latency_timeline, __latency_less);
> > +}
>
> > @@ -4966,7 +5040,7 @@ static struct sched_entity *
> > pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> > {
> > struct sched_entity *left = __pick_first_entity(cfs_rq);
> > - struct sched_entity *se;
> > + struct sched_entity *latency, *se;
> >
> > /*
> > * If curr is set we have to see if its left of the leftmost entity
> > @@ -5008,6 +5082,12 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> > se = cfs_rq->last;
> > }
> >
> > + /* Check for latency sensitive entity waiting for running */
> > + latency = __pick_first_latency(cfs_rq);
> > + if (latency && (latency != se) &&
> > + wakeup_preempt_entity(latency, se) < 1)
> > + se = latency;
>
> I'm not quite sure why this condition isn't sufficient on it's own.
> After all, if a task does a 'spurious' nanosleep it can get around the
> 'restriction' in __enqueue_latency() without any great penalty to it's
> actual bandwidth consumption.
Yes it until it used all its runtime.
On Wed, Feb 22, 2023 at 12:16:29PM +0100, Vincent Guittot wrote:
> On Wed, 22 Feb 2023 at 10:50, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Fri, Jan 13, 2023 at 03:12:33PM +0100, Vincent Guittot wrote:
> >
> > > +static void __enqueue_latency(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > > +{
> > > +
> > > + /* Only latency sensitive entity can be added to the list */
> > > + if (se->latency_offset >= 0)
> > > + return;
> > > +
> > > + if (!RB_EMPTY_NODE(&se->latency_node))
> > > + return;
> > > +
> > > + /*
> > > + * An execution time less than sysctl_sched_min_granularity means that
> > > + * the entity has been preempted by a higher sched class or an entity
> > > + * with higher latency constraint.
> > > + * Put it back in the list so it gets a chance to run 1st during the
> > > + * next slice.
> > > + */
> > > + if (!(flags & ENQUEUE_WAKEUP)) {
> > > + u64 delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> > > +
> > > + if (delta_exec >= sysctl_sched_min_granularity)
> > > + return;
> > > + }
> >
> > I'm not a big fan of this dynamic enqueueing condition; it makes it
> > rather hard to interpret the below addition to pick_next_entity().
> >
> > Let me think about this more... at the very least the comment with
> > __pick_first_latency() use below needs to be expanded upon if we keep it
> > like so.
>
> Only the waking tasks should be added in the latency rb tree so they
But that's what I'm saying, you can game this by doing super short
sleeps every min_gran.
> can be selected to run 1st (as long as they don't use too much
> runtime). But task A can wake up, preempts current task B thanks to
> its latency nice , starts to run few usecs but then is immediately
> preempted by a RT task C as an example. In this case, we consider that
> the task A didn't get a chance to run after its wakeup and we put it
> back to the latency rb tree just as if task A has just woken up but
> didn't preempted the new current task C.
So ideally, and this is where I'm very slow with thinking, that
wakeup_preempt_entity() condition here:
> > > @@ -5008,6 +5082,12 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> > > se = cfs_rq->last;
> > > }
> > >
> > > + /* Check for latency sensitive entity waiting for running */
> > > + latency = __pick_first_latency(cfs_rq);
> > > + if (latency && (latency != se) &&
> > > + wakeup_preempt_entity(latency, se) < 1)
> > > + se = latency;
should be sufficient to provide fair bandwidth usage. The EEVDF paper
achieves this by selecting the leftmost elegible task, where elegibility
is dependent on negative lag. Only those tasks that are behind the pack
are allowed runtime.
Now clearly our min_vruntime is unsuited for that exact scheme, but iirc
wake_preempt_entity() does not allow for starvation, so we should be
good, even without that weird condition in __enqueue_latency(), hmm?
On Mon, 27 Feb 2023 at 14:29, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Feb 22, 2023 at 12:16:29PM +0100, Vincent Guittot wrote:
> > On Wed, 22 Feb 2023 at 10:50, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Fri, Jan 13, 2023 at 03:12:33PM +0100, Vincent Guittot wrote:
> > >
> > > > +static void __enqueue_latency(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > > > +{
> > > > +
> > > > + /* Only latency sensitive entity can be added to the list */
> > > > + if (se->latency_offset >= 0)
> > > > + return;
> > > > +
> > > > + if (!RB_EMPTY_NODE(&se->latency_node))
> > > > + return;
> > > > +
> > > > + /*
> > > > + * An execution time less than sysctl_sched_min_granularity means that
> > > > + * the entity has been preempted by a higher sched class or an entity
> > > > + * with higher latency constraint.
> > > > + * Put it back in the list so it gets a chance to run 1st during the
> > > > + * next slice.
> > > > + */
> > > > + if (!(flags & ENQUEUE_WAKEUP)) {
> > > > + u64 delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> > > > +
> > > > + if (delta_exec >= sysctl_sched_min_granularity)
> > > > + return;
> > > > + }
> > >
> > > I'm not a big fan of this dynamic enqueueing condition; it makes it
> > > rather hard to interpret the below addition to pick_next_entity().
> > >
> > > Let me think about this more... at the very least the comment with
> > > __pick_first_latency() use below needs to be expanded upon if we keep it
> > > like so.
> >
> > Only the waking tasks should be added in the latency rb tree so they
>
> But that's what I'm saying, you can game this by doing super short
> sleeps every min_gran.
yes, it 's for a time. But i'm not sure this will be always beneficial
at the end because most of the time you will have your full slice
without doing this game
>
> > can be selected to run 1st (as long as they don't use too much
> > runtime). But task A can wake up, preempts current task B thanks to
> > its latency nice , starts to run few usecs but then is immediately
> > preempted by a RT task C as an example. In this case, we consider that
> > the task A didn't get a chance to run after its wakeup and we put it
> > back to the latency rb tree just as if task A has just woken up but
> > didn't preempted the new current task C.
>
> So ideally, and this is where I'm very slow with thinking, that
> wakeup_preempt_entity() condition here:
>
> > > > @@ -5008,6 +5082,12 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> > > > se = cfs_rq->last;
> > > > }
> > > >
> > > > + /* Check for latency sensitive entity waiting for running */
> > > > + latency = __pick_first_latency(cfs_rq);
> > > > + if (latency && (latency != se) &&
> > > > + wakeup_preempt_entity(latency, se) < 1)
> > > > + se = latency;
>
> should be sufficient to provide fair bandwidth usage. The EEVDF paper
> achieves this by selecting the leftmost elegible task, where elegibility
> is dependent on negative lag. Only those tasks that are behind the pack
> are allowed runtime.
>
> Now clearly our min_vruntime is unsuited for that exact scheme, but iirc
> wake_preempt_entity() does not allow for starvation, so we should be
> good, even without that weird condition in __enqueue_latency(), hmm?
If we unconditionally __enqueue_latency() the task then it can end up
providing more bandwidth to those tasks because it's like having a
larger sleep credit than others.
The original condition in __enqueue_latency() was :
if (!(flags & ENQUEUE_WAKEUP)) {
return;
}
So that task gets a chance to preempt others only at wakeup.
But then, I have seen such tasks being preempted immediately but RT
tasks and as a result lost their latency advantage. Maybe I should be
the condition above and add the weird condition in a separate patch
with dedicated figures
@@ -548,6 +548,7 @@ struct sched_entity {
/* For load-balancing: */
struct load_weight load;
struct rb_node run_node;
+ struct rb_node latency_node;
struct list_head group_node;
unsigned int on_rq;
@@ -4390,6 +4390,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->se.nr_migrations = 0;
p->se.vruntime = 0;
INIT_LIST_HEAD(&p->se.group_node);
+ RB_CLEAR_NODE(&p->se.latency_node);
#ifdef CONFIG_FAIR_GROUP_SCHED
p->se.cfs_rq = NULL;
@@ -680,7 +680,76 @@ struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
return __node_2_se(last);
}
+#endif
+
+/**************************************************************
+ * Scheduling class tree data structure manipulation methods:
+ * for latency
+ */
+
+static inline bool latency_before(struct sched_entity *a,
+ struct sched_entity *b)
+{
+ return (s64)(a->vruntime + a->latency_offset - b->vruntime - b->latency_offset) < 0;
+}
+
+#define __latency_node_2_se(node) \
+ rb_entry((node), struct sched_entity, latency_node)
+
+static inline bool __latency_less(struct rb_node *a, const struct rb_node *b)
+{
+ return latency_before(__latency_node_2_se(a), __latency_node_2_se(b));
+}
+
+/*
+ * Enqueue an entity into the latency rb-tree:
+ */
+static void __enqueue_latency(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+{
+
+ /* Only latency sensitive entity can be added to the list */
+ if (se->latency_offset >= 0)
+ return;
+
+ if (!RB_EMPTY_NODE(&se->latency_node))
+ return;
+
+ /*
+ * An execution time less than sysctl_sched_min_granularity means that
+ * the entity has been preempted by a higher sched class or an entity
+ * with higher latency constraint.
+ * Put it back in the list so it gets a chance to run 1st during the
+ * next slice.
+ */
+ if (!(flags & ENQUEUE_WAKEUP)) {
+ u64 delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+
+ if (delta_exec >= sysctl_sched_min_granularity)
+ return;
+ }
+
+ rb_add_cached(&se->latency_node, &cfs_rq->latency_timeline, __latency_less);
+}
+
+static void __dequeue_latency(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ if (!RB_EMPTY_NODE(&se->latency_node)) {
+ rb_erase_cached(&se->latency_node, &cfs_rq->latency_timeline);
+ RB_CLEAR_NODE(&se->latency_node);
+ }
+}
+
+static struct sched_entity *__pick_first_latency(struct cfs_rq *cfs_rq)
+{
+ struct rb_node *left = rb_first_cached(&cfs_rq->latency_timeline);
+
+ if (!left)
+ return NULL;
+
+ return __latency_node_2_se(left);
+}
+#ifdef CONFIG_SCHED_DEBUG
/**************************************************************
* Scheduling class statistics methods:
*/
@@ -4751,8 +4820,10 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
check_schedstat_required();
update_stats_enqueue_fair(cfs_rq, se, flags);
check_spread(cfs_rq, se);
- if (!curr)
+ if (!curr) {
__enqueue_entity(cfs_rq, se);
+ __enqueue_latency(cfs_rq, se, flags);
+ }
se->on_rq = 1;
if (cfs_rq->nr_running == 1) {
@@ -4838,8 +4909,10 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
clear_buddies(cfs_rq, se);
- if (se != cfs_rq->curr)
+ if (se != cfs_rq->curr) {
__dequeue_entity(cfs_rq, se);
+ __dequeue_latency(cfs_rq, se);
+ }
se->on_rq = 0;
account_entity_dequeue(cfs_rq, se);
@@ -4928,6 +5001,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
*/
update_stats_wait_end_fair(cfs_rq, se);
__dequeue_entity(cfs_rq, se);
+ __dequeue_latency(cfs_rq, se);
update_load_avg(cfs_rq, se, UPDATE_TG);
}
@@ -4966,7 +5040,7 @@ static struct sched_entity *
pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
struct sched_entity *left = __pick_first_entity(cfs_rq);
- struct sched_entity *se;
+ struct sched_entity *latency, *se;
/*
* If curr is set we have to see if its left of the leftmost entity
@@ -5008,6 +5082,12 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
se = cfs_rq->last;
}
+ /* Check for latency sensitive entity waiting for running */
+ latency = __pick_first_latency(cfs_rq);
+ if (latency && (latency != se) &&
+ wakeup_preempt_entity(latency, se) < 1)
+ se = latency;
+
return se;
}
@@ -5031,6 +5111,7 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
update_stats_wait_start_fair(cfs_rq, prev);
/* Put 'current' back into the tree. */
__enqueue_entity(cfs_rq, prev);
+ __enqueue_latency(cfs_rq, prev, 0);
/* in !on_rq case, update occurred at dequeue */
update_load_avg(cfs_rq, prev, 0);
}
@@ -12244,6 +12325,7 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
void init_cfs_rq(struct cfs_rq *cfs_rq)
{
cfs_rq->tasks_timeline = RB_ROOT_CACHED;
+ cfs_rq->latency_timeline = RB_ROOT_CACHED;
u64_u32_store(cfs_rq->min_vruntime, (u64)(-(1LL << 20)));
#ifdef CONFIG_SMP
raw_spin_lock_init(&cfs_rq->removed.lock);
@@ -12552,8 +12634,15 @@ int sched_group_set_latency(struct task_group *tg, s64 latency)
for_each_possible_cpu(i) {
struct sched_entity *se = tg->se[i];
+ struct rq *rq = cpu_rq(i);
+ struct rq_flags rf;
+
+ rq_lock_irqsave(rq, &rf);
+ __dequeue_latency(se->cfs_rq, se);
WRITE_ONCE(se->latency_offset, latency);
+
+ rq_unlock_irqrestore(rq, &rf);
}
mutex_unlock(&shares_mutex);
@@ -575,6 +575,7 @@ struct cfs_rq {
#endif
struct rb_root_cached tasks_timeline;
+ struct rb_root_cached latency_timeline;
/*
* 'curr' points to currently running entity on this cfs_rq.