[23/30] sched/fair: handle tick expiry under lazy preemption

Message ID 20240213055554.1802415-24-ankur.a.arora@oracle.com
State New
Headers
Series PREEMPT_AUTO: support lazy rescheduling |

Commit Message

Ankur Arora Feb. 13, 2024, 5:55 a.m. UTC
  The default policy for lazy scheduling is to schedule in exit-to-user,
assuming that would happen within the remaining time quanta of the
task.

However, that runs into the 'hog' problem -- the target task might
be running in the kernel and might not relinquish CPU on its own.

Handle that by upgrading the ignored tif_resched(NR_lazy) bit to
tif_resched(NR_now) at the next tick.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>

---
Note:
  Instead of special casing the tick, it might be simpler to always
  do the upgrade on the second resched_curr().

  The theoretical problem with doing that is that the current
  approach deterministically provides a well-defined extra unit of
  time. Going with a second resched_curr() might mean that the
  amount of extra time the task gets depends on the vagaries of
  the incoming resched_curr() (which is fine if it's mostly from
  the tick; not fine if we could get it due to other reasons.)

  Practically, both performed equally well in my tests.

  Thoughts?

 kernel/sched/core.c     | 8 ++++++++
 kernel/sched/deadline.c | 2 +-
 kernel/sched/fair.c     | 2 +-
 kernel/sched/rt.c       | 2 +-
 kernel/sched/sched.h    | 6 ++++++
 5 files changed, 17 insertions(+), 3 deletions(-)
  

Comments

Steven Rostedt Feb. 21, 2024, 9:38 p.m. UTC | #1
On Mon, 12 Feb 2024 21:55:47 -0800
Ankur Arora <ankur.a.arora@oracle.com> wrote:

> The default policy for lazy scheduling is to schedule in exit-to-user,
> assuming that would happen within the remaining time quanta of the
> task.
> 
> However, that runs into the 'hog' problem -- the target task might
> be running in the kernel and might not relinquish CPU on its own.
> 
> Handle that by upgrading the ignored tif_resched(NR_lazy) bit to
> tif_resched(NR_now) at the next tick.
> 
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Originally-by: Thomas Gleixner <tglx@linutronix.de>
> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> 
> ---
> Note:
>   Instead of special casing the tick, it might be simpler to always
>   do the upgrade on the second resched_curr().
> 
>   The theoretical problem with doing that is that the current
>   approach deterministically provides a well-defined extra unit of
>   time. Going with a second resched_curr() might mean that the
>   amount of extra time the task gets depends on the vagaries of
>   the incoming resched_curr() (which is fine if it's mostly from
>   the tick; not fine if we could get it due to other reasons.)
> 
>   Practically, both performed equally well in my tests.
> 
>   Thoughts?

I personally prefer the determinism of using the tick to force the resched.

-- Steve
  
Juri Lelli Feb. 28, 2024, 1:47 p.m. UTC | #2
Hi Ankur,

On 12/02/24 21:55, Ankur Arora wrote:
> The default policy for lazy scheduling is to schedule in exit-to-user,
> assuming that would happen within the remaining time quanta of the
> task.
> 
> However, that runs into the 'hog' problem -- the target task might
> be running in the kernel and might not relinquish CPU on its own.
> 
> Handle that by upgrading the ignored tif_resched(NR_lazy) bit to
> tif_resched(NR_now) at the next tick.
> 
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Originally-by: Thomas Gleixner <tglx@linutronix.de>
> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> 
> ---
> Note:
>   Instead of special casing the tick, it might be simpler to always
>   do the upgrade on the second resched_curr().
> 
>   The theoretical problem with doing that is that the current
>   approach deterministically provides a well-defined extra unit of
>   time. Going with a second resched_curr() might mean that the
>   amount of extra time the task gets depends on the vagaries of
>   the incoming resched_curr() (which is fine if it's mostly from
>   the tick; not fine if we could get it due to other reasons.)
> 
>   Practically, both performed equally well in my tests.
> 
>   Thoughts?

I'm still digesting the series, so I could simply be confused, but I
have the impression that the extra unit of time might be a problem for
deadline (and maybe rt as well?).

For deadline we call resched_curr_tick() from the throttle part of
update_curr_dl_se() if the dl_se happens to not be the leftmost anymore,
so in this case I believe we really want to reschedule straight away and
not wait for the second time around (otherwise we might be breaking the
new leftmost tasks guarantees)?

Thanks,
Juri
  
Ankur Arora Feb. 29, 2024, 6:43 a.m. UTC | #3
Juri Lelli <juri.lelli@redhat.com> writes:

> Hi Ankur,
>
> On 12/02/24 21:55, Ankur Arora wrote:
>> The default policy for lazy scheduling is to schedule in exit-to-user,
>> assuming that would happen within the remaining time quanta of the
>> task.
>>
>> However, that runs into the 'hog' problem -- the target task might
>> be running in the kernel and might not relinquish CPU on its own.
>>
>> Handle that by upgrading the ignored tif_resched(NR_lazy) bit to
>> tif_resched(NR_now) at the next tick.
>>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Juri Lelli <juri.lelli@redhat.com>
>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>> Originally-by: Thomas Gleixner <tglx@linutronix.de>
>> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>
>> ---
>> Note:
>>   Instead of special casing the tick, it might be simpler to always
>>   do the upgrade on the second resched_curr().
>>
>>   The theoretical problem with doing that is that the current
>>   approach deterministically provides a well-defined extra unit of
>>   time. Going with a second resched_curr() might mean that the
>>   amount of extra time the task gets depends on the vagaries of
>>   the incoming resched_curr() (which is fine if it's mostly from
>>   the tick; not fine if we could get it due to other reasons.)
>>
>>   Practically, both performed equally well in my tests.
>>
>>   Thoughts?
>
> I'm still digesting the series, so I could simply be confused, but I
> have the impression that the extra unit of time might be a problem for
> deadline (and maybe rt as well?).
>
> For deadline we call resched_curr_tick() from the throttle part of
> update_curr_dl_se() if the dl_se happens to not be the leftmost anymore,
> so in this case I believe we really want to reschedule straight away and
> not wait for the second time around (otherwise we might be breaking the
> new leftmost tasks guarantees)?

Yes, agreed, this looks like it breaks the deadline invariant for both
preempt=none and preempt=voluntary.

For RT, update_curr_rt() seems to have a similar problem if the task
doesn't have RUNTIME_INF.

Relatedly, do you think there's a similar problem when switching to
a task with a higher scheduling class?
(Related to code is in patch 25, 26.)

For preempt=voluntary, wakeup_preempt() will do the right thing, but
for preempt=none, we only reschedule lazily so the target might
continue to run until the end of the tick.

Thanks for the review, btw.

--
ankur
  
Juri Lelli Feb. 29, 2024, 9:33 a.m. UTC | #4
On 28/02/24 22:43, Ankur Arora wrote:
> Juri Lelli <juri.lelli@redhat.com> writes:

..

> > For deadline we call resched_curr_tick() from the throttle part of
> > update_curr_dl_se() if the dl_se happens to not be the leftmost anymore,
> > so in this case I believe we really want to reschedule straight away and
> > not wait for the second time around (otherwise we might be breaking the
> > new leftmost tasks guarantees)?
> 
> Yes, agreed, this looks like it breaks the deadline invariant for both
> preempt=none and preempt=voluntary.
> 
> For RT, update_curr_rt() seems to have a similar problem if the task
> doesn't have RUNTIME_INF.
> 
> Relatedly, do you think there's a similar problem when switching to
> a task with a higher scheduling class?
> (Related to code is in patch 25, 26.)
> 
> For preempt=voluntary, wakeup_preempt() will do the right thing, but

Right.

> for preempt=none, we only reschedule lazily so the target might
> continue to run until the end of the tick.

Hummm, not sure honestly, but I seem to understand that with
preempt=none we want to be super conservative wrt preemptions, so maybe
current behavior (1 tick of laziness) is OK? Otherwise what would be the
difference wrt preempt=voluntary from a scheduler pow? Yes, it might
break deadline guarantees, but if you wanted to use preempt=none maybe
there is a strong reason for it, I'm thinking.

> Thanks for the review, btw.

Sure. Thanks for working on this actually! :)

Best,
Juri
  
Ankur Arora Feb. 29, 2024, 11:54 p.m. UTC | #5
Juri Lelli <juri.lelli@redhat.com> writes:

> On 28/02/24 22:43, Ankur Arora wrote:
>> Juri Lelli <juri.lelli@redhat.com> writes:
>
> ..
>
>> > For deadline we call resched_curr_tick() from the throttle part of
>> > update_curr_dl_se() if the dl_se happens to not be the leftmost anymore,
>> > so in this case I believe we really want to reschedule straight away and
>> > not wait for the second time around (otherwise we might be breaking the
>> > new leftmost tasks guarantees)?
>>
>> Yes, agreed, this looks like it breaks the deadline invariant for both
>> preempt=none and preempt=voluntary.
>>
>> For RT, update_curr_rt() seems to have a similar problem if the task
>> doesn't have RUNTIME_INF.
>>
>> Relatedly, do you think there's a similar problem when switching to
>> a task with a higher scheduling class?
>> (Related to code is in patch 25, 26.)
>>
>> For preempt=voluntary, wakeup_preempt() will do the right thing, but
>
> Right.
>
>> for preempt=none, we only reschedule lazily so the target might
>> continue to run until the end of the tick.
>
> Hummm, not sure honestly, but I seem to understand that with
> preempt=none we want to be super conservative wrt preemptions, so maybe
> current behavior (1 tick of laziness) is OK? Otherwise what would be the

Yeah, that's kind of where I'm thinking of getting to. Be lazy so long
as we don't violate guarantees.

> difference wrt preempt=voluntary from a scheduler pow? Yes, it might
> break deadline guarantees, but if you wanted to use preempt=none maybe
> there is a strong reason for it, I'm thinking.

Yeah, the difference between preempt=none and preempt=voluntary is
looking narrower and narrower, and maybe a bit artificial in that
there seem to be very few cases where the two models would actually
differ in behaviour.

Thanks
Ankur

>> Thanks for the review, btw.
>
> Sure. Thanks for working on this actually! :)
>
> Best,
> Juri
  
Paul E. McKenney March 1, 2024, 12:28 a.m. UTC | #6
On Thu, Feb 29, 2024 at 03:54:42PM -0800, Ankur Arora wrote:
> 
> Juri Lelli <juri.lelli@redhat.com> writes:
> 
> > On 28/02/24 22:43, Ankur Arora wrote:
> >> Juri Lelli <juri.lelli@redhat.com> writes:
> >
> > ..
> >
> >> > For deadline we call resched_curr_tick() from the throttle part of
> >> > update_curr_dl_se() if the dl_se happens to not be the leftmost anymore,
> >> > so in this case I believe we really want to reschedule straight away and
> >> > not wait for the second time around (otherwise we might be breaking the
> >> > new leftmost tasks guarantees)?
> >>
> >> Yes, agreed, this looks like it breaks the deadline invariant for both
> >> preempt=none and preempt=voluntary.
> >>
> >> For RT, update_curr_rt() seems to have a similar problem if the task
> >> doesn't have RUNTIME_INF.
> >>
> >> Relatedly, do you think there's a similar problem when switching to
> >> a task with a higher scheduling class?
> >> (Related to code is in patch 25, 26.)
> >>
> >> For preempt=voluntary, wakeup_preempt() will do the right thing, but
> >
> > Right.
> >
> >> for preempt=none, we only reschedule lazily so the target might
> >> continue to run until the end of the tick.
> >
> > Hummm, not sure honestly, but I seem to understand that with
> > preempt=none we want to be super conservative wrt preemptions, so maybe
> > current behavior (1 tick of laziness) is OK? Otherwise what would be the
> 
> Yeah, that's kind of where I'm thinking of getting to. Be lazy so long
> as we don't violate guarantees.
> 
> > difference wrt preempt=voluntary from a scheduler pow? Yes, it might
> > break deadline guarantees, but if you wanted to use preempt=none maybe
> > there is a strong reason for it, I'm thinking.
> 
> Yeah, the difference between preempt=none and preempt=voluntary is
> looking narrower and narrower, and maybe a bit artificial in that
> there seem to be very few cases where the two models would actually
> differ in behaviour.

If it turns out that cond_resched() and the preemption points in
might_sleep() really can be dispensed with, then there would be little
difference between them.  But that is still "if".  ;-)

							Thanx, Paul

> Thanks
> Ankur
> 
> >> Thanks for the review, btw.
> >
> > Sure. Thanks for working on this actually! :)
> >
> > Best,
> > Juri
  

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index de963e8e2bee..5df59a4548dc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1038,6 +1038,10 @@  void wake_up_q(struct wake_q_head *head)
  * For PREEMPT_AUTO: schedule idle threads eagerly, allow everything
  * else, whether running in user or kernel context, to finish its time
  * quanta, and mark for rescheduling at the next exit to user.
+ *
+ * Note: to avoid the hog problem, where the user does not relinquish
+ * CPU even after its time quanta has expired, upgrade lazy to eager
+ * on the second tick.
  */
 static resched_t resched_opt_translate(struct task_struct *curr,
 				       enum resched_opt opt)
@@ -1051,6 +1055,10 @@  static resched_t resched_opt_translate(struct task_struct *curr,
 	if (is_idle_task(curr))
 		return NR_now;
 
+	if (opt == RESCHED_TICK &&
+	    unlikely(test_tsk_need_resched(curr, NR_lazy)))
+		return NR_now;
+
 	return NR_lazy;
 }
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index b4e68cfc3c3a..b935e634fbf8 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1379,7 +1379,7 @@  static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
 		}
 
 		if (!is_leftmost(dl_se, &rq->dl))
-			resched_curr(rq);
+			resched_curr_tick(rq);
 	}
 
 	/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 278eebe6656a..92910b721adb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12621,7 +12621,7 @@  static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 	}
 
 	if (resched)
-		resched_curr(rq);
+		resched_curr_tick(rq);
 
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index c57cc8427a57..1a2f3524d0eb 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1023,7 +1023,7 @@  static void update_curr_rt(struct rq *rq)
 			rt_rq->rt_time += delta_exec;
 			exceeded = sched_rt_runtime_exceeded(rt_rq);
 			if (exceeded)
-				resched_curr(rq);
+				resched_curr_tick(rq);
 			raw_spin_unlock(&rt_rq->rt_runtime_lock);
 			if (exceeded)
 				do_start_rt_bandwidth(sched_rt_bandwidth(rt_rq));
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3600a8673d08..c7e7acab1022 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2465,6 +2465,7 @@  extern void reweight_task(struct task_struct *p, int prio);
 enum resched_opt {
 	RESCHED_DEFAULT,
 	RESCHED_FORCE,
+	RESCHED_TICK,
 };
 
 extern void __resched_curr(struct rq *rq, enum resched_opt opt);
@@ -2474,6 +2475,11 @@  static inline void resched_curr(struct rq *rq)
 	__resched_curr(rq, RESCHED_DEFAULT);
 }
 
+static inline void resched_curr_tick(struct rq *rq)
+{
+	__resched_curr(rq, RESCHED_TICK);
+}
+
 extern void resched_cpu(int cpu);
 
 extern struct rt_bandwidth def_rt_bandwidth;