[RFC,41/86] sched: handle resched policy in resched_curr()

Message ID 20231107215742.363031-42-ankur.a.arora@oracle.com
State New
Headers
Series Make the kernel preemptible |

Commit Message

Ankur Arora Nov. 7, 2023, 9:57 p.m. UTC
  One of the last ports of call before rescheduling is triggered
is resched_curr().

It's task is to set TIF_NEED_RESCHED and, if running locally,
either fold it in the preempt_count, or send a resched-IPI so
the target CPU folds it in.
To handle TIF_NEED_RESCHED_LAZY -- since the reschedule is not
imminent -- it only needs to set the appropriate bit.

Move all of underlying mechanism in __resched_curr(). And, define
resched_curr() which handles the policy on when we want to set
which need-resched variant.

For now the approach is to run to completion (TIF_NEED_RESCHED_LAZY)
with the following exceptions where we always want to reschedule
at the next preemptible point (TIF_NEED_RESCHED):

 - idle: if we are polling in idle, then set_nr_if_polling() will do
   the right thing. When not polling, we force TIF_NEED_RESCHED
   and send a resched-IPI if needed.

 - the target CPU is in userspace: run to completion semantics are
   only for kernel tasks

 - running under the full preemption model

Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/sched/core.c | 80 +++++++++++++++++++++++++++++++++++++++------
 1 file changed, 70 insertions(+), 10 deletions(-)
  

Comments

Peter Zijlstra Nov. 8, 2023, 9:36 a.m. UTC | #1
On Tue, Nov 07, 2023 at 01:57:27PM -0800, Ankur Arora wrote:

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1027,13 +1027,13 @@ void wake_up_q(struct wake_q_head *head)
>  }
>  
>  /*
> - * resched_curr - mark rq's current task 'to be rescheduled now'.
> + * __resched_curr - mark rq's current task 'to be rescheduled'.
>   *
> - * On UP this means the setting of the need_resched flag, on SMP it
> - * might also involve a cross-CPU call to trigger the scheduler on
> - * the target CPU.
> + * On UP this means the setting of the need_resched flag, on SMP, for
> + * eager resched it might also involve a cross-CPU call to trigger
> + * the scheduler on the target CPU.
>   */
> -void resched_curr(struct rq *rq)
> +void __resched_curr(struct rq *rq, resched_t rs)
>  {
>  	struct task_struct *curr = rq->curr;
>  	int cpu;
> @@ -1046,17 +1046,77 @@ void resched_curr(struct rq *rq)
>  	cpu = cpu_of(rq);
>  
>  	if (cpu == smp_processor_id()) {
> -		set_tsk_need_resched(curr, RESCHED_eager);
> -		set_preempt_need_resched();
> +		set_tsk_need_resched(curr, rs);
> +		if (rs == RESCHED_eager)
> +			set_preempt_need_resched();
>  		return;
>  	}
>  
> -	if (set_nr_and_not_polling(curr, RESCHED_eager))
> -		smp_send_reschedule(cpu);
> -	else
> +	if (set_nr_and_not_polling(curr, rs)) {
> +		if (rs == RESCHED_eager)
> +			smp_send_reschedule(cpu);

I think you just broke things.

Not all idle threads have POLLING support, in which case you need that
IPI to wake them up, even if it's LAZY.

> +	} else if (rs == RESCHED_eager)
>  		trace_sched_wake_idle_without_ipi(cpu);
>  }



>  
> +/*
> + * resched_curr - mark rq's current task 'to be rescheduled' eagerly
> + * or lazily according to the current policy.
> + *
> + * Always schedule eagerly, if:
> + *
> + *  - running under full preemption
> + *
> + *  - idle: when not polling (or if we don't have TIF_POLLING_NRFLAG)
> + *    force TIF_NEED_RESCHED to be set and send a resched IPI.
> + *    (the polling case has already set TIF_NEED_RESCHED via
> + *     set_nr_if_polling()).
> + *
> + *  - in userspace: run to completion semantics are only for kernel tasks
> + *
> + * Otherwise (regardless of priority), run to completion.
> + */
> +void resched_curr(struct rq *rq)
> +{
> +	resched_t rs = RESCHED_lazy;
> +	int context;
> +
> +	if (IS_ENABLED(CONFIG_PREEMPT) ||
> +	    (rq->curr->sched_class == &idle_sched_class)) {
> +		rs = RESCHED_eager;
> +		goto resched;
> +	}
> +
> +	/*
> +	 * We might race with the target CPU while checking its ct_state:
> +	 *
> +	 * 1. The task might have just entered the kernel, but has not yet
> +	 * called user_exit(). We will see stale state (CONTEXT_USER) and
> +	 * send an unnecessary resched-IPI.
> +	 *
> +	 * 2. The user task is through with exit_to_user_mode_loop() but has
> +	 * not yet called user_enter().
> +	 *
> +	 * We'll see the thread's state as CONTEXT_KERNEL and will try to
> +	 * schedule it lazily. There's obviously nothing that will handle
> +	 * this need-resched bit until the thread enters the kernel next.
> +	 *
> +	 * The scheduler will still do tick accounting, but a potentially
> +	 * higher priority task waited to be scheduled for a user tick,
> +	 * instead of execution time in the kernel.
> +	 */
> +	context = ct_state_cpu(cpu_of(rq));
> +	if ((context == CONTEXT_USER) ||
> +	    (context == CONTEXT_GUEST)) {
> +
> +		rs = RESCHED_eager;
> +		goto resched;
> +	}

Like said, this simply cannot be. You must not rely on the remote CPU
being in some state or not. Also, it's racy, you could observe USER and
then it enters KERNEL.

> +
> +resched:
> +	__resched_curr(rq, rs);
> +}
> +
>  void resched_cpu(int cpu)
>  {
>  	struct rq *rq = cpu_rq(cpu);
  
Ankur Arora Nov. 8, 2023, 10:26 a.m. UTC | #2
Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, Nov 07, 2023 at 01:57:27PM -0800, Ankur Arora wrote:
>
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1027,13 +1027,13 @@ void wake_up_q(struct wake_q_head *head)
>>  }
>>
>>  /*
>> - * resched_curr - mark rq's current task 'to be rescheduled now'.
>> + * __resched_curr - mark rq's current task 'to be rescheduled'.
>>   *
>> - * On UP this means the setting of the need_resched flag, on SMP it
>> - * might also involve a cross-CPU call to trigger the scheduler on
>> - * the target CPU.
>> + * On UP this means the setting of the need_resched flag, on SMP, for
>> + * eager resched it might also involve a cross-CPU call to trigger
>> + * the scheduler on the target CPU.
>>   */
>> -void resched_curr(struct rq *rq)
>> +void __resched_curr(struct rq *rq, resched_t rs)
>>  {
>>  	struct task_struct *curr = rq->curr;
>>  	int cpu;
>> @@ -1046,17 +1046,77 @@ void resched_curr(struct rq *rq)
>>  	cpu = cpu_of(rq);
>>
>>  	if (cpu == smp_processor_id()) {
>> -		set_tsk_need_resched(curr, RESCHED_eager);
>> -		set_preempt_need_resched();
>> +		set_tsk_need_resched(curr, rs);
>> +		if (rs == RESCHED_eager)
>> +			set_preempt_need_resched();
>>  		return;
>>  	}
>>
>> -	if (set_nr_and_not_polling(curr, RESCHED_eager))
>> -		smp_send_reschedule(cpu);
>> -	else
>> +	if (set_nr_and_not_polling(curr, rs)) {
>> +		if (rs == RESCHED_eager)
>> +			smp_send_reschedule(cpu);
>
> I think you just broke things.
>
> Not all idle threads have POLLING support, in which case you need that
> IPI to wake them up, even if it's LAZY.

Yes, I was concerned about that too. But doesn't this check against the
idle_sched_class in resched_curr() cover that?

>> +	if (IS_ENABLED(CONFIG_PREEMPT) ||
>> +	    (rq->curr->sched_class == &idle_sched_class)) {
>> +		rs = RESCHED_eager;
>> +		goto resched;

>> +	} else if (rs == RESCHED_eager)
>>  		trace_sched_wake_idle_without_ipi(cpu);
>>  }
>
>
>
>>
>> +/*
>> + * resched_curr - mark rq's current task 'to be rescheduled' eagerly
>> + * or lazily according to the current policy.
>> + *
>> + * Always schedule eagerly, if:
>> + *
>> + *  - running under full preemption
>> + *
>> + *  - idle: when not polling (or if we don't have TIF_POLLING_NRFLAG)
>> + *    force TIF_NEED_RESCHED to be set and send a resched IPI.
>> + *    (the polling case has already set TIF_NEED_RESCHED via
>> + *     set_nr_if_polling()).
>> + *
>> + *  - in userspace: run to completion semantics are only for kernel tasks
>> + *
>> + * Otherwise (regardless of priority), run to completion.
>> + */
>> +void resched_curr(struct rq *rq)
>> +{
>> +	resched_t rs = RESCHED_lazy;
>> +	int context;
>> +
>> +	if (IS_ENABLED(CONFIG_PREEMPT) ||
>> +	    (rq->curr->sched_class == &idle_sched_class)) {
>> +		rs = RESCHED_eager;
>> +		goto resched;
>> +	}
>> +
>> +	/*
>> +	 * We might race with the target CPU while checking its ct_state:
>> +	 *
>> +	 * 1. The task might have just entered the kernel, but has not yet
>> +	 * called user_exit(). We will see stale state (CONTEXT_USER) and
>> +	 * send an unnecessary resched-IPI.
>> +	 *
>> +	 * 2. The user task is through with exit_to_user_mode_loop() but has
>> +	 * not yet called user_enter().
>> +	 *
>> +	 * We'll see the thread's state as CONTEXT_KERNEL and will try to
>> +	 * schedule it lazily. There's obviously nothing that will handle
>> +	 * this need-resched bit until the thread enters the kernel next.
>> +	 *
>> +	 * The scheduler will still do tick accounting, but a potentially
>> +	 * higher priority task waited to be scheduled for a user tick,
>> +	 * instead of execution time in the kernel.
>> +	 */
>> +	context = ct_state_cpu(cpu_of(rq));
>> +	if ((context == CONTEXT_USER) ||
>> +	    (context == CONTEXT_GUEST)) {
>> +
>> +		rs = RESCHED_eager;
>> +		goto resched;
>> +	}
>
> Like said, this simply cannot be. You must not rely on the remote CPU
> being in some state or not. Also, it's racy, you could observe USER and
> then it enters KERNEL.

Or worse. We might observe KERNEL and it enters USER.

I think we would be fine if we observe USER: it would be upgrade
to RESCHED_eager and send an unnecessary IPI.

But if we observe KERNEL and it enters USER, then we will have
set the need-resched-lazy bit which the thread might not see
(it might have left exit_to_user_mode_loop()) until the next
entry to the kernel.

But, yes I would like to avoid the ct_state as well. But
need-resched-lazy only makes sense when the task on the runqueue
is executing in the kernel...

--
ankur
  
Peter Zijlstra Nov. 8, 2023, 10:46 a.m. UTC | #3
On Wed, Nov 08, 2023 at 02:26:37AM -0800, Ankur Arora wrote:
> 
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Tue, Nov 07, 2023 at 01:57:27PM -0800, Ankur Arora wrote:
> >
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -1027,13 +1027,13 @@ void wake_up_q(struct wake_q_head *head)
> >>  }
> >>
> >>  /*
> >> - * resched_curr - mark rq's current task 'to be rescheduled now'.
> >> + * __resched_curr - mark rq's current task 'to be rescheduled'.
> >>   *
> >> - * On UP this means the setting of the need_resched flag, on SMP it
> >> - * might also involve a cross-CPU call to trigger the scheduler on
> >> - * the target CPU.
> >> + * On UP this means the setting of the need_resched flag, on SMP, for
> >> + * eager resched it might also involve a cross-CPU call to trigger
> >> + * the scheduler on the target CPU.
> >>   */
> >> -void resched_curr(struct rq *rq)
> >> +void __resched_curr(struct rq *rq, resched_t rs)
> >>  {
> >>  	struct task_struct *curr = rq->curr;
> >>  	int cpu;
> >> @@ -1046,17 +1046,77 @@ void resched_curr(struct rq *rq)
> >>  	cpu = cpu_of(rq);
> >>
> >>  	if (cpu == smp_processor_id()) {
> >> -		set_tsk_need_resched(curr, RESCHED_eager);
> >> -		set_preempt_need_resched();
> >> +		set_tsk_need_resched(curr, rs);
> >> +		if (rs == RESCHED_eager)
> >> +			set_preempt_need_resched();
> >>  		return;
> >>  	}
> >>
> >> -	if (set_nr_and_not_polling(curr, RESCHED_eager))
> >> -		smp_send_reschedule(cpu);
> >> -	else
> >> +	if (set_nr_and_not_polling(curr, rs)) {
> >> +		if (rs == RESCHED_eager)
> >> +			smp_send_reschedule(cpu);
> >
> > I think you just broke things.
> >
> > Not all idle threads have POLLING support, in which case you need that
> > IPI to wake them up, even if it's LAZY.
> 
> Yes, I was concerned about that too. But doesn't this check against the
> idle_sched_class in resched_curr() cover that?

I that's what that was. Hmm, maybe.

I mean, we have idle-injection too, those don't as FIFO, but as such,
they can only get preempted from RT/DL, and those will already force
preempt anyway.

The way you've split and structured the code makes it very hard to
follow. Something like:

	if (set_nr_and_not_polling(curr, rs) &&
	    (rs == RESCHED_force || is_idle_task(curr)))
		smp_send_reschedule();

is *far* clearer, no?
  
Ankur Arora Nov. 21, 2023, 6:31 a.m. UTC | #4
Ankur Arora <ankur.a.arora@oracle.com> writes:

> Peter Zijlstra <peterz@infradead.org> writes:
>
>> On Tue, Nov 07, 2023 at 01:57:27PM -0800, Ankur Arora wrote:
>>
>>> +	 * We might race with the target CPU while checking its ct_state:
>>> +	 *
>>> +	 * 1. The task might have just entered the kernel, but has not yet
>>> +	 * called user_exit(). We will see stale state (CONTEXT_USER) and
>>> +	 * send an unnecessary resched-IPI.
>>> +	 *
>>> +	 * 2. The user task is through with exit_to_user_mode_loop() but has
>>> +	 * not yet called user_enter().
>>> +	 *
>>> +	 * We'll see the thread's state as CONTEXT_KERNEL and will try to
>>> +	 * schedule it lazily. There's obviously nothing that will handle
>>> +	 * this need-resched bit until the thread enters the kernel next.
>>> +	 *
>>> +	 * The scheduler will still do tick accounting, but a potentially
>>> +	 * higher priority task waited to be scheduled for a user tick,
>>> +	 * instead of execution time in the kernel.
>>> +	 */
>>> +	context = ct_state_cpu(cpu_of(rq));
>>> +	if ((context == CONTEXT_USER) ||
>>> +	    (context == CONTEXT_GUEST)) {
>>> +
>>> +		rs = RESCHED_eager;
>>> +		goto resched;
>>> +	}
>>
>> Like said, this simply cannot be. You must not rely on the remote CPU
>> being in some state or not. Also, it's racy, you could observe USER and
>> then it enters KERNEL.
>
> Or worse. We might observe KERNEL and it enters USER.
>
> I think we would be fine if we observe USER: it would be upgrade
> to RESCHED_eager and send an unnecessary IPI.
>
> But if we observe KERNEL and it enters USER, then we will have
> set the need-resched-lazy bit which the thread might not see
> (it might have left exit_to_user_mode_loop()) until the next
> entry to the kernel.
>
> But, yes I would like to avoid the ct_state as well. But
> need-resched-lazy only makes sense when the task on the runqueue
> is executing in the kernel...

So, I discussed this with Thomas offlist, and he pointed out that I'm
overengineering this.

If we decide to wake up a remote rq lazily with (!sched_feat(TTWU_QUEUE)),
and if the target is running in user space, then the resched would
happen when the process enters kernel mode.

That's somewhat similar to how in this preemption model we let a task
run for upto one extra tick while in kernel mode. So I'll drop this and
allow the same behaviour in userspace instead of solving it in
unnecessarily complicated ways.

--
ankur
  
Ankur Arora Nov. 21, 2023, 6:34 a.m. UTC | #5
Peter Zijlstra <peterz@infradead.org> writes:

> On Wed, Nov 08, 2023 at 02:26:37AM -0800, Ankur Arora wrote:
>>
>> Peter Zijlstra <peterz@infradead.org> writes:
>>
>> > On Tue, Nov 07, 2023 at 01:57:27PM -0800, Ankur Arora wrote:
>> >
>> >> --- a/kernel/sched/core.c
>> >> +++ b/kernel/sched/core.c
>> >> @@ -1027,13 +1027,13 @@ void wake_up_q(struct wake_q_head *head)
>> >>  }
>> >>
>> >>  /*
>> >> - * resched_curr - mark rq's current task 'to be rescheduled now'.
>> >> + * __resched_curr - mark rq's current task 'to be rescheduled'.
>> >>   *
>> >> - * On UP this means the setting of the need_resched flag, on SMP it
>> >> - * might also involve a cross-CPU call to trigger the scheduler on
>> >> - * the target CPU.
>> >> + * On UP this means the setting of the need_resched flag, on SMP, for
>> >> + * eager resched it might also involve a cross-CPU call to trigger
>> >> + * the scheduler on the target CPU.
>> >>   */
>> >> -void resched_curr(struct rq *rq)
>> >> +void __resched_curr(struct rq *rq, resched_t rs)
>> >>  {
>> >>  	struct task_struct *curr = rq->curr;
>> >>  	int cpu;
>> >> @@ -1046,17 +1046,77 @@ void resched_curr(struct rq *rq)
>> >>  	cpu = cpu_of(rq);
>> >>
>> >>  	if (cpu == smp_processor_id()) {
>> >> -		set_tsk_need_resched(curr, RESCHED_eager);
>> >> -		set_preempt_need_resched();
>> >> +		set_tsk_need_resched(curr, rs);
>> >> +		if (rs == RESCHED_eager)
>> >> +			set_preempt_need_resched();
>> >>  		return;
>> >>  	}
>> >>
>> >> -	if (set_nr_and_not_polling(curr, RESCHED_eager))
>> >> -		smp_send_reschedule(cpu);
>> >> -	else
>> >> +	if (set_nr_and_not_polling(curr, rs)) {
>> >> +		if (rs == RESCHED_eager)
>> >> +			smp_send_reschedule(cpu);
>> >
>> > I think you just broke things.
>> >
>> > Not all idle threads have POLLING support, in which case you need that
>> > IPI to wake them up, even if it's LAZY.
>>
>> Yes, I was concerned about that too. But doesn't this check against the
>> idle_sched_class in resched_curr() cover that?
>
> I that's what that was. Hmm, maybe.
>
> I mean, we have idle-injection too, those don't as FIFO, but as such,
> they can only get preempted from RT/DL, and those will already force
> preempt anyway.

Aah yes, of course those are FIFO. Thanks missed that.

> The way you've split and structured the code makes it very hard to
> follow. Something like:
>
> 	if (set_nr_and_not_polling(curr, rs) &&
> 	    (rs == RESCHED_force || is_idle_task(curr)))
> 		smp_send_reschedule();
>
> is *far* clearer, no?

Nods. I was trying to separate where we decide whether we do things
eagerly or lazily. But this is way clearer.


--
ankur
  

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 01df5ac2982c..f65bf3ce0e9d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1027,13 +1027,13 @@  void wake_up_q(struct wake_q_head *head)
 }
 
 /*
- * resched_curr - mark rq's current task 'to be rescheduled now'.
+ * __resched_curr - mark rq's current task 'to be rescheduled'.
  *
- * On UP this means the setting of the need_resched flag, on SMP it
- * might also involve a cross-CPU call to trigger the scheduler on
- * the target CPU.
+ * On UP this means the setting of the need_resched flag, on SMP, for
+ * eager resched it might also involve a cross-CPU call to trigger
+ * the scheduler on the target CPU.
  */
-void resched_curr(struct rq *rq)
+void __resched_curr(struct rq *rq, resched_t rs)
 {
 	struct task_struct *curr = rq->curr;
 	int cpu;
@@ -1046,17 +1046,77 @@  void resched_curr(struct rq *rq)
 	cpu = cpu_of(rq);
 
 	if (cpu == smp_processor_id()) {
-		set_tsk_need_resched(curr, RESCHED_eager);
-		set_preempt_need_resched();
+		set_tsk_need_resched(curr, rs);
+		if (rs == RESCHED_eager)
+			set_preempt_need_resched();
 		return;
 	}
 
-	if (set_nr_and_not_polling(curr, RESCHED_eager))
-		smp_send_reschedule(cpu);
-	else
+	if (set_nr_and_not_polling(curr, rs)) {
+		if (rs == RESCHED_eager)
+			smp_send_reschedule(cpu);
+	} else if (rs == RESCHED_eager)
 		trace_sched_wake_idle_without_ipi(cpu);
 }
 
+/*
+ * resched_curr - mark rq's current task 'to be rescheduled' eagerly
+ * or lazily according to the current policy.
+ *
+ * Always schedule eagerly, if:
+ *
+ *  - running under full preemption
+ *
+ *  - idle: when not polling (or if we don't have TIF_POLLING_NRFLAG)
+ *    force TIF_NEED_RESCHED to be set and send a resched IPI.
+ *    (the polling case has already set TIF_NEED_RESCHED via
+ *     set_nr_if_polling()).
+ *
+ *  - in userspace: run to completion semantics are only for kernel tasks
+ *
+ * Otherwise (regardless of priority), run to completion.
+ */
+void resched_curr(struct rq *rq)
+{
+	resched_t rs = RESCHED_lazy;
+	int context;
+
+	if (IS_ENABLED(CONFIG_PREEMPT) ||
+	    (rq->curr->sched_class == &idle_sched_class)) {
+		rs = RESCHED_eager;
+		goto resched;
+	}
+
+	/*
+	 * We might race with the target CPU while checking its ct_state:
+	 *
+	 * 1. The task might have just entered the kernel, but has not yet
+	 * called user_exit(). We will see stale state (CONTEXT_USER) and
+	 * send an unnecessary resched-IPI.
+	 *
+	 * 2. The user task is through with exit_to_user_mode_loop() but has
+	 * not yet called user_enter().
+	 *
+	 * We'll see the thread's state as CONTEXT_KERNEL and will try to
+	 * schedule it lazily. There's obviously nothing that will handle
+	 * this need-resched bit until the thread enters the kernel next.
+	 *
+	 * The scheduler will still do tick accounting, but a potentially
+	 * higher priority task waited to be scheduled for a user tick,
+	 * instead of execution time in the kernel.
+	 */
+	context = ct_state_cpu(cpu_of(rq));
+	if ((context == CONTEXT_USER) ||
+	    (context == CONTEXT_GUEST)) {
+
+		rs = RESCHED_eager;
+		goto resched;
+	}
+
+resched:
+	__resched_curr(rq, rs);
+}
+
 void resched_cpu(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);