sched/core: Minor optimize ttwu_runnable()

Message ID 20221102102343.57845-1-zhouchengming@bytedance.com
State New
Headers
Series sched/core: Minor optimize ttwu_runnable() |

Commit Message

Chengming Zhou Nov. 2, 2022, 10:23 a.m. UTC
  ttwu_runnable() is used as a fast wakeup path when the wakee task
is between set_current_state() and schedule(), in which case all
we need to do is change p->state back to TASK_RUNNING. So we don't
need to update_rq_clock() and check_preempt_curr() in this case.

Some performance numbers using mmtests/perfpipe on Intel Xeon server:

                           linux-next                patched
Min       Time        8.67 (   0.00%)        8.66 (   0.13%)
1st-qrtle Time        8.83 (   0.00%)        8.72 (   1.19%)
2nd-qrtle Time        8.90 (   0.00%)        8.76 (   1.57%)
3rd-qrtle Time        8.98 (   0.00%)        8.82 (   1.82%)
Max-1     Time        8.67 (   0.00%)        8.66 (   0.13%)
Max-5     Time        8.67 (   0.00%)        8.66 (   0.13%)
Max-10    Time        8.79 (   0.00%)        8.69 (   1.09%)
Max-90    Time        9.01 (   0.00%)        8.84 (   1.94%)
Max-95    Time        9.02 (   0.00%)        8.85 (   1.86%)
Max-99    Time        9.02 (   0.00%)        8.88 (   1.56%)
Max       Time        9.59 (   0.00%)        8.89 (   7.29%)
Amean     Time        8.92 (   0.00%)        8.77 *   1.65%*

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 kernel/sched/core.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)
  

Comments

Valentin Schneider Nov. 4, 2022, 6:19 p.m. UTC | #1
On 02/11/22 18:23, Chengming Zhou wrote:
> ttwu_runnable() is used as a fast wakeup path when the wakee task
> is between set_current_state() and schedule(), in which case all
> we need to do is change p->state back to TASK_RUNNING. So we don't
> need to update_rq_clock() and check_preempt_curr() in this case.
>
> Some performance numbers using mmtests/perfpipe on Intel Xeon server:
>
>                            linux-next                patched
> Min       Time        8.67 (   0.00%)        8.66 (   0.13%)
> 1st-qrtle Time        8.83 (   0.00%)        8.72 (   1.19%)
> 2nd-qrtle Time        8.90 (   0.00%)        8.76 (   1.57%)
> 3rd-qrtle Time        8.98 (   0.00%)        8.82 (   1.82%)
> Max-1     Time        8.67 (   0.00%)        8.66 (   0.13%)
> Max-5     Time        8.67 (   0.00%)        8.66 (   0.13%)
> Max-10    Time        8.79 (   0.00%)        8.69 (   1.09%)
> Max-90    Time        9.01 (   0.00%)        8.84 (   1.94%)
> Max-95    Time        9.02 (   0.00%)        8.85 (   1.86%)
> Max-99    Time        9.02 (   0.00%)        8.88 (   1.56%)
> Max       Time        9.59 (   0.00%)        8.89 (   7.29%)
> Amean     Time        8.92 (   0.00%)        8.77 *   1.65%*
>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---
>  kernel/sched/core.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 87c9cdf37a26..3785418de127 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3718,9 +3718,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
>
>       rq = __task_rq_lock(p, &rf);
>       if (task_on_rq_queued(p)) {
> -		/* check_preempt_curr() may use rq clock */
> -		update_rq_clock(rq);
> -		ttwu_do_wakeup(rq, p, wake_flags, &rf);
> +		WRITE_ONCE(p->__state, TASK_RUNNING);
> +		trace_sched_wakeup(p);

This also loses the class->task_woken() call, AFAICT we could have
!p->on_cpu here (e.g. an IRQ hit before the task got to schedule()), but
then again if there is a need to push we should have done that at the IRQ
preempt via set_next_task_{rt, dl}()... So I'm starting to think this is
OK, but that needs elaborating in the changelog.


>               ret = 1;
>       }
>       __task_rq_unlock(rq, &rf);
> --
> 2.37.2
  
Chengming Zhou Nov. 5, 2022, 1:34 a.m. UTC | #2
On 2022/11/5 02:19, Valentin Schneider wrote:
> On 02/11/22 18:23, Chengming Zhou wrote:
>> ttwu_runnable() is used as a fast wakeup path when the wakee task
>> is between set_current_state() and schedule(), in which case all
>> we need to do is change p->state back to TASK_RUNNING. So we don't
>> need to update_rq_clock() and check_preempt_curr() in this case.
>>
>> Some performance numbers using mmtests/perfpipe on Intel Xeon server:
>>
>>                            linux-next                patched
>> Min       Time        8.67 (   0.00%)        8.66 (   0.13%)
>> 1st-qrtle Time        8.83 (   0.00%)        8.72 (   1.19%)
>> 2nd-qrtle Time        8.90 (   0.00%)        8.76 (   1.57%)
>> 3rd-qrtle Time        8.98 (   0.00%)        8.82 (   1.82%)
>> Max-1     Time        8.67 (   0.00%)        8.66 (   0.13%)
>> Max-5     Time        8.67 (   0.00%)        8.66 (   0.13%)
>> Max-10    Time        8.79 (   0.00%)        8.69 (   1.09%)
>> Max-90    Time        9.01 (   0.00%)        8.84 (   1.94%)
>> Max-95    Time        9.02 (   0.00%)        8.85 (   1.86%)
>> Max-99    Time        9.02 (   0.00%)        8.88 (   1.56%)
>> Max       Time        9.59 (   0.00%)        8.89 (   7.29%)
>> Amean     Time        8.92 (   0.00%)        8.77 *   1.65%*
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>> ---
>>  kernel/sched/core.c | 5 ++---
>>  1 file changed, 2 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 87c9cdf37a26..3785418de127 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -3718,9 +3718,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
>>
>>       rq = __task_rq_lock(p, &rf);
>>       if (task_on_rq_queued(p)) {
>> -		/* check_preempt_curr() may use rq clock */
>> -		update_rq_clock(rq);
>> -		ttwu_do_wakeup(rq, p, wake_flags, &rf);
>> +		WRITE_ONCE(p->__state, TASK_RUNNING);
>> +		trace_sched_wakeup(p);
> 
> This also loses the class->task_woken() call, AFAICT we could have
> !p->on_cpu here (e.g. an IRQ hit before the task got to schedule()), but
> then again if there is a need to push we should have done that at the IRQ
> preempt via set_next_task_{rt, dl}()... So I'm starting to think this is
> OK, but that needs elaborating in the changelog.

Sorry, I don't get why we could have !p->on_cpu here?

I thought if we have task_on_rq_queued(p) here, it means p hasn't got to
__schedule() to deactivate_task(), so p should still be on_cpu?

Thanks.

> 
> 
>>               ret = 1;
>>       }
>>       __task_rq_unlock(rq, &rf);
>> --
>> 2.37.2
>
  
Valentin Schneider Nov. 7, 2022, 12:56 p.m. UTC | #3
On 05/11/22 09:34, Chengming Zhou wrote:
> On 2022/11/5 02:19, Valentin Schneider wrote:
>> On 02/11/22 18:23, Chengming Zhou wrote:
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 87c9cdf37a26..3785418de127 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -3718,9 +3718,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
>>>
>>>       rq = __task_rq_lock(p, &rf);
>>>       if (task_on_rq_queued(p)) {
>>> -		/* check_preempt_curr() may use rq clock */
>>> -		update_rq_clock(rq);
>>> -		ttwu_do_wakeup(rq, p, wake_flags, &rf);
>>> +		WRITE_ONCE(p->__state, TASK_RUNNING);
>>> +		trace_sched_wakeup(p);
>>
>> This also loses the class->task_woken() call, AFAICT we could have
>> !p->on_cpu here (e.g. an IRQ hit before the task got to schedule()), but
>> then again if there is a need to push we should have done that at the IRQ
>> preempt via set_next_task_{rt, dl}()... So I'm starting to think this is
>> OK, but that needs elaborating in the changelog.
>
> Sorry, I don't get why we could have !p->on_cpu here?
>
> I thought if we have task_on_rq_queued(p) here, it means p hasn't got to
> __schedule() to deactivate_task(), so p should still be on_cpu?
>
> Thanks.
>

So p is still on the rq for sure, but it may not be the currently running
task. Consider, on a CONFIG_PREEMPT kernel, task p0 in a wait loop:

0    for (;;) {
1       set_current_state(TASK_UNINTERRUPTIBLE);
2
3       if (CONDITION)
4          break;
5
6       schedule();
7    }
8    __set_current_state(TASK_RUNNING);

Now further consider p0 executes line 1, but then gets interrupted by an
IRQ, at the end of which preempt_schedule_irq() happens. We enter
__schedule(SM_PREEMPT), so p0 isn't dequeued due to its non-zero task
state, but we *can* end up switching the CPU's current task.

ISTR that's why Peter renamed that function ttwu_*runnable*() and not
ttwu_*running*().

>>
>>
>>>               ret = 1;
>>>       }
>>>       __task_rq_unlock(rq, &rf);
>>> --
>>> 2.37.2
>>
  
Chengming Zhou Nov. 7, 2022, 1:19 p.m. UTC | #4
On 2022/11/7 20:56, Valentin Schneider wrote:
> On 05/11/22 09:34, Chengming Zhou wrote:
>> On 2022/11/5 02:19, Valentin Schneider wrote:
>>> On 02/11/22 18:23, Chengming Zhou wrote:
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 87c9cdf37a26..3785418de127 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -3718,9 +3718,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
>>>>
>>>>       rq = __task_rq_lock(p, &rf);
>>>>       if (task_on_rq_queued(p)) {
>>>> -		/* check_preempt_curr() may use rq clock */
>>>> -		update_rq_clock(rq);
>>>> -		ttwu_do_wakeup(rq, p, wake_flags, &rf);
>>>> +		WRITE_ONCE(p->__state, TASK_RUNNING);
>>>> +		trace_sched_wakeup(p);
>>>
>>> This also loses the class->task_woken() call, AFAICT we could have
>>> !p->on_cpu here (e.g. an IRQ hit before the task got to schedule()), but
>>> then again if there is a need to push we should have done that at the IRQ
>>> preempt via set_next_task_{rt, dl}()... So I'm starting to think this is
>>> OK, but that needs elaborating in the changelog.
>>
>> Sorry, I don't get why we could have !p->on_cpu here?
>>
>> I thought if we have task_on_rq_queued(p) here, it means p hasn't got to
>> __schedule() to deactivate_task(), so p should still be on_cpu?
>>
>> Thanks.
>>
> 
> So p is still on the rq for sure, but it may not be the currently running
> task. Consider, on a CONFIG_PREEMPT kernel, task p0 in a wait loop:
> 
> 0    for (;;) {
> 1       set_current_state(TASK_UNINTERRUPTIBLE);
> 2
> 3       if (CONDITION)
> 4          break;
> 5
> 6       schedule();
> 7    }
> 8    __set_current_state(TASK_RUNNING);
> 
> Now further consider p0 executes line 1, but then gets interrupted by an
> IRQ, at the end of which preempt_schedule_irq() happens. We enter
> __schedule(SM_PREEMPT), so p0 isn't dequeued due to its non-zero task
> state, but we *can* end up switching the CPU's current task.

Thank you very much for this detailed explanation, I get it. Yes, this
process can be seen on CONFIG_PREEMPT kernel.

> 
> ISTR that's why Peter renamed that function ttwu_*runnable*() and not
> ttwu_*running*().

So this task p didn't really sleep, it's preempted, later scheduled in,
get to execute line 6 schedule(), but its state has been set to TASK_RUNNING,
it won't deactivate_task(p).

Looks like this patch is still reasonable? Should I put this detailed
explanation in the changelog and send v2?

Thanks!

> 
>>>
>>>
>>>>               ret = 1;
>>>>       }
>>>>       __task_rq_unlock(rq, &rf);
>>>> --
>>>> 2.37.2
>>>
>
  
Valentin Schneider Nov. 7, 2022, 3:54 p.m. UTC | #5
On 07/11/22 21:19, Chengming Zhou wrote:
> On 2022/11/7 20:56, Valentin Schneider wrote:
>>
>> So p is still on the rq for sure, but it may not be the currently running
>> task. Consider, on a CONFIG_PREEMPT kernel, task p0 in a wait loop:
>>
>> 0    for (;;) {
>> 1       set_current_state(TASK_UNINTERRUPTIBLE);
>> 2
>> 3       if (CONDITION)
>> 4          break;
>> 5
>> 6       schedule();
>> 7    }
>> 8    __set_current_state(TASK_RUNNING);
>>
>> Now further consider p0 executes line 1, but then gets interrupted by an
>> IRQ, at the end of which preempt_schedule_irq() happens. We enter
>> __schedule(SM_PREEMPT), so p0 isn't dequeued due to its non-zero task
>> state, but we *can* end up switching the CPU's current task.
>
> Thank you very much for this detailed explanation, I get it. Yes, this
> process can be seen on CONFIG_PREEMPT kernel.
>
>>
>> ISTR that's why Peter renamed that function ttwu_*runnable*() and not
>> ttwu_*running*().
>
> So this task p didn't really sleep, it's preempted, later scheduled in,
> get to execute line 6 schedule(), but its state has been set to TASK_RUNNING,
> it won't deactivate_task(p).
>

Right!

> Looks like this patch is still reasonable? Should I put this detailed
> explanation in the changelog and send v2?
>

So that's the part for the p->sched_class->task_woken() callback, which
only affects RT and DL (and only does something when !p->on_cpu). I *think*
it's fine to remove it from ttwu_runnable() as any push/pull should have
happened when other tasks were enqueued on the same CPU - with that said,
it wouldn't hurt to double check this :-)


As for the check_preempt_curr(), since per the above p can be preempted,
you could have scenarios right now with CFS tasks where
ttwu_runnable()->check_preempt_curr() causes NEED_RESCHED to be set.

e.g. p0 does

  set_current_state(TASK_UNINTERRUPTIBLE)

but then gets interrupted by the tick, a p1 gets selected to run instead
because of check_preempt_tick(), and then runs long enough to have
check_preempt_curr() decide to let p0 preempt p1.

That does require specific timing (lower tick frequency should make this
more likely) and probably task niceness distribution too, but isn't
impossible.

Maybe try reading p->on_cpu, and only do the quick task state update if
it's still the current task, otherwise do the preemption checks?

> Thanks!
>
  
Chengming Zhou Nov. 8, 2022, 7:38 a.m. UTC | #6
On 2022/11/7 23:54, Valentin Schneider wrote:
> On 07/11/22 21:19, Chengming Zhou wrote:
>> On 2022/11/7 20:56, Valentin Schneider wrote:
>>>
>>> So p is still on the rq for sure, but it may not be the currently running
>>> task. Consider, on a CONFIG_PREEMPT kernel, task p0 in a wait loop:
>>>
>>> 0    for (;;) {
>>> 1       set_current_state(TASK_UNINTERRUPTIBLE);
>>> 2
>>> 3       if (CONDITION)
>>> 4          break;
>>> 5
>>> 6       schedule();
>>> 7    }
>>> 8    __set_current_state(TASK_RUNNING);
>>>
>>> Now further consider p0 executes line 1, but then gets interrupted by an
>>> IRQ, at the end of which preempt_schedule_irq() happens. We enter
>>> __schedule(SM_PREEMPT), so p0 isn't dequeued due to its non-zero task
>>> state, but we *can* end up switching the CPU's current task.
>>
>> Thank you very much for this detailed explanation, I get it. Yes, this
>> process can be seen on CONFIG_PREEMPT kernel.
>>
>>>
>>> ISTR that's why Peter renamed that function ttwu_*runnable*() and not
>>> ttwu_*running*().
>>
>> So this task p didn't really sleep, it's preempted, later scheduled in,
>> get to execute line 6 schedule(), but its state has been set to TASK_RUNNING,
>> it won't deactivate_task(p).
>>
> 
> Right!
> 
>> Looks like this patch is still reasonable? Should I put this detailed
>> explanation in the changelog and send v2?
>>
> 
> So that's the part for the p->sched_class->task_woken() callback, which
> only affects RT and DL (and only does something when !p->on_cpu). I *think*
> it's fine to remove it from ttwu_runnable() as any push/pull should have
> happened when other tasks were enqueued on the same CPU - with that said,
> it wouldn't hurt to double check this :-)

Yes, ttwu_runnable() should be fine to remove p->sched_class->task_woken().

> 
> 
> As for the check_preempt_curr(), since per the above p can be preempted,
> you could have scenarios right now with CFS tasks where
> ttwu_runnable()->check_preempt_curr() causes NEED_RESCHED to be set.
> 
> e.g. p0 does
> 
>   set_current_state(TASK_UNINTERRUPTIBLE)
> 
> but then gets interrupted by the tick, a p1 gets selected to run instead
> because of check_preempt_tick(), and then runs long enough to have
> check_preempt_curr() decide to let p0 preempt p1.
> 
> That does require specific timing (lower tick frequency should make this
> more likely) and probably task niceness distribution too, but isn't
> impossible.
> 
> Maybe try reading p->on_cpu, and only do the quick task state update if
> it's still the current task, otherwise do the preemption checks?

I think it's a good suggestion, so we still have the preemption check when
p0 is preempted by p1, letting p0 to be scheduled in more timely.

I will update and send v2 later.

> 
>> Thanks!
>>
>
  
Peter Zijlstra Nov. 8, 2022, 9:11 a.m. UTC | #7
On Mon, Nov 07, 2022 at 03:54:38PM +0000, Valentin Schneider wrote:

> So that's the part for the p->sched_class->task_woken() callback, which
> only affects RT and DL (and only does something when !p->on_cpu). I *think*
> it's fine to remove it from ttwu_runnable() as any push/pull should have
> happened when other tasks were enqueued on the same CPU - with that said,
> it wouldn't hurt to double check this :-)
> 
> 
> As for the check_preempt_curr(), since per the above p can be preempted,
> you could have scenarios right now with CFS tasks where
> ttwu_runnable()->check_preempt_curr() causes NEED_RESCHED to be set.
> 
> e.g. p0 does
> 
>   set_current_state(TASK_UNINTERRUPTIBLE)
> 
> but then gets interrupted by the tick, a p1 gets selected to run instead
> because of check_preempt_tick(), and then runs long enough to have
> check_preempt_curr() decide to let p0 preempt p1.
> 
> That does require specific timing (lower tick frequency should make this
> more likely) and probably task niceness distribution too, but isn't
> impossible.
> 
> Maybe try reading p->on_cpu, and only do the quick task state update if
> it's still the current task, otherwise do the preemption checks?

I'm confused...

So all relevant parties take rq->lock:

 - __schedule()
 - scheduler_tick()
 - ttwu_runnable()

So if ttwu_runnable() sees on_rq and switches state back to RUNNING then
neither check_preempt_curr() nor task_woken() make any sense.

Specifically:

 - you can't very well preempt yourself (which is what
   check_preempt_curr() is trying to determine -- if the woken task
   should preempt the running task, they're both the same in this case);

 - the task did not actually wake up, because it was still on the
   runqueue to begin with. This path prevents a sleep, rather than
   issues a wakeup.

So yes, I think the patch as proposed is ok.
  
Peter Zijlstra Nov. 8, 2022, 9:20 a.m. UTC | #8
On Wed, Nov 02, 2022 at 06:23:43PM +0800, Chengming Zhou wrote:
> ttwu_runnable() is used as a fast wakeup path when the wakee task
> is between set_current_state() and schedule(), in which case all
> we need to do is change p->state back to TASK_RUNNING. So we don't
> need to update_rq_clock() and check_preempt_curr() in this case.
> 
> Some performance numbers using mmtests/perfpipe on Intel Xeon server:
> 
>                            linux-next                patched
> Min       Time        8.67 (   0.00%)        8.66 (   0.13%)
> 1st-qrtle Time        8.83 (   0.00%)        8.72 (   1.19%)
> 2nd-qrtle Time        8.90 (   0.00%)        8.76 (   1.57%)
> 3rd-qrtle Time        8.98 (   0.00%)        8.82 (   1.82%)
> Max-1     Time        8.67 (   0.00%)        8.66 (   0.13%)
> Max-5     Time        8.67 (   0.00%)        8.66 (   0.13%)
> Max-10    Time        8.79 (   0.00%)        8.69 (   1.09%)
> Max-90    Time        9.01 (   0.00%)        8.84 (   1.94%)
> Max-95    Time        9.02 (   0.00%)        8.85 (   1.86%)
> Max-99    Time        9.02 (   0.00%)        8.88 (   1.56%)
> Max       Time        9.59 (   0.00%)        8.89 (   7.29%)
> Amean     Time        8.92 (   0.00%)        8.77 *   1.65%*
> 
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---
>  kernel/sched/core.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 87c9cdf37a26..3785418de127 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3718,9 +3718,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  
>  	rq = __task_rq_lock(p, &rf);
>  	if (task_on_rq_queued(p)) {
> -		/* check_preempt_curr() may use rq clock */
> -		update_rq_clock(rq);
> -		ttwu_do_wakeup(rq, p, wake_flags, &rf);
> +		WRITE_ONCE(p->__state, TASK_RUNNING);
> +		trace_sched_wakeup(p);
>  		ret = 1;
>  	}
>  	__task_rq_unlock(rq, &rf);

Yes, I think this is correct; however I would re-organize code a little.

How's this?


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cb2aa2b54c7a..43d9a1551a5d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3602,15 +3602,40 @@ ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
 		__schedstat_inc(p->stats.nr_wakeups_sync);
 }
 
 /*
- * Mark the task runnable and perform wakeup-preemption.
+ * Mark the task runnable...
  */
-static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags,
-			   struct rq_flags *rf)
+static inline void ttwu_do_wakeup(struct task_struct *p)
 {
-	check_preempt_curr(rq, p, wake_flags);
 	WRITE_ONCE(p->__state, TASK_RUNNING);
 	trace_sched_wakeup(p);
+}
+
+static void
+ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
+		 struct rq_flags *rf)
+{
+	int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK;
+
+	lockdep_assert_rq_held(rq);
+
+	if (p->sched_contributes_to_load)
+		rq->nr_uninterruptible--;
+
+#ifdef CONFIG_SMP
+	if (wake_flags & WF_MIGRATED)
+		en_flags |= ENQUEUE_MIGRATED;
+	else
+#endif
+	if (p->in_iowait) {
+		delayacct_blkio_end(p);
+		atomic_dec(&task_rq(p)->nr_iowait);
+	}
+
+	activate_task(rq, p, en_flags);
+	check_preempt_curr(rq, p, wake_flags);
+
+	ttwu_do_wakeup(p);
 
 #ifdef CONFIG_SMP
 	if (p->sched_class->task_woken) {
@@ -3640,31 +3666,6 @@ static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags,
 #endif
 }
 
-static void
-ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
-		 struct rq_flags *rf)
-{
-	int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK;
-
-	lockdep_assert_rq_held(rq);
-
-	if (p->sched_contributes_to_load)
-		rq->nr_uninterruptible--;
-
-#ifdef CONFIG_SMP
-	if (wake_flags & WF_MIGRATED)
-		en_flags |= ENQUEUE_MIGRATED;
-	else
-#endif
-	if (p->in_iowait) {
-		delayacct_blkio_end(p);
-		atomic_dec(&task_rq(p)->nr_iowait);
-	}
-
-	activate_task(rq, p, en_flags);
-	ttwu_do_wakeup(rq, p, wake_flags, rf);
-}
-
 /*
  * Consider @p being inside a wait loop:
  *
@@ -3698,9 +3699,7 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 
 	rq = __task_rq_lock(p, &rf);
 	if (task_on_rq_queued(p)) {
-		/* check_preempt_curr() may use rq clock */
-		update_rq_clock(rq);
-		ttwu_do_wakeup(rq, p, wake_flags, &rf);
+		ttwu_do_wakeup(p);
 		ret = 1;
 	}
 	__task_rq_unlock(rq, &rf);
@@ -4062,8 +4061,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 			goto out;
 
 		trace_sched_waking(p);
-		WRITE_ONCE(p->__state, TASK_RUNNING);
-		trace_sched_wakeup(p);
+		ttwu_do_wakeup(p);
 		goto out;
 	}
  
Peter Zijlstra Nov. 8, 2022, 10:08 a.m. UTC | #9
On Tue, Nov 08, 2022 at 10:11:49AM +0100, Peter Zijlstra wrote:
> On Mon, Nov 07, 2022 at 03:54:38PM +0000, Valentin Schneider wrote:
> 
> > So that's the part for the p->sched_class->task_woken() callback, which
> > only affects RT and DL (and only does something when !p->on_cpu). I *think*
> > it's fine to remove it from ttwu_runnable() as any push/pull should have
> > happened when other tasks were enqueued on the same CPU - with that said,
> > it wouldn't hurt to double check this :-)
> > 
> > 
> > As for the check_preempt_curr(), since per the above p can be preempted,
> > you could have scenarios right now with CFS tasks where
> > ttwu_runnable()->check_preempt_curr() causes NEED_RESCHED to be set.
> > 
> > e.g. p0 does
> > 
> >   set_current_state(TASK_UNINTERRUPTIBLE)
> > 
> > but then gets interrupted by the tick, a p1 gets selected to run instead
> > because of check_preempt_tick(), and then runs long enough to have
> > check_preempt_curr() decide to let p0 preempt p1.
> > 
> > That does require specific timing (lower tick frequency should make this
> > more likely) and probably task niceness distribution too, but isn't
> > impossible.
> > 
> > Maybe try reading p->on_cpu, and only do the quick task state update if
> > it's still the current task, otherwise do the preemption checks?
> 
> I'm confused...

I am and Valentin has a point. It could indeed be preempted and in that
case check_preempt_curr() could indeed make it get back on.

In that case his suggestion might make sense; something along the lines
of so I suppose...

(And I think we can still do the reorg I proposed elsewhere, but I've not
yet tried.)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cb2aa2b54c7a..6944d9473295 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3698,9 +3698,16 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 
 	rq = __task_rq_lock(p, &rf);
 	if (task_on_rq_queued(p)) {
-		/* check_preempt_curr() may use rq clock */
-		update_rq_clock(rq);
-		ttwu_do_wakeup(rq, p, wake_flags, &rf);
+		if (!p->on_cpu) {
+			/*
+			 * When on_rq && !on_cpu the task is preempted, see if
+			 * it should preempt whatever is current there now.
+			 */
+			update_rq_clock(rq);
+			check_preempt_curr(rq, p, wake_flags);
+		}
+		WRITE_ONCE(p->__state, TASK_RUNNING);
+		trace_sched_wakeup(p);
 		ret = 1;
 	}
 	__task_rq_unlock(rq, &rf);
  
Chengming Zhou Nov. 8, 2022, 11:17 a.m. UTC | #10
On 2022/11/8 18:08, Peter Zijlstra wrote:
> On Tue, Nov 08, 2022 at 10:11:49AM +0100, Peter Zijlstra wrote:
>> On Mon, Nov 07, 2022 at 03:54:38PM +0000, Valentin Schneider wrote:
>>
>>> So that's the part for the p->sched_class->task_woken() callback, which
>>> only affects RT and DL (and only does something when !p->on_cpu). I *think*
>>> it's fine to remove it from ttwu_runnable() as any push/pull should have
>>> happened when other tasks were enqueued on the same CPU - with that said,
>>> it wouldn't hurt to double check this :-)
>>>
>>>
>>> As for the check_preempt_curr(), since per the above p can be preempted,
>>> you could have scenarios right now with CFS tasks where
>>> ttwu_runnable()->check_preempt_curr() causes NEED_RESCHED to be set.
>>>
>>> e.g. p0 does
>>>
>>>   set_current_state(TASK_UNINTERRUPTIBLE)
>>>
>>> but then gets interrupted by the tick, a p1 gets selected to run instead
>>> because of check_preempt_tick(), and then runs long enough to have
>>> check_preempt_curr() decide to let p0 preempt p1.
>>>
>>> That does require specific timing (lower tick frequency should make this
>>> more likely) and probably task niceness distribution too, but isn't
>>> impossible.
>>>
>>> Maybe try reading p->on_cpu, and only do the quick task state update if
>>> it's still the current task, otherwise do the preemption checks?
>>
>> I'm confused...
> 
> I am and Valentin has a point. It could indeed be preempted and in that
> case check_preempt_curr() could indeed make it get back on.
> 
> In that case his suggestion might make sense; something along the lines
> of so I suppose...
> 
> (And I think we can still do the reorg I proposed elsewhere, but I've not
> yet tried.)

Ok, thanks for these suggestions.

I will try to do this, do some tests and send v2.

Thanks.

> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index cb2aa2b54c7a..6944d9473295 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3698,9 +3698,16 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  
>  	rq = __task_rq_lock(p, &rf);
>  	if (task_on_rq_queued(p)) {
> -		/* check_preempt_curr() may use rq clock */
> -		update_rq_clock(rq);
> -		ttwu_do_wakeup(rq, p, wake_flags, &rf);
> +		if (!p->on_cpu) {
> +			/*
> +			 * When on_rq && !on_cpu the task is preempted, see if
> +			 * it should preempt whatever is current there now.
> +			 */
> +			update_rq_clock(rq);
> +			check_preempt_curr(rq, p, wake_flags);
> +		}
> +		WRITE_ONCE(p->__state, TASK_RUNNING);
> +		trace_sched_wakeup(p);
>  		ret = 1;
>  	}
>  	__task_rq_unlock(rq, &rf);
  

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 87c9cdf37a26..3785418de127 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3718,9 +3718,8 @@  static int ttwu_runnable(struct task_struct *p, int wake_flags)
 
 	rq = __task_rq_lock(p, &rf);
 	if (task_on_rq_queued(p)) {
-		/* check_preempt_curr() may use rq clock */
-		update_rq_clock(rq);
-		ttwu_do_wakeup(rq, p, wake_flags, &rf);
+		WRITE_ONCE(p->__state, TASK_RUNNING);
+		trace_sched_wakeup(p);
 		ret = 1;
 	}
 	__task_rq_unlock(rq, &rf);