diff mbox series

sched/rt: Don't try push tasks if there are none.

Message ID	20230801152648._y603AS_@linutronix.de
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Date: Tue, 1 Aug 2023 17:26:48 +0200 From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> To: linux-kernel@vger.kernel.org Cc: Ben Segall <bsegall@google.com>, Daniel Bristot de Oliveira <bristot@redhat.com>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Ingo Molnar <mingo@redhat.com>, Juri Lelli <juri.lelli@redhat.com>, Mel Gorman <mgorman@suse.de>, Peter Zijlstra <peterz@infradead.org>, Steven Rostedt <rostedt@goodmis.org>, Valentin Schneider <vschneid@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, Thomas Gleixner <tglx@linutronix.de> Subject: [PATCH] sched/rt: Don't try push tasks if there are none. Message-ID: <20230801152648._y603AS_@linutronix.de> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Precedence: bulk
Series	sched/rt: Don't try push tasks if there are none. \| sched/rt: Don't try push tasks if there are none.

Commit Message

Sebastian Andrzej Siewior Aug. 1, 2023, 3:26 p.m. UTC

  I have a RT task X at a high priority and cyclictest on each CPU with
lower priority than X's. If X is active and each CPU wakes their own
cylictest thread then it ends in a longer rto_push storm.
A random CPU determines via balance_rt() that the CPU on which X is
running needs to push tasks. X has the highest priority, cyclictest is
next in line so there is nothing that can be done since the task with
the higher priority is not touched.

tell_cpu_to_push() increments rto_loop_next and schedules
rto_push_irq_work_func() on X's CPU. The other CPUs also increment the
loop counter and do the same. Once rto_push_irq_work_func() is active it
does nothing because it has _no_ pushable tasks on its runqueue. Then
checks rto_next_cpu() and decides to queue irq_work on the local CPU
because another CPU requested a push by incrementing the counter.

I have traces where ~30 CPUs request this ~3 times each before it
finally ends. This greatly increases X's runtime while X isn't making
much progress.

Teach rto_next_cpu() to only return CPUs which also have tasks on their
runqueue which can be pushed away. This does not reduce the
tell_cpu_to_push() invocations (rto_loop_next counter increments) but
reduces the amount of issued rto_push_irq_work_func() if nothing can be
done. As the result the overloaded CPU is blocked less often.

There are still cases where the "same job" is repeated several times
(for instance the current CPU needs to resched but didn't yet because
the irq-work is repeated a few times and so the old task remains on the
CPU) but the majority of request end in tell_cpu_to_push() before an IPI
is issued.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/sched/rt.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Steven Rostedt Aug. 1, 2023, 4 p.m. UTC | #1

On Tue, 1 Aug 2023 17:26:48 +0200
Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 00e0e50741153..338cd150973ff 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -2247,8 +2247,11 @@ static int rto_next_cpu(struct root_domain *rd)
>  
>  		rd->rto_cpu = cpu;
>  
> -		if (cpu < nr_cpu_ids)
> +		if (cpu < nr_cpu_ids) {
> +			if (!has_pushable_tasks(cpu_rq(cpu)))
> +				continue;

Hmm, I had to read this again to make sure we can't get into an infinite
loop. But it looks like we are safe, as if none of the CPUs had pushable
tasks, we would still hit the cpu == nr_cpu_ids case and fall into the loop
check, and break out there.

Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>

-- Steve


>  			return cpu;
> +		}
>  
>  		rd->rto_cpu = -1;
>

Valentin Schneider Aug. 9, 2023, 5:02 p.m. UTC | #2

On 01/08/23 17:26, Sebastian Andrzej Siewior wrote:
> I have a RT task X at a high priority and cyclictest on each CPU with
> lower priority than X's. If X is active and each CPU wakes their own
> cylictest thread then it ends in a longer rto_push storm.
> A random CPU determines via balance_rt() that the CPU on which X is
> running needs to push tasks. X has the highest priority, cyclictest is
> next in line so there is nothing that can be done since the task with
> the higher priority is not touched.
>
> tell_cpu_to_push() increments rto_loop_next and schedules
> rto_push_irq_work_func() on X's CPU. The other CPUs also increment the
> loop counter and do the same. Once rto_push_irq_work_func() is active it
> does nothing because it has _no_ pushable tasks on its runqueue. Then
> checks rto_next_cpu() and decides to queue irq_work on the local CPU
> because another CPU requested a push by incrementing the counter.
>

For a CPU to be in the rto_mask, it needs:

  rt_rq->rt_nr_migratory && rt_rq->rt_nr_total > 1

But if that CPU has no pushable tasks, then that means only the current
task has p->nr_cpus_allowed > 1.

Should we change it so a CPU is only in the rto_mask iff it has pushable
tasks? AFAICT that should not break the case where we push the current task
away due to migration_disabled, as that still relies on the
migration_disabled task to be in the pushable list.

Sebastian Andrzej Siewior Aug. 10, 2023, 8:07 a.m. UTC | #3

On 2023-08-09 18:02:32 [+0100], Valentin Schneider wrote:
> On 01/08/23 17:26, Sebastian Andrzej Siewior wrote:
> > I have a RT task X at a high priority and cyclictest on each CPU with
> > lower priority than X's. If X is active and each CPU wakes their own
> > cylictest thread then it ends in a longer rto_push storm.
> > A random CPU determines via balance_rt() that the CPU on which X is
> > running needs to push tasks. X has the highest priority, cyclictest is
> > next in line so there is nothing that can be done since the task with
> > the higher priority is not touched.
> >
> > tell_cpu_to_push() increments rto_loop_next and schedules
> > rto_push_irq_work_func() on X's CPU. The other CPUs also increment the
> > loop counter and do the same. Once rto_push_irq_work_func() is active it
> > does nothing because it has _no_ pushable tasks on its runqueue. Then
> > checks rto_next_cpu() and decides to queue irq_work on the local CPU
> > because another CPU requested a push by incrementing the counter.
> >
> 
> For a CPU to be in the rto_mask, it needs:
> 
>   rt_rq->rt_nr_migratory && rt_rq->rt_nr_total > 1
> 
> But if that CPU has no pushable tasks, then that means only the current
> task has p->nr_cpus_allowed > 1.
> 
> Should we change it so a CPU is only in the rto_mask iff it has pushable
> tasks? AFAICT that should not break the case where we push the current task
> away due to migration_disabled, as that still relies on the
> migration_disabled task to be in the pushable list.

Sounds good. The task with the highest priority becomes pushable if it
gets preempted (by a task with higher priority). This gets considered,
right?

Sebastian

Valentin Schneider Aug. 10, 2023, 9:09 a.m. UTC | #4

On 10/08/23 10:07, Sebastian Andrzej Siewior wrote:
> On 2023-08-09 18:02:32 [+0100], Valentin Schneider wrote:
>> On 01/08/23 17:26, Sebastian Andrzej Siewior wrote:
>> > I have a RT task X at a high priority and cyclictest on each CPU with
>> > lower priority than X's. If X is active and each CPU wakes their own
>> > cylictest thread then it ends in a longer rto_push storm.
>> > A random CPU determines via balance_rt() that the CPU on which X is
>> > running needs to push tasks. X has the highest priority, cyclictest is
>> > next in line so there is nothing that can be done since the task with
>> > the higher priority is not touched.
>> >
>> > tell_cpu_to_push() increments rto_loop_next and schedules
>> > rto_push_irq_work_func() on X's CPU. The other CPUs also increment the
>> > loop counter and do the same. Once rto_push_irq_work_func() is active it
>> > does nothing because it has _no_ pushable tasks on its runqueue. Then
>> > checks rto_next_cpu() and decides to queue irq_work on the local CPU
>> > because another CPU requested a push by incrementing the counter.
>> >
>>
>> For a CPU to be in the rto_mask, it needs:
>>
>>   rt_rq->rt_nr_migratory && rt_rq->rt_nr_total > 1
>>
>> But if that CPU has no pushable tasks, then that means only the current
>> task has p->nr_cpus_allowed > 1.
>>
>> Should we change it so a CPU is only in the rto_mask iff it has pushable
>> tasks? AFAICT that should not break the case where we push the current task
>> away due to migration_disabled, as that still relies on the
>> migration_disabled task to be in the pushable list.
>
> Sounds good. The task with the highest priority becomes pushable if it
> gets preempted (by a task with higher priority). This gets considered,
> right?
>

Yeah, when we switch the current task in put_prev_task_rt() we add the
previously-running (IOW preempted) task to the pushable list (if isn't
affined to just one CPU).

Now it looks like if we change update_rt_migration() to look at the
pushable list instead of rt_nr_migratory, we could get rid of
rt_nr_migratory itself - let me poke at this a little, I might just be
about to figure out why that isn't already the case...

> Sebastian

diff mbox series

Patch

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 00e0e50741153..338cd150973ff 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2247,8 +2247,11 @@  static int rto_next_cpu(struct root_domain *rd)
 
 		rd->rto_cpu = cpu;
 
-		if (cpu < nr_cpu_ids)
+		if (cpu < nr_cpu_ids) {
+			if (!has_pushable_tasks(cpu_rq(cpu)))
+				continue;
 			return cpu;
+		}
 
 		rd->rto_cpu = -1;