Message ID | 20230811112044.3302588-1-vschneid@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b824:0:b0:3f2:4152:657d with SMTP id z4csp1060726vqi; Fri, 11 Aug 2023 05:39:36 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEExzRGX51PnKmVuvPxM1RuPd8BcPutqTpM9midUR7rRV4jjyz1anHtQVGqy20AW85zL/al X-Received: by 2002:a17:90a:fe4:b0:268:2f6:61c4 with SMTP id 91-20020a17090a0fe400b0026802f661c4mr2394376pjz.12.1691757576460; Fri, 11 Aug 2023 05:39:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1691757576; cv=none; d=google.com; s=arc-20160816; b=llAAJZSTLb+VswtPNkrwmyh2i9MUHRoprhTYPD5BS5mZ5RTBnAXmDODmsA3U/NfR4H AvJBPiuuO/donrSj1YIKZuIvG3iOvXk7ljWljw6gimOkXTRK3semfyMOYcYgOcXFk1QQ 4ozSScbBxIgnZ25oQzraRmi3EcKxYsIOsoTkTVgieRDTvgJJOvnJpodLMEUe2A/Kup59 Qj2aJz65eArcdo0JwNbjxFGxC2Y10/1ddd6vWxNDmDe8bzAylpiy4lMNBN7YdOOOGSU7 0MNOFeQsnYUNeJ9xKaO+gUZ42FuRkOHDDl/fICn40xj+prU9rKMTrzgX5W7B4aflv769 1hSQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=dJpzB/fb1HFalKHi8N2CUXmdY3xEPIYALd/hrby8FaA=; fh=vHH3XzLJIPEb5P6hrtBVmnnhNrK8j3OpPmwH6c/VM9A=; b=Y4lIr3J5VfsM7swTLxRQW18JgzhQe+5ztTNBxPArLEwcJ3iX0FF+WWxeD3+6cJ5QA3 6w0WlNLj79rWYmfncB6IFKjsFQs/IRTY++dgEt48x7tNPcFb+vgUFAT8dOYsr92ZK7c9 EQLICv2YyRZfJq4H4oMW1dSKfSjlHPleIBuEaPhtoLvKKBeen1NLK+OxdVPmiWso8Ot6 B3loQOhg6FZtzXsm7hb6ey0rUviOYiZpio60kTXLyMoVXs8c6njN3L8Ky1ovAU8ljrsB LXpUWq0b9yBdkzF9FPXV0tYYsScbyWbw00vRLhRhJdzAKR25G54ccbscTTLhonJPLsb+ 8PHQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=F26I2unt; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id m10-20020a170902db0a00b001b9d8ea026fsi3387195plx.485.2023.08.11.05.39.23; Fri, 11 Aug 2023 05:39:36 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=F26I2unt; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236334AbjHKLWd (ORCPT <rfc822;shaohuahua6@gmail.com> + 99 others); Fri, 11 Aug 2023 07:22:33 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39614 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236213AbjHKLWT (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 11 Aug 2023 07:22:19 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 916F83C23 for <linux-kernel@vger.kernel.org>; Fri, 11 Aug 2023 04:21:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1691752864; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=dJpzB/fb1HFalKHi8N2CUXmdY3xEPIYALd/hrby8FaA=; b=F26I2untM5W9uqAJQD6FaBYXcvcnXNlDYZKbvBdSQaFwameaZHI5ZmH/7i+N3JG0Qjn8yo ILNWEbV034mHG5U0gDINN3Sj7ES6wV+xU1b1ceTIUDAdRv5VPGLpP2Gh/lx7AY95lhGMxD TCm7KEPC7uQkY7MjxQE5Z9AAbBE5cTc= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-607-ceDkjaDcMDuNHrHDZH-Esg-1; Fri, 11 Aug 2023 07:21:01 -0400 X-MC-Unique: ceDkjaDcMDuNHrHDZH-Esg-1 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.rdu2.redhat.com [10.11.54.1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id C15A4856DED; Fri, 11 Aug 2023 11:21:00 +0000 (UTC) Received: from vschneid.remote.csb (unknown [10.42.28.74]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 256AB40C2063; Fri, 11 Aug 2023 11:20:59 +0000 (UTC) From: Valentin Schneider <vschneid@redhat.com> To: linux-kernel@vger.kernel.org Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>, Daniel Bristot de Oliveira <bristot@redhat.com>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Ingo Molnar <mingo@redhat.com>, Juri Lelli <juri.lelli@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Steven Rostedt <rostedt@goodmis.org>, Vincent Guittot <vincent.guittot@linaro.org>, Thomas Gleixner <tglx@linutronix.de> Subject: [PATCH] sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask Date: Fri, 11 Aug 2023 12:20:44 +0100 Message-Id: <20230811112044.3302588-1-vschneid@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.1 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_NONE, T_SPF_HELO_TEMPERROR autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1773936392446489767 X-GMAIL-MSGID: 1773936392446489767 |
Series |
sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask
|
|
Commit Message
Valentin Schneider
Aug. 11, 2023, 11:20 a.m. UTC
Sebastian noted that the rto_push_work IRQ work can be queued for a CPU
that has an empty pushable_tasks list, which means nothing useful will be
done in the IPI other than queue the work for the next CPU on the rto_mask.
rto_push_irq_work_func() only operates on tasks in the pushable_tasks list,
but the conditions for that irq_work to be queued (and for a CPU to be
added to the rto_mask) rely on rq_rt->nr_migratory instead.
nr_migratory is increased whenever an RT task entity is enqueued and it has
nr_cpus_allowed > 1. Unlike the pushable_tasks list, nr_migratory includes a
rt_rq's current task. This means a rt_rq can have a migratible current, N
non-migratible queued tasks, and be flagged as overloaded / have its CPU
set in the rto_mask, despite having an empty pushable_tasks list.
Make an rt_rq's overload logic be driven by {enqueue,dequeue}_pushable_task().
Since rt_rq->{rt_nr_migratory,rt_nr_total} become unused, remove them.
Note that the case where the current task is pushed away to make way for a
migration-disabled task remains unchanged: the migration-disabled task has
to be in the pushable_tasks list in the first place, which means it has
nr_cpus_allowed > 1.
Link: http://lore.kernel.org/r/20230801152648._y603AS_@linutronix.de
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
This is lightly tested, this looks to be working OK but I don't have nor am
I aware of a test case for RT balancing, I suppose we want something that
asserts we always run the N highest prio tasks for N CPUs, with a small
margin for migrations?
---
kernel/sched/debug.c | 3 --
kernel/sched/rt.c | 70 +++++++-------------------------------------
kernel/sched/sched.h | 2 --
3 files changed, 10 insertions(+), 65 deletions(-)
Comments
On 2023-09-11 12:54:50 [+0200], Valentin Schneider wrote: > Ok, back to this :) > > On 15/08/23 16:21, Sebastian Andrzej Siewior wrote: > > What I still observe is: > > - CPU0 is idle. CPU0 gets a task assigned from CPU1. That task receives > > a wakeup. CPU0 returns from idle and schedules the task. > > pull_rt_task() on CPU1 and sometimes on other CPU observe this, too. > > CPU1 sends irq_work to CPU0 while at the time rto_next_cpu() sees that > > has_pushable_tasks() return 0. That bit was cleared earlier (as per > > tracing). > > > > - CPU0 is idle. CPU0 gets a task assigned from CPU1. The task on CPU0 is > > woken up without an IPI (yay). But then pull_rt_task() decides that > > send irq_work and has_pushable_tasks() said that is has tasks left > > so…. > > Now: rto_push_irq_work_func() run once once on CPU0, does nothing, > > rto_next_cpu() return CPU0 again and enqueues itself again on CPU0. > > Usually after the second or third round the scheduler on CPU0 makes > > enough progress to remove the task/ clear the CPU from mask. > > > > If CPU0 is selected for the push IPI, then we should have > > rd->rto_cpu == CPU0 > > So per the > > cpumask_next(rd->rto_cpu, rd->rto_mask); > > in rto_next_cpu(), it shouldn't be able to re-select itself. > > Do you have a simple enough reproducer I could use to poke at this? Not really a reproducer. What I had earlier was a high priority RT task (ntpsec at prio 99) and cyclictest below it (prio 90). And PREEMPT_RT which adds a few tasks (due to threaded interrupts). Then I added trace-printks to observe. Initially I had latency spikes due to ntpsec but also a bunch IRQ-work-IPIs which I decided to look at. > > I understand that there is a race and the CPU is cleared from rto_mask > > shortly after checking. Therefore I would suggest to look at > > has_pushable_tasks() before returning a CPU in rto_next_cpu() as I did > > just to avoid the interruption which does nothing. > > > > For the second case the irq_work seems to make no progress. I don't see > > any trace_events in hardirq, the mask is cleared outside hardirq (idle > > code). The NEED_RESCHED bit is set for current therefore it doesn't make > > sense to send irq_work to reschedule if the current already has this on > > its agenda. > > > > So what about something like: > > > > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c > > index 00e0e50741153..d963408855e25 100644 > > --- a/kernel/sched/rt.c > > +++ b/kernel/sched/rt.c > > @@ -2247,8 +2247,23 @@ static int rto_next_cpu(struct root_domain *rd) > > > > rd->rto_cpu = cpu; > > > > - if (cpu < nr_cpu_ids) > > + if (cpu < nr_cpu_ids) { > > + struct task_struct *t; > > + > > + if (!has_pushable_tasks(cpu_rq(cpu))) > > + continue; > > + > > IIUC that's just to plug the race between the CPU emptying its > pushable_tasks list and it removing itself from the rto_mask - that looks > fine to me. > > > + rcu_read_lock(); > > + t = rcu_dereference(rq->curr); > > + /* if (test_preempt_need_resched_cpu(cpu_rq(cpu))) */ > > + if (test_tsk_need_resched(t)) { > > We need to make sure this doesn't cause us to loose IPIs we actually need. > > We do have a call to put_prev_task_balance() through entering __schedule() > if the previous task is RT/DL, and balance_rt() can issue a push > IPI, but AFAICT only if the previous task was the last DL task. So I don't > think we can do this. I observed that the CPU/ task on that CPU already had the need-resched bit set so a task-switch is in progress. Therefore it looks like any further IPIs are needless because the IRQ-work IPI just "leave early" via resched_curr() and don't do anything useful. So they don't contribute anything but stall the CPU from making progress and performing the actual context switch. > > + rcu_read_unlock(); > > + continue; > > + } > > + rcu_read_unlock(); > > + > > return cpu; > > + } > > > > rd->rto_cpu = -1; Sebastian
* Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > On 2023-08-11 12:20:44 [+0100], Valentin Schneider wrote: > > Sebastian noted that the rto_push_work IRQ work can be queued for a CPU > > that has an empty pushable_tasks list, which means nothing useful will be > > done in the IPI other than queue the work for the next CPU on the rto_mask. > > > > rto_push_irq_work_func() only operates on tasks in the pushable_tasks list, > > but the conditions for that irq_work to be queued (and for a CPU to be > > added to the rto_mask) rely on rq_rt->nr_migratory instead. > > > > nr_migratory is increased whenever an RT task entity is enqueued and it has > > nr_cpus_allowed > 1. Unlike the pushable_tasks list, nr_migratory includes a > > rt_rq's current task. This means a rt_rq can have a migratible current, N > > non-migratible queued tasks, and be flagged as overloaded / have its CPU > > set in the rto_mask, despite having an empty pushable_tasks list. > > > > Make an rt_rq's overload logic be driven by {enqueue,dequeue}_pushable_task(). > > Since rt_rq->{rt_nr_migratory,rt_nr_total} become unused, remove them. > > > > Note that the case where the current task is pushed away to make way for a > > migration-disabled task remains unchanged: the migration-disabled task has > > to be in the pushable_tasks list in the first place, which means it has > > nr_cpus_allowed > 1. > > > > Link: http://lore.kernel.org/r/20230801152648._y603AS_@linutronix.de > > Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > > Signed-off-by: Valentin Schneider <vschneid@redhat.com> > > --- > > This is lightly tested, this looks to be working OK but I don't have nor am > > I aware of a test case for RT balancing, I suppose we want something that > > asserts we always run the N highest prio tasks for N CPUs, with a small > > margin for migrations? > > I don't see the storm of IPIs I saw before. So as far that goes: > Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> I've applied Valentin's initial fix to tip:sched/core, for an eventual v6.7 merge, as it addresses the IPI storm bug. Let me know if merging this is not desirable for some reason. > What I still observe is: > - CPU0 is idle. CPU0 gets a task assigned from CPU1. That task receives > a wakeup. CPU0 returns from idle and schedules the task. > pull_rt_task() on CPU1 and sometimes on other CPU observe this, too. > CPU1 sends irq_work to CPU0 while at the time rto_next_cpu() sees that > has_pushable_tasks() return 0. That bit was cleared earlier (as per > tracing). > > - CPU0 is idle. CPU0 gets a task assigned from CPU1. The task on CPU0 is > woken up without an IPI (yay). But then pull_rt_task() decides that > send irq_work and has_pushable_tasks() said that is has tasks left > so…. > Now: rto_push_irq_work_func() run once once on CPU0, does nothing, > rto_next_cpu() return CPU0 again and enqueues itself again on CPU0. > Usually after the second or third round the scheduler on CPU0 makes > enough progress to remove the task/ clear the CPU from mask. Just curious, any progress on solving this? Thanks, Ingo
On Mon, Sep 25, 2023 at 08:55:10AM -0000, tip-bot2 for Valentin Schneider wrote: > The following commit has been merged into the sched/core branch of tip: > > Commit-ID: 612f769edd06a6e42f7cd72425488e68ddaeef0a > Gitweb: https://git.kernel.org/tip/612f769edd06a6e42f7cd72425488e68ddaeef0a > Author: Valentin Schneider <vschneid@redhat.com> > AuthorDate: Fri, 11 Aug 2023 12:20:44 +01:00 > Committer: Ingo Molnar <mingo@kernel.org> > CommitterDate: Mon, 25 Sep 2023 10:25:29 +02:00 > > sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask > > Sebastian noted that the rto_push_work IRQ work can be queued for a CPU > that has an empty pushable_tasks list, which means nothing useful will be > done in the IPI other than queue the work for the next CPU on the rto_mask. > > rto_push_irq_work_func() only operates on tasks in the pushable_tasks list, > but the conditions for that irq_work to be queued (and for a CPU to be > added to the rto_mask) rely on rq_rt->nr_migratory instead. > > nr_migratory is increased whenever an RT task entity is enqueued and it has > nr_cpus_allowed > 1. Unlike the pushable_tasks list, nr_migratory includes a > rt_rq's current task. This means a rt_rq can have a migratible current, N > non-migratible queued tasks, and be flagged as overloaded / have its CPU > set in the rto_mask, despite having an empty pushable_tasks list. > > Make an rt_rq's overload logic be driven by {enqueue,dequeue}_pushable_task(). > Since rt_rq->{rt_nr_migratory,rt_nr_total} become unused, remove them. > > Note that the case where the current task is pushed away to make way for a > migration-disabled task remains unchanged: the migration-disabled task has > to be in the pushable_tasks list in the first place, which means it has > nr_cpus_allowed > 1. > > Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > Signed-off-by: Valentin Schneider <vschneid@redhat.com> > Signed-off-by: Ingo Molnar <mingo@kernel.org> > Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > Link: https://lore.kernel.org/r/20230811112044.3302588-1-vschneid@redhat.com > --- > kernel/sched/debug.c | 3 +-- > kernel/sched/rt.c | 70 ++++++------------------------------------- > kernel/sched/sched.h | 2 +- > 3 files changed, 10 insertions(+), 65 deletions(-) > > @@ -358,53 +357,6 @@ static inline void rt_clear_overload(struct rq *rq) > cpumask_clear_cpu(rq->cpu, rq->rd->rto_mask); > } > > -static void update_rt_migration(struct rt_rq *rt_rq) > -{ > - if (rt_rq->rt_nr_migratory && rt_rq->rt_nr_total > 1) { > - if (!rt_rq->overloaded) { > - rt_set_overload(rq_of_rt_rq(rt_rq)); > - rt_rq->overloaded = 1; > - } > - } else if (rt_rq->overloaded) { > - rt_clear_overload(rq_of_rt_rq(rt_rq)); > - rt_rq->overloaded = 0; > - } > -} > - > -static void inc_rt_migration(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) > -{ > - struct task_struct *p; > - > - if (!rt_entity_is_task(rt_se)) > - return; > - > - p = rt_task_of(rt_se); > - rt_rq = &rq_of_rt_rq(rt_rq)->rt; > - > - rt_rq->rt_nr_total++; > - if (p->nr_cpus_allowed > 1) > - rt_rq->rt_nr_migratory++; > - > - update_rt_migration(rt_rq); > -} > - > -static void dec_rt_migration(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) > -{ > - struct task_struct *p; > - > - if (!rt_entity_is_task(rt_se)) > - return; > - > - p = rt_task_of(rt_se); > - rt_rq = &rq_of_rt_rq(rt_rq)->rt; > - > - rt_rq->rt_nr_total--; > - if (p->nr_cpus_allowed > 1) > - rt_rq->rt_nr_migratory--; > - > - update_rt_migration(rt_rq); > -} sched/deadline.c has something very similar, does that need updating too?
On 25/09/23 10:27, Ingo Molnar wrote: > * Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > >> On 2023-08-11 12:20:44 [+0100], Valentin Schneider wrote: >> > Sebastian noted that the rto_push_work IRQ work can be queued for a CPU >> > that has an empty pushable_tasks list, which means nothing useful will be >> > done in the IPI other than queue the work for the next CPU on the rto_mask. >> > >> > rto_push_irq_work_func() only operates on tasks in the pushable_tasks list, >> > but the conditions for that irq_work to be queued (and for a CPU to be >> > added to the rto_mask) rely on rq_rt->nr_migratory instead. >> > >> > nr_migratory is increased whenever an RT task entity is enqueued and it has >> > nr_cpus_allowed > 1. Unlike the pushable_tasks list, nr_migratory includes a >> > rt_rq's current task. This means a rt_rq can have a migratible current, N >> > non-migratible queued tasks, and be flagged as overloaded / have its CPU >> > set in the rto_mask, despite having an empty pushable_tasks list. >> > >> > Make an rt_rq's overload logic be driven by {enqueue,dequeue}_pushable_task(). >> > Since rt_rq->{rt_nr_migratory,rt_nr_total} become unused, remove them. >> > >> > Note that the case where the current task is pushed away to make way for a >> > migration-disabled task remains unchanged: the migration-disabled task has >> > to be in the pushable_tasks list in the first place, which means it has >> > nr_cpus_allowed > 1. >> > >> > Link: http://lore.kernel.org/r/20230801152648._y603AS_@linutronix.de >> > Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> >> > Signed-off-by: Valentin Schneider <vschneid@redhat.com> >> > --- >> > This is lightly tested, this looks to be working OK but I don't have nor am >> > I aware of a test case for RT balancing, I suppose we want something that >> > asserts we always run the N highest prio tasks for N CPUs, with a small >> > margin for migrations? >> >> I don't see the storm of IPIs I saw before. So as far that goes: >> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > > I've applied Valentin's initial fix to tip:sched/core, for an eventual > v6.7 merge, as it addresses the IPI storm bug. Let me know if merging > this is not desirable for some reason. > >> What I still observe is: >> - CPU0 is idle. CPU0 gets a task assigned from CPU1. That task receives >> a wakeup. CPU0 returns from idle and schedules the task. >> pull_rt_task() on CPU1 and sometimes on other CPU observe this, too. >> CPU1 sends irq_work to CPU0 while at the time rto_next_cpu() sees that >> has_pushable_tasks() return 0. That bit was cleared earlier (as per >> tracing). >> >> - CPU0 is idle. CPU0 gets a task assigned from CPU1. The task on CPU0 is >> woken up without an IPI (yay). But then pull_rt_task() decides that >> send irq_work and has_pushable_tasks() said that is has tasks left >> so…. >> Now: rto_push_irq_work_func() run once once on CPU0, does nothing, >> rto_next_cpu() return CPU0 again and enqueues itself again on CPU0. >> Usually after the second or third round the scheduler on CPU0 makes >> enough progress to remove the task/ clear the CPU from mask. > > Just curious, any progress on solving this? > On my side not really, I need to stop getting distracted and probably get this to reproduce on a system so I can understand-by-tracing > Thanks, > > Ingo
On 25/09/23 12:11, Peter Zijlstra wrote: > On Mon, Sep 25, 2023 at 08:55:10AM -0000, tip-bot2 for Valentin Schneider wrote: >> The following commit has been merged into the sched/core branch of tip: >> >> Commit-ID: 612f769edd06a6e42f7cd72425488e68ddaeef0a >> Gitweb: https://git.kernel.org/tip/612f769edd06a6e42f7cd72425488e68ddaeef0a >> Author: Valentin Schneider <vschneid@redhat.com> >> AuthorDate: Fri, 11 Aug 2023 12:20:44 +01:00 >> Committer: Ingo Molnar <mingo@kernel.org> >> CommitterDate: Mon, 25 Sep 2023 10:25:29 +02:00 >> >> sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask >> >> Sebastian noted that the rto_push_work IRQ work can be queued for a CPU >> that has an empty pushable_tasks list, which means nothing useful will be >> done in the IPI other than queue the work for the next CPU on the rto_mask. >> >> rto_push_irq_work_func() only operates on tasks in the pushable_tasks list, >> but the conditions for that irq_work to be queued (and for a CPU to be >> added to the rto_mask) rely on rq_rt->nr_migratory instead. >> >> nr_migratory is increased whenever an RT task entity is enqueued and it has >> nr_cpus_allowed > 1. Unlike the pushable_tasks list, nr_migratory includes a >> rt_rq's current task. This means a rt_rq can have a migratible current, N >> non-migratible queued tasks, and be flagged as overloaded / have its CPU >> set in the rto_mask, despite having an empty pushable_tasks list. >> >> Make an rt_rq's overload logic be driven by {enqueue,dequeue}_pushable_task(). >> Since rt_rq->{rt_nr_migratory,rt_nr_total} become unused, remove them. >> >> Note that the case where the current task is pushed away to make way for a >> migration-disabled task remains unchanged: the migration-disabled task has >> to be in the pushable_tasks list in the first place, which means it has >> nr_cpus_allowed > 1. >> >> Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> >> Signed-off-by: Valentin Schneider <vschneid@redhat.com> >> Signed-off-by: Ingo Molnar <mingo@kernel.org> >> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> >> Link: https://lore.kernel.org/r/20230811112044.3302588-1-vschneid@redhat.com >> --- >> kernel/sched/debug.c | 3 +-- >> kernel/sched/rt.c | 70 ++++++------------------------------------- >> kernel/sched/sched.h | 2 +- >> 3 files changed, 10 insertions(+), 65 deletions(-) >> > >> @@ -358,53 +357,6 @@ static inline void rt_clear_overload(struct rq *rq) >> cpumask_clear_cpu(rq->cpu, rq->rd->rto_mask); >> } >> >> -static void update_rt_migration(struct rt_rq *rt_rq) >> -{ >> - if (rt_rq->rt_nr_migratory && rt_rq->rt_nr_total > 1) { >> - if (!rt_rq->overloaded) { >> - rt_set_overload(rq_of_rt_rq(rt_rq)); >> - rt_rq->overloaded = 1; >> - } >> - } else if (rt_rq->overloaded) { >> - rt_clear_overload(rq_of_rt_rq(rt_rq)); >> - rt_rq->overloaded = 0; >> - } >> -} >> - >> -static void inc_rt_migration(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) >> -{ >> - struct task_struct *p; >> - >> - if (!rt_entity_is_task(rt_se)) >> - return; >> - >> - p = rt_task_of(rt_se); >> - rt_rq = &rq_of_rt_rq(rt_rq)->rt; >> - >> - rt_rq->rt_nr_total++; >> - if (p->nr_cpus_allowed > 1) >> - rt_rq->rt_nr_migratory++; >> - >> - update_rt_migration(rt_rq); >> -} >> - >> -static void dec_rt_migration(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) >> -{ >> - struct task_struct *p; >> - >> - if (!rt_entity_is_task(rt_se)) >> - return; >> - >> - p = rt_task_of(rt_se); >> - rt_rq = &rq_of_rt_rq(rt_rq)->rt; >> - >> - rt_rq->rt_nr_total--; >> - if (p->nr_cpus_allowed > 1) >> - rt_rq->rt_nr_migratory--; >> - >> - update_rt_migration(rt_rq); >> -} > > sched/deadline.c has something very similar, does that need updating > too? Hm I think so yes: - push_dl_task() is an obvious noop if the pushable tree is empty - pull_dl_task() can be kicked if !rq->dl.overloaded, which similarly to rt is driven by nr_migratory but could be boiled down to having pushable tasks (due to the nr_cpus_allowed > 1 constraint). Lemme poke at it.
* Valentin Schneider <vschneid@redhat.com> wrote: > > sched/deadline.c has something very similar, does that need updating > > too? > > Hm I think so yes: > - push_dl_task() is an obvious noop if the pushable tree is empty > - pull_dl_task() can be kicked if !rq->dl.overloaded, which similarly to rt > is driven by nr_migratory but could be boiled down to having pushable > tasks (due to the nr_cpus_allowed > 1 constraint). > > Lemme poke at it. For reference, the matching DL change now lives in tip:sched/core as: 177f608a4c82 ("sched/deadline: Make dl_rq->pushable_dl_tasks update drive dl_rq->overloaded") Thanks, Ingo
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 4c3d0d9f3db63..b4f6fb592a123 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -724,9 +724,6 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq) SEQ_printf(m, " .%-30s: %Ld.%06ld\n", #x, SPLIT_NS(rt_rq->x)) PU(rt_nr_running); -#ifdef CONFIG_SMP - PU(rt_nr_migratory); -#endif P(rt_throttled); PN(rt_time); PN(rt_runtime); diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 00e0e50741153..12100f3b6e5f2 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -143,7 +143,6 @@ void init_rt_rq(struct rt_rq *rt_rq) #if defined CONFIG_SMP rt_rq->highest_prio.curr = MAX_RT_PRIO-1; rt_rq->highest_prio.next = MAX_RT_PRIO-1; - rt_rq->rt_nr_migratory = 0; rt_rq->overloaded = 0; plist_head_init(&rt_rq->pushable_tasks); #endif /* CONFIG_SMP */ @@ -358,53 +357,6 @@ static inline void rt_clear_overload(struct rq *rq) cpumask_clear_cpu(rq->cpu, rq->rd->rto_mask); } -static void update_rt_migration(struct rt_rq *rt_rq) -{ - if (rt_rq->rt_nr_migratory && rt_rq->rt_nr_total > 1) { - if (!rt_rq->overloaded) { - rt_set_overload(rq_of_rt_rq(rt_rq)); - rt_rq->overloaded = 1; - } - } else if (rt_rq->overloaded) { - rt_clear_overload(rq_of_rt_rq(rt_rq)); - rt_rq->overloaded = 0; - } -} - -static void inc_rt_migration(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) -{ - struct task_struct *p; - - if (!rt_entity_is_task(rt_se)) - return; - - p = rt_task_of(rt_se); - rt_rq = &rq_of_rt_rq(rt_rq)->rt; - - rt_rq->rt_nr_total++; - if (p->nr_cpus_allowed > 1) - rt_rq->rt_nr_migratory++; - - update_rt_migration(rt_rq); -} - -static void dec_rt_migration(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) -{ - struct task_struct *p; - - if (!rt_entity_is_task(rt_se)) - return; - - p = rt_task_of(rt_se); - rt_rq = &rq_of_rt_rq(rt_rq)->rt; - - rt_rq->rt_nr_total--; - if (p->nr_cpus_allowed > 1) - rt_rq->rt_nr_migratory--; - - update_rt_migration(rt_rq); -} - static inline int has_pushable_tasks(struct rq *rq) { return !plist_head_empty(&rq->rt.pushable_tasks); @@ -438,6 +390,11 @@ static void enqueue_pushable_task(struct rq *rq, struct task_struct *p) /* Update the highest prio pushable task */ if (p->prio < rq->rt.highest_prio.next) rq->rt.highest_prio.next = p->prio; + + if (!rq->rt.overloaded) { + rt_set_overload(rq); + rq->rt.overloaded = 1; + } } static void dequeue_pushable_task(struct rq *rq, struct task_struct *p) @@ -451,6 +408,11 @@ static void dequeue_pushable_task(struct rq *rq, struct task_struct *p) rq->rt.highest_prio.next = p->prio; } else { rq->rt.highest_prio.next = MAX_RT_PRIO-1; + + if (rq->rt.overloaded) { + rt_clear_overload(rq); + rq->rt.overloaded = 0; + } } } @@ -464,16 +426,6 @@ static inline void dequeue_pushable_task(struct rq *rq, struct task_struct *p) { } -static inline -void inc_rt_migration(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) -{ -} - -static inline -void dec_rt_migration(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) -{ -} - static inline void rt_queue_push_tasks(struct rq *rq) { } @@ -1281,7 +1233,6 @@ void inc_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) rt_rq->rr_nr_running += rt_se_rr_nr_running(rt_se); inc_rt_prio(rt_rq, prio); - inc_rt_migration(rt_se, rt_rq); inc_rt_group(rt_se, rt_rq); } @@ -1294,7 +1245,6 @@ void dec_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) rt_rq->rr_nr_running -= rt_se_rr_nr_running(rt_se); dec_rt_prio(rt_rq, rt_se_prio(rt_se)); - dec_rt_migration(rt_se, rt_rq); dec_rt_group(rt_se, rt_rq); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 9c5035ca3b06d..bea6a9ec8cde0 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -675,8 +675,6 @@ struct rt_rq { } highest_prio; #endif #ifdef CONFIG_SMP - unsigned int rt_nr_migratory; - unsigned int rt_nr_total; int overloaded; struct plist_head pushable_tasks;