Message ID | 20221123012104.3317665-2-joel@joelfernandes.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:f944:0:0:0:0:0 with SMTP id q4csp2532512wrr; Tue, 22 Nov 2022 17:27:14 -0800 (PST) X-Google-Smtp-Source: AA0mqf5FCylCNoiCslPqsEmRSrbNvDRxD7Le+N/rzkpfI9V3tUZgauke7ERIlTJMc/gSJyFWdUwq X-Received: by 2002:a17:907:3117:b0:7ae:6746:f26b with SMTP id wl23-20020a170907311700b007ae6746f26bmr21928750ejb.171.1669166834271; Tue, 22 Nov 2022 17:27:14 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669166834; cv=none; d=google.com; s=arc-20160816; b=DzIb4rGF8j6b7jKRRO8bUUPND1ppauYAsxel2HhqMcFHtpg0mgQhawPpMFril+nSnh lbB1vnfRy//2TyN8ebJuJuPVH3MnZruiZvkSGFnwiwXrrOcjFmGZVF0pZvbZ3hUQZExc 1i8lMSg/OwawXM8hYIIXw1Nd+LqA/PSShH3Opc/YnowtEft2ehwldRYrG7DzONlOir8v 1IDQS4PaksFzHdczwUP+GKIJpuTOP3wwTrvzN16JQVyYQnWdwn3prLnCzyZue/F4x180 aaMhjXIiX90SeVoo+r10of3QNa0V7XIvsUkztWhLQ4uuGgCmtouVZWMWwCRyf0rU8UNd 2xKg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=9GtnXpguQj4O0Y4VxU8SlCD6RocW4OIg2BaI4YRFC28=; b=yykGGUa+wYovJDd6FWHXL/aU94TrTx0arnqLElMJEJZPpZlwd1BucOXjs7aZzDDN2p AgPG4mm+KC/ktANhIaCOGjxdFG1LjxFrFPNtCcY+eIybe69jceosBrA3SOyQt5NOT/9m JzBh6ucQgGn0AHRRjtuNV8Q8vqBQrw9VGIT7ro0NoDzut8pZ6qBwkf2jNxt7+QtkPKKQ xJFJgOWe53m1OpJzOmVVV4hYeSkMj8H+NSQt1pis2VfUmPTihys/pdy9BmzOHWxrGt0W aSn9u2zE6k7c6UgBMKX+eim83CiakOsPawocQP0Ypxu6ZTznpW6Rw9QMn9UHHauv0/3y 7CsA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@joelfernandes.org header.s=google header.b=AA9IN5HC; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id ey3-20020a0564022a0300b004617d769429si12352127edb.459.2022.11.22.17.26.50; Tue, 22 Nov 2022 17:27:14 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@joelfernandes.org header.s=google header.b=AA9IN5HC; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234984AbiKWBVa (ORCPT <rfc822;cjcooper78@gmail.com> + 99 others); Tue, 22 Nov 2022 20:21:30 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60428 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233625AbiKWBVT (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 22 Nov 2022 20:21:19 -0500 Received: from mail-qt1-x82e.google.com (mail-qt1-x82e.google.com [IPv6:2607:f8b0:4864:20::82e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id ECBF1B5A for <linux-kernel@vger.kernel.org>; Tue, 22 Nov 2022 17:21:17 -0800 (PST) Received: by mail-qt1-x82e.google.com with SMTP id cg5so10430958qtb.12 for <linux-kernel@vger.kernel.org>; Tue, 22 Nov 2022 17:21:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=joelfernandes.org; s=google; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=9GtnXpguQj4O0Y4VxU8SlCD6RocW4OIg2BaI4YRFC28=; b=AA9IN5HCivPYCzFmqlbWwF8baNeqOR9Ccmdvk3ZCGGkEayW3L++azUK+9Bgah47C5E /GHOjjDLal9oMLaNvJhcWfo8jCo0FSsdC2sfXTZ3f4KI2WJhp6EuE1nIObGJTo7N5phH V/nSlNvCnKkrWRiDgKlRqujAL6x+cz3UM3T6I= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=9GtnXpguQj4O0Y4VxU8SlCD6RocW4OIg2BaI4YRFC28=; b=2pGoYYO4rSpJ0OfJxrExhWlBAL9X0MuT0MkJO8VDwZ/GufdWCpsieqMzdmHOUXCJKI gunjGLggNilw7i852bJ5iiAWFMh18bjR72X7VYq49QaGFP739GvMAdugpE8Y9lGqbV3f Aqw7/ddkI9LUxTFsyrtlDJdrflwDhayCrAYU4xEci66q3VMDCAwNfY7P03f1kTFA/XiR 1H3TVurU16UHQm0K6FGNfCYa/8qco1hSTLBVPn/gf9IY9qDQa0DG8sdNfbGhbzz+yK0q 9pqYgYldb1GAS5bJUdVZ7snoAsvPr8r5NFrKaY+Ab7XisgzbnXe/vaZ1w7njVH/cFWjH 3Arw== X-Gm-Message-State: ANoB5pm1irUAKxBIGvBwxEUcxFQ27LFIELlChh4Yu0l/hMtfQfVOxVGE YYKs9aj76uf5OGi+7nqdN4FPtmkCUZvWSw== X-Received: by 2002:ac8:5551:0:b0:3a6:5573:5851 with SMTP id o17-20020ac85551000000b003a655735851mr4334524qtr.313.1669166476848; Tue, 22 Nov 2022 17:21:16 -0800 (PST) Received: from joelboxx.c.googlers.com.com (228.221.150.34.bc.googleusercontent.com. [34.150.221.228]) by smtp.gmail.com with ESMTPSA id gc7-20020a05622a59c700b0035ce8965045sm9096026qtb.42.2022.11.22.17.21.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 22 Nov 2022 17:21:16 -0800 (PST) From: "Joel Fernandes (Google)" <joel@joelfernandes.org> To: linux-kernel@vger.kernel.org Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Ben Segall <bsegall@google.com>, Daniel Bristot de Oliveira <bristot@redhat.com>, Davidlohr Bueso <dave@stgolabs.net>, Ingo Molnar <mingo@redhat.com>, Josh Triplett <josh@joshtriplett.org>, Juri Lelli <juri.lelli@redhat.com>, Mel Gorman <mgorman@suse.de>, "Paul E. McKenney" <paulmck@kernel.org>, Peter Zijlstra <peterz@infradead.org>, Steven Rostedt <rostedt@goodmis.org>, Valentin Schneider <vschneid@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, kernel-team@android.com, John Stultz <jstultz@google.com>, Joel Fernandes <joelaf@google.com>, Qais Yousef <qais.yousef@arm.com>, Will Deacon <will@kernel.org>, Waiman Long <longman@redhat.com>, Boqun Feng <boqun.feng@gmail.com>, "Connor O'Brien" <connoro@google.com> Subject: [PATCH RFC 1/3] sched/pe: Exclude balance callback queuing during proxy()'s migrate Date: Wed, 23 Nov 2022 01:21:02 +0000 Message-Id: <20221123012104.3317665-2-joel@joelfernandes.org> X-Mailer: git-send-email 2.38.1.584.g0f3c55d4c2-goog In-Reply-To: <20221123012104.3317665-1-joel@joelfernandes.org> References: <20221123012104.3317665-1-joel@joelfernandes.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1750248282359723797?= X-GMAIL-MSGID: =?utf-8?q?1750248282359723797?= |
Series |
Patches on top of PE series
|
|
Commit Message
Joel Fernandes
Nov. 23, 2022, 1:21 a.m. UTC
In commit 565790d28b1e ("sched: Fix balance_callback()"), it is clear
that rq lock needs to be held from when balance callbacks are queued to
when __balance_callbacks() in schedule() is called.
This has to be done without dropping the runqueue lock in between. If
dropped between queuing and executing callbacks, it is possible that
another CPU, say in __sched_setscheduler() can queue balancing callbacks
and cause issues as show in that commit.
This is precisely what happens in proxy(). During a proxy(), the current
CPU's rq lock is temporary dropped when moving the tasks in the migrate
list to the owner CPU.
This commit therefore make proxy() exclude balance callback queuing on
other CPUs, in the section where proxy() temporarily drops the rq lock
of current CPU.
CPUs that acquire a remote CPU's rq lock but later queue a balance
callback, are made to call the new helpers in this commit to check
whether balance_lock is held. If it is held, then the rq lock is
released and a re-attempt is made to acquire it in the hopes that
the ban on balancing callback queuing has elapsed.
Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
kernel/sched/core.c | 72 ++++++++++++++++++++++++++++++++++++++++++--
kernel/sched/sched.h | 7 ++++-
2 files changed, 76 insertions(+), 3 deletions(-)
Comments
On 23/11/2022 02:21, Joel Fernandes (Google) wrote: > In commit 565790d28b1e ("sched: Fix balance_callback()"), it is clear > that rq lock needs to be held from when balance callbacks are queued to > when __balance_callbacks() in schedule() is called. > > This has to be done without dropping the runqueue lock in between. If > dropped between queuing and executing callbacks, it is possible that > another CPU, say in __sched_setscheduler() can queue balancing callbacks > and cause issues as show in that commit. > > This is precisely what happens in proxy(). During a proxy(), the current > CPU's rq lock is temporary dropped when moving the tasks in the migrate > list to the owner CPU. > > This commit therefore make proxy() exclude balance callback queuing on > other CPUs, in the section where proxy() temporarily drops the rq lock > of current CPU. > > CPUs that acquire a remote CPU's rq lock but later queue a balance > callback, are made to call the new helpers in this commit to check > whether balance_lock is held. If it is held, then the rq lock is > released and a re-attempt is made to acquire it in the hopes that > the ban on balancing callback queuing has elapsed. > > Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > --- > kernel/sched/core.c | 72 ++++++++++++++++++++++++++++++++++++++++++-- > kernel/sched/sched.h | 7 ++++- > 2 files changed, 76 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 88a5fa34dc06..aba90b3dc3ef 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -633,6 +633,29 @@ struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf) > } > } > > +/* > + * Helper to call __task_rq_lock safely, in scenarios where we might be about to > + * queue a balance callback on a remote CPU. That CPU might be in proxy(), and > + * could have released its rq lock while holding balance_lock. So release rq > + * lock in such a situation to avoid deadlock, and retry. > + */ > +struct rq *__task_rq_lock_balance(struct task_struct *p, struct rq_flags *rf) > +{ > + struct rq *rq; > + bool locked = false; > + > + do { > + if (locked) { > + __task_rq_unlock(rq, rf); > + cpu_relax(); > + } > + rq = __task_rq_lock(p, rf); > + locked = true; > + } while (raw_spin_is_locked(&rq->balance_lock)); > + > + return rq; > +} > + > /* > * task_rq_lock - lock p->pi_lock and lock the rq @p resides on. > */ > @@ -675,6 +698,29 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf) > } > } > > +/* > + * Helper to call task_rq_lock safely, in scenarios where we might be about to > + * queue a balance callback on a remote CPU. That CPU might be in proxy(), and > + * could have released its rq lock while holding balance_lock. So release rq > + * lock in such a situation to avoid deadlock, and retry. > + */ > +struct rq *task_rq_lock_balance(struct task_struct *p, struct rq_flags *rf) > +{ > + struct rq *rq; > + bool locked = false; > + > + do { > + if (locked) { > + task_rq_unlock(rq, p, rf); > + cpu_relax(); > + } > + rq = task_rq_lock(p, rf); > + locked = true; > + } while (raw_spin_is_locked(&rq->balance_lock)); > + > + return rq; > +} > + > /* > * RQ-clock updating methods: > */ > @@ -6739,6 +6785,12 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf) > p->wake_cpu = wake_cpu; > } > > + /* > + * Prevent other CPUs from queuing balance callbacks while we migrate > + * tasks in the migrate_list with the rq lock released. > + */ > + raw_spin_lock(&rq->balance_lock); > + > rq_unpin_lock(rq, rf); > raw_spin_rq_unlock(rq); > raw_spin_rq_lock(that_rq); > @@ -6758,7 +6810,21 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf) > } > > raw_spin_rq_unlock(that_rq); > + > + /* > + * This may make lockdep unhappy as we acquire rq->lock with > + * balance_lock held. But that should be a false positive, as the > + * following pattern happens only on the current CPU with interrupts > + * disabled: > + * rq_lock() > + * balance_lock(); > + * rq_unlock(); > + * rq_lock(); > + */ > raw_spin_rq_lock(rq); > + > + raw_spin_unlock(&rq->balance_lock); > + > rq_repin_lock(rq, rf); > > return NULL; /* Retry task selection on _this_ CPU. */ > @@ -7489,7 +7555,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task) > if (p->pi_top_task == pi_task && prio == p->prio && !dl_prio(prio)) > return; > > - rq = __task_rq_lock(p, &rf); > + rq = __task_rq_lock_balance(p, &rf); > update_rq_clock(rq); > /* > * Set under pi_lock && rq->lock, such that the value can be used under > @@ -8093,7 +8159,8 @@ static int __sched_setscheduler(struct task_struct *p, > * To be able to change p->policy safely, the appropriate > * runqueue lock must be held. > */ > - rq = task_rq_lock(p, &rf); > + rq = task_rq_lock_balance(p, &rf); > + > update_rq_clock(rq); > > /* You consider rt_mutex_setprio() and __sched_setscheduler() versus proxy() but what about all the other places like load_balance(), update_blocked_averages(), __set_cpus_allowed_ptr() and many more in which we take the rq lock (via task_rq_lock() or rq_lock{_xxx}())? With your changes to locktorture in {2-3}/3 you still run CFS lock_torture_writers but you should see the issue popping up in __set_cpus_allowed_ptr() (from torture_shuffle()) for example. Tried with: insmod /lib/modules/torture.ko insmod /lib/modules/locktorture.ko torture_type=mutex_lock rt_boost=1 rt_boost_factor=1 nlocks=3 ^^^^^^^^^^^^^^^^^ When changing all lock_torture_writer's to FIFO it becomes even more visible. -->8-- diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c index e4529c2166e9..ea75d525fe7c 100644 --- a/kernel/locking/locktorture.c +++ b/kernel/locking/locktorture.c @@ -683,7 +683,8 @@ static int lock_torture_writer(void *arg) DEFINE_TORTURE_RANDOM(rand); VERBOSE_TOROUT_STRING("lock_torture_writer task started"); - set_user_nice(current, MAX_NICE); + if (!rt_task(current)) + set_user_nice(current, MAX_NICE); do { if ((torture_random(&rand) & 0xfffff) == 0) diff --git a/kernel/torture.c b/kernel/torture.c index 1d0dd88369e3..55d8ac417a4a 100644 --- a/kernel/torture.c +++ b/kernel/torture.c @@ -57,6 +57,9 @@ module_param(verbose_sleep_duration, int, 0444); static int random_shuffle; module_param(random_shuffle, int, 0444); +static int lock_torture_writer_fifo; +module_param(lock_torture_writer_fifo, int, 0444); + static char *torture_type; static int verbose; @@ -734,7 +737,7 @@ bool stutter_wait(const char *title) cond_resched_tasks_rcu_qs(); spt = READ_ONCE(stutter_pause_test); for (; spt; spt = READ_ONCE(stutter_pause_test)) { - if (!ret) { + if (!ret && !rt_task(current)) { sched_set_normal(current, MAX_NICE); ret = true; } @@ -944,6 +947,11 @@ int _torture_create_kthread(int (*fn)(void *arg), void *arg, char *s, char *m, *tp = NULL; return ret; } + + if (lock_torture_writer_fifo && + !strncmp(s, "lock_torture_writer", strlen(s))) + sched_set_fifo(*tp); + wake_up_process(*tp); // Process is sleeping, so ordering provided. torture_shuffle_task_register(*tp); return ret; [...]
Hi Dietmar! On Fri, Dec 9, 2022 at 3:07 PM Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > On 23/11/2022 02:21, Joel Fernandes (Google) wrote: > > In commit 565790d28b1e ("sched: Fix balance_callback()"), it is clear > > that rq lock needs to be held from when balance callbacks are queued to > > when __balance_callbacks() in schedule() is called. > > > > This has to be done without dropping the runqueue lock in between. If > > dropped between queuing and executing callbacks, it is possible that > > another CPU, say in __sched_setscheduler() can queue balancing callbacks > > and cause issues as show in that commit. > > > > This is precisely what happens in proxy(). During a proxy(), the current > > CPU's rq lock is temporary dropped when moving the tasks in the migrate > > list to the owner CPU. > > > > This commit therefore make proxy() exclude balance callback queuing on > > other CPUs, in the section where proxy() temporarily drops the rq lock > > of current CPU. > > > > CPUs that acquire a remote CPU's rq lock but later queue a balance > > callback, are made to call the new helpers in this commit to check > > whether balance_lock is held. If it is held, then the rq lock is > > released and a re-attempt is made to acquire it in the hopes that > > the ban on balancing callback queuing has elapsed. > > > > Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > --- > > kernel/sched/core.c | 72 ++++++++++++++++++++++++++++++++++++++++++-- > > kernel/sched/sched.h | 7 ++++- > > 2 files changed, 76 insertions(+), 3 deletions(-) > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index 88a5fa34dc06..aba90b3dc3ef 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -633,6 +633,29 @@ struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf) > > } > > } > > > > +/* > > + * Helper to call __task_rq_lock safely, in scenarios where we might be about to > > + * queue a balance callback on a remote CPU. That CPU might be in proxy(), and > > + * could have released its rq lock while holding balance_lock. So release rq > > + * lock in such a situation to avoid deadlock, and retry. > > + */ > > +struct rq *__task_rq_lock_balance(struct task_struct *p, struct rq_flags *rf) > > +{ > > + struct rq *rq; > > + bool locked = false; > > + > > + do { > > + if (locked) { > > + __task_rq_unlock(rq, rf); > > + cpu_relax(); > > + } > > + rq = __task_rq_lock(p, rf); > > + locked = true; > > + } while (raw_spin_is_locked(&rq->balance_lock)); > > + > > + return rq; > > +} > > + > > /* > > * task_rq_lock - lock p->pi_lock and lock the rq @p resides on. > > */ > > @@ -675,6 +698,29 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf) > > } > > } > > > > +/* > > + * Helper to call task_rq_lock safely, in scenarios where we might be about to > > + * queue a balance callback on a remote CPU. That CPU might be in proxy(), and > > + * could have released its rq lock while holding balance_lock. So release rq > > + * lock in such a situation to avoid deadlock, and retry. > > + */ > > +struct rq *task_rq_lock_balance(struct task_struct *p, struct rq_flags *rf) > > +{ > > + struct rq *rq; > > + bool locked = false; > > + > > + do { > > + if (locked) { > > + task_rq_unlock(rq, p, rf); > > + cpu_relax(); > > + } > > + rq = task_rq_lock(p, rf); > > + locked = true; > > + } while (raw_spin_is_locked(&rq->balance_lock)); > > + > > + return rq; > > +} > > + > > /* > > * RQ-clock updating methods: > > */ > > @@ -6739,6 +6785,12 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf) > > p->wake_cpu = wake_cpu; > > } > > > > + /* > > + * Prevent other CPUs from queuing balance callbacks while we migrate > > + * tasks in the migrate_list with the rq lock released. > > + */ > > + raw_spin_lock(&rq->balance_lock); > > + > > rq_unpin_lock(rq, rf); > > raw_spin_rq_unlock(rq); > > raw_spin_rq_lock(that_rq); > > @@ -6758,7 +6810,21 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf) > > } > > > > raw_spin_rq_unlock(that_rq); > > + > > + /* > > + * This may make lockdep unhappy as we acquire rq->lock with > > + * balance_lock held. But that should be a false positive, as the > > + * following pattern happens only on the current CPU with interrupts > > + * disabled: > > + * rq_lock() > > + * balance_lock(); > > + * rq_unlock(); > > + * rq_lock(); > > + */ > > raw_spin_rq_lock(rq); > > + > > + raw_spin_unlock(&rq->balance_lock); > > + > > rq_repin_lock(rq, rf); > > > > return NULL; /* Retry task selection on _this_ CPU. */ > > @@ -7489,7 +7555,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task) > > if (p->pi_top_task == pi_task && prio == p->prio && !dl_prio(prio)) > > return; > > > > - rq = __task_rq_lock(p, &rf); > > + rq = __task_rq_lock_balance(p, &rf); > > update_rq_clock(rq); > > /* > > * Set under pi_lock && rq->lock, such that the value can be used under > > @@ -8093,7 +8159,8 @@ static int __sched_setscheduler(struct task_struct *p, > > * To be able to change p->policy safely, the appropriate > > * runqueue lock must be held. > > */ > > - rq = task_rq_lock(p, &rf); > > + rq = task_rq_lock_balance(p, &rf); > > + > > update_rq_clock(rq); > > > > /* > > You consider rt_mutex_setprio() and __sched_setscheduler() versus > proxy() but what about all the other places like load_balance(), > update_blocked_averages(), __set_cpus_allowed_ptr() and many > more in which we take the rq lock (via task_rq_lock() or > rq_lock{_xxx}())? You are right. Those paths potentially need updates as well. Any chance you could post stack traces or logs of those issues, just in case they have new nuggets of information? If you don't have them, don't bother, I'll reproduce it. > With your changes to locktorture in {2-3}/3 you still run CFS > lock_torture_writers but you should see the issue popping up > in __set_cpus_allowed_ptr() (from torture_shuffle()) for example. > > Tried with: > > insmod /lib/modules/torture.ko > insmod /lib/modules/locktorture.ko torture_type=mutex_lock rt_boost=1 rt_boost_factor=1 nlocks=3 > ^^^^^^^^^^^^^^^^^ > > When changing all lock_torture_writer's to FIFO it becomes even > more visible. Ok, thank you, I will make it more aggressively set to RT. The previous locktorture was setting RT once every minute or so, at least now that is down to 10 seconds ;-). But as you highlight with the locktorture diff below, it needs to go further. Anyway, this is going to be a nice holiday/Christmas project for me, so thank you in advance for the holiday gift :-) ^_^ thanks, - Joel > > -->8-- > > diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c > index e4529c2166e9..ea75d525fe7c 100644 > --- a/kernel/locking/locktorture.c > +++ b/kernel/locking/locktorture.c > @@ -683,7 +683,8 @@ static int lock_torture_writer(void *arg) > DEFINE_TORTURE_RANDOM(rand); > > VERBOSE_TOROUT_STRING("lock_torture_writer task started"); > - set_user_nice(current, MAX_NICE); > + if (!rt_task(current)) > + set_user_nice(current, MAX_NICE); > > do { > if ((torture_random(&rand) & 0xfffff) == 0) > diff --git a/kernel/torture.c b/kernel/torture.c > index 1d0dd88369e3..55d8ac417a4a 100644 > --- a/kernel/torture.c > +++ b/kernel/torture.c > @@ -57,6 +57,9 @@ module_param(verbose_sleep_duration, int, 0444); > static int random_shuffle; > module_param(random_shuffle, int, 0444); > > +static int lock_torture_writer_fifo; > +module_param(lock_torture_writer_fifo, int, 0444); > + > static char *torture_type; > static int verbose; > > @@ -734,7 +737,7 @@ bool stutter_wait(const char *title) > cond_resched_tasks_rcu_qs(); > spt = READ_ONCE(stutter_pause_test); > for (; spt; spt = READ_ONCE(stutter_pause_test)) { > - if (!ret) { > + if (!ret && !rt_task(current)) { > sched_set_normal(current, MAX_NICE); > ret = true; > } > @@ -944,6 +947,11 @@ int _torture_create_kthread(int (*fn)(void *arg), void *arg, char *s, char *m, > *tp = NULL; > return ret; > } > + > + if (lock_torture_writer_fifo && > + !strncmp(s, "lock_torture_writer", strlen(s))) > + sched_set_fifo(*tp); > + > wake_up_process(*tp); // Process is sleeping, so ordering provided. > torture_shuffle_task_register(*tp); > return ret; > > [...]
Hi Joel, On 09/12/2022 17:52, Joel Fernandes wrote: > Hi Dietmar! > > On Fri, Dec 9, 2022 at 3:07 PM Dietmar Eggemann > <dietmar.eggemann@arm.com> wrote: >> >> On 23/11/2022 02:21, Joel Fernandes (Google) wrote: [...] >> You consider rt_mutex_setprio() and __sched_setscheduler() versus >> proxy() but what about all the other places like load_balance(), >> update_blocked_averages(), __set_cpus_allowed_ptr() and many >> more in which we take the rq lock (via task_rq_lock() or >> rq_lock{_xxx}())? > > You are right. Those paths potentially need updates as well. Any IMHO, you would still have to lock the entire `queue->execute` (1)->(2) thing, like keeping the rq lock currently. __schedule() pick_next_task() pick_next_task_{rt/dl}() set_next_task_{rt/dl}() {rt/deadline}_queue_push_tasks() queue_balance_callback() --> (1) proxy() --- !!! finish_lock_switch() __balance_callbacks() <-- (2) __balance_callbacks(rq) <-- (2) Otherwise. something like this could happen: With `x-x` : smp_processor_id()-cpu_of(rq) lock_torture_wr-1745 [003] 338.270963: queue_balance_callback: 3-3-> lock_torture_wr-1745 [003] 338.270965: queue_balance_callback: 3-3<- cat-1726 [001] 338.270969: __schedule: proxy() 1-1-> lock_torture_wr-1745 [003] 338.270971: __schedule: proxy() 3-3-> cat-1726 [001] 338.270972: __schedule: proxy() 1-1<- lock_torture_wr-1745 [003] 338.270979: __schedule: proxy() 3-3<- <idle>-0 [002] 338.270981: __schedule: proxy() 2-2-> <idle>-0 [002] 338.270984: __schedule: proxy() 2-2<- lock_torture_wr-1745 [003] 338.270985: __schedule: proxy() 3-3-> migration/4-34 [004] 338.270989: active_load_balance_cpu_stop: rq_pin_lock() 4-3 <-- ! cb on CPU3 still enqueued > chance you could post stack traces or logs of those issues, just in > case they have new nuggets of information? If you don't have them, > don't bother, I'll reproduce it. insmod /lib/modules/torture.ko random_shuffle=1 lock_torture_writer_fifo=1 insmod /lib/modules/locktorture.ko torture_type=mutex_lock nlocks=3 [ 447.046916] rq->balance_callback && rq->balance_callback != &balance_push_callback [ 447.046926] WARNING: CPU: 1 PID: 1648 at kernel/sched/sched.h:1583 task_rq_lock+0x148/0x170 [ 447.062914] Modules linked in: locktorture(O) torture(O) [ 447.068254] CPU: 1 PID: 1648 Comm: torture_shuffle Tainted: G W O 6.1.0-rc2-00036-g397ce37b37a8-dirty #203 [ 447.079168] Hardware name: ARM Juno development board (r0) (DT) [ 447.085106] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 447.092095] pc : task_rq_lock+0x148/0x170 [ 447.096121] lr : task_rq_lock+0x148/0x170 [ 447.100146] sp : ffff80000b85bd30 ... [ 447.175138] Call trace: [ 447.177589] task_rq_lock+0x148/0x170 [ 447.181267] __set_cpus_allowed_ptr+0x34/0xc0 [ 447.185643] set_cpus_allowed_ptr+0x30/0x60 [ 447.189843] torture_shuffle+0x158/0x224 [torture] [ 447.194666] kthread+0x10c/0x110 [ 447.197906] ret_from_fork+0x10/0x20 ... [ 447.233560] ---[ end trace 0000000000000000 ]--- [ 446.542532] ------------[ cut here ]------------ [ 446.553224] rq->balance_callback && rq->balance_callback != &balance_push_callback [ 446.553243] WARNING: CPU: 3 PID: 0 at kernel/sched/sched.h:1583 update_blocked_averages+0x784/0x78c [ 446.576089] Modules linked in: locktorture(O+) torture(O) [ 446.581551] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 6.1.0-rc2-00036-g397ce37b37a8-dirty #203 [ 446.591723] Hardware name: ARM Juno development board (r0) (DT) [ 446.597691] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 446.604712] pc : update_blocked_averages+0x784/0x78c [ 446.609723] lr : update_blocked_averages+0x784/0x78c [ 446.614733] sp : ffff80000b403b70 ... [ 446.690142] Call trace: [ 446.692609] update_blocked_averages+0x784/0x78c [ 446.697270] newidle_balance+0x184/0x5f0 [ 446.701232] pick_next_task_fair+0x2c/0x500 [ 446.705456] __schedule+0x1d4/0x1084 [ 446.709070] schedule_idle+0x28/0x4c [ 446.712682] do_idle+0x1d4/0x2d0 [ 446.715946] cpu_startup_entry+0x28/0x30 [ 446.719909] secondary_start_kernel+0x138/0x15c [ 446.724486] __secondary_switched+0xb0/0xb4 ... [ 446.765848] ---[ end trace 0000000000000000 ]--- [ 95.091675] ------------[ cut here ]------------ [ 95.096301] rq->balance_callback && rq->balance_callback != &balance_push_callback [ 95.096322] WARNING: CPU: 5 PID: 39 at kernel/sched/sched.h:1583 load_balance+0x644/0xdc0 [ 95.103135] mutex_lock-torture: Creating lock_torture_writer task [ 95.103238] mutex_lock-torture: lock_torture_writer task started [ 95.110692] Modules linked in: locktorture(O+) torture(O) [ 95.136699] CPU: 5 PID: 39 Comm: migration/5 Tainted: G O 6.1.0-rc2-00036-g397ce37b37a8-dirty #204 [ 95.147137] Hardware name: ARM Juno development board (r0) (DT) [ 95.153105] Stopper: 0x0 <- 0x0 [ 95.156282] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 95.163306] pc : load_balance+0x644/0xdc0 [ 95.167356] lr : load_balance+0x644/0xdc0 [ 95.171405] sp : ffff80000b4cbaf0 ... [ 95.246833] Call trace: [ 95.249300] load_balance+0x644/0xdc0 [ 95.253000] newidle_balance+0x290/0x6f0 [ 95.256963] pick_next_task_fair+0x2c/0x510 [ 95.261188] __schedule+0x1d4/0x1084 [ 95.264801] schedule+0x64/0x11c [ 95.268063] smpboot_thread_fn+0xa4/0x270 [ 95.272115] kthread+0x10c/0x110 [ 95.275375] ret_from_fork+0x10/0x20 ... [ 95.316477] ---[ end trace 0000000000000000 ]--- [ 134.893379] ------------[ cut here ]------------ [ 134.898066] rq->balance_callback && rq->balance_callback != &balance_push_callback [ 134.898088] WARNING: CPU: 4 PID: 0 at kernel/sched/sched.h:1583 sched_rt_period_timer+0x1dc/0x3f0 [ 134.914683] Modules linked in: locktorture(O) torture(O) [ 134.920059] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G W O 6.1.0-rc2-00036-g397ce37b37a8-dirty #204 [ 134.930235] Hardware name: ARM Juno development board (r0) (DT) [ 134.936205] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 134.943229] pc : sched_rt_period_timer+0x1dc/0x3f0 [ 134.948069] lr : sched_rt_period_timer+0x1dc/0x3f0 [ 134.952908] sp : ffff80000b2cbde0 ... [ 135.028342] Call trace: [ 135.030810] sched_rt_period_timer+0x1dc/0x3f0 [ 135.035300] __hrtimer_run_queues+0x184/0x504 [ 135.039700] hrtimer_interrupt+0xe8/0x244 [ 135.043749] arch_timer_handler_phys+0x2c/0x44 [ 135.048239] handle_percpu_devid_irq+0x8c/0x150 [ 135.052815] generic_handle_domain_irq+0x2c/0x44 [ 135.057480] gic_handle_irq+0x44/0xc4 [ 135.061180] call_on_irq_stack+0x2c/0x5c [ 135.065143] do_interrupt_handler+0x80/0x84 ... [ 135.141122] ---[ end trace 0000000000000000 ]--- >> With your changes to locktorture in {2-3}/3 you still run CFS >> lock_torture_writers but you should see the issue popping up >> in __set_cpus_allowed_ptr() (from torture_shuffle()) for example. >> >> Tried with: >> >> insmod /lib/modules/torture.ko >> insmod /lib/modules/locktorture.ko torture_type=mutex_lock rt_boost=1 rt_boost_factor=1 nlocks=3 >> ^^^^^^^^^^^^^^^^^ >> >> When changing all lock_torture_writer's to FIFO it becomes even >> more visible. > > Ok, thank you, I will make it more aggressively set to RT. The > previous locktorture was setting RT once every minute or so, at least > now that is down to 10 seconds ;-). But as you highlight with the > locktorture diff below, it needs to go further. Yeah, the test env is more aggressive this way and you spot potential issues quicker. > Anyway, this is going to be a nice holiday/Christmas project for me, > so thank you in advance for the holiday gift :-) ^_^ Enjoy ;-)
On Mon, Dec 12, 2022 at 9:39 AM Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > Hi Joel, > > On 09/12/2022 17:52, Joel Fernandes wrote: > > Hi Dietmar! > > > > On Fri, Dec 9, 2022 at 3:07 PM Dietmar Eggemann > > <dietmar.eggemann@arm.com> wrote: > >> > >> On 23/11/2022 02:21, Joel Fernandes (Google) wrote: > > [...] > > >> You consider rt_mutex_setprio() and __sched_setscheduler() versus > >> proxy() but what about all the other places like load_balance(), > >> update_blocked_averages(), __set_cpus_allowed_ptr() and many > >> more in which we take the rq lock (via task_rq_lock() or > >> rq_lock{_xxx}())? > > > > You are right. Those paths potentially need updates as well. Any > > IMHO, you would still have to lock the entire `queue->execute` (1)->(2) > thing, like keeping the rq lock currently. > > __schedule() > > pick_next_task() > pick_next_task_{rt/dl}() > set_next_task_{rt/dl}() > {rt/deadline}_queue_push_tasks() > queue_balance_callback() --> (1) > > proxy() --- !!! > > finish_lock_switch() > __balance_callbacks() <-- (2) > > __balance_callbacks(rq) <-- (2) Instead of locking it throughout, I think we can keep my initial patch and just execute the balance callbacks in proxy() before dropping the rq lock. I *think* that will make it work properly, but I could be missing something. Anyway I see the issue you bring up, I took care of balance CBs queued from *other CPUs* while the rq lock is dropped, but the current CPU executing proxy() could itself have queued balance callbacks. Dropping the rq lock then causes other CPUs to see the proxy() CPU's balance CBs in the callback list. Anyway I will try the above and get back to you. Thanks so much again for the insights. Will test as you suggested below. Thanks, - Joel > Otherwise. something like this could happen: > > With `x-x` : smp_processor_id()-cpu_of(rq) > > lock_torture_wr-1745 [003] 338.270963: queue_balance_callback: 3-3-> > lock_torture_wr-1745 [003] 338.270965: queue_balance_callback: 3-3<- > cat-1726 [001] 338.270969: __schedule: proxy() 1-1-> > lock_torture_wr-1745 [003] 338.270971: __schedule: proxy() 3-3-> > cat-1726 [001] 338.270972: __schedule: proxy() 1-1<- > lock_torture_wr-1745 [003] 338.270979: __schedule: proxy() 3-3<- > <idle>-0 [002] 338.270981: __schedule: proxy() 2-2-> > <idle>-0 [002] 338.270984: __schedule: proxy() 2-2<- > lock_torture_wr-1745 [003] 338.270985: __schedule: proxy() 3-3-> > migration/4-34 [004] 338.270989: active_load_balance_cpu_stop: rq_pin_lock() 4-3 <-- ! cb on CPU3 still enqueued > > > chance you could post stack traces or logs of those issues, just in > > case they have new nuggets of information? If you don't have them, > > don't bother, I'll reproduce it. > > insmod /lib/modules/torture.ko random_shuffle=1 lock_torture_writer_fifo=1 > insmod /lib/modules/locktorture.ko torture_type=mutex_lock nlocks=3 > > [ 447.046916] rq->balance_callback && rq->balance_callback != &balance_push_callback > [ 447.046926] WARNING: CPU: 1 PID: 1648 at kernel/sched/sched.h:1583 task_rq_lock+0x148/0x170 > [ 447.062914] Modules linked in: locktorture(O) torture(O) > [ 447.068254] CPU: 1 PID: 1648 Comm: torture_shuffle Tainted: G W O 6.1.0-rc2-00036-g397ce37b37a8-dirty #203 > [ 447.079168] Hardware name: ARM Juno development board (r0) (DT) > [ 447.085106] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ 447.092095] pc : task_rq_lock+0x148/0x170 > [ 447.096121] lr : task_rq_lock+0x148/0x170 > [ 447.100146] sp : ffff80000b85bd30 > ... > [ 447.175138] Call trace: > [ 447.177589] task_rq_lock+0x148/0x170 > [ 447.181267] __set_cpus_allowed_ptr+0x34/0xc0 > [ 447.185643] set_cpus_allowed_ptr+0x30/0x60 > [ 447.189843] torture_shuffle+0x158/0x224 [torture] > [ 447.194666] kthread+0x10c/0x110 > [ 447.197906] ret_from_fork+0x10/0x20 > ... > [ 447.233560] ---[ end trace 0000000000000000 ]--- > > > [ 446.542532] ------------[ cut here ]------------ > [ 446.553224] rq->balance_callback && rq->balance_callback != &balance_push_callback > [ 446.553243] WARNING: CPU: 3 PID: 0 at kernel/sched/sched.h:1583 update_blocked_averages+0x784/0x78c > [ 446.576089] Modules linked in: locktorture(O+) torture(O) > [ 446.581551] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 6.1.0-rc2-00036-g397ce37b37a8-dirty #203 > [ 446.591723] Hardware name: ARM Juno development board (r0) (DT) > [ 446.597691] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ 446.604712] pc : update_blocked_averages+0x784/0x78c > [ 446.609723] lr : update_blocked_averages+0x784/0x78c > [ 446.614733] sp : ffff80000b403b70 > ... > [ 446.690142] Call trace: > [ 446.692609] update_blocked_averages+0x784/0x78c > [ 446.697270] newidle_balance+0x184/0x5f0 > [ 446.701232] pick_next_task_fair+0x2c/0x500 > [ 446.705456] __schedule+0x1d4/0x1084 > [ 446.709070] schedule_idle+0x28/0x4c > [ 446.712682] do_idle+0x1d4/0x2d0 > [ 446.715946] cpu_startup_entry+0x28/0x30 > [ 446.719909] secondary_start_kernel+0x138/0x15c > [ 446.724486] __secondary_switched+0xb0/0xb4 > ... > [ 446.765848] ---[ end trace 0000000000000000 ]--- > > > [ 95.091675] ------------[ cut here ]------------ > [ 95.096301] rq->balance_callback && rq->balance_callback != &balance_push_callback > [ 95.096322] WARNING: CPU: 5 PID: 39 at kernel/sched/sched.h:1583 load_balance+0x644/0xdc0 > [ 95.103135] mutex_lock-torture: Creating lock_torture_writer task > [ 95.103238] mutex_lock-torture: lock_torture_writer task started > [ 95.110692] Modules linked in: locktorture(O+) torture(O) > [ 95.136699] CPU: 5 PID: 39 Comm: migration/5 Tainted: G O 6.1.0-rc2-00036-g397ce37b37a8-dirty #204 > [ 95.147137] Hardware name: ARM Juno development board (r0) (DT) > [ 95.153105] Stopper: 0x0 <- 0x0 > [ 95.156282] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ 95.163306] pc : load_balance+0x644/0xdc0 > [ 95.167356] lr : load_balance+0x644/0xdc0 > [ 95.171405] sp : ffff80000b4cbaf0 > ... > [ 95.246833] Call trace: > [ 95.249300] load_balance+0x644/0xdc0 > [ 95.253000] newidle_balance+0x290/0x6f0 > [ 95.256963] pick_next_task_fair+0x2c/0x510 > [ 95.261188] __schedule+0x1d4/0x1084 > [ 95.264801] schedule+0x64/0x11c > [ 95.268063] smpboot_thread_fn+0xa4/0x270 > [ 95.272115] kthread+0x10c/0x110 > [ 95.275375] ret_from_fork+0x10/0x20 > ... > [ 95.316477] ---[ end trace 0000000000000000 ]--- > > > [ 134.893379] ------------[ cut here ]------------ > [ 134.898066] rq->balance_callback && rq->balance_callback != &balance_push_callback > [ 134.898088] WARNING: CPU: 4 PID: 0 at kernel/sched/sched.h:1583 sched_rt_period_timer+0x1dc/0x3f0 > [ 134.914683] Modules linked in: locktorture(O) torture(O) > [ 134.920059] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G W O 6.1.0-rc2-00036-g397ce37b37a8-dirty #204 > [ 134.930235] Hardware name: ARM Juno development board (r0) (DT) > [ 134.936205] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ 134.943229] pc : sched_rt_period_timer+0x1dc/0x3f0 > [ 134.948069] lr : sched_rt_period_timer+0x1dc/0x3f0 > [ 134.952908] sp : ffff80000b2cbde0 > ... > [ 135.028342] Call trace: > [ 135.030810] sched_rt_period_timer+0x1dc/0x3f0 > [ 135.035300] __hrtimer_run_queues+0x184/0x504 > [ 135.039700] hrtimer_interrupt+0xe8/0x244 > [ 135.043749] arch_timer_handler_phys+0x2c/0x44 > [ 135.048239] handle_percpu_devid_irq+0x8c/0x150 > [ 135.052815] generic_handle_domain_irq+0x2c/0x44 > [ 135.057480] gic_handle_irq+0x44/0xc4 > [ 135.061180] call_on_irq_stack+0x2c/0x5c > [ 135.065143] do_interrupt_handler+0x80/0x84 > ... > [ 135.141122] ---[ end trace 0000000000000000 ]--- > > >> With your changes to locktorture in {2-3}/3 you still run CFS > >> lock_torture_writers but you should see the issue popping up > >> in __set_cpus_allowed_ptr() (from torture_shuffle()) for example. > >> > >> Tried with: > >> > >> insmod /lib/modules/torture.ko > >> insmod /lib/modules/locktorture.ko torture_type=mutex_lock rt_boost=1 rt_boost_factor=1 nlocks=3 > >> ^^^^^^^^^^^^^^^^^ > >> > >> When changing all lock_torture_writer's to FIFO it becomes even > >> more visible. > > > > Ok, thank you, I will make it more aggressively set to RT. The > > previous locktorture was setting RT once every minute or so, at least > > now that is down to 10 seconds ;-). But as you highlight with the > > locktorture diff below, it needs to go further. > > Yeah, the test env is more aggressive this way and you spot potential > issues quicker. > > > Anyway, this is going to be a nice holiday/Christmas project for me, > > so thank you in advance for the holiday gift :-) ^_^ > > Enjoy ;-)
> On Dec 15, 2022, at 6:12 PM, Joel Fernandes <joel@joelfernandes.org> wrote: > > On Mon, Dec 12, 2022 at 9:39 AM Dietmar Eggemann > <dietmar.eggemann@arm.com> wrote: >> >> Hi Joel, >> >>> On 09/12/2022 17:52, Joel Fernandes wrote: >>> Hi Dietmar! >>> >>> On Fri, Dec 9, 2022 at 3:07 PM Dietmar Eggemann >>> <dietmar.eggemann@arm.com> wrote: >>>> >>>> On 23/11/2022 02:21, Joel Fernandes (Google) wrote: >> >> [...] >> >>>> You consider rt_mutex_setprio() and __sched_setscheduler() versus >>>> proxy() but what about all the other places like load_balance(), >>>> update_blocked_averages(), __set_cpus_allowed_ptr() and many >>>> more in which we take the rq lock (via task_rq_lock() or >>>> rq_lock{_xxx}())? >>> >>> You are right. Those paths potentially need updates as well. Any >> >> IMHO, you would still have to lock the entire `queue->execute` (1)->(2) >> thing, like keeping the rq lock currently. >> >> __schedule() >> >> pick_next_task() >> pick_next_task_{rt/dl}() >> set_next_task_{rt/dl}() >> {rt/deadline}_queue_push_tasks() >> queue_balance_callback() --> (1) >> >> proxy() --- !!! >> >> finish_lock_switch() >> __balance_callbacks() <-- (2) >> >> __balance_callbacks(rq) <-- (2) > > Instead of locking it throughout, I think we can keep my initial patch > and just execute the balance callbacks in proxy() before dropping the > rq lock. I *think* that will make it work properly, but I could be > missing something. > > Anyway I see the issue you bring up, I took care of balance CBs queued > from *other CPUs* while the rq lock is dropped, but the current CPU > executing proxy() could itself have queued balance callbacks. Dropping > the rq lock then causes other CPUs to see the proxy() CPU's balance > CBs in the callback list. > > Anyway I will try the above and get back to you. Thanks so much again > for the insights. Will test as you suggested below. Another option is to dequeue them before dropping the rq lock, and then requeue them later. Again not sure if that’s a can of worms. But the first option appears to me safer in theory. Anyway, it looks like we have a couple of options on the table here for me to try. - Joel > > Thanks, > > - Joel > > >> Otherwise. something like this could happen: >> >> With `x-x` : smp_processor_id()-cpu_of(rq) >> >> lock_torture_wr-1745 [003] 338.270963: queue_balance_callback: 3-3-> >> lock_torture_wr-1745 [003] 338.270965: queue_balance_callback: 3-3<- >> cat-1726 [001] 338.270969: __schedule: proxy() 1-1-> >> lock_torture_wr-1745 [003] 338.270971: __schedule: proxy() 3-3-> >> cat-1726 [001] 338.270972: __schedule: proxy() 1-1<- >> lock_torture_wr-1745 [003] 338.270979: __schedule: proxy() 3-3<- >> <idle>-0 [002] 338.270981: __schedule: proxy() 2-2-> >> <idle>-0 [002] 338.270984: __schedule: proxy() 2-2<- >> lock_torture_wr-1745 [003] 338.270985: __schedule: proxy() 3-3-> >> migration/4-34 [004] 338.270989: active_load_balance_cpu_stop: rq_pin_lock() 4-3 <-- ! cb on CPU3 still enqueued >> >>> chance you could post stack traces or logs of those issues, just in >>> case they have new nuggets of information? If you don't have them, >>> don't bother, I'll reproduce it. >> >> insmod /lib/modules/torture.ko random_shuffle=1 lock_torture_writer_fifo=1 >> insmod /lib/modules/locktorture.ko torture_type=mutex_lock nlocks=3 >> >> [ 447.046916] rq->balance_callback && rq->balance_callback != &balance_push_callback >> [ 447.046926] WARNING: CPU: 1 PID: 1648 at kernel/sched/sched.h:1583 task_rq_lock+0x148/0x170 >> [ 447.062914] Modules linked in: locktorture(O) torture(O) >> [ 447.068254] CPU: 1 PID: 1648 Comm: torture_shuffle Tainted: G W O 6.1.0-rc2-00036-g397ce37b37a8-dirty #203 >> [ 447.079168] Hardware name: ARM Juno development board (r0) (DT) >> [ 447.085106] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) >> [ 447.092095] pc : task_rq_lock+0x148/0x170 >> [ 447.096121] lr : task_rq_lock+0x148/0x170 >> [ 447.100146] sp : ffff80000b85bd30 >> ... >> [ 447.175138] Call trace: >> [ 447.177589] task_rq_lock+0x148/0x170 >> [ 447.181267] __set_cpus_allowed_ptr+0x34/0xc0 >> [ 447.185643] set_cpus_allowed_ptr+0x30/0x60 >> [ 447.189843] torture_shuffle+0x158/0x224 [torture] >> [ 447.194666] kthread+0x10c/0x110 >> [ 447.197906] ret_from_fork+0x10/0x20 >> ... >> [ 447.233560] ---[ end trace 0000000000000000 ]--- >> >> >> [ 446.542532] ------------[ cut here ]------------ >> [ 446.553224] rq->balance_callback && rq->balance_callback != &balance_push_callback >> [ 446.553243] WARNING: CPU: 3 PID: 0 at kernel/sched/sched.h:1583 update_blocked_averages+0x784/0x78c >> [ 446.576089] Modules linked in: locktorture(O+) torture(O) >> [ 446.581551] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 6.1.0-rc2-00036-g397ce37b37a8-dirty #203 >> [ 446.591723] Hardware name: ARM Juno development board (r0) (DT) >> [ 446.597691] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) >> [ 446.604712] pc : update_blocked_averages+0x784/0x78c >> [ 446.609723] lr : update_blocked_averages+0x784/0x78c >> [ 446.614733] sp : ffff80000b403b70 >> ... >> [ 446.690142] Call trace: >> [ 446.692609] update_blocked_averages+0x784/0x78c >> [ 446.697270] newidle_balance+0x184/0x5f0 >> [ 446.701232] pick_next_task_fair+0x2c/0x500 >> [ 446.705456] __schedule+0x1d4/0x1084 >> [ 446.709070] schedule_idle+0x28/0x4c >> [ 446.712682] do_idle+0x1d4/0x2d0 >> [ 446.715946] cpu_startup_entry+0x28/0x30 >> [ 446.719909] secondary_start_kernel+0x138/0x15c >> [ 446.724486] __secondary_switched+0xb0/0xb4 >> ... >> [ 446.765848] ---[ end trace 0000000000000000 ]--- >> >> >> [ 95.091675] ------------[ cut here ]------------ >> [ 95.096301] rq->balance_callback && rq->balance_callback != &balance_push_callback >> [ 95.096322] WARNING: CPU: 5 PID: 39 at kernel/sched/sched.h:1583 load_balance+0x644/0xdc0 >> [ 95.103135] mutex_lock-torture: Creating lock_torture_writer task >> [ 95.103238] mutex_lock-torture: lock_torture_writer task started >> [ 95.110692] Modules linked in: locktorture(O+) torture(O) >> [ 95.136699] CPU: 5 PID: 39 Comm: migration/5 Tainted: G O 6.1.0-rc2-00036-g397ce37b37a8-dirty #204 >> [ 95.147137] Hardware name: ARM Juno development board (r0) (DT) >> [ 95.153105] Stopper: 0x0 <- 0x0 >> [ 95.156282] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) >> [ 95.163306] pc : load_balance+0x644/0xdc0 >> [ 95.167356] lr : load_balance+0x644/0xdc0 >> [ 95.171405] sp : ffff80000b4cbaf0 >> ... >> [ 95.246833] Call trace: >> [ 95.249300] load_balance+0x644/0xdc0 >> [ 95.253000] newidle_balance+0x290/0x6f0 >> [ 95.256963] pick_next_task_fair+0x2c/0x510 >> [ 95.261188] __schedule+0x1d4/0x1084 >> [ 95.264801] schedule+0x64/0x11c >> [ 95.268063] smpboot_thread_fn+0xa4/0x270 >> [ 95.272115] kthread+0x10c/0x110 >> [ 95.275375] ret_from_fork+0x10/0x20 >> ... >> [ 95.316477] ---[ end trace 0000000000000000 ]--- >> >> >> [ 134.893379] ------------[ cut here ]------------ >> [ 134.898066] rq->balance_callback && rq->balance_callback != &balance_push_callback >> [ 134.898088] WARNING: CPU: 4 PID: 0 at kernel/sched/sched.h:1583 sched_rt_period_timer+0x1dc/0x3f0 >> [ 134.914683] Modules linked in: locktorture(O) torture(O) >> [ 134.920059] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G W O 6.1.0-rc2-00036-g397ce37b37a8-dirty #204 >> [ 134.930235] Hardware name: ARM Juno development board (r0) (DT) >> [ 134.936205] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) >> [ 134.943229] pc : sched_rt_period_timer+0x1dc/0x3f0 >> [ 134.948069] lr : sched_rt_period_timer+0x1dc/0x3f0 >> [ 134.952908] sp : ffff80000b2cbde0 >> ... >> [ 135.028342] Call trace: >> [ 135.030810] sched_rt_period_timer+0x1dc/0x3f0 >> [ 135.035300] __hrtimer_run_queues+0x184/0x504 >> [ 135.039700] hrtimer_interrupt+0xe8/0x244 >> [ 135.043749] arch_timer_handler_phys+0x2c/0x44 >> [ 135.048239] handle_percpu_devid_irq+0x8c/0x150 >> [ 135.052815] generic_handle_domain_irq+0x2c/0x44 >> [ 135.057480] gic_handle_irq+0x44/0xc4 >> [ 135.061180] call_on_irq_stack+0x2c/0x5c >> [ 135.065143] do_interrupt_handler+0x80/0x84 >> ... >> [ 135.141122] ---[ end trace 0000000000000000 ]--- >> >>>> With your changes to locktorture in {2-3}/3 you still run CFS >>>> lock_torture_writers but you should see the issue popping up >>>> in __set_cpus_allowed_ptr() (from torture_shuffle()) for example. >>>> >>>> Tried with: >>>> >>>> insmod /lib/modules/torture.ko >>>> insmod /lib/modules/locktorture.ko torture_type=mutex_lock rt_boost=1 rt_boost_factor=1 nlocks=3 >>>> ^^^^^^^^^^^^^^^^^ >>>> >>>> When changing all lock_torture_writer's to FIFO it becomes even >>>> more visible. >>> >>> Ok, thank you, I will make it more aggressively set to RT. The >>> previous locktorture was setting RT once every minute or so, at least >>> now that is down to 10 seconds ;-). But as you highlight with the >>> locktorture diff below, it needs to go further. >> >> Yeah, the test env is more aggressive this way and you spot potential >> issues quicker. >> >>> Anyway, this is going to be a nice holiday/Christmas project for me, >>> so thank you in advance for the holiday gift :-) ^_^ >> >> Enjoy ;-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 88a5fa34dc06..aba90b3dc3ef 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -633,6 +633,29 @@ struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf) } } +/* + * Helper to call __task_rq_lock safely, in scenarios where we might be about to + * queue a balance callback on a remote CPU. That CPU might be in proxy(), and + * could have released its rq lock while holding balance_lock. So release rq + * lock in such a situation to avoid deadlock, and retry. + */ +struct rq *__task_rq_lock_balance(struct task_struct *p, struct rq_flags *rf) +{ + struct rq *rq; + bool locked = false; + + do { + if (locked) { + __task_rq_unlock(rq, rf); + cpu_relax(); + } + rq = __task_rq_lock(p, rf); + locked = true; + } while (raw_spin_is_locked(&rq->balance_lock)); + + return rq; +} + /* * task_rq_lock - lock p->pi_lock and lock the rq @p resides on. */ @@ -675,6 +698,29 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf) } } +/* + * Helper to call task_rq_lock safely, in scenarios where we might be about to + * queue a balance callback on a remote CPU. That CPU might be in proxy(), and + * could have released its rq lock while holding balance_lock. So release rq + * lock in such a situation to avoid deadlock, and retry. + */ +struct rq *task_rq_lock_balance(struct task_struct *p, struct rq_flags *rf) +{ + struct rq *rq; + bool locked = false; + + do { + if (locked) { + task_rq_unlock(rq, p, rf); + cpu_relax(); + } + rq = task_rq_lock(p, rf); + locked = true; + } while (raw_spin_is_locked(&rq->balance_lock)); + + return rq; +} + /* * RQ-clock updating methods: */ @@ -6739,6 +6785,12 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf) p->wake_cpu = wake_cpu; } + /* + * Prevent other CPUs from queuing balance callbacks while we migrate + * tasks in the migrate_list with the rq lock released. + */ + raw_spin_lock(&rq->balance_lock); + rq_unpin_lock(rq, rf); raw_spin_rq_unlock(rq); raw_spin_rq_lock(that_rq); @@ -6758,7 +6810,21 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf) } raw_spin_rq_unlock(that_rq); + + /* + * This may make lockdep unhappy as we acquire rq->lock with + * balance_lock held. But that should be a false positive, as the + * following pattern happens only on the current CPU with interrupts + * disabled: + * rq_lock() + * balance_lock(); + * rq_unlock(); + * rq_lock(); + */ raw_spin_rq_lock(rq); + + raw_spin_unlock(&rq->balance_lock); + rq_repin_lock(rq, rf); return NULL; /* Retry task selection on _this_ CPU. */ @@ -7489,7 +7555,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task) if (p->pi_top_task == pi_task && prio == p->prio && !dl_prio(prio)) return; - rq = __task_rq_lock(p, &rf); + rq = __task_rq_lock_balance(p, &rf); update_rq_clock(rq); /* * Set under pi_lock && rq->lock, such that the value can be used under @@ -8093,7 +8159,8 @@ static int __sched_setscheduler(struct task_struct *p, * To be able to change p->policy safely, the appropriate * runqueue lock must be held. */ - rq = task_rq_lock(p, &rf); + rq = task_rq_lock_balance(p, &rf); + update_rq_clock(rq); /* @@ -10312,6 +10379,7 @@ void __init sched_init(void) rq = cpu_rq(i); raw_spin_lock_init(&rq->__lock); + raw_spin_lock_init(&rq->balance_lock); rq->nr_running = 0; rq->calc_load_active = 0; rq->calc_load_update = jiffies + LOAD_FREQ; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 354e75587fed..17947a4009de 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1057,6 +1057,7 @@ struct rq { unsigned long cpu_capacity_orig; struct callback_head *balance_callback; + raw_spinlock_t balance_lock; unsigned char nohz_idle_balance; unsigned char idle_balance; @@ -1748,18 +1749,22 @@ queue_balance_callback(struct rq *rq, void (*func)(struct rq *rq)) { lockdep_assert_rq_held(rq); + raw_spin_lock(&rq->balance_lock); /* * Don't (re)queue an already queued item; nor queue anything when * balance_push() is active, see the comment with * balance_push_callback. */ - if (unlikely(head->next || rq->balance_callback == &balance_push_callback)) + if (unlikely(head->next || rq->balance_callback == &balance_push_callback)) { + raw_spin_unlock(&rq->balance_lock); return; + } head->func = (void (*)(struct callback_head *))func; head->next = rq->balance_callback; rq->balance_callback = head; + raw_spin_unlock(&rq->balance_lock); } #define rcu_dereference_check_sched_domain(p) \