Message ID | 20230120150246.20797-2-wander@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp257170wrn; Fri, 20 Jan 2023 07:18:39 -0800 (PST) X-Google-Smtp-Source: AMrXdXtodIqXzAxnVrl7zmKlWQ+E8N4U9VX/V91jvnMQFTJTi6qZw6pyj5HepQX99BhLFJ04bwGF X-Received: by 2002:a05:6402:401d:b0:49b:5a1c:9cb5 with SMTP id d29-20020a056402401d00b0049b5a1c9cb5mr16740004eda.16.1674227919739; Fri, 20 Jan 2023 07:18:39 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1674227919; cv=none; d=google.com; s=arc-20160816; b=oI66X6AAQFm0tTdv5cKCPVz6CXLKG+CVFoH1puU1eb6T3/eq+r1iuuL+cSvqy8dFtm uKIJ+Rgm54LXweXbxrGD0C/5NmA8PN1k03k7M2EKSR6uiNfQqrIgSms2oqw5bSlJEIv6 IGviUgqaG6O2oBvBEKA4nrTA9BfrbVnu6YP67GuE2OrI9CMUwHG7w5aaf3shz7n5GqQF A3h+W3WxtdOzB/iHXC1vK+8ZbF011OhcyioxRGirwaeE5d+vGKvnGfvLC9YN7zh+XsAA XT0UV/FaoN9rzRYqFagAmOb6+WQ8gRKq3IpeRlBs+Tm4SmGxGKR9/o+388FFnGRqqvh0 nXmg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=JV/ESkSwVKzhkko42aCx2p6hA+p3ybUeT3T+J1aTWFk=; b=BDZ4ewbhmkb723h84q2/x6lFvTiNg4zHbkelvJ7GRV7ssvFx7rQMyKHRDxsfGwcxAr KD57L/9kg6xC5qL/sZtyiSO2+R6zciYjTUxhmSA6YMO2FfvRh/LoYV/JXI99Qszy9cTZ tIPRDXY7T3h9EGFu/hX1RnmUUrfO15NxSfB7QH3ABWqjGmFkFZ84tx5j6/MZvnnkPe6Q RABEkxiAjBwBc4Hnb19papqinpIpyF457U9htpx1LtGoZSDS2TJEx9RZpXPrawJF7T/h oeGTa+tWN7jseaoIdUQ+VmPvYoKdwLVZcndGv17NjoloHKUGvPRDExu9ZCkJdSJoKhwf KgZQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=J5ALeAYw; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id ho38-20020a1709070ea600b0084ce55f4b7csi51715073ejc.470.2023.01.20.07.18.15; Fri, 20 Jan 2023 07:18:39 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=J5ALeAYw; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231239AbjATPEI (ORCPT <rfc822;literming00@gmail.com> + 99 others); Fri, 20 Jan 2023 10:04:08 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38458 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230391AbjATPEH (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 20 Jan 2023 10:04:07 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 38D647DF96 for <linux-kernel@vger.kernel.org>; Fri, 20 Jan 2023 07:03:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1674226999; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=JV/ESkSwVKzhkko42aCx2p6hA+p3ybUeT3T+J1aTWFk=; b=J5ALeAYwCS0mgLtuxWdUdQcOzRVo1I/kkXGEZ2RbAKNo4WnFSaaRKCos22MUAjDwN3mnjQ Uclpe2zc4/YbozVzhMgM8/xco8Uw2oNJmoK0ZAazuMs/uwweHAz3ziT7C1o3U3/DE3z0aD za7AGfru0Jz8v/iSLYCdpkUYEpmvF9I= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-490-hTSGCuu2PG2YUUno0xEQkw-1; Fri, 20 Jan 2023 10:03:18 -0500 X-MC-Unique: hTSGCuu2PG2YUUno0xEQkw-1 Received: from smtp.corp.redhat.com (int-mx10.intmail.prod.int.rdu2.redhat.com [10.11.54.10]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 164F8181E3F2; Fri, 20 Jan 2023 15:03:17 +0000 (UTC) Received: from fedora.redhat.com (unknown [10.22.32.187]) by smtp.corp.redhat.com (Postfix) with ESMTP id 69408492B02; Fri, 20 Jan 2023 15:03:01 +0000 (UTC) From: Wander Lairson Costa <wander@redhat.com> To: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>, Daniel Bristot de Oliveira <bristot@redhat.com>, Valentin Schneider <vschneid@redhat.com>, "Eric W. Biederman" <ebiederm@xmission.com>, Wander Lairson Costa <wander@redhat.com>, Stafford Horne <shorne@gmail.com>, Kefeng Wang <wangkefeng.wang@huawei.com>, Oleg Nesterov <oleg@redhat.com>, Andrew Morton <akpm@linux-foundation.org>, Thomas Gleixner <tglx@linutronix.de>, Sebastian Andrzej Siewior <bigeasy@linutronix.de>, Andy Lutomirski <luto@kernel.org>, "Liam R. Howlett" <Liam.Howlett@Oracle.com>, Fenghua Yu <fenghua.yu@intel.com>, Andrei Vagin <avagin@gmail.com>, linux-kernel@vger.kernel.org (open list) Cc: Paul McKenney <paulmck@kernel.org> Subject: [PATCH v2 1/4] sched/task: Add the put_task_struct_atomic_safe function Date: Fri, 20 Jan 2023 12:02:39 -0300 Message-Id: <20230120150246.20797-2-wander@redhat.com> In-Reply-To: <20230120150246.20797-1-wander@redhat.com> References: <20230120150246.20797-1-wander@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.10 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1755555214927726207?= X-GMAIL-MSGID: =?utf-8?q?1755555214927726207?= |
Series |
Fix put_task_struct() calls under PREEMPT_RT
|
|
Commit Message
Wander Lairson Costa
Jan. 20, 2023, 3:02 p.m. UTC
With PREEMPT_RT, it is unsafe to call put_task_struct() in atomic
contexts because it indirectly acquires sleeping locks.
Introduce put_task_struct_atomic_safe(), which schedules
__put_task_struct() through call_rcu() when the kernel is compiled with
PREEMPT_RT.
A more natural approach would use a workqueue, but since
in PREEMPT_RT we can't allocate dynamic memory from atomic context,
the code would become more complex because we would need to put the
work_struct instance in the task_struct and initialize it when we
allocate a new task_struct.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Cc: Paul McKenney <paulmck@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
include/linux/sched/task.h | 21 +++++++++++++++++++++
kernel/fork.c | 8 ++++++++
2 files changed, 29 insertions(+)
Comments
On 01/20, Wander Lairson Costa wrote: > > +static inline void put_task_struct_atomic_safe(struct task_struct *task) > +{ > + if (IS_ENABLED(CONFIG_PREEMPT_RT)) { > + /* > + * Decrement the refcount explicitly to avoid unnecessarily > + * calling call_rcu. > + */ > + if (refcount_dec_and_test(&task->usage)) > + /* > + * under PREEMPT_RT, we can't call put_task_struct > + * in atomic context because it will indirectly > + * acquire sleeping locks. > + */ > + call_rcu(&task->rcu, __delayed_put_task_struct); ^^^^^^^^^ I am not sure the usage of task->rcu is safe... Suppose that, before __delayed_put_task_struct() is called by RCU, this task does the last schedule and calls put_task_struct_rcu_user(). And, can't we simply turn put_task_struct() into something like put_task_struct(struct task_struct *t) { if (refcount_dec_and_test(&t->usage)) { if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_atomic() || irqs_disabled())) call_rcu(...); else __put_task_struct(t); } } ? Oleg.
On 01/23, Oleg Nesterov wrote: > > On 01/20, Wander Lairson Costa wrote: > > > > +static inline void put_task_struct_atomic_safe(struct task_struct *task) > > +{ > > + if (IS_ENABLED(CONFIG_PREEMPT_RT)) { > > + /* > > + * Decrement the refcount explicitly to avoid unnecessarily > > + * calling call_rcu. > > + */ > > + if (refcount_dec_and_test(&task->usage)) > > + /* > > + * under PREEMPT_RT, we can't call put_task_struct > > + * in atomic context because it will indirectly > > + * acquire sleeping locks. > > + */ > > + call_rcu(&task->rcu, __delayed_put_task_struct); > ^^^^^^^^^ > I am not sure the usage of task->rcu is safe... > > Suppose that, before __delayed_put_task_struct() is called by RCU, this task > does the last schedule and calls put_task_struct_rcu_user(). Ah, sorry, please forget, rcu_users != 0 implies task->usage != 0. > And, can't we simply turn put_task_struct() into something like > > put_task_struct(struct task_struct *t) > { > if (refcount_dec_and_test(&t->usage)) { > if (IS_ENABLED(CONFIG_PREEMPT_RT) > && (in_atomic() || irqs_disabled())) > call_rcu(...); > else > __put_task_struct(t); > } > } > > ? > > Oleg.
On Mon, Jan 23, 2023 at 1:30 PM Oleg Nesterov <oleg@redhat.com> wrote: > > On 01/20, Wander Lairson Costa wrote: > > > > +static inline void put_task_struct_atomic_safe(struct task_struct *task) > > +{ > > + if (IS_ENABLED(CONFIG_PREEMPT_RT)) { > > + /* > > + * Decrement the refcount explicitly to avoid unnecessarily > > + * calling call_rcu. > > + */ > > + if (refcount_dec_and_test(&task->usage)) > > + /* > > + * under PREEMPT_RT, we can't call put_task_struct > > + * in atomic context because it will indirectly > > + * acquire sleeping locks. > > + */ > > + call_rcu(&task->rcu, __delayed_put_task_struct); > ^^^^^^^^^ > I am not sure the usage of task->rcu is safe... > > Suppose that, before __delayed_put_task_struct() is called by RCU, this task > does the last schedule and calls put_task_struct_rcu_user(). > > And, can't we simply turn put_task_struct() into something like > > put_task_struct(struct task_struct *t) > { > if (refcount_dec_and_test(&t->usage)) { > if (IS_ENABLED(CONFIG_PREEMPT_RT) > && (in_atomic() || irqs_disabled())) > call_rcu(...); > else > __put_task_struct(t); > } > } > > ? Yeah, that was one approach I thought about. I chose to use an explicit function because I assumed calling __put_task_struct() from a non-preemptable context should be the exception, not the rule. Therefore (if I am correct in my assumption), it would make sense for only some call sites to pay the overhead price for it. But this is just a guess, and I have no evidence to support my claim.
On 23/01/23 14:24, Wander Lairson Costa wrote: > On Mon, Jan 23, 2023 at 1:30 PM Oleg Nesterov <oleg@redhat.com> wrote: >> >> On 01/20, Wander Lairson Costa wrote: >> > >> > +static inline void put_task_struct_atomic_safe(struct task_struct *task) >> > +{ >> > + if (IS_ENABLED(CONFIG_PREEMPT_RT)) { >> > + /* >> > + * Decrement the refcount explicitly to avoid unnecessarily >> > + * calling call_rcu. >> > + */ >> > + if (refcount_dec_and_test(&task->usage)) >> > + /* >> > + * under PREEMPT_RT, we can't call put_task_struct >> > + * in atomic context because it will indirectly >> > + * acquire sleeping locks. >> > + */ >> > + call_rcu(&task->rcu, __delayed_put_task_struct); >> ^^^^^^^^^ >> I am not sure the usage of task->rcu is safe... >> >> Suppose that, before __delayed_put_task_struct() is called by RCU, this task >> does the last schedule and calls put_task_struct_rcu_user(). >> >> And, can't we simply turn put_task_struct() into something like >> >> put_task_struct(struct task_struct *t) >> { >> if (refcount_dec_and_test(&t->usage)) { >> if (IS_ENABLED(CONFIG_PREEMPT_RT) >> && (in_atomic() || irqs_disabled())) >> call_rcu(...); >> else >> __put_task_struct(t); >> } >> } >> >> ? > > Yeah, that was one approach I thought about. I chose to use an > explicit function because I assumed calling __put_task_struct() from a > non-preemptable context should be the exception, not the rule. I'd tend to agree. > Therefore (if I am correct in my assumption), it would make sense for > only some call sites to pay the overhead price for it. But this is > just a guess, and I have no evidence to support my claim. My worry here is that it's easy to miss problematic callgraphs, and it's potentially easy for new ones to creep in. Having a solution within put_task_struct() itself would prevent that. Another thing, if you look at release_task_stack(), it either caches the outgoing stack for later use, or frees it via RCU (regardless of PREEMPT_RT). Perhaps we could follow that and just always punt the freeing of the task struct to RCU?
On Fri, Jan 27, 2023 at 12:55 PM Valentin Schneider <vschneid@redhat.com> wrote: > > On 23/01/23 14:24, Wander Lairson Costa wrote: > > On Mon, Jan 23, 2023 at 1:30 PM Oleg Nesterov <oleg@redhat.com> wrote: > >> > >> On 01/20, Wander Lairson Costa wrote: > >> > > >> > +static inline void put_task_struct_atomic_safe(struct task_struct *task) > >> > +{ > >> > + if (IS_ENABLED(CONFIG_PREEMPT_RT)) { > >> > + /* > >> > + * Decrement the refcount explicitly to avoid unnecessarily > >> > + * calling call_rcu. > >> > + */ > >> > + if (refcount_dec_and_test(&task->usage)) > >> > + /* > >> > + * under PREEMPT_RT, we can't call put_task_struct > >> > + * in atomic context because it will indirectly > >> > + * acquire sleeping locks. > >> > + */ > >> > + call_rcu(&task->rcu, __delayed_put_task_struct); > >> ^^^^^^^^^ > >> I am not sure the usage of task->rcu is safe... > >> > >> Suppose that, before __delayed_put_task_struct() is called by RCU, this task > >> does the last schedule and calls put_task_struct_rcu_user(). > >> > >> And, can't we simply turn put_task_struct() into something like > >> > >> put_task_struct(struct task_struct *t) > >> { > >> if (refcount_dec_and_test(&t->usage)) { > >> if (IS_ENABLED(CONFIG_PREEMPT_RT) > >> && (in_atomic() || irqs_disabled())) > >> call_rcu(...); > >> else > >> __put_task_struct(t); > >> } > >> } > >> > >> ? > > > > Yeah, that was one approach I thought about. I chose to use an > > explicit function because I assumed calling __put_task_struct() from a > > non-preemptable context should be the exception, not the rule. > > I'd tend to agree. > > > Therefore (if I am correct in my assumption), it would make sense for > > only some call sites to pay the overhead price for it. But this is > > just a guess, and I have no evidence to support my claim. > > My worry here is that it's easy to miss problematic callgraphs, and it's > potentially easy for new ones to creep in. Having a solution within > put_task_struct() itself would prevent that. > We could add a WARN_ON statement in put_task_struct() to detect such cases. > Another thing, if you look at release_task_stack(), it either caches the > outgoing stack for later use, or frees it via RCU (regardless of > PREEMPT_RT). Perhaps we could follow that and just always punt the freeing > of the task struct to RCU? > That's a point. Do you mean doing that even for !PREEMPT_RT?
On 30/01/23 08:49, Wander Lairson Costa wrote: > On Fri, Jan 27, 2023 at 12:55 PM Valentin Schneider <vschneid@redhat.com> wrote: >> >> On 23/01/23 14:24, Wander Lairson Costa wrote: >> > Therefore (if I am correct in my assumption), it would make sense for >> > only some call sites to pay the overhead price for it. But this is >> > just a guess, and I have no evidence to support my claim. >> >> My worry here is that it's easy to miss problematic callgraphs, and it's >> potentially easy for new ones to creep in. Having a solution within >> put_task_struct() itself would prevent that. >> > > We could add a WARN_ON statement in put_task_struct() to detect such cases. > Anyone running their kernel with DEBUG_ATOMIC_SLEEP should be able to detect misuse, but it doesn't change that some callgraphs will only materialize under certain hardware/configuration combos. >> Another thing, if you look at release_task_stack(), it either caches the >> outgoing stack for later use, or frees it via RCU (regardless of >> PREEMPT_RT). Perhaps we could follow that and just always punt the freeing >> of the task struct to RCU? >> > > That's a point. Do you mean doing that even for !PREEMPT_RT? Could be worth a try? I think because of the cache thing the task stack is a bit less aggressive wrt RCU callback processing, but at a quick glance I don't see any fundamental reason why the task_struct itself can't be given the same treatment.
On Mon, Jan 30, 2023 at 11:47 AM Valentin Schneider <vschneid@redhat.com> wrote: > > On 30/01/23 08:49, Wander Lairson Costa wrote: > > On Fri, Jan 27, 2023 at 12:55 PM Valentin Schneider <vschneid@redhat.com> wrote: > >> > >> On 23/01/23 14:24, Wander Lairson Costa wrote: > >> > Therefore (if I am correct in my assumption), it would make sense for > >> > only some call sites to pay the overhead price for it. But this is > >> > just a guess, and I have no evidence to support my claim. > >> > >> My worry here is that it's easy to miss problematic callgraphs, and it's > >> potentially easy for new ones to creep in. Having a solution within > >> put_task_struct() itself would prevent that. > >> > > > > We could add a WARN_ON statement in put_task_struct() to detect such cases. > > > > Anyone running their kernel with DEBUG_ATOMIC_SLEEP should be able to > detect misuse, but it doesn't change that some callgraphs will only > materialize under certain hardware/configuration combos. > If we put a WARN_ON in put_task_struct(), we catch cases where the reference count didn't reach zero. > >> Another thing, if you look at release_task_stack(), it either caches the > >> outgoing stack for later use, or frees it via RCU (regardless of > >> PREEMPT_RT). Perhaps we could follow that and just always punt the freeing > >> of the task struct to RCU? > >> > > > > That's a point. Do you mean doing that even for !PREEMPT_RT? > > Could be worth a try? Sure. But I would do it only for PREEMPT_RT. > I think because of the cache thing the task stack is > a bit less aggressive wrt RCU callback processing, but at a quick glance I > don't see any fundamental reason why the task_struct itself can't be given > the same treatment. > Any idea about tests to catch performance regressions? I
On 30/01/23 11:58, Wander Lairson Costa wrote: > On Mon, Jan 30, 2023 at 11:47 AM Valentin Schneider <vschneid@redhat.com> wrote: >> >> On 30/01/23 08:49, Wander Lairson Costa wrote: >> > On Fri, Jan 27, 2023 at 12:55 PM Valentin Schneider <vschneid@redhat.com> wrote: >> >> >> >> On 23/01/23 14:24, Wander Lairson Costa wrote: >> >> > Therefore (if I am correct in my assumption), it would make sense for >> >> > only some call sites to pay the overhead price for it. But this is >> >> > just a guess, and I have no evidence to support my claim. >> >> >> >> My worry here is that it's easy to miss problematic callgraphs, and it's >> >> potentially easy for new ones to creep in. Having a solution within >> >> put_task_struct() itself would prevent that. >> >> >> > >> > We could add a WARN_ON statement in put_task_struct() to detect such cases. >> > >> >> Anyone running their kernel with DEBUG_ATOMIC_SLEEP should be able to >> detect misuse, but it doesn't change that some callgraphs will only >> materialize under certain hardware/configuration combos. >> > > If we put a WARN_ON in put_task_struct(), we catch cases where the > reference count didn't reach zero. > True, that'd be an improvement. >> >> Another thing, if you look at release_task_stack(), it either caches the >> >> outgoing stack for later use, or frees it via RCU (regardless of >> >> PREEMPT_RT). Perhaps we could follow that and just always punt the freeing >> >> of the task struct to RCU? >> >> >> > >> > That's a point. Do you mean doing that even for !PREEMPT_RT? >> >> Could be worth a try? > > Sure. But I would do it only for PREEMPT_RT. > >> I think because of the cache thing the task stack is >> a bit less aggressive wrt RCU callback processing, but at a quick glance I >> don't see any fundamental reason why the task_struct itself can't be given >> the same treatment. >> > > Any idea about tests to catch performance regressions? > I would wager anything fork-heavy with short-lived tasks, say loops of short hackbench runs, I belive stress-ng also has a fork test case.
On 1/30/23 08:49, Wander Lairson Costa wrote: >> Another thing, if you look at release_task_stack(), it either caches the >> outgoing stack for later use, or frees it via RCU (regardless of >> PREEMPT_RT). Perhaps we could follow that and just always punt the freeing >> of the task struct to RCU? >> > That's a point. Do you mean doing that even for !PREEMPT_RT? > It seems that Luca Abeni (in Cc) is hitting the bug in the non-rt kernel with SCHED_DEADLINE... adding him to the thread. -- Daniel
Hi all, On Fri, 17 Feb 2023 14:35:22 -0300 Daniel Bristot de Oliveira <bristot@redhat.com> wrote: > On 1/30/23 08:49, Wander Lairson Costa wrote: > >> Another thing, if you look at release_task_stack(), it either > >> caches the outgoing stack for later use, or frees it via RCU > >> (regardless of PREEMPT_RT). Perhaps we could follow that and just > >> always punt the freeing of the task struct to RCU? > >> > > That's a point. Do you mean doing that even for !PREEMPT_RT? > > > > It seems that Luca Abeni (in Cc) is hitting the bug in the non-rt > kernel with SCHED_DEADLINE... > > adding him to the thread. Thanks Daniel for adding me. I triggered this "BUG: Invalid wait context" with a 5.15.76 kernel, without PREEMPT_RT. I can reproduce it easily on a KVM-based VM; if more information or tests are needed, let me know. Luca
On Fri, Feb 17, 2023 at 08:04:37PM +0100, luca abeni wrote: > Hi all, > > On Fri, 17 Feb 2023 14:35:22 -0300 > Daniel Bristot de Oliveira <bristot@redhat.com> wrote: > > > On 1/30/23 08:49, Wander Lairson Costa wrote: > > >> Another thing, if you look at release_task_stack(), it either > > >> caches the outgoing stack for later use, or frees it via RCU > > >> (regardless of PREEMPT_RT). Perhaps we could follow that and just > > >> always punt the freeing of the task struct to RCU? > > >> > > > That's a point. Do you mean doing that even for !PREEMPT_RT? > > > > > > > It seems that Luca Abeni (in Cc) is hitting the bug in the non-rt > > kernel with SCHED_DEADLINE... > > > > adding him to the thread. > > Thanks Daniel for adding me. > > I triggered this "BUG: Invalid wait context" with a 5.15.76 kernel, > without PREEMPT_RT. I can reproduce it easily on a KVM-based VM; if > more information or tests are needed, let me know. > Does it happen in linux-6.1 or linux-6.2? > > Luca >
On Wed, 22 Feb 2023 15:42:37 -0300 Wander Lairson Costa <wander@redhat.com> wrote: [...] > > I triggered this "BUG: Invalid wait context" with a 5.15.76 kernel, > > without PREEMPT_RT. I can reproduce it easily on a KVM-based VM; if > > more information or tests are needed, let me know. > > > > Does it happen in linux-6.1 or linux-6.2? I only tried with 5.15.76... I am going to test 6.2 and I'll let you know ASAP. Luca
On Wed, 22 Feb 2023 22:00:34 +0100 luca abeni <luca.abeni@santannapisa.it> wrote: > On Wed, 22 Feb 2023 15:42:37 -0300 > Wander Lairson Costa <wander@redhat.com> wrote: > [...] > > > I triggered this "BUG: Invalid wait context" with a 5.15.76 > > > kernel, without PREEMPT_RT. I can reproduce it easily on a > > > KVM-based VM; if more information or tests are needed, let me > > > know. > > > > Does it happen in linux-6.1 or linux-6.2? > > I only tried with 5.15.76... I am going to test 6.2 and I'll let you > know ASAP. For various reasons it took more time than expected, but I managed to reproduce the bug with 6.2... Here are the relevant kernel messages: [ 1246.556100] ============================= [ 1246.559104] [ BUG: Invalid wait context ] [ 1246.562270] 6.2.0 #4 Not tainted [ 1246.564854] ----------------------------- [ 1246.567260] swapper/3/0 is trying to lock: [ 1246.568665] ffff8c2c7ebb2c10 (&c->lock){..-.}-{3:3}, at: put_cpu_partial+0x24/0x1c0 [ 1246.571325] other info that might help us debug this: [ 1246.573045] context-{2:2} [ 1246.574166] no locks held by swapper/3/0. [ 1246.575434] stack backtrace: [ 1246.576207] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 6.2.0 #4 [ 1246.578184] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014 [ 1246.580815] Call Trace: [ 1246.581723] <IRQ> [ 1246.582570] dump_stack_lvl+0x49/0x61 [ 1246.583860] __lock_acquire.cold+0xc8/0x31c [ 1246.584923] ? __lock_acquire+0x3be/0x1df0 [ 1246.585915] lock_acquire+0xce/0x2f0 [ 1246.586819] ? put_cpu_partial+0x24/0x1c0 [ 1246.588177] ? lock_is_held_type+0xdb/0x130 [ 1246.589519] put_cpu_partial+0x5b/0x1c0 [ 1246.590996] ? put_cpu_partial+0x24/0x1c0 [ 1246.592212] inactive_task_timer+0x263/0x4c0 [ 1246.593509] ? __pfx_inactive_task_timer+0x10/0x10 [ 1246.594953] __hrtimer_run_queues+0x1bf/0x470 [ 1246.596297] hrtimer_interrupt+0x117/0x250 [ 1246.597528] __sysvec_apic_timer_interrupt+0x99/0x270 [ 1246.599015] sysvec_apic_timer_interrupt+0x8d/0xc0 [ 1246.600416] </IRQ> [ 1246.601170] <TASK> [ 1246.601918] asm_sysvec_apic_timer_interrupt+0x1a/0x20 [ 1246.603377] RIP: 0010:default_idle+0xf/0x20 [ 1246.604640] Code: f6 5d 41 5c e9 72 4a 6e ff cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 03 52 2a 00 fb f4 <c3> cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 [ 1246.609718] RSP: 0018:ffffa1a2c009bed0 EFLAGS: 00000202 [ 1246.611259] RAX: ffffffffa4961a60 RBX: ffff8c2c4126b000 RCX: 0000000000000000 [ 1246.613230] RDX: 0000000000000000 RSI: ffffffffa510271b RDI: ffffffffa50d5b15 [ 1246.615266] RBP: 0000000000000003 R08: 0000000000000001 R09: 0000000000000001 [ 1246.617275] R10: 0000000000000000 R11: ffff8c2c4126b000 R12: ffff8c2c4126b000 [ 1246.619318] R13: ffff8c2c4126b000 R14: 0000000000000000 R15: 0000000000000000 [ 1246.621293] ? __pfx_default_idle+0x10/0x10 [ 1246.622581] default_idle_call+0x71/0x220 [ 1246.623790] do_idle+0x210/0x290 [ 1246.624827] cpu_startup_entry+0x18/0x20 [ 1246.626016] start_secondary+0xf1/0x100 [ 1246.627200] secondary_startup_64_no_verify+0xe0/0xeb [ 1246.628707] </TASK> Let me know if you need more information, or I should run other tests/experiments. Luca
On Fri, Feb 24, 2023 at 09:46:48AM +0100, luca abeni wrote: > On Wed, 22 Feb 2023 22:00:34 +0100 > luca abeni <luca.abeni@santannapisa.it> wrote: > > > On Wed, 22 Feb 2023 15:42:37 -0300 > > Wander Lairson Costa <wander@redhat.com> wrote: > > [...] > > > > I triggered this "BUG: Invalid wait context" with a 5.15.76 > > > > kernel, without PREEMPT_RT. I can reproduce it easily on a > > > > KVM-based VM; if more information or tests are needed, let me > > > > know. > > > > > > Does it happen in linux-6.1 or linux-6.2? > > > > I only tried with 5.15.76... I am going to test 6.2 and I'll let you > > know ASAP. > > For various reasons it took more time than expected, but I managed to > reproduce the bug with 6.2... Here are the relevant kernel messages: > > [ 1246.556100] ============================= > [ 1246.559104] [ BUG: Invalid wait context ] > [ 1246.562270] 6.2.0 #4 Not tainted > [ 1246.564854] ----------------------------- > [ 1246.567260] swapper/3/0 is trying to lock: > [ 1246.568665] ffff8c2c7ebb2c10 (&c->lock){..-.}-{3:3}, at: put_cpu_partial+0x24/0x1c0 > [ 1246.571325] other info that might help us debug this: > [ 1246.573045] context-{2:2} > [ 1246.574166] no locks held by swapper/3/0. > [ 1246.575434] stack backtrace: > [ 1246.576207] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 6.2.0 #4 > [ 1246.578184] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014 > [ 1246.580815] Call Trace: > [ 1246.581723] <IRQ> > [ 1246.582570] dump_stack_lvl+0x49/0x61 > [ 1246.583860] __lock_acquire.cold+0xc8/0x31c > [ 1246.584923] ? __lock_acquire+0x3be/0x1df0 > [ 1246.585915] lock_acquire+0xce/0x2f0 > [ 1246.586819] ? put_cpu_partial+0x24/0x1c0 > [ 1246.588177] ? lock_is_held_type+0xdb/0x130 > [ 1246.589519] put_cpu_partial+0x5b/0x1c0 > [ 1246.590996] ? put_cpu_partial+0x24/0x1c0 > [ 1246.592212] inactive_task_timer+0x263/0x4c0 > [ 1246.593509] ? __pfx_inactive_task_timer+0x10/0x10 > [ 1246.594953] __hrtimer_run_queues+0x1bf/0x470 > [ 1246.596297] hrtimer_interrupt+0x117/0x250 > [ 1246.597528] __sysvec_apic_timer_interrupt+0x99/0x270 > [ 1246.599015] sysvec_apic_timer_interrupt+0x8d/0xc0 > [ 1246.600416] </IRQ> > [ 1246.601170] <TASK> > [ 1246.601918] asm_sysvec_apic_timer_interrupt+0x1a/0x20 > [ 1246.603377] RIP: 0010:default_idle+0xf/0x20 > [ 1246.604640] Code: f6 5d 41 5c e9 72 4a 6e ff cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 03 52 2a 00 fb f4 <c3> cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 > [ 1246.609718] RSP: 0018:ffffa1a2c009bed0 EFLAGS: 00000202 > [ 1246.611259] RAX: ffffffffa4961a60 RBX: ffff8c2c4126b000 RCX: 0000000000000000 > [ 1246.613230] RDX: 0000000000000000 RSI: ffffffffa510271b RDI: ffffffffa50d5b15 > [ 1246.615266] RBP: 0000000000000003 R08: 0000000000000001 R09: 0000000000000001 > [ 1246.617275] R10: 0000000000000000 R11: ffff8c2c4126b000 R12: ffff8c2c4126b000 > [ 1246.619318] R13: ffff8c2c4126b000 R14: 0000000000000000 R15: 0000000000000000 > [ 1246.621293] ? __pfx_default_idle+0x10/0x10 > [ 1246.622581] default_idle_call+0x71/0x220 > [ 1246.623790] do_idle+0x210/0x290 > [ 1246.624827] cpu_startup_entry+0x18/0x20 > [ 1246.626016] start_secondary+0xf1/0x100 > [ 1246.627200] secondary_startup_64_no_verify+0xe0/0xeb > [ 1246.628707] </TASK> > > > Let me know if you need more information, or > I should run other tests/experiments. > This seems to be a different (maybe related?) issue. Would you mind sharing your .config and steps to reproduce it? > > Luca >
On Fri, 24 Feb 2023 10:02:40 -0300 Wander Lairson Costa <wander@redhat.com> wrote: [...] > > [ 1246.556100] ============================= > > [ 1246.559104] [ BUG: Invalid wait context ] > > [ 1246.562270] 6.2.0 #4 Not tainted > > [ 1246.564854] ----------------------------- > > [ 1246.567260] swapper/3/0 is trying to lock: > > [ 1246.568665] ffff8c2c7ebb2c10 (&c->lock){..-.}-{3:3}, at: > > put_cpu_partial+0x24/0x1c0 [ 1246.571325] other info that might > > help us debug this: [ 1246.573045] context-{2:2} > > [ 1246.574166] no locks held by swapper/3/0. > > [ 1246.575434] stack backtrace: > > [ 1246.576207] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 6.2.0 #4 > > [ 1246.578184] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), > > BIOS 1.13.0-1ubuntu1.1 04/01/2014 [ 1246.580815] Call Trace: > > [ 1246.581723] <IRQ> > > [ 1246.582570] dump_stack_lvl+0x49/0x61 > > [ 1246.583860] __lock_acquire.cold+0xc8/0x31c > > [ 1246.584923] ? __lock_acquire+0x3be/0x1df0 > > [ 1246.585915] lock_acquire+0xce/0x2f0 > > [ 1246.586819] ? put_cpu_partial+0x24/0x1c0 > > [ 1246.588177] ? lock_is_held_type+0xdb/0x130 > > [ 1246.589519] put_cpu_partial+0x5b/0x1c0 > > [ 1246.590996] ? put_cpu_partial+0x24/0x1c0 > > [ 1246.592212] inactive_task_timer+0x263/0x4c0 > > [ 1246.593509] ? __pfx_inactive_task_timer+0x10/0x10 > > [ 1246.594953] __hrtimer_run_queues+0x1bf/0x470 > > [ 1246.596297] hrtimer_interrupt+0x117/0x250 > > [ 1246.597528] __sysvec_apic_timer_interrupt+0x99/0x270 > > [ 1246.599015] sysvec_apic_timer_interrupt+0x8d/0xc0 > > [ 1246.600416] </IRQ> > > [ 1246.601170] <TASK> > > [ 1246.601918] asm_sysvec_apic_timer_interrupt+0x1a/0x20 > > [ 1246.603377] RIP: 0010:default_idle+0xf/0x20 > > [ 1246.604640] Code: f6 5d 41 5c e9 72 4a 6e ff cc cc 90 90 90 90 > > 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 03 > > 52 2a 00 fb f4 <c3> cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 > > 90 90 90 90 90 [ 1246.609718] RSP: 0018:ffffa1a2c009bed0 EFLAGS: > > 00000202 [ 1246.611259] RAX: ffffffffa4961a60 RBX: ffff8c2c4126b000 > > RCX: 0000000000000000 [ 1246.613230] RDX: 0000000000000000 RSI: > > ffffffffa510271b RDI: ffffffffa50d5b15 [ 1246.615266] RBP: > > 0000000000000003 R08: 0000000000000001 R09: 0000000000000001 [ > > 1246.617275] R10: 0000000000000000 R11: ffff8c2c4126b000 R12: > > ffff8c2c4126b000 [ 1246.619318] R13: ffff8c2c4126b000 R14: > > 0000000000000000 R15: 0000000000000000 [ 1246.621293] ? > > __pfx_default_idle+0x10/0x10 [ 1246.622581] > > default_idle_call+0x71/0x220 [ 1246.623790] do_idle+0x210/0x290 [ > > 1246.624827] cpu_startup_entry+0x18/0x20 [ 1246.626016] > > start_secondary+0xf1/0x100 [ 1246.627200] > > secondary_startup_64_no_verify+0xe0/0xeb [ 1246.628707] </TASK> > > > > > > Let me know if you need more information, or > > I should run other tests/experiments. > > > > This seems to be a different (maybe related?) issue. Would you mind > sharing your .config and steps to reproduce it? Ah, sorry then... I probably misunderstood the kernel messages (in my understanding, this is lockdep complaining because put_task_struct() - which can take a sleeping lock - is invoked from a timer callback). Anyway, I attach the config (it is basically a "make defconfig; make kvm_guest.config" with some debug options manually enabled - I think the relevant one is CONFIG_PROVE_RAW_LOCK_NESTING) Luca
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index 357e0068497c..80b4c5812563 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -127,6 +127,27 @@ static inline void put_task_struct_many(struct task_struct *t, int nr) void put_task_struct_rcu_user(struct task_struct *task); +extern void __delayed_put_task_struct(struct rcu_head *rhp); + +static inline void put_task_struct_atomic_safe(struct task_struct *task) +{ + if (IS_ENABLED(CONFIG_PREEMPT_RT)) { + /* + * Decrement the refcount explicitly to avoid unnecessarily + * calling call_rcu. + */ + if (refcount_dec_and_test(&task->usage)) + /* + * under PREEMPT_RT, we can't call put_task_struct + * in atomic context because it will indirectly + * acquire sleeping locks. + */ + call_rcu(&task->rcu, __delayed_put_task_struct); + } else { + put_task_struct(task); + } +} + /* Free all architecture-specific resources held by a thread. */ void release_thread(struct task_struct *dead_task); diff --git a/kernel/fork.c b/kernel/fork.c index 9f7fe3541897..3d7a4e9311b3 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -859,6 +859,14 @@ void __put_task_struct(struct task_struct *tsk) } EXPORT_SYMBOL_GPL(__put_task_struct); +void __delayed_put_task_struct(struct rcu_head *rhp) +{ + struct task_struct *task = container_of(rhp, struct task_struct, rcu); + + __put_task_struct(task); +} +EXPORT_SYMBOL_GPL(__delayed_put_task_struct); + void __init __weak arch_task_cache_init(void) { } /*