Message ID | 20230516191441.34377-1-wander@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp652154vqo; Tue, 16 May 2023 12:24:01 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7pAZbbDpFggKSdk1/B/x+elszyIZevFLy9Wf24GanubFJJAnXcKLwFYEc3RyqMjdmOlSx/ X-Received: by 2002:a05:6a21:329c:b0:102:8f0a:5acf with SMTP id yt28-20020a056a21329c00b001028f0a5acfmr29487520pzb.62.1684265041255; Tue, 16 May 2023 12:24:01 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1684265041; cv=none; d=google.com; s=arc-20160816; b=BLEU7Ul/9YPZxvsJ4OB/00gXdOdemXG7LUOs1bBN2Fd7fckuyaF2OTXuEQaaL1hG+v LwOx5XejsOytmu7PNAkfdH7ArF3XnBuaZj/IJZmtxSXTDBtudAppYuhsgI2rwlklCX+M PPWncGlAUIMPrJl+E5KINxVXropdTElwKCNceFw/UZA6PrKq/Z0n0TWOJKHH41xGvezc fxLDL/Mn+o4LyJPVL34+FscwdXI1AN4QK5JCgGEnL/QoWxT6BiBv/vMbUTnpJnVW9kpW M5VMQMIQ+71jl1THdiUVC35qytfhCbIImPD2CNuQ+NMSnn+nswuX6EakdNz0bwGItco1 F5Jw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=5L1aJYtcbR7dHch+1g8syxuuSJfOlW1pxYhUzrmVLTw=; b=lMDb5ANZI1N9oWF72qT+FDhDyc5gMnQYZSgrP1PgM0J84WIgrKtciedIPSq87HOiek dOVU4H3Ly1q3c6xh1H0tYcfcGH5goxSaadzkNo3B5DDJdk5ltPGGCOvY1QbSp7mUqm5F ppMvDLyroMzIv/x77oer4oPALq383NIuSxBbnzQa2/ifjZeremhkHPs2AhWT0/tWOyK6 39seBSm2byya3ZV+q7c+lCP0Ylvgly6IBS2RLLSRL8C4uKmuBS/ztNpgtZE2E8S+cDLg cQiNiO2cwiJfwp5kQgxjLz87TQWWalkLIZCJJkFD5J1+QGm6gjT076qr/cJP/ESvhh+k gt3g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=IsnFrOg9; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id 76-20020a63004f000000b005303b23ed6csi19489776pga.531.2023.05.16.12.23.48; Tue, 16 May 2023 12:24:01 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=IsnFrOg9; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229539AbjEPTPr (ORCPT <rfc822;peekingduck44@gmail.com> + 99 others); Tue, 16 May 2023 15:15:47 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44042 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229568AbjEPTPp (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 16 May 2023 15:15:45 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 51FDA49EF for <linux-kernel@vger.kernel.org>; Tue, 16 May 2023 12:15:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1684264499; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=5L1aJYtcbR7dHch+1g8syxuuSJfOlW1pxYhUzrmVLTw=; b=IsnFrOg9Q/WVv2Bsr+abbocaRKO2iBSXaqMn2PeCQ0ydqVrGJv1Ngh1nbvGYK9mxUCgIZz ncZbNkxZg4zX3lJVSDxFQvSSgqhuHLGbhpNRoKNkFrCMdYhPTFTID0NyDQr2bkNPt2TLzs 6DIbfjx+ODnKD9w5Z3mkOBZcvs3qNNg= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-228-Cgpq0UWTNzKc8Fd-rELAiA-1; Tue, 16 May 2023 15:14:52 -0400 X-MC-Unique: Cgpq0UWTNzKc8Fd-rELAiA-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id B70751C060CB; Tue, 16 May 2023 19:14:50 +0000 (UTC) Received: from fedora.redhat.com (unknown [10.22.9.87]) by smtp.corp.redhat.com (Postfix) with ESMTP id D0B8F2166B31; Tue, 16 May 2023 19:14:44 +0000 (UTC) From: Wander Lairson Costa <wander@redhat.com> To: Oleg Nesterov <oleg@redhat.com>, "Russell King (Oracle)" <rmk+kernel@armlinux.org.uk>, Brian Cain <bcain@quicinc.com>, Michael Ellerman <mpe@ellerman.id.au>, Stafford Horne <shorne@gmail.com>, Kefeng Wang <wangkefeng.wang@huawei.com>, Wander Lairson Costa <wander@redhat.com>, Andrew Morton <akpm@linux-foundation.org>, "Liam R. Howlett" <Liam.Howlett@Oracle.com>, Vlastimil Babka <vbabka@suse.cz>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, "Eric W. Biederman" <ebiederm@xmission.com>, Andrei Vagin <avagin@gmail.com>, Peter Zijlstra <peterz@infradead.org>, "Paul E. McKenney" <paulmck@kernel.org>, Daniel Bristot de Oliveira <bristot@kernel.org>, Yu Zhao <yuzhao@google.com>, Alexey Gladkov <legion@kernel.org>, Mike Kravetz <mike.kravetz@oracle.com>, Yang Shi <shy828301@gmail.com>, linux-kernel@vger.kernel.org (open list) Cc: Hu Chunyu <chuhu@redhat.com>, Valentin Schneider <vschneid@redhat.com>, Sebastian Andrzej Siewior <bigeasy@linutronix.de>, Steven Rostedt <rostedt@goodmis.org>, Luis Goncalves <lgoncalv@redhat.com> Subject: [PATCH v9] kernel/fork: beware of __put_task_struct calling context Date: Tue, 16 May 2023 16:14:41 -0300 Message-Id: <20230516191441.34377-1-wander@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.6 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1766079900048804221?= X-GMAIL-MSGID: =?utf-8?q?1766079900048804221?= |
Series |
[v9] kernel/fork: beware of __put_task_struct calling context
|
|
Commit Message
Wander Lairson Costa
May 16, 2023, 7:14 p.m. UTC
Under PREEMPT_RT, __put_task_struct() indirectly acquires sleeping
locks. Therefore, it can't be called from an non-preemptible context.
One practical example is splat inside inactive_task_timer(), which is
called in a interrupt context:
CPU: 1 PID: 2848 Comm: life Kdump: loaded Tainted: G W ---------
Hardware name: HP ProLiant DL388p Gen8, BIOS P70 07/15/2012
Call Trace:
dump_stack_lvl+0x57/0x7d
mark_lock_irq.cold+0x33/0xba
? stack_trace_save+0x4b/0x70
? save_trace+0x55/0x150
mark_lock+0x1e7/0x400
mark_usage+0x11d/0x140
__lock_acquire+0x30d/0x930
lock_acquire.part.0+0x9c/0x210
? refill_obj_stock+0x3d/0x3a0
? rcu_read_lock_sched_held+0x3f/0x70
? trace_lock_acquire+0x38/0x140
? lock_acquire+0x30/0x80
? refill_obj_stock+0x3d/0x3a0
rt_spin_lock+0x27/0xe0
? refill_obj_stock+0x3d/0x3a0
refill_obj_stock+0x3d/0x3a0
? inactive_task_timer+0x1ad/0x340
kmem_cache_free+0x357/0x560
inactive_task_timer+0x1ad/0x340
? switched_from_dl+0x2d0/0x2d0
__run_hrtimer+0x8a/0x1a0
__hrtimer_run_queues+0x91/0x130
hrtimer_interrupt+0x10f/0x220
__sysvec_apic_timer_interrupt+0x7b/0xd0
sysvec_apic_timer_interrupt+0x4f/0xd0
? asm_sysvec_apic_timer_interrupt+0xa/0x20
asm_sysvec_apic_timer_interrupt+0x12/0x20
RIP: 0033:0x7fff196bf6f5
Instead of calling __put_task_struct() directly, we defer it using
call_rcu(). A more natural approach would use a workqueue, but since
in PREEMPT_RT, we can't allocate dynamic memory from atomic context,
the code would become more complex because we would need to put the
work_struct instance in the task_struct and initialize it when we
allocate a new task_struct.
Changelog
=========
v1:
* Initial implementation fixing the splat.
v2:
* Isolate the logic in its own function.
* Fix two more cases caught in review.
v3:
* Change __put_task_struct() to handle the issue internally.
v4:
* Explain why call_rcu() is safe to call from interrupt context.
v5:
* Explain why __put_task_struct() doesn't conflict with
put_task_sruct_rcu_user.
v6:
* As per Sebastian's review, revert back the implementation of v2
with a distinct function.
* Add a check in put_task_struct() to warning when called from a
non-sleepable context.
* Address more call sites.
v7:
* Fix typos.
* Add an explanation why the new function doesn't conflict with
delayed_free_task().
v8:
* Bring back v5.
* Fix coding style.
v9:
* Reorganize to not need ___put_task_struct() by Oleg's suggestion.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Reported-by: Hu Chunyu <chuhu@redhat.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Valentin Schneider <vschneid@redhat.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Paul McKenney <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Luis Goncalves <lgoncalv@redhat.com>
---
include/linux/sched/task.h | 28 +++++++++++++++++++++++++++-
kernel/fork.c | 8 ++++++++
2 files changed, 35 insertions(+), 1 deletion(-)
Comments
On Tue, May 16, 2023 at 04:14:41PM -0300, Wander Lairson Costa wrote: > +void __put_task_struct_rcu_cb(struct rcu_head *rhp) > +{ > + struct task_struct *task = container_of(rhp, struct task_struct, rcu); > + > + __put_task_struct(task); > +} > +EXPORT_SYMBOL_GPL(__put_task_struct_rcu_cb); Why does this need to be exported when its only caller is within the main kernel and cannot possibly be built as a module?
On Tue, 16 May 2023 20:24:04 +0100 Matthew Wilcox <willy@infradead.org> wrote: > On Tue, May 16, 2023 at 04:14:41PM -0300, Wander Lairson Costa wrote: > > +void __put_task_struct_rcu_cb(struct rcu_head *rhp) > > +{ > > + struct task_struct *task = container_of(rhp, struct task_struct, rcu); > > + > > + __put_task_struct(task); > > +} > > +EXPORT_SYMBOL_GPL(__put_task_struct_rcu_cb); > > Why does this need to be exported when its only caller is within the > main kernel and cannot possibly be built as a module? It's referenced by inlined put_task_struct(), which is called from all over. However I believe the above definition could be inside #ifdef CONFIG_PREEMPT_RT, to save a scrap of resources?
On Tue, May 16, 2023 at 02:05:55PM -0700, Andrew Morton wrote: > On Tue, 16 May 2023 20:24:04 +0100 Matthew Wilcox <willy@infradead.org> wrote: > > > On Tue, May 16, 2023 at 04:14:41PM -0300, Wander Lairson Costa wrote: > > > +void __put_task_struct_rcu_cb(struct rcu_head *rhp) > > > +{ > > > + struct task_struct *task = container_of(rhp, struct task_struct, rcu); > > > + > > > + __put_task_struct(task); > > > +} > > > +EXPORT_SYMBOL_GPL(__put_task_struct_rcu_cb); > > > > Why does this need to be exported when its only caller is within the > > main kernel and cannot possibly be built as a module? > > It's referenced by inlined put_task_struct(), which is called from all > over. Oh, I missed that put_task_struct() was still inlined. Should it be? It seems quite large now. > However I believe the above definition could be inside #ifdef > CONFIG_PREEMPT_RT, to save a scrap of resources?
On Tue, 16 May 2023 22:41:18 +0100 Matthew Wilcox <willy@infradead.org> wrote: > > Oh, I missed that put_task_struct() was still inlined. Should it be? > It seems quite large now. It's not significantly worse because of this patch. In fact, it's unchanged for non-RT kernels. Possibly put_task_struct() *should* be uninlined, because it made the mistake of using the dang refcount stuff, which never saw a byte which it couldn't consume :( I mean... --- a/fs/open.c~a +++ a/fs/open.c @@ -1572,3 +1572,9 @@ int stream_open(struct inode *inode, str } EXPORT_SYMBOL(stream_open); + +#include <linux/refcount.h> +bool foo(refcount_t *r) +{ + return refcount_dec_and_test(r); +}
On 05/16, Wander Lairson Costa wrote: > > static inline void put_task_struct(struct task_struct *t) > { > - if (refcount_dec_and_test(&t->usage)) > + if (!refcount_dec_and_test(&t->usage)) > + return; > + > + /* > + * under PREEMPT_RT, we can't call put_task_struct > + * in atomic context because it will indirectly > + * acquire sleeping locks. > + * > + * call_rcu() will schedule delayed_put_task_struct_rcu() > + * to be called in process context. > + * > + * __put_task_struct() is called when > + * refcount_dec_and_test(&t->usage) succeeds. > + * > + * This means that it can't "conflict" with > + * put_task_struct_rcu_user() which abuses ->rcu the same > + * way; rcu_users has a reference so task->usage can't be > + * zero after rcu_users 1 -> 0 transition. > + * > + * delayed_free_task() also uses ->rcu, but it is only called > + * when it fails to fork a process. Therefore, there is no > + * way it can conflict with put_task_struct(). > + */ > + if (IS_ENABLED(CONFIG_PREEMPT_RT) && !preemptible()) > + call_rcu(&t->rcu, __put_task_struct_rcu_cb); > + else > __put_task_struct(t); > } LGTM but we still need to understand the possible problems with CONFIG_PROVE_RAW_LOCK_NESTING ... Again, I'll try to investigate when I have time although I am not sure I can really help. Perhaps you too can try to do this ? ;) Oleg.
On Wed, May 17, 2023 at 12:26 PM Oleg Nesterov <oleg@redhat.com> wrote: > > On 05/16, Wander Lairson Costa wrote: > > > > static inline void put_task_struct(struct task_struct *t) > > { > > - if (refcount_dec_and_test(&t->usage)) > > + if (!refcount_dec_and_test(&t->usage)) > > + return; > > + > > + /* > > + * under PREEMPT_RT, we can't call put_task_struct > > + * in atomic context because it will indirectly > > + * acquire sleeping locks. > > + * > > + * call_rcu() will schedule delayed_put_task_struct_rcu() > > + * to be called in process context. > > + * > > + * __put_task_struct() is called when > > + * refcount_dec_and_test(&t->usage) succeeds. > > + * > > + * This means that it can't "conflict" with > > + * put_task_struct_rcu_user() which abuses ->rcu the same > > + * way; rcu_users has a reference so task->usage can't be > > + * zero after rcu_users 1 -> 0 transition. > > + * > > + * delayed_free_task() also uses ->rcu, but it is only called > > + * when it fails to fork a process. Therefore, there is no > > + * way it can conflict with put_task_struct(). > > + */ > > + if (IS_ENABLED(CONFIG_PREEMPT_RT) && !preemptible()) > > + call_rcu(&t->rcu, __put_task_struct_rcu_cb); > > + else > > __put_task_struct(t); > > } > > LGTM but we still need to understand the possible problems with CONFIG_PROVE_RAW_LOCK_NESTING ... > > Again, I'll try to investigate when I have time although I am not sure I can really help. > > Perhaps you too can try to do this ? ;) > FWIW, I tested this patch with CONFIG_PROVE_LOCK_NESTING in RT and stock kernels. No splat happened. > Oleg. >
On 05/17, Wander Lairson Costa wrote: > > On Wed, May 17, 2023 at 12:26 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > LGTM but we still need to understand the possible problems with CONFIG_PROVE_RAW_LOCK_NESTING ... > > > > Again, I'll try to investigate when I have time although I am not sure I can really help. > > > > Perhaps you too can try to do this ? ;) > > > > FWIW, I tested this patch with CONFIG_PROVE_LOCK_NESTING in RT and > stock kernels. No splat happened. Strange... FYI, I am running the kernel with this patch diff --git a/kernel/sys.c b/kernel/sys.c index 339fee3eff6a..3169cceddf3b 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2412,6 +2412,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, error = 0; switch (option) { + case 666: { + static DEFINE_SPINLOCK(l); + static DEFINE_RAW_SPINLOCK(r); + + raw_spin_lock(&r); + spin_lock(&l); + spin_unlock(&l); + raw_spin_unlock(&r); + + break; + } case PR_SET_PDEATHSIG: if (!valid_signal(arg2)) { error = -EINVAL; applied (because I am too lazy to compile a module ;) and # perl -e 'syscall 157,666' triggers the lockdep bug ============================= [ BUG: Invalid wait context ] 6.4.0-rc2-00018-g4d6d4c7f541d-dirty #1176 Not tainted ----------------------------- perl/35 is trying to lock: ffffffff81c4cc18 (l){....}-{3:3}, at: __do_sys_prctl+0x21b/0x87b other info that might help us debug this: context-{5:5} ... as expected. Looks like your testing was wrong... Or maybe you missed another lockdep problem ? Did you check dmesg? Perhaps lockdep detected another bug,say, even at boot time ? In this case debug_locks_off() sets debug_locks = 0 and this disables lockdep. Oleg.
On Mon, May 29, 2023 at 9:23 AM Oleg Nesterov <oleg@redhat.com> wrote: > > On 05/17, Wander Lairson Costa wrote: > > > > On Wed, May 17, 2023 at 12:26 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > LGTM but we still need to understand the possible problems with CONFIG_PROVE_RAW_LOCK_NESTING ... > > > > > > Again, I'll try to investigate when I have time although I am not sure I can really help. > > > > > > Perhaps you too can try to do this ? ;) > > > > > > > FWIW, I tested this patch with CONFIG_PROVE_LOCK_NESTING in RT and > > stock kernels. No splat happened. > > Strange... FYI, I am running the kernel with this patch > > diff --git a/kernel/sys.c b/kernel/sys.c > index 339fee3eff6a..3169cceddf3b 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -2412,6 +2412,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, > > error = 0; > switch (option) { > + case 666: { > + static DEFINE_SPINLOCK(l); > + static DEFINE_RAW_SPINLOCK(r); > + > + raw_spin_lock(&r); > + spin_lock(&l); > + spin_unlock(&l); > + raw_spin_unlock(&r); > + > + break; > + } > case PR_SET_PDEATHSIG: > if (!valid_signal(arg2)) { > error = -EINVAL; > > applied (because I am too lazy to compile a module ;) and > FWIW, I converted it to a module [1] > # perl -e 'syscall 157,666' > > triggers the lockdep bug > > ============================= > [ BUG: Invalid wait context ] > 6.4.0-rc2-00018-g4d6d4c7f541d-dirty #1176 Not tainted > ----------------------------- > perl/35 is trying to lock: > ffffffff81c4cc18 (l){....}-{3:3}, at: __do_sys_prctl+0x21b/0x87b > other info that might help us debug this: > context-{5:5} > ... > > as expected. > Yeah, I tried it here and I had the same results, but only in the RT kernel. But running the reproducer for put_task_struct(), works fine. > Looks like your testing was wrong... Or maybe you missed another lockdep problem ? > Did you check dmesg? Perhaps lockdep detected another bug,say, even at boot time ? > In this case debug_locks_off() sets debug_locks = 0 and this disables lockdep. > > Oleg. >
On 06/01, Wander Lairson Costa wrote: > > On Mon, May 29, 2023 at 9:23 AM Oleg Nesterov <oleg@redhat.com> wrote: > > > > On 05/17, Wander Lairson Costa wrote: > > > > > > On Wed, May 17, 2023 at 12:26 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > > LGTM but we still need to understand the possible problems with CONFIG_PROVE_RAW_LOCK_NESTING ... > > > > > > > > Again, I'll try to investigate when I have time although I am not sure I can really help. > > > > > > > > Perhaps you too can try to do this ? ;) > > > > > > > > > > FWIW, I tested this patch with CONFIG_PROVE_LOCK_NESTING in RT and > > > stock kernels. No splat happened. > > > > Strange... FYI, I am running the kernel with this patch > > > > diff --git a/kernel/sys.c b/kernel/sys.c > > index 339fee3eff6a..3169cceddf3b 100644 > > --- a/kernel/sys.c > > +++ b/kernel/sys.c > > @@ -2412,6 +2412,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, > > > > error = 0; > > switch (option) { > > + case 666: { > > + static DEFINE_SPINLOCK(l); > > + static DEFINE_RAW_SPINLOCK(r); > > + > > + raw_spin_lock(&r); > > + spin_lock(&l); > > + spin_unlock(&l); > > + raw_spin_unlock(&r); > > + > > + break; > > + } > > case PR_SET_PDEATHSIG: > > if (!valid_signal(arg2)) { > > error = -EINVAL; > > > > applied (because I am too lazy to compile a module ;) and > > > > FWIW, I converted it to a module [1] where is [1] ? not that I think this matters though... > > # perl -e 'syscall 157,666' > > > > triggers the lockdep bug > > > > ============================= > > [ BUG: Invalid wait context ] > > 6.4.0-rc2-00018-g4d6d4c7f541d-dirty #1176 Not tainted > > ----------------------------- > > perl/35 is trying to lock: > > ffffffff81c4cc18 (l){....}-{3:3}, at: __do_sys_prctl+0x21b/0x87b > > other info that might help us debug this: > > context-{5:5} > > ... > > > > as expected. > > > > Yeah, I tried it here and I had the same results, OK, > but only in the RT kernel this again suggests that your testing was wrong or I am totally confused (quite possible, I know nothing about RT). I did the testing without CONFIG_PREEMPT_RT. > But running the reproducer for put_task_struct(), works fine. which reproducer ? Oleg.
On Thu, Jun 1, 2023 at 3:14 PM Oleg Nesterov <oleg@redhat.com> wrote: > > On 06/01, Wander Lairson Costa wrote: > > > > On Mon, May 29, 2023 at 9:23 AM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > On 05/17, Wander Lairson Costa wrote: > > > > > > > > On Wed, May 17, 2023 at 12:26 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > > > > LGTM but we still need to understand the possible problems with CONFIG_PROVE_RAW_LOCK_NESTING ... > > > > > > > > > > Again, I'll try to investigate when I have time although I am not sure I can really help. > > > > > > > > > > Perhaps you too can try to do this ? ;) > > > > > > > > > > > > > FWIW, I tested this patch with CONFIG_PROVE_LOCK_NESTING in RT and > > > > stock kernels. No splat happened. > > > > > > Strange... FYI, I am running the kernel with this patch > > > > > > diff --git a/kernel/sys.c b/kernel/sys.c > > > index 339fee3eff6a..3169cceddf3b 100644 > > > --- a/kernel/sys.c > > > +++ b/kernel/sys.c > > > @@ -2412,6 +2412,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, > > > > > > error = 0; > > > switch (option) { > > > + case 666: { > > > + static DEFINE_SPINLOCK(l); > > > + static DEFINE_RAW_SPINLOCK(r); > > > + > > > + raw_spin_lock(&r); > > > + spin_lock(&l); > > > + spin_unlock(&l); > > > + raw_spin_unlock(&r); > > > + > > > + break; > > > + } > > > case PR_SET_PDEATHSIG: > > > if (!valid_signal(arg2)) { > > > error = -EINVAL; > > > > > > applied (because I am too lazy to compile a module ;) and > > > > > > > FWIW, I converted it to a module [1] > > where is [1] ? not that I think this matters though... > > > > # perl -e 'syscall 157,666' > > > > > > triggers the lockdep bug > > > > > > ============================= > > > [ BUG: Invalid wait context ] > > > 6.4.0-rc2-00018-g4d6d4c7f541d-dirty #1176 Not tainted > > > ----------------------------- > > > perl/35 is trying to lock: > > > ffffffff81c4cc18 (l){....}-{3:3}, at: __do_sys_prctl+0x21b/0x87b > > > other info that might help us debug this: > > > context-{5:5} > > > ... > > > > > > as expected. > > > > > > > Yeah, I tried it here and I had the same results, > > OK, > > > but only in the RT kernel > > this again suggests that your testing was wrong or I am totally confused (quite > possible, I know nothing about RT). I did the testing without CONFIG_PREEMPT_RT. > Hrm, could you please share your .config? > > But running the reproducer for put_task_struct(), works fine. > > which reproducer ? > Only now I noticed I didn't add the reproducer to the commit message: while true; do stress-ng --sched deadline --sched-period 1000000000 --sched-runtime 800000000 --sched-deadline 1000000000 --mmapfork 23 -t 20 done
On 06/01, Wander Lairson Costa wrote: > > On Thu, Jun 1, 2023 at 3:14 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > but only in the RT kernel > > > > this again suggests that your testing was wrong or I am totally confused (quite > > possible, I know nothing about RT). I did the testing without CONFIG_PREEMPT_RT. > > > > Hrm, could you please share your .config? Sure. I do not want to spam the list, I'll send you a private email. Can you share your kernel module code? Did you verify that debug_locks != 0 as I asked in my previous email ? > > > But running the reproducer for put_task_struct(), works fine. > > > > which reproducer ? > > > > Only now I noticed I didn't add the reproducer to the commit message: > > while true; do > stress-ng --sched deadline --sched-period 1000000000 > --sched-runtime 800000000 --sched-deadline 1000000000 --mmapfork 23 -t > 20 > done Cough ;) I think we need a more simple one to enssure that refcount_sub_and_test(nr, &t->usage) returns true under raw_spin_lock() and then __put_task_struct() actually takes spin_lock(). Oleg.
On 06/01, Wander Lairson Costa wrote: > > On Thu, Jun 1, 2023 at 3:14 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > but only in the RT kernel > > > > this again suggests that your testing was wrong or I am totally confused (quite > > possible, I know nothing about RT). I did the testing without CONFIG_PREEMPT_RT. > > > > Hrm, could you please share your .config? Sure. I do not want to spam the list, I'll send you a private email. Can you share your kernel module code? Did you verify that debug_locks != 0 as I asked in my previous email ? > > > But running the reproducer for put_task_struct(), works fine. > > > > which reproducer ? > > > > Only now I noticed I didn't add the reproducer to the commit message: > > while true; do > stress-ng --sched deadline --sched-period 1000000000 > --sched-runtime 800000000 --sched-deadline 1000000000 --mmapfork 23 -t > 20 > done Cough ;) I think we need something more simple to ensure that refcount_sub_and_test(nr, &t->usage) returns true under raw_spin_lock() and then __put_task_struct() actually takes spin_lock(). Oleg.
On Fri, Jun 2, 2023 at 2:34 PM Oleg Nesterov <oleg@redhat.com> wrote: > > On 06/01, Wander Lairson Costa wrote: > > > > On Thu, Jun 1, 2023 at 3:14 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > but only in the RT kernel > > > > > > this again suggests that your testing was wrong or I am totally confused (quite > > > possible, I know nothing about RT). I did the testing without CONFIG_PREEMPT_RT. > > > > > > > Hrm, could you please share your .config? > > Sure. I do not want to spam the list, I'll send you a private email. > Thanks. I found an unrelated earlier splat in the console code. That's why I couldn't reproduce it in the stock kernel. > Can you share your kernel module code? > *facepalm* I forgot to post the link: https://github.com/walac/test-prove-lock/ > Did you verify that debug_locks != 0 as I asked in my previous email ? > > > > > But running the reproducer for put_task_struct(), works fine. > > > > > > which reproducer ? > > > > > > > Only now I noticed I didn't add the reproducer to the commit message: > > > > while true; do > > stress-ng --sched deadline --sched-period 1000000000 > > --sched-runtime 800000000 --sched-deadline 1000000000 --mmapfork 23 -t > > 20 > > done > > Cough ;) I think we need a more simple one to enssure that > refcount_sub_and_test(nr, &t->usage) returns true under raw_spin_lock() > and then __put_task_struct() actually takes spin_lock(). > > Oleg. >
On 06/05, Wander Lairson Costa wrote: > > Thanks. I found an unrelated earlier splat in the console code. That's > why I couldn't reproduce it in the stock kernel. As expected... So... Not sure what can I say ;) can you verify that this patch doesn't solve the issues with CONFIG_PROVE_RAW_LOCK_NESTING pointed out by Sebastian? Using stress-ng or anything else. This is not that bad, unless I am totally confused the current code (without your patch) has the same problem (otherwise we wouldn't need this fix). But perhaps you can make 2/2 which adds the DEFINE_WAIT_OVERRIDE_MAP() hack as Peter suggested? Oleg.
On Tue, Jun 6, 2023 at 5:40 PM Oleg Nesterov <oleg@redhat.com> wrote: > > On 06/05, Wander Lairson Costa wrote: > > > > Thanks. I found an unrelated earlier splat in the console code. That's > > why I couldn't reproduce it in the stock kernel. > > As expected... > > So... Not sure what can I say ;) can you verify that this patch doesn't solve > the issues with CONFIG_PROVE_RAW_LOCK_NESTING pointed out by Sebastian? Using > stress-ng or anything else. > I managed to test it without a console. No issues happened in the stock kernel. > This is not that bad, unless I am totally confused the current code (without > your patch) has the same problem (otherwise we wouldn't need this fix). > That's my understanding as well. > But perhaps you can make 2/2 which adds the DEFINE_WAIT_OVERRIDE_MAP() hack > as Peter suggested? > Yes, sure. I would like to get the issue reproduced in practice to make sure I am really fixing the problem. But I can live with that. > Oleg. >
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index d6c48163c6de..9bcb9535d4e1 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -112,10 +112,36 @@ static inline struct task_struct *get_task_struct(struct task_struct *t) } extern void __put_task_struct(struct task_struct *t); +extern void __put_task_struct_rcu_cb(struct rcu_head *rhp); static inline void put_task_struct(struct task_struct *t) { - if (refcount_dec_and_test(&t->usage)) + if (!refcount_dec_and_test(&t->usage)) + return; + + /* + * under PREEMPT_RT, we can't call put_task_struct + * in atomic context because it will indirectly + * acquire sleeping locks. + * + * call_rcu() will schedule delayed_put_task_struct_rcu() + * to be called in process context. + * + * __put_task_struct() is called when + * refcount_dec_and_test(&t->usage) succeeds. + * + * This means that it can't "conflict" with + * put_task_struct_rcu_user() which abuses ->rcu the same + * way; rcu_users has a reference so task->usage can't be + * zero after rcu_users 1 -> 0 transition. + * + * delayed_free_task() also uses ->rcu, but it is only called + * when it fails to fork a process. Therefore, there is no + * way it can conflict with put_task_struct(). + */ + if (IS_ENABLED(CONFIG_PREEMPT_RT) && !preemptible()) + call_rcu(&t->rcu, __put_task_struct_rcu_cb); + else __put_task_struct(t); } diff --git a/kernel/fork.c b/kernel/fork.c index 08969f5aa38d..fd3bb4a554c4 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -846,6 +846,14 @@ void __put_task_struct(struct task_struct *tsk) } EXPORT_SYMBOL_GPL(__put_task_struct); +void __put_task_struct_rcu_cb(struct rcu_head *rhp) +{ + struct task_struct *task = container_of(rhp, struct task_struct, rcu); + + __put_task_struct(task); +} +EXPORT_SYMBOL_GPL(__put_task_struct_rcu_cb); + void __init __weak arch_task_cache_init(void) { } /*