Message ID | 20231229074916.53547-1-andrea.righi@canonical.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-12976-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7301:6f82:b0:100:9c79:88ff with SMTP id tb2csp2434127dyb; Thu, 28 Dec 2023 23:49:46 -0800 (PST) X-Google-Smtp-Source: AGHT+IGa37BylLPBiFTsACDJdOEvn0Ehv2TTSICjnt/ibpY1VsAUvAlyg4F02F+q6o7mRLPLPDBR X-Received: by 2002:ad4:5f0b:0:b0:67a:a72d:fbbf with SMTP id fo11-20020ad45f0b000000b0067aa72dfbbfmr16986933qvb.61.1703836186796; Thu, 28 Dec 2023 23:49:46 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1703836186; cv=none; d=google.com; s=arc-20160816; b=MQZ7Z8ZDaBu3q9Qziltucwtt91idAKs5OhuCCPQqXoySinv2u12t+EhFl+gLA+qzN1 dTffEpOInSIABtKO22gKhza4hPVOJ17PBpHJHq/OgSga3ZU8iEEbAGRJ7sbDEIzT0j7C 1KYmVHxYduJ3v1egHD6PKjI9Q2WN+LoLljXHqGI36m5GsgOHjV90PDoT7narzmM2NZWd fCKauv8qNMIHB/JOo1e2x17u9kG0B1r+6C0zrK5aIfd04m84/COwgF+T007z8Kj1e+fS pypxQ4dRJjfasqXXypqXMmYpGgr4Ah5BevHdiHsCP4GffiU4HNRcsTKz0kxBF5yGzbmN szLg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:cc:to :from:dkim-signature; bh=SURP4bEXB3rOoVdDu2PinjzVHqssXg7lipQNHpdE17M=; fh=N7us7KGPdp4WmpAtIL4YqXxy2bIMQt2Z63VKFVfsJFE=; b=aWmspWHQBAGXuzFOIsJFZMkpSZyZ6w2xHIIaDlMczrXU3dBA/aS86l5qLOjzxLhZt4 769Ph3L5aok+n/Th72Owr7PgAr/QT0abjLah0NSlOJAt4kEoRu9YcEmYRv/ZwKzuVswa V6kXDbX/s4gP2nLjqiz44dNTMzd8uQvUpIUOL3mv2gIV5gh+Ys5p6A0lsKPd+HfypwMb G4ElD8eSz/I5+es/JETTm58VjwMG0C0KBcVxhTrnvZVHQ/Z7zrKflDiIEOuijRByHgsk qKMTuMVY+TbceauEueaPdbXoRvInMankt/S79oy81dR4vqnU3acFDu7uIMsGIPyXSDOx 4pzQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@canonical.com header.s=20210705 header.b=cBfCXyE0; spf=pass (google.com: domain of linux-kernel+bounces-12976-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-12976-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=canonical.com Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id o12-20020a0ccb0c000000b0067f032d7c53si18460821qvk.112.2023.12.28.23.49.46 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 28 Dec 2023 23:49:46 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-12976-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@canonical.com header.s=20210705 header.b=cBfCXyE0; spf=pass (google.com: domain of linux-kernel+bounces-12976-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-12976-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=canonical.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 8786C1C222F3 for <ouuuleilei@gmail.com>; Fri, 29 Dec 2023 07:49:46 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 8AA8879DF; Fri, 29 Dec 2023 07:49:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=canonical.com header.i=@canonical.com header.b="cBfCXyE0" X-Original-To: linux-kernel@vger.kernel.org Received: from smtp-relay-internal-1.canonical.com (smtp-relay-internal-1.canonical.com [185.125.188.123]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 09FF563A3 for <linux-kernel@vger.kernel.org>; Fri, 29 Dec 2023 07:49:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=canonical.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=canonical.com Received: from mail-ed1-f71.google.com (mail-ed1-f71.google.com [209.85.208.71]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-relay-internal-1.canonical.com (Postfix) with ESMTPS id 289B03F18B for <linux-kernel@vger.kernel.org>; Fri, 29 Dec 2023 07:49:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=canonical.com; s=20210705; t=1703836167; bh=SURP4bEXB3rOoVdDu2PinjzVHqssXg7lipQNHpdE17M=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=cBfCXyE0vQyzTKbFPislpi3S15VJDsNMM4I0CxK1g14SKAY/p9wo5azKbWuMpH2dt v1TEez7+fR9oorABH71tbFqMyqcPpDOcr99p4QfzLJA1JXZmutzt6bZbUNsUMZefKr VBkDfBRa5v2Kt7qDhc1yYG85jFElpR/sOyLHKH1hn0X28ezEhGq6z7P95QifQFy+C6 I1jruzINgcmC5kiUou/mpJ+x8n/ZMoDDwctT7V7AG3Nikx49LcxFuyp2RbGQ8g9+R3 zISk8Ixoab7XmQ6s6H89hq0LBcMV2KpZEvlXUnZdgqeoedPzZuxkTpog+jDEBylCZi TK4QO2p+3tBOQ== Received: by mail-ed1-f71.google.com with SMTP id 4fb4d7f45d1cf-5541f04f1e9so3554460a12.0 for <linux-kernel@vger.kernel.org>; Thu, 28 Dec 2023 23:49:27 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703836166; x=1704440966; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=SURP4bEXB3rOoVdDu2PinjzVHqssXg7lipQNHpdE17M=; b=ZYSLxtwts7oXkIwcFP2bT7UkmySluH6qDuIZ3hjFVFcJ7zrvho25GoaJZSOWZOgM/s zSapyLzfmGiMtUHe+x4QRfHab4D6MGA16rNbT8b68US7Ut4b9cyXxvXJvvkOCjbnL/qC OMp2G1bYt2wjTQPTb8m6GqwyxK0rHkd2pr8zDitmx9dDewEBaVUhyY1SkVsYV8PYZkMY ccQJCTmaAwZSxbnGdGf/ckLpIkzpwAzJPrHzeJV/12f7bSELN1oJ+RmZgnRCa63Mbo6r iZVCizR/SYLYUWsz/5ZF1WPIA4aY6cQGD0u6p9jLVhVLWo6XroodTmV0IqHMfRKGDqGh FC0Q== X-Gm-Message-State: AOJu0YyMdEkJWjsSC9ZadsneMK6c1iJwyRn7aDx9GW8KBDEufAVgITKu 1LHrkkmpnLN7cJr6n3favffgCSGvaSIUc0z3i3GWwZkUgEGJ9DPIO0QnZ5Jy4R1O7CjRVWsi1EN 8Nh1uz/M10MmB8NucW9g3O2TDx8SlS+7Mp9R+mXwaEXF7Eksn X-Received: by 2002:a50:aa9a:0:b0:553:a093:a499 with SMTP id q26-20020a50aa9a000000b00553a093a499mr7613335edc.59.1703836165844; Thu, 28 Dec 2023 23:49:25 -0800 (PST) X-Received: by 2002:a50:aa9a:0:b0:553:a093:a499 with SMTP id q26-20020a50aa9a000000b00553a093a499mr7613331edc.59.1703836165337; Thu, 28 Dec 2023 23:49:25 -0800 (PST) Received: from localhost.localdomain (host-95-246-75-161.retail.telecomitalia.it. [95.246.75.161]) by smtp.gmail.com with ESMTPSA id a14-20020a50ff0e000000b0054d360bdfd6sm10733584edu.73.2023.12.28.23.49.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 28 Dec 2023 23:49:24 -0800 (PST) From: Andrea Righi <andrea.righi@canonical.com> To: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Tejun Heo <tj@kernel.org>, Alexei Starovoitov <ast@kernel.org>, linux-kernel@vger.kernel.org Subject: [PATCH] kernfs: convert kernfs_idr_lock to an irq safe raw spinlock Date: Fri, 29 Dec 2023 08:49:16 +0100 Message-ID: <20231229074916.53547-1-andrea.righi@canonical.com> X-Mailer: git-send-email 2.43.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1786601733462037627 X-GMAIL-MSGID: 1786601733462037627 |
Series |
kernfs: convert kernfs_idr_lock to an irq safe raw spinlock
|
|
Commit Message
Andrea Righi
Dec. 29, 2023, 7:49 a.m. UTC
bpf_cgroup_from_id() is basically a wrapper to cgroup_get_from_id(),
that is relying on kernfs to determine the right cgroup associated to
the target id.
As a kfunc, it has the potential to be attached to any function through
BPF, particularly in contexts where certain locks are held.
However, kernfs is not using an irq safe spinlock for kernfs_idr_lock,
that means any kernfs function that is acquiring this lock can be
interrupted and potentially hit bpf_cgroup_from_id() in the process,
triggering a deadlock.
For example, it is really easy to trigger a lockdep splat between
kernfs_idr_lock and rq->_lock, attaching a small BPF program to
__set_cpus_allowed_ptr_locked() that just calls bpf_cgroup_from_id():
=====================================================
WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
6.7.0-rc7-virtme #5 Not tainted
-----------------------------------------------------
repro/131 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
ffffffffb2dc4578 (kernfs_idr_lock){+.+.}-{2:2}, at: kernfs_find_and_get_node_by_id+0x1d/0x80
and this task is already holding:
ffff911cbecaf218 (&rq->__lock){-.-.}-{2:2}, at: task_rq_lock+0x50/0xc0
which would create a new lock dependency:
(&rq->__lock){-.-.}-{2:2} -> (kernfs_idr_lock){+.+.}-{2:2}
but this new dependency connects a HARDIRQ-irq-safe lock:
(&rq->__lock){-.-.}-{2:2}
... which became HARDIRQ-irq-safe at:
lock_acquire+0xbf/0x2b0
_raw_spin_lock_nested+0x2e/0x40
scheduler_tick+0x5d/0x170
update_process_times+0x9c/0xb0
tick_periodic+0x27/0xe0
tick_handle_periodic+0x24/0x70
__sysvec_apic_timer_interrupt+0x64/0x1a0
sysvec_apic_timer_interrupt+0x6f/0x80
asm_sysvec_apic_timer_interrupt+0x1a/0x20
memcpy+0xc/0x20
arch_dup_task_struct+0x15/0x30
copy_process+0x1ce/0x1eb0
kernel_clone+0xac/0x390
kernel_thread+0x6f/0xa0
kthreadd+0x199/0x230
ret_from_fork+0x31/0x50
ret_from_fork_asm+0x1b/0x30
to a HARDIRQ-irq-unsafe lock:
(kernfs_idr_lock){+.+.}-{2:2}
... which became HARDIRQ-irq-unsafe at:
...
lock_acquire+0xbf/0x2b0
_raw_spin_lock+0x30/0x40
__kernfs_new_node.isra.0+0x83/0x280
kernfs_create_root+0xf6/0x1d0
sysfs_init+0x1b/0x70
mnt_init+0xd9/0x2a0
vfs_caches_init+0xcf/0xe0
start_kernel+0x58a/0x6a0
x86_64_start_reservations+0x18/0x30
x86_64_start_kernel+0xc5/0xe0
secondary_startup_64_no_verify+0x178/0x17b
other info that might help us debug this:
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(kernfs_idr_lock);
local_irq_disable();
lock(&rq->__lock);
lock(kernfs_idr_lock);
<Interrupt>
lock(&rq->__lock);
*** DEADLOCK ***
Prevent this deadlock condition converting kernfs_idr_lock to a raw irq
safe spinlock.
The performance impact of this change should be negligible and it also
helps to prevent similar deadlock conditions with any other subsystems
that may depend on kernfs.
Fixes: 332ea1f697be ("bpf: Add bpf_cgroup_from_id() kfunc")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
---
fs/kernfs/dir.c | 23 +++++++++++++----------
1 file changed, 13 insertions(+), 10 deletions(-)
Comments
On Fri, Dec 29, 2023 at 08:49:16AM +0100, Andrea Righi wrote: > bpf_cgroup_from_id() is basically a wrapper to cgroup_get_from_id(), > that is relying on kernfs to determine the right cgroup associated to > the target id. > > As a kfunc, it has the potential to be attached to any function through > BPF, particularly in contexts where certain locks are held. > > However, kernfs is not using an irq safe spinlock for kernfs_idr_lock, > that means any kernfs function that is acquiring this lock can be > interrupted and potentially hit bpf_cgroup_from_id() in the process, > triggering a deadlock. > > For example, it is really easy to trigger a lockdep splat between > kernfs_idr_lock and rq->_lock, attaching a small BPF program to > __set_cpus_allowed_ptr_locked() that just calls bpf_cgroup_from_id(): ... > Fixes: 332ea1f697be ("bpf: Add bpf_cgroup_from_id() kfunc") > Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Acked-by: Tejun Heo <tj@kernel.org> As an immediate fix, this looks great. In the longer term, it probably would make sense to make that idr RCU protected so that lookup path doesn't have to worry about locking order. Thanks.
Hi Andrea, On Fri, Dec 29, 2023 at 8:56 AM Andrea Righi <andrea.righi@canonical.com> wrote: > bpf_cgroup_from_id() is basically a wrapper to cgroup_get_from_id(), > that is relying on kernfs to determine the right cgroup associated to > the target id. > > As a kfunc, it has the potential to be attached to any function through > BPF, particularly in contexts where certain locks are held. > > However, kernfs is not using an irq safe spinlock for kernfs_idr_lock, > that means any kernfs function that is acquiring this lock can be > interrupted and potentially hit bpf_cgroup_from_id() in the process, > triggering a deadlock. > > For example, it is really easy to trigger a lockdep splat between > kernfs_idr_lock and rq->_lock, attaching a small BPF program to > __set_cpus_allowed_ptr_locked() that just calls bpf_cgroup_from_id(): > > ===================================================== > WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected > 6.7.0-rc7-virtme #5 Not tainted > ----------------------------------------------------- > repro/131 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: > ffffffffb2dc4578 (kernfs_idr_lock){+.+.}-{2:2}, at: kernfs_find_and_get_node_by_id+0x1d/0x80 > > and this task is already holding: > ffff911cbecaf218 (&rq->__lock){-.-.}-{2:2}, at: task_rq_lock+0x50/0xc0 > which would create a new lock dependency: > (&rq->__lock){-.-.}-{2:2} -> (kernfs_idr_lock){+.+.}-{2:2} > > but this new dependency connects a HARDIRQ-irq-safe lock: > (&rq->__lock){-.-.}-{2:2} > > ... which became HARDIRQ-irq-safe at: > lock_acquire+0xbf/0x2b0 > _raw_spin_lock_nested+0x2e/0x40 > scheduler_tick+0x5d/0x170 > update_process_times+0x9c/0xb0 > tick_periodic+0x27/0xe0 > tick_handle_periodic+0x24/0x70 > __sysvec_apic_timer_interrupt+0x64/0x1a0 > sysvec_apic_timer_interrupt+0x6f/0x80 > asm_sysvec_apic_timer_interrupt+0x1a/0x20 > memcpy+0xc/0x20 > arch_dup_task_struct+0x15/0x30 > copy_process+0x1ce/0x1eb0 > kernel_clone+0xac/0x390 > kernel_thread+0x6f/0xa0 > kthreadd+0x199/0x230 > ret_from_fork+0x31/0x50 > ret_from_fork_asm+0x1b/0x30 > > to a HARDIRQ-irq-unsafe lock: > (kernfs_idr_lock){+.+.}-{2:2} > > ... which became HARDIRQ-irq-unsafe at: > ... > lock_acquire+0xbf/0x2b0 > _raw_spin_lock+0x30/0x40 > __kernfs_new_node.isra.0+0x83/0x280 > kernfs_create_root+0xf6/0x1d0 > sysfs_init+0x1b/0x70 > mnt_init+0xd9/0x2a0 > vfs_caches_init+0xcf/0xe0 > start_kernel+0x58a/0x6a0 > x86_64_start_reservations+0x18/0x30 > x86_64_start_kernel+0xc5/0xe0 > secondary_startup_64_no_verify+0x178/0x17b > > other info that might help us debug this: > > Possible interrupt unsafe locking scenario: > > CPU0 CPU1 > ---- ---- > lock(kernfs_idr_lock); > local_irq_disable(); > lock(&rq->__lock); > lock(kernfs_idr_lock); > <Interrupt> > lock(&rq->__lock); > > *** DEADLOCK *** > > Prevent this deadlock condition converting kernfs_idr_lock to a raw irq > safe spinlock. > > The performance impact of this change should be negligible and it also > helps to prevent similar deadlock conditions with any other subsystems > that may depend on kernfs. > > Fixes: 332ea1f697be ("bpf: Add bpf_cgroup_from_id() kfunc") > Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Thanks for your patch, which is now commit c312828c37a72fe2 ("kernfs: convert kernfs_idr_lock to an irq safe raw spinlock") in driver-core/driver-core-next. Unfortunately this interacts badly with commit 4eff7d62abdeb293 ("Revert "mm/kmemleak: move the initialisation of object to __link_object"") in v6.7-rc5. driver-core/driver-core-next is still at v6.7-rc3, so it does not yet have commit 4eff7d62abdeb293, and thus still triggers: ============================= [ BUG: Invalid wait context ] 6.7.0-rc3-kzm9g-00052-gc312828c37a7 #576 Not tainted ----------------------------- swapper/0 is trying to lock: c0c6e3c4 (&zone->lock){....}-{3:3}, at: __rmqueue_pcplist+0x358/0x3c8 other info that might help us debug this: context-{5:5} 3 locks held by swapper/0: #0: c0bf35a0 (slab_mutex){....}-{4:4}, at: kmem_cache_create_usercopy+0xc8/0x2d0 #1: c0bfab0c (kmemleak_lock){....}-{2:2}, at: __create_object+0x2c/0x7c #2: dfbc8c90 (&pcp->lock){....}-{3:3}, at: get_page_from_freelist+0x1a0/0x684 stack backtrace: CPU: 0 PID: 0 Comm: swapper Not tainted 6.7.0-rc3-kzm9g-00052-gc312828c37a7 #576 Hardware name: Generic SH73A0 (Flattened Device Tree) unwind_backtrace from show_stack+0x10/0x14 show_stack from dump_stack_lvl+0x68/0x90 dump_stack_lvl from __lock_acquire+0x3cc/0x168c __lock_acquire from lock_acquire+0x274/0x30c lock_acquire from _raw_spin_lock_irqsave+0x50/0x64 _raw_spin_lock_irqsave from __rmqueue_pcplist+0x358/0x3c8 __rmqueue_pcplist from get_page_from_freelist+0x3bc/0x684 get_page_from_freelist from __alloc_pages+0xe8/0xad8 __alloc_pages from __stack_depot_save+0x160/0x398 __stack_depot_save from set_track_prepare+0x48/0x74 set_track_prepare from __link_object+0xac/0x204 __link_object from __create_object+0x48/0x7c __create_object from kmemleak_alloc+0x2c/0x38 kmemleak_alloc from slab_post_alloc_hook.constprop.0+0x9c/0xac slab_post_alloc_hook.constprop.0 from kmem_cache_alloc+0xcc/0x148 kmem_cache_alloc from kmem_cache_create_usercopy+0x1c4/0x2d0 kmem_cache_create_usercopy from kmem_cache_create+0x1c/0x24 kmem_cache_create from kmemleak_init+0x58/0xfc kmemleak_init from mm_core_init+0x244/0x2c8 mm_core_init from start_kernel+0x274/0x528 start_kernel from 0x0 After merging driver-core/driver-core-next into a tree based on v6.7-rc5, or after cherry-picking commit 4eff7d62abdeb293 into driver-core/driver-core-next, the above BUG is gone, but a different one appears: ============================= [ BUG: Invalid wait context ] 6.7.0-rc5-kzm9g-00251-g655022a45b1c #578 Not tainted ----------------------------- swapper/0/0 is trying to lock: dfbcd488 (&c->lock){....}-{3:3}, at: local_lock_acquire+0x0/0xa4 other info that might help us debug this: context-{5:5} 2 locks held by swapper/0/0: #0: dfbc9c60 (lock){+.+.}-{3:3}, at: local_lock_acquire+0x0/0xa4 #1: c0c012a8 (kernfs_idr_lock){....}-{2:2}, at: __kernfs_new_node.constprop.0+0x68/0x258 stack backtrace: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.7.0-rc5-kzm9g-00251-g655022a45b1c #578 Hardware name: Generic SH73A0 (Flattened Device Tree) unwind_backtrace from show_stack+0x10/0x14 show_stack from dump_stack_lvl+0x68/0x90 dump_stack_lvl from __lock_acquire+0x3cc/0x168c __lock_acquire from lock_acquire+0x274/0x30c lock_acquire from local_lock_acquire+0x28/0xa4 local_lock_acquire from ___slab_alloc+0x234/0x8a8 ___slab_alloc from __slab_alloc.constprop.0+0x30/0x44 __slab_alloc.constprop.0 from kmem_cache_alloc+0x7c/0x148 kmem_cache_alloc from radix_tree_node_alloc.constprop.0+0x44/0xdc radix_tree_node_alloc.constprop.0 from idr_get_free+0x110/0x2b8 idr_get_free from idr_alloc_u32+0x9c/0x108 idr_alloc_u32 from idr_alloc_cyclic+0x50/0xb8 idr_alloc_cyclic from __kernfs_new_node.constprop.0+0x88/0x258 __kernfs_new_node.constprop.0 from kernfs_create_root+0xbc/0x154 kernfs_create_root from sysfs_init+0x18/0x5c sysfs_init from mnt_init+0xc4/0x220 mnt_init from vfs_caches_init+0x6c/0x88 vfs_caches_init from start_kernel+0x474/0x528 start_kernel from 0x0 Reverting commit c312828c37a72fe2 fixes that. I have seen this issue on several Renesas arm32 and arm64 platforms. Also, I am wondering if the issue fixed by commit c312828c37a72fe2 can still be reproduced on v6.7-rc5 or later? Thanks! Gr{oetje,eeting}s, Geert
On Tue, Jan 09, 2024 at 05:35:36PM +0100, Geert Uytterhoeven wrote: .. > Thanks for your patch, which is now commit c312828c37a72fe2 > ("kernfs: convert kernfs_idr_lock to an irq safe raw spinlock") > in driver-core/driver-core-next. > > Unfortunately this interacts badly with commit 4eff7d62abdeb293 ("Revert > "mm/kmemleak: move the initialisation of object to __link_object"") > in v6.7-rc5. > > driver-core/driver-core-next is still at v6.7-rc3, so it does not > yet have commit 4eff7d62abdeb293, and thus still triggers: > > ============================= > [ BUG: Invalid wait context ] > 6.7.0-rc3-kzm9g-00052-gc312828c37a7 #576 Not tainted > ----------------------------- > swapper/0 is trying to lock: > c0c6e3c4 (&zone->lock){....}-{3:3}, at: __rmqueue_pcplist+0x358/0x3c8 > other info that might help us debug this: > context-{5:5} > 3 locks held by swapper/0: > #0: c0bf35a0 (slab_mutex){....}-{4:4}, at: > kmem_cache_create_usercopy+0xc8/0x2d0 > #1: c0bfab0c (kmemleak_lock){....}-{2:2}, at: __create_object+0x2c/0x7c > #2: dfbc8c90 (&pcp->lock){....}-{3:3}, at: > get_page_from_freelist+0x1a0/0x684 > stack backtrace: > CPU: 0 PID: 0 Comm: swapper Not tainted > 6.7.0-rc3-kzm9g-00052-gc312828c37a7 #576 > Hardware name: Generic SH73A0 (Flattened Device Tree) > unwind_backtrace from show_stack+0x10/0x14 > show_stack from dump_stack_lvl+0x68/0x90 > dump_stack_lvl from __lock_acquire+0x3cc/0x168c > __lock_acquire from lock_acquire+0x274/0x30c > lock_acquire from _raw_spin_lock_irqsave+0x50/0x64 > _raw_spin_lock_irqsave from __rmqueue_pcplist+0x358/0x3c8 > __rmqueue_pcplist from get_page_from_freelist+0x3bc/0x684 > get_page_from_freelist from __alloc_pages+0xe8/0xad8 > __alloc_pages from __stack_depot_save+0x160/0x398 > __stack_depot_save from set_track_prepare+0x48/0x74 > set_track_prepare from __link_object+0xac/0x204 > __link_object from __create_object+0x48/0x7c > __create_object from kmemleak_alloc+0x2c/0x38 > kmemleak_alloc from slab_post_alloc_hook.constprop.0+0x9c/0xac > slab_post_alloc_hook.constprop.0 from kmem_cache_alloc+0xcc/0x148 > kmem_cache_alloc from kmem_cache_create_usercopy+0x1c4/0x2d0 > kmem_cache_create_usercopy from kmem_cache_create+0x1c/0x24 > kmem_cache_create from kmemleak_init+0x58/0xfc > kmemleak_init from mm_core_init+0x244/0x2c8 > mm_core_init from start_kernel+0x274/0x528 > start_kernel from 0x0 > > After merging driver-core/driver-core-next into a tree based on > v6.7-rc5, or after cherry-picking commit 4eff7d62abdeb293 into > driver-core/driver-core-next, the above BUG is gone, but a different > one appears: > > ============================= > [ BUG: Invalid wait context ] > 6.7.0-rc5-kzm9g-00251-g655022a45b1c #578 Not tainted > ----------------------------- > swapper/0/0 is trying to lock: > dfbcd488 (&c->lock){....}-{3:3}, at: local_lock_acquire+0x0/0xa4 > other info that might help us debug this: > context-{5:5} > 2 locks held by swapper/0/0: > #0: dfbc9c60 (lock){+.+.}-{3:3}, at: local_lock_acquire+0x0/0xa4 > #1: c0c012a8 (kernfs_idr_lock){....}-{2:2}, at: > __kernfs_new_node.constprop.0+0x68/0x258 > stack backtrace: > CPU: 0 PID: 0 Comm: swapper/0 Not tainted > 6.7.0-rc5-kzm9g-00251-g655022a45b1c #578 > Hardware name: Generic SH73A0 (Flattened Device Tree) > unwind_backtrace from show_stack+0x10/0x14 > show_stack from dump_stack_lvl+0x68/0x90 > dump_stack_lvl from __lock_acquire+0x3cc/0x168c > __lock_acquire from lock_acquire+0x274/0x30c > lock_acquire from local_lock_acquire+0x28/0xa4 > local_lock_acquire from ___slab_alloc+0x234/0x8a8 > ___slab_alloc from __slab_alloc.constprop.0+0x30/0x44 > __slab_alloc.constprop.0 from kmem_cache_alloc+0x7c/0x148 > kmem_cache_alloc from radix_tree_node_alloc.constprop.0+0x44/0xdc > radix_tree_node_alloc.constprop.0 from idr_get_free+0x110/0x2b8 > idr_get_free from idr_alloc_u32+0x9c/0x108 > idr_alloc_u32 from idr_alloc_cyclic+0x50/0xb8 > idr_alloc_cyclic from __kernfs_new_node.constprop.0+0x88/0x258 > __kernfs_new_node.constprop.0 from kernfs_create_root+0xbc/0x154 > kernfs_create_root from sysfs_init+0x18/0x5c > sysfs_init from mnt_init+0xc4/0x220 > mnt_init from vfs_caches_init+0x6c/0x88 > vfs_caches_init from start_kernel+0x474/0x528 > start_kernel from 0x0 > > Reverting commit c312828c37a72fe2 fixes that. > I have seen this issue on several Renesas arm32 and arm64 platforms. > > Also, I am wondering if the issue fixed by commit c312828c37a72fe2 > can still be reproduced on v6.7-rc5 or later? Yep, I can still reproduce it (this is with v6.7): [ 3.082273] [ 3.082822] ===================================================== [ 3.084543] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected [ 3.086252] 6.7.0-virtme #4 Not tainted [ 3.087002] ----------------------------------------------------- [ 3.087385] swapper/5/0 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: [ 3.087768] ffffffff8f9c5378 (kernfs_idr_lock){+.+.}-{2:2}, at: kernfs_find_and_get_node_by_id+0x1d/0x80 [ 3.088335] [ 3.088335] and this task is already holding: [ 3.088685] ffff8a83becbf758 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0xda/0xef0 [ 3.089128] which would create a new lock dependency: [ 3.089435] (&rq->__lock){-.-.}-{2:2} -> (kernfs_idr_lock){+.+.}-{2:2} [ 3.089827] [ 3.089827] but this new dependency connects a HARDIRQ-irq-safe lock: [ 3.090296] (&rq->__lock){-.-.}-{2:2} [ 3.090297] [ 3.090297] ... which became HARDIRQ-irq-safe at: [ 3.090885] lock_acquire+0xcb/0x2c0 [ 3.091108] _raw_spin_lock_nested+0x2e/0x40 [ 3.091374] scheduler_tick+0x5b/0x3d0 [ 3.091607] update_process_times+0x9c/0xb0 [ 3.091867] tick_periodic+0x27/0xe0 [ 3.092089] tick_handle_periodic+0x24/0x70 [ 3.092351] timer_interrupt+0x18/0x30 [ 3.092585] __handle_irq_event_percpu+0x8b/0x240 [ 3.092865] handle_irq_event+0x38/0x80 [ 3.093095] handle_level_irq+0x90/0x170 [ 3.093340] __common_interrupt+0x4a/0xf0 [ 3.093586] common_interrupt+0x83/0xa0 [ 3.093820] asm_common_interrupt+0x26/0x40 [ 3.094080] _raw_spin_unlock_irqrestore+0x36/0x70 [ 3.094381] __setup_irq+0x441/0x6a0 [ 3.094602] request_threaded_irq+0xe5/0x190 [ 3.094862] hpet_time_init+0x3a/0x60 [ 3.095090] x86_late_time_init+0x1b/0x40 [ 3.095344] start_kernel+0x53a/0x6a0 [ 3.095569] x86_64_start_reservations+0x18/0x30 [ 3.095849] x86_64_start_kernel+0xc5/0xe0 [ 3.096097] secondary_startup_64_no_verify+0x178/0x17b [ 3.096426] [ 3.096426] to a HARDIRQ-irq-unsafe lock: [ 3.096749] (kernfs_idr_lock){+.+.}-{2:2} [ 3.096751] [ 3.096751] ... which became HARDIRQ-irq-unsafe at: [ 3.097372] ... [ 3.097372] lock_acquire+0xcb/0x2c0 [ 3.097701] _raw_spin_lock+0x30/0x40 [ 3.097925] __kernfs_new_node.isra.0+0x83/0x280 [ 3.098205] kernfs_create_root+0xf6/0x1d0 [ 3.098463] sysfs_init+0x1b/0x70 [ 3.098670] mnt_init+0xd9/0x2a0 [ 3.098872] vfs_caches_init+0xcf/0xe0 [ 3.099105] start_kernel+0x58a/0x6a0 [ 3.099334] x86_64_start_reservations+0x18/0x30 [ 3.099613] x86_64_start_kernel+0xc5/0xe0 [ 3.099862] secondary_startup_64_no_verify+0x178/0x17b [ 3.100175] [ 3.100175] other info that might help us debug this: [ 3.100175] [ 3.100652] Possible interrupt unsafe locking scenario: [ 3.100652] [ 3.101049] CPU0 CPU1 [ 3.101323] ---- ---- [ 3.101641] lock(kernfs_idr_lock); [ 3.101909] local_irq_disable(); [ 3.102473] lock(&rq->__lock); [ 3.102854] lock(kernfs_idr_lock); [ 3.103171] <Interrupt> [ 3.103308] lock(&rq->__lock); [ 3.103492] [ 3.103492] *** DEADLOCK *** I'm wondering if using a regular spinlock instead of a raw spinlock could be a reasonable compromise. We have a GFP_ATOMIC allocation in __kernfs_new_node(): raw_spin_lock_irqsave(&kernfs_idr_lock, irqflags); ret = idr_alloc_cyclic(&root->ino_idr, kn, 1, 0, GFP_ATOMIC); ... raw_spin_unlock_irqrestore(&kernfs_idr_lock, irqflags); That should become valid using a spin_lock_irqsave/spin_unlock_irqrestore(), right? Thanks, -Andrea > > Thanks! > > Gr{oetje,eeting}s, > > Geert > > -- > Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org > > In personal conversations with technical people, I call myself a hacker. But > when I'm talking to journalists I just say "programmer" or something like that. > -- Linus Torvalds
On Tue, Jan 09, 2024 at 06:05:09PM +0100, Andrea Righi wrote: > On Tue, Jan 09, 2024 at 05:35:36PM +0100, Geert Uytterhoeven wrote: > > Reverting commit c312828c37a72fe2 fixes that. > > I have seen this issue on several Renesas arm32 and arm64 platforms. > > > > Also, I am wondering if the issue fixed by commit c312828c37a72fe2 > > can still be reproduced on v6.7-rc5 or later? > > Yep, I can still reproduce it (this is with v6.7): .. > I'm wondering if using a regular spinlock instead of a raw spinlock > could be a reasonable compromise. I don't think that'd work on RT as we can end up nesting mutex inside a raw spinlock. > We have a GFP_ATOMIC allocation in __kernfs_new_node(): > > raw_spin_lock_irqsave(&kernfs_idr_lock, irqflags); > ret = idr_alloc_cyclic(&root->ino_idr, kn, 1, 0, GFP_ATOMIC); > ... > raw_spin_unlock_irqrestore(&kernfs_idr_lock, irqflags); > > That should become valid using a > spin_lock_irqsave/spin_unlock_irqrestore(), right? Yeah, this part should be fine. I think the right thing to do here is making the idr RCU safe so that lookup path doesn't depend on the lock. Greg, can you please revert c312828c37a72fe2 for now? Thanks.
On Tue, Jan 09, 2024 at 09:07:34AM -1000, Tejun Heo wrote: > On Tue, Jan 09, 2024 at 06:05:09PM +0100, Andrea Righi wrote: > > On Tue, Jan 09, 2024 at 05:35:36PM +0100, Geert Uytterhoeven wrote: > > > Reverting commit c312828c37a72fe2 fixes that. > > > I have seen this issue on several Renesas arm32 and arm64 platforms. > > > > > > Also, I am wondering if the issue fixed by commit c312828c37a72fe2 > > > can still be reproduced on v6.7-rc5 or later? > > > > Yep, I can still reproduce it (this is with v6.7): > ... > > I'm wondering if using a regular spinlock instead of a raw spinlock > > could be a reasonable compromise. > > I don't think that'd work on RT as we can end up nesting mutex inside a raw > spinlock. > > > We have a GFP_ATOMIC allocation in __kernfs_new_node(): > > > > raw_spin_lock_irqsave(&kernfs_idr_lock, irqflags); > > ret = idr_alloc_cyclic(&root->ino_idr, kn, 1, 0, GFP_ATOMIC); > > ... > > raw_spin_unlock_irqrestore(&kernfs_idr_lock, irqflags); > > > > That should become valid using a > > spin_lock_irqsave/spin_unlock_irqrestore(), right? > > Yeah, this part should be fine. I think the right thing to do here is making > the idr RCU safe so that lookup path doesn't depend on the lock. > > Greg, can you please revert c312828c37a72fe2 for now? I sent out a patchset to revert the commit and implement a different fix. http://lkml.kernel.org/r/20240109214828.252092-1-tj@kernel.org Thanks.
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index 8b2bd65d70e7..9ce7d2872b55 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -27,7 +27,7 @@ static DEFINE_RWLOCK(kernfs_rename_lock); /* kn->parent and ->name */ */ static DEFINE_SPINLOCK(kernfs_pr_cont_lock); static char kernfs_pr_cont_buf[PATH_MAX]; /* protected by pr_cont_lock */ -static DEFINE_SPINLOCK(kernfs_idr_lock); /* root->ino_idr */ +static DEFINE_RAW_SPINLOCK(kernfs_idr_lock); /* root->ino_idr */ #define rb_to_kn(X) rb_entry((X), struct kernfs_node, rb) @@ -539,6 +539,7 @@ void kernfs_put(struct kernfs_node *kn) { struct kernfs_node *parent; struct kernfs_root *root; + unsigned long flags; if (!kn || !atomic_dec_and_test(&kn->count)) return; @@ -563,9 +564,9 @@ void kernfs_put(struct kernfs_node *kn) simple_xattrs_free(&kn->iattr->xattrs, NULL); kmem_cache_free(kernfs_iattrs_cache, kn->iattr); } - spin_lock(&kernfs_idr_lock); + raw_spin_lock_irqsave(&kernfs_idr_lock, flags); idr_remove(&root->ino_idr, (u32)kernfs_ino(kn)); - spin_unlock(&kernfs_idr_lock); + raw_spin_unlock_irqrestore(&kernfs_idr_lock, flags); kmem_cache_free(kernfs_node_cache, kn); kn = parent; @@ -607,6 +608,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root, struct kernfs_node *kn; u32 id_highbits; int ret; + unsigned long irqflags; name = kstrdup_const(name, GFP_KERNEL); if (!name) @@ -617,13 +619,13 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root, goto err_out1; idr_preload(GFP_KERNEL); - spin_lock(&kernfs_idr_lock); + raw_spin_lock_irqsave(&kernfs_idr_lock, irqflags); ret = idr_alloc_cyclic(&root->ino_idr, kn, 1, 0, GFP_ATOMIC); if (ret >= 0 && ret < root->last_id_lowbits) root->id_highbits++; id_highbits = root->id_highbits; root->last_id_lowbits = ret; - spin_unlock(&kernfs_idr_lock); + raw_spin_unlock_irqrestore(&kernfs_idr_lock, irqflags); idr_preload_end(); if (ret < 0) goto err_out2; @@ -659,9 +661,9 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root, return kn; err_out3: - spin_lock(&kernfs_idr_lock); + raw_spin_lock_irqsave(&kernfs_idr_lock, irqflags); idr_remove(&root->ino_idr, (u32)kernfs_ino(kn)); - spin_unlock(&kernfs_idr_lock); + raw_spin_unlock_irqrestore(&kernfs_idr_lock, irqflags); err_out2: kmem_cache_free(kernfs_node_cache, kn); err_out1: @@ -702,8 +704,9 @@ struct kernfs_node *kernfs_find_and_get_node_by_id(struct kernfs_root *root, struct kernfs_node *kn; ino_t ino = kernfs_id_ino(id); u32 gen = kernfs_id_gen(id); + unsigned long flags; - spin_lock(&kernfs_idr_lock); + raw_spin_lock_irqsave(&kernfs_idr_lock, flags); kn = idr_find(&root->ino_idr, (u32)ino); if (!kn) @@ -727,10 +730,10 @@ struct kernfs_node *kernfs_find_and_get_node_by_id(struct kernfs_root *root, if (unlikely(!__kernfs_active(kn) || !atomic_inc_not_zero(&kn->count))) goto err_unlock; - spin_unlock(&kernfs_idr_lock); + raw_spin_unlock_irqrestore(&kernfs_idr_lock, flags); return kn; err_unlock: - spin_unlock(&kernfs_idr_lock); + raw_spin_unlock_irqrestore(&kernfs_idr_lock, flags); return NULL; }