Message ID | 20230511151702.14704-1-tycho@tycho.pizza |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp4457176vqo; Thu, 11 May 2023 08:18:49 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7XtOG5OKJzo2WMkW1120rjrR+T0Itm3OL9vR2xN8cqtVGMIySv/m8dWTzH0bJYHwHL0/05 X-Received: by 2002:a17:90a:5911:b0:250:53ab:74d1 with SMTP id k17-20020a17090a591100b0025053ab74d1mr17174785pji.34.1683818328895; Thu, 11 May 2023 08:18:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1683818328; cv=none; d=google.com; s=arc-20160816; b=rDLO66fCrUO9KMUEg9qmMw1hDT1MOrIuerbWiEyGh0PjwiNFUj0do/wETSnJrmSkYu C4maxEqQ6tT3XVJueXGEQ82uGgZOHRe7Wy9GRMfr8DjNrXxCdOoaMAiWGsiOESpAwN0Z DBh0iAUCYLdB3RVx+ETJAkkAgLkYGZFII7NJn90y3b21f+ccfqJozker1LeIMU5k4uYT SThiZ3NpfeLWTxjdIYniou6mfZ20WaXpcNg28bC/2QSqSKCS7BUBfQfsAD2AFzvLLVsB M98HIGrpDqI+my9KlmKNZbRcf1BOh6PFpYa1fpVKDM4cN9V2WGyaj6SD6XjiCFZapoFp 6xFg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:feedback-id:dkim-signature :dkim-signature; bh=1KBcMNjElxS0tYStsAjRh8dTuKNGBSAOW5dI2XnF5tc=; b=Ja5VPsuXvi9BL/T7q9JM9lO9oYa3Mj7c0MHpnRucPafJQO8Y9MZgg8aYmPI09z5xco gy6TKEtpO4nmrsKVFft7onx6DOkDLw695WvhXnnnwA8da2JVBGvv18PDfhql6awvm532 nH73cOzzRgQffwqXa+v/VERc7syQCV54JGGNq/niylQhJfFAjFuS8mEMA73Ugebt+LxU Ql9ny9Klp07eV/L5CixWmMNXiFvwJwCBqPauswUD481P/ufGpnDPtm3kMRN0wE/X/79i S1r+j5qbmSSQohdiM2TOXGQFvz4igJcFwpubIlzZHcUWU7SUWhmBm5Q9RIpmJQECh/Tg Tuvw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@tycho.pizza header.s=fm2 header.b=WyeQRbpy; dkim=pass header.i=@messagingengine.com header.s=fm3 header.b=MdJWHlcv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id b13-20020a17090a6e0d00b0024696759991si19947342pjk.47.2023.05.11.08.18.32; Thu, 11 May 2023 08:18:48 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@tycho.pizza header.s=fm2 header.b=WyeQRbpy; dkim=pass header.i=@messagingengine.com header.s=fm3 header.b=MdJWHlcv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238107AbjEKPRp (ORCPT <rfc822;peekingduck44@gmail.com> + 99 others); Thu, 11 May 2023 11:17:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41692 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237687AbjEKPRn (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 11 May 2023 11:17:43 -0400 Received: from wout2-smtp.messagingengine.com (wout2-smtp.messagingengine.com [64.147.123.25]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7CB4F1992; Thu, 11 May 2023 08:17:42 -0700 (PDT) Received: from compute4.internal (compute4.nyi.internal [10.202.2.44]) by mailout.west.internal (Postfix) with ESMTP id 7D09E32005B5; Thu, 11 May 2023 11:17:41 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute4.internal (MEProxy); Thu, 11 May 2023 11:17:41 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=tycho.pizza; h= cc:cc:content-transfer-encoding:content-type:date:date:from:from :in-reply-to:message-id:mime-version:reply-to:sender:subject :subject:to:to; s=fm2; t=1683818261; x=1683904661; bh=1KBcMNjElx S0tYStsAjRh8dTuKNGBSAOW5dI2XnF5tc=; b=WyeQRbpyqLnZfx0b7Yl02o2r7Y T9AF/sk7bE8v69NdozGiXYbZkHr3ptoWyuzepRCk8YO7w3r10MOOlriuvI273gkj FWBb4Y5yqVbR6lt9dh0N9wSuYZlJ/3SxAduNYY/z2Ti0Q1yGJwDkEhz1peMqfoiG pZEtD2cTgtnvJtAMlumhjbsG5tkW9xkwRKEGRGoOmgUww6HldSJhACxxDKHcOpXn WgADEfEArJUlzonxNnjYbq3fLed37dVzkZUZW4gZHUSwFz/n99TBKicMKmLZvfjh 4QTqE4hKMTTepNEQ7QxN1MS6HnK9KDbefor8Li0c9Um6qlJ3g/oiHOpG5XpA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-transfer-encoding :content-type:date:date:feedback-id:feedback-id:from:from :in-reply-to:message-id:mime-version:reply-to:sender:subject :subject:to:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender :x-sasl-enc; s=fm3; t=1683818261; x=1683904661; bh=1KBcMNjElxS0t YStsAjRh8dTuKNGBSAOW5dI2XnF5tc=; b=MdJWHlcvuV8sLmlJt/+1T3hRvpWWG BG/Z4bW005ADHXXMTooMnm7EnlP1pXh7Vh+DU316qY6ec7YJBsvt9/dmE4s6HWO8 KMVtDjiL4zDvIBTrTq0O1FUbuC4eIEOAfn1iYGwGSDx2MItONGBZfzxGOEK1hDtA DP1wqujlJC/VMkDZKOHNijEgJttNFZemPl89RoJ4YJda8yZ1v6/smRlRXbFMtuTT MrzbYFSGOKPwYbv8MlZiwE6vH8NgxKlU1jhFi186ntUWL/QpdykJIMk1WcokTSe3 Qz15yX3eoy7pK0D2otBFhbzrOV7wCq6JLJQ6blntF2PoMb+ZNDAedXuaQ== X-ME-Sender: <xms:FAddZLbdr5KQJuyI3a-Zwa1D7vHLPXkDr9Uzh6AcCMxCx2lK6pIftQ> <xme:FAddZKYG0fWRN4iX5HyVng7pZlRg17GhWrYVSraF-lyaftO7Wmmnz2jPqC2VZsuSY S80iq3vZ1rYhvV53jc> X-ME-Received: <xmr:FAddZN9QYYKMm7cXFz5KqH_fj0hOnCfxSq0DzjKiamcHEQBCKOoaqrgTYS5qsZ-yAdBusv_B> X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvhedrfeegkedgkeejucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmne cujfgurhephffvvefufffkofgggfestdekredtredttdenucfhrhhomhepvfihtghhohcu tehnuggvrhhsvghnuceothihtghhohesthihtghhohdrphhiiiiirgeqnecuggftrfgrth htvghrnhepheeffeehleeftdfgjeegheelieefvdfghfeuudeuheehuefhhffhtefhiedv geegnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomhepth ihtghhohesthihtghhohdrphhiiiiirg X-ME-Proxy: <xmx:FAddZBo1jyavf6q36FIaMaINn5nXJPHGF8Phvv_2Uj2TwUnqJlQUtg> <xmx:FAddZGq86tAnuk3TfQEiieQV8eLPM_7x-aSRofS3shVnrbnLnXsdOA> <xmx:FAddZHQR2PdVmJwuxqbb01ciikG-YDcLehfVsiMPzDQvlZTw0dQVhg> <xmx:FQddZPkZIFfkMSpeuz0zzwLpGAjxEEh1yhiDwVEN_hUIDLixXMVcHg> Feedback-ID: i21f147d5:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Thu, 11 May 2023 11:17:39 -0400 (EDT) From: Tycho Andersen <tycho@tycho.pizza> To: "Darrick J . Wong" <djwong@kernel.org> Cc: linux-xfs@vger.kernel.org, linux-kernel@vger.kernel.org, Tycho Andersen <tycho@tycho.pizza>, Tycho Andersen <tandersen@netflix.com>, "Eric W. Biederman" <ebiederm@xmission.com> Subject: [PATCH] xfs: don't do inodgc work if task is exiting Date: Thu, 11 May 2023 09:17:02 -0600 Message-Id: <20230511151702.14704-1-tycho@tycho.pizza> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2,SPF_HELO_PASS,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1765611487958925351?= X-GMAIL-MSGID: =?utf-8?q?1765611487958925351?= |
Series |
xfs: don't do inodgc work if task is exiting
|
|
Commit Message
Tycho Andersen
May 11, 2023, 3:17 p.m. UTC
From: Tycho Andersen <tandersen@netflix.com> Similar to 5a8bee63b10f ("fuse: in fuse_flush only wait if someone wants the return code"), we have a task that is stuck that can't be killed, so a pid ns can't exit, wreaking all kinds of havoc: INFO: task C2 CompilerThre:3546103 blocked for more than 1912 seconds. Tainted: G OE 5.15.35netflix-g54efd87a8576 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:C2 CompilerThre state:D stack: 0 pid:3546103 ppid:3546047 flags:0x00004222 Call Trace: <TASK> __schedule+0x2c5/0x8d0 schedule+0x3a/0xa0 schedule_timeout+0x115/0x280 ? __schedule+0x2cd/0x8d0 wait_for_completion+0x9f/0x100 __flush_work+0x161/0x1f0 ? worker_detach_from_pool+0xb0/0xb0 destroy_inode+0x3b/0x70 __dentry_kill+0xcc/0x160 dput+0x141/0x2e0 ovl_destroy_inode+0x15/0x50 [overlay] destroy_inode+0x3b/0x70 __dentry_kill+0xcc/0x160 dput+0x141/0x2e0 __fput+0xe1/0x250 task_work_run+0x73/0xb0 do_exit+0x37e/0xb80 do_group_exit+0x3a/0xa0 get_signal+0x140/0x870 ? perf_event_groups_first+0x80/0x80 arch_do_signal_or_restart+0xae/0x7c0 ? __x64_sys_futex+0x5e/0x1d0 ? __x64_sys_futex+0x5e/0x1d0 exit_to_user_mode_prepare+0x10f/0x1c0 syscall_exit_to_user_mode+0x26/0x40 do_syscall_64+0x46/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f3295cf3cf5 RSP: 002b:00007f327c834d00 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca RAX: fffffffffffffe00 RBX: 00007f32900bde50 RCX: 00007f3295cf3cf5 RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007f32900bde78 RBP: 00007f327c834dd0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f32900bde74 R13: 00007f32900bde78 R14: 00007f32900bde28 R15: 0000000000000000 </TASK> The bad call path is: xfs_fs_destroy_inode() -> xfs_inode_mark_reclaimable -> xfs_inodegc_queue() -> xfs_inodegc_want_queue_work() xfs_inodegc_want_flush_work() -> flush_work() -> __flush_work() -> wait_for_completion() We can avoid this task getting stuck by just not queuing the gc work from do_exit(). The fact that there's a lockup at all probably indicative of another xfs bug somewhere else that I am still looking for, but we should at least not generate unkillable tasks as a result. Signed-off-by: Tycho Andersen <tandersen@netflix.com> CC: "Eric W. Biederman" <ebiederm@xmission.com> --- fs/xfs/xfs_icache.c | 6 ++++++ 1 file changed, 6 insertions(+) base-commit: 78b421b6a7c6dbb6a213877c742af52330f5026d
Comments
On Thu, May 11, 2023 at 09:17:02AM -0600, Tycho Andersen wrote: > From: Tycho Andersen <tandersen@netflix.com> > > Similar to 5a8bee63b10f ("fuse: in fuse_flush only wait if someone wants > the return code"), we have a task that is stuck that can't be killed, so a > pid ns can't exit, wreaking all kinds of havoc: > > INFO: task C2 CompilerThre:3546103 blocked for more than 1912 seconds. > Tainted: G OE 5.15.35netflix-g54efd87a8576 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:C2 CompilerThre state:D stack: 0 pid:3546103 ppid:3546047 flags:0x00004222 > Call Trace: > <TASK> > __schedule+0x2c5/0x8d0 > schedule+0x3a/0xa0 > schedule_timeout+0x115/0x280 > ? __schedule+0x2cd/0x8d0 > wait_for_completion+0x9f/0x100 > __flush_work+0x161/0x1f0 > ? worker_detach_from_pool+0xb0/0xb0 > destroy_inode+0x3b/0x70 > __dentry_kill+0xcc/0x160 > dput+0x141/0x2e0 > ovl_destroy_inode+0x15/0x50 [overlay] > destroy_inode+0x3b/0x70 > __dentry_kill+0xcc/0x160 > dput+0x141/0x2e0 > __fput+0xe1/0x250 > task_work_run+0x73/0xb0 > do_exit+0x37e/0xb80 > do_group_exit+0x3a/0xa0 > get_signal+0x140/0x870 > ? perf_event_groups_first+0x80/0x80 > arch_do_signal_or_restart+0xae/0x7c0 > ? __x64_sys_futex+0x5e/0x1d0 > ? __x64_sys_futex+0x5e/0x1d0 > exit_to_user_mode_prepare+0x10f/0x1c0 > syscall_exit_to_user_mode+0x26/0x40 > do_syscall_64+0x46/0xb0 > entry_SYSCALL_64_after_hwframe+0x44/0xae > RIP: 0033:0x7f3295cf3cf5 > RSP: 002b:00007f327c834d00 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca > RAX: fffffffffffffe00 RBX: 00007f32900bde50 RCX: 00007f3295cf3cf5 > RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007f32900bde78 > RBP: 00007f327c834dd0 R08: 0000000000000000 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000246 R12: 00007f32900bde74 > R13: 00007f32900bde78 R14: 00007f32900bde28 R15: 0000000000000000 > </TASK> > > The bad call path is: > > xfs_fs_destroy_inode() -> > xfs_inode_mark_reclaimable -> > xfs_inodegc_queue() -> > xfs_inodegc_want_queue_work() > xfs_inodegc_want_flush_work() -> > flush_work() -> > __flush_work() -> > wait_for_completion() > > We can avoid this task getting stuck by just not queuing the gc work from > do_exit(). > > The fact that there's a lockup at all probably indicative of another xfs > bug somewhere else that I am still looking for, but we should at least not > generate unkillable tasks as a result. Yeah, we just added a couple of fixpatches to 6.4 to deal with inodegc hangs: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/xfs?id=03e0add80f4cf3f7393edb574eeb3a89a1db7758 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/xfs?id=2254a7396a0ca6309854948ee1c0a33fa4268cec If you've got a spare machine and a reproducer, could you try applying those two to see if the problem goes away? If you have online fsck enabled (I hope not in a 5.15 kernel) then turn it off or apply: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/xfs?id=2d5f38a31980d7090f5bf91021488dc61a0ba8ee > Signed-off-by: Tycho Andersen <tandersen@netflix.com> > CC: "Eric W. Biederman" <ebiederm@xmission.com> > --- > fs/xfs/xfs_icache.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index 351849fc18ff..90e94d84f8ad 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -2011,6 +2011,9 @@ xfs_inodegc_want_queue_work( > * > * Note: If the current thread is running a transaction, we don't ever want to > * wait for other transactions because that could introduce a deadlock. > + * If the task is currently exiting there is nobody to wait for > + * the flush and it can deadlock, so we should not try to flush in this case What do you mean by "there is nobody to wait for"? The process state still exists in the kernel, so the completion should wake up the exiting process, right? --D > + * either. > */ > static inline bool > xfs_inodegc_want_flush_work( > @@ -2021,6 +2024,9 @@ xfs_inodegc_want_flush_work( > if (current->journal_info) > return false; > > + if (current->flags & PF_EXITING) > + return false; > + > if (shrinker_hits > 0) > return true; > > > base-commit: 78b421b6a7c6dbb6a213877c742af52330f5026d > -- > 2.34.1 >
On Thu, May 11, 2023 at 09:17:02AM -0600, Tycho Andersen wrote: > From: Tycho Andersen <tandersen@netflix.com> > > Similar to 5a8bee63b10f ("fuse: in fuse_flush only wait if someone wants > the return code"), we have a task that is stuck that can't be killed, so a > pid ns can't exit, wreaking all kinds of havoc: > > INFO: task C2 CompilerThre:3546103 blocked for more than 1912 seconds. > Tainted: G OE 5.15.35netflix-g54efd87a8576 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:C2 CompilerThre state:D stack: 0 pid:3546103 ppid:3546047 flags:0x00004222 > Call Trace: > <TASK> > __schedule+0x2c5/0x8d0 > schedule+0x3a/0xa0 > schedule_timeout+0x115/0x280 > ? __schedule+0x2cd/0x8d0 > wait_for_completion+0x9f/0x100 > __flush_work+0x161/0x1f0 > ? worker_detach_from_pool+0xb0/0xb0 > destroy_inode+0x3b/0x70 It is likely that something has gone wrong in background inodegc if we block here for that length of time. > __dentry_kill+0xcc/0x160 > dput+0x141/0x2e0 > ovl_destroy_inode+0x15/0x50 [overlay] > destroy_inode+0x3b/0x70 > __dentry_kill+0xcc/0x160 > dput+0x141/0x2e0 > __fput+0xe1/0x250 > task_work_run+0x73/0xb0 > do_exit+0x37e/0xb80 > do_group_exit+0x3a/0xa0 > get_signal+0x140/0x870 > ? perf_event_groups_first+0x80/0x80 > arch_do_signal_or_restart+0xae/0x7c0 > ? __x64_sys_futex+0x5e/0x1d0 > ? __x64_sys_futex+0x5e/0x1d0 > exit_to_user_mode_prepare+0x10f/0x1c0 > syscall_exit_to_user_mode+0x26/0x40 > do_syscall_64+0x46/0xb0 > entry_SYSCALL_64_after_hwframe+0x44/0xae > RIP: 0033:0x7f3295cf3cf5 > RSP: 002b:00007f327c834d00 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca > RAX: fffffffffffffe00 RBX: 00007f32900bde50 RCX: 00007f3295cf3cf5 > RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007f32900bde78 > RBP: 00007f327c834dd0 R08: 0000000000000000 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000246 R12: 00007f32900bde74 > R13: 00007f32900bde78 R14: 00007f32900bde28 R15: 0000000000000000 > </TASK> > > The bad call path is: > > xfs_fs_destroy_inode() -> > xfs_inode_mark_reclaimable -> > xfs_inodegc_queue() -> > xfs_inodegc_want_queue_work() > xfs_inodegc_want_flush_work() -> > flush_work() -> > __flush_work() -> > wait_for_completion() > > We can avoid this task getting stuck by just not queuing the gc work from > do_exit(). > > The fact that there's a lockup at all probably indicative of another xfs > bug somewhere else that I am still looking for, but we should at least not > generate unkillable tasks as a result. > > Signed-off-by: Tycho Andersen <tandersen@netflix.com> > CC: "Eric W. Biederman" <ebiederm@xmission.com> > --- > fs/xfs/xfs_icache.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index 351849fc18ff..90e94d84f8ad 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -2011,6 +2011,9 @@ xfs_inodegc_want_queue_work( > * > * Note: If the current thread is running a transaction, we don't ever want to > * wait for other transactions because that could introduce a deadlock. > + * If the task is currently exiting there is nobody to wait for > + * the flush and it can deadlock, so we should not try to flush in this case > + * either. > */ > static inline bool > xfs_inodegc_want_flush_work( > @@ -2021,6 +2024,9 @@ xfs_inodegc_want_flush_work( > if (current->journal_info) > return false; > > + if (current->flags & PF_EXITING) > + return false; Yeah, this is papering over the observed symptom, not addressing the root cause of the inodegc flush delay. What do you see when you run sysrq-w and sysrq-l? Are there inodegc worker threads blocked performing inodegc? e.g. inodegc flushes could simply be delayed by an unlinked inode being processed that has millions of extents that need to be freed. In reality, inode reclaim can block for long periods of time on any filesystem, so the concept of "inode reclaim should not block when PF_EXITING" is not a behaviour that we guarantee anywhere or could guarantee across the board. Let's get to the bottom of why inodegc has apparently stalled before trying to work out how to fix it... Cheers, Dave.
On Thu, May 11, 2023 at 05:07:24PM -0700, Darrick J. Wong wrote: > On Thu, May 11, 2023 at 09:17:02AM -0600, Tycho Andersen wrote: > > From: Tycho Andersen <tandersen@netflix.com> > > > > Similar to 5a8bee63b10f ("fuse: in fuse_flush only wait if someone wants > > the return code"), we have a task that is stuck that can't be killed, so a > > pid ns can't exit, wreaking all kinds of havoc: > > > > INFO: task C2 CompilerThre:3546103 blocked for more than 1912 seconds. > > Tainted: G OE 5.15.35netflix-g54efd87a8576 #1 > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > task:C2 CompilerThre state:D stack: 0 pid:3546103 ppid:3546047 flags:0x00004222 > > Call Trace: > > <TASK> > > __schedule+0x2c5/0x8d0 > > schedule+0x3a/0xa0 > > schedule_timeout+0x115/0x280 > > ? __schedule+0x2cd/0x8d0 > > wait_for_completion+0x9f/0x100 > > __flush_work+0x161/0x1f0 > > ? worker_detach_from_pool+0xb0/0xb0 > > destroy_inode+0x3b/0x70 > > __dentry_kill+0xcc/0x160 > > dput+0x141/0x2e0 > > ovl_destroy_inode+0x15/0x50 [overlay] > > destroy_inode+0x3b/0x70 > > __dentry_kill+0xcc/0x160 > > dput+0x141/0x2e0 > > __fput+0xe1/0x250 > > task_work_run+0x73/0xb0 > > do_exit+0x37e/0xb80 > > do_group_exit+0x3a/0xa0 > > get_signal+0x140/0x870 > > ? perf_event_groups_first+0x80/0x80 > > arch_do_signal_or_restart+0xae/0x7c0 > > ? __x64_sys_futex+0x5e/0x1d0 > > ? __x64_sys_futex+0x5e/0x1d0 > > exit_to_user_mode_prepare+0x10f/0x1c0 > > syscall_exit_to_user_mode+0x26/0x40 > > do_syscall_64+0x46/0xb0 > > entry_SYSCALL_64_after_hwframe+0x44/0xae > > RIP: 0033:0x7f3295cf3cf5 > > RSP: 002b:00007f327c834d00 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca > > RAX: fffffffffffffe00 RBX: 00007f32900bde50 RCX: 00007f3295cf3cf5 > > RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007f32900bde78 > > RBP: 00007f327c834dd0 R08: 0000000000000000 R09: 0000000000000000 > > R10: 0000000000000000 R11: 0000000000000246 R12: 00007f32900bde74 > > R13: 00007f32900bde78 R14: 00007f32900bde28 R15: 0000000000000000 > > </TASK> > > > > The bad call path is: > > > > xfs_fs_destroy_inode() -> > > xfs_inode_mark_reclaimable -> > > xfs_inodegc_queue() -> > > xfs_inodegc_want_queue_work() > > xfs_inodegc_want_flush_work() -> > > flush_work() -> > > __flush_work() -> > > wait_for_completion() > > > > We can avoid this task getting stuck by just not queuing the gc work from > > do_exit(). > > > > The fact that there's a lockup at all probably indicative of another xfs > > bug somewhere else that I am still looking for, but we should at least not > > generate unkillable tasks as a result. > > Yeah, we just added a couple of fixpatches to 6.4 to deal with inodegc > hangs: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/xfs?id=03e0add80f4cf3f7393edb574eeb3a89a1db7758 > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/xfs?id=2254a7396a0ca6309854948ee1c0a33fa4268cec > > If you've got a spare machine and a reproducer, could you try applying > those two to see if the problem goes away? Thanks for these. I haven't figured out how to reproduce it on demand, but I also haven't spent a bunch of time. Maybe we can just apply these to our kernel and see. I put some stack traces from around the time when the lockup happens, maybe those will ring a bell? > If you have online fsck enabled (I hope not in a 5.15 kernel) then > turn it off or apply: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/xfs?id=2d5f38a31980d7090f5bf91021488dc61a0ba8ee Yes, we do not have it enabled. > > Signed-off-by: Tycho Andersen <tandersen@netflix.com> > > CC: "Eric W. Biederman" <ebiederm@xmission.com> > > --- > > fs/xfs/xfs_icache.c | 6 ++++++ > > 1 file changed, 6 insertions(+) > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > index 351849fc18ff..90e94d84f8ad 100644 > > --- a/fs/xfs/xfs_icache.c > > +++ b/fs/xfs/xfs_icache.c > > @@ -2011,6 +2011,9 @@ xfs_inodegc_want_queue_work( > > * > > * Note: If the current thread is running a transaction, we don't ever want to > > * wait for other transactions because that could introduce a deadlock. > > + * If the task is currently exiting there is nobody to wait for > > + * the flush and it can deadlock, so we should not try to flush in this case > > What do you mean by "there is nobody to wait for"? The process state > still exists in the kernel, so the completion should wake up the exiting > process, right? Yeah, sloppy language there. The problem is really that we can't kill this thing when it's locked up like this, since signals have been disabled. To get rid of these things, we have to evac a whole box, which is a pretty big hammer. Everything not stuck in do_exit() is "mostly fine", since it will still respond well to kill in most cases. Tycho INFO: task java:3546393 blocked for more than 1912 seconds. Tainted: G OE 5.15.35netflix-g54efd87a8576 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:skyframe-evalua state:D stack: 0 pid:3546393 ppid:3532734 flags:0x00000220 Call Trace: <TASK> __schedule+0x2c5/0x8d0 schedule+0x3a/0xa0 schedule_timeout+0x115/0x280 ? xfs_buf_read_map+0x52/0x2e0 [xfs] ? xfs_da_read_buf+0xcf/0x120 [xfs] __down+0x90/0xe0 ? down+0x43/0x60 down+0x43/0x60 xfs_buf_lock+0x29/0xa0 [xfs] xfs_buf_find.isra.34+0x1fd/0x610 [xfs] xfs_buf_get_map+0x4c/0x400 [xfs] xfs_buf_read_map+0x52/0x2e0 [xfs] ? xfs_read_agi+0x8c/0x120 [xfs] xfs_trans_read_buf_map+0x223/0x2d0 [xfs] ? xfs_read_agi+0x8c/0x120 [xfs] xfs_read_agi+0x8c/0x120 [xfs] xfs_ialloc_read_agi+0x2a/0x90 [xfs] xfs_dialloc+0x1ab/0x680 [xfs] ? xfs_trans_reserve_quota_icreate+0x32/0x40 [xfs] ? xfs_trans_alloc_icreate+0x75/0xf0 [xfs] xfs_create+0x39e/0x5b0 [xfs] xfs_generic_create+0x129/0x380 [xfs] ? generic_permission+0x27/0x200 vfs_create+0x12b/0x1c0 ovl_create_real+0xd7/0x220 [overlay] ovl_create_or_link+0x166/0x670 [overlay] ? bpf_lsm_path_notify+0x10/0x10 ? security_inode_alloc+0x4d/0x90 ? inode_init_always+0x1bc/0x230 ? alloc_inode+0x31/0xa0 ? new_inode_pseudo+0x23/0x60 ovl_create_object+0xe6/0x120 [overlay] path_openat+0xe79/0x1050 do_filp_open+0xc5/0x140 ? process_backlog+0xd0/0x160 ? __check_object_size+0xd1/0x1a0 ? do_sys_openat2+0x23a/0x310 do_sys_openat2+0x23a/0x310 do_sys_open+0x57/0x80 do_syscall_64+0x3a/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f077c4e7274 RSP: 002b:00007f05bc5c2290 EFLAGS: 00000293 ORIG_RAX: 0000000000000101 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f077c4e7274 RDX: 0000000000000241 RSI: 00007f061400e6b0 RDI: 00000000ffffff9c RBP: 00007f05bc5c22f0 R08: 0000000000000000 R09: 00000006322fa308 R10: 00000000000001b6 R11: 0000000000000293 R12: 0000000000000000 R13: 00007f061400e6b0 R14: 0000000000000241 R15: 00007f0540002340 </TASK> INFO: task git:3543232 blocked for more than 1912 seconds. Tainted: G OE 5.15.35netflix-g54efd87a8576 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:git state:D stack: 0 pid:3543232 ppid:3543231 flags:0x00000224 Call Trace: <TASK> __schedule+0x2c5/0x8d0 schedule+0x3a/0xa0 schedule_timeout+0x115/0x280 ? xfs_buf_get_map+0x2cf/0x400 [xfs] ? obj_cgroup_charge_pages+0xc3/0x100 __down+0x90/0xe0 ? down+0x43/0x60 down+0x43/0x60 xfs_buf_lock+0x29/0xa0 [xfs] xfs_buf_find.isra.34+0x1fd/0x610 [xfs] xfs_buf_get_map+0x4c/0x400 [xfs] ? xfs_vn_lookup+0x87/0xa0 [xfs] xfs_buf_read_map+0x52/0x2e0 [xfs] ? xfs_read_agi+0x8c/0x120 [xfs] xfs_trans_read_buf_map+0x223/0x2d0 [xfs] ? xfs_read_agi+0x8c/0x120 [xfs] xfs_read_agi+0x8c/0x120 [xfs] xfs_ialloc_read_agi+0x2a/0x90 [xfs] xfs_dialloc+0x1ab/0x680 [xfs] ? xfs_trans_reserve_quota_icreate+0x32/0x40 [xfs] ? xfs_trans_alloc_icreate+0x75/0xf0 [xfs] xfs_create+0x39e/0x5b0 [xfs] xfs_generic_create+0x129/0x380 [xfs] ? generic_permission+0x27/0x200 vfs_mkdir+0x123/0x1d0 ovl_mkdir_real+0x28/0xa0 [overlay] ovl_create_real+0x143/0x220 [overlay] ovl_create_or_link+0x166/0x670 [overlay] ? bpf_lsm_path_notify+0x10/0x10 ? security_inode_alloc+0x4d/0x90 ? inode_init_always+0x1bc/0x230 ? alloc_inode+0x31/0xa0 ? new_inode_pseudo+0x23/0x60 ovl_create_object+0xe6/0x120 [overlay] vfs_mkdir+0x123/0x1d0 do_mkdirat+0xfa/0x120 __x64_sys_mkdir+0x26/0x30 do_syscall_64+0x3a/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f99dc46bf27 RSP: 002b:00007ffca63aa6a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053 RAX: ffffffffffffffda RBX: 00007ffca63aa8c0 RCX: 00007f99dc46bf27 RDX: 0000000000000000 RSI: 00000000000001ff RDI: 00005648b2aa2b20 RBP: 00005648b298ba40 R08: 00007ffca63aa5ac R09: 0000000000000000 R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000018 R13: 000000000000000c R14: 00005648b2aa2b1f R15: 00005648b2af1af0 </TASK> INFO: task C2 CompilerThre:3546103 blocked for more than 1912 seconds. Tainted: G OE 5.15.35netflix-g54efd87a8576 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:C2 CompilerThre state:D stack: 0 pid:3546103 ppid:3546047 flags:0x00004222 Call Trace: <TASK> __schedule+0x2c5/0x8d0 schedule+0x3a/0xa0 schedule_timeout+0x115/0x280 ? __schedule+0x2cd/0x8d0 wait_for_completion+0x9f/0x100 __flush_work+0x161/0x1f0 ? worker_detach_from_pool+0xb0/0xb0 destroy_inode+0x3b/0x70 __dentry_kill+0xcc/0x160 dput+0x141/0x2e0 ovl_destroy_inode+0x15/0x50 [overlay] destroy_inode+0x3b/0x70 __dentry_kill+0xcc/0x160 dput+0x141/0x2e0 __fput+0xe1/0x250 task_work_run+0x73/0xb0 do_exit+0x37e/0xb80 do_group_exit+0x3a/0xa0 get_signal+0x140/0x870 ? perf_event_groups_first+0x80/0x80 arch_do_signal_or_restart+0xae/0x7c0 ? __x64_sys_futex+0x5e/0x1d0 ? __x64_sys_futex+0x5e/0x1d0 exit_to_user_mode_prepare+0x10f/0x1c0 syscall_exit_to_user_mode+0x26/0x40 do_syscall_64+0x46/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f3295cf3cf5 RSP: 002b:00007f327c834d00 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca RAX: fffffffffffffe00 RBX: 00007f32900bde50 RCX: 00007f3295cf3cf5 RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007f32900bde78 RBP: 00007f327c834dd0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f32900bde74 R13: 00007f32900bde78 R14: 00007f32900bde28 R15: 0000000000000000 </TASK> INFO: task java:3546788 blocked for more than 1912 seconds. Tainted: G OE 5.15.35netflix-g54efd87a8576 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:java state:D stack: 0 pid:3546788 ppid:3546785 flags:0x00000220 Call Trace: <TASK> __schedule+0x2c5/0x8d0 schedule+0x3a/0xa0 schedule_timeout+0x115/0x280 __down+0x90/0xe0 ? down+0x43/0x60 down+0x43/0x60 xfs_buf_lock+0x29/0xa0 [xfs] xfs_buf_find.isra.34+0x1fd/0x610 [xfs] xfs_buf_get_map+0x4c/0x400 [xfs] xfs_buf_read_map+0x52/0x2e0 [xfs] ? xfs_read_agi+0x8c/0x120 [xfs] xfs_trans_read_buf_map+0x223/0x2d0 [xfs] ? xfs_read_agi+0x8c/0x120 [xfs] xfs_read_agi+0x8c/0x120 [xfs] xfs_iunlink+0x62/0x1b0 [xfs] xfs_remove+0x1f5/0x300 [xfs] xfs_vn_unlink+0x55/0xa0 [xfs] vfs_unlink+0x11c/0x2a0 ovl_do_remove+0x3d5/0x4c0 [overlay] vfs_unlink+0x11c/0x2a0 do_unlinkat+0x2ac/0x300 __x64_sys_unlink+0x1f/0x30 do_syscall_64+0x3a/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f45d14ad1a7 RSP: 002b:00007f45d1fe3d08 EFLAGS: 00000202 ORIG_RAX: 0000000000000057 RAX: ffffffffffffffda RBX: 00007f45c800cbe0 RCX: 00007f45d14ad1a7 RDX: 00007f45c8019e70 RSI: 00007f45c8000030 RDI: 00007f45c800cbe0 RBP: 00007f45d1fe3d30 R08: 0000000000000c70 R09: 0000000000000000 R10: 0000000000000008 R11: 0000000000000202 R12: 00007f45c8009d80 R13: 00007f45d135c6e0 R14: 00007f45c8009880 R15: 00007f45c800e7c0 </TASK> INFO: task kworker/66:1:3548191 blocked for more than 1912 seconds. Tainted: G OE 5.15.35netflix-g54efd87a8576 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/66:1 state:D stack: 0 pid:3548191 ppid: 2 flags:0x00004000 Workqueue: xfs-inodegc/nvme1n1 xfs_inodegc_worker [xfs] Call Trace: <TASK> __schedule+0x2c5/0x8d0 schedule+0x3a/0xa0 schedule_timeout+0x115/0x280 __down+0x90/0xe0 ? down+0x43/0x60 down+0x43/0x60 xfs_buf_lock+0x29/0xa0 [xfs] xfs_buf_find.isra.34+0x1fd/0x610 [xfs] xfs_buf_get_map+0x4c/0x400 [xfs] xfs_buf_read_map+0x52/0x2e0 [xfs] ? xfs_read_agi+0x8c/0x120 [xfs] xfs_trans_read_buf_map+0x223/0x2d0 [xfs] ? xfs_read_agi+0x8c/0x120 [xfs] xfs_read_agi+0x8c/0x120 [xfs] xfs_iunlink_remove+0x5a/0x220 [xfs] xfs_ifree+0x7a/0x4c0 [xfs] ? xfs_trans_alloc+0xec/0x1e0 [xfs] xfs_inactive_ifree+0xa1/0x1a0 [xfs] xfs_inactive+0xf1/0x170 [xfs] xfs_inodegc_worker+0x76/0x100 [xfs] process_one_work+0x200/0x3d0 worker_thread+0x2d/0x3e0 ? process_one_work+0x3d0/0x3d0 kthread+0x11a/0x140 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x22/0x30 </TASK> INFO: task python3.7:3548278 blocked for more than 1912 seconds. Tainted: G OE 5.15.35netflix-g54efd87a8576 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:python3.7 state:D stack: 0 pid:3548278 ppid:3548276 flags:0x00000224 Call Trace: <TASK> __schedule+0x2c5/0x8d0 schedule+0x3a/0xa0 rwsem_down_write_slowpath+0x237/0x550 ? do_unlinkat+0x125/0x300 do_unlinkat+0x125/0x300 __x64_sys_unlink+0x1f/0x30 do_syscall_64+0x3a/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f06794fa1a7 RSP: 002b:00007ffca979e3b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000057 RAX: ffffffffffffffda RBX: 00000000018f1380 RCX: 00007f06794fa1a7 RDX: 00007f067b8f7bc0 RSI: 00007f067818fb10 RDI: 00007f067818fb10 RBP: 00000000ffffff9c R08: 0000000000000000 R09: 00007f067b411fc3 R10: 00007f067b35fd50 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000001db7678 R14: 0000000000000082 R15: 0000000000000000 </TASK> INFO: task python3.7:3548500 blocked for more than 1912 seconds. Tainted: G OE 5.15.35netflix-g54efd87a8576 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:python3.7 state:D stack: 0 pid:3548500 ppid:3548349 flags:0x00000224 Call Trace: <TASK> __schedule+0x2c5/0x8d0 schedule+0x3a/0xa0 schedule_timeout+0x115/0x280 __down+0x90/0xe0 ? down+0x43/0x60 down+0x43/0x60 xfs_buf_lock+0x29/0xa0 [xfs] xfs_buf_find.isra.34+0x1fd/0x610 [xfs] xfs_buf_get_map+0x4c/0x400 [xfs] xfs_buf_read_map+0x52/0x2e0 [xfs] ? xfs_read_agi+0x8c/0x120 [xfs] xfs_trans_read_buf_map+0x223/0x2d0 [xfs] ? xfs_read_agi+0x8c/0x120 [xfs] xfs_read_agi+0x8c/0x120 [xfs] xfs_ialloc_read_agi+0x2a/0x90 [xfs] xfs_dialloc+0x1ab/0x680 [xfs] ? xfs_trans_reserve_quota_icreate+0x32/0x40 [xfs] ? xfs_trans_alloc_icreate+0x75/0xf0 [xfs] xfs_create+0x39e/0x5b0 [xfs] xfs_generic_create+0x129/0x380 [xfs] ? generic_permission+0x184/0x200 vfs_mkdir+0x123/0x1d0 ovl_mkdir_real+0x28/0xa0 [overlay] ovl_create_real+0x143/0x220 [overlay] ovl_create_or_link+0x166/0x670 [overlay] ? bpf_lsm_path_notify+0x10/0x10 ? security_inode_alloc+0x4d/0x90 ? inode_init_always+0x1bc/0x230 ? alloc_inode+0x31/0xa0 ? new_inode_pseudo+0x23/0x60 ovl_create_object+0xe6/0x120 [overlay] vfs_mkdir+0x123/0x1d0 do_mkdirat+0xfa/0x120 __x64_sys_mkdir+0x26/0x30 do_syscall_64+0x3a/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f375aad1f27 RSP: 002b:00007ffc07169838 EFLAGS: 00000246 ORIG_RAX: 0000000000000053 RAX: ffffffffffffffda RBX: 00000000ffffff9c RCX: 00007f375aad1f27 RDX: 00007f375ced1bc0 RSI: 00000000000001c0 RDI: 00007f3759584750 RBP: 00000000000001c0 R08: 0000000000000000 R09: 00007f375c9ebebc R10: 00007ffc071694f0 R11: 0000000000000246 R12: 0000000000000000 R13: 00000000011e8160 R14: 0000000000acb0a0 R15: 0000000000000000 </TASK> INFO: task exe:3548515 blocked for more than 1912 seconds. Tainted: G OE 5.15.35netflix-g54efd87a8576 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:exe state:D stack: 0 pid:3548515 ppid: 23215 flags:0x00000000 Call Trace: <TASK> __schedule+0x2c5/0x8d0 schedule+0x3a/0xa0 schedule_timeout+0x115/0x280 ? ___perf_sw_event+0x11f/0x150 __down+0x90/0xe0 ? down+0x43/0x60 down+0x43/0x60 xfs_buf_lock+0x29/0xa0 [xfs] xfs_buf_find.isra.34+0x1fd/0x610 [xfs] xfs_buf_get_map+0x4c/0x400 [xfs] xfs_buf_read_map+0x52/0x2e0 [xfs] ? xfs_read_agi+0x8c/0x120 [xfs] xfs_trans_read_buf_map+0x223/0x2d0 [xfs] ? xfs_read_agi+0x8c/0x120 [xfs] xfs_read_agi+0x8c/0x120 [xfs] xfs_ialloc_read_agi+0x2a/0x90 [xfs] xfs_dialloc+0x1ab/0x680 [xfs] ? xfs_trans_reserve_quota_icreate+0x32/0x40 [xfs] ? xfs_trans_alloc_icreate+0x75/0xf0 [xfs] xfs_create+0x39e/0x5b0 [xfs] xfs_generic_create+0x129/0x380 [xfs] ? generic_permission+0xf3/0x200 vfs_mkdir+0x123/0x1d0 ovl_mkdir_real+0x28/0xa0 [overlay] ovl_create_real+0x143/0x220 [overlay] ovl_create_or_link+0x166/0x670 [overlay] ? bpf_lsm_path_notify+0x10/0x10 ? security_inode_alloc+0x4d/0x90 ? inode_init_always+0x1bc/0x230 ? alloc_inode+0x31/0xa0 ? new_inode_pseudo+0x23/0x60 ovl_create_object+0xe6/0x120 [overlay] vfs_mkdir+0x123/0x1d0 do_mkdirat+0xfa/0x120 __x64_sys_mkdirat+0x28/0x30 do_syscall_64+0x3a/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x560269a7fdfb RSP: 002b:000000c00079fc08 EFLAGS: 00000216 ORIG_RAX: 0000000000000102 RAX: ffffffffffffffda RBX: 000000c000081800 RCX: 0000560269a7fdfb RDX: 00000000000001c0 RSI: 000000c000ab2480 RDI: ffffffffffffff9c RBP: 000000c00079fc68 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000216 R12: 0000000000000000 R13: 0000000000000001 R14: 000000000000000a R15: ffffffffffffffff </TASK> INFO: task bazelenv:3548683 blocked for more than 1912 seconds. Tainted: G OE 5.15.35netflix-g54efd87a8576 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:bazelenv state:D stack: 0 pid:3548683 ppid:3435211 flags:0x00000220 Call Trace: <TASK> __schedule+0x2c5/0x8d0 schedule+0x3a/0xa0 schedule_timeout+0x115/0x280 __down+0x90/0xe0 ? down+0x43/0x60 down+0x43/0x60 xfs_buf_lock+0x29/0xa0 [xfs] xfs_buf_find.isra.34+0x1fd/0x610 [xfs] xfs_buf_get_map+0x4c/0x400 [xfs] xfs_buf_read_map+0x52/0x2e0 [xfs] ? xfs_read_agf+0x84/0x100 [xfs] xfs_trans_read_buf_map+0x223/0x2d0 [xfs] ? xfs_read_agf+0x84/0x100 [xfs] xfs_read_agf+0x84/0x100 [xfs] xfs_alloc_read_agf+0x3a/0x1b0 [xfs] xfs_alloc_fix_freelist+0x434/0x500 [xfs] ? __kmalloc+0x3c0/0x400 ? xfs_dquot_to_disk+0xfe/0x110 [xfs] ? xfs_qm_dquot_logitem_format+0x89/0x150 [xfs] xfs_free_extent_fix_freelist+0x61/0xa0 [xfs] __xfs_free_extent+0x6a/0x1c0 [xfs] xfs_trans_free_extent+0x3b/0xd0 [xfs] xfs_extent_free_finish_item+0x23/0x40 [xfs] xfs_defer_finish_noroll+0x24b/0x5a0 [xfs] xfs_defer_finish+0x13/0x80 [xfs] xfs_itruncate_extents_flags+0x13b/0x220 [xfs] xfs_setattr_size+0x35c/0x3d0 [xfs] ? aa_compute_fperms+0x16f/0x190 xfs_vn_setattr+0xf7/0x110 [xfs] notify_change+0x355/0x4a0 ? tomoyo_init_request_info+0x7c/0x90 ? ovl_setattr+0xd1/0x1b0 [overlay] ovl_setattr+0xd1/0x1b0 [overlay] notify_change+0x355/0x4a0 ? do_truncate+0x7c/0xd0 do_truncate+0x7c/0xd0 path_openat+0xb94/0x1050 ? __alloc_pages+0x17f/0x320 do_filp_open+0xc5/0x140 ? __check_object_size+0xd1/0x1a0 ? do_sys_openat2+0x23a/0x310 do_sys_openat2+0x23a/0x310 do_sys_open+0x57/0x80 do_syscall_64+0x3a/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f94a252f000 RSP: 002b:00007ffeaa4d0e50 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 RAX: ffffffffffffffda RBX: 000055ecddc90420 RCX: 00007f94a252f000 RDX: 0000000000000241 RSI: 000055ecddc8db60 RDI: 00000000ffffff9c RBP: 00007ffeaa4d0eb0 R08: 0000000000000020 R09: 0000000000000000 R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000003 R14: 0000000000000001 R15: 000055ecddc8db60 </TASK> INFO: task journalctl:3549038 blocked for more than 1912 seconds. Tainted: G OE 5.15.35netflix-g54efd87a8576 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:journalctl state:D stack: 0 pid:3549038 ppid:469330 flags:0x00000224 Call Trace: <TASK> __schedule+0x2c5/0x8d0 schedule+0x3a/0xa0 schedule_timeout+0x115/0x280 ? __traceiter_sched_wakeup+0x29/0x40 ? ttwu_do_wakeup+0x106/0x170 wait_for_completion+0x9f/0x100 __flush_work+0x161/0x1f0 ? worker_detach_from_pool+0xb0/0xb0 xfs_inodegc_flush.part.25+0x4f/0xa0 [xfs] xfs_fs_statfs+0x35/0x1e0 [xfs] statfs_by_dentry+0x67/0x90 vfs_statfs+0x16/0xd0 ovl_statfs+0x40/0x70 [overlay] statfs_by_dentry+0x67/0x90 vfs_statfs+0x16/0xd0 fd_statfs+0x2d/0x60 __do_sys_fstatfs+0x20/0x50 do_syscall_64+0x3a/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f81de0c0cf7 RSP: 002b:00007ffc758c49c8 EFLAGS: 00000246 ORIG_RAX: 000000000000008a RAX: ffffffffffffffda RBX: 000055b2a7d70d20 RCX: 00007f81de0c0cf7 RDX: 000000000000000d RSI: 00007ffc758c49d0 RDI: 0000000000000005 RBP: 0000000000000040 R08: 00007f81ddcc8603 R09: 0000000000000018 R10: 00007f81ddf23760 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000001 R14: 00000000ffffffff R15: 0000000000000000 </TASK>
On Fri, May 12, 2023 at 11:45:47AM +1000, Dave Chinner wrote: > > Yeah, this is papering over the observed symptom, not addressing the > root cause of the inodegc flush delay. What do you see when you run > sysrq-w and sysrq-l? Are there inodegc worker threads blocked > performing inodegc? I will try this next time we encounter this. > e.g. inodegc flushes could simply be delayed by an unlinked inode > being processed that has millions of extents that need to be freed. > > In reality, inode reclaim can block for long periods of time > on any filesystem, so the concept of "inode reclaim should > not block when PF_EXITING" is not a behaviour that we guarantee > anywhere or could guarantee across the board. > > Let's get to the bottom of why inodegc has apparently stalled before > trying to work out how to fix it... I'm happy to try, but I think it is also worth applying this patch. Like I said in the other thread, having to evac a box to get rid of an unkillable userspace process is annoying. Thanks for the debugging tips. Tycho
On Fri, May 12, 2023 at 09:16:36AM -0600, Tycho Andersen wrote: > On Fri, May 12, 2023 at 11:45:47AM +1000, Dave Chinner wrote: > > > > Yeah, this is papering over the observed symptom, not addressing the > > root cause of the inodegc flush delay. What do you see when you run > > sysrq-w and sysrq-l? Are there inodegc worker threads blocked > > performing inodegc? > > I will try this next time we encounter this. > > > e.g. inodegc flushes could simply be delayed by an unlinked inode > > being processed that has millions of extents that need to be freed. > > > > In reality, inode reclaim can block for long periods of time > > on any filesystem, so the concept of "inode reclaim should > > not block when PF_EXITING" is not a behaviour that we guarantee > > anywhere or could guarantee across the board. > > > > Let's get to the bottom of why inodegc has apparently stalled before > > trying to work out how to fix it... > > I'm happy to try, but I think it is also worth applying this patch. > Like I said in the other thread, having to evac a box to get rid of an > unkillable userspace process is annoying. If inodegc is stuck, then it's only a matter of time before the filesystem will completely lock up and you'll have to cycle the machine anyway. This patch merely kicks the can down the road a few minutes, it doesn't change anything material. -Dave.
On Fri, May 12, 2023 at 09:15:11AM -0600, Tycho Andersen wrote: > INFO: task java:3546393 blocked for more than 1912 seconds. > Tainted: G OE 5.15.35netflix-g54efd87a8576 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:skyframe-evalua state:D stack: 0 pid:3546393 ppid:3532734 flags:0x00000220 > Call Trace: > <TASK> > __schedule+0x2c5/0x8d0 > schedule+0x3a/0xa0 > schedule_timeout+0x115/0x280 > ? xfs_buf_read_map+0x52/0x2e0 [xfs] > ? xfs_da_read_buf+0xcf/0x120 [xfs] > __down+0x90/0xe0 > ? down+0x43/0x60 > down+0x43/0x60 > xfs_buf_lock+0x29/0xa0 [xfs] > xfs_buf_find.isra.34+0x1fd/0x610 [xfs] > xfs_buf_get_map+0x4c/0x400 [xfs] > xfs_buf_read_map+0x52/0x2e0 [xfs] > ? xfs_read_agi+0x8c/0x120 [xfs] > xfs_trans_read_buf_map+0x223/0x2d0 [xfs] > ? xfs_read_agi+0x8c/0x120 [xfs] > xfs_read_agi+0x8c/0x120 [xfs] > xfs_ialloc_read_agi+0x2a/0x90 [xfs] > xfs_dialloc+0x1ab/0x680 [xfs] > ? xfs_trans_reserve_quota_icreate+0x32/0x40 [xfs] > ? xfs_trans_alloc_icreate+0x75/0xf0 [xfs] > xfs_create+0x39e/0x5b0 [xfs] > xfs_generic_create+0x129/0x380 [xfs] > ? generic_permission+0x27/0x200 > vfs_create+0x12b/0x1c0 > ovl_create_real+0xd7/0x220 [overlay] > ovl_create_or_link+0x166/0x670 [overlay] Lots of stuff blocked on the AGI lock doing inode allocation. Kinda what I expected to see... > INFO: task kworker/66:1:3548191 blocked for more than 1912 seconds. > Tainted: G OE 5.15.35netflix-g54efd87a8576 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:kworker/66:1 state:D stack: 0 pid:3548191 ppid: 2 flags:0x00004000 > Workqueue: xfs-inodegc/nvme1n1 xfs_inodegc_worker [xfs] > Call Trace: > <TASK> > __schedule+0x2c5/0x8d0 > schedule+0x3a/0xa0 > schedule_timeout+0x115/0x280 > __down+0x90/0xe0 > ? down+0x43/0x60 > down+0x43/0x60 > xfs_buf_lock+0x29/0xa0 [xfs] > xfs_buf_find.isra.34+0x1fd/0x610 [xfs] > xfs_buf_get_map+0x4c/0x400 [xfs] > xfs_buf_read_map+0x52/0x2e0 [xfs] > ? xfs_read_agi+0x8c/0x120 [xfs] > xfs_trans_read_buf_map+0x223/0x2d0 [xfs] > ? xfs_read_agi+0x8c/0x120 [xfs] > xfs_read_agi+0x8c/0x120 [xfs] > xfs_iunlink_remove+0x5a/0x220 [xfs] > xfs_ifree+0x7a/0x4c0 [xfs] > ? xfs_trans_alloc+0xec/0x1e0 [xfs] > xfs_inactive_ifree+0xa1/0x1a0 [xfs] > xfs_inactive+0xf1/0x170 [xfs] > xfs_inodegc_worker+0x76/0x100 [xfs] > process_one_work+0x200/0x3d0 > worker_thread+0x2d/0x3e0 > ? process_one_work+0x3d0/0x3d0 > kthread+0x11a/0x140 > ? set_kthread_struct+0x40/0x40 > ret_from_fork+0x22/0x30 > </TASK> Yup, there's the stuck inodegc worker - also trying to get the AGI lock doing inode unlink. I'm not surprised at all by this - working out why this worker is stuck here is the goal because that is what is causing all your flush problems. It also gives me a better idea of what the likely issue is, too. > INFO: task bazelenv:3548683 blocked for more than 1912 seconds. > Tainted: G OE 5.15.35netflix-g54efd87a8576 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:bazelenv state:D stack: 0 pid:3548683 ppid:3435211 flags:0x00000220 > Call Trace: > <TASK> > __schedule+0x2c5/0x8d0 > schedule+0x3a/0xa0 > schedule_timeout+0x115/0x280 > __down+0x90/0xe0 > ? down+0x43/0x60 > down+0x43/0x60 > xfs_buf_lock+0x29/0xa0 [xfs] > xfs_buf_find.isra.34+0x1fd/0x610 [xfs] > xfs_buf_get_map+0x4c/0x400 [xfs] > xfs_buf_read_map+0x52/0x2e0 [xfs] > ? xfs_read_agf+0x84/0x100 [xfs] > xfs_trans_read_buf_map+0x223/0x2d0 [xfs] > ? xfs_read_agf+0x84/0x100 [xfs] > xfs_read_agf+0x84/0x100 [xfs] > xfs_alloc_read_agf+0x3a/0x1b0 [xfs] > xfs_alloc_fix_freelist+0x434/0x500 [xfs] > ? __kmalloc+0x3c0/0x400 > ? xfs_dquot_to_disk+0xfe/0x110 [xfs] > ? xfs_qm_dquot_logitem_format+0x89/0x150 [xfs] > xfs_free_extent_fix_freelist+0x61/0xa0 [xfs] > __xfs_free_extent+0x6a/0x1c0 [xfs] > xfs_trans_free_extent+0x3b/0xd0 [xfs] > xfs_extent_free_finish_item+0x23/0x40 [xfs] > xfs_defer_finish_noroll+0x24b/0x5a0 [xfs] > xfs_defer_finish+0x13/0x80 [xfs] > xfs_itruncate_extents_flags+0x13b/0x220 [xfs] > xfs_setattr_size+0x35c/0x3d0 [xfs] > ? aa_compute_fperms+0x16f/0x190 > xfs_vn_setattr+0xf7/0x110 [xfs] And that confirms the suspicions I have - another process blocked on an AGF lock. That might be IO it's blocking on here, but combined with the inode unlink being stuck in inodegc, I have a suspicion that this is an AGF - AGI - inode buffer lock ordering issue. There's been a heap of work done in this space since 5.15, and I'd suggest that almost none of it is trivially backportable as it's deeply intertwined with relatively complex algorithm changes. I also recently found evidence of another AGI -> inode buffer -> AGF -> inode buffer deadlock issue debugging another hang from a machine making heavy use of O_TMPFILEs via OVL. Without a kernel code to examine in intricate detail, I can't say if this is what you are seeing. I also don't have a fix for that yet because, well, it's complex and I can really only accomodate one complex high priority bug at a time in my head and it hasn't been my highiest priority. As it is, I don't think the fix for it will be trivially backportable to 5.15, either.... > INFO: task journalctl:3549038 blocked for more than 1912 seconds. > Tainted: G OE 5.15.35netflix-g54efd87a8576 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:journalctl state:D stack: 0 pid:3549038 ppid:469330 flags:0x00000224 > Call Trace: > <TASK> > __schedule+0x2c5/0x8d0 > schedule+0x3a/0xa0 > schedule_timeout+0x115/0x280 > ? __traceiter_sched_wakeup+0x29/0x40 > ? ttwu_do_wakeup+0x106/0x170 > wait_for_completion+0x9f/0x100 > __flush_work+0x161/0x1f0 > ? worker_detach_from_pool+0xb0/0xb0 > xfs_inodegc_flush.part.25+0x4f/0xa0 [xfs] > xfs_fs_statfs+0x35/0x1e0 [xfs] > statfs_by_dentry+0x67/0x90 > vfs_statfs+0x16/0xd0 > ovl_statfs+0x40/0x70 [overlay] Fixed by commit 5e672cd69f0a ("xfs: introduce xfs_inodegc_push()") which also means that the commits Darrick suggested you try aren't relevant. The mod_delayed_work() issues they address weren't introduced until commit 7cf2b0f9611b ("xfs: bound maximum wait time for inodegc work") which has a direct dependency on 5e672cd69f0a being present.... -Dave.
On Sat, May 13, 2023 at 08:54:14AM +1000, Dave Chinner wrote: > On Fri, May 12, 2023 at 09:15:11AM -0600, Tycho Andersen wrote: > I also recently found evidence of another AGI -> inode buffer -> AGF > -> inode buffer deadlock issue debugging another hang from a machine > making heavy use of O_TMPFILEs via OVL. Without a kernel code to > examine in intricate detail, I can't say if this is what you are > seeing. > > I also don't have a fix for that yet because, well, it's complex and > I can really only accomodate one complex high priority bug at a time > in my head and it hasn't been my highiest priority. As it is, I > don't think the fix for it will be trivially backportable to 5.15, > either.... Just FYI: Having paged this back into my brain looking at this issue, it occurred to me yesterday that there is a relatively simple fix for this issue: using ->iop_precommit() to move adding the inode to the inode cluster buffer from xfs_trans_log_inode() to inside xfs_trans_commit(). This ensures the inode cluster buffers don't get locked until xfs_trans_commit() is called, as we already do with the xfs_iunlink_item processing for unlinked list operations. I've just written the patch to do this, but it also isn't a 5.15 backport candidate because the infrastructure and xfs_iunlink_item processing wasn't introduced until 5.19.... -Dave.
Hi Dave, Thanks for this analysis, all very helpful, On Sat, May 13, 2023 at 08:54:14AM +1000, Dave Chinner wrote: > On Fri, May 12, 2023 at 09:15:11AM -0600, Tycho Andersen wrote: > > INFO: task kworker/66:1:3548191 blocked for more than 1912 seconds. > > Tainted: G OE 5.15.35netflix-g54efd87a8576 #1 > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > task:kworker/66:1 state:D stack: 0 pid:3548191 ppid: 2 flags:0x00004000 > > Workqueue: xfs-inodegc/nvme1n1 xfs_inodegc_worker [xfs] > > Call Trace: > > <TASK> > > __schedule+0x2c5/0x8d0 > > schedule+0x3a/0xa0 > > schedule_timeout+0x115/0x280 > > __down+0x90/0xe0 > > ? down+0x43/0x60 > > down+0x43/0x60 > > xfs_buf_lock+0x29/0xa0 [xfs] > > xfs_buf_find.isra.34+0x1fd/0x610 [xfs] > > xfs_buf_get_map+0x4c/0x400 [xfs] > > xfs_buf_read_map+0x52/0x2e0 [xfs] > > ? xfs_read_agi+0x8c/0x120 [xfs] > > xfs_trans_read_buf_map+0x223/0x2d0 [xfs] > > ? xfs_read_agi+0x8c/0x120 [xfs] > > xfs_read_agi+0x8c/0x120 [xfs] > > xfs_iunlink_remove+0x5a/0x220 [xfs] > > xfs_ifree+0x7a/0x4c0 [xfs] > > ? xfs_trans_alloc+0xec/0x1e0 [xfs] > > xfs_inactive_ifree+0xa1/0x1a0 [xfs] > > xfs_inactive+0xf1/0x170 [xfs] > > xfs_inodegc_worker+0x76/0x100 [xfs] > > process_one_work+0x200/0x3d0 > > worker_thread+0x2d/0x3e0 > > ? process_one_work+0x3d0/0x3d0 > > kthread+0x11a/0x140 > > ? set_kthread_struct+0x40/0x40 > > ret_from_fork+0x22/0x30 > > </TASK> > > Yup, there's the stuck inodegc worker - also trying to get the AGI > lock doing inode unlink. I'm not surprised at all by this - working > out why this worker is stuck here is the goal because that is what > is causing all your flush problems. > > It also gives me a better idea of what the likely issue is, too. > > > INFO: task bazelenv:3548683 blocked for more than 1912 seconds. > > Tainted: G OE 5.15.35netflix-g54efd87a8576 #1 > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > task:bazelenv state:D stack: 0 pid:3548683 ppid:3435211 flags:0x00000220 > > Call Trace: > > <TASK> > > __schedule+0x2c5/0x8d0 > > schedule+0x3a/0xa0 > > schedule_timeout+0x115/0x280 > > __down+0x90/0xe0 > > ? down+0x43/0x60 > > down+0x43/0x60 > > xfs_buf_lock+0x29/0xa0 [xfs] > > xfs_buf_find.isra.34+0x1fd/0x610 [xfs] > > xfs_buf_get_map+0x4c/0x400 [xfs] > > xfs_buf_read_map+0x52/0x2e0 [xfs] > > ? xfs_read_agf+0x84/0x100 [xfs] > > xfs_trans_read_buf_map+0x223/0x2d0 [xfs] > > ? xfs_read_agf+0x84/0x100 [xfs] > > xfs_read_agf+0x84/0x100 [xfs] > > xfs_alloc_read_agf+0x3a/0x1b0 [xfs] > > xfs_alloc_fix_freelist+0x434/0x500 [xfs] > > ? __kmalloc+0x3c0/0x400 > > ? xfs_dquot_to_disk+0xfe/0x110 [xfs] > > ? xfs_qm_dquot_logitem_format+0x89/0x150 [xfs] > > xfs_free_extent_fix_freelist+0x61/0xa0 [xfs] > > __xfs_free_extent+0x6a/0x1c0 [xfs] > > xfs_trans_free_extent+0x3b/0xd0 [xfs] > > xfs_extent_free_finish_item+0x23/0x40 [xfs] > > xfs_defer_finish_noroll+0x24b/0x5a0 [xfs] > > xfs_defer_finish+0x13/0x80 [xfs] > > xfs_itruncate_extents_flags+0x13b/0x220 [xfs] > > xfs_setattr_size+0x35c/0x3d0 [xfs] > > ? aa_compute_fperms+0x16f/0x190 > > xfs_vn_setattr+0xf7/0x110 [xfs] > > And that confirms the suspicions I have - another process blocked on > an AGF lock. That might be IO it's blocking on here, but combined > with the inode unlink being stuck in inodegc, I have a suspicion > that this is an AGF - AGI - inode buffer lock ordering issue. > > There's been a heap of work done in this space since 5.15, and I'd > suggest that almost none of it is trivially backportable as it's > deeply intertwined with relatively complex algorithm changes. Sounds like this is the type of work that might fix the bug? I could try to convince people to upgrade to 6.1, would that be enough? > I also recently found evidence of another AGI -> inode buffer -> AGF > -> inode buffer deadlock issue debugging another hang from a machine > making heavy use of O_TMPFILEs via OVL. Without a kernel code to > examine in intricate detail, I can't say if this is what you are > seeing. Ok. I haven't had success reproducing this without overlay in the mix, but if it could be related I'll give it a shot. Next time we see this I'll try to grab a kernel core as well. > I also don't have a fix for that yet because, well, it's complex and > I can really only accomodate one complex high priority bug at a time > in my head and it hasn't been my highiest priority. As it is, I > don't think the fix for it will be trivially backportable to 5.15, > either.... Yup, no worries. > > INFO: task journalctl:3549038 blocked for more than 1912 seconds. > > Tainted: G OE 5.15.35netflix-g54efd87a8576 #1 > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > task:journalctl state:D stack: 0 pid:3549038 ppid:469330 flags:0x00000224 > > Call Trace: > > <TASK> > > __schedule+0x2c5/0x8d0 > > schedule+0x3a/0xa0 > > schedule_timeout+0x115/0x280 > > ? __traceiter_sched_wakeup+0x29/0x40 > > ? ttwu_do_wakeup+0x106/0x170 > > wait_for_completion+0x9f/0x100 > > __flush_work+0x161/0x1f0 > > ? worker_detach_from_pool+0xb0/0xb0 > > xfs_inodegc_flush.part.25+0x4f/0xa0 [xfs] > > xfs_fs_statfs+0x35/0x1e0 [xfs] > > statfs_by_dentry+0x67/0x90 > > vfs_statfs+0x16/0xd0 > > ovl_statfs+0x40/0x70 [overlay] > > Fixed by commit 5e672cd69f0a ("xfs: introduce xfs_inodegc_push()") > which also means that the commits Darrick suggested you try aren't > relevant. The mod_delayed_work() issues they address weren't > introduced until commit 7cf2b0f9611b ("xfs: bound maximum wait time > for inodegc work") which has a direct dependency on 5e672cd69f0a > being present.... Thanks, this looks like a useful patch. I wonder if we could do the same thing when PF_EXITING, instead of just avoid queuing work at all? We tend to get complaints about tasks in weird states when we block in do_exit() for any nontrivial length of time. In any case, looks like this backports pretty cleanly to 5.15, so I'll put it in our kernels for now. Thanks again for your help. Tycho
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 351849fc18ff..90e94d84f8ad 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -2011,6 +2011,9 @@ xfs_inodegc_want_queue_work( * * Note: If the current thread is running a transaction, we don't ever want to * wait for other transactions because that could introduce a deadlock. + * If the task is currently exiting there is nobody to wait for + * the flush and it can deadlock, so we should not try to flush in this case + * either. */ static inline bool xfs_inodegc_want_flush_work( @@ -2021,6 +2024,9 @@ xfs_inodegc_want_flush_work( if (current->journal_info) return false; + if (current->flags & PF_EXITING) + return false; + if (shrinker_hits > 0) return true;