Message ID | 20221206093833.3812138-1-harshit.m.mogalapalli@oracle.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:f944:0:0:0:0:0 with SMTP id q4csp2725150wrr; Tue, 6 Dec 2022 01:46:23 -0800 (PST) X-Google-Smtp-Source: AA0mqf7dF/OrFyWDIKpLYXB882y6j/VVRDlWGO67kpoY9ag5baPj9k/Vpu3GRqnxIe+uGsmfQpna X-Received: by 2002:a63:2001:0:b0:477:b0d0:bbee with SMTP id g1-20020a632001000000b00477b0d0bbeemr58510120pgg.51.1670319983010; Tue, 06 Dec 2022 01:46:23 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1670319983; cv=none; d=google.com; s=arc-20160816; b=09naB2dy2bfSX7lOgPgqrFk/0iVdkrhbKdb558CR5owuzgQ9NJ5x++CYR3YPFuItEu v7qdpYE4/MrUkRHbtzRm2YuU1AOmQ9Cs8JOAz3d/O16iq7w3oTyjXvEKMo/vJEGzHTYv yU++4QWp7MHlB4nVq3lwKxKhzv9kaLtuGhiZIr/BkjI0IHv8V0hPOhLRVmuCooZigU4c 3elmFT+PdIeXt1hvTOToUPxzgO7b1XqdJyk8SiiZvlfJuw305lIb6SZX98MpIb7XOT2C Xm/F7qkQedqPV9mHMdM1hZa2MNievkkFEG3qppCHNswH8ige03ciHdaXNyNZQ8x2gMo8 Ob/A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:to:content-transfer-encoding:mime-version :message-id:date:subject:cc:from:dkim-signature; bh=h+xqY3ap6uUHi01HvJo5PCdAUlQP/OuCOPGu3xuefWk=; b=NV03bCA4A2wfHdrLILrad379ysy4uzlqbngKLSY5cDUtM8mOAtisFe/Hn6STOEEcVA gqMIqVvocLwtdIWdobe8g8C45hAbZplbcJY5gcYjKYrzWI3gxv7PrPsE8b5UDsCjQrYn ElCskLa5arK8xpKQyLNgzPh1DAnpRLN+F184QRZ0eFyNFlun0od1UQgRRY06R8DX/Edf 1KoYgOtBXSQWbaRE34U2ssm24nL182ZgJt62L7Z+5eEu5aLe3Qygb9Lx+of9F2qZvrZS tjOz20m6N1MvDZGfyh0bV64yb6d4z9JzPLwmK7UWPWAELyOAW7nZPA5RJVdPYIwCbT2I qK9w== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@oracle.com header.s=corp-2022-7-12 header.b=U6bRebwq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id f23-20020a63e317000000b0047697505449si16427180pgh.365.2022.12.06.01.46.08; Tue, 06 Dec 2022 01:46:22 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=fail header.i=@oracle.com header.s=corp-2022-7-12 header.b=U6bRebwq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234361AbiLFJjD (ORCPT <rfc822;jaysivo@gmail.com> + 99 others); Tue, 6 Dec 2022 04:39:03 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49150 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234156AbiLFJi7 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 6 Dec 2022 04:38:59 -0500 Received: from mx0b-00069f02.pphosted.com (mx0b-00069f02.pphosted.com [205.220.177.32]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0EAAC1AD99; Tue, 6 Dec 2022 01:38:57 -0800 (PST) Received: from pps.filterd (m0246632.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 2B68NiHI017013; Tue, 6 Dec 2022 09:38:56 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : mime-version : content-transfer-encoding; s=corp-2022-7-12; bh=h+xqY3ap6uUHi01HvJo5PCdAUlQP/OuCOPGu3xuefWk=; b=U6bRebwq0a09eqoYyMUBMzvDDuwlnT3h2M5neEXXJTjqnJUyjSZL82rFrVrIJGNRuFuw mY51NhaY9FsGHfoY22j9HVv4THWFsxhfi4Bnd1jhww8Wx0ODlHbxJaHkNzyTxWUhTzZq eza7QQfbD7WwT/nK3cfckhC5vH7baAwGYnZKmMiDFSbR3h6RvEUzFDRlQA7t1qpaHtTV b57TX+gTY7VlaIDq9c0+McWcVhvm/FtkV/xauiQIG9zoaCVlx0PsUJVJxK0c70KERHFp r20RA8VJ6vPT2zPCfHA336k4/37bA7JcksXw/pWU1Ex3OhqolgmqXGbs8L7dIFFVm/uV /g== Received: from phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta03.appoci.oracle.com [138.1.37.129]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 3m7ybgpmat-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 06 Dec 2022 09:38:55 +0000 Received: from pps.filterd (phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (8.17.1.5/8.17.1.5) with ESMTP id 2B67VZf2031092; Tue, 6 Dec 2022 09:38:47 GMT Received: from pps.reinject (localhost [127.0.0.1]) by phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTPS id 3m8ua0ahce-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 06 Dec 2022 09:38:47 +0000 Received: from phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 2B69clcE006666; Tue, 6 Dec 2022 09:38:47 GMT Received: from ca-dev112.us.oracle.com (ca-dev112.us.oracle.com [10.129.136.47]) by phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTP id 3m8ua0ahc9-1; Tue, 06 Dec 2022 09:38:47 +0000 From: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Cc: harshit.m.mogalapalli@oracle.com, harshit.m.mogalapalli@gmail.com, vegard.nossum@oracle.com, george.kennedy@oracle.com, darren.kenny@oracle.com, syzkaller <syzkaller@googlegroups.com>, Jens Axboe <axboe@kernel.dk>, Pavel Begunkov <asml.silence@gmail.com>, io-uring@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH] io_uring: Fix a null-ptr-deref in io_tctx_exit_cb() Date: Tue, 6 Dec 2022 01:38:32 -0800 Message-Id: <20221206093833.3812138-1-harshit.m.mogalapalli@oracle.com> X-Mailer: git-send-email 2.31.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.923,Hydra:6.0.545,FMLib:17.11.122.1 definitions=2022-12-06_05,2022-12-05_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 mlxlogscore=901 mlxscore=0 adultscore=0 phishscore=0 suspectscore=0 malwarescore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2210170000 definitions=main-2212060079 X-Proofpoint-ORIG-GUID: pYNp8N-3mGfidp7miOcVNRWjSqIG7PZh X-Proofpoint-GUID: pYNp8N-3mGfidp7miOcVNRWjSqIG7PZh X-Spam-Status: No, score=-2.8 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net To: unlisted-recipients:; (no To-header on input) Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1751457446238361235?= X-GMAIL-MSGID: =?utf-8?q?1751457446238361235?= |
Series |
io_uring: Fix a null-ptr-deref in io_tctx_exit_cb()
|
|
Commit Message
Harshit Mogalapalli
Dec. 6, 2022, 9:38 a.m. UTC
Syzkaller reports a NULL deref bug as follows:
BUG: KASAN: null-ptr-deref in io_tctx_exit_cb+0x53/0xd3
Read of size 4 at addr 0000000000000138 by task file1/1955
CPU: 1 PID: 1955 Comm: file1 Not tainted 6.1.0-rc7-00103-gef4d3ea40565 #75
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0xcd/0x134
? io_tctx_exit_cb+0x53/0xd3
kasan_report+0xbb/0x1f0
? io_tctx_exit_cb+0x53/0xd3
kasan_check_range+0x140/0x190
io_tctx_exit_cb+0x53/0xd3
task_work_run+0x164/0x250
? task_work_cancel+0x30/0x30
get_signal+0x1c3/0x2440
? lock_downgrade+0x6e0/0x6e0
? lock_downgrade+0x6e0/0x6e0
? exit_signals+0x8b0/0x8b0
? do_raw_read_unlock+0x3b/0x70
? do_raw_spin_unlock+0x50/0x230
arch_do_signal_or_restart+0x82/0x2470
? kmem_cache_free+0x260/0x4b0
? putname+0xfe/0x140
? get_sigframe_size+0x10/0x10
? do_execveat_common.isra.0+0x226/0x710
? lockdep_hardirqs_on+0x79/0x100
? putname+0xfe/0x140
? do_execveat_common.isra.0+0x238/0x710
exit_to_user_mode_prepare+0x15f/0x250
syscall_exit_to_user_mode+0x19/0x50
do_syscall_64+0x42/0xb0
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0023:0x0
Code: Unable to access opcode bytes at 0xffffffffffffffd6.
RSP: 002b:00000000fffb7790 EFLAGS: 00000200 ORIG_RAX: 000000000000000b
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
</TASK>
Kernel panic - not syncing: panic_on_warn set ...
Add a NULL check on tctx to prevent this.
Fixes: d56d938b4bef ("io_uring: do ctx initiated file note removal")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
---
Could not find the root cause of this.
---
io_uring/io_uring.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Comments
On 12/6/22 10:38, Harshit Mogalapalli wrote: > Syzkaller reports a NULL deref bug as follows: > > BUG: KASAN: null-ptr-deref in io_tctx_exit_cb+0x53/0xd3 [...] > Add a NULL check on tctx to prevent this. > > Fixes: d56d938b4bef ("io_uring: do ctx initiated file note removal") > Reported-by: syzkaller <syzkaller@googlegroups.com> > Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> > --- > Could not find the root cause of this. Hi, I don't think the patch is correct as-is -- we should in any case probably understand better what's going on. I think what's happening is something like this, where tsk->io_uring is set to NULL in begin_new_exec() while we have a pending callback: fd = io_uring_setup() [...] close(fd) ? - __fput() - io_uring_release() - io_ring_ctx_wait_and_kill() - init_task_work(..., io_tctx_exit_cb) // callback posted exec() - begin_new_exec() - io_uring_task_cancel() - __io_uring_cancel() - io_uring_cancel_generic() - __io_uring_free() - tsk->io_uring = NULL // pointer nulled - syscall_exit_to_user_mode() - [...] - task_work_run() - io_tctx_exit_cb() - *current->io_uring // callback runs: oops As far as I can tell, whatever is happening in io_ring_exit_work() is happening too late, as task->io_uring has already been set to NULL. It looks a bit like this is supposed to be handled in io_uring_cancel_generic() already where it tries to cancel and wait for all the outstanding work items to finish, but maybe that is not taking into account the fact that the exit callback is still pending? Should io_ring_ctx_wait_and_kill() bump the inflight counter..? It's unclear to me whether the io_ring_ctx_wait_and_kill() call is coming through close(), dup2(), or simply exec(), but it looks like this could potentially get delayed (from the current syscall) and thus pushed into the exec() call. Maybe flush_delayed_fput() needs to be called somewhere..? Anyway, I could be completely off base here as I'm not really familiar with the code, just wanted to share my notes. Vegard
On 12/6/22 2:38?AM, Harshit Mogalapalli wrote: > Syzkaller reports a NULL deref bug as follows: > > BUG: KASAN: null-ptr-deref in io_tctx_exit_cb+0x53/0xd3 > Read of size 4 at addr 0000000000000138 by task file1/1955 > > CPU: 1 PID: 1955 Comm: file1 Not tainted 6.1.0-rc7-00103-gef4d3ea40565 #75 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 > Call Trace: > <TASK> > dump_stack_lvl+0xcd/0x134 > ? io_tctx_exit_cb+0x53/0xd3 > kasan_report+0xbb/0x1f0 > ? io_tctx_exit_cb+0x53/0xd3 > kasan_check_range+0x140/0x190 > io_tctx_exit_cb+0x53/0xd3 > task_work_run+0x164/0x250 > ? task_work_cancel+0x30/0x30 > get_signal+0x1c3/0x2440 > ? lock_downgrade+0x6e0/0x6e0 > ? lock_downgrade+0x6e0/0x6e0 > ? exit_signals+0x8b0/0x8b0 > ? do_raw_read_unlock+0x3b/0x70 > ? do_raw_spin_unlock+0x50/0x230 > arch_do_signal_or_restart+0x82/0x2470 > ? kmem_cache_free+0x260/0x4b0 > ? putname+0xfe/0x140 > ? get_sigframe_size+0x10/0x10 > ? do_execveat_common.isra.0+0x226/0x710 > ? lockdep_hardirqs_on+0x79/0x100 > ? putname+0xfe/0x140 > ? do_execveat_common.isra.0+0x238/0x710 > exit_to_user_mode_prepare+0x15f/0x250 > syscall_exit_to_user_mode+0x19/0x50 > do_syscall_64+0x42/0xb0 > entry_SYSCALL_64_after_hwframe+0x63/0xcd > RIP: 0023:0x0 > Code: Unable to access opcode bytes at 0xffffffffffffffd6. > RSP: 002b:00000000fffb7790 EFLAGS: 00000200 ORIG_RAX: 000000000000000b > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 > RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 > </TASK> > Kernel panic - not syncing: panic_on_warn set ... > > Add a NULL check on tctx to prevent this. I agree with Vegard that I don't think this is fixing the core of the issue. I think what is happening here is that we don't run the task_work in io_uring_cancel_generic() unconditionally, if we don't need to in the loop above. But we do need to ensure we run it before we clear current->io_uring. Do you have a reproducer? If so, can you try the below? I _think_ this is all we need. We can't be hitting the delayed fput path as the task isn't exiting, and we're dealing with current here. diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 36cb63e4174f..4791d94c88f5 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -3125,6 +3125,15 @@ __cold void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd) io_uring_clean_tctx(tctx); if (cancel_all) { + /* + * If we didn't run task_work in the loop above, ensure we + * do so here. If an fput() queued up exit task_work for the + * ring descriptor before we started the exec that led to this + * cancelation, then we need to have that run before we proceed + * with tearing down current->io_uring. + */ + io_run_task_work(); + /* * We shouldn't run task_works after cancel, so just leave * ->in_idle set for normal exit.
On 12/6/22 2:15?PM, Jens Axboe wrote: > On 12/6/22 2:38?AM, Harshit Mogalapalli wrote: >> Syzkaller reports a NULL deref bug as follows: >> >> BUG: KASAN: null-ptr-deref in io_tctx_exit_cb+0x53/0xd3 >> Read of size 4 at addr 0000000000000138 by task file1/1955 >> >> CPU: 1 PID: 1955 Comm: file1 Not tainted 6.1.0-rc7-00103-gef4d3ea40565 #75 >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 >> Call Trace: >> <TASK> >> dump_stack_lvl+0xcd/0x134 >> ? io_tctx_exit_cb+0x53/0xd3 >> kasan_report+0xbb/0x1f0 >> ? io_tctx_exit_cb+0x53/0xd3 >> kasan_check_range+0x140/0x190 >> io_tctx_exit_cb+0x53/0xd3 >> task_work_run+0x164/0x250 >> ? task_work_cancel+0x30/0x30 >> get_signal+0x1c3/0x2440 >> ? lock_downgrade+0x6e0/0x6e0 >> ? lock_downgrade+0x6e0/0x6e0 >> ? exit_signals+0x8b0/0x8b0 >> ? do_raw_read_unlock+0x3b/0x70 >> ? do_raw_spin_unlock+0x50/0x230 >> arch_do_signal_or_restart+0x82/0x2470 >> ? kmem_cache_free+0x260/0x4b0 >> ? putname+0xfe/0x140 >> ? get_sigframe_size+0x10/0x10 >> ? do_execveat_common.isra.0+0x226/0x710 >> ? lockdep_hardirqs_on+0x79/0x100 >> ? putname+0xfe/0x140 >> ? do_execveat_common.isra.0+0x238/0x710 >> exit_to_user_mode_prepare+0x15f/0x250 >> syscall_exit_to_user_mode+0x19/0x50 >> do_syscall_64+0x42/0xb0 >> entry_SYSCALL_64_after_hwframe+0x63/0xcd >> RIP: 0023:0x0 >> Code: Unable to access opcode bytes at 0xffffffffffffffd6. >> RSP: 002b:00000000fffb7790 EFLAGS: 00000200 ORIG_RAX: 000000000000000b >> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 >> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 >> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 >> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 >> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 >> </TASK> >> Kernel panic - not syncing: panic_on_warn set ... >> >> Add a NULL check on tctx to prevent this. > > I agree with Vegard that I don't think this is fixing the core of > the issue. I think what is happening here is that we don't run the > task_work in io_uring_cancel_generic() unconditionally, if we don't > need to in the loop above. But we do need to ensure we run it before > we clear current->io_uring. > > Do you have a reproducer? If so, can you try the below? I _think_ > this is all we need. We can't be hitting the delayed fput path as > the task isn't exiting, and we're dealing with current here. While I think the above is the right description of what happens, I think there's still a race with the proposed solution. If the task_work gets added right after the newly inserted io_run_task_work(), then we'll still crash when the targeted task exits to userspace and runs the task_work. It should actually be fine to add that NULL check in io_tctx_exit_cb. We can't be racing here, as both the clear and io_tctx_exit_cb() are run by current itself. It's really just an ordering issue.
On 12/6/22 2:30 PM, Jens Axboe wrote: > On 12/6/22 2:15?PM, Jens Axboe wrote: >> On 12/6/22 2:38?AM, Harshit Mogalapalli wrote: >>> Syzkaller reports a NULL deref bug as follows: >>> >>> BUG: KASAN: null-ptr-deref in io_tctx_exit_cb+0x53/0xd3 >>> Read of size 4 at addr 0000000000000138 by task file1/1955 >>> >>> CPU: 1 PID: 1955 Comm: file1 Not tainted 6.1.0-rc7-00103-gef4d3ea40565 #75 >>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 >>> Call Trace: >>> <TASK> >>> dump_stack_lvl+0xcd/0x134 >>> ? io_tctx_exit_cb+0x53/0xd3 >>> kasan_report+0xbb/0x1f0 >>> ? io_tctx_exit_cb+0x53/0xd3 >>> kasan_check_range+0x140/0x190 >>> io_tctx_exit_cb+0x53/0xd3 >>> task_work_run+0x164/0x250 >>> ? task_work_cancel+0x30/0x30 >>> get_signal+0x1c3/0x2440 >>> ? lock_downgrade+0x6e0/0x6e0 >>> ? lock_downgrade+0x6e0/0x6e0 >>> ? exit_signals+0x8b0/0x8b0 >>> ? do_raw_read_unlock+0x3b/0x70 >>> ? do_raw_spin_unlock+0x50/0x230 >>> arch_do_signal_or_restart+0x82/0x2470 >>> ? kmem_cache_free+0x260/0x4b0 >>> ? putname+0xfe/0x140 >>> ? get_sigframe_size+0x10/0x10 >>> ? do_execveat_common.isra.0+0x226/0x710 >>> ? lockdep_hardirqs_on+0x79/0x100 >>> ? putname+0xfe/0x140 >>> ? do_execveat_common.isra.0+0x238/0x710 >>> exit_to_user_mode_prepare+0x15f/0x250 >>> syscall_exit_to_user_mode+0x19/0x50 >>> do_syscall_64+0x42/0xb0 >>> entry_SYSCALL_64_after_hwframe+0x63/0xcd >>> RIP: 0023:0x0 >>> Code: Unable to access opcode bytes at 0xffffffffffffffd6. >>> RSP: 002b:00000000fffb7790 EFLAGS: 00000200 ORIG_RAX: 000000000000000b >>> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 >>> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 >>> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 >>> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 >>> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 >>> </TASK> >>> Kernel panic - not syncing: panic_on_warn set ... >>> >>> Add a NULL check on tctx to prevent this. >> >> I agree with Vegard that I don't think this is fixing the core of >> the issue. I think what is happening here is that we don't run the >> task_work in io_uring_cancel_generic() unconditionally, if we don't >> need to in the loop above. But we do need to ensure we run it before >> we clear current->io_uring. >> >> Do you have a reproducer? If so, can you try the below? I _think_ >> this is all we need. We can't be hitting the delayed fput path as >> the task isn't exiting, and we're dealing with current here. > > While I think the above is the right description of what happens, I > think there's still a race with the proposed solution. If the task_work > gets added right after the newly inserted io_run_task_work(), then we'll > still crash when the targeted task exits to userspace and runs the > task_work. > > It should actually be fine to add that NULL check in io_tctx_exit_cb. We > can't be racing here, as both the clear and io_tctx_exit_cb() are run by > current itself. It's really just an ordering issue. I've queued it up with an improved commit message, and also a code comment: https://git.kernel.dk/cgit/linux-block/commit/?h=for-6.2/io_uring-next&id=6d1b48314b989d059642958fc94ef0a58b25fc8c
On 07/12/22 2:45 am, Jens Axboe wrote: > On 12/6/22 2:38?AM, Harshit Mogalapalli wrote: >> Syzkaller reports a NULL deref bug as follows: >> >> BUG: KASAN: null-ptr-deref in io_tctx_exit_cb+0x53/0xd3 >> Read of size 4 at addr 0000000000000138 by task file1/1955 >> >> CPU: 1 PID: 1955 Comm: file1 Not tainted 6.1.0-rc7-00103-gef4d3ea40565 #75 >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 >> Call Trace: >> <TASK> >> dump_stack_lvl+0xcd/0x134 >> ? io_tctx_exit_cb+0x53/0xd3 >> kasan_report+0xbb/0x1f0 >> ? io_tctx_exit_cb+0x53/0xd3 >> kasan_check_range+0x140/0x190 >> io_tctx_exit_cb+0x53/0xd3 >> task_work_run+0x164/0x250 >> ? task_work_cancel+0x30/0x30 >> get_signal+0x1c3/0x2440 >> ? lock_downgrade+0x6e0/0x6e0 >> ? lock_downgrade+0x6e0/0x6e0 >> ? exit_signals+0x8b0/0x8b0 >> ? do_raw_read_unlock+0x3b/0x70 >> ? do_raw_spin_unlock+0x50/0x230 >> arch_do_signal_or_restart+0x82/0x2470 >> ? kmem_cache_free+0x260/0x4b0 >> ? putname+0xfe/0x140 >> ? get_sigframe_size+0x10/0x10 >> ? do_execveat_common.isra.0+0x226/0x710 >> ? lockdep_hardirqs_on+0x79/0x100 >> ? putname+0xfe/0x140 >> ? do_execveat_common.isra.0+0x238/0x710 >> exit_to_user_mode_prepare+0x15f/0x250 >> syscall_exit_to_user_mode+0x19/0x50 >> do_syscall_64+0x42/0xb0 >> entry_SYSCALL_64_after_hwframe+0x63/0xcd >> RIP: 0023:0x0 >> Code: Unable to access opcode bytes at 0xffffffffffffffd6. >> RSP: 002b:00000000fffb7790 EFLAGS: 00000200 ORIG_RAX: 000000000000000b >> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 >> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 >> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 >> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 >> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 >> </TASK> >> Kernel panic - not syncing: panic_on_warn set ... >> >> Add a NULL check on tctx to prevent this. > > I agree with Vegard that I don't think this is fixing the core of > the issue. I think what is happening here is that we don't run the > task_work in io_uring_cancel_generic() unconditionally, if we don't > need to in the loop above. But we do need to ensure we run it before > we clear current->io_uring. > > Do you have a reproducer? If so, can you try the below? I _think_ > this is all we need. We can't be hitting the delayed fput path as > the task isn't exiting, and we're dealing with current here. > > Thanks Jens and Vegard for the suggestions and analysis. Yes, the below patch silences the reproducer. > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c > index 36cb63e4174f..4791d94c88f5 100644 > --- a/io_uring/io_uring.c > +++ b/io_uring/io_uring.c > @@ -3125,6 +3125,15 @@ __cold void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd) > > io_uring_clean_tctx(tctx); > if (cancel_all) { > + /* > + * If we didn't run task_work in the loop above, ensure we > + * do so here. If an fput() queued up exit task_work for the > + * ring descriptor before we started the exec that led to this > + * cancelation, then we need to have that run before we proceed > + * with tearing down current->io_uring. > + */ > + io_run_task_work(); > + > /* > * We shouldn't run task_works after cancel, so just leave > * ->in_idle set for normal exit. > Thanks, Harshit
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 8840cf3e20f2..20f7d8655b50 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -2708,7 +2708,7 @@ static __cold void io_tctx_exit_cb(struct callback_head *cb) * When @in_idle, we're in cancellation and it's racy to remove the * node. It'll be removed by the end of cancellation, just ignore it. */ - if (!atomic_read(&tctx->in_idle)) + if (tctx && !atomic_read(&tctx->in_idle)) io_uring_del_tctx_node((unsigned long)work->ctx); complete(&work->completion); }