diff mbox series

io_uring: Fix a null-ptr-deref in io_tctx_exit_cb()

Message ID	20221206093833.3812138-1-harshit.m.mogalapalli@oracle.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Cc: harshit.m.mogalapalli@oracle.com, harshit.m.mogalapalli@gmail.com, vegard.nossum@oracle.com, george.kennedy@oracle.com, darren.kenny@oracle.com, syzkaller <syzkaller@googlegroups.com>, Jens Axboe <axboe@kernel.dk>, Pavel Begunkov <asml.silence@gmail.com>, io-uring@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH] io_uring: Fix a null-ptr-deref in io_tctx_exit_cb() Date: Tue, 6 Dec 2022 01:38:32 -0800 Message-Id: <20221206093833.3812138-1-harshit.m.mogalapalli@oracle.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit To: unlisted-recipients:; (no To-header on input) Precedence: bulk
Series	io_uring: Fix a null-ptr-deref in io_tctx_exit_cb() \| io_uring: Fix a null-ptr-deref in io_tctx_exit_cb()

Commit Message

Harshit Mogalapalli Dec. 6, 2022, 9:38 a.m. UTC

  Syzkaller reports a NULL deref bug as follows:

 BUG: KASAN: null-ptr-deref in io_tctx_exit_cb+0x53/0xd3
 Read of size 4 at addr 0000000000000138 by task file1/1955

 CPU: 1 PID: 1955 Comm: file1 Not tainted 6.1.0-rc7-00103-gef4d3ea40565 #75
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0xcd/0x134
  ? io_tctx_exit_cb+0x53/0xd3
  kasan_report+0xbb/0x1f0
  ? io_tctx_exit_cb+0x53/0xd3
  kasan_check_range+0x140/0x190
  io_tctx_exit_cb+0x53/0xd3
  task_work_run+0x164/0x250
  ? task_work_cancel+0x30/0x30
  get_signal+0x1c3/0x2440
  ? lock_downgrade+0x6e0/0x6e0
  ? lock_downgrade+0x6e0/0x6e0
  ? exit_signals+0x8b0/0x8b0
  ? do_raw_read_unlock+0x3b/0x70
  ? do_raw_spin_unlock+0x50/0x230
  arch_do_signal_or_restart+0x82/0x2470
  ? kmem_cache_free+0x260/0x4b0
  ? putname+0xfe/0x140
  ? get_sigframe_size+0x10/0x10
  ? do_execveat_common.isra.0+0x226/0x710
  ? lockdep_hardirqs_on+0x79/0x100
  ? putname+0xfe/0x140
  ? do_execveat_common.isra.0+0x238/0x710
  exit_to_user_mode_prepare+0x15f/0x250
  syscall_exit_to_user_mode+0x19/0x50
  do_syscall_64+0x42/0xb0
  entry_SYSCALL_64_after_hwframe+0x63/0xcd
 RIP: 0023:0x0
 Code: Unable to access opcode bytes at 0xffffffffffffffd6.
 RSP: 002b:00000000fffb7790 EFLAGS: 00000200 ORIG_RAX: 000000000000000b
 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
  </TASK>
 Kernel panic - not syncing: panic_on_warn set ...

Add a NULL check on tctx to prevent this.

Fixes: d56d938b4bef ("io_uring: do ctx initiated file note removal")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
---
Could not find the root cause of this.
---
 io_uring/io_uring.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Vegard Nossum Dec. 6, 2022, 1:52 p.m. UTC | #1

On 12/6/22 10:38, Harshit Mogalapalli wrote:
> Syzkaller reports a NULL deref bug as follows:
> 
>   BUG: KASAN: null-ptr-deref in io_tctx_exit_cb+0x53/0xd3

[...]

> Add a NULL check on tctx to prevent this.
> 
> Fixes: d56d938b4bef ("io_uring: do ctx initiated file note removal")
> Reported-by: syzkaller <syzkaller@googlegroups.com>
> Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
> ---
> Could not find the root cause of this.

Hi,

I don't think the patch is correct as-is -- we should in any case
probably understand better what's going on.

I think what's happening is something like this, where tsk->io_uring is
set to NULL in begin_new_exec() while we have a pending callback:

fd = io_uring_setup()
[...]

close(fd) ?
- __fput()
   - io_uring_release()
     - io_ring_ctx_wait_and_kill()
       - init_task_work(..., io_tctx_exit_cb) // callback posted

exec()
- begin_new_exec()
   - io_uring_task_cancel()
     - __io_uring_cancel()
       - io_uring_cancel_generic()
         - __io_uring_free()
           - tsk->io_uring = NULL // pointer nulled
- syscall_exit_to_user_mode()
   - [...]
     - task_work_run()
       - io_tctx_exit_cb()
         - *current->io_uring // callback runs: oops

As far as I can tell, whatever is happening in io_ring_exit_work() is
happening too late, as task->io_uring has already been set to NULL.

It looks a bit like this is supposed to be handled in
io_uring_cancel_generic() already where it tries to cancel and wait for
all the outstanding work items to finish, but maybe that is not taking
into account the fact that the exit callback is still pending? Should
io_ring_ctx_wait_and_kill() bump the inflight counter..?

It's unclear to me whether the io_ring_ctx_wait_and_kill() call is
coming through close(), dup2(), or simply exec(), but it looks like this
could potentially get delayed (from the current syscall) and thus pushed
into the exec() call. Maybe flush_delayed_fput() needs to be called
somewhere..?

Anyway, I could be completely off base here as I'm not really familiar
with the code, just wanted to share my notes.

Vegard

Jens Axboe Dec. 6, 2022, 9:15 p.m. UTC | #2

On 12/6/22 2:38?AM, Harshit Mogalapalli wrote:
> Syzkaller reports a NULL deref bug as follows:
> 
>  BUG: KASAN: null-ptr-deref in io_tctx_exit_cb+0x53/0xd3
>  Read of size 4 at addr 0000000000000138 by task file1/1955
> 
>  CPU: 1 PID: 1955 Comm: file1 Not tainted 6.1.0-rc7-00103-gef4d3ea40565 #75
>  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
>  Call Trace:
>   <TASK>
>   dump_stack_lvl+0xcd/0x134
>   ? io_tctx_exit_cb+0x53/0xd3
>   kasan_report+0xbb/0x1f0
>   ? io_tctx_exit_cb+0x53/0xd3
>   kasan_check_range+0x140/0x190
>   io_tctx_exit_cb+0x53/0xd3
>   task_work_run+0x164/0x250
>   ? task_work_cancel+0x30/0x30
>   get_signal+0x1c3/0x2440
>   ? lock_downgrade+0x6e0/0x6e0
>   ? lock_downgrade+0x6e0/0x6e0
>   ? exit_signals+0x8b0/0x8b0
>   ? do_raw_read_unlock+0x3b/0x70
>   ? do_raw_spin_unlock+0x50/0x230
>   arch_do_signal_or_restart+0x82/0x2470
>   ? kmem_cache_free+0x260/0x4b0
>   ? putname+0xfe/0x140
>   ? get_sigframe_size+0x10/0x10
>   ? do_execveat_common.isra.0+0x226/0x710
>   ? lockdep_hardirqs_on+0x79/0x100
>   ? putname+0xfe/0x140
>   ? do_execveat_common.isra.0+0x238/0x710
>   exit_to_user_mode_prepare+0x15f/0x250
>   syscall_exit_to_user_mode+0x19/0x50
>   do_syscall_64+0x42/0xb0
>   entry_SYSCALL_64_after_hwframe+0x63/0xcd
>  RIP: 0023:0x0
>  Code: Unable to access opcode bytes at 0xffffffffffffffd6.
>  RSP: 002b:00000000fffb7790 EFLAGS: 00000200 ORIG_RAX: 000000000000000b
>  RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
>  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
>  RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
>  R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>  R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>   </TASK>
>  Kernel panic - not syncing: panic_on_warn set ...
> 
> Add a NULL check on tctx to prevent this.

I agree with Vegard that I don't think this is fixing the core of
the issue. I think what is happening here is that we don't run the
task_work in io_uring_cancel_generic() unconditionally, if we don't
need to in the loop above. But we do need to ensure we run it before
we clear current->io_uring.

Do you have a reproducer? If so, can you try the below? I _think_
this is all we need. We can't be hitting the delayed fput path as
the task isn't exiting, and we're dealing with current here.


diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 36cb63e4174f..4791d94c88f5 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3125,6 +3125,15 @@ __cold void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd)
 
 	io_uring_clean_tctx(tctx);
 	if (cancel_all) {
+		/*
+		 * If we didn't run task_work in the loop above, ensure we
+		 * do so here. If an fput() queued up exit task_work for the
+		 * ring descriptor before we started the exec that led to this
+		 * cancelation, then we need to have that run before we proceed
+		 * with tearing down current->io_uring.
+		 */
+		io_run_task_work();
+
 		/*
 		 * We shouldn't run task_works after cancel, so just leave
 		 * ->in_idle set for normal exit.

Jens Axboe Dec. 6, 2022, 9:30 p.m. UTC | #3

On 12/6/22 2:15?PM, Jens Axboe wrote:
> On 12/6/22 2:38?AM, Harshit Mogalapalli wrote:
>> Syzkaller reports a NULL deref bug as follows:
>>
>>  BUG: KASAN: null-ptr-deref in io_tctx_exit_cb+0x53/0xd3
>>  Read of size 4 at addr 0000000000000138 by task file1/1955
>>
>>  CPU: 1 PID: 1955 Comm: file1 Not tainted 6.1.0-rc7-00103-gef4d3ea40565 #75
>>  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
>>  Call Trace:
>>   <TASK>
>>   dump_stack_lvl+0xcd/0x134
>>   ? io_tctx_exit_cb+0x53/0xd3
>>   kasan_report+0xbb/0x1f0
>>   ? io_tctx_exit_cb+0x53/0xd3
>>   kasan_check_range+0x140/0x190
>>   io_tctx_exit_cb+0x53/0xd3
>>   task_work_run+0x164/0x250
>>   ? task_work_cancel+0x30/0x30
>>   get_signal+0x1c3/0x2440
>>   ? lock_downgrade+0x6e0/0x6e0
>>   ? lock_downgrade+0x6e0/0x6e0
>>   ? exit_signals+0x8b0/0x8b0
>>   ? do_raw_read_unlock+0x3b/0x70
>>   ? do_raw_spin_unlock+0x50/0x230
>>   arch_do_signal_or_restart+0x82/0x2470
>>   ? kmem_cache_free+0x260/0x4b0
>>   ? putname+0xfe/0x140
>>   ? get_sigframe_size+0x10/0x10
>>   ? do_execveat_common.isra.0+0x226/0x710
>>   ? lockdep_hardirqs_on+0x79/0x100
>>   ? putname+0xfe/0x140
>>   ? do_execveat_common.isra.0+0x238/0x710
>>   exit_to_user_mode_prepare+0x15f/0x250
>>   syscall_exit_to_user_mode+0x19/0x50
>>   do_syscall_64+0x42/0xb0
>>   entry_SYSCALL_64_after_hwframe+0x63/0xcd
>>  RIP: 0023:0x0
>>  Code: Unable to access opcode bytes at 0xffffffffffffffd6.
>>  RSP: 002b:00000000fffb7790 EFLAGS: 00000200 ORIG_RAX: 000000000000000b
>>  RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
>>  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
>>  RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
>>  R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>>  R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>>   </TASK>
>>  Kernel panic - not syncing: panic_on_warn set ...
>>
>> Add a NULL check on tctx to prevent this.
> 
> I agree with Vegard that I don't think this is fixing the core of
> the issue. I think what is happening here is that we don't run the
> task_work in io_uring_cancel_generic() unconditionally, if we don't
> need to in the loop above. But we do need to ensure we run it before
> we clear current->io_uring.
> 
> Do you have a reproducer? If so, can you try the below? I _think_
> this is all we need. We can't be hitting the delayed fput path as
> the task isn't exiting, and we're dealing with current here.

While I think the above is the right description of what happens, I
think there's still a race with the proposed solution. If the task_work
gets added right after the newly inserted io_run_task_work(), then we'll
still crash when the targeted task exits to userspace and runs the
task_work.

It should actually be fine to add that NULL check in io_tctx_exit_cb. We
can't be racing here, as both the clear and io_tctx_exit_cb() are run by
current itself. It's really just an ordering issue.

Jens Axboe Dec. 6, 2022, 10:51 p.m. UTC | #4

On 12/6/22 2:30 PM, Jens Axboe wrote:
> On 12/6/22 2:15?PM, Jens Axboe wrote:
>> On 12/6/22 2:38?AM, Harshit Mogalapalli wrote:
>>> Syzkaller reports a NULL deref bug as follows:
>>>
>>>  BUG: KASAN: null-ptr-deref in io_tctx_exit_cb+0x53/0xd3
>>>  Read of size 4 at addr 0000000000000138 by task file1/1955
>>>
>>>  CPU: 1 PID: 1955 Comm: file1 Not tainted 6.1.0-rc7-00103-gef4d3ea40565 #75
>>>  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
>>>  Call Trace:
>>>   <TASK>
>>>   dump_stack_lvl+0xcd/0x134
>>>   ? io_tctx_exit_cb+0x53/0xd3
>>>   kasan_report+0xbb/0x1f0
>>>   ? io_tctx_exit_cb+0x53/0xd3
>>>   kasan_check_range+0x140/0x190
>>>   io_tctx_exit_cb+0x53/0xd3
>>>   task_work_run+0x164/0x250
>>>   ? task_work_cancel+0x30/0x30
>>>   get_signal+0x1c3/0x2440
>>>   ? lock_downgrade+0x6e0/0x6e0
>>>   ? lock_downgrade+0x6e0/0x6e0
>>>   ? exit_signals+0x8b0/0x8b0
>>>   ? do_raw_read_unlock+0x3b/0x70
>>>   ? do_raw_spin_unlock+0x50/0x230
>>>   arch_do_signal_or_restart+0x82/0x2470
>>>   ? kmem_cache_free+0x260/0x4b0
>>>   ? putname+0xfe/0x140
>>>   ? get_sigframe_size+0x10/0x10
>>>   ? do_execveat_common.isra.0+0x226/0x710
>>>   ? lockdep_hardirqs_on+0x79/0x100
>>>   ? putname+0xfe/0x140
>>>   ? do_execveat_common.isra.0+0x238/0x710
>>>   exit_to_user_mode_prepare+0x15f/0x250
>>>   syscall_exit_to_user_mode+0x19/0x50
>>>   do_syscall_64+0x42/0xb0
>>>   entry_SYSCALL_64_after_hwframe+0x63/0xcd
>>>  RIP: 0023:0x0
>>>  Code: Unable to access opcode bytes at 0xffffffffffffffd6.
>>>  RSP: 002b:00000000fffb7790 EFLAGS: 00000200 ORIG_RAX: 000000000000000b
>>>  RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
>>>  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
>>>  RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
>>>  R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>>>  R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>>>   </TASK>
>>>  Kernel panic - not syncing: panic_on_warn set ...
>>>
>>> Add a NULL check on tctx to prevent this.
>>
>> I agree with Vegard that I don't think this is fixing the core of
>> the issue. I think what is happening here is that we don't run the
>> task_work in io_uring_cancel_generic() unconditionally, if we don't
>> need to in the loop above. But we do need to ensure we run it before
>> we clear current->io_uring.
>>
>> Do you have a reproducer? If so, can you try the below? I _think_
>> this is all we need. We can't be hitting the delayed fput path as
>> the task isn't exiting, and we're dealing with current here.
> 
> While I think the above is the right description of what happens, I
> think there's still a race with the proposed solution. If the task_work
> gets added right after the newly inserted io_run_task_work(), then we'll
> still crash when the targeted task exits to userspace and runs the
> task_work.
> 
> It should actually be fine to add that NULL check in io_tctx_exit_cb. We
> can't be racing here, as both the clear and io_tctx_exit_cb() are run by
> current itself. It's really just an ordering issue.

I've queued it up with an improved commit message, and also a code
comment:

https://git.kernel.dk/cgit/linux-block/commit/?h=for-6.2/io_uring-next&id=6d1b48314b989d059642958fc94ef0a58b25fc8c

Harshit Mogalapalli Dec. 7, 2022, 2:40 a.m. UTC | #5

On 07/12/22 2:45 am, Jens Axboe wrote:
> On 12/6/22 2:38?AM, Harshit Mogalapalli wrote:
>> Syzkaller reports a NULL deref bug as follows:
>>
>>   BUG: KASAN: null-ptr-deref in io_tctx_exit_cb+0x53/0xd3
>>   Read of size 4 at addr 0000000000000138 by task file1/1955
>>
>>   CPU: 1 PID: 1955 Comm: file1 Not tainted 6.1.0-rc7-00103-gef4d3ea40565 #75
>>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
>>   Call Trace:
>>    <TASK>
>>    dump_stack_lvl+0xcd/0x134
>>    ? io_tctx_exit_cb+0x53/0xd3
>>    kasan_report+0xbb/0x1f0
>>    ? io_tctx_exit_cb+0x53/0xd3
>>    kasan_check_range+0x140/0x190
>>    io_tctx_exit_cb+0x53/0xd3
>>    task_work_run+0x164/0x250
>>    ? task_work_cancel+0x30/0x30
>>    get_signal+0x1c3/0x2440
>>    ? lock_downgrade+0x6e0/0x6e0
>>    ? lock_downgrade+0x6e0/0x6e0
>>    ? exit_signals+0x8b0/0x8b0
>>    ? do_raw_read_unlock+0x3b/0x70
>>    ? do_raw_spin_unlock+0x50/0x230
>>    arch_do_signal_or_restart+0x82/0x2470
>>    ? kmem_cache_free+0x260/0x4b0
>>    ? putname+0xfe/0x140
>>    ? get_sigframe_size+0x10/0x10
>>    ? do_execveat_common.isra.0+0x226/0x710
>>    ? lockdep_hardirqs_on+0x79/0x100
>>    ? putname+0xfe/0x140
>>    ? do_execveat_common.isra.0+0x238/0x710
>>    exit_to_user_mode_prepare+0x15f/0x250
>>    syscall_exit_to_user_mode+0x19/0x50
>>    do_syscall_64+0x42/0xb0
>>    entry_SYSCALL_64_after_hwframe+0x63/0xcd
>>   RIP: 0023:0x0
>>   Code: Unable to access opcode bytes at 0xffffffffffffffd6.
>>   RSP: 002b:00000000fffb7790 EFLAGS: 00000200 ORIG_RAX: 000000000000000b
>>   RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
>>   RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
>>   RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
>>   R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>>   R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>>    </TASK>
>>   Kernel panic - not syncing: panic_on_warn set ...
>>
>> Add a NULL check on tctx to prevent this.
> 
> I agree with Vegard that I don't think this is fixing the core of
> the issue. I think what is happening here is that we don't run the
> task_work in io_uring_cancel_generic() unconditionally, if we don't
> need to in the loop above. But we do need to ensure we run it before
> we clear current->io_uring.
> 
> Do you have a reproducer? If so, can you try the below? I _think_
> this is all we need. We can't be hitting the delayed fput path as
> the task isn't exiting, and we're dealing with current here.
> 
> 
Thanks Jens and Vegard for the suggestions and analysis.

Yes, the below patch silences the reproducer.

> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 36cb63e4174f..4791d94c88f5 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -3125,6 +3125,15 @@ __cold void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd)
>   
>   	io_uring_clean_tctx(tctx);
>   	if (cancel_all) {
> +		/*
> +		 * If we didn't run task_work in the loop above, ensure we
> +		 * do so here. If an fput() queued up exit task_work for the
> +		 * ring descriptor before we started the exec that led to this
> +		 * cancelation, then we need to have that run before we proceed
> +		 * with tearing down current->io_uring.
> +		 */
> +		io_run_task_work();
> +
>   		/*
>   		 * We shouldn't run task_works after cancel, so just leave
>   		 * ->in_idle set for normal exit.
> 

Thanks,
Harshit

diff mbox series

Patch

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 8840cf3e20f2..20f7d8655b50 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2708,7 +2708,7 @@  static __cold void io_tctx_exit_cb(struct callback_head *cb)
 	 * When @in_idle, we're in cancellation and it's racy to remove the
 	 * node. It'll be removed by the end of cancellation, just ignore it.
 	 */
-	if (!atomic_read(&tctx->in_idle))
+	if (tctx && !atomic_read(&tctx->in_idle))
 		io_uring_del_tctx_node((unsigned long)work->ctx);
 	complete(&work->completion);
 }