[6/7] rust: file: add `DeferredFdCloser`

Message ID 20231129-alice-file-v1-6-f81afe8c7261@google.com
State New
Headers
Series File abstractions needed by Rust Binder |

Commit Message

Alice Ryhl Nov. 29, 2023, 1:12 p.m. UTC
  To close an fd from kernel space, we could call `ksys_close`. However,
if we do this to an fd that is held using `fdget`, then we may trigger a
use-after-free. Introduce a helper that can be used to close an fd even
if the fd is currently held with `fdget`. This is done by grabbing an
extra refcount to the file and dropping it in a task work once we return
to userspace.

This is necessary for Rust Binder because otherwise the user might try
to have Binder close its fd for /dev/binder, which would cause problems
as this happens inside an ioctl on /dev/binder, and ioctls hold the fd
using `fdget`.

Additional motivation can be found in commit 80cd795630d6 ("binder: fix
use-after-free due to ksys_close() during fdget()") and in the comments
on `binder_do_fd_close`.

If there is some way to detect whether an fd is currently held with
`fdget`, then this could be optimized to skip the allocation and task
work when this is not the case. Another possible optimization would be
to combine several fds into a single task work, since this is used with
fd arrays that might hold several fds.

That said, it might not be necessary to optimize it, because Rust Binder
has two ways to send fds: BINDER_TYPE_FD and BINDER_TYPE_FDA. With
BINDER_TYPE_FD, it is userspace's responsibility to close the fd, so
this mechanism is used only by BINDER_TYPE_FDA, but fd arrays are used
rarely these days.

Signed-off-by: Alice Ryhl <aliceryhl@google.com>
---
 rust/bindings/bindings_helper.h |  2 +
 rust/helpers.c                  |  8 ++++
 rust/kernel/file.rs             | 84 ++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 93 insertions(+), 1 deletion(-)
  

Comments

Benno Lossin Nov. 30, 2023, 5:12 p.m. UTC | #1
On 11/29/23 14:12, Alice Ryhl wrote:
> +    /// Schedule a task work that closes the file descriptor when this task returns to userspace.
> +    pub fn close_fd(mut self, fd: u32) {
> +        use bindings::task_work_notify_mode_TWA_RESUME as TWA_RESUME;
> +
> +        let file = unsafe { bindings::close_fd_get_file(fd) };
> +        if file.is_null() {
> +            // Nothing further to do. The allocation is freed by the destructor of `self.inner`.
> +            return;
> +        }
> +
> +        self.inner.file = file;
> +
> +        // SAFETY: Since `DeferredFdCloserInner` is `#[repr(C)]`, casting the pointers gives a
> +        // pointer to the `twork` field.
> +        let inner = Box::into_raw(self.inner) as *mut bindings::callback_head;

Here you can just use `.cast::<...>()`.

> +        // SAFETY: Getting a pointer to current is always safe.
> +        let current = unsafe { bindings::get_current() };
> +        // SAFETY: The `file` pointer points at a valid file.
> +        unsafe { bindings::get_file(file) };
> +        // SAFETY: Due to the above `get_file`, even if the current task holds an `fdget` to
> +        // this file right now, the refcount will not drop to zero until after it is released
> +        // with `fdput`. This is because when using `fdget`, you must always use `fdput` before
> +        // returning to userspace, and our task work runs after any `fdget` users have returned
> +        // to userspace.
> +        //
> +        // Note: fl_owner_t is currently a void pointer.
> +        unsafe { bindings::filp_close(file, (*current).files as bindings::fl_owner_t) };
> +        // SAFETY: The `inner` pointer is compatible with the `do_close_fd` method.
> +        unsafe { bindings::init_task_work(inner, Some(Self::do_close_fd)) };
> +        // SAFETY: The `inner` pointer points at a valid and fully initialized task work that is
> +        // ready to be scheduled.
> +        unsafe { bindings::task_work_add(current, inner, TWA_RESUME) };

I am a bit confused, when does `do_close_fd` actually run? Does
`TWA_RESUME` mean that `inner` is scheduled to run after the current
task has been completed?

> +    }
> +
> +    // SAFETY: This function is an implementation detail of `close_fd`, so its safety comments
> +    // should be read in extension of that method.
> +    unsafe extern "C" fn do_close_fd(inner: *mut bindings::callback_head) {
> +        // SAFETY: In `close_fd` we use this method together with a pointer that originates from a
> +        // `Box<DeferredFdCloserInner>`, and we have just been given ownership of that allocation.
> +        let inner = unsafe { Box::from_raw(inner as *mut DeferredFdCloserInner) };

In order for this call to be sound, `inner` must be an exclusive
pointer (including any possible references into the `callback_head`).
Is this the case?
  
Alice Ryhl Dec. 1, 2023, 11:35 a.m. UTC | #2
Benno Lossin <benno.lossin@proton.me> writes:
>> +        // SAFETY: The `inner` pointer points at a valid and fully initialized task work that is
>> +        // ready to be scheduled.
>> +        unsafe { bindings::task_work_add(current, inner, TWA_RESUME) };
> 
> I am a bit confused, when does `do_close_fd` actually run? Does
> `TWA_RESUME` mean that `inner` is scheduled to run after the current
> task has been completed?

When the current syscall returns to userspace.

>> +    // SAFETY: This function is an implementation detail of `close_fd`, so its safety comments
>> +    // should be read in extension of that method.
>> +    unsafe extern "C" fn do_close_fd(inner: *mut bindings::callback_head) {
>> +        // SAFETY: In `close_fd` we use this method together with a pointer that originates from a
>> +        // `Box<DeferredFdCloserInner>`, and we have just been given ownership of that allocation.
>> +        let inner = unsafe { Box::from_raw(inner as *mut DeferredFdCloserInner) };
> 
> In order for this call to be sound, `inner` must be an exclusive
> pointer (including any possible references into the `callback_head`).
> Is this the case?

Yes, when this is called, it's been removed from the linked list of task
work. That's why we can kfree it.

>> +        // SAFETY: Since `DeferredFdCloserInner` is `#[repr(C)]`, casting the pointers gives a
>> +        // pointer to the `twork` field.
>> +        let inner = Box::into_raw(self.inner) as *mut bindings::callback_head;
> 
> Here you can just use `.cast::<...>()`.

Will do.

Alice
  
Benno Lossin Dec. 2, 2023, 10:16 a.m. UTC | #3
On 12/1/23 12:35, Alice Ryhl wrote:
> Benno Lossin <benno.lossin@proton.me> writes:
>>> +        // SAFETY: The `inner` pointer points at a valid and fully initialized task work that is
>>> +        // ready to be scheduled.
>>> +        unsafe { bindings::task_work_add(current, inner, TWA_RESUME) };
>>
>> I am a bit confused, when does `do_close_fd` actually run? Does
>> `TWA_RESUME` mean that `inner` is scheduled to run after the current
>> task has been completed?
> 
> When the current syscall returns to userspace.

What happens when I use `DeferredFdCloser` outside of a syscall? Will
it never run? Maybe add some documentation about that?

>>> +    // SAFETY: This function is an implementation detail of `close_fd`, so its safety comments
>>> +    // should be read in extension of that method.
>>> +    unsafe extern "C" fn do_close_fd(inner: *mut bindings::callback_head) {
>>> +        // SAFETY: In `close_fd` we use this method together with a pointer that originates from a
>>> +        // `Box<DeferredFdCloserInner>`, and we have just been given ownership of that allocation.
>>> +        let inner = unsafe { Box::from_raw(inner as *mut DeferredFdCloserInner) };
>>
>> In order for this call to be sound, `inner` must be an exclusive
>> pointer (including any possible references into the `callback_head`).
>> Is this the case?
> 
> Yes, when this is called, it's been removed from the linked list of task
> work. That's why we can kfree it.

Please add this to the SAFETY comment.
  
Alice Ryhl Dec. 5, 2023, 2:43 p.m. UTC | #4
Benno Lossin <benno.lossin@proton.me> writes:
> On 12/1/23 12:35, Alice Ryhl wrote:
>> Benno Lossin <benno.lossin@proton.me> writes:
>>>> +        // SAFETY: The `inner` pointer points at a valid and fully initialized task work that is
>>>> +        // ready to be scheduled.
>>>> +        unsafe { bindings::task_work_add(current, inner, TWA_RESUME) };
>>>
>>> I am a bit confused, when does `do_close_fd` actually run? Does
>>> `TWA_RESUME` mean that `inner` is scheduled to run after the current
>>> task has been completed?
>> 
>> When the current syscall returns to userspace.
> 
> What happens when I use `DeferredFdCloser` outside of a syscall? Will
> it never run? Maybe add some documentation about that?

Christian Brauner, I think I need your help here.

I spent a bunch of time today trying to understand the correct way of
closing an fd held with fdget, and I'm unsure what the best way is.

So, first, `task_work_add` only really works when we're called from a
syscall. For one, it's fallible, and for another, you shouldn't even
attempt to use it from a kthread. (See e.g., the implementation of
`fput` in `fs/file_table.c`.)

To handle the above, we could fall back to the workqueue and schedule
the `fput` there when we are on a kthread or `task_work_add` fails. And
since I don't really care about the performance of this utility, let's
say we just unconditionally use the workqueue to simplify the
implementation.

However, it's not clear to me that this is okay. Consider this
execution: (please compare to `binder_deferred_fd_close`)

    Thread A                Thread B (workqueue)
    fdget()
    close_fd_get_file()
    get_file()
    filp_close()
    schedule_work(do_close_fd)
    // we are preempted
                            fput()
    fdput()

And now, since the workqueue can run before thread A returns to
userspace, we are in trouble again, right? Unless I missed an upgrade
to shared file descriptor somewhere that somehow makes this okay? I
looked around the C code and couldn't find one and I guess such an
upgrade has to happen before the call to `fdget` anyway?

In Binder, the above is perfectly fine since it closes the fd from a
context where `task_work_add` will always work, and a task work
definitely runs after the `fdput`. But I added this as a utility in the
shared kernel crate, and I want to avoid the situation where someone
comes along later and uses it from a kthread, gets the fallback to
workqueue, and then has an UAF due to the previously mentioned
execution...

What do you advise that I do?

Maybe the answer is just that, if you're in a context where it makes
sense to talk about an fd of the current task, then task_work_add will
also definitely work? So if `task_work_add` won't work, then
`close_fd_get_file` will return a null pointer and we never reach the
`task_work_add`. This seems fragile though.

Alice
  
Alice Ryhl Dec. 5, 2023, 6:16 p.m. UTC | #5
On Tue, Dec 5, 2023 at 3:43 PM Alice Ryhl <aliceryhl@google.com> wrote:
>
> Benno Lossin <benno.lossin@proton.me> writes:
> > On 12/1/23 12:35, Alice Ryhl wrote:
> >> Benno Lossin <benno.lossin@proton.me> writes:
> >>>> +        // SAFETY: The `inner` pointer points at a valid and fully initialized task work that is
> >>>> +        // ready to be scheduled.
> >>>> +        unsafe { bindings::task_work_add(current, inner, TWA_RESUME) };
> >>>
> >>> I am a bit confused, when does `do_close_fd` actually run? Does
> >>> `TWA_RESUME` mean that `inner` is scheduled to run after the current
> >>> task has been completed?
> >>
> >> When the current syscall returns to userspace.
> >
> > What happens when I use `DeferredFdCloser` outside of a syscall? Will
> > it never run? Maybe add some documentation about that?
>
> Christian Brauner, I think I need your help here.
>
> I spent a bunch of time today trying to understand the correct way of
> closing an fd held with fdget, and I'm unsure what the best way is.
>
> So, first, `task_work_add` only really works when we're called from a
> syscall. For one, it's fallible, and for another, you shouldn't even
> attempt to use it from a kthread. (See e.g., the implementation of
> `fput` in `fs/file_table.c`.)
>
> To handle the above, we could fall back to the workqueue and schedule
> the `fput` there when we are on a kthread or `task_work_add` fails. And
> since I don't really care about the performance of this utility, let's
> say we just unconditionally use the workqueue to simplify the
> implementation.
>
> However, it's not clear to me that this is okay. Consider this
> execution: (please compare to `binder_deferred_fd_close`)
>
>     Thread A                Thread B (workqueue)
>     fdget()
>     close_fd_get_file()
>     get_file()
>     filp_close()
>     schedule_work(do_close_fd)
>     // we are preempted
>                             fput()
>     fdput()
>
> And now, since the workqueue can run before thread A returns to
> userspace, we are in trouble again, right? Unless I missed an upgrade
> to shared file descriptor somewhere that somehow makes this okay? I
> looked around the C code and couldn't find one and I guess such an
> upgrade has to happen before the call to `fdget` anyway?
>
> In Binder, the above is perfectly fine since it closes the fd from a
> context where `task_work_add` will always work, and a task work
> definitely runs after the `fdput`. But I added this as a utility in the
> shared kernel crate, and I want to avoid the situation where someone
> comes along later and uses it from a kthread, gets the fallback to
> workqueue, and then has an UAF due to the previously mentioned
> execution...
>
> What do you advise that I do?
>
> Maybe the answer is just that, if you're in a context where it makes
> sense to talk about an fd of the current task, then task_work_add will
> also definitely work? So if `task_work_add` won't work, then
> `close_fd_get_file` will return a null pointer and we never reach the
> `task_work_add`. This seems fragile though.
>
> Alice

Ah! I realized that there's another option: Report an error if we
can't schedule the task work.

I didn't suggest this originally because I didn't want to leak the
file in the error path, and I couldn't think of anything else sane to
do.

But! We can schedule the task work *first*, then attempt to close the
file. This way, the file doesn't get closed in the error path. And
there's no race condition since the task work is guaranteed to get
scheduled later on the same thread, so there's no way for it to get
executed in between us scheduling it and closing the file.

Thoughts?

Alice
  

Patch

diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
index 700f01840188..c8daee341df6 100644
--- a/rust/bindings/bindings_helper.h
+++ b/rust/bindings/bindings_helper.h
@@ -9,6 +9,7 @@ 
 #include <kunit/test.h>
 #include <linux/cred.h>
 #include <linux/errname.h>
+#include <linux/fdtable.h>
 #include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/pid_namespace.h>
@@ -17,6 +18,7 @@ 
 #include <linux/refcount.h>
 #include <linux/wait.h>
 #include <linux/sched.h>
+#include <linux/task_work.h>
 #include <linux/workqueue.h>
 
 /* `bindgen` gets confused at certain things. */
diff --git a/rust/helpers.c b/rust/helpers.c
index 58e3a9dff349..d146bbf25aec 100644
--- a/rust/helpers.c
+++ b/rust/helpers.c
@@ -32,6 +32,7 @@ 
 #include <linux/sched/signal.h>
 #include <linux/security.h>
 #include <linux/spinlock.h>
+#include <linux/task_work.h>
 #include <linux/wait.h>
 #include <linux/workqueue.h>
 
@@ -243,6 +244,13 @@  void rust_helper_security_release_secctx(char *secdata, u32 seclen)
 EXPORT_SYMBOL_GPL(rust_helper_security_release_secctx);
 #endif
 
+void rust_helper_init_task_work(struct callback_head *twork,
+				task_work_func_t func)
+{
+	init_task_work(twork, func);
+}
+EXPORT_SYMBOL_GPL(rust_helper_init_task_work);
+
 /*
  * `bindgen` binds the C `size_t` type as the Rust `usize` type, so we can
  * use it in contexts where Rust expects a `usize` like slice (array) indices.
diff --git a/rust/kernel/file.rs b/rust/kernel/file.rs
index 2186a6ea3f2f..578ee307093f 100644
--- a/rust/kernel/file.rs
+++ b/rust/kernel/file.rs
@@ -11,7 +11,8 @@ 
     error::{code::*, Error, Result},
     types::{ARef, AlwaysRefCounted, Opaque},
 };
-use core::{marker::PhantomData, ptr};
+use alloc::boxed::Box;
+use core::{alloc::AllocError, marker::PhantomData, mem, ptr};
 
 /// Flags associated with a [`File`].
 pub mod flags {
@@ -242,6 +243,87 @@  fn drop(&mut self) {
     }
 }
 
+/// Helper used for closing file descriptors in a way that is safe even if the file is currently
+/// held using `fdget`.
+///
+/// Additional motivation can be found in commit 80cd795630d6 ("binder: fix use-after-free due to
+/// ksys_close() during fdget()") and in the comments on `binder_do_fd_close`.
+pub struct DeferredFdCloser {
+    inner: Box<DeferredFdCloserInner>,
+}
+
+/// SAFETY: This just holds an allocation with no real content, so there's no safety issue with
+/// moving it across threads.
+unsafe impl Send for DeferredFdCloser {}
+unsafe impl Sync for DeferredFdCloser {}
+
+#[repr(C)]
+struct DeferredFdCloserInner {
+    twork: mem::MaybeUninit<bindings::callback_head>,
+    file: *mut bindings::file,
+}
+
+impl DeferredFdCloser {
+    /// Create a new [`DeferredFdCloser`].
+    pub fn new() -> Result<Self, AllocError> {
+        Ok(Self {
+            inner: Box::try_new(DeferredFdCloserInner {
+                twork: mem::MaybeUninit::uninit(),
+                file: core::ptr::null_mut(),
+            })?,
+        })
+    }
+
+    /// Schedule a task work that closes the file descriptor when this task returns to userspace.
+    pub fn close_fd(mut self, fd: u32) {
+        use bindings::task_work_notify_mode_TWA_RESUME as TWA_RESUME;
+
+        let file = unsafe { bindings::close_fd_get_file(fd) };
+        if file.is_null() {
+            // Nothing further to do. The allocation is freed by the destructor of `self.inner`.
+            return;
+        }
+
+        self.inner.file = file;
+
+        // SAFETY: Since `DeferredFdCloserInner` is `#[repr(C)]`, casting the pointers gives a
+        // pointer to the `twork` field.
+        let inner = Box::into_raw(self.inner) as *mut bindings::callback_head;
+
+        // SAFETY: Getting a pointer to current is always safe.
+        let current = unsafe { bindings::get_current() };
+        // SAFETY: The `file` pointer points at a valid file.
+        unsafe { bindings::get_file(file) };
+        // SAFETY: Due to the above `get_file`, even if the current task holds an `fdget` to
+        // this file right now, the refcount will not drop to zero until after it is released
+        // with `fdput`. This is because when using `fdget`, you must always use `fdput` before
+        // returning to userspace, and our task work runs after any `fdget` users have returned
+        // to userspace.
+        //
+        // Note: fl_owner_t is currently a void pointer.
+        unsafe { bindings::filp_close(file, (*current).files as bindings::fl_owner_t) };
+        // SAFETY: The `inner` pointer is compatible with the `do_close_fd` method.
+        unsafe { bindings::init_task_work(inner, Some(Self::do_close_fd)) };
+        // SAFETY: The `inner` pointer points at a valid and fully initialized task work that is
+        // ready to be scheduled.
+        unsafe { bindings::task_work_add(current, inner, TWA_RESUME) };
+    }
+
+    // SAFETY: This function is an implementation detail of `close_fd`, so its safety comments
+    // should be read in extension of that method.
+    unsafe extern "C" fn do_close_fd(inner: *mut bindings::callback_head) {
+        // SAFETY: In `close_fd` we use this method together with a pointer that originates from a
+        // `Box<DeferredFdCloserInner>`, and we have just been given ownership of that allocation.
+        let inner = unsafe { Box::from_raw(inner as *mut DeferredFdCloserInner) };
+        // SAFETY: This drops a refcount we acquired in `close_fd`. Since this callback runs in a
+        // task work after we return to userspace, it is guaranteed that the current thread doesn't
+        // hold this file with `fdget`, as `fdget` must be released before returning to userspace.
+        unsafe { bindings::fput(inner.file) };
+        // Free the allocation.
+        drop(inner);
+    }
+}
+
 /// Represents the `EBADF` error code.
 ///
 /// Used for methods that can only fail with `EBADF`.