[1/1] Add a new sysctl to disable io_uring system-wide

Message ID 20230627120058.2214509-2-matteorizzo@google.com
State New
Headers
Series Add a sysctl to disable io_uring system-wide |

Commit Message

Matteo Rizzo June 27, 2023, noon UTC
  Introduce a new sysctl (io_uring_disabled) which can be either 0 or 1.
When 0 (the default), all processes are allowed to create io_uring
instances, which is the current behavior. When 1, all calls to
io_uring_setup fail with -EPERM.

Signed-off-by: Matteo Rizzo <matteorizzo@google.com>
---
 Documentation/admin-guide/sysctl/kernel.rst | 14 ++++++++++++
 io_uring/io_uring.c                         | 24 +++++++++++++++++++++
 2 files changed, 38 insertions(+)
  

Comments

Randy Dunlap June 27, 2023, 4:23 p.m. UTC | #1
Hi--

On 6/27/23 05:00, Matteo Rizzo wrote:
> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> index d85d90f5d000..3c53a238332a 100644
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -450,6 +450,20 @@ this allows system administrators to override the
>  ``IA64_THREAD_UAC_NOPRINT`` ``prctl`` and avoid logs being flooded.
>  
>  
> +io_uring_disabled
> +=========================
> +
> +Prevents all processes from creating new io_uring instances. Enabling this
> +shrinks the kernel's attack surface.
> +
> += =============================================================
> +0 All processes can create io_uring instances as normal. This is the default
> +  setting.
> +1 io_uring is disabled. io_uring_setup always fails with -EPERM. Existing
> +  io_uring instances can still be used.
> += =============================================================

These table lines should be extended at least as far as the text that they
enclose. I.e., the top and bottom lines should be like:

> += ==========================================================================

thanks.
  
Bart Van Assche June 27, 2023, 5:10 p.m. UTC | #2
On 6/27/23 05:00, Matteo Rizzo wrote:
> +Prevents all processes from creating new io_uring instances. Enabling this
> +shrinks the kernel's attack surface.
> +
> += =============================================================
> +0 All processes can create io_uring instances as normal. This is the default
> +  setting.
> +1 io_uring is disabled. io_uring_setup always fails with -EPERM. Existing
> +  io_uring instances can still be used.
> += =============================================================

I'm using fio + io_uring all the time on Android devices. I think we need a
better solution than disabling io_uring system-wide, e.g. a mechanism based
on SELinux that disables io_uring for apps and that keeps io_uring enabled
for processes started via 'adb root && adb shell ...'

Bart.
  
Matteo Rizzo June 27, 2023, 6:15 p.m. UTC | #3
On Tue, 27 Jun 2023 at 19:10, Bart Van Assche <bvanassche@acm.org> wrote:
> I'm using fio + io_uring all the time on Android devices. I think we need a
> better solution than disabling io_uring system-wide, e.g. a mechanism based
> on SELinux that disables io_uring for apps and that keeps io_uring enabled
> for processes started via 'adb root && adb shell ...'

Android already uses seccomp to prevent untrusted applications from using
io_uring. This patch is aimed at server/desktop environments where there is
no easy way to set a system-wide seccomp policy and right now the only way
to disable io_uring system-wide is to compile it out of the kernel entirely
(not really feasible for e.g. a general-purpose distro).

I thought about adding a capability check that lets privileged processes
bypass this sysctl, but it wasn't clear to me which capability I should use.
For userfaultfd the kernel uses CAP_SYS_PTRACE, but I wasn't sure that's
the best choice here since io_uring has nothing to do with ptrace.
If anyone has any suggestions please let me know. A LSM hook also sounds
like an option but it would be more complicated to implement and use.
  
Ricardo Ribalda June 28, 2023, 11:36 a.m. UTC | #4
Hi Matteo

On Tue, 27 Jun 2023 at 20:15, Matteo Rizzo <matteorizzo@google.com> wrote:
>
> On Tue, 27 Jun 2023 at 19:10, Bart Van Assche <bvanassche@acm.org> wrote:
> > I'm using fio + io_uring all the time on Android devices. I think we need a
> > better solution than disabling io_uring system-wide, e.g. a mechanism based
> > on SELinux that disables io_uring for apps and that keeps io_uring enabled
> > for processes started via 'adb root && adb shell ...'
>
> Android already uses seccomp to prevent untrusted applications from using
> io_uring. This patch is aimed at server/desktop environments where there is
> no easy way to set a system-wide seccomp policy and right now the only way
> to disable io_uring system-wide is to compile it out of the kernel entirely
> (not really feasible for e.g. a general-purpose distro).
>
> I thought about adding a capability check that lets privileged processes
> bypass this sysctl, but it wasn't clear to me which capability I should use.
> For userfaultfd the kernel uses CAP_SYS_PTRACE, but I wasn't sure that's
> the best choice here since io_uring has nothing to do with ptrace.
> If anyone has any suggestions please let me know. A LSM hook also sounds
> like an option but it would be more complicated to implement and use.

Have you considered that the new sysctl is "sticky like kexec_load_disabled.
When the user disables it there is no way to turn it back on until the
system is rebooted.

Best regards!
  
Gabriel Krisman Bertazi June 28, 2023, 1:50 p.m. UTC | #5
Matteo Rizzo <matteorizzo@google.com> writes:

> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> index d85d90f5d000..3c53a238332a 100644
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -450,6 +450,20 @@ this allows system administrators to override the
>  ``IA64_THREAD_UAC_NOPRINT`` ``prctl`` and avoid logs being flooded.
>  
>  
> +io_uring_disabled
> +=========================
> +
> +Prevents all processes from creating new io_uring instances. Enabling this
> +shrinks the kernel's attack surface.
> +
> += =============================================================
> +0 All processes can create io_uring instances as normal. This is the default
> +  setting.
> +1 io_uring is disabled. io_uring_setup always fails with -EPERM. Existing
> +  io_uring instances can still be used.
> += =============================================================

I had an internal request for something like this recently.  If we go
this route, we could use a intermediary option that limits io_uring
to root processes only.
  
Matteo Rizzo June 28, 2023, 3:12 p.m. UTC | #6
On Wed, 28 Jun 2023 at 13:44, Ricardo Ribalda <ribalda@chromium.org> wrote:
>
> Have you considered that the new sysctl is "sticky like kexec_load_disabled.
> When the user disables it there is no way to turn it back on until the
> system is rebooted.

Are you suggesting making this sysctl sticky? Are there any examples of how to
implement a sticky sysctl that can take more than 2 values in case we want to
add an intermediate level that still allows privileged processes to use
io_uring? Also, what would be the use case? Preventing privileged processes
from re-enabling io_uring?

Thanks!
--
Matteo
  
Jeff Moyer June 28, 2023, 3:59 p.m. UTC | #7
Gabriel Krisman Bertazi <krisman@suse.de> writes:

> Matteo Rizzo <matteorizzo@google.com> writes:
>
>> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
>> index d85d90f5d000..3c53a238332a 100644
>> --- a/Documentation/admin-guide/sysctl/kernel.rst
>> +++ b/Documentation/admin-guide/sysctl/kernel.rst
>> @@ -450,6 +450,20 @@ this allows system administrators to override the
>>  ``IA64_THREAD_UAC_NOPRINT`` ``prctl`` and avoid logs being flooded.
>>  
>>  
>> +io_uring_disabled
>> +=========================
>> +
>> +Prevents all processes from creating new io_uring instances. Enabling this
>> +shrinks the kernel's attack surface.
>> +
>> += =============================================================
>> +0 All processes can create io_uring instances as normal. This is the default
>> +  setting.
>> +1 io_uring is disabled. io_uring_setup always fails with -EPERM. Existing
>> +  io_uring instances can still be used.
>> += =============================================================
>
> I had an internal request for something like this recently.  If we go
> this route, we could use a intermediary option that limits io_uring
> to root processes only.

This is all regrettable, but this option makes the most sense to me.
Testing for CAP_SYS_ADMIN or CAP_SYS_RAW_IO would work for that third
option, I think.

-Jeff
  
Jeff Moyer June 28, 2023, 3:59 p.m. UTC | #8
Matteo Rizzo <matteorizzo@google.com> writes:

> On Wed, 28 Jun 2023 at 13:44, Ricardo Ribalda <ribalda@chromium.org> wrote:
>>
>> Have you considered that the new sysctl is "sticky like kexec_load_disabled.
>> When the user disables it there is no way to turn it back on until the
>> system is rebooted.
>
> Are you suggesting making this sysctl sticky? Are there any examples of how to
> implement a sticky sysctl that can take more than 2 values in case we want to
> add an intermediate level that still allows privileged processes to use
> io_uring? Also, what would be the use case? Preventing privileged processes
> from re-enabling io_uring?

See unprivileged_bpf_disabled for an example.  I can't speak to the use
case for a sticky value.

-Jeff
  
Ricardo Ribalda June 28, 2023, 3:59 p.m. UTC | #9
HI Matteo

On Wed, 28 Jun 2023 at 17:12, Matteo Rizzo <matteorizzo@google.com> wrote:
>
> On Wed, 28 Jun 2023 at 13:44, Ricardo Ribalda <ribalda@chromium.org> wrote:
> >
> > Have you considered that the new sysctl is "sticky like kexec_load_disabled.
> > When the user disables it there is no way to turn it back on until the
> > system is rebooted.
>
> Are you suggesting making this sysctl sticky? Are there any examples of how to
> implement a sticky sysctl that can take more than 2 values in case we want to
> add an intermediate level that still allows privileged processes to use
> io_uring? Also, what would be the use case? Preventing privileged processes
> from re-enabling io_uring?

Yes, if this sysctl is accepted, I think it would make sense to make it sticky.

For more than one value take a look to  kexec_load_limit_reboot and
kexec_load_limit_panic

Thanks!

>
> Thanks!
> --
> Matteo
  

Patch

diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index d85d90f5d000..3c53a238332a 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -450,6 +450,20 @@  this allows system administrators to override the
 ``IA64_THREAD_UAC_NOPRINT`` ``prctl`` and avoid logs being flooded.
 
 
+io_uring_disabled
+=========================
+
+Prevents all processes from creating new io_uring instances. Enabling this
+shrinks the kernel's attack surface.
+
+= =============================================================
+0 All processes can create io_uring instances as normal. This is the default
+  setting.
+1 io_uring is disabled. io_uring_setup always fails with -EPERM. Existing
+  io_uring instances can still be used.
+= =============================================================
+
+
 kexec_load_disabled
 ===================
 
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 1b53a2ab0a27..0496ae7017f7 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -153,6 +153,22 @@  static __cold void io_fallback_tw(struct io_uring_task *tctx);
 
 struct kmem_cache *req_cachep;
 
+static int __read_mostly sysctl_io_uring_disabled;
+#ifdef CONFIG_SYSCTL
+static struct ctl_table kernel_io_uring_disabled_table[] = {
+	{
+		.procname	= "io_uring_disabled",
+		.data		= &sysctl_io_uring_disabled,
+		.maxlen		= sizeof(sysctl_io_uring_disabled),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
+	{},
+};
+#endif
+
 struct sock *io_uring_get_socket(struct file *file)
 {
 #if defined(CONFIG_UNIX)
@@ -4003,6 +4019,9 @@  static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
 SYSCALL_DEFINE2(io_uring_setup, u32, entries,
 		struct io_uring_params __user *, params)
 {
+	if (sysctl_io_uring_disabled)
+		return -EPERM;
+
 	return io_uring_setup(entries, params);
 }
 
@@ -4577,6 +4596,11 @@  static int __init io_uring_init(void)
 
 	req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC |
 				SLAB_ACCOUNT | SLAB_TYPESAFE_BY_RCU);
+
+#ifdef CONFIG_SYSCTL
+	register_sysctl_init("kernel", kernel_io_uring_disabled_table);
+#endif
+
 	return 0;
 };
 __initcall(io_uring_init);