Message ID | 20230120102512.3195094-1-gscrivan@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp121160wrn; Fri, 20 Jan 2023 02:30:30 -0800 (PST) X-Google-Smtp-Source: AMrXdXt4wH3UYASjiblcuxhtnUsde2iPhZZ47gp2x/D0Pwx14IjiA4kvdqgdA6oYUXAQCmugBqAs X-Received: by 2002:a17:902:e548:b0:194:687c:e883 with SMTP id n8-20020a170902e54800b00194687ce883mr20538449plf.37.1674210630059; Fri, 20 Jan 2023 02:30:30 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1674210630; cv=none; d=google.com; s=arc-20160816; b=EwV3WFoc5SC7NMbjPnBOd9US0b/xjCgMP0e+hSTw64mQDnCOfYTqoO73XqU0nuX6Cy hz5Et7C6Tys6lmBeqdkeH5x7z5SKGOurm+LmGOLPSL6fmFHyz/w7Qn8u3u1k1Nk1tVHv RyLSdA+WJgbQ9bfIUjvUlVCTJQtcu4B6E3K6v53NQeUNIq23Q60Zsa52+pIc8t66QiZh dn3KQIMnzS4RWiW8vpvZMde28/+0pOOcHOT5hOiumGhRrQ6ZT/WkOUCkD6BHgx8hnR2x 0BkBDCi38DTvLbNyuz3i29BWD/klzHn5lUJGODHCF2lYtNyz6Xbp5z1pQOVmmzZ8kU7X 5kvA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=ieCymmvmHcLwxBMNWiBKMIP9Q3Vx9XZSJMoHLCt0yd4=; b=yd0qh5fh3RpK+pbFiuZWW9w3kF24f6AD3/olc62KpdcWK2d3hnToltl0KJFq8CYGmg XVUYZ1pDIAhJ9tc2fxJzudU2cAnvZ0QFUxgSQWDvWRi2zS30HbDcXdINQLAxWGZJnn+c j6/tjfK64MdhFJpQQGSd3ecoZIOUgmrQ3cEbBLYkcJlzo1l0trzsEVvIdYpXA8xqFig/ f686uGM5mzRf/nB8kSXjCC078xeAM+DW1gEjJfbiWtfia99Czq1Zu6r/YGA0ZWwvQ0IY zFibq9Bji3R0F5gpQzwHsD9SR20FEOZLhAXgpczfU+W9sE0Vlxay8sMVUAqZAqRPuG1P gkvw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=YxFqShtE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id b5-20020a170902e94500b001787f1922a7si23219408pll.19.2023.01.20.02.30.17; Fri, 20 Jan 2023 02:30:30 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=YxFqShtE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229749AbjATK01 (ORCPT <rfc822;changwoo.m@gmail.com> + 99 others); Fri, 20 Jan 2023 05:26:27 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36636 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229897AbjATK0U (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 20 Jan 2023 05:26:20 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D3EE28B76A for <linux-kernel@vger.kernel.org>; Fri, 20 Jan 2023 02:25:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1674210337; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=ieCymmvmHcLwxBMNWiBKMIP9Q3Vx9XZSJMoHLCt0yd4=; b=YxFqShtEbNs/Br8Gk3hVeYdkcENuR4IjEXcTk2DLTZyfNSkNKRm6Tqk+nlcACgiLrVPx1r LFyDULaWjxuLB0Q1Vb8xT2r8MbBbfy/N4dwNOe+lPjsRtxYGLwVtzSI2N5WFGS6WN2B7ju EP3T6vAOUoz/o9daejmRNvRwu5At5nE= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-448-V7OXi7UvNzqjtAFUhAJTDw-1; Fri, 20 Jan 2023 05:25:34 -0500 X-MC-Unique: V7OXi7UvNzqjtAFUhAJTDw-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 34B2718E0924; Fri, 20 Jan 2023 10:25:34 +0000 (UTC) Received: from lithium.redhat.com (unknown [10.39.194.249]) by smtp.corp.redhat.com (Postfix) with ESMTP id 1E98F53AA; Fri, 20 Jan 2023 10:25:31 +0000 (UTC) From: Giuseppe Scrivano <gscrivan@redhat.com> To: linux-kernel@vger.kernel.org Cc: keescook@chromium.org, bristot@redhat.com, ebiederm@xmission.com, brauner@kernel.org, cyphar@cyphar.com, viro@zeniv.linux.org.uk, alexl@redhat.com, peterz@infradead.org, bmasney@redhat.com, gscrivan@redhat.com Subject: [PATCH v3 1/2] exec: add PR_HIDE_SELF_EXE prctl Date: Fri, 20 Jan 2023 11:25:11 +0100 Message-Id: <20230120102512.3195094-1-gscrivan@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.5 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1755537085425460256?= X-GMAIL-MSGID: =?utf-8?q?1755537085425460256?= |
Series |
[v3,1/2] exec: add PR_HIDE_SELF_EXE prctl
|
|
Commit Message
Giuseppe Scrivano
Jan. 20, 2023, 10:25 a.m. UTC
This patch adds a new prctl called PR_HIDE_SELF_EXE which allows
processes to hide their own /proc/*/exe file. When this prctl is
used, every access to /proc/*/exe for the calling process will
fail with ENOENT.
This is useful for preventing issues like CVE-2019-5736, where an
attacker can gain host root access by overwriting the binary
in OCI runtimes through file-descriptor mishandling in containers.
The current fix for CVE-2019-5736 is to create a read-only copy or
a bind-mount of the current executable, and then re-exec the current
process. With the new prctl, the read-only copy or bind-mount copy is
not needed anymore.
While map_files/ also might contain symlinks to files in host,
proc_map_files_get_link() permissions checks are already sufficient.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
---
v2: https://lkml.org/lkml/2023/1/19/849
Differences from v2:
- fixed the test to check PR_SET_HIDE_SELF_EXE after fork
v1: https://lkml.org/lkml/2023/1/4/334
Differences from v1:
- amended more information in the commit message wrt map_files not
requiring the same protection.
- changed the test to verify PR_HIDE_SELF_EXE cannot be unset after
a fork.
fs/exec.c | 1 +
fs/proc/base.c | 8 +++++---
include/linux/sched.h | 5 +++++
include/uapi/linux/prctl.h | 3 +++
kernel/sys.c | 9 +++++++++
tools/include/uapi/linux/prctl.h | 3 +++
6 files changed, 26 insertions(+), 3 deletions(-)
Comments
On Fri, Jan 20, 2023, at 5:25 AM, Giuseppe Scrivano wrote: > This patch adds a new prctl called PR_HIDE_SELF_EXE which allows > processes to hide their own /proc/*/exe file. When this prctl is > used, every access to /proc/*/exe for the calling process will > fail with ENOENT. How about a mount option for procfs like `mount -t procfs procfs /proc -o rw,nosuid,nodev,magiclink-no-xdev` Where `magiclink-no-xdev` would cause all magic links to fail to cross a pid namespace or so?
"Colin Walters" <walters@verbum.org> writes: > On Fri, Jan 20, 2023, at 5:25 AM, Giuseppe Scrivano wrote: >> This patch adds a new prctl called PR_HIDE_SELF_EXE which allows >> processes to hide their own /proc/*/exe file. When this prctl is >> used, every access to /proc/*/exe for the calling process will >> fail with ENOENT. > > How about a mount option for procfs like `mount -t procfs procfs /proc -o rw,nosuid,nodev,magiclink-no-xdev` > > Where `magiclink-no-xdev` would cause all magic links to fail to cross a pid namespace or so? wouldn't that break also stuff like "/proc/self/fd/$FD" after you join a different PID namespace?
On Mon, Jan 23, 2023, at 2:21 PM, Giuseppe Scrivano wrote: > "Colin Walters" <walters@verbum.org> writes: > >> On Fri, Jan 20, 2023, at 5:25 AM, Giuseppe Scrivano wrote: >>> This patch adds a new prctl called PR_HIDE_SELF_EXE which allows >>> processes to hide their own /proc/*/exe file. When this prctl is >>> used, every access to /proc/*/exe for the calling process will >>> fail with ENOENT. >> >> How about a mount option for procfs like `mount -t procfs procfs /proc -o rw,nosuid,nodev,magiclink-no-xdev` >> >> Where `magiclink-no-xdev` would cause all magic links to fail to cross a pid namespace or so? > > wouldn't that break also stuff like "/proc/self/fd/$FD" after you join a > different PID namespace? Ah, right. Hmm. But building on the reply to https://lwn.net/Articles/920876/ maybe an opt-in flag like `-o magiclink-restricted=/proc/<pid>/ns/pid` that required callers to have CAP_SYS_ADMIN in the referenced namespace? Then things like crun/runc would open a fd for `/proc/self/ns/pid` when they start (usually, this is the init ns), then pass the reference to that fd into magiclink-restricted. There may be a more elegant userspace API here; I think my overall point is reiterating what I mentioned in https://github.com/containers/crun/pull/1105#issuecomment-1360085059 "A minor worry I have with both is that any namespacing-based approach that controls visibility in /proc runs the risk of someone adding a new way to get to the binary; maybe it's something like us having a fd for ourselves open under /proc/pid/fd/ or so." So instead of hiding just /proc/self/exe, we add some sort of API that aims to restrict all other magic links that may be added in the future. The history of map_files is interesting here; it looks like https://github.com/torvalds/linux/commit/bdb4d100afe9818aebd1d98ced575c5ef143456c introduces a comment that says "* Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the * symlinks may be used to bypass permissions on ancestor directories in the * path to the file in question." yet isn't there an inconsistency here in not applying the same restrictions on /proc/self/exe? Or another way to look at this is that if we were to add some sort of API like this on /proc, we'd also change the proc_maps code to also honor it *instead* if present instead of limiting to the init ns?
"Colin Walters" <walters@verbum.org> writes: > On Mon, Jan 23, 2023, at 2:21 PM, Giuseppe Scrivano wrote: >> "Colin Walters" <walters@verbum.org> writes: >> >>> On Fri, Jan 20, 2023, at 5:25 AM, Giuseppe Scrivano wrote: >>>> This patch adds a new prctl called PR_HIDE_SELF_EXE which allows >>>> processes to hide their own /proc/*/exe file. When this prctl is >>>> used, every access to /proc/*/exe for the calling process will >>>> fail with ENOENT. >>> >>> How about a mount option for procfs like `mount -t procfs procfs /proc -o rw,nosuid,nodev,magiclink-no-xdev` >>> >>> Where `magiclink-no-xdev` would cause all magic links to fail to cross a pid namespace or so? >> >> wouldn't that break also stuff like "/proc/self/fd/$FD" after you join a >> different PID namespace? > > Ah, right. Hmm. But building on the reply to > https://lwn.net/Articles/920876/ > maybe an opt-in flag like `-o magiclink-restricted=/proc/<pid>/ns/pid` > that required callers to have CAP_SYS_ADMIN in the referenced > namespace? Then things like crun/runc would open a fd for > `/proc/self/ns/pid` when they start (usually, this is the init ns), > then pass the reference to that fd into magiclink-restricted. > > There may be a more elegant userspace API here; I think my overall point is reiterating what I mentioned in > https://github.com/containers/crun/pull/1105#issuecomment-1360085059 > > "A minor worry I have with both is that any namespacing-based approach > that controls visibility in /proc runs the risk of someone adding a > new way to get to the binary; maybe it's something like us having a fd > for ourselves open under /proc/pid/fd/ or so." > > So instead of hiding just /proc/self/exe, we add some sort of API that aims to restrict all other magic links that may be added in the future. > > The history of map_files is interesting here; it looks like https://github.com/torvalds/linux/commit/bdb4d100afe9818aebd1d98ced575c5ef143456c introduces a comment that says > > "* Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the > * symlinks may be used to bypass permissions on ancestor directories in the > * path to the file in question." > > yet isn't there an inconsistency here in not applying the same restrictions on /proc/self/exe? > > Or another way to look at this is that if we were to add some sort of > API like this on /proc, we'd also change the proc_maps code to also > honor it *instead* if present instead of limiting to the init ns? I realize it seems like a one-off fix, but it is done only for backward compatibility. Other paths under /proc/self/map_files require CAP_SYS_ADMIN in the initial user namespace, or have CAP_CHECKPOINT_RESTORE in the user namespace. Sure, it is not future-proof, but it would look weird if after CVE-2019-19814 there will be more ways to access files from the host without requiring some capabilities. One problem with having it as a mount option for procfs (so for the /proc in the container) is that then the container runtime has the the burden to verify that there are no other procfs mounts accessible from the mount namespace as well as to check that the mount option is available before attempting to use it from the container. Potentially it could also be a TOCTOU race since a proc mount could appear later, e.g. parent mount with shared propagation. If the runtime finds out that another procfs is reachable, then it would have to fallback to re-exec itself much later than what we do today. With the prctl a runtime would just need to do the following and live happily ever after: __attribute__ ((constructor)) static void hide_self_exe (void) { if (prctl(PR_SET_HIDE_SELF_EXE, 1, 0, 0, 0) == 0) return; /* ...reexec as we do today... */ } and we won't have to worry about what mount options are supported or used by proc.
On Mon, Jan 23, 2023, at 5:54 PM, Giuseppe Scrivano wrote: > > I realize it seems like a one-off fix, but it is done only for backward > compatibility. > > Other paths under /proc/self/map_files require CAP_SYS_ADMIN in the > initial user namespace, or have CAP_CHECKPOINT_RESTORE in the user > namespace. Sure, it is not future-proof, but it would look weird if > after CVE-2019-19814 there will be more ways to access files from the > host without requiring some capabilities. I think a way to rephrase what I'm saying is that it feels like this should be about making /proc/self/exe and /proc/self/map_files consistent. > With the prctl a runtime would just need to do the following and live > happily ever after: > > __attribute__ ((constructor)) static void hide_self_exe (void) > { > if (prctl(PR_SET_HIDE_SELF_EXE, 1, 0, 0, 0) == 0) > return; > > /* ...reexec as we do today... */ > } > > and we won't have to worry about what mount options are supported or > used by proc. Yeah, OK - having the logical operation be on the process and not the view into it (procfs) definitely is more robust. But how about calling this PR_SET_PROCFS_RESTRICTED or so, and then *also* changing the /proc/self/map_files lookup to deny if this is set? Yes, it'd be mostly redundant, but it'd help clarify things for any future changes since it'd be clear that the logic *should* be consistent for /proc/self/exe and /proc/self/map_files. And actually as a bonus, it would make the case of e.g. `podman run --cap-add=checkpoint_restore` secure right? (Though honestly I don't know how common that is or whether one can practically use checkpoint_restore without other caps)
On 2023-01-20, Giuseppe Scrivano <gscrivan@redhat.com> wrote: > This patch adds a new prctl called PR_HIDE_SELF_EXE which allows > processes to hide their own /proc/*/exe file. When this prctl is > used, every access to /proc/*/exe for the calling process will > fail with ENOENT. > > This is useful for preventing issues like CVE-2019-5736, where an > attacker can gain host root access by overwriting the binary > in OCI runtimes through file-descriptor mishandling in containers. > > The current fix for CVE-2019-5736 is to create a read-only copy or > a bind-mount of the current executable, and then re-exec the current > process. With the new prctl, the read-only copy or bind-mount copy is > not needed anymore. > > While map_files/ also might contain symlinks to files in host, > proc_map_files_get_link() permissions checks are already sufficient. I suspect this doesn't protect against the execve("/proc/self/exe") tactic (because it clears the bit on execve), so I'm not sure this is much safer than PR_SET_DUMPABLE (yeah, it stops root in the source userns from accessing /proc/$pid/exe but the above attack makes that no longer that important). I think the only way to fix this properly is by blocking re-opens of magic links that have more permissions than they originally did. I just got back from vacation, but I'm working on fixing up [1] so it's ready to be an RFC so we can close this hole once and for all. [1]: https://github.com/cyphar/linux/tree/magiclink/open_how-reopen > > Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> > --- > v2: https://lkml.org/lkml/2023/1/19/849 > > Differences from v2: > > - fixed the test to check PR_SET_HIDE_SELF_EXE after fork > > v1: https://lkml.org/lkml/2023/1/4/334 > > Differences from v1: > > - amended more information in the commit message wrt map_files not > requiring the same protection. > - changed the test to verify PR_HIDE_SELF_EXE cannot be unset after > a fork. > > fs/exec.c | 1 + > fs/proc/base.c | 8 +++++--- > include/linux/sched.h | 5 +++++ > include/uapi/linux/prctl.h | 3 +++ > kernel/sys.c | 9 +++++++++ > tools/include/uapi/linux/prctl.h | 3 +++ > 6 files changed, 26 insertions(+), 3 deletions(-) > > diff --git a/fs/exec.c b/fs/exec.c > index ab913243a367..5a5dd964c3a3 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1855,6 +1855,7 @@ static int bprm_execve(struct linux_binprm *bprm, > /* execve succeeded */ > current->fs->in_exec = 0; > current->in_execve = 0; > + task_clear_hide_self_exe(current); > rseq_execve(current); > acct_update_integrals(current); > task_numa_free(current, false); > diff --git a/fs/proc/base.c b/fs/proc/base.c > index 9e479d7d202b..959968e2da0d 100644 > --- a/fs/proc/base.c > +++ b/fs/proc/base.c > @@ -1723,19 +1723,21 @@ static int proc_exe_link(struct dentry *dentry, struct path *exe_path) > { > struct task_struct *task; > struct file *exe_file; > + long hide_self_exe; > > task = get_proc_task(d_inode(dentry)); > if (!task) > return -ENOENT; > exe_file = get_task_exe_file(task); > + hide_self_exe = task_hide_self_exe(task); > put_task_struct(task); > - if (exe_file) { > + if (exe_file && !hide_self_exe) { > *exe_path = exe_file->f_path; > path_get(&exe_file->f_path); > fput(exe_file); > return 0; > - } else > - return -ENOENT; > + } > + return -ENOENT; > } > > static const char *proc_pid_get_link(struct dentry *dentry, > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 853d08f7562b..8db32d5fc285 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1790,6 +1790,7 @@ static __always_inline bool is_percpu_thread(void) > #define PFA_SPEC_IB_DISABLE 5 /* Indirect branch speculation restricted */ > #define PFA_SPEC_IB_FORCE_DISABLE 6 /* Indirect branch speculation permanently restricted */ > #define PFA_SPEC_SSB_NOEXEC 7 /* Speculative Store Bypass clear on execve() */ > +#define PFA_HIDE_SELF_EXE 8 /* Hide /proc/self/exe for the process */ > > #define TASK_PFA_TEST(name, func) \ > static inline bool task_##func(struct task_struct *p) \ > @@ -1832,6 +1833,10 @@ TASK_PFA_CLEAR(SPEC_IB_DISABLE, spec_ib_disable) > TASK_PFA_TEST(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable) > TASK_PFA_SET(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable) > > +TASK_PFA_TEST(HIDE_SELF_EXE, hide_self_exe) > +TASK_PFA_SET(HIDE_SELF_EXE, hide_self_exe) > +TASK_PFA_CLEAR(HIDE_SELF_EXE, hide_self_exe) > + > static inline void > current_restore_flags(unsigned long orig_flags, unsigned long flags) > { > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h > index a5e06dcbba13..f12f3df12468 100644 > --- a/include/uapi/linux/prctl.h > +++ b/include/uapi/linux/prctl.h > @@ -284,4 +284,7 @@ struct prctl_mm_map { > #define PR_SET_VMA 0x53564d41 > # define PR_SET_VMA_ANON_NAME 0 > > +#define PR_SET_HIDE_SELF_EXE 65 > +#define PR_GET_HIDE_SELF_EXE 66 > + > #endif /* _LINUX_PRCTL_H */ > diff --git a/kernel/sys.c b/kernel/sys.c > index 5fd54bf0e886..e992f1b72973 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -2626,6 +2626,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, > case PR_SET_VMA: > error = prctl_set_vma(arg2, arg3, arg4, arg5); > break; > + case PR_SET_HIDE_SELF_EXE: > + if (arg2 != 1 || arg3 || arg4 || arg5) > + return -EINVAL; > + task_set_hide_self_exe(current); > + break; > + case PR_GET_HIDE_SELF_EXE: > + if (arg2 || arg3 || arg4 || arg5) > + return -EINVAL; > + return task_hide_self_exe(current) ? 1 : 0; > default: > error = -EINVAL; > break; > diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h > index a5e06dcbba13..f12f3df12468 100644 > --- a/tools/include/uapi/linux/prctl.h > +++ b/tools/include/uapi/linux/prctl.h > @@ -284,4 +284,7 @@ struct prctl_mm_map { > #define PR_SET_VMA 0x53564d41 > # define PR_SET_VMA_ANON_NAME 0 > > +#define PR_SET_HIDE_SELF_EXE 65 > +#define PR_GET_HIDE_SELF_EXE 66 > + > #endif /* _LINUX_PRCTL_H */ > -- > 2.38.1 >
Aleksa Sarai <cyphar@cyphar.com> writes: > On 2023-01-20, Giuseppe Scrivano <gscrivan@redhat.com> wrote: >> This patch adds a new prctl called PR_HIDE_SELF_EXE which allows >> processes to hide their own /proc/*/exe file. When this prctl is >> used, every access to /proc/*/exe for the calling process will >> fail with ENOENT. >> >> This is useful for preventing issues like CVE-2019-5736, where an >> attacker can gain host root access by overwriting the binary >> in OCI runtimes through file-descriptor mishandling in containers. >> >> The current fix for CVE-2019-5736 is to create a read-only copy or >> a bind-mount of the current executable, and then re-exec the current >> process. With the new prctl, the read-only copy or bind-mount copy is >> not needed anymore. >> >> While map_files/ also might contain symlinks to files in host, >> proc_map_files_get_link() permissions checks are already sufficient. > > I suspect this doesn't protect against the execve("/proc/self/exe") > tactic (because it clears the bit on execve), so I'm not sure this is > much safer than PR_SET_DUMPABLE (yeah, it stops root in the source > userns from accessing /proc/$pid/exe but the above attack makes that no > longer that important). it protects against that attack too. It clears the bit _after_ the execve() syscall is done. If you attempt execve("/proc/self/exe") you still get ENOENT: ``` #include <stdlib.h> #include <stdio.h> #include <sys/prctl.h> #include <unistd.h> int main(void) { int ret; ret = prctl(65, 1, 0, 0, 0); if (ret != 0) exit(1); execl("/proc/self/exe", "foo", NULL); exit(2); } ``` # strace -e prctl,execve ./hide-self-exe execve("./hide-self-exe", ["./hide-self-exe"], 0x7fff975a3690 /* 39 vars */) = 0 prctl(0x41 /* PR_??? */, 0x1, 0, 0, 0) = 0 execve("/proc/self/exe", ["foo"], 0x7ffcf51868b8 /* 39 vars */) = -1 ENOENT (No such file or directory) +++ exited with 2 +++ I've also tried execv'ing with a script that uses "#!/proc/self/exe" and I get the same ENOENT. > > I think the only way to fix this properly is by blocking re-opens of > magic links that have more permissions than they originally did. I just > got back from vacation, but I'm working on fixing up [1] so it's ready > to be an RFC so we can close this hole once and for all. so that relies on the fact opening /proc/self/exe with O_WRONLY fails with ETXTBSY? > [1]: https://github.com/cyphar/linux/tree/magiclink/open_how-reopen > >> >> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> >> --- >> v2: https://lkml.org/lkml/2023/1/19/849 >> >> Differences from v2: >> >> - fixed the test to check PR_SET_HIDE_SELF_EXE after fork >> >> v1: https://lkml.org/lkml/2023/1/4/334 >> >> Differences from v1: >> >> - amended more information in the commit message wrt map_files not >> requiring the same protection. >> - changed the test to verify PR_HIDE_SELF_EXE cannot be unset after >> a fork. >> >> fs/exec.c | 1 + >> fs/proc/base.c | 8 +++++--- >> include/linux/sched.h | 5 +++++ >> include/uapi/linux/prctl.h | 3 +++ >> kernel/sys.c | 9 +++++++++ >> tools/include/uapi/linux/prctl.h | 3 +++ >> 6 files changed, 26 insertions(+), 3 deletions(-) >> >> diff --git a/fs/exec.c b/fs/exec.c >> index ab913243a367..5a5dd964c3a3 100644 >> --- a/fs/exec.c >> +++ b/fs/exec.c >> @@ -1855,6 +1855,7 @@ static int bprm_execve(struct linux_binprm *bprm, >> /* execve succeeded */ >> current->fs->in_exec = 0; >> current->in_execve = 0; >> + task_clear_hide_self_exe(current); >> rseq_execve(current); >> acct_update_integrals(current); >> task_numa_free(current, false); >> diff --git a/fs/proc/base.c b/fs/proc/base.c >> index 9e479d7d202b..959968e2da0d 100644 >> --- a/fs/proc/base.c >> +++ b/fs/proc/base.c >> @@ -1723,19 +1723,21 @@ static int proc_exe_link(struct dentry *dentry, struct path *exe_path) >> { >> struct task_struct *task; >> struct file *exe_file; >> + long hide_self_exe; >> >> task = get_proc_task(d_inode(dentry)); >> if (!task) >> return -ENOENT; >> exe_file = get_task_exe_file(task); >> + hide_self_exe = task_hide_self_exe(task); >> put_task_struct(task); >> - if (exe_file) { >> + if (exe_file && !hide_self_exe) { >> *exe_path = exe_file->f_path; >> path_get(&exe_file->f_path); >> fput(exe_file); >> return 0; >> - } else >> - return -ENOENT; >> + } >> + return -ENOENT; >> } >> >> static const char *proc_pid_get_link(struct dentry *dentry, >> diff --git a/include/linux/sched.h b/include/linux/sched.h >> index 853d08f7562b..8db32d5fc285 100644 >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -1790,6 +1790,7 @@ static __always_inline bool is_percpu_thread(void) >> #define PFA_SPEC_IB_DISABLE 5 /* Indirect branch speculation restricted */ >> #define PFA_SPEC_IB_FORCE_DISABLE 6 /* Indirect branch speculation permanently restricted */ >> #define PFA_SPEC_SSB_NOEXEC 7 /* Speculative Store Bypass clear on execve() */ >> +#define PFA_HIDE_SELF_EXE 8 /* Hide /proc/self/exe for the process */ >> >> #define TASK_PFA_TEST(name, func) \ >> static inline bool task_##func(struct task_struct *p) \ >> @@ -1832,6 +1833,10 @@ TASK_PFA_CLEAR(SPEC_IB_DISABLE, spec_ib_disable) >> TASK_PFA_TEST(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable) >> TASK_PFA_SET(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable) >> >> +TASK_PFA_TEST(HIDE_SELF_EXE, hide_self_exe) >> +TASK_PFA_SET(HIDE_SELF_EXE, hide_self_exe) >> +TASK_PFA_CLEAR(HIDE_SELF_EXE, hide_self_exe) >> + >> static inline void >> current_restore_flags(unsigned long orig_flags, unsigned long flags) >> { >> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h >> index a5e06dcbba13..f12f3df12468 100644 >> --- a/include/uapi/linux/prctl.h >> +++ b/include/uapi/linux/prctl.h >> @@ -284,4 +284,7 @@ struct prctl_mm_map { >> #define PR_SET_VMA 0x53564d41 >> # define PR_SET_VMA_ANON_NAME 0 >> >> +#define PR_SET_HIDE_SELF_EXE 65 >> +#define PR_GET_HIDE_SELF_EXE 66 >> + >> #endif /* _LINUX_PRCTL_H */ >> diff --git a/kernel/sys.c b/kernel/sys.c >> index 5fd54bf0e886..e992f1b72973 100644 >> --- a/kernel/sys.c >> +++ b/kernel/sys.c >> @@ -2626,6 +2626,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, >> case PR_SET_VMA: >> error = prctl_set_vma(arg2, arg3, arg4, arg5); >> break; >> + case PR_SET_HIDE_SELF_EXE: >> + if (arg2 != 1 || arg3 || arg4 || arg5) >> + return -EINVAL; >> + task_set_hide_self_exe(current); >> + break; >> + case PR_GET_HIDE_SELF_EXE: >> + if (arg2 || arg3 || arg4 || arg5) >> + return -EINVAL; >> + return task_hide_self_exe(current) ? 1 : 0; >> default: >> error = -EINVAL; >> break; >> diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h >> index a5e06dcbba13..f12f3df12468 100644 >> --- a/tools/include/uapi/linux/prctl.h >> +++ b/tools/include/uapi/linux/prctl.h >> @@ -284,4 +284,7 @@ struct prctl_mm_map { >> #define PR_SET_VMA 0x53564d41 >> # define PR_SET_VMA_ANON_NAME 0 >> >> +#define PR_SET_HIDE_SELF_EXE 65 >> +#define PR_GET_HIDE_SELF_EXE 66 >> + >> #endif /* _LINUX_PRCTL_H */ >> -- >> 2.38.1 >>
On Mon, Jan 23, 2023 at 6:04 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > On 2023-01-20, Giuseppe Scrivano <gscrivan@redhat.com> wrote: > > This patch adds a new prctl called PR_HIDE_SELF_EXE which allows > > processes to hide their own /proc/*/exe file. When this prctl is > > used, every access to /proc/*/exe for the calling process will > > fail with ENOENT. > > > > This is useful for preventing issues like CVE-2019-5736, where an > > attacker can gain host root access by overwriting the binary > > in OCI runtimes through file-descriptor mishandling in containers. > > > > The current fix for CVE-2019-5736 is to create a read-only copy or > > a bind-mount of the current executable, and then re-exec the current > > process. With the new prctl, the read-only copy or bind-mount copy is > > not needed anymore. > > > > While map_files/ also might contain symlinks to files in host, > > proc_map_files_get_link() permissions checks are already sufficient. > > I suspect this doesn't protect against the execve("/proc/self/exe") > tactic (because it clears the bit on execve), so I'm not sure this is > much safer than PR_SET_DUMPABLE (yeah, it stops root in the source > userns from accessing /proc/$pid/exe but the above attack makes that no > longer that important). > > I think the only way to fix this properly is by blocking re-opens of > magic links that have more permissions than they originally did. I just > got back from vacation, but I'm working on fixing up [1] so it's ready > to be an RFC so we can close this hole once and for all. > > [1]: https://github.com/cyphar/linux/tree/magiclink/open_how-reopen pls add me into cc when you will send this change. We need to be sure that it doesn't break CRIU... Thanks, Andrei
On 2023-01-24, Giuseppe Scrivano <gscrivan@redhat.com> wrote: > Aleksa Sarai <cyphar@cyphar.com> writes: > > > On 2023-01-20, Giuseppe Scrivano <gscrivan@redhat.com> wrote: > >> This patch adds a new prctl called PR_HIDE_SELF_EXE which allows > >> processes to hide their own /proc/*/exe file. When this prctl is > >> used, every access to /proc/*/exe for the calling process will > >> fail with ENOENT. > >> > >> This is useful for preventing issues like CVE-2019-5736, where an > >> attacker can gain host root access by overwriting the binary > >> in OCI runtimes through file-descriptor mishandling in containers. > >> > >> The current fix for CVE-2019-5736 is to create a read-only copy or > >> a bind-mount of the current executable, and then re-exec the current > >> process. With the new prctl, the read-only copy or bind-mount copy is > >> not needed anymore. > >> > >> While map_files/ also might contain symlinks to files in host, > >> proc_map_files_get_link() permissions checks are already sufficient. > > > > I suspect this doesn't protect against the execve("/proc/self/exe") > > tactic (because it clears the bit on execve), so I'm not sure this is > > much safer than PR_SET_DUMPABLE (yeah, it stops root in the source > > userns from accessing /proc/$pid/exe but the above attack makes that no > > longer that important). > > it protects against that attack too. It clears the bit _after_ the > execve() syscall is done. > > If you attempt execve("/proc/self/exe") you still get ENOENT: > > ``` > #include <stdlib.h> > #include <stdio.h> > #include <sys/prctl.h> > #include <unistd.h> > > int main(void) > { > int ret; > > ret = prctl(65, 1, 0, 0, 0); > if (ret != 0) > exit(1); > > execl("/proc/self/exe", "foo", NULL); > exit(2); > } > ``` > > # strace -e prctl,execve ./hide-self-exe > execve("./hide-self-exe", ["./hide-self-exe"], 0x7fff975a3690 /* 39 vars */) = 0 > prctl(0x41 /* PR_??? */, 0x1, 0, 0, 0) = 0 > execve("/proc/self/exe", ["foo"], 0x7ffcf51868b8 /* 39 vars */) = -1 ENOENT (No such file or directory) > +++ exited with 2 +++ > > I've also tried execv'ing with a script that uses "#!/proc/self/exe" and > I get the same ENOENT. Ah, you're right. As you mentioned, you could still do the attack through /proc/self/map_files but that would require you to know where the binary will be located (and being non-dumpable blocks container processes from doing tricks to get the right path). I wonder if we should somehow require (or auto-apply) SUID_DUMP_NONE when setting this prctl, since it does currently depend on it to be properly secure... > > I think the only way to fix this properly is by blocking re-opens of > > magic links that have more permissions than they originally did. I just > > got back from vacation, but I'm working on fixing up [1] so it's ready > > to be an RFC so we can close this hole once and for all. > > so that relies on the fact opening /proc/self/exe with O_WRONLY fails > with ETXTBSY? Not quite, it relies on the fact that /proc/self/exe (and any other magiclink to /proc/self/exe) does not have a write mode (semantically, because of -ETXTBSY) and thus blocks any attempt to open it (or re-open it) with a write mode. It also fixes some other possible issues and lets you have upgrade masks (a-la capabilities) to file descriptors. Ultimately I think having a complete "no really, nobody can touch this" knob is also a good idea, and as this is is much simpler we can it in much quicker than the magiclink stuff (which I still think is necessary in general). > > [1]: https://github.com/cyphar/linux/tree/magiclink/open_how-reopen > > > >> > >> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> > >> --- > >> v2: https://lkml.org/lkml/2023/1/19/849 > >> > >> Differences from v2: > >> > >> - fixed the test to check PR_SET_HIDE_SELF_EXE after fork > >> > >> v1: https://lkml.org/lkml/2023/1/4/334 > >> > >> Differences from v1: > >> > >> - amended more information in the commit message wrt map_files not > >> requiring the same protection. > >> - changed the test to verify PR_HIDE_SELF_EXE cannot be unset after > >> a fork. > >> > >> fs/exec.c | 1 + > >> fs/proc/base.c | 8 +++++--- > >> include/linux/sched.h | 5 +++++ > >> include/uapi/linux/prctl.h | 3 +++ > >> kernel/sys.c | 9 +++++++++ > >> tools/include/uapi/linux/prctl.h | 3 +++ > >> 6 files changed, 26 insertions(+), 3 deletions(-) > >> > >> diff --git a/fs/exec.c b/fs/exec.c > >> index ab913243a367..5a5dd964c3a3 100644 > >> --- a/fs/exec.c > >> +++ b/fs/exec.c > >> @@ -1855,6 +1855,7 @@ static int bprm_execve(struct linux_binprm *bprm, > >> /* execve succeeded */ > >> current->fs->in_exec = 0; > >> current->in_execve = 0; > >> + task_clear_hide_self_exe(current); > >> rseq_execve(current); > >> acct_update_integrals(current); > >> task_numa_free(current, false); > >> diff --git a/fs/proc/base.c b/fs/proc/base.c > >> index 9e479d7d202b..959968e2da0d 100644 > >> --- a/fs/proc/base.c > >> +++ b/fs/proc/base.c > >> @@ -1723,19 +1723,21 @@ static int proc_exe_link(struct dentry *dentry, struct path *exe_path) > >> { > >> struct task_struct *task; > >> struct file *exe_file; > >> + long hide_self_exe; > >> > >> task = get_proc_task(d_inode(dentry)); > >> if (!task) > >> return -ENOENT; > >> exe_file = get_task_exe_file(task); > >> + hide_self_exe = task_hide_self_exe(task); > >> put_task_struct(task); > >> - if (exe_file) { > >> + if (exe_file && !hide_self_exe) { > >> *exe_path = exe_file->f_path; > >> path_get(&exe_file->f_path); > >> fput(exe_file); > >> return 0; > >> - } else > >> - return -ENOENT; > >> + } > >> + return -ENOENT; > >> } > >> > >> static const char *proc_pid_get_link(struct dentry *dentry, > >> diff --git a/include/linux/sched.h b/include/linux/sched.h > >> index 853d08f7562b..8db32d5fc285 100644 > >> --- a/include/linux/sched.h > >> +++ b/include/linux/sched.h > >> @@ -1790,6 +1790,7 @@ static __always_inline bool is_percpu_thread(void) > >> #define PFA_SPEC_IB_DISABLE 5 /* Indirect branch speculation restricted */ > >> #define PFA_SPEC_IB_FORCE_DISABLE 6 /* Indirect branch speculation permanently restricted */ > >> #define PFA_SPEC_SSB_NOEXEC 7 /* Speculative Store Bypass clear on execve() */ > >> +#define PFA_HIDE_SELF_EXE 8 /* Hide /proc/self/exe for the process */ > >> > >> #define TASK_PFA_TEST(name, func) \ > >> static inline bool task_##func(struct task_struct *p) \ > >> @@ -1832,6 +1833,10 @@ TASK_PFA_CLEAR(SPEC_IB_DISABLE, spec_ib_disable) > >> TASK_PFA_TEST(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable) > >> TASK_PFA_SET(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable) > >> > >> +TASK_PFA_TEST(HIDE_SELF_EXE, hide_self_exe) > >> +TASK_PFA_SET(HIDE_SELF_EXE, hide_self_exe) > >> +TASK_PFA_CLEAR(HIDE_SELF_EXE, hide_self_exe) > >> + > >> static inline void > >> current_restore_flags(unsigned long orig_flags, unsigned long flags) > >> { > >> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h > >> index a5e06dcbba13..f12f3df12468 100644 > >> --- a/include/uapi/linux/prctl.h > >> +++ b/include/uapi/linux/prctl.h > >> @@ -284,4 +284,7 @@ struct prctl_mm_map { > >> #define PR_SET_VMA 0x53564d41 > >> # define PR_SET_VMA_ANON_NAME 0 > >> > >> +#define PR_SET_HIDE_SELF_EXE 65 > >> +#define PR_GET_HIDE_SELF_EXE 66 > >> + > >> #endif /* _LINUX_PRCTL_H */ > >> diff --git a/kernel/sys.c b/kernel/sys.c > >> index 5fd54bf0e886..e992f1b72973 100644 > >> --- a/kernel/sys.c > >> +++ b/kernel/sys.c > >> @@ -2626,6 +2626,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, > >> case PR_SET_VMA: > >> error = prctl_set_vma(arg2, arg3, arg4, arg5); > >> break; > >> + case PR_SET_HIDE_SELF_EXE: > >> + if (arg2 != 1 || arg3 || arg4 || arg5) > >> + return -EINVAL; > >> + task_set_hide_self_exe(current); > >> + break; > >> + case PR_GET_HIDE_SELF_EXE: > >> + if (arg2 || arg3 || arg4 || arg5) > >> + return -EINVAL; > >> + return task_hide_self_exe(current) ? 1 : 0; > >> default: > >> error = -EINVAL; > >> break; > >> diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h > >> index a5e06dcbba13..f12f3df12468 100644 > >> --- a/tools/include/uapi/linux/prctl.h > >> +++ b/tools/include/uapi/linux/prctl.h > >> @@ -284,4 +284,7 @@ struct prctl_mm_map { > >> #define PR_SET_VMA 0x53564d41 > >> # define PR_SET_VMA_ANON_NAME 0 > >> > >> +#define PR_SET_HIDE_SELF_EXE 65 > >> +#define PR_GET_HIDE_SELF_EXE 66 > >> + > >> #endif /* _LINUX_PRCTL_H */ > >> -- > >> 2.38.1 > >> >
Aleksa Sarai <cyphar@cyphar.com> writes: > On 2023-01-24, Giuseppe Scrivano <gscrivan@redhat.com> wrote: >> Aleksa Sarai <cyphar@cyphar.com> writes: >> >> > On 2023-01-20, Giuseppe Scrivano <gscrivan@redhat.com> wrote: >> >> This patch adds a new prctl called PR_HIDE_SELF_EXE which allows >> >> processes to hide their own /proc/*/exe file. When this prctl is >> >> used, every access to /proc/*/exe for the calling process will >> >> fail with ENOENT. >> >> >> >> This is useful for preventing issues like CVE-2019-5736, where an >> >> attacker can gain host root access by overwriting the binary >> >> in OCI runtimes through file-descriptor mishandling in containers. >> >> >> >> The current fix for CVE-2019-5736 is to create a read-only copy or >> >> a bind-mount of the current executable, and then re-exec the current >> >> process. With the new prctl, the read-only copy or bind-mount copy is >> >> not needed anymore. >> >> >> >> While map_files/ also might contain symlinks to files in host, >> >> proc_map_files_get_link() permissions checks are already sufficient. >> > >> > I suspect this doesn't protect against the execve("/proc/self/exe") >> > tactic (because it clears the bit on execve), so I'm not sure this is >> > much safer than PR_SET_DUMPABLE (yeah, it stops root in the source >> > userns from accessing /proc/$pid/exe but the above attack makes that no >> > longer that important). >> >> it protects against that attack too. It clears the bit _after_ the >> execve() syscall is done. >> >> If you attempt execve("/proc/self/exe") you still get ENOENT: >> >> ``` >> #include <stdlib.h> >> #include <stdio.h> >> #include <sys/prctl.h> >> #include <unistd.h> >> >> int main(void) >> { >> int ret; >> >> ret = prctl(65, 1, 0, 0, 0); >> if (ret != 0) >> exit(1); >> >> execl("/proc/self/exe", "foo", NULL); >> exit(2); >> } >> ``` >> >> # strace -e prctl,execve ./hide-self-exe >> execve("./hide-self-exe", ["./hide-self-exe"], 0x7fff975a3690 /* 39 vars */) = 0 >> prctl(0x41 /* PR_??? */, 0x1, 0, 0, 0) = 0 >> execve("/proc/self/exe", ["foo"], 0x7ffcf51868b8 /* 39 vars */) = -1 ENOENT (No such file or directory) >> +++ exited with 2 +++ >> >> I've also tried execv'ing with a script that uses "#!/proc/self/exe" and >> I get the same ENOENT. > > Ah, you're right. As you mentioned, you could still do the attack > through /proc/self/map_files but that would require you to know where > the binary will be located (and being non-dumpable blocks container > processes from doing tricks to get the right path). > > I wonder if we should somehow require (or auto-apply) SUID_DUMP_NONE > when setting this prctl, since it does currently depend on it to be > properly secure... from what I can see, access to /proc/*/map_files is already protected by proc_map_files_get_link() that requires either CAP_SYS_ADMIN in the initial user namespace or CAP_CHECKPOINT_RESTORE in the user namespace. Setting SUID_DUMP_NONE wouldn't hurt though :-) After reading some comments on the LWN.net article, I wonder if PR_HIDE_SELF_EXE should apply to CAP_SYS_ADMIN in the initial user namespace or if in this case root should keep the privilege to inspect the binary of a process. If a container runs with that many privileges then it has already other ways to damage the host anyway. >> > I think the only way to fix this properly is by blocking re-opens of >> > magic links that have more permissions than they originally did. I just >> > got back from vacation, but I'm working on fixing up [1] so it's ready >> > to be an RFC so we can close this hole once and for all. >> >> so that relies on the fact opening /proc/self/exe with O_WRONLY fails >> with ETXTBSY? > > Not quite, it relies on the fact that /proc/self/exe (and any other > magiclink to /proc/self/exe) does not have a write mode (semantically, > because of -ETXTBSY) and thus blocks any attempt to open it (or re-open > it) with a write mode. It also fixes some other possible issues and lets > you have upgrade masks (a-la capabilities) to file descriptors. > > Ultimately I think having a complete "no really, nobody can touch this" > knob is also a good idea, and as this is is much simpler we can it in > much quicker than the magiclink stuff (which I still think is necessary > in general). > >> > [1]: https://github.com/cyphar/linux/tree/magiclink/open_how-reopen >> > >> >> >> >> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> >> >> --- >> >> v2: https://lkml.org/lkml/2023/1/19/849 >> >> >> >> Differences from v2: >> >> >> >> - fixed the test to check PR_SET_HIDE_SELF_EXE after fork >> >> >> >> v1: https://lkml.org/lkml/2023/1/4/334 >> >> >> >> Differences from v1: >> >> >> >> - amended more information in the commit message wrt map_files not >> >> requiring the same protection. >> >> - changed the test to verify PR_HIDE_SELF_EXE cannot be unset after >> >> a fork. >> >> >> >> fs/exec.c | 1 + >> >> fs/proc/base.c | 8 +++++--- >> >> include/linux/sched.h | 5 +++++ >> >> include/uapi/linux/prctl.h | 3 +++ >> >> kernel/sys.c | 9 +++++++++ >> >> tools/include/uapi/linux/prctl.h | 3 +++ >> >> 6 files changed, 26 insertions(+), 3 deletions(-) >> >> >> >> diff --git a/fs/exec.c b/fs/exec.c >> >> index ab913243a367..5a5dd964c3a3 100644 >> >> --- a/fs/exec.c >> >> +++ b/fs/exec.c >> >> @@ -1855,6 +1855,7 @@ static int bprm_execve(struct linux_binprm *bprm, >> >> /* execve succeeded */ >> >> current->fs->in_exec = 0; >> >> current->in_execve = 0; >> >> + task_clear_hide_self_exe(current); >> >> rseq_execve(current); >> >> acct_update_integrals(current); >> >> task_numa_free(current, false); >> >> diff --git a/fs/proc/base.c b/fs/proc/base.c >> >> index 9e479d7d202b..959968e2da0d 100644 >> >> --- a/fs/proc/base.c >> >> +++ b/fs/proc/base.c >> >> @@ -1723,19 +1723,21 @@ static int proc_exe_link(struct dentry *dentry, struct path *exe_path) >> >> { >> >> struct task_struct *task; >> >> struct file *exe_file; >> >> + long hide_self_exe; >> >> >> >> task = get_proc_task(d_inode(dentry)); >> >> if (!task) >> >> return -ENOENT; >> >> exe_file = get_task_exe_file(task); >> >> + hide_self_exe = task_hide_self_exe(task); >> >> put_task_struct(task); >> >> - if (exe_file) { >> >> + if (exe_file && !hide_self_exe) { >> >> *exe_path = exe_file->f_path; >> >> path_get(&exe_file->f_path); >> >> fput(exe_file); >> >> return 0; >> >> - } else >> >> - return -ENOENT; >> >> + } >> >> + return -ENOENT; >> >> } >> >> >> >> static const char *proc_pid_get_link(struct dentry *dentry, >> >> diff --git a/include/linux/sched.h b/include/linux/sched.h >> >> index 853d08f7562b..8db32d5fc285 100644 >> >> --- a/include/linux/sched.h >> >> +++ b/include/linux/sched.h >> >> @@ -1790,6 +1790,7 @@ static __always_inline bool is_percpu_thread(void) >> >> #define PFA_SPEC_IB_DISABLE 5 /* Indirect branch speculation restricted */ >> >> #define PFA_SPEC_IB_FORCE_DISABLE 6 /* Indirect branch speculation permanently restricted */ >> >> #define PFA_SPEC_SSB_NOEXEC 7 /* Speculative Store Bypass clear on execve() */ >> >> +#define PFA_HIDE_SELF_EXE 8 /* Hide /proc/self/exe for the process */ >> >> >> >> #define TASK_PFA_TEST(name, func) \ >> >> static inline bool task_##func(struct task_struct *p) \ >> >> @@ -1832,6 +1833,10 @@ TASK_PFA_CLEAR(SPEC_IB_DISABLE, spec_ib_disable) >> >> TASK_PFA_TEST(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable) >> >> TASK_PFA_SET(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable) >> >> >> >> +TASK_PFA_TEST(HIDE_SELF_EXE, hide_self_exe) >> >> +TASK_PFA_SET(HIDE_SELF_EXE, hide_self_exe) >> >> +TASK_PFA_CLEAR(HIDE_SELF_EXE, hide_self_exe) >> >> + >> >> static inline void >> >> current_restore_flags(unsigned long orig_flags, unsigned long flags) >> >> { >> >> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h >> >> index a5e06dcbba13..f12f3df12468 100644 >> >> --- a/include/uapi/linux/prctl.h >> >> +++ b/include/uapi/linux/prctl.h >> >> @@ -284,4 +284,7 @@ struct prctl_mm_map { >> >> #define PR_SET_VMA 0x53564d41 >> >> # define PR_SET_VMA_ANON_NAME 0 >> >> >> >> +#define PR_SET_HIDE_SELF_EXE 65 >> >> +#define PR_GET_HIDE_SELF_EXE 66 >> >> + >> >> #endif /* _LINUX_PRCTL_H */ >> >> diff --git a/kernel/sys.c b/kernel/sys.c >> >> index 5fd54bf0e886..e992f1b72973 100644 >> >> --- a/kernel/sys.c >> >> +++ b/kernel/sys.c >> >> @@ -2626,6 +2626,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, >> >> case PR_SET_VMA: >> >> error = prctl_set_vma(arg2, arg3, arg4, arg5); >> >> break; >> >> + case PR_SET_HIDE_SELF_EXE: >> >> + if (arg2 != 1 || arg3 || arg4 || arg5) >> >> + return -EINVAL; >> >> + task_set_hide_self_exe(current); >> >> + break; >> >> + case PR_GET_HIDE_SELF_EXE: >> >> + if (arg2 || arg3 || arg4 || arg5) >> >> + return -EINVAL; >> >> + return task_hide_self_exe(current) ? 1 : 0; >> >> default: >> >> error = -EINVAL; >> >> break; >> >> diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h >> >> index a5e06dcbba13..f12f3df12468 100644 >> >> --- a/tools/include/uapi/linux/prctl.h >> >> +++ b/tools/include/uapi/linux/prctl.h >> >> @@ -284,4 +284,7 @@ struct prctl_mm_map { >> >> #define PR_SET_VMA 0x53564d41 >> >> # define PR_SET_VMA_ANON_NAME 0 >> >> >> >> +#define PR_SET_HIDE_SELF_EXE 65 >> >> +#define PR_GET_HIDE_SELF_EXE 66 >> >> + >> >> #endif /* _LINUX_PRCTL_H */ >> >> -- >> >> 2.38.1 >> >> >>
On Thu, Jan 26, 2023 at 02:28:47AM +1100, Aleksa Sarai wrote: > On 2023-01-24, Giuseppe Scrivano <gscrivan@redhat.com> wrote: > > Aleksa Sarai <cyphar@cyphar.com> writes: > > > > > On 2023-01-20, Giuseppe Scrivano <gscrivan@redhat.com> wrote: > > >> This patch adds a new prctl called PR_HIDE_SELF_EXE which allows > > >> processes to hide their own /proc/*/exe file. When this prctl is > > >> used, every access to /proc/*/exe for the calling process will > > >> fail with ENOENT. > > >> > > >> This is useful for preventing issues like CVE-2019-5736, where an > > >> attacker can gain host root access by overwriting the binary > > >> in OCI runtimes through file-descriptor mishandling in containers. > > >> > > >> The current fix for CVE-2019-5736 is to create a read-only copy or > > >> a bind-mount of the current executable, and then re-exec the current > > >> process. With the new prctl, the read-only copy or bind-mount copy is > > >> not needed anymore. > > >> > > >> While map_files/ also might contain symlinks to files in host, > > >> proc_map_files_get_link() permissions checks are already sufficient. > > > > > > I suspect this doesn't protect against the execve("/proc/self/exe") > > > tactic (because it clears the bit on execve), so I'm not sure this is > > > much safer than PR_SET_DUMPABLE (yeah, it stops root in the source > > > userns from accessing /proc/$pid/exe but the above attack makes that no > > > longer that important). > > > > it protects against that attack too. It clears the bit _after_ the > > execve() syscall is done. > > > > If you attempt execve("/proc/self/exe") you still get ENOENT: > > > > ``` > > #include <stdlib.h> > > #include <stdio.h> > > #include <sys/prctl.h> > > #include <unistd.h> > > > > int main(void) > > { > > int ret; > > > > ret = prctl(65, 1, 0, 0, 0); > > if (ret != 0) > > exit(1); > > > > execl("/proc/self/exe", "foo", NULL); > > exit(2); > > } > > ``` > > > > # strace -e prctl,execve ./hide-self-exe > > execve("./hide-self-exe", ["./hide-self-exe"], 0x7fff975a3690 /* 39 vars */) = 0 > > prctl(0x41 /* PR_??? */, 0x1, 0, 0, 0) = 0 > > execve("/proc/self/exe", ["foo"], 0x7ffcf51868b8 /* 39 vars */) = -1 ENOENT (No such file or directory) > > +++ exited with 2 +++ > > > > I've also tried execv'ing with a script that uses "#!/proc/self/exe" and > > I get the same ENOENT. > > Ah, you're right. As you mentioned, you could still do the attack > through /proc/self/map_files but that would require you to know where > the binary will be located (and being non-dumpable blocks container > processes from doing tricks to get the right path). > > I wonder if we should somehow require (or auto-apply) SUID_DUMP_NONE > when setting this prctl, since it does currently depend on it to be > properly secure... > > > > I think the only way to fix this properly is by blocking re-opens of > > > magic links that have more permissions than they originally did. I just > > > got back from vacation, but I'm working on fixing up [1] so it's ready > > > to be an RFC so we can close this hole once and for all. > > > > so that relies on the fact opening /proc/self/exe with O_WRONLY fails > > with ETXTBSY? > > Not quite, it relies on the fact that /proc/self/exe (and any other > magiclink to /proc/self/exe) does not have a write mode (semantically, > because of -ETXTBSY) and thus blocks any attempt to open it (or re-open > it) with a write mode. It also fixes some other possible issues and lets > you have upgrade masks (a-la capabilities) to file descriptors. > > Ultimately I think having a complete "no really, nobody can touch this" > knob is also a good idea, and as this is is much simpler we can it in > much quicker than the magiclink stuff (which I still think is necessary > in general). It definitely but let's not tie our generic vfs apis to this problem.
From: Christian Brauner (Microsoft) <brauner@kernel.org> On Fri, 20 Jan 2023 11:25:11 +0100, Giuseppe Scrivano wrote: > This patch adds a new prctl called PR_HIDE_SELF_EXE which allows > processes to hide their own /proc/*/exe file. When this prctl is > used, every access to /proc/*/exe for the calling process will > fail with ENOENT. > > This is useful for preventing issues like CVE-2019-5736, where an > attacker can gain host root access by overwriting the binary > in OCI runtimes through file-descriptor mishandling in containers. > > [...] Only needed for privileged sandboxes. The userspace mitigations Aleksa and I did for the CVE in all affected runtimes back then are nifty but complicated. The patch is a decent compromise. Picking up this prctl() for now, [1/2] exec: add PR_HIDE_SELF_EXE prctl commit: 673301182d473ef61a98c292cf64650c73117172 [2/2] selftests: add tests for prctl(SET_HIDE_SELF_EXE) commit: bafa339eda3f79d567386e1fae59bb0537156c96
On Fri, Jan 27, 2023 at 01:31:13PM +0100, Christian Brauner wrote: > From: Christian Brauner (Microsoft) <brauner@kernel.org> > > > On Fri, 20 Jan 2023 11:25:11 +0100, Giuseppe Scrivano wrote: > > This patch adds a new prctl called PR_HIDE_SELF_EXE which allows > > processes to hide their own /proc/*/exe file. When this prctl is > > used, every access to /proc/*/exe for the calling process will > > fail with ENOENT. > > > > This is useful for preventing issues like CVE-2019-5736, where an > > attacker can gain host root access by overwriting the binary > > in OCI runtimes through file-descriptor mishandling in containers. > > > > [...] > > Only needed for privileged sandboxes. The userspace mitigations Aleksa > and I did for the CVE in all affected runtimes back then are nifty but > complicated. The patch is a decent compromise. > Picking up this prctl() for now, > > [1/2] exec: add PR_HIDE_SELF_EXE prctl > commit: 673301182d473ef61a98c292cf64650c73117172 > [2/2] selftests: add tests for prctl(SET_HIDE_SELF_EXE) > commit: bafa339eda3f79d567386e1fae59bb0537156c96 Thanks! I'm late to the party, but I came to the same conclusion as you did. :) Reviewed-by: Kees Cook <keescook@chromium.org>
On Wed, Jan 25, 2023, at 11:30 AM, Giuseppe Scrivano wrote: > > After reading some comments on the LWN.net article, I wonder if > PR_HIDE_SELF_EXE should apply to CAP_SYS_ADMIN in the initial user > namespace or if in this case root should keep the privilege to inspect > the binary of a process. If a container runs with that many privileges > then it has already other ways to damage the host anyway. Right, that's what I was trying to express with the "make it work the same as map_files". Hiding the entry entirely even for initial-namespace-root (real root) seems like it's going to potentially confuse profiling/tracing/debugging tools for no good reason.
On Sun, Jan 29, 2023 at 08:59:32AM -0500, Colin Walters wrote: > > > On Wed, Jan 25, 2023, at 11:30 AM, Giuseppe Scrivano wrote: > > > > After reading some comments on the LWN.net article, I wonder if > > PR_HIDE_SELF_EXE should apply to CAP_SYS_ADMIN in the initial user > > namespace or if in this case root should keep the privilege to inspect > > the binary of a process. If a container runs with that many privileges > > then it has already other ways to damage the host anyway. > > Right, that's what I was trying to express with the "make it work the same as map_files". Hiding the entry entirely even for initial-namespace-root (real root) seems like it's going to potentially confuse profiling/tracing/debugging tools for no good reason. If this can be circumvented via CAP_SYS_ADMIN then this mitigation becomes immediately way less interesting because the userspace mitigation we came up with protects against CAP_SYS_ADMIN as well without any regression risk. At which point this is only useful for some privileged sandboxes at what point this isn't worth it. I'm still looking at userspace codebases to ensure that this is a change we can risk in general as this has the potential to prevent criu from dumping such processes. I'll talk to them tomorrow anyway.
On Sun, Jan 29, 2023, at 11:58 AM, Christian Brauner wrote: > On Sun, Jan 29, 2023 at 08:59:32AM -0500, Colin Walters wrote: >> >> >> On Wed, Jan 25, 2023, at 11:30 AM, Giuseppe Scrivano wrote: >> > >> > After reading some comments on the LWN.net article, I wonder if >> > PR_HIDE_SELF_EXE should apply to CAP_SYS_ADMIN in the initial user >> > namespace or if in this case root should keep the privilege to inspect >> > the binary of a process. If a container runs with that many privileges >> > then it has already other ways to damage the host anyway. >> >> Right, that's what I was trying to express with the "make it work the same as map_files". Hiding the entry entirely even for initial-namespace-root (real root) seems like it's going to potentially confuse profiling/tracing/debugging tools for no good reason. > > If this can be circumvented via CAP_SYS_ADMIN To be clear, I'm proposing CAP_SYS_ADMIN in the current user namespace at the time of the prctl(). (Or if keeping around a reference just for this is too problematic, perhaps hardcoding to the init ns) A process with CAP_SYS_ADMIN in a child namespace would still not be able to read the binary. > then this mitigation > becomes immediately way less interesting because the userspace > mitigation we came up with protects against CAP_SYS_ADMIN as well > without any regression risk. The userspace mitigation here being "clone self to memfd"? But that's a sufficiently ugly workaround that it's created new problems; see https://lwn.net/Articles/918106/
On Sun, Jan 29, 2023 at 01:12:45PM -0500, Colin Walters wrote: > > > On Sun, Jan 29, 2023, at 11:58 AM, Christian Brauner wrote: > > On Sun, Jan 29, 2023 at 08:59:32AM -0500, Colin Walters wrote: > >> > >> > >> On Wed, Jan 25, 2023, at 11:30 AM, Giuseppe Scrivano wrote: > >> > > >> > After reading some comments on the LWN.net article, I wonder if > >> > PR_HIDE_SELF_EXE should apply to CAP_SYS_ADMIN in the initial user > >> > namespace or if in this case root should keep the privilege to inspect > >> > the binary of a process. If a container runs with that many privileges > >> > then it has already other ways to damage the host anyway. > >> > >> Right, that's what I was trying to express with the "make it work the same as map_files". Hiding the entry entirely even for initial-namespace-root (real root) seems like it's going to potentially confuse profiling/tracing/debugging tools for no good reason. > > > > If this can be circumvented via CAP_SYS_ADMIN > > To be clear, I'm proposing CAP_SYS_ADMIN in the current user namespace at the time of the prctl(). (Or if keeping around a reference just for this is too problematic, perhaps hardcoding to the init ns) Oh no, I fully understand. The point was that the userspace fix protects even against attackers with CAP_SYS_ADMIN in init_user_ns. And that was important back then and is still relevant today for some workloads. For unprivileged containers where host and container are separate by a meaningful user namespace boundary this whole mitigation is irrelevant as the binary can't be overwritten. > > A process with CAP_SYS_ADMIN in a child namespace would still not be able to read the binary. > > > then this mitigation > > becomes immediately way less interesting because the userspace > > mitigation we came up with protects against CAP_SYS_ADMIN as well > > without any regression risk. > > The userspace mitigation here being "clone self to memfd"? But that's a sufficiently ugly workaround that it's created new problems; see https://lwn.net/Articles/918106/ But this is a problem with the memfd api not with the fix. Following the thread the ability to create executable memfds will stay around. As it should be given how long this has been supported. And they have backward compatibility in mind which is great.
On Mon, Jan 30, 2023 at 10:53:31AM +0100, Christian Brauner wrote: > On Sun, Jan 29, 2023 at 01:12:45PM -0500, Colin Walters wrote: > > > > > > On Sun, Jan 29, 2023, at 11:58 AM, Christian Brauner wrote: > > > On Sun, Jan 29, 2023 at 08:59:32AM -0500, Colin Walters wrote: > > >> > > >> > > >> On Wed, Jan 25, 2023, at 11:30 AM, Giuseppe Scrivano wrote: > > >> > > > >> > After reading some comments on the LWN.net article, I wonder if > > >> > PR_HIDE_SELF_EXE should apply to CAP_SYS_ADMIN in the initial user > > >> > namespace or if in this case root should keep the privilege to inspect > > >> > the binary of a process. If a container runs with that many privileges > > >> > then it has already other ways to damage the host anyway. > > >> > > >> Right, that's what I was trying to express with the "make it work the same as map_files". Hiding the entry entirely even for initial-namespace-root (real root) seems like it's going to potentially confuse profiling/tracing/debugging tools for no good reason. > > > > > > If this can be circumvented via CAP_SYS_ADMIN > > > > To be clear, I'm proposing CAP_SYS_ADMIN in the current user namespace at the time of the prctl(). (Or if keeping around a reference just for this is too problematic, perhaps hardcoding to the init ns) > > Oh no, I fully understand. The point was that the userspace fix protects > even against attackers with CAP_SYS_ADMIN in init_user_ns. And that was > important back then and is still relevant today for some workloads. > > For unprivileged containers where host and container are separate by a > meaningful user namespace boundary this whole mitigation is irrelevant > as the binary can't be overwritten. > > > > > A process with CAP_SYS_ADMIN in a child namespace would still not be able to read the binary. > > > > > then this mitigation > > > becomes immediately way less interesting because the userspace > > > mitigation we came up with protects against CAP_SYS_ADMIN as well > > > without any regression risk. > > > > The userspace mitigation here being "clone self to memfd"? But that's a sufficiently ugly workaround that it's created new problems; see https://lwn.net/Articles/918106/ > > But this is a problem with the memfd api not with the fix. Following the > thread the ability to create executable memfds will stay around. As it > should be given how long this has been supported. And they have backward > compatibility in mind which is great. Following up from yesterday's promise to check with the criu org I'm part of: this is going to break criu unforunately as it dumps (and restores) /proc/self/exe. Even with an escape hatch we'd still risk breaking it. Whereas again, the memfd solution doesn't cause those issues. Don't get me wrong it's pretty obvious that I was pretty supportive of this fix especially because it looked rather simple but this is turning out to be less simple than we tought. I don't think that this is worth it given the functioning fixes we already have. The good thing is that - even if it will take a longer - that Aleksa's patchset will provide a more general solution by making it possible for runc/crun/lxc to open the target binary with a restricted upgrade mask making it impossible to open the binary read-write again. This won't break criu and will fix this issue and is generally useful.
On Mon, Jan 30, 2023, at 5:06 AM, Christian Brauner wrote: > The good thing is that - even if it will take a longer - that Aleksa's > patchset will provide a more general solution by making it possible for > runc/crun/lxc to open the target binary with a restricted upgrade mask > making it impossible to open the binary read-write again. This won't > break criu and will fix this issue and is generally useful. Had to go back up thread more carefully; looking at the referenced commits now in https://github.com/cyphar/linux/commits/magiclink/open_how-reopen I do agree that that direction is the most elegant. The main downside I can think of is the potential blast radius is larger, and more nontrivial code. But...in practice I guess today for the runc/crun type attacks today there are commonly multiple mitigations (e.g. read-only rootfs, multiple LSMs, and finally the memfd copy fallback) so we can probably wait for that patchset to land. In short: agreed!
Christian Brauner <brauner@kernel.org> writes: > On Mon, Jan 30, 2023 at 10:53:31AM +0100, Christian Brauner wrote: >> On Sun, Jan 29, 2023 at 01:12:45PM -0500, Colin Walters wrote: >> > >> > >> > On Sun, Jan 29, 2023, at 11:58 AM, Christian Brauner wrote: >> > > On Sun, Jan 29, 2023 at 08:59:32AM -0500, Colin Walters wrote: >> > >> >> > >> >> > >> On Wed, Jan 25, 2023, at 11:30 AM, Giuseppe Scrivano wrote: >> > >> > >> > >> > After reading some comments on the LWN.net article, I wonder if >> > >> > PR_HIDE_SELF_EXE should apply to CAP_SYS_ADMIN in the initial user >> > >> > namespace or if in this case root should keep the privilege to inspect >> > >> > the binary of a process. If a container runs with that many privileges >> > >> > then it has already other ways to damage the host anyway. >> > >> >> > >> Right, that's what I was trying to express with the "make it >> > >> work the same as map_files". Hiding the entry entirely even >> > >> for initial-namespace-root (real root) seems like it's going to >> > >> potentially confuse profiling/tracing/debugging tools for no >> > >> good reason. >> > > >> > > If this can be circumvented via CAP_SYS_ADMIN >> > >> > To be clear, I'm proposing CAP_SYS_ADMIN in the current user >> > namespace at the time of the prctl(). (Or if keeping around a >> > reference just for this is too problematic, perhaps hardcoding to >> > the init ns) >> >> Oh no, I fully understand. The point was that the userspace fix protects >> even against attackers with CAP_SYS_ADMIN in init_user_ns. And that was >> important back then and is still relevant today for some workloads. >> >> For unprivileged containers where host and container are separate by a >> meaningful user namespace boundary this whole mitigation is irrelevant >> as the binary can't be overwritten. >> >> > >> > A process with CAP_SYS_ADMIN in a child namespace would still not be able to read the binary. >> > >> > > then this mitigation >> > > becomes immediately way less interesting because the userspace >> > > mitigation we came up with protects against CAP_SYS_ADMIN as well >> > > without any regression risk. >> > >> > The userspace mitigation here being "clone self to memfd"? But that's a sufficiently ugly workaround that it's created new problems; see https://lwn.net/Articles/918106/ >> >> But this is a problem with the memfd api not with the fix. Following the >> thread the ability to create executable memfds will stay around. As it >> should be given how long this has been supported. And they have backward >> compatibility in mind which is great. > > Following up from yesterday's promise to check with the criu org I'm > part of: this is going to break criu unforunately as it dumps (and > restores) /proc/self/exe. Even with an escape hatch we'd still risk > breaking it. Whereas again, the memfd solution doesn't cause those > issues. > > Don't get me wrong it's pretty obvious that I was pretty supportive of > this fix especially because it looked rather simple but this is turning > out to be less simple than we tought. I don't think that this is worth > it given the functioning fixes we already have. > > The good thing is that - even if it will take a longer - that Aleksa's > patchset will provide a more general solution by making it possible for > runc/crun/lxc to open the target binary with a restricted upgrade mask > making it impossible to open the binary read-write again. This won't > break criu and will fix this issue and is generally useful. I was not aware that running with CAP_SYS_ADMIN in the initial userns was considered as a use case, but in this case don't we need to protect /proc/$PID/map_files as well or do we rely only on randomize_va_space? It is a more difficult to guess the name but we can still exec these files and grab a reference to them. The current patch I've proposed is probably a too big hammer for the small issue we really have: other processes from the container are already blocked by PR_SET_DUMPABLE unless CAP_SYS_PTRACE is granted; but if CAP_SYS_PTRACE is granted then it seems already vulnerable today since processes from the container can just read the /proc/PID/map_files files without even requiring the exec trick. So the only hole left, that I can see, is that the container runtime can be tricked to exec /proc/self/exe (or /proc/self/map_files/*) and from there open a reference to the binary. Could we just restrict the usage to the current thread group? That won't affect in any way other processes. The patch won't be too much more complicated, we just need to amend the following fix: diff --git a/fs/proc/base.c b/fs/proc/base.c index e9127084b82a..2f5c5ed2dae8 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1723,6 +1723,7 @@ static int proc_exe_link(struct dentry *dentry, struct path *exe_path) { struct task_struct *task; struct file *exe_file; + bool is_same_tgroup; long hide_self_exe; task = get_proc_task(d_inode(dentry)); @@ -1730,8 +1731,9 @@ static int proc_exe_link(struct dentry *dentry, struct path *exe_path) return -ENOENT; exe_file = get_task_exe_file(task); hide_self_exe = task_hide_self_exe(task); + is_same_tgroup = same_thread_group(current, task); put_task_struct(task); - if (hide_self_exe) + if (hide_self_exe && is_same_tgroup) return -EPERM; if (exe_file) { *exe_path = exe_file->f_path; Would that be sufficient or are there other ways to attack it? Given the premise about CAP_SYS_ADMIN (and even more loosen CAP_CHECKPOINT_RESTORE in the *current* user namespace), I think we probably need a similar protection fo /proc/PID/map_files. Regards, Giuseppe
On Mon, Jan 30, 2023 at 11:06:02AM +0100, Christian Brauner wrote: > On Mon, Jan 30, 2023 at 10:53:31AM +0100, Christian Brauner wrote: > > On Sun, Jan 29, 2023 at 01:12:45PM -0500, Colin Walters wrote: > > > > > > > > > On Sun, Jan 29, 2023, at 11:58 AM, Christian Brauner wrote: > > > > On Sun, Jan 29, 2023 at 08:59:32AM -0500, Colin Walters wrote: > > > >> > > > >> > > > >> On Wed, Jan 25, 2023, at 11:30 AM, Giuseppe Scrivano wrote: > > > >> > > > > >> > After reading some comments on the LWN.net article, I wonder if > > > >> > PR_HIDE_SELF_EXE should apply to CAP_SYS_ADMIN in the initial user > > > >> > namespace or if in this case root should keep the privilege to inspect > > > >> > the binary of a process. If a container runs with that many privileges > > > >> > then it has already other ways to damage the host anyway. > > > >> > > > >> Right, that's what I was trying to express with the "make it work the same as map_files". Hiding the entry entirely even for initial-namespace-root (real root) seems like it's going to potentially confuse profiling/tracing/debugging tools for no good reason. > > > > > > > > If this can be circumvented via CAP_SYS_ADMIN > > > > > > To be clear, I'm proposing CAP_SYS_ADMIN in the current user namespace at the time of the prctl(). (Or if keeping around a reference just for this is too problematic, perhaps hardcoding to the init ns) > > > > Oh no, I fully understand. The point was that the userspace fix protects > > even against attackers with CAP_SYS_ADMIN in init_user_ns. And that was > > important back then and is still relevant today for some workloads. > > > > For unprivileged containers where host and container are separate by a > > meaningful user namespace boundary this whole mitigation is irrelevant > > as the binary can't be overwritten. > > > > > > > > A process with CAP_SYS_ADMIN in a child namespace would still not be able to read the binary. > > > > > > > then this mitigation > > > > becomes immediately way less interesting because the userspace > > > > mitigation we came up with protects against CAP_SYS_ADMIN as well > > > > without any regression risk. > > > > > > The userspace mitigation here being "clone self to memfd"? But that's a sufficiently ugly workaround that it's created new problems; see https://lwn.net/Articles/918106/ > > > > But this is a problem with the memfd api not with the fix. Following the > > thread the ability to create executable memfds will stay around. As it > > should be given how long this has been supported. And they have backward > > compatibility in mind which is great. > > Following up from yesterday's promise to check with the criu org I'm > part of: this is going to break criu unforunately as it dumps (and > restores) /proc/self/exe. Even with an escape hatch we'd still risk > breaking it. Whereas again, the memfd solution doesn't cause those > issues. > > Don't get me wrong it's pretty obvious that I was pretty supportive of > this fix especially because it looked rather simple but this is turning > out to be less simple than we tought. I don't think that this is worth > it given the functioning fixes we already have. btw: can we use PR_SET_MM_EXE_FILE or PR_SET_MM_MAP (prctl_map.exe_fd) to set a dummy exe. Will it have the required effect? > > The good thing is that - even if it will take a longer - that Aleksa's > patchset will provide a more general solution by making it possible for > runc/crun/lxc to open the target binary with a restricted upgrade mask > making it impossible to open the binary read-write again. This won't > break criu and will fix this issue and is generally useful.
Andrei Vagin <avagin@gmail.com> writes: > On Mon, Jan 30, 2023 at 11:06:02AM +0100, Christian Brauner wrote: >> On Mon, Jan 30, 2023 at 10:53:31AM +0100, Christian Brauner wrote: >> > On Sun, Jan 29, 2023 at 01:12:45PM -0500, Colin Walters wrote: >> > > >> > > >> > > On Sun, Jan 29, 2023, at 11:58 AM, Christian Brauner wrote: >> > > > On Sun, Jan 29, 2023 at 08:59:32AM -0500, Colin Walters wrote: >> > > >> >> > > >> >> > > >> On Wed, Jan 25, 2023, at 11:30 AM, Giuseppe Scrivano wrote: >> > > >> > >> > > >> > After reading some comments on the LWN.net article, I wonder if >> > > >> > PR_HIDE_SELF_EXE should apply to CAP_SYS_ADMIN in the initial user >> > > >> > namespace or if in this case root should keep the privilege to inspect >> > > >> > the binary of a process. If a container runs with that many privileges >> > > >> > then it has already other ways to damage the host anyway. >> > > >> >> > > >> Right, that's what I was trying to express with the "make it >> > > >> work the same as map_files". Hiding the entry entirely even >> > > >> for initial-namespace-root (real root) seems like it's going >> > > >> to potentially confuse profiling/tracing/debugging tools for >> > > >> no good reason. >> > > > >> > > > If this can be circumvented via CAP_SYS_ADMIN >> > > >> > > To be clear, I'm proposing CAP_SYS_ADMIN in the current user >> > > namespace at the time of the prctl(). (Or if keeping around a >> > > reference just for this is too problematic, perhaps hardcoding >> > > to the init ns) >> > >> > Oh no, I fully understand. The point was that the userspace fix protects >> > even against attackers with CAP_SYS_ADMIN in init_user_ns. And that was >> > important back then and is still relevant today for some workloads. >> > >> > For unprivileged containers where host and container are separate by a >> > meaningful user namespace boundary this whole mitigation is irrelevant >> > as the binary can't be overwritten. >> > >> > > >> > > A process with CAP_SYS_ADMIN in a child namespace would still not be able to read the binary. >> > > >> > > > then this mitigation >> > > > becomes immediately way less interesting because the userspace >> > > > mitigation we came up with protects against CAP_SYS_ADMIN as well >> > > > without any regression risk. >> > > >> > > The userspace mitigation here being "clone self to memfd"? But that's a sufficiently ugly workaround that it's created new problems; see https://lwn.net/Articles/918106/ >> > >> > But this is a problem with the memfd api not with the fix. Following the >> > thread the ability to create executable memfds will stay around. As it >> > should be given how long this has been supported. And they have backward >> > compatibility in mind which is great. >> >> Following up from yesterday's promise to check with the criu org I'm >> part of: this is going to break criu unforunately as it dumps (and >> restores) /proc/self/exe. Even with an escape hatch we'd still risk >> breaking it. Whereas again, the memfd solution doesn't cause those >> issues. >> >> Don't get me wrong it's pretty obvious that I was pretty supportive of >> this fix especially because it looked rather simple but this is turning >> out to be less simple than we tought. I don't think that this is worth >> it given the functioning fixes we already have. > > btw: can we use PR_SET_MM_EXE_FILE or PR_SET_MM_MAP (prctl_map.exe_fd) to > set a dummy exe. Will it have the required effect? if I am understanding it correctly, that seems a bit more complicated, we first need to unmap the current executable and then replace it with its copy? Creating the dummy exe could also be a problem since we need a new copy each time we want to hide the executable. Or are you suggesting using it differently? Thanks, Giuseppe >> The good thing is that - even if it will take a longer - that Aleksa's >> patchset will provide a more general solution by making it possible for >> runc/crun/lxc to open the target binary with a restricted upgrade mask >> making it impossible to open the binary read-write again. This won't >> break criu and will fix this issue and is generally useful.
diff --git a/fs/exec.c b/fs/exec.c index ab913243a367..5a5dd964c3a3 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1855,6 +1855,7 @@ static int bprm_execve(struct linux_binprm *bprm, /* execve succeeded */ current->fs->in_exec = 0; current->in_execve = 0; + task_clear_hide_self_exe(current); rseq_execve(current); acct_update_integrals(current); task_numa_free(current, false); diff --git a/fs/proc/base.c b/fs/proc/base.c index 9e479d7d202b..959968e2da0d 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1723,19 +1723,21 @@ static int proc_exe_link(struct dentry *dentry, struct path *exe_path) { struct task_struct *task; struct file *exe_file; + long hide_self_exe; task = get_proc_task(d_inode(dentry)); if (!task) return -ENOENT; exe_file = get_task_exe_file(task); + hide_self_exe = task_hide_self_exe(task); put_task_struct(task); - if (exe_file) { + if (exe_file && !hide_self_exe) { *exe_path = exe_file->f_path; path_get(&exe_file->f_path); fput(exe_file); return 0; - } else - return -ENOENT; + } + return -ENOENT; } static const char *proc_pid_get_link(struct dentry *dentry, diff --git a/include/linux/sched.h b/include/linux/sched.h index 853d08f7562b..8db32d5fc285 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1790,6 +1790,7 @@ static __always_inline bool is_percpu_thread(void) #define PFA_SPEC_IB_DISABLE 5 /* Indirect branch speculation restricted */ #define PFA_SPEC_IB_FORCE_DISABLE 6 /* Indirect branch speculation permanently restricted */ #define PFA_SPEC_SSB_NOEXEC 7 /* Speculative Store Bypass clear on execve() */ +#define PFA_HIDE_SELF_EXE 8 /* Hide /proc/self/exe for the process */ #define TASK_PFA_TEST(name, func) \ static inline bool task_##func(struct task_struct *p) \ @@ -1832,6 +1833,10 @@ TASK_PFA_CLEAR(SPEC_IB_DISABLE, spec_ib_disable) TASK_PFA_TEST(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable) TASK_PFA_SET(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable) +TASK_PFA_TEST(HIDE_SELF_EXE, hide_self_exe) +TASK_PFA_SET(HIDE_SELF_EXE, hide_self_exe) +TASK_PFA_CLEAR(HIDE_SELF_EXE, hide_self_exe) + static inline void current_restore_flags(unsigned long orig_flags, unsigned long flags) { diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index a5e06dcbba13..f12f3df12468 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -284,4 +284,7 @@ struct prctl_mm_map { #define PR_SET_VMA 0x53564d41 # define PR_SET_VMA_ANON_NAME 0 +#define PR_SET_HIDE_SELF_EXE 65 +#define PR_GET_HIDE_SELF_EXE 66 + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/sys.c b/kernel/sys.c index 5fd54bf0e886..e992f1b72973 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2626,6 +2626,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_SET_VMA: error = prctl_set_vma(arg2, arg3, arg4, arg5); break; + case PR_SET_HIDE_SELF_EXE: + if (arg2 != 1 || arg3 || arg4 || arg5) + return -EINVAL; + task_set_hide_self_exe(current); + break; + case PR_GET_HIDE_SELF_EXE: + if (arg2 || arg3 || arg4 || arg5) + return -EINVAL; + return task_hide_self_exe(current) ? 1 : 0; default: error = -EINVAL; break; diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h index a5e06dcbba13..f12f3df12468 100644 --- a/tools/include/uapi/linux/prctl.h +++ b/tools/include/uapi/linux/prctl.h @@ -284,4 +284,7 @@ struct prctl_mm_map { #define PR_SET_VMA 0x53564d41 # define PR_SET_VMA_ANON_NAME 0 +#define PR_SET_HIDE_SELF_EXE 65 +#define PR_GET_HIDE_SELF_EXE 66 + #endif /* _LINUX_PRCTL_H */