[v3,1/2] exec: add PR_HIDE_SELF_EXE prctl

Message ID 20230120102512.3195094-1-gscrivan@redhat.com
State New
Headers
Series [v3,1/2] exec: add PR_HIDE_SELF_EXE prctl |

Commit Message

Giuseppe Scrivano Jan. 20, 2023, 10:25 a.m. UTC
  This patch adds a new prctl called PR_HIDE_SELF_EXE which allows
processes to hide their own /proc/*/exe file. When this prctl is
used, every access to /proc/*/exe for the calling process will
fail with ENOENT.

This is useful for preventing issues like CVE-2019-5736, where an
attacker can gain host root access by overwriting the binary
in OCI runtimes through file-descriptor mishandling in containers.

The current fix for CVE-2019-5736 is to create a read-only copy or
a bind-mount of the current executable, and then re-exec the current
process.  With the new prctl, the read-only copy or bind-mount copy is
not needed anymore.

While map_files/ also might contain symlinks to files in host,
proc_map_files_get_link() permissions checks are already sufficient.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
---
v2: https://lkml.org/lkml/2023/1/19/849

Differences from v2:

- fixed the test to check PR_SET_HIDE_SELF_EXE after fork

v1: https://lkml.org/lkml/2023/1/4/334

Differences from v1:

- amended more information in the commit message wrt map_files not
  requiring the same protection.
- changed the test to verify PR_HIDE_SELF_EXE cannot be unset after
  a fork.

fs/exec.c                        | 1 +
 fs/proc/base.c                   | 8 +++++---
 include/linux/sched.h            | 5 +++++
 include/uapi/linux/prctl.h       | 3 +++
 kernel/sys.c                     | 9 +++++++++
 tools/include/uapi/linux/prctl.h | 3 +++
 6 files changed, 26 insertions(+), 3 deletions(-)
  

Comments

Colin Walters Jan. 23, 2023, 6:41 p.m. UTC | #1
On Fri, Jan 20, 2023, at 5:25 AM, Giuseppe Scrivano wrote:
> This patch adds a new prctl called PR_HIDE_SELF_EXE which allows
> processes to hide their own /proc/*/exe file. When this prctl is
> used, every access to /proc/*/exe for the calling process will
> fail with ENOENT.

How about a mount option for procfs like `mount -t procfs procfs /proc -o rw,nosuid,nodev,magiclink-no-xdev`

Where `magiclink-no-xdev` would cause all magic links to fail to cross a pid namespace or so?
  
Giuseppe Scrivano Jan. 23, 2023, 7:21 p.m. UTC | #2
"Colin Walters" <walters@verbum.org> writes:

> On Fri, Jan 20, 2023, at 5:25 AM, Giuseppe Scrivano wrote:
>> This patch adds a new prctl called PR_HIDE_SELF_EXE which allows
>> processes to hide their own /proc/*/exe file. When this prctl is
>> used, every access to /proc/*/exe for the calling process will
>> fail with ENOENT.
>
> How about a mount option for procfs like `mount -t procfs procfs /proc -o rw,nosuid,nodev,magiclink-no-xdev`
>
> Where `magiclink-no-xdev` would cause all magic links to fail to cross a pid namespace or so?

wouldn't that break also stuff like "/proc/self/fd/$FD" after you join a
different PID namespace?
  
Colin Walters Jan. 23, 2023, 10:07 p.m. UTC | #3
On Mon, Jan 23, 2023, at 2:21 PM, Giuseppe Scrivano wrote:
> "Colin Walters" <walters@verbum.org> writes:
>
>> On Fri, Jan 20, 2023, at 5:25 AM, Giuseppe Scrivano wrote:
>>> This patch adds a new prctl called PR_HIDE_SELF_EXE which allows
>>> processes to hide their own /proc/*/exe file. When this prctl is
>>> used, every access to /proc/*/exe for the calling process will
>>> fail with ENOENT.
>>
>> How about a mount option for procfs like `mount -t procfs procfs /proc -o rw,nosuid,nodev,magiclink-no-xdev`
>>
>> Where `magiclink-no-xdev` would cause all magic links to fail to cross a pid namespace or so?
>
> wouldn't that break also stuff like "/proc/self/fd/$FD" after you join a
> different PID namespace?

Ah, right.  Hmm.  But building on the reply to
https://lwn.net/Articles/920876/
maybe an opt-in flag like `-o magiclink-restricted=/proc/<pid>/ns/pid` that required callers to have CAP_SYS_ADMIN in the referenced namespace?  Then things like crun/runc would open a fd for `/proc/self/ns/pid` when they start (usually, this is the init ns), then pass the reference to that fd into magiclink-restricted.

There may be a more elegant userspace API here; I think my overall point is reiterating what I mentioned in 
https://github.com/containers/crun/pull/1105#issuecomment-1360085059

"A minor worry I have with both is that any namespacing-based approach that controls visibility in /proc runs the risk of someone adding a new way to get to the binary; maybe it's something like us having a fd for ourselves open under /proc/pid/fd/ or so."

So instead of hiding just /proc/self/exe, we add some sort of API that aims to restrict all other magic links that may be added in the future.

The history of map_files is interesting here; it looks like https://github.com/torvalds/linux/commit/bdb4d100afe9818aebd1d98ced575c5ef143456c introduces a comment that says

"* Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the
 * symlinks may be used to bypass permissions on ancestor directories in the
 * path to the file in question."

yet isn't there an inconsistency here in not applying the same restrictions on /proc/self/exe?

Or another way to look at this is that if we were to add some sort of API like this on /proc, we'd also change the proc_maps code to also honor it *instead* if present instead of limiting to the init ns?
  
Giuseppe Scrivano Jan. 23, 2023, 10:54 p.m. UTC | #4
"Colin Walters" <walters@verbum.org> writes:

> On Mon, Jan 23, 2023, at 2:21 PM, Giuseppe Scrivano wrote:
>> "Colin Walters" <walters@verbum.org> writes:
>>
>>> On Fri, Jan 20, 2023, at 5:25 AM, Giuseppe Scrivano wrote:
>>>> This patch adds a new prctl called PR_HIDE_SELF_EXE which allows
>>>> processes to hide their own /proc/*/exe file. When this prctl is
>>>> used, every access to /proc/*/exe for the calling process will
>>>> fail with ENOENT.
>>>
>>> How about a mount option for procfs like `mount -t procfs procfs /proc -o rw,nosuid,nodev,magiclink-no-xdev`
>>>
>>> Where `magiclink-no-xdev` would cause all magic links to fail to cross a pid namespace or so?
>>
>> wouldn't that break also stuff like "/proc/self/fd/$FD" after you join a
>> different PID namespace?
>
> Ah, right.  Hmm.  But building on the reply to
> https://lwn.net/Articles/920876/
> maybe an opt-in flag like `-o magiclink-restricted=/proc/<pid>/ns/pid`
> that required callers to have CAP_SYS_ADMIN in the referenced
> namespace?  Then things like crun/runc would open a fd for
> `/proc/self/ns/pid` when they start (usually, this is the init ns),
> then pass the reference to that fd into magiclink-restricted.
>
> There may be a more elegant userspace API here; I think my overall point is reiterating what I mentioned in 
> https://github.com/containers/crun/pull/1105#issuecomment-1360085059
>
> "A minor worry I have with both is that any namespacing-based approach
> that controls visibility in /proc runs the risk of someone adding a
> new way to get to the binary; maybe it's something like us having a fd
> for ourselves open under /proc/pid/fd/ or so."
>
> So instead of hiding just /proc/self/exe, we add some sort of API that aims to restrict all other magic links that may be added in the future.
>
> The history of map_files is interesting here; it looks like https://github.com/torvalds/linux/commit/bdb4d100afe9818aebd1d98ced575c5ef143456c introduces a comment that says
>
> "* Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the
>  * symlinks may be used to bypass permissions on ancestor directories in the
>  * path to the file in question."
>
> yet isn't there an inconsistency here in not applying the same restrictions on /proc/self/exe?
>
> Or another way to look at this is that if we were to add some sort of
> API like this on /proc, we'd also change the proc_maps code to also
> honor it *instead* if present instead of limiting to the init ns?

I realize it seems like a one-off fix, but it is done only for backward
compatibility.

Other paths under /proc/self/map_files require CAP_SYS_ADMIN in the
initial user namespace, or have CAP_CHECKPOINT_RESTORE in the user
namespace.  Sure, it is not future-proof, but it would look weird if
after CVE-2019-19814 there will be more ways to access files from the
host without requiring some capabilities.

One problem with having it as a mount option for procfs (so for the
/proc in the container) is that then the container runtime has the
the burden to verify that there are no other procfs mounts accessible
from the mount namespace as well as to check that the mount option is
available before attempting to use it from the container.  Potentially
it could also be a TOCTOU race since a proc mount could appear later,
e.g. parent mount with shared propagation.  If the runtime finds out
that another procfs is reachable, then it would have to fallback to
re-exec itself much later than what we do today.

With the prctl a runtime would just need to do the following and live
happily ever after:

__attribute__ ((constructor)) static void hide_self_exe (void)
{
	if (prctl(PR_SET_HIDE_SELF_EXE, 1, 0, 0, 0) == 0)
		return;

	/* ...reexec as we do today... */
}

and we won't have to worry about what mount options are supported or
used by proc.
  
Colin Walters Jan. 23, 2023, 11:14 p.m. UTC | #5
On Mon, Jan 23, 2023, at 5:54 PM, Giuseppe Scrivano wrote:
>
> I realize it seems like a one-off fix, but it is done only for backward
> compatibility.
>
> Other paths under /proc/self/map_files require CAP_SYS_ADMIN in the
> initial user namespace, or have CAP_CHECKPOINT_RESTORE in the user
> namespace.  Sure, it is not future-proof, but it would look weird if
> after CVE-2019-19814 there will be more ways to access files from the
> host without requiring some capabilities.

I think a way to rephrase what I'm saying is that it feels like this should be about making /proc/self/exe and /proc/self/map_files consistent.

> With the prctl a runtime would just need to do the following and live
> happily ever after:
>
> __attribute__ ((constructor)) static void hide_self_exe (void)
> {
> 	if (prctl(PR_SET_HIDE_SELF_EXE, 1, 0, 0, 0) == 0)
> 		return;
>
> 	/* ...reexec as we do today... */
> }
>
> and we won't have to worry about what mount options are supported or
> used by proc.

Yeah, OK - having the logical operation be on the process and not the view into it (procfs) definitely is more robust.

But how about calling this PR_SET_PROCFS_RESTRICTED or so, and then *also* changing the /proc/self/map_files lookup to deny if this is set?  Yes, it'd be mostly redundant, but it'd help clarify things for any future changes since it'd be clear that the logic *should* be consistent for /proc/self/exe and /proc/self/map_files.  And actually as a bonus, it would make the case of e.g. `podman run --cap-add=checkpoint_restore` secure right?  (Though honestly I don't know how common that is or whether one can practically use checkpoint_restore without other caps)
  
Aleksa Sarai Jan. 24, 2023, 1:53 a.m. UTC | #6
On 2023-01-20, Giuseppe Scrivano <gscrivan@redhat.com> wrote:
> This patch adds a new prctl called PR_HIDE_SELF_EXE which allows
> processes to hide their own /proc/*/exe file. When this prctl is
> used, every access to /proc/*/exe for the calling process will
> fail with ENOENT.
> 
> This is useful for preventing issues like CVE-2019-5736, where an
> attacker can gain host root access by overwriting the binary
> in OCI runtimes through file-descriptor mishandling in containers.
> 
> The current fix for CVE-2019-5736 is to create a read-only copy or
> a bind-mount of the current executable, and then re-exec the current
> process.  With the new prctl, the read-only copy or bind-mount copy is
> not needed anymore.
> 
> While map_files/ also might contain symlinks to files in host,
> proc_map_files_get_link() permissions checks are already sufficient.

I suspect this doesn't protect against the execve("/proc/self/exe")
tactic (because it clears the bit on execve), so I'm not sure this is
much safer than PR_SET_DUMPABLE (yeah, it stops root in the source
userns from accessing /proc/$pid/exe but the above attack makes that no
longer that important).

I think the only way to fix this properly is by blocking re-opens of
magic links that have more permissions than they originally did. I just
got back from vacation, but I'm working on fixing up [1] so it's ready
to be an RFC so we can close this hole once and for all.

[1]: https://github.com/cyphar/linux/tree/magiclink/open_how-reopen

> 
> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
> ---
> v2: https://lkml.org/lkml/2023/1/19/849
> 
> Differences from v2:
> 
> - fixed the test to check PR_SET_HIDE_SELF_EXE after fork
> 
> v1: https://lkml.org/lkml/2023/1/4/334
> 
> Differences from v1:
> 
> - amended more information in the commit message wrt map_files not
>   requiring the same protection.
> - changed the test to verify PR_HIDE_SELF_EXE cannot be unset after
>   a fork.
> 
> fs/exec.c                        | 1 +
>  fs/proc/base.c                   | 8 +++++---
>  include/linux/sched.h            | 5 +++++
>  include/uapi/linux/prctl.h       | 3 +++
>  kernel/sys.c                     | 9 +++++++++
>  tools/include/uapi/linux/prctl.h | 3 +++
>  6 files changed, 26 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index ab913243a367..5a5dd964c3a3 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1855,6 +1855,7 @@ static int bprm_execve(struct linux_binprm *bprm,
>  	/* execve succeeded */
>  	current->fs->in_exec = 0;
>  	current->in_execve = 0;
> +	task_clear_hide_self_exe(current);
>  	rseq_execve(current);
>  	acct_update_integrals(current);
>  	task_numa_free(current, false);
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 9e479d7d202b..959968e2da0d 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1723,19 +1723,21 @@ static int proc_exe_link(struct dentry *dentry, struct path *exe_path)
>  {
>  	struct task_struct *task;
>  	struct file *exe_file;
> +	long hide_self_exe;
>  
>  	task = get_proc_task(d_inode(dentry));
>  	if (!task)
>  		return -ENOENT;
>  	exe_file = get_task_exe_file(task);
> +	hide_self_exe = task_hide_self_exe(task);
>  	put_task_struct(task);
> -	if (exe_file) {
> +	if (exe_file && !hide_self_exe) {
>  		*exe_path = exe_file->f_path;
>  		path_get(&exe_file->f_path);
>  		fput(exe_file);
>  		return 0;
> -	} else
> -		return -ENOENT;
> +	}
> +	return -ENOENT;
>  }
>  
>  static const char *proc_pid_get_link(struct dentry *dentry,
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 853d08f7562b..8db32d5fc285 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1790,6 +1790,7 @@ static __always_inline bool is_percpu_thread(void)
>  #define PFA_SPEC_IB_DISABLE		5	/* Indirect branch speculation restricted */
>  #define PFA_SPEC_IB_FORCE_DISABLE	6	/* Indirect branch speculation permanently restricted */
>  #define PFA_SPEC_SSB_NOEXEC		7	/* Speculative Store Bypass clear on execve() */
> +#define PFA_HIDE_SELF_EXE		8	/* Hide /proc/self/exe for the process */
>  
>  #define TASK_PFA_TEST(name, func)					\
>  	static inline bool task_##func(struct task_struct *p)		\
> @@ -1832,6 +1833,10 @@ TASK_PFA_CLEAR(SPEC_IB_DISABLE, spec_ib_disable)
>  TASK_PFA_TEST(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable)
>  TASK_PFA_SET(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable)
>  
> +TASK_PFA_TEST(HIDE_SELF_EXE, hide_self_exe)
> +TASK_PFA_SET(HIDE_SELF_EXE, hide_self_exe)
> +TASK_PFA_CLEAR(HIDE_SELF_EXE, hide_self_exe)
> +
>  static inline void
>  current_restore_flags(unsigned long orig_flags, unsigned long flags)
>  {
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index a5e06dcbba13..f12f3df12468 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -284,4 +284,7 @@ struct prctl_mm_map {
>  #define PR_SET_VMA		0x53564d41
>  # define PR_SET_VMA_ANON_NAME		0
>  
> +#define PR_SET_HIDE_SELF_EXE		65
> +#define PR_GET_HIDE_SELF_EXE		66
> +
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 5fd54bf0e886..e992f1b72973 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2626,6 +2626,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  	case PR_SET_VMA:
>  		error = prctl_set_vma(arg2, arg3, arg4, arg5);
>  		break;
> +	case PR_SET_HIDE_SELF_EXE:
> +		if (arg2 != 1 || arg3 || arg4 || arg5)
> +			return -EINVAL;
> +		task_set_hide_self_exe(current);
> +		break;
> +	case PR_GET_HIDE_SELF_EXE:
> +		if (arg2 || arg3 || arg4 || arg5)
> +			return -EINVAL;
> +		return task_hide_self_exe(current) ? 1 : 0;
>  	default:
>  		error = -EINVAL;
>  		break;
> diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
> index a5e06dcbba13..f12f3df12468 100644
> --- a/tools/include/uapi/linux/prctl.h
> +++ b/tools/include/uapi/linux/prctl.h
> @@ -284,4 +284,7 @@ struct prctl_mm_map {
>  #define PR_SET_VMA		0x53564d41
>  # define PR_SET_VMA_ANON_NAME		0
>  
> +#define PR_SET_HIDE_SELF_EXE		65
> +#define PR_GET_HIDE_SELF_EXE		66
> +
>  #endif /* _LINUX_PRCTL_H */
> -- 
> 2.38.1
>
  
Giuseppe Scrivano Jan. 24, 2023, 7:29 a.m. UTC | #7
Aleksa Sarai <cyphar@cyphar.com> writes:

> On 2023-01-20, Giuseppe Scrivano <gscrivan@redhat.com> wrote:
>> This patch adds a new prctl called PR_HIDE_SELF_EXE which allows
>> processes to hide their own /proc/*/exe file. When this prctl is
>> used, every access to /proc/*/exe for the calling process will
>> fail with ENOENT.
>> 
>> This is useful for preventing issues like CVE-2019-5736, where an
>> attacker can gain host root access by overwriting the binary
>> in OCI runtimes through file-descriptor mishandling in containers.
>> 
>> The current fix for CVE-2019-5736 is to create a read-only copy or
>> a bind-mount of the current executable, and then re-exec the current
>> process.  With the new prctl, the read-only copy or bind-mount copy is
>> not needed anymore.
>> 
>> While map_files/ also might contain symlinks to files in host,
>> proc_map_files_get_link() permissions checks are already sufficient.
>
> I suspect this doesn't protect against the execve("/proc/self/exe")
> tactic (because it clears the bit on execve), so I'm not sure this is
> much safer than PR_SET_DUMPABLE (yeah, it stops root in the source
> userns from accessing /proc/$pid/exe but the above attack makes that no
> longer that important).

it protects against that attack too.  It clears the bit _after_ the
execve() syscall is done.

If you attempt execve("/proc/self/exe") you still get ENOENT:

```
#include <stdlib.h>
#include <stdio.h>
#include <sys/prctl.h>
#include <unistd.h>

int main(void)
{
        int ret;

        ret = prctl(65, 1, 0, 0, 0);
        if (ret != 0)
                exit(1);

        execl("/proc/self/exe", "foo", NULL);
        exit(2);
}
```

# strace -e prctl,execve ./hide-self-exe
execve("./hide-self-exe", ["./hide-self-exe"], 0x7fff975a3690 /* 39 vars */) = 0
prctl(0x41 /* PR_??? */, 0x1, 0, 0, 0)  = 0
execve("/proc/self/exe", ["foo"], 0x7ffcf51868b8 /* 39 vars */) = -1 ENOENT (No such file or directory)
+++ exited with 2 +++

I've also tried execv'ing with a script that uses "#!/proc/self/exe" and
I get the same ENOENT.


>
> I think the only way to fix this properly is by blocking re-opens of
> magic links that have more permissions than they originally did. I just
> got back from vacation, but I'm working on fixing up [1] so it's ready
> to be an RFC so we can close this hole once and for all.

so that relies on the fact opening /proc/self/exe with O_WRONLY fails
with ETXTBSY?

> [1]: https://github.com/cyphar/linux/tree/magiclink/open_how-reopen
>
>> 
>> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
>> ---
>> v2: https://lkml.org/lkml/2023/1/19/849
>> 
>> Differences from v2:
>> 
>> - fixed the test to check PR_SET_HIDE_SELF_EXE after fork
>> 
>> v1: https://lkml.org/lkml/2023/1/4/334
>> 
>> Differences from v1:
>> 
>> - amended more information in the commit message wrt map_files not
>>   requiring the same protection.
>> - changed the test to verify PR_HIDE_SELF_EXE cannot be unset after
>>   a fork.
>> 
>> fs/exec.c                        | 1 +
>>  fs/proc/base.c                   | 8 +++++---
>>  include/linux/sched.h            | 5 +++++
>>  include/uapi/linux/prctl.h       | 3 +++
>>  kernel/sys.c                     | 9 +++++++++
>>  tools/include/uapi/linux/prctl.h | 3 +++
>>  6 files changed, 26 insertions(+), 3 deletions(-)
>> 
>> diff --git a/fs/exec.c b/fs/exec.c
>> index ab913243a367..5a5dd964c3a3 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1855,6 +1855,7 @@ static int bprm_execve(struct linux_binprm *bprm,
>>  	/* execve succeeded */
>>  	current->fs->in_exec = 0;
>>  	current->in_execve = 0;
>> +	task_clear_hide_self_exe(current);
>>  	rseq_execve(current);
>>  	acct_update_integrals(current);
>>  	task_numa_free(current, false);
>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>> index 9e479d7d202b..959968e2da0d 100644
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>> @@ -1723,19 +1723,21 @@ static int proc_exe_link(struct dentry *dentry, struct path *exe_path)
>>  {
>>  	struct task_struct *task;
>>  	struct file *exe_file;
>> +	long hide_self_exe;
>>  
>>  	task = get_proc_task(d_inode(dentry));
>>  	if (!task)
>>  		return -ENOENT;
>>  	exe_file = get_task_exe_file(task);
>> +	hide_self_exe = task_hide_self_exe(task);
>>  	put_task_struct(task);
>> -	if (exe_file) {
>> +	if (exe_file && !hide_self_exe) {
>>  		*exe_path = exe_file->f_path;
>>  		path_get(&exe_file->f_path);
>>  		fput(exe_file);
>>  		return 0;
>> -	} else
>> -		return -ENOENT;
>> +	}
>> +	return -ENOENT;
>>  }
>>  
>>  static const char *proc_pid_get_link(struct dentry *dentry,
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 853d08f7562b..8db32d5fc285 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1790,6 +1790,7 @@ static __always_inline bool is_percpu_thread(void)
>>  #define PFA_SPEC_IB_DISABLE		5	/* Indirect branch speculation restricted */
>>  #define PFA_SPEC_IB_FORCE_DISABLE	6	/* Indirect branch speculation permanently restricted */
>>  #define PFA_SPEC_SSB_NOEXEC		7	/* Speculative Store Bypass clear on execve() */
>> +#define PFA_HIDE_SELF_EXE		8	/* Hide /proc/self/exe for the process */
>>  
>>  #define TASK_PFA_TEST(name, func)					\
>>  	static inline bool task_##func(struct task_struct *p)		\
>> @@ -1832,6 +1833,10 @@ TASK_PFA_CLEAR(SPEC_IB_DISABLE, spec_ib_disable)
>>  TASK_PFA_TEST(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable)
>>  TASK_PFA_SET(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable)
>>  
>> +TASK_PFA_TEST(HIDE_SELF_EXE, hide_self_exe)
>> +TASK_PFA_SET(HIDE_SELF_EXE, hide_self_exe)
>> +TASK_PFA_CLEAR(HIDE_SELF_EXE, hide_self_exe)
>> +
>>  static inline void
>>  current_restore_flags(unsigned long orig_flags, unsigned long flags)
>>  {
>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
>> index a5e06dcbba13..f12f3df12468 100644
>> --- a/include/uapi/linux/prctl.h
>> +++ b/include/uapi/linux/prctl.h
>> @@ -284,4 +284,7 @@ struct prctl_mm_map {
>>  #define PR_SET_VMA		0x53564d41
>>  # define PR_SET_VMA_ANON_NAME		0
>>  
>> +#define PR_SET_HIDE_SELF_EXE		65
>> +#define PR_GET_HIDE_SELF_EXE		66
>> +
>>  #endif /* _LINUX_PRCTL_H */
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index 5fd54bf0e886..e992f1b72973 100644
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -2626,6 +2626,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>>  	case PR_SET_VMA:
>>  		error = prctl_set_vma(arg2, arg3, arg4, arg5);
>>  		break;
>> +	case PR_SET_HIDE_SELF_EXE:
>> +		if (arg2 != 1 || arg3 || arg4 || arg5)
>> +			return -EINVAL;
>> +		task_set_hide_self_exe(current);
>> +		break;
>> +	case PR_GET_HIDE_SELF_EXE:
>> +		if (arg2 || arg3 || arg4 || arg5)
>> +			return -EINVAL;
>> +		return task_hide_self_exe(current) ? 1 : 0;
>>  	default:
>>  		error = -EINVAL;
>>  		break;
>> diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
>> index a5e06dcbba13..f12f3df12468 100644
>> --- a/tools/include/uapi/linux/prctl.h
>> +++ b/tools/include/uapi/linux/prctl.h
>> @@ -284,4 +284,7 @@ struct prctl_mm_map {
>>  #define PR_SET_VMA		0x53564d41
>>  # define PR_SET_VMA_ANON_NAME		0
>>  
>> +#define PR_SET_HIDE_SELF_EXE		65
>> +#define PR_GET_HIDE_SELF_EXE		66
>> +
>>  #endif /* _LINUX_PRCTL_H */
>> -- 
>> 2.38.1
>>
  
Andrei Vagin Jan. 24, 2023, 7:17 p.m. UTC | #8
On Mon, Jan 23, 2023 at 6:04 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2023-01-20, Giuseppe Scrivano <gscrivan@redhat.com> wrote:
> > This patch adds a new prctl called PR_HIDE_SELF_EXE which allows
> > processes to hide their own /proc/*/exe file. When this prctl is
> > used, every access to /proc/*/exe for the calling process will
> > fail with ENOENT.
> >
> > This is useful for preventing issues like CVE-2019-5736, where an
> > attacker can gain host root access by overwriting the binary
> > in OCI runtimes through file-descriptor mishandling in containers.
> >
> > The current fix for CVE-2019-5736 is to create a read-only copy or
> > a bind-mount of the current executable, and then re-exec the current
> > process.  With the new prctl, the read-only copy or bind-mount copy is
> > not needed anymore.
> >
> > While map_files/ also might contain symlinks to files in host,
> > proc_map_files_get_link() permissions checks are already sufficient.
>
> I suspect this doesn't protect against the execve("/proc/self/exe")
> tactic (because it clears the bit on execve), so I'm not sure this is
> much safer than PR_SET_DUMPABLE (yeah, it stops root in the source
> userns from accessing /proc/$pid/exe but the above attack makes that no
> longer that important).
>
> I think the only way to fix this properly is by blocking re-opens of
> magic links that have more permissions than they originally did. I just
> got back from vacation, but I'm working on fixing up [1] so it's ready
> to be an RFC so we can close this hole once and for all.
>
> [1]: https://github.com/cyphar/linux/tree/magiclink/open_how-reopen

pls add me into cc when you will send this change. We need to be sure
that it doesn't break CRIU...

Thanks,
Andrei
  
Aleksa Sarai Jan. 25, 2023, 3:28 p.m. UTC | #9
On 2023-01-24, Giuseppe Scrivano <gscrivan@redhat.com> wrote:
> Aleksa Sarai <cyphar@cyphar.com> writes:
> 
> > On 2023-01-20, Giuseppe Scrivano <gscrivan@redhat.com> wrote:
> >> This patch adds a new prctl called PR_HIDE_SELF_EXE which allows
> >> processes to hide their own /proc/*/exe file. When this prctl is
> >> used, every access to /proc/*/exe for the calling process will
> >> fail with ENOENT.
> >> 
> >> This is useful for preventing issues like CVE-2019-5736, where an
> >> attacker can gain host root access by overwriting the binary
> >> in OCI runtimes through file-descriptor mishandling in containers.
> >> 
> >> The current fix for CVE-2019-5736 is to create a read-only copy or
> >> a bind-mount of the current executable, and then re-exec the current
> >> process.  With the new prctl, the read-only copy or bind-mount copy is
> >> not needed anymore.
> >> 
> >> While map_files/ also might contain symlinks to files in host,
> >> proc_map_files_get_link() permissions checks are already sufficient.
> >
> > I suspect this doesn't protect against the execve("/proc/self/exe")
> > tactic (because it clears the bit on execve), so I'm not sure this is
> > much safer than PR_SET_DUMPABLE (yeah, it stops root in the source
> > userns from accessing /proc/$pid/exe but the above attack makes that no
> > longer that important).
> 
> it protects against that attack too.  It clears the bit _after_ the
> execve() syscall is done.
> 
> If you attempt execve("/proc/self/exe") you still get ENOENT:
> 
> ```
> #include <stdlib.h>
> #include <stdio.h>
> #include <sys/prctl.h>
> #include <unistd.h>
> 
> int main(void)
> {
>         int ret;
> 
>         ret = prctl(65, 1, 0, 0, 0);
>         if (ret != 0)
>                 exit(1);
> 
>         execl("/proc/self/exe", "foo", NULL);
>         exit(2);
> }
> ```
> 
> # strace -e prctl,execve ./hide-self-exe
> execve("./hide-self-exe", ["./hide-self-exe"], 0x7fff975a3690 /* 39 vars */) = 0
> prctl(0x41 /* PR_??? */, 0x1, 0, 0, 0)  = 0
> execve("/proc/self/exe", ["foo"], 0x7ffcf51868b8 /* 39 vars */) = -1 ENOENT (No such file or directory)
> +++ exited with 2 +++
> 
> I've also tried execv'ing with a script that uses "#!/proc/self/exe" and
> I get the same ENOENT.

Ah, you're right. As you mentioned, you could still do the attack
through /proc/self/map_files but that would require you to know where
the binary will be located (and being non-dumpable blocks container
processes from doing tricks to get the right path).

I wonder if we should somehow require (or auto-apply) SUID_DUMP_NONE
when setting this prctl, since it does currently depend on it to be
properly secure...

> > I think the only way to fix this properly is by blocking re-opens of
> > magic links that have more permissions than they originally did. I just
> > got back from vacation, but I'm working on fixing up [1] so it's ready
> > to be an RFC so we can close this hole once and for all.
> 
> so that relies on the fact opening /proc/self/exe with O_WRONLY fails
> with ETXTBSY?

Not quite, it relies on the fact that /proc/self/exe (and any other
magiclink to /proc/self/exe) does not have a write mode (semantically,
because of -ETXTBSY) and thus blocks any attempt to open it (or re-open
it) with a write mode. It also fixes some other possible issues and lets
you have upgrade masks (a-la capabilities) to file descriptors.

Ultimately I think having a complete "no really, nobody can touch this"
knob is also a good idea, and as this is is much simpler we can it in
much quicker than the magiclink stuff (which I still think is necessary
in general).

> > [1]: https://github.com/cyphar/linux/tree/magiclink/open_how-reopen
> >
> >> 
> >> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
> >> ---
> >> v2: https://lkml.org/lkml/2023/1/19/849
> >> 
> >> Differences from v2:
> >> 
> >> - fixed the test to check PR_SET_HIDE_SELF_EXE after fork
> >> 
> >> v1: https://lkml.org/lkml/2023/1/4/334
> >> 
> >> Differences from v1:
> >> 
> >> - amended more information in the commit message wrt map_files not
> >>   requiring the same protection.
> >> - changed the test to verify PR_HIDE_SELF_EXE cannot be unset after
> >>   a fork.
> >> 
> >> fs/exec.c                        | 1 +
> >>  fs/proc/base.c                   | 8 +++++---
> >>  include/linux/sched.h            | 5 +++++
> >>  include/uapi/linux/prctl.h       | 3 +++
> >>  kernel/sys.c                     | 9 +++++++++
> >>  tools/include/uapi/linux/prctl.h | 3 +++
> >>  6 files changed, 26 insertions(+), 3 deletions(-)
> >> 
> >> diff --git a/fs/exec.c b/fs/exec.c
> >> index ab913243a367..5a5dd964c3a3 100644
> >> --- a/fs/exec.c
> >> +++ b/fs/exec.c
> >> @@ -1855,6 +1855,7 @@ static int bprm_execve(struct linux_binprm *bprm,
> >>  	/* execve succeeded */
> >>  	current->fs->in_exec = 0;
> >>  	current->in_execve = 0;
> >> +	task_clear_hide_self_exe(current);
> >>  	rseq_execve(current);
> >>  	acct_update_integrals(current);
> >>  	task_numa_free(current, false);
> >> diff --git a/fs/proc/base.c b/fs/proc/base.c
> >> index 9e479d7d202b..959968e2da0d 100644
> >> --- a/fs/proc/base.c
> >> +++ b/fs/proc/base.c
> >> @@ -1723,19 +1723,21 @@ static int proc_exe_link(struct dentry *dentry, struct path *exe_path)
> >>  {
> >>  	struct task_struct *task;
> >>  	struct file *exe_file;
> >> +	long hide_self_exe;
> >>  
> >>  	task = get_proc_task(d_inode(dentry));
> >>  	if (!task)
> >>  		return -ENOENT;
> >>  	exe_file = get_task_exe_file(task);
> >> +	hide_self_exe = task_hide_self_exe(task);
> >>  	put_task_struct(task);
> >> -	if (exe_file) {
> >> +	if (exe_file && !hide_self_exe) {
> >>  		*exe_path = exe_file->f_path;
> >>  		path_get(&exe_file->f_path);
> >>  		fput(exe_file);
> >>  		return 0;
> >> -	} else
> >> -		return -ENOENT;
> >> +	}
> >> +	return -ENOENT;
> >>  }
> >>  
> >>  static const char *proc_pid_get_link(struct dentry *dentry,
> >> diff --git a/include/linux/sched.h b/include/linux/sched.h
> >> index 853d08f7562b..8db32d5fc285 100644
> >> --- a/include/linux/sched.h
> >> +++ b/include/linux/sched.h
> >> @@ -1790,6 +1790,7 @@ static __always_inline bool is_percpu_thread(void)
> >>  #define PFA_SPEC_IB_DISABLE		5	/* Indirect branch speculation restricted */
> >>  #define PFA_SPEC_IB_FORCE_DISABLE	6	/* Indirect branch speculation permanently restricted */
> >>  #define PFA_SPEC_SSB_NOEXEC		7	/* Speculative Store Bypass clear on execve() */
> >> +#define PFA_HIDE_SELF_EXE		8	/* Hide /proc/self/exe for the process */
> >>  
> >>  #define TASK_PFA_TEST(name, func)					\
> >>  	static inline bool task_##func(struct task_struct *p)		\
> >> @@ -1832,6 +1833,10 @@ TASK_PFA_CLEAR(SPEC_IB_DISABLE, spec_ib_disable)
> >>  TASK_PFA_TEST(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable)
> >>  TASK_PFA_SET(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable)
> >>  
> >> +TASK_PFA_TEST(HIDE_SELF_EXE, hide_self_exe)
> >> +TASK_PFA_SET(HIDE_SELF_EXE, hide_self_exe)
> >> +TASK_PFA_CLEAR(HIDE_SELF_EXE, hide_self_exe)
> >> +
> >>  static inline void
> >>  current_restore_flags(unsigned long orig_flags, unsigned long flags)
> >>  {
> >> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> >> index a5e06dcbba13..f12f3df12468 100644
> >> --- a/include/uapi/linux/prctl.h
> >> +++ b/include/uapi/linux/prctl.h
> >> @@ -284,4 +284,7 @@ struct prctl_mm_map {
> >>  #define PR_SET_VMA		0x53564d41
> >>  # define PR_SET_VMA_ANON_NAME		0
> >>  
> >> +#define PR_SET_HIDE_SELF_EXE		65
> >> +#define PR_GET_HIDE_SELF_EXE		66
> >> +
> >>  #endif /* _LINUX_PRCTL_H */
> >> diff --git a/kernel/sys.c b/kernel/sys.c
> >> index 5fd54bf0e886..e992f1b72973 100644
> >> --- a/kernel/sys.c
> >> +++ b/kernel/sys.c
> >> @@ -2626,6 +2626,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> >>  	case PR_SET_VMA:
> >>  		error = prctl_set_vma(arg2, arg3, arg4, arg5);
> >>  		break;
> >> +	case PR_SET_HIDE_SELF_EXE:
> >> +		if (arg2 != 1 || arg3 || arg4 || arg5)
> >> +			return -EINVAL;
> >> +		task_set_hide_self_exe(current);
> >> +		break;
> >> +	case PR_GET_HIDE_SELF_EXE:
> >> +		if (arg2 || arg3 || arg4 || arg5)
> >> +			return -EINVAL;
> >> +		return task_hide_self_exe(current) ? 1 : 0;
> >>  	default:
> >>  		error = -EINVAL;
> >>  		break;
> >> diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
> >> index a5e06dcbba13..f12f3df12468 100644
> >> --- a/tools/include/uapi/linux/prctl.h
> >> +++ b/tools/include/uapi/linux/prctl.h
> >> @@ -284,4 +284,7 @@ struct prctl_mm_map {
> >>  #define PR_SET_VMA		0x53564d41
> >>  # define PR_SET_VMA_ANON_NAME		0
> >>  
> >> +#define PR_SET_HIDE_SELF_EXE		65
> >> +#define PR_GET_HIDE_SELF_EXE		66
> >> +
> >>  #endif /* _LINUX_PRCTL_H */
> >> -- 
> >> 2.38.1
> >> 
>
  
Giuseppe Scrivano Jan. 25, 2023, 4:30 p.m. UTC | #10
Aleksa Sarai <cyphar@cyphar.com> writes:

> On 2023-01-24, Giuseppe Scrivano <gscrivan@redhat.com> wrote:
>> Aleksa Sarai <cyphar@cyphar.com> writes:
>> 
>> > On 2023-01-20, Giuseppe Scrivano <gscrivan@redhat.com> wrote:
>> >> This patch adds a new prctl called PR_HIDE_SELF_EXE which allows
>> >> processes to hide their own /proc/*/exe file. When this prctl is
>> >> used, every access to /proc/*/exe for the calling process will
>> >> fail with ENOENT.
>> >> 
>> >> This is useful for preventing issues like CVE-2019-5736, where an
>> >> attacker can gain host root access by overwriting the binary
>> >> in OCI runtimes through file-descriptor mishandling in containers.
>> >> 
>> >> The current fix for CVE-2019-5736 is to create a read-only copy or
>> >> a bind-mount of the current executable, and then re-exec the current
>> >> process.  With the new prctl, the read-only copy or bind-mount copy is
>> >> not needed anymore.
>> >> 
>> >> While map_files/ also might contain symlinks to files in host,
>> >> proc_map_files_get_link() permissions checks are already sufficient.
>> >
>> > I suspect this doesn't protect against the execve("/proc/self/exe")
>> > tactic (because it clears the bit on execve), so I'm not sure this is
>> > much safer than PR_SET_DUMPABLE (yeah, it stops root in the source
>> > userns from accessing /proc/$pid/exe but the above attack makes that no
>> > longer that important).
>> 
>> it protects against that attack too.  It clears the bit _after_ the
>> execve() syscall is done.
>> 
>> If you attempt execve("/proc/self/exe") you still get ENOENT:
>> 
>> ```
>> #include <stdlib.h>
>> #include <stdio.h>
>> #include <sys/prctl.h>
>> #include <unistd.h>
>> 
>> int main(void)
>> {
>>         int ret;
>> 
>>         ret = prctl(65, 1, 0, 0, 0);
>>         if (ret != 0)
>>                 exit(1);
>> 
>>         execl("/proc/self/exe", "foo", NULL);
>>         exit(2);
>> }
>> ```
>> 
>> # strace -e prctl,execve ./hide-self-exe
>> execve("./hide-self-exe", ["./hide-self-exe"], 0x7fff975a3690 /* 39 vars */) = 0
>> prctl(0x41 /* PR_??? */, 0x1, 0, 0, 0)  = 0
>> execve("/proc/self/exe", ["foo"], 0x7ffcf51868b8 /* 39 vars */) = -1 ENOENT (No such file or directory)
>> +++ exited with 2 +++
>> 
>> I've also tried execv'ing with a script that uses "#!/proc/self/exe" and
>> I get the same ENOENT.
>
> Ah, you're right. As you mentioned, you could still do the attack
> through /proc/self/map_files but that would require you to know where
> the binary will be located (and being non-dumpable blocks container
> processes from doing tricks to get the right path).
>
> I wonder if we should somehow require (or auto-apply) SUID_DUMP_NONE
> when setting this prctl, since it does currently depend on it to be
> properly secure...

from what I can see, access to /proc/*/map_files is already protected
by proc_map_files_get_link() that requires either CAP_SYS_ADMIN in the
initial user namespace or CAP_CHECKPOINT_RESTORE in the user namespace.

Setting SUID_DUMP_NONE wouldn't hurt though :-)

After reading some comments on the LWN.net article, I wonder if
PR_HIDE_SELF_EXE should apply to CAP_SYS_ADMIN in the initial user
namespace or if in this case root should keep the privilege to inspect
the binary of a process.  If a container runs with that many privileges
then it has already other ways to damage the host anyway.

>> > I think the only way to fix this properly is by blocking re-opens of
>> > magic links that have more permissions than they originally did. I just
>> > got back from vacation, but I'm working on fixing up [1] so it's ready
>> > to be an RFC so we can close this hole once and for all.
>> 
>> so that relies on the fact opening /proc/self/exe with O_WRONLY fails
>> with ETXTBSY?
>
> Not quite, it relies on the fact that /proc/self/exe (and any other
> magiclink to /proc/self/exe) does not have a write mode (semantically,
> because of -ETXTBSY) and thus blocks any attempt to open it (or re-open
> it) with a write mode. It also fixes some other possible issues and lets
> you have upgrade masks (a-la capabilities) to file descriptors.
>
> Ultimately I think having a complete "no really, nobody can touch this"
> knob is also a good idea, and as this is is much simpler we can it in
> much quicker than the magiclink stuff (which I still think is necessary
> in general).
>
>> > [1]: https://github.com/cyphar/linux/tree/magiclink/open_how-reopen
>> >
>> >> 
>> >> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
>> >> ---
>> >> v2: https://lkml.org/lkml/2023/1/19/849
>> >> 
>> >> Differences from v2:
>> >> 
>> >> - fixed the test to check PR_SET_HIDE_SELF_EXE after fork
>> >> 
>> >> v1: https://lkml.org/lkml/2023/1/4/334
>> >> 
>> >> Differences from v1:
>> >> 
>> >> - amended more information in the commit message wrt map_files not
>> >>   requiring the same protection.
>> >> - changed the test to verify PR_HIDE_SELF_EXE cannot be unset after
>> >>   a fork.
>> >> 
>> >> fs/exec.c                        | 1 +
>> >>  fs/proc/base.c                   | 8 +++++---
>> >>  include/linux/sched.h            | 5 +++++
>> >>  include/uapi/linux/prctl.h       | 3 +++
>> >>  kernel/sys.c                     | 9 +++++++++
>> >>  tools/include/uapi/linux/prctl.h | 3 +++
>> >>  6 files changed, 26 insertions(+), 3 deletions(-)
>> >> 
>> >> diff --git a/fs/exec.c b/fs/exec.c
>> >> index ab913243a367..5a5dd964c3a3 100644
>> >> --- a/fs/exec.c
>> >> +++ b/fs/exec.c
>> >> @@ -1855,6 +1855,7 @@ static int bprm_execve(struct linux_binprm *bprm,
>> >>  	/* execve succeeded */
>> >>  	current->fs->in_exec = 0;
>> >>  	current->in_execve = 0;
>> >> +	task_clear_hide_self_exe(current);
>> >>  	rseq_execve(current);
>> >>  	acct_update_integrals(current);
>> >>  	task_numa_free(current, false);
>> >> diff --git a/fs/proc/base.c b/fs/proc/base.c
>> >> index 9e479d7d202b..959968e2da0d 100644
>> >> --- a/fs/proc/base.c
>> >> +++ b/fs/proc/base.c
>> >> @@ -1723,19 +1723,21 @@ static int proc_exe_link(struct dentry *dentry, struct path *exe_path)
>> >>  {
>> >>  	struct task_struct *task;
>> >>  	struct file *exe_file;
>> >> +	long hide_self_exe;
>> >>  
>> >>  	task = get_proc_task(d_inode(dentry));
>> >>  	if (!task)
>> >>  		return -ENOENT;
>> >>  	exe_file = get_task_exe_file(task);
>> >> +	hide_self_exe = task_hide_self_exe(task);
>> >>  	put_task_struct(task);
>> >> -	if (exe_file) {
>> >> +	if (exe_file && !hide_self_exe) {
>> >>  		*exe_path = exe_file->f_path;
>> >>  		path_get(&exe_file->f_path);
>> >>  		fput(exe_file);
>> >>  		return 0;
>> >> -	} else
>> >> -		return -ENOENT;
>> >> +	}
>> >> +	return -ENOENT;
>> >>  }
>> >>  
>> >>  static const char *proc_pid_get_link(struct dentry *dentry,
>> >> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> >> index 853d08f7562b..8db32d5fc285 100644
>> >> --- a/include/linux/sched.h
>> >> +++ b/include/linux/sched.h
>> >> @@ -1790,6 +1790,7 @@ static __always_inline bool is_percpu_thread(void)
>> >>  #define PFA_SPEC_IB_DISABLE		5	/* Indirect branch speculation restricted */
>> >>  #define PFA_SPEC_IB_FORCE_DISABLE	6	/* Indirect branch speculation permanently restricted */
>> >>  #define PFA_SPEC_SSB_NOEXEC		7	/* Speculative Store Bypass clear on execve() */
>> >> +#define PFA_HIDE_SELF_EXE		8	/* Hide /proc/self/exe for the process */
>> >>  
>> >>  #define TASK_PFA_TEST(name, func)					\
>> >>  	static inline bool task_##func(struct task_struct *p)		\
>> >> @@ -1832,6 +1833,10 @@ TASK_PFA_CLEAR(SPEC_IB_DISABLE, spec_ib_disable)
>> >>  TASK_PFA_TEST(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable)
>> >>  TASK_PFA_SET(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable)
>> >>  
>> >> +TASK_PFA_TEST(HIDE_SELF_EXE, hide_self_exe)
>> >> +TASK_PFA_SET(HIDE_SELF_EXE, hide_self_exe)
>> >> +TASK_PFA_CLEAR(HIDE_SELF_EXE, hide_self_exe)
>> >> +
>> >>  static inline void
>> >>  current_restore_flags(unsigned long orig_flags, unsigned long flags)
>> >>  {
>> >> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
>> >> index a5e06dcbba13..f12f3df12468 100644
>> >> --- a/include/uapi/linux/prctl.h
>> >> +++ b/include/uapi/linux/prctl.h
>> >> @@ -284,4 +284,7 @@ struct prctl_mm_map {
>> >>  #define PR_SET_VMA		0x53564d41
>> >>  # define PR_SET_VMA_ANON_NAME		0
>> >>  
>> >> +#define PR_SET_HIDE_SELF_EXE		65
>> >> +#define PR_GET_HIDE_SELF_EXE		66
>> >> +
>> >>  #endif /* _LINUX_PRCTL_H */
>> >> diff --git a/kernel/sys.c b/kernel/sys.c
>> >> index 5fd54bf0e886..e992f1b72973 100644
>> >> --- a/kernel/sys.c
>> >> +++ b/kernel/sys.c
>> >> @@ -2626,6 +2626,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>> >>  	case PR_SET_VMA:
>> >>  		error = prctl_set_vma(arg2, arg3, arg4, arg5);
>> >>  		break;
>> >> +	case PR_SET_HIDE_SELF_EXE:
>> >> +		if (arg2 != 1 || arg3 || arg4 || arg5)
>> >> +			return -EINVAL;
>> >> +		task_set_hide_self_exe(current);
>> >> +		break;
>> >> +	case PR_GET_HIDE_SELF_EXE:
>> >> +		if (arg2 || arg3 || arg4 || arg5)
>> >> +			return -EINVAL;
>> >> +		return task_hide_self_exe(current) ? 1 : 0;
>> >>  	default:
>> >>  		error = -EINVAL;
>> >>  		break;
>> >> diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
>> >> index a5e06dcbba13..f12f3df12468 100644
>> >> --- a/tools/include/uapi/linux/prctl.h
>> >> +++ b/tools/include/uapi/linux/prctl.h
>> >> @@ -284,4 +284,7 @@ struct prctl_mm_map {
>> >>  #define PR_SET_VMA		0x53564d41
>> >>  # define PR_SET_VMA_ANON_NAME		0
>> >>  
>> >> +#define PR_SET_HIDE_SELF_EXE		65
>> >> +#define PR_GET_HIDE_SELF_EXE		66
>> >> +
>> >>  #endif /* _LINUX_PRCTL_H */
>> >> -- 
>> >> 2.38.1
>> >> 
>>
  
Christian Brauner Jan. 26, 2023, 8:25 a.m. UTC | #11
On Thu, Jan 26, 2023 at 02:28:47AM +1100, Aleksa Sarai wrote:
> On 2023-01-24, Giuseppe Scrivano <gscrivan@redhat.com> wrote:
> > Aleksa Sarai <cyphar@cyphar.com> writes:
> > 
> > > On 2023-01-20, Giuseppe Scrivano <gscrivan@redhat.com> wrote:
> > >> This patch adds a new prctl called PR_HIDE_SELF_EXE which allows
> > >> processes to hide their own /proc/*/exe file. When this prctl is
> > >> used, every access to /proc/*/exe for the calling process will
> > >> fail with ENOENT.
> > >> 
> > >> This is useful for preventing issues like CVE-2019-5736, where an
> > >> attacker can gain host root access by overwriting the binary
> > >> in OCI runtimes through file-descriptor mishandling in containers.
> > >> 
> > >> The current fix for CVE-2019-5736 is to create a read-only copy or
> > >> a bind-mount of the current executable, and then re-exec the current
> > >> process.  With the new prctl, the read-only copy or bind-mount copy is
> > >> not needed anymore.
> > >> 
> > >> While map_files/ also might contain symlinks to files in host,
> > >> proc_map_files_get_link() permissions checks are already sufficient.
> > >
> > > I suspect this doesn't protect against the execve("/proc/self/exe")
> > > tactic (because it clears the bit on execve), so I'm not sure this is
> > > much safer than PR_SET_DUMPABLE (yeah, it stops root in the source
> > > userns from accessing /proc/$pid/exe but the above attack makes that no
> > > longer that important).
> > 
> > it protects against that attack too.  It clears the bit _after_ the
> > execve() syscall is done.
> > 
> > If you attempt execve("/proc/self/exe") you still get ENOENT:
> > 
> > ```
> > #include <stdlib.h>
> > #include <stdio.h>
> > #include <sys/prctl.h>
> > #include <unistd.h>
> > 
> > int main(void)
> > {
> >         int ret;
> > 
> >         ret = prctl(65, 1, 0, 0, 0);
> >         if (ret != 0)
> >                 exit(1);
> > 
> >         execl("/proc/self/exe", "foo", NULL);
> >         exit(2);
> > }
> > ```
> > 
> > # strace -e prctl,execve ./hide-self-exe
> > execve("./hide-self-exe", ["./hide-self-exe"], 0x7fff975a3690 /* 39 vars */) = 0
> > prctl(0x41 /* PR_??? */, 0x1, 0, 0, 0)  = 0
> > execve("/proc/self/exe", ["foo"], 0x7ffcf51868b8 /* 39 vars */) = -1 ENOENT (No such file or directory)
> > +++ exited with 2 +++
> > 
> > I've also tried execv'ing with a script that uses "#!/proc/self/exe" and
> > I get the same ENOENT.
> 
> Ah, you're right. As you mentioned, you could still do the attack
> through /proc/self/map_files but that would require you to know where
> the binary will be located (and being non-dumpable blocks container
> processes from doing tricks to get the right path).
> 
> I wonder if we should somehow require (or auto-apply) SUID_DUMP_NONE
> when setting this prctl, since it does currently depend on it to be
> properly secure...
> 
> > > I think the only way to fix this properly is by blocking re-opens of
> > > magic links that have more permissions than they originally did. I just
> > > got back from vacation, but I'm working on fixing up [1] so it's ready
> > > to be an RFC so we can close this hole once and for all.
> > 
> > so that relies on the fact opening /proc/self/exe with O_WRONLY fails
> > with ETXTBSY?
> 
> Not quite, it relies on the fact that /proc/self/exe (and any other
> magiclink to /proc/self/exe) does not have a write mode (semantically,
> because of -ETXTBSY) and thus blocks any attempt to open it (or re-open
> it) with a write mode. It also fixes some other possible issues and lets
> you have upgrade masks (a-la capabilities) to file descriptors.
> 
> Ultimately I think having a complete "no really, nobody can touch this"
> knob is also a good idea, and as this is is much simpler we can it in
> much quicker than the magiclink stuff (which I still think is necessary
> in general).

It definitely but let's not tie our generic vfs apis to this problem.
  
Christian Brauner Jan. 27, 2023, 12:31 p.m. UTC | #12
From: Christian Brauner (Microsoft) <brauner@kernel.org>


On Fri, 20 Jan 2023 11:25:11 +0100, Giuseppe Scrivano wrote:
> This patch adds a new prctl called PR_HIDE_SELF_EXE which allows
> processes to hide their own /proc/*/exe file. When this prctl is
> used, every access to /proc/*/exe for the calling process will
> fail with ENOENT.
> 
> This is useful for preventing issues like CVE-2019-5736, where an
> attacker can gain host root access by overwriting the binary
> in OCI runtimes through file-descriptor mishandling in containers.
> 
> [...]

Only needed for privileged sandboxes. The userspace mitigations Aleksa
and I did for the CVE in all affected runtimes back then are nifty but
complicated. The patch is a decent compromise.
Picking up this prctl() for now,

[1/2] exec: add PR_HIDE_SELF_EXE prctl
      commit: 673301182d473ef61a98c292cf64650c73117172
[2/2] selftests: add tests for prctl(SET_HIDE_SELF_EXE)
      commit: bafa339eda3f79d567386e1fae59bb0537156c96
  
Kees Cook Jan. 27, 2023, 8:34 p.m. UTC | #13
On Fri, Jan 27, 2023 at 01:31:13PM +0100, Christian Brauner wrote:
> From: Christian Brauner (Microsoft) <brauner@kernel.org>
> 
> 
> On Fri, 20 Jan 2023 11:25:11 +0100, Giuseppe Scrivano wrote:
> > This patch adds a new prctl called PR_HIDE_SELF_EXE which allows
> > processes to hide their own /proc/*/exe file. When this prctl is
> > used, every access to /proc/*/exe for the calling process will
> > fail with ENOENT.
> > 
> > This is useful for preventing issues like CVE-2019-5736, where an
> > attacker can gain host root access by overwriting the binary
> > in OCI runtimes through file-descriptor mishandling in containers.
> > 
> > [...]
> 
> Only needed for privileged sandboxes. The userspace mitigations Aleksa
> and I did for the CVE in all affected runtimes back then are nifty but
> complicated. The patch is a decent compromise.
> Picking up this prctl() for now,
> 
> [1/2] exec: add PR_HIDE_SELF_EXE prctl
>       commit: 673301182d473ef61a98c292cf64650c73117172
> [2/2] selftests: add tests for prctl(SET_HIDE_SELF_EXE)
>       commit: bafa339eda3f79d567386e1fae59bb0537156c96

Thanks! I'm late to the party, but I came to the same conclusion as you
did. :)

Reviewed-by: Kees Cook <keescook@chromium.org>
  
Colin Walters Jan. 29, 2023, 1:59 p.m. UTC | #14
On Wed, Jan 25, 2023, at 11:30 AM, Giuseppe Scrivano wrote:
> 
> After reading some comments on the LWN.net article, I wonder if
> PR_HIDE_SELF_EXE should apply to CAP_SYS_ADMIN in the initial user
> namespace or if in this case root should keep the privilege to inspect
> the binary of a process.  If a container runs with that many privileges
> then it has already other ways to damage the host anyway.

Right, that's what I was trying to express with the "make it work the same as map_files".  Hiding the entry entirely even for initial-namespace-root (real root) seems like it's going to potentially confuse profiling/tracing/debugging tools for no good reason.
  
Christian Brauner Jan. 29, 2023, 4:58 p.m. UTC | #15
On Sun, Jan 29, 2023 at 08:59:32AM -0500, Colin Walters wrote:
> 
> 
> On Wed, Jan 25, 2023, at 11:30 AM, Giuseppe Scrivano wrote:
> > 
> > After reading some comments on the LWN.net article, I wonder if
> > PR_HIDE_SELF_EXE should apply to CAP_SYS_ADMIN in the initial user
> > namespace or if in this case root should keep the privilege to inspect
> > the binary of a process.  If a container runs with that many privileges
> > then it has already other ways to damage the host anyway.
> 
> Right, that's what I was trying to express with the "make it work the same as map_files".  Hiding the entry entirely even for initial-namespace-root (real root) seems like it's going to potentially confuse profiling/tracing/debugging tools for no good reason.

If this can be circumvented via CAP_SYS_ADMIN then this mitigation
becomes immediately way less interesting because the userspace
mitigation we came up with protects against CAP_SYS_ADMIN as well
without any regression risk. At which point this is only useful for some
privileged sandboxes at what point this isn't worth it.

I'm still looking at userspace codebases to ensure that this is a change
we can risk in general as this has the potential to prevent criu from
dumping such processes. I'll talk to them tomorrow anyway.
  
Colin Walters Jan. 29, 2023, 6:12 p.m. UTC | #16
On Sun, Jan 29, 2023, at 11:58 AM, Christian Brauner wrote:
> On Sun, Jan 29, 2023 at 08:59:32AM -0500, Colin Walters wrote:
>> 
>> 
>> On Wed, Jan 25, 2023, at 11:30 AM, Giuseppe Scrivano wrote:
>> > 
>> > After reading some comments on the LWN.net article, I wonder if
>> > PR_HIDE_SELF_EXE should apply to CAP_SYS_ADMIN in the initial user
>> > namespace or if in this case root should keep the privilege to inspect
>> > the binary of a process.  If a container runs with that many privileges
>> > then it has already other ways to damage the host anyway.
>> 
>> Right, that's what I was trying to express with the "make it work the same as map_files".  Hiding the entry entirely even for initial-namespace-root (real root) seems like it's going to potentially confuse profiling/tracing/debugging tools for no good reason.
>
> If this can be circumvented via CAP_SYS_ADMIN 

To be clear, I'm proposing CAP_SYS_ADMIN in the current user namespace at the time of the prctl().  (Or if keeping around a reference just for this is too problematic, perhaps hardcoding to the init ns)

A process with CAP_SYS_ADMIN in a child namespace would still not be able to read the binary.

> then this mitigation
> becomes immediately way less interesting because the userspace
> mitigation we came up with protects against CAP_SYS_ADMIN as well
> without any regression risk. 

The userspace mitigation here being "clone self to memfd"?  But that's a sufficiently ugly workaround that it's created new problems; see https://lwn.net/Articles/918106/
  
Christian Brauner Jan. 30, 2023, 9:53 a.m. UTC | #17
On Sun, Jan 29, 2023 at 01:12:45PM -0500, Colin Walters wrote:
> 
> 
> On Sun, Jan 29, 2023, at 11:58 AM, Christian Brauner wrote:
> > On Sun, Jan 29, 2023 at 08:59:32AM -0500, Colin Walters wrote:
> >> 
> >> 
> >> On Wed, Jan 25, 2023, at 11:30 AM, Giuseppe Scrivano wrote:
> >> > 
> >> > After reading some comments on the LWN.net article, I wonder if
> >> > PR_HIDE_SELF_EXE should apply to CAP_SYS_ADMIN in the initial user
> >> > namespace or if in this case root should keep the privilege to inspect
> >> > the binary of a process.  If a container runs with that many privileges
> >> > then it has already other ways to damage the host anyway.
> >> 
> >> Right, that's what I was trying to express with the "make it work the same as map_files".  Hiding the entry entirely even for initial-namespace-root (real root) seems like it's going to potentially confuse profiling/tracing/debugging tools for no good reason.
> >
> > If this can be circumvented via CAP_SYS_ADMIN 
> 
> To be clear, I'm proposing CAP_SYS_ADMIN in the current user namespace at the time of the prctl().  (Or if keeping around a reference just for this is too problematic, perhaps hardcoding to the init ns)

Oh no, I fully understand. The point was that the userspace fix protects
even against attackers with CAP_SYS_ADMIN in init_user_ns. And that was
important back then and is still relevant today for some workloads.

For unprivileged containers where host and container are separate by a
meaningful user namespace boundary this whole mitigation is irrelevant
as the binary can't be overwritten.

> 
> A process with CAP_SYS_ADMIN in a child namespace would still not be able to read the binary.
> 
> > then this mitigation
> > becomes immediately way less interesting because the userspace
> > mitigation we came up with protects against CAP_SYS_ADMIN as well
> > without any regression risk. 
> 
> The userspace mitigation here being "clone self to memfd"?  But that's a sufficiently ugly workaround that it's created new problems; see https://lwn.net/Articles/918106/

But this is a problem with the memfd api not with the fix. Following the
thread the ability to create executable memfds will stay around. As it
should be given how long this has been supported. And they have backward
compatibility in mind which is great.
  
Christian Brauner Jan. 30, 2023, 10:06 a.m. UTC | #18
On Mon, Jan 30, 2023 at 10:53:31AM +0100, Christian Brauner wrote:
> On Sun, Jan 29, 2023 at 01:12:45PM -0500, Colin Walters wrote:
> > 
> > 
> > On Sun, Jan 29, 2023, at 11:58 AM, Christian Brauner wrote:
> > > On Sun, Jan 29, 2023 at 08:59:32AM -0500, Colin Walters wrote:
> > >> 
> > >> 
> > >> On Wed, Jan 25, 2023, at 11:30 AM, Giuseppe Scrivano wrote:
> > >> > 
> > >> > After reading some comments on the LWN.net article, I wonder if
> > >> > PR_HIDE_SELF_EXE should apply to CAP_SYS_ADMIN in the initial user
> > >> > namespace or if in this case root should keep the privilege to inspect
> > >> > the binary of a process.  If a container runs with that many privileges
> > >> > then it has already other ways to damage the host anyway.
> > >> 
> > >> Right, that's what I was trying to express with the "make it work the same as map_files".  Hiding the entry entirely even for initial-namespace-root (real root) seems like it's going to potentially confuse profiling/tracing/debugging tools for no good reason.
> > >
> > > If this can be circumvented via CAP_SYS_ADMIN 
> > 
> > To be clear, I'm proposing CAP_SYS_ADMIN in the current user namespace at the time of the prctl().  (Or if keeping around a reference just for this is too problematic, perhaps hardcoding to the init ns)
> 
> Oh no, I fully understand. The point was that the userspace fix protects
> even against attackers with CAP_SYS_ADMIN in init_user_ns. And that was
> important back then and is still relevant today for some workloads.
> 
> For unprivileged containers where host and container are separate by a
> meaningful user namespace boundary this whole mitigation is irrelevant
> as the binary can't be overwritten.
> 
> > 
> > A process with CAP_SYS_ADMIN in a child namespace would still not be able to read the binary.
> > 
> > > then this mitigation
> > > becomes immediately way less interesting because the userspace
> > > mitigation we came up with protects against CAP_SYS_ADMIN as well
> > > without any regression risk. 
> > 
> > The userspace mitigation here being "clone self to memfd"?  But that's a sufficiently ugly workaround that it's created new problems; see https://lwn.net/Articles/918106/
> 
> But this is a problem with the memfd api not with the fix. Following the
> thread the ability to create executable memfds will stay around. As it
> should be given how long this has been supported. And they have backward
> compatibility in mind which is great.

Following up from yesterday's promise to check with the criu org I'm
part of: this is going to break criu unforunately as it dumps (and
restores) /proc/self/exe. Even with an escape hatch we'd still risk
breaking it. Whereas again, the memfd solution doesn't cause those
issues.

Don't get me wrong it's pretty obvious that I was pretty supportive of
this fix especially because it looked rather simple but this is turning
out to be less simple than we tought. I don't think that this is worth
it given the functioning fixes we already have.

The good thing is that - even if it will take a longer - that Aleksa's
patchset will provide a more general solution by making it possible for
runc/crun/lxc to open the target binary with a restricted upgrade mask
making it impossible to open the binary read-write again. This won't
break criu and will fix this issue and is generally useful.
  
Colin Walters Jan. 30, 2023, 9:52 p.m. UTC | #19
On Mon, Jan 30, 2023, at 5:06 AM, Christian Brauner wrote:

> The good thing is that - even if it will take a longer - that Aleksa's
> patchset will provide a more general solution by making it possible for
> runc/crun/lxc to open the target binary with a restricted upgrade mask
> making it impossible to open the binary read-write again. This won't
> break criu and will fix this issue and is generally useful.

Had to go back up thread more carefully; looking at the referenced commits now in
https://github.com/cyphar/linux/commits/magiclink/open_how-reopen
I do agree that that direction is the most elegant.  The main downside I can think of is the potential blast radius is larger, and more nontrivial code.  

But...in practice I guess today for the runc/crun type attacks today there are commonly multiple mitigations (e.g. read-only rootfs, multiple LSMs, and finally the memfd copy fallback) so we can probably wait for that patchset to land.

In short: agreed!
  
Giuseppe Scrivano Jan. 31, 2023, 2:17 p.m. UTC | #20
Christian Brauner <brauner@kernel.org> writes:

> On Mon, Jan 30, 2023 at 10:53:31AM +0100, Christian Brauner wrote:
>> On Sun, Jan 29, 2023 at 01:12:45PM -0500, Colin Walters wrote:
>> > 
>> > 
>> > On Sun, Jan 29, 2023, at 11:58 AM, Christian Brauner wrote:
>> > > On Sun, Jan 29, 2023 at 08:59:32AM -0500, Colin Walters wrote:
>> > >> 
>> > >> 
>> > >> On Wed, Jan 25, 2023, at 11:30 AM, Giuseppe Scrivano wrote:
>> > >> > 
>> > >> > After reading some comments on the LWN.net article, I wonder if
>> > >> > PR_HIDE_SELF_EXE should apply to CAP_SYS_ADMIN in the initial user
>> > >> > namespace or if in this case root should keep the privilege to inspect
>> > >> > the binary of a process.  If a container runs with that many privileges
>> > >> > then it has already other ways to damage the host anyway.
>> > >> 
>> > >> Right, that's what I was trying to express with the "make it
>> > >> work the same as map_files".  Hiding the entry entirely even
>> > >> for initial-namespace-root (real root) seems like it's going to
>> > >> potentially confuse profiling/tracing/debugging tools for no
>> > >> good reason.
>> > >
>> > > If this can be circumvented via CAP_SYS_ADMIN 
>> > 
>> > To be clear, I'm proposing CAP_SYS_ADMIN in the current user
>> > namespace at the time of the prctl().  (Or if keeping around a
>> > reference just for this is too problematic, perhaps hardcoding to
>> > the init ns)
>> 
>> Oh no, I fully understand. The point was that the userspace fix protects
>> even against attackers with CAP_SYS_ADMIN in init_user_ns. And that was
>> important back then and is still relevant today for some workloads.
>> 
>> For unprivileged containers where host and container are separate by a
>> meaningful user namespace boundary this whole mitigation is irrelevant
>> as the binary can't be overwritten.
>> 
>> > 
>> > A process with CAP_SYS_ADMIN in a child namespace would still not be able to read the binary.
>> > 
>> > > then this mitigation
>> > > becomes immediately way less interesting because the userspace
>> > > mitigation we came up with protects against CAP_SYS_ADMIN as well
>> > > without any regression risk. 
>> > 
>> > The userspace mitigation here being "clone self to memfd"?  But that's a sufficiently ugly workaround that it's created new problems; see https://lwn.net/Articles/918106/
>> 
>> But this is a problem with the memfd api not with the fix. Following the
>> thread the ability to create executable memfds will stay around. As it
>> should be given how long this has been supported. And they have backward
>> compatibility in mind which is great.
>
> Following up from yesterday's promise to check with the criu org I'm
> part of: this is going to break criu unforunately as it dumps (and
> restores) /proc/self/exe. Even with an escape hatch we'd still risk
> breaking it. Whereas again, the memfd solution doesn't cause those
> issues.
>
> Don't get me wrong it's pretty obvious that I was pretty supportive of
> this fix especially because it looked rather simple but this is turning
> out to be less simple than we tought. I don't think that this is worth
> it given the functioning fixes we already have.
>
> The good thing is that - even if it will take a longer - that Aleksa's
> patchset will provide a more general solution by making it possible for
> runc/crun/lxc to open the target binary with a restricted upgrade mask
> making it impossible to open the binary read-write again. This won't
> break criu and will fix this issue and is generally useful.

I was not aware that running with CAP_SYS_ADMIN in the initial userns
was considered as a use case, but in this case don't we need to protect
/proc/$PID/map_files as well or do we rely only on randomize_va_space? 
It is a more difficult to guess the name but we can still exec these
files and grab a reference to them.

The current patch I've proposed is probably a too big hammer for the
small issue we really have:

other processes from the container are already blocked by PR_SET_DUMPABLE unless
CAP_SYS_PTRACE is granted; but if CAP_SYS_PTRACE is granted then it seems
already vulnerable today since processes from the container can just
read the /proc/PID/map_files files without even requiring the exec trick.

So the only hole left, that I can see, is that the container runtime
can be tricked to exec /proc/self/exe (or /proc/self/map_files/*) and
from there open a reference to the binary.

Could we just restrict the usage to the current thread group?  That
won't affect in any way other processes.

The patch won't be too much more complicated, we just need to amend the
following fix:

diff --git a/fs/proc/base.c b/fs/proc/base.c
index e9127084b82a..2f5c5ed2dae8 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1723,6 +1723,7 @@ static int proc_exe_link(struct dentry *dentry, struct path *exe_path)
 {
        struct task_struct *task;
        struct file *exe_file;
+       bool is_same_tgroup;
        long hide_self_exe;
 
        task = get_proc_task(d_inode(dentry));
@@ -1730,8 +1731,9 @@ static int proc_exe_link(struct dentry *dentry, struct path *exe_path)
                return -ENOENT;
        exe_file = get_task_exe_file(task);
        hide_self_exe = task_hide_self_exe(task);
+       is_same_tgroup = same_thread_group(current, task);
        put_task_struct(task);
-       if (hide_self_exe)
+       if (hide_self_exe && is_same_tgroup)
                return -EPERM;
        if (exe_file) {
                *exe_path = exe_file->f_path;

Would that be sufficient or are there other ways to attack it?

Given the premise about CAP_SYS_ADMIN (and even more loosen
CAP_CHECKPOINT_RESTORE in the *current* user namespace), I think we
probably need a similar protection fo /proc/PID/map_files.

Regards,
Giuseppe
  
Andrei Vagin Feb. 25, 2023, 12:27 a.m. UTC | #21
On Mon, Jan 30, 2023 at 11:06:02AM +0100, Christian Brauner wrote:
> On Mon, Jan 30, 2023 at 10:53:31AM +0100, Christian Brauner wrote:
> > On Sun, Jan 29, 2023 at 01:12:45PM -0500, Colin Walters wrote:
> > > 
> > > 
> > > On Sun, Jan 29, 2023, at 11:58 AM, Christian Brauner wrote:
> > > > On Sun, Jan 29, 2023 at 08:59:32AM -0500, Colin Walters wrote:
> > > >> 
> > > >> 
> > > >> On Wed, Jan 25, 2023, at 11:30 AM, Giuseppe Scrivano wrote:
> > > >> > 
> > > >> > After reading some comments on the LWN.net article, I wonder if
> > > >> > PR_HIDE_SELF_EXE should apply to CAP_SYS_ADMIN in the initial user
> > > >> > namespace or if in this case root should keep the privilege to inspect
> > > >> > the binary of a process.  If a container runs with that many privileges
> > > >> > then it has already other ways to damage the host anyway.
> > > >> 
> > > >> Right, that's what I was trying to express with the "make it work the same as map_files".  Hiding the entry entirely even for initial-namespace-root (real root) seems like it's going to potentially confuse profiling/tracing/debugging tools for no good reason.
> > > >
> > > > If this can be circumvented via CAP_SYS_ADMIN 
> > > 
> > > To be clear, I'm proposing CAP_SYS_ADMIN in the current user namespace at the time of the prctl().  (Or if keeping around a reference just for this is too problematic, perhaps hardcoding to the init ns)
> > 
> > Oh no, I fully understand. The point was that the userspace fix protects
> > even against attackers with CAP_SYS_ADMIN in init_user_ns. And that was
> > important back then and is still relevant today for some workloads.
> > 
> > For unprivileged containers where host and container are separate by a
> > meaningful user namespace boundary this whole mitigation is irrelevant
> > as the binary can't be overwritten.
> > 
> > > 
> > > A process with CAP_SYS_ADMIN in a child namespace would still not be able to read the binary.
> > > 
> > > > then this mitigation
> > > > becomes immediately way less interesting because the userspace
> > > > mitigation we came up with protects against CAP_SYS_ADMIN as well
> > > > without any regression risk. 
> > > 
> > > The userspace mitigation here being "clone self to memfd"?  But that's a sufficiently ugly workaround that it's created new problems; see https://lwn.net/Articles/918106/
> > 
> > But this is a problem with the memfd api not with the fix. Following the
> > thread the ability to create executable memfds will stay around. As it
> > should be given how long this has been supported. And they have backward
> > compatibility in mind which is great.
> 
> Following up from yesterday's promise to check with the criu org I'm
> part of: this is going to break criu unforunately as it dumps (and
> restores) /proc/self/exe. Even with an escape hatch we'd still risk
> breaking it. Whereas again, the memfd solution doesn't cause those
> issues.
> 
> Don't get me wrong it's pretty obvious that I was pretty supportive of
> this fix especially because it looked rather simple but this is turning
> out to be less simple than we tought. I don't think that this is worth
> it given the functioning fixes we already have.

btw: can we use PR_SET_MM_EXE_FILE or PR_SET_MM_MAP (prctl_map.exe_fd) to
set a dummy exe. Will it have the required effect?

> 
> The good thing is that - even if it will take a longer - that Aleksa's
> patchset will provide a more general solution by making it possible for
> runc/crun/lxc to open the target binary with a restricted upgrade mask
> making it impossible to open the binary read-write again. This won't
> break criu and will fix this issue and is generally useful.
  
Giuseppe Scrivano Feb. 28, 2023, 2:19 p.m. UTC | #22
Andrei Vagin <avagin@gmail.com> writes:

> On Mon, Jan 30, 2023 at 11:06:02AM +0100, Christian Brauner wrote:
>> On Mon, Jan 30, 2023 at 10:53:31AM +0100, Christian Brauner wrote:
>> > On Sun, Jan 29, 2023 at 01:12:45PM -0500, Colin Walters wrote:
>> > >
>> > >
>> > > On Sun, Jan 29, 2023, at 11:58 AM, Christian Brauner wrote:
>> > > > On Sun, Jan 29, 2023 at 08:59:32AM -0500, Colin Walters wrote:
>> > > >>
>> > > >>
>> > > >> On Wed, Jan 25, 2023, at 11:30 AM, Giuseppe Scrivano wrote:
>> > > >> >
>> > > >> > After reading some comments on the LWN.net article, I wonder if
>> > > >> > PR_HIDE_SELF_EXE should apply to CAP_SYS_ADMIN in the initial user
>> > > >> > namespace or if in this case root should keep the privilege to inspect
>> > > >> > the binary of a process.  If a container runs with that many privileges
>> > > >> > then it has already other ways to damage the host anyway.
>> > > >>
>> > > >> Right, that's what I was trying to express with the "make it
>> > > >> work the same as map_files".  Hiding the entry entirely even
>> > > >> for initial-namespace-root (real root) seems like it's going
>> > > >> to potentially confuse profiling/tracing/debugging tools for
>> > > >> no good reason.
>> > > >
>> > > > If this can be circumvented via CAP_SYS_ADMIN
>> > >
>> > > To be clear, I'm proposing CAP_SYS_ADMIN in the current user
>> > > namespace at the time of the prctl().  (Or if keeping around a
>> > > reference just for this is too problematic, perhaps hardcoding
>> > > to the init ns)
>> >
>> > Oh no, I fully understand. The point was that the userspace fix protects
>> > even against attackers with CAP_SYS_ADMIN in init_user_ns. And that was
>> > important back then and is still relevant today for some workloads.
>> >
>> > For unprivileged containers where host and container are separate by a
>> > meaningful user namespace boundary this whole mitigation is irrelevant
>> > as the binary can't be overwritten.
>> >
>> > >
>> > > A process with CAP_SYS_ADMIN in a child namespace would still not be able to read the binary.
>> > >
>> > > > then this mitigation
>> > > > becomes immediately way less interesting because the userspace
>> > > > mitigation we came up with protects against CAP_SYS_ADMIN as well
>> > > > without any regression risk.
>> > >
>> > > The userspace mitigation here being "clone self to memfd"?  But that's a sufficiently ugly workaround that it's created new problems; see https://lwn.net/Articles/918106/
>> >
>> > But this is a problem with the memfd api not with the fix. Following the
>> > thread the ability to create executable memfds will stay around. As it
>> > should be given how long this has been supported. And they have backward
>> > compatibility in mind which is great.
>>
>> Following up from yesterday's promise to check with the criu org I'm
>> part of: this is going to break criu unforunately as it dumps (and
>> restores) /proc/self/exe. Even with an escape hatch we'd still risk
>> breaking it. Whereas again, the memfd solution doesn't cause those
>> issues.
>>
>> Don't get me wrong it's pretty obvious that I was pretty supportive of
>> this fix especially because it looked rather simple but this is turning
>> out to be less simple than we tought. I don't think that this is worth
>> it given the functioning fixes we already have.
>
> btw: can we use PR_SET_MM_EXE_FILE or PR_SET_MM_MAP (prctl_map.exe_fd) to
> set a dummy exe. Will it have the required effect?

if I am understanding it correctly, that seems a bit more complicated,
we first need to unmap the current executable and then replace it with
its copy?

Creating the dummy exe could also be a problem since we need a new copy
each time we want to hide the executable.

Or are you suggesting using it differently?

Thanks,
Giuseppe

>> The good thing is that - even if it will take a longer - that Aleksa's
>> patchset will provide a more general solution by making it possible for
>> runc/crun/lxc to open the target binary with a restricted upgrade mask
>> making it impossible to open the binary read-write again. This won't
>> break criu and will fix this issue and is generally useful.
  

Patch

diff --git a/fs/exec.c b/fs/exec.c
index ab913243a367..5a5dd964c3a3 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1855,6 +1855,7 @@  static int bprm_execve(struct linux_binprm *bprm,
 	/* execve succeeded */
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
+	task_clear_hide_self_exe(current);
 	rseq_execve(current);
 	acct_update_integrals(current);
 	task_numa_free(current, false);
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9e479d7d202b..959968e2da0d 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1723,19 +1723,21 @@  static int proc_exe_link(struct dentry *dentry, struct path *exe_path)
 {
 	struct task_struct *task;
 	struct file *exe_file;
+	long hide_self_exe;
 
 	task = get_proc_task(d_inode(dentry));
 	if (!task)
 		return -ENOENT;
 	exe_file = get_task_exe_file(task);
+	hide_self_exe = task_hide_self_exe(task);
 	put_task_struct(task);
-	if (exe_file) {
+	if (exe_file && !hide_self_exe) {
 		*exe_path = exe_file->f_path;
 		path_get(&exe_file->f_path);
 		fput(exe_file);
 		return 0;
-	} else
-		return -ENOENT;
+	}
+	return -ENOENT;
 }
 
 static const char *proc_pid_get_link(struct dentry *dentry,
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 853d08f7562b..8db32d5fc285 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1790,6 +1790,7 @@  static __always_inline bool is_percpu_thread(void)
 #define PFA_SPEC_IB_DISABLE		5	/* Indirect branch speculation restricted */
 #define PFA_SPEC_IB_FORCE_DISABLE	6	/* Indirect branch speculation permanently restricted */
 #define PFA_SPEC_SSB_NOEXEC		7	/* Speculative Store Bypass clear on execve() */
+#define PFA_HIDE_SELF_EXE		8	/* Hide /proc/self/exe for the process */
 
 #define TASK_PFA_TEST(name, func)					\
 	static inline bool task_##func(struct task_struct *p)		\
@@ -1832,6 +1833,10 @@  TASK_PFA_CLEAR(SPEC_IB_DISABLE, spec_ib_disable)
 TASK_PFA_TEST(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable)
 TASK_PFA_SET(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable)
 
+TASK_PFA_TEST(HIDE_SELF_EXE, hide_self_exe)
+TASK_PFA_SET(HIDE_SELF_EXE, hide_self_exe)
+TASK_PFA_CLEAR(HIDE_SELF_EXE, hide_self_exe)
+
 static inline void
 current_restore_flags(unsigned long orig_flags, unsigned long flags)
 {
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index a5e06dcbba13..f12f3df12468 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -284,4 +284,7 @@  struct prctl_mm_map {
 #define PR_SET_VMA		0x53564d41
 # define PR_SET_VMA_ANON_NAME		0
 
+#define PR_SET_HIDE_SELF_EXE		65
+#define PR_GET_HIDE_SELF_EXE		66
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 5fd54bf0e886..e992f1b72973 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2626,6 +2626,15 @@  SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_SET_VMA:
 		error = prctl_set_vma(arg2, arg3, arg4, arg5);
 		break;
+	case PR_SET_HIDE_SELF_EXE:
+		if (arg2 != 1 || arg3 || arg4 || arg5)
+			return -EINVAL;
+		task_set_hide_self_exe(current);
+		break;
+	case PR_GET_HIDE_SELF_EXE:
+		if (arg2 || arg3 || arg4 || arg5)
+			return -EINVAL;
+		return task_hide_self_exe(current) ? 1 : 0;
 	default:
 		error = -EINVAL;
 		break;
diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
index a5e06dcbba13..f12f3df12468 100644
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -284,4 +284,7 @@  struct prctl_mm_map {
 #define PR_SET_VMA		0x53564d41
 # define PR_SET_VMA_ANON_NAME		0
 
+#define PR_SET_HIDE_SELF_EXE		65
+#define PR_GET_HIDE_SELF_EXE		66
+
 #endif /* _LINUX_PRCTL_H */