[v3] mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC

Message ID 20221202013404.163143-3-jeffxu@google.com
State New
Headers
Series [v3] mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC |

Commit Message

Jeff Xu Dec. 2, 2022, 1:34 a.m. UTC
  From: Jeff Xu <jeffxu@chromium.org>

The new MFD_NOEXEC_SEAL and MFD_EXEC flags allows application to
set executable bit at creation time (memfd_create).

When MFD_NOEXEC_SEAL is set, memfd is created without executable bit
(mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to
be executable (mode: 0777) after creation.

when MFD_EXEC flag is set, memfd is created with executable bit
(mode:0777), this is the same as the old behavior of memfd_create.

The new pid namespaced sysctl vm.memfd_noexec has 3 values:
0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
	MFD_EXEC was set.
1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
	MFD_NOEXEC_SEAL was set.
2:memfd_create() without MFD_NOEXEC_SEAL will be rejected.

The sysctl allows finer control of memfd_create for old-software
that doesn't set the executable bit, for example, a container with
vm.memfd_noexec=1 means the old-software will create non-executable
memfd by default.

Co-developed-by: Daniel Verkamp <dverkamp@chromium.org>
Signed-off-by: Daniel Verkamp <dverkamp@chromium.org>
Signed-off-by: Jeff Xu <jeffxu@chromium.org>
---
 include/linux/pid_namespace.h | 19 ++++++++++++++
 include/uapi/linux/memfd.h    |  4 +++
 kernel/pid_namespace.c        | 47 +++++++++++++++++++++++++++++++++++
 mm/memfd.c                    | 44 ++++++++++++++++++++++++++++++--
 4 files changed, 112 insertions(+), 2 deletions(-)
  

Comments

kernel test robot Dec. 2, 2022, 11:32 a.m. UTC | #1
Hi,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/jeffxu-chromium-org/mm-memfd-add-MFD_NOEXEC_SEAL-and-MFD_EXEC/20221202-093847
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20221202013404.163143-3-jeffxu%40google.com
patch subject: [PATCH v3] mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC
config: i386-randconfig-a003
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/d6b8fec1440cb9206bef29eb27359a61e6c73f41
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review jeffxu-chromium-org/mm-memfd-add-MFD_NOEXEC_SEAL-and-MFD_EXEC/20221202-093847
        git checkout d6b8fec1440cb9206bef29eb27359a61e6c73f41
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> kernel/pid_namespace.c:263:5: warning: no previous prototype for 'pid_mfd_noexec_dointvec_minmax' [-Wmissing-prototypes]
     263 | int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table, int write,
         |     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


vim +/pid_mfd_noexec_dointvec_minmax +263 kernel/pid_namespace.c

   261	
   262	#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
 > 263	int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table, int write,
   264		void *buffer, size_t *lenp, loff_t *ppos)
   265	{
   266		struct pid_namespace *ns = task_active_pid_ns(current);
   267		struct ctl_table table_copy;
   268	
   269		if (write && !capable(CAP_SYS_ADMIN))
   270			return -EPERM;
   271	
   272		table_copy = *table;
   273		if (ns != &init_pid_ns)
   274			table_copy.data = &ns->memfd_noexec_scope;
   275	
   276		/*
   277		 * set minimum to current value, the effect is only bigger
   278		 * value is accepted.
   279		 */
   280		if (*(int *)table_copy.data > *(int *)table_copy.extra1)
   281			table_copy.extra1 = table_copy.data;
   282	
   283		return proc_dointvec_minmax(&table_copy, write, buffer, lenp, ppos);
   284	}
   285
  
kernel test robot Dec. 2, 2022, 1:33 p.m. UTC | #2
Hi,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/jeffxu-chromium-org/mm-memfd-add-MFD_NOEXEC_SEAL-and-MFD_EXEC/20221202-093847
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20221202013404.163143-3-jeffxu%40google.com
patch subject: [PATCH v3] mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC
config: i386-randconfig-a002
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/d6b8fec1440cb9206bef29eb27359a61e6c73f41
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review jeffxu-chromium-org/mm-memfd-add-MFD_NOEXEC_SEAL-and-MFD_EXEC/20221202-093847
        git checkout d6b8fec1440cb9206bef29eb27359a61e6c73f41
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> kernel/pid_namespace.c:263:5: warning: no previous prototype for function 'pid_mfd_noexec_dointvec_minmax' [-Wmissing-prototypes]
   int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table, int write,
       ^
   kernel/pid_namespace.c:263:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table, int write,
   ^
   static 
   1 warning generated.


vim +/pid_mfd_noexec_dointvec_minmax +263 kernel/pid_namespace.c

   261	
   262	#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
 > 263	int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table, int write,
   264		void *buffer, size_t *lenp, loff_t *ppos)
   265	{
   266		struct pid_namespace *ns = task_active_pid_ns(current);
   267		struct ctl_table table_copy;
   268	
   269		if (write && !capable(CAP_SYS_ADMIN))
   270			return -EPERM;
   271	
   272		table_copy = *table;
   273		if (ns != &init_pid_ns)
   274			table_copy.data = &ns->memfd_noexec_scope;
   275	
   276		/*
   277		 * set minimum to current value, the effect is only bigger
   278		 * value is accepted.
   279		 */
   280		if (*(int *)table_copy.data > *(int *)table_copy.extra1)
   281			table_copy.extra1 = table_copy.data;
   282	
   283		return proc_dointvec_minmax(&table_copy, write, buffer, lenp, ppos);
   284	}
   285
  
kernel test robot Dec. 2, 2022, 1:43 p.m. UTC | #3
Hi,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/jeffxu-chromium-org/mm-memfd-add-MFD_NOEXEC_SEAL-and-MFD_EXEC/20221202-093847
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20221202013404.163143-3-jeffxu%40google.com
patch subject: [PATCH v3] mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC
config: i386-randconfig-a013
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/d6b8fec1440cb9206bef29eb27359a61e6c73f41
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review jeffxu-chromium-org/mm-memfd-add-MFD_NOEXEC_SEAL-and-MFD_EXEC/20221202-093847
        git checkout d6b8fec1440cb9206bef29eb27359a61e6c73f41
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> kernel/pid_namespace.c:263:5: warning: no previous prototype for function 'pid_mfd_noexec_dointvec_minmax' [-Wmissing-prototypes]
   int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table, int write,
       ^
   kernel/pid_namespace.c:263:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table, int write,
   ^
   static 
   1 warning generated.


vim +/pid_mfd_noexec_dointvec_minmax +263 kernel/pid_namespace.c

   261	
   262	#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
 > 263	int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table, int write,
   264		void *buffer, size_t *lenp, loff_t *ppos)
   265	{
   266		struct pid_namespace *ns = task_active_pid_ns(current);
   267		struct ctl_table table_copy;
   268	
   269		if (write && !capable(CAP_SYS_ADMIN))
   270			return -EPERM;
   271	
   272		table_copy = *table;
   273		if (ns != &init_pid_ns)
   274			table_copy.data = &ns->memfd_noexec_scope;
   275	
   276		/*
   277		 * set minimum to current value, the effect is only bigger
   278		 * value is accepted.
   279		 */
   280		if (*(int *)table_copy.data > *(int *)table_copy.extra1)
   281			table_copy.extra1 = table_copy.data;
   282	
   283		return proc_dointvec_minmax(&table_copy, write, buffer, lenp, ppos);
   284	}
   285
  
Kees Cook Dec. 2, 2022, 10:56 p.m. UTC | #4
On Fri, Dec 02, 2022 at 01:34:00AM +0000, jeffxu@chromium.org wrote:
> From: Jeff Xu <jeffxu@chromium.org>
> 
> The new MFD_NOEXEC_SEAL and MFD_EXEC flags allows application to
> set executable bit at creation time (memfd_create).
> 
> When MFD_NOEXEC_SEAL is set, memfd is created without executable bit
> (mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to
> be executable (mode: 0777) after creation.
> 
> when MFD_EXEC flag is set, memfd is created with executable bit
> (mode:0777), this is the same as the old behavior of memfd_create.
> 
> The new pid namespaced sysctl vm.memfd_noexec has 3 values:
> 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
> 	MFD_EXEC was set.
> 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
> 	MFD_NOEXEC_SEAL was set.
> 2:memfd_create() without MFD_NOEXEC_SEAL will be rejected.

    ^ nit: missing space

> 
> The sysctl allows finer control of memfd_create for old-software
> that doesn't set the executable bit, for example, a container with
> vm.memfd_noexec=1 means the old-software will create non-executable
> memfd by default.
> 
> Co-developed-by: Daniel Verkamp <dverkamp@chromium.org>
> Signed-off-by: Daniel Verkamp <dverkamp@chromium.org>
> Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> ---
>  include/linux/pid_namespace.h | 19 ++++++++++++++
>  include/uapi/linux/memfd.h    |  4 +++
>  kernel/pid_namespace.c        | 47 +++++++++++++++++++++++++++++++++++
>  mm/memfd.c                    | 44 ++++++++++++++++++++++++++++++--
>  4 files changed, 112 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index 07481bb87d4e..a4789a7b34a9 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -16,6 +16,21 @@
>  
>  struct fs_pin;
>  
> +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
> +/*
> + * sysctl for vm.memfd_noexec
> + * 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL
> + *	acts like MFD_EXEC was set.
> + * 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL
> + *	acts like MFD_NOEXEC_SEAL was set.
> + * 2: memfd_create() without MFD_NOEXEC_SEAL will be
> + *	rejected.
> + */
> +#define MEMFD_NOEXEC_SCOPE_EXEC		0
> +#define MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL		1
> +#define MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED	2
> +#endif
> +
>  struct pid_namespace {
>  	struct idr idr;
>  	struct rcu_head rcu;
> @@ -31,6 +46,10 @@ struct pid_namespace {
>  	struct ucounts *ucounts;
>  	int reboot;	/* group exit code if this pidns was rebooted */
>  	struct ns_common ns;
> +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
> +	/* sysctl for vm.memfd_noexec */
> +	int memfd_noexec_scope;
> +#endif
>  } __randomize_layout;
>  
>  extern struct pid_namespace init_pid_ns;
> diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> index 7a8a26751c23..273a4e15dfcf 100644
> --- a/include/uapi/linux/memfd.h
> +++ b/include/uapi/linux/memfd.h
> @@ -8,6 +8,10 @@
>  #define MFD_CLOEXEC		0x0001U
>  #define MFD_ALLOW_SEALING	0x0002U
>  #define MFD_HUGETLB		0x0004U
> +/* not executable and sealed to prevent changing to executable. */
> +#define MFD_NOEXEC_SEAL		0x0008U
> +/* executable */
> +#define MFD_EXEC		0x0010U
>  
>  /*
>   * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index f4f8cb0435b4..71dd9b0a0f62 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -110,6 +110,10 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
>  	ns->ucounts = ucounts;
>  	ns->pid_allocated = PIDNS_ADDING;
>  
> +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
> +	ns->memfd_noexec_scope = MEMFD_NOEXEC_SCOPE_EXEC;
> +#endif

I think this should be inherited from the parent pid namespace instead?

> +
>  	return ns;
>  
>  out_free_idr:
> @@ -255,6 +259,45 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>  	return;
>  }
>  
> +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
> +int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table, int write,
> +	void *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	struct pid_namespace *ns = task_active_pid_ns(current);
> +	struct ctl_table table_copy;
> +
> +	if (write && !capable(CAP_SYS_ADMIN))
> +		return -EPERM;
> +
> +	table_copy = *table;
> +	if (ns != &init_pid_ns)
> +		table_copy.data = &ns->memfd_noexec_scope;
> +
> +	/*
> +	 * set minimum to current value, the effect is only bigger
> +	 * value is accepted.
> +	 */
> +	if (*(int *)table_copy.data > *(int *)table_copy.extra1)
> +		table_copy.extra1 = table_copy.data;

Yeah, I like this kind of enforcement.

> +
> +	return proc_dointvec_minmax(&table_copy, write, buffer, lenp, ppos);
> +}
> +
> +static struct ctl_table pid_ns_ctl_table_vm[] = {
> +	{
> +		.procname	= "memfd_noexec",
> +		.data		= &init_pid_ns.memfd_noexec_scope,
> +		.maxlen		= sizeof(init_pid_ns.memfd_noexec_scope),
> +		.mode		= 0644,
> +		.proc_handler	= pid_mfd_noexec_dointvec_minmax,
> +		.extra1		= SYSCTL_ZERO,
> +		.extra2		= SYSCTL_TWO,
> +	},
> +	{ }
> +};
> +static struct ctl_path vm_path[] = { { .procname = "vm", }, { } };
> +#endif
> +
>  #ifdef CONFIG_CHECKPOINT_RESTORE
>  static int pid_ns_ctl_handler(struct ctl_table *table, int write,
>  		void *buffer, size_t *lenp, loff_t *ppos)
> @@ -455,6 +498,10 @@ static __init int pid_namespaces_init(void)
>  #ifdef CONFIG_CHECKPOINT_RESTORE
>  	register_sysctl_paths(kern_path, pid_ns_ctl_table);
>  #endif
> +
> +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
> +	register_sysctl_paths(vm_path, pid_ns_ctl_table_vm);
> +#endif
>  	return 0;
>  }
>  
> diff --git a/mm/memfd.c b/mm/memfd.c
> index 4ebeab94aa74..69e897dea6d5 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -18,6 +18,7 @@
>  #include <linux/hugetlb.h>
>  #include <linux/shmem_fs.h>
>  #include <linux/memfd.h>
> +#include <linux/pid_namespace.h>
>  #include <uapi/linux/memfd.h>
>  
>  /*
> @@ -263,12 +264,13 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
>  #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
>  #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
>  
> -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
> +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC_SEAL | MFD_EXEC)
>  
>  SYSCALL_DEFINE2(memfd_create,
>  		const char __user *, uname,
>  		unsigned int, flags)
>  {
> +	struct pid_namespace *ns;
>  	unsigned int *file_seals;
>  	struct file *file;
>  	int fd, error;
> @@ -285,6 +287,36 @@ SYSCALL_DEFINE2(memfd_create,
>  			return -EINVAL;
>  	}
>  
> +	/* Invalid if both EXEC and NOEXEC_SEAL are set.*/
> +	if ((flags & MFD_EXEC) && (flags & MFD_NOEXEC_SEAL))
> +		return -EINVAL;
> +
> +	if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) {
> +#ifdef CONFIG_SYSCTL
> +		int sysctl = MEMFD_NOEXEC_SCOPE_EXEC;
> +
> +		ns = task_active_pid_ns(current);
> +		if (ns)
> +			sysctl = ns->memfd_noexec_scope;
> +
> +		if (sysctl == MEMFD_NOEXEC_SCOPE_EXEC) {
> +			flags |= MFD_EXEC;
> +		} else if (sysctl == MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL) {
> +			flags |= MFD_NOEXEC_SEAL;
> +		} else {
> +			pr_warn_ratelimited(
> +				"memfd_create(): MFD_NOEXEC_SEAL is enforced, pid=%d\n",
> +				task_pid_nr(current));
> +			return -EINVAL;
> +		}

Not a huge deal, but the above could be a switch statement to improve
readability.

> +#else
> +		flags |= MFD_EXEC;
> +#endif
> +		pr_warn_ratelimited(
> +			"memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=%d\n",
> +			task_pid_nr(current));

Perhaps include process name as well -- makes admin lives easier. :)

		pr_warn_ratelimited(
			"memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL: %s[%d]\n",
			current->comm, task_pid_nr(current));

> +	}
> +
>  	/* length includes terminating zero */
>  	len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
>  	if (len <= 0)
> @@ -328,7 +360,15 @@ SYSCALL_DEFINE2(memfd_create,
>  	file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
>  	file->f_flags |= O_LARGEFILE;
>  
> -	if (flags & MFD_ALLOW_SEALING) {
> +	if (flags & MFD_NOEXEC_SEAL) {
> +		struct inode *inode = file_inode(file);
> +
> +		inode->i_mode &= ~0111;
> +		file_seals = memfd_file_seals_ptr(file);
> +		*file_seals &= ~F_SEAL_SEAL;
> +		*file_seals |= F_SEAL_EXEC;
> +	} else if (flags & MFD_ALLOW_SEALING) {
> +		/* MFD_EXEC and MFD_ALLOW_SEALING are set */
>  		file_seals = memfd_file_seals_ptr(file);
>  		*file_seals &= ~F_SEAL_SEAL;
>  	}
> -- 
> 2.39.0.rc0.267.gcb52ba06e7-goog
> 

Otherwise, looks good to me.

Reviewed-by: Kees Cook <keescook@chromium.org>
  
Jeff Xu Dec. 2, 2022, 11:32 p.m. UTC | #5
On Fri, Dec 2, 2022 at 2:56 PM Kees Cook <keescook@chromium.org> wrote:
>
> On Fri, Dec 02, 2022 at 01:34:00AM +0000, jeffxu@chromium.org wrote:
> > From: Jeff Xu <jeffxu@chromium.org>
> >
> > The new MFD_NOEXEC_SEAL and MFD_EXEC flags allows application to
> > set executable bit at creation time (memfd_create).
> >
> > When MFD_NOEXEC_SEAL is set, memfd is created without executable bit
> > (mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to
> > be executable (mode: 0777) after creation.
> >
> > when MFD_EXEC flag is set, memfd is created with executable bit
> > (mode:0777), this is the same as the old behavior of memfd_create.
> >
> > The new pid namespaced sysctl vm.memfd_noexec has 3 values:
> > 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
> >       MFD_EXEC was set.
> > 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
> >       MFD_NOEXEC_SEAL was set.
> > 2:memfd_create() without MFD_NOEXEC_SEAL will be rejected.
>
>     ^ nit: missing space
>
> >
> > The sysctl allows finer control of memfd_create for old-software
> > that doesn't set the executable bit, for example, a container with
> > vm.memfd_noexec=1 means the old-software will create non-executable
> > memfd by default.
> >
> > Co-developed-by: Daniel Verkamp <dverkamp@chromium.org>
> > Signed-off-by: Daniel Verkamp <dverkamp@chromium.org>
> > Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> > ---
> >  include/linux/pid_namespace.h | 19 ++++++++++++++
> >  include/uapi/linux/memfd.h    |  4 +++
> >  kernel/pid_namespace.c        | 47 +++++++++++++++++++++++++++++++++++
> >  mm/memfd.c                    | 44 ++++++++++++++++++++++++++++++--
> >  4 files changed, 112 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> > index 07481bb87d4e..a4789a7b34a9 100644
> > --- a/include/linux/pid_namespace.h
> > +++ b/include/linux/pid_namespace.h
> > @@ -16,6 +16,21 @@
> >
> >  struct fs_pin;
> >
> > +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
> > +/*
> > + * sysctl for vm.memfd_noexec
> > + * 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL
> > + *   acts like MFD_EXEC was set.
> > + * 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL
> > + *   acts like MFD_NOEXEC_SEAL was set.
> > + * 2: memfd_create() without MFD_NOEXEC_SEAL will be
> > + *   rejected.
> > + */
> > +#define MEMFD_NOEXEC_SCOPE_EXEC              0
> > +#define MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL               1
> > +#define MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED   2
> > +#endif
> > +
> >  struct pid_namespace {
> >       struct idr idr;
> >       struct rcu_head rcu;
> > @@ -31,6 +46,10 @@ struct pid_namespace {
> >       struct ucounts *ucounts;
> >       int reboot;     /* group exit code if this pidns was rebooted */
> >       struct ns_common ns;
> > +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
> > +     /* sysctl for vm.memfd_noexec */
> > +     int memfd_noexec_scope;
> > +#endif
> >  } __randomize_layout;
> >
> >  extern struct pid_namespace init_pid_ns;
> > diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> > index 7a8a26751c23..273a4e15dfcf 100644
> > --- a/include/uapi/linux/memfd.h
> > +++ b/include/uapi/linux/memfd.h
> > @@ -8,6 +8,10 @@
> >  #define MFD_CLOEXEC          0x0001U
> >  #define MFD_ALLOW_SEALING    0x0002U
> >  #define MFD_HUGETLB          0x0004U
> > +/* not executable and sealed to prevent changing to executable. */
> > +#define MFD_NOEXEC_SEAL              0x0008U
> > +/* executable */
> > +#define MFD_EXEC             0x0010U
> >
> >  /*
> >   * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
> > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> > index f4f8cb0435b4..71dd9b0a0f62 100644
> > --- a/kernel/pid_namespace.c
> > +++ b/kernel/pid_namespace.c
> > @@ -110,6 +110,10 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
> >       ns->ucounts = ucounts;
> >       ns->pid_allocated = PIDNS_ADDING;
> >
> > +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
> > +     ns->memfd_noexec_scope = MEMFD_NOEXEC_SCOPE_EXEC;
> > +#endif
>
> I think this should be inherited from the parent pid namespace instead?
>
OK, I will add logic to make the child name space to inherit from the
parent namespace at creation time,
but if the parent changes this setting later, the child will not pick
up the parent's change automatically, would that be OK ?

Or do we want the child to keep in sync with the parent, to the
highest value. For example,
if init namespace val=1, child namespace=1, later when init namespace
= 2, child namespace will automatically become 2.

Jeff

> > +
> >       return ns;
> >
> >  out_free_idr:
> > @@ -255,6 +259,45 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
> >       return;
> >  }
> >
> > +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
> > +int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table, int write,
> > +     void *buffer, size_t *lenp, loff_t *ppos)
> > +{
> > +     struct pid_namespace *ns = task_active_pid_ns(current);
> > +     struct ctl_table table_copy;
> > +
> > +     if (write && !capable(CAP_SYS_ADMIN))
> > +             return -EPERM;
> > +
> > +     table_copy = *table;
> > +     if (ns != &init_pid_ns)
> > +             table_copy.data = &ns->memfd_noexec_scope;
> > +
> > +     /*
> > +      * set minimum to current value, the effect is only bigger
> > +      * value is accepted.
> > +      */
> > +     if (*(int *)table_copy.data > *(int *)table_copy.extra1)
> > +             table_copy.extra1 = table_copy.data;
>
> Yeah, I like this kind of enforcement.
>
> > +
> > +     return proc_dointvec_minmax(&table_copy, write, buffer, lenp, ppos);
> > +}
> > +
> > +static struct ctl_table pid_ns_ctl_table_vm[] = {
> > +     {
> > +             .procname       = "memfd_noexec",
> > +             .data           = &init_pid_ns.memfd_noexec_scope,
> > +             .maxlen         = sizeof(init_pid_ns.memfd_noexec_scope),
> > +             .mode           = 0644,
> > +             .proc_handler   = pid_mfd_noexec_dointvec_minmax,
> > +             .extra1         = SYSCTL_ZERO,
> > +             .extra2         = SYSCTL_TWO,
> > +     },
> > +     { }
> > +};
> > +static struct ctl_path vm_path[] = { { .procname = "vm", }, { } };
> > +#endif
> > +
> >  #ifdef CONFIG_CHECKPOINT_RESTORE
> >  static int pid_ns_ctl_handler(struct ctl_table *table, int write,
> >               void *buffer, size_t *lenp, loff_t *ppos)
> > @@ -455,6 +498,10 @@ static __init int pid_namespaces_init(void)
> >  #ifdef CONFIG_CHECKPOINT_RESTORE
> >       register_sysctl_paths(kern_path, pid_ns_ctl_table);
> >  #endif
> > +
> > +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
> > +     register_sysctl_paths(vm_path, pid_ns_ctl_table_vm);
> > +#endif
> >       return 0;
> >  }
> >
> > diff --git a/mm/memfd.c b/mm/memfd.c
> > index 4ebeab94aa74..69e897dea6d5 100644
> > --- a/mm/memfd.c
> > +++ b/mm/memfd.c
> > @@ -18,6 +18,7 @@
> >  #include <linux/hugetlb.h>
> >  #include <linux/shmem_fs.h>
> >  #include <linux/memfd.h>
> > +#include <linux/pid_namespace.h>
> >  #include <uapi/linux/memfd.h>
> >
> >  /*
> > @@ -263,12 +264,13 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
> >  #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
> >  #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
> >
> > -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
> > +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC_SEAL | MFD_EXEC)
> >
> >  SYSCALL_DEFINE2(memfd_create,
> >               const char __user *, uname,
> >               unsigned int, flags)
> >  {
> > +     struct pid_namespace *ns;
> >       unsigned int *file_seals;
> >       struct file *file;
> >       int fd, error;
> > @@ -285,6 +287,36 @@ SYSCALL_DEFINE2(memfd_create,
> >                       return -EINVAL;
> >       }
> >
> > +     /* Invalid if both EXEC and NOEXEC_SEAL are set.*/
> > +     if ((flags & MFD_EXEC) && (flags & MFD_NOEXEC_SEAL))
> > +             return -EINVAL;
> > +
> > +     if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) {
> > +#ifdef CONFIG_SYSCTL
> > +             int sysctl = MEMFD_NOEXEC_SCOPE_EXEC;
> > +
> > +             ns = task_active_pid_ns(current);
> > +             if (ns)
> > +                     sysctl = ns->memfd_noexec_scope;
> > +
> > +             if (sysctl == MEMFD_NOEXEC_SCOPE_EXEC) {
> > +                     flags |= MFD_EXEC;
> > +             } else if (sysctl == MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL) {
> > +                     flags |= MFD_NOEXEC_SEAL;
> > +             } else {
> > +                     pr_warn_ratelimited(
> > +                             "memfd_create(): MFD_NOEXEC_SEAL is enforced, pid=%d\n",
> > +                             task_pid_nr(current));
> > +                     return -EINVAL;
> > +             }
>
> Not a huge deal, but the above could be a switch statement to improve
> readability.
>
> > +#else
> > +             flags |= MFD_EXEC;
> > +#endif
> > +             pr_warn_ratelimited(
> > +                     "memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=%d\n",
> > +                     task_pid_nr(current));
>
> Perhaps include process name as well -- makes admin lives easier. :)
>
>                 pr_warn_ratelimited(
>                         "memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL: %s[%d]\n",
>                         current->comm, task_pid_nr(current));
>
> > +     }
> > +
> >       /* length includes terminating zero */
> >       len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
> >       if (len <= 0)
> > @@ -328,7 +360,15 @@ SYSCALL_DEFINE2(memfd_create,
> >       file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
> >       file->f_flags |= O_LARGEFILE;
> >
> > -     if (flags & MFD_ALLOW_SEALING) {
> > +     if (flags & MFD_NOEXEC_SEAL) {
> > +             struct inode *inode = file_inode(file);
> > +
> > +             inode->i_mode &= ~0111;
> > +             file_seals = memfd_file_seals_ptr(file);
> > +             *file_seals &= ~F_SEAL_SEAL;
> > +             *file_seals |= F_SEAL_EXEC;
> > +     } else if (flags & MFD_ALLOW_SEALING) {
> > +             /* MFD_EXEC and MFD_ALLOW_SEALING are set */
> >               file_seals = memfd_file_seals_ptr(file);
> >               *file_seals &= ~F_SEAL_SEAL;
> >       }
> > --
> > 2.39.0.rc0.267.gcb52ba06e7-goog
> >
>
> Otherwise, looks good to me.
>
> Reviewed-by: Kees Cook <keescook@chromium.org>
>
> --
> Kees Cook
  

Patch

diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 07481bb87d4e..a4789a7b34a9 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -16,6 +16,21 @@ 
 
 struct fs_pin;
 
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
+/*
+ * sysctl for vm.memfd_noexec
+ * 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL
+ *	acts like MFD_EXEC was set.
+ * 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL
+ *	acts like MFD_NOEXEC_SEAL was set.
+ * 2: memfd_create() without MFD_NOEXEC_SEAL will be
+ *	rejected.
+ */
+#define MEMFD_NOEXEC_SCOPE_EXEC		0
+#define MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL		1
+#define MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED	2
+#endif
+
 struct pid_namespace {
 	struct idr idr;
 	struct rcu_head rcu;
@@ -31,6 +46,10 @@  struct pid_namespace {
 	struct ucounts *ucounts;
 	int reboot;	/* group exit code if this pidns was rebooted */
 	struct ns_common ns;
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
+	/* sysctl for vm.memfd_noexec */
+	int memfd_noexec_scope;
+#endif
 } __randomize_layout;
 
 extern struct pid_namespace init_pid_ns;
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 7a8a26751c23..273a4e15dfcf 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -8,6 +8,10 @@ 
 #define MFD_CLOEXEC		0x0001U
 #define MFD_ALLOW_SEALING	0x0002U
 #define MFD_HUGETLB		0x0004U
+/* not executable and sealed to prevent changing to executable. */
+#define MFD_NOEXEC_SEAL		0x0008U
+/* executable */
+#define MFD_EXEC		0x0010U
 
 /*
  * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index f4f8cb0435b4..71dd9b0a0f62 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -110,6 +110,10 @@  static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
 	ns->ucounts = ucounts;
 	ns->pid_allocated = PIDNS_ADDING;
 
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
+	ns->memfd_noexec_scope = MEMFD_NOEXEC_SCOPE_EXEC;
+#endif
+
 	return ns;
 
 out_free_idr:
@@ -255,6 +259,45 @@  void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 	return;
 }
 
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
+int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table, int write,
+	void *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct pid_namespace *ns = task_active_pid_ns(current);
+	struct ctl_table table_copy;
+
+	if (write && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	table_copy = *table;
+	if (ns != &init_pid_ns)
+		table_copy.data = &ns->memfd_noexec_scope;
+
+	/*
+	 * set minimum to current value, the effect is only bigger
+	 * value is accepted.
+	 */
+	if (*(int *)table_copy.data > *(int *)table_copy.extra1)
+		table_copy.extra1 = table_copy.data;
+
+	return proc_dointvec_minmax(&table_copy, write, buffer, lenp, ppos);
+}
+
+static struct ctl_table pid_ns_ctl_table_vm[] = {
+	{
+		.procname	= "memfd_noexec",
+		.data		= &init_pid_ns.memfd_noexec_scope,
+		.maxlen		= sizeof(init_pid_ns.memfd_noexec_scope),
+		.mode		= 0644,
+		.proc_handler	= pid_mfd_noexec_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_TWO,
+	},
+	{ }
+};
+static struct ctl_path vm_path[] = { { .procname = "vm", }, { } };
+#endif
+
 #ifdef CONFIG_CHECKPOINT_RESTORE
 static int pid_ns_ctl_handler(struct ctl_table *table, int write,
 		void *buffer, size_t *lenp, loff_t *ppos)
@@ -455,6 +498,10 @@  static __init int pid_namespaces_init(void)
 #ifdef CONFIG_CHECKPOINT_RESTORE
 	register_sysctl_paths(kern_path, pid_ns_ctl_table);
 #endif
+
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
+	register_sysctl_paths(vm_path, pid_ns_ctl_table_vm);
+#endif
 	return 0;
 }
 
diff --git a/mm/memfd.c b/mm/memfd.c
index 4ebeab94aa74..69e897dea6d5 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -18,6 +18,7 @@ 
 #include <linux/hugetlb.h>
 #include <linux/shmem_fs.h>
 #include <linux/memfd.h>
+#include <linux/pid_namespace.h>
 #include <uapi/linux/memfd.h>
 
 /*
@@ -263,12 +264,13 @@  long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
 #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
 #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
 
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
+#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC_SEAL | MFD_EXEC)
 
 SYSCALL_DEFINE2(memfd_create,
 		const char __user *, uname,
 		unsigned int, flags)
 {
+	struct pid_namespace *ns;
 	unsigned int *file_seals;
 	struct file *file;
 	int fd, error;
@@ -285,6 +287,36 @@  SYSCALL_DEFINE2(memfd_create,
 			return -EINVAL;
 	}
 
+	/* Invalid if both EXEC and NOEXEC_SEAL are set.*/
+	if ((flags & MFD_EXEC) && (flags & MFD_NOEXEC_SEAL))
+		return -EINVAL;
+
+	if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) {
+#ifdef CONFIG_SYSCTL
+		int sysctl = MEMFD_NOEXEC_SCOPE_EXEC;
+
+		ns = task_active_pid_ns(current);
+		if (ns)
+			sysctl = ns->memfd_noexec_scope;
+
+		if (sysctl == MEMFD_NOEXEC_SCOPE_EXEC) {
+			flags |= MFD_EXEC;
+		} else if (sysctl == MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL) {
+			flags |= MFD_NOEXEC_SEAL;
+		} else {
+			pr_warn_ratelimited(
+				"memfd_create(): MFD_NOEXEC_SEAL is enforced, pid=%d\n",
+				task_pid_nr(current));
+			return -EINVAL;
+		}
+#else
+		flags |= MFD_EXEC;
+#endif
+		pr_warn_ratelimited(
+			"memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=%d\n",
+			task_pid_nr(current));
+	}
+
 	/* length includes terminating zero */
 	len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
 	if (len <= 0)
@@ -328,7 +360,15 @@  SYSCALL_DEFINE2(memfd_create,
 	file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
 	file->f_flags |= O_LARGEFILE;
 
-	if (flags & MFD_ALLOW_SEALING) {
+	if (flags & MFD_NOEXEC_SEAL) {
+		struct inode *inode = file_inode(file);
+
+		inode->i_mode &= ~0111;
+		file_seals = memfd_file_seals_ptr(file);
+		*file_seals &= ~F_SEAL_SEAL;
+		*file_seals |= F_SEAL_EXEC;
+	} else if (flags & MFD_ALLOW_SEALING) {
+		/* MFD_EXEC and MFD_ALLOW_SEALING are set */
 		file_seals = memfd_file_seals_ptr(file);
 		*file_seals &= ~F_SEAL_SEAL;
 	}