Message ID | 20221207154939.2532830-4-jeffxu@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:f944:0:0:0:0:0 with SMTP id q4csp260377wrr; Wed, 7 Dec 2022 07:51:34 -0800 (PST) X-Google-Smtp-Source: AA0mqf4ULw8B8QjUWmjD4yoLsbufugBkXvTrAERRgFzuM/1aAbokAJ+Bimjf7ZEySm9yV1NpPA0b X-Received: by 2002:a17:906:5e55:b0:7c0:cfc5:dd0c with SMTP id b21-20020a1709065e5500b007c0cfc5dd0cmr17949993eju.743.1670428293955; Wed, 07 Dec 2022 07:51:33 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1670428293; cv=none; d=google.com; s=arc-20160816; b=HW+LDflL8+zqfM/RvSyTgVPIX8YI6UTimfezWEC6L4xcl3b6qtviAK+niNEkewYkLz gfT2GkJdY3QDYSPjU23VOU6pcUQeHcbm/tlRU+F6b5zNhtMbcvMRip7AoxQLpMiBOPH9 NW/JitPoHHUoWF6E4lOyOq0msmn0R2UdOxayCM4+dS2x2HMCJGO2hBprc/lqgWHB1Jgw pAMSFbdmH65/ggEX5iuTj4Z3sLJdePTUQrUKS+jiCH2/LAg7GC5jsYMbwXTeQOCM0NLT EgS45acbgePKzCIBPD0rVN/4Jq7YEB3oqlAImW8i6/GkqsmxFdNQC5dsUB+rZLRg8rvY GAgA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=j02hcUmpYLv5fRuUpSeXXOL/JCHtO2qkj12Q1X4RZAg=; b=WgF9Y/iKKLgU7lRXGA+r26oKHsqELhQ+Hg1cmn5lDP7FiynG90GEsyniar1TIn/VOH bpfT8O+P8kZqv9JwkC5iTgqlRtY6JliRNvY4CUlFjLmBt0cWImTqrDEo4W6exzxMTVs7 ol5CLrsaFc9LhBWzlh5NfadByJB/mIq+bZ6RMYYh8ZXiu8+RRj+5Cua9O5p4Q1DKjvIr LH03Vn71U0kQjE5DuC0bOxuefLWNrweA88ZDDLr6NTfjpuX7JxajiuNImPU3w+CbDS3V vRkj1hfTwlqlWYp1pmK2PlJiH3OeP1CV6JPsvPuxsuY89+SOISXtGi9G5Hb/VVwZIeMx PuzA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=R5BHpK7Z; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id tz14-20020a170907c78e00b0078e1d213812si13724635ejc.184.2022.12.07.07.51.10; Wed, 07 Dec 2022 07:51:33 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=R5BHpK7Z; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229911AbiLGPuI (ORCPT <rfc822;foxyelen666@gmail.com> + 99 others); Wed, 7 Dec 2022 10:50:08 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50628 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229819AbiLGPtt (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 7 Dec 2022 10:49:49 -0500 Received: from mail-pj1-x102c.google.com (mail-pj1-x102c.google.com [IPv6:2607:f8b0:4864:20::102c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2DE1060B47 for <linux-kernel@vger.kernel.org>; Wed, 7 Dec 2022 07:49:48 -0800 (PST) Received: by mail-pj1-x102c.google.com with SMTP id u5so8621868pjy.5 for <linux-kernel@vger.kernel.org>; Wed, 07 Dec 2022 07:49:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=j02hcUmpYLv5fRuUpSeXXOL/JCHtO2qkj12Q1X4RZAg=; b=R5BHpK7ZG9aovJ+1jeInMWGBsCCU+DnF25bMLAW45g60cIzG8FFjGMOXE+YWkWnupK kQHfR+uCW3ib+lGzbzFcJTQMcw+b6mTj+yo7v+J3Y5PsUrokernH9eTgmzS5L2ZyHQPx He6qg6g0nzNS1MVsYGfSwJbIUQhYQcP1HAQvE= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=j02hcUmpYLv5fRuUpSeXXOL/JCHtO2qkj12Q1X4RZAg=; b=Wv5JlTAPsow0UyR6HhW2K15nG4LJOJNvRVBvXS5K7NQPvTyHKIWEOqDXJoLXOciHM1 E+L3sFK3ha+GwaZb2H+yz943wZnaTtpwhzbX01QK2PWSPDL4yAGbSPXY0MUgvu9bO8oI DmXuASuEcSzY54+HKRsbXr0XAUb5QYdsQYhiqmwSNfRYm+RIs3GvuwKqpt6/yi5DCjkx R+y70iuvSYapIPmZB97a/ozvvUwhwMkRjEYyoxvsTHO2pn0/drX7GWUGfVj4I1zqqFWV tfN+isC55V6bWdt6wR50dgYKAvdZL8X9d/Iwajzdsb4e32/oOVHYl/lyy473UBhg3fI5 bykw== X-Gm-Message-State: ANoB5pnf7gxLgLUSaco02zhYm5tgYTEpGJzE5UuO2MgL7IoprTiBgFCM 80RIgwHfZJOG0iNQRlTdavduSg== X-Received: by 2002:a17:902:9891:b0:189:2688:c97f with SMTP id s17-20020a170902989100b001892688c97fmr690338plp.50.1670428187656; Wed, 07 Dec 2022 07:49:47 -0800 (PST) Received: from jeffxud.c.googlers.com.com (30.202.168.34.bc.googleusercontent.com. [34.168.202.30]) by smtp.gmail.com with ESMTPSA id a9-20020a170902ecc900b0017f7628cbddsm14920934plh.30.2022.12.07.07.49.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 07 Dec 2022 07:49:46 -0800 (PST) From: jeffxu@chromium.org To: skhan@linuxfoundation.org, keescook@chromium.org Cc: akpm@linux-foundation.org, dmitry.torokhov@gmail.com, dverkamp@chromium.org, hughd@google.com, jeffxu@google.com, jorgelo@chromium.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org, jannh@google.com, linux-hardening@vger.kernel.org, kernel test robot <lkp@intel.com> Subject: [PATCH v6 3/6] mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC Date: Wed, 7 Dec 2022 15:49:36 +0000 Message-Id: <20221207154939.2532830-4-jeffxu@google.com> X-Mailer: git-send-email 2.39.0.rc0.267.gcb52ba06e7-goog In-Reply-To: <20221207154939.2532830-1-jeffxu@google.com> References: <20221207154939.2532830-1-jeffxu@google.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1751571018444711132?= X-GMAIL-MSGID: =?utf-8?q?1751571018444711132?= |
Series |
mm/memfd: introduce MFD_NOEXEC_SEAL and MFD_EXEC
|
|
Commit Message
Jeff Xu
Dec. 7, 2022, 3:49 p.m. UTC
From: Jeff Xu <jeffxu@google.com> The new MFD_NOEXEC_SEAL and MFD_EXEC flags allows application to set executable bit at creation time (memfd_create). When MFD_NOEXEC_SEAL is set, memfd is created without executable bit (mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to be executable (mode: 0777) after creation. when MFD_EXEC flag is set, memfd is created with executable bit (mode:0777), this is the same as the old behavior of memfd_create. The new pid namespaced sysctl vm.memfd_noexec has 3 values: 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like MFD_EXEC was set. 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like MFD_NOEXEC_SEAL was set. 2: memfd_create() without MFD_NOEXEC_SEAL will be rejected. The sysctl allows finer control of memfd_create for old-software that doesn't set the executable bit, for example, a container with vm.memfd_noexec=1 means the old-software will create non-executable memfd by default. Also, the value of memfd_noexec is passed to child namespace at creation time. For example, if the init namespace has vm.memfd_noexec=2, all its children namespaces will be created with 2. Signed-off-by: Jeff Xu <jeffxu@google.com> Co-developed-by: Daniel Verkamp <dverkamp@chromium.org> Signed-off-by: Daniel Verkamp <dverkamp@chromium.org> Reported-by: kernel test robot <lkp@intel.com> --- include/linux/pid_namespace.h | 19 +++++++++++ include/uapi/linux/memfd.h | 4 +++ kernel/pid_namespace.c | 5 +++ kernel/pid_sysctl.h | 59 +++++++++++++++++++++++++++++++++++ mm/memfd.c | 48 ++++++++++++++++++++++++++-- 5 files changed, 133 insertions(+), 2 deletions(-) create mode 100644 kernel/pid_sysctl.h
Comments
On Wed, Dec 07, 2022 at 03:49:36PM +0000, jeffxu@chromium.org wrote: > From: Jeff Xu <jeffxu@google.com> > > The new MFD_NOEXEC_SEAL and MFD_EXEC flags allows application to > set executable bit at creation time (memfd_create). > > When MFD_NOEXEC_SEAL is set, memfd is created without executable bit > (mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to > be executable (mode: 0777) after creation. > > when MFD_EXEC flag is set, memfd is created with executable bit > (mode:0777), this is the same as the old behavior of memfd_create. > > The new pid namespaced sysctl vm.memfd_noexec has 3 values: > 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like > MFD_EXEC was set. > 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like > MFD_NOEXEC_SEAL was set. > 2: memfd_create() without MFD_NOEXEC_SEAL will be rejected. > > The sysctl allows finer control of memfd_create for old-software > that doesn't set the executable bit, for example, a container with > vm.memfd_noexec=1 means the old-software will create non-executable > memfd by default. Also, the value of memfd_noexec is passed to child > namespace at creation time. For example, if the init namespace has > vm.memfd_noexec=2, all its children namespaces will be created with 2. > > Signed-off-by: Jeff Xu <jeffxu@google.com> > Co-developed-by: Daniel Verkamp <dverkamp@chromium.org> > Signed-off-by: Daniel Verkamp <dverkamp@chromium.org> > Reported-by: kernel test robot <lkp@intel.com> Please rearrange these tags, and add a link to the lkp report: Reported-by: kernel test robot <lkp@intel.com> Link: ...url.to.lkp.lore.email... Co-developed-by: Daniel Verkamp <dverkamp@chromium.org> Signed-off-by: Daniel Verkamp <dverkamp@chromium.org> Signed-off-by: Jeff Xu <jeffxu@google.com> > --- > include/linux/pid_namespace.h | 19 +++++++++++ > include/uapi/linux/memfd.h | 4 +++ > kernel/pid_namespace.c | 5 +++ > kernel/pid_sysctl.h | 59 +++++++++++++++++++++++++++++++++++ > mm/memfd.c | 48 ++++++++++++++++++++++++++-- > 5 files changed, 133 insertions(+), 2 deletions(-) > create mode 100644 kernel/pid_sysctl.h > > diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h > index 07481bb87d4e..a4789a7b34a9 100644 > --- a/include/linux/pid_namespace.h > +++ b/include/linux/pid_namespace.h > @@ -16,6 +16,21 @@ > > struct fs_pin; > > +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) > +/* > + * sysctl for vm.memfd_noexec > + * 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL > + * acts like MFD_EXEC was set. > + * 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL > + * acts like MFD_NOEXEC_SEAL was set. > + * 2: memfd_create() without MFD_NOEXEC_SEAL will be > + * rejected. > + */ > +#define MEMFD_NOEXEC_SCOPE_EXEC 0 > +#define MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL 1 > +#define MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED 2 These don't align? I think a tab is missing on MEMFD_NOEXEC_SCOPE_EXEC. > +#endif > + > struct pid_namespace { > struct idr idr; > struct rcu_head rcu; > @@ -31,6 +46,10 @@ struct pid_namespace { > struct ucounts *ucounts; > int reboot; /* group exit code if this pidns was rebooted */ > struct ns_common ns; > +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) > + /* sysctl for vm.memfd_noexec */ > + int memfd_noexec_scope; > +#endif > } __randomize_layout; > > extern struct pid_namespace init_pid_ns; > diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h > index 7a8a26751c23..273a4e15dfcf 100644 > --- a/include/uapi/linux/memfd.h > +++ b/include/uapi/linux/memfd.h > @@ -8,6 +8,10 @@ > #define MFD_CLOEXEC 0x0001U > #define MFD_ALLOW_SEALING 0x0002U > #define MFD_HUGETLB 0x0004U > +/* not executable and sealed to prevent changing to executable. */ > +#define MFD_NOEXEC_SEAL 0x0008U > +/* executable */ > +#define MFD_EXEC 0x0010U > > /* > * Huge page size encoding when MFD_HUGETLB is specified, and a huge page > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c > index f4f8cb0435b4..8a98b1af9376 100644 > --- a/kernel/pid_namespace.c > +++ b/kernel/pid_namespace.c > @@ -23,6 +23,7 @@ > #include <linux/sched/task.h> > #include <linux/sched/signal.h> > #include <linux/idr.h> > +#include "pid_sysctl.h" > > static DEFINE_MUTEX(pid_caches_mutex); > static struct kmem_cache *pid_ns_cachep; > @@ -110,6 +111,8 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns > ns->ucounts = ucounts; > ns->pid_allocated = PIDNS_ADDING; > > + initialize_memfd_noexec_scope(ns); > + > return ns; > > out_free_idr: > @@ -455,6 +458,8 @@ static __init int pid_namespaces_init(void) > #ifdef CONFIG_CHECKPOINT_RESTORE > register_sysctl_paths(kern_path, pid_ns_ctl_table); > #endif > + > + register_pid_ns_sysctl_table_vm(); > return 0; > } > > diff --git a/kernel/pid_sysctl.h b/kernel/pid_sysctl.h > new file mode 100644 > index 000000000000..5986d6493b5b > --- /dev/null > +++ b/kernel/pid_sysctl.h > @@ -0,0 +1,59 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef LINUX_PID_SYSCTL_H > +#define LINUX_PID_SYSCTL_H > + > +#include <linux/pid_namespace.h> > + > +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) > +static inline void initialize_memfd_noexec_scope(struct pid_namespace *ns) > +{ > + ns->memfd_noexec_scope = > + task_active_pid_ns(current)->memfd_noexec_scope; > +} > + > +static int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table, > + int write, void *buf, size_t *lenp, loff_t *ppos) > +{ > + struct pid_namespace *ns = task_active_pid_ns(current); > + struct ctl_table table_copy; > + > + if (write && !capable(CAP_SYS_ADMIN)) > + return -EPERM; Should this be CAP_SYS_ADMIN within the userns, rather than the global init_task CAP_SYS_ADMIN? > + > + table_copy = *table; > + if (ns != &init_pid_ns) > + table_copy.data = &ns->memfd_noexec_scope; > + > + /* > + * set minimum to current value, the effect is only bigger > + * value is accepted. > + */ > + if (*(int *)table_copy.data > *(int *)table_copy.extra1) > + table_copy.extra1 = table_copy.data; > + > + return proc_dointvec_minmax(&table_copy, write, buf, lenp, ppos); > +} > + > +static struct ctl_table pid_ns_ctl_table_vm[] = { > + { > + .procname = "memfd_noexec", > + .data = &init_pid_ns.memfd_noexec_scope, > + .maxlen = sizeof(init_pid_ns.memfd_noexec_scope), > + .mode = 0644, > + .proc_handler = pid_mfd_noexec_dointvec_minmax, > + .extra1 = SYSCTL_ZERO, > + .extra2 = SYSCTL_TWO, > + }, > + { } > +}; > +static struct ctl_path vm_path[] = { { .procname = "vm", }, { } }; > +static inline void register_pid_ns_sysctl_table_vm(void) > +{ > + register_sysctl_paths(vm_path, pid_ns_ctl_table_vm); > +} > +#else > +static inline void set_memfd_noexec_scope(struct pid_namespace *ns) {} > +static inline void register_pid_ns_ctl_table_vm(void) {} > +#endif > + > +#endif /* LINUX_PID_SYSCTL_H */ > diff --git a/mm/memfd.c b/mm/memfd.c > index 4ebeab94aa74..ec70675a7069 100644 > --- a/mm/memfd.c > +++ b/mm/memfd.c > @@ -18,6 +18,7 @@ > #include <linux/hugetlb.h> > #include <linux/shmem_fs.h> > #include <linux/memfd.h> > +#include <linux/pid_namespace.h> > #include <uapi/linux/memfd.h> > > /* > @@ -263,12 +264,14 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg) > #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1) > #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN) > > -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB) > +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC_SEAL | MFD_EXEC) > > SYSCALL_DEFINE2(memfd_create, > const char __user *, uname, > unsigned int, flags) > { > + char comm[TASK_COMM_LEN]; I'm fine with using "comm", but technically, it's not needed: task->comm will always be %NUL terminated. > + struct pid_namespace *ns; > unsigned int *file_seals; > struct file *file; > int fd, error; > @@ -285,6 +288,39 @@ SYSCALL_DEFINE2(memfd_create, > return -EINVAL; > } > > + /* Invalid if both EXEC and NOEXEC_SEAL are set.*/ > + if ((flags & MFD_EXEC) && (flags & MFD_NOEXEC_SEAL)) > + return -EINVAL; > + > + if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) { > +#ifdef CONFIG_SYSCTL > + int sysctl = MEMFD_NOEXEC_SCOPE_EXEC; > + > + ns = task_active_pid_ns(current); > + if (ns) > + sysctl = ns->memfd_noexec_scope; > + > + switch (sysctl) { > + case MEMFD_NOEXEC_SCOPE_EXEC: > + flags |= MFD_EXEC; > + break; > + case MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL: > + flags |= MFD_NOEXEC_SEAL; > + break; > + default: > + pr_warn_ratelimited( > + "memfd_create(): MFD_NOEXEC_SEAL is enforced, pid=%d '%s'\n", > + task_pid_nr(current), get_task_comm(comm, current)); > + return -EINVAL; > + } > +#else > + flags |= MFD_EXEC; > +#endif > + pr_warn_ratelimited( > + "memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=%d '%s'\n", > + task_pid_nr(current), get_task_comm(comm, current)); > + } > + > /* length includes terminating zero */ > len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1); > if (len <= 0) > @@ -328,7 +364,15 @@ SYSCALL_DEFINE2(memfd_create, > file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE; > file->f_flags |= O_LARGEFILE; > > - if (flags & MFD_ALLOW_SEALING) { > + if (flags & MFD_NOEXEC_SEAL) { > + struct inode *inode = file_inode(file); > + > + inode->i_mode &= ~0111; > + file_seals = memfd_file_seals_ptr(file); > + *file_seals &= ~F_SEAL_SEAL; > + *file_seals |= F_SEAL_EXEC; > + } else if (flags & MFD_ALLOW_SEALING) { > + /* MFD_EXEC and MFD_ALLOW_SEALING are set */ > file_seals = memfd_file_seals_ptr(file); > *file_seals &= ~F_SEAL_SEAL; > } > -- > 2.39.0.rc0.267.gcb52ba06e7-goog > Otherwise looks good!
On Thu, Dec 8, 2022 at 8:27 AM Kees Cook <keescook@chromium.org> wrote: > > On Wed, Dec 07, 2022 at 03:49:36PM +0000, jeffxu@chromium.org wrote: > > From: Jeff Xu <jeffxu@google.com> > > > > The new MFD_NOEXEC_SEAL and MFD_EXEC flags allows application to > > set executable bit at creation time (memfd_create). > > > > When MFD_NOEXEC_SEAL is set, memfd is created without executable bit > > (mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to > > be executable (mode: 0777) after creation. > > > > when MFD_EXEC flag is set, memfd is created with executable bit > > (mode:0777), this is the same as the old behavior of memfd_create. > > > > The new pid namespaced sysctl vm.memfd_noexec has 3 values: > > 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like > > MFD_EXEC was set. > > 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like > > MFD_NOEXEC_SEAL was set. > > 2: memfd_create() without MFD_NOEXEC_SEAL will be rejected. > > > > The sysctl allows finer control of memfd_create for old-software > > that doesn't set the executable bit, for example, a container with > > vm.memfd_noexec=1 means the old-software will create non-executable > > memfd by default. Also, the value of memfd_noexec is passed to child > > namespace at creation time. For example, if the init namespace has > > vm.memfd_noexec=2, all its children namespaces will be created with 2. > > > > Signed-off-by: Jeff Xu <jeffxu@google.com> > > Co-developed-by: Daniel Verkamp <dverkamp@chromium.org> > > Signed-off-by: Daniel Verkamp <dverkamp@chromium.org> > > Reported-by: kernel test robot <lkp@intel.com> > > Please rearrange these tags, and add a link to the lkp report: > > Reported-by: kernel test robot <lkp@intel.com> > Link: ...url.to.lkp.lore.email... > Co-developed-by: Daniel Verkamp <dverkamp@chromium.org> > Signed-off-by: Daniel Verkamp <dverkamp@chromium.org> > Signed-off-by: Jeff Xu <jeffxu@google.com> > > > --- > > include/linux/pid_namespace.h | 19 +++++++++++ > > include/uapi/linux/memfd.h | 4 +++ > > kernel/pid_namespace.c | 5 +++ > > kernel/pid_sysctl.h | 59 +++++++++++++++++++++++++++++++++++ > > mm/memfd.c | 48 ++++++++++++++++++++++++++-- > > 5 files changed, 133 insertions(+), 2 deletions(-) > > create mode 100644 kernel/pid_sysctl.h > > > > diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h > > index 07481bb87d4e..a4789a7b34a9 100644 > > --- a/include/linux/pid_namespace.h > > +++ b/include/linux/pid_namespace.h > > @@ -16,6 +16,21 @@ > > > > struct fs_pin; > > > > +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) > > +/* > > + * sysctl for vm.memfd_noexec > > + * 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL > > + * acts like MFD_EXEC was set. > > + * 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL > > + * acts like MFD_NOEXEC_SEAL was set. > > + * 2: memfd_create() without MFD_NOEXEC_SEAL will be > > + * rejected. > > + */ > > +#define MEMFD_NOEXEC_SCOPE_EXEC 0 > > +#define MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL 1 > > +#define MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED 2 > > These don't align? I think a tab is missing on MEMFD_NOEXEC_SCOPE_EXEC. > Done > > +#endif > > + > > struct pid_namespace { > > struct idr idr; > > struct rcu_head rcu; > > @@ -31,6 +46,10 @@ struct pid_namespace { > > struct ucounts *ucounts; > > int reboot; /* group exit code if this pidns was rebooted */ > > struct ns_common ns; > > +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) > > + /* sysctl for vm.memfd_noexec */ > > + int memfd_noexec_scope; > > +#endif > > } __randomize_layout; > > > > extern struct pid_namespace init_pid_ns; > > diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h > > index 7a8a26751c23..273a4e15dfcf 100644 > > --- a/include/uapi/linux/memfd.h > > +++ b/include/uapi/linux/memfd.h > > @@ -8,6 +8,10 @@ > > #define MFD_CLOEXEC 0x0001U > > #define MFD_ALLOW_SEALING 0x0002U > > #define MFD_HUGETLB 0x0004U > > +/* not executable and sealed to prevent changing to executable. */ > > +#define MFD_NOEXEC_SEAL 0x0008U > > +/* executable */ > > +#define MFD_EXEC 0x0010U > > > > /* > > * Huge page size encoding when MFD_HUGETLB is specified, and a huge page > > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c > > index f4f8cb0435b4..8a98b1af9376 100644 > > --- a/kernel/pid_namespace.c > > +++ b/kernel/pid_namespace.c > > @@ -23,6 +23,7 @@ > > #include <linux/sched/task.h> > > #include <linux/sched/signal.h> > > #include <linux/idr.h> > > +#include "pid_sysctl.h" > > > > static DEFINE_MUTEX(pid_caches_mutex); > > static struct kmem_cache *pid_ns_cachep; > > @@ -110,6 +111,8 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns > > ns->ucounts = ucounts; > > ns->pid_allocated = PIDNS_ADDING; > > > > + initialize_memfd_noexec_scope(ns); > > + > > return ns; > > > > out_free_idr: > > @@ -455,6 +458,8 @@ static __init int pid_namespaces_init(void) > > #ifdef CONFIG_CHECKPOINT_RESTORE > > register_sysctl_paths(kern_path, pid_ns_ctl_table); > > #endif > > + > > + register_pid_ns_sysctl_table_vm(); > > return 0; > > } > > > > diff --git a/kernel/pid_sysctl.h b/kernel/pid_sysctl.h > > new file mode 100644 > > index 000000000000..5986d6493b5b > > --- /dev/null > > +++ b/kernel/pid_sysctl.h > > @@ -0,0 +1,59 @@ > > +/* SPDX-License-Identifier: GPL-2.0 */ > > +#ifndef LINUX_PID_SYSCTL_H > > +#define LINUX_PID_SYSCTL_H > > + > > +#include <linux/pid_namespace.h> > > + > > +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) > > +static inline void initialize_memfd_noexec_scope(struct pid_namespace *ns) > > +{ > > + ns->memfd_noexec_scope = > > + task_active_pid_ns(current)->memfd_noexec_scope; > > +} > > + > > +static int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table, > > + int write, void *buf, size_t *lenp, loff_t *ppos) > > +{ > > + struct pid_namespace *ns = task_active_pid_ns(current); > > + struct ctl_table table_copy; > > + > > + if (write && !capable(CAP_SYS_ADMIN)) > > + return -EPERM; > > Should this be CAP_SYS_ADMIN within the userns, rather than the global > init_task CAP_SYS_ADMIN? > Done. > > + > > + table_copy = *table; > > + if (ns != &init_pid_ns) > > + table_copy.data = &ns->memfd_noexec_scope; > > + > > + /* > > + * set minimum to current value, the effect is only bigger > > + * value is accepted. > > + */ > > + if (*(int *)table_copy.data > *(int *)table_copy.extra1) > > + table_copy.extra1 = table_copy.data; > > + > > + return proc_dointvec_minmax(&table_copy, write, buf, lenp, ppos); > > +} > > + > > +static struct ctl_table pid_ns_ctl_table_vm[] = { > > + { > > + .procname = "memfd_noexec", > > + .data = &init_pid_ns.memfd_noexec_scope, > > + .maxlen = sizeof(init_pid_ns.memfd_noexec_scope), > > + .mode = 0644, > > + .proc_handler = pid_mfd_noexec_dointvec_minmax, > > + .extra1 = SYSCTL_ZERO, > > + .extra2 = SYSCTL_TWO, > > + }, > > + { } > > +}; > > +static struct ctl_path vm_path[] = { { .procname = "vm", }, { } }; > > +static inline void register_pid_ns_sysctl_table_vm(void) > > +{ > > + register_sysctl_paths(vm_path, pid_ns_ctl_table_vm); > > +} > > +#else > > +static inline void set_memfd_noexec_scope(struct pid_namespace *ns) {} > > +static inline void register_pid_ns_ctl_table_vm(void) {} > > +#endif > > + > > +#endif /* LINUX_PID_SYSCTL_H */ > > diff --git a/mm/memfd.c b/mm/memfd.c > > index 4ebeab94aa74..ec70675a7069 100644 > > --- a/mm/memfd.c > > +++ b/mm/memfd.c > > @@ -18,6 +18,7 @@ > > #include <linux/hugetlb.h> > > #include <linux/shmem_fs.h> > > #include <linux/memfd.h> > > +#include <linux/pid_namespace.h> > > #include <uapi/linux/memfd.h> > > > > /* > > @@ -263,12 +264,14 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg) > > #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1) > > #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN) > > > > -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB) > > +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC_SEAL | MFD_EXEC) > > > > SYSCALL_DEFINE2(memfd_create, > > const char __user *, uname, > > unsigned int, flags) > > { > > + char comm[TASK_COMM_LEN]; > > I'm fine with using "comm", but technically, it's not needed: task->comm > will always be %NUL terminated. > get_task_comm takes a lock. Do we need to consider the case of task->comm mutation in a multithreaded environment ? There seems to be work related with replacing task->comm with get_task_comm, such as: https://lore.kernel.org/netdev/20211108083840.4627-4-laoar.shao@gmail.com/ > > + struct pid_namespace *ns; > > unsigned int *file_seals; > > struct file *file; > > int fd, error; > > @@ -285,6 +288,39 @@ SYSCALL_DEFINE2(memfd_create, > > return -EINVAL; > > } > > > > + /* Invalid if both EXEC and NOEXEC_SEAL are set.*/ > > + if ((flags & MFD_EXEC) && (flags & MFD_NOEXEC_SEAL)) > > + return -EINVAL; > > + > > + if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) { > > +#ifdef CONFIG_SYSCTL > > + int sysctl = MEMFD_NOEXEC_SCOPE_EXEC; > > + > > + ns = task_active_pid_ns(current); > > + if (ns) > > + sysctl = ns->memfd_noexec_scope; > > + > > + switch (sysctl) { > > + case MEMFD_NOEXEC_SCOPE_EXEC: > > + flags |= MFD_EXEC; > > + break; > > + case MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL: > > + flags |= MFD_NOEXEC_SEAL; > > + break; > > + default: > > + pr_warn_ratelimited( > > + "memfd_create(): MFD_NOEXEC_SEAL is enforced, pid=%d '%s'\n", > > + task_pid_nr(current), get_task_comm(comm, current)); > > + return -EINVAL; > > + } > > +#else > > + flags |= MFD_EXEC; > > +#endif > > + pr_warn_ratelimited( > > + "memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=%d '%s'\n", > > + task_pid_nr(current), get_task_comm(comm, current)); > > + } > > + > > /* length includes terminating zero */ > > len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1); > > if (len <= 0) > > @@ -328,7 +364,15 @@ SYSCALL_DEFINE2(memfd_create, > > file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE; > > file->f_flags |= O_LARGEFILE; > > > > - if (flags & MFD_ALLOW_SEALING) { > > + if (flags & MFD_NOEXEC_SEAL) { > > + struct inode *inode = file_inode(file); > > + > > + inode->i_mode &= ~0111; > > + file_seals = memfd_file_seals_ptr(file); > > + *file_seals &= ~F_SEAL_SEAL; > > + *file_seals |= F_SEAL_EXEC; > > + } else if (flags & MFD_ALLOW_SEALING) { > > + /* MFD_EXEC and MFD_ALLOW_SEALING are set */ > > file_seals = memfd_file_seals_ptr(file); > > *file_seals &= ~F_SEAL_SEAL; > > } > > -- > > 2.39.0.rc0.267.gcb52ba06e7-goog > > > > Otherwise looks good! > > -- > Kees Cook
Hi, Jeff, On Thu, Dec 08, 2022 at 02:55:45PM -0800, Jeff Xu wrote: > > > + if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) { [...] > > > + pr_warn_ratelimited( > > > + "memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=%d '%s'\n", > > > + task_pid_nr(current), get_task_comm(comm, current)); This will be frequently dumped right now with mm-unstable. Is that what it wanted to achieve? [ 10.822575] memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=491 'systemd' [ 10.824743] memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=495 '(sd-executor)' ... If there's already a sane default value (and also knobs for the user to change the default) not sure whether it's saner to just keep it silent as before?
On Fri, Dec 16, 2022 at 7:47 AM Peter Xu <peterx@redhat.com> wrote: > > Hi, Jeff, > > On Thu, Dec 08, 2022 at 02:55:45PM -0800, Jeff Xu wrote: > > > > + if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) { > > [...] > > > > > + pr_warn_ratelimited( > > > > + "memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=%d '%s'\n", > > > > + task_pid_nr(current), get_task_comm(comm, current)); > > This will be frequently dumped right now with mm-unstable. Is that what it > wanted to achieve? > > [ 10.822575] memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=491 'systemd' > [ 10.824743] memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=495 '(sd-executor)' > ... > > If there's already a sane default value (and also knobs for the user to > change the default) not sure whether it's saner to just keep it silent as > before? > Thanks for your comments. The intention is it is a reminder to adjust API calls to explicitly setting this bit. The sysctl vm.memfd_noexec = 0 1 is for transaction to the final state, and 2 depends on API call setting this bit. The log is ratelimited, and there is a rate limit setting: /proc/sys/kernel/printk_ratelimit /proc/sys/kernel/printk_ratelimit_burst Best regards, Jeff > -- > Peter Xu >
On Fri, 16 Dec 2022 09:15:40 -0800 Jeff Xu <jeffxu@google.com> wrote: > On Fri, Dec 16, 2022 at 7:47 AM Peter Xu <peterx@redhat.com> wrote: > > > > Hi, Jeff, > > > > On Thu, Dec 08, 2022 at 02:55:45PM -0800, Jeff Xu wrote: > > > > > + if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) { > > > > [...] > > > > > > > + pr_warn_ratelimited( > > > > > + "memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=%d '%s'\n", > > > > > + task_pid_nr(current), get_task_comm(comm, current)); > > > > This will be frequently dumped right now with mm-unstable. Is that what it > > wanted to achieve? > > > > [ 10.822575] memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=491 'systemd' > > [ 10.824743] memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=495 '(sd-executor)' > > ... > > > > If there's already a sane default value (and also knobs for the user to > > change the default) not sure whether it's saner to just keep it silent as > > before? > > > Thanks for your comments. > > The intention is it is a reminder to adjust API calls to explicitly > setting this bit. Do we need to warn more than once per boot? If not, use pr_warn_once()? > The sysctl vm.memfd_noexec = 0 1 is for transaction to the final > state, and 2 depends on API call setting this bit. > > The log is ratelimited, and there is a rate limit setting: > /proc/sys/kernel/printk_ratelimit > /proc/sys/kernel/printk_ratelimit_burst >
On Fri, Dec 16, 2022 at 9:43 AM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Fri, 16 Dec 2022 09:15:40 -0800 Jeff Xu <jeffxu@google.com> wrote: > > > On Fri, Dec 16, 2022 at 7:47 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > Hi, Jeff, > > > > > > On Thu, Dec 08, 2022 at 02:55:45PM -0800, Jeff Xu wrote: > > > > > > + if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) { > > > > > > [...] > > > > > > > > > + pr_warn_ratelimited( > > > > > > + "memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=%d '%s'\n", > > > > > > + task_pid_nr(current), get_task_comm(comm, current)); > > > > > > This will be frequently dumped right now with mm-unstable. Is that what it > > > wanted to achieve? > > > > > > [ 10.822575] memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=491 'systemd' > > > [ 10.824743] memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=495 '(sd-executor)' > > > ... > > > > > > If there's already a sane default value (and also knobs for the user to > > > change the default) not sure whether it's saner to just keep it silent as > > > before? > > > > > Thanks for your comments. > > > > The intention is it is a reminder to adjust API calls to explicitly > > setting this bit. > > Do we need to warn more than once per boot? If not, use pr_warn_once()? > Once per boot seems too little, it would be nice if we can list all processes. I agree ratelimited might be too much. There is a feature gap here for logging. Kees, what do you think ? > > The sysctl vm.memfd_noexec = 0 1 is for transaction to the final > > state, and 2 depends on API call setting this bit. > > > > The log is ratelimited, and there is a rate limit setting: > > /proc/sys/kernel/printk_ratelimit > > /proc/sys/kernel/printk_ratelimit_burst > > >
On Fri, Dec 16, 2022 at 10:11:44AM -0800, Jeff Xu wrote: > Once per boot seems too little, it would be nice if we can list all processes. > I agree ratelimited might be too much. > There is a feature gap here for logging. > > Kees, what do you think ? I agree once per boot is kind of frustrating "I fixed the one warning, oh, now it's coming from a different process". But ratelimit is, in retrospect, still too often. Let's go with per boot -- this should be noisy "enough" to get the changes in API into the callers without being too much of a hassle.
On Fri, Dec 16, 2022 at 12:35 PM Kees Cook <keescook@chromium.org> wrote: > > On Fri, Dec 16, 2022 at 10:11:44AM -0800, Jeff Xu wrote: > > Once per boot seems too little, it would be nice if we can list all processes. > > I agree ratelimited might be too much. > > There is a feature gap here for logging. > > > > Kees, what do you think ? > > I agree once per boot is kind of frustrating "I fixed the one warning, > oh, now it's coming from a different process". But ratelimit is, in > retrospect, still too often. > > Let's go with per boot -- this should be noisy "enough" to get the > changes in API into the callers without being too much of a hassle. > Agreed. Let's go with per boot. Hi Andrew, what is your preference ? I can send a patch or you directly fix it in mm-unstable ? Thanks -Jeff > -- > Kees Cook
On Fri, 16 Dec 2022 13:46:58 -0800 Jeff Xu <jeffxu@google.com> wrote: > On Fri, Dec 16, 2022 at 12:35 PM Kees Cook <keescook@chromium.org> wrote: > > > > On Fri, Dec 16, 2022 at 10:11:44AM -0800, Jeff Xu wrote: > > > Once per boot seems too little, it would be nice if we can list all processes. > > > I agree ratelimited might be too much. > > > There is a feature gap here for logging. > > > > > > Kees, what do you think ? > > > > I agree once per boot is kind of frustrating "I fixed the one warning, > > oh, now it's coming from a different process". But ratelimit is, in > > retrospect, still too often. > > > > Let's go with per boot -- this should be noisy "enough" to get the > > changes in API into the callers without being too much of a hassle. > > > Agreed. Let's go with per boot. > > Hi Andrew, what is your preference ? I can send a patch or you > directly fix it in mm-unstable ? Like this? --- a/mm/memfd.c~mm-memfd-add-mfd_noexec_seal-and-mfd_exec-fix-3 +++ a/mm/memfd.c @@ -308,7 +308,7 @@ SYSCALL_DEFINE2(memfd_create, flags |= MFD_NOEXEC_SEAL; break; default: - pr_warn_ratelimited( + pr_warn_once( "memfd_create(): MFD_NOEXEC_SEAL is enforced, pid=%d '%s'\n", task_pid_nr(current), get_task_comm(comm, current)); return -EINVAL; @@ -316,7 +316,7 @@ SYSCALL_DEFINE2(memfd_create, #else flags |= MFD_EXEC; #endif - pr_warn_ratelimited( + pr_warn_once( "memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=%d '%s'\n", task_pid_nr(current), get_task_comm(comm, current)); }
On Fri, Dec 16, 2022 at 2:06 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Fri, 16 Dec 2022 13:46:58 -0800 Jeff Xu <jeffxu@google.com> wrote: > > > On Fri, Dec 16, 2022 at 12:35 PM Kees Cook <keescook@chromium.org> wrote: > > > > > > On Fri, Dec 16, 2022 at 10:11:44AM -0800, Jeff Xu wrote: > > > > Once per boot seems too little, it would be nice if we can list all processes. > > > > I agree ratelimited might be too much. > > > > There is a feature gap here for logging. > > > > > > > > Kees, what do you think ? > > > > > > I agree once per boot is kind of frustrating "I fixed the one warning, > > > oh, now it's coming from a different process". But ratelimit is, in > > > retrospect, still too often. > > > > > > Let's go with per boot -- this should be noisy "enough" to get the > > > changes in API into the callers without being too much of a hassle. > > > > > Agreed. Let's go with per boot. > > > > Hi Andrew, what is your preference ? I can send a patch or you > > directly fix it in mm-unstable ? > > Like this? > Yes. Thanks! > --- a/mm/memfd.c~mm-memfd-add-mfd_noexec_seal-and-mfd_exec-fix-3 > +++ a/mm/memfd.c > @@ -308,7 +308,7 @@ SYSCALL_DEFINE2(memfd_create, > flags |= MFD_NOEXEC_SEAL; > break; > default: > - pr_warn_ratelimited( > + pr_warn_once( > "memfd_create(): MFD_NOEXEC_SEAL is enforced, pid=%d '%s'\n", > task_pid_nr(current), get_task_comm(comm, current)); > return -EINVAL; > @@ -316,7 +316,7 @@ SYSCALL_DEFINE2(memfd_create, > #else > flags |= MFD_EXEC; > #endif > - pr_warn_ratelimited( > + pr_warn_once( > "memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=%d '%s'\n", > task_pid_nr(current), get_task_comm(comm, current)); > } > _ >
On 12/16/22 16:40, Jeff Xu wrote: > On Fri, Dec 16, 2022 at 2:06 PM Andrew Morton <akpm@linux-foundation.org> wrote: >> >> On Fri, 16 Dec 2022 13:46:58 -0800 Jeff Xu <jeffxu@google.com> wrote: >> >>> On Fri, Dec 16, 2022 at 12:35 PM Kees Cook <keescook@chromium.org> wrote: >>>> >>>> On Fri, Dec 16, 2022 at 10:11:44AM -0800, Jeff Xu wrote: >>>>> Once per boot seems too little, it would be nice if we can list all processes. >>>>> I agree ratelimited might be too much. >>>>> There is a feature gap here for logging. >>>>> >>>>> Kees, what do you think ? >>>> >>>> I agree once per boot is kind of frustrating "I fixed the one warning, >>>> oh, now it's coming from a different process". But ratelimit is, in >>>> retrospect, still too often. >>>> >>>> Let's go with per boot -- this should be noisy "enough" to get the >>>> changes in API into the callers without being too much of a hassle. >>>> >>> Agreed. Let's go with per boot. >>> >>> Hi Andrew, what is your preference ? I can send a patch or you >>> directly fix it in mm-unstable ? >> >> Like this? >> > Yes. Thanks! > Sorry jumping into this discussion a bit late. Is it possible to provide a way to enable full logging as a debug option to tag more processes? thanks, -- Shuah
On Tue, Dec 20, 2022 at 8:55 AM Shuah Khan <skhan@linuxfoundation.org> wrote: > > On 12/16/22 16:40, Jeff Xu wrote: > > On Fri, Dec 16, 2022 at 2:06 PM Andrew Morton <akpm@linux-foundation.org> wrote: > >> > >> On Fri, 16 Dec 2022 13:46:58 -0800 Jeff Xu <jeffxu@google.com> wrote: > >> > >>> On Fri, Dec 16, 2022 at 12:35 PM Kees Cook <keescook@chromium.org> wrote: > >>>> > >>>> On Fri, Dec 16, 2022 at 10:11:44AM -0800, Jeff Xu wrote: > >>>>> Once per boot seems too little, it would be nice if we can list all processes. > >>>>> I agree ratelimited might be too much. > >>>>> There is a feature gap here for logging. > >>>>> > >>>>> Kees, what do you think ? > >>>> > >>>> I agree once per boot is kind of frustrating "I fixed the one warning, > >>>> oh, now it's coming from a different process". But ratelimit is, in > >>>> retrospect, still too often. > >>>> > >>>> Let's go with per boot -- this should be noisy "enough" to get the > >>>> changes in API into the callers without being too much of a hassle. > >>>> > >>> Agreed. Let's go with per boot. > >>> > >>> Hi Andrew, what is your preference ? I can send a patch or you > >>> directly fix it in mm-unstable ? > >> > >> Like this? > >> > > Yes. Thanks! > > > > Sorry jumping into this discussion a bit late. Is it possible to provide > a way to enable full logging as a debug option to tag more processes? > Codewise it is possible, maybe by adding a sysctl or CONFIG_, but I am not sure the best practice to do this with the kernel? Kees/Andrew, do you have suggestions ? Thanks Jeff > thanks, > -- Shuah >
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h index 07481bb87d4e..a4789a7b34a9 100644 --- a/include/linux/pid_namespace.h +++ b/include/linux/pid_namespace.h @@ -16,6 +16,21 @@ struct fs_pin; +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) +/* + * sysctl for vm.memfd_noexec + * 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL + * acts like MFD_EXEC was set. + * 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL + * acts like MFD_NOEXEC_SEAL was set. + * 2: memfd_create() without MFD_NOEXEC_SEAL will be + * rejected. + */ +#define MEMFD_NOEXEC_SCOPE_EXEC 0 +#define MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL 1 +#define MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED 2 +#endif + struct pid_namespace { struct idr idr; struct rcu_head rcu; @@ -31,6 +46,10 @@ struct pid_namespace { struct ucounts *ucounts; int reboot; /* group exit code if this pidns was rebooted */ struct ns_common ns; +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) + /* sysctl for vm.memfd_noexec */ + int memfd_noexec_scope; +#endif } __randomize_layout; extern struct pid_namespace init_pid_ns; diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h index 7a8a26751c23..273a4e15dfcf 100644 --- a/include/uapi/linux/memfd.h +++ b/include/uapi/linux/memfd.h @@ -8,6 +8,10 @@ #define MFD_CLOEXEC 0x0001U #define MFD_ALLOW_SEALING 0x0002U #define MFD_HUGETLB 0x0004U +/* not executable and sealed to prevent changing to executable. */ +#define MFD_NOEXEC_SEAL 0x0008U +/* executable */ +#define MFD_EXEC 0x0010U /* * Huge page size encoding when MFD_HUGETLB is specified, and a huge page diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c index f4f8cb0435b4..8a98b1af9376 100644 --- a/kernel/pid_namespace.c +++ b/kernel/pid_namespace.c @@ -23,6 +23,7 @@ #include <linux/sched/task.h> #include <linux/sched/signal.h> #include <linux/idr.h> +#include "pid_sysctl.h" static DEFINE_MUTEX(pid_caches_mutex); static struct kmem_cache *pid_ns_cachep; @@ -110,6 +111,8 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns ns->ucounts = ucounts; ns->pid_allocated = PIDNS_ADDING; + initialize_memfd_noexec_scope(ns); + return ns; out_free_idr: @@ -455,6 +458,8 @@ static __init int pid_namespaces_init(void) #ifdef CONFIG_CHECKPOINT_RESTORE register_sysctl_paths(kern_path, pid_ns_ctl_table); #endif + + register_pid_ns_sysctl_table_vm(); return 0; } diff --git a/kernel/pid_sysctl.h b/kernel/pid_sysctl.h new file mode 100644 index 000000000000..5986d6493b5b --- /dev/null +++ b/kernel/pid_sysctl.h @@ -0,0 +1,59 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef LINUX_PID_SYSCTL_H +#define LINUX_PID_SYSCTL_H + +#include <linux/pid_namespace.h> + +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) +static inline void initialize_memfd_noexec_scope(struct pid_namespace *ns) +{ + ns->memfd_noexec_scope = + task_active_pid_ns(current)->memfd_noexec_scope; +} + +static int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table, + int write, void *buf, size_t *lenp, loff_t *ppos) +{ + struct pid_namespace *ns = task_active_pid_ns(current); + struct ctl_table table_copy; + + if (write && !capable(CAP_SYS_ADMIN)) + return -EPERM; + + table_copy = *table; + if (ns != &init_pid_ns) + table_copy.data = &ns->memfd_noexec_scope; + + /* + * set minimum to current value, the effect is only bigger + * value is accepted. + */ + if (*(int *)table_copy.data > *(int *)table_copy.extra1) + table_copy.extra1 = table_copy.data; + + return proc_dointvec_minmax(&table_copy, write, buf, lenp, ppos); +} + +static struct ctl_table pid_ns_ctl_table_vm[] = { + { + .procname = "memfd_noexec", + .data = &init_pid_ns.memfd_noexec_scope, + .maxlen = sizeof(init_pid_ns.memfd_noexec_scope), + .mode = 0644, + .proc_handler = pid_mfd_noexec_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_TWO, + }, + { } +}; +static struct ctl_path vm_path[] = { { .procname = "vm", }, { } }; +static inline void register_pid_ns_sysctl_table_vm(void) +{ + register_sysctl_paths(vm_path, pid_ns_ctl_table_vm); +} +#else +static inline void set_memfd_noexec_scope(struct pid_namespace *ns) {} +static inline void register_pid_ns_ctl_table_vm(void) {} +#endif + +#endif /* LINUX_PID_SYSCTL_H */ diff --git a/mm/memfd.c b/mm/memfd.c index 4ebeab94aa74..ec70675a7069 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -18,6 +18,7 @@ #include <linux/hugetlb.h> #include <linux/shmem_fs.h> #include <linux/memfd.h> +#include <linux/pid_namespace.h> #include <uapi/linux/memfd.h> /* @@ -263,12 +264,14 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg) #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1) #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN) -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB) +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC_SEAL | MFD_EXEC) SYSCALL_DEFINE2(memfd_create, const char __user *, uname, unsigned int, flags) { + char comm[TASK_COMM_LEN]; + struct pid_namespace *ns; unsigned int *file_seals; struct file *file; int fd, error; @@ -285,6 +288,39 @@ SYSCALL_DEFINE2(memfd_create, return -EINVAL; } + /* Invalid if both EXEC and NOEXEC_SEAL are set.*/ + if ((flags & MFD_EXEC) && (flags & MFD_NOEXEC_SEAL)) + return -EINVAL; + + if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) { +#ifdef CONFIG_SYSCTL + int sysctl = MEMFD_NOEXEC_SCOPE_EXEC; + + ns = task_active_pid_ns(current); + if (ns) + sysctl = ns->memfd_noexec_scope; + + switch (sysctl) { + case MEMFD_NOEXEC_SCOPE_EXEC: + flags |= MFD_EXEC; + break; + case MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL: + flags |= MFD_NOEXEC_SEAL; + break; + default: + pr_warn_ratelimited( + "memfd_create(): MFD_NOEXEC_SEAL is enforced, pid=%d '%s'\n", + task_pid_nr(current), get_task_comm(comm, current)); + return -EINVAL; + } +#else + flags |= MFD_EXEC; +#endif + pr_warn_ratelimited( + "memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=%d '%s'\n", + task_pid_nr(current), get_task_comm(comm, current)); + } + /* length includes terminating zero */ len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1); if (len <= 0) @@ -328,7 +364,15 @@ SYSCALL_DEFINE2(memfd_create, file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE; file->f_flags |= O_LARGEFILE; - if (flags & MFD_ALLOW_SEALING) { + if (flags & MFD_NOEXEC_SEAL) { + struct inode *inode = file_inode(file); + + inode->i_mode &= ~0111; + file_seals = memfd_file_seals_ptr(file); + *file_seals &= ~F_SEAL_SEAL; + *file_seals |= F_SEAL_EXEC; + } else if (flags & MFD_ALLOW_SEALING) { + /* MFD_EXEC and MFD_ALLOW_SEALING are set */ file_seals = memfd_file_seals_ptr(file); *file_seals &= ~F_SEAL_SEAL; }