[RFC,v3,1/2] mm: restrictedmem: Allow userspace to specify mount for memfd_restricted
Message ID | 592ebd9e33a906ba026d56dc68f42d691706f865.1680306489.git.ackerleytng@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp925426vqo; Fri, 31 Mar 2023 17:19:28 -0700 (PDT) X-Google-Smtp-Source: AKy350YjnSTv7G5eSB/mSJ3RxWocyt1TU12JU1CrNxKXFbqyER2ikKiNZzWClFXrCCpvKtEfKk43 X-Received: by 2002:a17:906:74f:b0:933:3b2e:6016 with SMTP id z15-20020a170906074f00b009333b2e6016mr27361099ejb.7.1680308368158; Fri, 31 Mar 2023 17:19:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1680308368; cv=none; d=google.com; s=arc-20160816; b=U2b1W+0bCGOlvdo6/J+gTUC+D8rjyOGvvo7hFe/ct1i+3JlDdUhhjuXfg5AG74efGp HLtdLsR+XSRKie9X0cLR7SSJ6zFuyXQDvYwUpyIre+xHllrIWgc27UriGdfoe+t4LfSd BTn4f+vC/DFOzlkjUVvKFZQECRYiY68esLPp2Hdxh3I58WJscTA4tM96/dnoLlz6Lm6a 1+LiA4nnxV2xqyPcBeVextJZDSFKJt2qS/ayeHTkT+P+aMzGBpxAJsxTWpdW269+zmF7 G8t1Xv+P6lK8utKw58l36+oD5cLaMfIuyKYUsxpSNuP95tkez4YLApydvnX5dynWKs2T osvg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=67aN/B+WtOaSUEhiQW5ieKe9kdphJ9yXXZLhFTmAYCY=; b=MPBbYdaMKyggZ2hySfity87aO8L/luT2dfXU8ag7LTvwjANO0/f9RLiHOY8Jkcp5Zo PGzOoGgiB6svs/gknAQ18XW+lyW8wg7lNAeWs0UGbZhzxzTzYP7EgxOumfCGVOFzYr5B fXuI0Rrh4MUic4dts49MFc8lIZJP7ovmvsOm2q6xJ1/641FgIAqOnABIo3UIK1+uhTff qENlroBmATQwID3sSGa3q5swfdsytkv9bOzaD54j5NUBIqypw2/vuYtw6l3zdYsQ5XeK P2rh6tXV4eFt3YsND4LJ1WV1UtxcvoEJoLbzvQFkSZfw9nj7xo3a7h+hDKHt1AhRnqJD fIug== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=aqbsmcr2; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id n7-20020aa7c787000000b005026897d7bdsi3383888eds.1.2023.03.31.17.19.04; Fri, 31 Mar 2023 17:19:28 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=aqbsmcr2; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233302AbjCaXuz (ORCPT <rfc822;dexuan.linux@gmail.com> + 99 others); Fri, 31 Mar 2023 19:50:55 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57100 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233175AbjCaXuu (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 31 Mar 2023 19:50:50 -0400 Received: from mail-pj1-x104a.google.com (mail-pj1-x104a.google.com [IPv6:2607:f8b0:4864:20::104a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F24B71C1E8 for <linux-kernel@vger.kernel.org>; Fri, 31 Mar 2023 16:50:48 -0700 (PDT) Received: by mail-pj1-x104a.google.com with SMTP id pt5-20020a17090b3d0500b0023d3ffe542fso5201679pjb.0 for <linux-kernel@vger.kernel.org>; Fri, 31 Mar 2023 16:50:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1680306648; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=67aN/B+WtOaSUEhiQW5ieKe9kdphJ9yXXZLhFTmAYCY=; b=aqbsmcr2pK1zJc/QddLocmFe7u0eN9wv3vYw/b6U0d43YucmpKQusaaQyAaVSa0brL LpSKYapJtGiMflg5mvZkz3d5j4vniDYYeKTW9Q4ay1SgV2A+IZ22Cf6Rxm/yGJKvgMN8 SMKUd9my6zvqI6T79WbKu7yBu6efMqRgxeGWTlGD4T7pNs+f8J8J0byx5Zn/rGxIBacK RO2pdC/oG+kndVLe0/EN60h+Ufhb6JX9aJrTNcBTxw5DYOj8yvfJ2FTBeGdPqfrf3kJN vw3lhr7RqWCr8GoHdH9B4JRBqn6YLWfmu0XgDDbuD1vuUAihW3PD1pSNBYYD+O6dlOsv bwXw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680306648; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=67aN/B+WtOaSUEhiQW5ieKe9kdphJ9yXXZLhFTmAYCY=; b=RkYg7ylOzYoEhH/87escTKgzVIB/GyQoV72cPtCVsUd5nAZ9ff5iMMBb8W9/uybaG8 M5Ahq4O1OtbWPLhGVbohWrYD7RhjTXXYCANUp6Im8QWShkDs6xFXMUAPMYeRQDAf2lQx CDXG+miCnh9H80dogr71gfjwPLh8o4Yop2gDSHthlL0wbkvevu7ub4RCdn5TO+bnlYRa e6be66FzPYMPBuPCD2cya9JPIRS3crBe5nkqkiPseq8jpERjpmUk0Y5cujEu1aw7B8lb kzfXYC1zeVRKk39mQEC/N+hKw/b4SvzrE8RjQni7A+2dLaYOTKTRcRcyKq1ECjpalUdk E6Rg== X-Gm-Message-State: AAQBX9ekTVTI5vhTaSrj8OnIB7hkNRN8bxlcFpbrOWeuQNxGzV3VHLgF Ykn1UTGe+JAbqCU2/jrKfttsI92qsxdcpwKXPw== X-Received: from ackerleytng-cloudtop.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:1f5f]) (user=ackerleytng job=sendgmr) by 2002:a17:90a:ba0a:b0:23f:6eff:9430 with SMTP id s10-20020a17090aba0a00b0023f6eff9430mr9066957pjr.3.1680306648452; Fri, 31 Mar 2023 16:50:48 -0700 (PDT) Date: Fri, 31 Mar 2023 23:50:39 +0000 In-Reply-To: <cover.1680306489.git.ackerleytng@google.com> Mime-Version: 1.0 References: <cover.1680306489.git.ackerleytng@google.com> X-Mailer: git-send-email 2.40.0.348.gf938b09366-goog Message-ID: <592ebd9e33a906ba026d56dc68f42d691706f865.1680306489.git.ackerleytng@google.com> Subject: [RFC PATCH v3 1/2] mm: restrictedmem: Allow userspace to specify mount for memfd_restricted From: Ackerley Tng <ackerleytng@google.com> To: kvm@vger.kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, qemu-devel@nongnu.org Cc: aarcange@redhat.com, ak@linux.intel.com, akpm@linux-foundation.org, arnd@arndb.de, bfields@fieldses.org, bp@alien8.de, chao.p.peng@linux.intel.com, corbet@lwn.net, dave.hansen@intel.com, david@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, hpa@zytor.com, hughd@google.com, jlayton@kernel.org, jmattson@google.com, joro@8bytes.org, jun.nakajima@intel.com, kirill.shutemov@linux.intel.com, linmiaohe@huawei.com, luto@kernel.org, mail@maciej.szmigiero.name, mhocko@suse.com, michael.roth@amd.com, mingo@redhat.com, naoya.horiguchi@nec.com, pbonzini@redhat.com, qperret@google.com, rppt@kernel.org, seanjc@google.com, shuah@kernel.org, steven.price@arm.com, tabba@google.com, tglx@linutronix.de, vannapurve@google.com, vbabka@suse.cz, vkuznets@redhat.com, wanpengli@tencent.com, wei.w.wang@intel.com, x86@kernel.org, yu.c.zhang@linux.intel.com, Ackerley Tng <ackerleytng@google.com> Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-7.7 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1761931027269525465?= X-GMAIL-MSGID: =?utf-8?q?1761931027269525465?= |
Series |
Providing mount in memfd_restricted() syscall
|
|
Commit Message
Ackerley Tng
March 31, 2023, 11:50 p.m. UTC
By default, the backing shmem file for a restrictedmem fd is created
on shmem's kernel space mount.
With this patch, an optional tmpfs mount can be specified via an fd,
which will be used as the mountpoint for backing the shmem file
associated with a restrictedmem fd.
This will help restrictedmem fds inherit the properties of the
provided tmpfs mounts, for example, hugepage allocation hints, NUMA
binding hints, etc.
Permissions for the fd passed to memfd_restricted() is modeled after
the openat() syscall, since both of these allow creation of a file
upon a mount/directory.
Permission to reference the mount the fd represents is checked upon fd
creation by other syscalls (e.g. fsmount(), open(), or open_tree(),
etc) and any process that can present memfd_restricted() with a valid
fd is expected to have obtained permission to use the mount
represented by the fd. This behavior is intended to parallel that of
the openat() syscall.
memfd_restricted() will check that the tmpfs superblock is
writable, and that the mount is also writable, before attempting to
create a restrictedmem file on the mount.
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
include/linux/syscalls.h | 2 +-
include/uapi/linux/restrictedmem.h | 8 ++++
mm/restrictedmem.c | 74 +++++++++++++++++++++++++++---
3 files changed, 77 insertions(+), 7 deletions(-)
create mode 100644 include/uapi/linux/restrictedmem.h
--
2.40.0.348.gf938b09366-goog
Comments
On 01.04.23 01:50, Ackerley Tng wrote: > By default, the backing shmem file for a restrictedmem fd is created > on shmem's kernel space mount. > > With this patch, an optional tmpfs mount can be specified via an fd, > which will be used as the mountpoint for backing the shmem file > associated with a restrictedmem fd. > > This will help restrictedmem fds inherit the properties of the > provided tmpfs mounts, for example, hugepage allocation hints, NUMA > binding hints, etc. > > Permissions for the fd passed to memfd_restricted() is modeled after > the openat() syscall, since both of these allow creation of a file > upon a mount/directory. > > Permission to reference the mount the fd represents is checked upon fd > creation by other syscalls (e.g. fsmount(), open(), or open_tree(), > etc) and any process that can present memfd_restricted() with a valid > fd is expected to have obtained permission to use the mount > represented by the fd. This behavior is intended to parallel that of > the openat() syscall. > > memfd_restricted() will check that the tmpfs superblock is > writable, and that the mount is also writable, before attempting to > create a restrictedmem file on the mount. > > Signed-off-by: Ackerley Tng <ackerleytng@google.com> > --- > include/linux/syscalls.h | 2 +- > include/uapi/linux/restrictedmem.h | 8 ++++ > mm/restrictedmem.c | 74 +++++++++++++++++++++++++++--- > 3 files changed, 77 insertions(+), 7 deletions(-) > create mode 100644 include/uapi/linux/restrictedmem.h > > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > index f9e9e0c820c5..a23c4c385cd3 100644 > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -1056,7 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags); > asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len, > unsigned long home_node, > unsigned long flags); > -asmlinkage long sys_memfd_restricted(unsigned int flags); > +asmlinkage long sys_memfd_restricted(unsigned int flags, int mount_fd); > > /* > * Architecture-specific system calls > diff --git a/include/uapi/linux/restrictedmem.h b/include/uapi/linux/restrictedmem.h > new file mode 100644 > index 000000000000..22d6f2285f6d > --- /dev/null > +++ b/include/uapi/linux/restrictedmem.h > @@ -0,0 +1,8 @@ > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ > +#ifndef _UAPI_LINUX_RESTRICTEDMEM_H > +#define _UAPI_LINUX_RESTRICTEDMEM_H > + > +/* flags for memfd_restricted */ > +#define RMFD_USERMNT 0x0001U I wonder if we can come up with a more expressive prefix than RMFD. Sounds more like "rm fd" ;) Maybe it should better match the "memfd_restricted" syscall name, like "MEMFD_RSTD_USERMNT". > + > +#endif /* _UAPI_LINUX_RESTRICTEDMEM_H */ > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c > index c5d869d8c2d8..f7b62364a31a 100644 > --- a/mm/restrictedmem.c > +++ b/mm/restrictedmem.c > @@ -1,11 +1,12 @@ > // SPDX-License-Identifier: GPL-2.0 > -#include "linux/sbitmap.h" Looks like an unrelated change? > +#include <linux/namei.h> > #include <linux/pagemap.h> > #include <linux/pseudo_fs.h> > #include <linux/shmem_fs.h> > #include <linux/syscalls.h> > #include <uapi/linux/falloc.h> > #include <uapi/linux/magic.h> > +#include <uapi/linux/restrictedmem.h> > #include <linux/restrictedmem.h> > > struct restrictedmem { > @@ -189,19 +190,20 @@ static struct file *restrictedmem_file_create(struct file *memfd) > return file; > } > > -SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) > +static int restrictedmem_create(struct vfsmount *mount) > { > struct file *file, *restricted_file; > int fd, err; > > - if (flags) > - return -EINVAL; > - > fd = get_unused_fd_flags(0); > if (fd < 0) > return fd; > > - file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE); > + if (mount) > + file = shmem_file_setup_with_mnt(mount, "memfd:restrictedmem", 0, VM_NORESERVE); > + else > + file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE); > + > if (IS_ERR(file)) { > err = PTR_ERR(file); > goto err_fd; > @@ -223,6 +225,66 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) > return err; > } > > +static bool is_shmem_mount(struct vfsmount *mnt) > +{ > + return mnt && mnt->mnt_sb && mnt->mnt_sb->s_magic == TMPFS_MAGIC; > +} > + > +static bool is_mount_root(struct file *file) > +{ > + return file->f_path.dentry == file->f_path.mnt->mnt_root; > +} I'd inline at least that function, pretty self-explaining. > + > +static int restrictedmem_create_on_user_mount(int mount_fd) > +{ > + int ret; > + struct fd f; > + struct vfsmount *mnt; > + > + f = fdget_raw(mount_fd); > + if (!f.file) > + return -EBADF; > + > + ret = -EINVAL; > + if (!is_mount_root(f.file)) > + goto out; > + > + mnt = f.file->f_path.mnt; > + if (!is_shmem_mount(mnt)) > + goto out; > + > + ret = file_permission(f.file, MAY_WRITE | MAY_EXEC); > + if (ret) > + goto out; > + > + ret = mnt_want_write(mnt); > + if (unlikely(ret)) > + goto out; > + > + ret = restrictedmem_create(mnt); > + > + mnt_drop_write(mnt); > +out: > + fdput(f); > + > + return ret; > +} > + > +SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd) > +{ > + if (flags & ~RMFD_USERMNT) > + return -EINVAL; > + > + if (flags == RMFD_USERMNT) { > + if (mount_fd < 0) > + return -EINVAL; > + > + return restrictedmem_create_on_user_mount(mount_fd); > + } else { > + return restrictedmem_create(NULL); > + } You can drop the else case: if (flags == RMFD_USERMNT) { ... return restrictedmem_create_on_user_mount(mount_fd); } return restrictedmem_create(NULL); I do wonder if you want to properly check for a flag instead of comparing values. Results in a more natural way to deal with flags: if (flags & RMFD_USERMNT) { } > +} > + > int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end, > struct restrictedmem_notifier *notifier, bool exclusive) > { The "memfd_restricted" vs. "restrictedmem" terminology is a bit unfortunate, but not your fault here. I'm not a FS person, but it does look good to me.
On Fri, Mar 31, 2023 at 11:50:39PM +0000, Ackerley Tng wrote: > By default, the backing shmem file for a restrictedmem fd is created > on shmem's kernel space mount. > > With this patch, an optional tmpfs mount can be specified via an fd, > which will be used as the mountpoint for backing the shmem file > associated with a restrictedmem fd. > > This will help restrictedmem fds inherit the properties of the > provided tmpfs mounts, for example, hugepage allocation hints, NUMA > binding hints, etc. > > Permissions for the fd passed to memfd_restricted() is modeled after > the openat() syscall, since both of these allow creation of a file > upon a mount/directory. > > Permission to reference the mount the fd represents is checked upon fd > creation by other syscalls (e.g. fsmount(), open(), or open_tree(), > etc) and any process that can present memfd_restricted() with a valid > fd is expected to have obtained permission to use the mount > represented by the fd. This behavior is intended to parallel that of > the openat() syscall. > > memfd_restricted() will check that the tmpfs superblock is > writable, and that the mount is also writable, before attempting to > create a restrictedmem file on the mount. > > Signed-off-by: Ackerley Tng <ackerleytng@google.com> > --- > include/linux/syscalls.h | 2 +- > include/uapi/linux/restrictedmem.h | 8 ++++ > mm/restrictedmem.c | 74 +++++++++++++++++++++++++++--- > 3 files changed, 77 insertions(+), 7 deletions(-) > create mode 100644 include/uapi/linux/restrictedmem.h > > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > index f9e9e0c820c5..a23c4c385cd3 100644 > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -1056,7 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags); > asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len, > unsigned long home_node, > unsigned long flags); > -asmlinkage long sys_memfd_restricted(unsigned int flags); > +asmlinkage long sys_memfd_restricted(unsigned int flags, int mount_fd); > > /* > * Architecture-specific system calls > diff --git a/include/uapi/linux/restrictedmem.h b/include/uapi/linux/restrictedmem.h > new file mode 100644 > index 000000000000..22d6f2285f6d > --- /dev/null > +++ b/include/uapi/linux/restrictedmem.h > @@ -0,0 +1,8 @@ > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ > +#ifndef _UAPI_LINUX_RESTRICTEDMEM_H > +#define _UAPI_LINUX_RESTRICTEDMEM_H > + > +/* flags for memfd_restricted */ > +#define RMFD_USERMNT 0x0001U > + > +#endif /* _UAPI_LINUX_RESTRICTEDMEM_H */ > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c > index c5d869d8c2d8..f7b62364a31a 100644 > --- a/mm/restrictedmem.c > +++ b/mm/restrictedmem.c > @@ -1,11 +1,12 @@ > // SPDX-License-Identifier: GPL-2.0 > -#include "linux/sbitmap.h" > +#include <linux/namei.h> > #include <linux/pagemap.h> > #include <linux/pseudo_fs.h> > #include <linux/shmem_fs.h> > #include <linux/syscalls.h> > #include <uapi/linux/falloc.h> > #include <uapi/linux/magic.h> > +#include <uapi/linux/restrictedmem.h> > #include <linux/restrictedmem.h> > > struct restrictedmem { > @@ -189,19 +190,20 @@ static struct file *restrictedmem_file_create(struct file *memfd) > return file; > } > > -SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) > +static int restrictedmem_create(struct vfsmount *mount) > { > struct file *file, *restricted_file; > int fd, err; > > - if (flags) > - return -EINVAL; > - > fd = get_unused_fd_flags(0); > if (fd < 0) > return fd; > > - file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE); > + if (mount) > + file = shmem_file_setup_with_mnt(mount, "memfd:restrictedmem", 0, VM_NORESERVE); > + else > + file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE); > + > if (IS_ERR(file)) { > err = PTR_ERR(file); > goto err_fd; > @@ -223,6 +225,66 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) > return err; > } > > +static bool is_shmem_mount(struct vfsmount *mnt) > +{ > + return mnt && mnt->mnt_sb && mnt->mnt_sb->s_magic == TMPFS_MAGIC; > +} > + > +static bool is_mount_root(struct file *file) > +{ > + return file->f_path.dentry == file->f_path.mnt->mnt_root; > +} > + > +static int restrictedmem_create_on_user_mount(int mount_fd) > +{ > + int ret; > + struct fd f; > + struct vfsmount *mnt; > + > + f = fdget_raw(mount_fd); > + if (!f.file) > + return -EBADF; > + > + ret = -EINVAL; > + if (!is_mount_root(f.file)) > + goto out; > + > + mnt = f.file->f_path.mnt; > + if (!is_shmem_mount(mnt)) > + goto out; > + > + ret = file_permission(f.file, MAY_WRITE | MAY_EXEC); Why MAY_EXEC? > + if (ret) > + goto out; > + > + ret = mnt_want_write(mnt); > + if (unlikely(ret)) > + goto out; > + > + ret = restrictedmem_create(mnt); > + > + mnt_drop_write(mnt); > +out: > + fdput(f); > + > + return ret; > +} We need review from fs folks. Look mostly sensible, but I have no experience in fs. > + > +SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd) > +{ > + if (flags & ~RMFD_USERMNT) > + return -EINVAL; > + > + if (flags == RMFD_USERMNT) { > + if (mount_fd < 0) > + return -EINVAL; > + > + return restrictedmem_create_on_user_mount(mount_fd); > + } else { > + return restrictedmem_create(NULL); > + } Maybe restructure with single restrictedmem_create() call? struct vfsmount *mnt = NULL; if (flags == RMFD_USERMNT) { ... mnt = ...(); } return restrictedmem_create(mnt); > +} > + > int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end, > struct restrictedmem_notifier *notifier, bool exclusive) > { > -- > 2.40.0.348.gf938b09366-goog
On Fri, Mar 31, 2023 at 11:50:39PM +0000, Ackerley Tng wrote: > By default, the backing shmem file for a restrictedmem fd is created > on shmem's kernel space mount. > > With this patch, an optional tmpfs mount can be specified via an fd, > which will be used as the mountpoint for backing the shmem file > associated with a restrictedmem fd. > > This will help restrictedmem fds inherit the properties of the > provided tmpfs mounts, for example, hugepage allocation hints, NUMA > binding hints, etc. > > Permissions for the fd passed to memfd_restricted() is modeled after > the openat() syscall, since both of these allow creation of a file > upon a mount/directory. > > Permission to reference the mount the fd represents is checked upon fd > creation by other syscalls (e.g. fsmount(), open(), or open_tree(), > etc) and any process that can present memfd_restricted() with a valid > fd is expected to have obtained permission to use the mount > represented by the fd. This behavior is intended to parallel that of > the openat() syscall. > > memfd_restricted() will check that the tmpfs superblock is > writable, and that the mount is also writable, before attempting to > create a restrictedmem file on the mount. > > Signed-off-by: Ackerley Tng <ackerleytng@google.com> > --- > include/linux/syscalls.h | 2 +- > include/uapi/linux/restrictedmem.h | 8 ++++ > mm/restrictedmem.c | 74 +++++++++++++++++++++++++++--- > 3 files changed, 77 insertions(+), 7 deletions(-) > create mode 100644 include/uapi/linux/restrictedmem.h > > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > index f9e9e0c820c5..a23c4c385cd3 100644 > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -1056,7 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags); > asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len, > unsigned long home_node, > unsigned long flags); > -asmlinkage long sys_memfd_restricted(unsigned int flags); > +asmlinkage long sys_memfd_restricted(unsigned int flags, int mount_fd); > > /* > * Architecture-specific system calls > diff --git a/include/uapi/linux/restrictedmem.h b/include/uapi/linux/restrictedmem.h > new file mode 100644 > index 000000000000..22d6f2285f6d > --- /dev/null > +++ b/include/uapi/linux/restrictedmem.h > @@ -0,0 +1,8 @@ > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ > +#ifndef _UAPI_LINUX_RESTRICTEDMEM_H > +#define _UAPI_LINUX_RESTRICTEDMEM_H > + > +/* flags for memfd_restricted */ > +#define RMFD_USERMNT 0x0001U > + > +#endif /* _UAPI_LINUX_RESTRICTEDMEM_H */ > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c > index c5d869d8c2d8..f7b62364a31a 100644 > --- a/mm/restrictedmem.c > +++ b/mm/restrictedmem.c > @@ -1,11 +1,12 @@ > // SPDX-License-Identifier: GPL-2.0 > -#include "linux/sbitmap.h" > +#include <linux/namei.h> > #include <linux/pagemap.h> > #include <linux/pseudo_fs.h> > #include <linux/shmem_fs.h> > #include <linux/syscalls.h> > #include <uapi/linux/falloc.h> > #include <uapi/linux/magic.h> > +#include <uapi/linux/restrictedmem.h> > #include <linux/restrictedmem.h> > > struct restrictedmem { > @@ -189,19 +190,20 @@ static struct file *restrictedmem_file_create(struct file *memfd) > return file; > } > > -SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) > +static int restrictedmem_create(struct vfsmount *mount) > { > struct file *file, *restricted_file; > int fd, err; > > - if (flags) > - return -EINVAL; > - > fd = get_unused_fd_flags(0); Any reasons the file descriptors aren't O_CLOEXEC by default? I don't see any reasons why we should introduce new fdtypes that aren't O_CLOEXEC by default. The "don't mix-and-match" train has already left the station anyway as we do have seccomp noitifer fds and pidfds both of which are O_CLOEXEC by default. > if (fd < 0) > return fd; > > - file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE); > + if (mount) > + file = shmem_file_setup_with_mnt(mount, "memfd:restrictedmem", 0, VM_NORESERVE); > + else > + file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE); > + > if (IS_ERR(file)) { > err = PTR_ERR(file); > goto err_fd; > @@ -223,6 +225,66 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) > return err; > } > > +static bool is_shmem_mount(struct vfsmount *mnt) > +{ > + return mnt && mnt->mnt_sb && mnt->mnt_sb->s_magic == TMPFS_MAGIC; This can just be if (mnt->mnt_sb->s_magic == TMPFS_MAGIC). > +} > + > +static bool is_mount_root(struct file *file) > +{ > + return file->f_path.dentry == file->f_path.mnt->mnt_root; mount -t tmpfs tmpfs /mnt touch /mnt/bla touch /mnt/ble mount --bind /mnt/bla /mnt/ble fd = open("/mnt/ble") fd_restricted = memfd_restricted(fd) IOW, this doesn't restrict it to the tmpfs root. It only restricts it to paths that refer to the root of any tmpfs mount. To exclude bind-mounts that aren't bind-mounts of the whole filesystem you want: path->dentry == path->mnt->mnt_root && path->mnt->mnt_root == path->mnt->mnt_sb->s_root > +} > + > +static int restrictedmem_create_on_user_mount(int mount_fd) > +{ > + int ret; > + struct fd f; > + struct vfsmount *mnt; > + > + f = fdget_raw(mount_fd); > + if (!f.file) > + return -EBADF; > + > + ret = -EINVAL; > + if (!is_mount_root(f.file)) > + goto out; > + > + mnt = f.file->f_path.mnt; > + if (!is_shmem_mount(mnt)) > + goto out; > + > + ret = file_permission(f.file, MAY_WRITE | MAY_EXEC); With the current semantics you're asking whether you have write permissions on the /mnt/ble file in order to get answer to the question whether you're allowed to create an unlinked restricted memory file. That doesn't make much sense afaict. > + if (ret) > + goto out; > + > + ret = mnt_want_write(mnt); > + if (unlikely(ret)) > + goto out; > + > + ret = restrictedmem_create(mnt); > + > + mnt_drop_write(mnt); > +out: > + fdput(f); > + > + return ret; > +} > + > +SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd) > +{ > + if (flags & ~RMFD_USERMNT) > + return -EINVAL; > + > + if (flags == RMFD_USERMNT) { Why do you even need this flag? It seems that @mount_fd being < 0 is sufficient to indicate that a new restricted memory fd is supposed to be created in the system instance. > + if (mount_fd < 0) > + return -EINVAL; > + > + return restrictedmem_create_on_user_mount(mount_fd); > + } else { > + return restrictedmem_create(NULL); > + } > +} I have to say that I'm very confused by all of this the more I look at it. Effectively memfd restricted functions as a wrapper filesystem around the tmpfs filesystem. This is basically a weird overlay filesystem. You're allocating tmpfs files that you stash in restrictedmem files. I have to say that this seems very hacky. I didn't get this at all at first. So what does the caller get if they call statx() on a restricted memfd? Do they get the device number of the tmpfs mount and the inode numbers of the tmpfs mount? Because it looks like they would: static int restrictedmem_getattr(struct user_namespace *mnt_userns, const struct path *path, struct kstat *stat, u32 request_mask, unsigned int query_flags) { struct inode *inode = d_inode(path->dentry); struct restrictedmem *rm = inode->i_mapping->private_data; struct file *memfd = rm->memfd; return memfd->f_inode->i_op->getattr(mnt_userns, path, stat, request_mask, query_flags); That @memfd would be a struct file allocated in a tmpfs instance, no? So you'd be calling the inode operation of the tmpfs file meaning that struct kstat will be filled up with the info from the tmpfs instance. But then if I call statfs() and check the fstype I would get RESTRICTEDMEM_MAGIC, no? This is... unorthodox? I'm honestly puzzled and this sounds really strange. There must be a better way to implement all of this. Shouldn't you try and make this a part of tmpfs proper? Make a really separate filesystem and add a memfs library that both tmpfs and restrictedmemfs can use? Add a mount option to tmpfs that makes it a restricted tmpfs?
On Tue, Apr 04, 2023 at 03:53:13PM +0200, Christian Brauner wrote: > On Fri, Mar 31, 2023 at 11:50:39PM +0000, Ackerley Tng wrote: > > By default, the backing shmem file for a restrictedmem fd is created > > on shmem's kernel space mount. > > > > With this patch, an optional tmpfs mount can be specified via an fd, > > which will be used as the mountpoint for backing the shmem file > > associated with a restrictedmem fd. > > > > This will help restrictedmem fds inherit the properties of the > > provided tmpfs mounts, for example, hugepage allocation hints, NUMA > > binding hints, etc. > > > > Permissions for the fd passed to memfd_restricted() is modeled after > > the openat() syscall, since both of these allow creation of a file > > upon a mount/directory. > > > > Permission to reference the mount the fd represents is checked upon fd > > creation by other syscalls (e.g. fsmount(), open(), or open_tree(), > > etc) and any process that can present memfd_restricted() with a valid > > fd is expected to have obtained permission to use the mount > > represented by the fd. This behavior is intended to parallel that of > > the openat() syscall. > > > > memfd_restricted() will check that the tmpfs superblock is > > writable, and that the mount is also writable, before attempting to > > create a restrictedmem file on the mount. > > > > Signed-off-by: Ackerley Tng <ackerleytng@google.com> > > --- > > include/linux/syscalls.h | 2 +- > > include/uapi/linux/restrictedmem.h | 8 ++++ > > mm/restrictedmem.c | 74 +++++++++++++++++++++++++++--- > > 3 files changed, 77 insertions(+), 7 deletions(-) > > create mode 100644 include/uapi/linux/restrictedmem.h > > > > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > > index f9e9e0c820c5..a23c4c385cd3 100644 > > --- a/include/linux/syscalls.h > > +++ b/include/linux/syscalls.h > > @@ -1056,7 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags); > > asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len, > > unsigned long home_node, > > unsigned long flags); > > -asmlinkage long sys_memfd_restricted(unsigned int flags); > > +asmlinkage long sys_memfd_restricted(unsigned int flags, int mount_fd); > > > > /* > > * Architecture-specific system calls > > diff --git a/include/uapi/linux/restrictedmem.h b/include/uapi/linux/restrictedmem.h > > new file mode 100644 > > index 000000000000..22d6f2285f6d > > --- /dev/null > > +++ b/include/uapi/linux/restrictedmem.h > > @@ -0,0 +1,8 @@ > > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ > > +#ifndef _UAPI_LINUX_RESTRICTEDMEM_H > > +#define _UAPI_LINUX_RESTRICTEDMEM_H > > + > > +/* flags for memfd_restricted */ > > +#define RMFD_USERMNT 0x0001U > > + > > +#endif /* _UAPI_LINUX_RESTRICTEDMEM_H */ > > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c > > index c5d869d8c2d8..f7b62364a31a 100644 > > --- a/mm/restrictedmem.c > > +++ b/mm/restrictedmem.c > > @@ -1,11 +1,12 @@ > > // SPDX-License-Identifier: GPL-2.0 > > -#include "linux/sbitmap.h" > > +#include <linux/namei.h> > > #include <linux/pagemap.h> > > #include <linux/pseudo_fs.h> > > #include <linux/shmem_fs.h> > > #include <linux/syscalls.h> > > #include <uapi/linux/falloc.h> > > #include <uapi/linux/magic.h> > > +#include <uapi/linux/restrictedmem.h> > > #include <linux/restrictedmem.h> > > > > struct restrictedmem { > > @@ -189,19 +190,20 @@ static struct file *restrictedmem_file_create(struct file *memfd) > > return file; > > } > > > > -SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) > > +static int restrictedmem_create(struct vfsmount *mount) > > { > > struct file *file, *restricted_file; > > int fd, err; > > > > - if (flags) > > - return -EINVAL; > > - > > fd = get_unused_fd_flags(0); > > Any reasons the file descriptors aren't O_CLOEXEC by default? I don't > see any reasons why we should introduce new fdtypes that aren't > O_CLOEXEC by default. The "don't mix-and-match" train has already left > the station anyway as we do have seccomp noitifer fds and pidfds both of > which are O_CLOEXEC by default. > > > if (fd < 0) > > return fd; > > > > - file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE); > > + if (mount) > > + file = shmem_file_setup_with_mnt(mount, "memfd:restrictedmem", 0, VM_NORESERVE); > > + else > > + file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE); > > + > > if (IS_ERR(file)) { > > err = PTR_ERR(file); > > goto err_fd; > > @@ -223,6 +225,66 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) > > return err; > > } > > > > +static bool is_shmem_mount(struct vfsmount *mnt) > > +{ > > + return mnt && mnt->mnt_sb && mnt->mnt_sb->s_magic == TMPFS_MAGIC; > > This can just be if (mnt->mnt_sb->s_magic == TMPFS_MAGIC). > > > +} > > + > > +static bool is_mount_root(struct file *file) > > +{ > > + return file->f_path.dentry == file->f_path.mnt->mnt_root; > > mount -t tmpfs tmpfs /mnt > touch /mnt/bla > touch /mnt/ble > mount --bind /mnt/bla /mnt/ble > fd = open("/mnt/ble") > fd_restricted = memfd_restricted(fd) > > IOW, this doesn't restrict it to the tmpfs root. It only restricts it to > paths that refer to the root of any tmpfs mount. To exclude bind-mounts > that aren't bind-mounts of the whole filesystem you want: > > path->dentry == path->mnt->mnt_root && > path->mnt->mnt_root == path->mnt->mnt_sb->s_root > > > +} > > + > > +static int restrictedmem_create_on_user_mount(int mount_fd) > > +{ > > + int ret; > > + struct fd f; > > + struct vfsmount *mnt; > > + > > + f = fdget_raw(mount_fd); > > + if (!f.file) > > + return -EBADF; > > + > > + ret = -EINVAL; > > + if (!is_mount_root(f.file)) > > + goto out; > > + > > + mnt = f.file->f_path.mnt; > > + if (!is_shmem_mount(mnt)) > > + goto out; > > + > > + ret = file_permission(f.file, MAY_WRITE | MAY_EXEC); > > With the current semantics you're asking whether you have write > permissions on the /mnt/ble file in order to get answer to the question > whether you're allowed to create an unlinked restricted memory file. > That doesn't make much sense afaict. > > > + if (ret) > > + goto out; > > + > > + ret = mnt_want_write(mnt); > > + if (unlikely(ret)) > > + goto out; > > + > > + ret = restrictedmem_create(mnt); > > + > > + mnt_drop_write(mnt); > > +out: > > + fdput(f); > > + > > + return ret; > > +} > > + > > +SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd) > > +{ > > + if (flags & ~RMFD_USERMNT) > > + return -EINVAL; > > + > > + if (flags == RMFD_USERMNT) { > > Why do you even need this flag? It seems that @mount_fd being < 0 is > sufficient to indicate that a new restricted memory fd is supposed to be > created in the system instance. > > > + if (mount_fd < 0) > > + return -EINVAL; > > + > > + return restrictedmem_create_on_user_mount(mount_fd); > > + } else { > > + return restrictedmem_create(NULL); > > + } > > +} > > I have to say that I'm very confused by all of this the more I look at it. > > Effectively memfd restricted functions as a wrapper filesystem around > the tmpfs filesystem. This is basically a weird overlay filesystem. > You're allocating tmpfs files that you stash in restrictedmem files. > I have to say that this seems very hacky. I didn't get this at all at > first. > > So what does the caller get if they call statx() on a restricted memfd? > Do they get the device number of the tmpfs mount and the inode numbers > of the tmpfs mount? Because it looks like they would: > > static int restrictedmem_getattr(struct user_namespace *mnt_userns, > const struct path *path, struct kstat *stat, > u32 request_mask, unsigned int query_flags) > { > struct inode *inode = d_inode(path->dentry); > struct restrictedmem *rm = inode->i_mapping->private_data; > struct file *memfd = rm->memfd; > > return memfd->f_inode->i_op->getattr(mnt_userns, path, stat, This is pretty broken btw, because @path refers to a restrictedmem path which you're passing to a tmpfs iop... I see that in return memfd->f_inode->i_op->getattr(mnt_userns, &memfd->f_path, stat, request_mask, query_flags); this if fixed but still, this is... not great. > request_mask, query_flags); > > That @memfd would be a struct file allocated in a tmpfs instance, no? So > you'd be calling the inode operation of the tmpfs file meaning that > struct kstat will be filled up with the info from the tmpfs instance. > > But then if I call statfs() and check the fstype I would get > RESTRICTEDMEM_MAGIC, no? This is... unorthodox? > > I'm honestly puzzled and this sounds really strange. There must be a > better way to implement all of this. > > Shouldn't you try and make this a part of tmpfs proper? Make a really > separate filesystem and add a memfs library that both tmpfs and > restrictedmemfs can use? Add a mount option to tmpfs that makes it a > restricted tmpfs?
Thanks again for your review! Christian Brauner <brauner@kernel.org> writes: > On Tue, Apr 04, 2023 at 03:53:13PM +0200, Christian Brauner wrote: >> On Fri, Mar 31, 2023 at 11:50:39PM +0000, Ackerley Tng wrote: >> > >> > ... >> > >> > -SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) >> > +static int restrictedmem_create(struct vfsmount *mount) >> > { >> > struct file *file, *restricted_file; >> > int fd, err; >> > >> > - if (flags) >> > - return -EINVAL; >> > - >> > fd = get_unused_fd_flags(0); >> Any reasons the file descriptors aren't O_CLOEXEC by default? I don't >> see any reasons why we should introduce new fdtypes that aren't >> O_CLOEXEC by default. The "don't mix-and-match" train has already left >> the station anyway as we do have seccomp noitifer fds and pidfds both of >> which are O_CLOEXEC by default. Thanks for pointing this out. I agree with using O_CLOEXEC, but didn’t notice this before. Let us discuss this under the original series at [1]. >> > if (fd < 0) >> > return fd; >> > >> > - file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE); >> > + if (mount) >> > + file = shmem_file_setup_with_mnt(mount, "memfd:restrictedmem", 0, >> VM_NORESERVE); >> > + else >> > + file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE); >> > + >> > if (IS_ERR(file)) { >> > err = PTR_ERR(file); >> > goto err_fd; >> > @@ -223,6 +225,66 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned int, >> flags) >> > return err; >> > } >> > >> > +static bool is_shmem_mount(struct vfsmount *mnt) >> > +{ >> > + return mnt && mnt->mnt_sb && mnt->mnt_sb->s_magic == TMPFS_MAGIC; >> This can just be if (mnt->mnt_sb->s_magic == TMPFS_MAGIC). Will simplify this in the next revision. >> > +} >> > + >> > +static bool is_mount_root(struct file *file) >> > +{ >> > + return file->f_path.dentry == file->f_path.mnt->mnt_root; >> mount -t tmpfs tmpfs /mnt >> touch /mnt/bla >> touch /mnt/ble >> mount --bind /mnt/bla /mnt/ble >> fd = open("/mnt/ble") >> fd_restricted = memfd_restricted(fd) >> IOW, this doesn't restrict it to the tmpfs root. It only restricts it to >> paths that refer to the root of any tmpfs mount. To exclude bind-mounts >> that aren't bind-mounts of the whole filesystem you want: >> path->dentry == path->mnt->mnt_root && >> path->mnt->mnt_root == path->mnt->mnt_sb->s_root Will adopt this in the next revision and add a selftest to check this. Thanks for pointing this out! >> > +} >> > + >> > +static int restrictedmem_create_on_user_mount(int mount_fd) >> > +{ >> > + int ret; >> > + struct fd f; >> > + struct vfsmount *mnt; >> > + >> > + f = fdget_raw(mount_fd); >> > + if (!f.file) >> > + return -EBADF; >> > + >> > + ret = -EINVAL; >> > + if (!is_mount_root(f.file)) >> > + goto out; >> > + >> > + mnt = f.file->f_path.mnt; >> > + if (!is_shmem_mount(mnt)) >> > + goto out; >> > + >> > + ret = file_permission(f.file, MAY_WRITE | MAY_EXEC); >> With the current semantics you're asking whether you have write >> permissions on the /mnt/ble file in order to get answer to the question >> whether you're allowed to create an unlinked restricted memory file. >> That doesn't make much sense afaict. That's true. Since mnt_want_write() already checks for write permissions and this syscall creates an unlinked file on the mount, we don't have to check permissions on the file then. Will remove this in the next revision! >> > + if (ret) >> > + goto out; >> > + >> > + ret = mnt_want_write(mnt); >> > + if (unlikely(ret)) >> > + goto out; >> > + >> > + ret = restrictedmem_create(mnt); >> > + >> > + mnt_drop_write(mnt); >> > +out: >> > + fdput(f); >> > + >> > + return ret; >> > +} >> > + >> > +SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd) >> > +{ >> > + if (flags & ~RMFD_USERMNT) >> > + return -EINVAL; >> > + >> > + if (flags == RMFD_USERMNT) { >> Why do you even need this flag? It seems that @mount_fd being < 0 is >> sufficient to indicate that a new restricted memory fd is supposed to be >> created in the system instance. I'm hoping to have this patch series merged after Chao's patch series introduces the memfd_restricted() syscall [1]. This flag is necessary to indicate the validity of the second argument. With this flag, we can definitively return an error if the fd is invalid, which I think is a better experience for the userspace programmer than if we just silently default to the kernel mount when the fd provided is invalid. >> > + if (mount_fd < 0) >> > + return -EINVAL; >> > + >> > + return restrictedmem_create_on_user_mount(mount_fd); >> > + } else { >> > + return restrictedmem_create(NULL); >> > + } >> > +} >> I have to say that I'm very confused by all of this the more I look at >> it. >> Effectively memfd restricted functions as a wrapper filesystem around >> the tmpfs filesystem. This is basically a weird overlay filesystem. >> You're allocating tmpfs files that you stash in restrictedmem files. >> I have to say that this seems very hacky. I didn't get this at all at >> first. >> So what does the caller get if they call statx() on a restricted memfd? >> Do they get the device number of the tmpfs mount and the inode numbers >> of the tmpfs mount? Because it looks like they would: >> static int restrictedmem_getattr(struct user_namespace *mnt_userns, >> const struct path *path, struct kstat *stat, >> u32 request_mask, unsigned int query_flags) >> { >> struct inode *inode = d_inode(path->dentry); >> struct restrictedmem *rm = inode->i_mapping->private_data; >> struct file *memfd = rm->memfd; >> return memfd->f_inode->i_op->getattr(mnt_userns, path, stat, > This is pretty broken btw, because @path refers to a restrictedmem path > which you're passing to a tmpfs iop... > I see that in > return memfd->f_inode->i_op->getattr(mnt_userns, &memfd->f_path, stat, > request_mask, query_flags); > this if fixed but still, this is... not great. Thanks, this will be fixed in the next revision by rebasing on Chao's latest code. >> request_mask, query_flags); >> That @memfd would be a struct file allocated in a tmpfs instance, no? So >> you'd be calling the inode operation of the tmpfs file meaning that >> struct kstat will be filled up with the info from the tmpfs instance. >> But then if I call statfs() and check the fstype I would get >> RESTRICTEDMEM_MAGIC, no? This is... unorthodox? >> I'm honestly puzzled and this sounds really strange. There must be a >> better way to implement all of this. >> Shouldn't you try and make this a part of tmpfs proper? Make a really >> separate filesystem and add a memfs library that both tmpfs and >> restrictedmemfs can use? Add a mount option to tmpfs that makes it a >> restricted tmpfs? This was discussed earlier in the patch series introducing memfd_restricted and this approach was taken to better manage ownership of required functionalities between two subsystems. Please see discussion beginning [2] [1] -> https://lore.kernel.org/lkml/20221202061347.1070246-1-chao.p.peng@linux.intel.com/T/. [2] -> https://lore.kernel.org/lkml/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com/
Thanks for your review! David Hildenbrand <david@redhat.com> writes: > On 01.04.23 01:50, Ackerley Tng wrote: >> ... >> diff --git a/include/uapi/linux/restrictedmem.h >> b/include/uapi/linux/restrictedmem.h >> new file mode 100644 >> index 000000000000..22d6f2285f6d >> --- /dev/null >> +++ b/include/uapi/linux/restrictedmem.h >> @@ -0,0 +1,8 @@ >> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ >> +#ifndef _UAPI_LINUX_RESTRICTEDMEM_H >> +#define _UAPI_LINUX_RESTRICTEDMEM_H >> + >> +/* flags for memfd_restricted */ >> +#define RMFD_USERMNT 0x0001U > I wonder if we can come up with a more expressive prefix than RMFD. > Sounds more like "rm fd" ;) Maybe it should better match the > "memfd_restricted" syscall name, like "MEMFD_RSTD_USERMNT". RMFD did actually sound vulgar, I'm good with MEMFD_RSTD_USERMNT! >> + >> +#endif /* _UAPI_LINUX_RESTRICTEDMEM_H */ >> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c >> index c5d869d8c2d8..f7b62364a31a 100644 >> --- a/mm/restrictedmem.c >> +++ b/mm/restrictedmem.c >> @@ -1,11 +1,12 @@ >> // SPDX-License-Identifier: GPL-2.0 >> -#include "linux/sbitmap.h" > Looks like an unrelated change? Will remove this in the next revision. >> +#include <linux/namei.h> >> #include <linux/pagemap.h> >> #include <linux/pseudo_fs.h> >> #include <linux/shmem_fs.h> >> #include <linux/syscalls.h> >> #include <uapi/linux/falloc.h> >> #include <uapi/linux/magic.h> >> +#include <uapi/linux/restrictedmem.h> >> #include <linux/restrictedmem.h> >> struct restrictedmem { >> @@ -189,19 +190,20 @@ static struct file >> *restrictedmem_file_create(struct file *memfd) >> return file; >> } >> -SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) >> +static int restrictedmem_create(struct vfsmount *mount) >> { >> struct file *file, *restricted_file; >> int fd, err; >> - if (flags) >> - return -EINVAL; >> - >> fd = get_unused_fd_flags(0); >> if (fd < 0) >> return fd; >> - file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE); >> + if (mount) >> + file = shmem_file_setup_with_mnt(mount, "memfd:restrictedmem", 0, >> VM_NORESERVE); >> + else >> + file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE); >> + >> if (IS_ERR(file)) { >> err = PTR_ERR(file); >> goto err_fd; >> @@ -223,6 +225,66 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned int, >> flags) >> return err; >> } >> +static bool is_shmem_mount(struct vfsmount *mnt) >> +{ >> + return mnt && mnt->mnt_sb && mnt->mnt_sb->s_magic == TMPFS_MAGIC; >> +} >> + >> +static bool is_mount_root(struct file *file) >> +{ >> + return file->f_path.dentry == file->f_path.mnt->mnt_root; >> +} > I'd inline at least that function, pretty self-explaining. Will inline this in the next revision. >> + >> +static int restrictedmem_create_on_user_mount(int mount_fd) >> +{ >> + int ret; >> + struct fd f; >> + struct vfsmount *mnt; >> + >> + f = fdget_raw(mount_fd); >> + if (!f.file) >> + return -EBADF; >> + >> + ret = -EINVAL; >> + if (!is_mount_root(f.file)) >> + goto out; >> + >> + mnt = f.file->f_path.mnt; >> + if (!is_shmem_mount(mnt)) >> + goto out; >> + >> + ret = file_permission(f.file, MAY_WRITE | MAY_EXEC); >> + if (ret) >> + goto out; >> + >> + ret = mnt_want_write(mnt); >> + if (unlikely(ret)) >> + goto out; >> + >> + ret = restrictedmem_create(mnt); >> + >> + mnt_drop_write(mnt); >> +out: >> + fdput(f); >> + >> + return ret; >> +} >> + >> +SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd) >> +{ >> + if (flags & ~RMFD_USERMNT) >> + return -EINVAL; >> + >> + if (flags == RMFD_USERMNT) { >> + if (mount_fd < 0) >> + return -EINVAL; >> + >> + return restrictedmem_create_on_user_mount(mount_fd); >> + } else { >> + return restrictedmem_create(NULL); >> + } > You can drop the else case: > if (flags == RMFD_USERMNT) { > ... > return restrictedmem_create_on_user_mount(mount_fd); > } > return restrictedmem_create(NULL); I'll be refactoring this to adopt Kirill's suggestion of using a single restrictedmem_create(mnt) call. > I do wonder if you want to properly check for a flag instead of > comparing values. Results in a more natural way to deal with flags: > if (flags & RMFD_USERMNT) { > } Will use this in the next revision. >> +} >> + >> int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end, >> struct restrictedmem_notifier *notifier, bool exclusive) >> { > The "memfd_restricted" vs. "restrictedmem" terminology is a bit > unfortunate, but not your fault here. > I'm not a FS person, but it does look good to me.
Thanks for reviewing these patches! "Kirill A. Shutemov" <kirill@shutemov.name> writes: > On Fri, Mar 31, 2023 at 11:50:39PM +0000, Ackerley Tng wrote: >> ... >> +static int restrictedmem_create_on_user_mount(int mount_fd) >> +{ >> + int ret; >> + struct fd f; >> + struct vfsmount *mnt; >> + >> + f = fdget_raw(mount_fd); >> + if (!f.file) >> + return -EBADF; >> + >> + ret = -EINVAL; >> + if (!is_mount_root(f.file)) >> + goto out; >> + >> + mnt = f.file->f_path.mnt; >> + if (!is_shmem_mount(mnt)) >> + goto out; >> + >> + ret = file_permission(f.file, MAY_WRITE | MAY_EXEC); > Why MAY_EXEC? Christian pointed out that this check does not make sense, I'll be removing the entire check in the next revision. >> + if (ret) >> + goto out; >> + >> + ret = mnt_want_write(mnt); >> + if (unlikely(ret)) >> + goto out; >> + >> + ret = restrictedmem_create(mnt); >> + >> + mnt_drop_write(mnt); >> +out: >> + fdput(f); >> + >> + return ret; >> +} > We need review from fs folks. Look mostly sensible, but I have no > experience in fs. >> + >> +SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd) >> +{ >> + if (flags & ~RMFD_USERMNT) >> + return -EINVAL; >> + >> + if (flags == RMFD_USERMNT) { >> + if (mount_fd < 0) >> + return -EINVAL; >> + >> + return restrictedmem_create_on_user_mount(mount_fd); >> + } else { >> + return restrictedmem_create(NULL); >> + } > Maybe restructure with single restrictedmem_create() call? > struct vfsmount *mnt = NULL; > if (flags == RMFD_USERMNT) { > ... > mnt = ...(); > } > return restrictedmem_create(mnt); Will do so in the next revision. >> +} >> + >> int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end, >> struct restrictedmem_notifier *notifier, bool exclusive) >> { >> -- >> 2.40.0.348.gf938b09366-goog
On Wed, Apr 05, 2023 at 09:58:44PM +0000, Ackerley Tng wrote: > > Thanks again for your review! > > Christian Brauner <brauner@kernel.org> writes: > > On Tue, Apr 04, 2023 at 03:53:13PM +0200, Christian Brauner wrote: > > > On Fri, Mar 31, 2023 at 11:50:39PM +0000, Ackerley Tng wrote: > > > > > > > > ... > > > > > > > > -SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) > > > > +static int restrictedmem_create(struct vfsmount *mount) > > > > { > > > > struct file *file, *restricted_file; > > > > int fd, err; > > > > > > > > - if (flags) > > > > - return -EINVAL; > > > > - > > > > fd = get_unused_fd_flags(0); > > > > Any reasons the file descriptors aren't O_CLOEXEC by default? I don't > > > see any reasons why we should introduce new fdtypes that aren't > > > O_CLOEXEC by default. The "don't mix-and-match" train has already left > > > the station anyway as we do have seccomp noitifer fds and pidfds both of > > > which are O_CLOEXEC by default. > > > Thanks for pointing this out. I agree with using O_CLOEXEC, but didn’t > notice this before. Let us discuss this under the original series at > [1]. > > > > > if (fd < 0) > > > > return fd; > > > > > > > > - file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE); > > > > + if (mount) > > > > + file = shmem_file_setup_with_mnt(mount, "memfd:restrictedmem", > > > 0, VM_NORESERVE); > > > > + else > > > > + file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE); > > > > + > > > > if (IS_ERR(file)) { > > > > err = PTR_ERR(file); > > > > goto err_fd; > > > > @@ -223,6 +225,66 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned > > > int, flags) > > > > return err; > > > > } > > > > > > > > +static bool is_shmem_mount(struct vfsmount *mnt) > > > > +{ > > > > + return mnt && mnt->mnt_sb && mnt->mnt_sb->s_magic == TMPFS_MAGIC; > > > > This can just be if (mnt->mnt_sb->s_magic == TMPFS_MAGIC). > > > Will simplify this in the next revision. > > > > > +} > > > > + > > > > +static bool is_mount_root(struct file *file) > > > > +{ > > > > + return file->f_path.dentry == file->f_path.mnt->mnt_root; > > > > mount -t tmpfs tmpfs /mnt > > > touch /mnt/bla > > > touch /mnt/ble > > > mount --bind /mnt/bla /mnt/ble > > > fd = open("/mnt/ble") > > > fd_restricted = memfd_restricted(fd) > > > > IOW, this doesn't restrict it to the tmpfs root. It only restricts it to > > > paths that refer to the root of any tmpfs mount. To exclude bind-mounts > > > that aren't bind-mounts of the whole filesystem you want: > > > > path->dentry == path->mnt->mnt_root && > > > path->mnt->mnt_root == path->mnt->mnt_sb->s_root > > > Will adopt this in the next revision and add a selftest to check > this. Thanks for pointing this out! > > > > > +} > > > > + > > > > +static int restrictedmem_create_on_user_mount(int mount_fd) > > > > +{ > > > > + int ret; > > > > + struct fd f; > > > > + struct vfsmount *mnt; > > > > + > > > > + f = fdget_raw(mount_fd); > > > > + if (!f.file) > > > > + return -EBADF; > > > > + > > > > + ret = -EINVAL; > > > > + if (!is_mount_root(f.file)) > > > > + goto out; > > > > + > > > > + mnt = f.file->f_path.mnt; > > > > + if (!is_shmem_mount(mnt)) > > > > + goto out; > > > > + > > > > + ret = file_permission(f.file, MAY_WRITE | MAY_EXEC); > > > > With the current semantics you're asking whether you have write > > > permissions on the /mnt/ble file in order to get answer to the question > > > whether you're allowed to create an unlinked restricted memory file. > > > That doesn't make much sense afaict. > > > That's true. Since mnt_want_write() already checks for write permissions > and this syscall creates an unlinked file on the mount, we don't have to > check permissions on the file then. Will remove this in the next > revision! > > > > > + if (ret) > > > > + goto out; > > > > + > > > > + ret = mnt_want_write(mnt); > > > > + if (unlikely(ret)) > > > > + goto out; > > > > + > > > > + ret = restrictedmem_create(mnt); > > > > + > > > > + mnt_drop_write(mnt); > > > > +out: > > > > + fdput(f); > > > > + > > > > + return ret; > > > > +} > > > > + > > > > +SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd) > > > > +{ > > > > + if (flags & ~RMFD_USERMNT) > > > > + return -EINVAL; > > > > + > > > > + if (flags == RMFD_USERMNT) { > > > > Why do you even need this flag? It seems that @mount_fd being < 0 is > > > sufficient to indicate that a new restricted memory fd is supposed to be > > > created in the system instance. > > > I'm hoping to have this patch series merged after Chao's patch series > introduces the memfd_restricted() syscall [1]. I'm curious, is there an LSFMM session for this?
Christian Brauner <brauner@kernel.org> writes: > On Wed, Apr 05, 2023 at 09:58:44PM +0000, Ackerley Tng wrote: >> ... >> > > Why do you even need this flag? It seems that @mount_fd being < 0 is >> > > sufficient to indicate that a new restricted memory fd is supposed >> to be >> > > created in the system instance. >> I'm hoping to have this patch series merged after Chao's patch series >> introduces the memfd_restricted() syscall [1]. > I'm curious, is there an LSFMM session for this? As far as I know, there is no LSFMM session for this.
On Thu, Apr 13, 2023, Ackerley Tng wrote: > Christian Brauner <brauner@kernel.org> writes: > > I'm curious, is there an LSFMM session for this? > > As far as I know, there is no LSFMM session for this. Correct, no LSFMM session. In hindsight, that's obviously something we should have pursued :-(
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index f9e9e0c820c5..a23c4c385cd3 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -1056,7 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags); asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len, unsigned long home_node, unsigned long flags); -asmlinkage long sys_memfd_restricted(unsigned int flags); +asmlinkage long sys_memfd_restricted(unsigned int flags, int mount_fd); /* * Architecture-specific system calls diff --git a/include/uapi/linux/restrictedmem.h b/include/uapi/linux/restrictedmem.h new file mode 100644 index 000000000000..22d6f2285f6d --- /dev/null +++ b/include/uapi/linux/restrictedmem.h @@ -0,0 +1,8 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _UAPI_LINUX_RESTRICTEDMEM_H +#define _UAPI_LINUX_RESTRICTEDMEM_H + +/* flags for memfd_restricted */ +#define RMFD_USERMNT 0x0001U + +#endif /* _UAPI_LINUX_RESTRICTEDMEM_H */ diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c index c5d869d8c2d8..f7b62364a31a 100644 --- a/mm/restrictedmem.c +++ b/mm/restrictedmem.c @@ -1,11 +1,12 @@ // SPDX-License-Identifier: GPL-2.0 -#include "linux/sbitmap.h" +#include <linux/namei.h> #include <linux/pagemap.h> #include <linux/pseudo_fs.h> #include <linux/shmem_fs.h> #include <linux/syscalls.h> #include <uapi/linux/falloc.h> #include <uapi/linux/magic.h> +#include <uapi/linux/restrictedmem.h> #include <linux/restrictedmem.h> struct restrictedmem { @@ -189,19 +190,20 @@ static struct file *restrictedmem_file_create(struct file *memfd) return file; } -SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) +static int restrictedmem_create(struct vfsmount *mount) { struct file *file, *restricted_file; int fd, err; - if (flags) - return -EINVAL; - fd = get_unused_fd_flags(0); if (fd < 0) return fd; - file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE); + if (mount) + file = shmem_file_setup_with_mnt(mount, "memfd:restrictedmem", 0, VM_NORESERVE); + else + file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE); + if (IS_ERR(file)) { err = PTR_ERR(file); goto err_fd; @@ -223,6 +225,66 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) return err; } +static bool is_shmem_mount(struct vfsmount *mnt) +{ + return mnt && mnt->mnt_sb && mnt->mnt_sb->s_magic == TMPFS_MAGIC; +} + +static bool is_mount_root(struct file *file) +{ + return file->f_path.dentry == file->f_path.mnt->mnt_root; +} + +static int restrictedmem_create_on_user_mount(int mount_fd) +{ + int ret; + struct fd f; + struct vfsmount *mnt; + + f = fdget_raw(mount_fd); + if (!f.file) + return -EBADF; + + ret = -EINVAL; + if (!is_mount_root(f.file)) + goto out; + + mnt = f.file->f_path.mnt; + if (!is_shmem_mount(mnt)) + goto out; + + ret = file_permission(f.file, MAY_WRITE | MAY_EXEC); + if (ret) + goto out; + + ret = mnt_want_write(mnt); + if (unlikely(ret)) + goto out; + + ret = restrictedmem_create(mnt); + + mnt_drop_write(mnt); +out: + fdput(f); + + return ret; +} + +SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd) +{ + if (flags & ~RMFD_USERMNT) + return -EINVAL; + + if (flags == RMFD_USERMNT) { + if (mount_fd < 0) + return -EINVAL; + + return restrictedmem_create_on_user_mount(mount_fd); + } else { + return restrictedmem_create(NULL); + } +} + int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end, struct restrictedmem_notifier *notifier, bool exclusive) {