[v9,1/8] mm: Introduce memfd_restricted system call to create restricted user memory

Message ID 20221025151344.3784230-2-chao.p.peng@linux.intel.com
State New
Headers
Series KVM: mm: fd-based approach for supporting KVM |

Commit Message

Chao Peng Oct. 25, 2022, 3:13 p.m. UTC
  From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Introduce 'memfd_restricted' system call with the ability to create
memory areas that are restricted from userspace access through ordinary
MMU operations (e.g. read/write/mmap). The memory content is expected to
be used through a new in-kernel interface by a third kernel module.

memfd_restricted() is useful for scenarios where a file descriptor(fd)
can be used as an interface into mm but want to restrict userspace's
ability on the fd. Initially it is designed to provide protections for
KVM encrypted guest memory.

Normally KVM uses memfd memory via mmapping the memfd into KVM userspace
(e.g. QEMU) and then using the mmaped virtual address to setup the
mapping in the KVM secondary page table (e.g. EPT). With confidential
computing technologies like Intel TDX, the memfd memory may be encrypted
with special key for special software domain (e.g. KVM guest) and is not
expected to be directly accessed by userspace. Precisely, userspace
access to such encrypted memory may lead to host crash so should be
prevented.

memfd_restricted() provides semantics required for KVM guest encrypted
memory support that a fd created with memfd_restricted() is going to be
used as the source of guest memory in confidential computing environment
and KVM can directly interact with core-mm without the need to expose
the memoy content into KVM userspace.

KVM userspace is still in charge of the lifecycle of the fd. It should
pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to
obtain the physical memory page and then uses it to populate the KVM
secondary page table entries.

The userspace restricted memfd can be fallocate-ed or hole-punched
from userspace. When these operations happen, KVM can get notified
through restrictedmem_notifier, it then gets chance to remove any
mapped entries of the range in the secondary page tables.

memfd_restricted() itself is implemented as a shim layer on top of real
memory file systems (currently tmpfs). Pages in restrictedmem are marked
as unmovable and unevictable, this is required for current confidential
usage. But in future this might be changed.

By default memfd_restricted() prevents userspace read, write and mmap.
By defining new bit in the 'flags', it can be extended to support other
restricted semantics in the future.

The system call is currently wired up for x86 arch.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 include/linux/restrictedmem.h          |  62 ++++++
 include/linux/syscalls.h               |   1 +
 include/uapi/asm-generic/unistd.h      |   5 +-
 include/uapi/linux/magic.h             |   1 +
 kernel/sys_ni.c                        |   3 +
 mm/Kconfig                             |   4 +
 mm/Makefile                            |   1 +
 mm/restrictedmem.c                     | 250 +++++++++++++++++++++++++
 10 files changed, 328 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/restrictedmem.h
 create mode 100644 mm/restrictedmem.c
  

Comments

Isaku Yamahata Oct. 26, 2022, 5:31 p.m. UTC | #1
On Tue, Oct 25, 2022 at 11:13:37PM +0800,
Chao Peng <chao.p.peng@linux.intel.com> wrote:

> +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +			   struct page **pagep, int *order)
> +{
> +	struct restrictedmem_data *data = file->f_mapping->private_data;
> +	struct file *memfd = data->memfd;
> +	struct page *page;
> +	int ret;
> +
> +	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);

shmem_getpage() was removed.
https://lkml.kernel.org/r/20220902194653.1739778-34-willy@infradead.org

I needed the following fix to compile.

thanks,

diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
index e5bf8907e0f8..4694dd5609d6 100644
--- a/mm/restrictedmem.c
+++ b/mm/restrictedmem.c
@@ -231,13 +231,15 @@ int restrictedmem_get_page(struct file *file, pgoff_t offset,
 {
        struct restrictedmem_data *data = file->f_mapping->private_data;
        struct file *memfd = data->memfd;
+       struct folio *folio = NULL;
        struct page *page;
        int ret;
 
-       ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
+       ret = shmem_get_folio(file_inode(memfd), offset, &folio, SGP_WRITE);
        if (ret)
                return ret;
 
+       page = folio_file_page(folio, offset);
        *pagep = page;
        if (order)
                *order = thp_order(compound_head(page));
  
Fuad Tabba Oct. 27, 2022, 10:20 a.m. UTC | #2
Hi,


On Tue, Oct 25, 2022 at 4:18 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Introduce 'memfd_restricted' system call with the ability to create
> memory areas that are restricted from userspace access through ordinary
> MMU operations (e.g. read/write/mmap). The memory content is expected to
> be used through a new in-kernel interface by a third kernel module.
>
> memfd_restricted() is useful for scenarios where a file descriptor(fd)
> can be used as an interface into mm but want to restrict userspace's
> ability on the fd. Initially it is designed to provide protections for
> KVM encrypted guest memory.
>
> Normally KVM uses memfd memory via mmapping the memfd into KVM userspace
> (e.g. QEMU) and then using the mmaped virtual address to setup the
> mapping in the KVM secondary page table (e.g. EPT). With confidential
> computing technologies like Intel TDX, the memfd memory may be encrypted
> with special key for special software domain (e.g. KVM guest) and is not
> expected to be directly accessed by userspace. Precisely, userspace
> access to such encrypted memory may lead to host crash so should be
> prevented.
>
> memfd_restricted() provides semantics required for KVM guest encrypted
> memory support that a fd created with memfd_restricted() is going to be
> used as the source of guest memory in confidential computing environment
> and KVM can directly interact with core-mm without the need to expose
> the memoy content into KVM userspace.
>
> KVM userspace is still in charge of the lifecycle of the fd. It should
> pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to
> obtain the physical memory page and then uses it to populate the KVM
> secondary page table entries.
>
> The userspace restricted memfd can be fallocate-ed or hole-punched
> from userspace. When these operations happen, KVM can get notified
> through restrictedmem_notifier, it then gets chance to remove any
> mapped entries of the range in the secondary page tables.
>
> memfd_restricted() itself is implemented as a shim layer on top of real
> memory file systems (currently tmpfs). Pages in restrictedmem are marked
> as unmovable and unevictable, this is required for current confidential
> usage. But in future this might be changed.
>
> By default memfd_restricted() prevents userspace read, write and mmap.
> By defining new bit in the 'flags', it can be extended to support other
> restricted semantics in the future.
>
> The system call is currently wired up for x86 arch.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---

Reviewed-by: Fuad Tabba <tabba@google.com>

And I'm working on porting to arm64 and testing V9.

Cheers,
/fuad


>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  include/linux/restrictedmem.h          |  62 ++++++
>  include/linux/syscalls.h               |   1 +
>  include/uapi/asm-generic/unistd.h      |   5 +-
>  include/uapi/linux/magic.h             |   1 +
>  kernel/sys_ni.c                        |   3 +
>  mm/Kconfig                             |   4 +
>  mm/Makefile                            |   1 +
>  mm/restrictedmem.c                     | 250 +++++++++++++++++++++++++
>  10 files changed, 328 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/restrictedmem.h
>  create mode 100644 mm/restrictedmem.c
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 320480a8db4f..dc70ba90247e 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -455,3 +455,4 @@
>  448    i386    process_mrelease        sys_process_mrelease
>  449    i386    futex_waitv             sys_futex_waitv
>  450    i386    set_mempolicy_home_node         sys_set_mempolicy_home_node
> +451    i386    memfd_restricted        sys_memfd_restricted
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index c84d12608cd2..06516abc8318 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -372,6 +372,7 @@
>  448    common  process_mrelease        sys_process_mrelease
>  449    common  futex_waitv             sys_futex_waitv
>  450    common  set_mempolicy_home_node sys_set_mempolicy_home_node
> +451    common  memfd_restricted        sys_memfd_restricted
>
>  #
>  # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
> new file mode 100644
> index 000000000000..9c37c3ea3180
> --- /dev/null
> +++ b/include/linux/restrictedmem.h
> @@ -0,0 +1,62 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _LINUX_RESTRICTEDMEM_H
> +
> +#include <linux/file.h>
> +#include <linux/magic.h>
> +#include <linux/pfn_t.h>
> +
> +struct restrictedmem_notifier;
> +
> +struct restrictedmem_notifier_ops {
> +       void (*invalidate_start)(struct restrictedmem_notifier *notifier,
> +                                pgoff_t start, pgoff_t end);
> +       void (*invalidate_end)(struct restrictedmem_notifier *notifier,
> +                              pgoff_t start, pgoff_t end);
> +};
> +
> +struct restrictedmem_notifier {
> +       struct list_head list;
> +       const struct restrictedmem_notifier_ops *ops;
> +};
> +
> +#ifdef CONFIG_RESTRICTEDMEM
> +
> +void restrictedmem_register_notifier(struct file *file,
> +                                    struct restrictedmem_notifier *notifier);
> +void restrictedmem_unregister_notifier(struct file *file,
> +                                      struct restrictedmem_notifier *notifier);
> +
> +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +                          struct page **pagep, int *order);
> +
> +static inline bool file_is_restrictedmem(struct file *file)
> +{
> +       return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC;
> +}
> +
> +#else
> +
> +static inline void restrictedmem_register_notifier(struct file *file,
> +                                    struct restrictedmem_notifier *notifier)
> +{
> +}
> +
> +static inline void restrictedmem_unregister_notifier(struct file *file,
> +                                      struct restrictedmem_notifier *notifier)
> +{
> +}
> +
> +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +                                        struct page **pagep, int *order)
> +{
> +       return -1;
> +}
> +
> +static inline bool file_is_restrictedmem(struct file *file)
> +{
> +       return false;
> +}
> +
> +#endif /* CONFIG_RESTRICTEDMEM */
> +
> +#endif /* _LINUX_RESTRICTEDMEM_H */
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index a34b0f9a9972..f9e9e0c820c5 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -1056,6 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
>  asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
>                                             unsigned long home_node,
>                                             unsigned long flags);
> +asmlinkage long sys_memfd_restricted(unsigned int flags);
>
>  /*
>   * Architecture-specific system calls
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 45fa180cc56a..e93cd35e46d0 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
>  #define __NR_set_mempolicy_home_node 450
>  __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
>
> +#define __NR_memfd_restricted 451
> +__SYSCALL(__NR_memfd_restricted, sys_memfd_restricted)
> +
>  #undef __NR_syscalls
> -#define __NR_syscalls 451
> +#define __NR_syscalls 452
>
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index 6325d1d0e90f..8aa38324b90a 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -101,5 +101,6 @@
>  #define DMA_BUF_MAGIC          0x444d4142      /* "DMAB" */
>  #define DEVMEM_MAGIC           0x454d444d      /* "DMEM" */
>  #define SECRETMEM_MAGIC                0x5345434d      /* "SECM" */
> +#define RESTRICTEDMEM_MAGIC    0x5245534d      /* "RESM" */
>
>  #endif /* __LINUX_MAGIC_H__ */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 860b2dcf3ac4..7c4a32cbd2e7 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -360,6 +360,9 @@ COND_SYSCALL(pkey_free);
>  /* memfd_secret */
>  COND_SYSCALL(memfd_secret);
>
> +/* memfd_restricted */
> +COND_SYSCALL(memfd_restricted);
> +
>  /*
>   * Architecture specific weak syscall entries.
>   */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 0331f1461f81..0177d53676c7 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1076,6 +1076,10 @@ config IO_MAPPING
>  config SECRETMEM
>         def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
>
> +config RESTRICTEDMEM
> +       bool
> +       depends on TMPFS
> +
>  config ANON_VMA_NAME
>         bool "Anonymous VMA name support"
>         depends on PROC_FS && ADVISE_SYSCALLS && MMU
> diff --git a/mm/Makefile b/mm/Makefile
> index 9a564f836403..6cb6403ffd40 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -117,6 +117,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
>  obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o
>  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
>  obj-$(CONFIG_SECRETMEM) += secretmem.o
> +obj-$(CONFIG_RESTRICTEDMEM) += restrictedmem.o
>  obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o
>  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
>  obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> new file mode 100644
> index 000000000000..e5bf8907e0f8
> --- /dev/null
> +++ b/mm/restrictedmem.c
> @@ -0,0 +1,250 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "linux/sbitmap.h"
> +#include <linux/pagemap.h>
> +#include <linux/pseudo_fs.h>
> +#include <linux/shmem_fs.h>
> +#include <linux/syscalls.h>
> +#include <uapi/linux/falloc.h>
> +#include <uapi/linux/magic.h>
> +#include <linux/restrictedmem.h>
> +
> +struct restrictedmem_data {
> +       struct mutex lock;
> +       struct file *memfd;
> +       struct list_head notifiers;
> +};
> +
> +static void restrictedmem_notifier_invalidate(struct restrictedmem_data *data,
> +                                pgoff_t start, pgoff_t end, bool notify_start)
> +{
> +       struct restrictedmem_notifier *notifier;
> +
> +       mutex_lock(&data->lock);
> +       list_for_each_entry(notifier, &data->notifiers, list) {
> +               if (notify_start)
> +                       notifier->ops->invalidate_start(notifier, start, end);
> +               else
> +                       notifier->ops->invalidate_end(notifier, start, end);
> +       }
> +       mutex_unlock(&data->lock);
> +}
> +
> +static int restrictedmem_release(struct inode *inode, struct file *file)
> +{
> +       struct restrictedmem_data *data = inode->i_mapping->private_data;
> +
> +       fput(data->memfd);
> +       kfree(data);
> +       return 0;
> +}
> +
> +static long restrictedmem_fallocate(struct file *file, int mode,
> +                                   loff_t offset, loff_t len)
> +{
> +       struct restrictedmem_data *data = file->f_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +       int ret;
> +
> +       if (mode & FALLOC_FL_PUNCH_HOLE) {
> +               if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> +                       return -EINVAL;
> +       }
> +
> +       restrictedmem_notifier_invalidate(data, offset, offset + len, true);
> +       ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> +       restrictedmem_notifier_invalidate(data, offset, offset + len, false);
> +       return ret;
> +}
> +
> +static const struct file_operations restrictedmem_fops = {
> +       .release = restrictedmem_release,
> +       .fallocate = restrictedmem_fallocate,
> +};
> +
> +static int restrictedmem_getattr(struct user_namespace *mnt_userns,
> +                                const struct path *path, struct kstat *stat,
> +                                u32 request_mask, unsigned int query_flags)
> +{
> +       struct inode *inode = d_inode(path->dentry);
> +       struct restrictedmem_data *data = inode->i_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +
> +       return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
> +                                            request_mask, query_flags);
> +}
> +
> +static int restrictedmem_setattr(struct user_namespace *mnt_userns,
> +                                struct dentry *dentry, struct iattr *attr)
> +{
> +       struct inode *inode = d_inode(dentry);
> +       struct restrictedmem_data *data = inode->i_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +       int ret;
> +
> +       if (attr->ia_valid & ATTR_SIZE) {
> +               if (memfd->f_inode->i_size)
> +                       return -EPERM;
> +
> +               if (!PAGE_ALIGNED(attr->ia_size))
> +                       return -EINVAL;
> +       }
> +
> +       ret = memfd->f_inode->i_op->setattr(mnt_userns,
> +                                           file_dentry(memfd), attr);
> +       return ret;
> +}
> +
> +static const struct inode_operations restrictedmem_iops = {
> +       .getattr = restrictedmem_getattr,
> +       .setattr = restrictedmem_setattr,
> +};
> +
> +static int restrictedmem_init_fs_context(struct fs_context *fc)
> +{
> +       if (!init_pseudo(fc, RESTRICTEDMEM_MAGIC))
> +               return -ENOMEM;
> +
> +       fc->s_iflags |= SB_I_NOEXEC;
> +       return 0;
> +}
> +
> +static struct file_system_type restrictedmem_fs = {
> +       .owner          = THIS_MODULE,
> +       .name           = "memfd:restrictedmem",
> +       .init_fs_context = restrictedmem_init_fs_context,
> +       .kill_sb        = kill_anon_super,
> +};
> +
> +static struct vfsmount *restrictedmem_mnt;
> +
> +static __init int restrictedmem_init(void)
> +{
> +       restrictedmem_mnt = kern_mount(&restrictedmem_fs);
> +       if (IS_ERR(restrictedmem_mnt))
> +               return PTR_ERR(restrictedmem_mnt);
> +       return 0;
> +}
> +fs_initcall(restrictedmem_init);
> +
> +static struct file *restrictedmem_file_create(struct file *memfd)
> +{
> +       struct restrictedmem_data *data;
> +       struct address_space *mapping;
> +       struct inode *inode;
> +       struct file *file;
> +
> +       data = kzalloc(sizeof(*data), GFP_KERNEL);
> +       if (!data)
> +               return ERR_PTR(-ENOMEM);
> +
> +       data->memfd = memfd;
> +       mutex_init(&data->lock);
> +       INIT_LIST_HEAD(&data->notifiers);
> +
> +       inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
> +       if (IS_ERR(inode)) {
> +               kfree(data);
> +               return ERR_CAST(inode);
> +       }
> +
> +       inode->i_mode |= S_IFREG;
> +       inode->i_op = &restrictedmem_iops;
> +       inode->i_mapping->private_data = data;
> +
> +       file = alloc_file_pseudo(inode, restrictedmem_mnt,
> +                                "restrictedmem", O_RDWR,
> +                                &restrictedmem_fops);
> +       if (IS_ERR(file)) {
> +               iput(inode);
> +               kfree(data);
> +               return ERR_CAST(file);
> +       }
> +
> +       file->f_flags |= O_LARGEFILE;
> +
> +       mapping = memfd->f_mapping;
> +       mapping_set_unevictable(mapping);
> +       mapping_set_gfp_mask(mapping,
> +                            mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> +
> +       return file;
> +}
> +
> +SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
> +{
> +       struct file *file, *restricted_file;
> +       int fd, err;
> +
> +       if (flags)
> +               return -EINVAL;
> +
> +       fd = get_unused_fd_flags(0);
> +       if (fd < 0)
> +               return fd;
> +
> +       file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
> +       if (IS_ERR(file)) {
> +               err = PTR_ERR(file);
> +               goto err_fd;
> +       }
> +       file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
> +       file->f_flags |= O_LARGEFILE;
> +
> +       restricted_file = restrictedmem_file_create(file);
> +       if (IS_ERR(restricted_file)) {
> +               err = PTR_ERR(restricted_file);
> +               fput(file);
> +               goto err_fd;
> +       }
> +
> +       fd_install(fd, restricted_file);
> +       return fd;
> +err_fd:
> +       put_unused_fd(fd);
> +       return err;
> +}
> +
> +void restrictedmem_register_notifier(struct file *file,
> +                                    struct restrictedmem_notifier *notifier)
> +{
> +       struct restrictedmem_data *data = file->f_mapping->private_data;
> +
> +       mutex_lock(&data->lock);
> +       list_add(&notifier->list, &data->notifiers);
> +       mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(restrictedmem_register_notifier);
> +
> +void restrictedmem_unregister_notifier(struct file *file,
> +                                      struct restrictedmem_notifier *notifier)
> +{
> +       struct restrictedmem_data *data = file->f_mapping->private_data;
> +
> +       mutex_lock(&data->lock);
> +       list_del(&notifier->list);
> +       mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(restrictedmem_unregister_notifier);
> +
> +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +                          struct page **pagep, int *order)
> +{
> +       struct restrictedmem_data *data = file->f_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +       struct page *page;
> +       int ret;
> +
> +       ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> +       if (ret)
> +               return ret;
> +
> +       *pagep = page;
> +       if (order)
> +               *order = thp_order(compound_head(page));
> +
> +       SetPageUptodate(page);
> +       unlock_page(page);
> +
> +       return 0;
> +}
> +EXPORT_SYMBOL_GPL(restrictedmem_get_page);
> --
> 2.25.1
>
  
Chao Peng Oct. 28, 2022, 6:12 a.m. UTC | #3
On Wed, Oct 26, 2022 at 10:31:45AM -0700, Isaku Yamahata wrote:
> On Tue, Oct 25, 2022 at 11:13:37PM +0800,
> Chao Peng <chao.p.peng@linux.intel.com> wrote:
> 
> > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > +			   struct page **pagep, int *order)
> > +{
> > +	struct restrictedmem_data *data = file->f_mapping->private_data;
> > +	struct file *memfd = data->memfd;
> > +	struct page *page;
> > +	int ret;
> > +
> > +	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> 
> shmem_getpage() was removed.
> https://lkml.kernel.org/r/20220902194653.1739778-34-willy@infradead.org

Thanks for pointing out. My current base(kvm/queue) has not included
this change yet so still use shmem_getpage().

Chao
> 
> I needed the following fix to compile.
> 
> thanks,
> 
> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> index e5bf8907e0f8..4694dd5609d6 100644
> --- a/mm/restrictedmem.c
> +++ b/mm/restrictedmem.c
> @@ -231,13 +231,15 @@ int restrictedmem_get_page(struct file *file, pgoff_t offset,
>  {
>         struct restrictedmem_data *data = file->f_mapping->private_data;
>         struct file *memfd = data->memfd;
> +       struct folio *folio = NULL;
>         struct page *page;
>         int ret;
>  
> -       ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> +       ret = shmem_get_folio(file_inode(memfd), offset, &folio, SGP_WRITE);
>         if (ret)
>                 return ret;
>  
> +       page = folio_file_page(folio, offset);
>         *pagep = page;
>         if (order)
>                 *order = thp_order(compound_head(page));
> -- 
> Isaku Yamahata <isaku.yamahata@gmail.com>
  
Michael Roth Oct. 31, 2022, 5:47 p.m. UTC | #4
On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Introduce 'memfd_restricted' system call with the ability to create
> memory areas that are restricted from userspace access through ordinary
> MMU operations (e.g. read/write/mmap). The memory content is expected to
> be used through a new in-kernel interface by a third kernel module.
> 
> memfd_restricted() is useful for scenarios where a file descriptor(fd)
> can be used as an interface into mm but want to restrict userspace's
> ability on the fd. Initially it is designed to provide protections for
> KVM encrypted guest memory.
> 
> Normally KVM uses memfd memory via mmapping the memfd into KVM userspace
> (e.g. QEMU) and then using the mmaped virtual address to setup the
> mapping in the KVM secondary page table (e.g. EPT). With confidential
> computing technologies like Intel TDX, the memfd memory may be encrypted
> with special key for special software domain (e.g. KVM guest) and is not
> expected to be directly accessed by userspace. Precisely, userspace
> access to such encrypted memory may lead to host crash so should be
> prevented.
> 
> memfd_restricted() provides semantics required for KVM guest encrypted
> memory support that a fd created with memfd_restricted() is going to be
> used as the source of guest memory in confidential computing environment
> and KVM can directly interact with core-mm without the need to expose
> the memoy content into KVM userspace.
> 
> KVM userspace is still in charge of the lifecycle of the fd. It should
> pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to
> obtain the physical memory page and then uses it to populate the KVM
> secondary page table entries.
> 
> The userspace restricted memfd can be fallocate-ed or hole-punched
> from userspace. When these operations happen, KVM can get notified
> through restrictedmem_notifier, it then gets chance to remove any
> mapped entries of the range in the secondary page tables.
> 
> memfd_restricted() itself is implemented as a shim layer on top of real
> memory file systems (currently tmpfs). Pages in restrictedmem are marked
> as unmovable and unevictable, this is required for current confidential
> usage. But in future this might be changed.
> 
> By default memfd_restricted() prevents userspace read, write and mmap.
> By defining new bit in the 'flags', it can be extended to support other
> restricted semantics in the future.
> 
> The system call is currently wired up for x86 arch.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  include/linux/restrictedmem.h          |  62 ++++++
>  include/linux/syscalls.h               |   1 +
>  include/uapi/asm-generic/unistd.h      |   5 +-
>  include/uapi/linux/magic.h             |   1 +
>  kernel/sys_ni.c                        |   3 +
>  mm/Kconfig                             |   4 +
>  mm/Makefile                            |   1 +
>  mm/restrictedmem.c                     | 250 +++++++++++++++++++++++++
>  10 files changed, 328 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/restrictedmem.h
>  create mode 100644 mm/restrictedmem.c
> 
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 320480a8db4f..dc70ba90247e 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -455,3 +455,4 @@
>  448	i386	process_mrelease	sys_process_mrelease
>  449	i386	futex_waitv		sys_futex_waitv
>  450	i386	set_mempolicy_home_node		sys_set_mempolicy_home_node
> +451	i386	memfd_restricted	sys_memfd_restricted
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index c84d12608cd2..06516abc8318 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -372,6 +372,7 @@
>  448	common	process_mrelease	sys_process_mrelease
>  449	common	futex_waitv		sys_futex_waitv
>  450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
> +451	common	memfd_restricted	sys_memfd_restricted
>  
>  #
>  # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
> new file mode 100644
> index 000000000000..9c37c3ea3180
> --- /dev/null
> +++ b/include/linux/restrictedmem.h
> @@ -0,0 +1,62 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _LINUX_RESTRICTEDMEM_H
> +
> +#include <linux/file.h>
> +#include <linux/magic.h>
> +#include <linux/pfn_t.h>
> +
> +struct restrictedmem_notifier;
> +
> +struct restrictedmem_notifier_ops {
> +	void (*invalidate_start)(struct restrictedmem_notifier *notifier,
> +				 pgoff_t start, pgoff_t end);
> +	void (*invalidate_end)(struct restrictedmem_notifier *notifier,
> +			       pgoff_t start, pgoff_t end);
> +};
> +
> +struct restrictedmem_notifier {
> +	struct list_head list;
> +	const struct restrictedmem_notifier_ops *ops;
> +};
> +
> +#ifdef CONFIG_RESTRICTEDMEM
> +
> +void restrictedmem_register_notifier(struct file *file,
> +				     struct restrictedmem_notifier *notifier);
> +void restrictedmem_unregister_notifier(struct file *file,
> +				       struct restrictedmem_notifier *notifier);
> +
> +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +			   struct page **pagep, int *order);
> +
> +static inline bool file_is_restrictedmem(struct file *file)
> +{
> +	return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC;
> +}
> +
> +#else
> +
> +static inline void restrictedmem_register_notifier(struct file *file,
> +				     struct restrictedmem_notifier *notifier)
> +{
> +}
> +
> +static inline void restrictedmem_unregister_notifier(struct file *file,
> +				       struct restrictedmem_notifier *notifier)
> +{
> +}
> +
> +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +					 struct page **pagep, int *order)
> +{
> +	return -1;
> +}
> +
> +static inline bool file_is_restrictedmem(struct file *file)
> +{
> +	return false;
> +}
> +
> +#endif /* CONFIG_RESTRICTEDMEM */
> +
> +#endif /* _LINUX_RESTRICTEDMEM_H */
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index a34b0f9a9972..f9e9e0c820c5 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -1056,6 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
>  asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
>  					    unsigned long home_node,
>  					    unsigned long flags);
> +asmlinkage long sys_memfd_restricted(unsigned int flags);
>  
>  /*
>   * Architecture-specific system calls
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 45fa180cc56a..e93cd35e46d0 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
>  #define __NR_set_mempolicy_home_node 450
>  __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
>  
> +#define __NR_memfd_restricted 451
> +__SYSCALL(__NR_memfd_restricted, sys_memfd_restricted)
> +
>  #undef __NR_syscalls
> -#define __NR_syscalls 451
> +#define __NR_syscalls 452
>  
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index 6325d1d0e90f..8aa38324b90a 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -101,5 +101,6 @@
>  #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
>  #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
>  #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
> +#define RESTRICTEDMEM_MAGIC	0x5245534d	/* "RESM" */
>  
>  #endif /* __LINUX_MAGIC_H__ */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 860b2dcf3ac4..7c4a32cbd2e7 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -360,6 +360,9 @@ COND_SYSCALL(pkey_free);
>  /* memfd_secret */
>  COND_SYSCALL(memfd_secret);
>  
> +/* memfd_restricted */
> +COND_SYSCALL(memfd_restricted);
> +
>  /*
>   * Architecture specific weak syscall entries.
>   */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 0331f1461f81..0177d53676c7 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1076,6 +1076,10 @@ config IO_MAPPING
>  config SECRETMEM
>  	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
>  
> +config RESTRICTEDMEM
> +	bool
> +	depends on TMPFS
> +
>  config ANON_VMA_NAME
>  	bool "Anonymous VMA name support"
>  	depends on PROC_FS && ADVISE_SYSCALLS && MMU
> diff --git a/mm/Makefile b/mm/Makefile
> index 9a564f836403..6cb6403ffd40 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -117,6 +117,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
>  obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o
>  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
>  obj-$(CONFIG_SECRETMEM) += secretmem.o
> +obj-$(CONFIG_RESTRICTEDMEM) += restrictedmem.o
>  obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o
>  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
>  obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> new file mode 100644
> index 000000000000..e5bf8907e0f8
> --- /dev/null
> +++ b/mm/restrictedmem.c
> @@ -0,0 +1,250 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "linux/sbitmap.h"
> +#include <linux/pagemap.h>
> +#include <linux/pseudo_fs.h>
> +#include <linux/shmem_fs.h>
> +#include <linux/syscalls.h>
> +#include <uapi/linux/falloc.h>
> +#include <uapi/linux/magic.h>
> +#include <linux/restrictedmem.h>
> +
> +struct restrictedmem_data {
> +	struct mutex lock;
> +	struct file *memfd;
> +	struct list_head notifiers;
> +};
> +
> +static void restrictedmem_notifier_invalidate(struct restrictedmem_data *data,
> +				 pgoff_t start, pgoff_t end, bool notify_start)
> +{
> +	struct restrictedmem_notifier *notifier;
> +
> +	mutex_lock(&data->lock);
> +	list_for_each_entry(notifier, &data->notifiers, list) {
> +		if (notify_start)
> +			notifier->ops->invalidate_start(notifier, start, end);
> +		else
> +			notifier->ops->invalidate_end(notifier, start, end);
> +	}
> +	mutex_unlock(&data->lock);
> +}
> +
> +static int restrictedmem_release(struct inode *inode, struct file *file)
> +{
> +	struct restrictedmem_data *data = inode->i_mapping->private_data;
> +
> +	fput(data->memfd);
> +	kfree(data);
> +	return 0;
> +}
> +
> +static long restrictedmem_fallocate(struct file *file, int mode,
> +				    loff_t offset, loff_t len)
> +{
> +	struct restrictedmem_data *data = file->f_mapping->private_data;
> +	struct file *memfd = data->memfd;
> +	int ret;
> +
> +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> +		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> +			return -EINVAL;
> +	}
> +
> +	restrictedmem_notifier_invalidate(data, offset, offset + len, true);
> +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> +	restrictedmem_notifier_invalidate(data, offset, offset + len, false);
> +	return ret;
> +}

In v8 there was some discussion about potentially passing the page/folio
and order as part of the invalidation callback, I ended up needing
something similar for SEV-SNP, and think it might make sense for other
platforms. This main reasoning is:

  1) restoring kernel directmap:

     Currently SNP (and I believe TDX) need to either split or remove kernel
     direct mappings for restricted PFNs, since there is no guarantee that
     other PFNs within a 2MB range won't be used for non-restricted
     (which will cause an RMP #PF in the case of SNP since the 2MB
     mapping overlaps with guest-owned pages)

     Previously we were able to restore 2MB mappings to some degree
     since both shared/restricted pages were all pinned, so anything
     backed by a THP (or hugetlb page once that is implemented) at guest
     teardown could be restored as 2MB direct mapping.

     Invalidation seems like the most logical time to have this happen,
     but whether or not to restore as 2MB requires the order to be 2MB
     or larger, and for GPA range being invalidated to cover the entire
     2MB (otherwise it means the page was potentially split and some
     subpages free back to host already, in which case it can't be
     restored as 2MB).

  2) Potentially less invalidations:
      
     If we pass the entire folio or compound_page as part of
     invalidation, we only needed to issue 1 invalidation per folio.

  3) Potentially useful for hugetlbfs support:

     One issue with hugetlbfs is that we don't support splitting the
     hugepage in such cases, which was a big obstacle prior to UPM. Now
     however, we may have the option of doing "lazy" invalidations where
     fallocate(PUNCH_HOLE, ...) won't free a shmem-allocate page unless
     all the subpages within the 2M range are either hole-punched, or the
     guest is shut down, so in that way we never have to split it. Sean
     was pondering something similar in another thread:

       https://lore.kernel.org/linux-mm/YyGLXXkFCmxBfu5U@google.com/

     Issuing invalidations with folio-granularity ties in fairly well
     with this sort of approach if we end up going that route.

I need to rework things for v9, and we'll probably want to use struct
folio instead of struct page now, but as a proof-of-concept of sorts this
is what I'd added on top of v8 of your patchset to implement 1) and 2):

  https://github.com/mdroth/linux/commit/127e5ea477c7bd5e4107fd44a04b9dc9e9b1af8b

Does an approach like this seem reasonable? Should be work this into the
base restricted memslot support?

Thanks,

Mike
  
Chao Peng Nov. 1, 2022, 11:37 a.m. UTC | #5
On Mon, Oct 31, 2022 at 12:47:38PM -0500, Michael Roth wrote:
> On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > Introduce 'memfd_restricted' system call with the ability to create
> > memory areas that are restricted from userspace access through ordinary
> > MMU operations (e.g. read/write/mmap). The memory content is expected to
> > be used through a new in-kernel interface by a third kernel module.
> > 
> > memfd_restricted() is useful for scenarios where a file descriptor(fd)
> > can be used as an interface into mm but want to restrict userspace's
> > ability on the fd. Initially it is designed to provide protections for
> > KVM encrypted guest memory.
> > 
> > Normally KVM uses memfd memory via mmapping the memfd into KVM userspace
> > (e.g. QEMU) and then using the mmaped virtual address to setup the
> > mapping in the KVM secondary page table (e.g. EPT). With confidential
> > computing technologies like Intel TDX, the memfd memory may be encrypted
> > with special key for special software domain (e.g. KVM guest) and is not
> > expected to be directly accessed by userspace. Precisely, userspace
> > access to such encrypted memory may lead to host crash so should be
> > prevented.
> > 
> > memfd_restricted() provides semantics required for KVM guest encrypted
> > memory support that a fd created with memfd_restricted() is going to be
> > used as the source of guest memory in confidential computing environment
> > and KVM can directly interact with core-mm without the need to expose
> > the memoy content into KVM userspace.
> > 
> > KVM userspace is still in charge of the lifecycle of the fd. It should
> > pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to
> > obtain the physical memory page and then uses it to populate the KVM
> > secondary page table entries.
> > 
> > The userspace restricted memfd can be fallocate-ed or hole-punched
> > from userspace. When these operations happen, KVM can get notified
> > through restrictedmem_notifier, it then gets chance to remove any
> > mapped entries of the range in the secondary page tables.
> > 
> > memfd_restricted() itself is implemented as a shim layer on top of real
> > memory file systems (currently tmpfs). Pages in restrictedmem are marked
> > as unmovable and unevictable, this is required for current confidential
> > usage. But in future this might be changed.
> > 
> > By default memfd_restricted() prevents userspace read, write and mmap.
> > By defining new bit in the 'flags', it can be extended to support other
> > restricted semantics in the future.
> > 
> > The system call is currently wired up for x86 arch.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
> >  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
> >  include/linux/restrictedmem.h          |  62 ++++++
> >  include/linux/syscalls.h               |   1 +
> >  include/uapi/asm-generic/unistd.h      |   5 +-
> >  include/uapi/linux/magic.h             |   1 +
> >  kernel/sys_ni.c                        |   3 +
> >  mm/Kconfig                             |   4 +
> >  mm/Makefile                            |   1 +
> >  mm/restrictedmem.c                     | 250 +++++++++++++++++++++++++
> >  10 files changed, 328 insertions(+), 1 deletion(-)
> >  create mode 100644 include/linux/restrictedmem.h
> >  create mode 100644 mm/restrictedmem.c
> > 
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> > index 320480a8db4f..dc70ba90247e 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -455,3 +455,4 @@
> >  448	i386	process_mrelease	sys_process_mrelease
> >  449	i386	futex_waitv		sys_futex_waitv
> >  450	i386	set_mempolicy_home_node		sys_set_mempolicy_home_node
> > +451	i386	memfd_restricted	sys_memfd_restricted
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> > index c84d12608cd2..06516abc8318 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -372,6 +372,7 @@
> >  448	common	process_mrelease	sys_process_mrelease
> >  449	common	futex_waitv		sys_futex_waitv
> >  450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
> > +451	common	memfd_restricted	sys_memfd_restricted
> >  
> >  #
> >  # Due to a historical design error, certain syscalls are numbered differently
> > diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
> > new file mode 100644
> > index 000000000000..9c37c3ea3180
> > --- /dev/null
> > +++ b/include/linux/restrictedmem.h
> > @@ -0,0 +1,62 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +#ifndef _LINUX_RESTRICTEDMEM_H
> > +
> > +#include <linux/file.h>
> > +#include <linux/magic.h>
> > +#include <linux/pfn_t.h>
> > +
> > +struct restrictedmem_notifier;
> > +
> > +struct restrictedmem_notifier_ops {
> > +	void (*invalidate_start)(struct restrictedmem_notifier *notifier,
> > +				 pgoff_t start, pgoff_t end);
> > +	void (*invalidate_end)(struct restrictedmem_notifier *notifier,
> > +			       pgoff_t start, pgoff_t end);
> > +};
> > +
> > +struct restrictedmem_notifier {
> > +	struct list_head list;
> > +	const struct restrictedmem_notifier_ops *ops;
> > +};
> > +
> > +#ifdef CONFIG_RESTRICTEDMEM
> > +
> > +void restrictedmem_register_notifier(struct file *file,
> > +				     struct restrictedmem_notifier *notifier);
> > +void restrictedmem_unregister_notifier(struct file *file,
> > +				       struct restrictedmem_notifier *notifier);
> > +
> > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > +			   struct page **pagep, int *order);
> > +
> > +static inline bool file_is_restrictedmem(struct file *file)
> > +{
> > +	return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC;
> > +}
> > +
> > +#else
> > +
> > +static inline void restrictedmem_register_notifier(struct file *file,
> > +				     struct restrictedmem_notifier *notifier)
> > +{
> > +}
> > +
> > +static inline void restrictedmem_unregister_notifier(struct file *file,
> > +				       struct restrictedmem_notifier *notifier)
> > +{
> > +}
> > +
> > +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > +					 struct page **pagep, int *order)
> > +{
> > +	return -1;
> > +}
> > +
> > +static inline bool file_is_restrictedmem(struct file *file)
> > +{
> > +	return false;
> > +}
> > +
> > +#endif /* CONFIG_RESTRICTEDMEM */
> > +
> > +#endif /* _LINUX_RESTRICTEDMEM_H */
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index a34b0f9a9972..f9e9e0c820c5 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -1056,6 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
> >  asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
> >  					    unsigned long home_node,
> >  					    unsigned long flags);
> > +asmlinkage long sys_memfd_restricted(unsigned int flags);
> >  
> >  /*
> >   * Architecture-specific system calls
> > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > index 45fa180cc56a..e93cd35e46d0 100644
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > @@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
> >  #define __NR_set_mempolicy_home_node 450
> >  __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
> >  
> > +#define __NR_memfd_restricted 451
> > +__SYSCALL(__NR_memfd_restricted, sys_memfd_restricted)
> > +
> >  #undef __NR_syscalls
> > -#define __NR_syscalls 451
> > +#define __NR_syscalls 452
> >  
> >  /*
> >   * 32 bit systems traditionally used different
> > diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> > index 6325d1d0e90f..8aa38324b90a 100644
> > --- a/include/uapi/linux/magic.h
> > +++ b/include/uapi/linux/magic.h
> > @@ -101,5 +101,6 @@
> >  #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
> >  #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
> >  #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
> > +#define RESTRICTEDMEM_MAGIC	0x5245534d	/* "RESM" */
> >  
> >  #endif /* __LINUX_MAGIC_H__ */
> > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> > index 860b2dcf3ac4..7c4a32cbd2e7 100644
> > --- a/kernel/sys_ni.c
> > +++ b/kernel/sys_ni.c
> > @@ -360,6 +360,9 @@ COND_SYSCALL(pkey_free);
> >  /* memfd_secret */
> >  COND_SYSCALL(memfd_secret);
> >  
> > +/* memfd_restricted */
> > +COND_SYSCALL(memfd_restricted);
> > +
> >  /*
> >   * Architecture specific weak syscall entries.
> >   */
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 0331f1461f81..0177d53676c7 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -1076,6 +1076,10 @@ config IO_MAPPING
> >  config SECRETMEM
> >  	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
> >  
> > +config RESTRICTEDMEM
> > +	bool
> > +	depends on TMPFS
> > +
> >  config ANON_VMA_NAME
> >  	bool "Anonymous VMA name support"
> >  	depends on PROC_FS && ADVISE_SYSCALLS && MMU
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 9a564f836403..6cb6403ffd40 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -117,6 +117,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
> >  obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o
> >  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
> >  obj-$(CONFIG_SECRETMEM) += secretmem.o
> > +obj-$(CONFIG_RESTRICTEDMEM) += restrictedmem.o
> >  obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o
> >  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
> >  obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
> > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> > new file mode 100644
> > index 000000000000..e5bf8907e0f8
> > --- /dev/null
> > +++ b/mm/restrictedmem.c
> > @@ -0,0 +1,250 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include "linux/sbitmap.h"
> > +#include <linux/pagemap.h>
> > +#include <linux/pseudo_fs.h>
> > +#include <linux/shmem_fs.h>
> > +#include <linux/syscalls.h>
> > +#include <uapi/linux/falloc.h>
> > +#include <uapi/linux/magic.h>
> > +#include <linux/restrictedmem.h>
> > +
> > +struct restrictedmem_data {
> > +	struct mutex lock;
> > +	struct file *memfd;
> > +	struct list_head notifiers;
> > +};
> > +
> > +static void restrictedmem_notifier_invalidate(struct restrictedmem_data *data,
> > +				 pgoff_t start, pgoff_t end, bool notify_start)
> > +{
> > +	struct restrictedmem_notifier *notifier;
> > +
> > +	mutex_lock(&data->lock);
> > +	list_for_each_entry(notifier, &data->notifiers, list) {
> > +		if (notify_start)
> > +			notifier->ops->invalidate_start(notifier, start, end);
> > +		else
> > +			notifier->ops->invalidate_end(notifier, start, end);
> > +	}
> > +	mutex_unlock(&data->lock);
> > +}
> > +
> > +static int restrictedmem_release(struct inode *inode, struct file *file)
> > +{
> > +	struct restrictedmem_data *data = inode->i_mapping->private_data;
> > +
> > +	fput(data->memfd);
> > +	kfree(data);
> > +	return 0;
> > +}
> > +
> > +static long restrictedmem_fallocate(struct file *file, int mode,
> > +				    loff_t offset, loff_t len)
> > +{
> > +	struct restrictedmem_data *data = file->f_mapping->private_data;
> > +	struct file *memfd = data->memfd;
> > +	int ret;
> > +
> > +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> > +		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > +			return -EINVAL;
> > +	}
> > +
> > +	restrictedmem_notifier_invalidate(data, offset, offset + len, true);
> > +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > +	restrictedmem_notifier_invalidate(data, offset, offset + len, false);
> > +	return ret;
> > +}
> 
> In v8 there was some discussion about potentially passing the page/folio
> and order as part of the invalidation callback, I ended up needing
> something similar for SEV-SNP, and think it might make sense for other
> platforms. This main reasoning is:

In that context what we talked on is the inaccessible_get_pfn(), I was
not aware there is need for invalidation callback as well.

> 
>   1) restoring kernel directmap:
> 
>      Currently SNP (and I believe TDX) need to either split or remove kernel
>      direct mappings for restricted PFNs, since there is no guarantee that
>      other PFNs within a 2MB range won't be used for non-restricted
>      (which will cause an RMP #PF in the case of SNP since the 2MB
>      mapping overlaps with guest-owned pages)

Has the splitting and restoring been a well-discussed direction? I'm
just curious whether there is other options to solve this issue.

> 
>      Previously we were able to restore 2MB mappings to some degree
>      since both shared/restricted pages were all pinned, so anything
>      backed by a THP (or hugetlb page once that is implemented) at guest
>      teardown could be restored as 2MB direct mapping.
> 
>      Invalidation seems like the most logical time to have this happen,

Currently invalidation only happens at user-initiated fallocate(). It
does not cover the VM teardown case where the restoring might also be
expected to be handled.

>      but whether or not to restore as 2MB requires the order to be 2MB
>      or larger, and for GPA range being invalidated to cover the entire
>      2MB (otherwise it means the page was potentially split and some
>      subpages free back to host already, in which case it can't be
>      restored as 2MB).
> 
>   2) Potentially less invalidations:
>       
>      If we pass the entire folio or compound_page as part of
>      invalidation, we only needed to issue 1 invalidation per folio.

I'm not sure I agree, the current invalidation covers the whole range
that passed from userspace and the invalidation is invoked only once for
each usrspace fallocate().

> 
>   3) Potentially useful for hugetlbfs support:
> 
>      One issue with hugetlbfs is that we don't support splitting the
>      hugepage in such cases, which was a big obstacle prior to UPM. Now
>      however, we may have the option of doing "lazy" invalidations where
>      fallocate(PUNCH_HOLE, ...) won't free a shmem-allocate page unless
>      all the subpages within the 2M range are either hole-punched, or the
>      guest is shut down, so in that way we never have to split it. Sean
>      was pondering something similar in another thread:
> 
>        https://lore.kernel.org/linux-mm/YyGLXXkFCmxBfu5U@google.com/
> 
>      Issuing invalidations with folio-granularity ties in fairly well
>      with this sort of approach if we end up going that route.

There is semantics difference between the current one and the proposed
one: The invalidation range is exactly what userspace passed down to the
kernel (being fallocated) while the proposed one will be subset of that
(if userspace-provided addr/size is not aligned to power of two), I'm
not quite confident this difference has no side effect.

> 
> I need to rework things for v9, and we'll probably want to use struct
> folio instead of struct page now, but as a proof-of-concept of sorts this
> is what I'd added on top of v8 of your patchset to implement 1) and 2):
> 
>   https://github.com/mdroth/linux/commit/127e5ea477c7bd5e4107fd44a04b9dc9e9b1af8b
> 
> Does an approach like this seem reasonable? Should be work this into the
> base restricted memslot support?

If the above mentioned semantics difference is not a problem, I don't
have strong objection on this.

Sean, since you have much better understanding on this, what is your
take on this?

Chao
> 
> Thanks,
> 
> Mike
  
Michael Roth Nov. 1, 2022, 3:19 p.m. UTC | #6
On Tue, Nov 01, 2022 at 07:37:29PM +0800, Chao Peng wrote:
> On Mon, Oct 31, 2022 at 12:47:38PM -0500, Michael Roth wrote:
> > On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > 
> > > +struct restrictedmem_data {
> > > +	struct mutex lock;
> > > +	struct file *memfd;
> > > +	struct list_head notifiers;
> > > +};
> > > +
> > > +static void restrictedmem_notifier_invalidate(struct restrictedmem_data *data,
> > > +				 pgoff_t start, pgoff_t end, bool notify_start)
> > > +{
> > > +	struct restrictedmem_notifier *notifier;
> > > +
> > > +	mutex_lock(&data->lock);
> > > +	list_for_each_entry(notifier, &data->notifiers, list) {
> > > +		if (notify_start)
> > > +			notifier->ops->invalidate_start(notifier, start, end);
> > > +		else
> > > +			notifier->ops->invalidate_end(notifier, start, end);
> > > +	}
> > > +	mutex_unlock(&data->lock);
> > > +}
> > > +
> > > +static int restrictedmem_release(struct inode *inode, struct file *file)
> > > +{
> > > +	struct restrictedmem_data *data = inode->i_mapping->private_data;
> > > +
> > > +	fput(data->memfd);
> > > +	kfree(data);
> > > +	return 0;
> > > +}
> > > +
> > > +static long restrictedmem_fallocate(struct file *file, int mode,
> > > +				    loff_t offset, loff_t len)
> > > +{
> > > +	struct restrictedmem_data *data = file->f_mapping->private_data;
> > > +	struct file *memfd = data->memfd;
> > > +	int ret;
> > > +
> > > +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > +		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > +			return -EINVAL;
> > > +	}
> > > +
> > > +	restrictedmem_notifier_invalidate(data, offset, offset + len, true);
> > > +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > +	restrictedmem_notifier_invalidate(data, offset, offset + len, false);
> > > +	return ret;
> > > +}
> > 
> > In v8 there was some discussion about potentially passing the page/folio
> > and order as part of the invalidation callback, I ended up needing
> > something similar for SEV-SNP, and think it might make sense for other
> > platforms. This main reasoning is:
> 
> In that context what we talked on is the inaccessible_get_pfn(), I was
> not aware there is need for invalidation callback as well.

Right, your understanding is correct. I think Sean had only mentioned in
passing that it was something we could potentially do, and in the cases I
was looking at it ended up being useful. I only mentioned it so I don't
seem like I'm too far out in the weeds here :)

> 
> > 
> >   1) restoring kernel directmap:
> > 
> >      Currently SNP (and I believe TDX) need to either split or remove kernel
> >      direct mappings for restricted PFNs, since there is no guarantee that
> >      other PFNs within a 2MB range won't be used for non-restricted
> >      (which will cause an RMP #PF in the case of SNP since the 2MB
> >      mapping overlaps with guest-owned pages)
> 
> Has the splitting and restoring been a well-discussed direction? I'm
> just curious whether there is other options to solve this issue.

For SNP it's been discussed for quite some time, and either splitting or
removing private entries from directmap are the well-discussed way I'm
aware of to avoid RMP violations due to some other kernel process using
a 2MB mapping to access shared memory if there are private pages that
happen to be within that range.

In both cases the issue of how to restore directmap as 2M becomes a
problem.

I was also under the impression TDX had similar requirements. If so,
do you know what the plan is for handling this for TDX?

There are also 2 potential alternatives I'm aware of, but these haven't
been discussed in much detail AFAIK:

a) Ensure confidential guests are backed by 2MB pages. shmem has a way to
   request 2MB THP pages, but I'm not sure how reliably we can guarantee
   that enough THPs are available, so if we went that route we'd probably
   be better off requiring the use of hugetlbfs as the backing store. But
   obviously that's a bit limiting and it would be nice to have the option
   of using normal pages as well. One nice thing with invalidation
   scheme proposed here is that this would "Just Work" if implement
   hugetlbfs support, so an admin that doesn't want any directmap
   splitting has this option available, otherwise it's done as a
   best-effort.

b) Implement general support for restoring directmap as 2M even when
   subpages might be in use by other kernel threads. This would be the
   most flexible approach since it requires no special handling during
   invalidations, but I think it's only possible if all the CPA
   attributes for the 2M range are the same at the time the mapping is
   restored/unsplit, so some potential locking issues there and still
   chance for splitting directmap over time.

> 
> > 
> >      Previously we were able to restore 2MB mappings to some degree
> >      since both shared/restricted pages were all pinned, so anything
> >      backed by a THP (or hugetlb page once that is implemented) at guest
> >      teardown could be restored as 2MB direct mapping.
> > 
> >      Invalidation seems like the most logical time to have this happen,
> 
> Currently invalidation only happens at user-initiated fallocate(). It
> does not cover the VM teardown case where the restoring might also be
> expected to be handled.

Right, I forgot to add that in my proposed changes I added invalidations
for any still-allocated private pages present when the restricted memfd
notifier is unregistered. This was needed to avoid leaking pages back to
the kernel that still need directmap or RMP table fixups. I also added
similar invalidations for memfd->release(), since it seems possible that
userspace might close() it before shutting down guest, but maybe the
latter is not needed if KVM takes a reference on the FD during life of
the guest.

> 
> >      but whether or not to restore as 2MB requires the order to be 2MB
> >      or larger, and for GPA range being invalidated to cover the entire
> >      2MB (otherwise it means the page was potentially split and some
> >      subpages free back to host already, in which case it can't be
> >      restored as 2MB).
> > 
> >   2) Potentially less invalidations:
> >       
> >      If we pass the entire folio or compound_page as part of
> >      invalidation, we only needed to issue 1 invalidation per folio.
> 
> I'm not sure I agree, the current invalidation covers the whole range
> that passed from userspace and the invalidation is invoked only once for
> each usrspace fallocate().

That's true, it only reduces invalidations if we decide to provide a
struct page/folio as part of the invalidation callbacks, which isn't
the case yet. Sorry for the confusion.

> 
> > 
> >   3) Potentially useful for hugetlbfs support:
> > 
> >      One issue with hugetlbfs is that we don't support splitting the
> >      hugepage in such cases, which was a big obstacle prior to UPM. Now
> >      however, we may have the option of doing "lazy" invalidations where
> >      fallocate(PUNCH_HOLE, ...) won't free a shmem-allocate page unless
> >      all the subpages within the 2M range are either hole-punched, or the
> >      guest is shut down, so in that way we never have to split it. Sean
> >      was pondering something similar in another thread:
> > 
> >        https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-mm%2FYyGLXXkFCmxBfu5U%40google.com%2F&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C3aba56bf7d574c749ea708dabbfe2224%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638028997419628807%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=c7gSLjJEAxuX8xmMiTDMUHNwUdQNKN00xqtAZAEeow8%3D&amp;reserved=0
> > 
> >      Issuing invalidations with folio-granularity ties in fairly well
> >      with this sort of approach if we end up going that route.
> 
> There is semantics difference between the current one and the proposed
> one: The invalidation range is exactly what userspace passed down to the
> kernel (being fallocated) while the proposed one will be subset of that
> (if userspace-provided addr/size is not aligned to power of two), I'm
> not quite confident this difference has no side effect.

In theory userspace should not be allocating/hole-punching restricted
pages for GPA ranges that are already mapped as private in the xarray,
and KVM could potentially fail such requests (though it does currently).

But if we somehow enforced that, then we could rely on
KVM_MEMORY_ENCRYPT_REG_REGION to handle all the MMU invalidation stuff,
which would free up the restricted fd invalidation callbacks to be used
purely to handle doing things like RMP/directmap fixups prior to returning
restricted pages back to the host. So that was sort of my thinking why the
new semantics would still cover all the necessary cases.

-Mike

> 
> > 
> > I need to rework things for v9, and we'll probably want to use struct
> > folio instead of struct page now, but as a proof-of-concept of sorts this
> > is what I'd added on top of v8 of your patchset to implement 1) and 2):
> > 
> >   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdroth%2Flinux%2Fcommit%2F127e5ea477c7bd5e4107fd44a04b9dc9e9b1af8b&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C3aba56bf7d574c749ea708dabbfe2224%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638028997419628807%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=jOFT0iLmeU7rKniEkWOsTf2%2FPI13EAw4Qm7arI1q970%3D&amp;reserved=0
> > 
> > Does an approach like this seem reasonable? Should be work this into the
> > base restricted memslot support?
> 
> If the above mentioned semantics difference is not a problem, I don't
> have strong objection on this.
> 
> Sean, since you have much better understanding on this, what is your
> take on this?
> 
> Chao
> > 
> > Thanks,
> > 
> > Mike
  
Michael Roth Nov. 1, 2022, 7:30 p.m. UTC | #7
On Tue, Nov 01, 2022 at 10:19:44AM -0500, Michael Roth wrote:
> On Tue, Nov 01, 2022 at 07:37:29PM +0800, Chao Peng wrote:
> > On Mon, Oct 31, 2022 at 12:47:38PM -0500, Michael Roth wrote:
> > > On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > 
> > > 
> > >   3) Potentially useful for hugetlbfs support:
> > > 
> > >      One issue with hugetlbfs is that we don't support splitting the
> > >      hugepage in such cases, which was a big obstacle prior to UPM. Now
> > >      however, we may have the option of doing "lazy" invalidations where
> > >      fallocate(PUNCH_HOLE, ...) won't free a shmem-allocate page unless
> > >      all the subpages within the 2M range are either hole-punched, or the
> > >      guest is shut down, so in that way we never have to split it. Sean
> > >      was pondering something similar in another thread:
> > > 
> > >        https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-mm%2FYyGLXXkFCmxBfu5U%40google.com%2F&amp;data=05%7C01%7CMichael.Roth%40amd.com%7C28ba5dbb51844f910dec08dabc1c99e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638029128345507924%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=bxcRfuJIgo1Z1G8HQ800HscE6y7RXRQwvWSkfc5M8Bs%3D&amp;reserved=0
> > > 
> > >      Issuing invalidations with folio-granularity ties in fairly well
> > >      with this sort of approach if we end up going that route.
> > 
> > There is semantics difference between the current one and the proposed
> > one: The invalidation range is exactly what userspace passed down to the
> > kernel (being fallocated) while the proposed one will be subset of that
> > (if userspace-provided addr/size is not aligned to power of two), I'm
> > not quite confident this difference has no side effect.
> 
> In theory userspace should not be allocating/hole-punching restricted
> pages for GPA ranges that are already mapped as private in the xarray,
> and KVM could potentially fail such requests (though it does currently).
> 
> But if we somehow enforced that, then we could rely on
> KVM_MEMORY_ENCRYPT_REG_REGION to handle all the MMU invalidation stuff,
> which would free up the restricted fd invalidation callbacks to be used
> purely to handle doing things like RMP/directmap fixups prior to returning
> restricted pages back to the host. So that was sort of my thinking why the
> new semantics would still cover all the necessary cases.

Sorry, this explanation is if we rely on userspace to fallocate() on 2MB
boundaries, and ignore any non-aligned requests in the kernel. But
that's not how I actually ended up implementing things, so I'm not sure
why answered that way...

In my implementation we actually do issue invalidations for fallocate()
even for non-2M-aligned GPA/offset ranges. For instance (assuming
restricted FD offset 0 corresponds to GPA 0), an fallocate() on GPA
range 0x1000-0x402000 would result in the following invalidations being
issued if everything was backed by a 2MB page:

  invalidate GPA: 0x001000-0x200000, Page: pfn_to_page(I), order:9
  invalidate GPA: 0x200000-0x400000, Page: pfn_to_page(J), order:9
  invalidate GPA: 0x400000-0x402000, Page: pfn_to_page(K), order:9

So you still cover the same range, but the arch/platform callbacks can
then, as a best effort, do things like restore 2M directmap if they see
that the backing page is 2MB+ and the GPA range covers the entire range.
If the GPA doesn't covers the whole range, or the backing page is
order:0, then in that case we are still forced to leave the directmap
split.

But with that in place we can then improve on that by allowing for the
use of hugetlbfs.

We'd still be somewhat reliant on userspace to issue fallocate()'s on
2M-aligned boundaries to some degree (guest teardown invalidations
could be issued as 2M-aligned, which would be the bulk of the pages
in most cases, but for discarding pages after private->shared
conversion we could still get fragmentation). This could maybe be
addressed by keeping track of those partial/non-2M-aligned fallocate()
requests and then issuing them as a batched 2M invalidation once all
the subpages have been fallocate(HOLE_PUNCH)'d. We'd need to enforce
that fallocate(PUNCH_HOLE) is preceeded by
KVM_MEMORY_ENCRYPT_UNREG_REGION to make sure MMU invalidations happen
though.

Not sure on these potential follow-ups, but they all at least seem
compatible with the proposed invalidation scheme.

-Mike

> 
> -Mike
> 
> > 
> > > 
> > > I need to rework things for v9, and we'll probably want to use struct
> > > folio instead of struct page now, but as a proof-of-concept of sorts this
> > > is what I'd added on top of v8 of your patchset to implement 1) and 2):
> > > 
> > >   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdroth%2Flinux%2Fcommit%2F127e5ea477c7bd5e4107fd44a04b9dc9e9b1af8b&amp;data=05%7C01%7CMichael.Roth%40amd.com%7C28ba5dbb51844f910dec08dabc1c99e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638029128345507924%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=iv%2BOMPe5AZuUtIW6bCH%2BRhJPljS14JrTXbQXptLG9fM%3D&amp;reserved=0
> > > 
> > > Does an approach like this seem reasonable? Should be work this into the
> > > base restricted memslot support?
> > 
> > If the above mentioned semantics difference is not a problem, I don't
> > have strong objection on this.
> > 
> > Sean, since you have much better understanding on this, what is your
> > take on this?
> > 
> > Chao
> > > 
> > > Thanks,
> > > 
> > > Mike
  
Chao Peng Nov. 2, 2022, 2:53 p.m. UTC | #8
On Tue, Nov 01, 2022 at 02:30:58PM -0500, Michael Roth wrote:
> On Tue, Nov 01, 2022 at 10:19:44AM -0500, Michael Roth wrote:
> > On Tue, Nov 01, 2022 at 07:37:29PM +0800, Chao Peng wrote:
> > > On Mon, Oct 31, 2022 at 12:47:38PM -0500, Michael Roth wrote:
> > > > On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > > 
> > > > 
> > > >   3) Potentially useful for hugetlbfs support:
> > > > 
> > > >      One issue with hugetlbfs is that we don't support splitting the
> > > >      hugepage in such cases, which was a big obstacle prior to UPM. Now
> > > >      however, we may have the option of doing "lazy" invalidations where
> > > >      fallocate(PUNCH_HOLE, ...) won't free a shmem-allocate page unless
> > > >      all the subpages within the 2M range are either hole-punched, or the
> > > >      guest is shut down, so in that way we never have to split it. Sean
> > > >      was pondering something similar in another thread:
> > > > 
> > > >        https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-mm%2FYyGLXXkFCmxBfu5U%40google.com%2F&amp;data=05%7C01%7CMichael.Roth%40amd.com%7C28ba5dbb51844f910dec08dabc1c99e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638029128345507924%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=bxcRfuJIgo1Z1G8HQ800HscE6y7RXRQwvWSkfc5M8Bs%3D&amp;reserved=0
> > > > 
> > > >      Issuing invalidations with folio-granularity ties in fairly well
> > > >      with this sort of approach if we end up going that route.
> > > 
> > > There is semantics difference between the current one and the proposed
> > > one: The invalidation range is exactly what userspace passed down to the
> > > kernel (being fallocated) while the proposed one will be subset of that
> > > (if userspace-provided addr/size is not aligned to power of two), I'm
> > > not quite confident this difference has no side effect.
> > 
> > In theory userspace should not be allocating/hole-punching restricted
> > pages for GPA ranges that are already mapped as private in the xarray,
> > and KVM could potentially fail such requests (though it does currently).
> > 
> > But if we somehow enforced that, then we could rely on
> > KVM_MEMORY_ENCRYPT_REG_REGION to handle all the MMU invalidation stuff,
> > which would free up the restricted fd invalidation callbacks to be used
> > purely to handle doing things like RMP/directmap fixups prior to returning
> > restricted pages back to the host. So that was sort of my thinking why the
> > new semantics would still cover all the necessary cases.
> 
> Sorry, this explanation is if we rely on userspace to fallocate() on 2MB
> boundaries, and ignore any non-aligned requests in the kernel. But
> that's not how I actually ended up implementing things, so I'm not sure
> why answered that way...
> 
> In my implementation we actually do issue invalidations for fallocate()
> even for non-2M-aligned GPA/offset ranges. For instance (assuming
> restricted FD offset 0 corresponds to GPA 0), an fallocate() on GPA
> range 0x1000-0x402000 would result in the following invalidations being
> issued if everything was backed by a 2MB page:
> 
>   invalidate GPA: 0x001000-0x200000, Page: pfn_to_page(I), order:9
>   invalidate GPA: 0x200000-0x400000, Page: pfn_to_page(J), order:9
>   invalidate GPA: 0x400000-0x402000, Page: pfn_to_page(K), order:9

Only see this I understand what you are actually going to propose;)

So the memory range(start/end) will be still there and covers exactly
what it should be from usrspace point of view, the page+order(or just
folio) is really just a _hint_ for the invalidation callbacks. Looks
ugly though.

In v9 we use a invalidate_start/ invalidate_end pair to solve a race
contention issue(https://lore.kernel.org/kvm/Y1LOe4JvnTbFNs4u@google.com/).
To work with this, I believe we only need pass this hint info for
invalidate_start() since at the invalidate_end() time, the page has
already been discarded.

Another worth-mentioning-thing is invalidate_start/end is not just
invoked for hole punching, but also for allocation(e.g. default
fallocate). While for allocation we can get the page only at the
invalidate_end() time. But AFAICS, the invalidate() is called for
fallocate(allocation) is because previously we rely on the existence in
memory backing store to tell a page is private and we need notify KVM
that the page is being converted from shared to private, but that is not
true for current code and fallocate() is also not mandatory since KVM
can call restrictedmem_get_page() to allocate dynamically, so I think we
can remove the invalidation path for fallocate(allocation).

> 
> So you still cover the same range, but the arch/platform callbacks can
> then, as a best effort, do things like restore 2M directmap if they see
> that the backing page is 2MB+ and the GPA range covers the entire range.
> If the GPA doesn't covers the whole range, or the backing page is
> order:0, then in that case we are still forced to leave the directmap
> split.
> 
> But with that in place we can then improve on that by allowing for the
> use of hugetlbfs.
> 
> We'd still be somewhat reliant on userspace to issue fallocate()'s on
> 2M-aligned boundaries to some degree (guest teardown invalidations
> could be issued as 2M-aligned, which would be the bulk of the pages
> in most cases, but for discarding pages after private->shared
> conversion we could still get fragmentation). This could maybe be
> addressed by keeping track of those partial/non-2M-aligned fallocate()
> requests and then issuing them as a batched 2M invalidation once all
> the subpages have been fallocate(HOLE_PUNCH)'d. We'd need to enforce
> that fallocate(PUNCH_HOLE) is preceeded by
> KVM_MEMORY_ENCRYPT_UNREG_REGION to make sure MMU invalidations happen
> though.

Don't understand why the sequence matters here, we should do MMU
invalidation for both fallocate(PUNCH_HOLE) and
KVM_MEMORY_ENCRYPT_UNREG_REGION, right?

Thanks,
Chao
> 
> Not sure on these potential follow-ups, but they all at least seem
> compatible with the proposed invalidation scheme.
> 
> -Mike
> 
> > 
> > -Mike
> > 
> > > 
> > > > 
> > > > I need to rework things for v9, and we'll probably want to use struct
> > > > folio instead of struct page now, but as a proof-of-concept of sorts this
> > > > is what I'd added on top of v8 of your patchset to implement 1) and 2):
> > > > 
> > > >   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdroth%2Flinux%2Fcommit%2F127e5ea477c7bd5e4107fd44a04b9dc9e9b1af8b&amp;data=05%7C01%7CMichael.Roth%40amd.com%7C28ba5dbb51844f910dec08dabc1c99e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638029128345507924%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=iv%2BOMPe5AZuUtIW6bCH%2BRhJPljS14JrTXbQXptLG9fM%3D&amp;reserved=0
> > > > 
> > > > Does an approach like this seem reasonable? Should be work this into the
> > > > base restricted memslot support?
> > > 
> > > If the above mentioned semantics difference is not a problem, I don't
> > > have strong objection on this.
> > > 
> > > Sean, since you have much better understanding on this, what is your
> > > take on this?
> > > 
> > > Chao
> > > > 
> > > > Thanks,
> > > > 
> > > > Mike
  
Kirill A. Shutemov Nov. 2, 2022, 9:14 p.m. UTC | #9
On Mon, Oct 31, 2022 at 12:47:38PM -0500, Michael Roth wrote:
> 
> In v8 there was some discussion about potentially passing the page/folio
> and order as part of the invalidation callback, I ended up needing
> something similar for SEV-SNP, and think it might make sense for other
> platforms. This main reasoning is:
> 
>   1) restoring kernel directmap:
> 
>      Currently SNP (and I believe TDX) need to either split or remove kernel
>      direct mappings for restricted PFNs, since there is no guarantee that
>      other PFNs within a 2MB range won't be used for non-restricted
>      (which will cause an RMP #PF in the case of SNP since the 2MB
>      mapping overlaps with guest-owned pages)

That's news to me. Where the restriction for SNP comes from? There's no
such limitation on TDX side AFAIK?

Could you point me to relevant documentation if there's any?
  
Michael Roth Nov. 2, 2022, 9:19 p.m. UTC | #10
On Wed, Nov 02, 2022 at 10:53:25PM +0800, Chao Peng wrote:
> On Tue, Nov 01, 2022 at 02:30:58PM -0500, Michael Roth wrote:
> > On Tue, Nov 01, 2022 at 10:19:44AM -0500, Michael Roth wrote:
> > > On Tue, Nov 01, 2022 at 07:37:29PM +0800, Chao Peng wrote:
> > > > On Mon, Oct 31, 2022 at 12:47:38PM -0500, Michael Roth wrote:
> > > > > On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > > > 
> > > > > 
> > > > >   3) Potentially useful for hugetlbfs support:
> > > > > 
> > > > >      One issue with hugetlbfs is that we don't support splitting the
> > > > >      hugepage in such cases, which was a big obstacle prior to UPM. Now
> > > > >      however, we may have the option of doing "lazy" invalidations where
> > > > >      fallocate(PUNCH_HOLE, ...) won't free a shmem-allocate page unless
> > > > >      all the subpages within the 2M range are either hole-punched, or the
> > > > >      guest is shut down, so in that way we never have to split it. Sean
> > > > >      was pondering something similar in another thread:
> > > > > 
> > > > >        https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-mm%2FYyGLXXkFCmxBfu5U%40google.com%2F&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C13192ae987b442f10b7408dabce2a4c5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638029978853935768%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=Is%2Bfm3c9BGFmU%2Btn3ZgPPQnUeCK%2BhKPArsPrWY5JeSg%3D&amp;reserved=0
> > > > > 
> > > > >      Issuing invalidations with folio-granularity ties in fairly well
> > > > >      with this sort of approach if we end up going that route.
> > > > 
> > > > There is semantics difference between the current one and the proposed
> > > > one: The invalidation range is exactly what userspace passed down to the
> > > > kernel (being fallocated) while the proposed one will be subset of that
> > > > (if userspace-provided addr/size is not aligned to power of two), I'm
> > > > not quite confident this difference has no side effect.
> > > 
> > > In theory userspace should not be allocating/hole-punching restricted
> > > pages for GPA ranges that are already mapped as private in the xarray,
> > > and KVM could potentially fail such requests (though it does currently).
> > > 
> > > But if we somehow enforced that, then we could rely on
> > > KVM_MEMORY_ENCRYPT_REG_REGION to handle all the MMU invalidation stuff,
> > > which would free up the restricted fd invalidation callbacks to be used
> > > purely to handle doing things like RMP/directmap fixups prior to returning
> > > restricted pages back to the host. So that was sort of my thinking why the
> > > new semantics would still cover all the necessary cases.
> > 
> > Sorry, this explanation is if we rely on userspace to fallocate() on 2MB
> > boundaries, and ignore any non-aligned requests in the kernel. But
> > that's not how I actually ended up implementing things, so I'm not sure
> > why answered that way...
> > 
> > In my implementation we actually do issue invalidations for fallocate()
> > even for non-2M-aligned GPA/offset ranges. For instance (assuming
> > restricted FD offset 0 corresponds to GPA 0), an fallocate() on GPA
> > range 0x1000-0x402000 would result in the following invalidations being
> > issued if everything was backed by a 2MB page:
> > 
> >   invalidate GPA: 0x001000-0x200000, Page: pfn_to_page(I), order:9
> >   invalidate GPA: 0x200000-0x400000, Page: pfn_to_page(J), order:9
> >   invalidate GPA: 0x400000-0x402000, Page: pfn_to_page(K), order:9
> 
> Only see this I understand what you are actually going to propose;)
> 
> So the memory range(start/end) will be still there and covers exactly
> what it should be from usrspace point of view, the page+order(or just
> folio) is really just a _hint_ for the invalidation callbacks. Looks
> ugly though.

Yes that's accurate: callbacks still need to handle partial ranges, so
it's more of a hint/optimization for cases where callbacks can benefit
from knowing the entire backing hugepage is being invalidated/freed.

> 
> In v9 we use a invalidate_start/ invalidate_end pair to solve a race
> contention issue(https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fkvm%2FY1LOe4JvnTbFNs4u%40google.com%2F&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C13192ae987b442f10b7408dabce2a4c5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638029978853935768%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=zccj0lNcqBCxGVGLBYAD2BCkJuy75nTxFTSUMfDJjzM%3D&amp;reserved=0).
> To work with this, I believe we only need pass this hint info for
> invalidate_start() since at the invalidate_end() time, the page has
> already been discarded.

Ok, yah, that's the approach I'm looking at for v9: pass the page/order
for invalidate_start, but keep invalidate_end as-is.

> 
> Another worth-mentioning-thing is invalidate_start/end is not just
> invoked for hole punching, but also for allocation(e.g. default
> fallocate). While for allocation we can get the page only at the
> invalidate_end() time. But AFAICS, the invalidate() is called for
> fallocate(allocation) is because previously we rely on the existence in
> memory backing store to tell a page is private and we need notify KVM
> that the page is being converted from shared to private, but that is not
> true for current code and fallocate() is also not mandatory since KVM
> can call restrictedmem_get_page() to allocate dynamically, so I think we
> can remove the invalidation path for fallocate(allocation).

I actually ended up doing that for the v8 implementation, I figured it
was a holdover from before {REG,UNREG}_REGION were used, but too sure on
that so good to have some confirmation there.

> 
> > 
> > So you still cover the same range, but the arch/platform callbacks can
> > then, as a best effort, do things like restore 2M directmap if they see
> > that the backing page is 2MB+ and the GPA range covers the entire range.
> > If the GPA doesn't covers the whole range, or the backing page is
> > order:0, then in that case we are still forced to leave the directmap
> > split.
> > 
> > But with that in place we can then improve on that by allowing for the
> > use of hugetlbfs.
> > 
> > We'd still be somewhat reliant on userspace to issue fallocate()'s on
> > 2M-aligned boundaries to some degree (guest teardown invalidations
> > could be issued as 2M-aligned, which would be the bulk of the pages
> > in most cases, but for discarding pages after private->shared
> > conversion we could still get fragmentation). This could maybe be
> > addressed by keeping track of those partial/non-2M-aligned fallocate()
> > requests and then issuing them as a batched 2M invalidation once all
> > the subpages have been fallocate(HOLE_PUNCH)'d. We'd need to enforce
> > that fallocate(PUNCH_HOLE) is preceeded by
> > KVM_MEMORY_ENCRYPT_UNREG_REGION to make sure MMU invalidations happen
> > though.
> 
> Don't understand why the sequence matters here, we should do MMU
> invalidation for both fallocate(PUNCH_HOLE) and
> KVM_MEMORY_ENCRYPT_UNREG_REGION, right?

It should happen in both places as long as it's possible for userspace
to fallocate(PUNCH_HOLE) a private page while it is mapped to a guest.
I'm not necessarily suggesting we should make any changes there right
now, but...

We might need to consider changing that if we decide that we don't want
to allow userspace to basically force splitting the directmap by always
issuing fallocate(PUNCH_HOLE) for each 4K page rather than trying to do
it in 2M intervals when possible, since it would still result in 4K
invalidations being issued, such that optimizations like restoring the
2M directmap can't be done, even when backed by THPs or hugetlbfs pages.
One approach to deal with this is to introduce a bitmap (for instance) to
track what subpages have been fallocate(PUNCH_HOLE)'d, and defer the
actual free'ing/invalidation until a whole page has been marked for
deallocation. This would keep huge-pages/huge-invalidations intact even
if userspace is malicious/non-optimal and actively trying to slow the
host down by forcing the directmap to get split.

*If* we took that approach though, then the MMU invalidations would also
get deferred, which is bad. But if we added a check/callback that
restrictedfd.c could use to confirm that the page is already in a
shared/non-private state, then we'd know the MMU invalidation for the
private page must have already happened, so if anything got faulted in
afterward it should be a shared page. (Though I guess update_mem_attr
would also need to check this bitmap and fail for cases where a
shared->private conversion is issued for a page/range that's been
recorded as having been previously issued a deferred PUNCH_HOLE'd.

But that's only an optimization and probably needs a lot more thought,
not necessarily something I think we need to implement now.

-Mike

> 
> Thanks,
> Chao
> > 
> > Not sure on these potential follow-ups, but they all at least seem
> > compatible with the proposed invalidation scheme.
> > 
> > -Mike
> > 
> > > 
> > > -Mike
> > > 
> > > > 
> > > > > 
> > > > > I need to rework things for v9, and we'll probably want to use struct
> > > > > folio instead of struct page now, but as a proof-of-concept of sorts this
> > > > > is what I'd added on top of v8 of your patchset to implement 1) and 2):
> > > > > 
> > > > >   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdroth%2Flinux%2Fcommit%2F127e5ea477c7bd5e4107fd44a04b9dc9e9b1af8b&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C13192ae987b442f10b7408dabce2a4c5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638029978854091987%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=ghvLOpeqPz%2B6G593enT0p%2F3Ovh9rjHKtsuSQ2xObFHU%3D&amp;reserved=0
> > > > > 
> > > > > Does an approach like this seem reasonable? Should be work this into the
> > > > > base restricted memslot support?
> > > > 
> > > > If the above mentioned semantics difference is not a problem, I don't
> > > > have strong objection on this.
> > > > 
> > > > Sean, since you have much better understanding on this, what is your
> > > > take on this?
> > > > 
> > > > Chao
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > > Mike
  
Michael Roth Nov. 2, 2022, 9:26 p.m. UTC | #11
On Thu, Nov 03, 2022 at 12:14:04AM +0300, Kirill A. Shutemov wrote:
> On Mon, Oct 31, 2022 at 12:47:38PM -0500, Michael Roth wrote:
> > 
> > In v8 there was some discussion about potentially passing the page/folio
> > and order as part of the invalidation callback, I ended up needing
> > something similar for SEV-SNP, and think it might make sense for other
> > platforms. This main reasoning is:
> > 
> >   1) restoring kernel directmap:
> > 
> >      Currently SNP (and I believe TDX) need to either split or remove kernel
> >      direct mappings for restricted PFNs, since there is no guarantee that
> >      other PFNs within a 2MB range won't be used for non-restricted
> >      (which will cause an RMP #PF in the case of SNP since the 2MB
> >      mapping overlaps with guest-owned pages)
> 
> That's news to me. Where the restriction for SNP comes from? There's no
> such limitation on TDX side AFAIK?
> 
> Could you point me to relevant documentation if there's any?

I could be mistaken, I haven't looked into the specific documentation and was
going off of this discussion from a ways back:

  https://lore.kernel.org/all/YWb8WG6Ravbs1nbx@google.com/

Sean, is my read of that correct? Do you happen to know where there's
some documentation on that for the TDX side?

Thanks,

Mike

> 
> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov
  
Michael Roth Nov. 2, 2022, 10:07 p.m. UTC | #12
On Thu, Nov 03, 2022 at 12:14:04AM +0300, Kirill A. Shutemov wrote:
> On Mon, Oct 31, 2022 at 12:47:38PM -0500, Michael Roth wrote:
> > 
> > In v8 there was some discussion about potentially passing the page/folio
> > and order as part of the invalidation callback, I ended up needing
> > something similar for SEV-SNP, and think it might make sense for other
> > platforms. This main reasoning is:
> > 
> >   1) restoring kernel directmap:
> > 
> >      Currently SNP (and I believe TDX) need to either split or remove kernel
> >      direct mappings for restricted PFNs, since there is no guarantee that
> >      other PFNs within a 2MB range won't be used for non-restricted
> >      (which will cause an RMP #PF in the case of SNP since the 2MB
> >      mapping overlaps with guest-owned pages)
> 
> That's news to me. Where the restriction for SNP comes from?

Sorry, missed your first question.

For SNP at least, the restriction is documented in APM Volume 2, Section
15.36.10, First row of Table 15-36 (preceeding paragraph has more
context). I forgot to mention this is only pertaining to writes by the
host to 2MB pages that contain guest-owned subpages, for reads it's
not an issue, but I think the implementation requirements end up being
the same either way:

  https://www.amd.com/system/files/TechDocs/24593.pdf

-Mike

> That's news to me. Where the restriction for SNP comes from? There's no
> such limitation on TDX side AFAIK?
> 
> Could you point me to relevant documentation if there's any?
> 
> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov
  
Kirill A. Shutemov Nov. 3, 2022, 4:30 p.m. UTC | #13
On Wed, Nov 02, 2022 at 05:07:00PM -0500, Michael Roth wrote:
> On Thu, Nov 03, 2022 at 12:14:04AM +0300, Kirill A. Shutemov wrote:
> > On Mon, Oct 31, 2022 at 12:47:38PM -0500, Michael Roth wrote:
> > > 
> > > In v8 there was some discussion about potentially passing the page/folio
> > > and order as part of the invalidation callback, I ended up needing
> > > something similar for SEV-SNP, and think it might make sense for other
> > > platforms. This main reasoning is:
> > > 
> > >   1) restoring kernel directmap:
> > > 
> > >      Currently SNP (and I believe TDX) need to either split or remove kernel
> > >      direct mappings for restricted PFNs, since there is no guarantee that
> > >      other PFNs within a 2MB range won't be used for non-restricted
> > >      (which will cause an RMP #PF in the case of SNP since the 2MB
> > >      mapping overlaps with guest-owned pages)
> > 
> > That's news to me. Where the restriction for SNP comes from?
> 
> Sorry, missed your first question.
> 
> For SNP at least, the restriction is documented in APM Volume 2, Section
> 15.36.10, First row of Table 15-36 (preceeding paragraph has more
> context). I forgot to mention this is only pertaining to writes by the
> host to 2MB pages that contain guest-owned subpages, for reads it's
> not an issue, but I think the implementation requirements end up being
> the same either way:
> 
>   https://www.amd.com/system/files/TechDocs/24593.pdf

Looks like you wanted restricted memfd to be backed by secretmem rather
then normal memfd. It would help preserve directmap.
  
Vlastimil Babka Nov. 14, 2022, 2:02 p.m. UTC | #14
On 11/1/22 16:19, Michael Roth wrote:
> On Tue, Nov 01, 2022 at 07:37:29PM +0800, Chao Peng wrote:
>> > 
>> >   1) restoring kernel directmap:
>> > 
>> >      Currently SNP (and I believe TDX) need to either split or remove kernel
>> >      direct mappings for restricted PFNs, since there is no guarantee that
>> >      other PFNs within a 2MB range won't be used for non-restricted
>> >      (which will cause an RMP #PF in the case of SNP since the 2MB
>> >      mapping overlaps with guest-owned pages)
>> 
>> Has the splitting and restoring been a well-discussed direction? I'm
>> just curious whether there is other options to solve this issue.
> 
> For SNP it's been discussed for quite some time, and either splitting or
> removing private entries from directmap are the well-discussed way I'm
> aware of to avoid RMP violations due to some other kernel process using
> a 2MB mapping to access shared memory if there are private pages that
> happen to be within that range.
> 
> In both cases the issue of how to restore directmap as 2M becomes a
> problem.
> 
> I was also under the impression TDX had similar requirements. If so,
> do you know what the plan is for handling this for TDX?
> 
> There are also 2 potential alternatives I'm aware of, but these haven't
> been discussed in much detail AFAIK:
> 
> a) Ensure confidential guests are backed by 2MB pages. shmem has a way to
>    request 2MB THP pages, but I'm not sure how reliably we can guarantee
>    that enough THPs are available, so if we went that route we'd probably
>    be better off requiring the use of hugetlbfs as the backing store. But
>    obviously that's a bit limiting and it would be nice to have the option
>    of using normal pages as well. One nice thing with invalidation
>    scheme proposed here is that this would "Just Work" if implement
>    hugetlbfs support, so an admin that doesn't want any directmap
>    splitting has this option available, otherwise it's done as a
>    best-effort.
> 
> b) Implement general support for restoring directmap as 2M even when
>    subpages might be in use by other kernel threads. This would be the
>    most flexible approach since it requires no special handling during
>    invalidations, but I think it's only possible if all the CPA
>    attributes for the 2M range are the same at the time the mapping is
>    restored/unsplit, so some potential locking issues there and still
>    chance for splitting directmap over time.

I've been hoping that

c) using a mechanism such as [1] [2] where the goal is to group together
these small allocations that need to increase directmap granularity so
maximum number of large mappings are preserved. But I guess that means
knowing at allocation time that this will happen. So I've been wondering how
this would be possible to employ in the SNP/UPM case? I guess it depends on
how we expect the private/shared conversions to happen in practice, and I
don't know the details. I can imagine the following complications:

- a memfd_restricted region is created such that it's 2MB large/aligned,
i.e. like case a) above, we can allocate it normally. Now, what if a 4k page
in the middle is to be temporarily converted to shared for some
communication between host and guest (can such thing happen?). With the
punch hole approach, I wonder if we end up fragmenting directmap
unnecessarily? IIUC the now shared page will become backed by some other
page (as the memslot supports both private and shared pages simultaneously).
But does it make sense to really split the direct mapping (and e.g. the
shmem page?) We could leave the whole 2MB unmapped without splitting if we
didn't free the private 4k subpage.

- a restricted region is created that's below 2MB. If something like [1] is
merged, it could be used for the backing pages to limit directmap
fragmentation. But then in case it's eventually fallocated to become larger
and gain one more more 2MB aligned ranges, the result is suboptimal. Unless
in that case we migrate the existing pages to a THP-backed shmem, kinda like
khugepaged collapses hugepages. But that would have to be coordinated with
the guest, maybe not even possible?

[1] https://lore.kernel.org/all/20220127085608.306306-1-rppt@kernel.org/
[2] https://lwn.net/Articles/894557/

>> 
>> > 
>> >      Previously we were able to restore 2MB mappings to some degree
>> >      since both shared/restricted pages were all pinned, so anything
>> >      backed by a THP (or hugetlb page once that is implemented) at guest
>> >      teardown could be restored as 2MB direct mapping.
>> > 
>> >      Invalidation seems like the most logical time to have this happen,
>> 
>> Currently invalidation only happens at user-initiated fallocate(). It
>> does not cover the VM teardown case where the restoring might also be
>> expected to be handled.
> 
> Right, I forgot to add that in my proposed changes I added invalidations
> for any still-allocated private pages present when the restricted memfd
> notifier is unregistered. This was needed to avoid leaking pages back to
> the kernel that still need directmap or RMP table fixups. I also added
> similar invalidations for memfd->release(), since it seems possible that
> userspace might close() it before shutting down guest, but maybe the
> latter is not needed if KVM takes a reference on the FD during life of
> the guest.
> 
>> 
>> >      but whether or not to restore as 2MB requires the order to be 2MB
>> >      or larger, and for GPA range being invalidated to cover the entire
>> >      2MB (otherwise it means the page was potentially split and some
>> >      subpages free back to host already, in which case it can't be
>> >      restored as 2MB).
>> > 
>> >   2) Potentially less invalidations:
>> >       
>> >      If we pass the entire folio or compound_page as part of
>> >      invalidation, we only needed to issue 1 invalidation per folio.
>> 
>> I'm not sure I agree, the current invalidation covers the whole range
>> that passed from userspace and the invalidation is invoked only once for
>> each usrspace fallocate().
> 
> That's true, it only reduces invalidations if we decide to provide a
> struct page/folio as part of the invalidation callbacks, which isn't
> the case yet. Sorry for the confusion.
> 
>> 
>> > 
>> >   3) Potentially useful for hugetlbfs support:
>> > 
>> >      One issue with hugetlbfs is that we don't support splitting the
>> >      hugepage in such cases, which was a big obstacle prior to UPM. Now
>> >      however, we may have the option of doing "lazy" invalidations where
>> >      fallocate(PUNCH_HOLE, ...) won't free a shmem-allocate page unless
>> >      all the subpages within the 2M range are either hole-punched, or the
>> >      guest is shut down, so in that way we never have to split it. Sean
>> >      was pondering something similar in another thread:
>> > 
>> >        https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-mm%2FYyGLXXkFCmxBfu5U%40google.com%2F&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C3aba56bf7d574c749ea708dabbfe2224%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638028997419628807%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=c7gSLjJEAxuX8xmMiTDMUHNwUdQNKN00xqtAZAEeow8%3D&amp;reserved=0
>> > 
>> >      Issuing invalidations with folio-granularity ties in fairly well
>> >      with this sort of approach if we end up going that route.
>> 
>> There is semantics difference between the current one and the proposed
>> one: The invalidation range is exactly what userspace passed down to the
>> kernel (being fallocated) while the proposed one will be subset of that
>> (if userspace-provided addr/size is not aligned to power of two), I'm
>> not quite confident this difference has no side effect.
> 
> In theory userspace should not be allocating/hole-punching restricted
> pages for GPA ranges that are already mapped as private in the xarray,
> and KVM could potentially fail such requests (though it does currently).
> 
> But if we somehow enforced that, then we could rely on
> KVM_MEMORY_ENCRYPT_REG_REGION to handle all the MMU invalidation stuff,
> which would free up the restricted fd invalidation callbacks to be used
> purely to handle doing things like RMP/directmap fixups prior to returning
> restricted pages back to the host. So that was sort of my thinking why the
> new semantics would still cover all the necessary cases.
> 
> -Mike
> 
>> 
>> > 
>> > I need to rework things for v9, and we'll probably want to use struct
>> > folio instead of struct page now, but as a proof-of-concept of sorts this
>> > is what I'd added on top of v8 of your patchset to implement 1) and 2):
>> > 
>> >   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdroth%2Flinux%2Fcommit%2F127e5ea477c7bd5e4107fd44a04b9dc9e9b1af8b&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C3aba56bf7d574c749ea708dabbfe2224%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638028997419628807%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=jOFT0iLmeU7rKniEkWOsTf2%2FPI13EAw4Qm7arI1q970%3D&amp;reserved=0
>> > 
>> > Does an approach like this seem reasonable? Should be work this into the
>> > base restricted memslot support?
>> 
>> If the above mentioned semantics difference is not a problem, I don't
>> have strong objection on this.
>> 
>> Sean, since you have much better understanding on this, what is your
>> take on this?
>> 
>> Chao
>> > 
>> > Thanks,
>> > 
>> > Mike
  
Kirill A. Shutemov Nov. 14, 2022, 3:28 p.m. UTC | #15
On Mon, Nov 14, 2022 at 03:02:37PM +0100, Vlastimil Babka wrote:
> On 11/1/22 16:19, Michael Roth wrote:
> > On Tue, Nov 01, 2022 at 07:37:29PM +0800, Chao Peng wrote:
> >> > 
> >> >   1) restoring kernel directmap:
> >> > 
> >> >      Currently SNP (and I believe TDX) need to either split or remove kernel
> >> >      direct mappings for restricted PFNs, since there is no guarantee that
> >> >      other PFNs within a 2MB range won't be used for non-restricted
> >> >      (which will cause an RMP #PF in the case of SNP since the 2MB
> >> >      mapping overlaps with guest-owned pages)
> >> 
> >> Has the splitting and restoring been a well-discussed direction? I'm
> >> just curious whether there is other options to solve this issue.
> > 
> > For SNP it's been discussed for quite some time, and either splitting or
> > removing private entries from directmap are the well-discussed way I'm
> > aware of to avoid RMP violations due to some other kernel process using
> > a 2MB mapping to access shared memory if there are private pages that
> > happen to be within that range.
> > 
> > In both cases the issue of how to restore directmap as 2M becomes a
> > problem.
> > 
> > I was also under the impression TDX had similar requirements. If so,
> > do you know what the plan is for handling this for TDX?
> > 
> > There are also 2 potential alternatives I'm aware of, but these haven't
> > been discussed in much detail AFAIK:
> > 
> > a) Ensure confidential guests are backed by 2MB pages. shmem has a way to
> >    request 2MB THP pages, but I'm not sure how reliably we can guarantee
> >    that enough THPs are available, so if we went that route we'd probably
> >    be better off requiring the use of hugetlbfs as the backing store. But
> >    obviously that's a bit limiting and it would be nice to have the option
> >    of using normal pages as well. One nice thing with invalidation
> >    scheme proposed here is that this would "Just Work" if implement
> >    hugetlbfs support, so an admin that doesn't want any directmap
> >    splitting has this option available, otherwise it's done as a
> >    best-effort.
> > 
> > b) Implement general support for restoring directmap as 2M even when
> >    subpages might be in use by other kernel threads. This would be the
> >    most flexible approach since it requires no special handling during
> >    invalidations, but I think it's only possible if all the CPA
> >    attributes for the 2M range are the same at the time the mapping is
> >    restored/unsplit, so some potential locking issues there and still
> >    chance for splitting directmap over time.
> 
> I've been hoping that
> 
> c) using a mechanism such as [1] [2] where the goal is to group together
> these small allocations that need to increase directmap granularity so
> maximum number of large mappings are preserved.

As I mentioned in the other thread the restricted memfd can be backed by
secretmem instead of plain memfd. It already handles directmap with care.

But I don't think it has to be part of initial restricted memfd
implementation. It is SEV-specific requirement and AMD folks can extend
implementation as needed later.
  
Michael Roth Nov. 14, 2022, 10:16 p.m. UTC | #16
On Mon, Nov 14, 2022 at 03:02:37PM +0100, Vlastimil Babka wrote:
> On 11/1/22 16:19, Michael Roth wrote:
> > On Tue, Nov 01, 2022 at 07:37:29PM +0800, Chao Peng wrote:
> >> > 
> >> >   1) restoring kernel directmap:
> >> > 
> >> >      Currently SNP (and I believe TDX) need to either split or remove kernel
> >> >      direct mappings for restricted PFNs, since there is no guarantee that
> >> >      other PFNs within a 2MB range won't be used for non-restricted
> >> >      (which will cause an RMP #PF in the case of SNP since the 2MB
> >> >      mapping overlaps with guest-owned pages)
> >> 
> >> Has the splitting and restoring been a well-discussed direction? I'm
> >> just curious whether there is other options to solve this issue.
> > 
> > For SNP it's been discussed for quite some time, and either splitting or
> > removing private entries from directmap are the well-discussed way I'm
> > aware of to avoid RMP violations due to some other kernel process using
> > a 2MB mapping to access shared memory if there are private pages that
> > happen to be within that range.
> > 
> > In both cases the issue of how to restore directmap as 2M becomes a
> > problem.
> > 
> > I was also under the impression TDX had similar requirements. If so,
> > do you know what the plan is for handling this for TDX?
> > 
> > There are also 2 potential alternatives I'm aware of, but these haven't
> > been discussed in much detail AFAIK:
> > 
> > a) Ensure confidential guests are backed by 2MB pages. shmem has a way to
> >    request 2MB THP pages, but I'm not sure how reliably we can guarantee
> >    that enough THPs are available, so if we went that route we'd probably
> >    be better off requiring the use of hugetlbfs as the backing store. But
> >    obviously that's a bit limiting and it would be nice to have the option
> >    of using normal pages as well. One nice thing with invalidation
> >    scheme proposed here is that this would "Just Work" if implement
> >    hugetlbfs support, so an admin that doesn't want any directmap
> >    splitting has this option available, otherwise it's done as a
> >    best-effort.
> > 
> > b) Implement general support for restoring directmap as 2M even when
> >    subpages might be in use by other kernel threads. This would be the
> >    most flexible approach since it requires no special handling during
> >    invalidations, but I think it's only possible if all the CPA
> >    attributes for the 2M range are the same at the time the mapping is
> >    restored/unsplit, so some potential locking issues there and still
> >    chance for splitting directmap over time.
> 
> I've been hoping that
> 
> c) using a mechanism such as [1] [2] where the goal is to group together
> these small allocations that need to increase directmap granularity so
> maximum number of large mappings are preserved. But I guess that means

Thanks for the references. I wasn't aware there was work in this area,
this opens up some possibilities on how to approach this.

> maximum number of large mappings are preserved. But I guess that means
> knowing at allocation time that this will happen. So I've been wondering how
> this would be possible to employ in the SNP/UPM case? I guess it depends on
> how we expect the private/shared conversions to happen in practice, and I
> don't know the details. I can imagine the following complications:
> 
> - a memfd_restricted region is created such that it's 2MB large/aligned,
> i.e. like case a) above, we can allocate it normally. Now, what if a 4k page
> in the middle is to be temporarily converted to shared for some
> communication between host and guest (can such thing happen?). With the
> punch hole approach, I wonder if we end up fragmenting directmap
> unnecessarily? IIUC the now shared page will become backed by some other

Yes, we end up fragmenting in cases where a guest converts a sub-page to a
shared page because the fallocate(PUNCH_HOLE) gets forwarded through to shmem
which will then split it. At that point the subpage might get used elsewhere
so we no longer have the ability to restore as 2M after
invalidation/shutdown. We could potentially just intercept those
fallocate()'s and only issue the invalidation once all the subpages have
been PUNCH_HOLE'd. We'd still need to ensure KVM MMU invalidations
happen immediately though, but since we rely on a KVM ioctl to do the
conversion in advance, we can rely on the KVM MMU invalidation that
happens at that point and simply make fallocate(PUNCH_HOLE) fail if
someone attempts it on a page that hasn't been converted to shared yet.

Otherwise we could end up being an good chunk of pages depending on how
guest allocates shared pages, but I'm slightly less concerned about that
seeing as there are some general solutions to directmap fragmentation
being considered. I need to think more how this hooks would tie in to
that though.

And since we'd only really being able to avoid unrecoverable splits if
the restrictedmem is hugepage-backed (if we get a bunch of 4K pages
to begin with there's no handling that would avoid fragmentation), it seems
like we'd end up relying on hugetlbfs support for instances where a host
really wants to avoid splitting, and maybe in the case of hugetlbfs
fallocate(PUNCH_HOLE) is already a no-op of sorts? Either way maybe it's
better to explore this aspect in the context of hugetlbfs support.

> page (as the memslot supports both private and shared pages simultaneously).
> But does it make sense to really split the direct mapping (and e.g. the
> shmem page?) We could leave the whole 2MB unmapped without splitting if we
> didn't free the private 4k subpage.
> 
> - a restricted region is created that's below 2MB. If something like [1] is
> merged, it could be used for the backing pages to limit directmap
> fragmentation. But then in case it's eventually fallocated to become larger
> and gain one more more 2MB aligned ranges, the result is suboptimal. Unless
> in that case we migrate the existing pages to a THP-backed shmem, kinda like
> khugepaged collapses hugepages. But that would have to be coordinated with
> the guest, maybe not even possible?

Any migrations would need to be coordinated with SNP firmware at least. I
think it's possible, but that support is probably a ways out. Near-term
I think it might be more straightforward to say: if you don't want to
directmap fragmentation (for SNP anyway), you need to ensure restricted
ranges are backed by THPs or hugetlbfs, and make that the basis for
avoiding directmap splitting for now. Otherwise, it's simply done as a
best-effort, and then maybe over time, with things like [1] and migration
support in place, this restriction can go away, or become less impactful
at least.

Thanks,

Mike

> 
> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2F20220127085608.306306-1-rppt%40kernel.org%2F&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C50b74bc241704885319d08dac648e4bb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638040313701097847%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=iGsgnGccmJrik%2FJqve4NmP0U%2B9cEQBJGDPITynIYZUQ%3D&amp;reserved=0
> [2] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flwn.net%2FArticles%2F894557%2F&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C50b74bc241704885319d08dac648e4bb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638040313701097847%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=W%2BtSXxi%2Bs5RJYxP9BH0%2FiNOfl1FKM9mfw5nEXJO5doU%3D&amp;reserved=0
> 
> >> 
> >> > 
> >> >      Previously we were able to restore 2MB mappings to some degree
> >> >      since both shared/restricted pages were all pinned, so anything
> >> >      backed by a THP (or hugetlb page once that is implemented) at guest
> >> >      teardown could be restored as 2MB direct mapping.
> >> > 
> >> >      Invalidation seems like the most logical time to have this happen,
> >> 
> >> Currently invalidation only happens at user-initiated fallocate(). It
> >> does not cover the VM teardown case where the restoring might also be
> >> expected to be handled.
> > 
> > Right, I forgot to add that in my proposed changes I added invalidations
> > for any still-allocated private pages present when the restricted memfd
> > notifier is unregistered. This was needed to avoid leaking pages back to
> > the kernel that still need directmap or RMP table fixups. I also added
> > similar invalidations for memfd->release(), since it seems possible that
> > userspace might close() it before shutting down guest, but maybe the
> > latter is not needed if KVM takes a reference on the FD during life of
> > the guest.
> > 
> >> 
> >> >      but whether or not to restore as 2MB requires the order to be 2MB
> >> >      or larger, and for GPA range being invalidated to cover the entire
> >> >      2MB (otherwise it means the page was potentially split and some
> >> >      subpages free back to host already, in which case it can't be
> >> >      restored as 2MB).
> >> > 
> >> >   2) Potentially less invalidations:
> >> >       
> >> >      If we pass the entire folio or compound_page as part of
> >> >      invalidation, we only needed to issue 1 invalidation per folio.
> >> 
> >> I'm not sure I agree, the current invalidation covers the whole range
> >> that passed from userspace and the invalidation is invoked only once for
> >> each usrspace fallocate().
> > 
> > That's true, it only reduces invalidations if we decide to provide a
> > struct page/folio as part of the invalidation callbacks, which isn't
> > the case yet. Sorry for the confusion.
> > 
> >> 
> >> > 
> >> >   3) Potentially useful for hugetlbfs support:
> >> > 
> >> >      One issue with hugetlbfs is that we don't support splitting the
> >> >      hugepage in such cases, which was a big obstacle prior to UPM. Now
> >> >      however, we may have the option of doing "lazy" invalidations where
> >> >      fallocate(PUNCH_HOLE, ...) won't free a shmem-allocate page unless
> >> >      all the subpages within the 2M range are either hole-punched, or the
> >> >      guest is shut down, so in that way we never have to split it. Sean
> >> >      was pondering something similar in another thread:
> >> > 
> >> >        https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-mm%2FYyGLXXkFCmxBfu5U%40google.com%2F&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C50b74bc241704885319d08dac648e4bb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638040313701097847%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=GTO73Onun86jZh3PZABQL%2F4Fs5R%2BFZe9gDkOSMoHddA%3D&amp;reserved=0
> >> > 
> >> >      Issuing invalidations with folio-granularity ties in fairly well
> >> >      with this sort of approach if we end up going that route.
> >> 
> >> There is semantics difference between the current one and the proposed
> >> one: The invalidation range is exactly what userspace passed down to the
> >> kernel (being fallocated) while the proposed one will be subset of that
> >> (if userspace-provided addr/size is not aligned to power of two), I'm
> >> not quite confident this difference has no side effect.
> > 
> > In theory userspace should not be allocating/hole-punching restricted
> > pages for GPA ranges that are already mapped as private in the xarray,
> > and KVM could potentially fail such requests (though it does currently).
> > 
> > But if we somehow enforced that, then we could rely on
> > KVM_MEMORY_ENCRYPT_REG_REGION to handle all the MMU invalidation stuff,
> > which would free up the restricted fd invalidation callbacks to be used
> > purely to handle doing things like RMP/directmap fixups prior to returning
> > restricted pages back to the host. So that was sort of my thinking why the
> > new semantics would still cover all the necessary cases.
> > 
> > -Mike
> > 
> >> 
> >> > 
> >> > I need to rework things for v9, and we'll probably want to use struct
> >> > folio instead of struct page now, but as a proof-of-concept of sorts this
> >> > is what I'd added on top of v8 of your patchset to implement 1) and 2):
> >> > 
> >> >   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdroth%2Flinux%2Fcommit%2F127e5ea477c7bd5e4107fd44a04b9dc9e9b1af8b&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C50b74bc241704885319d08dac648e4bb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638040313701097847%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=yYd6lVWEFkffTZnAoFTgoozYxZbxvMXjOd%2BWuP70G7I%3D&amp;reserved=0
> >> > 
> >> > Does an approach like this seem reasonable? Should be work this into the
> >> > base restricted memslot support?
> >> 
> >> If the above mentioned semantics difference is not a problem, I don't
> >> have strong objection on this.
> >> 
> >> Sean, since you have much better understanding on this, what is your
> >> take on this?
> >> 
> >> Chao
> >> > 
> >> > Thanks,
> >> > 
> >> > Mike
>
  
Michael Roth Nov. 14, 2022, 10:16 p.m. UTC | #17
On Mon, Nov 14, 2022 at 06:28:43PM +0300, Kirill A. Shutemov wrote:
> On Mon, Nov 14, 2022 at 03:02:37PM +0100, Vlastimil Babka wrote:
> > On 11/1/22 16:19, Michael Roth wrote:
> > > On Tue, Nov 01, 2022 at 07:37:29PM +0800, Chao Peng wrote:
> > >> > 
> > >> >   1) restoring kernel directmap:
> > >> > 
> > >> >      Currently SNP (and I believe TDX) need to either split or remove kernel
> > >> >      direct mappings for restricted PFNs, since there is no guarantee that
> > >> >      other PFNs within a 2MB range won't be used for non-restricted
> > >> >      (which will cause an RMP #PF in the case of SNP since the 2MB
> > >> >      mapping overlaps with guest-owned pages)
> > >> 
> > >> Has the splitting and restoring been a well-discussed direction? I'm
> > >> just curious whether there is other options to solve this issue.
> > > 
> > > For SNP it's been discussed for quite some time, and either splitting or
> > > removing private entries from directmap are the well-discussed way I'm
> > > aware of to avoid RMP violations due to some other kernel process using
> > > a 2MB mapping to access shared memory if there are private pages that
> > > happen to be within that range.
> > > 
> > > In both cases the issue of how to restore directmap as 2M becomes a
> > > problem.
> > > 
> > > I was also under the impression TDX had similar requirements. If so,
> > > do you know what the plan is for handling this for TDX?
> > > 
> > > There are also 2 potential alternatives I'm aware of, but these haven't
> > > been discussed in much detail AFAIK:
> > > 
> > > a) Ensure confidential guests are backed by 2MB pages. shmem has a way to
> > >    request 2MB THP pages, but I'm not sure how reliably we can guarantee
> > >    that enough THPs are available, so if we went that route we'd probably
> > >    be better off requiring the use of hugetlbfs as the backing store. But
> > >    obviously that's a bit limiting and it would be nice to have the option
> > >    of using normal pages as well. One nice thing with invalidation
> > >    scheme proposed here is that this would "Just Work" if implement
> > >    hugetlbfs support, so an admin that doesn't want any directmap
> > >    splitting has this option available, otherwise it's done as a
> > >    best-effort.
> > > 
> > > b) Implement general support for restoring directmap as 2M even when
> > >    subpages might be in use by other kernel threads. This would be the
> > >    most flexible approach since it requires no special handling during
> > >    invalidations, but I think it's only possible if all the CPA
> > >    attributes for the 2M range are the same at the time the mapping is
> > >    restored/unsplit, so some potential locking issues there and still
> > >    chance for splitting directmap over time.
> > 
> > I've been hoping that
> > 
> > c) using a mechanism such as [1] [2] where the goal is to group together
> > these small allocations that need to increase directmap granularity so
> > maximum number of large mappings are preserved.
> 
> As I mentioned in the other thread the restricted memfd can be backed by
> secretmem instead of plain memfd. It already handles directmap with care.

It looks like it would handle direct unmapping/cleanup nicely, but it
seems to lack fallocate(PUNCH_HOLE) support which we'd probably want to
avoid additional memory requirements. I think once we added that we'd
still end up needing some sort of handling for the invalidations.

Also, I know Chao has been considering hugetlbfs support, I assume by
leveraging the support that already exists in shmem. Ideally SNP would
be able to make use of that support as well, but relying on a separate
backend seems likely to result in more complications getting there
later.

> 
> But I don't think it has to be part of initial restricted memfd
> implementation. It is SEV-specific requirement and AMD folks can extend
> implementation as needed later.

Admittedly the suggested changes to the invalidation mechanism made a
lot more sense to me when I was under the impression that TDX would have
similar requirements and we might end up with a common hook. Since that
doesn't actually seem to be the case, it makes sense to try to do it as
a platform-specific hook for SNP.

I think, given a memslot, a GFN range, and kvm_restricted_mem_get_pfn(),
we should be able to get the same information needed to figure out whether
the range is backed by huge pages or not. I'll see how that works out
instead.

Thanks,

Mike

> 
> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov
  
Chao Peng Nov. 15, 2022, 9:48 a.m. UTC | #18
On Mon, Nov 14, 2022 at 04:16:32PM -0600, Michael Roth wrote:
> On Mon, Nov 14, 2022 at 06:28:43PM +0300, Kirill A. Shutemov wrote:
> > On Mon, Nov 14, 2022 at 03:02:37PM +0100, Vlastimil Babka wrote:
> > > On 11/1/22 16:19, Michael Roth wrote:
> > > > On Tue, Nov 01, 2022 at 07:37:29PM +0800, Chao Peng wrote:
> > > >> > 
> > > >> >   1) restoring kernel directmap:
> > > >> > 
> > > >> >      Currently SNP (and I believe TDX) need to either split or remove kernel
> > > >> >      direct mappings for restricted PFNs, since there is no guarantee that
> > > >> >      other PFNs within a 2MB range won't be used for non-restricted
> > > >> >      (which will cause an RMP #PF in the case of SNP since the 2MB
> > > >> >      mapping overlaps with guest-owned pages)
> > > >> 
> > > >> Has the splitting and restoring been a well-discussed direction? I'm
> > > >> just curious whether there is other options to solve this issue.
> > > > 
> > > > For SNP it's been discussed for quite some time, and either splitting or
> > > > removing private entries from directmap are the well-discussed way I'm
> > > > aware of to avoid RMP violations due to some other kernel process using
> > > > a 2MB mapping to access shared memory if there are private pages that
> > > > happen to be within that range.
> > > > 
> > > > In both cases the issue of how to restore directmap as 2M becomes a
> > > > problem.
> > > > 
> > > > I was also under the impression TDX had similar requirements. If so,
> > > > do you know what the plan is for handling this for TDX?
> > > > 
> > > > There are also 2 potential alternatives I'm aware of, but these haven't
> > > > been discussed in much detail AFAIK:
> > > > 
> > > > a) Ensure confidential guests are backed by 2MB pages. shmem has a way to
> > > >    request 2MB THP pages, but I'm not sure how reliably we can guarantee
> > > >    that enough THPs are available, so if we went that route we'd probably
> > > >    be better off requiring the use of hugetlbfs as the backing store. But
> > > >    obviously that's a bit limiting and it would be nice to have the option
> > > >    of using normal pages as well. One nice thing with invalidation
> > > >    scheme proposed here is that this would "Just Work" if implement
> > > >    hugetlbfs support, so an admin that doesn't want any directmap
> > > >    splitting has this option available, otherwise it's done as a
> > > >    best-effort.
> > > > 
> > > > b) Implement general support for restoring directmap as 2M even when
> > > >    subpages might be in use by other kernel threads. This would be the
> > > >    most flexible approach since it requires no special handling during
> > > >    invalidations, but I think it's only possible if all the CPA
> > > >    attributes for the 2M range are the same at the time the mapping is
> > > >    restored/unsplit, so some potential locking issues there and still
> > > >    chance for splitting directmap over time.
> > > 
> > > I've been hoping that
> > > 
> > > c) using a mechanism such as [1] [2] where the goal is to group together
> > > these small allocations that need to increase directmap granularity so
> > > maximum number of large mappings are preserved.
> > 
> > As I mentioned in the other thread the restricted memfd can be backed by
> > secretmem instead of plain memfd. It already handles directmap with care.
> 
> It looks like it would handle direct unmapping/cleanup nicely, but it
> seems to lack fallocate(PUNCH_HOLE) support which we'd probably want to
> avoid additional memory requirements. I think once we added that we'd
> still end up needing some sort of handling for the invalidations.
> 
> Also, I know Chao has been considering hugetlbfs support, I assume by
> leveraging the support that already exists in shmem. Ideally SNP would
> be able to make use of that support as well, but relying on a separate
> backend seems likely to result in more complications getting there
> later.
> 
> > 
> > But I don't think it has to be part of initial restricted memfd
> > implementation. It is SEV-specific requirement and AMD folks can extend
> > implementation as needed later.
> 
> Admittedly the suggested changes to the invalidation mechanism made a
> lot more sense to me when I was under the impression that TDX would have
> similar requirements and we might end up with a common hook. Since that
> doesn't actually seem to be the case, it makes sense to try to do it as
> a platform-specific hook for SNP.
> 
> I think, given a memslot, a GFN range, and kvm_restricted_mem_get_pfn(),
> we should be able to get the same information needed to figure out whether
> the range is backed by huge pages or not. I'll see how that works out
> instead.

Sounds a viable solution, just that kvm_restricted_mem_get_pfn() will
only give you the ability to check a page, not a range. But you can
still call it many times I think.

The invalidation callback will be still needed, it gives you the chance
to do the restoring.

Chao
> 
> Thanks,
> 
> Mike
> 
> > 
> > -- 
> >   Kiryl Shutsemau / Kirill A. Shutemov
  
Michael Roth Nov. 29, 2022, 12:06 a.m. UTC | #19
On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 

<snip>

> +static struct file *restrictedmem_file_create(struct file *memfd)
> +{
> +	struct restrictedmem_data *data;
> +	struct address_space *mapping;
> +	struct inode *inode;
> +	struct file *file;
> +
> +	data = kzalloc(sizeof(*data), GFP_KERNEL);
> +	if (!data)
> +		return ERR_PTR(-ENOMEM);
> +
> +	data->memfd = memfd;
> +	mutex_init(&data->lock);
> +	INIT_LIST_HEAD(&data->notifiers);
> +
> +	inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
> +	if (IS_ERR(inode)) {
> +		kfree(data);
> +		return ERR_CAST(inode);
> +	}
> +
> +	inode->i_mode |= S_IFREG;
> +	inode->i_op = &restrictedmem_iops;
> +	inode->i_mapping->private_data = data;
> +
> +	file = alloc_file_pseudo(inode, restrictedmem_mnt,
> +				 "restrictedmem", O_RDWR,
> +				 &restrictedmem_fops);
> +	if (IS_ERR(file)) {
> +		iput(inode);
> +		kfree(data);
> +		return ERR_CAST(file);
> +	}
> +
> +	file->f_flags |= O_LARGEFILE;
> +
> +	mapping = memfd->f_mapping;
> +	mapping_set_unevictable(mapping);
> +	mapping_set_gfp_mask(mapping,
> +			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);

Is this supposed to prevent migration of pages being used for
restrictedmem/shmem backend?

In my case I've been testing SNP support based on UPM v9, and for
large guests (128GB+), if I force 2M THPs via:

  echo always >/sys/kernel/mm/transparent_hugepages/shmem_enabled

it will in some cases trigger the below trace, which suggests that
kcompactd is trying to call migrate_folio() on a PFN that was/is
still allocated for guest private memory (and so has been removed from
directmap as part of shared->private conversation via REG_REGION kvm
ioctl, leading to the crash). This trace seems to occur during early
OVMF boot while the guest is in the middle of pre-accepting on private
memory (no lazy accept in this case).

Is this expected behavior? What else needs to be done to ensure
migrations aren't attempted in this case?

Thanks!

-Mike


# Host logs with debug info for crash during SNP boot

...
[  904.373632] kvm_restricted_mem_get_pfn: GFN: 0x1caced1, PFN: 0x156b7f, page: ffffea0006b197b0, ref_count: 2
[  904.373634] kvm_restricted_mem_get_pfn: GFN: 0x1caced2, PFN: 0x156840, page: ffffea0006b09400, ref_count: 2
[  904.373637] kvm_restricted_mem_get_pfn: GFN: 0x1caced3, PFN: 0x156841, page: ffffea0006b09450, ref_count: 2
[  904.373639] kvm_restricted_mem_get_pfn: GFN: 0x1caced4, PFN: 0x156842, page: ffffea0006b094a0, ref_count: 2
[  904.373641] kvm_restricted_mem_get_pfn: GFN: 0x1caced5, PFN: 0x156843, page: ffffea0006b094f0, ref_count: 2
[  904.373645] kvm_restricted_mem_get_pfn: GFN: 0x1caced6, PFN: 0x156844, page: ffffea0006b09540, ref_count: 2
[  904.373647] kvm_restricted_mem_get_pfn: GFN: 0x1caced7, PFN: 0x156845, page: ffffea0006b09590, ref_count: 2
[  904.373649] kvm_restricted_mem_get_pfn: GFN: 0x1caced8, PFN: 0x156846, page: ffffea0006b095e0, ref_count: 2
[  904.373652] kvm_restricted_mem_get_pfn: GFN: 0x1caced9, PFN: 0x156847, page: ffffea0006b09630, ref_count: 2
[  904.373654] kvm_restricted_mem_get_pfn: GFN: 0x1caceda, PFN: 0x156848, page: ffffea0006b09680, ref_count: 2
[  904.373656] kvm_restricted_mem_get_pfn: GFN: 0x1cacedb, PFN: 0x156849, page: ffffea0006b096d0, ref_count: 2
[  904.373661] kvm_restricted_mem_get_pfn: GFN: 0x1cacedc, PFN: 0x15684a, page: ffffea0006b09720, ref_count: 2
[  904.373663] kvm_restricted_mem_get_pfn: GFN: 0x1cacedd, PFN: 0x15684b, page: ffffea0006b09770, ref_count: 2

# PFN 0x15684c is allocated for guest private memory, will have been removed from directmap as part of RMP requirements

[  904.373665] kvm_restricted_mem_get_pfn: GFN: 0x1cacede, PFN: 0x15684c, page: ffffea0006b097c0, ref_count: 2
...

# kcompactd crashes trying to copy PFN 0x15684c to a new folio, crashes trying to access PFN via directmap

[  904.470135] Migrating restricted page, SRC pfn: 0x15684c, folio_ref_count: 2, folio_order: 0
[  904.470154] BUG: unable to handle page fault for address: ffff88815684c000
[  904.470314] kvm_restricted_mem_get_pfn: GFN: 0x1cafe00, PFN: 0x19f6d0, page: ffffea00081d2100, ref_count: 2
[  904.477828] #PF: supervisor read access in kernel mode
[  904.477831] #PF: error_code(0x0000) - not-present page
[  904.477833] PGD 6601067 P4D 6601067 PUD 1569ad063 PMD 1569af063 PTE 800ffffea97b3060
[  904.508806] Oops: 0000 [#1] SMP NOPTI
[  904.512892] CPU: 52 PID: 1563 Comm: kcompactd0 Tainted: G            E      6.0.0-rc7-hsnp-v7pfdv9d+ #10
[  904.523473] Hardware name: AMD Corporation ETHANOL_X/ETHANOL_X, BIOS RXM1006B 08/20/2021
[  904.532499] RIP: 0010:copy_page+0x7/0x10
[  904.536877] Code: 00 66 90 48 89 f8 48 89 d1 f3 a4 31 c0 c3 cc cc cc cc 48 89 c8 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
[  904.557831] RSP: 0018:ffffc900106dfb78 EFLAGS: 00010286
[  904.563661] RAX: ffff888000000000 RBX: ffffea0006b09810 RCX: 0000000000000200
[  904.571622] RDX: ffffea0000000000 RSI: ffff88815684c000 RDI: ffff88816bc5d000
[  904.579581] RBP: ffffc900106dfba0 R08: 0000000000000001 R09: ffffea0006b097c0
[  904.587541] R10: 0000000000000002 R11: ffffc900106dfb38 R12: ffffea00071add60
[  904.595502] R13: cccccccccccccccd R14: ffffea0006b09810 R15: ffff888159c1e0f8
[  904.603462] FS:  0000000000000000(0000) GS:ffff88a04df00000(0000) knlGS:0000000000000000
[  904.612489] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  904.618897] CR2: ffff88815684c000 CR3: 00000020eae16002 CR4: 0000000000770ee0
[  904.626855] PKRU: 55555554
[  904.629870] Call Trace:
[  904.632594]  <TASK>
[  904.634928]  ? folio_copy+0x8c/0xe0
[  904.638818]  migrate_folio+0x5b/0x110
[  904.642901]  move_to_new_folio+0x5b/0x150
[  904.647371]  migrate_pages+0x11bb/0x1830
[  904.651743]  ? move_freelist_tail+0xc0/0xc0
[  904.656406]  ? isolate_freepages_block+0x470/0x470
[  904.661749]  compact_zone+0x681/0xda0
[  904.665832]  kcompactd_do_work+0x1b3/0x2c0
[  904.670400]  kcompactd+0x257/0x330
[  904.674190]  ? prepare_to_wait_event+0x120/0x120
[  904.679338]  ? kcompactd_do_work+0x2c0/0x2c0
[  904.684098]  kthread+0xcf/0xf0
[  904.687501]  ? kthread_complete_and_exit+0x20/0x20
[  904.692844]  ret_from_fork+0x22/0x30
[  904.696830]  </TASK>
[  904.699262] Modules linked in: nf_conntrack_netlink(E) xfrm_user(E) xfrm_algo(E) xt_addrtype(E) br_netfilter(E) xt_CHECKSUM(E) xt_MASQUERADE(E) xt_conntrack(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_tcpudp(E) ip6table_mangle(E) ip6table_nat(E) iptable_mangle(E) iptable_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nf_tables(E) nfnetlink(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E) bpfilter(E) intel_rapl_msr(E) intel_rapl_common(E) amd64_edac(E) bridge(E) stp(E) llc(E) kvm_amd(E) overlay(E) nls_iso8859_1(E) kvm(E) crct10dif_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) crypto_simd(E) cryptd(E) rapl(E) ipmi_si(E) ipmi_devintf(E) wmi_bmof(E) ipmi_msghandler(E) efi_pstore(E) binfmt_misc(E) ast(E) drm_vram_helper(E) joydev(E) drm_ttm_helper(E) ttm(E) drm_kms_helper(E) input_leds(E) i2c_algo_bit(E) fb_sys_fops(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) ccp(E) k10temp(E) mac_hid(E) sch_fq_codel(E) parport_pc(E) ppdev(E) lp(E) parport(E) drm(E) ip_tables(E)
[  904.699316]  x_tables(E) autofs4(E) btrfs(E) blake2b_generic(E) zstd_compress(E) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E) raid6_pq(E) libcrc32c(E) raid1(E) raid0(E) multipath(E) linear(E) crc32_pclmul(E) hid_generic(E) usbhid(E) hid(E) e1000e(E) i2c_piix4(E) wmi(E)
[  904.828498] CR2: ffff88815684c000
[  904.832193] ---[ end trace 0000000000000000 ]---
[  904.937159] RIP: 0010:copy_page+0x7/0x10
[  904.941524] Code: 00 66 90 48 89 f8 48 89 d1 f3 a4 31 c0 c3 cc cc cc cc 48 89 c8 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
[  904.962478] RSP: 0018:ffffc900106dfb78 EFLAGS: 00010286
[  904.968305] RAX: ffff888000000000 RBX: ffffea0006b09810 RCX: 0000000000000200
[  904.976265] RDX: ffffea0000000000 RSI: ffff88815684c000 RDI: ffff88816bc5d000
[  904.984227] RBP: ffffc900106dfba0 R08: 0000000000000001 R09: ffffea0006b097c0
[  904.992187] R10: 0000000000000002 R11: ffffc900106dfb38 R12: ffffea00071add60
[  905.000145] R13: cccccccccccccccd R14: ffffea0006b09810 R15: ffff888159c1e0f8
[  905.008105] FS:  0000000000000000(0000) GS:ffff88a04df00000(0000) knlGS:0000000000000000
[  905.017132] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  905.023540] CR2: ffff88815684c000 CR3: 00000020eae16002 CR4: 0000000000770ee0
[  905.031501] PKRU: 55555554
[  905.034558] kvm_restricted_mem_get_pfn: GFN: 0x1cafe01, PFN: 0x19f6d1, page: ffffea00081d2150, ref_count: 2
[  905.045455] kvm_restricted_mem_get_pfn: GFN: 0x1cafe02, PFN: 0x19f6d2, page: ffffea00081d21a0, ref_count: 2
...
  
Michael Roth Nov. 29, 2022, 12:37 a.m. UTC | #20
On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Introduce 'memfd_restricted' system call with the ability to create
> memory areas that are restricted from userspace access through ordinary
> MMU operations (e.g. read/write/mmap). The memory content is expected to
> be used through a new in-kernel interface by a third kernel module.
> 
> memfd_restricted() is useful for scenarios where a file descriptor(fd)
> can be used as an interface into mm but want to restrict userspace's
> ability on the fd. Initially it is designed to provide protections for
> KVM encrypted guest memory.
> 
> Normally KVM uses memfd memory via mmapping the memfd into KVM userspace
> (e.g. QEMU) and then using the mmaped virtual address to setup the
> mapping in the KVM secondary page table (e.g. EPT). With confidential
> computing technologies like Intel TDX, the memfd memory may be encrypted
> with special key for special software domain (e.g. KVM guest) and is not
> expected to be directly accessed by userspace. Precisely, userspace
> access to such encrypted memory may lead to host crash so should be
> prevented.
> 
> memfd_restricted() provides semantics required for KVM guest encrypted
> memory support that a fd created with memfd_restricted() is going to be
> used as the source of guest memory in confidential computing environment
> and KVM can directly interact with core-mm without the need to expose
> the memoy content into KVM userspace.
> 
> KVM userspace is still in charge of the lifecycle of the fd. It should
> pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to
> obtain the physical memory page and then uses it to populate the KVM
> secondary page table entries.
> 
> The userspace restricted memfd can be fallocate-ed or hole-punched
> from userspace. When these operations happen, KVM can get notified
> through restrictedmem_notifier, it then gets chance to remove any
> mapped entries of the range in the secondary page tables.
> 
> memfd_restricted() itself is implemented as a shim layer on top of real
> memory file systems (currently tmpfs). Pages in restrictedmem are marked
> as unmovable and unevictable, this is required for current confidential
> usage. But in future this might be changed.
> 
> By default memfd_restricted() prevents userspace read, write and mmap.
> By defining new bit in the 'flags', it can be extended to support other
> restricted semantics in the future.
> 
> The system call is currently wired up for x86 arch.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  include/linux/restrictedmem.h          |  62 ++++++
>  include/linux/syscalls.h               |   1 +
>  include/uapi/asm-generic/unistd.h      |   5 +-
>  include/uapi/linux/magic.h             |   1 +
>  kernel/sys_ni.c                        |   3 +
>  mm/Kconfig                             |   4 +
>  mm/Makefile                            |   1 +
>  mm/restrictedmem.c                     | 250 +++++++++++++++++++++++++
>  10 files changed, 328 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/restrictedmem.h
>  create mode 100644 mm/restrictedmem.c
> 
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 320480a8db4f..dc70ba90247e 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -455,3 +455,4 @@
>  448	i386	process_mrelease	sys_process_mrelease
>  449	i386	futex_waitv		sys_futex_waitv
>  450	i386	set_mempolicy_home_node		sys_set_mempolicy_home_node
> +451	i386	memfd_restricted	sys_memfd_restricted
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index c84d12608cd2..06516abc8318 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -372,6 +372,7 @@
>  448	common	process_mrelease	sys_process_mrelease
>  449	common	futex_waitv		sys_futex_waitv
>  450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
> +451	common	memfd_restricted	sys_memfd_restricted
>  
>  #
>  # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
> new file mode 100644
> index 000000000000..9c37c3ea3180
> --- /dev/null
> +++ b/include/linux/restrictedmem.h
> @@ -0,0 +1,62 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _LINUX_RESTRICTEDMEM_H
> +
> +#include <linux/file.h>
> +#include <linux/magic.h>
> +#include <linux/pfn_t.h>
> +
> +struct restrictedmem_notifier;
> +
> +struct restrictedmem_notifier_ops {
> +	void (*invalidate_start)(struct restrictedmem_notifier *notifier,
> +				 pgoff_t start, pgoff_t end);
> +	void (*invalidate_end)(struct restrictedmem_notifier *notifier,
> +			       pgoff_t start, pgoff_t end);
> +};
> +
> +struct restrictedmem_notifier {
> +	struct list_head list;
> +	const struct restrictedmem_notifier_ops *ops;
> +};
> +
> +#ifdef CONFIG_RESTRICTEDMEM
> +
> +void restrictedmem_register_notifier(struct file *file,
> +				     struct restrictedmem_notifier *notifier);
> +void restrictedmem_unregister_notifier(struct file *file,
> +				       struct restrictedmem_notifier *notifier);
> +
> +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +			   struct page **pagep, int *order);
> +
> +static inline bool file_is_restrictedmem(struct file *file)
> +{
> +	return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC;
> +}
> +
> +#else
> +
> +static inline void restrictedmem_register_notifier(struct file *file,
> +				     struct restrictedmem_notifier *notifier)
> +{
> +}
> +
> +static inline void restrictedmem_unregister_notifier(struct file *file,
> +				       struct restrictedmem_notifier *notifier)
> +{
> +}
> +
> +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +					 struct page **pagep, int *order)
> +{
> +	return -1;
> +}
> +
> +static inline bool file_is_restrictedmem(struct file *file)
> +{
> +	return false;
> +}
> +
> +#endif /* CONFIG_RESTRICTEDMEM */
> +
> +#endif /* _LINUX_RESTRICTEDMEM_H */
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index a34b0f9a9972..f9e9e0c820c5 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -1056,6 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
>  asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
>  					    unsigned long home_node,
>  					    unsigned long flags);
> +asmlinkage long sys_memfd_restricted(unsigned int flags);
>  
>  /*
>   * Architecture-specific system calls
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 45fa180cc56a..e93cd35e46d0 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
>  #define __NR_set_mempolicy_home_node 450
>  __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
>  
> +#define __NR_memfd_restricted 451
> +__SYSCALL(__NR_memfd_restricted, sys_memfd_restricted)
> +
>  #undef __NR_syscalls
> -#define __NR_syscalls 451
> +#define __NR_syscalls 452
>  
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index 6325d1d0e90f..8aa38324b90a 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -101,5 +101,6 @@
>  #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
>  #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
>  #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
> +#define RESTRICTEDMEM_MAGIC	0x5245534d	/* "RESM" */
>  
>  #endif /* __LINUX_MAGIC_H__ */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 860b2dcf3ac4..7c4a32cbd2e7 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -360,6 +360,9 @@ COND_SYSCALL(pkey_free);
>  /* memfd_secret */
>  COND_SYSCALL(memfd_secret);
>  
> +/* memfd_restricted */
> +COND_SYSCALL(memfd_restricted);
> +
>  /*
>   * Architecture specific weak syscall entries.
>   */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 0331f1461f81..0177d53676c7 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1076,6 +1076,10 @@ config IO_MAPPING
>  config SECRETMEM
>  	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
>  
> +config RESTRICTEDMEM
> +	bool
> +	depends on TMPFS
> +
>  config ANON_VMA_NAME
>  	bool "Anonymous VMA name support"
>  	depends on PROC_FS && ADVISE_SYSCALLS && MMU
> diff --git a/mm/Makefile b/mm/Makefile
> index 9a564f836403..6cb6403ffd40 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -117,6 +117,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
>  obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o
>  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
>  obj-$(CONFIG_SECRETMEM) += secretmem.o
> +obj-$(CONFIG_RESTRICTEDMEM) += restrictedmem.o
>  obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o
>  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
>  obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> new file mode 100644
> index 000000000000..e5bf8907e0f8
> --- /dev/null
> +++ b/mm/restrictedmem.c
> @@ -0,0 +1,250 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "linux/sbitmap.h"
> +#include <linux/pagemap.h>
> +#include <linux/pseudo_fs.h>
> +#include <linux/shmem_fs.h>
> +#include <linux/syscalls.h>
> +#include <uapi/linux/falloc.h>
> +#include <uapi/linux/magic.h>
> +#include <linux/restrictedmem.h>
> +
> +struct restrictedmem_data {
> +	struct mutex lock;
> +	struct file *memfd;
> +	struct list_head notifiers;
> +};
> +
> +static void restrictedmem_notifier_invalidate(struct restrictedmem_data *data,
> +				 pgoff_t start, pgoff_t end, bool notify_start)
> +{
> +	struct restrictedmem_notifier *notifier;
> +
> +	mutex_lock(&data->lock);
> +	list_for_each_entry(notifier, &data->notifiers, list) {
> +		if (notify_start)
> +			notifier->ops->invalidate_start(notifier, start, end);
> +		else
> +			notifier->ops->invalidate_end(notifier, start, end);
> +	}
> +	mutex_unlock(&data->lock);
> +}
> +
> +static int restrictedmem_release(struct inode *inode, struct file *file)
> +{
> +	struct restrictedmem_data *data = inode->i_mapping->private_data;
> +
> +	fput(data->memfd);
> +	kfree(data);
> +	return 0;
> +}
> +
> +static long restrictedmem_fallocate(struct file *file, int mode,
> +				    loff_t offset, loff_t len)
> +{
> +	struct restrictedmem_data *data = file->f_mapping->private_data;
> +	struct file *memfd = data->memfd;
> +	int ret;
> +
> +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> +		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> +			return -EINVAL;
> +	}
> +
> +	restrictedmem_notifier_invalidate(data, offset, offset + len, true);

The KVM restrictedmem ops seem to expect pgoff_t, but here we pass
loff_t. For SNP we've made this strange as part of the following patch
and it seems to produce the expected behavior:

  https://github.com/mdroth/linux/commit/d669c7d3003ff7a7a47e73e8c3b4eeadbd2c4eb6

> +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> +	restrictedmem_notifier_invalidate(data, offset, offset + len, false);
> +	return ret;
> +}
> +

<snip>

> +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +			   struct page **pagep, int *order)
> +{
> +	struct restrictedmem_data *data = file->f_mapping->private_data;
> +	struct file *memfd = data->memfd;
> +	struct page *page;
> +	int ret;
> +
> +	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);

This will result in KVM allocating pages that userspace hasn't necessary
fallocate()'d. In the case of SNP we need to get the PFN so we can clean
up the RMP entries when restrictedmem invalidations are issued for a GFN
range.

If the guest supports lazy-acceptance however, these pages may not have
been faulted in yet, and if the VMM defers actually fallocate()'ing space
until the guest actually tries to issue a shared->private for that GFN
(to support lazy-pinning), then there may never be a need to allocate
pages for these backends.

However, the restrictedmem invalidations are for GFN ranges so there's
no way to know inadvance whether it's been allocated yet or not. The
xarray is one option but currently it defaults to 'private' so that
doesn't help us here. It might if we introduced a 'uninitialized' state
or something along that line instead of just the binary
'shared'/'private' though...

But for now we added a restrictedmem_get_page_noalloc() that uses
SGP_NONE instead of SGP_WRITE to avoid accidentally allocating a bunch
of memory as part of guest shutdown, and a
kvm_restrictedmem_get_pfn_noalloc() variant to go along with that. But
maybe a boolean param is better? Or maybe SGP_NOALLOC is the better
default, and we just propagate an error to userspace if they didn't
fallocate() in advance?

-Mike

> +	if (ret)
> +		return ret;
> +
> +	*pagep = page;
> +	if (order)
> +		*order = thp_order(compound_head(page));
> +
> +	SetPageUptodate(page);
> +	unlock_page(page);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(restrictedmem_get_page);
> -- 
> 2.25.1
>
  
Kirill A. Shutemov Nov. 29, 2022, 11:21 a.m. UTC | #21
On Mon, Nov 28, 2022 at 06:06:32PM -0600, Michael Roth wrote:
> On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> 
> <snip>
> 
> > +static struct file *restrictedmem_file_create(struct file *memfd)
> > +{
> > +	struct restrictedmem_data *data;
> > +	struct address_space *mapping;
> > +	struct inode *inode;
> > +	struct file *file;
> > +
> > +	data = kzalloc(sizeof(*data), GFP_KERNEL);
> > +	if (!data)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	data->memfd = memfd;
> > +	mutex_init(&data->lock);
> > +	INIT_LIST_HEAD(&data->notifiers);
> > +
> > +	inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
> > +	if (IS_ERR(inode)) {
> > +		kfree(data);
> > +		return ERR_CAST(inode);
> > +	}
> > +
> > +	inode->i_mode |= S_IFREG;
> > +	inode->i_op = &restrictedmem_iops;
> > +	inode->i_mapping->private_data = data;
> > +
> > +	file = alloc_file_pseudo(inode, restrictedmem_mnt,
> > +				 "restrictedmem", O_RDWR,
> > +				 &restrictedmem_fops);
> > +	if (IS_ERR(file)) {
> > +		iput(inode);
> > +		kfree(data);
> > +		return ERR_CAST(file);
> > +	}
> > +
> > +	file->f_flags |= O_LARGEFILE;
> > +
> > +	mapping = memfd->f_mapping;
> > +	mapping_set_unevictable(mapping);
> > +	mapping_set_gfp_mask(mapping,
> > +			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> 
> Is this supposed to prevent migration of pages being used for
> restrictedmem/shmem backend?

Yes, my bad. I expected it to prevent migration, but it is not true.

Looks like we need to bump refcount in restrictedmem_get_page() and reduce
it back when KVM is no longer use it.

Chao, could you adjust it?
  
David Hildenbrand Nov. 29, 2022, 11:39 a.m. UTC | #22
On 29.11.22 12:21, Kirill A. Shutemov wrote:
> On Mon, Nov 28, 2022 at 06:06:32PM -0600, Michael Roth wrote:
>> On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>>
>>
>> <snip>
>>
>>> +static struct file *restrictedmem_file_create(struct file *memfd)
>>> +{
>>> +	struct restrictedmem_data *data;
>>> +	struct address_space *mapping;
>>> +	struct inode *inode;
>>> +	struct file *file;
>>> +
>>> +	data = kzalloc(sizeof(*data), GFP_KERNEL);
>>> +	if (!data)
>>> +		return ERR_PTR(-ENOMEM);
>>> +
>>> +	data->memfd = memfd;
>>> +	mutex_init(&data->lock);
>>> +	INIT_LIST_HEAD(&data->notifiers);
>>> +
>>> +	inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
>>> +	if (IS_ERR(inode)) {
>>> +		kfree(data);
>>> +		return ERR_CAST(inode);
>>> +	}
>>> +
>>> +	inode->i_mode |= S_IFREG;
>>> +	inode->i_op = &restrictedmem_iops;
>>> +	inode->i_mapping->private_data = data;
>>> +
>>> +	file = alloc_file_pseudo(inode, restrictedmem_mnt,
>>> +				 "restrictedmem", O_RDWR,
>>> +				 &restrictedmem_fops);
>>> +	if (IS_ERR(file)) {
>>> +		iput(inode);
>>> +		kfree(data);
>>> +		return ERR_CAST(file);
>>> +	}
>>> +
>>> +	file->f_flags |= O_LARGEFILE;
>>> +
>>> +	mapping = memfd->f_mapping;
>>> +	mapping_set_unevictable(mapping);
>>> +	mapping_set_gfp_mask(mapping,
>>> +			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
>>
>> Is this supposed to prevent migration of pages being used for
>> restrictedmem/shmem backend?
> 
> Yes, my bad. I expected it to prevent migration, but it is not true.

Maybe add a comment that these pages are not movable and we don't want 
to place them into movable pageblocks (including CMA and ZONE_MOVABLE). 
That's the primary purpose of the GFP mask here.
  
Chao Peng Nov. 29, 2022, 1:58 p.m. UTC | #23
On Tue, Nov 29, 2022 at 02:21:39PM +0300, Kirill A. Shutemov wrote:
> On Mon, Nov 28, 2022 at 06:06:32PM -0600, Michael Roth wrote:
> > On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > 
> > 
> > <snip>
> > 
> > > +static struct file *restrictedmem_file_create(struct file *memfd)
> > > +{
> > > +	struct restrictedmem_data *data;
> > > +	struct address_space *mapping;
> > > +	struct inode *inode;
> > > +	struct file *file;
> > > +
> > > +	data = kzalloc(sizeof(*data), GFP_KERNEL);
> > > +	if (!data)
> > > +		return ERR_PTR(-ENOMEM);
> > > +
> > > +	data->memfd = memfd;
> > > +	mutex_init(&data->lock);
> > > +	INIT_LIST_HEAD(&data->notifiers);
> > > +
> > > +	inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
> > > +	if (IS_ERR(inode)) {
> > > +		kfree(data);
> > > +		return ERR_CAST(inode);
> > > +	}
> > > +
> > > +	inode->i_mode |= S_IFREG;
> > > +	inode->i_op = &restrictedmem_iops;
> > > +	inode->i_mapping->private_data = data;
> > > +
> > > +	file = alloc_file_pseudo(inode, restrictedmem_mnt,
> > > +				 "restrictedmem", O_RDWR,
> > > +				 &restrictedmem_fops);
> > > +	if (IS_ERR(file)) {
> > > +		iput(inode);
> > > +		kfree(data);
> > > +		return ERR_CAST(file);
> > > +	}
> > > +
> > > +	file->f_flags |= O_LARGEFILE;
> > > +
> > > +	mapping = memfd->f_mapping;
> > > +	mapping_set_unevictable(mapping);
> > > +	mapping_set_gfp_mask(mapping,
> > > +			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> > 
> > Is this supposed to prevent migration of pages being used for
> > restrictedmem/shmem backend?
> 
> Yes, my bad. I expected it to prevent migration, but it is not true.
> 
> Looks like we need to bump refcount in restrictedmem_get_page() and reduce
> it back when KVM is no longer use it.

The restrictedmem_get_page() has taken a reference, but later KVM
put_page() after populating the secondary page table entry through
kvm_release_pfn_clean(). One option would let the user feature(e.g.
TDX/SEV) to get_page/put_page() during populating the secondary page
table entry, AFAICS, this requirement also comes from these features.

Chao
> 
> Chao, could you adjust it?
> 
> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov
  
Chao Peng Nov. 29, 2022, 1:59 p.m. UTC | #24
On Tue, Nov 29, 2022 at 12:39:06PM +0100, David Hildenbrand wrote:
> On 29.11.22 12:21, Kirill A. Shutemov wrote:
> > On Mon, Nov 28, 2022 at 06:06:32PM -0600, Michael Roth wrote:
> > > On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > > > 
> > > 
> > > <snip>
> > > 
> > > > +static struct file *restrictedmem_file_create(struct file *memfd)
> > > > +{
> > > > +	struct restrictedmem_data *data;
> > > > +	struct address_space *mapping;
> > > > +	struct inode *inode;
> > > > +	struct file *file;
> > > > +
> > > > +	data = kzalloc(sizeof(*data), GFP_KERNEL);
> > > > +	if (!data)
> > > > +		return ERR_PTR(-ENOMEM);
> > > > +
> > > > +	data->memfd = memfd;
> > > > +	mutex_init(&data->lock);
> > > > +	INIT_LIST_HEAD(&data->notifiers);
> > > > +
> > > > +	inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
> > > > +	if (IS_ERR(inode)) {
> > > > +		kfree(data);
> > > > +		return ERR_CAST(inode);
> > > > +	}
> > > > +
> > > > +	inode->i_mode |= S_IFREG;
> > > > +	inode->i_op = &restrictedmem_iops;
> > > > +	inode->i_mapping->private_data = data;
> > > > +
> > > > +	file = alloc_file_pseudo(inode, restrictedmem_mnt,
> > > > +				 "restrictedmem", O_RDWR,
> > > > +				 &restrictedmem_fops);
> > > > +	if (IS_ERR(file)) {
> > > > +		iput(inode);
> > > > +		kfree(data);
> > > > +		return ERR_CAST(file);
> > > > +	}
> > > > +
> > > > +	file->f_flags |= O_LARGEFILE;
> > > > +
> > > > +	mapping = memfd->f_mapping;
> > > > +	mapping_set_unevictable(mapping);
> > > > +	mapping_set_gfp_mask(mapping,
> > > > +			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> > > 
> > > Is this supposed to prevent migration of pages being used for
> > > restrictedmem/shmem backend?
> > 
> > Yes, my bad. I expected it to prevent migration, but it is not true.
> 
> Maybe add a comment that these pages are not movable and we don't want to
> place them into movable pageblocks (including CMA and ZONE_MOVABLE). That's
> the primary purpose of the GFP mask here.

Yes I can do that.

Chao
> 
> -- 
> Thanks,
> 
> David / dhildenb
  
Chao Peng Nov. 29, 2022, 2:06 p.m. UTC | #25
On Mon, Nov 28, 2022 at 06:37:25PM -0600, Michael Roth wrote:
> On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
...
> > +static long restrictedmem_fallocate(struct file *file, int mode,
> > +				    loff_t offset, loff_t len)
> > +{
> > +	struct restrictedmem_data *data = file->f_mapping->private_data;
> > +	struct file *memfd = data->memfd;
> > +	int ret;
> > +
> > +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> > +		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > +			return -EINVAL;
> > +	}
> > +
> > +	restrictedmem_notifier_invalidate(data, offset, offset + len, true);
> 
> The KVM restrictedmem ops seem to expect pgoff_t, but here we pass
> loff_t. For SNP we've made this strange as part of the following patch
> and it seems to produce the expected behavior:

That's correct. Thanks.

> 
>   https://github.com/mdroth/linux/commit/d669c7d3003ff7a7a47e73e8c3b4eeadbd2c4eb6
> 
> > +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > +	restrictedmem_notifier_invalidate(data, offset, offset + len, false);
> > +	return ret;
> > +}
> > +
> 
> <snip>
> 
> > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > +			   struct page **pagep, int *order)
> > +{
> > +	struct restrictedmem_data *data = file->f_mapping->private_data;
> > +	struct file *memfd = data->memfd;
> > +	struct page *page;
> > +	int ret;
> > +
> > +	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> 
> This will result in KVM allocating pages that userspace hasn't necessary
> fallocate()'d. In the case of SNP we need to get the PFN so we can clean
> up the RMP entries when restrictedmem invalidations are issued for a GFN
> range.

Yes fallocate() is unnecessary unless someone wants to reserve some
space (e.g. for determination or performance purpose), this matches its
semantics perfectly at:
https://www.man7.org/linux/man-pages/man2/fallocate.2.html

> 
> If the guest supports lazy-acceptance however, these pages may not have
> been faulted in yet, and if the VMM defers actually fallocate()'ing space
> until the guest actually tries to issue a shared->private for that GFN
> (to support lazy-pinning), then there may never be a need to allocate
> pages for these backends.
> 
> However, the restrictedmem invalidations are for GFN ranges so there's
> no way to know inadvance whether it's been allocated yet or not. The
> xarray is one option but currently it defaults to 'private' so that
> doesn't help us here. It might if we introduced a 'uninitialized' state
> or something along that line instead of just the binary
> 'shared'/'private' though...

How about if we change the default to 'shared' as we discussed at
https://lore.kernel.org/all/Y35gI0L8GMt9+OkK@google.com/?
> 
> But for now we added a restrictedmem_get_page_noalloc() that uses
> SGP_NONE instead of SGP_WRITE to avoid accidentally allocating a bunch
> of memory as part of guest shutdown, and a
> kvm_restrictedmem_get_pfn_noalloc() variant to go along with that. But
> maybe a boolean param is better? Or maybe SGP_NOALLOC is the better
> default, and we just propagate an error to userspace if they didn't
> fallocate() in advance?

This (making fallocate() a hard requirement) not only complicates the
userspace but also forces the lazy-faulting going through a long path of
exiting to userspace. Unless we don't have other options I would not go
this way.

Chao
> 
> -Mike
> 
> > +	if (ret)
> > +		return ret;
> > +
> > +	*pagep = page;
> > +	if (order)
> > +		*order = thp_order(compound_head(page));
> > +
> > +	SetPageUptodate(page);
> > +	unlock_page(page);
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(restrictedmem_get_page);
> > -- 
> > 2.25.1
> >
  
Vishal Annapurve Nov. 29, 2022, 6:01 p.m. UTC | #26
On Mon, Nov 28, 2022 at 4:37 PM Michael Roth <michael.roth@amd.com> wrote:
>
> On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >
> > Introduce 'memfd_restricted' system call with the ability to create
> > memory areas that are restricted from userspace access through ordinary
> > MMU operations (e.g. read/write/mmap). The memory content is expected to
> > be used through a new in-kernel interface by a third kernel module.
> >
> > memfd_restricted() is useful for scenarios where a file descriptor(fd)
> > can be used as an interface into mm but want to restrict userspace's
> > ability on the fd. Initially it is designed to provide protections for
> > KVM encrypted guest memory.
> >
> > Normally KVM uses memfd memory via mmapping the memfd into KVM userspace
> > (e.g. QEMU) and then using the mmaped virtual address to setup the
> > mapping in the KVM secondary page table (e.g. EPT). With confidential
> > computing technologies like Intel TDX, the memfd memory may be encrypted
> > with special key for special software domain (e.g. KVM guest) and is not
> > expected to be directly accessed by userspace. Precisely, userspace
> > access to such encrypted memory may lead to host crash so should be
> > prevented.
> >
> > memfd_restricted() provides semantics required for KVM guest encrypted
> > memory support that a fd created with memfd_restricted() is going to be
> > used as the source of guest memory in confidential computing environment
> > and KVM can directly interact with core-mm without the need to expose
> > the memoy content into KVM userspace.
> >
> > KVM userspace is still in charge of the lifecycle of the fd. It should
> > pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to
> > obtain the physical memory page and then uses it to populate the KVM
> > secondary page table entries.
> >
> > The userspace restricted memfd can be fallocate-ed or hole-punched
> > from userspace. When these operations happen, KVM can get notified
> > through restrictedmem_notifier, it then gets chance to remove any
> > mapped entries of the range in the secondary page tables.
> >
> > memfd_restricted() itself is implemented as a shim layer on top of real
> > memory file systems (currently tmpfs). Pages in restrictedmem are marked
> > as unmovable and unevictable, this is required for current confidential
> > usage. But in future this might be changed.
> >
> > By default memfd_restricted() prevents userspace read, write and mmap.
> > By defining new bit in the 'flags', it can be extended to support other
> > restricted semantics in the future.
> >
> > The system call is currently wired up for x86 arch.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
> >  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
> >  include/linux/restrictedmem.h          |  62 ++++++
> >  include/linux/syscalls.h               |   1 +
> >  include/uapi/asm-generic/unistd.h      |   5 +-
> >  include/uapi/linux/magic.h             |   1 +
> >  kernel/sys_ni.c                        |   3 +
> >  mm/Kconfig                             |   4 +
> >  mm/Makefile                            |   1 +
> >  mm/restrictedmem.c                     | 250 +++++++++++++++++++++++++
> >  10 files changed, 328 insertions(+), 1 deletion(-)
> >  create mode 100644 include/linux/restrictedmem.h
> >  create mode 100644 mm/restrictedmem.c
> >
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> > index 320480a8db4f..dc70ba90247e 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -455,3 +455,4 @@
> >  448  i386    process_mrelease        sys_process_mrelease
> >  449  i386    futex_waitv             sys_futex_waitv
> >  450  i386    set_mempolicy_home_node         sys_set_mempolicy_home_node
> > +451  i386    memfd_restricted        sys_memfd_restricted
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> > index c84d12608cd2..06516abc8318 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -372,6 +372,7 @@
> >  448  common  process_mrelease        sys_process_mrelease
> >  449  common  futex_waitv             sys_futex_waitv
> >  450  common  set_mempolicy_home_node sys_set_mempolicy_home_node
> > +451  common  memfd_restricted        sys_memfd_restricted
> >
> >  #
> >  # Due to a historical design error, certain syscalls are numbered differently
> > diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
> > new file mode 100644
> > index 000000000000..9c37c3ea3180
> > --- /dev/null
> > +++ b/include/linux/restrictedmem.h
> > @@ -0,0 +1,62 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +#ifndef _LINUX_RESTRICTEDMEM_H
> > +
> > +#include <linux/file.h>
> > +#include <linux/magic.h>
> > +#include <linux/pfn_t.h>
> > +
> > +struct restrictedmem_notifier;
> > +
> > +struct restrictedmem_notifier_ops {
> > +     void (*invalidate_start)(struct restrictedmem_notifier *notifier,
> > +                              pgoff_t start, pgoff_t end);
> > +     void (*invalidate_end)(struct restrictedmem_notifier *notifier,
> > +                            pgoff_t start, pgoff_t end);
> > +};
> > +
> > +struct restrictedmem_notifier {
> > +     struct list_head list;
> > +     const struct restrictedmem_notifier_ops *ops;
> > +};
> > +
> > +#ifdef CONFIG_RESTRICTEDMEM
> > +
> > +void restrictedmem_register_notifier(struct file *file,
> > +                                  struct restrictedmem_notifier *notifier);
> > +void restrictedmem_unregister_notifier(struct file *file,
> > +                                    struct restrictedmem_notifier *notifier);
> > +
> > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > +                        struct page **pagep, int *order);
> > +
> > +static inline bool file_is_restrictedmem(struct file *file)
> > +{
> > +     return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC;
> > +}
> > +
> > +#else
> > +
> > +static inline void restrictedmem_register_notifier(struct file *file,
> > +                                  struct restrictedmem_notifier *notifier)
> > +{
> > +}
> > +
> > +static inline void restrictedmem_unregister_notifier(struct file *file,
> > +                                    struct restrictedmem_notifier *notifier)
> > +{
> > +}
> > +
> > +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > +                                      struct page **pagep, int *order)
> > +{
> > +     return -1;
> > +}
> > +
> > +static inline bool file_is_restrictedmem(struct file *file)
> > +{
> > +     return false;
> > +}
> > +
> > +#endif /* CONFIG_RESTRICTEDMEM */
> > +
> > +#endif /* _LINUX_RESTRICTEDMEM_H */
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index a34b0f9a9972..f9e9e0c820c5 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -1056,6 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
> >  asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
> >                                           unsigned long home_node,
> >                                           unsigned long flags);
> > +asmlinkage long sys_memfd_restricted(unsigned int flags);
> >
> >  /*
> >   * Architecture-specific system calls
> > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > index 45fa180cc56a..e93cd35e46d0 100644
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > @@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
> >  #define __NR_set_mempolicy_home_node 450
> >  __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
> >
> > +#define __NR_memfd_restricted 451
> > +__SYSCALL(__NR_memfd_restricted, sys_memfd_restricted)
> > +
> >  #undef __NR_syscalls
> > -#define __NR_syscalls 451
> > +#define __NR_syscalls 452
> >
> >  /*
> >   * 32 bit systems traditionally used different
> > diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> > index 6325d1d0e90f..8aa38324b90a 100644
> > --- a/include/uapi/linux/magic.h
> > +++ b/include/uapi/linux/magic.h
> > @@ -101,5 +101,6 @@
> >  #define DMA_BUF_MAGIC                0x444d4142      /* "DMAB" */
> >  #define DEVMEM_MAGIC         0x454d444d      /* "DMEM" */
> >  #define SECRETMEM_MAGIC              0x5345434d      /* "SECM" */
> > +#define RESTRICTEDMEM_MAGIC  0x5245534d      /* "RESM" */
> >
> >  #endif /* __LINUX_MAGIC_H__ */
> > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> > index 860b2dcf3ac4..7c4a32cbd2e7 100644
> > --- a/kernel/sys_ni.c
> > +++ b/kernel/sys_ni.c
> > @@ -360,6 +360,9 @@ COND_SYSCALL(pkey_free);
> >  /* memfd_secret */
> >  COND_SYSCALL(memfd_secret);
> >
> > +/* memfd_restricted */
> > +COND_SYSCALL(memfd_restricted);
> > +
> >  /*
> >   * Architecture specific weak syscall entries.
> >   */
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 0331f1461f81..0177d53676c7 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -1076,6 +1076,10 @@ config IO_MAPPING
> >  config SECRETMEM
> >       def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
> >
> > +config RESTRICTEDMEM
> > +     bool
> > +     depends on TMPFS
> > +
> >  config ANON_VMA_NAME
> >       bool "Anonymous VMA name support"
> >       depends on PROC_FS && ADVISE_SYSCALLS && MMU
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 9a564f836403..6cb6403ffd40 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -117,6 +117,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
> >  obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o
> >  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
> >  obj-$(CONFIG_SECRETMEM) += secretmem.o
> > +obj-$(CONFIG_RESTRICTEDMEM) += restrictedmem.o
> >  obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o
> >  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
> >  obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
> > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> > new file mode 100644
> > index 000000000000..e5bf8907e0f8
> > --- /dev/null
> > +++ b/mm/restrictedmem.c
> > @@ -0,0 +1,250 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include "linux/sbitmap.h"
> > +#include <linux/pagemap.h>
> > +#include <linux/pseudo_fs.h>
> > +#include <linux/shmem_fs.h>
> > +#include <linux/syscalls.h>
> > +#include <uapi/linux/falloc.h>
> > +#include <uapi/linux/magic.h>
> > +#include <linux/restrictedmem.h>
> > +
> > +struct restrictedmem_data {
> > +     struct mutex lock;
> > +     struct file *memfd;
> > +     struct list_head notifiers;
> > +};
> > +
> > +static void restrictedmem_notifier_invalidate(struct restrictedmem_data *data,
> > +                              pgoff_t start, pgoff_t end, bool notify_start)
> > +{
> > +     struct restrictedmem_notifier *notifier;
> > +
> > +     mutex_lock(&data->lock);
> > +     list_for_each_entry(notifier, &data->notifiers, list) {
> > +             if (notify_start)
> > +                     notifier->ops->invalidate_start(notifier, start, end);
> > +             else
> > +                     notifier->ops->invalidate_end(notifier, start, end);
> > +     }
> > +     mutex_unlock(&data->lock);
> > +}
> > +
> > +static int restrictedmem_release(struct inode *inode, struct file *file)
> > +{
> > +     struct restrictedmem_data *data = inode->i_mapping->private_data;
> > +
> > +     fput(data->memfd);
> > +     kfree(data);
> > +     return 0;
> > +}
> > +
> > +static long restrictedmem_fallocate(struct file *file, int mode,
> > +                                 loff_t offset, loff_t len)
> > +{
> > +     struct restrictedmem_data *data = file->f_mapping->private_data;
> > +     struct file *memfd = data->memfd;
> > +     int ret;
> > +
> > +     if (mode & FALLOC_FL_PUNCH_HOLE) {
> > +             if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > +                     return -EINVAL;
> > +     }
> > +
> > +     restrictedmem_notifier_invalidate(data, offset, offset + len, true);
>
> The KVM restrictedmem ops seem to expect pgoff_t, but here we pass
> loff_t. For SNP we've made this strange as part of the following patch
> and it seems to produce the expected behavior:
>
>   https://github.com/mdroth/linux/commit/d669c7d3003ff7a7a47e73e8c3b4eeadbd2c4eb6
>
> > +     ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > +     restrictedmem_notifier_invalidate(data, offset, offset + len, false);
> > +     return ret;
> > +}
> > +
>
> <snip>
>
> > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > +                        struct page **pagep, int *order)
> > +{
> > +     struct restrictedmem_data *data = file->f_mapping->private_data;
> > +     struct file *memfd = data->memfd;
> > +     struct page *page;
> > +     int ret;
> > +
> > +     ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
>
> This will result in KVM allocating pages that userspace hasn't necessary
> fallocate()'d. In the case of SNP we need to get the PFN so we can clean
> up the RMP entries when restrictedmem invalidations are issued for a GFN
> range.
>
> If the guest supports lazy-acceptance however, these pages may not have
> been faulted in yet, and if the VMM defers actually fallocate()'ing space
> until the guest actually tries to issue a shared->private for that GFN
> (to support lazy-pinning), then there may never be a need to allocate
> pages for these backends.
>
> However, the restrictedmem invalidations are for GFN ranges so there's
> no way to know inadvance whether it's been allocated yet or not. The
> xarray is one option but currently it defaults to 'private' so that
> doesn't help us here. It might if we introduced a 'uninitialized' state
> or something along that line instead of just the binary
> 'shared'/'private' though...
>
> But for now we added a restrictedmem_get_page_noalloc() that uses
> SGP_NONE instead of SGP_WRITE to avoid accidentally allocating a bunch
> of memory as part of guest shutdown, and a
> kvm_restrictedmem_get_pfn_noalloc() variant to go along with that. But
> maybe a boolean param is better? Or maybe SGP_NOALLOC is the better
> default, and we just propagate an error to userspace if they didn't
> fallocate() in advance?
>

One caveat with SGP_NOALLOC being default: For performance reasons (to
avoid frequent userspace exits), VMM will have to always preallocate
all the guest restricted memory. In general this will prevent VMM from
overcommitting.


> -Mike
>
> > +     if (ret)
> > +             return ret;
> > +
> > +     *pagep = page;
> > +     if (order)
> > +             *order = thp_order(compound_head(page));
> > +
> > +     SetPageUptodate(page);
> > +     unlock_page(page);
> > +
> > +     return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(restrictedmem_get_page);
> > --
> > 2.25.1
> >
  
Michael Roth Nov. 29, 2022, 7:06 p.m. UTC | #27
On Tue, Nov 29, 2022 at 10:06:15PM +0800, Chao Peng wrote:
> On Mon, Nov 28, 2022 at 06:37:25PM -0600, Michael Roth wrote:
> > On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> ...
> > > +static long restrictedmem_fallocate(struct file *file, int mode,
> > > +				    loff_t offset, loff_t len)
> > > +{
> > > +	struct restrictedmem_data *data = file->f_mapping->private_data;
> > > +	struct file *memfd = data->memfd;
> > > +	int ret;
> > > +
> > > +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > +		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > +			return -EINVAL;
> > > +	}
> > > +
> > > +	restrictedmem_notifier_invalidate(data, offset, offset + len, true);
> > 
> > The KVM restrictedmem ops seem to expect pgoff_t, but here we pass
> > loff_t. For SNP we've made this strange as part of the following patch
> > and it seems to produce the expected behavior:
> 
> That's correct. Thanks.
> 
> > 
> >   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdroth%2Flinux%2Fcommit%2Fd669c7d3003ff7a7a47e73e8c3b4eeadbd2c4eb6&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C99e80696067a40d42f6e08dad2138556%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053278531323330%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=WDj4KxJjhcntBWJUGCjNmMPfZMGQkCSaAo6ElYrGgF0%3D&amp;reserved=0
> > 
> > > +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > +	restrictedmem_notifier_invalidate(data, offset, offset + len, false);
> > > +	return ret;
> > > +}
> > > +
> > 
> > <snip>
> > 
> > > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > > +			   struct page **pagep, int *order)
> > > +{
> > > +	struct restrictedmem_data *data = file->f_mapping->private_data;
> > > +	struct file *memfd = data->memfd;
> > > +	struct page *page;
> > > +	int ret;
> > > +
> > > +	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> > 
> > This will result in KVM allocating pages that userspace hasn't necessary
> > fallocate()'d. In the case of SNP we need to get the PFN so we can clean
> > up the RMP entries when restrictedmem invalidations are issued for a GFN
> > range.
> 
> Yes fallocate() is unnecessary unless someone wants to reserve some
> space (e.g. for determination or performance purpose), this matches its
> semantics perfectly at:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.man7.org%2Flinux%2Fman-pages%2Fman2%2Ffallocate.2.html&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C99e80696067a40d42f6e08dad2138556%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053278531323330%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=67sdTY47cM1IBUG2eJCltYF5SyGOpd9%2FVxVlHUw02tU%3D&amp;reserved=0
> 
> > 
> > If the guest supports lazy-acceptance however, these pages may not have
> > been faulted in yet, and if the VMM defers actually fallocate()'ing space
> > until the guest actually tries to issue a shared->private for that GFN
> > (to support lazy-pinning), then there may never be a need to allocate
> > pages for these backends.
> > 
> > However, the restrictedmem invalidations are for GFN ranges so there's
> > no way to know inadvance whether it's been allocated yet or not. The
> > xarray is one option but currently it defaults to 'private' so that
> > doesn't help us here. It might if we introduced a 'uninitialized' state
> > or something along that line instead of just the binary
> > 'shared'/'private' though...
> 
> How about if we change the default to 'shared' as we discussed at
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2FY35gI0L8GMt9%2BOkK%40google.com%2F&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C99e80696067a40d42f6e08dad2138556%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053278531323330%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=qzWObDo7ZHW4YjuAjZ5%2B1wEwbqymgBiNM%2BYXiyUSBdI%3D&amp;reserved=0?

Need to look at this a bit more, but I think that could work as well.

> > 
> > But for now we added a restrictedmem_get_page_noalloc() that uses
> > SGP_NONE instead of SGP_WRITE to avoid accidentally allocating a bunch
> > of memory as part of guest shutdown, and a
> > kvm_restrictedmem_get_pfn_noalloc() variant to go along with that. But
> > maybe a boolean param is better? Or maybe SGP_NOALLOC is the better
> > default, and we just propagate an error to userspace if they didn't
> > fallocate() in advance?
> 
> This (making fallocate() a hard requirement) not only complicates the
> userspace but also forces the lazy-faulting going through a long path of
> exiting to userspace. Unless we don't have other options I would not go
> this way.

Unless I'm missing something, it's already the case that userspace is
responsible for handling all the shared->private transitions in response
to KVM_EXIT_MEMORY_FAULT or (in our case) KVM_EXIT_VMGEXIT. So it only
places the additional requirements on the VMM that if they *don't*
preallocate, then they'll need to issue the fallocate() prior to issuing
the KVM_MEM_ENCRYPT_REG_REGION ioctl in response to these events.

QEMU for example already has a separate 'prealloc' option for cases
where they want to prefault all the guest memory, so it makes sense to
continue making that an optional thing with regard to UPM.

-Mike

> 
> Chao
> > 
> > -Mike
> > 
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	*pagep = page;
> > > +	if (order)
> > > +		*order = thp_order(compound_head(page));
> > > +
> > > +	SetPageUptodate(page);
> > > +	unlock_page(page);
> > > +
> > > +	return 0;
> > > +}
> > > +EXPORT_SYMBOL_GPL(restrictedmem_get_page);
> > > -- 
> > > 2.25.1
> > >
  
Michael Roth Nov. 29, 2022, 7:18 p.m. UTC | #28
On Tue, Nov 29, 2022 at 01:06:58PM -0600, Michael Roth wrote:
> On Tue, Nov 29, 2022 at 10:06:15PM +0800, Chao Peng wrote:
> > On Mon, Nov 28, 2022 at 06:37:25PM -0600, Michael Roth wrote:
> > > On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > ...
> > > > +static long restrictedmem_fallocate(struct file *file, int mode,
> > > > +				    loff_t offset, loff_t len)
> > > > +{
> > > > +	struct restrictedmem_data *data = file->f_mapping->private_data;
> > > > +	struct file *memfd = data->memfd;
> > > > +	int ret;
> > > > +
> > > > +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > +		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > > +			return -EINVAL;
> > > > +	}
> > > > +
> > > > +	restrictedmem_notifier_invalidate(data, offset, offset + len, true);
> > > 
> > > The KVM restrictedmem ops seem to expect pgoff_t, but here we pass
> > > loff_t. For SNP we've made this strange as part of the following patch
> > > and it seems to produce the expected behavior:
> > 
> > That's correct. Thanks.
> > 
> > > 
> > >   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdroth%2Flinux%2Fcommit%2Fd669c7d3003ff7a7a47e73e8c3b4eeadbd2c4eb6&amp;data=05%7C01%7CMichael.Roth%40amd.com%7C0c26815eb6af4f1a243508dad23cf713%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053456609134623%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=kAL42bmyBB0alVwh%2FN%2BT3D%2BiVTdxxMsJ7V4TNuCTjM4%3D&amp;reserved=0
> > > 
> > > > +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > > +	restrictedmem_notifier_invalidate(data, offset, offset + len, false);
> > > > +	return ret;
> > > > +}
> > > > +
> > > 
> > > <snip>
> > > 
> > > > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > > > +			   struct page **pagep, int *order)
> > > > +{
> > > > +	struct restrictedmem_data *data = file->f_mapping->private_data;
> > > > +	struct file *memfd = data->memfd;
> > > > +	struct page *page;
> > > > +	int ret;
> > > > +
> > > > +	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> > > 
> > > This will result in KVM allocating pages that userspace hasn't necessary
> > > fallocate()'d. In the case of SNP we need to get the PFN so we can clean
> > > up the RMP entries when restrictedmem invalidations are issued for a GFN
> > > range.
> > 
> > Yes fallocate() is unnecessary unless someone wants to reserve some
> > space (e.g. for determination or performance purpose), this matches its
> > semantics perfectly at:
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.man7.org%2Flinux%2Fman-pages%2Fman2%2Ffallocate.2.html&amp;data=05%7C01%7CMichael.Roth%40amd.com%7C0c26815eb6af4f1a243508dad23cf713%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053456609134623%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=acBSquFG%2FHtpbcZfHDZrP2O63bu06rI0pjiPJFSJSj8%3D&amp;reserved=0
> > 
> > > 
> > > If the guest supports lazy-acceptance however, these pages may not have
> > > been faulted in yet, and if the VMM defers actually fallocate()'ing space
> > > until the guest actually tries to issue a shared->private for that GFN
> > > (to support lazy-pinning), then there may never be a need to allocate
> > > pages for these backends.
> > > 
> > > However, the restrictedmem invalidations are for GFN ranges so there's
> > > no way to know inadvance whether it's been allocated yet or not. The
> > > xarray is one option but currently it defaults to 'private' so that
> > > doesn't help us here. It might if we introduced a 'uninitialized' state
> > > or something along that line instead of just the binary
> > > 'shared'/'private' though...
> > 
> > How about if we change the default to 'shared' as we discussed at
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2FY35gI0L8GMt9%2BOkK%40google.com%2F&amp;data=05%7C01%7CMichael.Roth%40amd.com%7C0c26815eb6af4f1a243508dad23cf713%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053456609134623%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=Q1vZWQiZ7mx12Qn5aKl4s8Ea9hNbwCJBb%2BjiA1du3Os%3D&amp;reserved=0?
> 
> Need to look at this a bit more, but I think that could work as well.
> 
> > > 
> > > But for now we added a restrictedmem_get_page_noalloc() that uses
> > > SGP_NONE instead of SGP_WRITE to avoid accidentally allocating a bunch
> > > of memory as part of guest shutdown, and a
> > > kvm_restrictedmem_get_pfn_noalloc() variant to go along with that. But
> > > maybe a boolean param is better? Or maybe SGP_NOALLOC is the better
> > > default, and we just propagate an error to userspace if they didn't
> > > fallocate() in advance?
> > 
> > This (making fallocate() a hard requirement) not only complicates the
> > userspace but also forces the lazy-faulting going through a long path of
> > exiting to userspace. Unless we don't have other options I would not go
> > this way.
> 
> Unless I'm missing something, it's already the case that userspace is
> responsible for handling all the shared->private transitions in response
> to KVM_EXIT_MEMORY_FAULT or (in our case) KVM_EXIT_VMGEXIT. So it only
> places the additional requirements on the VMM that if they *don't*
> preallocate, then they'll need to issue the fallocate() prior to issuing
> the KVM_MEM_ENCRYPT_REG_REGION ioctl in response to these events.
> 
> QEMU for example already has a separate 'prealloc' option for cases
> where they want to prefault all the guest memory, so it makes sense to
> continue making that an optional thing with regard to UPM.

Although I guess what you're suggesting doesn't stop userspace from
deciding whether they want to prefault or not. I know the Google folks
had some concerns over unexpected allocations causing 2x memory usage
though so giving userspace full control of what is/isn't allocated in
the restrictedmem backend seems to make it easier to guard against this,
but I think checking the xarray and defaulting to 'shared' would work
for us if that's the direction we end up going.

-Mike

> 
> -Mike
> 
> > 
> > Chao
> > > 
> > > -Mike
> > > 
> > > > +	if (ret)
> > > > +		return ret;
> > > > +
> > > > +	*pagep = page;
> > > > +	if (order)
> > > > +		*order = thp_order(compound_head(page));
> > > > +
> > > > +	SetPageUptodate(page);
> > > > +	unlock_page(page);
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(restrictedmem_get_page);
> > > > -- 
> > > > 2.25.1
> > > >
  
Chao Peng Nov. 30, 2022, 9:39 a.m. UTC | #29
On Tue, Nov 29, 2022 at 01:18:15PM -0600, Michael Roth wrote:
> On Tue, Nov 29, 2022 at 01:06:58PM -0600, Michael Roth wrote:
> > On Tue, Nov 29, 2022 at 10:06:15PM +0800, Chao Peng wrote:
> > > On Mon, Nov 28, 2022 at 06:37:25PM -0600, Michael Roth wrote:
> > > > On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > > ...
> > > > > +static long restrictedmem_fallocate(struct file *file, int mode,
> > > > > +				    loff_t offset, loff_t len)
> > > > > +{
> > > > > +	struct restrictedmem_data *data = file->f_mapping->private_data;
> > > > > +	struct file *memfd = data->memfd;
> > > > > +	int ret;
> > > > > +
> > > > > +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > > +		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > > > +			return -EINVAL;
> > > > > +	}
> > > > > +
> > > > > +	restrictedmem_notifier_invalidate(data, offset, offset + len, true);
> > > > 
> > > > The KVM restrictedmem ops seem to expect pgoff_t, but here we pass
> > > > loff_t. For SNP we've made this strange as part of the following patch
> > > > and it seems to produce the expected behavior:
> > > 
> > > That's correct. Thanks.
> > > 
> > > > 
> > > >   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdroth%2Flinux%2Fcommit%2Fd669c7d3003ff7a7a47e73e8c3b4eeadbd2c4eb6&amp;data=05%7C01%7CMichael.Roth%40amd.com%7C0c26815eb6af4f1a243508dad23cf713%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053456609134623%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=kAL42bmyBB0alVwh%2FN%2BT3D%2BiVTdxxMsJ7V4TNuCTjM4%3D&amp;reserved=0
> > > > 
> > > > > +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > > > +	restrictedmem_notifier_invalidate(data, offset, offset + len, false);
> > > > > +	return ret;
> > > > > +}
> > > > > +
> > > > 
> > > > <snip>
> > > > 
> > > > > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > > > > +			   struct page **pagep, int *order)
> > > > > +{
> > > > > +	struct restrictedmem_data *data = file->f_mapping->private_data;
> > > > > +	struct file *memfd = data->memfd;
> > > > > +	struct page *page;
> > > > > +	int ret;
> > > > > +
> > > > > +	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> > > > 
> > > > This will result in KVM allocating pages that userspace hasn't necessary
> > > > fallocate()'d. In the case of SNP we need to get the PFN so we can clean
> > > > up the RMP entries when restrictedmem invalidations are issued for a GFN
> > > > range.
> > > 
> > > Yes fallocate() is unnecessary unless someone wants to reserve some
> > > space (e.g. for determination or performance purpose), this matches its
> > > semantics perfectly at:
> > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.man7.org%2Flinux%2Fman-pages%2Fman2%2Ffallocate.2.html&amp;data=05%7C01%7CMichael.Roth%40amd.com%7C0c26815eb6af4f1a243508dad23cf713%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053456609134623%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=acBSquFG%2FHtpbcZfHDZrP2O63bu06rI0pjiPJFSJSj8%3D&amp;reserved=0
> > > 
> > > > 
> > > > If the guest supports lazy-acceptance however, these pages may not have
> > > > been faulted in yet, and if the VMM defers actually fallocate()'ing space
> > > > until the guest actually tries to issue a shared->private for that GFN
> > > > (to support lazy-pinning), then there may never be a need to allocate
> > > > pages for these backends.
> > > > 
> > > > However, the restrictedmem invalidations are for GFN ranges so there's
> > > > no way to know inadvance whether it's been allocated yet or not. The
> > > > xarray is one option but currently it defaults to 'private' so that
> > > > doesn't help us here. It might if we introduced a 'uninitialized' state
> > > > or something along that line instead of just the binary
> > > > 'shared'/'private' though...
> > > 
> > > How about if we change the default to 'shared' as we discussed at
> > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2FY35gI0L8GMt9%2BOkK%40google.com%2F&amp;data=05%7C01%7CMichael.Roth%40amd.com%7C0c26815eb6af4f1a243508dad23cf713%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053456609134623%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=Q1vZWQiZ7mx12Qn5aKl4s8Ea9hNbwCJBb%2BjiA1du3Os%3D&amp;reserved=0?
> > 
> > Need to look at this a bit more, but I think that could work as well.
> > 
> > > > 
> > > > But for now we added a restrictedmem_get_page_noalloc() that uses
> > > > SGP_NONE instead of SGP_WRITE to avoid accidentally allocating a bunch
> > > > of memory as part of guest shutdown, and a
> > > > kvm_restrictedmem_get_pfn_noalloc() variant to go along with that. But
> > > > maybe a boolean param is better? Or maybe SGP_NOALLOC is the better
> > > > default, and we just propagate an error to userspace if they didn't
> > > > fallocate() in advance?
> > > 
> > > This (making fallocate() a hard requirement) not only complicates the
> > > userspace but also forces the lazy-faulting going through a long path of
> > > exiting to userspace. Unless we don't have other options I would not go
> > > this way.
> > 
> > Unless I'm missing something, it's already the case that userspace is
> > responsible for handling all the shared->private transitions in response
> > to KVM_EXIT_MEMORY_FAULT or (in our case) KVM_EXIT_VMGEXIT. So it only
> > places the additional requirements on the VMM that if they *don't*
> > preallocate, then they'll need to issue the fallocate() prior to issuing
> > the KVM_MEM_ENCRYPT_REG_REGION ioctl in response to these events.

Preallocating and memory conversion between shared<->private are two
different things. No double fallocate() and conversion can be called
together in response to KVM_EXIT_MEMORY_FAULT, but they don't have to be
paired. And the fallocate() does not have to operate on the same memory
range as memory conversion does.

> > 
> > QEMU for example already has a separate 'prealloc' option for cases
> > where they want to prefault all the guest memory, so it makes sense to
> > continue making that an optional thing with regard to UPM.

Making 'prealloc' work for UPM in QEMU does sound reasonable. Anyway,
it's just an option so not change the assumption here.

> 
> Although I guess what you're suggesting doesn't stop userspace from
> deciding whether they want to prefault or not. I know the Google folks
> had some concerns over unexpected allocations causing 2x memory usage
> though so giving userspace full control of what is/isn't allocated in
> the restrictedmem backend seems to make it easier to guard against this,
> but I think checking the xarray and defaulting to 'shared' would work
> for us if that's the direction we end up going.

Yeah, that looks very likely the direction satisfying all people here.

Chao
> 
> -Mike
> 
> > 
> > -Mike
> > 
> > > 
> > > Chao
> > > > 
> > > > -Mike
> > > > 
> > > > > +	if (ret)
> > > > > +		return ret;
> > > > > +
> > > > > +	*pagep = page;
> > > > > +	if (order)
> > > > > +		*order = thp_order(compound_head(page));
> > > > > +
> > > > > +	SetPageUptodate(page);
> > > > > +	unlock_page(page);
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(restrictedmem_get_page);
> > > > > -- 
> > > > > 2.25.1
> > > > >
  
Michael Roth Nov. 30, 2022, 2:31 p.m. UTC | #30
On Wed, Nov 30, 2022 at 05:39:31PM +0800, Chao Peng wrote:
> On Tue, Nov 29, 2022 at 01:18:15PM -0600, Michael Roth wrote:
> > On Tue, Nov 29, 2022 at 01:06:58PM -0600, Michael Roth wrote:
> > > On Tue, Nov 29, 2022 at 10:06:15PM +0800, Chao Peng wrote:
> > > > On Mon, Nov 28, 2022 at 06:37:25PM -0600, Michael Roth wrote:
> > > > > On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > > > ...
> > > > > > +static long restrictedmem_fallocate(struct file *file, int mode,
> > > > > > +				    loff_t offset, loff_t len)
> > > > > > +{
> > > > > > +	struct restrictedmem_data *data = file->f_mapping->private_data;
> > > > > > +	struct file *memfd = data->memfd;
> > > > > > +	int ret;
> > > > > > +
> > > > > > +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > > > +		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > > > > +			return -EINVAL;
> > > > > > +	}
> > > > > > +
> > > > > > +	restrictedmem_notifier_invalidate(data, offset, offset + len, true);
> > > > > 
> > > > > The KVM restrictedmem ops seem to expect pgoff_t, but here we pass
> > > > > loff_t. For SNP we've made this strange as part of the following patch
> > > > > and it seems to produce the expected behavior:
> > > > 
> > > > That's correct. Thanks.
> > > > 
> > > > > 
> > > > >   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdroth%2Flinux%2Fcommit%2Fd669c7d3003ff7a7a47e73e8c3b4eeadbd2c4eb6&amp;data=05%7C01%7Cmichael.roth%40amd.com%7Cf3ad9d505bec4006028308dad2b76bc5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053982483658905%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=ipHjTVNhiRmaa%2BKTJiodbxHS7TOaYbBhAPD0VZ%2FFU2k%3D&amp;reserved=0
> > > > > 
> > > > > > +	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > > > > +	restrictedmem_notifier_invalidate(data, offset, offset + len, false);
> > > > > > +	return ret;
> > > > > > +}
> > > > > > +
> > > > > 
> > > > > <snip>
> > > > > 
> > > > > > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > > > > > +			   struct page **pagep, int *order)
> > > > > > +{
> > > > > > +	struct restrictedmem_data *data = file->f_mapping->private_data;
> > > > > > +	struct file *memfd = data->memfd;
> > > > > > +	struct page *page;
> > > > > > +	int ret;
> > > > > > +
> > > > > > +	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> > > > > 
> > > > > This will result in KVM allocating pages that userspace hasn't necessary
> > > > > fallocate()'d. In the case of SNP we need to get the PFN so we can clean
> > > > > up the RMP entries when restrictedmem invalidations are issued for a GFN
> > > > > range.
> > > > 
> > > > Yes fallocate() is unnecessary unless someone wants to reserve some
> > > > space (e.g. for determination or performance purpose), this matches its
> > > > semantics perfectly at:
> > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.man7.org%2Flinux%2Fman-pages%2Fman2%2Ffallocate.2.html&amp;data=05%7C01%7Cmichael.roth%40amd.com%7Cf3ad9d505bec4006028308dad2b76bc5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053982483658905%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=NJXs0bvvqb3oU%2FGhcvgHSvh8r1DouskOY5CreP1Q5OU%3D&amp;reserved=0
> > > > 
> > > > > 
> > > > > If the guest supports lazy-acceptance however, these pages may not have
> > > > > been faulted in yet, and if the VMM defers actually fallocate()'ing space
> > > > > until the guest actually tries to issue a shared->private for that GFN
> > > > > (to support lazy-pinning), then there may never be a need to allocate
> > > > > pages for these backends.
> > > > > 
> > > > > However, the restrictedmem invalidations are for GFN ranges so there's
> > > > > no way to know inadvance whether it's been allocated yet or not. The
> > > > > xarray is one option but currently it defaults to 'private' so that
> > > > > doesn't help us here. It might if we introduced a 'uninitialized' state
> > > > > or something along that line instead of just the binary
> > > > > 'shared'/'private' though...
> > > > 
> > > > How about if we change the default to 'shared' as we discussed at
> > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2FY35gI0L8GMt9%2BOkK%40google.com%2F&amp;data=05%7C01%7Cmichael.roth%40amd.com%7Cf3ad9d505bec4006028308dad2b76bc5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053982483658905%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=%2F1g3NdU0iLO6rWVgSm42UYlfHGG2EJ1Wp0r%2FGEznUoo%3D&amp;reserved=0?
> > > 
> > > Need to look at this a bit more, but I think that could work as well.
> > > 
> > > > > 
> > > > > But for now we added a restrictedmem_get_page_noalloc() that uses
> > > > > SGP_NONE instead of SGP_WRITE to avoid accidentally allocating a bunch
> > > > > of memory as part of guest shutdown, and a
> > > > > kvm_restrictedmem_get_pfn_noalloc() variant to go along with that. But
> > > > > maybe a boolean param is better? Or maybe SGP_NOALLOC is the better
> > > > > default, and we just propagate an error to userspace if they didn't
> > > > > fallocate() in advance?
> > > > 
> > > > This (making fallocate() a hard requirement) not only complicates the
> > > > userspace but also forces the lazy-faulting going through a long path of
> > > > exiting to userspace. Unless we don't have other options I would not go
> > > > this way.
> > > 
> > > Unless I'm missing something, it's already the case that userspace is
> > > responsible for handling all the shared->private transitions in response
> > > to KVM_EXIT_MEMORY_FAULT or (in our case) KVM_EXIT_VMGEXIT. So it only
> > > places the additional requirements on the VMM that if they *don't*
> > > preallocate, then they'll need to issue the fallocate() prior to issuing
> > > the KVM_MEM_ENCRYPT_REG_REGION ioctl in response to these events.
> 
> Preallocating and memory conversion between shared<->private are two
> different things. No double fallocate() and conversion can be called

I just mean that we don't actually have additional userspace exits for
doing lazy-faulting in this manner, because prior to mapping restricted
page into the TDP, we will have gotten a KVM_EXIT_MEMORY_FAULT anyway so
that userspace can handle the conversion, so if you do the fallocate()
prior to KVM_MEM_ENCRYPT_REG_REGION, there's no additional KVM exits
(unless you count the fallocate() syscall itself but that seems
negligable compared to memory allocation).

For instance on QEMU side we do the fallocate() as part of
kvm_convert_memory() helper.

But thinking about it more, the main upside to this approach (giving VMM
control/accounting over restrictedmem allocations), doesn't actually
work out. For instance if VMM fallocate()'s memory for a single 4K page
prior to shared->private conversion, shmem might still allocate a THP for
that whole 2M range, and userspace doesn't have a good way to account
for this. So what I'm proposing probably isn't feasible anyway.

> different things. No double fallocate() and conversion can be called
> together in response to KVM_EXIT_MEMORY_FAULT, but they don't have to be
> paired. And the fallocate() does not have to operate on the same memory
> range as memory conversion does.
> 
> > > 
> > > QEMU for example already has a separate 'prealloc' option for cases
> > > where they want to prefault all the guest memory, so it makes sense to
> > > continue making that an optional thing with regard to UPM.
> 
> Making 'prealloc' work for UPM in QEMU does sound reasonable. Anyway,
> it's just an option so not change the assumption here.
> 
> > 
> > Although I guess what you're suggesting doesn't stop userspace from
> > deciding whether they want to prefault or not. I know the Google folks
> > had some concerns over unexpected allocations causing 2x memory usage
> > though so giving userspace full control of what is/isn't allocated in
> > the restrictedmem backend seems to make it easier to guard against this,
> > but I think checking the xarray and defaulting to 'shared' would work
> > for us if that's the direction we end up going.
> 
> Yeah, that looks very likely the direction satisfying all people here.

Ok, yah after some more thought this probably is the more feasible
approach. Thanks for your input on this.

-Mike

> 
> Chao
> > 
> > -Mike
> > 
> > > 
> > > -Mike
> > > 
> > > > 
> > > > Chao
> > > > > 
> > > > > -Mike
> > > > > 
> > > > > > +	if (ret)
> > > > > > +		return ret;
> > > > > > +
> > > > > > +	*pagep = page;
> > > > > > +	if (order)
> > > > > > +		*order = thp_order(compound_head(page));
> > > > > > +
> > > > > > +	SetPageUptodate(page);
> > > > > > +	unlock_page(page);
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +EXPORT_SYMBOL_GPL(restrictedmem_get_page);
> > > > > > -- 
> > > > > > 2.25.1
> > > > > >
  
Vishal Annapurve Dec. 2, 2022, 2:16 a.m. UTC | #31
On Tue, Oct 25, 2022 at 8:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Introduce 'memfd_restricted' system call with the ability to create
> memory areas that are restricted from userspace access through ordinary
> MMU operations (e.g. read/write/mmap). The memory content is expected to
> be used through a new in-kernel interface by a third kernel module.
>
> memfd_restricted() is useful for scenarios where a file descriptor(fd)
> can be used as an interface into mm but want to restrict userspace's
> ability on the fd. Initially it is designed to provide protections for
> KVM encrypted guest memory.
>
> Normally KVM uses memfd memory via mmapping the memfd into KVM userspace
> (e.g. QEMU) and then using the mmaped virtual address to setup the
> mapping in the KVM secondary page table (e.g. EPT). With confidential
> computing technologies like Intel TDX, the memfd memory may be encrypted
> with special key for special software domain (e.g. KVM guest) and is not
> expected to be directly accessed by userspace. Precisely, userspace
> access to such encrypted memory may lead to host crash so should be
> prevented.
>
> memfd_restricted() provides semantics required for KVM guest encrypted
> memory support that a fd created with memfd_restricted() is going to be
> used as the source of guest memory in confidential computing environment
> and KVM can directly interact with core-mm without the need to expose
> the memoy content into KVM userspace.
>
> KVM userspace is still in charge of the lifecycle of the fd. It should
> pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to
> obtain the physical memory page and then uses it to populate the KVM
> secondary page table entries.
>
> The userspace restricted memfd can be fallocate-ed or hole-punched
> from userspace. When these operations happen, KVM can get notified
> through restrictedmem_notifier, it then gets chance to remove any
> mapped entries of the range in the secondary page tables.
>
> memfd_restricted() itself is implemented as a shim layer on top of real
> memory file systems (currently tmpfs). Pages in restrictedmem are marked
> as unmovable and unevictable, this is required for current confidential
> usage. But in future this might be changed.
>
> By default memfd_restricted() prevents userspace read, write and mmap.
> By defining new bit in the 'flags', it can be extended to support other
> restricted semantics in the future.
>
> The system call is currently wired up for x86 arch.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  include/linux/restrictedmem.h          |  62 ++++++
>  include/linux/syscalls.h               |   1 +
>  include/uapi/asm-generic/unistd.h      |   5 +-
>  include/uapi/linux/magic.h             |   1 +
>  kernel/sys_ni.c                        |   3 +
>  mm/Kconfig                             |   4 +
>  mm/Makefile                            |   1 +
>  mm/restrictedmem.c                     | 250 +++++++++++++++++++++++++
>  10 files changed, 328 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/restrictedmem.h
>  create mode 100644 mm/restrictedmem.c
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 320480a8db4f..dc70ba90247e 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -455,3 +455,4 @@
>  448    i386    process_mrelease        sys_process_mrelease
>  449    i386    futex_waitv             sys_futex_waitv
>  450    i386    set_mempolicy_home_node         sys_set_mempolicy_home_node
> +451    i386    memfd_restricted        sys_memfd_restricted
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index c84d12608cd2..06516abc8318 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -372,6 +372,7 @@
>  448    common  process_mrelease        sys_process_mrelease
>  449    common  futex_waitv             sys_futex_waitv
>  450    common  set_mempolicy_home_node sys_set_mempolicy_home_node
> +451    common  memfd_restricted        sys_memfd_restricted
>
>  #
>  # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
> new file mode 100644
> index 000000000000..9c37c3ea3180
> --- /dev/null
> +++ b/include/linux/restrictedmem.h
> @@ -0,0 +1,62 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _LINUX_RESTRICTEDMEM_H
> +
> +#include <linux/file.h>
> +#include <linux/magic.h>
> +#include <linux/pfn_t.h>
> +
> +struct restrictedmem_notifier;
> +
> +struct restrictedmem_notifier_ops {
> +       void (*invalidate_start)(struct restrictedmem_notifier *notifier,
> +                                pgoff_t start, pgoff_t end);
> +       void (*invalidate_end)(struct restrictedmem_notifier *notifier,
> +                              pgoff_t start, pgoff_t end);
> +};
> +
> +struct restrictedmem_notifier {
> +       struct list_head list;
> +       const struct restrictedmem_notifier_ops *ops;
> +};
> +
> +#ifdef CONFIG_RESTRICTEDMEM
> +
> +void restrictedmem_register_notifier(struct file *file,
> +                                    struct restrictedmem_notifier *notifier);
> +void restrictedmem_unregister_notifier(struct file *file,
> +                                      struct restrictedmem_notifier *notifier);
> +
> +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +                          struct page **pagep, int *order);
> +
> +static inline bool file_is_restrictedmem(struct file *file)
> +{
> +       return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC;
> +}
> +
> +#else
> +
> +static inline void restrictedmem_register_notifier(struct file *file,
> +                                    struct restrictedmem_notifier *notifier)
> +{
> +}
> +
> +static inline void restrictedmem_unregister_notifier(struct file *file,
> +                                      struct restrictedmem_notifier *notifier)
> +{
> +}
> +
> +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +                                        struct page **pagep, int *order)
> +{
> +       return -1;
> +}
> +
> +static inline bool file_is_restrictedmem(struct file *file)
> +{
> +       return false;
> +}
> +
> +#endif /* CONFIG_RESTRICTEDMEM */
> +
> +#endif /* _LINUX_RESTRICTEDMEM_H */
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index a34b0f9a9972..f9e9e0c820c5 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -1056,6 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
>  asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
>                                             unsigned long home_node,
>                                             unsigned long flags);
> +asmlinkage long sys_memfd_restricted(unsigned int flags);
>
>  /*
>   * Architecture-specific system calls
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 45fa180cc56a..e93cd35e46d0 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
>  #define __NR_set_mempolicy_home_node 450
>  __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
>
> +#define __NR_memfd_restricted 451
> +__SYSCALL(__NR_memfd_restricted, sys_memfd_restricted)
> +
>  #undef __NR_syscalls
> -#define __NR_syscalls 451
> +#define __NR_syscalls 452
>
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index 6325d1d0e90f..8aa38324b90a 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -101,5 +101,6 @@
>  #define DMA_BUF_MAGIC          0x444d4142      /* "DMAB" */
>  #define DEVMEM_MAGIC           0x454d444d      /* "DMEM" */
>  #define SECRETMEM_MAGIC                0x5345434d      /* "SECM" */
> +#define RESTRICTEDMEM_MAGIC    0x5245534d      /* "RESM" */
>
>  #endif /* __LINUX_MAGIC_H__ */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 860b2dcf3ac4..7c4a32cbd2e7 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -360,6 +360,9 @@ COND_SYSCALL(pkey_free);
>  /* memfd_secret */
>  COND_SYSCALL(memfd_secret);
>
> +/* memfd_restricted */
> +COND_SYSCALL(memfd_restricted);
> +
>  /*
>   * Architecture specific weak syscall entries.
>   */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 0331f1461f81..0177d53676c7 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1076,6 +1076,10 @@ config IO_MAPPING
>  config SECRETMEM
>         def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
>
> +config RESTRICTEDMEM
> +       bool
> +       depends on TMPFS
> +
>  config ANON_VMA_NAME
>         bool "Anonymous VMA name support"
>         depends on PROC_FS && ADVISE_SYSCALLS && MMU
> diff --git a/mm/Makefile b/mm/Makefile
> index 9a564f836403..6cb6403ffd40 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -117,6 +117,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
>  obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o
>  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
>  obj-$(CONFIG_SECRETMEM) += secretmem.o
> +obj-$(CONFIG_RESTRICTEDMEM) += restrictedmem.o
>  obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o
>  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
>  obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> new file mode 100644
> index 000000000000..e5bf8907e0f8
> --- /dev/null
> +++ b/mm/restrictedmem.c
> @@ -0,0 +1,250 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "linux/sbitmap.h"
> +#include <linux/pagemap.h>
> +#include <linux/pseudo_fs.h>
> +#include <linux/shmem_fs.h>
> +#include <linux/syscalls.h>
> +#include <uapi/linux/falloc.h>
> +#include <uapi/linux/magic.h>
> +#include <linux/restrictedmem.h>
> +
> +struct restrictedmem_data {
> +       struct mutex lock;
> +       struct file *memfd;
> +       struct list_head notifiers;
> +};
> +
> +static void restrictedmem_notifier_invalidate(struct restrictedmem_data *data,
> +                                pgoff_t start, pgoff_t end, bool notify_start)
> +{
> +       struct restrictedmem_notifier *notifier;
> +
> +       mutex_lock(&data->lock);
> +       list_for_each_entry(notifier, &data->notifiers, list) {
> +               if (notify_start)
> +                       notifier->ops->invalidate_start(notifier, start, end);
> +               else
> +                       notifier->ops->invalidate_end(notifier, start, end);
> +       }
> +       mutex_unlock(&data->lock);
> +}
> +
> +static int restrictedmem_release(struct inode *inode, struct file *file)
> +{
> +       struct restrictedmem_data *data = inode->i_mapping->private_data;
> +
> +       fput(data->memfd);
> +       kfree(data);
> +       return 0;
> +}
> +
> +static long restrictedmem_fallocate(struct file *file, int mode,
> +                                   loff_t offset, loff_t len)
> +{
> +       struct restrictedmem_data *data = file->f_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +       int ret;
> +
> +       if (mode & FALLOC_FL_PUNCH_HOLE) {
> +               if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> +                       return -EINVAL;
> +       }
> +
> +       restrictedmem_notifier_invalidate(data, offset, offset + len, true);
> +       ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> +       restrictedmem_notifier_invalidate(data, offset, offset + len, false);
> +       return ret;
> +}
> +
> +static const struct file_operations restrictedmem_fops = {
> +       .release = restrictedmem_release,
> +       .fallocate = restrictedmem_fallocate,
> +};
> +
> +static int restrictedmem_getattr(struct user_namespace *mnt_userns,
> +                                const struct path *path, struct kstat *stat,
> +                                u32 request_mask, unsigned int query_flags)
> +{
> +       struct inode *inode = d_inode(path->dentry);
> +       struct restrictedmem_data *data = inode->i_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +
> +       return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
> +                                            request_mask, query_flags);
> +}
> +
> +static int restrictedmem_setattr(struct user_namespace *mnt_userns,
> +                                struct dentry *dentry, struct iattr *attr)
> +{
> +       struct inode *inode = d_inode(dentry);
> +       struct restrictedmem_data *data = inode->i_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +       int ret;
> +
> +       if (attr->ia_valid & ATTR_SIZE) {
> +               if (memfd->f_inode->i_size)
> +                       return -EPERM;
> +
> +               if (!PAGE_ALIGNED(attr->ia_size))
> +                       return -EINVAL;
> +       }
> +
> +       ret = memfd->f_inode->i_op->setattr(mnt_userns,
> +                                           file_dentry(memfd), attr);
> +       return ret;
> +}
> +
> +static const struct inode_operations restrictedmem_iops = {
> +       .getattr = restrictedmem_getattr,
> +       .setattr = restrictedmem_setattr,
> +};
> +
> +static int restrictedmem_init_fs_context(struct fs_context *fc)
> +{
> +       if (!init_pseudo(fc, RESTRICTEDMEM_MAGIC))
> +               return -ENOMEM;
> +
> +       fc->s_iflags |= SB_I_NOEXEC;
> +       return 0;
> +}
> +
> +static struct file_system_type restrictedmem_fs = {
> +       .owner          = THIS_MODULE,
> +       .name           = "memfd:restrictedmem",
> +       .init_fs_context = restrictedmem_init_fs_context,
> +       .kill_sb        = kill_anon_super,
> +};
> +
> +static struct vfsmount *restrictedmem_mnt;
> +
> +static __init int restrictedmem_init(void)
> +{
> +       restrictedmem_mnt = kern_mount(&restrictedmem_fs);
> +       if (IS_ERR(restrictedmem_mnt))
> +               return PTR_ERR(restrictedmem_mnt);
> +       return 0;
> +}
> +fs_initcall(restrictedmem_init);
> +
> +static struct file *restrictedmem_file_create(struct file *memfd)
> +{
> +       struct restrictedmem_data *data;
> +       struct address_space *mapping;
> +       struct inode *inode;
> +       struct file *file;
> +
> +       data = kzalloc(sizeof(*data), GFP_KERNEL);
> +       if (!data)
> +               return ERR_PTR(-ENOMEM);
> +
> +       data->memfd = memfd;
> +       mutex_init(&data->lock);
> +       INIT_LIST_HEAD(&data->notifiers);
> +
> +       inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
> +       if (IS_ERR(inode)) {
> +               kfree(data);
> +               return ERR_CAST(inode);
> +       }
> +
> +       inode->i_mode |= S_IFREG;
> +       inode->i_op = &restrictedmem_iops;
> +       inode->i_mapping->private_data = data;
> +
> +       file = alloc_file_pseudo(inode, restrictedmem_mnt,
> +                                "restrictedmem", O_RDWR,
> +                                &restrictedmem_fops);
> +       if (IS_ERR(file)) {
> +               iput(inode);
> +               kfree(data);
> +               return ERR_CAST(file);
> +       }
> +
> +       file->f_flags |= O_LARGEFILE;
> +
> +       mapping = memfd->f_mapping;
> +       mapping_set_unevictable(mapping);
> +       mapping_set_gfp_mask(mapping,
> +                            mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> +
> +       return file;
> +}
> +
> +SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
> +{

Looking at the underlying shmem implementation, there seems to be no
way to enable transparent huge pages specifically for restricted memfd
files.

Michael discussed earlier about tweaking
/sys/kernel/mm/transparent_hugepage/shmem_enabled setting to allow
hugepages to be used while backing restricted memfd. Such a change
will affect the rest of the shmem usecases as well. Even setting the
shmem_enabled policy to "advise" wouldn't help unless file based
advise for hugepage allocation is implemented.

Does it make sense to provide a flag here to allow creating restricted
memfds backed possibly by huge pages to give a more granular control?

> +       struct file *file, *restricted_file;
> +       int fd, err;
> +
> +       if (flags)
> +               return -EINVAL;
> +
> +       fd = get_unused_fd_flags(0);
> +       if (fd < 0)
> +               return fd;
> +
> +       file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
> +       if (IS_ERR(file)) {
> +               err = PTR_ERR(file);
> +               goto err_fd;
> +       }
> +       file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
> +       file->f_flags |= O_LARGEFILE;
> +
> +       restricted_file = restrictedmem_file_create(file);
> +       if (IS_ERR(restricted_file)) {
> +               err = PTR_ERR(restricted_file);
> +               fput(file);
> +               goto err_fd;
> +       }
> +
> +       fd_install(fd, restricted_file);
> +       return fd;
> +err_fd:
> +       put_unused_fd(fd);
> +       return err;
> +}
> +
> +void restrictedmem_register_notifier(struct file *file,
> +                                    struct restrictedmem_notifier *notifier)
> +{
> +       struct restrictedmem_data *data = file->f_mapping->private_data;
> +
> +       mutex_lock(&data->lock);
> +       list_add(&notifier->list, &data->notifiers);
> +       mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(restrictedmem_register_notifier);
> +
> +void restrictedmem_unregister_notifier(struct file *file,
> +                                      struct restrictedmem_notifier *notifier)
> +{
> +       struct restrictedmem_data *data = file->f_mapping->private_data;
> +
> +       mutex_lock(&data->lock);
> +       list_del(&notifier->list);
> +       mutex_unlock(&data->lock);
> +}
> +EXPORT_SYMBOL_GPL(restrictedmem_unregister_notifier);
> +
> +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +                          struct page **pagep, int *order)
> +{
> +       struct restrictedmem_data *data = file->f_mapping->private_data;
> +       struct file *memfd = data->memfd;
> +       struct page *page;
> +       int ret;
> +
> +       ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> +       if (ret)
> +               return ret;
> +
> +       *pagep = page;
> +       if (order)
> +               *order = thp_order(compound_head(page));
> +
> +       SetPageUptodate(page);
> +       unlock_page(page);
> +
> +       return 0;
> +}
> +EXPORT_SYMBOL_GPL(restrictedmem_get_page);
> --
> 2.25.1
>
  
Chao Peng Dec. 2, 2022, 6:49 a.m. UTC | #32
On Thu, Dec 01, 2022 at 06:16:46PM -0800, Vishal Annapurve wrote:
> On Tue, Oct 25, 2022 at 8:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
...
> > +}
> > +
> > +SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
> > +{
> 
> Looking at the underlying shmem implementation, there seems to be no
> way to enable transparent huge pages specifically for restricted memfd
> files.
> 
> Michael discussed earlier about tweaking
> /sys/kernel/mm/transparent_hugepage/shmem_enabled setting to allow
> hugepages to be used while backing restricted memfd. Such a change
> will affect the rest of the shmem usecases as well. Even setting the
> shmem_enabled policy to "advise" wouldn't help unless file based
> advise for hugepage allocation is implemented.

Had a look at fadvise() and looks it does not support HUGEPAGE for any
filesystem yet.

> 
> Does it make sense to provide a flag here to allow creating restricted
> memfds backed possibly by huge pages to give a more granular control?

We do have a unused 'flags' can be extended for such usage, but I would
let Kirill have further look, perhaps need more discussions.

Chao
> 
> > +       struct file *file, *restricted_file;
> > +       int fd, err;
> > +
> > +       if (flags)
> > +               return -EINVAL;
> > +
> > +       fd = get_unused_fd_flags(0);
> > +       if (fd < 0)
> > +               return fd;
> > +
> > +       file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
> > +       if (IS_ERR(file)) {
> > +               err = PTR_ERR(file);
> > +               goto err_fd;
> > +       }
> > +       file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
> > +       file->f_flags |= O_LARGEFILE;
> > +
> > +       restricted_file = restrictedmem_file_create(file);
> > +       if (IS_ERR(restricted_file)) {
> > +               err = PTR_ERR(restricted_file);
> > +               fput(file);
> > +               goto err_fd;
> > +       }
> > +
> > +       fd_install(fd, restricted_file);
> > +       return fd;
> > +err_fd:
> > +       put_unused_fd(fd);
> > +       return err;
> > +}
> > +
> > +void restrictedmem_register_notifier(struct file *file,
> > +                                    struct restrictedmem_notifier *notifier)
> > +{
> > +       struct restrictedmem_data *data = file->f_mapping->private_data;
> > +
> > +       mutex_lock(&data->lock);
> > +       list_add(&notifier->list, &data->notifiers);
> > +       mutex_unlock(&data->lock);
> > +}
> > +EXPORT_SYMBOL_GPL(restrictedmem_register_notifier);
> > +
> > +void restrictedmem_unregister_notifier(struct file *file,
> > +                                      struct restrictedmem_notifier *notifier)
> > +{
> > +       struct restrictedmem_data *data = file->f_mapping->private_data;
> > +
> > +       mutex_lock(&data->lock);
> > +       list_del(&notifier->list);
> > +       mutex_unlock(&data->lock);
> > +}
> > +EXPORT_SYMBOL_GPL(restrictedmem_unregister_notifier);
> > +
> > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > +                          struct page **pagep, int *order)
> > +{
> > +       struct restrictedmem_data *data = file->f_mapping->private_data;
> > +       struct file *memfd = data->memfd;
> > +       struct page *page;
> > +       int ret;
> > +
> > +       ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
> > +       if (ret)
> > +               return ret;
> > +
> > +       *pagep = page;
> > +       if (order)
> > +               *order = thp_order(compound_head(page));
> > +
> > +       SetPageUptodate(page);
> > +       unlock_page(page);
> > +
> > +       return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(restrictedmem_get_page);
> > --
> > 2.25.1
> >
  
Kirill A. Shutemov Dec. 2, 2022, 1:44 p.m. UTC | #33
On Fri, Dec 02, 2022 at 02:49:09PM +0800, Chao Peng wrote:
> On Thu, Dec 01, 2022 at 06:16:46PM -0800, Vishal Annapurve wrote:
> > On Tue, Oct 25, 2022 at 8:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > >
> ...
> > > +}
> > > +
> > > +SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
> > > +{
> > 
> > Looking at the underlying shmem implementation, there seems to be no
> > way to enable transparent huge pages specifically for restricted memfd
> > files.
> > 
> > Michael discussed earlier about tweaking
> > /sys/kernel/mm/transparent_hugepage/shmem_enabled setting to allow
> > hugepages to be used while backing restricted memfd. Such a change
> > will affect the rest of the shmem usecases as well. Even setting the
> > shmem_enabled policy to "advise" wouldn't help unless file based
> > advise for hugepage allocation is implemented.
> 
> Had a look at fadvise() and looks it does not support HUGEPAGE for any
> filesystem yet.

Yes, I think fadvise() is the right direction here. The problem is similar
to NUMA policy where existing APIs are focused around virtual memory
addresses. We need to extend ABI to take fd+offset as input instead.
  

Patch

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 320480a8db4f..dc70ba90247e 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -455,3 +455,4 @@ 
 448	i386	process_mrelease	sys_process_mrelease
 449	i386	futex_waitv		sys_futex_waitv
 450	i386	set_mempolicy_home_node		sys_set_mempolicy_home_node
+451	i386	memfd_restricted	sys_memfd_restricted
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c84d12608cd2..06516abc8318 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -372,6 +372,7 @@ 
 448	common	process_mrelease	sys_process_mrelease
 449	common	futex_waitv		sys_futex_waitv
 450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
+451	common	memfd_restricted	sys_memfd_restricted
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
new file mode 100644
index 000000000000..9c37c3ea3180
--- /dev/null
+++ b/include/linux/restrictedmem.h
@@ -0,0 +1,62 @@ 
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _LINUX_RESTRICTEDMEM_H
+
+#include <linux/file.h>
+#include <linux/magic.h>
+#include <linux/pfn_t.h>
+
+struct restrictedmem_notifier;
+
+struct restrictedmem_notifier_ops {
+	void (*invalidate_start)(struct restrictedmem_notifier *notifier,
+				 pgoff_t start, pgoff_t end);
+	void (*invalidate_end)(struct restrictedmem_notifier *notifier,
+			       pgoff_t start, pgoff_t end);
+};
+
+struct restrictedmem_notifier {
+	struct list_head list;
+	const struct restrictedmem_notifier_ops *ops;
+};
+
+#ifdef CONFIG_RESTRICTEDMEM
+
+void restrictedmem_register_notifier(struct file *file,
+				     struct restrictedmem_notifier *notifier);
+void restrictedmem_unregister_notifier(struct file *file,
+				       struct restrictedmem_notifier *notifier);
+
+int restrictedmem_get_page(struct file *file, pgoff_t offset,
+			   struct page **pagep, int *order);
+
+static inline bool file_is_restrictedmem(struct file *file)
+{
+	return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC;
+}
+
+#else
+
+static inline void restrictedmem_register_notifier(struct file *file,
+				     struct restrictedmem_notifier *notifier)
+{
+}
+
+static inline void restrictedmem_unregister_notifier(struct file *file,
+				       struct restrictedmem_notifier *notifier)
+{
+}
+
+static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
+					 struct page **pagep, int *order)
+{
+	return -1;
+}
+
+static inline bool file_is_restrictedmem(struct file *file)
+{
+	return false;
+}
+
+#endif /* CONFIG_RESTRICTEDMEM */
+
+#endif /* _LINUX_RESTRICTEDMEM_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a34b0f9a9972..f9e9e0c820c5 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1056,6 +1056,7 @@  asmlinkage long sys_memfd_secret(unsigned int flags);
 asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
 					    unsigned long home_node,
 					    unsigned long flags);
+asmlinkage long sys_memfd_restricted(unsigned int flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 45fa180cc56a..e93cd35e46d0 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -886,8 +886,11 @@  __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
 #define __NR_set_mempolicy_home_node 450
 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
 
+#define __NR_memfd_restricted 451
+__SYSCALL(__NR_memfd_restricted, sys_memfd_restricted)
+
 #undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 452
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 6325d1d0e90f..8aa38324b90a 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -101,5 +101,6 @@ 
 #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
+#define RESTRICTEDMEM_MAGIC	0x5245534d	/* "RESM" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 860b2dcf3ac4..7c4a32cbd2e7 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -360,6 +360,9 @@  COND_SYSCALL(pkey_free);
 /* memfd_secret */
 COND_SYSCALL(memfd_secret);
 
+/* memfd_restricted */
+COND_SYSCALL(memfd_restricted);
+
 /*
  * Architecture specific weak syscall entries.
  */
diff --git a/mm/Kconfig b/mm/Kconfig
index 0331f1461f81..0177d53676c7 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1076,6 +1076,10 @@  config IO_MAPPING
 config SECRETMEM
 	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
 
+config RESTRICTEDMEM
+	bool
+	depends on TMPFS
+
 config ANON_VMA_NAME
 	bool "Anonymous VMA name support"
 	depends on PROC_FS && ADVISE_SYSCALLS && MMU
diff --git a/mm/Makefile b/mm/Makefile
index 9a564f836403..6cb6403ffd40 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -117,6 +117,7 @@  obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
 obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o
 obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
 obj-$(CONFIG_SECRETMEM) += secretmem.o
+obj-$(CONFIG_RESTRICTEDMEM) += restrictedmem.o
 obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o
 obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
 obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
new file mode 100644
index 000000000000..e5bf8907e0f8
--- /dev/null
+++ b/mm/restrictedmem.c
@@ -0,0 +1,250 @@ 
+// SPDX-License-Identifier: GPL-2.0
+#include "linux/sbitmap.h"
+#include <linux/pagemap.h>
+#include <linux/pseudo_fs.h>
+#include <linux/shmem_fs.h>
+#include <linux/syscalls.h>
+#include <uapi/linux/falloc.h>
+#include <uapi/linux/magic.h>
+#include <linux/restrictedmem.h>
+
+struct restrictedmem_data {
+	struct mutex lock;
+	struct file *memfd;
+	struct list_head notifiers;
+};
+
+static void restrictedmem_notifier_invalidate(struct restrictedmem_data *data,
+				 pgoff_t start, pgoff_t end, bool notify_start)
+{
+	struct restrictedmem_notifier *notifier;
+
+	mutex_lock(&data->lock);
+	list_for_each_entry(notifier, &data->notifiers, list) {
+		if (notify_start)
+			notifier->ops->invalidate_start(notifier, start, end);
+		else
+			notifier->ops->invalidate_end(notifier, start, end);
+	}
+	mutex_unlock(&data->lock);
+}
+
+static int restrictedmem_release(struct inode *inode, struct file *file)
+{
+	struct restrictedmem_data *data = inode->i_mapping->private_data;
+
+	fput(data->memfd);
+	kfree(data);
+	return 0;
+}
+
+static long restrictedmem_fallocate(struct file *file, int mode,
+				    loff_t offset, loff_t len)
+{
+	struct restrictedmem_data *data = file->f_mapping->private_data;
+	struct file *memfd = data->memfd;
+	int ret;
+
+	if (mode & FALLOC_FL_PUNCH_HOLE) {
+		if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
+			return -EINVAL;
+	}
+
+	restrictedmem_notifier_invalidate(data, offset, offset + len, true);
+	ret = memfd->f_op->fallocate(memfd, mode, offset, len);
+	restrictedmem_notifier_invalidate(data, offset, offset + len, false);
+	return ret;
+}
+
+static const struct file_operations restrictedmem_fops = {
+	.release = restrictedmem_release,
+	.fallocate = restrictedmem_fallocate,
+};
+
+static int restrictedmem_getattr(struct user_namespace *mnt_userns,
+				 const struct path *path, struct kstat *stat,
+				 u32 request_mask, unsigned int query_flags)
+{
+	struct inode *inode = d_inode(path->dentry);
+	struct restrictedmem_data *data = inode->i_mapping->private_data;
+	struct file *memfd = data->memfd;
+
+	return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
+					     request_mask, query_flags);
+}
+
+static int restrictedmem_setattr(struct user_namespace *mnt_userns,
+				 struct dentry *dentry, struct iattr *attr)
+{
+	struct inode *inode = d_inode(dentry);
+	struct restrictedmem_data *data = inode->i_mapping->private_data;
+	struct file *memfd = data->memfd;
+	int ret;
+
+	if (attr->ia_valid & ATTR_SIZE) {
+		if (memfd->f_inode->i_size)
+			return -EPERM;
+
+		if (!PAGE_ALIGNED(attr->ia_size))
+			return -EINVAL;
+	}
+
+	ret = memfd->f_inode->i_op->setattr(mnt_userns,
+					    file_dentry(memfd), attr);
+	return ret;
+}
+
+static const struct inode_operations restrictedmem_iops = {
+	.getattr = restrictedmem_getattr,
+	.setattr = restrictedmem_setattr,
+};
+
+static int restrictedmem_init_fs_context(struct fs_context *fc)
+{
+	if (!init_pseudo(fc, RESTRICTEDMEM_MAGIC))
+		return -ENOMEM;
+
+	fc->s_iflags |= SB_I_NOEXEC;
+	return 0;
+}
+
+static struct file_system_type restrictedmem_fs = {
+	.owner		= THIS_MODULE,
+	.name		= "memfd:restrictedmem",
+	.init_fs_context = restrictedmem_init_fs_context,
+	.kill_sb	= kill_anon_super,
+};
+
+static struct vfsmount *restrictedmem_mnt;
+
+static __init int restrictedmem_init(void)
+{
+	restrictedmem_mnt = kern_mount(&restrictedmem_fs);
+	if (IS_ERR(restrictedmem_mnt))
+		return PTR_ERR(restrictedmem_mnt);
+	return 0;
+}
+fs_initcall(restrictedmem_init);
+
+static struct file *restrictedmem_file_create(struct file *memfd)
+{
+	struct restrictedmem_data *data;
+	struct address_space *mapping;
+	struct inode *inode;
+	struct file *file;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return ERR_PTR(-ENOMEM);
+
+	data->memfd = memfd;
+	mutex_init(&data->lock);
+	INIT_LIST_HEAD(&data->notifiers);
+
+	inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
+	if (IS_ERR(inode)) {
+		kfree(data);
+		return ERR_CAST(inode);
+	}
+
+	inode->i_mode |= S_IFREG;
+	inode->i_op = &restrictedmem_iops;
+	inode->i_mapping->private_data = data;
+
+	file = alloc_file_pseudo(inode, restrictedmem_mnt,
+				 "restrictedmem", O_RDWR,
+				 &restrictedmem_fops);
+	if (IS_ERR(file)) {
+		iput(inode);
+		kfree(data);
+		return ERR_CAST(file);
+	}
+
+	file->f_flags |= O_LARGEFILE;
+
+	mapping = memfd->f_mapping;
+	mapping_set_unevictable(mapping);
+	mapping_set_gfp_mask(mapping,
+			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
+
+	return file;
+}
+
+SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
+{
+	struct file *file, *restricted_file;
+	int fd, err;
+
+	if (flags)
+		return -EINVAL;
+
+	fd = get_unused_fd_flags(0);
+	if (fd < 0)
+		return fd;
+
+	file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
+	if (IS_ERR(file)) {
+		err = PTR_ERR(file);
+		goto err_fd;
+	}
+	file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
+	file->f_flags |= O_LARGEFILE;
+
+	restricted_file = restrictedmem_file_create(file);
+	if (IS_ERR(restricted_file)) {
+		err = PTR_ERR(restricted_file);
+		fput(file);
+		goto err_fd;
+	}
+
+	fd_install(fd, restricted_file);
+	return fd;
+err_fd:
+	put_unused_fd(fd);
+	return err;
+}
+
+void restrictedmem_register_notifier(struct file *file,
+				     struct restrictedmem_notifier *notifier)
+{
+	struct restrictedmem_data *data = file->f_mapping->private_data;
+
+	mutex_lock(&data->lock);
+	list_add(&notifier->list, &data->notifiers);
+	mutex_unlock(&data->lock);
+}
+EXPORT_SYMBOL_GPL(restrictedmem_register_notifier);
+
+void restrictedmem_unregister_notifier(struct file *file,
+				       struct restrictedmem_notifier *notifier)
+{
+	struct restrictedmem_data *data = file->f_mapping->private_data;
+
+	mutex_lock(&data->lock);
+	list_del(&notifier->list);
+	mutex_unlock(&data->lock);
+}
+EXPORT_SYMBOL_GPL(restrictedmem_unregister_notifier);
+
+int restrictedmem_get_page(struct file *file, pgoff_t offset,
+			   struct page **pagep, int *order)
+{
+	struct restrictedmem_data *data = file->f_mapping->private_data;
+	struct file *memfd = data->memfd;
+	struct page *page;
+	int ret;
+
+	ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE);
+	if (ret)
+		return ret;
+
+	*pagep = page;
+	if (order)
+		*order = thp_order(compound_head(page));
+
+	SetPageUptodate(page);
+	unlock_page(page);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(restrictedmem_get_page);