[v10,1/9] mm: Introduce memfd_restricted system call to create restricted user memory
Message ID | 20221202061347.1070246-2-chao.p.peng@linux.intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:f944:0:0:0:0:0 with SMTP id q4csp673927wrr; Thu, 1 Dec 2022 22:21:48 -0800 (PST) X-Google-Smtp-Source: AA0mqf5bDtu0SQ8FS0IM34NWafRRGdFsYo2cUYXGtH3Mw8oywInEM60l7X2blnDr6O8cMJ/H0gr1 X-Received: by 2002:a17:902:7006:b0:181:b55a:f987 with SMTP id y6-20020a170902700600b00181b55af987mr51989282plk.67.1669962108356; Thu, 01 Dec 2022 22:21:48 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669962108; cv=none; d=google.com; s=arc-20160816; b=Wl789sBqcsve9t/ZzCK86TR0rkvGUpeR/VoMKhYTJ2gdAtJvddnIiozaadYPF3jK1N qCg+RqKG1uY34lTLrpnQysMFR/s7A75DTz9aDz3PGKM7cQFMpjBxeYikNN0PvGC5BBN5 AGsEaVtXAdK/77fhtwkPNMvJEL/O/CB9pRMpeDJcX4PPL9EmR9mIKYZf3apRX4kUklZR RCcWqNZcEy2+S2Nkr1/ldExDwG4jIpxwphYR+kGx3nf1pTEatHrfJ/khJHD1MtKWCCw/ 71oVzVElxKmUJrBXNvlpIoYmiGok2Gu92ARBcvYQeBEvf2sE3xjHVjpsGB7Kp9nYNGXu IQ2Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=hxMo29MPLERIVN55Mc5nap4HmlHiqc4o86xMClsF7e4=; b=06cnMVLLFwgTPM1Mh8QTNiMSGr+0NcugLLJN3t43MLmF4OOA9ab5NqUETN2foOY4fF LDL3qgWkPVC05snd7OK6WdHZp2I2EdWrvMVYiO4VQZP8qvy9o02CcFmsdASLU8clyBFQ R+RTQjpn4iHQcUAmIVktDWx3wetg7u0WTVYlghtr+r1skwrDvJBCHzvsUbuoIyMTHNfM U79GA512leL6tPgKzJYNwgpCvhNd8WOX84ePkRlx4RHOUAk7Wkc2PY6YC3tu6LmiiXEn Z3QOtaWKnBFDXSDvWEyn6wy/mVOFVjpLUZWqqBlSJGPrRgqXEiS3M8MhqggsC88pWN/O 3VbA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=nuKeKhbj; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id n2-20020a170902e54200b0017b3392ca72si7049121plf.323.2022.12.01.22.21.35; Thu, 01 Dec 2022 22:21:48 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=nuKeKhbj; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232227AbiLBGTN (ORCPT <rfc822;heyuhang3455@gmail.com> + 99 others); Fri, 2 Dec 2022 01:19:13 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40012 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232290AbiLBGTG (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 2 Dec 2022 01:19:06 -0500 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2AA1ECE41A; Thu, 1 Dec 2022 22:18:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1669961911; x=1701497911; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=t0LZ0YyKrKPl65oG/djA5Wt39fL9WiC+pwAjlkHR9fE=; b=nuKeKhbj/Jr6CtLsIapJ+bf98OkmZ6+orHfxeUdKZ1VbJToWHzK2WGoq W3J3o2lx/buoTT5BYkSbkmvmVFkU6Ae4ycOVCXNxZ4xj9dOtDxJbTuxuk 4rmnajxsK35+FH/z1XssOc4f7LenQzswpCUOmKP1PvsPEd5A3irW7aliH 1d6/UYfqvIInINIlJRi3wxCcR9MPgsp9Z8WAmX+qkxXFLqb83Cf+ZJFpM Ov/l45dgT4GR/xyuWmGGAeoYKjogMglQzgeSfQ66kCKsxUu6ApPXDCq9o k0zB4ugr8ZCWGnS4Hzw4PjXK/RlpdtLcQxwdcU16qNeb3g8Y7hw8wuBJT A==; X-IronPort-AV: E=McAfee;i="6500,9779,10548"; a="380170379" X-IronPort-AV: E=Sophos;i="5.96,210,1665471600"; d="scan'208";a="380170379" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Dec 2022 22:18:30 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10548"; a="733698589" X-IronPort-AV: E=Sophos;i="5.96,210,1665471600"; d="scan'208";a="733698589" Received: from chaop.bj.intel.com ([10.240.193.75]) by FMSMGA003.fm.intel.com with ESMTP; 01 Dec 2022 22:18:19 -0800 From: Chao Peng <chao.p.peng@linux.intel.com> To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini <pbonzini@redhat.com>, Jonathan Corbet <corbet@lwn.net>, Sean Christopherson <seanjc@google.com>, Vitaly Kuznetsov <vkuznets@redhat.com>, Wanpeng Li <wanpengli@tencent.com>, Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Arnd Bergmann <arnd@arndb.de>, Naoya Horiguchi <naoya.horiguchi@nec.com>, Miaohe Lin <linmiaohe@huawei.com>, x86@kernel.org, "H . Peter Anvin" <hpa@zytor.com>, Hugh Dickins <hughd@google.com>, Jeff Layton <jlayton@kernel.org>, "J . Bruce Fields" <bfields@fieldses.org>, Andrew Morton <akpm@linux-foundation.org>, Shuah Khan <shuah@kernel.org>, Mike Rapoport <rppt@kernel.org>, Steven Price <steven.price@arm.com>, "Maciej S . Szmigiero" <mail@maciej.szmigiero.name>, Vlastimil Babka <vbabka@suse.cz>, Vishal Annapurve <vannapurve@google.com>, Yu Zhang <yu.c.zhang@linux.intel.com>, Chao Peng <chao.p.peng@linux.intel.com>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret <qperret@google.com>, tabba@google.com, Michael Roth <michael.roth@amd.com>, mhocko@suse.com, wei.w.wang@intel.com Subject: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory Date: Fri, 2 Dec 2022 14:13:39 +0800 Message-Id: <20221202061347.1070246-2-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20221202061347.1070246-1-chao.p.peng@linux.intel.com> References: <20221202061347.1070246-1-chao.p.peng@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1751082187677968779?= X-GMAIL-MSGID: =?utf-8?q?1751082187677968779?= |
Series |
KVM: mm: fd-based approach for supporting KVM
|
|
Commit Message
Chao Peng
Dec. 2, 2022, 6:13 a.m. UTC
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Introduce 'memfd_restricted' system call with the ability to create memory areas that are restricted from userspace access through ordinary MMU operations (e.g. read/write/mmap). The memory content is expected to be used through the new in-kernel interface by a third kernel module. memfd_restricted() is useful for scenarios where a file descriptor(fd) can be used as an interface into mm but want to restrict userspace's ability on the fd. Initially it is designed to provide protections for KVM encrypted guest memory. Normally KVM uses memfd memory via mmapping the memfd into KVM userspace (e.g. QEMU) and then using the mmaped virtual address to setup the mapping in the KVM secondary page table (e.g. EPT). With confidential computing technologies like Intel TDX, the memfd memory may be encrypted with special key for special software domain (e.g. KVM guest) and is not expected to be directly accessed by userspace. Precisely, userspace access to such encrypted memory may lead to host crash so should be prevented. memfd_restricted() provides semantics required for KVM guest encrypted memory support that a fd created with memfd_restricted() is going to be used as the source of guest memory in confidential computing environment and KVM can directly interact with core-mm without the need to expose the memoy content into KVM userspace. KVM userspace is still in charge of the lifecycle of the fd. It should pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to obtain the physical memory page and then uses it to populate the KVM secondary page table entries. The userspace restricted memfd can be fallocate-ed or hole-punched from userspace. When hole-punched, KVM can get notified through invalidate_start/invalidate_end() callbacks, KVM then gets chance to remove any mapped entries of the range in the secondary page tables. Machine check can happen for memory pages in the restricted memfd, instead of routing this directly to userspace, we call the error() callback that KVM registered. KVM then gets chance to handle it correctly. memfd_restricted() itself is implemented as a shim layer on top of real memory file systems (currently tmpfs). Pages in restrictedmem are marked as unmovable and unevictable, this is required for current confidential usage. But in future this might be changed. By default memfd_restricted() prevents userspace read, write and mmap. By defining new bit in the 'flags', it can be extended to support other restricted semantics in the future. The system call is currently wired up for x86 arch. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> --- arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + include/linux/restrictedmem.h | 71 ++++++ include/linux/syscalls.h | 1 + include/uapi/asm-generic/unistd.h | 5 +- include/uapi/linux/magic.h | 1 + kernel/sys_ni.c | 3 + mm/Kconfig | 4 + mm/Makefile | 1 + mm/memory-failure.c | 3 + mm/restrictedmem.c | 318 +++++++++++++++++++++++++ 11 files changed, 408 insertions(+), 1 deletion(-) create mode 100644 include/linux/restrictedmem.h create mode 100644 mm/restrictedmem.c
Comments
Hi, On Fri, Dec 2, 2022 at 6:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote: > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> > > Introduce 'memfd_restricted' system call with the ability to create > memory areas that are restricted from userspace access through ordinary > MMU operations (e.g. read/write/mmap). The memory content is expected to > be used through the new in-kernel interface by a third kernel module. > > memfd_restricted() is useful for scenarios where a file descriptor(fd) > can be used as an interface into mm but want to restrict userspace's > ability on the fd. Initially it is designed to provide protections for > KVM encrypted guest memory. > > Normally KVM uses memfd memory via mmapping the memfd into KVM userspace > (e.g. QEMU) and then using the mmaped virtual address to setup the > mapping in the KVM secondary page table (e.g. EPT). With confidential > computing technologies like Intel TDX, the memfd memory may be encrypted > with special key for special software domain (e.g. KVM guest) and is not > expected to be directly accessed by userspace. Precisely, userspace > access to such encrypted memory may lead to host crash so should be > prevented. > > memfd_restricted() provides semantics required for KVM guest encrypted > memory support that a fd created with memfd_restricted() is going to be > used as the source of guest memory in confidential computing environment > and KVM can directly interact with core-mm without the need to expose > the memoy content into KVM userspace. nit: memory > > KVM userspace is still in charge of the lifecycle of the fd. It should > pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to > obtain the physical memory page and then uses it to populate the KVM > secondary page table entries. > > The userspace restricted memfd can be fallocate-ed or hole-punched > from userspace. When hole-punched, KVM can get notified through > invalidate_start/invalidate_end() callbacks, KVM then gets chance to > remove any mapped entries of the range in the secondary page tables. > > Machine check can happen for memory pages in the restricted memfd, > instead of routing this directly to userspace, we call the error() > callback that KVM registered. KVM then gets chance to handle it > correctly. > > memfd_restricted() itself is implemented as a shim layer on top of real > memory file systems (currently tmpfs). Pages in restrictedmem are marked > as unmovable and unevictable, this is required for current confidential > usage. But in future this might be changed. > > By default memfd_restricted() prevents userspace read, write and mmap. > By defining new bit in the 'flags', it can be extended to support other > restricted semantics in the future. > > The system call is currently wired up for x86 arch. Reviewed-by: Fuad Tabba <tabba@google.com> After wiring the system call for arm64 (on qemu/arm64): Tested-by: Fuad Tabba <tabba@google.com> Cheers, /fuad > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> > --- > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > include/linux/restrictedmem.h | 71 ++++++ > include/linux/syscalls.h | 1 + > include/uapi/asm-generic/unistd.h | 5 +- > include/uapi/linux/magic.h | 1 + > kernel/sys_ni.c | 3 + > mm/Kconfig | 4 + > mm/Makefile | 1 + > mm/memory-failure.c | 3 + > mm/restrictedmem.c | 318 +++++++++++++++++++++++++ > 11 files changed, 408 insertions(+), 1 deletion(-) > create mode 100644 include/linux/restrictedmem.h > create mode 100644 mm/restrictedmem.c > > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl > index 320480a8db4f..dc70ba90247e 100644 > --- a/arch/x86/entry/syscalls/syscall_32.tbl > +++ b/arch/x86/entry/syscalls/syscall_32.tbl > @@ -455,3 +455,4 @@ > 448 i386 process_mrelease sys_process_mrelease > 449 i386 futex_waitv sys_futex_waitv > 450 i386 set_mempolicy_home_node sys_set_mempolicy_home_node > +451 i386 memfd_restricted sys_memfd_restricted > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl > index c84d12608cd2..06516abc8318 100644 > --- a/arch/x86/entry/syscalls/syscall_64.tbl > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > @@ -372,6 +372,7 @@ > 448 common process_mrelease sys_process_mrelease > 449 common futex_waitv sys_futex_waitv > 450 common set_mempolicy_home_node sys_set_mempolicy_home_node > +451 common memfd_restricted sys_memfd_restricted > > # > # Due to a historical design error, certain syscalls are numbered differently > diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h > new file mode 100644 > index 000000000000..c2700c5daa43 > --- /dev/null > +++ b/include/linux/restrictedmem.h > @@ -0,0 +1,71 @@ > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ > +#ifndef _LINUX_RESTRICTEDMEM_H > + > +#include <linux/file.h> > +#include <linux/magic.h> > +#include <linux/pfn_t.h> > + > +struct restrictedmem_notifier; > + > +struct restrictedmem_notifier_ops { > + void (*invalidate_start)(struct restrictedmem_notifier *notifier, > + pgoff_t start, pgoff_t end); > + void (*invalidate_end)(struct restrictedmem_notifier *notifier, > + pgoff_t start, pgoff_t end); > + void (*error)(struct restrictedmem_notifier *notifier, > + pgoff_t start, pgoff_t end); > +}; > + > +struct restrictedmem_notifier { > + struct list_head list; > + const struct restrictedmem_notifier_ops *ops; > +}; > + > +#ifdef CONFIG_RESTRICTEDMEM > + > +void restrictedmem_register_notifier(struct file *file, > + struct restrictedmem_notifier *notifier); > +void restrictedmem_unregister_notifier(struct file *file, > + struct restrictedmem_notifier *notifier); > + > +int restrictedmem_get_page(struct file *file, pgoff_t offset, > + struct page **pagep, int *order); > + > +static inline bool file_is_restrictedmem(struct file *file) > +{ > + return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC; > +} > + > +void restrictedmem_error_page(struct page *page, struct address_space *mapping); > + > +#else > + > +static inline void restrictedmem_register_notifier(struct file *file, > + struct restrictedmem_notifier *notifier) > +{ > +} > + > +static inline void restrictedmem_unregister_notifier(struct file *file, > + struct restrictedmem_notifier *notifier) > +{ > +} > + > +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset, > + struct page **pagep, int *order) > +{ > + return -1; > +} > + > +static inline bool file_is_restrictedmem(struct file *file) > +{ > + return false; > +} > + > +static inline void restrictedmem_error_page(struct page *page, > + struct address_space *mapping) > +{ > +} > + > +#endif /* CONFIG_RESTRICTEDMEM */ > + > +#endif /* _LINUX_RESTRICTEDMEM_H */ > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > index a34b0f9a9972..f9e9e0c820c5 100644 > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -1056,6 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags); > asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len, > unsigned long home_node, > unsigned long flags); > +asmlinkage long sys_memfd_restricted(unsigned int flags); > > /* > * Architecture-specific system calls > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h > index 45fa180cc56a..e93cd35e46d0 100644 > --- a/include/uapi/asm-generic/unistd.h > +++ b/include/uapi/asm-generic/unistd.h > @@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv) > #define __NR_set_mempolicy_home_node 450 > __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node) > > +#define __NR_memfd_restricted 451 > +__SYSCALL(__NR_memfd_restricted, sys_memfd_restricted) > + > #undef __NR_syscalls > -#define __NR_syscalls 451 > +#define __NR_syscalls 452 > > /* > * 32 bit systems traditionally used different > diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h > index 6325d1d0e90f..8aa38324b90a 100644 > --- a/include/uapi/linux/magic.h > +++ b/include/uapi/linux/magic.h > @@ -101,5 +101,6 @@ > #define DMA_BUF_MAGIC 0x444d4142 /* "DMAB" */ > #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */ > #define SECRETMEM_MAGIC 0x5345434d /* "SECM" */ > +#define RESTRICTEDMEM_MAGIC 0x5245534d /* "RESM" */ > > #endif /* __LINUX_MAGIC_H__ */ > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c > index 860b2dcf3ac4..7c4a32cbd2e7 100644 > --- a/kernel/sys_ni.c > +++ b/kernel/sys_ni.c > @@ -360,6 +360,9 @@ COND_SYSCALL(pkey_free); > /* memfd_secret */ > COND_SYSCALL(memfd_secret); > > +/* memfd_restricted */ > +COND_SYSCALL(memfd_restricted); > + > /* > * Architecture specific weak syscall entries. > */ > diff --git a/mm/Kconfig b/mm/Kconfig > index 57e1d8c5b505..06b0e1d6b8c1 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -1076,6 +1076,10 @@ config IO_MAPPING > config SECRETMEM > def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED > > +config RESTRICTEDMEM > + bool > + depends on TMPFS > + > config ANON_VMA_NAME > bool "Anonymous VMA name support" > depends on PROC_FS && ADVISE_SYSCALLS && MMU > diff --git a/mm/Makefile b/mm/Makefile > index 8e105e5b3e29..bcbb0edf9ba1 100644 > --- a/mm/Makefile > +++ b/mm/Makefile > @@ -121,6 +121,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o > obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o > obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o > obj-$(CONFIG_SECRETMEM) += secretmem.o > +obj-$(CONFIG_RESTRICTEDMEM) += restrictedmem.o > obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o > obj-$(CONFIG_USERFAULTFD) += userfaultfd.o > obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index 145bb561ddb3..f91b444e471e 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -62,6 +62,7 @@ > #include <linux/page-isolation.h> > #include <linux/pagewalk.h> > #include <linux/shmem_fs.h> > +#include <linux/restrictedmem.h> > #include "swap.h" > #include "internal.h" > #include "ras/ras_event.h" > @@ -940,6 +941,8 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p) > goto out; > } > > + restrictedmem_error_page(p, mapping); > + > /* > * The shmem page is kept in page cache instead of truncating > * so is expected to have an extra refcount after error-handling. > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c > new file mode 100644 > index 000000000000..56953c204e5c > --- /dev/null > +++ b/mm/restrictedmem.c > @@ -0,0 +1,318 @@ > +// SPDX-License-Identifier: GPL-2.0 > +#include "linux/sbitmap.h" > +#include <linux/pagemap.h> > +#include <linux/pseudo_fs.h> > +#include <linux/shmem_fs.h> > +#include <linux/syscalls.h> > +#include <uapi/linux/falloc.h> > +#include <uapi/linux/magic.h> > +#include <linux/restrictedmem.h> > + > +struct restrictedmem_data { > + struct mutex lock; > + struct file *memfd; > + struct list_head notifiers; > +}; > + > +static void restrictedmem_invalidate_start(struct restrictedmem_data *data, > + pgoff_t start, pgoff_t end) > +{ > + struct restrictedmem_notifier *notifier; > + > + mutex_lock(&data->lock); > + list_for_each_entry(notifier, &data->notifiers, list) { > + notifier->ops->invalidate_start(notifier, start, end); > + } > + mutex_unlock(&data->lock); > +} > + > +static void restrictedmem_invalidate_end(struct restrictedmem_data *data, > + pgoff_t start, pgoff_t end) > +{ > + struct restrictedmem_notifier *notifier; > + > + mutex_lock(&data->lock); > + list_for_each_entry(notifier, &data->notifiers, list) { > + notifier->ops->invalidate_end(notifier, start, end); > + } > + mutex_unlock(&data->lock); > +} > + > +static void restrictedmem_notifier_error(struct restrictedmem_data *data, > + pgoff_t start, pgoff_t end) > +{ > + struct restrictedmem_notifier *notifier; > + > + mutex_lock(&data->lock); > + list_for_each_entry(notifier, &data->notifiers, list) { > + notifier->ops->error(notifier, start, end); > + } > + mutex_unlock(&data->lock); > +} > + > +static int restrictedmem_release(struct inode *inode, struct file *file) > +{ > + struct restrictedmem_data *data = inode->i_mapping->private_data; > + > + fput(data->memfd); > + kfree(data); > + return 0; > +} > + > +static long restrictedmem_punch_hole(struct restrictedmem_data *data, int mode, > + loff_t offset, loff_t len) > +{ > + int ret; > + pgoff_t start, end; > + struct file *memfd = data->memfd; > + > + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) > + return -EINVAL; > + > + start = offset >> PAGE_SHIFT; > + end = (offset + len) >> PAGE_SHIFT; > + > + restrictedmem_invalidate_start(data, start, end); > + ret = memfd->f_op->fallocate(memfd, mode, offset, len); > + restrictedmem_invalidate_end(data, start, end); > + > + return ret; > +} > + > +static long restrictedmem_fallocate(struct file *file, int mode, > + loff_t offset, loff_t len) > +{ > + struct restrictedmem_data *data = file->f_mapping->private_data; > + struct file *memfd = data->memfd; > + > + if (mode & FALLOC_FL_PUNCH_HOLE) > + return restrictedmem_punch_hole(data, mode, offset, len); > + > + return memfd->f_op->fallocate(memfd, mode, offset, len); > +} > + > +static const struct file_operations restrictedmem_fops = { > + .release = restrictedmem_release, > + .fallocate = restrictedmem_fallocate, > +}; > + > +static int restrictedmem_getattr(struct user_namespace *mnt_userns, > + const struct path *path, struct kstat *stat, > + u32 request_mask, unsigned int query_flags) > +{ > + struct inode *inode = d_inode(path->dentry); > + struct restrictedmem_data *data = inode->i_mapping->private_data; > + struct file *memfd = data->memfd; > + > + return memfd->f_inode->i_op->getattr(mnt_userns, path, stat, > + request_mask, query_flags); > +} > + > +static int restrictedmem_setattr(struct user_namespace *mnt_userns, > + struct dentry *dentry, struct iattr *attr) > +{ > + struct inode *inode = d_inode(dentry); > + struct restrictedmem_data *data = inode->i_mapping->private_data; > + struct file *memfd = data->memfd; > + int ret; > + > + if (attr->ia_valid & ATTR_SIZE) { > + if (memfd->f_inode->i_size) > + return -EPERM; > + > + if (!PAGE_ALIGNED(attr->ia_size)) > + return -EINVAL; > + } > + > + ret = memfd->f_inode->i_op->setattr(mnt_userns, > + file_dentry(memfd), attr); > + return ret; > +} > + > +static const struct inode_operations restrictedmem_iops = { > + .getattr = restrictedmem_getattr, > + .setattr = restrictedmem_setattr, > +}; > + > +static int restrictedmem_init_fs_context(struct fs_context *fc) > +{ > + if (!init_pseudo(fc, RESTRICTEDMEM_MAGIC)) > + return -ENOMEM; > + > + fc->s_iflags |= SB_I_NOEXEC; > + return 0; > +} > + > +static struct file_system_type restrictedmem_fs = { > + .owner = THIS_MODULE, > + .name = "memfd:restrictedmem", > + .init_fs_context = restrictedmem_init_fs_context, > + .kill_sb = kill_anon_super, > +}; > + > +static struct vfsmount *restrictedmem_mnt; > + > +static __init int restrictedmem_init(void) > +{ > + restrictedmem_mnt = kern_mount(&restrictedmem_fs); > + if (IS_ERR(restrictedmem_mnt)) > + return PTR_ERR(restrictedmem_mnt); > + return 0; > +} > +fs_initcall(restrictedmem_init); > + > +static struct file *restrictedmem_file_create(struct file *memfd) > +{ > + struct restrictedmem_data *data; > + struct address_space *mapping; > + struct inode *inode; > + struct file *file; > + > + data = kzalloc(sizeof(*data), GFP_KERNEL); > + if (!data) > + return ERR_PTR(-ENOMEM); > + > + data->memfd = memfd; > + mutex_init(&data->lock); > + INIT_LIST_HEAD(&data->notifiers); > + > + inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb); > + if (IS_ERR(inode)) { > + kfree(data); > + return ERR_CAST(inode); > + } > + > + inode->i_mode |= S_IFREG; > + inode->i_op = &restrictedmem_iops; > + inode->i_mapping->private_data = data; > + > + file = alloc_file_pseudo(inode, restrictedmem_mnt, > + "restrictedmem", O_RDWR, > + &restrictedmem_fops); > + if (IS_ERR(file)) { > + iput(inode); > + kfree(data); > + return ERR_CAST(file); > + } > + > + file->f_flags |= O_LARGEFILE; > + > + /* > + * These pages are currently unmovable so don't place them into movable > + * pageblocks (e.g. CMA and ZONE_MOVABLE). > + */ > + mapping = memfd->f_mapping; > + mapping_set_unevictable(mapping); > + mapping_set_gfp_mask(mapping, > + mapping_gfp_mask(mapping) & ~__GFP_MOVABLE); > + > + return file; > +} > + > +SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) > +{ > + struct file *file, *restricted_file; > + int fd, err; > + > + if (flags) > + return -EINVAL; > + > + fd = get_unused_fd_flags(0); > + if (fd < 0) > + return fd; > + > + file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE); > + if (IS_ERR(file)) { > + err = PTR_ERR(file); > + goto err_fd; > + } > + file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE; > + file->f_flags |= O_LARGEFILE; > + > + restricted_file = restrictedmem_file_create(file); > + if (IS_ERR(restricted_file)) { > + err = PTR_ERR(restricted_file); > + fput(file); > + goto err_fd; > + } > + > + fd_install(fd, restricted_file); > + return fd; > +err_fd: > + put_unused_fd(fd); > + return err; > +} > + > +void restrictedmem_register_notifier(struct file *file, > + struct restrictedmem_notifier *notifier) > +{ > + struct restrictedmem_data *data = file->f_mapping->private_data; > + > + mutex_lock(&data->lock); > + list_add(¬ifier->list, &data->notifiers); > + mutex_unlock(&data->lock); > +} > +EXPORT_SYMBOL_GPL(restrictedmem_register_notifier); > + > +void restrictedmem_unregister_notifier(struct file *file, > + struct restrictedmem_notifier *notifier) > +{ > + struct restrictedmem_data *data = file->f_mapping->private_data; > + > + mutex_lock(&data->lock); > + list_del(¬ifier->list); > + mutex_unlock(&data->lock); > +} > +EXPORT_SYMBOL_GPL(restrictedmem_unregister_notifier); > + > +int restrictedmem_get_page(struct file *file, pgoff_t offset, > + struct page **pagep, int *order) > +{ > + struct restrictedmem_data *data = file->f_mapping->private_data; > + struct file *memfd = data->memfd; > + struct folio *folio; > + struct page *page; > + int ret; > + > + ret = shmem_get_folio(file_inode(memfd), offset, &folio, SGP_WRITE); > + if (ret) > + return ret; > + > + page = folio_file_page(folio, offset); > + *pagep = page; > + if (order) > + *order = thp_order(compound_head(page)); > + > + SetPageUptodate(page); > + unlock_page(page); > + > + return 0; > +} > +EXPORT_SYMBOL_GPL(restrictedmem_get_page); > + > +void restrictedmem_error_page(struct page *page, struct address_space *mapping) > +{ > + struct super_block *sb = restrictedmem_mnt->mnt_sb; > + struct inode *inode, *next; > + > + if (!shmem_mapping(mapping)) > + return; > + > + spin_lock(&sb->s_inode_list_lock); > + list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) { > + struct restrictedmem_data *data = inode->i_mapping->private_data; > + struct file *memfd = data->memfd; > + > + if (memfd->f_mapping == mapping) { > + pgoff_t start, end; > + > + spin_unlock(&sb->s_inode_list_lock); > + > + start = page->index; > + end = start + thp_nr_pages(page); > + restrictedmem_notifier_error(data, start, end); > + return; > + } > + } > + spin_unlock(&sb->s_inode_list_lock); > +} > -- > 2.25.1 >
On Tue, Dec 06, 2022 at 02:57:04PM +0000, Fuad Tabba wrote: > Hi, > > On Fri, Dec 2, 2022 at 6:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote: > > > > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> > > > > Introduce 'memfd_restricted' system call with the ability to create > > memory areas that are restricted from userspace access through ordinary > > MMU operations (e.g. read/write/mmap). The memory content is expected to > > be used through the new in-kernel interface by a third kernel module. > > > > memfd_restricted() is useful for scenarios where a file descriptor(fd) > > can be used as an interface into mm but want to restrict userspace's > > ability on the fd. Initially it is designed to provide protections for > > KVM encrypted guest memory. > > > > Normally KVM uses memfd memory via mmapping the memfd into KVM userspace > > (e.g. QEMU) and then using the mmaped virtual address to setup the > > mapping in the KVM secondary page table (e.g. EPT). With confidential > > computing technologies like Intel TDX, the memfd memory may be encrypted > > with special key for special software domain (e.g. KVM guest) and is not > > expected to be directly accessed by userspace. Precisely, userspace > > access to such encrypted memory may lead to host crash so should be > > prevented. > > > > memfd_restricted() provides semantics required for KVM guest encrypted > > memory support that a fd created with memfd_restricted() is going to be > > used as the source of guest memory in confidential computing environment > > and KVM can directly interact with core-mm without the need to expose > > the memoy content into KVM userspace. > > nit: memory Ya! > > > > > KVM userspace is still in charge of the lifecycle of the fd. It should > > pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to > > obtain the physical memory page and then uses it to populate the KVM > > secondary page table entries. > > > > The userspace restricted memfd can be fallocate-ed or hole-punched > > from userspace. When hole-punched, KVM can get notified through > > invalidate_start/invalidate_end() callbacks, KVM then gets chance to > > remove any mapped entries of the range in the secondary page tables. > > > > Machine check can happen for memory pages in the restricted memfd, > > instead of routing this directly to userspace, we call the error() > > callback that KVM registered. KVM then gets chance to handle it > > correctly. > > > > memfd_restricted() itself is implemented as a shim layer on top of real > > memory file systems (currently tmpfs). Pages in restrictedmem are marked > > as unmovable and unevictable, this is required for current confidential > > usage. But in future this might be changed. > > > > By default memfd_restricted() prevents userspace read, write and mmap. > > By defining new bit in the 'flags', it can be extended to support other > > restricted semantics in the future. > > > > The system call is currently wired up for x86 arch. > > Reviewed-by: Fuad Tabba <tabba@google.com> > After wiring the system call for arm64 (on qemu/arm64): > Tested-by: Fuad Tabba <tabba@google.com> Thanks. Chao > > Cheers, > /fuad > > > > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> > > --- > > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > > include/linux/restrictedmem.h | 71 ++++++ > > include/linux/syscalls.h | 1 + > > include/uapi/asm-generic/unistd.h | 5 +- > > include/uapi/linux/magic.h | 1 + > > kernel/sys_ni.c | 3 + > > mm/Kconfig | 4 + > > mm/Makefile | 1 + > > mm/memory-failure.c | 3 + > > mm/restrictedmem.c | 318 +++++++++++++++++++++++++ > > 11 files changed, 408 insertions(+), 1 deletion(-) > > create mode 100644 include/linux/restrictedmem.h > > create mode 100644 mm/restrictedmem.c > > > > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl > > index 320480a8db4f..dc70ba90247e 100644 > > --- a/arch/x86/entry/syscalls/syscall_32.tbl > > +++ b/arch/x86/entry/syscalls/syscall_32.tbl > > @@ -455,3 +455,4 @@ > > 448 i386 process_mrelease sys_process_mrelease > > 449 i386 futex_waitv sys_futex_waitv > > 450 i386 set_mempolicy_home_node sys_set_mempolicy_home_node > > +451 i386 memfd_restricted sys_memfd_restricted > > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl > > index c84d12608cd2..06516abc8318 100644 > > --- a/arch/x86/entry/syscalls/syscall_64.tbl > > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > > @@ -372,6 +372,7 @@ > > 448 common process_mrelease sys_process_mrelease > > 449 common futex_waitv sys_futex_waitv > > 450 common set_mempolicy_home_node sys_set_mempolicy_home_node > > +451 common memfd_restricted sys_memfd_restricted > > > > # > > # Due to a historical design error, certain syscalls are numbered differently > > diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h > > new file mode 100644 > > index 000000000000..c2700c5daa43 > > --- /dev/null > > +++ b/include/linux/restrictedmem.h > > @@ -0,0 +1,71 @@ > > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ > > +#ifndef _LINUX_RESTRICTEDMEM_H > > + > > +#include <linux/file.h> > > +#include <linux/magic.h> > > +#include <linux/pfn_t.h> > > + > > +struct restrictedmem_notifier; > > + > > +struct restrictedmem_notifier_ops { > > + void (*invalidate_start)(struct restrictedmem_notifier *notifier, > > + pgoff_t start, pgoff_t end); > > + void (*invalidate_end)(struct restrictedmem_notifier *notifier, > > + pgoff_t start, pgoff_t end); > > + void (*error)(struct restrictedmem_notifier *notifier, > > + pgoff_t start, pgoff_t end); > > +}; > > + > > +struct restrictedmem_notifier { > > + struct list_head list; > > + const struct restrictedmem_notifier_ops *ops; > > +}; > > + > > +#ifdef CONFIG_RESTRICTEDMEM > > + > > +void restrictedmem_register_notifier(struct file *file, > > + struct restrictedmem_notifier *notifier); > > +void restrictedmem_unregister_notifier(struct file *file, > > + struct restrictedmem_notifier *notifier); > > + > > +int restrictedmem_get_page(struct file *file, pgoff_t offset, > > + struct page **pagep, int *order); > > + > > +static inline bool file_is_restrictedmem(struct file *file) > > +{ > > + return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC; > > +} > > + > > +void restrictedmem_error_page(struct page *page, struct address_space *mapping); > > + > > +#else > > + > > +static inline void restrictedmem_register_notifier(struct file *file, > > + struct restrictedmem_notifier *notifier) > > +{ > > +} > > + > > +static inline void restrictedmem_unregister_notifier(struct file *file, > > + struct restrictedmem_notifier *notifier) > > +{ > > +} > > + > > +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset, > > + struct page **pagep, int *order) > > +{ > > + return -1; > > +} > > + > > +static inline bool file_is_restrictedmem(struct file *file) > > +{ > > + return false; > > +} > > + > > +static inline void restrictedmem_error_page(struct page *page, > > + struct address_space *mapping) > > +{ > > +} > > + > > +#endif /* CONFIG_RESTRICTEDMEM */ > > + > > +#endif /* _LINUX_RESTRICTEDMEM_H */ > > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > > index a34b0f9a9972..f9e9e0c820c5 100644 > > --- a/include/linux/syscalls.h > > +++ b/include/linux/syscalls.h > > @@ -1056,6 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags); > > asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len, > > unsigned long home_node, > > unsigned long flags); > > +asmlinkage long sys_memfd_restricted(unsigned int flags); > > > > /* > > * Architecture-specific system calls > > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h > > index 45fa180cc56a..e93cd35e46d0 100644 > > --- a/include/uapi/asm-generic/unistd.h > > +++ b/include/uapi/asm-generic/unistd.h > > @@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv) > > #define __NR_set_mempolicy_home_node 450 > > __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node) > > > > +#define __NR_memfd_restricted 451 > > +__SYSCALL(__NR_memfd_restricted, sys_memfd_restricted) > > + > > #undef __NR_syscalls > > -#define __NR_syscalls 451 > > +#define __NR_syscalls 452 > > > > /* > > * 32 bit systems traditionally used different > > diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h > > index 6325d1d0e90f..8aa38324b90a 100644 > > --- a/include/uapi/linux/magic.h > > +++ b/include/uapi/linux/magic.h > > @@ -101,5 +101,6 @@ > > #define DMA_BUF_MAGIC 0x444d4142 /* "DMAB" */ > > #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */ > > #define SECRETMEM_MAGIC 0x5345434d /* "SECM" */ > > +#define RESTRICTEDMEM_MAGIC 0x5245534d /* "RESM" */ > > > > #endif /* __LINUX_MAGIC_H__ */ > > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c > > index 860b2dcf3ac4..7c4a32cbd2e7 100644 > > --- a/kernel/sys_ni.c > > +++ b/kernel/sys_ni.c > > @@ -360,6 +360,9 @@ COND_SYSCALL(pkey_free); > > /* memfd_secret */ > > COND_SYSCALL(memfd_secret); > > > > +/* memfd_restricted */ > > +COND_SYSCALL(memfd_restricted); > > + > > /* > > * Architecture specific weak syscall entries. > > */ > > diff --git a/mm/Kconfig b/mm/Kconfig > > index 57e1d8c5b505..06b0e1d6b8c1 100644 > > --- a/mm/Kconfig > > +++ b/mm/Kconfig > > @@ -1076,6 +1076,10 @@ config IO_MAPPING > > config SECRETMEM > > def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED > > > > +config RESTRICTEDMEM > > + bool > > + depends on TMPFS > > + > > config ANON_VMA_NAME > > bool "Anonymous VMA name support" > > depends on PROC_FS && ADVISE_SYSCALLS && MMU > > diff --git a/mm/Makefile b/mm/Makefile > > index 8e105e5b3e29..bcbb0edf9ba1 100644 > > --- a/mm/Makefile > > +++ b/mm/Makefile > > @@ -121,6 +121,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o > > obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o > > obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o > > obj-$(CONFIG_SECRETMEM) += secretmem.o > > +obj-$(CONFIG_RESTRICTEDMEM) += restrictedmem.o > > obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o > > obj-$(CONFIG_USERFAULTFD) += userfaultfd.o > > obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > > index 145bb561ddb3..f91b444e471e 100644 > > --- a/mm/memory-failure.c > > +++ b/mm/memory-failure.c > > @@ -62,6 +62,7 @@ > > #include <linux/page-isolation.h> > > #include <linux/pagewalk.h> > > #include <linux/shmem_fs.h> > > +#include <linux/restrictedmem.h> > > #include "swap.h" > > #include "internal.h" > > #include "ras/ras_event.h" > > @@ -940,6 +941,8 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p) > > goto out; > > } > > > > + restrictedmem_error_page(p, mapping); > > + > > /* > > * The shmem page is kept in page cache instead of truncating > > * so is expected to have an extra refcount after error-handling. > > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c > > new file mode 100644 > > index 000000000000..56953c204e5c > > --- /dev/null > > +++ b/mm/restrictedmem.c > > @@ -0,0 +1,318 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +#include "linux/sbitmap.h" > > +#include <linux/pagemap.h> > > +#include <linux/pseudo_fs.h> > > +#include <linux/shmem_fs.h> > > +#include <linux/syscalls.h> > > +#include <uapi/linux/falloc.h> > > +#include <uapi/linux/magic.h> > > +#include <linux/restrictedmem.h> > > + > > +struct restrictedmem_data { > > + struct mutex lock; > > + struct file *memfd; > > + struct list_head notifiers; > > +}; > > + > > +static void restrictedmem_invalidate_start(struct restrictedmem_data *data, > > + pgoff_t start, pgoff_t end) > > +{ > > + struct restrictedmem_notifier *notifier; > > + > > + mutex_lock(&data->lock); > > + list_for_each_entry(notifier, &data->notifiers, list) { > > + notifier->ops->invalidate_start(notifier, start, end); > > + } > > + mutex_unlock(&data->lock); > > +} > > + > > +static void restrictedmem_invalidate_end(struct restrictedmem_data *data, > > + pgoff_t start, pgoff_t end) > > +{ > > + struct restrictedmem_notifier *notifier; > > + > > + mutex_lock(&data->lock); > > + list_for_each_entry(notifier, &data->notifiers, list) { > > + notifier->ops->invalidate_end(notifier, start, end); > > + } > > + mutex_unlock(&data->lock); > > +} > > + > > +static void restrictedmem_notifier_error(struct restrictedmem_data *data, > > + pgoff_t start, pgoff_t end) > > +{ > > + struct restrictedmem_notifier *notifier; > > + > > + mutex_lock(&data->lock); > > + list_for_each_entry(notifier, &data->notifiers, list) { > > + notifier->ops->error(notifier, start, end); > > + } > > + mutex_unlock(&data->lock); > > +} > > + > > +static int restrictedmem_release(struct inode *inode, struct file *file) > > +{ > > + struct restrictedmem_data *data = inode->i_mapping->private_data; > > + > > + fput(data->memfd); > > + kfree(data); > > + return 0; > > +} > > + > > +static long restrictedmem_punch_hole(struct restrictedmem_data *data, int mode, > > + loff_t offset, loff_t len) > > +{ > > + int ret; > > + pgoff_t start, end; > > + struct file *memfd = data->memfd; > > + > > + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) > > + return -EINVAL; > > + > > + start = offset >> PAGE_SHIFT; > > + end = (offset + len) >> PAGE_SHIFT; > > + > > + restrictedmem_invalidate_start(data, start, end); > > + ret = memfd->f_op->fallocate(memfd, mode, offset, len); > > + restrictedmem_invalidate_end(data, start, end); > > + > > + return ret; > > +} > > + > > +static long restrictedmem_fallocate(struct file *file, int mode, > > + loff_t offset, loff_t len) > > +{ > > + struct restrictedmem_data *data = file->f_mapping->private_data; > > + struct file *memfd = data->memfd; > > + > > + if (mode & FALLOC_FL_PUNCH_HOLE) > > + return restrictedmem_punch_hole(data, mode, offset, len); > > + > > + return memfd->f_op->fallocate(memfd, mode, offset, len); > > +} > > + > > +static const struct file_operations restrictedmem_fops = { > > + .release = restrictedmem_release, > > + .fallocate = restrictedmem_fallocate, > > +}; > > + > > +static int restrictedmem_getattr(struct user_namespace *mnt_userns, > > + const struct path *path, struct kstat *stat, > > + u32 request_mask, unsigned int query_flags) > > +{ > > + struct inode *inode = d_inode(path->dentry); > > + struct restrictedmem_data *data = inode->i_mapping->private_data; > > + struct file *memfd = data->memfd; > > + > > + return memfd->f_inode->i_op->getattr(mnt_userns, path, stat, > > + request_mask, query_flags); > > +} > > + > > +static int restrictedmem_setattr(struct user_namespace *mnt_userns, > > + struct dentry *dentry, struct iattr *attr) > > +{ > > + struct inode *inode = d_inode(dentry); > > + struct restrictedmem_data *data = inode->i_mapping->private_data; > > + struct file *memfd = data->memfd; > > + int ret; > > + > > + if (attr->ia_valid & ATTR_SIZE) { > > + if (memfd->f_inode->i_size) > > + return -EPERM; > > + > > + if (!PAGE_ALIGNED(attr->ia_size)) > > + return -EINVAL; > > + } > > + > > + ret = memfd->f_inode->i_op->setattr(mnt_userns, > > + file_dentry(memfd), attr); > > + return ret; > > +} > > + > > +static const struct inode_operations restrictedmem_iops = { > > + .getattr = restrictedmem_getattr, > > + .setattr = restrictedmem_setattr, > > +}; > > + > > +static int restrictedmem_init_fs_context(struct fs_context *fc) > > +{ > > + if (!init_pseudo(fc, RESTRICTEDMEM_MAGIC)) > > + return -ENOMEM; > > + > > + fc->s_iflags |= SB_I_NOEXEC; > > + return 0; > > +} > > + > > +static struct file_system_type restrictedmem_fs = { > > + .owner = THIS_MODULE, > > + .name = "memfd:restrictedmem", > > + .init_fs_context = restrictedmem_init_fs_context, > > + .kill_sb = kill_anon_super, > > +}; > > + > > +static struct vfsmount *restrictedmem_mnt; > > + > > +static __init int restrictedmem_init(void) > > +{ > > + restrictedmem_mnt = kern_mount(&restrictedmem_fs); > > + if (IS_ERR(restrictedmem_mnt)) > > + return PTR_ERR(restrictedmem_mnt); > > + return 0; > > +} > > +fs_initcall(restrictedmem_init); > > + > > +static struct file *restrictedmem_file_create(struct file *memfd) > > +{ > > + struct restrictedmem_data *data; > > + struct address_space *mapping; > > + struct inode *inode; > > + struct file *file; > > + > > + data = kzalloc(sizeof(*data), GFP_KERNEL); > > + if (!data) > > + return ERR_PTR(-ENOMEM); > > + > > + data->memfd = memfd; > > + mutex_init(&data->lock); > > + INIT_LIST_HEAD(&data->notifiers); > > + > > + inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb); > > + if (IS_ERR(inode)) { > > + kfree(data); > > + return ERR_CAST(inode); > > + } > > + > > + inode->i_mode |= S_IFREG; > > + inode->i_op = &restrictedmem_iops; > > + inode->i_mapping->private_data = data; > > + > > + file = alloc_file_pseudo(inode, restrictedmem_mnt, > > + "restrictedmem", O_RDWR, > > + &restrictedmem_fops); > > + if (IS_ERR(file)) { > > + iput(inode); > > + kfree(data); > > + return ERR_CAST(file); > > + } > > + > > + file->f_flags |= O_LARGEFILE; > > + > > + /* > > + * These pages are currently unmovable so don't place them into movable > > + * pageblocks (e.g. CMA and ZONE_MOVABLE). > > + */ > > + mapping = memfd->f_mapping; > > + mapping_set_unevictable(mapping); > > + mapping_set_gfp_mask(mapping, > > + mapping_gfp_mask(mapping) & ~__GFP_MOVABLE); > > + > > + return file; > > +} > > + > > +SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) > > +{ > > + struct file *file, *restricted_file; > > + int fd, err; > > + > > + if (flags) > > + return -EINVAL; > > + > > + fd = get_unused_fd_flags(0); > > + if (fd < 0) > > + return fd; > > + > > + file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE); > > + if (IS_ERR(file)) { > > + err = PTR_ERR(file); > > + goto err_fd; > > + } > > + file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE; > > + file->f_flags |= O_LARGEFILE; > > + > > + restricted_file = restrictedmem_file_create(file); > > + if (IS_ERR(restricted_file)) { > > + err = PTR_ERR(restricted_file); > > + fput(file); > > + goto err_fd; > > + } > > + > > + fd_install(fd, restricted_file); > > + return fd; > > +err_fd: > > + put_unused_fd(fd); > > + return err; > > +} > > + > > +void restrictedmem_register_notifier(struct file *file, > > + struct restrictedmem_notifier *notifier) > > +{ > > + struct restrictedmem_data *data = file->f_mapping->private_data; > > + > > + mutex_lock(&data->lock); > > + list_add(¬ifier->list, &data->notifiers); > > + mutex_unlock(&data->lock); > > +} > > +EXPORT_SYMBOL_GPL(restrictedmem_register_notifier); > > + > > +void restrictedmem_unregister_notifier(struct file *file, > > + struct restrictedmem_notifier *notifier) > > +{ > > + struct restrictedmem_data *data = file->f_mapping->private_data; > > + > > + mutex_lock(&data->lock); > > + list_del(¬ifier->list); > > + mutex_unlock(&data->lock); > > +} > > +EXPORT_SYMBOL_GPL(restrictedmem_unregister_notifier); > > + > > +int restrictedmem_get_page(struct file *file, pgoff_t offset, > > + struct page **pagep, int *order) > > +{ > > + struct restrictedmem_data *data = file->f_mapping->private_data; > > + struct file *memfd = data->memfd; > > + struct folio *folio; > > + struct page *page; > > + int ret; > > + > > + ret = shmem_get_folio(file_inode(memfd), offset, &folio, SGP_WRITE); > > + if (ret) > > + return ret; > > + > > + page = folio_file_page(folio, offset); > > + *pagep = page; > > + if (order) > > + *order = thp_order(compound_head(page)); > > + > > + SetPageUptodate(page); > > + unlock_page(page); > > + > > + return 0; > > +} > > +EXPORT_SYMBOL_GPL(restrictedmem_get_page); > > + > > +void restrictedmem_error_page(struct page *page, struct address_space *mapping) > > +{ > > + struct super_block *sb = restrictedmem_mnt->mnt_sb; > > + struct inode *inode, *next; > > + > > + if (!shmem_mapping(mapping)) > > + return; > > + > > + spin_lock(&sb->s_inode_list_lock); > > + list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) { > > + struct restrictedmem_data *data = inode->i_mapping->private_data; > > + struct file *memfd = data->memfd; > > + > > + if (memfd->f_mapping == mapping) { > > + pgoff_t start, end; > > + > > + spin_unlock(&sb->s_inode_list_lock); > > + > > + start = page->index; > > + end = start + thp_nr_pages(page); > > + restrictedmem_notifier_error(data, start, end); > > + return; > > + } > > + } > > + spin_unlock(&sb->s_inode_list_lock); > > +} > > -- > > 2.25.1 > >
> > memfd_restricted() itself is implemented as a shim layer on top of real > memory file systems (currently tmpfs). Pages in restrictedmem are marked > as unmovable and unevictable, this is required for current confidential > usage. But in future this might be changed. > > I didn't dig full histroy, but I interpret this as we don't support page migration and swapping for restricted memfd for now. IMHO "page marked as unmovable" can be confused with PageMovable(), which is a different thing from this series. It's better to just say something like "those pages cannot be migrated and swapped". [...] > + > + /* > + * These pages are currently unmovable so don't place them into movable > + * pageblocks (e.g. CMA and ZONE_MOVABLE). > + */ > + mapping = memfd->f_mapping; > + mapping_set_unevictable(mapping); > + mapping_set_gfp_mask(mapping, > + mapping_gfp_mask(mapping) & ~__GFP_MOVABLE); But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from non- movable zones, but doesn't necessarily prevent page from being migrated. My first glance is you need to implement either a_ops->migrate_folio() or just get_page() after faulting in the page to prevent. So I think the comment also needs improvement -- IMHO we can just call out currently those pages cannot be migrated and swapped, which is clearer (and the latter justifies mapping_set_unevictable() clearly).
On Tue, Dec 13, 2022 at 11:49:13PM +0000, Huang, Kai wrote: > > > > memfd_restricted() itself is implemented as a shim layer on top of real > > memory file systems (currently tmpfs). Pages in restrictedmem are marked > > as unmovable and unevictable, this is required for current confidential > > usage. But in future this might be changed. > > > > > I didn't dig full histroy, but I interpret this as we don't support page > migration and swapping for restricted memfd for now. IMHO "page marked as > unmovable" can be confused with PageMovable(), which is a different thing from > this series. It's better to just say something like "those pages cannot be > migrated and swapped". Yes, if that helps some clarification. > > [...] > > > + > > + /* > > + * These pages are currently unmovable so don't place them into movable > > + * pageblocks (e.g. CMA and ZONE_MOVABLE). > > + */ > > + mapping = memfd->f_mapping; > > + mapping_set_unevictable(mapping); > > + mapping_set_gfp_mask(mapping, > > + mapping_gfp_mask(mapping) & ~__GFP_MOVABLE); > > But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from non- > movable zones, but doesn't necessarily prevent page from being migrated. My > first glance is you need to implement either a_ops->migrate_folio() or just > get_page() after faulting in the page to prevent. The current api restrictedmem_get_page() already does this, after the caller calling it, it holds a reference to the page. The caller then decides when to call put_page() appropriately. > > So I think the comment also needs improvement -- IMHO we can just call out > currently those pages cannot be migrated and swapped, which is clearer (and the > latter justifies mapping_set_unevictable() clearly). Good to me. Thanks, Chao > >
On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote: > > > > [...] > > > > > + > > > + /* > > > + * These pages are currently unmovable so don't place them into > > > movable > > > + * pageblocks (e.g. CMA and ZONE_MOVABLE). > > > + */ > > > + mapping = memfd->f_mapping; > > > + mapping_set_unevictable(mapping); > > > + mapping_set_gfp_mask(mapping, > > > +     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE); > > > > But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from > > non- > > movable zones, but doesn't necessarily prevent page from being migrated. My > > first glance is you need to implement either a_ops->migrate_folio() or just > > get_page() after faulting in the page to prevent. > > The current api restrictedmem_get_page() already does this, after the > caller calling it, it holds a reference to the page. The caller then > decides when to call put_page() appropriately. I tried to dig some history. Perhaps I am missing something, but it seems Kirill said in v9 that this code doesn't prevent page migration, and we need to increase page refcount in restrictedmem_get_page(): https://lore.kernel.org/linux-mm/20221129112139.usp6dqhbih47qpjl@box.shutemov.name/ But looking at this series it seems restrictedmem_get_page() in this v10 is identical to the one in v9 (except v10 uses 'folio' instead of 'page')? Anyway if this is not fixed, then it should be fixed. Otherwise, a comment at the place where page refcount is increased will be helpful to help people understand page migration is actually prevented.
On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote: > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote: > > > > > > [...] > > > > > > > + > > > > + /* > > > > + * These pages are currently unmovable so don't place them into > > > > movable > > > > + * pageblocks (e.g. CMA and ZONE_MOVABLE). > > > > + */ > > > > + mapping = memfd->f_mapping; > > > > + mapping_set_unevictable(mapping); > > > > + mapping_set_gfp_mask(mapping, > > > > +     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE); > > > > > > But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from > > > non- > > > movable zones, but doesn't necessarily prevent page from being migrated. My > > > first glance is you need to implement either a_ops->migrate_folio() or just > > > get_page() after faulting in the page to prevent. > > > > The current api restrictedmem_get_page() already does this, after the > > caller calling it, it holds a reference to the page. The caller then > > decides when to call put_page() appropriately. > > I tried to dig some history. Perhaps I am missing something, but it seems Kirill > said in v9 that this code doesn't prevent page migration, and we need to > increase page refcount in restrictedmem_get_page(): > > https://lore.kernel.org/linux-mm/20221129112139.usp6dqhbih47qpjl@box.shutemov.name/ > > But looking at this series it seems restrictedmem_get_page() in this v10 is > identical to the one in v9 (except v10 uses 'folio' instead of 'page')? restrictedmem_get_page() increases page refcount several versions ago so no change in v10 is needed. You probably missed my reply: https://lore.kernel.org/linux-mm/20221129135844.GA902164@chaop.bj.intel.com/ The current solution is clear: unless we have better approach, we will let restrictedmem user (KVM in this case) to hold the refcount to prevent page migration. Thanks, Chao > > Anyway if this is not fixed, then it should be fixed. Otherwise, a comment at > the place where page refcount is increased will be helpful to help people > understand page migration is actually prevented. >
On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote: > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote: > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote: > > > > > > > > [...] > > > > > > > > > + > > > > > + /* > > > > > + * These pages are currently unmovable so don't place them into > > > > > movable > > > > > + * pageblocks (e.g. CMA and ZONE_MOVABLE). > > > > > + */ > > > > > + mapping = memfd->f_mapping; > > > > > + mapping_set_unevictable(mapping); > > > > > + mapping_set_gfp_mask(mapping, > > > > > +     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE); > > > > > > > > But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from > > > > non- > > > > movable zones, but doesn't necessarily prevent page from being migrated. My > > > > first glance is you need to implement either a_ops->migrate_folio() or just > > > > get_page() after faulting in the page to prevent. > > > > > > The current api restrictedmem_get_page() already does this, after the > > > caller calling it, it holds a reference to the page. The caller then > > > decides when to call put_page() appropriately. > > > > I tried to dig some history. Perhaps I am missing something, but it seems Kirill > > said in v9 that this code doesn't prevent page migration, and we need to > > increase page refcount in restrictedmem_get_page(): > > > > https://lore.kernel.org/linux-mm/20221129112139.usp6dqhbih47qpjl@box.shutemov.name/ > > > > But looking at this series it seems restrictedmem_get_page() in this v10 is > > identical to the one in v9 (except v10 uses 'folio' instead of 'page')? > > restrictedmem_get_page() increases page refcount several versions ago so > no change in v10 is needed. You probably missed my reply: > > https://lore.kernel.org/linux-mm/20221129135844.GA902164@chaop.bj.intel.com/ But for non-restricted-mem case, it is correct for KVM to decrease page's refcount after setting up mapping in the secondary mmu, otherwise the page will be pinned by KVM for normal VM (since KVM uses GUP to get the page). So what we are expecting is: for KVM if the page comes from restricted mem, then KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should. > > The current solution is clear: unless we have better approach, we will > let restrictedmem user (KVM in this case) to hold the refcount to > prevent page migration. > OK. Will leave to others :)
On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote: > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote: > > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote: > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote: > > > > > > > > > > [...] > > > > > > > > > > > + > > > > > > + /* > > > > > > + * These pages are currently unmovable so don't place them into > > > > > > movable > > > > > > + * pageblocks (e.g. CMA and ZONE_MOVABLE). > > > > > > + */ > > > > > > + mapping = memfd->f_mapping; > > > > > > + mapping_set_unevictable(mapping); > > > > > > + mapping_set_gfp_mask(mapping, > > > > > > +     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE); > > > > > > > > > > But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from > > > > > non- > > > > > movable zones, but doesn't necessarily prevent page from being migrated. My > > > > > first glance is you need to implement either a_ops->migrate_folio() or just > > > > > get_page() after faulting in the page to prevent. > > > > > > > > The current api restrictedmem_get_page() already does this, after the > > > > caller calling it, it holds a reference to the page. The caller then > > > > decides when to call put_page() appropriately. > > > > > > I tried to dig some history. Perhaps I am missing something, but it seems Kirill > > > said in v9 that this code doesn't prevent page migration, and we need to > > > increase page refcount in restrictedmem_get_page(): > > > > > > https://lore.kernel.org/linux-mm/20221129112139.usp6dqhbih47qpjl@box.shutemov.name/ > > > > > > But looking at this series it seems restrictedmem_get_page() in this v10 is > > > identical to the one in v9 (except v10 uses 'folio' instead of 'page')? > > > > restrictedmem_get_page() increases page refcount several versions ago so > > no change in v10 is needed. You probably missed my reply: > > > > https://lore.kernel.org/linux-mm/20221129135844.GA902164@chaop.bj.intel.com/ > > But for non-restricted-mem case, it is correct for KVM to decrease page's > refcount after setting up mapping in the secondary mmu, otherwise the page will > be pinned by KVM for normal VM (since KVM uses GUP to get the page). That's true. Actually even true for restrictedmem case, most likely we will still need the kvm_release_pfn_clean() for KVM generic code. On one side, other restrictedmem users like pKVM may not require page pinning at all. On the other side, see below. > > So what we are expecting is: for KVM if the page comes from restricted mem, then > KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should. I argue that this page pinning (or page migration prevention) is not tied to where the page comes from, instead related to how the page will be used. Whether the page is restrictedmem backed or GUP() backed, once it's used by current version of TDX then the page pinning is needed. So such page migration prevention is really TDX thing, even not KVM generic thing (that's why I think we don't need change the existing logic of kvm_release_pfn_clean()). Wouldn't better to let TDX code (or who requires that) to increase/decrease the refcount when it populates/drops the secure EPT entries? This is exactly what the current TDX code does: get_page(): https://github.com/intel/tdx/blob/kvm-upstream/arch/x86/kvm/vmx/tdx.c#L1217 put_page(): https://github.com/intel/tdx/blob/kvm-upstream/arch/x86/kvm/vmx/tdx.c#L1334 Thanks, Chao > > > > > The current solution is clear: unless we have better approach, we will > > let restrictedmem user (KVM in this case) to hold the refcount to > > prevent page migration. > > > > OK. Will leave to others :) >
On Wed, 2022-12-21 at 21:39 +0800, Chao Peng wrote: > > On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote: > > > > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote: > > > > > > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote: > > > > > > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote: > > > > > > > > > > > > > > > > > > > > > > > > [...] > > > > > > > > > > > > > > > > > > > > > > > > > > + > > > > > > > > > > > > > > + /* > > > > > > > > > > > > > > + * These pages are currently unmovable so don't place them into > > > > > > > > > > > > > > movable > > > > > > > > > > > > > > + * pageblocks (e.g. CMA and ZONE_MOVABLE). > > > > > > > > > > > > > > + */ > > > > > > > > > > > > > > + mapping = memfd->f_mapping; > > > > > > > > > > > > > > + mapping_set_unevictable(mapping); > > > > > > > > > > > > > > + mapping_set_gfp_mask(mapping, > > > > > > > > > > > > > > +     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE); > > > > > > > > > > > > > > > > > > > > > > > > But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from > > > > > > > > > > > > non- > > > > > > > > > > > > movable zones, but doesn't necessarily prevent page from being migrated. My > > > > > > > > > > > > first glance is you need to implement either a_ops->migrate_folio() or just > > > > > > > > > > > > get_page() after faulting in the page to prevent. > > > > > > > > > > > > > > > > > > > > The current api restrictedmem_get_page() already does this, after the > > > > > > > > > > caller calling it, it holds a reference to the page. The caller then > > > > > > > > > > decides when to call put_page() appropriately. > > > > > > > > > > > > > > > > I tried to dig some history. Perhaps I am missing something, but it seems Kirill > > > > > > > > said in v9 that this code doesn't prevent page migration, and we need to > > > > > > > > increase page refcount in restrictedmem_get_page(): > > > > > > > > > > > > > > > > https://lore.kernel.org/linux-mm/20221129112139.usp6dqhbih47qpjl@box.shutemov.name/ > > > > > > > > > > > > > > > > But looking at this series it seems restrictedmem_get_page() in this v10 is > > > > > > > > identical to the one in v9 (except v10 uses 'folio' instead of 'page')? > > > > > > > > > > > > restrictedmem_get_page() increases page refcount several versions ago so > > > > > > no change in v10 is needed. You probably missed my reply: > > > > > > > > > > > > https://lore.kernel.org/linux-mm/20221129135844.GA902164@chaop.bj.intel.com/ > > > > > > > > But for non-restricted-mem case, it is correct for KVM to decrease page's > > > > refcount after setting up mapping in the secondary mmu, otherwise the page will > > > > be pinned by KVM for normal VM (since KVM uses GUP to get the page). > > > > That's true. Actually even true for restrictedmem case, most likely we > > will still need the kvm_release_pfn_clean() for KVM generic code. On one > > side, other restrictedmem users like pKVM may not require page pinning > > at all. On the other side, see below. OK. Agreed. > > > > > > > > > > So what we are expecting is: for KVM if the page comes from restricted mem, then > > > > KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should. > > > > I argue that this page pinning (or page migration prevention) is not > > tied to where the page comes from, instead related to how the page will > > be used. Whether the page is restrictedmem backed or GUP() backed, once > > it's used by current version of TDX then the page pinning is needed. So > > such page migration prevention is really TDX thing, even not KVM generic > > thing (that's why I think we don't need change the existing logic of > > kvm_release_pfn_clean()). > > This essentially boils down to who "owns" page migration handling, and sadly, page migration is kinda "owned" by the core-kernel, i.e. KVM cannot handle page migration by itself -- it's just a passive receiver. For normal pages, page migration is totally done by the core-kernel (i.e. it unmaps page from VMA, allocates a new page, and uses migrate_pape() or a_ops- >migrate_page() to actually migrate the page). In the sense of TDX, conceptually it should be done in the same way. The more important thing is: yes KVM can use get_page() to prevent page migration, but when KVM wants to support it, KVM cannot just remove get_page(), as the core- kernel will still just do migrate_page() which won't work for TDX (given restricted_memfd doesn't have a_ops->migrate_page() implemented). So I think the restricted_memfd filesystem should own page migration handling, (i.e. by implementing a_ops->migrate_page() to either just reject page migration or somehow support it). To support page migration, it may require KVM's help in case of TDX (the TDH.MEM.PAGE.RELOCATE SEAMCALL requires "GPA" and "level" of EPT mapping, which are only available in KVM), but that doesn't make KVM to own the handling of page migration. > > Wouldn't better to let TDX code (or who > > requires that) to increase/decrease the refcount when it populates/drops > > the secure EPT entries? This is exactly what the current TDX code does: > > > > get_page(): > > https://github.com/intel/tdx/blob/kvm-upstream/arch/x86/kvm/vmx/tdx.c#L1217 > > > > put_page(): > > https://github.com/intel/tdx/blob/kvm-upstream/arch/x86/kvm/vmx/tdx.c#L1334 > > As explained above, I think doing so in KVM is wrong: it can prevent by using get_page(), but you cannot simply remove it to support page migration. Sean also said similar thing when reviewing v8 KVM TDX series and I also agree: https://lore.kernel.org/lkml/Yvu5PsAndEbWKTHc@google.com/ https://lore.kernel.org/lkml/31fec1b4438a6d9bb7ff719f96caa8b23ed764d6.camel@intel.com/
On Wed, Dec 21, 2022, Chao Peng wrote: > On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote: > > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote: > > > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote: > > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote: > > But for non-restricted-mem case, it is correct for KVM to decrease page's > > refcount after setting up mapping in the secondary mmu, otherwise the page will > > be pinned by KVM for normal VM (since KVM uses GUP to get the page). > > That's true. Actually even true for restrictedmem case, most likely we > will still need the kvm_release_pfn_clean() for KVM generic code. On one > side, other restrictedmem users like pKVM may not require page pinning > at all. On the other side, see below. > > > > > So what we are expecting is: for KVM if the page comes from restricted mem, then > > KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should. No, requiring the user (KVM) to guard against lack of support for page migration in restricted mem is a terrible API. It's totally fine for restricted mem to not support page migration until there's a use case, but punting the problem to KVM is not acceptable. Restricted mem itself doesn't yet support page migration, e.g. explosions would occur even if KVM wanted to allow migration since there is no notification to invalidate existing mappings. > I argue that this page pinning (or page migration prevention) is not > tied to where the page comes from, instead related to how the page will > be used. Whether the page is restrictedmem backed or GUP() backed, once > it's used by current version of TDX then the page pinning is needed. So > such page migration prevention is really TDX thing, even not KVM generic > thing (that's why I think we don't need change the existing logic of > kvm_release_pfn_clean()). Wouldn't better to let TDX code (or who > requires that) to increase/decrease the refcount when it populates/drops > the secure EPT entries? This is exactly what the current TDX code does: I agree that whether or not migration is supported should be controllable by the user, but I strongly disagree on punting refcount management to KVM (or TDX). The whole point of restricted mem is to support technologies like TDX and SNP, accomodating their special needs for things like page migration should be part of the API, not some footnote in the documenation. It's not difficult to let the user communicate support for page migration, e.g. if/when restricted mem gains support, add a hook to restrictedmem_notifier_ops to signal support (or lack thereof) for page migration. NULL == no migration, non-NULL == migration allowed. We know that supporting page migration in TDX and SNP is possible, and we know that page migration will require a dedicated API since the backing store can't memcpy() the page. I don't see any reason to ignore that eventuality. But again, unless I'm missing something, that's a future problem because restricted mem doesn't yet support page migration regardless of the downstream user.
On Thu, 2022-12-22 at 18:15 +0000, Sean Christopherson wrote: > On Wed, Dec 21, 2022, Chao Peng wrote: > > On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote: > > > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote: > > > > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote: > > > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote: > > > But for non-restricted-mem case, it is correct for KVM to decrease page's > > > refcount after setting up mapping in the secondary mmu, otherwise the page will > > > be pinned by KVM for normal VM (since KVM uses GUP to get the page). > > > > That's true. Actually even true for restrictedmem case, most likely we > > will still need the kvm_release_pfn_clean() for KVM generic code. On one > > side, other restrictedmem users like pKVM may not require page pinning > > at all. On the other side, see below. > > > > > > > > So what we are expecting is: for KVM if the page comes from restricted mem, then > > > KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should. > > No, requiring the user (KVM) to guard against lack of support for page migration > in restricted mem is a terrible API. It's totally fine for restricted mem to not > support page migration until there's a use case, but punting the problem to KVM > is not acceptable. Restricted mem itself doesn't yet support page migration, > e.g. explosions would occur even if KVM wanted to allow migration since there is > no notification to invalidate existing mappings. > > Yes totally agree (I also replied separately).
On Thu, Dec 22, 2022 at 12:37:19AM +0000, Huang, Kai wrote: > On Wed, 2022-12-21 at 21:39 +0800, Chao Peng wrote: > > > On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote: > > > > > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote: > > > > > > > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote: > > > > > > > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > [...] > > > > > > > > > > > > > > > > > > > > > > > > > > > > + > > > > > > > > > > > > > > > + /* > > > > > > > > > > > > > > > + * These pages are currently unmovable so don't place them into > > > > > > > > > > > > > > > movable > > > > > > > > > > > > > > > + * pageblocks (e.g. CMA and ZONE_MOVABLE). > > > > > > > > > > > > > > > + */ > > > > > > > > > > > > > > > + mapping = memfd->f_mapping; > > > > > > > > > > > > > > > + mapping_set_unevictable(mapping); > > > > > > > > > > > > > > > + mapping_set_gfp_mask(mapping, > > > > > > > > > > > > > > > +     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE); > > > > > > > > > > > > > > > > > > > > > > > > > > But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from > > > > > > > > > > > > > non- > > > > > > > > > > > > > movable zones, but doesn't necessarily prevent page from being migrated. My > > > > > > > > > > > > > first glance is you need to implement either a_ops->migrate_folio() or just > > > > > > > > > > > > > get_page() after faulting in the page to prevent. > > > > > > > > > > > > > > > > > > > > > > The current api restrictedmem_get_page() already does this, after the > > > > > > > > > > > caller calling it, it holds a reference to the page. The caller then > > > > > > > > > > > decides when to call put_page() appropriately. > > > > > > > > > > > > > > > > > > I tried to dig some history. Perhaps I am missing something, but it seems Kirill > > > > > > > > > said in v9 that this code doesn't prevent page migration, and we need to > > > > > > > > > increase page refcount in restrictedmem_get_page(): > > > > > > > > > > > > > > > > > > https://lore.kernel.org/linux-mm/20221129112139.usp6dqhbih47qpjl@box.shutemov.name/ > > > > > > > > > > > > > > > > > > But looking at this series it seems restrictedmem_get_page() in this v10 is > > > > > > > > > identical to the one in v9 (except v10 uses 'folio' instead of 'page')? > > > > > > > > > > > > > > restrictedmem_get_page() increases page refcount several versions ago so > > > > > > > no change in v10 is needed. You probably missed my reply: > > > > > > > > > > > > > > https://lore.kernel.org/linux-mm/20221129135844.GA902164@chaop.bj.intel.com/ > > > > > > > > > > But for non-restricted-mem case, it is correct for KVM to decrease page's > > > > > refcount after setting up mapping in the secondary mmu, otherwise the page will > > > > > be pinned by KVM for normal VM (since KVM uses GUP to get the page). > > > > > > That's true. Actually even true for restrictedmem case, most likely we > > > will still need the kvm_release_pfn_clean() for KVM generic code. On one > > > side, other restrictedmem users like pKVM may not require page pinning > > > at all. On the other side, see below. > > OK. Agreed. > > > > > > > > > > > > > > So what we are expecting is: for KVM if the page comes from restricted mem, then > > > > > KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should. > > > > > > I argue that this page pinning (or page migration prevention) is not > > > tied to where the page comes from, instead related to how the page will > > > be used. Whether the page is restrictedmem backed or GUP() backed, once > > > it's used by current version of TDX then the page pinning is needed. So > > > such page migration prevention is really TDX thing, even not KVM generic > > > thing (that's why I think we don't need change the existing logic of > > > kvm_release_pfn_clean()). > > > > > This essentially boils down to who "owns" page migration handling, and sadly, > page migration is kinda "owned" by the core-kernel, i.e. KVM cannot handle page > migration by itself -- it's just a passive receiver. No, I'm not talking on the page migration handling itself, I know page migration requires coordination from both core-mm and KVM. I'm more concerning on the page migration prevention here. This is something we need to address for TDX before the page migration is supported. > > For normal pages, page migration is totally done by the core-kernel (i.e. it > unmaps page from VMA, allocates a new page, and uses migrate_pape() or a_ops- > >migrate_page() to actually migrate the page). > > In the sense of TDX, conceptually it should be done in the same way. The more > important thing is: yes KVM can use get_page() to prevent page migration, but > when KVM wants to support it, KVM cannot just remove get_page(), as the core- > kernel will still just do migrate_page() which won't work for TDX (given > restricted_memfd doesn't have a_ops->migrate_page() implemented). > > So I think the restricted_memfd filesystem should own page migration handling, > (i.e. by implementing a_ops->migrate_page() to either just reject page migration > or somehow support it). > > To support page migration, it may require KVM's help in case of TDX (the > TDH.MEM.PAGE.RELOCATE SEAMCALL requires "GPA" and "level" of EPT mapping, which > are only available in KVM), but that doesn't make KVM to own the handling of > page migration. > > > > > Wouldn't better to let TDX code (or who > > > requires that) to increase/decrease the refcount when it populates/drops > > > the secure EPT entries? This is exactly what the current TDX code does: > > > > > > get_page(): > > > https://github.com/intel/tdx/blob/kvm-upstream/arch/x86/kvm/vmx/tdx.c#L1217 > > > > > > put_page(): > > > https://github.com/intel/tdx/blob/kvm-upstream/arch/x86/kvm/vmx/tdx.c#L1334 > > > > > As explained above, I think doing so in KVM is wrong: it can prevent by using > get_page(), but you cannot simply remove it to support page migration. Removing get_page() is definitely not enough for page migration support. But the key thing is for page migration prevention, other than get_page(), do we really have alternative. Thanks, Chao > > Sean also said similar thing when reviewing v8 KVM TDX series and I also agree: > > https://lore.kernel.org/lkml/Yvu5PsAndEbWKTHc@google.com/ > https://lore.kernel.org/lkml/31fec1b4438a6d9bb7ff719f96caa8b23ed764d6.camel@intel.com/ >
On Thu, Dec 22, 2022 at 06:15:24PM +0000, Sean Christopherson wrote: > On Wed, Dec 21, 2022, Chao Peng wrote: > > On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote: > > > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote: > > > > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote: > > > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote: > > > But for non-restricted-mem case, it is correct for KVM to decrease page's > > > refcount after setting up mapping in the secondary mmu, otherwise the page will > > > be pinned by KVM for normal VM (since KVM uses GUP to get the page). > > > > That's true. Actually even true for restrictedmem case, most likely we > > will still need the kvm_release_pfn_clean() for KVM generic code. On one > > side, other restrictedmem users like pKVM may not require page pinning > > at all. On the other side, see below. > > > > > > > > So what we are expecting is: for KVM if the page comes from restricted mem, then > > > KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should. > > No, requiring the user (KVM) to guard against lack of support for page migration > in restricted mem is a terrible API. It's totally fine for restricted mem to not > support page migration until there's a use case, but punting the problem to KVM > is not acceptable. Restricted mem itself doesn't yet support page migration, > e.g. explosions would occur even if KVM wanted to allow migration since there is > no notification to invalidate existing mappings. > > > I argue that this page pinning (or page migration prevention) is not > > tied to where the page comes from, instead related to how the page will > > be used. Whether the page is restrictedmem backed or GUP() backed, once > > it's used by current version of TDX then the page pinning is needed. So > > such page migration prevention is really TDX thing, even not KVM generic > > thing (that's why I think we don't need change the existing logic of > > kvm_release_pfn_clean()). Wouldn't better to let TDX code (or who > > requires that) to increase/decrease the refcount when it populates/drops > > the secure EPT entries? This is exactly what the current TDX code does: > > I agree that whether or not migration is supported should be controllable by the > user, but I strongly disagree on punting refcount management to KVM (or TDX). > The whole point of restricted mem is to support technologies like TDX and SNP, > accomodating their special needs for things like page migration should be part of > the API, not some footnote in the documenation. I never doubt page migration should be part of restrictedmem API, but that's not an initial implementing as we all agreed? Then before that API being introduced, we need find a solution to prevent page migration for TDX. Other than refcount management, do we have any other workable solution? > > It's not difficult to let the user communicate support for page migration, e.g. > if/when restricted mem gains support, add a hook to restrictedmem_notifier_ops > to signal support (or lack thereof) for page migration. NULL == no migration, > non-NULL == migration allowed. I know. > > We know that supporting page migration in TDX and SNP is possible, and we know > that page migration will require a dedicated API since the backing store can't > memcpy() the page. I don't see any reason to ignore that eventuality. No, I'm not ignoring it. It's just about the short-term page migration prevention before that dedicated API being introduced. > > But again, unless I'm missing something, that's a future problem because restricted > mem doesn't yet support page migration regardless of the downstream user. It's true a future problem for page migration support itself, but page migration prevention is not a future problem since TDX pages need to be pinned before page migration gets supported. Thanks, Chao
On Fri, Dec 02, 2022, Chao Peng wrote: > The system call is currently wired up for x86 arch. Building on other architectures (except for arm64 for some reason) yields: CALL /.../scripts/checksyscalls.sh <stdin>:1565:2: warning: #warning syscall memfd_restricted not implemented [-Wcpp] Do we care? It's the only such warning, which makes me think we either need to wire this up for all architectures, or explicitly document that it's unsupported. > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> > --- ... > diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h > new file mode 100644 > index 000000000000..c2700c5daa43 > --- /dev/null > +++ b/include/linux/restrictedmem.h > @@ -0,0 +1,71 @@ > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ > +#ifndef _LINUX_RESTRICTEDMEM_H Missing #define _LINUX_RESTRICTEDMEM_H which causes fireworks if restrictedmem.h is included more than once. > +#include <linux/file.h> > +#include <linux/magic.h> > +#include <linux/pfn_t.h> ... > +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset, > + struct page **pagep, int *order) > +{ > + return -1; This should be a proper -errno, though in the current incarnation of things it's a moot point because no stub is needed. KVM can (and should) easily provide its own stub for this one. > +} > + > +static inline bool file_is_restrictedmem(struct file *file) > +{ > + return false; > +} > + > +static inline void restrictedmem_error_page(struct page *page, > + struct address_space *mapping) > +{ > +} > + > +#endif /* CONFIG_RESTRICTEDMEM */ > + > +#endif /* _LINUX_RESTRICTEDMEM_H */ ... > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c > new file mode 100644 > index 000000000000..56953c204e5c > --- /dev/null > +++ b/mm/restrictedmem.c > @@ -0,0 +1,318 @@ > +// SPDX-License-Identifier: GPL-2.0 > +#include "linux/sbitmap.h" > +#include <linux/pagemap.h> > +#include <linux/pseudo_fs.h> > +#include <linux/shmem_fs.h> > +#include <linux/syscalls.h> > +#include <uapi/linux/falloc.h> > +#include <uapi/linux/magic.h> > +#include <linux/restrictedmem.h> > + > +struct restrictedmem_data { Any objection to simply calling this "restrictedmem"? And then using either "rm" or "rmem" for local variable names? I kept reading "data" as the underyling data being written to the page, as opposed to the metadata describing the restrictedmem instance. > + struct mutex lock; > + struct file *memfd; > + struct list_head notifiers; > +}; > + > +static void restrictedmem_invalidate_start(struct restrictedmem_data *data, > + pgoff_t start, pgoff_t end) > +{ > + struct restrictedmem_notifier *notifier; > + > + mutex_lock(&data->lock); This can be a r/w semaphore instead of a mutex, that way punching holes at multiple points in the file can at least run the notifiers in parallel. The actual allocation by shmem will still be serialized, but I think it's worth the simple optimization since zapping and flushing in KVM may be somewhat slow. > + list_for_each_entry(notifier, &data->notifiers, list) { > + notifier->ops->invalidate_start(notifier, start, end); Two major design issues that we overlooked long ago: 1. Blindly invoking notifiers will not scale. E.g. if userspace configures a VM with a large number of convertible memslots that are all backed by a single large restrictedmem instance, then converting a single page will result in a linear walk through all memslots. I don't expect anyone to actually do something silly like that, but I also never expected there to be a legitimate usecase for thousands of memslots. 2. This approach fails to provide the ability for KVM to ensure a guest has exclusive access to a page. As discussed in the past, the kernel can rely on hardware (and maybe ARM's pKVM implementation?) for those guarantees, but only for SNP and TDX VMs. For VMs where userspace is trusted to some extent, e.g. SEV, there is value in ensuring a 1:1 association. And probably more importantly, relying on hardware for SNP and TDX yields a poor ABI and complicates KVM's internals. If the kernel doesn't guarantee a page is exclusive to a guest, i.e. if userspace can hand out the same page from a restrictedmem instance to multiple VMs, then failure will occur only when KVM tries to assign the page to the second VM. That will happen deep in KVM, which means KVM needs to gracefully handle such errors, and it means that KVM's ABI effectively allows plumbing garbage into its memslots. Rather than use a simple list of notifiers, this appears to be yet another opportunity to use an xarray. Supporting sharing of restrictedmem will be non-trivial, but IMO we should punt that to the future since it's still unclear exactly how sharing will work. An xarray will solve #1 by notifying only the consumers (memslots) that are bound to the affected range. And for #2, it's relatively straightforward (knock wood) to detect existing entries, i.e. if the user wants exclusive access to memory, then the bind operation can be reject if there's an existing entry. VERY lightly tested code snippet at the bottom (will provide link to fully worked code in cover letter). > +static long restrictedmem_punch_hole(struct restrictedmem_data *data, int mode, > + loff_t offset, loff_t len) > +{ > + int ret; > + pgoff_t start, end; > + struct file *memfd = data->memfd; > + > + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) > + return -EINVAL; > + > + start = offset >> PAGE_SHIFT; > + end = (offset + len) >> PAGE_SHIFT; > + > + restrictedmem_invalidate_start(data, start, end); > + ret = memfd->f_op->fallocate(memfd, mode, offset, len); > + restrictedmem_invalidate_end(data, start, end); The lock needs to be end for the entire duration of the hole punch, i.e. needs to be taken before invalidate_start() and released after invalidate_end(). If a user (un)binds/(un)registers after invalidate_state(), it will see an unpaired notification, e.g. could leave KVM with incorrect notifier counts. > + > + return ret; > +} What I ended up with for an xarray-based implementation. I'm very flexible on names and whatnot, these are just what made sense to me. static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode, loff_t offset, loff_t len) { struct restrictedmem_notifier *notifier; struct file *memfd = rm->memfd; unsigned long index; pgoff_t start, end; int ret; if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) return -EINVAL; start = offset >> PAGE_SHIFT; end = (offset + len) >> PAGE_SHIFT; /* * Bindings must stable across invalidation to ensure the start+end * are balanced. */ down_read(&rm->lock); xa_for_each_range(&rm->bindings, index, notifier, start, end) notifier->ops->invalidate_start(notifier, start, end); ret = memfd->f_op->fallocate(memfd, mode, offset, len); xa_for_each_range(&rm->bindings, index, notifier, start, end) notifier->ops->invalidate_end(notifier, start, end); up_read(&rm->lock); return ret; } int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end, struct restrictedmem_notifier *notifier, bool exclusive) { struct restrictedmem *rm = file->f_mapping->private_data; int ret = -EINVAL; down_write(&rm->lock); /* Non-exclusive mappings are not yet implemented. */ if (!exclusive) goto out_unlock; if (!xa_empty(&rm->bindings)) { if (exclusive != rm->exclusive) goto out_unlock; if (exclusive && xa_find(&rm->bindings, &start, end, XA_PRESENT)) goto out_unlock; } xa_store_range(&rm->bindings, start, end, notifier, GFP_KERNEL); rm->exclusive = exclusive; ret = 0; out_unlock: up_write(&rm->lock); return ret; } EXPORT_SYMBOL_GPL(restrictedmem_bind); void restrictedmem_unbind(struct file *file, pgoff_t start, pgoff_t end, struct restrictedmem_notifier *notifier) { struct restrictedmem *rm = file->f_mapping->private_data; down_write(&rm->lock); xa_store_range(&rm->bindings, start, end, NULL, GFP_KERNEL); synchronize_rcu(); up_write(&rm->lock); } EXPORT_SYMBOL_GPL(restrictedmem_unbind);
On Fri, Jan 13, 2023 at 09:54:41PM +0000, Sean Christopherson wrote: > On Fri, Dec 02, 2022, Chao Peng wrote: > > The system call is currently wired up for x86 arch. > > Building on other architectures (except for arm64 for some reason) yields: > > CALL /.../scripts/checksyscalls.sh > <stdin>:1565:2: warning: #warning syscall memfd_restricted not implemented [-Wcpp] > > Do we care? It's the only such warning, which makes me think we either need to > wire this up for all architectures, or explicitly document that it's unsupported. I'm a bit conservative and prefer enabling only on x86 where we know the exact usecase. For the warning we can get rid of by changing scripts/checksyscalls.sh, just like __IGNORE_memfd_secret: https://lkml.kernel.org/r/20210518072034.31572-7-rppt@kernel.org > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> > > --- > > ... > > > diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h > > new file mode 100644 > > index 000000000000..c2700c5daa43 > > --- /dev/null > > +++ b/include/linux/restrictedmem.h > > @@ -0,0 +1,71 @@ > > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ > > +#ifndef _LINUX_RESTRICTEDMEM_H > > Missing > > #define _LINUX_RESTRICTEDMEM_H > > which causes fireworks if restrictedmem.h is included more than once. > > > +#include <linux/file.h> > > +#include <linux/magic.h> > > +#include <linux/pfn_t.h> > > ... > > > +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset, > > + struct page **pagep, int *order) > > +{ > > + return -1; > > This should be a proper -errno, though in the current incarnation of things it's > a moot point because no stub is needed. KVM can (and should) easily provide its > own stub for this one. > > > +} > > + > > +static inline bool file_is_restrictedmem(struct file *file) > > +{ > > + return false; > > +} > > + > > +static inline void restrictedmem_error_page(struct page *page, > > + struct address_space *mapping) > > +{ > > +} > > + > > +#endif /* CONFIG_RESTRICTEDMEM */ > > + > > +#endif /* _LINUX_RESTRICTEDMEM_H */ > > ... > > > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c > > new file mode 100644 > > index 000000000000..56953c204e5c > > --- /dev/null > > +++ b/mm/restrictedmem.c > > @@ -0,0 +1,318 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +#include "linux/sbitmap.h" > > +#include <linux/pagemap.h> > > +#include <linux/pseudo_fs.h> > > +#include <linux/shmem_fs.h> > > +#include <linux/syscalls.h> > > +#include <uapi/linux/falloc.h> > > +#include <uapi/linux/magic.h> > > +#include <linux/restrictedmem.h> > > + > > +struct restrictedmem_data { > > Any objection to simply calling this "restrictedmem"? And then using either "rm" > or "rmem" for local variable names? I kept reading "data" as the underyling data > being written to the page, as opposed to the metadata describing the restrictedmem > instance. > > > + struct mutex lock; > > + struct file *memfd; > > + struct list_head notifiers; > > +}; > > + > > +static void restrictedmem_invalidate_start(struct restrictedmem_data *data, > > + pgoff_t start, pgoff_t end) > > +{ > > + struct restrictedmem_notifier *notifier; > > + > > + mutex_lock(&data->lock); > > This can be a r/w semaphore instead of a mutex, that way punching holes at multiple > points in the file can at least run the notifiers in parallel. The actual allocation > by shmem will still be serialized, but I think it's worth the simple optimization > since zapping and flushing in KVM may be somewhat slow. > > > + list_for_each_entry(notifier, &data->notifiers, list) { > > + notifier->ops->invalidate_start(notifier, start, end); > > Two major design issues that we overlooked long ago: > > 1. Blindly invoking notifiers will not scale. E.g. if userspace configures a > VM with a large number of convertible memslots that are all backed by a > single large restrictedmem instance, then converting a single page will > result in a linear walk through all memslots. I don't expect anyone to > actually do something silly like that, but I also never expected there to be > a legitimate usecase for thousands of memslots. > > 2. This approach fails to provide the ability for KVM to ensure a guest has > exclusive access to a page. As discussed in the past, the kernel can rely > on hardware (and maybe ARM's pKVM implementation?) for those guarantees, but > only for SNP and TDX VMs. For VMs where userspace is trusted to some extent, > e.g. SEV, there is value in ensuring a 1:1 association. > > And probably more importantly, relying on hardware for SNP and TDX yields a > poor ABI and complicates KVM's internals. If the kernel doesn't guarantee a > page is exclusive to a guest, i.e. if userspace can hand out the same page > from a restrictedmem instance to multiple VMs, then failure will occur only > when KVM tries to assign the page to the second VM. That will happen deep > in KVM, which means KVM needs to gracefully handle such errors, and it means > that KVM's ABI effectively allows plumbing garbage into its memslots. It may not be a valid usage, but in my TDX environment I do meet below issue. kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0x80000000 ua=0x7fe1ebfff000 ret=0 kvm_set_user_memory AddrSpace#0 Slot#1 flags=0x4 gpa=0xffc00000 size=0x400000 ua=0x7fe271579000 ret=0 kvm_set_user_memory AddrSpace#0 Slot#2 flags=0x4 gpa=0xfeda0000 size=0x20000 ua=0x7fe1ec09f000 ret=-22 Slot#2('SMRAM') is actually an alias into system memory(Slot#0) in QEMU and slot#2 fails due to below exclusive check. Currently I changed QEMU code to mark these alias slots as shared instead of private but I'm not 100% confident this is correct fix. > > Rather than use a simple list of notifiers, this appears to be yet another > opportunity to use an xarray. Supporting sharing of restrictedmem will be > non-trivial, but IMO we should punt that to the future since it's still unclear > exactly how sharing will work. > > An xarray will solve #1 by notifying only the consumers (memslots) that are bound > to the affected range. > > And for #2, it's relatively straightforward (knock wood) to detect existing > entries, i.e. if the user wants exclusive access to memory, then the bind operation > can be reject if there's an existing entry. > > VERY lightly tested code snippet at the bottom (will provide link to fully worked > code in cover letter). > > > > +static long restrictedmem_punch_hole(struct restrictedmem_data *data, int mode, > > + loff_t offset, loff_t len) > > +{ > > + int ret; > > + pgoff_t start, end; > > + struct file *memfd = data->memfd; > > + > > + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) > > + return -EINVAL; > > + > > + start = offset >> PAGE_SHIFT; > > + end = (offset + len) >> PAGE_SHIFT; > > + > > + restrictedmem_invalidate_start(data, start, end); > > + ret = memfd->f_op->fallocate(memfd, mode, offset, len); > > + restrictedmem_invalidate_end(data, start, end); > > The lock needs to be end for the entire duration of the hole punch, i.e. needs to > be taken before invalidate_start() and released after invalidate_end(). If a user > (un)binds/(un)registers after invalidate_state(), it will see an unpaired notification, > e.g. could leave KVM with incorrect notifier counts. > > > + > > + return ret; > > +} > > What I ended up with for an xarray-based implementation. I'm very flexible on > names and whatnot, these are just what made sense to me. > > static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode, > loff_t offset, loff_t len) > { > struct restrictedmem_notifier *notifier; > struct file *memfd = rm->memfd; > unsigned long index; > pgoff_t start, end; > int ret; > > if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) > return -EINVAL; > > start = offset >> PAGE_SHIFT; > end = (offset + len) >> PAGE_SHIFT; > > /* > * Bindings must stable across invalidation to ensure the start+end > * are balanced. > */ > down_read(&rm->lock); > > xa_for_each_range(&rm->bindings, index, notifier, start, end) > notifier->ops->invalidate_start(notifier, start, end); > > ret = memfd->f_op->fallocate(memfd, mode, offset, len); > > xa_for_each_range(&rm->bindings, index, notifier, start, end) > notifier->ops->invalidate_end(notifier, start, end); > > up_read(&rm->lock); > > return ret; > } > > int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end, > struct restrictedmem_notifier *notifier, bool exclusive) > { > struct restrictedmem *rm = file->f_mapping->private_data; > int ret = -EINVAL; > > down_write(&rm->lock); > > /* Non-exclusive mappings are not yet implemented. */ > if (!exclusive) > goto out_unlock; > > if (!xa_empty(&rm->bindings)) { > if (exclusive != rm->exclusive) > goto out_unlock; > > if (exclusive && xa_find(&rm->bindings, &start, end, XA_PRESENT)) > goto out_unlock; > } > > xa_store_range(&rm->bindings, start, end, notifier, GFP_KERNEL); > rm->exclusive = exclusive; > ret = 0; > out_unlock: > up_write(&rm->lock); > return ret; > } > EXPORT_SYMBOL_GPL(restrictedmem_bind); > > void restrictedmem_unbind(struct file *file, pgoff_t start, pgoff_t end, > struct restrictedmem_notifier *notifier) > { > struct restrictedmem *rm = file->f_mapping->private_data; > > down_write(&rm->lock); > xa_store_range(&rm->bindings, start, end, NULL, GFP_KERNEL); > synchronize_rcu(); > up_write(&rm->lock); > } > EXPORT_SYMBOL_GPL(restrictedmem_unbind);
On Tue, Jan 17, 2023, Chao Peng wrote: > On Fri, Jan 13, 2023 at 09:54:41PM +0000, Sean Christopherson wrote: > > > + list_for_each_entry(notifier, &data->notifiers, list) { > > > + notifier->ops->invalidate_start(notifier, start, end); > > > > Two major design issues that we overlooked long ago: > > > > 1. Blindly invoking notifiers will not scale. E.g. if userspace configures a > > VM with a large number of convertible memslots that are all backed by a > > single large restrictedmem instance, then converting a single page will > > result in a linear walk through all memslots. I don't expect anyone to > > actually do something silly like that, but I also never expected there to be > > a legitimate usecase for thousands of memslots. > > > > 2. This approach fails to provide the ability for KVM to ensure a guest has > > exclusive access to a page. As discussed in the past, the kernel can rely > > on hardware (and maybe ARM's pKVM implementation?) for those guarantees, but > > only for SNP and TDX VMs. For VMs where userspace is trusted to some extent, > > e.g. SEV, there is value in ensuring a 1:1 association. > > > > And probably more importantly, relying on hardware for SNP and TDX yields a > > poor ABI and complicates KVM's internals. If the kernel doesn't guarantee a > > page is exclusive to a guest, i.e. if userspace can hand out the same page > > from a restrictedmem instance to multiple VMs, then failure will occur only > > when KVM tries to assign the page to the second VM. That will happen deep > > in KVM, which means KVM needs to gracefully handle such errors, and it means > > that KVM's ABI effectively allows plumbing garbage into its memslots. > > It may not be a valid usage, but in my TDX environment I do meet below > issue. > > kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0x80000000 ua=0x7fe1ebfff000 ret=0 > kvm_set_user_memory AddrSpace#0 Slot#1 flags=0x4 gpa=0xffc00000 size=0x400000 ua=0x7fe271579000 ret=0 > kvm_set_user_memory AddrSpace#0 Slot#2 flags=0x4 gpa=0xfeda0000 size=0x20000 ua=0x7fe1ec09f000 ret=-22 > > Slot#2('SMRAM') is actually an alias into system memory(Slot#0) in QEMU > and slot#2 fails due to below exclusive check. > > Currently I changed QEMU code to mark these alias slots as shared > instead of private but I'm not 100% confident this is correct fix. That's a QEMU bug of sorts. SMM is mutually exclusive with TDX, QEMU shouldn't be configuring SMRAM (or any SMM memslots for that matter) for TDX guests. Actually, KVM should enforce that by disallowing SMM memslots for TDX guests. Ditto for SNP guests and UPM-backed SEV and SEV-ES guests. I think it probably even makes sense to introduce that restriction in the base UPM support, e.g. something like the below. That would unnecessarily prevent emulating SMM for KVM_X86_PROTECTED_VM types that aren't encrypted, but IMO that's an acceptable limitation until there's an actual use case for KVM_X86_PROTECTED_VM guests beyond SEV (my thought is that KVM_X86_PROTECTED_VM will mostly be a vehicle for selftests and UPM-based SEV and SEV-ES guests). diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 48b7bdad1e0a..0a8aac821cb0 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4357,6 +4357,14 @@ bool kvm_arch_has_private_mem(struct kvm *kvm) return kvm->arch.vm_type != KVM_X86_DEFAULT_VM; } +bool kvm_arch_nr_address_spaces(struct kvm *kvm) +{ + if (kvm->arch.vm_type != KVM_X86_DEFAULT_VM) + return 1; + + return KVM_ADDRESS_SPACE_NUM; +} + static bool kvm_is_vm_type_supported(unsigned long type) { return type == KVM_X86_DEFAULT_VM || diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 97801d81ee42..e0a3fc819fe5 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2126,7 +2126,7 @@ int __kvm_set_memory_region(struct kvm *kvm, mem->restricted_offset + mem->memory_size < mem->restricted_offset || 0 /* TODO: require gfn be aligned with restricted offset */)) return -EINVAL; - if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM) + if (as_id >= kvm_arch_nr_address_spaces(vm) || id >= KVM_MEM_SLOTS_NUM) return -EINVAL; if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr) return -EINVAL;
On Tue, Jan 17, 2023 at 04:34:15PM +0000, Sean Christopherson wrote: > On Tue, Jan 17, 2023, Chao Peng wrote: > > On Fri, Jan 13, 2023 at 09:54:41PM +0000, Sean Christopherson wrote: > > > > + list_for_each_entry(notifier, &data->notifiers, list) { > > > > + notifier->ops->invalidate_start(notifier, start, end); > > > > > > Two major design issues that we overlooked long ago: > > > > > > 1. Blindly invoking notifiers will not scale. E.g. if userspace configures a > > > VM with a large number of convertible memslots that are all backed by a > > > single large restrictedmem instance, then converting a single page will > > > result in a linear walk through all memslots. I don't expect anyone to > > > actually do something silly like that, but I also never expected there to be > > > a legitimate usecase for thousands of memslots. > > > > > > 2. This approach fails to provide the ability for KVM to ensure a guest has > > > exclusive access to a page. As discussed in the past, the kernel can rely > > > on hardware (and maybe ARM's pKVM implementation?) for those guarantees, but > > > only for SNP and TDX VMs. For VMs where userspace is trusted to some extent, > > > e.g. SEV, there is value in ensuring a 1:1 association. > > > > > > And probably more importantly, relying on hardware for SNP and TDX yields a > > > poor ABI and complicates KVM's internals. If the kernel doesn't guarantee a > > > page is exclusive to a guest, i.e. if userspace can hand out the same page > > > from a restrictedmem instance to multiple VMs, then failure will occur only > > > when KVM tries to assign the page to the second VM. That will happen deep > > > in KVM, which means KVM needs to gracefully handle such errors, and it means > > > that KVM's ABI effectively allows plumbing garbage into its memslots. > > > > It may not be a valid usage, but in my TDX environment I do meet below > > issue. > > > > kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0x80000000 ua=0x7fe1ebfff000 ret=0 > > kvm_set_user_memory AddrSpace#0 Slot#1 flags=0x4 gpa=0xffc00000 size=0x400000 ua=0x7fe271579000 ret=0 > > kvm_set_user_memory AddrSpace#0 Slot#2 flags=0x4 gpa=0xfeda0000 size=0x20000 ua=0x7fe1ec09f000 ret=-22 > > > > Slot#2('SMRAM') is actually an alias into system memory(Slot#0) in QEMU > > and slot#2 fails due to below exclusive check. > > > > Currently I changed QEMU code to mark these alias slots as shared > > instead of private but I'm not 100% confident this is correct fix. > > That's a QEMU bug of sorts. SMM is mutually exclusive with TDX, QEMU shouldn't > be configuring SMRAM (or any SMM memslots for that matter) for TDX guests. Thanks for the confirmation. As long as we only bind one notifier for each address, using xarray does make things simple. Chao
On Wed, Jan 18, 2023 at 04:16:41PM +0800, Chao Peng <chao.p.peng@linux.intel.com> wrote: > On Tue, Jan 17, 2023 at 04:34:15PM +0000, Sean Christopherson wrote: > > On Tue, Jan 17, 2023, Chao Peng wrote: > > > On Fri, Jan 13, 2023 at 09:54:41PM +0000, Sean Christopherson wrote: > > > > > + list_for_each_entry(notifier, &data->notifiers, list) { > > > > > + notifier->ops->invalidate_start(notifier, start, end); > > > > > > > > Two major design issues that we overlooked long ago: > > > > > > > > 1. Blindly invoking notifiers will not scale. E.g. if userspace configures a > > > > VM with a large number of convertible memslots that are all backed by a > > > > single large restrictedmem instance, then converting a single page will > > > > result in a linear walk through all memslots. I don't expect anyone to > > > > actually do something silly like that, but I also never expected there to be > > > > a legitimate usecase for thousands of memslots. > > > > > > > > 2. This approach fails to provide the ability for KVM to ensure a guest has > > > > exclusive access to a page. As discussed in the past, the kernel can rely > > > > on hardware (and maybe ARM's pKVM implementation?) for those guarantees, but > > > > only for SNP and TDX VMs. For VMs where userspace is trusted to some extent, > > > > e.g. SEV, there is value in ensuring a 1:1 association. > > > > > > > > And probably more importantly, relying on hardware for SNP and TDX yields a > > > > poor ABI and complicates KVM's internals. If the kernel doesn't guarantee a > > > > page is exclusive to a guest, i.e. if userspace can hand out the same page > > > > from a restrictedmem instance to multiple VMs, then failure will occur only > > > > when KVM tries to assign the page to the second VM. That will happen deep > > > > in KVM, which means KVM needs to gracefully handle such errors, and it means > > > > that KVM's ABI effectively allows plumbing garbage into its memslots. > > > > > > It may not be a valid usage, but in my TDX environment I do meet below > > > issue. > > > > > > kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0x80000000 ua=0x7fe1ebfff000 ret=0 > > > kvm_set_user_memory AddrSpace#0 Slot#1 flags=0x4 gpa=0xffc00000 size=0x400000 ua=0x7fe271579000 ret=0 > > > kvm_set_user_memory AddrSpace#0 Slot#2 flags=0x4 gpa=0xfeda0000 size=0x20000 ua=0x7fe1ec09f000 ret=-22 > > > > > > Slot#2('SMRAM') is actually an alias into system memory(Slot#0) in QEMU > > > and slot#2 fails due to below exclusive check. > > > > > > Currently I changed QEMU code to mark these alias slots as shared > > > instead of private but I'm not 100% confident this is correct fix. > > > > That's a QEMU bug of sorts. SMM is mutually exclusive with TDX, QEMU shouldn't > > be configuring SMRAM (or any SMM memslots for that matter) for TDX guests. > > Thanks for the confirmation. As long as we only bind one notifier for > each address, using xarray does make things simple. In the past, I had patches for qemu to disable PAM and SMRAM, but they were dropped for simplicity because SMRAM/PAM are disabled as reset state with unused memslot registered. TDX guest bios(TDVF or EDK2) doesn't enable them. Now we can revive them.
On 12/22/22 01:37, Huang, Kai wrote: >>> I argue that this page pinning (or page migration prevention) is not >>> tied to where the page comes from, instead related to how the page will >>> be used. Whether the page is restrictedmem backed or GUP() backed, once >>> it's used by current version of TDX then the page pinning is needed. So >>> such page migration prevention is really TDX thing, even not KVM generic >>> thing (that's why I think we don't need change the existing logic of >>> kvm_release_pfn_clean()). >>> > This essentially boils down to who "owns" page migration handling, and sadly, > page migration is kinda "owned" by the core-kernel, i.e. KVM cannot handle page > migration by itself -- it's just a passive receiver. > > For normal pages, page migration is totally done by the core-kernel (i.e. it > unmaps page from VMA, allocates a new page, and uses migrate_pape() or a_ops- >> migrate_page() to actually migrate the page). > In the sense of TDX, conceptually it should be done in the same way. The more > important thing is: yes KVM can use get_page() to prevent page migration, but > when KVM wants to support it, KVM cannot just remove get_page(), as the core- > kernel will still just do migrate_page() which won't work for TDX (given > restricted_memfd doesn't have a_ops->migrate_page() implemented). > > So I think the restricted_memfd filesystem should own page migration handling, > (i.e. by implementing a_ops->migrate_page() to either just reject page migration > or somehow support it). While this thread seems to be settled on refcounts already, just wanted to point out that it wouldn't be ideal to prevent migrations by a_ops->migrate_page() rejecting them. It would mean cputime wasted (i.e. by memory compaction) by isolating the pages for migration and then releasing them after the callback rejects it (at least we wouldn't waste time creating and undoing migration entries in the userspace page tables as there's no mmap). Elevated refcount on the other hand is detected very early in compaction so no isolation is attempted, so from that aspect it's optimal.
On Mon, Jan 23, 2023 at 03:03:45PM +0100, Vlastimil Babka wrote: > On 12/22/22 01:37, Huang, Kai wrote: > >>> I argue that this page pinning (or page migration prevention) is not > >>> tied to where the page comes from, instead related to how the page will > >>> be used. Whether the page is restrictedmem backed or GUP() backed, once > >>> it's used by current version of TDX then the page pinning is needed. So > >>> such page migration prevention is really TDX thing, even not KVM generic > >>> thing (that's why I think we don't need change the existing logic of > >>> kvm_release_pfn_clean()). > >>> > > This essentially boils down to who "owns" page migration handling, and sadly, > > page migration is kinda "owned" by the core-kernel, i.e. KVM cannot handle page > > migration by itself -- it's just a passive receiver. > > > > For normal pages, page migration is totally done by the core-kernel (i.e. it > > unmaps page from VMA, allocates a new page, and uses migrate_pape() or a_ops- > >> migrate_page() to actually migrate the page). > > In the sense of TDX, conceptually it should be done in the same way. The more > > important thing is: yes KVM can use get_page() to prevent page migration, but > > when KVM wants to support it, KVM cannot just remove get_page(), as the core- > > kernel will still just do migrate_page() which won't work for TDX (given > > restricted_memfd doesn't have a_ops->migrate_page() implemented). > > > > So I think the restricted_memfd filesystem should own page migration handling, > > (i.e. by implementing a_ops->migrate_page() to either just reject page migration > > or somehow support it). > > While this thread seems to be settled on refcounts already, just wanted > to point out that it wouldn't be ideal to prevent migrations by > a_ops->migrate_page() rejecting them. It would mean cputime wasted (i.e. > by memory compaction) by isolating the pages for migration and then > releasing them after the callback rejects it (at least we wouldn't waste > time creating and undoing migration entries in the userspace page tables > as there's no mmap). Elevated refcount on the other hand is detected > very early in compaction so no isolation is attempted, so from that > aspect it's optimal. Hm. Do we need a new hook in a_ops to check if the page is migratable before going with longer path to migrate_page(). Or maybe add AS_UNMOVABLE?
On Thu, Dec 22, 2022 at 06:15:24PM +0000, Sean Christopherson wrote: > On Wed, Dec 21, 2022, Chao Peng wrote: > > On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote: > > > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote: > > > > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote: > > > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote: > > > But for non-restricted-mem case, it is correct for KVM to decrease page's > > > refcount after setting up mapping in the secondary mmu, otherwise the page will > > > be pinned by KVM for normal VM (since KVM uses GUP to get the page). > > > > That's true. Actually even true for restrictedmem case, most likely we > > will still need the kvm_release_pfn_clean() for KVM generic code. On one > > side, other restrictedmem users like pKVM may not require page pinning > > at all. On the other side, see below. > > > > > > > > So what we are expecting is: for KVM if the page comes from restricted mem, then > > > KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should. > > No, requiring the user (KVM) to guard against lack of support for page migration > in restricted mem is a terrible API. It's totally fine for restricted mem to not > support page migration until there's a use case, but punting the problem to KVM > is not acceptable. Restricted mem itself doesn't yet support page migration, > e.g. explosions would occur even if KVM wanted to allow migration since there is > no notification to invalidate existing mappings. I tried to find a way to hook into migration path from restrictedmem. It is not easy because from code-mm PoV the restrictedmem page just yet another shmem page. It is somewhat dubious, but I think it should be safe to override mapping->a_ops for the shmem mapping. It also eliminates need in special treatment for the restrictedmem pages from memory-failure code. shmem_mapping() uses ->a_ops to detect shmem mapping. Modify the implementation to still be true for restrictedmem pages. Build tested only. Any comments? diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h index 6fddb08f03cc..73ded3c3bad1 100644 --- a/include/linux/restrictedmem.h +++ b/include/linux/restrictedmem.h @@ -36,8 +36,6 @@ static inline bool file_is_restrictedmem(struct file *file) return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC; } -void restrictedmem_error_page(struct page *page, struct address_space *mapping); - #else static inline bool file_is_restrictedmem(struct file *file) @@ -45,11 +43,6 @@ static inline bool file_is_restrictedmem(struct file *file) return false; } -static inline void restrictedmem_error_page(struct page *page, - struct address_space *mapping) -{ -} - #endif /* CONFIG_RESTRICTEDMEM */ #endif /* _LINUX_RESTRICTEDMEM_H */ diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index d500ea967dc7..a4af160f37e4 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -9,6 +9,7 @@ #include <linux/percpu_counter.h> #include <linux/xattr.h> #include <linux/fs_parser.h> +#include <linux/magic.h> /* inode in-kernel data */ @@ -75,10 +76,9 @@ extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags); extern int shmem_lock(struct file *file, int lock, struct ucounts *ucounts); #ifdef CONFIG_SHMEM -extern const struct address_space_operations shmem_aops; static inline bool shmem_mapping(struct address_space *mapping) { - return mapping->a_ops == &shmem_aops; + return mapping->host->i_sb->s_magic == TMPFS_MAGIC; } #else static inline bool shmem_mapping(struct address_space *mapping) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index f91b444e471e..145bb561ddb3 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -62,7 +62,6 @@ #include <linux/page-isolation.h> #include <linux/pagewalk.h> #include <linux/shmem_fs.h> -#include <linux/restrictedmem.h> #include "swap.h" #include "internal.h" #include "ras/ras_event.h" @@ -941,8 +940,6 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p) goto out; } - restrictedmem_error_page(p, mapping); - /* * The shmem page is kept in page cache instead of truncating * so is expected to have an extra refcount after error-handling. diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c index 15c52301eeb9..d0ca609b82cb 100644 --- a/mm/restrictedmem.c +++ b/mm/restrictedmem.c @@ -189,6 +189,51 @@ static struct file *restrictedmem_file_create(struct file *memfd) return file; } +static int restricted_error_remove_page(struct address_space *mapping, + struct page *page) +{ + struct super_block *sb = restrictedmem_mnt->mnt_sb; + struct inode *inode, *next; + pgoff_t start, end; + + start = page->index; + end = start + thp_nr_pages(page); + + spin_lock(&sb->s_inode_list_lock); + list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) { + struct restrictedmem *rm = inode->i_mapping->private_data; + struct restrictedmem_notifier *notifier; + struct file *memfd = rm->memfd; + unsigned long index; + + if (memfd->f_mapping != mapping) + continue; + + xa_for_each_range(&rm->bindings, index, notifier, start, end) + notifier->ops->error(notifier, start, end); + break; + } + spin_unlock(&sb->s_inode_list_lock); + + return 0; +} + +#ifdef CONFIG_MIGRATION +static int restricted_folio(struct address_space *mapping, struct folio *dst, + struct folio *src, enum migrate_mode mode) +{ + return -EBUSY; +} +#endif + +static struct address_space_operations restricted_aops = { + .dirty_folio = noop_dirty_folio, + .error_remove_page = restricted_error_remove_page, +#ifdef CONFIG_MIGRATION + .migrate_folio = restricted_folio, +#endif +}; + SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) { struct file *file, *restricted_file; @@ -209,6 +254,8 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE; file->f_flags |= O_LARGEFILE; + file->f_mapping->a_ops = &restricted_aops; + restricted_file = restrictedmem_file_create(file); if (IS_ERR(restricted_file)) { err = PTR_ERR(restricted_file); @@ -293,31 +340,3 @@ int restrictedmem_get_page(struct file *file, pgoff_t offset, } EXPORT_SYMBOL_GPL(restrictedmem_get_page); -void restrictedmem_error_page(struct page *page, struct address_space *mapping) -{ - struct super_block *sb = restrictedmem_mnt->mnt_sb; - struct inode *inode, *next; - pgoff_t start, end; - - if (!shmem_mapping(mapping)) - return; - - start = page->index; - end = start + thp_nr_pages(page); - - spin_lock(&sb->s_inode_list_lock); - list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) { - struct restrictedmem *rm = inode->i_mapping->private_data; - struct restrictedmem_notifier *notifier; - struct file *memfd = rm->memfd; - unsigned long index; - - if (memfd->f_mapping != mapping) - continue; - - xa_for_each_range(&rm->bindings, index, notifier, start, end) - notifier->ops->error(notifier, start, end); - break; - } - spin_unlock(&sb->s_inode_list_lock); -} diff --git a/mm/shmem.c b/mm/shmem.c index c1d8b8a1aa3b..3df4d95784b9 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -231,7 +231,7 @@ static inline void shmem_inode_unacct_blocks(struct inode *inode, long pages) } static const struct super_operations shmem_ops; -const struct address_space_operations shmem_aops; +static const struct address_space_operations shmem_aops; static const struct file_operations shmem_file_operations; static const struct inode_operations shmem_inode_operations; static const struct inode_operations shmem_dir_inode_operations; @@ -3894,7 +3894,7 @@ static int shmem_error_remove_page(struct address_space *mapping, return 0; } -const struct address_space_operations shmem_aops = { +static const struct address_space_operations shmem_aops = { .writepage = shmem_writepage, .dirty_folio = noop_dirty_folio, #ifdef CONFIG_TMPFS @@ -3906,7 +3906,6 @@ const struct address_space_operations shmem_aops = { #endif .error_remove_page = shmem_error_remove_page, }; -EXPORT_SYMBOL(shmem_aops); static const struct file_operations shmem_file_operations = { .mmap = shmem_mmap,
On Mon, 2023-01-23 at 15:03 +0100, Vlastimil Babka wrote: > On 12/22/22 01:37, Huang, Kai wrote: > > > > I argue that this page pinning (or page migration prevention) is not > > > > tied to where the page comes from, instead related to how the page will > > > > be used. Whether the page is restrictedmem backed or GUP() backed, once > > > > it's used by current version of TDX then the page pinning is needed. So > > > > such page migration prevention is really TDX thing, even not KVM generic > > > > thing (that's why I think we don't need change the existing logic of > > > > kvm_release_pfn_clean()). > > > > > > This essentially boils down to who "owns" page migration handling, and sadly, > > page migration is kinda "owned" by the core-kernel, i.e. KVM cannot handle page > > migration by itself -- it's just a passive receiver. > > > > For normal pages, page migration is totally done by the core-kernel (i.e. it > > unmaps page from VMA, allocates a new page, and uses migrate_pape() or a_ops- > > > migrate_page() to actually migrate the page). > > In the sense of TDX, conceptually it should be done in the same way. The more > > important thing is: yes KVM can use get_page() to prevent page migration, but > > when KVM wants to support it, KVM cannot just remove get_page(), as the core- > > kernel will still just do migrate_page() which won't work for TDX (given > > restricted_memfd doesn't have a_ops->migrate_page() implemented). > > > > So I think the restricted_memfd filesystem should own page migration handling, > > (i.e. by implementing a_ops->migrate_page() to either just reject page migration > > or somehow support it). > > While this thread seems to be settled on refcounts already, > I am not sure but will let Sean/Paolo to decide. > just wanted > to point out that it wouldn't be ideal to prevent migrations by > a_ops->migrate_page() rejecting them. It would mean cputime wasted (i.e. > by memory compaction) by isolating the pages for migration and then > releasing them after the callback rejects it (at least we wouldn't waste > time creating and undoing migration entries in the userspace page tables > as there's no mmap). Elevated refcount on the other hand is detected > very early in compaction so no isolation is attempted, so from that > aspect it's optimal. I am probably missing something, but IIUC the checking of refcount happens at very last stage of page migration too, for instance: migrate_folio(...) -> migrate_folio_extra(..., 0 /* extra_count */) -> folio_migrate_mapping(...). And it is folio_migrate_mapping() who does the actual compare with the refcount, which is at very late stage too: int folio_migrate_mapping(struct address_space *mapping, struct folio *newfolio, struct folio *folio, int extra_count) { ... int expected_count = folio_expected_refs(mapping, folio) + extra_count; if (!mapping) { /* Anonymous page without mapping */ if (folio_ref_count(folio) != expected_count) return -EAGAIN; .... return MIGRATEPAGE_SUCCESS; } .... if (!folio_ref_freeze(folio, expected_count)) { xas_unlock_irq(&xas); return -EAGAIN; } ... }
On Mon, Jan 23, 2023, Huang, Kai wrote: > On Mon, 2023-01-23 at 15:03 +0100, Vlastimil Babka wrote: > > On 12/22/22 01:37, Huang, Kai wrote: > > > > > I argue that this page pinning (or page migration prevention) is not > > > > > tied to where the page comes from, instead related to how the page will > > > > > be used. Whether the page is restrictedmem backed or GUP() backed, once > > > > > it's used by current version of TDX then the page pinning is needed. So > > > > > such page migration prevention is really TDX thing, even not KVM generic > > > > > thing (that's why I think we don't need change the existing logic of > > > > > kvm_release_pfn_clean()). > > > > > > > > This essentially boils down to who "owns" page migration handling, and sadly, > > > page migration is kinda "owned" by the core-kernel, i.e. KVM cannot handle page > > > migration by itself -- it's just a passive receiver. > > > > > > For normal pages, page migration is totally done by the core-kernel (i.e. it > > > unmaps page from VMA, allocates a new page, and uses migrate_pape() or a_ops- > > > > migrate_page() to actually migrate the page). > > > In the sense of TDX, conceptually it should be done in the same way. The more > > > important thing is: yes KVM can use get_page() to prevent page migration, but > > > when KVM wants to support it, KVM cannot just remove get_page(), as the core- > > > kernel will still just do migrate_page() which won't work for TDX (given > > > restricted_memfd doesn't have a_ops->migrate_page() implemented). > > > > > > So I think the restricted_memfd filesystem should own page migration handling, > > > (i.e. by implementing a_ops->migrate_page() to either just reject page migration > > > or somehow support it). > > > > While this thread seems to be settled on refcounts already, > > > > I am not sure but will let Sean/Paolo to decide. My preference is whatever is most performant without being hideous :-) > > just wanted > > to point out that it wouldn't be ideal to prevent migrations by > > a_ops->migrate_page() rejecting them. It would mean cputime wasted (i.e. > > by memory compaction) by isolating the pages for migration and then > > releasing them after the callback rejects it (at least we wouldn't waste > > time creating and undoing migration entries in the userspace page tables > > as there's no mmap). Elevated refcount on the other hand is detected > > very early in compaction so no isolation is attempted, so from that > > aspect it's optimal. > > I am probably missing something, Heh, me too, I could have sworn that using refcounts was the least efficient way to block migration. > but IIUC the checking of refcount happens at very last stage of page migration too
On 1/24/23 00:38, Sean Christopherson wrote: > On Mon, Jan 23, 2023, Huang, Kai wrote: >> On Mon, 2023-01-23 at 15:03 +0100, Vlastimil Babka wrote: >>> On 12/22/22 01:37, Huang, Kai wrote: >>>>>> I argue that this page pinning (or page migration prevention) is not >>>>>> tied to where the page comes from, instead related to how the page will >>>>>> be used. Whether the page is restrictedmem backed or GUP() backed, once >>>>>> it's used by current version of TDX then the page pinning is needed. So >>>>>> such page migration prevention is really TDX thing, even not KVM generic >>>>>> thing (that's why I think we don't need change the existing logic of >>>>>> kvm_release_pfn_clean()). >>>>>> >>>> This essentially boils down to who "owns" page migration handling, and sadly, >>>> page migration is kinda "owned" by the core-kernel, i.e. KVM cannot handle page >>>> migration by itself -- it's just a passive receiver. >>>> >>>> For normal pages, page migration is totally done by the core-kernel (i.e. it >>>> unmaps page from VMA, allocates a new page, and uses migrate_pape() or a_ops- >>>>> migrate_page() to actually migrate the page). >>>> In the sense of TDX, conceptually it should be done in the same way. The more >>>> important thing is: yes KVM can use get_page() to prevent page migration, but >>>> when KVM wants to support it, KVM cannot just remove get_page(), as the core- >>>> kernel will still just do migrate_page() which won't work for TDX (given >>>> restricted_memfd doesn't have a_ops->migrate_page() implemented). >>>> >>>> So I think the restricted_memfd filesystem should own page migration handling, >>>> (i.e. by implementing a_ops->migrate_page() to either just reject page migration >>>> or somehow support it). >>> >>> While this thread seems to be settled on refcounts already, >>> >> >> I am not sure but will let Sean/Paolo to decide. > > My preference is whatever is most performant without being hideous :-) > >>> just wanted >>> to point out that it wouldn't be ideal to prevent migrations by >>> a_ops->migrate_page() rejecting them. It would mean cputime wasted (i.e. >>> by memory compaction) by isolating the pages for migration and then >>> releasing them after the callback rejects it (at least we wouldn't waste >>> time creating and undoing migration entries in the userspace page tables >>> as there's no mmap). Elevated refcount on the other hand is detected >>> very early in compaction so no isolation is attempted, so from that >>> aspect it's optimal. >> >> I am probably missing something, > > Heh, me too, I could have sworn that using refcounts was the least efficient way > to block migration. Well I admit that due to my experience with it, I do mostly consider migration through memory compaction POV, which is a significant user of migration on random pages that's not requested by userspace actions on specific ranges. And compaction has in isolate_migratepages_block(): /* * Migration will fail if an anonymous page is pinned in memory, * so avoid taking lru_lock and isolating it unnecessarily in an * admittedly racy check. */ mapping = page_mapping(page); if (!mapping && page_count(page) > page_mapcount(page)) goto isolate_fail; so that prevents migration of pages with elevated refcount very early, before they are even isolated, so before migrate_pages() is called. But it's true there are other sources of "random pages migration" - numa balancing, demotion in lieu of reclaim... and I'm not sure if all have such early check too. Anyway, whatever is decided to be a better way than elevated refcounts, would ideally be checked before isolation as well, as that's the most efficient way. >> but IIUC the checking of refcount happens at very last stage of page migration too
> +static int restrictedmem_getattr(struct user_namespace *mnt_userns, > + const struct path *path, struct kstat *stat, > + u32 request_mask, unsigned int query_flags) > +{ > + struct inode *inode = d_inode(path->dentry); > + struct restrictedmem_data *data = inode->i_mapping->private_data; > + struct file *memfd = data->memfd; > + > + return memfd->f_inode->i_op->getattr(mnt_userns, path, stat, > + request_mask, query_flags); Instead of calling shmem's getattr() with path, we should be using the the memfd's path. Otherwise, shmem's getattr() will use restrictedmem's inode instead of shmem's inode. The private fields will be of the wrong type, and the host will crash when shmem_is_huge() does SHMEM_SB(inode->i_sb)->huge), since inode->i_sb->s_fs_info is NULL for the restrictedmem's superblock. Here's the patch: diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c index 37191cd9eed1..06b72d593bd8 100644 --- a/mm/restrictedmem.c +++ b/mm/restrictedmem.c @@ -84,7 +84,7 @@ static int restrictedmem_getattr(struct user_namespace *mnt_userns, struct restrictedmem *rm = inode->i_mapping->private_data; struct file *memfd = rm->memfd; - return memfd->f_inode->i_op->getattr(mnt_userns, path, stat, + return memfd->f_inode->i_op->getattr(mnt_userns, &memfd->f_path, stat, request_mask, query_flags); } > +} > + > +static int restrictedmem_setattr(struct user_namespace *mnt_userns, > + struct dentry *dentry, struct iattr *attr) > +{ > + struct inode *inode = d_inode(dentry); > + struct restrictedmem_data *data = inode->i_mapping->private_data; > + struct file *memfd = data->memfd; > + int ret; > + > + if (attr->ia_valid & ATTR_SIZE) { > + if (memfd->f_inode->i_size) > + return -EPERM; > + > + if (!PAGE_ALIGNED(attr->ia_size)) > + return -EINVAL; > + } > + > + ret = memfd->f_inode->i_op->setattr(mnt_userns, > + file_dentry(memfd), attr); > + return ret; > +} > + > +static const struct inode_operations restrictedmem_iops = { > + .getattr = restrictedmem_getattr, > + .setattr = restrictedmem_setattr, > +};
On Monday, January 30, 2023 1:26 PM, Ackerley Tng wrote: > > > +static int restrictedmem_getattr(struct user_namespace *mnt_userns, > > + const struct path *path, struct kstat *stat, > > + u32 request_mask, unsigned int query_flags) > { > > + struct inode *inode = d_inode(path->dentry); > > + struct restrictedmem_data *data = inode->i_mapping- > >private_data; > > + struct file *memfd = data->memfd; > > + > > + return memfd->f_inode->i_op->getattr(mnt_userns, path, stat, > > + request_mask, query_flags); > > Instead of calling shmem's getattr() with path, we should be using the the > memfd's path. > > Otherwise, shmem's getattr() will use restrictedmem's inode instead of > shmem's inode. The private fields will be of the wrong type, and the host will > crash when shmem_is_huge() does SHMEM_SB(inode->i_sb)->huge), since > inode->i_sb->s_fs_info is NULL for the restrictedmem's superblock. > > Here's the patch: > > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c index > 37191cd9eed1..06b72d593bd8 100644 > --- a/mm/restrictedmem.c > +++ b/mm/restrictedmem.c > @@ -84,7 +84,7 @@ static int restrictedmem_getattr(struct user_namespace > *mnt_userns, > struct restrictedmem *rm = inode->i_mapping->private_data; > struct file *memfd = rm->memfd; > > - return memfd->f_inode->i_op->getattr(mnt_userns, path, stat, > + return memfd->f_inode->i_op->getattr(mnt_userns, &memfd- > >f_path, stat, > request_mask, query_flags); > } > Nice catch. I also encountered this issue during my work. The fix can further be enforced by shmem: index c301487be5fb..d850c0190359 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -472,8 +472,9 @@ bool shmem_is_huge(struct vm_area_struct *vma, struct inode *inode, pgoff_t index, bool shmem_huge_force) { loff_t i_size; + struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); - if (!S_ISREG(inode->i_mode)) + if (!sbinfo || !S_ISREG(inode->i_mode)) return false; if (vma && ((vma->vm_flags & VM_NOHUGEPAGE) || test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))) @@ -485,7 +486,7 @@ bool shmem_is_huge(struct vm_area_struct *vma, struct inode *inode, if (shmem_huge == SHMEM_HUGE_DENY) return false; - switch (SHMEM_SB(inode->i_sb)->huge) { + switch (sbinfo->huge) { case SHMEM_HUGE_ALWAYS: return true; case SHMEM_HUGE_WITHIN_SIZE:
On 1/23/23 16:43, Kirill A. Shutemov wrote: > On Thu, Dec 22, 2022 at 06:15:24PM +0000, Sean Christopherson wrote: >> On Wed, Dec 21, 2022, Chao Peng wrote: >> > On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote: >> > > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote: >> > > > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote: >> > > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote: >> > > But for non-restricted-mem case, it is correct for KVM to decrease page's >> > > refcount after setting up mapping in the secondary mmu, otherwise the page will >> > > be pinned by KVM for normal VM (since KVM uses GUP to get the page). >> > >> > That's true. Actually even true for restrictedmem case, most likely we >> > will still need the kvm_release_pfn_clean() for KVM generic code. On one >> > side, other restrictedmem users like pKVM may not require page pinning >> > at all. On the other side, see below. >> > >> > > >> > > So what we are expecting is: for KVM if the page comes from restricted mem, then >> > > KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should. >> >> No, requiring the user (KVM) to guard against lack of support for page migration >> in restricted mem is a terrible API. It's totally fine for restricted mem to not >> support page migration until there's a use case, but punting the problem to KVM >> is not acceptable. Restricted mem itself doesn't yet support page migration, >> e.g. explosions would occur even if KVM wanted to allow migration since there is >> no notification to invalidate existing mappings. > > I tried to find a way to hook into migration path from restrictedmem. It > is not easy because from code-mm PoV the restrictedmem page just yet > another shmem page. > > It is somewhat dubious, but I think it should be safe to override > mapping->a_ops for the shmem mapping. > > It also eliminates need in special treatment for the restrictedmem pages > from memory-failure code. > > shmem_mapping() uses ->a_ops to detect shmem mapping. Modify the > implementation to still be true for restrictedmem pages. > > Build tested only. > > Any comments? > > diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h > index 6fddb08f03cc..73ded3c3bad1 100644 > --- a/include/linux/restrictedmem.h > +++ b/include/linux/restrictedmem.h > @@ -36,8 +36,6 @@ static inline bool file_is_restrictedmem(struct file *file) > return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC; > } > > -void restrictedmem_error_page(struct page *page, struct address_space *mapping); > - > #else > > static inline bool file_is_restrictedmem(struct file *file) > @@ -45,11 +43,6 @@ static inline bool file_is_restrictedmem(struct file *file) > return false; > } > > -static inline void restrictedmem_error_page(struct page *page, > - struct address_space *mapping) > -{ > -} > - > #endif /* CONFIG_RESTRICTEDMEM */ > > #endif /* _LINUX_RESTRICTEDMEM_H */ > diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h > index d500ea967dc7..a4af160f37e4 100644 > --- a/include/linux/shmem_fs.h > +++ b/include/linux/shmem_fs.h > @@ -9,6 +9,7 @@ > #include <linux/percpu_counter.h> > #include <linux/xattr.h> > #include <linux/fs_parser.h> > +#include <linux/magic.h> > > /* inode in-kernel data */ > > @@ -75,10 +76,9 @@ extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr, > unsigned long len, unsigned long pgoff, unsigned long flags); > extern int shmem_lock(struct file *file, int lock, struct ucounts *ucounts); > #ifdef CONFIG_SHMEM > -extern const struct address_space_operations shmem_aops; > static inline bool shmem_mapping(struct address_space *mapping) > { > - return mapping->a_ops == &shmem_aops; > + return mapping->host->i_sb->s_magic == TMPFS_MAGIC; Alternatively just check a_ops against two possible values? Fewer chained dereferences, no-op with !CONFIG_RESTRICTEDMEM, maybe Hugh would be less unhappy with that. Besides that, IIRC Michael Roth mentioned that this approach for preventing migration would be simpler for SNP than the refcount elevation? Do I recall right and should this be pursued then? > } > #else > static inline bool shmem_mapping(struct address_space *mapping) > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index f91b444e471e..145bb561ddb3 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -62,7 +62,6 @@ > #include <linux/page-isolation.h> > #include <linux/pagewalk.h> > #include <linux/shmem_fs.h> > -#include <linux/restrictedmem.h> > #include "swap.h" > #include "internal.h" > #include "ras/ras_event.h" > @@ -941,8 +940,6 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p) > goto out; > } > > - restrictedmem_error_page(p, mapping); > - > /* > * The shmem page is kept in page cache instead of truncating > * so is expected to have an extra refcount after error-handling. > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c > index 15c52301eeb9..d0ca609b82cb 100644 > --- a/mm/restrictedmem.c > +++ b/mm/restrictedmem.c > @@ -189,6 +189,51 @@ static struct file *restrictedmem_file_create(struct file *memfd) > return file; > } > > +static int restricted_error_remove_page(struct address_space *mapping, > + struct page *page) > +{ > + struct super_block *sb = restrictedmem_mnt->mnt_sb; > + struct inode *inode, *next; > + pgoff_t start, end; > + > + start = page->index; > + end = start + thp_nr_pages(page); > + > + spin_lock(&sb->s_inode_list_lock); > + list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) { > + struct restrictedmem *rm = inode->i_mapping->private_data; > + struct restrictedmem_notifier *notifier; > + struct file *memfd = rm->memfd; > + unsigned long index; > + > + if (memfd->f_mapping != mapping) > + continue; > + > + xa_for_each_range(&rm->bindings, index, notifier, start, end) > + notifier->ops->error(notifier, start, end); > + break; > + } > + spin_unlock(&sb->s_inode_list_lock); > + > + return 0; > +} > + > +#ifdef CONFIG_MIGRATION > +static int restricted_folio(struct address_space *mapping, struct folio *dst, > + struct folio *src, enum migrate_mode mode) > +{ > + return -EBUSY; > +} > +#endif > + > +static struct address_space_operations restricted_aops = { > + .dirty_folio = noop_dirty_folio, > + .error_remove_page = restricted_error_remove_page, > +#ifdef CONFIG_MIGRATION > + .migrate_folio = restricted_folio, > +#endif > +}; > + > SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) > { > struct file *file, *restricted_file; > @@ -209,6 +254,8 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) > file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE; > file->f_flags |= O_LARGEFILE; > > + file->f_mapping->a_ops = &restricted_aops; > + > restricted_file = restrictedmem_file_create(file); > if (IS_ERR(restricted_file)) { > err = PTR_ERR(restricted_file); > @@ -293,31 +340,3 @@ int restrictedmem_get_page(struct file *file, pgoff_t offset, > } > EXPORT_SYMBOL_GPL(restrictedmem_get_page); > > -void restrictedmem_error_page(struct page *page, struct address_space *mapping) > -{ > - struct super_block *sb = restrictedmem_mnt->mnt_sb; > - struct inode *inode, *next; > - pgoff_t start, end; > - > - if (!shmem_mapping(mapping)) > - return; > - > - start = page->index; > - end = start + thp_nr_pages(page); > - > - spin_lock(&sb->s_inode_list_lock); > - list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) { > - struct restrictedmem *rm = inode->i_mapping->private_data; > - struct restrictedmem_notifier *notifier; > - struct file *memfd = rm->memfd; > - unsigned long index; > - > - if (memfd->f_mapping != mapping) > - continue; > - > - xa_for_each_range(&rm->bindings, index, notifier, start, end) > - notifier->ops->error(notifier, start, end); > - break; > - } > - spin_unlock(&sb->s_inode_list_lock); > -} > diff --git a/mm/shmem.c b/mm/shmem.c > index c1d8b8a1aa3b..3df4d95784b9 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -231,7 +231,7 @@ static inline void shmem_inode_unacct_blocks(struct inode *inode, long pages) > } > > static const struct super_operations shmem_ops; > -const struct address_space_operations shmem_aops; > +static const struct address_space_operations shmem_aops; > static const struct file_operations shmem_file_operations; > static const struct inode_operations shmem_inode_operations; > static const struct inode_operations shmem_dir_inode_operations; > @@ -3894,7 +3894,7 @@ static int shmem_error_remove_page(struct address_space *mapping, > return 0; > } > > -const struct address_space_operations shmem_aops = { > +static const struct address_space_operations shmem_aops = { > .writepage = shmem_writepage, > .dirty_folio = noop_dirty_folio, > #ifdef CONFIG_TMPFS > @@ -3906,7 +3906,6 @@ const struct address_space_operations shmem_aops = { > #endif > .error_remove_page = shmem_error_remove_page, > }; > -EXPORT_SYMBOL(shmem_aops); > > static const struct file_operations shmem_file_operations = { > .mmap = shmem_mmap,
On Mon, Jan 23, 2023 at 06:43:34PM +0300, Kirill A. Shutemov wrote: > On Thu, Dec 22, 2022 at 06:15:24PM +0000, Sean Christopherson wrote: > > On Wed, Dec 21, 2022, Chao Peng wrote: > > > On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote: > > > > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote: > > > > > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote: > > > > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote: > > > > But for non-restricted-mem case, it is correct for KVM to decrease page's > > > > refcount after setting up mapping in the secondary mmu, otherwise the page will > > > > be pinned by KVM for normal VM (since KVM uses GUP to get the page). > > > > > > That's true. Actually even true for restrictedmem case, most likely we > > > will still need the kvm_release_pfn_clean() for KVM generic code. On one > > > side, other restrictedmem users like pKVM may not require page pinning > > > at all. On the other side, see below. > > > > > > > > > > > So what we are expecting is: for KVM if the page comes from restricted mem, then > > > > KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should. > > > > No, requiring the user (KVM) to guard against lack of support for page migration > > in restricted mem is a terrible API. It's totally fine for restricted mem to not > > support page migration until there's a use case, but punting the problem to KVM > > is not acceptable. Restricted mem itself doesn't yet support page migration, > > e.g. explosions would occur even if KVM wanted to allow migration since there is > > no notification to invalidate existing mappings. > > I tried to find a way to hook into migration path from restrictedmem. It > is not easy because from code-mm PoV the restrictedmem page just yet > another shmem page. > > It is somewhat dubious, but I think it should be safe to override > mapping->a_ops for the shmem mapping. > > It also eliminates need in special treatment for the restrictedmem pages > from memory-failure code. > > shmem_mapping() uses ->a_ops to detect shmem mapping. Modify the > implementation to still be true for restrictedmem pages. > > Build tested only. > > Any comments? Hi Kirill, We've been testing your approach to handle pinning for the SNP+UPM implementation and haven't noticed any problems so far: (based on top of Sean's updated UPM v10 tree) https://github.com/mdroth/linux/commit/f780033e6812a01f8732060605d941474fee2bd6 Prior to your patch we also tried elevating refcount via restrictedmem_get_page() for cases where shmem_get_folio(..., SGP_NOALLOC) indicates the page hasn't been allocated yet, and that approach also seems to work, but there are potential races and other ugliness that make your approach seem a lot cleaner. -Mike > > diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h > index 6fddb08f03cc..73ded3c3bad1 100644 > --- a/include/linux/restrictedmem.h > +++ b/include/linux/restrictedmem.h > @@ -36,8 +36,6 @@ static inline bool file_is_restrictedmem(struct file *file) > return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC; > } > > -void restrictedmem_error_page(struct page *page, struct address_space *mapping); > - > #else > > static inline bool file_is_restrictedmem(struct file *file) > @@ -45,11 +43,6 @@ static inline bool file_is_restrictedmem(struct file *file) > return false; > } > > -static inline void restrictedmem_error_page(struct page *page, > - struct address_space *mapping) > -{ > -} > - > #endif /* CONFIG_RESTRICTEDMEM */ > > #endif /* _LINUX_RESTRICTEDMEM_H */ > diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h > index d500ea967dc7..a4af160f37e4 100644 > --- a/include/linux/shmem_fs.h > +++ b/include/linux/shmem_fs.h > @@ -9,6 +9,7 @@ > #include <linux/percpu_counter.h> > #include <linux/xattr.h> > #include <linux/fs_parser.h> > +#include <linux/magic.h> > > /* inode in-kernel data */ > > @@ -75,10 +76,9 @@ extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr, > unsigned long len, unsigned long pgoff, unsigned long flags); > extern int shmem_lock(struct file *file, int lock, struct ucounts *ucounts); > #ifdef CONFIG_SHMEM > -extern const struct address_space_operations shmem_aops; > static inline bool shmem_mapping(struct address_space *mapping) > { > - return mapping->a_ops == &shmem_aops; > + return mapping->host->i_sb->s_magic == TMPFS_MAGIC; > } > #else > static inline bool shmem_mapping(struct address_space *mapping) > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index f91b444e471e..145bb561ddb3 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -62,7 +62,6 @@ > #include <linux/page-isolation.h> > #include <linux/pagewalk.h> > #include <linux/shmem_fs.h> > -#include <linux/restrictedmem.h> > #include "swap.h" > #include "internal.h" > #include "ras/ras_event.h" > @@ -941,8 +940,6 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p) > goto out; > } > > - restrictedmem_error_page(p, mapping); > - > /* > * The shmem page is kept in page cache instead of truncating > * so is expected to have an extra refcount after error-handling. > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c > index 15c52301eeb9..d0ca609b82cb 100644 > --- a/mm/restrictedmem.c > +++ b/mm/restrictedmem.c > @@ -189,6 +189,51 @@ static struct file *restrictedmem_file_create(struct file *memfd) > return file; > } > > +static int restricted_error_remove_page(struct address_space *mapping, > + struct page *page) > +{ > + struct super_block *sb = restrictedmem_mnt->mnt_sb; > + struct inode *inode, *next; > + pgoff_t start, end; > + > + start = page->index; > + end = start + thp_nr_pages(page); > + > + spin_lock(&sb->s_inode_list_lock); > + list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) { > + struct restrictedmem *rm = inode->i_mapping->private_data; > + struct restrictedmem_notifier *notifier; > + struct file *memfd = rm->memfd; > + unsigned long index; > + > + if (memfd->f_mapping != mapping) > + continue; > + > + xa_for_each_range(&rm->bindings, index, notifier, start, end) > + notifier->ops->error(notifier, start, end); > + break; > + } > + spin_unlock(&sb->s_inode_list_lock); > + > + return 0; > +} > + > +#ifdef CONFIG_MIGRATION > +static int restricted_folio(struct address_space *mapping, struct folio *dst, > + struct folio *src, enum migrate_mode mode) > +{ > + return -EBUSY; > +} > +#endif > + > +static struct address_space_operations restricted_aops = { > + .dirty_folio = noop_dirty_folio, > + .error_remove_page = restricted_error_remove_page, > +#ifdef CONFIG_MIGRATION > + .migrate_folio = restricted_folio, > +#endif > +}; > + > SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) > { > struct file *file, *restricted_file; > @@ -209,6 +254,8 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) > file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE; > file->f_flags |= O_LARGEFILE; > > + file->f_mapping->a_ops = &restricted_aops; > + > restricted_file = restrictedmem_file_create(file); > if (IS_ERR(restricted_file)) { > err = PTR_ERR(restricted_file); > @@ -293,31 +340,3 @@ int restrictedmem_get_page(struct file *file, pgoff_t offset, > } > EXPORT_SYMBOL_GPL(restrictedmem_get_page); > > -void restrictedmem_error_page(struct page *page, struct address_space *mapping) > -{ > - struct super_block *sb = restrictedmem_mnt->mnt_sb; > - struct inode *inode, *next; > - pgoff_t start, end; > - > - if (!shmem_mapping(mapping)) > - return; > - > - start = page->index; > - end = start + thp_nr_pages(page); > - > - spin_lock(&sb->s_inode_list_lock); > - list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) { > - struct restrictedmem *rm = inode->i_mapping->private_data; > - struct restrictedmem_notifier *notifier; > - struct file *memfd = rm->memfd; > - unsigned long index; > - > - if (memfd->f_mapping != mapping) > - continue; > - > - xa_for_each_range(&rm->bindings, index, notifier, start, end) > - notifier->ops->error(notifier, start, end); > - break; > - } > - spin_unlock(&sb->s_inode_list_lock); > -} > diff --git a/mm/shmem.c b/mm/shmem.c > index c1d8b8a1aa3b..3df4d95784b9 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -231,7 +231,7 @@ static inline void shmem_inode_unacct_blocks(struct inode *inode, long pages) > } > > static const struct super_operations shmem_ops; > -const struct address_space_operations shmem_aops; > +static const struct address_space_operations shmem_aops; > static const struct file_operations shmem_file_operations; > static const struct inode_operations shmem_inode_operations; > static const struct inode_operations shmem_dir_inode_operations; > @@ -3894,7 +3894,7 @@ static int shmem_error_remove_page(struct address_space *mapping, > return 0; > } > > -const struct address_space_operations shmem_aops = { > +static const struct address_space_operations shmem_aops = { > .writepage = shmem_writepage, > .dirty_folio = noop_dirty_folio, > #ifdef CONFIG_TMPFS > @@ -3906,7 +3906,6 @@ const struct address_space_operations shmem_aops = { > #endif > .error_remove_page = shmem_error_remove_page, > }; > -EXPORT_SYMBOL(shmem_aops); > > static const struct file_operations shmem_file_operations = { > .mmap = shmem_mmap, > -- > Kiryl Shutsemau / Kirill A. Shutemov
On 1/23/23 16:18, Kirill A. Shutemov wrote: > On Mon, Jan 23, 2023 at 03:03:45PM +0100, Vlastimil Babka wrote: >> On 12/22/22 01:37, Huang, Kai wrote: >> >>> I argue that this page pinning (or page migration prevention) is not >> >>> tied to where the page comes from, instead related to how the page will >> >>> be used. Whether the page is restrictedmem backed or GUP() backed, once >> >>> it's used by current version of TDX then the page pinning is needed. So >> >>> such page migration prevention is really TDX thing, even not KVM generic >> >>> thing (that's why I think we don't need change the existing logic of >> >>> kvm_release_pfn_clean()). >> >>> >> > This essentially boils down to who "owns" page migration handling, and sadly, >> > page migration is kinda "owned" by the core-kernel, i.e. KVM cannot handle page >> > migration by itself -- it's just a passive receiver. >> > >> > For normal pages, page migration is totally done by the core-kernel (i.e. it >> > unmaps page from VMA, allocates a new page, and uses migrate_pape() or a_ops- >> >> migrate_page() to actually migrate the page). >> > In the sense of TDX, conceptually it should be done in the same way. The more >> > important thing is: yes KVM can use get_page() to prevent page migration, but >> > when KVM wants to support it, KVM cannot just remove get_page(), as the core- >> > kernel will still just do migrate_page() which won't work for TDX (given >> > restricted_memfd doesn't have a_ops->migrate_page() implemented). >> > >> > So I think the restricted_memfd filesystem should own page migration handling, >> > (i.e. by implementing a_ops->migrate_page() to either just reject page migration >> > or somehow support it). >> >> While this thread seems to be settled on refcounts already, just wanted >> to point out that it wouldn't be ideal to prevent migrations by >> a_ops->migrate_page() rejecting them. It would mean cputime wasted (i.e. >> by memory compaction) by isolating the pages for migration and then >> releasing them after the callback rejects it (at least we wouldn't waste >> time creating and undoing migration entries in the userspace page tables >> as there's no mmap). Elevated refcount on the other hand is detected >> very early in compaction so no isolation is attempted, so from that >> aspect it's optimal. > > Hm. Do we need a new hook in a_ops to check if the page is migratable > before going with longer path to migrate_page(). > > Or maybe add AS_UNMOVABLE? AS_UNMOVABLE should indeed allow a test in e.g. compaction to descide that the page is not worth isolating in the first place.
> +static struct file *restrictedmem_file_create(struct file *memfd) > +{ > + struct restrictedmem_data *data; > + struct address_space *mapping; > + struct inode *inode; > + struct file *file; > + > + data = kzalloc(sizeof(*data), GFP_KERNEL); > + if (!data) > + return ERR_PTR(-ENOMEM); > + > + data->memfd = memfd; > + mutex_init(&data->lock); > + INIT_LIST_HEAD(&data->notifiers); > + > + inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb); > + if (IS_ERR(inode)) { > + kfree(data); > + return ERR_CAST(inode); > + } alloc_anon_inode() uses new_pseudo_inode() to get the inode. As per the comment, new inode is not added to the superblock s_inodes list. /** * new_inode_pseudo - obtain an inode * @sb: superblock * * Allocates a new inode for given superblock. * Inode wont be chained in superblock s_inodes list * This means : * - fs can't be unmount * - quotas, fsnotify, writeback can't work */ So the restrictedmem_error_page will not find the inode as it was never added to the s_inodes list. We might need to add the inode after allocating. inode_sb_list_add(inode); > +void restrictedmem_error_page(struct page *page, struct address_space *mapping) > +{ > + struct super_block *sb = restrictedmem_mnt->mnt_sb; > + struct inode *inode, *next; > + > + if (!shmem_mapping(mapping)) > + return; > + > + spin_lock(&sb->s_inode_list_lock); > + list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) { > + struct restrictedmem_data *data = inode->i_mapping->private_data; > + struct file *memfd = data->memfd; > + > + if (memfd->f_mapping == mapping) { > + pgoff_t start, end; > + > + spin_unlock(&sb->s_inode_list_lock); > + > + start = page->index; > + end = start + thp_nr_pages(page); > + restrictedmem_notifier_error(data, start, end); > + return; > + } > + } > + spin_unlock(&sb->s_inode_list_lock); > +} Regards Nikunj
On 14/1/23 08:54, Sean Christopherson wrote: > On Fri, Dec 02, 2022, Chao Peng wrote: >> The system call is currently wired up for x86 arch. > > Building on other architectures (except for arm64 for some reason) yields: > > CALL /.../scripts/checksyscalls.sh > <stdin>:1565:2: warning: #warning syscall memfd_restricted not implemented [-Wcpp] > > Do we care? It's the only such warning, which makes me think we either need to > wire this up for all architectures, or explicitly document that it's unsupported. > >> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> >> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> >> --- > > ... > >> diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h >> new file mode 100644 >> index 000000000000..c2700c5daa43 >> --- /dev/null >> +++ b/include/linux/restrictedmem.h >> @@ -0,0 +1,71 @@ >> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ >> +#ifndef _LINUX_RESTRICTEDMEM_H > > Missing > > #define _LINUX_RESTRICTEDMEM_H > > which causes fireworks if restrictedmem.h is included more than once. > >> +#include <linux/file.h> >> +#include <linux/magic.h> >> +#include <linux/pfn_t.h> > > ... > >> +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset, >> + struct page **pagep, int *order) >> +{ >> + return -1; > > This should be a proper -errno, though in the current incarnation of things it's > a moot point because no stub is needed. KVM can (and should) easily provide its > own stub for this one. > >> +} >> + >> +static inline bool file_is_restrictedmem(struct file *file) >> +{ >> + return false; >> +} >> + >> +static inline void restrictedmem_error_page(struct page *page, >> + struct address_space *mapping) >> +{ >> +} >> + >> +#endif /* CONFIG_RESTRICTEDMEM */ >> + >> +#endif /* _LINUX_RESTRICTEDMEM_H */ > > ... > >> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c >> new file mode 100644 >> index 000000000000..56953c204e5c >> --- /dev/null >> +++ b/mm/restrictedmem.c >> @@ -0,0 +1,318 @@ >> +// SPDX-License-Identifier: GPL-2.0 >> +#include "linux/sbitmap.h" >> +#include <linux/pagemap.h> >> +#include <linux/pseudo_fs.h> >> +#include <linux/shmem_fs.h> >> +#include <linux/syscalls.h> >> +#include <uapi/linux/falloc.h> >> +#include <uapi/linux/magic.h> >> +#include <linux/restrictedmem.h> >> + >> +struct restrictedmem_data { > > Any objection to simply calling this "restrictedmem"? And then using either "rm" > or "rmem" for local variable names? I kept reading "data" as the underyling data > being written to the page, as opposed to the metadata describing the restrictedmem > instance. > >> + struct mutex lock; >> + struct file *memfd; >> + struct list_head notifiers; >> +}; >> + >> +static void restrictedmem_invalidate_start(struct restrictedmem_data *data, >> + pgoff_t start, pgoff_t end) >> +{ >> + struct restrictedmem_notifier *notifier; >> + >> + mutex_lock(&data->lock); > > This can be a r/w semaphore instead of a mutex, that way punching holes at multiple > points in the file can at least run the notifiers in parallel. The actual allocation > by shmem will still be serialized, but I think it's worth the simple optimization > since zapping and flushing in KVM may be somewhat slow. > >> + list_for_each_entry(notifier, &data->notifiers, list) { >> + notifier->ops->invalidate_start(notifier, start, end); > > Two major design issues that we overlooked long ago: > > 1. Blindly invoking notifiers will not scale. E.g. if userspace configures a > VM with a large number of convertible memslots that are all backed by a > single large restrictedmem instance, then converting a single page will > result in a linear walk through all memslots. I don't expect anyone to > actually do something silly like that, but I also never expected there to be > a legitimate usecase for thousands of memslots. > > 2. This approach fails to provide the ability for KVM to ensure a guest has > exclusive access to a page. As discussed in the past, the kernel can rely > on hardware (and maybe ARM's pKVM implementation?) for those guarantees, but > only for SNP and TDX VMs. For VMs where userspace is trusted to some extent, > e.g. SEV, there is value in ensuring a 1:1 association. > > And probably more importantly, relying on hardware for SNP and TDX yields a > poor ABI and complicates KVM's internals. If the kernel doesn't guarantee a > page is exclusive to a guest, i.e. if userspace can hand out the same page > from a restrictedmem instance to multiple VMs, then failure will occur only > when KVM tries to assign the page to the second VM. That will happen deep > in KVM, which means KVM needs to gracefully handle such errors, and it means > that KVM's ABI effectively allows plumbing garbage into its memslots. > > Rather than use a simple list of notifiers, this appears to be yet another > opportunity to use an xarray. Supporting sharing of restrictedmem will be > non-trivial, but IMO we should punt that to the future since it's still unclear > exactly how sharing will work. > > An xarray will solve #1 by notifying only the consumers (memslots) that are bound > to the affected range. > > And for #2, it's relatively straightforward (knock wood) to detect existing > entries, i.e. if the user wants exclusive access to memory, then the bind operation > can be reject if there's an existing entry. > > VERY lightly tested code snippet at the bottom (will provide link to fully worked > code in cover letter). > > >> +static long restrictedmem_punch_hole(struct restrictedmem_data *data, int mode, >> + loff_t offset, loff_t len) >> +{ >> + int ret; >> + pgoff_t start, end; >> + struct file *memfd = data->memfd; >> + >> + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) >> + return -EINVAL; >> + >> + start = offset >> PAGE_SHIFT; >> + end = (offset + len) >> PAGE_SHIFT; >> + >> + restrictedmem_invalidate_start(data, start, end); >> + ret = memfd->f_op->fallocate(memfd, mode, offset, len); >> + restrictedmem_invalidate_end(data, start, end); > > The lock needs to be end for the entire duration of the hole punch, i.e. needs to > be taken before invalidate_start() and released after invalidate_end(). If a user > (un)binds/(un)registers after invalidate_state(), it will see an unpaired notification, > e.g. could leave KVM with incorrect notifier counts. > >> + >> + return ret; >> +} > > What I ended up with for an xarray-based implementation. I'm very flexible on > names and whatnot, these are just what made sense to me. > > static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode, > loff_t offset, loff_t len) > { > struct restrictedmem_notifier *notifier; > struct file *memfd = rm->memfd; > unsigned long index; > pgoff_t start, end; > int ret; > > if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) > return -EINVAL; > > start = offset >> PAGE_SHIFT; > end = (offset + len) >> PAGE_SHIFT; > > /* > * Bindings must stable across invalidation to ensure the start+end > * are balanced. > */ > down_read(&rm->lock); > > xa_for_each_range(&rm->bindings, index, notifier, start, end) > notifier->ops->invalidate_start(notifier, start, end); > > ret = memfd->f_op->fallocate(memfd, mode, offset, len); > > xa_for_each_range(&rm->bindings, index, notifier, start, end) > notifier->ops->invalidate_end(notifier, start, end); > > up_read(&rm->lock); > > return ret; > } > > int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end, > struct restrictedmem_notifier *notifier, bool exclusive) > { > struct restrictedmem *rm = file->f_mapping->private_data; > int ret = -EINVAL; > > down_write(&rm->lock); > > /* Non-exclusive mappings are not yet implemented. */ > if (!exclusive) > goto out_unlock; > > if (!xa_empty(&rm->bindings)) { > if (exclusive != rm->exclusive) > goto out_unlock; > > if (exclusive && xa_find(&rm->bindings, &start, end, XA_PRESENT)) > goto out_unlock; > } > > xa_store_range(&rm->bindings, start, end, notifier, GFP_KERNEL); || ld: mm/restrictedmem.o: in function `restrictedmem_bind': mm/restrictedmem.c|295| undefined reference to `xa_store_range' This is missing: === diff --git a/mm/Kconfig b/mm/Kconfig index f952d0172080..03aca542c0da 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1087,6 +1087,7 @@ config SECRETMEM config RESTRICTEDMEM bool depends on TMPFS + select XARRAY_MULTI === Thanks, > rm->exclusive = exclusive; > ret = 0; > out_unlock: > up_write(&rm->lock); > return ret; > } > EXPORT_SYMBOL_GPL(restrictedmem_bind); > > void restrictedmem_unbind(struct file *file, pgoff_t start, pgoff_t end, > struct restrictedmem_notifier *notifier) > { > struct restrictedmem *rm = file->f_mapping->private_data; > > down_write(&rm->lock); > xa_store_range(&rm->bindings, start, end, NULL, GFP_KERNEL); > synchronize_rcu(); > up_write(&rm->lock); > } > EXPORT_SYMBOL_GPL(restrictedmem_unbind);
> > int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end, > > struct restrictedmem_notifier *notifier, bool exclusive) > > { > > struct restrictedmem *rm = file->f_mapping->private_data; > > int ret = -EINVAL; > > > > down_write(&rm->lock); > > > > /* Non-exclusive mappings are not yet implemented. */ > > if (!exclusive) > > goto out_unlock; > > > > if (!xa_empty(&rm->bindings)) { > > if (exclusive != rm->exclusive) > > goto out_unlock; > > > > if (exclusive && xa_find(&rm->bindings, &start, end, XA_PRESENT)) > > goto out_unlock; > > } > > > > xa_store_range(&rm->bindings, start, end, notifier, GFP_KERNEL); > > > || ld: mm/restrictedmem.o: in function `restrictedmem_bind': > mm/restrictedmem.c|295| undefined reference to `xa_store_range' Right, xa_store_range() is only available for XARRAY_MULTI. > > > This is missing: > === > diff --git a/mm/Kconfig b/mm/Kconfig > index f952d0172080..03aca542c0da 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -1087,6 +1087,7 @@ config SECRETMEM > config RESTRICTEDMEM > bool > depends on TMPFS > + select XARRAY_MULTI > === > > Thanks, > > > > > rm->exclusive = exclusive; > > ret = 0; > > out_unlock: > > up_write(&rm->lock); > > return ret; > > } > > EXPORT_SYMBOL_GPL(restrictedmem_bind); > > > > void restrictedmem_unbind(struct file *file, pgoff_t start, pgoff_t end, > > struct restrictedmem_notifier *notifier) > > { > > struct restrictedmem *rm = file->f_mapping->private_data; > > > > down_write(&rm->lock); > > xa_store_range(&rm->bindings, start, end, NULL, GFP_KERNEL); > > synchronize_rcu(); > > up_write(&rm->lock); > > } > > EXPORT_SYMBOL_GPL(restrictedmem_unbind); > > -- > Alexey
On Thu, Feb 16, 2023 at 03:21:21PM +0530, Nikunj A. Dadhania wrote: > > > +static struct file *restrictedmem_file_create(struct file *memfd) > > +{ > > + struct restrictedmem_data *data; > > + struct address_space *mapping; > > + struct inode *inode; > > + struct file *file; > > + > > + data = kzalloc(sizeof(*data), GFP_KERNEL); > > + if (!data) > > + return ERR_PTR(-ENOMEM); > > + > > + data->memfd = memfd; > > + mutex_init(&data->lock); > > + INIT_LIST_HEAD(&data->notifiers); > > + > > + inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb); > > + if (IS_ERR(inode)) { > > + kfree(data); > > + return ERR_CAST(inode); > > + } > > alloc_anon_inode() uses new_pseudo_inode() to get the inode. As per the comment, new inode > is not added to the superblock s_inodes list. Another issue somewhat related to alloc_anon_inode() is that the shmem code in some cases assumes the inode struct was allocated via shmem_alloc_inode(), which allocates a struct shmem_inode_info, which is a superset of struct inode with additional fields for things like spinlocks. These additional fields don't get allocated/ininitialized in the case of restrictedmem, so when restrictedmem_getattr() tries to pass the inode on to shmem handler, it can cause a crash. For instance, the following trace was seen when executing 'sudo lsof' while a process/guest was running with an open memfd FD: [24393.121409] general protection fault, probably for non-canonical address 0xfe9fb182fea3f077: 0000 [#1] PREEMPT SMP NOPTI [24393.133546] CPU: 2 PID: 590073 Comm: lsof Tainted: G E 6.1.0-rc4-upm10b-host-snp-v8b+ #4 [24393.144125] Hardware name: AMD Corporation ETHANOL_X/ETHANOL_X, BIOS RXM1009B 05/14/2022 [24393.153150] RIP: 0010:native_queued_spin_lock_slowpath+0x3a3/0x3e0 [24393.160049] Code: f3 90 41 8b 04 24 85 c0 74 ea eb f4 c1 ea 12 83 e0 03 83 ea 01 48 c1 e0 05 48 63 d2 48 05 00 41 04 00 48 03 04 d5 e0 ea 8b 82 <48> 89 18 8b 43 08 85 c0 75 09 f3 90 8b 43 08 85 c0 74 f7 48 8b 13 [24393.181004] RSP: 0018:ffffc9006b6a3cf8 EFLAGS: 00010086 [24393.186832] RAX: fe9fb182fea3f077 RBX: ffff889fcc144100 RCX: 0000000000000000 [24393.194793] RDX: 0000000000003ffe RSI: ffffffff827acde9 RDI: ffffc9006b6a3cdf [24393.202751] RBP: ffffc9006b6a3d20 R08: 0000000000000001 R09: 0000000000000000 [24393.210710] R10: 0000000000000000 R11: 000000000000ffff R12: ffff888179fa50e0 [24393.218670] R13: ffff889fcc144100 R14: 00000000000c0000 R15: 00000000000c0000 [24393.226629] FS: 00007f9440f45400(0000) GS:ffff889fcc100000(0000) knlGS:0000000000000000 [24393.235692] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [24393.242101] CR2: 000055c55a9cf088 CR3: 0008000220e9c003 CR4: 0000000000770ee0 [24393.250059] PKRU: 55555554 [24393.253073] Call Trace: [24393.255797] <TASK> [24393.258133] do_raw_spin_lock+0xc4/0xd0 [24393.262410] _raw_spin_lock_irq+0x50/0x70 [24393.266880] ? shmem_getattr+0x4c/0xf0 [24393.271060] shmem_getattr+0x4c/0xf0 [24393.275044] restrictedmem_getattr+0x34/0x40 [24393.279805] vfs_getattr_nosec+0xbd/0xe0 [24393.284178] vfs_getattr+0x37/0x50 [24393.287971] vfs_statx+0xa0/0x150 [24393.291668] vfs_fstatat+0x59/0x80 [24393.295462] __do_sys_newstat+0x35/0x70 [24393.299739] __x64_sys_newstat+0x16/0x20 [24393.304111] do_syscall_64+0x3b/0x90 [24393.308098] entry_SYSCALL_64_after_hwframe+0x63/0xcd As a workaround we've been doing the following, but it's probably not the proper fix: https://github.com/AMDESE/linux/commit/0378116b5c4e373295c9101727f2cb5112d6b1f4 -Mike
Chao Peng <chao.p.peng@linux.intel.com> writes: > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> > Introduce 'memfd_restricted' system call with the ability to create > memory areas that are restricted from userspace access through ordinary > MMU operations (e.g. read/write/mmap). The memory content is expected to > be used through the new in-kernel interface by a third kernel module. > ... > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c > new file mode 100644 > index 000000000000..56953c204e5c > --- /dev/null > +++ b/mm/restrictedmem.c > @@ -0,0 +1,318 @@ > +// SPDX-License-Identifier: GPL-2.0 > +#include "linux/sbitmap.h" > +#include <linux/pagemap.h> > +#include <linux/pseudo_fs.h> > +#include <linux/shmem_fs.h> > +#include <linux/syscalls.h> > +#include <uapi/linux/falloc.h> > +#include <uapi/linux/magic.h> > +#include <linux/restrictedmem.h> > + > +struct restrictedmem_data { > + struct mutex lock; > + struct file *memfd; Can this be renamed to file, or lower_file (as in stacking filesystems)? It's a little confusing because this pointer doesn't actually refer to an fd. 'memfd' is already used by udmabuf to refer to an actual fd [1], which makes this a little misleading. [1] https://elixir.bootlin.com/linux/v6.2.10/source/tools/testing/selftests/drivers/dma-buf/udmabuf.c#L63 > + struct list_head notifiers; > +}; > + > ...
On Thu, Apr 13, 2023, Christian Brauner wrote: > On Thu, Aug 18, 2022 at 04:24:21PM +0300, Kirill A . Shutemov wrote: > > On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote: > > > Here's what I would prefer, and imagine much easier for you to maintain; > > > but I'm no system designer, and may be misunderstanding throughout. > > > > > > QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps > > > the fallocate syscall interface itself) to allocate and free the memory, > > > ioctl for initializing some of it too. KVM in control of whether that > > > fd can be read or written or mmap'ed or whatever, no need to prevent it > > > in shmem.c, no need for flags, seals, notifications to and fro because > > > KVM is already in control and knows the history. If shmem actually has > > > value, call into it underneath - somewhat like SysV SHM, and /dev/zero > > > mmap, and i915/gem make use of it underneath. If shmem has nothing to > > > add, just allocate and free kernel memory directly, recorded in your > > > own xarray. > > > > I guess shim layer on top of shmem *can* work. I don't see immediately why > > it would not. But I'm not sure it is right direction. We risk creating yet > > another parallel VM with own rules/locking/accounting that opaque to > > core-mm. > > Sorry for necrobumping this thread but I've been reviewing the No worries, I'm just stoked someone who actually knows what they're doing is chiming in :-) > memfd_restricted() extension that Ackerley is currently working on. I > was pointed to this thread as this is what the extension is building > on but I'll reply to both threads here. > > From a glance at v10, memfd_restricted() is currently implemented as an > in-kernel stacking filesystem. A call to memfd_restricted() creates a > new restricted memfd file and a new unlinked tmpfs file and stashes the > tmpfs file into the memfd file's private data member. It then uses the > tmpfs file's f_ops and i_ops to perform the relevant file and inode > operations. So it has the same callstack as a general stacking > filesystem like overlayfs in some cases: > > memfd_restricted->getattr() > -> tmpfs->getattr() ... > Since you're effectively acting like a stacking filesystem you should > really use the device number of your memfd restricted filesystem. IOW, > sm like: > > stat->dev = memfd_restricted_dentry->d_sb->s_dev; > > But then you run into trouble if you want to go forward with Ackerley's > extension that allows to explicitly pass in tmpfs fds to > memfd_restricted(). Afaict, two tmpfs instances might allocate the same > inode number. So now the inode and device number pair isn't unique > anymore. > > So you might best be served by allocating and reporting your own inode > numbers as well. > > But if you want to preserve the inode number and device number of the > relevant tmpfs instance but still report memfd restricted as your > filesystem type Unless I missed something along the way, reporting memfd_restricted as a distinct filesystem is very much a non-goal. AFAIK it's purely a side effect of the proposed implementation. > then I think it's reasonable to ask whether a stacking implementation really > makes sense here. > > If you extend memfd_restricted() or even consider extending it in the > future to take tmpfs file descriptors as arguments to identify the tmpfs > instance in which to allocate the underlying tmpfs file for the new > restricted memfd file you should really consider a tmpfs based > implementation. > > Because at that point it just feels like a pointless wrapper to get > custom f_ops and i_ops. Plus it's wasteful because you allocate dentries > and inodes that you don't really care about at all. > > Just off the top of my hat you might be better served: > * by a new ioctl() on tmpfs instances that > yield regular tmpfs file descriptors with restricted f_ops and i_ops. > That's not that different from btrfs subvolumes which effectively are > directories but are created through an ioctl(). I think this is more or less what we want to do, except via a dedicated syscall instead of an ioctl() so that the primary interface isn't strictly tied to tmpfs, e.g. so that it can be extended to other backing types in the future. > * by a mount option to tmpfs that makes it act > in this restricted manner then you don't need an ioctl() and can get > away with regular open calls. Such a tmpfs instance would only create > regular, restricted memfds. I'd prefer to not go this route, becuase IIUC, it would require relatively invasive changes to shmem code, and IIUC would require similar changes to other support backings in the future, e.g. hugetlbfs? And as above, I don't think any of the potential use cases need restrictedmem to be a uniquely identifiable mount. One of the goals (hopefully not a pipe dream) is to design restrictmem in such a way that extending it to support other backing types isn't terribly difficult. In case it's not obvious, most of us working on this stuff aren't filesystems experts, and many of us aren't mm experts either. The more we (KVM folks for the most part) can leverage existing code to do the heavy lifting, the better. After giving myself a bit of a crash course in file systems, would something like the below have any chance of (a) working, (b) getting merged, and (c) being maintainable? The idea is similar to a stacking filesystem, but instead of stacking, restrictedmem hijacks a f_ops and a_ops to create a lightweight shim around tmpfs. There are undoubtedly issues and edge cases, I'm just looking for a quick "yes, this might be doable" or a "no, that's absolutely bonkers, don't try it". Thanks! struct restrictedmem { struct rw_semaphore lock; struct file *file; const struct file_operations *backing_f_ops; const struct address_space_operations *backing_a_ops; struct xarray bindings; bool exclusive; }; static int restrictedmem_release(struct inode *inode, struct file *file) { struct restrictedmem *rm = inode->i_mapping->private_data; xa_destroy(&rm->bindings); kfree(rm); WARN_ON_ONCE(rm->backing_f_ops->release); return 0; } static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode, loff_t offset, loff_t len) { struct restrictedmem_notifier *notifier; unsigned long index; pgoff_t start, end; int ret; if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) return -EINVAL; start = offset >> PAGE_SHIFT; end = (offset + len) >> PAGE_SHIFT; /* * Bindings must be stable across invalidation to ensure the start+end * are balanced. */ down_read(&rm->lock); xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) notifier->ops->invalidate_start(notifier, start, end); ret = rm->backing_f_ops->fallocate(rm->file, mode, offset, len); xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) notifier->ops->invalidate_end(notifier, start, end); up_read(&rm->lock); return ret; } static long restrictedmem_fallocate(struct file *file, int mode, loff_t offset, loff_t len) { struct restrictedmem *rm = file->f_mapping->private_data; if (mode & FALLOC_FL_PUNCH_HOLE) return restrictedmem_punch_hole(rm, mode, offset, len); return rm->backing_f_ops->fallocate(file, mode, offset, len); } static int restrictedmem_migrate_folio(struct address_space *mapping, struct folio *dst, struct folio *src, enum migrate_mode) { WARN_ON_ONCE(1); return -EINVAL; } static int restrictedmem_error_page(struct address_space *mapping, struct page *page) { struct restrictedmem *rm = mapping->private_data; struct restrictedmem_notifier *notifier; unsigned long index; pgoff_t start, end; start = page->index; end = start + thp_nr_pages(page); down_read(&rm->lock); xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) notifier->ops->error(notifier, start, end); up_read(&rm->lock); return rm->backing_a_ops->error_remove_page(mapping, page); } static const struct file_operations restrictedmem_fops = { .release = restrictedmem_release, .fallocate = restrictedmem_fallocate, }; static const struct address_space_operations restrictedmem_aops = { .dirty_folio = noop_dirty_folio, #ifdef CONFIG_MIGRATION .migrate_folio = restrictedmem_migrate_folio, #endif .error_remove_page = restrictedmem_error_page, }; static int restrictedmem_file_create(struct file *file) { struct address_space *mapping = file->f_mapping; struct restrictedmem *rm; rm = kzalloc(sizeof(*rm), GFP_KERNEL); if (!rm) return -ENOMEM; rm->backing_f_ops = file->f_op; rm->backing_a_ops = mapping->a_ops; rm->file = file; init_rwsem(&rm->lock); xa_init(&rm->bindings); file->f_flags |= O_LARGEFILE; file->f_op = &restrictedmem_fops; mapping->a_ops = &restrictedmem_aops; mapping_set_unevictable(mapping); mapping_set_unmovable(mapping); mapping_set_gfp_mask(mapping, mapping_gfp_mask(mapping) & ~__GFP_MOVABLE); return 0; } static int restrictedmem_create(struct vfsmount *mount) { struct file *file; int fd, err; fd = get_unused_fd_flags(0); if (fd < 0) return fd; file = shmem_file_setup_with_mnt(mount, "memfd:restrictedmem", 0, VM_NORESERVE); if (IS_ERR(file)) { err = PTR_ERR(file); goto err_fd; } if (WARN_ON_ONCE(file->private_data)) { err = -EEXIST; goto err_fd; } file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE; file->f_flags |= O_LARGEFILE; err = restrictedmem_file_create(file); if (err) { fput(file); goto err_fd; } fd_install(fd, file); return fd; err_fd: put_unused_fd(fd); return err; } SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd) { struct vfsmount *mnt; struct path *path; struct fd f; int ret; if (flags) return -EINVAL; f = fdget_raw(mount_fd); if (!f.file) return -EBADF; ret = -EINVAL; path = &f.file->f_path; if (path->dentry != path->mnt->mnt_root) goto out; /* Disallow bind-mounts that aren't bind-mounts of the whole filesystem. */ mnt = path->mnt; if (mnt->mnt_root != mnt->mnt_sb->s_root) goto out; /* * The filesystem must be mounted no-execute, executing from guest * private memory in the host is nonsensical and unsafe. */ if (!(mnt->mnt_sb->s_iflags & SB_I_NOEXEC)) goto out; /* Currently only TMPFS is supported as underlying storage. */ if (mnt->mnt_sb->s_magic != TMPFS_MAGIC) goto out; ret = mnt_want_write(mnt); if (ret) goto out; ret = restrictedmem_create(mnt); if (mnt) mnt_drop_write(mnt); out: if (f.file) fdput(f); return ret; }
Sean Christopherson <seanjc@google.com> writes: > On Thu, Apr 13, 2023, Christian Brauner wrote: >> On Thu, Aug 18, 2022 at 04:24:21PM +0300, Kirill A . Shutemov wrote: >> > On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote: >> > > Here's what I would prefer, and imagine much easier for you to >> maintain; >> > > but I'm no system designer, and may be misunderstanding throughout. >> > > >> > > QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps >> > > the fallocate syscall interface itself) to allocate and free the >> memory, >> > > ioctl for initializing some of it too. KVM in control of whether >> that >> > > fd can be read or written or mmap'ed or whatever, no need to prevent >> it >> > > in shmem.c, no need for flags, seals, notifications to and fro >> because >> > > KVM is already in control and knows the history. If shmem actually >> has >> > > value, call into it underneath - somewhat like SysV SHM, and >> /dev/zero >> > > mmap, and i915/gem make use of it underneath. If shmem has nothing >> to >> > > add, just allocate and free kernel memory directly, recorded in your >> > > own xarray. >> > >> > I guess shim layer on top of shmem *can* work. I don't see immediately >> why >> > it would not. But I'm not sure it is right direction. We risk creating >> yet >> > another parallel VM with own rules/locking/accounting that opaque to >> > core-mm. >> Sorry for necrobumping this thread but I've been reviewing the > No worries, I'm just stoked someone who actually knows what they're doing > is > chiming in :-) +1, thanks Christian! >> memfd_restricted() extension that Ackerley is currently working on. I >> was pointed to this thread as this is what the extension is building >> on but I'll reply to both threads here. >> From a glance at v10, memfd_restricted() is currently implemented as an >> in-kernel stacking filesystem. A call to memfd_restricted() creates a >> new restricted memfd file and a new unlinked tmpfs file and stashes the >> tmpfs file into the memfd file's private data member. It then uses the >> tmpfs file's f_ops and i_ops to perform the relevant file and inode >> operations. So it has the same callstack as a general stacking >> filesystem like overlayfs in some cases: >> memfd_restricted->getattr() >> -> tmpfs->getattr() > ... >> Since you're effectively acting like a stacking filesystem you should >> really use the device number of your memfd restricted filesystem. IOW, >> sm like: >> stat->dev = memfd_restricted_dentry->d_sb->s_dev; >> But then you run into trouble if you want to go forward with Ackerley's >> extension that allows to explicitly pass in tmpfs fds to >> memfd_restricted(). Afaict, two tmpfs instances might allocate the same >> inode number. So now the inode and device number pair isn't unique >> anymore. >> So you might best be served by allocating and reporting your own inode >> numbers as well. >> But if you want to preserve the inode number and device number of the >> relevant tmpfs instance but still report memfd restricted as your >> filesystem type > Unless I missed something along the way, reporting memfd_restricted as a > distinct > filesystem is very much a non-goal. AFAIK it's purely a side effect of > the > proposed implementation. >> then I think it's reasonable to ask whether a stacking implementation >> really >> makes sense here. >> If you extend memfd_restricted() or even consider extending it in the >> future to take tmpfs file descriptors as arguments to identify the tmpfs >> instance in which to allocate the underlying tmpfs file for the new >> restricted memfd file you should really consider a tmpfs based >> implementation. >> Because at that point it just feels like a pointless wrapper to get >> custom f_ops and i_ops. Plus it's wasteful because you allocate dentries >> and inodes that you don't really care about at all. >> Just off the top of my hat you might be better served: >> * by a new ioctl() on tmpfs instances that >> yield regular tmpfs file descriptors with restricted f_ops and i_ops. >> That's not that different from btrfs subvolumes which effectively are >> directories but are created through an ioctl(). > I think this is more or less what we want to do, except via a dedicated > syscall > instead of an ioctl() so that the primary interface isn't strictly tied > to tmpfs, > e.g. so that it can be extended to other backing types in the future. >> * by a mount option to tmpfs that makes it act >> in this restricted manner then you don't need an ioctl() and can get >> away with regular open calls. Such a tmpfs instance would only create >> regular, restricted memfds. > I'd prefer to not go this route, becuase IIUC, it would require > relatively invasive > changes to shmem code, and IIUC would require similar changes to other > support > backings in the future, e.g. hugetlbfs? And as above, I don't think any > of the > potential use cases need restrictedmem to be a uniquely identifiable > mount. FWIW, I'm starting to look at extending restrictedmem to hugetlbfs and the separation that the current implementation has is very helpful. Also helps that hugetlbfs and tmpfs are structured similarly, I guess. > One of the goals (hopefully not a pipe dream) is to design restrictmem in > such a > way that extending it to support other backing types isn't terribly > difficult. > In case it's not obvious, most of us working on this stuff aren't > filesystems > experts, and many of us aren't mm experts either. The more we (KVM folks > for the > most part) can leverage existing code to do the heavy lifting, the better. > After giving myself a bit of a crash course in file systems, would > something like > the below have any chance of (a) working, (b) getting merged, and (c) > being > maintainable? > The idea is similar to a stacking filesystem, but instead of stacking, > restrictedmem > hijacks a f_ops and a_ops to create a lightweight shim around tmpfs. > There are > undoubtedly issues and edge cases, I'm just looking for a quick "yes, > this might > be doable" or a "no, that's absolutely bonkers, don't try it". Not an FS expert by any means, but I did think of approaching it this way as well! "Hijacking" perhaps gives this approach a bit of a negative connotation. I thought this is pretty close to subclassing (as in Object Oriented Programming). When some methods (e.g. fallocate) are called, restrictedmem does some work, and calls the same method in the superclass. The existing restrictedmem code is a more like instantiating an shmem object and keeping that object as a field within the restrictedmem object. Some (maybe small) issues I can think of now: (1) One difficulty with this approach is that other functions may make assumptions about private_data being of a certain type, or functions may use private_data. I checked and IIUC neither shmem nor hugetlbfs use the private_data field in the inode's i_mapping (also file's f_mapping). But there's fs/buffer.c which uses private_data, although those functions seem to be used by FSes like ext4 and fat, not memory-backed FSes. We can probably fix this if any backing filesystems of restrictedmem, like tmpfs and future ones use private_data. Could the solution here be to store private_data of the superclass instance in restrictedmem, and then override every method in the superclass that uses private_data to first restore private_data before making the superclass call? Perhaps we can take private_lock to change private_data. (2) Perhaps there are other slightly hidden cases that might need cleaning up. For example, one of the patches in this series amends the shmem_mapping() function from return mapping->a_ops == &shmem_aops; to return mapping->host->i_sb->s_magic == TMPFS_MAGIC; The former/original is more accurate since it checks a property of the mapping itself instead of checking a property of the mapping's host's superblock. The impact of changing this guard is more obvious if we now override a_ops but keep the mapping's host's superblock's s_magic. Specifically for this example, maybe we should handle restrictedmem in the caller (me_pagecache_clean()) specially, in addition to shmem. > Thanks! > struct restrictedmem { > struct rw_semaphore lock; > struct file *file; > const struct file_operations *backing_f_ops; > const struct address_space_operations *backing_a_ops; > struct xarray bindings; > bool exclusive; > }; > static int restrictedmem_release(struct inode *inode, struct file *file) > { > struct restrictedmem *rm = inode->i_mapping->private_data; > xa_destroy(&rm->bindings); > kfree(rm); > WARN_ON_ONCE(rm->backing_f_ops->release); > return 0; > } > static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode, > loff_t offset, loff_t len) > { > struct restrictedmem_notifier *notifier; > unsigned long index; > pgoff_t start, end; > int ret; > if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) > return -EINVAL; > start = offset >> PAGE_SHIFT; > end = (offset + len) >> PAGE_SHIFT; > /* > * Bindings must be stable across invalidation to ensure the start+end > * are balanced. > */ > down_read(&rm->lock); > xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) > notifier->ops->invalidate_start(notifier, start, end); > ret = rm->backing_f_ops->fallocate(rm->file, mode, offset, len); > xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) > notifier->ops->invalidate_end(notifier, start, end); > up_read(&rm->lock); > return ret; > } > static long restrictedmem_fallocate(struct file *file, int mode, > loff_t offset, loff_t len) > { > struct restrictedmem *rm = file->f_mapping->private_data; > if (mode & FALLOC_FL_PUNCH_HOLE) > return restrictedmem_punch_hole(rm, mode, offset, len); > return rm->backing_f_ops->fallocate(file, mode, offset, len); > } > static int restrictedmem_migrate_folio(struct address_space *mapping, > struct folio *dst, struct folio *src, > enum migrate_mode) > { > WARN_ON_ONCE(1); > return -EINVAL; > } > static int restrictedmem_error_page(struct address_space *mapping, > struct page *page) > { > struct restrictedmem *rm = mapping->private_data; > struct restrictedmem_notifier *notifier; > unsigned long index; > pgoff_t start, end; > start = page->index; > end = start + thp_nr_pages(page); > down_read(&rm->lock); > xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) > notifier->ops->error(notifier, start, end); > up_read(&rm->lock); > return rm->backing_a_ops->error_remove_page(mapping, page); > } When I was thinking of this I was stuck on handling error_remove_page, because it was looking up the superblock to iterate over the inodes to find the right mapping. Glad to see that the solution is simply to use the given mapping from the arguments! > static const struct file_operations restrictedmem_fops = { > .release = restrictedmem_release, > .fallocate = restrictedmem_fallocate, > }; > static const struct address_space_operations restrictedmem_aops = { > .dirty_folio = noop_dirty_folio, > #ifdef CONFIG_MIGRATION > .migrate_folio = restrictedmem_migrate_folio, > #endif > .error_remove_page = restrictedmem_error_page, > }; > static int restrictedmem_file_create(struct file *file) > { > struct address_space *mapping = file->f_mapping; > struct restrictedmem *rm; > rm = kzalloc(sizeof(*rm), GFP_KERNEL); > if (!rm) > return -ENOMEM; > rm->backing_f_ops = file->f_op; > rm->backing_a_ops = mapping->a_ops; > rm->file = file; We don't really need to do this, since rm->file is already the same as file, we could just pass the file itself when it's needed > init_rwsem(&rm->lock); > xa_init(&rm->bindings); > file->f_flags |= O_LARGEFILE; > file->f_op = &restrictedmem_fops; > mapping->a_ops = &restrictedmem_aops; I think we probably have to override inode_operations as well, because otherwise other methods would become available to a restrictedmem file (like link, unlink, mkdir, tmpfile). Or maybe that's a feature instead of a bug. > mapping_set_unevictable(mapping); > mapping_set_unmovable(mapping); > mapping_set_gfp_mask(mapping, > mapping_gfp_mask(mapping) & ~__GFP_MOVABLE); > return 0; > } > static int restrictedmem_create(struct vfsmount *mount) > { > struct file *file; > int fd, err; > fd = get_unused_fd_flags(0); > if (fd < 0) > return fd; > file = shmem_file_setup_with_mnt(mount, "memfd:restrictedmem", 0, > VM_NORESERVE); > if (IS_ERR(file)) { > err = PTR_ERR(file); > goto err_fd; > } > if (WARN_ON_ONCE(file->private_data)) { > err = -EEXIST; > goto err_fd; > } Did you intend this as a check that the backing filesystem isn't using the private_data field in the mapping? I think you meant file->f_mapping->private_data. On this note, we will probably have to fix things whenever any backing filesystems need the private_data field. > file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE; > file->f_flags |= O_LARGEFILE; > err = restrictedmem_file_create(file); > if (err) { > fput(file); > goto err_fd; > } > fd_install(fd, file); > return fd; > err_fd: > put_unused_fd(fd); > return err; > } > SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd) > { > struct vfsmount *mnt; > struct path *path; > struct fd f; > int ret; > if (flags) > return -EINVAL; > f = fdget_raw(mount_fd); > if (!f.file) > return -EBADF; > ret = -EINVAL; > path = &f.file->f_path; > if (path->dentry != path->mnt->mnt_root) > goto out; > /* Disallow bind-mounts that aren't bind-mounts of the whole filesystem. > */ > mnt = path->mnt; > if (mnt->mnt_root != mnt->mnt_sb->s_root) > goto out; > /* > * The filesystem must be mounted no-execute, executing from guest > * private memory in the host is nonsensical and unsafe. > */ > if (!(mnt->mnt_sb->s_iflags & SB_I_NOEXEC)) > goto out; > /* Currently only TMPFS is supported as underlying storage. */ > if (mnt->mnt_sb->s_magic != TMPFS_MAGIC) > goto out; > ret = mnt_want_write(mnt); > if (ret) > goto out; > ret = restrictedmem_create(mnt); > if (mnt) > mnt_drop_write(mnt); > out: > if (f.file) > fdput(f); > return ret; > }
On Fri, Apr 14, 2023, Ackerley Tng wrote: > Sean Christopherson <seanjc@google.com> writes: > > > On Thu, Apr 13, 2023, Christian Brauner wrote: > > > * by a mount option to tmpfs that makes it act > > > in this restricted manner then you don't need an ioctl() and can get > > > away with regular open calls. Such a tmpfs instance would only create > > > regular, restricted memfds. > > > I'd prefer to not go this route, becuase IIUC, it would require relatively > > invasive changes to shmem code, and IIUC would require similar changes to > > other support backings in the future, e.g. hugetlbfs? And as above, I > > don't think any of the potential use cases need restrictedmem to be a > > uniquely identifiable mount. > > FWIW, I'm starting to look at extending restrictedmem to hugetlbfs and > the separation that the current implementation has is very helpful. Also > helps that hugetlbfs and tmpfs are structured similarly, I guess. > > > One of the goals (hopefully not a pipe dream) is to design restrictmem in > > such a way that extending it to support other backing types isn't terribly > > difficult. In case it's not obvious, most of us working on this stuff > > aren't filesystems experts, and many of us aren't mm experts either. The > > more we (KVM folks for the most part) can leverage existing code to do the > > heavy lifting, the better. > > > After giving myself a bit of a crash course in file systems, would > > something like the below have any chance of (a) working, (b) getting > > merged, and (c) being maintainable? > > > The idea is similar to a stacking filesystem, but instead of stacking, > > restrictedmem hijacks a f_ops and a_ops to create a lightweight shim around > > tmpfs. There are undoubtedly issues and edge cases, I'm just looking for a > > quick "yes, this might be doable" or a "no, that's absolutely bonkers, > > don't try it". > > Not an FS expert by any means, but I did think of approaching it this > way as well! > > "Hijacking" perhaps gives this approach a bit of a negative connotation. Heh, commandeer then. > I thought this is pretty close to subclassing (as in Object > Oriented Programming). When some methods (e.g. fallocate) are called, > restrictedmem does some work, and calls the same method in the > superclass. > > The existing restrictedmem code is a more like instantiating an shmem > object and keeping that object as a field within the restrictedmem > object. > > Some (maybe small) issues I can think of now: > > (1) > > One difficulty with this approach is that other functions may make > assumptions about private_data being of a certain type, or functions may > use private_data. > > I checked and IIUC neither shmem nor hugetlbfs use the private_data > field in the inode's i_mapping (also file's f_mapping). > > But there's fs/buffer.c which uses private_data, although those > functions seem to be used by FSes like ext4 and fat, not memory-backed > FSes. > > We can probably fix this if any backing filesystems of restrictedmem, > like tmpfs and future ones use private_data. Ya, if we go the route of poking into f_ops and stuff, I would want to add WARN_ON_ONCE() hardening of everything that restrictemem wants to "commandeer" ;-) > > static int restrictedmem_file_create(struct file *file) > > { > > struct address_space *mapping = file->f_mapping; > > struct restrictedmem *rm; > > > rm = kzalloc(sizeof(*rm), GFP_KERNEL); > > if (!rm) > > return -ENOMEM; > > > rm->backing_f_ops = file->f_op; > > rm->backing_a_ops = mapping->a_ops; > > rm->file = file; > > We don't really need to do this, since rm->file is already the same as > file, we could just pass the file itself when it's needed Aha! I was working on getting rid of it, but forgot to go back and do another pass. > > init_rwsem(&rm->lock); > > xa_init(&rm->bindings); > > > file->f_flags |= O_LARGEFILE; > > > file->f_op = &restrictedmem_fops; > > mapping->a_ops = &restrictedmem_aops; > > I think we probably have to override inode_operations as well, because > otherwise other methods would become available to a restrictedmem file > (like link, unlink, mkdir, tmpfile). Or maybe that's a feature instead > of a bug. I think we want those? What we want to restrict are operations that require read/write/execute access to the file, everything else should be ok. fallocate() is a special case because restrictmem needs to tell KVM to unmap the memory when a hole is punched. I assume ->setattr() needs similar treatment to handle ftruncate()? I'd love to hear Christian's input on this aspect of things. > > if (WARN_ON_ONCE(file->private_data)) { > > err = -EEXIST; > > goto err_fd; > > } > > Did you intend this as a check that the backing filesystem isn't using > the private_data field in the mapping? > > I think you meant file->f_mapping->private_data. Ya, sounds right. I should have added disclaimers that (a) I wrote this quite quickly and (b) it's compile tested only at this point. > On this note, we will probably have to fix things whenever any backing > filesystems need the private_data field. Yep. > > f = fdget_raw(mount_fd); > > if (!f.file) > > return -EBADF; ... > > /* > > * The filesystem must be mounted no-execute, executing from guest > > * private memory in the host is nonsensical and unsafe. > > */ > > if (!(mnt->mnt_sb->s_iflags & SB_I_NOEXEC)) > > goto out; Looking at this more closely, I don't think we need to require NOEXEC, things like like execve() should get squashed by virtue of not providing any read/write implementations. And dropping my misguided NOEXEC requirement means there's no reason to disallow using the kernel internal mount.
On Fri, Apr 14, 2023, Sean Christopherson wrote: > On Fri, Apr 14, 2023, Ackerley Tng wrote: > > Sean Christopherson <seanjc@google.com> writes: > > > if (WARN_ON_ONCE(file->private_data)) { > > > err = -EEXIST; > > > goto err_fd; > > > } > > > > Did you intend this as a check that the backing filesystem isn't using > > the private_data field in the mapping? > > > > I think you meant file->f_mapping->private_data. > > Ya, sounds right. I should have added disclaimers that (a) I wrote this quite > quickly and (b) it's compile tested only at this point. FWIW, here's a very lightly tested version that doesn't explode on a basic selftest. https://github.com/sean-jc/linux/tree/x86/upm_base_support
On Thu, Apr 13, 2023 at 03:28:43PM -0700, Sean Christopherson wrote: > On Thu, Apr 13, 2023, Christian Brauner wrote: > > On Thu, Aug 18, 2022 at 04:24:21PM +0300, Kirill A . Shutemov wrote: > > > On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote: > > > > Here's what I would prefer, and imagine much easier for you to maintain; > > > > but I'm no system designer, and may be misunderstanding throughout. > > > > > > > > QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps > > > > the fallocate syscall interface itself) to allocate and free the memory, > > > > ioctl for initializing some of it too. KVM in control of whether that > > > > fd can be read or written or mmap'ed or whatever, no need to prevent it > > > > in shmem.c, no need for flags, seals, notifications to and fro because > > > > KVM is already in control and knows the history. If shmem actually has > > > > value, call into it underneath - somewhat like SysV SHM, and /dev/zero > > > > mmap, and i915/gem make use of it underneath. If shmem has nothing to > > > > add, just allocate and free kernel memory directly, recorded in your > > > > own xarray. > > > > > > I guess shim layer on top of shmem *can* work. I don't see immediately why > > > it would not. But I'm not sure it is right direction. We risk creating yet > > > another parallel VM with own rules/locking/accounting that opaque to > > > core-mm. > > > > Sorry for necrobumping this thread but I've been reviewing the > > No worries, I'm just stoked someone who actually knows what they're doing is > chiming in :-) It's a dangerous business, going out of your subsystem. You step into code, and if you don't watch your hands, there is no knowing where you might be swept off to. That saying goes for me here specifically... > > > memfd_restricted() extension that Ackerley is currently working on. I > > was pointed to this thread as this is what the extension is building > > on but I'll reply to both threads here. > > > > From a glance at v10, memfd_restricted() is currently implemented as an > > in-kernel stacking filesystem. A call to memfd_restricted() creates a > > new restricted memfd file and a new unlinked tmpfs file and stashes the > > tmpfs file into the memfd file's private data member. It then uses the > > tmpfs file's f_ops and i_ops to perform the relevant file and inode > > operations. So it has the same callstack as a general stacking > > filesystem like overlayfs in some cases: > > > > memfd_restricted->getattr() > > -> tmpfs->getattr() > > ... > > > Since you're effectively acting like a stacking filesystem you should > > really use the device number of your memfd restricted filesystem. IOW, > > sm like: > > > > stat->dev = memfd_restricted_dentry->d_sb->s_dev; > > > > But then you run into trouble if you want to go forward with Ackerley's > > extension that allows to explicitly pass in tmpfs fds to > > memfd_restricted(). Afaict, two tmpfs instances might allocate the same > > inode number. So now the inode and device number pair isn't unique > > anymore. > > > > So you might best be served by allocating and reporting your own inode > > numbers as well. > > > > But if you want to preserve the inode number and device number of the > > relevant tmpfs instance but still report memfd restricted as your > > filesystem type > > Unless I missed something along the way, reporting memfd_restricted as a distinct > filesystem is very much a non-goal. AFAIK it's purely a side effect of the > proposed implementation. In the current implementation you would have to put in effort to fake this. For example, you would need to also implement ->statfs super_operation where you'd need to fill in the details of the tmpfs instance. At that point all that memfd_restricted fs code that you've written is nothing but deadweight, I would reckon. > > > then I think it's reasonable to ask whether a stacking implementation really > > makes sense here. > > > > If you extend memfd_restricted() or even consider extending it in the > > future to take tmpfs file descriptors as arguments to identify the tmpfs > > instance in which to allocate the underlying tmpfs file for the new > > restricted memfd file you should really consider a tmpfs based > > implementation. > > > > Because at that point it just feels like a pointless wrapper to get > > custom f_ops and i_ops. Plus it's wasteful because you allocate dentries > > and inodes that you don't really care about at all. > > > > Just off the top of my hat you might be better served: > > * by a new ioctl() on tmpfs instances that > > yield regular tmpfs file descriptors with restricted f_ops and i_ops. > > That's not that different from btrfs subvolumes which effectively are > > directories but are created through an ioctl(). > > I think this is more or less what we want to do, except via a dedicated syscall > instead of an ioctl() so that the primary interface isn't strictly tied to tmpfs, > e.g. so that it can be extended to other backing types in the future. Ok. But just to point this out, this would make memfd_restricted() a multiplexer on types of memory. And my wild guess is that not all memory types you might reasonably want to use will have a filesystem like interface such. So in the future you might end up with multiple ways of specifying the type of memory: // use tmpfs backing memfd_restricted(fd_tmpfs, 0); // use hugetlbfs backing memfd_restricted(fd_hugetlbfs, 0); // use non-fs type memory backing memfd_restricted(-EBADF, MEMFD_SUPER_FANCY_MEMORY_TYPE); interface wise I find an unpleasant design. But that multi-memory-open goal also makes it a bit hard to come up with a clean design (On possibility would be to use an extensible struct - versioned by size - similar to openat2() and clone3() such that you can specify all types of options on the memory in the future.). > > > * by a mount option to tmpfs that makes it act > > in this restricted manner then you don't need an ioctl() and can get > > away with regular open calls. Such a tmpfs instance would only create > > regular, restricted memfds. > > I'd prefer to not go this route, becuase IIUC, it would require relatively invasive > changes to shmem code, and IIUC would require similar changes to other support > backings in the future, e.g. hugetlbfs? And as above, I don't think any of the > potential use cases need restrictedmem to be a uniquely identifiable mount. Ok, see my comment above then. > > One of the goals (hopefully not a pipe dream) is to design restrictmem in such a > way that extending it to support other backing types isn't terribly difficult. Not necessarily difficult, just difficult to do tastefully imho. But it's not that has traditionally held people back. ;) > In case it's not obvious, most of us working on this stuff aren't filesystems > experts, and many of us aren't mm experts either. The more we (KVM folks for the > most part) can leverage existing code to do the heavy lifting, the better. Well, hopefully we can complement each other's knowledge here. > > After giving myself a bit of a crash course in file systems, would something like > the below have any chance of (a) working, (b) getting merged, and (c) being > maintainable? > > The idea is similar to a stacking filesystem, but instead of stacking, restrictedmem > hijacks a f_ops and a_ops to create a lightweight shim around tmpfs. There are > undoubtedly issues and edge cases, I'm just looking for a quick "yes, this might > be doable" or a "no, that's absolutely bonkers, don't try it". Maybe, but I think it's weird. _Replacing_ f_ops isn't something that's unprecedented. It happens everytime a character device is opened (see fs/char_dev.c:chrdev_open()). And debugfs does a similar (much more involved) thing where it replaces it's proxy f_ops with the relevant subsystem's f_ops. The difference is that in both cases the replace happens at ->open() time; and the replace is done once. Afterwards only the newly added f_ops are relevant. In your case you'd be keeping two sets of {f,a}_ops; one usable by userspace and another only usable by in-kernel consumers. And there are some concerns (non-exhaustive list), I think: * {f,a}_ops weren't designed for this. IOW, one set of {f,a}_ops is authoritative per @file and it is left to the individual subsystems to maintain driver specific ops (see the sunrpc stuff or sockets). * lifetime management for the two sets of {f,a}_ops: If the ops belong to a module then you need to make sure that the module can't get unloaded while you're using the fops. Might not be a concern in this case. * brittleness: Not all f_ops for example deal with userspace functionality some deal with cleanup when the file is closed like ->release(). So it's delicate to override that functionality with custom f_ops. Restricted memfds could easily forget to cleanup resources. * Potential for confusion why there's two sets of {f,a}_ops. * f_ops specifically are generic across a vast amount of consumers and are subject to change. If memfd_restricted() has specific requirements because of this weird double-use they won't be taken into account. I find this hard to navigate tbh and it feels like taking a shortcut to avoid building a proper api. If you only care about a specific set of operations specific to memfd restricte that needs to be available to in-kernel consumers, I wonder if you shouldn't just go one step further then your proposal below and build a dedicated minimal ops api. Idk, sketching like a madman on a drawning board here with no claim to feasibility from a mm perspective whatsoever: struct restrictedmem_ops { // only contains very limited stuff you need or special stuff // you nee, similar to struct proto_ops (sockets) and so on }; struct restrictedmem { const struct restrictedmem_ops ops; }; This would avoid fuzzing with two different set of {f,a}_ops in this brittle way. It would force you to clarify the semantics that you need and the operations that you need or don't need implemented. And it would get rid of the ambiguity inherent to using two sets of {f,a}_ops.
On Wed, Apr 19, 2023, Christian Brauner wrote: > On Thu, Apr 13, 2023 at 03:28:43PM -0700, Sean Christopherson wrote: > > > But if you want to preserve the inode number and device number of the > > > relevant tmpfs instance but still report memfd restricted as your > > > filesystem type > > > > Unless I missed something along the way, reporting memfd_restricted as a distinct > > filesystem is very much a non-goal. AFAIK it's purely a side effect of the > > proposed implementation. > > In the current implementation you would have to put in effort to fake > this. For example, you would need to also implement ->statfs > super_operation where you'd need to fill in the details of the tmpfs > instance. At that point all that memfd_restricted fs code that you've > written is nothing but deadweight, I would reckon. After digging a bit, I suspect the main reason Kirill implemented an overlay to inode_operations was to prevent modifying the file size via ->setattr(). Relying on shmem_setattr() to unmap entries in KVM's MMU wouldn't work because, by design, the memory can't be mmap()'d into host userspace. if (attr->ia_valid & ATTR_SIZE) { if (memfd->f_inode->i_size) return -EPERM; if (!PAGE_ALIGNED(attr->ia_size)) return -EINVAL; } But I think we can solve this particular problem by using F_SEAL_{GROW,SHRINK} or SHMEM_LONGPIN. For a variety of reasons, I'm leaning more and more toward making this a KVM ioctl() instead of a dedicated syscall, at which point we can be both more flexible and more draconian, e.g. let userspace provide the file size at the time of creation, but make the size immutable, at least by default. > > After giving myself a bit of a crash course in file systems, would something like > > the below have any chance of (a) working, (b) getting merged, and (c) being > > maintainable? > > > > The idea is similar to a stacking filesystem, but instead of stacking, restrictedmem > > hijacks a f_ops and a_ops to create a lightweight shim around tmpfs. There are > > undoubtedly issues and edge cases, I'm just looking for a quick "yes, this might > > be doable" or a "no, that's absolutely bonkers, don't try it". > > Maybe, but I think it's weird. Yeah, agreed. > _Replacing_ f_ops isn't something that's unprecedented. It happens everytime > a character device is opened (see fs/char_dev.c:chrdev_open()). And debugfs > does a similar (much more involved) thing where it replaces it's proxy f_ops > with the relevant subsystem's f_ops. The difference is that in both cases the > replace happens at ->open() time; and the replace is done once. Afterwards > only the newly added f_ops are relevant. > > In your case you'd be keeping two sets of {f,a}_ops; one usable by > userspace and another only usable by in-kernel consumers. And there are > some concerns (non-exhaustive list), I think: > > * {f,a}_ops weren't designed for this. IOW, one set of {f,a}_ops is > authoritative per @file and it is left to the individual subsystems to > maintain driver specific ops (see the sunrpc stuff or sockets). > * lifetime management for the two sets of {f,a}_ops: If the ops belong > to a module then you need to make sure that the module can't get > unloaded while you're using the fops. Might not be a concern in this > case. Ah, whereas I assume the owner of inode_operations is pinned by ??? (dentry?) holding a reference to the inode? > * brittleness: Not all f_ops for example deal with userspace > functionality some deal with cleanup when the file is closed like > ->release(). So it's delicate to override that functionality with > custom f_ops. Restricted memfds could easily forget to cleanup > resources. > * Potential for confusion why there's two sets of {f,a}_ops. > * f_ops specifically are generic across a vast amount of consumers and > are subject to change. If memfd_restricted() has specific requirements > because of this weird double-use they won't be taken into account. > > I find this hard to navigate tbh and it feels like taking a shortcut to > avoid building a proper api. Agreed. At the very least, it would be better to take an explicit dependency on whatever APIs are being used instead of somewhat blindly bouncing through ->fallocate(). I think that gives us a clearer path to getting something merged too, as we'll need Acks on making specific functions visible, i.e. will give MM maintainers something concrete to react too. > If you only care about a specific set of operations specific to memfd > restricte that needs to be available to in-kernel consumers, I wonder if you > shouldn't just go one step further then your proposal below and build a > dedicated minimal ops api. This is actually very doable for shmem. Unless I'm missing something, because our use case doesn't allow mmap(), swap, or migration, a good chunk of shmem_fallocate() is simply irrelevant. The result is only ~100 lines of code, and quite straightforward. My biggest concern, outside of missing a detail in shmem, is adding support for HugeTLBFS, which is likely going to be requested/needed sooner than later. At a glance, hugetlbfs_fallocate() is quite a bit more complex, i.e. not something I'm keen to duplicate. But that's also a future problem to some extent, as it's purely kernel internals; the uAPI side of things doesn't seem like it'll be messy at all. Thanks again!
On Wed, Apr 19, 2023 at 05:49:55PM -0700, Sean Christopherson wrote: > On Wed, Apr 19, 2023, Christian Brauner wrote: > > On Thu, Apr 13, 2023 at 03:28:43PM -0700, Sean Christopherson wrote: > > > > But if you want to preserve the inode number and device number of the > > > > relevant tmpfs instance but still report memfd restricted as your > > > > filesystem type > > > > > > Unless I missed something along the way, reporting memfd_restricted as a distinct > > > filesystem is very much a non-goal. AFAIK it's purely a side effect of the > > > proposed implementation. > > > > In the current implementation you would have to put in effort to fake > > this. For example, you would need to also implement ->statfs > > super_operation where you'd need to fill in the details of the tmpfs > > instance. At that point all that memfd_restricted fs code that you've > > written is nothing but deadweight, I would reckon. > > After digging a bit, I suspect the main reason Kirill implemented an overlay to > inode_operations was to prevent modifying the file size via ->setattr(). Relying > on shmem_setattr() to unmap entries in KVM's MMU wouldn't work because, by design, > the memory can't be mmap()'d into host userspace. > > if (attr->ia_valid & ATTR_SIZE) { > if (memfd->f_inode->i_size) > return -EPERM; > > if (!PAGE_ALIGNED(attr->ia_size)) > return -EINVAL; > } > > But I think we can solve this particular problem by using F_SEAL_{GROW,SHRINK} or > SHMEM_LONGPIN. For a variety of reasons, I'm leaning more and more toward making > this a KVM ioctl() instead of a dedicated syscall, at which point we can be both > more flexible and more draconian, e.g. let userspace provide the file size at the > time of creation, but make the size immutable, at least by default. > > > > After giving myself a bit of a crash course in file systems, would something like > > > the below have any chance of (a) working, (b) getting merged, and (c) being > > > maintainable? > > > > > > The idea is similar to a stacking filesystem, but instead of stacking, restrictedmem > > > hijacks a f_ops and a_ops to create a lightweight shim around tmpfs. There are > > > undoubtedly issues and edge cases, I'm just looking for a quick "yes, this might > > > be doable" or a "no, that's absolutely bonkers, don't try it". > > > > Maybe, but I think it's weird. > > Yeah, agreed. > > > _Replacing_ f_ops isn't something that's unprecedented. It happens everytime > > a character device is opened (see fs/char_dev.c:chrdev_open()). And debugfs > > does a similar (much more involved) thing where it replaces it's proxy f_ops > > with the relevant subsystem's f_ops. The difference is that in both cases the > > replace happens at ->open() time; and the replace is done once. Afterwards > > only the newly added f_ops are relevant. > > > > In your case you'd be keeping two sets of {f,a}_ops; one usable by > > userspace and another only usable by in-kernel consumers. And there are > > some concerns (non-exhaustive list), I think: > > > > * {f,a}_ops weren't designed for this. IOW, one set of {f,a}_ops is > > authoritative per @file and it is left to the individual subsystems to > > maintain driver specific ops (see the sunrpc stuff or sockets). > > * lifetime management for the two sets of {f,a}_ops: If the ops belong > > to a module then you need to make sure that the module can't get > > unloaded while you're using the fops. Might not be a concern in this > > case. > > Ah, whereas I assume the owner of inode_operations is pinned by ??? (dentry?) > holding a reference to the inode? I don't think it would be possible to safely replace inode_operations after the inode's been made visible in caches. It works with file_operations because when a file is opened a new struct file is allocated which isn't reachable anywhere before fd_install() is called. So it is possible to replace f_ops in the default f->f_op->open() method (which is what devices do as the inode is located on e.g., ext4/xfs/tmpfs but the functionality of the device usually provided by some driver/module through its file_operations). The default f_ops are taken from i_fop of the inode. The lifetime of the file_/inode_operations will be aligned with the lifetime of the module they're originating from. If only file_/inode_operations are used from within the same module then there should never be any lifetime concerns. So an inode doesn't explictly pin file_/inode_operations because there's usually no need to do that and it be weird if each new inode would take a reference on the f_ops/i_ops on the off-chance that someone _might_ open the file. Let alone the overhead of calling try_module_get() everytime a new inode is added to the cache. There are various fs objects - the superblock which is pinning the filesystem/module - that exceed the lifetime of inodes and dentries. Both also may be dropped from their respective caches and readded later. Pinning of the module for f_ops is done because it is possible that some filesystem/driver might want to use the file_operations of some other filesystem/driver by default and they are in separate modules. So the fops_get() in do_dentry_open is there because it's not guaranteed that file_/inode_operations originate from the same module as the inode that's opened. If the module is still alive during the open then a reference to its f_ops is taken if not then the open will fail with ENODEV. That's to the best of my knowledge. > > > * brittleness: Not all f_ops for example deal with userspace > > functionality some deal with cleanup when the file is closed like > > ->release(). So it's delicate to override that functionality with > > custom f_ops. Restricted memfds could easily forget to cleanup > > resources. > > * Potential for confusion why there's two sets of {f,a}_ops. > > * f_ops specifically are generic across a vast amount of consumers and > > are subject to change. If memfd_restricted() has specific requirements > > because of this weird double-use they won't be taken into account. > > > > I find this hard to navigate tbh and it feels like taking a shortcut to > > avoid building a proper api. > > Agreed. At the very least, it would be better to take an explicit dependency on > whatever APIs are being used instead of somewhat blindly bouncing through ->fallocate(). > I think that gives us a clearer path to getting something merged too, as we'll > need Acks on making specific functions visible, i.e. will give MM maintainers > something concrete to react too. > > > If you only care about a specific set of operations specific to memfd > > restricte that needs to be available to in-kernel consumers, I wonder if you > > shouldn't just go one step further then your proposal below and build a > > dedicated minimal ops api. > > This is actually very doable for shmem. Unless I'm missing something, because > our use case doesn't allow mmap(), swap, or migration, a good chunk of > shmem_fallocate() is simply irrelevant. The result is only ~100 lines of code, > and quite straightforward. > > My biggest concern, outside of missing a detail in shmem, is adding support for > HugeTLBFS, which is likely going to be requested/needed sooner than later. At a > glance, hugetlbfs_fallocate() is quite a bit more complex, i.e. not something I'm > keen to duplicate. But that's also a future problem to some extent, as it's > purely kernel internals; the uAPI side of things doesn't seem like it'll be messy > at all. > > Thanks again! Sure thing.
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 320480a8db4f..dc70ba90247e 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -455,3 +455,4 @@ 448 i386 process_mrelease sys_process_mrelease 449 i386 futex_waitv sys_futex_waitv 450 i386 set_mempolicy_home_node sys_set_mempolicy_home_node +451 i386 memfd_restricted sys_memfd_restricted diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index c84d12608cd2..06516abc8318 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -372,6 +372,7 @@ 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node +451 common memfd_restricted sys_memfd_restricted # # Due to a historical design error, certain syscalls are numbered differently diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h new file mode 100644 index 000000000000..c2700c5daa43 --- /dev/null +++ b/include/linux/restrictedmem.h @@ -0,0 +1,71 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _LINUX_RESTRICTEDMEM_H + +#include <linux/file.h> +#include <linux/magic.h> +#include <linux/pfn_t.h> + +struct restrictedmem_notifier; + +struct restrictedmem_notifier_ops { + void (*invalidate_start)(struct restrictedmem_notifier *notifier, + pgoff_t start, pgoff_t end); + void (*invalidate_end)(struct restrictedmem_notifier *notifier, + pgoff_t start, pgoff_t end); + void (*error)(struct restrictedmem_notifier *notifier, + pgoff_t start, pgoff_t end); +}; + +struct restrictedmem_notifier { + struct list_head list; + const struct restrictedmem_notifier_ops *ops; +}; + +#ifdef CONFIG_RESTRICTEDMEM + +void restrictedmem_register_notifier(struct file *file, + struct restrictedmem_notifier *notifier); +void restrictedmem_unregister_notifier(struct file *file, + struct restrictedmem_notifier *notifier); + +int restrictedmem_get_page(struct file *file, pgoff_t offset, + struct page **pagep, int *order); + +static inline bool file_is_restrictedmem(struct file *file) +{ + return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC; +} + +void restrictedmem_error_page(struct page *page, struct address_space *mapping); + +#else + +static inline void restrictedmem_register_notifier(struct file *file, + struct restrictedmem_notifier *notifier) +{ +} + +static inline void restrictedmem_unregister_notifier(struct file *file, + struct restrictedmem_notifier *notifier) +{ +} + +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset, + struct page **pagep, int *order) +{ + return -1; +} + +static inline bool file_is_restrictedmem(struct file *file) +{ + return false; +} + +static inline void restrictedmem_error_page(struct page *page, + struct address_space *mapping) +{ +} + +#endif /* CONFIG_RESTRICTEDMEM */ + +#endif /* _LINUX_RESTRICTEDMEM_H */ diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index a34b0f9a9972..f9e9e0c820c5 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -1056,6 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags); asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len, unsigned long home_node, unsigned long flags); +asmlinkage long sys_memfd_restricted(unsigned int flags); /* * Architecture-specific system calls diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 45fa180cc56a..e93cd35e46d0 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv) #define __NR_set_mempolicy_home_node 450 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node) +#define __NR_memfd_restricted 451 +__SYSCALL(__NR_memfd_restricted, sys_memfd_restricted) + #undef __NR_syscalls -#define __NR_syscalls 451 +#define __NR_syscalls 452 /* * 32 bit systems traditionally used different diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h index 6325d1d0e90f..8aa38324b90a 100644 --- a/include/uapi/linux/magic.h +++ b/include/uapi/linux/magic.h @@ -101,5 +101,6 @@ #define DMA_BUF_MAGIC 0x444d4142 /* "DMAB" */ #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */ #define SECRETMEM_MAGIC 0x5345434d /* "SECM" */ +#define RESTRICTEDMEM_MAGIC 0x5245534d /* "RESM" */ #endif /* __LINUX_MAGIC_H__ */ diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 860b2dcf3ac4..7c4a32cbd2e7 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -360,6 +360,9 @@ COND_SYSCALL(pkey_free); /* memfd_secret */ COND_SYSCALL(memfd_secret); +/* memfd_restricted */ +COND_SYSCALL(memfd_restricted); + /* * Architecture specific weak syscall entries. */ diff --git a/mm/Kconfig b/mm/Kconfig index 57e1d8c5b505..06b0e1d6b8c1 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1076,6 +1076,10 @@ config IO_MAPPING config SECRETMEM def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED +config RESTRICTEDMEM + bool + depends on TMPFS + config ANON_VMA_NAME bool "Anonymous VMA name support" depends on PROC_FS && ADVISE_SYSCALLS && MMU diff --git a/mm/Makefile b/mm/Makefile index 8e105e5b3e29..bcbb0edf9ba1 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -121,6 +121,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o obj-$(CONFIG_SECRETMEM) += secretmem.o +obj-$(CONFIG_RESTRICTEDMEM) += restrictedmem.o obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 145bb561ddb3..f91b444e471e 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -62,6 +62,7 @@ #include <linux/page-isolation.h> #include <linux/pagewalk.h> #include <linux/shmem_fs.h> +#include <linux/restrictedmem.h> #include "swap.h" #include "internal.h" #include "ras/ras_event.h" @@ -940,6 +941,8 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p) goto out; } + restrictedmem_error_page(p, mapping); + /* * The shmem page is kept in page cache instead of truncating * so is expected to have an extra refcount after error-handling. diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c new file mode 100644 index 000000000000..56953c204e5c --- /dev/null +++ b/mm/restrictedmem.c @@ -0,0 +1,318 @@ +// SPDX-License-Identifier: GPL-2.0 +#include "linux/sbitmap.h" +#include <linux/pagemap.h> +#include <linux/pseudo_fs.h> +#include <linux/shmem_fs.h> +#include <linux/syscalls.h> +#include <uapi/linux/falloc.h> +#include <uapi/linux/magic.h> +#include <linux/restrictedmem.h> + +struct restrictedmem_data { + struct mutex lock; + struct file *memfd; + struct list_head notifiers; +}; + +static void restrictedmem_invalidate_start(struct restrictedmem_data *data, + pgoff_t start, pgoff_t end) +{ + struct restrictedmem_notifier *notifier; + + mutex_lock(&data->lock); + list_for_each_entry(notifier, &data->notifiers, list) { + notifier->ops->invalidate_start(notifier, start, end); + } + mutex_unlock(&data->lock); +} + +static void restrictedmem_invalidate_end(struct restrictedmem_data *data, + pgoff_t start, pgoff_t end) +{ + struct restrictedmem_notifier *notifier; + + mutex_lock(&data->lock); + list_for_each_entry(notifier, &data->notifiers, list) { + notifier->ops->invalidate_end(notifier, start, end); + } + mutex_unlock(&data->lock); +} + +static void restrictedmem_notifier_error(struct restrictedmem_data *data, + pgoff_t start, pgoff_t end) +{ + struct restrictedmem_notifier *notifier; + + mutex_lock(&data->lock); + list_for_each_entry(notifier, &data->notifiers, list) { + notifier->ops->error(notifier, start, end); + } + mutex_unlock(&data->lock); +} + +static int restrictedmem_release(struct inode *inode, struct file *file) +{ + struct restrictedmem_data *data = inode->i_mapping->private_data; + + fput(data->memfd); + kfree(data); + return 0; +} + +static long restrictedmem_punch_hole(struct restrictedmem_data *data, int mode, + loff_t offset, loff_t len) +{ + int ret; + pgoff_t start, end; + struct file *memfd = data->memfd; + + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) + return -EINVAL; + + start = offset >> PAGE_SHIFT; + end = (offset + len) >> PAGE_SHIFT; + + restrictedmem_invalidate_start(data, start, end); + ret = memfd->f_op->fallocate(memfd, mode, offset, len); + restrictedmem_invalidate_end(data, start, end); + + return ret; +} + +static long restrictedmem_fallocate(struct file *file, int mode, + loff_t offset, loff_t len) +{ + struct restrictedmem_data *data = file->f_mapping->private_data; + struct file *memfd = data->memfd; + + if (mode & FALLOC_FL_PUNCH_HOLE) + return restrictedmem_punch_hole(data, mode, offset, len); + + return memfd->f_op->fallocate(memfd, mode, offset, len); +} + +static const struct file_operations restrictedmem_fops = { + .release = restrictedmem_release, + .fallocate = restrictedmem_fallocate, +}; + +static int restrictedmem_getattr(struct user_namespace *mnt_userns, + const struct path *path, struct kstat *stat, + u32 request_mask, unsigned int query_flags) +{ + struct inode *inode = d_inode(path->dentry); + struct restrictedmem_data *data = inode->i_mapping->private_data; + struct file *memfd = data->memfd; + + return memfd->f_inode->i_op->getattr(mnt_userns, path, stat, + request_mask, query_flags); +} + +static int restrictedmem_setattr(struct user_namespace *mnt_userns, + struct dentry *dentry, struct iattr *attr) +{ + struct inode *inode = d_inode(dentry); + struct restrictedmem_data *data = inode->i_mapping->private_data; + struct file *memfd = data->memfd; + int ret; + + if (attr->ia_valid & ATTR_SIZE) { + if (memfd->f_inode->i_size) + return -EPERM; + + if (!PAGE_ALIGNED(attr->ia_size)) + return -EINVAL; + } + + ret = memfd->f_inode->i_op->setattr(mnt_userns, + file_dentry(memfd), attr); + return ret; +} + +static const struct inode_operations restrictedmem_iops = { + .getattr = restrictedmem_getattr, + .setattr = restrictedmem_setattr, +}; + +static int restrictedmem_init_fs_context(struct fs_context *fc) +{ + if (!init_pseudo(fc, RESTRICTEDMEM_MAGIC)) + return -ENOMEM; + + fc->s_iflags |= SB_I_NOEXEC; + return 0; +} + +static struct file_system_type restrictedmem_fs = { + .owner = THIS_MODULE, + .name = "memfd:restrictedmem", + .init_fs_context = restrictedmem_init_fs_context, + .kill_sb = kill_anon_super, +}; + +static struct vfsmount *restrictedmem_mnt; + +static __init int restrictedmem_init(void) +{ + restrictedmem_mnt = kern_mount(&restrictedmem_fs); + if (IS_ERR(restrictedmem_mnt)) + return PTR_ERR(restrictedmem_mnt); + return 0; +} +fs_initcall(restrictedmem_init); + +static struct file *restrictedmem_file_create(struct file *memfd) +{ + struct restrictedmem_data *data; + struct address_space *mapping; + struct inode *inode; + struct file *file; + + data = kzalloc(sizeof(*data), GFP_KERNEL); + if (!data) + return ERR_PTR(-ENOMEM); + + data->memfd = memfd; + mutex_init(&data->lock); + INIT_LIST_HEAD(&data->notifiers); + + inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb); + if (IS_ERR(inode)) { + kfree(data); + return ERR_CAST(inode); + } + + inode->i_mode |= S_IFREG; + inode->i_op = &restrictedmem_iops; + inode->i_mapping->private_data = data; + + file = alloc_file_pseudo(inode, restrictedmem_mnt, + "restrictedmem", O_RDWR, + &restrictedmem_fops); + if (IS_ERR(file)) { + iput(inode); + kfree(data); + return ERR_CAST(file); + } + + file->f_flags |= O_LARGEFILE; + + /* + * These pages are currently unmovable so don't place them into movable + * pageblocks (e.g. CMA and ZONE_MOVABLE). + */ + mapping = memfd->f_mapping; + mapping_set_unevictable(mapping); + mapping_set_gfp_mask(mapping, + mapping_gfp_mask(mapping) & ~__GFP_MOVABLE); + + return file; +} + +SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) +{ + struct file *file, *restricted_file; + int fd, err; + + if (flags) + return -EINVAL; + + fd = get_unused_fd_flags(0); + if (fd < 0) + return fd; + + file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE); + if (IS_ERR(file)) { + err = PTR_ERR(file); + goto err_fd; + } + file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE; + file->f_flags |= O_LARGEFILE; + + restricted_file = restrictedmem_file_create(file); + if (IS_ERR(restricted_file)) { + err = PTR_ERR(restricted_file); + fput(file); + goto err_fd; + } + + fd_install(fd, restricted_file); + return fd; +err_fd: + put_unused_fd(fd); + return err; +} + +void restrictedmem_register_notifier(struct file *file, + struct restrictedmem_notifier *notifier) +{ + struct restrictedmem_data *data = file->f_mapping->private_data; + + mutex_lock(&data->lock); + list_add(¬ifier->list, &data->notifiers); + mutex_unlock(&data->lock); +} +EXPORT_SYMBOL_GPL(restrictedmem_register_notifier); + +void restrictedmem_unregister_notifier(struct file *file, + struct restrictedmem_notifier *notifier) +{ + struct restrictedmem_data *data = file->f_mapping->private_data; + + mutex_lock(&data->lock); + list_del(¬ifier->list); + mutex_unlock(&data->lock); +} +EXPORT_SYMBOL_GPL(restrictedmem_unregister_notifier); + +int restrictedmem_get_page(struct file *file, pgoff_t offset, + struct page **pagep, int *order) +{ + struct restrictedmem_data *data = file->f_mapping->private_data; + struct file *memfd = data->memfd; + struct folio *folio; + struct page *page; + int ret; + + ret = shmem_get_folio(file_inode(memfd), offset, &folio, SGP_WRITE); + if (ret) + return ret; + + page = folio_file_page(folio, offset); + *pagep = page; + if (order) + *order = thp_order(compound_head(page)); + + SetPageUptodate(page); + unlock_page(page); + + return 0; +} +EXPORT_SYMBOL_GPL(restrictedmem_get_page); + +void restrictedmem_error_page(struct page *page, struct address_space *mapping) +{ + struct super_block *sb = restrictedmem_mnt->mnt_sb; + struct inode *inode, *next; + + if (!shmem_mapping(mapping)) + return; + + spin_lock(&sb->s_inode_list_lock); + list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) { + struct restrictedmem_data *data = inode->i_mapping->private_data; + struct file *memfd = data->memfd; + + if (memfd->f_mapping == mapping) { + pgoff_t start, end; + + spin_unlock(&sb->s_inode_list_lock); + + start = page->index; + end = start + thp_nr_pages(page); + restrictedmem_notifier_error(data, start, end); + return; + } + } + spin_unlock(&sb->s_inode_list_lock); +}