Message ID | 20230218211433.26859-34-rick.p.edgecombe@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp555981wrn; Sat, 18 Feb 2023 13:22:54 -0800 (PST) X-Google-Smtp-Source: AK7set+/ou5tb8MteMcFGLcrr36viHSaYtPAejO7eBHYBSu1XSYYE1N932z99LsbnICL+77K9r/0 X-Received: by 2002:a17:90b:4d90:b0:234:e8e:a91d with SMTP id oj16-20020a17090b4d9000b002340e8ea91dmr1234318pjb.7.1676755374577; Sat, 18 Feb 2023 13:22:54 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1676755374; cv=none; d=google.com; s=arc-20160816; b=KxHieJnG1TNcbvRQ1Vgd8ilw8lUD1PEnwDSJnSRU6b2FyKtwUpq70+ySCEVOJ9x6Z9 ApIjC0xA5ZfTZ2gV0wEvCjUWRMEJBfwM2fB5aU0sPPYVidnhPBGi40YBi+mR4Oq7hPuV hj79jhCBr6HQ6luA1bLy6Zy4cTkic00Vd2a98HLzUPge9qFtqUWL6kAEg35ciADDrr/X SCD8PnVUk0b3aS0/aL97+V/Cy88C9FNu0GfPzRErgCmCINF62p5N+9XP/udf+fj8opLO omZ2b/oeSqZqcACLqHtuaQx0GiTh+WzDTyNO78aK0K3SPmc1hY7EjCuwSSJLfQqoa9dU QqBw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:references:in-reply-to:message-id:date:subject :cc:to:from:dkim-signature; bh=sPjgtO7gjiSisrQloQhJv/VllXOHJMFZ+fwjEtKpkI4=; b=FPe1gE1U9T3eAPKw3JOysCt7cIPTvxZYt4DQ4l4MvZbAY6DgWKtZlHCz+9J6/DkcMQ 3Q+XcDpxu3tEck+GV4ch9tCC3gqAKr5Uhw8zQP/0I3wBrmX/7Pu/fCPWH8+BPsex4ypa bisGsWGyfq9WM0ZTWkovUdYPTRtyWrytDfs1STe+G4wP5OflK12ikVCOLlao2TOHhX5a iuAC6ZlHxJgsvlHnkbkVRGP1gRJpxK1kr+Dm8rhAZlkLz8zi8jv1ko3TeNYXnNEK/XpK 01E0XIPVY0gbvJzNXm7rXJ/Q2n05/1X5c8SUcimDpDqcxPBLr8kZZ75xi0cuFzI/hgox LEFg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=gZwDplhI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id z2-20020a17090ad78200b00233d9187eb3si9020152pju.78.2023.02.18.13.22.40; Sat, 18 Feb 2023 13:22:54 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=gZwDplhI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230325AbjBRVWW (ORCPT <rfc822;assdfgzxcv4@gmail.com> + 99 others); Sat, 18 Feb 2023 16:22:22 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50102 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230217AbjBRVVh (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sat, 18 Feb 2023 16:21:37 -0500 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 881ED1A662; Sat, 18 Feb 2023 13:18:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1676755138; x=1708291138; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=id0+LKgwN1/MC0Jmt1pcpgkgYdQwQEyfmLNN2VWh7u4=; b=gZwDplhIAMHklSBCmbum2PJIrK56T8FXniM0ayUcaU6Ti+0/HkROb2El gstO51fksoYyC4xh8VejFL7jotB5/5HmVKYD+F67UN8gLkPkdxowgzQy0 MbnHguTPWVdVA0fwgISZVGHrGrqmjhiGJgFiGvQ67UxftoYPpFvNy27Rb 6ni45cuVPZsIiQWvg4guYSSJt2fCwKDe1Em4O2d5UylDfzT0GuWq1xZtn FgSs76Y9QffOzV6gBIuahay9v1I8+RFbIuEkMc5dl1BXnq0Uz+OdFzZOj Rzn5RdowsCix1SuHYh5Kh1dK35fRx1apNE6XxdMv/hCK8mqJ2tT/5SbXe A==; X-IronPort-AV: E=McAfee;i="6500,9779,10625"; a="418427803" X-IronPort-AV: E=Sophos;i="5.97,309,1669104000"; d="scan'208";a="418427803" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Feb 2023 13:16:24 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10625"; a="664241732" X-IronPort-AV: E=Sophos;i="5.97,309,1669104000"; d="scan'208";a="664241732" Received: from adityava-mobl1.amr.corp.intel.com (HELO rpedgeco-desk.amr.corp.intel.com) ([10.209.80.223]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Feb 2023 13:16:23 -0800 From: Rick Edgecombe <rick.p.edgecombe@intel.com> To: x86@kernel.org, "H . Peter Anvin" <hpa@zytor.com>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann <arnd@arndb.de>, Andy Lutomirski <luto@kernel.org>, Balbir Singh <bsingharora@gmail.com>, Borislav Petkov <bp@alien8.de>, Cyrill Gorcunov <gorcunov@gmail.com>, Dave Hansen <dave.hansen@linux.intel.com>, Eugene Syromiatnikov <esyr@redhat.com>, Florian Weimer <fweimer@redhat.com>, "H . J . Lu" <hjl.tools@gmail.com>, Jann Horn <jannh@google.com>, Jonathan Corbet <corbet@lwn.net>, Kees Cook <keescook@chromium.org>, Mike Kravetz <mike.kravetz@oracle.com>, Nadav Amit <nadav.amit@gmail.com>, Oleg Nesterov <oleg@redhat.com>, Pavel Machek <pavel@ucw.cz>, Peter Zijlstra <peterz@infradead.org>, Randy Dunlap <rdunlap@infradead.org>, Weijiang Yang <weijiang.yang@intel.com>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, John Allen <john.allen@amd.com>, kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com Cc: rick.p.edgecombe@intel.com Subject: [PATCH v6 33/41] x86/shstk: Introduce map_shadow_stack syscall Date: Sat, 18 Feb 2023 13:14:25 -0800 Message-Id: <20230218211433.26859-34-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20230218211433.26859-1-rick.p.edgecombe@intel.com> References: <20230218211433.26859-1-rick.p.edgecombe@intel.com> X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1758205443455935200?= X-GMAIL-MSGID: =?utf-8?q?1758205443455935200?= |
Series |
Shadow stacks for userspace
|
|
Commit Message
Edgecombe, Rick P
Feb. 18, 2023, 9:14 p.m. UTC
When operating with shadow stacks enabled, the kernel will automatically allocate shadow stacks for new threads, however in some cases userspace will need additional shadow stacks. The main example of this is the ucontext family of functions, which require userspace allocating and pivoting to userspace managed stacks. Unlike most other user memory permissions, shadow stacks need to be provisioned with special data in order to be useful. They need to be setup with a restore token so that userspace can pivot to them via the RSTORSSP instruction. But, the security design of shadow stack's is that they should not be written to except in limited circumstances. This presents a problem for userspace, as to how userspace can provision this special data, without allowing for the shadow stack to be generally writable. Previously, a new PROT_SHADOW_STACK was attempted, which could be mprotect()ed from RW permissions after the data was provisioned. This was found to not be secure enough, as other thread's could write to the shadow stack during the writable window. The kernel can use a special instruction, WRUSS, to write directly to userspace shadow stacks. So the solution can be that memory can be mapped as shadow stack permissions from the beginning (never generally writable in userspace), and the kernel itself can write the restore token. First, a new madvise() flag was explored, which could operate on the PROT_SHADOW_STACK memory. This had a couple downsides: 1. Extra checks were needed in mprotect() to prevent writable memory from ever becoming PROT_SHADOW_STACK. 2. Extra checks/vma state were needed in the new madvise() to prevent restore tokens being written into the middle of pre-used shadow stacks. It is ideal to prevent restore tokens being added at arbitrary locations, so the check was to make sure the shadow stack had never been written to. 3. It stood out from the rest of the madvise flags, as more of direct action than a hint at future desired behavior. So rather than repurpose two existing syscalls (mmap, madvise) that don't quite fit, just implement a new map_shadow_stack syscall to allow userspace to map and setup new shadow stacks in one step. While ucontext is the primary motivator, userspace may have other unforeseen reasons to setup it's own shadow stacks using the WRSS instruction. Towards this provide a flag so that stacks can be optionally setup securely for the common case of ucontext without enabling WRSS. Or potentially have the kernel set up the shadow stack in some new way. The following example demonstrates how to create a new shadow stack with map_shadow_stack: void *shstk = map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN); Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> --- v5: - Fix addr/mapped_addr (Kees) - Switch to EOPNOTSUPP (Kees suggested ENOTSUPP, but checkpatch suggests this) - Return error for addresses below 4G v3: - Change syscall common -> 64 (Kees) - Use bit shift notation instead of 0x1 for uapi header (Kees) - Call do_mmap() with MAP_FIXED_NOREPLACE (Kees) - Block unsupported flags (Kees) - Require size >= 8 to set token (Kees) v2: - Change syscall to take address like mmap() for CRIU's usage v1: - New patch (replaces PROT_SHADOW_STACK). --- arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/x86/include/uapi/asm/mman.h | 3 ++ arch/x86/kernel/shstk.c | 59 ++++++++++++++++++++++---- include/linux/syscalls.h | 1 + include/uapi/asm-generic/unistd.h | 2 +- kernel/sys_ni.c | 1 + 6 files changed, 58 insertions(+), 9 deletions(-)
Comments
On Sat, Feb 18, 2023 at 01:14:25PM -0800, Rick Edgecombe wrote: >When operating with shadow stacks enabled, the kernel will automatically >allocate shadow stacks for new threads, however in some cases userspace >will need additional shadow stacks. The main example of this is the >ucontext family of functions, which require userspace allocating and >pivoting to userspace managed stacks. > >Unlike most other user memory permissions, shadow stacks need to be >provisioned with special data in order to be useful. They need to be setup >with a restore token so that userspace can pivot to them via the RSTORSSP >instruction. But, the security design of shadow stack's is that they >should not be written to except in limited circumstances. This presents a >problem for userspace, as to how userspace can provision this special >data, without allowing for the shadow stack to be generally writable. > >Previously, a new PROT_SHADOW_STACK was attempted, which could be >mprotect()ed from RW permissions after the data was provisioned. This was >found to not be secure enough, as other thread's could write to the >shadow stack during the writable window. > >The kernel can use a special instruction, WRUSS, to write directly to >userspace shadow stacks. So the solution can be that memory can be mapped >as shadow stack permissions from the beginning (never generally writable >in userspace), and the kernel itself can write the restore token. > >First, a new madvise() flag was explored, which could operate on the >PROT_SHADOW_STACK memory. This had a couple downsides: >1. Extra checks were needed in mprotect() to prevent writable memory from > ever becoming PROT_SHADOW_STACK. >2. Extra checks/vma state were needed in the new madvise() to prevent > restore tokens being written into the middle of pre-used shadow stacks. > It is ideal to prevent restore tokens being added at arbitrary > locations, so the check was to make sure the shadow stack had never been > written to. >3. It stood out from the rest of the madvise flags, as more of direct > action than a hint at future desired behavior. > >So rather than repurpose two existing syscalls (mmap, madvise) that don't >quite fit, just implement a new map_shadow_stack syscall to allow >userspace to map and setup new shadow stacks in one step. While ucontext >is the primary motivator, userspace may have other unforeseen reasons to >setup it's own shadow stacks using the WRSS instruction. Towards this >provide a flag so that stacks can be optionally setup securely for the >common case of ucontext without enabling WRSS. Or potentially have the >kernel set up the shadow stack in some new way. Was following ever attempted? void *shstk = mmap(0, size, PROT_SHADOWSTACK, ...); - limit PROT_SHADOWSTACK protection flag to only mmap (and thus mprotect can't convert memory from shadow stack to non-shadow stack type or vice versa) - limit PROT_SHADOWSTACK protection flag to anonymous memory only. - top level mmap handler to put a token at the base using WRUSS if prot == PROT_SHADOWSTACK You essentially would get shadow stack manufacturing with existing (single) syscall. Acting a bit selfish here, this allows other architectures as well to re-use this and do their own implementation of mapping and placing the token at the base. > >The following example demonstrates how to create a new shadow stack with >map_shadow_stack: >void *shstk = map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN); > >Tested-by: Pengfei Xu <pengfei.xu@intel.com> >Tested-by: John Allen <john.allen@amd.com> >Reviewed-by: Kees Cook <keescook@chromium.org> >Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> > >--- >v5: > - Fix addr/mapped_addr (Kees) > - Switch to EOPNOTSUPP (Kees suggested ENOTSUPP, but checkpatch > suggests this) > - Return error for addresses below 4G > >v3: > - Change syscall common -> 64 (Kees) > - Use bit shift notation instead of 0x1 for uapi header (Kees) > - Call do_mmap() with MAP_FIXED_NOREPLACE (Kees) > - Block unsupported flags (Kees) > - Require size >= 8 to set token (Kees) > >v2: > - Change syscall to take address like mmap() for CRIU's usage > >v1: > - New patch (replaces PROT_SHADOW_STACK). >--- > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > arch/x86/include/uapi/asm/mman.h | 3 ++ > arch/x86/kernel/shstk.c | 59 ++++++++++++++++++++++---- > include/linux/syscalls.h | 1 + > include/uapi/asm-generic/unistd.h | 2 +- > kernel/sys_ni.c | 1 + > 6 files changed, 58 insertions(+), 9 deletions(-) > >diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl >index c84d12608cd2..f65c671ce3b1 100644 >--- a/arch/x86/entry/syscalls/syscall_64.tbl >+++ b/arch/x86/entry/syscalls/syscall_64.tbl >@@ -372,6 +372,7 @@ > 448 common process_mrelease sys_process_mrelease > 449 common futex_waitv sys_futex_waitv > 450 common set_mempolicy_home_node sys_set_mempolicy_home_node >+451 64 map_shadow_stack sys_map_shadow_stack > > # > # Due to a historical design error, certain syscalls are numbered differently >diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/mman.h >index 5a0256e73f1e..8148bdddbd2c 100644 >--- a/arch/x86/include/uapi/asm/mman.h >+++ b/arch/x86/include/uapi/asm/mman.h >@@ -13,6 +13,9 @@ > ((key) & 0x8 ? VM_PKEY_BIT3 : 0)) > #endif > >+/* Flags for map_shadow_stack(2) */ >+#define SHADOW_STACK_SET_TOKEN (1ULL << 0) /* Set up a restore token in the shadow stack */ >+ > #include <asm-generic/mman.h> > > #endif /* _ASM_X86_MMAN_H */ >diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c >index 40f0a55762a9..0a3decab70ee 100644 >--- a/arch/x86/kernel/shstk.c >+++ b/arch/x86/kernel/shstk.c >@@ -17,6 +17,7 @@ > #include <linux/compat.h> > #include <linux/sizes.h> > #include <linux/user.h> >+#include <linux/syscalls.h> > #include <asm/msr.h> > #include <asm/fpu/xstate.h> > #include <asm/fpu/types.h> >@@ -71,19 +72,31 @@ static int create_rstor_token(unsigned long ssp, unsigned long *token_addr) > return 0; > } > >-static unsigned long alloc_shstk(unsigned long size) >+static unsigned long alloc_shstk(unsigned long addr, unsigned long size, >+ unsigned long token_offset, bool set_res_tok) > { > int flags = MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G; > struct mm_struct *mm = current->mm; >- unsigned long addr, unused; >+ unsigned long mapped_addr, unused; > >- mmap_write_lock(mm); >- addr = do_mmap(NULL, 0, size, PROT_READ, flags, >- VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL); >+ if (addr) >+ flags |= MAP_FIXED_NOREPLACE; > >+ mmap_write_lock(mm); >+ mapped_addr = do_mmap(NULL, addr, size, PROT_READ, flags, >+ VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL); > mmap_write_unlock(mm); > >- return addr; >+ if (!set_res_tok || IS_ERR_VALUE(mapped_addr)) >+ goto out; >+ >+ if (create_rstor_token(mapped_addr + token_offset, NULL)) { >+ vm_munmap(mapped_addr, size); >+ return -EINVAL; >+ } >+ >+out: >+ return mapped_addr; > } > > static unsigned long adjust_shstk_size(unsigned long size) >@@ -134,7 +147,7 @@ static int shstk_setup(void) > return -EOPNOTSUPP; > > size = adjust_shstk_size(0); >- addr = alloc_shstk(size); >+ addr = alloc_shstk(0, size, 0, false); > if (IS_ERR_VALUE(addr)) > return PTR_ERR((void *)addr); > >@@ -178,7 +191,7 @@ int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags, > return 0; > > size = adjust_shstk_size(stack_size); >- addr = alloc_shstk(size); >+ addr = alloc_shstk(0, size, 0, false); > if (IS_ERR_VALUE(addr)) > return PTR_ERR((void *)addr); > >@@ -371,6 +384,36 @@ static int shstk_disable(void) > return 0; > } > >+SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags) >+{ >+ bool set_tok = flags & SHADOW_STACK_SET_TOKEN; >+ unsigned long aligned_size; >+ >+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK)) >+ return -EOPNOTSUPP; >+ >+ if (flags & ~SHADOW_STACK_SET_TOKEN) >+ return -EINVAL; >+ >+ /* If there isn't space for a token */ >+ if (set_tok && size < 8) >+ return -EINVAL; >+ >+ if (addr && addr <= 0xFFFFFFFF) >+ return -EINVAL; >+ >+ /* >+ * An overflow would result in attempting to write the restore token >+ * to the wrong location. Not catastrophic, but just return the right >+ * error code and block it. >+ */ >+ aligned_size = PAGE_ALIGN(size); >+ if (aligned_size < size) >+ return -EOVERFLOW; >+ >+ return alloc_shstk(addr, aligned_size, size, set_tok); >+} >+ > long shstk_prctl(struct task_struct *task, int option, unsigned long features) > { > if (option == ARCH_SHSTK_LOCK) { >diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h >index 33a0ee3bcb2e..392dc11e3556 100644 >--- a/include/linux/syscalls.h >+++ b/include/linux/syscalls.h >@@ -1058,6 +1058,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags); > asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len, > unsigned long home_node, > unsigned long flags); >+asmlinkage long sys_map_shadow_stack(unsigned long addr, unsigned long size, unsigned int flags); > > /* > * Architecture-specific system calls >diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h >index 45fa180cc56a..b12940ec5926 100644 >--- a/include/uapi/asm-generic/unistd.h >+++ b/include/uapi/asm-generic/unistd.h >@@ -887,7 +887,7 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv) > __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node) > > #undef __NR_syscalls >-#define __NR_syscalls 451 >+#define __NR_syscalls 452 > > /* > * 32 bit systems traditionally used different >diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c >index 860b2dcf3ac4..cb9aebd34646 100644 >--- a/kernel/sys_ni.c >+++ b/kernel/sys_ni.c >@@ -381,6 +381,7 @@ COND_SYSCALL(vm86old); > COND_SYSCALL(modify_ldt); > COND_SYSCALL(vm86); > COND_SYSCALL(kexec_file_load); >+COND_SYSCALL(map_shadow_stack); > > /* s390 */ > COND_SYSCALL(s390_pci_mmio_read); >-- >2.17.1 >
On Wed, 2023-02-22 at 16:03 -0800, Deepak Gupta wrote: > On Sat, Feb 18, 2023 at 01:14:25PM -0800, Rick Edgecombe wrote: > > When operating with shadow stacks enabled, the kernel will > > automatically > > allocate shadow stacks for new threads, however in some cases > > userspace > > will need additional shadow stacks. The main example of this is the > > ucontext family of functions, which require userspace allocating > > and > > pivoting to userspace managed stacks. > > > > Unlike most other user memory permissions, shadow stacks need to be > > provisioned with special data in order to be useful. They need to > > be setup > > with a restore token so that userspace can pivot to them via the > > RSTORSSP > > instruction. But, the security design of shadow stack's is that > > they > > should not be written to except in limited circumstances. This > > presents a > > problem for userspace, as to how userspace can provision this > > special > > data, without allowing for the shadow stack to be generally > > writable. > > > > Previously, a new PROT_SHADOW_STACK was attempted, which could be > > mprotect()ed from RW permissions after the data was provisioned. > > This was > > found to not be secure enough, as other thread's could write to the > > shadow stack during the writable window. > > > > The kernel can use a special instruction, WRUSS, to write directly > > to > > userspace shadow stacks. So the solution can be that memory can be > > mapped > > as shadow stack permissions from the beginning (never generally > > writable > > in userspace), and the kernel itself can write the restore token. > > > > First, a new madvise() flag was explored, which could operate on > > the > > PROT_SHADOW_STACK memory. This had a couple downsides: > > 1. Extra checks were needed in mprotect() to prevent writable > > memory from > > ever becoming PROT_SHADOW_STACK. > > 2. Extra checks/vma state were needed in the new madvise() to > > prevent > > restore tokens being written into the middle of pre-used shadow > > stacks. > > It is ideal to prevent restore tokens being added at arbitrary > > locations, so the check was to make sure the shadow stack had > > never been > > written to. > > 3. It stood out from the rest of the madvise flags, as more of > > direct > > action than a hint at future desired behavior. > > > > So rather than repurpose two existing syscalls (mmap, madvise) that > > don't > > quite fit, just implement a new map_shadow_stack syscall to allow > > userspace to map and setup new shadow stacks in one step. While > > ucontext > > is the primary motivator, userspace may have other unforeseen > > reasons to > > setup it's own shadow stacks using the WRSS instruction. Towards > > this > > provide a flag so that stacks can be optionally setup securely for > > the > > common case of ucontext without enabling WRSS. Or potentially have > > the > > kernel set up the shadow stack in some new way. > > Was following ever attempted? > > void *shstk = mmap(0, size, PROT_SHADOWSTACK, ...); > - limit PROT_SHADOWSTACK protection flag to only mmap (and thus > mprotect can't > convert memory from shadow stack to non-shadow stack type or vice > versa) > - limit PROT_SHADOWSTACK protection flag to anonymous memory only. > - top level mmap handler to put a token at the base using WRUSS if > prot == PROT_SHADOWSTACK > > You essentially would get shadow stack manufacturing with existing > (single) syscall. > Acting a bit selfish here, this allows other architectures as well to > re-use this and > do their own implementation of mapping and placing the token at the > base. Yes, I looked at it. You end up with a pile of checks and hooks added to mmap() and various other places as you outline. We also now have the MAP_ABOVE4G limitation for x86 shadow stack that would need checking for too. It's not exactly a clean fit. Then, callers would have to pass special x86 flags in anyway. It doesn't seem like the complexity of the checks is worth saving the tiny syscall. Is there some reason why riscv can't use the same syscall stub? It doesn't need to live forever in x86 code. Not sure what the savings are for riscv of the mmap+checks approach are either... I did wonder if there could be some sort of more general syscall for mapping and provisioning special security-type memory. But we probably need a few more non-shadow stack examples to get an idea of what that would look like.
On Wed, Feb 22, 2023 at 5:11 PM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > > On Wed, 2023-02-22 at 16:03 -0800, Deepak Gupta wrote: > > On Sat, Feb 18, 2023 at 01:14:25PM -0800, Rick Edgecombe wrote: > > > When operating with shadow stacks enabled, the kernel will > > > automatically > > > allocate shadow stacks for new threads, however in some cases > > > userspace > > > will need additional shadow stacks. The main example of this is the > > > ucontext family of functions, which require userspace allocating > > > and > > > pivoting to userspace managed stacks. > > > > > > Unlike most other user memory permissions, shadow stacks need to be > > > provisioned with special data in order to be useful. They need to > > > be setup > > > with a restore token so that userspace can pivot to them via the > > > RSTORSSP > > > instruction. But, the security design of shadow stack's is that > > > they > > > should not be written to except in limited circumstances. This > > > presents a > > > problem for userspace, as to how userspace can provision this > > > special > > > data, without allowing for the shadow stack to be generally > > > writable. > > > > > > Previously, a new PROT_SHADOW_STACK was attempted, which could be > > > mprotect()ed from RW permissions after the data was provisioned. > > > This was > > > found to not be secure enough, as other thread's could write to the > > > shadow stack during the writable window. > > > > > > The kernel can use a special instruction, WRUSS, to write directly > > > to > > > userspace shadow stacks. So the solution can be that memory can be > > > mapped > > > as shadow stack permissions from the beginning (never generally > > > writable > > > in userspace), and the kernel itself can write the restore token. > > > > > > First, a new madvise() flag was explored, which could operate on > > > the > > > PROT_SHADOW_STACK memory. This had a couple downsides: > > > 1. Extra checks were needed in mprotect() to prevent writable > > > memory from > > > ever becoming PROT_SHADOW_STACK. > > > 2. Extra checks/vma state were needed in the new madvise() to > > > prevent > > > restore tokens being written into the middle of pre-used shadow > > > stacks. > > > It is ideal to prevent restore tokens being added at arbitrary > > > locations, so the check was to make sure the shadow stack had > > > never been > > > written to. > > > 3. It stood out from the rest of the madvise flags, as more of > > > direct > > > action than a hint at future desired behavior. > > > > > > So rather than repurpose two existing syscalls (mmap, madvise) that > > > don't > > > quite fit, just implement a new map_shadow_stack syscall to allow > > > userspace to map and setup new shadow stacks in one step. While > > > ucontext > > > is the primary motivator, userspace may have other unforeseen > > > reasons to > > > setup it's own shadow stacks using the WRSS instruction. Towards > > > this > > > provide a flag so that stacks can be optionally setup securely for > > > the > > > common case of ucontext without enabling WRSS. Or potentially have > > > the > > > kernel set up the shadow stack in some new way. > > > > Was following ever attempted? > > > > void *shstk = mmap(0, size, PROT_SHADOWSTACK, ...); > > - limit PROT_SHADOWSTACK protection flag to only mmap (and thus > > mprotect can't > > convert memory from shadow stack to non-shadow stack type or vice > > versa) > > - limit PROT_SHADOWSTACK protection flag to anonymous memory only. > > - top level mmap handler to put a token at the base using WRUSS if > > prot == PROT_SHADOWSTACK > > > > You essentially would get shadow stack manufacturing with existing > > (single) syscall. > > Acting a bit selfish here, this allows other architectures as well to > > re-use this and > > do their own implementation of mapping and placing the token at the > > base. > > Yes, I looked at it. You end up with a pile of checks and hooks added > to mmap() and various other places as you outline. We also now have the > MAP_ABOVE4G limitation for x86 shadow stack that would need checking > for too. It's not exactly a clean fit. Then, callers would have to pass > special x86 flags in anyway. riscv has mechanisms using which a 32bit app can run on 64bit kernel. So technically if there are 32bit and 64bit code in address space, MAP_ABOVE4G could be useful. Although I am not sure (or aware of) if there are such requirement from app/developers yet (to guarantee address mapping above 4G) But I see this as orthogonal to memory protection flags. > > It doesn't seem like the complexity of the checks is worth saving the > tiny syscall. Is there some reason why riscv can't use the same syscall > stub? It doesn't need to live forever in x86 code. Not sure what the > savings are for riscv of the mmap+checks approach are either... I don't see a lot of extra complexity here. If `mprotect` and friends don't know about `PROT_SHADOWSTACK`, they'll just fail by default (which is desired) It's only `mmap` that needs to be enlightened. And it can just pass `VMA_SHADOW_STACK` to `do_mmap` if input is `PROT_SHADOWSTACK`. Adding a syscall just for mapping shadow stack is weird when it can be solved with existing system calls. As you say in your response below, it would be good to have such a syscall which serve larger purposes (e.g. provisioning special security-type memory) arm64's memory tagging is one such example. Not exactly security-type memory (but eventual application is security for this feature) . It adds extra meaning to virtual addresses (i.e. an address has tags). arm64 went about using a protection flag `PROT_MTE` instead of a special system call. Being said that since this patch has gone through multiple revisions and I am new to the party. If others dont have issues on this special system call, I think it's fine then. In case of riscv I can choose to use this mechanism or go via arm's route to define PROT_SHADOWSTACK which is arch specific. > > I did wonder if there could be some sort of more general syscall for > mapping and provisioning special security-type memory. But we probably > need a few more non-shadow stack examples to get an idea of what that > would look like. As I mentioned memory tagging and thus PROT_MTE is already such a use case which uses `mmap/mprotect` protection flags to designate special meaning to a virtual address.
On Thu, 2023-02-23 at 13:20 -0800, Deepak Gupta wrote: > > It doesn't seem like the complexity of the checks is worth saving > > the > > tiny syscall. Is there some reason why riscv can't use the same > > syscall > > stub? It doesn't need to live forever in x86 code. Not sure what > > the > > savings are for riscv of the mmap+checks approach are either... > > I don't see a lot of extra complexity here. > If `mprotect` and friends don't know about `PROT_SHADOWSTACK`, > they'll > just fail by default (which is desired) To me, the options look like: cram it into an existing syscall or create one that does exactly what is needed. To replace the new syscall with mmap(), with the existing MM implementation, I think you would need to add to mmap: 1. Limit PROT_SHADOW_STACK to anonymous, private memory. 2. Limit PROT_SHADOW_STACK to MAP_ABOVE4G (or create a MAP_SHADOW_STACK that does both). MAP_ABOVE4G protects from using shadow stack in 32 bit mode, after some ABI issues were raised. So it is supposed to be enforced. 3. Add additional logic for MAP_ABOVE4G to work with MAP_FIXED as is required by CRIU. 4. Add another MAP_ flag to specify whether to write the token (AFAIK the first flag that would do something like that. So then likely have debates about whether it fits into the flags). Sort out the behavior of trying to write the token to a non-PROT_SHADOW_STACK mmap call. 5. Add arch breakout for mmap to call into the token writing. I think you are right that no mprotect() changes would be needed with the current shadow stack MM implementation (it wasn't the case for the original PROT_SHADOW_STACK). But I'm still not sure if it is simpler then the syscall. I do wonder a little bit about trying to remove some of these limitations of the existing shadow stack MM implementation, which could make an mmap based interface a little better fit. Like if someday shadow stack memory added support for all the options that mmap supports. But I'm not sure if that would just result in more complexity in other places (MM, signals) that would barely get used. Like, I'm not sure how shadow stack permissioned mmap()ed files should work. You would have to require writable files to map them as shadow stack, but then that is not as locked down as we have today with the anonymous memory. And a "shadow stack" file permission would seem a bit overboard. Then probably some more dirty bit issues with mmaped files. I'm not fully sure. That one was definitely an issue when PROT_SHADOW_STACK was dropped, but David Hildenbrand has now removed at least some of the issues it hit. So the optimum solution might depend on if we add more shadow stack MM support later. But it is always possible to add mmap support later too. > > It's only `mmap` that needs to be enlightened. And it can just pass > `VMA_SHADOW_STACK` to `do_mmap` if input is `PROT_SHADOWSTACK`. > > Adding a syscall just for mapping shadow stack is weird when it can > be > solved with existing system calls. > As you say in your response below, it would be good to have such a > syscall which serve larger purposes (e.g. provisioning special > security-type memory) > > arm64's memory tagging is one such example. Not exactly security-type > memory (but eventual application is security for this feature) . > It adds extra meaning to virtual addresses (i.e. an address has > tags). > arm64 went about using a protection flag `PROT_MTE` instead of a > special system call. It looks like that memory can be written with a special instruction? And so it doesn't need to be provisioned by the kernel, as is the case here. > > Being said that since this patch has gone through multiple revisions > and I am new to the party. If others dont have issues on this special > system call, > I think it's fine then. In case of riscv I can choose to use this > mechanism or go via arm's route to define PROT_SHADOWSTACK which is > arch specific. Ok, sounds good. The advice I got from maintainers after the many attempts to cram it into the existing interfaces was: don't be afraid to add a syscall. And sure enough, when MAP_ABOVE4G came along it continued to make things simpler.
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index c84d12608cd2..f65c671ce3b1 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -372,6 +372,7 @@ 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node +451 64 map_shadow_stack sys_map_shadow_stack # # Due to a historical design error, certain syscalls are numbered differently diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/mman.h index 5a0256e73f1e..8148bdddbd2c 100644 --- a/arch/x86/include/uapi/asm/mman.h +++ b/arch/x86/include/uapi/asm/mman.h @@ -13,6 +13,9 @@ ((key) & 0x8 ? VM_PKEY_BIT3 : 0)) #endif +/* Flags for map_shadow_stack(2) */ +#define SHADOW_STACK_SET_TOKEN (1ULL << 0) /* Set up a restore token in the shadow stack */ + #include <asm-generic/mman.h> #endif /* _ASM_X86_MMAN_H */ diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c index 40f0a55762a9..0a3decab70ee 100644 --- a/arch/x86/kernel/shstk.c +++ b/arch/x86/kernel/shstk.c @@ -17,6 +17,7 @@ #include <linux/compat.h> #include <linux/sizes.h> #include <linux/user.h> +#include <linux/syscalls.h> #include <asm/msr.h> #include <asm/fpu/xstate.h> #include <asm/fpu/types.h> @@ -71,19 +72,31 @@ static int create_rstor_token(unsigned long ssp, unsigned long *token_addr) return 0; } -static unsigned long alloc_shstk(unsigned long size) +static unsigned long alloc_shstk(unsigned long addr, unsigned long size, + unsigned long token_offset, bool set_res_tok) { int flags = MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G; struct mm_struct *mm = current->mm; - unsigned long addr, unused; + unsigned long mapped_addr, unused; - mmap_write_lock(mm); - addr = do_mmap(NULL, 0, size, PROT_READ, flags, - VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL); + if (addr) + flags |= MAP_FIXED_NOREPLACE; + mmap_write_lock(mm); + mapped_addr = do_mmap(NULL, addr, size, PROT_READ, flags, + VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL); mmap_write_unlock(mm); - return addr; + if (!set_res_tok || IS_ERR_VALUE(mapped_addr)) + goto out; + + if (create_rstor_token(mapped_addr + token_offset, NULL)) { + vm_munmap(mapped_addr, size); + return -EINVAL; + } + +out: + return mapped_addr; } static unsigned long adjust_shstk_size(unsigned long size) @@ -134,7 +147,7 @@ static int shstk_setup(void) return -EOPNOTSUPP; size = adjust_shstk_size(0); - addr = alloc_shstk(size); + addr = alloc_shstk(0, size, 0, false); if (IS_ERR_VALUE(addr)) return PTR_ERR((void *)addr); @@ -178,7 +191,7 @@ int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags, return 0; size = adjust_shstk_size(stack_size); - addr = alloc_shstk(size); + addr = alloc_shstk(0, size, 0, false); if (IS_ERR_VALUE(addr)) return PTR_ERR((void *)addr); @@ -371,6 +384,36 @@ static int shstk_disable(void) return 0; } +SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags) +{ + bool set_tok = flags & SHADOW_STACK_SET_TOKEN; + unsigned long aligned_size; + + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK)) + return -EOPNOTSUPP; + + if (flags & ~SHADOW_STACK_SET_TOKEN) + return -EINVAL; + + /* If there isn't space for a token */ + if (set_tok && size < 8) + return -EINVAL; + + if (addr && addr <= 0xFFFFFFFF) + return -EINVAL; + + /* + * An overflow would result in attempting to write the restore token + * to the wrong location. Not catastrophic, but just return the right + * error code and block it. + */ + aligned_size = PAGE_ALIGN(size); + if (aligned_size < size) + return -EOVERFLOW; + + return alloc_shstk(addr, aligned_size, size, set_tok); +} + long shstk_prctl(struct task_struct *task, int option, unsigned long features) { if (option == ARCH_SHSTK_LOCK) { diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 33a0ee3bcb2e..392dc11e3556 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -1058,6 +1058,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags); asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len, unsigned long home_node, unsigned long flags); +asmlinkage long sys_map_shadow_stack(unsigned long addr, unsigned long size, unsigned int flags); /* * Architecture-specific system calls diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 45fa180cc56a..b12940ec5926 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -887,7 +887,7 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv) __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node) #undef __NR_syscalls -#define __NR_syscalls 451 +#define __NR_syscalls 452 /* * 32 bit systems traditionally used different diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 860b2dcf3ac4..cb9aebd34646 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -381,6 +381,7 @@ COND_SYSCALL(vm86old); COND_SYSCALL(modify_ldt); COND_SYSCALL(vm86); COND_SYSCALL(kexec_file_load); +COND_SYSCALL(map_shadow_stack); /* s390 */ COND_SYSCALL(s390_pci_mmio_read);