Message ID | 20230227222957.24501-34-rick.p.edgecombe@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:5915:0:0:0:0:0 with SMTP id v21csp2682959wrd; Mon, 27 Feb 2023 14:35:18 -0800 (PST) X-Google-Smtp-Source: AK7set8/34rU41+iPlqA4DMIeRhNS8nnDrpoCQQdpVyuIa4VsdxHW2QRHt79OndnqOEGW3FUVyJN X-Received: by 2002:a17:906:859:b0:872:84dd:8903 with SMTP id f25-20020a170906085900b0087284dd8903mr262591ejd.59.1677537318628; Mon, 27 Feb 2023 14:35:18 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1677537318; cv=none; d=google.com; s=arc-20160816; b=IW5gBKB2/mrFA4rjS+6aUNbnTVwtFu0ymNPc/ZSN4xSMIjzOmKHRIcjylhXyP67zXr nDWLkSIg6X5U5HCtsyXhoYqJyl3Bj/J/sXs8irzeUM9WAlsVNR2hc9sEgf4byZLnFaFG P3shU1M0hYuSrMVawxNStray43p0LcGpUFAbzbx2LJ6W55/fg5WhauNKYtF1uPmZv0yc IwHWCNN7VKz7jogysHlCKVFrRZkHol6crjv+71S83KSJSSqB/wq/H6HUs+fBqYNBVZ8u KnWcayHFg4/Iii9xohq1Qj2FqtXC3wT4G34zBoK1/VTn41IWybc2vyXvP2mkMneIkfK4 atuA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:references:in-reply-to:message-id:date:subject :cc:to:from:dkim-signature; bh=2FtOU/X+ROob91Drvw/u5zl2nlpfMJbMpmIA9iRxoYU=; b=LoOoDCwdwMcKCGR+fvc/9cIwtAKIZdicsfwwppGacpX/rEzEiYTMeCxK/t5SqKG96g bEHaF8fnOlNWQURQkVaDyI90breNA8LvdKTdW2mKoN6lmyuMBN1GBd8t2g+87Ktylmqi XvSdMs4I8pUPvFW8K1q92Cc86rZ15j+KYUBD66u51q0IY5IwPdKtyj9szHsNWm262rJz hJvdkDXmNBYlBOERS8ikGH5ysEzTIk0Qf0waVUt7JG+x0vj6PFrJr7z/cEMaXfsaiMHK Gd4UvFTw21+fPo/3K4fqADQmWC2tswRQASA/Cgrhj1vN/y53LzSAuYcdq6e/MhEzuC/i ddNg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=A3y7fbqt; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id gn1-20020a1709070d0100b008d8c9f4fd28si953473ejc.808.2023.02.27.14.34.53; Mon, 27 Feb 2023 14:35:18 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=A3y7fbqt; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230506AbjB0Wel (ORCPT <rfc822;wenzhi022@gmail.com> + 99 others); Mon, 27 Feb 2023 17:34:41 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38322 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230425AbjB0WdA (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 27 Feb 2023 17:33:00 -0500 Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C32332A16E; Mon, 27 Feb 2023 14:32:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1677537126; x=1709073126; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=0sN0qiJhpy5r3MZ28FxXbF8Pdb6RhgABRtgYTnCuxVM=; b=A3y7fbqtgkT2L8f5gZieYaTsaFrXrkb4c1qdhvbDaJzKznKO8x7Li5iL AY5JjgbxE19TTYIjW+xdFei3aGNiRYLvaVVf+8aNBTYywFP/9gl81tz/I u4N9tqxGSTWXVwvYNJaTlVG/QeIt1hnNG/WsXxjdxJZnbQt6SBgstS0Dv D+7ruia8Cds+QsTB14zqPeXFsk+CJuaucx3AJOS08ok6zU7rEILQSa2FY Mg9iJt5QFQ7yfWKaNcE7voXPVEkqjJuoRmnQg3aCfRmL7622AjY/dlvpx hiNeQ/tKta+jMTx3Ni8Nmy2LkMzy6NU9NkFOnbrMqLatUrlQewwnN278+ g==; X-IronPort-AV: E=McAfee;i="6500,9779,10634"; a="313657802" X-IronPort-AV: E=Sophos;i="5.98,220,1673942400"; d="scan'208";a="313657802" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Feb 2023 14:31:32 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10634"; a="848024769" X-IronPort-AV: E=Sophos;i="5.98,220,1673942400"; d="scan'208";a="848024769" Received: from leonqu-mobl1.amr.corp.intel.com (HELO rpedgeco-desk.amr.corp.intel.com) ([10.209.72.19]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Feb 2023 14:31:32 -0800 From: Rick Edgecombe <rick.p.edgecombe@intel.com> To: x86@kernel.org, "H . Peter Anvin" <hpa@zytor.com>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann <arnd@arndb.de>, Andy Lutomirski <luto@kernel.org>, Balbir Singh <bsingharora@gmail.com>, Borislav Petkov <bp@alien8.de>, Cyrill Gorcunov <gorcunov@gmail.com>, Dave Hansen <dave.hansen@linux.intel.com>, Eugene Syromiatnikov <esyr@redhat.com>, Florian Weimer <fweimer@redhat.com>, "H . J . Lu" <hjl.tools@gmail.com>, Jann Horn <jannh@google.com>, Jonathan Corbet <corbet@lwn.net>, Kees Cook <keescook@chromium.org>, Mike Kravetz <mike.kravetz@oracle.com>, Nadav Amit <nadav.amit@gmail.com>, Oleg Nesterov <oleg@redhat.com>, Pavel Machek <pavel@ucw.cz>, Peter Zijlstra <peterz@infradead.org>, Randy Dunlap <rdunlap@infradead.org>, Weijiang Yang <weijiang.yang@intel.com>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, John Allen <john.allen@amd.com>, kcc@google.com, eranian@google.com, rppt@kernel.org, jamorris@linux.microsoft.com, dethoma@microsoft.com, akpm@linux-foundation.org, Andrew.Cooper3@citrix.com, christina.schimpe@intel.com, david@redhat.com, debug@rivosinc.com Cc: rick.p.edgecombe@intel.com Subject: [PATCH v7 33/41] x86/shstk: Introduce map_shadow_stack syscall Date: Mon, 27 Feb 2023 14:29:49 -0800 Message-Id: <20230227222957.24501-34-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20230227222957.24501-1-rick.p.edgecombe@intel.com> References: <20230227222957.24501-1-rick.p.edgecombe@intel.com> X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, SPF_HELO_PASS,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1759025371535216656?= X-GMAIL-MSGID: =?utf-8?q?1759025371535216656?= |
Series |
Shadow stacks for userspace
|
|
Commit Message
Edgecombe, Rick P
Feb. 27, 2023, 10:29 p.m. UTC
When operating with shadow stacks enabled, the kernel will automatically allocate shadow stacks for new threads, however in some cases userspace will need additional shadow stacks. The main example of this is the ucontext family of functions, which require userspace allocating and pivoting to userspace managed stacks. Unlike most other user memory permissions, shadow stacks need to be provisioned with special data in order to be useful. They need to be setup with a restore token so that userspace can pivot to them via the RSTORSSP instruction. But, the security design of shadow stack's is that they should not be written to except in limited circumstances. This presents a problem for userspace, as to how userspace can provision this special data, without allowing for the shadow stack to be generally writable. Previously, a new PROT_SHADOW_STACK was attempted, which could be mprotect()ed from RW permissions after the data was provisioned. This was found to not be secure enough, as other thread's could write to the shadow stack during the writable window. The kernel can use a special instruction, WRUSS, to write directly to userspace shadow stacks. So the solution can be that memory can be mapped as shadow stack permissions from the beginning (never generally writable in userspace), and the kernel itself can write the restore token. First, a new madvise() flag was explored, which could operate on the PROT_SHADOW_STACK memory. This had a couple downsides: 1. Extra checks were needed in mprotect() to prevent writable memory from ever becoming PROT_SHADOW_STACK. 2. Extra checks/vma state were needed in the new madvise() to prevent restore tokens being written into the middle of pre-used shadow stacks. It is ideal to prevent restore tokens being added at arbitrary locations, so the check was to make sure the shadow stack had never been written to. 3. It stood out from the rest of the madvise flags, as more of direct action than a hint at future desired behavior. So rather than repurpose two existing syscalls (mmap, madvise) that don't quite fit, just implement a new map_shadow_stack syscall to allow userspace to map and setup new shadow stacks in one step. While ucontext is the primary motivator, userspace may have other unforeseen reasons to setup it's own shadow stacks using the WRSS instruction. Towards this provide a flag so that stacks can be optionally setup securely for the common case of ucontext without enabling WRSS. Or potentially have the kernel set up the shadow stack in some new way. The following example demonstrates how to create a new shadow stack with map_shadow_stack: void *shstk = map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN); Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> --- v5: - Fix addr/mapped_addr (Kees) - Switch to EOPNOTSUPP (Kees suggested ENOTSUPP, but checkpatch suggests this) - Return error for addresses below 4G v3: - Change syscall common -> 64 (Kees) - Use bit shift notation instead of 0x1 for uapi header (Kees) - Call do_mmap() with MAP_FIXED_NOREPLACE (Kees) - Block unsupported flags (Kees) - Require size >= 8 to set token (Kees) v2: - Change syscall to take address like mmap() for CRIU's usage v1: - New patch (replaces PROT_SHADOW_STACK). --- arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/x86/include/uapi/asm/mman.h | 3 ++ arch/x86/kernel/shstk.c | 59 ++++++++++++++++++++++---- include/linux/syscalls.h | 1 + include/uapi/asm-generic/unistd.h | 2 +- kernel/sys_ni.c | 1 + 6 files changed, 58 insertions(+), 9 deletions(-)
Comments
The 02/27/2023 14:29, Rick Edgecombe wrote: > Previously, a new PROT_SHADOW_STACK was attempted, ... > So rather than repurpose two existing syscalls (mmap, madvise) that don't > quite fit, just implement a new map_shadow_stack syscall to allow > userspace to map and setup new shadow stacks in one step. While ucontext > is the primary motivator, userspace may have other unforeseen reasons to > setup it's own shadow stacks using the WRSS instruction. Towards this > provide a flag so that stacks can be optionally setup securely for the > common case of ucontext without enabling WRSS. Or potentially have the > kernel set up the shadow stack in some new way. ... > The following example demonstrates how to create a new shadow stack with > map_shadow_stack: > void *shstk = map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN); i think mmap(addr, size, PROT_READ, MAP_ANON|MAP_SHADOW_STACK, -1, 0); could do the same with less disruption to users (new syscalls are harder to deal with than new flags). it would do the guard page and initial token setup too (there is no flag for it but could be squeezed in). most of the mmap features need not be available (EINVAL) when MAP_SHADOW_STACK is specified. the main drawback is running out of mmap flags so extension is limited. (but the new syscall has limitations too).
On Thu, 2023-03-02 at 17:22 +0000, Szabolcs Nagy wrote: > The 02/27/2023 14:29, Rick Edgecombe wrote: > > Previously, a new PROT_SHADOW_STACK was attempted, > > ... > > So rather than repurpose two existing syscalls (mmap, madvise) that > > don't > > quite fit, just implement a new map_shadow_stack syscall to allow > > userspace to map and setup new shadow stacks in one step. While > > ucontext > > is the primary motivator, userspace may have other unforeseen > > reasons to > > setup it's own shadow stacks using the WRSS instruction. Towards > > this > > provide a flag so that stacks can be optionally setup securely for > > the > > common case of ucontext without enabling WRSS. Or potentially have > > the > > kernel set up the shadow stack in some new way. > > ... > > The following example demonstrates how to create a new shadow stack > > with > > map_shadow_stack: > > void *shstk = map_shadow_stack(addr, stack_size, > > SHADOW_STACK_SET_TOKEN); > > i think > > mmap(addr, size, PROT_READ, MAP_ANON|MAP_SHADOW_STACK, -1, 0); > > could do the same with less disruption to users (new syscalls > are harder to deal with than new flags). it would do the > guard page and initial token setup too (there is no flag for > it but could be squeezed in). > > most of the mmap features need not be available (EINVAL) when > MAP_SHADOW_STACK is specified. > > the main drawback is running out of mmap flags so extension > is limited. (but the new syscall has limitations too). Deepak Gupta (working on riscv shadow stack) asked something similar. Can you see if this thread answers your questions? https://lore.kernel.org/lkml/20230223000340.GB945966@debug.ba.rivosinc.com/
On Thu, Mar 02, 2023 at 05:22:07PM +0000, Szabolcs Nagy wrote: >The 02/27/2023 14:29, Rick Edgecombe wrote: >> Previously, a new PROT_SHADOW_STACK was attempted, >... >> So rather than repurpose two existing syscalls (mmap, madvise) that don't >> quite fit, just implement a new map_shadow_stack syscall to allow >> userspace to map and setup new shadow stacks in one step. While ucontext >> is the primary motivator, userspace may have other unforeseen reasons to >> setup it's own shadow stacks using the WRSS instruction. Towards this >> provide a flag so that stacks can be optionally setup securely for the >> common case of ucontext without enabling WRSS. Or potentially have the >> kernel set up the shadow stack in some new way. >... >> The following example demonstrates how to create a new shadow stack with >> map_shadow_stack: >> void *shstk = map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN); > >i think > >mmap(addr, size, PROT_READ, MAP_ANON|MAP_SHADOW_STACK, -1, 0); > >could do the same with less disruption to users (new syscalls >are harder to deal with than new flags). it would do the >guard page and initial token setup too (there is no flag for >it but could be squeezed in). Discussion on this topic in v6 https://lore.kernel.org/all/20230223000340.GB945966@debug.ba.rivosinc.com/ Again I know earlier CET patches had protection flag and somehow due to pushback on mailing list, it was adopted to go for special syscall because no one else had shadow stack. Seeing a response from Szabolcs, I am assuming arm4 would also want to follow using mmap to manufacture shadow stack. For reference RFC patches for risc-v shadow stack, use a new protection flag = PROT_SHADOWSTACK. https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/ I know earlier discussion had been that we let this go and do a re-factor later as other arch support trickle in. But as I thought more on this and I think it may just be messy from user mode point of view as well to have cognition of two different ways of creating shadow stack. One would be special syscall (in current libc) and another `mmap` (whenever future re-factor happens) If it's not too late, it would be more wise to take `mmap` approach rather than special `syscall` approach. > >most of the mmap features need not be available (EINVAL) when >MAP_SHADOW_STACK is specified. > >the main drawback is running out of mmap flags so extension >is limited. (but the new syscall has limitations too).
On Thu, 2023-03-09 at 10:55 -0800, Deepak Gupta wrote: > On Thu, Mar 02, 2023 at 05:22:07PM +0000, Szabolcs Nagy wrote: > > The 02/27/2023 14:29, Rick Edgecombe wrote: > > > Previously, a new PROT_SHADOW_STACK was attempted, > > > > ... > > > So rather than repurpose two existing syscalls (mmap, madvise) > > > that don't > > > quite fit, just implement a new map_shadow_stack syscall to allow > > > userspace to map and setup new shadow stacks in one step. While > > > ucontext > > > is the primary motivator, userspace may have other unforeseen > > > reasons to > > > setup it's own shadow stacks using the WRSS instruction. Towards > > > this > > > provide a flag so that stacks can be optionally setup securely > > > for the > > > common case of ucontext without enabling WRSS. Or potentially > > > have the > > > kernel set up the shadow stack in some new way. > > > > ... > > > The following example demonstrates how to create a new shadow > > > stack with > > > map_shadow_stack: > > > void *shstk = map_shadow_stack(addr, stack_size, > > > SHADOW_STACK_SET_TOKEN); > > > > i think > > > > mmap(addr, size, PROT_READ, MAP_ANON|MAP_SHADOW_STACK, -1, 0); > > > > could do the same with less disruption to users (new syscalls > > are harder to deal with than new flags). it would do the > > guard page and initial token setup too (there is no flag for > > it but could be squeezed in). > > Discussion on this topic in v6 > https://lore.kernel.org/all/20230223000340.GB945966@debug.ba.rivosinc.com/ > > Again I know earlier CET patches had protection flag and somehow due > to pushback > on mailing list, > it was adopted to go for special syscall because no one else > had shadow stack. > > Seeing a response from Szabolcs, I am assuming arm4 would also want > to follow > using mmap to manufacture shadow stack. For reference RFC patches for > risc-v shadow stack, > use a new protection flag = PROT_SHADOWSTACK. > https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/ > > I know earlier discussion had been that we let this go and do a re- > factor later as other > arch support trickle in. But as I thought more on this and I think it > may just be > messy from user mode point of view as well to have cognition of two > different ways of > creating shadow stack. One would be special syscall (in current libc) > and another `mmap` > (whenever future re-factor happens) > > If it's not too late, it would be more wise to take `mmap` > approach rather than special `syscall` approach. There is sort of two things intermixed here when we talk about a PROT_SHADOW_STACK. One is: what is the interface for specifying how the shadow stack should be provisioned with data? Right now there are two ways supported, all zero or with an X86 shadow stack restore token at the end. Then there was already some conversation about a third type. In which case the question would be is using mmap MAP_ flags the right place for this? How many types of initialization will be needed in the end and what is the overlap between the architectures? The other thing is: should shadow stack memory creation be tightly controlled? For example in x86 we limit this to anonymous memory, etc. Some reasons for this are x86 specific, but some are not. So if we disallow most of the options why allow the interface to take them? And then you are in the position of carefully maintaining a list of not- allowed options instead letting a list of allowed options sit there. The only benefit I've heard is that it saves creating a new syscall, but it also saves several MAP_ flags. That, and that the RFC for riscv did a PROT_SHADOW_STACK to start. So, yes, two people asked the same question, but I'm still not seeing any benefits. Can you give the pros and cons please? BTW, in glibc map_shadow_stack is called from arch code. So I think userspace wise, for this to affect other architectures there would need to be some code that could do things generically, with somehow the shadow stack pivot abstracted but the shadow stack allocation not.
On Thu, Mar 09, 2023 at 07:39:41PM +0000, Edgecombe, Rick P wrote: >On Thu, 2023-03-09 at 10:55 -0800, Deepak Gupta wrote: >> On Thu, Mar 02, 2023 at 05:22:07PM +0000, Szabolcs Nagy wrote: >> > The 02/27/2023 14:29, Rick Edgecombe wrote: >> > > Previously, a new PROT_SHADOW_STACK was attempted, >> > >> > ... >> > > So rather than repurpose two existing syscalls (mmap, madvise) >> > > that don't >> > > quite fit, just implement a new map_shadow_stack syscall to allow >> > > userspace to map and setup new shadow stacks in one step. While >> > > ucontext >> > > is the primary motivator, userspace may have other unforeseen >> > > reasons to >> > > setup it's own shadow stacks using the WRSS instruction. Towards >> > > this >> > > provide a flag so that stacks can be optionally setup securely >> > > for the >> > > common case of ucontext without enabling WRSS. Or potentially >> > > have the >> > > kernel set up the shadow stack in some new way. >> > >> > ... >> > > The following example demonstrates how to create a new shadow >> > > stack with >> > > map_shadow_stack: >> > > void *shstk = map_shadow_stack(addr, stack_size, >> > > SHADOW_STACK_SET_TOKEN); >> > >> > i think >> > >> > mmap(addr, size, PROT_READ, MAP_ANON|MAP_SHADOW_STACK, -1, 0); >> > >> > could do the same with less disruption to users (new syscalls >> > are harder to deal with than new flags). it would do the >> > guard page and initial token setup too (there is no flag for >> > it but could be squeezed in). >> >> Discussion on this topic in v6 >> >https://lore.kernel.org/all/20230223000340.GB945966@debug.ba.rivosinc.com/ >> >> Again I know earlier CET patches had protection flag and somehow due >> to pushback >> on mailing list, >> it was adopted to go for special syscall because no one else >> had shadow stack. >> >> Seeing a response from Szabolcs, I am assuming arm4 would also want >> to follow >> using mmap to manufacture shadow stack. For reference RFC patches for >> risc-v shadow stack, >> use a new protection flag = PROT_SHADOWSTACK. >> >https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/ >> >> I know earlier discussion had been that we let this go and do a re- >> factor later as other >> arch support trickle in. But as I thought more on this and I think it >> may just be >> messy from user mode point of view as well to have cognition of two >> different ways of >> creating shadow stack. One would be special syscall (in current libc) >> and another `mmap` >> (whenever future re-factor happens) >> >> If it's not too late, it would be more wise to take `mmap` >> approach rather than special `syscall` approach. > >There is sort of two things intermixed here when we talk about a >PROT_SHADOW_STACK. > >One is: what is the interface for specifying how the shadow stack >should be provisioned with data? Right now there are two ways >supported, all zero or with an X86 shadow stack restore token at the >end. Then there was already some conversation about a third type. In >which case the question would be is using mmap MAP_ flags the right >place for this? How many types of initialization will be needed in the >end and what is the overlap between the architectures? First of all, arches can choose to have token at the bottom or not. Token serve following purposes - It allows one to put desired value in shadow stack pointer in safe/secure manner. Note: x86 doesn't provide any opcode encoding to value in SSP register. So having a token is kind of a necessity because x86 doesn't easily allow writing shadow stack. - A token at the bottom acts marker / barrier and can be useful in debugging - If (and a big *if*) we ever reach a point in future where return address is only pushed on shadow stack (x86 should have motivation to do this because less uops on call/ret), a token at the bottom (bottom means lower address) is ensuring sure shot way of getting a fault when exhausted. Current RISCV zisslpcfi proposal doesn't define CPU based tokens because it's RISC. It allows mechanisms using which software can define formatting of token for itself. Not sure of what ARM is doing. Now coming to the point of all zero v/s shadow stack token. Why not always have token at the bottom? In case of x86, Why need for two ways and why not always have a token at the bottom. The way x86 is going, user mode is responsible for establishing shadow stack and thus whenever shadow stack is created then if x86 kernel implementation always place a token at the base/bottom. Now user mode can do following:-- - If it has access to WRSS, it can sure go ahead and create a token of its choosing and overwrite kernel created token. and then do RSTORSSP on it's own created token. - If it doesn't have access to WRSS (and dont need to create its own token), it can do RSTORSSP on this. As soon as it does, no other thread in process can restore to it. On `fork`, you get the same un-restorable token. So why not always have a token at the bottom. This is my plan for riscv implementation as well (to have a token at the bottom) > >The other thing is: should shadow stack memory creation be tightly >controlled? For example in x86 we limit this to anonymous memory, etc. >Some reasons for this are x86 specific, but some are not. So if we >disallow most of the options why allow the interface to take them? And >then you are in the position of carefully maintaining a list of not- >allowed options instead letting a list of allowed options sit there. I am new to linux kernel and thus may be not able to follow the argument of limiting to anonymous memory. Why is limiting it to anonymous memory a problem. IIRC, ARM's PROT_MTE is applicable only to anonymous memory. I can probably find few more examples. Eventually syscall will also go ahead and use memory management code to perform mapping. So I didn't understand the reasoning here. The way syscall can limit it to anonymous memory, why mmap can't do the same if it sees PROT_SHADOWSTACK. > >The only benefit I've heard is that it saves creating a new syscall, >but it also saves several MAP_ flags. That, and that the RFC for riscv >did a PROT_SHADOW_STACK to start. So, yes, two people asked the same >question, but I'm still not seeing any benefits. Can you give the pros >and cons please? Again the way syscall will limit it to anonymous memory, Why mmap can't do same? There is precedence for it (like PROT_MTE is applicable only to anonymous memory) So if it can be done, then why introduce a new syscall? > >BTW, in glibc map_shadow_stack is called from arch code. So I think >userspace wise, for this to affect other architectures there would need >to be some code that could do things generically, with somehow the >shadow stack pivot abstracted but the shadow stack allocation not. Agreed, yes it can be done in a way where it won't put tax on other architectures. But what about fragmentation within x86. Will x86 always choose to use system call method map shadow stack. If future re-factor results in x86 also use `mmap` method. Isn't it a mess for x86 glibc to figure out what to do; whether to use system call or `mmap`?
On Thu, 2023-03-09 at 13:08 -0800, Deepak Gupta wrote: > On Thu, Mar 09, 2023 at 07:39:41PM +0000, Edgecombe, Rick P wrote: > > On Thu, 2023-03-09 at 10:55 -0800, Deepak Gupta wrote: > > > On Thu, Mar 02, 2023 at 05:22:07PM +0000, Szabolcs Nagy wrote: > > > > The 02/27/2023 14:29, Rick Edgecombe wrote: > > > > > Previously, a new PROT_SHADOW_STACK was attempted, > > > > > > > > ... > > > > > So rather than repurpose two existing syscalls (mmap, > > > > > madvise) > > > > > that don't > > > > > quite fit, just implement a new map_shadow_stack syscall to > > > > > allow > > > > > userspace to map and setup new shadow stacks in one step. > > > > > While > > > > > ucontext > > > > > is the primary motivator, userspace may have other unforeseen > > > > > reasons to > > > > > setup it's own shadow stacks using the WRSS instruction. > > > > > Towards > > > > > this > > > > > provide a flag so that stacks can be optionally setup > > > > > securely > > > > > for the > > > > > common case of ucontext without enabling WRSS. Or potentially > > > > > have the > > > > > kernel set up the shadow stack in some new way. > > > > > > > > ... > > > > > The following example demonstrates how to create a new shadow > > > > > stack with > > > > > map_shadow_stack: > > > > > void *shstk = map_shadow_stack(addr, stack_size, > > > > > SHADOW_STACK_SET_TOKEN); > > > > > > > > i think > > > > > > > > mmap(addr, size, PROT_READ, MAP_ANON|MAP_SHADOW_STACK, -1, 0); > > > > > > > > could do the same with less disruption to users (new syscalls > > > > are harder to deal with than new flags). it would do the > > > > guard page and initial token setup too (there is no flag for > > > > it but could be squeezed in). > > > > > > Discussion on this topic in v6 > > > > > > > https://lore.kernel.org/all/20230223000340.GB945966@debug.ba.rivosinc.com/ > > > > > > Again I know earlier CET patches had protection flag and somehow > > > due > > > to pushback > > > on mailing list, > > > it was adopted to go for special syscall because no one else > > > had shadow stack. > > > > > > Seeing a response from Szabolcs, I am assuming arm4 would also > > > want > > > to follow > > > using mmap to manufacture shadow stack. For reference RFC patches > > > for > > > risc-v shadow stack, > > > use a new protection flag = PROT_SHADOWSTACK. > > > > > > > https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/ > > > > > > I know earlier discussion had been that we let this go and do a > > > re- > > > factor later as other > > > arch support trickle in. But as I thought more on this and I > > > think it > > > may just be > > > messy from user mode point of view as well to have cognition of > > > two > > > different ways of > > > creating shadow stack. One would be special syscall (in current > > > libc) > > > and another `mmap` > > > (whenever future re-factor happens) > > > > > > If it's not too late, it would be more wise to take `mmap` > > > approach rather than special `syscall` approach. > > > > There is sort of two things intermixed here when we talk about a > > PROT_SHADOW_STACK. > > > > One is: what is the interface for specifying how the shadow stack > > should be provisioned with data? Right now there are two ways > > supported, all zero or with an X86 shadow stack restore token at > > the > > end. Then there was already some conversation about a third type. > > In > > which case the question would be is using mmap MAP_ flags the right > > place for this? How many types of initialization will be needed in > > the > > end and what is the overlap between the architectures? > > First of all, arches can choose to have token at the bottom or not. > > Token serve following purposes > - It allows one to put desired value in shadow stack pointer in > safe/secure manner. > Note: x86 doesn't provide any opcode encoding to value in SSP > register. So having > a token is kind of a necessity because x86 doesn't easily allow > writing shadow stack. > > - A token at the bottom acts marker / barrier and can be useful in > debugging > > - If (and a big *if*) we ever reach a point in future where return > address is only pushed > on shadow stack (x86 should have motivation to do this because > less uops on call/ret), > a token at the bottom (bottom means lower address) is ensuring > sure shot way of getting > a fault when exhausted. > > Current RISCV zisslpcfi proposal doesn't define CPU based tokens > because it's RISC. > It allows mechanisms using which software can define formatting of > token for itself. > Not sure of what ARM is doing. Ok, so riscv doesn't need to have the kernel write the token, but x86 does. > > Now coming to the point of all zero v/s shadow stack token. > Why not always have token at the bottom? With WRSS you can setup the shadow stack however you want. So the user would then have to take care to erase the token if they didn't want it. Not the end of the world, but kind of clunky if there is no reason for it. > > In case of x86, Why need for two ways and why not always have a token > at the bottom. > The way x86 is going, user mode is responsible for establishing > shadow stack and thus > whenever shadow stack is created then if x86 kernel implementation > always place a token > at the base/bottom. There was also some discussion recently of adding a token AND an end of stack marker, as a potential solution for backtracing in ucontext stacks. In this case it could cause an ABI break to just start adding the end of stack marker where the token was, and so would require a new map_shadow_stack flag. > > Now user mode can do following:-- > - If it has access to WRSS, it can sure go ahead and create a token > of its choosing and > overwrite kernel created token. and then do RSTORSSP on it's own > created token. > > - If it doesn't have access to WRSS (and dont need to create its > own token), it can do > RSTORSSP on this. As soon as it does, no other thread in process > can restore to it. > On `fork`, you get the same un-restorable token. > > So why not always have a token at the bottom. > This is my plan for riscv implementation as well (to have a token at > the bottom) > > > > > The other thing is: should shadow stack memory creation be tightly > > controlled? For example in x86 we limit this to anonymous memory, > > etc. > > Some reasons for this are x86 specific, but some are not. So if we > > disallow most of the options why allow the interface to take them? > > And > > then you are in the position of carefully maintaining a list of > > not- > > allowed options instead letting a list of allowed options sit > > there. > > I am new to linux kernel and thus may be not able to follow the > argument of > limiting to anonymous memory. > > Why is limiting it to anonymous memory a problem. IIRC, ARM's > PROT_MTE is applicable > only to anonymous memory. I can probably find few more examples. Oh I see, they have a special arch VMA flag VM_MTE_ALLOWED that only gets set if all the rules are followed. Then PROT_MTE can only be set on that to set VM_MTE. That is kind of nice because certain other special situations can choose to support it. It does take another arch vma flag though. For x86 I guess I would need to figure out how to squeeze VM_SHADOW_STACK into other flags to have a free flag to use the same method. It also only supports mprotect() and shadow stack would only want to support mmap(). And you still have the initialization stuff to plumb through. Yea, I think the PROT_MTE is a good thing to consider, but it's not super obvious to me how similar the logic would be for shadow stack. The question I'm asking though is, not "can mmap code and rules be changed to enforce the required limitations?". I think it is yes. But the question is "why is that plumbing better than a new syscall?". I guess to get a better idea, the mmap solution would need to get POCed. I had half done this at one point, but abandoned the approach. For your question about why limit it, the special x86 case is the Dirty=1,Write=0 PTE bit combination for shadow stacks. So for shadow stack you could have some confusion about whether a PTE is actually dirty for writeback, etc. I wouldn't say it's known to be impossible to do MAP_SHARED, but it has not been fully analyzed enough to know what the changes would be. There were some solvable concrete issues that tipped the scale as well. It was also not expected to be a common usage, if at all. The non-x86, general reasons for it, are for a smaller benefit. It blocks a lot of ways shadow stack memory could be written to. Like say you have a memory mapped writable file, and you also map it shadow stack. So it has better security properties depending on what your threat model is. > > Eventually syscall will also go ahead and use memory management code > to > perform mapping. So I didn't understand the reasoning here. The way > syscall > can limit it to anonymous memory, why mmap can't do the same if it > sees > PROT_SHADOWSTACK. > > > > > The only benefit I've heard is that it saves creating a new > > syscall, > > but it also saves several MAP_ flags. That, and that the RFC for > > riscv > > did a PROT_SHADOW_STACK to start. So, yes, two people asked the > > same > > question, but I'm still not seeing any benefits. Can you give the > > pros > > and cons please? > > Again the way syscall will limit it to anonymous memory, Why mmap > can't do same? > There is precedence for it (like PROT_MTE is applicable only to > anonymous memory) > > So if it can be done, then why introduce a new syscall? > > > > > BTW, in glibc map_shadow_stack is called from arch code. So I think > > userspace wise, for this to affect other architectures there would > > need > > to be some code that could do things generically, with somehow the > > shadow stack pivot abstracted but the shadow stack allocation not. > > Agreed, yes it can be done in a way where it won't put tax on other > architectures. > > But what about fragmentation within x86. Will x86 always choose to > use system call > method map shadow stack. If future re-factor results in x86 also use > `mmap` method. > Isn't it a mess for x86 glibc to figure out what to do; whether to > use system call > or `mmap`? > Ok, so this is the downside I guess. What happens if we want to support the other types of memory in the future and end up using mmap for this? Then we have 15-20 lines of extra syscall wrapping code to maintain to support legacy. For the mmap solution, we have the downside of using extra MAP_ flags, and *some* amount of currently unknown vm_flag and address range logic, plus mmap arch breakouts to add to core MM. Like I said earlier, you would need to POC it out to see how bad that looks and get some core MM feedback on the new type of MAP flag usage. But, syscalls being pretty straightforward, it would probably be *some* amount of added complexity _now_ to support something that might happen in the future. I'm not seeing either one as a landslide win. It's kind of an eternal software design philosophical question, isn't it? How much work should you do to prepare for things that might be needed in the future? From what I've seen the balance in the kernel seems to be to try not to paint yourself in to an ABI corner, but otherwise let the kernel evolve naturally in response to real usages. If anyone wants to correct this, please do. But otherwise I think the new syscall is aligned with that. TBH, you are making me wonder if I'm missing something. It seems you strongly don't prefer this approach, but I'm not hearing any huge potential negative impacts. And you also say it won't tax the riscv implementation. Is this just something just smells bad here? Or it would shrink the riscv series?
On Mon, Feb 27, 2023 at 02:29:49PM -0800, Rick Edgecombe wrote: > When operating with shadow stacks enabled, the kernel will automatically > allocate shadow stacks for new threads, however in some cases userspace > will need additional shadow stacks. The main example of this is the > ucontext family of functions, which require userspace allocating and > pivoting to userspace managed stacks. > > Unlike most other user memory permissions, shadow stacks need to be > provisioned with special data in order to be useful. They need to be setup > with a restore token so that userspace can pivot to them via the RSTORSSP > instruction. But, the security design of shadow stack's is that they "stacks" > should not be written to except in limited circumstances. This presents a > problem for userspace, as to how userspace can provision this special > data, without allowing for the shadow stack to be generally writable. > > Previously, a new PROT_SHADOW_STACK was attempted, which could be > mprotect()ed from RW permissions after the data was provisioned. This was > found to not be secure enough, as other thread's could write to the "threads" > shadow stack during the writable window. > > The kernel can use a special instruction, WRUSS, to write directly to > userspace shadow stacks. So the solution can be that memory can be mapped > as shadow stack permissions from the beginning (never generally writable > in userspace), and the kernel itself can write the restore token. > > First, a new madvise() flag was explored, which could operate on the > PROT_SHADOW_STACK memory. This had a couple downsides: ^ of > 1. Extra checks were needed in mprotect() to prevent writable memory from > ever becoming PROT_SHADOW_STACK. > 2. Extra checks/vma state were needed in the new madvise() to prevent > restore tokens being written into the middle of pre-used shadow stacks. > It is ideal to prevent restore tokens being added at arbitrary > locations, so the check was to make sure the shadow stack had never been > written to. > 3. It stood out from the rest of the madvise flags, as more of direct > action than a hint at future desired behavior. > > So rather than repurpose two existing syscalls (mmap, madvise) that don't > quite fit, just implement a new map_shadow_stack syscall to allow > userspace to map and setup new shadow stacks in one step. While ucontext > is the primary motivator, userspace may have other unforeseen reasons to > setup it's own shadow stacks using the WRSS instruction. Towards this "its" > provide a flag so that stacks can be optionally setup securely for the > common case of ucontext without enabling WRSS. Or potentially have the > kernel set up the shadow stack in some new way. > > The following example demonstrates how to create a new shadow stack with > map_shadow_stack: > void *shstk = map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN); ... > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl > index c84d12608cd2..f65c671ce3b1 100644 > --- a/arch/x86/entry/syscalls/syscall_64.tbl > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > @@ -372,6 +372,7 @@ > 448 common process_mrelease sys_process_mrelease > 449 common futex_waitv sys_futex_waitv > 450 common set_mempolicy_home_node sys_set_mempolicy_home_node > +451 64 map_shadow_stack sys_map_shadow_stack Yeah, this'll need a manpage too, I presume. But later. > +SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags) > +{ > + bool set_tok = flags & SHADOW_STACK_SET_TOKEN; > + unsigned long aligned_size; > + > + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK)) > + return -EOPNOTSUPP; > + > + if (flags & ~SHADOW_STACK_SET_TOKEN) > + return -EINVAL; > + > + /* If there isn't space for a token */ > + if (set_tok && size < 8) > + return -EINVAL; > + > + if (addr && addr <= 0xFFFFFFFF) < SZ_4G > + return -EINVAL; Can we use distinct negative retvals in each case so that it is clear to userspace where it fails, *if* it fails? > + /* > + * An overflow would result in attempting to write the restore token > + * to the wrong location. Not catastrophic, but just return the right > + * error code and block it. > + */ > + aligned_size = PAGE_ALIGN(size); > + if (aligned_size < size) > + return -EOVERFLOW; > + > + return alloc_shstk(addr, aligned_size, size, set_tok); > +}
On Fri, 2023-03-10 at 17:11 +0100, Borislav Petkov wrote: [...] Thanks on all the text edits. > On Mon, Feb 27, 2023 at 02:29:49PM -0800, Rick Edgecombe wrote: > > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl > > b/arch/x86/entry/syscalls/syscall_64.tbl > > index c84d12608cd2..f65c671ce3b1 100644 > > --- a/arch/x86/entry/syscalls/syscall_64.tbl > > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > > @@ -372,6 +372,7 @@ > > 448 common process_mrelease sys_process_mreleas > > e > > 449 common futex_waitv sys_futex_waitv > > 450 common set_mempolicy_home_node sys_set_mempolicy_h > > ome_node > > +451 64 map_shadow_stack sys_map_shadow_stack > > Yeah, this'll need a manpage too, I presume. But later. I have one to submit. [...] > > + > > + if (addr && addr <= 0xFFFFFFFF) > > < SZ_4G > > > + return -EINVAL; > > Can we use distinct negative retvals in each case so that it is clear > to > userspace where it fails, *if* it fails? Good idea, I think maybe ERANGE.
On Fri, Mar 10, 2023 at 05:12:40PM +0000, Edgecombe, Rick P wrote: > > Can we use distinct negative retvals in each case so that it is clear > > to > > userspace where it fails, *if* it fails? > > Good idea, I think maybe ERANGE. For those two, right? /* If there isn't space for a token */ if (set_tok && size < 8) return -EINVAL; if (addr && addr <= 0xFFFFFFFF) return -EINVAL; They are kinda range-checking of sorts. A wider range but still similar...
On Fri, 2023-03-10 at 21:05 +0100, Borislav Petkov wrote: > On Fri, Mar 10, 2023 at 05:12:40PM +0000, Edgecombe, Rick P wrote: > > > Can we use distinct negative retvals in each case so that it is > > > clear > > > to > > > userspace where it fails, *if* it fails? > > > > Good idea, I think maybe ERANGE. > > For those two, right? > > /* If there isn't space for a token */ > if (set_tok && size < 8) > return -EINVAL; > > if (addr && addr <= 0xFFFFFFFF) > return -EINVAL; > > They are kinda range-checking of sorts. A wider range but still > similar... I was thinking ERANGE would be for the 4GB limit. This is the weird 32 bit limiting thing. So if someone hit it they could look up in the docs what is going on. The size-big-enough-for-a-token check could be ENOSPC? Then each reason could have a different error code.
On Fri, Mar 10, 2023 at 08:19:46PM +0000, Edgecombe, Rick P wrote: > I was thinking ERANGE would be for the 4GB limit. This is the weird 32 > bit limiting thing. So if someone hit it they could look up in the docs > what is going on. The size-big-enough-for-a-token check could be > ENOSPC? Then each reason could have a different error code. Yah, makes perfect sense to me. Thx.
On Fri, Mar 10, 2023 at 12:14:01AM +0000, Edgecombe, Rick P wrote: >On Thu, 2023-03-09 at 13:08 -0800, Deepak Gupta wrote: >> On Thu, Mar 09, 2023 at 07:39:41PM +0000, Edgecombe, Rick P wrote: >> > On Thu, 2023-03-09 at 10:55 -0800, Deepak Gupta wrote: >> > > On Thu, Mar 02, 2023 at 05:22:07PM +0000, Szabolcs Nagy wrote: >> > > > The 02/27/2023 14:29, Rick Edgecombe wrote: >> > > > > Previously, a new PROT_SHADOW_STACK was attempted, >> > > > >> > > > ... >> > > > > So rather than repurpose two existing syscalls (mmap, >> > > > > madvise) >> > > > > that don't >> > > > > quite fit, just implement a new map_shadow_stack syscall to >> > > > > allow >> > > > > userspace to map and setup new shadow stacks in one step. >> > > > > While >> > > > > ucontext >> > > > > is the primary motivator, userspace may have other unforeseen >> > > > > reasons to >> > > > > setup it's own shadow stacks using the WRSS instruction. >> > > > > Towards >> > > > > this >> > > > > provide a flag so that stacks can be optionally setup >> > > > > securely >> > > > > for the >> > > > > common case of ucontext without enabling WRSS. Or potentially >> > > > > have the >> > > > > kernel set up the shadow stack in some new way. >> > > > >> > > > ... >> > > > > The following example demonstrates how to create a new shadow >> > > > > stack with >> > > > > map_shadow_stack: >> > > > > void *shstk = map_shadow_stack(addr, stack_size, >> > > > > SHADOW_STACK_SET_TOKEN); >> > > > >> > > > i think >> > > > >> > > > mmap(addr, size, PROT_READ, MAP_ANON|MAP_SHADOW_STACK, -1, 0); >> > > > >> > > > could do the same with less disruption to users (new syscalls >> > > > are harder to deal with than new flags). it would do the >> > > > guard page and initial token setup too (there is no flag for >> > > > it but could be squeezed in). >> > > >> > > Discussion on this topic in v6 >> > > >> > >> > >https://lore.kernel.org/all/20230223000340.GB945966@debug.ba.rivosinc.com/ >> > > >> > > Again I know earlier CET patches had protection flag and somehow >> > > due >> > > to pushback >> > > on mailing list, >> > > it was adopted to go for special syscall because no one else >> > > had shadow stack. >> > > >> > > Seeing a response from Szabolcs, I am assuming arm4 would also >> > > want >> > > to follow >> > > using mmap to manufacture shadow stack. For reference RFC patches >> > > for >> > > risc-v shadow stack, >> > > use a new protection flag = PROT_SHADOWSTACK. >> > > >> > >> > >https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/ >> > > >> > > I know earlier discussion had been that we let this go and do a >> > > re- >> > > factor later as other >> > > arch support trickle in. But as I thought more on this and I >> > > think it >> > > may just be >> > > messy from user mode point of view as well to have cognition of >> > > two >> > > different ways of >> > > creating shadow stack. One would be special syscall (in current >> > > libc) >> > > and another `mmap` >> > > (whenever future re-factor happens) >> > > >> > > If it's not too late, it would be more wise to take `mmap` >> > > approach rather than special `syscall` approach. >> > >> > There is sort of two things intermixed here when we talk about a >> > PROT_SHADOW_STACK. >> > >> > One is: what is the interface for specifying how the shadow stack >> > should be provisioned with data? Right now there are two ways >> > supported, all zero or with an X86 shadow stack restore token at >> > the >> > end. Then there was already some conversation about a third type. >> > In >> > which case the question would be is using mmap MAP_ flags the right >> > place for this? How many types of initialization will be needed in >> > the >> > end and what is the overlap between the architectures? >> >> First of all, arches can choose to have token at the bottom or not. >> >> Token serve following purposes >> - It allows one to put desired value in shadow stack pointer in >> safe/secure manner. >> Note: x86 doesn't provide any opcode encoding to value in SSP >> register. So having >> a token is kind of a necessity because x86 doesn't easily allow >> writing shadow stack. >> >> - A token at the bottom acts marker / barrier and can be useful in >> debugging >> >> - If (and a big *if*) we ever reach a point in future where return >> address is only pushed >> on shadow stack (x86 should have motivation to do this because >> less uops on call/ret), >> a token at the bottom (bottom means lower address) is ensuring >> sure shot way of getting >> a fault when exhausted. >> >> Current RISCV zisslpcfi proposal doesn't define CPU based tokens >> because it's RISC. >> It allows mechanisms using which software can define formatting of >> token for itself. >> Not sure of what ARM is doing. > >Ok, so riscv doesn't need to have the kernel write the token, but x86 >does. > >> >> Now coming to the point of all zero v/s shadow stack token. >> Why not always have token at the bottom? > >With WRSS you can setup the shadow stack however you want. So the user >would then have to take care to erase the token if they didn't want it. >Not the end of the world, but kind of clunky if there is no reason for >it. Yes but kernel always assumes the user is going to use the token. It' upto the user to decide whether they want to use the restore token or not. If they've WRSS capability security posture is anyways diluted. An attacker who would be clever enough to re-use `RSTORSSP` present in address space to restore using kernel prepared token, should anyways can be clever enough to use WRSS as well. It kind of makes shadow stack creation simpler for kernel to always place the token. This point is irrespective of whether to use system call or mmap. > >> >> In case of x86, Why need for two ways and why not always have a token >> at the bottom. >> The way x86 is going, user mode is responsible for establishing >> shadow stack and thus >> whenever shadow stack is created then if x86 kernel implementation >> always place a token >> at the base/bottom. > >There was also some discussion recently of adding a token AND an end of >stack marker, as a potential solution for backtracing in ucontext >stacks. In this case it could cause an ABI break to just start adding >the end of stack marker where the token was, and so would require a new >map_shadow_stack flag. Was this discussed why restore token itself can't be used as marker for end of stack (if we assume there is always going to be one at the bottom). It's a unique value. An address pointing to itself. > >> >> Now user mode can do following:-- >> - If it has access to WRSS, it can sure go ahead and create a token >> of its choosing and >> overwrite kernel created token. and then do RSTORSSP on it's own >> created token. >> >> - If it doesn't have access to WRSS (and dont need to create its >> own token), it can do >> RSTORSSP on this. As soon as it does, no other thread in process >> can restore to it. >> On `fork`, you get the same un-restorable token. >> >> So why not always have a token at the bottom. >> This is my plan for riscv implementation as well (to have a token at >> the bottom) >> >> > >> > The other thing is: should shadow stack memory creation be tightly >> > controlled? For example in x86 we limit this to anonymous memory, >> > etc. >> > Some reasons for this are x86 specific, but some are not. So if we >> > disallow most of the options why allow the interface to take them? >> > And >> > then you are in the position of carefully maintaining a list of >> > not- >> > allowed options instead letting a list of allowed options sit >> > there. >> >> I am new to linux kernel and thus may be not able to follow the >> argument of >> limiting to anonymous memory. >> >> Why is limiting it to anonymous memory a problem. IIRC, ARM's >> PROT_MTE is applicable >> only to anonymous memory. I can probably find few more examples. > >Oh I see, they have a special arch VMA flag VM_MTE_ALLOWED that only >gets set if all the rules are followed. Then PROT_MTE can only be set >on that to set VM_MTE. That is kind of nice because certain other >special situations can choose to support it. That's because MTE is different. It allows to assign tags to existing virtual memory. So one need to know whether a memory can have tags assigned. > >It does take another arch vma flag though. For x86 I guess I would need >to figure out how to squeeze VM_SHADOW_STACK into other flags to have a >free flag to use the same method. It also only supports mprotect() and >shadow stack would only want to support mmap(). And you still have the >initialization stuff to plumb through. Yea, I think the PROT_MTE is a >good thing to consider, but it's not super obvious to me how similar >the logic would be for shadow stack. I dont think you need another VMA flag. Memory tagging allows adding tags to existing virtual memory. That's why having `mprotect` makes sense for MTE. In shadow stack case, there is no requirement of changing a shadow stack to regular memory or vice-versa. All that's needed to change is `mmap`. `mprotect` should fail. Syscall approach gives that benefit by default because there is no protection flag for shadow stack. I was giving example that any feature which gives new meaning to virtual memory has been able to work with existing memory mapping APIs without the need of new system call (including whether you're dealing with anonymous memory). > >The question I'm asking though is, not "can mmap code and rules be >changed to enforce the required limitations?". I think it is yes. But >the question is "why is that plumbing better than a new syscall?". I >guess to get a better idea, the mmap solution would need to get POCed. >I had half done this at one point, but abandoned the approach. > >For your question about why limit it, the special x86 case is the >Dirty=1,Write=0 PTE bit combination for shadow stacks. So for shadow >stack you could have some confusion about whether a PTE is actually >dirty for writeback, etc. I wouldn't say it's known to be impossible to >do MAP_SHARED, but it has not been fully analyzed enough to know what >the changes would be. There were some solvable concrete issues that >tipped the scale as well. It was also not expected to be a common >usage, if at all. I am not sure how confusion of D=1,W=0 is not completely taken away by syscall approach. It'll always be there. One can only do things to minimize the chances. In case of syscall approach, syscall makes sure that `flags = MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G` This can be easily checked in arch specific landing function for mmap. Additionally, If you always have the token at base, you don't need that ABI between user and kernel. > >The non-x86, general reasons for it, are for a smaller benefit. It >blocks a lot of ways shadow stack memory could be written to. Like say >you have a memory mapped writable file, and you also map it shadow >stack. So it has better security properties depending on what your >threat model is. I wouldn't say any architecture should allow such primitives. It kind of defeats the purpose for shadow stack. Yes if some sort of secure memory is needed, there may be new ISA extensions for that. > >> >> Eventually syscall will also go ahead and use memory management code >> to >> perform mapping. So I didn't understand the reasoning here. The way >> syscall >> can limit it to anonymous memory, why mmap can't do the same if it >> sees >> PROT_SHADOWSTACK. >> >> > >> > The only benefit I've heard is that it saves creating a new >> > syscall, >> > but it also saves several MAP_ flags. That, and that the RFC for >> > riscv >> > did a PROT_SHADOW_STACK to start. So, yes, two people asked the >> > same >> > question, but I'm still not seeing any benefits. Can you give the >> > pros >> > and cons please? >> >> Again the way syscall will limit it to anonymous memory, Why mmap >> can't do same? >> There is precedence for it (like PROT_MTE is applicable only to >> anonymous memory) >> >> So if it can be done, then why introduce a new syscall? >> >> > >> > BTW, in glibc map_shadow_stack is called from arch code. So I think >> > userspace wise, for this to affect other architectures there would >> > need >> > to be some code that could do things generically, with somehow the >> > shadow stack pivot abstracted but the shadow stack allocation not. >> >> Agreed, yes it can be done in a way where it won't put tax on other >> architectures. >> >> But what about fragmentation within x86. Will x86 always choose to >> use system call >> method map shadow stack. If future re-factor results in x86 also use >> `mmap` method. >> Isn't it a mess for x86 glibc to figure out what to do; whether to >> use system call >> or `mmap`? >> > >Ok, so this is the downside I guess. What happens if we want to support >the other types of memory in the future and end up using mmap for this? >Then we have 15-20 lines of extra syscall wrapping code to maintain to >support legacy. > >For the mmap solution, we have the downside of using extra MAP_ flags, >and *some* amount of currently unknown vm_flag and address range logic, >plus mmap arch breakouts to add to core MM. Like I said earlier, you >would need to POC it out to see how bad that looks and get some core MM >feedback on the new type of MAP flag usage. But, syscalls being pretty >straightforward, it would probably be *some* amount of added complexity >_now_ to support something that might happen in the future. I'm not >seeing either one as a landslide win. > >It's kind of an eternal software design philosophical question, isn't >it? How much work should you do to prepare for things that might be >needed in the future? From what I've seen the balance in the kernel >seems to be to try not to paint yourself in to an ABI corner, but >otherwise let the kernel evolve naturally in response to real usages. >If anyone wants to correct this, please do. But otherwise I think the >new syscall is aligned with that. > >TBH, you are making me wonder if I'm missing something. It seems you >strongly don't prefer this approach, but I'm not hearing any huge >potential negative impacts. And you also say it won't tax the riscv >implementation. Is this just something just smells bad here? Or it >would shrink the riscv series? No you're not missing anything. It's just wierdness of adding a system call which enforces certain MAP_XX flags and pretty much mapping API. And difference between architectures on how they will create shadow stack. + if x86 chooses to use `mmap` in future, then there is ugliness in user mode to decide which method to choose. And yes you got it right, to some extent there is my own selfishness playing out as well here to reduce riscv patches.
On Fri, 2023-03-10 at 13:00 -0800, Deepak Gupta wrote: > On Fri, Mar 10, 2023 at 12:14:01AM +0000, Edgecombe, Rick P wrote: > > On Thu, 2023-03-09 at 13:08 -0800, Deepak Gupta wrote: > > > On Thu, Mar 09, 2023 at 07:39:41PM +0000, Edgecombe, Rick P > > > wrote: > > > > On Thu, 2023-03-09 at 10:55 -0800, Deepak Gupta wrote: > > > > > On Thu, Mar 02, 2023 at 05:22:07PM +0000, Szabolcs Nagy > > > > > wrote: > > > > > > The 02/27/2023 14:29, Rick Edgecombe wrote: > > > > > > > Previously, a new PROT_SHADOW_STACK was attempted, > > > > > > > > > > > > ... > > > > > > > So rather than repurpose two existing syscalls (mmap, > > > > > > > madvise) > > > > > > > that don't > > > > > > > quite fit, just implement a new map_shadow_stack syscall > > > > > > > to > > > > > > > allow > > > > > > > userspace to map and setup new shadow stacks in one step. > > > > > > > While > > > > > > > ucontext > > > > > > > is the primary motivator, userspace may have other > > > > > > > unforeseen > > > > > > > reasons to > > > > > > > setup it's own shadow stacks using the WRSS instruction. > > > > > > > Towards > > > > > > > this > > > > > > > provide a flag so that stacks can be optionally setup > > > > > > > securely > > > > > > > for the > > > > > > > common case of ucontext without enabling WRSS. Or > > > > > > > potentially > > > > > > > have the > > > > > > > kernel set up the shadow stack in some new way. > > > > > > > > > > > > ... > > > > > > > The following example demonstrates how to create a new > > > > > > > shadow > > > > > > > stack with > > > > > > > map_shadow_stack: > > > > > > > void *shstk = map_shadow_stack(addr, stack_size, > > > > > > > SHADOW_STACK_SET_TOKEN); > > > > > > > > > > > > i think > > > > > > > > > > > > mmap(addr, size, PROT_READ, MAP_ANON|MAP_SHADOW_STACK, -1, > > > > > > 0); > > > > > > > > > > > > could do the same with less disruption to users (new > > > > > > syscalls > > > > > > are harder to deal with than new flags). it would do the > > > > > > guard page and initial token setup too (there is no flag > > > > > > for > > > > > > it but could be squeezed in). > > > > > > > > > > Discussion on this topic in v6 > > > > > > > > > > > > > > > > > https://lore.kernel.org/all/20230223000340.GB945966@debug.ba.rivosinc.com/ > > > > > > > > > > Again I know earlier CET patches had protection flag and > > > > > somehow > > > > > due > > > > > to pushback > > > > > on mailing list, > > > > > it was adopted to go for special syscall because no one else > > > > > had shadow stack. > > > > > > > > > > Seeing a response from Szabolcs, I am assuming arm4 would > > > > > also > > > > > want > > > > > to follow > > > > > using mmap to manufacture shadow stack. For reference RFC > > > > > patches > > > > > for > > > > > risc-v shadow stack, > > > > > use a new protection flag = PROT_SHADOWSTACK. > > > > > > > > > > > > > > > > > https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/ > > > > > > > > > > I know earlier discussion had been that we let this go and do > > > > > a > > > > > re- > > > > > factor later as other > > > > > arch support trickle in. But as I thought more on this and I > > > > > think it > > > > > may just be > > > > > messy from user mode point of view as well to have cognition > > > > > of > > > > > two > > > > > different ways of > > > > > creating shadow stack. One would be special syscall (in > > > > > current > > > > > libc) > > > > > and another `mmap` > > > > > (whenever future re-factor happens) > > > > > > > > > > If it's not too late, it would be more wise to take `mmap` > > > > > approach rather than special `syscall` approach. > > > > > > > > There is sort of two things intermixed here when we talk about > > > > a > > > > PROT_SHADOW_STACK. > > > > > > > > One is: what is the interface for specifying how the shadow > > > > stack > > > > should be provisioned with data? Right now there are two ways > > > > supported, all zero or with an X86 shadow stack restore token > > > > at > > > > the > > > > end. Then there was already some conversation about a third > > > > type. > > > > In > > > > which case the question would be is using mmap MAP_ flags the > > > > right > > > > place for this? How many types of initialization will be needed > > > > in > > > > the > > > > end and what is the overlap between the architectures? > > > > > > First of all, arches can choose to have token at the bottom or > > > not. > > > > > > Token serve following purposes > > > - It allows one to put desired value in shadow stack pointer in > > > safe/secure manner. > > > Note: x86 doesn't provide any opcode encoding to value in SSP > > > register. So having > > > a token is kind of a necessity because x86 doesn't easily > > > allow > > > writing shadow stack. > > > > > > - A token at the bottom acts marker / barrier and can be useful > > > in > > > debugging > > > > > > - If (and a big *if*) we ever reach a point in future where > > > return > > > address is only pushed > > > on shadow stack (x86 should have motivation to do this > > > because > > > less uops on call/ret), > > > a token at the bottom (bottom means lower address) is > > > ensuring > > > sure shot way of getting > > > a fault when exhausted. > > > > > > Current RISCV zisslpcfi proposal doesn't define CPU based tokens > > > because it's RISC. > > > It allows mechanisms using which software can define formatting > > > of > > > token for itself. > > > Not sure of what ARM is doing. > > > > Ok, so riscv doesn't need to have the kernel write the token, but > > x86 > > does. > > > > > > > > Now coming to the point of all zero v/s shadow stack token. > > > Why not always have token at the bottom? > > > > With WRSS you can setup the shadow stack however you want. So the > > user > > would then have to take care to erase the token if they didn't want > > it. > > Not the end of the world, but kind of clunky if there is no reason > > for > > it. > > Yes but kernel always assumes the user is going to use the token. It' > upto the user > to decide whether they want to use the restore token or not. If > they've WRSS capability > security posture is anyways diluted. An attacker who would be clever > enough to > re-use `RSTORSSP` present in address space to restore using kernel > prepared token, should > anyways can be clever enough to use WRSS as well. > > It kind of makes shadow stack creation simpler for kernel to always > place the token. > This point is irrespective of whether to use system call or mmap. Think about like CRIU restoring the shadow stack, or other special cases like that. Userspace can always overwrite the token, but this involves some amount of extra work (extra writes, earlier faulting in the page, etc). It is clunky and very negligibly worse. > > > > > > > > > In case of x86, Why need for two ways and why not always have a > > > token > > > at the bottom. > > > The way x86 is going, user mode is responsible for establishing > > > shadow stack and thus > > > whenever shadow stack is created then if x86 kernel > > > implementation > > > always place a token > > > at the base/bottom. > > > > There was also some discussion recently of adding a token AND an > > end of > > stack marker, as a potential solution for backtracing in ucontext > > stacks. In this case it could cause an ABI break to just start > > adding > > the end of stack marker where the token was, and so would require a > > new > > map_shadow_stack flag. > > Was this discussed why restore token itself can't be used as marker > for > end of stack (if we assume there is always going to be one at the > bottom). > It's a unique value. An address pointing to itself. I thought the same thing at first, but it gets clobbered during the pivot and push. > > > > > > > > > Now user mode can do following:-- > > > - If it has access to WRSS, it can sure go ahead and create a > > > token > > > of its choosing and > > > overwrite kernel created token. and then do RSTORSSP on it's > > > own > > > created token. > > > > > > - If it doesn't have access to WRSS (and dont need to create > > > its > > > own token), it can do > > > RSTORSSP on this. As soon as it does, no other thread in > > > process > > > can restore to it. > > > On `fork`, you get the same un-restorable token. > > > > > > So why not always have a token at the bottom. > > > This is my plan for riscv implementation as well (to have a token > > > at > > > the bottom) > > > > > > > > > > > The other thing is: should shadow stack memory creation be > > > > tightly > > > > controlled? For example in x86 we limit this to anonymous > > > > memory, > > > > etc. > > > > Some reasons for this are x86 specific, but some are not. So if > > > > we > > > > disallow most of the options why allow the interface to take > > > > them? > > > > And > > > > then you are in the position of carefully maintaining a list of > > > > not- > > > > allowed options instead letting a list of allowed options sit > > > > there. > > > > > > I am new to linux kernel and thus may be not able to follow the > > > argument of > > > limiting to anonymous memory. > > > > > > Why is limiting it to anonymous memory a problem. IIRC, ARM's > > > PROT_MTE is applicable > > > only to anonymous memory. I can probably find few more examples. > > > > Oh I see, they have a special arch VMA flag VM_MTE_ALLOWED that > > only > > gets set if all the rules are followed. Then PROT_MTE can only be > > set > > on that to set VM_MTE. That is kind of nice because certain other > > special situations can choose to support it. > > That's because MTE is different. It allows to assign tags to existing > virtual memory. So one need to know whether a memory can have tags > assigned. > > > > > It does take another arch vma flag though. For x86 I guess I would > > need > > to figure out how to squeeze VM_SHADOW_STACK into other flags to > > have a > > free flag to use the same method. It also only supports mprotect() > > and > > shadow stack would only want to support mmap(). And you still have > > the > > initialization stuff to plumb through. Yea, I think the PROT_MTE is > > a > > good thing to consider, but it's not super obvious to me how > > similar > > the logic would be for shadow stack. > > I dont think you need another VMA flag. Memory tagging allows adding > tags > to existing virtual memory. ...need another VMA flag to use the existing mmap arch breakouts in the same way as VM_MTE. Of course changing mmap makes other solutions possible. > That's why having `mprotect` makes sense for MTE. > In shadow stack case, there is no requirement of changing a shadow > stack > to regular memory or vice-versa. uffd needs mprotect internals. You might take a look at it in regards to your VM_WRITE/mprotect blocking approach for riscv. I was imagining, even if mmap was the syscall, mprotect() would not be blocked in the x86 case at least. The mprotect() blocking is a separate thing than the syscall, right? > > All that's needed to change is `mmap`. `mprotect` should fail. > Syscall > approach gives that benefit by default because there is no protection > flag > for shadow stack. > > I was giving example that any feature which gives new meaning to > virtual memory > has been able to work with existing memory mapping APIs without the > need of new > system call (including whether you're dealing with anonymous memory). > > > > > The question I'm asking though is, not "can mmap code and rules be > > changed to enforce the required limitations?". I think it is yes. > > But > > the question is "why is that plumbing better than a new syscall?". > > I > > guess to get a better idea, the mmap solution would need to get > > POCed. > > I had half done this at one point, but abandoned the approach. > > > > For your question about why limit it, the special x86 case is the > > Dirty=1,Write=0 PTE bit combination for shadow stacks. So for > > shadow > > stack you could have some confusion about whether a PTE is actually > > dirty for writeback, etc. I wouldn't say it's known to be > > impossible to > > do MAP_SHARED, but it has not been fully analyzed enough to know > > what > > the changes would be. There were some solvable concrete issues that > > tipped the scale as well. It was also not expected to be a common > > usage, if at all. > > I am not sure how confusion of D=1,W=0 is not completely taken away > by > syscall approach. It'll always be there. One can only do things to > minimize > the chances. > > In case of syscall approach, syscall makes sure that > > `flags = MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G` > > This can be easily checked in arch specific landing function for > mmap. Right, this is why I listed two types of things in the mix here. The memory features supported, and what the syscall is. You asked why limit the memory features, so that is the explanation. > > > Additionally, If you always have the token at base, you don't need > that ABI > between user and kernel. > > > > > > The non-x86, general reasons for it, are for a smaller benefit. It > > blocks a lot of ways shadow stack memory could be written to. Like > > say > > you have a memory mapped writable file, and you also map it shadow > > stack. So it has better security properties depending on what your > > threat model is. > > I wouldn't say any architecture should allow such primitives. It kind > of defeats > the purpose for shadow stack. Yes if some sort of secure memory is > needed, there may > be new ISA extensions for that. Yea, seems reasonable to prevent this regardless of the extra x86 reasons, if that is what you are saying. It depends on people's threat models (as always in security). > > > > > > > > > Eventually syscall will also go ahead and use memory management > > > code > > > to > > > perform mapping. So I didn't understand the reasoning here. The > > > way > > > syscall > > > can limit it to anonymous memory, why mmap can't do the same if > > > it > > > sees > > > PROT_SHADOWSTACK. > > > > > > > > > > > The only benefit I've heard is that it saves creating a new > > > > syscall, > > > > but it also saves several MAP_ flags. That, and that the RFC > > > > for > > > > riscv > > > > did a PROT_SHADOW_STACK to start. So, yes, two people asked the > > > > same > > > > question, but I'm still not seeing any benefits. Can you give > > > > the > > > > pros > > > > and cons please? > > > > > > Again the way syscall will limit it to anonymous memory, Why mmap > > > can't do same? > > > There is precedence for it (like PROT_MTE is applicable only to > > > anonymous memory) > > > > > > So if it can be done, then why introduce a new syscall? > > > > > > > > > > > BTW, in glibc map_shadow_stack is called from arch code. So I > > > > think > > > > userspace wise, for this to affect other architectures there > > > > would > > > > need > > > > to be some code that could do things generically, with somehow > > > > the > > > > shadow stack pivot abstracted but the shadow stack allocation > > > > not. > > > > > > Agreed, yes it can be done in a way where it won't put tax on > > > other > > > architectures. > > > > > > But what about fragmentation within x86. Will x86 always choose > > > to > > > use system call > > > method map shadow stack. If future re-factor results in x86 also > > > use > > > `mmap` method. > > > Isn't it a mess for x86 glibc to figure out what to do; whether > > > to > > > use system call > > > or `mmap`? > > > > > > > Ok, so this is the downside I guess. What happens if we want to > > support > > the other types of memory in the future and end up using mmap for > > this? > > Then we have 15-20 lines of extra syscall wrapping code to maintain > > to > > support legacy. > > > > For the mmap solution, we have the downside of using extra MAP_ > > flags, > > and *some* amount of currently unknown vm_flag and address range > > logic, > > plus mmap arch breakouts to add to core MM. Like I said earlier, > > you > > would need to POC it out to see how bad that looks and get some > > core MM > > feedback on the new type of MAP flag usage. But, syscalls being > > pretty > > straightforward, it would probably be *some* amount of added > > complexity > > _now_ to support something that might happen in the future. I'm not > > seeing either one as a landslide win. > > > > It's kind of an eternal software design philosophical question, > > isn't > > it? How much work should you do to prepare for things that might be > > needed in the future? From what I've seen the balance in the kernel > > seems to be to try not to paint yourself in to an ABI corner, but > > otherwise let the kernel evolve naturally in response to real > > usages. > > If anyone wants to correct this, please do. But otherwise I think > > the > > new syscall is aligned with that. > > > > TBH, you are making me wonder if I'm missing something. It seems > > you > > strongly don't prefer this approach, but I'm not hearing any huge > > potential negative impacts. And you also say it won't tax the riscv > > implementation. Is this just something just smells bad here? Or it > > would shrink the riscv series? > > No you're not missing anything. It's just wierdness of adding a > system call > which enforces certain MAP_XX flags and pretty much mapping API. > And difference between architectures on how they will create shadow > stack. + > if x86 chooses to use `mmap` in future, then there is ugliness in > user mode to > decide which method to choose. Ok, I think I will leave it given it's entirely in arch/x86. It just got some special error codes in the other thread today too. > > And yes you got it right, to some extent there is my own selfishness > playing out > as well here to reduce riscv patches. > Feel free to join the map_shadow_stack party. :)
Hi, On Thu, Mar 09, 2023 at 10:55:11AM -0800, Deepak Gupta wrote: > On Thu, Mar 02, 2023 at 05:22:07PM +0000, Szabolcs Nagy wrote: > > The 02/27/2023 14:29, Rick Edgecombe wrote: > > > Previously, a new PROT_SHADOW_STACK was attempted, > > ... > > > So rather than repurpose two existing syscalls (mmap, madvise) that don't > > > quite fit, just implement a new map_shadow_stack syscall to allow > > > userspace to map and setup new shadow stacks in one step. While ucontext > > > is the primary motivator, userspace may have other unforeseen reasons to > > > setup it's own shadow stacks using the WRSS instruction. Towards this > > > provide a flag so that stacks can be optionally setup securely for the > > > common case of ucontext without enabling WRSS. Or potentially have the > > > kernel set up the shadow stack in some new way. > > ... > > > The following example demonstrates how to create a new shadow stack with > > > map_shadow_stack: > > > void *shstk = map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN); > > > > i think > > > > mmap(addr, size, PROT_READ, MAP_ANON|MAP_SHADOW_STACK, -1, 0); > > > > could do the same with less disruption to users (new syscalls > > are harder to deal with than new flags). it would do the > > guard page and initial token setup too (there is no flag for > > it but could be squeezed in). > > Discussion on this topic in v6 > https://lore.kernel.org/all/20230223000340.GB945966@debug.ba.rivosinc.com/ > > Again I know earlier CET patches had protection flag and somehow due to pushback > on mailing list, it was adopted to go for special syscall because no one else > had shadow stack. > > Seeing a response from Szabolcs, I am assuming arm4 would also want to follow > using mmap to manufacture shadow stack. For reference RFC patches for risc-v shadow stack, > use a new protection flag = PROT_SHADOWSTACK. > https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/ > > I know earlier discussion had been that we let this go and do a re-factor later as other > arch support trickle in. But as I thought more on this and I think it may just be > messy from user mode point of view as well to have cognition of two different ways of > creating shadow stack. One would be special syscall (in current libc) and another `mmap` > (whenever future re-factor happens) > > If it's not too late, it would be more wise to take `mmap` > approach rather than special `syscall` approach. I disagree. Having shadow stack flags for mmap() adds unnecessary complexity to the core-mm, while having a dedicated syscall hides all the details in the architecture specific code. Another reason to use a dedicated system call allows for better extensibility if/when we'd need to update the way shadow stack VMA is created. As for the userspace convenience, it is anyway required to add special code for creating the shadow stack and it wouldn't matter if that code would use mmap(NEW_FLAG) or map_shadow_stack(). > > most of the mmap features need not be available (EINVAL) when > > MAP_SHADOW_STACK is specified. > > > > the main drawback is running out of mmap flags so extension > > is limited. (but the new syscall has limitations too).
On Tue, Mar 14, 2023 at 12:19 AM Mike Rapoport <rppt@kernel.org> wrote: > > Hi, > > On Thu, Mar 09, 2023 at 10:55:11AM -0800, Deepak Gupta wrote: > > On Thu, Mar 02, 2023 at 05:22:07PM +0000, Szabolcs Nagy wrote: > > > The 02/27/2023 14:29, Rick Edgecombe wrote: > > > > Previously, a new PROT_SHADOW_STACK was attempted, > > > ... > > > > So rather than repurpose two existing syscalls (mmap, madvise) that don't > > > > quite fit, just implement a new map_shadow_stack syscall to allow > > > > userspace to map and setup new shadow stacks in one step. While ucontext > > > > is the primary motivator, userspace may have other unforeseen reasons to > > > > setup it's own shadow stacks using the WRSS instruction. Towards this > > > > provide a flag so that stacks can be optionally setup securely for the > > > > common case of ucontext without enabling WRSS. Or potentially have the > > > > kernel set up the shadow stack in some new way. > > > ... > > > > The following example demonstrates how to create a new shadow stack with > > > > map_shadow_stack: > > > > void *shstk = map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN); > > > > > > i think > > > > > > mmap(addr, size, PROT_READ, MAP_ANON|MAP_SHADOW_STACK, -1, 0); > > > > > > could do the same with less disruption to users (new syscalls > > > are harder to deal with than new flags). it would do the > > > guard page and initial token setup too (there is no flag for > > > it but could be squeezed in). > > > > Discussion on this topic in v6 > > https://lore.kernel.org/all/20230223000340.GB945966@debug.ba.rivosinc.com/ > > > > Again I know earlier CET patches had protection flag and somehow due to pushback > > on mailing list, it was adopted to go for special syscall because no one else > > had shadow stack. > > > > Seeing a response from Szabolcs, I am assuming arm4 would also want to follow > > using mmap to manufacture shadow stack. For reference RFC patches for risc-v shadow stack, > > use a new protection flag = PROT_SHADOWSTACK. > > https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/ > > > > I know earlier discussion had been that we let this go and do a re-factor later as other > > arch support trickle in. But as I thought more on this and I think it may just be > > messy from user mode point of view as well to have cognition of two different ways of > > creating shadow stack. One would be special syscall (in current libc) and another `mmap` > > (whenever future re-factor happens) > > > > If it's not too late, it would be more wise to take `mmap` > > approach rather than special `syscall` approach. > > I disagree. > > Having shadow stack flags for mmap() adds unnecessary complexity to the > core-mm, while having a dedicated syscall hides all the details in the > architecture specific code. Again reiterating it would've made sense if only x86 had a shadow stack. aarch64 announced support for guarded stack. risc-v spec is in development to support shadow stack. So there will be shadow stack related flow in these arches. > > Another reason to use a dedicated system call allows for better > extensibility if/when we'd need to update the way shadow stack VMA is > created. I see two valid points here - Shadow stack doesn't need conversion into different memory types (which is usually the case for address ranges created by mmap) So there is a static page permissions on shadow stack which is not mutable. - Future feature addition (if there is one needed) at the time of shadow stack creation It would avoid future tax on mmap I'll think more about this. > > As for the userspace convenience, it is anyway required to add special > code for creating the shadow stack and it wouldn't matter if that code > would use mmap(NEW_FLAG) or map_shadow_stack(). Yes *strictly* from userspace convenience, it doesn't matter which option. > > > > most of the mmap features need not be available (EINVAL) when > > > MAP_SHADOW_STACK is specified. > > > > > > the main drawback is running out of mmap flags so extension > > > is limited. (but the new syscall has limitations too). > > -- > Sincerely yours, > Mike.
On Fri, Mar 10, 2023 at 1:43 PM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > > On Fri, 2023-03-10 at 13:00 -0800, Deepak Gupta wrote: > > On Fri, Mar 10, 2023 at 12:14:01AM +0000, Edgecombe, Rick P wrote: > > > On Thu, 2023-03-09 at 13:08 -0800, Deepak Gupta wrote: > > > > On Thu, Mar 09, 2023 at 07:39:41PM +0000, Edgecombe, Rick P > > > > wrote: > > > > > On Thu, 2023-03-09 at 10:55 -0800, Deepak Gupta wrote: > > > > > > On Thu, Mar 02, 2023 at 05:22:07PM +0000, Szabolcs Nagy > > > > > > wrote: > > > > > > > The 02/27/2023 14:29, Rick Edgecombe wrote: > > > > > > > > Previously, a new PROT_SHADOW_STACK was attempted, > > > > > > > > > > > > > > ... > > > > > > > > So rather than repurpose two existing syscalls (mmap, > > > > > > > > madvise) > > > > > > > > that don't > > > > > > > > quite fit, just implement a new map_shadow_stack syscall > > > > > > > > to > > > > > > > > allow > > > > > > > > userspace to map and setup new shadow stacks in one step. > > > > > > > > While > > > > > > > > ucontext > > > > > > > > is the primary motivator, userspace may have other > > > > > > > > unforeseen > > > > > > > > reasons to > > > > > > > > setup it's own shadow stacks using the WRSS instruction. > > > > > > > > Towards > > > > > > > > this > > > > > > > > provide a flag so that stacks can be optionally setup > > > > > > > > securely > > > > > > > > for the > > > > > > > > common case of ucontext without enabling WRSS. Or > > > > > > > > potentially > > > > > > > > have the > > > > > > > > kernel set up the shadow stack in some new way. > > > > > > > > > > > > > > ... > > > > > > > > The following example demonstrates how to create a new > > > > > > > > shadow > > > > > > > > stack with > > > > > > > > map_shadow_stack: > > > > > > > > void *shstk = map_shadow_stack(addr, stack_size, > > > > > > > > SHADOW_STACK_SET_TOKEN); > > > > > > > > > > > > > > i think > > > > > > > > > > > > > > mmap(addr, size, PROT_READ, MAP_ANON|MAP_SHADOW_STACK, -1, > > > > > > > 0); > > > > > > > > > > > > > > could do the same with less disruption to users (new > > > > > > > syscalls > > > > > > > are harder to deal with than new flags). it would do the > > > > > > > guard page and initial token setup too (there is no flag > > > > > > > for > > > > > > > it but could be squeezed in). > > > > > > > > > > > > Discussion on this topic in v6 > > > > > > > > > > > > > > > > > > > > > > > https://lore.kernel.org/all/20230223000340.GB945966@debug.ba.rivosinc.com/ > > > > > > > > > > > > Again I know earlier CET patches had protection flag and > > > > > > somehow > > > > > > due > > > > > > to pushback > > > > > > on mailing list, > > > > > > it was adopted to go for special syscall because no one else > > > > > > had shadow stack. > > > > > > > > > > > > Seeing a response from Szabolcs, I am assuming arm4 would > > > > > > also > > > > > > want > > > > > > to follow > > > > > > using mmap to manufacture shadow stack. For reference RFC > > > > > > patches > > > > > > for > > > > > > risc-v shadow stack, > > > > > > use a new protection flag = PROT_SHADOWSTACK. > > > > > > > > > > > > > > > > > > > > > > > https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/ > > > > > > > > > > > > I know earlier discussion had been that we let this go and do > > > > > > a > > > > > > re- > > > > > > factor later as other > > > > > > arch support trickle in. But as I thought more on this and I > > > > > > think it > > > > > > may just be > > > > > > messy from user mode point of view as well to have cognition > > > > > > of > > > > > > two > > > > > > different ways of > > > > > > creating shadow stack. One would be special syscall (in > > > > > > current > > > > > > libc) > > > > > > and another `mmap` > > > > > > (whenever future re-factor happens) > > > > > > > > > > > > If it's not too late, it would be more wise to take `mmap` > > > > > > approach rather than special `syscall` approach. > > > > > > > > > > There is sort of two things intermixed here when we talk about > > > > > a > > > > > PROT_SHADOW_STACK. > > > > > > > > > > One is: what is the interface for specifying how the shadow > > > > > stack > > > > > should be provisioned with data? Right now there are two ways > > > > > supported, all zero or with an X86 shadow stack restore token > > > > > at > > > > > the > > > > > end. Then there was already some conversation about a third > > > > > type. > > > > > In > > > > > which case the question would be is using mmap MAP_ flags the > > > > > right > > > > > place for this? How many types of initialization will be needed > > > > > in > > > > > the > > > > > end and what is the overlap between the architectures? > > > > > > > > First of all, arches can choose to have token at the bottom or > > > > not. > > > > > > > > Token serve following purposes > > > > - It allows one to put desired value in shadow stack pointer in > > > > safe/secure manner. > > > > Note: x86 doesn't provide any opcode encoding to value in SSP > > > > register. So having > > > > a token is kind of a necessity because x86 doesn't easily > > > > allow > > > > writing shadow stack. > > > > > > > > - A token at the bottom acts marker / barrier and can be useful > > > > in > > > > debugging > > > > > > > > - If (and a big *if*) we ever reach a point in future where > > > > return > > > > address is only pushed > > > > on shadow stack (x86 should have motivation to do this > > > > because > > > > less uops on call/ret), > > > > a token at the bottom (bottom means lower address) is > > > > ensuring > > > > sure shot way of getting > > > > a fault when exhausted. > > > > > > > > Current RISCV zisslpcfi proposal doesn't define CPU based tokens > > > > because it's RISC. > > > > It allows mechanisms using which software can define formatting > > > > of > > > > token for itself. > > > > Not sure of what ARM is doing. > > > > > > Ok, so riscv doesn't need to have the kernel write the token, but > > > x86 > > > does. > > > > > > > > > > > Now coming to the point of all zero v/s shadow stack token. > > > > Why not always have token at the bottom? > > > > > > With WRSS you can setup the shadow stack however you want. So the > > > user > > > would then have to take care to erase the token if they didn't want > > > it. > > > Not the end of the world, but kind of clunky if there is no reason > > > for > > > it. > > > > Yes but kernel always assumes the user is going to use the token. It' > > upto the user > > to decide whether they want to use the restore token or not. If > > they've WRSS capability > > security posture is anyways diluted. An attacker who would be clever > > enough to > > re-use `RSTORSSP` present in address space to restore using kernel > > prepared token, should > > anyways can be clever enough to use WRSS as well. > > > > It kind of makes shadow stack creation simpler for kernel to always > > place the token. > > This point is irrespective of whether to use system call or mmap. > > Think about like CRIU restoring the shadow stack, or other special > cases like that. Userspace can always overwrite the token, but this > involves some amount of extra work (extra writes, earlier faulting in > the page, etc). It is clunky and very negligibly worse. Earlier faulting in the page because the kernel is writing the token at base? > > > > > > > > > > > > > > In case of x86, Why need for two ways and why not always have a > > > > token > > > > at the bottom. > > > > The way x86 is going, user mode is responsible for establishing > > > > shadow stack and thus > > > > whenever shadow stack is created then if x86 kernel > > > > implementation > > > > always place a token > > > > at the base/bottom. > > > > > > There was also some discussion recently of adding a token AND an > > > end of > > > stack marker, as a potential solution for backtracing in ucontext > > > stacks. In this case it could cause an ABI break to just start > > > adding > > > the end of stack marker where the token was, and so would require a > > > new > > > map_shadow_stack flag. > > > > Was this discussed why restore token itself can't be used as marker > > for > > end of stack (if we assume there is always going to be one at the > > bottom). > > It's a unique value. An address pointing to itself. > > I thought the same thing at first, but it gets clobbered during the > pivot and push. aah I remember. It was changed from savessp/rstorssp pair to rstorssp/saveprevssp to follow the `make before break` model. > > > > > > > > > > > > > > Now user mode can do following:-- > > > > - If it has access to WRSS, it can sure go ahead and create a > > > > token > > > > of its choosing and > > > > overwrite kernel created token. and then do RSTORSSP on it's > > > > own > > > > created token. > > > > > > > > - If it doesn't have access to WRSS (and dont need to create > > > > its > > > > own token), it can do > > > > RSTORSSP on this. As soon as it does, no other thread in > > > > process > > > > can restore to it. > > > > On `fork`, you get the same un-restorable token. > > > > > > > > So why not always have a token at the bottom. > > > > This is my plan for riscv implementation as well (to have a token > > > > at > > > > the bottom) > > > > > > > > > > > > > > The other thing is: should shadow stack memory creation be > > > > > tightly > > > > > controlled? For example in x86 we limit this to anonymous > > > > > memory, > > > > > etc. > > > > > Some reasons for this are x86 specific, but some are not. So if > > > > > we > > > > > disallow most of the options why allow the interface to take > > > > > them? > > > > > And > > > > > then you are in the position of carefully maintaining a list of > > > > > not- > > > > > allowed options instead letting a list of allowed options sit > > > > > there. > > > > > > > > I am new to linux kernel and thus may be not able to follow the > > > > argument of > > > > limiting to anonymous memory. > > > > > > > > Why is limiting it to anonymous memory a problem. IIRC, ARM's > > > > PROT_MTE is applicable > > > > only to anonymous memory. I can probably find few more examples. > > > > > > Oh I see, they have a special arch VMA flag VM_MTE_ALLOWED that > > > only > > > gets set if all the rules are followed. Then PROT_MTE can only be > > > set > > > on that to set VM_MTE. That is kind of nice because certain other > > > special situations can choose to support it. > > > > That's because MTE is different. It allows to assign tags to existing > > virtual memory. So one need to know whether a memory can have tags > > assigned. > > > > > > > > It does take another arch vma flag though. For x86 I guess I would > > > need > > > to figure out how to squeeze VM_SHADOW_STACK into other flags to > > > have a > > > free flag to use the same method. It also only supports mprotect() > > > and > > > shadow stack would only want to support mmap(). And you still have > > > the > > > initialization stuff to plumb through. Yea, I think the PROT_MTE is > > > a > > > good thing to consider, but it's not super obvious to me how > > > similar > > > the logic would be for shadow stack. > > > > I dont think you need another VMA flag. Memory tagging allows adding > > tags > > to existing virtual memory. > > ...need another VMA flag to use the existing mmap arch breakouts in the > same way as VM_MTE. Of course changing mmap makes other solutions > possible. > > > That's why having `mprotect` makes sense for MTE. > > In shadow stack case, there is no requirement of changing a shadow > > stack > > to regular memory or vice-versa. > > uffd needs mprotect internals. You might take a look at it in regards > to your VM_WRITE/mprotect blocking approach for riscv. I was imagining, > even if mmap was the syscall, mprotect() would not be blocked in the > x86 case at least. The mprotect() blocking is a separate thing than the > syscall, right? Yes, mprotect blocking is a different thing. VM_XXX flags are not exposed to mprotect (or any memory mapping API). PROT_XXX flags are. On riscv, in my current plan if mprotect or mmap specifies PROT_WRITE (no PROT_READ), It'll be mapped to `VM_READ | VM_WRITE` on vma flags (this is to make sure we don't break compat with existing user code which has been using only PROT_WRITE) If PROT_SHADOWSTACK (new protection flag) is specified, it'll be mapped to `VM_WRITE` on the vma_flag. Yes I am aware of uffd. I intend to handle it the same way I am handling the fork of riscv shadow stack. core-mm write protect checks if VM_WRiTE is specified it'll convert PTE encodings to read-only. uffd for regular memory should work as it is. In case if someone was monitoring shadow stack memory, following could occur 1) A write happens on shadow stack memory, a store page fault would occur. 2) A read happens, this would be allowed. 3) A shadow stack load / store happens, a store access fault would occur. Case 1 and 3 are reported to the kernel and it can make sure the uffd monitor is reported about it. Was there a specific concern here with respect to uffd and x86 shadow stack? > > > > > All that's needed to change is `mmap`. `mprotect` should fail. > > Syscall > > approach gives that benefit by default because there is no protection > > flag > > for shadow stack. > > > > I was giving example that any feature which gives new meaning to > > virtual memory > > has been able to work with existing memory mapping APIs without the > > need of new > > system call (including whether you're dealing with anonymous memory). > > > > > > > > The question I'm asking though is, not "can mmap code and rules be > > > changed to enforce the required limitations?". I think it is yes. > > > But > > > the question is "why is that plumbing better than a new syscall?". > > > I > > > guess to get a better idea, the mmap solution would need to get > > > POCed. > > > I had half done this at one point, but abandoned the approach. > > > > > > For your question about why limit it, the special x86 case is the > > > Dirty=1,Write=0 PTE bit combination for shadow stacks. So for > > > shadow > > > stack you could have some confusion about whether a PTE is actually > > > dirty for writeback, etc. I wouldn't say it's known to be > > > impossible to > > > do MAP_SHARED, but it has not been fully analyzed enough to know > > > what > > > the changes would be. There were some solvable concrete issues that > > > tipped the scale as well. It was also not expected to be a common > > > usage, if at all. > > > > I am not sure how confusion of D=1,W=0 is not completely taken away > > by > > syscall approach. It'll always be there. One can only do things to > > minimize > > the chances. > > > > In case of syscall approach, syscall makes sure that > > > > `flags = MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G` > > > > This can be easily checked in arch specific landing function for > > mmap. > > Right, this is why I listed two types of things in the mix here. The > memory features supported, and what the syscall is. You asked why limit > the memory features, so that is the explanation. > > > > > > > Additionally, If you always have the token at base, you don't need > > that ABI > > between user and kernel. > > > > > > > > > > The non-x86, general reasons for it, are for a smaller benefit. It > > > blocks a lot of ways shadow stack memory could be written to. Like > > > say > > > you have a memory mapped writable file, and you also map it shadow > > > stack. So it has better security properties depending on what your > > > threat model is. > > > > I wouldn't say any architecture should allow such primitives. It kind > > of defeats > > the purpose for shadow stack. Yes if some sort of secure memory is > > needed, there may > > be new ISA extensions for that. > > Yea, seems reasonable to prevent this regardless of the extra x86 > reasons, if that is what you are saying. It depends on people's threat > models (as always in security). > > > > > > > > > > > > > > Eventually syscall will also go ahead and use memory management > > > > code > > > > to > > > > perform mapping. So I didn't understand the reasoning here. The > > > > way > > > > syscall > > > > can limit it to anonymous memory, why mmap can't do the same if > > > > it > > > > sees > > > > PROT_SHADOWSTACK. > > > > > > > > > > > > > > The only benefit I've heard is that it saves creating a new > > > > > syscall, > > > > > but it also saves several MAP_ flags. That, and that the RFC > > > > > for > > > > > riscv > > > > > did a PROT_SHADOW_STACK to start. So, yes, two people asked the > > > > > same > > > > > question, but I'm still not seeing any benefits. Can you give > > > > > the > > > > > pros > > > > > and cons please? > > > > > > > > Again the way syscall will limit it to anonymous memory, Why mmap > > > > can't do same? > > > > There is precedence for it (like PROT_MTE is applicable only to > > > > anonymous memory) > > > > > > > > So if it can be done, then why introduce a new syscall? > > > > > > > > > > > > > > BTW, in glibc map_shadow_stack is called from arch code. So I > > > > > think > > > > > userspace wise, for this to affect other architectures there > > > > > would > > > > > need > > > > > to be some code that could do things generically, with somehow > > > > > the > > > > > shadow stack pivot abstracted but the shadow stack allocation > > > > > not. > > > > > > > > Agreed, yes it can be done in a way where it won't put tax on > > > > other > > > > architectures. > > > > > > > > But what about fragmentation within x86. Will x86 always choose > > > > to > > > > use system call > > > > method map shadow stack. If future re-factor results in x86 also > > > > use > > > > `mmap` method. > > > > Isn't it a mess for x86 glibc to figure out what to do; whether > > > > to > > > > use system call > > > > or `mmap`? > > > > > > > > > > Ok, so this is the downside I guess. What happens if we want to > > > support > > > the other types of memory in the future and end up using mmap for > > > this? > > > Then we have 15-20 lines of extra syscall wrapping code to maintain > > > to > > > support legacy. > > > > > > For the mmap solution, we have the downside of using extra MAP_ > > > flags, > > > and *some* amount of currently unknown vm_flag and address range > > > logic, > > > plus mmap arch breakouts to add to core MM. Like I said earlier, > > > you > > > would need to POC it out to see how bad that looks and get some > > > core MM > > > feedback on the new type of MAP flag usage. But, syscalls being > > > pretty > > > straightforward, it would probably be *some* amount of added > > > complexity > > > _now_ to support something that might happen in the future. I'm not > > > seeing either one as a landslide win. > > > > > > It's kind of an eternal software design philosophical question, > > > isn't > > > it? How much work should you do to prepare for things that might be > > > needed in the future? From what I've seen the balance in the kernel > > > seems to be to try not to paint yourself in to an ABI corner, but > > > otherwise let the kernel evolve naturally in response to real > > > usages. > > > If anyone wants to correct this, please do. But otherwise I think > > > the > > > new syscall is aligned with that. > > > > > > TBH, you are making me wonder if I'm missing something. It seems > > > you > > > strongly don't prefer this approach, but I'm not hearing any huge > > > potential negative impacts. And you also say it won't tax the riscv > > > implementation. Is this just something just smells bad here? Or it > > > would shrink the riscv series? > > > > No you're not missing anything. It's just wierdness of adding a > > system call > > which enforces certain MAP_XX flags and pretty much mapping API. > > And difference between architectures on how they will create shadow > > stack. + > > if x86 chooses to use `mmap` in future, then there is ugliness in > > user mode to > > decide which method to choose. > > Ok, I think I will leave it given it's entirely in arch/x86. It just > got some special error codes in the other thread today too. > > > > > And yes you got it right, to some extent there is my own selfishness > > playing out > > as well here to reduce riscv patches. > > > > Feel free to join the map_shadow_stack party. :) I am warming up to it :-)
The 03/16/2023 12:30, Deepak Gupta wrote: > On Tue, Mar 14, 2023 at 12:19 AM Mike Rapoport <rppt@kernel.org> wrote: > > As for the userspace convenience, it is anyway required to add special > > code for creating the shadow stack and it wouldn't matter if that code > > would use mmap(NEW_FLAG) or map_shadow_stack(). > > Yes *strictly* from userspace convenience, it doesn't matter which option. everybody seems to assume that the new syscall only matters for the code allocating the shadow stack. there are tools like strace, seccomp,.. that need to learn about the new syscall and anything that's built on top of them as well as libc api interposers like address sanitizer need to learn about the related new libc apis (if there are any.. which will be another long debate on the userspace side, delaying the usability of shadow stacks even more). such tools already know about mmap and often can handle new flags without much change. i agree that too much special logic in mmap is not ideal and using an mmap flag limits future extensions of both mmap and shadow map functionality. but i disagree that a new syscall is generally easy for userspace to deal with. in this case the cost seems acceptable to me, but it's not free at all.
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index c84d12608cd2..f65c671ce3b1 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -372,6 +372,7 @@ 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node +451 64 map_shadow_stack sys_map_shadow_stack # # Due to a historical design error, certain syscalls are numbered differently diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/mman.h index 5a0256e73f1e..8148bdddbd2c 100644 --- a/arch/x86/include/uapi/asm/mman.h +++ b/arch/x86/include/uapi/asm/mman.h @@ -13,6 +13,9 @@ ((key) & 0x8 ? VM_PKEY_BIT3 : 0)) #endif +/* Flags for map_shadow_stack(2) */ +#define SHADOW_STACK_SET_TOKEN (1ULL << 0) /* Set up a restore token in the shadow stack */ + #include <asm-generic/mman.h> #endif /* _ASM_X86_MMAN_H */ diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c index 40f0a55762a9..0a3decab70ee 100644 --- a/arch/x86/kernel/shstk.c +++ b/arch/x86/kernel/shstk.c @@ -17,6 +17,7 @@ #include <linux/compat.h> #include <linux/sizes.h> #include <linux/user.h> +#include <linux/syscalls.h> #include <asm/msr.h> #include <asm/fpu/xstate.h> #include <asm/fpu/types.h> @@ -71,19 +72,31 @@ static int create_rstor_token(unsigned long ssp, unsigned long *token_addr) return 0; } -static unsigned long alloc_shstk(unsigned long size) +static unsigned long alloc_shstk(unsigned long addr, unsigned long size, + unsigned long token_offset, bool set_res_tok) { int flags = MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G; struct mm_struct *mm = current->mm; - unsigned long addr, unused; + unsigned long mapped_addr, unused; - mmap_write_lock(mm); - addr = do_mmap(NULL, 0, size, PROT_READ, flags, - VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL); + if (addr) + flags |= MAP_FIXED_NOREPLACE; + mmap_write_lock(mm); + mapped_addr = do_mmap(NULL, addr, size, PROT_READ, flags, + VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL); mmap_write_unlock(mm); - return addr; + if (!set_res_tok || IS_ERR_VALUE(mapped_addr)) + goto out; + + if (create_rstor_token(mapped_addr + token_offset, NULL)) { + vm_munmap(mapped_addr, size); + return -EINVAL; + } + +out: + return mapped_addr; } static unsigned long adjust_shstk_size(unsigned long size) @@ -134,7 +147,7 @@ static int shstk_setup(void) return -EOPNOTSUPP; size = adjust_shstk_size(0); - addr = alloc_shstk(size); + addr = alloc_shstk(0, size, 0, false); if (IS_ERR_VALUE(addr)) return PTR_ERR((void *)addr); @@ -178,7 +191,7 @@ int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags, return 0; size = adjust_shstk_size(stack_size); - addr = alloc_shstk(size); + addr = alloc_shstk(0, size, 0, false); if (IS_ERR_VALUE(addr)) return PTR_ERR((void *)addr); @@ -371,6 +384,36 @@ static int shstk_disable(void) return 0; } +SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags) +{ + bool set_tok = flags & SHADOW_STACK_SET_TOKEN; + unsigned long aligned_size; + + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK)) + return -EOPNOTSUPP; + + if (flags & ~SHADOW_STACK_SET_TOKEN) + return -EINVAL; + + /* If there isn't space for a token */ + if (set_tok && size < 8) + return -EINVAL; + + if (addr && addr <= 0xFFFFFFFF) + return -EINVAL; + + /* + * An overflow would result in attempting to write the restore token + * to the wrong location. Not catastrophic, but just return the right + * error code and block it. + */ + aligned_size = PAGE_ALIGN(size); + if (aligned_size < size) + return -EOVERFLOW; + + return alloc_shstk(addr, aligned_size, size, set_tok); +} + long shstk_prctl(struct task_struct *task, int option, unsigned long features) { if (option == ARCH_SHSTK_LOCK) { diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 33a0ee3bcb2e..392dc11e3556 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -1058,6 +1058,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags); asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len, unsigned long home_node, unsigned long flags); +asmlinkage long sys_map_shadow_stack(unsigned long addr, unsigned long size, unsigned int flags); /* * Architecture-specific system calls diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 45fa180cc56a..b12940ec5926 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -887,7 +887,7 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv) __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node) #undef __NR_syscalls -#define __NR_syscalls 451 +#define __NR_syscalls 452 /* * 32 bit systems traditionally used different diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 860b2dcf3ac4..cb9aebd34646 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -381,6 +381,7 @@ COND_SYSCALL(vm86old); COND_SYSCALL(modify_ldt); COND_SYSCALL(vm86); COND_SYSCALL(kexec_file_load); +COND_SYSCALL(map_shadow_stack); /* s390 */ COND_SYSCALL(s390_pci_mmio_read);