[v6,33/41] x86/shstk: Introduce map_shadow_stack syscall

Message ID 20230218211433.26859-34-rick.p.edgecombe@intel.com
State New
Headers
Series Shadow stacks for userspace |

Commit Message

Edgecombe, Rick P Feb. 18, 2023, 9:14 p.m. UTC
  When operating with shadow stacks enabled, the kernel will automatically
allocate shadow stacks for new threads, however in some cases userspace
will need additional shadow stacks. The main example of this is the
ucontext family of functions, which require userspace allocating and
pivoting to userspace managed stacks.

Unlike most other user memory permissions, shadow stacks need to be
provisioned with special data in order to be useful. They need to be setup
with a restore token so that userspace can pivot to them via the RSTORSSP
instruction. But, the security design of shadow stack's is that they
should not be written to except in limited circumstances. This presents a
problem for userspace, as to how userspace can provision this special
data, without allowing for the shadow stack to be generally writable.

Previously, a new PROT_SHADOW_STACK was attempted, which could be
mprotect()ed from RW permissions after the data was provisioned. This was
found to not be secure enough, as other thread's could write to the
shadow stack during the writable window.

The kernel can use a special instruction, WRUSS, to write directly to
userspace shadow stacks. So the solution can be that memory can be mapped
as shadow stack permissions from the beginning (never generally writable
in userspace), and the kernel itself can write the restore token.

First, a new madvise() flag was explored, which could operate on the
PROT_SHADOW_STACK memory. This had a couple downsides:
1. Extra checks were needed in mprotect() to prevent writable memory from
   ever becoming PROT_SHADOW_STACK.
2. Extra checks/vma state were needed in the new madvise() to prevent
   restore tokens being written into the middle of pre-used shadow stacks.
   It is ideal to prevent restore tokens being added at arbitrary
   locations, so the check was to make sure the shadow stack had never been
   written to.
3. It stood out from the rest of the madvise flags, as more of direct
   action than a hint at future desired behavior.

So rather than repurpose two existing syscalls (mmap, madvise) that don't
quite fit, just implement a new map_shadow_stack syscall to allow
userspace to map and setup new shadow stacks in one step. While ucontext
is the primary motivator, userspace may have other unforeseen reasons to
setup it's own shadow stacks using the WRSS instruction. Towards this
provide a flag so that stacks can be optionally setup securely for the
common case of ucontext without enabling WRSS. Or potentially have the
kernel set up the shadow stack in some new way.

The following example demonstrates how to create a new shadow stack with
map_shadow_stack:
void *shstk = map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN);

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
v5:
 - Fix addr/mapped_addr (Kees)
 - Switch to EOPNOTSUPP (Kees suggested ENOTSUPP, but checkpatch
   suggests this)
 - Return error for addresses below 4G

v3:
 - Change syscall common -> 64 (Kees)
 - Use bit shift notation instead of 0x1 for uapi header (Kees)
 - Call do_mmap() with MAP_FIXED_NOREPLACE (Kees)
 - Block unsupported flags (Kees)
 - Require size >= 8 to set token (Kees)

v2:
 - Change syscall to take address like mmap() for CRIU's usage

v1:
 - New patch (replaces PROT_SHADOW_STACK).
---
 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 arch/x86/include/uapi/asm/mman.h       |  3 ++
 arch/x86/kernel/shstk.c                | 59 ++++++++++++++++++++++----
 include/linux/syscalls.h               |  1 +
 include/uapi/asm-generic/unistd.h      |  2 +-
 kernel/sys_ni.c                        |  1 +
 6 files changed, 58 insertions(+), 9 deletions(-)
  

Comments

Deepak Gupta Feb. 23, 2023, 12:03 a.m. UTC | #1
On Sat, Feb 18, 2023 at 01:14:25PM -0800, Rick Edgecombe wrote:
>When operating with shadow stacks enabled, the kernel will automatically
>allocate shadow stacks for new threads, however in some cases userspace
>will need additional shadow stacks. The main example of this is the
>ucontext family of functions, which require userspace allocating and
>pivoting to userspace managed stacks.
>
>Unlike most other user memory permissions, shadow stacks need to be
>provisioned with special data in order to be useful. They need to be setup
>with a restore token so that userspace can pivot to them via the RSTORSSP
>instruction. But, the security design of shadow stack's is that they
>should not be written to except in limited circumstances. This presents a
>problem for userspace, as to how userspace can provision this special
>data, without allowing for the shadow stack to be generally writable.
>
>Previously, a new PROT_SHADOW_STACK was attempted, which could be
>mprotect()ed from RW permissions after the data was provisioned. This was
>found to not be secure enough, as other thread's could write to the
>shadow stack during the writable window.
>
>The kernel can use a special instruction, WRUSS, to write directly to
>userspace shadow stacks. So the solution can be that memory can be mapped
>as shadow stack permissions from the beginning (never generally writable
>in userspace), and the kernel itself can write the restore token.
>
>First, a new madvise() flag was explored, which could operate on the
>PROT_SHADOW_STACK memory. This had a couple downsides:
>1. Extra checks were needed in mprotect() to prevent writable memory from
>   ever becoming PROT_SHADOW_STACK.
>2. Extra checks/vma state were needed in the new madvise() to prevent
>   restore tokens being written into the middle of pre-used shadow stacks.
>   It is ideal to prevent restore tokens being added at arbitrary
>   locations, so the check was to make sure the shadow stack had never been
>   written to.
>3. It stood out from the rest of the madvise flags, as more of direct
>   action than a hint at future desired behavior.
>
>So rather than repurpose two existing syscalls (mmap, madvise) that don't
>quite fit, just implement a new map_shadow_stack syscall to allow
>userspace to map and setup new shadow stacks in one step. While ucontext
>is the primary motivator, userspace may have other unforeseen reasons to
>setup it's own shadow stacks using the WRSS instruction. Towards this
>provide a flag so that stacks can be optionally setup securely for the
>common case of ucontext without enabling WRSS. Or potentially have the
>kernel set up the shadow stack in some new way.

Was following ever attempted?

void *shstk = mmap(0, size, PROT_SHADOWSTACK, ...);
- limit PROT_SHADOWSTACK protection flag to only mmap (and thus mprotect can't
   convert memory from shadow stack to non-shadow stack type or vice versa)
- limit PROT_SHADOWSTACK protection flag to anonymous memory only.
- top level mmap handler to put a token at the base using WRUSS if prot == PROT_SHADOWSTACK

You essentially would get shadow stack manufacturing with existing (single) syscall.
Acting a bit selfish here, this allows other architectures as well to re-use this and 
do their own implementation of mapping and placing the token at the base.

>
>The following example demonstrates how to create a new shadow stack with
>map_shadow_stack:
>void *shstk = map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN);
>
>Tested-by: Pengfei Xu <pengfei.xu@intel.com>
>Tested-by: John Allen <john.allen@amd.com>
>Reviewed-by: Kees Cook <keescook@chromium.org>
>Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
>
>---
>v5:
> - Fix addr/mapped_addr (Kees)
> - Switch to EOPNOTSUPP (Kees suggested ENOTSUPP, but checkpatch
>   suggests this)
> - Return error for addresses below 4G
>
>v3:
> - Change syscall common -> 64 (Kees)
> - Use bit shift notation instead of 0x1 for uapi header (Kees)
> - Call do_mmap() with MAP_FIXED_NOREPLACE (Kees)
> - Block unsupported flags (Kees)
> - Require size >= 8 to set token (Kees)
>
>v2:
> - Change syscall to take address like mmap() for CRIU's usage
>
>v1:
> - New patch (replaces PROT_SHADOW_STACK).
>---
> arch/x86/entry/syscalls/syscall_64.tbl |  1 +
> arch/x86/include/uapi/asm/mman.h       |  3 ++
> arch/x86/kernel/shstk.c                | 59 ++++++++++++++++++++++----
> include/linux/syscalls.h               |  1 +
> include/uapi/asm-generic/unistd.h      |  2 +-
> kernel/sys_ni.c                        |  1 +
> 6 files changed, 58 insertions(+), 9 deletions(-)
>
>diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
>index c84d12608cd2..f65c671ce3b1 100644
>--- a/arch/x86/entry/syscalls/syscall_64.tbl
>+++ b/arch/x86/entry/syscalls/syscall_64.tbl
>@@ -372,6 +372,7 @@
> 448	common	process_mrelease	sys_process_mrelease
> 449	common	futex_waitv		sys_futex_waitv
> 450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
>+451	64	map_shadow_stack	sys_map_shadow_stack
>
> #
> # Due to a historical design error, certain syscalls are numbered differently
>diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/mman.h
>index 5a0256e73f1e..8148bdddbd2c 100644
>--- a/arch/x86/include/uapi/asm/mman.h
>+++ b/arch/x86/include/uapi/asm/mman.h
>@@ -13,6 +13,9 @@
> 		((key) & 0x8 ? VM_PKEY_BIT3 : 0))
> #endif
>
>+/* Flags for map_shadow_stack(2) */
>+#define SHADOW_STACK_SET_TOKEN	(1ULL << 0)	/* Set up a restore token in the shadow stack */
>+
> #include <asm-generic/mman.h>
>
> #endif /* _ASM_X86_MMAN_H */
>diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
>index 40f0a55762a9..0a3decab70ee 100644
>--- a/arch/x86/kernel/shstk.c
>+++ b/arch/x86/kernel/shstk.c
>@@ -17,6 +17,7 @@
> #include <linux/compat.h>
> #include <linux/sizes.h>
> #include <linux/user.h>
>+#include <linux/syscalls.h>
> #include <asm/msr.h>
> #include <asm/fpu/xstate.h>
> #include <asm/fpu/types.h>
>@@ -71,19 +72,31 @@ static int create_rstor_token(unsigned long ssp, unsigned long *token_addr)
> 	return 0;
> }
>
>-static unsigned long alloc_shstk(unsigned long size)
>+static unsigned long alloc_shstk(unsigned long addr, unsigned long size,
>+				 unsigned long token_offset, bool set_res_tok)
> {
> 	int flags = MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G;
> 	struct mm_struct *mm = current->mm;
>-	unsigned long addr, unused;
>+	unsigned long mapped_addr, unused;
>
>-	mmap_write_lock(mm);
>-	addr = do_mmap(NULL, 0, size, PROT_READ, flags,
>-		       VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
>+	if (addr)
>+		flags |= MAP_FIXED_NOREPLACE;
>
>+	mmap_write_lock(mm);
>+	mapped_addr = do_mmap(NULL, addr, size, PROT_READ, flags,
>+			      VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
> 	mmap_write_unlock(mm);
>
>-	return addr;
>+	if (!set_res_tok || IS_ERR_VALUE(mapped_addr))
>+		goto out;
>+
>+	if (create_rstor_token(mapped_addr + token_offset, NULL)) {
>+		vm_munmap(mapped_addr, size);
>+		return -EINVAL;
>+	}
>+
>+out:
>+	return mapped_addr;
> }
>
> static unsigned long adjust_shstk_size(unsigned long size)
>@@ -134,7 +147,7 @@ static int shstk_setup(void)
> 		return -EOPNOTSUPP;
>
> 	size = adjust_shstk_size(0);
>-	addr = alloc_shstk(size);
>+	addr = alloc_shstk(0, size, 0, false);
> 	if (IS_ERR_VALUE(addr))
> 		return PTR_ERR((void *)addr);
>
>@@ -178,7 +191,7 @@ int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
> 		return 0;
>
> 	size = adjust_shstk_size(stack_size);
>-	addr = alloc_shstk(size);
>+	addr = alloc_shstk(0, size, 0, false);
> 	if (IS_ERR_VALUE(addr))
> 		return PTR_ERR((void *)addr);
>
>@@ -371,6 +384,36 @@ static int shstk_disable(void)
> 	return 0;
> }
>
>+SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags)
>+{
>+	bool set_tok = flags & SHADOW_STACK_SET_TOKEN;
>+	unsigned long aligned_size;
>+
>+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
>+		return -EOPNOTSUPP;
>+
>+	if (flags & ~SHADOW_STACK_SET_TOKEN)
>+		return -EINVAL;
>+
>+	/* If there isn't space for a token */
>+	if (set_tok && size < 8)
>+		return -EINVAL;
>+
>+	if (addr && addr <= 0xFFFFFFFF)
>+		return -EINVAL;
>+
>+	/*
>+	 * An overflow would result in attempting to write the restore token
>+	 * to the wrong location. Not catastrophic, but just return the right
>+	 * error code and block it.
>+	 */
>+	aligned_size = PAGE_ALIGN(size);
>+	if (aligned_size < size)
>+		return -EOVERFLOW;
>+
>+	return alloc_shstk(addr, aligned_size, size, set_tok);
>+}
>+
> long shstk_prctl(struct task_struct *task, int option, unsigned long features)
> {
> 	if (option == ARCH_SHSTK_LOCK) {
>diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
>index 33a0ee3bcb2e..392dc11e3556 100644
>--- a/include/linux/syscalls.h
>+++ b/include/linux/syscalls.h
>@@ -1058,6 +1058,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
> asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
> 					    unsigned long home_node,
> 					    unsigned long flags);
>+asmlinkage long sys_map_shadow_stack(unsigned long addr, unsigned long size, unsigned int flags);
>
> /*
>  * Architecture-specific system calls
>diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
>index 45fa180cc56a..b12940ec5926 100644
>--- a/include/uapi/asm-generic/unistd.h
>+++ b/include/uapi/asm-generic/unistd.h
>@@ -887,7 +887,7 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
> __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
>
> #undef __NR_syscalls
>-#define __NR_syscalls 451
>+#define __NR_syscalls 452
>
> /*
>  * 32 bit systems traditionally used different
>diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
>index 860b2dcf3ac4..cb9aebd34646 100644
>--- a/kernel/sys_ni.c
>+++ b/kernel/sys_ni.c
>@@ -381,6 +381,7 @@ COND_SYSCALL(vm86old);
> COND_SYSCALL(modify_ldt);
> COND_SYSCALL(vm86);
> COND_SYSCALL(kexec_file_load);
>+COND_SYSCALL(map_shadow_stack);
>
> /* s390 */
> COND_SYSCALL(s390_pci_mmio_read);
>-- 
>2.17.1
>
  
Edgecombe, Rick P Feb. 23, 2023, 1:11 a.m. UTC | #2
On Wed, 2023-02-22 at 16:03 -0800, Deepak Gupta wrote:
> On Sat, Feb 18, 2023 at 01:14:25PM -0800, Rick Edgecombe wrote:
> > When operating with shadow stacks enabled, the kernel will
> > automatically
> > allocate shadow stacks for new threads, however in some cases
> > userspace
> > will need additional shadow stacks. The main example of this is the
> > ucontext family of functions, which require userspace allocating
> > and
> > pivoting to userspace managed stacks.
> > 
> > Unlike most other user memory permissions, shadow stacks need to be
> > provisioned with special data in order to be useful. They need to
> > be setup
> > with a restore token so that userspace can pivot to them via the
> > RSTORSSP
> > instruction. But, the security design of shadow stack's is that
> > they
> > should not be written to except in limited circumstances. This
> > presents a
> > problem for userspace, as to how userspace can provision this
> > special
> > data, without allowing for the shadow stack to be generally
> > writable.
> > 
> > Previously, a new PROT_SHADOW_STACK was attempted, which could be
> > mprotect()ed from RW permissions after the data was provisioned.
> > This was
> > found to not be secure enough, as other thread's could write to the
> > shadow stack during the writable window.
> > 
> > The kernel can use a special instruction, WRUSS, to write directly
> > to
> > userspace shadow stacks. So the solution can be that memory can be
> > mapped
> > as shadow stack permissions from the beginning (never generally
> > writable
> > in userspace), and the kernel itself can write the restore token.
> > 
> > First, a new madvise() flag was explored, which could operate on
> > the
> > PROT_SHADOW_STACK memory. This had a couple downsides:
> > 1. Extra checks were needed in mprotect() to prevent writable
> > memory from
> >    ever becoming PROT_SHADOW_STACK.
> > 2. Extra checks/vma state were needed in the new madvise() to
> > prevent
> >    restore tokens being written into the middle of pre-used shadow
> > stacks.
> >    It is ideal to prevent restore tokens being added at arbitrary
> >    locations, so the check was to make sure the shadow stack had
> > never been
> >    written to.
> > 3. It stood out from the rest of the madvise flags, as more of
> > direct
> >    action than a hint at future desired behavior.
> > 
> > So rather than repurpose two existing syscalls (mmap, madvise) that
> > don't
> > quite fit, just implement a new map_shadow_stack syscall to allow
> > userspace to map and setup new shadow stacks in one step. While
> > ucontext
> > is the primary motivator, userspace may have other unforeseen
> > reasons to
> > setup it's own shadow stacks using the WRSS instruction. Towards
> > this
> > provide a flag so that stacks can be optionally setup securely for
> > the
> > common case of ucontext without enabling WRSS. Or potentially have
> > the
> > kernel set up the shadow stack in some new way.
> 
> Was following ever attempted?
> 
> void *shstk = mmap(0, size, PROT_SHADOWSTACK, ...);
> - limit PROT_SHADOWSTACK protection flag to only mmap (and thus
> mprotect can't
>    convert memory from shadow stack to non-shadow stack type or vice
> versa)
> - limit PROT_SHADOWSTACK protection flag to anonymous memory only.
> - top level mmap handler to put a token at the base using WRUSS if
> prot == PROT_SHADOWSTACK
> 
> You essentially would get shadow stack manufacturing with existing
> (single) syscall.
> Acting a bit selfish here, this allows other architectures as well to
> re-use this and 
> do their own implementation of mapping and placing the token at the
> base.

Yes, I looked at it. You end up with a pile of checks and hooks added
to mmap() and various other places as you outline. We also now have the
MAP_ABOVE4G limitation for x86 shadow stack that would need checking
for too. It's not exactly a clean fit. Then, callers would have to pass
special x86 flags in anyway.

It doesn't seem like the complexity of the checks is worth saving the
tiny syscall. Is there some reason why riscv can't use the same syscall
stub? It doesn't need to live forever in x86 code. Not sure what the
savings are for riscv of the mmap+checks approach are either...

I did wonder if there could be some sort of more general syscall for
mapping and provisioning special security-type memory. But we probably
need a few more non-shadow stack examples to get an idea of what that
would look like.
  
Deepak Gupta Feb. 23, 2023, 9:20 p.m. UTC | #3
On Wed, Feb 22, 2023 at 5:11 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Wed, 2023-02-22 at 16:03 -0800, Deepak Gupta wrote:
> > On Sat, Feb 18, 2023 at 01:14:25PM -0800, Rick Edgecombe wrote:
> > > When operating with shadow stacks enabled, the kernel will
> > > automatically
> > > allocate shadow stacks for new threads, however in some cases
> > > userspace
> > > will need additional shadow stacks. The main example of this is the
> > > ucontext family of functions, which require userspace allocating
> > > and
> > > pivoting to userspace managed stacks.
> > >
> > > Unlike most other user memory permissions, shadow stacks need to be
> > > provisioned with special data in order to be useful. They need to
> > > be setup
> > > with a restore token so that userspace can pivot to them via the
> > > RSTORSSP
> > > instruction. But, the security design of shadow stack's is that
> > > they
> > > should not be written to except in limited circumstances. This
> > > presents a
> > > problem for userspace, as to how userspace can provision this
> > > special
> > > data, without allowing for the shadow stack to be generally
> > > writable.
> > >
> > > Previously, a new PROT_SHADOW_STACK was attempted, which could be
> > > mprotect()ed from RW permissions after the data was provisioned.
> > > This was
> > > found to not be secure enough, as other thread's could write to the
> > > shadow stack during the writable window.
> > >
> > > The kernel can use a special instruction, WRUSS, to write directly
> > > to
> > > userspace shadow stacks. So the solution can be that memory can be
> > > mapped
> > > as shadow stack permissions from the beginning (never generally
> > > writable
> > > in userspace), and the kernel itself can write the restore token.
> > >
> > > First, a new madvise() flag was explored, which could operate on
> > > the
> > > PROT_SHADOW_STACK memory. This had a couple downsides:
> > > 1. Extra checks were needed in mprotect() to prevent writable
> > > memory from
> > >    ever becoming PROT_SHADOW_STACK.
> > > 2. Extra checks/vma state were needed in the new madvise() to
> > > prevent
> > >    restore tokens being written into the middle of pre-used shadow
> > > stacks.
> > >    It is ideal to prevent restore tokens being added at arbitrary
> > >    locations, so the check was to make sure the shadow stack had
> > > never been
> > >    written to.
> > > 3. It stood out from the rest of the madvise flags, as more of
> > > direct
> > >    action than a hint at future desired behavior.
> > >
> > > So rather than repurpose two existing syscalls (mmap, madvise) that
> > > don't
> > > quite fit, just implement a new map_shadow_stack syscall to allow
> > > userspace to map and setup new shadow stacks in one step. While
> > > ucontext
> > > is the primary motivator, userspace may have other unforeseen
> > > reasons to
> > > setup it's own shadow stacks using the WRSS instruction. Towards
> > > this
> > > provide a flag so that stacks can be optionally setup securely for
> > > the
> > > common case of ucontext without enabling WRSS. Or potentially have
> > > the
> > > kernel set up the shadow stack in some new way.
> >
> > Was following ever attempted?
> >
> > void *shstk = mmap(0, size, PROT_SHADOWSTACK, ...);
> > - limit PROT_SHADOWSTACK protection flag to only mmap (and thus
> > mprotect can't
> >    convert memory from shadow stack to non-shadow stack type or vice
> > versa)
> > - limit PROT_SHADOWSTACK protection flag to anonymous memory only.
> > - top level mmap handler to put a token at the base using WRUSS if
> > prot == PROT_SHADOWSTACK
> >
> > You essentially would get shadow stack manufacturing with existing
> > (single) syscall.
> > Acting a bit selfish here, this allows other architectures as well to
> > re-use this and
> > do their own implementation of mapping and placing the token at the
> > base.
>
> Yes, I looked at it. You end up with a pile of checks and hooks added
> to mmap() and various other places as you outline. We also now have the
> MAP_ABOVE4G limitation for x86 shadow stack that would need checking
> for too. It's not exactly a clean fit. Then, callers would have to pass
> special x86 flags in anyway.

riscv has mechanisms using which a 32bit app can run on 64bit kernel.
So technically if there are 32bit and 64bit code in address space,
MAP_ABOVE4G could be useful.
Although I am not sure (or aware of) if there are such requirement
from app/developers yet (to guarantee address mapping above 4G)

But I see this as orthogonal to memory protection flags.

>
> It doesn't seem like the complexity of the checks is worth saving the
> tiny syscall. Is there some reason why riscv can't use the same syscall
> stub? It doesn't need to live forever in x86 code. Not sure what the
> savings are for riscv of the mmap+checks approach are either...

I don't see a lot of extra complexity here.
If `mprotect` and friends don't know about `PROT_SHADOWSTACK`, they'll
just fail by default (which is desired)

It's only `mmap` that needs to be enlightened. And it can just pass
`VMA_SHADOW_STACK` to `do_mmap` if input is `PROT_SHADOWSTACK`.

Adding a syscall just for mapping shadow stack is weird when it can be
solved with existing system calls.
As you say in your response below, it would be good to have such a
syscall which serve larger purposes (e.g. provisioning special
security-type memory)

arm64's memory tagging is one such example. Not exactly security-type
memory (but eventual application is security for this feature) .
It adds extra meaning to virtual addresses (i.e. an address has tags).
arm64 went about using a protection flag `PROT_MTE` instead of a
special system call.

Being said that since this patch has gone through multiple revisions
and I am new to the party. If others dont have issues on this special
system call,
I think it's fine then. In case of riscv I can choose to use this
mechanism or go via arm's route to define PROT_SHADOWSTACK which is
arch specific.

>
> I did wonder if there could be some sort of more general syscall for
> mapping and provisioning special security-type memory. But we probably
> need a few more non-shadow stack examples to get an idea of what that
> would look like.

As I mentioned memory tagging and thus PROT_MTE is already such a use
case which uses `mmap/mprotect` protection flags to designate special
meaning to a virtual address.
  
Edgecombe, Rick P Feb. 23, 2023, 11:42 p.m. UTC | #4
On Thu, 2023-02-23 at 13:20 -0800, Deepak Gupta wrote:
> > It doesn't seem like the complexity of the checks is worth saving
> > the
> > tiny syscall. Is there some reason why riscv can't use the same
> > syscall
> > stub? It doesn't need to live forever in x86 code. Not sure what
> > the
> > savings are for riscv of the mmap+checks approach are either...
> 
> I don't see a lot of extra complexity here.
> If `mprotect` and friends don't know about `PROT_SHADOWSTACK`,
> they'll
> just fail by default (which is desired)

To me, the options look like: cram it into an existing syscall or
create one that does exactly what is needed.

To replace the new syscall with mmap(), with the existing MM
implementation, I think you would need to add to mmap:
1. Limit PROT_SHADOW_STACK to anonymous, private memory.
2. Limit PROT_SHADOW_STACK to MAP_ABOVE4G (or create a MAP_SHADOW_STACK
that does both). MAP_ABOVE4G protects from using shadow stack in 32 bit
mode, after some ABI issues were raised. So it is supposed to be
enforced.
3. Add additional logic for MAP_ABOVE4G to work with MAP_FIXED as is
required by CRIU.
4. Add another MAP_ flag to specify whether to write the token (AFAIK
the first flag that would do something like that. So then likely have
debates about whether it fits into the flags). Sort out the behavior of
trying to write the token to a non-PROT_SHADOW_STACK mmap call.
5. Add arch breakout for mmap to call into the token writing.

I think you are right that no mprotect() changes would be needed with
the current shadow stack MM implementation (it wasn't the case for the
original PROT_SHADOW_STACK). But I'm still not sure if it is simpler
then the syscall.

I do wonder a little bit about trying to remove some of these
limitations of the existing shadow stack MM implementation, which could
make an mmap based interface a little better fit. Like if someday
shadow stack memory added support for all the options that mmap
supports. But I'm not sure if that would just result in more complexity
in other places (MM, signals) that would barely get used. Like, I'm not
sure how shadow stack permissioned mmap()ed files should work. You
would have to require writable files to map them as shadow stack, but
then that is not as locked down as we have today with the anonymous
memory. And a "shadow stack" file permission would seem a bit
overboard.

Then probably some more dirty bit issues with mmaped files. I'm not
fully sure. That one was definitely an issue when PROT_SHADOW_STACK was
dropped, but David Hildenbrand has now removed at least some of the
issues it hit.

So the optimum solution might depend on if we add more shadow stack MM
support later. But it is always possible to add mmap support later too.

> 
> It's only `mmap` that needs to be enlightened. And it can just pass
> `VMA_SHADOW_STACK` to `do_mmap` if input is `PROT_SHADOWSTACK`.
> 
> Adding a syscall just for mapping shadow stack is weird when it can
> be
> solved with existing system calls.
> As you say in your response below, it would be good to have such a
> syscall which serve larger purposes (e.g. provisioning special
> security-type memory)
> 
> arm64's memory tagging is one such example. Not exactly security-type
> memory (but eventual application is security for this feature) .
> It adds extra meaning to virtual addresses (i.e. an address has
> tags).
> arm64 went about using a protection flag `PROT_MTE` instead of a
> special system call.

It looks like that memory can be written with a special instruction?
And so it doesn't need to be provisioned by the kernel, as is the case
here.

> 
> Being said that since this patch has gone through multiple revisions
> and I am new to the party. If others dont have issues on this special
> system call,
> I think it's fine then. In case of riscv I can choose to use this
> mechanism or go via arm's route to define PROT_SHADOWSTACK which is
> arch specific.

Ok, sounds good. The advice I got from maintainers after the many
attempts to cram it into the existing interfaces was: don't be afraid
to add a syscall. And sure enough, when MAP_ABOVE4G came along it
continued to make things simpler.
  

Patch

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c84d12608cd2..f65c671ce3b1 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -372,6 +372,7 @@ 
 448	common	process_mrelease	sys_process_mrelease
 449	common	futex_waitv		sys_futex_waitv
 450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
+451	64	map_shadow_stack	sys_map_shadow_stack
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/mman.h
index 5a0256e73f1e..8148bdddbd2c 100644
--- a/arch/x86/include/uapi/asm/mman.h
+++ b/arch/x86/include/uapi/asm/mman.h
@@ -13,6 +13,9 @@ 
 		((key) & 0x8 ? VM_PKEY_BIT3 : 0))
 #endif
 
+/* Flags for map_shadow_stack(2) */
+#define SHADOW_STACK_SET_TOKEN	(1ULL << 0)	/* Set up a restore token in the shadow stack */
+
 #include <asm-generic/mman.h>
 
 #endif /* _ASM_X86_MMAN_H */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 40f0a55762a9..0a3decab70ee 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -17,6 +17,7 @@ 
 #include <linux/compat.h>
 #include <linux/sizes.h>
 #include <linux/user.h>
+#include <linux/syscalls.h>
 #include <asm/msr.h>
 #include <asm/fpu/xstate.h>
 #include <asm/fpu/types.h>
@@ -71,19 +72,31 @@  static int create_rstor_token(unsigned long ssp, unsigned long *token_addr)
 	return 0;
 }
 
-static unsigned long alloc_shstk(unsigned long size)
+static unsigned long alloc_shstk(unsigned long addr, unsigned long size,
+				 unsigned long token_offset, bool set_res_tok)
 {
 	int flags = MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G;
 	struct mm_struct *mm = current->mm;
-	unsigned long addr, unused;
+	unsigned long mapped_addr, unused;
 
-	mmap_write_lock(mm);
-	addr = do_mmap(NULL, 0, size, PROT_READ, flags,
-		       VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
+	if (addr)
+		flags |= MAP_FIXED_NOREPLACE;
 
+	mmap_write_lock(mm);
+	mapped_addr = do_mmap(NULL, addr, size, PROT_READ, flags,
+			      VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
 	mmap_write_unlock(mm);
 
-	return addr;
+	if (!set_res_tok || IS_ERR_VALUE(mapped_addr))
+		goto out;
+
+	if (create_rstor_token(mapped_addr + token_offset, NULL)) {
+		vm_munmap(mapped_addr, size);
+		return -EINVAL;
+	}
+
+out:
+	return mapped_addr;
 }
 
 static unsigned long adjust_shstk_size(unsigned long size)
@@ -134,7 +147,7 @@  static int shstk_setup(void)
 		return -EOPNOTSUPP;
 
 	size = adjust_shstk_size(0);
-	addr = alloc_shstk(size);
+	addr = alloc_shstk(0, size, 0, false);
 	if (IS_ERR_VALUE(addr))
 		return PTR_ERR((void *)addr);
 
@@ -178,7 +191,7 @@  int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
 		return 0;
 
 	size = adjust_shstk_size(stack_size);
-	addr = alloc_shstk(size);
+	addr = alloc_shstk(0, size, 0, false);
 	if (IS_ERR_VALUE(addr))
 		return PTR_ERR((void *)addr);
 
@@ -371,6 +384,36 @@  static int shstk_disable(void)
 	return 0;
 }
 
+SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags)
+{
+	bool set_tok = flags & SHADOW_STACK_SET_TOKEN;
+	unsigned long aligned_size;
+
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return -EOPNOTSUPP;
+
+	if (flags & ~SHADOW_STACK_SET_TOKEN)
+		return -EINVAL;
+
+	/* If there isn't space for a token */
+	if (set_tok && size < 8)
+		return -EINVAL;
+
+	if (addr && addr <= 0xFFFFFFFF)
+		return -EINVAL;
+
+	/*
+	 * An overflow would result in attempting to write the restore token
+	 * to the wrong location. Not catastrophic, but just return the right
+	 * error code and block it.
+	 */
+	aligned_size = PAGE_ALIGN(size);
+	if (aligned_size < size)
+		return -EOVERFLOW;
+
+	return alloc_shstk(addr, aligned_size, size, set_tok);
+}
+
 long shstk_prctl(struct task_struct *task, int option, unsigned long features)
 {
 	if (option == ARCH_SHSTK_LOCK) {
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 33a0ee3bcb2e..392dc11e3556 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1058,6 +1058,7 @@  asmlinkage long sys_memfd_secret(unsigned int flags);
 asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
 					    unsigned long home_node,
 					    unsigned long flags);
+asmlinkage long sys_map_shadow_stack(unsigned long addr, unsigned long size, unsigned int flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 45fa180cc56a..b12940ec5926 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -887,7 +887,7 @@  __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
 
 #undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 452
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 860b2dcf3ac4..cb9aebd34646 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -381,6 +381,7 @@  COND_SYSCALL(vm86old);
 COND_SYSCALL(modify_ldt);
 COND_SYSCALL(vm86);
 COND_SYSCALL(kexec_file_load);
+COND_SYSCALL(map_shadow_stack);
 
 /* s390 */
 COND_SYSCALL(s390_pci_mmio_read);