[v9,00/42] Shadow stacks for userspace

Message ID 20230613001108.3040476-1-rick.p.edgecombe@intel.com
Headers
Series Shadow stacks for userspace |

Message

Edgecombe, Rick P June 13, 2023, 12:10 a.m. UTC
  Hi,

This series implements Shadow Stacks for userspace using x86's Control-flow 
Enforcement Technology (CET). CET consists of two related security
features: shadow stacks and indirect branch tracking. This series
implements just the  shadow stack part of this feature, and just for
userspace.

The main use case for shadow stack is providing protection against return 
oriented programming attacks. It works by maintaining a secondary (shadow) 
stack using a special memory type that has protections against 
modification. When executing a CALL instruction, the processor pushes the 
return address to both the normal stack and to the special permission 
shadow stack. Upon RET, the processor pops the shadow stack copy and 
compares it to the normal stack copy. For more details, see the 
coverletter from v1 [0].

Shadow Stack was rejected by Linus for 6.4 [1][2]. This is a new version 
that addresses his concerns. In the months since the series was queued in 
tip, there were also some non-critical things that turned up, that I was 
planning do as fast follow ups. Since we are doing a re-spin, I thought to 
just include them in the initial series. (see 3 and 4) Also, the whole 
series has been re-ordered, after some comments from Linus prompted some 
reflection.

Most of the patches are the same, so I’ll specifically list the patches 
with changes to help focus review.

1. Redo of the pte_mkwrite() refactoring patches
------------------------------------------------
The point of these patches was to make pte_mkwrite() take a VMA. 
Unfortunately the original version of this refactor had a bug. Linus 
suggested an alternate way of doing the refactor that would be less error 
prone. It sounds like this will be used by the riscv and possibly arm 
shadow stack features. It would be great to collect some Reviewed-by tags 
on these from anyone else that would depend on them.

Changed (and renamed) patches:
	mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma()
	mm: Move pte/pmd_mkwrite() callers with no VMA to _novma()
	mm: Make pte_mkwrite() take a VMA

2. SavedDirty overhaul
----------------------
The Shadow Stack PTEs are defined by the HW as Write=0,Dirty=1. Since 
Linux usually creates PTEs like that when it write-protects dirty memory, 
the series introduced a SavedDirty software bit. When a PTE gets 
write-protected the Dirty bit shifts to the SavedDirty bit so it won't
become shadow stack. When it is made writable, the opposite happens.

In the previously queued version, the SavedDirty bit was only used when 
shadow stack was configured and available on the CPU. But this created two 
versions of the Dirty bit behavior on x86, adding complexity. Linus 
objected to this, and also the use of conditional control flow logic 
instead of bit math. After some trial and error, I ended up with something 
that tries to incorporate the feedback, but with some adjustments on the 
specifics. I would like some feedback on the maintainability tradeoffs 
taken and some scrutiny on the correctness as well.


First of all, switching to bit math for the SavedDirty dance really seems 
to be a big improvement. The conditional part is now branchless and 
isolated to two functions. Descriptions of the other changes follows, and 
a little less of a clear win to me.


One thing that came up in this discussion was the performance impact of 
the CMPXCHG loop added to ptep_set_wrprotect() in order to atomically do 
the shift of Dirty to SavedDirty when a live PTE is being write-protected. 
The concern was that this could impact the performance of fork() where it 
is used to write-protect memory in the parent MM.

Linus had suggested to optimize the single threaded fork case to offset 
the concerns of a performance impact. However, trying to stress this case, 
I was unable to concoct a microbenchmark to show any slowdown of the LOCK 
CMPXCHG loop vs the original LOCK AND. Hypothetically the CMPXCHG could 
scale worse, but since I couldn’t actually entice it to show any slowdown, 
I was thinking that the original worries might have been misplaced and we 
could get away with the unconditional CMPXCHG loop.

As Dave previously mentioned, the other wrinkle in all of this is that on 
CPUs that don’t support shadow stack, a CPU could rarely set Dirty=1 on a 
PTE with Write=0. So the kernel logic needs to be robust to PTEs that have 
Write=0,Dirty=1 PTEs in non-shadow stack cases on these CPUs. And also 
should be able to handle Write=0,Dirty=1,SavedDirty=1 PTEs. Similarly, a 
KNL (Xeon Phi platform) erratum can result in Dirty=1 bits getting set on 
Present=0 PTEs.

In order to make the core-MM logic work correctly with shadow stack, 
pte_write() needs to also return true for Write=0,Dirty=1 memory. Since 
those older platforms can cause Dirty=1,Write=0 PTEs despite the kernels 
efforts to not create any itself, the kernel can’t have this shadow stack 
pte_write() logic when running on them. So the kernel can only check for 
shadow stack memory in pte_write() when shadow stack is supported by the 
CPU. Also, the kernel needs to make sure to not trigger any warnings when 
it sees a Dirty=0,Write=1 PTE in an unexpected place. So these are also 
only done when shadow stack is supported on the CPU. These checks are 
isolated to single place in pte_shstk().

So in the end, we can shift the Dirty bit around unconditionally, but the 
kernels behavior around the Dirty bit still needs to adjust depending on 
the CPU actually supporting shadow stack.

Refactoring the SavedDirty<->Dirty setting logic into bitmath made it so 
it would be a lot easier to compile the SavedDirty bit out if needed. 
Basically it can be removed with only two checks in 
mksaveddirty_shift()/clear_saveddirty_shift() (see “x86/mm: Introduce 
_PAGE_SAVED_DIRTY”).

That all makes me wonder if it would still be better to disable SavedDirty 
when shadow stack is not supported. One aspect of the idea to make 
SavedDirty  unconditional, was to unify the sets of rules to have to 
reason about. But due to the behavior of the pre-shadow stack CPUs, this 
also comes at the increased runtime behaviors we need to worry about. The 
other point brought up was the increased testing of having SavedDirty used 
more widely. But since we have to turn off the warnings and other logic, 
this testing isn’t fully happening on these older platforms either. So I’m 
not sure if the unconditional SavedDirty is really a win or not. I thought 
it was slightly.

So in this version SavedDirty is turned on universally for x86. Even for 
32 bit, which, while seeming a bit silly, allows there to be only one 
version of the ptep_set_wrprotect() logic.

Changed patches:
	x86/mm: Start actually marking _PAGE_SAVED_DIRTY
	x86/mm: Update ptep/pmdp_set_wrprotect() for _PAGE_SAVED_DIRTY
	x86/mm: Introduce _PAGE_SAVED_DIRTY
	mm: Warn on shadow stack memory in wrong vma
	x86/mm: Warn if create Write=0,Dirty=1 with raw prot

3. Shadow stack protections enhancements
----------------------------------------
The shadow stack signal frame format uses a special shadow stack frame 
pattern that should not occur naturally in order to avoid forgery on 
sigreturn. Two patches are added to strengthen the forgery checks. These 
could have been squashed into the signal patch, but I thought leaving them 
separate might make review easier for those familiar with the last series. 
I was waffling on whether to postpone them to minimize changes from the 
previously queued changed to v9. In the end, as long as the series was 
already getting re-spun, I thought the extra protections were worth 
starting with.

The new mmap maple tree code needs to be taught specifically about 
VM_SHADOW_STACK, instead of just relying on vm_start_gap() like the old RB 
stuff, so that is added as well to retain the start guard gap with maple
tree.

Added/changes patches:
	x86/shstk: Check that SSP is aligned on sigreturn
	x86/shstk: Check that signal frame is shadow stack mem
	mm: Add guard pages around a shadow stack

4. Selftest enhancements
------------------------
A few miscellaneous selftest enhancements that accumulated since the old 
series landed in tip. Added a test for the shadow stack guard gap and the 
shadow stack ptrace interface. Also fixed a race that caused the uffd test 
to sometimes hang.

Changed patches:
	selftests/x86: Add shadow stack test

Since some of the changes were extensive in the modified patches, I 
dropped some review tags. But I left testing tags, testers please retest.

Thanks,

Rick

[0] https://lore.kernel.org/lkml/20220130211838.8382-1-rick.p.edgecombe@intel.com/
[1] https://lore.kernel.org/lkml/CAHk-=wiuVXTfgapmjYQvrEDzn3naF2oYnHuky+feEJSj_G_yFQ@mail.gmail.com/
[2] https://lore.kernel.org/lkml/CAHk-=wiB0wy6oXOsPtYU4DSbqJAY8z5iNBKdjdOp2LP23khUoA@mail.gmail.com/

Mike Rapoport (1):
  x86/shstk: Add ARCH_SHSTK_UNLOCK

Rick Edgecombe (38):
  mm: Rename arch pte_mkwrite()'s to pte_mkwrite_novma()
  mm: Move pte/pmd_mkwrite() callers with no VMA to _novma()
  mm: Make pte_mkwrite() take a VMA
  x86/shstk: Add Kconfig option for shadow stack
  x86/traps: Move control protection handler to separate file
  x86/cpufeatures: Add CPU feature flags for shadow stacks
  x86/mm: Move pmd_write(), pud_write() up in the file
  x86/mm: Introduce _PAGE_SAVED_DIRTY
  x86/mm: Update ptep/pmdp_set_wrprotect() for _PAGE_SAVED_DIRTY
  x86/mm: Start actually marking _PAGE_SAVED_DIRTY
  x86/mm: Remove _PAGE_DIRTY from kernel RO pages
  x86/mm: Check shadow stack page fault errors
  mm: Add guard pages around a shadow stack.
  mm: Warn on shadow stack memory in wrong vma
  x86/mm: Warn if create Write=0,Dirty=1 with raw prot
  mm/mmap: Add shadow stack pages to memory accounting
  x86/mm: Introduce MAP_ABOVE4G
  x86/mm: Teach pte_mkwrite() about stack memory
  mm: Don't allow write GUPs to shadow stack memory
  Documentation/x86: Add CET shadow stack description
  x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
  x86/fpu: Add helper for modifying xstate
  x86: Introduce userspace API for shadow stack
  x86/shstk: Add user control-protection fault handler
  x86/shstk: Add user-mode shadow stack support
  x86/shstk: Handle thread shadow stack
  x86/shstk: Introduce routines modifying shstk
  x86/shstk: Handle signals for shadow stack
  x86/shstk: Check that SSP is aligned on sigreturn
  x86/shstk: Check that signal frame is shadow stack mem
  x86/shstk: Introduce map_shadow_stack syscall
  x86/shstk: Support WRSS for userspace
  x86: Expose thread features in /proc/$PID/status
  x86/shstk: Wire in shadow stack interface
  x86/cpufeatures: Enable CET CR4 bit for shadow stack
  selftests/x86: Add shadow stack test
  x86: Add PTRACE interface for shadow stack
  x86/shstk: Add ARCH_SHSTK_STATUS

Yu-cheng Yu (3):
  mm: Re-introduce vm_flags to do_mmap()
  mm: Move VM_UFFD_MINOR_BIT from 37 to 38
  mm: Introduce VM_SHADOW_STACK for shadow stack memory

 Documentation/arch/x86/index.rst              |   1 +
 Documentation/arch/x86/shstk.rst              | 179 ++++
 Documentation/filesystems/proc.rst            |   1 +
 Documentation/mm/arch_pgtable_helpers.rst     |  12 +-
 arch/Kconfig                                  |   3 +
 arch/alpha/include/asm/pgtable.h              |   2 +-
 arch/arc/include/asm/hugepage.h               |   2 +-
 arch/arc/include/asm/pgtable-bits-arcv2.h     |   2 +-
 arch/arm/include/asm/pgtable-3level.h         |   2 +-
 arch/arm/include/asm/pgtable.h                |   2 +-
 arch/arm/kernel/signal.c                      |   2 +-
 arch/arm64/include/asm/pgtable.h              |   4 +-
 arch/arm64/kernel/signal.c                    |   2 +-
 arch/arm64/kernel/signal32.c                  |   2 +-
 arch/arm64/mm/trans_pgd.c                     |   4 +-
 arch/csky/include/asm/pgtable.h               |   2 +-
 arch/hexagon/include/asm/pgtable.h            |   2 +-
 arch/ia64/include/asm/pgtable.h               |   2 +-
 arch/loongarch/include/asm/pgtable.h          |   4 +-
 arch/m68k/include/asm/mcf_pgtable.h           |   2 +-
 arch/m68k/include/asm/motorola_pgtable.h      |   2 +-
 arch/m68k/include/asm/sun3_pgtable.h          |   2 +-
 arch/microblaze/include/asm/pgtable.h         |   2 +-
 arch/mips/include/asm/pgtable.h               |   6 +-
 arch/nios2/include/asm/pgtable.h              |   2 +-
 arch/openrisc/include/asm/pgtable.h           |   2 +-
 arch/parisc/include/asm/pgtable.h             |   2 +-
 arch/powerpc/include/asm/book3s/32/pgtable.h  |   2 +-
 arch/powerpc/include/asm/book3s/64/pgtable.h  |   4 +-
 arch/powerpc/include/asm/nohash/32/pgtable.h  |   4 +-
 arch/powerpc/include/asm/nohash/32/pte-8xx.h  |   4 +-
 arch/powerpc/include/asm/nohash/64/pgtable.h  |   2 +-
 arch/riscv/include/asm/pgtable.h              |   6 +-
 arch/s390/include/asm/hugetlb.h               |   2 +-
 arch/s390/include/asm/pgtable.h               |   4 +-
 arch/s390/mm/pageattr.c                       |   4 +-
 arch/sh/include/asm/pgtable_32.h              |   4 +-
 arch/sparc/include/asm/pgtable_32.h           |   2 +-
 arch/sparc/include/asm/pgtable_64.h           |   6 +-
 arch/sparc/kernel/signal32.c                  |   2 +-
 arch/sparc/kernel/signal_64.c                 |   2 +-
 arch/um/include/asm/pgtable.h                 |   2 +-
 arch/x86/Kconfig                              |  24 +
 arch/x86/Kconfig.assembler                    |   5 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
 arch/x86/include/asm/cpufeatures.h            |   2 +
 arch/x86/include/asm/disabled-features.h      |  16 +-
 arch/x86/include/asm/fpu/api.h                |   9 +
 arch/x86/include/asm/fpu/regset.h             |   7 +-
 arch/x86/include/asm/fpu/sched.h              |   3 +-
 arch/x86/include/asm/fpu/types.h              |  16 +-
 arch/x86/include/asm/fpu/xstate.h             |   6 +-
 arch/x86/include/asm/idtentry.h               |   2 +-
 arch/x86/include/asm/mmu_context.h            |   2 +
 arch/x86/include/asm/pgtable.h                | 302 +++++-
 arch/x86/include/asm/pgtable_types.h          |  46 +-
 arch/x86/include/asm/processor.h              |   8 +
 arch/x86/include/asm/shstk.h                  |  38 +
 arch/x86/include/asm/special_insns.h          |  13 +
 arch/x86/include/asm/tlbflush.h               |   3 +-
 arch/x86/include/asm/trap_pf.h                |   2 +
 arch/x86/include/asm/traps.h                  |  12 +
 arch/x86/include/uapi/asm/mman.h              |   4 +
 arch/x86/include/uapi/asm/prctl.h             |  12 +
 arch/x86/kernel/Makefile                      |   4 +
 arch/x86/kernel/cet.c                         | 152 +++
 arch/x86/kernel/cpu/common.c                  |  35 +-
 arch/x86/kernel/cpu/cpuid-deps.c              |   1 +
 arch/x86/kernel/cpu/proc.c                    |  23 +
 arch/x86/kernel/fpu/core.c                    |  54 +-
 arch/x86/kernel/fpu/regset.c                  |  81 ++
 arch/x86/kernel/fpu/xstate.c                  |  90 +-
 arch/x86/kernel/idt.c                         |   2 +-
 arch/x86/kernel/process.c                     |  21 +-
 arch/x86/kernel/process_64.c                  |   8 +
 arch/x86/kernel/ptrace.c                      |  12 +
 arch/x86/kernel/shstk.c                       | 529 +++++++++++
 arch/x86/kernel/signal.c                      |   1 +
 arch/x86/kernel/signal_32.c                   |   2 +-
 arch/x86/kernel/signal_64.c                   |   8 +-
 arch/x86/kernel/sys_x86_64.c                  |   6 +-
 arch/x86/kernel/traps.c                       |  87 --
 arch/x86/mm/fault.c                           |  22 +
 arch/x86/mm/pat/set_memory.c                  |   4 +-
 arch/x86/mm/pgtable.c                         |  40 +
 arch/x86/xen/enlighten_pv.c                   |   2 +-
 arch/x86/xen/mmu_pv.c                         |   2 +-
 arch/x86/xen/xen-asm.S                        |   2 +-
 arch/xtensa/include/asm/pgtable.h             |   2 +-
 fs/aio.c                                      |   2 +-
 fs/proc/array.c                               |   6 +
 fs/proc/task_mmu.c                            |   3 +
 include/asm-generic/hugetlb.h                 |   2 +-
 include/linux/mm.h                            |  67 +-
 include/linux/mman.h                          |   4 +
 include/linux/pgtable.h                       |  28 +
 include/linux/proc_fs.h                       |   2 +
 include/linux/syscalls.h                      |   1 +
 include/uapi/asm-generic/siginfo.h            |   3 +-
 include/uapi/asm-generic/unistd.h             |   2 +-
 include/uapi/linux/elf.h                      |   2 +
 ipc/shm.c                                     |   2 +-
 kernel/sys_ni.c                               |   1 +
 mm/debug_vm_pgtable.c                         |  12 +-
 mm/gup.c                                      |   2 +-
 mm/huge_memory.c                              |  11 +-
 mm/internal.h                                 |   4 +-
 mm/memory.c                                   |   5 +-
 mm/migrate.c                                  |   2 +-
 mm/migrate_device.c                           |   2 +-
 mm/mmap.c                                     |  14 +-
 mm/mprotect.c                                 |   2 +-
 mm/nommu.c                                    |   4 +-
 mm/userfaultfd.c                              |   2 +-
 mm/util.c                                     |   2 +-
 tools/testing/selftests/x86/Makefile          |   2 +-
 .../testing/selftests/x86/test_shadow_stack.c | 884 ++++++++++++++++++
 117 files changed, 2789 insertions(+), 307 deletions(-)
 create mode 100644 Documentation/arch/x86/shstk.rst
 create mode 100644 arch/x86/include/asm/shstk.h
 create mode 100644 arch/x86/kernel/cet.c
 create mode 100644 arch/x86/kernel/shstk.c
 create mode 100644 tools/testing/selftests/x86/test_shadow_stack.c
  

Comments

Linus Torvalds June 13, 2023, 1:34 a.m. UTC | #1
On Mon, Jun 12, 2023 at 5:14 PM Rick Edgecombe
<rick.p.edgecombe@intel.com> wrote:
>
> This series implements Shadow Stacks for userspace using x86's Control-flow
> Enforcement Technology (CET).

Do you have this in a git tree somewhere? For series with this many
patches, I find it easier to just do a "git fetch" and "gitk
..FETCH_HEAD" these days, and then reply by email on anything I find.

That's partly because it makes it really easy to zoom in on some
particular area (eg "let's look at just mm/ and the generic include
files")

                  Linus
  
Edgecombe, Rick P June 13, 2023, 3:12 a.m. UTC | #2
On Mon, 2023-06-12 at 18:34 -0700, Linus Torvalds wrote:
> On Mon, Jun 12, 2023 at 5:14 PM Rick Edgecombe
> <rick.p.edgecombe@intel.com> wrote:
> > 
> > This series implements Shadow Stacks for userspace using x86's
> > Control-flow
> > Enforcement Technology (CET).
> 
> Do you have this in a git tree somewhere? For series with this many
> patches, I find it easier to just do a "git fetch" and "gitk
> ..FETCH_HEAD" these days, and then reply by email on anything I find.
> 
> That's partly because it makes it really easy to zoom in on some
> particular area (eg "let's look at just mm/ and the generic include
> files")

Sure. I probably should have included that upfront. Here is a github
repo:
https://github.com/rpedgeco/linux/tree/user_shstk_v9

I went ahead and included the tags[0] from last time in case that's
useful, but unfortunately the github web interface is not very
conducive to viewing the tag-based segmentation of the series. If
having it in a korg repo would be useful, please let me know.

[0]
https://lore.kernel.org/lkml/4433c3595db23f7c779b69b222958151b69ddd70.camel@intel.com/
  
Linus Torvalds June 13, 2023, 5:44 p.m. UTC | #3
On Mon, Jun 12, 2023 at 8:12 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> Sure. I probably should have included that upfront. Here is a github
> repo:
> https://github.com/rpedgeco/linux/tree/user_shstk_v9
>
> I went ahead and included the tags[0] from last time in case that's
> useful, but unfortunately the github web interface is not very
> conducive to viewing the tag-based segmentation of the series. If
> having it in a korg repo would be useful, please let me know.

Oh, kernel.org vs github doesn't matter. I'm not actually merging this
yet, I'm just doing a fetch to then easily be able to look at it
locally in different formats.

I tend to like seeing small things in my MUA just because then I don't
switch back-and-forth between reading email and some gitk workflow,
and it is easy to just scan through the series and reply all inthe
MUA.

But when it's some bigger piece, just doing a "git fetch" and then
being able to dissect it locally is really convenient.

Having worked with patches for three decades, I can read diffs in my
sleep - but it's still quite useful to say "give me the patches just
for *this* file" to just see how some specific area changed without
having to look at the other parts.

Or for example, that whole pte_mkwrite -> pte_mkwrite_novma patch is
much denser and more legible with color-coding and the --word-diff.

Anyway, I'm scanning through it right now. No comments yet, I only
just got started.

              Linus
  
Linus Torvalds June 13, 2023, 6:27 p.m. UTC | #4
On Tue, Jun 13, 2023 at 10:44 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Anyway, I'm scanning through it right now. No comments yet, I only
> just got started.

Well, it all looked fine from a quick scan. One small comment, and
even that was just a minor stylistic nit.

I didn't actually look through the x86 state infrastructure side -
I'll just trust that is fine, and it doesn't interact with anything
else, so I don't really worry about it. I mainly care about the VM
side not causing problems, and the changes on that side all looked
fine.

             Linus
  
Edgecombe, Rick P June 13, 2023, 7:38 p.m. UTC | #5
On Tue, 2023-06-13 at 11:27 -0700, Linus Torvalds wrote:
> On Tue, Jun 13, 2023 at 10:44 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > 
> > Anyway, I'm scanning through it right now. No comments yet, I only
> > just got started.
> 
> Well, it all looked fine from a quick scan. One small comment, and
> even that was just a minor stylistic nit.
> 
> I didn't actually look through the x86 state infrastructure side -
> I'll just trust that is fine, and it doesn't interact with anything
> else, so I don't really worry about it. I mainly care about the VM
> side not causing problems, and the changes on that side all looked
> fine.

Thanks!
  
Mark Brown June 14, 2023, 11:45 p.m. UTC | #6
On Mon, Jun 12, 2023 at 05:10:26PM -0700, Rick Edgecombe wrote:

> This series implements Shadow Stacks for userspace using x86's Control-flow 
> Enforcement Technology (CET). CET consists of two related security
> features: shadow stacks and indirect branch tracking. This series
> implements just the  shadow stack part of this feature, and just for
> userspace.

I've been using the generic changes here for the work I've been doing on
arm64's similar GCS feature, while that is still very much WIP and
hasn't been posted anywhere most of the common code here has been
exercised.  I've been through the patches that I've specifically checked
or used.  Thanks for all the work here.