[v7,01/41] Documentation/x86: Add CET shadow stack description

Message ID 20230227222957.24501-2-rick.p.edgecombe@intel.com
State New
Headers
Series Shadow stacks for userspace |

Commit Message

Edgecombe, Rick P Feb. 27, 2023, 10:29 p.m. UTC
  From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Introduce a new document on Control-flow Enforcement Technology (CET).

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>

---
v5:
 - Literal format tweaks (Bagas Sanjaya)
 - Update EOPNOTSUPP text due to unification after comment from (Kees)
 - Update 32 bit signal support with new behavior
 - Remove capitalization on shadow stack (Boris)
 - Fix typo

v4:
 - Drop clearcpuid piece (Boris)
 - Add some info about 32 bit

v3:
 - Clarify kernel IBT is supported by the kernel. (Kees, Andrew Cooper)
 - Clarify which arch_prctl's can take multiple bits. (Kees)
 - Describe ASLR characteristics of thread shadow stacks. (Kees)
 - Add exec section. (Andrew Cooper)
 - Fix some capitalization (Bagas Sanjaya)
 - Update new location of enablement status proc.
 - Add info about new user_shstk software capability.
 - Add more info about what the kernel pushes to the shadow stack on
   signal.

v2:
 - Updated to new arch_prctl() API
 - Add bit about new proc status
---
 Documentation/x86/index.rst |   1 +
 Documentation/x86/shstk.rst | 166 ++++++++++++++++++++++++++++++++++++
 2 files changed, 167 insertions(+)
 create mode 100644 Documentation/x86/shstk.rst
  

Comments

Szabolcs Nagy March 1, 2023, 2:21 p.m. UTC | #1
The 02/27/2023 14:29, Rick Edgecombe wrote:
> +Application Enabling
> +====================
> +
> +An application's CET capability is marked in its ELF note and can be verified
> +from readelf/llvm-readelf output::
> +
> +    readelf -n <application> | grep -a SHSTK
> +        properties: x86 feature: SHSTK
> +
> +The kernel does not process these applications markers directly. Applications
> +or loaders must enable CET features using the interface described in section 4.
> +Typically this would be done in dynamic loader or static runtime objects, as is
> +the case in GLIBC.

Note that this has to be an early decision in libc (ld.so or static
exe start code), which will be difficult to hook into system wide
security policy settings. (e.g. to force shstk on marked binaries.)

From userspace POV I'd prefer if a static exe did not have to parse
its own ELF notes (i.e. kernel enabled shstk based on the marking).
But I realize if there is a need for complex shstk enable/disable
decision that is better in userspace and if the kernel decision can
be overridden then it might as well all be in userspace.

> +Enabling arch_prctl()'s
> +=======================
> +
> +Elf features should be enabled by the loader using the below arch_prctl's. They
> +are only supported in 64 bit user applications.
> +
> +arch_prctl(ARCH_SHSTK_ENABLE, unsigned long feature)
> +    Enable a single feature specified in 'feature'. Can only operate on
> +    one feature at a time.
> +
> +arch_prctl(ARCH_SHSTK_DISABLE, unsigned long feature)
> +    Disable a single feature specified in 'feature'. Can only operate on
> +    one feature at a time.
> +
> +arch_prctl(ARCH_SHSTK_LOCK, unsigned long features)
> +    Lock in features at their current enabled or disabled status. 'features'
> +    is a mask of all features to lock. All bits set are processed, unset bits
> +    are ignored. The mask is ORed with the existing value. So any feature bits
> +    set here cannot be enabled or disabled afterwards.

The multi-thread behaviour should be documented here: Only the
current thread is affected. So an application can only change the
setting while single-threaded which is only guaranteed before any
user code is executed. Later using the prctl is complicated and
most c runtimes would not want to do that (async signalling all
threads and prctl from the handler).

In particular these interfaces are not suitable to turn shstk off
at dlopen time when an unmarked binary is loaded. Or any other
late shstk policy change will not work, so as far as i can see
the "permissive" mode in glibc does not work.

Does the main thread have shadow stack allocated before shstk is
enabled? is the shadow stack freed when it is disabled? (e.g.
what would the instruction reading the SSP do in disabled state?)

> +Proc Status
> +===========
> +To check if an application is actually running with shadow stack, the
> +user can read the /proc/$PID/status. It will report "wrss" or "shstk"
> +depending on what is enabled. The lines look like this::
> +
> +    x86_Thread_features: shstk wrss
> +    x86_Thread_features_locked: shstk wrss

Presumaly /proc/$TID/status and /proc/$PID/task/$TID/status also
shows the setting and only valid for the specific thread (not the
entire process). So i would note that this for one thread only.

> +Implementation of the Shadow Stack
> +==================================
> +
> +Shadow Stack Size
> +-----------------
> +
> +A task's shadow stack is allocated from memory to a fixed size of
> +MIN(RLIMIT_STACK, 4 GB). In other words, the shadow stack is allocated to
> +the maximum size of the normal stack, but capped to 4 GB. However,
> +a compat-mode application's address space is smaller, each of its thread's
> +shadow stack size is MIN(1/4 RLIMIT_STACK, 4 GB).

This policy tries to handle all threads with the same shadow stack
size logic, which has limitations. I think it should be improved
(otherwise some applications will have to turn shstk off):

- RLIMIT_STACK is not an upper bound for the main thread stack size
  (rlimit can increase/decrease dynamically).
- RLIMIT_STACK only applies to the main thread, so it is not an upper
  bound for non-main thread stacks.
- i.e. stack size >> startup RLIMIT_STACK is possible and then shadow
  stack can overflow.
- stack size << startup RLIMIT_STACK is also possible and then VA
  space is wasted (can lead to OOM with strict memory overcommit).
- clone3 tells the kernel the thread stack size so that should be
  used instead of RLIMIT_STACK. (clone does not though.)
- I think it's better to have a new limit specifically for shadow
  stack size (which by default can be RLIMIT_STACK) so userspace
  can adjust it if needed (another reason is that stack size is
  not always a good indicator of max call depth).

> +Signal
> +------
> +
> +By default, the main program and its signal handlers use the same shadow
> +stack. Because the shadow stack stores only return addresses, a large
> +shadow stack covers the condition that both the program stack and the
> +signal alternate stack run out.

What does "by default" mean here? Is there a case when the signal handler
is not entered with SSP set to the handling thread'd shadow stack?

> +When a signal happens, the old pre-signal state is pushed on the stack. When
> +shadow stack is enabled, the shadow stack specific state is pushed onto the
> +shadow stack. Today this is only the old SSP (shadow stack pointer), pushed
> +in a special format with bit 63 set. On sigreturn this old SSP token is
> +verified and restored by the kernel. The kernel will also push the normal
> +restorer address to the shadow stack to help userspace avoid a shadow stack
> +violation on the sigreturn path that goes through the restorer.

The kernel pushes on the shadow stack on signal entry so shadow stack
overflow cannot be handled. Please document this as non-recoverable
failure.

I think it can be made recoverable if signals with alternate stack run
on a different shadow stack. And the top of the thread shadow stack is
just corrupted instead of pushed in the overflow case. Then longjmp out
can be made to work (common in stack overflow handling cases), and
reliable crash report from the signal handler works (also common).

Does SSP get stored into the sigcontext struct somewhere?

> +Fork
> +----
> +
> +The shadow stack's vma has VM_SHADOW_STACK flag set; its PTEs are required
> +to be read-only and dirty. When a shadow stack PTE is not RO and dirty, a
> +shadow access triggers a page fault with the shadow stack access bit set
> +in the page fault error code.
> +
> +When a task forks a child, its shadow stack PTEs are copied and both the
> +parent's and the child's shadow stack PTEs are cleared of the dirty bit.
> +Upon the next shadow stack access, the resulting shadow stack page fault
> +is handled by page copy/re-use.
> +
> +When a pthread child is created, the kernel allocates a new shadow stack
> +for the new thread. New shadow stack's behave like mmap() with respect to
> +ASLR behavior.

Please document the shadow stack lifetimes here:

I think thread exit unmaps shadow stack and vfork shares shadow stack
with parent so exit does not unmap.

I think the map_shadow_stack syscall should be mentioned in this
document too.

ABI for initial shadow stack entries:

If one wants to scan the shadow stack how to detect the end (e.g. fast
backtrace)? Is it useful to put an invalid value (-1) there?
(affects map_shadow_stack syscall too).

thanks.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
  
Szabolcs Nagy March 1, 2023, 2:38 p.m. UTC | #2
The 03/01/2023 14:21, Szabolcs Nagy wrote:
>...
> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

sorry,
ignore this.
  
Edgecombe, Rick P March 1, 2023, 6:07 p.m. UTC | #3
On Wed, 2023-03-01 at 14:21 +0000, Szabolcs Nagy wrote:
> The 02/27/2023 14:29, Rick Edgecombe wrote:
> > +Application Enabling
> > +====================
> > +
> > +An application's CET capability is marked in its ELF note and can
> > be verified
> > +from readelf/llvm-readelf output::
> > +
> > +    readelf -n <application> | grep -a SHSTK
> > +        properties: x86 feature: SHSTK
> > +
> > +The kernel does not process these applications markers directly.
> > Applications
> > +or loaders must enable CET features using the interface described
> > in section 4.
> > +Typically this would be done in dynamic loader or static runtime
> > objects, as is
> > +the case in GLIBC.
> 
> Note that this has to be an early decision in libc (ld.so or static
> exe start code), which will be difficult to hook into system wide
> security policy settings. (e.g. to force shstk on marked binaries.)

In the eager enabling (by the kernel) scenario, how is this improved?
The loader has to have the option to disable the shadow stack if
enabling conditions are not met, so it still has to trust userspace to
not do that. Did you have any more specifics on how the policy would
work?

> 
> From userspace POV I'd prefer if a static exe did not have to parse
> its own ELF notes (i.e. kernel enabled shstk based on the marking).

This is actually exactly what happens in the glibc patches. My
understand was that it already been discussed amongst glibc folks.

> But I realize if there is a need for complex shstk enable/disable
> decision that is better in userspace and if the kernel decision can
> be overridden then it might as well all be in userspace.

A complication with shadow stack in general is that it has to be
enabled very early. Otherwise when the program returns from main(), it
will get a shadow stack underflow. The old logic in this series would
enable shadow stack if the loader had the SHSTK bit (by parsing the
header in the kernel). Then later if the conditions were not met to use
shadow stack, the loader would call into the kernel again to disable
shadow stack.

One problem (there were several with this area) with this eager
enabling, was the kernel ended up mapping, briefly using, and then
unmapping the shadow stack in the case of a executable not supporting
shadow stack. What the glibc patches do today is pretty much the same
behavior as before, just with the header parsing moved into userspace.
I think letting the component with the most information make the
decision leaves open the best opportunity for making it efficient. I
wonder if it could be possible for glibc to enable it later than it
currently does in the patches and improve the dynamic loader case, but
I don't know enough of that code.

> 
> > +Enabling arch_prctl()'s
> > +=======================
> > +
> > +Elf features should be enabled by the loader using the below
> > arch_prctl's. They
> > +are only supported in 64 bit user applications.
> > +
> > +arch_prctl(ARCH_SHSTK_ENABLE, unsigned long feature)
> > +    Enable a single feature specified in 'feature'. Can only
> > operate on
> > +    one feature at a time.
> > +
> > +arch_prctl(ARCH_SHSTK_DISABLE, unsigned long feature)
> > +    Disable a single feature specified in 'feature'. Can only
> > operate on
> > +    one feature at a time.
> > +
> > +arch_prctl(ARCH_SHSTK_LOCK, unsigned long features)
> > +    Lock in features at their current enabled or disabled status.
> > 'features'
> > +    is a mask of all features to lock. All bits set are processed,
> > unset bits
> > +    are ignored. The mask is ORed with the existing value. So any
> > feature bits
> > +    set here cannot be enabled or disabled afterwards.
> 
> The multi-thread behaviour should be documented here: Only the
> current thread is affected. So an application can only change the
> setting while single-threaded which is only guaranteed before any
> user code is executed. Later using the prctl is complicated and
> most c runtimes would not want to do that (async signalling all
> threads and prctl from the handler).

It is kind of covered in the fork() docs, but yes there should probably
be a reference here too.

> 
> In particular these interfaces are not suitable to turn shstk off
> at dlopen time when an unmarked binary is loaded. Or any other
> late shstk policy change will not work, so as far as i can see
> the "permissive" mode in glibc does not work.

Yes, that is correct. Glibc permissive mode does not fully work. There
are some ongoing discussions on how to make it work. Some options don't
require kernel changes, and some do. Making it per-thread is
complicated for x86 because when shadow stack is off, some of the
special shadow stack instructions will cause #UD exception. Glibc (any
probably other apps in the future) could be in the middle of executing
these instructions when dlopen() was called. So if there was a process
wide disable option it would have to be resilient to these #UDs. And
even then the code that used them could not be guaranteed to continue
to work. For example, if you call the gcc intrinsic _get_ssp() when
shadow stack is enabled it could be expected to point to the shadow
stack in most cases. If shadow stack gets disabled, rdssp will return
0, in which case reading the shadow stack would segfault. So the all-
process disabling solution can't be fully robust when there is any
shadow stack specific logic.

The other option discussed was creating trampolines between the
linked legacy objects that could know to tell the kernel to disable
shadow stack if needed. In this case, shadow stack is disabled for each
thread as it calls into the DSO. It's not clear if there can be enough
information gleaned from the legacy binaries to know when to generate
the trampolines in exotic cases.

A third option might be to have some synchronization between the kernel
and userspace around anything using the shadow stack instructions. But
there is not much detail filled in there.

So in summary, it's not as simple as making the disable per-process.

> 
> Does the main thread have shadow stack allocated before shstk is
> enabled?

No.

> is the shadow stack freed when it is disabled? (e.g.
> what would the instruction reading the SSP do in disabled state?)

Yes.

When shadow stack is disabled rdssp is a NOP, the intrinsic returns
NULL.

> 
> > +Proc Status
> > +===========
> > +To check if an application is actually running with shadow stack,
> > the
> > +user can read the /proc/$PID/status. It will report "wrss" or
> > "shstk"
> > +depending on what is enabled. The lines look like this::
> > +
> > +    x86_Thread_features: shstk wrss
> > +    x86_Thread_features_locked: shstk wrss
> 
> Presumaly /proc/$TID/status and /proc/$PID/task/$TID/status also
> shows the setting and only valid for the specific thread (not the
> entire process). So i would note that this for one thread only.

Since enabling/disabling is per-thread, and the field is called
"x86_Thread_features" I thought it was clear. It's easy to add some
more detail though.

> 
> > +Implementation of the Shadow Stack
> > +==================================
> > +
> > +Shadow Stack Size
> > +-----------------
> > +
> > +A task's shadow stack is allocated from memory to a fixed size of
> > +MIN(RLIMIT_STACK, 4 GB). In other words, the shadow stack is
> > allocated to
> > +the maximum size of the normal stack, but capped to 4 GB. However,
> > +a compat-mode application's address space is smaller, each of its
> > thread's
> > +shadow stack size is MIN(1/4 RLIMIT_STACK, 4 GB).
> 
> This policy tries to handle all threads with the same shadow stack
> size logic, which has limitations. I think it should be improved
> (otherwise some applications will have to turn shstk off):
> 
> - RLIMIT_STACK is not an upper bound for the main thread stack size
>   (rlimit can increase/decrease dynamically).
> - RLIMIT_STACK only applies to the main thread, so it is not an upper
>   bound for non-main thread stacks.
> - i.e. stack size >> startup RLIMIT_STACK is possible and then shadow
>   stack can overflow.
> - stack size << startup RLIMIT_STACK is also possible and then VA
>   space is wasted (can lead to OOM with strict memory overcommit).
> - clone3 tells the kernel the thread stack size so that should be
>   used instead of RLIMIT_STACK. (clone does not though.)

This actually happens already. I can update the docs.

> - I think it's better to have a new limit specifically for shadow
>   stack size (which by default can be RLIMIT_STACK) so userspace
>   can adjust it if needed (another reason is that stack size is
>   not always a good indicator of max call depth).

Hmm, yea. This seems like a good idea, but I don't see why it can't be
a follow on. The series is quite big just to get the basics. I have
tried to save some of the enhancements (like alt shadow stack) for the
future.

> 
> > +Signal
> > +------
> > +
> > +By default, the main program and its signal handlers use the same
> > shadow
> > +stack. Because the shadow stack stores only return addresses, a
> > large
> > +shadow stack covers the condition that both the program stack and
> > the
> > +signal alternate stack run out.
> 
> What does "by default" mean here? Is there a case when the signal
> handler
> is not entered with SSP set to the handling thread'd shadow stack?

Ah, yea, that could be updated. It is in reference to an alt shadow
stack implementation that was held for later.

> 
> > +When a signal happens, the old pre-signal state is pushed on the
> > stack. When
> > +shadow stack is enabled, the shadow stack specific state is pushed
> > onto the
> > +shadow stack. Today this is only the old SSP (shadow stack
> > pointer), pushed
> > +in a special format with bit 63 set. On sigreturn this old SSP
> > token is
> > +verified and restored by the kernel. The kernel will also push the
> > normal
> > +restorer address to the shadow stack to help userspace avoid a
> > shadow stack
> > +violation on the sigreturn path that goes through the restorer.
> 
> The kernel pushes on the shadow stack on signal entry so shadow stack
> overflow cannot be handled. Please document this as non-recoverable
> failure.

It doesn't hurt to call it out. Please see the below link for future
plans to handle this scenario (alt shadow stack).

> 
> I think it can be made recoverable if signals with alternate stack
> run
> on a different shadow stack. And the top of the thread shadow stack
> is
> just corrupted instead of pushed in the overflow case. Then longjmp
> out
> can be made to work (common in stack overflow handling cases), and
> reliable crash report from the signal handler works (also common).
> 
> Does SSP get stored into the sigcontext struct somewhere?

No, it's pushed to the shadow stack only. See the v2 coverletter of the
discussion on the design and reasoning:

https://lore.kernel.org/lkml/20220929222936.14584-1-rick.p.edgecombe@intel.com/

> 
> > +Fork
> > +----
> > +
> > +The shadow stack's vma has VM_SHADOW_STACK flag set; its PTEs are
> > required
> > +to be read-only and dirty. When a shadow stack PTE is not RO and
> > dirty, a
> > +shadow access triggers a page fault with the shadow stack access
> > bit set
> > +in the page fault error code.
> > +
> > +When a task forks a child, its shadow stack PTEs are copied and
> > both the
> > +parent's and the child's shadow stack PTEs are cleared of the
> > dirty bit.
> > +Upon the next shadow stack access, the resulting shadow stack page
> > fault
> > +is handled by page copy/re-use.
> > +
> > +When a pthread child is created, the kernel allocates a new shadow
> > stack
> > +for the new thread. New shadow stack's behave like mmap() with
> > respect to
> > +ASLR behavior.
> 
> Please document the shadow stack lifetimes here:
> 
> I think thread exit unmaps shadow stack and vfork shares shadow stack
> with parent so exit does not unmap.

Sure, this can be updated.

> 
> I think the map_shadow_stack syscall should be mentioned in this
> document too.

There is a man page prepared for this. I plan to update the docs to
reference it when it exists and not duplicate the text. There can be a
blurb for the time being but it would be short lived.

> If one wants to scan the shadow stack how to detect the end (e.g.
> fast
> backtrace)? Is it useful to put an invalid value (-1) there?
> (affects map_shadow_stack syscall too).

Interesting idea. I think it's probably not a breaking ABI change if we
wanted to add it later.
  
Edgecombe, Rick P March 1, 2023, 6:32 p.m. UTC | #4
On Wed, 2023-03-01 at 10:07 -0800, Rick Edgecombe wrote:
> > If one wants to scan the shadow stack how to detect the end (e.g.
> > fast
> > backtrace)? Is it useful to put an invalid value (-1) there?
> > (affects map_shadow_stack syscall too).
> 
> Interesting idea. I think it's probably not a breaking ABI change if
> we
> wanted to add it later.

One complication could be how to handle shadow stacks created outside
of thread creation. map_shadow_stack would typically add a token at the
end so it could be pivoted to. So then the backtracing algorithm would
have to know to skip it or something to find a special start of stack
marker.

Alternatively, the thread shadow stacks could get an already used token
pushed at the end, to try to match what an in-use map_shadow_stack
shadow stack would look like. Then the backtracing algorithm could just
look for the same token in both cases. It might get confused in exotic
cases and mistake a token in the middle of the stack for the end of the
allocation though. Hmm...
  
Szabolcs Nagy March 2, 2023, 4:14 p.m. UTC | #5
The 03/01/2023 18:07, Edgecombe, Rick P wrote:
> On Wed, 2023-03-01 at 14:21 +0000, Szabolcs Nagy wrote:
> > The 02/27/2023 14:29, Rick Edgecombe wrote:
> > > +Application Enabling
> > > +====================
> > > +
> > > +An application's CET capability is marked in its ELF note and can
> > > be verified
> > > +from readelf/llvm-readelf output::
> > > +
> > > +    readelf -n <application> | grep -a SHSTK
> > > +        properties: x86 feature: SHSTK
> > > +
> > > +The kernel does not process these applications markers directly.
> > > Applications
> > > +or loaders must enable CET features using the interface described
> > > in section 4.
> > > +Typically this would be done in dynamic loader or static runtime
> > > objects, as is
> > > +the case in GLIBC.
> > 
> > Note that this has to be an early decision in libc (ld.so or static
> > exe start code), which will be difficult to hook into system wide
> > security policy settings. (e.g. to force shstk on marked binaries.)
> 
> In the eager enabling (by the kernel) scenario, how is this improved?
> The loader has to have the option to disable the shadow stack if
> enabling conditions are not met, so it still has to trust userspace to
> not do that. Did you have any more specifics on how the policy would
> work?

i guess my issue is that the arch prctls only allow self policing.
there is no kernel mechanism to set policy from outside the process
that is either inherited or asynchronously set. policy is completely
managed by libc (and done very early).

now i understand that async disable does not work (thanks for the
explanation), but some control for forced enable/locking inherited
across exec could work.

> > From userspace POV I'd prefer if a static exe did not have to parse
> > its own ELF notes (i.e. kernel enabled shstk based on the marking).
> 
> This is actually exactly what happens in the glibc patches. My
> understand was that it already been discussed amongst glibc folks.

there were many glibc patches some of which are committed despite not
having an accepted linux abi, so i'm trying to review the linux abi
contracts and expect this patch to be authorative, please bear with me.

> > - I think it's better to have a new limit specifically for shadow
> >   stack size (which by default can be RLIMIT_STACK) so userspace
> >   can adjust it if needed (another reason is that stack size is
> >   not always a good indicator of max call depth).
> 
> Hmm, yea. This seems like a good idea, but I don't see why it can't be
> a follow on. The series is quite big just to get the basics. I have
> tried to save some of the enhancements (like alt shadow stack) for the
> future.

it is actually not obvious how to introduce a limit so it is inherited
or reset in a sensible way so i think it is useful to discuss it
together with other issues.

> > The kernel pushes on the shadow stack on signal entry so shadow stack
> > overflow cannot be handled. Please document this as non-recoverable
> > failure.
> 
> It doesn't hurt to call it out. Please see the below link for future
> plans to handle this scenario (alt shadow stack).
> 
> > 
> > I think it can be made recoverable if signals with alternate stack
> > run
> > on a different shadow stack. And the top of the thread shadow stack
> > is
> > just corrupted instead of pushed in the overflow case. Then longjmp
> > out
> > can be made to work (common in stack overflow handling cases), and
> > reliable crash report from the signal handler works (also common).
> > 
> > Does SSP get stored into the sigcontext struct somewhere?
> 
> No, it's pushed to the shadow stack only. See the v2 coverletter of the
> discussion on the design and reasoning:
> 
> https://lore.kernel.org/lkml/20220929222936.14584-1-rick.p.edgecombe@intel.com/

i think this should be part of the initial design as it may be hard
to change later.

"sigaltshstk() is separate from sigaltstack(). You can have one
without the other, neither or both together. Because the shadow
stack specific state is pushed to the shadow stack, the two
features don’t need to know about each other."

this means they cannot be changed together atomically.

i'd expect most sigaltstack users to want to be resilient
against shadow stack overflow which means non-portable
code changes.

i don't see why automatic alt shadow stack allocation would
not work (kernel manages it transparently when an alt stack
is installed or disabled).

"Since shadow alt stacks are a new feature, longjmp()ing from an
alt shadow stack will simply not be supported. If a libc want’s
to support this it will need to enable WRSS and write it’s own
restore token."

i think longjmp should work without enabling writes to the shadow
stack in the libc. this can also affect unwinding across signal
handlers (not for c++ but e.g. glibc thread cancellation).

i'd prefer overwriting the shadow stack top entry on overflow to
disallowing longjmp out of a shadow stack overflow handler.

> > I think the map_shadow_stack syscall should be mentioned in this
> > document too.
> 
> There is a man page prepared for this. I plan to update the docs to
> reference it when it exists and not duplicate the text. There can be a
> blurb for the time being but it would be short lived.

i wanted to comment on the syscall because i think it may be better
to have a magic mmap MAP_ flag that takes care of everything.

but i can go comment on the specific patch then.

thanks.
  
Szabolcs Nagy March 2, 2023, 4:34 p.m. UTC | #6
The 03/01/2023 18:32, Edgecombe, Rick P wrote:
> On Wed, 2023-03-01 at 10:07 -0800, Rick Edgecombe wrote:
> > > If one wants to scan the shadow stack how to detect the end (e.g.
> > > fast
> > > backtrace)? Is it useful to put an invalid value (-1) there?
> > > (affects map_shadow_stack syscall too).
> > 
> > Interesting idea. I think it's probably not a breaking ABI change if
> > we
> > wanted to add it later.
> 
> One complication could be how to handle shadow stacks created outside
> of thread creation. map_shadow_stack would typically add a token at the
> end so it could be pivoted to. So then the backtracing algorithm would
> have to know to skip it or something to find a special start of stack
> marker.

i'd expect the pivot token to disappear once you pivot to it
(and a pivot token to appear on the stack you pivoted away
from, so you can go back later) otherwise i don't see how
swapcontext works.

i'd push an end token and a pivot token on new shadow stacks.

> Alternatively, the thread shadow stacks could get an already used token
> pushed at the end, to try to match what an in-use map_shadow_stack
> shadow stack would look like. Then the backtracing algorithm could just
> look for the same token in both cases. It might get confused in exotic
> cases and mistake a token in the middle of the stack for the end of the
> allocation though. Hmm...

a backtracer would search for an end token on an active shadow
stack. it should be able to skip other tokens that don't seem
to be code addresses. the end token needs to be identifiable
and not break security properties. i think it's enough if the
backtrace is best effort correct, there can be corner-cases when
shadow stack is difficult to interpret, but e.g. a profiler can
still make good use of this feature.
  
Edgecombe, Rick P March 2, 2023, 9:17 p.m. UTC | #7
On Thu, 2023-03-02 at 16:14 +0000, szabolcs.nagy@arm.com wrote:
> The 03/01/2023 18:07, Edgecombe, Rick P wrote:
> > On Wed, 2023-03-01 at 14:21 +0000, Szabolcs Nagy wrote:
> > > The 02/27/2023 14:29, Rick Edgecombe wrote:
> > > > +Application Enabling
> > > > +====================
> > > > +
> > > > +An application's CET capability is marked in its ELF note and
> > > > can
> > > > be verified
> > > > +from readelf/llvm-readelf output::
> > > > +
> > > > +    readelf -n <application> | grep -a SHSTK
> > > > +        properties: x86 feature: SHSTK
> > > > +
> > > > +The kernel does not process these applications markers
> > > > directly.
> > > > Applications
> > > > +or loaders must enable CET features using the interface
> > > > described
> > > > in section 4.
> > > > +Typically this would be done in dynamic loader or static
> > > > runtime
> > > > objects, as is
> > > > +the case in GLIBC.
> > > 
> > > Note that this has to be an early decision in libc (ld.so or
> > > static
> > > exe start code), which will be difficult to hook into system wide
> > > security policy settings. (e.g. to force shstk on marked
> > > binaries.)
> > 
> > In the eager enabling (by the kernel) scenario, how is this
> > improved?
> > The loader has to have the option to disable the shadow stack if
> > enabling conditions are not met, so it still has to trust userspace
> > to
> > not do that. Did you have any more specifics on how the policy
> > would
> > work?
> 
> i guess my issue is that the arch prctls only allow self policing.
> there is no kernel mechanism to set policy from outside the process
> that is either inherited or asynchronously set. policy is completely
> managed by libc (and done very early).
> 
> now i understand that async disable does not work (thanks for the
> explanation), but some control for forced enable/locking inherited
> across exec could work.

Is the idea that shadow stack would be forced on regardless of if the
linked libraries support it? In which case it could be allowed to crash
if they do not?

I think the majority of users would prefer the other case where shadow
stack is only used if supported, so this sounds like a special case.
Rather than lose the flexibility for the typical case, I would think
something like this could be an additional enabling mode. glibc could
check if shadow stack is already enabled by the kernel using the
arch_prctl()s in this case.

We are having to work around the existing broken glibc binaries by not
triggering off the elf bits automatically in the kernel, but I suppose
if this was a special "I don't care if it crashes" feature, maybe it
would be ok. Otherwise we would need to change the elf header bit to
exclude the old binaries to even be able to do this, and there was
extreme resistance to this idea from the userspace side.

> 
> > > From userspace POV I'd prefer if a static exe did not have to
> > > parse
> > > its own ELF notes (i.e. kernel enabled shstk based on the
> > > marking).
> > 
> > This is actually exactly what happens in the glibc patches. My
> > understand was that it already been discussed amongst glibc folks.
> 
> there were many glibc patches some of which are committed despite not
> having an accepted linux abi, so i'm trying to review the linux abi
> contracts and expect this patch to be authorative, please bear with
> me.

H.J. has some recent ones that work against this kernel series that
might interest you. The existing upstream glibc support will not get
used due to the enabling interface change to arch_prctl() (this was one
of the inspirations of the change actually).

> 
> > > - I think it's better to have a new limit specifically for shadow
> > >   stack size (which by default can be RLIMIT_STACK) so userspace
> > >   can adjust it if needed (another reason is that stack size is
> > >   not always a good indicator of max call depth).
> > 
> > Hmm, yea. This seems like a good idea, but I don't see why it can't
> > be
> > a follow on. The series is quite big just to get the basics. I have
> > tried to save some of the enhancements (like alt shadow stack) for
> > the
> > future.
> 
> it is actually not obvious how to introduce a limit so it is
> inherited
> or reset in a sensible way so i think it is useful to discuss it
> together with other issues.

Looking at this again, I'm not sure why a new rlimit is needed. It
seems many of those points were just formulations of that the clone3
stack size was not used, but it actually is and just not documented. If
you disagree perhaps you could elaborate on what the requirements are
and we can see if it seems tricky to do in a follow up.

> 
> > > The kernel pushes on the shadow stack on signal entry so shadow
> > > stack
> > > overflow cannot be handled. Please document this as non-
> > > recoverable
> > > failure.
> > 
> > It doesn't hurt to call it out. Please see the below link for
> > future
> > plans to handle this scenario (alt shadow stack).
> > 
> > > 
> > > I think it can be made recoverable if signals with alternate
> > > stack
> > > run
> > > on a different shadow stack. And the top of the thread shadow
> > > stack
> > > is
> > > just corrupted instead of pushed in the overflow case. Then
> > > longjmp
> > > out
> > > can be made to work (common in stack overflow handling cases),
> > > and
> > > reliable crash report from the signal handler works (also
> > > common).
> > > 
> > > Does SSP get stored into the sigcontext struct somewhere?
> > 
> > No, it's pushed to the shadow stack only. See the v2 coverletter of
> > the
> > discussion on the design and reasoning:
> > 
> > 
https://lore.kernel.org/lkml/20220929222936.14584-1-rick.p.edgecombe@intel.com/
> 
> i think this should be part of the initial design as it may be hard
> to change later.

This is actually how it came up. Andy Lutomirski said, paraphrasing,
"what if we want alt shadow stacks someday, does the signal frame ABI
support it?". So I created an ABI that supports it and an initial POC,
and said lets hold off on the implementation for the first version and
just use the sigframe ABI that will allow it for the future. So the
point was to make sure the signal format supported alt shadow stacks to
make it easier in the future.

> 
> "sigaltshstk() is separate from sigaltstack(). You can have one
> without the other, neither or both together. Because the shadow
> stack specific state is pushed to the shadow stack, the two
> features don’t need to know about each other."
> 
> this means they cannot be changed together atomically.

Not sure why this is needed since they can be used separately. So why
tie them together?

> 
> i'd expect most sigaltstack users to want to be resilient
> against shadow stack overflow which means non-portable
> code changes.

Portable between architectures? Or between shadow stack vs non-shadow
stack?

It does seem like it would not be uncommon for users to want both
together, but see below.

> 
> i don't see why automatic alt shadow stack allocation would
> not work (kernel manages it transparently when an alt stack
> is installed or disabled).

Ah, I think I see where maybe I can fill you in. Andy Luto had
discounted this idea out of hand originally, but I didn't see it at
first. sigaltstack lets you set, retrieve, or disable the shadow stack,
right... But this doesn't allocate anything, it just sets where the
next signal will be handled. This is different than things like threads
where there is a new resources being allocated and it makes coming up
with logic to guess when to de-allocate the alt shadow stack difficult.
You probably already know...

But because of this there can be some modes where the shadow stack is
changed while on it. For one example, SS_AUTODISARM will disable the
alt shadow stack while switching to it and restore when sigreturning.
At which point a new altstack can be set. In the non-shadow stack case
this is nice because future signals won't clobber the alt stack if you
switch away from it (swapcontext(), etc). But it also means you can
"change" the alt stack while on it ("change" sort of, the auto disarm
results in the kernel forgetting it temporarily).

I hear where you are coming from with the desire to have it "just work"
with existing code, but I think the resulting ABI around the alt shadow
stack allocation lifecycle would be way too complicated even if it
could be made to work. Hence making a new interface. But also, the idea
was that the x86 signal ABI should support handling alt shadow stacks,
which is what we have done with this series. If a different interface
for configuring it is better than the one from the POC, I'm not seeing
a problem jump out. Is there any specific concern about backwards
compatibility here?

> 
> "Since shadow alt stacks are a new feature, longjmp()ing from an
> alt shadow stack will simply not be supported. If a libc want’s
> to support this it will need to enable WRSS and write it’s own
> restore token."
> 
> i think longjmp should work without enabling writes to the shadow
> stack in the libc. this can also affect unwinding across signal
> handlers (not for c++ but e.g. glibc thread cancellation).

glibc today does not support longjmp()ing from a different stack (for
example even today after a swapcontext()) when shadow stack is used. If
glibc used wrss it could be supported maybe, but otherwise I don't see
how the HW can support it.

HJ and I were actually just discussing this the other day. Are you
looking at this series with respect to the arm shadow stack feature by
any chance? I would love if glibc/tools would document what the shadow
stack limitations are. If the all the arch's have the same or similar
limitations perhaps this could be one developer guide. For the most
part though, the limitations I've encountered are in glibc and the
kernel is more the building blocks.

> 
> i'd prefer overwriting the shadow stack top entry on overflow to
> disallowing longjmp out of a shadow stack overflow handler.
> 
> > > I think the map_shadow_stack syscall should be mentioned in this
> > > document too.
> > 
> > There is a man page prepared for this. I plan to update the docs to
> > reference it when it exists and not duplicate the text. There can
> > be a
> > blurb for the time being but it would be short lived.
> 
> i wanted to comment on the syscall because i think it may be better
> to have a magic mmap MAP_ flag that takes care of everything.
> 
> but i can go comment on the specific patch then.
> 
> thanks.

A general comment. Not sure if you are aware, but this shadow stack
enabling effort is quite old at this point and there have been many
discussions on these topics stretching back years. The latest
conversation was around getting this series into linux-next soon to get
some testing on the MM pieces. I really appreciate getting this ABI
feedback as it is always tricky to get right, but at this stage I would
hope to be focusing mostly on concrete problems.

I also expect to have some amount of ABI growth going forward with all
the normal things that entails. Shadow stack is not special in that it
can come fully finalized without the need for the real world usage
iterative feedback process. At some point we need to move forward with
something, and we have quite a bit of initial changes at this point.

So I would like to minimize the initial implementation unless anyone
sees any likely problems with future growth. Can you be clear if you
see any concrete problems at this point or are more looking to evaluate
the design reasoning? I'm under the assumption there is nothing that
would prohibit linux-next testing while any ABI shakedown happens
concurrently at least?

Thanks,
Rick
  
Szabolcs Nagy March 3, 2023, 4:30 p.m. UTC | #8
The 03/02/2023 21:17, Edgecombe, Rick P wrote:
> Is the idea that shadow stack would be forced on regardless of if the
> linked libraries support it? In which case it could be allowed to crash
> if they do not?

execute a binary
- with shstk enabled and locked (only if marked?).
- with shstk disabled and locked.
could be managed in userspace, but it is libc dependent then.

> > > > - I think it's better to have a new limit specifically for shadow
> > > >   stack size (which by default can be RLIMIT_STACK) so userspace
> > > >   can adjust it if needed (another reason is that stack size is
> > > >   not always a good indicator of max call depth).
> 
> Looking at this again, I'm not sure why a new rlimit is needed. It
> seems many of those points were just formulations of that the clone3
> stack size was not used, but it actually is and just not documented. If
> you disagree perhaps you could elaborate on what the requirements are
> and we can see if it seems tricky to do in a follow up.

- tiny thread stack and deep signal stack.
(note that this does not really work with glibc because it has
implementation internal signals that don't run on alt stack,
cannot be masked and don't fit on a tiny thread stack, but
with other runtimes this can be a valid use-case, e.g. musl
allows tiny thread stacks, < pagesize.)

- thread runtimes with clone (glibc uses clone3 but some dont).

- huge stacks but small call depth (problem if some va limit
  is hit or memory overcommit is disabled).

> > "sigaltshstk() is separate from sigaltstack(). You can have one
> > without the other, neither or both together. Because the shadow
> > stack specific state is pushed to the shadow stack, the two
> > features don’t need to know about each other."
...
> > i don't see why automatic alt shadow stack allocation would
> > not work (kernel manages it transparently when an alt stack
> > is installed or disabled).
> 
> Ah, I think I see where maybe I can fill you in. Andy Luto had
> discounted this idea out of hand originally, but I didn't see it at
> first. sigaltstack lets you set, retrieve, or disable the shadow stack,
> right... But this doesn't allocate anything, it just sets where the
> next signal will be handled. This is different than things like threads
> where there is a new resources being allocated and it makes coming up
> with logic to guess when to de-allocate the alt shadow stack difficult.
> You probably already know...
> 
> But because of this there can be some modes where the shadow stack is
> changed while on it. For one example, SS_AUTODISARM will disable the
> alt shadow stack while switching to it and restore when sigreturning.
> At which point a new altstack can be set. In the non-shadow stack case
> this is nice because future signals won't clobber the alt stack if you
> switch away from it (swapcontext(), etc). But it also means you can
> "change" the alt stack while on it ("change" sort of, the auto disarm
> results in the kernel forgetting it temporarily).

the problem with swapcontext is that it may unmask signals
that run on the alt stack, which means the code cannot jump
back after another signal clobbered the alt stack.

the non-standard SS_AUTODISARM aims to solve this by disabling
alt stack settings on signal entry until the handler returns.

so this use case is not about supporting swapcontext out, but
about jumping back. however that does not work reliably with
this patchset: if swapcontext goes to the thread stack (and
not to another stack e.g. used by makecontext), then jump back
fails. (and if there is a sigaltshstk installed then even jump
out fails.)

assuming
- jump out from alt shadow stack can be made to work.
- alt shadow stack management can be automatic.
then this can be improved so jump back works reliably.

> I hear where you are coming from with the desire to have it "just work"
> with existing code, but I think the resulting ABI around the alt shadow
> stack allocation lifecycle would be way too complicated even if it
> could be made to work. Hence making a new interface. But also, the idea
> was that the x86 signal ABI should support handling alt shadow stacks,
> which is what we have done with this series. If a different interface
> for configuring it is better than the one from the POC, I'm not seeing
> a problem jump out. Is there any specific concern about backwards
> compatibility here?

sigaltstack syscall behaviour may be hard to change later
and currently
- shadow stack overflow cannot be recovered from.
- longjmp out of signal handler fails (with sigaltshstk).
- SS_AUTODISARM does not work (jump back can fail).

> > "Since shadow alt stacks are a new feature, longjmp()ing from an
> > alt shadow stack will simply not be supported. If a libc want’s
> > to support this it will need to enable WRSS and write it’s own
> > restore token."
> > 
> > i think longjmp should work without enabling writes to the shadow
> > stack in the libc. this can also affect unwinding across signal
> > handlers (not for c++ but e.g. glibc thread cancellation).
> 
> glibc today does not support longjmp()ing from a different stack (for
> example even today after a swapcontext()) when shadow stack is used. If
> glibc used wrss it could be supported maybe, but otherwise I don't see
> how the HW can support it.
> 
> HJ and I were actually just discussing this the other day. Are you
> looking at this series with respect to the arm shadow stack feature by
> any chance? I would love if glibc/tools would document what the shadow
> stack limitations are. If the all the arch's have the same or similar
> limitations perhaps this could be one developer guide. For the most
> part though, the limitations I've encountered are in glibc and the
> kernel is more the building blocks.

well we hope that shadow stack behaviour and limitations can
be similar across targets.

longjmp to different stack should work: it can do the same as
setcontext/swapcontext: scan for the pivot token. then only
longjmp out of alt shadow stack fails. (this is non-conforming
longjmp use, but e.g. qemu relies on it.)

for longjmp out of alt shadow stack, the target shadow stack
needs a pivot token, which implies the kernel needs to push that
on signal entry, which can overflow. but i suspect that can be
handled the same way as stackoverflow on signal entry is handled.

> A general comment. Not sure if you are aware, but this shadow stack
> enabling effort is quite old at this point and there have been many
> discussions on these topics stretching back years. The latest
> conversation was around getting this series into linux-next soon to get
> some testing on the MM pieces. I really appreciate getting this ABI
> feedback as it is always tricky to get right, but at this stage I would
> hope to be focusing mostly on concrete problems.
> 
> I also expect to have some amount of ABI growth going forward with all
> the normal things that entails. Shadow stack is not special in that it
> can come fully finalized without the need for the real world usage
> iterative feedback process. At some point we need to move forward with
> something, and we have quite a bit of initial changes at this point.
> 
> So I would like to minimize the initial implementation unless anyone
> sees any likely problems with future growth. Can you be clear if you
> see any concrete problems at this point or are more looking to evaluate
> the design reasoning? I'm under the assumption there is nothing that
> would prohibit linux-next testing while any ABI shakedown happens
> concurrently at least?

understood.

the points that i think are worth raising:

- shadow stack size logic may need to change later.
  (it can be too big, or too small in practice.)
- shadow stack overflow is not recoverable and the
  possible fix for that (sigaltshstk) breaks longjmp
  out of signal handlers.
- jump back after SS_AUTODISARM swapcontext cannot be
  reliable if alt signal uses thread shadow stack.
- the above two concerns may be mitigated by different
  sigaltstack behaviour which may be hard to add later.
- end token for backtrace may be useful, if added
  later it can be hard to check.

thanks.
  
H.J. Lu March 3, 2023, 4:57 p.m. UTC | #9
On Fri, Mar 3, 2023 at 8:31 AM szabolcs.nagy@arm.com
<szabolcs.nagy@arm.com> wrote:
>
> The 03/02/2023 21:17, Edgecombe, Rick P wrote:
> > Is the idea that shadow stack would be forced on regardless of if the
> > linked libraries support it? In which case it could be allowed to crash
> > if they do not?
>
> execute a binary
> - with shstk enabled and locked (only if marked?).
> - with shstk disabled and locked.
> could be managed in userspace, but it is libc dependent then.
>
> > > > > - I think it's better to have a new limit specifically for shadow
> > > > >   stack size (which by default can be RLIMIT_STACK) so userspace
> > > > >   can adjust it if needed (another reason is that stack size is
> > > > >   not always a good indicator of max call depth).
> >
> > Looking at this again, I'm not sure why a new rlimit is needed. It
> > seems many of those points were just formulations of that the clone3
> > stack size was not used, but it actually is and just not documented. If
> > you disagree perhaps you could elaborate on what the requirements are
> > and we can see if it seems tricky to do in a follow up.
>
> - tiny thread stack and deep signal stack.
> (note that this does not really work with glibc because it has
> implementation internal signals that don't run on alt stack,
> cannot be masked and don't fit on a tiny thread stack, but
> with other runtimes this can be a valid use-case, e.g. musl
> allows tiny thread stacks, < pagesize.)
>
> - thread runtimes with clone (glibc uses clone3 but some dont).
>
> - huge stacks but small call depth (problem if some va limit
>   is hit or memory overcommit is disabled).
>
> > > "sigaltshstk() is separate from sigaltstack(). You can have one
> > > without the other, neither or both together. Because the shadow
> > > stack specific state is pushed to the shadow stack, the two
> > > features don’t need to know about each other."
> ...
> > > i don't see why automatic alt shadow stack allocation would
> > > not work (kernel manages it transparently when an alt stack
> > > is installed or disabled).
> >
> > Ah, I think I see where maybe I can fill you in. Andy Luto had
> > discounted this idea out of hand originally, but I didn't see it at
> > first. sigaltstack lets you set, retrieve, or disable the shadow stack,
> > right... But this doesn't allocate anything, it just sets where the
> > next signal will be handled. This is different than things like threads
> > where there is a new resources being allocated and it makes coming up
> > with logic to guess when to de-allocate the alt shadow stack difficult.
> > You probably already know...
> >
> > But because of this there can be some modes where the shadow stack is
> > changed while on it. For one example, SS_AUTODISARM will disable the
> > alt shadow stack while switching to it and restore when sigreturning.
> > At which point a new altstack can be set. In the non-shadow stack case
> > this is nice because future signals won't clobber the alt stack if you
> > switch away from it (swapcontext(), etc). But it also means you can
> > "change" the alt stack while on it ("change" sort of, the auto disarm
> > results in the kernel forgetting it temporarily).
>
> the problem with swapcontext is that it may unmask signals
> that run on the alt stack, which means the code cannot jump
> back after another signal clobbered the alt stack.
>
> the non-standard SS_AUTODISARM aims to solve this by disabling
> alt stack settings on signal entry until the handler returns.
>
> so this use case is not about supporting swapcontext out, but
> about jumping back. however that does not work reliably with
> this patchset: if swapcontext goes to the thread stack (and
> not to another stack e.g. used by makecontext), then jump back
> fails. (and if there is a sigaltshstk installed then even jump
> out fails.)
>
> assuming
> - jump out from alt shadow stack can be made to work.
> - alt shadow stack management can be automatic.
> then this can be improved so jump back works reliably.
>
> > I hear where you are coming from with the desire to have it "just work"
> > with existing code, but I think the resulting ABI around the alt shadow
> > stack allocation lifecycle would be way too complicated even if it
> > could be made to work. Hence making a new interface. But also, the idea
> > was that the x86 signal ABI should support handling alt shadow stacks,
> > which is what we have done with this series. If a different interface
> > for configuring it is better than the one from the POC, I'm not seeing
> > a problem jump out. Is there any specific concern about backwards
> > compatibility here?
>
> sigaltstack syscall behaviour may be hard to change later
> and currently
> - shadow stack overflow cannot be recovered from.
> - longjmp out of signal handler fails (with sigaltshstk).
> - SS_AUTODISARM does not work (jump back can fail).
>
> > > "Since shadow alt stacks are a new feature, longjmp()ing from an
> > > alt shadow stack will simply not be supported. If a libc want’s
> > > to support this it will need to enable WRSS and write it’s own
> > > restore token."
> > >
> > > i think longjmp should work without enabling writes to the shadow
> > > stack in the libc. this can also affect unwinding across signal
> > > handlers (not for c++ but e.g. glibc thread cancellation).
> >
> > glibc today does not support longjmp()ing from a different stack (for
> > example even today after a swapcontext()) when shadow stack is used. If
> > glibc used wrss it could be supported maybe, but otherwise I don't see
> > how the HW can support it.
> >
> > HJ and I were actually just discussing this the other day. Are you
> > looking at this series with respect to the arm shadow stack feature by
> > any chance? I would love if glibc/tools would document what the shadow
> > stack limitations are. If the all the arch's have the same or similar
> > limitations perhaps this could be one developer guide. For the most
> > part though, the limitations I've encountered are in glibc and the
> > kernel is more the building blocks.
>
> well we hope that shadow stack behaviour and limitations can
> be similar across targets.
>
> longjmp to different stack should work: it can do the same as
> setcontext/swapcontext: scan for the pivot token. then only
> longjmp out of alt shadow stack fails. (this is non-conforming
> longjmp use, but e.g. qemu relies on it.)

Restore token may not be used with longjmp.  Unlike setcontext/swapcontext,
longjmp is optional.  If longjmp isn't called, there will be an extra
token on shadow
stack and RET will fail.

> for longjmp out of alt shadow stack, the target shadow stack
> needs a pivot token, which implies the kernel needs to push that
> on signal entry, which can overflow. but i suspect that can be
> handled the same way as stackoverflow on signal entry is handled.
>
> > A general comment. Not sure if you are aware, but this shadow stack
> > enabling effort is quite old at this point and there have been many
> > discussions on these topics stretching back years. The latest
> > conversation was around getting this series into linux-next soon to get
> > some testing on the MM pieces. I really appreciate getting this ABI
> > feedback as it is always tricky to get right, but at this stage I would
> > hope to be focusing mostly on concrete problems.
> >
> > I also expect to have some amount of ABI growth going forward with all
> > the normal things that entails. Shadow stack is not special in that it
> > can come fully finalized without the need for the real world usage
> > iterative feedback process. At some point we need to move forward with
> > something, and we have quite a bit of initial changes at this point.
> >
> > So I would like to minimize the initial implementation unless anyone
> > sees any likely problems with future growth. Can you be clear if you
> > see any concrete problems at this point or are more looking to evaluate
> > the design reasoning? I'm under the assumption there is nothing that
> > would prohibit linux-next testing while any ABI shakedown happens
> > concurrently at least?
>
> understood.
>
> the points that i think are worth raising:
>
> - shadow stack size logic may need to change later.
>   (it can be too big, or too small in practice.)
> - shadow stack overflow is not recoverable and the
>   possible fix for that (sigaltshstk) breaks longjmp
>   out of signal handlers.
> - jump back after SS_AUTODISARM swapcontext cannot be
>   reliable if alt signal uses thread shadow stack.
> - the above two concerns may be mitigated by different
>   sigaltstack behaviour which may be hard to add later.
> - end token for backtrace may be useful, if added
>   later it can be hard to check.
>
> thanks.
  
Szabolcs Nagy March 3, 2023, 5:39 p.m. UTC | #10
The 03/03/2023 08:57, H.J. Lu wrote:
> On Fri, Mar 3, 2023 at 8:31 AM szabolcs.nagy@arm.com
> <szabolcs.nagy@arm.com> wrote:
> > longjmp to different stack should work: it can do the same as
> > setcontext/swapcontext: scan for the pivot token. then only
> > longjmp out of alt shadow stack fails. (this is non-conforming
> > longjmp use, but e.g. qemu relies on it.)
> 
> Restore token may not be used with longjmp.  Unlike setcontext/swapcontext,
> longjmp is optional.  If longjmp isn't called, there will be an extra
> token on shadow
> stack and RET will fail.

what do you mean longjmp is optional?

it can scan the target shadow stack and decide if it's the
same as the current one or not and in the latter case there
should be a restore token to switch to. then it can INCSSP
to reach the target SSP state.

qemu does setjmp, then swapcontext, then longjmp back.
swapcontext can change the stack, but leaves a token behind
so longjmp can switch back.
  
Edgecombe, Rick P March 3, 2023, 5:41 p.m. UTC | #11
On Fri, 2023-03-03 at 16:30 +0000, szabolcs.nagy@arm.com wrote:
> the points that i think are worth raising:
> 
> - shadow stack size logic may need to change later.
>   (it can be too big, or too small in practice.)

Looking at making it more efficient in the future seems great. But
since we are not in the position of being able to make shadow stacks
completely seamless (see below)

> - shadow stack overflow is not recoverable and the
>   possible fix for that (sigaltshstk) breaks longjmp
>   out of signal handlers.
> - jump back after SS_AUTODISARM swapcontext cannot be
>   reliable if alt signal uses thread shadow stack.
> - the above two concerns may be mitigated by different
>   sigaltstack behaviour which may be hard to add later.

Are you aware that you can't simply emit a restore token on x86 without
first restoring to another restore token? This is why (I'm assuming)
glibc uses incssp to implement longjmp instead of just jumping back to
the setjmp point with a shadow stack restore. So of course then longjmp
can't jump between shadow stacks. So there are sort of two categories
of restrictions on binaries that mark the SHSTK elf bit. The first
category is that they have to take special steps when switching stacks
or jumping around on the stack. Once they handle this, they can work
with shadow stack.

The second category is that they can't do certain patterns of jumping
around on stacks, regardless of the steps they take. So certain
previously allowed software patterns are now impossible, including ones
implemented in glibc. (And the exact restrictions on the glibc APIs are
not documented and this should be fixed).

If applications will violate either type of these restrictions they
should not mark the SHSTK elf bit.

Now that said, there is an exception to these restrictions on x86,
which is the WRSS instruction, which can write to the shadow stack. The
arch_prctl() interface allows this to be optionally enabled and locked.
The v2 signal analysis I pointed earlier, mentions how this might be
used by glibc to support more of the currently restricted patterns.
Please take a look if you haven't (section "setjmp()/longjmp()"). It
also explains why in the non-WRSS scenarios the kernel can't easily
help improve the situation.

WRSS opens up writing to the shadow stack, and so a glibc-WRSS mode
would be making a security/compatibility tradeoff. I think starting
with the more restricted mode was ultimately good in creating a kernel
ABI that can support both. If userspace could paper over ABI gaps with
WRSS, we might not have realized the issues we did.

> - end token for backtrace may be useful, if added
>   later it can be hard to check.

Yes this seems like a good idea. Thanks for the suggestion. I'm not
sure it can't be added later though. I'll POC it and do some more
thinking.
  
H.J. Lu March 3, 2023, 5:50 p.m. UTC | #12
On Fri, Mar 3, 2023 at 9:40 AM szabolcs.nagy@arm.com
<szabolcs.nagy@arm.com> wrote:
>
> The 03/03/2023 08:57, H.J. Lu wrote:
> > On Fri, Mar 3, 2023 at 8:31 AM szabolcs.nagy@arm.com
> > <szabolcs.nagy@arm.com> wrote:
> > > longjmp to different stack should work: it can do the same as
> > > setcontext/swapcontext: scan for the pivot token. then only
> > > longjmp out of alt shadow stack fails. (this is non-conforming
> > > longjmp use, but e.g. qemu relies on it.)
> >
> > Restore token may not be used with longjmp.  Unlike setcontext/swapcontext,
> > longjmp is optional.  If longjmp isn't called, there will be an extra
> > token on shadow
> > stack and RET will fail.
>
> what do you mean longjmp is optional?

In some cases, longjmp is called to handle an error condition and
longjmp won't be called if there is no error.

> it can scan the target shadow stack and decide if it's the
> same as the current one or not and in the latter case there
> should be a restore token to switch to. then it can INCSSP
> to reach the target SSP state.
>
> qemu does setjmp, then swapcontext, then longjmp back.
> swapcontext can change the stack, but leaves a token behind
> so longjmp can switch back.

This needs changes to support shadow stack.  Replacing setjmp with
getcontext and longjmp with setcontext may work for shadow stack.

BTW, there is no testcase in glibc for this usage.
  
Edgecombe, Rick P March 3, 2023, 10:35 p.m. UTC | #13
On Thu, 2023-03-02 at 16:34 +0000, szabolcs.nagy@arm.com wrote:
> > Alternatively, the thread shadow stacks could get an already used
> > token
> > pushed at the end, to try to match what an in-use map_shadow_stack
> > shadow stack would look like. Then the backtracing algorithm could
> > just
> > look for the same token in both cases. It might get confused in
> > exotic
> > cases and mistake a token in the middle of the stack for the end of
> > the
> > allocation though. Hmm...
> 
> a backtracer would search for an end token on an active shadow
> stack. it should be able to skip other tokens that don't seem
> to be code addresses. the end token needs to be identifiable
> and not break security properties. i think it's enough if the
> backtrace is best effort correct, there can be corner-cases when
> shadow stack is difficult to interpret, but e.g. a profiler can
> still make good use of this feature.

So just taking a look at this and remembering we used to have an
arch_prctl() that returned the thread's shadow stack base and size.
Glibc needed it, but we found a way around and dropped it. If we added
something like that back, then it could be used for backtracing in the
typical thread case and also potentially similar things to what glibc
was doing. This also saves ~8 bytes per shadow stack over an end-of-
stack marker, so it's a tiny bit better on memory use.

For the end-of-stack-marker solution:
In the case of thread shadow stacks, I'm not seeing any issues testing
adding markers at the end. So adding this on top of the existing series
for just thread shadow stacks seems lower probability of impact
regression wise. Especially if we do it in the near term.

For ucontext/map_shadow_stack, glibc expects a token to be at the size
passed in. So we would either have to create a larger allocation (to
include the marker) or create a new map_shadow_stack flag to do this
(it was expected that there might be new types of initial shadow stack
data that the kernel might need to create). It is also possible to pass
a non-page aligned size and get zero's at the end of the allocation. In
fact glibc does this today in the common case. So that is also an
option.

I think I slightly prefer the former arch_prctl() based solution for a
few reasons:
 - When you need to find the start or end of the shadow stack can you
can just ask for it instead of searching. It can be faster and simpler.
 - It saves 8 bytes of memory per shadow stack.

If this turns out to be wrong and we want to do the marker solution
much later at some point, the safest option would probably be to create
new flags.

But just discussing this with HJ, can you share more on what the usage
is? Like which backtracing operation specifically needs the marker? How
much does it care about the ucontext case?
  
Szabolcs Nagy March 6, 2023, 4:20 p.m. UTC | #14
The 03/03/2023 22:35, Edgecombe, Rick P wrote:
> I think I slightly prefer the former arch_prctl() based solution for a
> few reasons:
>  - When you need to find the start or end of the shadow stack can you
> can just ask for it instead of searching. It can be faster and simpler.
>  - It saves 8 bytes of memory per shadow stack.
> 
> If this turns out to be wrong and we want to do the marker solution
> much later at some point, the safest option would probably be to create
> new flags.

i see two problems with a get bounds syscall:

- syscall overhead.

- discontinous shadow stack (e.g. alt shadow stack ends with a
  pointer to the interrupted thread shadow stack, so stack trace
  can continue there, except you don't know the bounds of that).

> But just discussing this with HJ, can you share more on what the usage
> is? Like which backtracing operation specifically needs the marker? How
> much does it care about the ucontext case?

it could be an option for perf or ptracers to sample the stack trace.

in-process collection of stack trace for profiling or crash reporting
(e.g. when stack is corrupted) or cross checking stack integrity may
use it too.

sometimes parsing /proc/self/smaps maybe enough, but the idea was to
enable light-weight backtrace collection in an async-signal-safe way.

syscall overhead in case of frequent stack trace collection can be
avoided by caching (in tls) when ssp falls within the thread shadow
stack bounds. otherwise caching does not work as the shadow stack may
be reused (alt shadow stack or ucontext case).

unfortunately i don't know if syscall overhead is actually a problem
(probably not) or if backtrace across signal handlers need to work
with alt shadow stack (i guess it should work for crash reporting).

thanks.
  
Florian Weimer March 6, 2023, 4:31 p.m. UTC | #15
* szabolcs:

> syscall overhead in case of frequent stack trace collection can be
> avoided by caching (in tls) when ssp falls within the thread shadow
> stack bounds. otherwise caching does not work as the shadow stack may
> be reused (alt shadow stack or ucontext case).

Do we need to perform the system call at each page boundary only?  That
should reduce overhead to the degree that it should not matter.

> unfortunately i don't know if syscall overhead is actually a problem
> (probably not) or if backtrace across signal handlers need to work
> with alt shadow stack (i guess it should work for crash reporting).

Ideally, we would implement the backtrace function (in glibc) as just a
shadow stack copy.  But this needs to follow the chain of alternate
stacks, and it may also need some form of markup for signal handler
frames (which need program counter adjustment to reflect that a
*non-signal* frame is conceptually nested within the previous
instruction, and not the function the return address points to).  But I
think we can add support for this incrementally.

I assume there is no desire at all on the kernel side that sigaltstack
transparently allocates the shadow stack?  Because there is no
deallocation function today for sigaltstack?

Thanks,
Florian
  
Edgecombe, Rick P March 6, 2023, 6:05 p.m. UTC | #16
+Kan for shadow stack perf discussion.

On Mon, 2023-03-06 at 16:20 +0000, szabolcs.nagy@arm.com wrote:
> The 03/03/2023 22:35, Edgecombe, Rick P wrote:
> > I think I slightly prefer the former arch_prctl() based solution
> > for a
> > few reasons:
> >   - When you need to find the start or end of the shadow stack can
> > you
> > can just ask for it instead of searching. It can be faster and
> > simpler.
> >   - It saves 8 bytes of memory per shadow stack.
> > 
> > If this turns out to be wrong and we want to do the marker solution
> > much later at some point, the safest option would probably be to
> > create
> > new flags.
> 
> i see two problems with a get bounds syscall:
> 
> - syscall overhead.
> 
> - discontinous shadow stack (e.g. alt shadow stack ends with a
>   pointer to the interrupted thread shadow stack, so stack trace
>   can continue there, except you don't know the bounds of that).
> 
> > But just discussing this with HJ, can you share more on what the
> > usage
> > is? Like which backtracing operation specifically needs the marker?
> > How
> > much does it care about the ucontext case?
> 
> it could be an option for perf or ptracers to sample the stack trace.
> 
> in-process collection of stack trace for profiling or crash reporting
> (e.g. when stack is corrupted) or cross checking stack integrity may
> use it too.
> 
> sometimes parsing /proc/self/smaps maybe enough, but the idea was to
> enable light-weight backtrace collection in an async-signal-safe way.
> 
> syscall overhead in case of frequent stack trace collection can be
> avoided by caching (in tls) when ssp falls within the thread shadow
> stack bounds. otherwise caching does not work as the shadow stack may
> be reused (alt shadow stack or ucontext case).
> 
> unfortunately i don't know if syscall overhead is actually a problem
> (probably not) or if backtrace across signal handlers need to work
> with alt shadow stack (i guess it should work for crash reporting).

There was a POC done of perf integration. I'm not too knowledgeable on
perf, but the patch itself didn't need any new shadow stack bounds ABI.
Since it was implemented in the kernel, it could just refer to the
kernel's internal data for the thread's shadow stack bounds.

I asked about ucontext (similar to alt shadow stacks in regards to lack
of bounds ABI), and apparently perf usually focuses on the thread
stacks. Hopefully Kan can lend some more confidence to that assertion.
  
Edgecombe, Rick P March 6, 2023, 6:08 p.m. UTC | #17
On Mon, 2023-03-06 at 17:31 +0100, Florian Weimer wrote:
> Ideally, we would implement the backtrace function (in glibc) as just
> a
> shadow stack copy.  But this needs to follow the chain of alternate
> stacks, and it may also need some form of markup for signal handler
> frames (which need program counter adjustment to reflect that a
> *non-signal* frame is conceptually nested within the previous
> instruction, and not the function the return address points to).

In the alt shadow stack case, the shadow stack sigframe will have a
special shadow stack frame with a pointer to the shadow stack stack it
came from. This may be a thread stack, or some other stack. This
writeup in the v2 of the series has more details and analysis on the
signal piece:

https://lore.kernel.org/lkml/20220929222936.14584-1-rick.p.edgecombe@intel.com/

So in that design, you should be able to backtrace out of a chain of
alt stacks.

>   But I
> think we can add support for this incrementally.

Yea, I think so too.

> 
> I assume there is no desire at all on the kernel side that
> sigaltstack
> transparently allocates the shadow stack?  

It could have some nice benefit for some apps, so I did look into it.

> Because there is no
> deallocation function today for sigaltstack?

Yea, this is why we can't do it transparently. There was some
discussion up the thread on this.
  
Liang, Kan March 6, 2023, 8:31 p.m. UTC | #18
On 2023-03-06 1:05 p.m., Edgecombe, Rick P wrote:
> +Kan for shadow stack perf discussion.
> 
> On Mon, 2023-03-06 at 16:20 +0000, szabolcs.nagy@arm.com wrote:
>> The 03/03/2023 22:35, Edgecombe, Rick P wrote:
>>> I think I slightly prefer the former arch_prctl() based solution
>>> for a
>>> few reasons:
>>>   - When you need to find the start or end of the shadow stack can
>>> you
>>> can just ask for it instead of searching. It can be faster and
>>> simpler.
>>>   - It saves 8 bytes of memory per shadow stack.
>>>
>>> If this turns out to be wrong and we want to do the marker solution
>>> much later at some point, the safest option would probably be to
>>> create
>>> new flags.
>>
>> i see two problems with a get bounds syscall:
>>
>> - syscall overhead.
>>
>> - discontinous shadow stack (e.g. alt shadow stack ends with a
>>   pointer to the interrupted thread shadow stack, so stack trace
>>   can continue there, except you don't know the bounds of that).
>>
>>> But just discussing this with HJ, can you share more on what the
>>> usage
>>> is? Like which backtracing operation specifically needs the marker?
>>> How
>>> much does it care about the ucontext case?
>>
>> it could be an option for perf or ptracers to sample the stack trace.
>>
>> in-process collection of stack trace for profiling or crash reporting
>> (e.g. when stack is corrupted) or cross checking stack integrity may
>> use it too.
>>
>> sometimes parsing /proc/self/smaps maybe enough, but the idea was to
>> enable light-weight backtrace collection in an async-signal-safe way.
>>
>> syscall overhead in case of frequent stack trace collection can be
>> avoided by caching (in tls) when ssp falls within the thread shadow
>> stack bounds. otherwise caching does not work as the shadow stack may
>> be reused (alt shadow stack or ucontext case).
>>
>> unfortunately i don't know if syscall overhead is actually a problem
>> (probably not) or if backtrace across signal handlers need to work
>> with alt shadow stack (i guess it should work for crash reporting).
> 
> There was a POC done of perf integration. I'm not too knowledgeable on
> perf, but the patch itself didn't need any new shadow stack bounds ABI.
> Since it was implemented in the kernel, it could just refer to the
> kernel's internal data for the thread's shadow stack bounds.
> 
> I asked about ucontext (similar to alt shadow stacks in regards to lack
> of bounds ABI), and apparently perf usually focuses on the thread
> stacks. Hopefully Kan can lend some more confidence to that assertion.

The POC perf patch I implemented tries to use the shadow stack to
replace the frame pointer to construct a callchain of a user space
thread. Yes, it's in the kernel, perf_callchain_user(). I don't think
the current X86 perf implementation handle the alt stack either. So the
kernel internal data for the thread's shadow stack bounds should be good
enough for the perf case.

Thanks,
Kan
  
Szabolcs Nagy March 7, 2023, 1:03 p.m. UTC | #19
The 03/06/2023 18:08, Edgecombe, Rick P wrote:
> On Mon, 2023-03-06 at 17:31 +0100, Florian Weimer wrote:
> > I assume there is no desire at all on the kernel side that
> > sigaltstack
> > transparently allocates the shadow stack?  
> 
> It could have some nice benefit for some apps, so I did look into it.
> 
> > Because there is no
> > deallocation function today for sigaltstack?
> 
> Yea, this is why we can't do it transparently. There was some
> discussion up the thread on this.

changing/disabling the alt stack is not valid while a handler is
executing on it. if we don't allow jumping out and back to an
alt stack (swapcontext) then there can be only one alt stack
live per thread and change/disable can do the shadow stack free.

if jump back is allowed (linux even makes it race-free with
SS_AUTODISARM) then the life-time of alt stack is extended
beyond change/disable (jump back to an unregistered alt stack).

to support jump back to an alt stack the requirements are

1) user has to manage an alt shadow stack together with the alt
   stack (requies user code change, not just libc).

2) kernel has to push a restore token on the thread shadow stack
   on signal entry (at least in case of alt shadow stack, and
   deal with corner cases around shadow stack overflow).
  
Florian Weimer March 7, 2023, 2 p.m. UTC | #20
* szabolcs:

> changing/disabling the alt stack is not valid while a handler is
> executing on it. if we don't allow jumping out and back to an
> alt stack (swapcontext) then there can be only one alt stack
> live per thread and change/disable can do the shadow stack free.
>
> if jump back is allowed (linux even makes it race-free with
> SS_AUTODISARM) then the life-time of alt stack is extended
> beyond change/disable (jump back to an unregistered alt stack).
>
> to support jump back to an alt stack the requirements are
>
> 1) user has to manage an alt shadow stack together with the alt
>    stack (requies user code change, not just libc).
>
> 2) kernel has to push a restore token on the thread shadow stack
>    on signal entry (at least in case of alt shadow stack, and
>    deal with corner cases around shadow stack overflow).

We need to have a story for stackful coroutine switching as well, not
just for sigaltstack.  I hope that we can use OpenJDK (Project Loom) and
QEMU as guinea pigs.  If we have something that works for both,
hopefully that covers a broad range of scenarios.  Userspace
coordination can eventually be handled by glibc; we can deallocate
alternate stacks on thread exit fairly easily (at least compared to the
current stack 8-).

Thanks,
Florian
  
Szabolcs Nagy March 7, 2023, 4:14 p.m. UTC | #21
The 03/07/2023 15:00, Florian Weimer wrote:
> * szabolcs:
> 
> > changing/disabling the alt stack is not valid while a handler is
> > executing on it. if we don't allow jumping out and back to an
> > alt stack (swapcontext) then there can be only one alt stack
> > live per thread and change/disable can do the shadow stack free.
> >
> > if jump back is allowed (linux even makes it race-free with
> > SS_AUTODISARM) then the life-time of alt stack is extended
> > beyond change/disable (jump back to an unregistered alt stack).
> >
> > to support jump back to an alt stack the requirements are
> >
> > 1) user has to manage an alt shadow stack together with the alt
> >    stack (requies user code change, not just libc).
> >
> > 2) kernel has to push a restore token on the thread shadow stack
> >    on signal entry (at least in case of alt shadow stack, and
> >    deal with corner cases around shadow stack overflow).
> 
> We need to have a story for stackful coroutine switching as well, not
> just for sigaltstack.  I hope that we can use OpenJDK (Project Loom) and
> QEMU as guinea pigs.  If we have something that works for both,
> hopefully that covers a broad range of scenarios.  Userspace
> coordination can eventually be handled by glibc; we can deallocate
> alternate stacks on thread exit fairly easily (at least compared to the
> current stack 8-).

for stackful coroutines we just need a way to

- allocate a shadow stack with a restore token on it.

- switch to a target shadow stack with a restore token on it,
  while leaving behind a restore token on the old shadow stack.

this is supported via map_shadow_stack syscall and the
rstoressp, saveprevssp instruction pair.

otoh there can be many alt shadow stacks per thread alive if
we allow jump back (only one of them registered at a time) in
fact they can be jumped to even from another thread, so their
life-time is not tied to the thread (at least if we allow
swapcontext across threads) so i think the libc cannot manage
the alt shadow stacks, only user code can in the general case.

and in case a signal runs on an alt shadow stack, the restore
token can only be placed by the kernel on the old shadow stack.

thanks.
  

Patch

diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
index c73d133fd37c..8ac64d7de4dc 100644
--- a/Documentation/x86/index.rst
+++ b/Documentation/x86/index.rst
@@ -22,6 +22,7 @@  x86-specific Documentation
    mtrr
    pat
    intel-hfi
+   shstk
    iommu
    intel_txt
    amd-memory-encryption
diff --git a/Documentation/x86/shstk.rst b/Documentation/x86/shstk.rst
new file mode 100644
index 000000000000..f2e6f323cf68
--- /dev/null
+++ b/Documentation/x86/shstk.rst
@@ -0,0 +1,166 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================================
+Control-flow Enforcement Technology (CET) Shadow Stack
+======================================================
+
+CET Background
+==============
+
+Control-flow Enforcement Technology (CET) is term referring to several
+related x86 processor features that provides protection against control
+flow hijacking attacks. The HW feature itself can be set up to protect
+both applications and the kernel.
+
+CET introduces shadow stack and indirect branch tracking (IBT). Shadow stack
+is a secondary stack allocated from memory and cannot be directly modified by
+applications. When executing a CALL instruction, the processor pushes the
+return address to both the normal stack and the shadow stack. Upon
+function return, the processor pops the shadow stack copy and compares it
+to the normal stack copy. If the two differ, the processor raises a
+control-protection fault. IBT verifies indirect CALL/JMP targets are intended
+as marked by the compiler with 'ENDBR' opcodes. Not all CPU's have both Shadow
+Stack and Indirect Branch Tracking. Today in the 64-bit kernel, only userspace
+shadow stack and kernel IBT are supported.
+
+Requirements to use Shadow Stack
+================================
+
+To use userspace shadow stack you need HW that supports it, a kernel
+configured with it and userspace libraries compiled with it.
+
+The kernel Kconfig option is X86_USER_SHADOW_STACK, and it can be disabled
+with the kernel parameter: nousershstk.
+
+To build a user shadow stack enabled kernel, Binutils v2.29 or LLVM v6 or later
+are required.
+
+At run time, /proc/cpuinfo shows CET features if the processor supports
+CET. "user_shstk" means that userspace shadow stack is supported on the current
+kernel and HW.
+
+Application Enabling
+====================
+
+An application's CET capability is marked in its ELF note and can be verified
+from readelf/llvm-readelf output::
+
+    readelf -n <application> | grep -a SHSTK
+        properties: x86 feature: SHSTK
+
+The kernel does not process these applications markers directly. Applications
+or loaders must enable CET features using the interface described in section 4.
+Typically this would be done in dynamic loader or static runtime objects, as is
+the case in GLIBC.
+
+Enabling arch_prctl()'s
+=======================
+
+Elf features should be enabled by the loader using the below arch_prctl's. They
+are only supported in 64 bit user applications.
+
+arch_prctl(ARCH_SHSTK_ENABLE, unsigned long feature)
+    Enable a single feature specified in 'feature'. Can only operate on
+    one feature at a time.
+
+arch_prctl(ARCH_SHSTK_DISABLE, unsigned long feature)
+    Disable a single feature specified in 'feature'. Can only operate on
+    one feature at a time.
+
+arch_prctl(ARCH_SHSTK_LOCK, unsigned long features)
+    Lock in features at their current enabled or disabled status. 'features'
+    is a mask of all features to lock. All bits set are processed, unset bits
+    are ignored. The mask is ORed with the existing value. So any feature bits
+    set here cannot be enabled or disabled afterwards.
+
+The return values are as follows. On success, return 0. On error, errno can
+be::
+
+        -EPERM if any of the passed feature are locked.
+        -ENOTSUPP if the feature is not supported by the hardware or
+         kernel.
+        -EINVAL arguments (non existing feature, etc)
+
+The feature's bits supported are::
+
+    ARCH_SHSTK_SHSTK - Shadow stack
+    ARCH_SHSTK_WRSS  - WRSS
+
+Currently shadow stack and WRSS are supported via this interface. WRSS
+can only be enabled with shadow stack, and is automatically disabled
+if shadow stack is disabled.
+
+Proc Status
+===========
+To check if an application is actually running with shadow stack, the
+user can read the /proc/$PID/status. It will report "wrss" or "shstk"
+depending on what is enabled. The lines look like this::
+
+    x86_Thread_features: shstk wrss
+    x86_Thread_features_locked: shstk wrss
+
+Implementation of the Shadow Stack
+==================================
+
+Shadow Stack Size
+-----------------
+
+A task's shadow stack is allocated from memory to a fixed size of
+MIN(RLIMIT_STACK, 4 GB). In other words, the shadow stack is allocated to
+the maximum size of the normal stack, but capped to 4 GB. However,
+a compat-mode application's address space is smaller, each of its thread's
+shadow stack size is MIN(1/4 RLIMIT_STACK, 4 GB).
+
+Signal
+------
+
+By default, the main program and its signal handlers use the same shadow
+stack. Because the shadow stack stores only return addresses, a large
+shadow stack covers the condition that both the program stack and the
+signal alternate stack run out.
+
+When a signal happens, the old pre-signal state is pushed on the stack. When
+shadow stack is enabled, the shadow stack specific state is pushed onto the
+shadow stack. Today this is only the old SSP (shadow stack pointer), pushed
+in a special format with bit 63 set. On sigreturn this old SSP token is
+verified and restored by the kernel. The kernel will also push the normal
+restorer address to the shadow stack to help userspace avoid a shadow stack
+violation on the sigreturn path that goes through the restorer.
+
+So the shadow stack signal frame format is as follows::
+
+    |1...old SSP| - Pointer to old pre-signal ssp in sigframe token format
+                    (bit 63 set to 1)
+    |        ...| - Other state may be added in the future
+
+
+32 bit ABI signals are not supported in shadow stack processes. Linux prevents
+32 bit execution while shadow stack is enabled by the allocating shadow stack's
+outside of the 32 bit address space. When execution enters 32 bit mode, either
+via far call or returning to userspace, a #GP is generated by the hardware
+which, will be delivered to the process as a segfault. When transitioning to
+userspace the register's state will be as if the userspace ip being returned to
+caused the segfault.
+
+Fork
+----
+
+The shadow stack's vma has VM_SHADOW_STACK flag set; its PTEs are required
+to be read-only and dirty. When a shadow stack PTE is not RO and dirty, a
+shadow access triggers a page fault with the shadow stack access bit set
+in the page fault error code.
+
+When a task forks a child, its shadow stack PTEs are copied and both the
+parent's and the child's shadow stack PTEs are cleared of the dirty bit.
+Upon the next shadow stack access, the resulting shadow stack page fault
+is handled by page copy/re-use.
+
+When a pthread child is created, the kernel allocates a new shadow stack
+for the new thread. New shadow stack's behave like mmap() with respect to
+ASLR behavior.
+
+Exec
+----
+
+On exec, shadow stack features are disabled by the kernel. At which point,
+userspace can choose to re-enable, or lock them.