[v8,00/38] arm64/gcs: Provide support for GCS in userspace

Message ID 20240203-arm64-gcs-v8-0-c9fec77673ef@kernel.org
Headers
Series arm64/gcs: Provide support for GCS in userspace |

Message

Mark Brown Feb. 3, 2024, 12:25 p.m. UTC
  The arm64 Guarded Control Stack (GCS) feature provides support for
hardware protected stacks of return addresses, intended to provide
hardening against return oriented programming (ROP) attacks and to make
it easier to gather call stacks for applications such as profiling.

When GCS is active a secondary stack called the Guarded Control Stack is
maintained, protected with a memory attribute which means that it can
only be written with specific GCS operations.  The current GCS pointer
can not be directly written to by userspace.  When a BL is executed the
value stored in LR is also pushed onto the GCS, and when a RET is
executed the top of the GCS is popped and compared to LR with a fault
being raised if the values do not match.  GCS operations may only be
performed on GCS pages, a data abort is generated if they are not.

The combination of hardware enforcement and lack of extra instructions
in the function entry and exit paths should result in something which
has less overhead and is more difficult to attack than a purely software
implementation like clang's shadow stacks.

This series implements support for use of GCS by userspace, along with
support for use of GCS within KVM guests.  It does not enable use of GCS
by either EL1 or EL2, this will be implemented separately.  Executables
are started without GCS and must use a prctl() to enable it, it is
expected that this will be done very early in application execution by
the dynamic linker or other startup code.  For dynamic linking this will
be done by checking that everything in the executable is marked as GCS
compatible.

x86 has an equivalent feature called shadow stacks, this series depends
on the x86 patches for generic memory management support for the new
guarded/shadow stack page type and shares APIs as much as possible.  As
there has been extensive discussion with the wider community around the
ABI for shadow stacks I have as far as practical kept implementation
decisions close to those for x86, anticipating that review would lead to
similar conclusions in the absence of strong reasoning for divergence.

The main divergence I am concious of is that x86 allows shadow stack to
be enabled and disabled repeatedly, freeing the shadow stack for the
thread whenever disabled, while this implementation keeps the GCS
allocated after disable but refuses to reenable it.  This is to avoid
races with things actively walking the GCS during a disable, we do
anticipate that some systems will wish to disable GCS at runtime but are
not aware of any demand for subsequently reenabling it.

x86 uses an arch_prctl() to manage enable and disable, since only x86
and S/390 use arch_prctl() a generic prctl() was proposed[1] as part of a
patch set for the equivalent RISC-V Zicfiss feature which I initially
adopted fairly directly but following review feedback has been revised
quite a bit.

We currently maintain the x86 pattern of implicitly allocating a shadow
stack for threads started with shadow stack enabled, there has been some
discussion of removing this support and requiring the use of clone3()
with explicit allocation of shadow stacks instead.  I have no strong
feelings either way, implicit allocation is not really consistent with
anything else we do and creates the potential for errors around thread
exit but on the other hand it is existing ABI on x86 and minimises the
changes needed in userspace code.

There is an open issue with support for CRIU, on x86 this required the
ability to set the GCS mode via ptrace.  This series supports
configuring mode bits other than enable/disable via ptrace but it needs
to be confirmed if this is sufficient.

The series depends on support for shadow stacks in clone3(), that series
includes the addition of ARCH_HAS_USER_SHADOW_STACK.

   https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@kernel.org

It also depends on the addition of more waitpid() flags to nolibc:

   https://lore.kernel.org/r/20231023-nolibc-waitpid-flags-v2-1-b09d096f091f@kernel.org

You can see a branch with the full set of dependencies against Linus'
tree at:

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/misc.git arm64-gcs

[1] https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/

Signed-off-by: Mark Brown <broonie@kernel.org>
---
Changes in v8:
- Invalidate signal cap token on stack when consuming.
- Typo and other trivial fixes.
- Don't try to use process_vm_write() on GCS, it intentionally does not
  work.
- Fix leak of thread GCSs.
- Rebase onto latest clone3() series.
- Link to v7: https://lore.kernel.org/r/20231122-arm64-gcs-v7-0-201c483bd775@kernel.org

Changes in v7:
- Rebase onto v6.7-rc2 via the clone3() patch series.
- Change the token used to cap the stack during signal handling to be
  compatible with GCSPOPM.
- Fix flags for new page types.
- Fold in support for clone3().
- Replace copy_to_user_gcs() with put_user_gcs().
- Link to v6: https://lore.kernel.org/r/20231009-arm64-gcs-v6-0-78e55deaa4dd@kernel.org

Changes in v6:
- Rebase onto v6.6-rc3.
- Add some more gcsb_dsync() barriers following spec clarifications.
- Due to ongoing discussion around clone()/clone3() I've not updated
  anything there, the behaviour is the same as on previous versions.
- Link to v5: https://lore.kernel.org/r/20230822-arm64-gcs-v5-0-9ef181dd6324@kernel.org

Changes in v5:
- Don't map any permissions for user GCSs, we always use EL0 accessors
  or use a separate mapping of the page.
- Reduce the standard size of the GCS to RLIMIT_STACK/2.
- Enforce a PAGE_SIZE alignment requirement on map_shadow_stack().
- Clarifications and fixes to documentation.
- More tests.
- Link to v4: https://lore.kernel.org/r/20230807-arm64-gcs-v4-0-68cfa37f9069@kernel.org

Changes in v4:
- Implement flags for map_shadow_stack() allowing the cap and end of
  stack marker to be enabled independently or not at all.
- Relax size and alignment requirements for map_shadow_stack().
- Add more blurb explaining the advantages of hardware enforcement.
- Link to v3: https://lore.kernel.org/r/20230731-arm64-gcs-v3-0-cddf9f980d98@kernel.org

Changes in v3:
- Rebase onto v6.5-rc4.
- Add a GCS barrier on context switch.
- Add a GCS stress test.
- Link to v2: https://lore.kernel.org/r/20230724-arm64-gcs-v2-0-dc2c1d44c2eb@kernel.org

Changes in v2:
- Rebase onto v6.5-rc3.
- Rework prctl() interface to allow each bit to be locked independently.
- map_shadow_stack() now places the cap token based on the size
  requested by the caller not the actual space allocated.
- Mode changes other than enable via ptrace are now supported.
- Expand test coverage.
- Various smaller fixes and adjustments.
- Link to v1: https://lore.kernel.org/r/20230716-arm64-gcs-v1-0-bf567f93bba6@kernel.org

---
Mark Brown (38):
      arm64/mm: Restructure arch_validate_flags() for extensibility
      prctl: arch-agnostic prctl for shadow stack
      mman: Add map_shadow_stack() flags
      arm64: Document boot requirements for Guarded Control Stacks
      arm64/gcs: Document the ABI for Guarded Control Stacks
      arm64/sysreg: Add definitions for architected GCS caps
      arm64/gcs: Add manual encodings of GCS instructions
      arm64/gcs: Provide put_user_gcs()
      arm64/cpufeature: Runtime detection of Guarded Control Stack (GCS)
      arm64/mm: Allocate PIE slots for EL0 guarded control stack
      mm: Define VM_SHADOW_STACK for arm64 when we support GCS
      arm64/mm: Map pages for guarded control stack
      KVM: arm64: Manage GCS registers for guests
      arm64/gcs: Allow GCS usage at EL0 and EL1
      arm64/idreg: Add overrride for GCS
      arm64/hwcap: Add hwcap for GCS
      arm64/traps: Handle GCS exceptions
      arm64/mm: Handle GCS data aborts
      arm64/gcs: Context switch GCS state for EL0
      arm64/gcs: Ensure that new threads have a GCS
      arm64/gcs: Implement shadow stack prctl() interface
      arm64/mm: Implement map_shadow_stack()
      arm64/signal: Set up and restore the GCS context for signal handlers
      arm64/signal: Expose GCS state in signal frames
      arm64/ptrace: Expose GCS via ptrace and core files
      arm64: Add Kconfig for Guarded Control Stack (GCS)
      kselftest/arm64: Verify the GCS hwcap
      kselftest/arm64: Add GCS as a detected feature in the signal tests
      kselftest/arm64: Add framework support for GCS to signal handling tests
      kselftest/arm64: Allow signals tests to specify an expected si_code
      kselftest/arm64: Always run signals tests with GCS enabled
      kselftest/arm64: Add very basic GCS test program
      kselftest/arm64: Add a GCS test program built with the system libc
      kselftest/arm64: Add test coverage for GCS mode locking
      selftests/arm64: Add GCS signal tests
      kselftest/arm64: Add a GCS stress test
      kselftest/arm64: Enable GCS for the FP stress tests
      kselftest: Provide shadow stack enable helpers for arm64

 Documentation/admin-guide/kernel-parameters.txt    |   6 +
 Documentation/arch/arm64/booting.rst               |  22 +
 Documentation/arch/arm64/elf_hwcaps.rst            |   3 +
 Documentation/arch/arm64/gcs.rst                   | 233 +++++++
 Documentation/arch/arm64/index.rst                 |   1 +
 Documentation/filesystems/proc.rst                 |   2 +-
 arch/arm64/Kconfig                                 |  20 +
 arch/arm64/include/asm/cpufeature.h                |   6 +
 arch/arm64/include/asm/el2_setup.h                 |  17 +
 arch/arm64/include/asm/esr.h                       |  28 +-
 arch/arm64/include/asm/exception.h                 |   2 +
 arch/arm64/include/asm/gcs.h                       | 107 +++
 arch/arm64/include/asm/hwcap.h                     |   1 +
 arch/arm64/include/asm/kvm_arm.h                   |   4 +-
 arch/arm64/include/asm/kvm_host.h                  |  12 +
 arch/arm64/include/asm/mman.h                      |  23 +-
 arch/arm64/include/asm/pgtable-prot.h              |  14 +-
 arch/arm64/include/asm/processor.h                 |   7 +
 arch/arm64/include/asm/sysreg.h                    |  20 +
 arch/arm64/include/asm/uaccess.h                   |  40 ++
 arch/arm64/include/uapi/asm/hwcap.h                |   1 +
 arch/arm64/include/uapi/asm/ptrace.h               |   8 +
 arch/arm64/include/uapi/asm/sigcontext.h           |   9 +
 arch/arm64/kernel/cpufeature.c                     |  19 +
 arch/arm64/kernel/cpuinfo.c                        |   1 +
 arch/arm64/kernel/entry-common.c                   |  23 +
 arch/arm64/kernel/idreg-override.c                 |   2 +
 arch/arm64/kernel/process.c                        |  85 +++
 arch/arm64/kernel/ptrace.c                         |  59 ++
 arch/arm64/kernel/signal.c                         | 242 ++++++-
 arch/arm64/kernel/traps.c                          |  11 +
 arch/arm64/kvm/emulate-nested.c                    |   4 +
 arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h         |  17 +
 arch/arm64/kvm/sys_regs.c                          |  22 +
 arch/arm64/mm/Makefile                             |   1 +
 arch/arm64/mm/fault.c                              |  79 ++-
 arch/arm64/mm/gcs.c                                | 300 +++++++++
 arch/arm64/mm/mmap.c                               |  13 +-
 arch/arm64/tools/cpucaps                           |   1 +
 arch/x86/include/uapi/asm/mman.h                   |   3 -
 fs/proc/task_mmu.c                                 |   3 +
 include/linux/mm.h                                 |  16 +-
 include/uapi/asm-generic/mman.h                    |   4 +
 include/uapi/linux/elf.h                           |   1 +
 include/uapi/linux/prctl.h                         |  22 +
 kernel/sys.c                                       |  30 +
 tools/testing/selftests/arm64/Makefile             |   2 +-
 tools/testing/selftests/arm64/abi/hwcap.c          |  19 +
 tools/testing/selftests/arm64/fp/assembler.h       |  15 +
 tools/testing/selftests/arm64/fp/fpsimd-test.S     |   2 +
 tools/testing/selftests/arm64/fp/sve-test.S        |   2 +
 tools/testing/selftests/arm64/fp/za-test.S         |   2 +
 tools/testing/selftests/arm64/fp/zt-test.S         |   2 +
 tools/testing/selftests/arm64/gcs/.gitignore       |   5 +
 tools/testing/selftests/arm64/gcs/Makefile         |  24 +
 tools/testing/selftests/arm64/gcs/asm-offsets.h    |   0
 tools/testing/selftests/arm64/gcs/basic-gcs.c      | 428 ++++++++++++
 tools/testing/selftests/arm64/gcs/gcs-locking.c    | 200 ++++++
 .../selftests/arm64/gcs/gcs-stress-thread.S        | 311 +++++++++
 tools/testing/selftests/arm64/gcs/gcs-stress.c     | 532 +++++++++++++++
 tools/testing/selftests/arm64/gcs/gcs-util.h       | 100 +++
 tools/testing/selftests/arm64/gcs/libc-gcs.c       | 736 +++++++++++++++++++++
 tools/testing/selftests/arm64/signal/.gitignore    |   1 +
 .../testing/selftests/arm64/signal/test_signals.c  |  17 +-
 .../testing/selftests/arm64/signal/test_signals.h  |   6 +
 .../selftests/arm64/signal/test_signals_utils.c    |  32 +-
 .../selftests/arm64/signal/test_signals_utils.h    |  39 ++
 .../arm64/signal/testcases/gcs_exception_fault.c   |  62 ++
 .../selftests/arm64/signal/testcases/gcs_frame.c   |  88 +++
 .../arm64/signal/testcases/gcs_write_fault.c       |  67 ++
 .../selftests/arm64/signal/testcases/testcases.c   |   7 +
 .../selftests/arm64/signal/testcases/testcases.h   |   1 +
 tools/testing/selftests/ksft_shstk.h               |  37 ++
 73 files changed, 4241 insertions(+), 40 deletions(-)
---
base-commit: 50abefbf1bc07f5c4e403fd28f71dcee855100f7
change-id: 20230303-arm64-gcs-e311ab0d8729

Best regards,
  

Comments

Thiago Jung Bauermann Feb. 20, 2024, 2 a.m. UTC | #1
Hello,

Mark Brown <broonie@kernel.org> writes:

> Changes in v8:
> - Invalidate signal cap token on stack when consuming.
> - Typo and other trivial fixes.
> - Don't try to use process_vm_write() on GCS, it intentionally does not
>   work.
> - Fix leak of thread GCSs.
> - Rebase onto latest clone3() series.
> - Link to v7: https://lore.kernel.org/r/20231122-arm64-gcs-v7-0-201c483bd775@kernel.org

Thank you for addressing my comments. I still have a few nets and
questions in a few patches, but regardless of them:

Reviewed-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org>
  
Stefan O'Rear Feb. 20, 2024, 4:36 p.m. UTC | #2
On Sat, Feb 3, 2024, at 7:25 AM, Mark Brown wrote:
> The arm64 Guarded Control Stack (GCS) feature provides support for
> hardware protected stacks of return addresses, intended to provide
> hardening against return oriented programming (ROP) attacks and to make
> it easier to gather call stacks for applications such as profiling.
>
> When GCS is active a secondary stack called the Guarded Control Stack is
> maintained, protected with a memory attribute which means that it can
> only be written with specific GCS operations.  The current GCS pointer
> can not be directly written to by userspace.  When a BL is executed the
> value stored in LR is also pushed onto the GCS, and when a RET is
> executed the top of the GCS is popped and compared to LR with a fault
> being raised if the values do not match.  GCS operations may only be
> performed on GCS pages, a data abort is generated if they are not.
>
> The combination of hardware enforcement and lack of extra instructions
> in the function entry and exit paths should result in something which
> has less overhead and is more difficult to attack than a purely software
> implementation like clang's shadow stacks.
>
> This series implements support for use of GCS by userspace, along with
> support for use of GCS within KVM guests.  It does not enable use of GCS
> by either EL1 or EL2, this will be implemented separately.  Executables
> are started without GCS and must use a prctl() to enable it, it is
> expected that this will be done very early in application execution by
> the dynamic linker or other startup code.  For dynamic linking this will
> be done by checking that everything in the executable is marked as GCS
> compatible.
>
> x86 has an equivalent feature called shadow stacks, this series depends
> on the x86 patches for generic memory management support for the new
> guarded/shadow stack page type and shares APIs as much as possible.  As
> there has been extensive discussion with the wider community around the
> ABI for shadow stacks I have as far as practical kept implementation
> decisions close to those for x86, anticipating that review would lead to
> similar conclusions in the absence of strong reasoning for divergence.
>
> The main divergence I am concious of is that x86 allows shadow stack to
> be enabled and disabled repeatedly, freeing the shadow stack for the
> thread whenever disabled, while this implementation keeps the GCS
> allocated after disable but refuses to reenable it.  This is to avoid
> races with things actively walking the GCS during a disable, we do
> anticipate that some systems will wish to disable GCS at runtime but are
> not aware of any demand for subsequently reenabling it.
>
> x86 uses an arch_prctl() to manage enable and disable, since only x86
> and S/390 use arch_prctl() a generic prctl() was proposed[1] as part of a
> patch set for the equivalent RISC-V Zicfiss feature which I initially
> adopted fairly directly but following review feedback has been revised
> quite a bit.

While discussing the ABI implications of shadow stacks in the context of
Zicfiss and musl a few days ago, I had the following idea for how to solve
the source compatibility problems with shadow stacks in POSIX.1-2004 and
POSIX.1-2017:

1. Introduce a "flexible shadow stack handling" option.  For what follows,
   it doesn't matter if this is system-wide, per-mm, or per-vma.

2. Shadow stack faults on non-shadow stack pages, if flexible shadow stack
   handling is in effect, cause the affected page to become a shadow stack
   page.  When this happens, the page filled with invalid address tokens.

   Faults from non-shadow-stack accesses to a shadow-stack page which was
   created by the previous paragraph will cause the page to revert to
   non-shadow-stack usage, with or without clearing.

   Important: a shadow stack operation can only load a valid address from
   a page if that page has been in continuous shadow stack use since the
   address was written by another shadow stack operation; the flexibility
   delays error reporting in cases of stray writes but it never allows for
   corruption of shadow stack operation.

3. Standards-defined operations which use a user-provided stack
   (makecontext, sigaltstack, pthread_attr_setstack) use a subrange of the
   provided stack for shadow stack storage.  I propose to use a shadow
   stack size of 1/32 of the provided stack size, rounded up to a positive
   integer number of pages, and place the shadow stack allocation at the
   lowest page-aligned address inside the provided stack region.

   Since page usage is flexible, no change in page permissions is
   immediately needed; this merely sets the initial shadow stack pointer for
   the new context.

   If the shadow stack grew in the opposite direction to the architectural
   stack, it would not be necessary to pick a fixed direction.

4. SIGSTKSZ and MINSIGSTKSZ are increased by 2 pages to provide sufficient
   space for a minimum-sized shadow stack region and worst case alignment.

_Without_ doing this, sigaltstack cannot be used to recover from stack
overflows if the shadow stack limit is reached first, and makecontext
cannot be supported without memory leaks and unreportable error conditions.

Kernel-allocated shadow stacks with a unique VM type are still useful since
they allows stray writes to crash at the time the stray write is performed,
rather than delaying the crash until the next shadow stack read.

The pthread and makecontext changes could be purely libc side, but we would
need kernel support for sigaltstack and page usage changes.

Luckily, there is no need to support stacks which are simultaneously used
from more than one process, so "is this a shadow stack page" can be tracked
purely at the vma/pte level without any need to involve the inode.  POSIX
explicitly allows using mmap to obtain stack memory and does not forbid
MAP_SHARED; I consider this sufficiently perverse application behavior that
it is not necessary to ensure exclusive use of the underlying pages while
a shadow stack pte exists.  (Applications that use MAP_SHARED for stacks
do not get the full benefit of the shadow stack but they keep POSIX.1-2004
conformance, applications that allocate stacks exclusively in MAP_PRIVATE
memory lose no security.)

The largest complication of this scheme is likely to be that the shadow
stack usage property of a page needs to survive the page being swapped out
and back in, which likely means that it must be present in the swap PTE.

I am substantially less familiar with GCS and SHSTK than with Zicfiss.
It is likely that a syscall or other mechanism is needed to initialize the
shadow stack in flexible memory for makecontext.

Is there interest on the kernel side on having mechanisms to fully support
POSIX.1-2004 with GCS or Zicfiss enabled?

-s

> We currently maintain the x86 pattern of implicitly allocating a shadow
> stack for threads started with shadow stack enabled, there has been some
> discussion of removing this support and requiring the use of clone3()
> with explicit allocation of shadow stacks instead.  I have no strong
> feelings either way, implicit allocation is not really consistent with
> anything else we do and creates the potential for errors around thread
> exit but on the other hand it is existing ABI on x86 and minimises the
> changes needed in userspace code.
>
> There is an open issue with support for CRIU, on x86 this required the
> ability to set the GCS mode via ptrace.  This series supports
> configuring mode bits other than enable/disable via ptrace but it needs
> to be confirmed if this is sufficient.
>
> The series depends on support for shadow stacks in clone3(), that series
> includes the addition of ARCH_HAS_USER_SHADOW_STACK.
>
>    
> https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@kernel.org
>
> It also depends on the addition of more waitpid() flags to nolibc:
>
>    
> https://lore.kernel.org/r/20231023-nolibc-waitpid-flags-v2-1-b09d096f091f@kernel.org
>
> You can see a branch with the full set of dependencies against Linus'
> tree at:
>
>    https://git.kernel.org/pub/scm/linux/kernel/git/broonie/misc.git arm64-gcs
>
> [1] https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/
>
> Signed-off-by: Mark Brown <broonie@kernel.org>
> ---
> Changes in v8:
> - Invalidate signal cap token on stack when consuming.
> - Typo and other trivial fixes.
> - Don't try to use process_vm_write() on GCS, it intentionally does not
>   work.
> - Fix leak of thread GCSs.
> - Rebase onto latest clone3() series.
> - Link to v7: 
> https://lore.kernel.org/r/20231122-arm64-gcs-v7-0-201c483bd775@kernel.org
>
> Changes in v7:
> - Rebase onto v6.7-rc2 via the clone3() patch series.
> - Change the token used to cap the stack during signal handling to be
>   compatible with GCSPOPM.
> - Fix flags for new page types.
> - Fold in support for clone3().
> - Replace copy_to_user_gcs() with put_user_gcs().
> - Link to v6: 
> https://lore.kernel.org/r/20231009-arm64-gcs-v6-0-78e55deaa4dd@kernel.org
>
> Changes in v6:
> - Rebase onto v6.6-rc3.
> - Add some more gcsb_dsync() barriers following spec clarifications.
> - Due to ongoing discussion around clone()/clone3() I've not updated
>   anything there, the behaviour is the same as on previous versions.
> - Link to v5: 
> https://lore.kernel.org/r/20230822-arm64-gcs-v5-0-9ef181dd6324@kernel.org
>
> Changes in v5:
> - Don't map any permissions for user GCSs, we always use EL0 accessors
>   or use a separate mapping of the page.
> - Reduce the standard size of the GCS to RLIMIT_STACK/2.
> - Enforce a PAGE_SIZE alignment requirement on map_shadow_stack().
> - Clarifications and fixes to documentation.
> - More tests.
> - Link to v4: 
> https://lore.kernel.org/r/20230807-arm64-gcs-v4-0-68cfa37f9069@kernel.org
>
> Changes in v4:
> - Implement flags for map_shadow_stack() allowing the cap and end of
>   stack marker to be enabled independently or not at all.
> - Relax size and alignment requirements for map_shadow_stack().
> - Add more blurb explaining the advantages of hardware enforcement.
> - Link to v3: 
> https://lore.kernel.org/r/20230731-arm64-gcs-v3-0-cddf9f980d98@kernel.org
>
> Changes in v3:
> - Rebase onto v6.5-rc4.
> - Add a GCS barrier on context switch.
> - Add a GCS stress test.
> - Link to v2: 
> https://lore.kernel.org/r/20230724-arm64-gcs-v2-0-dc2c1d44c2eb@kernel.org
>
> Changes in v2:
> - Rebase onto v6.5-rc3.
> - Rework prctl() interface to allow each bit to be locked independently.
> - map_shadow_stack() now places the cap token based on the size
>   requested by the caller not the actual space allocated.
> - Mode changes other than enable via ptrace are now supported.
> - Expand test coverage.
> - Various smaller fixes and adjustments.
> - Link to v1: 
> https://lore.kernel.org/r/20230716-arm64-gcs-v1-0-bf567f93bba6@kernel.org
>
> ---
> Mark Brown (38):
>       arm64/mm: Restructure arch_validate_flags() for extensibility
>       prctl: arch-agnostic prctl for shadow stack
>       mman: Add map_shadow_stack() flags
>       arm64: Document boot requirements for Guarded Control Stacks
>       arm64/gcs: Document the ABI for Guarded Control Stacks
>       arm64/sysreg: Add definitions for architected GCS caps
>       arm64/gcs: Add manual encodings of GCS instructions
>       arm64/gcs: Provide put_user_gcs()
>       arm64/cpufeature: Runtime detection of Guarded Control Stack (GCS)
>       arm64/mm: Allocate PIE slots for EL0 guarded control stack
>       mm: Define VM_SHADOW_STACK for arm64 when we support GCS
>       arm64/mm: Map pages for guarded control stack
>       KVM: arm64: Manage GCS registers for guests
>       arm64/gcs: Allow GCS usage at EL0 and EL1
>       arm64/idreg: Add overrride for GCS
>       arm64/hwcap: Add hwcap for GCS
>       arm64/traps: Handle GCS exceptions
>       arm64/mm: Handle GCS data aborts
>       arm64/gcs: Context switch GCS state for EL0
>       arm64/gcs: Ensure that new threads have a GCS
>       arm64/gcs: Implement shadow stack prctl() interface
>       arm64/mm: Implement map_shadow_stack()
>       arm64/signal: Set up and restore the GCS context for signal handlers
>       arm64/signal: Expose GCS state in signal frames
>       arm64/ptrace: Expose GCS via ptrace and core files
>       arm64: Add Kconfig for Guarded Control Stack (GCS)
>       kselftest/arm64: Verify the GCS hwcap
>       kselftest/arm64: Add GCS as a detected feature in the signal tests
>       kselftest/arm64: Add framework support for GCS to signal handling tests
>       kselftest/arm64: Allow signals tests to specify an expected si_code
>       kselftest/arm64: Always run signals tests with GCS enabled
>       kselftest/arm64: Add very basic GCS test program
>       kselftest/arm64: Add a GCS test program built with the system libc
>       kselftest/arm64: Add test coverage for GCS mode locking
>       selftests/arm64: Add GCS signal tests
>       kselftest/arm64: Add a GCS stress test
>       kselftest/arm64: Enable GCS for the FP stress tests
>       kselftest: Provide shadow stack enable helpers for arm64
>
>  Documentation/admin-guide/kernel-parameters.txt    |   6 +
>  Documentation/arch/arm64/booting.rst               |  22 +
>  Documentation/arch/arm64/elf_hwcaps.rst            |   3 +
>  Documentation/arch/arm64/gcs.rst                   | 233 +++++++
>  Documentation/arch/arm64/index.rst                 |   1 +
>  Documentation/filesystems/proc.rst                 |   2 +-
>  arch/arm64/Kconfig                                 |  20 +
>  arch/arm64/include/asm/cpufeature.h                |   6 +
>  arch/arm64/include/asm/el2_setup.h                 |  17 +
>  arch/arm64/include/asm/esr.h                       |  28 +-
>  arch/arm64/include/asm/exception.h                 |   2 +
>  arch/arm64/include/asm/gcs.h                       | 107 +++
>  arch/arm64/include/asm/hwcap.h                     |   1 +
>  arch/arm64/include/asm/kvm_arm.h                   |   4 +-
>  arch/arm64/include/asm/kvm_host.h                  |  12 +
>  arch/arm64/include/asm/mman.h                      |  23 +-
>  arch/arm64/include/asm/pgtable-prot.h              |  14 +-
>  arch/arm64/include/asm/processor.h                 |   7 +
>  arch/arm64/include/asm/sysreg.h                    |  20 +
>  arch/arm64/include/asm/uaccess.h                   |  40 ++
>  arch/arm64/include/uapi/asm/hwcap.h                |   1 +
>  arch/arm64/include/uapi/asm/ptrace.h               |   8 +
>  arch/arm64/include/uapi/asm/sigcontext.h           |   9 +
>  arch/arm64/kernel/cpufeature.c                     |  19 +
>  arch/arm64/kernel/cpuinfo.c                        |   1 +
>  arch/arm64/kernel/entry-common.c                   |  23 +
>  arch/arm64/kernel/idreg-override.c                 |   2 +
>  arch/arm64/kernel/process.c                        |  85 +++
>  arch/arm64/kernel/ptrace.c                         |  59 ++
>  arch/arm64/kernel/signal.c                         | 242 ++++++-
>  arch/arm64/kernel/traps.c                          |  11 +
>  arch/arm64/kvm/emulate-nested.c                    |   4 +
>  arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h         |  17 +
>  arch/arm64/kvm/sys_regs.c                          |  22 +
>  arch/arm64/mm/Makefile                             |   1 +
>  arch/arm64/mm/fault.c                              |  79 ++-
>  arch/arm64/mm/gcs.c                                | 300 +++++++++
>  arch/arm64/mm/mmap.c                               |  13 +-
>  arch/arm64/tools/cpucaps                           |   1 +
>  arch/x86/include/uapi/asm/mman.h                   |   3 -
>  fs/proc/task_mmu.c                                 |   3 +
>  include/linux/mm.h                                 |  16 +-
>  include/uapi/asm-generic/mman.h                    |   4 +
>  include/uapi/linux/elf.h                           |   1 +
>  include/uapi/linux/prctl.h                         |  22 +
>  kernel/sys.c                                       |  30 +
>  tools/testing/selftests/arm64/Makefile             |   2 +-
>  tools/testing/selftests/arm64/abi/hwcap.c          |  19 +
>  tools/testing/selftests/arm64/fp/assembler.h       |  15 +
>  tools/testing/selftests/arm64/fp/fpsimd-test.S     |   2 +
>  tools/testing/selftests/arm64/fp/sve-test.S        |   2 +
>  tools/testing/selftests/arm64/fp/za-test.S         |   2 +
>  tools/testing/selftests/arm64/fp/zt-test.S         |   2 +
>  tools/testing/selftests/arm64/gcs/.gitignore       |   5 +
>  tools/testing/selftests/arm64/gcs/Makefile         |  24 +
>  tools/testing/selftests/arm64/gcs/asm-offsets.h    |   0
>  tools/testing/selftests/arm64/gcs/basic-gcs.c      | 428 ++++++++++++
>  tools/testing/selftests/arm64/gcs/gcs-locking.c    | 200 ++++++
>  .../selftests/arm64/gcs/gcs-stress-thread.S        | 311 +++++++++
>  tools/testing/selftests/arm64/gcs/gcs-stress.c     | 532 +++++++++++++++
>  tools/testing/selftests/arm64/gcs/gcs-util.h       | 100 +++
>  tools/testing/selftests/arm64/gcs/libc-gcs.c       | 736 +++++++++++++++++++++
>  tools/testing/selftests/arm64/signal/.gitignore    |   1 +
>  .../testing/selftests/arm64/signal/test_signals.c  |  17 +-
>  .../testing/selftests/arm64/signal/test_signals.h  |   6 +
>  .../selftests/arm64/signal/test_signals_utils.c    |  32 +-
>  .../selftests/arm64/signal/test_signals_utils.h    |  39 ++
>  .../arm64/signal/testcases/gcs_exception_fault.c   |  62 ++
>  .../selftests/arm64/signal/testcases/gcs_frame.c   |  88 +++
>  .../arm64/signal/testcases/gcs_write_fault.c       |  67 ++
>  .../selftests/arm64/signal/testcases/testcases.c   |   7 +
>  .../selftests/arm64/signal/testcases/testcases.h   |   1 +
>  tools/testing/selftests/ksft_shstk.h               |  37 ++
>  73 files changed, 4241 insertions(+), 40 deletions(-)
> ---
> base-commit: 50abefbf1bc07f5c4e403fd28f71dcee855100f7
> change-id: 20230303-arm64-gcs-e311ab0d8729
>
> Best regards,
> -- 
> Mark Brown <broonie@kernel.org>
>
>
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv
  
Edgecombe, Rick P Feb. 20, 2024, 6:41 p.m. UTC | #3
Hi,

I worked on the x86 kernel shadow stack support. I think it is an
interesting suggestion. Some questions below, and I will think more on
it.

On Tue, 2024-02-20 at 11:36 -0500, Stefan O'Rear wrote:
> While discussing the ABI implications of shadow stacks in the context
> of
> Zicfiss and musl a few days ago, I had the following idea for how to
> solve
> the source compatibility problems with shadow stacks in POSIX.1-2004
> and
> POSIX.1-2017:
> 
> 1. Introduce a "flexible shadow stack handling" option.  For what
> follows,
>    it doesn't matter if this is system-wide, per-mm, or per-vma.
> 
> 2. Shadow stack faults on non-shadow stack pages, if flexible shadow
> stack
>    handling is in effect, cause the affected page to become a shadow
> stack
>    page.  When this happens, the page filled with invalid address
> tokens.

Hmm, could the shadow stack underflow onto the real stack then? Not
sure how bad that is. INCSSP (incrementing the SSP register on x86)
loops are not rare so it seems like something that could happen.

> 
>    Faults from non-shadow-stack accesses to a shadow-stack page which
> was
>    created by the previous paragraph will cause the page to revert to
>    non-shadow-stack usage, with or without clearing.

Won't this prevent catching stack overflows when they happen? An
overflow will just turn the shadow stack into normal stack and only get
detected when the shadow stack unwinds?

A related question would be how to handle the expanding nature of the
initial stack. I guess the initial stack could be special and have a
separate shadow stack.

> 
>    Important: a shadow stack operation can only load a valid address
> from
>    a page if that page has been in continuous shadow stack use since
> the
>    address was written by another shadow stack operation; the
> flexibility
>    delays error reporting in cases of stray writes but it never
> allows for
>    corruption of shadow stack operation.

Shadow stacks currently have automatic guard gaps to try to prevent one
thread from overflowing onto another thread's shadow stack. This would
somewhat opens that up, as the stack guard gaps are usually maintained
by userspace for new threads. It would have to be thought through if
these could still be enforced with checking at additional spots.

> 
> 3. Standards-defined operations which use a user-provided stack
>    (makecontext, sigaltstack, pthread_attr_setstack) use a subrange
> of the
>    provided stack for shadow stack storage.  I propose to use a
> shadow
>    stack size of 1/32 of the provided stack size, rounded up to a
> positive
>    integer number of pages, and place the shadow stack allocation at
> the
>    lowest page-aligned address inside the provided stack region.
> 
>    Since page usage is flexible, no change in page permissions is
>    immediately needed; this merely sets the initial shadow stack
> pointer for
>    the new context.
> 
>    If the shadow stack grew in the opposite direction to the
> architectural
>    stack, it would not be necessary to pick a fixed direction.
> 
> 4. SIGSTKSZ and MINSIGSTKSZ are increased by 2 pages to provide
> sufficient
>    space for a minimum-sized shadow stack region and worst case
> alignment.

Do all makecontext() callers ensure the size is greater than this?

I guess glibc's makecontext() could do this scheme to prevent leaking
without any changes to the kernel. Basically steal a little of the
stack address range and overwrite it with a shadow stack mapping. But
only if the apps leave enough room. If they need to be updated, then
they could be updated to manage their own shadow stacks too I think.

> 
> _Without_ doing this, sigaltstack cannot be used to recover from
> stack
> overflows if the shadow stack limit is reached first, and makecontext
> cannot be supported without memory leaks and unreportable error
> conditions.

FWIW, I think the makecontext() shadow stack leaking is a bad idea. I
would prefer the existing makecontext() interface just didn't support
shadow stack, rather than the leaking solution glibc does today.

The situation (for arm and riscv too I think?) is that some
applications will just not work automatically due to custom stack
switching implementations. (user level threading libraries, JITs, etc).
So I think it should be ok to ask for apps to change to enable shadow
stack and we should avoid doing anything too awkward in pursuit of
getting it to work completely transparently.

For ucontext, there was some discussion about implementing changes to
the interface makecontext() interface that allows the app to allocate
and manage their own shadow stacks. So they would be responsible for
freeing and allocating the shadow stacks. It seems a little more
straightforward.


For x86, due to some existing GCC binaries that jumped ahead of the
kernel support, it will likely require an ABI opt-in to enable alt
shadow stacks. So alt shadow stack support design is still pretty open
on the x86 side. Very glad to get broader input on it.

> 
> Kernel-allocated shadow stacks with a unique VM type are still useful
> since
> they allows stray writes to crash at the time the stray write is
> performed,
> rather than delaying the crash until the next shadow stack read.
> 
> The pthread and makecontext changes could be purely libc side, but we
> would
> need kernel support for sigaltstack and page usage changes.
> 
> Luckily, there is no need to support stacks which are simultaneously
> used
> from more than one process, so "is this a shadow stack page" can be
> tracked
> purely at the vma/pte level without any need to involve the inode. 
> POSIX
> explicitly allows using mmap to obtain stack memory and does not
> forbid
> MAP_SHARED; I consider this sufficiently perverse application
> behavior that
> it is not necessary to ensure exclusive use of the underlying pages
> while
> a shadow stack pte exists.  (Applications that use MAP_SHARED for
> stacks
> do not get the full benefit of the shadow stack but they keep
> POSIX.1-2004
> conformance, applications that allocate stacks exclusively in
> MAP_PRIVATE
> memory lose no security.)

On x86 we don't support MAP_SHARED shadow stacks. There is a whole
snarl around the dirty bit in the PTE. I'm not sure it's impossible but
it was gladly avoided. There is also a benefit in avoiding having them
get mapped as writable in a different context.

> 
> The largest complication of this scheme is likely to be that the
> shadow
> stack usage property of a page needs to survive the page being
> swapped out
> and back in, which likely means that it must be present in the swap
> PTE.
> 
> I am substantially less familiar with GCS and SHSTK than with
> Zicfiss.
> It is likely that a syscall or other mechanism is needed to
> initialize the
> shadow stack in flexible memory for makecontext.

The ucontext stacks (and alt shadow stacks is the plan) need to have a
"restore token". So, yea, you would probably need some syscall to
"convert" the normal stack memory into shadow stack with a restore
token.

> 
> Is there interest on the kernel side on having mechanisms to fully
> support
> POSIX.1-2004 with GCS or Zicfiss enabled?

Can you clarify, is the goal to meet compatibility with the spec or try
to make more apps run with shadow stack automatically?
  
dalias@libc.org Feb. 20, 2024, 6:57 p.m. UTC | #4
On Tue, Feb 20, 2024 at 06:41:05PM +0000, Edgecombe, Rick P wrote:
> Hi,
> 
> I worked on the x86 kernel shadow stack support. I think it is an
> interesting suggestion. Some questions below, and I will think more on
> it.
> 
> On Tue, 2024-02-20 at 11:36 -0500, Stefan O'Rear wrote:
> > While discussing the ABI implications of shadow stacks in the context
> > of
> > Zicfiss and musl a few days ago, I had the following idea for how to
> > solve
> > the source compatibility problems with shadow stacks in POSIX.1-2004
> > and
> > POSIX.1-2017:
> > 
> > 1. Introduce a "flexible shadow stack handling" option.  For what
> > follows,
> >    it doesn't matter if this is system-wide, per-mm, or per-vma.
> > 
> > 2. Shadow stack faults on non-shadow stack pages, if flexible shadow
> > stack
> >    handling is in effect, cause the affected page to become a shadow
> > stack
> >    page.  When this happens, the page filled with invalid address
> > tokens.
> 
> Hmm, could the shadow stack underflow onto the real stack then? Not
> sure how bad that is. INCSSP (incrementing the SSP register on x86)
> loops are not rare so it seems like something that could happen.

Shadow stack underflow should fault on attempt to access
non-shadow-stack memory as shadow-stack, no?

> >    Faults from non-shadow-stack accesses to a shadow-stack page which
> > was
> >    created by the previous paragraph will cause the page to revert to
> >    non-shadow-stack usage, with or without clearing.
> 
> Won't this prevent catching stack overflows when they happen? An
> overflow will just turn the shadow stack into normal stack and only get
> detected when the shadow stack unwinds?

I don't think that's as big a problem as it sounds like. It might make
pinpointing the spot at which things went wrong take a little bit more
work, but it should not admit any wrong-execution.

> A related question would be how to handle the expanding nature of the
> initial stack. I guess the initial stack could be special and have a
> separate shadow stack.

That seems fine.

> >    Important: a shadow stack operation can only load a valid address
> > from
> >    a page if that page has been in continuous shadow stack use since
> > the
> >    address was written by another shadow stack operation; the
> > flexibility
> >    delays error reporting in cases of stray writes but it never
> > allows for
> >    corruption of shadow stack operation.
> 
> Shadow stacks currently have automatic guard gaps to try to prevent one
> thread from overflowing onto another thread's shadow stack. This would
> somewhat opens that up, as the stack guard gaps are usually maintained
> by userspace for new threads. It would have to be thought through if
> these could still be enforced with checking at additional spots.

I would think the existing guard pages would already do that if a
thread's shadow stack is contiguous with its own data stack.

> > 3. Standards-defined operations which use a user-provided stack
> >    (makecontext, sigaltstack, pthread_attr_setstack) use a subrange
> > of the
> >    provided stack for shadow stack storage.  I propose to use a
> > shadow
> >    stack size of 1/32 of the provided stack size, rounded up to a
> > positive
> >    integer number of pages, and place the shadow stack allocation at
> > the
> >    lowest page-aligned address inside the provided stack region.
> > 
> >    Since page usage is flexible, no change in page permissions is
> >    immediately needed; this merely sets the initial shadow stack
> > pointer for
> >    the new context.
> > 
> >    If the shadow stack grew in the opposite direction to the
> > architectural
> >    stack, it would not be necessary to pick a fixed direction.
> > 
> > 4. SIGSTKSZ and MINSIGSTKSZ are increased by 2 pages to provide
> > sufficient
> >    space for a minimum-sized shadow stack region and worst case
> > alignment.
> 
> Do all makecontext() callers ensure the size is greater than this?
> 
> I guess glibc's makecontext() could do this scheme to prevent leaking
> without any changes to the kernel. Basically steal a little of the
> stack address range and overwrite it with a shadow stack mapping. But
> only if the apps leave enough room. If they need to be updated, then
> they could be updated to manage their own shadow stacks too I think.

From the musl side, I have always looked at the entirely of shadow
stack stuff with very heavy skepticism, and anything that breaks
existing interface contracts, introduced places where apps can get
auto-killed because a late resource allocation fails, or requires
applications to code around the existence of something that should be
an implementation detail, is a non-starter. To even consider shadow
stack support, it must truely be fully non-breaking.

> > _Without_ doing this, sigaltstack cannot be used to recover from
> > stack
> > overflows if the shadow stack limit is reached first, and makecontext
> > cannot be supported without memory leaks and unreportable error
> > conditions.
> 
> FWIW, I think the makecontext() shadow stack leaking is a bad idea. I
> would prefer the existing makecontext() interface just didn't support
> shadow stack, rather than the leaking solution glibc does today.

AIUI the proposal by Stefan makes it non-leaking because it's just
using normal memory that reverts to normal usage on any
non-shadow-stack access.

Rich
  
Mark Brown Feb. 20, 2024, 8:14 p.m. UTC | #5
On Tue, Feb 20, 2024 at 06:41:05PM +0000, Edgecombe, Rick P wrote:
> On Tue, 2024-02-20 at 11:36 -0500, Stefan O'Rear wrote:

> > 2. Shadow stack faults on non-shadow stack pages, if flexible shadow
> > stack
> >    handling is in effect, cause the affected page to become a shadow
> > stack
> >    page.  When this happens, the page filled with invalid address
> > tokens.

> Hmm, could the shadow stack underflow onto the real stack then? Not
> sure how bad that is. INCSSP (incrementing the SSP register on x86)
> loops are not rare so it seems like something that could happen.

Yes, they'd trash any pages of normal stack they touch as they do so but
otherwise seems similar to overflow.

> The situation (for arm and riscv too I think?) is that some
> applications will just not work automatically due to custom stack
> switching implementations. (user level threading libraries, JITs, etc).
> So I think it should be ok to ask for apps to change to enable shadow
> stack and we should avoid doing anything too awkward in pursuit of
> getting it to work completely transparently.

Yes, on arm64 anything that rewrites or is otherwise clever with the
stack is going to have to understand that the GCS exists on arm64 and do
matching rewrites/updates for the GCS.  This includes anything that
switches stacks, it will need to use GCS specific instructions to change
the current shadow stack pointer.

> > MAP_SHARED; I consider this sufficiently perverse application
> > behavior that
> > it is not necessary to ensure exclusive use of the underlying pages
> > while
> > a shadow stack pte exists.  (Applications that use MAP_SHARED for
> > stacks
> > do not get the full benefit of the shadow stack but they keep
> > POSIX.1-2004
> > conformance, applications that allocate stacks exclusively in
> > MAP_PRIVATE
> > memory lose no security.)

> On x86 we don't support MAP_SHARED shadow stacks. There is a whole
> snarl around the dirty bit in the PTE. I'm not sure it's impossible but
> it was gladly avoided. There is also a benefit in avoiding having them
> get mapped as writable in a different context.

Similarly for arm64, I think we can physically do it IIRC but between
having to map via map_shadow_stack() for security reasons and it just
generally not seeming like a clever idea the implementation shouldn't
actually let you get a MAP_SHARED GCS it's not something that's been
considered.

> > I am substantially less familiar with GCS and SHSTK than with
> > Zicfiss.
> > It is likely that a syscall or other mechanism is needed to
> > initialize the
> > shadow stack in flexible memory for makecontext.

> The ucontext stacks (and alt shadow stacks is the plan) need to have a
> "restore token". So, yea, you would probably need some syscall to
> "convert" the normal stack memory into shadow stack with a restore
> token.

Similar considerations for GCS, we need tokens and we don't want
userspace to be able to write by itself in the normal case.
  
Edgecombe, Rick P Feb. 20, 2024, 11:30 p.m. UTC | #6
On Tue, 2024-02-20 at 13:57 -0500, Rich Felker wrote:
> On Tue, Feb 20, 2024 at 06:41:05PM +0000, Edgecombe, Rick P wrote:
> > Hmm, could the shadow stack underflow onto the real stack then? Not
> > sure how bad that is. INCSSP (incrementing the SSP register on x86)
> > loops are not rare so it seems like something that could happen.
> 
> Shadow stack underflow should fault on attempt to access
> non-shadow-stack memory as shadow-stack, no?

Maybe I'm misunderstanding. I thought the proposal included allowing
shadow stack access to convert normal address ranges to shadow stack,
and normal writes to convert shadow stack to normal.

> > 
> > Won't this prevent catching stack overflows when they happen? An
> > overflow will just turn the shadow stack into normal stack and only
> > get
> > detected when the shadow stack unwinds?
> 
> I don't think that's as big a problem as it sounds like. It might
> make
> pinpointing the spot at which things went wrong take a little bit
> more
> work, but it should not admit any wrong-execution.

Right, it's a point about debugging. I'm just trying to analyze the
pros and cons and not calling it a showstopper.

> > 
> > Shadow stacks currently have automatic guard gaps to try to prevent
> > one
> > thread from overflowing onto another thread's shadow stack. This
> > would
> > somewhat opens that up, as the stack guard gaps are usually
> > maintained
> > by userspace for new threads. It would have to be thought through
> > if
> > these could still be enforced with checking at additional spots.
> 
> I would think the existing guard pages would already do that if a
> thread's shadow stack is contiguous with its own data stack.

The difference is that the kernel provides the guard gaps, where this
would rely on userspace to do it. It is not a showstopper either.

I think my biggest question on this is how does it change the
capability for two threads to share a shadow stack. It might require
some special rules around the syscall that writes restore tokens. So
I'm not sure. It probably needs a POC.

> 
> From the musl side, I have always looked at the entirely of shadow
> stack stuff with very heavy skepticism, and anything that breaks
> existing interface contracts, introduced places where apps can get
> auto-killed because a late resource allocation fails, or requires
> applications to code around the existence of something that should be
> an implementation detail, is a non-starter. To even consider shadow
> stack support, it must truely be fully non-breaking.

The manual assembly stack switching and JIT code in the apps needs to
be updated. I don't think there is a way around it.

I agree though that the late allocation failures are not great. Mark is
working on clone3 support which should allow moving the shadow stack
allocation to happen in userspace with the normal stack. Even for riscv
though, doesn't it need to update a new register in stack switching?

BTW, x86 shadow stack has a mode where the shadow stack is writable
with a special instruction (WRSS). It enables the SSP to be set
arbitrarily by writing restore tokens. We discussed this as an option
to make the existing longjmp() and signal stuff work more transparently
for glibc.

> 
> > > _Without_ doing this, sigaltstack cannot be used to recover from
> > > stack
> > > overflows if the shadow stack limit is reached first, and
> > > makecontext
> > > cannot be supported without memory leaks and unreportable error
> > > conditions.
> > 
> > FWIW, I think the makecontext() shadow stack leaking is a bad idea.
> > I
> > would prefer the existing makecontext() interface just didn't
> > support
> > shadow stack, rather than the leaking solution glibc does today.
> 
> AIUI the proposal by Stefan makes it non-leaking because it's just
> using normal memory that reverts to normal usage on any
> non-shadow-stack access.
> 

Right, but does it break any existing apps anyway (because of small
ucontext stack sizes)?

BTW, when I talk about "not supporting" I don't mean the app should
crash. I mean it should instead run normally, just without shadow stack
enabled. Not sure if that was clear. Since shadow stack is not
essential for an application to function, it is only security hardening
on top.

Although determining if an application supports shadow stack has turned
out to be difficult in practice. Handling dlopen() is especially hard.
  
Edgecombe, Rick P Feb. 20, 2024, 11:30 p.m. UTC | #7
On Tue, 2024-02-20 at 20:14 +0000, Mark Brown wrote:
> > Hmm, could the shadow stack underflow onto the real stack then? Not
> > sure how bad that is. INCSSP (incrementing the SSP register on x86)
> > loops are not rare so it seems like something that could happen.
> 
> Yes, they'd trash any pages of normal stack they touch as they do so
> but
> otherwise seems similar to overflow.

I was thinking in the normal buffer overflow case there is a guard gap
at the end of the stack, but in this case the shadow stack is directly
adjacent to the regular stack. It's probably a minor point.
  
dalias@libc.org Feb. 20, 2024, 11:54 p.m. UTC | #8
On Tue, Feb 20, 2024 at 11:30:22PM +0000, Edgecombe, Rick P wrote:
> On Tue, 2024-02-20 at 13:57 -0500, Rich Felker wrote:
> > On Tue, Feb 20, 2024 at 06:41:05PM +0000, Edgecombe, Rick P wrote:
> > > Hmm, could the shadow stack underflow onto the real stack then? Not
> > > sure how bad that is. INCSSP (incrementing the SSP register on x86)
> > > loops are not rare so it seems like something that could happen.
> > 
> > Shadow stack underflow should fault on attempt to access
> > non-shadow-stack memory as shadow-stack, no?
> 
> Maybe I'm misunderstanding. I thought the proposal included allowing
> shadow stack access to convert normal address ranges to shadow stack,
> and normal writes to convert shadow stack to normal.

As I understood the original discussion of the proposal on IRC, it was
only one-way (from shadow to normal). Unless I'm missing something,
making it one-way is necessary to catch situations where the shadow
stack would become compromised.

> > > Shadow stacks currently have automatic guard gaps to try to prevent
> > > one
> > > thread from overflowing onto another thread's shadow stack. This
> > > would
> > > somewhat opens that up, as the stack guard gaps are usually
> > > maintained
> > > by userspace for new threads. It would have to be thought through
> > > if
> > > these could still be enforced with checking at additional spots.
> > 
> > I would think the existing guard pages would already do that if a
> > thread's shadow stack is contiguous with its own data stack.
> 
> The difference is that the kernel provides the guard gaps, where this
> would rely on userspace to do it. It is not a showstopper either.
> 
> I think my biggest question on this is how does it change the
> capability for two threads to share a shadow stack. It might require
> some special rules around the syscall that writes restore tokens. So
> I'm not sure. It probably needs a POC.

Why would they be sharing a shadow stack?

> > From the musl side, I have always looked at the entirely of shadow
> > stack stuff with very heavy skepticism, and anything that breaks
> > existing interface contracts, introduced places where apps can get
> > auto-killed because a late resource allocation fails, or requires
> > applications to code around the existence of something that should be
> > an implementation detail, is a non-starter. To even consider shadow
> > stack support, it must truely be fully non-breaking.
> 
> The manual assembly stack switching and JIT code in the apps needs to
> be updated. I don't think there is a way around it.

Indeed, I'm not talking about programs with JIT/manual stack-switching
asm, just anything using existing APIs for control of stack --
pthread_setstack, makecontext, sigaltstack, etc.

> I agree though that the late allocation failures are not great. Mark is
> working on clone3 support which should allow moving the shadow stack
> allocation to happen in userspace with the normal stack. Even for riscv
> though, doesn't it need to update a new register in stack switching?

If clone is called with signals masked, it's probably not necessary
for the kernel to set the shadow stack register as part of clone3.

> BTW, x86 shadow stack has a mode where the shadow stack is writable
> with a special instruction (WRSS). It enables the SSP to be set
> arbitrarily by writing restore tokens. We discussed this as an option
> to make the existing longjmp() and signal stuff work more transparently
> for glibc.
> 
> > 
> > > > _Without_ doing this, sigaltstack cannot be used to recover from
> > > > stack
> > > > overflows if the shadow stack limit is reached first, and
> > > > makecontext
> > > > cannot be supported without memory leaks and unreportable error
> > > > conditions.
> > > 
> > > FWIW, I think the makecontext() shadow stack leaking is a bad idea.
> > > I
> > > would prefer the existing makecontext() interface just didn't
> > > support
> > > shadow stack, rather than the leaking solution glibc does today.
> > 
> > AIUI the proposal by Stefan makes it non-leaking because it's just
> > using normal memory that reverts to normal usage on any
> > non-shadow-stack access.
> > 
> 
> Right, but does it break any existing apps anyway (because of small
> ucontext stack sizes)?
> 
> BTW, when I talk about "not supporting" I don't mean the app should
> crash. I mean it should instead run normally, just without shadow stack
> enabled. Not sure if that was clear. Since shadow stack is not
> essential for an application to function, it is only security hardening
> on top.
> 
> Although determining if an application supports shadow stack has turned
> out to be difficult in practice. Handling dlopen() is especially hard.

One reasonable thing to do, that might be preferable to overengineered
solutions, is to disable shadow-stack process-wide if an interface
incompatible with it is used (sigaltstack, pthread_create with an
attribute setup using pthread_attr_setstack, makecontext, etc.), as
well as if an incompatible library is is dlopened. This is much more
acceptable than continuing to run with shadow stacks managed sloppily
by the kernel and async killing the process on OOM, and is probably
*more compatible* with apps than changing the minimum stack size
requirements out from under them.

The place where it's really needed to be able to allocate the shadow
stack synchronously under userspace control, in order to harden normal
applications that aren't doing funny things, is in pthread_create
without a caller-provided stack.

Rich
  
Stefan O'Rear Feb. 20, 2024, 11:59 p.m. UTC | #9
On Tue, Feb 20, 2024, at 6:30 PM, Edgecombe, Rick P wrote:
> On Tue, 2024-02-20 at 13:57 -0500, Rich Felker wrote:
>> On Tue, Feb 20, 2024 at 06:41:05PM +0000, Edgecombe, Rick P wrote:
>> > Hmm, could the shadow stack underflow onto the real stack then? Not
>> > sure how bad that is. INCSSP (incrementing the SSP register on x86)
>> > loops are not rare so it seems like something that could happen.
>> 
>> Shadow stack underflow should fault on attempt to access
>> non-shadow-stack memory as shadow-stack, no?
>
> Maybe I'm misunderstanding. I thought the proposal included allowing
> shadow stack access to convert normal address ranges to shadow stack,
> and normal writes to convert shadow stack to normal.

Ideally for riscv only writes would cause conversion, an incssp underflow
which performs shadow stack reads would be able to fault early.

For arm, since a syscall is needed anyway to set up the token in a new
shadow stack region, it would make sense for conversion from non-shadow
to shadow usage to never be automatic.

>> > 
>> > Won't this prevent catching stack overflows when they happen? An
>> > overflow will just turn the shadow stack into normal stack and only
>> > get
>> > detected when the shadow stack unwinds?
>> 
>> I don't think that's as big a problem as it sounds like. It might
>> make
>> pinpointing the spot at which things went wrong take a little bit
>> more
>> work, but it should not admit any wrong-execution.
>
> Right, it's a point about debugging. I'm just trying to analyze the
> pros and cons and not calling it a showstopper.

It's certainly undesirable, so I'd like to have both mechanisms available
(shadow stacks in ordinary memory to support several problematic APIs,
and in dedicated mappings with guard pages otherwise).

>> > 
>> > Shadow stacks currently have automatic guard gaps to try to prevent
>> > one
>> > thread from overflowing onto another thread's shadow stack. This
>> > would
>> > somewhat opens that up, as the stack guard gaps are usually
>> > maintained
>> > by userspace for new threads. It would have to be thought through
>> > if
>> > these could still be enforced with checking at additional spots.
>> 
>> I would think the existing guard pages would already do that if a
>> thread's shadow stack is contiguous with its own data stack.
>
> The difference is that the kernel provides the guard gaps, where this
> would rely on userspace to do it. It is not a showstopper either.
>
> I think my biggest question on this is how does it change the
> capability for two threads to share a shadow stack. It might require
> some special rules around the syscall that writes restore tokens. So
> I'm not sure. It probably needs a POC.

I'm not quite understanding what the property you're looking for here is.

>> From the musl side, I have always looked at the entirely of shadow
>> stack stuff with very heavy skepticism, and anything that breaks
>> existing interface contracts, introduced places where apps can get
>> auto-killed because a late resource allocation fails, or requires
>> applications to code around the existence of something that should be
>> an implementation detail, is a non-starter. To even consider shadow
>> stack support, it must truely be fully non-breaking.
>
> The manual assembly stack switching and JIT code in the apps needs to
> be updated. I don't think there is a way around it.

Naturally.  If an application uses nonportable functionality like JIT
and inline assembly, it's fine (within reason) for those nonportable
components to need changes for shadow stack support.

The objective of this proposal is to allow applications that do _not_
use inline assembly but rather only C APIs defined in POSIX.1-2004 to
execute correctly in an environment where shadow stacks are enabled
by default.

> I agree though that the late allocation failures are not great. Mark is
> working on clone3 support which should allow moving the shadow stack
> allocation to happen in userspace with the normal stack. Even for riscv
> though, doesn't it need to update a new register in stack switching?
>
> BTW, x86 shadow stack has a mode where the shadow stack is writable
> with a special instruction (WRSS). It enables the SSP to be set
> arbitrarily by writing restore tokens. We discussed this as an option
> to make the existing longjmp() and signal stuff work more transparently
> for glibc.
>
>> 
>> > > _Without_ doing this, sigaltstack cannot be used to recover from
>> > > stack
>> > > overflows if the shadow stack limit is reached first, and
>> > > makecontext
>> > > cannot be supported without memory leaks and unreportable error
>> > > conditions.
>> > 
>> > FWIW, I think the makecontext() shadow stack leaking is a bad idea.
>> > I
>> > would prefer the existing makecontext() interface just didn't
>> > support
>> > shadow stack, rather than the leaking solution glibc does today.
>> 
>> AIUI the proposal by Stefan makes it non-leaking because it's just
>> using normal memory that reverts to normal usage on any
>> non-shadow-stack access.
>> 
>
> Right, but does it break any existing apps anyway (because of small
> ucontext stack sizes)?

Possibly, but that's what SIGSTKSZ/MINSIGSTKSZ is for.  This is already
variable on several platforms due to variable-length vector extensions.

> BTW, when I talk about "not supporting" I don't mean the app should
> crash. I mean it should instead run normally, just without shadow stack
> enabled. Not sure if that was clear. Since shadow stack is not
> essential for an application to function, it is only security hardening
> on top.

I appreciate that.  How far can we go in that direction?  If we can
automatically disable shadow stacks on any call to makecontext, sigaltstack,
or pthread_attr_setstack without causing other threads to crash if they were
in the middle of shadow stack maintenance we can probably simplify this
proposal, although I need to think more about what's possible.

> Although determining if an application supports shadow stack has turned
> out to be difficult in practice. Handling dlopen() is especially hard.

How so?  Is the hard part figuring out if you need to do something, or
doing it?

-s
  
Edgecombe, Rick P Feb. 21, 2024, 12:35 a.m. UTC | #10
On Tue, 2024-02-20 at 18:54 -0500, dalias@libc.org wrote:
> On Tue, Feb 20, 2024 at 11:30:22PM +0000, Edgecombe, Rick P wrote:
> > On Tue, 2024-02-20 at 13:57 -0500, Rich Felker wrote:
> > > On Tue, Feb 20, 2024 at 06:41:05PM +0000, Edgecombe, Rick P
> > > wrote:
> > > > Hmm, could the shadow stack underflow onto the real stack then?
> > > > Not
> > > > sure how bad that is. INCSSP (incrementing the SSP register on
> > > > x86)
> > > > loops are not rare so it seems like something that could
> > > > happen.
> > > 
> > > Shadow stack underflow should fault on attempt to access
> > > non-shadow-stack memory as shadow-stack, no?
> > 
> > Maybe I'm misunderstanding. I thought the proposal included
> > allowing
> > shadow stack access to convert normal address ranges to shadow
> > stack,
> > and normal writes to convert shadow stack to normal.
> 
> As I understood the original discussion of the proposal on IRC, it
> was
> only one-way (from shadow to normal). Unless I'm missing something,
> making it one-way is necessary to catch situations where the shadow
> stack would become compromised.

The original post here:
https://lore.kernel.org/lkml/22a53b78-10d7-4a5a-a01e-b2f3a8c22e94@app.fastmail.com/

...has:
"Shadow stack faults on non-shadow stack pages, if flexible shadow
stack handling is in effect, cause the affected page to become a shadow
stack page.  When this happens, the page filled with invalid address
tokens."

...and:
"Faults from non-shadow-stack accesses to a shadow-stack page which was
created by the previous paragraph will cause the page to revert to non-
shadow-stack usage, with or without clearing."

I see Stefan has clarified in another response. So I'll go try to
figure it out.

> 
> > > > Shadow stacks currently have automatic guard gaps to try to
> > > > prevent
> > > > one
> > > > thread from overflowing onto another thread's shadow stack.
> > > > This
> > > > would
> > > > somewhat opens that up, as the stack guard gaps are usually
> > > > maintained
> > > > by userspace for new threads. It would have to be thought
> > > > through
> > > > if
> > > > these could still be enforced with checking at additional
> > > > spots.
> > > 
> > > I would think the existing guard pages would already do that if a
> > > thread's shadow stack is contiguous with its own data stack.
> > 
> > The difference is that the kernel provides the guard gaps, where
> > this
> > would rely on userspace to do it. It is not a showstopper either.
> > 
> > I think my biggest question on this is how does it change the
> > capability for two threads to share a shadow stack. It might
> > require
> > some special rules around the syscall that writes restore tokens.
> > So
> > I'm not sure. It probably needs a POC.
> 
> Why would they be sharing a shadow stack?

The guard gap was introduced originally based on a suggestion that
overflowing a shadow stack onto an adjacent shadow stack could cause
corruption that could be used by an attacker to work around the
protection. There wasn't any concrete demonstrated attacks or
suggestion that all the protection was moot.

But when we talk about capabilities for converting memory to shadow
stack with simple memory accesses, and syscalls that can write restore
token to shadow stacks, it's not immediately clear to me that it
wouldn't open up something like that. Like if two restore tokens were
written to a shadow stack, or two shadow stacks were adjacent with
normal memory between them that later got converted to shadow stack.
Those sorts of scenarios, but I won't lean on those specific examples.
Sorry for being hand wavy. It's just where I'm at, at this point.

> 
> > > From the musl side, I have always looked at the entirely of
> > > shadow
> > > stack stuff with very heavy skepticism, and anything that breaks
> > > existing interface contracts, introduced places where apps can
> > > get
> > > auto-killed because a late resource allocation fails, or requires
> > > applications to code around the existence of something that
> > > should be
> > > an implementation detail, is a non-starter. To even consider
> > > shadow
> > > stack support, it must truely be fully non-breaking.
> > 
> > The manual assembly stack switching and JIT code in the apps needs
> > to
> > be updated. I don't think there is a way around it.
> 
> Indeed, I'm not talking about programs with JIT/manual stack-
> switching
> asm, just anything using existing APIs for control of stack --
> pthread_setstack, makecontext, sigaltstack, etc.

Then I think WRSS might fit your requirements better than what glibc
did. It was considered a reduced security mode that made libc's job
much easier and had better compatibility, but the last discussion was
to try to do it without WRSS.

> 
> > I agree though that the late allocation failures are not great.
> > Mark is
> > working on clone3 support which should allow moving the shadow
> > stack
> > allocation to happen in userspace with the normal stack. Even for
> > riscv
> > though, doesn't it need to update a new register in stack
> > switching?
> 
> If clone is called with signals masked, it's probably not necessary
> for the kernel to set the shadow stack register as part of clone3.

So you would want a mode of clone3 that basically leaves the shadow
stack bits alone? Mark was driving that effort, but it doesn't seem
horrible to me on first impression. If it would open up the possibility
of musl support.

> 
> > BTW, x86 shadow stack has a mode where the shadow stack is writable
> > with a special instruction (WRSS). It enables the SSP to be set
> > arbitrarily by writing restore tokens. We discussed this as an
> > option
> > to make the existing longjmp() and signal stuff work more
> > transparently
> > for glibc.
> > 
> > 
> > 
> > BTW, when I talk about "not supporting" I don't mean the app should
> > crash. I mean it should instead run normally, just without shadow
> > stack
> > enabled. Not sure if that was clear. Since shadow stack is not
> > essential for an application to function, it is only security
> > hardening
> > on top.
> > 
> > Although determining if an application supports shadow stack has
> > turned
> > out to be difficult in practice. Handling dlopen() is especially
> > hard.
> 
> One reasonable thing to do, that might be preferable to
> overengineered
> solutions, is to disable shadow-stack process-wide if an interface
> incompatible with it is used (sigaltstack, pthread_create with an
> attribute setup using pthread_attr_setstack, makecontext, etc.), as
> well as if an incompatible library is is dlopened.

I think it would be an interesting approach to determining
compatibility. On x86 there has been cases of binaries getting
mismarked as supporting shadow stack. So an automated way of filtering
some of those out would be very useful I think. I guess the dynamic
linker could determine this based on some list of functions?

The dlopen() bit gets complicated though. You need to disable shadow
stack for all threads, which presumably the kernel could be coaxed into
doing. But those threads might be using shadow stack instructions
(INCSSP, RSTORSSP, etc). These are a collection of instructions that
allow limited control of the SSP. When shadow stack gets disabled,
these suddenly turn into #UD generating instructions. So any other
threads executing those instructions when shadow stack got disabled
would be in for a nasty surprise.

Glibc's permissive mode (that disables shadow stack when dlopen()ing a
DSO that doesn't support shadow stack) is quite limited because of
this. There was a POC for working around it, but I'll stop there for
now, to not spam you with the details. I'm not sure of arm and risc-v
details on this specific corner, but for x86.

>  This is much more
> acceptable than continuing to run with shadow stacks managed sloppily
> by the kernel and async killing the process on OOM, and is probably
> *more compatible* with apps than changing the minimum stack size
> requirements out from under them.

Yep.

> 
> The place where it's really needed to be able to allocate the shadow
> stack synchronously under userspace control, in order to harden
> normal
> applications that aren't doing funny things, is in pthread_create
> without a caller-provided stack.

Yea most apps don't do anything too tricky. Mostly shadow stack "just
works". But it's no excuse to just crash for the others.
  
Mark Brown Feb. 21, 2024, 12:40 a.m. UTC | #11
On Tue, Feb 20, 2024 at 06:59:58PM -0500, Stefan O'Rear wrote:
> On Tue, Feb 20, 2024, at 6:30 PM, Edgecombe, Rick P wrote:

> > Maybe I'm misunderstanding. I thought the proposal included allowing
> > shadow stack access to convert normal address ranges to shadow stack,
> > and normal writes to convert shadow stack to normal.

> Ideally for riscv only writes would cause conversion, an incssp underflow
> which performs shadow stack reads would be able to fault early.

> For arm, since a syscall is needed anyway to set up the token in a new
> shadow stack region, it would make sense for conversion from non-shadow
> to shadow usage to never be automatic.

Well, we only need the token to pivot in userspace so we could
*potentially* work something out as part of the conversion process.
It's not filling me with enthusiasm though, and I've certainly not
actually thought it through yet.

> > I agree though that the late allocation failures are not great. Mark is
> > working on clone3 support which should allow moving the shadow stack
> > allocation to happen in userspace with the normal stack. Even for riscv
> > though, doesn't it need to update a new register in stack switching?

> > BTW, x86 shadow stack has a mode where the shadow stack is writable
> > with a special instruction (WRSS). It enables the SSP to be set
> > arbitrarily by writing restore tokens. We discussed this as an option
> > to make the existing longjmp() and signal stuff work more transparently
> > for glibc.

We have this feature on arm64 too, plus a separately controllable push
instruction (though that's less useful here).

> > BTW, when I talk about "not supporting" I don't mean the app should
> > crash. I mean it should instead run normally, just without shadow stack
> > enabled. Not sure if that was clear. Since shadow stack is not
> > essential for an application to function, it is only security hardening
> > on top.

> I appreciate that.  How far can we go in that direction?  If we can
> automatically disable shadow stacks on any call to makecontext, sigaltstack,
> or pthread_attr_setstack without causing other threads to crash if they were
> in the middle of shadow stack maintenance we can probably simplify this
> proposal, although I need to think more about what's possible.

Aside from concerns about disabling over all the threads in the process
(which should be solvable if annoying) this would be incompatible with
policies which prevent disabling of shadow stacks, and it feels like it
might end up being a gadget people could use which will concern some
people.  There's a tension here between compatibility and the security
applications of these features.
  
Mark Brown Feb. 21, 2024, 12:44 a.m. UTC | #12
On Wed, Feb 21, 2024 at 12:35:48AM +0000, Edgecombe, Rick P wrote:

> doing. But those threads might be using shadow stack instructions
> (INCSSP, RSTORSSP, etc). These are a collection of instructions that
> allow limited control of the SSP. When shadow stack gets disabled,
> these suddenly turn into #UD generating instructions. So any other
> threads executing those instructions when shadow stack got disabled
> would be in for a nasty surprise.

> Glibc's permissive mode (that disables shadow stack when dlopen()ing a
> DSO that doesn't support shadow stack) is quite limited because of
> this. There was a POC for working around it, but I'll stop there for
> now, to not spam you with the details. I'm not sure of arm and risc-v
> details on this specific corner, but for x86.

We have the same issue with disabling GCS causing GCS instructions to
become undefined.
  
dalias@libc.org Feb. 21, 2024, 1:27 a.m. UTC | #13
On Wed, Feb 21, 2024 at 12:35:48AM +0000, Edgecombe, Rick P wrote:
> On Tue, 2024-02-20 at 18:54 -0500, dalias@libc.org wrote:
> > On Tue, Feb 20, 2024 at 11:30:22PM +0000, Edgecombe, Rick P wrote:
> > > On Tue, 2024-02-20 at 13:57 -0500, Rich Felker wrote:
> > > > On Tue, Feb 20, 2024 at 06:41:05PM +0000, Edgecombe, Rick P
> > > > > Shadow stacks currently have automatic guard gaps to try to
> > > > > prevent
> > > > > one
> > > > > thread from overflowing onto another thread's shadow stack.
> > > > > This
> > > > > would
> > > > > somewhat opens that up, as the stack guard gaps are usually
> > > > > maintained
> > > > > by userspace for new threads. It would have to be thought
> > > > > through
> > > > > if
> > > > > these could still be enforced with checking at additional
> > > > > spots.
> > > > 
> > > > I would think the existing guard pages would already do that if a
> > > > thread's shadow stack is contiguous with its own data stack.
> > > 
> > > The difference is that the kernel provides the guard gaps, where
> > > this
> > > would rely on userspace to do it. It is not a showstopper either.
> > > 
> > > I think my biggest question on this is how does it change the
> > > capability for two threads to share a shadow stack. It might
> > > require
> > > some special rules around the syscall that writes restore tokens.
> > > So
> > > I'm not sure. It probably needs a POC.
> > 
> > Why would they be sharing a shadow stack?
> 
> The guard gap was introduced originally based on a suggestion that
> overflowing a shadow stack onto an adjacent shadow stack could cause
> corruption that could be used by an attacker to work around the
> protection. There wasn't any concrete demonstrated attacks or
> suggestion that all the protection was moot.

OK, so not sharing, just happening to be adjacent.

I was thinking from a standpoint of allocating them as part of the
same range as the main stack, just with different protections, where
that would never happen; you'd always have intervening non-shadowstack
pages. But when they're kernel-allocated, yes, they need their own
guard pages.

> But when we talk about capabilities for converting memory to shadow
> stack with simple memory accesses, and syscalls that can write restore
> token to shadow stacks, it's not immediately clear to me that it
> wouldn't open up something like that. Like if two restore tokens were
> written to a shadow stack, or two shadow stacks were adjacent with
> normal memory between them that later got converted to shadow stack.
> Those sorts of scenarios, but I won't lean on those specific examples.
> Sorry for being hand wavy. It's just where I'm at, at this point.

I don't think it's safe to have automatic conversions back and forth,
only for normal accesses to convert shadowstack to normal memory (in
which case, any subsequent attempt to operate on it as shadow stack
indicates a critical bug and should be trapped to terminate the
process).

> > > > From the musl side, I have always looked at the entirely of
> > > > shadow
> > > > stack stuff with very heavy skepticism, and anything that breaks
> > > > existing interface contracts, introduced places where apps can
> > > > get
> > > > auto-killed because a late resource allocation fails, or requires
> > > > applications to code around the existence of something that
> > > > should be
> > > > an implementation detail, is a non-starter. To even consider
> > > > shadow
> > > > stack support, it must truely be fully non-breaking.
> > > 
> > > The manual assembly stack switching and JIT code in the apps needs
> > > to
> > > be updated. I don't think there is a way around it.
> > 
> > Indeed, I'm not talking about programs with JIT/manual stack-
> > switching
> > asm, just anything using existing APIs for control of stack --
> > pthread_setstack, makecontext, sigaltstack, etc.
> 
> Then I think WRSS might fit your requirements better than what glibc
> did. It was considered a reduced security mode that made libc's job
> much easier and had better compatibility, but the last discussion was
> to try to do it without WRSS.

Where can I read more about this? Some searches I tried didn't turn up
much useful information.

> > > I agree though that the late allocation failures are not great.
> > > Mark is
> > > working on clone3 support which should allow moving the shadow
> > > stack
> > > allocation to happen in userspace with the normal stack. Even for
> > > riscv
> > > though, doesn't it need to update a new register in stack
> > > switching?
> > 
> > If clone is called with signals masked, it's probably not necessary
> > for the kernel to set the shadow stack register as part of clone3.
> 
> So you would want a mode of clone3 that basically leaves the shadow
> stack bits alone? Mark was driving that effort, but it doesn't seem
> horrible to me on first impression. If it would open up the possibility
> of musl support.

Well I'm not sure. That's what we're trying to figure out. But I don't
think modifying it is a hard requirement, since it can be modified
from userspace if needed as long as signals are masked.

> > One reasonable thing to do, that might be preferable to
> > overengineered
> > solutions, is to disable shadow-stack process-wide if an interface
> > incompatible with it is used (sigaltstack, pthread_create with an
> > attribute setup using pthread_attr_setstack, makecontext, etc.), as
> > well as if an incompatible library is is dlopened.
> 
> I think it would be an interesting approach to determining
> compatibility. On x86 there has been cases of binaries getting
> mismarked as supporting shadow stack. So an automated way of filtering
> some of those out would be very useful I think. I guess the dynamic
> linker could determine this based on some list of functions?

I didn't follow this whole mess, but from our side (musl) it does not
seem relevant. There are no legacy binaries wrongly marked because we
have never supported shadow stacks so far.

> The dlopen() bit gets complicated though. You need to disable shadow
> stack for all threads, which presumably the kernel could be coaxed into
> doing. But those threads might be using shadow stack instructions
> (INCSSP, RSTORSSP, etc). These are a collection of instructions that
> allow limited control of the SSP. When shadow stack gets disabled,
> these suddenly turn into #UD generating instructions. So any other
> threads executing those instructions when shadow stack got disabled
> would be in for a nasty surprise.

This is the kernel's problem if that's happening. It should be
trapping these and returning immediately like a NOP if shadow stack
has been disabled, not generating SIGILL.

> > The place where it's really needed to be able to allocate the shadow
> > stack synchronously under userspace control, in order to harden
> > normal
> > applications that aren't doing funny things, is in pthread_create
> > without a caller-provided stack.
> 
> Yea most apps don't do anything too tricky. Mostly shadow stack "just
> works". But it's no excuse to just crash for the others.

One thing to note here is that, to enable this, we're going to need
some way to detect "new enough kernel that shadow stack semantics are
all right". If there are kernels that have shadow stack support but
with problems that make it unsafe to use (this sounds like the case),
we can't turn it on without a way to avoid trying to use it on those.

Rich
  
Edgecombe, Rick P Feb. 21, 2024, 2:11 a.m. UTC | #14
On Tue, 2024-02-20 at 20:27 -0500, dalias@libc.org wrote:
> > Then I think WRSS might fit your requirements better than what
> > glibc
> > did. It was considered a reduced security mode that made libc's job
> > much easier and had better compatibility, but the last discussion
> > was
> > to try to do it without WRSS.
> 
> Where can I read more about this? Some searches I tried didn't turn
> up
> much useful information.

There never was any proposal written down AFAIK. In the past we have
had a couple "shadow stack meetup" calls where folks who are working on
shadow stack got together to hash out some things. We discussed it
there.

But briefly, in the Intel SDM (and other places) there is documentation
on the special shadow stack instructions. The two key ones for this are
WRSS and RSTORSSP. WRSS is an instruction which can be enabled by the
kernel (and there is upstream support for this). The instruction can
write through shadow stack memory.

RSTORSSP can be used to consume a restore token, which is a special
value on the shadow stack. When this operations happens the SSP is
moved adjacent to the token that was just consumed. So between the two
of them the SSP can be adjusted to specific spots on the shadow stack
or another shadow stack.

Today when you longjmp() with shadow stack in glibc, INCSSP is used to
move the SSP back to the spot on the shadow stack where the setjmp()
was called. But this algorithm doesn't always work, for example,
longjmp()ing between stacks. To work around this glibc uses a scheme
where it searches from the target SSP for a shadow stack token and then
consumes it and INCSSPs back to the target SPP. It just barely
miraculously worked in most cases.

Some specific cases that were still open were longjmp()ing off of a
custom userspace threading library stack, which may not have left a
token behind when it jumped to a new stack. And also, potentially off
of an alt shadow stack in the future, depending on whether it leaves a
restore token when handling a signal. (the problem there, is if there
is no room to leave it).

So that is how x86 glic works, and I think arm was thinking along the
same lines. But if you have WRSS (and arm's version), you could just
write a restore token or anything else you need to fixup on the shadow
stack. Then you could longjmp() in one go without any high wire acts.
It's much simpler and more robust and would prevent needing to leave a
restore token when handling a signal to an alt shadow stack. Although,
nothing was ever prototyped. So "in theory".

But that is all about moving the SSP where you need it. It doesn't
resolve any of the allocation lifecycle issues. I think for those the
solutions are:
1. Not supporting ucontext/sigaltstack and shadow stack
2. Stefan's idea
3. A new interface that takes user allocated shadow stacks for those
operations

My preference has been a combination of 1 and 3. For threads, I think
Mark's clone3 enhancements will help.

Anyway, there is an attempt at a summary. I'd also point you to HJ for
more glibc context, as I mostly worked on the kernel side.
  
Edgecombe, Rick P Feb. 21, 2024, 4:18 a.m. UTC | #15
On Tue, 2024-02-20 at 18:11 -0800, Rick Edgecombe wrote:
> Some specific cases that were still open were longjmp()ing off of a
> custom userspace threading library stack, which may not have left a
> token behind when it jumped to a new stack. And also, potentially off
> of an alt shadow stack in the future, depending on whether it leaves
> a
> restore token when handling a signal. (the problem there, is if there
> is no room to leave it).

Ah, I remember the other one. If the token on the target shadow stack
is at the end of the shadow stack, it may not be able to handle pushing
a shadow stack signal frame if a signal hits while is unwinding through
the token. As in, where normal longjmp() is direct transition, in this
case the longjmp() operation can be temporarily in a place where a
signal cannot be handled.
  
Edgecombe, Rick P Feb. 21, 2024, 4:30 a.m. UTC | #16
On Tue, 2024-02-20 at 18:59 -0500, Stefan O'Rear wrote:
> 
> Ideally for riscv only writes would cause conversion, an incssp
> underflow
> which performs shadow stack reads would be able to fault early.

Why can't makecontext() just clobber part of the low address side of
the passed in stack with a shadow stack mapping? Like say it just
munmap()'s part of the passed stack, and map_shadow_stack() in it's
place.

Then you could still have the shadow stack->normal conversion process
triggered by normal writes. IIUC the concern there is to make sure the
caller can reuse it as normal memory when it is done with the
ucontext/sigaltstack stuff? So the normal->shadow stack part could be
explicit.

But the more I think about this, the more I think it is a hack, and a
proper fix is to use new interfaces. It also would be difficult to
sell, if the faulting conversion stuff is in any way complex.
  
Mark Brown Feb. 21, 2024, 1:53 p.m. UTC | #17
On Tue, Feb 20, 2024 at 08:27:37PM -0500, dalias@libc.org wrote:
> On Wed, Feb 21, 2024 at 12:35:48AM +0000, Edgecombe, Rick P wrote:

> > (INCSSP, RSTORSSP, etc). These are a collection of instructions that
> > allow limited control of the SSP. When shadow stack gets disabled,
> > these suddenly turn into #UD generating instructions. So any other
> > threads executing those instructions when shadow stack got disabled
> > would be in for a nasty surprise.

> This is the kernel's problem if that's happening. It should be
> trapping these and returning immediately like a NOP if shadow stack
> has been disabled, not generating SIGILL.

I'm not sure that's going to work out well, all it takes is some code
that's looking at the shadow stack and expecting something to happen as
a result of the instructions it's executing and we run into trouble.  A
lot of things won't notice and will just happily carry on but I expect
there are going to be things that care.  We also end up with an
additional state for threads that have had shadow stacks transparently
disabled, that's managable but still.

> > > The place where it's really needed to be able to allocate the shadow
> > > stack synchronously under userspace control, in order to harden
> > > normal
> > > applications that aren't doing funny things, is in pthread_create
> > > without a caller-provided stack.

> > Yea most apps don't do anything too tricky. Mostly shadow stack "just
> > works". But it's no excuse to just crash for the others.

> One thing to note here is that, to enable this, we're going to need
> some way to detect "new enough kernel that shadow stack semantics are
> all right". If there are kernels that have shadow stack support but
> with problems that make it unsafe to use (this sounds like the case),
> we can't turn it on without a way to avoid trying to use it on those.

If we have this automatic conversion of pages to shadow stack then we
should have an API for enabling it, userspace should be able to use the
presence of that API to determine if the feature is there.
  
dalias@libc.org Feb. 21, 2024, 2:58 p.m. UTC | #18
On Wed, Feb 21, 2024 at 01:53:10PM +0000, Mark Brown wrote:
> On Tue, Feb 20, 2024 at 08:27:37PM -0500, dalias@libc.org wrote:
> > On Wed, Feb 21, 2024 at 12:35:48AM +0000, Edgecombe, Rick P wrote:
> 
> > > (INCSSP, RSTORSSP, etc). These are a collection of instructions that
> > > allow limited control of the SSP. When shadow stack gets disabled,
> > > these suddenly turn into #UD generating instructions. So any other
> > > threads executing those instructions when shadow stack got disabled
> > > would be in for a nasty surprise.
> 
> > This is the kernel's problem if that's happening. It should be
> > trapping these and returning immediately like a NOP if shadow stack
> > has been disabled, not generating SIGILL.
> 
> I'm not sure that's going to work out well, all it takes is some code
> that's looking at the shadow stack and expecting something to happen as
> a result of the instructions it's executing and we run into trouble.  A
> lot of things won't notice and will just happily carry on but I expect
> there are going to be things that care.  We also end up with an
> additional state for threads that have had shadow stacks transparently
> disabled, that's managable but still.

I said NOP but there's no reason it strictly needs to be a NOP. It
could instead do something reasonable to convey the state of racing
with shadow stack being disabled.

> 
> > > > The place where it's really needed to be able to allocate the shadow
> > > > stack synchronously under userspace control, in order to harden
> > > > normal
> > > > applications that aren't doing funny things, is in pthread_create
> > > > without a caller-provided stack.
> 
> > > Yea most apps don't do anything too tricky. Mostly shadow stack "just
> > > works". But it's no excuse to just crash for the others.
> 
> > One thing to note here is that, to enable this, we're going to need
> > some way to detect "new enough kernel that shadow stack semantics are
> > all right". If there are kernels that have shadow stack support but
> > with problems that make it unsafe to use (this sounds like the case),
> > we can't turn it on without a way to avoid trying to use it on those.
> 
> If we have this automatic conversion of pages to shadow stack then we
> should have an API for enabling it, userspace should be able to use the
> presence of that API to determine if the feature is there.

Yes, or if a new prctl is needed to make disabling safe (see above)
that could probably be used.

Rich
  
Mark Brown Feb. 21, 2024, 5:36 p.m. UTC | #19
On Wed, Feb 21, 2024 at 09:58:01AM -0500, dalias@libc.org wrote:
> On Wed, Feb 21, 2024 at 01:53:10PM +0000, Mark Brown wrote:
> > On Tue, Feb 20, 2024 at 08:27:37PM -0500, dalias@libc.org wrote:
> > > On Wed, Feb 21, 2024 at 12:35:48AM +0000, Edgecombe, Rick P wrote:

> > > > (INCSSP, RSTORSSP, etc). These are a collection of instructions that
> > > > allow limited control of the SSP. When shadow stack gets disabled,
> > > > these suddenly turn into #UD generating instructions. So any other
> > > > threads executing those instructions when shadow stack got disabled
> > > > would be in for a nasty surprise.

> > > This is the kernel's problem if that's happening. It should be
> > > trapping these and returning immediately like a NOP if shadow stack
> > > has been disabled, not generating SIGILL.

> > I'm not sure that's going to work out well, all it takes is some code
> > that's looking at the shadow stack and expecting something to happen as
> > a result of the instructions it's executing and we run into trouble.  A

> I said NOP but there's no reason it strictly needs to be a NOP. It
> could instead do something reasonable to convey the state of racing
> with shadow stack being disabled.

This feels like it's getting complicated and I fear it may be an uphill
struggle to get such code merged, at least for arm64.  My instinct is
that it's going to be much more robust and generally tractable to let
things run to some suitable synchronisation point and then disable
there, but if we're going to do that then userspace can hopefully
arrange to do the disabling itself through the standard disable
interface anyway.  Presumably it'll want to notice things being disabled
at some point anyway?  TBH that's been how all the prior proposals for
process wide disable I've seen were done.
  
dalias@libc.org Feb. 21, 2024, 5:57 p.m. UTC | #20
On Wed, Feb 21, 2024 at 05:36:12PM +0000, Mark Brown wrote:
> On Wed, Feb 21, 2024 at 09:58:01AM -0500, dalias@libc.org wrote:
> > On Wed, Feb 21, 2024 at 01:53:10PM +0000, Mark Brown wrote:
> > > On Tue, Feb 20, 2024 at 08:27:37PM -0500, dalias@libc.org wrote:
> > > > On Wed, Feb 21, 2024 at 12:35:48AM +0000, Edgecombe, Rick P wrote:
> 
> > > > > (INCSSP, RSTORSSP, etc). These are a collection of instructions that
> > > > > allow limited control of the SSP. When shadow stack gets disabled,
> > > > > these suddenly turn into #UD generating instructions. So any other
> > > > > threads executing those instructions when shadow stack got disabled
> > > > > would be in for a nasty surprise.
> 
> > > > This is the kernel's problem if that's happening. It should be
> > > > trapping these and returning immediately like a NOP if shadow stack
> > > > has been disabled, not generating SIGILL.
> 
> > > I'm not sure that's going to work out well, all it takes is some code
> > > that's looking at the shadow stack and expecting something to happen as
> > > a result of the instructions it's executing and we run into trouble.  A
> 
> > I said NOP but there's no reason it strictly needs to be a NOP. It
> > could instead do something reasonable to convey the state of racing
> > with shadow stack being disabled.
> 
> This feels like it's getting complicated and I fear it may be an uphill
> struggle to get such code merged, at least for arm64.  My instinct is
> that it's going to be much more robust and generally tractable to let
> things run to some suitable synchronisation point and then disable
> there, but if we're going to do that then userspace can hopefully
> arrange to do the disabling itself through the standard disable
> interface anyway.  Presumably it'll want to notice things being disabled
> at some point anyway?  TBH that's been how all the prior proposals for
> process wide disable I've seen were done.

If it's possible to disable per-thread rather than per-process, some
things are easier. Disabling on account of using alt stacks only needs
to be done on the threads using those stacks. However, for dlopen
purposes you need a way to disable shadow stack for the whole process.
Initially this is only needed for the thread that called dlopen, but
it needs to have propagated to any thread that synchronizes with
completion of the call to dlopen by the time that synchronization
occurs, and since that synchronization can happen in lots of different
ways that are purely userspace (thanks to futexes being userspace in
the uncontended case), I don't see any way to make it work without
extremely invasive, high-cost checks.

If folks on the kernel side are not going to be amenable to doing the
things that are easy for the kernel to make it work without breaking
compatibility with existing interfaces, but that are impossible or
near-impossible for userspace to do, this seems like a dead-end. And I
suspect an operation to "disable shadow stack, but without making
threads still in SS-critical sections crash" is going to be
necessary..

Rich
  
Edgecombe, Rick P Feb. 21, 2024, 6:12 p.m. UTC | #21
On Wed, 2024-02-21 at 12:57 -0500, dalias@libc.org wrote:
> > This feels like it's getting complicated and I fear it may be an
> > uphill
> > struggle to get such code merged, at least for arm64.  My instinct
> > is
> > that it's going to be much more robust and generally tractable to
> > let
> > things run to some suitable synchronisation point and then disable
> > there, but if we're going to do that then userspace can hopefully
> > arrange to do the disabling itself through the standard disable
> > interface anyway.  Presumably it'll want to notice things being
> > disabled
> > at some point anyway?  TBH that's been how all the prior proposals
> > for
> > process wide disable I've seen were done.
> 
> If it's possible to disable per-thread rather than per-process, some
> things are easier. Disabling on account of using alt stacks only
> needs
> to be done on the threads using those stacks. However, for dlopen
> purposes you need a way to disable shadow stack for the whole
> process.
> Initially this is only needed for the thread that called dlopen, but
> it needs to have propagated to any thread that synchronizes with
> completion of the call to dlopen by the time that synchronization
> occurs, and since that synchronization can happen in lots of
> different
> ways that are purely userspace (thanks to futexes being userspace in
> the uncontended case), I don't see any way to make it work without
> extremely invasive, high-cost checks.

For glibc's use, we talked about a couple of options.
1. A mode to start suppressing the #UD's from the shadow stack
instructions
2. A mode to start suppressing #CPs (the exception that happens when
the shadow stack doesn't match). So the shadow stack instructions
continue to operate normally, but if the shadow stack gets mismatched
due to lack of support, the ret is emulated. It probably is safer (but
still not perfect), but the performance penalty of emulating every RET
after things get screwed up would be a significant down side. There
also needs to be clean handling of shadow stack #PFs.
3. Per-thread locking that is used around all shadow stack operations
that could be sensitive to disabling. This could be maybe exposed to
apps in case they want to use shadow stack instructions manually. Then
during dlopen() it waits until it can cleanly disable shadow stack for
each thread. In each critical sections there are checks for whether
shadow stack is still enabled.

3 is the cleanest and safest I think, and it was thought it might not
need kernel help, due to a scheme Florian had to direct signals to
specific threads. It's my preference at this point.

1 and 2 are POCed here, if you are interested:
https://github.com/rpedgeco/linux/commits/shstk_suppress_rfc/

> 
> If folks on the kernel side are not going to be amenable to doing the
> things that are easy for the kernel to make it work without breaking
> compatibility with existing interfaces, but that are impossible or
> near-impossible for userspace to do, this seems like a dead-end. And
> I
> suspect an operation to "disable shadow stack, but without making
> threads still in SS-critical sections crash" is going to be
> necessary..

I think we have to work through all the alternative before we can
accuse the kernel of not being amenable. Is there something that you
would like to see out of this conversation that is not happening?
  
dalias@libc.org Feb. 21, 2024, 6:30 p.m. UTC | #22
On Wed, Feb 21, 2024 at 06:12:30PM +0000, Edgecombe, Rick P wrote:
> On Wed, 2024-02-21 at 12:57 -0500, dalias@libc.org wrote:
> > > This feels like it's getting complicated and I fear it may be an
> > > uphill
> > > struggle to get such code merged, at least for arm64.  My instinct
> > > is
> > > that it's going to be much more robust and generally tractable to
> > > let
> > > things run to some suitable synchronisation point and then disable
> > > there, but if we're going to do that then userspace can hopefully
> > > arrange to do the disabling itself through the standard disable
> > > interface anyway.  Presumably it'll want to notice things being
> > > disabled
> > > at some point anyway?  TBH that's been how all the prior proposals
> > > for
> > > process wide disable I've seen were done.
> > 
> > If it's possible to disable per-thread rather than per-process, some
> > things are easier. Disabling on account of using alt stacks only
> > needs
> > to be done on the threads using those stacks. However, for dlopen
> > purposes you need a way to disable shadow stack for the whole
> > process.
> > Initially this is only needed for the thread that called dlopen, but
> > it needs to have propagated to any thread that synchronizes with
> > completion of the call to dlopen by the time that synchronization
> > occurs, and since that synchronization can happen in lots of
> > different
> > ways that are purely userspace (thanks to futexes being userspace in
> > the uncontended case), I don't see any way to make it work without
> > extremely invasive, high-cost checks.
> 
> For glibc's use, we talked about a couple of options.
> 1. A mode to start suppressing the #UD's from the shadow stack
> instructions
> 2. A mode to start suppressing #CPs (the exception that happens when
> the shadow stack doesn't match). So the shadow stack instructions
> continue to operate normally, but if the shadow stack gets mismatched
> due to lack of support, the ret is emulated. It probably is safer (but
> still not perfect), but the performance penalty of emulating every RET
> after things get screwed up would be a significant down side. There
> also needs to be clean handling of shadow stack #PFs.
> 3. Per-thread locking that is used around all shadow stack operations
> that could be sensitive to disabling. This could be maybe exposed to
> apps in case they want to use shadow stack instructions manually. Then
> during dlopen() it waits until it can cleanly disable shadow stack for
> each thread. In each critical sections there are checks for whether
> shadow stack is still enabled.
> 
> 3 is the cleanest and safest I think, and it was thought it might not
> need kernel help, due to a scheme Florian had to direct signals to
> specific threads. It's my preference at this point.

The operations where the shadow stack has to be processed need to be
executable from async-signal context, so this imposes a requirement to
block all signals around the lock. This makes all longjmps a heavy,
multi-syscall operation rather than O(1) userspace operation. I do not
think this is an acceptable implementation, especially when there are
clearly superior alternatives without that cost or invasiveness.

> 1 and 2 are POCed here, if you are interested:
> https://github.com/rpedgeco/linux/commits/shstk_suppress_rfc/

I'm not clear why 2 (suppression of #CP) is desirable at all. If
shadow stack is being disabled, it should just be disabled, with
minimal fault handling to paper over any racing operations at the
moment it's disabled. Leaving it on with extreme slowness to make it
not actually do anything does not seem useful.

Is there some way folks have in mind to use option 2 to lazily disable
shadow stack once the first SS-incompatible code is executed, when
execution is then known not to be in the middle of a SS-critical
section, instead of doing it right away? I don't see how this could
work, since the SS-incompatible code could be running from a signal
handler that interrupted an SS-critical section.

> > If folks on the kernel side are not going to be amenable to doing the
> > things that are easy for the kernel to make it work without breaking
> > compatibility with existing interfaces, but that are impossible or
> > near-impossible for userspace to do, this seems like a dead-end. And
> > I
> > suspect an operation to "disable shadow stack, but without making
> > threads still in SS-critical sections crash" is going to be
> > necessary..
> 
> I think we have to work through all the alternative before we can
> accuse the kernel of not being amenable. Is there something that you
> would like to see out of this conversation that is not happening?

No, I was just interpreting "uphill battle". I really do not want to
engage in an uphill battle for the sake of making it practical to
support something that was never my goal to begin with. If I'm
misreading this, or if others are willing to put the effort into that
"battle", I'd be happy to be mistaken about "not amenable".

Rich
  
Mark Brown Feb. 21, 2024, 6:32 p.m. UTC | #23
On Wed, Feb 21, 2024 at 12:57:19PM -0500, dalias@libc.org wrote:
> On Wed, Feb 21, 2024 at 05:36:12PM +0000, Mark Brown wrote:

> > This feels like it's getting complicated and I fear it may be an uphill
> > struggle to get such code merged, at least for arm64.  My instinct is
> > that it's going to be much more robust and generally tractable to let
> > things run to some suitable synchronisation point and then disable
> > there, but if we're going to do that then userspace can hopefully
> > arrange to do the disabling itself through the standard disable
> > interface anyway.  Presumably it'll want to notice things being disabled
> > at some point anyway?  TBH that's been how all the prior proposals for
> > process wide disable I've seen were done.

> If it's possible to disable per-thread rather than per-process, some
> things are easier. Disabling on account of using alt stacks only needs

Both x86 and arm64 currently track shadow stack enablement per thread,
not per process, so it's not just possible to do per thread it's the
only thing we're currently implementing.  I think the same is true for
RISC-V but I didn't look as closely at that yet.

> to be done on the threads using those stacks. However, for dlopen
> purposes you need a way to disable shadow stack for the whole process.
> Initially this is only needed for the thread that called dlopen, but
> it needs to have propagated to any thread that synchronizes with
> completion of the call to dlopen by the time that synchronization
> occurs, and since that synchronization can happen in lots of different
> ways that are purely userspace (thanks to futexes being userspace in
> the uncontended case), I don't see any way to make it work without
> extremely invasive, high-cost checks.

Yeah, it's not particularly nice - any whole process disable is going to
have some nasty cases I think.  Rick's message about covered AFAIR the
discussion, there were also some proposals for more limited userspaces I
think.

> If folks on the kernel side are not going to be amenable to doing the
> things that are easy for the kernel to make it work without breaking
> compatibility with existing interfaces, but that are impossible or
> near-impossible for userspace to do, this seems like a dead-end. And I
> suspect an operation to "disable shadow stack, but without making
> threads still in SS-critical sections crash" is going to be
> necessary..

Could you be more specific as to the easy things that you're referencing
here?
  
Edgecombe, Rick P Feb. 21, 2024, 6:53 p.m. UTC | #24
On Wed, 2024-02-21 at 13:30 -0500, dalias@libc.org wrote:
> > 3 is the cleanest and safest I think, and it was thought it might
> > not
> > need kernel help, due to a scheme Florian had to direct signals to
> > specific threads. It's my preference at this point.
> 
> The operations where the shadow stack has to be processed need to be
> executable from async-signal context, so this imposes a requirement
> to
> block all signals around the lock. This makes all longjmps a heavy,
> multi-syscall operation rather than O(1) userspace operation. I do
> not
> think this is an acceptable implementation, especially when there are
> clearly superior alternatives without that cost or invasiveness.

That is a good point. Could the per-thread locks be nestable to address
this? We just need to know if a thread *might* be using shadow stacks.
So we really just need a per-thread count.

> 
> > 1 and 2 are POCed here, if you are interested:
> > https://github.com/rpedgeco/linux/commits/shstk_suppress_rfc/
> 
> I'm not clear why 2 (suppression of #CP) is desirable at all. If
> shadow stack is being disabled, it should just be disabled, with
> minimal fault handling to paper over any racing operations at the
> moment it's disabled. Leaving it on with extreme slowness to make it
> not actually do anything does not seem useful.

The benefit is that code that is using shadow stack instructions won't
crash if it relies on them working. For example RDSSP turns into a NOP
if shadow stack is disabled, and the intrinsic is written such that a
NULL pointer is returned if shadow stack is disabled. The shadow stack
is normally readable, and this happens in glibc sometimes. So if there
was code like:

   long foo = *(long *)_get_ssp();

...then it could suddenly read a NULL pointer if shadow stack got
disabled. (notice, it's not even a "shadow stack access" fault-wise. So
it was looked at as somewhat more robust. But neither 1 or 2 are
perfect for apps that are manually using shadow stack instructions.

> 
> Is there some way folks have in mind to use option 2 to lazily
> disable
> shadow stack once the first SS-incompatible code is executed, when
> execution is then known not to be in the middle of a SS-critical
> section, instead of doing it right away? I don't see how this could
> work, since the SS-incompatible code could be running from a signal
> handler that interrupted an SS-critical section.

The idea was to disable it without critical sections, and it could be
more robust, but not perfect. I was preferring option 1 between 1 and
2, which was closer to your original suggestion. But it has problems
like the example I gave above. I agree 1 is relatively simpler for the
kernel, between 1 and 2.

> 
> > > If folks on the kernel side are not going to be amenable to doing
> > > the
> > > things that are easy for the kernel to make it work without
> > > breaking
> > > compatibility with existing interfaces, but that are impossible
> > > or
> > > near-impossible for userspace to do, this seems like a dead-end.
> > > And
> > > I
> > > suspect an operation to "disable shadow stack, but without making
> > > threads still in SS-critical sections crash" is going to be
> > > necessary..
> > 
> > I think we have to work through all the alternative before we can
> > accuse the kernel of not being amenable. Is there something that
> > you
> > would like to see out of this conversation that is not happening?
> 
> No, I was just interpreting "uphill battle". I really do not want to
> engage in an uphill battle for the sake of making it practical to
> support something that was never my goal to begin with. If I'm
> misreading this, or if others are willing to put the effort into that
> "battle", I'd be happy to be mistaken about "not amenable".

I don't think x86 maintainers have put a foot down on anything around
this at least. They would normally have concerns about complexity and
maintainability. So if we have something that has lower value
(imperfect solution), and high complexity, it starts to look like less
promising path.
  
dalias@libc.org Feb. 21, 2024, 7:06 p.m. UTC | #25
On Wed, Feb 21, 2024 at 06:53:44PM +0000, Edgecombe, Rick P wrote:
> On Wed, 2024-02-21 at 13:30 -0500, dalias@libc.org wrote:
> > > 3 is the cleanest and safest I think, and it was thought it might
> > > not
> > > need kernel help, due to a scheme Florian had to direct signals to
> > > specific threads. It's my preference at this point.
> > 
> > The operations where the shadow stack has to be processed need to be
> > executable from async-signal context, so this imposes a requirement
> > to
> > block all signals around the lock. This makes all longjmps a heavy,
> > multi-syscall operation rather than O(1) userspace operation. I do
> > not
> > think this is an acceptable implementation, especially when there are
> > clearly superior alternatives without that cost or invasiveness.
> 
> That is a good point. Could the per-thread locks be nestable to address
> this? We just need to know if a thread *might* be using shadow stacks.
> So we really just need a per-thread count.

Due to arbitrarily nestable signal frames, no, this does not suffice.
An interrupted operation using the lock could be arbitrarily delayed,
even never execute again, making any call to dlopen deadlock.

> > > 1 and 2 are POCed here, if you are interested:
> > > https://github.com/rpedgeco/linux/commits/shstk_suppress_rfc/
> > 
> > I'm not clear why 2 (suppression of #CP) is desirable at all. If
> > shadow stack is being disabled, it should just be disabled, with
> > minimal fault handling to paper over any racing operations at the
> > moment it's disabled. Leaving it on with extreme slowness to make it
> > not actually do anything does not seem useful.
> 
> The benefit is that code that is using shadow stack instructions won't
> crash if it relies on them working. For example RDSSP turns into a NOP
> if shadow stack is disabled, and the intrinsic is written such that a
> NULL pointer is returned if shadow stack is disabled. The shadow stack
> is normally readable, and this happens in glibc sometimes. So if there
> was code like:
> 
>    long foo = *(long *)_get_ssp();
> 
> ...then it could suddenly read a NULL pointer if shadow stack got
> disabled. (notice, it's not even a "shadow stack access" fault-wise. So
> it was looked at as somewhat more robust. But neither 1 or 2 are
> perfect for apps that are manually using shadow stack instructions.

It's fine to turn RDSSP into an actual emulated read of the SSP, or at
least an emulated load of zero so that uninitialized data is not left
in the target register. If doing the latter, code working with the
shadow stack just needs to be prepared for the possibility that it
could be async-disabled, and check the return value.

I have not looked at all the instructions that become #UD but I
suspect they all have reasonable trivial ways to implement a
"disabled" version of them that userspace can act upon reasonably.

Rich
  
dalias@libc.org Feb. 21, 2024, 7:10 p.m. UTC | #26
On Wed, Feb 21, 2024 at 06:32:20PM +0000, Mark Brown wrote:
> On Wed, Feb 21, 2024 at 12:57:19PM -0500, dalias@libc.org wrote:
> > On Wed, Feb 21, 2024 at 05:36:12PM +0000, Mark Brown wrote:
> 
> > > This feels like it's getting complicated and I fear it may be an uphill
> > > struggle to get such code merged, at least for arm64.  My instinct is
> > > that it's going to be much more robust and generally tractable to let
> > > things run to some suitable synchronisation point and then disable
> > > there, but if we're going to do that then userspace can hopefully
> > > arrange to do the disabling itself through the standard disable
> > > interface anyway.  Presumably it'll want to notice things being disabled
> > > at some point anyway?  TBH that's been how all the prior proposals for
> > > process wide disable I've seen were done.
> 
> > If it's possible to disable per-thread rather than per-process, some
> > things are easier. Disabling on account of using alt stacks only needs
> 
> Both x86 and arm64 currently track shadow stack enablement per thread,
> not per process, so it's not just possible to do per thread it's the
> only thing we're currently implementing.  I think the same is true for
> RISC-V but I didn't look as closely at that yet.

That's nice! It allows still keeping part of the benefit of SS in
programs which have some threads running with custom stacks. We do
however need a global-disable option for dlopen. In musl this could be
done via the same mechanism ("synccall") used for set*id -- it's
basically userspace IPI. But just having a native operation would be
nicer, and would probably help glibc where I don't think they
abstracted their set*id mechanism to do other things like this.

> > If folks on the kernel side are not going to be amenable to doing the
> > things that are easy for the kernel to make it work without breaking
> > compatibility with existing interfaces, but that are impossible or
> > near-impossible for userspace to do, this seems like a dead-end. And I
> > suspect an operation to "disable shadow stack, but without making
> > threads still in SS-critical sections crash" is going to be
> > necessary..
> 
> Could you be more specific as to the easy things that you're referencing
> here?

Basically the ARCH_SHSTK_SUPPRESS_UD proposal.

Rich
  
Edgecombe, Rick P Feb. 21, 2024, 7:22 p.m. UTC | #27
On Wed, 2024-02-21 at 14:06 -0500, dalias@libc.org wrote:
> Due to arbitrarily nestable signal frames, no, this does not suffice.
> An interrupted operation using the lock could be arbitrarily delayed,
> even never execute again, making any call to dlopen deadlock.

Doh! Yep, it is not robust to this. The only thing that could be done
would be a timeout in dlopen(). Which would make the whole thing just
better than nothing.

> 
> > 
> 
> It's fine to turn RDSSP into an actual emulated read of the SSP, or
> at
> least an emulated load of zero so that uninitialized data is not left
> in the target register.

We can't intercept RDSSP, but it becomes a NOP by default. (disclaimer
x86-only knowledge).

>  If doing the latter, code working with the
> shadow stack just needs to be prepared for the possibility that it
> could be async-disabled, and check the return value.
> 
> I have not looked at all the instructions that become #UD but I
> suspect they all have reasonable trivial ways to implement a
> "disabled" version of them that userspace can act upon reasonably.

This would have to be thought through functionally and performance
wise. I'm not opposed if can come up with a fully fleshed out plan. How
serious are you in pursuing musl support, if we had something like
this?

HJ, any thoughts on whether glibc would use this as well?

It is probably worth mentioning that from the security side (as Mark
mentioned there is always tension in the tradeoffs on these features),
permissive mode is seen by some as something that weakens security too
much. Apps could call dlopen() on a known unsupported DSO before doing
ROP. I don't know if you have any musl users with specific shadow stack
use cases to ask about this.
  
H.J. Lu Feb. 21, 2024, 8:18 p.m. UTC | #28
On Wed, Feb 21, 2024 at 11:22 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Wed, 2024-02-21 at 14:06 -0500, dalias@libc.org wrote:
> > Due to arbitrarily nestable signal frames, no, this does not suffice.
> > An interrupted operation using the lock could be arbitrarily delayed,
> > even never execute again, making any call to dlopen deadlock.
>
> Doh! Yep, it is not robust to this. The only thing that could be done
> would be a timeout in dlopen(). Which would make the whole thing just
> better than nothing.
>
> >
> > >
> >
> > It's fine to turn RDSSP into an actual emulated read of the SSP, or
> > at
> > least an emulated load of zero so that uninitialized data is not left
> > in the target register.
>
> We can't intercept RDSSP, but it becomes a NOP by default. (disclaimer
> x86-only knowledge).
>
> >  If doing the latter, code working with the
> > shadow stack just needs to be prepared for the possibility that it
> > could be async-disabled, and check the return value.
> >
> > I have not looked at all the instructions that become #UD but I
> > suspect they all have reasonable trivial ways to implement a
> > "disabled" version of them that userspace can act upon reasonably.
>
> This would have to be thought through functionally and performance
> wise. I'm not opposed if can come up with a fully fleshed out plan. How
> serious are you in pursuing musl support, if we had something like
> this?
>
> HJ, any thoughts on whether glibc would use this as well?

Assuming that we are talking about permissive mode,  if kernel can
suppress UD, we don't need to disable SHSTK.   Glibc can enable
ARCH_SHSTK_SUPPRESS_UD instead.

> It is probably worth mentioning that from the security side (as Mark
> mentioned there is always tension in the tradeoffs on these features),
> permissive mode is seen by some as something that weakens security too
> much. Apps could call dlopen() on a known unsupported DSO before doing
> ROP. I don't know if you have any musl users with specific shadow stack
> use cases to ask about this.
  
dalias@libc.org Feb. 21, 2024, 8:18 p.m. UTC | #29
On Wed, Feb 21, 2024 at 07:22:21PM +0000, Edgecombe, Rick P wrote:
> On Wed, 2024-02-21 at 14:06 -0500, dalias@libc.org wrote:
> > Due to arbitrarily nestable signal frames, no, this does not suffice.
> > An interrupted operation using the lock could be arbitrarily delayed,
> > even never execute again, making any call to dlopen deadlock.
> 
> Doh! Yep, it is not robust to this. The only thing that could be done
> would be a timeout in dlopen(). Which would make the whole thing just
> better than nothing.
> 
> > 
> > > 
> > 
> > It's fine to turn RDSSP into an actual emulated read of the SSP, or
> > at
> > least an emulated load of zero so that uninitialized data is not left
> > in the target register.
> 
> We can't intercept RDSSP, but it becomes a NOP by default. (disclaimer
> x86-only knowledge).

OK, then I think the contract just has to be that userspace, in a
process that might dynamically disable shadow stack, needs to do
something like xor %reg,%reg before rdssp so that the outcome is
deterministic in disabled case.

> >  If doing the latter, code working with the
> > shadow stack just needs to be prepared for the possibility that it
> > could be async-disabled, and check the return value.
> > 
> > I have not looked at all the instructions that become #UD but I
> > suspect they all have reasonable trivial ways to implement a
> > "disabled" version of them that userspace can act upon reasonably.
> 
> This would have to be thought through functionally and performance
> wise. I'm not opposed if can come up with a fully fleshed out plan. How
> serious are you in pursuing musl support, if we had something like
> this?

Up til this thread, my position was pretty much "nope" because it
looked like it could not be done in a way compatible with existing
interface requirements.

However, what's been discussed here, contingent on a dynamic-disable
(ideally allowing choice of per-thread or global, to minimize loss of
hardening properties),

Personally, I believe shadow stack has very low hardening value
relative to cost/complexity, and my leaning would be just to ignore
it. However, I also know it becomes marketing pressure, including
pressure on distros that use musl -- "Why doesn't [distro] do shadow
stack?? I thought you were security oriented!!!" -- and if it can be
done in a non-breaking and non-invasive way, I think it's reasonable
to pursue and make something work.

> HJ, any thoughts on whether glibc would use this as well?
> 
> It is probably worth mentioning that from the security side (as Mark
> mentioned there is always tension in the tradeoffs on these features),
> permissive mode is seen by some as something that weakens security too
> much. Apps could call dlopen() on a known unsupported DSO before doing
> ROP. I don't know if you have any musl users with specific shadow stack
> use cases to ask about this.

Yes, this is potentially an argument for something like the option 2,
if there's a way to leave SS enabled but then trap when something goes
wrong, detect if it went wrong via SS-incompatible library code, and
lazily disable SS, otherwise terminate.

But I just realized, I'm not even sure why shared libraries need to be
explicitly SS-compatible. Unless they're doing their own asm stack
switches, shouldn't they just work by default? And since I don't
understand this reason, I also don't understand what the failure mode
is when an incompatible library is loaded, and thus whether it would
be possible to detect and attribute the failure to the library, or
whether the library would induce failure somewhere else.

Anyway, a mechanism to allow the userspace implementation to disable
SS doesn't inherently expose a means to do that. A system integrator
doing maximum hardening might choose to build all libraries as
SS-compatible, or to patch the loader to refuse to load incompatible
libraries.

Rich
  
H.J. Lu Feb. 21, 2024, 8:25 p.m. UTC | #30
On Wed, Feb 21, 2024 at 12:18 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Wed, Feb 21, 2024 at 11:22 AM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> >
> > On Wed, 2024-02-21 at 14:06 -0500, dalias@libc.org wrote:
> > > Due to arbitrarily nestable signal frames, no, this does not suffice.
> > > An interrupted operation using the lock could be arbitrarily delayed,
> > > even never execute again, making any call to dlopen deadlock.
> >
> > Doh! Yep, it is not robust to this. The only thing that could be done
> > would be a timeout in dlopen(). Which would make the whole thing just
> > better than nothing.
> >
> > >
> > > >
> > >
> > > It's fine to turn RDSSP into an actual emulated read of the SSP, or
> > > at
> > > least an emulated load of zero so that uninitialized data is not left
> > > in the target register.
> >
> > We can't intercept RDSSP, but it becomes a NOP by default. (disclaimer
> > x86-only knowledge).
> >
> > >  If doing the latter, code working with the
> > > shadow stack just needs to be prepared for the possibility that it
> > > could be async-disabled, and check the return value.
> > >
> > > I have not looked at all the instructions that become #UD but I
> > > suspect they all have reasonable trivial ways to implement a
> > > "disabled" version of them that userspace can act upon reasonably.
> >
> > This would have to be thought through functionally and performance
> > wise. I'm not opposed if can come up with a fully fleshed out plan. How
> > serious are you in pursuing musl support, if we had something like
> > this?
> >
> > HJ, any thoughts on whether glibc would use this as well?
>
> Assuming that we are talking about permissive mode,  if kernel can
> suppress UD, we don't need to disable SHSTK.   Glibc can enable
> ARCH_SHSTK_SUPPRESS_UD instead.

Kernel must suppress all possible SHSTK UDs.

> > It is probably worth mentioning that from the security side (as Mark
> > mentioned there is always tension in the tradeoffs on these features),
> > permissive mode is seen by some as something that weakens security too
> > much. Apps could call dlopen() on a known unsupported DSO before doing
> > ROP. I don't know if you have any musl users with specific shadow stack
> > use cases to ask about this.
>
>
>
> --
> H.J.
  
H.J. Lu Feb. 21, 2024, 9:12 p.m. UTC | #31
On Wed, Feb 21, 2024 at 12:25 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Wed, Feb 21, 2024 at 12:18 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Wed, Feb 21, 2024 at 11:22 AM Edgecombe, Rick P
> > <rick.p.edgecombe@intel.com> wrote:
> > >
> > > On Wed, 2024-02-21 at 14:06 -0500, dalias@libc.org wrote:
> > > > Due to arbitrarily nestable signal frames, no, this does not suffice.
> > > > An interrupted operation using the lock could be arbitrarily delayed,
> > > > even never execute again, making any call to dlopen deadlock.
> > >
> > > Doh! Yep, it is not robust to this. The only thing that could be done
> > > would be a timeout in dlopen(). Which would make the whole thing just
> > > better than nothing.
> > >
> > > >
> > > > >
> > > >
> > > > It's fine to turn RDSSP into an actual emulated read of the SSP, or
> > > > at
> > > > least an emulated load of zero so that uninitialized data is not left
> > > > in the target register.
> > >
> > > We can't intercept RDSSP, but it becomes a NOP by default. (disclaimer
> > > x86-only knowledge).
> > >
> > > >  If doing the latter, code working with the
> > > > shadow stack just needs to be prepared for the possibility that it
> > > > could be async-disabled, and check the return value.
> > > >
> > > > I have not looked at all the instructions that become #UD but I
> > > > suspect they all have reasonable trivial ways to implement a
> > > > "disabled" version of them that userspace can act upon reasonably.
> > >
> > > This would have to be thought through functionally and performance
> > > wise. I'm not opposed if can come up with a fully fleshed out plan. How
> > > serious are you in pursuing musl support, if we had something like
> > > this?
> > >
> > > HJ, any thoughts on whether glibc would use this as well?
> >
> > Assuming that we are talking about permissive mode,  if kernel can
> > suppress UD, we don't need to disable SHSTK.   Glibc can enable
> > ARCH_SHSTK_SUPPRESS_UD instead.
>
> Kernel must suppress all possible SHSTK UDs.

If SHSTK is disabled by kernel, not by glibc,  there can be 2 issues:

1.  Glibc and kernel may be out of sync on SHSTK.
2.  When kernel disables SHSTK, glibc may be in the middle of reading
shadow stack in longjmp, searching for a restore token.

> > > It is probably worth mentioning that from the security side (as Mark
> > > mentioned there is always tension in the tradeoffs on these features),
> > > permissive mode is seen by some as something that weakens security too
> > > much. Apps could call dlopen() on a known unsupported DSO before doing
> > > ROP. I don't know if you have any musl users with specific shadow stack
> > > use cases to ask about this.
> >
> >
> >
> > --
> > H.J.
>
>
>
> --
> H.J.
  
Mark Brown Feb. 22, 2024, 1:57 p.m. UTC | #32
On Wed, Feb 21, 2024 at 07:22:21PM +0000, Edgecombe, Rick P wrote:
> On Wed, 2024-02-21 at 14:06 -0500, dalias@libc.org wrote:

> > It's fine to turn RDSSP into an actual emulated read of the SSP, or
> > at
> > least an emulated load of zero so that uninitialized data is not left
> > in the target register.

> We can't intercept RDSSP, but it becomes a NOP by default. (disclaimer
> x86-only knowledge).

For arm64 we have a separate control GCSCRE0_EL1.nTR for access to
GCSPR_EL0 (our SSP equivalent) we can use.

> > I have not looked at all the instructions that become #UD but I
> > suspect they all have reasonable trivial ways to implement a
> > "disabled" version of them that userspace can act upon reasonably.

> This would have to be thought through functionally and performance
> wise. I'm not opposed if can come up with a fully fleshed out plan. How
> serious are you in pursuing musl support, if we had something like
> this?

Same here, we have to be careful since it's defining ABI in a way that
we don't normally provide ABI but if there's a clear case for doing it
then...
  
Szabolcs Nagy March 2, 2024, 2:57 p.m. UTC | #33
* Mark Brown <broonie@kernel.org> [2024-02-21 17:36:12 +0000]:

> On Wed, Feb 21, 2024 at 09:58:01AM -0500, dalias@libc.org wrote:
> > On Wed, Feb 21, 2024 at 01:53:10PM +0000, Mark Brown wrote:
> > > On Tue, Feb 20, 2024 at 08:27:37PM -0500, dalias@libc.org wrote:
> > > > On Wed, Feb 21, 2024 at 12:35:48AM +0000, Edgecombe, Rick P wrote:
> 
> > > > > (INCSSP, RSTORSSP, etc). These are a collection of instructions that
> > > > > allow limited control of the SSP. When shadow stack gets disabled,
> > > > > these suddenly turn into #UD generating instructions. So any other
> > > > > threads executing those instructions when shadow stack got disabled
> > > > > would be in for a nasty surprise.
> 
> > > > This is the kernel's problem if that's happening. It should be
> > > > trapping these and returning immediately like a NOP if shadow stack
> > > > has been disabled, not generating SIGILL.
> 
> > > I'm not sure that's going to work out well, all it takes is some code
> > > that's looking at the shadow stack and expecting something to happen as
> > > a result of the instructions it's executing and we run into trouble.  A
> 
> > I said NOP but there's no reason it strictly needs to be a NOP. It
> > could instead do something reasonable to convey the state of racing
> > with shadow stack being disabled.
> 
> This feels like it's getting complicated and I fear it may be an uphill
> struggle to get such code merged, at least for arm64.  My instinct is

the aarch64 behaviour is already nop
for gcs instructions when gcs is disabled.
the isa was designed so async disable is
possible.

only x86 linux would have to emulate this.

> that it's going to be much more robust and generally tractable to let
> things run to some suitable synchronisation point and then disable
> there, but if we're going to do that then userspace can hopefully
> arrange to do the disabling itself through the standard disable
> interface anyway.  Presumably it'll want to notice things being disabled
> at some point anyway?  TBH that's been how all the prior proposals for
> process wide disable I've seen were done.
  
H.J. Lu March 2, 2024, 3:05 p.m. UTC | #34
On Sat, Mar 2, 2024 at 6:57 AM Szabolcs Nagy <nsz@port70.net> wrote:
>
> * Mark Brown <broonie@kernel.org> [2024-02-21 17:36:12 +0000]:
>
> > On Wed, Feb 21, 2024 at 09:58:01AM -0500, dalias@libc.org wrote:
> > > On Wed, Feb 21, 2024 at 01:53:10PM +0000, Mark Brown wrote:
> > > > On Tue, Feb 20, 2024 at 08:27:37PM -0500, dalias@libc.org wrote:
> > > > > On Wed, Feb 21, 2024 at 12:35:48AM +0000, Edgecombe, Rick P wrote:
> >
> > > > > > (INCSSP, RSTORSSP, etc). These are a collection of instructions that
> > > > > > allow limited control of the SSP. When shadow stack gets disabled,
> > > > > > these suddenly turn into #UD generating instructions. So any other
> > > > > > threads executing those instructions when shadow stack got disabled
> > > > > > would be in for a nasty surprise.
> >
> > > > > This is the kernel's problem if that's happening. It should be
> > > > > trapping these and returning immediately like a NOP if shadow stack
> > > > > has been disabled, not generating SIGILL.
> >
> > > > I'm not sure that's going to work out well, all it takes is some code
> > > > that's looking at the shadow stack and expecting something to happen as
> > > > a result of the instructions it's executing and we run into trouble  A
> >
> > > I said NOP but there's no reason it strictly needs to be a NOP. It
> > > could instead do something reasonable to convey the state of racing
> > > with shadow stack being disabled.
> >
> > This feels like it's getting complicated and I fear it may be an uphill
> > struggle to get such code merged, at least for arm64.  My instinct is
>
> the aarch64 behaviour is already nop
> for gcs instructions when gcs is disabled.
> the isa was designed so async disable is
> possible.
>
> only x86 linux would have to emulate this.

On Linux/x86, normal instructions are used to update SSP after
checking SHSTK is enabled.   If SHSTK is disabled in between,
program behavior may be undefined.