[RFC,v4,4/4] mseal:add documentation

Message ID 20240104185138.169307-5-jeffxu@chromium.org
State New
Headers
Series Introduce mseal() |

Commit Message

Jeff Xu Jan. 4, 2024, 6:51 p.m. UTC
  From: Jeff Xu <jeffxu@chromium.org>

Add documentation for mseal().

Signed-off-by: Jeff Xu <jeffxu@chromium.org>
---
 Documentation/userspace-api/mseal.rst | 181 ++++++++++++++++++++++++++
 1 file changed, 181 insertions(+)
 create mode 100644 Documentation/userspace-api/mseal.rst
  

Comments

Randy Dunlap Jan. 4, 2024, 11:47 p.m. UTC | #1
On 1/4/24 10:51, jeffxu@chromium.org wrote:
> From: Jeff Xu <jeffxu@chromium.org>
> 
> Add documentation for mseal().
> 
> Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> ---
>  Documentation/userspace-api/mseal.rst | 181 ++++++++++++++++++++++++++
>  1 file changed, 181 insertions(+)
>  create mode 100644 Documentation/userspace-api/mseal.rst
> 
> diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst
> new file mode 100644
> index 000000000000..1700ce5af218
> --- /dev/null
> +++ b/Documentation/userspace-api/mseal.rst
> @@ -0,0 +1,181 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================
> +Introduction of mseal
> +=====================
> +
> +:Author: Jeff Xu <jeffxu@chromium.org>
> +
> +Modern CPUs support memory permissions such as RW and NX bits. The memory
> +permission feature improves security stance on memory corruption bugs, i.e.
> +the attacker can’t just write to arbitrary memory and point the code to it,
> +the memory has to be marked with X bit, or else an exception will happen.
> +
> +Memory sealing additionally protects the mapping itself against
> +modifications. This is useful to mitigate memory corruption issues where a
> +corrupted pointer is passed to a memory management system. For example,
> +such an attacker primitive can break control-flow integrity guarantees
> +since read-only memory that is supposed to be trusted can become writable
> +or .text pages can get remapped. Memory sealing can automatically be
> +applied by the runtime loader to seal .text and .rodata pages and
> +applications can additionally seal security critical data at runtime.
> +
> +A similar feature already exists in the XNU kernel with the
> +VM_FLAGS_PERMANENT flag [1] and on OpenBSD with the mimmutable syscall [2].
> +
> +User API
> +========
> +Two system calls are involved in virtual memory sealing, mseal() and mmap().
> +
> +mseal()
> +-----------
> +The mseal() syscall has following signature:
> +
> +``int mseal(void addr, size_t len, unsigned long flags)``
> +
> +**addr/len**: virtual memory address range.
> +
> +The address range set by ``addr``/``len`` must meet:
> +   - The start address must be in an allocated VMA.
> +   - The start address must be page aligned.
> +   - The end address (``addr`` + ``len``) must be in an allocated VMA.
> +   - no gap (unallocated memory) between start and end address.
> +
> +The ``len`` will be paged aligned implicitly by the kernel.

Does that mean that the <len> will be extended to be page aligned
if it's not already page aligned?

> +
> +**flags**: reserved for future use.
> +
> +**return values**:
> +
> +- ``0``: Success.
> +
> +- ``-EINVAL``:
> +    - Invalid input ``flags``.
> +    - The start address (``addr``) is not page aligned.
> +    - Address range (``addr`` + ``len``) overflow.
> +
> +- ``-ENOMEM``:
> +    - The start address (``addr``) is not allocated.
> +    - The end address (``addr`` + ``len``) is not allocated.
> +    - A gap (unallocated memory) between start and end address.
> +
> +- ``-EACCES``:
> +    - ``MAP_SEALABLE`` is not set during mmap().
> +
> +- ``-EPERM``:
> +    - sealing is supported only on 64 bit CPUs, 32-bit is not supported.

                                      64-bit

> +
> +- For above error cases, users can expect the given memory range is
> +  unmodified, i.e. no partial update.
> +
> +- There might be other internal errors/cases not listed here, e.g.
> +  error during merging/splitting VMAs, or the process reaching the max
> +  number of supported VMAs. In those cases, partial updates to the given
> +  memory range could happen. However, those cases shall be rare.
> +
> +**Blocked operations after sealing**:
> +    Unmapping, moving to another location, and shrinking the size,
> +    via munmap() and mremap(), can leave an empty space, therefore
> +    can be replaced with a VMA with a new set of attributes.
> +
> +    Moving or expanding a different VMA into the current location,
> +    via mremap().
> +
> +    Modifying a VMA via mmap(MAP_FIXED).
> +
> +    Size expansion, via mremap(), does not appear to pose any
> +    specific risks to sealed VMAs. It is included anyway because
> +    the use case is unclear. In any case, users can rely on
> +    merging to expand a sealed VMA.
> +
> +    mprotect() and pkey_mprotect().
> +
> +    Some destructive madvice() behaviors (e.g. MADV_DONTNEED)
> +    for anonymous memory, when users don't have write permission to the
> +    memory. Those behaviors can alter region contents by discarding pages,
> +    effectively a memset(0) for anonymous memory.
> +
> +**Note**:
> +
> +- mseal() only works on 64-bit CPUs, not 32-bit CPU.
> +
> +- users can call mseal() multiple times, mseal() on an already sealed memory

                                     times;

> +  is a no-action (not error).
> +
> +- munseal() is not supported.
> +
> +mmap()
> +----------
> +``void *mmap(void* addr, size_t length, int prot, int flags, int fd,
> +off_t offset);``
> +
> +We add two changes in ``prot`` and ``flags`` of  mmap() related to
> +memory sealing.
> +
> +**prot**
> +
> +The ``PROT_SEAL`` bit in ``prot`` field of mmap().
> +
> +When present, it marks the memory is sealed since creation.
> +
> +This is useful as optimization because it avoids having to make two
> +system calls: one for mmap() and one for mseal().
> +
> +It's worth noting that even though the sealing is set via the
> +``prot`` field in mmap(), it can't be set in the ``prot``
> +field in later mprotect(). This is unlike the ``PROT_READ``,
> +``PROT_WRITE``, ``PROT_EXEC`` bits, e.g. if ``PROT_WRITE`` is not set in
> +mprotect(), it means that the region is not writable.
> +
> +Setting ``PROT_SEAL`` implies setting ``MAP_SEALABLE`` below.
> +
> +**flags**
> +
> +The ``MAP_SEALABLE`` bit in the ``flags`` field of mmap().
> +
> +When present, it marks the map as sealable. A map created
> +without ``MAP_SEALABLE`` will not support sealing; In other words,
> +mseal() will fail for such a map.
> +
> +
> +Applications that don't care about sealing will expect their
> +behavior unchanged. For those that need sealing support, opt-in
> +by adding ``MAP_SEALABLE`` in mmap().
> +
> +Note: for a map created without ``MAP_SEALABLE`` or a map created
> +with ``MAP_SEALABLE`` but not sealed yet, mmap(MAP_FIXED) can
> +change the sealable or sealing bit.
> +
> +Use Case:
> +=========
> +- glibc:
> +  The dynamic linker, during loading ELF executables, can apply sealing to

                         during loading of
or
                         while loading

> +  non-writable memory segments.
> +
> +- Chrome browser: protect some security sensitive data-structures.

                                  security-sensitive data structures.

> +
> +Additional notes:
> +=================
> +As Jann Horn pointed out in [3], there are still a few ways to write
> +to RO memory, which is, in a way, by design. Those cases are not covered
> +by mseal(). If applications want to block such cases, sandbox tools (such as
> +seccomp, LSM, etc) might be considered.
> +
> +Those cases are:
> +
> +- Write to read-only memory through /proc/self/mem interface.
> +- Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
> +- userfaultfd.
> +
> +The idea that inspired this patch comes from Stephen Röttger’s work in V8
> +CFI [4]. Chrome browser in ChromeOS will be the first user of this API.
> +
> +Reference:
> +==========
> +[1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
> +
> +[2] https://man.openbsd.org/mimmutable.2
> +
> +[3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com
> +
> +[4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
  
Jeff Xu Jan. 5, 2024, 7:37 p.m. UTC | #2
On Thu, Jan 4, 2024 at 3:47 PM Randy Dunlap <rdunlap@infradead.org> wrote:
>
>
>
> On 1/4/24 10:51, jeffxu@chromium.org wrote:
> > From: Jeff Xu <jeffxu@chromium.org>
> >
> > Add documentation for mseal().
> >
> > Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> > ---
> >  Documentation/userspace-api/mseal.rst | 181 ++++++++++++++++++++++++++
> >  1 file changed, 181 insertions(+)
> >  create mode 100644 Documentation/userspace-api/mseal.rst
> >
> > diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst
> > new file mode 100644
> > index 000000000000..1700ce5af218
> > --- /dev/null
> > +++ b/Documentation/userspace-api/mseal.rst
> > @@ -0,0 +1,181 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +=====================
> > +Introduction of mseal
> > +=====================
> > +
> > +:Author: Jeff Xu <jeffxu@chromium.org>
> > +
> > +Modern CPUs support memory permissions such as RW and NX bits. The memory
> > +permission feature improves security stance on memory corruption bugs, i.e.
> > +the attacker can’t just write to arbitrary memory and point the code to it,
> > +the memory has to be marked with X bit, or else an exception will happen.
> > +
> > +Memory sealing additionally protects the mapping itself against
> > +modifications. This is useful to mitigate memory corruption issues where a
> > +corrupted pointer is passed to a memory management system. For example,
> > +such an attacker primitive can break control-flow integrity guarantees
> > +since read-only memory that is supposed to be trusted can become writable
> > +or .text pages can get remapped. Memory sealing can automatically be
> > +applied by the runtime loader to seal .text and .rodata pages and
> > +applications can additionally seal security critical data at runtime.
> > +
> > +A similar feature already exists in the XNU kernel with the
> > +VM_FLAGS_PERMANENT flag [1] and on OpenBSD with the mimmutable syscall [2].
> > +
> > +User API
> > +========
> > +Two system calls are involved in virtual memory sealing, mseal() and mmap().
> > +
> > +mseal()
> > +-----------
> > +The mseal() syscall has following signature:
> > +
> > +``int mseal(void addr, size_t len, unsigned long flags)``
> > +
> > +**addr/len**: virtual memory address range.
> > +
> > +The address range set by ``addr``/``len`` must meet:
> > +   - The start address must be in an allocated VMA.
> > +   - The start address must be page aligned.
> > +   - The end address (``addr`` + ``len``) must be in an allocated VMA.
> > +   - no gap (unallocated memory) between start and end address.
> > +
> > +The ``len`` will be paged aligned implicitly by the kernel.
>
> Does that mean that the <len> will be extended to be page aligned
> if it's not already page aligned?
>
Yes.
the code (do_mseal) calls PAGE_ALIGNED(len).
mprotect() also has this.

Two test cases cover this part.
test_seal_mprotect_unalign_len
test_seal_mprotect_unalign_len_variant_2

-Jeff

> --
> #Randy
  
Jonathan Corbet Jan. 16, 2024, 8:13 p.m. UTC | #3
jeffxu@chromium.org writes:

> From: Jeff Xu <jeffxu@chromium.org>
>
> Add documentation for mseal().
>
> Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> ---
>  Documentation/userspace-api/mseal.rst | 181 ++++++++++++++++++++++++++
>  1 file changed, 181 insertions(+)
>  create mode 100644 Documentation/userspace-api/mseal.rst

You need to add this file to index.rst or it won't be part of the docs
build.  Sphinx should have warned you about that when you did your test
build.

Thanks,

jon
  
Randy Dunlap Jan. 16, 2024, 10:19 p.m. UTC | #4
On 1/16/24 12:13, Jonathan Corbet wrote:
> jeffxu@chromium.org writes:
> 
>> From: Jeff Xu <jeffxu@chromium.org>
>>
>> Add documentation for mseal().
>>
>> Signed-off-by: Jeff Xu <jeffxu@chromium.org>
>> ---
>>  Documentation/userspace-api/mseal.rst | 181 ++++++++++++++++++++++++++
>>  1 file changed, 181 insertions(+)
>>  create mode 100644 Documentation/userspace-api/mseal.rst
> 
> You need to add this file to index.rst or it won't be part of the docs
> build.  Sphinx should have warned you about that when you did your test
> build.

Yes, I have already asked Jeff to add this to his patch:

diff -- a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -23,6 +23,7 @@ place where this information is gathered
    ebpf/index
    ELF
    ioctl/index
+   mseal
    iommu
    iommufd
    media/index
  

Patch

diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst
new file mode 100644
index 000000000000..1700ce5af218
--- /dev/null
+++ b/Documentation/userspace-api/mseal.rst
@@ -0,0 +1,181 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Introduction of mseal
+=====================
+
+:Author: Jeff Xu <jeffxu@chromium.org>
+
+Modern CPUs support memory permissions such as RW and NX bits. The memory
+permission feature improves security stance on memory corruption bugs, i.e.
+the attacker can’t just write to arbitrary memory and point the code to it,
+the memory has to be marked with X bit, or else an exception will happen.
+
+Memory sealing additionally protects the mapping itself against
+modifications. This is useful to mitigate memory corruption issues where a
+corrupted pointer is passed to a memory management system. For example,
+such an attacker primitive can break control-flow integrity guarantees
+since read-only memory that is supposed to be trusted can become writable
+or .text pages can get remapped. Memory sealing can automatically be
+applied by the runtime loader to seal .text and .rodata pages and
+applications can additionally seal security critical data at runtime.
+
+A similar feature already exists in the XNU kernel with the
+VM_FLAGS_PERMANENT flag [1] and on OpenBSD with the mimmutable syscall [2].
+
+User API
+========
+Two system calls are involved in virtual memory sealing, mseal() and mmap().
+
+mseal()
+-----------
+The mseal() syscall has following signature:
+
+``int mseal(void addr, size_t len, unsigned long flags)``
+
+**addr/len**: virtual memory address range.
+
+The address range set by ``addr``/``len`` must meet:
+   - The start address must be in an allocated VMA.
+   - The start address must be page aligned.
+   - The end address (``addr`` + ``len``) must be in an allocated VMA.
+   - no gap (unallocated memory) between start and end address.
+
+The ``len`` will be paged aligned implicitly by the kernel.
+
+**flags**: reserved for future use.
+
+**return values**:
+
+- ``0``: Success.
+
+- ``-EINVAL``:
+    - Invalid input ``flags``.
+    - The start address (``addr``) is not page aligned.
+    - Address range (``addr`` + ``len``) overflow.
+
+- ``-ENOMEM``:
+    - The start address (``addr``) is not allocated.
+    - The end address (``addr`` + ``len``) is not allocated.
+    - A gap (unallocated memory) between start and end address.
+
+- ``-EACCES``:
+    - ``MAP_SEALABLE`` is not set during mmap().
+
+- ``-EPERM``:
+    - sealing is supported only on 64 bit CPUs, 32-bit is not supported.
+
+- For above error cases, users can expect the given memory range is
+  unmodified, i.e. no partial update.
+
+- There might be other internal errors/cases not listed here, e.g.
+  error during merging/splitting VMAs, or the process reaching the max
+  number of supported VMAs. In those cases, partial updates to the given
+  memory range could happen. However, those cases shall be rare.
+
+**Blocked operations after sealing**:
+    Unmapping, moving to another location, and shrinking the size,
+    via munmap() and mremap(), can leave an empty space, therefore
+    can be replaced with a VMA with a new set of attributes.
+
+    Moving or expanding a different VMA into the current location,
+    via mremap().
+
+    Modifying a VMA via mmap(MAP_FIXED).
+
+    Size expansion, via mremap(), does not appear to pose any
+    specific risks to sealed VMAs. It is included anyway because
+    the use case is unclear. In any case, users can rely on
+    merging to expand a sealed VMA.
+
+    mprotect() and pkey_mprotect().
+
+    Some destructive madvice() behaviors (e.g. MADV_DONTNEED)
+    for anonymous memory, when users don't have write permission to the
+    memory. Those behaviors can alter region contents by discarding pages,
+    effectively a memset(0) for anonymous memory.
+
+**Note**:
+
+- mseal() only works on 64-bit CPUs, not 32-bit CPU.
+
+- users can call mseal() multiple times, mseal() on an already sealed memory
+  is a no-action (not error).
+
+- munseal() is not supported.
+
+mmap()
+----------
+``void *mmap(void* addr, size_t length, int prot, int flags, int fd,
+off_t offset);``
+
+We add two changes in ``prot`` and ``flags`` of  mmap() related to
+memory sealing.
+
+**prot**
+
+The ``PROT_SEAL`` bit in ``prot`` field of mmap().
+
+When present, it marks the memory is sealed since creation.
+
+This is useful as optimization because it avoids having to make two
+system calls: one for mmap() and one for mseal().
+
+It's worth noting that even though the sealing is set via the
+``prot`` field in mmap(), it can't be set in the ``prot``
+field in later mprotect(). This is unlike the ``PROT_READ``,
+``PROT_WRITE``, ``PROT_EXEC`` bits, e.g. if ``PROT_WRITE`` is not set in
+mprotect(), it means that the region is not writable.
+
+Setting ``PROT_SEAL`` implies setting ``MAP_SEALABLE`` below.
+
+**flags**
+
+The ``MAP_SEALABLE`` bit in the ``flags`` field of mmap().
+
+When present, it marks the map as sealable. A map created
+without ``MAP_SEALABLE`` will not support sealing; In other words,
+mseal() will fail for such a map.
+
+
+Applications that don't care about sealing will expect their
+behavior unchanged. For those that need sealing support, opt-in
+by adding ``MAP_SEALABLE`` in mmap().
+
+Note: for a map created without ``MAP_SEALABLE`` or a map created
+with ``MAP_SEALABLE`` but not sealed yet, mmap(MAP_FIXED) can
+change the sealable or sealing bit.
+
+Use Case:
+=========
+- glibc:
+  The dynamic linker, during loading ELF executables, can apply sealing to
+  non-writable memory segments.
+
+- Chrome browser: protect some security sensitive data-structures.
+
+Additional notes:
+=================
+As Jann Horn pointed out in [3], there are still a few ways to write
+to RO memory, which is, in a way, by design. Those cases are not covered
+by mseal(). If applications want to block such cases, sandbox tools (such as
+seccomp, LSM, etc) might be considered.
+
+Those cases are:
+
+- Write to read-only memory through /proc/self/mem interface.
+- Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
+- userfaultfd.
+
+The idea that inspired this patch comes from Stephen Röttger’s work in V8
+CFI [4]. Chrome browser in ChromeOS will be the first user of this API.
+
+Reference:
+==========
+[1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
+
+[2] https://man.openbsd.org/mimmutable.2
+
+[3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com
+
+[4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc