[RFC,v3,11/11] mseal:add documentation

Message ID 20231212231706.2680890-12-jeffxu@chromium.org
State New
Headers
Series Introduce mseal() |

Commit Message

Jeff Xu Dec. 12, 2023, 11:17 p.m. UTC
  From: Jeff Xu <jeffxu@chromium.org>

Add documentation for mseal().

Signed-off-by: Jeff Xu <jeffxu@chromium.org>
---
 Documentation/userspace-api/mseal.rst | 189 ++++++++++++++++++++++++++
 1 file changed, 189 insertions(+)
 create mode 100644 Documentation/userspace-api/mseal.rst
  

Comments

Linus Torvalds Dec. 13, 2023, 12:39 a.m. UTC | #1
On Tue, 12 Dec 2023 at 15:17, <jeffxu@chromium.org> wrote:
> +
> +**types**: bit mask to specify the sealing types, they are:

I really want a real-life use-case for more than one bit of "don't modify".

IOW, when would you *ever* say "seal this area, but MADV_DONTNEED is ok"?

Or when would you *ever* say "seal this area, but mprotect()" is ok.

IOW, I want to know why we don't just do the BSD immutable thing, and
why we need this multi-level sealing thing.

               Linus
  
Jeff Xu Dec. 14, 2023, 12:35 a.m. UTC | #2
On Tue, Dec 12, 2023 at 4:39 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 12 Dec 2023 at 15:17, <jeffxu@chromium.org> wrote:
> > +
> > +**types**: bit mask to specify the sealing types, they are:
>
> I really want a real-life use-case for more than one bit of "don't modify".
>
For the real-life use case question, Stephen Röttger and I put
description in the cover letter as well as the open discussion section
(mseal() vs immutable()) of patch 0/11.  Perhaps you are looking for more
details in chrome usage of the API, e.g. code-wise ?

> IOW, when would you *ever* say "seal this area, but MADV_DONTNEED is ok"?
>
The MADV_DONTNEED is OK for file-backed mapping.
As state in man page of madvise: [1]

"subsequent accesses of pages in the range will succeed,  but will
result in either repopulating the memory contents from the up-to-date
contents of the underlying mapped file"

> Or when would you *ever* say "seal this area, but mprotect()" is ok.
>
The fact  that openBSD allows RW=>RO transaction, as in its man page [2]

 "  At present, mprotect(2) may reduce permissions on immutable pages
  marked PROT_READ | PROT_WRITE to the less permissive PROT_READ."

suggests application might desire multiple ways to seal the "PROT" bits.

E.g.
Applications that wants a full lockdown of PROT and PKEY might use
SEAL_PROT_PKEY (Chrome case and implemented in this patch)

Application that desires RW=>RO transaction, might implement
SEAL_PROT_DOWNGRADEABLE, or specifically allow RW=>RO.
(not implemented but can be added in future as extension if  needed.)

> IOW, I want to know why we don't just do the BSD immutable thing, and
> why we need this multi-level sealing thing.
>
The details are discussed in mseal() vs immutable()) of the cover letter
(patch 0/11)

In short, BSD's immutable is designed specific for libc case, and Chrome
case is just different (e.g. the lifetime of those mappings and requirement of
free/discard unused memory).

Single bit vs multi-bits are still up for discussion.
If there are strong opinions on the multiple-bits approach, (and
no objection on applying MM_SEAL_DISCARD_RO_ANON to the .text part
during libc dynamic loading, which has no effect anyway because it is
file backed.), we could combine all three bits into one. A side note is that we
could not add something such as SEAL_PROT_DOWNGRADEABLE later,
since pkey_mprotect is sealed.

I'm open to one bit approach. If we took that approach,
We might consider the following:

mseal() or
mseal(flags), flags are reserved for future use.

I appreciate a direction on this.

 [1] https://man7.org/linux/man-pages/man2/madvise.2.html
 [2] https://man.openbsd.org/mimmutable.2

-Jeff



>                Linus
  
Theo de Raadt Dec. 14, 2023, 1:09 a.m. UTC | #3
Jeff Xu <jeffxu@google.com> wrote:

> > Or when would you *ever* say "seal this area, but mprotect()" is ok.
> >
> The fact  that openBSD allows RW=>RO transaction, as in its man page [2]
> 
>  "  At present, mprotect(2) may reduce permissions on immutable pages
>   marked PROT_READ | PROT_WRITE to the less permissive PROT_READ."

Let me explain this.

We encountered two places that needed this less-permission-transition.

Both of these problems were found in either .data or bss, which the
kernel makes immutable by default.  The OpenBSD kernel makes those
regions immutable BY DEFAULT, and there is no way to turn that off.

One was in our libc malloc, which after initialization, wants to protect
a control data structure from being written in the future.

The other was in chrome v8, for the v8_flags variable, this is
similarily mprotected to less permission after initialization to avoid
tampering (because it's an amazing relative-address located control
gadget).

We introduced a different mechanism to solve these problem.

So we added a new ELF section which annotates objects you need to be
MUTABLE.  If these are .data or .bss, they are placed in the MUTABLE
region annotated with the following Program Header:

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  OPENBSD_MUTABLE 0x0e9000 0x00000000000ec000 0x00000000000ec000 0x001000 0x001000 RW  0x1000

associated with this Section Header

  [20] .openbsd.mutable  PROGBITS        00000000000ec000 0e9000 001000 00  WA  0   0 4096

(It is vaguely similar to RELRO).

You can place objects there using the a compiler __attribute__((section
declaration, like this example from our libc/malloc.c code

static union {
        struct malloc_readonly mopts;
        u_char _pad[MALLOC_PAGESIZE];
} malloc_readonly __attribute__((aligned(MALLOC_PAGESIZE)))
                __attribute__((section(".openbsd.mutable")));

During startup the code can set the protection and then the immutability
of the object correctly.

Since we have no purpose left for this permission reduction semantic
upon immutable mappings, we may be deleting that behaviour in the
future.  I wrote that code, because I needed it to make progress with some
difficult pieces of code.  But we found a better way.
  
Linus Torvalds Dec. 14, 2023, 1:31 a.m. UTC | #4
On Wed, 13 Dec 2023 at 16:36, Jeff Xu <jeffxu@google.com> wrote:
>
>
> > IOW, when would you *ever* say "seal this area, but MADV_DONTNEED is ok"?
> >
> The MADV_DONTNEED is OK for file-backed mapping.

Right. It makes no semantic difference. So there's no point to it.

My point was that you added this magic flag for "not ok for RO anon mapping".

It's such a *completely* random flag, that I go "that's just crazy
random - make sealing _always_ disallow that case".

So what I object to in this series is basically random small details
that should just eb part of the basic act of sealing.

I think sealing should just mean "you can't do any operations that
have semantic meaning for the mapping, because it is SEALED".

So I think sealing should automatically mean "can't do MADV_DONTNEED
on anon memory", because that's basically equivalent to a munmap/remap
operation.

I also think that sealing should just automatically mean "can't do
mprotect any more".

And yes, the OpenBSD semantics of "immutable" apparently allowed
reducing permissions, but even the openbsd man-page seems to think
that was a bug, so we should just not allow it. And the openbsd case
seems to be because of how they made certain things immutable by
default, which is different from what this mseal() thing is.

End result: I'd really like to make the thing conceptually simpler,
rather than add all those random (*very* random in case of
MADV_DONTNEED) special cases.

Is there any actual practical example of why you'd want a half-sealed thing?

And no, I didn't read the pdf that was attached. If it can't just be
explained in plain language, it's not an explanation.

I'd love for "sealed" to be just a single bit in the vm_flags things
that we already have. Not a config option. Not some complicated thing
that is hard to explain. A simple "I have set up this mapping, you
can't change it any more".

And if it cannot be that kind of thing, I want to have clear and
obvious examples of why it can't be that simple thing.

Not a pdf file that describes some google-chrome design. Something
down-to-earth and practical (and not a "we might want this in the
future" thing either).

IOW, what is wrong with "THIS VMA SETUP CANNOT BE CHANGED ANY MORE"?

Nothing less, but also nothing more. No random odd bits that need explaining.

              Linus
  
Theo de Raadt Dec. 14, 2023, 3:04 p.m. UTC | #5
Jeff Xu <jeffxu@google.com> wrote:

> In short, BSD's immutable is designed specific for libc case, and Chrome
> case is just different (e.g. the lifetime of those mappings and requirement of
> free/discard unused memory).

That is not true.  During the mimmutable design I took the entire
software ecosystem into consideration.  Not just libc.  That is either
uncharitable or uninformed.

In OpenBSD, pretty much the only thing which calls mimmutable() is the
shared library linker, which does so on all possible regions of all DSO
objects, not just libc.

For example, chrome loads 96 libraries, and all their text/data/bss/etc
are immutable. All the static address space is immutable.  It's the same
for all other programs running in OpenBSD -- only transient heap and
mmap spaces remain permission mutable.

It is not just libc.

What you are trying to do here with chrome is bring some sort of
soft-immutable management to regions of memory, so that trusted parts of
chrome can still change the permissions, but untrusted / gadgetry parts
of chrome cannot change the permissions.  That's a very different thing
than what I set out to do with mimmutable().  I'm not aware of any other
piece of software that needs this.  I still can't wrap my head around
the assurance model of the design. 

Maybe it is time to stop comparing mseal() to mimmutable().

Also, maybe this proposal should be using the name chromesyscall()
instead -- then it could be extended indefinitely in the future...
  
Stephen Röttger Dec. 14, 2023, 6:06 p.m. UTC | #6
On Thu, Dec 14, 2023 at 2:31 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, 13 Dec 2023 at 16:36, Jeff Xu <jeffxu@google.com> wrote:
> >
> >
> > > IOW, when would you *ever* say "seal this area, but MADV_DONTNEED is ok"?
> > >
> > The MADV_DONTNEED is OK for file-backed mapping.
>
> Right. It makes no semantic difference. So there's no point to it.
>
> My point was that you added this magic flag for "not ok for RO anon mapping".
>
> It's such a *completely* random flag, that I go "that's just crazy
> random - make sealing _always_ disallow that case".
>
> So what I object to in this series is basically random small details
> that should just eb part of the basic act of sealing.
>
> I think sealing should just mean "you can't do any operations that
> have semantic meaning for the mapping, because it is SEALED".
>
> So I think sealing should automatically mean "can't do MADV_DONTNEED
> on anon memory", because that's basically equivalent to a munmap/remap
> operation.

In Chrome, we have a use case to allow MADV_DONTNEED on sealed memory.
We have a pkey-tagged heap and code region for JIT code. The regions are
writable by page permissions, but we use the pkey to control write access.
These regions are mmapped at process startup and we want to seal them to ensure
that the pkey and page permissions can't change.
Since these regions are used for dynamic allocations, we still need a way to
release unneeded resources, i.e. madvise(DONTNEED) unused pages on free().

AIUI, the madvise(DONTNEED) should effectively only change the content of
anonymous pages, i.e. it's similar to a memset(0) in that case. That's why we
added this special case: if you want to madvise(DONTNEED) an anonymous page,
you should have write permissions to the page.

In our allocator, on free we can then release resources via:
* allow pkey writes
* madvise(DONTNEED)
* disallow pkey writes
  
Pedro Falcato Dec. 14, 2023, 8:11 p.m. UTC | #7
On Thu, Dec 14, 2023 at 6:07 PM Stephen Röttger <sroettger@google.com> wrote:
>
> On Thu, Dec 14, 2023 at 2:31 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Wed, 13 Dec 2023 at 16:36, Jeff Xu <jeffxu@google.com> wrote:
> > >
> > >
> > > > IOW, when would you *ever* say "seal this area, but MADV_DONTNEED is ok"?
> > > >
> > > The MADV_DONTNEED is OK for file-backed mapping.
> >
> > Right. It makes no semantic difference. So there's no point to it.
> >
> > My point was that you added this magic flag for "not ok for RO anon mapping".
> >
> > It's such a *completely* random flag, that I go "that's just crazy
> > random - make sealing _always_ disallow that case".
> >
> > So what I object to in this series is basically random small details
> > that should just eb part of the basic act of sealing.
> >
> > I think sealing should just mean "you can't do any operations that
> > have semantic meaning for the mapping, because it is SEALED".
> >
> > So I think sealing should automatically mean "can't do MADV_DONTNEED
> > on anon memory", because that's basically equivalent to a munmap/remap
> > operation.
>
> In Chrome, we have a use case to allow MADV_DONTNEED on sealed memory.

I don't want to be that guy (*believe me*), but what if there was a
way to attach BPF programs to mm's? Such that you could handle 'seal
failures' in BPF, and thus allow for this sort of weird semantics?
e.g: madvise(MADV_DONTNEED) on a sealed region fails, kernel invokes
the BPF program (that chrome loaded), BPF program sees it was a
MADV_DONTNEED and allows it to proceed.

It requires BPF but sounds like a good compromise in order to not get
an ugly API?
  
Linus Torvalds Dec. 14, 2023, 8:14 p.m. UTC | #8
On Thu, 14 Dec 2023 at 10:07, Stephen Röttger <sroettger@google.com> wrote:
>
> AIUI, the madvise(DONTNEED) should effectively only change the content of
> anonymous pages, i.e. it's similar to a memset(0) in that case. That's why we
> added this special case: if you want to madvise(DONTNEED) an anonymous page,
> you should have write permissions to the page.

Hmm. I actually would be happier if we just made that change in
general. Maybe even without sealing, but I agree that it *definitely*
makes sense in general as a sealing thing.

IOW, just saying

 "madvise(DONTNEED) needs write permissions to an anonymous mapping when sealed"

makes 100% sense to me. Having a separate _flag_ to give sensible
semantics is just odd.

IOW, what I really want is exactly that "sensible semantics, not random flags".

Particularly for new system calls with fairly specialized use, I think
it's very important that the semantics are sensible on a conceptual
level, and that we do not add system calls that are based on "random
implementation issue of the day".

Yes, yes, then as we have to maintain things long-term, and we hit
some compatibility issue, at *THAT* point we'll end up facing nasty
"we had an implementation that had these semantics in practice, so now
we're stuck with it", but when introducing a new system call, let's
try really hard to start off from those kinds of random things.

Wouldn't it be lovely if we can just come up with a sane set of "this
is what it means to seal a vma", and enumerate those, and make those
sane conceptual rules be the initial definition. By all means have a
"flags" argument for future cases when we figure out there was
something wrong or the notion needed to be extended, but if we already
*start* with random extensions, I feel there's something wrong with
the whole concept.

So I would really wish for the first version of

     mseal(start, len, flags);

to have "flags=0" be the one and only case we actually handle
initially, and only add a single PROT_SEAL flag to mmap() that says
"create this mapping already pre-sealed".

Strive very hard to make sealing be a single VM_SEALED flag in the
vma->vm_flags that we already have, just admit that none of this
matters on 32-bit architectures, so that VM_SEALED can just use one of
the high flags that we have several free of (and that pkeys already
depends on), and make this a standard feature with no #ifdef's.

Can chrome live with that? And what would the required semantics be?
I'll start the list:

 - you can't unmap or remap in any way (including over-mapping)

 - you can't change protections (but with architecture support like
pkey, you can obviously change the protections indirectly with PKRU
etc)

 - you can't do VM operations that change data without the area being
writable (so the DONTNEED case - maybe there are others)

 - anything else?

Wouldn't it be lovely to have just a single notion of sealing that is
well-documented and makes sense, and doesn't require people to worry
about odd special cases?

And yes, we'd have the 'flags' argument for future special cases, and
hope really hard that it's never needed.

           Linus
  
Jeff Xu Dec. 14, 2023, 10:52 p.m. UTC | #9
On Thu, Dec 14, 2023 at 12:14 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 14 Dec 2023 at 10:07, Stephen Röttger <sroettger@google.com> wrote:
> >
> > AIUI, the madvise(DONTNEED) should effectively only change the content of
> > anonymous pages, i.e. it's similar to a memset(0) in that case. That's why we
> > added this special case: if you want to madvise(DONTNEED) an anonymous page,
> > you should have write permissions to the page.
>
> Hmm. I actually would be happier if we just made that change in
> general. Maybe even without sealing, but I agree that it *definitely*
> makes sense in general as a sealing thing.
>
> IOW, just saying
>
>  "madvise(DONTNEED) needs write permissions to an anonymous mapping when sealed"
>
> makes 100% sense to me. Having a separate _flag_ to give sensible
> semantics is just odd.
>
> IOW, what I really want is exactly that "sensible semantics, not random flags".
>
> Particularly for new system calls with fairly specialized use, I think
> it's very important that the semantics are sensible on a conceptual
> level, and that we do not add system calls that are based on "random
> implementation issue of the day".
>
> Yes, yes, then as we have to maintain things long-term, and we hit
> some compatibility issue, at *THAT* point we'll end up facing nasty
> "we had an implementation that had these semantics in practice, so now
> we're stuck with it", but when introducing a new system call, let's
> try really hard to start off from those kinds of random things.
>
> Wouldn't it be lovely if we can just come up with a sane set of "this
> is what it means to seal a vma", and enumerate those, and make those
> sane conceptual rules be the initial definition. By all means have a
> "flags" argument for future cases when we figure out there was
> something wrong or the notion needed to be extended, but if we already
> *start* with random extensions, I feel there's something wrong with
> the whole concept.
>
> So I would really wish for the first version of
>
>      mseal(start, len, flags);
>
> to have "flags=0" be the one and only case we actually handle
> initially, and only add a single PROT_SEAL flag to mmap() that says
> "create this mapping already pre-sealed".
>
> Strive very hard to make sealing be a single VM_SEALED flag in the
> vma->vm_flags that we already have, just admit that none of this
> matters on 32-bit architectures, so that VM_SEALED can just use one of
> the high flags that we have several free of (and that pkeys already
> depends on), and make this a standard feature with no #ifdef's.
>
> Can chrome live with that? And what would the required semantics be?
> I'll start the list:
>
>  - you can't unmap or remap in any way (including over-mapping)
>
>  - you can't change protections (but with architecture support like
> pkey, you can obviously change the protections indirectly with PKRU
> etc)
>
>  - you can't do VM operations that change data without the area being
> writable (so the DONTNEED case - maybe there are others)
>
>  - anything else?
>
> Wouldn't it be lovely to have just a single notion of sealing that is
> well-documented and makes sense, and doesn't require people to worry
> about odd special cases?
>
> And yes, we'd have the 'flags' argument for future special cases, and
> hope really hard that it's never needed.
>

Yes, those inputs make a lot of sense !
I will start the next version. In the meantime, if there are more
comments, please continue to post, I appreciate those to make the API
better and simple to use.

-Jeff

>            Linus
>
  
Theo de Raadt Jan. 20, 2024, 3:23 p.m. UTC | #10
Some notes about compatibility between mimmutable(2) and mseal().

This morning, the "RW -> R demotion" code in mimmutable(2) was removed.
As described previously, that was a development crutch to solved a problem
but we found a better way with a new ELF section which is available at
compile time with __attribute__((section(".openbsd.mutable"))).   Which
works great.

I am syncronizing the madvise / msync behaviour further, we will be compatible.
I have worried about madvise / msync for a long time, and audited vast amounts
of the software ecosystem to come to a conclusion we can be more strict, but
I never acted upon it.

BTW, on OpenBSD and probably other related BSD operating systems,
MADV_DONTNEED is non-destructive.  However we have a destructive
operation called MADV_FREE.  msync() MS_INVALIDATE is also destructive.
But all of these operations will now be prohibited, to syncronize the
error return value situation.

There is an one large difference remainig between mimmutable() and mseal(),
which is how other system calls behave.

We return EPERM for failures in all the system calls that fail upon
immutable memory (since Oct 2022).

You are returning EACESS.

Before it is too late, do you want to reconsider that return value, or
do you have a justification for the choice?

I think this remains the blocker which would prevent software from doing

#define mimmutable(addr, len)  mseal(addr, len, 0)
  
Linus Torvalds Jan. 20, 2024, 4:40 p.m. UTC | #11
On Sat, 20 Jan 2024 at 07:23, Theo de Raadt <deraadt@openbsd.org> wrote:
>
> There is an one large difference remainig between mimmutable() and mseal(),
> which is how other system calls behave.
>
> We return EPERM for failures in all the system calls that fail upon
> immutable memory (since Oct 2022).
>
> You are returning EACESS.
>
> Before it is too late, do you want to reconsider that return value, or
> do you have a justification for the choice?

I don't think there's any real reason for the difference.

Jeff - mind changing the EACESS to EPERM, and we'll have something
that is more-or-less compatible between Linux and OpenBSD?

             Linus
  
Theo de Raadt Jan. 20, 2024, 4:59 p.m. UTC | #12
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, 20 Jan 2024 at 07:23, Theo de Raadt <deraadt@openbsd.org> wrote:
> >
> > There is an one large difference remainig between mimmutable() and mseal(),
> > which is how other system calls behave.
> >
> > We return EPERM for failures in all the system calls that fail upon
> > immutable memory (since Oct 2022).
> >
> > You are returning EACESS.
> >
> > Before it is too late, do you want to reconsider that return value, or
> > do you have a justification for the choice?
> 
> I don't think there's any real reason for the difference.
> 
> Jeff - mind changing the EACESS to EPERM, and we'll have something
> that is more-or-less compatible between Linux and OpenBSD?

(I tried to remember why I chose EPERM, replaying the view from the
German castle during kernel compiles...)

In mmap, EACCESS already means something.

     [EACCES]           The flag PROT_READ was specified as part of the prot
                        parameter and fd was not open for reading.  The flags
                        MAP_SHARED and PROT_WRITE were specified as part of
                        the flags and prot parameters and fd was not open for
                        writing.

In mprotect, the situation is similar

     [EACCES]           The process does not have sufficient access to the
                        underlying memory object to provide the requested
                        protection.

immutable isn't an aspect of the underlying object, but an aspect of the
mapping.

Anyways, it is common for one errno value to have multiple causes.

But this error-aliasing can make it harder to figure things out when
studying a "system call trace" a program, and I strongly believe in
keeping systems are simple as possible.

For all the memory mapping control operations, EPERM was available and
unambiguous.
  
Jeff Xu Jan. 21, 2024, 12:16 a.m. UTC | #13
On Sat, Jan 20, 2024 at 8:40 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sat, 20 Jan 2024 at 07:23, Theo de Raadt <deraadt@openbsd.org> wrote:
> >
> > There is an one large difference remainig between mimmutable() and mseal(),
> > which is how other system calls behave.
> >
> > We return EPERM for failures in all the system calls that fail upon
> > immutable memory (since Oct 2022).
> >
> > You are returning EACESS.
> >
> > Before it is too late, do you want to reconsider that return value, or
> > do you have a justification for the choice?
>
> I don't think there's any real reason for the difference.
>
> Jeff - mind changing the EACESS to EPERM, and we'll have something
> that is more-or-less compatible between Linux and OpenBSD?
>
Sounds Good. I will make the necessary changes in the next version.

-Jeff

>              Linus
  
Theo de Raadt Jan. 21, 2024, 12:43 a.m. UTC | #14
Jeff Xu <jeffxu@chromium.org> wrote:

> > Jeff - mind changing the EACESS to EPERM, and we'll have something
> > that is more-or-less compatible between Linux and OpenBSD?
> >
> Sounds Good. I will make the necessary changes in the next version.

Thanks!  That is so awesome!

On the OpenBSD side, I am close to landing our madvise / msync changes.

Then we are mostly in sync.

It was on my radar for a year, but delayed because I was ponderingn
blocking the destructive madvise / msync ops on regular non-writeable
pages.  These ops remain a page-zero gadget against regular (mutable)
readonly pages, and it bothers me.  I've heard rumour this has been used
in a nasty way, and I think the sloppily defined semantics could use
a strict modernization.
  

Patch

diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst
new file mode 100644
index 000000000000..651c618d0664
--- /dev/null
+++ b/Documentation/userspace-api/mseal.rst
@@ -0,0 +1,189 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Introduction of mseal
+=====================
+
+:Author: Jeff Xu <jeffxu@chromium.org>
+
+Modern CPUs support memory permissions such as RW and NX bits. The memory
+permission feature improves security stance on memory corruption bugs, i.e.
+the attacker can’t just write to arbitrary memory and point the code to it,
+the memory has to be marked with X bit, or else an exception will happen.
+
+Memory sealing additionally protects the mapping itself against
+modifications. This is useful to mitigate memory corruption issues where a
+corrupted pointer is passed to a memory management system. For example,
+such an attacker primitive can break control-flow integrity guarantees
+since read-only memory that is supposed to be trusted can become writable
+or .text pages can get remapped. Memory sealing can automatically be
+applied by the runtime loader to seal .text and .rodata pages and
+applications can additionally seal security critical data at runtime.
+
+A similar feature already exists in the XNU kernel with the
+VM_FLAGS_PERMANENT flag [1] and on OpenBSD with the mimmutable syscall [2].
+
+User API
+========
+Two system calls are involved in virtual memory sealing, ``mseal()`` and ``mmap()``.
+
+``mseal()``
+-----------
+
+The ``mseal()`` is an architecture independent syscall, and with following
+signature:
+
+``int mseal(void addr, size_t len, unsigned long types, unsigned long flags)``
+
+**addr/len**: virtual memory address range.
+
+The address range set by ``addr``/``len`` must meet:
+   - start (addr) must be in a valid VMA.
+   - end (addr + len) must be in a valid VMA.
+   - no gap (unallocated memory) between start and end.
+   - start (addr) must be page aligned.
+
+The ``len`` will be paged aligned implicitly by kernel.
+
+**types**: bit mask to specify the sealing types, they are:
+
+- The ``MM_SEAL_BASE``: Prevent VMA from:
+
+    Unmapping, moving to another location, and shrinking the size,
+    via munmap() and mremap(), can leave an empty space, therefore
+    can be replaced with a VMA with a new set of attributes.
+
+    Move or expand a different vma into the current location,
+    via mremap().
+
+    Modifying sealed VMA via mmap(MAP_FIXED).
+
+    Size expansion, via mremap(), does not appear to pose any
+    specific risks to sealed VMAs. It is included anyway because
+    the use case is unclear. In any case, users can rely on
+    merging to expand a sealed VMA.
+
+    We consider the MM_SEAL_BASE feature, on which other sealing
+    features will depend. For instance, it probably does not make sense
+    to seal PROT_PKEY without sealing the BASE, and the kernel will
+    implicitly add SEAL_BASE for SEAL_PROT_PKEY. (If the application
+    wants to relax this in future, we could use the “flags” field in
+    mseal() to overwrite this the behavior.)
+
+- The ``MM_SEAL_PROT_PKEY``:
+
+    Seal PROT and PKEY of the address range, in other words,
+    mprotect() and pkey_mprotect() will be denied if the memory is
+    sealed with MM_SEAL_PROT_PKEY.
+
+- The ``MM_SEAL_DISCARD_RO_ANON``:
+
+    Certain types of madvise() operations are destructive [3], such
+    as MADV_DONTNEED, which can effectively alter region contents by
+    discarding pages, especially when memory is anonymous. This blocks
+    such operations for anonymous memory which is not writable to the
+    user.
+
+- The ``MM_SEAL_SEAL``
+    Denies adding a new seal.
+
+**flags**: reserved for future use.
+
+**return values**:
+
+- ``0``:
+    - Success.
+
+- ``-EINVAL``:
+    - Invalid seal type.
+    - Invalid input flags.
+    - Start address is not page aligned.
+    - Address range (``addr`` + ``len``) overflow.
+
+- ``-ENOMEM``:
+    - ``addr`` is not a valid address (not allocated).
+    - End address (``addr`` + ``len``) is not a valid address.
+    - A gap (unallocated memory) between start and end.
+
+- ``-EACCES``:
+    - ``MM_SEAL_SEAL`` is set, adding a new seal is not allowed.
+    - Address range is not sealable, e.g. ``MAP_SEALABLE`` is not
+      set during ``mmap()``.
+
+**Note**:
+
+- User can call mseal(2) multiple times to add new seal types.
+- Adding an already added seal type is a no-action (no error).
+- unseal() or removing a seal type is not supported.
+- In case of error return, one can expect the memory range is unchanged.
+
+``mmap()``
+----------
+``void *mmap(void* addr, size_t length, int prot, int flags, int fd,
+off_t offset);``
+
+We made two changes (``prot`` and ``flags``) to ``mmap()`` related to
+memory sealing.
+
+**prot**:
+
+- ``PROT_SEAL_SEAL``
+- ``PROT_SEAL_BASE``
+- ``PROT_SEAL_PROT_PKEY``
+- ``PROT_SEAL_DISCARD_RO_ANON``
+
+Allow ``mmap()`` to set the sealing type when creating a mapping. This is
+useful for optimization because it avoids having to make two system
+calls: one for ``mmap()`` and one for ``mseal()``.
+
+It's worth noting that even though the sealing type is set via the
+``prot`` field in ``mmap()``, we don't require it to be set in the ``prot``
+field in later ``mprotect()`` call. This is unlike the ``PROT_READ``,
+``PROT_WRITE``, ``PROT_EXEC`` bits, e.g. if ``PROT_WRITE`` is not set in
+``mprotect()``, it means that the region is not writable.
+
+**flags**
+The ``MAP_SEALABLE`` flag is added to the ``flags`` field of ``mmap()``.
+When present, it marks the map as sealable. A map created
+without ``MAP_SEALABLE`` will not support sealing; In other words,
+``mseal()`` will fail for such a map.
+
+Applications that don't care about sealing will expect their
+behavior unchanged. For those that need sealing support, opt-in
+by adding ``MAP_SEALABLE`` when creating the map.
+
+Use Case:
+=========
+- glibc:
+  The dymamic linker, during loading ELF executables, can apply sealing to
+  to non-writeable memory segments.
+
+- Chrome browser: protect some security sensitive data-structures.
+
+Additional notes:
+=================
+As Jann Horn pointed out in [3], there are still a few ways to write
+to RO memory, which is, in a way, by design. Those are not covered by
+``mseal()``. If applications want to block such cases, sandboxer
+(such as seccomp, LSM, etc) might be considered.
+
+Those cases are:
+
+- Write to read-only memory through ``/proc/self/mem`` interface.
+
+- Write to read-only memory through ``ptrace`` (such as ``PTRACE_POKETEXT``).
+
+- ``userfaultfd()``.
+
+The idea that inspired this patch comes from Stephen Röttger’s work in V8
+CFI [4].Chrome browser in ChromeOS will be the first user of this API.
+
+Reference:
+==========
+[1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
+
+[2] https://man.openbsd.org/mimmutable.2
+
+[3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com
+
+[4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc