[0/4] MDWE without inheritance

Message ID	20230504170942.822147-1-revest@chromium.org
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Florent Revest <revest@chromium.org> To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: akpm@linux-foundation.org, catalin.marinas@arm.com, anshuman.khandual@arm.com, joey.gouly@arm.com, mhocko@suse.com, keescook@chromium.org, david@redhat.com, peterx@redhat.com, izbyshev@ispras.ru, nd@arm.com, broonie@kernel.org, szabolcs.nagy@arm.com, Florent Revest <revest@chromium.org> Subject: [PATCH 0/4] MDWE without inheritance Date: Thu, 4 May 2023 19:09:38 +0200 Message-ID: <20230504170942.822147-1-revest@chromium.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	MDWE without inheritance \| [0/4] MDWE without inheritance [1/4] kselftest: vm: Fix tabs/spaces inconsistency in the mdwe test [2/4] kselftest: vm: Fix mdwe's mmap_FIXED test case [3/4] mm: Add a NO_INHERIT flag to the PR_SET_MDWE prctl [4/4] kselftest: vm: Add tests for no-inherit memory-deny-write-execute

Message ID

20230504170942.822147-1-revest@chromium.org

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
From: Florent Revest <revest@chromium.org>
To: linux-kernel@vger.kernel.org, linux-mm@kvack.org
Cc: akpm@linux-foundation.org, catalin.marinas@arm.com,
        anshuman.khandual@arm.com, joey.gouly@arm.com, mhocko@suse.com,
        keescook@chromium.org, david@redhat.com, peterx@redhat.com,
        izbyshev@ispras.ru, nd@arm.com, broonie@kernel.org,
        szabolcs.nagy@arm.com, Florent Revest <revest@chromium.org>
Subject: [PATCH 0/4] MDWE without inheritance
Date: Thu,  4 May 2023 19:09:38 +0200
Message-ID: <20230504170942.822147-1-revest@chromium.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

MDWE without inheritance |

Message

Florent Revest May 4, 2023, 5:09 p.m. UTC

  Joey recently introduced a Memory-Deny-Write-Executable (MDWE) prctl which tags
current with a flag that prevents pages that were previously not executable from
becoming executable.
This tag always gets inherited by children tasks. (it's in MMF_INIT_MASK)

At Google, we've been using a somewhat similar downstream patch for a few years
now. To make the adoption of this feature easier, we've had it support a mode in
which the W^X flag does not propagate to children. For example, this is handy if
a C process which wants W^X protection suspects it could start children
processes that would use a JIT.

I'd like to align our features with the upstream prctl. This series proposes a
new NO_INHERIT flag to the MDWE prctl to make this kind of adoption easier. It
sets a different flag in current that is not in MMF_INIT_MASK and which does not
propagate.

As part of looking into MDWE, I also fixed a couple of things in the MDWE test.

Florent Revest (4):
  kselftest: vm: Fix tabs/spaces inconsistency in the mdwe test
  kselftest: vm: Fix mdwe's mmap_FIXED test case
  mm: Add a NO_INHERIT flag to the PR_SET_MDWE prctl
  kselftest: vm: Add tests for no-inherit memory-deny-write-execute

 include/linux/mman.h                   |   8 +-
 include/linux/sched/coredump.h         |   1 +
 include/uapi/linux/prctl.h             |   1 +
 kernel/sys.c                           |  29 +++++--
 tools/include/uapi/linux/prctl.h       |   1 +
 tools/testing/selftests/mm/mdwe_test.c | 110 +++++++++++++++++++++----
 6 files changed, 128 insertions(+), 22 deletions(-)

Comments

Peter Xu May 4, 2023, 8:06 p.m. UTC | #1

On Thu, May 04, 2023 at 07:09:38PM +0200, Florent Revest wrote:
> Joey recently introduced a Memory-Deny-Write-Executable (MDWE) prctl which tags
> current with a flag that prevents pages that were previously not executable from
> becoming executable.
> This tag always gets inherited by children tasks. (it's in MMF_INIT_MASK)
> 
> At Google, we've been using a somewhat similar downstream patch for a few years
> now. To make the adoption of this feature easier, we've had it support a mode in
> which the W^X flag does not propagate to children. For example, this is handy if
> a C process which wants W^X protection suspects it could start children
> processes that would use a JIT.
> 
> I'd like to align our features with the upstream prctl. This series proposes a
> new NO_INHERIT flag to the MDWE prctl to make this kind of adoption easier. It
> sets a different flag in current that is not in MMF_INIT_MASK and which does not
> propagate.

I don't think I have enough context, so sorry if I'm going to ask a naive
question..

I can understand how current MDWE helps on not allowing any modifi-able
content from becoming executable.  How could NO_INHERIT help if it won't
inherit and not in MMF_INIT_MASK?

IIUC it means the restriction will only apply to the current process.  Then
I assume the process can escape from this rule simply by a fork().  If so,
what's the point to protect at all?

And, what's the difference of this comparing to disabling MDWE after being
enabled (which seems to be forbidden for now, but it seems fork() can play
a similar role of disabling it)?

Thanks,

Florent Revest May 5, 2023, 4:42 p.m. UTC | #2

On Thu, May 4, 2023 at 10:06 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, May 04, 2023 at 07:09:38PM +0200, Florent Revest wrote:
> > Joey recently introduced a Memory-Deny-Write-Executable (MDWE) prctl which tags
> > current with a flag that prevents pages that were previously not executable from
> > becoming executable.
> > This tag always gets inherited by children tasks. (it's in MMF_INIT_MASK)
> >
> > At Google, we've been using a somewhat similar downstream patch for a few years
> > now. To make the adoption of this feature easier, we've had it support a mode in
> > which the W^X flag does not propagate to children. For example, this is handy if
> > a C process which wants W^X protection suspects it could start children
> > processes that would use a JIT.
> >
> > I'd like to align our features with the upstream prctl. This series proposes a
> > new NO_INHERIT flag to the MDWE prctl to make this kind of adoption easier. It
> > sets a different flag in current that is not in MMF_INIT_MASK and which does not
> > propagate.
>
> I don't think I have enough context, so sorry if I'm going to ask a naive
> question..

Not at all! :) You're absolutely right, it's important to address these points.

> I can understand how current MDWE helps on not allowing any modifi-able
> content from becoming executable.  How could NO_INHERIT help if it won't
> inherit and not in MMF_INIT_MASK?

The way I see it, enabling MDWE is just a small step towards hardening
a binary anyway. It can possibly make exploitation a bit harder in the
case where the attacker has _just_: a write primitive they can use to
write a shellcode somewhere and a primitive to make that page
executable later. It's a fairly narrow protection already and I think
it only really helps as part of a broader "defense in depth" strategy.

> IIUC it means the restriction will only apply to the current process.  Then
> I assume the process can escape from this rule simply by a fork().  If so,
> what's the point to protect at all?

If we assume enough control from the attacker, then MDWE is already
useless since it can be bypassed by writing to a file and then
mmapping that file with PROT_EXEC. I think that's a good example of
how "perfect can be the enemy of good" in security hardening. MDWE
isn't a silver-bullet but it's a cheap trick and it makes a small dent
in reducing the attack surface so it seems worth having anyway ?

But indeed, to address your question, if you choose to use this
NO_INHERIT flag: you're no longer protected if the attacker can fork()
as part of their exploitation. I think it's been a useful trade-off
for our internal users since, on the other hand, it also makes
adoption a lot easier: our C++ services developers can trivially opt
into a potpourri of hardening features without having to think too
much about how they work under-the-hood. The default behavior has been
to use a NO_INHERIT strategy so users don't get bad surprises the day
when they try to spawn a JITted subcommand. In the meantime, their C++
service still gets a little bit of extra protection.

> And, what's the difference of this comparing to disabling MDWE after being
> enabled (which seems to be forbidden for now, but it seems fork() can play
> a similar role of disabling it)?

That would be functionally somewhat similar, yes. I think it mostly
comes down to ease of adoption. I imagine that users who would opt
into NO_INHERIT are those who are interested in MDWE for the binary
they are writing but aren't 100% confident in what subprocesses they
will run and so they don't have to think about disabling it after
every fork.

Peter Xu May 8, 2023, 1:29 a.m. UTC | #3

On Fri, May 05, 2023 at 06:42:08PM +0200, Florent Revest wrote:
> On Thu, May 4, 2023 at 10:06 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, May 04, 2023 at 07:09:38PM +0200, Florent Revest wrote:
> > > Joey recently introduced a Memory-Deny-Write-Executable (MDWE) prctl which tags
> > > current with a flag that prevents pages that were previously not executable from
> > > becoming executable.
> > > This tag always gets inherited by children tasks. (it's in MMF_INIT_MASK)
> > >
> > > At Google, we've been using a somewhat similar downstream patch for a few years
> > > now. To make the adoption of this feature easier, we've had it support a mode in
> > > which the W^X flag does not propagate to children. For example, this is handy if
> > > a C process which wants W^X protection suspects it could start children
> > > processes that would use a JIT.
> > >
> > > I'd like to align our features with the upstream prctl. This series proposes a
> > > new NO_INHERIT flag to the MDWE prctl to make this kind of adoption easier. It
> > > sets a different flag in current that is not in MMF_INIT_MASK and which does not
> > > propagate.
> >
> > I don't think I have enough context, so sorry if I'm going to ask a naive
> > question..
> 
> Not at all! :) You're absolutely right, it's important to address these points.
> 
> > I can understand how current MDWE helps on not allowing any modifi-able
> > content from becoming executable.  How could NO_INHERIT help if it won't
> > inherit and not in MMF_INIT_MASK?
> 
> The way I see it, enabling MDWE is just a small step towards hardening
> a binary anyway. It can possibly make exploitation a bit harder in the
> case where the attacker has _just_: a write primitive they can use to
> write a shellcode somewhere and a primitive to make that page
> executable later. It's a fairly narrow protection already and I think
> it only really helps as part of a broader "defense in depth" strategy.
> 
> > IIUC it means the restriction will only apply to the current process.  Then
> > I assume the process can escape from this rule simply by a fork().  If so,
> > what's the point to protect at all?
> 
> If we assume enough control from the attacker, then MDWE is already
> useless since it can be bypassed by writing to a file and then
> mmapping that file with PROT_EXEC. I think that's a good example of
> how "perfect can be the enemy of good" in security hardening. MDWE
> isn't a silver-bullet but it's a cheap trick and it makes a small dent
> in reducing the attack surface so it seems worth having anyway ?
> 
> But indeed, to address your question, if you choose to use this
> NO_INHERIT flag: you're no longer protected if the attacker can fork()
> as part of their exploitation. I think it's been a useful trade-off
> for our internal users since, on the other hand, it also makes
> adoption a lot easier: our C++ services developers can trivially opt
> into a potpourri of hardening features without having to think too
> much about how they work under-the-hood. The default behavior has been
> to use a NO_INHERIT strategy so users don't get bad surprises the day
> when they try to spawn a JITted subcommand. In the meantime, their C++
> service still gets a little bit of extra protection.
> 
> > And, what's the difference of this comparing to disabling MDWE after being
> > enabled (which seems to be forbidden for now, but it seems fork() can play
> > a similar role of disabling it)?
> 
> That would be functionally somewhat similar, yes. I think it mostly
> comes down to ease of adoption. I imagine that users who would opt
> into NO_INHERIT are those who are interested in MDWE for the binary
> they are writing but aren't 100% confident in what subprocesses they
> will run and so they don't have to think about disabling it after
> every fork.

Okay, that makes sense to me.  Thanks.

Since the original MDWE was for systemd, I'm wondering what will happen if
some program like what you said is invoked by systemd and with MDWE enabled
already.

Currently in your patch IIUC MDWE_NO_INHERIT will fail directly on MDWE
enabled process, but then it makes me think whether it makes more sense to
allow converting MDWE->MDWE_NO_INHERIT in this case.  It seems to provide a
most broad coverage on system daemons using MDWE starting from systemd
initial process, meanwhile allows specific daemons to fork into anything
like a JIT process so it can make itself NO_INHERIT.  Attackers won't
leverage this because MDWE_NO_INHERIT also means MDWE enabled.

Florent Revest May 8, 2023, 12:12 p.m. UTC | #4

On Mon, May 8, 2023 at 3:29 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, May 05, 2023 at 06:42:08PM +0200, Florent Revest wrote:
> > On Thu, May 4, 2023 at 10:06 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Thu, May 04, 2023 at 07:09:38PM +0200, Florent Revest wrote:
> > > > Joey recently introduced a Memory-Deny-Write-Executable (MDWE) prctl which tags
> > > > current with a flag that prevents pages that were previously not executable from
> > > > becoming executable.
> > > > This tag always gets inherited by children tasks. (it's in MMF_INIT_MASK)
> > > >
> > > > At Google, we've been using a somewhat similar downstream patch for a few years
> > > > now. To make the adoption of this feature easier, we've had it support a mode in
> > > > which the W^X flag does not propagate to children. For example, this is handy if
> > > > a C process which wants W^X protection suspects it could start children
> > > > processes that would use a JIT.
> > > >
> > > > I'd like to align our features with the upstream prctl. This series proposes a
> > > > new NO_INHERIT flag to the MDWE prctl to make this kind of adoption easier. It
> > > > sets a different flag in current that is not in MMF_INIT_MASK and which does not
> > > > propagate.
> > >
> > > I don't think I have enough context, so sorry if I'm going to ask a naive
> > > question..
> >
> > Not at all! :) You're absolutely right, it's important to address these points.
> >
> > > I can understand how current MDWE helps on not allowing any modifi-able
> > > content from becoming executable.  How could NO_INHERIT help if it won't
> > > inherit and not in MMF_INIT_MASK?
> >
> > The way I see it, enabling MDWE is just a small step towards hardening
> > a binary anyway. It can possibly make exploitation a bit harder in the
> > case where the attacker has _just_: a write primitive they can use to
> > write a shellcode somewhere and a primitive to make that page
> > executable later. It's a fairly narrow protection already and I think
> > it only really helps as part of a broader "defense in depth" strategy.
> >
> > > IIUC it means the restriction will only apply to the current process.  Then
> > > I assume the process can escape from this rule simply by a fork().  If so,
> > > what's the point to protect at all?
> >
> > If we assume enough control from the attacker, then MDWE is already
> > useless since it can be bypassed by writing to a file and then
> > mmapping that file with PROT_EXEC. I think that's a good example of
> > how "perfect can be the enemy of good" in security hardening. MDWE
> > isn't a silver-bullet but it's a cheap trick and it makes a small dent
> > in reducing the attack surface so it seems worth having anyway ?
> >
> > But indeed, to address your question, if you choose to use this
> > NO_INHERIT flag: you're no longer protected if the attacker can fork()
> > as part of their exploitation. I think it's been a useful trade-off
> > for our internal users since, on the other hand, it also makes
> > adoption a lot easier: our C++ services developers can trivially opt
> > into a potpourri of hardening features without having to think too
> > much about how they work under-the-hood. The default behavior has been
> > to use a NO_INHERIT strategy so users don't get bad surprises the day
> > when they try to spawn a JITted subcommand. In the meantime, their C++
> > service still gets a little bit of extra protection.
> >
> > > And, what's the difference of this comparing to disabling MDWE after being
> > > enabled (which seems to be forbidden for now, but it seems fork() can play
> > > a similar role of disabling it)?
> >
> > That would be functionally somewhat similar, yes. I think it mostly
> > comes down to ease of adoption. I imagine that users who would opt
> > into NO_INHERIT are those who are interested in MDWE for the binary
> > they are writing but aren't 100% confident in what subprocesses they
> > will run and so they don't have to think about disabling it after
> > every fork.
>
> Okay, that makes sense to me.  Thanks.
>
> Since the original MDWE was for systemd, I'm wondering what will happen if
> some program like what you said is invoked by systemd and with MDWE enabled
> already.

Good question

> Currently in your patch IIUC MDWE_NO_INHERIT will fail directly on MDWE
> enabled process,

Yes, I tried to stay close to the spirit of the existing logic (which
doesn't allow any sort of privilege gains) but this is not
particularly a requirement on our side so I'm quite flexible here.
Maybe Joey has an input here ?

> but then it makes me think whether it makes more sense to
> allow converting MDWE->MDWE_NO_INHERIT in this case.  It seems to provide a
> most broad coverage on system daemons using MDWE starting from systemd
> initial process, meanwhile allows specific daemons to fork into anything
> like a JIT process so it can make itself NO_INHERIT.  Attackers won't
> leverage this because MDWE_NO_INHERIT also means MDWE enabled.

I should have cc-ed systemd folks who could have opinions on this. I
will for v2.

+ cc Topi & Lennart

Catalin Marinas May 8, 2023, 2:10 p.m. UTC | #5

On Mon, May 08, 2023 at 02:12:21PM +0200, Florent Revest wrote:
> On Mon, May 8, 2023 at 3:29 AM Peter Xu <peterx@redhat.com> wrote:
> > On Fri, May 05, 2023 at 06:42:08PM +0200, Florent Revest wrote:
> > > On Thu, May 4, 2023 at 10:06 PM Peter Xu <peterx@redhat.com> wrote:
> > > > And, what's the difference of this comparing to disabling MDWE after being
> > > > enabled (which seems to be forbidden for now, but it seems fork() can play
> > > > a similar role of disabling it)?
> > >
> > > That would be functionally somewhat similar, yes. I think it mostly
> > > comes down to ease of adoption. I imagine that users who would opt
> > > into NO_INHERIT are those who are interested in MDWE for the binary
> > > they are writing but aren't 100% confident in what subprocesses they
> > > will run and so they don't have to think about disabling it after
> > > every fork.
> >
> > Okay, that makes sense to me.  Thanks.
> >
> > Since the original MDWE was for systemd, I'm wondering what will happen if
> > some program like what you said is invoked by systemd and with MDWE enabled
> > already.
> 
> Good question

I think JITs work around this by creating two separate mappings of the
same pages, one RX and the other RW (rather than toggling the permission
with mprotect()). I had the impression Chromium can use memfd to do
something similar but I never checked.

> > Currently in your patch IIUC MDWE_NO_INHERIT will fail directly on MDWE
> > enabled process,
> 
> Yes, I tried to stay close to the spirit of the existing logic (which
> doesn't allow any sort of privilege gains) but this is not
> particularly a requirement on our side so I'm quite flexible here.

I think we should keep the original behaviour of systemd here, otherwise
they won't transition to the new interface and keep using the SECCOMP
BPF approach (which, in addition, prevents glibc from setting PROT_BTI
on an already executable mapping).

To me MDWE is not about preventing JITs but rather ensuring buggy
programs don't end up with WX mappings. We ended up this way because of
the SECCOMP BPF limitations (just guessing, I haven't been involved in
its design). With a no-inherit MDWE, one can introduce an additional
policy for systemd. It would be a sysadmin decision which one to enable
and maybe current (inherit) MDWE will disappear in time.

x86 has protection keys and arm64 will soon have permission overlays
that allow user-space to toggle between RX and RW (Joey is looking at
the arm64 support). I'm not sure how we'll end up implemented this on
arm64 (and haven't looked at x86) but I have a suspicion MDWE will get
in the way as the base page table permission will probably need
PROT_WRITE|PROT_EXEC.

Topi Miettinen May 8, 2023, 5:21 p.m. UTC | #6

On 8.5.2023 17.10, Catalin Marinas wrote:
> I think we should keep the original behaviour of systemd here, otherwise
> they won't transition to the new interface and keep using the SECCOMP
> BPF approach (which, in addition, prevents glibc from setting PROT_BTI
> on an already executable mapping).

Systemd has transitioned to prctl(PR_SET_MDWE) method since release of 
v253, so the original behaviour definitely should be kept.

> To me MDWE is not about preventing JITs but rather ensuring buggy
> programs don't end up with WX mappings. We ended up this way because of
> the SECCOMP BPF limitations (just guessing, I haven't been involved in
> its design). With a no-inherit MDWE, one can introduce an additional
> policy for systemd. It would be a sysadmin decision which one to enable
> and maybe current (inherit) MDWE will disappear in time.

There could be a new setting for this, like 
MemoryDenyWriteExecute=no-inherit. I'd only use it for those special 
cases where MemoryDenyWriteExecute=yes can't be used.

> x86 has protection keys and arm64 will soon have permission overlays
> that allow user-space to toggle between RX and RW (Joey is looking at
> the arm64 support). I'm not sure how we'll end up implemented this on
> arm64 (and haven't looked at x86) but I have a suspicion MDWE will get
> in the way as the base page table permission will probably need
> PROT_WRITE|PROT_EXEC.

Wouldn't those features defeat any gains from MDWE? The features 
probably should be forbidden with MemoryDenyWriteExecute=yes.

-Topi

Catalin Marinas May 9, 2023, 10:04 a.m. UTC | #7

On Mon, May 08, 2023 at 08:21:16PM +0300, Topi Miettinen wrote:
> On 8.5.2023 17.10, Catalin Marinas wrote:
> > I think we should keep the original behaviour of systemd here, otherwise
> > they won't transition to the new interface and keep using the SECCOMP
> > BPF approach (which, in addition, prevents glibc from setting PROT_BTI
> > on an already executable mapping).
> 
> Systemd has transitioned to prctl(PR_SET_MDWE) method since release of v253,
> so the original behaviour definitely should be kept.

That's great. So yes, no ABI changes allowed anymore.

> > x86 has protection keys and arm64 will soon have permission overlays
> > that allow user-space to toggle between RX and RW (Joey is looking at
> > the arm64 support). I'm not sure how we'll end up implemented this on
> > arm64 (and haven't looked at x86) but I have a suspicion MDWE will get
> > in the way as the base page table permission will probably need
> > PROT_WRITE|PROT_EXEC.
> 
> Wouldn't those features defeat any gains from MDWE? The features probably
> should be forbidden with MemoryDenyWriteExecute=yes.

The permission overlays (controlled by the user) can only further
restrict the mmap() permissions. So MDWE would still work as expected.
If one wants to toggle between RW and RX with overlays, the overall
mmap() needs to be RWX and it won't work if MDWE=yes. No need to
explicitly disable the overlays feature.

On arm64 at least, with the introduction of permission overlays we also
have the notion of "Read, Execute if not Write". This permission
automatically disables Exec if the mapping becomes writable (overlays
can disable writable, allowing exec). We could have a new MDWE policy
which allows this, though I'm not that keen on using it in Linux since
background permission changes done by the kernel can lead to an
unexpected executable permission (e.g. marking a page read-only for
clean/dirty tracking or in preparation for CoW after fork()).