[v1,0/8] x86_64 SandBox Mode arch hooks

Message ID	20240214113516.2307-1-petrtesarik@huaweicloud.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel+bounces-65137-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223; From: Petr Tesarik <petrtesarik@huaweicloud.com> To: Jonathan Corbet <corbet@lwn.net>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, x86@kernel.org (maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)), "H. Peter Anvin" <hpa@zytor.com>, Andy Lutomirski <luto@kernel.org>, Oleg Nesterov <oleg@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Xin Li <xin3.li@intel.com>, Arnd Bergmann <arnd@arndb.de>, Andrew Morton <akpm@linux-foundation.org>, Rick Edgecombe <rick.p.edgecombe@intel.com>, Kees Cook <keescook@chromium.org>, "Masami Hiramatsu (Google)" <mhiramat@kernel.org>, Pengfei Xu <pengfei.xu@intel.com>, Josh Poimboeuf <jpoimboe@kernel.org>, Ze Gao <zegao2021@gmail.com>, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, Kai Huang <kai.huang@intel.com>, David Woodhouse <dwmw@amazon.co.uk>, Brian Gerst <brgerst@gmail.com>, Jason Gunthorpe <jgg@ziepe.ca>, Joerg Roedel <jroedel@suse.de>, "Mike Rapoport (IBM)" <rppt@kernel.org>, Tina Zhang <tina.zhang@intel.com>, Jacob Pan <jacob.jun.pan@linux.intel.com>, linux-doc@vger.kernel.org (open list:DOCUMENTATION), linux-kernel@vger.kernel.org (open list) Cc: Roberto Sassu <roberto.sassu@huaweicloud.com>, petr@tesarici.cz, Petr Tesarik <petr.tesarik1@huawei-partners.com> Subject: [PATCH v1 0/8] x86_64 SandBox Mode arch hooks Date: Wed, 14 Feb 2024 12:35:08 +0100 Message-Id: <20240214113516.2307-1-petrtesarik@huaweicloud.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	x86_64 SandBox Mode arch hooks \| [v1,0/8] x86_64 SandBox Mode arch hooks [v1,1/8] sbm: x86: page table arch hooks [v1,2/8] sbm: x86: execute target function on sandbox mode stack [v1,3/8] sbm: x86: map system data structures into the sandbox [v1,4/8] sbm: x86: allocate and map an exception stack [v1,5/8] sbm: x86: handle sandbox mode faults [v1,6/8] sbm: x86: switch to sandbox mode pages in arch_sbm_exec() [v1,7/8] sbm: documentation of the x86-64 SandBox Mode implementation [v1,8/8] sbm: x86: lazy TLB flushing

Message ID

20240214113516.2307-1-petrtesarik@huaweicloud.com

Headers

Received-SPF: pass (google.com: domain of
 linux-kernel+bounces-65137-ouuuleilei=gmail.com@vger.kernel.org designates
 147.75.199.223 as permitted sender) client-ip=147.75.199.223;
From: Petr Tesarik <petrtesarik@huaweicloud.com>
To: Jonathan Corbet <corbet@lwn.net>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>,
	Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	x86@kernel.org (maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)),
	"H. Peter Anvin" <hpa@zytor.com>,
	Andy Lutomirski <luto@kernel.org>,
	Oleg Nesterov <oleg@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Xin Li <xin3.li@intel.com>,
	Arnd Bergmann <arnd@arndb.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Rick Edgecombe <rick.p.edgecombe@intel.com>,
	Kees Cook <keescook@chromium.org>,
	"Masami Hiramatsu (Google)" <mhiramat@kernel.org>,
	Pengfei Xu <pengfei.xu@intel.com>,
	Josh Poimboeuf <jpoimboe@kernel.org>,
	Ze Gao <zegao2021@gmail.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Kai Huang <kai.huang@intel.com>,
	David Woodhouse <dwmw@amazon.co.uk>,
	Brian Gerst <brgerst@gmail.com>,
	Jason Gunthorpe <jgg@ziepe.ca>,
	Joerg Roedel <jroedel@suse.de>,
	"Mike Rapoport (IBM)" <rppt@kernel.org>,
	Tina Zhang <tina.zhang@intel.com>,
	Jacob Pan <jacob.jun.pan@linux.intel.com>,
	linux-doc@vger.kernel.org (open list:DOCUMENTATION),
	linux-kernel@vger.kernel.org (open list)
Cc: Roberto Sassu <roberto.sassu@huaweicloud.com>,
	petr@tesarici.cz,
	Petr Tesarik <petr.tesarik1@huawei-partners.com>
Subject: [PATCH v1 0/8] x86_64 SandBox Mode arch hooks
Date: Wed, 14 Feb 2024 12:35:08 +0100
Message-Id: <20240214113516.2307-1-petrtesarik@huaweicloud.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

x86_64 SandBox Mode arch hooks |

Message

Petr Tesarik Feb. 14, 2024, 11:35 a.m. UTC

  From: Petr Tesarik <petr.tesarik1@huawei-partners.com>

This patch series implements x86_64 arch hooks for the generic SandBox
Mode infrastructure.

SandBox Mode on x86_64 is implemented as follows:

* The target function runs with CPL 3 (same as user mode) within its
  own virtual address space.
* Interrupt entry/exit paths are modified to let the interrupt handlers
  always run with kernel CR3 and restore sandbox CR3 when returning to
  sandbox mode.
* To avoid undesirable user mode processing (FPU state, signals, etc.),
  the value of pt_regs->cs is temporarily adjusted to make it look like
  coming from kernel mode.
* On a CPU fault, execution stops immediately, returning -EFAULT to
  the caller.

Petr Tesarik (8):
  sbm: x86: page table arch hooks
  sbm: x86: execute target function on sandbox mode stack
  sbm: x86: map system data structures into the sandbox
  sbm: x86: allocate and map an exception stack
  sbm: x86: handle sandbox mode faults
  sbm: x86: switch to sandbox mode pages in arch_sbm_exec()
  sbm: documentation of the x86-64 SandBox Mode implementation
  sbm: x86: lazy TLB flushing

 Documentation/security/sandbox-mode.rst |  25 ++
 arch/x86/Kconfig                        |   1 +
 arch/x86/entry/entry_64.S               | 123 ++++++
 arch/x86/include/asm/page_64_types.h    |   1 +
 arch/x86/include/asm/ptrace.h           |  21 +
 arch/x86/include/asm/sbm.h              |  83 ++++
 arch/x86/include/asm/segment.h          |   7 +
 arch/x86/include/asm/thread_info.h      |   3 +
 arch/x86/kernel/Makefile                |   2 +
 arch/x86/kernel/asm-offsets.c           |  10 +
 arch/x86/kernel/sbm/Makefile            |  16 +
 arch/x86/kernel/sbm/call_64.S           |  95 +++++
 arch/x86/kernel/sbm/core.c              | 499 ++++++++++++++++++++++++
 arch/x86/kernel/traps.c                 |  14 +-
 arch/x86/mm/fault.c                     |   6 +
 15 files changed, 905 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/include/asm/sbm.h
 create mode 100644 arch/x86/kernel/sbm/Makefile
 create mode 100644 arch/x86/kernel/sbm/call_64.S
 create mode 100644 arch/x86/kernel/sbm/core.c

Comments

Dave Hansen Feb. 14, 2024, 2:52 p.m. UTC | #1

On 2/14/24 03:35, Petr Tesarik wrote:
> This patch series implements x86_64 arch hooks for the generic SandBox
> Mode infrastructure.

I think I'm missing a bit of context here.  What does one _do_ with
SandBox Mode?  Why is it useful?

H. Peter Anvin Feb. 14, 2024, 3:28 p.m. UTC | #2

On February 14, 2024 6:52:53 AM PST, Dave Hansen <dave.hansen@intel.com> wrote:
>On 2/14/24 03:35, Petr Tesarik wrote:
>> This patch series implements x86_64 arch hooks for the generic SandBox
>> Mode infrastructure.
>
>I think I'm missing a bit of context here.  What does one _do_ with
>SandBox Mode?  Why is it useful?

Seriously. On the surface it looks like a really bad idea – basically an ad hoc, *more* privileged version of user shave.

Petr Tesařík Feb. 14, 2024, 4:41 p.m. UTC | #3

On Wed, 14 Feb 2024 07:28:35 -0800
"H. Peter Anvin" <hpa@zytor.com> wrote:

> On February 14, 2024 6:52:53 AM PST, Dave Hansen <dave.hansen@intel.com> wrote:
> >On 2/14/24 03:35, Petr Tesarik wrote:  
> >> This patch series implements x86_64 arch hooks for the generic SandBox
> >> Mode infrastructure.  
> >
> >I think I'm missing a bit of context here.  What does one _do_ with
> >SandBox Mode?  Why is it useful?  
> 
> Seriously. On the surface it looks like a really bad idea – basically an ad hoc, *more* privileged version of user shave.

Hi hpa,

I agree that it kind of tries to do "user mode without user mode".
There are some differences from actual user mode:

First, from a process management POV, sandbox mode appears to be
running in kernel mode. So, there is no way to use ptrace(2), send
malicious signals or otherwise interact with the sandbox. In fact,
the process can have three independent contexts: user mode, kernel mode
and sandbox mode.

Second, a sandbox can run unmodified kernel code and interact directly
with other parts of the kernel. It's not really possible with this
initial patch series, but the plan is that sandbox mode can share locks
with the kernel.

Third, sandbox code can be trusted for operations like parsing keys for
the trusted keychain if the kernel is locked down, i.e. when even a
process with UID 0 is not on the same trust level as kernel mode.

HTH
Petr T

H. Peter Anvin Feb. 14, 2024, 5:29 p.m. UTC | #4

On February 14, 2024 8:41:43 AM PST, "Petr Tesařík" <petr@tesarici.cz> wrote:
>On Wed, 14 Feb 2024 07:28:35 -0800
>"H. Peter Anvin" <hpa@zytor.com> wrote:
>
>> On February 14, 2024 6:52:53 AM PST, Dave Hansen <dave.hansen@intel.com> wrote:
>> >On 2/14/24 03:35, Petr Tesarik wrote:  
>> >> This patch series implements x86_64 arch hooks for the generic SandBox
>> >> Mode infrastructure.  
>> >
>> >I think I'm missing a bit of context here.  What does one _do_ with
>> >SandBox Mode?  Why is it useful?  
>> 
>> Seriously. On the surface it looks like a really bad idea – basically an ad hoc, *more* privileged version of user shave.
>
>Hi hpa,
>
>I agree that it kind of tries to do "user mode without user mode".
>There are some differences from actual user mode:
>
>First, from a process management POV, sandbox mode appears to be
>running in kernel mode. So, there is no way to use ptrace(2), send
>malicious signals or otherwise interact with the sandbox. In fact,
>the process can have three independent contexts: user mode, kernel mode
>and sandbox mode.
>
>Second, a sandbox can run unmodified kernel code and interact directly
>with other parts of the kernel. It's not really possible with this
>initial patch series, but the plan is that sandbox mode can share locks
>with the kernel.
>
>Third, sandbox code can be trusted for operations like parsing keys for
>the trusted keychain if the kernel is locked down, i.e. when even a
>process with UID 0 is not on the same trust level as kernel mode.
>
>HTH
>Petr T
>

This, to me, seems like "all the downsides of a microkernel without the upsides." Furthermore, it breaks security-hardening features like LASS and (to a lesser degree) SMAP. Not to mention dropping global pages?

All in all, I cannot see this as anything other than an enormous step in the wrong direction, and it isn't even in the sense of "it is harmless if noone uses it" – you are introducing architectural changes that are most definitely *very* harmful both to maintainers and users.

To me, this feels like paravirtualization all over again. 20 years later we still have not been able to undo all the damage that did.

Edgecombe, Rick P Feb. 14, 2024, 6:14 p.m. UTC | #5

On Wed, 2024-02-14 at 17:41 +0100, Petr Tesařík wrote:
> Second, a sandbox can run unmodified kernel code and interact
> directly
> with other parts of the kernel. It's not really possible with this
> initial patch series, but the plan is that sandbox mode can share
> locks
> with the kernel.
> 
> Third, sandbox code can be trusted for operations like parsing keys
> for
> the trusted keychain if the kernel is locked down, i.e. when even a
> process with UID 0 is not on the same trust level as kernel mode.

What use case needs to have the sandbox both protected from the kernel
(trusted operations) and non-privileged (the kernel protected from it
via CPL3)? It seems like opposite things.

Petr Tesařík Feb. 14, 2024, 6:22 p.m. UTC | #6

On Wed, 14 Feb 2024 06:52:53 -0800
Dave Hansen <dave.hansen@intel.com> wrote:

> On 2/14/24 03:35, Petr Tesarik wrote:
> > This patch series implements x86_64 arch hooks for the generic SandBox
> > Mode infrastructure.  
> 
> I think I'm missing a bit of context here.  What does one _do_ with
> SandBox Mode?  Why is it useful?

I see, I split the patch series into the base infrastructure and the
x86_64 implementation, but I forgot to merge the two recipient lists.
:-(

Anyway, in the long term I would like to work on gradual decomposition
of the kernel into a core part and many self-contained components.
Sandbox mode is a useful tool to enforce isolation.

In its current form, sandbox mode is too limited for that, but I'm
trying to find some balance between "publish early" and reaching a
feature level where some concrete examples can be shown. I'd rather
fail fast than maintain hundreds of patches in an out-of-tree branch
before submitting (and failing anyway).

Petr T

Petr Tesařík Feb. 14, 2024, 6:32 p.m. UTC | #7

(+Cc Kees)

On Wed, 14 Feb 2024 18:14:49 +0000
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:

> On Wed, 2024-02-14 at 17:41 +0100, Petr Tesařík wrote:
> > Second, a sandbox can run unmodified kernel code and interact
> > directly
> > with other parts of the kernel. It's not really possible with this
> > initial patch series, but the plan is that sandbox mode can share
> > locks
> > with the kernel.
> > 
> > Third, sandbox code can be trusted for operations like parsing keys
> > for
> > the trusted keychain if the kernel is locked down, i.e. when even a
> > process with UID 0 is not on the same trust level as kernel mode.  
> 
> What use case needs to have the sandbox both protected from the kernel
> (trusted operations) and non-privileged (the kernel protected from it
> via CPL3)? It seems like opposite things.

I think I have mentioned one: parsing keys for the trusted keyring. The
parser is complex enough to be potentially buggy, but the security
folks have already dismissed the idea to run it as a user mode helper.

Petr T

Dave Hansen Feb. 14, 2024, 6:42 p.m. UTC | #8

On 2/14/24 10:22, Petr Tesařík wrote:
> Anyway, in the long term I would like to work on gradual decomposition
> of the kernel into a core part and many self-contained components.
> Sandbox mode is a useful tool to enforce isolation.

I'd want to see at least a few examples of how this decomposition would
work and how much of a burden it is on each site that deployed it.

But I'm skeptical that this could ever work.  Ring-0 execution really is
special and it's _increasingly_ so.  Think of LASS or SMAP or SMEP.
We're even seeing hardware designers add hardware security defenses to
ring-0 that are not applied to ring-3.

In other words, ring-3 isn't just a deprivileged ring-0, it's more
exposed to attacks.

> I'd rather fail fast than maintain hundreds of patches in an
> out-of-tree branch before submitting (and failing anyway).

I don't see any remotely feasible path forward for this approach.

Xin Li (Intel) Feb. 14, 2024, 6:52 p.m. UTC | #9

On 2/14/2024 10:22 AM, Petr Tesařík wrote:
> On Wed, 14 Feb 2024 06:52:53 -0800
> Dave Hansen <dave.hansen@intel.com> wrote:
> 
>> On 2/14/24 03:35, Petr Tesarik wrote:
>>> This patch series implements x86_64 arch hooks for the generic SandBox
>>> Mode infrastructure.
>>
>> I think I'm missing a bit of context here.  What does one _do_ with
>> SandBox Mode?  Why is it useful?
> 
> I see, I split the patch series into the base infrastructure and the
> x86_64 implementation, but I forgot to merge the two recipient lists.
> :-(
> 
> Anyway, in the long term I would like to work on gradual decomposition
> of the kernel into a core part and many self-contained components.
> Sandbox mode is a useful tool to enforce isolation.
> 
> In its current form, sandbox mode is too limited for that, but I'm
> trying to find some balance between "publish early" and reaching a
> feature level where some concrete examples can be shown. I'd rather
> fail fast than maintain hundreds of patches in an out-of-tree branch
> before submitting (and failing anyway).
> 
> Petr T
> 

What you're proposing sounds a gigantic thing, which could potentially
impact all subsystems.  Unless you prove it has big advantages with real
world usages, I guess nobody even wants to look into the patches.

BTW, this seems another attempt to get the idea of micro-kernel into
Linux.

Petr Tesařík Feb. 14, 2024, 7:14 p.m. UTC | #10

On Wed, 14 Feb 2024 09:29:06 -0800
"H. Peter Anvin" <hpa@zytor.com> wrote:

> On February 14, 2024 8:41:43 AM PST, "Petr Tesařík" <petr@tesarici.cz> wrote:
> >On Wed, 14 Feb 2024 07:28:35 -0800
> >"H. Peter Anvin" <hpa@zytor.com> wrote:
> >  
> >> On February 14, 2024 6:52:53 AM PST, Dave Hansen <dave.hansen@intel.com> wrote:  
> >> >On 2/14/24 03:35, Petr Tesarik wrote:    
> >> >> This patch series implements x86_64 arch hooks for the generic SandBox
> >> >> Mode infrastructure.    
> >> >
> >> >I think I'm missing a bit of context here.  What does one _do_ with
> >> >SandBox Mode?  Why is it useful?    
> >> 
> >> Seriously. On the surface it looks like a really bad idea – basically an ad hoc, *more* privileged version of user shave.  
> >
> >Hi hpa,
> >
> >I agree that it kind of tries to do "user mode without user mode".
> >There are some differences from actual user mode:
> >
> >First, from a process management POV, sandbox mode appears to be
> >running in kernel mode. So, there is no way to use ptrace(2), send
> >malicious signals or otherwise interact with the sandbox. In fact,
> >the process can have three independent contexts: user mode, kernel mode
> >and sandbox mode.
> >
> >Second, a sandbox can run unmodified kernel code and interact directly
> >with other parts of the kernel. It's not really possible with this
> >initial patch series, but the plan is that sandbox mode can share locks
> >with the kernel.
> >
> >Third, sandbox code can be trusted for operations like parsing keys for
> >the trusted keychain if the kernel is locked down, i.e. when even a
> >process with UID 0 is not on the same trust level as kernel mode.
> >
> >HTH
> >Petr T
> >  
> 
> This, to me, seems like "all the downsides of a microkernel without the upsides." Furthermore, it breaks security-hardening features like LASS and (to a lesser degree) SMAP. Not to mention dropping global pages?

I must be missing something... But I am always open to learn something new.

I don't see how it breaks SMAP. Sandbox mode runs in its own address
space which does not contain any user-mode pages. While running in
sandbox mode, user pages belong to the sandboxed code, kernel pages are
used to enter/exit kernel mode. Bottom half of the PGD is empty, all
user page translations are removed from TLB.

For a similar reason, I don't see right now how it breaks linear
address space separation. Even if it did, I believe I can take care of
it in the entry/exit path. Anyway, which branch contains the LASS
patches now, so I can test?

As for dropping global pages, that's only part of the story. Indeed,
patch 6/8 of the series sets CR4.PGE to zero to have a known-good
working state, but that code is removed again by patch 8/8. I wanted to
implement lazy TLB flushing separately, so it can be easily reverted if
it is suspected to cause an issue.

Plus, each sandbox mode can use PCID to reduce TLB flushing even more.
I haven't done it, because it would be a waste of time if the whole
concept is scratched.

I believe that only those global pages which are actually accessed by
the sandbox need to be flushed. Yes, some parts of the necessary logic
are missing in the current patch series. I can add them in a v2 series
if you wish.

> All in all, I cannot see this as anything other than an enormous step in the wrong direction, and it isn't even in the sense of "it is harmless if noone uses it" – you are introducing architectural changes that are most definitely *very* harmful both to maintainers and users.

I agree that it adds some burden. After all, that's why the ultimate
decision is up to you, the maintainers. To defend my cause, I hope you
have noticed that if CONFIG_SANDBOX_MODE is not set:

1. literally nothing changes in entry_64.
2. sandbox_mode() always evaluates to false, so the added conditionals in fault.c and traps.c are never executed
3. top_of_instr_stack() always returns current_top_of_stack(), which is equivalent to the code it replaces, namely this_cpu_read(pcpu_hot.top_of_stack)

So, all the interesting stuff is under arch/x86/kernel/sbm/. Shall I
add a corresponding entry with my name to MAINTAINERS?

> To me, this feels like paravirtualization all over again. 20 years later we still have not been able to undo all the damage that did.

OK, I can follow you here. Indeed, there is some similarity with Xen PV
(running kernel code with CPL 3), but I don't think there's more than
this.

Petr T

Edgecombe, Rick P Feb. 14, 2024, 7:19 p.m. UTC | #11

On Wed, 2024-02-14 at 19:32 +0100, Petr Tesařík wrote:
> > What use case needs to have the sandbox both protected from the
> > kernel
> > (trusted operations) and non-privileged (the kernel protected from
> > it
> > via CPL3)? It seems like opposite things.
> 
> I think I have mentioned one: parsing keys for the trusted keyring.
> The
> parser is complex enough to be potentially buggy, but the security
> folks have already dismissed the idea to run it as a user mode
> helper.

Ah, I didn't realize the kernel needed to be protected from the key
parsing part because you called it out as a trusted operation. So on
the protect-the-kernel-side it's similar to the microkernel security
reasoning.

Did I get the other part wrong - that you want to protect the sandbox
from the rest of kernel as well?

Petr Tesařík Feb. 14, 2024, 7:33 p.m. UTC | #12

On Wed, 14 Feb 2024 10:42:57 -0800
Dave Hansen <dave.hansen@intel.com> wrote:

> On 2/14/24 10:22, Petr Tesařík wrote:
> > Anyway, in the long term I would like to work on gradual decomposition
> > of the kernel into a core part and many self-contained components.
> > Sandbox mode is a useful tool to enforce isolation.  
> 
> I'd want to see at least a few examples of how this decomposition would
> work and how much of a burden it is on each site that deployed it.

Got it. Are you okay with a couple of examples to illustrate the
concept? Because if you want patches that have been acked by the
respective maintainers, it somehow becomes a chicken-and-egg kind of
problem...

> But I'm skeptical that this could ever work.  Ring-0 execution really is
> special and it's _increasingly_ so.  Think of LASS or SMAP or SMEP.

I have just answered a similar concern by hpa. In short, I don't think
these features are relevant, because by definition sandbox mode does
not share anything with user mode address space.

> We're even seeing hardware designers add hardware security defenses to
> ring-0 that are not applied to ring-3.
> 
> In other words, ring-3 isn't just a deprivileged ring-0, it's more
> exposed to attacks.
> 
> > I'd rather fail fast than maintain hundreds of patches in an
> > out-of-tree branch before submitting (and failing anyway).  
> 
> I don't see any remotely feasible path forward for this approach.

I can live with such decision. But first, I want to make sure that the
concept has been understood correctly. So far, at least some concerns
suggest an understanding that is not quite accurate.

Is this sandbox idea a bit too much out-of-the-box?

Petr T

Petr Tesařík Feb. 14, 2024, 7:35 p.m. UTC | #13

On Wed, 14 Feb 2024 19:19:27 +0000
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:

> On Wed, 2024-02-14 at 19:32 +0100, Petr Tesařík wrote:
> > > What use case needs to have the sandbox both protected from the
> > > kernel
> > > (trusted operations) and non-privileged (the kernel protected from
> > > it
> > > via CPL3)? It seems like opposite things.  
> > 
> > I think I have mentioned one: parsing keys for the trusted keyring.
> > The
> > parser is complex enough to be potentially buggy, but the security
> > folks have already dismissed the idea to run it as a user mode
> > helper.  
> 
> Ah, I didn't realize the kernel needed to be protected from the key
> parsing part because you called it out as a trusted operation. So on
> the protect-the-kernel-side it's similar to the microkernel security
> reasoning.
> 
> Did I get the other part wrong - that you want to protect the sandbox
> from the rest of kernel as well?

Protecting the sandbox from the rest of the kernel is out of scope.
However, different sandboxes should be protected from each other.

Petr T

Dave Hansen Feb. 14, 2024, 8:16 p.m. UTC | #14

On 2/14/24 11:33, Petr Tesařík wrote:
>> I'd want to see at least a few examples of how this decomposition would
>> work and how much of a burden it is on each site that deployed it.
> Got it. Are you okay with a couple of examples to illustrate the
> concept? Because if you want patches that have been acked by the
> respective maintainers, it somehow becomes a chicken-and-egg kind of
> problem...

I'd be happy to look at a patch or two that demonstrate the concept,
just to make sure I'm not missing something.  But I'm still quite skeptical.

Petr Tesařík Feb. 15, 2024, 6:59 a.m. UTC | #15

On Wed, 14 Feb 2024 10:52:47 -0800
Xin Li <xin@zytor.com> wrote:

> On 2/14/2024 10:22 AM, Petr Tesařík wrote:
> > On Wed, 14 Feb 2024 06:52:53 -0800
> > Dave Hansen <dave.hansen@intel.com> wrote:
> >   
> >> On 2/14/24 03:35, Petr Tesarik wrote:  
> >>> This patch series implements x86_64 arch hooks for the generic SandBox
> >>> Mode infrastructure.  
> >>
> >> I think I'm missing a bit of context here.  What does one _do_ with
> >> SandBox Mode?  Why is it useful?  
> > 
> > I see, I split the patch series into the base infrastructure and the
> > x86_64 implementation, but I forgot to merge the two recipient lists.
> > :-(
> > 
> > Anyway, in the long term I would like to work on gradual decomposition
> > of the kernel into a core part and many self-contained components.
> > Sandbox mode is a useful tool to enforce isolation.
> > 
> > In its current form, sandbox mode is too limited for that, but I'm
> > trying to find some balance between "publish early" and reaching a
> > feature level where some concrete examples can be shown. I'd rather
> > fail fast than maintain hundreds of patches in an out-of-tree branch
> > before submitting (and failing anyway).
> > 
> > Petr T
> >   
> 
> What you're proposing sounds a gigantic thing, which could potentially
> impact all subsystems.

True. Luckily, sandbox mode allows me to move gradually, one component
at a time.

>  Unless you prove it has big advantages with real
> world usages, I guess nobody even wants to look into the patches.
> 
> BTW, this seems another attempt to get the idea of micro-kernel into
> Linux.

We know it's not feasible to convert Linux to a micro-kernel. AFAICS
that would require some kind of big switch, affecting all subsystems at
once.

But with a growing code base and more or less constant bug-per-LOC rate,
people will continue to come up with some ideas how to limit the
potential impact of each bug. Logically, one of the concepts that come
to mind is decomposition.

If my attempt helps to clarify how such decomposition should be done to
be acceptable, it is worthwile. If nothing else, I can summarize the
situation and ask Jonathan if he would kindly accept it as a LWN
article...

Petr T

H. Peter Anvin Feb. 15, 2024, 8:16 a.m. UTC | #16

On February 14, 2024 10:59:32 PM PST, "Petr Tesařík" <petr@tesarici.cz> wrote:
>On Wed, 14 Feb 2024 10:52:47 -0800
>Xin Li <xin@zytor.com> wrote:
>
>> On 2/14/2024 10:22 AM, Petr Tesařík wrote:
>> > On Wed, 14 Feb 2024 06:52:53 -0800
>> > Dave Hansen <dave.hansen@intel.com> wrote:
>> >   
>> >> On 2/14/24 03:35, Petr Tesarik wrote:  
>> >>> This patch series implements x86_64 arch hooks for the generic SandBox
>> >>> Mode infrastructure.  
>> >>
>> >> I think I'm missing a bit of context here.  What does one _do_ with
>> >> SandBox Mode?  Why is it useful?  
>> > 
>> > I see, I split the patch series into the base infrastructure and the
>> > x86_64 implementation, but I forgot to merge the two recipient lists.
>> > :-(
>> > 
>> > Anyway, in the long term I would like to work on gradual decomposition
>> > of the kernel into a core part and many self-contained components.
>> > Sandbox mode is a useful tool to enforce isolation.
>> > 
>> > In its current form, sandbox mode is too limited for that, but I'm
>> > trying to find some balance between "publish early" and reaching a
>> > feature level where some concrete examples can be shown. I'd rather
>> > fail fast than maintain hundreds of patches in an out-of-tree branch
>> > before submitting (and failing anyway).
>> > 
>> > Petr T
>> >   
>> 
>> What you're proposing sounds a gigantic thing, which could potentially
>> impact all subsystems.
>
>True. Luckily, sandbox mode allows me to move gradually, one component
>at a time.
>
>>  Unless you prove it has big advantages with real
>> world usages, I guess nobody even wants to look into the patches.
>> 
>> BTW, this seems another attempt to get the idea of micro-kernel into
>> Linux.
>
>We know it's not feasible to convert Linux to a micro-kernel. AFAICS
>that would require some kind of big switch, affecting all subsystems at
>once.
>
>But with a growing code base and more or less constant bug-per-LOC rate,
>people will continue to come up with some ideas how to limit the
>potential impact of each bug. Logically, one of the concepts that come
>to mind is decomposition.
>
>If my attempt helps to clarify how such decomposition should be done to
>be acceptable, it is worthwile. If nothing else, I can summarize the
>situation and ask Jonathan if he would kindly accept it as a LWN
>article...
>
>Petr T
>

I have been thinking more about this, and I'm more than ever convinced that exposing kernel memory to *any* kind of user space is a really, really bad idea. It is not a door we ever want to open; once that line gets muddled, the attack surface opens up dramatically.

And, in fact, we already have a sandbox mode in the kernel – it is called eBPF.

Petr Tesařík Feb. 15, 2024, 9:30 a.m. UTC | #17

On Thu, 15 Feb 2024 00:16:13 -0800
"H. Peter Anvin" <hpa@zytor.com> wrote:

> On February 14, 2024 10:59:32 PM PST, "Petr Tesařík" <petr@tesarici.cz> wrote:
> >On Wed, 14 Feb 2024 10:52:47 -0800
> >Xin Li <xin@zytor.com> wrote:
> >  
> >> On 2/14/2024 10:22 AM, Petr Tesařík wrote:  
> >> > On Wed, 14 Feb 2024 06:52:53 -0800
> >> > Dave Hansen <dave.hansen@intel.com> wrote:
> >> >     
> >> >> On 2/14/24 03:35, Petr Tesarik wrote:    
> >> >>> This patch series implements x86_64 arch hooks for the generic SandBox
> >> >>> Mode infrastructure.    
> >> >>
> >> >> I think I'm missing a bit of context here.  What does one _do_ with
> >> >> SandBox Mode?  Why is it useful?    
> >> > 
> >> > I see, I split the patch series into the base infrastructure and the
> >> > x86_64 implementation, but I forgot to merge the two recipient lists.
> >> > :-(
> >> > 
> >> > Anyway, in the long term I would like to work on gradual decomposition
> >> > of the kernel into a core part and many self-contained components.
> >> > Sandbox mode is a useful tool to enforce isolation.
> >> > 
> >> > In its current form, sandbox mode is too limited for that, but I'm
> >> > trying to find some balance between "publish early" and reaching a
> >> > feature level where some concrete examples can be shown. I'd rather
> >> > fail fast than maintain hundreds of patches in an out-of-tree branch
> >> > before submitting (and failing anyway).
> >> > 
> >> > Petr T
> >> >     
> >> 
> >> What you're proposing sounds a gigantic thing, which could potentially
> >> impact all subsystems.  
> >
> >True. Luckily, sandbox mode allows me to move gradually, one component
> >at a time.
> >  
> >>  Unless you prove it has big advantages with real
> >> world usages, I guess nobody even wants to look into the patches.
> >> 
> >> BTW, this seems another attempt to get the idea of micro-kernel into
> >> Linux.  
> >
> >We know it's not feasible to convert Linux to a micro-kernel. AFAICS
> >that would require some kind of big switch, affecting all subsystems at
> >once.
> >
> >But with a growing code base and more or less constant bug-per-LOC rate,
> >people will continue to come up with some ideas how to limit the
> >potential impact of each bug. Logically, one of the concepts that come
> >to mind is decomposition.
> >
> >If my attempt helps to clarify how such decomposition should be done to
> >be acceptable, it is worthwile. If nothing else, I can summarize the
> >situation and ask Jonathan if he would kindly accept it as a LWN
> >article...
> >
> >Petr T
> >  
> 
> I have been thinking more about this, and I'm more than ever convinced that exposing kernel memory to *any* kind of user space is a really, really bad idea. It is not a door we ever want to open; once that line gets muddled, the attack surface opens up dramatically.

Would you mind elaborating on this a bit more?

For one thing, sandbox mode is *not* user mode. Sure, my proposed
x86-64 implementation runs with the same CPU privilege level as user
mode, but it is isolated from user mode with just as strong mechanisms
as any two user mode processes are isolated from each other. Are you
saying that process isolation in Linux is not all that strong after all?

Don't get me wrong. I'm honestly trying to understand what exactly
makes the idea so bad. I have apparently not considered something that
you have, and I would be glad if you could reveal it.

> And, in fact, we already have a sandbox mode in the kernel – it is called eBPF. 

Sure. The difference is that eBPF is a platform of its own (with its
own consistency model, machine code etc.). Rewriting code for eBPF may
need a bit more effort.

Besides, Roberto wrote a PGP key parser as an eBPF program at some
point, and I believe it was rejected for that reason. So, it seems
there are situations where eBPF is not an alternative.

Roberto, can you remember and share some details?

Petr T

Roberto Sassu Feb. 15, 2024, 9:37 a.m. UTC | #18

On Thu, 2024-02-15 at 10:30 +0100, Petr Tesařík wrote:
> On Thu, 15 Feb 2024 00:16:13 -0800
> "H. Peter Anvin" <hpa@zytor.com> wrote:
> 
> > On February 14, 2024 10:59:32 PM PST, "Petr Tesařík" <petr@tesarici.cz> wrote:
> > > On Wed, 14 Feb 2024 10:52:47 -0800
> > > Xin Li <xin@zytor.com> wrote:
> > >  
> > > > On 2/14/2024 10:22 AM, Petr Tesařík wrote:  
> > > > > On Wed, 14 Feb 2024 06:52:53 -0800
> > > > > Dave Hansen <dave.hansen@intel.com> wrote:
> > > > >     
> > > > > > On 2/14/24 03:35, Petr Tesarik wrote:    
> > > > > > > This patch series implements x86_64 arch hooks for the generic SandBox
> > > > > > > Mode infrastructure.    
> > > > > > 
> > > > > > I think I'm missing a bit of context here.  What does one _do_ with
> > > > > > SandBox Mode?  Why is it useful?    
> > > > > 
> > > > > I see, I split the patch series into the base infrastructure and the
> > > > > x86_64 implementation, but I forgot to merge the two recipient lists.
> > > > > :-(
> > > > > 
> > > > > Anyway, in the long term I would like to work on gradual decomposition
> > > > > of the kernel into a core part and many self-contained components.
> > > > > Sandbox mode is a useful tool to enforce isolation.
> > > > > 
> > > > > In its current form, sandbox mode is too limited for that, but I'm
> > > > > trying to find some balance between "publish early" and reaching a
> > > > > feature level where some concrete examples can be shown. I'd rather
> > > > > fail fast than maintain hundreds of patches in an out-of-tree branch
> > > > > before submitting (and failing anyway).
> > > > > 
> > > > > Petr T
> > > > >     
> > > > 
> > > > What you're proposing sounds a gigantic thing, which could potentially
> > > > impact all subsystems.  
> > > 
> > > True. Luckily, sandbox mode allows me to move gradually, one component
> > > at a time.
> > >  
> > > >  Unless you prove it has big advantages with real
> > > > world usages, I guess nobody even wants to look into the patches.
> > > > 
> > > > BTW, this seems another attempt to get the idea of micro-kernel into
> > > > Linux.  
> > > 
> > > We know it's not feasible to convert Linux to a micro-kernel. AFAICS
> > > that would require some kind of big switch, affecting all subsystems at
> > > once.
> > > 
> > > But with a growing code base and more or less constant bug-per-LOC rate,
> > > people will continue to come up with some ideas how to limit the
> > > potential impact of each bug. Logically, one of the concepts that come
> > > to mind is decomposition.
> > > 
> > > If my attempt helps to clarify how such decomposition should be done to
> > > be acceptable, it is worthwile. If nothing else, I can summarize the
> > > situation and ask Jonathan if he would kindly accept it as a LWN
> > > article...
> > > 
> > > Petr T
> > >  
> > 
> > I have been thinking more about this, and I'm more than ever convinced that exposing kernel memory to *any* kind of user space is a really, really bad idea. It is not a door we ever want to open; once that line gets muddled, the attack surface opens up dramatically.
> 
> Would you mind elaborating on this a bit more?
> 
> For one thing, sandbox mode is *not* user mode. Sure, my proposed
> x86-64 implementation runs with the same CPU privilege level as user
> mode, but it is isolated from user mode with just as strong mechanisms
> as any two user mode processes are isolated from each other. Are you
> saying that process isolation in Linux is not all that strong after all?
> 
> Don't get me wrong. I'm honestly trying to understand what exactly
> makes the idea so bad. I have apparently not considered something that
> you have, and I would be glad if you could reveal it.
> 
> > And, in fact, we already have a sandbox mode in the kernel – it is called eBPF. 
> 
> Sure. The difference is that eBPF is a platform of its own (with its
> own consistency model, machine code etc.). Rewriting code for eBPF may
> need a bit more effort.
> 
> Besides, Roberto wrote a PGP key parser as an eBPF program at some
> point, and I believe it was rejected for that reason. So, it seems
> there are situations where eBPF is not an alternative.
> 
> Roberto, can you remember and share some details?

eBPF programs are not signed.

And I struggled to have some security bugs fixed, so I gave up.

Roberto