Message ID | 20240214113516.2307-1-petrtesarik@huaweicloud.com |
---|---|
Headers |
Return-Path: <linux-kernel+bounces-65137-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:bc8a:b0:106:860b:bbdd with SMTP id dn10csp1147099dyb; Wed, 14 Feb 2024 03:36:12 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCVWW8W+NLCs3oZqYlZOFTeEYSbw64OgJzGog7UbHrO9lA+FS39XeCplIT5j4ftanZun/hNE/xLv48MkuKmlw0FQb5FXyA== X-Google-Smtp-Source: AGHT+IH3/fKSMAKF5W8fr2Z2p8K7AWtFDRzK5jUnqym6rE2X0QaVXeglO9k9XkSsnW77qKICIVLL X-Received: by 2002:a05:622a:1788:b0:42c:3b1f:fe31 with SMTP id s8-20020a05622a178800b0042c3b1ffe31mr2392468qtk.35.1707910572385; Wed, 14 Feb 2024 03:36:12 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1707910572; cv=pass; d=google.com; s=arc-20160816; b=RzIhlb4hBBA46asWclMoVkKe00Do0jYCMBToVJeRwqYUUzdJMF7EM9S0it6Qq4ZfUb CaJ7cOtTv4Kz26diR4eeFKb/b+Ia0sYkQwKkaJrVDQcDtqfAHTIPnb3P2IQjK1Cn5ql0 /TwHcGCMPEGXCPGy2MnUh5WXNrNuCOyC+Fxj3pKwuwWeKcJynRQJ/PTbuwSdpt7iiFdO G7suvShdEsd7T7AdirCk1Zk+8OWd9ggSVS/5ycAfbEKOaZBn/ACVmgHwA3FYH2Jl+CrD A+r5NIJAC2BvX4LIWwpqCny5fgsYTZLj2nChSeiGQ/vqh6IFxCqzMtriwbm97SA9I3LU ViPQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:cc:to :from; bh=3dBrLBs4VKhU1qw9hvKr2PfTYo7cgYffGT9wuQftxYA=; fh=XrxBUbAaG8OqK8dsEFbriiYNgORaUCXkFDaON5Ngo5s=; b=i5Q99vaZEy6LZgURQr8nKAfGh9vW6p3ZShrpob1+AB2K3qsx2zVxcQlh2QRL8f4eMN fSoU7HVX5YwyLpiAWT+otwMYLL0JLS5Bfcke4jjrPjVfrRMzpxR+p5Mutc+D/Xz9YNv2 Ih6r/QchD5HW+n6lzeAM0NfXY5aJUFJUyiRs9H9n1sSlYUjRwvUuoXVHC76WvdSOBzV9 9K6eleNBtrrntssK+UMmXxz7aa1SdUctBBB8qrhD9l+ONFIbSGElDEkyxbUCM/XDFCPY wWMmVLbWLARVRo6h7pnhLSv3cW6LPPm9eDSjWoRWB1HWNihWRbddsk6GvZc+IdIteOEV PmfQ==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=huaweicloud.com); spf=pass (google.com: domain of linux-kernel+bounces-65137-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-65137-ouuuleilei=gmail.com@vger.kernel.org" X-Forwarded-Encrypted: i=2; AJvYcCWSN2SzTrAUE2GtKJEJQttOPHrrQg2dYfqVxxc7VAILotb/EOmeZNn4hfWOXjgCVlIpaMA3dYaRZX9xROMcrqSjW9TpHg== Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [147.75.199.223]) by mx.google.com with ESMTPS id z18-20020ac87f92000000b0042dc21b53f5si538047qtj.641.2024.02.14.03.36.12 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 14 Feb 2024 03:36:12 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-65137-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=huaweicloud.com); spf=pass (google.com: domain of linux-kernel+bounces-65137-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-65137-ouuuleilei=gmail.com@vger.kernel.org" Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 311FF1C22593 for <ouuuleilei@gmail.com>; Wed, 14 Feb 2024 11:36:12 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 5289F1AACB; Wed, 14 Feb 2024 11:35:56 +0000 (UTC) Received: from frasgout12.his.huawei.com (frasgout12.his.huawei.com [14.137.139.154]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EB8CA19475; Wed, 14 Feb 2024 11:35:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=14.137.139.154 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707910554; cv=none; b=AjuwRN5nU4tf20dRE77ZbYzRvpzifN6Bb40btlMs0nkx6VblCI61h9cxcFEWeR5UPbcAHiS7XdZZbGlNc2uPN3Ki0m7G1MCqzYIxMQIP9OqjEfNfJiO3fYBNGBd/mWsltEPYJXznx9mhCPmyYgWmxa+cB8bv6MF1rAs8V4uVDsA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707910554; c=relaxed/simple; bh=eSoQbLlaL9f781kXxig0KToQ/eN3Fw1j108K7hbPQKU=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version; b=nhcpVSycnODbY/2W2h3ae4+7qoLr/YPelh0AUrw+48wnmn7KafbsSOi3TlB+aKs/EJSGWB+4j/CR2Q6ZievcTqCwL41ifNiKs1rQiVD7p9cDdU36LYAt7MxqVdyeTJPvhUfcyv+pd3gxvEfCAWLKptFSsnwkCk9YrzxJCjpYEEo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=huaweicloud.com; spf=pass smtp.mailfrom=huaweicloud.com; arc=none smtp.client-ip=14.137.139.154 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=huaweicloud.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huaweicloud.com Received: from mail.maildlp.com (unknown [172.18.186.29]) by frasgout12.his.huawei.com (SkyGuard) with ESMTP id 4TZbHb0dXxz9y62N; Wed, 14 Feb 2024 19:16:43 +0800 (CST) Received: from mail02.huawei.com (unknown [7.182.16.47]) by mail.maildlp.com (Postfix) with ESMTP id 18AF81405F3; Wed, 14 Feb 2024 19:35:40 +0800 (CST) Received: from huaweicloud.com (unknown [10.45.156.69]) by APP1 (Coremail) with SMTP id LxC2BwAHshp7pcxlDJx9Ag--.51624S2; Wed, 14 Feb 2024 12:35:39 +0100 (CET) From: Petr Tesarik <petrtesarik@huaweicloud.com> To: Jonathan Corbet <corbet@lwn.net>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, x86@kernel.org (maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)), "H. Peter Anvin" <hpa@zytor.com>, Andy Lutomirski <luto@kernel.org>, Oleg Nesterov <oleg@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Xin Li <xin3.li@intel.com>, Arnd Bergmann <arnd@arndb.de>, Andrew Morton <akpm@linux-foundation.org>, Rick Edgecombe <rick.p.edgecombe@intel.com>, Kees Cook <keescook@chromium.org>, "Masami Hiramatsu (Google)" <mhiramat@kernel.org>, Pengfei Xu <pengfei.xu@intel.com>, Josh Poimboeuf <jpoimboe@kernel.org>, Ze Gao <zegao2021@gmail.com>, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, Kai Huang <kai.huang@intel.com>, David Woodhouse <dwmw@amazon.co.uk>, Brian Gerst <brgerst@gmail.com>, Jason Gunthorpe <jgg@ziepe.ca>, Joerg Roedel <jroedel@suse.de>, "Mike Rapoport (IBM)" <rppt@kernel.org>, Tina Zhang <tina.zhang@intel.com>, Jacob Pan <jacob.jun.pan@linux.intel.com>, linux-doc@vger.kernel.org (open list:DOCUMENTATION), linux-kernel@vger.kernel.org (open list) Cc: Roberto Sassu <roberto.sassu@huaweicloud.com>, petr@tesarici.cz, Petr Tesarik <petr.tesarik1@huawei-partners.com> Subject: [PATCH v1 0/8] x86_64 SandBox Mode arch hooks Date: Wed, 14 Feb 2024 12:35:08 +0100 Message-Id: <20240214113516.2307-1-petrtesarik@huaweicloud.com> X-Mailer: git-send-email 2.34.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-CM-TRANSID: LxC2BwAHshp7pcxlDJx9Ag--.51624S2 X-Coremail-Antispam: 1UD129KBjvJXoW7Kw1DWF1rJw1fGryUuF1fCrg_yoW8ZF1xpF 9rArs5KF4qga4avFZ3Grn7ZryfAw1kCw4rKFn7W34Yqa4aqa4UJrs3KanrX3yrZ3yUGFyF qF1YvF10gw1jyaUanT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUU9m14x267AKxVWrJVCq3wAFc2x0x2IEx4CE42xK8VAvwI8IcIk0 rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2ocxC64kIII0Yj41l84x0c7CEw4AK67xGY2AK02 1l84ACjcxK6xIIjxv20xvE14v26r1I6r4UM28EF7xvwVC0I7IYx2IY6xkF7I0E14v26r4U JVWxJr1l84ACjcxK6I8E87Iv67AKxVW8JVWxJwA2z4x0Y4vEx4A2jsIEc7CjxVAFwI0_Gr 1j6F4UJwAS0I0E0xvYzxvE52x082IY62kv0487Mc02F40EFcxC0VAKzVAqx4xG6I80ewAv 7VC0I7IYx2IY67AKxVWUGVWUXwAv7VC2z280aVAFwI0_Jr0_Gr1lOx8S6xCaFVCjc4AY6r 1j6r4UM4x0Y48IcxkI7VAKI48JM4x0x7Aq67IIx4CEVc8vx2IErcIFxwACI402YVCY1x02 628vn2kIc2xKxwCY1x0264kExVAvwVAq07x20xyl42xK82IYc2Ij64vIr41l4I8I3I0E4I kC6x0Yz7v_Jr0_Gr1lx2IqxVAqx4xG67AKxVWUJVWUGwC20s026x8GjcxK67AKxVWUGVWU WwC2zVAF1VAY17CE14v26rWY6r4UJwCIc40Y0x0EwIxGrwCI42IY6xIIjxv20xvE14v26r 1I6r4UMIIF0xvE2Ix0cI8IcVCY1x0267AKxVW8Jr0_Cr1UMIIF0xvE42xK8VAvwI8IcIk0 rVW3JVWrJr1lIxAIcVC2z280aVAFwI0_Jr0_Gr1lIxAIcVC2z280aVCY1x0267AKxVW8Jr 0_Cr1UYxBIdaVFxhVjvjDU0xZFpf9x0JU-yCJUUUUU= X-CM-SenderInfo: hshw23xhvd2x3n6k3tpzhluzxrxghudrp/ X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1790874036081594545 X-GMAIL-MSGID: 1790874036081594545 |
Series |
x86_64 SandBox Mode arch hooks
|
|
Message
Petr Tesarik
Feb. 14, 2024, 11:35 a.m. UTC
From: Petr Tesarik <petr.tesarik1@huawei-partners.com>
This patch series implements x86_64 arch hooks for the generic SandBox
Mode infrastructure.
SandBox Mode on x86_64 is implemented as follows:
* The target function runs with CPL 3 (same as user mode) within its
own virtual address space.
* Interrupt entry/exit paths are modified to let the interrupt handlers
always run with kernel CR3 and restore sandbox CR3 when returning to
sandbox mode.
* To avoid undesirable user mode processing (FPU state, signals, etc.),
the value of pt_regs->cs is temporarily adjusted to make it look like
coming from kernel mode.
* On a CPU fault, execution stops immediately, returning -EFAULT to
the caller.
Petr Tesarik (8):
sbm: x86: page table arch hooks
sbm: x86: execute target function on sandbox mode stack
sbm: x86: map system data structures into the sandbox
sbm: x86: allocate and map an exception stack
sbm: x86: handle sandbox mode faults
sbm: x86: switch to sandbox mode pages in arch_sbm_exec()
sbm: documentation of the x86-64 SandBox Mode implementation
sbm: x86: lazy TLB flushing
Documentation/security/sandbox-mode.rst | 25 ++
arch/x86/Kconfig | 1 +
arch/x86/entry/entry_64.S | 123 ++++++
arch/x86/include/asm/page_64_types.h | 1 +
arch/x86/include/asm/ptrace.h | 21 +
arch/x86/include/asm/sbm.h | 83 ++++
arch/x86/include/asm/segment.h | 7 +
arch/x86/include/asm/thread_info.h | 3 +
arch/x86/kernel/Makefile | 2 +
arch/x86/kernel/asm-offsets.c | 10 +
arch/x86/kernel/sbm/Makefile | 16 +
arch/x86/kernel/sbm/call_64.S | 95 +++++
arch/x86/kernel/sbm/core.c | 499 ++++++++++++++++++++++++
arch/x86/kernel/traps.c | 14 +-
arch/x86/mm/fault.c | 6 +
15 files changed, 905 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/include/asm/sbm.h
create mode 100644 arch/x86/kernel/sbm/Makefile
create mode 100644 arch/x86/kernel/sbm/call_64.S
create mode 100644 arch/x86/kernel/sbm/core.c
Comments
On 2/14/24 03:35, Petr Tesarik wrote: > This patch series implements x86_64 arch hooks for the generic SandBox > Mode infrastructure. I think I'm missing a bit of context here. What does one _do_ with SandBox Mode? Why is it useful?
On February 14, 2024 6:52:53 AM PST, Dave Hansen <dave.hansen@intel.com> wrote: >On 2/14/24 03:35, Petr Tesarik wrote: >> This patch series implements x86_64 arch hooks for the generic SandBox >> Mode infrastructure. > >I think I'm missing a bit of context here. What does one _do_ with >SandBox Mode? Why is it useful? Seriously. On the surface it looks like a really bad idea – basically an ad hoc, *more* privileged version of user shave.
On Wed, 14 Feb 2024 07:28:35 -0800 "H. Peter Anvin" <hpa@zytor.com> wrote: > On February 14, 2024 6:52:53 AM PST, Dave Hansen <dave.hansen@intel.com> wrote: > >On 2/14/24 03:35, Petr Tesarik wrote: > >> This patch series implements x86_64 arch hooks for the generic SandBox > >> Mode infrastructure. > > > >I think I'm missing a bit of context here. What does one _do_ with > >SandBox Mode? Why is it useful? > > Seriously. On the surface it looks like a really bad idea – basically an ad hoc, *more* privileged version of user shave. Hi hpa, I agree that it kind of tries to do "user mode without user mode". There are some differences from actual user mode: First, from a process management POV, sandbox mode appears to be running in kernel mode. So, there is no way to use ptrace(2), send malicious signals or otherwise interact with the sandbox. In fact, the process can have three independent contexts: user mode, kernel mode and sandbox mode. Second, a sandbox can run unmodified kernel code and interact directly with other parts of the kernel. It's not really possible with this initial patch series, but the plan is that sandbox mode can share locks with the kernel. Third, sandbox code can be trusted for operations like parsing keys for the trusted keychain if the kernel is locked down, i.e. when even a process with UID 0 is not on the same trust level as kernel mode. HTH Petr T
On February 14, 2024 8:41:43 AM PST, "Petr Tesařík" <petr@tesarici.cz> wrote: >On Wed, 14 Feb 2024 07:28:35 -0800 >"H. Peter Anvin" <hpa@zytor.com> wrote: > >> On February 14, 2024 6:52:53 AM PST, Dave Hansen <dave.hansen@intel.com> wrote: >> >On 2/14/24 03:35, Petr Tesarik wrote: >> >> This patch series implements x86_64 arch hooks for the generic SandBox >> >> Mode infrastructure. >> > >> >I think I'm missing a bit of context here. What does one _do_ with >> >SandBox Mode? Why is it useful? >> >> Seriously. On the surface it looks like a really bad idea – basically an ad hoc, *more* privileged version of user shave. > >Hi hpa, > >I agree that it kind of tries to do "user mode without user mode". >There are some differences from actual user mode: > >First, from a process management POV, sandbox mode appears to be >running in kernel mode. So, there is no way to use ptrace(2), send >malicious signals or otherwise interact with the sandbox. In fact, >the process can have three independent contexts: user mode, kernel mode >and sandbox mode. > >Second, a sandbox can run unmodified kernel code and interact directly >with other parts of the kernel. It's not really possible with this >initial patch series, but the plan is that sandbox mode can share locks >with the kernel. > >Third, sandbox code can be trusted for operations like parsing keys for >the trusted keychain if the kernel is locked down, i.e. when even a >process with UID 0 is not on the same trust level as kernel mode. > >HTH >Petr T > This, to me, seems like "all the downsides of a microkernel without the upsides." Furthermore, it breaks security-hardening features like LASS and (to a lesser degree) SMAP. Not to mention dropping global pages? All in all, I cannot see this as anything other than an enormous step in the wrong direction, and it isn't even in the sense of "it is harmless if noone uses it" – you are introducing architectural changes that are most definitely *very* harmful both to maintainers and users. To me, this feels like paravirtualization all over again. 20 years later we still have not been able to undo all the damage that did.
On Wed, 2024-02-14 at 17:41 +0100, Petr Tesařík wrote: > Second, a sandbox can run unmodified kernel code and interact > directly > with other parts of the kernel. It's not really possible with this > initial patch series, but the plan is that sandbox mode can share > locks > with the kernel. > > Third, sandbox code can be trusted for operations like parsing keys > for > the trusted keychain if the kernel is locked down, i.e. when even a > process with UID 0 is not on the same trust level as kernel mode. What use case needs to have the sandbox both protected from the kernel (trusted operations) and non-privileged (the kernel protected from it via CPL3)? It seems like opposite things.
On Wed, 14 Feb 2024 06:52:53 -0800 Dave Hansen <dave.hansen@intel.com> wrote: > On 2/14/24 03:35, Petr Tesarik wrote: > > This patch series implements x86_64 arch hooks for the generic SandBox > > Mode infrastructure. > > I think I'm missing a bit of context here. What does one _do_ with > SandBox Mode? Why is it useful? I see, I split the patch series into the base infrastructure and the x86_64 implementation, but I forgot to merge the two recipient lists. :-( Anyway, in the long term I would like to work on gradual decomposition of the kernel into a core part and many self-contained components. Sandbox mode is a useful tool to enforce isolation. In its current form, sandbox mode is too limited for that, but I'm trying to find some balance between "publish early" and reaching a feature level where some concrete examples can be shown. I'd rather fail fast than maintain hundreds of patches in an out-of-tree branch before submitting (and failing anyway). Petr T
(+Cc Kees) On Wed, 14 Feb 2024 18:14:49 +0000 "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote: > On Wed, 2024-02-14 at 17:41 +0100, Petr Tesařík wrote: > > Second, a sandbox can run unmodified kernel code and interact > > directly > > with other parts of the kernel. It's not really possible with this > > initial patch series, but the plan is that sandbox mode can share > > locks > > with the kernel. > > > > Third, sandbox code can be trusted for operations like parsing keys > > for > > the trusted keychain if the kernel is locked down, i.e. when even a > > process with UID 0 is not on the same trust level as kernel mode. > > What use case needs to have the sandbox both protected from the kernel > (trusted operations) and non-privileged (the kernel protected from it > via CPL3)? It seems like opposite things. I think I have mentioned one: parsing keys for the trusted keyring. The parser is complex enough to be potentially buggy, but the security folks have already dismissed the idea to run it as a user mode helper. Petr T
On 2/14/24 10:22, Petr Tesařík wrote: > Anyway, in the long term I would like to work on gradual decomposition > of the kernel into a core part and many self-contained components. > Sandbox mode is a useful tool to enforce isolation. I'd want to see at least a few examples of how this decomposition would work and how much of a burden it is on each site that deployed it. But I'm skeptical that this could ever work. Ring-0 execution really is special and it's _increasingly_ so. Think of LASS or SMAP or SMEP. We're even seeing hardware designers add hardware security defenses to ring-0 that are not applied to ring-3. In other words, ring-3 isn't just a deprivileged ring-0, it's more exposed to attacks. > I'd rather fail fast than maintain hundreds of patches in an > out-of-tree branch before submitting (and failing anyway). I don't see any remotely feasible path forward for this approach.
On 2/14/2024 10:22 AM, Petr Tesařík wrote: > On Wed, 14 Feb 2024 06:52:53 -0800 > Dave Hansen <dave.hansen@intel.com> wrote: > >> On 2/14/24 03:35, Petr Tesarik wrote: >>> This patch series implements x86_64 arch hooks for the generic SandBox >>> Mode infrastructure. >> >> I think I'm missing a bit of context here. What does one _do_ with >> SandBox Mode? Why is it useful? > > I see, I split the patch series into the base infrastructure and the > x86_64 implementation, but I forgot to merge the two recipient lists. > :-( > > Anyway, in the long term I would like to work on gradual decomposition > of the kernel into a core part and many self-contained components. > Sandbox mode is a useful tool to enforce isolation. > > In its current form, sandbox mode is too limited for that, but I'm > trying to find some balance between "publish early" and reaching a > feature level where some concrete examples can be shown. I'd rather > fail fast than maintain hundreds of patches in an out-of-tree branch > before submitting (and failing anyway). > > Petr T > What you're proposing sounds a gigantic thing, which could potentially impact all subsystems. Unless you prove it has big advantages with real world usages, I guess nobody even wants to look into the patches. BTW, this seems another attempt to get the idea of micro-kernel into Linux.
On Wed, 14 Feb 2024 09:29:06 -0800 "H. Peter Anvin" <hpa@zytor.com> wrote: > On February 14, 2024 8:41:43 AM PST, "Petr Tesařík" <petr@tesarici.cz> wrote: > >On Wed, 14 Feb 2024 07:28:35 -0800 > >"H. Peter Anvin" <hpa@zytor.com> wrote: > > > >> On February 14, 2024 6:52:53 AM PST, Dave Hansen <dave.hansen@intel.com> wrote: > >> >On 2/14/24 03:35, Petr Tesarik wrote: > >> >> This patch series implements x86_64 arch hooks for the generic SandBox > >> >> Mode infrastructure. > >> > > >> >I think I'm missing a bit of context here. What does one _do_ with > >> >SandBox Mode? Why is it useful? > >> > >> Seriously. On the surface it looks like a really bad idea – basically an ad hoc, *more* privileged version of user shave. > > > >Hi hpa, > > > >I agree that it kind of tries to do "user mode without user mode". > >There are some differences from actual user mode: > > > >First, from a process management POV, sandbox mode appears to be > >running in kernel mode. So, there is no way to use ptrace(2), send > >malicious signals or otherwise interact with the sandbox. In fact, > >the process can have three independent contexts: user mode, kernel mode > >and sandbox mode. > > > >Second, a sandbox can run unmodified kernel code and interact directly > >with other parts of the kernel. It's not really possible with this > >initial patch series, but the plan is that sandbox mode can share locks > >with the kernel. > > > >Third, sandbox code can be trusted for operations like parsing keys for > >the trusted keychain if the kernel is locked down, i.e. when even a > >process with UID 0 is not on the same trust level as kernel mode. > > > >HTH > >Petr T > > > > This, to me, seems like "all the downsides of a microkernel without the upsides." Furthermore, it breaks security-hardening features like LASS and (to a lesser degree) SMAP. Not to mention dropping global pages? I must be missing something... But I am always open to learn something new. I don't see how it breaks SMAP. Sandbox mode runs in its own address space which does not contain any user-mode pages. While running in sandbox mode, user pages belong to the sandboxed code, kernel pages are used to enter/exit kernel mode. Bottom half of the PGD is empty, all user page translations are removed from TLB. For a similar reason, I don't see right now how it breaks linear address space separation. Even if it did, I believe I can take care of it in the entry/exit path. Anyway, which branch contains the LASS patches now, so I can test? As for dropping global pages, that's only part of the story. Indeed, patch 6/8 of the series sets CR4.PGE to zero to have a known-good working state, but that code is removed again by patch 8/8. I wanted to implement lazy TLB flushing separately, so it can be easily reverted if it is suspected to cause an issue. Plus, each sandbox mode can use PCID to reduce TLB flushing even more. I haven't done it, because it would be a waste of time if the whole concept is scratched. I believe that only those global pages which are actually accessed by the sandbox need to be flushed. Yes, some parts of the necessary logic are missing in the current patch series. I can add them in a v2 series if you wish. > All in all, I cannot see this as anything other than an enormous step in the wrong direction, and it isn't even in the sense of "it is harmless if noone uses it" – you are introducing architectural changes that are most definitely *very* harmful both to maintainers and users. I agree that it adds some burden. After all, that's why the ultimate decision is up to you, the maintainers. To defend my cause, I hope you have noticed that if CONFIG_SANDBOX_MODE is not set: 1. literally nothing changes in entry_64. 2. sandbox_mode() always evaluates to false, so the added conditionals in fault.c and traps.c are never executed 3. top_of_instr_stack() always returns current_top_of_stack(), which is equivalent to the code it replaces, namely this_cpu_read(pcpu_hot.top_of_stack) So, all the interesting stuff is under arch/x86/kernel/sbm/. Shall I add a corresponding entry with my name to MAINTAINERS? > To me, this feels like paravirtualization all over again. 20 years later we still have not been able to undo all the damage that did. OK, I can follow you here. Indeed, there is some similarity with Xen PV (running kernel code with CPL 3), but I don't think there's more than this. Petr T
On Wed, 2024-02-14 at 19:32 +0100, Petr Tesařík wrote: > > What use case needs to have the sandbox both protected from the > > kernel > > (trusted operations) and non-privileged (the kernel protected from > > it > > via CPL3)? It seems like opposite things. > > I think I have mentioned one: parsing keys for the trusted keyring. > The > parser is complex enough to be potentially buggy, but the security > folks have already dismissed the idea to run it as a user mode > helper. Ah, I didn't realize the kernel needed to be protected from the key parsing part because you called it out as a trusted operation. So on the protect-the-kernel-side it's similar to the microkernel security reasoning. Did I get the other part wrong - that you want to protect the sandbox from the rest of kernel as well?
On Wed, 14 Feb 2024 10:42:57 -0800 Dave Hansen <dave.hansen@intel.com> wrote: > On 2/14/24 10:22, Petr Tesařík wrote: > > Anyway, in the long term I would like to work on gradual decomposition > > of the kernel into a core part and many self-contained components. > > Sandbox mode is a useful tool to enforce isolation. > > I'd want to see at least a few examples of how this decomposition would > work and how much of a burden it is on each site that deployed it. Got it. Are you okay with a couple of examples to illustrate the concept? Because if you want patches that have been acked by the respective maintainers, it somehow becomes a chicken-and-egg kind of problem... > But I'm skeptical that this could ever work. Ring-0 execution really is > special and it's _increasingly_ so. Think of LASS or SMAP or SMEP. I have just answered a similar concern by hpa. In short, I don't think these features are relevant, because by definition sandbox mode does not share anything with user mode address space. > We're even seeing hardware designers add hardware security defenses to > ring-0 that are not applied to ring-3. > > In other words, ring-3 isn't just a deprivileged ring-0, it's more > exposed to attacks. > > > I'd rather fail fast than maintain hundreds of patches in an > > out-of-tree branch before submitting (and failing anyway). > > I don't see any remotely feasible path forward for this approach. I can live with such decision. But first, I want to make sure that the concept has been understood correctly. So far, at least some concerns suggest an understanding that is not quite accurate. Is this sandbox idea a bit too much out-of-the-box? Petr T
On Wed, 14 Feb 2024 19:19:27 +0000 "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote: > On Wed, 2024-02-14 at 19:32 +0100, Petr Tesařík wrote: > > > What use case needs to have the sandbox both protected from the > > > kernel > > > (trusted operations) and non-privileged (the kernel protected from > > > it > > > via CPL3)? It seems like opposite things. > > > > I think I have mentioned one: parsing keys for the trusted keyring. > > The > > parser is complex enough to be potentially buggy, but the security > > folks have already dismissed the idea to run it as a user mode > > helper. > > Ah, I didn't realize the kernel needed to be protected from the key > parsing part because you called it out as a trusted operation. So on > the protect-the-kernel-side it's similar to the microkernel security > reasoning. > > Did I get the other part wrong - that you want to protect the sandbox > from the rest of kernel as well? Protecting the sandbox from the rest of the kernel is out of scope. However, different sandboxes should be protected from each other. Petr T
On 2/14/24 11:33, Petr Tesařík wrote: >> I'd want to see at least a few examples of how this decomposition would >> work and how much of a burden it is on each site that deployed it. > Got it. Are you okay with a couple of examples to illustrate the > concept? Because if you want patches that have been acked by the > respective maintainers, it somehow becomes a chicken-and-egg kind of > problem... I'd be happy to look at a patch or two that demonstrate the concept, just to make sure I'm not missing something. But I'm still quite skeptical.
On Wed, 14 Feb 2024 10:52:47 -0800 Xin Li <xin@zytor.com> wrote: > On 2/14/2024 10:22 AM, Petr Tesařík wrote: > > On Wed, 14 Feb 2024 06:52:53 -0800 > > Dave Hansen <dave.hansen@intel.com> wrote: > > > >> On 2/14/24 03:35, Petr Tesarik wrote: > >>> This patch series implements x86_64 arch hooks for the generic SandBox > >>> Mode infrastructure. > >> > >> I think I'm missing a bit of context here. What does one _do_ with > >> SandBox Mode? Why is it useful? > > > > I see, I split the patch series into the base infrastructure and the > > x86_64 implementation, but I forgot to merge the two recipient lists. > > :-( > > > > Anyway, in the long term I would like to work on gradual decomposition > > of the kernel into a core part and many self-contained components. > > Sandbox mode is a useful tool to enforce isolation. > > > > In its current form, sandbox mode is too limited for that, but I'm > > trying to find some balance between "publish early" and reaching a > > feature level where some concrete examples can be shown. I'd rather > > fail fast than maintain hundreds of patches in an out-of-tree branch > > before submitting (and failing anyway). > > > > Petr T > > > > What you're proposing sounds a gigantic thing, which could potentially > impact all subsystems. True. Luckily, sandbox mode allows me to move gradually, one component at a time. > Unless you prove it has big advantages with real > world usages, I guess nobody even wants to look into the patches. > > BTW, this seems another attempt to get the idea of micro-kernel into > Linux. We know it's not feasible to convert Linux to a micro-kernel. AFAICS that would require some kind of big switch, affecting all subsystems at once. But with a growing code base and more or less constant bug-per-LOC rate, people will continue to come up with some ideas how to limit the potential impact of each bug. Logically, one of the concepts that come to mind is decomposition. If my attempt helps to clarify how such decomposition should be done to be acceptable, it is worthwile. If nothing else, I can summarize the situation and ask Jonathan if he would kindly accept it as a LWN article... Petr T
On February 14, 2024 10:59:32 PM PST, "Petr Tesařík" <petr@tesarici.cz> wrote: >On Wed, 14 Feb 2024 10:52:47 -0800 >Xin Li <xin@zytor.com> wrote: > >> On 2/14/2024 10:22 AM, Petr Tesařík wrote: >> > On Wed, 14 Feb 2024 06:52:53 -0800 >> > Dave Hansen <dave.hansen@intel.com> wrote: >> > >> >> On 2/14/24 03:35, Petr Tesarik wrote: >> >>> This patch series implements x86_64 arch hooks for the generic SandBox >> >>> Mode infrastructure. >> >> >> >> I think I'm missing a bit of context here. What does one _do_ with >> >> SandBox Mode? Why is it useful? >> > >> > I see, I split the patch series into the base infrastructure and the >> > x86_64 implementation, but I forgot to merge the two recipient lists. >> > :-( >> > >> > Anyway, in the long term I would like to work on gradual decomposition >> > of the kernel into a core part and many self-contained components. >> > Sandbox mode is a useful tool to enforce isolation. >> > >> > In its current form, sandbox mode is too limited for that, but I'm >> > trying to find some balance between "publish early" and reaching a >> > feature level where some concrete examples can be shown. I'd rather >> > fail fast than maintain hundreds of patches in an out-of-tree branch >> > before submitting (and failing anyway). >> > >> > Petr T >> > >> >> What you're proposing sounds a gigantic thing, which could potentially >> impact all subsystems. > >True. Luckily, sandbox mode allows me to move gradually, one component >at a time. > >> Unless you prove it has big advantages with real >> world usages, I guess nobody even wants to look into the patches. >> >> BTW, this seems another attempt to get the idea of micro-kernel into >> Linux. > >We know it's not feasible to convert Linux to a micro-kernel. AFAICS >that would require some kind of big switch, affecting all subsystems at >once. > >But with a growing code base and more or less constant bug-per-LOC rate, >people will continue to come up with some ideas how to limit the >potential impact of each bug. Logically, one of the concepts that come >to mind is decomposition. > >If my attempt helps to clarify how such decomposition should be done to >be acceptable, it is worthwile. If nothing else, I can summarize the >situation and ask Jonathan if he would kindly accept it as a LWN >article... > >Petr T > I have been thinking more about this, and I'm more than ever convinced that exposing kernel memory to *any* kind of user space is a really, really bad idea. It is not a door we ever want to open; once that line gets muddled, the attack surface opens up dramatically. And, in fact, we already have a sandbox mode in the kernel – it is called eBPF.
On Thu, 15 Feb 2024 00:16:13 -0800 "H. Peter Anvin" <hpa@zytor.com> wrote: > On February 14, 2024 10:59:32 PM PST, "Petr Tesařík" <petr@tesarici.cz> wrote: > >On Wed, 14 Feb 2024 10:52:47 -0800 > >Xin Li <xin@zytor.com> wrote: > > > >> On 2/14/2024 10:22 AM, Petr Tesařík wrote: > >> > On Wed, 14 Feb 2024 06:52:53 -0800 > >> > Dave Hansen <dave.hansen@intel.com> wrote: > >> > > >> >> On 2/14/24 03:35, Petr Tesarik wrote: > >> >>> This patch series implements x86_64 arch hooks for the generic SandBox > >> >>> Mode infrastructure. > >> >> > >> >> I think I'm missing a bit of context here. What does one _do_ with > >> >> SandBox Mode? Why is it useful? > >> > > >> > I see, I split the patch series into the base infrastructure and the > >> > x86_64 implementation, but I forgot to merge the two recipient lists. > >> > :-( > >> > > >> > Anyway, in the long term I would like to work on gradual decomposition > >> > of the kernel into a core part and many self-contained components. > >> > Sandbox mode is a useful tool to enforce isolation. > >> > > >> > In its current form, sandbox mode is too limited for that, but I'm > >> > trying to find some balance between "publish early" and reaching a > >> > feature level where some concrete examples can be shown. I'd rather > >> > fail fast than maintain hundreds of patches in an out-of-tree branch > >> > before submitting (and failing anyway). > >> > > >> > Petr T > >> > > >> > >> What you're proposing sounds a gigantic thing, which could potentially > >> impact all subsystems. > > > >True. Luckily, sandbox mode allows me to move gradually, one component > >at a time. > > > >> Unless you prove it has big advantages with real > >> world usages, I guess nobody even wants to look into the patches. > >> > >> BTW, this seems another attempt to get the idea of micro-kernel into > >> Linux. > > > >We know it's not feasible to convert Linux to a micro-kernel. AFAICS > >that would require some kind of big switch, affecting all subsystems at > >once. > > > >But with a growing code base and more or less constant bug-per-LOC rate, > >people will continue to come up with some ideas how to limit the > >potential impact of each bug. Logically, one of the concepts that come > >to mind is decomposition. > > > >If my attempt helps to clarify how such decomposition should be done to > >be acceptable, it is worthwile. If nothing else, I can summarize the > >situation and ask Jonathan if he would kindly accept it as a LWN > >article... > > > >Petr T > > > > I have been thinking more about this, and I'm more than ever convinced that exposing kernel memory to *any* kind of user space is a really, really bad idea. It is not a door we ever want to open; once that line gets muddled, the attack surface opens up dramatically. Would you mind elaborating on this a bit more? For one thing, sandbox mode is *not* user mode. Sure, my proposed x86-64 implementation runs with the same CPU privilege level as user mode, but it is isolated from user mode with just as strong mechanisms as any two user mode processes are isolated from each other. Are you saying that process isolation in Linux is not all that strong after all? Don't get me wrong. I'm honestly trying to understand what exactly makes the idea so bad. I have apparently not considered something that you have, and I would be glad if you could reveal it. > And, in fact, we already have a sandbox mode in the kernel – it is called eBPF. Sure. The difference is that eBPF is a platform of its own (with its own consistency model, machine code etc.). Rewriting code for eBPF may need a bit more effort. Besides, Roberto wrote a PGP key parser as an eBPF program at some point, and I believe it was rejected for that reason. So, it seems there are situations where eBPF is not an alternative. Roberto, can you remember and share some details? Petr T
On Thu, 2024-02-15 at 10:30 +0100, Petr Tesařík wrote: > On Thu, 15 Feb 2024 00:16:13 -0800 > "H. Peter Anvin" <hpa@zytor.com> wrote: > > > On February 14, 2024 10:59:32 PM PST, "Petr Tesařík" <petr@tesarici.cz> wrote: > > > On Wed, 14 Feb 2024 10:52:47 -0800 > > > Xin Li <xin@zytor.com> wrote: > > > > > > > On 2/14/2024 10:22 AM, Petr Tesařík wrote: > > > > > On Wed, 14 Feb 2024 06:52:53 -0800 > > > > > Dave Hansen <dave.hansen@intel.com> wrote: > > > > > > > > > > > On 2/14/24 03:35, Petr Tesarik wrote: > > > > > > > This patch series implements x86_64 arch hooks for the generic SandBox > > > > > > > Mode infrastructure. > > > > > > > > > > > > I think I'm missing a bit of context here. What does one _do_ with > > > > > > SandBox Mode? Why is it useful? > > > > > > > > > > I see, I split the patch series into the base infrastructure and the > > > > > x86_64 implementation, but I forgot to merge the two recipient lists. > > > > > :-( > > > > > > > > > > Anyway, in the long term I would like to work on gradual decomposition > > > > > of the kernel into a core part and many self-contained components. > > > > > Sandbox mode is a useful tool to enforce isolation. > > > > > > > > > > In its current form, sandbox mode is too limited for that, but I'm > > > > > trying to find some balance between "publish early" and reaching a > > > > > feature level where some concrete examples can be shown. I'd rather > > > > > fail fast than maintain hundreds of patches in an out-of-tree branch > > > > > before submitting (and failing anyway). > > > > > > > > > > Petr T > > > > > > > > > > > > > What you're proposing sounds a gigantic thing, which could potentially > > > > impact all subsystems. > > > > > > True. Luckily, sandbox mode allows me to move gradually, one component > > > at a time. > > > > > > > Unless you prove it has big advantages with real > > > > world usages, I guess nobody even wants to look into the patches. > > > > > > > > BTW, this seems another attempt to get the idea of micro-kernel into > > > > Linux. > > > > > > We know it's not feasible to convert Linux to a micro-kernel. AFAICS > > > that would require some kind of big switch, affecting all subsystems at > > > once. > > > > > > But with a growing code base and more or less constant bug-per-LOC rate, > > > people will continue to come up with some ideas how to limit the > > > potential impact of each bug. Logically, one of the concepts that come > > > to mind is decomposition. > > > > > > If my attempt helps to clarify how such decomposition should be done to > > > be acceptable, it is worthwile. If nothing else, I can summarize the > > > situation and ask Jonathan if he would kindly accept it as a LWN > > > article... > > > > > > Petr T > > > > > > > I have been thinking more about this, and I'm more than ever convinced that exposing kernel memory to *any* kind of user space is a really, really bad idea. It is not a door we ever want to open; once that line gets muddled, the attack surface opens up dramatically. > > Would you mind elaborating on this a bit more? > > For one thing, sandbox mode is *not* user mode. Sure, my proposed > x86-64 implementation runs with the same CPU privilege level as user > mode, but it is isolated from user mode with just as strong mechanisms > as any two user mode processes are isolated from each other. Are you > saying that process isolation in Linux is not all that strong after all? > > Don't get me wrong. I'm honestly trying to understand what exactly > makes the idea so bad. I have apparently not considered something that > you have, and I would be glad if you could reveal it. > > > And, in fact, we already have a sandbox mode in the kernel – it is called eBPF. > > Sure. The difference is that eBPF is a platform of its own (with its > own consistency model, machine code etc.). Rewriting code for eBPF may > need a bit more effort. > > Besides, Roberto wrote a PGP key parser as an eBPF program at some > point, and I believe it was rejected for that reason. So, it seems > there are situations where eBPF is not an alternative. > > Roberto, can you remember and share some details? eBPF programs are not signed. And I struggled to have some security bugs fixed, so I gave up. Roberto