Message ID | 20221028141220.29217-3-kirill.shutemov@linux.intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp861520wru; Fri, 28 Oct 2022 07:24:40 -0700 (PDT) X-Google-Smtp-Source: AMsMyM5yBHfaUTrpoKXH8FcHvUHNVMsUusdnab4EwmXn+Pz3vQRRLCDtKRFPLjdeKzR4m1zbn36Z X-Received: by 2002:a05:6402:1cc5:b0:453:ed3f:6a38 with SMTP id ds5-20020a0564021cc500b00453ed3f6a38mr14548697edb.346.1666967080365; Fri, 28 Oct 2022 07:24:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666967080; cv=none; d=google.com; s=arc-20160816; b=O+EGDr1Kq2B3/rZREuoLEWaWWppM8nBaUL7eGkK2LUB1y6aJYg8IcvkVIn4fc+3uV4 PHGTgnR4+Z5FxfZF9sWhwbJ7X2aOc8TmL2onEHWQNNfwfTG6BUCt8B7CiFm6putNUYVV p/d5A4J0mcb8oZZQjYdtS2wdXrGxCMY89apUw0sa+bPogu9fykELE7QyY04JjDjSro59 EnaiHmZEkUlgn0cSRw3n/K+l55qALOCt5CJ7OKSXilaBFDNuHW3QUo6TmEjrH/u3F+T/ qvE2PyTjGFVZrUaJR/6ZF4708TdXyZAw0JaTzPX6CClXOijiUinPMjnR+eIkvrHYjEme S8aA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=IfgYe+5sCcvT1PggiuJTkCTfc1xDBkKCgdP1SD8szfM=; b=ibgOgAaaUEICKNkpw0ualK2KJXodemDtkRMu+tq7J0Ch1xnGaA3U5z2Ebh4ltCqOaq Fc8Mo9JsM78qVRol2UyUU22oSm8HqCiCIBeER1sQLOJOstEhn8Bwx/VulP5IHRRxYkRV pvD8BJeJBFy52DydcOjyHGqC23QUc1cHoAOJK4cmmdtudKyvHCtMQ5t+WhubbM+CkGU/ U21tpgHt8NAK/mwmIyjha71ftkCT6tn7Kp/vBvjjWxd2e1YcwKSevVK3yTTlbSf0dvGu 2K1i3whjrbfulFzjwKLTUHgH4bAvEOJSQSUXRfeXoMLsHkbGe1937XiBPGdsnizMs4ug Bfeg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=V4Ye95J+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id p11-20020a056402500b00b00461b2c3c4acsi5253731eda.515.2022.10.28.07.24.16; Fri, 28 Oct 2022 07:24:40 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=V4Ye95J+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230136AbiJ1OMi (ORCPT <rfc822;chrisfriedt@gmail.com> + 99 others); Fri, 28 Oct 2022 10:12:38 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35664 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230183AbiJ1OMb (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 28 Oct 2022 10:12:31 -0400 Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4D9CA1C8406; Fri, 28 Oct 2022 07:12:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1666966350; x=1698502350; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=q9Olp4YqLgqBAJYRL0jQZ9E01eAQBEhHXpcQaGWFd5k=; b=V4Ye95J+B1DfRqorVBi7sn1QK+BTGZhUhrCD7aapn7PkY6koRPfI72Iz L1/zjR0eNcP7suSVShYGabbU9wakm9WH3PuMR/+JZ7aopXardZ65zeCfX zXCLYusKdPELZaqWhgbI6JfR0rutMVxsRp7icl4JIduoKMZZalQNpCV2k xpJnOB/Gx+WJVJ6VMrbqdPAJHL+LltiSqxF4WHdMuDMrurlCaZKmn/m/L Q+7JXbnNnC516qxM9Tghlp/Rcx6HKNG6I6MuU+V8hcFP+i8NL/jdcirBV EdFkSz9Iti5JUUAOFdVoawlYTlnRqvE0eejAjIkpOImY1PrOPpEQ1S55F g==; X-IronPort-AV: E=McAfee;i="6500,9779,10513"; a="335142692" X-IronPort-AV: E=Sophos;i="5.95,221,1661842800"; d="scan'208";a="335142692" Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Oct 2022 07:12:29 -0700 X-IronPort-AV: E=McAfee;i="6500,9779,10513"; a="583936262" X-IronPort-AV: E=Sophos;i="5.95,221,1661842800"; d="scan'208";a="583936262" Received: from snehalde-mobl1.ger.corp.intel.com (HELO box.shutemov.name) ([10.252.46.229]) by orsmga003-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Oct 2022 07:12:24 -0700 Received: by box.shutemov.name (Postfix, from userid 1000) id 1DFAF10828A; Fri, 28 Oct 2022 17:12:22 +0300 (+03) From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@intel.com, luto@kernel.org, peterz@infradead.org Cc: sathyanarayanan.kuppuswamy@linux.intel.com, ak@linux.intel.com, dan.j.williams@intel.com, david@redhat.com, hpa@zytor.com, seanjc@google.com, thomas.lendacky@amd.com, elena.reshetova@intel.com, x86@kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, stable@vger.kernel.org Subject: [PATCH 2/2] x86/tdx: Do not allow #VE due to EPT violation on the private memory Date: Fri, 28 Oct 2022 17:12:20 +0300 Message-Id: <20221028141220.29217-3-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.38.0 In-Reply-To: <20221028141220.29217-1-kirill.shutemov@linux.intel.com> References: <20221028141220.29217-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-7.5 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1747941673133387845?= X-GMAIL-MSGID: =?utf-8?q?1747941673133387845?= |
Series |
x86/tdx: Enforce no #VE on private memory accesses
|
|
Commit Message
Kirill A. Shutemov
Oct. 28, 2022, 2:12 p.m. UTC
Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions such as using specific instructions or accessing
a specific MSR.
Notable reason for #VE is access to specific guest physical addresses.
It requires special security considerations as it is not fully in
control of the guest kernel. VMM can remove a page from EPT page table
and trigger #VE on access.
The primary use-case for #VE on a memory access is MMIO: VMM removes
page from EPT to trigger exception in the guest which allows guest to
emulate MMIO with hypercalls.
MMIO only happens on shared memory. All conventional kernel memory is
private. This includes everything from kernel stacks to kernel text.
Handling exceptions on arbitrary accesses to kernel memory is
essentially impossible as handling #VE may require access to memory
that also triggers the exception.
TDX module provides mechanism to disable #VE delivery on access to
private memory. If SEPT_VE_DISABLE TD attribute is set, private EPT
violation will not be reflected to the guest as #VE, but will trigger
exit to VMM.
Make sure the attribute is set by VMM. Panic otherwise.
There's small window during the boot before the check where kernel has
early #VE handler. But the handler is only for port I/O and panic as
soon as it sees any other #VE reason.
SEPT_VE_DISABLE makes SEPT violation unrecoverable and terminating the
TD is the only option.
Kernel has no legitimate use-cases for #VE on private memory. It is
either a guest kernel bug (like access of unaccepted memory) or
malicious/buggy VMM that removes guest page that is still in use.
In both cases terminating TD is the right thing to do.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Fixes: 9a22bf6debbf ("x86/traps: Add #VE support for TDX guest")
Cc: stable@vger.kernel.org # v5.19
---
arch/x86/coco/tdx/tdx.c | 49 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 49 insertions(+)
Comments
On 10/28/22 07:12, Kirill A. Shutemov wrote: > arch/x86/coco/tdx/tdx.c | 49 +++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 49 insertions(+) The patch is good, but I'm not crazy about the changelog or the big ol' comment. Really, this would do: /* * The kernel can not handle #VE's when accessing normal kernel * memory. Ensure that no #VE will be delivered for accesses to * TD-private memory. Only VMM-shared memory (MMIO) will #VE. */ if (!(td_attr & ATTR_SEPT_VE_DISABLE)) panic("TD misconfiguration: SEPT_VE_DISABLE attibute must be set.\n"); I'll probably trim both of them down. If I chop out something that's critical, let me know, otherwise let's follow up and stick all of those details in Documentation.
The core of this vulnerability is not directly related to the ATTR_SEPT_VE_DISABLE, but the MMIO processing logic in #VE. We have encountered similar problems on SEV-ES, here are their fixes on Kernel [1] and OVMF[2]. Instead of enforcing the ATTR_SEPT_VE_DISABLE in TDX guest kernel, I think the fix should also include necessary check on the MMIO path of the #VE routine. static int handle_mmio(struct pt_regs *regs, struct ve_info *ve) { unsigned long *reg, val, vaddr; char buffer[MAX_INSN_SIZE]; struct insn insn = {}; enum mmio_type mmio; int size, extend_size; u8 extend_val = 0; // Some addtional security check about ve->gpa should be introduced here. /* Only in-kernel MMIO is supported */ if (WARN_ON_ONCE(user_mode(regs))) return -EFAULT; // ... } If we don't fix the problem at the point where we found, but rely on complicated composite logic and long comments in the kernel, I'm confident we'll fall back into the same pit in the near future :). [1] https://github.com/torvalds/linux/blob/1a2dcbdde82e3a5f1db9b2f4c48aa1aeba534fb2/arch/x86/kernel/sev.c#L503 [2] OVMF: https://github.com/tianocore/edk2/blob/db2c22633f3c761975d8f469a0e195d8b79e1287/OvmfPkg/Library/VmgExitLib/VmgExitVcHandler.c#L670
On Mon, Oct 31, 2022 at 12:07:45PM +0800, Guorui Yu wrote: > The core of this vulnerability is not directly related to the > ATTR_SEPT_VE_DISABLE, but the MMIO processing logic in #VE. > > We have encountered similar problems on SEV-ES, here are their fixes on > Kernel [1] and OVMF[2]. > > Instead of enforcing the ATTR_SEPT_VE_DISABLE in TDX guest kernel, I think > the fix should also include necessary check on the MMIO path of the #VE > routine. Missing SEPT_VE_DISABLE exposes to more security problems than confused handle_mmio(). Rogue #VE that is rightly timed can be used to escalate privileges and more. Just adding check there would solve only some potential attacks. > static int handle_mmio(struct pt_regs *regs, struct ve_info *ve) > { > unsigned long *reg, val, vaddr; > char buffer[MAX_INSN_SIZE]; > struct insn insn = {}; > enum mmio_type mmio; > int size, extend_size; > u8 extend_val = 0; > > // Some addtional security check about ve->gpa should be introduced here. > > /* Only in-kernel MMIO is supported */ > if (WARN_ON_ONCE(user_mode(regs))) > return -EFAULT; > > // ... > } > > If we don't fix the problem at the point where we found, but rely on > complicated composite logic and long comments in the kernel, I'm confident > we'll fall back into the same pit in the near future :). The plan is to add the check there along with relaxing SEPT_VE_DISABLE for debug TD. It is required to debug guest kernel effectively. Otherwise access to unaccepted memory would terminate TD with zero info on why. But it is not the urgent fix. It can be submitted separately.
On 10/30/22 21:07, Guorui Yu wrote: > We have encountered similar problems on SEV-ES, here are their fixes > on Kernel [1] and OVMF[2]. SEV-ES and TDX are very different beasts in this area. > Instead of enforcing the ATTR_SEPT_VE_DISABLE in TDX guest kernel, I > think the fix should also include necessary check on the MMIO path of > the #VE routine. Instead? Without ATTR_SEPT_VE_DISABLE, a #VE can occur on basically any instruction. We call those kinds of exceptions "paranoid entry" points. They need special handling like the NMI or #MC handlers. I'd be happy to look at a patch that does the MMIO path check *and* turns the #VE handler into a robust entry point. Bonus points if you can do ~5 lines of C like the approach in this thread.
On Fri, Oct 28, 2022 at 7:12 AM Kirill A. Shutemov <kirill.shutemov@linux.intel.com> wrote: > + * > + * Kernel has no legitimate use-cases for #VE on private memory. It is > + * either a guest kernel bug (like access of unaccepted memory) or > + * malicious/buggy VMM that removes guest page that is still in use. > + * I think this statement is too strong and I have few concerns on this approach. I understand that there is an issue of handling #VEs on private pages but it seems like we are just hiding the problem with this approach instead of fixing it - I do not have any fix in my mind- . First there is a feature of injecting #VE to handle unaccepted pages at runtime and accept them on-demand, now the statement is saying this was an unnecessary feature (why is it there at all then?) at all as there is no legitimate use case. I wonder if this will limit how we can implement the lazy TDACCEPT. There are multiple ideas floating now. https://github.com/intel/tdx/commit/9b3ef9655b695d3c67a557ec016487fded8b0e2b has 3 implementation choices where "Accept a block of memory on the first use." option is implemented. Actually it says "Accept a block of memory on the first use." but it is implemented as "Accept a block of memory on the first allocation". The comments in this code also raises concerns on the performance. As of now, we do not know which one of those ideas will provide an acceptable performance for booting large size VMs. If the performance overhead is high, we can always implement the lazy TDACCEPT as when the first time a guest accesses an unaccepted memory, #VE can do the TDACCEPT. I am not trying to solve the lazy TDACCEPT problem here but all I am trying to say is that, there might be legitimate use cases for #VE on private memory and this patch limits any future improvement we might need to do on lazy TDACCEPT implementation. > + * In both cases terminating TD is the right thing to do. > + */ > + if (!(td_attr & ATTR_SEPT_VE_DISABLE)) > + panic("TD misconfiguration: SEPT_VE_DISABLE attibute must be set.\n"); > + > setup_force_cpu_cap(X86_FEATURE_TDX_GUEST); > > cc_set_vendor(CC_VENDOR_INTEL); > -- > 2.38.0 >
On 11/4/22 15:36, Erdem Aktas wrote: > On Fri, Oct 28, 2022 at 7:12 AM Kirill A. Shutemov > <kirill.shutemov@linux.intel.com> wrote: >> + * >> + * Kernel has no legitimate use-cases for #VE on private memory. It is >> + * either a guest kernel bug (like access of unaccepted memory) or >> + * malicious/buggy VMM that removes guest page that is still in use. >> + * > > I think this statement is too strong and I have few concerns on this approach. > I understand that there is an issue of handling #VEs on private pages > but it seems like we are just hiding the problem with this approach > instead of fixing it - I do not have any fix in my mind- . > First there is a feature of injecting #VE to handle unaccepted pages > at runtime and accept them on-demand, now the statement is saying this > was an unnecessary feature (why is it there at all then?) at all as > there is no legitimate use case. We're doing on-demand page acceptance. We just don't need a #VE to drive it. Why is it in the TDX module then? Inertia? Because it got too far along in the process before anyone asked me or some of the other x86 kernel folks to look at it hard. > I wonder if this will limit how we can implement the lazy TDACCEPT. > There are multiple ideas floating now. > https://github.com/intel/tdx/commit/9b3ef9655b695d3c67a557ec016487fded8b0e2b > has 3 implementation choices where "Accept a block of memory on the > first use." option is implemented. Actually it says "Accept a block > of memory on the first use." but it is implemented as "Accept a block > of memory on the first allocation". The comments in this code also > raises concerns on the performance. > > As of now, we do not know which one of those ideas will provide an > acceptable performance for booting large size VMs. If the performance > overhead is high, we can always implement the lazy TDACCEPT as when > the first time a guest accesses an unaccepted memory, #VE can do the TDACCEPT. Could you please elaborate a bit on what you think the distinction is between: * Accept on first use and * Accept on allocation Surely, for the vast majority of memory, it's allocated and then used pretty quickly. As in, most allocations are __GFP_ZERO so they're allocated and "used" before they even leave the allocator. So, in practice, they're *VERY* close to equivalent. Where do you see them diverging? Why does it matter? > I am not trying to solve the lazy TDACCEPT problem here but all I am > trying to say is that, there might be legitimate use cases for #VE on > private memory and this patch limits any future improvement we might > need to do on lazy TDACCEPT implementation. The kernel can't take exceptions on arbitrary memory accesses. I have *ZERO* idea how to handle page acceptance on an access to a per-cpu variable referenced in syscall entry, or the NMI stack when we've interrupted kernel code with a user GSBASE value. So, we either find *ALL* the kernel memory that needs to be pre-accepted at allocation time (like kernel stacks) or we just say that all allocated memory needs to be accepted before we let it be allocated. One of those is really easy. The other means a boatload of code audits. I know. I had to do that kind of exercise to get KPTI to work. I don't want to do it again. It was worth it for KPTI when the world was on fire. TDX isn't that important IMNHO. There's an easier way.
在 2022/10/31 22:22, Dave Hansen 写道: > On 10/30/22 21:07, Guorui Yu wrote: >> We have encountered similar problems on SEV-ES, here are their fixes >> on Kernel [1] and OVMF[2]. > > SEV-ES and TDX are very different beasts in this area. > >> Instead of enforcing the ATTR_SEPT_VE_DISABLE in TDX guest kernel, I >> think the fix should also include necessary check on the MMIO path of >> the #VE routine. > > Instead? Yes, "Instead of" should be "Addtionally,". > > Without ATTR_SEPT_VE_DISABLE, a #VE can occur on basically any > instruction. We call those kinds of exceptions "paranoid entry" points. > They need special handling like the NMI or #MC handlers. > > I'd be happy to look at a patch that does the MMIO path check *and* > turns the #VE handler into a robust entry point. > > Bonus points if you can do ~5 lines of C like the approach in this thread. Yes, there is a fix to satify your requirement and get the bouns points :) Please refer to https://github.com/intel/tdx/commit/f045b0d52a5f7d8bf66cd4410307d05a90523f10 case EXIT_REASON_EPT_VIOLATION: + if (!(ve->gpa & tdx_shared_mask())) { + panic("#VE due to access to unaccepted memory. " + "GPA: %#llx\n", ve->gpa); + } + /* original from Kirill and Kuppuswamy */ It's already there, but it just didn't get into the main branch. Sincerely, Guorui
On 11/6/22 21:10, Guorui Yu wrote: >> Without ATTR_SEPT_VE_DISABLE, a #VE can occur on basically any >> instruction. We call those kinds of exceptions "paranoid entry" points. >> They need special handling like the NMI or #MC handlers. >> >> I'd be happy to look at a patch that does the MMIO path check *and* >> turns the #VE handler into a robust entry point. >> >> Bonus points if you can do ~5 lines of C like the approach in this >> thread. > > Yes, there is a fix to satify your requirement and get the bouns points 😄 > > Please refer to > https://github.com/intel/tdx/commit/f045b0d52a5f7d8bf66cd4410307d05a90523f10 > > case EXIT_REASON_EPT_VIOLATION: > + if (!(ve->gpa & tdx_shared_mask())) { > + panic("#VE due to access to unaccepted memory. " > + "GPA: %#llx\n", ve->gpa); > + } > + > /* original from Kirill and Kuppuswamy */ > > It's already there, but it just didn't get into the main branch. Could you explain how that prevents the #VE from occurring in the "syscall gap" or in a place where the kernel is running with the user GSBASE value? It doesn't as far as I can tell. You need the SEPT_VE_DISABLE check for that.
在 2022/11/7 21:31, Dave Hansen 写道: > On 11/6/22 21:10, Guorui Yu wrote: >>> Without ATTR_SEPT_VE_DISABLE, a #VE can occur on basically any >>> instruction. We call those kinds of exceptions "paranoid entry" points. >>> They need special handling like the NMI or #MC handlers. >>> >>> I'd be happy to look at a patch that does the MMIO path check *and* >>> turns the #VE handler into a robust entry point. >>> >>> Bonus points if you can do ~5 lines of C like the approach in this >>> thread. >> >> Yes, there is a fix to satify your requirement and get the bouns points 😄 >> >> Please refer to >> https://github.com/intel/tdx/commit/f045b0d52a5f7d8bf66cd4410307d05a90523f10 >> >> case EXIT_REASON_EPT_VIOLATION: >> + if (!(ve->gpa & tdx_shared_mask())) { >> + panic("#VE due to access to unaccepted memory. " >> + "GPA: %#llx\n", ve->gpa); >> + } >> + >> /* original from Kirill and Kuppuswamy */ >> >> It's already there, but it just didn't get into the main branch. > > Could you explain how that prevents the #VE from occurring in the > "syscall gap" or in a place where the kernel is running with the user > GSBASE value? > Thank you for explaining the "paranoid entry" points with there examples to me, now I understand why the SEPT_VE_DISABLE is necessary for TD. > It doesn't as far as I can tell. You need the SEPT_VE_DISABLE check for > that.
On Fri, Nov 4, 2022 at 3:50 PM Dave Hansen <dave.hansen@intel.com> wrote: > > On 11/4/22 15:36, Erdem Aktas wrote: > > On Fri, Oct 28, 2022 at 7:12 AM Kirill A. Shutemov > > <kirill.shutemov@linux.intel.com> wrote: > >> + * > >> + * Kernel has no legitimate use-cases for #VE on private memory. It is > >> + * either a guest kernel bug (like access of unaccepted memory) or > >> + * malicious/buggy VMM that removes guest page that is still in use. > >> + * > > > > I think this statement is too strong and I have few concerns on this approach. > > I understand that there is an issue of handling #VEs on private pages > > but it seems like we are just hiding the problem with this approach > > instead of fixing it - I do not have any fix in my mind- . > > First there is a feature of injecting #VE to handle unaccepted pages > > at runtime and accept them on-demand, now the statement is saying this > > was an unnecessary feature (why is it there at all then?) at all as > > there is no legitimate use case. > > We're doing on-demand page acceptance. We just don't need a #VE to > drive it. Why is it in the TDX module then? Inertia? Because it got > too far along in the process before anyone asked me or some of the other > x86 kernel folks to look at it hard. > > > I wonder if this will limit how we can implement the lazy TDACCEPT. > > There are multiple ideas floating now. > > https://github.com/intel/tdx/commit/9b3ef9655b695d3c67a557ec016487fded8b0e2b > > has 3 implementation choices where "Accept a block of memory on the > > first use." option is implemented. Actually it says "Accept a block > > of memory on the first use." but it is implemented as "Accept a block > > of memory on the first allocation". The comments in this code also > > raises concerns on the performance. > > > > As of now, we do not know which one of those ideas will provide an > > acceptable performance for booting large size VMs. If the performance > > overhead is high, we can always implement the lazy TDACCEPT as when > > the first time a guest accesses an unaccepted memory, #VE can do the TDACCEPT. > > Could you please elaborate a bit on what you think the distinction is > between: > > * Accept on first use > and > * Accept on allocation > > Surely, for the vast majority of memory, it's allocated and then used > pretty quickly. As in, most allocations are __GFP_ZERO so they're > allocated and "used" before they even leave the allocator. So, in > practice, they're *VERY* close to equivalent. > > Where do you see them diverging? Why does it matter? > For a VM with a very large memory size, let's say close to 800G of memory, it might take a really long time to finish the initialization. If all allocations are __GFP_ZERO, then I agree it would not matter but -- I need to run some benchmarks to validate -- what I remember was, that was not what we were observing. Let me run a few tests to provide more input on this but meanwhile if you have already run some benchmarks, that would be great. What I see in the code is that the "accept_page" function will zero all the unaccepted pages even if the __GFP_ZERO flag is not set and if __GFP_ZERO is set, we will again zero all those pages. I see a lot of concerning comments like "Page acceptance can be very slow.". What I mean with "Accept on allocation" is leaving the memory allocation as it is and using the #VE handler to accept pages the first time they have been accessed. tLet me come back with some numbers on this which might take some time to collect. > > I am not trying to solve the lazy TDACCEPT problem here but all I am > > trying to say is that, there might be legitimate use cases for #VE on > > private memory and this patch limits any future improvement we might > > need to do on lazy TDACCEPT implementation. > > The kernel can't take exceptions on arbitrary memory accesses. I have > *ZERO* idea how to handle page acceptance on an access to a per-cpu > variable referenced in syscall entry, or the NMI stack when we've > interrupted kernel code with a user GSBASE value. > > So, we either find *ALL* the kernel memory that needs to be pre-accepted > at allocation time (like kernel stacks) or we just say that all > allocated memory needs to be accepted before we let it be allocated. > > One of those is really easy. The other means a boatload of code audits. > I know. I had to do that kind of exercise to get KPTI to work. I > don't want to do it again. It was worth it for KPTI when the world was > on fire. TDX isn't that important IMNHO. There's an easier way.
On 11/7/22 14:53, Erdem Aktas wrote: > On Fri, Nov 4, 2022 at 3:50 PM Dave Hansen <dave.hansen@intel.com> wrote: >> Could you please elaborate a bit on what you think the distinction is >> between: >> >> * Accept on first use >> and >> * Accept on allocation >> >> Surely, for the vast majority of memory, it's allocated and then used >> pretty quickly. As in, most allocations are __GFP_ZERO so they're >> allocated and "used" before they even leave the allocator. So, in >> practice, they're *VERY* close to equivalent. >> >> Where do you see them diverging? Why does it matter? > > For a VM with a very large memory size, let's say close to 800G of > memory, it might take a really long time to finish the initialization. > If all allocations are __GFP_ZERO, then I agree it would not matter > but -- I need to run some benchmarks to validate -- what I remember > was, that was not what we were observing. Let me run a few tests to > provide more input on this but meanwhile if you have already run some > benchmarks, that would be great. > > What I see in the code is that the "accept_page" function will zero > all the unaccepted pages even if the __GFP_ZERO flag is not set and if > __GFP_ZERO is set, we will again zero all those pages. I see a lot of > concerning comments like "Page acceptance can be very slow.". I'm not following you at all here. Yeah, page acceptance is very slow. But, the slowest part is the probably cache coherency dance that the TDX module has to do flushing and zeroing all the memory to initialize the new integrity metadata. Second to that is the cost of the TDCALL. Third is the cost of the #VE. Here's what Kirill is proposing, in some peudocode: alloc_page(order=0, __GFP_ZERO) { TD.accept(size=4M) { // TDX Module clflushes/zeroes 4M of memory } memset(4k); // leave 1023 accepted 4k pages in the allocator } To accept 4M of memory, you do one TDCALL. You do zero #VE's. Using the #VE handler, you do: alloc_page(order=0, __GFP_ZERO) { memset(4k) { -> #VE handler TD.accept(size=4k); // flush/zero 4k } // only 4k was accepted } ... Take 1023 more #VE's later on for each 4k page You do 1024 #VE's and 1024 TDCALLs. So, let's summarize. To do 4M worth of 4k pages, here's how the two approaches break down if __GFP_ZERO is in play: #VE Accept-in-allocator #VE's: 1024 0 TDCALLS: 1024 1 clflushes: 4k x 1024 4k x 1024 memset()s: 4k x 1024 4k x 1024 The *ONLY* downside of accept-at-allocate as implemented is that it does 4M at a time, so the TDCALL is long compared to a 4k one. But, this is a classing bandwidth versus latency compromise. In this case, we choose bandwidth. *Both* cases need to memset() the same amount of memory. Both cases only memset() 4k at a time. The *ONLY* way the #VE approach is better is if you allocate 4k and then never touch the rest of the 4M page. That might happen, maybe *ONE* time per zone. But the rest of the time, the amortization of the TDCALL cost is going to win. I'll be shocked if any benchmarking turns up another result.
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c index 343d60853b71..a376a0c3fddc 100644 --- a/arch/x86/coco/tdx/tdx.c +++ b/arch/x86/coco/tdx/tdx.c @@ -34,6 +34,9 @@ #define VE_GET_PORT_NUM(e) ((e) >> 16) #define VE_IS_IO_STRING(e) ((e) & BIT(4)) +/* TD Attributes */ +#define ATTR_SEPT_VE_DISABLE BIT(28) + /* Caches GPA width from TDG.VP.INFO TDCALL */ static unsigned int gpa_width __ro_after_init; @@ -770,6 +773,52 @@ void __init tdx_early_init(void) */ tdx_parse_tdinfo(); + /* + * Do not allow #VE due to EPT violation on the private memory + * + * Virtualization Exceptions (#VE) are delivered to TDX guests due to + * specific guest actions such as using specific instructions or + * accessing a specific MSR. + * + * Notable reason for #VE is access to specific guest physical + * addresses. It requires special security considerations as it is not + * fully in control of the guest kernel. VMM can remove a page from EPT + * page table and trigger #VE on access. + * + * The primary use-case for #VE on a memory access is MMIO: VMM removes + * page from EPT to trigger exception in the guest which allows guest to + * emulate MMIO with hypercalls. + * + * MMIO only happens on shared memory. All conventional kernel memory is + * private. This includes everything from kernel stacks to kernel text. + * + * Handling exceptions on arbitrary accesses to kernel memory is + * essentially impossible as handling #VE may require access to memory + * that also triggers the exception. + * + * TDX module provides mechanism to disable #VE delivery on access to + * private memory. If SEPT_VE_DISABLE TD attribute is set, private EPT + * violation will not be reflected to the guest as #VE, but will trigger + * exit to VMM. + * + * Make sure the attribute is set by VMM. Panic otherwise. + * + * There's small window during the boot before the check where kernel has + * early #VE handler. But the handler is only for port I/O and panic as + * soon as it sees any other #VE reason. + * + * SEPT_VE_DISABLE makes SEPT violation unrecoverable and terminating + * the TD is the only option. + * + * Kernel has no legitimate use-cases for #VE on private memory. It is + * either a guest kernel bug (like access of unaccepted memory) or + * malicious/buggy VMM that removes guest page that is still in use. + * + * In both cases terminating TD is the right thing to do. + */ + if (!(td_attr & ATTR_SEPT_VE_DISABLE)) + panic("TD misconfiguration: SEPT_VE_DISABLE attibute must be set.\n"); + setup_force_cpu_cap(X86_FEATURE_TDX_GUEST); cc_set_vendor(CC_VENDOR_INTEL);