Message ID | 20231017202505.340906-1-rick.p.edgecombe@intel.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2908:b0:403:3b70:6f57 with SMTP id ib8csp4380961vqb; Tue, 17 Oct 2023 13:26:14 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFo51d2r75dSvz7XXo5QH80+EybBZoGtEoQ11dupkT/2CdhB7A1D6AGo64aa7fGnokBPLdb X-Received: by 2002:a05:6e02:3d03:b0:34f:a4f0:4fc4 with SMTP id db3-20020a056e023d0300b0034fa4f04fc4mr3450215ilb.2.1697574373984; Tue, 17 Oct 2023 13:26:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697574373; cv=none; d=google.com; s=arc-20160816; b=nCBB89GDGPfwUzIii4xJtNmtIw3JR113KhMT0nUkITfWk5Ry8l97JlJP8r4DZdSnWw dyt+2oXP04/a4eUoIwqZqu4aOcmrGlM5EUekIz9j+N1R3jBPCbkdywRlEp8sKcNtFgRN q+/X1uX/8AbpOvNjaJT0rj7F1xobQDXqzXgnkED5gB08oHmD8tRcBqtgsqAlO5U2bSSg qSp0s6NqiDn1SxoOJeHtiMXB7Xo4IFooRE0dGYj0DGYb+4cF/yy9Pv2rlJY4ZxYB346E w9x3ipa7pizYA+Z8v1MvkYBz7MzFv349Mt69CJoYlSjuUdNmYy5B0e/F9S90FwIoCYZE RLLw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=Hb6xSa6mNhhw9qg1g+HsKvIKQijGGjtZ/OEIaLWSZ+s=; fh=dkaLeLJAV/GL3hXyLGO5Eid9rwmYacfd29/J1KkcTNo=; b=S6VOuFc6m57XpRodPYbCY+MLPgAla17h6IAX2zcCmYzYt9YSpGHPI0+wygQSjeGUQL VFbkuNOhL/659+BOz6odQm15YV+0bokMLkU+hU+qFKQKEnkfZrew5L0YPZs4p0Wzo5X8 7OFVegqIQFdOxbNe2iykBLShd/RsWnmpA2nDu24BJnCJ0+HYUK4ZQCUtRvRMHq5WMIw9 Wb2aMiFrofnACEk3zYikAK04PlE6Lv80oxieBcXJFtgYLqc1TQ9qoJ7Ylw3GV4YGUP59 5k46qQNUDjM3mAB2vEnju5JL7/wIU2fslo7drzHsr35lozFrVfxyGtT8gkUAKGlp7wTz oMLw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=oIiPe8O3; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from howler.vger.email (howler.vger.email. [23.128.96.34]) by mx.google.com with ESMTPS id p12-20020a056a000b4c00b00690dbcb75d8si2470971pfo.386.2023.10.17.13.26.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 17 Oct 2023 13:26:13 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) client-ip=23.128.96.34; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=oIiPe8O3; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id BF6BB80C6300; Tue, 17 Oct 2023 13:25:53 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234791AbjJQUZb (ORCPT <rfc822;dexuan.linux@gmail.com> + 21 others); Tue, 17 Oct 2023 16:25:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40920 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229459AbjJQUZa (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 17 Oct 2023 16:25:30 -0400 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.7]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D09E0F0; Tue, 17 Oct 2023 13:25:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1697574329; x=1729110329; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=GidCfHFt8RPhycoP1wfXlb89hDpzBE5Ubdb1ihTFzYc=; b=oIiPe8O3j8PAnMJTJfNGS/VZZGuavCLCy8IrqRjMo4qbGfR81/dJoo+b T78DiMw6DE6fQKZT+k5QqM9gZ137zevxfFU0JFdg7RPDKrAUPHGyOpujq NUUCSjgnvFjzlaRefFN1cTxXXf7aDem3Cik0fZlTcbwA5yPx25asXnd91 lDSpaHIpLJTga07S6j3OP7LM7tBszLfFycNg2N2IpEB5EwoRfVW3Tnid8 XCI87PxH5Djwig+3Lj3NP1d31Rv1rCfsAQf+CrRAvz/1o/uERZaMdkFzr HlHaKZJdObF3zAswzK4Fi9xaqVjeJQy1gCyFHiFJTYICkcQRdvKG8WQbg Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10866"; a="7429472" X-IronPort-AV: E=Sophos;i="6.03,233,1694761200"; d="scan'208";a="7429472" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by fmvoesa101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Oct 2023 13:25:28 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10866"; a="900040430" X-IronPort-AV: E=Sophos;i="6.03,233,1694761200"; d="scan'208";a="900040430" Received: from rtdinh-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.intel.com) ([10.212.150.155]) by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Oct 2023 13:23:24 -0700 From: Rick Edgecombe <rick.p.edgecombe@intel.com> To: x86@kernel.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, luto@kernel.org, peterz@infradead.org, kirill.shutemov@linux.intel.com, elena.reshetova@intel.com, isaku.yamahata@intel.com, seanjc@google.com, Michael Kelley <mikelley@microsoft.com>, thomas.lendacky@amd.com, decui@microsoft.com, sathyanarayanan.kuppuswamy@linux.intel.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-s390@vger.kernel.org Cc: rick.p.edgecombe@intel.com Subject: [PATCH 00/10] Handle set_memory_XXcrypted() errors Date: Tue, 17 Oct 2023 13:24:55 -0700 Message-Id: <20231017202505.340906-1-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on howler.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Tue, 17 Oct 2023 13:25:53 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1780035746706891266 X-GMAIL-MSGID: 1780035746706891266 |
Series |
Handle set_memory_XXcrypted() errors
|
|
Message
Edgecombe, Rick P
Oct. 17, 2023, 8:24 p.m. UTC
Shared pages should never return to the page allocator, or future usage of the pages may allow for the contents to be exposed to the host. They may also cause the guest to crash if the page is used in way disallowed by HW (i.e. for executable code or as a page table). Normally set_memory() call failures are rare. But on TDX set_memory_XXcrypted() involves calls to the untrusted VMM, and an attacker could fail these calls such that: 1. set_memory_encrypted() returns an error and leaves the pages fully shared. 2. set_memory_decrypted() returns an error, but the pages are actually full converted to shared. This means that patterns like the below can cause problems: void *addr = alloc(); int fail = set_memory_decrypted(addr, 1); if (fail) free_pages(addr, 0); And: void *addr = alloc(); int fail = set_memory_decrypted(addr, 1); if (fail) { set_memory_encrypted(addr, 1); free_pages(addr, 0); } Unfortunately these patterns are all over the place. And what the set_memory() callers should do in this situation is not clear either. They shouldn’t use them as shared because something clearly went wrong, but they also need to fully reset the pages to private to free them. But, the kernel needs the VMMs help to do this and the VMM is already being uncooperative around the needed operations. So this isn't guaranteed to succeed and the caller is kind of stuck with unusable pages. Looking at QEMU/KVM as an example, these VMM converstion failures either indicates an attempt to attack the guest, or resource constraints on the host. Preventing a DOS attack is out of scope for the coco threat model. So this leaves the host resource constraint cause. When similar resource constraints are encountered in the host, KVM punts the problem to userspace and QEMU terminates the guest. When similar problems are detected inside set_memory(), SEV issues a command to terminate the guest. This all makes it appealing to simply panic (via tdx_panic() call which informs the host what is happening) when observing troublesome VMM behavior around the memory conversion. It is: - Consistent with similar behavior on SEV side. - Generally more consistent with how host resource constraints are handled (at least in QEMU/KVM) - Would be a more foolproof defense against the attack scenario. Never-the-less, doing so would be an instance of the “crash the kernel for security reasons” pattern. This is a big reason, and crashing is not fully needed because the unusable pages could just be leaked (as they already are in some cases). So instead, this series does a tree-wide search and fixes the callers to handle the error by leaking the pages. Going forward callers will need to handle the set_memory() errors correctly in order to not reintroduce the issue. I think there are some points for both sides, and we had some internal discussion on the right way to handle it. So I've tried to characterize both arguments. I'm interested to hear opinions on which is the best. I’ve marked the hyperv guest parts in this as RFC, both because I can’t test them and I believe Linux TDs can’t run on hyperv yet due to some missing support. I would appreciate a correction on this if it’s wrong. Rick Edgecombe (10): mm: Add helper for freeing decrypted memory x86/mm/cpa: Reject incorrect encryption change requests kvmclock: Use free_decrypted_pages() swiotlb: Use free_decrypted_pages() ptp: Use free_decrypted_pages() dma: Use free_decrypted_pages() hv: Use free_decrypted_pages() hv: Track decrypted status in vmbus_gpadl hv_nstvsc: Don't free decrypted memory uio_hv_generic: Don't free decrypted memory arch/s390/include/asm/set_memory.h | 1 + arch/x86/kernel/kvmclock.c | 2 +- arch/x86/mm/pat/set_memory.c | 41 +++++++++++++++++++++++++++++- drivers/hv/channel.c | 18 ++++++++----- drivers/hv/connection.c | 13 +++++++--- drivers/net/hyperv/netvsc.c | 7 +++-- drivers/ptp/ptp_kvm_x86.c | 2 +- drivers/uio/uio_hv_generic.c | 12 ++++++--- include/linux/dma-map-ops.h | 3 ++- include/linux/hyperv.h | 1 + include/linux/set_memory.h | 13 ++++++++++ kernel/dma/contiguous.c | 2 +- kernel/dma/swiotlb.c | 11 +++++--- 13 files changed, 101 insertions(+), 25 deletions(-)
Comments
From: Rick Edgecombe <rick.p.edgecombe@intel.com> Sent: Tuesday, October 17, 2023 1:25 PM > > Shared pages should never return to the page allocator, or future usage of > the pages may allow for the contents to be exposed to the host. They may > also cause the guest to crash if the page is used in way disallowed by HW > (i.e. for executable code or as a page table). > > Normally set_memory() call failures are rare. But on TDX > set_memory_XXcrypted() involves calls to the untrusted VMM, and an attacker > could fail these calls such that: > 1. set_memory_encrypted() returns an error and leaves the pages fully > shared. > 2. set_memory_decrypted() returns an error, but the pages are actually > full converted to shared. > > This means that patterns like the below can cause problems: > void *addr = alloc(); > int fail = set_memory_decrypted(addr, 1); > if (fail) > free_pages(addr, 0); > > And: > void *addr = alloc(); > int fail = set_memory_decrypted(addr, 1); > if (fail) { > set_memory_encrypted(addr, 1); > free_pages(addr, 0); > } > > Unfortunately these patterns are all over the place. And what the > set_memory() callers should do in this situation is not clear either. They > shouldn’t use them as shared because something clearly went wrong, but > they also need to fully reset the pages to private to free them. But, the > kernel needs the VMMs help to do this and the VMM is already being > uncooperative around the needed operations. So this isn't guaranteed to > succeed and the caller is kind of stuck with unusable pages. > > Looking at QEMU/KVM as an example, these VMM converstion failures either > indicates an attempt to attack the guest, or resource constraints on the > host. Preventing a DOS attack is out of scope for the coco threat model. > So this leaves the host resource constraint cause. When similar resource > constraints are encountered in the host, KVM punts the problem to > userspace and QEMU terminates the guest. When similar problems are > detected inside set_memory(), SEV issues a command to terminate the guest. > > This all makes it appealing to simply panic (via tdx_panic() call > which informs the host what is happening) when observing troublesome VMM > behavior around the memory conversion. It is: > - Consistent with similar behavior on SEV side. > - Generally more consistent with how host resource constraints are handled > (at least in QEMU/KVM) > - Would be a more foolproof defense against the attack scenario. > > Never-the-less, doing so would be an instance of the “crash the kernel for > security reasons” pattern. This is a big reason, and crashing is not fully > needed because the unusable pages could just be leaked (as they already > are in some cases). So instead, this series does a tree-wide search and > fixes the callers to handle the error by leaking the pages. Going forward > callers will need to handle the set_memory() errors correctly in order to > not reintroduce the issue. > > I think there are some points for both sides, and we had some internal > discussion on the right way to handle it. So I've tried to characterize > both arguments. I'm interested to hear opinions on which is the best. I'm more in favor of the "simply panic" approach. What you've done in your Patch 1 and Patch 2 is an intriguing way to try to get the memory back into a consistent state. But I'm concerned that there are failure modes that make it less than 100% foolproof (more on that below). If we can't be sure that the memory is back in a consistent state, then the original problem isn't fully solved. I'm also not sure of the value of investing effort to ensure that some errors cases are handled without panic'ing. The upside benefit of not panic'ing seems small compared to the downside risk of leaking guest VM data to the host. My concern about Patches 1 and 2 is that the encryption bit in the PTE is not a reliable indicator of the state that the host thinks the page is in. Changing the state requires two steps (in either order): 1) updating the guest VM PTEs, and 2) updating the host's view of the page state. Both steps may be done on a range of pages. If #2 fails, the guest doesn't know which pages in the batch were updated and which were not, so the guest PTEs may not match the host state. In such a case, set_memory_encrypted() could succeed based on checking the PTEs when in fact the host still thinks some of the pages are shared. Such a mismatch will produce a guest panic later on if the page is referenced. As you pointed out, the SEV code, and specifically the SEV-SNP code, terminates the VM if there is a failure. That's in pvalidate_pages(). You've described your changes as being for TDX, but there's also the Hyper-V version of handling private <-> shared transitions which makes use of a paravisor for both SEV-SNP and TDX. The Hyper-V versions could also fail due to resource constraints in the paravisor, and so has the same issues as TDX, even when running on SEV-SNP hardware. In general, it's hard to anticipate all the failure modes that can occur currently. Additional failure modes could be added in the future, and taking into account malicious behavior by the host makes it even worse. That leads me back to my conclusion that just taking the panic is best. > > I’ve marked the hyperv guest parts in this as RFC, both because I can’t > test them and I believe Linux TDs can’t run on hyperv yet due to some > missing support. I would appreciate a correction on this if it’s wrong. Linux TDs can run on Hyper-V today, though it may require a Hyper-V version that isn't officially released yet. We have it working internally at Microsoft and I think at Intel as well. There *are* still a couple of Linux patches waiting to be accepted upstream to run without a paravisor. In any case, your concerns about testing are valid -- it's probably easier for one of us at Microsoft to test the Hyper-V guest parts if we continue down the path you have proposed. I've looked through the other patches in the series, and have a few minor comments on the Hyper-V parts. But I'll hold those pending an overall conclusion on whether to pursue this approach. If you send a new version of the series, please include the linux-hyperv@vger.kernel.org mailing list as well on all the patches. Michael > > Rick Edgecombe (10): > mm: Add helper for freeing decrypted memory > x86/mm/cpa: Reject incorrect encryption change requests > kvmclock: Use free_decrypted_pages() > swiotlb: Use free_decrypted_pages() > ptp: Use free_decrypted_pages() > dma: Use free_decrypted_pages() > hv: Use free_decrypted_pages() > hv: Track decrypted status in vmbus_gpadl > hv_nstvsc: Don't free decrypted memory > uio_hv_generic: Don't free decrypted memory > > arch/s390/include/asm/set_memory.h | 1 + > arch/x86/kernel/kvmclock.c | 2 +- > arch/x86/mm/pat/set_memory.c | 41 +++++++++++++++++++++++++++++- > drivers/hv/channel.c | 18 ++++++++----- > drivers/hv/connection.c | 13 +++++++--- > drivers/net/hyperv/netvsc.c | 7 +++-- > drivers/ptp/ptp_kvm_x86.c | 2 +- > drivers/uio/uio_hv_generic.c | 12 ++++++--- > include/linux/dma-map-ops.h | 3 ++- > include/linux/hyperv.h | 1 + > include/linux/set_memory.h | 13 ++++++++++ > kernel/dma/contiguous.c | 2 +- > kernel/dma/swiotlb.c | 11 +++++--- > 13 files changed, 101 insertions(+), 25 deletions(-) > > -- > 2.34.1
On 10/19/23 10:05, Michael Kelley (LINUX) wrote: > I'm more in favor of the "simply panic" approach. What you've done > in your Patch 1 and Patch 2 is an intriguing way to try to get the memory > back into a consistent state. But I'm concerned that there are failure > modes that make it less than 100% foolproof (more on that below). If > we can't be sure that the memory is back in a consistent state, then the > original problem isn't fully solved. I'm also not sure of the value of > investing effort to ensure that some errors cases are handled without > panic'ing. The upside benefit of not panic'ing seems small compared to > the downside risk of leaking guest VM data to the host. panic() should be a last resort. We *always* continue unless we know that something is so bad that we're going to make things worse by continuing to run. We shouldn't panic() on the first little thing that goes wrong. If folks want *that*, then they can set panic_on_warn. > My concern about Patches 1 and 2 is that the encryption bit in the PTE > is not a reliable indicator of the state that the host thinks the page is > in. Changing the state requires two steps (in either order): 1) updating > the guest VM PTEs, and 2) updating the host's view of the page state. > Both steps may be done on a range of pages. If #2 fails, the guest > doesn't know which pages in the batch were updated and which were > not, so the guest PTEs may not match the host state. In such a case, > set_memory_encrypted() could succeed based on checking the > PTEs when in fact the host still thinks some of the pages are shared. > Such a mismatch will produce a guest panic later on if the page is > referenced. I think that's OK. In the end, the page state is controlled by the VMM. The guest has zero control. All it can do is make the PTEs consistent and hold on for dear life. That's a general statement and not specific to this problem. In other words, it's fine for CoCo folks to be paranoid. It's fine for them to set panic_on_{warn,oops,whatever}=1. But it's *NOT* fine to say that every TDX guest will want to do that.
From: Dave Hansen <dave.hansen@intel.com> Sent: Thursday, October 19, 2023 12:13 PM > > On 10/19/23 10:05, Michael Kelley (LINUX) wrote: > > I'm more in favor of the "simply panic" approach. What you've done > > in your Patch 1 and Patch 2 is an intriguing way to try to get the memory > > back into a consistent state. But I'm concerned that there are failure > > modes that make it less than 100% foolproof (more on that below). If > > we can't be sure that the memory is back in a consistent state, then the > > original problem isn't fully solved. I'm also not sure of the value of > > investing effort to ensure that some errors cases are handled without > > panic'ing. The upside benefit of not panic'ing seems small compared to > > the downside risk of leaking guest VM data to the host. > > panic() should be a last resort. We *always* continue unless we know > that something is so bad that we're going to make things worse by > continuing to run. > > We shouldn't panic() on the first little thing that goes wrong. If > folks want *that*, then they can set panic_on_warn. > > > My concern about Patches 1 and 2 is that the encryption bit in the PTE > > is not a reliable indicator of the state that the host thinks the page is > > in. Changing the state requires two steps (in either order): 1) updating > > the guest VM PTEs, and 2) updating the host's view of the page state. > > Both steps may be done on a range of pages. If #2 fails, the guest > > doesn't know which pages in the batch were updated and which were > > not, so the guest PTEs may not match the host state. In such a case, > > set_memory_encrypted() could succeed based on checking the > > PTEs when in fact the host still thinks some of the pages are shared. > > Such a mismatch will produce a guest panic later on if the page is > > referenced. > > I think that's OK. In the end, the page state is controlled by the VMM. > The guest has zero control. All it can do is make the PTEs consistent > and hold on for dear life. That's a general statement and not specific > to this problem. > > In other words, it's fine for CoCo folks to be paranoid. It's fine for > them to set panic_on_{warn,oops,whatever}=1. But it's *NOT* fine to say > that every TDX guest will want to do that. The premise of this patch set is to not put pages on the Linux guest free list that are shared. I agree with that premise. But more precisely, the best we can do is not put pages on the free list where the guest PTE indicates "shared". Even if the host is not acting maliciously, errors can cause the guest and host to be out-of-sync regarding a page's private/shared status. There's no way to find out for sure if the host status is "private" before returning such a page to the free list, though if set_memory_encrypted() succeeds and the host is not malicious, we should be reasonably safe. For paranoid CoCo VM users, using panic_on_warn=1 seems workable. However, with current code and this patch series, it's possible have set_memory_decrypted() return an error and have set_memory_encrypted() fix things up as best it can without generating any warnings. It seems like we need a WARN or some equivalent mechanism if either of these fails, so that CoCo VMs can panic if they don't want to run with any inconsistencies (again, assuming the host isn't malicious). Also, from a troubleshooting standpoint, panic_on_warn=1 will make it easier to diagnose a failure of set_memory_encrypted()/decrypted() if it is caught immediately, versus putting a page with an inconsistent state on the free list and having things blow up later. Michael
On 10/23/23 09:47, Michael Kelley (LINUX) wrote: > For paranoid CoCo VM users, using panic_on_warn=1 seems workable. > However, with current code and this patch series, it's possible have > set_memory_decrypted() return an error and have set_memory_encrypted() > fix things up as best it can without generating any warnings. It seems > like we need a WARN or some equivalent mechanism if either of these > fails, so that CoCo VMs can panic if they don't want to run with any > inconsistencies (again, assuming the host isn't malicious). Adding a warning to the fixup path in set_memory_encrypted() would be totally fine with me.
On Mon, 2023-10-23 at 16:47 +0000, Michael Kelley (LINUX) wrote: > From: Dave Hansen <dave.hansen@intel.com> Sent: Thursday, October 19, > 2023 12:13 PM > > > > On 10/19/23 10:05, Michael Kelley (LINUX) wrote: > > > I'm more in favor of the "simply panic" approach. What you've > > > done > > > in your Patch 1 and Patch 2 is an intriguing way to try to get > > > the memory > > > back into a consistent state. But I'm concerned that there are > > > failure > > > modes that make it less than 100% foolproof (more on that > > > below). If > > > we can't be sure that the memory is back in a consistent state, > > > then the > > > original problem isn't fully solved. I'm also not sure of the > > > value of > > > investing effort to ensure that some errors cases are handled > > > without > > > panic'ing. The upside benefit of not panic'ing seems small > > > compared to > > > the downside risk of leaking guest VM data to the host. > > > > panic() should be a last resort. We *always* continue unless we > > know > > that something is so bad that we're going to make things worse by > > continuing to run. > > > > We shouldn't panic() on the first little thing that goes wrong. If > > folks want *that*, then they can set panic_on_warn. > > > > > My concern about Patches 1 and 2 is that the encryption bit in > > > the PTE > > > is not a reliable indicator of the state that the host thinks the > > > page is > > > in. Changing the state requires two steps (in either order): 1) > > > updating > > > the guest VM PTEs, and 2) updating the host's view of the page > > > state. > > > Both steps may be done on a range of pages. If #2 fails, the > > > guest > > > doesn't know which pages in the batch were updated and which were > > > not, so the guest PTEs may not match the host state. In such a > > > case, > > > set_memory_encrypted() could succeed based on checking the > > > PTEs when in fact the host still thinks some of the pages are > > > shared. > > > Such a mismatch will produce a guest panic later on if the page > > > is > > > referenced. > > > > I think that's OK. In the end, the page state is controlled by the > > VMM. > > The guest has zero control. All it can do is make the PTEs > > consistent > > and hold on for dear life. That's a general statement and not > > specific > > to this problem. > > > > In other words, it's fine for CoCo folks to be paranoid. It's fine > > for > > them to set panic_on_{warn,oops,whatever}=1. But it's *NOT* fine > > to say > > that every TDX guest will want to do that. > > The premise of this patch set is to not put pages on the Linux > guest free list that are shared. I agree with that premise. But > more precisely, the best we can do is not put pages on the free > list where the guest PTE indicates "shared". Even if the host is > not acting maliciously, errors can cause the guest and host to be > out-of-sync regarding a page's private/shared status. There's no > way to find out for sure if the host status is "private" before > returning such a page to the free list, though if > set_memory_encrypted() succeeds and the host is not > malicious, we should be reasonably safe. > > For paranoid CoCo VM users, using panic_on_warn=1 seems > workable. However, with current code and this patch series, > it's possible have set_memory_decrypted() return an error and > have set_memory_encrypted() fix things up as best it can > without generating any warnings. It seems like we need a > WARN or some equivalent mechanism if either of these fails, > so that CoCo VMs can panic if they don't want to run with any > inconsistencies (again, assuming the host isn't malicious). Always warning seems reasonable, given the added logic around retrying. This is something that the GHCI spec says is something that can happen though. So the warning is kind of like "the host is being legal, but a bit unreasonable". This will also save having to add warnings in all of the callers, which are missing in some spots currently. > > Also, from a troubleshooting standpoint, panic_on_warn=1 > will make it easier to diagnose a failure of > set_memory_encrypted()/decrypted() if it is caught > immediately, versus putting a page with an inconsistent state > on the free list and having things blow up later. If the guest doesn't notify anything to the host, I wonder if there will be scenarios where the host doesn't know that there is anything wrong in the TD. Maybe a naive question, but do you have any insight into how errors in guest logs might make there way to the host admin in typical usage?