Message ID | 20231024234829.1443125-1-rick.p.edgecombe@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:ce89:0:b0:403:3b70:6f57 with SMTP id p9csp2268919vqx; Tue, 24 Oct 2023 16:49:15 -0700 (PDT) X-Google-Smtp-Source: AGHT+IG4imxNY2ixnBA1/qbfl0K1/nqwdyqZhbf+scWa2QF/F0i6SHOeVaPakNri94Fcatz3+pj7 X-Received: by 2002:a1f:9e86:0:b0:4a4:680:bfad with SMTP id h128-20020a1f9e86000000b004a40680bfadmr10977476vke.7.1698191355085; Tue, 24 Oct 2023 16:49:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1698191355; cv=none; d=google.com; s=arc-20160816; b=gTIpMWgX1GMIZLDXFXaSXa8pD26GZQgoMVbpZv6ZwSSwh4Dne4y6Gz859LguJI7SKx ujTOpYG/7itKEYwgINPqF8LZ6uNy0qqy2AMgRtNznhB9v4Xj3hmjq7+LRDa9/Odp2ymC 7hbPHbe8UMB3WtY5Uw2KGYH6fxV0LIcWacpOKsmaZBQ1QcvK1oH0PFfPJal8H+vKOUPE gyqYQSZuxK9q4+paw/Qz5ONe4YTFq9NKX1cs6C/95/3MYZwBPyPYuP+ZKgcE1VJHJlml Dj4GXMscQb7k2QSRGVG/fJhB8V1VHOeD1hN51vhyQERe5Zq4eSB8qoeMcBTvipMGgrT7 3Z3g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=Nt4PSxvr1Y+znvLa/wRAKiseuZUTWx+v8HiZ8k9iyCU=; fh=gx8EVB6XUE1xf/MSIyK2ERwb94OBN+Dyd/Ct5NJp62I=; b=jEfRIfOejHQJXTfzttU7GBZzzRiRuG18Bhj+Uhz0+ka+1TFrqTz60DtEHI7xWz7Uwp TtU4RAOU1N8nFQmpyoa3lkrrh5i/B+ZKG3X4Wi/qCH1ginljnMAA7FwhAykA6kUC1oFW qngskpOaz2I/J9CsJ6gRmPl2R2UoqMDEBtfeNb/OvHv7yWY/IMJ8t/R/8B2Om5s4ZWBE MQSQbKSV27rX1OpsOaLGgS9Y0EfocPkk07xfS4bRrx1/vgE6Omzv43X1SzUgfgvmnwUH BhWWGYCT3XeS0G0/dEyjxjCrRAPGTfiDz0Vu6CfCx+Et8CoR+/EuEp8WBb+Z0LbjYBzE oKHw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="OZWfy3/D"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from morse.vger.email (morse.vger.email. [23.128.96.31]) by mx.google.com with ESMTPS id v131-20020a252f89000000b00da04e945ab2si1862965ybv.378.2023.10.24.16.49.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 24 Oct 2023 16:49:15 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) client-ip=23.128.96.31; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="OZWfy3/D"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by morse.vger.email (Postfix) with ESMTP id 14C9F803461F; Tue, 24 Oct 2023 16:49:12 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at morse.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344688AbjJXXsz (ORCPT <rfc822;aposhian.dev@gmail.com> + 26 others); Tue, 24 Oct 2023 19:48:55 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46124 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1344663AbjJXXsy (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 24 Oct 2023 19:48:54 -0400 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.20]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 18CDBA2 for <linux-kernel@vger.kernel.org>; Tue, 24 Oct 2023 16:48:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1698191332; x=1729727332; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=qfQ0PVM3UW/2NAqWFmhrhM2gR9XLfLx4/JHO0E0jUkg=; b=OZWfy3/DZrHUc5EZlBZ/SMFVlmRWDt5RSu1klA3Pwjb7F0qBXaz1SCdS R3JyHTeemVCz3vtDWewZyevoyF4vXm0gBjSSFX6kwxGQVH9Tabz9L4QcR xIZY+NitxsIRwPL88AiXoy9gvM40dEc82AvYFcWQBs3/hEmzXQZ6wagRq ZTwZNDzhsLhVIyXtbekZFxeU7HxlAp8GTQ+9+nT33G29obAFkpZKEdrzr SRShObxXz9CUgAo21kpcoVRFScY44JGNRg1QMjjdrL00e0uftD+hDpxMp Ik+w6RCtpqz7C93f3eGL7i9/GixvJs89t+Z81dxJ0oTFfnz0bU/xOaAj2 w==; X-IronPort-AV: E=McAfee;i="6600,9927,10873"; a="377569338" X-IronPort-AV: E=Sophos;i="6.03,249,1694761200"; d="scan'208";a="377569338" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Oct 2023 16:48:51 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10873"; a="762271652" X-IronPort-AV: E=Sophos;i="6.03,249,1694761200"; d="scan'208";a="762271652" Received: from lioneldi-mobl1.amr.corp.intel.com (HELO rpedgeco-desk4.intel.com) ([10.251.7.244]) by fmsmga007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Oct 2023 16:48:50 -0700 From: Rick Edgecombe <rick.p.edgecombe@intel.com> To: x86@kernel.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, luto@kernel.org, peterz@infradead.org, kirill.shutemov@linux.intel.com, elena.reshetova@intel.com, isaku.yamahata@intel.com, seanjc@google.com, Michael Kelley <mikelley@microsoft.com>, thomas.lendacky@amd.com, decui@microsoft.com, sathyanarayanan.kuppuswamy@linux.intel.com, linux-kernel@vger.kernel.org Cc: rick.p.edgecombe@intel.com Subject: [PATCH] x86/mm/cpa: Warn if set_memory_XXcrypted() fails Date: Tue, 24 Oct 2023 16:48:29 -0700 Message-Id: <20231024234829.1443125-1-rick.p.edgecombe@intel.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]); Tue, 24 Oct 2023 16:49:12 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1780682698357609487 X-GMAIL-MSGID: 1780682698357609487 |
Series |
x86/mm/cpa: Warn if set_memory_XXcrypted() fails
|
|
Commit Message
Edgecombe, Rick P
Oct. 24, 2023, 11:48 p.m. UTC
On TDX it is possible for the untrusted host to cause
set_memory_encrypted() or set_memory_decrypted() to fail such that an
error is returned and the resulting memory is shared. Callers need to take
care to handle these errors to avoid returning decrypted (shared) memory to
the page allocator, which could lead to functional or security issues.
Such errors may herald future system instability, but are temporarily
survivable with proper handling in the caller. The kernel traditionally
makes every effort to keep running, but it is expected that some coco
guests may prefer to play it safe security-wise, and panic in this case.
To accommodate both cases, warn when the arch breakouts for converting
memory at the VMM layer return an error to CPA. Security focused users
can rely on panic_on_warn to defend against bugs in the callers.
Since the arch breakouts host the logic for handling coco implementation
specific errors, an error returned from them means that the set_memory()
call is out of options for handling the error internally. Make this the
condition to warn about.
It is possible that very rarely these functions could fail due to guest
memory pressure (in the case of failing to allocate a huge page when
splitting a page table). Don't warn in this case because it is a lot less
likely to indicate an attack by the host and it is not clear which
set_memory() calls should get the same treatment. That corner should be
addressed by future work that considers the more general problem and not
just papers over a single set_memory() variant.
Suggested-by: Michael Kelley (LINUX) <mikelley@microsoft.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
This is a followup to the "Handle set_memory_XXcrypted() errors"
series[0].
Previously[1] I attempted to create a useful helper to both simplify the
callers and provide an official example of how to handle conversion
errors. Dave pointed out that there wasn't actually any code savings in
the callers using it. It also required a whole additional patch to make
set_memory_XXcrypted() more robust.
I tried to create some more sensible helper, but in the end gave up. My
current plan is to just add a warning for VMM failures around this. And
then shortly after, pursue open coded fixes for the callers that are
problems for TDX. There are some SEV and SME specifics callers, that I am
not sure on. But I'm under the impression that as long as that side
terminates the guest on error, they should be harmless.
[0] https://lore.kernel.org/lkml/20231017202505.340906-1-rick.p.edgecombe@intel.com/
[1] https://lore.kernel.org/lkml/20231017202505.340906-2-rick.p.edgecombe@intel.com/
---
arch/x86/mm/pat/set_memory.c | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)
Comments
On 10/24/23 18:48, Rick Edgecombe wrote: > On TDX it is possible for the untrusted host to cause > set_memory_encrypted() or set_memory_decrypted() to fail such that an > error is returned and the resulting memory is shared. Callers need to take > care to handle these errors to avoid returning decrypted (shared) memory to > the page allocator, which could lead to functional or security issues. > > Such errors may herald future system instability, but are temporarily > survivable with proper handling in the caller. The kernel traditionally > makes every effort to keep running, but it is expected that some coco > guests may prefer to play it safe security-wise, and panic in this case. > To accommodate both cases, warn when the arch breakouts for converting > memory at the VMM layer return an error to CPA. Security focused users > can rely on panic_on_warn to defend against bugs in the callers. > > Since the arch breakouts host the logic for handling coco implementation > specific errors, an error returned from them means that the set_memory() > call is out of options for handling the error internally. Make this the > condition to warn about. > > It is possible that very rarely these functions could fail due to guest > memory pressure (in the case of failing to allocate a huge page when > splitting a page table). Don't warn in this case because it is a lot less > likely to indicate an attack by the host and it is not clear which > set_memory() calls should get the same treatment. That corner should be > addressed by future work that considers the more general problem and not > just papers over a single set_memory() variant. > > Suggested-by: Michael Kelley (LINUX) <mikelley@microsoft.com> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> > --- > This is a followup to the "Handle set_memory_XXcrypted() errors" > series[0]. > > Previously[1] I attempted to create a useful helper to both simplify the > callers and provide an official example of how to handle conversion > errors. Dave pointed out that there wasn't actually any code savings in > the callers using it. It also required a whole additional patch to make > set_memory_XXcrypted() more robust. > > I tried to create some more sensible helper, but in the end gave up. My > current plan is to just add a warning for VMM failures around this. And > then shortly after, pursue open coded fixes for the callers that are > problems for TDX. There are some SEV and SME specifics callers, that I am > not sure on. But I'm under the impression that as long as that side > terminates the guest on error, they should be harmless. Under SEV, when making a page private/encrypted and the hypervisor does not assign the page to the guest (encrypted), but says it did, then when SEV tries to perform the PVALIDATE in the enc_status_change_finish() call, a nested page fault (#NPF) will be generated and exit to the hypervisor. Until the hypervisor assigns the page to the guest, the guest will not be able to make forward progress in regards to updating or using that page. And if the hypervisor returns an error when changing the page state, then, yes, the guest will terminate. Thanks, Tom > > [0] https://lore.kernel.org/lkml/20231017202505.340906-1-rick.p.edgecombe@intel.com/ > [1] https://lore.kernel.org/lkml/20231017202505.340906-2-rick.p.edgecombe@intel.com/ > --- > arch/x86/mm/pat/set_memory.c | 18 +++++++++++++----- > 1 file changed, 13 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c > index bda9f129835e..dade281f449b 100644 > --- a/arch/x86/mm/pat/set_memory.c > +++ b/arch/x86/mm/pat/set_memory.c > @@ -2153,7 +2153,7 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc) > > /* Notify hypervisor that we are about to set/clr encryption attribute. */ > if (!x86_platform.guest.enc_status_change_prepare(addr, numpages, enc)) > - return -EIO; > + goto vmm_fail; > > ret = __change_page_attr_set_clr(&cpa, 1); > > @@ -2167,12 +2167,20 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc) > cpa_flush(&cpa, 0); > > /* Notify hypervisor that we have successfully set/clr encryption attribute. */ > - if (!ret) { > - if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc)) > - ret = -EIO; > - } > + if (ret) > + goto out; > > + if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc)) > + goto vmm_fail; > + > +out: > return ret; > + > +vmm_fail: > + WARN_ONCE(1, "CPA VMM failure to convert memory (addr=%p, numpages=%d) to %s.\n", > + (void *)addr, numpages, enc ? "private" : "shared"); > + > + return -EIO; > } > > static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
On 10/24/2023 4:48 PM, Rick Edgecombe wrote: > On TDX it is possible for the untrusted host to cause > set_memory_encrypted() or set_memory_decrypted() to fail such that an > error is returned and the resulting memory is shared. Callers need to take > care to handle these errors to avoid returning decrypted (shared) memory to > the page allocator, which could lead to functional or security issues. > > Such errors may herald future system instability, but are temporarily > survivable with proper handling in the caller. The kernel traditionally > makes every effort to keep running, but it is expected that some coco > guests may prefer to play it safe security-wise, and panic in this case. > To accommodate both cases, warn when the arch breakouts for converting > memory at the VMM layer return an error to CPA. Security focused users > can rely on panic_on_warn to defend against bugs in the callers. > > Since the arch breakouts host the logic for handling coco implementation > specific errors, an error returned from them means that the set_memory() > call is out of options for handling the error internally. Make this the > condition to warn about. > > It is possible that very rarely these functions could fail due to guest > memory pressure (in the case of failing to allocate a huge page when > splitting a page table). Don't warn in this case because it is a lot less > likely to indicate an attack by the host and it is not clear which > set_memory() calls should get the same treatment. That corner should be > addressed by future work that considers the more general problem and not > just papers over a single set_memory() variant. > > Suggested-by: Michael Kelley (LINUX) <mikelley@microsoft.com> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> > --- Looks good to me. Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> > This is a followup to the "Handle set_memory_XXcrypted() errors" > series[0]. > > Previously[1] I attempted to create a useful helper to both simplify the > callers and provide an official example of how to handle conversion > errors. Dave pointed out that there wasn't actually any code savings in > the callers using it. It also required a whole additional patch to make > set_memory_XXcrypted() more robust. > > I tried to create some more sensible helper, but in the end gave up. My > current plan is to just add a warning for VMM failures around this. And > then shortly after, pursue open coded fixes for the callers that are > problems for TDX. There are some SEV and SME specifics callers, that I am > not sure on. But I'm under the impression that as long as that side > terminates the guest on error, they should be harmless. > > [0] https://lore.kernel.org/lkml/20231017202505.340906-1-rick.p.edgecombe@intel.com/ > [1] https://lore.kernel.org/lkml/20231017202505.340906-2-rick.p.edgecombe@intel.com/ > --- > arch/x86/mm/pat/set_memory.c | 18 +++++++++++++----- > 1 file changed, 13 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c > index bda9f129835e..dade281f449b 100644 > --- a/arch/x86/mm/pat/set_memory.c > +++ b/arch/x86/mm/pat/set_memory.c > @@ -2153,7 +2153,7 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc) > > /* Notify hypervisor that we are about to set/clr encryption attribute. */ > if (!x86_platform.guest.enc_status_change_prepare(addr, numpages, enc)) > - return -EIO; > + goto vmm_fail; > > ret = __change_page_attr_set_clr(&cpa, 1); > > @@ -2167,12 +2167,20 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc) > cpa_flush(&cpa, 0); > > /* Notify hypervisor that we have successfully set/clr encryption attribute. */ > - if (!ret) { > - if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc)) > - ret = -EIO; > - } > + if (ret) > + goto out; IMO, you can avoid "out" label with (!ret && !x86_platform....) check. But it is upto you. > > + if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc)) > + goto vmm_fail; > + > +out: > return ret; > + > +vmm_fail: > + WARN_ONCE(1, "CPA VMM failure to convert memory (addr=%p, numpages=%d) to %s.\n", > + (void *)addr, numpages, enc ? "private" : "shared"); > + > + return -EIO; > } > > static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
From: Rick Edgecombe <rick.p.edgecombe@intel.com> Sent: Tuesday, October 24, 2023 4:48 PM > > On TDX it is possible for the untrusted host to cause > set_memory_encrypted() or set_memory_decrypted() to fail such that an > error is returned and the resulting memory is shared. Callers need to take > care to handle these errors to avoid returning decrypted (shared) memory to > the page allocator, which could lead to functional or security issues. I think you mean "shared" as indicated by the guest page tables (vs. "shared" as the state of the page from the host standpoint). Some precision on that distinction seems useful here and in follow-on patches to make callers' error handling be correct. As I understand it, the premise is that if the guest is accessing a page as private, and the host/VMM has messed around with the page private/shared status, the confidentiality of the VM is protected. The risk of leakage occurs when the guest is accessing a page as shared, so kernel code must guard against putting memory on the free list if the guest page tables are marked shared. > > Such errors may herald future system instability, but are temporarily > survivable with proper handling in the caller. The kernel traditionally > makes every effort to keep running, but it is expected that some coco > guests may prefer to play it safe security-wise, and panic in this case. > To accommodate both cases, warn when the arch breakouts for converting > memory at the VMM layer return an error to CPA. Security focused users > can rely on panic_on_warn to defend against bugs in the callers. To me, this sentence doesn't fully characterize why panic_on_warn would be used. You describe one reason, which is a caller that fails to properly handle an error and incorrectly puts memory with a "shared" guest PTE on the free list. But getting an error back also implies that something unknown has gone wrong with the CoCo mechanism for managing private vs. shared pages. Security focused users would not take the risk of continuing to operate with that kind of unknown error in the core mechanism of a CoCo VM. > > Since the arch breakouts host the logic for handling coco implementation > specific errors, an error returned from them means that the set_memory() > call is out of options for handling the error internally. Make this the > condition to warn about. > > It is possible that very rarely these functions could fail due to guest > memory pressure (in the case of failing to allocate a huge page when > splitting a page table). Don't warn in this case because it is a lot less > likely to indicate an attack by the host and it is not clear which > set_memory() calls should get the same treatment. That corner should be > addressed by future work that considers the more general problem and not > just papers over a single set_memory() variant. > > Suggested-by: Michael Kelley (LINUX) <mikelley@microsoft.com> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> > --- > This is a followup to the "Handle set_memory_XXcrypted() errors" > series[0]. > > Previously[1] I attempted to create a useful helper to both simplify the > callers and provide an official example of how to handle conversion > errors. Dave pointed out that there wasn't actually any code savings in > the callers using it. It also required a whole additional patch to make > set_memory_XXcrypted() more robust. > > I tried to create some more sensible helper, but in the end gave up. My > current plan is to just add a warning for VMM failures around this. And > then shortly after, pursue open coded fixes for the callers that are > problems for TDX. There are some SEV and SME specifics callers, that I am > not sure on. But I'm under the impression that as long as that side > terminates the guest on error, they should be harmless. > > [0] https://lore.kernel.org/lkml/20231017202505.340906-1-rick.p.edgecombe@intel.com/ > [1] https://lore.kernel.org/lkml/20231017202505.340906-2-rick.p.edgecombe@intel.com/ > --- > arch/x86/mm/pat/set_memory.c | 18 +++++++++++++----- > 1 file changed, 13 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c > index bda9f129835e..dade281f449b 100644 > --- a/arch/x86/mm/pat/set_memory.c > +++ b/arch/x86/mm/pat/set_memory.c > @@ -2153,7 +2153,7 @@ static int __set_memory_enc_pgtable(unsigned long addr, > int numpages, bool enc) > > /* Notify hypervisor that we are about to set/clr encryption attribute. */ > if (!x86_platform.guest.enc_status_change_prepare(addr, numpages, enc)) > - return -EIO; > + goto vmm_fail; > > ret = __change_page_attr_set_clr(&cpa, 1); > > @@ -2167,12 +2167,20 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc) > cpa_flush(&cpa, 0); > > /* Notify hypervisor that we have successfully set/clr encryption attribute. */ > - if (!ret) { > - if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc)) > - ret = -EIO; > - } > + if (ret) > + goto out; > > + if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc)) > + goto vmm_fail; > + > +out: > return ret; > + > +vmm_fail: > + WARN_ONCE(1, "CPA VMM failure to convert memory (addr=%p, numpages=%d) to %s.\n", > + (void *)addr, numpages, enc ? "private" : "shared"); I'm not sure about outputting the "addr" value. It could be useful, but the %p format specifier hashes the value unless the kernel is booted with "no_hash_pointers". Should %px be used so the address is output unmodified? > + > + return -EIO; > } > > static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc) > -- > 2.34.1 My comments notwithstanding, I'm good with this overall change and the additional level of protection it offers to CoCo VM users. Michael
On Thu, 2023-10-26 at 00:35 +0000, Michael Kelley (LINUX) wrote: > I think you mean "shared" as indicated by the guest page tables (vs. > "shared" > as the state of the page from the host standpoint). Some precision > on > that distinction seems useful here and in follow-on patches to make > callers' > error handling be correct. As I understand it, the premise is that > if the > guest is accessing a page as private, and the host/VMM has messed > around with the page private/shared status, the confidentiality of > the > VM is protected. The risk of leakage occurs when the guest is > accessing > a page as shared, so kernel code must guard against putting memory > on the free list if the guest page tables are marked shared. > For TDX, the scenario of concern in the VMM error case is if the page is mapped as shared in the guest page tables *and* it is either also marked as shared in the EPT, or the VMM supports automatically converting it on access. In the attacker scenario, I think the problem is just that it is marked shared in the guest. I can clarify that it needs to be mapped shared in the guest for there to be a problem, but I don't see how it will help the patches to fix the callers. It seems like too many details for the callers to know about. For example, I think some architectures don't change the PTEs at all. The callers abstract shared and private at a higher level. > To me, this sentence doesn't fully characterize why panic_on_warn > would be used. You describe one reason, which is a caller that fails > to > properly handle an error and incorrectly puts memory with a "shared" > guest PTE on the free list. But getting an error back also implies > that > something unknown has gone wrong with the CoCo mechanism for > managing private vs. shared pages. Security focused users would not > take the risk of continuing to operate with that kind of unknown > error > in the core mechanism of a CoCo VM. Hmm, yea I could see that some users may want to take a hard line and terminate if anything looks strange. The counter point is that the VMM is actually returning a legal error here. It may be strange based on the details of when HyperV and QEMU/KVM would return this error, but not architecturally. > > > +vmm_fail: > > + WARN_ONCE(1, "CPA VMM failure to convert memory (addr=%p, > > numpages=%d) to %s.\n", > > + (void *)addr, numpages, enc ? "private" : > > "shared"); > > I'm not sure about outputting the "addr" value. It could be > useful, but the %p format specifier hashes the value unless the > kernel is booted with "no_hash_pointers". Should %px be used > so the address is output unmodified? Unfortunately, I don't think we can print the kernel virtual address because those are supposed to be hidden for security reasons. Ideally, I would prefer to print the PFN, but we won't have it here in the case of vmalloc's. I thought it might be useful to still have some address printed for debugging purposes. > > > + > > + return -EIO; > > } > > > > static int __set_memory_enc_dec(unsigned long addr, int numpages, > > bool enc) > > -- > > 2.34.1 > > My comments notwithstanding, I'm good with this overall change and > the additional level of protection it offers to CoCo VM users. Thanks.
On Wed, 2023-10-25 at 13:03 -0500, Tom Lendacky wrote: > > Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Thanks! > > > > Under SEV, when making a page private/encrypted and the hypervisor > does > not assign the page to the guest (encrypted), but says it did, then > when > SEV tries to perform the PVALIDATE in the enc_status_change_finish() > call, > a nested page fault (#NPF) will be generated and exit to the > hypervisor. > Until the hypervisor assigns the page to the guest, the guest will > not be > able to make forward progress in regards to updating or using that > page. Yea, mismatches between guest page tables and EPT/NPT can be trouble for TDX as well. > > And if the hypervisor returns an error when changing the page state, > then, > yes, the guest will terminate. I guess those callbacks could be changed to return an error after all these fixes then, if you want.
On Wed, 2023-10-25 at 11:10 -0700, Kuppuswamy Sathyanarayanan wrote: > Looks good to me. > > Reviewed-by: Kuppuswamy Sathyanarayanan > <sathyanarayanan.kuppuswamy@linux.intel.com> > Thanks! > > IMO, you can avoid "out" label with (!ret && !x86_platform....) > check. But it is upto > you. Hmm, yes it could. I think it's a little easier to read as is, but just my opinion as well.
On 10/25/23 21:04, Edgecombe, Rick P wrote: > On Wed, 2023-10-25 at 11:10 -0700, Kuppuswamy Sathyanarayanan wrote: >> Looks good to me. >> >> Reviewed-by: Kuppuswamy Sathyanarayanan >> <sathyanarayanan.kuppuswamy@linux.intel.com> >> > > Thanks! > >> >> IMO, you can avoid "out" label with (!ret && !x86_platform....) >> check. But it is upto >> you. > > Hmm, yes it could. I think it's a little easier to read as is, but just > my opinion as well. It might be even easier to read to just have: if (ret) return ret; if (!x86_platform...) goto vmm_fail return 0; since jumping to the out: label just does a return anyway. Thanks, Tom
On 10/25/23 20:45, Edgecombe, Rick P wrote: > On Wed, 2023-10-25 at 13:03 -0500, Tom Lendacky wrote: >> >> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> > > Thanks! >>> >> >> Under SEV, when making a page private/encrypted and the hypervisor >> does >> not assign the page to the guest (encrypted), but says it did, then >> when >> SEV tries to perform the PVALIDATE in the enc_status_change_finish() >> call, >> a nested page fault (#NPF) will be generated and exit to the >> hypervisor. >> Until the hypervisor assigns the page to the guest, the guest will >> not be >> able to make forward progress in regards to updating or using that >> page. > > Yea, mismatches between guest page tables and EPT/NPT can be trouble > for TDX as well. > >> >> And if the hypervisor returns an error when changing the page state, >> then, >> yes, the guest will terminate. > > I guess those callbacks could be changed to return an error after all > these fixes then, if you want. Probably not necessary as we will want to terminate the guest in these situations and having it here in this one area is easier than checking all of the call sites. Thanks, Tom
On Thu, 2023-10-26 at 08:37 -0500, Tom Lendacky wrote: > It might be even easier to read to just have: > > if (ret) > return ret; > > if (!x86_platform...) > goto vmm_fail > > return 0; > > since jumping to the out: label just does a return anyway. Err, right. I'll change it.
From: Edgecombe, Rick P <rick.p.edgecombe@intel.com> Sent: Wednesday, October 25, 2023 6:41 PM > > On Thu, 2023-10-26 at 00:35 +0000, Michael Kelley (LINUX) wrote: > > I think you mean "shared" as indicated by the guest page tables (vs."shared" > > as the state of the page from the host standpoint). Some precision on > > that distinction seems useful here and in follow-on patches to make callers' > > error handling be correct. As I understand it, the premise is that > > if the guest is accessing a page as private, and the host/VMM has messed > > around with the page private/shared status, the confidentiality of the > > VM is protected. The risk of leakage occurs when the guest is accessing > > a page as shared, so kernel code must guard against putting memory > > on the free list if the guest page tables are marked shared. > > > > For TDX, the scenario of concern in the VMM error case is if the page > is mapped as shared in the guest page tables *and* it is either also > marked as shared in the EPT, or the VMM supports automatically > converting it on access. In the attacker scenario, I think the problem > is just that it is marked shared in the guest. Agreed. > > I can clarify that it needs to be mapped shared in the guest for there > to be a problem, but I don't see how it will help the patches to fix > the callers. It seems like too many details for the callers to know > about. For example, I think some architectures don't change the PTEs at > all. The callers abstract shared and private at a higher level. > When a caller gets an error from set_memory_decrypted(), it will take steps to try to get the memory back into a "good" state so that it can put the memory back on the free list. If it can't get the memory back into a good state, then it will leak the memory. I was thinking about how the caller will make that determination. Is it based on whether set_memory_encrypted() succeeds? I think that works, as long as (for x86 at least) set_memory_encrypted() ensures that the guest PTEs are all marked "private" before it returns success. So maybe my comment applies to the caller in the sense of understanding what steps the caller should take to recover from an error, and the possible outcomes from the attempted recovery. > > > To me, this sentence doesn't fully characterize why panic_on_warn > > would be used. You describe one reason, which is a caller that fails to > > properly handle an error and incorrectly puts memory with a "shared" > > guest PTE on the free list. But getting an error back also implies that > > something unknown has gone wrong with the CoCo mechanism for > > managing private vs. shared pages. Security focused users would not > > take the risk of continuing to operate with that kind of unknown > > error in the core mechanism of a CoCo VM. > > Hmm, yea I could see that some users may want to take a hard line and > terminate if anything looks strange. The counter point is that the VMM > is actually returning a legal error here. It may be strange based on > the details of when HyperV and QEMU/KVM would return this error, but > not architecturally. > Agreed, it may be a legal error. But even with legal errors, the guest doesn't know whether the VMM has left the page in a private or shared state. If the guest fixes up its PTEs to access the memory as private and puts the memory back on the free list, that could be a time bomb that will blow up later. More paranoid guests will prefer to take the panic when the error is first reported. > > > > > +vmm_fail: > > > + WARN_ONCE(1, "CPA VMM failure to convert memory (addr=%p, numpages=%d) to %s.\n", > > > + (void *)addr, numpages, enc ? "private" : "shared"); > > > > I'm not sure about outputting the "addr" value. It could be > > useful, but the %p format specifier hashes the value unless the > > kernel is booted with "no_hash_pointers". Should %px be used > > so the address is output unmodified? > > Unfortunately, I don't think we can print the kernel virtual address > because those are supposed to be hidden for security reasons. Ideally, > I would prefer to print the PFN, but we won't have it here in the case > of vmalloc's. I thought it might be useful to still have some address > printed for debugging purposes. > I don't object to either approach. I was really just noting that we won't see the actual kernel virtual address. Michael
On Fri, 2023-10-27 at 16:37 +0000, Michael Kelley (LINUX) wrote: > When a caller gets an error from set_memory_decrypted(), it will > take steps to try to get the memory back into a "good" state so > that it can put the memory back on the free list. If it can't get > the memory back into a good state, then it will leak the memory. > I was thinking about how the caller will make that determination. > Is it based on whether set_memory_encrypted() succeeds? I think > that works, as long as (for x86 at least) set_memory_encrypted() > ensures that the guest PTEs are all marked "private" before it > returns success. > > So maybe my comment applies to the caller in the sense of > understanding what steps the caller should take to recover from > an error, and the possible outcomes from the attempted recovery. Since I was dropping free_decrypted_pages() helper, I was thinking to actually just leak the pages if set_memory_decryted() fails. As in, not try to recover them with set_memory_encrypted(). So the kernel will do the 3 retries that the recent HyperV focused patch added, and then walk away. The kernel will already be warning about this situation, so we are not expecting for it to be common. For rare cases, it seems simpler to just leak it, and then set_memory_encrypted() can be simpler as it doesn't need to worry about handling mixed ranges returning success. I'll update the log to clarify the importance of the PTE being marked shared in the guest, and post a v2.
From: Edgecombe, Rick P <rick.p.edgecombe@intel.com> Sent: Friday, October 27, 2023 9:47 AM > > On Fri, 2023-10-27 at 16:37 +0000, Michael Kelley (LINUX) wrote: > > When a caller gets an error from set_memory_decrypted(), it will > > take steps to try to get the memory back into a "good" state so > > that it can put the memory back on the free list. If it can't get > > the memory back into a good state, then it will leak the memory. > > I was thinking about how the caller will make that determination. > > Is it based on whether set_memory_encrypted() succeeds? I think > > that works, as long as (for x86 at least) set_memory_encrypted() > > ensures that the guest PTEs are all marked "private" before it > > returns success. > > > > So maybe my comment applies to the caller in the sense of > > understanding what steps the caller should take to recover from > > an error, and the possible outcomes from the attempted recovery. > > Since I was dropping free_decrypted_pages() helper, I was thinking to > actually just leak the pages if set_memory_decryted() fails. As in, not > try to recover them with set_memory_encrypted(). So the kernel will do > the 3 retries that the recent HyperV focused patch added, and then walk > away. > > The kernel will already be warning about this situation, so we are not > expecting for it to be common. For rare cases, it seems simpler to just > leak it, and then set_memory_encrypted() can be simpler as it doesn't > need to worry about handling mixed ranges returning success. > I like that approach even better than trying to fix things up and get the memory back on the guest free list. I agree the error case should be rare, and I'm generally leery of putting memory on the free list when there's some doubt about the private/shared state of the page from the host/VMM standpoint. Michael > I'll update the log to clarify the importance of the PTE being marked > shared in the guest, and post a v2.
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index bda9f129835e..dade281f449b 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -2153,7 +2153,7 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc) /* Notify hypervisor that we are about to set/clr encryption attribute. */ if (!x86_platform.guest.enc_status_change_prepare(addr, numpages, enc)) - return -EIO; + goto vmm_fail; ret = __change_page_attr_set_clr(&cpa, 1); @@ -2167,12 +2167,20 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc) cpa_flush(&cpa, 0); /* Notify hypervisor that we have successfully set/clr encryption attribute. */ - if (!ret) { - if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc)) - ret = -EIO; - } + if (ret) + goto out; + if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc)) + goto vmm_fail; + +out: return ret; + +vmm_fail: + WARN_ONCE(1, "CPA VMM failure to convert memory (addr=%p, numpages=%d) to %s.\n", + (void *)addr, numpages, enc ? "private" : "shared"); + + return -EIO; } static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)