diff mbox series

x86/mm/cpa: Warn if set_memory_XXcrypted() fails

Message ID	20231024234829.1443125-1-rick.p.edgecombe@intel.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) client-ip=23.128.96.31; From: Rick Edgecombe <rick.p.edgecombe@intel.com> To: x86@kernel.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, luto@kernel.org, peterz@infradead.org, kirill.shutemov@linux.intel.com, elena.reshetova@intel.com, isaku.yamahata@intel.com, seanjc@google.com, Michael Kelley <mikelley@microsoft.com>, thomas.lendacky@amd.com, decui@microsoft.com, sathyanarayanan.kuppuswamy@linux.intel.com, linux-kernel@vger.kernel.org Cc: rick.p.edgecombe@intel.com Subject: [PATCH] x86/mm/cpa: Warn if set_memory_XXcrypted() fails Date: Tue, 24 Oct 2023 16:48:29 -0700 Message-Id: <20231024234829.1443125-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	x86/mm/cpa: Warn if set_memory_XXcrypted() fails \| x86/mm/cpa: Warn if set_memory_XXcrypted() fails

Commit Message

Edgecombe, Rick P Oct. 24, 2023, 11:48 p.m. UTC

  On TDX it is possible for the untrusted host to cause
set_memory_encrypted() or set_memory_decrypted() to fail such that an
error is returned and the resulting memory is shared. Callers need to take
care to handle these errors to avoid returning decrypted (shared) memory to
the page allocator, which could lead to functional or security issues.

Such errors may herald future system instability, but are temporarily
survivable with proper handling in the caller. The kernel traditionally
makes every effort to keep running, but it is expected that some coco
guests may prefer to play it safe security-wise, and panic in this case.
To accommodate both cases, warn when the arch breakouts for converting
memory at the VMM layer return an error to CPA. Security focused users
can rely on panic_on_warn to defend against bugs in the callers.

Since the arch breakouts host the logic for handling coco implementation
specific errors, an error returned from them means that the set_memory()
call is out of options for handling the error internally. Make this the
condition to warn about.

It is possible that very rarely these functions could fail due to guest
memory pressure (in the case of failing to allocate a huge page when
splitting a page table). Don't warn in this case because it is a lot less
likely to indicate an attack by the host and it is not clear which
set_memory() calls should get the same treatment. That corner should be
addressed by future work that considers the more general problem and not
just papers over a single set_memory() variant.

Suggested-by: Michael Kelley (LINUX) <mikelley@microsoft.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
This is a followup to the "Handle set_memory_XXcrypted() errors"
series[0].

Previously[1] I attempted to create a useful helper to both simplify the
callers and provide an official example of how to handle conversion
errors. Dave pointed out that there wasn't actually any code savings in
the callers using it. It also required a whole additional patch to make
set_memory_XXcrypted() more robust.

I tried to create some more sensible helper, but in the end gave up. My
current plan is to just add a warning for VMM failures around this. And
then shortly after, pursue open coded fixes for the callers that are
problems for TDX. There are some SEV and SME specifics callers, that I am
not sure on. But I'm under the impression that as long as that side
terminates the guest on error, they should be harmless.

[0] https://lore.kernel.org/lkml/20231017202505.340906-1-rick.p.edgecombe@intel.com/
[1] https://lore.kernel.org/lkml/20231017202505.340906-2-rick.p.edgecombe@intel.com/
---
 arch/x86/mm/pat/set_memory.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

Comments

Tom Lendacky Oct. 25, 2023, 6:03 p.m. UTC | #1

On 10/24/23 18:48, Rick Edgecombe wrote:
> On TDX it is possible for the untrusted host to cause
> set_memory_encrypted() or set_memory_decrypted() to fail such that an
> error is returned and the resulting memory is shared. Callers need to take
> care to handle these errors to avoid returning decrypted (shared) memory to
> the page allocator, which could lead to functional or security issues.
> 
> Such errors may herald future system instability, but are temporarily
> survivable with proper handling in the caller. The kernel traditionally
> makes every effort to keep running, but it is expected that some coco
> guests may prefer to play it safe security-wise, and panic in this case.
> To accommodate both cases, warn when the arch breakouts for converting
> memory at the VMM layer return an error to CPA. Security focused users
> can rely on panic_on_warn to defend against bugs in the callers.
> 
> Since the arch breakouts host the logic for handling coco implementation
> specific errors, an error returned from them means that the set_memory()
> call is out of options for handling the error internally. Make this the
> condition to warn about.
> 
> It is possible that very rarely these functions could fail due to guest
> memory pressure (in the case of failing to allocate a huge page when
> splitting a page table). Don't warn in this case because it is a lot less
> likely to indicate an attack by the host and it is not clear which
> set_memory() calls should get the same treatment. That corner should be
> addressed by future work that considers the more general problem and not
> just papers over a single set_memory() variant.
> 
> Suggested-by: Michael Kelley (LINUX) <mikelley@microsoft.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>

> ---
> This is a followup to the "Handle set_memory_XXcrypted() errors"
> series[0].
> 
> Previously[1] I attempted to create a useful helper to both simplify the
> callers and provide an official example of how to handle conversion
> errors. Dave pointed out that there wasn't actually any code savings in
> the callers using it. It also required a whole additional patch to make
> set_memory_XXcrypted() more robust.
> 
> I tried to create some more sensible helper, but in the end gave up. My
> current plan is to just add a warning for VMM failures around this. And
> then shortly after, pursue open coded fixes for the callers that are
> problems for TDX. There are some SEV and SME specifics callers, that I am
> not sure on. But I'm under the impression that as long as that side
> terminates the guest on error, they should be harmless.

Under SEV, when making a page private/encrypted and the hypervisor does 
not assign the page to the guest (encrypted), but says it did, then when 
SEV tries to perform the PVALIDATE in the enc_status_change_finish() call, 
a nested page fault (#NPF) will be generated and exit to the hypervisor. 
Until the hypervisor assigns the page to the guest, the guest will not be 
able to make forward progress in regards to updating or using that page.

And if the hypervisor returns an error when changing the page state, then, 
yes, the guest will terminate.

Thanks,
Tom

> 
> [0] https://lore.kernel.org/lkml/20231017202505.340906-1-rick.p.edgecombe@intel.com/
> [1] https://lore.kernel.org/lkml/20231017202505.340906-2-rick.p.edgecombe@intel.com/
> ---
>   arch/x86/mm/pat/set_memory.c | 18 +++++++++++++-----
>   1 file changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index bda9f129835e..dade281f449b 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -2153,7 +2153,7 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
>   
>   	/* Notify hypervisor that we are about to set/clr encryption attribute. */
>   	if (!x86_platform.guest.enc_status_change_prepare(addr, numpages, enc))
> -		return -EIO;
> +		goto vmm_fail;
>   
>   	ret = __change_page_attr_set_clr(&cpa, 1);
>   
> @@ -2167,12 +2167,20 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
>   	cpa_flush(&cpa, 0);
>   
>   	/* Notify hypervisor that we have successfully set/clr encryption attribute. */
> -	if (!ret) {
> -		if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
> -			ret = -EIO;
> -	}
> +	if (ret)
> +		goto out;
>   
> +	if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
> +		goto vmm_fail;
> +
> +out:
>   	return ret;
> +
> +vmm_fail:
> +	WARN_ONCE(1, "CPA VMM failure to convert memory (addr=%p, numpages=%d) to %s.\n",
> +		  (void *)addr, numpages, enc ? "private" : "shared");
> +
> +	return -EIO;
>   }
>   
>   static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)

Kuppuswamy Sathyanarayanan Oct. 25, 2023, 6:10 p.m. UTC | #2

On 10/24/2023 4:48 PM, Rick Edgecombe wrote:
> On TDX it is possible for the untrusted host to cause
> set_memory_encrypted() or set_memory_decrypted() to fail such that an
> error is returned and the resulting memory is shared. Callers need to take
> care to handle these errors to avoid returning decrypted (shared) memory to
> the page allocator, which could lead to functional or security issues.
> 
> Such errors may herald future system instability, but are temporarily
> survivable with proper handling in the caller. The kernel traditionally
> makes every effort to keep running, but it is expected that some coco
> guests may prefer to play it safe security-wise, and panic in this case.
> To accommodate both cases, warn when the arch breakouts for converting
> memory at the VMM layer return an error to CPA. Security focused users
> can rely on panic_on_warn to defend against bugs in the callers.
> 
> Since the arch breakouts host the logic for handling coco implementation
> specific errors, an error returned from them means that the set_memory()
> call is out of options for handling the error internally. Make this the
> condition to warn about.
> 
> It is possible that very rarely these functions could fail due to guest
> memory pressure (in the case of failing to allocate a huge page when
> splitting a page table). Don't warn in this case because it is a lot less
> likely to indicate an attack by the host and it is not clear which
> set_memory() calls should get the same treatment. That corner should be
> addressed by future work that considers the more general problem and not
> just papers over a single set_memory() variant.
> 
> Suggested-by: Michael Kelley (LINUX) <mikelley@microsoft.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---

Looks good to me.

Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>


> This is a followup to the "Handle set_memory_XXcrypted() errors"
> series[0].
> 
> Previously[1] I attempted to create a useful helper to both simplify the
> callers and provide an official example of how to handle conversion
> errors. Dave pointed out that there wasn't actually any code savings in
> the callers using it. It also required a whole additional patch to make
> set_memory_XXcrypted() more robust.
> 
> I tried to create some more sensible helper, but in the end gave up. My
> current plan is to just add a warning for VMM failures around this. And
> then shortly after, pursue open coded fixes for the callers that are
> problems for TDX. There are some SEV and SME specifics callers, that I am
> not sure on. But I'm under the impression that as long as that side
> terminates the guest on error, they should be harmless.
> 
> [0] https://lore.kernel.org/lkml/20231017202505.340906-1-rick.p.edgecombe@intel.com/
> [1] https://lore.kernel.org/lkml/20231017202505.340906-2-rick.p.edgecombe@intel.com/
> ---
>  arch/x86/mm/pat/set_memory.c | 18 +++++++++++++-----
>  1 file changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index bda9f129835e..dade281f449b 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -2153,7 +2153,7 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
>  
>  	/* Notify hypervisor that we are about to set/clr encryption attribute. */
>  	if (!x86_platform.guest.enc_status_change_prepare(addr, numpages, enc))
> -		return -EIO;
> +		goto vmm_fail;
>  
>  	ret = __change_page_attr_set_clr(&cpa, 1);
>  
> @@ -2167,12 +2167,20 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
>  	cpa_flush(&cpa, 0);
>  
>  	/* Notify hypervisor that we have successfully set/clr encryption attribute. */
> -	if (!ret) {
> -		if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
> -			ret = -EIO;
> -	}
> +	if (ret)
> +		goto out;

IMO, you can avoid "out" label with (!ret && !x86_platform....) check. But it is upto
you.

>  
> +	if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
> +		goto vmm_fail;
> +
> +out:
>  	return ret;
> +
> +vmm_fail:
> +	WARN_ONCE(1, "CPA VMM failure to convert memory (addr=%p, numpages=%d) to %s.\n",
> +		  (void *)addr, numpages, enc ? "private" : "shared");
> +
> +	return -EIO;
>  }
>  
>  static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)

Michael Kelley (LINUX) Oct. 26, 2023, 12:35 a.m. UTC | #3

From: Rick Edgecombe <rick.p.edgecombe@intel.com> Sent: Tuesday, October 24, 2023 4:48 PM
> 
> On TDX it is possible for the untrusted host to cause
> set_memory_encrypted() or set_memory_decrypted() to fail such that an
> error is returned and the resulting memory is shared.  Callers need to take
> care to handle these errors to avoid returning decrypted (shared) memory to
> the page allocator, which could lead to functional or security issues.

I think you mean "shared" as indicated by the guest page tables (vs. "shared"
as the state of the page from the host standpoint).  Some precision on
that distinction seems useful here and in follow-on patches to make callers'
error handling be correct.   As I understand it, the premise is that if the
guest is accessing a page as private, and the host/VMM has messed
around with the page private/shared status, the confidentiality of the
VM is protected.  The risk of leakage occurs when the guest is accessing
a page as shared, so kernel code must guard against putting memory
on the free list if the guest page tables are marked shared.

> 
> Such errors may herald future system instability, but are temporarily
> survivable with proper handling in the caller. The kernel traditionally
> makes every effort to keep running, but it is expected that some coco
> guests may prefer to play it safe security-wise, and panic in this case.
> To accommodate both cases, warn when the arch breakouts for converting
> memory at the VMM layer return an error to CPA. Security focused users
> can rely on panic_on_warn to defend against bugs in the callers.

To me, this sentence doesn't fully characterize why panic_on_warn
would be used.  You describe one reason, which is a caller that fails to
properly handle an error and incorrectly puts memory with a "shared"
guest PTE on the free list.  But getting an error back also implies that
something unknown has gone wrong with the CoCo mechanism for
managing private vs. shared pages.  Security focused users would not
take the risk of continuing to operate with that kind of unknown error
in the core mechanism of a CoCo VM.

> 
> Since the arch breakouts host the logic for handling coco implementation
> specific errors, an error returned from them means that the set_memory()
> call is out of options for handling the error internally. Make this the
> condition to warn about.
> 
> It is possible that very rarely these functions could fail due to guest
> memory pressure (in the case of failing to allocate a huge page when
> splitting a page table). Don't warn in this case because it is a lot less
> likely to indicate an attack by the host and it is not clear which
> set_memory() calls should get the same treatment. That corner should be
> addressed by future work that considers the more general problem and not
> just papers over a single set_memory() variant.
> 
> Suggested-by: Michael Kelley (LINUX) <mikelley@microsoft.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> This is a followup to the "Handle set_memory_XXcrypted() errors"
> series[0].
> 
> Previously[1] I attempted to create a useful helper to both simplify the
> callers and provide an official example of how to handle conversion
> errors. Dave pointed out that there wasn't actually any code savings in
> the callers using it. It also required a whole additional patch to make
> set_memory_XXcrypted() more robust.
> 
> I tried to create some more sensible helper, but in the end gave up. My
> current plan is to just add a warning for VMM failures around this. And
> then shortly after, pursue open coded fixes for the callers that are
> problems for TDX. There are some SEV and SME specifics callers, that I am
> not sure on. But I'm under the impression that as long as that side
> terminates the guest on error, they should be harmless.
> 
> [0] https://lore.kernel.org/lkml/20231017202505.340906-1-rick.p.edgecombe@intel.com/
> [1] https://lore.kernel.org/lkml/20231017202505.340906-2-rick.p.edgecombe@intel.com/
> ---
>  arch/x86/mm/pat/set_memory.c | 18 +++++++++++++-----
>  1 file changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index bda9f129835e..dade281f449b 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -2153,7 +2153,7 @@ static int __set_memory_enc_pgtable(unsigned long addr,
> int numpages, bool enc)
> 
>  	/* Notify hypervisor that we are about to set/clr encryption attribute. */
>  	if (!x86_platform.guest.enc_status_change_prepare(addr, numpages, enc))
> -		return -EIO;
> +		goto vmm_fail;
> 
>  	ret = __change_page_attr_set_clr(&cpa, 1);
> 
> @@ -2167,12 +2167,20 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
>  	cpa_flush(&cpa, 0);
> 
>  	/* Notify hypervisor that we have successfully set/clr encryption attribute. */
> -	if (!ret) {
> -		if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
> -			ret = -EIO;
> -	}
> +	if (ret)
> +		goto out;
> 
> +	if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
> +		goto vmm_fail;
> +
> +out:
>  	return ret;
> +
> +vmm_fail:
> +	WARN_ONCE(1, "CPA VMM failure to convert memory (addr=%p, numpages=%d) to %s.\n",
> +		  (void *)addr, numpages, enc ? "private" : "shared");

I'm not sure about outputting the "addr" value.  It could be
useful, but the %p format specifier hashes the value unless the
kernel is booted with "no_hash_pointers".   Should %px be used
so the address is output unmodified?

> +
> +	return -EIO;
>  }
> 
>  static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
> --
> 2.34.1

My comments notwithstanding, I'm good with this overall change and
the additional level of protection it offers to CoCo VM users.

Michael

Edgecombe, Rick P Oct. 26, 2023, 1:40 a.m. UTC | #4

On Thu, 2023-10-26 at 00:35 +0000, Michael Kelley (LINUX) wrote:
> I think you mean "shared" as indicated by the guest page tables (vs.
> "shared"
> as the state of the page from the host standpoint).  Some precision
> on
> that distinction seems useful here and in follow-on patches to make
> callers'
> error handling be correct.   As I understand it, the premise is that
> if the
> guest is accessing a page as private, and the host/VMM has messed
> around with the page private/shared status, the confidentiality of
> the
> VM is protected.  The risk of leakage occurs when the guest is
> accessing
> a page as shared, so kernel code must guard against putting memory
> on the free list if the guest page tables are marked shared.
> 

For TDX, the scenario of concern in the VMM error case is if the page
is mapped as shared in the guest page tables *and* it is either also
marked as shared in the EPT, or the VMM supports automatically
converting it on access. In the attacker scenario, I think the problem
is just that it is marked shared in the guest.

I can clarify that it needs to be mapped shared in the guest for there
to be a problem, but I don't see how it will help the patches to fix
the callers. It seems like too many details for the callers to know
about. For example, I think some architectures don't change the PTEs at
all. The callers abstract shared and private at a higher level.

> To me, this sentence doesn't fully characterize why panic_on_warn
> would be used.  You describe one reason, which is a caller that fails
> to
> properly handle an error and incorrectly puts memory with a "shared"
> guest PTE on the free list.  But getting an error back also implies
> that
> something unknown has gone wrong with the CoCo mechanism for
> managing private vs. shared pages.  Security focused users would not
> take the risk of continuing to operate with that kind of unknown
> error
> in the core mechanism of a CoCo VM.

Hmm, yea I could see that some users may want to take a hard line and
terminate if anything looks strange. The counter point is that the VMM
is actually returning a legal error here. It may be strange based on
the details of when HyperV and QEMU/KVM would return this error, but
not architecturally.

> 
> > +vmm_fail:
> > +       WARN_ONCE(1, "CPA VMM failure to convert memory (addr=%p,
> > numpages=%d) to %s.\n",
> > +                 (void *)addr, numpages, enc ? "private" :
> > "shared");
> 
> I'm not sure about outputting the "addr" value.  It could be
> useful, but the %p format specifier hashes the value unless the
> kernel is booted with "no_hash_pointers".   Should %px be used
> so the address is output unmodified?

Unfortunately, I don't think we can print the kernel virtual address
because those are supposed to be hidden for security reasons. Ideally,
I would prefer to print the PFN, but we won't have it here in the case
of vmalloc's. I thought it might be useful to still have some address
printed for debugging purposes.

> 
> > +
> > +       return -EIO;
> >   }
> > 
> >   static int __set_memory_enc_dec(unsigned long addr, int numpages,
> > bool enc)
> > --
> > 2.34.1
> 
> My comments notwithstanding, I'm good with this overall change and
> the additional level of protection it offers to CoCo VM users.

Thanks.

Edgecombe, Rick P Oct. 26, 2023, 1:45 a.m. UTC | #5

On Wed, 2023-10-25 at 13:03 -0500, Tom Lendacky wrote:
> 
> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>

Thanks!
> > 
> 
> Under SEV, when making a page private/encrypted and the hypervisor
> does 
> not assign the page to the guest (encrypted), but says it did, then
> when 
> SEV tries to perform the PVALIDATE in the enc_status_change_finish()
> call, 
> a nested page fault (#NPF) will be generated and exit to the
> hypervisor. 
> Until the hypervisor assigns the page to the guest, the guest will
> not be 
> able to make forward progress in regards to updating or using that
> page.

Yea, mismatches between guest page tables and EPT/NPT can be trouble
for TDX as well.

> 
> And if the hypervisor returns an error when changing the page state,
> then, 
> yes, the guest will terminate.

I guess those callbacks could be changed to return an error after all
these fixes then, if you want.

Edgecombe, Rick P Oct. 26, 2023, 2:04 a.m. UTC | #6

On Wed, 2023-10-25 at 11:10 -0700, Kuppuswamy Sathyanarayanan wrote:
> Looks good to me.
> 
> Reviewed-by: Kuppuswamy Sathyanarayanan
> <sathyanarayanan.kuppuswamy@linux.intel.com>
> 

Thanks!

> 
> IMO, you can avoid "out" label with (!ret && !x86_platform....)
> check. But it is upto
> you.

Hmm, yes it could. I think it's a little easier to read as is, but just
my opinion as well.

Tom Lendacky Oct. 26, 2023, 1:37 p.m. UTC | #7

On 10/25/23 21:04, Edgecombe, Rick P wrote:
> On Wed, 2023-10-25 at 11:10 -0700, Kuppuswamy Sathyanarayanan wrote:
>> Looks good to me.
>>
>> Reviewed-by: Kuppuswamy Sathyanarayanan
>> <sathyanarayanan.kuppuswamy@linux.intel.com>
>>
> 
> Thanks!
> 
>>
>> IMO, you can avoid "out" label with (!ret && !x86_platform....)
>> check. But it is upto
>> you.
> 
> Hmm, yes it could. I think it's a little easier to read as is, but just
> my opinion as well.

It might be even easier to read to just have:

	if (ret)
		return ret;

	if (!x86_platform...)
		goto vmm_fail

	return 0;

since jumping to the out: label just does a return anyway.

Thanks,
Tom

Tom Lendacky Oct. 26, 2023, 1:43 p.m. UTC | #8

On 10/25/23 20:45, Edgecombe, Rick P wrote:
> On Wed, 2023-10-25 at 13:03 -0500, Tom Lendacky wrote:
>>
>> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
> 
> Thanks!
>>>
>>
>> Under SEV, when making a page private/encrypted and the hypervisor
>> does
>> not assign the page to the guest (encrypted), but says it did, then
>> when
>> SEV tries to perform the PVALIDATE in the enc_status_change_finish()
>> call,
>> a nested page fault (#NPF) will be generated and exit to the
>> hypervisor.
>> Until the hypervisor assigns the page to the guest, the guest will
>> not be
>> able to make forward progress in regards to updating or using that
>> page.
> 
> Yea, mismatches between guest page tables and EPT/NPT can be trouble
> for TDX as well.
> 
>>
>> And if the hypervisor returns an error when changing the page state,
>> then,
>> yes, the guest will terminate.
> 
> I guess those callbacks could be changed to return an error after all
> these fixes then, if you want.

Probably not necessary as we will want to terminate the guest in these 
situations and having it here in this one area is easier than checking all 
of the call sites.

Thanks,
Tom

Edgecombe, Rick P Oct. 26, 2023, 10:07 p.m. UTC | #9

On Thu, 2023-10-26 at 08:37 -0500, Tom Lendacky wrote:
> It might be even easier to read to just have:
> 
>         if (ret)
>                 return ret;
> 
>         if (!x86_platform...)
>                 goto vmm_fail
> 
>         return 0;
> 
> since jumping to the out: label just does a return anyway.

Err, right. I'll change it.

Michael Kelley (LINUX) Oct. 27, 2023, 4:37 p.m. UTC | #10

From: Edgecombe, Rick P <rick.p.edgecombe@intel.com> Sent: Wednesday, October 25, 2023 6:41 PM
> 
> On Thu, 2023-10-26 at 00:35 +0000, Michael Kelley (LINUX) wrote:
> > I think you mean "shared" as indicated by the guest page tables (vs."shared"
> > as the state of the page from the host standpoint).  Some precision on
> > that distinction seems useful here and in follow-on patches to make callers'
> > error handling be correct.   As I understand it, the premise is that
> > if the guest is accessing a page as private, and the host/VMM has messed
> > around with the page private/shared status, the confidentiality of the
> > VM is protected.  The risk of leakage occurs when the guest is accessing
> > a page as shared, so kernel code must guard against putting memory
> > on the free list if the guest page tables are marked shared.
> >
> 
> For TDX, the scenario of concern in the VMM error case is if the page
> is mapped as shared in the guest page tables *and* it is either also
> marked as shared in the EPT, or the VMM supports automatically
> converting it on access. In the attacker scenario, I think the problem
> is just that it is marked shared in the guest.

Agreed.

> 
> I can clarify that it needs to be mapped shared in the guest for there
> to be a problem, but I don't see how it will help the patches to fix
> the callers. It seems like too many details for the callers to know
> about. For example, I think some architectures don't change the PTEs at
> all. The callers abstract shared and private at a higher level.
> 

When a caller gets an error from set_memory_decrypted(), it will
take steps to try to get the memory back into a "good" state so
that it can put the memory back on the free list.   If it can't get
the memory back into a good state, then it will leak the memory.
I was thinking about how the caller will make that determination.
Is it based on whether set_memory_encrypted() succeeds?  I think
that works, as long as (for x86 at least) set_memory_encrypted()
ensures that the guest PTEs are all marked "private" before it
returns success.

So maybe my comment applies to the caller in the sense of
understanding what steps the caller should take to recover from
an error, and the possible outcomes from the attempted recovery.

> 
> > To me, this sentence doesn't fully characterize why panic_on_warn
> > would be used.  You describe one reason, which is a caller that fails to
> > properly handle an error and incorrectly puts memory with a "shared"
> > guest PTE on the free list.  But getting an error back also implies that
> > something unknown has gone wrong with the CoCo mechanism for
> > managing private vs. shared pages.  Security focused users would not
> > take the risk of continuing to operate with that kind of unknown
> > error in the core mechanism of a CoCo VM.
> 
> Hmm, yea I could see that some users may want to take a hard line and
> terminate if anything looks strange. The counter point is that the VMM
> is actually returning a legal error here. It may be strange based on
> the details of when HyperV and QEMU/KVM would return this error, but
> not architecturally.
> 

Agreed, it may be a legal error.  But even with legal errors, the guest
doesn't know whether the VMM has left the page in a private or
shared state.   If the guest fixes up its PTEs to access the memory
as private and puts the memory back on the free list, that could
be a time bomb that will blow up later.  More paranoid guests
will prefer to take the panic when the error is first reported.

> >
> > > +vmm_fail:
> > > +       WARN_ONCE(1, "CPA VMM failure to convert memory (addr=%p, numpages=%d) to %s.\n",
> > > +                 (void *)addr, numpages, enc ? "private" : "shared");
> >
> > I'm not sure about outputting the "addr" value.  It could be
> > useful, but the %p format specifier hashes the value unless the
> > kernel is booted with "no_hash_pointers".   Should %px be used
> > so the address is output unmodified?
> 
> Unfortunately, I don't think we can print the kernel virtual address
> because those are supposed to be hidden for security reasons. Ideally,
> I would prefer to print the PFN, but we won't have it here in the case
> of vmalloc's. I thought it might be useful to still have some address
> printed for debugging purposes.
> 

I don't object to either approach.  I was really just noting that
we won't see the actual kernel virtual address.

Michael

Edgecombe, Rick P Oct. 27, 2023, 4:46 p.m. UTC | #11

On Fri, 2023-10-27 at 16:37 +0000, Michael Kelley (LINUX) wrote:
> When a caller gets an error from set_memory_decrypted(), it will
> take steps to try to get the memory back into a "good" state so
> that it can put the memory back on the free list.   If it can't get
> the memory back into a good state, then it will leak the memory.
> I was thinking about how the caller will make that determination.
> Is it based on whether set_memory_encrypted() succeeds?  I think
> that works, as long as (for x86 at least) set_memory_encrypted()
> ensures that the guest PTEs are all marked "private" before it
> returns success.
> 
> So maybe my comment applies to the caller in the sense of
> understanding what steps the caller should take to recover from
> an error, and the possible outcomes from the attempted recovery.

Since I was dropping free_decrypted_pages() helper, I was thinking to
actually just leak the pages if set_memory_decryted() fails. As in, not
try to recover them with set_memory_encrypted(). So the kernel will do
the 3 retries that the recent HyperV focused patch added, and then walk
away.

The kernel will already be warning about this situation, so we are not
expecting for it to be common. For rare cases, it seems simpler to just
leak it, and then set_memory_encrypted() can be simpler as it doesn't
need to worry about handling mixed ranges returning success.

I'll update the log to clarify the importance of the PTE being marked
shared in the guest, and post a v2.

Michael Kelley (LINUX) Oct. 27, 2023, 5:08 p.m. UTC | #12

From: Edgecombe, Rick P <rick.p.edgecombe@intel.com> Sent: Friday, October 27, 2023 9:47 AM
> 
> On Fri, 2023-10-27 at 16:37 +0000, Michael Kelley (LINUX) wrote:
> > When a caller gets an error from set_memory_decrypted(), it will
> > take steps to try to get the memory back into a "good" state so
> > that it can put the memory back on the free list.   If it can't get
> > the memory back into a good state, then it will leak the memory.
> > I was thinking about how the caller will make that determination.
> > Is it based on whether set_memory_encrypted() succeeds?  I think
> > that works, as long as (for x86 at least) set_memory_encrypted()
> > ensures that the guest PTEs are all marked "private" before it
> > returns success.
> >
> > So maybe my comment applies to the caller in the sense of
> > understanding what steps the caller should take to recover from
> > an error, and the possible outcomes from the attempted recovery.
> 
> Since I was dropping free_decrypted_pages() helper, I was thinking to
> actually just leak the pages if set_memory_decryted() fails. As in, not
> try to recover them with set_memory_encrypted(). So the kernel will do
> the 3 retries that the recent HyperV focused patch added, and then walk
> away.
> 
> The kernel will already be warning about this situation, so we are not
> expecting for it to be common. For rare cases, it seems simpler to just
> leak it, and then set_memory_encrypted() can be simpler as it doesn't
> need to worry about handling mixed ranges returning success.
> 

I like that approach even better than trying to fix things up and get
the memory back on the guest free list.  I agree the error case should
be rare, and I'm generally leery of putting memory on the free list
when there's some doubt about the private/shared state of the page
from the host/VMM standpoint.

Michael

> I'll update the log to clarify the importance of the PTE being marked
> shared in the guest, and post a v2.

diff mbox series

Patch

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index bda9f129835e..dade281f449b 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2153,7 +2153,7 @@  static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
 
 	/* Notify hypervisor that we are about to set/clr encryption attribute. */
 	if (!x86_platform.guest.enc_status_change_prepare(addr, numpages, enc))
-		return -EIO;
+		goto vmm_fail;
 
 	ret = __change_page_attr_set_clr(&cpa, 1);
 
@@ -2167,12 +2167,20 @@  static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
 	cpa_flush(&cpa, 0);
 
 	/* Notify hypervisor that we have successfully set/clr encryption attribute. */
-	if (!ret) {
-		if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
-			ret = -EIO;
-	}
+	if (ret)
+		goto out;
 
+	if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
+		goto vmm_fail;
+
+out:
 	return ret;
+
+vmm_fail:
+	WARN_ONCE(1, "CPA VMM failure to convert memory (addr=%p, numpages=%d) to %s.\n",
+		  (void *)addr, numpages, enc ? "private" : "shared");
+
+	return -EIO;
 }
 
 static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)