[v2,2/2] x86/coco: Require seeding RNG with RDRAND on CoCo systems

Message ID 20240214195744.8332-3-Jason@zx2c4.com
State New
Headers
Series CoCo/RDRAND brokenness fixes |

Commit Message

Jason A. Donenfeld Feb. 14, 2024, 7:56 p.m. UTC
  There are few uses of CoCo that don't rely on working cryptography and
hence a working RNG. Unfortunately, the CoCo threat model means that the
VM host cannot be trusted and may actively work against guests to
extract secrets or manipulate computation. Since a malicious host can
modify or observe nearly all inputs to guests, the only remaining source
of entropy for CoCo guests is RDRAND.

If RDRAND is broken -- due to CPU hardware fault -- the generic path
will WARN(), but probably CoCo systems shouldn't even continue
executing. This is mostly a concern at boot time when initially seeding
the RNG, as after that the consequences of a broken RDRAND are much more
theoretical.

So, try at boot to seed the RNG using 256 bits of RDRAND output. If this
fails, panic(). This will also trigger if the system is booted without
RDRAND, as RDRAND is essential for a safe CoCo boot.

This patch is deliberately written to be "just a CoCo x86 driver
feature" and not part of the RNG itself. Many device drivers and
platforms have some desire to contribute something to the RNG, and
add_device_randomness() is specifically meant for this purpose. Any
driver can call this with seed data of any quality, or even garbage
quality, and it can only possibly make the quality of the RNG better or
have no effect, but can never make it worse. Rather than trying to
build something into the core of the RNG, this patch interprets the
particular CoCo issue as just a CoCo issue, and therefore separates this
all out into driver (well, arch/platform) code.

Cc: Borislav Petkov <bp@alien8.de>
Cc: Daniel P. Berrangé <berrange@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
Changes v1->v2:
- panic() instead of BUG_ON(), as suggested by Andi Kleen.
- Update comments, now that we have info from AMD and Intel.

 arch/x86/coco/core.c        | 36 ++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/coco.h |  2 ++
 arch/x86/kernel/setup.c     |  2 ++
 3 files changed, 40 insertions(+)
  

Comments

Reshetova, Elena Feb. 15, 2024, 7:30 a.m. UTC | #1
> 
> There are few uses of CoCo that don't rely on working cryptography and
> hence a working RNG. Unfortunately, the CoCo threat model means that the
> VM host cannot be trusted and may actively work against guests to
> extract secrets or manipulate computation. Since a malicious host can
> modify or observe nearly all inputs to guests, the only remaining source
> of entropy for CoCo guests is RDRAND.
> 
> If RDRAND is broken -- due to CPU hardware fault -- the generic path
> will WARN(), but probably CoCo systems shouldn't even continue
> executing. This is mostly a concern at boot time when initially seeding
> the RNG, as after that the consequences of a broken RDRAND are much more
> theoretical.
> 
> So, try at boot to seed the RNG using 256 bits of RDRAND output. If this
> fails, panic(). This will also trigger if the system is booted without
> RDRAND, as RDRAND is essential for a safe CoCo boot.
> 
> This patch is deliberately written to be "just a CoCo x86 driver
> feature" and not part of the RNG itself. Many device drivers and
> platforms have some desire to contribute something to the RNG, and
> add_device_randomness() is specifically meant for this purpose. Any
> driver can call this with seed data of any quality, or even garbage
> quality, and it can only possibly make the quality of the RNG better or
> have no effect, but can never make it worse. Rather than trying to
> build something into the core of the RNG, this patch interprets the
> particular CoCo issue as just a CoCo issue, and therefore separates this
> all out into driver (well, arch/platform) code.
> 
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Daniel P. Berrangé <berrange@redhat.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Elena Reshetova <elena.reshetova@intel.com>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Theodore Ts'o <tytso@mit.edu>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---
> Changes v1->v2:
> - panic() instead of BUG_ON(), as suggested by Andi Kleen.
> - Update comments, now that we have info from AMD and Intel.
> 
>  arch/x86/coco/core.c        | 36 ++++++++++++++++++++++++++++++++++++
>  arch/x86/include/asm/coco.h |  2 ++
>  arch/x86/kernel/setup.c     |  2 ++
>  3 files changed, 40 insertions(+)
> 
> diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
> index eeec9986570e..34d7c6795e8c 100644
> --- a/arch/x86/coco/core.c
> +++ b/arch/x86/coco/core.c
> @@ -3,13 +3,16 @@
>   * Confidential Computing Platform Capability checks
>   *
>   * Copyright (C) 2021 Advanced Micro Devices, Inc.
> + * Copyright (C) 2024 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
>   *
>   * Author: Tom Lendacky <thomas.lendacky@amd.com>
>   */
> 
>  #include <linux/export.h>
>  #include <linux/cc_platform.h>
> +#include <linux/random.h>
> 
> +#include <asm/archrandom.h>
>  #include <asm/coco.h>
>  #include <asm/processor.h>
> 
> @@ -153,3 +156,36 @@ __init void cc_set_mask(u64 mask)
>  {
>  	cc_mask = mask;
>  }
> +
> +__init void cc_random_init(void)
> +{
> +	unsigned long rng_seed[32 / sizeof(long)];
> +	size_t i, longs;
> +
> +	if (cc_vendor == CC_VENDOR_NONE)
> +		return;
> +
> +	/*
> +	 * Since the CoCo threat model includes the host, the only reliable
> +	 * source of entropy that can be neither observed nor manipulated is
> +	 * RDRAND. Usually, RDRAND failure is considered tolerable, but since a
> +	 * host can possibly induce failures consistently, it's important to at
> +	 * least ensure the RNG gets some initial random seeds.
> +	 */
> +	for (i = 0; i < ARRAY_SIZE(rng_seed); i += longs) {
> +		longs = arch_get_random_longs(&rng_seed[i],
> ARRAY_SIZE(rng_seed) - i);
> +
> +		/*
> +		 * A zero return value means that the guest doesn't have RDRAND
> +		 * or the CPU is physically broken, and in both cases that
> +		 * means most crypto inside of the CoCo instance will be
> +		 * broken, defeating the purpose of CoCo in the first place. So
> +		 * just panic here because it's absolutely unsafe to continue
> +		 * executing.
> +		 */
> +		if (longs == 0)
> +			panic("RDRAND is defective.");
> +	}
> +	add_device_randomness(rng_seed, sizeof(rng_seed));


While this patch functionally does close to what we need, i.e. adds 256 bits to the pool
from rdrand and panics otherwise, I personally find it quite confusing in the overall
architecture of Linux RNG. You have done a lot of great work to clean the arch
back in 2022, and currently we have two straightforward paths where the entropy is
added from rdrand/rdseed into the pool and seed production with proper entropy
accounting. Yes, I think everyone would agree (and it has been pointed in numerous
papers/reports) that entropy accounting is somewhat doomed business, but this
is what you have to do at least initially in order to still maintain the overall
logic of why Linux RNG does what it does.  

Now with this patch, the logic becomes somewhat messy:

1. in early boot Linux RNG will attempt to get inputs from rdrand/rdseed anyhow
and add it to the pool and count entropy (if successful and if we trust cpu rng)
2. now somewhat later, we *again* try to mix in 256 bits from *rdrand only*
(ideally in non-attack cases we should use rdseed here if it gives us output) via
the interface that is by Linux RNG design intended for consuming entropy from
 *untrusted* sources (hence no entropy counting), and if we fail this action
(which architecturally should not matter for Linux RNG based on the used interface)
we are going to panic the machine, because in fact this is the most important
for CoCo. 
 
I find the logic above confusing and not very clean. 
Should we just go back to the approach to add one more check in random_init_early()
to panic in the CoCo case if both rdseed and rdrand fails to give us anything? 
This way one can still look at the Linux RNG code overall and understand what
is going to be the behavior in all conditions/cases? 

Best Regards,
Elena.
  
Jason A. Donenfeld Feb. 15, 2024, 1:22 p.m. UTC | #2
Hi Elena,

On Thu, Feb 15, 2024 at 07:30:32AM +0000, Reshetova, Elena wrote:
> Should we just go back to the approach to add one more check in random_init_early()
> to panic in the CoCo case if both rdseed and rdrand fails to give us anything? 

Yea, no, definitely not. That is, in my opinion, completely backwards
and leads to impossible maintainability. CoCo is not some special
snowflake that gets to sprinkle random conditionals in generic code.

First, consider the motivation for doing this:
- This is to abort on a physical defective CPU bug -- presumably a
  highly implausible thing to ever happen.
- This is for a threat model that few people are really compelled by
  anyway, e.g. it's whack-a-mole season with host->guest vectors.
- This is for a single somewhat obscure configuration of a single
  architecture with a feature only available on certain chipsets.
- This is not an "intrinsic" problem that necessitates plumbing complex
  policy logic all over the place, but for a very special
  driver-specific case.

Rather, what this patch does is...

> Now with this patch, the logic becomes

Your description actually wasn't quite accurate so I'll write it out
(and I'll clarify in the commit message/comments for v3 - my fault for
being vague):

1. At early boot, x86/CoCo is initialized. As part of that
   initialization, it makes sure it can get 256 bits of RDRAND output
   and adds it to the pool, in exactly the same way that the SD card
   driver adds inserted memory card serial numbers to the pool. If it
   can't get RDRAND output, it means CoCo loses one of its "Co"s, and so
   it panic()s.

2. Later, the generic RNG initializes in random_init_early() and
   random_init(), where it opportunistically tries to use everything it
   can to get initialized -- architectural seed, architectural rand,
   jitter, timers, boot seeds, *seeds passed from other drivers*, and
   whatever else it can.

Now what you're complaining about is that in this step 2, we wind up
adding *even more* rdrand (though, more probably rdseed), in addition to
what was already added in the platform-specific driver in step 1. Boo
hoo? I can't see how that's a bad thing. Step 1 was CoCo's policy driver
*ensuring* that it was able to push at least *something good* into the
RNG, and taking a CoCo-specific policy decision (panic()ing) if it
can't. Step 2 is just generic RNG stuff doing its generic RNG thing.

You might also want to needle on the fact that if RDRAND is somehow
intermittently physically defective, and so step 1 succeeds, but in step
2, the RNG doesn't manage to get seeded by RDRAND and so initializes
based on jitter or IRQs or something. Okay, fine, but who cares? First,
you'd be talking here about a hugely unlikely defective hardware case,
and second, the end state remains basically identical: there's a good
seed from RDRAND and the RNG itself is able to initialize.

So I really don't want to litter the generic code with a bunch of
platform-specific hacks. The function add_device_randomness()
specifically exists so that individual platforms and devices that have
some unique insight into an entropy source or entropy requirements or
policy or whatever else can do that in their own platform or device
driver code where it belongs.

Jason
  
Reshetova, Elena Feb. 16, 2024, 7:57 a.m. UTC | #3
> Hi Elena,
> 
> On Thu, Feb 15, 2024 at 07:30:32AM +0000, Reshetova, Elena wrote:
> > Should we just go back to the approach to add one more check in
> random_init_early()
> > to panic in the CoCo case if both rdseed and rdrand fails to give us anything?
> 
> Yea, no, definitely not. That is, in my opinion, completely backwards
> and leads to impossible maintainability. CoCo is not some special
> snowflake that gets to sprinkle random conditionals in generic code.
> 
> First, consider the motivation for doing this:
> - This is to abort on a physical defective CPU bug -- presumably a
>   highly implausible thing to ever happen.
> - This is for a threat model that few people are really compelled by
>   anyway, e.g. it's whack-a-mole season with host->guest vectors.
> - This is for a single somewhat obscure configuration of a single
>   architecture with a feature only available on certain chipsets.
> - This is not an "intrinsic" problem that necessitates plumbing complex
>   policy logic all over the place, but for a very special
>   driver-specific case.
> 
> Rather, what this patch does is...
> 
> > Now with this patch, the logic becomes
> 
> Your description actually wasn't quite accurate so I'll write it out
> (and I'll clarify in the commit message/comments for v3 - my fault for
> being vague):
> 
> 1. At early boot, x86/CoCo is initialized. As part of that
>    initialization, it makes sure it can get 256 bits of RDRAND output
>    and adds it to the pool, in exactly the same way that the SD card
>    driver adds inserted memory card serial numbers to the pool. 

Yes, my mental picture that random_init_early() is called before 
setup_arch() was obviously wrong, I should have checked it explicitly.
So, yes, coco_random_init() happens first, which actually now has a nice
side-effect that on coco platforms we drop HW CPU output even earlier
in the entropy pool (Yay!).
Which at this point would be almost perfect, *if* we could also
count this entropy drop and allow ChaCha seeding to benefit straight from
this early drop of entropy. 

How about changing this to use add_hwgenerator_randomness()? 
And adjust cc_random_init() to try rdseed first and only fallback to rdrand
if it fails? 
I envision that a counter argument to this would be "we only count
entropy from HW CPU RNG, if we trust CPU RNG", but in CoCo case
we *do trust CPU* and this is the output of true HW RNG that we are
mixing in the pool per definition. 

Best Regards,
Elena.
  
Jason A. Donenfeld Feb. 16, 2024, 6:17 p.m. UTC | #4
Hi Elena,

On Fri, Feb 16, 2024 at 8:57 AM Reshetova, Elena
<elena.reshetova@intel.com> wrote:
> So, yes, coco_random_init() happens first, which actually now has a nice
> side-effect that on coco platforms we drop HW CPU output even earlier
> in the entropy pool (Yay!).
> Which at this point would be almost perfect, *if* we could also
> count this entropy drop and allow ChaCha seeding to benefit straight from
> this early drop of entropy.

I addressed this already in my last reply. I wouldn't get too hung up
on the entropy counting stuff. The RNG is going to get initialized
just fine anyway no matter what, and whether or not it's counted,
it'll still be used and available basically immediately anyway.

> How about changing this to use add_hwgenerator_randomness()?

That function is only for the hwrng API. It handles sleep semantics
and that's specific to that interface boundary. It is not for random
drivers and platforms to call directory.

> And adjust cc_random_init() to try rdseed first and only fallback to rdrand
> if it fails?

I guess that's possible, but what even is the point? I don't think
that really more directly accomplishes the objective here anyway. The
whole idea is we want to ensure that RDRAND is at least working for 32
bytes and if not panic. That's *all* we care about. Later on the RNG
will eat rdseed opportunistically as it can; let it handle that as it
does, and leave this patch here to be simple and direct and accomplish
the one single solitary goal of panicking if it can't avoid the worst
case scenario.

Jason
  
Reshetova, Elena Feb. 21, 2024, 7:52 a.m. UTC | #5
> Hi Elena,
> 
> On Fri, Feb 16, 2024 at 8:57 AM Reshetova, Elena
> <elena.reshetova@intel.com> wrote:
> > So, yes, coco_random_init() happens first, which actually now has a nice
> > side-effect that on coco platforms we drop HW CPU output even earlier
> > in the entropy pool (Yay!).
> > Which at this point would be almost perfect, *if* we could also
> > count this entropy drop and allow ChaCha seeding to benefit straight from
> > this early drop of entropy.
> 
> I addressed this already in my last reply. I wouldn't get too hung up
> on the entropy counting stuff. The RNG is going to get initialized
> just fine anyway no matter what, and whether or not it's counted,
> it'll still be used and available basically immediately anyway.
> 
> > How about changing this to use add_hwgenerator_randomness()?
> 
> That function is only for the hwrng API. It handles sleep semantics
> and that's specific to that interface boundary. It is not for random
> drivers and platforms to call directory.
> 
> > And adjust cc_random_init() to try rdseed first and only fallback to rdrand
> > if it fails?
> 
> I guess that's possible, but what even is the point? I don't think
> that really more directly accomplishes the objective here anyway. The
> whole idea is we want to ensure that RDRAND is at least working for 32
> bytes and if not panic. That's *all* we care about. Later on the RNG
> will eat rdseed opportunistically as it can; let it handle that as it
> does, and leave this patch here to be simple and direct and accomplish
> the one single solitary goal of panicking if it can't avoid the worst
> case scenario.


Okay, I guess I won’t start fighting for a cleanliness of an overall construction
(and rest of people stay pretty quiet in this discussion thread). 
Functionally-wise we will get what we need in practice and I learned that
"disagree and commit" is a very useful principle to exercise for moving things forward,
so I am ok with this approach.

Would you submit the patch or how do we proceed on this? 

Best Regards,
Elena.
  
Jason A. Donenfeld Feb. 21, 2024, 12:24 p.m. UTC | #6
On Wed, Feb 21, 2024 at 8:52 AM Reshetova, Elena
<elena.reshetova@intel.com> wrote:
> Would you submit the patch or how do we proceed on this?

I need to send in a v3 anyway. So I'll do that, and then you can reply
with your `Reviewed-by: First Last <email>` line and then someone
handling tip@ will hopefully pull it in.

Jason
  

Patch

diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index eeec9986570e..34d7c6795e8c 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -3,13 +3,16 @@ 
  * Confidential Computing Platform Capability checks
  *
  * Copyright (C) 2021 Advanced Micro Devices, Inc.
+ * Copyright (C) 2024 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
  *
  * Author: Tom Lendacky <thomas.lendacky@amd.com>
  */
 
 #include <linux/export.h>
 #include <linux/cc_platform.h>
+#include <linux/random.h>
 
+#include <asm/archrandom.h>
 #include <asm/coco.h>
 #include <asm/processor.h>
 
@@ -153,3 +156,36 @@  __init void cc_set_mask(u64 mask)
 {
 	cc_mask = mask;
 }
+
+__init void cc_random_init(void)
+{
+	unsigned long rng_seed[32 / sizeof(long)];
+	size_t i, longs;
+
+	if (cc_vendor == CC_VENDOR_NONE)
+		return;
+
+	/*
+	 * Since the CoCo threat model includes the host, the only reliable
+	 * source of entropy that can be neither observed nor manipulated is
+	 * RDRAND. Usually, RDRAND failure is considered tolerable, but since a
+	 * host can possibly induce failures consistently, it's important to at
+	 * least ensure the RNG gets some initial random seeds.
+	 */
+	for (i = 0; i < ARRAY_SIZE(rng_seed); i += longs) {
+		longs = arch_get_random_longs(&rng_seed[i], ARRAY_SIZE(rng_seed) - i);
+
+		/*
+		 * A zero return value means that the guest doesn't have RDRAND
+		 * or the CPU is physically broken, and in both cases that
+		 * means most crypto inside of the CoCo instance will be
+		 * broken, defeating the purpose of CoCo in the first place. So
+		 * just panic here because it's absolutely unsafe to continue
+		 * executing.
+		 */
+		if (longs == 0)
+			panic("RDRAND is defective.");
+	}
+	add_device_randomness(rng_seed, sizeof(rng_seed));
+	memzero_explicit(rng_seed, sizeof(rng_seed));
+}
diff --git a/arch/x86/include/asm/coco.h b/arch/x86/include/asm/coco.h
index 76c310b19b11..e9d059449885 100644
--- a/arch/x86/include/asm/coco.h
+++ b/arch/x86/include/asm/coco.h
@@ -15,6 +15,7 @@  extern enum cc_vendor cc_vendor;
 void cc_set_mask(u64 mask);
 u64 cc_mkenc(u64 val);
 u64 cc_mkdec(u64 val);
+void cc_random_init(void);
 #else
 #define cc_vendor (CC_VENDOR_NONE)
 
@@ -27,6 +28,7 @@  static inline u64 cc_mkdec(u64 val)
 {
 	return val;
 }
+static inline void cc_random_init(void) { }
 #endif
 
 #endif /* _ASM_X86_COCO_H */
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 84201071dfac..30a653cfc7d2 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -36,6 +36,7 @@ 
 #include <asm/bios_ebda.h>
 #include <asm/bugs.h>
 #include <asm/cacheinfo.h>
+#include <asm/coco.h>
 #include <asm/cpu.h>
 #include <asm/efi.h>
 #include <asm/gart.h>
@@ -994,6 +995,7 @@  void __init setup_arch(char **cmdline_p)
 	 * memory size.
 	 */
 	mem_encrypt_setup_arch();
+	cc_random_init();
 
 	efi_fake_memmap();
 	efi_find_mirror();