diff mbox series

[1/2] x86/tdx: Extract GET_INFO call from get_cc_mask()

Message ID	20221028141220.29217-2-kirill.shutemov@linux.intel.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@intel.com, luto@kernel.org, peterz@infradead.org Cc: sathyanarayanan.kuppuswamy@linux.intel.com, ak@linux.intel.com, dan.j.williams@intel.com, david@redhat.com, hpa@zytor.com, seanjc@google.com, thomas.lendacky@amd.com, elena.reshetova@intel.com, x86@kernel.org, linux-kernel@vger.kernel.org, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Subject: [PATCH 1/2] x86/tdx: Extract GET_INFO call from get_cc_mask() Date: Fri, 28 Oct 2022 17:12:19 +0300 Message-Id: <20221028141220.29217-2-kirill.shutemov@linux.intel.com> In-Reply-To: <20221028141220.29217-1-kirill.shutemov@linux.intel.com> References: <20221028141220.29217-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	x86/tdx: Enforce no #VE on private memory accesses \| [0/2] x86/tdx: Enforce no #VE on private memory accesses [1/2] x86/tdx: Extract GET_INFO call from get_cc_mask() [2/2] x86/tdx: Do not allow #VE due to EPT violation on the private memory

Commit Message

Kirill A. Shutemov Oct. 28, 2022, 2:12 p.m. UTC

  From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

TDCALL [TDG.VP.INFO] is used to get details like gpa_width,
TD attributes, etc.

So far only gpa_width was needed to enumerate the shared bit, but
upcoming changes will need TD attributes too.

Extract GET_INFO call out of get_cc_mask() into a separate helper and
save attributes.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/coco/tdx/tdx.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

Comments

Dave Hansen Oct. 28, 2022, 3:43 p.m. UTC | #1

On 10/28/22 07:12, Kirill A. Shutemov wrote:
> +	 * information, TD attributes etc. More details about the ABI can be
> +	 * found in TDX Guest-Host-Communication Interface (GHCI), section
> +	 * 2.4.2 TDCALL [TDG.VP.INFO].

Folks, I thought we agreed long ago to stop putting section numbers in
these comments because they're not stable.  Am I remembering this wrong?

Dave Hansen Oct. 28, 2022, 11:27 p.m. UTC | #2

I looked at this a bit more closely.  The code is just bonkers now.  It
can't go in like this.

tdx_parse_tdinfo() stashes two global variables.  Then, about three
lines later in the function, it calls get_cc_mask() which just returns
one of those variables, modulo a little masking.

Ditto for the td_attr.  It's stashed in a global variable and then read
one time.

There is *ZERO* reason to store 'td_attr'.  There's also zero reason to
have 'gpa_width' as a global variable.  It's only used *ONE* *TIME* from
the scope of *ONE* *FUNCTION*.

Even the comment is bonkers:

        /*
         * Initializes gpa_width and td_attr. Must be called before
         * get_cc_mask() or attribute checks.
         */
        tdx_parse_tdinfo();

Comments are great.  But comments that are only there because the code
is obtuse are not.  I changed it to:

        tdx_parse_tdinfo(&cc_mask);

It doesn't even need a comment now.  Why?  Because it's obvious from the
naming and calling convention that it initializes cc_mask and what the
ordering dependency is.  Plus, even *if* I missed the call, or screwed
up the order, the compiler would tell me that I'm a dolt.

The whole global variable thing actually makes the code objectively
worse in almost every possible way.

Can you please take a look through this and make sure I didn't botch
anything:

> https://git.kernel.org/pub/scm/linux/kernel/git/daveh/devel.git/log/?h=tdxbadve

The end result is about 50 lines less than what was there before.  Most
of it is comment removal but the code is simpler too.

Acks and Tested-by's would be appreciated.

Kirill A. Shutemov Oct. 28, 2022, 11:59 p.m. UTC | #3

On Fri, Oct 28, 2022 at 04:27:10PM -0700, Dave Hansen wrote:
> I looked at this a bit more closely.  The code is just bonkers now.  It
> can't go in like this.
> 
> tdx_parse_tdinfo() stashes two global variables.  Then, about three
> lines later in the function, it calls get_cc_mask() which just returns
> one of those variables, modulo a little masking.
> 
> Ditto for the td_attr.  It's stashed in a global variable and then read
> one time.
> 
> There is *ZERO* reason to store 'td_attr'.  There's also zero reason to
> have 'gpa_width' as a global variable.  It's only used *ONE* *TIME* from
> the scope of *ONE* *FUNCTION*.
> 
> Even the comment is bonkers:
> 
>         /*
>          * Initializes gpa_width and td_attr. Must be called before
>          * get_cc_mask() or attribute checks.
>          */
>         tdx_parse_tdinfo();
> 
> Comments are great.  But comments that are only there because the code
> is obtuse are not.  I changed it to:
> 
>         tdx_parse_tdinfo(&cc_mask);
> 
> It doesn't even need a comment now.  Why?  Because it's obvious from the
> naming and calling convention that it initializes cc_mask and what the
> ordering dependency is.  Plus, even *if* I missed the call, or screwed
> up the order, the compiler would tell me that I'm a dolt.
> 
> The whole global variable thing actually makes the code objectively
> worse in almost every possible way.

I agree. Sorry about that.

We have more code in our tree that want to check attributes. And it is
after initialization, so the code structure derived from there.

But, yes, I should have rework it before sending upstream.

> Can you please take a look through this and make sure I didn't botch
> anything:
> 
> > https://git.kernel.org/pub/scm/linux/kernel/git/daveh/devel.git/log/?h=tdxbadve
> 
> The end result is about 50 lines less than what was there before.  Most
> of it is comment removal but the code is simpler too.
> 
> Acks and Tested-by's would be appreciated.

Looks good and works fine:

Acked-and-Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Kirill A. Shutemov Oct. 31, 2022, 4:12 a.m. UTC | #4

On Sat, Oct 29, 2022 at 02:59:51AM +0300, Kirill A. Shutemov wrote:
> > Can you please take a look through this and make sure I didn't botch
> > anything:
> > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/daveh/devel.git/log/?h=tdxbadve
> > 
> > The end result is about 50 lines less than what was there before.  Most
> > of it is comment removal but the code is simpler too.
> > 
> > Acks and Tested-by's would be appreciated.

One thing that I must bring up is that it seems that there's no way to get
the panic message to user. I tried to convinced myself that it is qemu
misconfiguration on my part or some race, but no: it is just too early for
earlyprintk.

We only get earlyprintk working after parse_early_options() which happens
well after tdx_early_init().

Moving panic() after earlyprintk working is not good idea as it exposes
kernel more: by the time we already have full #VE handler.

We can move it earlier into decompresser which has different earlyprintk
implementation. Not sure if it worth this. What do you think?

"tdx: Guest detected" printed from the same tdx_early_init() is visible if
boot goes far enough to initialize console (early or not) as printk adds
the message to the buffer, but no so luck for panic().

Dave Hansen Oct. 31, 2022, 4:42 p.m. UTC | #5

On 10/30/22 21:12, Kirill A. Shutemov wrote:
> On Sat, Oct 29, 2022 at 02:59:51AM +0300, Kirill A. Shutemov wrote:
>>> Can you please take a look through this and make sure I didn't botch
>>> anything:
>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/daveh/devel.git/log/?h=tdxbadve
>>>
>>> The end result is about 50 lines less than what was there before.  Most
>>> of it is comment removal but the code is simpler too.
>>>
>>> Acks and Tested-by's would be appreciated.
> 
> One thing that I must bring up is that it seems that there's no way to get
> the panic message to user. I tried to convinced myself that it is qemu
> misconfiguration on my part or some race, but no: it is just too early for
> earlyprintk.
> 
> We only get earlyprintk working after parse_early_options() which happens
> well after tdx_early_init().
> 
> Moving panic() after earlyprintk working is not good idea as it exposes
> kernel more: by the time we already have full #VE handler.

How about we soften the panic() to a pr_err() if it's a debug guest?

The first thing a user is going to do if they get an early boot failure
is flip the debug switch and try it again.  That gets us safe,
well-defined behavior when we need security and also lets us figure out
what went wrong.

Also, did anyone ever actually implement that TDX earlyprintk simple
console thing?  A TDCALL up to the host with some characters in a
register or two is as dirt simple of a console as you can get.  It would
be very easy to improve the user experience here if there were a:

	tdx_puts("uh oh");

interface.  It's a shame if it didn't get done by now.  I asked for it
years ago.

And, yeah, I know it wouldn't help us in this precise situation because
earlyprintk doesn't work yet.  But, it *would* be one of those really,
really early bitbanging-style consoles that _could_ be in use very, very
early if the printk() infrastructure could take advantage of it.

> We can move it earlier into decompresser which has different earlyprintk
> implementation. Not sure if it worth this. What do you think?

There's the puts()/printf() gunk that's really early like in
validate_cpu().  Is that what you were thinking of?

Kirill A. Shutemov Oct. 31, 2022, 7:19 p.m. UTC | #6

On Mon, Oct 31, 2022 at 09:42:15AM -0700, Dave Hansen wrote:
> On 10/30/22 21:12, Kirill A. Shutemov wrote:
> > On Sat, Oct 29, 2022 at 02:59:51AM +0300, Kirill A. Shutemov wrote:
> >>> Can you please take a look through this and make sure I didn't botch
> >>> anything:
> >>>
> >>>> https://git.kernel.org/pub/scm/linux/kernel/git/daveh/devel.git/log/?h=tdxbadve
> >>>
> >>> The end result is about 50 lines less than what was there before.  Most
> >>> of it is comment removal but the code is simpler too.
> >>>
> >>> Acks and Tested-by's would be appreciated.
> > 
> > One thing that I must bring up is that it seems that there's no way to get
> > the panic message to user. I tried to convinced myself that it is qemu
> > misconfiguration on my part or some race, but no: it is just too early for
> > earlyprintk.
> > 
> > We only get earlyprintk working after parse_early_options() which happens
> > well after tdx_early_init().
> > 
> > Moving panic() after earlyprintk working is not good idea as it exposes
> > kernel more: by the time we already have full #VE handler.
> 
> How about we soften the panic() to a pr_err() if it's a debug guest?

The plan is to have pr_warn() + check in handle_mmio(), as I mentioned
before. But pr_err() also works.

> The first thing a user is going to do if they get an early boot failure
> is flip the debug switch and try it again.  That gets us safe,
> well-defined behavior when we need security and also lets us figure out
> what went wrong.
> 
> Also, did anyone ever actually implement that TDX earlyprintk simple
> console thing?  A TDCALL up to the host with some characters in a
> register or two is as dirt simple of a console as you can get.  It would
> be very easy to improve the user experience here if there were a:
> 
> 	tdx_puts("uh oh");
> 
> interface.  It's a shame if it didn't get done by now.  I asked for it
> years ago.

There's nothing like this, unfortunately.

There's ReportFatalError TDVMCALL that intended for the task, but it only
takes an error code as input which is useless here. Nobody will decode it.

> And, yeah, I know it wouldn't help us in this precise situation because
> earlyprintk doesn't work yet.  But, it *would* be one of those really,
> really early bitbanging-style consoles that _could_ be in use very, very
> early if the printk() infrastructure could take advantage of it.
> 
> > We can move it earlier into decompresser which has different earlyprintk
> > implementation. Not sure if it worth this. What do you think?
> 
> There's the puts()/printf() gunk that's really early like in
> validate_cpu().  Is that what you were thinking of?

More like error() in arch/x86/boot/compressed/kaslr.c.

Andi Kleen Oct. 31, 2022, 7:27 p.m. UTC | #7

sted-by's would be appreciated.
> One thing that I must bring up is that it seems that there's no way to get
> the panic message to user. I tried to convinced myself that it is qemu
> misconfiguration on my part or some race, but no: it is just too early for
> earlyprintk.
>
> We only get earlyprintk working after parse_early_options() which happens
> well after tdx_early_init().
>
> Moving panic() after earlyprintk working is not good idea as it exposes
> kernel more: by the time we already have full #VE handler.


It should be fine to move since there is no user land at this point (the 
attack requires user land)


>
> We can move it earlier into decompresser which has different earlyprintk
> implementation. Not sure if it worth this. What do you think?

That would make uncompressed kernels unsafe.

-Andi

Dave Hansen Oct. 31, 2022, 7:44 p.m. UTC | #8

On 10/31/22 12:27, Andi Kleen wrote:
>> Moving panic() after earlyprintk working is not good idea as it exposes
>> kernel more: by the time we already have full #VE handler.
> 
> It should be fine to move since there is no user land at this point (the
> attack requires user land)

Maybe I'm misunderstanding the exposure.  A normal MMIO #VE goes
something like this:

	1. %rax points to some MMIO
	2. Kernel executes: mov (%rax),%rbx, trying to read MMIO
	3. #VE handler is triggered
	4. Handler emulates the 'mov' with instruction decoding
	5. Handler asks the VMM what the value of %rax should be
	6. Handler puts VMM value in %rax
	7. Return from #VE

I think the attack scenario subverts a normal MMIO to the following
(changes from the normal flow are marked with *):

	*1. %rax points to some private kernel memory, VMM removes
	    Secure-EPT entry for that memory.
	 2. Kernel executes: mov (%rax),%rbx as part of normal kernel
	    execution, not an MMIO read.
	 3. #VE handler is triggered, assuming a MMIO read
	 4. Handler emulates the 'mov' with instruction decoding
	 5. Handler asks the VMM what the value of %rax should be
	*6. Handler puts (malicious) VMM value in %rax
	 7. Return from #VE
	*8. Now the guest kernel is running with an attacker-controlled
	    %rax

This effectively gives the attacker the ability to override the contents
of a memory read.

Am I misunderstanding the attack scenario?  I don't see guest userspace
needing to be involved at all.

Kirill A. Shutemov Oct. 31, 2022, 10:10 p.m. UTC | #9

On Mon, Oct 31, 2022 at 12:44:15PM -0700, Dave Hansen wrote:
> On 10/31/22 12:27, Andi Kleen wrote:
> >> Moving panic() after earlyprintk working is not good idea as it exposes
> >> kernel more: by the time we already have full #VE handler.
> > 
> > It should be fine to move since there is no user land at this point (the
> > attack requires user land)
> 
> Maybe I'm misunderstanding the exposure.  A normal MMIO #VE goes
> something like this:
> 
> 	1. %rax points to some MMIO
> 	2. Kernel executes: mov (%rax),%rbx, trying to read MMIO
> 	3. #VE handler is triggered
> 	4. Handler emulates the 'mov' with instruction decoding
> 	5. Handler asks the VMM what the value of %rax should be
> 	6. Handler puts VMM value in %rax
> 	7. Return from #VE
> 
> I think the attack scenario subverts a normal MMIO to the following
> (changes from the normal flow are marked with *):
> 
> 	*1. %rax points to some private kernel memory, VMM removes
> 	    Secure-EPT entry for that memory.
> 	 2. Kernel executes: mov (%rax),%rbx as part of normal kernel
> 	    execution, not an MMIO read.
> 	 3. #VE handler is triggered, assuming a MMIO read
> 	 4. Handler emulates the 'mov' with instruction decoding
> 	 5. Handler asks the VMM what the value of %rax should be
> 	*6. Handler puts (malicious) VMM value in %rax
> 	 7. Return from #VE
> 	*8. Now the guest kernel is running with an attacker-controlled
> 	    %rax
> 
> This effectively gives the attacker the ability to override the contents
> of a memory read.
> 
> Am I misunderstanding the attack scenario?  I don't see guest userspace
> needing to be involved at all.

Looks correct to me.

I think Andi refers to attack against syscall gap that also addressed by
the patch.

diff mbox series

Patch

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 928dcf7a20d9..343d60853b71 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -34,6 +34,12 @@ 
 #define VE_GET_PORT_NUM(e)	((e) >> 16)
 #define VE_IS_IO_STRING(e)	((e) & BIT(4))
 
+/* Caches GPA width from TDG.VP.INFO TDCALL */
+static unsigned int gpa_width __ro_after_init;
+
+/* Caches TD Attributes from TDG.VP.INFO TDCALL */
+static u64 td_attr __ro_after_init;
+
 /*
  * Wrapper for standard use of __tdx_hypercall with no output aside from
  * return code.
@@ -98,17 +104,16 @@  static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 		panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
 }
 
-static u64 get_cc_mask(void)
+static void tdx_parse_tdinfo(void)
 {
 	struct tdx_module_output out;
-	unsigned int gpa_width;
 
 	/*
 	 * TDINFO TDX module call is used to get the TD execution environment
 	 * information like GPA width, number of available vcpus, debug mode
-	 * information, etc. More details about the ABI can be found in TDX
-	 * Guest-Host-Communication Interface (GHCI), section 2.4.2 TDCALL
-	 * [TDG.VP.INFO].
+	 * information, TD attributes etc. More details about the ABI can be
+	 * found in TDX Guest-Host-Communication Interface (GHCI), section
+	 * 2.4.2 TDCALL [TDG.VP.INFO].
 	 *
 	 * The GPA width that comes out of this call is critical. TDX guests
 	 * can not meaningfully run without it.
@@ -116,7 +121,11 @@  static u64 get_cc_mask(void)
 	tdx_module_call(TDX_GET_INFO, 0, 0, 0, 0, &out);
 
 	gpa_width = out.rcx & GENMASK(5, 0);
+	td_attr = out.rdx;
+}
 
+static u64 get_cc_mask(void)
+{
 	/*
 	 * The highest bit of a guest physical address is the "sharing" bit.
 	 * Set it for shared pages and clear it for private pages.
@@ -755,6 +764,12 @@  void __init tdx_early_init(void)
 	if (memcmp(TDX_IDENT, sig, sizeof(sig)))
 		return;
 
+	/*
+	 * Initializes gpa_width and td_attr. Must be called before
+	 * get_cc_mask() or attribute checks.
+	 */
+	tdx_parse_tdinfo();
+
 	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
 
 	cc_set_vendor(CC_VENDOR_INTEL);