[2/2] x86/tdx: Do not allow #VE due to EPT violation on the private memory

Message ID 20221028141220.29217-3-kirill.shutemov@linux.intel.com
State New
Headers
Series x86/tdx: Enforce no #VE on private memory accesses |

Commit Message

Kirill A. Shutemov Oct. 28, 2022, 2:12 p.m. UTC
  Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions such as using specific instructions or accessing
a specific MSR.

Notable reason for #VE is access to specific guest physical addresses.
It requires special security considerations as it is not fully in
control of the guest kernel. VMM can remove a page from EPT page table
and trigger #VE on access.

The primary use-case for #VE on a memory access is MMIO: VMM removes
page from EPT to trigger exception in the guest which allows guest to
emulate MMIO with hypercalls.

MMIO only happens on shared memory. All conventional kernel memory is
private. This includes everything from kernel stacks to kernel text.

Handling exceptions on arbitrary accesses to kernel memory is
essentially impossible as handling #VE may require access to memory
that also triggers the exception.

TDX module provides mechanism to disable #VE delivery on access to
private memory. If SEPT_VE_DISABLE TD attribute is set, private EPT
violation will not be reflected to the guest as #VE, but will trigger
exit to VMM.

Make sure the attribute is set by VMM. Panic otherwise.

There's small window during the boot before the check where kernel has
early #VE handler. But the handler is only for port I/O and panic as
soon as it sees any other #VE reason.

SEPT_VE_DISABLE makes SEPT violation unrecoverable and terminating the
TD is the only option.

Kernel has no legitimate use-cases for #VE on private memory. It is
either a guest kernel bug (like access of unaccepted memory) or
malicious/buggy VMM that removes guest page that is still in use.

In both cases terminating TD is the right thing to do.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Fixes: 9a22bf6debbf ("x86/traps: Add #VE support for TDX guest")
Cc: stable@vger.kernel.org # v5.19
---
 arch/x86/coco/tdx/tdx.c | 49 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)
  

Comments

Dave Hansen Oct. 28, 2022, 3:41 p.m. UTC | #1
On 10/28/22 07:12, Kirill A. Shutemov wrote:
>  arch/x86/coco/tdx/tdx.c | 49 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 49 insertions(+)

The patch is good, but I'm not crazy about the changelog or the big ol'
comment.

Really, this would do:

	/*
	 * The kernel can not handle #VE's when accessing normal kernel
	 * memory.  Ensure that no #VE will be delivered for accesses to
	 * TD-private memory.  Only VMM-shared memory (MMIO) will #VE.
	 */
	if (!(td_attr & ATTR_SEPT_VE_DISABLE))
		panic("TD misconfiguration: SEPT_VE_DISABLE attibute must be set.\n");

I'll probably trim both of them down.  If I chop out something that's
critical, let me know, otherwise let's follow up and stick all of those
details in Documentation.
  
Guorui Yu Oct. 31, 2022, 4:07 a.m. UTC | #2
The core of this vulnerability is not directly related to the 
ATTR_SEPT_VE_DISABLE, but the MMIO processing logic in #VE.

We have encountered similar problems on SEV-ES, here are their fixes on 
Kernel [1] and OVMF[2].

Instead of enforcing the ATTR_SEPT_VE_DISABLE in TDX guest kernel, I 
think the fix should also include necessary check on the MMIO path of 
the #VE routine.

static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
{
	unsigned long *reg, val, vaddr;
	char buffer[MAX_INSN_SIZE];
	struct insn insn = {};
	enum mmio_type mmio;
	int size, extend_size;
	u8 extend_val = 0;

	// Some addtional security check about ve->gpa should be introduced here.

	/* Only in-kernel MMIO is supported */
	if (WARN_ON_ONCE(user_mode(regs)))
		return -EFAULT;

	// ...
}

If we don't fix the problem at the point where we found, but rely on 
complicated composite logic and long comments in the kernel, I'm 
confident we'll fall back into the same pit in the near future :).


[1] 
https://github.com/torvalds/linux/blob/1a2dcbdde82e3a5f1db9b2f4c48aa1aeba534fb2/arch/x86/kernel/sev.c#L503
[2] OVMF: 
https://github.com/tianocore/edk2/blob/db2c22633f3c761975d8f469a0e195d8b79e1287/OvmfPkg/Library/VmgExitLib/VmgExitVcHandler.c#L670
  
Kirill A. Shutemov Oct. 31, 2022, 4:33 a.m. UTC | #3
On Mon, Oct 31, 2022 at 12:07:45PM +0800, Guorui Yu wrote:
> The core of this vulnerability is not directly related to the
> ATTR_SEPT_VE_DISABLE, but the MMIO processing logic in #VE.
> 
> We have encountered similar problems on SEV-ES, here are their fixes on
> Kernel [1] and OVMF[2].
> 
> Instead of enforcing the ATTR_SEPT_VE_DISABLE in TDX guest kernel, I think
> the fix should also include necessary check on the MMIO path of the #VE
> routine.

Missing SEPT_VE_DISABLE exposes to more security problems than confused
handle_mmio(). Rogue #VE that is rightly timed can be used to escalate
privileges and more. Just adding check there would solve only some
potential attacks.

> static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
> {
> 	unsigned long *reg, val, vaddr;
> 	char buffer[MAX_INSN_SIZE];
> 	struct insn insn = {};
> 	enum mmio_type mmio;
> 	int size, extend_size;
> 	u8 extend_val = 0;
> 
> 	// Some addtional security check about ve->gpa should be introduced here.
> 
> 	/* Only in-kernel MMIO is supported */
> 	if (WARN_ON_ONCE(user_mode(regs)))
> 		return -EFAULT;
> 
> 	// ...
> }
> 
> If we don't fix the problem at the point where we found, but rely on
> complicated composite logic and long comments in the kernel, I'm confident
> we'll fall back into the same pit in the near future :).

The plan is to add the check there along with relaxing SEPT_VE_DISABLE for
debug TD. It is required to debug guest kernel effectively. Otherwise
access to unaccepted memory would terminate TD with zero info on why.

But it is not the urgent fix. It can be submitted separately.
  
Dave Hansen Oct. 31, 2022, 2:22 p.m. UTC | #4
On 10/30/22 21:07, Guorui Yu wrote:
> We have encountered similar problems on SEV-ES, here are their fixes
> on Kernel [1] and OVMF[2].

SEV-ES and TDX are very different beasts in this area.

> Instead of enforcing the ATTR_SEPT_VE_DISABLE in TDX guest kernel, I
> think the fix should also include necessary check on the MMIO path of
> the #VE routine.

Instead?

Without ATTR_SEPT_VE_DISABLE, a #VE can occur on basically any
instruction.  We call those kinds of exceptions "paranoid entry" points.
 They need special handling like the NMI or #MC handlers.

I'd be happy to look at a patch that does the MMIO path check *and*
turns the #VE handler into a robust entry point.

Bonus points if you can do ~5 lines of C like the approach in this thread.
  
Erdem Aktas Nov. 4, 2022, 10:36 p.m. UTC | #5
On Fri, Oct 28, 2022 at 7:12 AM Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:

> +        *
> +        * Kernel has no legitimate use-cases for #VE on private memory. It is
> +        * either a guest kernel bug (like access of unaccepted memory) or
> +        * malicious/buggy VMM that removes guest page that is still in use.
> +        *

I think this statement is too strong and I have few concerns on this approach.
I understand that there is an issue of handling #VEs on private pages
but it seems like we are just hiding the problem with this approach
instead of fixing it - I do not have any fix in my mind- .
First there is a feature of injecting #VE to handle unaccepted pages
at runtime and accept them on-demand, now the statement is saying this
was an unnecessary feature (why is it there at all then?) at all as
there is no legitimate use case.

I wonder if this will limit how we can implement the lazy TDACCEPT.
There are multiple ideas floating now.
https://github.com/intel/tdx/commit/9b3ef9655b695d3c67a557ec016487fded8b0e2b
has 3 implementation choices where "Accept a block of memory on the
first use." option is implemented.  Actually it says "Accept a block
of memory on the first use." but it is implemented as "Accept a block
of memory on the first allocation".  The comments in this code also
raises concerns on the performance.

As of now, we do not know which one of those ideas will provide an
acceptable performance for booting large size VMs. If the performance
overhead is high, we can always implement the lazy TDACCEPT as when
the first time a guest accesses an unaccepted memory, #VE can do the TDACCEPT.

I am not trying to solve the lazy TDACCEPT problem here but all I am
trying to say is that, there might be legitimate use cases for #VE on
private memory and this patch limits any future improvement we might
need to do on lazy TDACCEPT implementation.



> +        * In both cases terminating TD is the right thing to do.
> +        */
> +       if (!(td_attr & ATTR_SEPT_VE_DISABLE))
> +               panic("TD misconfiguration: SEPT_VE_DISABLE attibute must be set.\n");
> +
>         setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
>
>         cc_set_vendor(CC_VENDOR_INTEL);
> --
> 2.38.0
>
  
Dave Hansen Nov. 4, 2022, 10:50 p.m. UTC | #6
On 11/4/22 15:36, Erdem Aktas wrote:
> On Fri, Oct 28, 2022 at 7:12 AM Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
>> +        *
>> +        * Kernel has no legitimate use-cases for #VE on private memory. It is
>> +        * either a guest kernel bug (like access of unaccepted memory) or
>> +        * malicious/buggy VMM that removes guest page that is still in use.
>> +        *
> 
> I think this statement is too strong and I have few concerns on this approach.
> I understand that there is an issue of handling #VEs on private pages
> but it seems like we are just hiding the problem with this approach
> instead of fixing it - I do not have any fix in my mind- .
> First there is a feature of injecting #VE to handle unaccepted pages
> at runtime and accept them on-demand, now the statement is saying this
> was an unnecessary feature (why is it there at all then?) at all as
> there is no legitimate use case.

We're doing on-demand page acceptance.  We just don't need a #VE to
drive it.  Why is it in the TDX module then?  Inertia?  Because it got
too far along in the process before anyone asked me or some of the other
x86 kernel folks to look at it hard.

> I wonder if this will limit how we can implement the lazy TDACCEPT.
> There are multiple ideas floating now.
> https://github.com/intel/tdx/commit/9b3ef9655b695d3c67a557ec016487fded8b0e2b
> has 3 implementation choices where "Accept a block of memory on the
> first use." option is implemented.  Actually it says "Accept a block
> of memory on the first use." but it is implemented as "Accept a block
> of memory on the first allocation".  The comments in this code also
> raises concerns on the performance.
> 
> As of now, we do not know which one of those ideas will provide an
> acceptable performance for booting large size VMs. If the performance
> overhead is high, we can always implement the lazy TDACCEPT as when
> the first time a guest accesses an unaccepted memory, #VE can do the TDACCEPT.

Could you please elaborate a bit on what you think the distinction is
between:

	* Accept on first use
and
	* Accept on allocation

Surely, for the vast majority of memory, it's allocated and then used
pretty quickly.  As in, most allocations are __GFP_ZERO so they're
allocated and "used" before they even leave the allocator.  So, in
practice, they're *VERY* close to equivalent.

Where do you see them diverging?  Why does it matter?

> I am not trying to solve the lazy TDACCEPT problem here but all I am
> trying to say is that, there might be legitimate use cases for #VE on
> private memory and this patch limits any future improvement we might
> need to do on lazy TDACCEPT implementation.

The kernel can't take exceptions on arbitrary memory accesses.  I have
*ZERO* idea how to handle page acceptance on an access to a per-cpu
variable referenced in syscall entry, or the NMI stack when we've
interrupted kernel code with a user GSBASE value.

So, we either find *ALL* the kernel memory that needs to be pre-accepted
at allocation time (like kernel stacks) or we just say that all
allocated memory needs to be accepted before we let it be allocated.

One of those is really easy.  The other means a boatload of code audits.
 I know.  I had to do that kind of exercise to get KPTI to work.  I
don't want to do it again.  It was worth it for KPTI when the world was
on fire.  TDX isn't that important IMNHO.  There's an easier way.
  
Guorui Yu Nov. 7, 2022, 5:10 a.m. UTC | #7
在 2022/10/31 22:22, Dave Hansen 写道:
> On 10/30/22 21:07, Guorui Yu wrote:
>> We have encountered similar problems on SEV-ES, here are their fixes
>> on Kernel [1] and OVMF[2].
> 
> SEV-ES and TDX are very different beasts in this area.
> 
>> Instead of enforcing the ATTR_SEPT_VE_DISABLE in TDX guest kernel, I
>> think the fix should also include necessary check on the MMIO path of
>> the #VE routine.
> 
> Instead?
Yes, "Instead of" should be "Addtionally,".

> 
> Without ATTR_SEPT_VE_DISABLE, a #VE can occur on basically any
> instruction.  We call those kinds of exceptions "paranoid entry" points.
>   They need special handling like the NMI or #MC handlers.
> 
> I'd be happy to look at a patch that does the MMIO path check *and*
> turns the #VE handler into a robust entry point.
> 
> Bonus points if you can do ~5 lines of C like the approach in this thread.

Yes, there is a fix to satify your requirement and get the bouns points :)

Please refer to 
https://github.com/intel/tdx/commit/f045b0d52a5f7d8bf66cd4410307d05a90523f10 


case EXIT_REASON_EPT_VIOLATION:
+ if (!(ve->gpa & tdx_shared_mask())) {
+ panic("#VE due to access to unaccepted memory. "
+ "GPA: %#llx\n", ve->gpa);
+ }
+
/* original from Kirill and Kuppuswamy */

It's already there, but it just didn't get into the main branch.


Sincerely,
Guorui
  
Dave Hansen Nov. 7, 2022, 1:31 p.m. UTC | #8
On 11/6/22 21:10, Guorui Yu wrote:
>> Without ATTR_SEPT_VE_DISABLE, a #VE can occur on basically any
>> instruction.  We call those kinds of exceptions "paranoid entry" points.
>>   They need special handling like the NMI or #MC handlers.
>>
>> I'd be happy to look at a patch that does the MMIO path check *and*
>> turns the #VE handler into a robust entry point.
>>
>> Bonus points if you can do ~5 lines of C like the approach in this
>> thread.
> 
> Yes, there is a fix to satify your requirement and get the bouns points 😄
> 
> Please refer to
> https://github.com/intel/tdx/commit/f045b0d52a5f7d8bf66cd4410307d05a90523f10
> 
> case EXIT_REASON_EPT_VIOLATION:
> + if (!(ve->gpa & tdx_shared_mask())) {
> + panic("#VE due to access to unaccepted memory. "
> + "GPA: %#llx\n", ve->gpa);
> + }
> +
> /* original from Kirill and Kuppuswamy */
> 
> It's already there, but it just didn't get into the main branch.

Could you explain how that prevents the #VE from occurring in the
"syscall gap" or in a place where the kernel is running with the user
GSBASE value?

It doesn't as far as I can tell.  You need the SEPT_VE_DISABLE check for
that.
  
Guorui Yu Nov. 7, 2022, 1:43 p.m. UTC | #9
在 2022/11/7 21:31, Dave Hansen 写道:
> On 11/6/22 21:10, Guorui Yu wrote:
>>> Without ATTR_SEPT_VE_DISABLE, a #VE can occur on basically any
>>> instruction.  We call those kinds of exceptions "paranoid entry" points.
>>>    They need special handling like the NMI or #MC handlers.
>>>
>>> I'd be happy to look at a patch that does the MMIO path check *and*
>>> turns the #VE handler into a robust entry point.
>>>
>>> Bonus points if you can do ~5 lines of C like the approach in this
>>> thread.
>>
>> Yes, there is a fix to satify your requirement and get the bouns points 😄
>>
>> Please refer to
>> https://github.com/intel/tdx/commit/f045b0d52a5f7d8bf66cd4410307d05a90523f10
>>
>> case EXIT_REASON_EPT_VIOLATION:
>> + if (!(ve->gpa & tdx_shared_mask())) {
>> + panic("#VE due to access to unaccepted memory. "
>> + "GPA: %#llx\n", ve->gpa);
>> + }
>> +
>> /* original from Kirill and Kuppuswamy */
>>
>> It's already there, but it just didn't get into the main branch.
> 
> Could you explain how that prevents the #VE from occurring in the
> "syscall gap" or in a place where the kernel is running with the user
> GSBASE value?
> 
Thank you for explaining the "paranoid entry" points with there examples 
to me, now I understand why the SEPT_VE_DISABLE is necessary for TD.

> It doesn't as far as I can tell.  You need the SEPT_VE_DISABLE check for
> that.
  
Erdem Aktas Nov. 7, 2022, 10:53 p.m. UTC | #10
On Fri, Nov 4, 2022 at 3:50 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 11/4/22 15:36, Erdem Aktas wrote:
> > On Fri, Oct 28, 2022 at 7:12 AM Kirill A. Shutemov
> > <kirill.shutemov@linux.intel.com> wrote:
> >> +        *
> >> +        * Kernel has no legitimate use-cases for #VE on private memory. It is
> >> +        * either a guest kernel bug (like access of unaccepted memory) or
> >> +        * malicious/buggy VMM that removes guest page that is still in use.
> >> +        *
> >
> > I think this statement is too strong and I have few concerns on this approach.
> > I understand that there is an issue of handling #VEs on private pages
> > but it seems like we are just hiding the problem with this approach
> > instead of fixing it - I do not have any fix in my mind- .
> > First there is a feature of injecting #VE to handle unaccepted pages
> > at runtime and accept them on-demand, now the statement is saying this
> > was an unnecessary feature (why is it there at all then?) at all as
> > there is no legitimate use case.
>
> We're doing on-demand page acceptance.  We just don't need a #VE to
> drive it.  Why is it in the TDX module then?  Inertia?  Because it got
> too far along in the process before anyone asked me or some of the other
> x86 kernel folks to look at it hard.
>
> > I wonder if this will limit how we can implement the lazy TDACCEPT.
> > There are multiple ideas floating now.
> > https://github.com/intel/tdx/commit/9b3ef9655b695d3c67a557ec016487fded8b0e2b
> > has 3 implementation choices where "Accept a block of memory on the
> > first use." option is implemented.  Actually it says "Accept a block
> > of memory on the first use." but it is implemented as "Accept a block
> > of memory on the first allocation".  The comments in this code also
> > raises concerns on the performance.
> >
> > As of now, we do not know which one of those ideas will provide an
> > acceptable performance for booting large size VMs. If the performance
> > overhead is high, we can always implement the lazy TDACCEPT as when
> > the first time a guest accesses an unaccepted memory, #VE can do the TDACCEPT.
>
> Could you please elaborate a bit on what you think the distinction is
> between:
>
>         * Accept on first use
> and
>         * Accept on allocation
>
> Surely, for the vast majority of memory, it's allocated and then used
> pretty quickly.  As in, most allocations are __GFP_ZERO so they're
> allocated and "used" before they even leave the allocator.  So, in
> practice, they're *VERY* close to equivalent.
>
> Where do you see them diverging?  Why does it matter?
>

For a VM with a very large memory size, let's say close to 800G of
memory, it might take a really long time to finish the initialization.
If all allocations are __GFP_ZERO, then I agree it would not matter
but -- I need to run some benchmarks to validate --  what I remember
was, that was not what we were observing. Let me run a few tests to
provide more input on this but meanwhile if you have already run some
benchmarks, that would be great.

What I see in the code is that the "accept_page" function will zero
all the unaccepted pages even if the __GFP_ZERO flag is not set and if
__GFP_ZERO is set, we will again zero all those pages. I see a lot of
concerning comments like "Page acceptance can be very slow.".

What I mean with "Accept on allocation" is leaving the memory
allocation as it is and using the #VE handler to accept pages the
first time they have been accessed.

tLet me come back with some numbers on this which might take some time
to collect.

> > I am not trying to solve the lazy TDACCEPT problem here but all I am
> > trying to say is that, there might be legitimate use cases for #VE on
> > private memory and this patch limits any future improvement we might
> > need to do on lazy TDACCEPT implementation.
>
> The kernel can't take exceptions on arbitrary memory accesses.  I have
> *ZERO* idea how to handle page acceptance on an access to a per-cpu
> variable referenced in syscall entry, or the NMI stack when we've
> interrupted kernel code with a user GSBASE value.
>
> So, we either find *ALL* the kernel memory that needs to be pre-accepted
> at allocation time (like kernel stacks) or we just say that all
> allocated memory needs to be accepted before we let it be allocated.
>
> One of those is really easy.  The other means a boatload of code audits.
>  I know.  I had to do that kind of exercise to get KPTI to work.  I
> don't want to do it again.  It was worth it for KPTI when the world was
> on fire.  TDX isn't that important IMNHO.  There's an easier way.
  
Dave Hansen Nov. 7, 2022, 11:30 p.m. UTC | #11
On 11/7/22 14:53, Erdem Aktas wrote:
> On Fri, Nov 4, 2022 at 3:50 PM Dave Hansen <dave.hansen@intel.com> wrote:
>> Could you please elaborate a bit on what you think the distinction is
>> between:
>>
>>         * Accept on first use
>> and
>>         * Accept on allocation
>>
>> Surely, for the vast majority of memory, it's allocated and then used
>> pretty quickly.  As in, most allocations are __GFP_ZERO so they're
>> allocated and "used" before they even leave the allocator.  So, in
>> practice, they're *VERY* close to equivalent.
>>
>> Where do you see them diverging?  Why does it matter?
> 
> For a VM with a very large memory size, let's say close to 800G of
> memory, it might take a really long time to finish the initialization.
> If all allocations are __GFP_ZERO, then I agree it would not matter
> but -- I need to run some benchmarks to validate --  what I remember
> was, that was not what we were observing. Let me run a few tests to
> provide more input on this but meanwhile if you have already run some
> benchmarks, that would be great.
> 
> What I see in the code is that the "accept_page" function will zero
> all the unaccepted pages even if the __GFP_ZERO flag is not set and if
> __GFP_ZERO is set, we will again zero all those pages. I see a lot of
> concerning comments like "Page acceptance can be very slow.".

I'm not following you at all here.  Yeah, page acceptance is very slow.
But, the slowest part is the probably cache coherency dance that the TDX
module has to do flushing and zeroing all the memory to initialize the
new integrity metadata.  Second to that is the cost of the TDCALL.
Third is the cost of the #VE.

Here's what Kirill is proposing, in some peudocode:

	alloc_page(order=0, __GFP_ZERO) {
		TD.accept(size=4M) {
			// TDX Module clflushes/zeroes 4M of memory
		}
		memset(4k);
		// leave 1023 accepted 4k pages in the allocator
	}

To accept 4M of memory, you do one TDCALL.  You do zero #VE's.  Using
the #VE handler, you do:

	alloc_page(order=0, __GFP_ZERO) {
		memset(4k) {
			-> #VE handler
			TD.accept(size=4k); // flush/zero 4k
		}
		// only 4k was accepted
	}
	... Take 1023 more #VE's later on for each 4k page

You do 1024 #VE's and 1024 TDCALLs.  So, let's summarize.  To do 4M
worth of 4k pages, here's how the two approaches break down if
__GFP_ZERO is in play:

	      #VE	Accept-in-allocator
#VE's:	     1024			0
TDCALLS:     1024			1
clflushes: 4k x 1024		4k x 1024
memset()s: 4k x 1024		4k x 1024

The *ONLY* downside of accept-at-allocate as implemented is that it does
4M at a time, so the TDCALL is long compared to a 4k one.  But, this is
a classing bandwidth versus latency compromise.  In this case, we choose
bandwidth.

*Both* cases need to memset() the same amount of memory.  Both cases
only memset() 4k at a time.

The *ONLY* way the #VE approach is better is if you allocate 4k and then
never touch the rest of the 4M page.  That might happen, maybe *ONE*
time per zone.  But the rest of the time, the amortization of the TDCALL
cost is going to win.

I'll be shocked if any benchmarking turns up another result.
  

Patch

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 343d60853b71..a376a0c3fddc 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -34,6 +34,9 @@ 
 #define VE_GET_PORT_NUM(e)	((e) >> 16)
 #define VE_IS_IO_STRING(e)	((e) & BIT(4))
 
+/* TD Attributes */
+#define ATTR_SEPT_VE_DISABLE	BIT(28)
+
 /* Caches GPA width from TDG.VP.INFO TDCALL */
 static unsigned int gpa_width __ro_after_init;
 
@@ -770,6 +773,52 @@  void __init tdx_early_init(void)
 	 */
 	tdx_parse_tdinfo();
 
+	/*
+	 * Do not allow #VE due to EPT violation on the private memory
+	 *
+	 * Virtualization Exceptions (#VE) are delivered to TDX guests due to
+	 * specific guest actions such as using specific instructions or
+	 * accessing a specific MSR.
+	 *
+	 * Notable reason for #VE is access to specific guest physical
+	 * addresses. It requires special security considerations as it is not
+	 * fully in control of the guest kernel. VMM can remove a page from EPT
+	 * page table and trigger #VE on access.
+	 *
+	 * The primary use-case for #VE on a memory access is MMIO: VMM removes
+	 * page from EPT to trigger exception in the guest which allows guest to
+	 * emulate MMIO with hypercalls.
+	 *
+	 * MMIO only happens on shared memory. All conventional kernel memory is
+	 * private. This includes everything from kernel stacks to kernel text.
+	 *
+	 * Handling exceptions on arbitrary accesses to kernel memory is
+	 * essentially impossible as handling #VE may require access to memory
+	 * that also triggers the exception.
+	 *
+	 * TDX module provides mechanism to disable #VE delivery on access to
+	 * private memory. If SEPT_VE_DISABLE TD attribute is set, private EPT
+	 * violation will not be reflected to the guest as #VE, but will trigger
+	 * exit to VMM.
+	 *
+	 * Make sure the attribute is set by VMM. Panic otherwise.
+	 *
+	 * There's small window during the boot before the check where kernel has
+	 * early #VE handler. But the handler is only for port I/O and panic as
+	 * soon as it sees any other #VE reason.
+	 *
+	 * SEPT_VE_DISABLE makes SEPT violation unrecoverable and terminating
+	 * the TD is the only option.
+	 *
+	 * Kernel has no legitimate use-cases for #VE on private memory. It is
+	 * either a guest kernel bug (like access of unaccepted memory) or
+	 * malicious/buggy VMM that removes guest page that is still in use.
+	 *
+	 * In both cases terminating TD is the right thing to do.
+	 */
+	if (!(td_attr & ATTR_SEPT_VE_DISABLE))
+		panic("TD misconfiguration: SEPT_VE_DISABLE attibute must be set.\n");
+
 	setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
 
 	cc_set_vendor(CC_VENDOR_INTEL);