[3/4] x86/kexec(): Reset TDX private memory on platforms with TDX erratum

Message ID b52ed259e0d487b3a968b98da91cb4f55a28ae82.1706698706.git.kai.huang@intel.com
State New
Headers
Series TDX host: kexec() support |

Commit Message

Kai Huang Jan. 31, 2024, 11:31 a.m. UTC
  From: Kai Huang <kai.huang@intel.com>

The first few generations of TDX hardware have an erratum.  A partial
write to a TDX private memory cacheline will silently "poison" the
line.  Subsequent reads will consume the poison and generate a machine
check.  According to the TDX hardware spec, neither of these things
should have happened.

== Background ==

Virtually all kernel memory accesses operations happen in full
cachelines.  In practice, writing a "byte" of memory usually reads a 64
byte cacheline of memory, modifies it, then writes the whole line back.
Those operations do not trigger this problem.

This problem is triggered by "partial" writes where a write transaction
of less than cacheline lands at the memory controller.  The CPU does
these via non-temporal write instructions (like MOVNTI), or through
UC/WC memory mappings.  The issue can also be triggered away from the
CPU by devices doing partial writes via DMA.

== Problem ==

A fast warm reset doesn't reset TDX private memory.  Kexec() can also
boot into the new kernel directly.  Thus if the old kernel has enabled
TDX on the platform with this erratum, the new kernel may get unexpected
machine check.

Note that w/o this erratum any kernel read/write on TDX private memory
should never cause machine check, thus it's OK for the old kernel to
leave TDX private pages as is.

== Solution ==

In short, with this erratum, the kernel needs to explicitly convert all
TDX private pages back to normal to give the new kernel a clean slate
after kexec().  The BIOS is also expected to disable fast warm reset as
a workaround to this erratum, thus this implementation doesn't try to
reset TDX private memory for the reboot case in the kernel but depend on
the BIOS to enable the workaround.

Convert TDX private pages back to normal after all remote cpus has been
stopped and cache flush has been done on all cpus, when no more TDX
activity can happen further.  Do it in machine_kexec() to avoid the
additional overhead to the normal reboot/shutdown as the kernel depends
on the BIOS to disable fast warm reset for the reboot case.

For now TDX private memory can only be PAMT pages.  It would be ideal to
cover all types of TDX private memory here, but there are practical
problems to do so:

1) There's no existing infrastructure to track TDX private pages;
2) It's not feasible to query the TDX module about page type because VMX
   has already been stopped when KVM receives the reboot notifier, plus
   the result from the TDX module may not be accurate (e.g., the remote
   CPU could be stopped right before MOVDIR64B).

One temporary solution is to blindly convert all memory pages, but it's
problematic to do so too, because not all pages are mapped as writable
in the direct mapping.  It can be done by switching to the identical
mapping created for kexec() or a new page table, but the complexity
looks overkill.

Therefore, rather than doing something dramatic, only reset PAMT pages
here.  Other kernel components which use TDX need to do the conversion
on their own by intercepting the rebooting/shutdown notifier (KVM
already does that).

Note kexec() can happen at anytime, including when TDX module is being
initialized.  Register TDX reboot notifier callback to stop further TDX
module initialization.  If there's any ongoing module initialization,
wait until it finishes.  This makes sure the TDX module status is stable
after the reboot notifier callback, and the later kexec() code can read
module status to decide whether PAMTs are stable and available.

Also stop further TDX module initialization in case of machine shutdown
and halt, but not limited to kexec(), as there's no reason to do so in
these cases too.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/tdx.h         |  2 +
 arch/x86/kernel/machine_kexec_64.c | 16 +++++
 arch/x86/virt/vmx/tdx/tdx.c        | 98 ++++++++++++++++++++++++++++++
 3 files changed, 116 insertions(+)
  

Comments

Dave Hansen Jan. 31, 2024, 9:21 p.m. UTC | #1
On 1/31/24 03:31, Huang, Kai wrote:
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -28,6 +28,7 @@
>  #include <asm/setup.h>
>  #include <asm/set_memory.h>
>  #include <asm/cpu.h>
> +#include <asm/tdx.h>
>  
>  #ifdef CONFIG_ACPI
>  /*
> @@ -298,9 +299,24 @@ void machine_kexec(struct kimage *image)
>  	void *control_page;
>  	int save_ftrace_enabled;
>  
> +	/*
> +	 * For platforms with TDX "partial write machine check" erratum,
> +	 * all TDX private pages need to be converted back to normal
> +	 * before booting to the new kernel, otherwise the new kernel
> +	 * may get unexpected machine check.
> +	 *
> +	 * But skip this when preserve_context is on.  The second kernel
> +	 * shouldn't write to the first kernel's memory anyway.  Skipping
> +	 * this also avoids killing TDX in the first kernel, which would
> +	 * require more complicated handling.
> +	 */

This is waaaaaaaaaaaaaaay too much information to stick in a generic
function.  Just call tdx_reset_memory() and make the argument about

>  #ifdef CONFIG_KEXEC_JUMP
>  	if (image->preserve_context)
>  		save_processor_state();
> +	else
> +		tdx_reset_memory();
> +#else
> +	tdx_reset_memory();
>  #endif

Wow, that's awfully hard to read.  I really wish folks' gag reflex would
kick in when they see stuff like this to get them to spend an additional
15 seconds to turn this into:

	if (IS_ENABLED(CONFIG_KEXEC_JUMP) && image->preserve_context)
		save_processor_state();
	else
		tdx_reset_memory();

>  	save_ftrace_enabled = __ftrace_enabled_save();
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 9f1fed458a32..0537b1b76c2b 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -28,6 +28,7 @@
>  #include <linux/acpi.h>
>  #include <linux/suspend.h>
>  #include <linux/acpi.h>
> +#include <linux/reboot.h>
>  #include <asm/page.h>
>  #include <asm/special_insns.h>
>  #include <asm/msr-index.h>
> @@ -54,6 +55,8 @@ static DEFINE_MUTEX(tdx_module_lock);
>  /* All TDX-usable memory regions.  Protected by mem_hotplug_lock. */
>  static LIST_HEAD(tdx_memlist);
>  
> +static bool tdx_rebooting;
> +
>  typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);
>  
>  static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
> @@ -1187,6 +1190,9 @@ static int __tdx_enable(void)
>  {
>  	int ret;
>  
> +	if (tdx_rebooting)
> +		return -EAGAIN;

This whole 'tdx_rebooting' deserves to at least be in its own patch.
Also -EAGAIN seems to be a really odd return code for a permanent failure.

>  	ret = init_tdx_module();
>  	if (ret) {
>  		pr_err("module initialization failed (%d)\n", ret);
> @@ -1420,6 +1426,90 @@ static struct notifier_block tdx_memory_nb = {
>  	.notifier_call = tdx_memory_notifier,
>  };
>  
> +/*
> + * Convert TDX private pages back to normal on platforms with
> + * "partial write machine check" erratum.
> + *
> + * Called from machine_kexec() before booting to the new kernel.
> + */
> +void tdx_reset_memory(void)
> +{
> +	if (!boot_cpu_has(X86_FEATURE_TDX_HOST_PLATFORM))
> +		return;
> +
> +	/*
> +	 * Kernel read/write to TDX private memory doesn't
> +	 * cause machine check on hardware w/o this erratum.
> +	 */
> +	if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
> +		return;
> +
> +	/* Called from kexec() when only rebooting cpu is alive */
> +	WARN_ON_ONCE(num_online_cpus() != 1);
> +
> +	/*
> +	 * tdx_reboot_notifier() waits until ongoing TDX module
> +	 * initialization to finish, and module initialization is
> +	 * rejected after that.  Therefore @tdx_module_status is
> +	 * stable here and can be read w/o holding lock.
> +	 */
> +	if (tdx_module_status != TDX_MODULE_INITIALIZED)
> +		return;

kexec() can't happen until after reboot notifiers are called.
tdx_reboot_notifier() makes 'tdx_module_status' stable, so no lock is
needed.

Right?

> +	/*
> +	 * Flush cache of all TDX private memory _before_ converting
> +	 * them back to avoid silent memory corruption.
> +	 */
> +	native_wbinvd();

Since this is single-threaded, it also needs to mention where all the
other CPU caches got flushed.

> +	/*
> +	 * Convert PAMTs back to normal.  All other cpus are already
> +	 * dead and TDMRs/PAMTs are stable.
> +	 *
> +	 * Ideally it's better to cover all types of TDX private pages
> +	 * here, but it's impractical:
> +	 *
> +	 *  - There's no existing infrastructure to tell whether a page
> +	 *    is TDX private memory or not.
> +	 *
> +	 *  - Using SEAMCALL to query TDX module isn't feasible either:
> +	 *    - VMX has been turned off by reaching here so SEAMCALL
> +	 *      cannot be made;
> +	 *    - Even SEAMCALL can be made the result from TDX module may

		    ^ if     ^ a          ^,

> +	 *      not be accurate (e.g., remote CPU can be stopped while
> +	 *      the kernel is in the middle of reclaiming TDX private
> +	 *      page and doing MOVDIR64B).
> +	 *
> +	 * One temporary solution could be just converting all memory
> +	 * pages, but it's problematic too, because not all pages are
> +	 * mapped as writable in direct mapping.  It can be done by
> +	 * switching to the identical mapping for kexec() or a new page
> +	 * table which maps all pages as writable, but the complexity is
> +	 * overkill.
> +	 *
> +	 * Thus instead of doing something dramatic to convert all pages,
> +	 * only convert PAMTs here.  Other kernel components which use
> +	 * TDX need to do the conversion on their own by intercepting the
> +	 * rebooting/shutdown notifier (KVM already does that).
> +	 */

I'd leave the extended alternatives discussion in the changelog, not here.

Focus on what _this_ is doing and why it is imperfect:

 1. Just reset the PAMDs
 2. This leaves A, B, and C undealt with
 3. The risk of leaving those is ...


> +	tdmrs_reset_pamt_all(&tdx_tdmr_list);
> +}
> +
> +static int tdx_reboot_notifier(struct notifier_block *nb, unsigned long mode,
> +			       void *unused)
> +{
> +	/* Wait ongoing TDX initialization to finish */
> +	mutex_lock(&tdx_module_lock);
> +	tdx_rebooting = true;
> +	mutex_unlock(&tdx_module_lock);
> +
> +	return NOTIFY_OK;
> +}

Why is 'tdx_rebooting' a new variable instead of a new state in
tdx_module_status?
  
Kirill A. Shutemov Jan. 31, 2024, 10:03 p.m. UTC | #2
On Wed, Jan 31, 2024 at 01:21:39PM -0800, Dave Hansen wrote:
> >  #ifdef CONFIG_KEXEC_JUMP
> >  	if (image->preserve_context)
> >  		save_processor_state();
> > +	else
> > +		tdx_reset_memory();
> > +#else
> > +	tdx_reset_memory();
> >  #endif
> 
> Wow, that's awfully hard to read.  I really wish folks' gag reflex would
> kick in when they see stuff like this to get them to spend an additional
> 15 seconds to turn this into:
> 
> 	if (IS_ENABLED(CONFIG_KEXEC_JUMP) && image->preserve_context)
> 		save_processor_state();
> 	else
> 		tdx_reset_memory();

save_processor_state() is declared under CONFIG_PM_SLEEP, so I guess your
variant might break build in some cases without updated suspend.h.
  
Kai Huang Feb. 1, 2024, 2:22 p.m. UTC | #3
On 1/02/2024 6:03 am, Kirill A. Shutemov wrote:
> On Wed, Jan 31, 2024 at 01:21:39PM -0800, Dave Hansen wrote:
>>>   #ifdef CONFIG_KEXEC_JUMP
>>>   	if (image->preserve_context)
>>>   		save_processor_state();
>>> +	else
>>> +		tdx_reset_memory();
>>> +#else
>>> +	tdx_reset_memory();
>>>   #endif
>>
>> Wow, that's awfully hard to read.  I really wish folks' gag reflex would
>> kick in when they see stuff like this to get them to spend an additional
>> 15 seconds to turn this into:
>>
>> 	if (IS_ENABLED(CONFIG_KEXEC_JUMP) && image->preserve_context)
>> 		save_processor_state();
>> 	else
>> 		tdx_reset_memory();
> 
> save_processor_state() is declared under CONFIG_PM_SLEEP, so I guess your
> variant might break build in some cases without updated suspend.h.

I tried.  If I turn off both SUSPEND and HIBERNATION in the Kconfig I 
got build error:

arch/x86/kernel/machine_kexec_64.c:325:17: error: implicit declaration 
of function ‘save_processor_state’ [-Werror=implicit-function-declaration]
   325 |                 save_processor_state();
       |                 ^~~~~~~~~~~~~~~~~~~~
  
Kai Huang Feb. 1, 2024, 2:35 p.m. UTC | #4
On 1/02/2024 5:21 am, Dave Hansen wrote:
> On 1/31/24 03:31, Huang, Kai wrote:
>> --- a/arch/x86/kernel/machine_kexec_64.c
>> +++ b/arch/x86/kernel/machine_kexec_64.c
>> @@ -28,6 +28,7 @@
>>   #include <asm/setup.h>
>>   #include <asm/set_memory.h>
>>   #include <asm/cpu.h>
>> +#include <asm/tdx.h>
>>   
>>   #ifdef CONFIG_ACPI
>>   /*
>> @@ -298,9 +299,24 @@ void machine_kexec(struct kimage *image)
>>   	void *control_page;
>>   	int save_ftrace_enabled;
>>   
>> +	/*
>> +	 * For platforms with TDX "partial write machine check" erratum,
>> +	 * all TDX private pages need to be converted back to normal
>> +	 * before booting to the new kernel, otherwise the new kernel
>> +	 * may get unexpected machine check.
>> +	 *
>> +	 * But skip this when preserve_context is on.  The second kernel
>> +	 * shouldn't write to the first kernel's memory anyway.  Skipping
>> +	 * this also avoids killing TDX in the first kernel, which would
>> +	 * require more complicated handling.
>> +	 */
> 
> This is waaaaaaaaaaaaaaay too much information to stick in a generic
> function.  Just call tdx_reset_memory() and make the argument about

Hi Dave,

Thanks for review.

Agreed too much info here.  But I don't quite understand your second 
sentence here.  Can I have your suggestion again?

> 
>>   #ifdef CONFIG_KEXEC_JUMP
>>   	if (image->preserve_context)
>>   		save_processor_state();
>> +	else
>> +		tdx_reset_memory();
>> +#else
>> +	tdx_reset_memory();
>>   #endif
> 
> Wow, that's awfully hard to read.  I really wish folks' gag reflex would
> kick in when they see stuff like this to get them to spend an additional
> 15 seconds to turn this into:
> 
> 	if (IS_ENABLED(CONFIG_KEXEC_JUMP) && image->preserve_context)
> 		save_processor_state();
> 	else
> 		tdx_reset_memory();
> 

Yeah this is way better, but as Kirill pointed out we will have build 
error when both SUSPEND and HIBERNATION is off.

Maybe below?

#ifdef CONFIG_KEXEC_JUMP
	if (image->preserve_context)
		save_processor_state();
	else
#endif
	tdx_reset_memory();


>>   	save_ftrace_enabled = __ftrace_enabled_save();
>> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
>> index 9f1fed458a32..0537b1b76c2b 100644
>> --- a/arch/x86/virt/vmx/tdx/tdx.c
>> +++ b/arch/x86/virt/vmx/tdx/tdx.c
>> @@ -28,6 +28,7 @@
>>   #include <linux/acpi.h>
>>   #include <linux/suspend.h>
>>   #include <linux/acpi.h>
>> +#include <linux/reboot.h>
>>   #include <asm/page.h>
>>   #include <asm/special_insns.h>
>>   #include <asm/msr-index.h>
>> @@ -54,6 +55,8 @@ static DEFINE_MUTEX(tdx_module_lock);
>>   /* All TDX-usable memory regions.  Protected by mem_hotplug_lock. */
>>   static LIST_HEAD(tdx_memlist);
>>   
>> +static bool tdx_rebooting;
>> +
>>   typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);
>>   
>>   static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
>> @@ -1187,6 +1190,9 @@ static int __tdx_enable(void)
>>   {
>>   	int ret;
>>   
>> +	if (tdx_rebooting)
>> +		return -EAGAIN;
> 
> This whole 'tdx_rebooting' deserves to at least be in its own patch.
> Also -EAGAIN seems to be a really odd return code for a permanent failure.

Will move to a separate patch.

How about just -EINVAL?

> 
>>   	ret = init_tdx_module();
>>   	if (ret) {
>>   		pr_err("module initialization failed (%d)\n", ret);
>> @@ -1420,6 +1426,90 @@ static struct notifier_block tdx_memory_nb = {
>>   	.notifier_call = tdx_memory_notifier,
>>   };
>>   
>> +/*
>> + * Convert TDX private pages back to normal on platforms with
>> + * "partial write machine check" erratum.
>> + *
>> + * Called from machine_kexec() before booting to the new kernel.
>> + */
>> +void tdx_reset_memory(void)
>> +{
>> +	if (!boot_cpu_has(X86_FEATURE_TDX_HOST_PLATFORM))
>> +		return;
>> +
>> +	/*
>> +	 * Kernel read/write to TDX private memory doesn't
>> +	 * cause machine check on hardware w/o this erratum.
>> +	 */
>> +	if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
>> +		return;
>> +
>> +	/* Called from kexec() when only rebooting cpu is alive */
>> +	WARN_ON_ONCE(num_online_cpus() != 1);
>> +
>> +	/*
>> +	 * tdx_reboot_notifier() waits until ongoing TDX module
>> +	 * initialization to finish, and module initialization is
>> +	 * rejected after that.  Therefore @tdx_module_status is
>> +	 * stable here and can be read w/o holding lock.
>> +	 */
>> +	if (tdx_module_status != TDX_MODULE_INITIALIZED)
>> +		return;
> 
> kexec() can't happen until after reboot notifiers are called.
> tdx_reboot_notifier() makes 'tdx_module_status' stable, so no lock is
> needed.
> 
> Right?

Yes.  I can replace the comment with your words.

> 
>> +	/*
>> +	 * Flush cache of all TDX private memory _before_ converting
>> +	 * them back to avoid silent memory corruption.
>> +	 */
>> +	native_wbinvd();
> 
> Since this is single-threaded, it also needs to mention where all the
> other CPU caches got flushed.

OK will point out.

> 
>> +	/*
>> +	 * Convert PAMTs back to normal.  All other cpus are already
>> +	 * dead and TDMRs/PAMTs are stable.
>> +	 *
>> +	 * Ideally it's better to cover all types of TDX private pages
>> +	 * here, but it's impractical:
>> +	 *
>> +	 *  - There's no existing infrastructure to tell whether a page
>> +	 *    is TDX private memory or not.
>> +	 *
>> +	 *  - Using SEAMCALL to query TDX module isn't feasible either:
>> +	 *    - VMX has been turned off by reaching here so SEAMCALL
>> +	 *      cannot be made;
>> +	 *    - Even SEAMCALL can be made the result from TDX module may
> 
> 		    ^ if     ^ a          ^,
> 

Thanks.  Will add.

>> +	 *      not be accurate (e.g., remote CPU can be stopped while
>> +	 *      the kernel is in the middle of reclaiming TDX private
>> +	 *      page and doing MOVDIR64B).
>> +	 *
>> +	 * One temporary solution could be just converting all memory
>> +	 * pages, but it's problematic too, because not all pages are
>> +	 * mapped as writable in direct mapping.  It can be done by
>> +	 * switching to the identical mapping for kexec() or a new page
>> +	 * table which maps all pages as writable, but the complexity is
>> +	 * overkill.
>> +	 *
>> +	 * Thus instead of doing something dramatic to convert all pages,
>> +	 * only convert PAMTs here.  Other kernel components which use
>> +	 * TDX need to do the conversion on their own by intercepting the
>> +	 * rebooting/shutdown notifier (KVM already does that).
>> +	 */
> 
> I'd leave the extended alternatives discussion in the changelog, not here.
> 
> Focus on what _this_ is doing and why it is imperfect:
> 
>   1. Just reset the PAMDs
>   2. This leaves A, B, and C undealt with
>   3. The risk of leaving those is ...
> 

Agreed. Will do.

> 
>> +	tdmrs_reset_pamt_all(&tdx_tdmr_list);
>> +}
>> +
>> +static int tdx_reboot_notifier(struct notifier_block *nb, unsigned long mode,
>> +			       void *unused)
>> +{
>> +	/* Wait ongoing TDX initialization to finish */
>> +	mutex_lock(&tdx_module_lock);
>> +	tdx_rebooting = true;
>> +	mutex_unlock(&tdx_module_lock);
>> +
>> +	return NOTIFY_OK;
>> +}
> 
> Why is 'tdx_rebooting' a new variable instead of a new state in
> tdx_module_status?
> 

I think we can but I believe using tdx_rebooting is more clear.

For instance, if we want to add a new state TDX_MODULE_ABORT for this 
case, when tdx_reboot_notifier() is called after module initialization, 
then the tdx_module_status (which is already TDX_MODULE_INITIALIZED) 
will be overwritten and we will lose the information that module 
initialization has been done successfully.

We may be able to avoid these using more complicated logic but I think 
using a separate tdx_rebooting is straightforward.
  
Kirill A. Shutemov Feb. 1, 2024, 2:39 p.m. UTC | #5
On Thu, Feb 01, 2024 at 10:22:18PM +0800, Huang, Kai wrote:
> 
> 
> On 1/02/2024 6:03 am, Kirill A. Shutemov wrote:
> > On Wed, Jan 31, 2024 at 01:21:39PM -0800, Dave Hansen wrote:
> > > >   #ifdef CONFIG_KEXEC_JUMP
> > > >   	if (image->preserve_context)
> > > >   		save_processor_state();
> > > > +	else
> > > > +		tdx_reset_memory();
> > > > +#else
> > > > +	tdx_reset_memory();
> > > >   #endif
> > > 
> > > Wow, that's awfully hard to read.  I really wish folks' gag reflex would
> > > kick in when they see stuff like this to get them to spend an additional
> > > 15 seconds to turn this into:
> > > 
> > > 	if (IS_ENABLED(CONFIG_KEXEC_JUMP) && image->preserve_context)
> > > 		save_processor_state();
> > > 	else
> > > 		tdx_reset_memory();
> > 
> > save_processor_state() is declared under CONFIG_PM_SLEEP, so I guess your
> > variant might break build in some cases without updated suspend.h.
> 
> I tried.  If I turn off both SUSPEND and HIBERNATION in the Kconfig I got
> build error:
> 
> arch/x86/kernel/machine_kexec_64.c:325:17: error: implicit declaration of
> function ‘save_processor_state’ [-Werror=implicit-function-declaration]
>   325 |                 save_processor_state();
>       |                 ^~~~~~~~~~~~~~~~~~~~

Moving save_processor_state() declaration outside all defines would do the
trick.

Something like the patch below.

But finding the right spot for the move can be tricky. I don't particular
like where I moved it.

diff --git a/include/linux/suspend.h b/include/linux/suspend.h
index ef503088942d..1c17059df039 100644
--- a/include/linux/suspend.h
+++ b/include/linux/suspend.h
@@ -201,6 +201,9 @@ struct platform_s2idle_ops {
 	void (*end)(void);
 };
 
+void save_processor_state(void);
+void restore_processor_state(void);
+
 #ifdef CONFIG_SUSPEND
 extern suspend_state_t pm_suspend_target_state;
 extern suspend_state_t mem_sleep_current;
@@ -491,8 +494,6 @@ static inline int is_hibernate_resume_dev(dev_t dev) { return 0; }
 extern struct mutex system_transition_mutex;
 
 #ifdef CONFIG_PM_SLEEP
-void save_processor_state(void);
-void restore_processor_state(void);
 
 /* kernel/power/main.c */
 extern int register_pm_notifier(struct notifier_block *nb);
  
Kai Huang Feb. 1, 2024, 2:47 p.m. UTC | #6
On 1/02/2024 10:39 pm, Kirill A. Shutemov wrote:
> On Thu, Feb 01, 2024 at 10:22:18PM +0800, Huang, Kai wrote:
>>
>>
>> On 1/02/2024 6:03 am, Kirill A. Shutemov wrote:
>>> On Wed, Jan 31, 2024 at 01:21:39PM -0800, Dave Hansen wrote:
>>>>>    #ifdef CONFIG_KEXEC_JUMP
>>>>>    	if (image->preserve_context)
>>>>>    		save_processor_state();
>>>>> +	else
>>>>> +		tdx_reset_memory();
>>>>> +#else
>>>>> +	tdx_reset_memory();
>>>>>    #endif
>>>>
>>>> Wow, that's awfully hard to read.  I really wish folks' gag reflex would
>>>> kick in when they see stuff like this to get them to spend an additional
>>>> 15 seconds to turn this into:
>>>>
>>>> 	if (IS_ENABLED(CONFIG_KEXEC_JUMP) && image->preserve_context)
>>>> 		save_processor_state();
>>>> 	else
>>>> 		tdx_reset_memory();
>>>
>>> save_processor_state() is declared under CONFIG_PM_SLEEP, so I guess your
>>> variant might break build in some cases without updated suspend.h.
>>
>> I tried.  If I turn off both SUSPEND and HIBERNATION in the Kconfig I got
>> build error:
>>
>> arch/x86/kernel/machine_kexec_64.c:325:17: error: implicit declaration of
>> function ‘save_processor_state’ [-Werror=implicit-function-declaration]
>>    325 |                 save_processor_state();
>>        |                 ^~~~~~~~~~~~~~~~~~~~
> 
> Moving save_processor_state() declaration outside all defines would do the
> trick.
> 
> Something like the patch below.
> 
> But finding the right spot for the move can be tricky. I don't particular
> like where I moved it.
> 

I don't feel I have enough justification to do such change.

As I replied in another email, how about:

#ifdef CONFIG_KEXEC_JUMP
     if (image->preserve_context)
         save_processor_state();
     else
#endif
     tdx_reset_memory();

Not ideal, but looks slightly better?

But I'll leave to Dave to comment.
  
Dave Hansen Feb. 1, 2024, 4:57 p.m. UTC | #7
On 2/1/24 06:39, Kirill A. Shutemov wrote:
>> On 1/02/2024 6:03 am, Kirill A. Shutemov wrote:
>>> On Wed, Jan 31, 2024 at 01:21:39PM -0800, Dave Hansen wrote:
>>>>>   #ifdef CONFIG_KEXEC_JUMP
>>>>>   	if (image->preserve_context)
>>>>>   		save_processor_state();
>>>>> +	else
>>>>> +		tdx_reset_memory();
>>>>> +#else
>>>>> +	tdx_reset_memory();
>>>>>   #endif
..
> +void save_processor_state(void);
> +void restore_processor_state(void);
> +
>  #ifdef CONFIG_SUSPEND
>  extern suspend_state_t pm_suspend_target_state;
>  extern suspend_state_t mem_sleep_current;
> @@ -491,8 +494,6 @@ static inline int is_hibernate_resume_dev(dev_t dev) { return 0; }
>  extern struct mutex system_transition_mutex;
>  
>  #ifdef CONFIG_PM_SLEEP
> -void save_processor_state(void);
> -void restore_processor_state(void);

It's a little funky that we've got a #ifdef CONFIG_KEXEC_JUMP in the .c
file and then we're dodging around an #ifdef CONFIG_PM_SLEEP in the
header.  This is one of the reasons we shouldn't be putting #ifdefs in
c files in the first place.  But I digress...

Either way, if you focus on getting the dang #ifdef out of the main code
flow, the rest will fall in place easily.  Heck, if you even do this in
the x86 kexec code:

static void kexec_save_processor_start(image)
{
#ifdef CONFIG_KEXEC_JUMP
	if (image->preserve_context)
		save_processor_state();
#endif
}

it'll leave you with:

	kexec_save_processor_start(image);

	/* Give a good reason here */
	if (!image->preserve_context)
		tdx_reset_memory();

which is *FINE*.  No funky #ifdefs, indentation or else's dangling about.
  
Edgecombe, Rick P Feb. 2, 2024, 12:54 a.m. UTC | #8
On Wed, 2024-01-31 at 11:31 +0000, Huang, Kai wrote:
> Note kexec() can happen at anytime, including when TDX module is
> being
> initialized.  Register TDX reboot notifier callback to stop further
> TDX
> module initialization.  If there's any ongoing module initialization,
> wait until it finishes.  This makes sure the TDX module status is
> stable
> after the reboot notifier callback, and the later kexec() code can
> read
> module status to decide whether PAMTs are stable and available.

I don't see how this works with the crash kernel flavor of kexec. Did
you look at that scenario?
  
Kai Huang Feb. 5, 2024, 6:44 a.m. UTC | #9
On 2/02/2024 8:54 am, Edgecombe, Rick P wrote:
> On Wed, 2024-01-31 at 11:31 +0000, Huang, Kai wrote:
>> Note kexec() can happen at anytime, including when TDX module is
>> being
>> initialized.  Register TDX reboot notifier callback to stop further
>> TDX
>> module initialization.  If there's any ongoing module initialization,
>> wait until it finishes.  This makes sure the TDX module status is
>> stable
>> after the reboot notifier callback, and the later kexec() code can
>> read
>> module status to decide whether PAMTs are stable and available.
> 
> I don't see how this works with the crash kernel flavor of kexec. Did
> you look at that scenario?
> 
> 
Hmm right this doesn't work for crash kexec.  Thanks for pointing out.

We need a way that doesn't depend on the reboot notifier.  Previously we 
used a variable to indicate the point where it's possible to have any 
TDX private pages.  I'll switch back to use that.
  
Kai Huang Feb. 5, 2024, 6:49 a.m. UTC | #10
On 2/02/2024 12:57 am, Dave Hansen wrote:
> On 2/1/24 06:39, Kirill A. Shutemov wrote:
>>> On 1/02/2024 6:03 am, Kirill A. Shutemov wrote:
>>>> On Wed, Jan 31, 2024 at 01:21:39PM -0800, Dave Hansen wrote:
>>>>>>    #ifdef CONFIG_KEXEC_JUMP
>>>>>>    	if (image->preserve_context)
>>>>>>    		save_processor_state();
>>>>>> +	else
>>>>>> +		tdx_reset_memory();
>>>>>> +#else
>>>>>> +	tdx_reset_memory();
>>>>>>    #endif
> ...
>> +void save_processor_state(void);
>> +void restore_processor_state(void);
>> +
>>   #ifdef CONFIG_SUSPEND
>>   extern suspend_state_t pm_suspend_target_state;
>>   extern suspend_state_t mem_sleep_current;
>> @@ -491,8 +494,6 @@ static inline int is_hibernate_resume_dev(dev_t dev) { return 0; }
>>   extern struct mutex system_transition_mutex;
>>   
>>   #ifdef CONFIG_PM_SLEEP
>> -void save_processor_state(void);
>> -void restore_processor_state(void);
> 
> It's a little funky that we've got a #ifdef CONFIG_KEXEC_JUMP in the .c
> file and then we're dodging around an #ifdef CONFIG_PM_SLEEP in the
> header.  This is one of the reasons we shouldn't be putting #ifdefs in
> .c files in the first place.  But I digress...
> 
> Either way, if you focus on getting the dang #ifdef out of the main code
> flow, the rest will fall in place easily.  Heck, if you even do this in
> the x86 kexec code:
> 
> static void kexec_save_processor_start(image)
> {
> #ifdef CONFIG_KEXEC_JUMP
> 	if (image->preserve_context)
> 		save_processor_state();
> #endif
> }
> 
> it'll leave you with:
> 
> 	kexec_save_processor_start(image);
> 
> 	/* Give a good reason here */
> 	if (!image->preserve_context)
> 		tdx_reset_memory();
> 
> which is *FINE*.  No funky #ifdefs, indentation or else's dangling about.
> 
> 

Thanks for the insight and explanation!  I'll use this in the new version.
  

Patch

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index eba178996d84..ed3ac9a8a079 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -116,11 +116,13 @@  static inline u64 sc_retry(sc_func_t func, u64 fn,
 int tdx_cpu_enable(void);
 int tdx_enable(void);
 const char *tdx_dump_mce_info(struct mce *m);
+void tdx_reset_memory(void);
 #else
 static inline void tdx_init(void) { }
 static inline int tdx_cpu_enable(void) { return -ENODEV; }
 static inline int tdx_enable(void)  { return -ENODEV; }
 static inline const char *tdx_dump_mce_info(struct mce *m) { return NULL; }
+static inline void tdx_reset_memory(void) { }
 #endif	/* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index c9c6974e2e9c..b2279a3f6976 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -28,6 +28,7 @@ 
 #include <asm/setup.h>
 #include <asm/set_memory.h>
 #include <asm/cpu.h>
+#include <asm/tdx.h>
 
 #ifdef CONFIG_ACPI
 /*
@@ -298,9 +299,24 @@  void machine_kexec(struct kimage *image)
 	void *control_page;
 	int save_ftrace_enabled;
 
+	/*
+	 * For platforms with TDX "partial write machine check" erratum,
+	 * all TDX private pages need to be converted back to normal
+	 * before booting to the new kernel, otherwise the new kernel
+	 * may get unexpected machine check.
+	 *
+	 * But skip this when preserve_context is on.  The second kernel
+	 * shouldn't write to the first kernel's memory anyway.  Skipping
+	 * this also avoids killing TDX in the first kernel, which would
+	 * require more complicated handling.
+	 */
 #ifdef CONFIG_KEXEC_JUMP
 	if (image->preserve_context)
 		save_processor_state();
+	else
+		tdx_reset_memory();
+#else
+	tdx_reset_memory();
 #endif
 
 	save_ftrace_enabled = __ftrace_enabled_save();
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 9f1fed458a32..0537b1b76c2b 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -28,6 +28,7 @@ 
 #include <linux/acpi.h>
 #include <linux/suspend.h>
 #include <linux/acpi.h>
+#include <linux/reboot.h>
 #include <asm/page.h>
 #include <asm/special_insns.h>
 #include <asm/msr-index.h>
@@ -54,6 +55,8 @@  static DEFINE_MUTEX(tdx_module_lock);
 /* All TDX-usable memory regions.  Protected by mem_hotplug_lock. */
 static LIST_HEAD(tdx_memlist);
 
+static bool tdx_rebooting;
+
 typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);
 
 static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
@@ -1187,6 +1190,9 @@  static int __tdx_enable(void)
 {
 	int ret;
 
+	if (tdx_rebooting)
+		return -EAGAIN;
+
 	ret = init_tdx_module();
 	if (ret) {
 		pr_err("module initialization failed (%d)\n", ret);
@@ -1420,6 +1426,90 @@  static struct notifier_block tdx_memory_nb = {
 	.notifier_call = tdx_memory_notifier,
 };
 
+/*
+ * Convert TDX private pages back to normal on platforms with
+ * "partial write machine check" erratum.
+ *
+ * Called from machine_kexec() before booting to the new kernel.
+ */
+void tdx_reset_memory(void)
+{
+	if (!boot_cpu_has(X86_FEATURE_TDX_HOST_PLATFORM))
+		return;
+
+	/*
+	 * Kernel read/write to TDX private memory doesn't
+	 * cause machine check on hardware w/o this erratum.
+	 */
+	if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
+		return;
+
+	/* Called from kexec() when only rebooting cpu is alive */
+	WARN_ON_ONCE(num_online_cpus() != 1);
+
+	/*
+	 * tdx_reboot_notifier() waits until ongoing TDX module
+	 * initialization to finish, and module initialization is
+	 * rejected after that.  Therefore @tdx_module_status is
+	 * stable here and can be read w/o holding lock.
+	 */
+	if (tdx_module_status != TDX_MODULE_INITIALIZED)
+		return;
+
+	/*
+	 * Flush cache of all TDX private memory _before_ converting
+	 * them back to avoid silent memory corruption.
+	 */
+	native_wbinvd();
+
+	/*
+	 * Convert PAMTs back to normal.  All other cpus are already
+	 * dead and TDMRs/PAMTs are stable.
+	 *
+	 * Ideally it's better to cover all types of TDX private pages
+	 * here, but it's impractical:
+	 *
+	 *  - There's no existing infrastructure to tell whether a page
+	 *    is TDX private memory or not.
+	 *
+	 *  - Using SEAMCALL to query TDX module isn't feasible either:
+	 *    - VMX has been turned off by reaching here so SEAMCALL
+	 *      cannot be made;
+	 *    - Even SEAMCALL can be made the result from TDX module may
+	 *      not be accurate (e.g., remote CPU can be stopped while
+	 *      the kernel is in the middle of reclaiming TDX private
+	 *      page and doing MOVDIR64B).
+	 *
+	 * One temporary solution could be just converting all memory
+	 * pages, but it's problematic too, because not all pages are
+	 * mapped as writable in direct mapping.  It can be done by
+	 * switching to the identical mapping for kexec() or a new page
+	 * table which maps all pages as writable, but the complexity is
+	 * overkill.
+	 *
+	 * Thus instead of doing something dramatic to convert all pages,
+	 * only convert PAMTs here.  Other kernel components which use
+	 * TDX need to do the conversion on their own by intercepting the
+	 * rebooting/shutdown notifier (KVM already does that).
+	 */
+	tdmrs_reset_pamt_all(&tdx_tdmr_list);
+}
+
+static int tdx_reboot_notifier(struct notifier_block *nb, unsigned long mode,
+			       void *unused)
+{
+	/* Wait ongoing TDX initialization to finish */
+	mutex_lock(&tdx_module_lock);
+	tdx_rebooting = true;
+	mutex_unlock(&tdx_module_lock);
+
+	return NOTIFY_OK;
+}
+
+static struct notifier_block tdx_reboot_nb = {
+	.notifier_call = tdx_reboot_notifier,
+};
+
 static void __init check_tdx_erratum(void)
 {
 	/*
@@ -1474,6 +1564,14 @@  void __init tdx_init(void)
 		return;
 	}
 
+	err = register_reboot_notifier(&tdx_reboot_nb);
+	if (err) {
+		pr_err("initialization failed: register_reboot_notifier() failed (%d)\n",
+				err);
+		unregister_memory_notifier(&tdx_memory_nb);
+		return;
+	}
+
 #if defined(CONFIG_ACPI) && defined(CONFIG_SUSPEND)
 	pr_info("Disable ACPI S3. Turn off TDX in the BIOS to use ACPI S3.\n");
 	acpi_suspend_lowlevel = NULL;