[1/4] x86/coco: Add a new CC attribute to unify cache flush during kexec

Message ID cd88e97e53c502f0a457d6a82a31d9e8e0f9fca7.1706698706.git.kai.huang@intel.com
State New
Headers
Series TDX host: kexec() support |

Commit Message

Kai Huang Jan. 31, 2024, 11:31 a.m. UTC
  From: Kai Huang <kai.huang@intel.com>

Currently on AMD SME platforms, during kexec() caches are flushed
manually before jumping to the new kernel due to memory encryption.
Intel TDX needs to flush cachelines of TDX private memory before
jumping to the second kernel too, otherwise they may silently corrupt
the new kernel.

Instead of sprinkling both AMD and Intel's specific checks around,
introduce a new CC_ATTR_HOST_MEM_INCOHERENT attribute to unify both
Intel and AMD, and simplify the logic:

	Could the old kernel leave incoherent caches around?
	If so, do WBINVD.

Convert the AMD SME to use this new CC attribute.  A later patch will
utilize this new attribute for Intel TDX too.

Specifically, AMD SME flushes caches at two places: 1) stop_this_cpu();
2) relocate_kernel().  stop_this_cpu() checks the CPUID directly to do
WBINVD for the reason that the current kernel's SME enabling status may
not match the new kernel's choice.  However the relocate_kernel() only
does the WBINVD when the current kernel has enabled SME for the reason
that the new kernel is always placed in an "unencrypted" area.

To simplify the logic, for AMD SME change to always use the way that is
done in stop_this_cpu().  This will cause an additional WBINVD in
relocate_kernel() when the current kernel hasn't enabled SME (e.g.,
disabled by kernel command line), but this is acceptable for the sake of
having less complicated code (see [1] for the relevant discussion).

Note currently the kernel only advertises CC vendor for AMD SME when SME
is actually enabled by the kernel.  To always advertise the new
CC_ATTR_HOST_MEM_INCOHERENT regardless of the kernel's SME enabling
status, change to set CC vendor as long as the hardware has enabled SME.

Note "advertising CC_ATTR_HOST_MEM_INCOHERENT when the hardware has
enabled SME" is still different from "checking the CPUID" (the way that
is done in stop_this_cpu()), but technically the former also serves the
purpose and is actually more accurate.

Such change allows sme_me_mask to be 0 while CC vendor reports as AMD.
But this doesn't impact other CC attributes on AMD platforms, nor does
it impact the cc_mkdec()/cc_mkenc().

[1] https://lore.kernel.org/lkml/cbc9c527-17e5-4a63-80fe-85451394cc7c@amd.com/

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/coco/core.c               | 13 +++++++++++++
 arch/x86/kernel/machine_kexec_64.c |  2 +-
 arch/x86/kernel/process.c          | 14 +++-----------
 arch/x86/mm/mem_encrypt_identity.c | 11 ++++++++++-
 include/linux/cc_platform.h        | 15 +++++++++++++++
 5 files changed, 42 insertions(+), 13 deletions(-)
  

Comments

Borislav Petkov Feb. 19, 2024, 4:16 p.m. UTC | #1
On Wed, Jan 31, 2024 at 11:31:53AM +0000, Huang, Kai wrote:
> From: Kai Huang <kai.huang@intel.com>
> 
> Currently on AMD SME platforms, during kexec() caches are flushed
> manually before jumping to the new kernel due to memory encryption.
> Intel TDX needs to flush cachelines of TDX private memory before
> jumping to the second kernel too, otherwise they may silently corrupt
> the new kernel.
> 
> Instead of sprinkling both AMD and Intel's specific checks around,
> introduce a new CC_ATTR_HOST_MEM_INCOHERENT attribute to unify both
> Intel and AMD, and simplify the logic:
> 
> 	Could the old kernel leave incoherent caches around?

	"Is it possible that the kernel could leave caches in incoherent state?"

> 	If so, do WBINVD.
> 
> Convert the AMD SME to use this new CC attribute.

> A later patch will
> utilize this new attribute for Intel TDX too.

Yeah, once those are all in git, the concept of "subsequent patch"
becomes ambiguous depending on how you're sorting them. So try to read
that commit message out of the context of all those "subsequent patches"
and see if it still makes sense.

IOW, you should strive for your commit messages to make sense on their
own, without referencing other patches.

In this particular case, that "later patch" can go.

> Specifically, AMD SME flushes caches at two places: 1) stop_this_cpu();
> 2) relocate_kernel().  stop_this_cpu() checks the CPUID directly to do
> WBINVD for the reason that the current kernel's SME enabling status may
> not match the new kernel's choice.  However the relocate_kernel() only
> does the WBINVD when the current kernel has enabled SME for the reason
> that the new kernel is always placed in an "unencrypted" area.
> 
> To simplify the logic, for AMD SME change to always use the way that is
> done in stop_this_cpu().  This will cause an additional WBINVD in
> relocate_kernel() when the current kernel hasn't enabled SME (e.g.,
> disabled by kernel command line), but this is acceptable for the sake of
> having less complicated code (see [1] for the relevant discussion).
> 
> Note currently the kernel only advertises CC vendor for AMD SME when SME
> is actually enabled by the kernel.  To always advertise the new
> CC_ATTR_HOST_MEM_INCOHERENT regardless of the kernel's SME enabling
> status, change to set CC vendor as long as the hardware has enabled SME.
> 
> Note "advertising CC_ATTR_HOST_MEM_INCOHERENT when the hardware has
> enabled SME" is still different from "checking the CPUID" (the way that
> is done in stop_this_cpu()), but technically the former also serves the
> purpose and is actually more accurate.
> 
> Such change allows sme_me_mask to be 0 while CC vendor reports as AMD.
> But this doesn't impact other CC attributes on AMD platforms, nor does
> it impact the cc_mkdec()/cc_mkenc().
> 
> [1] https://lore.kernel.org/lkml/cbc9c527-17e5-4a63-80fe-85451394cc7c@amd.com/
> 
> Suggested-by: Dave Hansen <dave.hansen@intel.com>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
>  arch/x86/coco/core.c               | 13 +++++++++++++
>  arch/x86/kernel/machine_kexec_64.c |  2 +-
>  arch/x86/kernel/process.c          | 14 +++-----------
>  arch/x86/mm/mem_encrypt_identity.c | 11 ++++++++++-
>  include/linux/cc_platform.h        | 15 +++++++++++++++
>  5 files changed, 42 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
> index eeec9986570e..8d6d727e6e18 100644
> --- a/arch/x86/coco/core.c
> +++ b/arch/x86/coco/core.c
> @@ -72,6 +72,19 @@ static bool noinstr amd_cc_platform_has(enum cc_attr attr)
>  	case CC_ATTR_HOST_MEM_ENCRYPT:
>  		return sme_me_mask && !(sev_status & MSR_AMD64_SEV_ENABLED);
>  
> +	case CC_ATTR_HOST_MEM_INCOHERENT:
> +		/*
> +		 * CC_ATTR_HOST_MEM_INCOHERENT represents whether SME has
> +		 * enabled on the platform regardless whether the kernel
> +		 * has actually enabled the SME.
> +

"represents whether SME has [been] enabled ... regardless whether the
kernel has enabled SME"?!?

I think this needs to be:

diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index d07be9d05cd0..e3653465e532 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -67,6 +67,13 @@ static bool noinstr amd_cc_platform_has(enum cc_attr attr)
 
        switch (attr) {
        case CC_ATTR_MEM_ENCRYPT:
+
+               /*
+                * CC_ATTR_HOST_MEM_INCOHERENT represents whether SME has
+                * enabled on the platform regardless whether the kernel
+                * has actually enabled the SME.
+                */
+       case CC_ATTR_HOST_MEM_INCOHERENT:
                return sme_me_mask;
 
        case CC_ATTR_HOST_MEM_ENCRYPT:


> +		return !(sev_status & MSR_AMD64_SEV_ENABLED);
> +
> +	/*
> +	 * For all CC_ATTR_GUEST_* there's no need to check sme_me_mask
> +	 * as it must be true when there's any SEV enable bit set in
> +	 * sev_status.
> +	 */

Superfluous comment.

>  	case CC_ATTR_GUEST_MEM_ENCRYPT:
>  		return sev_status & MSR_AMD64_SEV_ENABLED;
>  
> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index bc0a5348b4a6..c9c6974e2e9c 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -358,7 +358,7 @@ void machine_kexec(struct kimage *image)
>  				       (unsigned long)page_list,
>  				       image->start,
>  				       image->preserve_context,
> -				       cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT));
> +				       cc_platform_has(CC_ATTR_HOST_MEM_INCOHERENT));
>  
>  #ifdef CONFIG_KEXEC_JUMP
>  	if (image->preserve_context)
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index ab49ade31b0d..2c7e8d9889c0 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -813,18 +813,10 @@ void __noreturn stop_this_cpu(void *dummy)
>  	mcheck_cpu_clear(c);
>  
>  	/*
> -	 * Use wbinvd on processors that support SME. This provides support
> -	 * for performing a successful kexec when going from SME inactive
> -	 * to SME active (or vice-versa). The cache must be cleared so that
> -	 * if there are entries with the same physical address, both with and
> -	 * without the encryption bit, they don't race each other when flushed
> -	 * and potentially end up with the wrong entry being committed to
> -	 * memory.
> -	 *
> -	 * Test the CPUID bit directly because the machine might've cleared
> -	 * X86_FEATURE_SME due to cmdline options.
> +	 * Use wbinvd on processors that the first kernel *could*
> +	 * potentially leave incoherent cachelines.

No need for that comment anymore - people can grep for
CC_ATTR_HOST_MEM_INCOHERENT's definition simply.

>  	 */
> -	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
> +	if (cc_platform_has(CC_ATTR_HOST_MEM_INCOHERENT))
>  		native_wbinvd();
>  
>  	/*
> diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
> index 7f72472a34d6..87e4fddab770 100644
> --- a/arch/x86/mm/mem_encrypt_identity.c
> +++ b/arch/x86/mm/mem_encrypt_identity.c
> @@ -570,9 +570,19 @@ void __init sme_enable(struct boot_params *bp)
>  		msr = __rdmsr(MSR_AMD64_SYSCFG);
>  		if (!(msr & MSR_AMD64_SYSCFG_MEM_ENCRYPT))
>  			return;
> +
> +		/*
> +		 * Always set CC vendor when the platform has SME enabled
> +		 * regardless whether the kernel will actually activates the
> +		 * SME or not.  This reports the CC_ATTR_HOST_MEM_INCOHERENT
> +		 * being true as long as the platform has SME enabled so that
> +		 * stop_this_cpu() can do necessary WBINVD during kexec().
> +		 */
> +		cc_vendor = CC_VENDOR_AMD;
>  	} else {
>  		/* SEV state cannot be controlled by a command line option */
>  		sme_me_mask = me_mask;
> +		cc_vendor = CC_VENDOR_AMD;
>  		goto out;
>  	}
>  

So you can't put it before the if - just slap it in both branches. Geez!

diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index 0166ab1780cc..1e4566cc233f 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -549,6 +549,8 @@ void __init sme_enable(struct boot_params *bp)
        if (snp && !(msr & MSR_AMD64_SEV_SNP_ENABLED))
                snp_abort();
 
+       cc_vendor = CC_VENDOR_AMD;
+
        /* Check if memory encryption is enabled */
        if (feature_mask == AMD_SME_BIT) {
                /*
@@ -597,6 +599,5 @@ void __init sme_enable(struct boot_params *bp)
 out:
        RIP_REL_REF(sme_me_mask) = me_mask;
        physical_mask &= ~me_mask;
-       cc_vendor = CC_VENDOR_AMD;
        cc_set_mask(me_mask);
 }

Btw, pls do your patches ontop of tip/master as other patches in there
are touching this file.

> @@ -608,7 +618,6 @@ void __init sme_enable(struct boot_params *bp)
>  out:
>  	if (sme_me_mask) {
>  		physical_mask &= ~sme_me_mask;
> -		cc_vendor = CC_VENDOR_AMD;
>  		cc_set_mask(sme_me_mask);
>  	}
>  }
> diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
> index cb0d6cd1c12f..2f7273596102 100644
> --- a/include/linux/cc_platform.h
> +++ b/include/linux/cc_platform.h
> @@ -42,6 +42,21 @@ enum cc_attr {
>  	 */
>  	CC_ATTR_HOST_MEM_ENCRYPT,
>  

This goes to the end of the enum.

> +	/**
> +	 * @CC_ATTR_HOST_MEM_INCOHERENT: Host memory encryption can be
> +	 * incoherent

"...can leave caches in an incoherent state."

> +	 *
> +	 * The platform/OS is running as a bare-metal system or a hypervisor.
> +	 * The memory encryption engine might have left non-cache-coherent
> +	 * data in the caches that needs to be flushed.
> +	 *
> +	 * Use this in places where the cache coherency of the memory matters
> +	 * but the encryption status does not.
> +	 *
> +	 * Includes all systems that set CC_ATTR_HOST_MEM_ENCRYPT.

If that is the case, why do you even need a new CC_ATTR define?

Might as well do:

	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
		native_wbinvd();

?

> +	 */
> +	CC_ATTR_HOST_MEM_INCOHERENT,
> +
  
Tom Lendacky Feb. 19, 2024, 7:45 p.m. UTC | #2
On 2/19/24 10:16, Borislav Petkov wrote:
> On Wed, Jan 31, 2024 at 11:31:53AM +0000, Huang, Kai wrote:
>> From: Kai Huang <kai.huang@intel.com>
>>
>> Currently on AMD SME platforms, during kexec() caches are flushed
>> manually before jumping to the new kernel due to memory encryption.
>> Intel TDX needs to flush cachelines of TDX private memory before
>> jumping to the second kernel too, otherwise they may silently corrupt
>> the new kernel.
>>
>> Instead of sprinkling both AMD and Intel's specific checks around,
>> introduce a new CC_ATTR_HOST_MEM_INCOHERENT attribute to unify both
>> Intel and AMD, and simplify the logic:
>>
>> 	Could the old kernel leave incoherent caches around?
> 
> 	"Is it possible that the kernel could leave caches in incoherent state?"
> 
>> 	If so, do WBINVD.
>>
>> Convert the AMD SME to use this new CC attribute.
> 
>> A later patch will
>> utilize this new attribute for Intel TDX too.
> 
> Yeah, once those are all in git, the concept of "subsequent patch"
> becomes ambiguous depending on how you're sorting them. So try to read
> that commit message out of the context of all those "subsequent patches"
> and see if it still makes sense.
> 
> IOW, you should strive for your commit messages to make sense on their
> own, without referencing other patches.
> 
> In this particular case, that "later patch" can go.
> 
>> Specifically, AMD SME flushes caches at two places: 1) stop_this_cpu();
>> 2) relocate_kernel().  stop_this_cpu() checks the CPUID directly to do
>> WBINVD for the reason that the current kernel's SME enabling status may
>> not match the new kernel's choice.  However the relocate_kernel() only
>> does the WBINVD when the current kernel has enabled SME for the reason
>> that the new kernel is always placed in an "unencrypted" area.
>>
>> To simplify the logic, for AMD SME change to always use the way that is
>> done in stop_this_cpu().  This will cause an additional WBINVD in
>> relocate_kernel() when the current kernel hasn't enabled SME (e.g.,
>> disabled by kernel command line), but this is acceptable for the sake of
>> having less complicated code (see [1] for the relevant discussion).
>>
>> Note currently the kernel only advertises CC vendor for AMD SME when SME
>> is actually enabled by the kernel.  To always advertise the new
>> CC_ATTR_HOST_MEM_INCOHERENT regardless of the kernel's SME enabling
>> status, change to set CC vendor as long as the hardware has enabled SME.
>>
>> Note "advertising CC_ATTR_HOST_MEM_INCOHERENT when the hardware has
>> enabled SME" is still different from "checking the CPUID" (the way that
>> is done in stop_this_cpu()), but technically the former also serves the
>> purpose and is actually more accurate.
>>
>> Such change allows sme_me_mask to be 0 while CC vendor reports as AMD.
>> But this doesn't impact other CC attributes on AMD platforms, nor does
>> it impact the cc_mkdec()/cc_mkenc().
>>
>> [1] https://lore.kernel.org/lkml/cbc9c527-17e5-4a63-80fe-85451394cc7c@amd.com/
>>
>> Suggested-by: Dave Hansen <dave.hansen@intel.com>
>> Signed-off-by: Kai Huang <kai.huang@intel.com>
>> ---
>>   arch/x86/coco/core.c               | 13 +++++++++++++
>>   arch/x86/kernel/machine_kexec_64.c |  2 +-
>>   arch/x86/kernel/process.c          | 14 +++-----------
>>   arch/x86/mm/mem_encrypt_identity.c | 11 ++++++++++-
>>   include/linux/cc_platform.h        | 15 +++++++++++++++
>>   5 files changed, 42 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
>> index eeec9986570e..8d6d727e6e18 100644
>> --- a/arch/x86/coco/core.c
>> +++ b/arch/x86/coco/core.c
>> @@ -72,6 +72,19 @@ static bool noinstr amd_cc_platform_has(enum cc_attr attr)
>>   	case CC_ATTR_HOST_MEM_ENCRYPT:
>>   		return sme_me_mask && !(sev_status & MSR_AMD64_SEV_ENABLED);
>>   
>> +	case CC_ATTR_HOST_MEM_INCOHERENT:
>> +		/*
>> +		 * CC_ATTR_HOST_MEM_INCOHERENT represents whether SME has
>> +		 * enabled on the platform regardless whether the kernel
>> +		 * has actually enabled the SME.
>> +
> 
> "represents whether SME has [been] enabled ... regardless whether the
> kernel has enabled SME"?!?
> 
> I think this needs to be:
> 
> diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
> index d07be9d05cd0..e3653465e532 100644
> --- a/arch/x86/coco/core.c
> +++ b/arch/x86/coco/core.c
> @@ -67,6 +67,13 @@ static bool noinstr amd_cc_platform_has(enum cc_attr attr)
>   
>          switch (attr) {
>          case CC_ATTR_MEM_ENCRYPT:
> +
> +               /*
> +                * CC_ATTR_HOST_MEM_INCOHERENT represents whether SME has
> +                * enabled on the platform regardless whether the kernel
> +                * has actually enabled the SME.
> +                */
> +       case CC_ATTR_HOST_MEM_INCOHERENT:

This change won't return the correct answer. The check needs to remain 
against the sev_status value.

>                  return sme_me_mask;
>   
>          case CC_ATTR_HOST_MEM_ENCRYPT:
> 
> 
>> +		return !(sev_status & MSR_AMD64_SEV_ENABLED);
>> +
>> +	/*
>> +	 * For all CC_ATTR_GUEST_* there's no need to check sme_me_mask
>> +	 * as it must be true when there's any SEV enable bit set in
>> +	 * sev_status.
>> +	 */
> 
> Superfluous comment.
> 
>>   	case CC_ATTR_GUEST_MEM_ENCRYPT:
>>   		return sev_status & MSR_AMD64_SEV_ENABLED;
>>   
>> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
>> index bc0a5348b4a6..c9c6974e2e9c 100644
>> --- a/arch/x86/kernel/machine_kexec_64.c
>> +++ b/arch/x86/kernel/machine_kexec_64.c
>> @@ -358,7 +358,7 @@ void machine_kexec(struct kimage *image)
>>   				       (unsigned long)page_list,
>>   				       image->start,
>>   				       image->preserve_context,
>> -				       cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT));
>> +				       cc_platform_has(CC_ATTR_HOST_MEM_INCOHERENT));
>>   
>>   #ifdef CONFIG_KEXEC_JUMP
>>   	if (image->preserve_context)
>> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
>> index ab49ade31b0d..2c7e8d9889c0 100644
>> --- a/arch/x86/kernel/process.c
>> +++ b/arch/x86/kernel/process.c
>> @@ -813,18 +813,10 @@ void __noreturn stop_this_cpu(void *dummy)
>>   	mcheck_cpu_clear(c);
>>   
>>   	/*
>> -	 * Use wbinvd on processors that support SME. This provides support
>> -	 * for performing a successful kexec when going from SME inactive
>> -	 * to SME active (or vice-versa). The cache must be cleared so that
>> -	 * if there are entries with the same physical address, both with and
>> -	 * without the encryption bit, they don't race each other when flushed
>> -	 * and potentially end up with the wrong entry being committed to
>> -	 * memory.
>> -	 *
>> -	 * Test the CPUID bit directly because the machine might've cleared
>> -	 * X86_FEATURE_SME due to cmdline options.
>> +	 * Use wbinvd on processors that the first kernel *could*
>> +	 * potentially leave incoherent cachelines.
> 
> No need for that comment anymore - people can grep for
> CC_ATTR_HOST_MEM_INCOHERENT's definition simply.
> 
>>   	 */
>> -	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
>> +	if (cc_platform_has(CC_ATTR_HOST_MEM_INCOHERENT))
>>   		native_wbinvd();
>>   
>>   	/*
>> diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
>> index 7f72472a34d6..87e4fddab770 100644
>> --- a/arch/x86/mm/mem_encrypt_identity.c
>> +++ b/arch/x86/mm/mem_encrypt_identity.c
>> @@ -570,9 +570,19 @@ void __init sme_enable(struct boot_params *bp)
>>   		msr = __rdmsr(MSR_AMD64_SYSCFG);
>>   		if (!(msr & MSR_AMD64_SYSCFG_MEM_ENCRYPT))
>>   			return;
>> +
>> +		/*
>> +		 * Always set CC vendor when the platform has SME enabled
>> +		 * regardless whether the kernel will actually activates the
>> +		 * SME or not.  This reports the CC_ATTR_HOST_MEM_INCOHERENT
>> +		 * being true as long as the platform has SME enabled so that
>> +		 * stop_this_cpu() can do necessary WBINVD during kexec().
>> +		 */
>> +		cc_vendor = CC_VENDOR_AMD;
>>   	} else {
>>   		/* SEV state cannot be controlled by a command line option */
>>   		sme_me_mask = me_mask;
>> +		cc_vendor = CC_VENDOR_AMD;
>>   		goto out;
>>   	}
>>   
> 
> So you can't put it before the if - just slap it in both branches. Geez!

I think that will still work because sme_me_mask and sev_status will both 
be 0 on bare-metal if 'msr & MSR_AMD64_SYSCFG_MEM_ENCRYPT' doesn't 
evaluate to true. However, that will cause any platform that hasn't 
enabled memory encryption (see SYS_CFG MSR), to also perform the WBINVD.

The idea looks like it was to keep existing behavior where cc_vendor 
wasn't set if 'msr & MSR_AMD64_SYSCFG_MEM_ENCRYPT' is false.

> 
> diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
> index 0166ab1780cc..1e4566cc233f 100644
> --- a/arch/x86/mm/mem_encrypt_identity.c
> +++ b/arch/x86/mm/mem_encrypt_identity.c
> @@ -549,6 +549,8 @@ void __init sme_enable(struct boot_params *bp)
>          if (snp && !(msr & MSR_AMD64_SEV_SNP_ENABLED))
>                  snp_abort();
>   
> +       cc_vendor = CC_VENDOR_AMD;
> +
>          /* Check if memory encryption is enabled */
>          if (feature_mask == AMD_SME_BIT) {
>                  /*
> @@ -597,6 +599,5 @@ void __init sme_enable(struct boot_params *bp)
>   out:
>          RIP_REL_REF(sme_me_mask) = me_mask;
>          physical_mask &= ~me_mask;
> -       cc_vendor = CC_VENDOR_AMD;
>          cc_set_mask(me_mask);
>   }
> 
> Btw, pls do your patches ontop of tip/master as other patches in there
> are touching this file.
> 
>> @@ -608,7 +618,6 @@ void __init sme_enable(struct boot_params *bp)
>>   out:
>>   	if (sme_me_mask) {
>>   		physical_mask &= ~sme_me_mask;
>> -		cc_vendor = CC_VENDOR_AMD;
>>   		cc_set_mask(sme_me_mask);
>>   	}
>>   }
>> diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
>> index cb0d6cd1c12f..2f7273596102 100644
>> --- a/include/linux/cc_platform.h
>> +++ b/include/linux/cc_platform.h
>> @@ -42,6 +42,21 @@ enum cc_attr {
>>   	 */
>>   	CC_ATTR_HOST_MEM_ENCRYPT,
>>   
> 
> This goes to the end of the enum.
> 
>> +	/**
>> +	 * @CC_ATTR_HOST_MEM_INCOHERENT: Host memory encryption can be
>> +	 * incoherent
> 
> "...can leave caches in an incoherent state."
> 
>> +	 *
>> +	 * The platform/OS is running as a bare-metal system or a hypervisor.
>> +	 * The memory encryption engine might have left non-cache-coherent
>> +	 * data in the caches that needs to be flushed.
>> +	 *
>> +	 * Use this in places where the cache coherency of the memory matters
>> +	 * but the encryption status does not.
>> +	 *
>> +	 * Includes all systems that set CC_ATTR_HOST_MEM_ENCRYPT.
> 
> If that is the case, why do you even need a new CC_ATTR define?
> 
> Might as well do:
> 
> 	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
> 		native_wbinvd();

That won't work, because the current system may not have SME active. The 
cases that needs to be caught are kexec'ing from a mem_encrypt=off to a 
mem_encrypt=on or from a mem_encrypt=on to a mem_encrypt=off.

For the first case, you need to determine if the system is SME capable, 
because cc_platform_has(CC_ATTR_MEM_ENCRYPT) will return false if SME is 
not active.

Thanks,
Tom

> 
> ?
> 
>> +	 */
>> +	CC_ATTR_HOST_MEM_INCOHERENT,
>> +
> 
>
  
Borislav Petkov Feb. 19, 2024, 8:32 p.m. UTC | #3
On Mon, Feb 19, 2024 at 01:45:37PM -0600, Tom Lendacky wrote:
> This change won't return the correct answer. The check needs to remain
> against the sev_status value.

Feel free to explain because this patch is confusing me.

> > So you can't put it before the if - just slap it in both branches. Geez!
> 
> I think that will still work because sme_me_mask and sev_status will both be
> 0 on bare-metal if 'msr & MSR_AMD64_SYSCFG_MEM_ENCRYPT' doesn't evaluate to
> true. However, that will cause any platform that hasn't enabled memory
> encryption (see SYS_CFG MSR), to also perform the WBINVD.

If it keeps the code simpler I don't mind. That's so not a fast path.

> That won't work, because the current system may not have SME active. The
> cases that needs to be caught are kexec'ing from a mem_encrypt=off to a
> mem_encrypt=on or from a mem_encrypt=on to a mem_encrypt=off.

And I'm saying, we should keep it simple and simply WBINVD on SME
capable machines, regardless of the encryption setting.

Any downsides to that which are actually real and noticeable?

Thx.
  
Tom Lendacky Feb. 19, 2024, 10:09 p.m. UTC | #4
On 2/19/24 14:32, Borislav Petkov wrote:
> On Mon, Feb 19, 2024 at 01:45:37PM -0600, Tom Lendacky wrote:
>> This change won't return the correct answer. The check needs to remain
>> against the sev_status value.
> 
> Feel free to explain because this patch is confusing me.

In your previous email, you want to put the CC_ATTR_HOST_MEM_INCOHERENT 
case statement with the CC_ATTR_MEM_ENCRYPT case which is returning 
sme_me_mask. That will be zero/false if SME is not active, skipping the 
WBINVD. But, in reality you still need to perform the WBINVD in case the 
kexec target is doing mem_encrypt=on.

That's why the '!(sev_status & MSR_AMD64_SEV_ENABLED)' works here. 
Basically, if you are bare-metal, it will return true. And it will only 
return true for machines that support SME and have the 
MSR_AMD64_SYSCFG_MEM_ENCRYPT bit set in SYS_CFG MSR because of where the 
'cc_vendor = CC_VENDOR_AMD' assignment is. However, if you move the 
'cc_vendor = CC_VENDOR_AMD' to before the if statement, then you will have 
the WBINVD called for any machine that supports SME, even if SME is not 
possible because the proper bit in the SYS_CFG MSR hasn't been set.

I know what I'm trying to say, let me know if it is making sense...

> 
>>> So you can't put it before the if - just slap it in both branches. Geez!
>>
>> I think that will still work because sme_me_mask and sev_status will both be
>> 0 on bare-metal if 'msr & MSR_AMD64_SYSCFG_MEM_ENCRYPT' doesn't evaluate to
>> true. However, that will cause any platform that hasn't enabled memory
>> encryption (see SYS_CFG MSR), to also perform the WBINVD.
> 
> If it keeps the code simpler I don't mind. That's so not a fast path.
> 
>> That won't work, because the current system may not have SME active. The
>> cases that needs to be caught are kexec'ing from a mem_encrypt=off to a
>> mem_encrypt=on or from a mem_encrypt=on to a mem_encrypt=off.
> 
> And I'm saying, we should keep it simple and simply WBINVD on SME
> capable machines, regardless of the encryption setting.

In that case, CC_ATTR_HOST_MEM_INCOHERENT needs to be separate from 
CC_ATTR_MEM_ENCRYPT as the original patch has it. The comment might make 
more sense as:

	 * CC_ATTR_HOST_MEM_INCOHERENT represents whether SME is possible
	 * on the platform, regardless of whether mem_encrypt=on has been
	 * used to make SME active.

Thanks,
Tom

> 
> Any downsides to that which are actually real and noticeable?
> 
> Thx.
>
  
Kai Huang Feb. 20, 2024, 2:57 a.m. UTC | #5
On Mon, 2024-02-19 at 16:09 -0600, Tom Lendacky wrote:
> On 2/19/24 14:32, Borislav Petkov wrote:
> > On Mon, Feb 19, 2024 at 01:45:37PM -0600, Tom Lendacky wrote:
> > > This change won't return the correct answer. The check needs to remain
> > > against the sev_status value.
> > 
> > Feel free to explain because this patch is confusing me.
> 
> In your previous email, you want to put the CC_ATTR_HOST_MEM_INCOHERENT 
> case statement with the CC_ATTR_MEM_ENCRYPT case which is returning 
> sme_me_mask. That will be zero/false if SME is not active, skipping the 
> WBINVD. But, in reality you still need to perform the WBINVD in case the 
> kexec target is doing mem_encrypt=on.
> 
> That's why the '!(sev_status & MSR_AMD64_SEV_ENABLED)' works here. 
> Basically, if you are bare-metal, it will return true. And it will only 
> return true for machines that support SME and have the 
> MSR_AMD64_SYSCFG_MEM_ENCRYPT bit set in SYS_CFG MSR because of where the 
> 'cc_vendor = CC_VENDOR_AMD' assignment is. 
> 

[...]

> However, if you move the 
> 'cc_vendor = CC_VENDOR_AMD' to before the if statement, then you will have 
> the WBINVD called for any machine that supports SME, even if SME is not 
> possible because the proper bit in the SYS_CFG MSR hasn't been set.

Hi Tom,

Thanks for clarifying.  However it seems to me that this is the behaviour in the
current upstream code.  The stop_this_cpu() checks CPUID directly w/o checking
the SYS_CFG MSR:

        if (c->extended_cpuid_level >= 0x8000001f && 
			(cpuid_eax(0x8000001f) & BIT(0)))                                           
                native_wbinvd();  

I believe the BIT(0) in CPUID, which is "Secure Memory Encryption Support", will
always be true regardless of whether the MSR_AMD64_SYSCFG_MEM_ENCRYPT is set in
SYS_CFG MSR?

If so, IIUC moving the 'cc_vendor = CC_VENDOR_AMD' to the place right before the
if statement as suggested by Boris seems just follows the current behaviour in
the upstream code.

Of course we need to always return true for CC_ATTR_HOST_MEM_INCOHERENT but not
querying sme_me_mask.

> 
> I know what I'm trying to say, let me know if it is making sense...
> 
> > 
> > > > So you can't put it before the if - just slap it in both branches. Geez!
> > > 
> > > I think that will still work because sme_me_mask and sev_status will both be
> > > 0 on bare-metal if 'msr & MSR_AMD64_SYSCFG_MEM_ENCRYPT' doesn't evaluate to
> > > true. However, that will cause any platform that hasn't enabled memory
> > > encryption (see SYS_CFG MSR), to also perform the WBINVD.
> > 
> > If it keeps the code simpler I don't mind. That's so not a fast path.
> > 
> > > That won't work, because the current system may not have SME active. The
> > > cases that needs to be caught are kexec'ing from a mem_encrypt=off to a
> > > mem_encrypt=on or from a mem_encrypt=on to a mem_encrypt=off.
> > 
> > And I'm saying, we should keep it simple and simply WBINVD on SME
> > capable machines, regardless of the encryption setting.
> 
> In that case, CC_ATTR_HOST_MEM_INCOHERENT needs to be separate from 
> CC_ATTR_MEM_ENCRYPT as the original patch has it. The comment might make 
> more sense as:
> 
> 	 * CC_ATTR_HOST_MEM_INCOHERENT represents whether SME is possible
> 	 * on the platform, regardless of whether mem_encrypt=on has been
> 	 * used to make SME active.
> 
> Thanks,
> Tom

This looks good to me.  Thanks!
  
Kai Huang Feb. 20, 2024, 3:12 a.m. UTC | #6
On Mon, 2024-02-19 at 17:16 +0100, Borislav Petkov wrote:
> On Wed, Jan 31, 2024 at 11:31:53AM +0000, Huang, Kai wrote:
> > From: Kai Huang <kai.huang@intel.com>
> > 
> > Currently on AMD SME platforms, during kexec() caches are flushed
> > manually before jumping to the new kernel due to memory encryption.
> > Intel TDX needs to flush cachelines of TDX private memory before
> > jumping to the second kernel too, otherwise they may silently corrupt
> > the new kernel.
> > 
> > Instead of sprinkling both AMD and Intel's specific checks around,
> > introduce a new CC_ATTR_HOST_MEM_INCOHERENT attribute to unify both
> > Intel and AMD, and simplify the logic:
> > 
> > 	Could the old kernel leave incoherent caches around?
> 
> 	"Is it possible that the kernel could leave caches in incoherent state?"

Will do.  Thanks.

> 
> > 	If so, do WBINVD.
> > 
> > Convert the AMD SME to use this new CC attribute.
> 
> > A later patch will
> > utilize this new attribute for Intel TDX too.
> 
> Yeah, once those are all in git, the concept of "subsequent patch"
> becomes ambiguous depending on how you're sorting them. So try to read
> that commit message out of the context of all those "subsequent patches"
> and see if it still makes sense.

Right.  Makes sense.

> 
> IOW, you should strive for your commit messages to make sense on their
> own, without referencing other patches.
> 
> In this particular case, that "later patch" can go.

Will remove the "later patch" sentence.  Thanks.

> 
> > Specifically, AMD SME flushes caches at two places: 1) stop_this_cpu();
> > 2) relocate_kernel().  stop_this_cpu() checks the CPUID directly to do
> > WBINVD for the reason that the current kernel's SME enabling status may
> > not match the new kernel's choice.  However the relocate_kernel() only
> > does the WBINVD when the current kernel has enabled SME for the reason
> > that the new kernel is always placed in an "unencrypted" area.
> > 
> > To simplify the logic, for AMD SME change to always use the way that is
> > done in stop_this_cpu().  This will cause an additional WBINVD in
> > relocate_kernel() when the current kernel hasn't enabled SME (e.g.,
> > disabled by kernel command line), but this is acceptable for the sake of
> > having less complicated code (see [1] for the relevant discussion).
> > 
> > Note currently the kernel only advertises CC vendor for AMD SME when SME
> > is actually enabled by the kernel.  To always advertise the new
> > CC_ATTR_HOST_MEM_INCOHERENT regardless of the kernel's SME enabling
> > status, change to set CC vendor as long as the hardware has enabled SME.
> > 
> > Note "advertising CC_ATTR_HOST_MEM_INCOHERENT when the hardware has
> > enabled SME" is still different from "checking the CPUID" (the way that
> > is done in stop_this_cpu()), but technically the former also serves the
> > purpose and is actually more accurate.
> > 
> > Such change allows sme_me_mask to be 0 while CC vendor reports as AMD.
> > But this doesn't impact other CC attributes on AMD platforms, nor does
> > it impact the cc_mkdec()/cc_mkenc().
> > 
> > [1] https://lore.kernel.org/lkml/cbc9c527-17e5-4a63-80fe-85451394cc7c@amd.com/
> > 
> > Suggested-by: Dave Hansen <dave.hansen@intel.com>
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > ---
> >  arch/x86/coco/core.c               | 13 +++++++++++++
> >  arch/x86/kernel/machine_kexec_64.c |  2 +-
> >  arch/x86/kernel/process.c          | 14 +++-----------
> >  arch/x86/mm/mem_encrypt_identity.c | 11 ++++++++++-
> >  include/linux/cc_platform.h        | 15 +++++++++++++++
> >  5 files changed, 42 insertions(+), 13 deletions(-)
> > 
> > diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
> > index eeec9986570e..8d6d727e6e18 100644
> > --- a/arch/x86/coco/core.c
> > +++ b/arch/x86/coco/core.c
> > @@ -72,6 +72,19 @@ static bool noinstr amd_cc_platform_has(enum cc_attr attr)
> >  	case CC_ATTR_HOST_MEM_ENCRYPT:
> >  		return sme_me_mask && !(sev_status & MSR_AMD64_SEV_ENABLED);
> >  
> > +	case CC_ATTR_HOST_MEM_INCOHERENT:
> > +		/*
> > +		 * CC_ATTR_HOST_MEM_INCOHERENT represents whether SME has
> > +		 * enabled on the platform regardless whether the kernel
> > +		 * has actually enabled the SME.
> > +
> 
> "represents whether SME has [been] enabled ... regardless whether the
> kernel has enabled SME"?!?
> 
> I think this needs to be:
> 
> diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
> index d07be9d05cd0..e3653465e532 100644
> --- a/arch/x86/coco/core.c
> +++ b/arch/x86/coco/core.c
> @@ -67,6 +67,13 @@ static bool noinstr amd_cc_platform_has(enum cc_attr attr)
>  
>         switch (attr) {
>         case CC_ATTR_MEM_ENCRYPT:
> +
> +               /*
> +                * CC_ATTR_HOST_MEM_INCOHERENT represents whether SME has
> +                * enabled on the platform regardless whether the kernel
> +                * has actually enabled the SME.
> +                */
> +       case CC_ATTR_HOST_MEM_INCOHERENT:
>                 return sme_me_mask;

As Tom pointed out this will return false when kernel doesn't active SME due to
command line, and WBINVD won't be done.

>  
>         case CC_ATTR_HOST_MEM_ENCRYPT:
> 
> 
> > +		return !(sev_status & MSR_AMD64_SEV_ENABLED);
> > +
> > +	/*
> > +	 * For all CC_ATTR_GUEST_* there's no need to check sme_me_mask
> > +	 * as it must be true when there's any SEV enable bit set in
> > +	 * sev_status.
> > +	 */
> 
> Superfluous comment.

Will remove.

> 
> >  	case CC_ATTR_GUEST_MEM_ENCRYPT:
> >  		return sev_status & MSR_AMD64_SEV_ENABLED;
> >  
> > diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> > index bc0a5348b4a6..c9c6974e2e9c 100644
> > --- a/arch/x86/kernel/machine_kexec_64.c
> > +++ b/arch/x86/kernel/machine_kexec_64.c
> > @@ -358,7 +358,7 @@ void machine_kexec(struct kimage *image)
> >  				       (unsigned long)page_list,
> >  				       image->start,
> >  				       image->preserve_context,
> > -				       cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT));
> > +				       cc_platform_has(CC_ATTR_HOST_MEM_INCOHERENT));
> >  
> >  #ifdef CONFIG_KEXEC_JUMP
> >  	if (image->preserve_context)
> > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> > index ab49ade31b0d..2c7e8d9889c0 100644
> > --- a/arch/x86/kernel/process.c
> > +++ b/arch/x86/kernel/process.c
> > @@ -813,18 +813,10 @@ void __noreturn stop_this_cpu(void *dummy)
> >  	mcheck_cpu_clear(c);
> >  
> >  	/*
> > -	 * Use wbinvd on processors that support SME. This provides support
> > -	 * for performing a successful kexec when going from SME inactive
> > -	 * to SME active (or vice-versa). The cache must be cleared so that
> > -	 * if there are entries with the same physical address, both with and
> > -	 * without the encryption bit, they don't race each other when flushed
> > -	 * and potentially end up with the wrong entry being committed to
> > -	 * memory.
> > -	 *
> > -	 * Test the CPUID bit directly because the machine might've cleared
> > -	 * X86_FEATURE_SME due to cmdline options.
> > +	 * Use wbinvd on processors that the first kernel *could*
> > +	 * potentially leave incoherent cachelines.
> 
> No need for that comment anymore - people can grep for
> CC_ATTR_HOST_MEM_INCOHERENT's definition simply.

Will remove.  Thanks.

> 
> >  	 */
> > -	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
> > +	if (cc_platform_has(CC_ATTR_HOST_MEM_INCOHERENT))
> >  		native_wbinvd();
> >  
> >  	/*
> > diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
> > index 7f72472a34d6..87e4fddab770 100644
> > --- a/arch/x86/mm/mem_encrypt_identity.c
> > +++ b/arch/x86/mm/mem_encrypt_identity.c
> > @@ -570,9 +570,19 @@ void __init sme_enable(struct boot_params *bp)
> >  		msr = __rdmsr(MSR_AMD64_SYSCFG);
> >  		if (!(msr & MSR_AMD64_SYSCFG_MEM_ENCRYPT))
> >  			return;
> > +
> > +		/*
> > +		 * Always set CC vendor when the platform has SME enabled
> > +		 * regardless whether the kernel will actually activates the
> > +		 * SME or not.  This reports the CC_ATTR_HOST_MEM_INCOHERENT
> > +		 * being true as long as the platform has SME enabled so that
> > +		 * stop_this_cpu() can do necessary WBINVD during kexec().
> > +		 */
> > +		cc_vendor = CC_VENDOR_AMD;
> >  	} else {
> >  		/* SEV state cannot be controlled by a command line option */
> >  		sme_me_mask = me_mask;
> > +		cc_vendor = CC_VENDOR_AMD;
> >  		goto out;
> >  	}
> >  
> 
> So you can't put it before the if - just slap it in both branches. Geez!

I think we can.  Please see my reply to Tom's email.

> 
> diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
> index 0166ab1780cc..1e4566cc233f 100644
> --- a/arch/x86/mm/mem_encrypt_identity.c
> +++ b/arch/x86/mm/mem_encrypt_identity.c
> @@ -549,6 +549,8 @@ void __init sme_enable(struct boot_params *bp)
>         if (snp && !(msr & MSR_AMD64_SEV_SNP_ENABLED))
>                 snp_abort();
>  
> +       cc_vendor = CC_VENDOR_AMD;
> +
>         /* Check if memory encryption is enabled */
>         if (feature_mask == AMD_SME_BIT) {
>                 /*
> @@ -597,6 +599,5 @@ void __init sme_enable(struct boot_params *bp)
>  out:
>         RIP_REL_REF(sme_me_mask) = me_mask;
>         physical_mask &= ~me_mask;
> -       cc_vendor = CC_VENDOR_AMD;
>         cc_set_mask(me_mask);
>  }
> 
> Btw, pls do your patches ontop of tip/master as other patches in there
> are touching this file.

Yeah will do.  I believe this series was generated based on tip/master but this
was 3 weeks ago.

> 
> > @@ -608,7 +618,6 @@ void __init sme_enable(struct boot_params *bp)
> >  out:
> >  	if (sme_me_mask) {
> >  		physical_mask &= ~sme_me_mask;
> > -		cc_vendor = CC_VENDOR_AMD;
> >  		cc_set_mask(sme_me_mask);
> >  	}
> >  }
> > diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
> > index cb0d6cd1c12f..2f7273596102 100644
> > --- a/include/linux/cc_platform.h
> > +++ b/include/linux/cc_platform.h
> > @@ -42,6 +42,21 @@ enum cc_attr {
> >  	 */
> >  	CC_ATTR_HOST_MEM_ENCRYPT,
> >  
> 
> This goes to the end of the enum.
> 
> > +	/**
> > +	 * @CC_ATTR_HOST_MEM_INCOHERENT: Host memory encryption can be
> > +	 * incoherent
> 
> "...can leave caches in an incoherent state."

Will do.  And I'll move this CC_ATTR_HOST_MEM_INCOHERENT to the end of the enum
if I understand you correctly.

> 
> > +	 *
> > +	 * The platform/OS is running as a bare-metal system or a hypervisor.
> > +	 * The memory encryption engine might have left non-cache-coherent
> > +	 * data in the caches that needs to be flushed.
> > +	 *
> > +	 * Use this in places where the cache coherency of the memory matters
> > +	 * but the encryption status does not.
> > +	 *
> > +	 * Includes all systems that set CC_ATTR_HOST_MEM_ENCRYPT.
> 
> If that is the case, why do you even need a new CC_ATTR define?
> 
> Might as well do:
> 
> 	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
> 		native_wbinvd();
> 
> ?

As Tom pointed out using the CC_ATTR_MEM_ENCRYPT will miss WBINVD when kernel
disables SME by commandline.

> 
> > +	 */
> > +	CC_ATTR_HOST_MEM_INCOHERENT,
> > +
> 
>
  
Borislav Petkov Feb. 20, 2024, 2:28 p.m. UTC | #7
On Mon, Feb 19, 2024 at 04:09:47PM -0600, Tom Lendacky wrote:
> That's why the '!(sev_status & MSR_AMD64_SEV_ENABLED)' works here.

I would've never figured that out just from staring at the test. :-\

> Basically, if you are bare-metal, it will return true. And it will only
> return true for machines that support SME and have the
> MSR_AMD64_SYSCFG_MEM_ENCRYPT bit set in SYS_CFG MSR because of where the
> 'cc_vendor = CC_VENDOR_AMD' assignment is. However, if you move the
> 'cc_vendor = CC_VENDOR_AMD' to before the if statement, then you will have
> the WBINVD called for any machine that supports SME, even if SME is not
> possible because the proper bit in the SYS_CFG MSR hasn't been set.
> 
> I know what I'm trying to say, let me know if it is making sense...

Yah, thanks for taking the time to explain.

Here's an even more radical idea:

Why not do WBINVD *unconditionally* on the CPU down path?

- it is the opposite of a fast path, i.e., no one cares

- it'll take care of every possible configuration without ugly logic

- it wouldn't hurt to have the caches nice and coherent before going
  down

Hmmm.
  
Tom Lendacky Feb. 20, 2024, 2:40 p.m. UTC | #8
On 2/19/24 20:57, Huang, Kai wrote:
> On Mon, 2024-02-19 at 16:09 -0600, Tom Lendacky wrote:
>> On 2/19/24 14:32, Borislav Petkov wrote:
>>> On Mon, Feb 19, 2024 at 01:45:37PM -0600, Tom Lendacky wrote:
>>>> This change won't return the correct answer. The check needs to remain
>>>> against the sev_status value.
>>>
>>> Feel free to explain because this patch is confusing me.
>>
>> In your previous email, you want to put the CC_ATTR_HOST_MEM_INCOHERENT
>> case statement with the CC_ATTR_MEM_ENCRYPT case which is returning
>> sme_me_mask. That will be zero/false if SME is not active, skipping the
>> WBINVD. But, in reality you still need to perform the WBINVD in case the
>> kexec target is doing mem_encrypt=on.
>>
>> That's why the '!(sev_status & MSR_AMD64_SEV_ENABLED)' works here.
>> Basically, if you are bare-metal, it will return true. And it will only
>> return true for machines that support SME and have the
>> MSR_AMD64_SYSCFG_MEM_ENCRYPT bit set in SYS_CFG MSR because of where the
>> 'cc_vendor = CC_VENDOR_AMD' assignment is.
>>
> 
> [...]
> 
>> However, if you move the
>> 'cc_vendor = CC_VENDOR_AMD' to before the if statement, then you will have
>> the WBINVD called for any machine that supports SME, even if SME is not
>> possible because the proper bit in the SYS_CFG MSR hasn't been set.
> 
> Hi Tom,
> 
> Thanks for clarifying.  However it seems to me that this is the behaviour in the
> current upstream code.  The stop_this_cpu() checks CPUID directly w/o checking
> the SYS_CFG MSR:

Correct, it will match the upstream behavior this way. It would have been 
improved slightly with your original patch by avoiding the WBINVD if the 
MSR_AMD64_SYSCFG_MEM_ENCRYPT wasn't set.

> 
>          if (c->extended_cpuid_level >= 0x8000001f &&
> 			(cpuid_eax(0x8000001f) & BIT(0)))
>                  native_wbinvd();
> 
> I believe the BIT(0) in CPUID, which is "Secure Memory Encryption Support", will
> always be true regardless of whether the MSR_AMD64_SYSCFG_MEM_ENCRYPT is set in
> SYS_CFG MSR?
> 
> If so, IIUC moving the 'cc_vendor = CC_VENDOR_AMD' to the place right before the
> if statement as suggested by Boris seems just follows the current behaviour in
> the upstream code.

Yep, that's how I see it, too.

Thanks,
Tom

> 
> Of course we need to always return true for CC_ATTR_HOST_MEM_INCOHERENT but not
> querying sme_me_mask.
> 
>>
>> I know what I'm trying to say, let me know if it is making sense...
>>
>>>
>>>>> So you can't put it before the if - just slap it in both branches. Geez!
>>>>
>>>> I think that will still work because sme_me_mask and sev_status will both be
>>>> 0 on bare-metal if 'msr & MSR_AMD64_SYSCFG_MEM_ENCRYPT' doesn't evaluate to
>>>> true. However, that will cause any platform that hasn't enabled memory
>>>> encryption (see SYS_CFG MSR), to also perform the WBINVD.
>>>
>>> If it keeps the code simpler I don't mind. That's so not a fast path.
>>>
>>>> That won't work, because the current system may not have SME active. The
>>>> cases that needs to be caught are kexec'ing from a mem_encrypt=off to a
>>>> mem_encrypt=on or from a mem_encrypt=on to a mem_encrypt=off.
>>>
>>> And I'm saying, we should keep it simple and simply WBINVD on SME
>>> capable machines, regardless of the encryption setting.
>>
>> In that case, CC_ATTR_HOST_MEM_INCOHERENT needs to be separate from
>> CC_ATTR_MEM_ENCRYPT as the original patch has it. The comment might make
>> more sense as:
>>
>> 	 * CC_ATTR_HOST_MEM_INCOHERENT represents whether SME is possible
>> 	 * on the platform, regardless of whether mem_encrypt=on has been
>> 	 * used to make SME active.
>>
>> Thanks,
>> Tom
> 
> This looks good to me.  Thanks!
> 
>
  
Tom Lendacky Feb. 20, 2024, 2:47 p.m. UTC | #9
On 2/20/24 08:28, Borislav Petkov wrote:
> On Mon, Feb 19, 2024 at 04:09:47PM -0600, Tom Lendacky wrote:
>> That's why the '!(sev_status & MSR_AMD64_SEV_ENABLED)' works here.
> 
> I would've never figured that out just from staring at the test. :-\
> 
>> Basically, if you are bare-metal, it will return true. And it will only
>> return true for machines that support SME and have the
>> MSR_AMD64_SYSCFG_MEM_ENCRYPT bit set in SYS_CFG MSR because of where the
>> 'cc_vendor = CC_VENDOR_AMD' assignment is. However, if you move the
>> 'cc_vendor = CC_VENDOR_AMD' to before the if statement, then you will have
>> the WBINVD called for any machine that supports SME, even if SME is not
>> possible because the proper bit in the SYS_CFG MSR hasn't been set.
>>
>> I know what I'm trying to say, let me know if it is making sense...
> 
> Yah, thanks for taking the time to explain.
> 
> Here's an even more radical idea:
> 
> Why not do WBINVD *unconditionally* on the CPU down path?
> 
> - it is the opposite of a fast path, i.e., no one cares
> 
> - it'll take care of every possible configuration without ugly logic
> 
> - it wouldn't hurt to have the caches nice and coherent before going
>    down
> 
> Hmmm.

That's what I initially did, but errors were reported, see commit:
   f23d74f6c66c ("x86/mm: Rework wbinvd, hlt operation in stop_this_cpu()")

But that may be because of the issue solved by commit:
   1f5e7eb7868e ("x86/smp: Make stop_other_cpus() more robust")

So...

Thanks,
Tom

>
  
Kai Huang Feb. 20, 2024, 8:07 p.m. UTC | #10
On Tue, 2024-02-20 at 08:47 -0600, Tom Lendacky wrote:
> On 2/20/24 08:28, Borislav Petkov wrote:
> > On Mon, Feb 19, 2024 at 04:09:47PM -0600, Tom Lendacky wrote:
> > > That's why the '!(sev_status & MSR_AMD64_SEV_ENABLED)' works here.
> > 
> > I would've never figured that out just from staring at the test. :-\
> > 
> > > Basically, if you are bare-metal, it will return true. And it will only
> > > return true for machines that support SME and have the
> > > MSR_AMD64_SYSCFG_MEM_ENCRYPT bit set in SYS_CFG MSR because of where the
> > > 'cc_vendor = CC_VENDOR_AMD' assignment is. However, if you move the
> > > 'cc_vendor = CC_VENDOR_AMD' to before the if statement, then you will have
> > > the WBINVD called for any machine that supports SME, even if SME is not
> > > possible because the proper bit in the SYS_CFG MSR hasn't been set.
> > > 
> > > I know what I'm trying to say, let me know if it is making sense...
> > 
> > Yah, thanks for taking the time to explain.
> > 
> > Here's an even more radical idea:
> > 
> > Why not do WBINVD *unconditionally* on the CPU down path?
> > 
> > - it is the opposite of a fast path, i.e., no one cares
> > 
> > - it'll take care of every possible configuration without ugly logic
> > 
> > - it wouldn't hurt to have the caches nice and coherent before going
> >    down
> > 
> > Hmmm.
> 
> That's what I initially did, but errors were reported, see commit:
>    f23d74f6c66c ("x86/mm: Rework wbinvd, hlt operation in stop_this_cpu()")

This changelog only mentions "Some issues".  Do you know exactly what kind
issues did you see?  Are these issues only appeared on SME enabled system or
other non-SME-capable systems too?

Because ...

> 
> But that may be because of the issue solved by commit:
>    1f5e7eb7868e ("x86/smp: Make stop_other_cpus() more robust")

... the issue resolved in this commit seems to be hang.
  
Tom Lendacky Feb. 20, 2024, 10:30 p.m. UTC | #11
On 2/20/24 14:07, Huang, Kai wrote:
> On Tue, 2024-02-20 at 08:47 -0600, Tom Lendacky wrote:
>> On 2/20/24 08:28, Borislav Petkov wrote:
>>> On Mon, Feb 19, 2024 at 04:09:47PM -0600, Tom Lendacky wrote:
>>>> That's why the '!(sev_status & MSR_AMD64_SEV_ENABLED)' works here.
>>>
>>> I would've never figured that out just from staring at the test. :-\
>>>
>>>> Basically, if you are bare-metal, it will return true. And it will only
>>>> return true for machines that support SME and have the
>>>> MSR_AMD64_SYSCFG_MEM_ENCRYPT bit set in SYS_CFG MSR because of where the
>>>> 'cc_vendor = CC_VENDOR_AMD' assignment is. However, if you move the
>>>> 'cc_vendor = CC_VENDOR_AMD' to before the if statement, then you will have
>>>> the WBINVD called for any machine that supports SME, even if SME is not
>>>> possible because the proper bit in the SYS_CFG MSR hasn't been set.
>>>>
>>>> I know what I'm trying to say, let me know if it is making sense...
>>>
>>> Yah, thanks for taking the time to explain.
>>>
>>> Here's an even more radical idea:
>>>
>>> Why not do WBINVD *unconditionally* on the CPU down path?
>>>
>>> - it is the opposite of a fast path, i.e., no one cares
>>>
>>> - it'll take care of every possible configuration without ugly logic
>>>
>>> - it wouldn't hurt to have the caches nice and coherent before going
>>>     down
>>>
>>> Hmmm.
>>
>> That's what I initially did, but errors were reported, see commit:
>>     f23d74f6c66c ("x86/mm: Rework wbinvd, hlt operation in stop_this_cpu()")
> 
> This changelog only mentions "Some issues".  Do you know exactly what kind
> issues did you see?  Are these issues only appeared on SME enabled system or
> other non-SME-capable systems too?

I believe the issues were that different Intel systems would hang or reset 
and it was bisected to that commit that added the WBINVD. It was a while 
ago, but I remember that they were similar to what the 1f5e7eb7868e commit 
ended up fixing, which was debugged because sometimes the WBINVD was still 
occasionally issued resulting in the following patch

   9b040453d444 ("x86/smp: Dont access non-existing CPUID leaf")

It just means that if we go to an unconditional WBINVD, then we need to be 
careful.

Thanks,
Tom

> 
> Because ...
> 
>>
>> But that may be because of the issue solved by commit:
>>     1f5e7eb7868e ("x86/smp: Make stop_other_cpus() more robust")
> 
> ... the issue resolved in this commit seems to be hang.
>
  
Kai Huang Feb. 21, 2024, 1:38 a.m. UTC | #12
On Tue, 2024-02-20 at 16:30 -0600, Tom Lendacky wrote:
> On 2/20/24 14:07, Huang, Kai wrote:
> > On Tue, 2024-02-20 at 08:47 -0600, Tom Lendacky wrote:
> > > On 2/20/24 08:28, Borislav Petkov wrote:
> > > > On Mon, Feb 19, 2024 at 04:09:47PM -0600, Tom Lendacky wrote:
> > > > > That's why the '!(sev_status & MSR_AMD64_SEV_ENABLED)' works here.
> > > > 
> > > > I would've never figured that out just from staring at the test. :-\
> > > > 
> > > > > Basically, if you are bare-metal, it will return true. And it will only
> > > > > return true for machines that support SME and have the
> > > > > MSR_AMD64_SYSCFG_MEM_ENCRYPT bit set in SYS_CFG MSR because of where the
> > > > > 'cc_vendor = CC_VENDOR_AMD' assignment is. However, if you move the
> > > > > 'cc_vendor = CC_VENDOR_AMD' to before the if statement, then you will have
> > > > > the WBINVD called for any machine that supports SME, even if SME is not
> > > > > possible because the proper bit in the SYS_CFG MSR hasn't been set.
> > > > > 
> > > > > I know what I'm trying to say, let me know if it is making sense...
> > > > 
> > > > Yah, thanks for taking the time to explain.
> > > > 
> > > > Here's an even more radical idea:
> > > > 
> > > > Why not do WBINVD *unconditionally* on the CPU down path?
> > > > 
> > > > - it is the opposite of a fast path, i.e., no one cares
> > > > 
> > > > - it'll take care of every possible configuration without ugly logic
> > > > 
> > > > - it wouldn't hurt to have the caches nice and coherent before going
> > > >     down
> > > > 
> > > > Hmmm.
> > > 
> > > That's what I initially did, but errors were reported, see commit:
> > >     f23d74f6c66c ("x86/mm: Rework wbinvd, hlt operation in stop_this_cpu()")
> > 
> > This changelog only mentions "Some issues".  Do you know exactly what kind
> > issues did you see?  Are these issues only appeared on SME enabled system or
> > other non-SME-capable systems too?
> 
> I believe the issues were that different Intel systems would hang or reset 
> and it was bisected to that commit that added the WBINVD. It was a while 
> ago, but I remember that they were similar to what the 1f5e7eb7868e commit 
> ended up fixing, which was debugged because sometimes the WBINVD was still 
> occasionally issued resulting in the following patch
> 
>    9b040453d444 ("x86/smp: Dont access non-existing CPUID leaf")
> 
> It just means that if we go to an unconditional WBINVD, then we need to be 
> careful.
> 
> 

Thanks Tom for the info.  That helps a lot.

Hi Boris, Dave,

I think I still prefer to keeping the existing SME kexec behaviour, that is, to
have the new CC_ATTR_HOST_MEM_INCOHERENT attribute, because in this way there
will be no risk.

However based on the information above I believe the risk is small if we switch
to unconditional WBINVD, in which way we don't need the new attribute and
there's also no new code needed for TDX to do cache flush.

Btw, I want to point out stop_this_cpu() is not the only place that needs to do
WBINVD for SME/TDX, the relocate_kernel() assembly also needs to:

        image->start = relocate_kernel((unsigned long)image->head,
                                     (unsigned long)page_list,
                                     image->start,
                                     image->preserve_context,
                                     cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT));

The last function argument cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT) is for SME.
The relocate_kernel() assembly checks the last argument and does WBINVD if it is
true.  If we go with unconditional WBINVD, I think we can just change the
assembly to do unconditional WBINVD and remove the last function parameter.

Please let me know your preference?
  
Borislav Petkov Feb. 21, 2024, 9:28 a.m. UTC | #13
On Tue, Feb 20, 2024 at 04:30:13PM -0600, Tom Lendacky wrote:
> I believe the issues were that different Intel systems would hang or reset
> and it was bisected to that commit that added the WBINVD. It was a while
> ago, but I remember that they were similar to what the 1f5e7eb7868e commit
> ended up fixing, which was debugged because sometimes the WBINVD was still
> occasionally issued resulting in the following patch
> 
>   9b040453d444 ("x86/smp: Dont access non-existing CPUID leaf")
> 
> It just means that if we go to an unconditional WBINVD, then we need to be
> careful.

Let's try it.

Dave, do you remember what issues

  f23d74f6c66c ("x86/mm: Rework wbinvd, hlt operation in stop_this_cpu()")

fixed?

If so, can you try the below diff ontop of latest tip/master to see if
those issues would reappear?

Thx.

---

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index ab49ade31b0d..ec4dcc9f70ca 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -824,8 +824,7 @@ void __noreturn stop_this_cpu(void *dummy)
 	 * Test the CPUID bit directly because the machine might've cleared
 	 * X86_FEATURE_SME due to cmdline options.
 	 */
-	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
-		native_wbinvd();
+	native_wbinvd();
 
 	/*
 	 * This brings a cache line back and dirties it, but
  
Kai Huang Feb. 22, 2024, 11:49 a.m. UTC | #14
On Wed, 2024-02-21 at 10:28 +0100, Borislav Petkov wrote:
> On Tue, Feb 20, 2024 at 04:30:13PM -0600, Tom Lendacky wrote:
> > I believe the issues were that different Intel systems would hang or reset
> > and it was bisected to that commit that added the WBINVD. It was a while
> > ago, but I remember that they were similar to what the 1f5e7eb7868e commit
> > ended up fixing, which was debugged because sometimes the WBINVD was still
> > occasionally issued resulting in the following patch
> > 
> >   9b040453d444 ("x86/smp: Dont access non-existing CPUID leaf")
> > 
> > It just means that if we go to an unconditional WBINVD, then we need to be
> > careful.
> 
> Let's try it.
> 
> Dave, do you remember what issues
> 
>   f23d74f6c66c ("x86/mm: Rework wbinvd, hlt operation in stop_this_cpu()")
> 
> fixed?
> 
> If so, can you try the below diff ontop of latest tip/master to see if
> those issues would reappear?
> 
> Thx.
> 
> ---
> 
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index ab49ade31b0d..ec4dcc9f70ca 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -824,8 +824,7 @@ void __noreturn stop_this_cpu(void *dummy)
>  	 * Test the CPUID bit directly because the machine might've cleared
>  	 * X86_FEATURE_SME due to cmdline options.
>  	 */
> -	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
> -		native_wbinvd();
> +	native_wbinvd();
>  
>  	/*
>  	 * This brings a cache line back and dirties it, but
> 

I really appreciate if Dave can help here.

I will also reach out to see whether there's anyone in Intel met this before.
  
Dave Young Feb. 23, 2024, 3:13 a.m. UTC | #15
Hi,

On Thu, 22 Feb 2024 at 19:50, Huang, Kai <kai.huang@intel.com> wrote:
>
> On Wed, 2024-02-21 at 10:28 +0100, Borislav Petkov wrote:
> > On Tue, Feb 20, 2024 at 04:30:13PM -0600, Tom Lendacky wrote:
> > > I believe the issues were that different Intel systems would hang or reset
> > > and it was bisected to that commit that added the WBINVD. It was a while
> > > ago, but I remember that they were similar to what the 1f5e7eb7868e commit
> > > ended up fixing, which was debugged because sometimes the WBINVD was still
> > > occasionally issued resulting in the following patch
> > >
> > >   9b040453d444 ("x86/smp: Dont access non-existing CPUID leaf")
> > >
> > > It just means that if we go to an unconditional WBINVD, then we need to be
> > > careful.
> >
> > Let's try it.
> >
> > Dave, do you remember what issues
> >
> >   f23d74f6c66c ("x86/mm: Rework wbinvd, hlt operation in stop_this_cpu()")
> >
> > fixed?
> >
> > If so, can you try the below diff ontop of latest tip/master to see if
> > those issues would reappear?
> >
> > Thx.
> >
> > ---
> >
> > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> > index ab49ade31b0d..ec4dcc9f70ca 100644
> > --- a/arch/x86/kernel/process.c
> > +++ b/arch/x86/kernel/process.c
> > @@ -824,8 +824,7 @@ void __noreturn stop_this_cpu(void *dummy)
> >        * Test the CPUID bit directly because the machine might've cleared
> >        * X86_FEATURE_SME due to cmdline options.
> >        */
> > -     if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
> > -             native_wbinvd();
> > +     native_wbinvd();
> >
> >       /*
> >        * This brings a cache line back and dirties it, but
> >
>
> I really appreciate if Dave can help here.
>
> I will also reach out to see whether there's anyone in Intel met this before.

I forgot the details now,  let me recall and try the patches if possible.

Thanks
Dave
  
Dave Young Feb. 23, 2024, 10:41 a.m. UTC | #16
On Wed, 21 Feb 2024 at 17:33, Borislav Petkov <bp@alien8.de> wrote:
>
> On Tue, Feb 20, 2024 at 04:30:13PM -0600, Tom Lendacky wrote:
> > I believe the issues were that different Intel systems would hang or reset
> > and it was bisected to that commit that added the WBINVD. It was a while
> > ago, but I remember that they were similar to what the 1f5e7eb7868e commit
> > ended up fixing, which was debugged because sometimes the WBINVD was still
> > occasionally issued resulting in the following patch
> >
> >   9b040453d444 ("x86/smp: Dont access non-existing CPUID leaf")
> >
> > It just means that if we go to an unconditional WBINVD, then we need to be
> > careful.
>
> Let's try it.
>
> Dave, do you remember what issues
>
>   f23d74f6c66c ("x86/mm: Rework wbinvd, hlt operation in stop_this_cpu()")
>
> fixed?

It should be a kexec reboot failure describe in below thread:
https://lore.kernel.org/lkml/20180117072123.GA1866@dhcp-128-65.nay.redhat.com/

>
> If so, can you try the below diff ontop of latest tip/master to see if
> those issues would reappear?

It was reproduced on an old laptop (Thinkpad t440s or t480s, I can not
remember), but I have replaced them with a new different one.  I tried
the latest tip-master with the if condition commented out, kexec
reboot works fine.

Let me try to find an old laptop to see if I can do more tests, will
get back later next week.

>
> Thx.
>
> ---
>
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index ab49ade31b0d..ec4dcc9f70ca 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -824,8 +824,7 @@ void __noreturn stop_this_cpu(void *dummy)
>          * Test the CPUID bit directly because the machine might've cleared
>          * X86_FEATURE_SME due to cmdline options.
>          */
> -       if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
> -               native_wbinvd();
> +       native_wbinvd();
>
>         /*
>          * This brings a cache line back and dirties it, but
>
> --
> Regards/Gruss,
>     Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
>
  
Dave Young Feb. 28, 2024, 2:54 a.m. UTC | #17
On Fri, 23 Feb 2024 at 18:41, Dave Young <dyoung@redhat.com> wrote:
>
> On Wed, 21 Feb 2024 at 17:33, Borislav Petkov <bp@alien8.de> wrote:
> >
> > On Tue, Feb 20, 2024 at 04:30:13PM -0600, Tom Lendacky wrote:
> > > I believe the issues were that different Intel systems would hang or reset
> > > and it was bisected to that commit that added the WBINVD. It was a while
> > > ago, but I remember that they were similar to what the 1f5e7eb7868e commit
> > > ended up fixing, which was debugged because sometimes the WBINVD was still
> > > occasionally issued resulting in the following patch
> > >
> > >   9b040453d444 ("x86/smp: Dont access non-existing CPUID leaf")
> > >
> > > It just means that if we go to an unconditional WBINVD, then we need to be
> > > careful.
> >
> > Let's try it.
> >
> > Dave, do you remember what issues
> >
> >   f23d74f6c66c ("x86/mm: Rework wbinvd, hlt operation in stop_this_cpu()")
> >
> > fixed?
>
> It should be a kexec reboot failure describe in below thread:
> https://lore.kernel.org/lkml/20180117072123.GA1866@dhcp-128-65.nay.redhat.com/
>
> >
> > If so, can you try the below diff ontop of latest tip/master to see if
> > those issues would reappear?
>
> It was reproduced on an old laptop (Thinkpad t440s or t480s, I can not
> remember), but I have replaced them with a new different one.  I tried
> the latest tip-master with the if condition commented out, kexec
> reboot works fine.
>
> Let me try to find an old laptop to see if I can do more tests, will
> get back later next week.

Update: tested on an old laptop as well,  I did not find any problems
with unconditional native_wbinvd(), kexec and kdump works fine.

>
> >
> > Thx.
> >
> > ---
> >
> > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> > index ab49ade31b0d..ec4dcc9f70ca 100644
> > --- a/arch/x86/kernel/process.c
> > +++ b/arch/x86/kernel/process.c
> > @@ -824,8 +824,7 @@ void __noreturn stop_this_cpu(void *dummy)
> >          * Test the CPUID bit directly because the machine might've cleared
> >          * X86_FEATURE_SME due to cmdline options.
> >          */
> > -       if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
> > -               native_wbinvd();
> > +       native_wbinvd();
> >
> >         /*
> >          * This brings a cache line back and dirties it, but
> >
> > --
> > Regards/Gruss,
> >     Boris.
> >
> > https://people.kernel.org/tglx/notes-about-netiquette
> >
  
Kai Huang Feb. 28, 2024, 9:21 a.m. UTC | #18
On Wed, 2024-02-28 at 10:54 +0800, Dave Young wrote:
> On Fri, 23 Feb 2024 at 18:41, Dave Young <dyoung@redhat.com> wrote:
> > 
> > On Wed, 21 Feb 2024 at 17:33, Borislav Petkov <bp@alien8.de> wrote:
> > > 
> > > On Tue, Feb 20, 2024 at 04:30:13PM -0600, Tom Lendacky wrote:
> > > > I believe the issues were that different Intel systems would hang or reset
> > > > and it was bisected to that commit that added the WBINVD. It was a while
> > > > ago, but I remember that they were similar to what the 1f5e7eb7868e commit
> > > > ended up fixing, which was debugged because sometimes the WBINVD was still
> > > > occasionally issued resulting in the following patch
> > > > 
> > > >   9b040453d444 ("x86/smp: Dont access non-existing CPUID leaf")
> > > > 
> > > > It just means that if we go to an unconditional WBINVD, then we need to be
> > > > careful.
> > > 
> > > Let's try it.
> > > 
> > > Dave, do you remember what issues
> > > 
> > >   f23d74f6c66c ("x86/mm: Rework wbinvd, hlt operation in stop_this_cpu()")
> > > 
> > > fixed?
> > 
> > It should be a kexec reboot failure describe in below thread:
> > https://lore.kernel.org/lkml/20180117072123.GA1866@dhcp-128-65.nay.redhat.com/
> > 
> > > 
> > > If so, can you try the below diff ontop of latest tip/master to see if
> > > those issues would reappear?
> > 
> > It was reproduced on an old laptop (Thinkpad t440s or t480s, I can not
> > remember), but I have replaced them with a new different one.  I tried
> > the latest tip-master with the if condition commented out, kexec
> > reboot works fine.
> > 
> > Let me try to find an old laptop to see if I can do more tests, will
> > get back later next week.
> 
> Update: tested on an old laptop as well,  I did not find any problems
> with unconditional native_wbinvd(), kexec and kdump works fine.
> 
> 

Hi DaveY,

Thanks for getting back and I appreciate your help.

Hi Boris/DaveH,

Based on this I'll change to do unconditional WBINVD during kexec() for in
stop_this_cpu() and relocate_kernel().  Please let me know if you have any
comments.

Thanks all!
  
Borislav Petkov Feb. 28, 2024, 10:44 a.m. UTC | #19
On Wed, Feb 28, 2024 at 10:54:22AM +0800, Dave Young wrote:
> Update: tested on an old laptop as well,  I did not find any problems
> with unconditional native_wbinvd(), kexec and kdump works fine.

Thanks a lot for testing it!
  
Borislav Petkov Feb. 28, 2024, 11:02 a.m. UTC | #20
On Wed, Feb 28, 2024 at 09:21:00AM +0000, Huang, Kai wrote:
> Based on this I'll change to do unconditional WBINVD during kexec() for in
> stop_this_cpu() and relocate_kernel().  Please let me know if you have any
> comments.

Yes, pls make sure to summarize in the commit message what this thread
figured out. But basically, we want to try the simple approach and
WBINVD unconditionally on the CPU stopping path so that caches are clean
and there's no potential issues...

Thx.
  
Kai Huang Feb. 28, 2024, 10:21 p.m. UTC | #21
On 29/02/2024 12:02 am, Borislav Petkov wrote:
> On Wed, Feb 28, 2024 at 09:21:00AM +0000, Huang, Kai wrote:
>> Based on this I'll change to do unconditional WBINVD during kexec() for in
>> stop_this_cpu() and relocate_kernel().  Please let me know if you have any
>> comments.
> 
> Yes, pls make sure to summarize in the commit message what this thread
> figured out. But basically, we want to try the simple approach and
> WBINVD unconditionally on the CPU stopping path so that caches are clean
> and there's no potential issues...
> 

Yes will do.  Thanks for the feedback.
  

Patch

diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index eeec9986570e..8d6d727e6e18 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -72,6 +72,19 @@  static bool noinstr amd_cc_platform_has(enum cc_attr attr)
 	case CC_ATTR_HOST_MEM_ENCRYPT:
 		return sme_me_mask && !(sev_status & MSR_AMD64_SEV_ENABLED);
 
+	case CC_ATTR_HOST_MEM_INCOHERENT:
+		/*
+		 * CC_ATTR_HOST_MEM_INCOHERENT represents whether SME has
+		 * enabled on the platform regardless whether the kernel
+		 * has actually enabled the SME.
+		 */
+		return !(sev_status & MSR_AMD64_SEV_ENABLED);
+
+	/*
+	 * For all CC_ATTR_GUEST_* there's no need to check sme_me_mask
+	 * as it must be true when there's any SEV enable bit set in
+	 * sev_status.
+	 */
 	case CC_ATTR_GUEST_MEM_ENCRYPT:
 		return sev_status & MSR_AMD64_SEV_ENABLED;
 
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index bc0a5348b4a6..c9c6974e2e9c 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -358,7 +358,7 @@  void machine_kexec(struct kimage *image)
 				       (unsigned long)page_list,
 				       image->start,
 				       image->preserve_context,
-				       cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT));
+				       cc_platform_has(CC_ATTR_HOST_MEM_INCOHERENT));
 
 #ifdef CONFIG_KEXEC_JUMP
 	if (image->preserve_context)
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index ab49ade31b0d..2c7e8d9889c0 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -813,18 +813,10 @@  void __noreturn stop_this_cpu(void *dummy)
 	mcheck_cpu_clear(c);
 
 	/*
-	 * Use wbinvd on processors that support SME. This provides support
-	 * for performing a successful kexec when going from SME inactive
-	 * to SME active (or vice-versa). The cache must be cleared so that
-	 * if there are entries with the same physical address, both with and
-	 * without the encryption bit, they don't race each other when flushed
-	 * and potentially end up with the wrong entry being committed to
-	 * memory.
-	 *
-	 * Test the CPUID bit directly because the machine might've cleared
-	 * X86_FEATURE_SME due to cmdline options.
+	 * Use wbinvd on processors that the first kernel *could*
+	 * potentially leave incoherent cachelines.
 	 */
-	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
+	if (cc_platform_has(CC_ATTR_HOST_MEM_INCOHERENT))
 		native_wbinvd();
 
 	/*
diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index 7f72472a34d6..87e4fddab770 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -570,9 +570,19 @@  void __init sme_enable(struct boot_params *bp)
 		msr = __rdmsr(MSR_AMD64_SYSCFG);
 		if (!(msr & MSR_AMD64_SYSCFG_MEM_ENCRYPT))
 			return;
+
+		/*
+		 * Always set CC vendor when the platform has SME enabled
+		 * regardless whether the kernel will actually activates the
+		 * SME or not.  This reports the CC_ATTR_HOST_MEM_INCOHERENT
+		 * being true as long as the platform has SME enabled so that
+		 * stop_this_cpu() can do necessary WBINVD during kexec().
+		 */
+		cc_vendor = CC_VENDOR_AMD;
 	} else {
 		/* SEV state cannot be controlled by a command line option */
 		sme_me_mask = me_mask;
+		cc_vendor = CC_VENDOR_AMD;
 		goto out;
 	}
 
@@ -608,7 +618,6 @@  void __init sme_enable(struct boot_params *bp)
 out:
 	if (sme_me_mask) {
 		physical_mask &= ~sme_me_mask;
-		cc_vendor = CC_VENDOR_AMD;
 		cc_set_mask(sme_me_mask);
 	}
 }
diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
index cb0d6cd1c12f..2f7273596102 100644
--- a/include/linux/cc_platform.h
+++ b/include/linux/cc_platform.h
@@ -42,6 +42,21 @@  enum cc_attr {
 	 */
 	CC_ATTR_HOST_MEM_ENCRYPT,
 
+	/**
+	 * @CC_ATTR_HOST_MEM_INCOHERENT: Host memory encryption can be
+	 * incoherent
+	 *
+	 * The platform/OS is running as a bare-metal system or a hypervisor.
+	 * The memory encryption engine might have left non-cache-coherent
+	 * data in the caches that needs to be flushed.
+	 *
+	 * Use this in places where the cache coherency of the memory matters
+	 * but the encryption status does not.
+	 *
+	 * Includes all systems that set CC_ATTR_HOST_MEM_ENCRYPT.
+	 */
+	CC_ATTR_HOST_MEM_INCOHERENT,
+
 	/**
 	 * @CC_ATTR_GUEST_MEM_ENCRYPT: Guest memory encryption is active
 	 *