[PATCHv6,10/16] x86/tdx: Convert shared memory back to private on kexec

Message ID 20240124125557.493675-11-kirill.shutemov@linux.intel.com
State New
Headers
Series x86/tdx: Add kexec support |

Commit Message

Kirill A. Shutemov Jan. 24, 2024, 12:55 p.m. UTC
  TDX guests allocate shared buffers to perform I/O. It is done by
allocating pages normally from the buddy allocator and converting them
to shared with set_memory_decrypted().

The second kernel has no idea what memory is converted this way. It only
sees E820_TYPE_RAM.

Accessing shared memory via private mapping is fatal. It leads to
unrecoverable TD exit.

On kexec walk direct mapping and convert all shared memory back to
private. It makes all RAM private again and second kernel may use it
normally.

The conversion occurs in two steps: stopping new conversions and
unsharing all memory. In the case of normal kexec, the stopping of
conversions takes place while scheduling is still functioning. This
allows for waiting until any ongoing conversions are finished. The
second step is carried out when all CPUs except one are inactive and
interrupts are disabled. This prevents any conflicts with code that may
access shared memory.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/coco/tdx/tdx.c | 124 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 122 insertions(+), 2 deletions(-)
  

Comments

Kalra, Ashish Jan. 29, 2024, 10:24 a.m. UTC | #1
Hello Kirill,

On 1/24/2024 6:55 AM, Kirill A. Shutemov wrote:
> TDX guests allocate shared buffers to perform I/O. It is done by
> allocating pages normally from the buddy allocator and converting them
> to shared with set_memory_decrypted().
>
> The second kernel has no idea what memory is converted this way. It only
> sees E820_TYPE_RAM.
>
> Accessing shared memory via private mapping is fatal. It leads to
> unrecoverable TD exit.
>
> On kexec walk direct mapping and convert all shared memory back to
> private. It makes all RAM private again and second kernel may use it
> normally.
>
> The conversion occurs in two steps: stopping new conversions and
> unsharing all memory. In the case of normal kexec, the stopping of
> conversions takes place while scheduling is still functioning. This
> allows for waiting until any ongoing conversions are finished. The
> second step is carried out when all CPUs except one are inactive and
> interrupts are disabled. This prevents any conflicts with code that may
> access shared memory.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
>   arch/x86/coco/tdx/tdx.c | 124 +++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 122 insertions(+), 2 deletions(-)
<snip>
> +static void tdx_kexec_stop_conversion(bool crash)
> +{
> +	/* Stop new private<->shared conversions */
> +	conversion_allowed = false;
> +
> +	/*
> +	 * Make sure conversion_allowed is cleared before checking
> +	 * conversions_in_progress.
> +	 */
> +	barrier();
> +
> +	/*
> +	 * Crash kernel reaches here with interrupts disabled: can't wait for
> +	 * conversions to finish.
> +	 *
> +	 * If race happened, just report and proceed.
> +	 */
> +	if (!crash) {
> +		unsigned long timeout;
> +
> +		/*
> +		 * Wait for in-flight conversions to complete.
> +		 *
> +		 * Do not wait more than 30 seconds.
> +		 */
> +		timeout = 30 * USEC_PER_SEC;
> +		while (atomic_read(&conversions_in_progress) && timeout--)
> +			udelay(1);
> +	}
> +
> +	if (atomic_read(&conversions_in_progress))
> +		pr_warn("Failed to finish shared<->private conversions\n");
> +}
> +
> +static void tdx_kexec_unshare_mem(void)
> +{
> +	unsigned long addr, end;
> +	long found = 0, shared;
> +
> +	/*
> +	 * Walk direct mapping and convert all shared memory back to private,
> +	 */
> +
> +	addr = PAGE_OFFSET;
> +	end  = PAGE_OFFSET + get_max_mapped();
> +
> +	while (addr < end) {
> +		unsigned long size;
> +		unsigned int level;
> +		pte_t *pte;
> +
> +		pte = lookup_address(addr, &level);
> +		size = page_level_size(level);
> +
> +		if (pte && pte_decrypted(*pte)) {
> +			int pages = size / PAGE_SIZE;
> +
> +			/*
> +			 * Touching memory with shared bit set triggers implicit
> +			 * conversion to shared.
> +			 *
> +			 * Make sure nobody touches the shared range from
> +			 * now on.
> +			 */
> +			set_pte(pte, __pte(0));
> +
> +			if (!tdx_enc_status_changed(addr, pages, true)) {
> +				pr_err("Failed to unshare range %#lx-%#lx\n",
> +				       addr, addr + size);
> +			}
> +
> +			found += pages;
> +		}
> +
> +		addr += size;
> +	}
> +
> +	__flush_tlb_all();
> +
> +	shared = atomic_long_read(&nr_shared);
> +	if (shared != found) {
> +		pr_err("shared page accounting is off\n");
> +		pr_err("nr_shared = %ld, nr_found = %ld\n", shared, found);
> +	}
> +}
In case of SNP and crash/kdump case, we need to prevent the boot_ghcb 
being converted to shared (in snp_kexec_unshare_mem()) as the boot_ghcb 
is required to handle all I/O for disabling IO-APIC/lapic, hpet, etc., 
as the enc_kexec_unshare_mem() callback is invoked before the apics, 
hpet, etc. are disabled.

Is there any reason why enc_kexec_unshare_mem() callback is invoked in 
crash case before the IO-APIC/lapic, hpet, etc. are shutdown/disabled ?

In case of kexec, enc_kexec_unshare_mem() callback is invoked after the 
IO-APIC/lapic, hpet, iommu, etc. have already been disabled/shutdown, 
hence, this callback can transition all guest shared memory (including 
boot_ghcb) back to private.

Thanks, Ashish
  
Kirill A. Shutemov Jan. 29, 2024, 10:36 a.m. UTC | #2
On Mon, Jan 29, 2024 at 04:24:09AM -0600, Kalra, Ashish wrote:
> In case of SNP and crash/kdump case, we need to prevent the boot_ghcb being
> converted to shared (in snp_kexec_unshare_mem()) as the boot_ghcb is
> required to handle all I/O for disabling IO-APIC/lapic, hpet, etc., as the
> enc_kexec_unshare_mem() callback is invoked before the apics, hpet, etc. are
> disabled.
> 
> Is there any reason why enc_kexec_unshare_mem() callback is invoked in crash
> case before the IO-APIC/lapic, hpet, etc. are shutdown/disabled ?

Not really. Either way works for TDX. I've tested the patch below. Is it
what you want?

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 6585a5f2c2ba..3001f4857ed7 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -107,11 +107,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
 
 	crash_smp_send_stop();
 
-	if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) {
-		x86_platform.guest.enc_kexec_stop_conversion(true);
-		x86_platform.guest.enc_kexec_unshare_mem();
-	}
-
 	cpu_emergency_disable_virtualization();
 
 	/*
@@ -129,6 +124,12 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
 #ifdef CONFIG_HPET_TIMER
 	hpet_disable();
 #endif
+
+	if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) {
+		x86_platform.guest.enc_kexec_stop_conversion(true);
+		x86_platform.guest.enc_kexec_unshare_mem();
+	}
+
 	crash_save_cpu(regs, safe_smp_processor_id());
 }
  
Kalra, Ashish Jan. 29, 2024, 1:09 p.m. UTC | #3
Hello Kirill,

On 1/29/2024 4:36 AM, Kirill A. Shutemov wrote:
> On Mon, Jan 29, 2024 at 04:24:09AM -0600, Kalra, Ashish wrote:
>> In case of SNP and crash/kdump case, we need to prevent the boot_ghcb being
>> converted to shared (in snp_kexec_unshare_mem()) as the boot_ghcb is
>> required to handle all I/O for disabling IO-APIC/lapic, hpet, etc., as the
>> enc_kexec_unshare_mem() callback is invoked before the apics, hpet, etc. are
>> disabled.
>>
>> Is there any reason why enc_kexec_unshare_mem() callback is invoked in crash
>> case before the IO-APIC/lapic, hpet, etc. are shutdown/disabled ?
> Not really. Either way works for TDX. I've tested the patch below. Is it
> what you want?
>
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index 6585a5f2c2ba..3001f4857ed7 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -107,11 +107,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>   
>   	crash_smp_send_stop();
>   
> -	if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) {
> -		x86_platform.guest.enc_kexec_stop_conversion(true);
> -		x86_platform.guest.enc_kexec_unshare_mem();
> -	}
> -
>   	cpu_emergency_disable_virtualization();
>   
>   	/*
> @@ -129,6 +124,12 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>   #ifdef CONFIG_HPET_TIMER
>   	hpet_disable();
>   #endif
> +
> +	if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) {
> +		x86_platform.guest.enc_kexec_stop_conversion(true);
> +		x86_platform.guest.enc_kexec_unshare_mem();
> +	}
> +
>   	crash_save_cpu(regs, safe_smp_processor_id());
>   }
>   

Yes, this will work for SNP.

Thanks, Ashish
  
Kirill A. Shutemov Jan. 29, 2024, 1:34 p.m. UTC | #4
On Mon, Jan 29, 2024 at 07:09:37AM -0600, Kalra, Ashish wrote:
> Hello Kirill,
> 
> On 1/29/2024 4:36 AM, Kirill A. Shutemov wrote:
> > On Mon, Jan 29, 2024 at 04:24:09AM -0600, Kalra, Ashish wrote:
> > > In case of SNP and crash/kdump case, we need to prevent the boot_ghcb being
> > > converted to shared (in snp_kexec_unshare_mem()) as the boot_ghcb is
> > > required to handle all I/O for disabling IO-APIC/lapic, hpet, etc., as the
> > > enc_kexec_unshare_mem() callback is invoked before the apics, hpet, etc. are
> > > disabled.
> > > 
> > > Is there any reason why enc_kexec_unshare_mem() callback is invoked in crash
> > > case before the IO-APIC/lapic, hpet, etc. are shutdown/disabled ?
> > Not really. Either way works for TDX. I've tested the patch below. Is it
> > what you want?
> > 
> > diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> > index 6585a5f2c2ba..3001f4857ed7 100644
> > --- a/arch/x86/kernel/crash.c
> > +++ b/arch/x86/kernel/crash.c
> > @@ -107,11 +107,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
> >   	crash_smp_send_stop();
> > -	if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) {
> > -		x86_platform.guest.enc_kexec_stop_conversion(true);
> > -		x86_platform.guest.enc_kexec_unshare_mem();
> > -	}
> > -
> >   	cpu_emergency_disable_virtualization();
> >   	/*
> > @@ -129,6 +124,12 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
> >   #ifdef CONFIG_HPET_TIMER
> >   	hpet_disable();
> >   #endif
> > +
> > +	if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) {
> > +		x86_platform.guest.enc_kexec_stop_conversion(true);
> > +		x86_platform.guest.enc_kexec_unshare_mem();
> > +	}
> > +
> >   	crash_save_cpu(regs, safe_smp_processor_id());
> >   }
> 
> Yes, this will work for SNP.

Okay, good. Will include the change into the next version of the patchset.
  

Patch

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index fd212c9bad89..bb77a927a831 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -6,8 +6,10 @@ 
 
 #include <linux/cpufeature.h>
 #include <linux/debugfs.h>
+#include <linux/delay.h>
 #include <linux/export.h>
 #include <linux/io.h>
+#include <linux/kexec.h>
 #include <asm/coco.h>
 #include <asm/tdx.h>
 #include <asm/vmx.h>
@@ -15,6 +17,7 @@ 
 #include <asm/insn.h>
 #include <asm/insn-eval.h>
 #include <asm/pgtable.h>
+#include <asm/set_memory.h>
 
 /* MMIO direction */
 #define EPT_READ	0
@@ -41,6 +44,9 @@ 
 
 static atomic_long_t nr_shared;
 
+static atomic_t conversions_in_progress;
+static bool conversion_allowed = true;
+
 static inline bool pte_decrypted(pte_t pte)
 {
 	return cc_mkdec(pte_val(pte)) == pte_val(pte);
@@ -726,6 +732,14 @@  static bool tdx_tlb_flush_required(bool private)
 
 static bool tdx_cache_flush_required(void)
 {
+	/*
+	 * Avoid issuing CLFLUSH on set_memory_decrypted() if conversions
+	 * stopped. Otherwise it can race with unshare_all_memory() and trigger
+	 * implicit conversion to shared.
+	 */
+	if (!conversion_allowed)
+		return false;
+
 	/*
 	 * AMD SME/SEV can avoid cache flushing if HW enforces cache coherence.
 	 * TDX doesn't have such capability.
@@ -809,12 +823,25 @@  static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
 static int tdx_enc_status_change_prepare(unsigned long vaddr, int numpages,
 					 bool enc)
 {
+	atomic_inc(&conversions_in_progress);
+
+	/*
+	 * Check after bumping conversions_in_progress to serialize
+	 * against tdx_kexec_stop_conversion().
+	 */
+	if (!conversion_allowed) {
+		atomic_dec(&conversions_in_progress);
+		return -EBUSY;
+	}
+
 	/*
 	 * Only handle shared->private conversion here.
 	 * See the comment in tdx_early_init().
 	 */
-	if (enc && !tdx_enc_status_changed(vaddr, numpages, enc))
+	if (enc && !tdx_enc_status_changed(vaddr, numpages, enc)) {
+		atomic_dec(&conversions_in_progress);
 		return -EIO;
+	}
 
 	return 0;
 }
@@ -826,17 +853,107 @@  static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
 	 * Only handle private->shared conversion here.
 	 * See the comment in tdx_early_init().
 	 */
-	if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc))
+	if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc)) {
+		atomic_dec(&conversions_in_progress);
 		return -EIO;
+	}
 
 	if (enc)
 		atomic_long_sub(numpages, &nr_shared);
 	else
 		atomic_long_add(numpages, &nr_shared);
 
+	atomic_dec(&conversions_in_progress);
+
 	return 0;
 }
 
+static void tdx_kexec_stop_conversion(bool crash)
+{
+	/* Stop new private<->shared conversions */
+	conversion_allowed = false;
+
+	/*
+	 * Make sure conversion_allowed is cleared before checking
+	 * conversions_in_progress.
+	 */
+	barrier();
+
+	/*
+	 * Crash kernel reaches here with interrupts disabled: can't wait for
+	 * conversions to finish.
+	 *
+	 * If race happened, just report and proceed.
+	 */
+	if (!crash) {
+		unsigned long timeout;
+
+		/*
+		 * Wait for in-flight conversions to complete.
+		 *
+		 * Do not wait more than 30 seconds.
+		 */
+		timeout = 30 * USEC_PER_SEC;
+		while (atomic_read(&conversions_in_progress) && timeout--)
+			udelay(1);
+	}
+
+	if (atomic_read(&conversions_in_progress))
+		pr_warn("Failed to finish shared<->private conversions\n");
+}
+
+static void tdx_kexec_unshare_mem(void)
+{
+	unsigned long addr, end;
+	long found = 0, shared;
+
+	/*
+	 * Walk direct mapping and convert all shared memory back to private,
+	 */
+
+	addr = PAGE_OFFSET;
+	end  = PAGE_OFFSET + get_max_mapped();
+
+	while (addr < end) {
+		unsigned long size;
+		unsigned int level;
+		pte_t *pte;
+
+		pte = lookup_address(addr, &level);
+		size = page_level_size(level);
+
+		if (pte && pte_decrypted(*pte)) {
+			int pages = size / PAGE_SIZE;
+
+			/*
+			 * Touching memory with shared bit set triggers implicit
+			 * conversion to shared.
+			 *
+			 * Make sure nobody touches the shared range from
+			 * now on.
+			 */
+			set_pte(pte, __pte(0));
+
+			if (!tdx_enc_status_changed(addr, pages, true)) {
+				pr_err("Failed to unshare range %#lx-%#lx\n",
+				       addr, addr + size);
+			}
+
+			found += pages;
+		}
+
+		addr += size;
+	}
+
+	__flush_tlb_all();
+
+	shared = atomic_long_read(&nr_shared);
+	if (shared != found) {
+		pr_err("shared page accounting is off\n");
+		pr_err("nr_shared = %ld, nr_found = %ld\n", shared, found);
+	}
+}
+
 void __init tdx_early_init(void)
 {
 	struct tdx_module_args args = {
@@ -896,6 +1013,9 @@  void __init tdx_early_init(void)
 	x86_platform.guest.enc_cache_flush_required  = tdx_cache_flush_required;
 	x86_platform.guest.enc_tlb_flush_required    = tdx_tlb_flush_required;
 
+	x86_platform.guest.enc_kexec_stop_conversion = tdx_kexec_stop_conversion;
+	x86_platform.guest.enc_kexec_unshare_mem     = tdx_kexec_unshare_mem;
+
 	/*
 	 * TDX intercepts the RDMSR to read the X2APIC ID in the parallel
 	 * bringup low level code. That raises #VE which cannot be handled