[v7,10/20] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory

Message ID 9b545148275b14a8c7edef1157f8ec44dc8116ee.1668988357.git.kai.huang@intel.com
State New
Headers
Series TDX host kernel support |

Commit Message

Kai Huang Nov. 21, 2022, 12:26 a.m. UTC
  TDX reports a list of "Convertible Memory Region" (CMR) to indicate all
memory regions that can possibly be used by the TDX module, but they are
not automatically usable to the TDX module.  As a step of initializing
the TDX module, the kernel needs to choose a list of memory regions (out
from convertible memory regions) that the TDX module can use and pass
those regions to the TDX module.  Once this is done, those "TDX-usable"
memory regions are fixed during module's lifetime.  No more TDX-usable
memory can be added to the TDX module after that.

The initial support of TDX guests will only allocate TDX guest memory
from the global page allocator.  To keep things simple, this initial
implementation simply guarantees all pages in the page allocator are TDX
memory.  To achieve this, use all system memory in the core-mm at the
time of initializing the TDX module as TDX memory, and at the meantime,
refuse to add any non-TDX-memory in the memory hotplug.

Specifically, walk through all memory regions managed by memblock and
add them to a global list of "TDX-usable" memory regions, which is a
fixed list after the module initialization (or empty if initialization
fails).  To reject non-TDX-memory in memory hotplug, add an additional
check in arch_add_memory() to check whether the new region is covered by
any region in the "TDX-usable" memory region list.

Note this requires all memory regions in memblock are TDX convertible
memory when initializing the TDX module.  This is true in practice if no
new memory has been hot-added before initializing the TDX module, since
in practice all boot-time present DIMM is TDX convertible memory.  If
any new memory has been hot-added, then initializing the TDX module will
fail due to that memory region is not covered by CMR.

This can be enhanced in the future, i.e. by allowing adding non-TDX
memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
needs to guarantee memory pages for TDX guests are always allocated from
the "TDX-capable" nodes.

Note TDX assumes convertible memory is always physically present during
machine's runtime.  A non-buggy BIOS should never support hot-removal of
any convertible memory.  This implementation doesn't handle ACPI memory
removal but depends on the BIOS to behave correctly.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v6 -> v7:
 - Changed to use all system memory in memblock at the time of
   initializing the TDX module as TDX memory
 - Added memory hotplug support

---
 arch/x86/Kconfig            |   1 +
 arch/x86/include/asm/tdx.h  |   3 +
 arch/x86/mm/init_64.c       |  10 ++
 arch/x86/virt/vmx/tdx/tdx.c | 183 ++++++++++++++++++++++++++++++++++++
 4 files changed, 197 insertions(+)
  

Comments

Huang, Ying Nov. 21, 2022, 5:37 a.m. UTC | #1
Kai Huang <kai.huang@intel.com> writes:

> TDX reports a list of "Convertible Memory Region" (CMR) to indicate all
> memory regions that can possibly be used by the TDX module, but they are
> not automatically usable to the TDX module.  As a step of initializing
> the TDX module, the kernel needs to choose a list of memory regions (out
> from convertible memory regions) that the TDX module can use and pass
> those regions to the TDX module.  Once this is done, those "TDX-usable"
> memory regions are fixed during module's lifetime.  No more TDX-usable
> memory can be added to the TDX module after that.
>
> The initial support of TDX guests will only allocate TDX guest memory
> from the global page allocator.  To keep things simple, this initial
> implementation simply guarantees all pages in the page allocator are TDX
> memory.  To achieve this, use all system memory in the core-mm at the
> time of initializing the TDX module as TDX memory, and at the meantime,
> refuse to add any non-TDX-memory in the memory hotplug.
>
> Specifically, walk through all memory regions managed by memblock and
> add them to a global list of "TDX-usable" memory regions, which is a
> fixed list after the module initialization (or empty if initialization
> fails).  To reject non-TDX-memory in memory hotplug, add an additional
> check in arch_add_memory() to check whether the new region is covered by
> any region in the "TDX-usable" memory region list.
>
> Note this requires all memory regions in memblock are TDX convertible
> memory when initializing the TDX module.  This is true in practice if no
> new memory has been hot-added before initializing the TDX module, since
> in practice all boot-time present DIMM is TDX convertible memory.  If
> any new memory has been hot-added, then initializing the TDX module will
> fail due to that memory region is not covered by CMR.
>
> This can be enhanced in the future, i.e. by allowing adding non-TDX
> memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
> and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
> needs to guarantee memory pages for TDX guests are always allocated from
> the "TDX-capable" nodes.
>
> Note TDX assumes convertible memory is always physically present during
> machine's runtime.  A non-buggy BIOS should never support hot-removal of
> any convertible memory.  This implementation doesn't handle ACPI memory
> removal but depends on the BIOS to behave correctly.
>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
>
> v6 -> v7:
>  - Changed to use all system memory in memblock at the time of
>    initializing the TDX module as TDX memory
>  - Added memory hotplug support
>
> ---
>  arch/x86/Kconfig            |   1 +
>  arch/x86/include/asm/tdx.h  |   3 +
>  arch/x86/mm/init_64.c       |  10 ++
>  arch/x86/virt/vmx/tdx/tdx.c | 183 ++++++++++++++++++++++++++++++++++++
>  4 files changed, 197 insertions(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index dd333b46fafb..b36129183035 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1959,6 +1959,7 @@ config INTEL_TDX_HOST
>  	depends on X86_64
>  	depends on KVM_INTEL
>  	depends on X86_X2APIC
> +	select ARCH_KEEP_MEMBLOCK
>  	help
>  	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
>  	  host and certain physical attacks.  This option enables necessary TDX
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index d688228f3151..71169ecefabf 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -111,9 +111,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
>  #ifdef CONFIG_INTEL_TDX_HOST
>  bool platform_tdx_enabled(void);
>  int tdx_enable(void);
> +bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn);
>  #else	/* !CONFIG_INTEL_TDX_HOST */
>  static inline bool platform_tdx_enabled(void) { return false; }
>  static inline int tdx_enable(void)  { return -ENODEV; }
> +static inline bool tdx_cc_memory_compatible(unsigned long start_pfn,
> +		unsigned long end_pfn) { return true; }
>  #endif	/* CONFIG_INTEL_TDX_HOST */
>  
>  #endif /* !__ASSEMBLY__ */
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 3f040c6e5d13..900341333d7e 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -55,6 +55,7 @@
>  #include <asm/uv/uv.h>
>  #include <asm/setup.h>
>  #include <asm/ftrace.h>
> +#include <asm/tdx.h>
>  
>  #include "mm_internal.h"
>  
> @@ -968,6 +969,15 @@ int arch_add_memory(int nid, u64 start, u64 size,
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  
> +	/*
> +	 * For now if TDX is enabled, all pages in the page allocator
> +	 * must be TDX memory, which is a fixed set of memory regions
> +	 * that are passed to the TDX module.  Reject the new region
> +	 * if it is not TDX memory to guarantee above is true.
> +	 */
> +	if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages))
> +		return -EINVAL;
> +
>  	init_memory_mapping(start, start + size, params->pgprot);
>  
>  	return add_pages(nid, start_pfn, nr_pages, params);
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 43227af25e44..32af86e31c47 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -16,6 +16,11 @@
>  #include <linux/smp.h>
>  #include <linux/atomic.h>
>  #include <linux/align.h>
> +#include <linux/list.h>
> +#include <linux/slab.h>
> +#include <linux/memblock.h>
> +#include <linux/minmax.h>
> +#include <linux/sizes.h>
>  #include <asm/msr-index.h>
>  #include <asm/msr.h>
>  #include <asm/apic.h>
> @@ -34,6 +39,13 @@ enum tdx_module_status_t {
>  	TDX_MODULE_SHUTDOWN,
>  };
>  
> +struct tdx_memblock {
> +	struct list_head list;
> +	unsigned long start_pfn;

Why not use 'phys_addr_t'?

> +	unsigned long end_pfn;
> +	int nid;
> +};
> +
>  static u32 tdx_keyid_start __ro_after_init;
>  static u32 tdx_keyid_num __ro_after_init;
>  
> @@ -46,6 +58,9 @@ static struct tdsysinfo_struct tdx_sysinfo;
>  static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
>  static int tdx_cmr_num;
>  
> +/* All TDX-usable memory regions */
> +static LIST_HEAD(tdx_memlist);
> +
>  /*
>   * Detect TDX private KeyIDs to see whether TDX has been enabled by the
>   * BIOS.  Both initializing the TDX module and running TDX guest require
> @@ -329,6 +344,107 @@ static int tdx_get_sysinfo(void)
>  	return trim_empty_cmrs(tdx_cmr_array, &tdx_cmr_num);
>  }
>  
> +/* Check whether the given pfn range is covered by any CMR or not. */
> +static bool pfn_range_covered_by_cmr(unsigned long start_pfn,
> +				     unsigned long end_pfn)
> +{
> +	int i;
> +
> +	for (i = 0; i < tdx_cmr_num; i++) {
> +		struct cmr_info *cmr = &tdx_cmr_array[i];
> +		unsigned long cmr_start_pfn;
> +		unsigned long cmr_end_pfn;
> +
> +		cmr_start_pfn = cmr->base >> PAGE_SHIFT;
> +		cmr_end_pfn = (cmr->base + cmr->size) >> PAGE_SHIFT;

Why not use PHYS_PFN() or PFN_DOWN()?  And PFN_PHYS() in reverse direction?

> +
> +		if (start_pfn >= cmr_start_pfn && end_pfn <= cmr_end_pfn)
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
> +/*
> + * Add a memory region on a given node as a TDX memory block.  The caller
> + * to make sure all memory regions are added in address ascending order
> + * and don't overlap.
> + */
> +static int add_tdx_memblock(unsigned long start_pfn, unsigned long end_pfn,
> +			    int nid)
> +{
> +	struct tdx_memblock *tmb;
> +
> +	tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
> +	if (!tmb)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&tmb->list);
> +	tmb->start_pfn = start_pfn;
> +	tmb->end_pfn = end_pfn;
> +	tmb->nid = nid;
> +
> +	list_add_tail(&tmb->list, &tdx_memlist);
> +	return 0;
> +}
> +
> +static void free_tdx_memory(void)
> +{
> +	while (!list_empty(&tdx_memlist)) {
> +		struct tdx_memblock *tmb = list_first_entry(&tdx_memlist,
> +				struct tdx_memblock, list);
> +
> +		list_del(&tmb->list);
> +		kfree(tmb);
> +	}
> +}
> +
> +/*
> + * Add all memblock memory regions to the @tdx_memlist as TDX memory.
> + * Must be called when get_online_mems() is called by the caller.
> + */
> +static int build_tdx_memory(void)
> +{
> +	unsigned long start_pfn, end_pfn;
> +	int i, nid, ret;
> +
> +	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
> +		/*
> +		 * The first 1MB may not be reported as TDX convertible
> +		 * memory.  Manually exclude them as TDX memory.
> +		 *
> +		 * This is fine as the first 1MB is already reserved in
> +		 * reserve_real_mode() and won't end up to ZONE_DMA as
> +		 * free page anyway.
> +		 */
> +		start_pfn = max(start_pfn, (unsigned long)SZ_1M >> PAGE_SHIFT);
> +		if (start_pfn >= end_pfn)
> +			continue;

How about check whether first 1MB is reserved instead of depending on
the corresponding code isn't changed?  Via for_each_reserved_mem_range()?

> +
> +		/* Verify memory is truly TDX convertible memory */
> +		if (!pfn_range_covered_by_cmr(start_pfn, end_pfn)) {
> +			pr_info("Memory region [0x%lx, 0x%lx) is not TDX convertible memorry.\n",
> +					start_pfn << PAGE_SHIFT,
> +					end_pfn << PAGE_SHIFT);
> +			return -EINVAL;
> +		}
> +
> +		/*
> +		 * Add the memory regions as TDX memory.  The regions in
> +		 * memblock has already guaranteed they are in address
> +		 * ascending order and don't overlap.
> +		 */
> +		ret = add_tdx_memblock(start_pfn, end_pfn, nid);
> +		if (ret)
> +			goto err;
> +	}
> +
> +	return 0;
> +err:
> +	free_tdx_memory();
> +	return ret;
> +}
> +
>  /*
>   * Detect and initialize the TDX module.
>   *
> @@ -357,12 +473,56 @@ static int init_tdx_module(void)
>  	if (ret)
>  		goto out;
>  
> +	/*
> +	 * All memory regions that can be used by the TDX module must be
> +	 * passed to the TDX module during the module initialization.
> +	 * Once this is done, all "TDX-usable" memory regions are fixed
> +	 * during module's runtime.
> +	 *
> +	 * The initial support of TDX guests only allocates memory from
> +	 * the global page allocator.  To keep things simple, for now
> +	 * just make sure all pages in the page allocator are TDX memory.
> +	 *
> +	 * To achieve this, use all system memory in the core-mm at the
> +	 * time of initializing the TDX module as TDX memory, and at the
> +	 * meantime, reject any new memory in memory hot-add.
> +	 *
> +	 * This works as in practice, all boot-time present DIMM is TDX
> +	 * convertible memory.  However if any new memory is hot-added
> +	 * before initializing the TDX module, the initialization will
> +	 * fail due to that memory is not covered by CMR.
> +	 *
> +	 * This can be enhanced in the future, i.e. by allowing adding or
> +	 * onlining non-TDX memory to a separate node, in which case the
> +	 * "TDX-capable" nodes and the "non-TDX-capable" nodes can exist
> +	 * together -- the userspace/kernel just needs to make sure pages
> +	 * for TDX guests must come from those "TDX-capable" nodes.
> +	 *
> +	 * Build the list of TDX memory regions as mentioned above so
> +	 * they can be passed to the TDX module later.
> +	 */
> +	get_online_mems();
> +
> +	ret = build_tdx_memory();
> +	if (ret)
> +		goto out;
>  	/*
>  	 * Return -EINVAL until all steps of TDX module initialization
>  	 * process are done.
>  	 */
>  	ret = -EINVAL;
>  out:
> +	/*
> +	 * Memory hotplug checks the hot-added memory region against the
> +	 * @tdx_memlist to see if the region is TDX memory.
> +	 *
> +	 * Do put_online_mems() here to make sure any modification to
> +	 * @tdx_memlist is done while holding the memory hotplug read
> +	 * lock, so that the memory hotplug path can just check the
> +	 * @tdx_memlist w/o holding the @tdx_module_lock which may cause
> +	 * deadlock.
> +	 */
> +	put_online_mems();
>  	return ret;
>  }
>  
> @@ -485,3 +645,26 @@ int tdx_enable(void)
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(tdx_enable);
> +
> +/*
> + * Check whether the given range is TDX memory.  Must be called between
> + * mem_hotplug_begin()/mem_hotplug_done().

Must be called with mem_hotplug_lock write-locked.

> + */
> +bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn)
> +{
> +	struct tdx_memblock *tmb;
> +
> +	/* Empty list means TDX isn't enabled successfully */
> +	if (list_empty(&tdx_memlist))
> +		return true;
> +
> +	list_for_each_entry(tmb, &tdx_memlist, list) {
> +		/*
> +		 * The new range is TDX memory if it is fully covered
> +		 * by any TDX memory block.
> +		 */
> +		if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
> +			return true;
> +	}
> +	return false;
> +}

Best Regards,
Huang, Ying
  
Kai Huang Nov. 21, 2022, 9:09 a.m. UTC | #2
> >  
> > +struct tdx_memblock {
> > +	struct list_head list;
> > +	unsigned long start_pfn;
> 
> Why not use 'phys_addr_t'?

TDX memory must be page aligned, so start_pfn/end_pfn would be easier.

> 
> > +	unsigned long end_pfn;
> > +	int nid;
> > +};
> > +
> > 

[...]

> >  
> > +/* Check whether the given pfn range is covered by any CMR or not. */
> > +static bool pfn_range_covered_by_cmr(unsigned long start_pfn,
> > +				     unsigned long end_pfn)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < tdx_cmr_num; i++) {
> > +		struct cmr_info *cmr = &tdx_cmr_array[i];
> > +		unsigned long cmr_start_pfn;
> > +		unsigned long cmr_end_pfn;
> > +
> > +		cmr_start_pfn = cmr->base >> PAGE_SHIFT;
> > +		cmr_end_pfn = (cmr->base + cmr->size) >> PAGE_SHIFT;
> 
> Why not use PHYS_PFN() or PFN_DOWN()?  And PFN_PHYS() in reverse direction?

Didn't know them.  Will use them.


[...]

> > +
> > +/*
> > + * Add all memblock memory regions to the @tdx_memlist as TDX memory.
> > + * Must be called when get_online_mems() is called by the caller.
> > + */
> > +static int build_tdx_memory(void)
> > +{
> > +	unsigned long start_pfn, end_pfn;
> > +	int i, nid, ret;
> > +
> > +	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
> > +		/*
> > +		 * The first 1MB may not be reported as TDX convertible
> > +		 * memory.  Manually exclude them as TDX memory.
> > +		 *
> > +		 * This is fine as the first 1MB is already reserved in
> > +		 * reserve_real_mode() and won't end up to ZONE_DMA as
> > +		 * free page anyway.
> > +		 */
> > +		start_pfn = max(start_pfn, (unsigned long)SZ_1M >> PAGE_SHIFT);
> > +		if (start_pfn >= end_pfn)
> > +			continue;
> 
> How about check whether first 1MB is reserved instead of depending on
> the corresponding code isn't changed?  Via for_each_reserved_mem_range()?

IIUC, some reserved memory can be freed to page allocator directly, i.e. kernel
init code/data.  I feel it's not safe to just treat reserved memory will never
be in page allocator.  Otherwise we have for_each_free_mem_range() can use.

[...]

> 
> > +/*
> > + * Check whether the given range is TDX memory.  Must be called between
> > + * mem_hotplug_begin()/mem_hotplug_done().
> 
> Must be called with mem_hotplug_lock write-locked.
> 

Will do.
  
Huang, Ying Nov. 22, 2022, 1:54 a.m. UTC | #3
"Huang, Kai" <kai.huang@intel.com> writes:

[...]

>
>> > +
>> > +/*
>> > + * Add all memblock memory regions to the @tdx_memlist as TDX memory.
>> > + * Must be called when get_online_mems() is called by the caller.
>> > + */
>> > +static int build_tdx_memory(void)
>> > +{
>> > +   unsigned long start_pfn, end_pfn;
>> > +   int i, nid, ret;
>> > +
>> > +   for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
>> > +           /*
>> > +            * The first 1MB may not be reported as TDX convertible
>> > +            * memory.  Manually exclude them as TDX memory.
>> > +            *
>> > +            * This is fine as the first 1MB is already reserved in
>> > +            * reserve_real_mode() and won't end up to ZONE_DMA as
>> > +            * free page anyway.
>> > +            */
>> > +           start_pfn = max(start_pfn, (unsigned long)SZ_1M >> PAGE_SHIFT);
>> > +           if (start_pfn >= end_pfn)
>> > +                   continue;
>>
>> How about check whether first 1MB is reserved instead of depending on
>> the corresponding code isn't changed?  Via for_each_reserved_mem_range()?
>
> IIUC, some reserved memory can be freed to page allocator directly, i.e. kernel
> init code/data.  I feel it's not safe to just treat reserved memory will never
> be in page allocator.  Otherwise we have for_each_free_mem_range() can use.

Yes.  memblock reverse information isn't perfect.  But I still think
that it is still better than just assumption to check whether the frist
1MB is reserved in memblock.  Or, we can check whether the pages of the
first 1MB is reversed via checking struct page directly?

[...]

Best Regards,
Huang, Ying
  
Kai Huang Nov. 22, 2022, 9:16 a.m. UTC | #4
> > > > +/*
> > > > + * Add all memblock memory regions to the @tdx_memlist as TDX memory.
> > > > + * Must be called when get_online_mems() is called by the caller.
> > > > + */
> > > > +static int build_tdx_memory(void)
> > > > +{
> > > > +   unsigned long start_pfn, end_pfn;
> > > > +   int i, nid, ret;
> > > > +
> > > > +   for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
> > > > +           /*
> > > > +            * The first 1MB may not be reported as TDX convertible
> > > > +            * memory.  Manually exclude them as TDX memory.
> > > > +            *
> > > > +            * This is fine as the first 1MB is already reserved in
> > > > +            * reserve_real_mode() and won't end up to ZONE_DMA as
> > > > +            * free page anyway.
> > > > +            */
> > > > +           start_pfn = max(start_pfn, (unsigned long)SZ_1M >> PAGE_SHIFT);
> > > > +           if (start_pfn >= end_pfn)
> > > > +                   continue;
> > > 
> > > How about check whether first 1MB is reserved instead of depending on
> > > the corresponding code isn't changed?  Via for_each_reserved_mem_range()?
> > 
> > IIUC, some reserved memory can be freed to page allocator directly, i.e. kernel
> > init code/data.  I feel it's not safe to just treat reserved memory will never
> > be in page allocator.  Otherwise we have for_each_free_mem_range() can use.
> 
> Yes.  memblock reverse information isn't perfect.  But I still think
> that it is still better than just assumption to check whether the frist
> 1MB is reserved in memblock.  Or, we can check whether the pages of the
> first 1MB is reversed via checking struct page directly?
> 

Sorry I am a little bit confused what you want to achieve here.  Do you want to
make some sanity check to make sure the first 1MB is indeed not in the page
allocator?

IIUC, it is indeed true.  Please see the comment of calling reserve_real_mode()
in setup_arch().  Also please see efi_free_boot_services(), which doesn't free
the boot service if it is below 1MB.

Also, my understanding is kernel's intention is to always reserve the first 1MB:

                /*
                 * Don't free memory under 1M for two reasons:                 
                 * - BIOS might clobber it                                     
                 * - Crash kernel needs it to be reserved                      
                 */        

So if any page in first 1MB ended up to the page allocator, it should be the
kernel bug which is not related to TDX, correct?
  
Peter Zijlstra Nov. 22, 2022, 10:10 a.m. UTC | #5
On Mon, Nov 21, 2022 at 01:26:32PM +1300, Kai Huang wrote:

> +static int build_tdx_memory(void)
> +{
> +	unsigned long start_pfn, end_pfn;
> +	int i, nid, ret;
> +
> +	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
> +		/*
> +		 * The first 1MB may not be reported as TDX convertible
> +		 * memory.  Manually exclude them as TDX memory.
> +		 *
> +		 * This is fine as the first 1MB is already reserved in
> +		 * reserve_real_mode() and won't end up to ZONE_DMA as
> +		 * free page anyway.
> +		 */
> +		start_pfn = max(start_pfn, (unsigned long)SZ_1M >> PAGE_SHIFT);
> +		if (start_pfn >= end_pfn)
> +			continue;
> +
> +		/* Verify memory is truly TDX convertible memory */
> +		if (!pfn_range_covered_by_cmr(start_pfn, end_pfn)) {
> +			pr_info("Memory region [0x%lx, 0x%lx) is not TDX convertible memorry.\n",
> +					start_pfn << PAGE_SHIFT,
> +					end_pfn << PAGE_SHIFT);
> +			return -EINVAL;

Given how tdx_cc_memory_compatible() below relies on tdx_memlist being
empty; this error patch is wrong and should goto err.

> +		}
> +
> +		/*
> +		 * Add the memory regions as TDX memory.  The regions in
> +		 * memblock has already guaranteed they are in address
> +		 * ascending order and don't overlap.
> +		 */
> +		ret = add_tdx_memblock(start_pfn, end_pfn, nid);
> +		if (ret)
> +			goto err;
> +	}
> +
> +	return 0;
> +err:
> +	free_tdx_memory();
> +	return ret;
> +}

> +bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn)
> +{
> +	struct tdx_memblock *tmb;
> +
> +	/* Empty list means TDX isn't enabled successfully */
> +	if (list_empty(&tdx_memlist))
> +		return true;
> +
> +	list_for_each_entry(tmb, &tdx_memlist, list) {
> +		/*
> +		 * The new range is TDX memory if it is fully covered
> +		 * by any TDX memory block.
> +		 */
> +		if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
> +			return true;
> +	}
> +	return false;
> +}
  
Kai Huang Nov. 22, 2022, 11:40 a.m. UTC | #6
On Tue, 2022-11-22 at 11:10 +0100, Peter Zijlstra wrote:
> On Mon, Nov 21, 2022 at 01:26:32PM +1300, Kai Huang wrote:
> 
> > +static int build_tdx_memory(void)
> > +{
> > +	unsigned long start_pfn, end_pfn;
> > +	int i, nid, ret;
> > +
> > +	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
> > +		/*
> > +		 * The first 1MB may not be reported as TDX convertible
> > +		 * memory.  Manually exclude them as TDX memory.
> > +		 *
> > +		 * This is fine as the first 1MB is already reserved in
> > +		 * reserve_real_mode() and won't end up to ZONE_DMA as
> > +		 * free page anyway.
> > +		 */
> > +		start_pfn = max(start_pfn, (unsigned long)SZ_1M >> PAGE_SHIFT);
> > +		if (start_pfn >= end_pfn)
> > +			continue;
> > +
> > +		/* Verify memory is truly TDX convertible memory */
> > +		if (!pfn_range_covered_by_cmr(start_pfn, end_pfn)) {
> > +			pr_info("Memory region [0x%lx, 0x%lx) is not TDX convertible memorry.\n",
> > +					start_pfn << PAGE_SHIFT,
> > +					end_pfn << PAGE_SHIFT);
> > +			return -EINVAL;
> 
> Given how tdx_cc_memory_compatible() below relies on tdx_memlist being
> empty; this error patch is wrong and should goto err.

Oops.  Thanks for catching.

Also thanks for review!  Today is too late for me and I'll catch up with others
tomorrow.

> 
> > +		}
> > +
> > +		/*
> > +		 * Add the memory regions as TDX memory.  The regions in
> > +		 * memblock has already guaranteed they are in address
> > +		 * ascending order and don't overlap.
> > +		 */
> > +		ret = add_tdx_memblock(start_pfn, end_pfn, nid);
> > +		if (ret)
> > +			goto err;
> > +	}
> > +
> > +	return 0;
> > +err:
> > +	free_tdx_memory();
> > +	return ret;
> > +}
> 
> > +bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn)
> > +{
> > +	struct tdx_memblock *tmb;
> > +
> > +	/* Empty list means TDX isn't enabled successfully */
> > +	if (list_empty(&tdx_memlist))
> > +		return true;
> > +
> > +	list_for_each_entry(tmb, &tdx_memlist, list) {
> > +		/*
> > +		 * The new range is TDX memory if it is fully covered
> > +		 * by any TDX memory block.
> > +		 */
> > +		if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
> > +			return true;
> > +	}
> > +	return false;
> > +}
>
  
Dave Hansen Nov. 23, 2022, 12:21 a.m. UTC | #7
On 11/20/22 16:26, Kai Huang wrote:
> TDX reports a list of "Convertible Memory Region" (CMR) to indicate all
> memory regions that can possibly be used by the TDX module, but they are
> not automatically usable to the TDX module.  As a step of initializing
> the TDX module, the kernel needs to choose a list of memory regions (out
> from convertible memory regions) that the TDX module can use and pass
> those regions to the TDX module.  Once this is done, those "TDX-usable"
> memory regions are fixed during module's lifetime.  No more TDX-usable
> memory can be added to the TDX module after that.
> 
> The initial support of TDX guests will only allocate TDX guest memory
> from the global page allocator.  To keep things simple, this initial
> implementation simply guarantees all pages in the page allocator are TDX
> memory.  To achieve this, use all system memory in the core-mm at the
> time of initializing the TDX module as TDX memory, and at the meantime,
> refuse to add any non-TDX-memory in the memory hotplug.
> 
> Specifically, walk through all memory regions managed by memblock and
> add them to a global list of "TDX-usable" memory regions, which is a
> fixed list after the module initialization (or empty if initialization
> fails).  To reject non-TDX-memory in memory hotplug, add an additional
> check in arch_add_memory() to check whether the new region is covered by
> any region in the "TDX-usable" memory region list.
> 
> Note this requires all memory regions in memblock are TDX convertible
> memory when initializing the TDX module.  This is true in practice if no
> new memory has been hot-added before initializing the TDX module, since
> in practice all boot-time present DIMM is TDX convertible memory.  If
> any new memory has been hot-added, then initializing the TDX module will
> fail due to that memory region is not covered by CMR.
> 
> This can be enhanced in the future, i.e. by allowing adding non-TDX
> memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
> and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
> needs to guarantee memory pages for TDX guests are always allocated from
> the "TDX-capable" nodes.
> 
> Note TDX assumes convertible memory is always physically present during
> machine's runtime.  A non-buggy BIOS should never support hot-removal of
> any convertible memory.  This implementation doesn't handle ACPI memory
> removal but depends on the BIOS to behave correctly.

My eyes glazed over about halfway through that.  Can you try to trim it
down a bit, or at least try to summarize it better up front?

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index dd333b46fafb..b36129183035 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1959,6 +1959,7 @@ config INTEL_TDX_HOST
>  	depends on X86_64
>  	depends on KVM_INTEL
>  	depends on X86_X2APIC
> +	select ARCH_KEEP_MEMBLOCK
>  	help
>  	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
>  	  host and certain physical attacks.  This option enables necessary TDX
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index d688228f3151..71169ecefabf 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -111,9 +111,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
>  #ifdef CONFIG_INTEL_TDX_HOST
>  bool platform_tdx_enabled(void);
>  int tdx_enable(void);
> +bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn);
>  #else	/* !CONFIG_INTEL_TDX_HOST */
>  static inline bool platform_tdx_enabled(void) { return false; }
>  static inline int tdx_enable(void)  { return -ENODEV; }
> +static inline bool tdx_cc_memory_compatible(unsigned long start_pfn,
> +		unsigned long end_pfn) { return true; }
>  #endif	/* CONFIG_INTEL_TDX_HOST */
>  
>  #endif /* !__ASSEMBLY__ */
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 3f040c6e5d13..900341333d7e 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -55,6 +55,7 @@
>  #include <asm/uv/uv.h>
>  #include <asm/setup.h>
>  #include <asm/ftrace.h>
> +#include <asm/tdx.h>
>  
>  #include "mm_internal.h"
>  
> @@ -968,6 +969,15 @@ int arch_add_memory(int nid, u64 start, u64 size,
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  
> +	/*
> +	 * For now if TDX is enabled, all pages in the page allocator

s/For now//

> +	 * must be TDX memory, which is a fixed set of memory regions
> +	 * that are passed to the TDX module.  Reject the new region
> +	 * if it is not TDX memory to guarantee above is true.
> +	 */
> +	if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages))
> +		return -EINVAL;

There's a real art to making a right-size comment.  I don't think this
needs to be any more than:

	/*
	 * Not all memory is compatible with TDX.  Reject
	 * the addition of any incomatible memory.
	 */

If you want to write a treatise, do it in Documentation or at the
tdx_cc_memory_compatible() definition.

>  	init_memory_mapping(start, start + size, params->pgprot);
>  
>  	return add_pages(nid, start_pfn, nr_pages, params);
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 43227af25e44..32af86e31c47 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -16,6 +16,11 @@
>  #include <linux/smp.h>
>  #include <linux/atomic.h>
>  #include <linux/align.h>
> +#include <linux/list.h>
> +#include <linux/slab.h>
> +#include <linux/memblock.h>
> +#include <linux/minmax.h>
> +#include <linux/sizes.h>
>  #include <asm/msr-index.h>
>  #include <asm/msr.h>
>  #include <asm/apic.h>
> @@ -34,6 +39,13 @@ enum tdx_module_status_t {
>  	TDX_MODULE_SHUTDOWN,
>  };
>  
> +struct tdx_memblock {
> +	struct list_head list;
> +	unsigned long start_pfn;
> +	unsigned long end_pfn;
> +	int nid;
> +};

Why does the nid matter?

>  static u32 tdx_keyid_start __ro_after_init;
>  static u32 tdx_keyid_num __ro_after_init;
>  
> @@ -46,6 +58,9 @@ static struct tdsysinfo_struct tdx_sysinfo;
>  static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
>  static int tdx_cmr_num;
>  
> +/* All TDX-usable memory regions */
> +static LIST_HEAD(tdx_memlist);
> +
>  /*
>   * Detect TDX private KeyIDs to see whether TDX has been enabled by the
>   * BIOS.  Both initializing the TDX module and running TDX guest require
> @@ -329,6 +344,107 @@ static int tdx_get_sysinfo(void)
>  	return trim_empty_cmrs(tdx_cmr_array, &tdx_cmr_num);
>  }
>  
> +/* Check whether the given pfn range is covered by any CMR or not. */
> +static bool pfn_range_covered_by_cmr(unsigned long start_pfn,
> +				     unsigned long end_pfn)
> +{
> +	int i;
> +
> +	for (i = 0; i < tdx_cmr_num; i++) {
> +		struct cmr_info *cmr = &tdx_cmr_array[i];
> +		unsigned long cmr_start_pfn;
> +		unsigned long cmr_end_pfn;
> +
> +		cmr_start_pfn = cmr->base >> PAGE_SHIFT;
> +		cmr_end_pfn = (cmr->base + cmr->size) >> PAGE_SHIFT;
> +
> +		if (start_pfn >= cmr_start_pfn && end_pfn <= cmr_end_pfn)
> +			return true;
> +	}

What if the pfn range overlaps two CMRs?  It will never pass any
individual overlap test and will return false.

> +	return false;
> +}
> +
> +/*
> + * Add a memory region on a given node as a TDX memory block.  The caller
> + * to make sure all memory regions are added in address ascending order

s/to/must/

> + * and don't overlap.
> + */
> +static int add_tdx_memblock(unsigned long start_pfn, unsigned long end_pfn,
> +			    int nid)
> +{
> +	struct tdx_memblock *tmb;
> +
> +	tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
> +	if (!tmb)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&tmb->list);
> +	tmb->start_pfn = start_pfn;
> +	tmb->end_pfn = end_pfn;
> +	tmb->nid = nid;
> +
> +	list_add_tail(&tmb->list, &tdx_memlist);
> +	return 0;
> +}
> +
> +static void free_tdx_memory(void)

This is named a bit too generically.  How about free_tdx_memlist() or
something?

> +{
> +	while (!list_empty(&tdx_memlist)) {
> +		struct tdx_memblock *tmb = list_first_entry(&tdx_memlist,
> +				struct tdx_memblock, list);
> +
> +		list_del(&tmb->list);
> +		kfree(tmb);
> +	}
> +}
> +
> +/*
> + * Add all memblock memory regions to the @tdx_memlist as TDX memory.
> + * Must be called when get_online_mems() is called by the caller.
> + */

Again, this explains the "what", but not the "why".

/*
 * Ensure that all memblock memory regions are convertible to TDX
 * memory.  Once this has been established, stash the memblock
 * ranges off in a secondary structure because $REASONS.
 */

Which makes me wonder: Why do you even need a secondary structure here?
What's wrong with the memblocks themselves?

> +static int build_tdx_memory(void)
> +{
> +	unsigned long start_pfn, end_pfn;
> +	int i, nid, ret;
> +
> +	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
> +		/*
> +		 * The first 1MB may not be reported as TDX convertible
> +		 * memory.  Manually exclude them as TDX memory.

I don't like the "may not" here very much.

> +		 * This is fine as the first 1MB is already reserved in
> +		 * reserve_real_mode() and won't end up to ZONE_DMA as
> +		 * free page anyway.

			 ^ free pages

> +		 */

This way too wishy washy.  The TDX module may or may not...  Then, it
doesn't matter since reserve_real_mode() does it anyway...

Then it goes and adds code to skip it!

> +		start_pfn = max(start_pfn, (unsigned long)SZ_1M >> PAGE_SHIFT);
> +		if (start_pfn >= end_pfn)
> +			continue;


Please just put a dang stake in the ground.  If the other code deals
with this, then explain *why* more is needed here.

> +		/* Verify memory is truly TDX convertible memory */
> +		if (!pfn_range_covered_by_cmr(start_pfn, end_pfn)) {
> +			pr_info("Memory region [0x%lx, 0x%lx) is not TDX convertible memorry.\n",
> +					start_pfn << PAGE_SHIFT,
> +					end_pfn << PAGE_SHIFT);
> +			return -EINVAL;

... no 'goto err'?  This leaks all the previous add_tdx_memblock()
structures, right?

> +		}
> +
> +		/*
> +		 * Add the memory regions as TDX memory.  The regions in
> +		 * memblock has already guaranteed they are in address
> +		 * ascending order and don't overlap.
> +		 */
> +		ret = add_tdx_memblock(start_pfn, end_pfn, nid);
> +		if (ret)
> +			goto err;
> +	}
> +
> +	return 0;
> +err:
> +	free_tdx_memory();
> +	return ret;
> +}
> +
>  /*
>   * Detect and initialize the TDX module.
>   *
> @@ -357,12 +473,56 @@ static int init_tdx_module(void)
>  	if (ret)
>  		goto out;
>  
> +	/*
> +	 * All memory regions that can be used by the TDX module must be
> +	 * passed to the TDX module during the module initialization.
> +	 * Once this is done, all "TDX-usable" memory regions are fixed
> +	 * during module's runtime.
> +	 *
> +	 * The initial support of TDX guests only allocates memory from
> +	 * the global page allocator.  To keep things simple, for now
> +	 * just make sure all pages in the page allocator are TDX memory.
> +	 *
> +	 * To achieve this, use all system memory in the core-mm at the
> +	 * time of initializing the TDX module as TDX memory, and at the
> +	 * meantime, reject any new memory in memory hot-add.
> +	 *
> +	 * This works as in practice, all boot-time present DIMM is TDX
> +	 * convertible memory.  However if any new memory is hot-added
> +	 * before initializing the TDX module, the initialization will
> +	 * fail due to that memory is not covered by CMR.
> +	 *
> +	 * This can be enhanced in the future, i.e. by allowing adding or
> +	 * onlining non-TDX memory to a separate node, in which case the
> +	 * "TDX-capable" nodes and the "non-TDX-capable" nodes can exist
> +	 * together -- the userspace/kernel just needs to make sure pages
> +	 * for TDX guests must come from those "TDX-capable" nodes.
> +	 *
> +	 * Build the list of TDX memory regions as mentioned above so
> +	 * they can be passed to the TDX module later.
> +	 */

This is msotly Documentation/, not a code comment.  Please clean it up.

> +	get_online_mems();
> +
> +	ret = build_tdx_memory();
> +	if (ret)
> +		goto out;
>  	/*
>  	 * Return -EINVAL until all steps of TDX module initialization
>  	 * process are done.
>  	 */
>  	ret = -EINVAL;
>  out:
> +	/*
> +	 * Memory hotplug checks the hot-added memory region against the
> +	 * @tdx_memlist to see if the region is TDX memory.
> +	 *
> +	 * Do put_online_mems() here to make sure any modification to
> +	 * @tdx_memlist is done while holding the memory hotplug read
> +	 * lock, so that the memory hotplug path can just check the
> +	 * @tdx_memlist w/o holding the @tdx_module_lock which may cause
> +	 * deadlock.
> +	 */

I'm honestly not following any of that.

> +	put_online_mems();
>  	return ret;
>  }
>  
> @@ -485,3 +645,26 @@ int tdx_enable(void)
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(tdx_enable);
> +
> +/*
> + * Check whether the given range is TDX memory.  Must be called between
> + * mem_hotplug_begin()/mem_hotplug_done().
> + */
> +bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn)
> +{
> +	struct tdx_memblock *tmb;
> +
> +	/* Empty list means TDX isn't enabled successfully */
> +	if (list_empty(&tdx_memlist))
> +		return true;
> +
> +	list_for_each_entry(tmb, &tdx_memlist, list) {
> +		/*
> +		 * The new range is TDX memory if it is fully covered
> +		 * by any TDX memory block.
> +		 */
> +		if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
> +			return true;

Same bug.  What if the start/end_pfn range is covered by more than one
tdx_memblock?

> +	}
> +	return false;
> +}
  
Peter Zijlstra Nov. 23, 2022, 9:29 a.m. UTC | #8
On Tue, Nov 22, 2022 at 04:21:38PM -0800, Dave Hansen wrote:
> > +	/*
> > +	 * All memory regions that can be used by the TDX module must be
> > +	 * passed to the TDX module during the module initialization.
> > +	 * Once this is done, all "TDX-usable" memory regions are fixed
> > +	 * during module's runtime.
> > +	 *
> > +	 * The initial support of TDX guests only allocates memory from
> > +	 * the global page allocator.  To keep things simple, for now
> > +	 * just make sure all pages in the page allocator are TDX memory.
> > +	 *
> > +	 * To achieve this, use all system memory in the core-mm at the
> > +	 * time of initializing the TDX module as TDX memory, and at the
> > +	 * meantime, reject any new memory in memory hot-add.
> > +	 *
> > +	 * This works as in practice, all boot-time present DIMM is TDX
> > +	 * convertible memory.  However if any new memory is hot-added
> > +	 * before initializing the TDX module, the initialization will
> > +	 * fail due to that memory is not covered by CMR.
> > +	 *
> > +	 * This can be enhanced in the future, i.e. by allowing adding or
> > +	 * onlining non-TDX memory to a separate node, in which case the
> > +	 * "TDX-capable" nodes and the "non-TDX-capable" nodes can exist
> > +	 * together -- the userspace/kernel just needs to make sure pages
> > +	 * for TDX guests must come from those "TDX-capable" nodes.
> > +	 *
> > +	 * Build the list of TDX memory regions as mentioned above so
> > +	 * they can be passed to the TDX module later.
> > +	 */
> 
> This is msotly Documentation/, not a code comment.  Please clean it up.

So personally, I *vastly* prefer code comments over this Documentation/
cesspit. Putting things in Documentation/ is a bit like an
old-folks-home, neatly out of the way to (bit)rot in peace.

And that whole .rst disease is making it unreadable for anybody that
still knows how to use a text editor :-(
  
Huang, Ying Nov. 24, 2022, 12:47 a.m. UTC | #9
"Huang, Kai" <kai.huang@intel.com> writes:

>> > > > +/*
>> > > > + * Add all memblock memory regions to the @tdx_memlist as TDX memory.
>> > > > + * Must be called when get_online_mems() is called by the caller.
>> > > > + */
>> > > > +static int build_tdx_memory(void)
>> > > > +{
>> > > > +   unsigned long start_pfn, end_pfn;
>> > > > +   int i, nid, ret;
>> > > > +
>> > > > +   for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
>> > > > +           /*
>> > > > +            * The first 1MB may not be reported as TDX convertible
>> > > > +            * memory.  Manually exclude them as TDX memory.
>> > > > +            *
>> > > > +            * This is fine as the first 1MB is already reserved in
>> > > > +            * reserve_real_mode() and won't end up to ZONE_DMA as
>> > > > +            * free page anyway.
>> > > > +            */
>> > > > +           start_pfn = max(start_pfn, (unsigned long)SZ_1M >> PAGE_SHIFT);
>> > > > +           if (start_pfn >= end_pfn)
>> > > > +                   continue;
>> > >
>> > > How about check whether first 1MB is reserved instead of depending on
>> > > the corresponding code isn't changed?  Via for_each_reserved_mem_range()?
>> >
>> > IIUC, some reserved memory can be freed to page allocator directly, i.e. kernel
>> > init code/data.  I feel it's not safe to just treat reserved memory will never
>> > be in page allocator.  Otherwise we have for_each_free_mem_range() can use.
>>
>> Yes.  memblock reverse information isn't perfect.  But I still think
>> that it is still better than just assumption to check whether the frist
>> 1MB is reserved in memblock.  Or, we can check whether the pages of the
>> first 1MB is reversed via checking struct page directly?
>>
>
> Sorry I am a little bit confused what you want to achieve here.  Do you want to
> make some sanity check to make sure the first 1MB is indeed not in the page
> allocator?
>
> IIUC, it is indeed true.  Please see the comment of calling reserve_real_mode()
> in setup_arch().  Also please see efi_free_boot_services(), which doesn't free
> the boot service if it is below 1MB.
>
> Also, my understanding is kernel's intention is to always reserve the first 1MB:
>
>                 /*
>                  * Don't free memory under 1M for two reasons:
>                  * - BIOS might clobber it
>                  * - Crash kernel needs it to be reserved
>                  */
>
> So if any page in first 1MB ended up to the page allocator, it should be the
> kernel bug which is not related to TDX, correct?

I suggest to add some code to verify this.  It's possible for the code
to be changed in the future (although possibility is low).  And TDX may
not be changed at the same time.  Then the verifying code here can catch
that.  So, we can make change accordingly.

Best Regards,
Huang, Ying
  
Kai Huang Nov. 24, 2022, 1:04 a.m. UTC | #10
On Tue, 2022-11-22 at 16:21 -0800, Dave Hansen wrote:
> On 11/20/22 16:26, Kai Huang wrote:
> > TDX reports a list of "Convertible Memory Region" (CMR) to indicate all
> > memory regions that can possibly be used by the TDX module, but they are
> > not automatically usable to the TDX module.  As a step of initializing
> > the TDX module, the kernel needs to choose a list of memory regions (out
> > from convertible memory regions) that the TDX module can use and pass
> > those regions to the TDX module.  Once this is done, those "TDX-usable"
> > memory regions are fixed during module's lifetime.  No more TDX-usable
> > memory can be added to the TDX module after that.
> > 
> > The initial support of TDX guests will only allocate TDX guest memory
> > from the global page allocator.  To keep things simple, this initial
> > implementation simply guarantees all pages in the page allocator are TDX
> > memory.  To achieve this, use all system memory in the core-mm at the
> > time of initializing the TDX module as TDX memory, and at the meantime,
> > refuse to add any non-TDX-memory in the memory hotplug.
> > 
> > Specifically, walk through all memory regions managed by memblock and
> > add them to a global list of "TDX-usable" memory regions, which is a
> > fixed list after the module initialization (or empty if initialization
> > fails).  To reject non-TDX-memory in memory hotplug, add an additional
> > check in arch_add_memory() to check whether the new region is covered by
> > any region in the "TDX-usable" memory region list.
> > 
> > Note this requires all memory regions in memblock are TDX convertible
> > memory when initializing the TDX module.  This is true in practice if no
> > new memory has been hot-added before initializing the TDX module, since
> > in practice all boot-time present DIMM is TDX convertible memory.  If
> > any new memory has been hot-added, then initializing the TDX module will
> > fail due to that memory region is not covered by CMR.
> > 
> > This can be enhanced in the future, i.e. by allowing adding non-TDX
> > memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
> > and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
> > needs to guarantee memory pages for TDX guests are always allocated from
> > the "TDX-capable" nodes.
> > 
> > Note TDX assumes convertible memory is always physically present during
> > machine's runtime.  A non-buggy BIOS should never support hot-removal of
> > any convertible memory.  This implementation doesn't handle ACPI memory
> > removal but depends on the BIOS to behave correctly.
> 
> My eyes glazed over about halfway through that.  Can you try to trim it
> down a bit, or at least try to summarize it better up front?

Will do.

> 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index dd333b46fafb..b36129183035 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1959,6 +1959,7 @@ config INTEL_TDX_HOST
> >  	depends on X86_64
> >  	depends on KVM_INTEL
> >  	depends on X86_X2APIC
> > +	select ARCH_KEEP_MEMBLOCK
> >  	help
> >  	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> >  	  host and certain physical attacks.  This option enables necessary TDX
> > diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> > index d688228f3151..71169ecefabf 100644
> > --- a/arch/x86/include/asm/tdx.h
> > +++ b/arch/x86/include/asm/tdx.h
> > @@ -111,9 +111,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
> >  #ifdef CONFIG_INTEL_TDX_HOST
> >  bool platform_tdx_enabled(void);
> >  int tdx_enable(void);
> > +bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn);
> >  #else	/* !CONFIG_INTEL_TDX_HOST */
> >  static inline bool platform_tdx_enabled(void) { return false; }
> >  static inline int tdx_enable(void)  { return -ENODEV; }
> > +static inline bool tdx_cc_memory_compatible(unsigned long start_pfn,
> > +		unsigned long end_pfn) { return true; }
> >  #endif	/* CONFIG_INTEL_TDX_HOST */
> >  
> >  #endif /* !__ASSEMBLY__ */
> > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> > index 3f040c6e5d13..900341333d7e 100644
> > --- a/arch/x86/mm/init_64.c
> > +++ b/arch/x86/mm/init_64.c
> > @@ -55,6 +55,7 @@
> >  #include <asm/uv/uv.h>
> >  #include <asm/setup.h>
> >  #include <asm/ftrace.h>
> > +#include <asm/tdx.h>
> >  
> >  #include "mm_internal.h"
> >  
> > @@ -968,6 +969,15 @@ int arch_add_memory(int nid, u64 start, u64 size,
> >  	unsigned long start_pfn = start >> PAGE_SHIFT;
> >  	unsigned long nr_pages = size >> PAGE_SHIFT;
> >  
> > +	/*
> > +	 * For now if TDX is enabled, all pages in the page allocator
> 
> s/For now//

Will do.

> 
> > +	 * must be TDX memory, which is a fixed set of memory regions
> > +	 * that are passed to the TDX module.  Reject the new region
> > +	 * if it is not TDX memory to guarantee above is true.
> > +	 */
> > +	if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages))
> > +		return -EINVAL;
> 
> There's a real art to making a right-size comment.  I don't think this
> needs to be any more than:
> 
> 	/*
> 	 * Not all memory is compatible with TDX.  Reject
> 	 * the addition of any incomatible memory.
> 	 */

Thanks.

> 
> If you want to write a treatise, do it in Documentation or at the
> tdx_cc_memory_compatible() definition.
> 
> >  	init_memory_mapping(start, start + size, params->pgprot);
> >  
> >  	return add_pages(nid, start_pfn, nr_pages, params);
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 43227af25e44..32af86e31c47 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -16,6 +16,11 @@
> >  #include <linux/smp.h>
> >  #include <linux/atomic.h>
> >  #include <linux/align.h>
> > +#include <linux/list.h>
> > +#include <linux/slab.h>
> > +#include <linux/memblock.h>
> > +#include <linux/minmax.h>
> > +#include <linux/sizes.h>
> >  #include <asm/msr-index.h>
> >  #include <asm/msr.h>
> >  #include <asm/apic.h>
> > @@ -34,6 +39,13 @@ enum tdx_module_status_t {
> >  	TDX_MODULE_SHUTDOWN,
> >  };
> >  
> > +struct tdx_memblock {
> > +	struct list_head list;
> > +	unsigned long start_pfn;
> > +	unsigned long end_pfn;
> > +	int nid;
> > +};
> 
> Why does the nid matter?

It is used to find the node for the PAMT allocation for a given TDMR.

> 
> >  static u32 tdx_keyid_start __ro_after_init;
> >  static u32 tdx_keyid_num __ro_after_init;
> >  
> > @@ -46,6 +58,9 @@ static struct tdsysinfo_struct tdx_sysinfo;
> >  static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
> >  static int tdx_cmr_num;
> >  
> > +/* All TDX-usable memory regions */
> > +static LIST_HEAD(tdx_memlist);
> > +
> >  /*
> >   * Detect TDX private KeyIDs to see whether TDX has been enabled by the
> >   * BIOS.  Both initializing the TDX module and running TDX guest require
> > @@ -329,6 +344,107 @@ static int tdx_get_sysinfo(void)
> >  	return trim_empty_cmrs(tdx_cmr_array, &tdx_cmr_num);
> >  }
> >  
> > +/* Check whether the given pfn range is covered by any CMR or not. */
> > +static bool pfn_range_covered_by_cmr(unsigned long start_pfn,
> > +				     unsigned long end_pfn)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < tdx_cmr_num; i++) {
> > +		struct cmr_info *cmr = &tdx_cmr_array[i];
> > +		unsigned long cmr_start_pfn;
> > +		unsigned long cmr_end_pfn;
> > +
> > +		cmr_start_pfn = cmr->base >> PAGE_SHIFT;
> > +		cmr_end_pfn = (cmr->base + cmr->size) >> PAGE_SHIFT;
> > +
> > +		if (start_pfn >= cmr_start_pfn && end_pfn <= cmr_end_pfn)
> > +			return true;
> > +	}
> 
> What if the pfn range overlaps two CMRs?  It will never pass any
> individual overlap test and will return false.

We can only return true if the two CMRs are contiguous.  

I cannot think out a reason that a reasonable BIOS could generate contiguous
CMRs.  Perhaps one reason is two contiguous NUMA nodes?  For this case, memblock
has made sure no memory region could cross NUMA nodes, so the start_pfn/end_pfn
here should always be within one node.  Perhaps we can add a comment for this
case?

Anyway I am not sure whether it is worth to consider "contiguous CMRs" case.

> 
> > +	return false;
> > +}
> > +
> > +/*
> > + * Add a memory region on a given node as a TDX memory block.  The caller
> > + * to make sure all memory regions are added in address ascending order
> 
> s/to/must/

Thanks.

> 
> > + * and don't overlap.
> > + */
> > +static int add_tdx_memblock(unsigned long start_pfn, unsigned long end_pfn,
> > +			    int nid)
> > +{
> > +	struct tdx_memblock *tmb;
> > +
> > +	tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
> > +	if (!tmb)
> > +		return -ENOMEM;
> > +
> > +	INIT_LIST_HEAD(&tmb->list);
> > +	tmb->start_pfn = start_pfn;
> > +	tmb->end_pfn = end_pfn;
> > +	tmb->nid = nid;
> > +
> > +	list_add_tail(&tmb->list, &tdx_memlist);
> > +	return 0;
> > +}
> > +
> > +static void free_tdx_memory(void)
> 
> This is named a bit too generically.  How about free_tdx_memlist() or
> something?

Will use free_tdx_memlist().  Do you want to also change build_tdx_memory() to
build_tdx_memlist()?

> 
> > +{
> > +	while (!list_empty(&tdx_memlist)) {
> > +		struct tdx_memblock *tmb = list_first_entry(&tdx_memlist,
> > +				struct tdx_memblock, list);
> > +
> > +		list_del(&tmb->list);
> > +		kfree(tmb);
> > +	}
> > +}
> > +
> > +/*
> > + * Add all memblock memory regions to the @tdx_memlist as TDX memory.
> > + * Must be called when get_online_mems() is called by the caller.
> > + */
> 
> Again, this explains the "what", but not the "why".
> 
> /*
>  * Ensure that all memblock memory regions are convertible to TDX
>  * memory.  Once this has been established, stash the memblock
>  * ranges off in a secondary structure because $REASONS.
>  */
> 
> Which makes me wonder: Why do you even need a secondary structure here?
> What's wrong with the memblocks themselves?

One reason is the new region has already been added to memblock before calling
arch_add_memory(), so we cannot compare the new region against memblock.

The other  reason is memblock is updated in memory hotplug so it really isn't a
set of *fixed* memory regions, which TDX requires.  Having TDX's own tdx_memlist
can support such case: after module initialization, some memory can be hot-
removed and then hot-added again, because the hot-added range will be covered by
@tdx_memlist.

> 
> > +static int build_tdx_memory(void)
> > +{
> > +	unsigned long start_pfn, end_pfn;
> > +	int i, nid, ret;
> > +
> > +	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
> > +		/*
> > +		 * The first 1MB may not be reported as TDX convertible
> > +		 * memory.  Manually exclude them as TDX memory.
> 
> I don't like the "may not" here very much.
> 
> > +		 * This is fine as the first 1MB is already reserved in
> > +		 * reserve_real_mode() and won't end up to ZONE_DMA as
> > +		 * free page anyway.
> 
> 			 ^ free pages
> 
> > +		 */
> 
> This way too wishy washy.  The TDX module may or may not...  Then, it
> doesn't matter since reserve_real_mode() does it anyway...
> 
> Then it goes and adds code to skip it!
> 
> > +		start_pfn = max(start_pfn, (unsigned long)SZ_1M >> PAGE_SHIFT);
> > +		if (start_pfn >= end_pfn)
> > +			continue;
> 
> 
> Please just put a dang stake in the ground.  If the other code deals
> with this, then explain *why* more is needed here.

How about adding below before the 'for_each_mem_pfn_range()' loop:

	/*
	 * Some reserved pages in memblock (i.e. kernel init code/data) are
	 * freed to the page allocator directly.  Use for_each_mem_pfn_range()
	 * instead of for_each_free_mem_range() to make sure all pages in the
	 * page allocator are covered as TDX memory.
	 */

It explains why to use for_each_mem_pfn_range().

And here before skipping first 1MB, we add below:

		/*
		 * The first 1MB is not reported as TDX covertible memory.
		 * Although the first 1MB is always reserved and won't end up
		 * to the page allocator, it is still in memblock's memory
		 * regions.  Skip them manually to exclude them as TDX memory.
		 */

> 
> > +		/* Verify memory is truly TDX convertible memory */
> > +		if (!pfn_range_covered_by_cmr(start_pfn, end_pfn)) {
> > +			pr_info("Memory region [0x%lx, 0x%lx) is not TDX convertible memorry.\n",
> > +					start_pfn << PAGE_SHIFT,
> > +					end_pfn << PAGE_SHIFT);
> > +			return -EINVAL;
> 
> ... no 'goto err'?  This leaks all the previous add_tdx_memblock()
> structures, right?

Right.  It's a leftover from the old code.  Will fix.

> 
> > +		}
> > +
> > +		/*
> > +		 * Add the memory regions as TDX memory.  The regions in
> > +		 * memblock has already guaranteed they are in address
> > +		 * ascending order and don't overlap.
> > +		 */
> > +		ret = add_tdx_memblock(start_pfn, end_pfn, nid);
> > +		if (ret)
> > +			goto err;
> > +	}
> > +
> > +	return 0;
> > +err:
> > +	free_tdx_memory();
> > +	return ret;
> > +}
> > +
> >  /*
> >   * Detect and initialize the TDX module.
> >   *
> > @@ -357,12 +473,56 @@ static int init_tdx_module(void)
> >  	if (ret)
> >  		goto out;
> >  
> > +	/*
> > +	 * All memory regions that can be used by the TDX module must be
> > +	 * passed to the TDX module during the module initialization.
> > +	 * Once this is done, all "TDX-usable" memory regions are fixed
> > +	 * during module's runtime.
> > +	 *
> > +	 * The initial support of TDX guests only allocates memory from
> > +	 * the global page allocator.  To keep things simple, for now
> > +	 * just make sure all pages in the page allocator are TDX memory.
> > +	 *
> > +	 * To achieve this, use all system memory in the core-mm at the
> > +	 * time of initializing the TDX module as TDX memory, and at the
> > +	 * meantime, reject any new memory in memory hot-add.
> > +	 *
> > +	 * This works as in practice, all boot-time present DIMM is TDX
> > +	 * convertible memory.  However if any new memory is hot-added
> > +	 * before initializing the TDX module, the initialization will
> > +	 * fail due to that memory is not covered by CMR.
> > +	 *
> > +	 * This can be enhanced in the future, i.e. by allowing adding or
> > +	 * onlining non-TDX memory to a separate node, in which case the
> > +	 * "TDX-capable" nodes and the "non-TDX-capable" nodes can exist
> > +	 * together -- the userspace/kernel just needs to make sure pages
> > +	 * for TDX guests must come from those "TDX-capable" nodes.
> > +	 *
> > +	 * Build the list of TDX memory regions as mentioned above so
> > +	 * they can be passed to the TDX module later.
> > +	 */
> 
> This is msotly Documentation/, not a code comment.  Please clean it up.

Will try to clean up.

> 
> > +	get_online_mems();
> > +
> > +	ret = build_tdx_memory();
> > +	if (ret)
> > +		goto out;
> >  	/*
> >  	 * Return -EINVAL until all steps of TDX module initialization
> >  	 * process are done.
> >  	 */
> >  	ret = -EINVAL;
> >  out:
> > +	/*
> > +	 * Memory hotplug checks the hot-added memory region against the
> > +	 * @tdx_memlist to see if the region is TDX memory.
> > +	 *
> > +	 * Do put_online_mems() here to make sure any modification to
> > +	 * @tdx_memlist is done while holding the memory hotplug read
> > +	 * lock, so that the memory hotplug path can just check the
> > +	 * @tdx_memlist w/o holding the @tdx_module_lock which may cause
> > +	 * deadlock.
> > +	 */
> 
> I'm honestly not following any of that.

How about:

	/*
	 * Make sure tdx_cc_memory_compatible() either sees a fixed set of
	 * memory regions in @tdx_memlist, or an empty list.
	 */
	
> 
> > +	put_online_mems();
> >  	return ret;
> >  }
> >  
> > @@ -485,3 +645,26 @@ int tdx_enable(void)
> >  	return ret;
> >  }
> >  EXPORT_SYMBOL_GPL(tdx_enable);
> > +
> > +/*
> > + * Check whether the given range is TDX memory.  Must be called between
> > + * mem_hotplug_begin()/mem_hotplug_done().
> > + */
> > +bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn)
> > +{
> > +	struct tdx_memblock *tmb;
> > +
> > +	/* Empty list means TDX isn't enabled successfully */
> > +	if (list_empty(&tdx_memlist))
> > +		return true;
> > +
> > +	list_for_each_entry(tmb, &tdx_memlist, list) {
> > +		/*
> > +		 * The new range is TDX memory if it is fully covered
> > +		 * by any TDX memory block.
> > +		 */
> > +		if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
> > +			return true;
> 
> Same bug.  What if the start/end_pfn range is covered by more than one
> tdx_memblock?

We may want to return true if tdx_memblocks are contiguous.

However I don't think this will happen?

tdx_memblock is from memblock, and when two memory regions in memblock are
contiguous, they must have different node, or flags.

My understanding is the hot-added memory region here cannot across NUMA nodes,
nor have different flags,  correct?

> 
> > +	}
> > +	return false;
> > +}
>
  
Dave Hansen Nov. 24, 2022, 1:22 a.m. UTC | #11
On 11/23/22 17:04, Huang, Kai wrote:
> On Tue, 2022-11-22 at 16:21 -0800, Dave Hansen wrote:
>>> +struct tdx_memblock {
>>> +   struct list_head list;
>>> +   unsigned long start_pfn;
>>> +   unsigned long end_pfn;
>>> +   int nid;
>>> +};
>>
>> Why does the nid matter?
> 
> It is used to find the node for the PAMT allocation for a given TDMR.

... which is in this patch?

You can't just plop unused and unmentioned nuggets in the code.  Remove
it until it is needed.


>>> +/* Check whether the given pfn range is covered by any CMR or not. */
>>> +static bool pfn_range_covered_by_cmr(unsigned long start_pfn,
>>> +                                unsigned long end_pfn)
>>> +{
>>> +   int i;
>>> +
>>> +   for (i = 0; i < tdx_cmr_num; i++) {
>>> +           struct cmr_info *cmr = &tdx_cmr_array[i];
>>> +           unsigned long cmr_start_pfn;
>>> +           unsigned long cmr_end_pfn;
>>> +
>>> +           cmr_start_pfn = cmr->base >> PAGE_SHIFT;
>>> +           cmr_end_pfn = (cmr->base + cmr->size) >> PAGE_SHIFT;
>>> +
>>> +           if (start_pfn >= cmr_start_pfn && end_pfn <= cmr_end_pfn)
>>> +                   return true;
>>> +   }
>>
>> What if the pfn range overlaps two CMRs?  It will never pass any
>> individual overlap test and will return false.
> 
> We can only return true if the two CMRs are contiguous.
> 
> I cannot think out a reason that a reasonable BIOS could generate contiguous
> CMRs.

Because it can?

We don't just try and randomly assign what we think is reasonable or
not.  First and foremost, we need to ask whether the configuration in
question is allowed by the spec.

Would it be a *valid* thing to have two adjacent CMRs?  Does the TDX
module spec disallow it?

> Perhaps one reason is two contiguous NUMA nodes?  For this case, memblock
> has made sure no memory region could cross NUMA nodes, so the start_pfn/end_pfn
> here should always be within one node.  Perhaps we can add a comment for this
> case?

<cough> numa=off <cough>

> Anyway I am not sure whether it is worth to consider "contiguous CMRs" case.

I am sure.  You need to consider it.

>>> + * and don't overlap.
>>> + */
>>> +static int add_tdx_memblock(unsigned long start_pfn, unsigned long end_pfn,
>>> +                       int nid)
>>> +{
>>> +   struct tdx_memblock *tmb;
>>> +
>>> +   tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
>>> +   if (!tmb)
>>> +           return -ENOMEM;
>>> +
>>> +   INIT_LIST_HEAD(&tmb->list);
>>> +   tmb->start_pfn = start_pfn;
>>> +   tmb->end_pfn = end_pfn;
>>> +   tmb->nid = nid;
>>> +
>>> +   list_add_tail(&tmb->list, &tdx_memlist);
>>> +   return 0;
>>> +}
>>> +
>>> +static void free_tdx_memory(void)
>>
>> This is named a bit too generically.  How about free_tdx_memlist() or
>> something?
> 
> Will use free_tdx_memlist().  Do you want to also change build_tdx_memory() to
> build_tdx_memlist()?

Does it build a memlist?

>>> +{
>>> +   while (!list_empty(&tdx_memlist)) {
>>> +           struct tdx_memblock *tmb = list_first_entry(&tdx_memlist,
>>> +                           struct tdx_memblock, list);
>>> +
>>> +           list_del(&tmb->list);
>>> +           kfree(tmb);
>>> +   }
>>> +}
>>> +
>>> +/*
>>> + * Add all memblock memory regions to the @tdx_memlist as TDX memory.
>>> + * Must be called when get_online_mems() is called by the caller.
>>> + */
>>
>> Again, this explains the "what", but not the "why".
>>
>> /*
>>  * Ensure that all memblock memory regions are convertible to TDX
>>  * memory.  Once this has been established, stash the memblock
>>  * ranges off in a secondary structure because $REASONS.
>>  */
>>
>> Which makes me wonder: Why do you even need a secondary structure here?
>> What's wrong with the memblocks themselves?
> 
> One reason is the new region has already been added to memblock before calling
> arch_add_memory(), so we cannot compare the new region against memblock.
> 
> The other  reason is memblock is updated in memory hotplug so it really isn't a
> set of *fixed* memory regions, which TDX requires.  Having TDX's own tdx_memlist
> can support such case: after module initialization, some memory can be hot-
> removed and then hot-added again, because the hot-added range will be covered by
> @tdx_memlist.

OK, that's fair enough.  Memblocks change and we don't lock out changes
to memblocks.  But, the TDX setup is based on memblocks at one point in
time, so we effectively need to know what their state was at that point
in time.

>>> +static int build_tdx_memory(void)
>>> +{
>>> +   unsigned long start_pfn, end_pfn;
>>> +   int i, nid, ret;
>>> +
>>> +   for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
>>> +           /*
>>> +            * The first 1MB may not be reported as TDX convertible
>>> +            * memory.  Manually exclude them as TDX memory.
>>
>> I don't like the "may not" here very much.
>>
>>> +            * This is fine as the first 1MB is already reserved in
>>> +            * reserve_real_mode() and won't end up to ZONE_DMA as
>>> +            * free page anyway.
>>
>>                        ^ free pages
>>
>>> +            */
>>
>> This way too wishy washy.  The TDX module may or may not...  Then, it
>> doesn't matter since reserve_real_mode() does it anyway...
>>
>> Then it goes and adds code to skip it!
>>
>>> +           start_pfn = max(start_pfn, (unsigned long)SZ_1M >> PAGE_SHIFT);
>>> +           if (start_pfn >= end_pfn)
>>> +                   continue;
>>
>>
>> Please just put a dang stake in the ground.  If the other code deals
>> with this, then explain *why* more is needed here.
> 
> How about adding below before the 'for_each_mem_pfn_range()' loop:
> 
>         /*
>          * Some reserved pages in memblock (i.e. kernel init code/data) are
>          * freed to the page allocator directly.  Use for_each_mem_pfn_range()
>          * instead of for_each_free_mem_range() to make sure all pages in the
>          * page allocator are covered as TDX memory.
>          */
> 
> It explains why to use for_each_mem_pfn_range().

I actually wasn't asking about the for_each_mem_pfn_range() use.

> And here before skipping first 1MB, we add below:
> 
>                 /*
>                  * The first 1MB is not reported as TDX covertible memory.
>                  * Although the first 1MB is always reserved and won't end up
>                  * to the page allocator, it is still in memblock's memory
>                  * regions.  Skip them manually to exclude them as TDX memory.
>                  */

That looks OK, with the spelling fixed.

>>> +           /* Verify memory is truly TDX convertible memory */
>>> +           if (!pfn_range_covered_by_cmr(start_pfn, end_pfn)) {
>>> +                   pr_info("Memory region [0x%lx, 0x%lx) is not TDX convertible memorry.\n",
>>> +                                   start_pfn << PAGE_SHIFT,
>>> +                                   end_pfn << PAGE_SHIFT);
>>> +                   return -EINVAL;
>>
>> ... no 'goto err'?  This leaks all the previous add_tdx_memblock()
>> structures, right?
> 
> Right.  It's a leftover from the old code.  Will fix.
> 
>>
>>> +           }
>>> +
>>> +           /*
>>> +            * Add the memory regions as TDX memory.  The regions in
>>> +            * memblock has already guaranteed they are in address
>>> +            * ascending order and don't overlap.
>>> +            */
>>> +           ret = add_tdx_memblock(start_pfn, end_pfn, nid);
>>> +           if (ret)
>>> +                   goto err;
>>> +   }
>>> +
>>> +   return 0;
>>> +err:
>>> +   free_tdx_memory();
>>> +   return ret;
>>> +}
>>> +
>>>  /*
>>>   * Detect and initialize the TDX module.
>>>   *
>>> @@ -357,12 +473,56 @@ static int init_tdx_module(void)
>>>     if (ret)
>>>             goto out;
>>>
>>> +   /*
>>> +    * All memory regions that can be used by the TDX module must be
>>> +    * passed to the TDX module during the module initialization.
>>> +    * Once this is done, all "TDX-usable" memory regions are fixed
>>> +    * during module's runtime.
>>> +    *
>>> +    * The initial support of TDX guests only allocates memory from
>>> +    * the global page allocator.  To keep things simple, for now
>>> +    * just make sure all pages in the page allocator are TDX memory.
>>> +    *
>>> +    * To achieve this, use all system memory in the core-mm at the
>>> +    * time of initializing the TDX module as TDX memory, and at the
>>> +    * meantime, reject any new memory in memory hot-add.
>>> +    *
>>> +    * This works as in practice, all boot-time present DIMM is TDX
>>> +    * convertible memory.  However if any new memory is hot-added
>>> +    * before initializing the TDX module, the initialization will
>>> +    * fail due to that memory is not covered by CMR.
>>> +    *
>>> +    * This can be enhanced in the future, i.e. by allowing adding or
>>> +    * onlining non-TDX memory to a separate node, in which case the
>>> +    * "TDX-capable" nodes and the "non-TDX-capable" nodes can exist
>>> +    * together -- the userspace/kernel just needs to make sure pages
>>> +    * for TDX guests must come from those "TDX-capable" nodes.
>>> +    *
>>> +    * Build the list of TDX memory regions as mentioned above so
>>> +    * they can be passed to the TDX module later.
>>> +    */
>>
>> This is msotly Documentation/, not a code comment.  Please clean it up.
> 
> Will try to clean up.
> 
>>
>>> +   get_online_mems();
>>> +
>>> +   ret = build_tdx_memory();
>>> +   if (ret)
>>> +           goto out;
>>>     /*
>>>      * Return -EINVAL until all steps of TDX module initialization
>>>      * process are done.
>>>      */
>>>     ret = -EINVAL;
>>>  out:
>>> +   /*
>>> +    * Memory hotplug checks the hot-added memory region against the
>>> +    * @tdx_memlist to see if the region is TDX memory.
>>> +    *
>>> +    * Do put_online_mems() here to make sure any modification to
>>> +    * @tdx_memlist is done while holding the memory hotplug read
>>> +    * lock, so that the memory hotplug path can just check the
>>> +    * @tdx_memlist w/o holding the @tdx_module_lock which may cause
>>> +    * deadlock.
>>> +    */
>>
>> I'm honestly not following any of that.
> 
> How about:
> 
>         /*
>          * Make sure tdx_cc_memory_compatible() either sees a fixed set of
>          * memory regions in @tdx_memlist, or an empty list.
>          */

That's a comment for the lock side, not the unlock side.  It should be:

	/*
	 * @tdx_memlist is written here and read at memory hotplug time.
	 * Lock out memory hotplug code while building it.
	 */

>>> +   put_online_mems();
>>>     return ret;
>>>  }
>>>
>>> @@ -485,3 +645,26 @@ int tdx_enable(void)
>>>     return ret;
>>>  }
>>>  EXPORT_SYMBOL_GPL(tdx_enable);
>>> +
>>> +/*
>>> + * Check whether the given range is TDX memory.  Must be called between
>>> + * mem_hotplug_begin()/mem_hotplug_done().
>>> + */
>>> +bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn)
>>> +{
>>> +   struct tdx_memblock *tmb;
>>> +
>>> +   /* Empty list means TDX isn't enabled successfully */
>>> +   if (list_empty(&tdx_memlist))
>>> +           return true;
>>> +
>>> +   list_for_each_entry(tmb, &tdx_memlist, list) {
>>> +           /*
>>> +            * The new range is TDX memory if it is fully covered
>>> +            * by any TDX memory block.
>>> +            */
>>> +           if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
>>> +                   return true;
>>
>> Same bug.  What if the start/end_pfn range is covered by more than one
>> tdx_memblock?
> 
> We may want to return true if tdx_memblocks are contiguous.
> 
> However I don't think this will happen?
> 
> tdx_memblock is from memblock, and when two memory regions in memblock are
> contiguous, they must have different node, or flags.
> 
> My understanding is the hot-added memory region here cannot across NUMA nodes,
> nor have different flags,  correct?

I'm not sure what flags are in this context.
  
Dan Williams Nov. 24, 2022, 1:50 a.m. UTC | #12
Kai Huang wrote:
> TDX reports a list of "Convertible Memory Region" (CMR) to indicate all
> memory regions that can possibly be used by the TDX module, but they are
> not automatically usable to the TDX module.  As a step of initializing
> the TDX module, the kernel needs to choose a list of memory regions (out
> from convertible memory regions) that the TDX module can use and pass
> those regions to the TDX module.  Once this is done, those "TDX-usable"
> memory regions are fixed during module's lifetime.  No more TDX-usable
> memory can be added to the TDX module after that.
> 
> The initial support of TDX guests will only allocate TDX guest memory
> from the global page allocator.  To keep things simple, this initial
> implementation simply guarantees all pages in the page allocator are TDX
> memory.  To achieve this, use all system memory in the core-mm at the
> time of initializing the TDX module as TDX memory, and at the meantime,
> refuse to add any non-TDX-memory in the memory hotplug.
> 
> Specifically, walk through all memory regions managed by memblock and
> add them to a global list of "TDX-usable" memory regions, which is a
> fixed list after the module initialization (or empty if initialization
> fails).  To reject non-TDX-memory in memory hotplug, add an additional
> check in arch_add_memory() to check whether the new region is covered by
> any region in the "TDX-usable" memory region list.
> 
> Note this requires all memory regions in memblock are TDX convertible
> memory when initializing the TDX module.  This is true in practice if no
> new memory has been hot-added before initializing the TDX module, since
> in practice all boot-time present DIMM is TDX convertible memory.  If
> any new memory has been hot-added, then initializing the TDX module will
> fail due to that memory region is not covered by CMR.
> 
> This can be enhanced in the future, i.e. by allowing adding non-TDX
> memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
> and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
> needs to guarantee memory pages for TDX guests are always allocated from
> the "TDX-capable" nodes.
> 
> Note TDX assumes convertible memory is always physically present during
> machine's runtime.  A non-buggy BIOS should never support hot-removal of
> any convertible memory.  This implementation doesn't handle ACPI memory
> removal but depends on the BIOS to behave correctly.
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> ---
> 
> v6 -> v7:
>  - Changed to use all system memory in memblock at the time of
>    initializing the TDX module as TDX memory
>  - Added memory hotplug support
> 
> ---
>  arch/x86/Kconfig            |   1 +
>  arch/x86/include/asm/tdx.h  |   3 +
>  arch/x86/mm/init_64.c       |  10 ++
>  arch/x86/virt/vmx/tdx/tdx.c | 183 ++++++++++++++++++++++++++++++++++++
>  4 files changed, 197 insertions(+)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index dd333b46fafb..b36129183035 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1959,6 +1959,7 @@ config INTEL_TDX_HOST
>  	depends on X86_64
>  	depends on KVM_INTEL
>  	depends on X86_X2APIC
> +	select ARCH_KEEP_MEMBLOCK
>  	help
>  	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
>  	  host and certain physical attacks.  This option enables necessary TDX
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index d688228f3151..71169ecefabf 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -111,9 +111,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
>  #ifdef CONFIG_INTEL_TDX_HOST
>  bool platform_tdx_enabled(void);
>  int tdx_enable(void);
> +bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn);
>  #else	/* !CONFIG_INTEL_TDX_HOST */
>  static inline bool platform_tdx_enabled(void) { return false; }
>  static inline int tdx_enable(void)  { return -ENODEV; }
> +static inline bool tdx_cc_memory_compatible(unsigned long start_pfn,
> +		unsigned long end_pfn) { return true; }
>  #endif	/* CONFIG_INTEL_TDX_HOST */
>  
>  #endif /* !__ASSEMBLY__ */
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 3f040c6e5d13..900341333d7e 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -55,6 +55,7 @@
>  #include <asm/uv/uv.h>
>  #include <asm/setup.h>
>  #include <asm/ftrace.h>
> +#include <asm/tdx.h>
>  
>  #include "mm_internal.h"
>  
> @@ -968,6 +969,15 @@ int arch_add_memory(int nid, u64 start, u64 size,
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  
> +	/*
> +	 * For now if TDX is enabled, all pages in the page allocator
> +	 * must be TDX memory, which is a fixed set of memory regions
> +	 * that are passed to the TDX module.  Reject the new region
> +	 * if it is not TDX memory to guarantee above is true.
> +	 */
> +	if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages))
> +		return -EINVAL;

arch_add_memory() does not add memory to the page allocator.  For
example, memremap_pages() uses arch_add_memory() and explicitly does not
release the memory to the page allocator. This check belongs in
add_memory_resource() to prevent new memory that violates TDX from being
onlined. Hopefully there is also an option to disable TDX from the
kernel boot command line to recover memory-hotplug without needing to
boot into the BIOS to toggle TDX.
  
Kai Huang Nov. 24, 2022, 2:27 a.m. UTC | #13
On Wed, 2022-11-23 at 17:22 -0800, Hansen, Dave wrote:
> On 11/23/22 17:04, Huang, Kai wrote:
> > On Tue, 2022-11-22 at 16:21 -0800, Dave Hansen wrote:
> > > > +struct tdx_memblock {
> > > > +   struct list_head list;
> > > > +   unsigned long start_pfn;
> > > > +   unsigned long end_pfn;
> > > > +   int nid;
> > > > +};
> > > 
> > > Why does the nid matter?
> > 
> > It is used to find the node for the PAMT allocation for a given TDMR.
> 
> ... which is in this patch?
> 
> You can't just plop unused and unmentioned nuggets in the code.  Remove
> it until it is needed.

OK. I'll move to the PAMT allocation patch.

> 
> 
> > > > +/* Check whether the given pfn range is covered by any CMR or not. */
> > > > +static bool pfn_range_covered_by_cmr(unsigned long start_pfn,
> > > > +                                unsigned long end_pfn)
> > > > +{
> > > > +   int i;
> > > > +
> > > > +   for (i = 0; i < tdx_cmr_num; i++) {
> > > > +           struct cmr_info *cmr = &tdx_cmr_array[i];
> > > > +           unsigned long cmr_start_pfn;
> > > > +           unsigned long cmr_end_pfn;
> > > > +
> > > > +           cmr_start_pfn = cmr->base >> PAGE_SHIFT;
> > > > +           cmr_end_pfn = (cmr->base + cmr->size) >> PAGE_SHIFT;
> > > > +
> > > > +           if (start_pfn >= cmr_start_pfn && end_pfn <= cmr_end_pfn)
> > > > +                   return true;
> > > > +   }
> > > 
> > > What if the pfn range overlaps two CMRs?  It will never pass any
> > > individual overlap test and will return false.
> > 
> > We can only return true if the two CMRs are contiguous.
> > 
> > I cannot think out a reason that a reasonable BIOS could generate contiguous
> > CMRs.
> 
> Because it can?
> 
> We don't just try and randomly assign what we think is reasonable or
> not.  First and foremost, we need to ask whether the configuration in
> question is allowed by the spec.
> 
> Would it be a *valid* thing to have two adjacent CMRs?  Does the TDX
> module spec disallow it?

No the TDX module doesn't disallow it, IIUC.  The spec only says they don't
overlap.

> 
> > Perhaps one reason is two contiguous NUMA nodes?  For this case, memblock
> > has made sure no memory region could cross NUMA nodes, so the start_pfn/end_pfn
> > here should always be within one node.  Perhaps we can add a comment for this
> > case?
> 
> <cough> numa=off <cough>
> 
> > Anyway I am not sure whether it is worth to consider "contiguous CMRs" case.
> 
> I am sure.  You need to consider it.

OK.

Also, as mentioned in another reply to patch "Get information about TDX module
and TDX-capable memory", we can depend on TDH.SYS.CONFIG to return failure but
don't necessarily need to sanity check all memory regions are CMR memory.  This
way we can just removing above sanity check code here.

What do you think?

> 
> > > > + * and don't overlap.
> > > > + */
> > > > +static int add_tdx_memblock(unsigned long start_pfn, unsigned long end_pfn,
> > > > +                       int nid)
> > > > +{
> > > > +   struct tdx_memblock *tmb;
> > > > +
> > > > +   tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
> > > > +   if (!tmb)
> > > > +           return -ENOMEM;
> > > > +
> > > > +   INIT_LIST_HEAD(&tmb->list);
> > > > +   tmb->start_pfn = start_pfn;
> > > > +   tmb->end_pfn = end_pfn;
> > > > +   tmb->nid = nid;
> > > > +
> > > > +   list_add_tail(&tmb->list, &tdx_memlist);
> > > > +   return 0;
> > > > +}
> > > > +
> > > > +static void free_tdx_memory(void)
> > > 
> > > This is named a bit too generically.  How about free_tdx_memlist() or
> > > something?
> > 
> > Will use free_tdx_memlist().  Do you want to also change build_tdx_memory() to
> > build_tdx_memlist()?
> 
> Does it build a memlist?

Yes.


[...]

> 
> I actually wasn't asking about the for_each_mem_pfn_range() use.
> 
> > And here before skipping first 1MB, we add below:
> > 
> >                 /*
> >                  * The first 1MB is not reported as TDX covertible memory.
> >                  * Although the first 1MB is always reserved and won't end up
> >                  * to the page allocator, it is still in memblock's memory
> >                  * regions.  Skip them manually to exclude them as TDX memory.
> >                  */
> 
> That looks OK, with the spelling fixed.

Yes "covertible" -> "convertible".


[...]

> > > >  out:
> > > > +   /*
> > > > +    * Memory hotplug checks the hot-added memory region against the
> > > > +    * @tdx_memlist to see if the region is TDX memory.
> > > > +    *
> > > > +    * Do put_online_mems() here to make sure any modification to
> > > > +    * @tdx_memlist is done while holding the memory hotplug read
> > > > +    * lock, so that the memory hotplug path can just check the
> > > > +    * @tdx_memlist w/o holding the @tdx_module_lock which may cause
> > > > +    * deadlock.
> > > > +    */
> > > 
> > > I'm honestly not following any of that.
> > 
> > How about:
> > 
> >         /*
> >          * Make sure tdx_cc_memory_compatible() either sees a fixed set of
> >          * memory regions in @tdx_memlist, or an empty list.
> >          */
> 
> That's a comment for the lock side, not the unlock side.  It should be:
> 
> 	/*
> 	 * @tdx_memlist is written here and read at memory hotplug time.
> 	 * Lock out memory hotplug code while building it.
> 	 */

Thanks.

> 
> > > > +   put_online_mems();
> > > >     return ret;
> > > >  }
> > > > 
> > > > @@ -485,3 +645,26 @@ int tdx_enable(void)
> > > >     return ret;
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(tdx_enable);
> > > > +
> > > > +/*
> > > > + * Check whether the given range is TDX memory.  Must be called between
> > > > + * mem_hotplug_begin()/mem_hotplug_done().
> > > > + */
> > > > +bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn)
> > > > +{
> > > > +   struct tdx_memblock *tmb;
> > > > +
> > > > +   /* Empty list means TDX isn't enabled successfully */
> > > > +   if (list_empty(&tdx_memlist))
> > > > +           return true;
> > > > +
> > > > +   list_for_each_entry(tmb, &tdx_memlist, list) {
> > > > +           /*
> > > > +            * The new range is TDX memory if it is fully covered
> > > > +            * by any TDX memory block.
> > > > +            */
> > > > +           if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
> > > > +                   return true;
> > > 
> > > Same bug.  What if the start/end_pfn range is covered by more than one
> > > tdx_memblock?
> > 
> > We may want to return true if tdx_memblocks are contiguous.
> > 
> > However I don't think this will happen?
> > 
> > tdx_memblock is from memblock, and when two memory regions in memblock are
> > contiguous, they must have different node, or flags.
> > 
> > My understanding is the hot-added memory region here cannot across NUMA nodes,
> > nor have different flags,  correct?
> 
> I'm not sure what flags are in this context.
> 

The flags in 'struct memblock_region':

enum memblock_flags {           
        MEMBLOCK_NONE           = 0x0,  /* No special request */
        MEMBLOCK_HOTPLUG        = 0x1,  /* hotpluggable region */              
        MEMBLOCK_MIRROR         = 0x2,  /* mirrored region */                  
        MEMBLOCK_NOMAP          = 0x4,  /* don't add to kernel direct mapping */
        MEMBLOCK_DRIVER_MANAGED = 0x8,  /* always detected via a driver */     
};      
        
/**                                                                            
 * struct memblock_region - represents a memory region                         
 * @base: base address of the region                                           
 * @size: size of the region                                                   
 * @flags: memory region attributes
 * @nid: NUMA node id
 */             
struct memblock_region {
        phys_addr_t base;
        phys_addr_t size;
        enum memblock_flags flags;
#ifdef CONFIG_NUMA
        int nid;
#endif  
};
  
Kai Huang Nov. 24, 2022, 9:06 a.m. UTC | #14
On Wed, 2022-11-23 at 17:50 -0800, Dan Williams wrote:
> >   
> > @@ -968,6 +969,15 @@ int arch_add_memory(int nid, u64 start, u64 size,
> >   	unsigned long start_pfn = start >> PAGE_SHIFT;
> >   	unsigned long nr_pages = size >> PAGE_SHIFT;
> >   
> > +	/*
> > +	 * For now if TDX is enabled, all pages in the page allocator
> > +	 * must be TDX memory, which is a fixed set of memory regions
> > +	 * that are passed to the TDX module.  Reject the new region
> > +	 * if it is not TDX memory to guarantee above is true.
> > +	 */
> > +	if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages))
> > +		return -EINVAL;
> 
> arch_add_memory() does not add memory to the page allocator.  For
> example, memremap_pages() uses arch_add_memory() and explicitly does not
> release the memory to the page allocator. 

Indeed.  Sorry I missed this.

> This check belongs in
> add_memory_resource() to prevent new memory that violates TDX from being
> onlined. 

This would require adding another 'arch_cc_memory_compatible()' to the common
add_memory_resource() (I actually long time ago had such patch to work with the
memremap_pages() you mentioned above).

How about adding a memory_notifier to the TDX code, and reject online of TDX
incompatible memory (something like below)?  The benefit is this is TDX code
self contained and won't pollute the common mm code:

+static int tdx_memory_notifier(struct notifier_block *nb,
+                              unsigned long action, void *v)
+{
+       struct memory_notify *mn = v;
+
+       if (action != MEM_GOING_ONLINE)
+               return NOTIFY_OK;
+
+       /*
+        * Not all memory is compatible with TDX.  Reject
+        * online of any incompatible memory.
+        */
+       return tdx_cc_memory_compatible(mn->start_pfn,
+                       mn->start_pfn + mn->nr_pages) ? NOTIFY_OK : NOTIFY_BAD;
+}
+
+static struct notifier_block tdx_memory_nb = {
+       .notifier_call = tdx_memory_notifier,
+};

> Hopefully there is also an option to disable TDX from the
> kernel boot command line to recover memory-hotplug without needing to
> boot into the BIOS to toggle TDX.

I am fine.

Hi Dave, is it OK to include such command line to this initial TDX submission?
  
Peter Zijlstra Nov. 24, 2022, 9:26 a.m. UTC | #15
On Wed, Nov 23, 2022 at 05:50:37PM -0800, Dan Williams wrote:

> arch_add_memory() does not add memory to the page allocator.  For
> example, memremap_pages() uses arch_add_memory() and explicitly does not
> release the memory to the page allocator. This check belongs in
> add_memory_resource() to prevent new memory that violates TDX from being
> onlined. Hopefully there is also an option to disable TDX from the
> kernel boot command line to recover memory-hotplug without needing to
> boot into the BIOS to toggle TDX.

So I've been pushing for all this to either require: tdx=force on the
cmdline to boot-time enable, or delay all the memory allocation to the
first KVM/TDX instance being created.

That is, by default, none of this crud should ever trigger and consume
memory if you're not using TDX (most of us really).

(every machine I have loads kvm.ko unconditionally -- even if I never
user KVM, so kvm.ko load time is not a valid point in time to do TDX
enablement).
  
Kai Huang Nov. 24, 2022, 10:02 a.m. UTC | #16
On Thu, 2022-11-24 at 10:26 +0100, Peter Zijlstra wrote:
> On Wed, Nov 23, 2022 at 05:50:37PM -0800, Dan Williams wrote:
> 
> > arch_add_memory() does not add memory to the page allocator.  For
> > example, memremap_pages() uses arch_add_memory() and explicitly does not
> > release the memory to the page allocator. This check belongs in
> > add_memory_resource() to prevent new memory that violates TDX from being
> > onlined. Hopefully there is also an option to disable TDX from the
> > kernel boot command line to recover memory-hotplug without needing to
> > boot into the BIOS to toggle TDX.
> 
> So I've been pushing for all this to either require: tdx=force on the
> cmdline to boot-time enable, or delay all the memory allocation to the
> first KVM/TDX instance being created.
> 
> That is, by default, none of this crud should ever trigger and consume
> memory if you're not using TDX (most of us really).
> 
> (every machine I have loads kvm.ko unconditionally -- even if I never
> user KVM, so kvm.ko load time is not a valid point in time to do TDX
> enablement).
> 

Thanks for input.  I am fine with 'tdx=force'.

Although, I'd like to point out KVM will have a module parameter 'enable_tdx'.

Hi Dave, Sean, do you have any comments?
  
David Hildenbrand Nov. 25, 2022, 9:28 a.m. UTC | #17
On 24.11.22 10:06, Huang, Kai wrote:
> On Wed, 2022-11-23 at 17:50 -0800, Dan Williams wrote:
>>>    
>>> @@ -968,6 +969,15 @@ int arch_add_memory(int nid, u64 start, u64 size,
>>>    	unsigned long start_pfn = start >> PAGE_SHIFT;
>>>    	unsigned long nr_pages = size >> PAGE_SHIFT;
>>>    
>>> +	/*
>>> +	 * For now if TDX is enabled, all pages in the page allocator
>>> +	 * must be TDX memory, which is a fixed set of memory regions
>>> +	 * that are passed to the TDX module.  Reject the new region
>>> +	 * if it is not TDX memory to guarantee above is true.
>>> +	 */
>>> +	if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages))
>>> +		return -EINVAL;
>>
>> arch_add_memory() does not add memory to the page allocator.  For
>> example, memremap_pages() uses arch_add_memory() and explicitly does not
>> release the memory to the page allocator.
> 
> Indeed.  Sorry I missed this.
> 
>> This check belongs in
>> add_memory_resource() to prevent new memory that violates TDX from being
>> onlined.
> 
> This would require adding another 'arch_cc_memory_compatible()' to the common
> add_memory_resource() (I actually long time ago had such patch to work with the
> memremap_pages() you mentioned above).
> 
> How about adding a memory_notifier to the TDX code, and reject online of TDX
> incompatible memory (something like below)?  The benefit is this is TDX code
> self contained and won't pollute the common mm code:
> 
> +static int tdx_memory_notifier(struct notifier_block *nb,
> +                              unsigned long action, void *v)
> +{
> +       struct memory_notify *mn = v;
> +
> +       if (action != MEM_GOING_ONLINE)
> +               return NOTIFY_OK;
> +
> +       /*
> +        * Not all memory is compatible with TDX.  Reject
> +        * online of any incompatible memory.
> +        */
> +       return tdx_cc_memory_compatible(mn->start_pfn,
> +                       mn->start_pfn + mn->nr_pages) ? NOTIFY_OK : NOTIFY_BAD;
> +}
> +
> +static struct notifier_block tdx_memory_nb = {
> +       .notifier_call = tdx_memory_notifier,
> +};

With mhp_memmap_on_memory() some memory might already be touched during 
add_memory() (because part of the hotplug memory is used for holding the 
memmap), not when actually onlining memory. So in that case, this would 
be too late.

add_memory_resource() sounds better, even though I disgust such TDX 
special handling in common code.
  
Kai Huang Nov. 28, 2022, 8:38 a.m. UTC | #18
On Fri, 2022-11-25 at 10:28 +0100, David Hildenbrand wrote:
> On 24.11.22 10:06, Huang, Kai wrote:
> > On Wed, 2022-11-23 at 17:50 -0800, Dan Williams wrote:
> > > >    
> > > > @@ -968,6 +969,15 @@ int arch_add_memory(int nid, u64 start, u64 size,
> > > >    	unsigned long start_pfn = start >> PAGE_SHIFT;
> > > >    	unsigned long nr_pages = size >> PAGE_SHIFT;
> > > >    
> > > > +	/*
> > > > +	 * For now if TDX is enabled, all pages in the page allocator
> > > > +	 * must be TDX memory, which is a fixed set of memory regions
> > > > +	 * that are passed to the TDX module.  Reject the new region
> > > > +	 * if it is not TDX memory to guarantee above is true.
> > > > +	 */
> > > > +	if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages))
> > > > +		return -EINVAL;
> > > 
> > > arch_add_memory() does not add memory to the page allocator.  For
> > > example, memremap_pages() uses arch_add_memory() and explicitly does not
> > > release the memory to the page allocator.
> > 
> > Indeed.  Sorry I missed this.
> > 
> > > This check belongs in
> > > add_memory_resource() to prevent new memory that violates TDX from being
> > > onlined.
> > 
> > This would require adding another 'arch_cc_memory_compatible()' to the common
> > add_memory_resource() (I actually long time ago had such patch to work with the
> > memremap_pages() you mentioned above).
> > 
> > How about adding a memory_notifier to the TDX code, and reject online of TDX
> > incompatible memory (something like below)?  The benefit is this is TDX code
> > self contained and won't pollute the common mm code:
> > 
> > +static int tdx_memory_notifier(struct notifier_block *nb,
> > +                              unsigned long action, void *v)
> > +{
> > +       struct memory_notify *mn = v;
> > +
> > +       if (action != MEM_GOING_ONLINE)
> > +               return NOTIFY_OK;
> > +
> > +       /*
> > +        * Not all memory is compatible with TDX.  Reject
> > +        * online of any incompatible memory.
> > +        */
> > +       return tdx_cc_memory_compatible(mn->start_pfn,
> > +                       mn->start_pfn + mn->nr_pages) ? NOTIFY_OK : NOTIFY_BAD;
> > +}
> > +
> > +static struct notifier_block tdx_memory_nb = {
> > +       .notifier_call = tdx_memory_notifier,
> > +};
> 
> With mhp_memmap_on_memory() some memory might already be touched during 
> add_memory() (because part of the hotplug memory is used for holding the 
> memmap), not when actually onlining memory. So in that case, this would 
> be too late.

Hi David,

Thanks for the review!

Right. The memmap pages are added to the zone before online_pages(), but IIUC
those memmap pages will never be free pages thus won't be allocated by the page
allocator, correct?  Therefore in practice there won't be problem even they are
not TDX compatible memory.

> 
> add_memory_resource() sounds better, even though I disgust such TDX 
> special handling in common code.
> 

Given above, do you still prefer changing add_memory_resource()?  If so I'll
change to modify add_memory_resource().
  
David Hildenbrand Nov. 28, 2022, 8:43 a.m. UTC | #19
On 28.11.22 09:38, Huang, Kai wrote:
> On Fri, 2022-11-25 at 10:28 +0100, David Hildenbrand wrote:
>> On 24.11.22 10:06, Huang, Kai wrote:
>>> On Wed, 2022-11-23 at 17:50 -0800, Dan Williams wrote:
>>>>>     
>>>>> @@ -968,6 +969,15 @@ int arch_add_memory(int nid, u64 start, u64 size,
>>>>>     	unsigned long start_pfn = start >> PAGE_SHIFT;
>>>>>     	unsigned long nr_pages = size >> PAGE_SHIFT;
>>>>>     
>>>>> +	/*
>>>>> +	 * For now if TDX is enabled, all pages in the page allocator
>>>>> +	 * must be TDX memory, which is a fixed set of memory regions
>>>>> +	 * that are passed to the TDX module.  Reject the new region
>>>>> +	 * if it is not TDX memory to guarantee above is true.
>>>>> +	 */
>>>>> +	if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages))
>>>>> +		return -EINVAL;
>>>>
>>>> arch_add_memory() does not add memory to the page allocator.  For
>>>> example, memremap_pages() uses arch_add_memory() and explicitly does not
>>>> release the memory to the page allocator.
>>>
>>> Indeed.  Sorry I missed this.
>>>
>>>> This check belongs in
>>>> add_memory_resource() to prevent new memory that violates TDX from being
>>>> onlined.
>>>
>>> This would require adding another 'arch_cc_memory_compatible()' to the common
>>> add_memory_resource() (I actually long time ago had such patch to work with the
>>> memremap_pages() you mentioned above).
>>>
>>> How about adding a memory_notifier to the TDX code, and reject online of TDX
>>> incompatible memory (something like below)?  The benefit is this is TDX code
>>> self contained and won't pollute the common mm code:
>>>
>>> +static int tdx_memory_notifier(struct notifier_block *nb,
>>> +                              unsigned long action, void *v)
>>> +{
>>> +       struct memory_notify *mn = v;
>>> +
>>> +       if (action != MEM_GOING_ONLINE)
>>> +               return NOTIFY_OK;
>>> +
>>> +       /*
>>> +        * Not all memory is compatible with TDX.  Reject
>>> +        * online of any incompatible memory.
>>> +        */
>>> +       return tdx_cc_memory_compatible(mn->start_pfn,
>>> +                       mn->start_pfn + mn->nr_pages) ? NOTIFY_OK : NOTIFY_BAD;
>>> +}
>>> +
>>> +static struct notifier_block tdx_memory_nb = {
>>> +       .notifier_call = tdx_memory_notifier,
>>> +};
>>
>> With mhp_memmap_on_memory() some memory might already be touched during
>> add_memory() (because part of the hotplug memory is used for holding the
>> memmap), not when actually onlining memory. So in that case, this would
>> be too late.
> 
> Hi David,
> 
> Thanks for the review!
> 
> Right. The memmap pages are added to the zone before online_pages(), but IIUC
> those memmap pages will never be free pages thus won't be allocated by the page
> allocator, correct?  Therefore in practice there won't be problem even they are
> not TDX compatible memory.

But that memory will be read/written. Isn't that an issue, for example, 
if memory doesn't get accepted etc?
  
Kai Huang Nov. 28, 2022, 9:21 a.m. UTC | #20
On Mon, 2022-11-28 at 09:43 +0100, David Hildenbrand wrote:
> On 28.11.22 09:38, Huang, Kai wrote:
> > On Fri, 2022-11-25 at 10:28 +0100, David Hildenbrand wrote:
> > > On 24.11.22 10:06, Huang, Kai wrote:
> > > > On Wed, 2022-11-23 at 17:50 -0800, Dan Williams wrote:
> > > > > >     
> > > > > > @@ -968,6 +969,15 @@ int arch_add_memory(int nid, u64 start, u64 size,
> > > > > >     	unsigned long start_pfn = start >> PAGE_SHIFT;
> > > > > >     	unsigned long nr_pages = size >> PAGE_SHIFT;
> > > > > >     
> > > > > > +	/*
> > > > > > +	 * For now if TDX is enabled, all pages in the page allocator
> > > > > > +	 * must be TDX memory, which is a fixed set of memory regions
> > > > > > +	 * that are passed to the TDX module.  Reject the new region
> > > > > > +	 * if it is not TDX memory to guarantee above is true.
> > > > > > +	 */
> > > > > > +	if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages))
> > > > > > +		return -EINVAL;
> > > > > 
> > > > > arch_add_memory() does not add memory to the page allocator.  For
> > > > > example, memremap_pages() uses arch_add_memory() and explicitly does not
> > > > > release the memory to the page allocator.
> > > > 
> > > > Indeed.  Sorry I missed this.
> > > > 
> > > > > This check belongs in
> > > > > add_memory_resource() to prevent new memory that violates TDX from being
> > > > > onlined.
> > > > 
> > > > This would require adding another 'arch_cc_memory_compatible()' to the common
> > > > add_memory_resource() (I actually long time ago had such patch to work with the
> > > > memremap_pages() you mentioned above).
> > > > 
> > > > How about adding a memory_notifier to the TDX code, and reject online of TDX
> > > > incompatible memory (something like below)?  The benefit is this is TDX code
> > > > self contained and won't pollute the common mm code:
> > > > 
> > > > +static int tdx_memory_notifier(struct notifier_block *nb,
> > > > +                              unsigned long action, void *v)
> > > > +{
> > > > +       struct memory_notify *mn = v;
> > > > +
> > > > +       if (action != MEM_GOING_ONLINE)
> > > > +               return NOTIFY_OK;
> > > > +
> > > > +       /*
> > > > +        * Not all memory is compatible with TDX.  Reject
> > > > +        * online of any incompatible memory.
> > > > +        */
> > > > +       return tdx_cc_memory_compatible(mn->start_pfn,
> > > > +                       mn->start_pfn + mn->nr_pages) ? NOTIFY_OK : NOTIFY_BAD;
> > > > +}
> > > > +
> > > > +static struct notifier_block tdx_memory_nb = {
> > > > +       .notifier_call = tdx_memory_notifier,
> > > > +};
> > > 
> > > With mhp_memmap_on_memory() some memory might already be touched during
> > > add_memory() (because part of the hotplug memory is used for holding the
> > > memmap), not when actually onlining memory. So in that case, this would
> > > be too late.
> > 
> > Hi David,
> > 
> > Thanks for the review!
> > 
> > Right. The memmap pages are added to the zone before online_pages(), but IIUC
> > those memmap pages will never be free pages thus won't be allocated by the page
> > allocator, correct?  Therefore in practice there won't be problem even they are
> > not TDX compatible memory.
> 
> But that memory will be read/written. Isn't that an issue, for example, 
> if memory doesn't get accepted etc?
> 

Sorry I don't quite understand "if memory doesn't get accepted" mean.  Do you
mean accepted by the TDX module?

Only the host kernel will read/write those memmap pages.  The TDX module won't
(as they won't be allocated to be used as TDX guest memory or TDX module
metadata).  So it's fine.
  
David Hildenbrand Nov. 28, 2022, 9:26 a.m. UTC | #21
On 28.11.22 10:21, Huang, Kai wrote:
> On Mon, 2022-11-28 at 09:43 +0100, David Hildenbrand wrote:
>> On 28.11.22 09:38, Huang, Kai wrote:
>>> On Fri, 2022-11-25 at 10:28 +0100, David Hildenbrand wrote:
>>>> On 24.11.22 10:06, Huang, Kai wrote:
>>>>> On Wed, 2022-11-23 at 17:50 -0800, Dan Williams wrote:
>>>>>>>      
>>>>>>> @@ -968,6 +969,15 @@ int arch_add_memory(int nid, u64 start, u64 size,
>>>>>>>      	unsigned long start_pfn = start >> PAGE_SHIFT;
>>>>>>>      	unsigned long nr_pages = size >> PAGE_SHIFT;
>>>>>>>      
>>>>>>> +	/*
>>>>>>> +	 * For now if TDX is enabled, all pages in the page allocator
>>>>>>> +	 * must be TDX memory, which is a fixed set of memory regions
>>>>>>> +	 * that are passed to the TDX module.  Reject the new region
>>>>>>> +	 * if it is not TDX memory to guarantee above is true.
>>>>>>> +	 */
>>>>>>> +	if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages))
>>>>>>> +		return -EINVAL;
>>>>>>
>>>>>> arch_add_memory() does not add memory to the page allocator.  For
>>>>>> example, memremap_pages() uses arch_add_memory() and explicitly does not
>>>>>> release the memory to the page allocator.
>>>>>
>>>>> Indeed.  Sorry I missed this.
>>>>>
>>>>>> This check belongs in
>>>>>> add_memory_resource() to prevent new memory that violates TDX from being
>>>>>> onlined.
>>>>>
>>>>> This would require adding another 'arch_cc_memory_compatible()' to the common
>>>>> add_memory_resource() (I actually long time ago had such patch to work with the
>>>>> memremap_pages() you mentioned above).
>>>>>
>>>>> How about adding a memory_notifier to the TDX code, and reject online of TDX
>>>>> incompatible memory (something like below)?  The benefit is this is TDX code
>>>>> self contained and won't pollute the common mm code:
>>>>>
>>>>> +static int tdx_memory_notifier(struct notifier_block *nb,
>>>>> +                              unsigned long action, void *v)
>>>>> +{
>>>>> +       struct memory_notify *mn = v;
>>>>> +
>>>>> +       if (action != MEM_GOING_ONLINE)
>>>>> +               return NOTIFY_OK;
>>>>> +
>>>>> +       /*
>>>>> +        * Not all memory is compatible with TDX.  Reject
>>>>> +        * online of any incompatible memory.
>>>>> +        */
>>>>> +       return tdx_cc_memory_compatible(mn->start_pfn,
>>>>> +                       mn->start_pfn + mn->nr_pages) ? NOTIFY_OK : NOTIFY_BAD;
>>>>> +}
>>>>> +
>>>>> +static struct notifier_block tdx_memory_nb = {
>>>>> +       .notifier_call = tdx_memory_notifier,
>>>>> +};
>>>>
>>>> With mhp_memmap_on_memory() some memory might already be touched during
>>>> add_memory() (because part of the hotplug memory is used for holding the
>>>> memmap), not when actually onlining memory. So in that case, this would
>>>> be too late.
>>>
>>> Hi David,
>>>
>>> Thanks for the review!
>>>
>>> Right. The memmap pages are added to the zone before online_pages(), but IIUC
>>> those memmap pages will never be free pages thus won't be allocated by the page
>>> allocator, correct?  Therefore in practice there won't be problem even they are
>>> not TDX compatible memory.
>>
>> But that memory will be read/written. Isn't that an issue, for example,
>> if memory doesn't get accepted etc?
>>
> 
> Sorry I don't quite understand "if memory doesn't get accepted" mean.  Do you
> mean accepted by the TDX module?
> 
> Only the host kernel will read/write those memmap pages.  The TDX module won't
> (as they won't be allocated to be used as TDX guest memory or TDX module
> metadata).  So it's fine.

Oh, so we're not also considering hotplugging memory to a TDX VM that 
might not be backed by TDX. Got it.

So what you want to prevent is getting !TDX memory exposed to the buddy 
such that it won't accidentally get allocated for a TDX guest, correct?

In that case, memory notifiers would indeed be fine.

Thanks!
  
Kai Huang Nov. 28, 2022, 9:50 a.m. UTC | #22
On Mon, 2022-11-28 at 10:26 +0100, David Hildenbrand wrote:
> On 28.11.22 10:21, Huang, Kai wrote:
> > On Mon, 2022-11-28 at 09:43 +0100, David Hildenbrand wrote:
> > > On 28.11.22 09:38, Huang, Kai wrote:
> > > > On Fri, 2022-11-25 at 10:28 +0100, David Hildenbrand wrote:
> > > > > On 24.11.22 10:06, Huang, Kai wrote:
> > > > > > On Wed, 2022-11-23 at 17:50 -0800, Dan Williams wrote:
> > > > > > > >      
> > > > > > > > @@ -968,6 +969,15 @@ int arch_add_memory(int nid, u64 start, u64 size,
> > > > > > > >      	unsigned long start_pfn = start >> PAGE_SHIFT;
> > > > > > > >      	unsigned long nr_pages = size >> PAGE_SHIFT;
> > > > > > > >      
> > > > > > > > +	/*
> > > > > > > > +	 * For now if TDX is enabled, all pages in the page allocator
> > > > > > > > +	 * must be TDX memory, which is a fixed set of memory regions
> > > > > > > > +	 * that are passed to the TDX module.  Reject the new region
> > > > > > > > +	 * if it is not TDX memory to guarantee above is true.
> > > > > > > > +	 */
> > > > > > > > +	if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages))
> > > > > > > > +		return -EINVAL;
> > > > > > > 
> > > > > > > arch_add_memory() does not add memory to the page allocator.  For
> > > > > > > example, memremap_pages() uses arch_add_memory() and explicitly does not
> > > > > > > release the memory to the page allocator.
> > > > > > 
> > > > > > Indeed.  Sorry I missed this.
> > > > > > 
> > > > > > > This check belongs in
> > > > > > > add_memory_resource() to prevent new memory that violates TDX from being
> > > > > > > onlined.
> > > > > > 
> > > > > > This would require adding another 'arch_cc_memory_compatible()' to the common
> > > > > > add_memory_resource() (I actually long time ago had such patch to work with the
> > > > > > memremap_pages() you mentioned above).
> > > > > > 
> > > > > > How about adding a memory_notifier to the TDX code, and reject online of TDX
> > > > > > incompatible memory (something like below)?  The benefit is this is TDX code
> > > > > > self contained and won't pollute the common mm code:
> > > > > > 
> > > > > > +static int tdx_memory_notifier(struct notifier_block *nb,
> > > > > > +                              unsigned long action, void *v)
> > > > > > +{
> > > > > > +       struct memory_notify *mn = v;
> > > > > > +
> > > > > > +       if (action != MEM_GOING_ONLINE)
> > > > > > +               return NOTIFY_OK;
> > > > > > +
> > > > > > +       /*
> > > > > > +        * Not all memory is compatible with TDX.  Reject
> > > > > > +        * online of any incompatible memory.
> > > > > > +        */
> > > > > > +       return tdx_cc_memory_compatible(mn->start_pfn,
> > > > > > +                       mn->start_pfn + mn->nr_pages) ? NOTIFY_OK : NOTIFY_BAD;
> > > > > > +}
> > > > > > +
> > > > > > +static struct notifier_block tdx_memory_nb = {
> > > > > > +       .notifier_call = tdx_memory_notifier,
> > > > > > +};
> > > > > 
> > > > > With mhp_memmap_on_memory() some memory might already be touched during
> > > > > add_memory() (because part of the hotplug memory is used for holding the
> > > > > memmap), not when actually onlining memory. So in that case, this would
> > > > > be too late.
> > > > 
> > > > Hi David,
> > > > 
> > > > Thanks for the review!
> > > > 
> > > > Right. The memmap pages are added to the zone before online_pages(), but IIUC
> > > > those memmap pages will never be free pages thus won't be allocated by the page
> > > > allocator, correct?  Therefore in practice there won't be problem even they are
> > > > not TDX compatible memory.
> > > 
> > > But that memory will be read/written. Isn't that an issue, for example,
> > > if memory doesn't get accepted etc?
> > > 
> > 
> > Sorry I don't quite understand "if memory doesn't get accepted" mean.  Do you
> > mean accepted by the TDX module?
> > 
> > Only the host kernel will read/write those memmap pages.  The TDX module won't
> > (as they won't be allocated to be used as TDX guest memory or TDX module
> > metadata).  So it's fine.
> 
> Oh, so we're not also considering hotplugging memory to a TDX VM that 
> might not be backed by TDX. Got it.
> 
> So what you want to prevent is getting !TDX memory exposed to the buddy 
> such that it won't accidentally get allocated for a TDX guest, correct?

Yes correct.

> 
> In that case, memory notifiers would indeed be fine.
> 
> Thanks!
> 

Thanks.
  
Dave Hansen Nov. 30, 2022, 10:26 p.m. UTC | #23
On 11/24/22 02:02, Huang, Kai wrote:
> Thanks for input.  I am fine with 'tdx=force'.
> 
> Although, I'd like to point out KVM will have a module parameter 'enable_tdx'.
> 
> Hi Dave, Sean, do you have any comments?

That's fine.  Just keep it out of the initial implementation.  Ignore it
for now,
  

Patch

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index dd333b46fafb..b36129183035 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1959,6 +1959,7 @@  config INTEL_TDX_HOST
 	depends on X86_64
 	depends on KVM_INTEL
 	depends on X86_X2APIC
+	select ARCH_KEEP_MEMBLOCK
 	help
 	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
 	  host and certain physical attacks.  This option enables necessary TDX
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index d688228f3151..71169ecefabf 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -111,9 +111,12 @@  static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
 #ifdef CONFIG_INTEL_TDX_HOST
 bool platform_tdx_enabled(void);
 int tdx_enable(void);
+bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn);
 #else	/* !CONFIG_INTEL_TDX_HOST */
 static inline bool platform_tdx_enabled(void) { return false; }
 static inline int tdx_enable(void)  { return -ENODEV; }
+static inline bool tdx_cc_memory_compatible(unsigned long start_pfn,
+		unsigned long end_pfn) { return true; }
 #endif	/* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3f040c6e5d13..900341333d7e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -55,6 +55,7 @@ 
 #include <asm/uv/uv.h>
 #include <asm/setup.h>
 #include <asm/ftrace.h>
+#include <asm/tdx.h>
 
 #include "mm_internal.h"
 
@@ -968,6 +969,15 @@  int arch_add_memory(int nid, u64 start, u64 size,
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
+	/*
+	 * For now if TDX is enabled, all pages in the page allocator
+	 * must be TDX memory, which is a fixed set of memory regions
+	 * that are passed to the TDX module.  Reject the new region
+	 * if it is not TDX memory to guarantee above is true.
+	 */
+	if (!tdx_cc_memory_compatible(start_pfn, start_pfn + nr_pages))
+		return -EINVAL;
+
 	init_memory_mapping(start, start + size, params->pgprot);
 
 	return add_pages(nid, start_pfn, nr_pages, params);
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 43227af25e44..32af86e31c47 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -16,6 +16,11 @@ 
 #include <linux/smp.h>
 #include <linux/atomic.h>
 #include <linux/align.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/memblock.h>
+#include <linux/minmax.h>
+#include <linux/sizes.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/apic.h>
@@ -34,6 +39,13 @@  enum tdx_module_status_t {
 	TDX_MODULE_SHUTDOWN,
 };
 
+struct tdx_memblock {
+	struct list_head list;
+	unsigned long start_pfn;
+	unsigned long end_pfn;
+	int nid;
+};
+
 static u32 tdx_keyid_start __ro_after_init;
 static u32 tdx_keyid_num __ro_after_init;
 
@@ -46,6 +58,9 @@  static struct tdsysinfo_struct tdx_sysinfo;
 static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
 static int tdx_cmr_num;
 
+/* All TDX-usable memory regions */
+static LIST_HEAD(tdx_memlist);
+
 /*
  * Detect TDX private KeyIDs to see whether TDX has been enabled by the
  * BIOS.  Both initializing the TDX module and running TDX guest require
@@ -329,6 +344,107 @@  static int tdx_get_sysinfo(void)
 	return trim_empty_cmrs(tdx_cmr_array, &tdx_cmr_num);
 }
 
+/* Check whether the given pfn range is covered by any CMR or not. */
+static bool pfn_range_covered_by_cmr(unsigned long start_pfn,
+				     unsigned long end_pfn)
+{
+	int i;
+
+	for (i = 0; i < tdx_cmr_num; i++) {
+		struct cmr_info *cmr = &tdx_cmr_array[i];
+		unsigned long cmr_start_pfn;
+		unsigned long cmr_end_pfn;
+
+		cmr_start_pfn = cmr->base >> PAGE_SHIFT;
+		cmr_end_pfn = (cmr->base + cmr->size) >> PAGE_SHIFT;
+
+		if (start_pfn >= cmr_start_pfn && end_pfn <= cmr_end_pfn)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Add a memory region on a given node as a TDX memory block.  The caller
+ * to make sure all memory regions are added in address ascending order
+ * and don't overlap.
+ */
+static int add_tdx_memblock(unsigned long start_pfn, unsigned long end_pfn,
+			    int nid)
+{
+	struct tdx_memblock *tmb;
+
+	tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
+	if (!tmb)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&tmb->list);
+	tmb->start_pfn = start_pfn;
+	tmb->end_pfn = end_pfn;
+	tmb->nid = nid;
+
+	list_add_tail(&tmb->list, &tdx_memlist);
+	return 0;
+}
+
+static void free_tdx_memory(void)
+{
+	while (!list_empty(&tdx_memlist)) {
+		struct tdx_memblock *tmb = list_first_entry(&tdx_memlist,
+				struct tdx_memblock, list);
+
+		list_del(&tmb->list);
+		kfree(tmb);
+	}
+}
+
+/*
+ * Add all memblock memory regions to the @tdx_memlist as TDX memory.
+ * Must be called when get_online_mems() is called by the caller.
+ */
+static int build_tdx_memory(void)
+{
+	unsigned long start_pfn, end_pfn;
+	int i, nid, ret;
+
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
+		/*
+		 * The first 1MB may not be reported as TDX convertible
+		 * memory.  Manually exclude them as TDX memory.
+		 *
+		 * This is fine as the first 1MB is already reserved in
+		 * reserve_real_mode() and won't end up to ZONE_DMA as
+		 * free page anyway.
+		 */
+		start_pfn = max(start_pfn, (unsigned long)SZ_1M >> PAGE_SHIFT);
+		if (start_pfn >= end_pfn)
+			continue;
+
+		/* Verify memory is truly TDX convertible memory */
+		if (!pfn_range_covered_by_cmr(start_pfn, end_pfn)) {
+			pr_info("Memory region [0x%lx, 0x%lx) is not TDX convertible memorry.\n",
+					start_pfn << PAGE_SHIFT,
+					end_pfn << PAGE_SHIFT);
+			return -EINVAL;
+		}
+
+		/*
+		 * Add the memory regions as TDX memory.  The regions in
+		 * memblock has already guaranteed they are in address
+		 * ascending order and don't overlap.
+		 */
+		ret = add_tdx_memblock(start_pfn, end_pfn, nid);
+		if (ret)
+			goto err;
+	}
+
+	return 0;
+err:
+	free_tdx_memory();
+	return ret;
+}
+
 /*
  * Detect and initialize the TDX module.
  *
@@ -357,12 +473,56 @@  static int init_tdx_module(void)
 	if (ret)
 		goto out;
 
+	/*
+	 * All memory regions that can be used by the TDX module must be
+	 * passed to the TDX module during the module initialization.
+	 * Once this is done, all "TDX-usable" memory regions are fixed
+	 * during module's runtime.
+	 *
+	 * The initial support of TDX guests only allocates memory from
+	 * the global page allocator.  To keep things simple, for now
+	 * just make sure all pages in the page allocator are TDX memory.
+	 *
+	 * To achieve this, use all system memory in the core-mm at the
+	 * time of initializing the TDX module as TDX memory, and at the
+	 * meantime, reject any new memory in memory hot-add.
+	 *
+	 * This works as in practice, all boot-time present DIMM is TDX
+	 * convertible memory.  However if any new memory is hot-added
+	 * before initializing the TDX module, the initialization will
+	 * fail due to that memory is not covered by CMR.
+	 *
+	 * This can be enhanced in the future, i.e. by allowing adding or
+	 * onlining non-TDX memory to a separate node, in which case the
+	 * "TDX-capable" nodes and the "non-TDX-capable" nodes can exist
+	 * together -- the userspace/kernel just needs to make sure pages
+	 * for TDX guests must come from those "TDX-capable" nodes.
+	 *
+	 * Build the list of TDX memory regions as mentioned above so
+	 * they can be passed to the TDX module later.
+	 */
+	get_online_mems();
+
+	ret = build_tdx_memory();
+	if (ret)
+		goto out;
 	/*
 	 * Return -EINVAL until all steps of TDX module initialization
 	 * process are done.
 	 */
 	ret = -EINVAL;
 out:
+	/*
+	 * Memory hotplug checks the hot-added memory region against the
+	 * @tdx_memlist to see if the region is TDX memory.
+	 *
+	 * Do put_online_mems() here to make sure any modification to
+	 * @tdx_memlist is done while holding the memory hotplug read
+	 * lock, so that the memory hotplug path can just check the
+	 * @tdx_memlist w/o holding the @tdx_module_lock which may cause
+	 * deadlock.
+	 */
+	put_online_mems();
 	return ret;
 }
 
@@ -485,3 +645,26 @@  int tdx_enable(void)
 	return ret;
 }
 EXPORT_SYMBOL_GPL(tdx_enable);
+
+/*
+ * Check whether the given range is TDX memory.  Must be called between
+ * mem_hotplug_begin()/mem_hotplug_done().
+ */
+bool tdx_cc_memory_compatible(unsigned long start_pfn, unsigned long end_pfn)
+{
+	struct tdx_memblock *tmb;
+
+	/* Empty list means TDX isn't enabled successfully */
+	if (list_empty(&tdx_memlist))
+		return true;
+
+	list_for_each_entry(tmb, &tdx_memlist, list) {
+		/*
+		 * The new range is TDX memory if it is fully covered
+		 * by any TDX memory block.
+		 */
+		if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
+			return true;
+	}
+	return false;
+}