[v5,1/2] mm/memory_hotplug: split memmap_on_memory requests across memblocks

Message ID 20231005-vv-kmem_memmap-v5-1-a54d1981f0a3@intel.com
State New
Headers
Series mm: use memmap_on_memory semantics for dax/kmem |

Commit Message

Verma, Vishal L Oct. 5, 2023, 6:31 p.m. UTC
  The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to
'memblock_size' chunks of memory being added. Adding a larger span of
memory precludes memmap_on_memory semantics.

For users of hotplug such as kmem, large amounts of memory might get
added from the CXL subsystem. In some cases, this amount may exceed the
available 'main memory' to store the memmap for the memory being added.
In this case, it is useful to have a way to place the memmap on the
memory being added, even if it means splitting the addition into
memblock-sized chunks.

Change add_memory_resource() to loop over memblock-sized chunks of
memory if caller requested memmap_on_memory, and if other conditions for
it are met. Teach try_remove_memory() to also expect that a memory
range being removed might have been split up into memblock sized chunks,
and to loop through those as needed.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
---
 mm/memory_hotplug.c | 162 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 99 insertions(+), 63 deletions(-)
  

Comments

Dan Williams Oct. 5, 2023, 9:20 p.m. UTC | #1
Vishal Verma wrote:
> The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to
> 'memblock_size' chunks of memory being added. Adding a larger span of
> memory precludes memmap_on_memory semantics.
> 
> For users of hotplug such as kmem, large amounts of memory might get
> added from the CXL subsystem. In some cases, this amount may exceed the
> available 'main memory' to store the memmap for the memory being added.
> In this case, it is useful to have a way to place the memmap on the
> memory being added, even if it means splitting the addition into
> memblock-sized chunks.
> 
> Change add_memory_resource() to loop over memblock-sized chunks of
> memory if caller requested memmap_on_memory, and if other conditions for
> it are met. Teach try_remove_memory() to also expect that a memory
> range being removed might have been split up into memblock sized chunks,
> and to loop through those as needed.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
> ---
>  mm/memory_hotplug.c | 162 ++++++++++++++++++++++++++++++++--------------------
>  1 file changed, 99 insertions(+), 63 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index f8d3e7427e32..77ec6f15f943 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1380,6 +1380,44 @@ static bool mhp_supports_memmap_on_memory(unsigned long size)
>  	return arch_supports_memmap_on_memory(vmemmap_size);
>  }
>  
> +static int add_memory_create_devices(int nid, struct memory_group *group,
> +				     u64 start, u64 size, mhp_t mhp_flags)
> +{
> +	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
> +	struct vmem_altmap mhp_altmap = {
> +		.base_pfn =  PHYS_PFN(start),
> +		.end_pfn  =  PHYS_PFN(start + size - 1),
> +	};
> +	int ret;
> +
> +	if ((mhp_flags & MHP_MEMMAP_ON_MEMORY)) {
> +		mhp_altmap.free = memory_block_memmap_on_memory_pages();
> +		params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL);
> +		if (!params.altmap)
> +			return -ENOMEM;
> +
> +		memcpy(params.altmap, &mhp_altmap, sizeof(mhp_altmap));

Isn't this just open coded kmemdup()?

Other than that, I am not seeing anything else to comment on, you can add:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
  
David Hildenbrand Oct. 6, 2023, 12:52 p.m. UTC | #2
On 05.10.23 20:31, Vishal Verma wrote:
> The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to
> 'memblock_size' chunks of memory being added. Adding a larger span of
> memory precludes memmap_on_memory semantics.
> 
> For users of hotplug such as kmem, large amounts of memory might get
> added from the CXL subsystem. In some cases, this amount may exceed the
> available 'main memory' to store the memmap for the memory being added.
> In this case, it is useful to have a way to place the memmap on the
> memory being added, even if it means splitting the addition into
> memblock-sized chunks.
> 
> Change add_memory_resource() to loop over memblock-sized chunks of
> memory if caller requested memmap_on_memory, and if other conditions for
> it are met. Teach try_remove_memory() to also expect that a memory
> range being removed might have been split up into memblock sized chunks,
> and to loop through those as needed.
> 

Maybe add that this implies that we're not making use of PUD mappings in 
the direct map yet, and link to the proposal on how we could optimize 
that eventually in the future.
[...]

>   
> -static int __ref try_remove_memory(u64 start, u64 size)
> +static void __ref remove_memory_block_and_altmap(int nid, u64 start, u64 size)


You shouldn't need the nid, right?

>   {
> +	int rc = 0;
>   	struct memory_block *mem;
> -	int rc = 0, nid = NUMA_NO_NODE;
>   	struct vmem_altmap *altmap = NULL;
>   


> +	rc = walk_memory_blocks(start, size, &mem, test_has_altmap_cb);
> +	if (rc) {
> +		altmap = mem->altmap;
> +		/*
> +		 * Mark altmap NULL so that we can add a debug
> +		 * check on memblock free.
> +		 */
> +		mem->altmap = NULL;
> +	}
> +
> +	/*
> +	 * Memory block device removal under the device_hotplug_lock is
> +	 * a barrier against racing online attempts.
> +	 */
> +	remove_memory_block_devices(start, size);

We're now calling that under the memory hotplug lock. I assume this is 
fine, but I remember some ugly lockdep details ...should be alright I guess.

> +
> +	arch_remove_memory(start, size, altmap);
> +
> +	/* Verify that all vmemmap pages have actually been freed. */
> +	if (altmap) {
> +		WARN(altmap->alloc, "Altmap not fully unmapped");
> +		kfree(altmap);
> +	}
> +}
> +
> +static int __ref try_remove_memory(u64 start, u64 size)
> +{
> +	int rc, nid = NUMA_NO_NODE;
> +
>   	BUG_ON(check_hotplug_memory_range(start, size));
>   
>   	/*
> @@ -2167,47 +2221,28 @@ static int __ref try_remove_memory(u64 start, u64 size)
>   	if (rc)
>   		return rc;
>   
> +	mem_hotplug_begin();
> +
>   	/*
> -	 * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
> -	 * the same granularity it was added - a single memory block.
> +	 * For memmap_on_memory, the altmaps could have been added on
> +	 * a per-memblock basis. Loop through the entire range if so,
> +	 * and remove each memblock and its altmap.
>   	 */
>   	if (mhp_memmap_on_memory()) {
> -		rc = walk_memory_blocks(start, size, &mem, test_has_altmap_cb);
> -		if (rc) {
> -			if (size != memory_block_size_bytes()) {
> -				pr_warn("Refuse to remove %#llx - %#llx,"
> -					"wrong granularity\n",
> -					start, start + size);
> -				return -EINVAL;
> -			}
> -			altmap = mem->altmap;
> -			/*
> -			 * Mark altmap NULL so that we can add a debug
> -			 * check on memblock free.
> -			 */
> -			mem->altmap = NULL;
> -		}
> +		unsigned long memblock_size = memory_block_size_bytes();
> +		u64 cur_start;
> +
> +		for (cur_start = start; cur_start < start + size;
> +		     cur_start += memblock_size)
> +			remove_memory_block_and_altmap(nid, cur_start,
> +						       memblock_size);
> +	} else {
> +		remove_memory_block_and_altmap(nid, start, size);

Better call remove_memory_block_devices() and arch_remove_memory(start, 
size, altmap) here explicitly instead of using 
remove_memory_block_and_altmap() that really can only handle a single 
memory block with any inputs.


>   	}
>   
>   	/* remove memmap entry */
>   	firmware_map_remove(start, start + size, "System RAM");

Can we continue doing that in the old order? (IOW before taking the lock?).

>   
> -	/*
> -	 * Memory block device removal under the device_hotplug_lock is
> -	 * a barrier against racing online attempts.
> -	 */
> -	remove_memory_block_devices(start, size);
> -
> -	mem_hotplug_begin();
> -
> -	arch_remove_memory(start, size, altmap);
> -
> -	/* Verify that all vmemmap pages have actually been freed. */
> -	if (altmap) {
> -		WARN(altmap->alloc, "Altmap not fully unmapped");
> -		kfree(altmap);
> -	}
> -
>   	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
>   		memblock_phys_free(start, size);
>   		memblock_remove(start, size);
> @@ -2219,6 +2254,7 @@ static int __ref try_remove_memory(u64 start, u64 size)
>   		try_offline_node(nid);
>   
>   	mem_hotplug_done();
> +

Unrelated change.

>   	return 0;
>   }
>   
>
  
Verma, Vishal L Oct. 6, 2023, 4:46 p.m. UTC | #3
On Thu, 2023-10-05 at 14:20 -0700, Dan Williams wrote:
> Vishal Verma wrote:
<..>
> > 
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1380,6 +1380,44 @@ static bool mhp_supports_memmap_on_memory(unsigned long size)
> >         return arch_supports_memmap_on_memory(vmemmap_size);
> >  }
> >  
> > +static int add_memory_create_devices(int nid, struct memory_group *group,
> > +                                    u64 start, u64 size, mhp_t mhp_flags)
> > +{
> > +       struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
> > +       struct vmem_altmap mhp_altmap = {
> > +               .base_pfn =  PHYS_PFN(start),
> > +               .end_pfn  =  PHYS_PFN(start + size - 1),
> > +       };
> > +       int ret;
> > +
> > +       if ((mhp_flags & MHP_MEMMAP_ON_MEMORY)) {
> > +               mhp_altmap.free = memory_block_memmap_on_memory_pages();
> > +               params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL);
> > +               if (!params.altmap)
> > +                       return -ENOMEM;
> > +
> > +               memcpy(params.altmap, &mhp_altmap, sizeof(mhp_altmap));
> 
> Isn't this just open coded kmemdup()?

Ah yes - it was existing code that I just moved, but I can add a
precursor cleanup patch to change it.

> 
> Other than that, I am not seeing anything else to comment on, you can add:
> 
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>

Thanks Dan!
  
Verma, Vishal L Oct. 6, 2023, 10:01 p.m. UTC | #4
On Fri, 2023-10-06 at 14:52 +0200, David Hildenbrand wrote:
> On 05.10.23 20:31, Vishal Verma wrote:
> > 
<..>
> > @@ -2167,47 +2221,28 @@ static int __ref try_remove_memory(u64 start, u64 size)
> >         if (rc)
> >                 return rc;
> >   
> > +       mem_hotplug_begin();
> > +
> >         /*
> > -        * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
> > -        * the same granularity it was added - a single memory block.
> > +        * For memmap_on_memory, the altmaps could have been added on
> > +        * a per-memblock basis. Loop through the entire range if so,
> > +        * and remove each memblock and its altmap.
> >          */
> >         if (mhp_memmap_on_memory()) {
> > -               rc = walk_memory_blocks(start, size, &mem, test_has_altmap_cb);
> > -               if (rc) {
> > -                       if (size != memory_block_size_bytes()) {
> > -                               pr_warn("Refuse to remove %#llx - %#llx,"
> > -                                       "wrong granularity\n",
> > -                                       start, start + size);
> > -                               return -EINVAL;
> > -                       }
> > -                       altmap = mem->altmap;
> > -                       /*
> > -                        * Mark altmap NULL so that we can add a debug
> > -                        * check on memblock free.
> > -                        */
> > -                       mem->altmap = NULL;
> > -               }
> > +               unsigned long memblock_size = memory_block_size_bytes();
> > +               u64 cur_start;
> > +
> > +               for (cur_start = start; cur_start < start + size;
> > +                    cur_start += memblock_size)
> > +                       remove_memory_block_and_altmap(nid, cur_start,
> > +                                                      memblock_size);
> > +       } else {
> > +               remove_memory_block_and_altmap(nid, start, size);
> 
> Better call remove_memory_block_devices() and arch_remove_memory(start, 
> size, altmap) here explicitly instead of using 
> remove_memory_block_and_altmap() that really can only handle a single
> memory block with any inputs.
> 
I'm not sure I follow. Even in the non memmap_on_memory case, we'd have
to walk_memory_blocks() to get to the memory_block->altmap, right?

Or is there a more direct way? If we have to walk_memory_blocks, what's
the advantage of calling those directly instead of calling the helper
created above?

Agreed with and fixed up all the other comments.
  
Huang, Ying Oct. 7, 2023, 8:55 a.m. UTC | #5
Vishal Verma <vishal.l.verma@intel.com> writes:

> The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to
> 'memblock_size' chunks of memory being added. Adding a larger span of
> memory precludes memmap_on_memory semantics.
>
> For users of hotplug such as kmem, large amounts of memory might get
> added from the CXL subsystem. In some cases, this amount may exceed the
> available 'main memory' to store the memmap for the memory being added.
> In this case, it is useful to have a way to place the memmap on the
> memory being added, even if it means splitting the addition into
> memblock-sized chunks.
>
> Change add_memory_resource() to loop over memblock-sized chunks of
> memory if caller requested memmap_on_memory, and if other conditions for
> it are met. Teach try_remove_memory() to also expect that a memory
> range being removed might have been split up into memblock sized chunks,
> and to loop through those as needed.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
> ---
>  mm/memory_hotplug.c | 162 ++++++++++++++++++++++++++++++++--------------------
>  1 file changed, 99 insertions(+), 63 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index f8d3e7427e32..77ec6f15f943 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1380,6 +1380,44 @@ static bool mhp_supports_memmap_on_memory(unsigned long size)
>  	return arch_supports_memmap_on_memory(vmemmap_size);
>  }
>  
> +static int add_memory_create_devices(int nid, struct memory_group *group,
> +				     u64 start, u64 size, mhp_t mhp_flags)
> +{
> +	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
> +	struct vmem_altmap mhp_altmap = {
> +		.base_pfn =  PHYS_PFN(start),
> +		.end_pfn  =  PHYS_PFN(start + size - 1),
> +	};
> +	int ret;
> +
> +	if ((mhp_flags & MHP_MEMMAP_ON_MEMORY)) {
> +		mhp_altmap.free = memory_block_memmap_on_memory_pages();
> +		params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL);
> +		if (!params.altmap)
> +			return -ENOMEM;
> +
> +		memcpy(params.altmap, &mhp_altmap, sizeof(mhp_altmap));
> +	}
> +
> +	/* call arch's memory hotadd */
> +	ret = arch_add_memory(nid, start, size, &params);
> +	if (ret < 0)
> +		goto error;
> +
> +	/* create memory block devices after memory was added */
> +	ret = create_memory_block_devices(start, size, params.altmap, group);
> +	if (ret)
> +		goto err_bdev;
> +
> +	return 0;
> +
> +err_bdev:
> +	arch_remove_memory(start, size, NULL);
> +error:
> +	kfree(params.altmap);
> +	return ret;
> +}
> +
>  /*
>   * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
>   * and online/offline operations (triggered e.g. by sysfs).
> @@ -1388,14 +1426,10 @@ static bool mhp_supports_memmap_on_memory(unsigned long size)
>   */
>  int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>  {
> -	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
> +	unsigned long memblock_size = memory_block_size_bytes();
>  	enum memblock_flags memblock_flags = MEMBLOCK_NONE;
> -	struct vmem_altmap mhp_altmap = {
> -		.base_pfn =  PHYS_PFN(res->start),
> -		.end_pfn  =  PHYS_PFN(res->end),
> -	};
>  	struct memory_group *group = NULL;
> -	u64 start, size;
> +	u64 start, size, cur_start;
>  	bool new_node = false;
>  	int ret;
>  
> @@ -1436,28 +1470,21 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>  	/*
>  	 * Self hosted memmap array
>  	 */
> -	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
> -		if (mhp_supports_memmap_on_memory(size)) {
> -			mhp_altmap.free = memory_block_memmap_on_memory_pages();
> -			params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL);
> -			if (!params.altmap)
> +	if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) &&
> +	    mhp_supports_memmap_on_memory(memblock_size)) {
> +		for (cur_start = start; cur_start < start + size;
> +		     cur_start += memblock_size) {
> +			ret = add_memory_create_devices(nid, group, cur_start,
> +							memblock_size,
> +							mhp_flags);
> +			if (ret)
>  				goto error;
> -
> -			memcpy(params.altmap, &mhp_altmap, sizeof(mhp_altmap));
>  		}
> -		/* fallback to not using altmap  */
> -	}
> -
> -	/* call arch's memory hotadd */
> -	ret = arch_add_memory(nid, start, size, &params);
> -	if (ret < 0)
> -		goto error_free;
> -
> -	/* create memory block devices after memory was added */
> -	ret = create_memory_block_devices(start, size, params.altmap, group);
> -	if (ret) {
> -		arch_remove_memory(start, size, NULL);
> -		goto error_free;
> +	} else {
> +		ret = add_memory_create_devices(nid, group, start, size,
> +						mhp_flags);
> +		if (ret)
> +			goto error;
>  	}
>  
>  	if (new_node) {
> @@ -1494,8 +1521,6 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>  		walk_memory_blocks(start, size, NULL, online_memory_block);
>  
>  	return ret;
> -error_free:
> -	kfree(params.altmap);
>  error:
>  	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
>  		memblock_remove(start, size);
> @@ -2146,12 +2171,41 @@ void try_offline_node(int nid)
>  }
>  EXPORT_SYMBOL(try_offline_node);
>  
> -static int __ref try_remove_memory(u64 start, u64 size)
> +static void __ref remove_memory_block_and_altmap(int nid, u64 start, u64 size)
>  {
> +	int rc = 0;
>  	struct memory_block *mem;
> -	int rc = 0, nid = NUMA_NO_NODE;
>  	struct vmem_altmap *altmap = NULL;
>  
> +	rc = walk_memory_blocks(start, size, &mem, test_has_altmap_cb);
> +	if (rc) {
> +		altmap = mem->altmap;
> +		/*
> +		 * Mark altmap NULL so that we can add a debug
> +		 * check on memblock free.
> +		 */
> +		mem->altmap = NULL;
> +	}
> +
> +	/*
> +	 * Memory block device removal under the device_hotplug_lock is
> +	 * a barrier against racing online attempts.
> +	 */
> +	remove_memory_block_devices(start, size);
> +
> +	arch_remove_memory(start, size, altmap);
> +
> +	/* Verify that all vmemmap pages have actually been freed. */
> +	if (altmap) {
> +		WARN(altmap->alloc, "Altmap not fully unmapped");
> +		kfree(altmap);
> +	}
> +}
> +
> +static int __ref try_remove_memory(u64 start, u64 size)
> +{
> +	int rc, nid = NUMA_NO_NODE;
> +
>  	BUG_ON(check_hotplug_memory_range(start, size));
>  
>  	/*
> @@ -2167,47 +2221,28 @@ static int __ref try_remove_memory(u64 start, u64 size)
>  	if (rc)
>  		return rc;
>  
> +	mem_hotplug_begin();
> +
>  	/*
> -	 * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
> -	 * the same granularity it was added - a single memory block.
> +	 * For memmap_on_memory, the altmaps could have been added on
> +	 * a per-memblock basis. Loop through the entire range if so,
> +	 * and remove each memblock and its altmap.
>  	 */
>  	if (mhp_memmap_on_memory()) {

IIUC, even if mhp_memmap_on_memory() returns true, it's still possible
that the memmap is put in DRAM after [2/2].  So that,
arch_remove_memory() are called for each memory block unnecessarily.  Can
we detect this (via altmap?) and call remove_memory_block_and_altmap()
for the whole range?

> -		rc = walk_memory_blocks(start, size, &mem, test_has_altmap_cb);
> -		if (rc) {
> -			if (size != memory_block_size_bytes()) {
> -				pr_warn("Refuse to remove %#llx - %#llx,"
> -					"wrong granularity\n",
> -					start, start + size);
> -				return -EINVAL;
> -			}
> -			altmap = mem->altmap;
> -			/*
> -			 * Mark altmap NULL so that we can add a debug
> -			 * check on memblock free.
> -			 */
> -			mem->altmap = NULL;
> -		}
> +		unsigned long memblock_size = memory_block_size_bytes();
> +		u64 cur_start;
> +
> +		for (cur_start = start; cur_start < start + size;
> +		     cur_start += memblock_size)
> +			remove_memory_block_and_altmap(nid, cur_start,
> +						       memblock_size);
> +	} else {
> +		remove_memory_block_and_altmap(nid, start, size);
>  	}
>  
>  	/* remove memmap entry */
>  	firmware_map_remove(start, start + size, "System RAM");
>  
> -	/*
> -	 * Memory block device removal under the device_hotplug_lock is
> -	 * a barrier against racing online attempts.
> -	 */
> -	remove_memory_block_devices(start, size);
> -
> -	mem_hotplug_begin();
> -
> -	arch_remove_memory(start, size, altmap);
> -
> -	/* Verify that all vmemmap pages have actually been freed. */
> -	if (altmap) {
> -		WARN(altmap->alloc, "Altmap not fully unmapped");
> -		kfree(altmap);
> -	}
> -
>  	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
>  		memblock_phys_free(start, size);
>  		memblock_remove(start, size);
> @@ -2219,6 +2254,7 @@ static int __ref try_remove_memory(u64 start, u64 size)
>  		try_offline_node(nid);
>  
>  	mem_hotplug_done();
> +
>  	return 0;
>  }

--
Best Regards,
Huang, Ying
  
David Hildenbrand Oct. 9, 2023, 3:04 p.m. UTC | #6
On 07.10.23 10:55, Huang, Ying wrote:
> Vishal Verma <vishal.l.verma@intel.com> writes:
> 
>> The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to
>> 'memblock_size' chunks of memory being added. Adding a larger span of
>> memory precludes memmap_on_memory semantics.
>>
>> For users of hotplug such as kmem, large amounts of memory might get
>> added from the CXL subsystem. In some cases, this amount may exceed the
>> available 'main memory' to store the memmap for the memory being added.
>> In this case, it is useful to have a way to place the memmap on the
>> memory being added, even if it means splitting the addition into
>> memblock-sized chunks.
>>
>> Change add_memory_resource() to loop over memblock-sized chunks of
>> memory if caller requested memmap_on_memory, and if other conditions for
>> it are met. Teach try_remove_memory() to also expect that a memory
>> range being removed might have been split up into memblock sized chunks,
>> and to loop through those as needed.
>>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Dave Jiang <dave.jiang@intel.com>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Huang Ying <ying.huang@intel.com>
>> Suggested-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
>> ---
>>   mm/memory_hotplug.c | 162 ++++++++++++++++++++++++++++++++--------------------
>>   1 file changed, 99 insertions(+), 63 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index f8d3e7427e32..77ec6f15f943 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1380,6 +1380,44 @@ static bool mhp_supports_memmap_on_memory(unsigned long size)
>>   	return arch_supports_memmap_on_memory(vmemmap_size);
>>   }
>>   
>> +static int add_memory_create_devices(int nid, struct memory_group *group,
>> +				     u64 start, u64 size, mhp_t mhp_flags)
>> +{
>> +	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
>> +	struct vmem_altmap mhp_altmap = {
>> +		.base_pfn =  PHYS_PFN(start),
>> +		.end_pfn  =  PHYS_PFN(start + size - 1),
>> +	};
>> +	int ret;
>> +
>> +	if ((mhp_flags & MHP_MEMMAP_ON_MEMORY)) {
>> +		mhp_altmap.free = memory_block_memmap_on_memory_pages();
>> +		params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL);
>> +		if (!params.altmap)
>> +			return -ENOMEM;
>> +
>> +		memcpy(params.altmap, &mhp_altmap, sizeof(mhp_altmap));
>> +	}
>> +
>> +	/* call arch's memory hotadd */
>> +	ret = arch_add_memory(nid, start, size, &params);
>> +	if (ret < 0)
>> +		goto error;
>> +
>> +	/* create memory block devices after memory was added */
>> +	ret = create_memory_block_devices(start, size, params.altmap, group);
>> +	if (ret)
>> +		goto err_bdev;
>> +
>> +	return 0;
>> +
>> +err_bdev:
>> +	arch_remove_memory(start, size, NULL);
>> +error:
>> +	kfree(params.altmap);
>> +	return ret;
>> +}
>> +
>>   /*
>>    * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
>>    * and online/offline operations (triggered e.g. by sysfs).
>> @@ -1388,14 +1426,10 @@ static bool mhp_supports_memmap_on_memory(unsigned long size)
>>    */
>>   int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>>   {
>> -	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
>> +	unsigned long memblock_size = memory_block_size_bytes();
>>   	enum memblock_flags memblock_flags = MEMBLOCK_NONE;
>> -	struct vmem_altmap mhp_altmap = {
>> -		.base_pfn =  PHYS_PFN(res->start),
>> -		.end_pfn  =  PHYS_PFN(res->end),
>> -	};
>>   	struct memory_group *group = NULL;
>> -	u64 start, size;
>> +	u64 start, size, cur_start;
>>   	bool new_node = false;
>>   	int ret;
>>   
>> @@ -1436,28 +1470,21 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>>   	/*
>>   	 * Self hosted memmap array
>>   	 */
>> -	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
>> -		if (mhp_supports_memmap_on_memory(size)) {
>> -			mhp_altmap.free = memory_block_memmap_on_memory_pages();
>> -			params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL);
>> -			if (!params.altmap)
>> +	if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) &&
>> +	    mhp_supports_memmap_on_memory(memblock_size)) {
>> +		for (cur_start = start; cur_start < start + size;
>> +		     cur_start += memblock_size) {
>> +			ret = add_memory_create_devices(nid, group, cur_start,
>> +							memblock_size,
>> +							mhp_flags);
>> +			if (ret)
>>   				goto error;
>> -
>> -			memcpy(params.altmap, &mhp_altmap, sizeof(mhp_altmap));
>>   		}
>> -		/* fallback to not using altmap  */
>> -	}
>> -
>> -	/* call arch's memory hotadd */
>> -	ret = arch_add_memory(nid, start, size, &params);
>> -	if (ret < 0)
>> -		goto error_free;
>> -
>> -	/* create memory block devices after memory was added */
>> -	ret = create_memory_block_devices(start, size, params.altmap, group);
>> -	if (ret) {
>> -		arch_remove_memory(start, size, NULL);
>> -		goto error_free;
>> +	} else {
>> +		ret = add_memory_create_devices(nid, group, start, size,
>> +						mhp_flags);
>> +		if (ret)
>> +			goto error;
>>   	}
>>   
>>   	if (new_node) {
>> @@ -1494,8 +1521,6 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>>   		walk_memory_blocks(start, size, NULL, online_memory_block);
>>   
>>   	return ret;
>> -error_free:
>> -	kfree(params.altmap);
>>   error:
>>   	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
>>   		memblock_remove(start, size);
>> @@ -2146,12 +2171,41 @@ void try_offline_node(int nid)
>>   }
>>   EXPORT_SYMBOL(try_offline_node);
>>   
>> -static int __ref try_remove_memory(u64 start, u64 size)
>> +static void __ref remove_memory_block_and_altmap(int nid, u64 start, u64 size)
>>   {
>> +	int rc = 0;
>>   	struct memory_block *mem;
>> -	int rc = 0, nid = NUMA_NO_NODE;
>>   	struct vmem_altmap *altmap = NULL;
>>   
>> +	rc = walk_memory_blocks(start, size, &mem, test_has_altmap_cb);
>> +	if (rc) {
>> +		altmap = mem->altmap;
>> +		/*
>> +		 * Mark altmap NULL so that we can add a debug
>> +		 * check on memblock free.
>> +		 */
>> +		mem->altmap = NULL;
>> +	}
>> +
>> +	/*
>> +	 * Memory block device removal under the device_hotplug_lock is
>> +	 * a barrier against racing online attempts.
>> +	 */
>> +	remove_memory_block_devices(start, size);
>> +
>> +	arch_remove_memory(start, size, altmap);
>> +
>> +	/* Verify that all vmemmap pages have actually been freed. */
>> +	if (altmap) {
>> +		WARN(altmap->alloc, "Altmap not fully unmapped");
>> +		kfree(altmap);
>> +	}
>> +}
>> +
>> +static int __ref try_remove_memory(u64 start, u64 size)
>> +{
>> +	int rc, nid = NUMA_NO_NODE;
>> +
>>   	BUG_ON(check_hotplug_memory_range(start, size));
>>   
>>   	/*
>> @@ -2167,47 +2221,28 @@ static int __ref try_remove_memory(u64 start, u64 size)
>>   	if (rc)
>>   		return rc;
>>   
>> +	mem_hotplug_begin();
>> +
>>   	/*
>> -	 * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
>> -	 * the same granularity it was added - a single memory block.
>> +	 * For memmap_on_memory, the altmaps could have been added on
>> +	 * a per-memblock basis. Loop through the entire range if so,
>> +	 * and remove each memblock and its altmap.
>>   	 */
>>   	if (mhp_memmap_on_memory()) {
> 
> IIUC, even if mhp_memmap_on_memory() returns true, it's still possible
> that the memmap is put in DRAM after [2/2].  So that,
> arch_remove_memory() are called for each memory block unnecessarily.  Can
> we detect this (via altmap?) and call remove_memory_block_and_altmap()
> for the whole range?

Good point. We should handle memblock-per-memblock onny if we have to 
handle the altmap. Otherwise, just call a separate function that doesn't 
care about -- e.g., called remove_memory_blocks_no_altmap().

We could simply walk all memory blocks and make sure either all have an 
altmap or none has an altmap. If there is a mix, we should bail out with 
WARN_ON_ONCE().
  
David Hildenbrand Oct. 9, 2023, 3:15 p.m. UTC | #7
On 07.10.23 00:01, Verma, Vishal L wrote:
> On Fri, 2023-10-06 at 14:52 +0200, David Hildenbrand wrote:
>> On 05.10.23 20:31, Vishal Verma wrote:
>>>
> <..>
>>> @@ -2167,47 +2221,28 @@ static int __ref try_remove_memory(u64 start, u64 size)
>>>          if (rc)
>>>                  return rc;
>>>    
>>> +       mem_hotplug_begin();
>>> +
>>>          /*
>>> -        * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
>>> -        * the same granularity it was added - a single memory block.
>>> +        * For memmap_on_memory, the altmaps could have been added on
>>> +        * a per-memblock basis. Loop through the entire range if so,
>>> +        * and remove each memblock and its altmap.
>>>           */
>>>          if (mhp_memmap_on_memory()) {
>>> -               rc = walk_memory_blocks(start, size, &mem, test_has_altmap_cb);
>>> -               if (rc) {
>>> -                       if (size != memory_block_size_bytes()) {
>>> -                               pr_warn("Refuse to remove %#llx - %#llx,"
>>> -                                       "wrong granularity\n",
>>> -                                       start, start + size);
>>> -                               return -EINVAL;
>>> -                       }
>>> -                       altmap = mem->altmap;
>>> -                       /*
>>> -                        * Mark altmap NULL so that we can add a debug
>>> -                        * check on memblock free.
>>> -                        */
>>> -                       mem->altmap = NULL;
>>> -               }
>>> +               unsigned long memblock_size = memory_block_size_bytes();
>>> +               u64 cur_start;
>>> +
>>> +               for (cur_start = start; cur_start < start + size;
>>> +                    cur_start += memblock_size)
>>> +                       remove_memory_block_and_altmap(nid, cur_start,
>>> +                                                      memblock_size);
>>> +       } else {
>>> +               remove_memory_block_and_altmap(nid, start, size);
>>
>> Better call remove_memory_block_devices() and arch_remove_memory(start,
>> size, altmap) here explicitly instead of using
>> remove_memory_block_and_altmap() that really can only handle a single
>> memory block with any inputs.
>>
> I'm not sure I follow. Even in the non memmap_on_memory case, we'd have
> to walk_memory_blocks() to get to the memory_block->altmap, right?

See my other reply to, at least with mhp_memmap_on_memory()==false, we 
don't have to worry about the altmap.

> 
> Or is there a more direct way? If we have to walk_memory_blocks, what's
> the advantage of calling those directly instead of calling the helper
> created above?

I think we have two cases to handle

1) All have an altmap. Remove them block-by-block. Probably we should 
call a function remove_memory_blocks(altmap=true) [or alternatively 
remove_memory_blocks_and_altmaps()] and just handle iterating internally.

2) All don't have an altmap. We can remove them in one go. Probably we 
should call that remove_memory_blocks(altmap=false) [or alternatively 
remove_memory_blocks_no_altmaps()].

I guess it's best to do a walk upfront to make sure either all have an 
altmap or none has one. Then we can branch off to the right function 
knowing whether we have to process altmaps or not.

The existing

if (mhp_memmap_on_memory()) {
	...
}

Can be extended for that case.

Please let me know if I failed to express what I mean, then I can 
briefly prototype it on top of your changes.
  
Verma, Vishal L Oct. 12, 2023, 5:53 a.m. UTC | #8
On Mon, 2023-10-09 at 17:04 +0200, David Hildenbrand wrote:
> On 07.10.23 10:55, Huang, Ying wrote:
> > Vishal Verma <vishal.l.verma@intel.com> writes:
> > 
> > > @@ -2167,47 +2221,28 @@ static int __ref try_remove_memory(u64 start, u64 size)
> > >         if (rc)
> > >                 return rc;
> > >   
> > > +       mem_hotplug_begin();
> > > +
> > >         /*
> > > -        * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
> > > -        * the same granularity it was added - a single memory block.
> > > +        * For memmap_on_memory, the altmaps could have been added on
> > > +        * a per-memblock basis. Loop through the entire range if so,
> > > +        * and remove each memblock and its altmap.
> > >          */
> > >         if (mhp_memmap_on_memory()) {
> > 
> > IIUC, even if mhp_memmap_on_memory() returns true, it's still possible
> > that the memmap is put in DRAM after [2/2].  So that,
> > arch_remove_memory() are called for each memory block unnecessarily.  Can
> > we detect this (via altmap?) and call remove_memory_block_and_altmap()
> > for the whole range?
> 
> Good point. We should handle memblock-per-memblock onny if we have to
> handle the altmap. Otherwise, just call a separate function that doesn't 
> care about -- e.g., called remove_memory_blocks_no_altmap().
> 
> We could simply walk all memory blocks and make sure either all have an 
> altmap or none has an altmap. If there is a mix, we should bail out with 
> WARN_ON_ONCE().
> 
Ok I think I follow - based on both of these threads, here's my
understanding in an incremental diff from the original patches (may not
apply directly as I've already committed changes from the other bits of
feedback - but this should provide an idea of the direction) - 

---

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 507291e44c0b..30addcb063b4 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -2201,6 +2201,40 @@ static void __ref remove_memory_block_and_altmap(u64 start, u64 size)
 	}
 }
 
+static bool memblocks_have_altmaps(u64 start, u64 size)
+{
+	unsigned long memblock_size = memory_block_size_bytes();
+	u64 num_altmaps = 0, num_no_altmaps = 0;
+	struct memory_block *mem;
+	u64 cur_start;
+	int rc = 0;
+
+	if (!mhp_memmap_on_memory())
+		return false;
+
+	for (cur_start = start; cur_start < start + size;
+	     cur_start += memblock_size) {
+		if (walk_memory_blocks(cur_start, memblock_size, &mem,
+				       test_has_altmap_cb))
+			num_altmaps++;
+		else
+			num_no_altmaps++;
+	}
+
+	if (!num_altmaps && num_no_altmaps > 0)
+		return false;
+
+	if (!num_no_altmaps && num_altmaps > 0)
+		return true;
+
+	/*
+	 * If there is a mix of memblocks with and without altmaps,
+	 * something has gone very wrong. WARN and bail.
+	 */
+	WARN_ONCE(1, "memblocks have a mix of missing and present altmaps");
+	return false;
+}
+
 static int __ref try_remove_memory(u64 start, u64 size)
 {
 	int rc, nid = NUMA_NO_NODE;
@@ -2230,7 +2264,7 @@ static int __ref try_remove_memory(u64 start, u64 size)
 	 * a per-memblock basis. Loop through the entire range if so,
 	 * and remove each memblock and its altmap.
 	 */
-	if (mhp_memmap_on_memory()) {
+	if (mhp_memmap_on_memory() && memblocks_have_altmaps(start, size)) {
 		unsigned long memblock_size = memory_block_size_bytes();
 		u64 cur_start;
 
@@ -2239,7 +2273,8 @@ static int __ref try_remove_memory(u64 start, u64 size)
 			remove_memory_block_and_altmap(cur_start,
 						       memblock_size);
 	} else {
-		remove_memory_block_and_altmap(start, size);
+		remove_memory_block_devices(start, size);
+		arch_remove_memory(start, size, NULL);
 	}
 
 	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
  
David Hildenbrand Oct. 12, 2023, 8:40 a.m. UTC | #9
On 12.10.23 07:53, Verma, Vishal L wrote:
> On Mon, 2023-10-09 at 17:04 +0200, David Hildenbrand wrote:
>> On 07.10.23 10:55, Huang, Ying wrote:
>>> Vishal Verma <vishal.l.verma@intel.com> writes:
>>>
>>>> @@ -2167,47 +2221,28 @@ static int __ref try_remove_memory(u64 start, u64 size)
>>>>          if (rc)
>>>>                  return rc;
>>>>    
>>>> +       mem_hotplug_begin();
>>>> +
>>>>          /*
>>>> -        * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
>>>> -        * the same granularity it was added - a single memory block.
>>>> +        * For memmap_on_memory, the altmaps could have been added on
>>>> +        * a per-memblock basis. Loop through the entire range if so,
>>>> +        * and remove each memblock and its altmap.
>>>>           */
>>>>          if (mhp_memmap_on_memory()) {
>>>
>>> IIUC, even if mhp_memmap_on_memory() returns true, it's still possible
>>> that the memmap is put in DRAM after [2/2].  So that,
>>> arch_remove_memory() are called for each memory block unnecessarily.  Can
>>> we detect this (via altmap?) and call remove_memory_block_and_altmap()
>>> for the whole range?
>>
>> Good point. We should handle memblock-per-memblock onny if we have to
>> handle the altmap. Otherwise, just call a separate function that doesn't
>> care about -- e.g., called remove_memory_blocks_no_altmap().
>>
>> We could simply walk all memory blocks and make sure either all have an
>> altmap or none has an altmap. If there is a mix, we should bail out with
>> WARN_ON_ONCE().
>>
> Ok I think I follow - based on both of these threads, here's my
> understanding in an incremental diff from the original patches (may not
> apply directly as I've already committed changes from the other bits of
> feedback - but this should provide an idea of the direction) -
> 
> ---
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 507291e44c0b..30addcb063b4 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -2201,6 +2201,40 @@ static void __ref remove_memory_block_and_altmap(u64 start, u64 size)
>   	}
>   }
>   
> +static bool memblocks_have_altmaps(u64 start, u64 size)
> +{
> +	unsigned long memblock_size = memory_block_size_bytes();
> +	u64 num_altmaps = 0, num_no_altmaps = 0;
> +	struct memory_block *mem;
> +	u64 cur_start;
> +	int rc = 0;
> +
> +	if (!mhp_memmap_on_memory())
> +		return false;

Probably can remove that, checked by the caller. (or drop the one in the 
caller)

> +
> +	for (cur_start = start; cur_start < start + size;
> +	     cur_start += memblock_size) {
> +		if (walk_memory_blocks(cur_start, memblock_size, &mem,
> +				       test_has_altmap_cb))
> +			num_altmaps++;
> +		else
> +			num_no_altmaps++;
> +	}

You should do that without the outer loop, by doing the counting in the 
callback function instead.	

> +
> +	if (!num_altmaps && num_no_altmaps > 0)
> +		return false;
> +
> +	if (!num_no_altmaps && num_altmaps > 0)
> +		return true;
> +
> +	/*
> +	 * If there is a mix of memblocks with and without altmaps,
> +	 * something has gone very wrong. WARN and bail.
> +	 */
> +	WARN_ONCE(1, "memblocks have a mix of missing and present altmaps");

It would be better if we could even make try_remove_memory() fail in 
this case.

> +	return false;
> +}
> +
>   static int __ref try_remove_memory(u64 start, u64 size)
>   {
>   	int rc, nid = NUMA_NO_NODE;
> @@ -2230,7 +2264,7 @@ static int __ref try_remove_memory(u64 start, u64 size)
>   	 * a per-memblock basis. Loop through the entire range if so,
>   	 * and remove each memblock and its altmap.
>   	 */
> -	if (mhp_memmap_on_memory()) {
> +	if (mhp_memmap_on_memory() && memblocks_have_altmaps(start, size)) {
>   		unsigned long memblock_size = memory_block_size_bytes();
>   		u64 cur_start;
>   
> @@ -2239,7 +2273,8 @@ static int __ref try_remove_memory(u64 start, u64 size)
>   			remove_memory_block_and_altmap(cur_start,
>   						       memblock_size);

^ probably cleaner move the loop into remove_memory_block_and_altmap() 
and call it remove_memory_blocks_and_altmaps(start, size) instead.

>   	} else {
> -		remove_memory_block_and_altmap(start, size);
> +		remove_memory_block_devices(start, size);
> +		arch_remove_memory(start, size, NULL);
>   	}
>   
>   	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
>
  
Verma, Vishal L Oct. 16, 2023, 6:19 p.m. UTC | #10
On Thu, 2023-10-12 at 10:40 +0200, David Hildenbrand wrote:
> On 12.10.23 07:53, Verma, Vishal L wrote:
> > On Mon, 2023-10-09 at 17:04 +0200, David Hildenbrand wrote:
> > > On 07.10.23 10:55, Huang, Ying wrote:
> > > > Vishal Verma <vishal.l.verma@intel.com> writes:
> 
<..>
> > +
> > +       for (cur_start = start; cur_start < start + size;
> > +            cur_start += memblock_size) {
> > +               if (walk_memory_blocks(cur_start, memblock_size, &mem,
> > +                                      test_has_altmap_cb))
> > +                       num_altmaps++;
> > +               else
> > +                       num_no_altmaps++;
> > +       }
> 
> You should do that without the outer loop, by doing the counting in the 
> callback function instead.      
> 
> 
I made a new callback, since the existing callback that returns the
memory_block breaks the walk the first time an altmap was encountered.

Agreed on all the other comments - it looks much cleaner now!

Sending v6 shortly with all of this.
  

Patch

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f8d3e7427e32..77ec6f15f943 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1380,6 +1380,44 @@  static bool mhp_supports_memmap_on_memory(unsigned long size)
 	return arch_supports_memmap_on_memory(vmemmap_size);
 }
 
+static int add_memory_create_devices(int nid, struct memory_group *group,
+				     u64 start, u64 size, mhp_t mhp_flags)
+{
+	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
+	struct vmem_altmap mhp_altmap = {
+		.base_pfn =  PHYS_PFN(start),
+		.end_pfn  =  PHYS_PFN(start + size - 1),
+	};
+	int ret;
+
+	if ((mhp_flags & MHP_MEMMAP_ON_MEMORY)) {
+		mhp_altmap.free = memory_block_memmap_on_memory_pages();
+		params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL);
+		if (!params.altmap)
+			return -ENOMEM;
+
+		memcpy(params.altmap, &mhp_altmap, sizeof(mhp_altmap));
+	}
+
+	/* call arch's memory hotadd */
+	ret = arch_add_memory(nid, start, size, &params);
+	if (ret < 0)
+		goto error;
+
+	/* create memory block devices after memory was added */
+	ret = create_memory_block_devices(start, size, params.altmap, group);
+	if (ret)
+		goto err_bdev;
+
+	return 0;
+
+err_bdev:
+	arch_remove_memory(start, size, NULL);
+error:
+	kfree(params.altmap);
+	return ret;
+}
+
 /*
  * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
  * and online/offline operations (triggered e.g. by sysfs).
@@ -1388,14 +1426,10 @@  static bool mhp_supports_memmap_on_memory(unsigned long size)
  */
 int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 {
-	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
+	unsigned long memblock_size = memory_block_size_bytes();
 	enum memblock_flags memblock_flags = MEMBLOCK_NONE;
-	struct vmem_altmap mhp_altmap = {
-		.base_pfn =  PHYS_PFN(res->start),
-		.end_pfn  =  PHYS_PFN(res->end),
-	};
 	struct memory_group *group = NULL;
-	u64 start, size;
+	u64 start, size, cur_start;
 	bool new_node = false;
 	int ret;
 
@@ -1436,28 +1470,21 @@  int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 	/*
 	 * Self hosted memmap array
 	 */
-	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
-		if (mhp_supports_memmap_on_memory(size)) {
-			mhp_altmap.free = memory_block_memmap_on_memory_pages();
-			params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL);
-			if (!params.altmap)
+	if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) &&
+	    mhp_supports_memmap_on_memory(memblock_size)) {
+		for (cur_start = start; cur_start < start + size;
+		     cur_start += memblock_size) {
+			ret = add_memory_create_devices(nid, group, cur_start,
+							memblock_size,
+							mhp_flags);
+			if (ret)
 				goto error;
-
-			memcpy(params.altmap, &mhp_altmap, sizeof(mhp_altmap));
 		}
-		/* fallback to not using altmap  */
-	}
-
-	/* call arch's memory hotadd */
-	ret = arch_add_memory(nid, start, size, &params);
-	if (ret < 0)
-		goto error_free;
-
-	/* create memory block devices after memory was added */
-	ret = create_memory_block_devices(start, size, params.altmap, group);
-	if (ret) {
-		arch_remove_memory(start, size, NULL);
-		goto error_free;
+	} else {
+		ret = add_memory_create_devices(nid, group, start, size,
+						mhp_flags);
+		if (ret)
+			goto error;
 	}
 
 	if (new_node) {
@@ -1494,8 +1521,6 @@  int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 		walk_memory_blocks(start, size, NULL, online_memory_block);
 
 	return ret;
-error_free:
-	kfree(params.altmap);
 error:
 	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
 		memblock_remove(start, size);
@@ -2146,12 +2171,41 @@  void try_offline_node(int nid)
 }
 EXPORT_SYMBOL(try_offline_node);
 
-static int __ref try_remove_memory(u64 start, u64 size)
+static void __ref remove_memory_block_and_altmap(int nid, u64 start, u64 size)
 {
+	int rc = 0;
 	struct memory_block *mem;
-	int rc = 0, nid = NUMA_NO_NODE;
 	struct vmem_altmap *altmap = NULL;
 
+	rc = walk_memory_blocks(start, size, &mem, test_has_altmap_cb);
+	if (rc) {
+		altmap = mem->altmap;
+		/*
+		 * Mark altmap NULL so that we can add a debug
+		 * check on memblock free.
+		 */
+		mem->altmap = NULL;
+	}
+
+	/*
+	 * Memory block device removal under the device_hotplug_lock is
+	 * a barrier against racing online attempts.
+	 */
+	remove_memory_block_devices(start, size);
+
+	arch_remove_memory(start, size, altmap);
+
+	/* Verify that all vmemmap pages have actually been freed. */
+	if (altmap) {
+		WARN(altmap->alloc, "Altmap not fully unmapped");
+		kfree(altmap);
+	}
+}
+
+static int __ref try_remove_memory(u64 start, u64 size)
+{
+	int rc, nid = NUMA_NO_NODE;
+
 	BUG_ON(check_hotplug_memory_range(start, size));
 
 	/*
@@ -2167,47 +2221,28 @@  static int __ref try_remove_memory(u64 start, u64 size)
 	if (rc)
 		return rc;
 
+	mem_hotplug_begin();
+
 	/*
-	 * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
-	 * the same granularity it was added - a single memory block.
+	 * For memmap_on_memory, the altmaps could have been added on
+	 * a per-memblock basis. Loop through the entire range if so,
+	 * and remove each memblock and its altmap.
 	 */
 	if (mhp_memmap_on_memory()) {
-		rc = walk_memory_blocks(start, size, &mem, test_has_altmap_cb);
-		if (rc) {
-			if (size != memory_block_size_bytes()) {
-				pr_warn("Refuse to remove %#llx - %#llx,"
-					"wrong granularity\n",
-					start, start + size);
-				return -EINVAL;
-			}
-			altmap = mem->altmap;
-			/*
-			 * Mark altmap NULL so that we can add a debug
-			 * check on memblock free.
-			 */
-			mem->altmap = NULL;
-		}
+		unsigned long memblock_size = memory_block_size_bytes();
+		u64 cur_start;
+
+		for (cur_start = start; cur_start < start + size;
+		     cur_start += memblock_size)
+			remove_memory_block_and_altmap(nid, cur_start,
+						       memblock_size);
+	} else {
+		remove_memory_block_and_altmap(nid, start, size);
 	}
 
 	/* remove memmap entry */
 	firmware_map_remove(start, start + size, "System RAM");
 
-	/*
-	 * Memory block device removal under the device_hotplug_lock is
-	 * a barrier against racing online attempts.
-	 */
-	remove_memory_block_devices(start, size);
-
-	mem_hotplug_begin();
-
-	arch_remove_memory(start, size, altmap);
-
-	/* Verify that all vmemmap pages have actually been freed. */
-	if (altmap) {
-		WARN(altmap->alloc, "Altmap not fully unmapped");
-		kfree(altmap);
-	}
-
 	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
 		memblock_phys_free(start, size);
 		memblock_remove(start, size);
@@ -2219,6 +2254,7 @@  static int __ref try_remove_memory(u64 start, u64 size)
 		try_offline_node(nid);
 
 	mem_hotplug_done();
+
 	return 0;
 }