[3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory

Message ID 20230613-vv-kmem_memmap-v1-3-f6de9c6af2c6@intel.com
State New
Headers
Series mm: use memmap_on_memory semantics for dax/kmem |

Commit Message

Verma, Vishal L June 15, 2023, 10 p.m. UTC
  With DAX memory regions originating from CXL memory expanders or
NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory
on a system without enough 'regular' main memory to support the memmap
for it. To avoid this, ensure that all kmem managed hotplugged memory is
added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the
new memory region being hot added.

To do this, call add_memory() in chunks of memory_block_size_bytes() as
that is a requirement for memmap_on_memory. Additionally, Use the
mhp_flag to force the memmap_on_memory checks regardless of the
respective module parameter setting.

Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
---
 drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 36 insertions(+), 13 deletions(-)
  

Comments

Huang, Ying June 16, 2023, 6:42 a.m. UTC | #1
Vishal Verma <vishal.l.verma@intel.com> writes:

> With DAX memory regions originating from CXL memory expanders or
> NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory
> on a system without enough 'regular' main memory to support the memmap
> for it. To avoid this, ensure that all kmem managed hotplugged memory is
> added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the
> new memory region being hot added.
>
> To do this, call add_memory() in chunks of memory_block_size_bytes() as
> that is a requirement for memmap_on_memory. Additionally, Use the
> mhp_flag to force the memmap_on_memory checks regardless of the
> respective module parameter setting.
>
> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> Cc: Len Brown <lenb@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
> ---
>  drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 36 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 7b36db6f1cbd..0751346193ef 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -12,6 +12,7 @@
>  #include <linux/mm.h>
>  #include <linux/mman.h>
>  #include <linux/memory-tiers.h>
> +#include <linux/memory_hotplug.h>
>  #include "dax-private.h"
>  #include "bus.h"
>  
> @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>  	data->mgid = rc;
>  
>  	for (i = 0; i < dev_dax->nr_range; i++) {
> +		u64 cur_start, cur_len, remaining;
>  		struct resource *res;
>  		struct range range;
>  
> @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>  		res->flags = IORESOURCE_SYSTEM_RAM;
>  
>  		/*
> -		 * Ensure that future kexec'd kernels will not treat
> -		 * this as RAM automatically.
> +		 * Add memory in chunks of memory_block_size_bytes() so that
> +		 * it is considered for MHP_MEMMAP_ON_MEMORY
> +		 * @range has already been aligned to memory_block_size_bytes(),
> +		 * so the following loop will always break it down cleanly.
>  		 */
> -		rc = add_memory_driver_managed(data->mgid, range.start,
> -				range_len(&range), kmem_name, MHP_NID_IS_MGID);
> +		cur_start = range.start;
> +		cur_len = memory_block_size_bytes();
> +		remaining = range_len(&range);
> +		while (remaining) {
> +			mhp_t mhp_flags = MHP_NID_IS_MGID;
>  
> -		if (rc) {
> -			dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
> -					i, range.start, range.end);
> -			remove_resource(res);
> -			kfree(res);
> -			data->res[i] = NULL;
> -			if (mapped)
> -				continue;
> -			goto err_request_mem;
> +			if (mhp_supports_memmap_on_memory(cur_len,
> +							  MHP_MEMMAP_ON_MEMORY))
> +				mhp_flags |= MHP_MEMMAP_ON_MEMORY;
> +			/*
> +			 * Ensure that future kexec'd kernels will not treat
> +			 * this as RAM automatically.
> +			 */
> +			rc = add_memory_driver_managed(data->mgid, cur_start,
> +						       cur_len, kmem_name,
> +						       mhp_flags);
> +
> +			if (rc) {
> +				dev_warn(dev,
> +					 "mapping%d: %#llx-%#llx memory add failed\n",
> +					 i, cur_start, cur_start + cur_len - 1);
> +				remove_resource(res);
> +				kfree(res);
> +				data->res[i] = NULL;
> +				if (mapped)
> +					continue;
> +				goto err_request_mem;
> +			}
> +
> +			cur_start += cur_len;
> +			remaining -= cur_len;
>  		}
>  		mapped++;
>  	}

It appears that we need to hot-remove memory in the granularity of
memory_block_size_bytes() too, according to try_remove_memory().  If so,
it seems better to allocate one dax_kmem_data.res[] element for each
memory block instead of dax region?

Best Regards,
Huang, Ying
  
David Hildenbrand June 16, 2023, 7:54 a.m. UTC | #2
On 16.06.23 00:00, Vishal Verma wrote:
> With DAX memory regions originating from CXL memory expanders or
> NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory
> on a system without enough 'regular' main memory to support the memmap
> for it. To avoid this, ensure that all kmem managed hotplugged memory is
> added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the
> new memory region being hot added.
> 
> To do this, call add_memory() in chunks of memory_block_size_bytes() as
> that is a requirement for memmap_on_memory. Additionally, Use the
> mhp_flag to force the memmap_on_memory checks regardless of the
> respective module parameter setting.
> 
> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> Cc: Len Brown <lenb@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
> ---
>   drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++-------------
>   1 file changed, 36 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 7b36db6f1cbd..0751346193ef 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -12,6 +12,7 @@
>   #include <linux/mm.h>
>   #include <linux/mman.h>
>   #include <linux/memory-tiers.h>
> +#include <linux/memory_hotplug.h>
>   #include "dax-private.h"
>   #include "bus.h"
>   
> @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>   	data->mgid = rc;
>   
>   	for (i = 0; i < dev_dax->nr_range; i++) {
> +		u64 cur_start, cur_len, remaining;
>   		struct resource *res;
>   		struct range range;
>   
> @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>   		res->flags = IORESOURCE_SYSTEM_RAM;
>   
>   		/*
> -		 * Ensure that future kexec'd kernels will not treat
> -		 * this as RAM automatically.
> +		 * Add memory in chunks of memory_block_size_bytes() so that
> +		 * it is considered for MHP_MEMMAP_ON_MEMORY
> +		 * @range has already been aligned to memory_block_size_bytes(),
> +		 * so the following loop will always break it down cleanly.
>   		 */
> -		rc = add_memory_driver_managed(data->mgid, range.start,
> -				range_len(&range), kmem_name, MHP_NID_IS_MGID);
> +		cur_start = range.start;
> +		cur_len = memory_block_size_bytes();
> +		remaining = range_len(&range);
> +		while (remaining) {
> +			mhp_t mhp_flags = MHP_NID_IS_MGID;
>   
> -		if (rc) {
> -			dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
> -					i, range.start, range.end);
> -			remove_resource(res);
> -			kfree(res);
> -			data->res[i] = NULL;
> -			if (mapped)
> -				continue;
> -			goto err_request_mem;
> +			if (mhp_supports_memmap_on_memory(cur_len,
> +							  MHP_MEMMAP_ON_MEMORY))
> +				mhp_flags |= MHP_MEMMAP_ON_MEMORY;
> +			/*
> +			 * Ensure that future kexec'd kernels will not treat
> +			 * this as RAM automatically.
> +			 */
> +			rc = add_memory_driver_managed(data->mgid, cur_start,
> +						       cur_len, kmem_name,
> +						       mhp_flags);
> +
> +			if (rc) {
> +				dev_warn(dev,
> +					 "mapping%d: %#llx-%#llx memory add failed\n",
> +					 i, cur_start, cur_start + cur_len - 1);
> +				remove_resource(res);
> +				kfree(res);
> +				data->res[i] = NULL;
> +				if (mapped)
> +					continue;
> +				goto err_request_mem;
> +			}
> +
> +			cur_start += cur_len;
> +			remaining -= cur_len;
>   		}
>   		mapped++;
>   	}
> 

Maybe the better alternative is teach 
add_memory_resource()/try_remove_memory() to do that internally.

In the add_memory_resource() case, it might be a loop around that 
memmap_on_memory + arch_add_memory code path (well, and the error path 
also needs adjustment):

	/*
	 * Self hosted memmap array
	 */
	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
		if (!mhp_supports_memmap_on_memory(size)) {
			ret = -EINVAL;
			goto error;
		}
		mhp_altmap.free = PHYS_PFN(size);
		mhp_altmap.base_pfn = PHYS_PFN(start);
		params.altmap = &mhp_altmap;
	}

	/* call arch's memory hotadd */
	ret = arch_add_memory(nid, start, size, &params);
	if (ret < 0)
		goto error;


Note that we want to handle that on a per memory-block basis, because we 
don't want the vmemmap of memory block #2 to end up on memory block #1. 
It all gets messy with memory onlining/offlining etc otherwise ...
  
Tarun Sahu June 20, 2023, 1:14 p.m. UTC | #3
Hi Vishal,

Vishal Verma <vishal.l.verma@intel.com> writes:

> With DAX memory regions originating from CXL memory expanders or
> NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory
> on a system without enough 'regular' main memory to support the memmap
> for it. To avoid this, ensure that all kmem managed hotplugged memory is
> added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the
> new memory region being hot added.
>

Some architectures doesn't have support for MEMMAP_ON_MEMORY, bypassing
the check mhp_memmap_on_memory() might cause problems on such
architectures (for e.g PPC64).

> To do this, call add_memory() in chunks of memory_block_size_bytes() as
> that is a requirement for memmap_on_memory. Additionally, Use the
> mhp_flag to force the memmap_on_memory checks regardless of the
> respective module parameter setting.
>
> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> Cc: Len Brown <lenb@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
> ---
>  drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 36 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 7b36db6f1cbd..0751346193ef 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -12,6 +12,7 @@
>  #include <linux/mm.h>
>  #include <linux/mman.h>
>  #include <linux/memory-tiers.h>
> +#include <linux/memory_hotplug.h>
>  #include "dax-private.h"
>  #include "bus.h"
>  
> @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>  	data->mgid = rc;
>  
>  	for (i = 0; i < dev_dax->nr_range; i++) {
> +		u64 cur_start, cur_len, remaining;
>  		struct resource *res;
>  		struct range range;
>  
> @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>  		res->flags = IORESOURCE_SYSTEM_RAM;
>  
>  		/*
> -		 * Ensure that future kexec'd kernels will not treat
> -		 * this as RAM automatically.
> +		 * Add memory in chunks of memory_block_size_bytes() so that
> +		 * it is considered for MHP_MEMMAP_ON_MEMORY
> +		 * @range has already been aligned to memory_block_size_bytes(),
> +		 * so the following loop will always break it down cleanly.
>  		 */
> -		rc = add_memory_driver_managed(data->mgid, range.start,
> -				range_len(&range), kmem_name, MHP_NID_IS_MGID);
> +		cur_start = range.start;
> +		cur_len = memory_block_size_bytes();
> +		remaining = range_len(&range);
> +		while (remaining) {
> +			mhp_t mhp_flags = MHP_NID_IS_MGID;
>  
> -		if (rc) {
> -			dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
> -					i, range.start, range.end);
> -			remove_resource(res);
> -			kfree(res);
> -			data->res[i] = NULL;
> -			if (mapped)
> -				continue;
> -			goto err_request_mem;
> +			if (mhp_supports_memmap_on_memory(cur_len,
> +							  MHP_MEMMAP_ON_MEMORY))
> +				mhp_flags |= MHP_MEMMAP_ON_MEMORY;
> +			/*
> +			 * Ensure that future kexec'd kernels will not treat
> +			 * this as RAM automatically.
> +			 */
> +			rc = add_memory_driver_managed(data->mgid, cur_start,
> +						       cur_len, kmem_name,
> +						       mhp_flags);
> +
> +			if (rc) {
> +				dev_warn(dev,
> +					 "mapping%d: %#llx-%#llx memory add failed\n",
> +					 i, cur_start, cur_start + cur_len - 1);
> +				remove_resource(res);
> +				kfree(res);
> +				data->res[i] = NULL;
> +				if (mapped)
> +					continue;
> +				goto err_request_mem;
> +			}
> +
> +			cur_start += cur_len;
> +			remaining -= cur_len;
>  		}
>  		mapped++;
>  	}
>
> -- 
> 2.40.1
  
Aneesh Kumar K.V July 11, 2023, 2:30 p.m. UTC | #4
David Hildenbrand <david@redhat.com> writes:

> On 16.06.23 00:00, Vishal Verma wrote:
>> With DAX memory regions originating from CXL memory expanders or
>> NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory
>> on a system without enough 'regular' main memory to support the memmap
>> for it. To avoid this, ensure that all kmem managed hotplugged memory is
>> added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the
>> new memory region being hot added.
>>
>> To do this, call add_memory() in chunks of memory_block_size_bytes() as
>> that is a requirement for memmap_on_memory. Additionally, Use the
>> mhp_flag to force the memmap_on_memory checks regardless of the
>> respective module parameter setting.
>>
>> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
>> Cc: Len Brown <lenb@kernel.org>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Dave Jiang <dave.jiang@intel.com>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Huang Ying <ying.huang@intel.com>
>> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
>> ---
>>   drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++-------------
>>   1 file changed, 36 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>> index 7b36db6f1cbd..0751346193ef 100644
>> --- a/drivers/dax/kmem.c
>> +++ b/drivers/dax/kmem.c
>> @@ -12,6 +12,7 @@
>>   #include <linux/mm.h>
>>   #include <linux/mman.h>
>>   #include <linux/memory-tiers.h>
>> +#include <linux/memory_hotplug.h>
>>   #include "dax-private.h"
>>   #include "bus.h"
>>
>> @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>   	data->mgid = rc;
>>
>>   	for (i = 0; i < dev_dax->nr_range; i++) {
>> +		u64 cur_start, cur_len, remaining;
>>   		struct resource *res;
>>   		struct range range;
>>
>> @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>   		res->flags = IORESOURCE_SYSTEM_RAM;
>>
>>   		/*
>> -		 * Ensure that future kexec'd kernels will not treat
>> -		 * this as RAM automatically.
>> +		 * Add memory in chunks of memory_block_size_bytes() so that
>> +		 * it is considered for MHP_MEMMAP_ON_MEMORY
>> +		 * @range has already been aligned to memory_block_size_bytes(),
>> +		 * so the following loop will always break it down cleanly.
>>   		 */
>> -		rc = add_memory_driver_managed(data->mgid, range.start,
>> -				range_len(&range), kmem_name, MHP_NID_IS_MGID);
>> +		cur_start = range.start;
>> +		cur_len = memory_block_size_bytes();
>> +		remaining = range_len(&range);
>> +		while (remaining) {
>> +			mhp_t mhp_flags = MHP_NID_IS_MGID;
>>
>> -		if (rc) {
>> -			dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
>> -					i, range.start, range.end);
>> -			remove_resource(res);
>> -			kfree(res);
>> -			data->res[i] = NULL;
>> -			if (mapped)
>> -				continue;
>> -			goto err_request_mem;
>> +			if (mhp_supports_memmap_on_memory(cur_len,
>> +							  MHP_MEMMAP_ON_MEMORY))
>> +				mhp_flags |= MHP_MEMMAP_ON_MEMORY;
>> +			/*
>> +			 * Ensure that future kexec'd kernels will not treat
>> +			 * this as RAM automatically.
>> +			 */
>> +			rc = add_memory_driver_managed(data->mgid, cur_start,
>> +						       cur_len, kmem_name,
>> +						       mhp_flags);
>> +
>> +			if (rc) {
>> +				dev_warn(dev,
>> +					 "mapping%d: %#llx-%#llx memory add failed\n",
>> +					 i, cur_start, cur_start + cur_len - 1);
>> +				remove_resource(res);
>> +				kfree(res);
>> +				data->res[i] = NULL;
>> +				if (mapped)
>> +					continue;
>> +				goto err_request_mem;
>> +			}
>> +
>> +			cur_start += cur_len;
>> +			remaining -= cur_len;
>>   		}
>>   		mapped++;
>>   	}
>>
>
> Maybe the better alternative is teach
> add_memory_resource()/try_remove_memory() to do that internally.
>
> In the add_memory_resource() case, it might be a loop around that
> memmap_on_memory + arch_add_memory code path (well, and the error path
> also needs adjustment):
>
> 	/*
> 	 * Self hosted memmap array
> 	 */
> 	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
> 		if (!mhp_supports_memmap_on_memory(size)) {
> 			ret = -EINVAL;
> 			goto error;
> 		}
> 		mhp_altmap.free = PHYS_PFN(size);
> 		mhp_altmap.base_pfn = PHYS_PFN(start);
> 		params.altmap = &mhp_altmap;
> 	}
>
> 	/* call arch's memory hotadd */
> 	ret = arch_add_memory(nid, start, size, &params);
> 	if (ret < 0)
> 		goto error;
>
>
> Note that we want to handle that on a per memory-block basis, because we
> don't want the vmemmap of memory block #2 to end up on memory block #1.
> It all gets messy with memory onlining/offlining etc otherwise ...
>

I tried to implement this inside add_memory_driver_managed() and also
within dax/kmem. IMHO doing the error handling inside dax/kmem is
better. Here is how it looks:

1. If any blocks got added before (mapped > 0) we loop through all successful request_mem_regions
2. For each succesful request_mem_regions if any blocks got added, we
keep the resource. If none got added, we will kfree the resource



	for (i = 0; i < dev_dax->nr_range; i++) {
		u64 cur_start, cur_len, remaining;
		struct resource *res;
		struct range range;
		bool block_added;

		rc = dax_kmem_range(dev_dax, i, &range);
		if (rc)
			continue;

		/* Region is permanently reserved if hotremove fails. */
		res = request_mem_region(range.start, range_len(&range), data->res_name);
		if (!res) {
			dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve region\n",
					i, range.start, range.end);
			/*
			 * Once some memory has been onlined we can't
			 * assume that it can be un-onlined safely.
			 */
			if (mapped)
				continue;
			rc = -EBUSY;
			goto err_request_mem;
		}
		data->res[i] = res;

		/*
		 * Set flags appropriate for System RAM.  Leave ..._BUSY clear
		 * so that add_memory() can add a child resource.  Do not
		 * inherit flags from the parent since it may set new flags
		 * unknown to us that will break add_memory() below.
		 */
		res->flags = IORESOURCE_SYSTEM_RAM;

		/*
		 * Add memory in chunks of memory_block_size_bytes() so that
		 * it is considered for MHP_MEMMAP_ON_MEMORY
		 * @range has already been aligned to memory_block_size_bytes(),
		 * so the following loop will always break it down cleanly.
		 */
		cur_start = range.start;
		cur_len = memory_block_size_bytes();
		remaining = range_len(&range);
		block_added = false;
		while (remaining) {
			/*
			 * If alignment rules are not satisified we will
			 * fallback normal memmap allocation.
			 */
			mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY;
			/*
			 * Ensure that future kexec'd kernels will not treat
			 * this as RAM automatically.
			 */
			rc = add_memory_driver_managed(data->mgid, cur_start,
						       cur_len, kmem_name,
						       mhp_flags);

			if (rc)
				dev_warn(dev,
					 "mapping%d: %#llx-%#llx memory add failed\n",
					 i, cur_start, cur_start + cur_len - 1);
			else
				block_added = true;

			cur_start += cur_len;
			remaining -= cur_len;
		}
		if (!block_added) {
			/*
			 * None of the blocks got added, remove the resource.
			 */
			remove_resource(res);
			kfree(res);
			data->res[i] = NULL;
		} else
			mapped++;
	}
	if (mapped) {
		dev_set_drvdata(dev, data);
		return 0;
	}

err_request_mem:
	/*
	 *  If none of the resources got mapped.
	 *  unregister the group.
	 */
	memory_group_unregister(data->mgid);
err_reg_mgid:
	kfree(data->res_name);
  
David Hildenbrand July 11, 2023, 3:21 p.m. UTC | #5
On 11.07.23 16:30, Aneesh Kumar K.V wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 16.06.23 00:00, Vishal Verma wrote:
>>> With DAX memory regions originating from CXL memory expanders or
>>> NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory
>>> on a system without enough 'regular' main memory to support the memmap
>>> for it. To avoid this, ensure that all kmem managed hotplugged memory is
>>> added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the
>>> new memory region being hot added.
>>>
>>> To do this, call add_memory() in chunks of memory_block_size_bytes() as
>>> that is a requirement for memmap_on_memory. Additionally, Use the
>>> mhp_flag to force the memmap_on_memory checks regardless of the
>>> respective module parameter setting.
>>>
>>> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
>>> Cc: Len Brown <lenb@kernel.org>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: Oscar Salvador <osalvador@suse.de>
>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>> Cc: Dave Jiang <dave.jiang@intel.com>
>>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>>> Cc: Huang Ying <ying.huang@intel.com>
>>> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
>>> ---
>>>    drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++-------------
>>>    1 file changed, 36 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>>> index 7b36db6f1cbd..0751346193ef 100644
>>> --- a/drivers/dax/kmem.c
>>> +++ b/drivers/dax/kmem.c
>>> @@ -12,6 +12,7 @@
>>>    #include <linux/mm.h>
>>>    #include <linux/mman.h>
>>>    #include <linux/memory-tiers.h>
>>> +#include <linux/memory_hotplug.h>
>>>    #include "dax-private.h"
>>>    #include "bus.h"
>>>
>>> @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>    	data->mgid = rc;
>>>
>>>    	for (i = 0; i < dev_dax->nr_range; i++) {
>>> +		u64 cur_start, cur_len, remaining;
>>>    		struct resource *res;
>>>    		struct range range;
>>>
>>> @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>    		res->flags = IORESOURCE_SYSTEM_RAM;
>>>
>>>    		/*
>>> -		 * Ensure that future kexec'd kernels will not treat
>>> -		 * this as RAM automatically.
>>> +		 * Add memory in chunks of memory_block_size_bytes() so that
>>> +		 * it is considered for MHP_MEMMAP_ON_MEMORY
>>> +		 * @range has already been aligned to memory_block_size_bytes(),
>>> +		 * so the following loop will always break it down cleanly.
>>>    		 */
>>> -		rc = add_memory_driver_managed(data->mgid, range.start,
>>> -				range_len(&range), kmem_name, MHP_NID_IS_MGID);
>>> +		cur_start = range.start;
>>> +		cur_len = memory_block_size_bytes();
>>> +		remaining = range_len(&range);
>>> +		while (remaining) {
>>> +			mhp_t mhp_flags = MHP_NID_IS_MGID;
>>>
>>> -		if (rc) {
>>> -			dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
>>> -					i, range.start, range.end);
>>> -			remove_resource(res);
>>> -			kfree(res);
>>> -			data->res[i] = NULL;
>>> -			if (mapped)
>>> -				continue;
>>> -			goto err_request_mem;
>>> +			if (mhp_supports_memmap_on_memory(cur_len,
>>> +							  MHP_MEMMAP_ON_MEMORY))
>>> +				mhp_flags |= MHP_MEMMAP_ON_MEMORY;
>>> +			/*
>>> +			 * Ensure that future kexec'd kernels will not treat
>>> +			 * this as RAM automatically.
>>> +			 */
>>> +			rc = add_memory_driver_managed(data->mgid, cur_start,
>>> +						       cur_len, kmem_name,
>>> +						       mhp_flags);
>>> +
>>> +			if (rc) {
>>> +				dev_warn(dev,
>>> +					 "mapping%d: %#llx-%#llx memory add failed\n",
>>> +					 i, cur_start, cur_start + cur_len - 1);
>>> +				remove_resource(res);
>>> +				kfree(res);
>>> +				data->res[i] = NULL;
>>> +				if (mapped)
>>> +					continue;
>>> +				goto err_request_mem;
>>> +			}
>>> +
>>> +			cur_start += cur_len;
>>> +			remaining -= cur_len;
>>>    		}
>>>    		mapped++;
>>>    	}
>>>
>>
>> Maybe the better alternative is teach
>> add_memory_resource()/try_remove_memory() to do that internally.
>>
>> In the add_memory_resource() case, it might be a loop around that
>> memmap_on_memory + arch_add_memory code path (well, and the error path
>> also needs adjustment):
>>
>> 	/*
>> 	 * Self hosted memmap array
>> 	 */
>> 	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
>> 		if (!mhp_supports_memmap_on_memory(size)) {
>> 			ret = -EINVAL;
>> 			goto error;
>> 		}
>> 		mhp_altmap.free = PHYS_PFN(size);
>> 		mhp_altmap.base_pfn = PHYS_PFN(start);
>> 		params.altmap = &mhp_altmap;
>> 	}
>>
>> 	/* call arch's memory hotadd */
>> 	ret = arch_add_memory(nid, start, size, &params);
>> 	if (ret < 0)
>> 		goto error;
>>
>>
>> Note that we want to handle that on a per memory-block basis, because we
>> don't want the vmemmap of memory block #2 to end up on memory block #1.
>> It all gets messy with memory onlining/offlining etc otherwise ...
>>
> 
> I tried to implement this inside add_memory_driver_managed() and also
> within dax/kmem. IMHO doing the error handling inside dax/kmem is
> better. Here is how it looks:
> 
> 1. If any blocks got added before (mapped > 0) we loop through all successful request_mem_regions
> 2. For each succesful request_mem_regions if any blocks got added, we
> keep the resource. If none got added, we will kfree the resource
> 

Doing this unconditional splitting outside of 
add_memory_driver_managed() is undesirable for at least two reasons

1) You end up always creating individual entries in the resource tree
    (/proc/iomem) even if MHP_MEMMAP_ON_MEMORY is not effective.
2) As we call arch_add_memory() in memory block granularity (e.g., 128
    MiB on x86), we might not make use of large PUDs (e.g., 1 GiB) in the
    identify mapping -- even if MHP_MEMMAP_ON_MEMORY is not effective.

While you could sense for support and do the split based on that, it 
will be beneficial for other users (especially DIMMs) if we do that 
internally -- where we already know if MHP_MEMMAP_ON_MEMORY can be 
effective or not.

In general, we avoid placing important kernel data-structures on slow 
memory. That's one of the reasons why PMEM decided to mostly always use 
ZONE_MOVABLE such that exactly what this patch does would not happen. So 
I'm wondering if there would be demand for an additional toggle.

Because even with memmap_on_memory enabled in general, you might not 
want to do that for dax/kmem.

IMHO, this patch should be dropped from your ppc64 series, as it's an 
independent change that might be valuable for other architectures as well.
  
Verma, Vishal L July 13, 2023, 6:45 a.m. UTC | #6
On Tue, 2023-07-11 at 17:21 +0200, David Hildenbrand wrote:
> On 11.07.23 16:30, Aneesh Kumar K.V wrote:
> > David Hildenbrand <david@redhat.com> writes:
> > > 
> > > Maybe the better alternative is teach
> > > add_memory_resource()/try_remove_memory() to do that internally.
> > > 
> > > In the add_memory_resource() case, it might be a loop around that
> > > memmap_on_memory + arch_add_memory code path (well, and the error path
> > > also needs adjustment):
> > > 
> > >         /*
> > >          * Self hosted memmap array
> > >          */
> > >         if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
> > >                 if (!mhp_supports_memmap_on_memory(size)) {
> > >                         ret = -EINVAL;
> > >                         goto error;
> > >                 }
> > >                 mhp_altmap.free = PHYS_PFN(size);
> > >                 mhp_altmap.base_pfn = PHYS_PFN(start);
> > >                 params.altmap = &mhp_altmap;
> > >         }
> > > 
> > >         /* call arch's memory hotadd */
> > >         ret = arch_add_memory(nid, start, size, &params);
> > >         if (ret < 0)
> > >                 goto error;
> > > 
> > > 
> > > Note that we want to handle that on a per memory-block basis, because we
> > > don't want the vmemmap of memory block #2 to end up on memory block #1.
> > > It all gets messy with memory onlining/offlining etc otherwise ...
> > > 
> > 
> > I tried to implement this inside add_memory_driver_managed() and also
> > within dax/kmem. IMHO doing the error handling inside dax/kmem is
> > better. Here is how it looks:
> > 
> > 1. If any blocks got added before (mapped > 0) we loop through all successful request_mem_regions
> > 2. For each succesful request_mem_regions if any blocks got added, we
> > keep the resource. If none got added, we will kfree the resource
> > 
> 
> Doing this unconditional splitting outside of 
> add_memory_driver_managed() is undesirable for at least two reasons
> 
> 1) You end up always creating individual entries in the resource tree
>     (/proc/iomem) even if MHP_MEMMAP_ON_MEMORY is not effective.
> 2) As we call arch_add_memory() in memory block granularity (e.g., 128
>     MiB on x86), we might not make use of large PUDs (e.g., 1 GiB) in the
>     identify mapping -- even if MHP_MEMMAP_ON_MEMORY is not effective.
> 
> While you could sense for support and do the split based on that, it 
> will be beneficial for other users (especially DIMMs) if we do that 
> internally -- where we already know if MHP_MEMMAP_ON_MEMORY can be 
> effective or not.

I'm taking a shot at implementing the splitting internally in
memory_hotplug.c. The caller (kmem) side does become trivial with this
approach, but there's a slight complication if I don't have the module
param override (patch 1 of this series).

The kmem diff now looks like:

   diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
   index 898ca9505754..8be932f63f90 100644
   --- a/drivers/dax/kmem.c
   +++ b/drivers/dax/kmem.c
   @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
           data->mgid = rc;
    
           for (i = 0; i < dev_dax->nr_range; i++) {
   +               mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY |
   +                                 MHP_SPLIT_MEMBLOCKS;
                   struct resource *res;
                   struct range range;
    
   @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
                    * this as RAM automatically.
                    */
                   rc = add_memory_driver_managed(data->mgid, range.start,
   -                               range_len(&range), kmem_name, MHP_NID_IS_MGID);
   +                               range_len(&range), kmem_name, mhp_flags);
    
                   if (rc) {
                           dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
   

However this begins to fail if the memmap_on_memory modparam is not
set, as add_memory_driver_managed EINVALs from the
mhp_supports_memmap_on_memory() check.

The way to work around this would probably include doing the
mhp_supports_memmap_on_memory() check in kmem, in a loop to check for
each memblock sized chunk, and that feels like a leak of the
implementation details into the caller.

Any suggestions on how to go about this?
> 
> In general, we avoid placing important kernel data-structures on slow
> memory. That's one of the reasons why PMEM decided to mostly always use 
> ZONE_MOVABLE such that exactly what this patch does would not happen. So 
> I'm wondering if there would be demand for an additional toggle.
> 
> Because even with memmap_on_memory enabled in general, you might not 
> want to do that for dax/kmem.
> 
> IMHO, this patch should be dropped from your ppc64 series, as it's an
> independent change that might be valuable for other architectures as well.
> 
Sure thing, I can pick this back up and Aneesh can drop this from his set.
  
David Hildenbrand July 13, 2023, 7:23 a.m. UTC | #7
On 13.07.23 08:45, Verma, Vishal L wrote:
> On Tue, 2023-07-11 at 17:21 +0200, David Hildenbrand wrote:
>> On 11.07.23 16:30, Aneesh Kumar K.V wrote:
>>> David Hildenbrand <david@redhat.com> writes:
>>>>
>>>> Maybe the better alternative is teach
>>>> add_memory_resource()/try_remove_memory() to do that internally.
>>>>
>>>> In the add_memory_resource() case, it might be a loop around that
>>>> memmap_on_memory + arch_add_memory code path (well, and the error path
>>>> also needs adjustment):
>>>>
>>>>          /*
>>>>           * Self hosted memmap array
>>>>           */
>>>>          if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
>>>>                  if (!mhp_supports_memmap_on_memory(size)) {
>>>>                          ret = -EINVAL;
>>>>                          goto error;
>>>>                  }
>>>>                  mhp_altmap.free = PHYS_PFN(size);
>>>>                  mhp_altmap.base_pfn = PHYS_PFN(start);
>>>>                  params.altmap = &mhp_altmap;
>>>>          }
>>>>
>>>>          /* call arch's memory hotadd */
>>>>          ret = arch_add_memory(nid, start, size, &params);
>>>>          if (ret < 0)
>>>>                  goto error;
>>>>
>>>>
>>>> Note that we want to handle that on a per memory-block basis, because we
>>>> don't want the vmemmap of memory block #2 to end up on memory block #1.
>>>> It all gets messy with memory onlining/offlining etc otherwise ...
>>>>
>>>
>>> I tried to implement this inside add_memory_driver_managed() and also
>>> within dax/kmem. IMHO doing the error handling inside dax/kmem is
>>> better. Here is how it looks:
>>>
>>> 1. If any blocks got added before (mapped > 0) we loop through all successful request_mem_regions
>>> 2. For each succesful request_mem_regions if any blocks got added, we
>>> keep the resource. If none got added, we will kfree the resource
>>>
>>
>> Doing this unconditional splitting outside of
>> add_memory_driver_managed() is undesirable for at least two reasons
>>
>> 1) You end up always creating individual entries in the resource tree
>>      (/proc/iomem) even if MHP_MEMMAP_ON_MEMORY is not effective.
>> 2) As we call arch_add_memory() in memory block granularity (e.g., 128
>>      MiB on x86), we might not make use of large PUDs (e.g., 1 GiB) in the
>>      identify mapping -- even if MHP_MEMMAP_ON_MEMORY is not effective.
>>
>> While you could sense for support and do the split based on that, it
>> will be beneficial for other users (especially DIMMs) if we do that
>> internally -- where we already know if MHP_MEMMAP_ON_MEMORY can be
>> effective or not.
> 
> I'm taking a shot at implementing the splitting internally in
> memory_hotplug.c. The caller (kmem) side does become trivial with this
> approach, but there's a slight complication if I don't have the module
> param override (patch 1 of this series).
> 
> The kmem diff now looks like:
> 
>     diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>     index 898ca9505754..8be932f63f90 100644
>     --- a/drivers/dax/kmem.c
>     +++ b/drivers/dax/kmem.c
>     @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>             data->mgid = rc;
>      
>             for (i = 0; i < dev_dax->nr_range; i++) {
>     +               mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY |
>     +                                 MHP_SPLIT_MEMBLOCKS;
>                     struct resource *res;
>                     struct range range;
>      
>     @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>                      * this as RAM automatically.
>                      */
>                     rc = add_memory_driver_managed(data->mgid, range.start,
>     -                               range_len(&range), kmem_name, MHP_NID_IS_MGID);
>     +                               range_len(&range), kmem_name, mhp_flags);
>      
>                     if (rc) {
>                             dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
>     
> 

Why do we need the MHP_SPLIT_MEMBLOCKS?

In add_memory_driver_managed(), if memmap_on_memory = 1 AND is effective for a
single memory block, you can simply split up internally, no?

Essentially in add_memory_resource() something like

if (mhp_flags & MHP_MEMMAP_ON_MEMORY &&
     mhp_supports_memmap_on_memory(memory_block_size_bytes())) {
	for (cur_start = start, cur_start < start + size;
	     cur_start += memory_block_size_bytes()) {
		mhp_altmap.free = PHYS_PFN(memory_block_size_bytes());
		mhp_altmap.base_pfn = PHYS_PFN(start);
		params.altmap = &mhp_altmap;

		ret = arch_add_memory(nid, start,
				      memory_block_size_bytes(), &params);
		if (ret < 0) ...

		ret = create_memory_block_devices(start, memory_block_size_bytes(),
						  mhp_altmap.alloc, group);
		if (ret) ...
		
	}
} else {
	/* old boring stuff */
}

Of course, doing it a bit cleaner, factoring out adding of mem+creating devices into
a helper so we can use it on the other path, avoiding repeating memory_block_size_bytes()
...

If any adding of memory failed, we remove what we already added. That works, because as
long as we're holding the relevant locks, memory cannot get onlined in the meantime.

Then we only have to teach remove_memory() to deal with individual blocks when finding
blocks that have an altmap.
  
Verma, Vishal L July 13, 2023, 3:15 p.m. UTC | #8
On Thu, 2023-07-13 at 09:23 +0200, David Hildenbrand wrote:
> On 13.07.23 08:45, Verma, Vishal L wrote:
> > 
> > I'm taking a shot at implementing the splitting internally in
> > memory_hotplug.c. The caller (kmem) side does become trivial with this
> > approach, but there's a slight complication if I don't have the module
> > param override (patch 1 of this series).
> > 
> > The kmem diff now looks like:
> > 
> >     diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> >     index 898ca9505754..8be932f63f90 100644
> >     --- a/drivers/dax/kmem.c
> >     +++ b/drivers/dax/kmem.c
> >     @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> >             data->mgid = rc;
> >      
> >             for (i = 0; i < dev_dax->nr_range; i++) {
> >     +               mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY |
> >     +                                 MHP_SPLIT_MEMBLOCKS;
> >                     struct resource *res;
> >                     struct range range;
> >      
> >     @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> >                      * this as RAM automatically.
> >                      */
> >                     rc = add_memory_driver_managed(data->mgid, range.start,
> >     -                               range_len(&range), kmem_name, MHP_NID_IS_MGID);
> >     +                               range_len(&range), kmem_name, mhp_flags);
> >      
> >                     if (rc) {
> >                             dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
> >     
> > 
> 
> Why do we need the MHP_SPLIT_MEMBLOCKS?

I thought we still wanted either an opt-in or opt-out for the kmem
driver to be able to do memmap_on_memory, in case there were
performance implications or the lack of 1GiB PUDs. I haven't
implemented that yet, but I was thinking along the lines of a sysfs
knob exposed by kmem, that controls setting of this new
MHP_SPLIT_MEMBLOCKS flag.

> 
> In add_memory_driver_managed(), if memmap_on_memory = 1 AND is effective for a
> single memory block, you can simply split up internally, no?
> 
> Essentially in add_memory_resource() something like
> 
> if (mhp_flags & MHP_MEMMAP_ON_MEMORY &&
>      mhp_supports_memmap_on_memory(memory_block_size_bytes())) {
>         for (cur_start = start, cur_start < start + size;
>              cur_start += memory_block_size_bytes()) {
>                 mhp_altmap.free = PHYS_PFN(memory_block_size_bytes());
>                 mhp_altmap.base_pfn = PHYS_PFN(start);
>                 params.altmap = &mhp_altmap;
> 
>                 ret = arch_add_memory(nid, start,
>                                       memory_block_size_bytes(), &params);
>                 if (ret < 0) ...
> 
>                 ret = create_memory_block_devices(start, memory_block_size_bytes(),
>                                                   mhp_altmap.alloc, group);
>                 if (ret) ...
>                 
>         }
> } else {
>         /* old boring stuff */
> }
> 
> Of course, doing it a bit cleaner, factoring out adding of mem+creating devices into
> a helper so we can use it on the other path, avoiding repeating memory_block_size_bytes()
> ...

My current approach was looping a level higher, on the call to
add_memory_resource, but this looks reasonable too, and I can switch to
this. In fact it is better as in my case I had to loop twice, once for
the regular add_memory() path and once for the _driver_managed() path.
Yours should avoid that.

> 
> If any adding of memory failed, we remove what we already added. That works, because as
> long as we're holding the relevant locks, memory cannot get onlined in the meantime.
> 
> Then we only have to teach remove_memory() to deal with individual blocks when finding
> blocks that have an altmap.
>
  
David Hildenbrand July 13, 2023, 3:23 p.m. UTC | #9
On 13.07.23 17:15, Verma, Vishal L wrote:
> On Thu, 2023-07-13 at 09:23 +0200, David Hildenbrand wrote:
>> On 13.07.23 08:45, Verma, Vishal L wrote:
>>>
>>> I'm taking a shot at implementing the splitting internally in
>>> memory_hotplug.c. The caller (kmem) side does become trivial with this
>>> approach, but there's a slight complication if I don't have the module
>>> param override (patch 1 of this series).
>>>
>>> The kmem diff now looks like:
>>>
>>>      diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>>>      index 898ca9505754..8be932f63f90 100644
>>>      --- a/drivers/dax/kmem.c
>>>      +++ b/drivers/dax/kmem.c
>>>      @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>              data->mgid = rc;
>>>       
>>>              for (i = 0; i < dev_dax->nr_range; i++) {
>>>      +               mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY |
>>>      +                                 MHP_SPLIT_MEMBLOCKS;
>>>                      struct resource *res;
>>>                      struct range range;
>>>       
>>>      @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>                       * this as RAM automatically.
>>>                       */
>>>                      rc = add_memory_driver_managed(data->mgid, range.start,
>>>      -                               range_len(&range), kmem_name, MHP_NID_IS_MGID);
>>>      +                               range_len(&range), kmem_name, mhp_flags);
>>>       
>>>                      if (rc) {
>>>                              dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
>>>      
>>>
>>
>> Why do we need the MHP_SPLIT_MEMBLOCKS?
> 
> I thought we still wanted either an opt-in or opt-out for the kmem
> driver to be able to do memmap_on_memory, in case there were
> performance implications or the lack of 1GiB PUDs. I haven't
> implemented that yet, but I was thinking along the lines of a sysfs
> knob exposed by kmem, that controls setting of this new
> MHP_SPLIT_MEMBLOCKS flag.

Why is MHP_MEMMAP_ON_MEMORY not sufficient for that?

> 
>>
>> In add_memory_driver_managed(), if memmap_on_memory = 1 AND is effective for a
>> single memory block, you can simply split up internally, no?
>>
>> Essentially in add_memory_resource() something like
>>
>> if (mhp_flags & MHP_MEMMAP_ON_MEMORY &&
>>       mhp_supports_memmap_on_memory(memory_block_size_bytes())) {
>>          for (cur_start = start, cur_start < start + size;
>>               cur_start += memory_block_size_bytes()) {
>>                  mhp_altmap.free = PHYS_PFN(memory_block_size_bytes());
>>                  mhp_altmap.base_pfn = PHYS_PFN(start);
>>                  params.altmap = &mhp_altmap;
>>
>>                  ret = arch_add_memory(nid, start,
>>                                        memory_block_size_bytes(), &params);
>>                  if (ret < 0) ...
>>
>>                  ret = create_memory_block_devices(start, memory_block_size_bytes(),
>>                                                    mhp_altmap.alloc, group);
>>                  if (ret) ...
>>                  
>>          }
>> } else {
>>          /* old boring stuff */
>> }
>>
>> Of course, doing it a bit cleaner, factoring out adding of mem+creating devices into
>> a helper so we can use it on the other path, avoiding repeating memory_block_size_bytes()
>> ...
> 
> My current approach was looping a level higher, on the call to
> add_memory_resource, but this looks reasonable too, and I can switch to
> this. In fact it is better as in my case I had to loop twice, once for
> the regular add_memory() path and once for the _driver_managed() path.
> Yours should avoid that.

As we only care about the altmap here, handling it for arch_add_memory() 
+ create_memory_block_devices() should be sufficient.
  
Verma, Vishal L July 13, 2023, 3:40 p.m. UTC | #10
On Thu, 2023-07-13 at 17:23 +0200, David Hildenbrand wrote:
> On 13.07.23 17:15, Verma, Vishal L wrote:
> > On Thu, 2023-07-13 at 09:23 +0200, David Hildenbrand wrote:
> > > On 13.07.23 08:45, Verma, Vishal L wrote:
> > > > 
> > > > I'm taking a shot at implementing the splitting internally in
> > > > memory_hotplug.c. The caller (kmem) side does become trivial with this
> > > > approach, but there's a slight complication if I don't have the module
> > > > param override (patch 1 of this series).
> > > > 
> > > > The kmem diff now looks like:
> > > > 
> > > >      diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> > > >      index 898ca9505754..8be932f63f90 100644
> > > >      --- a/drivers/dax/kmem.c
> > > >      +++ b/drivers/dax/kmem.c
> > > >      @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> > > >              data->mgid = rc;
> > > >       
> > > >              for (i = 0; i < dev_dax->nr_range; i++) {
> > > >      +               mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY |
> > > >      +                                 MHP_SPLIT_MEMBLOCKS;
> > > >                      struct resource *res;
> > > >                      struct range range;
> > > >       
> > > >      @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> > > >                       * this as RAM automatically.
> > > >                       */
> > > >                      rc = add_memory_driver_managed(data->mgid, range.start,
> > > >      -                               range_len(&range), kmem_name, MHP_NID_IS_MGID);
> > > >      +                               range_len(&range), kmem_name, mhp_flags);
> > > >       
> > > >                      if (rc) {
> > > >                              dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
> > > >      
> > > > 
> > > 
> > > Why do we need the MHP_SPLIT_MEMBLOCKS?
> > 
> > I thought we still wanted either an opt-in or opt-out for the kmem
> > driver to be able to do memmap_on_memory, in case there were
> > performance implications or the lack of 1GiB PUDs. I haven't
> > implemented that yet, but I was thinking along the lines of a sysfs
> > knob exposed by kmem, that controls setting of this new
> > MHP_SPLIT_MEMBLOCKS flag.
> 
> Why is MHP_MEMMAP_ON_MEMORY not sufficient for that?
> 
> 
Ah I see what you mean now - knob just controls MHP_MEMMAP_ON_MEMORY,
and memory_hotplug is free to split to memblocks if it needs to to
satisfy that.

That sounds reasonable. Let me give this a try and see if I run into
anything else. Thanks David!
  
David Hildenbrand July 13, 2023, 3:43 p.m. UTC | #11
On 13.07.23 17:40, Verma, Vishal L wrote:
> On Thu, 2023-07-13 at 17:23 +0200, David Hildenbrand wrote:
>> On 13.07.23 17:15, Verma, Vishal L wrote:
>>> On Thu, 2023-07-13 at 09:23 +0200, David Hildenbrand wrote:
>>>> On 13.07.23 08:45, Verma, Vishal L wrote:
>>>>>
>>>>> I'm taking a shot at implementing the splitting internally in
>>>>> memory_hotplug.c. The caller (kmem) side does become trivial with this
>>>>> approach, but there's a slight complication if I don't have the module
>>>>> param override (patch 1 of this series).
>>>>>
>>>>> The kmem diff now looks like:
>>>>>
>>>>>       diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>>>>>       index 898ca9505754..8be932f63f90 100644
>>>>>       --- a/drivers/dax/kmem.c
>>>>>       +++ b/drivers/dax/kmem.c
>>>>>       @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>>>               data->mgid = rc;
>>>>>        
>>>>>               for (i = 0; i < dev_dax->nr_range; i++) {
>>>>>       +               mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY |
>>>>>       +                                 MHP_SPLIT_MEMBLOCKS;
>>>>>                       struct resource *res;
>>>>>                       struct range range;
>>>>>        
>>>>>       @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>>>                        * this as RAM automatically.
>>>>>                        */
>>>>>                       rc = add_memory_driver_managed(data->mgid, range.start,
>>>>>       -                               range_len(&range), kmem_name, MHP_NID_IS_MGID);
>>>>>       +                               range_len(&range), kmem_name, mhp_flags);
>>>>>        
>>>>>                       if (rc) {
>>>>>                               dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
>>>>>       
>>>>>
>>>>
>>>> Why do we need the MHP_SPLIT_MEMBLOCKS?
>>>
>>> I thought we still wanted either an opt-in or opt-out for the kmem
>>> driver to be able to do memmap_on_memory, in case there were
>>> performance implications or the lack of 1GiB PUDs. I haven't
>>> implemented that yet, but I was thinking along the lines of a sysfs
>>> knob exposed by kmem, that controls setting of this new
>>> MHP_SPLIT_MEMBLOCKS flag.
>>
>> Why is MHP_MEMMAP_ON_MEMORY not sufficient for that?
>>
>>
> Ah I see what you mean now - knob just controls MHP_MEMMAP_ON_MEMORY,
> and memory_hotplug is free to split to memblocks if it needs to to
> satisfy that.

And if you don't want memmap holes in a larger area you're adding (for 
example to runtime-allocate 1 GiB pages), simply check the size your 
adding, and if it's, say, less than 1 G, don't set the flag.

But that's probably a corner case use case not worth considering for now.

> 
> That sounds reasonable. Let me give this a try and see if I run into
> anything else. Thanks David!

Sure!
  

Patch

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 7b36db6f1cbd..0751346193ef 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -12,6 +12,7 @@ 
 #include <linux/mm.h>
 #include <linux/mman.h>
 #include <linux/memory-tiers.h>
+#include <linux/memory_hotplug.h>
 #include "dax-private.h"
 #include "bus.h"
 
@@ -105,6 +106,7 @@  static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 	data->mgid = rc;
 
 	for (i = 0; i < dev_dax->nr_range; i++) {
+		u64 cur_start, cur_len, remaining;
 		struct resource *res;
 		struct range range;
 
@@ -137,21 +139,42 @@  static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 		res->flags = IORESOURCE_SYSTEM_RAM;
 
 		/*
-		 * Ensure that future kexec'd kernels will not treat
-		 * this as RAM automatically.
+		 * Add memory in chunks of memory_block_size_bytes() so that
+		 * it is considered for MHP_MEMMAP_ON_MEMORY
+		 * @range has already been aligned to memory_block_size_bytes(),
+		 * so the following loop will always break it down cleanly.
 		 */
-		rc = add_memory_driver_managed(data->mgid, range.start,
-				range_len(&range), kmem_name, MHP_NID_IS_MGID);
+		cur_start = range.start;
+		cur_len = memory_block_size_bytes();
+		remaining = range_len(&range);
+		while (remaining) {
+			mhp_t mhp_flags = MHP_NID_IS_MGID;
 
-		if (rc) {
-			dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
-					i, range.start, range.end);
-			remove_resource(res);
-			kfree(res);
-			data->res[i] = NULL;
-			if (mapped)
-				continue;
-			goto err_request_mem;
+			if (mhp_supports_memmap_on_memory(cur_len,
+							  MHP_MEMMAP_ON_MEMORY))
+				mhp_flags |= MHP_MEMMAP_ON_MEMORY;
+			/*
+			 * Ensure that future kexec'd kernels will not treat
+			 * this as RAM automatically.
+			 */
+			rc = add_memory_driver_managed(data->mgid, cur_start,
+						       cur_len, kmem_name,
+						       mhp_flags);
+
+			if (rc) {
+				dev_warn(dev,
+					 "mapping%d: %#llx-%#llx memory add failed\n",
+					 i, cur_start, cur_start + cur_len - 1);
+				remove_resource(res);
+				kfree(res);
+				data->res[i] = NULL;
+				if (mapped)
+					continue;
+				goto err_request_mem;
+			}
+
+			cur_start += cur_len;
+			remaining -= cur_len;
 		}
 		mapped++;
 	}