[RFC] memory tiering: use small chunk size and more tiers

Message ID 20221027065925.476955-1-ying.huang@intel.com
State New
Headers
Series [RFC] memory tiering: use small chunk size and more tiers |

Commit Message

Huang, Ying Oct. 27, 2022, 6:59 a.m. UTC
  We need some way to override the system default memory tiers.  For
the example system as follows,

type		abstract distance
----		-----------------
HBM		300
DRAM		1000
CXL_MEM		5000
PMEM		5100

Given the memory tier chunk size is 100, the default memory tiers
could be,

tier		abstract distance	types
                range
----		-----------------       -----
3		300-400			HBM
10		1000-1100		DRAM
50		5000-5100		CXL_MEM
51		5100-5200		PMEM

If we want to group CXL MEM and PMEM into one tier, we have 2 choices.

1) Override the abstract distance of CXL_MEM or PMEM.  For example, if
we change the abstract distance of PMEM to 5050, the memory tiers
become,

tier		abstract distance	types
                range
----		-----------------       -----
3		300-400			HBM
10		1000-1100		DRAM
50		5000-5100		CXL_MEM, PMEM

2) Override the memory tier chunk size.  For example, if we change the
memory tier chunk size to 200, the memory tiers become,

tier		abstract distance	types
                range
----		-----------------       -----
1		200-400			HBM
5		1000-1200		DRAM
25		5000-5200		CXL_MEM, PMEM

But after some thoughts, I think choice 2) may be not good.  The
problem is that even if 2 abstract distances are almost same, they may
be put in 2 tier if they sit in the different sides of the tier
boundary.  For example, if the abstract distance of CXL_MEM is 4990,
while the abstract distance of PMEM is 5010.  Although the difference
of the abstract distances is only 20, CXL_MEM and PMEM will put in
different tiers if the tier chunk size is 50, 100, 200, 250, 500, ....
This makes choice 2) hard to be used, it may become tricky to find out
the appropriate tier chunk size that satisfying all requirements.

So I suggest to abandon choice 2) and use choice 1) only.  This makes
the overall design and user space interface to be simpler and easier
to be used.  The overall design of the abstract distance could be,

1. Use decimal for abstract distance and its chunk size.  This makes
   them more user friendly.

2. Make the tier chunk size as small as possible.  For example, 10.
   This will put different memory types in one memory tier only if their
   performance is almost same by default.  And we will not provide the
   interface to override the chunk size.

3. Make the abstract distance of normal DRAM large enough.  For
   example, 1000, then 100 tiers can be defined below DRAM, this is
   more than enough in practice.

4. If we want to override the default memory tiers, just override the
   abstract distances of some memory types with a per memory type
   interface.

This patch is to apply the design choices above in the existing code.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yang Shi <shy828301@gmail.com>
---
 include/linux/memory-tiers.h | 7 +++----
 mm/memory-tiers.c            | 7 +++----
 2 files changed, 6 insertions(+), 8 deletions(-)
  

Comments

Aneesh Kumar K.V Oct. 27, 2022, 10:45 a.m. UTC | #1
On 10/27/22 12:29 PM, Huang Ying wrote:
> We need some way to override the system default memory tiers.  For
> the example system as follows,
> 
> type		abstract distance
> ----		-----------------
> HBM		300
> DRAM		1000
> CXL_MEM		5000
> PMEM		5100
> 
> Given the memory tier chunk size is 100, the default memory tiers
> could be,
> 
> tier		abstract distance	types
>                 range
> ----		-----------------       -----
> 3		300-400			HBM
> 10		1000-1100		DRAM
> 50		5000-5100		CXL_MEM
> 51		5100-5200		PMEM
> 
> If we want to group CXL MEM and PMEM into one tier, we have 2 choices.
> 
> 1) Override the abstract distance of CXL_MEM or PMEM.  For example, if
> we change the abstract distance of PMEM to 5050, the memory tiers
> become,
> 
> tier		abstract distance	types
>                 range
> ----		-----------------       -----
> 3		300-400			HBM
> 10		1000-1100		DRAM
> 50		5000-5100		CXL_MEM, PMEM
> 
> 2) Override the memory tier chunk size.  For example, if we change the
> memory tier chunk size to 200, the memory tiers become,
> 
> tier		abstract distance	types
>                 range
> ----		-----------------       -----
> 1		200-400			HBM
> 5		1000-1200		DRAM
> 25		5000-5200		CXL_MEM, PMEM
> 
> But after some thoughts, I think choice 2) may be not good.  The
> problem is that even if 2 abstract distances are almost same, they may
> be put in 2 tier if they sit in the different sides of the tier
> boundary.  For example, if the abstract distance of CXL_MEM is 4990,
> while the abstract distance of PMEM is 5010.  Although the difference
> of the abstract distances is only 20, CXL_MEM and PMEM will put in
> different tiers if the tier chunk size is 50, 100, 200, 250, 500, ....
> This makes choice 2) hard to be used, it may become tricky to find out
> the appropriate tier chunk size that satisfying all requirements.
> 

Shouldn't we wait for gaining experience w.r.t how we would end up
mapping devices with different latencies and bandwidth before tuning these values? 

> So I suggest to abandon choice 2) and use choice 1) only.  This makes
> the overall design and user space interface to be simpler and easier
> to be used.  The overall design of the abstract distance could be,
> 
> 1. Use decimal for abstract distance and its chunk size.  This makes
>    them more user friendly.
> 
> 2. Make the tier chunk size as small as possible.  For example, 10.
>    This will put different memory types in one memory tier only if their
>    performance is almost same by default.  And we will not provide the
>    interface to override the chunk size.
> 

this could also mean we can end up with lots of memory tiers with relative
smaller performance difference between them. Again it depends how HMAT
attributes will be used to map to abstract distance.



> 3. Make the abstract distance of normal DRAM large enough.  For
>    example, 1000, then 100 tiers can be defined below DRAM, this is
>    more than enough in practice.

Why 100? Will we really have that many tiers below/faster than DRAM? As of now 
I see only HBM below it.

> 
> 4. If we want to override the default memory tiers, just override the
>    abstract distances of some memory types with a per memory type
>    interface.
> 
> This patch is to apply the design choices above in the existing code.
> 
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Bharata B Rao <bharata@amd.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Davidlohr Bueso <dave@stgolabs.net>
> Cc: Hesham Almatary <hesham.almatary@huawei.com>
> Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Tim Chen <tim.c.chen@intel.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: Yang Shi <shy828301@gmail.com>
> ---
>  include/linux/memory-tiers.h | 7 +++----
>  mm/memory-tiers.c            | 7 +++----
>  2 files changed, 6 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index 965009aa01d7..2e39d9a6c8ce 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -7,17 +7,16 @@
>  #include <linux/kref.h>
>  #include <linux/mmzone.h>
>  /*
> - * Each tier cover a abstrace distance chunk size of 128
> + * Each tier cover a abstrace distance chunk size of 10
>   */
> -#define MEMTIER_CHUNK_BITS	7
> -#define MEMTIER_CHUNK_SIZE	(1 << MEMTIER_CHUNK_BITS)
> +#define MEMTIER_CHUNK_SIZE	10
>  /*
>   * Smaller abstract distance values imply faster (higher) memory tiers. Offset
>   * the DRAM adistance so that we can accommodate devices with a slightly lower
>   * adistance value (slightly faster) than default DRAM adistance to be part of
>   * the same memory tier.
>   */
> -#define MEMTIER_ADISTANCE_DRAM	((4 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE >> 1))
> +#define MEMTIER_ADISTANCE_DRAM	((100 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE / 2))
>  #define MEMTIER_HOTPLUG_PRIO	100
>  
>  struct memory_tier;
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index fa8c9d07f9ce..e03011428fa5 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -165,11 +165,10 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty
>  	bool found_slot = false;
>  	struct memory_tier *memtier, *new_memtier;
>  	int adistance = memtype->adistance;
> -	unsigned int memtier_adistance_chunk_size = MEMTIER_CHUNK_SIZE;
>  
>  	lockdep_assert_held_once(&memory_tier_lock);
>  
> -	adistance = round_down(adistance, memtier_adistance_chunk_size);
> +	adistance = rounddown(adistance, MEMTIER_CHUNK_SIZE);
>  	/*
>  	 * If the memtype is already part of a memory tier,
>  	 * just return that.
> @@ -204,7 +203,7 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty
>  	else
>  		list_add_tail(&new_memtier->list, &memory_tiers);
>  
> -	new_memtier->dev.id = adistance >> MEMTIER_CHUNK_BITS;
> +	new_memtier->dev.id = adistance / MEMTIER_CHUNK_SIZE;
>  	new_memtier->dev.bus = &memory_tier_subsys;
>  	new_memtier->dev.release = memory_tier_device_release;
>  	new_memtier->dev.groups = memtier_dev_groups;
> @@ -641,7 +640,7 @@ static int __init memory_tier_init(void)
>  #endif
>  	mutex_lock(&memory_tier_lock);
>  	/*
> -	 * For now we can have 4 faster memory tiers with smaller adistance
> +	 * For now we can have 100 faster memory tiers with smaller adistance
>  	 * than default DRAM tier.
>  	 */
>  	default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM);
  
Huang, Ying Oct. 28, 2022, 3:03 a.m. UTC | #2
Hi, Aneesh,

Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:

> On 10/27/22 12:29 PM, Huang Ying wrote:
>> We need some way to override the system default memory tiers.  For
>> the example system as follows,
>> 
>> type		abstract distance
>> ----		-----------------
>> HBM		300
>> DRAM		1000
>> CXL_MEM		5000
>> PMEM		5100
>> 
>> Given the memory tier chunk size is 100, the default memory tiers
>> could be,
>> 
>> tier		abstract distance	types
>>                 range
>> ----		-----------------       -----
>> 3		300-400			HBM
>> 10		1000-1100		DRAM
>> 50		5000-5100		CXL_MEM
>> 51		5100-5200		PMEM
>> 
>> If we want to group CXL MEM and PMEM into one tier, we have 2 choices.
>> 
>> 1) Override the abstract distance of CXL_MEM or PMEM.  For example, if
>> we change the abstract distance of PMEM to 5050, the memory tiers
>> become,
>> 
>> tier		abstract distance	types
>>                 range
>> ----		-----------------       -----
>> 3		300-400			HBM
>> 10		1000-1100		DRAM
>> 50		5000-5100		CXL_MEM, PMEM
>> 
>> 2) Override the memory tier chunk size.  For example, if we change the
>> memory tier chunk size to 200, the memory tiers become,
>> 
>> tier		abstract distance	types
>>                 range
>> ----		-----------------       -----
>> 1		200-400			HBM
>> 5		1000-1200		DRAM
>> 25		5000-5200		CXL_MEM, PMEM
>> 
>> But after some thoughts, I think choice 2) may be not good.  The
>> problem is that even if 2 abstract distances are almost same, they may
>> be put in 2 tier if they sit in the different sides of the tier
>> boundary.  For example, if the abstract distance of CXL_MEM is 4990,
>> while the abstract distance of PMEM is 5010.  Although the difference
>> of the abstract distances is only 20, CXL_MEM and PMEM will put in
>> different tiers if the tier chunk size is 50, 100, 200, 250, 500, ....
>> This makes choice 2) hard to be used, it may become tricky to find out
>> the appropriate tier chunk size that satisfying all requirements.
>> 
>
> Shouldn't we wait for gaining experience w.r.t how we would end up
> mapping devices with different latencies and bandwidth before tuning these values? 

Just want to discuss the overall design.

>> So I suggest to abandon choice 2) and use choice 1) only.  This makes
>> the overall design and user space interface to be simpler and easier
>> to be used.  The overall design of the abstract distance could be,
>> 
>> 1. Use decimal for abstract distance and its chunk size.  This makes
>>    them more user friendly.
>> 
>> 2. Make the tier chunk size as small as possible.  For example, 10.
>>    This will put different memory types in one memory tier only if their
>>    performance is almost same by default.  And we will not provide the
>>    interface to override the chunk size.
>> 
>
> this could also mean we can end up with lots of memory tiers with relative
> smaller performance difference between them. Again it depends how HMAT
> attributes will be used to map to abstract distance.

Per my understanding, there will not be many memory types in a system.
So, there will not be many memory tiers too.  In most systems, there are
only 2 or 3 memory tiers in the system, for example, HBM, DRAM, CXL,
etc.  Do you know systems with many memory types?  The basic idea is to
put different memory types in different memory tiers by default.  If
users want to group them, they can do that via overriding the abstract
distance of some memory type.

>
>> 3. Make the abstract distance of normal DRAM large enough.  For
>>    example, 1000, then 100 tiers can be defined below DRAM, this is
>>    more than enough in practice.
>
> Why 100? Will we really have that many tiers below/faster than DRAM? As of now 
> I see only HBM below it.

Yes.  100 is more than enough.  We just want to avoid to group different
memory types by default.

Best Regards,
Huang, Ying

>> 
>> 4. If we want to override the default memory tiers, just override the
>>    abstract distances of some memory types with a per memory type
>>    interface.
>> 
>> This patch is to apply the design choices above in the existing code.
>> 
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> Cc: Alistair Popple <apopple@nvidia.com>
>> Cc: Bharata B Rao <bharata@amd.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Dave Hansen <dave.hansen@intel.com>
>> Cc: Davidlohr Bueso <dave@stgolabs.net>
>> Cc: Hesham Almatary <hesham.almatary@huawei.com>
>> Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Tim Chen <tim.c.chen@intel.com>
>> Cc: Wei Xu <weixugc@google.com>
>> Cc: Yang Shi <shy828301@gmail.com>
>> ---
>>  include/linux/memory-tiers.h | 7 +++----
>>  mm/memory-tiers.c            | 7 +++----
>>  2 files changed, 6 insertions(+), 8 deletions(-)
>> 
>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>> index 965009aa01d7..2e39d9a6c8ce 100644
>> --- a/include/linux/memory-tiers.h
>> +++ b/include/linux/memory-tiers.h
>> @@ -7,17 +7,16 @@
>>  #include <linux/kref.h>
>>  #include <linux/mmzone.h>
>>  /*
>> - * Each tier cover a abstrace distance chunk size of 128
>> + * Each tier cover a abstrace distance chunk size of 10
>>   */
>> -#define MEMTIER_CHUNK_BITS	7
>> -#define MEMTIER_CHUNK_SIZE	(1 << MEMTIER_CHUNK_BITS)
>> +#define MEMTIER_CHUNK_SIZE	10
>>  /*
>>   * Smaller abstract distance values imply faster (higher) memory tiers. Offset
>>   * the DRAM adistance so that we can accommodate devices with a slightly lower
>>   * adistance value (slightly faster) than default DRAM adistance to be part of
>>   * the same memory tier.
>>   */
>> -#define MEMTIER_ADISTANCE_DRAM	((4 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE >> 1))
>> +#define MEMTIER_ADISTANCE_DRAM	((100 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE / 2))
>>  #define MEMTIER_HOTPLUG_PRIO	100
>>  
>>  struct memory_tier;
>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>> index fa8c9d07f9ce..e03011428fa5 100644
>> --- a/mm/memory-tiers.c
>> +++ b/mm/memory-tiers.c
>> @@ -165,11 +165,10 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty
>>  	bool found_slot = false;
>>  	struct memory_tier *memtier, *new_memtier;
>>  	int adistance = memtype->adistance;
>> -	unsigned int memtier_adistance_chunk_size = MEMTIER_CHUNK_SIZE;
>>  
>>  	lockdep_assert_held_once(&memory_tier_lock);
>>  
>> -	adistance = round_down(adistance, memtier_adistance_chunk_size);
>> +	adistance = rounddown(adistance, MEMTIER_CHUNK_SIZE);
>>  	/*
>>  	 * If the memtype is already part of a memory tier,
>>  	 * just return that.
>> @@ -204,7 +203,7 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty
>>  	else
>>  		list_add_tail(&new_memtier->list, &memory_tiers);
>>  
>> -	new_memtier->dev.id = adistance >> MEMTIER_CHUNK_BITS;
>> +	new_memtier->dev.id = adistance / MEMTIER_CHUNK_SIZE;
>>  	new_memtier->dev.bus = &memory_tier_subsys;
>>  	new_memtier->dev.release = memory_tier_device_release;
>>  	new_memtier->dev.groups = memtier_dev_groups;
>> @@ -641,7 +640,7 @@ static int __init memory_tier_init(void)
>>  #endif
>>  	mutex_lock(&memory_tier_lock);
>>  	/*
>> -	 * For now we can have 4 faster memory tiers with smaller adistance
>> +	 * For now we can have 100 faster memory tiers with smaller adistance
>>  	 * than default DRAM tier.
>>  	 */
>>  	default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM);
  
Aneesh Kumar K.V Oct. 28, 2022, 5:05 a.m. UTC | #3
On 10/28/22 8:33 AM, Huang, Ying wrote:
> Hi, Aneesh,
> 
> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> 
>> On 10/27/22 12:29 PM, Huang Ying wrote:
>>> We need some way to override the system default memory tiers.  For
>>> the example system as follows,
>>>
>>> type		abstract distance
>>> ----		-----------------
>>> HBM		300
>>> DRAM		1000
>>> CXL_MEM		5000
>>> PMEM		5100
>>>
>>> Given the memory tier chunk size is 100, the default memory tiers
>>> could be,
>>>
>>> tier		abstract distance	types
>>>                 range
>>> ----		-----------------       -----
>>> 3		300-400			HBM
>>> 10		1000-1100		DRAM
>>> 50		5000-5100		CXL_MEM
>>> 51		5100-5200		PMEM
>>>
>>> If we want to group CXL MEM and PMEM into one tier, we have 2 choices.
>>>
>>> 1) Override the abstract distance of CXL_MEM or PMEM.  For example, if
>>> we change the abstract distance of PMEM to 5050, the memory tiers
>>> become,
>>>
>>> tier		abstract distance	types
>>>                 range
>>> ----		-----------------       -----
>>> 3		300-400			HBM
>>> 10		1000-1100		DRAM
>>> 50		5000-5100		CXL_MEM, PMEM
>>>
>>> 2) Override the memory tier chunk size.  For example, if we change the
>>> memory tier chunk size to 200, the memory tiers become,
>>>
>>> tier		abstract distance	types
>>>                 range
>>> ----		-----------------       -----
>>> 1		200-400			HBM
>>> 5		1000-1200		DRAM
>>> 25		5000-5200		CXL_MEM, PMEM
>>>
>>> But after some thoughts, I think choice 2) may be not good.  The
>>> problem is that even if 2 abstract distances are almost same, they may
>>> be put in 2 tier if they sit in the different sides of the tier
>>> boundary.  For example, if the abstract distance of CXL_MEM is 4990,
>>> while the abstract distance of PMEM is 5010.  Although the difference
>>> of the abstract distances is only 20, CXL_MEM and PMEM will put in
>>> different tiers if the tier chunk size is 50, 100, 200, 250, 500, ....
>>> This makes choice 2) hard to be used, it may become tricky to find out
>>> the appropriate tier chunk size that satisfying all requirements.
>>>
>>
>> Shouldn't we wait for gaining experience w.r.t how we would end up
>> mapping devices with different latencies and bandwidth before tuning these values? 
> 
> Just want to discuss the overall design.
> 
>>> So I suggest to abandon choice 2) and use choice 1) only.  This makes
>>> the overall design and user space interface to be simpler and easier
>>> to be used.  The overall design of the abstract distance could be,
>>>
>>> 1. Use decimal for abstract distance and its chunk size.  This makes
>>>    them more user friendly.
>>>
>>> 2. Make the tier chunk size as small as possible.  For example, 10.
>>>    This will put different memory types in one memory tier only if their
>>>    performance is almost same by default.  And we will not provide the
>>>    interface to override the chunk size.
>>>
>>
>> this could also mean we can end up with lots of memory tiers with relative
>> smaller performance difference between them. Again it depends how HMAT
>> attributes will be used to map to abstract distance.
> 
> Per my understanding, there will not be many memory types in a system.
> So, there will not be many memory tiers too.  In most systems, there are
> only 2 or 3 memory tiers in the system, for example, HBM, DRAM, CXL,
> etc. 

So we don't need the chunk size to be 10 because we don't forsee us needing
to group devices into that many tiers. 

> Do you know systems with many memory types?  The basic idea is to
> put different memory types in different memory tiers by default.  If
> users want to group them, they can do that via overriding the abstract
> distance of some memory type.
> 

with small chunk size and depending on how we are going to derive abstract distance,
I am wondering whether we would end up with lots of memory tiers with no 
real value. Hence my suggestion to wait making a change like this till we have
code that map HMAT/CDAT attributes to abstract distance. 




>>
>>> 3. Make the abstract distance of normal DRAM large enough.  For
>>>    example, 1000, then 100 tiers can be defined below DRAM, this is
>>>    more than enough in practice.
>>
>> Why 100? Will we really have that many tiers below/faster than DRAM? As of now 
>> I see only HBM below it.
> 
> Yes.  100 is more than enough.  We just want to avoid to group different
> memory types by default.
> 
> Best Regards,
> Huang, Ying
>
  
Huang, Ying Oct. 28, 2022, 5:46 a.m. UTC | #4
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:

> On 10/28/22 8:33 AM, Huang, Ying wrote:
>> Hi, Aneesh,
>> 
>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>> 
>>> On 10/27/22 12:29 PM, Huang Ying wrote:
>>>> We need some way to override the system default memory tiers.  For
>>>> the example system as follows,
>>>>
>>>> type		abstract distance
>>>> ----		-----------------
>>>> HBM		300
>>>> DRAM		1000
>>>> CXL_MEM		5000
>>>> PMEM		5100
>>>>
>>>> Given the memory tier chunk size is 100, the default memory tiers
>>>> could be,
>>>>
>>>> tier		abstract distance	types
>>>>                 range
>>>> ----		-----------------       -----
>>>> 3		300-400			HBM
>>>> 10		1000-1100		DRAM
>>>> 50		5000-5100		CXL_MEM
>>>> 51		5100-5200		PMEM
>>>>
>>>> If we want to group CXL MEM and PMEM into one tier, we have 2 choices.
>>>>
>>>> 1) Override the abstract distance of CXL_MEM or PMEM.  For example, if
>>>> we change the abstract distance of PMEM to 5050, the memory tiers
>>>> become,
>>>>
>>>> tier		abstract distance	types
>>>>                 range
>>>> ----		-----------------       -----
>>>> 3		300-400			HBM
>>>> 10		1000-1100		DRAM
>>>> 50		5000-5100		CXL_MEM, PMEM
>>>>
>>>> 2) Override the memory tier chunk size.  For example, if we change the
>>>> memory tier chunk size to 200, the memory tiers become,
>>>>
>>>> tier		abstract distance	types
>>>>                 range
>>>> ----		-----------------       -----
>>>> 1		200-400			HBM
>>>> 5		1000-1200		DRAM
>>>> 25		5000-5200		CXL_MEM, PMEM
>>>>
>>>> But after some thoughts, I think choice 2) may be not good.  The
>>>> problem is that even if 2 abstract distances are almost same, they may
>>>> be put in 2 tier if they sit in the different sides of the tier
>>>> boundary.  For example, if the abstract distance of CXL_MEM is 4990,
>>>> while the abstract distance of PMEM is 5010.  Although the difference
>>>> of the abstract distances is only 20, CXL_MEM and PMEM will put in
>>>> different tiers if the tier chunk size is 50, 100, 200, 250, 500, ....
>>>> This makes choice 2) hard to be used, it may become tricky to find out
>>>> the appropriate tier chunk size that satisfying all requirements.
>>>>
>>>
>>> Shouldn't we wait for gaining experience w.r.t how we would end up
>>> mapping devices with different latencies and bandwidth before tuning these values? 
>> 
>> Just want to discuss the overall design.
>> 
>>>> So I suggest to abandon choice 2) and use choice 1) only.  This makes
>>>> the overall design and user space interface to be simpler and easier
>>>> to be used.  The overall design of the abstract distance could be,
>>>>
>>>> 1. Use decimal for abstract distance and its chunk size.  This makes
>>>>    them more user friendly.
>>>>
>>>> 2. Make the tier chunk size as small as possible.  For example, 10.
>>>>    This will put different memory types in one memory tier only if their
>>>>    performance is almost same by default.  And we will not provide the
>>>>    interface to override the chunk size.
>>>>
>>>
>>> this could also mean we can end up with lots of memory tiers with relative
>>> smaller performance difference between them. Again it depends how HMAT
>>> attributes will be used to map to abstract distance.
>> 
>> Per my understanding, there will not be many memory types in a system.
>> So, there will not be many memory tiers too.  In most systems, there are
>> only 2 or 3 memory tiers in the system, for example, HBM, DRAM, CXL,
>> etc. 
>
> So we don't need the chunk size to be 10 because we don't forsee us needing
> to group devices into that many tiers. 

I suggest to use small chunk size to avoid to group 2 memory
types into one memory tier accidently.

>> Do you know systems with many memory types?  The basic idea is to
>> put different memory types in different memory tiers by default.  If
>> users want to group them, they can do that via overriding the abstract
>> distance of some memory type.
>> 
>
> with small chunk size and depending on how we are going to derive abstract distance,
> I am wondering whether we would end up with lots of memory tiers with no 
> real value. Hence my suggestion to wait making a change like this till we have
> code that map HMAT/CDAT attributes to abstract distance. 

Per my understanding, the NUMA nodes of the same memory type/tier will
have the exact same latency and bandwidth in HMAT/CDAT for the CPU in
the same socket.

If my understanding were correct, you think the latency / bandwidth of
these NUMA nodes will near each other, but may be different.

Even if the latency / bandwidth of these NUMA nodes isn't exactly same,
we should deal with that in memory types instead of memory tiers.
There's only one abstract distance for each memory type.

So, I still believe we will not have many memory tiers with my proposal.

I don't care too much about the exact number, but want to discuss some
general design choice,

a) Avoid to group multiple memory types into one memory tier by default
   at most times.

b) Abandon customizing abstract distance chunk size.

Best Regards,
Huang, Ying

>
>>>
>>>> 3. Make the abstract distance of normal DRAM large enough.  For
>>>>    example, 1000, then 100 tiers can be defined below DRAM, this is
>>>>    more than enough in practice.
>>>
>>> Why 100? Will we really have that many tiers below/faster than DRAM? As of now 
>>> I see only HBM below it.
>> 
>> Yes.  100 is more than enough.  We just want to avoid to group different
>> memory types by default.
>> 
>> Best Regards,
>> Huang, Ying
>>
  
Bharata B Rao Oct. 28, 2022, 8:04 a.m. UTC | #5
On 10/28/2022 11:16 AM, Huang, Ying wrote:
> If my understanding were correct, you think the latency / bandwidth of
> these NUMA nodes will near each other, but may be different.
> 
> Even if the latency / bandwidth of these NUMA nodes isn't exactly same,
> we should deal with that in memory types instead of memory tiers.
> There's only one abstract distance for each memory type.
> 
> So, I still believe we will not have many memory tiers with my proposal.
> 
> I don't care too much about the exact number, but want to discuss some
> general design choice,
> 
> a) Avoid to group multiple memory types into one memory tier by default
>    at most times.

Do you expect the abstract distances of two different types to be
close enough in real life (like you showed in your example with
CXL - 5000 and PMEM - 5100) that they will get assigned into same tier
most times?

Are you foreseeing that abstract distance that get mapped by sources
like HMAT would run into this issue?

Regards,
Bharata.
  
Huang, Ying Oct. 28, 2022, 8:33 a.m. UTC | #6
Bharata B Rao <bharata@amd.com> writes:

> On 10/28/2022 11:16 AM, Huang, Ying wrote:
>> If my understanding were correct, you think the latency / bandwidth of
>> these NUMA nodes will near each other, but may be different.
>> 
>> Even if the latency / bandwidth of these NUMA nodes isn't exactly same,
>> we should deal with that in memory types instead of memory tiers.
>> There's only one abstract distance for each memory type.
>> 
>> So, I still believe we will not have many memory tiers with my proposal.
>> 
>> I don't care too much about the exact number, but want to discuss some
>> general design choice,
>> 
>> a) Avoid to group multiple memory types into one memory tier by default
>>    at most times.
>
> Do you expect the abstract distances of two different types to be
> close enough in real life (like you showed in your example with
> CXL - 5000 and PMEM - 5100) that they will get assigned into same tier
> most times?
>
> Are you foreseeing that abstract distance that get mapped by sources
> like HMAT would run into this issue?

Only if we set abstract distance chunk size large.  So, I think that
it's better to set chunk size as small as possible to avoid potential
issue.  What is the downside to set the chunk size small?

Best Regards,
Huang, Ying
  
Bharata B Rao Oct. 28, 2022, 1:53 p.m. UTC | #7
On 10/28/2022 2:03 PM, Huang, Ying wrote:
> Bharata B Rao <bharata@amd.com> writes:
> 
>> On 10/28/2022 11:16 AM, Huang, Ying wrote:
>>> If my understanding were correct, you think the latency / bandwidth of
>>> these NUMA nodes will near each other, but may be different.
>>>
>>> Even if the latency / bandwidth of these NUMA nodes isn't exactly same,
>>> we should deal with that in memory types instead of memory tiers.
>>> There's only one abstract distance for each memory type.
>>>
>>> So, I still believe we will not have many memory tiers with my proposal.
>>>
>>> I don't care too much about the exact number, but want to discuss some
>>> general design choice,
>>>
>>> a) Avoid to group multiple memory types into one memory tier by default
>>>    at most times.
>>
>> Do you expect the abstract distances of two different types to be
>> close enough in real life (like you showed in your example with
>> CXL - 5000 and PMEM - 5100) that they will get assigned into same tier
>> most times?
>>
>> Are you foreseeing that abstract distance that get mapped by sources
>> like HMAT would run into this issue?
> 
> Only if we set abstract distance chunk size large.  So, I think that
> it's better to set chunk size as small as possible to avoid potential
> issue.  What is the downside to set the chunk size small?

I don't see anything in particular. However

- With just two memory types (default_dram_type and dax_slowmem_type
with adistance values of 576 and 576*5 respectively) defined currently,
- With no interface yet to set/change adistance value of a memory type,
- With no defined way to convert the performance characteristics info
(bw and latency) from sources like HMAT into a adistance value,

I find it a bit difficult to see how a chunk size of 10 against the
existing 128 could be more useful.

Regards,
Bharata.
  
Huang, Ying Oct. 31, 2022, 1:33 a.m. UTC | #8
Bharata B Rao <bharata@amd.com> writes:

> On 10/28/2022 2:03 PM, Huang, Ying wrote:
>> Bharata B Rao <bharata@amd.com> writes:
>> 
>>> On 10/28/2022 11:16 AM, Huang, Ying wrote:
>>>> If my understanding were correct, you think the latency / bandwidth of
>>>> these NUMA nodes will near each other, but may be different.
>>>>
>>>> Even if the latency / bandwidth of these NUMA nodes isn't exactly same,
>>>> we should deal with that in memory types instead of memory tiers.
>>>> There's only one abstract distance for each memory type.
>>>>
>>>> So, I still believe we will not have many memory tiers with my proposal.
>>>>
>>>> I don't care too much about the exact number, but want to discuss some
>>>> general design choice,
>>>>
>>>> a) Avoid to group multiple memory types into one memory tier by default
>>>>    at most times.
>>>
>>> Do you expect the abstract distances of two different types to be
>>> close enough in real life (like you showed in your example with
>>> CXL - 5000 and PMEM - 5100) that they will get assigned into same tier
>>> most times?
>>>
>>> Are you foreseeing that abstract distance that get mapped by sources
>>> like HMAT would run into this issue?
>> 
>> Only if we set abstract distance chunk size large.  So, I think that
>> it's better to set chunk size as small as possible to avoid potential
>> issue.  What is the downside to set the chunk size small?
>
> I don't see anything in particular. However
>
> - With just two memory types (default_dram_type and dax_slowmem_type
> with adistance values of 576 and 576*5 respectively) defined currently,
> - With no interface yet to set/change adistance value of a memory type,
> - With no defined way to convert the performance characteristics info
> (bw and latency) from sources like HMAT into a adistance value,
>
> I find it a bit difficult to see how a chunk size of 10 against the
> existing 128 could be more useful.

OK.  Maybe we pay too much attention to specific number.  My target
isn't to push this specific RFC into kernel.  I just want to discuss the
design choices with community.

My basic idea is NOT to group memory types into memory tiers via
customizing abstract distance chunk size.  Because that's hard to be
used and implemented.  So far, it appears that nobody objects this.

Then, it's even better to avoid to adjust abstract chunk size in kernel
as much as possible.  This will make the life of the user space
tools/scripts easier.  One solution is to define more than enough
possible tiers under DRAM (we have unlimited number of tiers above
DRAM).

In the upstream implementation, 4 tiers are possible below DRAM.  That's
enough for now.  But in the long run, it may be better to define more.
100 possible tiers below DRAM may be too extreme.  How about define the
abstract distance of DRAM to be 1050 and chunk size to be 100.  Then we
will have 10 possible tiers below DRAM.  That may be more than enough
even in the long run?

Again, the specific number isn't so important for me.  So please suggest
your number if necessary.

Best Regards,
Huang, Ying
  
Michal Hocko Nov. 1, 2022, 2:34 p.m. UTC | #9
On Mon 31-10-22 09:33:49, Huang, Ying wrote:
[...]
> In the upstream implementation, 4 tiers are possible below DRAM.  That's
> enough for now.  But in the long run, it may be better to define more.
> 100 possible tiers below DRAM may be too extreme.

I am just curious. Is any configurations with more than couple of tiers
even manageable? I mean applications have been struggling even with
regular NUMA systems for years and vast majority of them is largerly
NUMA unaware. How are they going to configure for a more complex system
when a) there is no resource access control so whatever you aim for
might not be available and b) in which situations there is going to be a
demand only for subset of tears (GPU memory?) ?

Thanks!
  
Huang, Ying Nov. 2, 2022, 12:39 a.m. UTC | #10
Michal Hocko <mhocko@suse.com> writes:

> On Mon 31-10-22 09:33:49, Huang, Ying wrote:
> [...]
>> In the upstream implementation, 4 tiers are possible below DRAM.  That's
>> enough for now.  But in the long run, it may be better to define more.
>> 100 possible tiers below DRAM may be too extreme.
>
> I am just curious. Is any configurations with more than couple of tiers
> even manageable? I mean applications have been struggling even with
> regular NUMA systems for years and vast majority of them is largerly
> NUMA unaware. How are they going to configure for a more complex system
> when a) there is no resource access control so whatever you aim for
> might not be available and b) in which situations there is going to be a
> demand only for subset of tears (GPU memory?) ?

Sorry for confusing.  I think that there are only several (less than 10)
tiers in a system in practice.  Yes, here, I suggested to define 100 (10
in the later text) POSSIBLE tiers below DRAM.  My intention isn't to
manage a system with tens memory tiers.  Instead, my intention is to
avoid to put 2 memory types into one memory tier by accident via make
the abstract distance range of each memory tier as small as possible.
More possible memory tiers, smaller abstract distance range of each
memory tier.

Best Regards,
Huang, Ying
  
Michal Hocko Nov. 2, 2022, 7:51 a.m. UTC | #11
On Wed 02-11-22 08:39:49, Huang, Ying wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Mon 31-10-22 09:33:49, Huang, Ying wrote:
> > [...]
> >> In the upstream implementation, 4 tiers are possible below DRAM.  That's
> >> enough for now.  But in the long run, it may be better to define more.
> >> 100 possible tiers below DRAM may be too extreme.
> >
> > I am just curious. Is any configurations with more than couple of tiers
> > even manageable? I mean applications have been struggling even with
> > regular NUMA systems for years and vast majority of them is largerly
> > NUMA unaware. How are they going to configure for a more complex system
> > when a) there is no resource access control so whatever you aim for
> > might not be available and b) in which situations there is going to be a
> > demand only for subset of tears (GPU memory?) ?
> 
> Sorry for confusing.  I think that there are only several (less than 10)
> tiers in a system in practice.  Yes, here, I suggested to define 100 (10
> in the later text) POSSIBLE tiers below DRAM.  My intention isn't to
> manage a system with tens memory tiers.  Instead, my intention is to
> avoid to put 2 memory types into one memory tier by accident via make
> the abstract distance range of each memory tier as small as possible.
> More possible memory tiers, smaller abstract distance range of each
> memory tier.

TBH I do not really understand how tweaking ranges helps anything.
IIUC drivers are free to assign any abstract distance so they will clash
without any higher level coordination.
  
Huang, Ying Nov. 2, 2022, 8:02 a.m. UTC | #12
Michal Hocko <mhocko@suse.com> writes:

> On Wed 02-11-22 08:39:49, Huang, Ying wrote:
>> Michal Hocko <mhocko@suse.com> writes:
>> 
>> > On Mon 31-10-22 09:33:49, Huang, Ying wrote:
>> > [...]
>> >> In the upstream implementation, 4 tiers are possible below DRAM.  That's
>> >> enough for now.  But in the long run, it may be better to define more.
>> >> 100 possible tiers below DRAM may be too extreme.
>> >
>> > I am just curious. Is any configurations with more than couple of tiers
>> > even manageable? I mean applications have been struggling even with
>> > regular NUMA systems for years and vast majority of them is largerly
>> > NUMA unaware. How are they going to configure for a more complex system
>> > when a) there is no resource access control so whatever you aim for
>> > might not be available and b) in which situations there is going to be a
>> > demand only for subset of tears (GPU memory?) ?
>> 
>> Sorry for confusing.  I think that there are only several (less than 10)
>> tiers in a system in practice.  Yes, here, I suggested to define 100 (10
>> in the later text) POSSIBLE tiers below DRAM.  My intention isn't to
>> manage a system with tens memory tiers.  Instead, my intention is to
>> avoid to put 2 memory types into one memory tier by accident via make
>> the abstract distance range of each memory tier as small as possible.
>> More possible memory tiers, smaller abstract distance range of each
>> memory tier.
>
> TBH I do not really understand how tweaking ranges helps anything.
> IIUC drivers are free to assign any abstract distance so they will clash
> without any higher level coordination.

Yes.  That's possible.  Each memory tier corresponds to one abstract
distance range.  The larger the range is, the higher the possibility of
clashing is.  So I suggest to make the abstract distance range smaller
to reduce the possibility of clashing.

Best Regards,
Huang, Ying
  
Michal Hocko Nov. 2, 2022, 8:17 a.m. UTC | #13
On Wed 02-11-22 16:02:54, Huang, Ying wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Wed 02-11-22 08:39:49, Huang, Ying wrote:
> >> Michal Hocko <mhocko@suse.com> writes:
> >> 
> >> > On Mon 31-10-22 09:33:49, Huang, Ying wrote:
> >> > [...]
> >> >> In the upstream implementation, 4 tiers are possible below DRAM.  That's
> >> >> enough for now.  But in the long run, it may be better to define more.
> >> >> 100 possible tiers below DRAM may be too extreme.
> >> >
> >> > I am just curious. Is any configurations with more than couple of tiers
> >> > even manageable? I mean applications have been struggling even with
> >> > regular NUMA systems for years and vast majority of them is largerly
> >> > NUMA unaware. How are they going to configure for a more complex system
> >> > when a) there is no resource access control so whatever you aim for
> >> > might not be available and b) in which situations there is going to be a
> >> > demand only for subset of tears (GPU memory?) ?
> >> 
> >> Sorry for confusing.  I think that there are only several (less than 10)
> >> tiers in a system in practice.  Yes, here, I suggested to define 100 (10
> >> in the later text) POSSIBLE tiers below DRAM.  My intention isn't to
> >> manage a system with tens memory tiers.  Instead, my intention is to
> >> avoid to put 2 memory types into one memory tier by accident via make
> >> the abstract distance range of each memory tier as small as possible.
> >> More possible memory tiers, smaller abstract distance range of each
> >> memory tier.
> >
> > TBH I do not really understand how tweaking ranges helps anything.
> > IIUC drivers are free to assign any abstract distance so they will clash
> > without any higher level coordination.
> 
> Yes.  That's possible.  Each memory tier corresponds to one abstract
> distance range.  The larger the range is, the higher the possibility of
> clashing is.  So I suggest to make the abstract distance range smaller
> to reduce the possibility of clashing.

I am sorry but I really do not understand how the size of the range
actually addresses a fundamental issue that each driver simply picks
what it wants. Is there any enumeration defining basic characteristic of
each tier? How does a driver developer knows which tear to assign its
driver to?
  
Huang, Ying Nov. 2, 2022, 8:28 a.m. UTC | #14
Michal Hocko <mhocko@suse.com> writes:

> On Wed 02-11-22 16:02:54, Huang, Ying wrote:
>> Michal Hocko <mhocko@suse.com> writes:
>> 
>> > On Wed 02-11-22 08:39:49, Huang, Ying wrote:
>> >> Michal Hocko <mhocko@suse.com> writes:
>> >> 
>> >> > On Mon 31-10-22 09:33:49, Huang, Ying wrote:
>> >> > [...]
>> >> >> In the upstream implementation, 4 tiers are possible below DRAM.  That's
>> >> >> enough for now.  But in the long run, it may be better to define more.
>> >> >> 100 possible tiers below DRAM may be too extreme.
>> >> >
>> >> > I am just curious. Is any configurations with more than couple of tiers
>> >> > even manageable? I mean applications have been struggling even with
>> >> > regular NUMA systems for years and vast majority of them is largerly
>> >> > NUMA unaware. How are they going to configure for a more complex system
>> >> > when a) there is no resource access control so whatever you aim for
>> >> > might not be available and b) in which situations there is going to be a
>> >> > demand only for subset of tears (GPU memory?) ?
>> >> 
>> >> Sorry for confusing.  I think that there are only several (less than 10)
>> >> tiers in a system in practice.  Yes, here, I suggested to define 100 (10
>> >> in the later text) POSSIBLE tiers below DRAM.  My intention isn't to
>> >> manage a system with tens memory tiers.  Instead, my intention is to
>> >> avoid to put 2 memory types into one memory tier by accident via make
>> >> the abstract distance range of each memory tier as small as possible.
>> >> More possible memory tiers, smaller abstract distance range of each
>> >> memory tier.
>> >
>> > TBH I do not really understand how tweaking ranges helps anything.
>> > IIUC drivers are free to assign any abstract distance so they will clash
>> > without any higher level coordination.
>> 
>> Yes.  That's possible.  Each memory tier corresponds to one abstract
>> distance range.  The larger the range is, the higher the possibility of
>> clashing is.  So I suggest to make the abstract distance range smaller
>> to reduce the possibility of clashing.
>
> I am sorry but I really do not understand how the size of the range
> actually addresses a fundamental issue that each driver simply picks
> what it wants. Is there any enumeration defining basic characteristic of
> each tier? How does a driver developer knows which tear to assign its
> driver to?

The smaller range size will not guarantee anything.  It just tries to
help the default behavior.

The drivers are expected to assign the abstract distance based on the
memory latency/bandwidth, etc.  And the abstract distance range of a
memory tier corresponds to a memory latency/bandwidth range too.  So, if
the size of the abstract distance range is smaller, the possibility for
two types of memory with different latency/bandwidth to clash on
the abstract distance range is lower.

Clashing isn't a totally disaster.  We plan to provide a per-memory-type
knob to offset the abstract distance provided by driver.  Then, we can
move clashing memory types away if necessary.

Best Regards,
Huang, Ying
  
Michal Hocko Nov. 2, 2022, 8:39 a.m. UTC | #15
On Wed 02-11-22 16:28:08, Huang, Ying wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Wed 02-11-22 16:02:54, Huang, Ying wrote:
> >> Michal Hocko <mhocko@suse.com> writes:
> >> 
> >> > On Wed 02-11-22 08:39:49, Huang, Ying wrote:
> >> >> Michal Hocko <mhocko@suse.com> writes:
> >> >> 
> >> >> > On Mon 31-10-22 09:33:49, Huang, Ying wrote:
> >> >> > [...]
> >> >> >> In the upstream implementation, 4 tiers are possible below DRAM.  That's
> >> >> >> enough for now.  But in the long run, it may be better to define more.
> >> >> >> 100 possible tiers below DRAM may be too extreme.
> >> >> >
> >> >> > I am just curious. Is any configurations with more than couple of tiers
> >> >> > even manageable? I mean applications have been struggling even with
> >> >> > regular NUMA systems for years and vast majority of them is largerly
> >> >> > NUMA unaware. How are they going to configure for a more complex system
> >> >> > when a) there is no resource access control so whatever you aim for
> >> >> > might not be available and b) in which situations there is going to be a
> >> >> > demand only for subset of tears (GPU memory?) ?
> >> >> 
> >> >> Sorry for confusing.  I think that there are only several (less than 10)
> >> >> tiers in a system in practice.  Yes, here, I suggested to define 100 (10
> >> >> in the later text) POSSIBLE tiers below DRAM.  My intention isn't to
> >> >> manage a system with tens memory tiers.  Instead, my intention is to
> >> >> avoid to put 2 memory types into one memory tier by accident via make
> >> >> the abstract distance range of each memory tier as small as possible.
> >> >> More possible memory tiers, smaller abstract distance range of each
> >> >> memory tier.
> >> >
> >> > TBH I do not really understand how tweaking ranges helps anything.
> >> > IIUC drivers are free to assign any abstract distance so they will clash
> >> > without any higher level coordination.
> >> 
> >> Yes.  That's possible.  Each memory tier corresponds to one abstract
> >> distance range.  The larger the range is, the higher the possibility of
> >> clashing is.  So I suggest to make the abstract distance range smaller
> >> to reduce the possibility of clashing.
> >
> > I am sorry but I really do not understand how the size of the range
> > actually addresses a fundamental issue that each driver simply picks
> > what it wants. Is there any enumeration defining basic characteristic of
> > each tier? How does a driver developer knows which tear to assign its
> > driver to?
> 
> The smaller range size will not guarantee anything.  It just tries to
> help the default behavior.
> 
> The drivers are expected to assign the abstract distance based on the
> memory latency/bandwidth, etc.

Would it be possible/feasible to have a canonical way to calculate the
abstract distance from these characteristics by the core kernel so that
drivers do not even have fall into that trap?
  
Huang, Ying Nov. 2, 2022, 8:45 a.m. UTC | #16
Michal Hocko <mhocko@suse.com> writes:

> On Wed 02-11-22 16:28:08, Huang, Ying wrote:
>> Michal Hocko <mhocko@suse.com> writes:
>> 
>> > On Wed 02-11-22 16:02:54, Huang, Ying wrote:
>> >> Michal Hocko <mhocko@suse.com> writes:
>> >> 
>> >> > On Wed 02-11-22 08:39:49, Huang, Ying wrote:
>> >> >> Michal Hocko <mhocko@suse.com> writes:
>> >> >> 
>> >> >> > On Mon 31-10-22 09:33:49, Huang, Ying wrote:
>> >> >> > [...]
>> >> >> >> In the upstream implementation, 4 tiers are possible below DRAM.  That's
>> >> >> >> enough for now.  But in the long run, it may be better to define more.
>> >> >> >> 100 possible tiers below DRAM may be too extreme.
>> >> >> >
>> >> >> > I am just curious. Is any configurations with more than couple of tiers
>> >> >> > even manageable? I mean applications have been struggling even with
>> >> >> > regular NUMA systems for years and vast majority of them is largerly
>> >> >> > NUMA unaware. How are they going to configure for a more complex system
>> >> >> > when a) there is no resource access control so whatever you aim for
>> >> >> > might not be available and b) in which situations there is going to be a
>> >> >> > demand only for subset of tears (GPU memory?) ?
>> >> >> 
>> >> >> Sorry for confusing.  I think that there are only several (less than 10)
>> >> >> tiers in a system in practice.  Yes, here, I suggested to define 100 (10
>> >> >> in the later text) POSSIBLE tiers below DRAM.  My intention isn't to
>> >> >> manage a system with tens memory tiers.  Instead, my intention is to
>> >> >> avoid to put 2 memory types into one memory tier by accident via make
>> >> >> the abstract distance range of each memory tier as small as possible.
>> >> >> More possible memory tiers, smaller abstract distance range of each
>> >> >> memory tier.
>> >> >
>> >> > TBH I do not really understand how tweaking ranges helps anything.
>> >> > IIUC drivers are free to assign any abstract distance so they will clash
>> >> > without any higher level coordination.
>> >> 
>> >> Yes.  That's possible.  Each memory tier corresponds to one abstract
>> >> distance range.  The larger the range is, the higher the possibility of
>> >> clashing is.  So I suggest to make the abstract distance range smaller
>> >> to reduce the possibility of clashing.
>> >
>> > I am sorry but I really do not understand how the size of the range
>> > actually addresses a fundamental issue that each driver simply picks
>> > what it wants. Is there any enumeration defining basic characteristic of
>> > each tier? How does a driver developer knows which tear to assign its
>> > driver to?
>> 
>> The smaller range size will not guarantee anything.  It just tries to
>> help the default behavior.
>> 
>> The drivers are expected to assign the abstract distance based on the
>> memory latency/bandwidth, etc.
>
> Would it be possible/feasible to have a canonical way to calculate the
> abstract distance from these characteristics by the core kernel so that
> drivers do not even have fall into that trap?

Yes.  That sounds a good idea.  We can provide a function to map from
the memory latency/bandwidth to the abstract distance for the drivers.

Best Regards,
Huang, Ying
  

Patch

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 965009aa01d7..2e39d9a6c8ce 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -7,17 +7,16 @@ 
 #include <linux/kref.h>
 #include <linux/mmzone.h>
 /*
- * Each tier cover a abstrace distance chunk size of 128
+ * Each tier cover a abstrace distance chunk size of 10
  */
-#define MEMTIER_CHUNK_BITS	7
-#define MEMTIER_CHUNK_SIZE	(1 << MEMTIER_CHUNK_BITS)
+#define MEMTIER_CHUNK_SIZE	10
 /*
  * Smaller abstract distance values imply faster (higher) memory tiers. Offset
  * the DRAM adistance so that we can accommodate devices with a slightly lower
  * adistance value (slightly faster) than default DRAM adistance to be part of
  * the same memory tier.
  */
-#define MEMTIER_ADISTANCE_DRAM	((4 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE >> 1))
+#define MEMTIER_ADISTANCE_DRAM	((100 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE / 2))
 #define MEMTIER_HOTPLUG_PRIO	100
 
 struct memory_tier;
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index fa8c9d07f9ce..e03011428fa5 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -165,11 +165,10 @@  static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty
 	bool found_slot = false;
 	struct memory_tier *memtier, *new_memtier;
 	int adistance = memtype->adistance;
-	unsigned int memtier_adistance_chunk_size = MEMTIER_CHUNK_SIZE;
 
 	lockdep_assert_held_once(&memory_tier_lock);
 
-	adistance = round_down(adistance, memtier_adistance_chunk_size);
+	adistance = rounddown(adistance, MEMTIER_CHUNK_SIZE);
 	/*
 	 * If the memtype is already part of a memory tier,
 	 * just return that.
@@ -204,7 +203,7 @@  static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty
 	else
 		list_add_tail(&new_memtier->list, &memory_tiers);
 
-	new_memtier->dev.id = adistance >> MEMTIER_CHUNK_BITS;
+	new_memtier->dev.id = adistance / MEMTIER_CHUNK_SIZE;
 	new_memtier->dev.bus = &memory_tier_subsys;
 	new_memtier->dev.release = memory_tier_device_release;
 	new_memtier->dev.groups = memtier_dev_groups;
@@ -641,7 +640,7 @@  static int __init memory_tier_init(void)
 #endif
 	mutex_lock(&memory_tier_lock);
 	/*
-	 * For now we can have 4 faster memory tiers with smaller adistance
+	 * For now we can have 100 faster memory tiers with smaller adistance
 	 * than default DRAM tier.
 	 */
 	default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM);