[RESEND,1/4] memory tiering: add abstract distance calculation algorithms management

Message ID 20230721012932.190742-2-ying.huang@intel.com
State New
Headers
Series memory tiering: calculate abstract distance based on ACPI HMAT |

Commit Message

Huang, Ying July 21, 2023, 1:29 a.m. UTC
  The abstract distance may be calculated by various drivers, such as
ACPI HMAT, CXL CDAT, etc.  While it may be used by various code which
hot-add memory node, such as dax/kmem etc.  To decouple the algorithm
users and the providers, the abstract distance calculation algorithms
management mechanism is implemented in this patch.  It provides
interface for the providers to register the implementation, and
interface for the users.

Multiple algorithm implementations can cooperate via calculating
abstract distance for different memory nodes.  The preference of
algorithm implementations can be specified via
priority (notifier_block.priority).

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
---
 include/linux/memory-tiers.h | 19 ++++++++++++
 mm/memory-tiers.c            | 59 ++++++++++++++++++++++++++++++++++++
 2 files changed, 78 insertions(+)
  

Comments

Alistair Popple July 25, 2023, 2:13 a.m. UTC | #1
Huang Ying <ying.huang@intel.com> writes:

> The abstract distance may be calculated by various drivers, such as
> ACPI HMAT, CXL CDAT, etc.  While it may be used by various code which
> hot-add memory node, such as dax/kmem etc.  To decouple the algorithm
> users and the providers, the abstract distance calculation algorithms
> management mechanism is implemented in this patch.  It provides
> interface for the providers to register the implementation, and
> interface for the users.

I wonder if we need this level of decoupling though? It seems to me like
it would be simpler and better for drivers to calculate the abstract
distance directly themselves by calling the desired algorithm (eg. ACPI
HMAT) and pass this when creating the nodes rather than having a
notifier chain.

At the moment it seems we've only identified two possible algorithms
(ACPI HMAT and CXL CDAT) and I don't think it would make sense for one
of those to fallback to the other based on priority, so why not just
have drivers call the correct algorithm directly?

> Multiple algorithm implementations can cooperate via calculating
> abstract distance for different memory nodes.  The preference of
> algorithm implementations can be specified via
> priority (notifier_block.priority).

How/what decides the priority though? That seems like something better
decided by a device driver than the algorithm driver IMHO.

> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Davidlohr Bueso <dave@stgolabs.net>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Yang Shi <shy828301@gmail.com>
> Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
> ---
>  include/linux/memory-tiers.h | 19 ++++++++++++
>  mm/memory-tiers.c            | 59 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 78 insertions(+)
>
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index fc9647b1b4f9..c6429e624244 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -6,6 +6,7 @@
>  #include <linux/nodemask.h>
>  #include <linux/kref.h>
>  #include <linux/mmzone.h>
> +#include <linux/notifier.h>
>  /*
>   * Each tier cover a abstrace distance chunk size of 128
>   */
> @@ -36,6 +37,9 @@ struct memory_dev_type *alloc_memory_type(int adistance);
>  void destroy_memory_type(struct memory_dev_type *memtype);
>  void init_node_memory_type(int node, struct memory_dev_type *default_type);
>  void clear_node_memory_type(int node, struct memory_dev_type *memtype);
> +int register_mt_adistance_algorithm(struct notifier_block *nb);
> +int unregister_mt_adistance_algorithm(struct notifier_block *nb);
> +int mt_calc_adistance(int node, int *adist);
>  #ifdef CONFIG_MIGRATION
>  int next_demotion_node(int node);
>  void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
> @@ -97,5 +101,20 @@ static inline bool node_is_toptier(int node)
>  {
>  	return true;
>  }
> +
> +static inline int register_mt_adistance_algorithm(struct notifier_block *nb)
> +{
> +	return 0;
> +}
> +
> +static inline int unregister_mt_adistance_algorithm(struct notifier_block *nb)
> +{
> +	return 0;
> +}
> +
> +static inline int mt_calc_adistance(int node, int *adist)
> +{
> +	return NOTIFY_DONE;
> +}
>  #endif	/* CONFIG_NUMA */
>  #endif  /* _LINUX_MEMORY_TIERS_H */
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index a516e303e304..1e55fbe2ad51 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -5,6 +5,7 @@
>  #include <linux/kobject.h>
>  #include <linux/memory.h>
>  #include <linux/memory-tiers.h>
> +#include <linux/notifier.h>
>  
>  #include "internal.h"
>  
> @@ -105,6 +106,8 @@ static int top_tier_adistance;
>  static struct demotion_nodes *node_demotion __read_mostly;
>  #endif /* CONFIG_MIGRATION */
>  
> +static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms);
> +
>  static inline struct memory_tier *to_memory_tier(struct device *device)
>  {
>  	return container_of(device, struct memory_tier, dev);
> @@ -592,6 +595,62 @@ void clear_node_memory_type(int node, struct memory_dev_type *memtype)
>  }
>  EXPORT_SYMBOL_GPL(clear_node_memory_type);
>  
> +/**
> + * register_mt_adistance_algorithm() - Register memory tiering abstract distance algorithm
> + * @nb: The notifier block which describe the algorithm
> + *
> + * Return: 0 on success, errno on error.
> + *
> + * Every memory tiering abstract distance algorithm provider needs to
> + * register the algorithm with register_mt_adistance_algorithm().  To
> + * calculate the abstract distance for a specified memory node, the
> + * notifier function will be called unless some high priority
> + * algorithm has provided result.  The prototype of the notifier
> + * function is as follows,
> + *
> + *   int (*algorithm_notifier)(struct notifier_block *nb,
> + *                             unsigned long nid, void *data);
> + *
> + * Where "nid" specifies the memory node, "data" is the pointer to the
> + * returned abstract distance (that is, "int *adist").  If the
> + * algorithm provides the result, NOTIFY_STOP should be returned.
> + * Otherwise, return_value & %NOTIFY_STOP_MASK == 0 to allow the next
> + * algorithm in the chain to provide the result.
> + */
> +int register_mt_adistance_algorithm(struct notifier_block *nb)
> +{
> +	return blocking_notifier_chain_register(&mt_adistance_algorithms, nb);
> +}
> +EXPORT_SYMBOL_GPL(register_mt_adistance_algorithm);
> +
> +/**
> + * unregister_mt_adistance_algorithm() - Unregister memory tiering abstract distance algorithm
> + * @nb: the notifier block which describe the algorithm
> + *
> + * Return: 0 on success, errno on error.
> + */
> +int unregister_mt_adistance_algorithm(struct notifier_block *nb)
> +{
> +	return blocking_notifier_chain_unregister(&mt_adistance_algorithms, nb);
> +}
> +EXPORT_SYMBOL_GPL(unregister_mt_adistance_algorithm);
> +
> +/**
> + * mt_calc_adistance() - Calculate abstract distance with registered algorithms
> + * @node: the node to calculate abstract distance for
> + * @adist: the returned abstract distance
> + *
> + * Return: if return_value & %NOTIFY_STOP_MASK != 0, then some
> + * abstract distance algorithm provides the result, and return it via
> + * @adist.  Otherwise, no algorithm can provide the result and @adist
> + * will be kept as it is.
> + */
> +int mt_calc_adistance(int node, int *adist)
> +{
> +	return blocking_notifier_call_chain(&mt_adistance_algorithms, node, adist);
> +}
> +EXPORT_SYMBOL_GPL(mt_calc_adistance);
> +
>  static int __meminit memtier_hotplug_callback(struct notifier_block *self,
>  					      unsigned long action, void *_arg)
>  {
  
Huang, Ying July 25, 2023, 3:14 a.m. UTC | #2
Hi, Alistair,

Thanks a lot for comments!

Alistair Popple <apopple@nvidia.com> writes:

> Huang Ying <ying.huang@intel.com> writes:
>
>> The abstract distance may be calculated by various drivers, such as
>> ACPI HMAT, CXL CDAT, etc.  While it may be used by various code which
>> hot-add memory node, such as dax/kmem etc.  To decouple the algorithm
>> users and the providers, the abstract distance calculation algorithms
>> management mechanism is implemented in this patch.  It provides
>> interface for the providers to register the implementation, and
>> interface for the users.
>
> I wonder if we need this level of decoupling though? It seems to me like
> it would be simpler and better for drivers to calculate the abstract
> distance directly themselves by calling the desired algorithm (eg. ACPI
> HMAT) and pass this when creating the nodes rather than having a
> notifier chain.

Per my understanding, ACPI HMAT and memory device drivers (such as
dax/kmem) may belong to different subsystems (ACPI vs. dax).  It's not
good to call functions across subsystems directly.  So, I think it's
better to use a general subsystem: memory-tier.c to decouple them.  If
it turns out that a notifier chain is unnecessary, we can use some
function pointers instead.

> At the moment it seems we've only identified two possible algorithms
> (ACPI HMAT and CXL CDAT) and I don't think it would make sense for one
> of those to fallback to the other based on priority, so why not just
> have drivers call the correct algorithm directly?

For example, we have a system with PMEM (persistent memory, Optane
DCPMM, or AEP, or something else) in DIMM slots and CXL.mem connected
via CXL link to a remote memory pool.  We will need ACPI HMAT for PMEM
and CXL CDAT for CXL.mem.  One way is to make dax/kmem identify the
types of the device and call corresponding algorithms.  The other way
(suggested by this series) is to make dax/kmem call a notifier chain,
then CXL CDAT or ACPI HMAT can identify the type of device and calculate
the distance if the type is correct for them.  I don't think that it's
good to make dax/kem to know every possible types of memory devices.

>> Multiple algorithm implementations can cooperate via calculating
>> abstract distance for different memory nodes.  The preference of
>> algorithm implementations can be specified via
>> priority (notifier_block.priority).
>
> How/what decides the priority though? That seems like something better
> decided by a device driver than the algorithm driver IMHO.

Do we need the memory device driver specific priority?  Or we just share
a common priority?  For example, the priority of CXL CDAT is always
higher than that of ACPI HMAT?  Or architecture specific?

And, I don't think that we are forced to use the general notifier chain
interface in all memory device drivers.  If the memory device driver has
better understanding of the memory device, it can use other way to
determine abstract distance.  For example, a CXL memory device driver
can identify abstract distance by itself.  While other memory device drivers
can use the general notifier chain interface at the same time.

--
Best Regards,
Huang, Ying
  
Alistair Popple July 25, 2023, 8:26 a.m. UTC | #3
"Huang, Ying" <ying.huang@intel.com> writes:

> Hi, Alistair,
>
> Thanks a lot for comments!
>
> Alistair Popple <apopple@nvidia.com> writes:
>
>> Huang Ying <ying.huang@intel.com> writes:
>>
>>> The abstract distance may be calculated by various drivers, such as
>>> ACPI HMAT, CXL CDAT, etc.  While it may be used by various code which
>>> hot-add memory node, such as dax/kmem etc.  To decouple the algorithm
>>> users and the providers, the abstract distance calculation algorithms
>>> management mechanism is implemented in this patch.  It provides
>>> interface for the providers to register the implementation, and
>>> interface for the users.
>>
>> I wonder if we need this level of decoupling though? It seems to me like
>> it would be simpler and better for drivers to calculate the abstract
>> distance directly themselves by calling the desired algorithm (eg. ACPI
>> HMAT) and pass this when creating the nodes rather than having a
>> notifier chain.
>
> Per my understanding, ACPI HMAT and memory device drivers (such as
> dax/kmem) may belong to different subsystems (ACPI vs. dax).  It's not
> good to call functions across subsystems directly.  So, I think it's
> better to use a general subsystem: memory-tier.c to decouple them.  If
> it turns out that a notifier chain is unnecessary, we can use some
> function pointers instead.
>
>> At the moment it seems we've only identified two possible algorithms
>> (ACPI HMAT and CXL CDAT) and I don't think it would make sense for one
>> of those to fallback to the other based on priority, so why not just
>> have drivers call the correct algorithm directly?
>
> For example, we have a system with PMEM (persistent memory, Optane
> DCPMM, or AEP, or something else) in DIMM slots and CXL.mem connected
> via CXL link to a remote memory pool.  We will need ACPI HMAT for PMEM
> and CXL CDAT for CXL.mem.  One way is to make dax/kmem identify the
> types of the device and call corresponding algorithms.

Yes, that is what I was thinking.

> The other way (suggested by this series) is to make dax/kmem call a
> notifier chain, then CXL CDAT or ACPI HMAT can identify the type of
> device and calculate the distance if the type is correct for them.  I
> don't think that it's good to make dax/kem to know every possible
> types of memory devices.

Do we expect there to be lots of different types of memory devices
sharing a common dax/kmem driver though? Must admit I'm coming from a
GPU background where we'd expect each type of device to have it's own
driver anyway so wasn't expecting different types of memory devices to
be handled by the same driver.

>>> Multiple algorithm implementations can cooperate via calculating
>>> abstract distance for different memory nodes.  The preference of
>>> algorithm implementations can be specified via
>>> priority (notifier_block.priority).
>>
>> How/what decides the priority though? That seems like something better
>> decided by a device driver than the algorithm driver IMHO.
>
> Do we need the memory device driver specific priority?  Or we just share
> a common priority?  For example, the priority of CXL CDAT is always
> higher than that of ACPI HMAT?  Or architecture specific?

Ok, thanks. Having read the above I think the priority is
unimportant. Algorithms can either decide to return a distance and
NOTIFY_STOP_MASK if they can calculate a distance or NOTIFY_DONE if they
can't for a specific device.

> And, I don't think that we are forced to use the general notifier
> chain interface in all memory device drivers.  If the memory device
> driver has better understanding of the memory device, it can use other
> way to determine abstract distance.  For example, a CXL memory device
> driver can identify abstract distance by itself.  While other memory
> device drivers can use the general notifier chain interface at the
> same time.

Whilst I think personally I would find that flexibility useful I am
concerned it means every driver will just end up divining it's own
distance rather than ensuring data in HMAT/CDAT/etc. is correct. That
would kind of defeat the purpose of it all then.
  
Huang, Ying July 26, 2023, 7:33 a.m. UTC | #4
Alistair Popple <apopple@nvidia.com> writes:

> "Huang, Ying" <ying.huang@intel.com> writes:
>
>> Hi, Alistair,
>>
>> Thanks a lot for comments!
>>
>> Alistair Popple <apopple@nvidia.com> writes:
>>
>>> Huang Ying <ying.huang@intel.com> writes:
>>>
>>>> The abstract distance may be calculated by various drivers, such as
>>>> ACPI HMAT, CXL CDAT, etc.  While it may be used by various code which
>>>> hot-add memory node, such as dax/kmem etc.  To decouple the algorithm
>>>> users and the providers, the abstract distance calculation algorithms
>>>> management mechanism is implemented in this patch.  It provides
>>>> interface for the providers to register the implementation, and
>>>> interface for the users.
>>>
>>> I wonder if we need this level of decoupling though? It seems to me like
>>> it would be simpler and better for drivers to calculate the abstract
>>> distance directly themselves by calling the desired algorithm (eg. ACPI
>>> HMAT) and pass this when creating the nodes rather than having a
>>> notifier chain.
>>
>> Per my understanding, ACPI HMAT and memory device drivers (such as
>> dax/kmem) may belong to different subsystems (ACPI vs. dax).  It's not
>> good to call functions across subsystems directly.  So, I think it's
>> better to use a general subsystem: memory-tier.c to decouple them.  If
>> it turns out that a notifier chain is unnecessary, we can use some
>> function pointers instead.
>>
>>> At the moment it seems we've only identified two possible algorithms
>>> (ACPI HMAT and CXL CDAT) and I don't think it would make sense for one
>>> of those to fallback to the other based on priority, so why not just
>>> have drivers call the correct algorithm directly?
>>
>> For example, we have a system with PMEM (persistent memory, Optane
>> DCPMM, or AEP, or something else) in DIMM slots and CXL.mem connected
>> via CXL link to a remote memory pool.  We will need ACPI HMAT for PMEM
>> and CXL CDAT for CXL.mem.  One way is to make dax/kmem identify the
>> types of the device and call corresponding algorithms.
>
> Yes, that is what I was thinking.
>
>> The other way (suggested by this series) is to make dax/kmem call a
>> notifier chain, then CXL CDAT or ACPI HMAT can identify the type of
>> device and calculate the distance if the type is correct for them.  I
>> don't think that it's good to make dax/kem to know every possible
>> types of memory devices.
>
> Do we expect there to be lots of different types of memory devices
> sharing a common dax/kmem driver though? Must admit I'm coming from a
> GPU background where we'd expect each type of device to have it's own
> driver anyway so wasn't expecting different types of memory devices to
> be handled by the same driver.

Now, dax/kmem.c is used for

- PMEM (Optane DCPMM, or AEP)
- CXL.mem
- HBM (attached to CPU)

I understand that for a CXL GPU driver it's OK to call some CXL CDAT
helper to identify the abstract distance of memory attached to GPU.
Because there's no cross-subsystem function calls.  But it looks not
very clean to call from dax/kmem.c to CXL CDAT because it's a
cross-subsystem function call.

>>>> Multiple algorithm implementations can cooperate via calculating
>>>> abstract distance for different memory nodes.  The preference of
>>>> algorithm implementations can be specified via
>>>> priority (notifier_block.priority).
>>>
>>> How/what decides the priority though? That seems like something better
>>> decided by a device driver than the algorithm driver IMHO.
>>
>> Do we need the memory device driver specific priority?  Or we just share
>> a common priority?  For example, the priority of CXL CDAT is always
>> higher than that of ACPI HMAT?  Or architecture specific?
>
> Ok, thanks. Having read the above I think the priority is
> unimportant. Algorithms can either decide to return a distance and
> NOTIFY_STOP_MASK if they can calculate a distance or NOTIFY_DONE if they
> can't for a specific device.

Yes.  In most cases, there are no overlaps among algorithms.

>> And, I don't think that we are forced to use the general notifier
>> chain interface in all memory device drivers.  If the memory device
>> driver has better understanding of the memory device, it can use other
>> way to determine abstract distance.  For example, a CXL memory device
>> driver can identify abstract distance by itself.  While other memory
>> device drivers can use the general notifier chain interface at the
>> same time.
>
> Whilst I think personally I would find that flexibility useful I am
> concerned it means every driver will just end up divining it's own
> distance rather than ensuring data in HMAT/CDAT/etc. is correct. That
> would kind of defeat the purpose of it all then.

But we have no way to enforce that too.

--
Best Regards,
Huang, Ying
  
Alistair Popple July 27, 2023, 3:42 a.m. UTC | #5
"Huang, Ying" <ying.huang@intel.com> writes:

>>> The other way (suggested by this series) is to make dax/kmem call a
>>> notifier chain, then CXL CDAT or ACPI HMAT can identify the type of
>>> device and calculate the distance if the type is correct for them.  I
>>> don't think that it's good to make dax/kem to know every possible
>>> types of memory devices.
>>
>> Do we expect there to be lots of different types of memory devices
>> sharing a common dax/kmem driver though? Must admit I'm coming from a
>> GPU background where we'd expect each type of device to have it's own
>> driver anyway so wasn't expecting different types of memory devices to
>> be handled by the same driver.
>
> Now, dax/kmem.c is used for
>
> - PMEM (Optane DCPMM, or AEP)
> - CXL.mem
> - HBM (attached to CPU)

Thanks a lot for the background! I will admit to having a faily narrow
focus here.

>>> And, I don't think that we are forced to use the general notifier
>>> chain interface in all memory device drivers.  If the memory device
>>> driver has better understanding of the memory device, it can use other
>>> way to determine abstract distance.  For example, a CXL memory device
>>> driver can identify abstract distance by itself.  While other memory
>>> device drivers can use the general notifier chain interface at the
>>> same time.
>>
>> Whilst I think personally I would find that flexibility useful I am
>> concerned it means every driver will just end up divining it's own
>> distance rather than ensuring data in HMAT/CDAT/etc. is correct. That
>> would kind of defeat the purpose of it all then.
>
> But we have no way to enforce that too.

Enforce that HMAT/CDAT/etc. is correct? Agree we can't enforce it, but
we can influence it. If drivers can easily ignore the notifier chain and
do their own thing that's what will happen.

>>> While other memory device drivers can use the general notifier chain
>>> interface at the same time.

How would that work in practice though? The abstract distance as far as
I can tell doesn't have any meaning other than establishing preferences
for memory demotion order. Therefore all calculations are relative to
the rest of the calculations on the system. So if a driver does it's own
thing how does it choose a sensible distance? IHMO the value here is in
coordinating all that through a standard interface, whether that is HMAT
or something else.

 - Alistair
  
Huang, Ying July 27, 2023, 4:02 a.m. UTC | #6
Alistair Popple <apopple@nvidia.com> writes:

> "Huang, Ying" <ying.huang@intel.com> writes:
>
>>>> The other way (suggested by this series) is to make dax/kmem call a
>>>> notifier chain, then CXL CDAT or ACPI HMAT can identify the type of
>>>> device and calculate the distance if the type is correct for them.  I
>>>> don't think that it's good to make dax/kem to know every possible
>>>> types of memory devices.
>>>
>>> Do we expect there to be lots of different types of memory devices
>>> sharing a common dax/kmem driver though? Must admit I'm coming from a
>>> GPU background where we'd expect each type of device to have it's own
>>> driver anyway so wasn't expecting different types of memory devices to
>>> be handled by the same driver.
>>
>> Now, dax/kmem.c is used for
>>
>> - PMEM (Optane DCPMM, or AEP)
>> - CXL.mem
>> - HBM (attached to CPU)
>
> Thanks a lot for the background! I will admit to having a faily narrow
> focus here.
>
>>>> And, I don't think that we are forced to use the general notifier
>>>> chain interface in all memory device drivers.  If the memory device
>>>> driver has better understanding of the memory device, it can use other
>>>> way to determine abstract distance.  For example, a CXL memory device
>>>> driver can identify abstract distance by itself.  While other memory
>>>> device drivers can use the general notifier chain interface at the
>>>> same time.
>>>
>>> Whilst I think personally I would find that flexibility useful I am
>>> concerned it means every driver will just end up divining it's own
>>> distance rather than ensuring data in HMAT/CDAT/etc. is correct. That
>>> would kind of defeat the purpose of it all then.
>>
>> But we have no way to enforce that too.
>
> Enforce that HMAT/CDAT/etc. is correct? Agree we can't enforce it, but
> we can influence it. If drivers can easily ignore the notifier chain and
> do their own thing that's what will happen.

IMHO, both enforce HMAT/CDAT/etc is correct and enforce drivers to use
general interface we provided.  Anyway, we should try to make HMAT/CDAT
works well, so drivers want to use them :-)

>>>> While other memory device drivers can use the general notifier chain
>>>> interface at the same time.
>
> How would that work in practice though? The abstract distance as far as
> I can tell doesn't have any meaning other than establishing preferences
> for memory demotion order. Therefore all calculations are relative to
> the rest of the calculations on the system. So if a driver does it's own
> thing how does it choose a sensible distance? IHMO the value here is in
> coordinating all that through a standard interface, whether that is HMAT
> or something else.

Only if different algorithms follow the same basic principle.  For
example, the abstract distance of default DRAM nodes are fixed
(MEMTIER_ADISTANCE_DRAM).  The abstract distance of the memory device is
in linear direct proportion to the memory latency and inversely
proportional to the memory bandwidth.  Use the memory latency and
bandwidth of default DRAM nodes as base.

HMAT and CDAT report the raw memory latency and bandwidth.  If there are
some other methods to report the raw memory latency and bandwidth, we
can use them too.

--
Best Regards,
Huang, Ying
  
Alistair Popple July 27, 2023, 4:07 a.m. UTC | #7
"Huang, Ying" <ying.huang@intel.com> writes:

> Alistair Popple <apopple@nvidia.com> writes:
>
>> "Huang, Ying" <ying.huang@intel.com> writes:
>>
>>>>> And, I don't think that we are forced to use the general notifier
>>>>> chain interface in all memory device drivers.  If the memory device
>>>>> driver has better understanding of the memory device, it can use other
>>>>> way to determine abstract distance.  For example, a CXL memory device
>>>>> driver can identify abstract distance by itself.  While other memory
>>>>> device drivers can use the general notifier chain interface at the
>>>>> same time.
>>>>
>>>> Whilst I think personally I would find that flexibility useful I am
>>>> concerned it means every driver will just end up divining it's own
>>>> distance rather than ensuring data in HMAT/CDAT/etc. is correct. That
>>>> would kind of defeat the purpose of it all then.
>>>
>>> But we have no way to enforce that too.
>>
>> Enforce that HMAT/CDAT/etc. is correct? Agree we can't enforce it, but
>> we can influence it. If drivers can easily ignore the notifier chain and
>> do their own thing that's what will happen.
>
> IMHO, both enforce HMAT/CDAT/etc is correct and enforce drivers to use
> general interface we provided.  Anyway, we should try to make HMAT/CDAT
> works well, so drivers want to use them :-)

Exactly :-)

>>>>> While other memory device drivers can use the general notifier chain
>>>>> interface at the same time.
>>
>> How would that work in practice though? The abstract distance as far as
>> I can tell doesn't have any meaning other than establishing preferences
>> for memory demotion order. Therefore all calculations are relative to
>> the rest of the calculations on the system. So if a driver does it's own
>> thing how does it choose a sensible distance? IHMO the value here is in
>> coordinating all that through a standard interface, whether that is HMAT
>> or something else.
>
> Only if different algorithms follow the same basic principle.  For
> example, the abstract distance of default DRAM nodes are fixed
> (MEMTIER_ADISTANCE_DRAM).  The abstract distance of the memory device is
> in linear direct proportion to the memory latency and inversely
> proportional to the memory bandwidth.  Use the memory latency and
> bandwidth of default DRAM nodes as base.
>
> HMAT and CDAT report the raw memory latency and bandwidth.  If there are
> some other methods to report the raw memory latency and bandwidth, we
> can use them too.

Argh! So we could address my concerns by having drivers feed
latency/bandwidth numbers into a standard calculation algorithm right?
Ie. Rather than having drivers calculate abstract distance themselves we
have the notifier chains return the raw performance data from which the
abstract distance is derived.
  
Huang, Ying July 27, 2023, 5:41 a.m. UTC | #8
Alistair Popple <apopple@nvidia.com> writes:

> "Huang, Ying" <ying.huang@intel.com> writes:
>
>> Alistair Popple <apopple@nvidia.com> writes:
>>
>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>
>>>>>> And, I don't think that we are forced to use the general notifier
>>>>>> chain interface in all memory device drivers.  If the memory device
>>>>>> driver has better understanding of the memory device, it can use other
>>>>>> way to determine abstract distance.  For example, a CXL memory device
>>>>>> driver can identify abstract distance by itself.  While other memory
>>>>>> device drivers can use the general notifier chain interface at the
>>>>>> same time.
>>>>>
>>>>> Whilst I think personally I would find that flexibility useful I am
>>>>> concerned it means every driver will just end up divining it's own
>>>>> distance rather than ensuring data in HMAT/CDAT/etc. is correct. That
>>>>> would kind of defeat the purpose of it all then.
>>>>
>>>> But we have no way to enforce that too.
>>>
>>> Enforce that HMAT/CDAT/etc. is correct? Agree we can't enforce it, but
>>> we can influence it. If drivers can easily ignore the notifier chain and
>>> do their own thing that's what will happen.
>>
>> IMHO, both enforce HMAT/CDAT/etc is correct and enforce drivers to use
>> general interface we provided.  Anyway, we should try to make HMAT/CDAT
>> works well, so drivers want to use them :-)
>
> Exactly :-)
>
>>>>>> While other memory device drivers can use the general notifier chain
>>>>>> interface at the same time.
>>>
>>> How would that work in practice though? The abstract distance as far as
>>> I can tell doesn't have any meaning other than establishing preferences
>>> for memory demotion order. Therefore all calculations are relative to
>>> the rest of the calculations on the system. So if a driver does it's own
>>> thing how does it choose a sensible distance? IHMO the value here is in
>>> coordinating all that through a standard interface, whether that is HMAT
>>> or something else.
>>
>> Only if different algorithms follow the same basic principle.  For
>> example, the abstract distance of default DRAM nodes are fixed
>> (MEMTIER_ADISTANCE_DRAM).  The abstract distance of the memory device is
>> in linear direct proportion to the memory latency and inversely
>> proportional to the memory bandwidth.  Use the memory latency and
>> bandwidth of default DRAM nodes as base.
>>
>> HMAT and CDAT report the raw memory latency and bandwidth.  If there are
>> some other methods to report the raw memory latency and bandwidth, we
>> can use them too.
>
> Argh! So we could address my concerns by having drivers feed
> latency/bandwidth numbers into a standard calculation algorithm right?
> Ie. Rather than having drivers calculate abstract distance themselves we
> have the notifier chains return the raw performance data from which the
> abstract distance is derived.

Now, memory device drivers only need a general interface to get the
abstract distance from the NUMA node ID.  In the future, if they need
more interfaces, we can add them.  For example, the interface you
suggested above.

--
Best Regards,
Huang, Ying
  
Alistair Popple July 28, 2023, 1:20 a.m. UTC | #9
"Huang, Ying" <ying.huang@intel.com> writes:

> Alistair Popple <apopple@nvidia.com> writes:
>
>> "Huang, Ying" <ying.huang@intel.com> writes:
>>
>>> Alistair Popple <apopple@nvidia.com> writes:
>>>
>>>>>>> While other memory device drivers can use the general notifier chain
>>>>>>> interface at the same time.
>>>>
>>>> How would that work in practice though? The abstract distance as far as
>>>> I can tell doesn't have any meaning other than establishing preferences
>>>> for memory demotion order. Therefore all calculations are relative to
>>>> the rest of the calculations on the system. So if a driver does it's own
>>>> thing how does it choose a sensible distance? IHMO the value here is in
>>>> coordinating all that through a standard interface, whether that is HMAT
>>>> or something else.
>>>
>>> Only if different algorithms follow the same basic principle.  For
>>> example, the abstract distance of default DRAM nodes are fixed
>>> (MEMTIER_ADISTANCE_DRAM).  The abstract distance of the memory device is
>>> in linear direct proportion to the memory latency and inversely
>>> proportional to the memory bandwidth.  Use the memory latency and
>>> bandwidth of default DRAM nodes as base.
>>>
>>> HMAT and CDAT report the raw memory latency and bandwidth.  If there are
>>> some other methods to report the raw memory latency and bandwidth, we
>>> can use them too.
>>
>> Argh! So we could address my concerns by having drivers feed
>> latency/bandwidth numbers into a standard calculation algorithm right?
>> Ie. Rather than having drivers calculate abstract distance themselves we
>> have the notifier chains return the raw performance data from which the
>> abstract distance is derived.
>
> Now, memory device drivers only need a general interface to get the
> abstract distance from the NUMA node ID.  In the future, if they need
> more interfaces, we can add them.  For example, the interface you
> suggested above.

Huh? Memory device drivers (ie. dax/kmem.c) don't care about abstract
distance, it's a meaningless number. The only reason they care about it
is so they can pass it to alloc_memory_type():

struct memory_dev_type *alloc_memory_type(int adistance)

Instead alloc_memory_type() should be taking bandwidth/latency numbers
and the calculation of abstract distance should be done there. That
resovles the issues about how drivers are supposed to devine adistance
and also means that when CDAT is added we don't have to duplicate the
calculation code.
  
Huang, Ying Aug. 11, 2023, 3:51 a.m. UTC | #10
Hi, Alistair,

Sorry for late response.  Just come back from vacation.

Alistair Popple <apopple@nvidia.com> writes:

> "Huang, Ying" <ying.huang@intel.com> writes:
>
>> Alistair Popple <apopple@nvidia.com> writes:
>>
>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>
>>>> Alistair Popple <apopple@nvidia.com> writes:
>>>>
>>>>>>>> While other memory device drivers can use the general notifier chain
>>>>>>>> interface at the same time.
>>>>>
>>>>> How would that work in practice though? The abstract distance as far as
>>>>> I can tell doesn't have any meaning other than establishing preferences
>>>>> for memory demotion order. Therefore all calculations are relative to
>>>>> the rest of the calculations on the system. So if a driver does it's own
>>>>> thing how does it choose a sensible distance? IHMO the value here is in
>>>>> coordinating all that through a standard interface, whether that is HMAT
>>>>> or something else.
>>>>
>>>> Only if different algorithms follow the same basic principle.  For
>>>> example, the abstract distance of default DRAM nodes are fixed
>>>> (MEMTIER_ADISTANCE_DRAM).  The abstract distance of the memory device is
>>>> in linear direct proportion to the memory latency and inversely
>>>> proportional to the memory bandwidth.  Use the memory latency and
>>>> bandwidth of default DRAM nodes as base.
>>>>
>>>> HMAT and CDAT report the raw memory latency and bandwidth.  If there are
>>>> some other methods to report the raw memory latency and bandwidth, we
>>>> can use them too.
>>>
>>> Argh! So we could address my concerns by having drivers feed
>>> latency/bandwidth numbers into a standard calculation algorithm right?
>>> Ie. Rather than having drivers calculate abstract distance themselves we
>>> have the notifier chains return the raw performance data from which the
>>> abstract distance is derived.
>>
>> Now, memory device drivers only need a general interface to get the
>> abstract distance from the NUMA node ID.  In the future, if they need
>> more interfaces, we can add them.  For example, the interface you
>> suggested above.
>
> Huh? Memory device drivers (ie. dax/kmem.c) don't care about abstract
> distance, it's a meaningless number. The only reason they care about it
> is so they can pass it to alloc_memory_type():
>
> struct memory_dev_type *alloc_memory_type(int adistance)
>
> Instead alloc_memory_type() should be taking bandwidth/latency numbers
> and the calculation of abstract distance should be done there. That
> resovles the issues about how drivers are supposed to devine adistance
> and also means that when CDAT is added we don't have to duplicate the
> calculation code.

In the current design, the abstract distance is the key concept of
memory types and memory tiers.  And it is used as interface to allocate
memory types.  This provides more flexibility than some other interfaces
(e.g. read/write bandwidth/latency).  For example, in current
dax/kmem.c, if HMAT isn't available in the system, the default abstract
distance: MEMTIER_DEFAULT_DAX_ADISTANCE is used.  This is still useful
to support some systems now.  On a system without HMAT/CDAT, it's
possible to calculate abstract distance from ACPI SLIT, although this is
quite limited.  I'm not sure whether all systems will provide read/write
bandwith/latency data for all memory devices.

HMAT and CDAT or some other mechanisms may provide the read/write
bandwidth/latency data to be used to calculate abstract distance.  For
them, we can provide a shared implementation in mm/memory-tiers.c to map
from read/write bandwith/latency to the abstract distance.  Can this
solve your concerns about the consistency among algorithms?  If so, we
can do that when we add the second algorithm that needs that.

--
Best Regards,
Huang, Ying
  
Alistair Popple Aug. 22, 2023, 7:11 a.m. UTC | #11
"Huang, Ying" <ying.huang@intel.com> writes:

> Alistair Popple <apopple@nvidia.com> writes:
>
>> "Huang, Ying" <ying.huang@intel.com> writes:
>>
>>> Alistair Popple <apopple@nvidia.com> writes:
>>>
>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>
>>>>> Hi, Alistair,
>>>>>
>>>>> Sorry for late response.  Just come back from vacation.
>>>>
>>>> Ditto for this response :-)
>>>>
>>>> I see Andrew has taken this into mm-unstable though, so my bad for not
>>>> getting around to following all this up sooner.
>>>>
>>>>> Alistair Popple <apopple@nvidia.com> writes:
>>>>>
>>>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>>>
>>>>>>> Alistair Popple <apopple@nvidia.com> writes:
>>>>>>>
>>>>>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>>>>>
>>>>>>>>> Alistair Popple <apopple@nvidia.com> writes:
>>>>>>>>>
>>>>>>>>>>>>> While other memory device drivers can use the general notifier chain
>>>>>>>>>>>>> interface at the same time.
>>>>>>>>>>
>>>>>>>>>> How would that work in practice though? The abstract distance as far as
>>>>>>>>>> I can tell doesn't have any meaning other than establishing preferences
>>>>>>>>>> for memory demotion order. Therefore all calculations are relative to
>>>>>>>>>> the rest of the calculations on the system. So if a driver does it's own
>>>>>>>>>> thing how does it choose a sensible distance? IHMO the value here is in
>>>>>>>>>> coordinating all that through a standard interface, whether that is HMAT
>>>>>>>>>> or something else.
>>>>>>>>>
>>>>>>>>> Only if different algorithms follow the same basic principle.  For
>>>>>>>>> example, the abstract distance of default DRAM nodes are fixed
>>>>>>>>> (MEMTIER_ADISTANCE_DRAM).  The abstract distance of the memory device is
>>>>>>>>> in linear direct proportion to the memory latency and inversely
>>>>>>>>> proportional to the memory bandwidth.  Use the memory latency and
>>>>>>>>> bandwidth of default DRAM nodes as base.
>>>>>>>>>
>>>>>>>>> HMAT and CDAT report the raw memory latency and bandwidth.  If there are
>>>>>>>>> some other methods to report the raw memory latency and bandwidth, we
>>>>>>>>> can use them too.
>>>>>>>>
>>>>>>>> Argh! So we could address my concerns by having drivers feed
>>>>>>>> latency/bandwidth numbers into a standard calculation algorithm right?
>>>>>>>> Ie. Rather than having drivers calculate abstract distance themselves we
>>>>>>>> have the notifier chains return the raw performance data from which the
>>>>>>>> abstract distance is derived.
>>>>>>>
>>>>>>> Now, memory device drivers only need a general interface to get the
>>>>>>> abstract distance from the NUMA node ID.  In the future, if they need
>>>>>>> more interfaces, we can add them.  For example, the interface you
>>>>>>> suggested above.
>>>>>>
>>>>>> Huh? Memory device drivers (ie. dax/kmem.c) don't care about abstract
>>>>>> distance, it's a meaningless number. The only reason they care about it
>>>>>> is so they can pass it to alloc_memory_type():
>>>>>>
>>>>>> struct memory_dev_type *alloc_memory_type(int adistance)
>>>>>>
>>>>>> Instead alloc_memory_type() should be taking bandwidth/latency numbers
>>>>>> and the calculation of abstract distance should be done there. That
>>>>>> resovles the issues about how drivers are supposed to devine adistance
>>>>>> and also means that when CDAT is added we don't have to duplicate the
>>>>>> calculation code.
>>>>>
>>>>> In the current design, the abstract distance is the key concept of
>>>>> memory types and memory tiers.  And it is used as interface to allocate
>>>>> memory types.  This provides more flexibility than some other interfaces
>>>>> (e.g. read/write bandwidth/latency).  For example, in current
>>>>> dax/kmem.c, if HMAT isn't available in the system, the default abstract
>>>>> distance: MEMTIER_DEFAULT_DAX_ADISTANCE is used.  This is still useful
>>>>> to support some systems now.  On a system without HMAT/CDAT, it's
>>>>> possible to calculate abstract distance from ACPI SLIT, although this is
>>>>> quite limited.  I'm not sure whether all systems will provide read/write
>>>>> bandwith/latency data for all memory devices.
>>>>>
>>>>> HMAT and CDAT or some other mechanisms may provide the read/write
>>>>> bandwidth/latency data to be used to calculate abstract distance.  For
>>>>> them, we can provide a shared implementation in mm/memory-tiers.c to map
>>>>> from read/write bandwith/latency to the abstract distance.  Can this
>>>>> solve your concerns about the consistency among algorithms?  If so, we
>>>>> can do that when we add the second algorithm that needs that.
>>>>
>>>> I guess it would address my concerns if we did that now. I don't see why
>>>> we need to wait for a second implementation for that though - the whole
>>>> series seems to be built around adding a framework for supporting
>>>> multiple algorithms even though only one exists. So I think we should
>>>> support that fully, or simplfy the whole thing and just assume the only
>>>> thing that exists is HMAT and get rid of the general interface until a
>>>> second algorithm comes along.
>>>
>>> We will need a general interface even for one algorithm implementation.
>>> Because it's not good to make a dax subsystem driver (dax/kmem) to
>>> depend on a ACPI subsystem driver (acpi/hmat).  We need some general
>>> interface at subsystem level (memory tier here) between them.
>>
>> I don't understand this argument. For a single algorithm it would be
>> simpler to just define acpi_hmat_calculate_adistance() and a static
>> inline version of it that returns -ENOENT when !CONFIG_ACPI than adding
>> a layer of indirection through notifier blocks. That breaks any
>> dependency on ACPI and there's plenty of precedent for this approach in
>> the kernel already.
>
> ACPI is a subsystem, so it's OK for dax/kmem to depends on CONFIG_ACPI.
> But HMAT is a driver of ACPI subsystem (controlled via
> CONFIG_ACPI_HMAT).  It's not good for a driver of DAX subsystem
> (dax/kmem) to depend on a *driver* of ACPI subsystem.
>
> Yes.  Technically, there's no hard wall to prevent this.  But I think
> that a good design should make drivers depends on subsystems or drivers
> of the same subsystem, NOT drivers of other subsystems.

Thanks, I wasn't really thinking of HMAT as an ACPI driver. I understand
where you're coming from but I really don't see the problem with using a
static inline. It doesn't create dependencies (you could still use
dax/kmem without ACPI) and results in smaller and easier to follow code.

IMHO it's far more obvious that a call to acpi_hmat_calcaulte_adist()
returns either a default if ACPI HMAT isn't configured or a calculated
value than it is to figure out what notifiers may or may not be
registered at runtime and what priority they may be called in from
mt_calc_adistance().

It appears you think that is a bad design, but I don't understand
why. What does this approach give us that a simpler approach wouldn't?
  

Patch

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index fc9647b1b4f9..c6429e624244 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -6,6 +6,7 @@ 
 #include <linux/nodemask.h>
 #include <linux/kref.h>
 #include <linux/mmzone.h>
+#include <linux/notifier.h>
 /*
  * Each tier cover a abstrace distance chunk size of 128
  */
@@ -36,6 +37,9 @@  struct memory_dev_type *alloc_memory_type(int adistance);
 void destroy_memory_type(struct memory_dev_type *memtype);
 void init_node_memory_type(int node, struct memory_dev_type *default_type);
 void clear_node_memory_type(int node, struct memory_dev_type *memtype);
+int register_mt_adistance_algorithm(struct notifier_block *nb);
+int unregister_mt_adistance_algorithm(struct notifier_block *nb);
+int mt_calc_adistance(int node, int *adist);
 #ifdef CONFIG_MIGRATION
 int next_demotion_node(int node);
 void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
@@ -97,5 +101,20 @@  static inline bool node_is_toptier(int node)
 {
 	return true;
 }
+
+static inline int register_mt_adistance_algorithm(struct notifier_block *nb)
+{
+	return 0;
+}
+
+static inline int unregister_mt_adistance_algorithm(struct notifier_block *nb)
+{
+	return 0;
+}
+
+static inline int mt_calc_adistance(int node, int *adist)
+{
+	return NOTIFY_DONE;
+}
 #endif	/* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index a516e303e304..1e55fbe2ad51 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -5,6 +5,7 @@ 
 #include <linux/kobject.h>
 #include <linux/memory.h>
 #include <linux/memory-tiers.h>
+#include <linux/notifier.h>
 
 #include "internal.h"
 
@@ -105,6 +106,8 @@  static int top_tier_adistance;
 static struct demotion_nodes *node_demotion __read_mostly;
 #endif /* CONFIG_MIGRATION */
 
+static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms);
+
 static inline struct memory_tier *to_memory_tier(struct device *device)
 {
 	return container_of(device, struct memory_tier, dev);
@@ -592,6 +595,62 @@  void clear_node_memory_type(int node, struct memory_dev_type *memtype)
 }
 EXPORT_SYMBOL_GPL(clear_node_memory_type);
 
+/**
+ * register_mt_adistance_algorithm() - Register memory tiering abstract distance algorithm
+ * @nb: The notifier block which describe the algorithm
+ *
+ * Return: 0 on success, errno on error.
+ *
+ * Every memory tiering abstract distance algorithm provider needs to
+ * register the algorithm with register_mt_adistance_algorithm().  To
+ * calculate the abstract distance for a specified memory node, the
+ * notifier function will be called unless some high priority
+ * algorithm has provided result.  The prototype of the notifier
+ * function is as follows,
+ *
+ *   int (*algorithm_notifier)(struct notifier_block *nb,
+ *                             unsigned long nid, void *data);
+ *
+ * Where "nid" specifies the memory node, "data" is the pointer to the
+ * returned abstract distance (that is, "int *adist").  If the
+ * algorithm provides the result, NOTIFY_STOP should be returned.
+ * Otherwise, return_value & %NOTIFY_STOP_MASK == 0 to allow the next
+ * algorithm in the chain to provide the result.
+ */
+int register_mt_adistance_algorithm(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&mt_adistance_algorithms, nb);
+}
+EXPORT_SYMBOL_GPL(register_mt_adistance_algorithm);
+
+/**
+ * unregister_mt_adistance_algorithm() - Unregister memory tiering abstract distance algorithm
+ * @nb: the notifier block which describe the algorithm
+ *
+ * Return: 0 on success, errno on error.
+ */
+int unregister_mt_adistance_algorithm(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_unregister(&mt_adistance_algorithms, nb);
+}
+EXPORT_SYMBOL_GPL(unregister_mt_adistance_algorithm);
+
+/**
+ * mt_calc_adistance() - Calculate abstract distance with registered algorithms
+ * @node: the node to calculate abstract distance for
+ * @adist: the returned abstract distance
+ *
+ * Return: if return_value & %NOTIFY_STOP_MASK != 0, then some
+ * abstract distance algorithm provides the result, and return it via
+ * @adist.  Otherwise, no algorithm can provide the result and @adist
+ * will be kept as it is.
+ */
+int mt_calc_adistance(int node, int *adist)
+{
+	return blocking_notifier_call_chain(&mt_adistance_algorithms, node, adist);
+}
+EXPORT_SYMBOL_GPL(mt_calc_adistance);
+
 static int __meminit memtier_hotplug_callback(struct notifier_block *self,
 					      unsigned long action, void *_arg)
 {