[RESEND,1/4] memory tiering: add abstract distance calculation algorithms management
Message ID | 20230721012932.190742-2-ying.huang@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:c923:0:b0:3e4:2afc:c1 with SMTP id j3csp3495839vqt; Thu, 20 Jul 2023 18:57:04 -0700 (PDT) X-Google-Smtp-Source: APBJJlFKN//TXp6hCyjsY8ZOJy9z6j0L7zjNEpQSThetsUpTabQuregAY209XZLd4Cx2k55m8sO/ X-Received: by 2002:a05:6402:1a33:b0:51e:234:cc51 with SMTP id be19-20020a0564021a3300b0051e0234cc51mr404849edb.17.1689904623813; Thu, 20 Jul 2023 18:57:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689904623; cv=none; d=google.com; s=arc-20160816; b=p0RdX01epNGykwa7rxLCezGPcOGVA59VnQHCKBVj9aD3u/g2ik+iYcUy/dML/1ccIg hi8RKjviRta1x54oSIhIq/+7URUeXuQW9HH2cxWiehaRwACYJgNc8C3QfA4Y79FdX539 V8ucI0BNYaO/1jJcK1JWyjv+C458pH5MiqfPCoWCzku/DAHeBmQjoiWbka09MSXw8jfS qDv0WFlazp/pNwxyyPohNCQT62ZSSnPtAEjYcTI347Z320lUlMtRwn00qlF/fvWv5xrK YDMer0o4YwUrELX+ToPqtlemQOz9LfLjKaBsZ0mq0XySlEw4x8yYm2QqEE0iev76gvWu VM8w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=mwlR3P3BI/k6pKiodh+YaDuSPT0DQXy8CnCLXSzimOg=; fh=lzKkhZyaUrhNkIakGPxlbJijiu34Upm1x29wDAUxkMQ=; b=m6rXmzgh60wwwgH6xs2fJAyivgjymmguVg3/+hDr3lx7gBXgto8ThmUL3ZBClSqk+T Mw1ATJjKR9iX6oFMMe1S5eoWazNJUIYtPkGI8VUfyIMH7Hyn1TQ6130LIiqWCodO83TR RmdU/B2FOxZ2RioI99qtZjjvWSgisxaoD/6HOEd7nKxUbDl8M3m7m/8enMOP/siQzEL6 mGZ5uBpIwO8kG2UtnRWqd+tQ9u0ZNvQrn0gUUPZ2uGNZEL3jm+N8VpDph7pox1miPMSH MHRcBFo2A4bvqhmRwP0shegUJqDy10d/NnvUz7p2evuwEdDVm1zoGu147Fd21uBel1Ja t2aw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=hYrzwCZc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id ba17-20020a0564021ad100b0051e23f3c3a6si1617438edb.198.2023.07.20.18.56.36; Thu, 20 Jul 2023 18:57:03 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=hYrzwCZc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229902AbjGUBpQ (ORCPT <rfc822;assdfgzxcv4@gmail.com> + 99 others); Thu, 20 Jul 2023 21:45:16 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51948 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229534AbjGUBpL (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 20 Jul 2023 21:45:11 -0400 Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A1E632727; Thu, 20 Jul 2023 18:45:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1689903909; x=1721439909; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=oyCI5HLzYg6B8dSBk9eHwDqLZtZfH++WogxlpW3EVlc=; b=hYrzwCZc++J66V74kF1EeCIuDYRMacBixXqxGr++AKCnk+uUxvuIKrw7 7Y1Vz+YpKO4fJM9lTMxdNgaPl1AM4+t4uM7/C/SGJn5IoU4NzWyYXPIGp OdFq1+bvE4EAzYMPk4QC3FfB+sfxsmtp9D2Y28Aj/jHnfhMAWl+IwbiMG 5q++hKwhMOjUqjitfRbeIfuk3Q7uhLzOvpbG0Illl706u8Op7cFSkcI6r pR1ZOJb4aMu6a1iOygZnsEIYS5ZPipZsOYsFZARU9hrjHaS7vBDipXFFG dWuxJ45Ck6pw0EtriOFJD8wgVtu3qLzvVgo5jOEUYp+JeG9MsAcp8JVhx Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10777"; a="347214137" X-IronPort-AV: E=Sophos;i="6.01,220,1684825200"; d="scan'208";a="347214137" Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Jul 2023 18:45:09 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10777"; a="724670918" X-IronPort-AV: E=Sophos;i="6.01,220,1684825200"; d="scan'208";a="724670918" Received: from yanfeng1-mobl.ccr.corp.intel.com (HELO yhuang6-mobl2.ccr.corp.intel.com) ([10.255.29.24]) by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Jul 2023 18:45:04 -0700 From: Huang Ying <ying.huang@intel.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org, nvdimm@lists.linux.dev, linux-acpi@vger.kernel.org, Huang Ying <ying.huang@intel.com>, "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>, Wei Xu <weixugc@google.com>, Alistair Popple <apopple@nvidia.com>, Dan Williams <dan.j.williams@intel.com>, Dave Hansen <dave.hansen@intel.com>, Davidlohr Bueso <dave@stgolabs.net>, Johannes Weiner <hannes@cmpxchg.org>, Jonathan Cameron <Jonathan.Cameron@huawei.com>, Michal Hocko <mhocko@kernel.org>, Yang Shi <shy828301@gmail.com>, Rafael J Wysocki <rafael.j.wysocki@intel.com> Subject: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management Date: Fri, 21 Jul 2023 09:29:29 +0800 Message-Id: <20230721012932.190742-2-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230721012932.190742-1-ying.huang@intel.com> References: <20230721012932.190742-1-ying.huang@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1771993430786490705 X-GMAIL-MSGID: 1771993430786490705 |
Series |
memory tiering: calculate abstract distance based on ACPI HMAT
|
|
Commit Message
Huang, Ying
July 21, 2023, 1:29 a.m. UTC
The abstract distance may be calculated by various drivers, such as
ACPI HMAT, CXL CDAT, etc. While it may be used by various code which
hot-add memory node, such as dax/kmem etc. To decouple the algorithm
users and the providers, the abstract distance calculation algorithms
management mechanism is implemented in this patch. It provides
interface for the providers to register the implementation, and
interface for the users.
Multiple algorithm implementations can cooperate via calculating
abstract distance for different memory nodes. The preference of
algorithm implementations can be specified via
priority (notifier_block.priority).
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
---
include/linux/memory-tiers.h | 19 ++++++++++++
mm/memory-tiers.c | 59 ++++++++++++++++++++++++++++++++++++
2 files changed, 78 insertions(+)
Comments
Huang Ying <ying.huang@intel.com> writes: > The abstract distance may be calculated by various drivers, such as > ACPI HMAT, CXL CDAT, etc. While it may be used by various code which > hot-add memory node, such as dax/kmem etc. To decouple the algorithm > users and the providers, the abstract distance calculation algorithms > management mechanism is implemented in this patch. It provides > interface for the providers to register the implementation, and > interface for the users. I wonder if we need this level of decoupling though? It seems to me like it would be simpler and better for drivers to calculate the abstract distance directly themselves by calling the desired algorithm (eg. ACPI HMAT) and pass this when creating the nodes rather than having a notifier chain. At the moment it seems we've only identified two possible algorithms (ACPI HMAT and CXL CDAT) and I don't think it would make sense for one of those to fallback to the other based on priority, so why not just have drivers call the correct algorithm directly? > Multiple algorithm implementations can cooperate via calculating > abstract distance for different memory nodes. The preference of > algorithm implementations can be specified via > priority (notifier_block.priority). How/what decides the priority though? That seems like something better decided by a device driver than the algorithm driver IMHO. > Signed-off-by: "Huang, Ying" <ying.huang@intel.com> > Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> > Cc: Wei Xu <weixugc@google.com> > Cc: Alistair Popple <apopple@nvidia.com> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Dave Hansen <dave.hansen@intel.com> > Cc: Davidlohr Bueso <dave@stgolabs.net> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> > Cc: Michal Hocko <mhocko@kernel.org> > Cc: Yang Shi <shy828301@gmail.com> > Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com> > --- > include/linux/memory-tiers.h | 19 ++++++++++++ > mm/memory-tiers.c | 59 ++++++++++++++++++++++++++++++++++++ > 2 files changed, 78 insertions(+) > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h > index fc9647b1b4f9..c6429e624244 100644 > --- a/include/linux/memory-tiers.h > +++ b/include/linux/memory-tiers.h > @@ -6,6 +6,7 @@ > #include <linux/nodemask.h> > #include <linux/kref.h> > #include <linux/mmzone.h> > +#include <linux/notifier.h> > /* > * Each tier cover a abstrace distance chunk size of 128 > */ > @@ -36,6 +37,9 @@ struct memory_dev_type *alloc_memory_type(int adistance); > void destroy_memory_type(struct memory_dev_type *memtype); > void init_node_memory_type(int node, struct memory_dev_type *default_type); > void clear_node_memory_type(int node, struct memory_dev_type *memtype); > +int register_mt_adistance_algorithm(struct notifier_block *nb); > +int unregister_mt_adistance_algorithm(struct notifier_block *nb); > +int mt_calc_adistance(int node, int *adist); > #ifdef CONFIG_MIGRATION > int next_demotion_node(int node); > void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); > @@ -97,5 +101,20 @@ static inline bool node_is_toptier(int node) > { > return true; > } > + > +static inline int register_mt_adistance_algorithm(struct notifier_block *nb) > +{ > + return 0; > +} > + > +static inline int unregister_mt_adistance_algorithm(struct notifier_block *nb) > +{ > + return 0; > +} > + > +static inline int mt_calc_adistance(int node, int *adist) > +{ > + return NOTIFY_DONE; > +} > #endif /* CONFIG_NUMA */ > #endif /* _LINUX_MEMORY_TIERS_H */ > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > index a516e303e304..1e55fbe2ad51 100644 > --- a/mm/memory-tiers.c > +++ b/mm/memory-tiers.c > @@ -5,6 +5,7 @@ > #include <linux/kobject.h> > #include <linux/memory.h> > #include <linux/memory-tiers.h> > +#include <linux/notifier.h> > > #include "internal.h" > > @@ -105,6 +106,8 @@ static int top_tier_adistance; > static struct demotion_nodes *node_demotion __read_mostly; > #endif /* CONFIG_MIGRATION */ > > +static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms); > + > static inline struct memory_tier *to_memory_tier(struct device *device) > { > return container_of(device, struct memory_tier, dev); > @@ -592,6 +595,62 @@ void clear_node_memory_type(int node, struct memory_dev_type *memtype) > } > EXPORT_SYMBOL_GPL(clear_node_memory_type); > > +/** > + * register_mt_adistance_algorithm() - Register memory tiering abstract distance algorithm > + * @nb: The notifier block which describe the algorithm > + * > + * Return: 0 on success, errno on error. > + * > + * Every memory tiering abstract distance algorithm provider needs to > + * register the algorithm with register_mt_adistance_algorithm(). To > + * calculate the abstract distance for a specified memory node, the > + * notifier function will be called unless some high priority > + * algorithm has provided result. The prototype of the notifier > + * function is as follows, > + * > + * int (*algorithm_notifier)(struct notifier_block *nb, > + * unsigned long nid, void *data); > + * > + * Where "nid" specifies the memory node, "data" is the pointer to the > + * returned abstract distance (that is, "int *adist"). If the > + * algorithm provides the result, NOTIFY_STOP should be returned. > + * Otherwise, return_value & %NOTIFY_STOP_MASK == 0 to allow the next > + * algorithm in the chain to provide the result. > + */ > +int register_mt_adistance_algorithm(struct notifier_block *nb) > +{ > + return blocking_notifier_chain_register(&mt_adistance_algorithms, nb); > +} > +EXPORT_SYMBOL_GPL(register_mt_adistance_algorithm); > + > +/** > + * unregister_mt_adistance_algorithm() - Unregister memory tiering abstract distance algorithm > + * @nb: the notifier block which describe the algorithm > + * > + * Return: 0 on success, errno on error. > + */ > +int unregister_mt_adistance_algorithm(struct notifier_block *nb) > +{ > + return blocking_notifier_chain_unregister(&mt_adistance_algorithms, nb); > +} > +EXPORT_SYMBOL_GPL(unregister_mt_adistance_algorithm); > + > +/** > + * mt_calc_adistance() - Calculate abstract distance with registered algorithms > + * @node: the node to calculate abstract distance for > + * @adist: the returned abstract distance > + * > + * Return: if return_value & %NOTIFY_STOP_MASK != 0, then some > + * abstract distance algorithm provides the result, and return it via > + * @adist. Otherwise, no algorithm can provide the result and @adist > + * will be kept as it is. > + */ > +int mt_calc_adistance(int node, int *adist) > +{ > + return blocking_notifier_call_chain(&mt_adistance_algorithms, node, adist); > +} > +EXPORT_SYMBOL_GPL(mt_calc_adistance); > + > static int __meminit memtier_hotplug_callback(struct notifier_block *self, > unsigned long action, void *_arg) > {
Hi, Alistair, Thanks a lot for comments! Alistair Popple <apopple@nvidia.com> writes: > Huang Ying <ying.huang@intel.com> writes: > >> The abstract distance may be calculated by various drivers, such as >> ACPI HMAT, CXL CDAT, etc. While it may be used by various code which >> hot-add memory node, such as dax/kmem etc. To decouple the algorithm >> users and the providers, the abstract distance calculation algorithms >> management mechanism is implemented in this patch. It provides >> interface for the providers to register the implementation, and >> interface for the users. > > I wonder if we need this level of decoupling though? It seems to me like > it would be simpler and better for drivers to calculate the abstract > distance directly themselves by calling the desired algorithm (eg. ACPI > HMAT) and pass this when creating the nodes rather than having a > notifier chain. Per my understanding, ACPI HMAT and memory device drivers (such as dax/kmem) may belong to different subsystems (ACPI vs. dax). It's not good to call functions across subsystems directly. So, I think it's better to use a general subsystem: memory-tier.c to decouple them. If it turns out that a notifier chain is unnecessary, we can use some function pointers instead. > At the moment it seems we've only identified two possible algorithms > (ACPI HMAT and CXL CDAT) and I don't think it would make sense for one > of those to fallback to the other based on priority, so why not just > have drivers call the correct algorithm directly? For example, we have a system with PMEM (persistent memory, Optane DCPMM, or AEP, or something else) in DIMM slots and CXL.mem connected via CXL link to a remote memory pool. We will need ACPI HMAT for PMEM and CXL CDAT for CXL.mem. One way is to make dax/kmem identify the types of the device and call corresponding algorithms. The other way (suggested by this series) is to make dax/kmem call a notifier chain, then CXL CDAT or ACPI HMAT can identify the type of device and calculate the distance if the type is correct for them. I don't think that it's good to make dax/kem to know every possible types of memory devices. >> Multiple algorithm implementations can cooperate via calculating >> abstract distance for different memory nodes. The preference of >> algorithm implementations can be specified via >> priority (notifier_block.priority). > > How/what decides the priority though? That seems like something better > decided by a device driver than the algorithm driver IMHO. Do we need the memory device driver specific priority? Or we just share a common priority? For example, the priority of CXL CDAT is always higher than that of ACPI HMAT? Or architecture specific? And, I don't think that we are forced to use the general notifier chain interface in all memory device drivers. If the memory device driver has better understanding of the memory device, it can use other way to determine abstract distance. For example, a CXL memory device driver can identify abstract distance by itself. While other memory device drivers can use the general notifier chain interface at the same time. -- Best Regards, Huang, Ying
"Huang, Ying" <ying.huang@intel.com> writes: > Hi, Alistair, > > Thanks a lot for comments! > > Alistair Popple <apopple@nvidia.com> writes: > >> Huang Ying <ying.huang@intel.com> writes: >> >>> The abstract distance may be calculated by various drivers, such as >>> ACPI HMAT, CXL CDAT, etc. While it may be used by various code which >>> hot-add memory node, such as dax/kmem etc. To decouple the algorithm >>> users and the providers, the abstract distance calculation algorithms >>> management mechanism is implemented in this patch. It provides >>> interface for the providers to register the implementation, and >>> interface for the users. >> >> I wonder if we need this level of decoupling though? It seems to me like >> it would be simpler and better for drivers to calculate the abstract >> distance directly themselves by calling the desired algorithm (eg. ACPI >> HMAT) and pass this when creating the nodes rather than having a >> notifier chain. > > Per my understanding, ACPI HMAT and memory device drivers (such as > dax/kmem) may belong to different subsystems (ACPI vs. dax). It's not > good to call functions across subsystems directly. So, I think it's > better to use a general subsystem: memory-tier.c to decouple them. If > it turns out that a notifier chain is unnecessary, we can use some > function pointers instead. > >> At the moment it seems we've only identified two possible algorithms >> (ACPI HMAT and CXL CDAT) and I don't think it would make sense for one >> of those to fallback to the other based on priority, so why not just >> have drivers call the correct algorithm directly? > > For example, we have a system with PMEM (persistent memory, Optane > DCPMM, or AEP, or something else) in DIMM slots and CXL.mem connected > via CXL link to a remote memory pool. We will need ACPI HMAT for PMEM > and CXL CDAT for CXL.mem. One way is to make dax/kmem identify the > types of the device and call corresponding algorithms. Yes, that is what I was thinking. > The other way (suggested by this series) is to make dax/kmem call a > notifier chain, then CXL CDAT or ACPI HMAT can identify the type of > device and calculate the distance if the type is correct for them. I > don't think that it's good to make dax/kem to know every possible > types of memory devices. Do we expect there to be lots of different types of memory devices sharing a common dax/kmem driver though? Must admit I'm coming from a GPU background where we'd expect each type of device to have it's own driver anyway so wasn't expecting different types of memory devices to be handled by the same driver. >>> Multiple algorithm implementations can cooperate via calculating >>> abstract distance for different memory nodes. The preference of >>> algorithm implementations can be specified via >>> priority (notifier_block.priority). >> >> How/what decides the priority though? That seems like something better >> decided by a device driver than the algorithm driver IMHO. > > Do we need the memory device driver specific priority? Or we just share > a common priority? For example, the priority of CXL CDAT is always > higher than that of ACPI HMAT? Or architecture specific? Ok, thanks. Having read the above I think the priority is unimportant. Algorithms can either decide to return a distance and NOTIFY_STOP_MASK if they can calculate a distance or NOTIFY_DONE if they can't for a specific device. > And, I don't think that we are forced to use the general notifier > chain interface in all memory device drivers. If the memory device > driver has better understanding of the memory device, it can use other > way to determine abstract distance. For example, a CXL memory device > driver can identify abstract distance by itself. While other memory > device drivers can use the general notifier chain interface at the > same time. Whilst I think personally I would find that flexibility useful I am concerned it means every driver will just end up divining it's own distance rather than ensuring data in HMAT/CDAT/etc. is correct. That would kind of defeat the purpose of it all then.
Alistair Popple <apopple@nvidia.com> writes: > "Huang, Ying" <ying.huang@intel.com> writes: > >> Hi, Alistair, >> >> Thanks a lot for comments! >> >> Alistair Popple <apopple@nvidia.com> writes: >> >>> Huang Ying <ying.huang@intel.com> writes: >>> >>>> The abstract distance may be calculated by various drivers, such as >>>> ACPI HMAT, CXL CDAT, etc. While it may be used by various code which >>>> hot-add memory node, such as dax/kmem etc. To decouple the algorithm >>>> users and the providers, the abstract distance calculation algorithms >>>> management mechanism is implemented in this patch. It provides >>>> interface for the providers to register the implementation, and >>>> interface for the users. >>> >>> I wonder if we need this level of decoupling though? It seems to me like >>> it would be simpler and better for drivers to calculate the abstract >>> distance directly themselves by calling the desired algorithm (eg. ACPI >>> HMAT) and pass this when creating the nodes rather than having a >>> notifier chain. >> >> Per my understanding, ACPI HMAT and memory device drivers (such as >> dax/kmem) may belong to different subsystems (ACPI vs. dax). It's not >> good to call functions across subsystems directly. So, I think it's >> better to use a general subsystem: memory-tier.c to decouple them. If >> it turns out that a notifier chain is unnecessary, we can use some >> function pointers instead. >> >>> At the moment it seems we've only identified two possible algorithms >>> (ACPI HMAT and CXL CDAT) and I don't think it would make sense for one >>> of those to fallback to the other based on priority, so why not just >>> have drivers call the correct algorithm directly? >> >> For example, we have a system with PMEM (persistent memory, Optane >> DCPMM, or AEP, or something else) in DIMM slots and CXL.mem connected >> via CXL link to a remote memory pool. We will need ACPI HMAT for PMEM >> and CXL CDAT for CXL.mem. One way is to make dax/kmem identify the >> types of the device and call corresponding algorithms. > > Yes, that is what I was thinking. > >> The other way (suggested by this series) is to make dax/kmem call a >> notifier chain, then CXL CDAT or ACPI HMAT can identify the type of >> device and calculate the distance if the type is correct for them. I >> don't think that it's good to make dax/kem to know every possible >> types of memory devices. > > Do we expect there to be lots of different types of memory devices > sharing a common dax/kmem driver though? Must admit I'm coming from a > GPU background where we'd expect each type of device to have it's own > driver anyway so wasn't expecting different types of memory devices to > be handled by the same driver. Now, dax/kmem.c is used for - PMEM (Optane DCPMM, or AEP) - CXL.mem - HBM (attached to CPU) I understand that for a CXL GPU driver it's OK to call some CXL CDAT helper to identify the abstract distance of memory attached to GPU. Because there's no cross-subsystem function calls. But it looks not very clean to call from dax/kmem.c to CXL CDAT because it's a cross-subsystem function call. >>>> Multiple algorithm implementations can cooperate via calculating >>>> abstract distance for different memory nodes. The preference of >>>> algorithm implementations can be specified via >>>> priority (notifier_block.priority). >>> >>> How/what decides the priority though? That seems like something better >>> decided by a device driver than the algorithm driver IMHO. >> >> Do we need the memory device driver specific priority? Or we just share >> a common priority? For example, the priority of CXL CDAT is always >> higher than that of ACPI HMAT? Or architecture specific? > > Ok, thanks. Having read the above I think the priority is > unimportant. Algorithms can either decide to return a distance and > NOTIFY_STOP_MASK if they can calculate a distance or NOTIFY_DONE if they > can't for a specific device. Yes. In most cases, there are no overlaps among algorithms. >> And, I don't think that we are forced to use the general notifier >> chain interface in all memory device drivers. If the memory device >> driver has better understanding of the memory device, it can use other >> way to determine abstract distance. For example, a CXL memory device >> driver can identify abstract distance by itself. While other memory >> device drivers can use the general notifier chain interface at the >> same time. > > Whilst I think personally I would find that flexibility useful I am > concerned it means every driver will just end up divining it's own > distance rather than ensuring data in HMAT/CDAT/etc. is correct. That > would kind of defeat the purpose of it all then. But we have no way to enforce that too. -- Best Regards, Huang, Ying
"Huang, Ying" <ying.huang@intel.com> writes: >>> The other way (suggested by this series) is to make dax/kmem call a >>> notifier chain, then CXL CDAT or ACPI HMAT can identify the type of >>> device and calculate the distance if the type is correct for them. I >>> don't think that it's good to make dax/kem to know every possible >>> types of memory devices. >> >> Do we expect there to be lots of different types of memory devices >> sharing a common dax/kmem driver though? Must admit I'm coming from a >> GPU background where we'd expect each type of device to have it's own >> driver anyway so wasn't expecting different types of memory devices to >> be handled by the same driver. > > Now, dax/kmem.c is used for > > - PMEM (Optane DCPMM, or AEP) > - CXL.mem > - HBM (attached to CPU) Thanks a lot for the background! I will admit to having a faily narrow focus here. >>> And, I don't think that we are forced to use the general notifier >>> chain interface in all memory device drivers. If the memory device >>> driver has better understanding of the memory device, it can use other >>> way to determine abstract distance. For example, a CXL memory device >>> driver can identify abstract distance by itself. While other memory >>> device drivers can use the general notifier chain interface at the >>> same time. >> >> Whilst I think personally I would find that flexibility useful I am >> concerned it means every driver will just end up divining it's own >> distance rather than ensuring data in HMAT/CDAT/etc. is correct. That >> would kind of defeat the purpose of it all then. > > But we have no way to enforce that too. Enforce that HMAT/CDAT/etc. is correct? Agree we can't enforce it, but we can influence it. If drivers can easily ignore the notifier chain and do their own thing that's what will happen. >>> While other memory device drivers can use the general notifier chain >>> interface at the same time. How would that work in practice though? The abstract distance as far as I can tell doesn't have any meaning other than establishing preferences for memory demotion order. Therefore all calculations are relative to the rest of the calculations on the system. So if a driver does it's own thing how does it choose a sensible distance? IHMO the value here is in coordinating all that through a standard interface, whether that is HMAT or something else. - Alistair
Alistair Popple <apopple@nvidia.com> writes: > "Huang, Ying" <ying.huang@intel.com> writes: > >>>> The other way (suggested by this series) is to make dax/kmem call a >>>> notifier chain, then CXL CDAT or ACPI HMAT can identify the type of >>>> device and calculate the distance if the type is correct for them. I >>>> don't think that it's good to make dax/kem to know every possible >>>> types of memory devices. >>> >>> Do we expect there to be lots of different types of memory devices >>> sharing a common dax/kmem driver though? Must admit I'm coming from a >>> GPU background where we'd expect each type of device to have it's own >>> driver anyway so wasn't expecting different types of memory devices to >>> be handled by the same driver. >> >> Now, dax/kmem.c is used for >> >> - PMEM (Optane DCPMM, or AEP) >> - CXL.mem >> - HBM (attached to CPU) > > Thanks a lot for the background! I will admit to having a faily narrow > focus here. > >>>> And, I don't think that we are forced to use the general notifier >>>> chain interface in all memory device drivers. If the memory device >>>> driver has better understanding of the memory device, it can use other >>>> way to determine abstract distance. For example, a CXL memory device >>>> driver can identify abstract distance by itself. While other memory >>>> device drivers can use the general notifier chain interface at the >>>> same time. >>> >>> Whilst I think personally I would find that flexibility useful I am >>> concerned it means every driver will just end up divining it's own >>> distance rather than ensuring data in HMAT/CDAT/etc. is correct. That >>> would kind of defeat the purpose of it all then. >> >> But we have no way to enforce that too. > > Enforce that HMAT/CDAT/etc. is correct? Agree we can't enforce it, but > we can influence it. If drivers can easily ignore the notifier chain and > do their own thing that's what will happen. IMHO, both enforce HMAT/CDAT/etc is correct and enforce drivers to use general interface we provided. Anyway, we should try to make HMAT/CDAT works well, so drivers want to use them :-) >>>> While other memory device drivers can use the general notifier chain >>>> interface at the same time. > > How would that work in practice though? The abstract distance as far as > I can tell doesn't have any meaning other than establishing preferences > for memory demotion order. Therefore all calculations are relative to > the rest of the calculations on the system. So if a driver does it's own > thing how does it choose a sensible distance? IHMO the value here is in > coordinating all that through a standard interface, whether that is HMAT > or something else. Only if different algorithms follow the same basic principle. For example, the abstract distance of default DRAM nodes are fixed (MEMTIER_ADISTANCE_DRAM). The abstract distance of the memory device is in linear direct proportion to the memory latency and inversely proportional to the memory bandwidth. Use the memory latency and bandwidth of default DRAM nodes as base. HMAT and CDAT report the raw memory latency and bandwidth. If there are some other methods to report the raw memory latency and bandwidth, we can use them too. -- Best Regards, Huang, Ying
"Huang, Ying" <ying.huang@intel.com> writes: > Alistair Popple <apopple@nvidia.com> writes: > >> "Huang, Ying" <ying.huang@intel.com> writes: >> >>>>> And, I don't think that we are forced to use the general notifier >>>>> chain interface in all memory device drivers. If the memory device >>>>> driver has better understanding of the memory device, it can use other >>>>> way to determine abstract distance. For example, a CXL memory device >>>>> driver can identify abstract distance by itself. While other memory >>>>> device drivers can use the general notifier chain interface at the >>>>> same time. >>>> >>>> Whilst I think personally I would find that flexibility useful I am >>>> concerned it means every driver will just end up divining it's own >>>> distance rather than ensuring data in HMAT/CDAT/etc. is correct. That >>>> would kind of defeat the purpose of it all then. >>> >>> But we have no way to enforce that too. >> >> Enforce that HMAT/CDAT/etc. is correct? Agree we can't enforce it, but >> we can influence it. If drivers can easily ignore the notifier chain and >> do their own thing that's what will happen. > > IMHO, both enforce HMAT/CDAT/etc is correct and enforce drivers to use > general interface we provided. Anyway, we should try to make HMAT/CDAT > works well, so drivers want to use them :-) Exactly :-) >>>>> While other memory device drivers can use the general notifier chain >>>>> interface at the same time. >> >> How would that work in practice though? The abstract distance as far as >> I can tell doesn't have any meaning other than establishing preferences >> for memory demotion order. Therefore all calculations are relative to >> the rest of the calculations on the system. So if a driver does it's own >> thing how does it choose a sensible distance? IHMO the value here is in >> coordinating all that through a standard interface, whether that is HMAT >> or something else. > > Only if different algorithms follow the same basic principle. For > example, the abstract distance of default DRAM nodes are fixed > (MEMTIER_ADISTANCE_DRAM). The abstract distance of the memory device is > in linear direct proportion to the memory latency and inversely > proportional to the memory bandwidth. Use the memory latency and > bandwidth of default DRAM nodes as base. > > HMAT and CDAT report the raw memory latency and bandwidth. If there are > some other methods to report the raw memory latency and bandwidth, we > can use them too. Argh! So we could address my concerns by having drivers feed latency/bandwidth numbers into a standard calculation algorithm right? Ie. Rather than having drivers calculate abstract distance themselves we have the notifier chains return the raw performance data from which the abstract distance is derived.
Alistair Popple <apopple@nvidia.com> writes: > "Huang, Ying" <ying.huang@intel.com> writes: > >> Alistair Popple <apopple@nvidia.com> writes: >> >>> "Huang, Ying" <ying.huang@intel.com> writes: >>> >>>>>> And, I don't think that we are forced to use the general notifier >>>>>> chain interface in all memory device drivers. If the memory device >>>>>> driver has better understanding of the memory device, it can use other >>>>>> way to determine abstract distance. For example, a CXL memory device >>>>>> driver can identify abstract distance by itself. While other memory >>>>>> device drivers can use the general notifier chain interface at the >>>>>> same time. >>>>> >>>>> Whilst I think personally I would find that flexibility useful I am >>>>> concerned it means every driver will just end up divining it's own >>>>> distance rather than ensuring data in HMAT/CDAT/etc. is correct. That >>>>> would kind of defeat the purpose of it all then. >>>> >>>> But we have no way to enforce that too. >>> >>> Enforce that HMAT/CDAT/etc. is correct? Agree we can't enforce it, but >>> we can influence it. If drivers can easily ignore the notifier chain and >>> do their own thing that's what will happen. >> >> IMHO, both enforce HMAT/CDAT/etc is correct and enforce drivers to use >> general interface we provided. Anyway, we should try to make HMAT/CDAT >> works well, so drivers want to use them :-) > > Exactly :-) > >>>>>> While other memory device drivers can use the general notifier chain >>>>>> interface at the same time. >>> >>> How would that work in practice though? The abstract distance as far as >>> I can tell doesn't have any meaning other than establishing preferences >>> for memory demotion order. Therefore all calculations are relative to >>> the rest of the calculations on the system. So if a driver does it's own >>> thing how does it choose a sensible distance? IHMO the value here is in >>> coordinating all that through a standard interface, whether that is HMAT >>> or something else. >> >> Only if different algorithms follow the same basic principle. For >> example, the abstract distance of default DRAM nodes are fixed >> (MEMTIER_ADISTANCE_DRAM). The abstract distance of the memory device is >> in linear direct proportion to the memory latency and inversely >> proportional to the memory bandwidth. Use the memory latency and >> bandwidth of default DRAM nodes as base. >> >> HMAT and CDAT report the raw memory latency and bandwidth. If there are >> some other methods to report the raw memory latency and bandwidth, we >> can use them too. > > Argh! So we could address my concerns by having drivers feed > latency/bandwidth numbers into a standard calculation algorithm right? > Ie. Rather than having drivers calculate abstract distance themselves we > have the notifier chains return the raw performance data from which the > abstract distance is derived. Now, memory device drivers only need a general interface to get the abstract distance from the NUMA node ID. In the future, if they need more interfaces, we can add them. For example, the interface you suggested above. -- Best Regards, Huang, Ying
"Huang, Ying" <ying.huang@intel.com> writes: > Alistair Popple <apopple@nvidia.com> writes: > >> "Huang, Ying" <ying.huang@intel.com> writes: >> >>> Alistair Popple <apopple@nvidia.com> writes: >>> >>>>>>> While other memory device drivers can use the general notifier chain >>>>>>> interface at the same time. >>>> >>>> How would that work in practice though? The abstract distance as far as >>>> I can tell doesn't have any meaning other than establishing preferences >>>> for memory demotion order. Therefore all calculations are relative to >>>> the rest of the calculations on the system. So if a driver does it's own >>>> thing how does it choose a sensible distance? IHMO the value here is in >>>> coordinating all that through a standard interface, whether that is HMAT >>>> or something else. >>> >>> Only if different algorithms follow the same basic principle. For >>> example, the abstract distance of default DRAM nodes are fixed >>> (MEMTIER_ADISTANCE_DRAM). The abstract distance of the memory device is >>> in linear direct proportion to the memory latency and inversely >>> proportional to the memory bandwidth. Use the memory latency and >>> bandwidth of default DRAM nodes as base. >>> >>> HMAT and CDAT report the raw memory latency and bandwidth. If there are >>> some other methods to report the raw memory latency and bandwidth, we >>> can use them too. >> >> Argh! So we could address my concerns by having drivers feed >> latency/bandwidth numbers into a standard calculation algorithm right? >> Ie. Rather than having drivers calculate abstract distance themselves we >> have the notifier chains return the raw performance data from which the >> abstract distance is derived. > > Now, memory device drivers only need a general interface to get the > abstract distance from the NUMA node ID. In the future, if they need > more interfaces, we can add them. For example, the interface you > suggested above. Huh? Memory device drivers (ie. dax/kmem.c) don't care about abstract distance, it's a meaningless number. The only reason they care about it is so they can pass it to alloc_memory_type(): struct memory_dev_type *alloc_memory_type(int adistance) Instead alloc_memory_type() should be taking bandwidth/latency numbers and the calculation of abstract distance should be done there. That resovles the issues about how drivers are supposed to devine adistance and also means that when CDAT is added we don't have to duplicate the calculation code.
Hi, Alistair, Sorry for late response. Just come back from vacation. Alistair Popple <apopple@nvidia.com> writes: > "Huang, Ying" <ying.huang@intel.com> writes: > >> Alistair Popple <apopple@nvidia.com> writes: >> >>> "Huang, Ying" <ying.huang@intel.com> writes: >>> >>>> Alistair Popple <apopple@nvidia.com> writes: >>>> >>>>>>>> While other memory device drivers can use the general notifier chain >>>>>>>> interface at the same time. >>>>> >>>>> How would that work in practice though? The abstract distance as far as >>>>> I can tell doesn't have any meaning other than establishing preferences >>>>> for memory demotion order. Therefore all calculations are relative to >>>>> the rest of the calculations on the system. So if a driver does it's own >>>>> thing how does it choose a sensible distance? IHMO the value here is in >>>>> coordinating all that through a standard interface, whether that is HMAT >>>>> or something else. >>>> >>>> Only if different algorithms follow the same basic principle. For >>>> example, the abstract distance of default DRAM nodes are fixed >>>> (MEMTIER_ADISTANCE_DRAM). The abstract distance of the memory device is >>>> in linear direct proportion to the memory latency and inversely >>>> proportional to the memory bandwidth. Use the memory latency and >>>> bandwidth of default DRAM nodes as base. >>>> >>>> HMAT and CDAT report the raw memory latency and bandwidth. If there are >>>> some other methods to report the raw memory latency and bandwidth, we >>>> can use them too. >>> >>> Argh! So we could address my concerns by having drivers feed >>> latency/bandwidth numbers into a standard calculation algorithm right? >>> Ie. Rather than having drivers calculate abstract distance themselves we >>> have the notifier chains return the raw performance data from which the >>> abstract distance is derived. >> >> Now, memory device drivers only need a general interface to get the >> abstract distance from the NUMA node ID. In the future, if they need >> more interfaces, we can add them. For example, the interface you >> suggested above. > > Huh? Memory device drivers (ie. dax/kmem.c) don't care about abstract > distance, it's a meaningless number. The only reason they care about it > is so they can pass it to alloc_memory_type(): > > struct memory_dev_type *alloc_memory_type(int adistance) > > Instead alloc_memory_type() should be taking bandwidth/latency numbers > and the calculation of abstract distance should be done there. That > resovles the issues about how drivers are supposed to devine adistance > and also means that when CDAT is added we don't have to duplicate the > calculation code. In the current design, the abstract distance is the key concept of memory types and memory tiers. And it is used as interface to allocate memory types. This provides more flexibility than some other interfaces (e.g. read/write bandwidth/latency). For example, in current dax/kmem.c, if HMAT isn't available in the system, the default abstract distance: MEMTIER_DEFAULT_DAX_ADISTANCE is used. This is still useful to support some systems now. On a system without HMAT/CDAT, it's possible to calculate abstract distance from ACPI SLIT, although this is quite limited. I'm not sure whether all systems will provide read/write bandwith/latency data for all memory devices. HMAT and CDAT or some other mechanisms may provide the read/write bandwidth/latency data to be used to calculate abstract distance. For them, we can provide a shared implementation in mm/memory-tiers.c to map from read/write bandwith/latency to the abstract distance. Can this solve your concerns about the consistency among algorithms? If so, we can do that when we add the second algorithm that needs that. -- Best Regards, Huang, Ying
"Huang, Ying" <ying.huang@intel.com> writes: > Alistair Popple <apopple@nvidia.com> writes: > >> "Huang, Ying" <ying.huang@intel.com> writes: >> >>> Alistair Popple <apopple@nvidia.com> writes: >>> >>>> "Huang, Ying" <ying.huang@intel.com> writes: >>>> >>>>> Hi, Alistair, >>>>> >>>>> Sorry for late response. Just come back from vacation. >>>> >>>> Ditto for this response :-) >>>> >>>> I see Andrew has taken this into mm-unstable though, so my bad for not >>>> getting around to following all this up sooner. >>>> >>>>> Alistair Popple <apopple@nvidia.com> writes: >>>>> >>>>>> "Huang, Ying" <ying.huang@intel.com> writes: >>>>>> >>>>>>> Alistair Popple <apopple@nvidia.com> writes: >>>>>>> >>>>>>>> "Huang, Ying" <ying.huang@intel.com> writes: >>>>>>>> >>>>>>>>> Alistair Popple <apopple@nvidia.com> writes: >>>>>>>>> >>>>>>>>>>>>> While other memory device drivers can use the general notifier chain >>>>>>>>>>>>> interface at the same time. >>>>>>>>>> >>>>>>>>>> How would that work in practice though? The abstract distance as far as >>>>>>>>>> I can tell doesn't have any meaning other than establishing preferences >>>>>>>>>> for memory demotion order. Therefore all calculations are relative to >>>>>>>>>> the rest of the calculations on the system. So if a driver does it's own >>>>>>>>>> thing how does it choose a sensible distance? IHMO the value here is in >>>>>>>>>> coordinating all that through a standard interface, whether that is HMAT >>>>>>>>>> or something else. >>>>>>>>> >>>>>>>>> Only if different algorithms follow the same basic principle. For >>>>>>>>> example, the abstract distance of default DRAM nodes are fixed >>>>>>>>> (MEMTIER_ADISTANCE_DRAM). The abstract distance of the memory device is >>>>>>>>> in linear direct proportion to the memory latency and inversely >>>>>>>>> proportional to the memory bandwidth. Use the memory latency and >>>>>>>>> bandwidth of default DRAM nodes as base. >>>>>>>>> >>>>>>>>> HMAT and CDAT report the raw memory latency and bandwidth. If there are >>>>>>>>> some other methods to report the raw memory latency and bandwidth, we >>>>>>>>> can use them too. >>>>>>>> >>>>>>>> Argh! So we could address my concerns by having drivers feed >>>>>>>> latency/bandwidth numbers into a standard calculation algorithm right? >>>>>>>> Ie. Rather than having drivers calculate abstract distance themselves we >>>>>>>> have the notifier chains return the raw performance data from which the >>>>>>>> abstract distance is derived. >>>>>>> >>>>>>> Now, memory device drivers only need a general interface to get the >>>>>>> abstract distance from the NUMA node ID. In the future, if they need >>>>>>> more interfaces, we can add them. For example, the interface you >>>>>>> suggested above. >>>>>> >>>>>> Huh? Memory device drivers (ie. dax/kmem.c) don't care about abstract >>>>>> distance, it's a meaningless number. The only reason they care about it >>>>>> is so they can pass it to alloc_memory_type(): >>>>>> >>>>>> struct memory_dev_type *alloc_memory_type(int adistance) >>>>>> >>>>>> Instead alloc_memory_type() should be taking bandwidth/latency numbers >>>>>> and the calculation of abstract distance should be done there. That >>>>>> resovles the issues about how drivers are supposed to devine adistance >>>>>> and also means that when CDAT is added we don't have to duplicate the >>>>>> calculation code. >>>>> >>>>> In the current design, the abstract distance is the key concept of >>>>> memory types and memory tiers. And it is used as interface to allocate >>>>> memory types. This provides more flexibility than some other interfaces >>>>> (e.g. read/write bandwidth/latency). For example, in current >>>>> dax/kmem.c, if HMAT isn't available in the system, the default abstract >>>>> distance: MEMTIER_DEFAULT_DAX_ADISTANCE is used. This is still useful >>>>> to support some systems now. On a system without HMAT/CDAT, it's >>>>> possible to calculate abstract distance from ACPI SLIT, although this is >>>>> quite limited. I'm not sure whether all systems will provide read/write >>>>> bandwith/latency data for all memory devices. >>>>> >>>>> HMAT and CDAT or some other mechanisms may provide the read/write >>>>> bandwidth/latency data to be used to calculate abstract distance. For >>>>> them, we can provide a shared implementation in mm/memory-tiers.c to map >>>>> from read/write bandwith/latency to the abstract distance. Can this >>>>> solve your concerns about the consistency among algorithms? If so, we >>>>> can do that when we add the second algorithm that needs that. >>>> >>>> I guess it would address my concerns if we did that now. I don't see why >>>> we need to wait for a second implementation for that though - the whole >>>> series seems to be built around adding a framework for supporting >>>> multiple algorithms even though only one exists. So I think we should >>>> support that fully, or simplfy the whole thing and just assume the only >>>> thing that exists is HMAT and get rid of the general interface until a >>>> second algorithm comes along. >>> >>> We will need a general interface even for one algorithm implementation. >>> Because it's not good to make a dax subsystem driver (dax/kmem) to >>> depend on a ACPI subsystem driver (acpi/hmat). We need some general >>> interface at subsystem level (memory tier here) between them. >> >> I don't understand this argument. For a single algorithm it would be >> simpler to just define acpi_hmat_calculate_adistance() and a static >> inline version of it that returns -ENOENT when !CONFIG_ACPI than adding >> a layer of indirection through notifier blocks. That breaks any >> dependency on ACPI and there's plenty of precedent for this approach in >> the kernel already. > > ACPI is a subsystem, so it's OK for dax/kmem to depends on CONFIG_ACPI. > But HMAT is a driver of ACPI subsystem (controlled via > CONFIG_ACPI_HMAT). It's not good for a driver of DAX subsystem > (dax/kmem) to depend on a *driver* of ACPI subsystem. > > Yes. Technically, there's no hard wall to prevent this. But I think > that a good design should make drivers depends on subsystems or drivers > of the same subsystem, NOT drivers of other subsystems. Thanks, I wasn't really thinking of HMAT as an ACPI driver. I understand where you're coming from but I really don't see the problem with using a static inline. It doesn't create dependencies (you could still use dax/kmem without ACPI) and results in smaller and easier to follow code. IMHO it's far more obvious that a call to acpi_hmat_calcaulte_adist() returns either a default if ACPI HMAT isn't configured or a calculated value than it is to figure out what notifiers may or may not be registered at runtime and what priority they may be called in from mt_calc_adistance(). It appears you think that is a bad design, but I don't understand why. What does this approach give us that a simpler approach wouldn't?
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index fc9647b1b4f9..c6429e624244 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -6,6 +6,7 @@ #include <linux/nodemask.h> #include <linux/kref.h> #include <linux/mmzone.h> +#include <linux/notifier.h> /* * Each tier cover a abstrace distance chunk size of 128 */ @@ -36,6 +37,9 @@ struct memory_dev_type *alloc_memory_type(int adistance); void destroy_memory_type(struct memory_dev_type *memtype); void init_node_memory_type(int node, struct memory_dev_type *default_type); void clear_node_memory_type(int node, struct memory_dev_type *memtype); +int register_mt_adistance_algorithm(struct notifier_block *nb); +int unregister_mt_adistance_algorithm(struct notifier_block *nb); +int mt_calc_adistance(int node, int *adist); #ifdef CONFIG_MIGRATION int next_demotion_node(int node); void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); @@ -97,5 +101,20 @@ static inline bool node_is_toptier(int node) { return true; } + +static inline int register_mt_adistance_algorithm(struct notifier_block *nb) +{ + return 0; +} + +static inline int unregister_mt_adistance_algorithm(struct notifier_block *nb) +{ + return 0; +} + +static inline int mt_calc_adistance(int node, int *adist) +{ + return NOTIFY_DONE; +} #endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index a516e303e304..1e55fbe2ad51 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -5,6 +5,7 @@ #include <linux/kobject.h> #include <linux/memory.h> #include <linux/memory-tiers.h> +#include <linux/notifier.h> #include "internal.h" @@ -105,6 +106,8 @@ static int top_tier_adistance; static struct demotion_nodes *node_demotion __read_mostly; #endif /* CONFIG_MIGRATION */ +static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms); + static inline struct memory_tier *to_memory_tier(struct device *device) { return container_of(device, struct memory_tier, dev); @@ -592,6 +595,62 @@ void clear_node_memory_type(int node, struct memory_dev_type *memtype) } EXPORT_SYMBOL_GPL(clear_node_memory_type); +/** + * register_mt_adistance_algorithm() - Register memory tiering abstract distance algorithm + * @nb: The notifier block which describe the algorithm + * + * Return: 0 on success, errno on error. + * + * Every memory tiering abstract distance algorithm provider needs to + * register the algorithm with register_mt_adistance_algorithm(). To + * calculate the abstract distance for a specified memory node, the + * notifier function will be called unless some high priority + * algorithm has provided result. The prototype of the notifier + * function is as follows, + * + * int (*algorithm_notifier)(struct notifier_block *nb, + * unsigned long nid, void *data); + * + * Where "nid" specifies the memory node, "data" is the pointer to the + * returned abstract distance (that is, "int *adist"). If the + * algorithm provides the result, NOTIFY_STOP should be returned. + * Otherwise, return_value & %NOTIFY_STOP_MASK == 0 to allow the next + * algorithm in the chain to provide the result. + */ +int register_mt_adistance_algorithm(struct notifier_block *nb) +{ + return blocking_notifier_chain_register(&mt_adistance_algorithms, nb); +} +EXPORT_SYMBOL_GPL(register_mt_adistance_algorithm); + +/** + * unregister_mt_adistance_algorithm() - Unregister memory tiering abstract distance algorithm + * @nb: the notifier block which describe the algorithm + * + * Return: 0 on success, errno on error. + */ +int unregister_mt_adistance_algorithm(struct notifier_block *nb) +{ + return blocking_notifier_chain_unregister(&mt_adistance_algorithms, nb); +} +EXPORT_SYMBOL_GPL(unregister_mt_adistance_algorithm); + +/** + * mt_calc_adistance() - Calculate abstract distance with registered algorithms + * @node: the node to calculate abstract distance for + * @adist: the returned abstract distance + * + * Return: if return_value & %NOTIFY_STOP_MASK != 0, then some + * abstract distance algorithm provides the result, and return it via + * @adist. Otherwise, no algorithm can provide the result and @adist + * will be kept as it is. + */ +int mt_calc_adistance(int node, int *adist) +{ + return blocking_notifier_call_chain(&mt_adistance_algorithms, node, adist); +} +EXPORT_SYMBOL_GPL(mt_calc_adistance); + static int __meminit memtier_hotplug_callback(struct notifier_block *self, unsigned long action, void *_arg) {