[v3,3/8] x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]

Message ID 20230713163207.219710-4-tony.luck@intel.com
State New
Headers
Series x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems |

Commit Message

Luck, Tony July 13, 2023, 4:32 p.m. UTC
  Add a placeholder in the array of struct rdt_hw_resource to be used
for event monitoring of systems with Sub-NUMA Cluster enabled.

Update get_domain_id() to handle SCOPE_NODE.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Peter Newman <peternewman@google.com>
---
 arch/x86/kernel/cpu/resctrl/internal.h |  4 +++-
 arch/x86/kernel/cpu/resctrl/core.c     | 12 ++++++++++++
 2 files changed, 15 insertions(+), 1 deletion(-)
  

Comments

Reinette Chatre July 18, 2023, 8:40 p.m. UTC | #1
Hi Tony,

On 7/13/2023 9:32 AM, Tony Luck wrote:
> Add a placeholder in the array of struct rdt_hw_resource to be used
> for event monitoring of systems with Sub-NUMA Cluster enabled.

Could you please elaborate why a new resource is required?


> 
> Update get_domain_id() to handle SCOPE_NODE.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> Reviewed-by: Peter Newman <peternewman@google.com>
> ---
>  arch/x86/kernel/cpu/resctrl/internal.h |  4 +++-
>  arch/x86/kernel/cpu/resctrl/core.c     | 12 ++++++++++++
>  2 files changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 8275b8a74f7e..243017096ddf 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -435,6 +435,7 @@ enum resctrl_res_level {
>  	RDT_RESOURCE_L2,
>  	RDT_RESOURCE_MBA,
>  	RDT_RESOURCE_SMBA,
> +	RDT_RESOURCE_NODE,
>  
>  	/* Must be the last */
>  	RDT_NUM_RESOURCES,
> @@ -442,7 +443,8 @@ enum resctrl_res_level {
>  
>  enum resctrl_scope {
>  	SCOPE_L2_CACHE = 2,
> -	SCOPE_L3_CACHE = 3
> +	SCOPE_L3_CACHE = 3,
> +	SCOPE_NODE,
>  };

A new resource _and_ a new scope is added. Could changelog please
explain why this is required?

>  
>  static inline struct rdt_resource *resctrl_inc(struct rdt_resource *res)
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 6571514752f3..e4bd3072927c 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -112,6 +112,16 @@ struct rdt_hw_resource rdt_resources_all[] = {
>  			.fflags			= RFTYPE_RES_MB,
>  		},
>  	},
> +	[RDT_RESOURCE_NODE] =
> +	{
> +		.r_resctrl = {
> +			.rid			= RDT_RESOURCE_NODE,
> +			.name			= "L3",
> +			.scope			= SCOPE_NODE,
> +			.domains		= domain_init(RDT_RESOURCE_NODE),
> +			.fflags			= 0,
> +		},
> +	},
>  };

So the new resource has the same name, from user perspective,
as RDT_RESOURCE_L3. From this perspective it thus seems to be a
shadow of RDT_RESOURCE_L3 that is used as alternative for some properties
of the actual RDT_RESOURCE_L3? This is starting to look as though this
solution is wrenching itself into current architecture.

From what I can tell the monitoring in SNC environment needs a different
domain list because of the change in scope. What else is needed in the
resource that is different from the existing L3 resource? Could the
monitoring scope of a resource not instead be made distinct from its
allocation scope? By default monitoring and allocation scope will be
the same and thus use the same domain list but when SNC is enabled
then monitoring uses a different domain list.
  
>  /*
> @@ -489,6 +499,8 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
>  
>  static int get_domain_id(int cpu, enum resctrl_scope scope)
>  {
> +	if (scope == SCOPE_NODE)
> +		return cpu_to_node(cpu);
>  	return get_cpu_cacheinfo_id(cpu, scope);
>  }
>  

Reinette
  
Luck, Tony July 18, 2023, 10:57 p.m. UTC | #2
On Tue, Jul 18, 2023 at 01:40:32PM -0700, Reinette Chatre wrote:
> > +	[RDT_RESOURCE_NODE] =
> > +	{
> > +		.r_resctrl = {
> > +			.rid			= RDT_RESOURCE_NODE,
> > +			.name			= "L3",
> > +			.scope			= SCOPE_NODE,
> > +			.domains		= domain_init(RDT_RESOURCE_NODE),
> > +			.fflags			= 0,
> > +		},
> > +	},
> >  };
> 
> So the new resource has the same name, from user perspective,
> as RDT_RESOURCE_L3. From this perspective it thus seems to be a
> shadow of RDT_RESOURCE_L3 that is used as alternative for some properties
> of the actual RDT_RESOURCE_L3? This is starting to look as though this
> solution is wrenching itself into current architecture.
> 
> >From what I can tell the monitoring in SNC environment needs a different
> domain list because of the change in scope. What else is needed in the
> resource that is different from the existing L3 resource? Could the
> monitoring scope of a resource not instead be made distinct from its
> allocation scope? By default monitoring and allocation scope will be
> the same and thus use the same domain list but when SNC is enabled
> then monitoring uses a different domain list.

Answering this part first, because my choice here affects a bunch
of the code that also raised comments from you.

The crux of the issue is that when SNC mode is enabled the scope
for L3 monitoring functions changes to "node" scope, while the
scope of L3 control functions (CAT, CDP) remains at L3 cache scope.

My solution was to just create a new resource. But you have an
interesing alternate solution. Add an extra domain list to the
resource structure to allow creation of distinct domain lists
for this case where the scope for control and monitor functions
differs.

So change the resource structure like this:

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 8334eeacfec5..01590aa59a67 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -168,10 +168,12 @@ struct rdt_resource {
 	bool			alloc_capable;
 	bool			mon_capable;
 	int			num_rmid;
-	int			cache_level;
+	int			ctrl_scope;
+	int			mon_scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
-	struct list_head	domains;
+	struct list_head	ctrl_domains;
+	struct list_head	mon_domains;
 	char			*name;
 	int			data_width;
 	u32			default_ctrl;

and build/use separate domain lists for when this resource is
being referenced for allocation/monitoring. E.g. domain_add_cpu()
would check "r->alloc_capable" and add a cpu to the ctrl_domains
list based on the ctrl_scope value. It would do the same with
mon_capable / mon_domains / mon_scope.

If ctrl_scope == mon_scope, just build one list as you suggest above.

Maybe there are more places that walk the list of control domains than
walk the list of monitor domains. Need to audit this set:

$ git grep list_for_each.*domains -- arch/x86/kernel/cpu/resctrl
arch/x86/kernel/cpu/resctrl/core.c:     list_for_each_entry(d, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/core.c:     list_for_each(l, &r->domains) {
arch/x86/kernel/cpu/resctrl/ctrlmondata.c:      list_for_each_entry(d, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/ctrlmondata.c:      list_for_each_entry(d, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/ctrlmondata.c:      list_for_each_entry(dom, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/monitor.c:  list_for_each_entry(d, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/pseudo_lock.c:              list_for_each_entry(d_i, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c:         list_for_each_entry(dom, &r->domains, list)
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(dom, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c:         list_for_each_entry(d, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c:         list_for_each_entry(d, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(dom, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r_l->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c:         list_for_each_entry(dom, &r->domains, list)
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(dom, &r->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &s->res->domains, list) {
arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r->domains, list) {

Maybe "domains" can keep its name and make a "list_for_each_monitor_domain()" macro
to pick the right list to walk?


I don't think this will reduce the amount of code change in a
significant way. But it may be conceptually easier to follow
what is going on.

-Tony
  
Reinette Chatre July 18, 2023, 11:45 p.m. UTC | #3
Hi Tony,

On 7/18/2023 3:57 PM, Tony Luck wrote:
> On Tue, Jul 18, 2023 at 01:40:32PM -0700, Reinette Chatre wrote:
>>> +	[RDT_RESOURCE_NODE] =
>>> +	{
>>> +		.r_resctrl = {
>>> +			.rid			= RDT_RESOURCE_NODE,
>>> +			.name			= "L3",
>>> +			.scope			= SCOPE_NODE,
>>> +			.domains		= domain_init(RDT_RESOURCE_NODE),
>>> +			.fflags			= 0,
>>> +		},
>>> +	},
>>>  };
>>
>> So the new resource has the same name, from user perspective,
>> as RDT_RESOURCE_L3. From this perspective it thus seems to be a
>> shadow of RDT_RESOURCE_L3 that is used as alternative for some properties
>> of the actual RDT_RESOURCE_L3? This is starting to look as though this
>> solution is wrenching itself into current architecture.
>>
>> >From what I can tell the monitoring in SNC environment needs a different
>> domain list because of the change in scope. What else is needed in the
>> resource that is different from the existing L3 resource? Could the
>> monitoring scope of a resource not instead be made distinct from its
>> allocation scope? By default monitoring and allocation scope will be
>> the same and thus use the same domain list but when SNC is enabled
>> then monitoring uses a different domain list.
> 
> Answering this part first, because my choice here affects a bunch
> of the code that also raised comments from you.

Indeed.

> 
> The crux of the issue is that when SNC mode is enabled the scope
> for L3 monitoring functions changes to "node" scope, while the
> scope of L3 control functions (CAT, CDP) remains at L3 cache scope.
> 
> My solution was to just create a new resource. But you have an
> interesing alternate solution. Add an extra domain list to the
> resource structure to allow creation of distinct domain lists
> for this case where the scope for control and monitor functions
> differs.
> 
> So change the resource structure like this:
> 
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 8334eeacfec5..01590aa59a67 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -168,10 +168,12 @@ struct rdt_resource {
>  	bool			alloc_capable;
>  	bool			mon_capable;
>  	int			num_rmid;
> -	int			cache_level;
> +	int			ctrl_scope;
> +	int			mon_scope;

I am not sure about getting rid of cache_level so fast.
I see regarding the current problem being solved that
ctrl_scope would have the same values as cache_level but
I find that adding this level of indirection while keeping
the comparison with cacheinfo->level to create a trap
for future mistakes.


>  	struct resctrl_cache	cache;
>  	struct resctrl_membw	membw;
> -	struct list_head	domains;
> +	struct list_head	ctrl_domains;
> +	struct list_head	mon_domains;
>  	char			*name;
>  	int			data_width;
>  	u32			default_ctrl;
> 
> and build/use separate domain lists for when this resource is
> being referenced for allocation/monitoring. E.g. domain_add_cpu()
> would check "r->alloc_capable" and add a cpu to the ctrl_domains
> list based on the ctrl_scope value. It would do the same with
> mon_capable / mon_domains / mon_scope.
> 
> If ctrl_scope == mon_scope, just build one list as you suggest above.

Yes, this is the idea. Thank you for considering it. Something else
to consider that may make this even cleaner/simpler would be to review
struct rdt_domain and struct rdt_hw_domain members for "monitor" vs "control"
usage. These structs could potentially be split further into separate
"control" and "monitor" variants. For example, "struct rdt_domain" split into
"struct rdt_ctrl_domain" and "struct rdt_mon_domain". If there is a clean
split then resctrl can always create two lists with the unnecessary duplication
eliminated when two domain lists are created. This would also
eliminate the need to scatter ctrl_scope == mon_scope checks throughout. 

> 
> Maybe there are more places that walk the list of control domains than
> walk the list of monitor domains. Need to audit this set:
> 
> $ git grep list_for_each.*domains -- arch/x86/kernel/cpu/resctrl
> arch/x86/kernel/cpu/resctrl/core.c:     list_for_each_entry(d, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/core.c:     list_for_each(l, &r->domains) {
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c:      list_for_each_entry(d, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c:      list_for_each_entry(d, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c:      list_for_each_entry(dom, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/monitor.c:  list_for_each_entry(d, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/pseudo_lock.c:              list_for_each_entry(d_i, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c:         list_for_each_entry(dom, &r->domains, list)
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(dom, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c:         list_for_each_entry(d, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c:         list_for_each_entry(d, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(dom, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r_l->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c:         list_for_each_entry(dom, &r->domains, list)
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(dom, &r->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &s->res->domains, list) {
> arch/x86/kernel/cpu/resctrl/rdtgroup.c: list_for_each_entry(d, &r->domains, list) {
> 
> Maybe "domains" can keep its name and make a "list_for_each_monitor_domain()" macro
> to pick the right list to walk?

It is not clear to me how "domains" can keep its name. If I understand
the macro would be useful if scope always needs to be considered. I wonder
if the list walkers may not mostly just walk the appropriate list directly
if resctrl always creates separate "control domain" and "monitor domain"
lists.

> I don't think this will reduce the amount of code change in a
> significant way. But it may be conceptually easier to follow
> what is going on.

Reducing the amount of code changed is not a goal to me. If I understand
correctly I think that adapting resctrl to support different monitor and
control scope could create a foundation into which SNC can slot in smoothly. 

Reinette
  
Luck, Tony July 19, 2023, 12:11 a.m. UTC | #4
> Yes, this is the idea. Thank you for considering it. Something else
> to consider that may make this even cleaner/simpler would be to review
> struct rdt_domain and struct rdt_hw_domain members for "monitor" vs "control"
> usage. These structs could potentially be split further into separate
> "control" and "monitor" variants. For example, "struct rdt_domain" split into
> "struct rdt_ctrl_domain" and "struct rdt_mon_domain". If there is a clean
> split then resctrl can always create two lists with the unnecessary duplication
> eliminated when two domain lists are created. This would also
> eliminate the need to scatter ctrl_scope == mon_scope checks throughout.

You might like what I'm doing in the "resctrl2" re-write[1]. Arch independent code
that maintains the domain lists for a resource via a cpuhp notifier just has this
for the domain structure:

struct resctrl_domain {
        struct list_head        list;
        struct cpumask          cpu_mask;
        int                     id;
        int                     cache_size;
};

Each module managing a resource decides what extra information it wants to
carry in the domain. So the above structure is common to all, but it is followed
by whatever the resource module wants. E.g. the CBM masks for each CLOSid
for the CAT module. The module tells core code the size to allocate.

"cache_size" is only there because the cache topology bits needed to discover
sizes of caches aren't exported. Both the "size" file and pseudo-locking need
to know the size.

It's also possible that you may hate it. There is zero sharing of resource structures
even if they have the same scope. This is because all modules are independently
loadable.

-Tony

[1] WIP snapshot at git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git
branch resctrl2_v65rc1. That doesn't have pseudo-locking, but most of the rest
of existing resctrl functionality is there.
  
Luck, Tony July 20, 2023, 12:20 a.m. UTC | #5
Here's a quick hack to see how things might look with
separate domain lists in the "L3" resource.

For testing purposes on a non-SNC system I set ->mon_scope =
MON_SCOPE_NODE, but made domain_add_cpu() allocate the mondomains
list based on L3 scope ... just so I could check that I found all
the places where monitoring needs to use the mondomains list.
The kernel doesn't crash when running tools/testing/selftests/resctrl,
and the tests all pass. But that doesn't mean I didn't miss something.

Some restructuring of control vs. monitoing initialization might
avoid some of the code I duplicated in domain_add_cpu(). But this
is intended just as a "Is this what you meant?" before I dig deeper.

Overall, I think it is a cleaner approach that making a new
"L3" resource with different scope just for the SNC monitoring

-Tony

---

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 8334eeacfec5..e4b653088a22 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -151,9 +151,11 @@ struct resctrl_schema;
  * @mon_capable:	Is monitor feature available on this machine
  * @num_rmid:		Number of RMIDs available
  * @cache_level:	Which cache level defines scope of this resource
+ * @mon_scope:		Scope of this resource if different from cache_level
  * @cache:		Cache allocation related data
  * @membw:		If the component has bandwidth controls, their properties.
  * @domains:		All domains for this resource
+ * @mondomains:		Monitor domains for this resource (if mon_scope != 0)
  * @name:		Name to use in "schemata" file.
  * @data_width:		Character width of data when displaying
  * @default_ctrl:	Specifies default cache cbm or memory B/W percent.
@@ -169,9 +171,11 @@ struct rdt_resource {
 	bool			mon_capable;
 	int			num_rmid;
 	int			cache_level;
+	int			mon_scope;
 	struct resctrl_cache	cache;
 	struct resctrl_membw	membw;
 	struct list_head	domains;
+	struct list_head	mondomains;
 	char			*name;
 	int			data_width;
 	u32			default_ctrl;
@@ -184,6 +188,8 @@ struct rdt_resource {
 	bool			cdp_capable;
 };
 
+#define MON_SCOPE_NODE	1
+
 /**
  * struct resctrl_schema - configuration abilities of a resource presented to
  *			   user-space
@@ -217,8 +223,8 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d,
 
 u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
 			    u32 closid, enum resctrl_conf_type type);
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d);
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
+int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d, bool mon_setup);
+void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d, bool mon_teardown);
 
 /**
  * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 85ceaf9a31ac..c5e2ac2a60cf 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -511,7 +511,7 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn);
 int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name);
 int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name,
 			     umode_t mask);
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
+struct rdt_domain *rdt_find_domain(struct list_head *h, int id,
 				   struct list_head **pos);
 ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 030d3b409768..545d563ba956 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -57,7 +57,7 @@ static void
 mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
 	      struct rdt_resource *r);
 
-#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
+#define domain_init(id, field) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.field)
 
 struct rdt_hw_resource rdt_resources_all[] = {
 	[RDT_RESOURCE_L3] =
@@ -66,7 +66,9 @@ struct rdt_hw_resource rdt_resources_all[] = {
 			.rid			= RDT_RESOURCE_L3,
 			.name			= "L3",
 			.cache_level		= 3,
-			.domains		= domain_init(RDT_RESOURCE_L3),
+			.mon_scope		= MON_SCOPE_NODE, //FAKE
+			.domains		= domain_init(RDT_RESOURCE_L3, domains),
+			.mondomains		= domain_init(RDT_RESOURCE_L3, mondomains),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -80,7 +82,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 			.rid			= RDT_RESOURCE_L2,
 			.name			= "L2",
 			.cache_level		= 2,
-			.domains		= domain_init(RDT_RESOURCE_L2),
+			.domains		= domain_init(RDT_RESOURCE_L2, domains),
 			.parse_ctrlval		= parse_cbm,
 			.format_str		= "%d=%0*x",
 			.fflags			= RFTYPE_RES_CACHE,
@@ -94,7 +96,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 			.rid			= RDT_RESOURCE_MBA,
 			.name			= "MB",
 			.cache_level		= 3,
-			.domains		= domain_init(RDT_RESOURCE_MBA),
+			.domains		= domain_init(RDT_RESOURCE_MBA, domains),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -106,7 +108,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
 			.rid			= RDT_RESOURCE_SMBA,
 			.name			= "SMBA",
 			.cache_level		= 3,
-			.domains		= domain_init(RDT_RESOURCE_SMBA),
+			.domains		= domain_init(RDT_RESOURCE_SMBA, domains),
 			.parse_ctrlval		= parse_bw,
 			.format_str		= "%d=%*u",
 			.fflags			= RFTYPE_RES_MB,
@@ -384,14 +386,15 @@ void rdt_ctrl_update(void *arg)
 }
 
 /*
- * rdt_find_domain - Find a domain in a resource that matches input resource id
+ * rdt_find_domain - Find a domain in one of the lists for a resource that
+ * matches input resource id
  *
  * Search resource r's domain list to find the resource id. If the resource
  * id is found in a domain, return the domain. Otherwise, if requested by
  * caller, return the first domain whose id is bigger than the input id.
  * The domain list is sorted by id in ascending order.
  */
-struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
+struct rdt_domain *rdt_find_domain(struct list_head *h, int id,
 				   struct list_head **pos)
 {
 	struct rdt_domain *d;
@@ -400,7 +403,7 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
 	if (id < 0)
 		return ERR_PTR(-ENODEV);
 
-	list_for_each(l, &r->domains) {
+	list_for_each(l, h) {
 		d = list_entry(l, struct rdt_domain, list);
 		/* When id is found, return its domain. */
 		if (id == d->id)
@@ -508,7 +511,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 	struct rdt_domain *d;
 	int err;
 
-	d = rdt_find_domain(r, id, &add_pos);
+	d = rdt_find_domain(&r->domains, id, &add_pos);
 	if (IS_ERR(d)) {
 		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
 		return;
@@ -536,6 +539,44 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 		return;
 	}
 
+	if (!r->mon_scope && r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
+		domain_free(hw_dom);
+		return;
+	}
+
+	list_add_tail(&d->list, add_pos);
+
+	err = resctrl_online_domain(r, d, r->mon_scope == 0);
+	if (err) {
+		list_del(&d->list);
+		domain_free(hw_dom);
+	}
+
+	if (r->mon_scope != MON_SCOPE_NODE)
+		return;
+
+	//id = cpu_to_node(cpu);
+	id = get_cpu_cacheinfo_id(cpu, r->cache_level); // FAKE
+	add_pos = NULL;
+	d = rdt_find_domain(&r->mondomains, id, &add_pos);
+	if (IS_ERR(d)) {
+		pr_warn("Couldn't find node id for CPU %d\n", cpu);
+		return;
+	}
+
+	if (d) {
+		cpumask_set_cpu(cpu, &d->cpu_mask);
+		return;
+	}
+
+	hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu));
+	if (!hw_dom)
+		return;
+
+	d = &hw_dom->d_resctrl;
+	d->id = id;
+	cpumask_set_cpu(cpu, &d->cpu_mask);
+
 	if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) {
 		domain_free(hw_dom);
 		return;
@@ -543,7 +584,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
 
 	list_add_tail(&d->list, add_pos);
 
-	err = resctrl_online_domain(r, d);
+	err = resctrl_online_domain(r, d, true);
 	if (err) {
 		list_del(&d->list);
 		domain_free(hw_dom);
@@ -556,7 +597,7 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 	struct rdt_hw_domain *hw_dom;
 	struct rdt_domain *d;
 
-	d = rdt_find_domain(r, id, NULL);
+	d = rdt_find_domain(&r->domains, id, NULL);
 	if (IS_ERR_OR_NULL(d)) {
 		pr_warn("Couldn't find cache id for CPU %d\n", cpu);
 		return;
@@ -565,7 +606,7 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 
 	cpumask_clear_cpu(cpu, &d->cpu_mask);
 	if (cpumask_empty(&d->cpu_mask)) {
-		resctrl_offline_domain(r, d);
+		resctrl_offline_domain(r, d, r->mon_scope == 0);
 		list_del(&d->list);
 
 		/*
@@ -579,7 +620,7 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 		return;
 	}
 
-	if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
+	if (r->mon_scope == 0 && r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
 		if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
 			cancel_delayed_work(&d->mbm_over);
 			mbm_setup_overflow_handler(d, 0);
@@ -590,6 +631,23 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 			cqm_setup_limbo_handler(d, 0);
 		}
 	}
+
+	if (r->mon_scope != MON_SCOPE_NODE)
+		return;
+
+	id = cpu_to_node(cpu);
+	d = rdt_find_domain(&r->mondomains, id, NULL);
+	if (IS_ERR_OR_NULL(d)) {
+		pr_warn("Couldn't find node id for CPU %d\n", cpu);
+		return;
+	}
+
+	cpumask_clear_cpu(cpu, &d->cpu_mask);
+	if (cpumask_empty(&d->cpu_mask)) {
+		resctrl_offline_domain(r, d, true);
+		list_del(&d->list);
+		domain_free(hw_dom);
+	}
 }
 
 static void clear_closid_rmid(int cpu)
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index b44c487727d4..80033cb698d0 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -545,6 +545,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 	struct rdt_resource *r;
 	union mon_data_bits md;
 	struct rdt_domain *d;
+	struct list_head *h;
 	struct rmid_read rr;
 	int ret = 0;
 
@@ -560,7 +561,8 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
 	evtid = md.u.evtid;
 
 	r = &rdt_resources_all[resid].r_resctrl;
-	d = rdt_find_domain(r, domid, NULL);
+	h = r->mon_scope ? &r->mondomains : &r->domains;
+	d = rdt_find_domain(h, domid, NULL);
 	if (IS_ERR_OR_NULL(d)) {
 		ret = -ENOENT;
 		goto out;
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index ded1fc7cb7cb..08085202582a 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -335,12 +335,14 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
 	struct rdt_domain *d;
+	struct list_head *h;
 	int cpu, err;
 	u64 val = 0;
 
 	entry->busy = 0;
 	cpu = get_cpu();
-	list_for_each_entry(d, &r->domains, list) {
+	h = r->mon_scope ? &r->mondomains : &r->domains;
+	list_for_each_entry(d, h, list) {
 		if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
 			err = resctrl_arch_rmid_read(r, d, entry->rmid,
 						     QOS_L3_OCCUP_EVENT_ID,
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 725344048f85..fb5b23fcb6d4 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1492,11 +1492,13 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid
 {
 	struct mon_config_info mon_info = {0};
 	struct rdt_domain *dom;
+	struct list_head *h;
 	bool sep = false;
 
 	mutex_lock(&rdtgroup_mutex);
 
-	list_for_each_entry(dom, &r->domains, list) {
+	h = r->mon_scope ? &r->mondomains : &r->domains;
+	list_for_each_entry(dom, h, list) {
 		if (sep)
 			seq_puts(s, ";");
 
@@ -1599,6 +1601,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 	char *dom_str = NULL, *id_str;
 	unsigned long dom_id, val;
 	struct rdt_domain *d;
+	struct list_head *h;
 	int ret = 0;
 
 next:
@@ -1619,7 +1622,8 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid)
 		return -EINVAL;
 	}
 
-	list_for_each_entry(d, &r->domains, list) {
+	h = r->mon_scope ? &r->mondomains : &r->domains;
+	list_for_each_entry(d, h, list) {
 		if (d->id == dom_id) {
 			ret = mbm_config_write_domain(r, d, evtid, val);
 			if (ret)
@@ -2465,6 +2469,7 @@ static int rdt_get_tree(struct fs_context *fc)
 	struct rdt_fs_context *ctx = rdt_fc2context(fc);
 	struct rdt_domain *dom;
 	struct rdt_resource *r;
+	struct list_head *h;
 	int ret;
 
 	cpus_read_lock();
@@ -2525,7 +2530,8 @@ static int rdt_get_tree(struct fs_context *fc)
 
 	if (is_mbm_enabled()) {
 		r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
-		list_for_each_entry(dom, &r->domains, list)
+		h = r->mon_scope ? &r->mondomains : &r->domains;
+		list_for_each_entry(dom, h, list)
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
 	}
 
@@ -2917,9 +2923,11 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
 				       struct rdtgroup *prgrp)
 {
 	struct rdt_domain *dom;
+	struct list_head *h;
 	int ret;
 
-	list_for_each_entry(dom, &r->domains, list) {
+	h = r->mon_scope ? &r->mondomains : &r->domains;
+	list_for_each_entry(dom, h, list) {
 		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
 		if (ret)
 			return ret;
@@ -3708,14 +3716,14 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
 	kfree(d->mbm_local);
 }
 
-void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
+void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d, bool mon_teardown)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA)
 		mba_sc_domain_destroy(r, d);
 
-	if (!r->mon_capable)
+	if (!mon_teardown || !r->mon_capable)
 		return;
 
 	/*
@@ -3773,7 +3781,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
 	return 0;
 }
 
-int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
+int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d, bool mon_setup)
 {
 	int err;
 
@@ -3783,7 +3791,7 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
 		/* RDT_RESOURCE_MBA is never mon_capable */
 		return mba_sc_domain_allocate(r, d);
 
-	if (!r->mon_capable)
+	if (!mon_setup || !r->mon_capable)
 		return 0;
 
 	err = domain_setup_mon_state(r, d);
  
Reinette Chatre July 20, 2023, 5:27 p.m. UTC | #6
Hi Tony,

On 7/18/2023 5:11 PM, Luck, Tony wrote:
>> Yes, this is the idea. Thank you for considering it. Something else
>> to consider that may make this even cleaner/simpler would be to review
>> struct rdt_domain and struct rdt_hw_domain members for "monitor" vs "control"
>> usage. These structs could potentially be split further into separate
>> "control" and "monitor" variants. For example, "struct rdt_domain" split into
>> "struct rdt_ctrl_domain" and "struct rdt_mon_domain". If there is a clean
>> split then resctrl can always create two lists with the unnecessary duplication
>> eliminated when two domain lists are created. This would also
>> eliminate the need to scatter ctrl_scope == mon_scope checks throughout.
> 
> You might like what I'm doing in the "resctrl2" re-write[1]. Arch independent code
> that maintains the domain lists for a resource via a cpuhp notifier just has this
> for the domain structure:
> 
> struct resctrl_domain {
>         struct list_head        list;
>         struct cpumask          cpu_mask;
>         int                     id;
>         int                     cache_size;
> };
> 
> Each module managing a resource decides what extra information it wants to
> carry in the domain. So the above structure is common to all, but it is followed
> by whatever the resource module wants. E.g. the CBM masks for each CLOSid
> for the CAT module. The module tells core code the size to allocate.

hmmm ... what I am *hearing* you say is that the goodness from the
rewrite can be added to resctrl? :)

> "cache_size" is only there because the cache topology bits needed to discover
> sizes of caches aren't exported. Both the "size" file and pseudo-locking need
> to know the size.
> 
> It's also possible that you may hate it. There is zero sharing of resource structures
> even if they have the same scope. This is because all modules are independently
> loadable.

Apologies but I am still unable to understand the problem statement that
motivates the rewrite.

Reinette
  
Reinette Chatre July 20, 2023, 5:27 p.m. UTC | #7
Hi Tony,

On 7/19/2023 5:20 PM, Tony Luck wrote:
> Here's a quick hack to see how things might look with
> separate domain lists in the "L3" resource.
> 
> For testing purposes on a non-SNC system I set ->mon_scope =
> MON_SCOPE_NODE, but made domain_add_cpu() allocate the mondomains
> list based on L3 scope ... just so I could check that I found all
> the places where monitoring needs to use the mondomains list.
> The kernel doesn't crash when running tools/testing/selftests/resctrl,
> and the tests all pass. But that doesn't mean I didn't miss something.
> 
> Some restructuring of control vs. monitoing initialization might
> avoid some of the code I duplicated in domain_add_cpu(). But this
> is intended just as a "Is this what you meant?" before I dig deeper.

Thank you for considering the approach. I find that this sample move
towards the idea while also highlighting what else can be considered.
I do not know if you already considered these ideas and found it flawed
so I will try to make it explicit so that you can point out to me where
things will fall apart.

The sample code introduces a new list "mondomains" that is intended to
be used when the monitoring scope is different from the allocation scope.
This introduces duplication when the monitoring and allocation scope is
different. Each list, "domains" and "mondomains" will host structures
that can accommodate both monitoring and allocation data, with the data
not relevant to the list going unused as it is unnecessarily duplicated.
Additionally this forces significant portions of resctrl to now always
consider whether the monitoring and allocation scope is different ...
note how this sample now has code like below scattered throughout.
   h = r->mon_scope ? &r->mondomains : &r->domains;

I also find the domain_add_cpu() becoming intricate as it needs to
navigate all the different scenarios.

This unnecessary duplication, new environment checks needed throughout,
and additional code complexities are red flags to me that this solution
is not well integrated.

To deal with these complexities I would like to consider if it may
make things simpler to always (irrespective of allocation and
monitoring scope) maintain allocation and monitoring domain lists.
Each list need only carry data appropriate to its use ... the allocation
list only has data relevant to allocation, the monitoring list only
has data relevant to monitoring. This is the struct rdt_domain related
split I mentioned previously.

Code could become something like:

resctrl_online_cpu()
{
	...
	for_each_alloc_capable_rdt_resource(r)
		alloc_domain_add_cpu(...)
	for_each_mon_capable_rdt_resource(r)
		mon_domain_add_cpu(...)
	...
}

This would reduce complication in domain_add_cpu() since each domain list
only need to concern itself with monitoring or allocation.

Even resctrl_online_domain() can be siplified significantly by
making it specific to allocation or monitoring. For example,
resctrl_online_mon_domain() would only and always just run
the monitoring related code.

With the separate allocation and monitoring domain lists there
may no longer be a need for scattering code with checks like:
   h = r->mon_scope ? &r->mondomains : &r->domains;
This would be because the code can directly pick the domain
list it is operating on.

What do you think? The above is just refactoring of existing
code and from what I can tell this would make supporting
SNC straight forward.

Reinette
  
Luck, Tony July 20, 2023, 9:56 p.m. UTC | #8
> To deal with these complexities I would like to consider if it may
> make things simpler to always (irrespective of allocation and
> monitoring scope) maintain allocation and monitoring domain lists.
> Each list need only carry data appropriate to its use ... the allocation
> list only has data relevant to allocation, the monitoring list only
> has data relevant to monitoring. This is the struct rdt_domain related
> split I mentioned previously.
>
> Code could become something like:

resctrl_online_cpu()
{
	...
	for_each_alloc_capable_rdt_resource(r)
		alloc_domain_add_cpu(...)
	for_each_mon_capable_rdt_resource(r)
		mon_domain_add_cpu(...)
	...
}

> This would reduce complication in domain_add_cpu() since each domain list
> only need to concern itself with monitoring or allocation.

This does seem a worthy target.

I started on a patch to so this ... but I'm not sure I have the stamina or the time
to see it through. 

I split struct rdt_domain into rdt_ctrl_domain and rdt_mon_domain. But that
led to also splitting the rdt_hw_domain structure into two, and then splitting
the resctrl_to_arch_dom() function, and then another and another.

That process will eventually converge (there are a finite number of lines
of code) .... but it will be a big patch. I don't see how to stage it a piece
at a time.

-Tony
  
Luck, Tony July 22, 2023, 7:18 p.m. UTC | #9
On Thu, Jul 20, 2023 at 09:56:50PM +0000, Luck, Tony wrote:
> This does seem a worthy target.
> 
> I started on a patch to so this ... but I'm not sure I have the stamina or the time
> to see it through. 

I was being a wuss. I came back at this on Friday from a slightly
different perspective, and it all came togeteher fairly easily.

New series posted here:
	https://lore.kernel.org/all/20230722190740.326190-1-tony.luck@intel.com/

-Tony
  

Patch

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 8275b8a74f7e..243017096ddf 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -435,6 +435,7 @@  enum resctrl_res_level {
 	RDT_RESOURCE_L2,
 	RDT_RESOURCE_MBA,
 	RDT_RESOURCE_SMBA,
+	RDT_RESOURCE_NODE,
 
 	/* Must be the last */
 	RDT_NUM_RESOURCES,
@@ -442,7 +443,8 @@  enum resctrl_res_level {
 
 enum resctrl_scope {
 	SCOPE_L2_CACHE = 2,
-	SCOPE_L3_CACHE = 3
+	SCOPE_L3_CACHE = 3,
+	SCOPE_NODE,
 };
 
 static inline struct rdt_resource *resctrl_inc(struct rdt_resource *res)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 6571514752f3..e4bd3072927c 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -112,6 +112,16 @@  struct rdt_hw_resource rdt_resources_all[] = {
 			.fflags			= RFTYPE_RES_MB,
 		},
 	},
+	[RDT_RESOURCE_NODE] =
+	{
+		.r_resctrl = {
+			.rid			= RDT_RESOURCE_NODE,
+			.name			= "L3",
+			.scope			= SCOPE_NODE,
+			.domains		= domain_init(RDT_RESOURCE_NODE),
+			.fflags			= 0,
+		},
+	},
 };
 
 /*
@@ -489,6 +499,8 @@  static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
 
 static int get_domain_id(int cpu, enum resctrl_scope scope)
 {
+	if (scope == SCOPE_NODE)
+		return cpu_to_node(cpu);
 	return get_cpu_cacheinfo_id(cpu, scope);
 }