[v15-RFC,0/8] Add support for Sub-NUMA cluster (SNC) systems

Message ID 20240130222034.37181-1-tony.luck@intel.com
Headers
Series Add support for Sub-NUMA cluster (SNC) systems |

Message

Luck, Tony Jan. 30, 2024, 10:20 p.m. UTC
  This is the re-worked version of this series that I promised to post
yesterday. Check that e-mail for the arguments for this alternate
approach.

https://lore.kernel.org/all/ZbhLRDvZrxBZDv2j@agluck-desk3/

Apologies to Drew Fustini who I'd somehow dropped from later versions
of this series. Drew: you had made a comment at one point that having
different scopes within a single resource may be useful on RISC-V.
Version 14 included that, but it's gone here. Maybe multiple resctrl
"struct resource" for a single h/w entity like L3 as I'm doing in this
version could work for you too?

Patches 1-5 are almost completely rewritten based around the new
idea to give CMT and MBM their own "resource" instead of sharing
one with L3 CAT. This removes the need for separate domain lists,
and thus most of the churn of the previous version of this series.

Patches 6-8 are largely unchanged. But I removed all the Reviewed
and Tested tags since they are now built on a completely different
base.

Patches are against tip x86/cache:

  fc747eebef73 ("x86/resctrl: Remove redundant variable in mbm_config_write_domain()")

Re-work doesn't affect the v14 cover letter, so pasting it here:

The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
that share an L3 cache into two or more sets. This plays havoc with the
Resource Director Technology (RDT) monitoring features.  Prior to this
patch Intel has advised that SNC and RDT are incompatible.

Some of these CPU support an MSR that can partition the RMID counters in
the same way. This allows monitoring features to be used. With the caveat
that users must be aware that Linux may migrate tasks more frequently
between SNC nodes than between "regular" NUMA nodes, so reading counters
from all SNC nodes may be needed to get a complete picture of activity
for tasks.

Cache and memory bandwidth allocation features continue to operate at
the scope of the L3 cache.

Signed-off-by: Tony Luck <tony.luck@intel.com>

Tony Luck (8):
  x86/resctrl: Split the RDT_RESOURCE_L3 resource
  x86/resctrl: Move all monitoring functions to RDT_RESOURCE_L3_MON
  x86/resctrl: Prepare for non-cache-scoped resources
  x86/resctrl: Add helper function to look up domain_id from scope
  x86/resctrl: Add "NODE" as an option for resource scope
  x86/resctrl: Introduce snc_nodes_per_l3_cache
  x86/resctrl: Sub NUMA Cluster detection and enable
  x86/resctrl: Update documentation with Sub-NUMA cluster changes

 Documentation/arch/x86/resctrl.rst        |  25 ++-
 include/linux/resctrl.h                   |  10 +-
 arch/x86/include/asm/msr-index.h          |   1 +
 arch/x86/kernel/cpu/resctrl/internal.h    |   3 +
 arch/x86/kernel/cpu/resctrl/core.c        | 181 +++++++++++++++++++++-
 arch/x86/kernel/cpu/resctrl/monitor.c     |  28 ++--
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |   6 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  12 +-
 8 files changed, 236 insertions(+), 30 deletions(-)


base-commit: fc747eebef734563cf68a512f57937c8f231834a
  

Comments

Reinette Chatre Feb. 1, 2024, 6:46 p.m. UTC | #1
Hi Tony,

On 1/30/2024 2:20 PM, Tony Luck wrote:
> This is the re-worked version of this series that I promised to post
> yesterday. Check that e-mail for the arguments for this alternate
> approach.
> 
> https://lore.kernel.org/all/ZbhLRDvZrxBZDv2j@agluck-desk3/
> 
> Apologies to Drew Fustini who I'd somehow dropped from later versions
> of this series. Drew: you had made a comment at one point that having
> different scopes within a single resource may be useful on RISC-V.
> Version 14 included that, but it's gone here. Maybe multiple resctrl
> "struct resource" for a single h/w entity like L3 as I'm doing in this
> version could work for you too?
> 
> Patches 1-5 are almost completely rewritten based around the new
> idea to give CMT and MBM their own "resource" instead of sharing
> one with L3 CAT. This removes the need for separate domain lists,
> and thus most of the churn of the previous version of this series.

I do not see it as removing the need for separate domain lists but
instead keeping the idea of separate domain lists but in this case
duplicating the resource in order to host the second domain
list. This solution also keeps the same structures for control and
monitoring that previous version cleaned up [1].
To me this thus seems like a similar solution as v14 but with
additional duplication due to an extra resource and without the cleanup.

Reinette

[1] https://lore.kernel.org/lkml/20240126223837.21835-5-tony.luck@intel.com/
  
Drew Fustini Feb. 8, 2024, 4:17 a.m. UTC | #2
On Tue, Jan 30, 2024 at 02:20:26PM -0800, Tony Luck wrote:
> This is the re-worked version of this series that I promised to post
> yesterday. Check that e-mail for the arguments for this alternate
> approach.
> 
> https://lore.kernel.org/all/ZbhLRDvZrxBZDv2j@agluck-desk3/
> 
> Apologies to Drew Fustini who I'd somehow dropped from later versions
> of this series. Drew: you had made a comment at one point that having
> different scopes within a single resource may be useful on RISC-V.
> Version 14 included that, but it's gone here. Maybe multiple resctrl
> "struct resource" for a single h/w entity like L3 as I'm doing in this
> version could work for you too?

Sorry for the latency.

The RISC-V CBQRI specification [1] describes a bandwidth controller
register interface [2]. It allows a controller to implement both
bandwidth allocation and bandwidth usage monitoring.

The proof-of-concept resctrl implementation [3] that I worked on created
two domains for each memory controller in the example SoC. One domain
would contain the MBA resource and the other would contain the L3
resource to represent MBM files like local_bytes:

  # cat /sys/fs/resctrl/schemata
  MB:4=  80;6=  80;8=  80
  L2:0=0fff;1=0fff
  L3:2=ffff;3=0000;5=0000;7=0000

Where:

  Domain 0 is L2 cache controller 0 capacity allocation
  Domain 1 is L2 cache controller 1 capacity allocation
  Domain 2 is L3 cache controller capacity allocation

  Domain 4 is Memory controller 0 bandwidth allocation
  Domain 6 is Memory controller 1 bandwidth allocation
  Domain 8 is Memory controller 2 bandwidth allocation

  Domain 3 is Memory controller 0 bandwidth monitoring
  Domain 5 is Memory controller 1 bandwidth monitoring
  Domain 7 is Memory controller 2 bandwidth monitoring

I think this scheme is confusing but I wasn't able to find a better
way to do it at the time.

> Patches 1-5 are almost completely rewritten based around the new
> idea to give CMT and MBM their own "resource" instead of sharing
> one with L3 CAT. This removes the need for separate domain lists,
> and thus most of the churn of the previous version of this series.

Very interesting. Do you think I would be able to create MBM files for
each memory controller without creating pointless L3 domains that show
up in schemata?

Thanks,
Drew

[1] https://github.com/riscv-non-isa/riscv-cbqri/releases/tag/v1.0-rc1
[2] https://github.com/riscv-non-isa/riscv-cbqri/blob/main/qos_bandwidth.adoc
[3] https://lore.kernel.org/linux-riscv/20230419111111.477118-1-dfustini@baylibre.com/
  
Luck, Tony Feb. 8, 2024, 7:26 p.m. UTC | #3
> > Patches 1-5 are almost completely rewritten based around the new
> > idea to give CMT and MBM their own "resource" instead of sharing
> > one with L3 CAT. This removes the need for separate domain lists,
> > and thus most of the churn of the previous version of this series.
>
> Very interesting. Do you think I would be able to create MBM files for
> each memory controller without creating pointless L3 domains that show
> up in schemata?

Entries only show up in the schemata file for resources that are "alloc_capable".
So you should be able to make as many rdt_hw_resource structures as you
need that are "mon_capable", but not "alloc_capable" ... though making more
than one such resource may explore untested areas of the code since there
has historically only been one mon_capable resource. It looks like the resource
id from the "rid" field is passed through to the "show" functions for MBM and CQM.

This patch series splits the one resource that is marked as both mon_capable
and alloc_capable into two. Maybe that's a useful cleanup, but maybe not a
requirement for what you need.

-Tony
  
Moger, Babu Feb. 9, 2024, 3:27 p.m. UTC | #4
Hi Tony,

On 1/30/24 16:20, Tony Luck wrote:
> This is the re-worked version of this series that I promised to post
> yesterday. Check that e-mail for the arguments for this alternate
> approach.

To be honest, I like this series more than the previous series. I always
thought RDT_RESOURCE_L3_MON should have been a separate resource by itself.

You need to separate the domain lists for RDT_RESOURCE_L3 and
RDT_RESOURCE_L3_MON if you are going this route. I didn't see that in this
series. Also I have few other comments as well.

Thanks
Babu
  
Luck, Tony Feb. 9, 2024, 6:31 p.m. UTC | #5
On Fri, Feb 09, 2024 at 09:27:56AM -0600, Moger, Babu wrote:
> Hi Tony,
> 
> On 1/30/24 16:20, Tony Luck wrote:
> > This is the re-worked version of this series that I promised to post
> > yesterday. Check that e-mail for the arguments for this alternate
> > approach.
> 
> To be honest, I like this series more than the previous series. I always
> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself.
> 
> You need to separate the domain lists for RDT_RESOURCE_L3 and
> RDT_RESOURCE_L3_MON if you are going this route. I didn't see that in this
> series. Also I have few other comments as well.

They are separated. Each "struct rdt_resource" has its own domain list.

Or do you mean break up the struct rdt_domain into the control and
monitor versions as was done in the previous series?
> 
> Thanks
> Babu
>
  
Moger, Babu Feb. 9, 2024, 7:38 p.m. UTC | #6
On 2/9/24 12:31, Tony Luck wrote:
> On Fri, Feb 09, 2024 at 09:27:56AM -0600, Moger, Babu wrote:
>> Hi Tony,
>>
>> On 1/30/24 16:20, Tony Luck wrote:
>>> This is the re-worked version of this series that I promised to post
>>> yesterday. Check that e-mail for the arguments for this alternate
>>> approach.
>>
>> To be honest, I like this series more than the previous series. I always
>> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself.
>>
>> You need to separate the domain lists for RDT_RESOURCE_L3 and
>> RDT_RESOURCE_L3_MON if you are going this route. I didn't see that in this
>> series. Also I have few other comments as well.
> 
> They are separated. Each "struct rdt_resource" has its own domain list.

Yea. You are right.
> 
> Or do you mean break up the struct rdt_domain into the control and
> monitor versions as was done in the previous series?

No. Not required. Each resource has its own domain list. So, it is
separated already as far as I can see.

Reinette seem to have some concerns about this series. But, I am fine with
both these approaches. I feel this is more clean approach.
  
Reinette Chatre Feb. 9, 2024, 9:26 p.m. UTC | #7
On 2/9/2024 11:38 AM, Moger, Babu wrote:
> On 2/9/24 12:31, Tony Luck wrote:
>> On Fri, Feb 09, 2024 at 09:27:56AM -0600, Moger, Babu wrote:
>>> On 1/30/24 16:20, Tony Luck wrote:
> 
> Reinette seem to have some concerns about this series. But, I am fine with
> both these approaches. I feel this is more clean approach.

I questioned the motivation but never received a response. 

Reinette
  
Luck, Tony Feb. 9, 2024, 9:36 p.m. UTC | #8
>> Reinette seem to have some concerns about this series. But, I am fine with
>> both these approaches. I feel this is more clean approach.
>
> I questioned the motivation but never received a response. 

Reinette,

Sorry. My motivation was to reduce the amount of code churn that
was done in the by the previous incarnation.

   9 files changed, 629 insertions(+), 282 deletions(-)

Vast amounts of that just added "_mon" or "_ctrl" to structure
or variable names.

-Tony
  
Reinette Chatre Feb. 9, 2024, 10:16 p.m. UTC | #9
Hi Tony,

On 2/9/2024 1:36 PM, Luck, Tony wrote:
>>> Reinette seem to have some concerns about this series. But, I am fine with
>>> both these approaches. I feel this is more clean approach.
>>
>> I questioned the motivation but never received a response. 
> 
> Reinette,
> 
> Sorry. My motivation was to reduce the amount of code churn that
> was done in the by the previous incarnation.
> 
>    9 files changed, 629 insertions(+), 282 deletions(-)
> 
> Vast amounts of that just added "_mon" or "_ctrl" to structure
> or variable names.

I actually had specific points that this response also ignores. 
Let me repeat and highlight the same points:

1) You claim that this series "removes the need for separate domain
   lists" ... but then this series does just that (create a separate
   domain list), but in an obfuscated way (duplicate the resource to
   have the monitoring domain list in there).
2) You claim this series "reduces amount of code churn", but this is
   because this series keeps using the same original data structures
   for separate monitoring and control usages. The previous series made 
   an effort to separate the structures for the different usages
   but this series does not. What makes it ok in this series to
   use the same data structures for different usages? 

Additionally:

Regarding "Vast amounts of that just added "_mon" or "_ctrl" to structure
or variable names." ... that is because the structures are actually split,
no? It is not just renaming for unnecessary churn.

What is the benefit of keeping the data structures to be shared
between monitor and control usages?
If there is a benefit to keeping these data structures, why not just
address this aspect in previous solution?

Reinette
  
Luck, Tony Feb. 9, 2024, 11:44 p.m. UTC | #10
> I actually had specific points that this response also ignores.
> Let me repeat and highlight the same points:
>
> 1) You claim that this series "removes the need for separate domain
>    lists" ... but then this series does just that (create a separate
>    domain list), but in an obfuscated way (duplicate the resource to
>    have the monitoring domain list in there).

That was poorly worded on my part. I should have said "removes the
need for separate domain lists within a single rdt_resource".

Adding an extra domain list to a resource may be the start of a slippery
slope. What if there is some additional "L3"-like resctrl operation that
acts at the socket level (Intel has made products with multiple L3
instances per socket before). Would you be OK add a third domain
list to every struct rdt_resource to handle this? Or would it be simpler
to just add a new rdt_resource structure with socket scoped domains?

> 2) You claim this series "reduces amount of code churn", but this is
>    because this series keeps using the same original data structures
>    for separate monitoring and control usages. The previous series made
>    an effort to separate the structures for the different usages
>    but this series does not. What makes it ok in this series to
>    use the same data structures for different usages?

Legacy resctrl has been using the same rdt_domain structure for both
usages since the dawn of time. So it has been OK up until now.

> Additionally:
>
> Regarding "Vast amounts of that just added "_mon" or "_ctrl" to structure
> or variable names." ... that is because the structures are actually split,
> no? It is not just renaming for unnecessary churn.

Perhaps not "unnecessary" churn. But certainly a lot of code change for
what I perceive as very little real gain. 

> What is the benefit of keeping the data structures to be shared
> between monitor and control usages?

Benefit is no code changes. Cost is continuing to waste memory with
structures that are slightly bigger than they need to be.

> If there is a benefit to keeping these data structures, why not just
> address this aspect in previous solution?

The previous solution evolved to splitting these structures. But this
happened incrementally (remember that at an early stage the monitor
structures all got the "_mon" addition to their names, but the control
structures kept the original names). Only when I got to the end of this
process did I look at the magnitude of the change.

-Tony
  
Reinette Chatre Feb. 10, 2024, 12:28 a.m. UTC | #11
Hi Tony,

On 2/9/2024 3:44 PM, Luck, Tony wrote:
>> I actually had specific points that this response also ignores.
>> Let me repeat and highlight the same points:
>>
>> 1) You claim that this series "removes the need for separate domain
>>    lists" ... but then this series does just that (create a separate
>>    domain list), but in an obfuscated way (duplicate the resource to
>>    have the monitoring domain list in there).
> 
> That was poorly worded on my part. I should have said "removes the
> need for separate domain lists within a single rdt_resource".
> 
> Adding an extra domain list to a resource may be the start of a slippery
> slope. What if there is some additional "L3"-like resctrl operation that
> acts at the socket level (Intel has made products with multiple L3
> instances per socket before). Would you be OK add a third domain
> list to every struct rdt_resource to handle this? Or would it be simpler
> to just add a new rdt_resource structure with socket scoped domains?

This should not be about what is simplest to patch into current resctrl.

There is no need to support a new domain list for a new scope. The domain
lists support the functionality: control or monitoring. If control has
socket scope the existing implementation supports that.
If there is another operation supported by a resource apart from 
control or monitoring then we can consider how to support it when
we know what it is. That would also be a great point to decide if
the same data structure should just grow to support an operation that
not all resources may support. That may depend on the amount of data
needed to support this hypothetical operation.

> 
>> 2) You claim this series "reduces amount of code churn", but this is
>>    because this series keeps using the same original data structures
>>    for separate monitoring and control usages. The previous series made
>>    an effort to separate the structures for the different usages
>>    but this series does not. What makes it ok in this series to
>>    use the same data structures for different usages?
> 
> Legacy resctrl has been using the same rdt_domain structure for both
> usages since the dawn of time. So it has been OK up until now.

This is not the same.

Legacy resctrl uses the same data structure in the same list for both control
and monitoring usages so it is fine to have both monitoring and control data
in the data structure.

What you are doing in both solutions is to place the same data structure
in separate lists for control and monitoring usages. In the one list only the
control data is used, on the other only the monitoring data is used.

>> Additionally:
>>
>> Regarding "Vast amounts of that just added "_mon" or "_ctrl" to structure
>> or variable names." ... that is because the structures are actually split,
>> no? It is not just renaming for unnecessary churn.
> 
> Perhaps not "unnecessary" churn. But certainly a lot of code change for
> what I perceive as very little real gain. 

ok. There may be little gain wrt saving space. One complication with
this single data structure is that its content may only be decided based
on which list it is part of. It should be obvious to developers when
which members are valid. Perhaps this can be addressed with clear
documentation of the data structures.

> 
>> What is the benefit of keeping the data structures to be shared
>> between monitor and control usages?
> 
> Benefit is no code changes. Cost is continuing to waste memory with
> structures that are slightly bigger than they need to be.
> 
>> If there is a benefit to keeping these data structures, why not just
>> address this aspect in previous solution?
> 
> The previous solution evolved to splitting these structures. But this
> happened incrementally (remember that at an early stage the monitor
> structures all got the "_mon" addition to their names, but the control
> structures kept the original names). Only when I got to the end of this
> process did I look at the magnitude of the change.

Not answering my question. 

Reinette
  
Luck, Tony Feb. 12, 2024, 4:52 p.m. UTC | #12
> >> I actually had specific points that this response also ignores.
> >> Let me repeat and highlight the same points:
> >>
> >> 1) You claim that this series "removes the need for separate domain
> >>    lists" ... but then this series does just that (create a separate
> >>    domain list), but in an obfuscated way (duplicate the resource to
> >>    have the monitoring domain list in there).
> >
> > That was poorly worded on my part. I should have said "removes the
> > need for separate domain lists within a single rdt_resource".
> >
> > Adding an extra domain list to a resource may be the start of a slippery
> > slope. What if there is some additional "L3"-like resctrl operation that
> > acts at the socket level (Intel has made products with multiple L3
> > instances per socket before). Would you be OK add a third domain
> > list to every struct rdt_resource to handle this? Or would it be simpler
> > to just add a new rdt_resource structure with socket scoped domains?
>
> This should not be about what is simplest to patch into current resctrl.

I wanted to offer this in case Boris also thought that the previous version
was too much churn to support an obscure Intel-only (so far) feature.

But if you are going to Nack this new version on the grounds that it muddies
the water about usage of the rdt_domain structure, then I will abandon it.

> There is no need to support a new domain list for a new scope. The domain
> lists support the functionality: control or monitoring. If control has
> socket scope the existing implementation supports that.
> If there is another operation supported by a resource apart from
> control or monitoring then we can consider how to support it when
> we know what it is. That would also be a great point to decide if
> the same data structure should just grow to support an operation that
> not all resources may support. That may depend on the amount of data
> needed to support this hypothetical operation.
>
> >
> >> 2) You claim this series "reduces amount of code churn", but this is
> >>    because this series keeps using the same original data structures
> >>    for separate monitoring and control usages. The previous series made
> >>    an effort to separate the structures for the different usages
> >>    but this series does not. What makes it ok in this series to
> >>    use the same data structures for different usages?
> >
> > Legacy resctrl has been using the same rdt_domain structure for both
> > usages since the dawn of time. So it has been OK up until now.
>
> This is not the same.
>
> Legacy resctrl uses the same data structure in the same list for both control
> and monitoring usages so it is fine to have both monitoring and control data
> in the data structure.
>
> What you are doing in both solutions is to place the same data structure
> in separate lists for control and monitoring usages. In the one list only the
> control data is used, on the other only the monitoring data is used.
>
> >> Additionally:
> >>
> >> Regarding "Vast amounts of that just added "_mon" or "_ctrl" to structure
> >> or variable names." ... that is because the structures are actually split,
> >> no? It is not just renaming for unnecessary churn.
> >
> > Perhaps not "unnecessary" churn. But certainly a lot of code change for
> > what I perceive as very little real gain.
>
> ok. There may be little gain wrt saving space. One complication with
> this single data structure is that its content may only be decided based
> on which list it is part of. It should be obvious to developers when
> which members are valid. Perhaps this can be addressed with clear
> documentation of the data structures.
>
> >
> >> What is the benefit of keeping the data structures to be shared
> >> between monitor and control usages?
> >
> > Benefit is no code changes. Cost is continuing to waste memory with
> > structures that are slightly bigger than they need to be.
> >
> >> If there is a benefit to keeping these data structures, why not just
> >> address this aspect in previous solution?
> >
> > The previous solution evolved to splitting these structures. But this
> > happened incrementally (remember that at an early stage the monitor
> > structures all got the "_mon" addition to their names, but the control
> > structures kept the original names). Only when I got to the end of this
> > process did I look at the magnitude of the change.
>
> Not answering my question.

I'm not exactly sure what "aspect" you thought could be addressed in the
previous series. But the point is moot now. This diversion from the
series has come to a dead end, and I hope that Boris will look at v14
(either before the next group of ARM patches, or after).

-Tony
  
Reinette Chatre Feb. 12, 2024, 7:44 p.m. UTC | #13
Hi Babu,

On 2/9/2024 7:27 AM, Moger, Babu wrote:

> To be honest, I like this series more than the previous series. I always
> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself.

Would you prefer that your "Reviewed-by" tag be removed from the
previous series?

Reinette
  
Luck, Tony Feb. 12, 2024, 7:57 p.m. UTC | #14
>> To be honest, I like this series more than the previous series. I always
>> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself.
>
> Would you prefer that your "Reviewed-by" tag be removed from the
> previous series?

I'm thinking that I could continue splitting things and break "struct rdt_resource" into
separate "ctrl" and "mon" structures. Then we'd have a clean split from top to bottom.

Doing that would get rid of the rdt_resources_all[] array. Replacing with individual
rdt_hw_ctrl_resource and rdt_hw_mon_resource declarations for each feature.

Features found on a system would be added to a list of ctrl or list of mon resources.

-Tony
  
Moger, Babu Feb. 12, 2024, 8:46 p.m. UTC | #15
On 2/12/24 13:44, Reinette Chatre wrote:
> Hi Babu,
> 
> On 2/9/2024 7:27 AM, Moger, Babu wrote:
> 
>> To be honest, I like this series more than the previous series. I always
>> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself.
> 
> Would you prefer that your "Reviewed-by" tag be removed from the
> previous series?
> 
Sure. I will plan to review again the new series when Tony submits v16.
  
Reinette Chatre Feb. 12, 2024, 9:43 p.m. UTC | #16
Hi Tony,

On 2/12/2024 11:57 AM, Luck, Tony wrote:
>>> To be honest, I like this series more than the previous series. I always
>>> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself.
>>
>> Would you prefer that your "Reviewed-by" tag be removed from the
>> previous series?
> 
> I'm thinking that I could continue splitting things and break "struct rdt_resource" into
> separate "ctrl" and "mon" structures. Then we'd have a clean split from top to bottom.

It is not obvious what you mean with "continue splitting things". Are you
speaking about "continue splitting from v14" or "continue splitting from v15-RFC"?

I think that any solution needs to consider what makes sense for resctrl
as a whole instead of how to support SNC with smallest patch possible.

There should not be any changes that makes resctrl harder to understand
and maintain, as exemplified by confusion introduced by a simple thing as
resource name choice [1].

> 
> Doing that would get rid of the rdt_resources_all[] array. Replacing with individual
> rdt_hw_ctrl_resource and rdt_hw_mon_resource declarations for each feature.
>
> Features found on a system would be added to a list of ctrl or list of mon resources.

Could you please elaborate what is architecturally wrong with v14 and how this
new proposal addresses that?

Reinette

[1] https://lore.kernel.org/lkml/ZcZyqs5hnQqZ5ZV0@agluck-desk3/
  
Luck, Tony Feb. 12, 2024, 10:05 p.m. UTC | #17
On Mon, Feb 12, 2024 at 01:43:56PM -0800, Reinette Chatre wrote:
> Hi Tony,
> 
> On 2/12/2024 11:57 AM, Luck, Tony wrote:
> >>> To be honest, I like this series more than the previous series. I always
> >>> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself.
> >>
> >> Would you prefer that your "Reviewed-by" tag be removed from the
> >> previous series?
> > 
> > I'm thinking that I could continue splitting things and break "struct rdt_resource" into
> > separate "ctrl" and "mon" structures. Then we'd have a clean split from top to bottom.
> 
> It is not obvious what you mean with "continue splitting things". Are you
> speaking about "continue splitting from v14" or "continue splitting from v15-RFC"?

I'm speaking of some future potential changes. Not proposing to
do this now.

> I think that any solution needs to consider what makes sense for resctrl
> as a whole instead of how to support SNC with smallest patch possible.

I am officially abandoning my v15-RFC patches. I wasn't clear enough in
my e-mail earlier today.

https://lore.kernel.org/all/SJ1PR11MB608378D1304224D9E8A9016FFC482@SJ1PR11MB6083.namprd11.prod.outlook.com/
> 
> There should not be any changes that makes resctrl harder to understand
> and maintain, as exemplified by confusion introduced by a simple thing as
> resource name choice [1].
> 
> > 
> > Doing that would get rid of the rdt_resources_all[] array. Replacing with individual
> > rdt_hw_ctrl_resource and rdt_hw_mon_resource declarations for each feature.
> >
> > Features found on a system would be added to a list of ctrl or list of mon resources.
> 
> Could you please elaborate what is architecturally wrong with v14 and how this
> new proposal addresses that?

There is nothing architecturally wrong with v14. I thought it was more
complex than it needed to be. You have convinced me that my v15-RFC
series, while simpler, is not a reasonable path for long-term resctrl
maintainability.
> 
> Reinette
> 
> [1] https://lore.kernel.org/lkml/ZcZyqs5hnQqZ5ZV0@agluck-desk3/

-Tony
  
James Morse Feb. 13, 2024, 6:11 p.m. UTC | #18
Hello,

On 12/02/2024 22:05, Tony Luck wrote:
> On Mon, Feb 12, 2024 at 01:43:56PM -0800, Reinette Chatre wrote:
>> On 2/12/2024 11:57 AM, Luck, Tony wrote:
>>>>> To be honest, I like this series more than the previous series. I always
>>>>> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself.
>>>>
>>>> Would you prefer that your "Reviewed-by" tag be removed from the
>>>> previous series?
>>>
>>> I'm thinking that I could continue splitting things and break "struct rdt_resource" into
>>> separate "ctrl" and "mon" structures. Then we'd have a clean split from top to bottom.
>>
>> It is not obvious what you mean with "continue splitting things". Are you
>> speaking about "continue splitting from v14" or "continue splitting from v15-RFC"?
> 
> I'm speaking of some future potential changes. Not proposing to
> do this now.
> 
>> I think that any solution needs to consider what makes sense for resctrl
>> as a whole instead of how to support SNC with smallest patch possible.

>> There should not be any changes that makes resctrl harder to understand
>> and maintain, as exemplified by confusion introduced by a simple thing as
>> resource name choice [1].
>>
>>>
>>> Doing that would get rid of the rdt_resources_all[] array. Replacing with individual
>>> rdt_hw_ctrl_resource and rdt_hw_mon_resource declarations for each feature.
>>>
>>> Features found on a system would be added to a list of ctrl or list of mon resources.
>>
>> Could you please elaborate what is architecturally wrong with v14 and how this
>> new proposal addresses that?
> 
> There is nothing architecturally wrong with v14. I thought it was more
> complex than it needed to be. You have convinced me that my v15-RFC
> series, while simpler, is not a reasonable path for long-term resctrl
> maintainability.

I'm not sure if its helpful to describe a third approach at this point - but on the off
chance its useful:
With SNC enable, the L3 monitors are unaffected, but the controls behave as if they were
part of some other component in the system..

ACPI describes something called "memory side caches" [0] in the HMAT table, which are
outside the CPU cache hierarchy, and are associated with a Proximity-Domain. I've heard
that one of Arm's partners has built a system with MPAM controls on something like this.
How would we support this - and would this be a better fit for the way SNC behaves?

I think this would be a new resource and schema, 'MSC'(?) with domain-ids using the NUMA
nid. As these aren't CPU caches, they wouldn't appear in the same part of the sysfs
hierarchy, and wouldn't necessarily have a cache-id.

For SNC systems, I think this would look like CMT on the L3, and CAT on the 'MSC'.
Existing software wouldn't know to use the new schema, but equally wouldn't be surprised
by the domain-ids being something other than the cache-id, and the controls and monitors
not lining up.
Where its not quite right for SNC is sysfs may not describe a memory side cache, but one
would be present in resctrl. I don't think that's a problem - unless these systems do also
have a memory-side-cache that behaves differently. (where is the controls being applied at
the 'near' side of the link - I don't think the difference matters)


I'm a little nervous that the SNC support looks strange if we ever add support for
something like the above. Given its described in ACPI, I assume there are plenty of
machines out there that look like this.

(Why aren't memory-side-caches a CPU cache? They live near the memory controller and cache
based on the PA, not the CPU that issued the transaction)


Thanks,

James

[0]
https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.html#memory-side-cache-overview
  
Luck, Tony Feb. 13, 2024, 7:02 p.m. UTC | #19
> With SNC enable, the L3 monitors are unaffected, but the controls behave as if they were
> part of some other component in the system.

I don't think of it like that. See attached picture of a single socket divided in two by SNC.
[If the attachment is stripped off for those reading this via mailing lists, if you want the
picture, just send me an e-mail.]

Everything in blue is node 0. Yellow for node 1.

The rectangles in the middle represent the L3 cache (12-way associative). When cores
in node 0 access memory in node 0, it will be cached using the "top" half of the cache
indices. Similarly for node 1 using the "bottom" half.

Here’s how each of the Intel L3 resctrl functions operate with SNC enabled:

CQM: Reports how much of your half of the L3 cache is occupied

MBM: Reports on memory traffic from your half of the cache to your memory controllers.

CAT: Still controls which ways of the cache are available for allocation (but each way
has half the capacity.)

MBA: The same throttling levels applied to "blue" and "yellow" traffic (because there
are only socket level controls).

> I'm a little nervous that the SNC support looks strange if we ever add support for
> something like the above. Given its described in ACPI, I assume there are plenty of
> machines out there that look like this.

I'm also nervous as h/w designers find various ways to diverge from the old paradigm of

	socket scope == L3 cache scope == NUMA node scope

-Tony