[v14,0/8] Add support for Sub-NUMA cluster (SNC) systems

Message ID 20240126223837.21835-1-tony.luck@intel.com
Headers
Series Add support for Sub-NUMA cluster (SNC) systems |

Message

Luck, Tony Jan. 26, 2024, 10:38 p.m. UTC
  The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
that share an L3 cache into two or more sets. This plays havoc with the
Resource Director Technology (RDT) monitoring features.  Prior to this
patch Intel has advised that SNC and RDT are incompatible.

Some of these CPU support an MSR that can partition the RMID counters in
the same way. This allows monitoring features to be used. With the caveat
that users must be aware that Linux may migrate tasks more frequently
between SNC nodes than between "regular" NUMA nodes, so reading counters
from all SNC nodes may be needed to get a complete picture of activity
for tasks.

Cache and memory bandwidth allocation features continue to operate at
the scope of the L3 cache.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---
Changes since v13:

Rebase to tip x86/cache
	fc747eebef73 ("x86/resctrl: Remove redundant variable in mbm_config_write_domain()")

Applied all Reviewed and Tested tags

Tony Luck (8):
  x86/resctrl: Prepare for new domain scope
  x86/resctrl: Prepare to split rdt_domain structure
  x86/resctrl: Prepare for different scope for control/monitor
    operations
  x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  x86/resctrl: Add node-scope to the options for feature scope
  x86/resctrl: Introduce snc_nodes_per_l3_cache
  x86/resctrl: Sub NUMA Cluster detection and enable
  x86/resctrl: Update documentation with Sub-NUMA cluster changes

 Documentation/arch/x86/resctrl.rst        |  25 +-
 include/linux/resctrl.h                   |  85 +++--
 arch/x86/include/asm/msr-index.h          |   1 +
 arch/x86/kernel/cpu/resctrl/internal.h    |  66 ++--
 arch/x86/kernel/cpu/resctrl/core.c        | 433 +++++++++++++++++-----
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  58 +--
 arch/x86/kernel/cpu/resctrl/monitor.c     |  68 ++--
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  26 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 149 ++++----
 9 files changed, 629 insertions(+), 282 deletions(-)


base-commit: fc747eebef734563cf68a512f57937c8f231834a
  

Comments

Luck, Tony Jan. 30, 2024, 1:05 a.m. UTC | #1
I've been wondering whether the SNC patches were creating way too much
infrastructure that isn't generally useful. Specifically the capability
for a reactrl resource to have different scope for monitoring and
control functions.

That seems like a very strange thing.

History here is that Intel CMT, MBM and L3 CAT features arrived in
quick succession, closely followed by MBA and then L2 CAT. At the time
it made sense for the "L3" resource to cover all of CMT, MBM and L3 CAT.

That became slightly odd when MBA (another L3 scoped resource) was
added. It was given its own "struct rdt_resource". AMD later added
SMBA as yet another L3-scoped resource with its own "struct resource".

I wondered how the SNC series would look if I went back to an approach
from one of the earlier versions. In that one I created a new resource
for SNC monitoring that was NODE scoped. And then made all the CMT and
MBM code switch over to using that one when SNC was enabled. That was
a bit hacky, and I moved from there to the dual domain lists per
resource.

I just tried out an approach the split the RDT_RESOURCE_L3 resource
into separate resources, one for control, one for monitoring.

It makes the code much simpler:

 8 files changed, 235 insertions(+), 30 deletions(-) [1]

vs.

 9 files changed, 629 insertions(+), 282 deletions(-)


Tradeoffs:

The complex series (posted as v14) cleanly split the "rdt_domain"
structure into two. So it no longer carried all the fields needed
for both control and monitor, even though only one set was ever
used. But the cost was a lot of code churn that may never be useful
for anything other than SNC.

On non-SNC systems the new series produces separate linked lists
of domains for L3 control & monitor, even though the lists are the
same, and the domain structures still carry all fields for both
control and monitor functions.

Question: Does anyone think that single domain with different scope for
control and monitor is inherently more useful than two separate domains?


While I get this sorted out, maybe Boris should take James next set
of MPAM patches as they are, instead of on top of the complex SNC
series.

-Tony

[1] I'll post this set just as soon as I can get time on and SNC machine
to make sure they actually work.