[0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems

Message ID 20230126184157.27626-1-tony.luck@intel.com
Headers
Series x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems |

Message

Luck, Tony Jan. 26, 2023, 6:41 p.m. UTC
  Intel server systems starting with Skylake support a mode that logically
partitions each socket. E.g. when partitioned two ways, half the cores,
L3 cache, and memory controllers are allocated to each of the partitions.
This may reduce average latency to access L3 cache and memory, with the
tradeoff that only half the L3 cache is available for subnode-local memory
access.

The existing Linux resctrl system mishandles RDT monitoring on systems
with SNC mode enabled.

But, with some simple changes, this can be fixed. When SNC mode is
enabled, the RDT RMID counters are also partitioned with the low numbered
counters going to the first partition, and the high numbered counters
to the second partition[1]. The key is to adjust the RMID value written
to the IA32_PQR_ASSOC MSR on context switch, and the value written to
the IA32_QM_EVTSEL when reading out counters, and to change the scaling
factor that was read from CPUID(0xf,1).EBX

E.g. in 2-way Sub-NUMA cluster with 200 RMID counters there are only
100 available counters to the resctrl code. When running on the first
SNC node RMID values 0..99 are used as before. But when running on the
second node, a task that is assigned resctrl rmid=10 must load 10+100
into IA32_PQR_ASSOC to use RMID counter 110.

There should be no changes to functionality on other architectures,
or on Intel systems with SNC disabled, where snc_ways == 1.

-Tony

[1] Some systems also support a 4-way split. All the above still
applies, just need to account for cores, cache, memory controllers
and RMID counters being divided four ways instead of two.

Tony Luck (7):
  x86/resctrl: Refactor in preparation for node-scoped resources
  x86/resctrl: Remove hard code of RDT_RESOURCE_L3 in monitor.c
  x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]
  x86/resctrl: Add code to setup monitoring at L3 or NODE scope.
  x86/resctrl: Add a new "snc_ways" file to the monitoring info
    directory.
  x86/resctrl: Update documentation with Sub-NUMA cluster changes
  x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.

 Documentation/x86/resctrl.rst             | 15 +++-
 include/linux/resctrl.h                   |  4 +-
 arch/x86/include/asm/resctrl.h            |  4 +-
 arch/x86/kernel/cpu/resctrl/internal.h    |  9 +++
 arch/x86/kernel/cpu/resctrl/core.c        | 83 ++++++++++++++++++++---
 arch/x86/kernel/cpu/resctrl/monitor.c     | 24 ++++---
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  2 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 22 +++++-
 8 files changed, 136 insertions(+), 27 deletions(-)
  

Comments

James Morse Feb. 28, 2023, 5:12 p.m. UTC | #1
Hi Tony,

On 26/01/2023 18:41, Tony Luck wrote:
> Intel server systems starting with Skylake support a mode that logically
> partitions each socket. E.g. when partitioned two ways, half the cores,
> L3 cache, and memory controllers are allocated to each of the partitions.
> This may reduce average latency to access L3 cache and memory, with the
> tradeoff that only half the L3 cache is available for subnode-local memory
> access.

I couldn't find a description of what happens to the CAT bitmaps or counters.

Presumably the CAT bitmaps are duplicated, so each cluster has its own set, and
the counters aren't - so software has to co-ordinate the use of RMID across the CPUs?


How come cacheinfo isn't modified to report the L3 partitions as separate caches?
Otherwise user-space would assume the full size of the cache is available on any of those
CPUs.
This would avoid an ABI change in resctrl (domain is now the numa node), leaving only the
RMID range code.


Thanks,

James
  
Luck, Tony Feb. 28, 2023, 6:04 p.m. UTC | #2
> > Intel server systems starting with Skylake support a mode that logically
> > partitions each socket. E.g. when partitioned two ways, half the cores,
> > L3 cache, and memory controllers are allocated to each of the partitions.
> > This may reduce average latency to access L3 cache and memory, with the
> > tradeoff that only half the L3 cache is available for subnode-local memory
> > access.
>
> I couldn't find a description of what happens to the CAT bitmaps or counters.

No changes to CAT. The cache is partitioned between sub-numa nodes based
on the index, not by dividing the ways. E.g. an 8-way associative 32MB cache is
still 8-way associative in each sub-node, but with 16MB available to each node.

This means users who want a specific amount of cache will need to allocate
more bits in the cache way mask (because each way is half as big).

> Presumably the CAT bitmaps are duplicated, so each cluster has its own set, and
> the counters aren't - so software has to co-ordinate the use of RMID across the CPUs?

Nope. Still one set of CAT bit maps per socket.

With "N" RMIDs available on a system with SNC disabled, there will be N/2 available
when there are 2 SNC nodes per socket. Processes use values [0 .. N/2).

> How come cacheinfo isn't modified to report the L3 partitions as separate caches?
> Otherwise user-space would assume the full size of the cache is available on any of those
> CPUs.

The size of the cache is perhaps poorly defined in the SNC enabled case. A well
behaved NUMA application that is only accessing memory from its local node will
see an effective cache half the size. But if a process accesses memory from the
other SNC node on the same socket, then it will get allocations in that SNC nodes
half share of the cache.  Accessing memory across inter-socket links will end up
allocating across the whole cache.

Moral: SNC mode is intended for applications that have very well-behaved NUMA
characteristics.

-Tony