[v1,0/9] x86/resctrl: Use soft RMIDs for reliable MBM on AMD

Message ID 20230421141723.2405942-1-peternewman@google.com
Headers
Series x86/resctrl: Use soft RMIDs for reliable MBM on AMD |

Message

Peter Newman April 21, 2023, 2:17 p.m. UTC
  Hi Reinette, Fenghua,

This series introduces a new mount option enabling an alternate mode for
MBM to work around an issue on present AMD implementations and any other
resctrl implementation where there are more RMIDs (or equivalent) than
hardware counters.

The L3 External Bandwidth Monitoring feature of the AMD PQoS
extension[1] only guarantees that RMIDs currently assigned to a
processor will be tracked by hardware. The counters of any other RMIDs
which are no longer being tracked will be reset to zero. The MBM event
counters return "Unavailable" to indicate when this has happened.

An interval for effectively measuring memory bandwidth typically needs
to be multiple seconds long. In Google's workloads, it is not feasible
to bound the number of jobs with different RMIDs which will run in a
cache domain over any period of time.  Consequently, on a
fully-committed system where all RMIDs are allocated, few groups'
counters return non-zero values.

To demonstrate the underlying issue, the first patch provides a test
case in tools/testing/selftests/resctrl/test_rmids.sh.

On an AMD EPYC 7B12 64-Core Processor with the default behavior:

 # ./test_rmids.sh
 Created 255 monitoring groups.
 g1: mbm_total_bytes: Unavailable -> Unavailable (FAIL)
 g2: mbm_total_bytes: Unavailable -> Unavailable (FAIL)
 g3: mbm_total_bytes: Unavailable -> Unavailable (FAIL)
[..]
 g238: mbm_total_bytes: Unavailable -> Unavailable (FAIL)
 g239: mbm_total_bytes: Unavailable -> Unavailable (FAIL)
 g240: mbm_total_bytes: Unavailable -> Unavailable (FAIL)
 g241: mbm_total_bytes: Unavailable -> 660497472
 g242: mbm_total_bytes: Unavailable -> 660793344
 g243: mbm_total_bytes: Unavailable -> 660477312
 g244: mbm_total_bytes: Unavailable -> 660495360
 g245: mbm_total_bytes: Unavailable -> 660775360
 g246: mbm_total_bytes: Unavailable -> 660645504
 g247: mbm_total_bytes: Unavailable -> 660696128
 g248: mbm_total_bytes: Unavailable -> 660605248
 g249: mbm_total_bytes: Unavailable -> 660681280
 g250: mbm_total_bytes: Unavailable -> 660834240
 g251: mbm_total_bytes: Unavailable -> 660440064
 g252: mbm_total_bytes: Unavailable -> 660501504
 g253: mbm_total_bytes: Unavailable -> 660590720
 g254: mbm_total_bytes: Unavailable -> 660548352
 g255: mbm_total_bytes: Unavailable -> 660607296
 255 groups, 0 returned counts in first pass, 15 in second
 successfully measured bandwidth from 15/255 groups

To compare, here is the output from an Intel(R) Xeon(R) Platinum 8173M
CPU:

 # ./test_rmids.sh
 Created 223 monitoring groups.
 g1: mbm_total_bytes: 0 -> 606126080
 g2: mbm_total_bytes: 0 -> 613236736
 g3: mbm_total_bytes: 0 -> 610254848
[..]
 g221: mbm_total_bytes: 0 -> 584679424
 g222: mbm_total_bytes: 0 -> 588808192
 g223: mbm_total_bytes: 0 -> 587317248
 223 groups, 223 returned counts in first pass, 223 in second
 successfully measured bandwidth from 223/223 groups

To make better use of the hardware in such a use case, this patchset
introduces a "soft" RMID implementation, where each CPU is permanently
assigned a "hard" RMID. On context switches which change the current
soft RMID, the difference between each CPU's current event counts and
most recent counts is added to the totals for the current or outgoing
soft RMID.

This technique does not work for cache occupancy counters, so this patch
series disables cache occupancy events when soft RMIDs are enabled.

This series adds the "mbm_soft_rmid" mount option to allow users to
opt-in to the functionaltiy when they deem it helpful.

When the same system from the earlier AMD example enables the
mbm_soft_rmid mount option:

 # ./test_rmids.sh
 Created 255 monitoring groups.
 g1: mbm_total_bytes: 0 -> 686560576
 g2: mbm_total_bytes: 0 -> 668204416
[..]
 g252: mbm_total_bytes: 0 -> 672651200
 g253: mbm_total_bytes: 0 -> 666956800
 g254: mbm_total_bytes: 0 -> 665917056
 g255: mbm_total_bytes: 0 -> 671049600
 255 groups, 255 returned counts in first pass, 255 in second
 successfully measured bandwidth from 255/255 groups

(patches are based on tip/master)

[1] https://www.amd.com/system/files/TechDocs/56375_1.03_PUB.pdf

Peter Newman (8):
  selftests/resctrl: Verify all RMIDs count together
  x86/resctrl: Add resctrl_mbm_flush_cpu() to collect CPUs' MBM events
  x86/resctrl: Flush MBM event counts on soft RMID change
  x86/resctrl: Call mon_event_count() directly for soft RMIDs
  x86/resctrl: Create soft RMID version of __mon_event_count()
  x86/resctrl: Assign HW RMIDs to CPUs for soft RMID
  x86/resctrl: Use mbm_update() to push soft RMID counts
  x86/resctrl: Add mount option to enable soft RMID

Stephane Eranian (1):
  x86/resctrl: Hold a spinlock in __rmid_read() on AMD

 arch/x86/include/asm/resctrl.h                |  29 +++-
 arch/x86/kernel/cpu/resctrl/core.c            |  80 ++++++++-
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c     |   9 +-
 arch/x86/kernel/cpu/resctrl/internal.h        |  19 ++-
 arch/x86/kernel/cpu/resctrl/monitor.c         | 158 +++++++++++++++++-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c        |  52 ++++++
 tools/testing/selftests/resctrl/test_rmids.sh |  93 +++++++++++
 7 files changed, 425 insertions(+), 15 deletions(-)
 create mode 100755 tools/testing/selftests/resctrl/test_rmids.sh


base-commit: dd806e2f030e57dd5bac973372aa252b6c175b73