[v9,2/3] x86/mce: Add per-bank CMCI storm mitigation

Message ID 20231004183623.17067-3-tony.luck@intel.com
State New
Headers
Series Handle corrected machine check interrupt storms |

Commit Message

Luck, Tony Oct. 4, 2023, 6:36 p.m. UTC
  This is the core functionality to track CMCI storms at the
machine check bank granularity. Subsequent patches will add
the vendor specific hooks to supply input to the storm
detection and take actions on the start/end of a storm.

Maintain a bitmap history for each bank showing whether the bank
logged an corrected error or not each time it is polled.

In normal operation the interval between polls of this banks
determines how far to shift the history. The 64 bit width corresponds
to about one second.

When a storm is observed a CPU vendor specific action is taken to reduce
or stop CMCI from the bank that is the source of the storm.  The bank
is added to the bitmap of banks for this CPU to poll. The polling rate
is increased to once per second.  During a storm each bit in the history
indicates the status of the bank each time it is polled. Thus the history
covers just over a minute.

Declare a storm for that bank if the number of corrected interrupts
seen in that history is above some threshold (defined as 5 in this
series, could be tuned later if there is data to suggest a better
value).

A storm on a bank ends if enough consecutive polls of the bank show
no corrected errors (defined as 30, may also change). That calls the
CPU vendor specific function to revert to normal operational mode,
and changes the polling rate back to the default.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---
v8: Updated with review comments from Yazen.
    Link: https://lore.kernel.org/r/c76723df-f2f1-4888-9e05-61917145503c@amd.com
v9: Fix lkp randconfig build issue when neither INTEL nor AMD configured
---
 arch/x86/kernel/cpu/mce/internal.h  |  40 +++++++++-
 arch/x86/kernel/cpu/mce/core.c      |  25 +++++--
 arch/x86/kernel/cpu/mce/threshold.c | 112 ++++++++++++++++++++++++++++
 3 files changed, 168 insertions(+), 9 deletions(-)
  

Comments

kernel test robot Oct. 11, 2023, 9:11 a.m. UTC | #1
Hello,

kernel test robot noticed a -8.8% regression of stress-ng.clock.ops_per_sec on:


commit: 26bff7b04b829cccc6a97726d6398391a62e34ef ("[PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation")
url: https://github.com/intel-lab-lkp/linux/commits/Tony-Luck/x86-mce-Remove-old-CMCI-storm-mitigation-code/20231005-024047
patch link: https://lore.kernel.org/all/20231004183623.17067-3-tony.luck@intel.com/
patch subject: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation

testcase: stress-ng
test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
parameters:

	nr_threads: 10%
	disk: 1HDD
	testtime: 60s
	fs: ext4
	class: os
	test: clock
	cpufreq_governor: performance




If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202310111637.dee70328-oliver.sang@intel.com


Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231011/202310111637.dee70328-oliver.sang@intel.com

=========================================================================================
class/compiler/cpufreq_governor/disk/fs/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
  os/gcc-12/performance/1HDD/ext4/x86_64-rhel-8.3/10%/debian-11.1-x86_64-20220510.cgz/lkp-icl-2sp7/clock/stress-ng/60s

commit: 
  cbb2ef4fc1 ("x86/mce: Remove old CMCI storm mitigation code")
  26bff7b04b ("x86/mce: Add per-bank CMCI storm mitigation")

cbb2ef4fc1f946b3 26bff7b04b829cccc6a97726d63 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
    118898 ±  7%     -18.5%      96883 ±  8%  meminfo.Mapped
      2971 ±  7%     +18.3%       3513 ±  9%  perf-c2c.HITM.local
      0.01           -20.0%       0.00        perf-sched.wait_time.avg.ms.do_nanosleep.hrtimer_nanosleep.common_nsleep_timens.__x64_sys_clock_nanosleep
   3865971 ±  4%     -23.2%    2970833 ±  2%  turbostat.C1
      1.48 ±  6%      -0.3        1.22 ±  4%  turbostat.C1%
    292422            -9.4%     264956        vmstat.system.cs
    219398            -7.4%     203113        vmstat.system.in
   4715828            -8.8%    4301560        stress-ng.clock.ops
     78597            -8.8%      71692        stress-ng.clock.ops_per_sec
    305.30            +1.7%     310.35        stress-ng.time.system_time
   9425269            -8.8%    8596889        stress-ng.time.voluntary_context_switches
    157229 ±  2%      -5.1%     149224        proc-vmstat.nr_active_anon
    904759            -1.7%     889679        proc-vmstat.nr_file_pages
    123722 ±  2%      -6.1%     116162 ±  2%  proc-vmstat.nr_inactive_anon
     29778 ±  7%     -18.5%      24260 ±  8%  proc-vmstat.nr_mapped
    194663            -7.8%     179535        proc-vmstat.nr_shmem
    157229 ±  2%      -5.1%     149224        proc-vmstat.nr_zone_active_anon
    123722 ±  2%      -6.1%     116162 ±  2%  proc-vmstat.nr_zone_inactive_anon
    844968            -5.5%     798662        proc-vmstat.numa_hit
    778755            -5.9%     732457 ±  2%  proc-vmstat.numa_local
    259547            -5.7%     244756        proc-vmstat.pgactivate
   1060836            -7.1%     986009 ±  2%  proc-vmstat.pgalloc_normal
    755630 ±  2%      -7.0%     702996 ±  4%  proc-vmstat.pgfree
      2.70           -17.4%       2.23        perf-stat.i.MPKI
 1.428e+09            -5.8%  1.345e+09        perf-stat.i.branch-instructions
      0.99            +0.1        1.09 ±  2%  perf-stat.i.branch-miss-rate%
     35.43            -7.4       28.07        perf-stat.i.cache-miss-rate%
  19762268           -22.2%   15365818        perf-stat.i.cache-misses
  55484296            -1.5%   54670651        perf-stat.i.cache-references
    305445            -8.7%     278849        perf-stat.i.context-switches
      2.88            +6.9%       3.08        perf-stat.i.cpi
    400.01 ±  3%      -6.5%     373.89 ±  3%  perf-stat.i.cpu-migrations
      1090           +27.2%       1387        perf-stat.i.cycles-between-cache-misses
 1.989e+09            -6.6%  1.857e+09        perf-stat.i.dTLB-loads
 1.153e+09            -7.6%  1.066e+09        perf-stat.i.dTLB-stores
 7.703e+09            -6.3%  7.221e+09        perf-stat.i.instructions
      0.38            -5.8%       0.36        perf-stat.i.ipc
      1124            -6.3%       1053        perf-stat.i.metric.K/sec
     71.46            -6.6%      66.71        perf-stat.i.metric.M/sec
   3530238 ±  3%     +48.1%    5229217 ±  3%  perf-stat.i.node-load-misses
     90.35            -5.3       85.09        perf-stat.i.node-store-miss-rate%
  14871425 ±  2%     -40.5%    8846683        perf-stat.i.node-store-misses
   1393999 ±  2%     +10.0%    1533309        perf-stat.i.node-stores
      2.57           -17.0%       2.13        perf-stat.overall.MPKI
      1.03 ±  2%      +0.1        1.14 ±  3%  perf-stat.overall.branch-miss-rate%
     35.62            -7.5       28.11        perf-stat.overall.cache-miss-rate%
      2.73            +6.4%       2.90        perf-stat.overall.cpi
      1063           +28.2%       1363        perf-stat.overall.cycles-between-cache-misses
      0.37            -6.0%       0.34        perf-stat.overall.ipc
     91.42            -6.2       85.23        perf-stat.overall.node-store-miss-rate%
 1.405e+09            -5.8%  1.323e+09        perf-stat.ps.branch-instructions
  19444141           -22.2%   15118353        perf-stat.ps.cache-misses
  54591369            -1.5%   53789356        perf-stat.ps.cache-references
    300511            -8.7%     274335        perf-stat.ps.context-switches
    393.53 ±  3%      -6.5%     367.80 ±  3%  perf-stat.ps.cpu-migrations
 1.957e+09            -6.6%  1.827e+09        perf-stat.ps.dTLB-loads
 1.135e+09            -7.6%  1.049e+09        perf-stat.ps.dTLB-stores
  7.58e+09            -6.3%  7.105e+09        perf-stat.ps.instructions
   3473672 ±  3%     +48.1%    5144934 ±  3%  perf-stat.ps.node-load-misses
  14630992 ±  2%     -40.5%    8703485        perf-stat.ps.node-store-misses
   1371912 ±  2%     +10.0%    1508626        perf-stat.ps.node-stores
 4.721e+11            -5.7%   4.45e+11        perf-stat.total.instructions
     13.27            -1.8       11.46 ±  2%  perf-profile.calltrace.cycles-pp.ktime_get_update_offsets_now.clock_was_set.timekeeping_inject_offset.do_adjtimex.__do_sys_clock_adjtime
      7.00 ±  3%      -1.3        5.74 ±  2%  perf-profile.calltrace.cycles-pp.syscall
      5.50 ±  2%      -1.2        4.26 ±  6%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.syscall
      5.25 ±  2%      -1.2        4.02 ±  6%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.syscall
      4.27 ±  2%      -1.1        3.13 ±  9%  perf-profile.calltrace.cycles-pp.__x64_sys_clock_gettime.do_syscall_64.entry_SYSCALL_64_after_hwframe.syscall
     14.93            -1.1       13.82 ±  2%  perf-profile.calltrace.cycles-pp.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
     17.13            -1.1       16.06 ±  2%  perf-profile.calltrace.cycles-pp.secondary_startup_64_no_verify
     14.04            -1.0       13.00 ±  2%  perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry
     14.22            -1.0       13.18 ±  2%  perf-profile.calltrace.cycles-pp.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary
     16.74            -1.0       15.77 ±  2%  perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
     16.78            -1.0       15.80 ±  2%  perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64_no_verify
     16.77            -1.0       15.80 ±  2%  perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
      3.29 ±  7%      -0.8        2.51 ± 13%  perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt
      2.50 ±  7%      -0.6        1.94 ±  6%  perf-profile.calltrace.cycles-pp.intel_idle_irq.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
      0.68 ±  3%      -0.4        0.26 ±100%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.timekeeping_inject_offset.do_adjtimex.__do_sys_clock_adjtime
      2.09            -0.4        1.67 ±  2%  perf-profile.calltrace.cycles-pp.timekeeping_update.timekeeping_inject_offset.do_adjtimex.__do_sys_clock_adjtime.do_syscall_64
      4.87            -0.4        4.49 ±  4%  perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
      4.70            -0.4        4.31 ±  4%  perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call
      1.99 ±  7%      -0.4        1.62 ±  6%  perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.intel_idle_irq.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call
      1.78 ±  7%      -0.3        1.44 ±  6%  perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.intel_idle_irq.cpuidle_enter_state.cpuidle_enter
      1.40 ±  8%      -0.3        1.08 ±  6%  perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.intel_idle_irq.cpuidle_enter_state
      1.40 ±  8%      -0.3        1.08 ±  6%  perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.intel_idle_irq
      0.56 ±  4%      -0.3        0.26 ±100%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__close
      3.07 ±  3%      -0.3        2.80 ±  5%  perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state
      3.08 ±  3%      -0.3        2.81 ±  5%  perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter
      1.74 ±  6%      -0.2        1.51 ±  6%  perf-profile.calltrace.cycles-pp.timer_settime
      2.12 ±  2%      -0.2        1.90 ±  5%  perf-profile.calltrace.cycles-pp.add_device_randomness.do_adjtimex.__do_sys_clock_adjtime.do_syscall_64.entry_SYSCALL_64_after_hwframe
      0.95            -0.2        0.73 ±  4%  perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.timekeeping_inject_offset.do_adjtimex.__do_sys_clock_adjtime.do_syscall_64
      1.58 ±  6%      -0.2        1.36 ±  7%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.timer_settime
      0.71 ±  7%      -0.2        0.50 ± 45%  perf-profile.calltrace.cycles-pp.posix_get_monotonic_raw.__x64_sys_clock_gettime.do_syscall_64.entry_SYSCALL_64_after_hwframe.syscall
      1.60 ±  6%      -0.2        1.39 ±  6%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.timer_settime
      1.55 ±  7%      -0.2        1.33 ±  6%  perf-profile.calltrace.cycles-pp.__x64_sys_timer_settime.do_syscall_64.entry_SYSCALL_64_after_hwframe.timer_settime
      0.70 ±  7%      -0.2        0.49 ± 46%  perf-profile.calltrace.cycles-pp.ktime_get_raw_ts64.posix_get_monotonic_raw.__x64_sys_clock_gettime.do_syscall_64.entry_SYSCALL_64_after_hwframe
      1.34 ±  8%      -0.2        1.14 ±  7%  perf-profile.calltrace.cycles-pp.do_timer_settime.__x64_sys_timer_settime.do_syscall_64.entry_SYSCALL_64_after_hwframe.timer_settime
      2.84            -0.2        2.64 ±  4%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.clock_nanosleep
      2.79 ±  2%      -0.2        2.61 ±  4%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep
      2.30            -0.2        2.12        perf-profile.calltrace.cycles-pp.open64
      2.24            -0.2        2.07        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.open64
      2.25            -0.2        2.08        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.open64
      2.23            -0.2        2.06        perf-profile.calltrace.cycles-pp.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe.open64
      2.22            -0.2        2.06        perf-profile.calltrace.cycles-pp.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe.open64
      1.36 ±  4%      -0.2        1.20 ± 18%  perf-profile.calltrace.cycles-pp.__entry_text_start.syscall
      1.79 ±  4%      -0.2        1.63 ±  6%  perf-profile.calltrace.cycles-pp.tick_sched_timer.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt
      2.70 ±  2%      -0.2        2.54 ±  4%  perf-profile.calltrace.cycles-pp.__x64_sys_clock_nanosleep.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep
      1.12 ±  3%      -0.1        0.97 ±  6%  perf-profile.calltrace.cycles-pp.blake2s_update.add_device_randomness.do_adjtimex.__do_sys_clock_adjtime.do_syscall_64
      0.60 ±  4%      -0.1        0.46 ± 44%  perf-profile.calltrace.cycles-pp.__close
      2.02 ±  2%      -0.1        1.88        perf-profile.calltrace.cycles-pp.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe
      2.00 ±  2%      -0.1        1.86        perf-profile.calltrace.cycles-pp.path_openat.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64
      0.95 ±  3%      -0.1        0.82 ±  6%  perf-profile.calltrace.cycles-pp.blake2s_compress.blake2s_update.add_device_randomness.do_adjtimex.__do_sys_clock_adjtime
      1.26 ±  2%      -0.1        1.13 ±  3%  perf-profile.calltrace.cycles-pp.do_open.path_openat.do_filp_open.do_sys_openat2.__x64_sys_openat
      1.07 ±  3%      -0.1        0.98 ±  5%  perf-profile.calltrace.cycles-pp.hrtimer_nanosleep.common_nsleep_timens.__x64_sys_clock_nanosleep.do_syscall_64.entry_SYSCALL_64_after_hwframe
      1.08 ±  3%      -0.1        0.99 ±  5%  perf-profile.calltrace.cycles-pp.common_nsleep_timens.__x64_sys_clock_nanosleep.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep
      0.86 ±  2%      -0.1        0.78 ±  6%  perf-profile.calltrace.cycles-pp.hrtimer_start_range_ns.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep
      1.01 ±  3%      -0.1        0.93 ±  6%  perf-profile.calltrace.cycles-pp.do_nanosleep.hrtimer_nanosleep.common_nsleep_timens.__x64_sys_clock_nanosleep.do_syscall_64
      0.96            -0.1        0.88 ±  2%  perf-profile.calltrace.cycles-pp.read_tsc.ktime_get_update_offsets_now.clock_was_set.timekeeping_inject_offset.do_adjtimex
      0.69 ±  4%      -0.1        0.62 ±  3%  perf-profile.calltrace.cycles-pp.chrdev_open.do_dentry_open.do_open.path_openat.do_filp_open
      0.98 ±  4%      -0.1        0.91        perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.poll_idle.cpuidle_enter_state
      0.86 ±  4%      -0.1        0.80 ±  3%  perf-profile.calltrace.cycles-pp.do_dentry_open.do_open.path_openat.do_filp_open.do_sys_openat2
      0.96 ±  3%      +0.1        1.09 ±  5%  perf-profile.calltrace.cycles-pp.tick_nohz_idle_enter.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
      0.87 ±  6%      +0.1        1.00 ±  5%  perf-profile.calltrace.cycles-pp.posix_timer_unhash_and_free.__x64_sys_timer_delete.do_syscall_64.entry_SYSCALL_64_after_hwframe.timer_delete
      6.08 ±  2%      +0.4        6.44 ±  2%  perf-profile.calltrace.cycles-pp.igb_ptp_gettimex_82580.pc_clock_gettime.__x64_sys_clock_gettime.do_syscall_64.entry_SYSCALL_64_after_hwframe
      6.27 ±  2%      +0.4        6.66 ±  2%  perf-profile.calltrace.cycles-pp.pc_clock_gettime.__x64_sys_clock_gettime.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_gettime
      6.39 ±  2%      +0.4        6.78 ±  2%  perf-profile.calltrace.cycles-pp.clock_gettime
      6.32 ±  2%      +0.4        6.72 ±  2%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.clock_gettime
      6.32 ±  2%      +0.4        6.71 ±  2%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_gettime
      6.30 ±  2%      +0.4        6.70 ±  2%  perf-profile.calltrace.cycles-pp.__x64_sys_clock_gettime.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_gettime
      0.08 ±223%      +0.5        0.55 ±  5%  perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.tick_nohz_idle_enter.do_idle.cpu_startup_entry.start_secondary
      2.58 ±  7%      +0.5        3.13 ±  2%  perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.igb_ptp_gettimex_82580.pc_clock_gettime.__x64_sys_clock_gettime.do_syscall_64
      2.30 ±  7%      +0.6        2.89 ±  2%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.igb_ptp_gettimex_82580.pc_clock_gettime.__x64_sys_clock_gettime
     53.43            +2.9       56.32        perf-profile.calltrace.cycles-pp.clock_adjtime
     53.05            +2.9       55.98        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.clock_adjtime
     52.98            +2.9       55.92        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_adjtime
     52.81            +2.9       55.75        perf-profile.calltrace.cycles-pp.__do_sys_clock_adjtime.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_adjtime
     51.37            +3.1       54.49        perf-profile.calltrace.cycles-pp.do_adjtimex.__do_sys_clock_adjtime.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_adjtime
     47.33            +3.4       50.75        perf-profile.calltrace.cycles-pp.timekeeping_inject_offset.do_adjtimex.__do_sys_clock_adjtime.do_syscall_64.entry_SYSCALL_64_after_hwframe
     43.22            +4.3       47.50        perf-profile.calltrace.cycles-pp.clock_was_set.timekeeping_inject_offset.do_adjtimex.__do_sys_clock_adjtime.do_syscall_64
      4.17 ± 13%      +4.7        8.85 ±  4%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.clock_was_set.timekeeping_inject_offset.do_adjtimex
     21.83            +5.8       27.67        perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.clock_was_set.timekeeping_inject_offset.do_adjtimex.__do_sys_clock_adjtime
     14.07            -1.8       12.26 ±  2%  perf-profile.children.cycles-pp.ktime_get_update_offsets_now
      7.34            -1.4        5.98 ±  4%  perf-profile.children.cycles-pp.syscall
     15.24            -1.2       14.06 ±  2%  perf-profile.children.cycles-pp.cpuidle_idle_call
     14.48            -1.1       13.37 ±  2%  perf-profile.children.cycles-pp.cpuidle_enter_state
     14.51            -1.1       13.40 ±  2%  perf-profile.children.cycles-pp.cpuidle_enter
     17.11            -1.1       16.04 ±  2%  perf-profile.children.cycles-pp.do_idle
     17.13            -1.1       16.06 ±  2%  perf-profile.children.cycles-pp.secondary_startup_64_no_verify
     17.13            -1.1       16.06 ±  2%  perf-profile.children.cycles-pp.cpu_startup_entry
     16.78            -1.0       15.80 ±  2%  perf-profile.children.cycles-pp.start_secondary
     10.61 ±  2%      -0.7        9.86 ±  3%  perf-profile.children.cycles-pp.__x64_sys_clock_gettime
      9.62            -0.7        8.90 ±  3%  perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
      8.93            -0.7        8.26 ±  3%  perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
      2.76 ±  6%      -0.6        2.14 ±  6%  perf-profile.children.cycles-pp.intel_idle_irq
      6.40            -0.6        5.81 ±  3%  perf-profile.children.cycles-pp.hrtimer_interrupt
      6.41            -0.6        5.83 ±  3%  perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
      4.11 ±  2%      -0.5        3.60 ±  4%  perf-profile.children.cycles-pp.__hrtimer_run_queues
      1.09 ±  4%      -0.5        0.62 ±  8%  perf-profile.children.cycles-pp.ktime_get_real_ts64
      0.88 ±  6%      -0.4        0.45 ± 14%  perf-profile.children.cycles-pp.posix_get_realtime_timespec
      2.09            -0.4        1.67 ±  2%  perf-profile.children.cycles-pp.timekeeping_update
      2.12 ±  2%      -0.2        1.90 ±  5%  perf-profile.children.cycles-pp.add_device_randomness
      1.78 ±  6%      -0.2        1.56 ±  5%  perf-profile.children.cycles-pp.timer_settime
      1.55 ±  7%      -0.2        1.34 ±  6%  perf-profile.children.cycles-pp.__x64_sys_timer_settime
      1.35 ±  8%      -0.2        1.14 ±  7%  perf-profile.children.cycles-pp.do_timer_settime
      1.19 ±  3%      -0.2        0.99 ±  5%  perf-profile.children.cycles-pp.try_to_wake_up
      1.19 ±  3%      -0.2        0.99 ±  4%  perf-profile.children.cycles-pp.hrtimer_wakeup
      0.77 ±  4%      -0.2        0.58 ±  8%  perf-profile.children.cycles-pp.ktime_get_with_offset
      0.52 ±  7%      -0.2        0.33 ± 17%  perf-profile.children.cycles-pp.posix_get_monotonic_timespec
      0.51 ±  6%      -0.2        0.32 ± 19%  perf-profile.children.cycles-pp.ktime_get_ts64
      2.25            -0.2        2.08        perf-profile.children.cycles-pp.do_sys_openat2
      2.31            -0.2        2.14        perf-profile.children.cycles-pp.open64
      2.26            -0.2        2.08        perf-profile.children.cycles-pp.__x64_sys_openat
      2.10 ±  3%      -0.2        1.92 ±  5%  perf-profile.children.cycles-pp.tick_sched_timer
      2.71 ±  2%      -0.2        2.54 ±  4%  perf-profile.children.cycles-pp.__x64_sys_clock_nanosleep
      2.55            -0.2        2.39 ±  4%  perf-profile.children.cycles-pp.hrtimer_nanosleep
      2.45 ±  2%      -0.2        2.30 ±  5%  perf-profile.children.cycles-pp.do_nanosleep
      1.12 ±  3%      -0.1        0.98 ±  6%  perf-profile.children.cycles-pp.blake2s_update
      0.36 ±  6%      -0.1        0.21 ± 17%  perf-profile.children.cycles-pp.smp_call_function_many_cond
      0.39 ±  3%      -0.1        0.24 ±  4%  perf-profile.children.cycles-pp.update_vsyscall
      0.72 ±  7%      -0.1        0.57 ± 13%  perf-profile.children.cycles-pp.posix_get_monotonic_raw
      0.53            -0.1        0.38 ±  8%  perf-profile.children.cycles-pp.posix_get_tai_timespec
      0.70 ±  7%      -0.1        0.56 ± 14%  perf-profile.children.cycles-pp.ktime_get_raw_ts64
      2.03 ±  2%      -0.1        1.89        perf-profile.children.cycles-pp.path_openat
      2.04 ±  2%      -0.1        1.91        perf-profile.children.cycles-pp.do_filp_open
      0.95 ±  3%      -0.1        0.82 ±  6%  perf-profile.children.cycles-pp.blake2s_compress
      1.50 ±  2%      -0.1        1.37 ±  2%  perf-profile.children.cycles-pp.read_tsc
      1.26 ±  3%      -0.1        1.14 ±  3%  perf-profile.children.cycles-pp.do_open
      1.38 ±  2%      -0.1        1.26 ±  6%  perf-profile.children.cycles-pp.hrtimer_start_range_ns
      0.54 ±  4%      -0.1        0.43 ±  3%  perf-profile.children.cycles-pp.ttwu_do_activate
      0.26 ±  8%      -0.1        0.16 ± 21%  perf-profile.children.cycles-pp.ktime_get_coarse_real_ts64
      0.25 ±  6%      -0.1        0.14 ± 21%  perf-profile.children.cycles-pp.posix_get_realtime_coarse
      1.47 ±  2%      -0.1        1.37 ±  3%  perf-profile.children.cycles-pp.__entry_text_start
      1.09 ±  3%      -0.1        0.99 ±  5%  perf-profile.children.cycles-pp.common_nsleep_timens
      0.56 ±  3%      -0.1        0.46 ±  7%  perf-profile.children.cycles-pp.update_fast_timekeeper
      1.53 ±  3%      -0.1        1.44 ±  3%  perf-profile.children.cycles-pp.tick_sched_handle
      1.14 ±  2%      -0.1        1.05 ±  2%  perf-profile.children.cycles-pp._copy_to_user
      0.51 ±  2%      -0.1        0.43 ±  4%  perf-profile.children.cycles-pp.timekeeping_forward_now
      0.16 ±  7%      -0.1        0.08 ± 18%  perf-profile.children.cycles-pp.__flush_smp_call_function_queue
      0.32 ±  3%      -0.1        0.25 ±  7%  perf-profile.children.cycles-pp.enqueue_entity
      0.69 ±  4%      -0.1        0.62 ±  4%  perf-profile.children.cycles-pp.chrdev_open
      0.36 ±  3%      -0.1        0.29 ±  7%  perf-profile.children.cycles-pp.activate_task
      0.15 ±  9%      -0.1        0.08 ± 23%  perf-profile.children.cycles-pp.asm_sysvec_call_function_single
      0.34 ±  4%      -0.1        0.28 ±  7%  perf-profile.children.cycles-pp.enqueue_task_fair
      0.30 ±  7%      -0.1        0.24 ±  6%  perf-profile.children.cycles-pp.capable
      0.54 ±  4%      -0.1        0.48 ±  3%  perf-profile.children.cycles-pp.syscall_exit_to_user_mode
      0.40 ±  7%      -0.1        0.34 ± 11%  perf-profile.children.cycles-pp.timerfd_clock_was_set
      0.62 ±  4%      -0.1        0.56 ±  4%  perf-profile.children.cycles-pp.__close
      0.57 ±  4%      -0.1        0.51 ±  6%  perf-profile.children.cycles-pp.menu_select
      0.54 ±  3%      -0.1        0.48 ±  4%  perf-profile.children.cycles-pp.__x64_sys_close
      0.31 ±  9%      -0.1        0.25 ±  5%  perf-profile.children.cycles-pp.lockref_get_not_dead
      0.28 ±  5%      -0.1        0.22 ±  6%  perf-profile.children.cycles-pp.apparmor_capable
      0.14 ± 16%      -0.1        0.08 ± 27%  perf-profile.children.cycles-pp.ktime_mono_to_any
      0.13 ±  8%      -0.1        0.07 ± 20%  perf-profile.children.cycles-pp.sysvec_call_function_single
      0.32 ±  8%      -0.1        0.26 ±  5%  perf-profile.children.cycles-pp.__legitimize_path
      0.30 ±  5%      -0.1        0.24 ±  5%  perf-profile.children.cycles-pp.security_capable
      0.08 ± 17%      -0.1        0.02 ± 99%  perf-profile.children.cycles-pp.security_file_free
      0.52 ±  5%      -0.1        0.47 ±  4%  perf-profile.children.cycles-pp.alloc_empty_file
      0.11 ±  6%      -0.1        0.06 ± 21%  perf-profile.children.cycles-pp.__sysvec_call_function_single
      0.45 ±  4%      -0.1        0.40 ±  6%  perf-profile.children.cycles-pp.__fput
      0.40 ±  5%      -0.1        0.35 ±  4%  perf-profile.children.cycles-pp.init_file
      0.31 ±  5%      -0.1        0.26 ±  6%  perf-profile.children.cycles-pp.notifier_call_chain
      0.69 ±  3%      -0.0        0.64 ±  2%  perf-profile.children.cycles-pp.kmem_cache_alloc
      0.18 ±  9%      -0.0        0.14 ±  8%  perf-profile.children.cycles-pp.kobject_get
      0.18 ±  9%      -0.0        0.13 ± 14%  perf-profile.children.cycles-pp.check_preempt_curr
      0.23 ±  6%      -0.0        0.19 ±  8%  perf-profile.children.cycles-pp._raw_spin_lock_irq
      0.15 ± 11%      -0.0        0.11 ± 14%  perf-profile.children.cycles-pp.resched_curr
      0.29 ±  5%      -0.0        0.25 ±  5%  perf-profile.children.cycles-pp._raw_spin_unlock_irqrestore
      0.60 ±  3%      -0.0        0.56 ±  2%  perf-profile.children.cycles-pp.lapic_next_deadline
      0.22 ±  5%      -0.0        0.19 ±  8%  perf-profile.children.cycles-pp.security_file_alloc
      0.18 ±  6%      -0.0        0.15 ± 12%  perf-profile.children.cycles-pp.smpboot_thread_fn
      0.11 ±  8%      -0.0        0.08 ±  5%  perf-profile.children.cycles-pp.run_ksoftirqd
      0.26 ±  5%      -0.0        0.23 ±  4%  perf-profile.children.cycles-pp.syscall_return_via_sysret
      0.14 ±  9%      -0.0        0.11 ±  9%  perf-profile.children.cycles-pp.getname_flags
      0.24 ±  3%      -0.0        0.22 ±  4%  perf-profile.children.cycles-pp.exit_to_user_mode_prepare
      0.10 ±  6%      +0.0        0.12 ±  6%  perf-profile.children.cycles-pp.posix_clock_poll
      0.12 ±  6%      +0.0        0.14 ±  4%  perf-profile.children.cycles-pp.do_poll
      0.27 ±  5%      +0.0        0.29 ±  3%  perf-profile.children.cycles-pp.down_read
      0.15 ±  9%      +0.0        0.18 ±  7%  perf-profile.children.cycles-pp.up_read
      0.12 ±  6%      +0.0        0.15 ±  6%  perf-profile.children.cycles-pp.link_path_walk
      0.18 ±  5%      +0.0        0.22 ±  7%  perf-profile.children.cycles-pp.__switch_to
      0.21 ±  4%      +0.1        0.26 ± 13%  perf-profile.children.cycles-pp.finish_task_switch
      0.00            +0.1        0.08 ± 21%  perf-profile.children.cycles-pp.step_into
      0.87 ±  6%      +0.1        1.00 ±  5%  perf-profile.children.cycles-pp.posix_timer_unhash_and_free
      6.08 ±  2%      +0.4        6.45 ±  2%  perf-profile.children.cycles-pp.igb_ptp_gettimex_82580
      6.27 ±  2%      +0.4        6.66 ±  2%  perf-profile.children.cycles-pp.pc_clock_gettime
      6.39 ±  2%      +0.4        6.78 ±  2%  perf-profile.children.cycles-pp.clock_gettime
     77.76            +1.4       79.16        perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
     77.38            +1.4       78.81        perf-profile.children.cycles-pp.do_syscall_64
     53.58            +2.9       56.46        perf-profile.children.cycles-pp.clock_adjtime
     52.83            +2.9       55.78        perf-profile.children.cycles-pp.__do_sys_clock_adjtime
     51.38            +3.1       54.50        perf-profile.children.cycles-pp.do_adjtimex
     47.34            +3.4       50.76        perf-profile.children.cycles-pp.timekeeping_inject_offset
     43.40            +4.3       47.66        perf-profile.children.cycles-pp.clock_was_set
      9.11 ±  8%      +5.2       14.27 ±  2%  perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
     29.22            +6.0       35.19        perf-profile.children.cycles-pp._raw_spin_lock_irqsave
     13.04            -1.7       11.30 ±  2%  perf-profile.self.cycles-pp.ktime_get_update_offsets_now
      1.03 ±  5%      -0.5        0.57 ±  8%  perf-profile.self.cycles-pp.ktime_get_real_ts64
      4.24 ±  4%      -0.3        3.92 ±  5%  perf-profile.self.cycles-pp.ktime_get
      0.81 ±  4%      -0.2        0.58 ±  7%  perf-profile.self.cycles-pp.intel_idle_irq
      0.70 ±  5%      -0.2        0.51 ±  9%  perf-profile.self.cycles-pp.ktime_get_with_offset
      0.47 ±  7%      -0.2        0.29 ± 21%  perf-profile.self.cycles-pp.ktime_get_ts64
      0.39 ±  3%      -0.1        0.24 ±  4%  perf-profile.self.cycles-pp.update_vsyscall
      0.66 ±  8%      -0.1        0.52 ± 15%  perf-profile.self.cycles-pp.ktime_get_raw_ts64
      0.74 ±  3%      -0.1        0.61 ±  4%  perf-profile.self.cycles-pp.__do_sys_clock_adjtime
      0.31 ±  7%      -0.1        0.18 ± 16%  perf-profile.self.cycles-pp.smp_call_function_many_cond
      0.95 ±  3%      -0.1        0.82 ±  5%  perf-profile.self.cycles-pp.blake2s_compress
      0.80 ±  2%      -0.1        0.68 ±  3%  perf-profile.self.cycles-pp.timekeeping_update
      1.43 ±  2%      -0.1        1.32 ±  2%  perf-profile.self.cycles-pp.read_tsc
      0.26 ±  7%      -0.1        0.15 ± 22%  perf-profile.self.cycles-pp.ktime_get_coarse_real_ts64
      0.98 ±  4%      -0.1        0.87 ±  5%  perf-profile.self.cycles-pp.poll_idle
      0.56 ±  3%      -0.1        0.46 ±  7%  perf-profile.self.cycles-pp.update_fast_timekeeper
      0.32 ±  5%      -0.1        0.23 ±  5%  perf-profile.self.cycles-pp.timekeeping_inject_offset
      1.12 ±  2%      -0.1        1.03 ±  2%  perf-profile.self.cycles-pp._copy_to_user
      0.50 ±  2%      -0.1        0.41 ±  5%  perf-profile.self.cycles-pp.timekeeping_forward_now
      0.51 ±  7%      -0.1        0.45 ±  3%  perf-profile.self.cycles-pp.hrtimer_interrupt
      0.30 ±  8%      -0.1        0.24 ±  3%  perf-profile.self.cycles-pp.lockref_get_not_dead
      0.10 ±  9%      -0.1        0.04 ± 45%  perf-profile.self.cycles-pp.enqueue_entity
      0.13 ± 18%      -0.1        0.08 ± 25%  perf-profile.self.cycles-pp.ktime_mono_to_any
      0.27 ±  6%      -0.1        0.21 ±  5%  perf-profile.self.cycles-pp.apparmor_capable
      0.16 ±  8%      -0.1        0.10 ± 21%  perf-profile.self.cycles-pp.hrtimer_start_range_ns
      0.18 ±  9%      -0.0        0.14 ±  8%  perf-profile.self.cycles-pp.kobject_get
      0.15 ± 11%      -0.0        0.11 ± 14%  perf-profile.self.cycles-pp.resched_curr
      0.59 ±  2%      -0.0        0.56 ±  2%  perf-profile.self.cycles-pp.lapic_next_deadline
      0.41 ±  4%      -0.0        0.38 ±  3%  perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
      0.33 ±  3%      -0.0        0.30 ±  5%  perf-profile.self.cycles-pp.do_syscall_64
      0.20 ±  7%      -0.0        0.17 ±  5%  perf-profile.self.cycles-pp.syscall_exit_to_user_mode
      0.26 ±  5%      -0.0        0.23 ±  4%  perf-profile.self.cycles-pp.syscall_return_via_sysret
      0.19 ±  3%      -0.0        0.16 ±  8%  perf-profile.self.cycles-pp.apparmor_file_alloc_security
      0.10 ±  4%      -0.0        0.08 ±  8%  perf-profile.self.cycles-pp.update_curr
      0.14 ± 11%      +0.0        0.18 ±  7%  perf-profile.self.cycles-pp.up_read
      0.17 ±  5%      +0.0        0.21 ±  6%  perf-profile.self.cycles-pp.__switch_to
      0.01 ±223%      +0.1        0.08 ±  8%  perf-profile.self.cycles-pp.dequeue_task_fair
     20.56            +0.9       21.48        perf-profile.self.cycles-pp._raw_spin_lock_irqsave
      9.08 ±  8%      +5.2       14.24 ±  2%  perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
  
Luck, Tony Oct. 11, 2023, 3:16 p.m. UTC | #2
> kernel test robot noticed a -8.8% regression of stress-ng.clock.ops_per_sec on:
>
>
> commit: 26bff7b04b829cccc6a97726d6398391a62e34ef ("[PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation")
> url: https://github.com/intel-lab-lkp/linux/commits/Tony-Luck/x86-mce-Remove-old-CMCI-storm-mitigation-code/20231005-024047
> patch link: https://lore.kernel.org/all/20231004183623.17067-3-tony.luck@intel.com/
> patch subject: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation
>
> testcase: stress-ng
> test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
> parameters:
>
>       nr_threads: 10%
>       disk: 1HDD
>       testtime: 60s
>       fs: ext4
>       class: os
>       test: clock
>       cpufreq_governor: performance
>
>
>
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202310111637.dee70328-oliver.sang@intel.com

Is the test injecting massive numbers of corrected memory errors? The code in this patch
is only executed when handling CMCI interrupts, or polling machine check banks (at most
once per second).

I'm guessing this report is just because alignment of some hot path code changed.

-Tony
  
Feng Tang Oct. 11, 2023, 3:42 p.m. UTC | #3
On Wed, Oct 11, 2023 at 11:16:46PM +0800, Luck, Tony wrote:
> > kernel test robot noticed a -8.8% regression of stress-ng.clock.ops_per_sec on:
> >
> >
> > commit: 26bff7b04b829cccc6a97726d6398391a62e34ef ("[PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation")
> > url: https://github.com/intel-lab-lkp/linux/commits/Tony-Luck/x86-mce-Remove-old-CMCI-storm-mitigation-code/20231005-024047
> > patch link: https://lore.kernel.org/all/20231004183623.17067-3-tony.luck@intel.com/
> > patch subject: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation
> >
> > testcase: stress-ng
> > test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
> > parameters:
> >
> >       nr_threads: 10%
> >       disk: 1HDD
> >       testtime: 60s
> >       fs: ext4
> >       class: os
> >       test: clock
> >       cpufreq_governor: performance
> >
> >
> >
> >
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <oliver.sang@intel.com>
> > | Closes: https://lore.kernel.org/oe-lkp/202310111637.dee70328-oliver.sang@intel.com
> 
> Is the test injecting massive numbers of corrected memory errors? The code in this patch
> is only executed when handling CMCI interrupts, or polling machine check banks (at most
> once per second).
> 
> I'm guessing this report is just because alignment of some hot path code changed.

IIRC, CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B was enabled in 0Day's
kernel config for quite a while, to force each funtion's start
address aligned on 64 bytes. Don't know if this has been changed
recently.

Also I noticed the patch introduce a new per-cpu variable 'storm_desc",
if the function address is 64B aligned, then per-cpu data alignment
may be related.

Thanks,
Feng

> 
> -Tony
  
Luck, Tony Oct. 11, 2023, 5:23 p.m. UTC | #4
> IIRC, CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B was enabled in 0Day's
> kernel config for quite a while, to force each funtion's start
> address aligned on 64 bytes. Don't know if this has been changed
> recently.
>
> Also I noticed the patch introduce a new per-cpu variable 'storm_desc",
> if the function address is 64B aligned, then per-cpu data alignment
> may be related.

If adding (removing, changing) per-cpu variables can cause 8% performance
changes, then maybe there needs to be some way to insulate the builds
from these side effects (as was done with the 64-byte function alignment).
I've no ideas on how to achieve that :-(

"storm_desc" is a largish per-cpu variable at 1568 bytes. But that doesn't
quite get it into the top 20.

In the config used for my kernel build, it does appear just before a bunch
of interrupt related per-cpu variables. Those might be sensitive to alignment?

000000000001ff40 d smca_misc_banks_map
000000000001ff48 d bank_map
000000000001ff50 d threshold_banks
000000000001ff60 d storm_desc
0000000000020580 d tsc_adjust
00000000000205c0 d lapic_events
00000000000206c0 d vector_cleanup
0000000000020720 D __uv_cpu_info
0000000000020738 d ipi_mask

-Tony
  
Philip Li Oct. 12, 2023, 2:35 a.m. UTC | #5
On Wed, Oct 11, 2023 at 11:42:26PM +0800, Feng Tang wrote:
> On Wed, Oct 11, 2023 at 11:16:46PM +0800, Luck, Tony wrote:
> > > kernel test robot noticed a -8.8% regression of stress-ng.clock.ops_per_sec on:
> > >
> > >
> > > commit: 26bff7b04b829cccc6a97726d6398391a62e34ef ("[PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation")
> > > url: https://github.com/intel-lab-lkp/linux/commits/Tony-Luck/x86-mce-Remove-old-CMCI-storm-mitigation-code/20231005-024047
> > > patch link: https://lore.kernel.org/all/20231004183623.17067-3-tony.luck@intel.com/
> > > patch subject: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation
> > >
> > > testcase: stress-ng
> > > test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
> > > parameters:
> > >
> > >       nr_threads: 10%
> > >       disk: 1HDD
> > >       testtime: 60s
> > >       fs: ext4
> > >       class: os
> > >       test: clock
> > >       cpufreq_governor: performance
> > >
> > >
> > >
> > >
> > > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > > the same patch/commit), kindly add following tags
> > > | Reported-by: kernel test robot <oliver.sang@intel.com>
> > > | Closes: https://lore.kernel.org/oe-lkp/202310111637.dee70328-oliver.sang@intel.com
> > 
> > Is the test injecting massive numbers of corrected memory errors? The code in this patch
> > is only executed when handling CMCI interrupts, or polling machine check banks (at most
> > once per second).
> > 
> > I'm guessing this report is just because alignment of some hot path code changed.
> 
> IIRC, CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B was enabled in 0Day's
> kernel config for quite a while, to force each funtion's start
> address aligned on 64 bytes. Don't know if this has been changed
> recently.

yes, the kernel is always built with -falign-functions=64 to test
performance now.

> 
> Also I noticed the patch introduce a new per-cpu variable 'storm_desc",
> if the function address is 64B aligned, then per-cpu data alignment
> may be related.
> 
> Thanks,
> Feng
> 
> > 
> > -Tony
>
  
Feng Tang Oct. 12, 2023, 5:36 a.m. UTC | #6
On Thu, Oct 12, 2023 at 01:23:54AM +0800, Luck, Tony wrote:
> > IIRC, CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B was enabled in 0Day's
> > kernel config for quite a while, to force each funtion's start
> > address aligned on 64 bytes. Don't know if this has been changed
> > recently.
> >
> > Also I noticed the patch introduce a new per-cpu variable 'storm_desc",
> > if the function address is 64B aligned, then per-cpu data alignment
> > may be related.
> 
> If adding (removing, changing) per-cpu variables can cause 8% performance
> changes, then maybe there needs to be some way to insulate the builds
> from these side effects (as was done with the 64-byte function alignment).
> I've no ideas on how to achieve that :-(

Philip helped to clarify that 0Day kernel build does enable the 64 bytes
function address alignment.

> "storm_desc" is a largish per-cpu variable at 1568 bytes. But that doesn't
> quite get it into the top 20.
> 
> In the config used for my kernel build, it does appear just before a bunch
> of interrupt related per-cpu variables. Those might be sensitive to alignment?
> 
> 000000000001ff40 d smca_misc_banks_map
> 000000000001ff48 d bank_map
> 000000000001ff50 d threshold_banks
> 000000000001ff60 d storm_desc
> 0000000000020580 d tsc_adjust
> 00000000000205c0 d lapic_events
> 00000000000206c0 d vector_cleanup
> 0000000000020720 D __uv_cpu_info
> 0000000000020738 d ipi_mask

I did further check by putting 'storm_desc' into another section with
below patch, and the regression is restored, which indictes it is
related with these per-cpu data alignment change.

diff --git a/arch/x86/kernel/cpu/mce/threshold.c b/arch/x86/kernel/cpu/mce/threshold.c
index ecdf13f1bb7d5..7f32126f216bc 100644
--- a/arch/x86/kernel/cpu/mce/threshold.c
+++ b/arch/x86/kernel/cpu/mce/threshold.c
@@ -41,7 +41,7 @@ struct mca_storm_desc {
 	bool			poll_mode;
 };
 
-static DEFINE_PER_CPU(struct mca_storm_desc, storm_desc);
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct mca_storm_desc, storm_desc);

Thanks,
Feng

> 
> -Tony
  
Feng Tang Oct. 12, 2023, 5:56 a.m. UTC | #7
On Thu, Oct 12, 2023 at 01:36:41PM +0800, Feng Tang wrote:
> On Thu, Oct 12, 2023 at 01:23:54AM +0800, Luck, Tony wrote:
> > > IIRC, CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B was enabled in 0Day's
> > > kernel config for quite a while, to force each funtion's start
> > > address aligned on 64 bytes. Don't know if this has been changed
> > > recently.
> > >
> > > Also I noticed the patch introduce a new per-cpu variable 'storm_desc",
> > > if the function address is 64B aligned, then per-cpu data alignment
> > > may be related.
> > 
> > If adding (removing, changing) per-cpu variables can cause 8% performance
> > changes, then maybe there needs to be some way to insulate the builds
> > from these side effects (as was done with the 64-byte function alignment).
> > I've no ideas on how to achieve that :-(

As for mitigation (how to reduce these kind of strange performance
changes), I have proposed a "selective isolation" in LPC 2021, in
page 15 of https://lpc.events/event/11/contributions/895/attachments/770/1603/Strange_kernel_performance_changes_lpc_2021.pdf

As kernel binary is compactly linked together, the alignment change
of text/data in earlier modules could affect other modules after them.
MCE module is much easier to hit these strange performance changes,
just because it sit in arch/ folder and get linked early in kernel
binary.

The idea of "selective isolation" is simple, by picking some modules
sparsely and enforce some alignment to function/data/per-cpu data,
so that they can act as fences/borders, separate kernel into multiple
capsules, and make the alignment changes only take effect inside
that specific capsule.

Any thoughts? thanks!

- Feng
 
> Philip helped to clarify that 0Day kernel build does enable the 64 bytes
> function address alignment.

[...]
  
Borislav Petkov Oct. 19, 2023, 3:12 p.m. UTC | #8
On Wed, Oct 04, 2023 at 11:36:22AM -0700, Tony Luck wrote:
> +/*
> + * history:	bitmask tracking whether errors were seen or not seen in
> + *		the most recent polls of a bank.

		each bit in that bitmask represents an error seen.

> + * timestamp:	last time (in jiffies) that the bank was polled
> + * storm:	Is this bank in storm mode?
> + */
> +struct storm_bank {
> +	u64 history;
> +	u64 timestamp;
> +	bool storm;

I guess "in_storm_mode" is even more descriptive:

	storm->banks[bank].in_storm_mode = false;

etc.

> +};
> +
> +/* How many errors within the history buffer mark the start of a storm. */
> +#define STORM_BEGIN_THRESHOLD	5
> +
> +/*
> + * How many polls of machine check bank without an error before declaring
> + * the storm is over. Since it is tracked by the bitmaks in the history
> + * field of struct storm_bank the mask is 30 bits [0 ... 29].
> + */
> +#define STORM_END_POLL_THRESHOLD	29
>  
>  #ifdef CONFIG_ACPI_APEI
>  int apei_write_mce(struct mce *m);
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index f6e87443b37a..7c931f0c9251 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -680,6 +680,8 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
>  		barrier();
>  		m.status = mce_rdmsrl(mca_msr_reg(i, MCA_STATUS));
>  
> +		mce_track_storm(&m);

Lemme see if I understand the idea here:

the default polling interval is 5 mins. So the storm tracking is called
every 5 mins once to see how are banks "doing", so to speak and to get
them in or out of storm mode. So far so good...

> +
>  		/* If this entry is not valid, ignore it */
>  		if (!(m.status & MCI_STATUS_VAL))
>  			continue;

Btw, you're tracking storm even if the error is not valid - conditional
above here. Why?

> @@ -1652,22 +1654,29 @@ static void mce_timer_fn(struct timer_list *t)
>  	else
>  		iv = min(iv * 2, round_jiffies_relative(check_interval * HZ));
>  
> -	__this_cpu_write(mce_next_interval, iv);
> -	__start_timer(t, iv);
> +	if (mce_is_storm()) {
> +		__start_timer(t, HZ);
> +	} else {
> +		__this_cpu_write(mce_next_interval, iv);
> +		__start_timer(t, iv);
> +	}

... this is where it becomes, hm, interesting: the check interval will
be halved if an error has been seen during this round but then if we're
in storm mode, that check interval doesn't matter - you'll run the timer
each second.

Then you need to restructure this to check the storm condition and not
do anything to iv if storm.

Or, am I missing something?

> diff --git a/arch/x86/kernel/cpu/mce/threshold.c b/arch/x86/kernel/cpu/mce/threshold.c
> index ef4e7bb5fd88..ecdf13f1bb7d 100644
> --- a/arch/x86/kernel/cpu/mce/threshold.c
> +++ b/arch/x86/kernel/cpu/mce/threshold.c
> @@ -29,3 +29,115 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_threshold)
>  	trace_threshold_apic_exit(THRESHOLD_APIC_VECTOR);
>  	apic_eoi();
>  }
> +
> +/*
> + * banks:		per-cpu, per-bank details
> + * stormy_bank_count:	count of MC banks in storm state
> + * poll_mode:		CPU is in poll mode
> + */
> +struct mca_storm_desc {
> +	struct storm_bank	banks[MAX_NR_BANKS];
> +	u8			stormy_bank_count;
> +	bool			poll_mode;
> +};

Yeah, put the struct definition into internal.h pls.

> +static DEFINE_PER_CPU(struct mca_storm_desc, storm_desc);
> +
> +void mce_inherit_storm(unsigned int bank)
> +{
> +	struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
> +
> +	storm->banks[bank].history = ~0ull;

So upon inheriting a bank, you set its history that it has seen errors
each time?

That's weird.

> +	storm->banks[bank].timestamp = jiffies;
> +}
> +
> +bool mce_is_storm(void)

That's a weird name. mce_get_storm_mode() perhaps?

> +{
> +	return __this_cpu_read(storm_desc.poll_mode);
> +}
> +
> +void mce_set_storm(bool storm)

mce_set_storm_mode()

> +{
> +	__this_cpu_write(storm_desc.poll_mode, storm);
> +}
> +
> +static void mce_handle_storm(unsigned int bank, bool on)
> +{
> +	switch (boot_cpu_data.x86_vendor) {
> +	}
> +}

...

> +void mce_track_storm(struct mce *mce)
> +{
> +	struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
> +	unsigned long now = jiffies, delta;
> +	unsigned int shift = 1;
> +	u64 history = 0;
> +
> +	/*
> +	 * When a bank is in storm mode it is polled once per second and
> +	 * the history mask will record about the last minute of poll results.
> +	 * If it is not in storm mode, then the bank is only checked when
> +	 * there is a CMCI interrupt. Check how long it has been since
> +	 * this bank was last checked, and adjust the amount of "shift"
> +	 * to apply to history.
> +	 */
> +	if (!storm->banks[mce->bank].storm) {

Yeah, if this were

	if (!storm->banks[mce->bank].in_storm_mode)

it would've been perfectly clear what the condition tests.

> +		delta = now - storm->banks[mce->bank].timestamp;
> +		shift = (delta + HZ) / HZ;
> +	}
> +
> +	/* If it has been a long time since the last poll, clear history. */
> +	if (shift < 64)

Use a properly named define instead of a naked number.

...

Thx.
  
Luck, Tony Oct. 23, 2023, 6:14 p.m. UTC | #9
On Thu, Oct 19, 2023 at 05:12:11PM +0200, Borislav Petkov wrote:
> On Wed, Oct 04, 2023 at 11:36:22AM -0700, Tony Luck wrote:
> > +/*
> > + * history:	bitmask tracking whether errors were seen or not seen in
> > + *		the most recent polls of a bank.
> 
> 		each bit in that bitmask represents an error seen.

Yes. That reads better.

> 
> > + * timestamp:	last time (in jiffies) that the bank was polled
> > + * storm:	Is this bank in storm mode?
> > + */
> > +struct storm_bank {
> > +	u64 history;
> > +	u64 timestamp;
> > +	bool storm;
> 
> I guess "in_storm_mode" is even more descriptive:
> 
> 	storm->banks[bank].in_storm_mode = false;
> 
> etc.

Will fix.

> 
> > +};
> > +
> > +/* How many errors within the history buffer mark the start of a storm. */
> > +#define STORM_BEGIN_THRESHOLD	5
> > +
> > +/*
> > + * How many polls of machine check bank without an error before declaring
> > + * the storm is over. Since it is tracked by the bitmaks in the history
> > + * field of struct storm_bank the mask is 30 bits [0 ... 29].
> > + */
> > +#define STORM_END_POLL_THRESHOLD	29
> >  
> >  #ifdef CONFIG_ACPI_APEI
> >  int apei_write_mce(struct mce *m);
> > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> > index f6e87443b37a..7c931f0c9251 100644
> > --- a/arch/x86/kernel/cpu/mce/core.c
> > +++ b/arch/x86/kernel/cpu/mce/core.c
> > @@ -680,6 +680,8 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
> >  		barrier();
> >  		m.status = mce_rdmsrl(mca_msr_reg(i, MCA_STATUS));
> >  
> > +		mce_track_storm(&m);
> 
> Lemme see if I understand the idea here:
> 
> the default polling interval is 5 mins. So the storm tracking is called
> every 5 mins once to see how are banks "doing", so to speak and to get
> them in or out of storm mode. So far so good...
> 
> > +
> >  		/* If this entry is not valid, ignore it */
> >  		if (!(m.status & MCI_STATUS_VAL))
> >  			continue;
> 
> Btw, you're tracking storm even if the error is not valid - conditional
> above here. Why?

I want to track whether each bank is in storm mode, or not. But there's
no indication when a CMCI is delivered which bank was the source. Code
just has to scan all the banks, and might find more than one with an
error. While no bank is in polling mode, there isn't a set time interval
between scanning banks. A scan is just triggered when a CMCI happens.
So it's non-trivial to compute a rate. Would require saving a timestamp
for every logged error.

In a simple case there's just one bank responsible for a ton of CMCI.
No need for complexity here, the count of interrupts from that bank will
hit a threshold and a storm is declared.

But more complex scenarois are possible. Other banks may trigger small
numbers of CMCI. Not enough to call it a storm.  Or multiple banks may
be screaming together.

By tracking both the hits and misses in each bank, I end up with a
bitmap history for the past 64 polls. If there are enough "1" bits in
that bitmap to meet the threshold, then declare a storm for that bank.

> 
> > @@ -1652,22 +1654,29 @@ static void mce_timer_fn(struct timer_list *t)
> >  	else
> >  		iv = min(iv * 2, round_jiffies_relative(check_interval * HZ));
> >  
> > -	__this_cpu_write(mce_next_interval, iv);
> > -	__start_timer(t, iv);
> > +	if (mce_is_storm()) {
> > +		__start_timer(t, HZ);
> > +	} else {
> > +		__this_cpu_write(mce_next_interval, iv);
> > +		__start_timer(t, iv);
> > +	}
> 
> ... this is where it becomes, hm, interesting: the check interval will
> be halved if an error has been seen during this round but then if we're
> in storm mode, that check interval doesn't matter - you'll run the timer
> each second.
> 
> Then you need to restructure this to check the storm condition and not
> do anything to iv if storm.
> 
> Or, am I missing something?

I need to stare at this again to refresh my memory of what's going on
here. This code may need pulling apart into a routine that is used for
systems with no CMCI (or have CMCI disabled). Then the whole "divide the
poll interval by two" when you see an error and double the interval
when you don't see an error makes sense.

For systems with CMCI ... I think just polling a one second interval
until the storm is over makes sense.

> 
> > diff --git a/arch/x86/kernel/cpu/mce/threshold.c b/arch/x86/kernel/cpu/mce/threshold.c
> > index ef4e7bb5fd88..ecdf13f1bb7d 100644
> > --- a/arch/x86/kernel/cpu/mce/threshold.c
> > +++ b/arch/x86/kernel/cpu/mce/threshold.c
> > @@ -29,3 +29,115 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_threshold)
> >  	trace_threshold_apic_exit(THRESHOLD_APIC_VECTOR);
> >  	apic_eoi();
> >  }
> > +
> > +/*
> > + * banks:		per-cpu, per-bank details
> > + * stormy_bank_count:	count of MC banks in storm state
> > + * poll_mode:		CPU is in poll mode
> > + */
> > +struct mca_storm_desc {
> > +	struct storm_bank	banks[MAX_NR_BANKS];
> > +	u8			stormy_bank_count;
> > +	bool			poll_mode;
> > +};
> 
> Yeah, put the struct definition into internal.h pls.

These are only used in threshold.c now. What's the point of them
being in internal.h. That's for defintiones shared by multiple
mcs/*.c files. Isn't it? But will move there if you still want this.

> 
> > +static DEFINE_PER_CPU(struct mca_storm_desc, storm_desc);
> > +
> > +void mce_inherit_storm(unsigned int bank)
> > +{
> > +	struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
> > +
> > +	storm->banks[bank].history = ~0ull;
> 
> So upon inheriting a bank, you set its history that it has seen errors
> each time?
> 
> That's weird.

Ideally the new CPU would inherit the precise state of the previous
owner of this bank. But there's no easy way to track that as the bank
is abanoned by the CPU going offline, and there is a free-for-all with
remaining CPUs racing to claim ownership. It is known that this bank
was in storm mode (because the threshold in the CTL2 bank register is
set to CMCI_STORM_THRESHOLD).

I went with "worst case" to make sure the new CPU didn't prematurely
declare an end to the storm.

I'll add a comment in mce_inherit_storm() to explain this.

> 
> > +	storm->banks[bank].timestamp = jiffies;
> > +}
> > +
> > +bool mce_is_storm(void)
> 
> That's a weird name. mce_get_storm_mode() perhaps?

Sure.

> 
> > +{
> > +	return __this_cpu_read(storm_desc.poll_mode);
> > +}
> > +
> > +void mce_set_storm(bool storm)
> 
> mce_set_storm_mode()

Also sure.

> 
> > +{
> > +	__this_cpu_write(storm_desc.poll_mode, storm);
> > +}
> > +
> > +static void mce_handle_storm(unsigned int bank, bool on)
> > +{
> > +	switch (boot_cpu_data.x86_vendor) {
> > +	}
> > +}
> 
> ...
> 
> > +void mce_track_storm(struct mce *mce)
> > +{
> > +	struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
> > +	unsigned long now = jiffies, delta;
> > +	unsigned int shift = 1;
> > +	u64 history = 0;
> > +
> > +	/*
> > +	 * When a bank is in storm mode it is polled once per second and
> > +	 * the history mask will record about the last minute of poll results.
> > +	 * If it is not in storm mode, then the bank is only checked when
> > +	 * there is a CMCI interrupt. Check how long it has been since
> > +	 * this bank was last checked, and adjust the amount of "shift"
> > +	 * to apply to history.
> > +	 */
> > +	if (!storm->banks[mce->bank].storm) {
> 
> Yeah, if this were
> 
> 	if (!storm->banks[mce->bank].in_storm_mode)
> 
> it would've been perfectly clear what the condition tests.

Yup.

> 
> > +		delta = now - storm->banks[mce->bank].timestamp;
> > +		shift = (delta + HZ) / HZ;
> > +	}
> > +
> > +	/* If it has been a long time since the last poll, clear history. */
> > +	if (shift < 64)
> 
> Use a properly named define instead of a naked number.

Like this?

#define NUM_HISTORY_BITS (sizeof(u64) * BITS_PER_BYTE)

	if (shift < NUM_HISTORY_BITS)
> 
> ...
> 
> Thx.
> 
> -- 
> Regards/Gruss,
>     Boris.

-Tony
  
Borislav Petkov Nov. 14, 2023, 7:23 p.m. UTC | #10
On Mon, Oct 23, 2023 at 11:14:04AM -0700, Tony Luck wrote:
> I want to track whether each bank is in storm mode, or not. But there's
> no indication when a CMCI is delivered which bank was the source. Code
> just has to scan all the banks, and might find more than one with an
> error. While no bank is in polling mode, there isn't a set time interval
> between scanning banks.

Well, if no bank is in *storm* polling mode - I presume you mean that
when you say "polling mode" - then we have the default polling interval
of machine_check_poll() of INITIAL_CHECK_INTERVAL, i.e., 5 mins, I'd
say.

> A scan is just triggered when a CMCI happens. So it's non-trivial to
> compute a rate. Would require saving a timestamp for every logged
> error.

So what I'm trying to establish first is, what our entry vectors into
the storm code are.

1. You can enter storm tracking when you poll normally. I.e.,
   machine_check_poll() each 5 mins once.

   mce_track_storm() tracks history only for MCI_STATUS_VAL=1b CEs, which
   is what I was wondering. It is hidden a bit down in that function.

2. As a result of a CMCI interrupt. That will call machine_check_poll()
   too and go down the same path.

So that flow should be called out in the commit message so that the big
picture is clear - how we're doing that storm tracking.

Now, what is protecting this against concurrent runs of
machine_check_poll()? Imagine the timer fires and a CMCI happens at the
same time and on the same core.

> In a simple case there's just one bank responsible for a ton of CMCI.
> No need for complexity here, the count of interrupts from that bank will
> hit a threshold and a storm is declared.

Right.

> But more complex scenarois are possible. Other banks may trigger small
> numbers of CMCI. Not enough to call it a storm.  Or multiple banks may
> be screaming together.
> 
> By tracking both the hits and misses in each bank, I end up with a
> bitmap history for the past 64 polls. If there are enough "1" bits in
> that bitmap to meet the threshold, then declare a storm for that
> bank.

Yap, I only want to be crystal clear on the flow and the entry points.

> I need to stare at this again to refresh my memory of what's going on
> here. This code may need pulling apart into a routine that is used for
> systems with no CMCI (or have CMCI disabled). Then the whole "divide the
> poll interval by two" when you see an error and double the interval
> when you don't see an error makes sense.
> 
> For systems with CMCI ... I think just polling a one second interval
> until the storm is over makes sense.

Ok.

> These are only used in threshold.c now. What's the point of them
> being in internal.h. That's for defintiones shared by multiple
> mcs/*.c files. Isn't it? But will move there if you still want this.

Structs hidden in .c files looks weird but ok.

> Ideally the new CPU would inherit the precise state of the previous
> owner of this bank. But there's no easy way to track that as the bank
> is abanoned by the CPU going offline, and there is a free-for-all with
> remaining CPUs racing to claim ownership. It is known that this bank
> was in storm mode (because the threshold in the CTL2 bank register is
> set to CMCI_STORM_THRESHOLD).
> 
> I went with "worst case" to make sure the new CPU didn't prematurely
> declare an end to the storm.
> 
> I'll add a comment in mce_inherit_storm() to explain this.

Yap, exactly.

> Like this?
> 
> #define NUM_HISTORY_BITS (sizeof(u64) * BITS_PER_BYTE)
> 
> 	if (shift < NUM_HISTORY_BITS)

Yap.

Thx.
  
Luck, Tony Nov. 14, 2023, 10:04 p.m. UTC | #11
On Tue, Nov 14, 2023 at 08:23:24PM +0100, Borislav Petkov wrote:
> On Mon, Oct 23, 2023 at 11:14:04AM -0700, Tony Luck wrote:
> > I want to track whether each bank is in storm mode, or not. But there's
> > no indication when a CMCI is delivered which bank was the source. Code
> > just has to scan all the banks, and might find more than one with an
> > error. While no bank is in polling mode, there isn't a set time interval
> > between scanning banks.
> 
> Well, if no bank is in *storm* polling mode - I presume you mean that
> when you say "polling mode" - then we have the default polling interval
> of machine_check_poll() of INITIAL_CHECK_INTERVAL, i.e., 5 mins, I'd
> say.

Initial state is that no banks are in storm mode. Errors logged in any
bank are signaled with a CMCI to all logical CPUs that share that bank.
But the "mce_banks_owned" per-cpu bitmask means that only one of those
CPUs will get past the test_bit() check at the top of the loop in
machine_check_poll():

        for (i = 0; i < this_cpu_read(mce_num_banks); i++) {
                if (!mce_banks[i].ctl || !test_bit(i, *b))
                        continue;

Before any storm happens, machine_check_poll() may only be called once
a month, or less, when errors occur.

When a storm is detected for a bank, that bank (and any others in storm
mode) will be checked once per second.

> > A scan is just triggered when a CMCI happens. So it's non-trivial to
> > compute a rate. Would require saving a timestamp for every logged
> > error.
> 
> So what I'm trying to establish first is, what our entry vectors into
> the storm code are.
> 
> 1. You can enter storm tracking when you poll normally. I.e.,
>    machine_check_poll() each 5 mins once.

For a bank that doesn't support CMCI, then polling is the only way
to find errors. You are right, these will feed into the history
tracker, but while at 5-minute interval will not be able to trigger
a storm. Since that 5-minute interval is halved every time error is
found consecutively, it is possible at the 1-second poll interval to
fill out enough bits to indicate a storm. I think I need to add some
code to handle that case as it makes no sense to mess with the CMCI
threshold in IA32_MCi_CTL2 for a bank that doesn't support CMCI.
Probably will just skip tracking for any such banks.

>    mce_track_storm() tracks history only for MCI_STATUS_VAL=1b CEs, which
>    is what I was wondering. It is hidden a bit down in that function.
> 
> 2. As a result of a CMCI interrupt. That will call machine_check_poll()
>    too and go down the same path.
> 
> So that flow should be called out in the commit message so that the big
> picture is clear - how we're doing that storm tracking.

Yes. Those are both inputs to the storm tracker. I'll update the commit
message to describe that.

> Now, what is protecting this against concurrent runs of
> machine_check_poll()? Imagine the timer fires and a CMCI happens at the
> same time and on the same core.

Aren't interrupts disabled while running the code after the timer fires?
Whichever of the timer and the CMCI happens first will run. Second to
arrive will pend the interrupt and be handled when interrupts are
enabled as the first completes.

... rest of message trimmed, no open discussion items there.

-Tony
  
Borislav Petkov Nov. 21, 2023, 11:54 a.m. UTC | #12
On Tue, Nov 14, 2023 at 02:04:46PM -0800, Tony Luck wrote:
> Before any storm happens, machine_check_poll() may only be called once
> a month, or less, when errors occur.

Err:

[  317.825546] mce: mc_poll_banks_default: CPU2 irq ctxt level: 1
[  317.825585] mce: mc_poll_banks_default: CPU0 irq ctxt level: 1
[  317.825585] mce: mc_poll_banks_default: CPU1 irq ctxt level: 1
[  317.825586] mce: mc_poll_banks_default: CPU3 irq ctxt level: 1
[  317.825586] mce: mc_poll_banks_default: CPU4 irq ctxt level: 1
[  317.825586] mce: mc_poll_banks_default: CPU5 irq ctxt level: 1
[  629.121536] mce: mc_poll_banks_default: CPU1 irq ctxt level: 1
[  629.121536] mce: mc_poll_banks_default: CPU4 irq ctxt level: 1
[  629.121560] mce: mc_poll_banks_default: CPU2 irq ctxt level: 1
[  629.121561] mce: mc_poll_banks_default: CPU0 irq ctxt level: 1
[  629.121561] mce: mc_poll_banks_default: CPU5 irq ctxt level: 1
[  629.121569] mce: mc_poll_banks_default: CPU3 irq ctxt level: 1
[  940.417507] mce: mc_poll_banks_default: CPU2 irq ctxt level: 1
[  940.417508] mce: mc_poll_banks_default: CPU3 irq ctxt level: 1
[  940.417508] mce: mc_poll_banks_default: CPU1 irq ctxt level: 1
[  940.417508] mce: mc_poll_banks_default: CPU4 irq ctxt level: 1
[  940.417509] mce: mc_poll_banks_default: CPU5 irq ctxt level: 1
[  940.417508] mce: mc_poll_banks_default: CPU0 irq ctxt level: 1
...

That's from my coffeelake test box.

The irq context level thing says we're in softirq context when the
polling happens.

> When a storm is detected for a bank, that bank (and any others in storm
> mode) will be checked once per second.

Ok.

> For a bank that doesn't support CMCI, then polling is the only way
> to find errors. You are right, these will feed into the history
> tracker, but while at 5-minute interval will not be able to trigger
> a storm.

Yes. But you need to call into the storm handling code somehow. So you
do that from the polling code.

And if the machine supports CMCI, you do the same - call the polling
code which then does the storm check.

> Since that 5-minute interval is halved every time error is
> found consecutively, it is possible at the 1-second poll interval to
> fill out enough bits to indicate a storm. I think I need to add some
> code to handle that case as it makes no sense to mess with the CMCI
> threshold in IA32_MCi_CTL2 for a bank that doesn't support CMCI.
> Probably will just skip tracking for any such banks.

Ok.

> Aren't interrupts disabled while running the code after the timer fires?

No, see above.

> Whichever of the timer and the CMCI happens first will run. Second to
> arrive will pend the interrupt and be handled when interrupts are
> enabled as the first completes.

So I still don't like the timer calling machine_check_poll() and
cmci_mc_poll_banks() doing the same without any proper synchronization
between the two.

Yes, when you get a CMCI interrupt, you poll and do the call the storm
code. Now what happens if the polling runs from softirq context and you
get a CMCI interrupt at exactly the same time. I.e., is
machine_check_poll() reentrant and audited properly?

I hope I'm making more sense.

Thx.
  
Luck, Tony Nov. 27, 2023, 7:50 p.m. UTC | #13
On Tue, Nov 21, 2023 at 12:54:48PM +0100, Borislav Petkov wrote:
> On Tue, Nov 14, 2023 at 02:04:46PM -0800, Tony Luck wrote:
> > Whichever of the timer and the CMCI happens first will run. Second to
> > arrive will pend the interrupt and be handled when interrupts are
> > enabled as the first completes.
> 
> So I still don't like the timer calling machine_check_poll() and
> cmci_mc_poll_banks() doing the same without any proper synchronization
> between the two.

But it isn't doing the same thing.  The timer calls:

	machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));

and cmci_mc_poll_banks() calls:

	machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));

A bank is either in the bitmap of banks to poll from the timer,
or in one of the per-cpu bitmaps of banks "owned" by that CPU to
be checked when a CMCI occurs. But it can't be in both.

> Yes, when you get a CMCI interrupt, you poll and do the call the storm
> code. Now what happens if the polling runs from softirq context and you
> get a CMCI interrupt at exactly the same time. I.e., is
> machine_check_poll() reentrant and audited properly?

So nothing bad happens. If Linux was polling some set of banks from
the timer and is interrupted by CMCI, the interrupt will check some
disjoint set of banks. All the history tracking code is done per-bank,
so there is no overlap.

> I hope I'm making more sense.

Yes. Totally making sense. I was under the mistaken impression that
the mce timers used TIMER_IRQSAFE and the nested CMCI while processing
a timed poll couldn't happen. So I learned something here too.

I'll think of some comment to add to the history tracking code
to summarize this thread.

-Tony
  
Luck, Tony Nov. 27, 2023, 8:14 p.m. UTC | #14
On Mon, Nov 27, 2023 at 11:50:26AM -0800, Tony Luck wrote:
> On Tue, Nov 21, 2023 at 12:54:48PM +0100, Borislav Petkov wrote:
> > On Tue, Nov 14, 2023 at 02:04:46PM -0800, Tony Luck wrote:
> But it isn't doing the same thing.  The timer calls:
> 
> 	machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));
> 
> and cmci_mc_poll_banks() calls:
> 
> 	machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));

Bah ... I've cut & pasted the same thing ... but I think there
are separate bit maps ... maybe I'm wrong. Will go back and
look again.

-Tony
  
Luck, Tony Nov. 28, 2023, 12:42 a.m. UTC | #15
On Mon, Nov 27, 2023 at 12:14:28PM -0800, Tony Luck wrote:
> On Mon, Nov 27, 2023 at 11:50:26AM -0800, Tony Luck wrote:
> > On Tue, Nov 21, 2023 at 12:54:48PM +0100, Borislav Petkov wrote:
> > > On Tue, Nov 14, 2023 at 02:04:46PM -0800, Tony Luck wrote:
> > But it isn't doing the same thing.  The timer calls:
> > 
> > 	machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));
> > 
> > and cmci_mc_poll_banks() calls:
> > 
> > 	machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));

        machine_check_poll(0, this_cpu_ptr(&mce_banks_owned));

> 
> Bah ... I've cut & pasted the same thing ... but I think there
> are separate bit maps ... maybe I'm wrong. Will go back and
> look again.
> 
> -Tony
  
Yazen Ghannam Nov. 28, 2023, 3:32 p.m. UTC | #16
On 11/27/2023 7:42 PM, Tony Luck wrote:
> On Mon, Nov 27, 2023 at 12:14:28PM -0800, Tony Luck wrote:
>> On Mon, Nov 27, 2023 at 11:50:26AM -0800, Tony Luck wrote:
>>> On Tue, Nov 21, 2023 at 12:54:48PM +0100, Borislav Petkov wrote:
>>>> On Tue, Nov 14, 2023 at 02:04:46PM -0800, Tony Luck wrote:
>>> But it isn't doing the same thing.  The timer calls:
>>>
>>> 	machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));
>>>
>>> and cmci_mc_poll_banks() calls:
>>>
>>> 	machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));
> 
>          machine_check_poll(0, this_cpu_ptr(&mce_banks_owned));
> 
>>
>> Bah ... I've cut & pasted the same thing ... but I think there
>> are separate bit maps ... maybe I'm wrong. Will go back and
>> look again.
>>

This mutually exclusive behavior is not enforced on AMD systems. A bank 
can be polled and signaled using hardware interrupts.

I've been thinking to change this in order to save polling cycles. Now 
it seems there's another reason. :)

Thanks,
Yazen
  
Borislav Petkov Dec. 14, 2023, 4:58 p.m. UTC | #17
On Mon, Nov 27, 2023 at 04:42:02PM -0800, Tony Luck wrote:
> On Mon, Nov 27, 2023 at 12:14:28PM -0800, Tony Luck wrote:
> > On Mon, Nov 27, 2023 at 11:50:26AM -0800, Tony Luck wrote:
> > > On Tue, Nov 21, 2023 at 12:54:48PM +0100, Borislav Petkov wrote:
> > > > On Tue, Nov 14, 2023 at 02:04:46PM -0800, Tony Luck wrote:
> > > But it isn't doing the same thing.  The timer calls:
> > > 
> > > 	machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));
> > > 
> > > and cmci_mc_poll_banks() calls:
> > > 
> > > 	machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));
> 
>         machine_check_poll(0, this_cpu_ptr(&mce_banks_owned));

Hmm, so I applied your v10 and this call with mce_banks_owned is done in
cmci_recheck() only. Which is on some init path.

The thresholding interrupt calls it too.

The timer ends up calling mc_poll_banks_default() which does

	machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));

I presume we don't do:

        if (!cmci_supported(&banks)) {
                mc_poll_banks = cmci_mc_poll_banks;
                return;
        }

usually on Intel. And even if we did, cmci_mc_poll_banks() calls

	machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));

too.

So regardless what machine you have, you do call the mc_poll_banks
pointer which in both cases does

	machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));

The *thresholding* interrupt does

	machine_check_poll(0, this_cpu_ptr(&mce_banks_owned));

and you're saying

	mce_poll_banks and mce_banks_owned

are disjoint.

That's what you mean, right?

Because if so, yes, that makes sense. If the sets of MCA banks polled
and handled in the thresholding interrupt are disjoint, we should be ok.

Thx.
  
Luck, Tony Dec. 14, 2023, 6:03 p.m. UTC | #18
> and you're saying
>
>	mce_poll_banks and mce_banks_owned
>
> are disjoint.
>
> That's what you mean, right?

Yes. A bank is one or the other bitmap, but not both.

> Because if so, yes, that makes sense. If the sets of MCA banks polled
> and handled in the thresholding interrupt are disjoint, we should be ok.

Yes.

-Tony
  

Patch

diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index 616732ec16f7..4aa957442c8d 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -54,7 +54,45 @@  static inline void intel_clear_lmce(void) { }
 static inline bool intel_filter_mce(struct mce *m) { return false; }
 #endif
 
-void mce_timer_kick(unsigned long interval);
+void mce_timer_kick(bool storm);
+
+#ifdef CONFIG_X86_MCE_THRESHOLD
+void cmci_storm_begin(unsigned int bank);
+void cmci_storm_end(unsigned int bank);
+void mce_track_storm(struct mce *mce);
+void mce_inherit_storm(unsigned int bank);
+bool mce_is_storm(void);
+void mce_set_storm(bool storm);
+#else
+static inline void cmci_storm_begin(unsigned int bank) {}
+static inline void cmci_storm_end(unsigned int bank) {}
+static inline void mce_track_storm(struct mce *mce) {}
+static inline void mce_inherit_storm(unsigned int bank) {}
+static inline bool mce_is_storm(void) { return false; }
+static inline void mce_set_storm(bool storm) {}
+#endif
+
+/*
+ * history:	bitmask tracking whether errors were seen or not seen in
+ *		the most recent polls of a bank.
+ * timestamp:	last time (in jiffies) that the bank was polled
+ * storm:	Is this bank in storm mode?
+ */
+struct storm_bank {
+	u64 history;
+	u64 timestamp;
+	bool storm;
+};
+
+/* How many errors within the history buffer mark the start of a storm. */
+#define STORM_BEGIN_THRESHOLD	5
+
+/*
+ * How many polls of machine check bank without an error before declaring
+ * the storm is over. Since it is tracked by the bitmaks in the history
+ * field of struct storm_bank the mask is 30 bits [0 ... 29].
+ */
+#define STORM_END_POLL_THRESHOLD	29
 
 #ifdef CONFIG_ACPI_APEI
 int apei_write_mce(struct mce *m);
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index f6e87443b37a..7c931f0c9251 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -680,6 +680,8 @@  bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 		barrier();
 		m.status = mce_rdmsrl(mca_msr_reg(i, MCA_STATUS));
 
+		mce_track_storm(&m);
+
 		/* If this entry is not valid, ignore it */
 		if (!(m.status & MCI_STATUS_VAL))
 			continue;
@@ -1652,22 +1654,29 @@  static void mce_timer_fn(struct timer_list *t)
 	else
 		iv = min(iv * 2, round_jiffies_relative(check_interval * HZ));
 
-	__this_cpu_write(mce_next_interval, iv);
-	__start_timer(t, iv);
+	if (mce_is_storm()) {
+		__start_timer(t, HZ);
+	} else {
+		__this_cpu_write(mce_next_interval, iv);
+		__start_timer(t, iv);
+	}
 }
 
 /*
- * Ensure that the timer is firing in @interval from now.
+ * When a storm starts on any bank on this CPU, switch to polling
+ * once per second. When the storm ends, revert to the default
+ * polling interval.
  */
-void mce_timer_kick(unsigned long interval)
+void mce_timer_kick(bool storm)
 {
 	struct timer_list *t = this_cpu_ptr(&mce_timer);
-	unsigned long iv = __this_cpu_read(mce_next_interval);
 
-	__start_timer(t, interval);
+	mce_set_storm(storm);
 
-	if (interval < iv)
-		__this_cpu_write(mce_next_interval, interval);
+	if (storm)
+		__start_timer(t, HZ);
+	else
+		__this_cpu_write(mce_next_interval, check_interval * HZ);
 }
 
 /* Must not be called in IRQ context where del_timer_sync() can deadlock */
diff --git a/arch/x86/kernel/cpu/mce/threshold.c b/arch/x86/kernel/cpu/mce/threshold.c
index ef4e7bb5fd88..ecdf13f1bb7d 100644
--- a/arch/x86/kernel/cpu/mce/threshold.c
+++ b/arch/x86/kernel/cpu/mce/threshold.c
@@ -29,3 +29,115 @@  DEFINE_IDTENTRY_SYSVEC(sysvec_threshold)
 	trace_threshold_apic_exit(THRESHOLD_APIC_VECTOR);
 	apic_eoi();
 }
+
+/*
+ * banks:		per-cpu, per-bank details
+ * stormy_bank_count:	count of MC banks in storm state
+ * poll_mode:		CPU is in poll mode
+ */
+struct mca_storm_desc {
+	struct storm_bank	banks[MAX_NR_BANKS];
+	u8			stormy_bank_count;
+	bool			poll_mode;
+};
+
+static DEFINE_PER_CPU(struct mca_storm_desc, storm_desc);
+
+void mce_inherit_storm(unsigned int bank)
+{
+	struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
+
+	storm->banks[bank].history = ~0ull;
+	storm->banks[bank].timestamp = jiffies;
+}
+
+bool mce_is_storm(void)
+{
+	return __this_cpu_read(storm_desc.poll_mode);
+}
+
+void mce_set_storm(bool storm)
+{
+	__this_cpu_write(storm_desc.poll_mode, storm);
+}
+
+static void mce_handle_storm(unsigned int bank, bool on)
+{
+	switch (boot_cpu_data.x86_vendor) {
+	}
+}
+
+void cmci_storm_begin(unsigned int bank)
+{
+	struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
+
+	__set_bit(bank, this_cpu_ptr(mce_poll_banks));
+	storm->banks[bank].storm = true;
+
+	/*
+	 * If this is the first bank on this CPU to enter storm mode
+	 * start polling.
+	 */
+	if (++storm->stormy_bank_count == 1)
+		mce_timer_kick(true);
+}
+
+void cmci_storm_end(unsigned int bank)
+{
+	struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
+
+	__clear_bit(bank, this_cpu_ptr(mce_poll_banks));
+	storm->banks[bank].history = 0;
+	storm->banks[bank].storm = false;
+
+	/* If no banks left in storm mode, stop polling. */
+	if (!this_cpu_dec_return(storm_desc.stormy_bank_count))
+		mce_timer_kick(false);
+}
+
+void mce_track_storm(struct mce *mce)
+{
+	struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
+	unsigned long now = jiffies, delta;
+	unsigned int shift = 1;
+	u64 history = 0;
+
+	/*
+	 * When a bank is in storm mode it is polled once per second and
+	 * the history mask will record about the last minute of poll results.
+	 * If it is not in storm mode, then the bank is only checked when
+	 * there is a CMCI interrupt. Check how long it has been since
+	 * this bank was last checked, and adjust the amount of "shift"
+	 * to apply to history.
+	 */
+	if (!storm->banks[mce->bank].storm) {
+		delta = now - storm->banks[mce->bank].timestamp;
+		shift = (delta + HZ) / HZ;
+	}
+
+	/* If it has been a long time since the last poll, clear history. */
+	if (shift < 64)
+		history = storm->banks[mce->bank].history << shift;
+
+	storm->banks[mce->bank].timestamp = now;
+
+	/* History keeps track of corrected errors. VAL=1 && UC=0 */
+	if ((mce->status & MCI_STATUS_VAL) && mce_is_correctable(mce))
+		history |= 1;
+
+	storm->banks[mce->bank].history = history;
+
+	if (storm->banks[mce->bank].storm) {
+		if (history & GENMASK_ULL(STORM_END_POLL_THRESHOLD, 0))
+			return;
+		printk_deferred(KERN_NOTICE "CPU%d BANK%d CMCI storm subsided\n", smp_processor_id(), mce->bank);
+		mce_handle_storm(mce->bank, false);
+		cmci_storm_end(mce->bank);
+	} else {
+		if (hweight64(history) < STORM_BEGIN_THRESHOLD)
+			return;
+		printk_deferred(KERN_NOTICE "CPU%d BANK%d CMCI storm detected\n", smp_processor_id(), mce->bank);
+		mce_handle_storm(mce->bank, true);
+		cmci_storm_begin(mce->bank);
+	}
+}