[v2] x86/tsc: Extend watchdog check exemption to 4-Sockets platform

Message ID 20221013131200.973649-1-feng.tang@intel.com
State New
Headers
Series [v2] x86/tsc: Extend watchdog check exemption to 4-Sockets platform |

Commit Message

Feng Tang Oct. 13, 2022, 1:12 p.m. UTC
  There is report again that the tsc clocksource on a 4 sockets x86
Skylake server was wrongly judged as 'unstable' by 'jiffies' watchdog,
and disabled [1].

Commit b50db7095fe0 ("x86/tsc: Disable clocksource watchdog for TSC
on qualified platorms") was introduce to deal with these false
alarms of tsc unstable issues, covering qualified platforms for 2
sockets or smaller ones.

Extend the exemption to 4 sockets to fix the issue.

We also got similar reports on 8 sockets platform from internal test,
but as Peter pointed out, there was tsc sync issues for 8-sockets
platform, and it'd better be handled architecture by architecture,
instead of directly changing the threshold to 8 here.

Rui also proposed another way to disable 'jiffies' as clocksource
watchdog [2], which can also solve this specific problem in an
architecture independent way, with one limitation that some tsc false
alarms are reported by other watchdogs like HPET in post-boot time,
while 'jiffies' is mostly used in boot phase before hardware
clocksources are initialized.

[1]. https://lore.kernel.org/all/9d3bf570-3108-0336-9c52-9bee15767d29@huawei.com/
[2]. https://lore.kernel.org/all/bd5b97f89ab2887543fc262348d1c7cafcaae536.camel@intel.com/

Reported-by: Yu Liao <liaoyu15@huawei.com>
Tested-by: Yu Liao <liaoyu15@huawei.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
---

Hi reviewers:

In the v1 review cycle, Dave raised the issue that 'nr_online_nodes'
is not accurate, and could have problem with fakenuma (numa=fake=4 in
cmdline) case and system with CPU-less HBM/PMEM nodes, which we have
discussed in https://lore.kernel.org/lkml/Y0UgeUIJSFNR4mQB@feng-clx/
and will post the solution as a separate RFC patch.

Thanks,
Feng

Changelog:
  
  Since v1:
  * Change the max socket number from 8 to 4, as Peter Zijlstra
    pointed the 8S machine could have tsc sync issue, and should
    not be exempted generally
  
 arch/x86/kernel/tsc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
  

Comments

Dave Hansen Oct. 13, 2022, 4:02 p.m. UTC | #1
On 10/13/22 06:12, Feng Tang wrote:
> @@ -1217,7 +1217,7 @@ static void __init check_system_tsc_reliable(void)
>  	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
>  	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
>  	    boot_cpu_has(X86_FEATURE_TSC_ADJUST) &&
> -	    nr_online_nodes <= 2)
> +	    nr_online_nodes <= 4)
>  		tsc_disable_clocksource_watchdog();

I still don't think we should perpetuate this hack.

This just plain doesn't work in numa=off numa=fake=... or presumably in
cases where NUMA is disabled in the firmware and memory is interleaved
across all sockets.

It also presumably doesn't work on two-socket systems that have
Cluster-on-Die or Sub-NUMA-Clustering where a single socket is chopped
up into multiple nodes.
  
Feng Tang Oct. 14, 2022, 12:37 a.m. UTC | #2
On Thu, Oct 13, 2022 at 09:02:43AM -0700, Dave Hansen wrote:
> On 10/13/22 06:12, Feng Tang wrote:
> > @@ -1217,7 +1217,7 @@ static void __init check_system_tsc_reliable(void)
> >  	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> >  	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> >  	    boot_cpu_has(X86_FEATURE_TSC_ADJUST) &&
> > -	    nr_online_nodes <= 2)
> > +	    nr_online_nodes <= 4)
> >  		tsc_disable_clocksource_watchdog();
> 
> I still don't think we should perpetuate this hack.
> 
> This just plain doesn't work in numa=off numa=fake=... or presumably in
> cases where NUMA is disabled in the firmware and memory is interleaved
> across all sockets.
> 
> It also presumably doesn't work on two-socket systems that have
> Cluster-on-Die or Sub-NUMA-Clustering where a single socket is chopped
> up into multiple nodes.

Yes, after you raised the 'nr_online_nodes' issue, Peter, Rui and I
have discussed the problem, and plan to post a RFC patch as in 
 https://lore.kernel.org/lkml/Y0UgeUIJSFNR4mQB@feng-clx/

Which can cover:
 - numa=fake=... case
 - platform has DRAM nodes and cpu-less HBM/PMEM nodes

and 'sub-numa-clustering' can't be covered, and the tsc will be
watchdoged as before.

For numa=off case, there is only one CPU up, and I think lifting this
watchdog for tsc is fine.

Thanks,
Feng
  
Feng Tang Oct. 14, 2022, 1:14 a.m. UTC | #3
On Fri, Oct 14, 2022 at 08:37:18AM +0800, Feng Tang wrote:
> On Thu, Oct 13, 2022 at 09:02:43AM -0700, Dave Hansen wrote:
> > On 10/13/22 06:12, Feng Tang wrote:
> > > @@ -1217,7 +1217,7 @@ static void __init check_system_tsc_reliable(void)
> > >  	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> > >  	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> > >  	    boot_cpu_has(X86_FEATURE_TSC_ADJUST) &&
> > > -	    nr_online_nodes <= 2)
> > > +	    nr_online_nodes <= 4)
> > >  		tsc_disable_clocksource_watchdog();
> > 
> > I still don't think we should perpetuate this hack.
> > 
> > This just plain doesn't work in numa=off numa=fake=... or presumably in
> > cases where NUMA is disabled in the firmware and memory is interleaved
> > across all sockets.
> > 
> > It also presumably doesn't work on two-socket systems that have
> > Cluster-on-Die or Sub-NUMA-Clustering where a single socket is chopped
> > up into multiple nodes.
> 
> Yes, after you raised the 'nr_online_nodes' issue, Peter, Rui and I
> have discussed the problem, and plan to post a RFC patch as in 
>  https://lore.kernel.org/lkml/Y0UgeUIJSFNR4mQB@feng-clx/
> 
> Which can cover:
>  - numa=fake=... case
>  - platform has DRAM nodes and cpu-less HBM/PMEM nodes
> 
> and 'sub-numa-clustering' can't be covered, and the tsc will be
> watchdoged as before.
[...] 


> For numa=off case, there is only one CPU up, and I think lifting this
> watchdog for tsc is fine.

Sorry, I was wrong about this. 'numa=off' will still boot all CPUs up,
but skip SRAT table init and only show one node. so this is another
case that the fix patch can't cover.

Thanks,
Feng
  
Thomas Gleixner Oct. 19, 2022, 9:18 a.m. UTC | #4
On Thu, Oct 13 2022 at 21:12, Feng Tang wrote:
> There is report again that the tsc clocksource on a 4 sockets x86
> Skylake server was wrongly judged as 'unstable' by 'jiffies' watchdog,
> and disabled [1].
>
> Commit b50db7095fe0 ("x86/tsc: Disable clocksource watchdog for TSC
> on qualified platorms") was introduce to deal with these false
> alarms of tsc unstable issues, covering qualified platforms for 2
> sockets or smaller ones.
>
> Extend the exemption to 4 sockets to fix the issue.
>
> We also got similar reports on 8 sockets platform from internal test,
> but as Peter pointed out, there was tsc sync issues for 8-sockets
> platform, and it'd better be handled architecture by architecture,
> instead of directly changing the threshold to 8 here.
>
> Rui also proposed another way to disable 'jiffies' as clocksource
> watchdog [2], which can also solve this specific problem in an
> architecture independent way, with one limitation that some tsc false
> alarms are reported by other watchdogs like HPET in post-boot time,
> while 'jiffies' is mostly used in boot phase before hardware
> clocksources are initialized.

HPET is initialized early, but if HPET is disabled or not advertised
then the only other hardware clocksource is PMTIMER which is initialized
late via fs_initcall. PMTIMER is initialized late due to broken Pentium
era chipsets which are sorted with PCI quirks. For anything else we can
initialize it early. Something like the below.

I'm sure I said this more than once, but I'm happy to repeat myself
forever:

  Instead of proliferating lousy hacks, can the X86 vendors finaly get
  their act together and provide some architected information whether
  the TSC is trustworthy or not?

Thanks,

        tglx
---

--- a/arch/x86/kernel/time.c
+++ b/arch/x86/kernel/time.c
@@ -10,6 +10,7 @@
  *
  */
 
+#include <linux/acpi_pmtmr.h>
 #include <linux/clocksource.h>
 #include <linux/clockchips.h>
 #include <linux/interrupt.h>
@@ -75,6 +76,14 @@ static void __init setup_default_timer_i
 void __init hpet_time_init(void)
 {
 	if (!hpet_enable()) {
+		/*
+		 * Some Pentium chipsets have broken HPETs and need
+		 * PCI quirks to run before init.
+		 */
+		if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
+		    boot_cpu_data.family != 5)
+			init_acpi_pm_clocksource();
+
 		if (!pit_timer_init())
 			return;
 	}
--- a/drivers/clocksource/acpi_pm.c
+++ b/drivers/clocksource/acpi_pm.c
@@ -30,6 +30,7 @@
  * in arch/i386/kernel/acpi/boot.c
  */
 u32 pmtmr_ioport __read_mostly;
+static bool pmtmr_initialized __init_data;
 
 static inline u32 read_pmtmr(void)
 {
@@ -142,7 +143,7 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_SE
  * Some boards have the PMTMR running way too fast. We check
  * the PMTMR rate against PIT channel 2 to catch these cases.
  */
-static int verify_pmtmr_rate(void)
+static int __init verify_pmtmr_rate(void)
 {
 	u64 value1, value2;
 	unsigned long count, delta;
@@ -172,14 +173,18 @@ static int verify_pmtmr_rate(void)
 /* Number of reads we try to get two different values */
 #define ACPI_PM_READ_CHECKS 10000
 
-static int __init init_acpi_pm_clocksource(void)
+int __init init_acpi_pm_clocksource(void)
 {
 	u64 value1, value2;
 	unsigned int i, j = 0;
+	int ret;
 
 	if (!pmtmr_ioport)
 		return -ENODEV;
 
+	if (pmtmr_initialized)
+		return 0;
+
 	/* "verify" this timing source: */
 	for (j = 0; j < ACPI_PM_MONOTONICITY_CHECKS; j++) {
 		udelay(100 * j);
@@ -210,10 +215,11 @@ static int __init init_acpi_pm_clocksour
 		return -ENODEV;
 	}
 
-	return clocksource_register_hz(&clocksource_acpi_pm,
-						PMTMR_TICKS_PER_SEC);
+	ret = clocksource_register_hz(&clocksource_acpi_pm, PMTMR_TICKS_PER_SEC);
+	if (!ret)
+		pmtimer_initialized = true;
+	return ret;
 }
-
 /* We use fs_initcall because we want the PCI fixups to have run
  * but we still need to load before device_initcall
  */
--- a/include/linux/acpi_pmtmr.h
+++ b/include/linux/acpi_pmtmr.h
@@ -13,6 +13,8 @@
 /* Overrun value */
 #define ACPI_PM_OVRRUN	(1<<24)
 
+extern int __init init_acpi_pm_clocksource(void);
+
 #ifdef CONFIG_X86_PM_TIMER
 
 extern u32 acpi_pm_read_verified(void);
  
Feng Tang Oct. 20, 2022, 4:44 a.m. UTC | #5
On Wed, Oct 19, 2022 at 11:18:43AM +0200, Thomas Gleixner wrote:
> On Thu, Oct 13 2022 at 21:12, Feng Tang wrote:
> > There is report again that the tsc clocksource on a 4 sockets x86
> > Skylake server was wrongly judged as 'unstable' by 'jiffies' watchdog,
> > and disabled [1].
> >
> > Commit b50db7095fe0 ("x86/tsc: Disable clocksource watchdog for TSC
> > on qualified platorms") was introduce to deal with these false
> > alarms of tsc unstable issues, covering qualified platforms for 2
> > sockets or smaller ones.
> >
> > Extend the exemption to 4 sockets to fix the issue.
> >
> > We also got similar reports on 8 sockets platform from internal test,
> > but as Peter pointed out, there was tsc sync issues for 8-sockets
> > platform, and it'd better be handled architecture by architecture,
> > instead of directly changing the threshold to 8 here.
> >
> > Rui also proposed another way to disable 'jiffies' as clocksource
> > watchdog [2], which can also solve this specific problem in an
> > architecture independent way, with one limitation that some tsc false
> > alarms are reported by other watchdogs like HPET in post-boot time,
> > while 'jiffies' is mostly used in boot phase before hardware
> > clocksources are initialized.
> 
> HPET is initialized early, but if HPET is disabled or not advertised
> then the only other hardware clocksource is PMTIMER which is initialized
> late via fs_initcall. PMTIMER is initialized late due to broken Pentium
> era chipsets which are sorted with PCI quirks. For anything else we can
> initialize it early. Something like the below.

Thanks for sharing the background and the code! It can reduce the
time of 'jiffies' being a watchdog on client platforms whose HPET
are disabled. And there were still false positive reports for
HPET/PMTIMER as watchdogs, so I still vote to your suggestion of
lifting the check for qualified platforms.

For that, Dave raised the accuracy issue of 'nr_online_nodes' and
we proposed new patch in https://lore.kernel.org/lkml/20221017132942.1646934-1-feng.tang@intel.com/
while the topology_max_packages() still has issue as providing socket
number, and I plan to use 'logical_packages' instead. Do you think
it's in the right direction?

> I'm sure I said this more than once, but I'm happy to repeat myself
> forever:
> 
>   Instead of proliferating lousy hacks, can the X86 vendors finaly get
>   their act together and provide some architected information whether
>   the TSC is trustworthy or not?
 
Yes it will save us a lot of trouble. Maybe better in CPUID info, as
if there is some bug in HW/BIOS, it may get fixed with microcode update.

Thanks,
Feng

> Thanks,
> 
>         tglx
> ---
> 
> --- a/arch/x86/kernel/time.c
> +++ b/arch/x86/kernel/time.c
> @@ -10,6 +10,7 @@
>   *
>   */
>  
> +#include <linux/acpi_pmtmr.h>
>  #include <linux/clocksource.h>
>  #include <linux/clockchips.h>
>  #include <linux/interrupt.h>
> @@ -75,6 +76,14 @@ static void __init setup_default_timer_i
>  void __init hpet_time_init(void)
>  {
>  	if (!hpet_enable()) {
> +		/*
> +		 * Some Pentium chipsets have broken HPETs and need
> +		 * PCI quirks to run before init.
> +		 */
> +		if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
> +		    boot_cpu_data.family != 5)
> +			init_acpi_pm_clocksource();
> +
>  		if (!pit_timer_init())
>  			return;
>  	}
> --- a/drivers/clocksource/acpi_pm.c
> +++ b/drivers/clocksource/acpi_pm.c
> @@ -30,6 +30,7 @@
>   * in arch/i386/kernel/acpi/boot.c
>   */
>  u32 pmtmr_ioport __read_mostly;
> +static bool pmtmr_initialized __init_data;
>  
>  static inline u32 read_pmtmr(void)
>  {
> @@ -142,7 +143,7 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_SE
>   * Some boards have the PMTMR running way too fast. We check
>   * the PMTMR rate against PIT channel 2 to catch these cases.
>   */
> -static int verify_pmtmr_rate(void)
> +static int __init verify_pmtmr_rate(void)
>  {
>  	u64 value1, value2;
>  	unsigned long count, delta;
> @@ -172,14 +173,18 @@ static int verify_pmtmr_rate(void)
>  /* Number of reads we try to get two different values */
>  #define ACPI_PM_READ_CHECKS 10000
>  
> -static int __init init_acpi_pm_clocksource(void)
> +int __init init_acpi_pm_clocksource(void)
>  {
>  	u64 value1, value2;
>  	unsigned int i, j = 0;
> +	int ret;
>  
>  	if (!pmtmr_ioport)
>  		return -ENODEV;
>  
> +	if (pmtmr_initialized)
> +		return 0;
> +
>  	/* "verify" this timing source: */
>  	for (j = 0; j < ACPI_PM_MONOTONICITY_CHECKS; j++) {
>  		udelay(100 * j);
> @@ -210,10 +215,11 @@ static int __init init_acpi_pm_clocksour
>  		return -ENODEV;
>  	}
>  
> -	return clocksource_register_hz(&clocksource_acpi_pm,
> -						PMTMR_TICKS_PER_SEC);
> +	ret = clocksource_register_hz(&clocksource_acpi_pm, PMTMR_TICKS_PER_SEC);
> +	if (!ret)
> +		pmtimer_initialized = true;
> +	return ret;
>  }
> -
>  /* We use fs_initcall because we want the PCI fixups to have run
>   * but we still need to load before device_initcall
>   */
> --- a/include/linux/acpi_pmtmr.h
> +++ b/include/linux/acpi_pmtmr.h
> @@ -13,6 +13,8 @@
>  /* Overrun value */
>  #define ACPI_PM_OVRRUN	(1<<24)
>  
> +extern int __init init_acpi_pm_clocksource(void);
> +
>  #ifdef CONFIG_X86_PM_TIMER
>  
>  extern u32 acpi_pm_read_verified(void);
> 
>
  

Patch

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index cafacb2e58cc..1fa3fdf43159 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1209,7 +1209,7 @@  static void __init check_system_tsc_reliable(void)
 	 *  - TSC which does not stop in C-States
 	 *  - the TSC_ADJUST register which allows to detect even minimal
 	 *    modifications
-	 *  - not more than two sockets. As the number of sockets cannot be
+	 *  - not more than four sockets. As the number of sockets cannot be
 	 *    evaluated at the early boot stage where this has to be
 	 *    invoked, check the number of online memory nodes as a
 	 *    fallback solution which is an reasonable estimate.
@@ -1217,7 +1217,7 @@  static void __init check_system_tsc_reliable(void)
 	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
 	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
 	    boot_cpu_has(X86_FEATURE_TSC_ADJUST) &&
-	    nr_online_nodes <= 2)
+	    nr_online_nodes <= 4)
 		tsc_disable_clocksource_watchdog();
 }