[clocksource,1/3] clocksource: Reject bogus watchdog clocksource measurements

Message ID 20221114232827.835599-1-paulmck@kernel.org
State New
Headers
Series Reject bogus watchdog clocksource measurements |

Commit Message

Paul E. McKenney Nov. 14, 2022, 11:28 p.m. UTC
  One remaining clocksource-skew issue involves extreme CPU overcommit,
which can cause the clocksource watchdog measurements to be delayed by
tens of seconds.  This in turn means that a clock-skew criterion that
is appropriate for a 500-millisecond interval will instead give lots of
false positives.

Therefore, check for the watchdog clocksource reporting much larger or
much less than the time specified by WATCHDOG_INTERVAL.  In these cases,
print a pr_warn() warning and refrain from marking the clocksource under
test as being unstable.

Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: John Stultz <jstultz@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Waiman Long <longman@redhat.com>
---
 kernel/time/clocksource.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)
  

Comments

Thomas Gleixner Nov. 17, 2022, 9:57 p.m. UTC | #1
Paul!

On Mon, Nov 14 2022 at 15:28, Paul E. McKenney wrote:
>  
> +		/* Check for bogus measurements. */
> +		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
> +		if (wd_nsec < (wdi >> 2)) {
> +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> +			continue;
> +		}
> +		if (wd_nsec > (wdi << 2)) {
> +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> +			continue;
> +		}

This is really getting ridiculous.

The clocksource watchdog is supposed to run periodically with period =
WATCHDOG_INTERVAL.

That periodic schedule depends on the clocksource which is monitored. If
the clocksource runs fast the period is shortened and if it runs slow is
prolonged.

Now you add checks:

 1) If the period observed by the watchdog clocksource is less than 1/4
    of the expected period, everything is fine.

 2) If the period observed by the watchdog clocksource is greater than 4
    times the expected period, everything is fine.

IOW, you are preventing detection of one class of problems which caused
us to implement the watchdog clocksource in the first place.

You are preventing it by making the watchdog decision circular dependent
on the clocksource it is supposed to watch. IOW, you put a fox in charge
of the henhouse. That's a really brilliant plan.

But what's worse is the constant stream of heuristics which make the
clocksource watchdog "work" under workloads which are simply impossible
to be handled by its current implementation.

If I look at the full set of them by now, I'm pretty sure that a real
TSC fail would not be noticed anymore because there are more exceptions
and excuses why a particular measurement it bogus or invalid or
whatever.

I didn't do a full analysis yet, but I have a hard time to convince
myself that - assumed we add this gem - the watchdog will be anything
else than a useless waste of CPU cycles as there is always one of the
heuristics declaring that everything is fine along with incomprehensible
messages in dmesg which will create more confusion to most people than
being helpful.

This is hunting us for 20+ years now and why do we still need this? I'm
pretty sure that farcebook does not run their server farms on 20 years
old silicon.

That means even the largest customers have not been able to convince the
CPU manufactures to fix this idiocy by now, right? Either that or they
did not even try because it's simpler to "fix" it in software.

That's the only two explanations I have for the constant stream of
voodoo logic. Both are just a proof for my claim that this industry just
"works" by chance.

Can we stop this pretty please and either come up with something
fundamentally different or just admit defeat and remove the whole thing?

Thanks,

        tglx
  
Paul E. McKenney Nov. 17, 2022, 11:09 p.m. UTC | #2
On Thu, Nov 17, 2022 at 10:57:34PM +0100, Thomas Gleixner wrote:
> Paul!
> 
> On Mon, Nov 14 2022 at 15:28, Paul E. McKenney wrote:
> >  
> > +		/* Check for bogus measurements. */
> > +		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
> > +		if (wd_nsec < (wdi >> 2)) {
> > +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > +			continue;
> > +		}
> > +		if (wd_nsec > (wdi << 2)) {
> > +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > +			continue;
> > +		}
> 
> This is really getting ridiculous.

I have absolutely no argument with this statement, and going back a
long time.  ;-)

But the set of systems that caused me to send this turned out to have
real divergence between HPET and TSC, and 40 milliseconds per second of
divergence at that.  So not only do you hate this series, but it is also
the case that this series doesn't help with the problem at hand.

I will therefore set this series aside.

And I sure hope that it is the TSC and not HPET that is correct on those
systems, because they have constant_tsc, nonstop_tsc, and tsc_adjust
all set in their lscpu output.  :-/

And yes, I backported Feng Tang's "b50db7095fe0 ("x86/tsc: Disable
clocksource watchdog for TSC on qualified platorms") commit for them.
Worked like a charm, so thank you, Feng Tang!

But as long as I started responding, I might as well continue.  ;-)

> The clocksource watchdog is supposed to run periodically with period =
> WATCHDOG_INTERVAL.
> 
> That periodic schedule depends on the clocksource which is monitored. If
> the clocksource runs fast the period is shortened and if it runs slow is
> prolonged.
> 
> Now you add checks:
> 
>  1) If the period observed by the watchdog clocksource is less than 1/4
>     of the expected period, everything is fine.

s/less than/more than/, then agreed, give or take that a console message
is printed.  Also, the third patch adds a check of the clocksource under
check, so that if either the watchdog clocksource or the clocksource
under test says that the period observed is less than 1/4 fo the expected
period, we print the message but don't turn off the clocksource under
test.

>  2) If the period observed by the watchdog clocksource is greater than 4
>     times the expected period, everything is fine.

s/greater than/less than/, then agreed, again give or take the console
message and the subsequent check of both clocksources.

> IOW, you are preventing detection of one class of problems which caused
> us to implement the watchdog clocksource in the first place.

The problem is still memorialized via those console-log message.

> You are preventing it by making the watchdog decision circular dependent
> on the clocksource it is supposed to watch. IOW, you put a fox in charge
> of the henhouse. That's a really brilliant plan.
> 
> But what's worse is the constant stream of heuristics which make the
> clocksource watchdog "work" under workloads which are simply impossible
> to be handled by its current implementation.
> 
> If I look at the full set of them by now, I'm pretty sure that a real
> TSC fail would not be noticed anymore because there are more exceptions
> and excuses why a particular measurement it bogus or invalid or
> whatever.

It really did spot the 40ms/s divergence between HPET and TSC.

> I didn't do a full analysis yet, but I have a hard time to convince
> myself that - assumed we add this gem - the watchdog will be anything
> else than a useless waste of CPU cycles as there is always one of the
> heuristics declaring that everything is fine along with incomprehensible
> messages in dmesg which will create more confusion to most people than
> being helpful.

The messages certainly could be improved.  I freely confess to having
focused solely on eliminating false positives.

And to good effect.  The overwhelming bulk of our fleet seems to have
good timers.  The issues are concentrated in systems with dying CPUs and
in a few of the quite-new systems.  Had this happened ten years ago,
we would both be engaged in activities that are far more productive.
Or, failing that, far more fun.

> This is hunting us for 20+ years now and why do we still need this? I'm
> pretty sure that farcebook does not run their server farms on 20 years
> old silicon.

The 40ms/s divergence between TSC and HPET was observed on hardware that
is quite recent.  Again, I sure hope that it is HPET that is unreliable
rather than TSC, but either way, we have a very modern piece of hardware
with at least one very iffy clocksource.  Needless to say, this does not
build confidence.  Yes, it might be a firmware or some sort of calibration
problem, but this system is definitely well beyond the simulation and
FPGA stage.

> That means even the largest customers have not been able to convince the
> CPU manufactures to fix this idiocy by now, right? Either that or they
> did not even try because it's simpler to "fix" it in software.

Worse yet, the manufacturers that did fix it were not helped by having
reliable per-CPU clocks.  The response was of the form "Nice, but given
that it does not work on x86, we won't be relying on it."

Can't win for losing.  ;-)

> That's the only two explanations I have for the constant stream of
> voodoo logic. Both are just a proof for my claim that this industry just
> "works" by chance.

I am sure that you meant to say "just additional proof", but yes.  :-/

My concern is that the next vendor will make all the same mistakes all
over again.  It would be better for the kernel (or something equally
ubiquitous) to yell at them right away rather than for hapless users to
have to figure it out.

> Can we stop this pretty please and either come up with something
> fundamentally different or just admit defeat and remove the whole thing?

I am not at all married to the current code, but I really would like
offending system manufacturers to get their hands slapped during hardware
bringup rather than such bugs being inflicted on their unwary users.

Do you have any thoughts on what you would want a fundamentally different
approach to look like?

Yes, I do hear the "null approach" as your current preferred option.  ;-)

But having just met up with a system with seriously divergent clocks,
I am a bit shy of that approach.

							Thanx, Paul
  
Feng Tang Nov. 21, 2022, 12:55 a.m. UTC | #3
On Thu, Nov 17, 2022 at 03:09:10PM -0800, Paul E. McKenney wrote:
> On Thu, Nov 17, 2022 at 10:57:34PM +0100, Thomas Gleixner wrote:
> > Paul!
> > 
> > On Mon, Nov 14 2022 at 15:28, Paul E. McKenney wrote:
> > >  
> > > +		/* Check for bogus measurements. */
> > > +		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
> > > +		if (wd_nsec < (wdi >> 2)) {
> > > +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > > +			continue;
> > > +		}
> > > +		if (wd_nsec > (wdi << 2)) {
> > > +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > > +			continue;
> > > +		}
> > 
> > This is really getting ridiculous.
> 
> I have absolutely no argument with this statement, and going back a
> long time.  ;-)
> 
> But the set of systems that caused me to send this turned out to have
> real divergence between HPET and TSC, and 40 milliseconds per second of
> divergence at that.  So not only do you hate this series, but it is also
> the case that this series doesn't help with the problem at hand.

The drift is about 4% which is quite big. It seems that this is
either problem of HPET/TSC's hardware/firmware, or the problem of
frequency calibration for HPET/TSC. TSC calibration is complex,
as it could be done from different methods depending on hardware
and firmware, could you share the kernel boot log related with
tsc/hpet and clocksource? 

Also if your platform has acpi PM_TIMER, you may try "nohpet"
to use PM_TIMER instead of HPET and check if there is also big
drift between TSC and PM_TIMER.

Thanks,
Feng
  
Paul E. McKenney Nov. 21, 2022, 3:21 p.m. UTC | #4
On Mon, Nov 21, 2022 at 08:55:15AM +0800, Feng Tang wrote:
> On Thu, Nov 17, 2022 at 03:09:10PM -0800, Paul E. McKenney wrote:
> > On Thu, Nov 17, 2022 at 10:57:34PM +0100, Thomas Gleixner wrote:
> > > Paul!
> > > 
> > > On Mon, Nov 14 2022 at 15:28, Paul E. McKenney wrote:
> > > >  
> > > > +		/* Check for bogus measurements. */
> > > > +		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
> > > > +		if (wd_nsec < (wdi >> 2)) {
> > > > +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > > > +			continue;
> > > > +		}
> > > > +		if (wd_nsec > (wdi << 2)) {
> > > > +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > > > +			continue;
> > > > +		}
> > > 
> > > This is really getting ridiculous.
> > 
> > I have absolutely no argument with this statement, and going back a
> > long time.  ;-)
> > 
> > But the set of systems that caused me to send this turned out to have
> > real divergence between HPET and TSC, and 40 milliseconds per second of
> > divergence at that.  So not only do you hate this series, but it is also
> > the case that this series doesn't help with the problem at hand.
> 
> The drift is about 4% which is quite big. It seems that this is
> either problem of HPET/TSC's hardware/firmware, or the problem of
> frequency calibration for HPET/TSC. TSC calibration is complex,
> as it could be done from different methods depending on hardware
> and firmware, could you share the kernel boot log related with
> tsc/hpet and clocksource? 

Thank you for looking into this!

Knucklehead here failed to save away the dmesg.  Let me see what
I can drum up.

> Also if your platform has acpi PM_TIMER, you may try "nohpet"
> to use PM_TIMER instead of HPET and check if there is also big
> drift between TSC and PM_TIMER.

Let me see about giving this a go, and again, thank you!

						Thanx, Paul
  
Paul E. McKenney Nov. 21, 2022, 6:14 p.m. UTC | #5
On Mon, Nov 21, 2022 at 08:55:15AM +0800, Feng Tang wrote:
> On Thu, Nov 17, 2022 at 03:09:10PM -0800, Paul E. McKenney wrote:
> > On Thu, Nov 17, 2022 at 10:57:34PM +0100, Thomas Gleixner wrote:
> > > Paul!
> > > 
> > > On Mon, Nov 14 2022 at 15:28, Paul E. McKenney wrote:
> > > >  
> > > > +		/* Check for bogus measurements. */
> > > > +		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
> > > > +		if (wd_nsec < (wdi >> 2)) {
> > > > +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > > > +			continue;
> > > > +		}
> > > > +		if (wd_nsec > (wdi << 2)) {
> > > > +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > > > +			continue;
> > > > +		}
> > > 
> > > This is really getting ridiculous.
> > 
> > I have absolutely no argument with this statement, and going back a
> > long time.  ;-)
> > 
> > But the set of systems that caused me to send this turned out to have
> > real divergence between HPET and TSC, and 40 milliseconds per second of
> > divergence at that.  So not only do you hate this series, but it is also
> > the case that this series doesn't help with the problem at hand.
> 
> The drift is about 4% which is quite big. It seems that this is
> either problem of HPET/TSC's hardware/firmware, or the problem of
> frequency calibration for HPET/TSC. TSC calibration is complex,
> as it could be done from different methods depending on hardware
> and firmware, could you share the kernel boot log related with
> tsc/hpet and clocksource? 
> 
> Also if your platform has acpi PM_TIMER, you may try "nohpet"
> to use PM_TIMER instead of HPET and check if there is also big
> drift between TSC and PM_TIMER.

The kernel is built with CONFIG_X86_PM_TIMER=y, so I was guessing
that there is an ACPI PM_TIMER.  Except that when I booted
without your "Disable clocksource watchdog for TSC on qualified
platforms" patch, I get the following:

[   44.303035] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc-early' as unstable because the skew is too large:
[   44.347034] clocksource:                       'refined-jiffies' wd_nsec: 503923392 wd_now: fffb73f8 wd_last: fffb7200 mask: ffffffff
[   44.374034] clocksource:                       'tsc-early' cs_nsec: 588042081 cs_now: 66c486d157 cs_last: 6682125e5e mask: ffffffffffffffff
[   44.403034] clocksource:                       No current clocksource.
[   44.418034] tsc: Marking TSC unstable due to clocksource watchdog

If PM_TIMER was involved, I would expect 'acpi_pm' instead of
refined-jiffies.  Or am I misinterpreting the output and/or code?

Either way, would it make sense to add CLOCK_SOURCE_MUST_VERIFY to
clocksource_hpet.flags?

I am sending the full console output off-list.  Hey, you asked for it!  ;-)

							Thanx, Paul
  
Feng Tang Nov. 22, 2022, 3:55 p.m. UTC | #6
On Mon, Nov 21, 2022 at 10:14:49AM -0800, Paul E. McKenney wrote:
> > > I have absolutely no argument with this statement, and going back a
> > > long time.  ;-)
> > > 
> > > But the set of systems that caused me to send this turned out to have
> > > real divergence between HPET and TSC, and 40 milliseconds per second of
> > > divergence at that.  So not only do you hate this series, but it is also
> > > the case that this series doesn't help with the problem at hand.
> > 
> > The drift is about 4% which is quite big. It seems that this is
> > either problem of HPET/TSC's hardware/firmware, or the problem of
> > frequency calibration for HPET/TSC. TSC calibration is complex,
> > as it could be done from different methods depending on hardware
> > and firmware, could you share the kernel boot log related with
> > tsc/hpet and clocksource? 
> > 
> > Also if your platform has acpi PM_TIMER, you may try "nohpet"
> > to use PM_TIMER instead of HPET and check if there is also big
> > drift between TSC and PM_TIMER.
> 
> The kernel is built with CONFIG_X86_PM_TIMER=y, so I was guessing
> that there is an ACPI PM_TIMER.  Except that when I booted
> without your "Disable clocksource watchdog for TSC on qualified
> platforms" patch, I get the following:
> 
> [   44.303035] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc-early' as unstable because the skew is too large:
> [   44.347034] clocksource:                       'refined-jiffies' wd_nsec: 503923392 wd_now: fffb73f8 wd_last: fffb7200 mask: ffffffff
> [   44.374034] clocksource:                       'tsc-early' cs_nsec: 588042081 cs_now: 66c486d157 cs_last: 6682125e5e mask: ffffffffffffffff
> [   44.403034] clocksource:                       No current clocksource.
> [   44.418034] tsc: Marking TSC unstable due to clocksource watchdog
 
Aha, we've met similar error (TSC being judged 'unstable' by
'refined-jiffies') before, and our root cause is discussed in [1].
In our case, we had early serial console enabled, which made it
easier to be reproduced.

That was a trigger for us to proposed severl solutions before Thomas
suggested to disable tsc watchdog for all qualified platforms.

[1]. https://lore.kernel.org/lkml/20201126012421.GA92582@shbuild999.sh.intel.com/

> If PM_TIMER was involved, I would expect 'acpi_pm' instead of
> refined-jiffies.  Or am I misinterpreting the output and/or code?

It's about timing. On a typical server platform, the clocksources
init order could be:
  refined-jiffies --> hpet --> tsc-early --> acpi_pm --> tsc 

From your log, TSC('tsc-early') is disabled before 'acpi_pm' get
initialized, so 'acpi_pm' timer (if exist) had no chance to watchdog
the tsc.

> Either way, would it make sense to add CLOCK_SOURCE_MUST_VERIFY to
> clocksource_hpet.flags?

Maybe try below patch, which will skip watchdog for 'tsc-early',
while giving 'acpi_pm' timer a chance to watchdog 'tsc'.

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index cafacb2e58cc..9840f0131764 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1131,8 +1131,7 @@ static struct clocksource clocksource_tsc_early = {
 	.uncertainty_margin	= 32 * NSEC_PER_MSEC,
 	.read			= read_tsc,
 	.mask			= CLOCKSOURCE_MASK(64),
-	.flags			= CLOCK_SOURCE_IS_CONTINUOUS |
-				  CLOCK_SOURCE_MUST_VERIFY,
+	.flags			= CLOCK_SOURCE_IS_CONTINUOUS,
 	.vdso_clock_mode	= VDSO_CLOCKMODE_TSC,
 	.enable			= tsc_cs_enable,
 	.resume			= tsc_resume,


> I am sending the full console output off-list.  Hey, you asked for it!  ;-)

Thanks for sharing!

Thanks,
Feng

> 
> 							Thanx, Paul
  
Paul E. McKenney Nov. 22, 2022, 10:07 p.m. UTC | #7
On Tue, Nov 22, 2022 at 11:55:51PM +0800, Feng Tang wrote:
> On Mon, Nov 21, 2022 at 10:14:49AM -0800, Paul E. McKenney wrote:
> > > > I have absolutely no argument with this statement, and going back a
> > > > long time.  ;-)
> > > > 
> > > > But the set of systems that caused me to send this turned out to have
> > > > real divergence between HPET and TSC, and 40 milliseconds per second of
> > > > divergence at that.  So not only do you hate this series, but it is also
> > > > the case that this series doesn't help with the problem at hand.
> > > 
> > > The drift is about 4% which is quite big. It seems that this is
> > > either problem of HPET/TSC's hardware/firmware, or the problem of
> > > frequency calibration for HPET/TSC. TSC calibration is complex,
> > > as it could be done from different methods depending on hardware
> > > and firmware, could you share the kernel boot log related with
> > > tsc/hpet and clocksource? 
> > > 
> > > Also if your platform has acpi PM_TIMER, you may try "nohpet"
> > > to use PM_TIMER instead of HPET and check if there is also big
> > > drift between TSC and PM_TIMER.
> > 
> > The kernel is built with CONFIG_X86_PM_TIMER=y, so I was guessing
> > that there is an ACPI PM_TIMER.  Except that when I booted
> > without your "Disable clocksource watchdog for TSC on qualified
> > platforms" patch, I get the following:
> > 
> > [   44.303035] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc-early' as unstable because the skew is too large:
> > [   44.347034] clocksource:                       'refined-jiffies' wd_nsec: 503923392 wd_now: fffb73f8 wd_last: fffb7200 mask: ffffffff
> > [   44.374034] clocksource:                       'tsc-early' cs_nsec: 588042081 cs_now: 66c486d157 cs_last: 6682125e5e mask: ffffffffffffffff
> > [   44.403034] clocksource:                       No current clocksource.
> > [   44.418034] tsc: Marking TSC unstable due to clocksource watchdog
>  
> Aha, we've met similar error (TSC being judged 'unstable' by
> 'refined-jiffies') before, and our root cause is discussed in [1].
> In our case, we had early serial console enabled, which made it
> easier to be reproduced.
> 
> That was a trigger for us to proposed severl solutions before Thomas
> suggested to disable tsc watchdog for all qualified platforms.
> 
> [1]. https://lore.kernel.org/lkml/20201126012421.GA92582@shbuild999.sh.intel.com/

Thank you for the analysis!

> > If PM_TIMER was involved, I would expect 'acpi_pm' instead of
> > refined-jiffies.  Or am I misinterpreting the output and/or code?
> 
> It's about timing. On a typical server platform, the clocksources
> init order could be:
>   refined-jiffies --> hpet --> tsc-early --> acpi_pm --> tsc 
> 
> >From your log, TSC('tsc-early') is disabled before 'acpi_pm' get
> initialized, so 'acpi_pm' timer (if exist) had no chance to watchdog
> the tsc.
> 
> > Either way, would it make sense to add CLOCK_SOURCE_MUST_VERIFY to
> > clocksource_hpet.flags?
> 
> Maybe try below patch, which will skip watchdog for 'tsc-early',
> while giving 'acpi_pm' timer a chance to watchdog 'tsc'.
> 
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index cafacb2e58cc..9840f0131764 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -1131,8 +1131,7 @@ static struct clocksource clocksource_tsc_early = {
>  	.uncertainty_margin	= 32 * NSEC_PER_MSEC,
>  	.read			= read_tsc,
>  	.mask			= CLOCKSOURCE_MASK(64),
> -	.flags			= CLOCK_SOURCE_IS_CONTINUOUS |
> -				  CLOCK_SOURCE_MUST_VERIFY,
> +	.flags			= CLOCK_SOURCE_IS_CONTINUOUS,
>  	.vdso_clock_mode	= VDSO_CLOCKMODE_TSC,
>  	.enable			= tsc_cs_enable,
>  	.resume			= tsc_resume,

Your mainline patch b50db7095fe0 ("x86/tsc: Disable clocksource watchdog
for TSC on qualified platorms") mitigates the issue so we are good for
the immediate future, at least assuming reliable TSC.

But it also disables checking against HPET, hence my question about
marking clocksource_hpet.flags with CLOCK_SOURCE_MUST_VERIFY at boot time
on systems whose CPUs have constant_tsc, nonstop_tsc, and tsc_adjust.

Would that make sense?

							Thanx, Paul

> > I am sending the full console output off-list.  Hey, you asked for it!  ;-)
> 
> Thanks for sharing!
> 
> Thanks,
> Feng
> 
> > 
> > 							Thanx, Paul
  
Feng Tang Nov. 23, 2022, 2:36 a.m. UTC | #8
On Tue, Nov 22, 2022 at 02:07:12PM -0800, Paul E. McKenney wrote:
[...] 
> > > If PM_TIMER was involved, I would expect 'acpi_pm' instead of
> > > refined-jiffies.  Or am I misinterpreting the output and/or code?
> > 
> > It's about timing. On a typical server platform, the clocksources
> > init order could be:
> >   refined-jiffies --> hpet --> tsc-early --> acpi_pm --> tsc 
> > 
> > >From your log, TSC('tsc-early') is disabled before 'acpi_pm' get
> > initialized, so 'acpi_pm' timer (if exist) had no chance to watchdog
> > the tsc.
> > 
> > > Either way, would it make sense to add CLOCK_SOURCE_MUST_VERIFY to
> > > clocksource_hpet.flags?
> > 
> > Maybe try below patch, which will skip watchdog for 'tsc-early',
> > while giving 'acpi_pm' timer a chance to watchdog 'tsc'.
> > 
> > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > index cafacb2e58cc..9840f0131764 100644
> > --- a/arch/x86/kernel/tsc.c
> > +++ b/arch/x86/kernel/tsc.c
> > @@ -1131,8 +1131,7 @@ static struct clocksource clocksource_tsc_early = {
> >  	.uncertainty_margin	= 32 * NSEC_PER_MSEC,
> >  	.read			= read_tsc,
> >  	.mask			= CLOCKSOURCE_MASK(64),
> > -	.flags			= CLOCK_SOURCE_IS_CONTINUOUS |
> > -				  CLOCK_SOURCE_MUST_VERIFY,
> > +	.flags			= CLOCK_SOURCE_IS_CONTINUOUS,
> >  	.vdso_clock_mode	= VDSO_CLOCKMODE_TSC,
> >  	.enable			= tsc_cs_enable,
> >  	.resume			= tsc_resume,
> 
> Your mainline patch b50db7095fe0 ("x86/tsc: Disable clocksource watchdog
> for TSC on qualified platorms") mitigates the issue so we are good for
> the immediate future, at least assuming reliable TSC.
> 
> But it also disables checking against HPET, hence my question about
> marking clocksource_hpet.flags with CLOCK_SOURCE_MUST_VERIFY at boot time
> on systems whose CPUs have constant_tsc, nonstop_tsc, and tsc_adjust.

IIUC, this will make TSC to watchdog HPET every 500 ms. We have got
report that the 500ms watchdog timer had big impact on some parallel
workload on big servers, that was another factor for us to seek
stopping the timer.

Is this about the concern of possible TSC frequency calibration
issue, as the 40 ms per second drift between HPET and TSC? With 
b50db7095fe0 backported, we also have another patch to force TSC
calibration for those platforms which get the TSC freq directly
from CPUID or MSR and don't have such info in dmesg:
 "tsc: Refined TSC clocksource calibration: 2693.509 MHz" 

https://lore.kernel.org/lkml/20220509144110.9242-1-feng.tang@intel.com/

We did met tsc calibration issue due to some firmware issue, and
this can help to catch it. You can try it if you think it's relevant.

Thanks,
Feng

> Would that make sense?
> 
> 							Thanx, Paul
> 
> > > I am sending the full console output off-list.  Hey, you asked for it!  ;-)
> > 
> > Thanks for sharing!
> > 
> > Thanks,
> > Feng
> > 
> > > 
> > > 							Thanx, Paul
  
Paul E. McKenney Nov. 23, 2022, 9:23 p.m. UTC | #9
On Wed, Nov 23, 2022 at 10:36:04AM +0800, Feng Tang wrote:
> On Tue, Nov 22, 2022 at 02:07:12PM -0800, Paul E. McKenney wrote:
> [...] 
> > > > If PM_TIMER was involved, I would expect 'acpi_pm' instead of
> > > > refined-jiffies.  Or am I misinterpreting the output and/or code?
> > > 
> > > It's about timing. On a typical server platform, the clocksources
> > > init order could be:
> > >   refined-jiffies --> hpet --> tsc-early --> acpi_pm --> tsc 
> > > 
> > > >From your log, TSC('tsc-early') is disabled before 'acpi_pm' get
> > > initialized, so 'acpi_pm' timer (if exist) had no chance to watchdog
> > > the tsc.
> > > 
> > > > Either way, would it make sense to add CLOCK_SOURCE_MUST_VERIFY to
> > > > clocksource_hpet.flags?
> > > 
> > > Maybe try below patch, which will skip watchdog for 'tsc-early',
> > > while giving 'acpi_pm' timer a chance to watchdog 'tsc'.
> > > 
> > > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > > index cafacb2e58cc..9840f0131764 100644
> > > --- a/arch/x86/kernel/tsc.c
> > > +++ b/arch/x86/kernel/tsc.c
> > > @@ -1131,8 +1131,7 @@ static struct clocksource clocksource_tsc_early = {
> > >  	.uncertainty_margin	= 32 * NSEC_PER_MSEC,
> > >  	.read			= read_tsc,
> > >  	.mask			= CLOCKSOURCE_MASK(64),
> > > -	.flags			= CLOCK_SOURCE_IS_CONTINUOUS |
> > > -				  CLOCK_SOURCE_MUST_VERIFY,
> > > +	.flags			= CLOCK_SOURCE_IS_CONTINUOUS,
> > >  	.vdso_clock_mode	= VDSO_CLOCKMODE_TSC,
> > >  	.enable			= tsc_cs_enable,
> > >  	.resume			= tsc_resume,
> > 
> > Your mainline patch b50db7095fe0 ("x86/tsc: Disable clocksource watchdog
> > for TSC on qualified platorms") mitigates the issue so we are good for
> > the immediate future, at least assuming reliable TSC.
> > 
> > But it also disables checking against HPET, hence my question about
> > marking clocksource_hpet.flags with CLOCK_SOURCE_MUST_VERIFY at boot time
> > on systems whose CPUs have constant_tsc, nonstop_tsc, and tsc_adjust.
> 
> IIUC, this will make TSC to watchdog HPET every 500 ms. We have got
> report that the 500ms watchdog timer had big impact on some parallel
> workload on big servers, that was another factor for us to seek
> stopping the timer.

Another approach would be to slow it down.  Given the tighter bounds
on skew, it could be done every (say) 10 seconds while allowing
2 milliseconds skew instead of the current 100 microseconds.

> Is this about the concern of possible TSC frequency calibration
> issue, as the 40 ms per second drift between HPET and TSC? With 
> b50db7095fe0 backported, we also have another patch to force TSC
> calibration for those platforms which get the TSC freq directly
> from CPUID or MSR and don't have such info in dmesg:
>  "tsc: Refined TSC clocksource calibration: 2693.509 MHz" 
> 
> https://lore.kernel.org/lkml/20220509144110.9242-1-feng.tang@intel.com/
> 
> We did met tsc calibration issue due to some firmware issue, and
> this can help to catch it. You can try it if you think it's relevant.

I am giving this a go, thank you!

						Thanx, Paul

> Thanks,
> Feng
> 
> > Would that make sense?
> > 
> > 							Thanx, Paul
> > 
> > > > I am sending the full console output off-list.  Hey, you asked for it!  ;-)
> > > 
> > > Thanks for sharing!
> > > 
> > > Thanks,
> > > Feng
> > > 
> > > > 
> > > > 							Thanx, Paul
  
Feng Tang Nov. 28, 2022, 2:15 a.m. UTC | #10
On Wed, Nov 23, 2022 at 01:23:48PM -0800, Paul E. McKenney wrote:
> On Wed, Nov 23, 2022 at 10:36:04AM +0800, Feng Tang wrote:
> > On Tue, Nov 22, 2022 at 02:07:12PM -0800, Paul E. McKenney wrote:
> > [...] 
> > > > > If PM_TIMER was involved, I would expect 'acpi_pm' instead of
> > > > > refined-jiffies.  Or am I misinterpreting the output and/or code?
> > > > 
> > > > It's about timing. On a typical server platform, the clocksources
> > > > init order could be:
> > > >   refined-jiffies --> hpet --> tsc-early --> acpi_pm --> tsc 
> > > > 
> > > > >From your log, TSC('tsc-early') is disabled before 'acpi_pm' get
> > > > initialized, so 'acpi_pm' timer (if exist) had no chance to watchdog
> > > > the tsc.
> > > > 
> > > > > Either way, would it make sense to add CLOCK_SOURCE_MUST_VERIFY to
> > > > > clocksource_hpet.flags?
> > > > 
> > > > Maybe try below patch, which will skip watchdog for 'tsc-early',
> > > > while giving 'acpi_pm' timer a chance to watchdog 'tsc'.
> > > > 
> > > > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > > > index cafacb2e58cc..9840f0131764 100644
> > > > --- a/arch/x86/kernel/tsc.c
> > > > +++ b/arch/x86/kernel/tsc.c
> > > > @@ -1131,8 +1131,7 @@ static struct clocksource clocksource_tsc_early = {
> > > >  	.uncertainty_margin	= 32 * NSEC_PER_MSEC,
> > > >  	.read			= read_tsc,
> > > >  	.mask			= CLOCKSOURCE_MASK(64),
> > > > -	.flags			= CLOCK_SOURCE_IS_CONTINUOUS |
> > > > -				  CLOCK_SOURCE_MUST_VERIFY,
> > > > +	.flags			= CLOCK_SOURCE_IS_CONTINUOUS,
> > > >  	.vdso_clock_mode	= VDSO_CLOCKMODE_TSC,
> > > >  	.enable			= tsc_cs_enable,
> > > >  	.resume			= tsc_resume,
> > > 
> > > Your mainline patch b50db7095fe0 ("x86/tsc: Disable clocksource watchdog
> > > for TSC on qualified platorms") mitigates the issue so we are good for
> > > the immediate future, at least assuming reliable TSC.
> > > 
> > > But it also disables checking against HPET, hence my question about
> > > marking clocksource_hpet.flags with CLOCK_SOURCE_MUST_VERIFY at boot time
> > > on systems whose CPUs have constant_tsc, nonstop_tsc, and tsc_adjust.
> > 
> > IIUC, this will make TSC to watchdog HPET every 500 ms. We have got
> > report that the 500ms watchdog timer had big impact on some parallel
> > workload on big servers, that was another factor for us to seek
> > stopping the timer.
> 
> Another approach would be to slow it down.  Given the tighter bounds
> on skew, it could be done every (say) 10 seconds while allowing
> 2 milliseconds skew instead of the current 100 microseconds.

Yes, this can reduce the OS noise much. One problem is if we make it
a general interface, there is some clocksource whose warp time is
less than 10 seconds, like ACPI PM_TIMER (3-4 seconds), and I don't
know if other ARCHs have similar cases.

> 
> > Is this about the concern of possible TSC frequency calibration
> > issue, as the 40 ms per second drift between HPET and TSC? With 
> > b50db7095fe0 backported, we also have another patch to force TSC
> > calibration for those platforms which get the TSC freq directly
> > from CPUID or MSR and don't have such info in dmesg:
> >  "tsc: Refined TSC clocksource calibration: 2693.509 MHz" 
> > 
> > https://lore.kernel.org/lkml/20220509144110.9242-1-feng.tang@intel.com/
> > 
> > We did met tsc calibration issue due to some firmware issue, and
> > this can help to catch it. You can try it if you think it's relevant.
> 
> I am giving this a go, thank you!

Thanks for spending time testing it!

Thanks,
Feng
  
Paul E. McKenney Nov. 29, 2022, 7:29 p.m. UTC | #11
On Mon, Nov 28, 2022 at 10:15:43AM +0800, Feng Tang wrote:
> On Wed, Nov 23, 2022 at 01:23:48PM -0800, Paul E. McKenney wrote:
> > On Wed, Nov 23, 2022 at 10:36:04AM +0800, Feng Tang wrote:
> > > On Tue, Nov 22, 2022 at 02:07:12PM -0800, Paul E. McKenney wrote:
> > > [...] 
> > > > > > If PM_TIMER was involved, I would expect 'acpi_pm' instead of
> > > > > > refined-jiffies.  Or am I misinterpreting the output and/or code?
> > > > > 
> > > > > It's about timing. On a typical server platform, the clocksources
> > > > > init order could be:
> > > > >   refined-jiffies --> hpet --> tsc-early --> acpi_pm --> tsc 
> > > > > 
> > > > > >From your log, TSC('tsc-early') is disabled before 'acpi_pm' get
> > > > > initialized, so 'acpi_pm' timer (if exist) had no chance to watchdog
> > > > > the tsc.
> > > > > 
> > > > > > Either way, would it make sense to add CLOCK_SOURCE_MUST_VERIFY to
> > > > > > clocksource_hpet.flags?
> > > > > 
> > > > > Maybe try below patch, which will skip watchdog for 'tsc-early',
> > > > > while giving 'acpi_pm' timer a chance to watchdog 'tsc'.
> > > > > 
> > > > > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > > > > index cafacb2e58cc..9840f0131764 100644
> > > > > --- a/arch/x86/kernel/tsc.c
> > > > > +++ b/arch/x86/kernel/tsc.c
> > > > > @@ -1131,8 +1131,7 @@ static struct clocksource clocksource_tsc_early = {
> > > > >  	.uncertainty_margin	= 32 * NSEC_PER_MSEC,
> > > > >  	.read			= read_tsc,
> > > > >  	.mask			= CLOCKSOURCE_MASK(64),
> > > > > -	.flags			= CLOCK_SOURCE_IS_CONTINUOUS |
> > > > > -				  CLOCK_SOURCE_MUST_VERIFY,
> > > > > +	.flags			= CLOCK_SOURCE_IS_CONTINUOUS,
> > > > >  	.vdso_clock_mode	= VDSO_CLOCKMODE_TSC,
> > > > >  	.enable			= tsc_cs_enable,
> > > > >  	.resume			= tsc_resume,
> > > > 
> > > > Your mainline patch b50db7095fe0 ("x86/tsc: Disable clocksource watchdog
> > > > for TSC on qualified platorms") mitigates the issue so we are good for
> > > > the immediate future, at least assuming reliable TSC.
> > > > 
> > > > But it also disables checking against HPET, hence my question about
> > > > marking clocksource_hpet.flags with CLOCK_SOURCE_MUST_VERIFY at boot time
> > > > on systems whose CPUs have constant_tsc, nonstop_tsc, and tsc_adjust.
> > > 
> > > IIUC, this will make TSC to watchdog HPET every 500 ms. We have got
> > > report that the 500ms watchdog timer had big impact on some parallel
> > > workload on big servers, that was another factor for us to seek
> > > stopping the timer.
> > 
> > Another approach would be to slow it down.  Given the tighter bounds
> > on skew, it could be done every (say) 10 seconds while allowing
> > 2 milliseconds skew instead of the current 100 microseconds.
> 
> Yes, this can reduce the OS noise much. One problem is if we make it
> a general interface, there is some clocksource whose warp time is
> less than 10 seconds, like ACPI PM_TIMER (3-4 seconds), and I don't
> know if other ARCHs have similar cases.

Maybe a simpler approach is for systems with such high sensitivity to
OS noise to simply disable the clocksource watchdog.  ;-)

> > > Is this about the concern of possible TSC frequency calibration
> > > issue, as the 40 ms per second drift between HPET and TSC? With 
> > > b50db7095fe0 backported, we also have another patch to force TSC
> > > calibration for those platforms which get the TSC freq directly
> > > from CPUID or MSR and don't have such info in dmesg:
> > >  "tsc: Refined TSC clocksource calibration: 2693.509 MHz" 
> > > 
> > > https://lore.kernel.org/lkml/20220509144110.9242-1-feng.tang@intel.com/
> > > 
> > > We did met tsc calibration issue due to some firmware issue, and
> > > this can help to catch it. You can try it if you think it's relevant.
> > 
> > I am giving this a go, thank you!
> 
> Thanks for spending time testing it!

And here are the results from setting tsc_force_recalibrate to 1:

$ dmesg | grep -E 'calibrat|clocksource'
[    5.272939] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[   16.830644] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 76450417870 ns
[   17.938020] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
[   24.548583] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
[   49.762432] clocksource: Switched to clocksource tsc-early
[   50.076769] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[   55.615946] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
[   55.640270] clocksource: Switched to clocksource tsc
[   56.694371] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
[   56.724550] tsc: Previous calibrated TSC freq:        1896.000 MHz
[   56.737646] tsc: TSC freq recalibrated by [HPET]:     1975.000 MHz

Apologies for the delay, but reconfigurations put the system off the
net for some time.

							Thanx, Paul
  
Feng Tang Nov. 30, 2022, 1:38 a.m. UTC | #12
On Tue, Nov 29, 2022 at 11:29:15AM -0800, Paul E. McKenney wrote:
[...]
> > > > IIUC, this will make TSC to watchdog HPET every 500 ms. We have got
> > > > report that the 500ms watchdog timer had big impact on some parallel
> > > > workload on big servers, that was another factor for us to seek
> > > > stopping the timer.
> > > 
> > > Another approach would be to slow it down.  Given the tighter bounds
> > > on skew, it could be done every (say) 10 seconds while allowing
> > > 2 milliseconds skew instead of the current 100 microseconds.
> > 
> > Yes, this can reduce the OS noise much. One problem is if we make it
> > a general interface, there is some clocksource whose warp time is
> > less than 10 seconds, like ACPI PM_TIMER (3-4 seconds), and I don't
> > know if other ARCHs have similar cases.
> 
> Maybe a simpler approach is for systems with such high sensitivity to
> OS noise to simply disable the clocksource watchdog.  ;-)

That's what the reported did, test with and without "tsc=reliable"
parameter :)

And AFAIK, many customers with big server farms hate to add more
cmdline parameters when we suggested so.

> > > > Is this about the concern of possible TSC frequency calibration
> > > > issue, as the 40 ms per second drift between HPET and TSC? With 
> > > > b50db7095fe0 backported, we also have another patch to force TSC
> > > > calibration for those platforms which get the TSC freq directly
> > > > from CPUID or MSR and don't have such info in dmesg:
> > > >  "tsc: Refined TSC clocksource calibration: 2693.509 MHz" 
> > > > 
> > > > https://lore.kernel.org/lkml/20220509144110.9242-1-feng.tang@intel.com/
> > > > 
> > > > We did met tsc calibration issue due to some firmware issue, and
> > > > this can help to catch it. You can try it if you think it's relevant.
> > > 
> > > I am giving this a go, thank you!
> > 
> > Thanks for spending time testing it!
> 
> And here are the results from setting tsc_force_recalibrate to 1:
> 
> $ dmesg | grep -E 'calibrat|clocksource'
> [    5.272939] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> [   16.830644] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 76450417870 ns
> [   17.938020] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> [   24.548583] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> [   49.762432] clocksource: Switched to clocksource tsc-early
> [   50.076769] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> [   55.615946] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> [   55.640270] clocksource: Switched to clocksource tsc
> [   56.694371] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
> [   56.724550] tsc: Previous calibrated TSC freq:        1896.000 MHz
> [   56.737646] tsc: TSC freq recalibrated by [HPET]:     1975.000 MHz

Looks like there is really something wrong here. I assume the first
number '1896 MHz' is got from CPUID(0x15)'s math calculation.

I thinks 2 more things could be try:

* add "nohpet" to the cmdline, so the tsc_force_recalibrate should use
  ACPI PM_TIMER to do the calibration, say a third-party check.
* If the system don't have auto-adjusted time setting like NTP, I
  guess the system time will have obvious drift comparing to a normal
  clock or a mobile phone time, as the deviation is about 4%, which
  is 2.4 minutes per hour.

> Apologies for the delay, but reconfigurations put the system off the
> net for some time.

No problem at all, it's your holiday time! Thanks for trying this!

Thanks,
Feng

> 							Thanx, Paul
  
Paul E. McKenney Nov. 30, 2022, 4:12 a.m. UTC | #13
On Wed, Nov 30, 2022 at 09:38:00AM +0800, Feng Tang wrote:
> On Tue, Nov 29, 2022 at 11:29:15AM -0800, Paul E. McKenney wrote:
> [...]
> > > > > IIUC, this will make TSC to watchdog HPET every 500 ms. We have got
> > > > > report that the 500ms watchdog timer had big impact on some parallel
> > > > > workload on big servers, that was another factor for us to seek
> > > > > stopping the timer.
> > > > 
> > > > Another approach would be to slow it down.  Given the tighter bounds
> > > > on skew, it could be done every (say) 10 seconds while allowing
> > > > 2 milliseconds skew instead of the current 100 microseconds.
> > > 
> > > Yes, this can reduce the OS noise much. One problem is if we make it
> > > a general interface, there is some clocksource whose warp time is
> > > less than 10 seconds, like ACPI PM_TIMER (3-4 seconds), and I don't
> > > know if other ARCHs have similar cases.
> > 
> > Maybe a simpler approach is for systems with such high sensitivity to
> > OS noise to simply disable the clocksource watchdog.  ;-)
> 
> That's what the reported did, test with and without "tsc=reliable"
> parameter :)
> 
> And AFAIK, many customers with big server farms hate to add more
> cmdline parameters when we suggested so.

It can be surprisingly hard.  It is sometimes easier to patch the kernel
to change the default.  Part of the problem is getting the right set
of command-line parameters associated with the right versions of the
kernel in the not-uncommon case where different portions of the server
farm are running different versions of the kernel.

> > > > > Is this about the concern of possible TSC frequency calibration
> > > > > issue, as the 40 ms per second drift between HPET and TSC? With 
> > > > > b50db7095fe0 backported, we also have another patch to force TSC
> > > > > calibration for those platforms which get the TSC freq directly
> > > > > from CPUID or MSR and don't have such info in dmesg:
> > > > >  "tsc: Refined TSC clocksource calibration: 2693.509 MHz" 
> > > > > 
> > > > > https://lore.kernel.org/lkml/20220509144110.9242-1-feng.tang@intel.com/
> > > > > 
> > > > > We did met tsc calibration issue due to some firmware issue, and
> > > > > this can help to catch it. You can try it if you think it's relevant.
> > > > 
> > > > I am giving this a go, thank you!
> > > 
> > > Thanks for spending time testing it!
> > 
> > And here are the results from setting tsc_force_recalibrate to 1:
> > 
> > $ dmesg | grep -E 'calibrat|clocksource'
> > [    5.272939] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> > [   16.830644] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 76450417870 ns
> > [   17.938020] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > [   24.548583] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> > [   49.762432] clocksource: Switched to clocksource tsc-early
> > [   50.076769] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> > [   55.615946] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > [   55.640270] clocksource: Switched to clocksource tsc
> > [   56.694371] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
> > [   56.724550] tsc: Previous calibrated TSC freq:        1896.000 MHz
> > [   56.737646] tsc: TSC freq recalibrated by [HPET]:     1975.000 MHz
> 
> Looks like there is really something wrong here. I assume the first
> number '1896 MHz' is got from CPUID(0x15)'s math calculation.

How about this from earlier in that same console?

[    0.000000] efi: EFI v2.80 by American Megatrends
[    0.000000] efi: ACPI=0x6f569000 ACPI 2.0=0x6f569014 TPMFinalLog=0x6f56b000 SMBIOS=0x753e1000 SMBIOS 3.0=0x753e0000 MEMATTR=0x62176018 ESRT=0x64bd1018 TPMEventLog=0x58737018
[    0.000000] [Firmware Bug]: TPM Final Events table missing or invalid
[    0.000000] SMBIOS 3.5.0 present.
[    0.000000] DMI: Quanta Grand Teton 1F0TUBZ0007/Grand Teton MB, BIOS F0T_1A15 08/25/2022
[    0.000000] tsc: Detected 1900.000 MHz processor
[    0.000000] tsc: Detected 1896.000 MHz TSC

> I thinks 2 more things could be try:
> 
> * add "nohpet" to the cmdline, so the tsc_force_recalibrate should use
>   ACPI PM_TIMER to do the calibration, say a third-party check.

OK, getting things teed up for TSC recalibration and nohpet.

> * If the system don't have auto-adjusted time setting like NTP, I
>   guess the system time will have obvious drift comparing to a normal
>   clock or a mobile phone time, as the deviation is about 4%, which
>   is 2.4 minutes per hour.

No ntpd, but there is a chronyd.

I will let you know what happens with HPET disabled and TSC recalibration
enabled.

							Thanx, Paul
  
Feng Tang Nov. 30, 2022, 4:49 a.m. UTC | #14
On Tue, Nov 29, 2022 at 08:12:06PM -0800, Paul E. McKenney wrote:
> On Wed, Nov 30, 2022 at 09:38:00AM +0800, Feng Tang wrote:
> > On Tue, Nov 29, 2022 at 11:29:15AM -0800, Paul E. McKenney wrote:
> > [...]
> > > > > > IIUC, this will make TSC to watchdog HPET every 500 ms. We have got
> > > > > > report that the 500ms watchdog timer had big impact on some parallel
> > > > > > workload on big servers, that was another factor for us to seek
> > > > > > stopping the timer.
> > > > > 
> > > > > Another approach would be to slow it down.  Given the tighter bounds
> > > > > on skew, it could be done every (say) 10 seconds while allowing
> > > > > 2 milliseconds skew instead of the current 100 microseconds.
> > > > 
> > > > Yes, this can reduce the OS noise much. One problem is if we make it
> > > > a general interface, there is some clocksource whose warp time is
> > > > less than 10 seconds, like ACPI PM_TIMER (3-4 seconds), and I don't
> > > > know if other ARCHs have similar cases.
> > > 
> > > Maybe a simpler approach is for systems with such high sensitivity to
> > > OS noise to simply disable the clocksource watchdog.  ;-)
> > 
> > That's what the reported did, test with and without "tsc=reliable"
> > parameter :)
> > 
> > And AFAIK, many customers with big server farms hate to add more
> > cmdline parameters when we suggested so.
> 
> It can be surprisingly hard.  It is sometimes easier to patch the kernel
> to change the default. 

Indeed, sometimes we were askd to provide patch than cmdline parameters :)

> Part of the problem is getting the right set
> of command-line parameters associated with the right versions of the
> kernel in the not-uncommon case where different portions of the server
> farm are running different versions of the kernel.
> 
> > > > > > Is this about the concern of possible TSC frequency calibration
> > > > > > issue, as the 40 ms per second drift between HPET and TSC? With 
> > > > > > b50db7095fe0 backported, we also have another patch to force TSC
> > > > > > calibration for those platforms which get the TSC freq directly
> > > > > > from CPUID or MSR and don't have such info in dmesg:
> > > > > >  "tsc: Refined TSC clocksource calibration: 2693.509 MHz" 
> > > > > > 
> > > > > > https://lore.kernel.org/lkml/20220509144110.9242-1-feng.tang@intel.com/
> > > > > > 
> > > > > > We did met tsc calibration issue due to some firmware issue, and
> > > > > > this can help to catch it. You can try it if you think it's relevant.
> > > > > 
> > > > > I am giving this a go, thank you!
> > > > 
> > > > Thanks for spending time testing it!
> > > 
> > > And here are the results from setting tsc_force_recalibrate to 1:
> > > 
> > > $ dmesg | grep -E 'calibrat|clocksource'
> > > [    5.272939] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> > > [   16.830644] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 76450417870 ns
> > > [   17.938020] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > [   24.548583] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> > > [   49.762432] clocksource: Switched to clocksource tsc-early
> > > [   50.076769] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> > > [   55.615946] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > [   55.640270] clocksource: Switched to clocksource tsc
> > > [   56.694371] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
> > > [   56.724550] tsc: Previous calibrated TSC freq:        1896.000 MHz
> > > [   56.737646] tsc: TSC freq recalibrated by [HPET]:     1975.000 MHz
> > 
> > Looks like there is really something wrong here. I assume the first
> > number '1896 MHz' is got from CPUID(0x15)'s math calculation.
> 
> How about this from earlier in that same console?
> 
> [    0.000000] efi: EFI v2.80 by American Megatrends
> [    0.000000] efi: ACPI=0x6f569000 ACPI 2.0=0x6f569014 TPMFinalLog=0x6f56b000 SMBIOS=0x753e1000 SMBIOS 3.0=0x753e0000 MEMATTR=0x62176018 ESRT=0x64bd1018 TPMEventLog=0x58737018
> [    0.000000] [Firmware Bug]: TPM Final Events table missing or invalid
> [    0.000000] SMBIOS 3.5.0 present.
> [    0.000000] DMI: Quanta Grand Teton 1F0TUBZ0007/Grand Teton MB, BIOS F0T_1A15 08/25/2022
> [    0.000000] tsc: Detected 1900.000 MHz processor
> [    0.000000] tsc: Detected 1896.000 MHz TSC

I'm still not sure, but it's likely from CPUID(0x15). I met cases
that even severs of same generations get their tsc frequence from
different sources.

I locally have debug patch to check it:

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index cafacb2e58cc..82ddb4b0529a 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -654,8 +654,11 @@ unsigned long native_calibrate_tsc(void)
 	 * frequency and is the most accurate one so far we have. This
 	 * is considered a known frequency.
 	 */
-	if (crystal_khz != 0)
+	if (crystal_khz != 0) {
+		printk("tsc: using CPUID[0x15] crystal_khz= %d kHz ebx=%d eax=%d\n",
+		crystal_khz, ebx_numerator, eax_denominator);
 		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
+	}
 
 	/*
 	 * Some Intel SoCs like Skylake and Kabylake don't report the crystal
@@ -668,6 +671,7 @@ unsigned long native_calibrate_tsc(void)
 		cpuid(0x16, &eax_base_mhz, &ebx, &ecx, &edx);
 		crystal_khz = eax_base_mhz * 1000 *
 			eax_denominator / ebx_numerator;
+		printk("tsc: using CPUID[0x16] base_khz=%d kHz\n", crystal_khz);
 	}
 
 	if (crystal_khz == 0)

This remind me that maybe we can add a line in dmesg telling people
which exact soure that the TSC frequency comes from (CPUID, MSR or
calibration from HW timers like HPET/PM_TIMER).

> > I thinks 2 more things could be try:
> > 
> > * add "nohpet" to the cmdline, so the tsc_force_recalibrate should use
> >   ACPI PM_TIMER to do the calibration, say a third-party check.
> 
> OK, getting things teed up for TSC recalibration and nohpet.
> 
> > * If the system don't have auto-adjusted time setting like NTP, I
> >   guess the system time will have obvious drift comparing to a normal
> >   clock or a mobile phone time, as the deviation is about 4%, which
> >   is 2.4 minutes per hour.
> 
> No ntpd, but there is a chronyd.
> 
> I will let you know what happens with HPET disabled and TSC recalibration
> enabled.

Thanks!

- Feng

> 							Thanx, Paul
  
Paul E. McKenney Nov. 30, 2022, 5:16 a.m. UTC | #15
On Wed, Nov 30, 2022 at 12:49:24PM +0800, Feng Tang wrote:
> On Tue, Nov 29, 2022 at 08:12:06PM -0800, Paul E. McKenney wrote:
> > On Wed, Nov 30, 2022 at 09:38:00AM +0800, Feng Tang wrote:
> > > On Tue, Nov 29, 2022 at 11:29:15AM -0800, Paul E. McKenney wrote:
> > > [...]
> > > > > > > IIUC, this will make TSC to watchdog HPET every 500 ms. We have got
> > > > > > > report that the 500ms watchdog timer had big impact on some parallel
> > > > > > > workload on big servers, that was another factor for us to seek
> > > > > > > stopping the timer.
> > > > > > 
> > > > > > Another approach would be to slow it down.  Given the tighter bounds
> > > > > > on skew, it could be done every (say) 10 seconds while allowing
> > > > > > 2 milliseconds skew instead of the current 100 microseconds.
> > > > > 
> > > > > Yes, this can reduce the OS noise much. One problem is if we make it
> > > > > a general interface, there is some clocksource whose warp time is
> > > > > less than 10 seconds, like ACPI PM_TIMER (3-4 seconds), and I don't
> > > > > know if other ARCHs have similar cases.
> > > > 
> > > > Maybe a simpler approach is for systems with such high sensitivity to
> > > > OS noise to simply disable the clocksource watchdog.  ;-)
> > > 
> > > That's what the reported did, test with and without "tsc=reliable"
> > > parameter :)
> > > 
> > > And AFAIK, many customers with big server farms hate to add more
> > > cmdline parameters when we suggested so.
> > 
> > It can be surprisingly hard.  It is sometimes easier to patch the kernel
> > to change the default. 
> 
> Indeed, sometimes we were askd to provide patch than cmdline parameters :)

Do any of them like CONFIG_BOOT_CONFIG_EMBED?  ;-)

> > Part of the problem is getting the right set
> > of command-line parameters associated with the right versions of the
> > kernel in the not-uncommon case where different portions of the server
> > farm are running different versions of the kernel.
> > 
> > > > > > > Is this about the concern of possible TSC frequency calibration
> > > > > > > issue, as the 40 ms per second drift between HPET and TSC? With 
> > > > > > > b50db7095fe0 backported, we also have another patch to force TSC
> > > > > > > calibration for those platforms which get the TSC freq directly
> > > > > > > from CPUID or MSR and don't have such info in dmesg:
> > > > > > >  "tsc: Refined TSC clocksource calibration: 2693.509 MHz" 
> > > > > > > 
> > > > > > > https://lore.kernel.org/lkml/20220509144110.9242-1-feng.tang@intel.com/
> > > > > > > 
> > > > > > > We did met tsc calibration issue due to some firmware issue, and
> > > > > > > this can help to catch it. You can try it if you think it's relevant.
> > > > > > 
> > > > > > I am giving this a go, thank you!
> > > > > 
> > > > > Thanks for spending time testing it!
> > > > 
> > > > And here are the results from setting tsc_force_recalibrate to 1:
> > > > 
> > > > $ dmesg | grep -E 'calibrat|clocksource'
> > > > [    5.272939] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> > > > [   16.830644] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 76450417870 ns
> > > > [   17.938020] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > > [   24.548583] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> > > > [   49.762432] clocksource: Switched to clocksource tsc-early
> > > > [   50.076769] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> > > > [   55.615946] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > > [   55.640270] clocksource: Switched to clocksource tsc
> > > > [   56.694371] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
> > > > [   56.724550] tsc: Previous calibrated TSC freq:        1896.000 MHz
> > > > [   56.737646] tsc: TSC freq recalibrated by [HPET]:     1975.000 MHz
> > > 
> > > Looks like there is really something wrong here. I assume the first
> > > number '1896 MHz' is got from CPUID(0x15)'s math calculation.
> > 
> > How about this from earlier in that same console?
> > 
> > [    0.000000] efi: EFI v2.80 by American Megatrends
> > [    0.000000] efi: ACPI=0x6f569000 ACPI 2.0=0x6f569014 TPMFinalLog=0x6f56b000 SMBIOS=0x753e1000 SMBIOS 3.0=0x753e0000 MEMATTR=0x62176018 ESRT=0x64bd1018 TPMEventLog=0x58737018
> > [    0.000000] [Firmware Bug]: TPM Final Events table missing or invalid
> > [    0.000000] SMBIOS 3.5.0 present.
> > [    0.000000] DMI: Quanta Grand Teton 1F0TUBZ0007/Grand Teton MB, BIOS F0T_1A15 08/25/2022
> > [    0.000000] tsc: Detected 1900.000 MHz processor
> > [    0.000000] tsc: Detected 1896.000 MHz TSC
> 
> I'm still not sure, but it's likely from CPUID(0x15). I met cases
> that even severs of same generations get their tsc frequence from
> different sources.
> 
> I locally have debug patch to check it:
> 
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index cafacb2e58cc..82ddb4b0529a 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -654,8 +654,11 @@ unsigned long native_calibrate_tsc(void)
>  	 * frequency and is the most accurate one so far we have. This
>  	 * is considered a known frequency.
>  	 */
> -	if (crystal_khz != 0)
> +	if (crystal_khz != 0) {
> +		printk("tsc: using CPUID[0x15] crystal_khz= %d kHz ebx=%d eax=%d\n",
> +		crystal_khz, ebx_numerator, eax_denominator);
>  		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> +	}
>  
>  	/*
>  	 * Some Intel SoCs like Skylake and Kabylake don't report the crystal
> @@ -668,6 +671,7 @@ unsigned long native_calibrate_tsc(void)
>  		cpuid(0x16, &eax_base_mhz, &ebx, &ecx, &edx);
>  		crystal_khz = eax_base_mhz * 1000 *
>  			eax_denominator / ebx_numerator;
> +		printk("tsc: using CPUID[0x16] base_khz=%d kHz\n", crystal_khz);
>  	}
>  
>  	if (crystal_khz == 0)

I will give this one a go, thank you!

> This remind me that maybe we can add a line in dmesg telling people
> which exact soure that the TSC frequency comes from (CPUID, MSR or
> calibration from HW timers like HPET/PM_TIMER).

Thus far, it seems that this information could be useful.

> > > I thinks 2 more things could be try:
> > > 
> > > * add "nohpet" to the cmdline, so the tsc_force_recalibrate should use
> > >   ACPI PM_TIMER to do the calibration, say a third-party check.
> > 
> > OK, getting things teed up for TSC recalibration and nohpet.
> > 
> > > * If the system don't have auto-adjusted time setting like NTP, I
> > >   guess the system time will have obvious drift comparing to a normal
> > >   clock or a mobile phone time, as the deviation is about 4%, which
> > >   is 2.4 minutes per hour.
> > 
> > No ntpd, but there is a chronyd.
> > 
> > I will let you know what happens with HPET disabled and TSC recalibration
> > enabled.

And here you go:

[    0.000000] tsc: Detected 1900.000 MHz processor
[    0.000000] tsc: Detected 1896.000 MHz TSC
[    5.254788] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[   25.669260] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
[   32.182958] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
[   57.785756] clocksource: Switched to clocksource tsc-early
[   58.106661] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[   63.578433] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
[   63.602169] clocksource: Switched to clocksource tsc
[   64.636612] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
[   64.664615] tsc: Previous calibrated TSC freq:        1896.000 MHz
[   64.677714] tsc: TSC freq recalibrated by [PM_TIMER]:         1975.000 MHz

Same same, give or take, anyway.

							Thanx, Paul
  
Feng Tang Nov. 30, 2022, 5:35 a.m. UTC | #16
On Tue, Nov 29, 2022 at 09:16:00PM -0800, Paul E. McKenney wrote:
> On Wed, Nov 30, 2022 at 12:49:24PM +0800, Feng Tang wrote:
> > On Tue, Nov 29, 2022 at 08:12:06PM -0800, Paul E. McKenney wrote:
> > > On Wed, Nov 30, 2022 at 09:38:00AM +0800, Feng Tang wrote:
> > > > On Tue, Nov 29, 2022 at 11:29:15AM -0800, Paul E. McKenney wrote:
> > > > [...]
> > > > > > > > IIUC, this will make TSC to watchdog HPET every 500 ms. We have got
> > > > > > > > report that the 500ms watchdog timer had big impact on some parallel
> > > > > > > > workload on big servers, that was another factor for us to seek
> > > > > > > > stopping the timer.
> > > > > > > 
> > > > > > > Another approach would be to slow it down.  Given the tighter bounds
> > > > > > > on skew, it could be done every (say) 10 seconds while allowing
> > > > > > > 2 milliseconds skew instead of the current 100 microseconds.
> > > > > > 
> > > > > > Yes, this can reduce the OS noise much. One problem is if we make it
> > > > > > a general interface, there is some clocksource whose warp time is
> > > > > > less than 10 seconds, like ACPI PM_TIMER (3-4 seconds), and I don't
> > > > > > know if other ARCHs have similar cases.
> > > > > 
> > > > > Maybe a simpler approach is for systems with such high sensitivity to
> > > > > OS noise to simply disable the clocksource watchdog.  ;-)
> > > > 
> > > > That's what the reported did, test with and without "tsc=reliable"
> > > > parameter :)
> > > > 
> > > > And AFAIK, many customers with big server farms hate to add more
> > > > cmdline parameters when we suggested so.
> > > 
> > > It can be surprisingly hard.  It is sometimes easier to patch the kernel
> > > to change the default. 
> > 
> > Indeed, sometimes we were askd to provide patch than cmdline parameters :)
> 
> Do any of them like CONFIG_BOOT_CONFIG_EMBED?  ;-)
> 
> > > Part of the problem is getting the right set
> > > of command-line parameters associated with the right versions of the
> > > kernel in the not-uncommon case where different portions of the server
> > > farm are running different versions of the kernel.
> > > 
> > > > > > > > Is this about the concern of possible TSC frequency calibration
> > > > > > > > issue, as the 40 ms per second drift between HPET and TSC? With 
> > > > > > > > b50db7095fe0 backported, we also have another patch to force TSC
> > > > > > > > calibration for those platforms which get the TSC freq directly
> > > > > > > > from CPUID or MSR and don't have such info in dmesg:
> > > > > > > >  "tsc: Refined TSC clocksource calibration: 2693.509 MHz" 
> > > > > > > > 
> > > > > > > > https://lore.kernel.org/lkml/20220509144110.9242-1-feng.tang@intel.com/
> > > > > > > > 
> > > > > > > > We did met tsc calibration issue due to some firmware issue, and
> > > > > > > > this can help to catch it. You can try it if you think it's relevant.
> > > > > > > 
> > > > > > > I am giving this a go, thank you!
> > > > > > 
> > > > > > Thanks for spending time testing it!
> > > > > 
> > > > > And here are the results from setting tsc_force_recalibrate to 1:
> > > > > 
> > > > > $ dmesg | grep -E 'calibrat|clocksource'
> > > > > [    5.272939] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> > > > > [   16.830644] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 76450417870 ns
> > > > > [   17.938020] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > > > [   24.548583] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> > > > > [   49.762432] clocksource: Switched to clocksource tsc-early
> > > > > [   50.076769] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> > > > > [   55.615946] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > > > [   55.640270] clocksource: Switched to clocksource tsc
> > > > > [   56.694371] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
> > > > > [   56.724550] tsc: Previous calibrated TSC freq:        1896.000 MHz
> > > > > [   56.737646] tsc: TSC freq recalibrated by [HPET]:     1975.000 MHz
> > > > 
> > > > Looks like there is really something wrong here. I assume the first
> > > > number '1896 MHz' is got from CPUID(0x15)'s math calculation.
> > > 
> > > How about this from earlier in that same console?
> > > 
> > > [    0.000000] efi: EFI v2.80 by American Megatrends
> > > [    0.000000] efi: ACPI=0x6f569000 ACPI 2.0=0x6f569014 TPMFinalLog=0x6f56b000 SMBIOS=0x753e1000 SMBIOS 3.0=0x753e0000 MEMATTR=0x62176018 ESRT=0x64bd1018 TPMEventLog=0x58737018
> > > [    0.000000] [Firmware Bug]: TPM Final Events table missing or invalid
> > > [    0.000000] SMBIOS 3.5.0 present.
> > > [    0.000000] DMI: Quanta Grand Teton 1F0TUBZ0007/Grand Teton MB, BIOS F0T_1A15 08/25/2022
> > > [    0.000000] tsc: Detected 1900.000 MHz processor
> > > [    0.000000] tsc: Detected 1896.000 MHz TSC
> > 
> > I'm still not sure, but it's likely from CPUID(0x15). I met cases
> > that even severs of same generations get their tsc frequence from
> > different sources.
> > 
> > I locally have debug patch to check it:
> > 
> > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > index cafacb2e58cc..82ddb4b0529a 100644
> > --- a/arch/x86/kernel/tsc.c
> > +++ b/arch/x86/kernel/tsc.c
> > @@ -654,8 +654,11 @@ unsigned long native_calibrate_tsc(void)
> >  	 * frequency and is the most accurate one so far we have. This
> >  	 * is considered a known frequency.
> >  	 */
> > -	if (crystal_khz != 0)
> > +	if (crystal_khz != 0) {
> > +		printk("tsc: using CPUID[0x15] crystal_khz= %d kHz ebx=%d eax=%d\n",
> > +		crystal_khz, ebx_numerator, eax_denominator);
> >  		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> > +	}
> >  
> >  	/*
> >  	 * Some Intel SoCs like Skylake and Kabylake don't report the crystal
> > @@ -668,6 +671,7 @@ unsigned long native_calibrate_tsc(void)
> >  		cpuid(0x16, &eax_base_mhz, &ebx, &ecx, &edx);
> >  		crystal_khz = eax_base_mhz * 1000 *
> >  			eax_denominator / ebx_numerator;
> > +		printk("tsc: using CPUID[0x16] base_khz=%d kHz\n", crystal_khz);
> >  	}
> >  
> >  	if (crystal_khz == 0)
> 
> I will give this one a go, thank you!
> 
> > This remind me that maybe we can add a line in dmesg telling people
> > which exact soure that the TSC frequency comes from (CPUID, MSR or
> > calibration from HW timers like HPET/PM_TIMER).
> 
> Thus far, it seems that this information could be useful.
> 
> > > > I thinks 2 more things could be try:
> > > > 
> > > > * add "nohpet" to the cmdline, so the tsc_force_recalibrate should use
> > > >   ACPI PM_TIMER to do the calibration, say a third-party check.
> > > 
> > > OK, getting things teed up for TSC recalibration and nohpet.
> > > 
> > > > * If the system don't have auto-adjusted time setting like NTP, I
> > > >   guess the system time will have obvious drift comparing to a normal
> > > >   clock or a mobile phone time, as the deviation is about 4%, which
> > > >   is 2.4 minutes per hour.
> > > 
> > > No ntpd, but there is a chronyd.
> > > 
> > > I will let you know what happens with HPET disabled and TSC recalibration
> > > enabled.
> 
> And here you go:
> 
> [    0.000000] tsc: Detected 1900.000 MHz processor
> [    0.000000] tsc: Detected 1896.000 MHz TSC
> [    5.254788] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> [   25.669260] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> [   32.182958] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> [   57.785756] clocksource: Switched to clocksource tsc-early
> [   58.106661] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> [   63.578433] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> [   63.602169] clocksource: Switched to clocksource tsc
> [   64.636612] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
> [   64.664615] tsc: Previous calibrated TSC freq:        1896.000 MHz
> [   64.677714] tsc: TSC freq recalibrated by [PM_TIMER]:         1975.000 MHz
 
Great! As both HPET and PM_TIMER get the same calibration 1975.000 MHz,
and it matches the 40ms drift per second you mentioned earlier, this
seems like the CPUID(0x15) gave the wrong frequence number.

Or unlikely, HPET and PM_TIMER are driven by the same circuit source,
which has deviation.

Either way, I think the HW/FW could have some problem.

Thanks,
Feng

> Same same, give or take, anyway.
> 
> 							Thanx, Paul
  
Paul E. McKenney Nov. 30, 2022, 5:50 a.m. UTC | #17
On Wed, Nov 30, 2022 at 01:35:08PM +0800, Feng Tang wrote:
> On Tue, Nov 29, 2022 at 09:16:00PM -0800, Paul E. McKenney wrote:
> > On Wed, Nov 30, 2022 at 12:49:24PM +0800, Feng Tang wrote:
> > > On Tue, Nov 29, 2022 at 08:12:06PM -0800, Paul E. McKenney wrote:
> > > > On Wed, Nov 30, 2022 at 09:38:00AM +0800, Feng Tang wrote:
> > > > > On Tue, Nov 29, 2022 at 11:29:15AM -0800, Paul E. McKenney wrote:
> > > > > [...]
> > > > > > > > > IIUC, this will make TSC to watchdog HPET every 500 ms. We have got
> > > > > > > > > report that the 500ms watchdog timer had big impact on some parallel
> > > > > > > > > workload on big servers, that was another factor for us to seek
> > > > > > > > > stopping the timer.
> > > > > > > > 
> > > > > > > > Another approach would be to slow it down.  Given the tighter bounds
> > > > > > > > on skew, it could be done every (say) 10 seconds while allowing
> > > > > > > > 2 milliseconds skew instead of the current 100 microseconds.
> > > > > > > 
> > > > > > > Yes, this can reduce the OS noise much. One problem is if we make it
> > > > > > > a general interface, there is some clocksource whose warp time is
> > > > > > > less than 10 seconds, like ACPI PM_TIMER (3-4 seconds), and I don't
> > > > > > > know if other ARCHs have similar cases.
> > > > > > 
> > > > > > Maybe a simpler approach is for systems with such high sensitivity to
> > > > > > OS noise to simply disable the clocksource watchdog.  ;-)
> > > > > 
> > > > > That's what the reported did, test with and without "tsc=reliable"
> > > > > parameter :)
> > > > > 
> > > > > And AFAIK, many customers with big server farms hate to add more
> > > > > cmdline parameters when we suggested so.
> > > > 
> > > > It can be surprisingly hard.  It is sometimes easier to patch the kernel
> > > > to change the default. 
> > > 
> > > Indeed, sometimes we were askd to provide patch than cmdline parameters :)
> > 
> > Do any of them like CONFIG_BOOT_CONFIG_EMBED?  ;-)
> > 
> > > > Part of the problem is getting the right set
> > > > of command-line parameters associated with the right versions of the
> > > > kernel in the not-uncommon case where different portions of the server
> > > > farm are running different versions of the kernel.
> > > > 
> > > > > > > > > Is this about the concern of possible TSC frequency calibration
> > > > > > > > > issue, as the 40 ms per second drift between HPET and TSC? With 
> > > > > > > > > b50db7095fe0 backported, we also have another patch to force TSC
> > > > > > > > > calibration for those platforms which get the TSC freq directly
> > > > > > > > > from CPUID or MSR and don't have such info in dmesg:
> > > > > > > > >  "tsc: Refined TSC clocksource calibration: 2693.509 MHz" 
> > > > > > > > > 
> > > > > > > > > https://lore.kernel.org/lkml/20220509144110.9242-1-feng.tang@intel.com/
> > > > > > > > > 
> > > > > > > > > We did met tsc calibration issue due to some firmware issue, and
> > > > > > > > > this can help to catch it. You can try it if you think it's relevant.
> > > > > > > > 
> > > > > > > > I am giving this a go, thank you!
> > > > > > > 
> > > > > > > Thanks for spending time testing it!
> > > > > > 
> > > > > > And here are the results from setting tsc_force_recalibrate to 1:
> > > > > > 
> > > > > > $ dmesg | grep -E 'calibrat|clocksource'
> > > > > > [    5.272939] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> > > > > > [   16.830644] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 76450417870 ns
> > > > > > [   17.938020] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > > > > [   24.548583] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> > > > > > [   49.762432] clocksource: Switched to clocksource tsc-early
> > > > > > [   50.076769] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> > > > > > [   55.615946] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > > > > [   55.640270] clocksource: Switched to clocksource tsc
> > > > > > [   56.694371] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
> > > > > > [   56.724550] tsc: Previous calibrated TSC freq:        1896.000 MHz
> > > > > > [   56.737646] tsc: TSC freq recalibrated by [HPET]:     1975.000 MHz
> > > > > 
> > > > > Looks like there is really something wrong here. I assume the first
> > > > > number '1896 MHz' is got from CPUID(0x15)'s math calculation.
> > > > 
> > > > How about this from earlier in that same console?
> > > > 
> > > > [    0.000000] efi: EFI v2.80 by American Megatrends
> > > > [    0.000000] efi: ACPI=0x6f569000 ACPI 2.0=0x6f569014 TPMFinalLog=0x6f56b000 SMBIOS=0x753e1000 SMBIOS 3.0=0x753e0000 MEMATTR=0x62176018 ESRT=0x64bd1018 TPMEventLog=0x58737018
> > > > [    0.000000] [Firmware Bug]: TPM Final Events table missing or invalid
> > > > [    0.000000] SMBIOS 3.5.0 present.
> > > > [    0.000000] DMI: Quanta Grand Teton 1F0TUBZ0007/Grand Teton MB, BIOS F0T_1A15 08/25/2022
> > > > [    0.000000] tsc: Detected 1900.000 MHz processor
> > > > [    0.000000] tsc: Detected 1896.000 MHz TSC
> > > 
> > > I'm still not sure, but it's likely from CPUID(0x15). I met cases
> > > that even severs of same generations get their tsc frequence from
> > > different sources.
> > > 
> > > I locally have debug patch to check it:
> > > 
> > > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > > index cafacb2e58cc..82ddb4b0529a 100644
> > > --- a/arch/x86/kernel/tsc.c
> > > +++ b/arch/x86/kernel/tsc.c
> > > @@ -654,8 +654,11 @@ unsigned long native_calibrate_tsc(void)
> > >  	 * frequency and is the most accurate one so far we have. This
> > >  	 * is considered a known frequency.
> > >  	 */
> > > -	if (crystal_khz != 0)
> > > +	if (crystal_khz != 0) {
> > > +		printk("tsc: using CPUID[0x15] crystal_khz= %d kHz ebx=%d eax=%d\n",
> > > +		crystal_khz, ebx_numerator, eax_denominator);
> > >  		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> > > +	}
> > >  
> > >  	/*
> > >  	 * Some Intel SoCs like Skylake and Kabylake don't report the crystal
> > > @@ -668,6 +671,7 @@ unsigned long native_calibrate_tsc(void)
> > >  		cpuid(0x16, &eax_base_mhz, &ebx, &ecx, &edx);
> > >  		crystal_khz = eax_base_mhz * 1000 *
> > >  			eax_denominator / ebx_numerator;
> > > +		printk("tsc: using CPUID[0x16] base_khz=%d kHz\n", crystal_khz);
> > >  	}
> > >  
> > >  	if (crystal_khz == 0)
> > 
> > I will give this one a go, thank you!
> > 
> > > This remind me that maybe we can add a line in dmesg telling people
> > > which exact soure that the TSC frequency comes from (CPUID, MSR or
> > > calibration from HW timers like HPET/PM_TIMER).
> > 
> > Thus far, it seems that this information could be useful.
> > 
> > > > > I thinks 2 more things could be try:
> > > > > 
> > > > > * add "nohpet" to the cmdline, so the tsc_force_recalibrate should use
> > > > >   ACPI PM_TIMER to do the calibration, say a third-party check.
> > > > 
> > > > OK, getting things teed up for TSC recalibration and nohpet.
> > > > 
> > > > > * If the system don't have auto-adjusted time setting like NTP, I
> > > > >   guess the system time will have obvious drift comparing to a normal
> > > > >   clock or a mobile phone time, as the deviation is about 4%, which
> > > > >   is 2.4 minutes per hour.
> > > > 
> > > > No ntpd, but there is a chronyd.
> > > > 
> > > > I will let you know what happens with HPET disabled and TSC recalibration
> > > > enabled.
> > 
> > And here you go:
> > 
> > [    0.000000] tsc: Detected 1900.000 MHz processor
> > [    0.000000] tsc: Detected 1896.000 MHz TSC
> > [    5.254788] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> > [   25.669260] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > [   32.182958] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> > [   57.785756] clocksource: Switched to clocksource tsc-early
> > [   58.106661] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> > [   63.578433] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > [   63.602169] clocksource: Switched to clocksource tsc
> > [   64.636612] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
> > [   64.664615] tsc: Previous calibrated TSC freq:        1896.000 MHz
> > [   64.677714] tsc: TSC freq recalibrated by [PM_TIMER]:         1975.000 MHz
>  
> Great! As both HPET and PM_TIMER get the same calibration 1975.000 MHz,
> and it matches the 40ms drift per second you mentioned earlier, this
> seems like the CPUID(0x15) gave the wrong frequence number.
> 
> Or unlikely, HPET and PM_TIMER are driven by the same circuit source,
> which has deviation.
> 
> Either way, I think the HW/FW could have some problem.

And this time with your printk()s:

[    0.000000] tsc: using CPUID[0x15] crystal_khz= 24000 kHz ebx=158 eax=2
[    0.000000] tsc: Detected 1900.000 MHz processor
[    0.000000] tsc: Detected 1896.000 MHz TSC
[    5.268858] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[   25.706231] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
[   32.223011] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
[   57.823933] clocksource: Switched to clocksource tsc-early
[   58.144840] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[   63.613713] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
[   63.637323] clocksource: Switched to clocksource tsc
[   64.673579] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
[   64.703719] tsc: Previous calibrated TSC freq:        1896.000 MHz
[   64.716816] tsc: TSC freq recalibrated by [PM_TIMER]:         1974.999 MHz

What would be good next steps to check up on the hardware and firmware?

(My next step involves a pillow, but will follow up tomorrow morning
Pacific Time.)

							Thanx, Paul
  
Feng Tang Nov. 30, 2022, 6 a.m. UTC | #18
On Tue, Nov 29, 2022 at 09:50:24PM -0800, Paul E. McKenney wrote:
[...]
> >  
> > Great! As both HPET and PM_TIMER get the same calibration 1975.000 MHz,
> > and it matches the 40ms drift per second you mentioned earlier, this
> > seems like the CPUID(0x15) gave the wrong frequence number.
> > 
> > Or unlikely, HPET and PM_TIMER are driven by the same circuit source,
> > which has deviation.
> > 
> > Either way, I think the HW/FW could have some problem.
> 
> And this time with your printk()s:
> 
> [    0.000000] tsc: using CPUID[0x15] crystal_khz= 24000 kHz ebx=158 eax=2
> [    0.000000] tsc: Detected 1900.000 MHz processor
> [    0.000000] tsc: Detected 1896.000 MHz TSC
> [    5.268858] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> [   25.706231] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> [   32.223011] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> [   57.823933] clocksource: Switched to clocksource tsc-early
> [   58.144840] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> [   63.613713] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> [   63.637323] clocksource: Switched to clocksource tsc
> [   64.673579] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
> [   64.703719] tsc: Previous calibrated TSC freq:        1896.000 MHz
> [   64.716816] tsc: TSC freq recalibrated by [PM_TIMER]:         1974.999 MHz

This confirms the tsc frequency is calculated from CPUID(0x15).

> What would be good next steps to check up on the hardware and firmware?

Maybe raise it to vendor? I have no idea how to check thos black boxes :)


> (My next step involves a pillow, but will follow up tomorrow morning
> Pacific Time.)
 
Really thanks for checking this through late night!

Thanks,
Feng

> 							Thanx, Paul
  
Paul E. McKenney Dec. 1, 2022, 5:24 p.m. UTC | #19
On Wed, Nov 30, 2022 at 02:00:04PM +0800, Feng Tang wrote:
> On Tue, Nov 29, 2022 at 09:50:24PM -0800, Paul E. McKenney wrote:
> [...]
> > >  
> > > Great! As both HPET and PM_TIMER get the same calibration 1975.000 MHz,
> > > and it matches the 40ms drift per second you mentioned earlier, this
> > > seems like the CPUID(0x15) gave the wrong frequence number.
> > > 
> > > Or unlikely, HPET and PM_TIMER are driven by the same circuit source,
> > > which has deviation.
> > > 
> > > Either way, I think the HW/FW could have some problem.
> > 
> > And this time with your printk()s:
> > 
> > [    0.000000] tsc: using CPUID[0x15] crystal_khz= 24000 kHz ebx=158 eax=2
> > [    0.000000] tsc: Detected 1900.000 MHz processor
> > [    0.000000] tsc: Detected 1896.000 MHz TSC
> > [    5.268858] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> > [   25.706231] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > [   32.223011] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> > [   57.823933] clocksource: Switched to clocksource tsc-early
> > [   58.144840] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> > [   63.613713] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > [   63.637323] clocksource: Switched to clocksource tsc
> > [   64.673579] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
> > [   64.703719] tsc: Previous calibrated TSC freq:        1896.000 MHz
> > [   64.716816] tsc: TSC freq recalibrated by [PM_TIMER]:         1974.999 MHz
> 
> This confirms the tsc frequency is calculated from CPUID(0x15).
> 
> > What would be good next steps to check up on the hardware and firmware?
> 
> Maybe raise it to vendor? I have no idea how to check thos black boxes :)

Done, but no high hopes here.  (What, me cynical?  Better believe it!!!)

> > (My next step involves a pillow, but will follow up tomorrow morning
> > Pacific Time.)
>  
> Really thanks for checking this through late night!

No problem, and I guess it is instead the day after tomorrow, but
I thought you might be interested in chronyd's opinion:

[root@rtptest1029.snc8 ~]# cat /var/lib/chrony/drift
        40001.074911             0.002098

In contrast, on my Fedora laptop:

$ sudo cat /var/lib/chrony/drift
            2.074313             0.186606

I am (perhaps incorrectly) taking this to indicate that TSC is in fact
drifting with respect to standard time.

Thoughts?

							Thanx, Paul
  
Feng Tang Dec. 2, 2022, 1:10 a.m. UTC | #20
On Thu, Dec 01, 2022 at 09:24:05AM -0800, Paul E. McKenney wrote:
> On Wed, Nov 30, 2022 at 02:00:04PM +0800, Feng Tang wrote:
> > On Tue, Nov 29, 2022 at 09:50:24PM -0800, Paul E. McKenney wrote:
> > [...]
> > > >  
> > > > Great! As both HPET and PM_TIMER get the same calibration 1975.000 MHz,
> > > > and it matches the 40ms drift per second you mentioned earlier, this
> > > > seems like the CPUID(0x15) gave the wrong frequence number.
> > > > 
> > > > Or unlikely, HPET and PM_TIMER are driven by the same circuit source,
> > > > which has deviation.
> > > > 
> > > > Either way, I think the HW/FW could have some problem.
> > > 
> > > And this time with your printk()s:
> > > 
> > > [    0.000000] tsc: using CPUID[0x15] crystal_khz= 24000 kHz ebx=158 eax=2
> > > [    0.000000] tsc: Detected 1900.000 MHz processor
> > > [    0.000000] tsc: Detected 1896.000 MHz TSC
> > > [    5.268858] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> > > [   25.706231] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > [   32.223011] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> > > [   57.823933] clocksource: Switched to clocksource tsc-early
> > > [   58.144840] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> > > [   63.613713] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > [   63.637323] clocksource: Switched to clocksource tsc
> > > [   64.673579] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
> > > [   64.703719] tsc: Previous calibrated TSC freq:        1896.000 MHz
> > > [   64.716816] tsc: TSC freq recalibrated by [PM_TIMER]:         1974.999 MHz
> > 
> > This confirms the tsc frequency is calculated from CPUID(0x15).
> > 
> > > What would be good next steps to check up on the hardware and firmware?
> > 
> > Maybe raise it to vendor? I have no idea how to check thos black boxes :)
> 
> Done, but no high hopes here.  (What, me cynical?  Better believe it!!!)
> 
> > > (My next step involves a pillow, but will follow up tomorrow morning
> > > Pacific Time.)
> >  
> > Really thanks for checking this through late night!
> 
> No problem, and I guess it is instead the day after tomorrow, but
> I thought you might be interested in chronyd's opinion:
> 
> [root@rtptest1029.snc8 ~]# cat /var/lib/chrony/drift
>         40001.074911             0.002098
> 
> In contrast, on my Fedora laptop:
> 
> $ sudo cat /var/lib/chrony/drift
>             2.074313             0.186606
> 
> I am (perhaps incorrectly) taking this to indicate that TSC is in fact
> drifting with respect to standard time.
 
This info is very useful! It further confirms the CPUID(0x15) gave
the wrong frequency info. 

Also I don't think TSC itself is drifting, and the drift some from
the wrong match calculation(1896 MHz), if we give it the correct
number (likely 1975 MHz here), there shouldn't be big chrony drift
like your Fedora laptop.

Thanks,
Feng


> Thoughts?
> 
> 							Thanx, Paul
  
Paul E. McKenney Dec. 2, 2022, 1:44 a.m. UTC | #21
On Fri, Dec 02, 2022 at 09:10:12AM +0800, Feng Tang wrote:
> On Thu, Dec 01, 2022 at 09:24:05AM -0800, Paul E. McKenney wrote:
> > On Wed, Nov 30, 2022 at 02:00:04PM +0800, Feng Tang wrote:
> > > On Tue, Nov 29, 2022 at 09:50:24PM -0800, Paul E. McKenney wrote:
> > > [...]
> > > > >  
> > > > > Great! As both HPET and PM_TIMER get the same calibration 1975.000 MHz,
> > > > > and it matches the 40ms drift per second you mentioned earlier, this
> > > > > seems like the CPUID(0x15) gave the wrong frequence number.
> > > > > 
> > > > > Or unlikely, HPET and PM_TIMER are driven by the same circuit source,
> > > > > which has deviation.
> > > > > 
> > > > > Either way, I think the HW/FW could have some problem.
> > > > 
> > > > And this time with your printk()s:
> > > > 
> > > > [    0.000000] tsc: using CPUID[0x15] crystal_khz= 24000 kHz ebx=158 eax=2
> > > > [    0.000000] tsc: Detected 1900.000 MHz processor
> > > > [    0.000000] tsc: Detected 1896.000 MHz TSC
> > > > [    5.268858] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> > > > [   25.706231] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > > [   32.223011] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> > > > [   57.823933] clocksource: Switched to clocksource tsc-early
> > > > [   58.144840] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> > > > [   63.613713] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > > [   63.637323] clocksource: Switched to clocksource tsc
> > > > [   64.673579] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
> > > > [   64.703719] tsc: Previous calibrated TSC freq:        1896.000 MHz
> > > > [   64.716816] tsc: TSC freq recalibrated by [PM_TIMER]:         1974.999 MHz
> > > 
> > > This confirms the tsc frequency is calculated from CPUID(0x15).
> > > 
> > > > What would be good next steps to check up on the hardware and firmware?
> > > 
> > > Maybe raise it to vendor? I have no idea how to check thos black boxes :)
> > 
> > Done, but no high hopes here.  (What, me cynical?  Better believe it!!!)
> > 
> > > > (My next step involves a pillow, but will follow up tomorrow morning
> > > > Pacific Time.)
> > >  
> > > Really thanks for checking this through late night!
> > 
> > No problem, and I guess it is instead the day after tomorrow, but
> > I thought you might be interested in chronyd's opinion:
> > 
> > [root@rtptest1029.snc8 ~]# cat /var/lib/chrony/drift
> >         40001.074911             0.002098
> > 
> > In contrast, on my Fedora laptop:
> > 
> > $ sudo cat /var/lib/chrony/drift
> >             2.074313             0.186606
> > 
> > I am (perhaps incorrectly) taking this to indicate that TSC is in fact
> > drifting with respect to standard time.
>  
> This info is very useful! It further confirms the CPUID(0x15) gave
> the wrong frequency info. 

So the TSC is just doing what it is told.  ;-)

This indicates a firmware problem?

> Also I don't think TSC itself is drifting, and the drift some from
> the wrong match calculation(1896 MHz), if we give it the correct
> number (likely 1975 MHz here), there shouldn't be big chrony drift
> like your Fedora laptop.

Resetting so that the clocksource watchdog once again gets rid of TSC,
but leaving nohpet:

[    0.000000] tsc: using CPUID[0x15] crystal_khz= 24000 kHz ebx=158 eax=2
[    0.000000] tsc: Detected 1900.000 MHz processor
[    0.000000] tsc: Detected 1896.000 MHz TSC
[    5.287750] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[   17.963947] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
[   19.996287] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc-early' as unstable because the skew is too large:
[   20.040287] clocksource:                       'refined-jiffies' wd_nsec: 503923392 wd_now: fffb73f8 wd_last: fffb7200 mask: ffffffff
[   20.067286] clocksource:                       'tsc-early' cs_nsec: 588021368 cs_now: 581c1eb378 cs_last: 57d9aad9e8 mask: ffffffffffffffff
 [   20.096286] clocksource:                       No current clocksource.
 [   20.111286] tsc: Marking TSC unstable due to clocksource watchdog
 [   24.582541] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
 [   49.739301] clocksource: Switched to clocksource refined-jiffies
 [   50.046356] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
 [   50.066475] clocksource: Switched to clocksource acpi_pm

# cat /var/lib/chrony/drift 
            1.372570             0.020049

I interpret this to mean that acpi_pm (and thus from prior observations,
HPET as well) are counting at the correct rate.

							Thanx, Paul
  
Feng Tang Dec. 2, 2022, 2:02 a.m. UTC | #22
On Thu, Dec 01, 2022 at 05:44:28PM -0800, Paul E. McKenney wrote:
> On Fri, Dec 02, 2022 at 09:10:12AM +0800, Feng Tang wrote:
> > On Thu, Dec 01, 2022 at 09:24:05AM -0800, Paul E. McKenney wrote:
> > > On Wed, Nov 30, 2022 at 02:00:04PM +0800, Feng Tang wrote:
> > > > On Tue, Nov 29, 2022 at 09:50:24PM -0800, Paul E. McKenney wrote:
> > > > [...]
> > > > > >  
> > > > > > Great! As both HPET and PM_TIMER get the same calibration 1975.000 MHz,
> > > > > > and it matches the 40ms drift per second you mentioned earlier, this
> > > > > > seems like the CPUID(0x15) gave the wrong frequence number.
> > > > > > 
> > > > > > Or unlikely, HPET and PM_TIMER are driven by the same circuit source,
> > > > > > which has deviation.
> > > > > > 
> > > > > > Either way, I think the HW/FW could have some problem.
> > > > > 
> > > > > And this time with your printk()s:
> > > > > 
> > > > > [    0.000000] tsc: using CPUID[0x15] crystal_khz= 24000 kHz ebx=158 eax=2
> > > > > [    0.000000] tsc: Detected 1900.000 MHz processor
> > > > > [    0.000000] tsc: Detected 1896.000 MHz TSC
> > > > > [    5.268858] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> > > > > [   25.706231] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > > > [   32.223011] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> > > > > [   57.823933] clocksource: Switched to clocksource tsc-early
> > > > > [   58.144840] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> > > > > [   63.613713] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > > > [   63.637323] clocksource: Switched to clocksource tsc
> > > > > [   64.673579] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
> > > > > [   64.703719] tsc: Previous calibrated TSC freq:        1896.000 MHz
> > > > > [   64.716816] tsc: TSC freq recalibrated by [PM_TIMER]:         1974.999 MHz
> > > > 
> > > > This confirms the tsc frequency is calculated from CPUID(0x15).
> > > > 
> > > > > What would be good next steps to check up on the hardware and firmware?
> > > > 
> > > > Maybe raise it to vendor? I have no idea how to check thos black boxes :)
> > > 
> > > Done, but no high hopes here.  (What, me cynical?  Better believe it!!!)
> > > 
> > > > > (My next step involves a pillow, but will follow up tomorrow morning
> > > > > Pacific Time.)
> > > >  
> > > > Really thanks for checking this through late night!
> > > 
> > > No problem, and I guess it is instead the day after tomorrow, but
> > > I thought you might be interested in chronyd's opinion:
> > > 
> > > [root@rtptest1029.snc8 ~]# cat /var/lib/chrony/drift
> > >         40001.074911             0.002098
> > > 
> > > In contrast, on my Fedora laptop:
> > > 
> > > $ sudo cat /var/lib/chrony/drift
> > >             2.074313             0.186606
> > > 
> > > I am (perhaps incorrectly) taking this to indicate that TSC is in fact
> > > drifting with respect to standard time.
> >  
> > This info is very useful! It further confirms the CPUID(0x15) gave
> > the wrong frequency info. 
> 
> So the TSC is just doing what it is told.  ;-)
> 
> This indicates a firmware problem?
> 
> > Also I don't think TSC itself is drifting, and the drift some from
> > the wrong match calculation(1896 MHz), if we give it the correct
> > number (likely 1975 MHz here), there shouldn't be big chrony drift
> > like your Fedora laptop.
> 
> Resetting so that the clocksource watchdog once again gets rid of TSC,
> but leaving nohpet:
> 
> [    0.000000] tsc: using CPUID[0x15] crystal_khz= 24000 kHz ebx=158 eax=2
> [    0.000000] tsc: Detected 1900.000 MHz processor
> [    0.000000] tsc: Detected 1896.000 MHz TSC
> [    5.287750] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> [   17.963947] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> [   19.996287] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc-early' as unstable because the skew is too large:
> [   20.040287] clocksource:                       'refined-jiffies' wd_nsec: 503923392 wd_now: fffb73f8 wd_last: fffb7200 mask: ffffffff
> [   20.067286] clocksource:                       'tsc-early' cs_nsec: 588021368 cs_now: 581c1eb378 cs_last: 57d9aad9e8 mask: ffffffffffffffff
>  [   20.096286] clocksource:                       No current clocksource.
>  [   20.111286] tsc: Marking TSC unstable due to clocksource watchdog
>  [   24.582541] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
>  [   49.739301] clocksource: Switched to clocksource refined-jiffies
>  [   50.046356] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
>  [   50.066475] clocksource: Switched to clocksource acpi_pm
> 
> # cat /var/lib/chrony/drift 
>             1.372570             0.020049
> 
> I interpret this to mean that acpi_pm (and thus from prior observations,
> HPET as well) are counting at the correct rate.

Correct. And this is a good news! that 1975 MHz seems to be the right
number.

Could you try below patch, it should override the CPUID calculation
and forced to use HW timer calibrated number:

---
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index a78e73da4a74..68a2fea4961d 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1417,7 +1417,8 @@ static int __init init_tsc_clocksource(void)
 	 * When TSC frequency is known (retrieved via MSR or CPUID), we skip
 	 * the refined calibration and directly register it as a clocksource.
 	 */
-	if (boot_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ)) {
+//	if (boot_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ)) {
+	if (false) {
 		if (boot_cpu_has(X86_FEATURE_ART))
 			art_related_clocksource = &clocksource_tsc;
 		clocksource_register_khz(&clocksource_tsc, tsc_khz);
---

And together with b50db7095fe0 "x86/tsc: Disable clocksource watchdog
for TSC on qualified platorms". I assume this will tell TSC to use
1975 MHZ as its frequency.


Thanks,
Feng
  
Paul E. McKenney Dec. 2, 2022, 10:24 p.m. UTC | #23
On Fri, Dec 02, 2022 at 10:02:59AM +0800, Feng Tang wrote:
> On Thu, Dec 01, 2022 at 05:44:28PM -0800, Paul E. McKenney wrote:
> > On Fri, Dec 02, 2022 at 09:10:12AM +0800, Feng Tang wrote:
> > > On Thu, Dec 01, 2022 at 09:24:05AM -0800, Paul E. McKenney wrote:
> > > > On Wed, Nov 30, 2022 at 02:00:04PM +0800, Feng Tang wrote:
> > > > > On Tue, Nov 29, 2022 at 09:50:24PM -0800, Paul E. McKenney wrote:
> > > > > [...]
> > > > > > >  
> > > > > > > Great! As both HPET and PM_TIMER get the same calibration 1975.000 MHz,
> > > > > > > and it matches the 40ms drift per second you mentioned earlier, this
> > > > > > > seems like the CPUID(0x15) gave the wrong frequence number.
> > > > > > > 
> > > > > > > Or unlikely, HPET and PM_TIMER are driven by the same circuit source,
> > > > > > > which has deviation.
> > > > > > > 
> > > > > > > Either way, I think the HW/FW could have some problem.
> > > > > > 
> > > > > > And this time with your printk()s:
> > > > > > 
> > > > > > [    0.000000] tsc: using CPUID[0x15] crystal_khz= 24000 kHz ebx=158 eax=2
> > > > > > [    0.000000] tsc: Detected 1900.000 MHz processor
> > > > > > [    0.000000] tsc: Detected 1896.000 MHz TSC
> > > > > > [    5.268858] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> > > > > > [   25.706231] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > > > > [   32.223011] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> > > > > > [   57.823933] clocksource: Switched to clocksource tsc-early
> > > > > > [   58.144840] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> > > > > > [   63.613713] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > > > > [   63.637323] clocksource: Switched to clocksource tsc
> > > > > > [   64.673579] tsc: Warning: TSC freq calibrated by CPUID/MSR differs from what is calibrated by HW timer, please check with vendor!!
> > > > > > [   64.703719] tsc: Previous calibrated TSC freq:        1896.000 MHz
> > > > > > [   64.716816] tsc: TSC freq recalibrated by [PM_TIMER]:         1974.999 MHz
> > > > > 
> > > > > This confirms the tsc frequency is calculated from CPUID(0x15).
> > > > > 
> > > > > > What would be good next steps to check up on the hardware and firmware?
> > > > > 
> > > > > Maybe raise it to vendor? I have no idea how to check thos black boxes :)
> > > > 
> > > > Done, but no high hopes here.  (What, me cynical?  Better believe it!!!)
> > > > 
> > > > > > (My next step involves a pillow, but will follow up tomorrow morning
> > > > > > Pacific Time.)
> > > > >  
> > > > > Really thanks for checking this through late night!
> > > > 
> > > > No problem, and I guess it is instead the day after tomorrow, but
> > > > I thought you might be interested in chronyd's opinion:
> > > > 
> > > > [root@rtptest1029.snc8 ~]# cat /var/lib/chrony/drift
> > > >         40001.074911             0.002098
> > > > 
> > > > In contrast, on my Fedora laptop:
> > > > 
> > > > $ sudo cat /var/lib/chrony/drift
> > > >             2.074313             0.186606
> > > > 
> > > > I am (perhaps incorrectly) taking this to indicate that TSC is in fact
> > > > drifting with respect to standard time.
> > >  
> > > This info is very useful! It further confirms the CPUID(0x15) gave
> > > the wrong frequency info. 
> > 
> > So the TSC is just doing what it is told.  ;-)
> > 
> > This indicates a firmware problem?
> > 
> > > Also I don't think TSC itself is drifting, and the drift some from
> > > the wrong match calculation(1896 MHz), if we give it the correct
> > > number (likely 1975 MHz here), there shouldn't be big chrony drift
> > > like your Fedora laptop.
> > 
> > Resetting so that the clocksource watchdog once again gets rid of TSC,
> > but leaving nohpet:
> > 
> > [    0.000000] tsc: using CPUID[0x15] crystal_khz= 24000 kHz ebx=158 eax=2
> > [    0.000000] tsc: Detected 1900.000 MHz processor
> > [    0.000000] tsc: Detected 1896.000 MHz TSC
> > [    5.287750] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> > [   17.963947] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > [   19.996287] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc-early' as unstable because the skew is too large:
> > [   20.040287] clocksource:                       'refined-jiffies' wd_nsec: 503923392 wd_now: fffb73f8 wd_last: fffb7200 mask: ffffffff
> > [   20.067286] clocksource:                       'tsc-early' cs_nsec: 588021368 cs_now: 581c1eb378 cs_last: 57d9aad9e8 mask: ffffffffffffffff
> >  [   20.096286] clocksource:                       No current clocksource.
> >  [   20.111286] tsc: Marking TSC unstable due to clocksource watchdog
> >  [   24.582541] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> >  [   49.739301] clocksource: Switched to clocksource refined-jiffies
> >  [   50.046356] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> >  [   50.066475] clocksource: Switched to clocksource acpi_pm
> > 
> > # cat /var/lib/chrony/drift 
> >             1.372570             0.020049
> > 
> > I interpret this to mean that acpi_pm (and thus from prior observations,
> > HPET as well) are counting at the correct rate.
> 
> Correct. And this is a good news! that 1975 MHz seems to be the right
> number.
> 
> Could you try below patch, it should override the CPUID calculation
> and forced to use HW timer calibrated number:
> 
> ---
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index a78e73da4a74..68a2fea4961d 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -1417,7 +1417,8 @@ static int __init init_tsc_clocksource(void)
>  	 * When TSC frequency is known (retrieved via MSR or CPUID), we skip
>  	 * the refined calibration and directly register it as a clocksource.
>  	 */
> -	if (boot_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ)) {
> +//	if (boot_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ)) {
> +	if (false) {
>  		if (boot_cpu_has(X86_FEATURE_ART))
>  			art_related_clocksource = &clocksource_tsc;
>  		clocksource_register_khz(&clocksource_tsc, tsc_khz);
> ---
> 
> And together with b50db7095fe0 "x86/tsc: Disable clocksource watchdog
> for TSC on qualified platorms". I assume this will tell TSC to use
> 1975 MHZ as its frequency.

This did not change things, but when I hardcoded tsc_khz to 1975, the
clocksource watchdog no longer disables TSC and chrony shows drifts of
less than 2.0.  (As opposed to about 40,000 without the hardcoding.)

So yes, forcing 1975 makes TSC work nicely.  Yay!  ;-)

							Thanx, Paul
  
Feng Tang Dec. 3, 2022, 2:51 a.m. UTC | #24
On Fri, Dec 02, 2022 at 02:24:02PM -0800, Paul E. McKenney wrote:
[...]
> > > > > > > (My next step involves a pillow, but will follow up tomorrow morning
> > > > > > > Pacific Time.)
> > > > > >  
> > > > > > Really thanks for checking this through late night!
> > > > > 
> > > > > No problem, and I guess it is instead the day after tomorrow, but
> > > > > I thought you might be interested in chronyd's opinion:
> > > > > 
> > > > > [root@rtptest1029.snc8 ~]# cat /var/lib/chrony/drift
> > > > >         40001.074911             0.002098
> > > > > 
> > > > > In contrast, on my Fedora laptop:
> > > > > 
> > > > > $ sudo cat /var/lib/chrony/drift
> > > > >             2.074313             0.186606
> > > > > 
> > > > > I am (perhaps incorrectly) taking this to indicate that TSC is in fact
> > > > > drifting with respect to standard time.
> > > >  
> > > > This info is very useful! It further confirms the CPUID(0x15) gave
> > > > the wrong frequency info. 
> > > 
> > > So the TSC is just doing what it is told.  ;-)
> > > 
> > > This indicates a firmware problem?
> > > 
> > > > Also I don't think TSC itself is drifting, and the drift some from
> > > > the wrong match calculation(1896 MHz), if we give it the correct
> > > > number (likely 1975 MHz here), there shouldn't be big chrony drift
> > > > like your Fedora laptop.
> > > 
> > > Resetting so that the clocksource watchdog once again gets rid of TSC,
> > > but leaving nohpet:
> > > 
> > > [    0.000000] tsc: using CPUID[0x15] crystal_khz= 24000 kHz ebx=158 eax=2
> > > [    0.000000] tsc: Detected 1900.000 MHz processor
> > > [    0.000000] tsc: Detected 1896.000 MHz TSC
> > > [    5.287750] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
> > > [   17.963947] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36a8d32ce31, max_idle_ns: 881590731004 ns
> > > [   19.996287] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc-early' as unstable because the skew is too large:
> > > [   20.040287] clocksource:                       'refined-jiffies' wd_nsec: 503923392 wd_now: fffb73f8 wd_last: fffb7200 mask: ffffffff
> > > [   20.067286] clocksource:                       'tsc-early' cs_nsec: 588021368 cs_now: 581c1eb378 cs_last: 57d9aad9e8 mask: ffffffffffffffff
> > >  [   20.096286] clocksource:                       No current clocksource.
> > >  [   20.111286] tsc: Marking TSC unstable due to clocksource watchdog
> > >  [   24.582541] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> > >  [   49.739301] clocksource: Switched to clocksource refined-jiffies
> > >  [   50.046356] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
> > >  [   50.066475] clocksource: Switched to clocksource acpi_pm
> > > 
> > > # cat /var/lib/chrony/drift 
> > >             1.372570             0.020049
> > > 
> > > I interpret this to mean that acpi_pm (and thus from prior observations,
> > > HPET as well) are counting at the correct rate.
> > 
> > Correct. And this is a good news! that 1975 MHz seems to be the right
> > number.
> > 
> > Could you try below patch, it should override the CPUID calculation
> > and forced to use HW timer calibrated number:
> > 
> > ---
> > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > index a78e73da4a74..68a2fea4961d 100644
> > --- a/arch/x86/kernel/tsc.c
> > +++ b/arch/x86/kernel/tsc.c
> > @@ -1417,7 +1417,8 @@ static int __init init_tsc_clocksource(void)
> >  	 * When TSC frequency is known (retrieved via MSR or CPUID), we skip
> >  	 * the refined calibration and directly register it as a clocksource.
> >  	 */
> > -	if (boot_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ)) {
> > +//	if (boot_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ)) {
> > +	if (false) {
> >  		if (boot_cpu_has(X86_FEATURE_ART))
> >  			art_related_clocksource = &clocksource_tsc;
> >  		clocksource_register_khz(&clocksource_tsc, tsc_khz);
> > ---
> > 
> > And together with b50db7095fe0 "x86/tsc: Disable clocksource watchdog
> > for TSC on qualified platorms". I assume this will tell TSC to use
> > 1975 MHZ as its frequency.
> 
> This did not change things,

Maybe the kernel is with the tsc reclibrate patch? it will not use
the reclibrated frequency for a KNOWN_FREQ case, but just print
out the info. Anyway this doesn't matter, as it's just for debug,
and your way already proved the 1975 MHz is the right number :)

> but when I hardcoded tsc_khz to 1975, the
> clocksource watchdog no longer disables TSC and chrony shows drifts of
> less than 2.0.  (As opposed to about 40,000 without the hardcoding.)
> 
> So yes, forcing 1975 makes TSC work nicely.  Yay!  ;-)

Great to know it works and we have some workaround before a final
solution :)

Thanks,
Feng

> 							Thanx, Paul
  

Patch

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 8058bec87acee..dcaf38c062161 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -386,7 +386,7 @@  EXPORT_SYMBOL_GPL(clocksource_verify_percpu);
 
 static void clocksource_watchdog(struct timer_list *unused)
 {
-	u64 csnow, wdnow, cslast, wdlast, delta;
+	u64 csnow, wdnow, cslast, wdlast, delta, wdi;
 	int next_cpu, reset_pending;
 	int64_t wd_nsec, cs_nsec;
 	struct clocksource *cs;
@@ -440,6 +440,17 @@  static void clocksource_watchdog(struct timer_list *unused)
 		if (atomic_read(&watchdog_reset_pending))
 			continue;
 
+		/* Check for bogus measurements. */
+		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
+		if (wd_nsec < (wdi >> 2)) {
+			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
+			continue;
+		}
+		if (wd_nsec > (wdi << 2)) {
+			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
+			continue;
+		}
+
 		/* Check the deviation from the watchdog clocksource. */
 		md = cs->uncertainty_margin + watchdog->uncertainty_margin;
 		if (abs(cs_nsec - wd_nsec) > md) {