[clocksource] Reject bogus watchdog clocksource measurements

Message ID 20221019230904.GA2502730@paulmck-ThinkPad-P17-Gen-1
State New
Headers
Series [clocksource] Reject bogus watchdog clocksource measurements |

Commit Message

Paul E. McKenney Oct. 19, 2022, 11:09 p.m. UTC
  One remaining clocksource-skew issue involves extreme CPU overcommit,
which can cause the clocksource watchdog measurements to be delayed by
tens of seconds.  This in turn means that a clock-skew criterion that
is appropriate for a 500-millisecond interval will instead give lots of
false positives.

Therefore, check for the watchdog clocksource reporting much larger or
much less than the time specified by WATCHDOG_INTERVAL.  In these cases,
print a pr_warn() warning and refrain from marking the clocksource under
test as being unstable.

Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: John Stultz <jstultz@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Waiman Long <longman@redhat.com>
  

Comments

Waiman Long Oct. 20, 2022, 3:01 a.m. UTC | #1
On 10/19/22 19:09, Paul E. McKenney wrote:
> One remaining clocksource-skew issue involves extreme CPU overcommit,
> which can cause the clocksource watchdog measurements to be delayed by
> tens of seconds.  This in turn means that a clock-skew criterion that
> is appropriate for a 500-millisecond interval will instead give lots of
> false positives.

CPU overcommit means it is running in a VM. Right? Unfortunately, there 
is not a consistent cross-arch way to check for running under a 
hypervisor or we may want to add such a test if available. However, 
CLOCKSOURCE_WATCHDOG is only enabled in x86 and mips. Maybe we can add a 
helper function to do that.


>
> Therefore, check for the watchdog clocksource reporting much larger or
> much less than the time specified by WATCHDOG_INTERVAL.  In these cases,
> print a pr_warn() warning and refrain from marking the clocksource under
> test as being unstable.
>
> Reported-by: Chris Mason <clm@meta.com>
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> Cc: John Stultz <jstultz@google.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Stephen Boyd <sboyd@kernel.org>
> Cc: Feng Tang <feng.tang@intel.com>
> Cc: Waiman Long <longman@redhat.com>
>
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index 8058bec87acee..dcaf38c062161 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -386,7 +386,7 @@ EXPORT_SYMBOL_GPL(clocksource_verify_percpu);
>   
>   static void clocksource_watchdog(struct timer_list *unused)
>   {
> -	u64 csnow, wdnow, cslast, wdlast, delta;
> +	u64 csnow, wdnow, cslast, wdlast, delta, wdi;
>   	int next_cpu, reset_pending;
>   	int64_t wd_nsec, cs_nsec;
>   	struct clocksource *cs;
> @@ -440,6 +440,17 @@ static void clocksource_watchdog(struct timer_list *unused)
>   		if (atomic_read(&watchdog_reset_pending))
>   			continue;
>   
> +		/* Check for bogus measurements. */
> +		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
> +		if (wd_nsec < (wdi >> 2)) {
> +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> +			continue;
> +		}
> +		if (wd_nsec > (wdi << 2)) {
> +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> +			continue;
> +		}
> +
>   		/* Check the deviation from the watchdog clocksource. */
>   		md = cs->uncertainty_margin + watchdog->uncertainty_margin;
>   		if (abs(cs_nsec - wd_nsec) > md) {

In the worst case, there will be a warning every half second or so. 
Should we rate limit the number of these warnings in some way?

Cheers,
Longman
  
Feng Tang Oct. 20, 2022, 8:09 a.m. UTC | #2
On Wed, Oct 19, 2022 at 04:09:04PM -0700, Paul E. McKenney wrote:
> One remaining clocksource-skew issue involves extreme CPU overcommit,
> which can cause the clocksource watchdog measurements to be delayed by
> tens of seconds.  This in turn means that a clock-skew criterion that
> is appropriate for a 500-millisecond interval will instead give lots of
> false positives.

I remembered I saw logs that the watchdog were delayed to dozens of
or hundreds of seconds. 

Thanks for the fix which makes sense to me! with some nits below.

> Therefore, check for the watchdog clocksource reporting much larger or
> much less than the time specified by WATCHDOG_INTERVAL.  In these cases,
> print a pr_warn() warning and refrain from marking the clocksource under
> test as being unstable.
> 
> Reported-by: Chris Mason <clm@meta.com>
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> Cc: John Stultz <jstultz@google.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Stephen Boyd <sboyd@kernel.org>
> Cc: Feng Tang <feng.tang@intel.com>
> Cc: Waiman Long <longman@redhat.com>
> 
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index 8058bec87acee..dcaf38c062161 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -386,7 +386,7 @@ EXPORT_SYMBOL_GPL(clocksource_verify_percpu);
>  
>  static void clocksource_watchdog(struct timer_list *unused)
>  {
> -	u64 csnow, wdnow, cslast, wdlast, delta;
> +	u64 csnow, wdnow, cslast, wdlast, delta, wdi;
>  	int next_cpu, reset_pending;
>  	int64_t wd_nsec, cs_nsec;
>  	struct clocksource *cs;
> @@ -440,6 +440,17 @@ static void clocksource_watchdog(struct timer_list *unused)
>  		if (atomic_read(&watchdog_reset_pending))
>  			continue;
>  
> +		/* Check for bogus measurements. */
> +		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
> +		if (wd_nsec < (wdi >> 2)) {
> +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> +			continue;
> +		}

If this happens (500ms timer happens only after less than 125ms),
there is some severe problem with timer/interrupt system. 

> +		if (wd_nsec > (wdi << 2)) {
> +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> +			continue;
> +		}

I agree with Waiman that some rate limiting may be needed. As there
were reports of hundreds of seconds of delay, 2 seconds delay could
easily happen if a system is too busy or misbehave to trigger this
problem.

Thanks,
Feng
  
Paul E. McKenney Oct. 20, 2022, 2:09 p.m. UTC | #3
On Thu, Oct 20, 2022 at 04:09:01PM +0800, Feng Tang wrote:
> On Wed, Oct 19, 2022 at 04:09:04PM -0700, Paul E. McKenney wrote:
> > One remaining clocksource-skew issue involves extreme CPU overcommit,
> > which can cause the clocksource watchdog measurements to be delayed by
> > tens of seconds.  This in turn means that a clock-skew criterion that
> > is appropriate for a 500-millisecond interval will instead give lots of
> > false positives.
> 
> I remembered I saw logs that the watchdog were delayed to dozens of
> or hundreds of seconds. 
> 
> Thanks for the fix which makes sense to me! with some nits below.
> 
> > Therefore, check for the watchdog clocksource reporting much larger or
> > much less than the time specified by WATCHDOG_INTERVAL.  In these cases,
> > print a pr_warn() warning and refrain from marking the clocksource under
> > test as being unstable.
> > 
> > Reported-by: Chris Mason <clm@meta.com>
> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > Cc: John Stultz <jstultz@google.com>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Stephen Boyd <sboyd@kernel.org>
> > Cc: Feng Tang <feng.tang@intel.com>
> > Cc: Waiman Long <longman@redhat.com>
> > 
> > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > index 8058bec87acee..dcaf38c062161 100644
> > --- a/kernel/time/clocksource.c
> > +++ b/kernel/time/clocksource.c
> > @@ -386,7 +386,7 @@ EXPORT_SYMBOL_GPL(clocksource_verify_percpu);
> >  
> >  static void clocksource_watchdog(struct timer_list *unused)
> >  {
> > -	u64 csnow, wdnow, cslast, wdlast, delta;
> > +	u64 csnow, wdnow, cslast, wdlast, delta, wdi;
> >  	int next_cpu, reset_pending;
> >  	int64_t wd_nsec, cs_nsec;
> >  	struct clocksource *cs;
> > @@ -440,6 +440,17 @@ static void clocksource_watchdog(struct timer_list *unused)
> >  		if (atomic_read(&watchdog_reset_pending))
> >  			continue;
> >  
> > +		/* Check for bogus measurements. */
> > +		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
> > +		if (wd_nsec < (wdi >> 2)) {
> > +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > +			continue;
> > +		}
> 
> If this happens (500ms timer happens only after less than 125ms),
> there is some severe problem with timer/interrupt system. 

Should I add ", suspect timer/interrupt bug" just after "jiffy time
interval"?  Or would a comment before that pr_warn() work better for you?

> > +		if (wd_nsec > (wdi << 2)) {
> > +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > +			continue;
> > +		}
> 
> I agree with Waiman that some rate limiting may be needed. As there
> were reports of hundreds of seconds of delay, 2 seconds delay could
> easily happen if a system is too busy or misbehave to trigger this
> problem.

Good points, thank you both!

Left to myself, I would use a capped power-of-two backoff that was reset
any time that the interval was within bounds.  Maybe a cap of 10 minutes?

Or is there a better way to do this?

							Thanx, Paul
  
Waiman Long Oct. 20, 2022, 5:29 p.m. UTC | #4
On 10/20/22 10:09, Paul E. McKenney wrote:
> On Thu, Oct 20, 2022 at 04:09:01PM +0800, Feng Tang wrote:
>> On Wed, Oct 19, 2022 at 04:09:04PM -0700, Paul E. McKenney wrote:
>>> One remaining clocksource-skew issue involves extreme CPU overcommit,
>>> which can cause the clocksource watchdog measurements to be delayed by
>>> tens of seconds.  This in turn means that a clock-skew criterion that
>>> is appropriate for a 500-millisecond interval will instead give lots of
>>> false positives.
>> I remembered I saw logs that the watchdog were delayed to dozens of
>> or hundreds of seconds.
>>
>> Thanks for the fix which makes sense to me! with some nits below.
>>
>>> Therefore, check for the watchdog clocksource reporting much larger or
>>> much less than the time specified by WATCHDOG_INTERVAL.  In these cases,
>>> print a pr_warn() warning and refrain from marking the clocksource under
>>> test as being unstable.
>>>
>>> Reported-by: Chris Mason <clm@meta.com>
>>> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
>>> Cc: John Stultz <jstultz@google.com>
>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>> Cc: Stephen Boyd <sboyd@kernel.org>
>>> Cc: Feng Tang <feng.tang@intel.com>
>>> Cc: Waiman Long <longman@redhat.com>
>>>
>>> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
>>> index 8058bec87acee..dcaf38c062161 100644
>>> --- a/kernel/time/clocksource.c
>>> +++ b/kernel/time/clocksource.c
>>> @@ -386,7 +386,7 @@ EXPORT_SYMBOL_GPL(clocksource_verify_percpu);
>>>   
>>>   static void clocksource_watchdog(struct timer_list *unused)
>>>   {
>>> -	u64 csnow, wdnow, cslast, wdlast, delta;
>>> +	u64 csnow, wdnow, cslast, wdlast, delta, wdi;
>>>   	int next_cpu, reset_pending;
>>>   	int64_t wd_nsec, cs_nsec;
>>>   	struct clocksource *cs;
>>> @@ -440,6 +440,17 @@ static void clocksource_watchdog(struct timer_list *unused)
>>>   		if (atomic_read(&watchdog_reset_pending))
>>>   			continue;
>>>   
>>> +		/* Check for bogus measurements. */
>>> +		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
>>> +		if (wd_nsec < (wdi >> 2)) {
>>> +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
>>> +			continue;
>>> +		}
>> If this happens (500ms timer happens only after less than 125ms),
>> there is some severe problem with timer/interrupt system.
> Should I add ", suspect timer/interrupt bug" just after "jiffy time
> interval"?  Or would a comment before that pr_warn() work better for you?
>
>>> +		if (wd_nsec > (wdi << 2)) {
>>> +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
>>> +			continue;
>>> +		}
>> I agree with Waiman that some rate limiting may be needed. As there
>> were reports of hundreds of seconds of delay, 2 seconds delay could
>> easily happen if a system is too busy or misbehave to trigger this
>> problem.
> Good points, thank you both!
>
> Left to myself, I would use a capped power-of-two backoff that was reset
> any time that the interval was within bounds.  Maybe a cap of 10 minutes?

That sounds good to me.

Thanks,
Longman
  
Feng Tang Oct. 21, 2022, 12:46 a.m. UTC | #5
On Thu, Oct 20, 2022 at 07:09:44AM -0700, Paul E. McKenney wrote:
[...]
> > >  static void clocksource_watchdog(struct timer_list *unused)
> > >  {
> > > -	u64 csnow, wdnow, cslast, wdlast, delta;
> > > +	u64 csnow, wdnow, cslast, wdlast, delta, wdi;
> > >  	int next_cpu, reset_pending;
> > >  	int64_t wd_nsec, cs_nsec;
> > >  	struct clocksource *cs;
> > > @@ -440,6 +440,17 @@ static void clocksource_watchdog(struct timer_list *unused)
> > >  		if (atomic_read(&watchdog_reset_pending))
> > >  			continue;
> > >  
> > > +		/* Check for bogus measurements. */
> > > +		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
> > > +		if (wd_nsec < (wdi >> 2)) {
> > > +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > > +			continue;
> > > +		}
> > 
> > If this happens (500ms timer happens only after less than 125ms),
> > there is some severe problem with timer/interrupt system. 
> 
> Should I add ", suspect timer/interrupt bug" just after "jiffy time
> interval"?  Or would a comment before that pr_warn() work better for you?

Both are fine for me.

> > > +		if (wd_nsec > (wdi << 2)) {
> > > +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > > +			continue;
> > > +		}
> > 
> > I agree with Waiman that some rate limiting may be needed. As there
> > were reports of hundreds of seconds of delay, 2 seconds delay could
> > easily happen if a system is too busy or misbehave to trigger this
> > problem.
> 
> Good points, thank you both!
> 
> Left to myself, I would use a capped power-of-two backoff that was reset
> any time that the interval was within bounds.  Maybe a cap of 10 minutes?

Yep, that's more informative. Thanks!

- Feng

> Or is there a better way to do this?
> 
> 							Thanx, Paul
  
Paul E. McKenney Oct. 28, 2022, 5:52 p.m. UTC | #6
On Fri, Oct 21, 2022 at 08:46:32AM +0800, Feng Tang wrote:
> On Thu, Oct 20, 2022 at 07:09:44AM -0700, Paul E. McKenney wrote:
> [...]
> > > >  static void clocksource_watchdog(struct timer_list *unused)
> > > >  {
> > > > -	u64 csnow, wdnow, cslast, wdlast, delta;
> > > > +	u64 csnow, wdnow, cslast, wdlast, delta, wdi;
> > > >  	int next_cpu, reset_pending;
> > > >  	int64_t wd_nsec, cs_nsec;
> > > >  	struct clocksource *cs;
> > > > @@ -440,6 +440,17 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > >  		if (atomic_read(&watchdog_reset_pending))
> > > >  			continue;
> > > >  
> > > > +		/* Check for bogus measurements. */
> > > > +		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
> > > > +		if (wd_nsec < (wdi >> 2)) {
> > > > +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > > > +			continue;
> > > > +		}
> > > 
> > > If this happens (500ms timer happens only after less than 125ms),
> > > there is some severe problem with timer/interrupt system. 
> > 
> > Should I add ", suspect timer/interrupt bug" just after "jiffy time
> > interval"?  Or would a comment before that pr_warn() work better for you?
> 
> Both are fine for me.

Here is the patch, which just adds comments.  (The exponential-backoff
patch is on its way.)

Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

commit 59d9db36dc15b3b40a30d7a3d733dbd412c8557a
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Thu Oct 27 11:58:27 2022 -0700

    clocksource: Add comments to classify bogus measurements
    
    An extremely busy system can delay the clocksource watchdog, so that
    the corresponding too-long bogus-measurement error does not necessarily
    imply an error in the system.  However, a too-short bogus-measurement
    error likely indicates a bug in hardware, firmware or software.
    
    Therefore, add comments clarifying these bogus-measurement pr_warn()s.
    
    Reported-by: Feng Tang <feng.tang@intel.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Cc: John Stultz <jstultz@google.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Stephen Boyd <sboyd@kernel.org>
    Cc: Feng Tang <feng.tang@intel.com>
    Cc: Waiman Long <longman@redhat.com>

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index dcaf38c062161..3f5317faf891f 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -443,10 +443,12 @@ static void clocksource_watchdog(struct timer_list *unused)
 		/* Check for bogus measurements. */
 		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
 		if (wd_nsec < (wdi >> 2)) {
+			/* This usually indicates broken timer code or hardware. */
 			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
 			continue;
 		}
 		if (wd_nsec > (wdi << 2)) {
+			/* This can happen on busy systems, which can delay the watchdog. */
 			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
 			continue;
 		}
  
Paul E. McKenney Oct. 28, 2022, 5:52 p.m. UTC | #7
On Fri, Oct 21, 2022 at 08:46:32AM +0800, Feng Tang wrote:
> On Thu, Oct 20, 2022 at 07:09:44AM -0700, Paul E. McKenney wrote:
> [...]
> > > >  static void clocksource_watchdog(struct timer_list *unused)
> > > >  {
> > > > -	u64 csnow, wdnow, cslast, wdlast, delta;
> > > > +	u64 csnow, wdnow, cslast, wdlast, delta, wdi;
> > > >  	int next_cpu, reset_pending;
> > > >  	int64_t wd_nsec, cs_nsec;
> > > >  	struct clocksource *cs;
> > > > @@ -440,6 +440,17 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > >  		if (atomic_read(&watchdog_reset_pending))
> > > >  			continue;
> > > >  
> > > > +		/* Check for bogus measurements. */
> > > > +		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
> > > > +		if (wd_nsec < (wdi >> 2)) {
> > > > +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > > > +			continue;
> > > > +		}
> > > 
> > > If this happens (500ms timer happens only after less than 125ms),
> > > there is some severe problem with timer/interrupt system. 
> > 
> > Should I add ", suspect timer/interrupt bug" just after "jiffy time
> > interval"?  Or would a comment before that pr_warn() work better for you?
> 
> Both are fine for me.
> 
> > > > +		if (wd_nsec > (wdi << 2)) {
> > > > +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > > > +			continue;
> > > > +		}
> > > 
> > > I agree with Waiman that some rate limiting may be needed. As there
> > > were reports of hundreds of seconds of delay, 2 seconds delay could
> > > easily happen if a system is too busy or misbehave to trigger this
> > > problem.
> > 
> > Good points, thank you both!
> > 
> > Left to myself, I would use a capped power-of-two backoff that was reset
> > any time that the interval was within bounds.  Maybe a cap of 10 minutes?
> 
> Yep, that's more informative. Thanks!

And here is the resulting patch.  Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

commit adc0512cf34fe7f0d73966d59644b826ee526742
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Fri Oct 28 10:38:58 2022 -0700

    clocksource: Exponential backoff for load-induced bogus watchdog reads
    
    The clocksource watchdog will reject measurements that are excessively
    delayed, that is, by more than 1.5 seconds beyond the intended 0.5-second
    watchdog interval.  On an extremely busy system, this can result in a
    console message being printed every two seconds.  This is excessively
    noisy for a non-error condition.
    
    Therefore, apply exponential backoff to these messages.  This exponential
    backoff is capped at 1024 times the watchdog interval, which comes to
    not quite one message per ten minutes.
    
    Please note that the bogus watchdog reads that occur when the watchdog
    interval is less than 0.125 seconds are still printed unconditionally
    because these likely correspond to a serious error condition in the
    timer code or hardware.
    
    Reported-by: Waiman Long <longman@redhat.com>
    Reported-by: Feng Tang <feng.tang@intel.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Cc: John Stultz <jstultz@google.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Stephen Boyd <sboyd@kernel.org>
    Cc: Feng Tang <feng.tang@intel.com>
    Cc: Waiman Long <longman@redhat.com>

diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 1d42d4b173271..daac05aedf56a 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -125,6 +125,10 @@ struct clocksource {
 	struct list_head	wd_list;
 	u64			cs_last;
 	u64			wd_last;
+	u64			wd_last_bogus;
+	int			wd_bogus_shift;
+	unsigned long		wd_bogus_count;
+	unsigned long		wd_bogus_count_last;
 #endif
 	struct module		*owner;
 };
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 3f5317faf891f..6537ffa02e445 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -448,8 +448,26 @@ static void clocksource_watchdog(struct timer_list *unused)
 			continue;
 		}
 		if (wd_nsec > (wdi << 2)) {
-			/* This can happen on busy systems, which can delay the watchdog. */
-			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
+			bool needwarn = false;
+			u64 wd_lb;
+
+			cs->wd_bogus_count++;
+			if (!cs->wd_bogus_shift) {
+				needwarn = true;
+			} else {
+				delta = clocksource_delta(wdnow, cs->wd_last_bogus, watchdog->mask);
+				wd_lb = clocksource_cyc2ns(delta, watchdog->mult, watchdog->shift);
+				if ((1 << cs->wd_bogus_shift) * wdi <= wd_lb)
+					needwarn = true;
+			}
+			if (needwarn) {
+				/* This can happen on busy systems, which can delay the watchdog. */
+				pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval (%lu additional), probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL, cs->wd_bogus_count - cs->wd_bogus_count_last);
+				cs->wd_last_bogus = wdnow;
+				if (cs->wd_bogus_shift < 10)
+					cs->wd_bogus_shift++;
+				cs->wd_bogus_count_last = cs->wd_bogus_count;
+			}
 			continue;
 		}
  
Feng Tang Oct. 31, 2022, 5:59 a.m. UTC | #8
On Fri, Oct 28, 2022 at 10:52:45AM -0700, Paul E. McKenney wrote:
[...]
> > > Left to myself, I would use a capped power-of-two backoff that was reset
> > > any time that the interval was within bounds.  Maybe a cap of 10 minutes?
> > 
> > Yep, that's more informative. Thanks!
> 
> And here is the resulting patch.  Thoughts?
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> commit adc0512cf34fe7f0d73966d59644b826ee526742
> Author: Paul E. McKenney <paulmck@kernel.org>
> Date:   Fri Oct 28 10:38:58 2022 -0700
> 
>     clocksource: Exponential backoff for load-induced bogus watchdog reads
>     
>     The clocksource watchdog will reject measurements that are excessively
>     delayed, that is, by more than 1.5 seconds beyond the intended 0.5-second
>     watchdog interval.  On an extremely busy system, this can result in a
>     console message being printed every two seconds.  This is excessively
>     noisy for a non-error condition.
>     
>     Therefore, apply exponential backoff to these messages.  This exponential
>     backoff is capped at 1024 times the watchdog interval, which comes to
>     not quite one message per ten minutes.
>     
>     Please note that the bogus watchdog reads that occur when the watchdog
>     interval is less than 0.125 seconds are still printed unconditionally
>     because these likely correspond to a serious error condition in the
>     timer code or hardware.
>     
>     Reported-by: Waiman Long <longman@redhat.com>
>     Reported-by: Feng Tang <feng.tang@intel.com>
>     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
>     Cc: John Stultz <jstultz@google.com>
>     Cc: Thomas Gleixner <tglx@linutronix.de>
>     Cc: Stephen Boyd <sboyd@kernel.org>
>     Cc: Feng Tang <feng.tang@intel.com>
>     Cc: Waiman Long <longman@redhat.com>
> 
> diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
> index 1d42d4b173271..daac05aedf56a 100644
> --- a/include/linux/clocksource.h
> +++ b/include/linux/clocksource.h
> @@ -125,6 +125,10 @@ struct clocksource {
>  	struct list_head	wd_list;
>  	u64			cs_last;
>  	u64			wd_last;
> +	u64			wd_last_bogus;
> +	int			wd_bogus_shift;
> +	unsigned long		wd_bogus_count;
> +	unsigned long		wd_bogus_count_last;
>  #endif
>  	struct module		*owner;
>  };
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index 3f5317faf891f..6537ffa02e445 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -448,8 +448,26 @@ static void clocksource_watchdog(struct timer_list *unused)
>  			continue;
>  		}
>  		if (wd_nsec > (wdi << 2)) {

Just recalled one thing, that it may be better to check 'cs_nsec' 
instead of 'wd_nsec', as some watchdog may have small wrap-around
value. IIRC, HPET's counter is 32 bits long and wraps at about
300 seconds, and PMTIMER's counter is 24 bits which wraps at about
3 ~ 4 seconds. So when a long stall of the watchdog timer happens,
the watchdog's value could 'overflow' many times.

And usually the 'current' closcksource has longer wrap time than
the watchdog.


> -			/* This can happen on busy systems, which can delay the watchdog. */
> -			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> +			bool needwarn = false;
> +			u64 wd_lb;
> +
> +			cs->wd_bogus_count++;
> +			if (!cs->wd_bogus_shift) {
> +				needwarn = true;
> +			} else {
> +				delta = clocksource_delta(wdnow, cs->wd_last_bogus, watchdog->mask);
> +				wd_lb = clocksource_cyc2ns(delta, watchdog->mult, watchdog->shift);
> +				if ((1 << cs->wd_bogus_shift) * wdi <= wd_lb)
> +					needwarn = true;

I'm not sure if we need to check the last_bogus counter, or just
the current interval 'cs_nsec' is what we care, and some code
like this ?

diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index daac05aedf56..3910dbb9b960 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -125,7 +125,6 @@ struct clocksource {
 	struct list_head	wd_list;
 	u64			cs_last;
 	u64			wd_last;
-	u64			wd_last_bogus;
 	int			wd_bogus_shift;
 	unsigned long		wd_bogus_count;
 	unsigned long		wd_bogus_count_last;
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 6537ffa02e44..8e6d498b1492 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -442,28 +442,18 @@ static void clocksource_watchdog(struct timer_list *unused)
 
 		/* Check for bogus measurements. */
 		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
-		if (wd_nsec < (wdi >> 2)) {
+		if (cs_nsec < (wdi >> 2)) {
 			/* This usually indicates broken timer code or hardware. */
-			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
+			pr_warn("timekeeping watchdog on CPU%d: clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), cs->name, wd_nsec, WATCHDOG_INTERVAL);
 			continue;
 		}
-		if (wd_nsec > (wdi << 2)) {
-			bool needwarn = false;
-			u64 wd_lb;
-
+		if (cs_nsec > (wdi << 2)) {
 			cs->wd_bogus_count++;
-			if (!cs->wd_bogus_shift) {
-				needwarn = true;
-			} else {
-				delta = clocksource_delta(wdnow, cs->wd_last_bogus, watchdog->mask);
-				wd_lb = clocksource_cyc2ns(delta, watchdog->mult, watchdog->shift);
-				if ((1 << cs->wd_bogus_shift) * wdi <= wd_lb)
-					needwarn = true;
-			}
-			if (needwarn) {
+			if (!cs->wd_bogus_shift ||
+			    (1 << cs->wd_bogus_shift) * wdi <= cs_nsec) {
 				/* This can happen on busy systems, which can delay the watchdog. */
-				pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval (%lu additional), probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL, cs->wd_bogus_count - cs->wd_bogus_count_last);
-				cs->wd_last_bogus = wdnow;
+				pr_warn("timekeeping watchdog on CPU%d: clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval (%lu additional), probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), cs->name, cs_nsec, WATCHDOG_INTERVAL, cs->wd_bogus_count - cs->wd_bogus_count_last);
+
 				if (cs->wd_bogus_shift < 10)
 					cs->wd_bogus_shift++;
 				cs->wd_bogus_count_last = cs->wd_bogus_count;

Thanks,
Feng


> +			}
> +			if (needwarn) {
> +				/* This can happen on busy systems, which can delay the watchdog. */
> +				pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval (%lu additional), probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL, cs->wd_bogus_count - cs->wd_bogus_count_last);
> +				cs->wd_last_bogus = wdnow;
> +				if (cs->wd_bogus_shift < 10)
> +					cs->wd_bogus_shift++;
> +				cs->wd_bogus_count_last = cs->wd_bogus_count;
> +			}
>  			continue;
>  		}
>
  
Paul E. McKenney Oct. 31, 2022, 5:42 p.m. UTC | #9
On Mon, Oct 31, 2022 at 01:59:38PM +0800, Feng Tang wrote:
> On Fri, Oct 28, 2022 at 10:52:45AM -0700, Paul E. McKenney wrote:
> [...]
> > > > Left to myself, I would use a capped power-of-two backoff that was reset
> > > > any time that the interval was within bounds.  Maybe a cap of 10 minutes?
> > > 
> > > Yep, that's more informative. Thanks!
> > 
> > And here is the resulting patch.  Thoughts?
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > commit adc0512cf34fe7f0d73966d59644b826ee526742
> > Author: Paul E. McKenney <paulmck@kernel.org>
> > Date:   Fri Oct 28 10:38:58 2022 -0700
> > 
> >     clocksource: Exponential backoff for load-induced bogus watchdog reads
> >     
> >     The clocksource watchdog will reject measurements that are excessively
> >     delayed, that is, by more than 1.5 seconds beyond the intended 0.5-second
> >     watchdog interval.  On an extremely busy system, this can result in a
> >     console message being printed every two seconds.  This is excessively
> >     noisy for a non-error condition.
> >     
> >     Therefore, apply exponential backoff to these messages.  This exponential
> >     backoff is capped at 1024 times the watchdog interval, which comes to
> >     not quite one message per ten minutes.
> >     
> >     Please note that the bogus watchdog reads that occur when the watchdog
> >     interval is less than 0.125 seconds are still printed unconditionally
> >     because these likely correspond to a serious error condition in the
> >     timer code or hardware.
> >     
> >     Reported-by: Waiman Long <longman@redhat.com>
> >     Reported-by: Feng Tang <feng.tang@intel.com>
> >     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> >     Cc: John Stultz <jstultz@google.com>
> >     Cc: Thomas Gleixner <tglx@linutronix.de>
> >     Cc: Stephen Boyd <sboyd@kernel.org>
> >     Cc: Feng Tang <feng.tang@intel.com>
> >     Cc: Waiman Long <longman@redhat.com>
> > 
> > diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
> > index 1d42d4b173271..daac05aedf56a 100644
> > --- a/include/linux/clocksource.h
> > +++ b/include/linux/clocksource.h
> > @@ -125,6 +125,10 @@ struct clocksource {
> >  	struct list_head	wd_list;
> >  	u64			cs_last;
> >  	u64			wd_last;
> > +	u64			wd_last_bogus;
> > +	int			wd_bogus_shift;
> > +	unsigned long		wd_bogus_count;
> > +	unsigned long		wd_bogus_count_last;
> >  #endif
> >  	struct module		*owner;
> >  };
> > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > index 3f5317faf891f..6537ffa02e445 100644
> > --- a/kernel/time/clocksource.c
> > +++ b/kernel/time/clocksource.c
> > @@ -448,8 +448,26 @@ static void clocksource_watchdog(struct timer_list *unused)
> >  			continue;
> >  		}
> >  		if (wd_nsec > (wdi << 2)) {
> 
> Just recalled one thing, that it may be better to check 'cs_nsec' 
> instead of 'wd_nsec', as some watchdog may have small wrap-around
> value. IIRC, HPET's counter is 32 bits long and wraps at about
> 300 seconds, and PMTIMER's counter is 24 bits which wraps at about
> 3 ~ 4 seconds. So when a long stall of the watchdog timer happens,
> the watchdog's value could 'overflow' many times.
> 
> And usually the 'current' closcksource has longer wrap time than
> the watchdog.

Why not both?

 		if (wd_nsec > (wdi << 2) || cs_nsec > (wdi << 2)) {

> > -			/* This can happen on busy systems, which can delay the watchdog. */
> > -			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > +			bool needwarn = false;
> > +			u64 wd_lb;
> > +
> > +			cs->wd_bogus_count++;
> > +			if (!cs->wd_bogus_shift) {
> > +				needwarn = true;
> > +			} else {
> > +				delta = clocksource_delta(wdnow, cs->wd_last_bogus, watchdog->mask);
> > +				wd_lb = clocksource_cyc2ns(delta, watchdog->mult, watchdog->shift);
> > +				if ((1 << cs->wd_bogus_shift) * wdi <= wd_lb)
> > +					needwarn = true;
> 
> I'm not sure if we need to check the last_bogus counter, or just
> the current interval 'cs_nsec' is what we care, and some code
> like this ?

I thought we wanted exponential backoff?  Do you really get that from
the changes below?

And should we be using something like the jiffies counter to measure the
exponential backoff?

							Thanx, Paul

> diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
> index daac05aedf56..3910dbb9b960 100644
> --- a/include/linux/clocksource.h
> +++ b/include/linux/clocksource.h
> @@ -125,7 +125,6 @@ struct clocksource {
>  	struct list_head	wd_list;
>  	u64			cs_last;
>  	u64			wd_last;
> -	u64			wd_last_bogus;
>  	int			wd_bogus_shift;
>  	unsigned long		wd_bogus_count;
>  	unsigned long		wd_bogus_count_last;
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index 6537ffa02e44..8e6d498b1492 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -442,28 +442,18 @@ static void clocksource_watchdog(struct timer_list *unused)
>  
>  		/* Check for bogus measurements. */
>  		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
> -		if (wd_nsec < (wdi >> 2)) {
> +		if (cs_nsec < (wdi >> 2)) {
>  			/* This usually indicates broken timer code or hardware. */
> -			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> +			pr_warn("timekeeping watchdog on CPU%d: clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), cs->name, wd_nsec, WATCHDOG_INTERVAL);
>  			continue;
>  		}
> -		if (wd_nsec > (wdi << 2)) {
> -			bool needwarn = false;
> -			u64 wd_lb;
> -
> +		if (cs_nsec > (wdi << 2)) {
>  			cs->wd_bogus_count++;
> -			if (!cs->wd_bogus_shift) {
> -				needwarn = true;
> -			} else {
> -				delta = clocksource_delta(wdnow, cs->wd_last_bogus, watchdog->mask);
> -				wd_lb = clocksource_cyc2ns(delta, watchdog->mult, watchdog->shift);
> -				if ((1 << cs->wd_bogus_shift) * wdi <= wd_lb)
> -					needwarn = true;
> -			}
> -			if (needwarn) {
> +			if (!cs->wd_bogus_shift ||
> +			    (1 << cs->wd_bogus_shift) * wdi <= cs_nsec) {
>  				/* This can happen on busy systems, which can delay the watchdog. */
> -				pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval (%lu additional), probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL, cs->wd_bogus_count - cs->wd_bogus_count_last);
> -				cs->wd_last_bogus = wdnow;
> +				pr_warn("timekeeping watchdog on CPU%d: clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval (%lu additional), probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), cs->name, cs_nsec, WATCHDOG_INTERVAL, cs->wd_bogus_count - cs->wd_bogus_count_last);
> +
>  				if (cs->wd_bogus_shift < 10)
>  					cs->wd_bogus_shift++;
>  				cs->wd_bogus_count_last = cs->wd_bogus_count;
> 
> Thanks,
> Feng
> 
> 
> > +			}
> > +			if (needwarn) {
> > +				/* This can happen on busy systems, which can delay the watchdog. */
> > +				pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval (%lu additional), probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL, cs->wd_bogus_count - cs->wd_bogus_count_last);
> > +				cs->wd_last_bogus = wdnow;
> > +				if (cs->wd_bogus_shift < 10)
> > +					cs->wd_bogus_shift++;
> > +				cs->wd_bogus_count_last = cs->wd_bogus_count;
> > +			}
> >  			continue;
> >  		}
> >
  
Feng Tang Nov. 1, 2022, 5:43 a.m. UTC | #10
On Mon, Oct 31, 2022 at 10:42:12AM -0700, Paul E. McKenney wrote:

[...]
> > > @@ -448,8 +448,26 @@ static void clocksource_watchdog(struct timer_list *unused)
> > >  			continue;
> > >  		}
> > >  		if (wd_nsec > (wdi << 2)) {
> > 
> > Just recalled one thing, that it may be better to check 'cs_nsec' 
> > instead of 'wd_nsec', as some watchdog may have small wrap-around
> > value. IIRC, HPET's counter is 32 bits long and wraps at about
> > 300 seconds, and PMTIMER's counter is 24 bits which wraps at about
> > 3 ~ 4 seconds. So when a long stall of the watchdog timer happens,
> > the watchdog's value could 'overflow' many times.
> > 
> > And usually the 'current' closcksource has longer wrap time than
> > the watchdog.
> 
> Why not both?

You mean checking both clocksource and the watchdog? It's fine for
me, though I still trust clocksource more.

I checked some old emails and found some long stall logs for reference.

* one stall of 471 seconds

 [ 2410.694068] clocksource: timekeeping watchdog on CPU262: Marking clocksource 'tsc' as unstable because the skew is too large:
 [ 2410.706920] clocksource:                       'hpet' wd_nsec: 0 wd_now: ffd70be2 wd_last: 40da633b mask: ffffffff
 [ 2410.718583] clocksource:                       'tsc' cs_nsec: 471766594285 cs_now: 44f62c184e9 cs_last: 394a7a43771 mask: ffffffffffffffff
 [ 2410.732568] clocksource:                       'tsc' is current clocksource.
 [ 2410.740553] tsc: Marking TSC unstable due to clocksource watchdog
 [ 2410.747611] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
 [ 2410.757321] sched_clock: Marking unstable (2398804490960, 11943006672)<-(2419023952548, -8276474713)
 [ 2410.767741] clocksource: Checking clocksource tsc synchronization from CPU 233 to CPUs 0,73,93-94,226,454,602,821.
 [ 2410.784045] clocksource: Switched to clocksource hpet


* another one of 5 seconds

 [ 3302.211708] clocksource: timekeeping watchdog on CPU9: Marking clocksource 'tsc' as unstable because the skew is too large:
 [ 3302.211710] clocksource:                       'acpi_pm' wd_nsec: 312227950 wd_now: 92367f wd_last: 8128bd mask: ffffff
 [ 3302.211712] clocksource:                       'tsc' cs_nsec: 4999196389 cs_now: 9e811223a9754 cs_last: 9e80e767df194 mask: ffffffffffffffff
 [ 3302.211714] clocksource:                       'tsc' is current clocksource.
 [ 3302.211716] tsc: Marking TSC unstable due to clocksource watchdog


> 
>  		if (wd_nsec > (wdi << 2) || cs_nsec > (wdi << 2)) {
> 
> > > -			/* This can happen on busy systems, which can delay the watchdog. */
> > > -			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > > +			bool needwarn = false;
> > > +			u64 wd_lb;
> > > +
> > > +			cs->wd_bogus_count++;
> > > +			if (!cs->wd_bogus_shift) {
> > > +				needwarn = true;
> > > +			} else {
> > > +				delta = clocksource_delta(wdnow, cs->wd_last_bogus, watchdog->mask);
> > > +				wd_lb = clocksource_cyc2ns(delta, watchdog->mult, watchdog->shift);
> > > +				if ((1 << cs->wd_bogus_shift) * wdi <= wd_lb)
> > > +					needwarn = true;
> > 
> > I'm not sure if we need to check the last_bogus counter, or just
> > the current interval 'cs_nsec' is what we care, and some code
> > like this ?
> 
> I thought we wanted exponential backoff?  Do you really get that from
> the changes below?

Aha, I misunderstood your words. I thought to only report one time for
each 2, 4, 8, ... 256 seconds stall, and after that only report stall
of 512+ seconds. So your approach looks good to me, as our intention is
to avoid the flood of warning message.

Thanks,
Feng

> And should we be using something like the jiffies counter to measure the
> exponential backoff?
> 
> 							Thanx, Paul
> 
> > diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
> > index daac05aedf56..3910dbb9b960 100644
> > --- a/include/linux/clocksource.h
> > +++ b/include/linux/clocksource.h
> > @@ -125,7 +125,6 @@ struct clocksource {
> >  	struct list_head	wd_list;
> >  	u64			cs_last;
> >  	u64			wd_last;
> > -	u64			wd_last_bogus;
> >  	int			wd_bogus_shift;
> >  	unsigned long		wd_bogus_count;
> >  	unsigned long		wd_bogus_count_last;
> > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > index 6537ffa02e44..8e6d498b1492 100644
> > --- a/kernel/time/clocksource.c
> > +++ b/kernel/time/clocksource.c
> > @@ -442,28 +442,18 @@ static void clocksource_watchdog(struct timer_list *unused)
> >  
> >  		/* Check for bogus measurements. */
> >  		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
> > -		if (wd_nsec < (wdi >> 2)) {
> > +		if (cs_nsec < (wdi >> 2)) {
> >  			/* This usually indicates broken timer code or hardware. */
> > -			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > +			pr_warn("timekeeping watchdog on CPU%d: clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), cs->name, wd_nsec, WATCHDOG_INTERVAL);
> >  			continue;
> >  		}
> > -		if (wd_nsec > (wdi << 2)) {
> > -			bool needwarn = false;
> > -			u64 wd_lb;
> > -
> > +		if (cs_nsec > (wdi << 2)) {
> >  			cs->wd_bogus_count++;
> > -			if (!cs->wd_bogus_shift) {
> > -				needwarn = true;
> > -			} else {
> > -				delta = clocksource_delta(wdnow, cs->wd_last_bogus, watchdog->mask);
> > -				wd_lb = clocksource_cyc2ns(delta, watchdog->mult, watchdog->shift);
> > -				if ((1 << cs->wd_bogus_shift) * wdi <= wd_lb)
> > -					needwarn = true;
> > -			}
> > -			if (needwarn) {
> > +			if (!cs->wd_bogus_shift ||
> > +			    (1 << cs->wd_bogus_shift) * wdi <= cs_nsec) {
> >  				/* This can happen on busy systems, which can delay the watchdog. */
> > -				pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval (%lu additional), probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL, cs->wd_bogus_count - cs->wd_bogus_count_last);
> > -				cs->wd_last_bogus = wdnow;
> > +				pr_warn("timekeeping watchdog on CPU%d: clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval (%lu additional), probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), cs->name, cs_nsec, WATCHDOG_INTERVAL, cs->wd_bogus_count - cs->wd_bogus_count_last);
> > +
> >  				if (cs->wd_bogus_shift < 10)
> >  					cs->wd_bogus_shift++;
> >  				cs->wd_bogus_count_last = cs->wd_bogus_count;
> > 
> > Thanks,
> > Feng
> > 
> > 
> > > +			}
> > > +			if (needwarn) {
> > > +				/* This can happen on busy systems, which can delay the watchdog. */
> > > +				pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval (%lu additional), probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL, cs->wd_bogus_count - cs->wd_bogus_count_last);
> > > +				cs->wd_last_bogus = wdnow;
> > > +				if (cs->wd_bogus_shift < 10)
> > > +					cs->wd_bogus_shift++;
> > > +				cs->wd_bogus_count_last = cs->wd_bogus_count;
> > > +			}
> > >  			continue;
> > >  		}
> > >
  
Paul E. McKenney Nov. 1, 2022, 7:06 p.m. UTC | #11
On Tue, Nov 01, 2022 at 01:43:32PM +0800, Feng Tang wrote:
> On Mon, Oct 31, 2022 at 10:42:12AM -0700, Paul E. McKenney wrote:
> 
> [...]
> > > > @@ -448,8 +448,26 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > >  			continue;
> > > >  		}
> > > >  		if (wd_nsec > (wdi << 2)) {
> > > 
> > > Just recalled one thing, that it may be better to check 'cs_nsec' 
> > > instead of 'wd_nsec', as some watchdog may have small wrap-around
> > > value. IIRC, HPET's counter is 32 bits long and wraps at about
> > > 300 seconds, and PMTIMER's counter is 24 bits which wraps at about
> > > 3 ~ 4 seconds. So when a long stall of the watchdog timer happens,
> > > the watchdog's value could 'overflow' many times.
> > > 
> > > And usually the 'current' closcksource has longer wrap time than
> > > the watchdog.
> > 
> > Why not both?
> 
> You mean checking both clocksource and the watchdog? It's fine for
> me, though I still trust clocksource more.

OK, good, I will check both.  You never know what future hardware
might bring.

I also reversed the order of the checks, so that it now checks for heavy
load before too-short interval.  The purpose is to automatically avoid
being fooled by clock wrap.

> I checked some old emails and found some long stall logs for reference.
> 
> * one stall of 471 seconds
> 
>  [ 2410.694068] clocksource: timekeeping watchdog on CPU262: Marking clocksource 'tsc' as unstable because the skew is too large:
>  [ 2410.706920] clocksource:                       'hpet' wd_nsec: 0 wd_now: ffd70be2 wd_last: 40da633b mask: ffffffff
>  [ 2410.718583] clocksource:                       'tsc' cs_nsec: 471766594285 cs_now: 44f62c184e9 cs_last: 394a7a43771 mask: ffffffffffffffff
>  [ 2410.732568] clocksource:                       'tsc' is current clocksource.
>  [ 2410.740553] tsc: Marking TSC unstable due to clocksource watchdog
>  [ 2410.747611] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
>  [ 2410.757321] sched_clock: Marking unstable (2398804490960, 11943006672)<-(2419023952548, -8276474713)
>  [ 2410.767741] clocksource: Checking clocksource tsc synchronization from CPU 233 to CPUs 0,73,93-94,226,454,602,821.
>  [ 2410.784045] clocksource: Switched to clocksource hpet
> 
> 
> * another one of 5 seconds
> 
>  [ 3302.211708] clocksource: timekeeping watchdog on CPU9: Marking clocksource 'tsc' as unstable because the skew is too large:
>  [ 3302.211710] clocksource:                       'acpi_pm' wd_nsec: 312227950 wd_now: 92367f wd_last: 8128bd mask: ffffff
>  [ 3302.211712] clocksource:                       'tsc' cs_nsec: 4999196389 cs_now: 9e811223a9754 cs_last: 9e80e767df194 mask: ffffffffffffffff
>  [ 3302.211714] clocksource:                       'tsc' is current clocksource.
>  [ 3302.211716] tsc: Marking TSC unstable due to clocksource watchdog

Very good, thank you!  I believe that both of these would be handled
by the updated commit (see below for the update).

> >  		if (wd_nsec > (wdi << 2) || cs_nsec > (wdi << 2)) {
> > 
> > > > -			/* This can happen on busy systems, which can delay the watchdog. */
> > > > -			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > > > +			bool needwarn = false;
> > > > +			u64 wd_lb;
> > > > +
> > > > +			cs->wd_bogus_count++;
> > > > +			if (!cs->wd_bogus_shift) {
> > > > +				needwarn = true;
> > > > +			} else {
> > > > +				delta = clocksource_delta(wdnow, cs->wd_last_bogus, watchdog->mask);
> > > > +				wd_lb = clocksource_cyc2ns(delta, watchdog->mult, watchdog->shift);
> > > > +				if ((1 << cs->wd_bogus_shift) * wdi <= wd_lb)
> > > > +					needwarn = true;
> > > 
> > > I'm not sure if we need to check the last_bogus counter, or just
> > > the current interval 'cs_nsec' is what we care, and some code
> > > like this ?
> > 
> > I thought we wanted exponential backoff?  Do you really get that from
> > the changes below?
> 
> Aha, I misunderstood your words. I thought to only report one time for
> each 2, 4, 8, ... 256 seconds stall, and after that only report stall
> of 512+ seconds. So your approach looks good to me, as our intention is
> to avoid the flood of warning message.

Sounds good, thank you!

Please see below for a patch to be squashed into the original.

Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

commit eaee921daa7091f0eb731c9217ccc638ed5f8baf
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Tue Nov 1 12:02:18 2022 -0700

    squash! clocksource: Exponential backoff for load-induced bogus watchdog reads
    
    [ paulmck: Apply Feng Tang feedback. ]
    
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 6537ffa02e445..de8047b6720f5 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -442,12 +442,7 @@ static void clocksource_watchdog(struct timer_list *unused)
 
 		/* Check for bogus measurements. */
 		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
-		if (wd_nsec < (wdi >> 2)) {
-			/* This usually indicates broken timer code or hardware. */
-			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
-			continue;
-		}
-		if (wd_nsec > (wdi << 2)) {
+		if (wd_nsec > (wdi << 2) || cs_nsec > (wdi << 2)) {
 			bool needwarn = false;
 			u64 wd_lb;
 
@@ -470,6 +465,12 @@ static void clocksource_watchdog(struct timer_list *unused)
 			}
 			continue;
 		}
+		/* Check too-short measurements second to handle wrap. */
+		if (wd_nsec < (wdi >> 2) || cs_nsec < (wdi >> 2)) {
+			/* This usually indicates broken timer code or hardware. */
+			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
+			continue;
+		}
 
 		/* Check the deviation from the watchdog clocksource. */
 		md = cs->uncertainty_margin + watchdog->uncertainty_margin;
  
Feng Tang Nov. 2, 2022, 2:58 a.m. UTC | #12
On Tue, Nov 01, 2022 at 12:06:27PM -0700, Paul E. McKenney wrote:
> On Tue, Nov 01, 2022 at 01:43:32PM +0800, Feng Tang wrote:
> > On Mon, Oct 31, 2022 at 10:42:12AM -0700, Paul E. McKenney wrote:
> > 
> > [...]
> > > > > @@ -448,8 +448,26 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > > >  			continue;
> > > > >  		}
> > > > >  		if (wd_nsec > (wdi << 2)) {
> > > > 
> > > > Just recalled one thing, that it may be better to check 'cs_nsec' 
> > > > instead of 'wd_nsec', as some watchdog may have small wrap-around
> > > > value. IIRC, HPET's counter is 32 bits long and wraps at about
> > > > 300 seconds, and PMTIMER's counter is 24 bits which wraps at about
> > > > 3 ~ 4 seconds. So when a long stall of the watchdog timer happens,
> > > > the watchdog's value could 'overflow' many times.
> > > > 
> > > > And usually the 'current' closcksource has longer wrap time than
> > > > the watchdog.
> > > 
> > > Why not both?
> > 
> > You mean checking both clocksource and the watchdog? It's fine for
> > me, though I still trust clocksource more.
> 
> OK, good, I will check both.  You never know what future hardware
> might bring.

Makes sense to me.

> I also reversed the order of the checks, so that it now checks for heavy
> load before too-short interval.  The purpose is to automatically avoid
> being fooled by clock wrap.
> 
> > I checked some old emails and found some long stall logs for reference.
> > 
> > * one stall of 471 seconds
> > 
> >  [ 2410.694068] clocksource: timekeeping watchdog on CPU262: Marking clocksource 'tsc' as unstable because the skew is too large:
> >  [ 2410.706920] clocksource:                       'hpet' wd_nsec: 0 wd_now: ffd70be2 wd_last: 40da633b mask: ffffffff
> >  [ 2410.718583] clocksource:                       'tsc' cs_nsec: 471766594285 cs_now: 44f62c184e9 cs_last: 394a7a43771 mask: ffffffffffffffff
> >  [ 2410.732568] clocksource:                       'tsc' is current clocksource.
> >  [ 2410.740553] tsc: Marking TSC unstable due to clocksource watchdog
> >  [ 2410.747611] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
> >  [ 2410.757321] sched_clock: Marking unstable (2398804490960, 11943006672)<-(2419023952548, -8276474713)
> >  [ 2410.767741] clocksource: Checking clocksource tsc synchronization from CPU 233 to CPUs 0,73,93-94,226,454,602,821.
> >  [ 2410.784045] clocksource: Switched to clocksource hpet
> > 
> > 
> > * another one of 5 seconds
> > 
> >  [ 3302.211708] clocksource: timekeeping watchdog on CPU9: Marking clocksource 'tsc' as unstable because the skew is too large:
> >  [ 3302.211710] clocksource:                       'acpi_pm' wd_nsec: 312227950 wd_now: 92367f wd_last: 8128bd mask: ffffff
> >  [ 3302.211712] clocksource:                       'tsc' cs_nsec: 4999196389 cs_now: 9e811223a9754 cs_last: 9e80e767df194 mask: ffffffffffffffff
> >  [ 3302.211714] clocksource:                       'tsc' is current clocksource.
> >  [ 3302.211716] tsc: Marking TSC unstable due to clocksource watchdog
> 
> Very good, thank you!  I believe that both of these would be handled
> by the updated commit (see below for the update).

Yes, I think so too.

> 
> > >  		if (wd_nsec > (wdi << 2) || cs_nsec > (wdi << 2)) {
> > > 
> > > > > -			/* This can happen on busy systems, which can delay the watchdog. */
> > > > > -			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> > > > > +			bool needwarn = false;
> > > > > +			u64 wd_lb;
> > > > > +
> > > > > +			cs->wd_bogus_count++;
> > > > > +			if (!cs->wd_bogus_shift) {
> > > > > +				needwarn = true;
> > > > > +			} else {
> > > > > +				delta = clocksource_delta(wdnow, cs->wd_last_bogus, watchdog->mask);
> > > > > +				wd_lb = clocksource_cyc2ns(delta, watchdog->mult, watchdog->shift);
> > > > > +				if ((1 << cs->wd_bogus_shift) * wdi <= wd_lb)
> > > > > +					needwarn = true;
> > > > 
> > > > I'm not sure if we need to check the last_bogus counter, or just
> > > > the current interval 'cs_nsec' is what we care, and some code
> > > > like this ?
> > > 
> > > I thought we wanted exponential backoff?  Do you really get that from
> > > the changes below?
> > 
> > Aha, I misunderstood your words. I thought to only report one time for
> > each 2, 4, 8, ... 256 seconds stall, and after that only report stall
> > of 512+ seconds. So your approach looks good to me, as our intention is
> > to avoid the flood of warning message.
> 
> Sounds good, thank you!
> 
> Please see below for a patch to be squashed into the original.
> 
> Thoughts?

It looks good to me, thanks!

- Feng

>
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> commit eaee921daa7091f0eb731c9217ccc638ed5f8baf
> Author: Paul E. McKenney <paulmck@kernel.org>
> Date:   Tue Nov 1 12:02:18 2022 -0700
> 
>     squash! clocksource: Exponential backoff for load-induced bogus watchdog reads
>     
>     [ paulmck: Apply Feng Tang feedback. ]
>     
>     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> 
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index 6537ffa02e445..de8047b6720f5 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -442,12 +442,7 @@ static void clocksource_watchdog(struct timer_list *unused)
>  
>  		/* Check for bogus measurements. */
>  		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
> -		if (wd_nsec < (wdi >> 2)) {
> -			/* This usually indicates broken timer code or hardware. */
> -			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> -			continue;
> -		}
> -		if (wd_nsec > (wdi << 2)) {
> +		if (wd_nsec > (wdi << 2) || cs_nsec > (wdi << 2)) {
>  			bool needwarn = false;
>  			u64 wd_lb;
>  
> @@ -470,6 +465,12 @@ static void clocksource_watchdog(struct timer_list *unused)
>  			}
>  			continue;
>  		}
> +		/* Check too-short measurements second to handle wrap. */
> +		if (wd_nsec < (wdi >> 2) || cs_nsec < (wdi >> 2)) {
> +			/* This usually indicates broken timer code or hardware. */
> +			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
> +			continue;
> +		}
>  
>  		/* Check the deviation from the watchdog clocksource. */
>  		md = cs->uncertainty_margin + watchdog->uncertainty_margin;
  

Patch

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 8058bec87acee..dcaf38c062161 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -386,7 +386,7 @@  EXPORT_SYMBOL_GPL(clocksource_verify_percpu);
 
 static void clocksource_watchdog(struct timer_list *unused)
 {
-	u64 csnow, wdnow, cslast, wdlast, delta;
+	u64 csnow, wdnow, cslast, wdlast, delta, wdi;
 	int next_cpu, reset_pending;
 	int64_t wd_nsec, cs_nsec;
 	struct clocksource *cs;
@@ -440,6 +440,17 @@  static void clocksource_watchdog(struct timer_list *unused)
 		if (atomic_read(&watchdog_reset_pending))
 			continue;
 
+		/* Check for bogus measurements. */
+		wdi = jiffies_to_nsecs(WATCHDOG_INTERVAL);
+		if (wd_nsec < (wdi >> 2)) {
+			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced only %lld ns during %d-jiffy time interval, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
+			continue;
+		}
+		if (wd_nsec > (wdi << 2)) {
+			pr_warn("timekeeping watchdog on CPU%d: Watchdog clocksource '%s' advanced an excessive %lld ns during %d-jiffy time interval, probable CPU overutilization, skipping watchdog check.\n", smp_processor_id(), watchdog->name, wd_nsec, WATCHDOG_INTERVAL);
+			continue;
+		}
+
 		/* Check the deviation from the watchdog clocksource. */
 		md = cs->uncertainty_margin + watchdog->uncertainty_margin;
 		if (abs(cs_nsec - wd_nsec) > md) {