clocksource: Skip watchdog check for large watchdog intervals

Message ID 20240103112113.GA6108@incl
State New
Headers
Series clocksource: Skip watchdog check for large watchdog intervals |

Commit Message

Jiri Wiesner Jan. 3, 2024, 11:21 a.m. UTC
  There have been reports of the watchdog marking clocksources unstable on
machines with 8 NUMA nodes:
> clocksource: timekeeping watchdog on CPU373: Marking clocksource 'tsc' as unstable because the skew is too large:
> clocksource:   'hpet' wd_nsec: 14523447520 wd_now: 5a749706 wd_last: 45adf1e0 mask: ffffffff
> clocksource:   'tsc' cs_nsec: 14524115132 cs_now: 515ce2c5a96caa cs_last: 515cd9a9d83918 mask: ffffffffffffffff
> clocksource:   'tsc' is current clocksource.
> tsc: Marking TSC unstable due to clocksource watchdog
> TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
> sched_clock: Marking unstable (1950347883333462, 79649632569)<-(1950428279338308, -745776594)
> clocksource: Checking clocksource tsc synchronization from CPU 400 to CPUs 0,46,52,54,138,208,392,397.
> clocksource: Switched to clocksource hpet

The measured clocksource skew - the absolute difference between cs_nsec
and wd_nsec - was 668 microseconds:
> cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612

The kernel (based on 5.14.21) used 200 microseconds for the
uncertainty_margin of both the clocksource and watchdog, resulting in a
threshold of 400 microseconds.  The discrepancy is that the measured
clocksource skew was evaluated against a threshold suited for watchdog
intervals of roughly WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5 second.
Both the cs_nsec and the wd_nsec value indicate that the actual watchdog
interval was circa 14.5 seconds. Since the watchdog is executed in softirq
context the expiration of the watchdog timer can get severely delayed on
account of a ksoftirqd thread not getting to run in a timely manner.
Surely, a system with such belated softirq execution is not working well
and the scheduling issue should be looked into but the clocksource
watchdog should be able to deal with it accordingly.

The solution in this patch skips the current watchdog check if the
watchdog interval exceeds 1.5 * WATCHDOG_INTERVAL. Considering the maximum
watchdog interval of 1.5 * WATCHDOG_INTERVAL, the current default
uncertainty margin (of the TSC and HPET) corresponds to a limit on
clocksource skew of 333 ppm (microseconds of skew per second). To keep the
limit imposed by NTP (500 microseconds of skew per second) for all
possible watchdog intervals, the margins would have to be scaled so that
the threshold value is proportional to the length of the actual watchdog
interval.

Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold")
Suggested-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Jiri Wiesner <jwiesner@suse.de>
---
 kernel/time/clocksource.c | 28 ++++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)
  

Comments

Paul E. McKenney Jan. 3, 2024, 10:08 p.m. UTC | #1
On Wed, Jan 03, 2024 at 12:21:13PM +0100, Jiri Wiesner wrote:
> There have been reports of the watchdog marking clocksources unstable on
> machines with 8 NUMA nodes:
> > clocksource: timekeeping watchdog on CPU373: Marking clocksource 'tsc' as unstable because the skew is too large:
> > clocksource:   'hpet' wd_nsec: 14523447520 wd_now: 5a749706 wd_last: 45adf1e0 mask: ffffffff
> > clocksource:   'tsc' cs_nsec: 14524115132 cs_now: 515ce2c5a96caa cs_last: 515cd9a9d83918 mask: ffffffffffffffff
> > clocksource:   'tsc' is current clocksource.
> > tsc: Marking TSC unstable due to clocksource watchdog
> > TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
> > sched_clock: Marking unstable (1950347883333462, 79649632569)<-(1950428279338308, -745776594)
> > clocksource: Checking clocksource tsc synchronization from CPU 400 to CPUs 0,46,52,54,138,208,392,397.
> > clocksource: Switched to clocksource hpet
> 
> The measured clocksource skew - the absolute difference between cs_nsec
> and wd_nsec - was 668 microseconds:
> > cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612
> 
> The kernel (based on 5.14.21) used 200 microseconds for the
> uncertainty_margin of both the clocksource and watchdog, resulting in a
> threshold of 400 microseconds.  The discrepancy is that the measured
> clocksource skew was evaluated against a threshold suited for watchdog
> intervals of roughly WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5 second.
> Both the cs_nsec and the wd_nsec value indicate that the actual watchdog
> interval was circa 14.5 seconds. Since the watchdog is executed in softirq
> context the expiration of the watchdog timer can get severely delayed on
> account of a ksoftirqd thread not getting to run in a timely manner.
> Surely, a system with such belated softirq execution is not working well
> and the scheduling issue should be looked into but the clocksource
> watchdog should be able to deal with it accordingly.
> 
> The solution in this patch skips the current watchdog check if the
> watchdog interval exceeds 1.5 * WATCHDOG_INTERVAL. Considering the maximum
> watchdog interval of 1.5 * WATCHDOG_INTERVAL, the current default
> uncertainty margin (of the TSC and HPET) corresponds to a limit on
> clocksource skew of 333 ppm (microseconds of skew per second). To keep the
> limit imposed by NTP (500 microseconds of skew per second) for all
> possible watchdog intervals, the margins would have to be scaled so that
> the threshold value is proportional to the length of the actual watchdog
> interval.
> 
> Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold")
> Suggested-by: Feng Tang <feng.tang@intel.com>
> Signed-off-by: Jiri Wiesner <jwiesner@suse.de>

I believe that there were concerns about a similar approach in the case
where the jiffies counter is the clocksource, but in the meantime:

Tested-by: Paul E. McKenney <paulmck@kernel.org>

> ---
>  kernel/time/clocksource.c | 28 ++++++++++++++++++++++++++--
>  1 file changed, 26 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index c108ed8a9804..ac5cb0ff278b 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -98,7 +98,9 @@ static u64 suspend_start;
>  /*
>   * Interval: 0.5sec.
>   */
> -#define WATCHDOG_INTERVAL (HZ >> 1)
> +#define WATCHDOG_INTERVAL	(HZ >> 1)
> +#define WATCHDOG_INTR_MAX_NS	((WATCHDOG_INTERVAL + (WATCHDOG_INTERVAL >> 1))\
> +				 * NSEC_PER_SEC / HZ)
>  
>  /*
>   * Threshold: 0.0312s, when doubled: 0.0625s.
> @@ -134,6 +136,7 @@ static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
>  static DEFINE_SPINLOCK(watchdog_lock);
>  static int watchdog_running;
>  static atomic_t watchdog_reset_pending;
> +static int64_t watchdog_max_intr;
>  
>  static inline void clocksource_watchdog_lock(unsigned long *flags)
>  {
> @@ -400,7 +403,7 @@ static void clocksource_watchdog(struct timer_list *unused)
>  {
>  	u64 csnow, wdnow, cslast, wdlast, delta;
>  	int next_cpu, reset_pending;
> -	int64_t wd_nsec, cs_nsec;
> +	int64_t wd_nsec, cs_nsec, interval;
>  	struct clocksource *cs;
>  	enum wd_read_status read_ret;
>  	unsigned long extra_wait = 0;
> @@ -470,6 +473,27 @@ static void clocksource_watchdog(struct timer_list *unused)
>  		if (atomic_read(&watchdog_reset_pending))
>  			continue;
>  
> +		/*
> +		 * The processing of timer softirqs can get delayed (usually
> +		 * on account of ksoftirqd not getting to run in a timely
> +		 * manner), which causes the watchdog interval to stretch.
> +		 * Some clocksources, e.g. acpi_pm, cannot tolerate
> +		 * watchdog intervals longer than a few seconds.
> +		 * Skew detection may fail for longer watchdog intervals
> +		 * on account of fixed margins being used.
> +		 */
> +		interval = max(cs_nsec, wd_nsec);
> +		if (unlikely(interval > WATCHDOG_INTR_MAX_NS)) {
> +			if (system_state > SYSTEM_SCHEDULING &&
> +			    interval > 2 * watchdog_max_intr) {
> +				watchdog_max_intr = interval;
> +				pr_warn("Skipping watchdog check: cs_nsec: %lld wd_nsec: %lld\n",
> +					cs_nsec, wd_nsec);
> +			}
> +			watchdog_timer.expires = jiffies;
> +			continue;
> +		}
> +
>  		/* Check the deviation from the watchdog clocksource. */
>  		md = cs->uncertainty_margin + watchdog->uncertainty_margin;
>  		if (abs(cs_nsec - wd_nsec) > md) {
> -- 
> 2.35.3
> 
> 
> -- 
> Jiri Wiesner
> SUSE Labs
  
kernel test robot Jan. 4, 2024, 12:46 a.m. UTC | #2
Hi Jiri,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/timers/core]
[also build test WARNING on linus/master v6.7-rc8 next-20240103]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Jiri-Wiesner/clocksource-Skip-watchdog-check-for-large-watchdog-intervals/20240103-192257
base:   tip/timers/core
patch link:    https://lore.kernel.org/r/20240103112113.GA6108%40incl
patch subject: [PATCH] clocksource: Skip watchdog check for large watchdog intervals
config: i386-buildonly-randconfig-001-20240104 (https://download.01.org/0day-ci/archive/20240104/202401040845.lHig8Ei0-lkp@intel.com/config)
compiler: gcc-7 (Ubuntu 7.5.0-6ubuntu2) 7.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240104/202401040845.lHig8Ei0-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202401040845.lHig8Ei0-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from include/linux/dev_printk.h:14:0,
                    from include/linux/device.h:15,
                    from kernel/time/clocksource.c:10:
   kernel/time/clocksource.c: In function 'clocksource_watchdog':
>> kernel/time/clocksource.c:103:6: warning: integer overflow in expression [-Woverflow]
         * NSEC_PER_SEC / HZ)
         ^
   include/linux/compiler.h:77:42: note: in definition of macro 'unlikely'
    # define unlikely(x) __builtin_expect(!!(x), 0)
                                             ^
   kernel/time/clocksource.c:486:27: note: in expansion of macro 'WATCHDOG_INTR_MAX_NS'
      if (unlikely(interval > WATCHDOG_INTR_MAX_NS)) {
                              ^~~~~~~~~~~~~~~~~~~~


vim +103 kernel/time/clocksource.c

    97	
    98	/*
    99	 * Interval: 0.5sec.
   100	 */
   101	#define WATCHDOG_INTERVAL	(HZ >> 1)
   102	#define WATCHDOG_INTR_MAX_NS	((WATCHDOG_INTERVAL + (WATCHDOG_INTERVAL >> 1))\
 > 103					 * NSEC_PER_SEC / HZ)
   104
  
kernel test robot Jan. 4, 2024, 12:57 a.m. UTC | #3
Hi Jiri,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/timers/core]
[also build test WARNING on linus/master v6.7-rc8 next-20240103]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Jiri-Wiesner/clocksource-Skip-watchdog-check-for-large-watchdog-intervals/20240103-192257
base:   tip/timers/core
patch link:    https://lore.kernel.org/r/20240103112113.GA6108%40incl
patch subject: [PATCH] clocksource: Skip watchdog check for large watchdog intervals
config: i386-randconfig-r071-20240104 (https://download.01.org/0day-ci/archive/20240104/202401040837.eAbo53gs-lkp@intel.com/config)
compiler: ClangBuiltLinux clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240104/202401040837.eAbo53gs-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202401040837.eAbo53gs-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> kernel/time/clocksource.c:486:27: warning: overflow in expression; result is -1978561024 with type 'long' [-Winteger-overflow]
     486 |                 if (unlikely(interval > WATCHDOG_INTR_MAX_NS)) {
         |                                         ^
   kernel/time/clocksource.c:103:6: note: expanded from macro 'WATCHDOG_INTR_MAX_NS'
     103 |                                  * NSEC_PER_SEC / HZ)
         |                                  ^
   1 warning generated.


vim +/long +486 kernel/time/clocksource.c

   400	
   401	
   402	static void clocksource_watchdog(struct timer_list *unused)
   403	{
   404		u64 csnow, wdnow, cslast, wdlast, delta;
   405		int next_cpu, reset_pending;
   406		int64_t wd_nsec, cs_nsec, interval;
   407		struct clocksource *cs;
   408		enum wd_read_status read_ret;
   409		unsigned long extra_wait = 0;
   410		u32 md;
   411	
   412		spin_lock(&watchdog_lock);
   413		if (!watchdog_running)
   414			goto out;
   415	
   416		reset_pending = atomic_read(&watchdog_reset_pending);
   417	
   418		list_for_each_entry(cs, &watchdog_list, wd_list) {
   419	
   420			/* Clocksource already marked unstable? */
   421			if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
   422				if (finished_booting)
   423					schedule_work(&watchdog_work);
   424				continue;
   425			}
   426	
   427			read_ret = cs_watchdog_read(cs, &csnow, &wdnow);
   428	
   429			if (read_ret == WD_READ_UNSTABLE) {
   430				/* Clock readout unreliable, so give it up. */
   431				__clocksource_unstable(cs);
   432				continue;
   433			}
   434	
   435			/*
   436			 * When WD_READ_SKIP is returned, it means the system is likely
   437			 * under very heavy load, where the latency of reading
   438			 * watchdog/clocksource is very big, and affect the accuracy of
   439			 * watchdog check. So give system some space and suspend the
   440			 * watchdog check for 5 minutes.
   441			 */
   442			if (read_ret == WD_READ_SKIP) {
   443				/*
   444				 * As the watchdog timer will be suspended, and
   445				 * cs->last could keep unchanged for 5 minutes, reset
   446				 * the counters.
   447				 */
   448				clocksource_reset_watchdog();
   449				extra_wait = HZ * 300;
   450				break;
   451			}
   452	
   453			/* Clocksource initialized ? */
   454			if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
   455			    atomic_read(&watchdog_reset_pending)) {
   456				cs->flags |= CLOCK_SOURCE_WATCHDOG;
   457				cs->wd_last = wdnow;
   458				cs->cs_last = csnow;
   459				continue;
   460			}
   461	
   462			delta = clocksource_delta(wdnow, cs->wd_last, watchdog->mask);
   463			wd_nsec = clocksource_cyc2ns(delta, watchdog->mult,
   464						     watchdog->shift);
   465	
   466			delta = clocksource_delta(csnow, cs->cs_last, cs->mask);
   467			cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
   468			wdlast = cs->wd_last; /* save these in case we print them */
   469			cslast = cs->cs_last;
   470			cs->cs_last = csnow;
   471			cs->wd_last = wdnow;
   472	
   473			if (atomic_read(&watchdog_reset_pending))
   474				continue;
   475	
   476			/*
   477			 * The processing of timer softirqs can get delayed (usually
   478			 * on account of ksoftirqd not getting to run in a timely
   479			 * manner), which causes the watchdog interval to stretch.
   480			 * Some clocksources, e.g. acpi_pm, cannot tolerate
   481			 * watchdog intervals longer than a few seconds.
   482			 * Skew detection may fail for longer watchdog intervals
   483			 * on account of fixed margins being used.
   484			 */
   485			interval = max(cs_nsec, wd_nsec);
 > 486			if (unlikely(interval > WATCHDOG_INTR_MAX_NS)) {
   487				if (system_state > SYSTEM_SCHEDULING &&
   488				    interval > 2 * watchdog_max_intr) {
   489					watchdog_max_intr = interval;
   490					pr_warn("Skipping watchdog check: cs_nsec: %lld wd_nsec: %lld\n",
   491						cs_nsec, wd_nsec);
   492				}
   493				watchdog_timer.expires = jiffies;
   494				continue;
   495			}
   496	
   497			/* Check the deviation from the watchdog clocksource. */
   498			md = cs->uncertainty_margin + watchdog->uncertainty_margin;
   499			if (abs(cs_nsec - wd_nsec) > md) {
   500				s64 cs_wd_msec;
   501				s64 wd_msec;
   502				u32 wd_rem;
   503	
   504				pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
   505					smp_processor_id(), cs->name);
   506				pr_warn("                      '%s' wd_nsec: %lld wd_now: %llx wd_last: %llx mask: %llx\n",
   507					watchdog->name, wd_nsec, wdnow, wdlast, watchdog->mask);
   508				pr_warn("                      '%s' cs_nsec: %lld cs_now: %llx cs_last: %llx mask: %llx\n",
   509					cs->name, cs_nsec, csnow, cslast, cs->mask);
   510				cs_wd_msec = div_s64_rem(cs_nsec - wd_nsec, 1000 * 1000, &wd_rem);
   511				wd_msec = div_s64_rem(wd_nsec, 1000 * 1000, &wd_rem);
   512				pr_warn("                      Clocksource '%s' skewed %lld ns (%lld ms) over watchdog '%s' interval of %lld ns (%lld ms)\n",
   513					cs->name, cs_nsec - wd_nsec, cs_wd_msec, watchdog->name, wd_nsec, wd_msec);
   514				if (curr_clocksource == cs)
   515					pr_warn("                      '%s' is current clocksource.\n", cs->name);
   516				else if (curr_clocksource)
   517					pr_warn("                      '%s' (not '%s') is current clocksource.\n", curr_clocksource->name, cs->name);
   518				else
   519					pr_warn("                      No current clocksource.\n");
   520				__clocksource_unstable(cs);
   521				continue;
   522			}
   523	
   524			if (cs == curr_clocksource && cs->tick_stable)
   525				cs->tick_stable(cs);
   526	
   527			if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) &&
   528			    (cs->flags & CLOCK_SOURCE_IS_CONTINUOUS) &&
   529			    (watchdog->flags & CLOCK_SOURCE_IS_CONTINUOUS)) {
   530				/* Mark it valid for high-res. */
   531				cs->flags |= CLOCK_SOURCE_VALID_FOR_HRES;
   532	
   533				/*
   534				 * clocksource_done_booting() will sort it if
   535				 * finished_booting is not set yet.
   536				 */
   537				if (!finished_booting)
   538					continue;
   539	
   540				/*
   541				 * If this is not the current clocksource let
   542				 * the watchdog thread reselect it. Due to the
   543				 * change to high res this clocksource might
   544				 * be preferred now. If it is the current
   545				 * clocksource let the tick code know about
   546				 * that change.
   547				 */
   548				if (cs != curr_clocksource) {
   549					cs->flags |= CLOCK_SOURCE_RESELECT;
   550					schedule_work(&watchdog_work);
   551				} else {
   552					tick_clock_notify();
   553				}
   554			}
   555		}
   556	
   557		/*
   558		 * We only clear the watchdog_reset_pending, when we did a
   559		 * full cycle through all clocksources.
   560		 */
   561		if (reset_pending)
   562			atomic_dec(&watchdog_reset_pending);
   563	
   564		/*
   565		 * Cycle through CPUs to check if the CPUs stay synchronized
   566		 * to each other.
   567		 */
   568		next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
   569		if (next_cpu >= nr_cpu_ids)
   570			next_cpu = cpumask_first(cpu_online_mask);
   571	
   572		/*
   573		 * Arm timer if not already pending: could race with concurrent
   574		 * pair clocksource_stop_watchdog() clocksource_start_watchdog().
   575		 */
   576		if (!timer_pending(&watchdog_timer)) {
   577			watchdog_timer.expires += WATCHDOG_INTERVAL + extra_wait;
   578			add_timer_on(&watchdog_timer, next_cpu);
   579		}
   580	out:
   581		spin_unlock(&watchdog_lock);
   582	}
   583
  
Feng Tang Jan. 4, 2024, 5:55 a.m. UTC | #4
On Wed, Jan 03, 2024 at 12:21:13PM +0100, Jiri Wiesner wrote:
> There have been reports of the watchdog marking clocksources unstable on
> machines with 8 NUMA nodes:
> > clocksource: timekeeping watchdog on CPU373: Marking clocksource 'tsc' as unstable because the skew is too large:
> > clocksource:   'hpet' wd_nsec: 14523447520 wd_now: 5a749706 wd_last: 45adf1e0 mask: ffffffff
> > clocksource:   'tsc' cs_nsec: 14524115132 cs_now: 515ce2c5a96caa cs_last: 515cd9a9d83918 mask: ffffffffffffffff
> > clocksource:   'tsc' is current clocksource.
> > tsc: Marking TSC unstable due to clocksource watchdog
> > TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
> > sched_clock: Marking unstable (1950347883333462, 79649632569)<-(1950428279338308, -745776594)
> > clocksource: Checking clocksource tsc synchronization from CPU 400 to CPUs 0,46,52,54,138,208,392,397.
> > clocksource: Switched to clocksource hpet
> 
> The measured clocksource skew - the absolute difference between cs_nsec
> and wd_nsec - was 668 microseconds:
> > cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612
> 
> The kernel (based on 5.14.21) used 200 microseconds for the
> uncertainty_margin of both the clocksource and watchdog, resulting in a
> threshold of 400 microseconds.  The discrepancy is that the measured
> clocksource skew was evaluated against a threshold suited for watchdog
> intervals of roughly WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5 second.
> Both the cs_nsec and the wd_nsec value indicate that the actual watchdog
> interval was circa 14.5 seconds. Since the watchdog is executed in softirq
> context the expiration of the watchdog timer can get severely delayed on
> account of a ksoftirqd thread not getting to run in a timely manner.
> Surely, a system with such belated softirq execution is not working well
> and the scheduling issue should be looked into but the clocksource
> watchdog should be able to deal with it accordingly.
> 
> The solution in this patch skips the current watchdog check if the
> watchdog interval exceeds 1.5 * WATCHDOG_INTERVAL. Considering the maximum
> watchdog interval of 1.5 * WATCHDOG_INTERVAL, the current default
> uncertainty margin (of the TSC and HPET) corresponds to a limit on
> clocksource skew of 333 ppm (microseconds of skew per second). To keep the
> limit imposed by NTP (500 microseconds of skew per second) for all
> possible watchdog intervals, the margins would have to be scaled so that
> the threshold value is proportional to the length of the actual watchdog
> interval.
> 
> Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold")

IIUC, the issue could happen even before this commit? Though I do think 
this is good stuff for stable kernel.

> Suggested-by: Feng Tang <feng.tang@intel.com>
> Signed-off-by: Jiri Wiesner <jwiesner@suse.de>

0Day robot reported some compiling issue. Other than that, it looks
good to me, thanks!

Reviewed-by: Feng Tang <feng.tang@intel.com>

>
> ---
>  kernel/time/clocksource.c | 28 ++++++++++++++++++++++++++--
>  1 file changed, 26 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index c108ed8a9804..ac5cb0ff278b 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -98,7 +98,9 @@ static u64 suspend_start;
>  /*
>   * Interval: 0.5sec.
>   */
> -#define WATCHDOG_INTERVAL (HZ >> 1)
> +#define WATCHDOG_INTERVAL	(HZ >> 1)
> +#define WATCHDOG_INTR_MAX_NS	((WATCHDOG_INTERVAL + (WATCHDOG_INTERVAL >> 1))\
> +				 * NSEC_PER_SEC / HZ)
>  
>  /*
>   * Threshold: 0.0312s, when doubled: 0.0625s.
> @@ -134,6 +136,7 @@ static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
>  static DEFINE_SPINLOCK(watchdog_lock);
>  static int watchdog_running;
>  static atomic_t watchdog_reset_pending;
> +static int64_t watchdog_max_intr;
>  
>  static inline void clocksource_watchdog_lock(unsigned long *flags)
>  {
> @@ -400,7 +403,7 @@ static void clocksource_watchdog(struct timer_list *unused)
>  {
>  	u64 csnow, wdnow, cslast, wdlast, delta;
>  	int next_cpu, reset_pending;
> -	int64_t wd_nsec, cs_nsec;
> +	int64_t wd_nsec, cs_nsec, interval;
>  	struct clocksource *cs;
>  	enum wd_read_status read_ret;
>  	unsigned long extra_wait = 0;
> @@ -470,6 +473,27 @@ static void clocksource_watchdog(struct timer_list *unused)
>  		if (atomic_read(&watchdog_reset_pending))
>  			continue;
>  
> +		/*
> +		 * The processing of timer softirqs can get delayed (usually
> +		 * on account of ksoftirqd not getting to run in a timely
> +		 * manner), which causes the watchdog interval to stretch.
> +		 * Some clocksources, e.g. acpi_pm, cannot tolerate
> +		 * watchdog intervals longer than a few seconds.
> +		 * Skew detection may fail for longer watchdog intervals
> +		 * on account of fixed margins being used.
> +		 */
> +		interval = max(cs_nsec, wd_nsec);
> +		if (unlikely(interval > WATCHDOG_INTR_MAX_NS)) {
> +			if (system_state > SYSTEM_SCHEDULING &&
> +			    interval > 2 * watchdog_max_intr) {
> +				watchdog_max_intr = interval;
> +				pr_warn("Skipping watchdog check: cs_nsec: %lld wd_nsec: %lld\n",
> +					cs_nsec, wd_nsec);
> +			}
> +			watchdog_timer.expires = jiffies;
> +			continue;
> +		}
> +
>  		/* Check the deviation from the watchdog clocksource. */
>  		md = cs->uncertainty_margin + watchdog->uncertainty_margin;
>  		if (abs(cs_nsec - wd_nsec) > md) {
> -- 
> 2.35.3
> 
> 
> -- 
> Jiri Wiesner
> SUSE Labs
  
Jiri Wiesner Jan. 4, 2024, 4:30 p.m. UTC | #5
On Wed, Jan 03, 2024 at 02:08:08PM -0800, Paul E. McKenney wrote:
> I believe that there were concerns about a similar approach in the case
> where the jiffies counter is the clocksource

I ran a few simple tests on a 2 NUMA node Intel machine and found nothing 
so far. I tried booting with clocksource=jiffies and I changed the 
"nr_online_nodes <= 4" check in tsc_clocksource_as_watchdog() to enable 
the watchdog on my machine. I have a debugging module that monitors 
clocksource and watchdog reads in clocksource_watchdog() with kprobes. I 
see the cs/wd reads executed roughly every 0.5 second, as expected. When 
the machine is idle the average watchdog interval is 501.61 milliseconds 
(+-15.57 ms, with a minimum of 477.07 ms and a maximum of 517.93 ms). The 
result is similar when the CPUs of the machine are fully saturated with 
netperf processes. I also tried booting with clocksource=jiffies and 
tsc=watchdog. The watchdog interval was similar to the previous test.

AFAIK, the jiffies clocksource does get checked by the watchdog itself. 
And with that, I have run out of ideas.
  
Jiri Wiesner Jan. 4, 2024, 4:48 p.m. UTC | #6
On Thu, Jan 04, 2024 at 01:55:21PM +0800, Feng Tang wrote:
> On Wed, Jan 03, 2024 at 12:21:13PM +0100, Jiri Wiesner wrote:
> > Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold")
> 
> IIUC, the issue could happen even before this commit? Though I do think 
> this is good stuff for stable kernel.

Yes, it could happen before 2e27e793e280 but the threshold was so loose 
that it probably was an extremely rare event. Commit 2e27e793e280 lowered 
the threshold and this significantly increased the likelihood of skew 
detection failing for large watchdog intervals. We got a report of a 
kernel with 2e27e793e280 switching to the HPET but there were no failures 
with a kernel without 2e27e793e280 on the same fleet of 8 NUMA node machines.

> 0Day robot reported some compiling issue.

I am going to submit a v2 of the patch fixing the error after we clarify 
the jiffies concern.
  
Paul E. McKenney Jan. 4, 2024, 7:19 p.m. UTC | #7
On Thu, Jan 04, 2024 at 05:30:50PM +0100, Jiri Wiesner wrote:
> On Wed, Jan 03, 2024 at 02:08:08PM -0800, Paul E. McKenney wrote:
> > I believe that there were concerns about a similar approach in the case
> > where the jiffies counter is the clocksource
> 
> I ran a few simple tests on a 2 NUMA node Intel machine and found nothing 
> so far. I tried booting with clocksource=jiffies and I changed the 
> "nr_online_nodes <= 4" check in tsc_clocksource_as_watchdog() to enable 
> the watchdog on my machine. I have a debugging module that monitors 
> clocksource and watchdog reads in clocksource_watchdog() with kprobes. I 
> see the cs/wd reads executed roughly every 0.5 second, as expected. When 
> the machine is idle the average watchdog interval is 501.61 milliseconds 
> (+-15.57 ms, with a minimum of 477.07 ms and a maximum of 517.93 ms). The 
> result is similar when the CPUs of the machine are fully saturated with 
> netperf processes. I also tried booting with clocksource=jiffies and 
> tsc=watchdog. The watchdog interval was similar to the previous test.
> 
> AFAIK, the jiffies clocksource does get checked by the watchdog itself. 
> And with that, I have run out of ideas.

If I recall correctly (ha!), the concern was that with the jiffies as
clocksource, we would be using jiffies (via timers) to check jiffies
(the clocksource), and that this could cause issues if the jiffies got
behind, then suddenly updated while the clocksource watchdog was running.

Thoughts?

							Thanx, Paul
  
Feng Tang Jan. 6, 2024, 2:55 a.m. UTC | #8
On Thu, Jan 04, 2024 at 11:19:56AM -0800, Paul E. McKenney wrote:
> On Thu, Jan 04, 2024 at 05:30:50PM +0100, Jiri Wiesner wrote:
> > On Wed, Jan 03, 2024 at 02:08:08PM -0800, Paul E. McKenney wrote:
> > > I believe that there were concerns about a similar approach in the case
> > > where the jiffies counter is the clocksource
> > 
> > I ran a few simple tests on a 2 NUMA node Intel machine and found nothing 
> > so far. I tried booting with clocksource=jiffies and I changed the 
> > "nr_online_nodes <= 4" check in tsc_clocksource_as_watchdog() to enable 
> > the watchdog on my machine. I have a debugging module that monitors 
> > clocksource and watchdog reads in clocksource_watchdog() with kprobes. I 
> > see the cs/wd reads executed roughly every 0.5 second, as expected. When 
> > the machine is idle the average watchdog interval is 501.61 milliseconds 
> > (+-15.57 ms, with a minimum of 477.07 ms and a maximum of 517.93 ms). The 
> > result is similar when the CPUs of the machine are fully saturated with 
> > netperf processes. I also tried booting with clocksource=jiffies and 
> > tsc=watchdog. The watchdog interval was similar to the previous test.
> > 
> > AFAIK, the jiffies clocksource does get checked by the watchdog itself. 
> > And with that, I have run out of ideas.
> 
> If I recall correctly (ha!), the concern was that with the jiffies as
> clocksource, we would be using jiffies (via timers) to check jiffies
> (the clocksource), and that this could cause issues if the jiffies got
> behind, then suddenly updated while the clocksource watchdog was running.

Yes, we also met problem when 'jiffies' was used as clocksource/watchdog,
but don't know if it's the same problem you mentioned. Our problem
('jiffies' as watchdog marks clocksource TSC as unstable) only happens
in early boot phase with serial earlyprintk enabled, that the updating
of 'jiffies' relies on HW timer's periodic interrupt, but early printk
will disable interrupt during printing and cause some timer interrupts
lost, and hence big lagging in 'jiffies'. Rui once proposed a patch to
prevent 'jiffies' from being a watchdog due to it unreliability [1].

And I think skipping the watchdog check one time when detecting some
abnormal condition won't hurt the overall check much.

[1]. https://lore.kernel.org/lkml/bd5b97f89ab2887543fc262348d1c7cafcaae536.camel@intel.com/

Thanks,
Feng

> Thoughts?
> 
> 							Thanx, Paul
  
Paul E. McKenney Jan. 6, 2024, 12:04 p.m. UTC | #9
On Sat, Jan 06, 2024 at 10:55:09AM +0800, Feng Tang wrote:
> On Thu, Jan 04, 2024 at 11:19:56AM -0800, Paul E. McKenney wrote:
> > On Thu, Jan 04, 2024 at 05:30:50PM +0100, Jiri Wiesner wrote:
> > > On Wed, Jan 03, 2024 at 02:08:08PM -0800, Paul E. McKenney wrote:
> > > > I believe that there were concerns about a similar approach in the case
> > > > where the jiffies counter is the clocksource
> > > 
> > > I ran a few simple tests on a 2 NUMA node Intel machine and found nothing 
> > > so far. I tried booting with clocksource=jiffies and I changed the 
> > > "nr_online_nodes <= 4" check in tsc_clocksource_as_watchdog() to enable 
> > > the watchdog on my machine. I have a debugging module that monitors 
> > > clocksource and watchdog reads in clocksource_watchdog() with kprobes. I 
> > > see the cs/wd reads executed roughly every 0.5 second, as expected. When 
> > > the machine is idle the average watchdog interval is 501.61 milliseconds 
> > > (+-15.57 ms, with a minimum of 477.07 ms and a maximum of 517.93 ms). The 
> > > result is similar when the CPUs of the machine are fully saturated with 
> > > netperf processes. I also tried booting with clocksource=jiffies and 
> > > tsc=watchdog. The watchdog interval was similar to the previous test.
> > > 
> > > AFAIK, the jiffies clocksource does get checked by the watchdog itself. 
> > > And with that, I have run out of ideas.
> > 
> > If I recall correctly (ha!), the concern was that with the jiffies as
> > clocksource, we would be using jiffies (via timers) to check jiffies
> > (the clocksource), and that this could cause issues if the jiffies got
> > behind, then suddenly updated while the clocksource watchdog was running.
> 
> Yes, we also met problem when 'jiffies' was used as clocksource/watchdog,
> but don't know if it's the same problem you mentioned. Our problem
> ('jiffies' as watchdog marks clocksource TSC as unstable) only happens
> in early boot phase with serial earlyprintk enabled, that the updating
> of 'jiffies' relies on HW timer's periodic interrupt, but early printk
> will disable interrupt during printing and cause some timer interrupts
> lost, and hence big lagging in 'jiffies'. Rui once proposed a patch to
> prevent 'jiffies' from being a watchdog due to it unreliability [1].
> 
> And I think skipping the watchdog check one time when detecting some
> abnormal condition won't hurt the overall check much.

Works for me!

							Thanx, Paul

> [1]. https://lore.kernel.org/lkml/bd5b97f89ab2887543fc262348d1c7cafcaae536.camel@intel.com/
> 
> Thanks,
> Feng
> 
> > Thoughts?
> > 
> > 							Thanx, Paul
  
kernel test robot Jan. 8, 2024, 1:44 p.m. UTC | #10
Hello,

kernel test robot noticed "WARNING:at_kernel/time/clocksource-wdtest.c:#wdtest_func" on:

commit: ceacf0b27b5d9cdc574f8d0a0bcb11d3272e7b9f ("[PATCH] clocksource: Skip watchdog check for large watchdog intervals")
url: https://github.com/intel-lab-lkp/linux/commits/Jiri-Wiesner/clocksource-Skip-watchdog-check-for-large-watchdog-intervals/20240103-192257
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git da65f29dada7f7cbbf0d6375b88a0316f5f7d6f5
patch link: https://lore.kernel.org/all/20240103112113.GA6108@incl/
patch subject: [PATCH] clocksource: Skip watchdog check for large watchdog intervals

in testcase: boot

compiler: gcc-9
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


+--------------------------------------------------------------------+------------+------------+
|                                                                    | da65f29dad | ceacf0b27b |
+--------------------------------------------------------------------+------------+------------+
| WARNING:at_kernel/time/clocksource-wdtest.c:#wdtest_func           | 0          | 14         |
| EIP:wdtest_func                                                    | 0          | 14         |
+--------------------------------------------------------------------+------------+------------+


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202401082125.4ec42f71-oliver.sang@intel.com


[   54.778895][   T41] ------------[ cut here ]------------
[ 54.781279][ T41] WARNING: CPU: 0 PID: 41 at kernel/time/clocksource-wdtest.c:162 wdtest_func (kernel/time/clocksource-wdtest.c:162 (discriminator 1)) 
[   54.785405][   T41] Modules linked in:
[   54.787152][   T41] CPU: 0 PID: 41 Comm: wdtest Tainted: G                 N 6.7.0-rc2-00014-gceacf0b27b5d #1 30e4ebd22e7da702dfdab1313ae74e5a246df970
[   54.791633][   T41] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 54.795124][ T41] EIP: wdtest_func (kernel/time/clocksource-wdtest.c:162 (discriminator 1)) 
[ 54.796800][ T41] Code: ff 8d b4 26 00 00 00 00 66 90 0f 0b e9 ee fd ff ff 8d b4 26 00 00 00 00 66 90 0f 0b e9 08 fe ff ff 8d b4 26 00 00 00 00 66 90 <0f> 0b e9 76 ff ff ff cc cc cc cc cc cc cc cc cc 3e 8d 74 26 00 55
All code
========
   0:	ff 8d b4 26 00 00    	decl   0x26b4(%rbp)
   6:	00 00                	add    %al,(%rax)
   8:	66 90                	xchg   %ax,%ax
   a:	0f 0b                	ud2    
   c:	e9 ee fd ff ff       	jmpq   0xfffffffffffffdff
  11:	8d b4 26 00 00 00 00 	lea    0x0(%rsi,%riz,1),%esi
  18:	66 90                	xchg   %ax,%ax
  1a:	0f 0b                	ud2    
  1c:	e9 08 fe ff ff       	jmpq   0xfffffffffffffe29
  21:	8d b4 26 00 00 00 00 	lea    0x0(%rsi,%riz,1),%esi
  28:	66 90                	xchg   %ax,%ax
  2a:*	0f 0b                	ud2    		<-- trapping instruction
  2c:	e9 76 ff ff ff       	jmpq   0xffffffffffffffa7
  31:	cc                   	int3   
  32:	cc                   	int3   
  33:	cc                   	int3   
  34:	cc                   	int3   
  35:	cc                   	int3   
  36:	cc                   	int3   
  37:	cc                   	int3   
  38:	cc                   	int3   
  39:	cc                   	int3   
  3a:	3e 8d 74 26 00       	lea    %ds:0x0(%rsi,%riz,1),%esi
  3f:	55                   	push   %rbp

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2    
   2:	e9 76 ff ff ff       	jmpq   0xffffffffffffff7d
   7:	cc                   	int3   
   8:	cc                   	int3   
   9:	cc                   	int3   
   a:	cc                   	int3   
   b:	cc                   	int3   
   c:	cc                   	int3   
   d:	cc                   	int3   
   e:	cc                   	int3   
   f:	cc                   	int3   
  10:	3e 8d 74 26 00       	lea    %ds:0x0(%rsi,%riz,1),%esi
  15:	55                   	push   %rbp
[   54.801794][   T41] EAX: c4a1fe8c EBX: 00000004 ECX: 00000000 EDX: 00000000
[   54.803569][   T41] ESI: c4a1ff2e EDI: 00000000 EBP: c74aff70 ESP: c74aff60
[   54.805314][   T41] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 EFLAGS: 00010246
[   54.807278][   T41] CR0: 80050033 CR2: ffda9000 CR3: 05979000 CR4: 000406d0
[   54.808856][   T41] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[   54.810216][   T41] DR6: fffe0ff0 DR7: 00000400
[   54.811218][   T41] Call Trace:
[ 54.811950][ T41] ? show_regs (arch/x86/kernel/dumpstack.c:479 arch/x86/kernel/dumpstack.c:465) 
[ 54.812840][ T41] ? wdtest_func (kernel/time/clocksource-wdtest.c:162 (discriminator 1)) 
[ 54.813801][ T41] ? __warn (kernel/panic.c:677) 
[ 54.814643][ T41] ? wdtest_func (kernel/time/clocksource-wdtest.c:162 (discriminator 1)) 
[ 54.815628][ T41] ? report_bug (lib/bug.c:201 lib/bug.c:219) 
[ 54.816581][ T41] ? exc_overflow (arch/x86/kernel/traps.c:250) 
[ 54.817506][ T41] ? handle_bug (arch/x86/kernel/traps.c:216) 
[ 54.818420][ T41] ? exc_invalid_op (arch/x86/kernel/traps.c:258 (discriminator 1)) 
[ 54.819425][ T41] ? handle_exception (arch/x86/entry/entry_32.S:1056) 
[ 54.820438][ T41] ? exc_overflow (arch/x86/kernel/traps.c:250) 
[ 54.821374][ T41] ? wdtest_func (kernel/time/clocksource-wdtest.c:162 (discriminator 1)) 
[ 54.822318][ T41] ? exc_overflow (arch/x86/kernel/traps.c:250) 
[ 54.823284][ T41] ? wdtest_func (kernel/time/clocksource-wdtest.c:162 (discriminator 1)) 
[ 54.824248][ T41] kthread (kernel/kthread.c:388) 
[ 54.825093][ T41] ? wdtest_ktime_read (kernel/time/clocksource-wdtest.c:105) 
[ 54.826132][ T41] ? kthread_complete_and_exit (kernel/kthread.c:341) 
[ 54.827307][ T41] ret_from_fork (arch/x86/kernel/process.c:153) 
[ 54.828219][ T41] ? kthread_complete_and_exit (kernel/kthread.c:341) 
[ 54.829331][ T41] ret_from_fork_asm (arch/x86/entry/entry_32.S:741) 
[ 54.830310][ T41] entry_INT80_32 (arch/x86/entry/entry_32.S:947) 
[   54.831287][   T41] irq event stamp: 879
[ 54.832154][ T41] hardirqs last enabled at (889): console_unlock (kernel/printk/printk.c:341 kernel/printk/printk.c:2706 kernel/printk/printk.c:3038) 
[ 54.833838][ T41] hardirqs last disabled at (898): console_unlock (kernel/printk/printk.c:339 kernel/printk/printk.c:2706 kernel/printk/printk.c:3038) 
[ 54.835567][ T41] softirqs last enabled at (918): __do_softirq (arch/x86/include/asm/preempt.h:27 kernel/softirq.c:400 kernel/softirq.c:582) 
[ 54.837246][ T41] softirqs last disabled at (907): do_softirq_own_stack (arch/x86/kernel/irq_32.c:57 arch/x86/kernel/irq_32.c:147) 
[   54.839049][   T41] ---[ end trace 0000000000000000 ]---
[   54.840178][   T41] clocksource: Not enough CPUs to check clocksource 'wdtest-ktime'.
[   54.841774][   T41] clocksource_wdtest: --- Done with test.



The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240108/202401082125.4ec42f71-oliver.sang@intel.com
  
Jiri Wiesner Jan. 10, 2024, 6:36 p.m. UTC | #11
On Mon, Jan 08, 2024 at 09:44:31PM +0800, kernel test robot wrote:
> commit: ceacf0b27b5d9cdc574f8d0a0bcb11d3272e7b9f ("[PATCH] clocksource: Skip watchdog check for large watchdog intervals")
> url: https://github.com/intel-lab-lkp/linux/commits/Jiri-Wiesner/clocksource-Skip-watchdog-check-for-large-watchdog-intervals/20240103-192257
> base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git da65f29dada7f7cbbf0d6375b88a0316f5f7d6f5
> patch link: https://lore.kernel.org/all/20240103112113.GA6108@incl/
> patch subject: [PATCH] clocksource: Skip watchdog check for large watchdog intervals
> 
> [   54.778895][   T41] ------------[ cut here ]------------
> [ 54.781279][ T41] WARNING: CPU: 0 PID: 41 at kernel/time/clocksource-wdtest.c:162 wdtest_func (kernel/time/clocksource-wdtest.c:162 (discriminator 1)) 
> [   54.785405][   T41] Modules linked in:
> [   54.787152][   T41] CPU: 0 PID: 41 Comm: wdtest Tainted: G                 N 6.7.0-rc2-00014-gceacf0b27b5d #1 30e4ebd22e7da702dfdab1313ae74e5a246df970
> [   54.791633][   T41] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [ 54.795124][ T41] EIP: wdtest_func (kernel/time/clocksource-wdtest.c:162 (discriminator 1)) 

The code:
        pr_info("--- Watchdog clock-value-fuzz error injection, expect clock skew and per-CPU mismatches.\n");
        WRITE_ONCE(wdtest_ktime_read_fuzz, true);
        schedule_timeout_uninterruptible(2 * HZ);
        WARN_ON_ONCE(!(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE));

The warning is the consequence of the watchdog not running on i386 builds of the kernel. The watchdog does not run because the "interval > WATCHDOG_INTR_MAX_NS" check is always true on account of the WATCHDOG_INTR_MAX_NS constant evaluating to a negative number on account of integer overflow.

> [   54.801794][   T41] EAX: c4a1fe8c EBX: 00000004 ECX: 00000000 EDX: 00000000
> [   54.803569][   T41] ESI: c4a1ff2e EDI: 00000000 EBP: c74aff70 ESP: c74aff60
> [   54.805314][   T41] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 EFLAGS: 00010246
> [   54.807278][   T41] CR0: 80050033 CR2: ffda9000 CR3: 05979000 CR4: 000406d0
> [   54.808856][   T41] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> [   54.810216][   T41] DR6: fffe0ff0 DR7: 00000400
> [   54.811218][   T41] Call Trace:
  

Patch

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index c108ed8a9804..ac5cb0ff278b 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -98,7 +98,9 @@  static u64 suspend_start;
 /*
  * Interval: 0.5sec.
  */
-#define WATCHDOG_INTERVAL (HZ >> 1)
+#define WATCHDOG_INTERVAL	(HZ >> 1)
+#define WATCHDOG_INTR_MAX_NS	((WATCHDOG_INTERVAL + (WATCHDOG_INTERVAL >> 1))\
+				 * NSEC_PER_SEC / HZ)
 
 /*
  * Threshold: 0.0312s, when doubled: 0.0625s.
@@ -134,6 +136,7 @@  static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
 static DEFINE_SPINLOCK(watchdog_lock);
 static int watchdog_running;
 static atomic_t watchdog_reset_pending;
+static int64_t watchdog_max_intr;
 
 static inline void clocksource_watchdog_lock(unsigned long *flags)
 {
@@ -400,7 +403,7 @@  static void clocksource_watchdog(struct timer_list *unused)
 {
 	u64 csnow, wdnow, cslast, wdlast, delta;
 	int next_cpu, reset_pending;
-	int64_t wd_nsec, cs_nsec;
+	int64_t wd_nsec, cs_nsec, interval;
 	struct clocksource *cs;
 	enum wd_read_status read_ret;
 	unsigned long extra_wait = 0;
@@ -470,6 +473,27 @@  static void clocksource_watchdog(struct timer_list *unused)
 		if (atomic_read(&watchdog_reset_pending))
 			continue;
 
+		/*
+		 * The processing of timer softirqs can get delayed (usually
+		 * on account of ksoftirqd not getting to run in a timely
+		 * manner), which causes the watchdog interval to stretch.
+		 * Some clocksources, e.g. acpi_pm, cannot tolerate
+		 * watchdog intervals longer than a few seconds.
+		 * Skew detection may fail for longer watchdog intervals
+		 * on account of fixed margins being used.
+		 */
+		interval = max(cs_nsec, wd_nsec);
+		if (unlikely(interval > WATCHDOG_INTR_MAX_NS)) {
+			if (system_state > SYSTEM_SCHEDULING &&
+			    interval > 2 * watchdog_max_intr) {
+				watchdog_max_intr = interval;
+				pr_warn("Skipping watchdog check: cs_nsec: %lld wd_nsec: %lld\n",
+					cs_nsec, wd_nsec);
+			}
+			watchdog_timer.expires = jiffies;
+			continue;
+		}
+
 		/* Check the deviation from the watchdog clocksource. */
 		md = cs->uncertainty_margin + watchdog->uncertainty_margin;
 		if (abs(cs_nsec - wd_nsec) > md) {