Message ID | 20240103112113.GA6108@incl |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-15439-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7301:6f82:b0:100:9c79:88ff with SMTP id tb2csp4956183dyb; Wed, 3 Jan 2024 03:21:36 -0800 (PST) X-Google-Smtp-Source: AGHT+IHPKDvVC1g1sH02Mzotnrxv75Ks+4aNMXwSzdHta9AOPPkEwQGD3QMwBAiHtlCXVMlbyKwy X-Received: by 2002:aa7:8dd7:0:b0:6d9:8ab2:c853 with SMTP id j23-20020aa78dd7000000b006d98ab2c853mr18639042pfr.12.1704280895971; Wed, 03 Jan 2024 03:21:35 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1704280895; cv=none; d=google.com; s=arc-20160816; b=qMrm/qDoqvh9nyXWPeEopTVf2ttoo7ZssUP1EWhF2mCvwZTCyU7YVctK3m3UYmF13j /4TVC+mrVHXPb/eU9You926uXnAXDi2gOS7sUnLhChHfh4qTVG9ntouVABMFiKVEGQFh lTTAvw9ZNu3KMJdFV8yGOAmfElkEJELnXwmLx/i37pzKaieaF0AJagtvoqwQa+6fDG8P bSq9ii4c37lcgZP7M7fhe9rYCfbsEwh9t66obXsFsJuzqZl3fJdobjTNQMDVv7hzuKeI CXqeBHVHS2S4RBRRkZ0e+5BhPojLSqoL50+aT8eBrlQH+VOr8oFrCV8Bo10ew42Uhh4g 4BDw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=user-agent:content-disposition:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:subject:cc:to:from :date:dkim-signature:dkim-signature:dkim-signature:dkim-signature; bh=NHB4j9wlHDrYyqT2ikxPmdHQfGYnNUKv98b9mA6DHZM=; fh=K9eOPowSNCHNEltiQEZKVlmTFxutM7whsYPWsjDosEc=; b=0qt4O6KBJzzHqNCjCbnmEZdKWHeB4zMdsXZc74V3WPk2WHJRYbxMPdNtfp0p9Fu/Dg jnpPjFrwTTMyuPLnGy2D12H/YENHfN1BfnprEiN/T8Rc4e+lpCxzgnHlp9KD/1Cv1wXi hYiXjZO7cu4NkxfNRbGIpZ/uFgfT+SHv/6r28bgAucf2zzIVqxKzZrx5VraPg1/rQ1qz FoYd+KuVzqm79kvJyBqPUQm1gimfK4qX6nN6vOJax461pppVqFKlfx4kjwchV6jckvlI JZ99vJOP+rMuudPhw0DFdQVxYO0x7zirLYl/2DRnt9gQPttMvMmK4fGYM2bP1FmOPj6g P9gQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.de header.s=susede2_rsa header.b="Xe/QT+2J"; dkim=neutral (no key) header.i=@suse.de header.s=susede2_ed25519; dkim=pass header.i=@suse.de header.s=susede2_rsa header.b=tMWEzmJS; dkim=neutral (no key) header.i=@suse.de header.s=susede2_ed25519; spf=pass (google.com: domain of linux-kernel+bounces-15439-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-15439-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=suse.de Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [2604:1380:45e3:2400::1]) by mx.google.com with ESMTPS id k1-20020aa788c1000000b006d9b9ca7ffasi15266861pff.79.2024.01.03.03.21.35 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 03 Jan 2024 03:21:35 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-15439-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) client-ip=2604:1380:45e3:2400::1; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.de header.s=susede2_rsa header.b="Xe/QT+2J"; dkim=neutral (no key) header.i=@suse.de header.s=susede2_ed25519; dkim=pass header.i=@suse.de header.s=susede2_rsa header.b=tMWEzmJS; dkim=neutral (no key) header.i=@suse.de header.s=susede2_ed25519; spf=pass (google.com: domain of linux-kernel+bounces-15439-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-15439-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=suse.de Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id B9B51282585 for <ouuuleilei@gmail.com>; Wed, 3 Jan 2024 11:21:35 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 5D0BD18E36; Wed, 3 Jan 2024 11:21:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de header.b="Xe/QT+2J"; dkim=permerror (0-bit key) header.d=suse.de header.i=@suse.de header.b="HExykre3"; dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de header.b="tMWEzmJS"; dkim=permerror (0-bit key) header.d=suse.de header.i=@suse.de header.b="TJqKF1cV" X-Original-To: linux-kernel@vger.kernel.org Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E923418E15 for <linux-kernel@vger.kernel.org>; Wed, 3 Jan 2024 11:21:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=suse.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=suse.de Received: from imap1.dmz-prg2.suse.org (imap1.dmz-prg2.suse.org [IPv6:2a07:de40:b281:104:10:150:64:97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id C36CE21F5E; Wed, 3 Jan 2024 11:21:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1704280875; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type; bh=NHB4j9wlHDrYyqT2ikxPmdHQfGYnNUKv98b9mA6DHZM=; b=Xe/QT+2JCCbI8yPbqXzgunQq2X3E0dtypQMaD5cVm8O8Ai5eQoSfApEWmxdnywqJvuykVo IwkutEsjaBwyQHHPDhcMTXf4REgEtY9RdqXHOwaYjQxaoEK+YfUfHfD441xGfnHcDSOFxi 2lJecVINk8UwOOFBNmdOGX5uTOzosmw= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1704280875; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type; bh=NHB4j9wlHDrYyqT2ikxPmdHQfGYnNUKv98b9mA6DHZM=; b=HExykre3r2nf6w/oGKa01rzU9P8W5hj2+dWHlXMRnbopvWXgUagO60EWxfaTJxshV40R1X DRN91xHx8qlLrsBA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1704280874; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type; bh=NHB4j9wlHDrYyqT2ikxPmdHQfGYnNUKv98b9mA6DHZM=; b=tMWEzmJSfW52E5llmjoOygMAXyU09LuvgTtVegHo0XlcNZLQ6/3lhF/Mi1i4t50Lab3gg+ 42lYQX55m19IfbfuQdfW9Gqd57wkkCwxWQqfPEt8GwMm2+idxnIpzi6+dYIIceeWSThIhx ALj/VI8oDs4YzrDghcZ5/79LPGYDXV4= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1704280874; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type; bh=NHB4j9wlHDrYyqT2ikxPmdHQfGYnNUKv98b9mA6DHZM=; b=TJqKF1cVc4ZSpxLReDXq13dQPiEwPH0nQeZyd843RMNo3L7roG4SHtSiVPm3yR++TwEdRM 7f6GiA4pJgnLjYDw== Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id AB5DE13AA6; Wed, 3 Jan 2024 11:21:14 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id GXIXKipDlWV8EAAAD6G6ig (envelope-from <jwiesner@suse.de>); Wed, 03 Jan 2024 11:21:14 +0000 Received: by incl.suse.cz (Postfix, from userid 1000) id 0819A9BCA0; Wed, 3 Jan 2024 12:21:14 +0100 (CET) Date: Wed, 3 Jan 2024 12:21:13 +0100 From: Jiri Wiesner <jwiesner@suse.de> To: linux-kernel@vger.kernel.org Cc: John Stultz <jstultz@google.com>, Thomas Gleixner <tglx@linutronix.de>, Stephen Boyd <sboyd@kernel.org>, "Paul E. McKenney" <paulmck@kernel.org>, Feng Tang <feng.tang@intel.com> Subject: [PATCH] clocksource: Skip watchdog check for large watchdog intervals Message-ID: <20240103112113.GA6108@incl> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.10.1 (2018-07-13) X-Spam-Level: ***** X-Spam-Level: Authentication-Results: smtp-out1.suse.de; dkim=pass header.d=suse.de header.s=susede2_rsa header.b=tMWEzmJS; dkim=pass header.d=suse.de header.s=susede2_ed25519 header.b=TJqKF1cV X-Rspamd-Server: rspamd2.dmz-prg2.suse.org X-Spamd-Result: default: False [-2.51 / 50.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; R_DKIM_ALLOW(-0.20)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519]; SPAMHAUS_XBL(0.00)[2a07:de40:b281:104:10:150:64:97:from]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; RECEIVED_SPAMHAUS_BLOCKED_OPENRESOLVER(0.00)[2a07:de40:b281:106:10:150:64:167:received]; MIME_GOOD(-0.10)[text/plain]; RCVD_DKIM_ARC_DNSWL_HI(-1.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; RCPT_COUNT_FIVE(0.00)[6]; NEURAL_HAM_SHORT(-0.20)[-1.000]; RCVD_COUNT_THREE(0.00)[3]; DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519]; DKIM_TRACE(0.00)[suse.de:+]; MX_GOOD(-0.01)[]; DBL_BLOCKED_OPENRESOLVER(0.00)[intel.com:email,suse.de:dkim,suse.de:email]; FUZZY_BLOCKED(0.00)[rspamd.com]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; MID_RHS_NOT_FQDN(0.50)[]; RCVD_IN_DNSWL_HI(-0.50)[2a07:de40:b281:104:10:150:64:97:from]; RCVD_TLS_ALL(0.00)[]; RBL_SPAMHAUS_BLOCKED_OPENRESOLVER(0.00)[2a07:de40:b281:104:10:150:64:97:from] X-Spam-Score: -2.51 X-Rspamd-Queue-Id: C36CE21F5E X-Spam-Flag: NO X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1787068044492614507 X-GMAIL-MSGID: 1787068044492614507 |
Series |
clocksource: Skip watchdog check for large watchdog intervals
|
|
Commit Message
Jiri Wiesner
Jan. 3, 2024, 11:21 a.m. UTC
There have been reports of the watchdog marking clocksources unstable on machines with 8 NUMA nodes: > clocksource: timekeeping watchdog on CPU373: Marking clocksource 'tsc' as unstable because the skew is too large: > clocksource: 'hpet' wd_nsec: 14523447520 wd_now: 5a749706 wd_last: 45adf1e0 mask: ffffffff > clocksource: 'tsc' cs_nsec: 14524115132 cs_now: 515ce2c5a96caa cs_last: 515cd9a9d83918 mask: ffffffffffffffff > clocksource: 'tsc' is current clocksource. > tsc: Marking TSC unstable due to clocksource watchdog > TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. > sched_clock: Marking unstable (1950347883333462, 79649632569)<-(1950428279338308, -745776594) > clocksource: Checking clocksource tsc synchronization from CPU 400 to CPUs 0,46,52,54,138,208,392,397. > clocksource: Switched to clocksource hpet The measured clocksource skew - the absolute difference between cs_nsec and wd_nsec - was 668 microseconds: > cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612 The kernel (based on 5.14.21) used 200 microseconds for the uncertainty_margin of both the clocksource and watchdog, resulting in a threshold of 400 microseconds. The discrepancy is that the measured clocksource skew was evaluated against a threshold suited for watchdog intervals of roughly WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5 second. Both the cs_nsec and the wd_nsec value indicate that the actual watchdog interval was circa 14.5 seconds. Since the watchdog is executed in softirq context the expiration of the watchdog timer can get severely delayed on account of a ksoftirqd thread not getting to run in a timely manner. Surely, a system with such belated softirq execution is not working well and the scheduling issue should be looked into but the clocksource watchdog should be able to deal with it accordingly. The solution in this patch skips the current watchdog check if the watchdog interval exceeds 1.5 * WATCHDOG_INTERVAL. Considering the maximum watchdog interval of 1.5 * WATCHDOG_INTERVAL, the current default uncertainty margin (of the TSC and HPET) corresponds to a limit on clocksource skew of 333 ppm (microseconds of skew per second). To keep the limit imposed by NTP (500 microseconds of skew per second) for all possible watchdog intervals, the margins would have to be scaled so that the threshold value is proportional to the length of the actual watchdog interval. Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold") Suggested-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Jiri Wiesner <jwiesner@suse.de> --- kernel/time/clocksource.c | 28 ++++++++++++++++++++++++++-- 1 file changed, 26 insertions(+), 2 deletions(-)
Comments
On Wed, Jan 03, 2024 at 12:21:13PM +0100, Jiri Wiesner wrote: > There have been reports of the watchdog marking clocksources unstable on > machines with 8 NUMA nodes: > > clocksource: timekeeping watchdog on CPU373: Marking clocksource 'tsc' as unstable because the skew is too large: > > clocksource: 'hpet' wd_nsec: 14523447520 wd_now: 5a749706 wd_last: 45adf1e0 mask: ffffffff > > clocksource: 'tsc' cs_nsec: 14524115132 cs_now: 515ce2c5a96caa cs_last: 515cd9a9d83918 mask: ffffffffffffffff > > clocksource: 'tsc' is current clocksource. > > tsc: Marking TSC unstable due to clocksource watchdog > > TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. > > sched_clock: Marking unstable (1950347883333462, 79649632569)<-(1950428279338308, -745776594) > > clocksource: Checking clocksource tsc synchronization from CPU 400 to CPUs 0,46,52,54,138,208,392,397. > > clocksource: Switched to clocksource hpet > > The measured clocksource skew - the absolute difference between cs_nsec > and wd_nsec - was 668 microseconds: > > cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612 > > The kernel (based on 5.14.21) used 200 microseconds for the > uncertainty_margin of both the clocksource and watchdog, resulting in a > threshold of 400 microseconds. The discrepancy is that the measured > clocksource skew was evaluated against a threshold suited for watchdog > intervals of roughly WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5 second. > Both the cs_nsec and the wd_nsec value indicate that the actual watchdog > interval was circa 14.5 seconds. Since the watchdog is executed in softirq > context the expiration of the watchdog timer can get severely delayed on > account of a ksoftirqd thread not getting to run in a timely manner. > Surely, a system with such belated softirq execution is not working well > and the scheduling issue should be looked into but the clocksource > watchdog should be able to deal with it accordingly. > > The solution in this patch skips the current watchdog check if the > watchdog interval exceeds 1.5 * WATCHDOG_INTERVAL. Considering the maximum > watchdog interval of 1.5 * WATCHDOG_INTERVAL, the current default > uncertainty margin (of the TSC and HPET) corresponds to a limit on > clocksource skew of 333 ppm (microseconds of skew per second). To keep the > limit imposed by NTP (500 microseconds of skew per second) for all > possible watchdog intervals, the margins would have to be scaled so that > the threshold value is proportional to the length of the actual watchdog > interval. > > Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold") > Suggested-by: Feng Tang <feng.tang@intel.com> > Signed-off-by: Jiri Wiesner <jwiesner@suse.de> I believe that there were concerns about a similar approach in the case where the jiffies counter is the clocksource, but in the meantime: Tested-by: Paul E. McKenney <paulmck@kernel.org> > --- > kernel/time/clocksource.c | 28 ++++++++++++++++++++++++++-- > 1 file changed, 26 insertions(+), 2 deletions(-) > > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c > index c108ed8a9804..ac5cb0ff278b 100644 > --- a/kernel/time/clocksource.c > +++ b/kernel/time/clocksource.c > @@ -98,7 +98,9 @@ static u64 suspend_start; > /* > * Interval: 0.5sec. > */ > -#define WATCHDOG_INTERVAL (HZ >> 1) > +#define WATCHDOG_INTERVAL (HZ >> 1) > +#define WATCHDOG_INTR_MAX_NS ((WATCHDOG_INTERVAL + (WATCHDOG_INTERVAL >> 1))\ > + * NSEC_PER_SEC / HZ) > > /* > * Threshold: 0.0312s, when doubled: 0.0625s. > @@ -134,6 +136,7 @@ static DECLARE_WORK(watchdog_work, clocksource_watchdog_work); > static DEFINE_SPINLOCK(watchdog_lock); > static int watchdog_running; > static atomic_t watchdog_reset_pending; > +static int64_t watchdog_max_intr; > > static inline void clocksource_watchdog_lock(unsigned long *flags) > { > @@ -400,7 +403,7 @@ static void clocksource_watchdog(struct timer_list *unused) > { > u64 csnow, wdnow, cslast, wdlast, delta; > int next_cpu, reset_pending; > - int64_t wd_nsec, cs_nsec; > + int64_t wd_nsec, cs_nsec, interval; > struct clocksource *cs; > enum wd_read_status read_ret; > unsigned long extra_wait = 0; > @@ -470,6 +473,27 @@ static void clocksource_watchdog(struct timer_list *unused) > if (atomic_read(&watchdog_reset_pending)) > continue; > > + /* > + * The processing of timer softirqs can get delayed (usually > + * on account of ksoftirqd not getting to run in a timely > + * manner), which causes the watchdog interval to stretch. > + * Some clocksources, e.g. acpi_pm, cannot tolerate > + * watchdog intervals longer than a few seconds. > + * Skew detection may fail for longer watchdog intervals > + * on account of fixed margins being used. > + */ > + interval = max(cs_nsec, wd_nsec); > + if (unlikely(interval > WATCHDOG_INTR_MAX_NS)) { > + if (system_state > SYSTEM_SCHEDULING && > + interval > 2 * watchdog_max_intr) { > + watchdog_max_intr = interval; > + pr_warn("Skipping watchdog check: cs_nsec: %lld wd_nsec: %lld\n", > + cs_nsec, wd_nsec); > + } > + watchdog_timer.expires = jiffies; > + continue; > + } > + > /* Check the deviation from the watchdog clocksource. */ > md = cs->uncertainty_margin + watchdog->uncertainty_margin; > if (abs(cs_nsec - wd_nsec) > md) { > -- > 2.35.3 > > > -- > Jiri Wiesner > SUSE Labs
Hi Jiri, kernel test robot noticed the following build warnings: [auto build test WARNING on tip/timers/core] [also build test WARNING on linus/master v6.7-rc8 next-20240103] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Jiri-Wiesner/clocksource-Skip-watchdog-check-for-large-watchdog-intervals/20240103-192257 base: tip/timers/core patch link: https://lore.kernel.org/r/20240103112113.GA6108%40incl patch subject: [PATCH] clocksource: Skip watchdog check for large watchdog intervals config: i386-buildonly-randconfig-001-20240104 (https://download.01.org/0day-ci/archive/20240104/202401040845.lHig8Ei0-lkp@intel.com/config) compiler: gcc-7 (Ubuntu 7.5.0-6ubuntu2) 7.5.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240104/202401040845.lHig8Ei0-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202401040845.lHig8Ei0-lkp@intel.com/ All warnings (new ones prefixed by >>): In file included from include/linux/dev_printk.h:14:0, from include/linux/device.h:15, from kernel/time/clocksource.c:10: kernel/time/clocksource.c: In function 'clocksource_watchdog': >> kernel/time/clocksource.c:103:6: warning: integer overflow in expression [-Woverflow] * NSEC_PER_SEC / HZ) ^ include/linux/compiler.h:77:42: note: in definition of macro 'unlikely' # define unlikely(x) __builtin_expect(!!(x), 0) ^ kernel/time/clocksource.c:486:27: note: in expansion of macro 'WATCHDOG_INTR_MAX_NS' if (unlikely(interval > WATCHDOG_INTR_MAX_NS)) { ^~~~~~~~~~~~~~~~~~~~ vim +103 kernel/time/clocksource.c 97 98 /* 99 * Interval: 0.5sec. 100 */ 101 #define WATCHDOG_INTERVAL (HZ >> 1) 102 #define WATCHDOG_INTR_MAX_NS ((WATCHDOG_INTERVAL + (WATCHDOG_INTERVAL >> 1))\ > 103 * NSEC_PER_SEC / HZ) 104
Hi Jiri, kernel test robot noticed the following build warnings: [auto build test WARNING on tip/timers/core] [also build test WARNING on linus/master v6.7-rc8 next-20240103] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Jiri-Wiesner/clocksource-Skip-watchdog-check-for-large-watchdog-intervals/20240103-192257 base: tip/timers/core patch link: https://lore.kernel.org/r/20240103112113.GA6108%40incl patch subject: [PATCH] clocksource: Skip watchdog check for large watchdog intervals config: i386-randconfig-r071-20240104 (https://download.01.org/0day-ci/archive/20240104/202401040837.eAbo53gs-lkp@intel.com/config) compiler: ClangBuiltLinux clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240104/202401040837.eAbo53gs-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202401040837.eAbo53gs-lkp@intel.com/ All warnings (new ones prefixed by >>): >> kernel/time/clocksource.c:486:27: warning: overflow in expression; result is -1978561024 with type 'long' [-Winteger-overflow] 486 | if (unlikely(interval > WATCHDOG_INTR_MAX_NS)) { | ^ kernel/time/clocksource.c:103:6: note: expanded from macro 'WATCHDOG_INTR_MAX_NS' 103 | * NSEC_PER_SEC / HZ) | ^ 1 warning generated. vim +/long +486 kernel/time/clocksource.c 400 401 402 static void clocksource_watchdog(struct timer_list *unused) 403 { 404 u64 csnow, wdnow, cslast, wdlast, delta; 405 int next_cpu, reset_pending; 406 int64_t wd_nsec, cs_nsec, interval; 407 struct clocksource *cs; 408 enum wd_read_status read_ret; 409 unsigned long extra_wait = 0; 410 u32 md; 411 412 spin_lock(&watchdog_lock); 413 if (!watchdog_running) 414 goto out; 415 416 reset_pending = atomic_read(&watchdog_reset_pending); 417 418 list_for_each_entry(cs, &watchdog_list, wd_list) { 419 420 /* Clocksource already marked unstable? */ 421 if (cs->flags & CLOCK_SOURCE_UNSTABLE) { 422 if (finished_booting) 423 schedule_work(&watchdog_work); 424 continue; 425 } 426 427 read_ret = cs_watchdog_read(cs, &csnow, &wdnow); 428 429 if (read_ret == WD_READ_UNSTABLE) { 430 /* Clock readout unreliable, so give it up. */ 431 __clocksource_unstable(cs); 432 continue; 433 } 434 435 /* 436 * When WD_READ_SKIP is returned, it means the system is likely 437 * under very heavy load, where the latency of reading 438 * watchdog/clocksource is very big, and affect the accuracy of 439 * watchdog check. So give system some space and suspend the 440 * watchdog check for 5 minutes. 441 */ 442 if (read_ret == WD_READ_SKIP) { 443 /* 444 * As the watchdog timer will be suspended, and 445 * cs->last could keep unchanged for 5 minutes, reset 446 * the counters. 447 */ 448 clocksource_reset_watchdog(); 449 extra_wait = HZ * 300; 450 break; 451 } 452 453 /* Clocksource initialized ? */ 454 if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) || 455 atomic_read(&watchdog_reset_pending)) { 456 cs->flags |= CLOCK_SOURCE_WATCHDOG; 457 cs->wd_last = wdnow; 458 cs->cs_last = csnow; 459 continue; 460 } 461 462 delta = clocksource_delta(wdnow, cs->wd_last, watchdog->mask); 463 wd_nsec = clocksource_cyc2ns(delta, watchdog->mult, 464 watchdog->shift); 465 466 delta = clocksource_delta(csnow, cs->cs_last, cs->mask); 467 cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift); 468 wdlast = cs->wd_last; /* save these in case we print them */ 469 cslast = cs->cs_last; 470 cs->cs_last = csnow; 471 cs->wd_last = wdnow; 472 473 if (atomic_read(&watchdog_reset_pending)) 474 continue; 475 476 /* 477 * The processing of timer softirqs can get delayed (usually 478 * on account of ksoftirqd not getting to run in a timely 479 * manner), which causes the watchdog interval to stretch. 480 * Some clocksources, e.g. acpi_pm, cannot tolerate 481 * watchdog intervals longer than a few seconds. 482 * Skew detection may fail for longer watchdog intervals 483 * on account of fixed margins being used. 484 */ 485 interval = max(cs_nsec, wd_nsec); > 486 if (unlikely(interval > WATCHDOG_INTR_MAX_NS)) { 487 if (system_state > SYSTEM_SCHEDULING && 488 interval > 2 * watchdog_max_intr) { 489 watchdog_max_intr = interval; 490 pr_warn("Skipping watchdog check: cs_nsec: %lld wd_nsec: %lld\n", 491 cs_nsec, wd_nsec); 492 } 493 watchdog_timer.expires = jiffies; 494 continue; 495 } 496 497 /* Check the deviation from the watchdog clocksource. */ 498 md = cs->uncertainty_margin + watchdog->uncertainty_margin; 499 if (abs(cs_nsec - wd_nsec) > md) { 500 s64 cs_wd_msec; 501 s64 wd_msec; 502 u32 wd_rem; 503 504 pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n", 505 smp_processor_id(), cs->name); 506 pr_warn(" '%s' wd_nsec: %lld wd_now: %llx wd_last: %llx mask: %llx\n", 507 watchdog->name, wd_nsec, wdnow, wdlast, watchdog->mask); 508 pr_warn(" '%s' cs_nsec: %lld cs_now: %llx cs_last: %llx mask: %llx\n", 509 cs->name, cs_nsec, csnow, cslast, cs->mask); 510 cs_wd_msec = div_s64_rem(cs_nsec - wd_nsec, 1000 * 1000, &wd_rem); 511 wd_msec = div_s64_rem(wd_nsec, 1000 * 1000, &wd_rem); 512 pr_warn(" Clocksource '%s' skewed %lld ns (%lld ms) over watchdog '%s' interval of %lld ns (%lld ms)\n", 513 cs->name, cs_nsec - wd_nsec, cs_wd_msec, watchdog->name, wd_nsec, wd_msec); 514 if (curr_clocksource == cs) 515 pr_warn(" '%s' is current clocksource.\n", cs->name); 516 else if (curr_clocksource) 517 pr_warn(" '%s' (not '%s') is current clocksource.\n", curr_clocksource->name, cs->name); 518 else 519 pr_warn(" No current clocksource.\n"); 520 __clocksource_unstable(cs); 521 continue; 522 } 523 524 if (cs == curr_clocksource && cs->tick_stable) 525 cs->tick_stable(cs); 526 527 if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) && 528 (cs->flags & CLOCK_SOURCE_IS_CONTINUOUS) && 529 (watchdog->flags & CLOCK_SOURCE_IS_CONTINUOUS)) { 530 /* Mark it valid for high-res. */ 531 cs->flags |= CLOCK_SOURCE_VALID_FOR_HRES; 532 533 /* 534 * clocksource_done_booting() will sort it if 535 * finished_booting is not set yet. 536 */ 537 if (!finished_booting) 538 continue; 539 540 /* 541 * If this is not the current clocksource let 542 * the watchdog thread reselect it. Due to the 543 * change to high res this clocksource might 544 * be preferred now. If it is the current 545 * clocksource let the tick code know about 546 * that change. 547 */ 548 if (cs != curr_clocksource) { 549 cs->flags |= CLOCK_SOURCE_RESELECT; 550 schedule_work(&watchdog_work); 551 } else { 552 tick_clock_notify(); 553 } 554 } 555 } 556 557 /* 558 * We only clear the watchdog_reset_pending, when we did a 559 * full cycle through all clocksources. 560 */ 561 if (reset_pending) 562 atomic_dec(&watchdog_reset_pending); 563 564 /* 565 * Cycle through CPUs to check if the CPUs stay synchronized 566 * to each other. 567 */ 568 next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask); 569 if (next_cpu >= nr_cpu_ids) 570 next_cpu = cpumask_first(cpu_online_mask); 571 572 /* 573 * Arm timer if not already pending: could race with concurrent 574 * pair clocksource_stop_watchdog() clocksource_start_watchdog(). 575 */ 576 if (!timer_pending(&watchdog_timer)) { 577 watchdog_timer.expires += WATCHDOG_INTERVAL + extra_wait; 578 add_timer_on(&watchdog_timer, next_cpu); 579 } 580 out: 581 spin_unlock(&watchdog_lock); 582 } 583
On Wed, Jan 03, 2024 at 12:21:13PM +0100, Jiri Wiesner wrote: > There have been reports of the watchdog marking clocksources unstable on > machines with 8 NUMA nodes: > > clocksource: timekeeping watchdog on CPU373: Marking clocksource 'tsc' as unstable because the skew is too large: > > clocksource: 'hpet' wd_nsec: 14523447520 wd_now: 5a749706 wd_last: 45adf1e0 mask: ffffffff > > clocksource: 'tsc' cs_nsec: 14524115132 cs_now: 515ce2c5a96caa cs_last: 515cd9a9d83918 mask: ffffffffffffffff > > clocksource: 'tsc' is current clocksource. > > tsc: Marking TSC unstable due to clocksource watchdog > > TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. > > sched_clock: Marking unstable (1950347883333462, 79649632569)<-(1950428279338308, -745776594) > > clocksource: Checking clocksource tsc synchronization from CPU 400 to CPUs 0,46,52,54,138,208,392,397. > > clocksource: Switched to clocksource hpet > > The measured clocksource skew - the absolute difference between cs_nsec > and wd_nsec - was 668 microseconds: > > cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612 > > The kernel (based on 5.14.21) used 200 microseconds for the > uncertainty_margin of both the clocksource and watchdog, resulting in a > threshold of 400 microseconds. The discrepancy is that the measured > clocksource skew was evaluated against a threshold suited for watchdog > intervals of roughly WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5 second. > Both the cs_nsec and the wd_nsec value indicate that the actual watchdog > interval was circa 14.5 seconds. Since the watchdog is executed in softirq > context the expiration of the watchdog timer can get severely delayed on > account of a ksoftirqd thread not getting to run in a timely manner. > Surely, a system with such belated softirq execution is not working well > and the scheduling issue should be looked into but the clocksource > watchdog should be able to deal with it accordingly. > > The solution in this patch skips the current watchdog check if the > watchdog interval exceeds 1.5 * WATCHDOG_INTERVAL. Considering the maximum > watchdog interval of 1.5 * WATCHDOG_INTERVAL, the current default > uncertainty margin (of the TSC and HPET) corresponds to a limit on > clocksource skew of 333 ppm (microseconds of skew per second). To keep the > limit imposed by NTP (500 microseconds of skew per second) for all > possible watchdog intervals, the margins would have to be scaled so that > the threshold value is proportional to the length of the actual watchdog > interval. > > Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold") IIUC, the issue could happen even before this commit? Though I do think this is good stuff for stable kernel. > Suggested-by: Feng Tang <feng.tang@intel.com> > Signed-off-by: Jiri Wiesner <jwiesner@suse.de> 0Day robot reported some compiling issue. Other than that, it looks good to me, thanks! Reviewed-by: Feng Tang <feng.tang@intel.com> > > --- > kernel/time/clocksource.c | 28 ++++++++++++++++++++++++++-- > 1 file changed, 26 insertions(+), 2 deletions(-) > > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c > index c108ed8a9804..ac5cb0ff278b 100644 > --- a/kernel/time/clocksource.c > +++ b/kernel/time/clocksource.c > @@ -98,7 +98,9 @@ static u64 suspend_start; > /* > * Interval: 0.5sec. > */ > -#define WATCHDOG_INTERVAL (HZ >> 1) > +#define WATCHDOG_INTERVAL (HZ >> 1) > +#define WATCHDOG_INTR_MAX_NS ((WATCHDOG_INTERVAL + (WATCHDOG_INTERVAL >> 1))\ > + * NSEC_PER_SEC / HZ) > > /* > * Threshold: 0.0312s, when doubled: 0.0625s. > @@ -134,6 +136,7 @@ static DECLARE_WORK(watchdog_work, clocksource_watchdog_work); > static DEFINE_SPINLOCK(watchdog_lock); > static int watchdog_running; > static atomic_t watchdog_reset_pending; > +static int64_t watchdog_max_intr; > > static inline void clocksource_watchdog_lock(unsigned long *flags) > { > @@ -400,7 +403,7 @@ static void clocksource_watchdog(struct timer_list *unused) > { > u64 csnow, wdnow, cslast, wdlast, delta; > int next_cpu, reset_pending; > - int64_t wd_nsec, cs_nsec; > + int64_t wd_nsec, cs_nsec, interval; > struct clocksource *cs; > enum wd_read_status read_ret; > unsigned long extra_wait = 0; > @@ -470,6 +473,27 @@ static void clocksource_watchdog(struct timer_list *unused) > if (atomic_read(&watchdog_reset_pending)) > continue; > > + /* > + * The processing of timer softirqs can get delayed (usually > + * on account of ksoftirqd not getting to run in a timely > + * manner), which causes the watchdog interval to stretch. > + * Some clocksources, e.g. acpi_pm, cannot tolerate > + * watchdog intervals longer than a few seconds. > + * Skew detection may fail for longer watchdog intervals > + * on account of fixed margins being used. > + */ > + interval = max(cs_nsec, wd_nsec); > + if (unlikely(interval > WATCHDOG_INTR_MAX_NS)) { > + if (system_state > SYSTEM_SCHEDULING && > + interval > 2 * watchdog_max_intr) { > + watchdog_max_intr = interval; > + pr_warn("Skipping watchdog check: cs_nsec: %lld wd_nsec: %lld\n", > + cs_nsec, wd_nsec); > + } > + watchdog_timer.expires = jiffies; > + continue; > + } > + > /* Check the deviation from the watchdog clocksource. */ > md = cs->uncertainty_margin + watchdog->uncertainty_margin; > if (abs(cs_nsec - wd_nsec) > md) { > -- > 2.35.3 > > > -- > Jiri Wiesner > SUSE Labs
On Wed, Jan 03, 2024 at 02:08:08PM -0800, Paul E. McKenney wrote: > I believe that there were concerns about a similar approach in the case > where the jiffies counter is the clocksource I ran a few simple tests on a 2 NUMA node Intel machine and found nothing so far. I tried booting with clocksource=jiffies and I changed the "nr_online_nodes <= 4" check in tsc_clocksource_as_watchdog() to enable the watchdog on my machine. I have a debugging module that monitors clocksource and watchdog reads in clocksource_watchdog() with kprobes. I see the cs/wd reads executed roughly every 0.5 second, as expected. When the machine is idle the average watchdog interval is 501.61 milliseconds (+-15.57 ms, with a minimum of 477.07 ms and a maximum of 517.93 ms). The result is similar when the CPUs of the machine are fully saturated with netperf processes. I also tried booting with clocksource=jiffies and tsc=watchdog. The watchdog interval was similar to the previous test. AFAIK, the jiffies clocksource does get checked by the watchdog itself. And with that, I have run out of ideas.
On Thu, Jan 04, 2024 at 01:55:21PM +0800, Feng Tang wrote: > On Wed, Jan 03, 2024 at 12:21:13PM +0100, Jiri Wiesner wrote: > > Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold") > > IIUC, the issue could happen even before this commit? Though I do think > this is good stuff for stable kernel. Yes, it could happen before 2e27e793e280 but the threshold was so loose that it probably was an extremely rare event. Commit 2e27e793e280 lowered the threshold and this significantly increased the likelihood of skew detection failing for large watchdog intervals. We got a report of a kernel with 2e27e793e280 switching to the HPET but there were no failures with a kernel without 2e27e793e280 on the same fleet of 8 NUMA node machines. > 0Day robot reported some compiling issue. I am going to submit a v2 of the patch fixing the error after we clarify the jiffies concern.
On Thu, Jan 04, 2024 at 05:30:50PM +0100, Jiri Wiesner wrote: > On Wed, Jan 03, 2024 at 02:08:08PM -0800, Paul E. McKenney wrote: > > I believe that there were concerns about a similar approach in the case > > where the jiffies counter is the clocksource > > I ran a few simple tests on a 2 NUMA node Intel machine and found nothing > so far. I tried booting with clocksource=jiffies and I changed the > "nr_online_nodes <= 4" check in tsc_clocksource_as_watchdog() to enable > the watchdog on my machine. I have a debugging module that monitors > clocksource and watchdog reads in clocksource_watchdog() with kprobes. I > see the cs/wd reads executed roughly every 0.5 second, as expected. When > the machine is idle the average watchdog interval is 501.61 milliseconds > (+-15.57 ms, with a minimum of 477.07 ms and a maximum of 517.93 ms). The > result is similar when the CPUs of the machine are fully saturated with > netperf processes. I also tried booting with clocksource=jiffies and > tsc=watchdog. The watchdog interval was similar to the previous test. > > AFAIK, the jiffies clocksource does get checked by the watchdog itself. > And with that, I have run out of ideas. If I recall correctly (ha!), the concern was that with the jiffies as clocksource, we would be using jiffies (via timers) to check jiffies (the clocksource), and that this could cause issues if the jiffies got behind, then suddenly updated while the clocksource watchdog was running. Thoughts? Thanx, Paul
On Thu, Jan 04, 2024 at 11:19:56AM -0800, Paul E. McKenney wrote: > On Thu, Jan 04, 2024 at 05:30:50PM +0100, Jiri Wiesner wrote: > > On Wed, Jan 03, 2024 at 02:08:08PM -0800, Paul E. McKenney wrote: > > > I believe that there were concerns about a similar approach in the case > > > where the jiffies counter is the clocksource > > > > I ran a few simple tests on a 2 NUMA node Intel machine and found nothing > > so far. I tried booting with clocksource=jiffies and I changed the > > "nr_online_nodes <= 4" check in tsc_clocksource_as_watchdog() to enable > > the watchdog on my machine. I have a debugging module that monitors > > clocksource and watchdog reads in clocksource_watchdog() with kprobes. I > > see the cs/wd reads executed roughly every 0.5 second, as expected. When > > the machine is idle the average watchdog interval is 501.61 milliseconds > > (+-15.57 ms, with a minimum of 477.07 ms and a maximum of 517.93 ms). The > > result is similar when the CPUs of the machine are fully saturated with > > netperf processes. I also tried booting with clocksource=jiffies and > > tsc=watchdog. The watchdog interval was similar to the previous test. > > > > AFAIK, the jiffies clocksource does get checked by the watchdog itself. > > And with that, I have run out of ideas. > > If I recall correctly (ha!), the concern was that with the jiffies as > clocksource, we would be using jiffies (via timers) to check jiffies > (the clocksource), and that this could cause issues if the jiffies got > behind, then suddenly updated while the clocksource watchdog was running. Yes, we also met problem when 'jiffies' was used as clocksource/watchdog, but don't know if it's the same problem you mentioned. Our problem ('jiffies' as watchdog marks clocksource TSC as unstable) only happens in early boot phase with serial earlyprintk enabled, that the updating of 'jiffies' relies on HW timer's periodic interrupt, but early printk will disable interrupt during printing and cause some timer interrupts lost, and hence big lagging in 'jiffies'. Rui once proposed a patch to prevent 'jiffies' from being a watchdog due to it unreliability [1]. And I think skipping the watchdog check one time when detecting some abnormal condition won't hurt the overall check much. [1]. https://lore.kernel.org/lkml/bd5b97f89ab2887543fc262348d1c7cafcaae536.camel@intel.com/ Thanks, Feng > Thoughts? > > Thanx, Paul
On Sat, Jan 06, 2024 at 10:55:09AM +0800, Feng Tang wrote: > On Thu, Jan 04, 2024 at 11:19:56AM -0800, Paul E. McKenney wrote: > > On Thu, Jan 04, 2024 at 05:30:50PM +0100, Jiri Wiesner wrote: > > > On Wed, Jan 03, 2024 at 02:08:08PM -0800, Paul E. McKenney wrote: > > > > I believe that there were concerns about a similar approach in the case > > > > where the jiffies counter is the clocksource > > > > > > I ran a few simple tests on a 2 NUMA node Intel machine and found nothing > > > so far. I tried booting with clocksource=jiffies and I changed the > > > "nr_online_nodes <= 4" check in tsc_clocksource_as_watchdog() to enable > > > the watchdog on my machine. I have a debugging module that monitors > > > clocksource and watchdog reads in clocksource_watchdog() with kprobes. I > > > see the cs/wd reads executed roughly every 0.5 second, as expected. When > > > the machine is idle the average watchdog interval is 501.61 milliseconds > > > (+-15.57 ms, with a minimum of 477.07 ms and a maximum of 517.93 ms). The > > > result is similar when the CPUs of the machine are fully saturated with > > > netperf processes. I also tried booting with clocksource=jiffies and > > > tsc=watchdog. The watchdog interval was similar to the previous test. > > > > > > AFAIK, the jiffies clocksource does get checked by the watchdog itself. > > > And with that, I have run out of ideas. > > > > If I recall correctly (ha!), the concern was that with the jiffies as > > clocksource, we would be using jiffies (via timers) to check jiffies > > (the clocksource), and that this could cause issues if the jiffies got > > behind, then suddenly updated while the clocksource watchdog was running. > > Yes, we also met problem when 'jiffies' was used as clocksource/watchdog, > but don't know if it's the same problem you mentioned. Our problem > ('jiffies' as watchdog marks clocksource TSC as unstable) only happens > in early boot phase with serial earlyprintk enabled, that the updating > of 'jiffies' relies on HW timer's periodic interrupt, but early printk > will disable interrupt during printing and cause some timer interrupts > lost, and hence big lagging in 'jiffies'. Rui once proposed a patch to > prevent 'jiffies' from being a watchdog due to it unreliability [1]. > > And I think skipping the watchdog check one time when detecting some > abnormal condition won't hurt the overall check much. Works for me! Thanx, Paul > [1]. https://lore.kernel.org/lkml/bd5b97f89ab2887543fc262348d1c7cafcaae536.camel@intel.com/ > > Thanks, > Feng > > > Thoughts? > > > > Thanx, Paul
Hello, kernel test robot noticed "WARNING:at_kernel/time/clocksource-wdtest.c:#wdtest_func" on: commit: ceacf0b27b5d9cdc574f8d0a0bcb11d3272e7b9f ("[PATCH] clocksource: Skip watchdog check for large watchdog intervals") url: https://github.com/intel-lab-lkp/linux/commits/Jiri-Wiesner/clocksource-Skip-watchdog-check-for-large-watchdog-intervals/20240103-192257 base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git da65f29dada7f7cbbf0d6375b88a0316f5f7d6f5 patch link: https://lore.kernel.org/all/20240103112113.GA6108@incl/ patch subject: [PATCH] clocksource: Skip watchdog check for large watchdog intervals in testcase: boot compiler: gcc-9 test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G (please refer to attached dmesg/kmsg for entire log/backtrace) +--------------------------------------------------------------------+------------+------------+ | | da65f29dad | ceacf0b27b | +--------------------------------------------------------------------+------------+------------+ | WARNING:at_kernel/time/clocksource-wdtest.c:#wdtest_func | 0 | 14 | | EIP:wdtest_func | 0 | 14 | +--------------------------------------------------------------------+------------+------------+ If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <oliver.sang@intel.com> | Closes: https://lore.kernel.org/oe-lkp/202401082125.4ec42f71-oliver.sang@intel.com [ 54.778895][ T41] ------------[ cut here ]------------ [ 54.781279][ T41] WARNING: CPU: 0 PID: 41 at kernel/time/clocksource-wdtest.c:162 wdtest_func (kernel/time/clocksource-wdtest.c:162 (discriminator 1)) [ 54.785405][ T41] Modules linked in: [ 54.787152][ T41] CPU: 0 PID: 41 Comm: wdtest Tainted: G N 6.7.0-rc2-00014-gceacf0b27b5d #1 30e4ebd22e7da702dfdab1313ae74e5a246df970 [ 54.791633][ T41] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 54.795124][ T41] EIP: wdtest_func (kernel/time/clocksource-wdtest.c:162 (discriminator 1)) [ 54.796800][ T41] Code: ff 8d b4 26 00 00 00 00 66 90 0f 0b e9 ee fd ff ff 8d b4 26 00 00 00 00 66 90 0f 0b e9 08 fe ff ff 8d b4 26 00 00 00 00 66 90 <0f> 0b e9 76 ff ff ff cc cc cc cc cc cc cc cc cc 3e 8d 74 26 00 55 All code ======== 0: ff 8d b4 26 00 00 decl 0x26b4(%rbp) 6: 00 00 add %al,(%rax) 8: 66 90 xchg %ax,%ax a: 0f 0b ud2 c: e9 ee fd ff ff jmpq 0xfffffffffffffdff 11: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi 18: 66 90 xchg %ax,%ax 1a: 0f 0b ud2 1c: e9 08 fe ff ff jmpq 0xfffffffffffffe29 21: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi 28: 66 90 xchg %ax,%ax 2a:* 0f 0b ud2 <-- trapping instruction 2c: e9 76 ff ff ff jmpq 0xffffffffffffffa7 31: cc int3 32: cc int3 33: cc int3 34: cc int3 35: cc int3 36: cc int3 37: cc int3 38: cc int3 39: cc int3 3a: 3e 8d 74 26 00 lea %ds:0x0(%rsi,%riz,1),%esi 3f: 55 push %rbp Code starting with the faulting instruction =========================================== 0: 0f 0b ud2 2: e9 76 ff ff ff jmpq 0xffffffffffffff7d 7: cc int3 8: cc int3 9: cc int3 a: cc int3 b: cc int3 c: cc int3 d: cc int3 e: cc int3 f: cc int3 10: 3e 8d 74 26 00 lea %ds:0x0(%rsi,%riz,1),%esi 15: 55 push %rbp [ 54.801794][ T41] EAX: c4a1fe8c EBX: 00000004 ECX: 00000000 EDX: 00000000 [ 54.803569][ T41] ESI: c4a1ff2e EDI: 00000000 EBP: c74aff70 ESP: c74aff60 [ 54.805314][ T41] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 EFLAGS: 00010246 [ 54.807278][ T41] CR0: 80050033 CR2: ffda9000 CR3: 05979000 CR4: 000406d0 [ 54.808856][ T41] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 54.810216][ T41] DR6: fffe0ff0 DR7: 00000400 [ 54.811218][ T41] Call Trace: [ 54.811950][ T41] ? show_regs (arch/x86/kernel/dumpstack.c:479 arch/x86/kernel/dumpstack.c:465) [ 54.812840][ T41] ? wdtest_func (kernel/time/clocksource-wdtest.c:162 (discriminator 1)) [ 54.813801][ T41] ? __warn (kernel/panic.c:677) [ 54.814643][ T41] ? wdtest_func (kernel/time/clocksource-wdtest.c:162 (discriminator 1)) [ 54.815628][ T41] ? report_bug (lib/bug.c:201 lib/bug.c:219) [ 54.816581][ T41] ? exc_overflow (arch/x86/kernel/traps.c:250) [ 54.817506][ T41] ? handle_bug (arch/x86/kernel/traps.c:216) [ 54.818420][ T41] ? exc_invalid_op (arch/x86/kernel/traps.c:258 (discriminator 1)) [ 54.819425][ T41] ? handle_exception (arch/x86/entry/entry_32.S:1056) [ 54.820438][ T41] ? exc_overflow (arch/x86/kernel/traps.c:250) [ 54.821374][ T41] ? wdtest_func (kernel/time/clocksource-wdtest.c:162 (discriminator 1)) [ 54.822318][ T41] ? exc_overflow (arch/x86/kernel/traps.c:250) [ 54.823284][ T41] ? wdtest_func (kernel/time/clocksource-wdtest.c:162 (discriminator 1)) [ 54.824248][ T41] kthread (kernel/kthread.c:388) [ 54.825093][ T41] ? wdtest_ktime_read (kernel/time/clocksource-wdtest.c:105) [ 54.826132][ T41] ? kthread_complete_and_exit (kernel/kthread.c:341) [ 54.827307][ T41] ret_from_fork (arch/x86/kernel/process.c:153) [ 54.828219][ T41] ? kthread_complete_and_exit (kernel/kthread.c:341) [ 54.829331][ T41] ret_from_fork_asm (arch/x86/entry/entry_32.S:741) [ 54.830310][ T41] entry_INT80_32 (arch/x86/entry/entry_32.S:947) [ 54.831287][ T41] irq event stamp: 879 [ 54.832154][ T41] hardirqs last enabled at (889): console_unlock (kernel/printk/printk.c:341 kernel/printk/printk.c:2706 kernel/printk/printk.c:3038) [ 54.833838][ T41] hardirqs last disabled at (898): console_unlock (kernel/printk/printk.c:339 kernel/printk/printk.c:2706 kernel/printk/printk.c:3038) [ 54.835567][ T41] softirqs last enabled at (918): __do_softirq (arch/x86/include/asm/preempt.h:27 kernel/softirq.c:400 kernel/softirq.c:582) [ 54.837246][ T41] softirqs last disabled at (907): do_softirq_own_stack (arch/x86/kernel/irq_32.c:57 arch/x86/kernel/irq_32.c:147) [ 54.839049][ T41] ---[ end trace 0000000000000000 ]--- [ 54.840178][ T41] clocksource: Not enough CPUs to check clocksource 'wdtest-ktime'. [ 54.841774][ T41] clocksource_wdtest: --- Done with test. The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20240108/202401082125.4ec42f71-oliver.sang@intel.com
On Mon, Jan 08, 2024 at 09:44:31PM +0800, kernel test robot wrote: > commit: ceacf0b27b5d9cdc574f8d0a0bcb11d3272e7b9f ("[PATCH] clocksource: Skip watchdog check for large watchdog intervals") > url: https://github.com/intel-lab-lkp/linux/commits/Jiri-Wiesner/clocksource-Skip-watchdog-check-for-large-watchdog-intervals/20240103-192257 > base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git da65f29dada7f7cbbf0d6375b88a0316f5f7d6f5 > patch link: https://lore.kernel.org/all/20240103112113.GA6108@incl/ > patch subject: [PATCH] clocksource: Skip watchdog check for large watchdog intervals > > [ 54.778895][ T41] ------------[ cut here ]------------ > [ 54.781279][ T41] WARNING: CPU: 0 PID: 41 at kernel/time/clocksource-wdtest.c:162 wdtest_func (kernel/time/clocksource-wdtest.c:162 (discriminator 1)) > [ 54.785405][ T41] Modules linked in: > [ 54.787152][ T41] CPU: 0 PID: 41 Comm: wdtest Tainted: G N 6.7.0-rc2-00014-gceacf0b27b5d #1 30e4ebd22e7da702dfdab1313ae74e5a246df970 > [ 54.791633][ T41] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 > [ 54.795124][ T41] EIP: wdtest_func (kernel/time/clocksource-wdtest.c:162 (discriminator 1)) The code: pr_info("--- Watchdog clock-value-fuzz error injection, expect clock skew and per-CPU mismatches.\n"); WRITE_ONCE(wdtest_ktime_read_fuzz, true); schedule_timeout_uninterruptible(2 * HZ); WARN_ON_ONCE(!(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE)); The warning is the consequence of the watchdog not running on i386 builds of the kernel. The watchdog does not run because the "interval > WATCHDOG_INTR_MAX_NS" check is always true on account of the WATCHDOG_INTR_MAX_NS constant evaluating to a negative number on account of integer overflow. > [ 54.801794][ T41] EAX: c4a1fe8c EBX: 00000004 ECX: 00000000 EDX: 00000000 > [ 54.803569][ T41] ESI: c4a1ff2e EDI: 00000000 EBP: c74aff70 ESP: c74aff60 > [ 54.805314][ T41] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 EFLAGS: 00010246 > [ 54.807278][ T41] CR0: 80050033 CR2: ffda9000 CR3: 05979000 CR4: 000406d0 > [ 54.808856][ T41] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 > [ 54.810216][ T41] DR6: fffe0ff0 DR7: 00000400 > [ 54.811218][ T41] Call Trace:
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index c108ed8a9804..ac5cb0ff278b 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -98,7 +98,9 @@ static u64 suspend_start; /* * Interval: 0.5sec. */ -#define WATCHDOG_INTERVAL (HZ >> 1) +#define WATCHDOG_INTERVAL (HZ >> 1) +#define WATCHDOG_INTR_MAX_NS ((WATCHDOG_INTERVAL + (WATCHDOG_INTERVAL >> 1))\ + * NSEC_PER_SEC / HZ) /* * Threshold: 0.0312s, when doubled: 0.0625s. @@ -134,6 +136,7 @@ static DECLARE_WORK(watchdog_work, clocksource_watchdog_work); static DEFINE_SPINLOCK(watchdog_lock); static int watchdog_running; static atomic_t watchdog_reset_pending; +static int64_t watchdog_max_intr; static inline void clocksource_watchdog_lock(unsigned long *flags) { @@ -400,7 +403,7 @@ static void clocksource_watchdog(struct timer_list *unused) { u64 csnow, wdnow, cslast, wdlast, delta; int next_cpu, reset_pending; - int64_t wd_nsec, cs_nsec; + int64_t wd_nsec, cs_nsec, interval; struct clocksource *cs; enum wd_read_status read_ret; unsigned long extra_wait = 0; @@ -470,6 +473,27 @@ static void clocksource_watchdog(struct timer_list *unused) if (atomic_read(&watchdog_reset_pending)) continue; + /* + * The processing of timer softirqs can get delayed (usually + * on account of ksoftirqd not getting to run in a timely + * manner), which causes the watchdog interval to stretch. + * Some clocksources, e.g. acpi_pm, cannot tolerate + * watchdog intervals longer than a few seconds. + * Skew detection may fail for longer watchdog intervals + * on account of fixed margins being used. + */ + interval = max(cs_nsec, wd_nsec); + if (unlikely(interval > WATCHDOG_INTR_MAX_NS)) { + if (system_state > SYSTEM_SCHEDULING && + interval > 2 * watchdog_max_intr) { + watchdog_max_intr = interval; + pr_warn("Skipping watchdog check: cs_nsec: %lld wd_nsec: %lld\n", + cs_nsec, wd_nsec); + } + watchdog_timer.expires = jiffies; + continue; + } + /* Check the deviation from the watchdog clocksource. */ md = cs->uncertainty_margin + watchdog->uncertainty_margin; if (abs(cs_nsec - wd_nsec) > md) {