Message ID | 20240126091250.79985-1-feng.tang@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-39908-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:e09d:b0:103:945f:af90 with SMTP id gm29csp560991dyb; Fri, 26 Jan 2024 02:14:27 -0800 (PST) X-Google-Smtp-Source: AGHT+IFBiquq3pferelBBMSaoZ9yvANbR8fnE9IjW6EmdD45/Asdxwm4j7FnQT7/oqV4P18k2y4s X-Received: by 2002:a05:6214:b63:b0:680:b44c:1a01 with SMTP id ey3-20020a0562140b6300b00680b44c1a01mr1284174qvb.116.1706264067714; Fri, 26 Jan 2024 02:14:27 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1706264067; cv=pass; d=google.com; s=arc-20160816; b=bEpx4JGK2syCtGyC5cn68Nvc3GN5599ifymymwUsXtXXozKCrXpBEq2cFxDJS/foYG LVyYu11mAoR3aiWCuNzghytmwOsfjDPGoALDgr6pyGb5OPStho+0u3kqMXYioRd8iCxc 9U/S6wXfTCItVdnZ9haT07JDhBR8anAxh3hUQ382CslxesvrA1ulstIzW0jWRP5dIvVs 0YkCvQ6XTI6jzl8KHq/Ixtn49ns9EbsyNkPpQfSLw94DStom2dN/0juM8p9k0mCX25bY pVorQpJG+2aWi7proVWyYQdjHDYtIo3rUGAl1SlceKqidlIQPWCopBXnNfOuxCiRn2QO xZyw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:cc:to :from:dkim-signature; bh=eGMcpXe5TeykGP5mKsy5j8/AQ/YhaytXUJs4t665ld0=; fh=tFLw/PNlbwIAo2DPswVgp6UP3lE4UT5uXm0FtmzcPZ8=; b=00j3EBkVDGdbtzVlzhDsgOfpeXQz8MGAnaJuhrSpv3s28hYOYuQIbcG/fVkAY/kiqL 0hS7F9ywNBbEVkC6YMAaJrnB12Z3mA9jsSTep2A2+seKt5zrWjdXv6vER5WIJKS0QRBL YV3TTyNEOtS/FSc0NITe8CVNmnzRKvB9kYeSB4HAuKoPtPEu3Wx7nptepmz1eFqPJZd8 hdfG0QU7ApzxTIrjK2c1XMrUCMOBK4v5d8udYRfS4UU2Ct+wrY57VDOzFf7Zk203dG2h D56qQJuSVt3fS7Z+vKC6Fd37Han7B1p5muK01lqcFs/5BAJ2mcf1SJ0CbXmdLivMigLo xOZQ== ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=Xuut1Wzs; arc=pass (i=1 spf=pass spfdomain=intel.com dkim=pass dkdomain=intel.com dmarc=pass fromdomain=intel.com); spf=pass (google.com: domain of linux-kernel+bounces-39908-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-39908-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [147.75.199.223]) by mx.google.com with ESMTPS id gh1-20020a05621429c100b0067f28fb3f53si918047qvb.445.2024.01.26.02.14.27 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 26 Jan 2024 02:14:27 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-39908-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=Xuut1Wzs; arc=pass (i=1 spf=pass spfdomain=intel.com dkim=pass dkdomain=intel.com dmarc=pass fromdomain=intel.com); spf=pass (google.com: domain of linux-kernel+bounces-39908-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-39908-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 042AD1C20938 for <ouuuleilei@gmail.com>; Fri, 26 Jan 2024 10:06:14 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 7F37C1BDF8; Fri, 26 Jan 2024 09:21:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="Xuut1Wzs" Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B99C91BC34 for <linux-kernel@vger.kernel.org>; Fri, 26 Jan 2024 09:21:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.55.52.120 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706260885; cv=none; b=cxjv02lbD5cqgPULHx/nLqEStXWstyxaAnhBGcHCAg7IsFYXYJA1Ot/zAS+6edhxgTV+rrsRulSuZIee5khLAl6rimHBfMvCRoXjtRKCIV9rBwRDz9SRndZxLPMhTCmj4EfQTFT0cQ/Ya4n5qvxyq7OMXsJq4GeY34aWbMvyYXk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706260885; c=relaxed/simple; bh=6mV7VNYt7HwWXFfXqjQflzG59MR/XQAJ7cn2D6yJLfs=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version; b=hhp3s3sGjHDLDXObWIf9V6q88cQW1WiY2DRjYdcfe00/AyxKVEuBtkCzlUK/+9dBh8cDBNvVIEHDDYhLAx1hbj4V1XwHQJX20NYhTVo0CmBS+oLWgDDERzi5ygzwqFGuRHRn/rgYH2zoESmBoX+YC4wnVZlGL05OyJ8JFAHVeuU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=Xuut1Wzs; arc=none smtp.client-ip=192.55.52.120 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1706260883; x=1737796883; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=6mV7VNYt7HwWXFfXqjQflzG59MR/XQAJ7cn2D6yJLfs=; b=Xuut1Wzs8V+OSZhANLhR1xRFeXyPNv069YvHd3sFyeoXS8GCv1xZxjfl A/1UeKxroVoZpf4EX7p+wS5ICZz9EdnjKBGpX/Ubfk7xqdCd45jzxUhRL hSLf+WWhZJg66j8emnE5eAysym4V7vvJlzhzzsPyzLBDJ1b5EPIw7qHRo tijhnwfilCA1PhPhFzp6spKJaUXF1h2w46FYtid7C6odMRlRc/NJMm8dW qgOc3noozfUMY5NQeCVQ0/9agkCpJwjnFHxAt221J9xD6VSv6s+8rYR4g Wq0q+tCdVVDnPmIhuZ6rIMYD4qQBDNSF9TOaJ/GH4oAXstA/FUw1SqMBj g==; X-IronPort-AV: E=McAfee;i="6600,9927,10964"; a="401279180" X-IronPort-AV: E=Sophos;i="6.05,216,1701158400"; d="scan'208";a="401279180" Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jan 2024 01:21:22 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10964"; a="906261352" X-IronPort-AV: E=Sophos;i="6.05,216,1701158400"; d="scan'208";a="906261352" Received: from feng-clx.sh.intel.com ([10.239.159.50]) by fmsmga002.fm.intel.com with ESMTP; 26 Jan 2024 01:21:20 -0800 From: Feng Tang <feng.tang@intel.com> To: John Stultz <jstultz@google.com>, Thomas Gleixner <tglx@linutronix.de>, Stephen Boyd <sboyd@kernel.org>, linux-kernel@vger.kernel.org, paulmck@kernel.org, Peter Zijlstra <peterz@infradead.org>, Waiman Long <longman@redhat.com> Cc: Feng Tang <feng.tang@intel.com>, Jin Wang <jin1.wang@intel.com> Subject: [PATCH] clocksource: Scale the max retry number of watchdog read according to CPU numbers Date: Fri, 26 Jan 2024 17:12:50 +0800 Message-Id: <20240126091250.79985-1-feng.tang@intel.com> X-Mailer: git-send-email 2.34.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1789147551119072009 X-GMAIL-MSGID: 1789147551119072009 |
Series |
clocksource: Scale the max retry number of watchdog read according to CPU numbers
|
|
Commit Message
Feng Tang
Jan. 26, 2024, 9:12 a.m. UTC
There was a bug on one 8-socket server that the TSC is wrongly marked as 'unstable' and disabled during boot time. (reproduce rate is every 120 rounds of reboot tests), with log: clocksource: timekeeping watchdog on CPU227: wd-tsc-wd excessive read-back delay of 153560ns vs. limit of 125000ns, wd-wd read-back delay only 11440ns, attempt 3, marking tsc unstable tsc: Marking TSC unstable due to clocksource watchdog TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. sched_clock: Marking unstable (119294969739, 159204297)<-(125446229205, -5992055152) clocksource: Checking clocksource tsc synchronization from CPU 319 to CPUs 0,99,136,180,210,542,601,896. clocksource: Switched to clocksource hpet The reason is for platform with lots of CPU, there are sporadic big or huge read latency of read watchog/clocksource during boot or when system is under stress work load, and the frequency and maximum value of the latency goes up with the increasing of CPU numbers. Current code already has logic to detect and filter such high latency case by reading 3 times of watchdog, and check the 2 deltas. Due to the randomness of the latency, there is a low possibility situation that the first delta (latency) is big, but the second delta is small and looks valid, which can escape from the check, and there is a 'max_cswd_read_retries' for retrying that check covering this case, whose default value is only 2 and may be not enough for machines with huge number of CPUs. So scale and enlarge the max retry number according to CPU number to better filter those latency noise on large system, which has been verified fine in 4 days and 670 rounds of reboot test on the 8-socket machine. Tested-by: Jin Wang <jin1.wang@intel.com> Signed-off-by: Feng Tang <feng.tang@intel.com> --- kernel/time/clocksource.c | 9 +++++++++ 1 file changed, 9 insertions(+)
Comments
On Fri, Jan 26, 2024 at 05:12:50PM +0800, Feng Tang wrote: > There was a bug on one 8-socket server that the TSC is wrongly marked as > 'unstable' and disabled during boot time. (reproduce rate is every 120 > rounds of reboot tests), with log: > > clocksource: timekeeping watchdog on CPU227: wd-tsc-wd excessive read-back delay of 153560ns vs. limit of 125000ns, > wd-wd read-back delay only 11440ns, attempt 3, marking tsc unstable > tsc: Marking TSC unstable due to clocksource watchdog > TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. > sched_clock: Marking unstable (119294969739, 159204297)<-(125446229205, -5992055152) > clocksource: Checking clocksource tsc synchronization from CPU 319 to CPUs 0,99,136,180,210,542,601,896. > clocksource: Switched to clocksource hpet > > The reason is for platform with lots of CPU, there are sporadic big or huge > read latency of read watchog/clocksource during boot or when system is under > stress work load, and the frequency and maximum value of the latency goes up > with the increasing of CPU numbers. Current code already has logic to detect > and filter such high latency case by reading 3 times of watchdog, and check > the 2 deltas. Due to the randomness of the latency, there is a low possibility > situation that the first delta (latency) is big, but the second delta is small > and looks valid, which can escape from the check, and there is a > 'max_cswd_read_retries' for retrying that check covering this case, whose > default value is only 2 and may be not enough for machines with huge number > of CPUs. > > So scale and enlarge the max retry number according to CPU number to better > filter those latency noise on large system, which has been verified fine in > 4 days and 670 rounds of reboot test on the 8-socket machine. > > Tested-by: Jin Wang <jin1.wang@intel.com> > Signed-off-by: Feng Tang <feng.tang@intel.com> > --- > kernel/time/clocksource.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c > index c108ed8a9804..f15283101906 100644 > --- a/kernel/time/clocksource.c > +++ b/kernel/time/clocksource.c > @@ -226,6 +226,15 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, > u64 wd_end, wd_end2, wd_delta; > int64_t wd_delay, wd_seq_delay; > > + /* > + * If no user changes the default value, scale the retry threshold > + * according to CPU numbers. As per test, the more CPU a platform has, > + * the bigger read latency is found during boot time or under stress > + * work load. Increase the try nubmer to reduce false alarms. > + */ > + if (max_cswd_read_retries == 2) I like the self-adjusting behavior! But why not make max_cswd_read_retries be a long (instead of a ulong) defaulting to -1. Then when someone sets it to a specific positive value, they get exactly that value, while leaving it -1 gets the default CPU-scaling behavior. Zero and other negative values should get a warning in order to reserve the for possible future use. I also suggest doing the adjustment at boot time, for example, using an early_initcall(). That way the test code also sees the scaled value. Thanx, Paul > + max_cswd_read_retries = max(2, ilog2(num_online_cpus())); > + > for (nretries = 0; nretries <= max_cswd_read_retries; nretries++) { > local_irq_disable(); > *wdnow = watchdog->read(watchdog); > -- > 2.34.1 >
Hi Paul, On Fri, Jan 26, 2024 at 04:27:19AM -0800, Paul E. McKenney wrote: > On Fri, Jan 26, 2024 at 05:12:50PM +0800, Feng Tang wrote: > > There was a bug on one 8-socket server that the TSC is wrongly marked as > > 'unstable' and disabled during boot time. (reproduce rate is every 120 > > rounds of reboot tests), with log: > > > > clocksource: timekeeping watchdog on CPU227: wd-tsc-wd excessive read-back delay of 153560ns vs. limit of 125000ns, > > wd-wd read-back delay only 11440ns, attempt 3, marking tsc unstable > > tsc: Marking TSC unstable due to clocksource watchdog > > TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. > > sched_clock: Marking unstable (119294969739, 159204297)<-(125446229205, -5992055152) > > clocksource: Checking clocksource tsc synchronization from CPU 319 to CPUs 0,99,136,180,210,542,601,896. > > clocksource: Switched to clocksource hpet > > > > The reason is for platform with lots of CPU, there are sporadic big or huge > > read latency of read watchog/clocksource during boot or when system is under > > stress work load, and the frequency and maximum value of the latency goes up > > with the increasing of CPU numbers. Current code already has logic to detect > > and filter such high latency case by reading 3 times of watchdog, and check > > the 2 deltas. Due to the randomness of the latency, there is a low possibility > > situation that the first delta (latency) is big, but the second delta is small > > and looks valid, which can escape from the check, and there is a > > 'max_cswd_read_retries' for retrying that check covering this case, whose > > default value is only 2 and may be not enough for machines with huge number > > of CPUs. > > > > So scale and enlarge the max retry number according to CPU number to better > > filter those latency noise on large system, which has been verified fine in > > 4 days and 670 rounds of reboot test on the 8-socket machine. > > > > Tested-by: Jin Wang <jin1.wang@intel.com> > > Signed-off-by: Feng Tang <feng.tang@intel.com> > > --- > > kernel/time/clocksource.c | 9 +++++++++ > > 1 file changed, 9 insertions(+) > > > > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c > > index c108ed8a9804..f15283101906 100644 > > --- a/kernel/time/clocksource.c > > +++ b/kernel/time/clocksource.c > > @@ -226,6 +226,15 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, > > u64 wd_end, wd_end2, wd_delta; > > int64_t wd_delay, wd_seq_delay; > > > > + /* > > + * If no user changes the default value, scale the retry threshold > > + * according to CPU numbers. As per test, the more CPU a platform has, > > + * the bigger read latency is found during boot time or under stress > > + * work load. Increase the try nubmer to reduce false alarms. > > + */ > > + if (max_cswd_read_retries == 2) > > I like the self-adjusting behavior! Thanks! > But why not make max_cswd_read_retries be a long (instead of a ulong) > defaulting to -1. Then when someone sets it to a specific positive > value, they get exactly that value, while leaving it -1 gets the default > CPU-scaling behavior. Zero and other negative values should get a > warning in order to reserve the for possible future use. Good suggestions! > I also suggest doing the adjustment at boot time, for example, using > an early_initcall(). That way the test code also sees the scaled value. I also thought about doing the adjustment once in early boot phase using num_possible_cpus(), but gave up as that parameters could be changed runtime using sysfs's module parameter interface, and cpu runtime hotplugging. Since the watchdog timer only happens (if not skipped) every 500 ms, how about doing the ilog2 math everytime, like below: diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h index 1d42d4b17327..9104bdecf34e 100644 --- a/include/linux/clocksource.h +++ b/include/linux/clocksource.h @@ -291,7 +291,7 @@ static inline void timer_probe(void) {} #define TIMER_ACPI_DECLARE(name, table_id, fn) \ ACPI_DECLARE_PROBE_ENTRY(timer, name, table_id, 0, NULL, 0, fn) -extern ulong max_cswd_read_retries; +extern long max_cswd_read_retries; void clocksource_verify_percpu(struct clocksource *cs); #endif /* _LINUX_CLOCKSOURCE_H */ diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index c108ed8a9804..867bb36e6dad 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -208,8 +208,8 @@ void clocksource_mark_unstable(struct clocksource *cs) spin_unlock_irqrestore(&watchdog_lock, flags); } -ulong max_cswd_read_retries = 2; -module_param(max_cswd_read_retries, ulong, 0644); +long max_cswd_read_retries = -1; +module_param(max_cswd_read_retries, long, 0644); EXPORT_SYMBOL_GPL(max_cswd_read_retries); static int verify_n_cpus = 8; module_param(verify_n_cpus, int, 0644); @@ -225,8 +225,17 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, unsigned int nretries; u64 wd_end, wd_end2, wd_delta; int64_t wd_delay, wd_seq_delay; + long max_retries = max_cswd_read_retries; + + if (max_cswd_read_retries <= 0) { + if (max_cswd_read_retries != -1) + pr_warn_once("max_cswd_read_retries has been set a invalid number: %d\n", + max_cswd_read_retries); - for (nretries = 0; nretries <= max_cswd_read_retries; nretries++) { + max_retries = ilog2(num_online_cpus()) + 1; + } + + for (nretries = 0; nretries <= max_retries; nretries++) { local_irq_disable(); *wdnow = watchdog->read(watchdog); *csnow = cs->read(cs); @@ -238,7 +247,7 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, wd_delay = clocksource_cyc2ns(wd_delta, watchdog->mult, watchdog->shift); if (wd_delay <= WATCHDOG_MAX_SKEW) { - if (nretries > 1 || nretries >= max_cswd_read_retries) { + if (nretries > 1 || nretries >= max_retries) { pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n", smp_processor_id(), watchdog->name, nretries); } Thanks, Feng
On 1/26/24 08:45, Feng Tang wrote: > Hi Paul, > > On Fri, Jan 26, 2024 at 04:27:19AM -0800, Paul E. McKenney wrote: >> On Fri, Jan 26, 2024 at 05:12:50PM +0800, Feng Tang wrote: >>> There was a bug on one 8-socket server that the TSC is wrongly marked as >>> 'unstable' and disabled during boot time. (reproduce rate is every 120 >>> rounds of reboot tests), with log: >>> >>> clocksource: timekeeping watchdog on CPU227: wd-tsc-wd excessive read-back delay of 153560ns vs. limit of 125000ns, >>> wd-wd read-back delay only 11440ns, attempt 3, marking tsc unstable >>> tsc: Marking TSC unstable due to clocksource watchdog >>> TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. >>> sched_clock: Marking unstable (119294969739, 159204297)<-(125446229205, -5992055152) >>> clocksource: Checking clocksource tsc synchronization from CPU 319 to CPUs 0,99,136,180,210,542,601,896. >>> clocksource: Switched to clocksource hpet >>> >>> The reason is for platform with lots of CPU, there are sporadic big or huge >>> read latency of read watchog/clocksource during boot or when system is under >>> stress work load, and the frequency and maximum value of the latency goes up >>> with the increasing of CPU numbers. Current code already has logic to detect >>> and filter such high latency case by reading 3 times of watchdog, and check >>> the 2 deltas. Due to the randomness of the latency, there is a low possibility >>> situation that the first delta (latency) is big, but the second delta is small >>> and looks valid, which can escape from the check, and there is a >>> 'max_cswd_read_retries' for retrying that check covering this case, whose >>> default value is only 2 and may be not enough for machines with huge number >>> of CPUs. >>> >>> So scale and enlarge the max retry number according to CPU number to better >>> filter those latency noise on large system, which has been verified fine in >>> 4 days and 670 rounds of reboot test on the 8-socket machine. >>> >>> Tested-by: Jin Wang <jin1.wang@intel.com> >>> Signed-off-by: Feng Tang <feng.tang@intel.com> >>> --- >>> kernel/time/clocksource.c | 9 +++++++++ >>> 1 file changed, 9 insertions(+) >>> >>> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c >>> index c108ed8a9804..f15283101906 100644 >>> --- a/kernel/time/clocksource.c >>> +++ b/kernel/time/clocksource.c >>> @@ -226,6 +226,15 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, >>> u64 wd_end, wd_end2, wd_delta; >>> int64_t wd_delay, wd_seq_delay; >>> >>> + /* >>> + * If no user changes the default value, scale the retry threshold >>> + * according to CPU numbers. As per test, the more CPU a platform has, >>> + * the bigger read latency is found during boot time or under stress >>> + * work load. Increase the try nubmer to reduce false alarms. >>> + */ >>> + if (max_cswd_read_retries == 2) >> I like the self-adjusting behavior! > Thanks! > >> But why not make max_cswd_read_retries be a long (instead of a ulong) >> defaulting to -1. Then when someone sets it to a specific positive >> value, they get exactly that value, while leaving it -1 gets the default >> CPU-scaling behavior. Zero and other negative values should get a >> warning in order to reserve the for possible future use. > Good suggestions! > >> I also suggest doing the adjustment at boot time, for example, using >> an early_initcall(). That way the test code also sees the scaled value. > I also thought about doing the adjustment once in early boot phase > using num_possible_cpus(), but gave up as that parameters could be > changed runtime using sysfs's module parameter interface, and cpu > runtime hotplugging. > > Since the watchdog timer only happens (if not skipped) every 500 ms, > how about doing the ilog2 math everytime, like below: > > diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h > index 1d42d4b17327..9104bdecf34e 100644 > --- a/include/linux/clocksource.h > +++ b/include/linux/clocksource.h > @@ -291,7 +291,7 @@ static inline void timer_probe(void) {} > #define TIMER_ACPI_DECLARE(name, table_id, fn) \ > ACPI_DECLARE_PROBE_ENTRY(timer, name, table_id, 0, NULL, 0, fn) > > -extern ulong max_cswd_read_retries; > +extern long max_cswd_read_retries; > void clocksource_verify_percpu(struct clocksource *cs); > > #endif /* _LINUX_CLOCKSOURCE_H */ > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c > index c108ed8a9804..867bb36e6dad 100644 > --- a/kernel/time/clocksource.c > +++ b/kernel/time/clocksource.c > @@ -208,8 +208,8 @@ void clocksource_mark_unstable(struct clocksource *cs) > spin_unlock_irqrestore(&watchdog_lock, flags); > } > > -ulong max_cswd_read_retries = 2; > -module_param(max_cswd_read_retries, ulong, 0644); > +long max_cswd_read_retries = -1; > +module_param(max_cswd_read_retries, long, 0644); > EXPORT_SYMBOL_GPL(max_cswd_read_retries); > static int verify_n_cpus = 8; > module_param(verify_n_cpus, int, 0644); > @@ -225,8 +225,17 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, > unsigned int nretries; > u64 wd_end, wd_end2, wd_delta; > int64_t wd_delay, wd_seq_delay; > + long max_retries = max_cswd_read_retries; > + > + if (max_cswd_read_retries <= 0) { > + if (max_cswd_read_retries != -1) > + pr_warn_once("max_cswd_read_retries has been set a invalid number: %d\n", > + max_cswd_read_retries); > > - for (nretries = 0; nretries <= max_cswd_read_retries; nretries++) { > + max_retries = ilog2(num_online_cpus()) + 1; > + } > + > + for (nretries = 0; nretries <= max_retries; nretries++) { > local_irq_disable(); > *wdnow = watchdog->read(watchdog); > *csnow = cs->read(cs); > @@ -238,7 +247,7 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, > wd_delay = clocksource_cyc2ns(wd_delta, watchdog->mult, > watchdog->shift); > if (wd_delay <= WATCHDOG_MAX_SKEW) { > - if (nretries > 1 || nretries >= max_cswd_read_retries) { > + if (nretries > 1 || nretries >= max_retries) { > pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n", > smp_processor_id(), watchdog->name, nretries); > } The max_cswd_read_retries value is also used in kernel/time/clocksource-wdtest.c. You will have to apply similar logic to clocksource-wdtest.c if it is not done once in early_init. Cheers, Longman
On Fri, Jan 26, 2024 at 11:19:50AM -0500, Waiman Long wrote: > On 1/26/24 08:45, Feng Tang wrote: > > Hi Paul, > > > > On Fri, Jan 26, 2024 at 04:27:19AM -0800, Paul E. McKenney wrote: > > > On Fri, Jan 26, 2024 at 05:12:50PM +0800, Feng Tang wrote: > > > > There was a bug on one 8-socket server that the TSC is wrongly marked as > > > > 'unstable' and disabled during boot time. (reproduce rate is every 120 > > > > rounds of reboot tests), with log: > > > > > > > > clocksource: timekeeping watchdog on CPU227: wd-tsc-wd excessive read-back delay of 153560ns vs. limit of 125000ns, > > > > wd-wd read-back delay only 11440ns, attempt 3, marking tsc unstable > > > > tsc: Marking TSC unstable due to clocksource watchdog > > > > TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. > > > > sched_clock: Marking unstable (119294969739, 159204297)<-(125446229205, -5992055152) > > > > clocksource: Checking clocksource tsc synchronization from CPU 319 to CPUs 0,99,136,180,210,542,601,896. > > > > clocksource: Switched to clocksource hpet > > > > > > > > The reason is for platform with lots of CPU, there are sporadic big or huge > > > > read latency of read watchog/clocksource during boot or when system is under > > > > stress work load, and the frequency and maximum value of the latency goes up > > > > with the increasing of CPU numbers. Current code already has logic to detect > > > > and filter such high latency case by reading 3 times of watchdog, and check > > > > the 2 deltas. Due to the randomness of the latency, there is a low possibility > > > > situation that the first delta (latency) is big, but the second delta is small > > > > and looks valid, which can escape from the check, and there is a > > > > 'max_cswd_read_retries' for retrying that check covering this case, whose > > > > default value is only 2 and may be not enough for machines with huge number > > > > of CPUs. > > > > > > > > So scale and enlarge the max retry number according to CPU number to better > > > > filter those latency noise on large system, which has been verified fine in > > > > 4 days and 670 rounds of reboot test on the 8-socket machine. > > > > > > > > Tested-by: Jin Wang <jin1.wang@intel.com> > > > > Signed-off-by: Feng Tang <feng.tang@intel.com> > > > > --- > > > > kernel/time/clocksource.c | 9 +++++++++ > > > > 1 file changed, 9 insertions(+) > > > > > > > > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c > > > > index c108ed8a9804..f15283101906 100644 > > > > --- a/kernel/time/clocksource.c > > > > +++ b/kernel/time/clocksource.c > > > > @@ -226,6 +226,15 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, > > > > u64 wd_end, wd_end2, wd_delta; > > > > int64_t wd_delay, wd_seq_delay; > > > > + /* > > > > + * If no user changes the default value, scale the retry threshold > > > > + * according to CPU numbers. As per test, the more CPU a platform has, > > > > + * the bigger read latency is found during boot time or under stress > > > > + * work load. Increase the try nubmer to reduce false alarms. > > > > + */ > > > > + if (max_cswd_read_retries == 2) > > > I like the self-adjusting behavior! > > Thanks! > > > But why not make max_cswd_read_retries be a long (instead of a ulong) > > > defaulting to -1. Then when someone sets it to a specific positive > > > value, they get exactly that value, while leaving it -1 gets the default > > > CPU-scaling behavior. Zero and other negative values should get a > > > warning in order to reserve the for possible future use. > > Good suggestions! > > > > > I also suggest doing the adjustment at boot time, for example, using > > > an early_initcall(). That way the test code also sees the scaled value. > > I also thought about doing the adjustment once in early boot phase > > using num_possible_cpus(), but gave up as that parameters could be > > changed runtime using sysfs's module parameter interface, and cpu > > runtime hotplugging. > > > > Since the watchdog timer only happens (if not skipped) every 500 ms, > > how about doing the ilog2 math everytime, like below: > > > > diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h > > index 1d42d4b17327..9104bdecf34e 100644 > > --- a/include/linux/clocksource.h > > +++ b/include/linux/clocksource.h > > @@ -291,7 +291,7 @@ static inline void timer_probe(void) {} > > #define TIMER_ACPI_DECLARE(name, table_id, fn) \ > > ACPI_DECLARE_PROBE_ENTRY(timer, name, table_id, 0, NULL, 0, fn) > > -extern ulong max_cswd_read_retries; > > +extern long max_cswd_read_retries; > > void clocksource_verify_percpu(struct clocksource *cs); > > #endif /* _LINUX_CLOCKSOURCE_H */ > > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c > > index c108ed8a9804..867bb36e6dad 100644 > > --- a/kernel/time/clocksource.c > > +++ b/kernel/time/clocksource.c > > @@ -208,8 +208,8 @@ void clocksource_mark_unstable(struct clocksource *cs) > > spin_unlock_irqrestore(&watchdog_lock, flags); > > } > > -ulong max_cswd_read_retries = 2; > > -module_param(max_cswd_read_retries, ulong, 0644); > > +long max_cswd_read_retries = -1; > > +module_param(max_cswd_read_retries, long, 0644); > > EXPORT_SYMBOL_GPL(max_cswd_read_retries); > > static int verify_n_cpus = 8; > > module_param(verify_n_cpus, int, 0644); > > @@ -225,8 +225,17 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, > > unsigned int nretries; > > u64 wd_end, wd_end2, wd_delta; > > int64_t wd_delay, wd_seq_delay; > > + long max_retries = max_cswd_read_retries; > > + > > + if (max_cswd_read_retries <= 0) { > > + if (max_cswd_read_retries != -1) > > + pr_warn_once("max_cswd_read_retries has been set a invalid number: %d\n", > > + max_cswd_read_retries); > > - for (nretries = 0; nretries <= max_cswd_read_retries; nretries++) { > > + max_retries = ilog2(num_online_cpus()) + 1; > > + } > > + > > + for (nretries = 0; nretries <= max_retries; nretries++) { > > local_irq_disable(); > > *wdnow = watchdog->read(watchdog); > > *csnow = cs->read(cs); > > @@ -238,7 +247,7 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, > > wd_delay = clocksource_cyc2ns(wd_delta, watchdog->mult, > > watchdog->shift); > > if (wd_delay <= WATCHDOG_MAX_SKEW) { > > - if (nretries > 1 || nretries >= max_cswd_read_retries) { > > + if (nretries > 1 || nretries >= max_retries) { > > pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n", > > smp_processor_id(), watchdog->name, nretries); > > } > > The max_cswd_read_retries value is also used in > kernel/time/clocksource-wdtest.c. You will have to apply similar logic to > clocksource-wdtest.c if it is not done once in early_init. Good point! If it is not done in an early_init(), could we please have a function for the common code? Thanx, Paul
On Fri, Jan 26, 2024 at 11:28:36AM -0800, Paul E. McKenney wrote: > On Fri, Jan 26, 2024 at 11:19:50AM -0500, Waiman Long wrote: [...] > > > > I also suggest doing the adjustment at boot time, for example, using > > > > an early_initcall(). That way the test code also sees the scaled value. > > > I also thought about doing the adjustment once in early boot phase > > > using num_possible_cpus(), but gave up as that parameters could be > > > changed runtime using sysfs's module parameter interface, and cpu > > > runtime hotplugging. > > > > > > Since the watchdog timer only happens (if not skipped) every 500 ms, > > > how about doing the ilog2 math everytime, like below: > > > > > > diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h > > > index 1d42d4b17327..9104bdecf34e 100644 > > > --- a/include/linux/clocksource.h > > > +++ b/include/linux/clocksource.h > > > @@ -291,7 +291,7 @@ static inline void timer_probe(void) {} > > > #define TIMER_ACPI_DECLARE(name, table_id, fn) \ > > > ACPI_DECLARE_PROBE_ENTRY(timer, name, table_id, 0, NULL, 0, fn) > > > -extern ulong max_cswd_read_retries; > > > +extern long max_cswd_read_retries; > > > void clocksource_verify_percpu(struct clocksource *cs); > > > #endif /* _LINUX_CLOCKSOURCE_H */ > > > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c > > > index c108ed8a9804..867bb36e6dad 100644 > > > --- a/kernel/time/clocksource.c > > > +++ b/kernel/time/clocksource.c > > > @@ -208,8 +208,8 @@ void clocksource_mark_unstable(struct clocksource *cs) > > > spin_unlock_irqrestore(&watchdog_lock, flags); > > > } > > > -ulong max_cswd_read_retries = 2; > > > -module_param(max_cswd_read_retries, ulong, 0644); > > > +long max_cswd_read_retries = -1; > > > +module_param(max_cswd_read_retries, long, 0644); > > > EXPORT_SYMBOL_GPL(max_cswd_read_retries); > > > static int verify_n_cpus = 8; > > > module_param(verify_n_cpus, int, 0644); > > > @@ -225,8 +225,17 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, > > > unsigned int nretries; > > > u64 wd_end, wd_end2, wd_delta; > > > int64_t wd_delay, wd_seq_delay; > > > + long max_retries = max_cswd_read_retries; > > > + > > > + if (max_cswd_read_retries <= 0) { > > > + if (max_cswd_read_retries != -1) > > > + pr_warn_once("max_cswd_read_retries has been set a invalid number: %d\n", > > > + max_cswd_read_retries); > > > - for (nretries = 0; nretries <= max_cswd_read_retries; nretries++) { > > > + max_retries = ilog2(num_online_cpus()) + 1; > > > + } > > > + > > > + for (nretries = 0; nretries <= max_retries; nretries++) { > > > local_irq_disable(); > > > *wdnow = watchdog->read(watchdog); > > > *csnow = cs->read(cs); > > > @@ -238,7 +247,7 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, > > > wd_delay = clocksource_cyc2ns(wd_delta, watchdog->mult, > > > watchdog->shift); > > > if (wd_delay <= WATCHDOG_MAX_SKEW) { > > > - if (nretries > 1 || nretries >= max_cswd_read_retries) { > > > + if (nretries > 1 || nretries >= max_retries) { > > > pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n", > > > smp_processor_id(), watchdog->name, nretries); > > > } > > > > The max_cswd_read_retries value is also used in > > kernel/time/clocksource-wdtest.c. You will have to apply similar logic to > > clocksource-wdtest.c if it is not done once in early_init. > > Good point! If it is not done in an early_init(), could we please > have a function for the common code? Thanks Waiman for the catch! And sure, will add a new helper function for that. Thanks, Feng > > Thanx, Paul
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index c108ed8a9804..f15283101906 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -226,6 +226,15 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 wd_end, wd_end2, wd_delta; int64_t wd_delay, wd_seq_delay; + /* + * If no user changes the default value, scale the retry threshold + * according to CPU numbers. As per test, the more CPU a platform has, + * the bigger read latency is found during boot time or under stress + * work load. Increase the try nubmer to reduce false alarms. + */ + if (max_cswd_read_retries == 2) + max_cswd_read_retries = max(2, ilog2(num_online_cpus())); + for (nretries = 0; nretries <= max_cswd_read_retries; nretries++) { local_irq_disable(); *wdnow = watchdog->read(watchdog);