Message ID | 20230418122639.ikgfvu3f@linutronix.de |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp2831668vqo; Tue, 18 Apr 2023 06:06:14 -0700 (PDT) X-Google-Smtp-Source: AKy350ZxlLg2TqxjFGasedq+e+8R+BcmddK9SP7ZoHKnkizMBuCXLxANDKrrElrEcREHK6UA6Kb+ X-Received: by 2002:a05:6a00:1903:b0:625:efa4:4c01 with SMTP id y3-20020a056a00190300b00625efa44c01mr24308121pfi.3.1681823174040; Tue, 18 Apr 2023 06:06:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1681823174; cv=none; d=google.com; s=arc-20160816; b=zzlDGG2ufrPmwwW3NxJYMeolYTE+pNLuLOZ3+JuenOY/jP3H3Q6/M5mfnBDrhg9Sms p+IVcTLJqB6SznmV4kjTLSK5viER9Y3v2GY3K96+hyf+hPCm2zxbAw8g2wy2R+kA7cdf qJyAu0F+3Hxbltb5+OOXgVJkMLIUGsMo9Nw72l6MPpptd1+FscyOzrYihSyFeCkAzTMo vf5ipdOlywO0qN/Opq+Gq7x3iISPR93PzwAuutri4k9P83oLT8Pz3L+C+7YxkU51OXaW ydZf/QdIJwybpJdpm21wV9wW3kNWh10ZsVvTMsrMD5jRqw++cIxTi0ujcVwyWGSoXIpx llqQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:dkim-signature :dkim-signature:date; bh=FqUkCjI9Mr4UILiIZm8CkQghAtfwUwan4upLdsoWU00=; b=GTjf0YapFvpKSxT/WQAPR+qnnCef3jNVYaljIn9W3/CtsZJl3oQY0wKIgQkgTjNPOz G5zu4NhQuZ+qNHt+RPq5oobD3DY09vXEBJmuIBZeUhef/nfFaPdT80voEQx8kwxB8gEU 5r+5QcOBK3+mgNkV1A85kJ6d+M2p/03YNUhXboOlalZKF0U74xO0jC+r3aUZZmP9JK91 qd9HgQcHRQfetEssgVbiUWvPpubuB8T+0SE2ilM3UZH1bQOyikkk/JyAEW45JrTUlD7H OMG4uAs3h//KLte2jyNApWsdTcJCX5dRaoaymzxTI7Tb0FKt0ULNHQtyoEoNOQGyIMmP RC2Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=XZCTq6LE; dkim=neutral (no key) header.i=@linutronix.de header.b=D5xGMZhI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a4-20020a656544000000b0051b54dcd001si13371179pgw.780.2023.04.18.06.06.00; Tue, 18 Apr 2023 06:06:14 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=XZCTq6LE; dkim=neutral (no key) header.i=@linutronix.de header.b=D5xGMZhI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230384AbjDRM1Y (ORCPT <rfc822;leviz.kernel.dev@gmail.com> + 99 others); Tue, 18 Apr 2023 08:27:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36020 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230518AbjDRM1O (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 18 Apr 2023 08:27:14 -0400 Received: from galois.linutronix.de (Galois.linutronix.de [IPv6:2a0a:51c0:0:12e:550::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A4A669020 for <linux-kernel@vger.kernel.org>; Tue, 18 Apr 2023 05:26:54 -0700 (PDT) Date: Tue, 18 Apr 2023 14:26:39 +0200 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1681820801; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=FqUkCjI9Mr4UILiIZm8CkQghAtfwUwan4upLdsoWU00=; b=XZCTq6LE1G+yciJO4kkfRVEAfFAeZE7nrI8z+Dh+O/lMDTr0YoepfNWfDxQ6Fqm7Z+tStJ 96mQTG7/wu3/uKC512CHUpWq25rGf5h79ujTEUjCj6dSTr6mCCp/MJxzZiRak+WxXZQ/ku 7Xa493KJmgCnOu6l6fs5DhY8Slk3lbULWfjgNl6nEOXyrhCcTFFDQxhT87nl/v1cmoW8yv DcrfVVX9hyGyFhxhzcblp+6etyRd7nKKHAPPWdOENefzJ21kCuHMYZ4W0vuOa0HXsTcjzn hA+yEAJ6wSbeLukxaQtqIgPnu8zoB0wbOYkMBarar0jwKWxVe9MG14eOwFa0Mg== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1681820801; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=FqUkCjI9Mr4UILiIZm8CkQghAtfwUwan4upLdsoWU00=; b=D5xGMZhIC3TRcsju2YKfLuBS0K+Vt7c+TQfwcD6r/DFJUSdxKYRJ/sOYUBEfxmHB15FZOK LhZxFuEP3rAulMBg== From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> To: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org>, linux-kernel@vger.kernel.org, Ingo Molnar <mingo@redhat.com>, Gusenleitner Klaus <gus@keba.com>, Frederic Weisbecker <fweisbec@gmail.com> Subject: [PATCH] timekeeping: Align tick_sched_timer() with the HZ tick. Message-ID: <20230418122639.ikgfvu3f@linutronix.de> References: <20230406095735.0_14edn3@linutronix.de> <20230406105529.GB392176@hirez.programming.kicks-ass.net> <87mt3lqnde.ffs@tglx> <20230406150254.ZrawA2Y-@linutronix.de> <87jzypq6gq.ffs@tglx> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <87jzypq6gq.ffs@tglx> X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1763519416336346508?= X-GMAIL-MSGID: =?utf-8?q?1763519416336346508?= |
Series |
timekeeping: Align tick_sched_timer() with the HZ tick.
|
|
Commit Message
Sebastian Andrzej Siewior
April 18, 2023, 12:26 p.m. UTC
With HIGHRES enabled tick_sched_timer() is programmed every jiffy to
expire the timer_list timers. This timer is programmed accurate in
respect to CLOCK_MONOTONIC so that 0 seconds and nanoseconds is the
first tick and the next one is 1000/CONFIG_HZ ms later. For HZ=250 it is
every 4 ms and so based on the current time the next tick can be
computed.
This accuracy broke since the commit mentioned below because the jiffy
based clocksource is initialized with higher accuracy in
read_persistent_wall_and_boot_offset(). This higher accuracy is
inherited during the setup in tick_setup_device(). The timer still fires
every 4ms with HZ=250 but timer is no longer aligned with
CLOCK_MONOTONIC with 0 as it origin but has an offset in the us/ns part
of the timestamp. The offset differs with every boot and makes it
impossible for user land to align with the tick.
Align the tick timer with CLOCK_MONOTONIC ensuring that it is always a
multiple of 1000/CONFIG_HZ ms.
Reported-by: Gusenleitner Klaus <gus@keba.com>
Link: https://lore.kernel.org/20230406095735.0_14edn3@linutronix.de
Fixes: 857baa87b6422 ("sched/clock: Enable sched clock early")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/time/tick-common.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
Comments
Hi Sebastian, On 18.04.23 14:26, Sebastian Andrzej Siewior wrote: > [...]. The timer still fires > every 4ms with HZ=250 but timer is no longer aligned with > CLOCK_MONOTONIC with 0 as it origin but has an offset in the us/ns part > of the timestamp. The offset differs with every boot and makes it > impossible for user land to align with the tick. I can observe these per-boot offsets too, but... > Align the tick timer with CLOCK_MONOTONIC ensuring that it is always a > multiple of 1000/CONFIG_HZ ms. this change doesn't seem to achieve that goal, unfortunately. Quite the opposite. It makes the (boot) clock run faster and, because of the per- boot different offset, differently fast for each boot. Up to the point where it's running too fast to make any progress at all. This patch causes VM boot hangs for us. It took a while to identify as the boot hangs were only ~1 out of 30 but it's clearly it. Reverting the commit got me 100 boots in a row without any issue. Instrumenting the kernel a little gave me a clue what the bug is. When switching from the boot timer tick device (which is 'hpet' in my setup) to 'lapic-deadline', the mode of the timer isn't changed and kept at TICKDEV_MODE_PERIODIC. As that device doesn't support this mode, tick_setup_periodic() will switch over to CLOCK_EVT_STATE_ONESHOT mode and program the next expire event based on tick_next_period. clockevents_program_event() will calculate the delta of that timestamp and ktime_get() and pass that value on to dev->set_next_event() (which is lapic_next_deadline()) which will write it to the IA32_TSC_DEADLINE MSR. That delta value -- which is still the per-boot different offset to ktime_get() your patch introduces -- now gets stuck and is taken as the new *jiffies tick time*. That's because tick_handle_periodic() -> tick_periodic() will advance tick_next_period by TICK_NSEC, make do_timer() increment jiffies_64 by one and then program the next event to be in TICK_NSEC ns based on the device's old expiry time, i.e. keep the offset intact. This is followed by re-arming the event by a call to clockevents_program_event() which does the already-know delta calculation and writes, again, the too little value to IA32_TSC_DEADLINE. This effectively makes the jiffies based clock go too fast as the timer IRQ comes too early (less than TICK_NSEC ns). Sometimes it's barely noticeable, but sometimes it's so fast that the kernel is overloaded with only handling the local timer IRQ without making any further progress, especially in (nested) VM setups. Without commit e9523a0d8189 ("tick/common: Align tick period with the HZ tick."), which was backported to many stable and LTS kernels (v6.3.2 (571c3b46c9b3), v6.2.15 (f0cb827199ec), v6.1.28 (290e26ec0d01), v5.15.111 (a55050c7989c), v5.10.180 (c4013689269d) and v5.4.243 (a3e7a3d472c2)) this clock drift is gone and my VMs boot again. Before that commit, the delta between tick_next_period and ktime_get() was initially zero, so tick_handle_period() had to loop, as clockevents_program_event() will return with -ETIME. The next attempt would be done with a delta of TICK_NSEC which will make clockevents_program_event() succeed and ensure that future events don't need the additional loop iteration, as the delta got stuck at TICK_NSEC -- exactly where it should be. We observed the bug first on the v6.3, v6.1 and v5.15 stable branch updates from May 11th and then, a week later, on v5.4 too. All first occurrences were coinciding with the bad commit going into the corresponding stable and LTS kernel releases. The issue manifests itself as a fast running clock only during boot, when the clock source is still jiffies based. That'll eventually lead to a boot hang as the timer IRQs are firing too fast. To reproduce this you can either boot loop a VM and try to get "lucky" to hit a big enough 'rem' value or just apply this little diff instead: ---8<--- diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c index 65b8658da829..b01cf18a5d42 100644 --- a/kernel/time/tick-common.c +++ b/kernel/time/tick-common.c @@ -225,6 +225,7 @@ static void tick_setup_device(struct tick_device *td, next_p = ktime_get(); div_u64_rem(next_p, TICK_NSEC, &rem); + rem = TICK_NSEC - 123; if (rem) { next_p -= rem; next_p += TICK_NSEC; --->8--- This should make the kernel get stuck with only handling timer ticks but not making any further progress. Change the subtrahend to 1234 to get a system that boots but has an unrealistically fast clock during kernel initialization. As reverting that commit fixes the issue for us but it seemingly fixes another bug for Klaus (or at least attempted to), the now uncovered bug should be fixed instead. The fundamental issue is that the jiffies based clock source cannot be trusted and shouldn't be used to calculate offsets to timestamps in the future when tick_next_period mod ktime_get() != 0. Can we defer the offset adjustment of tick_next_period to a later point in time when a stable clock source gets used, like 'tsc'? Thanks, Mathias
On Wed, Jun 14, 2023 at 12:59:46AM +0200, Mathias Krause wrote: > This patch causes VM boot hangs for us. It took a while to identify as > the boot hangs were only ~1 out of 30 but it's clearly it. Reverting > the commit got me 100 boots in a row without any issue. FWIW I have quite a nice test program for catching these sorts of boot hangs, see attached. You need to change the VMLINUX define to point to your vmlinux or vmlinuz file. Rich.
On 2023-06-14 00:59:46 [+0200], Mathias Krause wrote: > Hi Sebastian, Hi Mathias, > this change doesn't seem to achieve that goal, unfortunately. Quite the > opposite. It makes the (boot) clock run faster and, because of the per- > boot different offset, differently fast for each boot. Up to the point > where it's running too fast to make any progress at all. I meant to reply to this thread but took the other by accident. Sorry. This should be addressed by https://lore.kernel.org/20230615091830.RxMV2xf_@linutronix.de Thank you for the report. > Thanks, > Mathias Sebastian
On 15.06.23 11:03, Richard W.M. Jones wrote: > On Wed, Jun 14, 2023 at 12:59:46AM +0200, Mathias Krause wrote: >> This patch causes VM boot hangs for us. It took a while to identify as >> the boot hangs were only ~1 out of 30 but it's clearly it. Reverting >> the commit got me 100 boots in a row without any issue. > > FWIW I have quite a nice test program for catching these sorts of boot > hangs, see attached. You need to change the VMLINUX define to point > to your vmlinux or vmlinuz file. Heh, nice. For this specific case, though, a simple boot loop in a single VM was sufficient: $ kvm -cpu host -smp 8 -m 8G -display none -serial stdio -s \ -kernel path/to/bzImage -append 'console=ttyS0 panic=1' This will reboot the VM in an endless loop until it hangs which is visually noticeable as there's no more output generated to the terminal. At that time one can attach to it via gdb and investigate further: $ gdb -ex 'target remote :1234' path/to/vmlinux Yet another advantage is that one can filter the output, e.g. via: $ kvm ... | grep clock Thanks, Mathias > > Rich. >
diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c index 46789356f856e..65b8658da829e 100644 --- a/kernel/time/tick-common.c +++ b/kernel/time/tick-common.c @@ -218,9 +218,19 @@ static void tick_setup_device(struct tick_device *td, * this cpu: */ if (tick_do_timer_cpu == TICK_DO_TIMER_BOOT) { + ktime_t next_p; + u32 rem; + tick_do_timer_cpu = cpu; - tick_next_period = ktime_get(); + next_p = ktime_get(); + div_u64_rem(next_p, TICK_NSEC, &rem); + if (rem) { + next_p -= rem; + next_p += TICK_NSEC; + } + + tick_next_period = next_p; #ifdef CONFIG_NO_HZ_FULL /* * The boot CPU may be nohz_full, in which case set