Message ID | 80956e8f-761e-b74-1c7a-3966f9e8d934@linutronix.de |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:c923:0:b0:3e4:2afc:c1 with SMTP id j3csp2961950vqt; Thu, 20 Jul 2023 01:12:36 -0700 (PDT) X-Google-Smtp-Source: APBJJlHznF6zlPss+6uncHffiLcl2Isqshuj4xMLLvrMXOZpZixkpoTBHRTSi/rGJ/Sn4Qj2bp/Y X-Received: by 2002:a05:6a20:7485:b0:137:7add:b7cc with SMTP id p5-20020a056a20748500b001377addb7ccmr2643499pzd.22.1689840756203; Thu, 20 Jul 2023 01:12:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689840756; cv=none; d=google.com; s=arc-20160816; b=LY/aE0esYso7TsTQGyUgUCRp/Flb9V0fjCEzxwMS7hTDXdeA/sMHieD+dx8DKBYslP Jgrba/wwE4GM/IaYufQqqUCO8PD1Kwi2Xqrvx5I8UGavRi/o1tw69obgflpZ1FkmY53k Mz+zr3X2uEDNo1h76xOjn5EcS/045OiDeti7xUjRQOn6QyDZwL8Z9EKDdy4t8zBPTFpD 4qztqVSlcV4+Ru9pci0pI+bpyDK5Iw9T6gWbD/Eg7dYkUt27GNZIe+u9IvM7zG2Y8b5b /xf02BahAVgacvBfZe0Xp3K0JmmaeItE/x52QpygazCcJIM2Odx9Mg7UpiNr2w0IYGyX vOww== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:message-id:subject:cc:to:from :dkim-signature:dkim-signature:date; bh=+e3NaiVnYWvRv2AtkgTippPR/87QBax0qgO1525r0Q0=; fh=ZnlnmaBLcVQFhq9GkMJhnrQyRETu5vIUWwHeA0pzpfI=; b=vEQwjrkXgauYuhhA0FV5dv0FThA7AMA9p6bmk53fN5LmW7hU/bspgN/d7vG/InXEtm Wk88Bee9i8sH/2LEkycXaRRHIJnrsSHtIR/p7CM/Z3lQqwnDaGZ2uUAfMzT6xdTVaS1H AfEuqF48ENfm1VnhUZKlECPdhUzusAKMwZR0YTMfZboUSHB5pbw1xndP94/PND6fphew 8yoQNnfq6wmsAjbsG40Ekz5GQCqfB5fNuXswpPR+E/eLTUSjkUCp8x+Kom02vRJihTEo Y56JjiWP6EBZKAowaJBkIcOq+7fE5Qy5lDslru+sSvL4DtzCl8qC2X/c47FDkpQBERqc RRpQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=cphfUMVT; dkim=neutral (no key) header.i=@linutronix.de; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id k3-20020a170902c40300b001bb0fdbfdc1si497403plk.199.2023.07.20.01.12.23; Thu, 20 Jul 2023 01:12:36 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=cphfUMVT; dkim=neutral (no key) header.i=@linutronix.de; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230473AbjGTGv3 (ORCPT <rfc822;chrisben.tianve@gmail.com> + 99 others); Thu, 20 Jul 2023 02:51:29 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60856 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230064AbjGTGv2 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 20 Jul 2023 02:51:28 -0400 Received: from galois.linutronix.de (Galois.linutronix.de [IPv6:2a0a:51c0:0:12e:550::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C0CC8CC for <linux-kernel@vger.kernel.org>; Wed, 19 Jul 2023 23:51:26 -0700 (PDT) Date: Thu, 20 Jul 2023 08:51:01 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1689835884; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type; bh=+e3NaiVnYWvRv2AtkgTippPR/87QBax0qgO1525r0Q0=; b=cphfUMVT36g7vJykUb01HFodFi5uq5gZ8eJtbsPan8o8Ye21X7o4XKlyvKBTedjDRzYeZM GfdzBn/oZ9vqoO6jSXfoMP4R3s/kcaB80ID31unJZ12CPQQDTkCWg7mf02EeZtjeYvWA9q sleKutIXx8t4wR4X4FJUvt1TC6jXIOYGVvwNWjQxz7xTbMsEpZfuLtQgQA9bLKRxBUn4GF oPdeskjLXOgTqa2yQAbitmPTii/w2ySRj9Ez7I5RK9Uq94LfvUWTQeP9q/D0nvdBSZGmPs sXUD3wCpHYMIcIyWuOxkrRp8EScF+bi+6/HS5DCoJCq3b40HPPDEvDOPzRqF2A== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1689835884; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type; bh=+e3NaiVnYWvRv2AtkgTippPR/87QBax0qgO1525r0Q0=; b=pRuWjc8hrYF8PRpwiDIM0PYz49D7sXKpOkyQTptlf5nMSxvftrWmJ05mieEKxfqvvoD9rY HdNqs8+hsltz/RDQ== From: Anna-Maria Behnsen <anna-maria@linutronix.de> To: Peter Zijlstra <peterz@infradead.org> cc: linux-kernel@vger.kernel.org, Thomas Gleixner <tglx@linutronix.de>, Frederic Weisbecker <frederic@kernel.org>, "Gautham R. Shenoy" <gautham.shenoy@amd.com>, Ingo Molnar <mingo@redhat.com>, Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>, Daniel Bristot de Oliveira <bristot@redhat.com>, Valentin Schneider <vschneid@redhat.com> Subject: Stopping the tick on a fully loaded system Message-ID: <80956e8f-761e-b74-1c7a-3966f9e8d934@linutronix.de> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_BLOCKED, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1771926053499570739 X-GMAIL-MSGID: 1771926460422645685 |
Series |
Stopping the tick on a fully loaded system
|
|
Commit Message
Anna-Maria Behnsen
July 20, 2023, 6:51 a.m. UTC
Hi, during tests of the timer pull model, Gautham observed regressions under load. With the timer pull model in place, going idle is more expensive. My naive assumption, that a system which is fully loaded will not go idle was simply wrong. Under a fully loaded system (top shows ~1% idle), some CPUs go idle and stop the tick for several us and come back to work and this heavily repeats. Peter and tglx helped me to track it down to find the reason: The governor which decides if the tick will be stopped only looks at the next timer but does not take into account how busy the system is. Here Peter pointed to the scheduler avg_idle value. Beside the existing avg_idle, I introduced really_avg_idle which is not limited to twice max_idle_balance_cost but also updated in ttwu_do_activate() when avg_idle is updated. With tracing, I was able to see that in the fully loaded case, 75%-80% of the idle periods have been shorter than the really_avg_idle value. (trace printk of really_avg_idle values directly at the begin of tick_nohz_next_event(); enabling sched_wakeup tracepoint; take the delta between the timestamps of the first and the latter as idle time). A generalized approach to prevent going idle, when the system is loaded, would be to add a check how busy the system is to tick_nohz_next_event(). In my PoC (find it at the end) it's simply checked whether the really_avg_idle value is smaller than TICK_NSEC. It's not possible to use the existing avg_idle value as it is always smaller than TICK_NSEC on 250HZ systems. But there regressions occur under load and the standard deviation of the test results were in the same range as the regression (between 5 and 10%). So I wanted to understand the avg_idle values and examined the distribution with different load scenarios. There my next naive assumption was, that under load mainly short values will be seen. This is true, when the system is halfway loaded (top shows ~50% idle). But when the system is fully loaded, the avg_idle values are no longer 'focused' on small values. Here I stopped and started to write the mail. I don't know the reason for the distribution under load and I also don't know if the idea of checking the system load in tick_nohz_next_event() is good or not. And last but not least, scheduler is a closed book for me... Thanks, Anna-Maria ---8<---
Comments
Hi Vincent, On Thu, 20 Jul 2023, Vincent Guittot wrote: > On Thu, 20 Jul 2023 at 08:51, Anna-Maria Behnsen > <anna-maria@linutronix.de> wrote: > > > > Hi, > > > > during tests of the timer pull model, Gautham observed regressions under > > load. With the timer pull model in place, going idle is more expensive. My > > naive assumption, that a system which is fully loaded will not go idle was > > simply wrong. Under a fully loaded system (top shows ~1% idle), some CPUs > > go idle and stop the tick for several us and come back to work and this > > heavily repeats. > > > > Peter and tglx helped me to track it down to find the reason: The governor > > which decides if the tick will be stopped only looks at the next timer but > > does not take into account how busy the system is. Here Peter pointed to > > the scheduler avg_idle value. > > > > Beside the existing avg_idle, I introduced really_avg_idle which is not > > limited to twice max_idle_balance_cost but also updated in > > ttwu_do_activate() when avg_idle is updated. > > > > With tracing, I was able to see that in the fully loaded case, 75%-80% of > > the idle periods have been shorter than the really_avg_idle value. (trace > > printk of really_avg_idle values directly at the begin of > > tick_nohz_next_event(); enabling sched_wakeup tracepoint; take the delta > > between the timestamps of the first and the latter as idle time). > > > > A generalized approach to prevent going idle, when the system is loaded, > > would be to add a check how busy the system is to tick_nohz_next_event(). > > Would it be better to let the cpuidle governor decide whether to stop > or not the tick instead ? > With your change, tick_nohz_next_event() becomes an random value which > might return something else than the real next event > > you might me interested by this: > https://lore.kernel.org/all/20230105145159.1089531-1-kajetan.puchalski@arm.com/ > > They use cpu utilization to stay in shallow c-states some of which > don't stop the tick. Maybe you extend this to make sure to not stop > the tick for high load I had also a look at teo. It makes things better but does not solve the underlying problem that I see here - please correct me if I missed something or if I'm simply wrong: Yes, the governors have to decide in the end, whether it makes sense to stop the tick or not. For this decision, the governors require information about the current state of the core and how long nothing has to be done propably. At the moment the governors therefore call tick_nohz_get_sleep_length(). This checks first whether the tick can be stopped. Then it takes into account whether rcu, irq_work, arch_work needs the CPU or a timer softirq is pending. If non of this is true, then the timers are checked. So tick_nohz_get_sleep_length() isn't only based on timers already. The information about the sleep length of the scheduler perspective is completely missing in the current existing check for the probable sleep length. Sure, teo takes scheduler utilization into account directly in the governor. But for me it is not comprehensible, why the CPU utilization check is done after asking for the possible sleep length where timers are taken into account. If the CPU is busy anyway, the information generated by tick_nohz_next_event() is irrelevant. And when the CPU is not busy, then it makes sense to ask for the sleep length also from a timer perspective. When this CPU utilization check is implemented directly inside the governor, every governor has to implement it on it's own. So wouldn't it make sense to implement a "how utilized is the CPU out of a scheduler perspective" in one place and use this as the first check in tick_nohz_get_sleep_length()/tick_nohz_next_event()? > > > > In my PoC (find it at the end) it's simply checked whether the > > really_avg_idle value is smaller than TICK_NSEC. It's not possible to use > > the existing avg_idle value as it is always smaller than TICK_NSEC on 250HZ > > systems. But there regressions occur under load and the standard deviation > > of the test results were in the same range as the regression (between 5 and > > 10%). > > > > So I wanted to understand the avg_idle values and examined the distribution > > with different load scenarios. There my next naive assumption was, that > > under load mainly short values will be seen. This is true, when the system > > is halfway loaded (top shows ~50% idle). But when the system is fully > > loaded, the avg_idle values are no longer 'focused' on small values. > > avg_idle is broken for what you want to do. It is updated only when > you leave an idle state which means that it stays stuck to the old avg > idle time when your system is fully busy > As I said, I'm not familiar with scheduler internals. If avg_idle is not the right thing, there might be another possibility to generate the information about a possible sleep length out of the already existing scheduler data. Thanks, Anna-Maria
On Thu, 20 Jul 2023 at 15:00, Anna-Maria Behnsen <anna-maria@linutronix.de> wrote: > > Hi Vincent, > > On Thu, 20 Jul 2023, Vincent Guittot wrote: > > > On Thu, 20 Jul 2023 at 08:51, Anna-Maria Behnsen > > <anna-maria@linutronix.de> wrote: > > > > > > Hi, > > > > > > during tests of the timer pull model, Gautham observed regressions under > > > load. With the timer pull model in place, going idle is more expensive. My > > > naive assumption, that a system which is fully loaded will not go idle was > > > simply wrong. Under a fully loaded system (top shows ~1% idle), some CPUs > > > go idle and stop the tick for several us and come back to work and this > > > heavily repeats. > > > > > > Peter and tglx helped me to track it down to find the reason: The governor > > > which decides if the tick will be stopped only looks at the next timer but > > > does not take into account how busy the system is. Here Peter pointed to > > > the scheduler avg_idle value. > > > > > > Beside the existing avg_idle, I introduced really_avg_idle which is not > > > limited to twice max_idle_balance_cost but also updated in > > > ttwu_do_activate() when avg_idle is updated. > > > > > > With tracing, I was able to see that in the fully loaded case, 75%-80% of > > > the idle periods have been shorter than the really_avg_idle value. (trace > > > printk of really_avg_idle values directly at the begin of > > > tick_nohz_next_event(); enabling sched_wakeup tracepoint; take the delta > > > between the timestamps of the first and the latter as idle time). > > > > > > A generalized approach to prevent going idle, when the system is loaded, > > > would be to add a check how busy the system is to tick_nohz_next_event(). > > > > Would it be better to let the cpuidle governor decide whether to stop > > or not the tick instead ? > > With your change, tick_nohz_next_event() becomes an random value which > > might return something else than the real next event > > > > you might me interested by this: > > https://lore.kernel.org/all/20230105145159.1089531-1-kajetan.puchalski@arm.com/ > > > > They use cpu utilization to stay in shallow c-states some of which > > don't stop the tick. Maybe you extend this to make sure to not stop > > the tick for high load > > I had also a look at teo. It makes things better but does not solve the > underlying problem that I see here - please correct me if I missed > something or if I'm simply wrong: > > Yes, the governors have to decide in the end, whether it makes sense to > stop the tick or not. For this decision, the governors require information > about the current state of the core and how long nothing has to be done > propably. At the moment the governors therefore call > tick_nohz_get_sleep_length(). This checks first whether the tick can be > stopped. Then it takes into account whether rcu, irq_work, arch_work needs > the CPU or a timer softirq is pending. If non of this is true, then the > timers are checked. So tick_nohz_get_sleep_length() isn't only based on > timers already. yes but all these are boolean to say that they rely on the tick to work correctly > > The information about the sleep length of the scheduler perspective is > completely missing in the current existing check for the probable sleep > length. The scheduler is not able to estimate the sleep length and its avg_idle is just a rough and wrong estimate which already fails to do its job in several cases. It's quite difficult to predict a sleep duration if not impossible. Even the cpuidle governors which are the expert for this often fail. That's partly why teo has been added because the menu governor was failing to correctly predict sleep duration for some systems. > > Sure, teo takes scheduler utilization into account directly in the > governor. But for me it is not comprehensible, why the CPU utilization > check is done after asking for the possible sleep length where timers are > taken into account. If the CPU is busy anyway, the information generated by > tick_nohz_next_event() is irrelevant. And when the CPU is not busy, then it > makes sense to ask for the sleep length also from a timer perspective. You should ask teo's maintainer and contributors for a detailed explanation about the policy and why they check cpu utilization after getting the next timer event > > When this CPU utilization check is implemented directly inside the > governor, every governor has to implement it on it's own. So wouldn't it > make sense to implement a "how utilized is the CPU out of a scheduler > perspective" in one place and use this as the first check in > tick_nohz_get_sleep_length()/tick_nohz_next_event()? tick_nohz_get_sleep_length() is not the probable sleep length but the next timer event. whereas it's not clear what an utilized cpu means. Next timer event and cpu's utilization are 2 different parameters and the governors might want to apply different "weights" on these parameters in their policy. > > > > > > > In my PoC (find it at the end) it's simply checked whether the > > > really_avg_idle value is smaller than TICK_NSEC. It's not possible to use > > > the existing avg_idle value as it is always smaller than TICK_NSEC on 250HZ > > > systems. But there regressions occur under load and the standard deviation > > > of the test results were in the same range as the regression (between 5 and > > > 10%). > > > > > > So I wanted to understand the avg_idle values and examined the distribution > > > with different load scenarios. There my next naive assumption was, that > > > under load mainly short values will be seen. This is true, when the system > > > is halfway loaded (top shows ~50% idle). But when the system is fully > > > loaded, the avg_idle values are no longer 'focused' on small values. > > > > avg_idle is broken for what you want to do. It is updated only when > > you leave an idle state which means that it stays stuck to the old avg > > idle time when your system is fully busy Trying to predict sleep duration is quite complex and needs other inputs than just the scheduler's one. As an example, predicting irq is another source of uncertainty in your sleep duration. As a side note, I have already seen some test results where randomly selecting an idle state was doing as good as current governors. Vincent > > > > As I said, I'm not familiar with scheduler internals. If avg_idle is not > the right thing, there might be another possibility to generate the > information about a possible sleep length out of the already existing > scheduler data. > > Thanks, > > Anna-Maria >
(Adding Rafael in Cc) Le Thu, Jul 20, 2023 at 03:00:37PM +0200, Anna-Maria Behnsen a écrit : > I had also a look at teo. It makes things better but does not solve the > underlying problem that I see here - please correct me if I missed > something or if I'm simply wrong: > > Yes, the governors have to decide in the end, whether it makes sense to > stop the tick or not. For this decision, the governors require information > about the current state of the core and how long nothing has to be done > propably. At the moment the governors therefore call > tick_nohz_get_sleep_length(). This checks first whether the tick can be > stopped. Then it takes into account whether rcu, irq_work, arch_work needs > the CPU or a timer softirq is pending. If non of this is true, then the > timers are checked. So tick_nohz_get_sleep_length() isn't only based on > timers already. Right but those things (rcu/irq work, etc...) act kind of like timers here and they should be considered as exceptions. The timer infrastructure shouldn't take into account the idle activity, this is really a job for the cpuidle governors. > The information about the sleep length of the scheduler perspective is > completely missing in the current existing check for the probable sleep > length. > > Sure, teo takes scheduler utilization into account directly in the > governor. But for me it is not comprehensible, why the CPU utilization > check is done after asking for the possible sleep length where timers are > taken into account. If the CPU is busy anyway, the information generated by > tick_nohz_next_event() is irrelevant. And when the CPU is not busy, then it > makes sense to ask for the sleep length also from a timer perspective. > > When this CPU utilization check is implemented directly inside the > governor, every governor has to implement it on it's own. So wouldn't it > make sense to implement a "how utilized is the CPU out of a scheduler > perspective" in one place and use this as the first check in > tick_nohz_get_sleep_length()/tick_nohz_next_event()? > Well, beyond that, there might be other situations where the governor may decide not to stop the tick even if tick_nohz_next_event() says it's possible to do so. That's the purpose of having that next event as an input among many others for the cpuidle governors. As such, calling tmigr_cpu_deactivate() on next tick _evaluation_ time instead of tick _stop_ time is always going to be problematic. Can we fix that and call tmigr_cpu_deactivate() from tick_nohz_stop_tick() instead? This will change a bit the locking scenario because tick_nohz_stop_tick() doesn't hold the base lock. Is it a problem though? In the worst case a remote tick happens and handles the earliest timer for the current CPU while it's between tick_nohz_next_event() and tick_nohz_stop_tick(), but then the current CPU would just propagate an earlier deadline than needed. No big deal. Though I could be overlooking some race or something else making that not possible of course... Thanks.
On Sun, Jul 23, 2023 at 11:21 PM Frederic Weisbecker <frederic@kernel.org> wrote: > > (Adding Rafael in Cc) > > Le Thu, Jul 20, 2023 at 03:00:37PM +0200, Anna-Maria Behnsen a écrit : > > I had also a look at teo. It makes things better but does not solve the > > underlying problem that I see here - please correct me if I missed > > something or if I'm simply wrong: > > > > Yes, the governors have to decide in the end, whether it makes sense to > > stop the tick or not. For this decision, the governors require information > > about the current state of the core and how long nothing has to be done > > propably. At the moment the governors therefore call > > tick_nohz_get_sleep_length(). This checks first whether the tick can be > > stopped. Then it takes into account whether rcu, irq_work, arch_work needs > > the CPU or a timer softirq is pending. If non of this is true, then the > > timers are checked. So tick_nohz_get_sleep_length() isn't only based on > > timers already. > > Right but those things (rcu/irq work, etc...) act kind of like timers here > and they should be considered as exceptions. > > The timer infrastructure shouldn't take into account the idle activity, > this is really a job for the cpuidle governors. > > > The information about the sleep length of the scheduler perspective is > > completely missing in the current existing check for the probable sleep > > length. Well, not exactly. From the governor's perspective, tick_nohz_get_sleep_length() is supposed to return a deterministic upper bound on the upcoming idle duration (or at least it is used in the governors this way). IOW, it is expected that the CPU will not be idle longer that the tick_nohz_get_sleep_length() return value. There are other factors that may cause the governor to predict a shorter idle duration and the information coming from the scheduler may be regarded as one of them, but they are not deterministic. > > Sure, teo takes scheduler utilization into account directly in the > > governor. But for me it is not comprehensible, why the CPU utilization > > check is done after asking for the possible sleep length where timers are > > taken into account. If the CPU is busy anyway, the information generated by > > tick_nohz_next_event() is irrelevant. Why isn't it? The CPU is idle at that point and it has gone idle for a reason. Surely, there was nothing to run on it at. The scheduler only knows that the CPU has been busy recently, which may not imply anything on whether or not and when it is going to be busy again. > > And when the CPU is not busy, then it > > makes sense to ask for the sleep length also from a timer perspective. > > > > When this CPU utilization check is implemented directly inside the > > governor, every governor has to implement it on it's own. So wouldn't it > > make sense to implement a "how utilized is the CPU out of a scheduler Yes, it would make sense to do that, but I thought that PELT was that thing. Wasn't it? > > perspective" in one place and use this as the first check in > > tick_nohz_get_sleep_length()/tick_nohz_next_event()? No, IMO it should be a separate check, because tick_nohz_get_sleep_length() is expected to return a deterministic upper bound on the idle duration whereas the scheduler information is not deterministic. It is the governor's role to decide how to take it into account. > Well, beyond that, there might be other situations where the governor may > decide not to stop the tick even if tick_nohz_next_event() says it's possible > to do so. That's the purpose of having that next event as an input among many > others for the cpuidle governors. Exactly. > As such, calling tmigr_cpu_deactivate() on next tick _evaluation_ time instead of > tick _stop_ time is always going to be problematic. > > Can we fix that and call tmigr_cpu_deactivate() from tick_nohz_stop_tick() > instead? This will change a bit the locking scenario because > tick_nohz_stop_tick() doesn't hold the base lock. Is it a problem though? > In the worst case a remote tick happens and handles the earliest timer > for the current CPU while it's between tick_nohz_next_event() and > tick_nohz_stop_tick(), but then the current CPU would just propagate > an earlier deadline than needed. No big deal. FWIW, this sounds reasonable to me.
Hi, On Mon, 24 Jul 2023, Rafael J. Wysocki wrote: > On Sun, Jul 23, 2023 at 11:21 PM Frederic Weisbecker > <frederic@kernel.org> wrote: > > > From the governor's perspective, tick_nohz_get_sleep_length() is > supposed to return a deterministic upper bound on the upcoming idle > duration (or at least it is used in the governors this way). IOW, it > is expected that the CPU will not be idle longer that the > tick_nohz_get_sleep_length() return value. > > There are other factors that may cause the governor to predict a > shorter idle duration and the information coming from the scheduler > may be regarded as one of them, but they are not deterministic. Ok. Thanks for this explanation why two separate checks are required. > > > Sure, teo takes scheduler utilization into account directly in the > > > governor. But for me it is not comprehensible, why the CPU utilization > > > check is done after asking for the possible sleep length where timers are > > > taken into account. If the CPU is busy anyway, the information generated by > > > tick_nohz_next_event() is irrelevant. > > Why isn't it? > > The CPU is idle at that point and it has gone idle for a reason. > Surely, there was nothing to run on it at. > > The scheduler only knows that the CPU has been busy recently, which > may not imply anything on whether or not and when it is going to be > busy again. > I went back one step and simply had a look at current upstream to understand the behavior when going idle under load more detailed. I wanted to see the distribution of idle time duration when the tick is stopped. I used dbench tests with a script provided by Gautham to generate three different loads: 100% load, 50% load and 25% load. The kernel is configured with HZ=250. The system has 2 sockets, 64 cores per socket, 2 threads each -> 256 CPUs. Three minutes trace data - idle periods larger than three minutes are not tracked. The governor is teo. I added tracepoints to the point where the tick is stopped and where the tick is started again. I calculated the delta between the timestamps of those two trace points and had a look at the distribution: 100% load 50% load 25% load (top: ~2% idle) (top: ~49% idle) (top: ~74% idle; 33 CPUs are completely idle) --------------- ---------------- ---------------------------- Idle Total 1658703 100% 3150522 100% 2377035 100% x >= 4ms 2504 0.15% 2 0.00% 53 0.00% 4ms> x >= 2ms 390 0.02% 0 0.00% 4563 0.19% 2ms > x >= 1ms 62 0.00% 1 0.00% 54 0.00% 1ms > x >= 500us 67 0.00% 6 0.00% 2 0.00% 500us > x >= 250us 93 0.01% 39 0.00% 11 0.00% 250us > x >=100us 280 0.02% 1145 0.04% 633 0.03% 100us > x >= 50us 942 0.06% 30722 0.98% 13347 0.56% 50us > x >= 25us 26728 1.61% 310932 9.87% 106083 4.46% 25us > x >= 10us 825920 49.79% 2320683 73.66% 1722505 72.46% 10us > x > 5us 795197 47.94% 442991 14.06% 506008 21.29% 5us > x 6520 0.39% 43994 1.40% 23645 0.99% 99% of the tick stops only have an idle period shorter than 50us (50us is 1,25% of a tick length). This is also the reason for my opinion, that the return of tick_nohz_next_event() is completely irrelevant in a (fully) loaded case: The return is in the range of ticks, and a tick is ~100 times longer than the the duration of the majority of idle periods. > > > And when the CPU is not busy, then it > > > makes sense to ask for the sleep length also from a timer perspective. > > > > > > When this CPU utilization check is implemented directly inside the > > > governor, every governor has to implement it on it's own. So wouldn't it > > > make sense to implement a "how utilized is the CPU out of a scheduler > > Yes, it would make sense to do that, but I thought that PELT was that > thing. Wasn't it? > > > > perspective" in one place and use this as the first check in > > > tick_nohz_get_sleep_length()/tick_nohz_next_event()? [...] > > As such, calling tmigr_cpu_deactivate() on next tick _evaluation_ time instead of > > tick _stop_ time is always going to be problematic. > > > > Can we fix that and call tmigr_cpu_deactivate() from tick_nohz_stop_tick() > > instead? This will change a bit the locking scenario because > > tick_nohz_stop_tick() doesn't hold the base lock. Is it a problem though? > > In the worst case a remote tick happens and handles the earliest timer > > for the current CPU while it's between tick_nohz_next_event() and > > tick_nohz_stop_tick(), but then the current CPU would just propagate > > an earlier deadline than needed. No big deal. > > FWIW, this sounds reasonable to me. > The worst case scenario will not happen, because remote timer expiry only happens when CPU is not active in the hierarchy. And with your proposal this is valid after tick_nohz_stop_tick(). Nevertheless, I see some problems with this. But this also depends if there is the need to change current idle behavior or not. Right now, this are my concerns: - The determinism of tick_nohz_next_event() will break: The return of tick_nohz_next_event() will not take into account, if it is the last CPU going idle and then has to take care of remote timers. So the first timer of the CPU (regardless of global or local) has to be handed back even if it could be handled by the hierarchy. - When moving the tmigr_cpu_deactivate() to tick_nohz_stop_tick() and the return value of tmigr_cpu_deactivate() is before the ts->next_tick, the expiry has to be modified in tick_nohz_stop_tick(). - The load is simply moved to a later place - tick_nohz_stop_tick() is never called without a preceding tick_nohz_next_event() call. Yes, tick_nohz_next_event() is called under load ~8% more than tick_nohz_stop_tick(), but the 'quality' of the return value of tick_nohz_next_event() is getting worse. - timer migration hierarchy is not a standalone timer infrastructure. It only makes sense to handle it in combination with the existing timer wheel. When the timer base is idle, the timer migration hierarchy with the migrators will do the job for global timers. So, I'm not sure about the impact of the changed locking - but I'm pretty sure changing that increases the probability for ugly races hidden somewhere between the lines. Thanks, Anna-Maria
On Tue, Jul 25, 2023 at 3:07 PM Anna-Maria Behnsen <anna-maria@linutronix.de> wrote: > > Hi, > > On Mon, 24 Jul 2023, Rafael J. Wysocki wrote: > > > On Sun, Jul 23, 2023 at 11:21 PM Frederic Weisbecker > > <frederic@kernel.org> wrote: > > > > > From the governor's perspective, tick_nohz_get_sleep_length() is > > supposed to return a deterministic upper bound on the upcoming idle > > duration (or at least it is used in the governors this way). IOW, it > > is expected that the CPU will not be idle longer that the > > tick_nohz_get_sleep_length() return value. > > > > There are other factors that may cause the governor to predict a > > shorter idle duration and the information coming from the scheduler > > may be regarded as one of them, but they are not deterministic. > > Ok. Thanks for this explanation why two separate checks are required. > > > > > Sure, teo takes scheduler utilization into account directly in the > > > > governor. But for me it is not comprehensible, why the CPU utilization > > > > check is done after asking for the possible sleep length where timers are > > > > taken into account. If the CPU is busy anyway, the information generated by > > > > tick_nohz_next_event() is irrelevant. > > > > Why isn't it? > > > > The CPU is idle at that point and it has gone idle for a reason. > > Surely, there was nothing to run on it at. > > > > The scheduler only knows that the CPU has been busy recently, which > > may not imply anything on whether or not and when it is going to be > > busy again. > > > > I went back one step and simply had a look at current upstream to > understand the behavior when going idle under load more detailed. I wanted > to see the distribution of idle time duration when the tick is stopped. I > used dbench tests with a script provided by Gautham to generate three > different loads: 100% load, 50% load and 25% load. The kernel is configured > with HZ=250. The system has 2 sockets, 64 cores per socket, 2 threads each > -> 256 CPUs. Three minutes trace data - idle periods larger than three > minutes are not tracked. The governor is teo. > > I added tracepoints to the point where the tick is stopped and where the > tick is started again. I calculated the delta between the timestamps of > those two trace points and had a look at the distribution: > > > 100% load 50% load 25% load > (top: ~2% idle) (top: ~49% idle) (top: ~74% idle; > 33 CPUs are completely idle) > --------------- ---------------- ---------------------------- > Idle Total 1658703 100% 3150522 100% 2377035 100% > x >= 4ms 2504 0.15% 2 0.00% 53 0.00% > 4ms> x >= 2ms 390 0.02% 0 0.00% 4563 0.19% > 2ms > x >= 1ms 62 0.00% 1 0.00% 54 0.00% > 1ms > x >= 500us 67 0.00% 6 0.00% 2 0.00% > 500us > x >= 250us 93 0.01% 39 0.00% 11 0.00% > 250us > x >=100us 280 0.02% 1145 0.04% 633 0.03% > 100us > x >= 50us 942 0.06% 30722 0.98% 13347 0.56% > 50us > x >= 25us 26728 1.61% 310932 9.87% 106083 4.46% > 25us > x >= 10us 825920 49.79% 2320683 73.66% 1722505 72.46% > 10us > x > 5us 795197 47.94% 442991 14.06% 506008 21.29% > 5us > x 6520 0.39% 43994 1.40% 23645 0.99% > > > 99% of the tick stops only have an idle period shorter than 50us (50us is > 1,25% of a tick length). Well, this just means that the governor predicts overly long idle durations quite often under this workload. The governor's decision on whether or not to stop the tick is based on its idle duration prediction. If it overshoots, that's how it goes. > This is also the reason for my opinion, that the return of > tick_nohz_next_event() is completely irrelevant in a (fully) loaded case: It is an upper bound and in a fully loaded case it may be way off. > The return is in the range of ticks, and a tick is ~100 times longer than > the the duration of the majority of idle periods. Overall, this means that there is room for improvements in the idle governor. > > > > And when the CPU is not busy, then it > > > > makes sense to ask for the sleep length also from a timer perspective. > > > > > > > > When this CPU utilization check is implemented directly inside the > > > > governor, every governor has to implement it on it's own. So wouldn't it > > > > make sense to implement a "how utilized is the CPU out of a scheduler > > > > Yes, it would make sense to do that, but I thought that PELT was that > > thing. Wasn't it? > > > > > > perspective" in one place and use this as the first check in > > > > tick_nohz_get_sleep_length()/tick_nohz_next_event()? > > [...] > > > > As such, calling tmigr_cpu_deactivate() on next tick _evaluation_ time instead of > > > tick _stop_ time is always going to be problematic. > > > > > > Can we fix that and call tmigr_cpu_deactivate() from tick_nohz_stop_tick() > > > instead? This will change a bit the locking scenario because > > > tick_nohz_stop_tick() doesn't hold the base lock. Is it a problem though? > > > In the worst case a remote tick happens and handles the earliest timer > > > for the current CPU while it's between tick_nohz_next_event() and > > > tick_nohz_stop_tick(), but then the current CPU would just propagate > > > an earlier deadline than needed. No big deal. > > > > FWIW, this sounds reasonable to me. > > > > The worst case scenario will not happen, because remote timer expiry only > happens when CPU is not active in the hierarchy. And with your proposal > this is valid after tick_nohz_stop_tick(). > > Nevertheless, I see some problems with this. But this also depends if there > is the need to change current idle behavior or not. Right now, this are my > concerns: > > - The determinism of tick_nohz_next_event() will break: That would be bad. > The return of > tick_nohz_next_event() will not take into account, if it is the last CPU > going idle and then has to take care of remote timers. So the first timer > of the CPU (regardless of global or local) has to be handed back even if > it could be handled by the hierarchy. > > - When moving the tmigr_cpu_deactivate() to tick_nohz_stop_tick() and the > return value of tmigr_cpu_deactivate() is before the ts->next_tick, the > expiry has to be modified in tick_nohz_stop_tick(). > > - The load is simply moved to a later place - tick_nohz_stop_tick() is > never called without a preceding tick_nohz_next_event() call. Yes, > tick_nohz_next_event() is called under load ~8% more than > tick_nohz_stop_tick(), but the 'quality' of the return value of > tick_nohz_next_event() is getting worse. > > - timer migration hierarchy is not a standalone timer infrastructure. It > only makes sense to handle it in combination with the existing timer > wheel. When the timer base is idle, the timer migration hierarchy with > the migrators will do the job for global timers. So, I'm not sure about > the impact of the changed locking - but I'm pretty sure changing that > increases the probability for ugly races hidden somewhere between the > lines. I'll let Frederic respond to the above, but from my perspective it all just means that the idle governors in use today are not perfect. However, they will never be perfect, because they only have a little time to make a decision, so it's a matter of balancing that with precision.
On Tue, Jul 25, 2023 at 04:27:56PM +0200, Rafael J. Wysocki wrote: > On Tue, Jul 25, 2023 at 3:07 PM Anna-Maria Behnsen > > 100% load 50% load 25% load > > (top: ~2% idle) (top: ~49% idle) (top: ~74% idle; > > 33 CPUs are completely idle) > > --------------- ---------------- ---------------------------- > > Idle Total 1658703 100% 3150522 100% 2377035 100% > > x >= 4ms 2504 0.15% 2 0.00% 53 0.00% > > 4ms> x >= 2ms 390 0.02% 0 0.00% 4563 0.19% > > 2ms > x >= 1ms 62 0.00% 1 0.00% 54 0.00% > > 1ms > x >= 500us 67 0.00% 6 0.00% 2 0.00% > > 500us > x >= 250us 93 0.01% 39 0.00% 11 0.00% > > 250us > x >=100us 280 0.02% 1145 0.04% 633 0.03% > > 100us > x >= 50us 942 0.06% 30722 0.98% 13347 0.56% > > 50us > x >= 25us 26728 1.61% 310932 9.87% 106083 4.46% > > 25us > x >= 10us 825920 49.79% 2320683 73.66% 1722505 72.46% > > 10us > x > 5us 795197 47.94% 442991 14.06% 506008 21.29% > > 5us > x 6520 0.39% 43994 1.40% 23645 0.99% > > > > > > 99% of the tick stops only have an idle period shorter than 50us (50us is > > 1,25% of a tick length). > > Well, this just means that the governor predicts overly long idle > durations quite often under this workload. > > The governor's decision on whether or not to stop the tick is based on > its idle duration prediction. If it overshoots, that's how it goes. This is abysmal; IIRC TEO tracks a density function in C state buckets and if it finds it's more likely to be shorter than 'predicted' by the timer it should pick something shallower. Given we have this density function, picking something that's <1% likely is insane. In fact, it seems to suggest the whole pick-alternative thing is utterly broken. > > This is also the reason for my opinion, that the return of > > tick_nohz_next_event() is completely irrelevant in a (fully) loaded case: > > It is an upper bound and in a fully loaded case it may be way off. But given we have our density function, we should be able to do much better. Oooh,... I think I see the problem. Our bins are strictly the available C-state, but if you run this on a Zen3 that has ACPI-idle, then you end up with something that only has 3 C states, like: $ for i in state*/residency ; do echo -n "${i}: "; cat $i; done state0/residency: 0 state1/residency: 2 state2/residency: 36 Which means we only have buckets: (0,0] (0,2000], (2000,36000] or somesuch. All of them very much smaller than TICK_NSEC. That means we don't track nearly enough data to reliably tell anything about disabling the tick or not. We should have at least one bucket beyond TICK_NSEC for this. Hmm.. it is getting very late, but how about I get the cpuidle framework to pad the drv states with a few 'disabled' C states so that we have at least enough data to cross the TICK_NSEC boundary and say something usable about things. Because as things stand, it's very likely we determine @stop_tick purely based on what tick_nohz_get_sleep_length() tells us, not on what we've learnt from recent history. (FWIW intel_idle seems to not have an entry for Tigerlake !?! -- my poor laptop, it feels neglected)
Le Tue, Jul 25, 2023 at 03:07:05PM +0200, Anna-Maria Behnsen a écrit : > The worst case scenario will not happen, because remote timer expiry only > happens when CPU is not active in the hierarchy. And with your proposal > this is valid after tick_nohz_stop_tick(). > > Nevertheless, I see some problems with this. But this also depends if there > is the need to change current idle behavior or not. Right now, this are my > concerns: > > - The determinism of tick_nohz_next_event() will break: The return of > tick_nohz_next_event() will not take into account, if it is the last CPU > going idle and then has to take care of remote timers. So the first timer > of the CPU (regardless of global or local) has to be handed back even if > it could be handled by the hierarchy. Bah, of course... > > - When moving the tmigr_cpu_deactivate() to tick_nohz_stop_tick() and the > return value of tmigr_cpu_deactivate() is before the ts->next_tick, the > expiry has to be modified in tick_nohz_stop_tick(). > > - The load is simply moved to a later place - tick_nohz_stop_tick() is > never called without a preceding tick_nohz_next_event() call. Yes, > tick_nohz_next_event() is called under load ~8% more than > tick_nohz_stop_tick(), but the 'quality' of the return value of > tick_nohz_next_event() is getting worse. > > - timer migration hierarchy is not a standalone timer infrastructure. It > only makes sense to handle it in combination with the existing timer > wheel. When the timer base is idle, the timer migration hierarchy with > the migrators will do the job for global timers. So, I'm not sure about > the impact of the changed locking - but I'm pretty sure changing that > increases the probability for ugly races hidden somewhere between the > lines. Sure thing, and this won't be pretty. > > Thanks, > > Anna-Maria
Le Tue, Jul 25, 2023 at 04:27:56PM +0200, Rafael J. Wysocki a écrit : > On Tue, Jul 25, 2023 at 3:07 PM Anna-Maria Behnsen > I'll let Frederic respond to the above, but from my perspective it all > just means that the idle governors in use today are not perfect. > > However, they will never be perfect, because they only have a little > time to make a decision, so it's a matter of balancing that with > precision. So, even without considering Anna-Maria's patchset, sparing the next timer lookup if we already know we won't stop it would show quite a benefit because that lookup involves locking and quite some overhead walking through all levels. Thanks.
On Wed, Jul 26, 2023 at 12:59 PM Frederic Weisbecker <frederic@kernel.org> wrote: > > Le Tue, Jul 25, 2023 at 04:27:56PM +0200, Rafael J. Wysocki a écrit : > > On Tue, Jul 25, 2023 at 3:07 PM Anna-Maria Behnsen > > I'll let Frederic respond to the above, but from my perspective it all > > just means that the idle governors in use today are not perfect. > > > > However, they will never be perfect, because they only have a little > > time to make a decision, so it's a matter of balancing that with > > precision. > > So, even without considering Anna-Maria's patchset, sparing > the next timer lookup if we already know we won't stop it would show > quite a benefit because that lookup involves locking and quite some > overhead walking through all levels. You are right. In principle, this can be done by using TICK_NSEC as the initial estimation of the idle duration and only calling tick_nohz_get_sleep_length() if the candidate state ends up in the higher bins. Interesting.
On Wed, Jul 26, 2023 at 12:29 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Jul 25, 2023 at 04:27:56PM +0200, Rafael J. Wysocki wrote: > > On Tue, Jul 25, 2023 at 3:07 PM Anna-Maria Behnsen > > > > 100% load 50% load 25% load > > > (top: ~2% idle) (top: ~49% idle) (top: ~74% idle; > > > 33 CPUs are completely idle) > > > --------------- ---------------- ---------------------------- > > > Idle Total 1658703 100% 3150522 100% 2377035 100% > > > x >= 4ms 2504 0.15% 2 0.00% 53 0.00% > > > 4ms> x >= 2ms 390 0.02% 0 0.00% 4563 0.19% > > > 2ms > x >= 1ms 62 0.00% 1 0.00% 54 0.00% > > > 1ms > x >= 500us 67 0.00% 6 0.00% 2 0.00% > > > 500us > x >= 250us 93 0.01% 39 0.00% 11 0.00% > > > 250us > x >=100us 280 0.02% 1145 0.04% 633 0.03% > > > 100us > x >= 50us 942 0.06% 30722 0.98% 13347 0.56% > > > 50us > x >= 25us 26728 1.61% 310932 9.87% 106083 4.46% > > > 25us > x >= 10us 825920 49.79% 2320683 73.66% 1722505 72.46% > > > 10us > x > 5us 795197 47.94% 442991 14.06% 506008 21.29% > > > 5us > x 6520 0.39% 43994 1.40% 23645 0.99% > > > > > > > > > 99% of the tick stops only have an idle period shorter than 50us (50us is > > > 1,25% of a tick length). > > > > Well, this just means that the governor predicts overly long idle > > durations quite often under this workload. > > > > The governor's decision on whether or not to stop the tick is based on > > its idle duration prediction. If it overshoots, that's how it goes. > > This is abysmal; IIRC TEO tracks a density function in C state buckets > and if it finds it's more likely to be shorter than 'predicted' by the > timer it should pick something shallower. > > Given we have this density function, picking something that's <1% likely > is insane. In fact, it seems to suggest the whole pick-alternative thing > is utterly broken. > > > > This is also the reason for my opinion, that the return of > > > tick_nohz_next_event() is completely irrelevant in a (fully) loaded case: > > > > It is an upper bound and in a fully loaded case it may be way off. > > But given we have our density function, we should be able to do much > better. > > > Oooh,... I think I see the problem. Our bins are strictly the available > C-state, but if you run this on a Zen3 that has ACPI-idle, then you end > up with something that only has 3 C states, like: > > $ for i in state*/residency ; do echo -n "${i}: "; cat $i; done > state0/residency: 0 > state1/residency: 2 > state2/residency: 36 > > Which means we only have buckets: (0,0] (0,2000], (2000,36000] or somesuch. All > of them very much smaller than TICK_NSEC. > > That means we don't track nearly enough data to reliably tell anything > about disabling the tick or not. We should have at least one bucket > beyond TICK_NSEC for this. Quite likely. > Hmm.. it is getting very late, but how about I get the cpuidle framework > to pad the drv states with a few 'disabled' C states so that we have at > least enough data to cross the TICK_NSEC boundary and say something > usable about things. > > Because as things stand, it's very likely we determine @stop_tick purely > based on what tick_nohz_get_sleep_length() tells us, not on what we've > learnt from recent history. > > > (FWIW intel_idle seems to not have an entry for Tigerlake !?! -- my poor > laptop, it feels neglected) It should then use ACPI _CST idle states.
On Wed, Jul 26, 2023 at 5:10 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > On Wed, Jul 26, 2023 at 12:29 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Tue, Jul 25, 2023 at 04:27:56PM +0200, Rafael J. Wysocki wrote: > > > On Tue, Jul 25, 2023 at 3:07 PM Anna-Maria Behnsen > > > > > > 100% load 50% load 25% load > > > > (top: ~2% idle) (top: ~49% idle) (top: ~74% idle; > > > > 33 CPUs are completely idle) > > > > --------------- ---------------- ---------------------------- > > > > Idle Total 1658703 100% 3150522 100% 2377035 100% > > > > x >= 4ms 2504 0.15% 2 0.00% 53 0.00% > > > > 4ms> x >= 2ms 390 0.02% 0 0.00% 4563 0.19% > > > > 2ms > x >= 1ms 62 0.00% 1 0.00% 54 0.00% > > > > 1ms > x >= 500us 67 0.00% 6 0.00% 2 0.00% > > > > 500us > x >= 250us 93 0.01% 39 0.00% 11 0.00% > > > > 250us > x >=100us 280 0.02% 1145 0.04% 633 0.03% > > > > 100us > x >= 50us 942 0.06% 30722 0.98% 13347 0.56% > > > > 50us > x >= 25us 26728 1.61% 310932 9.87% 106083 4.46% > > > > 25us > x >= 10us 825920 49.79% 2320683 73.66% 1722505 72.46% > > > > 10us > x > 5us 795197 47.94% 442991 14.06% 506008 21.29% > > > > 5us > x 6520 0.39% 43994 1.40% 23645 0.99% > > > > > > > > > > > > 99% of the tick stops only have an idle period shorter than 50us (50us is > > > > 1,25% of a tick length). > > > > > > Well, this just means that the governor predicts overly long idle > > > durations quite often under this workload. > > > > > > The governor's decision on whether or not to stop the tick is based on > > > its idle duration prediction. If it overshoots, that's how it goes. > > > > This is abysmal; IIRC TEO tracks a density function in C state buckets > > and if it finds it's more likely to be shorter than 'predicted' by the > > timer it should pick something shallower. > > > > Given we have this density function, picking something that's <1% likely > > is insane. In fact, it seems to suggest the whole pick-alternative thing > > is utterly broken. > > > > > > This is also the reason for my opinion, that the return of > > > > tick_nohz_next_event() is completely irrelevant in a (fully) loaded case: > > > > > > It is an upper bound and in a fully loaded case it may be way off. > > > > But given we have our density function, we should be able to do much > > better. > > > > > > Oooh,... I think I see the problem. Our bins are strictly the available > > C-state, but if you run this on a Zen3 that has ACPI-idle, then you end > > up with something that only has 3 C states, like: > > > > $ for i in state*/residency ; do echo -n "${i}: "; cat $i; done > > state0/residency: 0 > > state1/residency: 2 > > state2/residency: 36 > > > > Which means we only have buckets: (0,0] (0,2000], (2000,36000] or somesuch. All > > of them very much smaller than TICK_NSEC. > > > > That means we don't track nearly enough data to reliably tell anything > > about disabling the tick or not. We should have at least one bucket > > beyond TICK_NSEC for this. > > Quite likely. So the reasoning here was that those additional bins would not be necessary for idle state selection, but the problem of whether or not to stop the tick is kind of separate from the idle state selection problem if the target residency values for all of the idle states are relatively short. And so it should be addressed separately which currently it is not. Admittedly, this is a mistake.
On Wed, Jul 26, 2023 at 05:53:46PM +0200, Rafael J. Wysocki wrote: > > > That means we don't track nearly enough data to reliably tell anything > > > about disabling the tick or not. We should have at least one bucket > > > beyond TICK_NSEC for this. > > > > Quite likely. > > So the reasoning here was that those additional bins would not be > necessary for idle state selection, but the problem of whether or not > to stop the tick is kind of separate from the idle state selection > problem if the target residency values for all of the idle states are > relatively short. And so it should be addressed separately which > currently it is not. Admittedly, this is a mistake. Right, the C state buckets are enough to pick a state, but not to handle the tick thing. The below hack boots on my ivb-ep with extra (disabled) states. Now let me go hack up teo to make use of that. name residency POLL 0 C1 1 C1E 80 C3 156 C6 300 TICK 1000 POST-TICK 2000 --- diff --git a/drivers/cpuidle/driver.c b/drivers/cpuidle/driver.c index d9cda7f6ccb9..5f435fb8b89f 100644 --- a/drivers/cpuidle/driver.c +++ b/drivers/cpuidle/driver.c @@ -147,13 +147,37 @@ static void cpuidle_setup_broadcast_timer(void *arg) tick_broadcast_disable(); } +static int tick_enter(struct cpuidle_device *dev, + struct cpuidle_driver *drv, + int index) +{ + return -ENODEV; +} + +static void __cpuidle_state_init_tick(struct cpuidle_state *s) +{ + strcpy(s->name, "TICK"); + strcpy(s->desc, "(no-op)"); + + s->target_residency_ns = TICK_NSEC; + s->target_residency = div_u64(TICK_NSEC, NSEC_PER_USEC); + + s->exit_latency_ns = 0; + s->exit_latency = 0; + + s->flags |= CPUIDLE_FLAG_UNUSABLE; + + s->enter = tick_enter; + s->enter_s2idle = tick_enter; +} + /** * __cpuidle_driver_init - initialize the driver's internal data * @drv: a valid pointer to a struct cpuidle_driver */ static void __cpuidle_driver_init(struct cpuidle_driver *drv) { - int i; + int tick = 0, post_tick = 0, i; /* * Use all possible CPUs as the default, because if the kernel boots @@ -192,6 +216,39 @@ static void __cpuidle_driver_init(struct cpuidle_driver *drv) s->exit_latency_ns = 0; else s->exit_latency = div_u64(s->exit_latency_ns, NSEC_PER_USEC); + + if (!tick && s->target_residency_ns >= TICK_NSEC) { + tick = 1; + + if (s->target_residency_ns == TICK_NSEC) + continue; + + post_tick = 1; + + memmove(&drv->states[i+1], &drv->states[i], + sizeof(struct cpuidle_state) * (CPUIDLE_STATE_MAX - i - 1)); + drv->state_count++; + + __cpuidle_state_init_tick(s); + i++; + } + } + + if (!tick) { + struct cpuidle_state *s = &drv->states[i]; + __cpuidle_state_init_tick(s); + drv->state_count++; + i++; + } + + if (!post_tick) { + struct cpuidle_state *s = &drv->states[i]; + __cpuidle_state_init_tick(s); + strcpy(s->name, "POST-TICK"); + s->target_residency_ns *= 2; + s->target_residency *= 2; + drv->state_count++; + i++; } } diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h index 3183aeb7f5b4..a642ee9e916c 100644 --- a/include/linux/cpuidle.h +++ b/include/linux/cpuidle.h @@ -16,7 +16,7 @@ #include <linux/hrtimer.h> #include <linux/context_tracking.h> -#define CPUIDLE_STATE_MAX 10 +#define CPUIDLE_STATE_MAX 16 #define CPUIDLE_NAME_LEN 16 #define CPUIDLE_DESC_LEN 32
Hi, On Wed, 26 Jul 2023, Peter Zijlstra wrote: > On Tue, Jul 25, 2023 at 04:27:56PM +0200, Rafael J. Wysocki wrote: > > On Tue, Jul 25, 2023 at 3:07 PM Anna-Maria Behnsen > > > > 100% load 50% load 25% load > > > (top: ~2% idle) (top: ~49% idle) (top: ~74% idle; > > > 33 CPUs are completely idle) > > > --------------- ---------------- ---------------------------- > > > Idle Total 1658703 100% 3150522 100% 2377035 100% > > > x >= 4ms 2504 0.15% 2 0.00% 53 0.00% > > > 4ms> x >= 2ms 390 0.02% 0 0.00% 4563 0.19% > > > 2ms > x >= 1ms 62 0.00% 1 0.00% 54 0.00% > > > 1ms > x >= 500us 67 0.00% 6 0.00% 2 0.00% > > > 500us > x >= 250us 93 0.01% 39 0.00% 11 0.00% > > > 250us > x >=100us 280 0.02% 1145 0.04% 633 0.03% > > > 100us > x >= 50us 942 0.06% 30722 0.98% 13347 0.56% > > > 50us > x >= 25us 26728 1.61% 310932 9.87% 106083 4.46% > > > 25us > x >= 10us 825920 49.79% 2320683 73.66% 1722505 72.46% > > > 10us > x > 5us 795197 47.94% 442991 14.06% 506008 21.29% > > > 5us > x 6520 0.39% 43994 1.40% 23645 0.99% > > > > > > > > > 99% of the tick stops only have an idle period shorter than 50us (50us is > > > 1,25% of a tick length). > > > > Well, this just means that the governor predicts overly long idle > > durations quite often under this workload. > > > > The governor's decision on whether or not to stop the tick is based on > > its idle duration prediction. If it overshoots, that's how it goes. > > This is abysmal; IIRC TEO tracks a density function in C state buckets > and if it finds it's more likely to be shorter than 'predicted' by the > timer it should pick something shallower. > > Given we have this density function, picking something that's <1% likely > is insane. In fact, it seems to suggest the whole pick-alternative thing > is utterly broken. > When I tried to understand the cstates, I noticed cstates have been disabled on the zen3 machine I used for testing - I'm sorry, pilot error. So the numbers above are caused by calling tick_nohz_idle_stop_tick() unconditionally in cpuidle_idle_call() when cpuidle_not_available() is true. The regression Gautham observed was then caused by tons of tick_nohz_next_event() calls, which are more expensive with the current implementation of the timer migration hierarchy (if he tested with cstates enabled...). Nevertheless, I rerun the tests on current upstream with cstates enabled on zen3 machine and on a SKL-X with governor teo, menu and ladder and generated the following numbers (100% load): Zen3: teo menu ladder ------------------ ------------------ ------------------ Idle Total 2533 100.00% 5123 100.00% 1333746 100.00% x >= 4ms 1458 57.56% 2764 53.95% 2304 0.17% 4ms> x >= 2ms 91 3.59% 95 1.85% 98 0.01% 2ms > x >= 1ms 56 2.21% 66 1.29% 57 0.00% 1ms > x >= 500us 64 2.53% 74 1.44% 61 0.00% 500us > x >= 250us 73 2.88% 39 0.76% 69 0.01% 250us > x >=100us 76 3.00% 88 1.72% 502 0.04% 100us > x >= 50us 33 1.30% 104 2.03% 3976 0.30% 50us > x >= 25us 39 1.54% 289 5.64% 64463 4.83% 25us > x >= 10us 199 7.86% 830 16.20% 1245946 93.42% 10us > x > 5us 156 6.16% 231 4.51% 9452 0.71% 5us > x 288 11.37% 543 10.60% 6818 0.51% tick_nohz_next_event() total count 8839790 2113357 1363896 SKL-X: teo menu ladder ------------------ ------------------ ------------------ Idle Total 2388 100.00% 2047 100.00% 693514 100.00% x >= 4ms 2047 85.72% 1347 65.80% 1141 0.16% 4ms> x >= 2ms 29 1.21% 47 2.30% 18 0.00% 2ms > x >= 1ms 20 0.84% 9 0.44% 10 0.00% 1ms > x >= 500us 21 0.88% 17 0.83% 10 0.00% 500us > x >= 250us 15 0.63% 26 1.27% 9 0.00% 250us > x >=100us 67 2.81% 39 1.91% 24 0.00% 100us > x >= 50us 18 0.75% 26 1.27% 17 0.00% 50us > x >= 25us 15 0.63% 28 1.37% 2141 0.31% 25us > x >= 10us 31 1.30% 61 2.98% 108208 15.60% 10us > x > 5us 37 1.55% 195 9.53% 242809 35.01% 5us > x 88 3.69% 252 12.31% 339127 48.90% tick_nohz_next_event() total count 2317973 2481724 701069 With this (and hopefully without another pilot error), I see the following 'open points' where improvement or more thoughts might be good: - Without cstates enabled, it is possible to change the cpuidle governors even if they do not have an impact on idle behavior but at the first glance it looks like cpuidle governors are used. Is this behavior intended? - When there is no cpuidle driver, tick_nohz_idle_stop_tick() is called unconditionally - is there the possibility to make an easy check whether the CPU is loaded? - The governors teo and menu do the tick_nohz_next_event() check even if the CPU is fully loaded and but the check is not for free. - timer bases are marked idle in tick_nohz_next_event() when the next expiry is more than a tick away. But when the tick can not be stopped, because CPU is loaded and timer base is alreay marked idle, a remote timer enqueue before clearing timer base idle information will lead to a IPI which is also expensive. It might be worth a try to do only a (maybe leaner) check for the next timer in tick_nohz_next_event() and do the actual idle dance in tick_nohz_stop_tick(). When a timer is enqueued remote between tick_nohz_next_event() and tick_nohz_stop_tick() call, there is no need for an IPI - CPU might be prevented from stopping the tick. This is also the case at the moment and only solved by an IPI after tick is already stopped. With regard to the timer migration hierarchy, there might be the possibility to do also a quick check in tick_nohz_next_event() and do the final tmigr_cpu_deactivate() thing when stopping the tick and marking the timer bases idle. So no lock ordering change would be required here... - Side note: When testing 'ladder' on SKL-X machine there was a strange pattern: All CPUs on the second socket, stopped the tick quite often (~12000) and all of the idle durations were below 50us. All CPUs on the first socket stopped the tick max ~100 times and most of the idle durations were longer than 4ms (HZ=250). Thanks, Anna-Maria
On Wed, Jul 26, 2023 at 06:14:32PM +0200, Peter Zijlstra wrote: > On Wed, Jul 26, 2023 at 05:53:46PM +0200, Rafael J. Wysocki wrote: > > > > > That means we don't track nearly enough data to reliably tell anything > > > > about disabling the tick or not. We should have at least one bucket > > > > beyond TICK_NSEC for this. > > > > > > Quite likely. > > > > So the reasoning here was that those additional bins would not be > > necessary for idle state selection, but the problem of whether or not > > to stop the tick is kind of separate from the idle state selection > > problem if the target residency values for all of the idle states are > > relatively short. And so it should be addressed separately which > > currently it is not. Admittedly, this is a mistake. > > Right, the C state buckets are enough to pick a state, but not to handle > the tick thing. > > The below hack boots on my ivb-ep with extra (disabled) states. Now let > me go hack up teo to make use of that. > > name residency > > POLL 0 > C1 1 > C1E 80 > C3 156 > C6 300 > TICK 1000 > POST-TICK 2000 > builds and boots, futher untested -- I need to see to dinner. The idea is that teo_update() should account to the highest state if measured_ns is 'large'. Then, then the tick is on, see if the majority (50%) of the hit+intercepts are below the TICK threshold, if so, don't stop the tick and assume duration_ns = TICK_NSEC -- iow. don't call tick_nohz_get_sleep_length(). I'll check if the below code actually does as intended once the loonies are in bed. --- diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c index 987fc5f3997d..362470c8c273 100644 --- a/drivers/cpuidle/governors/teo.c +++ b/drivers/cpuidle/governors/teo.c @@ -197,7 +197,6 @@ struct teo_cpu { int next_recent_idx; int recent_idx[NR_RECENT]; unsigned long util_threshold; - bool utilized; }; static DEFINE_PER_CPU(struct teo_cpu, teo_cpus); @@ -227,7 +226,7 @@ static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data) static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev) { struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu); - int i, idx_timer = 0, idx_duration = 0; + int i, idx_timer = 0, idx_duration = drv->state_count-1; u64 measured_ns; if (cpu_data->time_span_ns >= cpu_data->sleep_length_ns) { @@ -362,11 +361,12 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, unsigned int recent_sum = 0; unsigned int idx_hit_sum = 0; unsigned int hit_sum = 0; + unsigned int tick_sum = 0; int constraint_idx = 0; int idx0 = 0, idx = -1; bool alt_intercepts, alt_recent; ktime_t delta_tick; - s64 duration_ns; + s64 duration_ns = TICK_NSEC; int i; if (dev->last_state_idx >= 0) { @@ -376,36 +376,26 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, cpu_data->time_span_ns = local_clock(); - duration_ns = tick_nohz_get_sleep_length(&delta_tick); - cpu_data->sleep_length_ns = duration_ns; + if (!tick_nohz_tick_stopped()) { + for (i = 1; i < drv->state_count; i++) { + struct teo_bin *prev_bin = &cpu_data->state_bins[i-1]; + struct cpuidle_state *s = &drv->states[i]; - /* Check if there is any choice in the first place. */ - if (drv->state_count < 2) { - idx = 0; - goto end; - } - if (!dev->states_usage[0].disable) { - idx = 0; - if (drv->states[1].target_residency_ns > duration_ns) - goto end; - } + tick_sum += prev_bin->intercepts; + tick_sum += prev_bin->hits; - cpu_data->utilized = teo_cpu_is_utilized(dev->cpu, cpu_data); - /* - * If the CPU is being utilized over the threshold and there are only 2 - * states to choose from, the metrics need not be considered, so choose - * the shallowest non-polling state and exit. - */ - if (drv->state_count < 3 && cpu_data->utilized) { - for (i = 0; i < drv->state_count; ++i) { - if (!dev->states_usage[i].disable && - !(drv->states[i].flags & CPUIDLE_FLAG_POLLING)) { - idx = i; - goto end; - } + if (s->target_residency_ns >= TICK_NSEC) + break; } + + if (2*tick_sum > cpu_data->total) + *stop_tick = false; } + if (*stop_tick) + duration_ns = tick_nohz_get_sleep_length(&delta_tick); + cpu_data->sleep_length_ns = duration_ns; + /* * Find the deepest idle state whose target residency does not exceed * the current sleep length and the deepest idle state not deeper than @@ -541,7 +531,7 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, * If the CPU is being utilized over the threshold, choose a shallower * non-polling state to improve latency */ - if (cpu_data->utilized) + if (teo_cpu_is_utilized(dev->cpu, cpu_data)) idx = teo_find_shallower_state(drv, dev, idx, duration_ns, true); end: @@ -549,8 +539,7 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, * Don't stop the tick if the selected state is a polling one or if the * expected idle duration is shorter than the tick period length. */ - if (((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) || - duration_ns < TICK_NSEC) && !tick_nohz_tick_stopped()) { + if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) || !*stop_tick) { *stop_tick = false; /*
On Wed, Jul 26, 2023 at 6:40 PM Anna-Maria Behnsen <anna-maria@linutronix.de> wrote: > > Hi, > > On Wed, 26 Jul 2023, Peter Zijlstra wrote: > > > On Tue, Jul 25, 2023 at 04:27:56PM +0200, Rafael J. Wysocki wrote: > > > On Tue, Jul 25, 2023 at 3:07 PM Anna-Maria Behnsen > > > > > > 100% load 50% load 25% load > > > > (top: ~2% idle) (top: ~49% idle) (top: ~74% idle; > > > > 33 CPUs are completely idle) > > > > --------------- ---------------- ---------------------------- > > > > Idle Total 1658703 100% 3150522 100% 2377035 100% > > > > x >= 4ms 2504 0.15% 2 0.00% 53 0.00% > > > > 4ms> x >= 2ms 390 0.02% 0 0.00% 4563 0.19% > > > > 2ms > x >= 1ms 62 0.00% 1 0.00% 54 0.00% > > > > 1ms > x >= 500us 67 0.00% 6 0.00% 2 0.00% > > > > 500us > x >= 250us 93 0.01% 39 0.00% 11 0.00% > > > > 250us > x >=100us 280 0.02% 1145 0.04% 633 0.03% > > > > 100us > x >= 50us 942 0.06% 30722 0.98% 13347 0.56% > > > > 50us > x >= 25us 26728 1.61% 310932 9.87% 106083 4.46% > > > > 25us > x >= 10us 825920 49.79% 2320683 73.66% 1722505 72.46% > > > > 10us > x > 5us 795197 47.94% 442991 14.06% 506008 21.29% > > > > 5us > x 6520 0.39% 43994 1.40% 23645 0.99% > > > > > > > > > > > > 99% of the tick stops only have an idle period shorter than 50us (50us is > > > > 1,25% of a tick length). > > > > > > Well, this just means that the governor predicts overly long idle > > > durations quite often under this workload. > > > > > > The governor's decision on whether or not to stop the tick is based on > > > its idle duration prediction. If it overshoots, that's how it goes. > > > > This is abysmal; IIRC TEO tracks a density function in C state buckets > > and if it finds it's more likely to be shorter than 'predicted' by the > > timer it should pick something shallower. > > > > Given we have this density function, picking something that's <1% likely > > is insane. In fact, it seems to suggest the whole pick-alternative thing > > is utterly broken. > > > > When I tried to understand the cstates, I noticed cstates have been > disabled on the zen3 machine I used for testing - I'm sorry, pilot error. > > So the numbers above are caused by calling tick_nohz_idle_stop_tick() > unconditionally in cpuidle_idle_call() when cpuidle_not_available() is > true. > > The regression Gautham observed was then caused by tons of > tick_nohz_next_event() calls, which are more expensive with the current > implementation of the timer migration hierarchy (if he tested with cstates > enabled...). > > Nevertheless, I rerun the tests on current upstream with cstates enabled on > zen3 machine and on a SKL-X with governor teo, menu and ladder and > generated the following numbers (100% load): > > Zen3: > teo menu ladder > ------------------ ------------------ ------------------ > Idle Total 2533 100.00% 5123 100.00% 1333746 100.00% > x >= 4ms 1458 57.56% 2764 53.95% 2304 0.17% > 4ms> x >= 2ms 91 3.59% 95 1.85% 98 0.01% > 2ms > x >= 1ms 56 2.21% 66 1.29% 57 0.00% > 1ms > x >= 500us 64 2.53% 74 1.44% 61 0.00% > 500us > x >= 250us 73 2.88% 39 0.76% 69 0.01% > 250us > x >=100us 76 3.00% 88 1.72% 502 0.04% > 100us > x >= 50us 33 1.30% 104 2.03% 3976 0.30% > 50us > x >= 25us 39 1.54% 289 5.64% 64463 4.83% > 25us > x >= 10us 199 7.86% 830 16.20% 1245946 93.42% > 10us > x > 5us 156 6.16% 231 4.51% 9452 0.71% > 5us > x 288 11.37% 543 10.60% 6818 0.51% > > tick_nohz_next_event() > total count 8839790 2113357 1363896 > > > > SKL-X: > teo menu ladder > ------------------ ------------------ ------------------ > Idle Total 2388 100.00% 2047 100.00% 693514 100.00% > x >= 4ms 2047 85.72% 1347 65.80% 1141 0.16% > 4ms> x >= 2ms 29 1.21% 47 2.30% 18 0.00% > 2ms > x >= 1ms 20 0.84% 9 0.44% 10 0.00% > 1ms > x >= 500us 21 0.88% 17 0.83% 10 0.00% > 500us > x >= 250us 15 0.63% 26 1.27% 9 0.00% > 250us > x >=100us 67 2.81% 39 1.91% 24 0.00% > 100us > x >= 50us 18 0.75% 26 1.27% 17 0.00% > 50us > x >= 25us 15 0.63% 28 1.37% 2141 0.31% > 25us > x >= 10us 31 1.30% 61 2.98% 108208 15.60% > 10us > x > 5us 37 1.55% 195 9.53% 242809 35.01% > 5us > x 88 3.69% 252 12.31% 339127 48.90% > > tick_nohz_next_event() > total count 2317973 2481724 701069 This looks better indeed. What's the HZ value? > With this (and hopefully without another pilot error), I see the following > 'open points' where improvement or more thoughts might be good: > > - Without cstates enabled, it is possible to change the cpuidle governors > even if they do not have an impact on idle behavior but at the first > glance it looks like cpuidle governors are used. Is this behavior > intended? Not really intended and it can be improved. I guess it's this way, because there were no complaints. :-) > - When there is no cpuidle driver, tick_nohz_idle_stop_tick() is called > unconditionally - is there the possibility to make an easy check whether > the CPU is loaded? PELT could be asked like in the teo governor. Or it may be a better idea to simply never stop the tick in that case (the default idle state is going to be shallow anyway, so running the tick shouldn't matter from the energy POV). > - The governors teo and menu do the tick_nohz_next_event() check even if > the CPU is fully loaded and but the check is not for free. Let me have a loot at teo in that respect. The problem is when tick_nohz_get_sleep_length() should not be called. The easy case is when the governor would select the shallowest idle state without taking it into account, but what about the deeper ones? I guess this depends on the exit latency of the current candidate idle state, but what exit latency would be low enough? I guess 2 us would be fine, but what about 10 us, or even 20 us for that matter? > - timer bases are marked idle in tick_nohz_next_event() when the next > expiry is more than a tick away. But when the tick can not be stopped, > because CPU is loaded and timer base is alreay marked idle, a remote > timer enqueue before clearing timer base idle information will lead to a > IPI which is also expensive. > > It might be worth a try to do only a (maybe leaner) check for the next > timer in tick_nohz_next_event() and do the actual idle dance in > tick_nohz_stop_tick(). When a timer is enqueued remote between > tick_nohz_next_event() and tick_nohz_stop_tick() call, there is no need > for an IPI - CPU might be prevented from stopping the tick. This is also > the case at the moment and only solved by an IPI after tick is already > stopped. > > With regard to the timer migration hierarchy, there might be the > possibility to do also a quick check in tick_nohz_next_event() and do the > final tmigr_cpu_deactivate() thing when stopping the tick and marking the > timer bases idle. So no lock ordering change would be required here... > > - Side note: When testing 'ladder' on SKL-X machine there was a strange > pattern: All CPUs on the second socket, stopped the tick quite often > (~12000) and all of the idle durations were below 50us. All CPUs on the > first socket stopped the tick max ~100 times and most of the idle > durations were longer than 4ms (HZ=250). I don't really have a quick explanation for that. It's probably due to the work distribution pattern. Not that I think that ladder is interesting any more, TBH.
On Wed, Jul 26, 2023 at 08:30:01PM +0200, Rafael J. Wysocki wrote: > > - The governors teo and menu do the tick_nohz_next_event() check even if > > the CPU is fully loaded and but the check is not for free. > > Let me have a loot at teo in that respect. > > The problem is when tick_nohz_get_sleep_length() should not be called. > The easy case is when the governor would select the shallowest idle > state without taking it into account, but what about the deeper ones? > I guess this depends on the exit latency of the current candidate idle > state, but what exit latency would be low enough? I guess 2 us would > be fine, but what about 10 us, or even 20 us for that matter? The patch I send here: https://lkml.kernel.org/r/20230726164958.GV38236@hirez.programming.kicks-ass.net (which was stuck in a mailqueue :/) tries to address that. Additionally, I think we can do something like this on top of all that, stop going deeper when 66% of wakeups is at or below the current state. --- a/drivers/cpuidle/governors/teo.c +++ b/drivers/cpuidle/governors/teo.c @@ -362,6 +362,7 @@ static int teo_select(struct cpuidle_dri unsigned int idx_hit_sum = 0; unsigned int hit_sum = 0; unsigned int tick_sum = 0; + unsigned int thresh_sum = 0; int constraint_idx = 0; int idx0 = 0, idx = -1; bool alt_intercepts, alt_recent; @@ -396,6 +397,8 @@ static int teo_select(struct cpuidle_dri duration_ns = tick_nohz_get_sleep_length(&delta_tick); cpu_data->sleep_length_ns = duration_ns; + thresh_sum = 2 * cpu_data->total / 3; /* 66% */ + /* * Find the deepest idle state whose target residency does not exceed * the current sleep length and the deepest idle state not deeper than @@ -426,6 +429,9 @@ static int teo_select(struct cpuidle_dri if (s->target_residency_ns > duration_ns) break; + if (intercept_sum + hit_sum > thresh_sum) + break; + idx = i; if (s->exit_latency_ns <= latency_req)
On Wed, Jul 26, 2023 at 06:49:58PM +0200, Peter Zijlstra wrote: > On Wed, Jul 26, 2023 at 06:14:32PM +0200, Peter Zijlstra wrote: > > On Wed, Jul 26, 2023 at 05:53:46PM +0200, Rafael J. Wysocki wrote: > > > > > > > That means we don't track nearly enough data to reliably tell anything > > > > > about disabling the tick or not. We should have at least one bucket > > > > > beyond TICK_NSEC for this. > > > > > > > > Quite likely. > > > > > > So the reasoning here was that those additional bins would not be > > > necessary for idle state selection, but the problem of whether or not > > > to stop the tick is kind of separate from the idle state selection > > > problem if the target residency values for all of the idle states are > > > relatively short. And so it should be addressed separately which > > > currently it is not. Admittedly, this is a mistake. > > > > Right, the C state buckets are enough to pick a state, but not to handle > > the tick thing. > > > > The below hack boots on my ivb-ep with extra (disabled) states. Now let > > me go hack up teo to make use of that. > > > > name residency > > > > POLL 0 > > C1 1 > > C1E 80 > > C3 156 > > C6 300 > > TICK 1000 > > POST-TICK 2000 > > > > builds and boots, futher untested -- I need to see to dinner. > > The idea is that teo_update() should account to the highest state if > measured_ns is 'large'. > > Then, then the tick is on, see if the majority (50%) of the > hit+intercepts are below the TICK threshold, if so, don't stop the tick > and assume duration_ns = TICK_NSEC -- iow. don't call > tick_nohz_get_sleep_length(). > > I'll check if the below code actually does as intended once the loonies > are in bed. D'oh, it suffers the 'obvious' problem. Since the tick is not disabled, we'll never log sleeps longer than the tick and hence never decide to disable the tick. Clever of me... This needs a wee bit of refining it does :-)
On Wed, Jul 26, 2023 at 06:14:32PM +0200, Peter Zijlstra wrote: > On Wed, Jul 26, 2023 at 05:53:46PM +0200, Rafael J. Wysocki wrote: > > > > > That means we don't track nearly enough data to reliably tell anything > > > > about disabling the tick or not. We should have at least one bucket > > > > beyond TICK_NSEC for this. > > > > > > Quite likely. > > > > So the reasoning here was that those additional bins would not be > > necessary for idle state selection, but the problem of whether or not > > to stop the tick is kind of separate from the idle state selection > > problem if the target residency values for all of the idle states are > > relatively short. And so it should be addressed separately which > > currently it is not. Admittedly, this is a mistake. > > Right, the C state buckets are enough to pick a state, but not to handle > the tick thing. > > The below hack boots on my ivb-ep with extra (disabled) states. Now let > me go hack up teo to make use of that. > > name residency > > POLL 0 > C1 1 > C1E 80 > C3 156 > C6 300 > TICK 1000 > POST-TICK 2000 > Ah, so last night (or rather, somewhat realy today) I realized I has the buckets wrong. We don't have buckets to the left, but buckets to the right, so the above would give: 0: [0,1) 1: [1,80) 2: [80,156) 3: [156,300) 4: [300,1000) 5: [1000,2000) 6: [2000,...) Which also means I can ditch the whole POST-TICK bucket. Let me get breakfast and try all this again.
On Thu, Jul 27, 2023 at 9:59 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Jul 26, 2023 at 06:14:32PM +0200, Peter Zijlstra wrote: > > On Wed, Jul 26, 2023 at 05:53:46PM +0200, Rafael J. Wysocki wrote: > > > > > > > That means we don't track nearly enough data to reliably tell anything > > > > > about disabling the tick or not. We should have at least one bucket > > > > > beyond TICK_NSEC for this. > > > > > > > > Quite likely. > > > > > > So the reasoning here was that those additional bins would not be > > > necessary for idle state selection, but the problem of whether or not > > > to stop the tick is kind of separate from the idle state selection > > > problem if the target residency values for all of the idle states are > > > relatively short. And so it should be addressed separately which > > > currently it is not. Admittedly, this is a mistake. > > > > Right, the C state buckets are enough to pick a state, but not to handle > > the tick thing. > > > > The below hack boots on my ivb-ep with extra (disabled) states. Now let > > me go hack up teo to make use of that. > > > > name residency > > > > POLL 0 > > C1 1 > > C1E 80 > > C3 156 > > C6 300 > > TICK 1000 > > POST-TICK 2000 > > > > Ah, so last night (or rather, somewhat realy today) I realized I has the > buckets wrong. > > We don't have buckets to the left, but buckets to the right, so the > above would give: > > 0: [0,1) > 1: [1,80) > 2: [80,156) > 3: [156,300) > 4: [300,1000) > 5: [1000,2000) > 6: [2000,...) > > Which also means I can ditch the whole POST-TICK bucket. Let me get > breakfast and try all this again. Note that this is about the cases between the target residency of the deepest enabled idle state and the tick period and from the Anna-Maria's results it looks that's about 0,1% of the total number.
--- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2307,6 +2307,7 @@ static inline bool owner_on_cpu(struct t /* Returns effective CPU energy utilization, as seen by the scheduler */ unsigned long sched_cpu_util(int cpu); +u64 sched_cpu_really_avg_idle(int cpu); #endif /* CONFIG_SMP */ #ifdef CONFIG_RSEQ --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3753,6 +3753,7 @@ ttwu_do_activate(struct rq *rq, struct t u64 max = 2*rq->max_idle_balance_cost; update_avg(&rq->avg_idle, delta); + update_avg(&rq->really_avg_idle, delta); if (rq->avg_idle > max) rq->avg_idle = max; @@ -7455,6 +7456,12 @@ unsigned long sched_cpu_util(int cpu) { return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL, NULL); } + +u64 sched_cpu_really_avg_idle(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + return rq->really_avg_idle; +} #endif /* CONFIG_SMP */ /** @@ -9988,6 +9995,7 @@ void __init sched_init(void) rq->online = 0; rq->idle_stamp = 0; rq->avg_idle = 2*sysctl_sched_migration_cost; + rq->really_avg_idle = 2*sysctl_sched_migration_cost; rq->wake_stamp = jiffies; rq->wake_avg_idle = rq->avg_idle; rq->max_idle_balance_cost = sysctl_sched_migration_cost; --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1073,6 +1073,7 @@ struct rq { #endif u64 idle_stamp; u64 avg_idle; + u64 really_avg_idle; unsigned long wake_stamp; u64 wake_avg_idle; --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -800,7 +800,7 @@ static inline bool local_timer_softirq_p static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu) { - u64 basemono, next_tick, delta, expires; + u64 basemono, next_tick, delta, expires, sched_avg_idle; unsigned long basejiff; unsigned int seq; @@ -823,8 +823,11 @@ static ktime_t tick_nohz_next_event(stru * minimal delta which brings us back to this place * immediately. Lather, rinse and repeat... */ - if (rcu_needs_cpu() || arch_needs_cpu() || - irq_work_needs_cpu() || local_timer_softirq_pending()) { + sched_avg_idle = sched_cpu_really_avg_idle(cpu); + if (sched_avg_idle <= (u64)TICK_NSEC) { + next_tick = basemono + sched_avg_idle; + } else if (rcu_needs_cpu() || arch_needs_cpu() || + irq_work_needs_cpu() || local_timer_softirq_pending()) { next_tick = basemono + TICK_NSEC; } else { /*