Message ID | 20221104145737.71236-1-anna-maria@linutronix.de |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp453803wru; Fri, 4 Nov 2022 07:59:23 -0700 (PDT) X-Google-Smtp-Source: AMsMyM6zodszsEQttZv9PNhlUmzlU70hY1B3KbH9xsM77/a4DLr4xOpLrYBJDYTQ7Pn7GQX/+WsM X-Received: by 2002:a17:907:1c9e:b0:79e:6d97:5e0 with SMTP id nb30-20020a1709071c9e00b0079e6d9705e0mr34049622ejc.534.1667573963260; Fri, 04 Nov 2022 07:59:23 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1667573963; cv=none; d=google.com; s=arc-20160816; b=EKviUmul7NPph8lM0aZ/0/sNMA/uZ9olf9H2OmGLSCflcGHW58bol65tHF+ETCvBZt 0v+1QPknhhSsuuJn3lPdx83EuM1nGhS1FjFs+s/dGngCu+KyZcDJIFilfsxXdHx7ijm+ xWnaTHl0/jBE/FxPRbTcnKbS7K7x8HrmmKr46zGoU9d1JX2Awy7twJ0WSHGXq58+HvlU UneXCg47yMMDRhYMy4ZNB0W5vVBfrUGoKjFjbYGoySFl0EtRfYR/IJq66UNj1jRNOtJb /r3XkoT1RswJ+9DQnOq0Sqcqy7xZMWRrGJMup9qEJRum8VBuqNdHOQBesu+hv5TyM+r+ NTrQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:dkim-signature:dkim-signature:from; bh=737pHEJJTSnxFKBYGuyt7Low3qBKlccPZgqWAab6Nsw=; b=lmutbLhnHUJEiKJrZzRkwQ76Ay5gnR/iREA3t7Ru7VUEP8LPr81tLKbmynrpzrT9DA pYH/s9WvWtBZw3Do+jacHCsYZEIdoQQz+M8VSbGSYY18pG4rQYLxukh5WxEwA8pwfi92 97Oa89tN70Rw1T/GxcvwyDGc6EmoYaKoHUHXw0a05rhYSsQCzh+YYEgtWm/dGNHyQkqd 9HjL+yfchI9ojjnwXnN1kY3qSTNhHG1L3VA07r/6z8SIxIlw9d5RPJvrlh64c7y0uQk6 89ePl8AdlU8jWi+U17Sxj9m+0cm0e8jJmTsggbl3pWgas5JwVk56H6vpO8stmODFsKYb KlvQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=BKauH494; dkim=neutral (no key) header.i=@linutronix.de header.s=2020e header.b=tFbF+Od7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id j15-20020a05640211cf00b0045878af0adbsi6859539edw.393.2022.11.04.07.58.59; Fri, 04 Nov 2022 07:59:23 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=BKauH494; dkim=neutral (no key) header.i=@linutronix.de header.s=2020e header.b=tFbF+Od7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232265AbiKDO6D (ORCPT <rfc822;jimliu8233@gmail.com> + 99 others); Fri, 4 Nov 2022 10:58:03 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41654 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230324AbiKDO54 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 4 Nov 2022 10:57:56 -0400 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CA4512ED53 for <linux-kernel@vger.kernel.org>; Fri, 4 Nov 2022 07:57:54 -0700 (PDT) From: Anna-Maria Behnsen <anna-maria@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1667573872; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=737pHEJJTSnxFKBYGuyt7Low3qBKlccPZgqWAab6Nsw=; b=BKauH494hFcnZM0i9+L7NP75p8LJLqtirQEx0VQ2rcVPKDdEsBG3B8i6m1m7FAe9Nns7Sv JqaQHM2KADQqpM+U0mMmCLdYS0PLM3TgQQzm/rdZXibxrGg3IvDN5+I1UhwU82cTNEniqH bvyZH0a+ZtIi6gYecloVfQ3BBFYSj46TlvnpxBAie+93g2MPy6blOkvp2Q1eXRF1U/bQUM KmvQmecjj9vK0zvOVEBil4iz0bV74/ZYOE5vX2UsO5phQVjNRbGg63nbTr64NSR54LpZgi HFtG/Z7eWfP4pwp/ix/R1y7tpfTusuwp/aQZmPbPyXsLFpuSNn1Ub/hQktQE7A== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1667573872; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=737pHEJJTSnxFKBYGuyt7Low3qBKlccPZgqWAab6Nsw=; b=tFbF+Od7EhPsotceQeFTEv9xGmdvPtRfg6QHKprLjpyTs8CISCQk0By9EAzcDvNUnzjQlp BI7dfLzCcXklX0Bw== To: linux-kernel@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org>, John Stultz <jstultz@google.com>, Thomas Gleixner <tglx@linutronix.de>, Eric Dumazet <edumazet@google.com>, "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>, Arjan van de Ven <arjan@infradead.org>, "Paul E . McKenney" <paulmck@kernel.org>, Frederic Weisbecker <fweisbec@gmail.com>, Rik van Riel <riel@surriel.com>, Anna-Maria Behnsen <anna-maria@linutronix.de> Subject: [PATCH v4 00/16] timer: Move from a push remote at enqueue to a pull at expiry model Date: Fri, 4 Nov 2022 15:57:21 +0100 Message-Id: <20221104145737.71236-1-anna-maria@linutronix.de> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1748578036060855159?= X-GMAIL-MSGID: =?utf-8?q?1748578036060855159?= |
Series |
timer: Move from a push remote at enqueue to a pull at expiry model
|
|
Message
Anna-Maria Behnsen
Nov. 4, 2022, 2:57 p.m. UTC
Placing timers at enqueue time on a target CPU based on dubious heuristics does not make any sense: 1) Most timer wheel timers are canceled or rearmed before they expire. 2) The heuristics to predict which CPU will be busy when the timer expires are wrong by definition. So placing the timers at enqueue wastes precious cycles. The proper solution to this problem is to always queue the timers on the local CPU and allow the non pinned timers to be pulled onto a busy CPU at expiry time. Therefore split the timer storage into local pinned and global timers: Local pinned timers are always expired on the CPU on which they have been queued. Global timers can be expired on any CPU. As long as a CPU is busy it expires both local and global timers. When a CPU goes idle it arms for the first expiring local timer. If the first expiring pinned (local) timer is before the first expiring movable timer, then no action is required because the CPU will wake up before the first movable timer expires. If the first expiring movable timer is before the first expiring pinned (local) timer, then this timer is queued into a idle timerqueue and eventually expired by some other active CPU. To avoid global locking the timerqueues are implemented as a hierarchy. The lowest level of the hierarchy holds the CPUs. The CPUs are associated to groups of 8, which are seperated per node. If more than one CPU group exist, then a second level in the hierarchy collects the groups. Depending on the size of the system more than 2 levels are required. Each group has a "migrator" which checks the timerqueue during the tick for remote expirable timers. If the last CPU in a group goes idle it reports the first expiring event in the group up to the next group(s) in the hierarchy. If the last CPU goes idle it arms its timer for the first system wide expiring timer to ensure that no timer event is missed. Testing ~~~~~~~ The impact of wasting cycles during enqueue by using the heuristic in contrast to always queueing the timer on the local CPU was measured with a micro benchmark. Therefore a timer is enqueued and dequeued in a loop with 1000 repetitions on a isolated CPU. The time the loop takes is measured. A quater of the remaining CPUs was kept busy. This measurement was repeated several times. With the patch queue the average duration was reduced by approximately 25%. 145ns plain v6 109ns v6 with patch queue Furthermore the impact of residence in deep idle states of an idle system was investigated. The patch queue doesn't downgrade this behavior. During testing on a mostly idle machine a ping pong game could be observed: a process_timeout timer is expired remotely on a non idle CPU. Then the CPU where the schedule_timeout() was executed to enqueue the timer comes out of idle and restarts the timer using schedule_timeout() and goes back to idle again. This is due to the fair scheduler which tries to keep the task on the CPU which it previously executed on. Next Steps ~~~~~~~~~~ Simple deferrable timers are no longer required as they can be converted to global timers. If a CPU goes idle, a formerly deferrable timer will not prevent the CPU to sleep as long as possible. Only the last migrator CPU has to take care of them. Deferrable timers with timer pinned flags needs to be expired on the specified CPU but must not prevent CPU from going idle. They require their own timer base which is never taken into account when calculating the next expiry time. This conversation and required cleanup will be done in a follow up series. v3..v4: - address review feedback of Frederic Weisbecker - address kernel test robot fallout - Move patch 16 "add_timer_on(): Make sure callers have TIMER_PINNED flag" at the begin of the queue to prevent timers to end up in global timer base when they were queued using add_timer_on() - Fix some comments and typos v2..v3: https://lore.kernel.org/r/20170418111102.490432548@linutronix.de/ - Minimize usage of locks by storing data using atomic_cmpxchg() for migrator information and information about active cpus. Thanks, Anna-Maria Anna-Maria Behnsen (13): tick-sched: Warn when next tick seems to be in the past timer: Move store of next event into __next_timer_interrupt() timer: Split next timer interrupt logic add_timer_on(): Make sure callers have TIMER_PINNED flag timer: Keep the pinned timers separate from the others timer: Retrieve next expiry of pinned/non-pinned timers seperately timer: Rename get_next_timer_interrupt() timer: Split out "get next timer interrupt" functionality timer: Add get next timer interrupt functionality for remote CPUs timer: Check if timers base is handled already timer: Implement the hierarchical pull model timer_migration: Add tracepoints timer: Always queue timers on the local CPU Richard Cochran (linutronix GmbH) (2): timer: Restructure internal locking tick/sched: Split out jiffies update helper function Thomas Gleixner (1): timer: Rework idle logic arch/x86/kernel/tsc_sync.c | 3 +- drivers/char/random.c | 2 +- include/linux/cpuhotplug.h | 1 + include/linux/timer.h | 5 +- include/trace/events/timer_migration.h | 277 ++++++ kernel/time/Makefile | 3 + kernel/time/clocksource.c | 2 +- kernel/time/tick-internal.h | 12 +- kernel/time/tick-sched.c | 50 +- kernel/time/timer.c | 372 +++++-- kernel/time/timer_migration.c | 1263 ++++++++++++++++++++++++ kernel/time/timer_migration.h | 123 +++ kernel/workqueue.c | 7 +- 13 files changed, 2011 insertions(+), 109 deletions(-) create mode 100644 include/trace/events/timer_migration.h create mode 100644 kernel/time/timer_migration.c create mode 100644 kernel/time/timer_migration.h
Comments
Hi Anna-Maria, On Fri, Nov 04, 2022 at 03:57:21PM +0100, Anna-Maria Behnsen wrote: > Placing timers at enqueue time on a target CPU based on dubious heuristics > does not make any sense: > > 1) Most timer wheel timers are canceled or rearmed before they expire. > > 2) The heuristics to predict which CPU will be busy when the timer expires > are wrong by definition. > > So placing the timers at enqueue wastes precious cycles. > > The proper solution to this problem is to always queue the timers on the > local CPU and allow the non pinned timers to be pulled onto a busy CPU at > expiry time. > > Therefore split the timer storage into local pinned and global timers: > Local pinned timers are always expired on the CPU on which they have been > queued. Global timers can be expired on any CPU. > > As long as a CPU is busy it expires both local and global timers. When a > CPU goes idle it arms for the first expiring local timer. If the first > expiring pinned (local) timer is before the first expiring movable timer, > then no action is required because the CPU will wake up before the first > movable timer expires. If the first expiring movable timer is before the > first expiring pinned (local) timer, then this timer is queued into a idle > timerqueue and eventually expired by some other active CPU. > > To avoid global locking the timerqueues are implemented as a hierarchy. The > lowest level of the hierarchy holds the CPUs. The CPUs are associated to > groups of 8, which are seperated per node. If more than one CPU group > exist, then a second level in the hierarchy collects the groups. Depending > on the size of the system more than 2 levels are required. Each group has a > "migrator" which checks the timerqueue during the tick for remote expirable > timers. > > If the last CPU in a group goes idle it reports the first expiring event in > the group up to the next group(s) in the hierarchy. If the last CPU goes > idle it arms its timer for the first system wide expiring timer to ensure > that no timer event is missed. > > > Testing > ~~~~~~~ > > The impact of wasting cycles during enqueue by using the heuristic in > contrast to always queueing the timer on the local CPU was measured with a > micro benchmark. Therefore a timer is enqueued and dequeued in a loop with > 1000 repetitions on a isolated CPU. The time the loop takes is measured. A > quater of the remaining CPUs was kept busy. This measurement was repeated > several times. With the patch queue the average duration was reduced by > approximately 25%. > > 145ns plain v6 > 109ns v6 with patch queue > > > Furthermore the impact of residence in deep idle states of an idle system > was investigated. The patch queue doesn't downgrade this behavior. > > > During testing on a mostly idle machine a ping pong game could be observed: > a process_timeout timer is expired remotely on a non idle CPU. Then the CPU > where the schedule_timeout() was executed to enqueue the timer comes out of > idle and restarts the timer using schedule_timeout() and goes back to idle > again. This is due to the fair scheduler which tries to keep the task on > the CPU which it previously executed on. > > > Next Steps > ~~~~~~~~~~ > > Simple deferrable timers are no longer required as they can be converted to > global timers. If a CPU goes idle, a formerly deferrable timer will not > prevent the CPU to sleep as long as possible. Only the last migrator CPU > has to take care of them. Deferrable timers with timer pinned flags needs > to be expired on the specified CPU but must not prevent CPU from going > idle. They require their own timer base which is never taken into account > when calculating the next expiry time. This conversation and required > cleanup will be done in a follow up series. > Taking non-pinned deferrable timers case, they are queued on their own base and its expiry is not taken into account while programming the next timer event during idle. Can you elaborate on "Simple deferrable timers are no longer required as they can be converted to global timers" statement? Though they can be on global base, we still need to find a way to distinguish them aginst the normal global timers so that the last migrator can program the next timer event without taking these deferrable timer expiry into account? IOW, a deferrable timer should not bring a completely idle system out of idle to serve the deferrable timer. When the deferrable timers will be queued on global base, once a CPU comes out of idle and serve the timers on global base, the deferrable timers also would be served. This is a welcoming change. We would see a truly deferrable global timer something we would be interested in. [1] has some background on this. [1] https://lore.kernel.org/lkml/1430188744-24737-1-git-send-email-joonwoop@codeaurora.org/ Thanks, Pavan
On Tue, 8 Nov 2022, Pavan Kondeti wrote: > Hi Anna-Maria, > > On Fri, Nov 04, 2022 at 03:57:21PM +0100, Anna-Maria Behnsen wrote: > > Next Steps > > ~~~~~~~~~~ > > > > Simple deferrable timers are no longer required as they can be converted to > > global timers. If a CPU goes idle, a formerly deferrable timer will not > > prevent the CPU to sleep as long as possible. Only the last migrator CPU > > has to take care of them. Deferrable timers with timer pinned flags needs > > to be expired on the specified CPU but must not prevent CPU from going > > idle. They require their own timer base which is never taken into account > > when calculating the next expiry time. This conversation and required > > cleanup will be done in a follow up series. > > > > Taking non-pinned deferrable timers case, they are queued on their own base > and its expiry is not taken into account while programming the next timer > event during idle. If CPU is not the last CPU going idle, then yes. > Can you elaborate on "Simple deferrable timers are no longer required as they > can be converted to global timers" statement? Global timers do not prevent CPU from going idle. Same thing that deferrable timers does right now. Global timers are queued into the hierarchy and migrator takes care of expiry when CPU goes idle. The main change of behavoir with global timers compared to deferrable timers is, that they will expire in time and not necessarily on the CPU they were enqueued. Deferrable timers were expired, only when the CPU was awake and always on the CPU they have been enqueued. > Though they can be on global base, we still need to find a way to distinguish > them aginst the normal global timers so that the last migrator can program > the next timer event without taking these deferrable timer expiry into > account? IOW, a deferrable timer should not bring a completely idle system out > of idle to serve the deferrable timer. This behavior will change a little. If the system is completely idle, the last migrator CPU has to handle the first global timer even if it is a formerly deferrable and non pinned timer. > When the deferrable timers will be queued on global base, once a CPU comes out > of idle and serve the timers on global base, the deferrable timers also would > be served. This is a welcoming change. We would see a truly deferrable global > timer something we would be interested in. [1] has some background on this. Serving the deferrable timers once a CPU comes out of idle is already the case even without the timer migration hierarchy. See upstream version of run_local_timers(). > [1] > https://lore.kernel.org/lkml/1430188744-24737-1-git-send-email-joonwoop@codeaurora.org/ As far as I understand the problem you are linking to correctly, you want to have a real "unbound" solution for deferrable or delayed work. This is what you get with the timer migration hierarchy and when enqueuing deferrable timers into global timer base. Timers are executed on the migrator CPU because this CPU is not idle - doesn't matter where they have been queued before. It might be possible, that a fomerly deferrable timer enforces the last CPU going idle to come back out of idle. But the question is, how often does this occur in contrast to a wakeup cause by a non deferrable timer? If you have a look at the timers in kernel you have 64 deferrable timers (this number also contain the deferrable and pinned timers). There are 7 timers with only TIMER_PINNED flag and some additional using the add_timer_on() to be enqueued on a dedicated CPU. But in total we have more than 1000 timers. Sure - in the end, numbers hardly depends on the selected kernel config... Side note: One big problem of deferrable timers disappear with this approach. All deferrable timers _WILL_ expire. Even if CPU where they have been enqueued does not come back out of idle. Only deferrable and pinned timers will still have this problem. Thanks, Anna-Maria
Hi Anna-Maria, On Tue, Nov 08, 2022 at 04:06:15PM +0100, Anna-Maria Behnsen wrote: > On Tue, 8 Nov 2022, Pavan Kondeti wrote: > > > Hi Anna-Maria, > > > > On Fri, Nov 04, 2022 at 03:57:21PM +0100, Anna-Maria Behnsen wrote: > > > Next Steps > > > ~~~~~~~~~~ > > > > > > Simple deferrable timers are no longer required as they can be converted to > > > global timers. If a CPU goes idle, a formerly deferrable timer will not > > > prevent the CPU to sleep as long as possible. Only the last migrator CPU > > > has to take care of them. Deferrable timers with timer pinned flags needs > > > to be expired on the specified CPU but must not prevent CPU from going > > > idle. They require their own timer base which is never taken into account > > > when calculating the next expiry time. This conversation and required > > > cleanup will be done in a follow up series. > > > > > > > Taking non-pinned deferrable timers case, they are queued on their own base > > and its expiry is not taken into account while programming the next timer > > event during idle. > > If CPU is not the last CPU going idle, then yes. What is special with last CPU that is going idle? Sorry, it is not clear where the deferrable timer expiry is taken into account while programming the next wakeup event? forward_and_idle_timer_bases()->tmigr_cpu_deactivate() is only taking global timer expiry (deferrable timers are NOT queued on global base) and comparing against the local base expiry. This makes me think that we are not taking deferrable timers expiry into account, which is correct IMO. > > > Can you elaborate on "Simple deferrable timers are no longer required as they > > can be converted to global timers" statement? > > Global timers do not prevent CPU from going idle. Same thing that > deferrable timers does right now. Global timers are queued into the > hierarchy and migrator takes care of expiry when CPU goes idle. The main > change of behavoir with global timers compared to deferrable timers is, > that they will expire in time and not necessarily on the CPU they were > enqueued. Deferrable timers were expired, only when the CPU was awake and > always on the CPU they have been enqueued. Thanks. This is very clear. A deferrable timer (upstream or your patches) only expire on busy and local CPU. A CPU will not come out of idle, just to serve a deferrable timer. > > > Though they can be on global base, we still need to find a way to distinguish > > them aginst the normal global timers so that the last migrator can program > > the next timer event without taking these deferrable timer expiry into > > account? IOW, a deferrable timer should not bring a completely idle system out > > of idle to serve the deferrable timer. > > This behavior will change a little. If the system is completely idle, the > last migrator CPU has to handle the first global timer even if it is a > formerly deferrable and non pinned timer. > > > When the deferrable timers will be queued on global base, once a CPU comes out > > of idle and serve the timers on global base, the deferrable timers also would > > be served. This is a welcoming change. We would see a truly deferrable global > > timer something we would be interested in. [1] has some background on this. > > Serving the deferrable timers once a CPU comes out of idle is already the > case even without the timer migration hierarchy. See upstream version of > run_local_timers(). However upstream version does not wake a CPU just to serve a deferrable timer. But it seems if we consider a deferrable timer just as another global timer, sure it will not prevent the local CPU going idle but there would be one CPU (thus, the system) that pays the penalty. > > > [1] > > https://lore.kernel.org/lkml/1430188744-24737-1-git-send-email-joonwoop@codeaurora.org/ > > As far as I understand the problem you are linking to correctly, you want > to have a real "unbound" solution for deferrable or delayed work. This is > what you get with the timer migration hierarchy and when enqueuing > deferrable timers into global timer base. Timers are executed on the > migrator CPU because this CPU is not idle - doesn't matter where they have > been queued before. > > It might be possible, that a fomerly deferrable timer enforces the last CPU > going idle to come back out of idle. But the question is, how often does > this occur in contrast to a wakeup cause by a non deferrable timer? If you > have a look at the timers in kernel you have 64 deferrable timers (this > number also contain the deferrable and pinned timers). There are 7 timers > with only TIMER_PINNED flag and some additional using the add_timer_on() to > be enqueued on a dedicated CPU. But in total we have more than 1000 > timers. Sure - in the end, numbers hardly depends on the selected kernel > config... I will give an example here. Lets say we have 4 CPUs in a system. There is a devfreq governor driver that configures a delayed work for every 20 msec. #1 When the system is busy, this *deferrable* timer expires at the 20 msec boundary. However, when the system is idle (i.e no use case is running but system does not enter global suspend because of other reasons like display ON etc), we don't expect this deferrable timer to expire at every 20 msec. With your proposal, we endup seeing the system (last CPU that enters idle) coming out of idle for every 20 msec which is not desirable. #2 Today, deferrable is local to CPU. Irrespective of the activity on the other CPUs, this deferrable timer does not expire as long as the local CPU is idle for whatever reason. That is definitly not the devfreq governor expectation. The intention is to save power when system is idle but serving the purpose when it is relatively busy. > > Side note: One big problem of deferrable timers disappear with this > approach. All deferrable timers _WILL_ expire. Even if CPU where they have > been enqueued does not come back out of idle. Only deferrable and pinned > timers will still have this problem. > Yes, this is a welcoming change. Solves the #2 problem as mentioned above. Thanks, Pavan
On Tue, 8 Nov 2022, Pavan Kondeti wrote: > Hi Anna-Maria, > > On Tue, Nov 08, 2022 at 04:06:15PM +0100, Anna-Maria Behnsen wrote: > > On Tue, 8 Nov 2022, Pavan Kondeti wrote: > > > > > Hi Anna-Maria, > > > > > > On Fri, Nov 04, 2022 at 03:57:21PM +0100, Anna-Maria Behnsen wrote: > > > > Next Steps > > > > ~~~~~~~~~~ > > > > > > > > Simple deferrable timers are no longer required as they can be converted to > > > > global timers. If a CPU goes idle, a formerly deferrable timer will not > > > > prevent the CPU to sleep as long as possible. Only the last migrator CPU > > > > has to take care of them. Deferrable timers with timer pinned flags needs > > > > to be expired on the specified CPU but must not prevent CPU from going > > > > idle. They require their own timer base which is never taken into account > > > > when calculating the next expiry time. This conversation and required > > > > cleanup will be done in a follow up series. > > > > > > > > > > Taking non-pinned deferrable timers case, they are queued on their own base > > > and its expiry is not taken into account while programming the next timer > > > event during idle. > > > > If CPU is not the last CPU going idle, then yes. > > What is special with last CPU that is going idle? Sorry, it is not clear where > the deferrable timer expiry is taken into account while programming the next > wakeup event? The last CPU has to make sure the global timers are handled. At the moment the deferrable timer expiry is not taken into account for next wakeup. > forward_and_idle_timer_bases()->tmigr_cpu_deactivate() is only taking global > timer expiry (deferrable timers are NOT queued on global base) and comparing > against the local base expiry. This makes me think that we are not taking > deferrable timers expiry into account, which is correct IMO. The information "deferrable timers [...] can be converted to global timers" is below the heading "Next Steps". It is _NOT_ part of this series, it will be part of a follow up patch series. The posted series only introduces the timer migration hierarchy and then removes the heuristic on which CPU a timer will be enqueued. The only change for deferrable timers after this series is: They are always enqueued on the local CPU. The rest stays the same. [...] > > > When the deferrable timers will be queued on global base, once a CPU comes out > > > of idle and serve the timers on global base, the deferrable timers also would > > > be served. This is a welcoming change. We would see a truly deferrable global > > > timer something we would be interested in. [1] has some background on this. > > > > Serving the deferrable timers once a CPU comes out of idle is already the > > case even without the timer migration hierarchy. See upstream version of > > run_local_timers(). > > However upstream version does not wake a CPU just to serve a deferrable timer. > But it seems if we consider a deferrable timer just as another global timer, > sure it will not prevent the local CPU going idle but there would be one CPU > (thus, the system) that pays the penalty. > Right. > > > [1] > > > https://lore.kernel.org/lkml/1430188744-24737-1-git-send-email-joonwoop@codeaurora.org/ > > > > As far as I understand the problem you are linking to correctly, you want > > to have a real "unbound" solution for deferrable or delayed work. This is > > what you get with the timer migration hierarchy and when enqueuing > > deferrable timers into global timer base. Timers are executed on the > > migrator CPU because this CPU is not idle - doesn't matter where they have > > been queued before. > > > > It might be possible, that a fomerly deferrable timer enforces the last CPU > > going idle to come back out of idle. But the question is, how often does > > this occur in contrast to a wakeup cause by a non deferrable timer? If you > > have a look at the timers in kernel you have 64 deferrable timers (this > > number also contain the deferrable and pinned timers). There are 7 timers > > with only TIMER_PINNED flag and some additional using the add_timer_on() to > > be enqueued on a dedicated CPU. But in total we have more than 1000 > > timers. Sure - in the end, numbers hardly depends on the selected kernel > > config... > > I will give an example here. Lets say we have 4 CPUs in a system. There is a > devfreq governor driver that configures a delayed work for every 20 msec. s/delayed work/deferrable work ? > #1 When the system is busy, this *deferrable* timer expires at the 20 msec > boundary. However, when the system is idle (i.e no use case is running but > system does not enter global suspend because of other reasons like display > ON etc), we don't expect this deferrable timer to expire at every 20 msec. > > With your proposal, we endup seeing the system (last CPU that enters idle) > coming out of idle for every 20 msec which is not desirable. With my proposal for next steps only timers with pinned and deferrable flag set would keep the old behavior. > #2 Today, deferrable is local to CPU. Irrespective of the activity on the > other CPUs, this deferrable timer does not expire as long as the local CPU > is idle for whatever reason. That is definitly not the devfreq governor > expectation. The intention is to save power when system is idle but serving > the purpose when it is relatively busy. > > > > > Side note: One big problem of deferrable timers disappear with this > > approach. All deferrable timers _WILL_ expire. Even if CPU where they have > > been enqueued does not come back out of idle. Only deferrable and pinned > > timers will still have this problem. > > > Yes, this is a welcoming change. Solves the #2 problem as mentioned above. But this welcoming change is only accessible when enqueuing deferrable timers into global base. But be aware, the problem sill exists for pinned deferrable timers. Thanks, Anna-Maria
On Tue, Nov 08, 2022 at 06:39:22PM +0100, Anna-Maria Behnsen wrote: > On Tue, 8 Nov 2022, Pavan Kondeti wrote: > > > Hi Anna-Maria, > > > > On Tue, Nov 08, 2022 at 04:06:15PM +0100, Anna-Maria Behnsen wrote: > > > On Tue, 8 Nov 2022, Pavan Kondeti wrote: > > > > > > > Hi Anna-Maria, > > > > > > > > On Fri, Nov 04, 2022 at 03:57:21PM +0100, Anna-Maria Behnsen wrote: > > > > > Next Steps > > > > > ~~~~~~~~~~ > > > > > > > > > > Simple deferrable timers are no longer required as they can be converted to > > > > > global timers. If a CPU goes idle, a formerly deferrable timer will not > > > > > prevent the CPU to sleep as long as possible. Only the last migrator CPU > > > > > has to take care of them. Deferrable timers with timer pinned flags needs > > > > > to be expired on the specified CPU but must not prevent CPU from going > > > > > idle. They require their own timer base which is never taken into account > > > > > when calculating the next expiry time. This conversation and required > > > > > cleanup will be done in a follow up series. > > > > > > > > > > > > > Taking non-pinned deferrable timers case, they are queued on their own base > > > > and its expiry is not taken into account while programming the next timer > > > > event during idle. > > > > > > If CPU is not the last CPU going idle, then yes. > > > > What is special with last CPU that is going idle? Sorry, it is not clear where > > the deferrable timer expiry is taken into account while programming the next > > wakeup event? > > The last CPU has to make sure the global timers are handled. At the moment > the deferrable timer expiry is not taken into account for next wakeup. > Right. Nothing changes wrt deferrable timers with this series. > > forward_and_idle_timer_bases()->tmigr_cpu_deactivate() is only taking global > > timer expiry (deferrable timers are NOT queued on global base) and comparing > > against the local base expiry. This makes me think that we are not taking > > deferrable timers expiry into account, which is correct IMO. > > The information "deferrable timers [...] can be converted to global timers" > is below the heading "Next Steps". It is _NOT_ part of this series, it will > be part of a follow up patch series. > > The posted series only introduces the timer migration hierarchy and then > removes the heuristic on which CPU a timer will be enqueued. The only > change for deferrable timers after this series is: They are always enqueued > on the local CPU. The rest stays the same. > Understood. > > > > > When the deferrable timers will be queued on global base, once a CPU comes out > > > > of idle and serve the timers on global base, the deferrable timers also would > > > > be served. This is a welcoming change. We would see a truly deferrable global > > > > timer something we would be interested in. [1] has some background on this. > > > > > > Serving the deferrable timers once a CPU comes out of idle is already the > > > case even without the timer migration hierarchy. See upstream version of > > > run_local_timers(). > > > > However upstream version does not wake a CPU just to serve a deferrable timer. > > But it seems if we consider a deferrable timer just as another global timer, > > sure it will not prevent the local CPU going idle but there would be one CPU > > (thus, the system) that pays the penalty. > > > > Right. > > > > > [1] > > > > https://lore.kernel.org/lkml/1430188744-24737-1-git-send-email-joonwoop@codeaurora.org/ > > > > > > As far as I understand the problem you are linking to correctly, you want > > > to have a real "unbound" solution for deferrable or delayed work. This is > > > what you get with the timer migration hierarchy and when enqueuing > > > deferrable timers into global timer base. Timers are executed on the > > > migrator CPU because this CPU is not idle - doesn't matter where they have > > > been queued before. > > > > > > It might be possible, that a fomerly deferrable timer enforces the last CPU > > > going idle to come back out of idle. But the question is, how often does > > > this occur in contrast to a wakeup cause by a non deferrable timer? If you > > > have a look at the timers in kernel you have 64 deferrable timers (this > > > number also contain the deferrable and pinned timers). There are 7 timers > > > with only TIMER_PINNED flag and some additional using the add_timer_on() to > > > be enqueued on a dedicated CPU. But in total we have more than 1000 > > > timers. Sure - in the end, numbers hardly depends on the selected kernel > > > config... > > > > I will give an example here. Lets say we have 4 CPUs in a system. There is a > > devfreq governor driver that configures a delayed work for every 20 msec. > > s/delayed work/deferrable work ? yeah, deferrable work. > > > #1 When the system is busy, this *deferrable* timer expires at the 20 msec > > boundary. However, when the system is idle (i.e no use case is running but > > system does not enter global suspend because of other reasons like display > > ON etc), we don't expect this deferrable timer to expire at every 20 msec. > > > > With your proposal, we endup seeing the system (last CPU that enters idle) > > coming out of idle for every 20 msec which is not desirable. > > With my proposal for next steps only timers with pinned and deferrable flag > set would keep the old behavior. pinned timers/work are meant for collecting/doing per-CPU things. Those who don't care about these like devfreq would generally want the scheduler to select the best CPU for their work and probably don't use pinned timers. > > > #2 Today, deferrable is local to CPU. Irrespective of the activity on the > > other CPUs, this deferrable timer does not expire as long as the local CPU > > is idle for whatever reason. That is definitly not the devfreq governor > > expectation. The intention is to save power when system is idle but serving > > the purpose when it is relatively busy. > > > > > > > > Side note: One big problem of deferrable timers disappear with this > > > approach. All deferrable timers _WILL_ expire. Even if CPU where they have > > > been enqueued does not come back out of idle. Only deferrable and pinned > > > timers will still have this problem. > > > > > Yes, this is a welcoming change. Solves the #2 problem as mentioned above. > > But this welcoming change is only accessible when enqueuing deferrable > timers into global base. But be aware, the problem sill exists for pinned > deferrable timers. > Yes. I will look forward to your series that implements Next steps and we can discuss more about this. Please do keep the usecase/example, I have mentioned above. We really don't want folks to use pinned deferrable timers as a workaround because now deferrable timers do wake up CPUs otherwise. Thanks, Pavan