Message ID | 20231004123454.15691-1-anna-maria@linutronix.de |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:254a:b0:403:3b70:6f57 with SMTP id hf10csp97853vqb; Wed, 4 Oct 2023 05:35:36 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFwv5WAPD4/dsiNjzZDSM0N7bN2GILVkTkJZs1aJ62RBvgaOaudHThMNqJlAkNX9HUm15Ae X-Received: by 2002:a05:6a20:9687:b0:161:ffbf:d949 with SMTP id hp7-20020a056a20968700b00161ffbfd949mr1785018pzc.3.1696422935989; Wed, 04 Oct 2023 05:35:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696422935; cv=none; d=google.com; s=arc-20160816; b=LpGS/RC2UjkHrFESm0tRpO0vuBSKncSDlTQDT7A70qRvpbSzdhklF+mhFDeLozltPi ENrPI3Mp6CRn+ikVu8IRdpp05Kv1ICoddkbDsqewk/Oa5Gy6xQrl5rfksaVAv42b8u1z 3snf9O0NGhTsq3Y+ypjoAbcZgt4Qo2rKE1+XIb5EGYESOZqvXYFX6fAofuTki+tZk0il IP9+UFkwCwHyD1ge4ylyMRUyC5brkTXsMnUI+wTtr5CCmFToJ2Sgmdd9/8p8T5MEXVbd yxmefuvc5WgURQmEyaIiOq7/JAXIlO6fxTXmWWqAOY1qmqNJciaDs3LJf7aIKDZlX48S syBQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:dkim-signature:dkim-signature:from; bh=rudoa6NJ5Zpp+bOSGwMDCr52Z4Woif0J2VFWh5Tib1M=; fh=PG6uS4TiiUSDyl8D/joYkWbwCgDm4ug0ir2h7tHBJXQ=; b=xODUcVOln5BCAH/6beHSOA0lN2SS92e/ozP3TEKGQjiS38k+rMXzRfoq1zypMwph/f fQPPdbY3GHjsesCbfc4wPpxI+/VXeRP+TXKh53aJwSiU8RgdX3v8RSsBXvEqbbQ3Flig HiGCxYhDI5z6ZtlX8UPLwtrjHWMA+KHKsGsW0uJNjwG6bYswKKqQ7oFSBx90DCuiuIsQ O0eKkn9BbROv/D7rvQWWUfES1C0egwgLzv/Ek25FTbLan4qzB/imsXNJVrnAnumqPoX2 xgoLMwMjNaAYpYO0C3ZilF0iD0FIm/retHKQOsG81Xx/TxOTZIK/C5XnoOaTvMBYBp/m klDQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=ilalWE4o; dkim=neutral (no key) header.i=@linutronix.de; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: from howler.vger.email (howler.vger.email. [2620:137:e000::3:4]) by mx.google.com with ESMTPS id s20-20020a056a00195400b0069023afd0d1si3748182pfk.329.2023.10.04.05.35.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 04 Oct 2023 05:35:35 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) client-ip=2620:137:e000::3:4; Authentication-Results: mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=ilalWE4o; dkim=neutral (no key) header.i=@linutronix.de; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id 38D328031F28; Wed, 4 Oct 2023 05:35:35 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242354AbjJDMfe (ORCPT <rfc822;pusanteemu@gmail.com> + 18 others); Wed, 4 Oct 2023 08:35:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39410 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233165AbjJDMfb (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 4 Oct 2023 08:35:31 -0400 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C05E4A9 for <linux-kernel@vger.kernel.org>; Wed, 4 Oct 2023 05:35:27 -0700 (PDT) From: Anna-Maria Behnsen <anna-maria@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1696422925; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=rudoa6NJ5Zpp+bOSGwMDCr52Z4Woif0J2VFWh5Tib1M=; b=ilalWE4odugw10BZdwVB1UEhJaTkWbaGmdKOrnbMaQVB5EPNgTOQ21tgZxMsCLUoUWGDGz gYBwsSPTo4DYUjja9+PQp8vUAyoreqK+kLy5SD/EVn3wOBWMVVl3GJfAUhOM/OD9RrtJ2P ZXiiHylMJtaHB5i3DZsS+D7FXO2YArepey2R+zo9prqT7aztAAs3JLp+mPnvhtyV4EVXvr J1r9t4dhCxjixzn97Zgf9RhCyslZqUQejCoWc11IJqsPwnLcv5MNfR8TzqTn2w0x0Ply6I ZzFMXVKUBo84R86YP9w78Oz6QfQk7b81ZSwQBGnYAQPJrddWQCrH5i2PwMSyuQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1696422925; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=rudoa6NJ5Zpp+bOSGwMDCr52Z4Woif0J2VFWh5Tib1M=; b=acMWHfmStmGkQSgwMrtJMhdoLSVVqNZVfaBEV3cFv0cm3fsekc0MPhceQA3oG8H7PI/8VV TRQxerFXW5X8JgAg== To: linux-kernel@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org>, John Stultz <jstultz@google.com>, Thomas Gleixner <tglx@linutronix.de>, Eric Dumazet <edumazet@google.com>, "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>, Arjan van de Ven <arjan@infradead.org>, "Paul E . McKenney" <paulmck@kernel.org>, Frederic Weisbecker <frederic@kernel.org>, Rik van Riel <riel@surriel.com>, Steven Rostedt <rostedt@goodmis.org>, Sebastian Siewior <bigeasy@linutronix.de>, Giovanni Gherdovich <ggherdovich@suse.cz>, Lukasz Luba <lukasz.luba@arm.com>, "Gautham R . Shenoy" <gautham.shenoy@amd.com>, Srinivas Pandruvada <srinivas.pandruvada@intel.com>, K Prateek Nayak <kprateek.nayak@amd.com>, Anna-Maria Behnsen <anna-maria@linutronix.de> Subject: [PATCH v8 00/25] timer: Move from a push remote at enqueue to a pull at expiry model Date: Wed, 4 Oct 2023 14:34:29 +0200 Message-Id: <20231004123454.15691-1-anna-maria@linutronix.de> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_BLOCKED, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Wed, 04 Oct 2023 05:35:35 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1778828376508545824 X-GMAIL-MSGID: 1778828376508545824 |
Series |
timer: Move from a push remote at enqueue to a pull at expiry model
|
|
Message
Anna-Maria Behnsen
Oct. 4, 2023, 12:34 p.m. UTC
Hi, The updated queue of moving from a push remote at enqueue time to a pull at expiry time model now also contains changes regarding the timer base idle marking. The queue is splitted into three parts: - Patches 1 - 7: Cleanups and minor fixes - Patches 8 - 10: timer base idle marking rework with two preparatory changes. See the section below for more details. - Patches 11 - 25: Updated timer pull model on top of timer idle rework The queue is available here: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel timers/pushpull Move marking timer bases as idle into tick_nohz_stop_tick() ----------------------------------------------------------- The idle marking of timer bases is done in get_next_timer_interrupt() whenever possible. The timer bases are idle, even if the tick will not be stopped. This lead to an IPI when a new first timer is enqueued remote. To prevent this, setting timer_base->in_idle flag is postponed to tick_nohz_stop_tick(). Timer pull model ---------------- Placing timers at enqueue time on a target CPU based on dubious heuristics does not make any sense: 1) Most timer wheel timers are canceled or rearmed before they expire. 2) The heuristics to predict which CPU will be busy when the timer expires are wrong by definition. So placing the timers at enqueue wastes precious cycles. The proper solution to this problem is to always queue the timers on the local CPU and allow the non pinned timers to be pulled onto a busy CPU at expiry time. Therefore split the timer storage into local pinned and global timers: Local pinned timers are always expired on the CPU on which they have been queued. Global timers can be expired on any CPU. As long as a CPU is busy it expires both local and global timers. When a CPU goes idle it arms for the first expiring local timer. If the first expiring pinned (local) timer is before the first expiring movable timer, then no action is required because the CPU will wake up before the first movable timer expires. If the first expiring movable timer is before the first expiring pinned (local) timer, then this timer is queued into a idle timerqueue and eventually expired by some other active CPU. To avoid global locking the timerqueues are implemented as a hierarchy. The lowest level of the hierarchy holds the CPUs. The CPUs are associated to groups of 8, which are separated per node. If more than one CPU group exist, then a second level in the hierarchy collects the groups. Depending on the size of the system more than 2 levels are required. Each group has a "migrator" which checks the timerqueue during the tick for remote timers to be expired. If the last CPU in a group goes idle it reports the first expiring event in the group up to the next group(s) in the hierarchy. If the last CPU goes idle it arms its timer for the first system wide expiring timer to ensure that no timer event is missed. Testing ~~~~~~~ Enqueue ^^^^^^^ The impact of wasting cycles during enqueue by using the heuristic in contrast to always queuing the timer on the local CPU was measured with a micro benchmark. Therefore a timer is enqueued and dequeued in a loop with 1000 repetitions on a isolated CPU. The time the loop takes is measured. A quarter of the remaining CPUs was kept busy. This measurement was repeated several times. With the patch queue the average duration was reduced by approximately 25%. 145ns plain v6 109ns v6 with patch queue Furthermore the impact of residence in deep idle states of an idle system was investigated. The patch queue doesn't downgrade this behavior. dbench test ^^^^^^^^^^^ A dbench test starting X pairs of client servers are used to create load on the system. The measurable value is the throughput. The tests were executed on a zen3 machine. The base is the tip tree branch timers/core which is based on a v6.6-rc1. governor menu X pairs timers/core pull-model impact ---------------------------------------------- 1 353.19 (0.19) 353.45 (0.30) 0.07% 2 700.10 (0.96) 687.00 (0.20) -1.87% 4 1329.37 (0.63) 1282.91 (0.64) -3.49% 8 2561.16 (1.28) 2493.56 (1.76) -2.64% 16 4959.96 (0.80) 4914.59 (0.64) -0.91% 32 9741.92 (3.44) 8979.83 (1.13) -7.82% 64 16535.40 (2.84) 16388.47 (4.02) -0.89% 128 22136.83 (2.42) 23174.50 (1.43) 4.69% 256 39256.77 (4.48) 38994.00 (0.39) -0.67% 512 36799.03 (1.83) 38091.10 (0.63) 3.51% 1024 32903.03 (0.86) 35370.70 (0.89) 7.50% governor teo X pairs timers/core pull-model impact ---------------------------------------------- 1 350.83 (1.27) 352.45 (0.96) 0.46% 2 699.52 (0.85) 690.10 (0.54) -1.35% 4 1339.53 (1.99) 1294.71 (2.71) -3.35% 8 2574.10 (0.76) 2495.46 (1.97) -3.06% 16 4898.50 (1.74) 4783.06 (1.64) -2.36% 32 9115.50 (4.63) 9037.83 (1.58) -0.85% 64 16663.90 (3.80) 16042.00 (1.72) -3.73% 128 25044.93 (1.11) 23250.03 (1.08) -7.17% 256 38059.53 (1.70) 39658.57 (2.98) 4.20% 512 36369.30 (0.39) 38890.13 (0.36) 6.93% 1024 33956.83 (1.14) 35514.83 (0.29) 4.59% Ping Pong Oberservation ^^^^^^^^^^^^^^^^^^^^^^^ During testing on a mostly idle machine a ping pong game could be observed: a process_timeout timer is expired remotely on a non idle CPU. Then the CPU where the schedule_timeout() was executed to enqueue the timer comes out of idle and restarts the timer using schedule_timeout() and goes back to idle again. This is due to the fair scheduler which tries to keep the task on the CPU which it previously executed on. Possible Next Steps ~~~~~~~~~~~~~~~~~~~ Simple deferrable timers are no longer required as they can be converted to global timers. If a CPU goes idle, a formerly deferrable timer will not prevent the CPU to sleep as long as possible. Only the last migrator CPU has to take care of them. Deferrable timers with timer pinned flags needs to be expired on the specified CPU but must not prevent CPU from going idle. They require their own timer base which is never taken into account when calculating the next expiry time. This conversation and required cleanup will be done in a follow up series. v7..v8: https://lore.kernel.org/r/20230524070629.6377-1-anna-maria@linutronix.de - Address review feedback - Move marking timer base idle into tick_nohz_stop_tick() - Look ahead function to determine possible sleep lenght v6..v7: - Address review feedback of Frederic and bigeasy - Change lock, unlock fetch next timer interrupt logic after remote expiry - Move timer_expire_remote() into tick-internal.h - Add documentation section about "Required event and timerqueue update after remote expiry" - Fix fallout of kernel test robot v5..v6: - Address review of Frederic Weisbecker and Peter Zijlstra (spelling, locking, race in tmigr_handle_remote_cpu()) - unconditionally set TIMER_PINNED flag in add_timer_on(); introduce add_timer() variants which set/unset TIMER_PINNED flag; drop fixing add_timer_on() call sites, as TIMER_PINNED flag is set implicitly; Fixing workqueue to use add_timer_global() instead of simply add_timer() for unbound work. - Drop support for siblings to end up in the same level 0 group (could be added again in a better way as an improvement later on) - Do not send IPI for new first deferrable timers v4..v5: - address review feedback of Frederic Weisbecker - fix issue with group timer update after remote expiry v3..v4: - address review feedback of Frederic Weisbecker - address kernel test robot fallout - Move patch 16 "add_timer_on(): Make sure callers have TIMER_PINNED flag" at the begin of the queue to prevent timers to end up in global timer base when they were queued using add_timer_on() - Fix some comments and typos v2..v3: https://lore.kernel.org/r/20170418111102.490432548@linutronix.de/ - Minimize usage of locks by storing data using atomic_cmpxchg() for migrator information and information about active cpus. Thanks, Anna-Maria Anna-Maria Behnsen (22): tick/sched: Cleanup confusing variables tick-sched: Warn when next tick seems to be in the past timer: Do not IPI for deferrable timers timer: Move store of next event into __next_timer_interrupt() timers: Clarify check in forward_timer_base() timers: Split out forward timer base functionality timers: Use already existing function for forwarding timer base timer: Split out get next timer functionality timers: Move marking timer bases idle into tick_nohz_stop_tick() timers: Introduce add_timer() variants which modify timer flags workqueue: Use global variant for add_timer() timer: add_timer_on(): Make sure TIMER_PINNED flag is set timers: Ease code in run_local_timers() timer: Split next timer interrupt logic timer: Keep the pinned timers separate from the others timer: Retrieve next expiry of pinned/non-pinned timers separately timer: Split out "get next timer interrupt" functionality timer: Add get next timer interrupt functionality for remote CPUs timer: Check if timers base is handled already timer: Implement the hierarchical pull model timer_migration: Add tracepoints timer: Always queue timers on the local CPU Richard Cochran (linutronix GmbH) (2): timer: Restructure internal locking tick/sched: Split out jiffies update helper function Thomas Gleixner (1): timer: Rework idle logic include/linux/cpuhotplug.h | 1 + include/linux/timer.h | 16 +- include/trace/events/timer_migration.h | 283 ++++ kernel/time/Makefile | 3 + kernel/time/tick-internal.h | 13 + kernel/time/tick-sched.c | 69 +- kernel/time/timer.c | 514 ++++++-- kernel/time/timer_migration.c | 1636 ++++++++++++++++++++++++ kernel/time/timer_migration.h | 144 +++ kernel/workqueue.c | 2 +- 10 files changed, 2549 insertions(+), 132 deletions(-) create mode 100644 include/trace/events/timer_migration.h create mode 100644 kernel/time/timer_migration.c create mode 100644 kernel/time/timer_migration.h
Comments
Hello Anna-Maria, On 10/4/2023 6:04 PM, Anna-Maria Behnsen wrote: > [..snip..] > > Ping Pong Oberservation > ^^^^^^^^^^^^^^^^^^^^^^^ > > During testing on a mostly idle machine a ping pong game could be observed: > a process_timeout timer is expired remotely on a non idle CPU. Then the CPU > where the schedule_timeout() was executed to enqueue the timer comes out of > idle and restarts the timer using schedule_timeout() and goes back to idle > again. This is due to the fair scheduler which tries to keep the task on > the CPU which it previously executed on. Regarding above, are you referring to "wake_up_process(timeout->task)" in "process_timeout()" ends up waking the task on an idle CPU instead of the CPU where process_timeout() ran? In which case, have you tried using the "WF_CURRENT_CPU" flag for the wakeup? (landed upstream in v6.6-rc1) It is only used by wait queues in kernel/sched/wait.c currently but perhaps we can have a "wake_up_process_on_current_cpu()" that process_timeout() can call. Something along the lines of: int wake_up_process_on_current_cpu(struct task_struct *p) { return try_to_wake_up(p, TASK_NORMAL, WF_CURRENT_CPU); } EXPORT_SYMBOL(wake_up_process_on_current_cpu); Thoughts? > > [..snip..] > -- Thanks and Regards, Prateek
Hi Maria, On Wed, 2023-10-04 at 14:34 +0200, Anna-Maria Behnsen wrote: > Hi, > > [...] > > The proper solution to this problem is to always queue the timers on > the > local CPU and allow the non pinned timers to be pulled onto a busy > CPU at > expiry time. Thanks for these patches. I am looking for saving power during video playback with our low power daemon. I use cgroup v2 isolation to keep some CPUs idle (CPU 0-11) and video is played on a single module (CPU 12-15). I have some kernelshark pictures at below link. The traces are collected with sched, timer and irq. With 6.6-rc5, you can see some timers still expires on CPUs which I want to keep idle. With timer patches added, they are mostly pulled to busy CPU. https://imgur.com/a/8nF5OoP I can share the .dat files, but they are too big to attach here. Thanks, Srinivas > > Therefore split the timer storage into local pinned and global > timers: > Local pinned timers are always expired on the CPU on which they have > been > queued. Global timers can be expired on any CPU. > > As long as a CPU is busy it expires both local and global timers. > When a > CPU goes idle it arms for the first expiring local timer. If the > first > expiring pinned (local) timer is before the first expiring movable > timer, > then no action is required because the CPU will wake up before the > first > movable timer expires. If the first expiring movable timer is before > the > first expiring pinned (local) timer, then this timer is queued into a > idle > timerqueue and eventually expired by some other active CPU. > > To avoid global locking the timerqueues are implemented as a > hierarchy. The > lowest level of the hierarchy holds the CPUs. The CPUs are associated > to > groups of 8, which are separated per node. If more than one CPU group > exist, then a second level in the hierarchy collects the groups. > Depending > on the size of the system more than 2 levels are required. Each group > has a > "migrator" which checks the timerqueue during the tick for remote > timers to > be expired. > > If the last CPU in a group goes idle it reports the first expiring > event in > the group up to the next group(s) in the hierarchy. If the last CPU > goes > idle it arms its timer for the first system wide expiring timer to > ensure > that no timer event is missed. > > > Testing > ~~~~~~~ > > Enqueue > ^^^^^^^ > > The impact of wasting cycles during enqueue by using the heuristic in > contrast to always queuing the timer on the local CPU was measured > with a > micro benchmark. Therefore a timer is enqueued and dequeued in a loop > with > 1000 repetitions on a isolated CPU. The time the loop takes is > measured. A > quarter of the remaining CPUs was kept busy. This measurement was > repeated > several times. With the patch queue the average duration was reduced > by > approximately 25%. > > 145ns plain v6 > 109ns v6 with patch queue > > > Furthermore the impact of residence in deep idle states of an idle > system > was investigated. The patch queue doesn't downgrade this behavior. > > dbench test > ^^^^^^^^^^^ > > A dbench test starting X pairs of client servers are used to create > load on > the system. The measurable value is the throughput. The tests were > executed > on a zen3 machine. The base is the tip tree branch timers/core which > is > based on a v6.6-rc1. > > governor menu > > X pairs timers/core pull-model impact > ---------------------------------------------- > 1 353.19 (0.19) 353.45 (0.30) 0.07% > 2 700.10 (0.96) 687.00 (0.20) -1.87% > 4 1329.37 (0.63) 1282.91 (0.64) -3.49% > 8 2561.16 (1.28) 2493.56 (1.76) -2.64% > 16 4959.96 (0.80) 4914.59 (0.64) -0.91% > 32 9741.92 (3.44) 8979.83 (1.13) -7.82% > 64 16535.40 (2.84) 16388.47 (4.02) -0.89% > 128 22136.83 (2.42) 23174.50 (1.43) 4.69% > 256 39256.77 (4.48) 38994.00 (0.39) -0.67% > 512 36799.03 (1.83) 38091.10 (0.63) 3.51% > 1024 32903.03 (0.86) 35370.70 (0.89) 7.50% > > > governor teo > > X pairs timers/core pull-model impact > ---------------------------------------------- > 1 350.83 (1.27) 352.45 (0.96) 0.46% > 2 699.52 (0.85) 690.10 (0.54) -1.35% > 4 1339.53 (1.99) 1294.71 (2.71) -3.35% > 8 2574.10 (0.76) 2495.46 (1.97) -3.06% > 16 4898.50 (1.74) 4783.06 (1.64) -2.36% > 32 9115.50 (4.63) 9037.83 (1.58) -0.85% > 64 16663.90 (3.80) 16042.00 (1.72) -3.73% > 128 25044.93 (1.11) 23250.03 (1.08) -7.17% > 256 38059.53 (1.70) 39658.57 (2.98) 4.20% > 512 36369.30 (0.39) 38890.13 (0.36) 6.93% > 1024 33956.83 (1.14) 35514.83 (0.29) 4.59% > > > > Ping Pong Oberservation > ^^^^^^^^^^^^^^^^^^^^^^^ > > During testing on a mostly idle machine a ping pong game could be > observed: > a process_timeout timer is expired remotely on a non idle CPU. Then > the CPU > where the schedule_timeout() was executed to enqueue the timer comes > out of > idle and restarts the timer using schedule_timeout() and goes back to > idle > again. This is due to the fair scheduler which tries to keep the task > on > the CPU which it previously executed on. > > > > > Possible Next Steps > ~~~~~~~~~~~~~~~~~~~ > > Simple deferrable timers are no longer required as they can be > converted to > global timers. If a CPU goes idle, a formerly deferrable timer will > not > prevent the CPU to sleep as long as possible. Only the last migrator > CPU > has to take care of them. Deferrable timers with timer pinned flags > needs > to be expired on the specified CPU but must not prevent CPU from > going > idle. They require their own timer base which is never taken into > account > when calculating the next expiry time. This conversation and required > cleanup will be done in a follow up series. > > > v7..v8: > https://lore.kernel.org/r/20230524070629.6377-1-anna-maria@linutronix.de > - Address review feedback > - Move marking timer base idle into tick_nohz_stop_tick() > - Look ahead function to determine possible sleep lenght > > > v6..v7: > - Address review feedback of Frederic and bigeasy > - Change lock, unlock fetch next timer interrupt logic after remote > expiry > - Move timer_expire_remote() into tick-internal.h > - Add documentation section about "Required event and timerqueue > update > after remote expiry" > - Fix fallout of kernel test robot > > > v5..v6: > > - Address review of Frederic Weisbecker and Peter Zijlstra > (spelling, > locking, race in tmigr_handle_remote_cpu()) > > - unconditionally set TIMER_PINNED flag in add_timer_on(); > introduce > add_timer() variants which set/unset TIMER_PINNED flag; drop > fixing > add_timer_on() call sites, as TIMER_PINNED flag is set > implicitly; > Fixing workqueue to use add_timer_global() instead of simply > add_timer() for unbound work. > > - Drop support for siblings to end up in the same level 0 group > (could be > added again in a better way as an improvement later on) > > - Do not send IPI for new first deferrable timers > > v4..v5: > - address review feedback of Frederic Weisbecker > - fix issue with group timer update after remote expiry > > v3..v4: > - address review feedback of Frederic Weisbecker > - address kernel test robot fallout > - Move patch 16 "add_timer_on(): Make sure callers have > TIMER_PINNED > flag" at the begin of the queue to prevent timers to end up in > global > timer base when they were queued using add_timer_on() > - Fix some comments and typos > > v2..v3: > https://lore.kernel.org/r/20170418111102.490432548@linutronix.de/ > - Minimize usage of locks by storing data using atomic_cmpxchg() > for > migrator information and information about active cpus. > > > Thanks, > > Anna-Maria > > > > Anna-Maria Behnsen (22): > tick/sched: Cleanup confusing variables > tick-sched: Warn when next tick seems to be in the past > timer: Do not IPI for deferrable timers > timer: Move store of next event into __next_timer_interrupt() > timers: Clarify check in forward_timer_base() > timers: Split out forward timer base functionality > timers: Use already existing function for forwarding timer base > timer: Split out get next timer functionality > timers: Move marking timer bases idle into tick_nohz_stop_tick() > timers: Introduce add_timer() variants which modify timer flags > workqueue: Use global variant for add_timer() > timer: add_timer_on(): Make sure TIMER_PINNED flag is set > timers: Ease code in run_local_timers() > timer: Split next timer interrupt logic > timer: Keep the pinned timers separate from the others > timer: Retrieve next expiry of pinned/non-pinned timers separately > timer: Split out "get next timer interrupt" functionality > timer: Add get next timer interrupt functionality for remote CPUs > timer: Check if timers base is handled already > timer: Implement the hierarchical pull model > timer_migration: Add tracepoints > timer: Always queue timers on the local CPU > > Richard Cochran (linutronix GmbH) (2): > timer: Restructure internal locking > tick/sched: Split out jiffies update helper function > > Thomas Gleixner (1): > timer: Rework idle logic > > include/linux/cpuhotplug.h | 1 + > include/linux/timer.h | 16 +- > include/trace/events/timer_migration.h | 283 ++++ > kernel/time/Makefile | 3 + > kernel/time/tick-internal.h | 13 + > kernel/time/tick-sched.c | 69 +- > kernel/time/timer.c | 514 ++++++-- > kernel/time/timer_migration.c | 1636 > ++++++++++++++++++++++++ > kernel/time/timer_migration.h | 144 +++ > kernel/workqueue.c | 2 +- > 10 files changed, 2549 insertions(+), 132 deletions(-) > create mode 100644 include/trace/events/timer_migration.h > create mode 100644 kernel/time/timer_migration.c > create mode 100644 kernel/time/timer_migration.h >
Hello Anna-Maria, Happy to report I don't see any regression with this version of series. I'll leave the detailed report below. On 10/4/2023 6:04 PM, Anna-Maria Behnsen wrote: > [..snip..] > > dbench test > ^^^^^^^^^^^ > > A dbench test starting X pairs of client servers are used to create load on > the system. The measurable value is the throughput. The tests were executed > on a zen3 machine. The base is the tip tree branch timers/core which is > based on a v6.6-rc1. > > governor menu > > X pairs timers/core pull-model impact > ---------------------------------------------- > 1 353.19 (0.19) 353.45 (0.30) 0.07% > 2 700.10 (0.96) 687.00 (0.20) -1.87% > 4 1329.37 (0.63) 1282.91 (0.64) -3.49% > 8 2561.16 (1.28) 2493.56 (1.76) -2.64% > 16 4959.96 (0.80) 4914.59 (0.64) -0.91% > 32 9741.92 (3.44) 8979.83 (1.13) -7.82% > 64 16535.40 (2.84) 16388.47 (4.02) -0.89% > 128 22136.83 (2.42) 23174.50 (1.43) 4.69% > 256 39256.77 (4.48) 38994.00 (0.39) -0.67% > 512 36799.03 (1.83) 38091.10 (0.63) 3.51% > 1024 32903.03 (0.86) 35370.70 (0.89) 7.50% > > > governor teo > > X pairs timers/core pull-model impact > ---------------------------------------------- > 1 350.83 (1.27) 352.45 (0.96) 0.46% > 2 699.52 (0.85) 690.10 (0.54) -1.35% > 4 1339.53 (1.99) 1294.71 (2.71) -3.35% > 8 2574.10 (0.76) 2495.46 (1.97) -3.06% > 16 4898.50 (1.74) 4783.06 (1.64) -2.36% > 32 9115.50 (4.63) 9037.83 (1.58) -0.85% > 64 16663.90 (3.80) 16042.00 (1.72) -3.73% > 128 25044.93 (1.11) 23250.03 (1.08) -7.17% > 256 38059.53 (1.70) 39658.57 (2.98) 4.20% > 512 36369.30 (0.39) 38890.13 (0.36) 6.93% > 1024 33956.83 (1.14) 35514.83 (0.29) 4.59% o Machine details - 3rd Generation EPYC System - 2 sockets each with 64C/128T - NPS1 (Each socket is a NUMA node) - C2 Disabled (POLL and C1(MWAIT) remained enabled) o Kernel Details - tip: tip:sched/core at commit 238437d88cea ("intel_idle: Add ibrs_off module parameter to force-disable IBRS") + min_deadline fix commit 8dafa9d0eb1a ("sched/eevdf: Fix min_deadline heap integrity") from tip:sched/urgent - timer-pull: tip + this series as is o Benchmark Results ================================================================== Test : hackbench Units : Normalized time in seconds Interpretation: Lower is better Statistic : AMean ================================================================== Case: tip[pct imp](CV) timer-pull[pct imp](CV) 1-groups 1.00 [ -0.00]( 2.11) 0.99 [ 1.44]( 3.34) 2-groups 1.00 [ -0.00]( 1.31) 1.01 [ -0.93]( 1.57) 4-groups 1.00 [ -0.00]( 1.04) 1.00 [ 0.44]( 1.11) 8-groups 1.00 [ -0.00]( 1.34) 0.99 [ 1.29]( 1.34) 16-groups 1.00 [ -0.00]( 2.45) 1.00 [ -0.40]( 2.78) ================================================================== Test : tbench Units : Normalized throughput Interpretation: Higher is better Statistic : AMean ================================================================== Clients: tip[pct imp](CV) timer-pull[pct imp](CV) 1 1.00 [ 0.00]( 0.46) 1.01 [ 0.52]( 0.66) 2 1.00 [ 0.00]( 0.64) 0.99 [ -0.60]( 0.88) 4 1.00 [ 0.00]( 0.59) 0.99 [ -0.92]( 1.82) 8 1.00 [ 0.00]( 0.34) 1.00 [ -0.06]( 0.33) 16 1.00 [ 0.00]( 0.72) 0.99 [ -1.25]( 1.52) 32 1.00 [ 0.00]( 0.65) 0.98 [ -1.59]( 1.29) 64 1.00 [ 0.00]( 0.59) 0.99 [ -0.84]( 3.87) 128 1.00 [ 0.00]( 1.19) 1.00 [ 0.11]( 0.33) 256 1.00 [ 0.00]( 0.16) 1.01 [ 0.61]( 0.52) 512 1.00 [ 0.00]( 0.20) 1.01 [ 0.80]( 0.29) 1024 1.00 [ 0.00]( 0.06) 1.01 [ 1.06]( 0.59) ================================================================== Test : stream-10 Units : Normalized Bandwidth, MB/s Interpretation: Higher is better Statistic : HMean ================================================================== Test: tip[pct imp](CV) timer-pull[pct imp](CV) Copy 1.00 [ 0.00]( 6.04) 1.04 [ 4.31]( 3.71) Scale 1.00 [ 0.00]( 5.44) 1.01 [ 0.57]( 5.63) Add 1.00 [ 0.00]( 5.44) 1.01 [ 0.99]( 5.46) Triad 1.00 [ 0.00]( 7.82) 1.04 [ 4.14]( 5.68) ================================================================== Test : stream-100 Units : Normalized Bandwidth, MB/s Interpretation: Higher is better Statistic : HMean ================================================================== Test: tip[pct imp](CV) timer-pull[pct imp](CV) Copy 1.00 [ 0.00]( 1.14) 1.00 [ 0.29]( 0.49) Scale 1.00 [ 0.00]( 4.60) 1.03 [ 2.87]( 0.62) Add 1.00 [ 0.00]( 4.91) 1.01 [ 1.36]( 1.34) Triad 1.00 [ 0.00]( 0.60) 0.98 [ -1.50]( 4.24) ================================================================== Test : netperf Units : Normalized Througput Interpretation: Higher is better Statistic : AMean ================================================================== Clients: tip[pct imp](CV) timer-pull[pct imp](CV) 1-clients 1.00 [ 0.00]( 0.61) 1.01 [ 1.25]( 0.48) 2-clients 1.00 [ 0.00]( 0.44) 1.00 [ 0.34]( 0.65) 4-clients 1.00 [ 0.00]( 0.75) 1.01 [ 0.98]( 1.26) 8-clients 1.00 [ 0.00]( 0.65) 1.01 [ 0.82]( 0.73) 16-clients 1.00 [ 0.00]( 0.49) 1.00 [ 0.37]( 0.99) 32-clients 1.00 [ 0.00]( 0.57) 0.98 [ -2.05]( 3.44) 64-clients 1.00 [ 0.00]( 1.67) 1.00 [ 0.00]( 1.74) 128-clients 1.00 [ 0.00]( 1.11) 1.01 [ 0.69]( 1.11) 256-clients 1.00 [ 0.00]( 2.64) 1.00 [ 0.00]( 3.79) 512-clients 1.00 [ 0.00](52.49) 1.00 [ 0.26](54.13) ================================================================== Test : schbench Units : Normalized 99th percentile latency in us Interpretation: Lower is better Statistic : Median ================================================================== #workers: tip[pct imp](CV) timer-pull[pct imp](CV) 1 1.00 [ -0.00]( 8.41) 0.59 [ 40.54](40.25) 2 1.00 [ -0.00]( 5.29) 0.93 [ 7.50]( 9.01) 4 1.00 [ -0.00]( 1.32) 0.91 [ 9.09](12.33) 8 1.00 [ -0.00]( 9.52) 1.00 [ -0.00](15.02) 16 1.00 [ -0.00]( 1.61) 1.03 [ -3.23]( 2.37) 32 1.00 [ -0.00]( 7.27) 0.92 [ 7.69]( 1.59) 64 1.00 [ -0.00]( 6.96) 1.12 [-11.56]( 1.20) 128 1.00 [ -0.00]( 3.41) 1.06 [ -6.49]( 3.73) 256 1.00 [ -0.00](32.95) 1.02 [ -2.48](28.66) 512 1.00 [ -0.00]( 3.20) 0.99 [ 0.71]( 3.22) ================================================================== Test : ycsb-cassandra Units : Normalized throughput Interpretation: Higher is better Statistic : Mean ================================================================== metric tip timer-pull (%diff) throughput 1.00 1.01 (%diff: 0.75%) ================================================================== Test : ycsb-mondodb Units : Normalized throughput Interpretation: Higher is better Statistic : Mean ================================================================== metric tip timer-pull (%diff) throughput 1.00 1.00 (%diff: -0.49%) ================================================================== Test : DeathStarBench Units : Normalized throughput Interpretation: Higher is better Statistic : Mean ================================================================== Pinning scaling tip timer-pull (%diff) 1CCD 1 1.00 1.01 (%diff: 0.75%) 2CCD 2 1.00 1.03 (%diff: 2.72%) 4CCD 4 1.00 1.00 (%diff: -0.28%) 8CCD 8 1.00 1.00 (%diff: 0.20%) -- Thank you for debugging and helping fix the tbench regression. If the series does not change drastically, feel free to add: Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> > > > > Ping Pong Oberservation > ^^^^^^^^^^^^^^^^^^^^^^^ > > During testing on a mostly idle machine a ping pong game could be observed: > a process_timeout timer is expired remotely on a non idle CPU. Then the CPU > where the schedule_timeout() was executed to enqueue the timer comes out of > idle and restarts the timer using schedule_timeout() and goes back to idle > again. This is due to the fair scheduler which tries to keep the task on > the CPU which it previously executed on. > > > > > Possible Next Steps > ~~~~~~~~~~~~~~~~~~~ > > Simple deferrable timers are no longer required as they can be converted to > global timers. If a CPU goes idle, a formerly deferrable timer will not > prevent the CPU to sleep as long as possible. Only the last migrator CPU > has to take care of them. Deferrable timers with timer pinned flags needs > to be expired on the specified CPU but must not prevent CPU from going > idle. They require their own timer base which is never taken into account > when calculating the next expiry time. This conversation and required > cleanup will be done in a follow up series. > I'll keep an eye out for future versions for testing. > > [..snip..] > -- Thanks and Regards, Prateek
Hi Anna-Maria On 10/4/23 13:34, Anna-Maria Behnsen wrote: > Hi, > [snip] > > > Testing > ~~~~~~~ > > Enqueue > ^^^^^^^ > > The impact of wasting cycles during enqueue by using the heuristic in > contrast to always queuing the timer on the local CPU was measured with a > micro benchmark. Therefore a timer is enqueued and dequeued in a loop with > 1000 repetitions on a isolated CPU. The time the loop takes is measured. A > quarter of the remaining CPUs was kept busy. This measurement was repeated > several times. With the patch queue the average duration was reduced by > approximately 25%. > > 145ns plain v6 > 109ns v6 with patch queue > > > Furthermore the impact of residence in deep idle states of an idle system > was investigated. The patch queue doesn't downgrade this behavior. > > dbench test > ^^^^^^^^^^^ > > A dbench test starting X pairs of client servers are used to create load on > the system. The measurable value is the throughput. The tests were executed > on a zen3 machine. The base is the tip tree branch timers/core which is > based on a v6.6-rc1. > > governor menu > > X pairs timers/core pull-model impact > ---------------------------------------------- > 1 353.19 (0.19) 353.45 (0.30) 0.07% > 2 700.10 (0.96) 687.00 (0.20) -1.87% > 4 1329.37 (0.63) 1282.91 (0.64) -3.49% > 8 2561.16 (1.28) 2493.56 (1.76) -2.64% > 16 4959.96 (0.80) 4914.59 (0.64) -0.91% > 32 9741.92 (3.44) 8979.83 (1.13) -7.82% > 64 16535.40 (2.84) 16388.47 (4.02) -0.89% > 128 22136.83 (2.42) 23174.50 (1.43) 4.69% > 256 39256.77 (4.48) 38994.00 (0.39) -0.67% > 512 36799.03 (1.83) 38091.10 (0.63) 3.51% > 1024 32903.03 (0.86) 35370.70 (0.89) 7.50% > > > governor teo > > X pairs timers/core pull-model impact > ---------------------------------------------- > 1 350.83 (1.27) 352.45 (0.96) 0.46% > 2 699.52 (0.85) 690.10 (0.54) -1.35% > 4 1339.53 (1.99) 1294.71 (2.71) -3.35% > 8 2574.10 (0.76) 2495.46 (1.97) -3.06% > 16 4898.50 (1.74) 4783.06 (1.64) -2.36% > 32 9115.50 (4.63) 9037.83 (1.58) -0.85% > 64 16663.90 (3.80) 16042.00 (1.72) -3.73% > 128 25044.93 (1.11) 23250.03 (1.08) -7.17% > 256 38059.53 (1.70) 39658.57 (2.98) 4.20% > 512 36369.30 (0.39) 38890.13 (0.36) 6.93% > 1024 33956.83 (1.14) 35514.83 (0.29) 4.59% > > > > Ping Pong Oberservation > ^^^^^^^^^^^^^^^^^^^^^^^ > > During testing on a mostly idle machine a ping pong game could be observed: > a process_timeout timer is expired remotely on a non idle CPU. Then the CPU > where the schedule_timeout() was executed to enqueue the timer comes out of > idle and restarts the timer using schedule_timeout() and goes back to idle > again. This is due to the fair scheduler which tries to keep the task on > the CPU which it previously executed on. > > I have tested this on my 2 Arm boards with mainline kernel and almost-mainline. On both platforms it looks stable. The results w/ your patchset looks better. 1. rockpi4b - mainline kernel (but no UI) Limiting the cpumask for only 4 Little CPUs and setting performance governor for cpufreq and menu for idle. 1.1. perf bench sched pipe w/o patchset vs. w/ patchset avg [ops/sec]: (more is better) 23012.33 vs. 23154.33 (+0.6%) avg [usecs/op]: (less is better) 43.453 vs. 43.187 (-0.6%) 1.2. perf bench sched messaging (less is better) w/o patchset vs. w/ patchset avg total time [s]: 2.7855 vs. 2.7005 (-3.1%) 2. pixel6 (kernel v5.18 with backported patchset) 2.1 Speedometer 2.0 (JS test running in Chrome browser) w/o patchset vs. w/ patchset 149 vs. 146 (-2%) 2.2 Geekbench 5 (more is better) Single core w/o patchset vs. w/ patchset 1025 vs. 1017 (-0.7%) Multi core w/o patchset vs. w/ patchset 2756 vs. 2813 (+2%) The performance looks good. Only one test 'Speedometer' has some interesting lower score. Fill free to add: Tested-by: Lukasz Luba <lukasz.luba@arm.com> Regards, Lukasz
"Pandruvada, Srinivas" <srinivas.pandruvada@intel.com> writes: > Hi Maria, > > On Wed, 2023-10-04 at 14:34 +0200, Anna-Maria Behnsen wrote: >> Hi, >> >> > > [...] > >> >> The proper solution to this problem is to always queue the timers on >> the >> local CPU and allow the non pinned timers to be pulled onto a busy >> CPU at >> expiry time. > > Thanks for these patches. I am looking for saving power during video > playback with our low power daemon. I use cgroup v2 isolation to keep > some CPUs idle (CPU 0-11) and video is played on a single module (CPU > 12-15). > > I have some kernelshark pictures at below link. The traces are > collected with sched, timer and irq. With 6.6-rc5, you can see some > timers still expires on CPUs which I want to keep idle. With timer > patches added, they are mostly pulled to busy CPU. > > https://imgur.com/a/8nF5OoP > > I can share the .dat files, but they are too big to attach here. Thanks a lot for testing! The images are totally fine (at least for me). As there are still some issues in v8, I'll have to post a new version... Thanks, Anna-Maria
Hello Prateek, K Prateek Nayak <kprateek.nayak@amd.com> writes: > Hello Anna-Maria, > > Happy to report I don't see any regression with this version of series. > I'll leave the detailed report below. [...] > Thank you for debugging and helping fix the tbench regression. > If the series does not change drastically, feel free to add: > > Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> > Thanks a lot for all the testing you did! When posting v9, I'll summarize the changes and if required, I'll ask for testing support, if it is ok? >> >> Possible Next Steps >> ~~~~~~~~~~~~~~~~~~~ >> >> Simple deferrable timers are no longer required as they can be converted to >> global timers. If a CPU goes idle, a formerly deferrable timer will not >> prevent the CPU to sleep as long as possible. Only the last migrator CPU >> has to take care of them. Deferrable timers with timer pinned flags needs >> to be expired on the specified CPU but must not prevent CPU from going >> idle. They require their own timer base which is never taken into account >> when calculating the next expiry time. This conversation and required >> cleanup will be done in a follow up series. >> > > I'll keep an eye out for future versions for testing. I'll keep you in the loop. Thanks, Anna-Maria
Hi Lukasz, Lukasz Luba <lukasz.luba@arm.com> writes: [...] > > I have tested this on my 2 Arm boards with mainline kernel > and almost-mainline. On both platforms it looks stable. > The results w/ your patchset looks better. > Thanks for testing! [...] > The performance looks good. Only one test 'Speedometer' > has some interesting lower score. Is it required to look into this more detailed or is the regression in a acceptable range for you? > > Fill free to add: > > Tested-by: Lukasz Luba <lukasz.luba@arm.com> Thanks, Anna-Maria
Hello Prateek, I'm sorry for the late reply! K Prateek Nayak <kprateek.nayak@amd.com> writes: > Hello Anna-Maria, > > On 10/4/2023 6:04 PM, Anna-Maria Behnsen wrote: >> [..snip..] >> >> Ping Pong Oberservation >> ^^^^^^^^^^^^^^^^^^^^^^^ >> >> During testing on a mostly idle machine a ping pong game could be observed: >> a process_timeout timer is expired remotely on a non idle CPU. Then the CPU >> where the schedule_timeout() was executed to enqueue the timer comes out of >> idle and restarts the timer using schedule_timeout() and goes back to idle >> again. This is due to the fair scheduler which tries to keep the task on >> the CPU which it previously executed on. > > Regarding above, are you referring to "wake_up_process(timeout->task)" in > "process_timeout()" ends up waking the task on an idle CPU instead of the > CPU where process_timeout() ran? Yes. > In which case, have you tried using the "WF_CURRENT_CPU" flag for the > wakeup? (landed upstream in v6.6-rc1) It is only used by wait queues in > kernel/sched/wait.c currently but perhaps we can have a > "wake_up_process_on_current_cpu()" that process_timeout() can call. > > Something along the lines of: > > int wake_up_process_on_current_cpu(struct task_struct *p) > { > return try_to_wake_up(p, TASK_NORMAL, WF_CURRENT_CPU); > } > EXPORT_SYMBOL(wake_up_process_on_current_cpu); > > Thoughts? I didn't look into this again. Back than, I reported the observation to scheduler people (others also already observed this behavior). I'm not so familiar with scheduling, so I will ping scheduler people to give you a feedback. Thanks, Anna-Maria
On 10/19/23 15:04, Anna-Maria Behnsen wrote: > Hi Lukasz, > > Lukasz Luba <lukasz.luba@arm.com> writes: > > [...] > >> >> I have tested this on my 2 Arm boards with mainline kernel >> and almost-mainline. On both platforms it looks stable. >> The results w/ your patchset looks better. >> > > Thanks for testing! You're welcome > > [...] > >> The performance looks good. Only one test 'Speedometer' >> has some interesting lower score. > > Is it required to look into this more detailed or is the regression in a > acceptable range for you? That's something which we can ignore. I have tested with a different cpu idle governor and it goes away. So, I suspect that the governor heuristic just probably was confused. I will spend some time on different idle governor tuning. Feel free to go forward with this patch set. > >> >> Fill free to add: >> >> Tested-by: Lukasz Luba <lukasz.luba@arm.com> > > Thanks, > > Anna-Maria >
On Fri, Oct 06, 2023 at 10:35:43AM +0530, K Prateek Nayak wrote: > Hello Anna-Maria, > > On 10/4/2023 6:04 PM, Anna-Maria Behnsen wrote: > > [..snip..] > > > > Ping Pong Oberservation > > ^^^^^^^^^^^^^^^^^^^^^^^ > > > > During testing on a mostly idle machine a ping pong game could be observed: > > a process_timeout timer is expired remotely on a non idle CPU. Then the CPU > > where the schedule_timeout() was executed to enqueue the timer comes out of > > idle and restarts the timer using schedule_timeout() and goes back to idle > > again. This is due to the fair scheduler which tries to keep the task on > > the CPU which it previously executed on. > > Regarding above, are you referring to "wake_up_process(timeout->task)" in > "process_timeout()" ends up waking the task on an idle CPU instead of the > CPU where process_timeout() ran? > > In which case, have you tried using the "WF_CURRENT_CPU" flag for the > wakeup? (landed upstream in v6.6-rc1) It is only used by wait queues in > kernel/sched/wait.c currently but perhaps we can have a > "wake_up_process_on_current_cpu()" that process_timeout() can call. > > Something along the lines of: > > int wake_up_process_on_current_cpu(struct task_struct *p) > { > return try_to_wake_up(p, TASK_NORMAL, WF_CURRENT_CPU); > } > EXPORT_SYMBOL(wake_up_process_on_current_cpu); > > Thoughts? Yeah, we should definitely not export such a function. Also, WF_CURRENT_CPU should be used sparingly. So restarting the task on the previous CPU is done because of cache affinity and is typically the right thing to do if you care about performance. The first question to ask is where this schedule_timeout() is coming from. Is this some daft userspace that is polling on something that eventually ends up in schedule_timeout() ? Can we fix the userspace to not do silly things like that? The alternative is adding heuristics and we all know where that ends :/