Message ID | 20230112162426.217522-1-bristot@kernel.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:4e01:0:0:0:0:0 with SMTP id p1csp3977595wrt; Thu, 12 Jan 2023 08:32:00 -0800 (PST) X-Google-Smtp-Source: AMrXdXvyEC1Fp8SLgwee0Em5JNdzpT4CNNf++2kDnugVvT1NDcALFklVnWOh/CbYWSxJRogGUfjK X-Received: by 2002:a62:cd4e:0:b0:58a:d606:4258 with SMTP id o75-20020a62cd4e000000b0058ad6064258mr10253763pfg.10.1673541120043; Thu, 12 Jan 2023 08:32:00 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1673541120; cv=none; d=google.com; s=arc-20160816; b=HomtpfxnHBvV+kNbZbeYDT6hAUhb+pN9vsdZaQHHpqeshvrJ/QuhsLIORHgNumhBKh wtHd0gXBbjnb57jfD+cCIVijtQ8YMni/n2mEaTWbZHL1WniitMmOW+gkmnnzQchiD2LB boQ7fw2BN9157BGUYoJAqHUt46U4B+lr9JlzO/1mmdPb38TeyaHjB+mhoSbkxZ9NlLoL nBVtzmdihg452ymUgF8HwSBvPJUzua2SN01wAqg9SSjywmHVXCcekfLBYzcGe4AD9ZI+ +J1h2w1hJua4bf47hb6U3psvsW9yoy6/+c6sV5hTVEgPC2iyiV+x/8MXMkAe2WjyzFan hM5g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=CGNd56Fd9CfwV04JRfMSzKnPxqRaRgoA0qBIoc2DgRQ=; b=vF8wUzJvsJR9sXo0XflsQKsdSiGJzE1ib9d4on+u9fpdpnpuRuwe81C18W7v2e448W 81QA+ctMzJOne1lwM22/TBTW8GfTetUi9iJUMPICs0ScqOQnuDlmT+Zd6ABLVs+q79d+ 0v5p2e+8w2qjVDSOiCZDvMlmGBLaIdcqVGiO3hb7zjCvlIjbzhhM7w441+VbHfIUOrBY uw3Ku1/0gwvu2vjF/KhQ+dYC71QLb3Ddbd4nPGpdXz9MrqfhE4e3czaM5lxfHVWV0ydL UrCRLHgjydHKuzdSbWad2IpIYL5ddcQDxgT2sJ+qmByfPCfTe32ZI0lRFuN6yeOMLSjL 9dJA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=BSlgKwj7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id d11-20020a056a00244b00b0057f8d82dda4si17917941pfj.218.2023.01.12.08.31.47; Thu, 12 Jan 2023 08:32:00 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=BSlgKwj7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235265AbjALQ1o (ORCPT <rfc822;zhuangel570@gmail.com> + 99 others); Thu, 12 Jan 2023 11:27:44 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33040 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238076AbjALQ1V (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 12 Jan 2023 11:27:21 -0500 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0BE7E1C10C for <linux-kernel@vger.kernel.org>; Thu, 12 Jan 2023 08:24:38 -0800 (PST) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 99FD96208F for <linux-kernel@vger.kernel.org>; Thu, 12 Jan 2023 16:24:37 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 92642C433EF; Thu, 12 Jan 2023 16:24:33 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1673540677; bh=63+5Xx8u7+1H2T+lBnQxdLkryhorI+y70lXtivZ2alk=; h=From:To:Cc:Subject:Date:From; b=BSlgKwj7Yc/kGMm+EXmtq1wG1gBZCE7Lk0aZFZ0HMlmTFhAtM21eTekAATIk00Hg6 Uup6nZsgBwwcEusdgdDaiKJKSEo6XQYJj/rEaYxUFs/oNJ8R/4N5tWsRAYtWf2mym4 oV2m+iU1WHcqYz1BfSIpsOF3p9/90cxxP0qYtVkLxAi5altbJnhzi9fRVzqE0G9jJv oIRVgmI6PiufDey2uHFj5KiJWNfG9udOXZoDH5LWRxcwE7gPrlXu6RwWORU80KdmfO v4TQhmIVtYZXV4w8BkN/bnvUsGpiuqvRDeqVg8k3b9NpcCmdm8ykCepThyED5H70GA zn7f2Qg3yzHuA== From: Daniel Bristot de Oliveira <bristot@kernel.org> To: linux-kernel@vger.kernel.org, Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org> Cc: Daniel Bristot de Oliveira <bristot@kernel.org>, Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>, Daniel Bristot de Oliveira <bristot@redhat.com>, Valentin Schneider <vschneid@redhat.com>, Joe Mario <jmario@redhat.com> Subject: [PATCH] sched/idle: Make idle poll dynamic per-cpu Date: Thu, 12 Jan 2023 17:24:26 +0100 Message-Id: <20230112162426.217522-1-bristot@kernel.org> X-Mailer: git-send-email 2.39.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1754835053455259699?= X-GMAIL-MSGID: =?utf-8?q?1754835053455259699?= |
Series |
sched/idle: Make idle poll dynamic per-cpu
|
|
Commit Message
Daniel Bristot de Oliveira
Jan. 12, 2023, 4:24 p.m. UTC
idle=poll is frequently used on ultra-low-latency systems. Examples of
such systems are high-performance trading and 5G NVRAM. The performance
gain is given by avoiding the idle driver machinery and by keeping the
CPU is always in an active state - avoiding (odd) hardware heuristics that
are out of the control of the OS.
Currently, idle=poll is an all-or-nothing static option defined at
boot time. The motivation for creating this option dynamic and per-cpu
are two:
1) Reduce the power usage/heat by allowing only selected CPUs to
do idle polling;
2) Allow multi-tenant systems (e.g., Kubernetes) to enable idle
polling only when ultra-low-latency applications are present
on specific CPUs.
Joe Mario did some experiments with this option enabled, and the results
were significant. For example, by using dynamic idle polling on
selected CPUs, cyclictest performance is optimal (like when using
idle=poll), but cpu power consumption drops from 381 to 233 watts.
Also, limiting idle=poll to the set of CPUs that benefits from
it allows other CPUs to benefit from frequency boosts. Joe also
shows that the results can be in the order of 80nsec round trip
improvement when system-wide idle=poll was not used.
The user can enable idle polling with this command:
# echo 1 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll
And disable it via:
# echo 0 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll
By default, all CPUs have idle polling disabled (the current behavior).
A static key avoids the CPU mask check overhead when no idle polling
is enabled.
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Joe Mario <jmario@redhat.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
---
kernel/sched/idle.c | 97 +++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 93 insertions(+), 4 deletions(-)
Comments
* Daniel Bristot de Oliveira <bristot@kernel.org> wrote: > idle=poll is frequently used on ultra-low-latency systems. Examples of > such systems are high-performance trading and 5G NVRAM. The performance > gain is given by avoiding the idle driver machinery and by keeping the > CPU is always in an active state - avoiding (odd) hardware heuristics that > are out of the control of the OS. > > Currently, idle=poll is an all-or-nothing static option defined at > boot time. The motivation for creating this option dynamic and per-cpu > are two: > > 1) Reduce the power usage/heat by allowing only selected CPUs to > do idle polling; > 2) Allow multi-tenant systems (e.g., Kubernetes) to enable idle > polling only when ultra-low-latency applications are present > on specific CPUs. > > Joe Mario did some experiments with this option enabled, and the results > were significant. For example, by using dynamic idle polling on > selected CPUs, cyclictest performance is optimal (like when using > idle=poll), but cpu power consumption drops from 381 to 233 watts. > > Also, limiting idle=poll to the set of CPUs that benefits from > it allows other CPUs to benefit from frequency boosts. Joe also > shows that the results can be in the order of 80nsec round trip > improvement when system-wide idle=poll was not used. > > The user can enable idle polling with this command: > # echo 1 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll > > And disable it via: > # echo 0 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll > > By default, all CPUs have idle polling disabled (the current behavior). > A static key avoids the CPU mask check overhead when no idle polling > is enabled. Sounds useful in general. A couple of observations: ABI: how about putting the new file into the existing /sys/devices/system/cpu/cpuidle/ directory - the sysfs space of cpuidle? Arguably this flag is an extension of it. > extern char __cpuidle_text_start[], __cpuidle_text_end[]; > > +/* > + * per-cpu idle polling selector. > + */ > +static struct cpumask cpu_poll_mask; > +DEFINE_STATIC_KEY_FALSE(cpu_poll_enabled); > + > +/* > + * Protects the mask/static key relation. > + */ > +DEFINE_MUTEX(cpu_poll_mutex); > + > +static ssize_t idle_poll_store(struct device *dev, struct device_attribute *attr, > + const char *buf, size_t count) > +{ > + int cpu = dev->id; > + int retval, set; > + bool val; > + > + retval = kstrtobool(buf, &val); > + if (retval) > + return retval; > + > + mutex_lock(&cpu_poll_mutex); > + > + if (val) { > + set = cpumask_test_and_set_cpu(cpu, &cpu_poll_mask); > + > + /* > + * If the CPU was already on, do not increase the static key usage. > + */ > + if (!set) > + static_branch_inc(&cpu_poll_enabled); > + } else { > + set = cpumask_test_and_clear_cpu(cpu, &cpu_poll_mask); > + > + /* > + * If the CPU was already off, do not decrease the static key usage. > + */ > + if (set) > + static_branch_dec(&cpu_poll_enabled); > + } Nit: I think 'old_bit' or so is easier to read than a generic 'set'? > + > + mutex_unlock(&cpu_poll_mutex); Also, is cpu_poll_mutex locking really necessary, given that these bitops methods are atomic, and CPUs observe cpu_poll_enabled without taking any locks? > +static int is_cpu_idle_poll(int cpu) > +{ > + if (static_branch_unlikely(&cpu_poll_enabled)) > + return cpumask_test_cpu(cpu, &cpu_poll_mask); > + > + return 0; > +} static inline might be justified in this case I guess. > @@ -51,18 +136,21 @@ __setup("hlt", cpu_idle_nopoll_setup); > > static noinline int __cpuidle cpu_idle_poll(void) > { > - trace_cpu_idle(0, smp_processor_id()); > + int cpu = smp_processor_id(); > + > + trace_cpu_idle(0, cpu); > stop_critical_timings(); > ct_idle_enter(); > local_irq_enable(); > > while (!tif_need_resched() && > - (cpu_idle_force_poll || tick_check_broadcast_expired())) > + (cpu_idle_force_poll || tick_check_broadcast_expired() > + || is_cpu_idle_poll(cpu))) > cpu_relax(); > > ct_idle_exit(); > start_critical_timings(); > - trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id()); > + trace_cpu_idle(PWR_EVENT_EXIT, cpu); > > return 1; So I think the introduction of the 'cpu' local variable to clean up the flow of cpu_idle_poll() should be a separate preparatory patch, which will make the addition of the is_cpu_idle_poll() call a bit easier to read in the second patch. > } > @@ -296,7 +384,8 @@ static void do_idle(void) > * broadcast device expired for us, we don't want to go deep > * idle as we know that the IPI is going to arrive right away. > */ > - if (cpu_idle_force_poll || tick_check_broadcast_expired()) { > + if (cpu_idle_force_poll || tick_check_broadcast_expired() > + || is_cpu_idle_poll(cpu)) { > tick_nohz_idle_restart_tick(); > cpu_idle_poll(); Shouldn't we check is_cpu_idle_poll() right after the cpu_idle_force_poll check, and before the tick_check_broadcast_expired() check? Shouldn't matter to the outcome, but for consistency's sake. Plus, if we are doing this anyway, maybe cpu_idle_force_poll could now be implemented as 0/all setting of cpu_poll_mask, eliminating the cpu_idle_force_poll flag? As a third patch on top. Thanks, Ingo
Hi Daniel, On 2023-01-12 at 17:24:26 +0100, Daniel Bristot de Oliveira wrote: > idle=poll is frequently used on ultra-low-latency systems. Examples of > such systems are high-performance trading and 5G NVRAM. The performance > gain is given by avoiding the idle driver machinery and by keeping the > CPU is always in an active state - avoiding (odd) hardware heuristics that > are out of the control of the OS. > > Currently, idle=poll is an all-or-nothing static option defined at > boot time. The motivation for creating this option dynamic and per-cpu > are two: > > 1) Reduce the power usage/heat by allowing only selected CPUs to > do idle polling; > 2) Allow multi-tenant systems (e.g., Kubernetes) to enable idle > polling only when ultra-low-latency applications are present > on specific CPUs. > > Joe Mario did some experiments with this option enabled, and the results > were significant. For example, by using dynamic idle polling on > selected CPUs, cyclictest performance is optimal (like when using > idle=poll), but cpu power consumption drops from 381 to 233 watts. > > Also, limiting idle=poll to the set of CPUs that benefits from > it allows other CPUs to benefit from frequency boosts. Joe also > shows that the results can be in the order of 80nsec round trip > improvement when system-wide idle=poll was not used. > > The user can enable idle polling with this command: > # echo 1 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll > > And disable it via: > # echo 0 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll > Maybe I understood it incorrectly, is above command intended to put specific CPU only in poll mode? Can the c-state sysfs do this? grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name /sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL /sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1 /sys/devices/system/cpu/cpu0/cpuidle/state2/name:C1E /sys/devices/system/cpu/cpu0/cpuidle/state3/name:C6 grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/disable /sys/devices/system/cpu/cpu0/cpuidle/state0/disable:0 /sys/devices/system/cpu/cpu0/cpuidle/state1/disable:1 /sys/devices/system/cpu/cpu0/cpuidle/state2/disable:1 /sys/devices/system/cpu/cpu0/cpuidle/state3/disable:1 thanks, Chenyu
On Thu, Jan 12, 2023 at 05:24:26PM +0100, Daniel Bristot de Oliveira wrote: > idle=poll is frequently used on ultra-low-latency systems. Examples of > such systems are high-performance trading and 5G NVRAM. The performance > gain is given by avoiding the idle driver machinery and by keeping the > CPU is always in an active state - avoiding (odd) hardware heuristics that > are out of the control of the OS. > > Currently, idle=poll is an all-or-nothing static option defined at > boot time. The motivation for creating this option dynamic and per-cpu > are two: > > 1) Reduce the power usage/heat by allowing only selected CPUs to > do idle polling; > 2) Allow multi-tenant systems (e.g., Kubernetes) to enable idle > polling only when ultra-low-latency applications are present > on specific CPUs. > > Joe Mario did some experiments with this option enabled, and the results > were significant. For example, by using dynamic idle polling on > selected CPUs, cyclictest performance is optimal (like when using > idle=poll), but cpu power consumption drops from 381 to 233 watts. > > Also, limiting idle=poll to the set of CPUs that benefits from > it allows other CPUs to benefit from frequency boosts. Joe also > shows that the results can be in the order of 80nsec round trip > improvement when system-wide idle=poll was not used. > > The user can enable idle polling with this command: > # echo 1 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll > > And disable it via: > # echo 0 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll > > By default, all CPUs have idle polling disabled (the current behavior). > A static key avoids the CPU mask check overhead when no idle polling > is enabled. Urgh, can we please make this a cpuidle governor thing or so? So that we don't need to invent new interfaces and such.
* Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Jan 12, 2023 at 05:24:26PM +0100, Daniel Bristot de Oliveira wrote: > > idle=poll is frequently used on ultra-low-latency systems. Examples of > > such systems are high-performance trading and 5G NVRAM. The performance > > gain is given by avoiding the idle driver machinery and by keeping the > > CPU is always in an active state - avoiding (odd) hardware heuristics that > > are out of the control of the OS. > > > > Currently, idle=poll is an all-or-nothing static option defined at > > boot time. The motivation for creating this option dynamic and per-cpu > > are two: > > > > 1) Reduce the power usage/heat by allowing only selected CPUs to > > do idle polling; > > 2) Allow multi-tenant systems (e.g., Kubernetes) to enable idle > > polling only when ultra-low-latency applications are present > > on specific CPUs. > > > > Joe Mario did some experiments with this option enabled, and the results > > were significant. For example, by using dynamic idle polling on > > selected CPUs, cyclictest performance is optimal (like when using > > idle=poll), but cpu power consumption drops from 381 to 233 watts. > > > > Also, limiting idle=poll to the set of CPUs that benefits from > > it allows other CPUs to benefit from frequency boosts. Joe also > > shows that the results can be in the order of 80nsec round trip > > improvement when system-wide idle=poll was not used. > > > > The user can enable idle polling with this command: > > # echo 1 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll > > > > And disable it via: > > # echo 0 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll > > > > By default, all CPUs have idle polling disabled (the current behavior). > > A static key avoids the CPU mask check overhead when no idle polling > > is enabled. > > Urgh, can we please make this a cpuidle governor thing or so? So that we > don't need to invent new interfaces and such. I think the desired property here would be to make this interface on top of pretty much any governor. Ie. have a governor, but also a way to drop any CPU into idle-poll, overriding that. Thanks, Ingo
* Ingo Molnar <mingo@kernel.org> wrote: > > Urgh, can we please make this a cpuidle governor thing or so? So that > > we don't need to invent new interfaces and such. > > I think the desired property here would be to make this interface on top > of pretty much any governor. Ie. have a governor, but also a way to drop > any CPU into idle-poll, overriding that. ... with the goal of having the best governor for power efficiency by default - but also the ability to drop a handful of CPUs into the highest performance / lowest latency idle mode. It's a special kind of nested policy, for workload exceptions. Thanks, Ingo
On Mon, 16 Jan 2023 at 10:28, Ingo Molnar <mingo@kernel.org> wrote: > > > * Ingo Molnar <mingo@kernel.org> wrote: > > > > Urgh, can we please make this a cpuidle governor thing or so? So that > > > we don't need to invent new interfaces and such. > > > > I think the desired property here would be to make this interface on top > > of pretty much any governor. Ie. have a governor, but also a way to drop > > any CPU into idle-poll, overriding that. > > ... with the goal of having the best governor for power efficiency by > default - but also the ability to drop a handful of CPUs into the highest > performance / lowest latency idle mode. > > It's a special kind of nested policy, for workload exceptions. User can set per cpu latency constraint with /sys/devices/system/cpu/cpu*/power/pm_qos_resume_latency_us Which is then used by cpuidle governor when selecting an idle state. The cpuidle governor should then select the idle state that matches with the wakeup latency for those CPUs but select the most power efficient for others. Setting a low value should filter all idle states except the polling one Regards Vincent > > Thanks, > > Ingo
On 1/16/23 10:28, Ingo Molnar wrote: > > * Ingo Molnar <mingo@kernel.org> wrote: > >>> Urgh, can we please make this a cpuidle governor thing or so? So that >>> we don't need to invent new interfaces and such. >> >> I think the desired property here would be to make this interface on top >> of pretty much any governor. Ie. have a governor, but also a way to drop >> any CPU into idle-poll, overriding that. > > ... with the goal of having the best governor for power efficiency by > default - but also the ability to drop a handful of CPUs into the highest > performance / lowest latency idle mode. > > It's a special kind of nested policy, for workload exceptions. Yep, it is for the (extreme, but existing) case in which the user wants to skip idle driver machinery (and overheads involved). People use idle poll on high-frequency trading or to avoid scheduling out a vCPU, but as the systems are becoming more dynamic (and shared), having this option dynamic and per-cpu is useful... -- Daniel > Thanks, > > Ingo >
On 1/16/23 10:51, Vincent Guittot wrote: > On Mon, 16 Jan 2023 at 10:28, Ingo Molnar <mingo@kernel.org> wrote: >> >> >> * Ingo Molnar <mingo@kernel.org> wrote: >> >>>> Urgh, can we please make this a cpuidle governor thing or so? So that >>>> we don't need to invent new interfaces and such. >>> >>> I think the desired property here would be to make this interface on top >>> of pretty much any governor. Ie. have a governor, but also a way to drop >>> any CPU into idle-poll, overriding that. >> >> ... with the goal of having the best governor for power efficiency by >> default - but also the ability to drop a handful of CPUs into the highest >> performance / lowest latency idle mode. >> >> It's a special kind of nested policy, for workload exceptions. > > User can set per cpu latency constraint with > /sys/devices/system/cpu/cpu*/power/pm_qos_resume_latency_us > Which is then used by cpuidle governor when selecting an idle state. > The cpuidle governor should then select the idle state that matches > with the wakeup latency for those CPUs but select the most power > efficient for others. Setting a low value should filter all idle > states except the polling one Yep, that is a possibility, but it does not always work as expected. For example, on virtual machines the vCPU gets scheduled out, even with this option set :-/. -- Daniel > Regards > Vincent >> >> Thanks, >> >> Ingo
On Mon, Jan 16, 2023 at 10:28:20AM +0100, Ingo Molnar wrote: > > * Ingo Molnar <mingo@kernel.org> wrote: > > > > Urgh, can we please make this a cpuidle governor thing or so? So that > > > we don't need to invent new interfaces and such. > > > > I think the desired property here would be to make this interface on top > > of pretty much any governor. Ie. have a governor, but also a way to drop > > any CPU into idle-poll, overriding that. > > ... with the goal of having the best governor for power efficiency by > default - but also the ability to drop a handful of CPUs into the highest > performance / lowest latency idle mode. Bah, so while you can set a cpufreq gov (say performance) per cpu, you can't do the same with cpuidle.
On 1/15/23 10:15, Ingo Molnar wrote: > > * Daniel Bristot de Oliveira <bristot@kernel.org> wrote: > >> idle=poll is frequently used on ultra-low-latency systems. Examples of >> such systems are high-performance trading and 5G NVRAM. The performance >> gain is given by avoiding the idle driver machinery and by keeping the >> CPU is always in an active state - avoiding (odd) hardware heuristics that >> are out of the control of the OS. >> >> Currently, idle=poll is an all-or-nothing static option defined at >> boot time. The motivation for creating this option dynamic and per-cpu >> are two: >> >> 1) Reduce the power usage/heat by allowing only selected CPUs to >> do idle polling; >> 2) Allow multi-tenant systems (e.g., Kubernetes) to enable idle >> polling only when ultra-low-latency applications are present >> on specific CPUs. >> >> Joe Mario did some experiments with this option enabled, and the results >> were significant. For example, by using dynamic idle polling on >> selected CPUs, cyclictest performance is optimal (like when using >> idle=poll), but cpu power consumption drops from 381 to 233 watts. >> >> Also, limiting idle=poll to the set of CPUs that benefits from >> it allows other CPUs to benefit from frequency boosts. Joe also >> shows that the results can be in the order of 80nsec round trip >> improvement when system-wide idle=poll was not used. >> >> The user can enable idle polling with this command: >> # echo 1 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll >> >> And disable it via: >> # echo 0 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll >> >> By default, all CPUs have idle polling disabled (the current behavior). >> A static key avoids the CPU mask check overhead when no idle polling >> is enabled. > > Sounds useful in general. > > A couple of observations: > > ABI: how about putting the new file into the existing > /sys/devices/system/cpu/cpuidle/ directory - the sysfs space of cpuidle? > Arguably this flag is an extension of it. > I tried that, but then this option will depend on CONFIG_CPU_IDLE, which... is not away set, and idle_poll does not depend on now... so I am not sure if it is the best option... or am I missing something? suggestions? >> extern char __cpuidle_text_start[], __cpuidle_text_end[]; >> >> +/* >> + * per-cpu idle polling selector. >> + */ >> +static struct cpumask cpu_poll_mask; >> +DEFINE_STATIC_KEY_FALSE(cpu_poll_enabled); >> + >> +/* >> + * Protects the mask/static key relation. >> + */ >> +DEFINE_MUTEX(cpu_poll_mutex); >> + >> +static ssize_t idle_poll_store(struct device *dev, struct device_attribute *attr, >> + const char *buf, size_t count) >> +{ >> + int cpu = dev->id; >> + int retval, set; >> + bool val; >> + >> + retval = kstrtobool(buf, &val); >> + if (retval) >> + return retval; >> + >> + mutex_lock(&cpu_poll_mutex); >> + >> + if (val) { >> + set = cpumask_test_and_set_cpu(cpu, &cpu_poll_mask); >> + >> + /* >> + * If the CPU was already on, do not increase the static key usage. >> + */ >> + if (!set) >> + static_branch_inc(&cpu_poll_enabled); >> + } else { >> + set = cpumask_test_and_clear_cpu(cpu, &cpu_poll_mask); >> + >> + /* >> + * If the CPU was already off, do not decrease the static key usage. >> + */ >> + if (set) >> + static_branch_dec(&cpu_poll_enabled); >> + } > > Nit: I think 'old_bit' or so is easier to read than a generic 'set'? ack > >> + >> + mutex_unlock(&cpu_poll_mutex); > > Also, is cpu_poll_mutex locking really necessary, given that these bitops > methods are atomic, and CPUs observe cpu_poll_enabled without taking any > locks? you are right, it is not needed. I will remove it. >> +static int is_cpu_idle_poll(int cpu) >> +{ >> + if (static_branch_unlikely(&cpu_poll_enabled)) >> + return cpumask_test_cpu(cpu, &cpu_poll_mask); >> + >> + return 0; >> +} > > static inline might be justified in this case I guess. ack >> @@ -51,18 +136,21 @@ __setup("hlt", cpu_idle_nopoll_setup); >> >> static noinline int __cpuidle cpu_idle_poll(void) >> { >> - trace_cpu_idle(0, smp_processor_id()); >> + int cpu = smp_processor_id(); >> + >> + trace_cpu_idle(0, cpu); >> stop_critical_timings(); >> ct_idle_enter(); >> local_irq_enable(); >> >> while (!tif_need_resched() && >> - (cpu_idle_force_poll || tick_check_broadcast_expired())) >> + (cpu_idle_force_poll || tick_check_broadcast_expired() >> + || is_cpu_idle_poll(cpu))) >> cpu_relax(); >> >> ct_idle_exit(); >> start_critical_timings(); >> - trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id()); >> + trace_cpu_idle(PWR_EVENT_EXIT, cpu); >> >> return 1; > > So I think the introduction of the 'cpu' local variable to clean up the > flow of cpu_idle_poll() should be a separate preparatory patch, which will > make the addition of the is_cpu_idle_poll() call a bit easier to read in > the second patch. Makes sense. >> } >> @@ -296,7 +384,8 @@ static void do_idle(void) >> * broadcast device expired for us, we don't want to go deep >> * idle as we know that the IPI is going to arrive right away. >> */ >> - if (cpu_idle_force_poll || tick_check_broadcast_expired()) { >> + if (cpu_idle_force_poll || tick_check_broadcast_expired() >> + || is_cpu_idle_poll(cpu)) { >> tick_nohz_idle_restart_tick(); >> cpu_idle_poll(); > > Shouldn't we check is_cpu_idle_poll() right after the cpu_idle_force_poll > check, and before the tick_check_broadcast_expired() check? Right. > Shouldn't matter to the outcome, but for consistency's sake. Maybe, we can move the cpu_idle_force_poll check inside cpu_idle_force_poll()? because... > Plus, if we are doing this anyway, maybe cpu_idle_force_poll could now be > implemented as 0/all setting of cpu_poll_mask, eliminating the > cpu_idle_force_poll flag? As a third patch on top. I started doing it, but then I noticed some points: - the cpu_idle_force_poll can stack, as platforms can call cpu_idle_poll_ctrl(true) on top of idle=poll. So we would still need an integer to count how many times the cpu_idle_force_poll was called. - call to cpu_idle_poll_ctrl(false) when cpu_idle_force_poll reaches 0 cannot unset all bits from the cpu_poll_mask because the user setup would be lost. So I think that cpu_idle_force_poll is being used for two purposes: 1) user setting via idle=poll, and 2) as a kernel facility via cpu_idle_poll_ctrl(true/false) other than idle=poll. So, maybe we can make idle=poll to change the initial value of the bitmask to all 1 (with the addition that the user can now undo it), and keep cpu_idle_force_poll for internal use? Thanks! -- Daniel > > Thanks, > > Ingo
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index f26ab2675f7d..c6ef1322d549 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -10,6 +10,91 @@ /* Linker adds these: start and end of __cpuidle functions */ extern char __cpuidle_text_start[], __cpuidle_text_end[]; +/* + * per-cpu idle polling selector. + */ +static struct cpumask cpu_poll_mask; +DEFINE_STATIC_KEY_FALSE(cpu_poll_enabled); + +/* + * Protects the mask/static key relation. + */ +DEFINE_MUTEX(cpu_poll_mutex); + +static ssize_t idle_poll_store(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count) +{ + int cpu = dev->id; + int retval, set; + bool val; + + retval = kstrtobool(buf, &val); + if (retval) + return retval; + + mutex_lock(&cpu_poll_mutex); + + if (val) { + set = cpumask_test_and_set_cpu(cpu, &cpu_poll_mask); + + /* + * If the CPU was already on, do not increase the static key usage. + */ + if (!set) + static_branch_inc(&cpu_poll_enabled); + } else { + set = cpumask_test_and_clear_cpu(cpu, &cpu_poll_mask); + + /* + * If the CPU was already off, do not decrease the static key usage. + */ + if (set) + static_branch_dec(&cpu_poll_enabled); + } + + mutex_unlock(&cpu_poll_mutex); + + return count; +} + +static ssize_t idle_poll_show(struct device *dev, struct device_attribute *attr, char *buf) +{ + return sprintf(buf, "%d\n", cpumask_test_cpu(dev->id, &cpu_poll_mask)); +} + +static DEVICE_ATTR_RW(idle_poll); + +static const struct attribute *idle_poll_attrs[] = { + &dev_attr_idle_poll.attr, + NULL +}; + +static int __init idle_poll_sysfs_init(void) +{ + int cpu, retval; + + for_each_possible_cpu(cpu) { + struct device *dev = get_cpu_device(cpu); + + if (!dev) + continue; + retval = sysfs_create_files(&dev->kobj, idle_poll_attrs); + if (retval) + return retval; + } + + return 0; +} +device_initcall(idle_poll_sysfs_init); + +static int is_cpu_idle_poll(int cpu) +{ + if (static_branch_unlikely(&cpu_poll_enabled)) + return cpumask_test_cpu(cpu, &cpu_poll_mask); + + return 0; +} + /** * sched_idle_set_state - Record idle state for the current CPU. * @idle_state: State to record. @@ -51,18 +136,21 @@ __setup("hlt", cpu_idle_nopoll_setup); static noinline int __cpuidle cpu_idle_poll(void) { - trace_cpu_idle(0, smp_processor_id()); + int cpu = smp_processor_id(); + + trace_cpu_idle(0, cpu); stop_critical_timings(); ct_idle_enter(); local_irq_enable(); while (!tif_need_resched() && - (cpu_idle_force_poll || tick_check_broadcast_expired())) + (cpu_idle_force_poll || tick_check_broadcast_expired() + || is_cpu_idle_poll(cpu))) cpu_relax(); ct_idle_exit(); start_critical_timings(); - trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id()); + trace_cpu_idle(PWR_EVENT_EXIT, cpu); return 1; } @@ -296,7 +384,8 @@ static void do_idle(void) * broadcast device expired for us, we don't want to go deep * idle as we know that the IPI is going to arrive right away. */ - if (cpu_idle_force_poll || tick_check_broadcast_expired()) { + if (cpu_idle_force_poll || tick_check_broadcast_expired() + || is_cpu_idle_poll(cpu)) { tick_nohz_idle_restart_tick(); cpu_idle_poll(); } else {