sched/idle: Make idle poll dynamic per-cpu

Message ID 20230112162426.217522-1-bristot@kernel.org
State New
Headers
Series sched/idle: Make idle poll dynamic per-cpu |

Commit Message

Daniel Bristot de Oliveira Jan. 12, 2023, 4:24 p.m. UTC
  idle=poll is frequently used on ultra-low-latency systems. Examples of
such systems are high-performance trading and 5G NVRAM. The performance
gain is given by avoiding the idle driver machinery and by keeping the
CPU is always in an active state - avoiding (odd) hardware heuristics that
are out of the control of the OS.

Currently, idle=poll is an all-or-nothing static option defined at
boot time. The motivation for creating this option dynamic and per-cpu
are two:

  1) Reduce the power usage/heat by allowing only selected CPUs to
     do idle polling;
  2) Allow multi-tenant systems (e.g., Kubernetes) to enable idle
     polling only when ultra-low-latency applications are present
     on specific CPUs.

Joe Mario did some experiments with this option enabled, and the results
were significant. For example, by using dynamic idle polling on
selected CPUs, cyclictest performance is optimal (like when using
idle=poll), but cpu power consumption drops from 381 to 233 watts.

Also, limiting idle=poll to the set of CPUs that benefits from
it allows other CPUs to benefit from frequency boosts. Joe also
shows that the results can be in the order of 80nsec round trip
improvement when system-wide idle=poll was not used.

The user can enable idle polling with this command:
  # echo 1 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll

And disable it via:
  # echo 0 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll

By default, all CPUs have idle polling disabled (the current behavior).
A static key avoids the CPU mask check overhead when no idle polling
is enabled.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Joe Mario <jmario@redhat.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
---
 kernel/sched/idle.c | 97 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 93 insertions(+), 4 deletions(-)
  

Comments

Ingo Molnar Jan. 15, 2023, 9:15 a.m. UTC | #1
* Daniel Bristot de Oliveira <bristot@kernel.org> wrote:

> idle=poll is frequently used on ultra-low-latency systems. Examples of
> such systems are high-performance trading and 5G NVRAM. The performance
> gain is given by avoiding the idle driver machinery and by keeping the
> CPU is always in an active state - avoiding (odd) hardware heuristics that
> are out of the control of the OS.
> 
> Currently, idle=poll is an all-or-nothing static option defined at
> boot time. The motivation for creating this option dynamic and per-cpu
> are two:
> 
>   1) Reduce the power usage/heat by allowing only selected CPUs to
>      do idle polling;
>   2) Allow multi-tenant systems (e.g., Kubernetes) to enable idle
>      polling only when ultra-low-latency applications are present
>      on specific CPUs.
> 
> Joe Mario did some experiments with this option enabled, and the results
> were significant. For example, by using dynamic idle polling on
> selected CPUs, cyclictest performance is optimal (like when using
> idle=poll), but cpu power consumption drops from 381 to 233 watts.
> 
> Also, limiting idle=poll to the set of CPUs that benefits from
> it allows other CPUs to benefit from frequency boosts. Joe also
> shows that the results can be in the order of 80nsec round trip
> improvement when system-wide idle=poll was not used.
> 
> The user can enable idle polling with this command:
>   # echo 1 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll
> 
> And disable it via:
>   # echo 0 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll
> 
> By default, all CPUs have idle polling disabled (the current behavior).
> A static key avoids the CPU mask check overhead when no idle polling
> is enabled.

Sounds useful in general.

A couple of observations:

ABI: how about putting the new file into the existing 
/sys/devices/system/cpu/cpuidle/ directory - the sysfs space of cpuidle? 
Arguably this flag is an extension of it.


>  extern char __cpuidle_text_start[], __cpuidle_text_end[];
>  
> +/*
> + * per-cpu idle polling selector.
> + */
> +static struct cpumask cpu_poll_mask;
> +DEFINE_STATIC_KEY_FALSE(cpu_poll_enabled);
> +
> +/*
> + * Protects the mask/static key relation.
> + */
> +DEFINE_MUTEX(cpu_poll_mutex);
> +
> +static ssize_t idle_poll_store(struct device *dev, struct device_attribute *attr,
> +			       const char *buf, size_t count)
> +{
> +	int cpu = dev->id;
> +	int retval, set;
> +	bool val;
> +
> +	retval = kstrtobool(buf, &val);
> +	if (retval)
> +		return retval;
> +
> +	mutex_lock(&cpu_poll_mutex);
> +
> +	if (val) {
> +		set = cpumask_test_and_set_cpu(cpu, &cpu_poll_mask);
> +
> +		/*
> +		 * If the CPU was already on, do not increase the static key usage.
> +		 */
> +		if (!set)
> +			static_branch_inc(&cpu_poll_enabled);
> +	} else {
> +		set = cpumask_test_and_clear_cpu(cpu, &cpu_poll_mask);
> +
> +		/*
> +		 * If the CPU was already off, do not decrease the static key usage.
> +		 */
> +		if (set)
> +			static_branch_dec(&cpu_poll_enabled);
> +	}

Nit: I think 'old_bit' or so is easier to read than a generic 'set'?


> +
> +	mutex_unlock(&cpu_poll_mutex);

Also, is cpu_poll_mutex locking really necessary, given that these bitops 
methods are atomic, and CPUs observe cpu_poll_enabled without taking any 
locks?

> +static int is_cpu_idle_poll(int cpu)
> +{
> +	if (static_branch_unlikely(&cpu_poll_enabled))
> +		return cpumask_test_cpu(cpu, &cpu_poll_mask);
> +
> +	return 0;
> +}

static inline might be justified in this case I guess.

> @@ -51,18 +136,21 @@ __setup("hlt", cpu_idle_nopoll_setup);
>  
>  static noinline int __cpuidle cpu_idle_poll(void)
>  {
> -	trace_cpu_idle(0, smp_processor_id());
> +	int cpu = smp_processor_id();
> +
> +	trace_cpu_idle(0, cpu);
>  	stop_critical_timings();
>  	ct_idle_enter();
>  	local_irq_enable();
>  
>  	while (!tif_need_resched() &&
> -	       (cpu_idle_force_poll || tick_check_broadcast_expired()))
> +	       (cpu_idle_force_poll || tick_check_broadcast_expired()
> +		|| is_cpu_idle_poll(cpu)))
>  		cpu_relax();
>  
>  	ct_idle_exit();
>  	start_critical_timings();
> -	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
> +	trace_cpu_idle(PWR_EVENT_EXIT, cpu);
>  
>  	return 1;

So I think the introduction of the 'cpu' local variable to clean up the 
flow of cpu_idle_poll() should be a separate preparatory patch, which will 
make the addition of the is_cpu_idle_poll() call a bit easier to read in 
the second patch.

>  }
> @@ -296,7 +384,8 @@ static void do_idle(void)
>  		 * broadcast device expired for us, we don't want to go deep
>  		 * idle as we know that the IPI is going to arrive right away.
>  		 */
> -		if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
> +		if (cpu_idle_force_poll || tick_check_broadcast_expired()
> +		    || is_cpu_idle_poll(cpu)) {
>  			tick_nohz_idle_restart_tick();
>  			cpu_idle_poll();

Shouldn't we check is_cpu_idle_poll() right after the cpu_idle_force_poll 
check, and before the tick_check_broadcast_expired() check?

Shouldn't matter to the outcome, but for consistency's sake.

Plus, if we are doing this anyway, maybe cpu_idle_force_poll could now be 
implemented as 0/all setting of cpu_poll_mask, eliminating the 
cpu_idle_force_poll flag? As a third patch on top.

Thanks,

	Ingo
  
Chen Yu Jan. 16, 2023, 1:43 a.m. UTC | #2
Hi Daniel,
On 2023-01-12 at 17:24:26 +0100, Daniel Bristot de Oliveira wrote:
> idle=poll is frequently used on ultra-low-latency systems. Examples of
> such systems are high-performance trading and 5G NVRAM. The performance
> gain is given by avoiding the idle driver machinery and by keeping the
> CPU is always in an active state - avoiding (odd) hardware heuristics that
> are out of the control of the OS.
> 
> Currently, idle=poll is an all-or-nothing static option defined at
> boot time. The motivation for creating this option dynamic and per-cpu
> are two:
> 
>   1) Reduce the power usage/heat by allowing only selected CPUs to
>      do idle polling;
>   2) Allow multi-tenant systems (e.g., Kubernetes) to enable idle
>      polling only when ultra-low-latency applications are present
>      on specific CPUs.
> 
> Joe Mario did some experiments with this option enabled, and the results
> were significant. For example, by using dynamic idle polling on
> selected CPUs, cyclictest performance is optimal (like when using
> idle=poll), but cpu power consumption drops from 381 to 233 watts.
> 
> Also, limiting idle=poll to the set of CPUs that benefits from
> it allows other CPUs to benefit from frequency boosts. Joe also
> shows that the results can be in the order of 80nsec round trip
> improvement when system-wide idle=poll was not used.
> 
> The user can enable idle polling with this command:
>   # echo 1 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll
> 
> And disable it via:
>   # echo 0 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll
>
Maybe I understood it incorrectly, is above command intended to
put specific CPU only in poll mode? Can the c-state sysfs
do this?

grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name 
/sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
/sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1
/sys/devices/system/cpu/cpu0/cpuidle/state2/name:C1E
/sys/devices/system/cpu/cpu0/cpuidle/state3/name:C6

grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/disable
/sys/devices/system/cpu/cpu0/cpuidle/state0/disable:0
/sys/devices/system/cpu/cpu0/cpuidle/state1/disable:1
/sys/devices/system/cpu/cpu0/cpuidle/state2/disable:1
/sys/devices/system/cpu/cpu0/cpuidle/state3/disable:1
 
thanks,
Chenyu
  
Peter Zijlstra Jan. 16, 2023, 8:53 a.m. UTC | #3
On Thu, Jan 12, 2023 at 05:24:26PM +0100, Daniel Bristot de Oliveira wrote:
> idle=poll is frequently used on ultra-low-latency systems. Examples of
> such systems are high-performance trading and 5G NVRAM. The performance
> gain is given by avoiding the idle driver machinery and by keeping the
> CPU is always in an active state - avoiding (odd) hardware heuristics that
> are out of the control of the OS.
> 
> Currently, idle=poll is an all-or-nothing static option defined at
> boot time. The motivation for creating this option dynamic and per-cpu
> are two:
> 
>   1) Reduce the power usage/heat by allowing only selected CPUs to
>      do idle polling;
>   2) Allow multi-tenant systems (e.g., Kubernetes) to enable idle
>      polling only when ultra-low-latency applications are present
>      on specific CPUs.
> 
> Joe Mario did some experiments with this option enabled, and the results
> were significant. For example, by using dynamic idle polling on
> selected CPUs, cyclictest performance is optimal (like when using
> idle=poll), but cpu power consumption drops from 381 to 233 watts.
> 
> Also, limiting idle=poll to the set of CPUs that benefits from
> it allows other CPUs to benefit from frequency boosts. Joe also
> shows that the results can be in the order of 80nsec round trip
> improvement when system-wide idle=poll was not used.
> 
> The user can enable idle polling with this command:
>   # echo 1 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll
> 
> And disable it via:
>   # echo 0 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll
> 
> By default, all CPUs have idle polling disabled (the current behavior).
> A static key avoids the CPU mask check overhead when no idle polling
> is enabled.

Urgh, can we please make this a cpuidle governor thing or so? So that we
don't need to invent new interfaces and such.
  
Ingo Molnar Jan. 16, 2023, 9:02 a.m. UTC | #4
* Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Jan 12, 2023 at 05:24:26PM +0100, Daniel Bristot de Oliveira wrote:
> > idle=poll is frequently used on ultra-low-latency systems. Examples of
> > such systems are high-performance trading and 5G NVRAM. The performance
> > gain is given by avoiding the idle driver machinery and by keeping the
> > CPU is always in an active state - avoiding (odd) hardware heuristics that
> > are out of the control of the OS.
> > 
> > Currently, idle=poll is an all-or-nothing static option defined at
> > boot time. The motivation for creating this option dynamic and per-cpu
> > are two:
> > 
> >   1) Reduce the power usage/heat by allowing only selected CPUs to
> >      do idle polling;
> >   2) Allow multi-tenant systems (e.g., Kubernetes) to enable idle
> >      polling only when ultra-low-latency applications are present
> >      on specific CPUs.
> > 
> > Joe Mario did some experiments with this option enabled, and the results
> > were significant. For example, by using dynamic idle polling on
> > selected CPUs, cyclictest performance is optimal (like when using
> > idle=poll), but cpu power consumption drops from 381 to 233 watts.
> > 
> > Also, limiting idle=poll to the set of CPUs that benefits from
> > it allows other CPUs to benefit from frequency boosts. Joe also
> > shows that the results can be in the order of 80nsec round trip
> > improvement when system-wide idle=poll was not used.
> > 
> > The user can enable idle polling with this command:
> >   # echo 1 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll
> > 
> > And disable it via:
> >   # echo 0 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll
> > 
> > By default, all CPUs have idle polling disabled (the current behavior).
> > A static key avoids the CPU mask check overhead when no idle polling
> > is enabled.
> 
> Urgh, can we please make this a cpuidle governor thing or so? So that we
> don't need to invent new interfaces and such.

I think the desired property here would be to make this interface on top of 
pretty much any governor. Ie. have a governor, but also a way to drop any 
CPU into idle-poll, overriding that.

Thanks,

	Ingo
  
Ingo Molnar Jan. 16, 2023, 9:28 a.m. UTC | #5
* Ingo Molnar <mingo@kernel.org> wrote:

> > Urgh, can we please make this a cpuidle governor thing or so? So that 
> > we don't need to invent new interfaces and such.
> 
> I think the desired property here would be to make this interface on top 
> of pretty much any governor. Ie. have a governor, but also a way to drop 
> any CPU into idle-poll, overriding that.

... with the goal of having the best governor for power efficiency by 
default - but also the ability to drop a handful of CPUs into the highest 
performance / lowest latency idle mode.

It's a special kind of nested policy, for workload exceptions.

Thanks,

	Ingo
  
Vincent Guittot Jan. 16, 2023, 9:51 a.m. UTC | #6
On Mon, 16 Jan 2023 at 10:28, Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Ingo Molnar <mingo@kernel.org> wrote:
>
> > > Urgh, can we please make this a cpuidle governor thing or so? So that
> > > we don't need to invent new interfaces and such.
> >
> > I think the desired property here would be to make this interface on top
> > of pretty much any governor. Ie. have a governor, but also a way to drop
> > any CPU into idle-poll, overriding that.
>
> ... with the goal of having the best governor for power efficiency by
> default - but also the ability to drop a handful of CPUs into the highest
> performance / lowest latency idle mode.
>
> It's a special kind of nested policy, for workload exceptions.

User can set per cpu latency constraint with
/sys/devices/system/cpu/cpu*/power/pm_qos_resume_latency_us
Which is then used by cpuidle governor when selecting an idle state.
The cpuidle governor should then select the idle state that matches
with the wakeup latency for those CPUs but select the most power
efficient for others. Setting a low value should filter all idle
states except the polling one

Regards
Vincent
>
> Thanks,
>
>         Ingo
  
Daniel Bristot de Oliveira Jan. 16, 2023, 10:06 a.m. UTC | #7
On 1/16/23 10:28, Ingo Molnar wrote:
> 
> * Ingo Molnar <mingo@kernel.org> wrote:
> 
>>> Urgh, can we please make this a cpuidle governor thing or so? So that 
>>> we don't need to invent new interfaces and such.
>>
>> I think the desired property here would be to make this interface on top 
>> of pretty much any governor. Ie. have a governor, but also a way to drop 
>> any CPU into idle-poll, overriding that.
> 
> ... with the goal of having the best governor for power efficiency by 
> default - but also the ability to drop a handful of CPUs into the highest 
> performance / lowest latency idle mode.
> 
> It's a special kind of nested policy, for workload exceptions.

Yep, it is for the (extreme, but existing) case in which the user wants to skip idle driver
machinery (and overheads involved).

People use idle poll on high-frequency trading or to avoid scheduling out a vCPU,
but as the systems are becoming more dynamic (and shared), having this option dynamic
and per-cpu is useful...

-- Daniel
> Thanks,
> 
> 	Ingo
>
  
Daniel Bristot de Oliveira Jan. 16, 2023, 10:11 a.m. UTC | #8
On 1/16/23 10:51, Vincent Guittot wrote:
> On Mon, 16 Jan 2023 at 10:28, Ingo Molnar <mingo@kernel.org> wrote:
>>
>>
>> * Ingo Molnar <mingo@kernel.org> wrote:
>>
>>>> Urgh, can we please make this a cpuidle governor thing or so? So that
>>>> we don't need to invent new interfaces and such.
>>>
>>> I think the desired property here would be to make this interface on top
>>> of pretty much any governor. Ie. have a governor, but also a way to drop
>>> any CPU into idle-poll, overriding that.
>>
>> ... with the goal of having the best governor for power efficiency by
>> default - but also the ability to drop a handful of CPUs into the highest
>> performance / lowest latency idle mode.
>>
>> It's a special kind of nested policy, for workload exceptions.
> 
> User can set per cpu latency constraint with
> /sys/devices/system/cpu/cpu*/power/pm_qos_resume_latency_us
> Which is then used by cpuidle governor when selecting an idle state.
> The cpuidle governor should then select the idle state that matches
> with the wakeup latency for those CPUs but select the most power
> efficient for others. Setting a low value should filter all idle
> states except the polling one

Yep, that is a possibility, but it does not always work as expected. For example,
on virtual machines the vCPU gets scheduled out, even with this option set :-/.

-- Daniel

> Regards
> Vincent
>>
>> Thanks,
>>
>>         Ingo
  
Peter Zijlstra Jan. 16, 2023, 11:54 a.m. UTC | #9
On Mon, Jan 16, 2023 at 10:28:20AM +0100, Ingo Molnar wrote:
> 
> * Ingo Molnar <mingo@kernel.org> wrote:
> 
> > > Urgh, can we please make this a cpuidle governor thing or so? So that 
> > > we don't need to invent new interfaces and such.
> > 
> > I think the desired property here would be to make this interface on top 
> > of pretty much any governor. Ie. have a governor, but also a way to drop 
> > any CPU into idle-poll, overriding that.
> 
> ... with the goal of having the best governor for power efficiency by 
> default - but also the ability to drop a handful of CPUs into the highest 
> performance / lowest latency idle mode.

Bah, so while you can set a cpufreq gov (say performance) per cpu,
you can't do the same with cpuidle.
  
Daniel Bristot de Oliveira Jan. 17, 2023, 11:20 a.m. UTC | #10
On 1/15/23 10:15, Ingo Molnar wrote:
> 
> * Daniel Bristot de Oliveira <bristot@kernel.org> wrote:
> 
>> idle=poll is frequently used on ultra-low-latency systems. Examples of
>> such systems are high-performance trading and 5G NVRAM. The performance
>> gain is given by avoiding the idle driver machinery and by keeping the
>> CPU is always in an active state - avoiding (odd) hardware heuristics that
>> are out of the control of the OS.
>>
>> Currently, idle=poll is an all-or-nothing static option defined at
>> boot time. The motivation for creating this option dynamic and per-cpu
>> are two:
>>
>>   1) Reduce the power usage/heat by allowing only selected CPUs to
>>      do idle polling;
>>   2) Allow multi-tenant systems (e.g., Kubernetes) to enable idle
>>      polling only when ultra-low-latency applications are present
>>      on specific CPUs.
>>
>> Joe Mario did some experiments with this option enabled, and the results
>> were significant. For example, by using dynamic idle polling on
>> selected CPUs, cyclictest performance is optimal (like when using
>> idle=poll), but cpu power consumption drops from 381 to 233 watts.
>>
>> Also, limiting idle=poll to the set of CPUs that benefits from
>> it allows other CPUs to benefit from frequency boosts. Joe also
>> shows that the results can be in the order of 80nsec round trip
>> improvement when system-wide idle=poll was not used.
>>
>> The user can enable idle polling with this command:
>>   # echo 1 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll
>>
>> And disable it via:
>>   # echo 0 > /sys/devices/system/cpu/cpu{CPU_ID}/idle_poll
>>
>> By default, all CPUs have idle polling disabled (the current behavior).
>> A static key avoids the CPU mask check overhead when no idle polling
>> is enabled.
> 
> Sounds useful in general.
> 
> A couple of observations:
> 
> ABI: how about putting the new file into the existing 
> /sys/devices/system/cpu/cpuidle/ directory - the sysfs space of cpuidle? 
> Arguably this flag is an extension of it.
> 

I tried that, but then this option will depend on CONFIG_CPU_IDLE, which... is not
away set, and idle_poll does not depend on now... so I am not sure if it is
the best option... or am I missing something? suggestions?

>>  extern char __cpuidle_text_start[], __cpuidle_text_end[];
>>  
>> +/*
>> + * per-cpu idle polling selector.
>> + */
>> +static struct cpumask cpu_poll_mask;
>> +DEFINE_STATIC_KEY_FALSE(cpu_poll_enabled);
>> +
>> +/*
>> + * Protects the mask/static key relation.
>> + */
>> +DEFINE_MUTEX(cpu_poll_mutex);
>> +
>> +static ssize_t idle_poll_store(struct device *dev, struct device_attribute *attr,
>> +			       const char *buf, size_t count)
>> +{
>> +	int cpu = dev->id;
>> +	int retval, set;
>> +	bool val;
>> +
>> +	retval = kstrtobool(buf, &val);
>> +	if (retval)
>> +		return retval;
>> +
>> +	mutex_lock(&cpu_poll_mutex);
>> +
>> +	if (val) {
>> +		set = cpumask_test_and_set_cpu(cpu, &cpu_poll_mask);
>> +
>> +		/*
>> +		 * If the CPU was already on, do not increase the static key usage.
>> +		 */
>> +		if (!set)
>> +			static_branch_inc(&cpu_poll_enabled);
>> +	} else {
>> +		set = cpumask_test_and_clear_cpu(cpu, &cpu_poll_mask);
>> +
>> +		/*
>> +		 * If the CPU was already off, do not decrease the static key usage.
>> +		 */
>> +		if (set)
>> +			static_branch_dec(&cpu_poll_enabled);
>> +	}
> 
> Nit: I think 'old_bit' or so is easier to read than a generic 'set'?

ack

> 
>> +
>> +	mutex_unlock(&cpu_poll_mutex);
> 
> Also, is cpu_poll_mutex locking really necessary, given that these bitops 
> methods are atomic, and CPUs observe cpu_poll_enabled without taking any 
> locks?

you are right, it is not needed. I will remove it.

>> +static int is_cpu_idle_poll(int cpu)
>> +{
>> +	if (static_branch_unlikely(&cpu_poll_enabled))
>> +		return cpumask_test_cpu(cpu, &cpu_poll_mask);
>> +
>> +	return 0;
>> +}
> 
> static inline might be justified in this case I guess.

ack

>> @@ -51,18 +136,21 @@ __setup("hlt", cpu_idle_nopoll_setup);
>>  
>>  static noinline int __cpuidle cpu_idle_poll(void)
>>  {
>> -	trace_cpu_idle(0, smp_processor_id());
>> +	int cpu = smp_processor_id();
>> +
>> +	trace_cpu_idle(0, cpu);
>>  	stop_critical_timings();
>>  	ct_idle_enter();
>>  	local_irq_enable();
>>  
>>  	while (!tif_need_resched() &&
>> -	       (cpu_idle_force_poll || tick_check_broadcast_expired()))
>> +	       (cpu_idle_force_poll || tick_check_broadcast_expired()
>> +		|| is_cpu_idle_poll(cpu)))
>>  		cpu_relax();
>>  
>>  	ct_idle_exit();
>>  	start_critical_timings();
>> -	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
>> +	trace_cpu_idle(PWR_EVENT_EXIT, cpu);
>>  
>>  	return 1;
> 
> So I think the introduction of the 'cpu' local variable to clean up the 
> flow of cpu_idle_poll() should be a separate preparatory patch, which will 
> make the addition of the is_cpu_idle_poll() call a bit easier to read in 
> the second patch.

Makes sense.

>>  }
>> @@ -296,7 +384,8 @@ static void do_idle(void)
>>  		 * broadcast device expired for us, we don't want to go deep
>>  		 * idle as we know that the IPI is going to arrive right away.
>>  		 */
>> -		if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
>> +		if (cpu_idle_force_poll || tick_check_broadcast_expired()
>> +		    || is_cpu_idle_poll(cpu)) {
>>  			tick_nohz_idle_restart_tick();
>>  			cpu_idle_poll();
> 
> Shouldn't we check is_cpu_idle_poll() right after the cpu_idle_force_poll 
> check, and before the tick_check_broadcast_expired() check?

Right.

> Shouldn't matter to the outcome, but for consistency's sake.

Maybe, we can move the cpu_idle_force_poll check inside cpu_idle_force_poll()?

because...

> Plus, if we are doing this anyway, maybe cpu_idle_force_poll could now be 
> implemented as 0/all setting of cpu_poll_mask, eliminating the 
> cpu_idle_force_poll flag? As a third patch on top.

I started doing it, but then I noticed some points:

- the cpu_idle_force_poll can stack, as platforms can call cpu_idle_poll_ctrl(true)
on top of idle=poll. So we would still need an integer to count how many times the
cpu_idle_force_poll was called.

- call to cpu_idle_poll_ctrl(false) when cpu_idle_force_poll reaches 0 cannot
unset all bits from the cpu_poll_mask because the user setup would be lost.

So I think that cpu_idle_force_poll is being used for two purposes: 1) user setting
via idle=poll, and 2) as a kernel facility via cpu_idle_poll_ctrl(true/false) other
than idle=poll.

So, maybe we can make idle=poll to change the initial value of the bitmask to
all 1 (with the addition that the user can now undo it), and keep cpu_idle_force_poll
for internal use?

Thanks!
-- Daniel
> 
> Thanks,
> 
> 	Ingo
  

Patch

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index f26ab2675f7d..c6ef1322d549 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -10,6 +10,91 @@ 
 /* Linker adds these: start and end of __cpuidle functions */
 extern char __cpuidle_text_start[], __cpuidle_text_end[];
 
+/*
+ * per-cpu idle polling selector.
+ */
+static struct cpumask cpu_poll_mask;
+DEFINE_STATIC_KEY_FALSE(cpu_poll_enabled);
+
+/*
+ * Protects the mask/static key relation.
+ */
+DEFINE_MUTEX(cpu_poll_mutex);
+
+static ssize_t idle_poll_store(struct device *dev, struct device_attribute *attr,
+			       const char *buf, size_t count)
+{
+	int cpu = dev->id;
+	int retval, set;
+	bool val;
+
+	retval = kstrtobool(buf, &val);
+	if (retval)
+		return retval;
+
+	mutex_lock(&cpu_poll_mutex);
+
+	if (val) {
+		set = cpumask_test_and_set_cpu(cpu, &cpu_poll_mask);
+
+		/*
+		 * If the CPU was already on, do not increase the static key usage.
+		 */
+		if (!set)
+			static_branch_inc(&cpu_poll_enabled);
+	} else {
+		set = cpumask_test_and_clear_cpu(cpu, &cpu_poll_mask);
+
+		/*
+		 * If the CPU was already off, do not decrease the static key usage.
+		 */
+		if (set)
+			static_branch_dec(&cpu_poll_enabled);
+	}
+
+	mutex_unlock(&cpu_poll_mutex);
+
+	return count;
+}
+
+static ssize_t idle_poll_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%d\n", cpumask_test_cpu(dev->id, &cpu_poll_mask));
+}
+
+static DEVICE_ATTR_RW(idle_poll);
+
+static const struct attribute *idle_poll_attrs[] = {
+	&dev_attr_idle_poll.attr,
+	NULL
+};
+
+static int __init idle_poll_sysfs_init(void)
+{
+	int cpu, retval;
+
+	for_each_possible_cpu(cpu) {
+		struct device *dev = get_cpu_device(cpu);
+
+		if (!dev)
+			continue;
+		retval = sysfs_create_files(&dev->kobj, idle_poll_attrs);
+		if (retval)
+			return retval;
+	}
+
+	return 0;
+}
+device_initcall(idle_poll_sysfs_init);
+
+static int is_cpu_idle_poll(int cpu)
+{
+	if (static_branch_unlikely(&cpu_poll_enabled))
+		return cpumask_test_cpu(cpu, &cpu_poll_mask);
+
+	return 0;
+}
+
 /**
  * sched_idle_set_state - Record idle state for the current CPU.
  * @idle_state: State to record.
@@ -51,18 +136,21 @@  __setup("hlt", cpu_idle_nopoll_setup);
 
 static noinline int __cpuidle cpu_idle_poll(void)
 {
-	trace_cpu_idle(0, smp_processor_id());
+	int cpu = smp_processor_id();
+
+	trace_cpu_idle(0, cpu);
 	stop_critical_timings();
 	ct_idle_enter();
 	local_irq_enable();
 
 	while (!tif_need_resched() &&
-	       (cpu_idle_force_poll || tick_check_broadcast_expired()))
+	       (cpu_idle_force_poll || tick_check_broadcast_expired()
+		|| is_cpu_idle_poll(cpu)))
 		cpu_relax();
 
 	ct_idle_exit();
 	start_critical_timings();
-	trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
+	trace_cpu_idle(PWR_EVENT_EXIT, cpu);
 
 	return 1;
 }
@@ -296,7 +384,8 @@  static void do_idle(void)
 		 * broadcast device expired for us, we don't want to go deep
 		 * idle as we know that the IPI is going to arrive right away.
 		 */
-		if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
+		if (cpu_idle_force_poll || tick_check_broadcast_expired()
+		    || is_cpu_idle_poll(cpu)) {
 			tick_nohz_idle_restart_tick();
 			cpu_idle_poll();
 		} else {