[v2,2/4] powercap: idle_inject: Add prepare/complete callbacks

Message ID 20221129233419.4022830-3-srinivas.pandruvada@linux.intel.com
State New
Headers
Series Per CPU idle injection |

Commit Message

srinivas pandruvada Nov. 29, 2022, 11:34 p.m. UTC
  The actual idle percentage can be less than the desired because of
interrupts. Since the objective for CPU Idle injection is for thermal
control, there should be some way to compensate for lost idle percentage.
Some architectures provide interface to get actual idle percent observed
by the hardware. So, the idle percent can be adjusted using the hardware
feedback. For example, Intel CPUs provides package idle counters, which
is currently used by intel powerclamp driver to adjust idle time.

The only way this can be done currently is by monitoring hardware idle
percent from a different software thread. This can be avoided by adding
callbacks.

Add a capability to register a prepare and complete callback during idle
inject registry. Add a new register function idle_inject_register_full()
which also allows to register callbacks.

If they are not NULL, then prepare callback is called before calling
play_idle_precise() and complete callback is called after calling
play_idle_precise().

If prepare callback is present and returns non 0 value then
play_idle_precise() is not called to avoid over compensation.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
---
v2
- Replace begin/end with prepare/complete
- Add new interface idle_inject_register_full with callbacks
- Update kernel doc
- Update commit description

 drivers/powercap/idle_inject.c | 62 +++++++++++++++++++++++++++++++---
 include/linux/idle_inject.h    |  4 +++
 2 files changed, 62 insertions(+), 4 deletions(-)
  

Comments

Daniel Lezcano Dec. 21, 2022, 2:52 p.m. UTC | #1
Hi Srinivas,

On 30/11/2022 00:34, Srinivas Pandruvada wrote:
> The actual idle percentage can be less than the desired because of
> interrupts. Since the objective for CPU Idle injection is for thermal
> control, there should be some way to compensate for lost idle percentage.
> Some architectures provide interface to get actual idle percent observed
> by the hardware. So, the idle percent can be adjusted using the hardware
> feedback. For example, Intel CPUs provides package idle counters, which
> is currently used by intel powerclamp driver to adjust idle time.
Can you provide an example in terms of timings?

I'm not getting how 'prepare' would do by returning a positive value to 
skip the play_idle_precise() and what will do 'complete' ?


> The only way this can be done currently is by monitoring hardware idle
> percent from a different software thread. This can be avoided by adding
> callbacks.
> 
> Add a capability to register a prepare and complete callback during idle
> inject registry. Add a new register function idle_inject_register_full()
> which also allows to register callbacks.
> 
> If they are not NULL, then prepare callback is called before calling
> play_idle_precise() and complete callback is called after calling
> play_idle_precise().
> 
> If prepare callback is present and returns non 0 value then
> play_idle_precise() is not called to avoid over compensation.
> 
> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> ---
> v2
> - Replace begin/end with prepare/complete
> - Add new interface idle_inject_register_full with callbacks
> - Update kernel doc
> - Update commit description
> 
>   drivers/powercap/idle_inject.c | 62 +++++++++++++++++++++++++++++++---
>   include/linux/idle_inject.h    |  4 +++
>   2 files changed, 62 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/powercap/idle_inject.c b/drivers/powercap/idle_inject.c
> index dfa989182e71..f48e71501429 100644
> --- a/drivers/powercap/idle_inject.c
> +++ b/drivers/powercap/idle_inject.c
> @@ -63,13 +63,31 @@ struct idle_inject_thread {
>    * @idle_duration_us: duration of CPU idle time to inject
>    * @run_duration_us: duration of CPU run time to allow
>    * @latency_us: max allowed latency
> + * @prepare: Callback function which is called before calling
> + *		play_idle_precise()
> + * @complete: Callback function which is called after calling
> + *		play_idle_precise()
>    * @cpumask: mask of CPUs affected by idle injection
> + *
> + * This structure is used to define per instance idle inject device data. Each
> + * instance has an idle duration, a run duration and mask of CPUs to inject
> + * idle.
> + * Actual idle is injected by calling kernel scheduler interface
> + * play_idle_precise(). There are two optional callbacks which the caller can
> + * register by calling idle_inject_register_full():
> + * prepare() - This callback is called just before calling play_idle_precise()
> + *		If this callback returns non zero value then
> + *		play_idle_precise() is not called. This means skip injecting
> + *		idle during this period.
> + * complete() - This callback is called after calling play_idle_precise().
>    */
>   struct idle_inject_device {
>   	struct hrtimer timer;
>   	unsigned int idle_duration_us;
>   	unsigned int run_duration_us;
>   	unsigned int latency_us;
> +	int (*prepare)(unsigned int cpu);
> +	void (*complete)(unsigned int cpu);
>   	unsigned long cpumask[];
>   };
>   
> @@ -132,6 +150,7 @@ static void idle_inject_fn(unsigned int cpu)
>   {
>   	struct idle_inject_device *ii_dev;
>   	struct idle_inject_thread *iit;
> +	int ret;
>   
>   	ii_dev = per_cpu(idle_inject_device, cpu);
>   	iit = per_cpu_ptr(&idle_inject_thread, cpu);
> @@ -141,8 +160,18 @@ static void idle_inject_fn(unsigned int cpu)
>   	 */
>   	iit->should_run = 0;
>   
> +	if (ii_dev->prepare) {
> +		ret = ii_dev->prepare(cpu);
> +		if (ret)
> +			goto skip;
> +	}
> +
>   	play_idle_precise(READ_ONCE(ii_dev->idle_duration_us) * NSEC_PER_USEC,
>   			  READ_ONCE(ii_dev->latency_us) * NSEC_PER_USEC);
> +
> +skip:
> +	if (ii_dev->complete)
> +		ii_dev->complete(cpu);
>   }
>   
>   /**
> @@ -295,17 +324,23 @@ static int idle_inject_should_run(unsigned int cpu)
>   }
>   
>   /**
> - * idle_inject_register - initialize idle injection on a set of CPUs
> + * idle_inject_register_full - initialize idle injection on a set of CPUs
>    * @cpumask: CPUs to be affected by idle injection
> + * @prepare: callback called before calling play_idle_precise()
> + * @complete: callback called after calling play_idle_precise()
>    *
>    * This function creates an idle injection control device structure for the
> - * given set of CPUs and initializes the timer associated with it.  It does not
> - * start any injection cycles.
> + * given set of CPUs and initializes the timer associated with it. This
> + * function also allows to register prepare() and complete() callbacks.
> + * It does not start any injection cycles.
>    *
>    * Return: NULL if memory allocation fails, idle injection control device
>    * pointer on success.
>    */
> -struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
> +
> +struct idle_inject_device *idle_inject_register_full(struct cpumask *cpumask,
> +						     int (*prepare)(unsigned int cpu),
> +						     void (*complete)(unsigned int cpu))
>   {
>   	struct idle_inject_device *ii_dev;
>   	int cpu, cpu_rb;
> @@ -318,6 +353,8 @@ struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
>   	hrtimer_init(&ii_dev->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>   	ii_dev->timer.function = idle_inject_timer_fn;
>   	ii_dev->latency_us = UINT_MAX;
> +	ii_dev->prepare = prepare;
> +	ii_dev->complete = complete;
>   
>   	for_each_cpu(cpu, to_cpumask(ii_dev->cpumask)) {
>   
> @@ -342,6 +379,23 @@ struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
>   
>   	return NULL;
>   }
> +EXPORT_SYMBOL_NS_GPL(idle_inject_register_full, IDLE_INJECT);
> +
> +/**
> + * idle_inject_register - initialize idle injection on a set of CPUs
> + * @cpumask: CPUs to be affected by idle injection
> + *
> + * This function creates an idle injection control device structure for the
> + * given set of CPUs and initializes the timer associated with it.  It does not
> + * start any injection cycles.
> + *
> + * Return: NULL if memory allocation fails, idle injection control device
> + * pointer on success.
> + */
> +struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
> +{
> +	return idle_inject_register_full(cpumask, NULL, NULL);
> +}
>   EXPORT_SYMBOL_NS_GPL(idle_inject_register, IDLE_INJECT);
>   
>   /**
> diff --git a/include/linux/idle_inject.h b/include/linux/idle_inject.h
> index fb88e23a99d3..e18d89793490 100644
> --- a/include/linux/idle_inject.h
> +++ b/include/linux/idle_inject.h
> @@ -13,6 +13,10 @@ struct idle_inject_device;
>   
>   struct idle_inject_device *idle_inject_register(struct cpumask *cpumask);
>   
> +struct idle_inject_device *idle_inject_register_full(struct cpumask *cpumask,
> +						     int (*prepare)(unsigned int cpu),
> +						     void (*complete)(unsigned int cpu));
> +
>   void idle_inject_unregister(struct idle_inject_device *ii_dev);
>   
>   int idle_inject_start(struct idle_inject_device *ii_dev);
  
srinivas pandruvada Dec. 21, 2022, 8:58 p.m. UTC | #2
Hi Daniel,

On Wed, 2022-12-21 at 15:52 +0100, Daniel Lezcano wrote:
> 
> Hi Srinivas,
> 
> On 30/11/2022 00:34, Srinivas Pandruvada wrote:
> > The actual idle percentage can be less than the desired because of
> > interrupts. Since the objective for CPU Idle injection is for
> > thermal
> > control, there should be some way to compensate for lost idle
> > percentage.
> > Some architectures provide interface to get actual idle percent
> > observed
> > by the hardware. So, the idle percent can be adjusted using the
> > hardware
> > feedback. For example, Intel CPUs provides package idle counters,
> > which
> > is currently used by intel powerclamp driver to adjust idle time.
> Can you provide an example in terms of timings?
> 
> I'm not getting how 'prepare' would do by returning a positive value
> to 
> skip the play_idle_precise() and what will do 'complete' ?
> 
intel_powerclamp has a logic where if the current idle percentage
observed from hardware is more than the desired target inject percent,
it skips calling play_idle().

For example if you want to inject 50% idle and system is naturally idle
for 60%, there is no use of calling play_idle in the idle injection
framework to induce more idle. In this way a workload can run
immediately.

So trying to emulate the same logic by using powercap/idle_inject
framework. So prepare() callback in the intel_powerclamp driver calls
the existing function to check if idle-inject should skip for this time
or not.

The complete() callback is to do just to adjust run duration. I don't
have a use case, but to add complementary callback to prepare() if
required.

Thanks,
Srinivas

> 
> > The only way this can be done currently is by monitoring hardware
> > idle
> > percent from a different software thread. This can be avoided by
> > adding
> > callbacks.
> > 
> > Add a capability to register a prepare and complete callback during
> > idle
> > inject registry. Add a new register function
> > idle_inject_register_full()
> > which also allows to register callbacks.
> > 
> > If they are not NULL, then prepare callback is called before
> > calling
> > play_idle_precise() and complete callback is called after calling
> > play_idle_precise().
> > 
> > If prepare callback is present and returns non 0 value then
> > play_idle_precise() is not called to avoid over compensation.
> > 
> > Signed-off-by: Srinivas Pandruvada
> > <srinivas.pandruvada@linux.intel.com>
> > ---
> > v2
> > - Replace begin/end with prepare/complete
> > - Add new interface idle_inject_register_full with callbacks
> > - Update kernel doc
> > - Update commit description
> > 
> >   drivers/powercap/idle_inject.c | 62
> > +++++++++++++++++++++++++++++++---
> >   include/linux/idle_inject.h    |  4 +++
> >   2 files changed, 62 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/powercap/idle_inject.c
> > b/drivers/powercap/idle_inject.c
> > index dfa989182e71..f48e71501429 100644
> > --- a/drivers/powercap/idle_inject.c
> > +++ b/drivers/powercap/idle_inject.c
> > @@ -63,13 +63,31 @@ struct idle_inject_thread {
> >    * @idle_duration_us: duration of CPU idle time to inject
> >    * @run_duration_us: duration of CPU run time to allow
> >    * @latency_us: max allowed latency
> > + * @prepare: Callback function which is called before calling
> > + *             play_idle_precise()
> > + * @complete: Callback function which is called after calling
> > + *             play_idle_precise()
> >    * @cpumask: mask of CPUs affected by idle injection
> > + *
> > + * This structure is used to define per instance idle inject
> > device data. Each
> > + * instance has an idle duration, a run duration and mask of CPUs
> > to inject
> > + * idle.
> > + * Actual idle is injected by calling kernel scheduler interface
> > + * play_idle_precise(). There are two optional callbacks which the
> > caller can
> > + * register by calling idle_inject_register_full():
> > + * prepare() - This callback is called just before calling
> > play_idle_precise()
> > + *             If this callback returns non zero value then
> > + *             play_idle_precise() is not called. This means skip
> > injecting
> > + *             idle during this period.
> > + * complete() - This callback is called after calling
> > play_idle_precise().
> >    */
> >   struct idle_inject_device {
> >         struct hrtimer timer;
> >         unsigned int idle_duration_us;
> >         unsigned int run_duration_us;
> >         unsigned int latency_us;
> > +       int (*prepare)(unsigned int cpu);
> > +       void (*complete)(unsigned int cpu);
> >         unsigned long cpumask[];
> >   };
> >   
> > @@ -132,6 +150,7 @@ static void idle_inject_fn(unsigned int cpu)
> >   {
> >         struct idle_inject_device *ii_dev;
> >         struct idle_inject_thread *iit;
> > +       int ret;
> >   
> >         ii_dev = per_cpu(idle_inject_device, cpu);
> >         iit = per_cpu_ptr(&idle_inject_thread, cpu);
> > @@ -141,8 +160,18 @@ static void idle_inject_fn(unsigned int cpu)
> >          */
> >         iit->should_run = 0;
> >   
> > +       if (ii_dev->prepare) {
> > +               ret = ii_dev->prepare(cpu);
> > +               if (ret)
> > +                       goto skip;
> > +       }
> > +
> >         play_idle_precise(READ_ONCE(ii_dev->idle_duration_us) *
> > NSEC_PER_USEC,
> >                           READ_ONCE(ii_dev->latency_us) *
> > NSEC_PER_USEC);
> > +
> > +skip:
> > +       if (ii_dev->complete)
> > +               ii_dev->complete(cpu);
> >   }
> >   
> >   /**
> > @@ -295,17 +324,23 @@ static int idle_inject_should_run(unsigned
> > int cpu)
> >   }
> >   
> >   /**
> > - * idle_inject_register - initialize idle injection on a set of
> > CPUs
> > + * idle_inject_register_full - initialize idle injection on a set
> > of CPUs
> >    * @cpumask: CPUs to be affected by idle injection
> > + * @prepare: callback called before calling play_idle_precise()
> > + * @complete: callback called after calling play_idle_precise()
> >    *
> >    * This function creates an idle injection control device
> > structure for the
> > - * given set of CPUs and initializes the timer associated with
> > it.  It does not
> > - * start any injection cycles.
> > + * given set of CPUs and initializes the timer associated with it.
> > This
> > + * function also allows to register prepare() and complete()
> > callbacks.
> > + * It does not start any injection cycles.
> >    *
> >    * Return: NULL if memory allocation fails, idle injection
> > control device
> >    * pointer on success.
> >    */
> > -struct idle_inject_device *idle_inject_register(struct cpumask
> > *cpumask)
> > +
> > +struct idle_inject_device *idle_inject_register_full(struct
> > cpumask *cpumask,
> > +                                                    int
> > (*prepare)(unsigned int cpu),
> > +                                                    void
> > (*complete)(unsigned int cpu))
> >   {
> >         struct idle_inject_device *ii_dev;
> >         int cpu, cpu_rb;
> > @@ -318,6 +353,8 @@ struct idle_inject_device
> > *idle_inject_register(struct cpumask *cpumask)
> >         hrtimer_init(&ii_dev->timer, CLOCK_MONOTONIC,
> > HRTIMER_MODE_REL);
> >         ii_dev->timer.function = idle_inject_timer_fn;
> >         ii_dev->latency_us = UINT_MAX;
> > +       ii_dev->prepare = prepare;
> > +       ii_dev->complete = complete;
> >   
> >         for_each_cpu(cpu, to_cpumask(ii_dev->cpumask)) {
> >   
> > @@ -342,6 +379,23 @@ struct idle_inject_device
> > *idle_inject_register(struct cpumask *cpumask)
> >   
> >         return NULL;
> >   }
> > +EXPORT_SYMBOL_NS_GPL(idle_inject_register_full, IDLE_INJECT);
> > +
> > +/**
> > + * idle_inject_register - initialize idle injection on a set of
> > CPUs
> > + * @cpumask: CPUs to be affected by idle injection
> > + *
> > + * This function creates an idle injection control device
> > structure for the
> > + * given set of CPUs and initializes the timer associated with
> > it.  It does not
> > + * start any injection cycles.
> > + *
> > + * Return: NULL if memory allocation fails, idle injection control
> > device
> > + * pointer on success.
> > + */
> > +struct idle_inject_device *idle_inject_register(struct cpumask
> > *cpumask)
> > +{
> > +       return idle_inject_register_full(cpumask, NULL, NULL);
> > +}
> >   EXPORT_SYMBOL_NS_GPL(idle_inject_register, IDLE_INJECT);
> >   
> >   /**
> > diff --git a/include/linux/idle_inject.h
> > b/include/linux/idle_inject.h
> > index fb88e23a99d3..e18d89793490 100644
> > --- a/include/linux/idle_inject.h
> > +++ b/include/linux/idle_inject.h
> > @@ -13,6 +13,10 @@ struct idle_inject_device;
> >   
> >   struct idle_inject_device *idle_inject_register(struct cpumask
> > *cpumask);
> >   
> > +struct idle_inject_device *idle_inject_register_full(struct
> > cpumask *cpumask,
> > +                                                    int
> > (*prepare)(unsigned int cpu),
> > +                                                    void
> > (*complete)(unsigned int cpu));
> > +
> >   void idle_inject_unregister(struct idle_inject_device *ii_dev);
> >   
> >   int idle_inject_start(struct idle_inject_device *ii_dev);
>
  
Daniel Lezcano Dec. 22, 2022, 9:50 a.m. UTC | #3
Hi Srinivas,


On 21/12/2022 21:58, srinivas pandruvada wrote:
> Hi Daniel,
> 
> On Wed, 2022-12-21 at 15:52 +0100, Daniel Lezcano wrote:
>>
>> Hi Srinivas,
>>
>> On 30/11/2022 00:34, Srinivas Pandruvada wrote:
>>> The actual idle percentage can be less than the desired because of
>>> interrupts. Since the objective for CPU Idle injection is for
>>> thermal
>>> control, there should be some way to compensate for lost idle
>>> percentage.
>>> Some architectures provide interface to get actual idle percent
>>> observed
>>> by the hardware. So, the idle percent can be adjusted using the
>>> hardware
>>> feedback. For example, Intel CPUs provides package idle counters,
>>> which
>>> is currently used by intel powerclamp driver to adjust idle time.
>> Can you provide an example in terms of timings?
>>
>> I'm not getting how 'prepare' would do by returning a positive value
>> to
>> skip the play_idle_precise() and what will do 'complete' ?
>>
> intel_powerclamp has a logic where if the current idle percentage
> observed from hardware is more than the desired target inject percent,
> it skips calling play_idle().
> 
> For example if you want to inject 50% idle and system is naturally idle
> for 60%, there is no use of calling play_idle in the idle injection
> framework to induce more idle. In this way a workload can run
> immediately.
> 
> So trying to emulate the same logic by using powercap/idle_inject
> framework. So prepare() callback in the intel_powerclamp driver calls
> the existing function to check if idle-inject should skip for this time
> or not.

The function 'prepare' has the 'cpu' parameter. How can it compare with 
the desired idle duration as this information is not passed to the 
callback ?
  
srinivas pandruvada Dec. 22, 2022, 5:36 p.m. UTC | #4
Hi Daniel,

On Thu, 2022-12-22 at 10:50 +0100, Daniel Lezcano wrote:
> 
> Hi Srinivas,
> 
> 
> On 21/12/2022 21:58, srinivas pandruvada wrote:
> > Hi Daniel,
> > 
> > On Wed, 2022-12-21 at 15:52 +0100, Daniel Lezcano wrote:
> > > 
> > > Hi Srinivas,
> > > 
> > > On 30/11/2022 00:34, Srinivas Pandruvada wrote:
> > > > The actual idle percentage can be less than the desired because
> > > > of
> > > > interrupts. Since the objective for CPU Idle injection is for
> > > > thermal
> > > > control, there should be some way to compensate for lost idle
> > > > percentage.
> > > > Some architectures provide interface to get actual idle percent
> > > > observed
> > > > by the hardware. So, the idle percent can be adjusted using the
> > > > hardware
> > > > feedback. For example, Intel CPUs provides package idle
> > > > counters,
> > > > which
> > > > is currently used by intel powerclamp driver to adjust idle
> > > > time.
> > > Can you provide an example in terms of timings?
> > > 
> > > I'm not getting how 'prepare' would do by returning a positive
> > > value
> > > to
> > > skip the play_idle_precise() and what will do 'complete' ?
> > > 
> > intel_powerclamp has a logic where if the current idle percentage
> > observed from hardware is more than the desired target inject
> > percent,
> > it skips calling play_idle().
> > 
> > For example if you want to inject 50% idle and system is naturally
> > idle
> > for 60%, there is no use of calling play_idle in the idle injection
> > framework to induce more idle. In this way a workload can run
> > immediately.
> > 
> > So trying to emulate the same logic by using powercap/idle_inject
> > framework. So prepare() callback in the intel_powerclamp driver
> > calls
> > the existing function to check if idle-inject should skip for this
> > time
> > or not.
> 
> The function 'prepare' has the 'cpu' parameter. How can it compare
> with 
> the desired idle duration as this information is not passed to the 
> callback ?
Good question.

Calling driver knows what idle_duration he set.
In my first version, I passed *idle_duration (with current
idle_duration set), so the caller can change this for the current
play_idle call if required for one time.

But in powerclamp case we either skip the whole play_idle or not. It
doesn't change idle duration. So didn't add.

But we can add this back.

Thanks,
Srinivas



> 
>
  
Rafael J. Wysocki Jan. 12, 2023, 2:37 p.m. UTC | #5
On Thu, Dec 22, 2022 at 6:36 PM srinivas pandruvada
<srinivas.pandruvada@linux.intel.com> wrote:
>
> Hi Daniel,
>
> On Thu, 2022-12-22 at 10:50 +0100, Daniel Lezcano wrote:
> >
> > Hi Srinivas,
> >
> >
> > On 21/12/2022 21:58, srinivas pandruvada wrote:
> > > Hi Daniel,
> > >
> > > On Wed, 2022-12-21 at 15:52 +0100, Daniel Lezcano wrote:
> > > >
> > > > Hi Srinivas,
> > > >
> > > > On 30/11/2022 00:34, Srinivas Pandruvada wrote:
> > > > > The actual idle percentage can be less than the desired because
> > > > > of
> > > > > interrupts. Since the objective for CPU Idle injection is for
> > > > > thermal
> > > > > control, there should be some way to compensate for lost idle
> > > > > percentage.
> > > > > Some architectures provide interface to get actual idle percent
> > > > > observed
> > > > > by the hardware. So, the idle percent can be adjusted using the
> > > > > hardware
> > > > > feedback. For example, Intel CPUs provides package idle
> > > > > counters,
> > > > > which
> > > > > is currently used by intel powerclamp driver to adjust idle
> > > > > time.
> > > > Can you provide an example in terms of timings?
> > > >
> > > > I'm not getting how 'prepare' would do by returning a positive
> > > > value
> > > > to
> > > > skip the play_idle_precise() and what will do 'complete' ?
> > > >
> > > intel_powerclamp has a logic where if the current idle percentage
> > > observed from hardware is more than the desired target inject
> > > percent,
> > > it skips calling play_idle().
> > >
> > > For example if you want to inject 50% idle and system is naturally
> > > idle
> > > for 60%, there is no use of calling play_idle in the idle injection
> > > framework to induce more idle. In this way a workload can run
> > > immediately.
> > >
> > > So trying to emulate the same logic by using powercap/idle_inject
> > > framework. So prepare() callback in the intel_powerclamp driver
> > > calls
> > > the existing function to check if idle-inject should skip for this
> > > time
> > > or not.
> >
> > The function 'prepare' has the 'cpu' parameter. How can it compare
> > with
> > the desired idle duration as this information is not passed to the
> > callback ?
> Good question.
>
> Calling driver knows what idle_duration he set.
> In my first version, I passed *idle_duration (with current
> idle_duration set), so the caller can change this for the current
> play_idle call if required for one time.
>
> But in powerclamp case we either skip the whole play_idle or not. It
> doesn't change idle duration. So didn't add.
>
> But we can add this back.

I don't think that it is necessary at this point.

Since powerclamp is the only user and it doesn't need idle_duration, I
would just not add it ATM.

I have a couple of other comments to the patch, but let me send them separately.
  
Rafael J. Wysocki Jan. 12, 2023, 2:53 p.m. UTC | #6
On Wed, Nov 30, 2022 at 12:34 AM Srinivas Pandruvada
<srinivas.pandruvada@linux.intel.com> wrote:
>
> The actual idle percentage can be less than the desired because of
> interrupts.

This is somewhat unclear.

> Since the objective for CPU Idle injection is for thermal
> control, there should be some way to compensate for lost idle percentage.

What does "lost idle percentage" mean here?

> Some architectures provide interface to get actual idle percent observed
> by the hardware. So, the idle percent can be adjusted using the hardware
> feedback. For example, Intel CPUs provides package idle counters, which
> is currently used by intel powerclamp driver to adjust idle time.
>
> The only way this can be done currently is by monitoring hardware idle
> percent from a different software thread. This can be avoided by adding
> callbacks.
>
> Add a capability to register a prepare and complete callback during idle
> inject registry. Add a new register function idle_inject_register_full()
> which also allows to register callbacks.
>
> If they are not NULL, then prepare callback is called before calling
> play_idle_precise() and complete callback is called after calling
> play_idle_precise().
>
> If prepare callback is present and returns non 0 value then
> play_idle_precise() is not called to avoid over compensation.

This mechanism isn't particularly straightforward, but maybe there's
no better way.

> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> ---
> v2
> - Replace begin/end with prepare/complete
> - Add new interface idle_inject_register_full with callbacks
> - Update kernel doc
> - Update commit description
>
>  drivers/powercap/idle_inject.c | 62 +++++++++++++++++++++++++++++++---
>  include/linux/idle_inject.h    |  4 +++
>  2 files changed, 62 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/powercap/idle_inject.c b/drivers/powercap/idle_inject.c
> index dfa989182e71..f48e71501429 100644
> --- a/drivers/powercap/idle_inject.c
> +++ b/drivers/powercap/idle_inject.c
> @@ -63,13 +63,31 @@ struct idle_inject_thread {
>   * @idle_duration_us: duration of CPU idle time to inject
>   * @run_duration_us: duration of CPU run time to allow
>   * @latency_us: max allowed latency
> + * @prepare: Callback function which is called before calling
> + *             play_idle_precise()
> + * @complete: Callback function which is called after calling
> + *             play_idle_precise()

What about:

@prepare: Optional callback deciding whether or not to skip idle
injection in the given cycle.
@complete: Optional callback updating the state after idle injection.

>   * @cpumask: mask of CPUs affected by idle injection
> + *
> + * This structure is used to define per instance idle inject device data. Each
> + * instance has an idle duration, a run duration and mask of CPUs to inject
> + * idle.
> + * Actual idle is injected by calling kernel scheduler interface
> + * play_idle_precise(). There are two optional callbacks which the caller can
> + * register by calling idle_inject_register_full():
> + * prepare() - This callback is called just before calling play_idle_precise()
> + *             If this callback returns non zero value then
> + *             play_idle_precise() is not called. This means skip injecting
> + *             idle during this period.
> + * complete() - This callback is called after calling play_idle_precise().

I would keep the format of the comment more consistent with the
general information at the top and the member descriptions at the
bottom.

>   */
>  struct idle_inject_device {
>         struct hrtimer timer;
>         unsigned int idle_duration_us;
>         unsigned int run_duration_us;
>         unsigned int latency_us;
> +       int (*prepare)(unsigned int cpu);

Can it be bool?

> +       void (*complete)(unsigned int cpu);
>         unsigned long cpumask[];
>  };
>
> @@ -132,6 +150,7 @@ static void idle_inject_fn(unsigned int cpu)
>  {
>         struct idle_inject_device *ii_dev;
>         struct idle_inject_thread *iit;
> +       int ret;

This is redundant.

>
>         ii_dev = per_cpu(idle_inject_device, cpu);
>         iit = per_cpu_ptr(&idle_inject_thread, cpu);
> @@ -141,8 +160,18 @@ static void idle_inject_fn(unsigned int cpu)
>          */
>         iit->should_run = 0;
>
> +       if (ii_dev->prepare) {
> +               ret = ii_dev->prepare(cpu);
> +               if (ret)
> +                       goto skip;
> +       }

Because the above can be written as

if (ii_dev->prepare && ii_dev->prepare(cpu))
         goto skip;

> +
>         play_idle_precise(READ_ONCE(ii_dev->idle_duration_us) * NSEC_PER_USEC,
>                           READ_ONCE(ii_dev->latency_us) * NSEC_PER_USEC);
> +
> +skip:
> +       if (ii_dev->complete)
> +               ii_dev->complete(cpu);
>  }
>
>  /**
> @@ -295,17 +324,23 @@ static int idle_inject_should_run(unsigned int cpu)
>  }
>
>  /**
> - * idle_inject_register - initialize idle injection on a set of CPUs
> + * idle_inject_register_full - initialize idle injection on a set of CPUs
>   * @cpumask: CPUs to be affected by idle injection
> + * @prepare: callback called before calling play_idle_precise()
> + * @complete: callback called after calling play_idle_precise()

IMO it would be slightly cleaner to say "invoked" instead of "called".

>   *
>   * This function creates an idle injection control device structure for the
> - * given set of CPUs and initializes the timer associated with it.  It does not
> - * start any injection cycles.
> + * given set of CPUs and initializes the timer associated with it. This
> + * function also allows to register prepare() and complete() callbacks.
> + * It does not start any injection cycles.
>   *
>   * Return: NULL if memory allocation fails, idle injection control device
>   * pointer on success.
>   */
> -struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
> +
> +struct idle_inject_device *idle_inject_register_full(struct cpumask *cpumask,
> +                                                    int (*prepare)(unsigned int cpu),
> +                                                    void (*complete)(unsigned int cpu))
>  {
>         struct idle_inject_device *ii_dev;
>         int cpu, cpu_rb;
> @@ -318,6 +353,8 @@ struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
>         hrtimer_init(&ii_dev->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>         ii_dev->timer.function = idle_inject_timer_fn;
>         ii_dev->latency_us = UINT_MAX;
> +       ii_dev->prepare = prepare;
> +       ii_dev->complete = complete;
>
>         for_each_cpu(cpu, to_cpumask(ii_dev->cpumask)) {
>
> @@ -342,6 +379,23 @@ struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
>
>         return NULL;
>  }
> +EXPORT_SYMBOL_NS_GPL(idle_inject_register_full, IDLE_INJECT);
> +
> +/**
> + * idle_inject_register - initialize idle injection on a set of CPUs
> + * @cpumask: CPUs to be affected by idle injection
> + *
> + * This function creates an idle injection control device structure for the
> + * given set of CPUs and initializes the timer associated with it.  It does not
> + * start any injection cycles.
> + *
> + * Return: NULL if memory allocation fails, idle injection control device
> + * pointer on success.

It would be sufficient to say "Pass @cpumask to
idle_inject_register_full() to initialize idle injection on a set of
CPUs".

> + */
> +struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
> +{
> +       return idle_inject_register_full(cpumask, NULL, NULL);
> +}
>  EXPORT_SYMBOL_NS_GPL(idle_inject_register, IDLE_INJECT);
>
>  /**
> diff --git a/include/linux/idle_inject.h b/include/linux/idle_inject.h
> index fb88e23a99d3..e18d89793490 100644
> --- a/include/linux/idle_inject.h
> +++ b/include/linux/idle_inject.h
> @@ -13,6 +13,10 @@ struct idle_inject_device;
>
>  struct idle_inject_device *idle_inject_register(struct cpumask *cpumask);
>
> +struct idle_inject_device *idle_inject_register_full(struct cpumask *cpumask,
> +                                                    int (*prepare)(unsigned int cpu),
> +                                                    void (*complete)(unsigned int cpu));
> +
>  void idle_inject_unregister(struct idle_inject_device *ii_dev);
>
>  int idle_inject_start(struct idle_inject_device *ii_dev);
> --
  
Rafael J. Wysocki Jan. 12, 2023, 5:25 p.m. UTC | #7
On Thu, Jan 12, 2023 at 3:53 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Wed, Nov 30, 2022 at 12:34 AM Srinivas Pandruvada
> <srinivas.pandruvada@linux.intel.com> wrote:
> >
> > The actual idle percentage can be less than the desired because of
> > interrupts.
>
> This is somewhat unclear.
>
> > Since the objective for CPU Idle injection is for thermal
> > control, there should be some way to compensate for lost idle percentage.
>
> What does "lost idle percentage" mean here?
>
> > Some architectures provide interface to get actual idle percent observed
> > by the hardware. So, the idle percent can be adjusted using the hardware
> > feedback. For example, Intel CPUs provides package idle counters, which
> > is currently used by intel powerclamp driver to adjust idle time.
> >
> > The only way this can be done currently is by monitoring hardware idle
> > percent from a different software thread. This can be avoided by adding
> > callbacks.
> >
> > Add a capability to register a prepare and complete callback during idle
> > inject registry. Add a new register function idle_inject_register_full()
> > which also allows to register callbacks.
> >
> > If they are not NULL, then prepare callback is called before calling
> > play_idle_precise() and complete callback is called after calling
> > play_idle_precise().
> >
> > If prepare callback is present and returns non 0 value then
> > play_idle_precise() is not called to avoid over compensation.
>
> This mechanism isn't particularly straightforward, but maybe there's
> no better way.
>
> > Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> > ---
> > v2
> > - Replace begin/end with prepare/complete
> > - Add new interface idle_inject_register_full with callbacks
> > - Update kernel doc
> > - Update commit description
> >
> >  drivers/powercap/idle_inject.c | 62 +++++++++++++++++++++++++++++++---
> >  include/linux/idle_inject.h    |  4 +++
> >  2 files changed, 62 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/powercap/idle_inject.c b/drivers/powercap/idle_inject.c
> > index dfa989182e71..f48e71501429 100644
> > --- a/drivers/powercap/idle_inject.c
> > +++ b/drivers/powercap/idle_inject.c
> > @@ -63,13 +63,31 @@ struct idle_inject_thread {
> >   * @idle_duration_us: duration of CPU idle time to inject
> >   * @run_duration_us: duration of CPU run time to allow
> >   * @latency_us: max allowed latency
> > + * @prepare: Callback function which is called before calling
> > + *             play_idle_precise()
> > + * @complete: Callback function which is called after calling
> > + *             play_idle_precise()
>
> What about:
>
> @prepare: Optional callback deciding whether or not to skip idle
> injection in the given cycle.
> @complete: Optional callback updating the state after idle injection.

One more thing: ->complete() is not even used by powerclamp AFAICS, so
I wouldn't add it at this time, because it isn't clear if it's going
to be useful at all.
  
srinivas pandruvada Jan. 12, 2023, 6:13 p.m. UTC | #8
On Thu, 2023-01-12 at 18:25 +0100, Rafael J. Wysocki wrote:
> On Thu, Jan 12, 2023 at 3:53 PM Rafael J. Wysocki <rafael@kernel.org>
> wrote:
> > 
> > On Wed, Nov 30, 2022 at 12:34 AM Srinivas Pandruvada
> > <srinivas.pandruvada@linux.intel.com> wrote:
> > > 
> > > The actual idle percentage can be less than the desired because
> > > of
> > > interrupts.
> > 
> > This is somewhat unclear.
> > 
Will try to make it better.

> > > Since the objective for CPU Idle injection is for thermal
> > > control, there should be some way to compensate for lost idle
> > > percentage.
> > 
> > What does "lost idle percentage" mean here?
User space wants 50% idle, but because of interrupts on target cpus we
were less than 50% idle. This may be because of too many interrupt
wakes. If we measure from the hardware counters, this may say we were
45% idle. So we were 5% less than user desired setting.

> > 
> > > Some architectures provide interface to get actual idle percent
> > > observed
> > > by the hardware. So, the idle percent can be adjusted using the
> > > hardware
> > > feedback. For example, Intel CPUs provides package idle counters,
> > > which
> > > is currently used by intel powerclamp driver to adjust idle time.
> > > 
> > > The only way this can be done currently is by monitoring hardware
> > > idle
> > > percent from a different software thread. This can be avoided by
> > > adding
> > > callbacks.
> > > 
> > > Add a capability to register a prepare and complete callback
> > > during idle
> > > inject registry. Add a new register function
> > > idle_inject_register_full()
> > > which also allows to register callbacks.
> > > 
> > > If they are not NULL, then prepare callback is called before
> > > calling
> > > play_idle_precise() and complete callback is called after calling
> > > play_idle_precise().
> > > 
> > > If prepare callback is present and returns non 0 value then
> > > play_idle_precise() is not called to avoid over compensation.
> > 
> > This mechanism isn't particularly straightforward, but maybe
> > there's
> > no better way.
> > 
> > > Signed-off-by: Srinivas Pandruvada
> > > <srinivas.pandruvada@linux.intel.com>
> > > ---
> > > v2
> > > - Replace begin/end with prepare/complete
> > > - Add new interface idle_inject_register_full with callbacks
> > > - Update kernel doc
> > > - Update commit description
> > > 
> > >  drivers/powercap/idle_inject.c | 62
> > > +++++++++++++++++++++++++++++++---
> > >  include/linux/idle_inject.h    |  4 +++
> > >  2 files changed, 62 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/drivers/powercap/idle_inject.c
> > > b/drivers/powercap/idle_inject.c
> > > index dfa989182e71..f48e71501429 100644
> > > --- a/drivers/powercap/idle_inject.c
> > > +++ b/drivers/powercap/idle_inject.c
> > > @@ -63,13 +63,31 @@ struct idle_inject_thread {
> > >   * @idle_duration_us: duration of CPU idle time to inject
> > >   * @run_duration_us: duration of CPU run time to allow
> > >   * @latency_us: max allowed latency
> > > + * @prepare: Callback function which is called before calling
> > > + *             play_idle_precise()
> > > + * @complete: Callback function which is called after calling
> > > + *             play_idle_precise()
> > 
> > What about:
> > 
> > @prepare: Optional callback deciding whether or not to skip idle
> > injection in the given cycle.
> > @complete: Optional callback updating the state after idle
> > injection.
Looks good.

> 
> One more thing: ->complete() is not even used by powerclamp AFAICS,
> so
> I wouldn't add it at this time, because it isn't clear if it's going
> to be useful at all.
Sure.

Thanks,
Srinivas
  

Patch

diff --git a/drivers/powercap/idle_inject.c b/drivers/powercap/idle_inject.c
index dfa989182e71..f48e71501429 100644
--- a/drivers/powercap/idle_inject.c
+++ b/drivers/powercap/idle_inject.c
@@ -63,13 +63,31 @@  struct idle_inject_thread {
  * @idle_duration_us: duration of CPU idle time to inject
  * @run_duration_us: duration of CPU run time to allow
  * @latency_us: max allowed latency
+ * @prepare: Callback function which is called before calling
+ *		play_idle_precise()
+ * @complete: Callback function which is called after calling
+ *		play_idle_precise()
  * @cpumask: mask of CPUs affected by idle injection
+ *
+ * This structure is used to define per instance idle inject device data. Each
+ * instance has an idle duration, a run duration and mask of CPUs to inject
+ * idle.
+ * Actual idle is injected by calling kernel scheduler interface
+ * play_idle_precise(). There are two optional callbacks which the caller can
+ * register by calling idle_inject_register_full():
+ * prepare() - This callback is called just before calling play_idle_precise()
+ *		If this callback returns non zero value then
+ *		play_idle_precise() is not called. This means skip injecting
+ *		idle during this period.
+ * complete() - This callback is called after calling play_idle_precise().
  */
 struct idle_inject_device {
 	struct hrtimer timer;
 	unsigned int idle_duration_us;
 	unsigned int run_duration_us;
 	unsigned int latency_us;
+	int (*prepare)(unsigned int cpu);
+	void (*complete)(unsigned int cpu);
 	unsigned long cpumask[];
 };
 
@@ -132,6 +150,7 @@  static void idle_inject_fn(unsigned int cpu)
 {
 	struct idle_inject_device *ii_dev;
 	struct idle_inject_thread *iit;
+	int ret;
 
 	ii_dev = per_cpu(idle_inject_device, cpu);
 	iit = per_cpu_ptr(&idle_inject_thread, cpu);
@@ -141,8 +160,18 @@  static void idle_inject_fn(unsigned int cpu)
 	 */
 	iit->should_run = 0;
 
+	if (ii_dev->prepare) {
+		ret = ii_dev->prepare(cpu);
+		if (ret)
+			goto skip;
+	}
+
 	play_idle_precise(READ_ONCE(ii_dev->idle_duration_us) * NSEC_PER_USEC,
 			  READ_ONCE(ii_dev->latency_us) * NSEC_PER_USEC);
+
+skip:
+	if (ii_dev->complete)
+		ii_dev->complete(cpu);
 }
 
 /**
@@ -295,17 +324,23 @@  static int idle_inject_should_run(unsigned int cpu)
 }
 
 /**
- * idle_inject_register - initialize idle injection on a set of CPUs
+ * idle_inject_register_full - initialize idle injection on a set of CPUs
  * @cpumask: CPUs to be affected by idle injection
+ * @prepare: callback called before calling play_idle_precise()
+ * @complete: callback called after calling play_idle_precise()
  *
  * This function creates an idle injection control device structure for the
- * given set of CPUs and initializes the timer associated with it.  It does not
- * start any injection cycles.
+ * given set of CPUs and initializes the timer associated with it. This
+ * function also allows to register prepare() and complete() callbacks.
+ * It does not start any injection cycles.
  *
  * Return: NULL if memory allocation fails, idle injection control device
  * pointer on success.
  */
-struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
+
+struct idle_inject_device *idle_inject_register_full(struct cpumask *cpumask,
+						     int (*prepare)(unsigned int cpu),
+						     void (*complete)(unsigned int cpu))
 {
 	struct idle_inject_device *ii_dev;
 	int cpu, cpu_rb;
@@ -318,6 +353,8 @@  struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
 	hrtimer_init(&ii_dev->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	ii_dev->timer.function = idle_inject_timer_fn;
 	ii_dev->latency_us = UINT_MAX;
+	ii_dev->prepare = prepare;
+	ii_dev->complete = complete;
 
 	for_each_cpu(cpu, to_cpumask(ii_dev->cpumask)) {
 
@@ -342,6 +379,23 @@  struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
 
 	return NULL;
 }
+EXPORT_SYMBOL_NS_GPL(idle_inject_register_full, IDLE_INJECT);
+
+/**
+ * idle_inject_register - initialize idle injection on a set of CPUs
+ * @cpumask: CPUs to be affected by idle injection
+ *
+ * This function creates an idle injection control device structure for the
+ * given set of CPUs and initializes the timer associated with it.  It does not
+ * start any injection cycles.
+ *
+ * Return: NULL if memory allocation fails, idle injection control device
+ * pointer on success.
+ */
+struct idle_inject_device *idle_inject_register(struct cpumask *cpumask)
+{
+	return idle_inject_register_full(cpumask, NULL, NULL);
+}
 EXPORT_SYMBOL_NS_GPL(idle_inject_register, IDLE_INJECT);
 
 /**
diff --git a/include/linux/idle_inject.h b/include/linux/idle_inject.h
index fb88e23a99d3..e18d89793490 100644
--- a/include/linux/idle_inject.h
+++ b/include/linux/idle_inject.h
@@ -13,6 +13,10 @@  struct idle_inject_device;
 
 struct idle_inject_device *idle_inject_register(struct cpumask *cpumask);
 
+struct idle_inject_device *idle_inject_register_full(struct cpumask *cpumask,
+						     int (*prepare)(unsigned int cpu),
+						     void (*complete)(unsigned int cpu));
+
 void idle_inject_unregister(struct idle_inject_device *ii_dev);
 
 int idle_inject_start(struct idle_inject_device *ii_dev);