[v4,11/18] PM: EM: Add runtime update interface to modify EM power

Message ID 20230925081139.1305766-12-lukasz.luba@arm.com
State New
Headers
Series Introduce runtime modifiable Energy Model |

Commit Message

Lukasz Luba Sept. 25, 2023, 8:11 a.m. UTC
  Add an interface which allows to modify EM power data at runtime.
The new power information is populated by the provided callback, which
is called for each performance state. The CPU frequencies' efficiency is
re-calculated since that might be affected as well. The old EM memory
is going to be freed later using RCU mechanism.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 include/linux/energy_model.h |   8 +++
 kernel/power/energy_model.c  | 111 +++++++++++++++++++++++++++++++++++
 2 files changed, 119 insertions(+)
  

Comments

kernel test robot Sept. 26, 2023, 5:21 p.m. UTC | #1
Hi Lukasz,

kernel test robot noticed the following build warnings:

[auto build test WARNING on rafael-pm/linux-next]
[also build test WARNING on rafael-pm/thermal linus/master v6.6-rc3 next-20230926]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Lukasz-Luba/PM-EM-Add-missing-newline-for-the-message-log/20230925-181243
base:   https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next
patch link:    https://lore.kernel.org/r/20230925081139.1305766-12-lukasz.luba%40arm.com
patch subject: [PATCH v4 11/18] PM: EM: Add runtime update interface to modify EM power
config: i386-randconfig-063-20230926 (https://download.01.org/0day-ci/archive/20230927/202309270106.c56Z2Tci-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230927/202309270106.c56Z2Tci-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202309270106.c56Z2Tci-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
   kernel/power/energy_model.c:125:13: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct em_perf_table *tmp @@     got struct em_perf_table [noderef] __rcu *runtime_table @@
   kernel/power/energy_model.c:125:13: sparse:     expected struct em_perf_table *tmp
   kernel/power/energy_model.c:125:13: sparse:     got struct em_perf_table [noderef] __rcu *runtime_table
>> kernel/power/energy_model.c:613:27: sparse: sparse: incompatible types in comparison expression (different address spaces):
>> kernel/power/energy_model.c:613:27: sparse:    struct em_perf_table [noderef] __rcu *
>> kernel/power/energy_model.c:613:27: sparse:    struct em_perf_table *

vim +613 kernel/power/energy_model.c

   569	
   570	/**
   571	 * em_dev_unregister_perf_domain() - Unregister Energy Model (EM) for a device
   572	 * @dev		: Device for which the EM is registered
   573	 *
   574	 * Unregister the EM for the specified @dev (but not a CPU device).
   575	 */
   576	void em_dev_unregister_perf_domain(struct device *dev)
   577	{
   578		struct em_perf_table __rcu *runtime_table;
   579		struct em_perf_domain *pd;
   580	
   581		if (IS_ERR_OR_NULL(dev) || !dev->em_pd)
   582			return;
   583	
   584		if (_is_cpu_device(dev))
   585			return;
   586	
   587		pd = dev->em_pd;
   588		/*
   589		 * The mutex separates all register/unregister requests and protects
   590		 * from potential clean-up/setup issues in the debugfs directories.
   591		 * The debugfs directory name is the same as device's name.
   592		 * The lock also protects the updater of the runtime modifiable
   593		 * EM and this remover.
   594		 */
   595		mutex_lock(&em_pd_mutex);
   596	
   597		em_debug_remove_pd(dev);
   598	
   599		runtime_table = pd->runtime_table;
   600	
   601		/*
   602		 * Safely destroy runtime modifiable EM. By using the call
   603		 * synchronize_rcu() we make sure we don't progress till last user
   604		 * finished the RCU section and our update got applied.
   605		 */
   606		rcu_assign_pointer(pd->runtime_table, NULL);
   607		synchronize_rcu();
   608	
   609		/*
   610		 * After the sync no updates will be in-flight, so free the
   611		 * memory allocated for runtime table (if there was such).
   612		 */
 > 613		if (runtime_table != pd->default_table) {
  
Rafael J. Wysocki Sept. 26, 2023, 7:48 p.m. UTC | #2
On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:

First off, I would merge this with the previous patch, as the changes
would be much clearer then IMO.

> Add an interface which allows to modify EM power data at runtime.
> The new power information is populated by the provided callback, which
> is called for each performance state.

But it all starts with copying the frequencies from the default table.

> The CPU frequencies' efficiency is
> re-calculated since that might be affected as well. The old EM memory
> is going to be freed later using RCU mechanism.

Not all of it, but the old runtime table that is not going to be used any more.

> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  include/linux/energy_model.h |   8 +++
>  kernel/power/energy_model.c  | 111 +++++++++++++++++++++++++++++++++++
>  2 files changed, 119 insertions(+)
>
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index 740e7c25cfff..8f055ab356ed 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -201,6 +201,8 @@ struct em_data_callback {
>
>  struct em_perf_domain *em_cpu_get(int cpu);
>  struct em_perf_domain *em_pd_get(struct device *dev);
> +int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb,
> +                             void *priv);
>  int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
>                                 struct em_data_callback *cb, cpumask_t *span,
>                                 bool microwatts);
> @@ -384,6 +386,12 @@ static inline int em_pd_nr_perf_states(struct em_perf_domain *pd)
>  {
>         return 0;
>  }
> +static inline
> +int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb,
> +                             void *priv)
> +{
> +       return -EINVAL;
> +}
>  #endif
>
>  #endif
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index 2345837bfd2c..78e1495dc87e 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -172,6 +172,101 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
>         return 0;
>  }
>
> +/**
> + * em_dev_update_perf_domain() - Update runtime EM table for a device
> + * @dev                : Device for which the EM is to be updated
> + * @cb         : Callback function providing the power data for the EM
> + * @priv       : Pointer to private data useful for passing context
> + *             which might be required while calling @cb

It is still unclear to me who is going to use this priv pointer and how.

> + *
> + * Update EM runtime modifiable table for a @dev using the callback
> + * defined in @cb. The EM new power values are then used for calculating
> + * the em_perf_state::cost for associated performance state.

It actually allocates a new runtime table and populates it from
scratch, using the frequencies from the default table and the
callback.

> + *
> + * This function uses mutex to serialize writers, so it must not be called

"a mutex"

> + * from non-sleeping context.
> + *
> + * Return 0 on success or a proper error in case of failure.
> + */
> +int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb,
> +                             void *priv)
> +{
> +       struct em_perf_table *runtime_table;
> +       unsigned long power, freq;
> +       struct em_perf_domain *pd;
> +       int ret, i;
> +
> +       if (!cb || !cb->update_power)
> +               return -EINVAL;
> +
> +       /*
> +        * The lock serializes update and unregister code paths. When the
> +        * EM has been unregistered in the meantime, we should capture that
> +        * when entering this critical section. It also makes sure that
> +        * two concurrent updates will be serialized.
> +        */
> +       mutex_lock(&em_pd_mutex);
> +
> +       if (!dev || !dev->em_pd) {

Checking dev against NULL under the mutex is pointless (either it is
NULL or it isn't, so check it earlier).

> +               ret = -EINVAL;
> +               goto unlock_em;
> +       }
> +
> +       pd = dev->em_pd;

And I would check pd against NULL here.

> +
> +       runtime_table = kzalloc(sizeof(*runtime_table), GFP_KERNEL);
> +       if (!runtime_table) {
> +               ret = -ENOMEM;
> +               goto unlock_em;
> +       }
> +
> +       runtime_table->state = kcalloc(pd->nr_perf_states,
> +                                      sizeof(struct em_perf_state),
> +                                      GFP_KERNEL);
> +       if (!runtime_table->state) {
> +               ret = -ENOMEM;
> +               goto free_runtime_table;
> +       }

The above allocations can be merged into one and allocating memory
under the mutex is questionable.

> +
> +       /* Populate runtime table with updated values using driver callback */
> +       for (i = 0; i < pd->nr_perf_states; i++) {
> +               freq = pd->default_table->state[i].frequency;
> +               runtime_table->state[i].frequency = freq;
> +
> +               /*
> +                * Call driver callback to get a new power value for
> +                * a given frequency.
> +                */
> +               ret = cb->update_power(dev, freq, &power, priv);
> +               if (ret) {
> +                       dev_dbg(dev, "EM: runtime update error: %d\n", ret);
> +                       goto free_runtime_state_table;
> +               }
> +
> +               runtime_table->state[i].power = power;
> +       }
> +
> +       ret = em_compute_costs(dev, runtime_table->state, cb,
> +                              pd->nr_perf_states, pd->flags);
> +       if (ret)
> +               goto free_runtime_state_table;
> +
> +       em_perf_runtime_table_set(dev, runtime_table);
> +
> +       mutex_unlock(&em_pd_mutex);
> +       return 0;
> +
> +free_runtime_state_table:
> +       kfree(runtime_table->state);
> +free_runtime_table:
> +       kfree(runtime_table);
> +unlock_em:
> +       mutex_unlock(&em_pd_mutex);
> +
> +       return -EINVAL;
> +}
> +EXPORT_SYMBOL_GPL(em_dev_update_perf_domain);
> +
>  static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
>                                 int nr_states, struct em_data_callback *cb,
>                                 unsigned long flags)
> @@ -494,6 +589,8 @@ void em_dev_unregister_perf_domain(struct device *dev)
>          * The mutex separates all register/unregister requests and protects
>          * from potential clean-up/setup issues in the debugfs directories.
>          * The debugfs directory name is the same as device's name.
> +        * The lock also protects the updater of the runtime modifiable
> +        * EM and this remover.
>          */
>         mutex_lock(&em_pd_mutex);
>
> @@ -501,9 +598,23 @@ void em_dev_unregister_perf_domain(struct device *dev)
>
>         runtime_table = pd->runtime_table;
>
> +       /*
> +        * Safely destroy runtime modifiable EM. By using the call
> +        * synchronize_rcu() we make sure we don't progress till last user
> +        * finished the RCU section and our update got applied.
> +        */
>         rcu_assign_pointer(pd->runtime_table, NULL);
>         synchronize_rcu();
>
> +       /*
> +        * After the sync no updates will be in-flight, so free the
> +        * memory allocated for runtime table (if there was such).
> +        */
> +       if (runtime_table != pd->default_table) {
> +               kfree(runtime_table->state);
> +               kfree(runtime_table);
> +       }

Can't this race with the RCU callback freeing the runtime table?

> +
>         kfree(pd->default_table->state);
>         kfree(pd->default_table);
>         kfree(dev->em_pd);
> --
  
Lukasz Luba Sept. 29, 2023, 10 a.m. UTC | #3
On 9/26/23 20:48, Rafael J. Wysocki wrote:
> On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> 
> First off, I would merge this with the previous patch, as the changes
> would be much clearer then IMO.

I was trying to avoid a big patch ~150 lines. I will do that merge.

> 
>> Add an interface which allows to modify EM power data at runtime.
>> The new power information is populated by the provided callback, which
>> is called for each performance state.
> 
> But it all starts with copying the frequencies from the default table.

Yes, I can add that to the description.

> 
>> The CPU frequencies' efficiency is
>> re-calculated since that might be affected as well. The old EM memory
>> is going to be freed later using RCU mechanism.
> 
> Not all of it, but the old runtime table that is not going to be used any more.

True, I will rephrase that, to make it more precised.

> 
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>

[snip]

>>
>> +/**
>> + * em_dev_update_perf_domain() - Update runtime EM table for a device
>> + * @dev                : Device for which the EM is to be updated
>> + * @cb         : Callback function providing the power data for the EM
>> + * @priv       : Pointer to private data useful for passing context
>> + *             which might be required while calling @cb
> 
> It is still unclear to me who is going to use this priv pointer and how.

I have explained that in some previous patch response. A driver or
kernel module which monitors the thermal situation and has context.

> 
>> + *
>> + * Update EM runtime modifiable table for a @dev using the callback
>> + * defined in @cb. The EM new power values are then used for calculating
>> + * the em_perf_state::cost for associated performance state.
> 
> It actually allocates a new runtime table and populates it from
> scratch, using the frequencies from the default table and the
> callback.

Yes, it allocated new table and put the updated power values there.
I can add that to the comment.

> 
>> + *
>> + * This function uses mutex to serialize writers, so it must not be called
> 
> "a mutex"

ACK

> 
>> + * from non-sleeping context.

[snip]

>> +
>> +       if (!dev || !dev->em_pd) {
> 
> Checking dev against NULL under the mutex is pointless (either it is
> NULL or it isn't, so check it earlier).

ACK

> 
>> +               ret = -EINVAL;
>> +               goto unlock_em;
>> +       }
>> +
>> +       pd = dev->em_pd;
> 
> And I would check pd against NULL here.

It's done above, next to '!dev || !dev->em_pd'

> 
>> +
>> +       runtime_table = kzalloc(sizeof(*runtime_table), GFP_KERNEL);
>> +       if (!runtime_table) {
>> +               ret = -ENOMEM;
>> +               goto unlock_em;
>> +       }
>> +
>> +       runtime_table->state = kcalloc(pd->nr_perf_states,
>> +                                      sizeof(struct em_perf_state),
>> +                                      GFP_KERNEL);
>> +       if (!runtime_table->state) {
>> +               ret = -ENOMEM;
>> +               goto free_runtime_table;
>> +       }
> 
> The above allocations can be merged into one and allocating memory
> under the mutex is questionable.

So how to make sure that there is no 2 callers trying to update the
same EM or unregistration is not in the background?

[snip]

>>
>> @@ -501,9 +598,23 @@ void em_dev_unregister_perf_domain(struct device *dev)
>>
>>          runtime_table = pd->runtime_table;
>>
>> +       /*
>> +        * Safely destroy runtime modifiable EM. By using the call
>> +        * synchronize_rcu() we make sure we don't progress till last user
>> +        * finished the RCU section and our update got applied.
>> +        */
>>          rcu_assign_pointer(pd->runtime_table, NULL);
>>          synchronize_rcu();
>>
>> +       /*
>> +        * After the sync no updates will be in-flight, so free the
>> +        * memory allocated for runtime table (if there was such).
>> +        */
>> +       if (runtime_table != pd->default_table) {
>> +               kfree(runtime_table->state);
>> +               kfree(runtime_table);
>> +       }
> 
> Can't this race with the RCU callback freeing the runtime table?

That's why there is this 'synchronize_rcu()' above and the mutex. The
updating caller if finished the update, would unlock the mutex and this
unregister code can go. Here we call the synchronize_rcu() so we assure
the callback has finished for the update path and than we explicitly
free the saved 'runtime_table' here. So all data should be freed and
code serialized in those two paths.
  
Rafael J. Wysocki Sept. 29, 2023, 1:18 p.m. UTC | #4
On Fri, Sep 29, 2023 at 11:59 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
>
>
> On 9/26/23 20:48, Rafael J. Wysocki wrote:
> > On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> >
> > First off, I would merge this with the previous patch, as the changes
> > would be much clearer then IMO.
>
> I was trying to avoid a big patch ~150 lines. I will do that merge.
>
> >
> >> Add an interface which allows to modify EM power data at runtime.
> >> The new power information is populated by the provided callback, which
> >> is called for each performance state.
> >
> > But it all starts with copying the frequencies from the default table.
>
> Yes, I can add that to the description.
>
> >
> >> The CPU frequencies' efficiency is
> >> re-calculated since that might be affected as well. The old EM memory
> >> is going to be freed later using RCU mechanism.
> >
> > Not all of it, but the old runtime table that is not going to be used any more.
>
> True, I will rephrase that, to make it more precised.
>
> >
> >> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>
> [snip]
>
> >>
> >> +/**
> >> + * em_dev_update_perf_domain() - Update runtime EM table for a device
> >> + * @dev                : Device for which the EM is to be updated
> >> + * @cb         : Callback function providing the power data for the EM
> >> + * @priv       : Pointer to private data useful for passing context
> >> + *             which might be required while calling @cb
> >
> > It is still unclear to me who is going to use this priv pointer and how.
>
> I have explained that in some previous patch response. A driver or
> kernel module which monitors the thermal situation and has context.
>
> >
> >> + *
> >> + * Update EM runtime modifiable table for a @dev using the callback
> >> + * defined in @cb. The EM new power values are then used for calculating
> >> + * the em_perf_state::cost for associated performance state.
> >
> > It actually allocates a new runtime table and populates it from
> > scratch, using the frequencies from the default table and the
> > callback.
>
> Yes, it allocated new table and put the updated power values there.
> I can add that to the comment.
>
> >
> >> + *
> >> + * This function uses mutex to serialize writers, so it must not be called
> >
> > "a mutex"
>
> ACK
>
> >
> >> + * from non-sleeping context.
>
> [snip]
>
> >> +
> >> +       if (!dev || !dev->em_pd) {
> >
> > Checking dev against NULL under the mutex is pointless (either it is
> > NULL or it isn't, so check it earlier).
>
> ACK
>
> >
> >> +               ret = -EINVAL;
> >> +               goto unlock_em;
> >> +       }
> >> +
> >> +       pd = dev->em_pd;
> >
> > And I would check pd against NULL here.
>
> It's done above, next to '!dev || !dev->em_pd'

Yes, it is, I meant something like this:

    if (!cb || !cb->update_power || !dev)
        return -EINVAL;

    mutex_lock(&em_pd_mutex);

    pd = dev->em_pd;
    if (!pd) {
        ret = -EINVAL; /* or perhaps -ENODATA */
        goto unlock_em;
    }


> >
> >> +
> >> +       runtime_table = kzalloc(sizeof(*runtime_table), GFP_KERNEL);
> >> +       if (!runtime_table) {
> >> +               ret = -ENOMEM;
> >> +               goto unlock_em;
> >> +       }
> >> +
> >> +       runtime_table->state = kcalloc(pd->nr_perf_states,
> >> +                                      sizeof(struct em_perf_state),
> >> +                                      GFP_KERNEL);
> >> +       if (!runtime_table->state) {
> >> +               ret = -ENOMEM;
> >> +               goto free_runtime_table;
> >> +       }
> >
> > The above allocations can be merged into one and allocating memory
> > under the mutex is questionable.
>
> So how to make sure that there is no 2 callers trying to update the
> same EM or unregistration is not in the background?

You can allocate memory upfront and take the mutex before accessing
the shared data structures.  If there's an error in the code running
under the mutex, release it and then free the memory.

Allocating memory is one operation, updating the shared data
structures to use it is another one.  The former doesn't affect the
shared state in any way, so why do it under the mutex?

> [snip]
>
> >>
> >> @@ -501,9 +598,23 @@ void em_dev_unregister_perf_domain(struct device *dev)
> >>
> >>          runtime_table = pd->runtime_table;
> >>
> >> +       /*
> >> +        * Safely destroy runtime modifiable EM. By using the call
> >> +        * synchronize_rcu() we make sure we don't progress till last user
> >> +        * finished the RCU section and our update got applied.
> >> +        */
> >>          rcu_assign_pointer(pd->runtime_table, NULL);
> >>          synchronize_rcu();
> >>
> >> +       /*
> >> +        * After the sync no updates will be in-flight, so free the
> >> +        * memory allocated for runtime table (if there was such).
> >> +        */
> >> +       if (runtime_table != pd->default_table) {
> >> +               kfree(runtime_table->state);
> >> +               kfree(runtime_table);
> >> +       }
> >
> > Can't this race with the RCU callback freeing the runtime table?
>
> That's why there is this 'synchronize_rcu()' above and the mutex. The
> updating caller if finished the update, would unlock the mutex and this
> unregister code can go. Here we call the synchronize_rcu() so we assure
> the callback has finished for the update path and than we explicitly
> free the saved 'runtime_table' here. So all data should be freed and
> code serialized in those two paths.

This doesn't quite agree with my understanding of what synchronize_rcu() does.

IIUC, RCU callbacks can run as soon as the grace period has elapsed
and they need not wait for synchronize_rcu() to return.  Conversely,
synchronize_rcu() doesn't wait for all of the RCU callbacks to
complete.

Now, em_destroy_rt_table_rcu() doesn't actually use the mutex, so how
exactly is it protected against racing with this code?
  
Lukasz Luba Oct. 2, 2023, 2:09 p.m. UTC | #5
On 9/29/23 14:18, Rafael J. Wysocki wrote:
> On Fri, Sep 29, 2023 at 11:59 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>

[snip]

>>
>> It's done above, next to '!dev || !dev->em_pd'
> 
> Yes, it is, I meant something like this:
> 
>      if (!cb || !cb->update_power || !dev)
>          return -EINVAL;
> 
>      mutex_lock(&em_pd_mutex);
> 
>      pd = dev->em_pd;
>      if (!pd) {
>          ret = -EINVAL; /* or perhaps -ENODATA */
>          goto unlock_em;
>      }
> 
> 

OK, I see what you mean. Let me change that.

>>>
>>>> +
>>>> +       runtime_table = kzalloc(sizeof(*runtime_table), GFP_KERNEL);
>>>> +       if (!runtime_table) {
>>>> +               ret = -ENOMEM;
>>>> +               goto unlock_em;
>>>> +       }
>>>> +
>>>> +       runtime_table->state = kcalloc(pd->nr_perf_states,
>>>> +                                      sizeof(struct em_perf_state),
>>>> +                                      GFP_KERNEL);
>>>> +       if (!runtime_table->state) {
>>>> +               ret = -ENOMEM;
>>>> +               goto free_runtime_table;
>>>> +       }
>>>
>>> The above allocations can be merged into one and allocating memory
>>> under the mutex is questionable.
>>
>> So how to make sure that there is no 2 callers trying to update the
>> same EM or unregistration is not in the background?
> 
> You can allocate memory upfront and take the mutex before accessing
> the shared data structures.  If there's an error in the code running
> under the mutex, release it and then free the memory.
> 
> Allocating memory is one operation, updating the shared data
> structures to use it is another one.  The former doesn't affect the
> shared state in any way, so why do it under the mutex?

Yes, make sense. I will shrink that critical section. Good catch,
thanks!

> 
>> [snip]
>>
>>>>
>>>> @@ -501,9 +598,23 @@ void em_dev_unregister_perf_domain(struct device *dev)
>>>>
>>>>           runtime_table = pd->runtime_table;
>>>>
>>>> +       /*
>>>> +        * Safely destroy runtime modifiable EM. By using the call
>>>> +        * synchronize_rcu() we make sure we don't progress till last user
>>>> +        * finished the RCU section and our update got applied.
>>>> +        */
>>>>           rcu_assign_pointer(pd->runtime_table, NULL);
>>>>           synchronize_rcu();
>>>>
>>>> +       /*
>>>> +        * After the sync no updates will be in-flight, so free the
>>>> +        * memory allocated for runtime table (if there was such).
>>>> +        */
>>>> +       if (runtime_table != pd->default_table) {
>>>> +               kfree(runtime_table->state);
>>>> +               kfree(runtime_table);
>>>> +       }
>>>
>>> Can't this race with the RCU callback freeing the runtime table?
>>
>> That's why there is this 'synchronize_rcu()' above and the mutex. The
>> updating caller if finished the update, would unlock the mutex and this
>> unregister code can go. Here we call the synchronize_rcu() so we assure
>> the callback has finished for the update path and than we explicitly
>> free the saved 'runtime_table' here. So all data should be freed and
>> code serialized in those two paths.
> 
> This doesn't quite agree with my understanding of what synchronize_rcu() does.
> 
> IIUC, RCU callbacks can run as soon as the grace period has elapsed
> and they need not wait for synchronize_rcu() to return.  Conversely,
> synchronize_rcu() doesn't wait for all of the RCU callbacks to
> complete.
> 
> Now, em_destroy_rt_table_rcu() doesn't actually use the mutex, so how
> exactly is it protected against racing with this code?


I'll try to draw in on some pictures...

(previous instance )
+---------------------+
|                     |
| runtime table    #1 |
|                     |
+---------------------+


(next instance )
+---------------------+
|                     |
| runtime table    #2 |
|                     |
+---------------------+


(not possible new instance)
+.....................+
.                     .
. runtime table    #3 .
.                     .
+.....................+



    cpu A - "updater"          |    cpu B - "remover"
                               |
------------------------------|------------------------------
    lock em_pd_mutex           |
                               |
       alloc runtime table #2  |   lock em_pd_mutex
                               |   (waiting)
       async free instance #1  |    .
                               |    .
    unlock em_pd_mutex         |    .
                               |   (enter critical section)
                               |
    lock em_pd_mutex           |   set NULL to runtime table ptr
    (waiting)                  |
    (wanted to create #3 inst) |   synchronize rcu to make it is visible
    .                          |
    .                          |   implicit free instance #2
    .                          |
    .                          |   free the rest of EM and EM
    .                          |
    .                          |   unlock em_pd_mutex
    (enter critical section)   |
    !dev->em_pd so             |
    unlock & exit              |
                               |
                               |


This should clean all involved memory and also prevent
of allocating new instance, when we unregister EM.
  

Patch

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 740e7c25cfff..8f055ab356ed 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -201,6 +201,8 @@  struct em_data_callback {
 
 struct em_perf_domain *em_cpu_get(int cpu);
 struct em_perf_domain *em_pd_get(struct device *dev);
+int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb,
+			      void *priv);
 int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
 				struct em_data_callback *cb, cpumask_t *span,
 				bool microwatts);
@@ -384,6 +386,12 @@  static inline int em_pd_nr_perf_states(struct em_perf_domain *pd)
 {
 	return 0;
 }
+static inline
+int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb,
+			      void *priv)
+{
+	return -EINVAL;
+}
 #endif
 
 #endif
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 2345837bfd2c..78e1495dc87e 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -172,6 +172,101 @@  static int em_compute_costs(struct device *dev, struct em_perf_state *table,
 	return 0;
 }
 
+/**
+ * em_dev_update_perf_domain() - Update runtime EM table for a device
+ * @dev		: Device for which the EM is to be updated
+ * @cb		: Callback function providing the power data for the EM
+ * @priv	: Pointer to private data useful for passing context
+ *		which might be required while calling @cb
+ *
+ * Update EM runtime modifiable table for a @dev using the callback
+ * defined in @cb. The EM new power values are then used for calculating
+ * the em_perf_state::cost for associated performance state.
+ *
+ * This function uses mutex to serialize writers, so it must not be called
+ * from non-sleeping context.
+ *
+ * Return 0 on success or a proper error in case of failure.
+ */
+int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb,
+			      void *priv)
+{
+	struct em_perf_table *runtime_table;
+	unsigned long power, freq;
+	struct em_perf_domain *pd;
+	int ret, i;
+
+	if (!cb || !cb->update_power)
+		return -EINVAL;
+
+	/*
+	 * The lock serializes update and unregister code paths. When the
+	 * EM has been unregistered in the meantime, we should capture that
+	 * when entering this critical section. It also makes sure that
+	 * two concurrent updates will be serialized.
+	 */
+	mutex_lock(&em_pd_mutex);
+
+	if (!dev || !dev->em_pd) {
+		ret = -EINVAL;
+		goto unlock_em;
+	}
+
+	pd = dev->em_pd;
+
+	runtime_table = kzalloc(sizeof(*runtime_table), GFP_KERNEL);
+	if (!runtime_table) {
+		ret = -ENOMEM;
+		goto unlock_em;
+	}
+
+	runtime_table->state = kcalloc(pd->nr_perf_states,
+				       sizeof(struct em_perf_state),
+				       GFP_KERNEL);
+	if (!runtime_table->state) {
+		ret = -ENOMEM;
+		goto free_runtime_table;
+	}
+
+	/* Populate runtime table with updated values using driver callback */
+	for (i = 0; i < pd->nr_perf_states; i++) {
+		freq = pd->default_table->state[i].frequency;
+		runtime_table->state[i].frequency = freq;
+
+		/*
+		 * Call driver callback to get a new power value for
+		 * a given frequency.
+		 */
+		ret = cb->update_power(dev, freq, &power, priv);
+		if (ret) {
+			dev_dbg(dev, "EM: runtime update error: %d\n", ret);
+			goto free_runtime_state_table;
+		}
+
+		runtime_table->state[i].power = power;
+	}
+
+	ret = em_compute_costs(dev, runtime_table->state, cb,
+			       pd->nr_perf_states, pd->flags);
+	if (ret)
+		goto free_runtime_state_table;
+
+	em_perf_runtime_table_set(dev, runtime_table);
+
+	mutex_unlock(&em_pd_mutex);
+	return 0;
+
+free_runtime_state_table:
+	kfree(runtime_table->state);
+free_runtime_table:
+	kfree(runtime_table);
+unlock_em:
+	mutex_unlock(&em_pd_mutex);
+
+	return -EINVAL;
+}
+EXPORT_SYMBOL_GPL(em_dev_update_perf_domain);
+
 static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
 				int nr_states, struct em_data_callback *cb,
 				unsigned long flags)
@@ -494,6 +589,8 @@  void em_dev_unregister_perf_domain(struct device *dev)
 	 * The mutex separates all register/unregister requests and protects
 	 * from potential clean-up/setup issues in the debugfs directories.
 	 * The debugfs directory name is the same as device's name.
+	 * The lock also protects the updater of the runtime modifiable
+	 * EM and this remover.
 	 */
 	mutex_lock(&em_pd_mutex);
 
@@ -501,9 +598,23 @@  void em_dev_unregister_perf_domain(struct device *dev)
 
 	runtime_table = pd->runtime_table;
 
+	/*
+	 * Safely destroy runtime modifiable EM. By using the call
+	 * synchronize_rcu() we make sure we don't progress till last user
+	 * finished the RCU section and our update got applied.
+	 */
 	rcu_assign_pointer(pd->runtime_table, NULL);
 	synchronize_rcu();
 
+	/*
+	 * After the sync no updates will be in-flight, so free the
+	 * memory allocated for runtime table (if there was such).
+	 */
+	if (runtime_table != pd->default_table) {
+		kfree(runtime_table->state);
+		kfree(runtime_table);
+	}
+
 	kfree(pd->default_table->state);
 	kfree(pd->default_table);
 	kfree(dev->em_pd);