Message ID | 20230925081139.1305766-12-lukasz.luba@arm.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:cae8:0:b0:403:3b70:6f57 with SMTP id r8csp1054298vqu; Mon, 25 Sep 2023 01:20:49 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFtTCkQ/GGIUSI8k26/LRl/2xkQbn7BVv6VTQRBPNxAUqdYYLLMqimQUxve9bILb0ePH1Q2 X-Received: by 2002:a05:6830:4509:b0:6c4:f095:7b76 with SMTP id i9-20020a056830450900b006c4f0957b76mr563867otv.31.1695630049451; Mon, 25 Sep 2023 01:20:49 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695630049; cv=none; d=google.com; s=arc-20160816; b=xU8vv0EWWtbowbO1m2IuePqa18+yMQ7zpMsYRdgiDqg1FnEO7WYWMGVCe6cPmoHgBj KQAaC6RAUU9hGRJT6m9DKtVGQBIkZiysKJC2NXNevM989sQbWvjb/5nCBPsLtBpNaUjj qh7G8XUZ/MOLMsw7y0JCTko4+dD78Rl8V90JHt6bdzFDXyvQQsSush38GpGdIc+7wWy3 UmqfiYU2o09BQAlPoKMizzmmbawoD6Kai/lsaHIca4H/j5Bhm8hG/Ki1yXiC9hRIOUf7 u1b7A3W0WehAFFTx0W9d8x7qwcK1tUKF2TWnabXDh69TJgd787ZC9C4R8K5NDKBc9No+ mEhg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=SpQ4SpQ/nsSRUNHX9Fsso7TM1arsUyqdBOO/SxPH5j8=; fh=jF+j1UEDAwnYbShW3HmO1TshMsWh36Jt7pSLTP62NeE=; b=bFfN8nETFZeK1voWhUyAG2XfVCSLYA1SaZqG2Vbg7LU//HPw0w+rtmLwKZ+ZDHXiZd MHjI3pAB4ouq80iXPPifXi7guoYh4Gc0vJDWDbJRmaU9ZQCAF3qgZGe6gTDlGKhL4QNU Zu0LKah5CaG/NxLPorhYX+XdhzO1JInGwpq17WHX1NO/slNQOSKs4ekZdBIzXh2ogzov Q4c6zQ7Fdd+vPomPFvMOpUQG2WjQKbPOw0RA3hQzpZgNIwxJ1ateHTQgOpNpuavRhuV7 ABGf/hGFJPXM07gyvUczOgWio1N3o7jRnzMXqjPuYPvjj1pQBdG4mXrpBMllUtPRCPlx 9maw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from howler.vger.email (howler.vger.email. [2620:137:e000::3:4]) by mx.google.com with ESMTPS id j9-20020a056a00234900b00690fe3ec830si10173119pfj.55.2023.09.25.01.20.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 25 Sep 2023 01:20:49 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) client-ip=2620:137:e000::3:4; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id 2C30280E2254; Mon, 25 Sep 2023 01:12:20 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232796AbjIYIMQ (ORCPT <rfc822;ezelljr.billy@gmail.com> + 30 others); Mon, 25 Sep 2023 04:12:16 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42378 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232792AbjIYIL4 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 25 Sep 2023 04:11:56 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id A55DBCC2; Mon, 25 Sep 2023 01:11:42 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 9BA73DA7; Mon, 25 Sep 2023 01:12:20 -0700 (PDT) Received: from e129166.arm.com (unknown [10.57.93.139]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id F3A433F5A1; Mon, 25 Sep 2023 01:11:39 -0700 (PDT) From: Lukasz Luba <lukasz.luba@arm.com> To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org, rafael@kernel.org Cc: lukasz.luba@arm.com, dietmar.eggemann@arm.com, rui.zhang@intel.com, amit.kucheria@verdurent.com, amit.kachhap@gmail.com, daniel.lezcano@linaro.org, viresh.kumar@linaro.org, len.brown@intel.com, pavel@ucw.cz, mhiramat@kernel.org, qyousef@layalina.io, wvw@google.com Subject: [PATCH v4 11/18] PM: EM: Add runtime update interface to modify EM power Date: Mon, 25 Sep 2023 09:11:32 +0100 Message-Id: <20230925081139.1305766-12-lukasz.luba@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230925081139.1305766-1-lukasz.luba@arm.com> References: <20230925081139.1305766-1-lukasz.luba@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Mon, 25 Sep 2023 01:12:20 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1777996974855064897 X-GMAIL-MSGID: 1777996974855064897 |
Series |
Introduce runtime modifiable Energy Model
|
|
Commit Message
Lukasz Luba
Sept. 25, 2023, 8:11 a.m. UTC
Add an interface which allows to modify EM power data at runtime.
The new power information is populated by the provided callback, which
is called for each performance state. The CPU frequencies' efficiency is
re-calculated since that might be affected as well. The old EM memory
is going to be freed later using RCU mechanism.
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
include/linux/energy_model.h | 8 +++
kernel/power/energy_model.c | 111 +++++++++++++++++++++++++++++++++++
2 files changed, 119 insertions(+)
Comments
Hi Lukasz, kernel test robot noticed the following build warnings: [auto build test WARNING on rafael-pm/linux-next] [also build test WARNING on rafael-pm/thermal linus/master v6.6-rc3 next-20230926] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Lukasz-Luba/PM-EM-Add-missing-newline-for-the-message-log/20230925-181243 base: https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next patch link: https://lore.kernel.org/r/20230925081139.1305766-12-lukasz.luba%40arm.com patch subject: [PATCH v4 11/18] PM: EM: Add runtime update interface to modify EM power config: i386-randconfig-063-20230926 (https://download.01.org/0day-ci/archive/20230927/202309270106.c56Z2Tci-lkp@intel.com/config) compiler: gcc-12 (Debian 12.2.0-14) 12.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230927/202309270106.c56Z2Tci-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202309270106.c56Z2Tci-lkp@intel.com/ sparse warnings: (new ones prefixed by >>) kernel/power/energy_model.c:125:13: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct em_perf_table *tmp @@ got struct em_perf_table [noderef] __rcu *runtime_table @@ kernel/power/energy_model.c:125:13: sparse: expected struct em_perf_table *tmp kernel/power/energy_model.c:125:13: sparse: got struct em_perf_table [noderef] __rcu *runtime_table >> kernel/power/energy_model.c:613:27: sparse: sparse: incompatible types in comparison expression (different address spaces): >> kernel/power/energy_model.c:613:27: sparse: struct em_perf_table [noderef] __rcu * >> kernel/power/energy_model.c:613:27: sparse: struct em_perf_table * vim +613 kernel/power/energy_model.c 569 570 /** 571 * em_dev_unregister_perf_domain() - Unregister Energy Model (EM) for a device 572 * @dev : Device for which the EM is registered 573 * 574 * Unregister the EM for the specified @dev (but not a CPU device). 575 */ 576 void em_dev_unregister_perf_domain(struct device *dev) 577 { 578 struct em_perf_table __rcu *runtime_table; 579 struct em_perf_domain *pd; 580 581 if (IS_ERR_OR_NULL(dev) || !dev->em_pd) 582 return; 583 584 if (_is_cpu_device(dev)) 585 return; 586 587 pd = dev->em_pd; 588 /* 589 * The mutex separates all register/unregister requests and protects 590 * from potential clean-up/setup issues in the debugfs directories. 591 * The debugfs directory name is the same as device's name. 592 * The lock also protects the updater of the runtime modifiable 593 * EM and this remover. 594 */ 595 mutex_lock(&em_pd_mutex); 596 597 em_debug_remove_pd(dev); 598 599 runtime_table = pd->runtime_table; 600 601 /* 602 * Safely destroy runtime modifiable EM. By using the call 603 * synchronize_rcu() we make sure we don't progress till last user 604 * finished the RCU section and our update got applied. 605 */ 606 rcu_assign_pointer(pd->runtime_table, NULL); 607 synchronize_rcu(); 608 609 /* 610 * After the sync no updates will be in-flight, so free the 611 * memory allocated for runtime table (if there was such). 612 */ > 613 if (runtime_table != pd->default_table) {
On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote: First off, I would merge this with the previous patch, as the changes would be much clearer then IMO. > Add an interface which allows to modify EM power data at runtime. > The new power information is populated by the provided callback, which > is called for each performance state. But it all starts with copying the frequencies from the default table. > The CPU frequencies' efficiency is > re-calculated since that might be affected as well. The old EM memory > is going to be freed later using RCU mechanism. Not all of it, but the old runtime table that is not going to be used any more. > Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> > --- > include/linux/energy_model.h | 8 +++ > kernel/power/energy_model.c | 111 +++++++++++++++++++++++++++++++++++ > 2 files changed, 119 insertions(+) > > diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h > index 740e7c25cfff..8f055ab356ed 100644 > --- a/include/linux/energy_model.h > +++ b/include/linux/energy_model.h > @@ -201,6 +201,8 @@ struct em_data_callback { > > struct em_perf_domain *em_cpu_get(int cpu); > struct em_perf_domain *em_pd_get(struct device *dev); > +int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb, > + void *priv); > int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states, > struct em_data_callback *cb, cpumask_t *span, > bool microwatts); > @@ -384,6 +386,12 @@ static inline int em_pd_nr_perf_states(struct em_perf_domain *pd) > { > return 0; > } > +static inline > +int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb, > + void *priv) > +{ > + return -EINVAL; > +} > #endif > > #endif > diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c > index 2345837bfd2c..78e1495dc87e 100644 > --- a/kernel/power/energy_model.c > +++ b/kernel/power/energy_model.c > @@ -172,6 +172,101 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table, > return 0; > } > > +/** > + * em_dev_update_perf_domain() - Update runtime EM table for a device > + * @dev : Device for which the EM is to be updated > + * @cb : Callback function providing the power data for the EM > + * @priv : Pointer to private data useful for passing context > + * which might be required while calling @cb It is still unclear to me who is going to use this priv pointer and how. > + * > + * Update EM runtime modifiable table for a @dev using the callback > + * defined in @cb. The EM new power values are then used for calculating > + * the em_perf_state::cost for associated performance state. It actually allocates a new runtime table and populates it from scratch, using the frequencies from the default table and the callback. > + * > + * This function uses mutex to serialize writers, so it must not be called "a mutex" > + * from non-sleeping context. > + * > + * Return 0 on success or a proper error in case of failure. > + */ > +int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb, > + void *priv) > +{ > + struct em_perf_table *runtime_table; > + unsigned long power, freq; > + struct em_perf_domain *pd; > + int ret, i; > + > + if (!cb || !cb->update_power) > + return -EINVAL; > + > + /* > + * The lock serializes update and unregister code paths. When the > + * EM has been unregistered in the meantime, we should capture that > + * when entering this critical section. It also makes sure that > + * two concurrent updates will be serialized. > + */ > + mutex_lock(&em_pd_mutex); > + > + if (!dev || !dev->em_pd) { Checking dev against NULL under the mutex is pointless (either it is NULL or it isn't, so check it earlier). > + ret = -EINVAL; > + goto unlock_em; > + } > + > + pd = dev->em_pd; And I would check pd against NULL here. > + > + runtime_table = kzalloc(sizeof(*runtime_table), GFP_KERNEL); > + if (!runtime_table) { > + ret = -ENOMEM; > + goto unlock_em; > + } > + > + runtime_table->state = kcalloc(pd->nr_perf_states, > + sizeof(struct em_perf_state), > + GFP_KERNEL); > + if (!runtime_table->state) { > + ret = -ENOMEM; > + goto free_runtime_table; > + } The above allocations can be merged into one and allocating memory under the mutex is questionable. > + > + /* Populate runtime table with updated values using driver callback */ > + for (i = 0; i < pd->nr_perf_states; i++) { > + freq = pd->default_table->state[i].frequency; > + runtime_table->state[i].frequency = freq; > + > + /* > + * Call driver callback to get a new power value for > + * a given frequency. > + */ > + ret = cb->update_power(dev, freq, &power, priv); > + if (ret) { > + dev_dbg(dev, "EM: runtime update error: %d\n", ret); > + goto free_runtime_state_table; > + } > + > + runtime_table->state[i].power = power; > + } > + > + ret = em_compute_costs(dev, runtime_table->state, cb, > + pd->nr_perf_states, pd->flags); > + if (ret) > + goto free_runtime_state_table; > + > + em_perf_runtime_table_set(dev, runtime_table); > + > + mutex_unlock(&em_pd_mutex); > + return 0; > + > +free_runtime_state_table: > + kfree(runtime_table->state); > +free_runtime_table: > + kfree(runtime_table); > +unlock_em: > + mutex_unlock(&em_pd_mutex); > + > + return -EINVAL; > +} > +EXPORT_SYMBOL_GPL(em_dev_update_perf_domain); > + > static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd, > int nr_states, struct em_data_callback *cb, > unsigned long flags) > @@ -494,6 +589,8 @@ void em_dev_unregister_perf_domain(struct device *dev) > * The mutex separates all register/unregister requests and protects > * from potential clean-up/setup issues in the debugfs directories. > * The debugfs directory name is the same as device's name. > + * The lock also protects the updater of the runtime modifiable > + * EM and this remover. > */ > mutex_lock(&em_pd_mutex); > > @@ -501,9 +598,23 @@ void em_dev_unregister_perf_domain(struct device *dev) > > runtime_table = pd->runtime_table; > > + /* > + * Safely destroy runtime modifiable EM. By using the call > + * synchronize_rcu() we make sure we don't progress till last user > + * finished the RCU section and our update got applied. > + */ > rcu_assign_pointer(pd->runtime_table, NULL); > synchronize_rcu(); > > + /* > + * After the sync no updates will be in-flight, so free the > + * memory allocated for runtime table (if there was such). > + */ > + if (runtime_table != pd->default_table) { > + kfree(runtime_table->state); > + kfree(runtime_table); > + } Can't this race with the RCU callback freeing the runtime table? > + > kfree(pd->default_table->state); > kfree(pd->default_table); > kfree(dev->em_pd); > --
On 9/26/23 20:48, Rafael J. Wysocki wrote: > On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote: > > First off, I would merge this with the previous patch, as the changes > would be much clearer then IMO. I was trying to avoid a big patch ~150 lines. I will do that merge. > >> Add an interface which allows to modify EM power data at runtime. >> The new power information is populated by the provided callback, which >> is called for each performance state. > > But it all starts with copying the frequencies from the default table. Yes, I can add that to the description. > >> The CPU frequencies' efficiency is >> re-calculated since that might be affected as well. The old EM memory >> is going to be freed later using RCU mechanism. > > Not all of it, but the old runtime table that is not going to be used any more. True, I will rephrase that, to make it more precised. > >> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> [snip] >> >> +/** >> + * em_dev_update_perf_domain() - Update runtime EM table for a device >> + * @dev : Device for which the EM is to be updated >> + * @cb : Callback function providing the power data for the EM >> + * @priv : Pointer to private data useful for passing context >> + * which might be required while calling @cb > > It is still unclear to me who is going to use this priv pointer and how. I have explained that in some previous patch response. A driver or kernel module which monitors the thermal situation and has context. > >> + * >> + * Update EM runtime modifiable table for a @dev using the callback >> + * defined in @cb. The EM new power values are then used for calculating >> + * the em_perf_state::cost for associated performance state. > > It actually allocates a new runtime table and populates it from > scratch, using the frequencies from the default table and the > callback. Yes, it allocated new table and put the updated power values there. I can add that to the comment. > >> + * >> + * This function uses mutex to serialize writers, so it must not be called > > "a mutex" ACK > >> + * from non-sleeping context. [snip] >> + >> + if (!dev || !dev->em_pd) { > > Checking dev against NULL under the mutex is pointless (either it is > NULL or it isn't, so check it earlier). ACK > >> + ret = -EINVAL; >> + goto unlock_em; >> + } >> + >> + pd = dev->em_pd; > > And I would check pd against NULL here. It's done above, next to '!dev || !dev->em_pd' > >> + >> + runtime_table = kzalloc(sizeof(*runtime_table), GFP_KERNEL); >> + if (!runtime_table) { >> + ret = -ENOMEM; >> + goto unlock_em; >> + } >> + >> + runtime_table->state = kcalloc(pd->nr_perf_states, >> + sizeof(struct em_perf_state), >> + GFP_KERNEL); >> + if (!runtime_table->state) { >> + ret = -ENOMEM; >> + goto free_runtime_table; >> + } > > The above allocations can be merged into one and allocating memory > under the mutex is questionable. So how to make sure that there is no 2 callers trying to update the same EM or unregistration is not in the background? [snip] >> >> @@ -501,9 +598,23 @@ void em_dev_unregister_perf_domain(struct device *dev) >> >> runtime_table = pd->runtime_table; >> >> + /* >> + * Safely destroy runtime modifiable EM. By using the call >> + * synchronize_rcu() we make sure we don't progress till last user >> + * finished the RCU section and our update got applied. >> + */ >> rcu_assign_pointer(pd->runtime_table, NULL); >> synchronize_rcu(); >> >> + /* >> + * After the sync no updates will be in-flight, so free the >> + * memory allocated for runtime table (if there was such). >> + */ >> + if (runtime_table != pd->default_table) { >> + kfree(runtime_table->state); >> + kfree(runtime_table); >> + } > > Can't this race with the RCU callback freeing the runtime table? That's why there is this 'synchronize_rcu()' above and the mutex. The updating caller if finished the update, would unlock the mutex and this unregister code can go. Here we call the synchronize_rcu() so we assure the callback has finished for the update path and than we explicitly free the saved 'runtime_table' here. So all data should be freed and code serialized in those two paths.
On Fri, Sep 29, 2023 at 11:59 AM Lukasz Luba <lukasz.luba@arm.com> wrote: > > > > On 9/26/23 20:48, Rafael J. Wysocki wrote: > > On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote: > > > > First off, I would merge this with the previous patch, as the changes > > would be much clearer then IMO. > > I was trying to avoid a big patch ~150 lines. I will do that merge. > > > > >> Add an interface which allows to modify EM power data at runtime. > >> The new power information is populated by the provided callback, which > >> is called for each performance state. > > > > But it all starts with copying the frequencies from the default table. > > Yes, I can add that to the description. > > > > >> The CPU frequencies' efficiency is > >> re-calculated since that might be affected as well. The old EM memory > >> is going to be freed later using RCU mechanism. > > > > Not all of it, but the old runtime table that is not going to be used any more. > > True, I will rephrase that, to make it more precised. > > > > >> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> > > [snip] > > >> > >> +/** > >> + * em_dev_update_perf_domain() - Update runtime EM table for a device > >> + * @dev : Device for which the EM is to be updated > >> + * @cb : Callback function providing the power data for the EM > >> + * @priv : Pointer to private data useful for passing context > >> + * which might be required while calling @cb > > > > It is still unclear to me who is going to use this priv pointer and how. > > I have explained that in some previous patch response. A driver or > kernel module which monitors the thermal situation and has context. > > > > >> + * > >> + * Update EM runtime modifiable table for a @dev using the callback > >> + * defined in @cb. The EM new power values are then used for calculating > >> + * the em_perf_state::cost for associated performance state. > > > > It actually allocates a new runtime table and populates it from > > scratch, using the frequencies from the default table and the > > callback. > > Yes, it allocated new table and put the updated power values there. > I can add that to the comment. > > > > >> + * > >> + * This function uses mutex to serialize writers, so it must not be called > > > > "a mutex" > > ACK > > > > >> + * from non-sleeping context. > > [snip] > > >> + > >> + if (!dev || !dev->em_pd) { > > > > Checking dev against NULL under the mutex is pointless (either it is > > NULL or it isn't, so check it earlier). > > ACK > > > > >> + ret = -EINVAL; > >> + goto unlock_em; > >> + } > >> + > >> + pd = dev->em_pd; > > > > And I would check pd against NULL here. > > It's done above, next to '!dev || !dev->em_pd' Yes, it is, I meant something like this: if (!cb || !cb->update_power || !dev) return -EINVAL; mutex_lock(&em_pd_mutex); pd = dev->em_pd; if (!pd) { ret = -EINVAL; /* or perhaps -ENODATA */ goto unlock_em; } > > > >> + > >> + runtime_table = kzalloc(sizeof(*runtime_table), GFP_KERNEL); > >> + if (!runtime_table) { > >> + ret = -ENOMEM; > >> + goto unlock_em; > >> + } > >> + > >> + runtime_table->state = kcalloc(pd->nr_perf_states, > >> + sizeof(struct em_perf_state), > >> + GFP_KERNEL); > >> + if (!runtime_table->state) { > >> + ret = -ENOMEM; > >> + goto free_runtime_table; > >> + } > > > > The above allocations can be merged into one and allocating memory > > under the mutex is questionable. > > So how to make sure that there is no 2 callers trying to update the > same EM or unregistration is not in the background? You can allocate memory upfront and take the mutex before accessing the shared data structures. If there's an error in the code running under the mutex, release it and then free the memory. Allocating memory is one operation, updating the shared data structures to use it is another one. The former doesn't affect the shared state in any way, so why do it under the mutex? > [snip] > > >> > >> @@ -501,9 +598,23 @@ void em_dev_unregister_perf_domain(struct device *dev) > >> > >> runtime_table = pd->runtime_table; > >> > >> + /* > >> + * Safely destroy runtime modifiable EM. By using the call > >> + * synchronize_rcu() we make sure we don't progress till last user > >> + * finished the RCU section and our update got applied. > >> + */ > >> rcu_assign_pointer(pd->runtime_table, NULL); > >> synchronize_rcu(); > >> > >> + /* > >> + * After the sync no updates will be in-flight, so free the > >> + * memory allocated for runtime table (if there was such). > >> + */ > >> + if (runtime_table != pd->default_table) { > >> + kfree(runtime_table->state); > >> + kfree(runtime_table); > >> + } > > > > Can't this race with the RCU callback freeing the runtime table? > > That's why there is this 'synchronize_rcu()' above and the mutex. The > updating caller if finished the update, would unlock the mutex and this > unregister code can go. Here we call the synchronize_rcu() so we assure > the callback has finished for the update path and than we explicitly > free the saved 'runtime_table' here. So all data should be freed and > code serialized in those two paths. This doesn't quite agree with my understanding of what synchronize_rcu() does. IIUC, RCU callbacks can run as soon as the grace period has elapsed and they need not wait for synchronize_rcu() to return. Conversely, synchronize_rcu() doesn't wait for all of the RCU callbacks to complete. Now, em_destroy_rt_table_rcu() doesn't actually use the mutex, so how exactly is it protected against racing with this code?
On 9/29/23 14:18, Rafael J. Wysocki wrote: > On Fri, Sep 29, 2023 at 11:59 AM Lukasz Luba <lukasz.luba@arm.com> wrote: >> [snip] >> >> It's done above, next to '!dev || !dev->em_pd' > > Yes, it is, I meant something like this: > > if (!cb || !cb->update_power || !dev) > return -EINVAL; > > mutex_lock(&em_pd_mutex); > > pd = dev->em_pd; > if (!pd) { > ret = -EINVAL; /* or perhaps -ENODATA */ > goto unlock_em; > } > > OK, I see what you mean. Let me change that. >>> >>>> + >>>> + runtime_table = kzalloc(sizeof(*runtime_table), GFP_KERNEL); >>>> + if (!runtime_table) { >>>> + ret = -ENOMEM; >>>> + goto unlock_em; >>>> + } >>>> + >>>> + runtime_table->state = kcalloc(pd->nr_perf_states, >>>> + sizeof(struct em_perf_state), >>>> + GFP_KERNEL); >>>> + if (!runtime_table->state) { >>>> + ret = -ENOMEM; >>>> + goto free_runtime_table; >>>> + } >>> >>> The above allocations can be merged into one and allocating memory >>> under the mutex is questionable. >> >> So how to make sure that there is no 2 callers trying to update the >> same EM or unregistration is not in the background? > > You can allocate memory upfront and take the mutex before accessing > the shared data structures. If there's an error in the code running > under the mutex, release it and then free the memory. > > Allocating memory is one operation, updating the shared data > structures to use it is another one. The former doesn't affect the > shared state in any way, so why do it under the mutex? Yes, make sense. I will shrink that critical section. Good catch, thanks! > >> [snip] >> >>>> >>>> @@ -501,9 +598,23 @@ void em_dev_unregister_perf_domain(struct device *dev) >>>> >>>> runtime_table = pd->runtime_table; >>>> >>>> + /* >>>> + * Safely destroy runtime modifiable EM. By using the call >>>> + * synchronize_rcu() we make sure we don't progress till last user >>>> + * finished the RCU section and our update got applied. >>>> + */ >>>> rcu_assign_pointer(pd->runtime_table, NULL); >>>> synchronize_rcu(); >>>> >>>> + /* >>>> + * After the sync no updates will be in-flight, so free the >>>> + * memory allocated for runtime table (if there was such). >>>> + */ >>>> + if (runtime_table != pd->default_table) { >>>> + kfree(runtime_table->state); >>>> + kfree(runtime_table); >>>> + } >>> >>> Can't this race with the RCU callback freeing the runtime table? >> >> That's why there is this 'synchronize_rcu()' above and the mutex. The >> updating caller if finished the update, would unlock the mutex and this >> unregister code can go. Here we call the synchronize_rcu() so we assure >> the callback has finished for the update path and than we explicitly >> free the saved 'runtime_table' here. So all data should be freed and >> code serialized in those two paths. > > This doesn't quite agree with my understanding of what synchronize_rcu() does. > > IIUC, RCU callbacks can run as soon as the grace period has elapsed > and they need not wait for synchronize_rcu() to return. Conversely, > synchronize_rcu() doesn't wait for all of the RCU callbacks to > complete. > > Now, em_destroy_rt_table_rcu() doesn't actually use the mutex, so how > exactly is it protected against racing with this code? I'll try to draw in on some pictures... (previous instance ) +---------------------+ | | | runtime table #1 | | | +---------------------+ (next instance ) +---------------------+ | | | runtime table #2 | | | +---------------------+ (not possible new instance) +.....................+ . . . runtime table #3 . . . +.....................+ cpu A - "updater" | cpu B - "remover" | ------------------------------|------------------------------ lock em_pd_mutex | | alloc runtime table #2 | lock em_pd_mutex | (waiting) async free instance #1 | . | . unlock em_pd_mutex | . | (enter critical section) | lock em_pd_mutex | set NULL to runtime table ptr (waiting) | (wanted to create #3 inst) | synchronize rcu to make it is visible . | . | implicit free instance #2 . | . | free the rest of EM and EM . | . | unlock em_pd_mutex (enter critical section) | !dev->em_pd so | unlock & exit | | | This should clean all involved memory and also prevent of allocating new instance, when we unregister EM.
diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h index 740e7c25cfff..8f055ab356ed 100644 --- a/include/linux/energy_model.h +++ b/include/linux/energy_model.h @@ -201,6 +201,8 @@ struct em_data_callback { struct em_perf_domain *em_cpu_get(int cpu); struct em_perf_domain *em_pd_get(struct device *dev); +int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb, + void *priv); int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states, struct em_data_callback *cb, cpumask_t *span, bool microwatts); @@ -384,6 +386,12 @@ static inline int em_pd_nr_perf_states(struct em_perf_domain *pd) { return 0; } +static inline +int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb, + void *priv) +{ + return -EINVAL; +} #endif #endif diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c index 2345837bfd2c..78e1495dc87e 100644 --- a/kernel/power/energy_model.c +++ b/kernel/power/energy_model.c @@ -172,6 +172,101 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table, return 0; } +/** + * em_dev_update_perf_domain() - Update runtime EM table for a device + * @dev : Device for which the EM is to be updated + * @cb : Callback function providing the power data for the EM + * @priv : Pointer to private data useful for passing context + * which might be required while calling @cb + * + * Update EM runtime modifiable table for a @dev using the callback + * defined in @cb. The EM new power values are then used for calculating + * the em_perf_state::cost for associated performance state. + * + * This function uses mutex to serialize writers, so it must not be called + * from non-sleeping context. + * + * Return 0 on success or a proper error in case of failure. + */ +int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb, + void *priv) +{ + struct em_perf_table *runtime_table; + unsigned long power, freq; + struct em_perf_domain *pd; + int ret, i; + + if (!cb || !cb->update_power) + return -EINVAL; + + /* + * The lock serializes update and unregister code paths. When the + * EM has been unregistered in the meantime, we should capture that + * when entering this critical section. It also makes sure that + * two concurrent updates will be serialized. + */ + mutex_lock(&em_pd_mutex); + + if (!dev || !dev->em_pd) { + ret = -EINVAL; + goto unlock_em; + } + + pd = dev->em_pd; + + runtime_table = kzalloc(sizeof(*runtime_table), GFP_KERNEL); + if (!runtime_table) { + ret = -ENOMEM; + goto unlock_em; + } + + runtime_table->state = kcalloc(pd->nr_perf_states, + sizeof(struct em_perf_state), + GFP_KERNEL); + if (!runtime_table->state) { + ret = -ENOMEM; + goto free_runtime_table; + } + + /* Populate runtime table with updated values using driver callback */ + for (i = 0; i < pd->nr_perf_states; i++) { + freq = pd->default_table->state[i].frequency; + runtime_table->state[i].frequency = freq; + + /* + * Call driver callback to get a new power value for + * a given frequency. + */ + ret = cb->update_power(dev, freq, &power, priv); + if (ret) { + dev_dbg(dev, "EM: runtime update error: %d\n", ret); + goto free_runtime_state_table; + } + + runtime_table->state[i].power = power; + } + + ret = em_compute_costs(dev, runtime_table->state, cb, + pd->nr_perf_states, pd->flags); + if (ret) + goto free_runtime_state_table; + + em_perf_runtime_table_set(dev, runtime_table); + + mutex_unlock(&em_pd_mutex); + return 0; + +free_runtime_state_table: + kfree(runtime_table->state); +free_runtime_table: + kfree(runtime_table); +unlock_em: + mutex_unlock(&em_pd_mutex); + + return -EINVAL; +} +EXPORT_SYMBOL_GPL(em_dev_update_perf_domain); + static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd, int nr_states, struct em_data_callback *cb, unsigned long flags) @@ -494,6 +589,8 @@ void em_dev_unregister_perf_domain(struct device *dev) * The mutex separates all register/unregister requests and protects * from potential clean-up/setup issues in the debugfs directories. * The debugfs directory name is the same as device's name. + * The lock also protects the updater of the runtime modifiable + * EM and this remover. */ mutex_lock(&em_pd_mutex); @@ -501,9 +598,23 @@ void em_dev_unregister_perf_domain(struct device *dev) runtime_table = pd->runtime_table; + /* + * Safely destroy runtime modifiable EM. By using the call + * synchronize_rcu() we make sure we don't progress till last user + * finished the RCU section and our update got applied. + */ rcu_assign_pointer(pd->runtime_table, NULL); synchronize_rcu(); + /* + * After the sync no updates will be in-flight, so free the + * memory allocated for runtime table (if there was such). + */ + if (runtime_table != pd->default_table) { + kfree(runtime_table->state); + kfree(runtime_table); + } + kfree(pd->default_table->state); kfree(pd->default_table); kfree(dev->em_pd);