[v1,08/10] cpufreq: tegra194: add OPP support and set bandwidth

Message ID 20221220160240.27494-9-sumitg@nvidia.com
State New
Headers
Series Tegra234 Memory interconnect support |

Commit Message

Sumit Gupta Dec. 20, 2022, 4:02 p.m. UTC
  Add support to use OPP table from DT in Tegra194 cpufreq driver.
Tegra SoC's receive the frequency lookup table (LUT) from BPMP-FW.
Cross check the OPP's present in DT against the LUT from BPMP-FW
and enable only those DT OPP's which are present in LUT also.

The OPP table in DT has CPU Frequency to bandwidth mapping where
the bandwidth value is per MC channel. DRAM bandwidth depends on the
number of MC channels which can vary as per the boot configuration.
This per channel bandwidth from OPP table will be later converted by
MC driver to final bandwidth value by multiplying with number of
channels before sending the request to BPMP-FW.

If OPP table is not present in DT, then use the LUT from BPMP-FW directy
as the frequency table and not do the DRAM frequency scaling which is
same as the current behavior.

Now, as the CPU Frequency table is being controlling through OPP table
in DT. Keeping fewer entries in the table will create less frequency
steps and scale fast to high frequencies if required.

Signed-off-by: Sumit Gupta <sumitg@nvidia.com>
---
 drivers/cpufreq/tegra194-cpufreq.c | 152 ++++++++++++++++++++++++++---
 1 file changed, 139 insertions(+), 13 deletions(-)
  

Comments

Dmitry Osipenko Dec. 22, 2022, 3:46 p.m. UTC | #1
20.12.2022 19:02, Sumit Gupta пишет:
> Add support to use OPP table from DT in Tegra194 cpufreq driver.
> Tegra SoC's receive the frequency lookup table (LUT) from BPMP-FW.
> Cross check the OPP's present in DT against the LUT from BPMP-FW
> and enable only those DT OPP's which are present in LUT also.
> 
> The OPP table in DT has CPU Frequency to bandwidth mapping where
> the bandwidth value is per MC channel. DRAM bandwidth depends on the
> number of MC channels which can vary as per the boot configuration.
> This per channel bandwidth from OPP table will be later converted by
> MC driver to final bandwidth value by multiplying with number of
> channels before sending the request to BPMP-FW.
> 
> If OPP table is not present in DT, then use the LUT from BPMP-FW directy
> as the frequency table and not do the DRAM frequency scaling which is
> same as the current behavior.
> 
> Now, as the CPU Frequency table is being controlling through OPP table
> in DT. Keeping fewer entries in the table will create less frequency
> steps and scale fast to high frequencies if required.

It's not exactly clear what you're doing here. Are you going to scale
memory BW based on CPU freq? If yes, then this is wrong because CPU freq
is independent from the memory subsystem.

All Tegra30+ SoCs have ACTMON hardware unit that monitors CPU memory
activity and CPU memory BW should be scaled based on CPU memory events
counter. We have ACTMON devfreq driver for older SoCs. I have no clue
how ACTMON can be accessed on T186+, perhaps there should be a BPMP FW
API for that.
  
Sumit Gupta Jan. 13, 2023, 1:50 p.m. UTC | #2
On 22/12/22 21:16, Dmitry Osipenko wrote:
> External email: Use caution opening links or attachments
> 
> 
> 20.12.2022 19:02, Sumit Gupta пишет:
>> Add support to use OPP table from DT in Tegra194 cpufreq driver.
>> Tegra SoC's receive the frequency lookup table (LUT) from BPMP-FW.
>> Cross check the OPP's present in DT against the LUT from BPMP-FW
>> and enable only those DT OPP's which are present in LUT also.
>>
>> The OPP table in DT has CPU Frequency to bandwidth mapping where
>> the bandwidth value is per MC channel. DRAM bandwidth depends on the
>> number of MC channels which can vary as per the boot configuration.
>> This per channel bandwidth from OPP table will be later converted by
>> MC driver to final bandwidth value by multiplying with number of
>> channels before sending the request to BPMP-FW.
>>
>> If OPP table is not present in DT, then use the LUT from BPMP-FW directy
>> as the frequency table and not do the DRAM frequency scaling which is
>> same as the current behavior.
>>
>> Now, as the CPU Frequency table is being controlling through OPP table
>> in DT. Keeping fewer entries in the table will create less frequency
>> steps and scale fast to high frequencies if required.
> 
> It's not exactly clear what you're doing here. Are you going to scale
> memory BW based on CPU freq? If yes, then this is wrong because CPU freq
> is independent from the memory subsystem.
> 
> All Tegra30+ SoCs have ACTMON hardware unit that monitors CPU memory
> activity and CPU memory BW should be scaled based on CPU memory events
> counter. We have ACTMON devfreq driver for older SoCs. I have no clue
> how ACTMON can be accessed on T186+, perhaps there should be a BPMP FW
> API for that.
> 

Yes, scaling the memory BW based on CPU freq.
Referred below patch set for previous generation of Tegra Soc's which 
you mentioned and tried to trace the history.
 
https://patchwork.ozlabs.org/project/linux-tegra/patch/1418719298-25314-3-git-send-email-tomeu.vizoso@collabora.com/

In new Tegra Soc's, actmon counter control and usage has been moved to 
BPMP-FW where only 'MCALL' counter is used and 'MCCPU is not being used.
Using the actmon counter was a reactive way to scale the frequency which 
is less effective due to averaging over a time period.
We are now using the proactive way where clients tell their bandwidth
needs to help achieve better performance.
  
Dmitry Osipenko Jan. 16, 2023, 12:16 p.m. UTC | #3
On 1/13/23 16:50, Sumit Gupta wrote:
> 
> 
> On 22/12/22 21:16, Dmitry Osipenko wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> 20.12.2022 19:02, Sumit Gupta пишет:
>>> Add support to use OPP table from DT in Tegra194 cpufreq driver.
>>> Tegra SoC's receive the frequency lookup table (LUT) from BPMP-FW.
>>> Cross check the OPP's present in DT against the LUT from BPMP-FW
>>> and enable only those DT OPP's which are present in LUT also.
>>>
>>> The OPP table in DT has CPU Frequency to bandwidth mapping where
>>> the bandwidth value is per MC channel. DRAM bandwidth depends on the
>>> number of MC channels which can vary as per the boot configuration.
>>> This per channel bandwidth from OPP table will be later converted by
>>> MC driver to final bandwidth value by multiplying with number of
>>> channels before sending the request to BPMP-FW.
>>>
>>> If OPP table is not present in DT, then use the LUT from BPMP-FW directy
>>> as the frequency table and not do the DRAM frequency scaling which is
>>> same as the current behavior.
>>>
>>> Now, as the CPU Frequency table is being controlling through OPP table
>>> in DT. Keeping fewer entries in the table will create less frequency
>>> steps and scale fast to high frequencies if required.
>>
>> It's not exactly clear what you're doing here. Are you going to scale
>> memory BW based on CPU freq? If yes, then this is wrong because CPU freq
>> is independent from the memory subsystem.
>>
>> All Tegra30+ SoCs have ACTMON hardware unit that monitors CPU memory
>> activity and CPU memory BW should be scaled based on CPU memory events
>> counter. We have ACTMON devfreq driver for older SoCs. I have no clue
>> how ACTMON can be accessed on T186+, perhaps there should be a BPMP FW
>> API for that.
>>
> 
> Yes, scaling the memory BW based on CPU freq.
> Referred below patch set for previous generation of Tegra Soc's which
> you mentioned and tried to trace the history.
> 
> https://patchwork.ozlabs.org/project/linux-tegra/patch/1418719298-25314-3-git-send-email-tomeu.vizoso@collabora.com/
> 
> In new Tegra Soc's, actmon counter control and usage has been moved to
> BPMP-FW where only 'MCALL' counter is used and 'MCCPU is not being used.
> Using the actmon counter was a reactive way to scale the frequency which
> is less effective due to averaging over a time period.
> We are now using the proactive way where clients tell their bandwidth
> needs to help achieve better performance.

You don't know what bandwidth CPU needs, you trying to guess it.

It should be a bad decision to use cpufreq for memory bandwidth scaling.
You'll be wasting memory power 90% of time because cpufreq doesn't have
relation to the DRAM, your heuristics will be wrong and won't do
anything good compared to using ACTMON. The L2 CPU cache + memory
prefetching hides memory from CPU. And cpufreq should be less reactive
than ACTMON in general.

Scaling memory freq based on cpufreq is what downstream NV kernel did
10+ years ago for the oldest Tegra generations. Today upstream has all
the necessary infrastructure for doing memory bandwidth scaling properly
and we even using h/w memory counters on T20. It's strange that you want
to bring the downstream archaity to the modern upstream for the latest
Tegra generations.

If you can skip the BPMP-FW and use ACTMON directly from kernel, then
that's what I suggest to do.
  
Thierry Reding Jan. 19, 2023, 10:26 a.m. UTC | #4
On Mon, Jan 16, 2023 at 03:16:48PM +0300, Dmitry Osipenko wrote:
> On 1/13/23 16:50, Sumit Gupta wrote:
> > 
> > 
> > On 22/12/22 21:16, Dmitry Osipenko wrote:
> >> External email: Use caution opening links or attachments
> >>
> >>
> >> 20.12.2022 19:02, Sumit Gupta пишет:
> >>> Add support to use OPP table from DT in Tegra194 cpufreq driver.
> >>> Tegra SoC's receive the frequency lookup table (LUT) from BPMP-FW.
> >>> Cross check the OPP's present in DT against the LUT from BPMP-FW
> >>> and enable only those DT OPP's which are present in LUT also.
> >>>
> >>> The OPP table in DT has CPU Frequency to bandwidth mapping where
> >>> the bandwidth value is per MC channel. DRAM bandwidth depends on the
> >>> number of MC channels which can vary as per the boot configuration.
> >>> This per channel bandwidth from OPP table will be later converted by
> >>> MC driver to final bandwidth value by multiplying with number of
> >>> channels before sending the request to BPMP-FW.
> >>>
> >>> If OPP table is not present in DT, then use the LUT from BPMP-FW directy
> >>> as the frequency table and not do the DRAM frequency scaling which is
> >>> same as the current behavior.
> >>>
> >>> Now, as the CPU Frequency table is being controlling through OPP table
> >>> in DT. Keeping fewer entries in the table will create less frequency
> >>> steps and scale fast to high frequencies if required.
> >>
> >> It's not exactly clear what you're doing here. Are you going to scale
> >> memory BW based on CPU freq? If yes, then this is wrong because CPU freq
> >> is independent from the memory subsystem.
> >>
> >> All Tegra30+ SoCs have ACTMON hardware unit that monitors CPU memory
> >> activity and CPU memory BW should be scaled based on CPU memory events
> >> counter. We have ACTMON devfreq driver for older SoCs. I have no clue
> >> how ACTMON can be accessed on T186+, perhaps there should be a BPMP FW
> >> API for that.
> >>
> > 
> > Yes, scaling the memory BW based on CPU freq.
> > Referred below patch set for previous generation of Tegra Soc's which
> > you mentioned and tried to trace the history.
> > 
> > https://patchwork.ozlabs.org/project/linux-tegra/patch/1418719298-25314-3-git-send-email-tomeu.vizoso@collabora.com/
> > 
> > In new Tegra Soc's, actmon counter control and usage has been moved to
> > BPMP-FW where only 'MCALL' counter is used and 'MCCPU is not being used.
> > Using the actmon counter was a reactive way to scale the frequency which
> > is less effective due to averaging over a time period.
> > We are now using the proactive way where clients tell their bandwidth
> > needs to help achieve better performance.
> 
> You don't know what bandwidth CPU needs, you trying to guess it.
> 
> It should be a bad decision to use cpufreq for memory bandwidth scaling.
> You'll be wasting memory power 90% of time because cpufreq doesn't have
> relation to the DRAM, your heuristics will be wrong and won't do
> anything good compared to using ACTMON. The L2 CPU cache + memory
> prefetching hides memory from CPU. And cpufreq should be less reactive
> than ACTMON in general.
> 
> Scaling memory freq based on cpufreq is what downstream NV kernel did
> 10+ years ago for the oldest Tegra generations. Today upstream has all
> the necessary infrastructure for doing memory bandwidth scaling properly
> and we even using h/w memory counters on T20. It's strange that you want
> to bring the downstream archaity to the modern upstream for the latest
> Tegra generations.
> 
> If you can skip the BPMP-FW and use ACTMON directly from kernel, then
> that's what I suggest to do.

After talking to a few people, it turns out that BPMP is already using
ACTMON internally to do the actual scaling of the EMC frequency (or the
CPUs contribution to that). So BPMP will use ACTMON counters to monitor
the effective memory load of the CPU and adjust the EMC frequency. The
bandwidth request that we generate from the cpufreq driver is more of a
guideline for the maximum bandwidth we might consume.

Thierry
  
Dmitry Osipenko Jan. 19, 2023, 1:01 p.m. UTC | #5
On 1/19/23 13:26, Thierry Reding wrote:
> On Mon, Jan 16, 2023 at 03:16:48PM +0300, Dmitry Osipenko wrote:
>> On 1/13/23 16:50, Sumit Gupta wrote:
>>>
>>>
>>> On 22/12/22 21:16, Dmitry Osipenko wrote:
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> 20.12.2022 19:02, Sumit Gupta пишет:
>>>>> Add support to use OPP table from DT in Tegra194 cpufreq driver.
>>>>> Tegra SoC's receive the frequency lookup table (LUT) from BPMP-FW.
>>>>> Cross check the OPP's present in DT against the LUT from BPMP-FW
>>>>> and enable only those DT OPP's which are present in LUT also.
>>>>>
>>>>> The OPP table in DT has CPU Frequency to bandwidth mapping where
>>>>> the bandwidth value is per MC channel. DRAM bandwidth depends on the
>>>>> number of MC channels which can vary as per the boot configuration.
>>>>> This per channel bandwidth from OPP table will be later converted by
>>>>> MC driver to final bandwidth value by multiplying with number of
>>>>> channels before sending the request to BPMP-FW.
>>>>>
>>>>> If OPP table is not present in DT, then use the LUT from BPMP-FW directy
>>>>> as the frequency table and not do the DRAM frequency scaling which is
>>>>> same as the current behavior.
>>>>>
>>>>> Now, as the CPU Frequency table is being controlling through OPP table
>>>>> in DT. Keeping fewer entries in the table will create less frequency
>>>>> steps and scale fast to high frequencies if required.
>>>>
>>>> It's not exactly clear what you're doing here. Are you going to scale
>>>> memory BW based on CPU freq? If yes, then this is wrong because CPU freq
>>>> is independent from the memory subsystem.
>>>>
>>>> All Tegra30+ SoCs have ACTMON hardware unit that monitors CPU memory
>>>> activity and CPU memory BW should be scaled based on CPU memory events
>>>> counter. We have ACTMON devfreq driver for older SoCs. I have no clue
>>>> how ACTMON can be accessed on T186+, perhaps there should be a BPMP FW
>>>> API for that.
>>>>
>>>
>>> Yes, scaling the memory BW based on CPU freq.
>>> Referred below patch set for previous generation of Tegra Soc's which
>>> you mentioned and tried to trace the history.
>>>
>>> https://patchwork.ozlabs.org/project/linux-tegra/patch/1418719298-25314-3-git-send-email-tomeu.vizoso@collabora.com/
>>>
>>> In new Tegra Soc's, actmon counter control and usage has been moved to
>>> BPMP-FW where only 'MCALL' counter is used and 'MCCPU is not being used.
>>> Using the actmon counter was a reactive way to scale the frequency which
>>> is less effective due to averaging over a time period.
>>> We are now using the proactive way where clients tell their bandwidth
>>> needs to help achieve better performance.
>>
>> You don't know what bandwidth CPU needs, you trying to guess it.
>>
>> It should be a bad decision to use cpufreq for memory bandwidth scaling.
>> You'll be wasting memory power 90% of time because cpufreq doesn't have
>> relation to the DRAM, your heuristics will be wrong and won't do
>> anything good compared to using ACTMON. The L2 CPU cache + memory
>> prefetching hides memory from CPU. And cpufreq should be less reactive
>> than ACTMON in general.
>>
>> Scaling memory freq based on cpufreq is what downstream NV kernel did
>> 10+ years ago for the oldest Tegra generations. Today upstream has all
>> the necessary infrastructure for doing memory bandwidth scaling properly
>> and we even using h/w memory counters on T20. It's strange that you want
>> to bring the downstream archaity to the modern upstream for the latest
>> Tegra generations.
>>
>> If you can skip the BPMP-FW and use ACTMON directly from kernel, then
>> that's what I suggest to do.
> 
> After talking to a few people, it turns out that BPMP is already using
> ACTMON internally to do the actual scaling of the EMC frequency (or the
> CPUs contribution to that). So BPMP will use ACTMON counters to monitor
> the effective memory load of the CPU and adjust the EMC frequency. The
> bandwidth request that we generate from the cpufreq driver is more of a
> guideline for the maximum bandwidth we might consume.

Our kernel ACTMON driver uses cpufreq for guiding the EMC freq. Driving
EMC rate solely based on cpufreq would be a not good decision. So does
it mean you're now going to extend the ACTMON driver with the BPMP support?
  
Sumit Gupta Feb. 6, 2023, 1:31 p.m. UTC | #6
On 19/01/23 18:31, Dmitry Osipenko wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 1/19/23 13:26, Thierry Reding wrote:
>> On Mon, Jan 16, 2023 at 03:16:48PM +0300, Dmitry Osipenko wrote:
>>> On 1/13/23 16:50, Sumit Gupta wrote:
>>>>
>>>>
>>>> On 22/12/22 21:16, Dmitry Osipenko wrote:
>>>>> External email: Use caution opening links or attachments
>>>>>
>>>>>
>>>>> 20.12.2022 19:02, Sumit Gupta пишет:
>>>>>> Add support to use OPP table from DT in Tegra194 cpufreq driver.
>>>>>> Tegra SoC's receive the frequency lookup table (LUT) from BPMP-FW.
>>>>>> Cross check the OPP's present in DT against the LUT from BPMP-FW
>>>>>> and enable only those DT OPP's which are present in LUT also.
>>>>>>
>>>>>> The OPP table in DT has CPU Frequency to bandwidth mapping where
>>>>>> the bandwidth value is per MC channel. DRAM bandwidth depends on the
>>>>>> number of MC channels which can vary as per the boot configuration.
>>>>>> This per channel bandwidth from OPP table will be later converted by
>>>>>> MC driver to final bandwidth value by multiplying with number of
>>>>>> channels before sending the request to BPMP-FW.
>>>>>>
>>>>>> If OPP table is not present in DT, then use the LUT from BPMP-FW directy
>>>>>> as the frequency table and not do the DRAM frequency scaling which is
>>>>>> same as the current behavior.
>>>>>>
>>>>>> Now, as the CPU Frequency table is being controlling through OPP table
>>>>>> in DT. Keeping fewer entries in the table will create less frequency
>>>>>> steps and scale fast to high frequencies if required.
>>>>>
>>>>> It's not exactly clear what you're doing here. Are you going to scale
>>>>> memory BW based on CPU freq? If yes, then this is wrong because CPU freq
>>>>> is independent from the memory subsystem.
>>>>>
>>>>> All Tegra30+ SoCs have ACTMON hardware unit that monitors CPU memory
>>>>> activity and CPU memory BW should be scaled based on CPU memory events
>>>>> counter. We have ACTMON devfreq driver for older SoCs. I have no clue
>>>>> how ACTMON can be accessed on T186+, perhaps there should be a BPMP FW
>>>>> API for that.
>>>>>
>>>>
>>>> Yes, scaling the memory BW based on CPU freq.
>>>> Referred below patch set for previous generation of Tegra Soc's which
>>>> you mentioned and tried to trace the history.
>>>>
>>>> https://patchwork.ozlabs.org/project/linux-tegra/patch/1418719298-25314-3-git-send-email-tomeu.vizoso@collabora.com/
>>>>
>>>> In new Tegra Soc's, actmon counter control and usage has been moved to
>>>> BPMP-FW where only 'MCALL' counter is used and 'MCCPU is not being used.
>>>> Using the actmon counter was a reactive way to scale the frequency which
>>>> is less effective due to averaging over a time period.
>>>> We are now using the proactive way where clients tell their bandwidth
>>>> needs to help achieve better performance.
>>>
>>> You don't know what bandwidth CPU needs, you trying to guess it.
>>>
>>> It should be a bad decision to use cpufreq for memory bandwidth scaling.
>>> You'll be wasting memory power 90% of time because cpufreq doesn't have
>>> relation to the DRAM, your heuristics will be wrong and won't do
>>> anything good compared to using ACTMON. The L2 CPU cache + memory
>>> prefetching hides memory from CPU. And cpufreq should be less reactive
>>> than ACTMON in general.
>>>
>>> Scaling memory freq based on cpufreq is what downstream NV kernel did
>>> 10+ years ago for the oldest Tegra generations. Today upstream has all
>>> the necessary infrastructure for doing memory bandwidth scaling properly
>>> and we even using h/w memory counters on T20. It's strange that you want
>>> to bring the downstream archaity to the modern upstream for the latest
>>> Tegra generations.
>>>
>>> If you can skip the BPMP-FW and use ACTMON directly from kernel, then
>>> that's what I suggest to do.
>>
>> After talking to a few people, it turns out that BPMP is already using
>> ACTMON internally to do the actual scaling of the EMC frequency (or the
>> CPUs contribution to that). So BPMP will use ACTMON counters to monitor
>> the effective memory load of the CPU and adjust the EMC frequency. The
>> bandwidth request that we generate from the cpufreq driver is more of a
>> guideline for the maximum bandwidth we might consume.
> 
> Our kernel ACTMON driver uses cpufreq for guiding the EMC freq. Driving
> EMC rate solely based on cpufreq would be a not good decision. So does
> it mean you're now going to extend the ACTMON driver with the BPMP support?
> 
> --
> Best regards,
> Dmitry
> 

As we are using the ACTMON in BPMP-FW now and there is no plan to move 
it back to kernel for future SoC's. The ACTMON kernel driver won't be 
used in T234 and later SoC's.
The current patches to scale DRAM_FREQ with CPU_FREQ is used in T234.
In future SoC's, we are planning to use a new solution for this in
BPMP-FW which will provide better metric in addition to ACTMON's for 
scaling DRAM FREQ for CPU Clusters.

Thanks,
Sumit
  

Patch

diff --git a/drivers/cpufreq/tegra194-cpufreq.c b/drivers/cpufreq/tegra194-cpufreq.c
index 4596c3e323aa..8155a9ea0815 100644
--- a/drivers/cpufreq/tegra194-cpufreq.c
+++ b/drivers/cpufreq/tegra194-cpufreq.c
@@ -12,6 +12,7 @@ 
 #include <linux/of_platform.h>
 #include <linux/platform_device.h>
 #include <linux/slab.h>
+#include <linux/units.h>
 
 #include <asm/smp_plat.h>
 
@@ -65,12 +66,32 @@  struct tegra_cpufreq_soc {
 
 struct tegra194_cpufreq_data {
 	void __iomem *regs;
-	struct cpufreq_frequency_table **tables;
+	struct cpufreq_frequency_table **bpmp_luts;
 	const struct tegra_cpufreq_soc *soc;
+	bool icc_dram_bw_scaling;
 };
 
 static struct workqueue_struct *read_counters_wq;
 
+static int tegra_cpufreq_set_bw(struct cpufreq_policy *policy, unsigned long freq_khz)
+{
+	struct dev_pm_opp *opp;
+	struct device *dev;
+	int ret;
+
+	dev = get_cpu_device(policy->cpu);
+	if (!dev)
+		return -ENODEV;
+
+	opp = dev_pm_opp_find_freq_exact(dev, freq_khz * KHZ, true);
+	if (IS_ERR(opp))
+		return PTR_ERR(opp);
+
+	ret = dev_pm_opp_set_opp(dev, opp);
+	dev_pm_opp_put(opp);
+	return ret;
+}
+
 static void tegra_get_cpu_mpidr(void *mpidr)
 {
 	*((u64 *)mpidr) = read_cpuid_mpidr() & MPIDR_HWID_BITMASK;
@@ -354,7 +375,7 @@  static unsigned int tegra194_get_speed(u32 cpu)
 	 * to the last written ndiv value from freq_table. This is
 	 * done to return consistent value.
 	 */
-	cpufreq_for_each_valid_entry(pos, data->tables[clusterid]) {
+	cpufreq_for_each_valid_entry(pos, data->bpmp_luts[clusterid]) {
 		if (pos->driver_data != ndiv)
 			continue;
 
@@ -369,16 +390,93 @@  static unsigned int tegra194_get_speed(u32 cpu)
 	return rate;
 }
 
+int tegra_cpufreq_init_cpufreq_table(struct cpufreq_policy *policy,
+				     struct cpufreq_frequency_table *bpmp_lut,
+				     struct cpufreq_frequency_table **opp_table)
+{
+	struct tegra194_cpufreq_data *data = cpufreq_get_driver_data();
+	struct cpufreq_frequency_table *freq_table = NULL;
+	struct cpufreq_frequency_table *pos;
+	int ret = 0, max_opps = 0;
+	struct device *cpu_dev;
+	struct dev_pm_opp *opp;
+	unsigned long rate;
+	int j = 0;
+
+	cpu_dev = get_cpu_device(policy->cpu);
+	if (!cpu_dev) {
+		pr_err("%s: failed to get cpu%d device\n", __func__, policy->cpu);
+		return -ENODEV;
+	}
+
+	/* Initialize OPP table mentioned in operating-points-v2 property in DT */
+	ret = dev_pm_opp_of_add_table_indexed(cpu_dev, 0);
+	if (!ret) {
+		max_opps = dev_pm_opp_get_opp_count(cpu_dev);
+		if (max_opps <= 0) {
+			dev_err(cpu_dev, "Failed to add OPPs\n");
+			return max_opps;
+		}
+
+		/* Disable all opps and cross-validate against LUT later */
+		for (rate = 0; ; rate++) {
+			opp = dev_pm_opp_find_freq_ceil(cpu_dev, &rate);
+			if (IS_ERR(opp))
+				break;
+
+			dev_pm_opp_put(opp);
+			dev_pm_opp_disable(cpu_dev, rate);
+		}
+	} else {
+		dev_err(cpu_dev, "Invalid or empty opp table in device tree\n");
+		data->icc_dram_bw_scaling = false;
+		return ret;
+	}
+
+	freq_table = kcalloc((max_opps + 1), sizeof(*freq_table), GFP_KERNEL);
+	if (!freq_table)
+		return -ENOMEM;
+
+	/*
+	 * Cross check the frequencies from BPMP-FW LUT against the OPP's present in DT.
+	 * Enable only those DT OPP's which are present in LUT also.
+	 */
+	cpufreq_for_each_valid_entry(pos, bpmp_lut) {
+		opp = dev_pm_opp_find_freq_exact(cpu_dev, pos->frequency * KHZ, false);
+		if (IS_ERR(opp))
+			continue;
+
+		ret = dev_pm_opp_enable(cpu_dev, pos->frequency * KHZ);
+		if (ret < 0)
+			return ret;
+
+		freq_table[j].driver_data = pos->driver_data;
+		freq_table[j].frequency = pos->frequency;
+		j++;
+	}
+
+	freq_table[j].driver_data = pos->driver_data;
+	freq_table[j].frequency = CPUFREQ_TABLE_END;
+
+	*opp_table = &freq_table[0];
+
+	dev_pm_opp_set_sharing_cpus(cpu_dev, policy->cpus);
+
+	return ret;
+}
+
 static int tegra194_cpufreq_init(struct cpufreq_policy *policy)
 {
 	struct tegra194_cpufreq_data *data = cpufreq_get_driver_data();
 	int maxcpus_per_cluster = data->soc->maxcpus_per_cluster;
+	struct cpufreq_frequency_table *freq_table;
+	struct cpufreq_frequency_table *bpmp_lut;
 	u32 start_cpu, cpu;
 	u32 clusterid;
+	int ret;
 
 	data->soc->ops->get_cpu_cluster_id(policy->cpu, NULL, &clusterid);
-
-	if (clusterid >= data->soc->num_clusters || !data->tables[clusterid])
+	if (clusterid >= data->soc->num_clusters || !data->bpmp_luts[clusterid])
 		return -EINVAL;
 
 	start_cpu = rounddown(policy->cpu, maxcpus_per_cluster);
@@ -387,9 +485,22 @@  static int tegra194_cpufreq_init(struct cpufreq_policy *policy)
 		if (cpu_possible(cpu))
 			cpumask_set_cpu(cpu, policy->cpus);
 	}
-	policy->freq_table = data->tables[clusterid];
 	policy->cpuinfo.transition_latency = TEGRA_CPUFREQ_TRANSITION_LATENCY;
 
+	bpmp_lut = data->bpmp_luts[clusterid];
+
+	if (data->icc_dram_bw_scaling) {
+		ret = tegra_cpufreq_init_cpufreq_table(policy, bpmp_lut, &freq_table);
+		if (!ret) {
+			policy->freq_table = freq_table;
+			return 0;
+		}
+	}
+
+	data->icc_dram_bw_scaling = false;
+	policy->freq_table = bpmp_lut;
+	pr_info("OPP tables missing from DT, EMC frequency scaling disabled\n");
+
 	return 0;
 }
 
@@ -406,6 +517,9 @@  static int tegra194_cpufreq_set_target(struct cpufreq_policy *policy,
 	 */
 	data->soc->ops->set_cpu_ndiv(policy, (u64)tbl->driver_data);
 
+	if (data->icc_dram_bw_scaling)
+		tegra_cpufreq_set_bw(policy, tbl->frequency);
+
 	return 0;
 }
 
@@ -438,8 +552,8 @@  static void tegra194_cpufreq_free_resources(void)
 }
 
 static struct cpufreq_frequency_table *
-init_freq_table(struct platform_device *pdev, struct tegra_bpmp *bpmp,
-		unsigned int cluster_id)
+tegra_cpufreq_bpmp_read_lut(struct platform_device *pdev, struct tegra_bpmp *bpmp,
+			    unsigned int cluster_id)
 {
 	struct cpufreq_frequency_table *freq_table;
 	struct mrq_cpu_ndiv_limits_response resp;
@@ -514,6 +628,7 @@  static int tegra194_cpufreq_probe(struct platform_device *pdev)
 	const struct tegra_cpufreq_soc *soc;
 	struct tegra194_cpufreq_data *data;
 	struct tegra_bpmp *bpmp;
+	struct device *cpu_dev;
 	int err, i;
 
 	data = devm_kzalloc(&pdev->dev, sizeof(*data), GFP_KERNEL);
@@ -529,9 +644,9 @@  static int tegra194_cpufreq_probe(struct platform_device *pdev)
 		return -EINVAL;
 	}
 
-	data->tables = devm_kcalloc(&pdev->dev, data->soc->num_clusters,
-				    sizeof(*data->tables), GFP_KERNEL);
-	if (!data->tables)
+	data->bpmp_luts = devm_kcalloc(&pdev->dev, data->soc->num_clusters,
+				       sizeof(*data->bpmp_luts), GFP_KERNEL);
+	if (!data->bpmp_luts)
 		return -ENOMEM;
 
 	if (soc->actmon_cntr_base) {
@@ -555,15 +670,26 @@  static int tegra194_cpufreq_probe(struct platform_device *pdev)
 	}
 
 	for (i = 0; i < data->soc->num_clusters; i++) {
-		data->tables[i] = init_freq_table(pdev, bpmp, i);
-		if (IS_ERR(data->tables[i])) {
-			err = PTR_ERR(data->tables[i]);
+		data->bpmp_luts[i] = tegra_cpufreq_bpmp_read_lut(pdev, bpmp, i);
+		if (IS_ERR(data->bpmp_luts[i])) {
+			err = PTR_ERR(data->bpmp_luts[i]);
 			goto err_free_res;
 		}
 	}
 
 	tegra194_cpufreq_driver.driver_data = data;
 
+	/* Check for optional OPPv2 and interconnect paths on CPU0 to enable ICC scaling */
+	cpu_dev = get_cpu_device(0);
+	if (!cpu_dev)
+		return -EPROBE_DEFER;
+
+	if (dev_pm_opp_of_get_opp_desc_node(cpu_dev)) {
+		err = dev_pm_opp_of_find_icc_paths(cpu_dev, NULL);
+		if (!err)
+			data->icc_dram_bw_scaling = true;
+	}
+
 	err = cpufreq_register_driver(&tegra194_cpufreq_driver);
 	if (!err)
 		goto put_bpmp;