[2/4] arm64: dts: rockchip: enable built-in thermal monitoring on rk3588

Message ID 20240125-rk-dts-additions-v1-2-5879275db36f@gmail.com
State New
Headers
Series [1/4] arm64: dts: rockchip: add rfkill node for M.2 Key E WiFi on rock-5b |

Commit Message

Alexey Charkov Jan. 24, 2024, 8:30 p.m. UTC
  Include thermal zones information in device tree for rk3588 variants

Signed-off-by: Alexey Charkov <alchark@gmail.com>
---
 arch/arm64/boot/dts/rockchip/rk3588s.dtsi | 165 ++++++++++++++++++++++++++++++
 1 file changed, 165 insertions(+)
  

Comments

Daniel Lezcano Jan. 24, 2024, 9:56 p.m. UTC | #1
On 24/01/2024 21:30, Alexey Charkov wrote:
> Include thermal zones information in device tree for rk3588 variants

There is an energy model for the CPUs. But finding out the sustainable 
power may be a bit tricky. So I suggest to remove everything related to 
the power allocator in this change and propose a dedicated change with 
all the power configuration (which includes proper k_p* coefficients to 
be set from userspace to have a flat mitigation figure).

That implies removing the "contribution" properties in this description.

Some comments below but definitively this version is close to be ok.


> Signed-off-by: Alexey Charkov <alchark@gmail.com>
> ---
>   arch/arm64/boot/dts/rockchip/rk3588s.dtsi | 165 ++++++++++++++++++++++++++++++
>   1 file changed, 165 insertions(+)
> 
> diff --git a/arch/arm64/boot/dts/rockchip/rk3588s.dtsi b/arch/arm64/boot/dts/rockchip/rk3588s.dtsi
> index 36b1b7acfe6a..131b9eb21398 100644
> --- a/arch/arm64/boot/dts/rockchip/rk3588s.dtsi
> +++ b/arch/arm64/boot/dts/rockchip/rk3588s.dtsi
> @@ -10,6 +10,7 @@
>   #include <dt-bindings/reset/rockchip,rk3588-cru.h>
>   #include <dt-bindings/phy/phy.h>
>   #include <dt-bindings/ata/ahci.h>
> +#include <dt-bindings/thermal/thermal.h>
>   
>   / {
>   	compatible = "rockchip,rk3588";
> @@ -2228,6 +2229,170 @@ tsadc: tsadc@fec00000 {
>   		status = "disabled";
>   	};
>   
> +	thermal_zones: thermal-zones {
> +		/* sensor near the center of the whole chip */
> +		package_thermal: package-thermal {
> +			polling-delay-passive = <0>;
> +			polling-delay = <0>;
> +			thermal-sensors = <&tsadc 0>;
> +
> +			trips {
> +				package_crit: package-crit {
> +					temperature = <115000>;
> +					hysteresis = <0>;
> +					type = "critical";
> +				};
> +			};
> +		};
> +
> +		/* sensor between A76 cores 0 and 1 */
> +		bigcore0_thermal: bigcore0-thermal {
> +			polling-delay-passive = <20>;

20ms seems very short, is this value on purpose? Or just picked up 
arbitrarily?

If it is possible, perhaps you should profile the temperature of these 
thermal zones (CPUs ones). There is a tool in 
<linuxdir>/tools/thermal/thermometer to do that.

You can measure with 10ms sampling rate when running for instance 
dhrystone pinned on b0 and b1, then on b2 and b3. And finally on the 
small cluster.

But if you don't have spare time and 20 is ok for you. Then it is fine 
for me too.

Some nits below.

> +			polling-delay = <0>;
> +			thermal-sensors = <&tsadc 1>;
> +
> +			trips {
> +				bigcore0_alert0: bigcore0-alert0 {
> +					temperature = <75000>;
> +					hysteresis = <2000>;
> +					type = "passive";
> +				};
> +				bigcore0_alert1: bigcore0-alert1 {
> +					temperature = <85000>;
> +					hysteresis = <2000>;
> +					type = "passive";
> +				};
> +				bigcore0_crit: bigcore0-crit {
> +					temperature = <115000>;
> +					hysteresis = <0>;
> +					type = "critical";
> +				};
> +			};
> +			cooling-maps {
> +				map0 {
> +					trip = <&bigcore0_alert1>;
> +					cooling-device =
> +						<&cpu_b0 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
> +						<&cpu_b1 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
> +					contribution = <1024>;
> +				};
> +			};
> +		};
> +
> +		/* sensor between A76 cores 2 and 3 */
> +		bigcore2_thermal: bigcore2-thermal {
> +			polling-delay-passive = <20>;
> +			polling-delay = <0>;
> +			thermal-sensors = <&tsadc 2>;
> +
> +			trips {
> +				bigcore2_alert0: bigcore2-alert0 {
> +					temperature = <75000>;
> +					hysteresis = <2000>;
> +					type = "passive";
> +				};
> +				bigcore2_alert1: bigcore2-alert1 {
> +					temperature = <85000>;
> +					hysteresis = <2000>;
> +					type = "passive";
> +				};
> +				bigcore2_crit: bigcore2-crit {
> +					temperature = <115000>;
> +					hysteresis = <0>;
> +					type = "critical";
> +				};
> +			};
> +			cooling-maps {
> +				map1 {

s/map1/mpa0/

> +					trip = <&bigcore2_alert1>;
> +					cooling-device =
> +						<&cpu_b2 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
> +						<&cpu_b3 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
> +					contribution = <1024>;
> +				};
> +			};
> +		};
> +
> +		/* sensor between the four A55 cores */
> +		little_core_thermal: littlecore-thermal {
> +			polling-delay-passive = <20>;
> +			polling-delay = <0>;
> +			thermal-sensors = <&tsadc 3>;
> +
> +			trips {
> +				littlecore_alert0: littlecore-alert0 {
> +					temperature = <75000>;
> +					hysteresis = <2000>;
> +					type = "passive";
> +				};
> +				littlecore_alert1: littlecore-alert1 {
> +					temperature = <85000>;
> +					hysteresis = <2000>;
> +					type = "passive";
> +				};
> +				littlecore_crit: littlecore-crit {
> +					temperature = <115000>;
> +					hysteresis = <0>;
> +					type = "critical";
> +				};
> +			};
> +			cooling-maps {
> +				map2 {

s/map2/map0/

> +					trip = <&littlecore_alert1>;
> +					cooling-device =
> +						<&cpu_l0 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
> +						<&cpu_l1 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
> +						<&cpu_l2 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
> +						<&cpu_l3 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
> +					contribution = <1024>;
> +				};
> +			};
> +		};
> +
> +		/* sensor near the PD_CENTER power domain */
> +		center_thermal: center-thermal {
> +			polling-delay-passive = <0>;
> +			polling-delay = <0>;
> +			thermal-sensors = <&tsadc 4>;
> +
> +			trips {
> +				center_crit: center-crit {
> +					temperature = <115000>;
> +					hysteresis = <0>;
> +					type = "critical";
> +				};
> +			};
> +		};
> +
> +		gpu_thermal: gpu-thermal {
> +			polling-delay-passive = <0>;
> +			polling-delay = <0>;
> +			thermal-sensors = <&tsadc 5>;
> +
> +			trips {
> +				gpu_crit: gpu-crit {
> +					temperature = <115000>;
> +					hysteresis = <0>;
> +					type = "critical";
> +				};
> +			};
> +		};
> +
> +		npu_thermal: npu-thermal {
> +			polling-delay-passive = <0>;
> +			polling-delay = <0>;
> +			thermal-sensors = <&tsadc 6>;
> +
> +			trips {
> +				npu_crit: npu-crit {
> +					temperature = <115000>;
> +					hysteresis = <0>;
> +					type = "critical";
> +				};
> +			};
> +		};
> +	};
> +
>   	saradc: adc@fec10000 {
>   		compatible = "rockchip,rk3588-saradc";
>   		reg = <0x0 0xfec10000 0x0 0x10000>;
>
  
Alexey Charkov Jan. 25, 2024, 8:26 a.m. UTC | #2
On Thu, Jan 25, 2024 at 1:56 AM Daniel Lezcano
<daniel.lezcano@linaro.org> wrote:
>
> On 24/01/2024 21:30, Alexey Charkov wrote:
> > Include thermal zones information in device tree for rk3588 variants
>
> There is an energy model for the CPUs. But finding out the sustainable
> power may be a bit tricky. So I suggest to remove everything related to
> the power allocator in this change and propose a dedicated change with
> all the power configuration (which includes proper k_p* coefficients to
> be set from userspace to have a flat mitigation figure).
>
> That implies removing the "contribution" properties in this description.

Alright, I'll just drop those "contribution" properties, thanks!

> Some comments below but definitively this version is close to be ok.

Yay! :)

> > Signed-off-by: Alexey Charkov <alchark@gmail.com>
> > ---
> >   arch/arm64/boot/dts/rockchip/rk3588s.dtsi | 165 ++++++++++++++++++++++++++++++
> >   1 file changed, 165 insertions(+)
> >
> > diff --git a/arch/arm64/boot/dts/rockchip/rk3588s.dtsi b/arch/arm64/boot/dts/rockchip/rk3588s.dtsi
> > index 36b1b7acfe6a..131b9eb21398 100644
> > --- a/arch/arm64/boot/dts/rockchip/rk3588s.dtsi
> > +++ b/arch/arm64/boot/dts/rockchip/rk3588s.dtsi
> > @@ -10,6 +10,7 @@
> >   #include <dt-bindings/reset/rockchip,rk3588-cru.h>
> >   #include <dt-bindings/phy/phy.h>
> >   #include <dt-bindings/ata/ahci.h>
> > +#include <dt-bindings/thermal/thermal.h>
> >
> >   / {
> >       compatible = "rockchip,rk3588";
> > @@ -2228,6 +2229,170 @@ tsadc: tsadc@fec00000 {
> >               status = "disabled";
> >       };
> >
> > +     thermal_zones: thermal-zones {
> > +             /* sensor near the center of the whole chip */
> > +             package_thermal: package-thermal {
> > +                     polling-delay-passive = <0>;
> > +                     polling-delay = <0>;
> > +                     thermal-sensors = <&tsadc 0>;
> > +
> > +                     trips {
> > +                             package_crit: package-crit {
> > +                                     temperature = <115000>;
> > +                                     hysteresis = <0>;
> > +                                     type = "critical";
> > +                             };
> > +                     };
> > +             };
> > +
> > +             /* sensor between A76 cores 0 and 1 */
> > +             bigcore0_thermal: bigcore0-thermal {
> > +                     polling-delay-passive = <20>;
>
> 20ms seems very short, is this value on purpose? Or just picked up
> arbitrarily?

Frankly, I simply used the value that Radxa's downstream DTS sets for
my board. 100ms seem to work just as well.

> If it is possible, perhaps you should profile the temperature of these
> thermal zones (CPUs ones). There is a tool in
> <linuxdir>/tools/thermal/thermometer to do that.
>
> You can measure with 10ms sampling rate when running for instance
> dhrystone pinned on b0 and b1, then on b2 and b3. And finally on the
> small cluster.

It seems tricky to isolate the effects from just one of the CPU
clusters, as their individual thermal outputs are not that high.

For my testing I disabled the fan (but didn't remove the heatsink to
avoid wasting the thermal interface tape), and tried loading CPUs with
stress-ng. Here are the observations:
 - Little cores alone stressed with 4 threads pegged to them with
taskset never reach the throttling temperature (85C), and balance out
around 60C
 - Either of the big core clusters stressed individually with 2
threads pegged to them with taskset never reach the throttling
temperature either
 - Four big cores with 4 threads heat up very slowly (>30 minutes to
reach throttling temperature, I didn't have enough patience to let
them actually reach it - maybe they never do)
 - Eight cores with 8 threads heat up to the throttling temperature
within ~5 minutes (again, with the fan off), and then, as soon as just
one of the big core clusters gets throttled, the temperature of all
cores balances out just below the throttling threshold. In my
observation cores 6,7 go from 2.4GHz down to 1.8GHz while the rest
stay at their respective top performance states (2.4GHz for big cores
4,5 and 1.8GHz for little cores 0-3)

Adding to it the fact that the temperature measurement resolution is
not very granular (almost 1C) it's somewhat difficult to estimate how
fast throttling action on a single cluster really brings its
temperature within bounds, as they all affect each other at relevant
temperature-load combinations. Perhaps it means that too granular
polling doesn't add much value.

> But if you don't have spare time and 20 is ok for you. Then it is fine
> for me too.

I guess I'll go for 100 as other upstream Rockchip .dtsi's do, given
all of the above. Thanks for pointing this out!

> Some nits below.
>
> > +                     polling-delay = <0>;
> > +                     thermal-sensors = <&tsadc 1>;
> > +
> > +                     trips {
> > +                             bigcore0_alert0: bigcore0-alert0 {
> > +                                     temperature = <75000>;
> > +                                     hysteresis = <2000>;
> > +                                     type = "passive";
> > +                             };
> > +                             bigcore0_alert1: bigcore0-alert1 {
> > +                                     temperature = <85000>;
> > +                                     hysteresis = <2000>;
> > +                                     type = "passive";
> > +                             };
> > +                             bigcore0_crit: bigcore0-crit {
> > +                                     temperature = <115000>;
> > +                                     hysteresis = <0>;
> > +                                     type = "critical";
> > +                             };
> > +                     };
> > +                     cooling-maps {
> > +                             map0 {
> > +                                     trip = <&bigcore0_alert1>;
> > +                                     cooling-device =
> > +                                             <&cpu_b0 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
> > +                                             <&cpu_b1 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
> > +                                     contribution = <1024>;
> > +                             };
> > +                     };
> > +             };
> > +
> > +             /* sensor between A76 cores 2 and 3 */
> > +             bigcore2_thermal: bigcore2-thermal {
> > +                     polling-delay-passive = <20>;
> > +                     polling-delay = <0>;
> > +                     thermal-sensors = <&tsadc 2>;
> > +
> > +                     trips {
> > +                             bigcore2_alert0: bigcore2-alert0 {
> > +                                     temperature = <75000>;
> > +                                     hysteresis = <2000>;
> > +                                     type = "passive";
> > +                             };
> > +                             bigcore2_alert1: bigcore2-alert1 {
> > +                                     temperature = <85000>;
> > +                                     hysteresis = <2000>;
> > +                                     type = "passive";
> > +                             };
> > +                             bigcore2_crit: bigcore2-crit {
> > +                                     temperature = <115000>;
> > +                                     hysteresis = <0>;
> > +                                     type = "critical";
> > +                             };
> > +                     };
> > +                     cooling-maps {
> > +                             map1 {
>
> s/map1/mpa0/

Noted, thanks!

> > +                                     trip = <&bigcore2_alert1>;
> > +                                     cooling-device =
> > +                                             <&cpu_b2 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
> > +                                             <&cpu_b3 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
> > +                                     contribution = <1024>;
> > +                             };
> > +                     };
> > +             };
> > +
> > +             /* sensor between the four A55 cores */
> > +             little_core_thermal: littlecore-thermal {
> > +                     polling-delay-passive = <20>;
> > +                     polling-delay = <0>;
> > +                     thermal-sensors = <&tsadc 3>;
> > +
> > +                     trips {
> > +                             littlecore_alert0: littlecore-alert0 {
> > +                                     temperature = <75000>;
> > +                                     hysteresis = <2000>;
> > +                                     type = "passive";
> > +                             };
> > +                             littlecore_alert1: littlecore-alert1 {
> > +                                     temperature = <85000>;
> > +                                     hysteresis = <2000>;
> > +                                     type = "passive";
> > +                             };
> > +                             littlecore_crit: littlecore-crit {
> > +                                     temperature = <115000>;
> > +                                     hysteresis = <0>;
> > +                                     type = "critical";
> > +                             };
> > +                     };
> > +                     cooling-maps {
> > +                             map2 {
>
> s/map2/map0/

Noted, thanks!

Best regards,
Alexey
  
Daniel Lezcano Jan. 25, 2024, 10:02 a.m. UTC | #3
On 25/01/2024 09:26, Alexey Charkov wrote:
> On Thu, Jan 25, 2024 at 1:56 AM Daniel Lezcano
> <daniel.lezcano@linaro.org> wrote:
>>
>> On 24/01/2024 21:30, Alexey Charkov wrote:
>>> Include thermal zones information in device tree for rk3588 variants
>>
>> There is an energy model for the CPUs. But finding out the sustainable
>> power may be a bit tricky. So I suggest to remove everything related to
>> the power allocator in this change and propose a dedicated change with
>> all the power configuration (which includes proper k_p* coefficients to
>> be set from userspace to have a flat mitigation figure).
>>
>> That implies removing the "contribution" properties in this description.
> 
> Alright, I'll just drop those "contribution" properties, thanks!
> 
>> Some comments below but definitively this version is close to be ok.
> 
> Yay! :)
> 
>>> Signed-off-by: Alexey Charkov <alchark@gmail.com>
>>> ---
>>>    arch/arm64/boot/dts/rockchip/rk3588s.dtsi | 165 ++++++++++++++++++++++++++++++
>>>    1 file changed, 165 insertions(+)
>>>
>>> diff --git a/arch/arm64/boot/dts/rockchip/rk3588s.dtsi b/arch/arm64/boot/dts/rockchip/rk3588s.dtsi
>>> index 36b1b7acfe6a..131b9eb21398 100644
>>> --- a/arch/arm64/boot/dts/rockchip/rk3588s.dtsi
>>> +++ b/arch/arm64/boot/dts/rockchip/rk3588s.dtsi
>>> @@ -10,6 +10,7 @@
>>>    #include <dt-bindings/reset/rockchip,rk3588-cru.h>
>>>    #include <dt-bindings/phy/phy.h>
>>>    #include <dt-bindings/ata/ahci.h>
>>> +#include <dt-bindings/thermal/thermal.h>
>>>
>>>    / {
>>>        compatible = "rockchip,rk3588";
>>> @@ -2228,6 +2229,170 @@ tsadc: tsadc@fec00000 {
>>>                status = "disabled";
>>>        };
>>>
>>> +     thermal_zones: thermal-zones {
>>> +             /* sensor near the center of the whole chip */
>>> +             package_thermal: package-thermal {
>>> +                     polling-delay-passive = <0>;
>>> +                     polling-delay = <0>;
>>> +                     thermal-sensors = <&tsadc 0>;
>>> +
>>> +                     trips {
>>> +                             package_crit: package-crit {
>>> +                                     temperature = <115000>;
>>> +                                     hysteresis = <0>;
>>> +                                     type = "critical";
>>> +                             };
>>> +                     };
>>> +             };
>>> +
>>> +             /* sensor between A76 cores 0 and 1 */
>>> +             bigcore0_thermal: bigcore0-thermal {
>>> +                     polling-delay-passive = <20>;
>>
>> 20ms seems very short, is this value on purpose? Or just picked up
>> arbitrarily?
> 
> Frankly, I simply used the value that Radxa's downstream DTS sets for
> my board. 100ms seem to work just as well.
> 
>> If it is possible, perhaps you should profile the temperature of these
>> thermal zones (CPUs ones). There is a tool in
>> <linuxdir>/tools/thermal/thermometer to do that.
>>
>> You can measure with 10ms sampling rate when running for instance
>> dhrystone pinned on b0 and b1, then on b2 and b3. And finally on the
>> small cluster.
> 
> It seems tricky to isolate the effects from just one of the CPU
> clusters, as their individual thermal outputs are not that high.
> 
> For my testing I disabled the fan (but didn't remove the heatsink to
> avoid wasting the thermal interface tape),

It is ok but the system will have more heat capacity and it will be 
necessary to saturate it before running the tests. IOW warm up the 
system by running thermal stress tests several times.

> and tried loading CPUs with
> stress-ng. Here are the observations:

Usually I use drhystone to thermal stress the cores (e. one minute).

>   - Little cores alone stressed with 4 threads pegged to them with
> taskset never reach the throttling temperature (85C), and balance out
> around 60C
>   - Either of the big core clusters stressed individually with 2
> threads pegged to them with taskset never reach the throttling
> temperature either

Not sure what does stress-ng but you may want to test with dhrystone. 
I'm pretty sure it will make the expected thermal stress.

On the rk3388, the temperature on the big cores raises to the threshold 
in a few seconds (after a warmup).

>   - Four big cores with 4 threads heat up very slowly (>30 minutes to
> reach throttling temperature, I didn't have enough patience to let
> them actually reach it - maybe they never do)
>   - Eight cores with 8 threads heat up to the throttling temperature
> within ~5 minutes (again, with the fan off), and then, as soon as just
> one of the big core clusters gets throttled, the temperature of all
> cores balances out just below the throttling threshold. In my
> observation cores 6,7 go from 2.4GHz down to 1.8GHz while the rest
> stay at their respective top performance states (2.4GHz for big cores
> 4,5 and 1.8GHz for little cores 0-3)

Yeah, definitively it is probable stress-ng is not the right tool for 
that. I gave a try and it does not thermal stress my rk3388 board.

Just try "dhrystone -t 2 -l 10000" pinned on a big cluster


> Adding to it the fact that the temperature measurement resolution is
> not very granular (almost 1C) it's somewhat difficult to estimate how
> fast throttling action on a single cluster really brings its
> temperature within bounds, as they all affect each other at relevant
> temperature-load combinations. Perhaps it means that too granular
> polling doesn't add much value.
> 
>> But if you don't have spare time and 20 is ok for you. Then it is fine
>> for me too.
> 
> I guess I'll go for 100 as other upstream Rockchip .dtsi's do, given
> all of the above. Thanks for pointing this out!
> 
>> Some nits below.
>>
>>> +                     polling-delay = <0>;
>>> +                     thermal-sensors = <&tsadc 1>;
>>> +
>>> +                     trips {
>>> +                             bigcore0_alert0: bigcore0-alert0 {
>>> +                                     temperature = <75000>;
>>> +                                     hysteresis = <2000>;
>>> +                                     type = "passive";
>>> +                             };
>>> +                             bigcore0_alert1: bigcore0-alert1 {
>>> +                                     temperature = <85000>;
>>> +                                     hysteresis = <2000>;
>>> +                                     type = "passive";
>>> +                             };
>>> +                             bigcore0_crit: bigcore0-crit {
>>> +                                     temperature = <115000>;
>>> +                                     hysteresis = <0>;
>>> +                                     type = "critical";
>>> +                             };
>>> +                     };
>>> +                     cooling-maps {
>>> +                             map0 {
>>> +                                     trip = <&bigcore0_alert1>;
>>> +                                     cooling-device =
>>> +                                             <&cpu_b0 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
>>> +                                             <&cpu_b1 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
>>> +                                     contribution = <1024>;
>>> +                             };
>>> +                     };
>>> +             };
>>> +
>>> +             /* sensor between A76 cores 2 and 3 */
>>> +             bigcore2_thermal: bigcore2-thermal {
>>> +                     polling-delay-passive = <20>;
>>> +                     polling-delay = <0>;
>>> +                     thermal-sensors = <&tsadc 2>;
>>> +
>>> +                     trips {
>>> +                             bigcore2_alert0: bigcore2-alert0 {
>>> +                                     temperature = <75000>;
>>> +                                     hysteresis = <2000>;
>>> +                                     type = "passive";
>>> +                             };
>>> +                             bigcore2_alert1: bigcore2-alert1 {
>>> +                                     temperature = <85000>;
>>> +                                     hysteresis = <2000>;
>>> +                                     type = "passive";
>>> +                             };
>>> +                             bigcore2_crit: bigcore2-crit {
>>> +                                     temperature = <115000>;
>>> +                                     hysteresis = <0>;
>>> +                                     type = "critical";
>>> +                             };
>>> +                     };
>>> +                     cooling-maps {
>>> +                             map1 {
>>
>> s/map1/mpa0/
> 
> Noted, thanks!
> 
>>> +                                     trip = <&bigcore2_alert1>;
>>> +                                     cooling-device =
>>> +                                             <&cpu_b2 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
>>> +                                             <&cpu_b3 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
>>> +                                     contribution = <1024>;
>>> +                             };
>>> +                     };
>>> +             };
>>> +
>>> +             /* sensor between the four A55 cores */
>>> +             little_core_thermal: littlecore-thermal {
>>> +                     polling-delay-passive = <20>;
>>> +                     polling-delay = <0>;
>>> +                     thermal-sensors = <&tsadc 3>;
>>> +
>>> +                     trips {
>>> +                             littlecore_alert0: littlecore-alert0 {
>>> +                                     temperature = <75000>;
>>> +                                     hysteresis = <2000>;
>>> +                                     type = "passive";
>>> +                             };
>>> +                             littlecore_alert1: littlecore-alert1 {
>>> +                                     temperature = <85000>;
>>> +                                     hysteresis = <2000>;
>>> +                                     type = "passive";
>>> +                             };
>>> +                             littlecore_crit: littlecore-crit {
>>> +                                     temperature = <115000>;
>>> +                                     hysteresis = <0>;
>>> +                                     type = "critical";
>>> +                             };
>>> +                     };
>>> +                     cooling-maps {
>>> +                             map2 {
>>
>> s/map2/map0/
> 
> Noted, thanks!
> 
> Best regards,
> Alexey
  
Diederik de Haas Jan. 25, 2024, 1:34 p.m. UTC | #4
On Wednesday, 24 January 2024 22:56:21 CET Daniel Lezcano wrote:
> > +                       cooling-maps {
> > +                               map1 {
> 
> s/map1/mpa0/

FTR: that should be ``s/map1/map0/``
(verified with Daniel Lezcano)
  
Alexey Charkov Jan. 25, 2024, 2:46 p.m. UTC | #5
On Thu, Jan 25, 2024 at 2:02 PM Daniel Lezcano
<daniel.lezcano@linaro.org> wrote:
>
> On 25/01/2024 09:26, Alexey Charkov wrote:
> > On Thu, Jan 25, 2024 at 1:56 AM Daniel Lezcano
> > <daniel.lezcano@linaro.org> wrote:
> >>
> >> On 24/01/2024 21:30, Alexey Charkov wrote:
> >>> Include thermal zones information in device tree for rk3588 variants
> >>
> >> There is an energy model for the CPUs. But finding out the sustainable
> >> power may be a bit tricky. So I suggest to remove everything related to
> >> the power allocator in this change and propose a dedicated change with
> >> all the power configuration (which includes proper k_p* coefficients to
> >> be set from userspace to have a flat mitigation figure).
> >>
> >> That implies removing the "contribution" properties in this description.
> >
> > Alright, I'll just drop those "contribution" properties, thanks!
> >
> >> Some comments below but definitively this version is close to be ok.
> >
> > Yay! :)
> >
> >>> Signed-off-by: Alexey Charkov <alchark@gmail.com>
> >>> ---
> >>>    arch/arm64/boot/dts/rockchip/rk3588s.dtsi | 165 ++++++++++++++++++++++++++++++
> >>>    1 file changed, 165 insertions(+)
> >>>
> >>> diff --git a/arch/arm64/boot/dts/rockchip/rk3588s.dtsi b/arch/arm64/boot/dts/rockchip/rk3588s.dtsi
> >>> index 36b1b7acfe6a..131b9eb21398 100644
> >>> --- a/arch/arm64/boot/dts/rockchip/rk3588s.dtsi
> >>> +++ b/arch/arm64/boot/dts/rockchip/rk3588s.dtsi
> >>> @@ -10,6 +10,7 @@
> >>>    #include <dt-bindings/reset/rockchip,rk3588-cru.h>
> >>>    #include <dt-bindings/phy/phy.h>
> >>>    #include <dt-bindings/ata/ahci.h>
> >>> +#include <dt-bindings/thermal/thermal.h>
> >>>
> >>>    / {
> >>>        compatible = "rockchip,rk3588";
> >>> @@ -2228,6 +2229,170 @@ tsadc: tsadc@fec00000 {
> >>>                status = "disabled";
> >>>        };
> >>>
> >>> +     thermal_zones: thermal-zones {
> >>> +             /* sensor near the center of the whole chip */
> >>> +             package_thermal: package-thermal {
> >>> +                     polling-delay-passive = <0>;
> >>> +                     polling-delay = <0>;
> >>> +                     thermal-sensors = <&tsadc 0>;
> >>> +
> >>> +                     trips {
> >>> +                             package_crit: package-crit {
> >>> +                                     temperature = <115000>;
> >>> +                                     hysteresis = <0>;
> >>> +                                     type = "critical";
> >>> +                             };
> >>> +                     };
> >>> +             };
> >>> +
> >>> +             /* sensor between A76 cores 0 and 1 */
> >>> +             bigcore0_thermal: bigcore0-thermal {
> >>> +                     polling-delay-passive = <20>;
> >>
> >> 20ms seems very short, is this value on purpose? Or just picked up
> >> arbitrarily?
> >
> > Frankly, I simply used the value that Radxa's downstream DTS sets for
> > my board. 100ms seem to work just as well.
> >
> >> If it is possible, perhaps you should profile the temperature of these
> >> thermal zones (CPUs ones). There is a tool in
> >> <linuxdir>/tools/thermal/thermometer to do that.
> >>
> >> You can measure with 10ms sampling rate when running for instance
> >> dhrystone pinned on b0 and b1, then on b2 and b3. And finally on the
> >> small cluster.
> >
> > It seems tricky to isolate the effects from just one of the CPU
> > clusters, as their individual thermal outputs are not that high.
> >
> > For my testing I disabled the fan (but didn't remove the heatsink to
> > avoid wasting the thermal interface tape),
>
> It is ok but the system will have more heat capacity and it will be
> necessary to saturate it before running the tests. IOW warm up the
> system by running thermal stress tests several times.
>
> > and tried loading CPUs with
> > stress-ng. Here are the observations:
>
> Usually I use drhystone to thermal stress the cores (e. one minute).

Hmm, could you please point to the source package or repo to get the
version of dhrystone you use? I could only find the old shar [1] and a
Debian version with added Makefile [2], but neither seems to produce
multiple threads.

It doesn't seem to be packaged for either Gentoo or Fedora unfortunately.

Indeed, I'm getting higher thermal load (vs. stress-ng --cpu) even by
simply compiling kernel sources, although I'd expect it to wait for
memory and/or IO quite a lot.

[1] https://www.netlib.org/benchmark/dhry-c
[2] https://github.com/qris/dhrystone-deb

Thanks a lot,
Alexey
  
Dragan Simic Jan. 25, 2024, 10:19 p.m. UTC | #6
On 2024-01-25 09:26, Alexey Charkov wrote:
> On Thu, Jan 25, 2024 at 1:56 AM Daniel Lezcano
> <daniel.lezcano@linaro.org> wrote:
>> 
>> On 24/01/2024 21:30, Alexey Charkov wrote:
>> > Include thermal zones information in device tree for rk3588 variants
>> 
>> There is an energy model for the CPUs. But finding out the sustainable
>> power may be a bit tricky. So I suggest to remove everything related 
>> to
>> the power allocator in this change and propose a dedicated change with
>> all the power configuration (which includes proper k_p* coefficients 
>> to
>> be set from userspace to have a flat mitigation figure).
>> 
>> That implies removing the "contribution" properties in this 
>> description.
> 
> Alright, I'll just drop those "contribution" properties, thanks!
> 
>> Some comments below but definitively this version is close to be ok.
> 
> Yay! :)
> 
>> > Signed-off-by: Alexey Charkov <alchark@gmail.com>
>> > ---
>> >   arch/arm64/boot/dts/rockchip/rk3588s.dtsi | 165 ++++++++++++++++++++++++++++++
>> >   1 file changed, 165 insertions(+)
>> >
>> > diff --git a/arch/arm64/boot/dts/rockchip/rk3588s.dtsi b/arch/arm64/boot/dts/rockchip/rk3588s.dtsi
>> > index 36b1b7acfe6a..131b9eb21398 100644
>> > --- a/arch/arm64/boot/dts/rockchip/rk3588s.dtsi
>> > +++ b/arch/arm64/boot/dts/rockchip/rk3588s.dtsi
>> > @@ -10,6 +10,7 @@
>> >   #include <dt-bindings/reset/rockchip,rk3588-cru.h>
>> >   #include <dt-bindings/phy/phy.h>
>> >   #include <dt-bindings/ata/ahci.h>
>> > +#include <dt-bindings/thermal/thermal.h>
>> >
>> >   / {
>> >       compatible = "rockchip,rk3588";
>> > @@ -2228,6 +2229,170 @@ tsadc: tsadc@fec00000 {
>> >               status = "disabled";
>> >       };
>> >
>> > +     thermal_zones: thermal-zones {
>> > +             /* sensor near the center of the whole chip */
>> > +             package_thermal: package-thermal {
>> > +                     polling-delay-passive = <0>;
>> > +                     polling-delay = <0>;
>> > +                     thermal-sensors = <&tsadc 0>;
>> > +
>> > +                     trips {
>> > +                             package_crit: package-crit {
>> > +                                     temperature = <115000>;
>> > +                                     hysteresis = <0>;
>> > +                                     type = "critical";
>> > +                             };
>> > +                     };
>> > +             };
>> > +
>> > +             /* sensor between A76 cores 0 and 1 */
>> > +             bigcore0_thermal: bigcore0-thermal {
>> > +                     polling-delay-passive = <20>;
>> 
>> 20ms seems very short, is this value on purpose? Or just picked up
>> arbitrarily?
> 
> Frankly, I simply used the value that Radxa's downstream DTS sets for
> my board. 100ms seem to work just as well.

I think that 100 ms should be just fine (and should lower the resulting
CPU usage a bit), especially because the resolution of the TSADCs is
quite low at 1 oC, according to the RK3588 TRM.  If the resolution were
higher, sampling the temperatures more often might make more sense.

>> If it is possible, perhaps you should profile the temperature of these
>> thermal zones (CPUs ones). There is a tool in
>> <linuxdir>/tools/thermal/thermometer to do that.
>> 
>> You can measure with 10ms sampling rate when running for instance
>> dhrystone pinned on b0 and b1, then on b2 and b3. And finally on the
>> small cluster.
> 
> It seems tricky to isolate the effects from just one of the CPU
> clusters, as their individual thermal outputs are not that high.
> 
> For my testing I disabled the fan (but didn't remove the heatsink to
> avoid wasting the thermal interface tape), and tried loading CPUs with
> stress-ng. Here are the observations:
>  - Little cores alone stressed with 4 threads pegged to them with
> taskset never reach the throttling temperature (85C), and balance out
> around 60C
>  - Either of the big core clusters stressed individually with 2
> threads pegged to them with taskset never reach the throttling
> temperature either
>  - Four big cores with 4 threads heat up very slowly (>30 minutes to
> reach throttling temperature, I didn't have enough patience to let
> them actually reach it - maybe they never do)
>  - Eight cores with 8 threads heat up to the throttling temperature
> within ~5 minutes (again, with the fan off), and then, as soon as just
> one of the big core clusters gets throttled, the temperature of all
> cores balances out just below the throttling threshold. In my
> observation cores 6,7 go from 2.4GHz down to 1.8GHz while the rest
> stay at their respective top performance states (2.4GHz for big cores
> 4,5 and 1.8GHz for little cores 0-3)
> 
> Adding to it the fact that the temperature measurement resolution is
> not very granular (almost 1C) it's somewhat difficult to estimate how
> fast throttling action on a single cluster really brings its
> temperature within bounds, as they all affect each other at relevant
> temperature-load combinations. Perhaps it means that too granular
> polling doesn't add much value.
> 
>> But if you don't have spare time and 20 is ok for you. Then it is fine
>> for me too.
> 
> I guess I'll go for 100 as other upstream Rockchip .dtsi's do, given
> all of the above. Thanks for pointing this out!
> 
>> Some nits below.
>> 
>> > +                     polling-delay = <0>;
>> > +                     thermal-sensors = <&tsadc 1>;
>> > +
>> > +                     trips {
>> > +                             bigcore0_alert0: bigcore0-alert0 {
>> > +                                     temperature = <75000>;
>> > +                                     hysteresis = <2000>;
>> > +                                     type = "passive";
>> > +                             };
>> > +                             bigcore0_alert1: bigcore0-alert1 {
>> > +                                     temperature = <85000>;
>> > +                                     hysteresis = <2000>;
>> > +                                     type = "passive";
>> > +                             };
>> > +                             bigcore0_crit: bigcore0-crit {
>> > +                                     temperature = <115000>;
>> > +                                     hysteresis = <0>;
>> > +                                     type = "critical";
>> > +                             };
>> > +                     };
>> > +                     cooling-maps {
>> > +                             map0 {
>> > +                                     trip = <&bigcore0_alert1>;
>> > +                                     cooling-device =
>> > +                                             <&cpu_b0 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
>> > +                                             <&cpu_b1 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
>> > +                                     contribution = <1024>;
>> > +                             };
>> > +                     };
>> > +             };
>> > +
>> > +             /* sensor between A76 cores 2 and 3 */
>> > +             bigcore2_thermal: bigcore2-thermal {
>> > +                     polling-delay-passive = <20>;
>> > +                     polling-delay = <0>;
>> > +                     thermal-sensors = <&tsadc 2>;
>> > +
>> > +                     trips {
>> > +                             bigcore2_alert0: bigcore2-alert0 {
>> > +                                     temperature = <75000>;
>> > +                                     hysteresis = <2000>;
>> > +                                     type = "passive";
>> > +                             };
>> > +                             bigcore2_alert1: bigcore2-alert1 {
>> > +                                     temperature = <85000>;
>> > +                                     hysteresis = <2000>;
>> > +                                     type = "passive";
>> > +                             };
>> > +                             bigcore2_crit: bigcore2-crit {
>> > +                                     temperature = <115000>;
>> > +                                     hysteresis = <0>;
>> > +                                     type = "critical";
>> > +                             };
>> > +                     };
>> > +                     cooling-maps {
>> > +                             map1 {
>> 
>> s/map1/mpa0/
> 
> Noted, thanks!
> 
>> > +                                     trip = <&bigcore2_alert1>;
>> > +                                     cooling-device =
>> > +                                             <&cpu_b2 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
>> > +                                             <&cpu_b3 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
>> > +                                     contribution = <1024>;
>> > +                             };
>> > +                     };
>> > +             };
>> > +
>> > +             /* sensor between the four A55 cores */
>> > +             little_core_thermal: littlecore-thermal {
>> > +                     polling-delay-passive = <20>;
>> > +                     polling-delay = <0>;
>> > +                     thermal-sensors = <&tsadc 3>;
>> > +
>> > +                     trips {
>> > +                             littlecore_alert0: littlecore-alert0 {
>> > +                                     temperature = <75000>;
>> > +                                     hysteresis = <2000>;
>> > +                                     type = "passive";
>> > +                             };
>> > +                             littlecore_alert1: littlecore-alert1 {
>> > +                                     temperature = <85000>;
>> > +                                     hysteresis = <2000>;
>> > +                                     type = "passive";
>> > +                             };
>> > +                             littlecore_crit: littlecore-crit {
>> > +                                     temperature = <115000>;
>> > +                                     hysteresis = <0>;
>> > +                                     type = "critical";
>> > +                             };
>> > +                     };
>> > +                     cooling-maps {
>> > +                             map2 {
>> 
>> s/map2/map0/
> 
> Noted, thanks!
  

Patch

diff --git a/arch/arm64/boot/dts/rockchip/rk3588s.dtsi b/arch/arm64/boot/dts/rockchip/rk3588s.dtsi
index 36b1b7acfe6a..131b9eb21398 100644
--- a/arch/arm64/boot/dts/rockchip/rk3588s.dtsi
+++ b/arch/arm64/boot/dts/rockchip/rk3588s.dtsi
@@ -10,6 +10,7 @@ 
 #include <dt-bindings/reset/rockchip,rk3588-cru.h>
 #include <dt-bindings/phy/phy.h>
 #include <dt-bindings/ata/ahci.h>
+#include <dt-bindings/thermal/thermal.h>
 
 / {
 	compatible = "rockchip,rk3588";
@@ -2228,6 +2229,170 @@  tsadc: tsadc@fec00000 {
 		status = "disabled";
 	};
 
+	thermal_zones: thermal-zones {
+		/* sensor near the center of the whole chip */
+		package_thermal: package-thermal {
+			polling-delay-passive = <0>;
+			polling-delay = <0>;
+			thermal-sensors = <&tsadc 0>;
+
+			trips {
+				package_crit: package-crit {
+					temperature = <115000>;
+					hysteresis = <0>;
+					type = "critical";
+				};
+			};
+		};
+
+		/* sensor between A76 cores 0 and 1 */
+		bigcore0_thermal: bigcore0-thermal {
+			polling-delay-passive = <20>;
+			polling-delay = <0>;
+			thermal-sensors = <&tsadc 1>;
+
+			trips {
+				bigcore0_alert0: bigcore0-alert0 {
+					temperature = <75000>;
+					hysteresis = <2000>;
+					type = "passive";
+				};
+				bigcore0_alert1: bigcore0-alert1 {
+					temperature = <85000>;
+					hysteresis = <2000>;
+					type = "passive";
+				};
+				bigcore0_crit: bigcore0-crit {
+					temperature = <115000>;
+					hysteresis = <0>;
+					type = "critical";
+				};
+			};
+			cooling-maps {
+				map0 {
+					trip = <&bigcore0_alert1>;
+					cooling-device =
+						<&cpu_b0 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
+						<&cpu_b1 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
+					contribution = <1024>;
+				};
+			};
+		};
+
+		/* sensor between A76 cores 2 and 3 */
+		bigcore2_thermal: bigcore2-thermal {
+			polling-delay-passive = <20>;
+			polling-delay = <0>;
+			thermal-sensors = <&tsadc 2>;
+
+			trips {
+				bigcore2_alert0: bigcore2-alert0 {
+					temperature = <75000>;
+					hysteresis = <2000>;
+					type = "passive";
+				};
+				bigcore2_alert1: bigcore2-alert1 {
+					temperature = <85000>;
+					hysteresis = <2000>;
+					type = "passive";
+				};
+				bigcore2_crit: bigcore2-crit {
+					temperature = <115000>;
+					hysteresis = <0>;
+					type = "critical";
+				};
+			};
+			cooling-maps {
+				map1 {
+					trip = <&bigcore2_alert1>;
+					cooling-device =
+						<&cpu_b2 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
+						<&cpu_b3 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
+					contribution = <1024>;
+				};
+			};
+		};
+
+		/* sensor between the four A55 cores */
+		little_core_thermal: littlecore-thermal {
+			polling-delay-passive = <20>;
+			polling-delay = <0>;
+			thermal-sensors = <&tsadc 3>;
+
+			trips {
+				littlecore_alert0: littlecore-alert0 {
+					temperature = <75000>;
+					hysteresis = <2000>;
+					type = "passive";
+				};
+				littlecore_alert1: littlecore-alert1 {
+					temperature = <85000>;
+					hysteresis = <2000>;
+					type = "passive";
+				};
+				littlecore_crit: littlecore-crit {
+					temperature = <115000>;
+					hysteresis = <0>;
+					type = "critical";
+				};
+			};
+			cooling-maps {
+				map2 {
+					trip = <&littlecore_alert1>;
+					cooling-device =
+						<&cpu_l0 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
+						<&cpu_l1 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
+						<&cpu_l2 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
+						<&cpu_l3 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
+					contribution = <1024>;
+				};
+			};
+		};
+
+		/* sensor near the PD_CENTER power domain */
+		center_thermal: center-thermal {
+			polling-delay-passive = <0>;
+			polling-delay = <0>;
+			thermal-sensors = <&tsadc 4>;
+
+			trips {
+				center_crit: center-crit {
+					temperature = <115000>;
+					hysteresis = <0>;
+					type = "critical";
+				};
+			};
+		};
+
+		gpu_thermal: gpu-thermal {
+			polling-delay-passive = <0>;
+			polling-delay = <0>;
+			thermal-sensors = <&tsadc 5>;
+
+			trips {
+				gpu_crit: gpu-crit {
+					temperature = <115000>;
+					hysteresis = <0>;
+					type = "critical";
+				};
+			};
+		};
+
+		npu_thermal: npu-thermal {
+			polling-delay-passive = <0>;
+			polling-delay = <0>;
+			thermal-sensors = <&tsadc 6>;
+
+			trips {
+				npu_crit: npu-crit {
+					temperature = <115000>;
+					hysteresis = <0>;
+					type = "critical";
+				};
+			};
+		};
+	};
+
 	saradc: adc@fec10000 {
 		compatible = "rockchip,rk3588-saradc";
 		reg = <0x0 0xfec10000 0x0 0x10000>;