[v1,07/17] thermal/hwmon: Use the thermal API instead tampering the internals

Message ID 20230219143657.241542-8-daniel.lezcano@linaro.org
State New
Headers
Series Self-encapsulate the thermal zone device structure |

Commit Message

Daniel Lezcano Feb. 19, 2023, 2:36 p.m. UTC
  In this function, there is a guarantee the thermal zone is registered.

The sysfs hwmon unregistering will be blocked until we exit the
function. The thermal zone is unregistered after the sysfs hwmon is
unregistered.

When we are in this function, the thermal zone is registered.

We can call the thermal_zone_get_crit_temp() function safely and let
the function use the lock which is private the thermal core code.

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
---
 drivers/thermal/thermal_hwmon.c | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)
  

Comments

Daniel Lezcano Feb. 20, 2023, 1:34 p.m. UTC | #1
Hi Guenter,

my script should have Cc'ed you but it didn't, so just a heads up this 
patch ;)

On 19/02/2023 15:36, Daniel Lezcano wrote:
> In this function, there is a guarantee the thermal zone is registered.
> 
> The sysfs hwmon unregistering will be blocked until we exit the
> function. The thermal zone is unregistered after the sysfs hwmon is
> unregistered.
> 
> When we are in this function, the thermal zone is registered.
> 
> We can call the thermal_zone_get_crit_temp() function safely and let
> the function use the lock which is private the thermal core code.
> 
> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> ---
>   drivers/thermal/thermal_hwmon.c | 10 +---------
>   1 file changed, 1 insertion(+), 9 deletions(-)
> 
> diff --git a/drivers/thermal/thermal_hwmon.c b/drivers/thermal/thermal_hwmon.c
> index bc02095b314c..15158715b967 100644
> --- a/drivers/thermal/thermal_hwmon.c
> +++ b/drivers/thermal/thermal_hwmon.c
> @@ -77,15 +77,7 @@ temp_crit_show(struct device *dev, struct device_attribute *attr, char *buf)
>   	int temperature;
>   	int ret;
>   
> -	mutex_lock(&tz->lock);
> -
> -	if (device_is_registered(&tz->device))
> -		ret = tz->ops->get_crit_temp(tz, &temperature);
> -	else
> -		ret = -ENODEV;
> -
> -	mutex_unlock(&tz->lock);
> -
> +	ret = thermal_zone_get_crit_temp(tz, &temperature);
>   	if (ret)
>   		return ret;
>
  
Guenter Roeck Feb. 20, 2023, 2:11 p.m. UTC | #2
On Mon, Feb 20, 2023 at 02:34:08PM +0100, Daniel Lezcano wrote:
> Hi Guenter,
> 
> my script should have Cc'ed you but it didn't, so just a heads up this patch
> ;)
> 
> On 19/02/2023 15:36, Daniel Lezcano wrote:
> > In this function, there is a guarantee the thermal zone is registered.
> > 
> > The sysfs hwmon unregistering will be blocked until we exit the
> > function. The thermal zone is unregistered after the sysfs hwmon is
> > unregistered.
> > 
> > When we are in this function, the thermal zone is registered.
> > 
> > We can call the thermal_zone_get_crit_temp() function safely and let
> > the function use the lock which is private the thermal core code.
> > 

Hmm, if you say so. That very same call used to cause a crash in
Chromebooks, which is why I had added the locking.

Guenter

> > Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> > ---
> >   drivers/thermal/thermal_hwmon.c | 10 +---------
> >   1 file changed, 1 insertion(+), 9 deletions(-)
> > 
> > diff --git a/drivers/thermal/thermal_hwmon.c b/drivers/thermal/thermal_hwmon.c
> > index bc02095b314c..15158715b967 100644
> > --- a/drivers/thermal/thermal_hwmon.c
> > +++ b/drivers/thermal/thermal_hwmon.c
> > @@ -77,15 +77,7 @@ temp_crit_show(struct device *dev, struct device_attribute *attr, char *buf)
> >   	int temperature;
> >   	int ret;
> > -	mutex_lock(&tz->lock);
> > -
> > -	if (device_is_registered(&tz->device))
> > -		ret = tz->ops->get_crit_temp(tz, &temperature);
> > -	else
> > -		ret = -ENODEV;
> > -
> > -	mutex_unlock(&tz->lock);
> > -
> > +	ret = thermal_zone_get_crit_temp(tz, &temperature);
> >   	if (ret)
> >   		return ret;
> 
> -- 
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
> 
> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
>
  
Daniel Lezcano Feb. 20, 2023, 3:39 p.m. UTC | #3
On 20/02/2023 15:11, Guenter Roeck wrote:
> On Mon, Feb 20, 2023 at 02:34:08PM +0100, Daniel Lezcano wrote:
>> Hi Guenter,
>>
>> my script should have Cc'ed you but it didn't, so just a heads up this patch
>> ;)
>>
>> On 19/02/2023 15:36, Daniel Lezcano wrote:
>>> In this function, there is a guarantee the thermal zone is registered.
>>>
>>> The sysfs hwmon unregistering will be blocked until we exit the
>>> function. The thermal zone is unregistered after the sysfs hwmon is
>>> unregistered.
>>>
>>> When we are in this function, the thermal zone is registered.
>>>
>>> We can call the thermal_zone_get_crit_temp() function safely and let
>>> the function use the lock which is private the thermal core code.
>>>
> 
> Hmm, if you say so. That very same call used to cause a crash in
> Chromebooks, which is why I had added the locking.

Mmh, I see. I guess we can assume thermal_hwmon is part of the core code 
and remove this change.
  
Guenter Roeck Feb. 20, 2023, 5:12 p.m. UTC | #4
On Mon, Feb 20, 2023 at 04:39:48PM +0100, Daniel Lezcano wrote:
> On 20/02/2023 15:11, Guenter Roeck wrote:
> > On Mon, Feb 20, 2023 at 02:34:08PM +0100, Daniel Lezcano wrote:
> > > Hi Guenter,
> > > 
> > > my script should have Cc'ed you but it didn't, so just a heads up this patch
> > > ;)
> > > 
> > > On 19/02/2023 15:36, Daniel Lezcano wrote:
> > > > In this function, there is a guarantee the thermal zone is registered.
> > > > 
> > > > The sysfs hwmon unregistering will be blocked until we exit the
> > > > function. The thermal zone is unregistered after the sysfs hwmon is
> > > > unregistered.
> > > > 
> > > > When we are in this function, the thermal zone is registered.
> > > > 
> > > > We can call the thermal_zone_get_crit_temp() function safely and let
> > > > the function use the lock which is private the thermal core code.
> > > > 
> > 
> > Hmm, if you say so. That very same call used to cause a crash in
> > Chromebooks, which is why I had added the locking.
> 
> Mmh, I see. I guess we can assume thermal_hwmon is part of the core code and
> remove this change.
> 

Yes. Anyway, the sequence of events was roughly as follows.

- thermal zone is device is registered
- hwmon device is registered
  - userspace is triggered and starts reading device attributes
- while userspace has a hwmon attribute open, thermal device is unregistered
- hwmon device is unregistered (sysfs attribute is still open)
- hwmon device attribute function is called
- Since thermal device ops have been released after the thermal device
  was unregistered, trying to call an ops callback fails.

That doesn't normally happen, but the Intel wireless driver has the habit
of registering a thermal zone early in its probe function, only to unregister
it immediately afterwards if the probe function fails. If some userspace
activity is triggered by the hwmon device registration, the thermal and
hwmon device removal may be timed such that the hwmon devive is removed
while one (or more) of its attribute files are still open. Normally that
doesn't matter, but it is fatal here since the ops callbacks are not owned
by the hwmon device but by the thermal device.

Essentially every ops callback has this problem.
thermal_zone_get_temp() had it as well, also associated with
a hwmon sysfs attribute read operation. See commit 1c6b30060777
("thermal/core: Ensure that thermal device is registered in
thermal_zone_get_temp").

If you don't want non-thermal code to access ->ops directly, the thermal
code would have to provide protected accessor functions, similar to
thermal_zone_get_temp().

Thanks,
Guenter

> 
> -- 
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
> 
> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
>
  
Daniel Lezcano Feb. 21, 2023, 4:08 p.m. UTC | #5
On 20/02/2023 18:12, Guenter Roeck wrote:
> On Mon, Feb 20, 2023 at 04:39:48PM +0100, Daniel Lezcano wrote:
>> On 20/02/2023 15:11, Guenter Roeck wrote:
>>> On Mon, Feb 20, 2023 at 02:34:08PM +0100, Daniel Lezcano wrote:
>>>> Hi Guenter,
>>>>
>>>> my script should have Cc'ed you but it didn't, so just a heads up this patch
>>>> ;)
>>>>
>>>> On 19/02/2023 15:36, Daniel Lezcano wrote:
>>>>> In this function, there is a guarantee the thermal zone is registered.
>>>>>
>>>>> The sysfs hwmon unregistering will be blocked until we exit the
>>>>> function. The thermal zone is unregistered after the sysfs hwmon is
>>>>> unregistered.
>>>>>
>>>>> When we are in this function, the thermal zone is registered.
>>>>>
>>>>> We can call the thermal_zone_get_crit_temp() function safely and let
>>>>> the function use the lock which is private the thermal core code.
>>>>>
>>>
>>> Hmm, if you say so. That very same call used to cause a crash in
>>> Chromebooks, which is why I had added the locking.
>>
>> Mmh, I see. I guess we can assume thermal_hwmon is part of the core code and
>> remove this change.
>>
> 
> Yes. Anyway, the sequence of events was roughly as follows.
> 
> - thermal zone is device is registered
> - hwmon device is registered
>    - userspace is triggered and starts reading device attributes
> - while userspace has a hwmon attribute open, thermal device is unregistered
> - hwmon device is unregistered (sysfs attribute is still open)
> - hwmon device attribute function is called
> - Since thermal device ops have been released after the thermal device
>    was unregistered, trying to call an ops callback fails.
> 
> That doesn't normally happen, but the Intel wireless driver has the habit
> of registering a thermal zone early in its probe function, only to unregister
> it immediately afterwards if the probe function fails. If some userspace
> activity is triggered by the hwmon device registration, the thermal and
> hwmon device removal may be timed such that the hwmon devive is removed
> while one (or more) of its attribute files are still open. Normally that
> doesn't matter, but it is fatal here since the ops callbacks are not owned
> by the hwmon device but by the thermal device.
> 
> Essentially every ops callback has this problem.
> thermal_zone_get_temp() had it as well, also associated with
> a hwmon sysfs attribute read operation. See commit 1c6b30060777
> ("thermal/core: Ensure that thermal device is registered in
> thermal_zone_get_temp").
> 
> If you don't want non-thermal code to access ->ops directly, the thermal
> code would have to provide protected accessor functions, similar to
> thermal_zone_get_temp().

Hopefully we are getting rid of most of the ops soon ... :/
  

Patch

diff --git a/drivers/thermal/thermal_hwmon.c b/drivers/thermal/thermal_hwmon.c
index bc02095b314c..15158715b967 100644
--- a/drivers/thermal/thermal_hwmon.c
+++ b/drivers/thermal/thermal_hwmon.c
@@ -77,15 +77,7 @@  temp_crit_show(struct device *dev, struct device_attribute *attr, char *buf)
 	int temperature;
 	int ret;
 
-	mutex_lock(&tz->lock);
-
-	if (device_is_registered(&tz->device))
-		ret = tz->ops->get_crit_temp(tz, &temperature);
-	else
-		ret = -ENODEV;
-
-	mutex_unlock(&tz->lock);
-
+	ret = thermal_zone_get_crit_temp(tz, &temperature);
 	if (ret)
 		return ret;