[2/5] thermal/core: Reset cooling state during cooling device unregistration

Message ID 20230324070807.6342-2-rui.zhang@intel.com
State New
Headers
Series [v2,1/5] thermal/core: Update cooling device during thermal zone unregistration |

Commit Message

Zhang, Rui March 24, 2023, 7:08 a.m. UTC
  When unregistering a cooling device, it is possible that the cooling
device has been activated. And once the cooling device is unregistered,
no one will deactivate it anymore.

Reset cooling state during cooling device unregistration.

Signed-off-by: Zhang Rui <rui.zhang@intel.com>
---
In theory, this problem that this patch fixes can be triggered on a
platform with ACPI Active cooling, by
1. overheat the system to trigger ACPI active cooling
2. unload ACPI fan driver
3. check if the fan is still spinning
But I don't have such a system so I didn't trigger then problem and I
only did build & boot test.
---
 drivers/thermal/thermal_core.c | 4 ++++
 1 file changed, 4 insertions(+)
  

Comments

Rafael J. Wysocki March 24, 2023, 1:19 p.m. UTC | #1
On Fri, Mar 24, 2023 at 8:08 AM Zhang Rui <rui.zhang@intel.com> wrote:
>
> When unregistering a cooling device, it is possible that the cooling
> device has been activated. And once the cooling device is unregistered,
> no one will deactivate it anymore.
>
> Reset cooling state during cooling device unregistration.
>
> Signed-off-by: Zhang Rui <rui.zhang@intel.com>
> ---
> In theory, this problem that this patch fixes can be triggered on a
> platform with ACPI Active cooling, by
> 1. overheat the system to trigger ACPI active cooling
> 2. unload ACPI fan driver
> 3. check if the fan is still spinning
> But I don't have such a system so I didn't trigger then problem and I
> only did build & boot test.

So I'm not sure if this change is actually safe.

In the example above, the system will still need the fan to spin after
the ACPI fan driver is unloaded in order to cool down, won't it?

> ---
>  drivers/thermal/thermal_core.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> index 30ff39154598..fd54e6c10b60 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -1192,6 +1192,10 @@ void thermal_cooling_device_unregister(struct thermal_cooling_device *cdev)
>                 }
>         }
>
> +       mutex_lock(&cdev->lock);
> +       cdev->ops->set_cur_state(cdev, 0);
> +       mutex_unlock(&cdev->lock);
> +
>         mutex_unlock(&thermal_list_lock);
>
>         device_unregister(&cdev->device);
> --
> 2.25.1
>
  
Zhang, Rui March 27, 2023, 2:50 p.m. UTC | #2
On Fri, 2023-03-24 at 14:19 +0100, Rafael J. Wysocki wrote:
> On Fri, Mar 24, 2023 at 8:08 AM Zhang Rui <rui.zhang@intel.com>
> wrote:
> > When unregistering a cooling device, it is possible that the
> > cooling
> > device has been activated. And once the cooling device is
> > unregistered,
> > no one will deactivate it anymore.
> > 
> > Reset cooling state during cooling device unregistration.
> > 
> > Signed-off-by: Zhang Rui <rui.zhang@intel.com>
> > ---
> > In theory, this problem that this patch fixes can be triggered on a
> > platform with ACPI Active cooling, by
> > 1. overheat the system to trigger ACPI active cooling
> > 2. unload ACPI fan driver
> > 3. check if the fan is still spinning
> > But I don't have such a system so I didn't trigger then problem and
> > I
> > only did build & boot test.
> 
> So I'm not sure if this change is actually safe.
> 
> In the example above, the system will still need the fan to spin
> after
> the ACPI fan driver is unloaded in order to cool down, won't it?

Then we can argue that the ACPI fan driver should not be unloaded in
this case.

Actually, this is the same situation as patch 1/5.
Patch 1/5 fixes the problem that cooling state not restored to 0 when
unloading the thermal driver, and this fixes the same problem when
unloading the cooling device driver.

thanks,
rui

> 
> > ---
> >  drivers/thermal/thermal_core.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/drivers/thermal/thermal_core.c
> > b/drivers/thermal/thermal_core.c
> > index 30ff39154598..fd54e6c10b60 100644
> > --- a/drivers/thermal/thermal_core.c
> > +++ b/drivers/thermal/thermal_core.c
> > @@ -1192,6 +1192,10 @@ void
> > thermal_cooling_device_unregister(struct thermal_cooling_device
> > *cdev)
> >                 }
> >         }
> > 
> > +       mutex_lock(&cdev->lock);
> > +       cdev->ops->set_cur_state(cdev, 0);
> > +       mutex_unlock(&cdev->lock);
> > +
> >         mutex_unlock(&thermal_list_lock);
> > 
> >         device_unregister(&cdev->device);
> > --
> > 2.25.1
> >
  
Rafael J. Wysocki March 27, 2023, 3:13 p.m. UTC | #3
On Mon, Mar 27, 2023 at 4:50 PM Zhang, Rui <rui.zhang@intel.com> wrote:
>
> On Fri, 2023-03-24 at 14:19 +0100, Rafael J. Wysocki wrote:
> > On Fri, Mar 24, 2023 at 8:08 AM Zhang Rui <rui.zhang@intel.com>
> > wrote:
> > > When unregistering a cooling device, it is possible that the
> > > cooling
> > > device has been activated. And once the cooling device is
> > > unregistered,
> > > no one will deactivate it anymore.
> > >
> > > Reset cooling state during cooling device unregistration.
> > >
> > > Signed-off-by: Zhang Rui <rui.zhang@intel.com>
> > > ---
> > > In theory, this problem that this patch fixes can be triggered on a
> > > platform with ACPI Active cooling, by
> > > 1. overheat the system to trigger ACPI active cooling
> > > 2. unload ACPI fan driver
> > > 3. check if the fan is still spinning
> > > But I don't have such a system so I didn't trigger then problem and
> > > I
> > > only did build & boot test.
> >
> > So I'm not sure if this change is actually safe.
> >
> > In the example above, the system will still need the fan to spin
> > after
> > the ACPI fan driver is unloaded in order to cool down, won't it?
>
> Then we can argue that the ACPI fan driver should not be unloaded in
> this case.

I don't think that whether or not the driver is expected to be
unloaded at a given time has any bearing on how it should behave when
actually unloaded.

Leaving the cooling device in its current state is "safe" from the
thermal control perspective, but it may affect the general user
experience (which may include performance too) going forward, so there
is a tradeoff.

You can argue that even if the cooling device is reset on the driver
removal, there should be another thermal control mechanism in place
that will take care of the overheat condition instead of it, but that
mechanism may be an emergency system shutdown.

What do the other cooling device drivers do in general when they get removed?

> Actually, this is the same situation as patch 1/5.
> Patch 1/5 fixes the problem that cooling state not restored to 0 when
> unloading the thermal driver, and this fixes the same problem when
> unloading the cooling device driver.

Right, it is analogous.
  
Zhang, Rui March 28, 2023, 2:46 a.m. UTC | #4
On Mon, 2023-03-27 at 17:13 +0200, Rafael J. Wysocki wrote:
> On Mon, Mar 27, 2023 at 4:50 PM Zhang, Rui <rui.zhang@intel.com>
> wrote:
> > On Fri, 2023-03-24 at 14:19 +0100, Rafael J. Wysocki wrote:
> > > On Fri, Mar 24, 2023 at 8:08 AM Zhang Rui <rui.zhang@intel.com>
> > > wrote:
> > > > When unregistering a cooling device, it is possible that the
> > > > cooling
> > > > device has been activated. And once the cooling device is
> > > > unregistered,
> > > > no one will deactivate it anymore.
> > > > 
> > > > Reset cooling state during cooling device unregistration.
> > > > 
> > > > Signed-off-by: Zhang Rui <rui.zhang@intel.com>
> > > > ---
> > > > In theory, this problem that this patch fixes can be triggered
> > > > on a
> > > > platform with ACPI Active cooling, by
> > > > 1. overheat the system to trigger ACPI active cooling
> > > > 2. unload ACPI fan driver
> > > > 3. check if the fan is still spinning
> > > > But I don't have such a system so I didn't trigger then problem
> > > > and
> > > > I
> > > > only did build & boot test.
> > > 
> > > So I'm not sure if this change is actually safe.
> > > 
> > > In the example above, the system will still need the fan to spin
> > > after
> > > the ACPI fan driver is unloaded in order to cool down, won't it?
> > 
> > Then we can argue that the ACPI fan driver should not be unloaded
> > in
> > this case.
> 
> I don't think that whether or not the driver is expected to be
> unloaded at a given time has any bearing on how it should behave when
> actually unloaded.
> 
> Leaving the cooling device in its current state is "safe" from the
> thermal control perspective, but it may affect the general user
> experience (which may include performance too) going forward, so
> there
> is a tradeoff.

Right.
If we don't have a third choice, then the question is simple.
"thermal safety" vs. "user experience"?

I'd vote for "thermal safety" and drop this patch series.
> 
> What do the other cooling device drivers do in general when they get
> removed?

No cooling device driver has extra handling after cdev unregistration.

thanks,
rui
  
Rafael J. Wysocki March 28, 2023, 5:54 p.m. UTC | #5
On Tue, Mar 28, 2023 at 4:46 AM Zhang, Rui <rui.zhang@intel.com> wrote:
>
> On Mon, 2023-03-27 at 17:13 +0200, Rafael J. Wysocki wrote:
> > On Mon, Mar 27, 2023 at 4:50 PM Zhang, Rui <rui.zhang@intel.com>
> > wrote:
> > > On Fri, 2023-03-24 at 14:19 +0100, Rafael J. Wysocki wrote:
> > > > On Fri, Mar 24, 2023 at 8:08 AM Zhang Rui <rui.zhang@intel.com>
> > > > wrote:
> > > > > When unregistering a cooling device, it is possible that the
> > > > > cooling
> > > > > device has been activated. And once the cooling device is
> > > > > unregistered,
> > > > > no one will deactivate it anymore.
> > > > >
> > > > > Reset cooling state during cooling device unregistration.
> > > > >
> > > > > Signed-off-by: Zhang Rui <rui.zhang@intel.com>
> > > > > ---
> > > > > In theory, this problem that this patch fixes can be triggered
> > > > > on a
> > > > > platform with ACPI Active cooling, by
> > > > > 1. overheat the system to trigger ACPI active cooling
> > > > > 2. unload ACPI fan driver
> > > > > 3. check if the fan is still spinning
> > > > > But I don't have such a system so I didn't trigger then problem
> > > > > and
> > > > > I
> > > > > only did build & boot test.
> > > >
> > > > So I'm not sure if this change is actually safe.
> > > >
> > > > In the example above, the system will still need the fan to spin
> > > > after
> > > > the ACPI fan driver is unloaded in order to cool down, won't it?
> > >
> > > Then we can argue that the ACPI fan driver should not be unloaded
> > > in
> > > this case.
> >
> > I don't think that whether or not the driver is expected to be
> > unloaded at a given time has any bearing on how it should behave when
> > actually unloaded.
> >
> > Leaving the cooling device in its current state is "safe" from the
> > thermal control perspective, but it may affect the general user
> > experience (which may include performance too) going forward, so
> > there
> > is a tradeoff.
>
> Right.
> If we don't have a third choice, then the question is simple.
> "thermal safety" vs. "user experience"?
>
> I'd vote for "thermal safety" and drop this patch series.

Works for me.

> > What do the other cooling device drivers do in general when they get
> > removed?
>
> No cooling device driver has extra handling after cdev unregistration.

However, the question regarding what to do when the driver of a
cooling device in use is being removed is a valid one.

One possible approach that comes to mind could be to defer the driver
removal until the overheat condition goes away, but anyway it would be
better to do that in the core IMV.
  
Zhang, Rui March 29, 2023, 6:28 a.m. UTC | #6
On Tue, 2023-03-28 at 19:54 +0200, Rafael J. Wysocki wrote:
> > > What do the other cooling device drivers do in general when they
> > > get
> > > removed?
> > 
> > No cooling device driver has extra handling after cdev
> > unregistration.
> 
> However, the question regarding what to do when the driver of a
> cooling device in use is being removed is a valid one.
> 
> One possible approach that comes to mind could be to defer the driver
> removal until the overheat condition goes away, but anyway it would
> be
> better to do that in the core IMV.

In this case, we should guarantee that the thermal zone driver is still
functional. i.e. it still can get temperature change notifications and
update the thermal zone. I doubt if current thermal zone drivers can
guarantee this.

Given that this is a rare case, and the current behavior is not perfect
but still acceptable, maybe we can leave this low priority for now.

thanks,
rui
  

Patch

diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 30ff39154598..fd54e6c10b60 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -1192,6 +1192,10 @@  void thermal_cooling_device_unregister(struct thermal_cooling_device *cdev)
 		}
 	}
 
+	mutex_lock(&cdev->lock);
+	cdev->ops->set_cur_state(cdev, 0);
+	mutex_unlock(&cdev->lock);
+
 	mutex_unlock(&thermal_list_lock);
 
 	device_unregister(&cdev->device);