[v2,2/2] of: overlay: Synchronize of_overlay_remove() with the devlink removals

Message ID 20240229083953.607569-3-herve.codina@bootlin.com
State New
Headers
Series Synchronize DT overlay removal with devlink removals |

Commit Message

Herve Codina Feb. 29, 2024, 8:39 a.m. UTC
  In the following sequence:
  1) of_platform_depopulate()
  2) of_overlay_remove()

During the step 1, devices are destroyed and devlinks are removed.
During the step 2, OF nodes are destroyed but
__of_changeset_entry_destroy() can raise warnings related to missing
of_node_put():
  ERROR: memory leak, expected refcount 1 instead of 2 ...

Indeed, during the devlink removals performed at step 1, the removal
itself releasing the device (and the attached of_node) is done by a job
queued in a workqueue and so, it is done asynchronously with respect to
function calls.
When the warning is present, of_node_put() will be called but wrongly
too late from the workqueue job.

In order to be sure that any ongoing devlink removals are done before
the of_node destruction, synchronize the of_overlay_remove() with the
devlink removals.

Fixes: 80dd33cf72d1 ("drivers: base: Fix device link removal")
Cc: stable@vger.kernel.org
Signed-off-by: Herve Codina <herve.codina@bootlin.com>
---
 drivers/of/overlay.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)
  

Comments

Nuno Sá Feb. 29, 2024, 9:47 a.m. UTC | #1
On Thu, 2024-02-29 at 09:39 +0100, Herve Codina wrote:
> In the following sequence:
>   1) of_platform_depopulate()
>   2) of_overlay_remove()
> 
> During the step 1, devices are destroyed and devlinks are removed.
> During the step 2, OF nodes are destroyed but
> __of_changeset_entry_destroy() can raise warnings related to missing
> of_node_put():
>   ERROR: memory leak, expected refcount 1 instead of 2 ...
> 
> Indeed, during the devlink removals performed at step 1, the removal
> itself releasing the device (and the attached of_node) is done by a job
> queued in a workqueue and so, it is done asynchronously with respect to
> function calls.
> When the warning is present, of_node_put() will be called but wrongly
> too late from the workqueue job.
> 
> In order to be sure that any ongoing devlink removals are done before
> the of_node destruction, synchronize the of_overlay_remove() with the
> devlink removals.
> 
> Fixes: 80dd33cf72d1 ("drivers: base: Fix device link removal")
> Cc: stable@vger.kernel.org
> Signed-off-by: Herve Codina <herve.codina@bootlin.com>
> ---
>  drivers/of/overlay.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
> index 2ae7e9d24a64..99659ae9fb28 100644
> --- a/drivers/of/overlay.c
> +++ b/drivers/of/overlay.c
> @@ -853,6 +853,14 @@ static void free_overlay_changeset(struct
> overlay_changeset *ovcs)
>  {
>  	int i;
>  
> +	/*
> +	 * Wait for any ongoing device link removals before removing some of
> +	 * nodes. Drop the global lock while waiting
> +	 */
> +	mutex_unlock(&of_mutex);
> +	device_link_wait_removal();
> +	mutex_lock(&of_mutex);

I'm still not convinced we need to drop the lock. What happens if someone else
grabs the lock while we are in device_link_wait_removal()? Can we guarantee that
we can't screw things badly?

The question is, do you have a system/use case where you can really see the
deadlock happening? Until I see one, I'm very skeptical about this. And if we
have one, I'm not really sure this is also the right solution for it.

- Nuno Sá
  
Nuno Sá Feb. 29, 2024, 9:50 a.m. UTC | #2
On Thu, 2024-02-29 at 09:39 +0100, Herve Codina wrote:
> In the following sequence:
>   1) of_platform_depopulate()
>   2) of_overlay_remove()
> 
> During the step 1, devices are destroyed and devlinks are removed.
> During the step 2, OF nodes are destroyed but
> __of_changeset_entry_destroy() can raise warnings related to missing
> of_node_put():
>   ERROR: memory leak, expected refcount 1 instead of 2 ...
> 
> Indeed, during the devlink removals performed at step 1, the removal
> itself releasing the device (and the attached of_node) is done by a job
> queued in a workqueue and so, it is done asynchronously with respect to
> function calls.
> When the warning is present, of_node_put() will be called but wrongly
> too late from the workqueue job.
> 
> In order to be sure that any ongoing devlink removals are done before
> the of_node destruction, synchronize the of_overlay_remove() with the
> devlink removals.
> 
> Fixes: 80dd33cf72d1 ("drivers: base: Fix device link removal")
> Cc: stable@vger.kernel.org
> Signed-off-by: Herve Codina <herve.codina@bootlin.com>
> ---
>  drivers/of/overlay.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
> index 2ae7e9d24a64..99659ae9fb28 100644
> --- a/drivers/of/overlay.c
> +++ b/drivers/of/overlay.c

In the cover, you mention device.h inclusion but I'm not seeing it? This is
clearly up to the DT maintainers to decide but, IMHO, I would very much prefer
to see fwnode.h included in here rather than directly device.h (so yeah,
renaming the function to fwnode_*). But yeah, I might be biased by own series :)

- Nuno Sá
  
Herve Codina Feb. 29, 2024, 10:14 a.m. UTC | #3
On Thu, 29 Feb 2024 10:50:21 +0100
Nuno Sá <noname.nuno@gmail.com> wrote:

> On Thu, 2024-02-29 at 09:39 +0100, Herve Codina wrote:
> > In the following sequence:
> >   1) of_platform_depopulate()
> >   2) of_overlay_remove()
> > 
> > During the step 1, devices are destroyed and devlinks are removed.
> > During the step 2, OF nodes are destroyed but
> > __of_changeset_entry_destroy() can raise warnings related to missing
> > of_node_put():
> >   ERROR: memory leak, expected refcount 1 instead of 2 ...
> > 
> > Indeed, during the devlink removals performed at step 1, the removal
> > itself releasing the device (and the attached of_node) is done by a job
> > queued in a workqueue and so, it is done asynchronously with respect to
> > function calls.
> > When the warning is present, of_node_put() will be called but wrongly
> > too late from the workqueue job.
> > 
> > In order to be sure that any ongoing devlink removals are done before
> > the of_node destruction, synchronize the of_overlay_remove() with the
> > devlink removals.
> > 
> > Fixes: 80dd33cf72d1 ("drivers: base: Fix device link removal")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Herve Codina <herve.codina@bootlin.com>
> > ---
> >  drivers/of/overlay.c | 9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
> > index 2ae7e9d24a64..99659ae9fb28 100644
> > --- a/drivers/of/overlay.c
> > +++ b/drivers/of/overlay.c  
> 
> In the cover, you mention device.h inclusion but I'm not seeing it? This is
> clearly up to the DT maintainers to decide but, IMHO, I would very much prefer
> to see fwnode.h included in here rather than directly device.h (so yeah,
> renaming the function to fwnode_*). But yeah, I might be biased by own series :)
> 

Damned. I missed device.h in this patch.
Without this one, the patch do not compile :(

A fixup commit I missed to squash before sending.

A v3 is planned to add this device.h.

Nuno, do you prefer I wait few days before sending this v3 waiting for more replies
or I send it right now and you re-do your comment on the v3 ?

I would really prefer to send it now :)

Sorry about my mistake.
Best regards,
Hervé
  
Nuno Sá Feb. 29, 2024, 10:25 a.m. UTC | #4
On Thu, 2024-02-29 at 11:14 +0100, Herve Codina wrote:
> On Thu, 29 Feb 2024 10:50:21 +0100
> Nuno Sá <noname.nuno@gmail.com> wrote:
> 
> > On Thu, 2024-02-29 at 09:39 +0100, Herve Codina wrote:
> > > In the following sequence:
> > >   1) of_platform_depopulate()
> > >   2) of_overlay_remove()
> > > 
> > > During the step 1, devices are destroyed and devlinks are removed.
> > > During the step 2, OF nodes are destroyed but
> > > __of_changeset_entry_destroy() can raise warnings related to missing
> > > of_node_put():
> > >   ERROR: memory leak, expected refcount 1 instead of 2 ...
> > > 
> > > Indeed, during the devlink removals performed at step 1, the removal
> > > itself releasing the device (and the attached of_node) is done by a job
> > > queued in a workqueue and so, it is done asynchronously with respect to
> > > function calls.
> > > When the warning is present, of_node_put() will be called but wrongly
> > > too late from the workqueue job.
> > > 
> > > In order to be sure that any ongoing devlink removals are done before
> > > the of_node destruction, synchronize the of_overlay_remove() with the
> > > devlink removals.
> > > 
> > > Fixes: 80dd33cf72d1 ("drivers: base: Fix device link removal")
> > > Cc: stable@vger.kernel.org
> > > Signed-off-by: Herve Codina <herve.codina@bootlin.com>
> > > ---
> > >  drivers/of/overlay.c | 9 ++++++++-
> > >  1 file changed, 8 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
> > > index 2ae7e9d24a64..99659ae9fb28 100644
> > > --- a/drivers/of/overlay.c
> > > +++ b/drivers/of/overlay.c  
> > 
> > In the cover, you mention device.h inclusion but I'm not seeing it? This is
> > clearly up to the DT maintainers to decide but, IMHO, I would very much
> > prefer
> > to see fwnode.h included in here rather than directly device.h (so yeah,
> > renaming the function to fwnode_*). But yeah, I might be biased by own
> > series :)
> > 
> 
> Damned. I missed device.h in this patch.
> Without this one, the patch do not compile :(
> 
> A fixup commit I missed to squash before sending.
> 
> A v3 is planned to add this device.h.
> 
> Nuno, do you prefer I wait few days before sending this v3 waiting for more
> replies
> or I send it right now and you re-do your comment on the v3 ?
> 
> I would really prefer to send it now :)
> 

Typically maintainers don't like much of re-spinning versions too fast. That
said, up to you :). I can copy paste my comments in v3.

- Nuno Sá
  

Patch

diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
index 2ae7e9d24a64..99659ae9fb28 100644
--- a/drivers/of/overlay.c
+++ b/drivers/of/overlay.c
@@ -853,6 +853,14 @@  static void free_overlay_changeset(struct overlay_changeset *ovcs)
 {
 	int i;
 
+	/*
+	 * Wait for any ongoing device link removals before removing some of
+	 * nodes. Drop the global lock while waiting
+	 */
+	mutex_unlock(&of_mutex);
+	device_link_wait_removal();
+	mutex_lock(&of_mutex);
+
 	if (ovcs->cset.entries.next)
 		of_changeset_destroy(&ovcs->cset);
 
@@ -862,7 +870,6 @@  static void free_overlay_changeset(struct overlay_changeset *ovcs)
 		ovcs->id = 0;
 	}
 
-
 	for (i = 0; i < ovcs->count; i++) {
 		of_node_put(ovcs->fragments[i].target);
 		of_node_put(ovcs->fragments[i].overlay);