[v1] PM: sleep: Restore asynchronous device resume optimization

Message ID 10423008.nUPlyArG6x@kreacher
State New
Headers
Series [v1] PM: sleep: Restore asynchronous device resume optimization |

Commit Message

Rafael J. Wysocki Jan. 9, 2024, 4:59 p.m. UTC
  From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Before commit 7839d0078e0d ("PM: sleep: Fix possible deadlocks in core
system-wide PM code"), the resume of devices that were allowed to resume
asynchronously was scheduled before starting the resume of the other
devices, so the former did not have to wait for the latter unless
functional dependencies were present.

Commit 7839d0078e0d removed that optimization in order to address a
correctness issue, but it can be restored with the help of a new device
power management flag, so do that now.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

I said I'd probably do this in 6.9, but then I thought more about it
and now I think it would be nice to have 6.8-rc1 without a suspend
performance regression and the change is relatively straightforward,
so here it goes.

---
 drivers/base/power/main.c |  117 +++++++++++++++++++++++++---------------------
 include/linux/pm.h        |    1 
 2 files changed, 65 insertions(+), 53 deletions(-)
  

Comments

Stanislaw Gruszka Jan. 10, 2024, 10:37 a.m. UTC | #1
On Tue, Jan 09, 2024 at 05:59:22PM +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Before commit 7839d0078e0d ("PM: sleep: Fix possible deadlocks in core
> system-wide PM code"), the resume of devices that were allowed to resume
> asynchronously was scheduled before starting the resume of the other
> devices, so the former did not have to wait for the latter unless
> functional dependencies were present.
> 
> Commit 7839d0078e0d removed that optimization in order to address a
> correctness issue, but it can be restored with the help of a new device
> power management flag, so do that now.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> I said I'd probably do this in 6.9, but then I thought more about it
> and now I think it would be nice to have 6.8-rc1 without a suspend
> performance regression and the change is relatively straightforward,
> so here it goes.
> 
> ---
>  drivers/base/power/main.c |  117 +++++++++++++++++++++++++---------------------
>  include/linux/pm.h        |    1 
>  2 files changed, 65 insertions(+), 53 deletions(-)
> 
> Index: linux-pm/include/linux/pm.h
> ===================================================================
> --- linux-pm.orig/include/linux/pm.h
> +++ linux-pm/include/linux/pm.h
> @@ -681,6 +681,7 @@ struct dev_pm_info {
>  	bool			wakeup_path:1;
>  	bool			syscore:1;
>  	bool			no_pm_callbacks:1;	/* Owned by the PM core */
> +	bool			in_progress:1;	/* Owned by the PM core */
>  	unsigned int		must_resume:1;	/* Owned by the PM core */
>  	unsigned int		may_skip_resume:1;	/* Set by subsystems */

Not related to the patch, just question: why some types here are
unsigned int :1 others bool :1 ?

>   * dpm_resume_early - Execute "early resume" callbacks for all devices.
>   * @state: PM transition of the system being carried out.
> @@ -845,18 +845,28 @@ void dpm_resume_early(pm_message_t state
>  	mutex_lock(&dpm_list_mtx);
>  	pm_transition = state;
>  
> +	/*
> +	 * Trigger the resume of "async" devices upfront so they don't have to
> +	 * wait for the "non-async" ones they don't depend on.
> +	 */
> +	list_for_each_entry(dev, &dpm_late_early_list, power.entry)
> +		dpm_async_fn(dev, async_resume_early);
> +
>  	while (!list_empty(&dpm_late_early_list)) {
>  		dev = to_device(dpm_late_early_list.next);
> -		get_device(dev);
>  		list_move_tail(&dev->power.entry, &dpm_suspended_list);
>  
> -		mutex_unlock(&dpm_list_mtx);
> +		if (!dev->power.in_progress) {

I would consider different naming just to make clear this
is regarding async call, in_progress looks too generic for me.
Fine if you think otherwise, in general patch LGTM:

Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
  
Rafael J. Wysocki Jan. 10, 2024, 12:33 p.m. UTC | #2
On Wed, Jan 10, 2024 at 11:37 AM Stanislaw Gruszka
<stanislaw.gruszka@linux.intel.com> wrote:
>
> On Tue, Jan 09, 2024 at 05:59:22PM +0100, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Before commit 7839d0078e0d ("PM: sleep: Fix possible deadlocks in core
> > system-wide PM code"), the resume of devices that were allowed to resume
> > asynchronously was scheduled before starting the resume of the other
> > devices, so the former did not have to wait for the latter unless
> > functional dependencies were present.
> >
> > Commit 7839d0078e0d removed that optimization in order to address a
> > correctness issue, but it can be restored with the help of a new device
> > power management flag, so do that now.
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >
> > I said I'd probably do this in 6.9, but then I thought more about it
> > and now I think it would be nice to have 6.8-rc1 without a suspend
> > performance regression and the change is relatively straightforward,
> > so here it goes.
> >
> > ---
> >  drivers/base/power/main.c |  117 +++++++++++++++++++++++++---------------------
> >  include/linux/pm.h        |    1
> >  2 files changed, 65 insertions(+), 53 deletions(-)
> >
> > Index: linux-pm/include/linux/pm.h
> > ===================================================================
> > --- linux-pm.orig/include/linux/pm.h
> > +++ linux-pm/include/linux/pm.h
> > @@ -681,6 +681,7 @@ struct dev_pm_info {
> >       bool                    wakeup_path:1;
> >       bool                    syscore:1;
> >       bool                    no_pm_callbacks:1;      /* Owned by the PM core */
> > +     bool                    in_progress:1;  /* Owned by the PM core */
> >       unsigned int            must_resume:1;  /* Owned by the PM core */
> >       unsigned int            may_skip_resume:1;      /* Set by subsystems */
>
> Not related to the patch, just question: why some types here are
> unsigned int :1 others bool :1 ?

No particular reason.

I think I will change them all to bool in the future.

> >   * dpm_resume_early - Execute "early resume" callbacks for all devices.
> >   * @state: PM transition of the system being carried out.
> > @@ -845,18 +845,28 @@ void dpm_resume_early(pm_message_t state
> >       mutex_lock(&dpm_list_mtx);
> >       pm_transition = state;
> >
> > +     /*
> > +      * Trigger the resume of "async" devices upfront so they don't have to
> > +      * wait for the "non-async" ones they don't depend on.
> > +      */
> > +     list_for_each_entry(dev, &dpm_late_early_list, power.entry)
> > +             dpm_async_fn(dev, async_resume_early);
> > +
> >       while (!list_empty(&dpm_late_early_list)) {
> >               dev = to_device(dpm_late_early_list.next);
> > -             get_device(dev);
> >               list_move_tail(&dev->power.entry, &dpm_suspended_list);
> >
> > -             mutex_unlock(&dpm_list_mtx);
> > +             if (!dev->power.in_progress) {
>
> I would consider different naming just to make clear this
> is regarding async call, in_progress looks too generic for me.

OK, what about async_in_progress?

> Fine if you think otherwise, in general patch LGTM:
>
> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>

Thanks!
  
Stanislaw Gruszka Jan. 10, 2024, 2:05 p.m. UTC | #3
On Wed, Jan 10, 2024 at 01:33:07PM +0100, Rafael J. Wysocki wrote:
> > I would consider different naming just to make clear this
> > is regarding async call, in_progress looks too generic for me.
> 
> OK, what about async_in_progress?
Sure, that better.

Regards
Stanislaw
  
Stanislaw Gruszka Jan. 11, 2024, 7:58 a.m. UTC | #4
On Wed, Jan 10, 2024 at 03:05:07PM +0100, Stanislaw Gruszka wrote:
> On Wed, Jan 10, 2024 at 01:33:07PM +0100, Rafael J. Wysocki wrote:
> > > I would consider different naming just to make clear this
> > > is regarding async call, in_progress looks too generic for me.
> > 
> > OK, what about async_in_progress?
> Sure, that better.

Even better would be using_async IMO, because we don't know if
async call is in progress or finish or before start.

Regards
Stanislaw
  
Rafael J. Wysocki Jan. 11, 2024, 12:01 p.m. UTC | #5
On Thu, Jan 11, 2024 at 8:58 AM Stanislaw Gruszka
<stanislaw.gruszka@linux.intel.com> wrote:
>
> On Wed, Jan 10, 2024 at 03:05:07PM +0100, Stanislaw Gruszka wrote:
> > On Wed, Jan 10, 2024 at 01:33:07PM +0100, Rafael J. Wysocki wrote:
> > > > I would consider different naming just to make clear this
> > > > is regarding async call, in_progress looks too generic for me.
> > >
> > > OK, what about async_in_progress?
> > Sure, that better.
>
> Even better would be using_async IMO, because we don't know if
> async call is in progress or finish or before start.

Well, "in progress" applies to all of the processing of the async call
and I regard it as "in progress" once it is known that it will run
asynchronously eventually.

In any case, I've already applied the async_in_progress version, thanks!
  
Marek Szyprowski Feb. 7, 2024, 10:31 a.m. UTC | #6
Dear All,

On 09.01.2024 17:59, Rafael J. Wysocki wrote:
>   From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> Before commit 7839d0078e0d ("PM: sleep: Fix possible deadlocks in core
> system-wide PM code"), the resume of devices that were allowed to resume
> asynchronously was scheduled before starting the resume of the other
> devices, so the former did not have to wait for the latter unless
> functional dependencies were present.
>
> Commit 7839d0078e0d removed that optimization in order to address a
> correctness issue, but it can be restored with the help of a new device
> power management flag, so do that now.
>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---

This patch finally landed in linux-next some time ago as 3e999770ac1c 
("PM: sleep: Restore asynchronous device resume optimization"). Recently 
I found that it causes a non-trivial interaction with commit 
5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement 
for unbound workqueues"). Since merge commit 954350a5f8db in linux-next 
system suspend/resume fails (board doesn't wake up) on my old Samsung 
Exynos4412-based Odroid-U3 board (ARM 32bit based), which was rock 
stable for last years.

My further investigations confirmed that the mentioned commits are 
responsible for this issue. Each of them separately (3e999770ac1c and 
5797b1c18919) doesn't trigger any problems. Reverting any of them on top 
of linux-next (with some additional commit due to code dependencies) 
also fixes/hides the problem.

Let me know if You need more information or tests on the hardware. I'm 
open to help debugging this issue.

> I said I'd probably do this in 6.9, but then I thought more about it
> and now I think it would be nice to have 6.8-rc1 without a suspend
> performance regression and the change is relatively straightforward,
> so here it goes.
>
> ---
>   drivers/base/power/main.c |  117 +++++++++++++++++++++++++---------------------
>   include/linux/pm.h        |    1
>   2 files changed, 65 insertions(+), 53 deletions(-)
>
> Index: linux-pm/include/linux/pm.h
> ===================================================================
> --- linux-pm.orig/include/linux/pm.h
> +++ linux-pm/include/linux/pm.h
> @@ -681,6 +681,7 @@ struct dev_pm_info {
>   	bool			wakeup_path:1;
>   	bool			syscore:1;
>   	bool			no_pm_callbacks:1;	/* Owned by the PM core */
> +	bool			in_progress:1;	/* Owned by the PM core */
>   	unsigned int		must_resume:1;	/* Owned by the PM core */
>   	unsigned int		may_skip_resume:1;	/* Set by subsystems */
>   #else
> Index: linux-pm/drivers/base/power/main.c
> ===================================================================
> --- linux-pm.orig/drivers/base/power/main.c
> +++ linux-pm/drivers/base/power/main.c
> @@ -579,7 +579,7 @@ bool dev_pm_skip_resume(struct device *d
>   }
>   
>   /**
> - * __device_resume_noirq - Execute a "noirq resume" callback for given device.
> + * device_resume_noirq - Execute a "noirq resume" callback for given device.
>    * @dev: Device to handle.
>    * @state: PM transition of the system being carried out.
>    * @async: If true, the device is being resumed asynchronously.
> @@ -587,7 +587,7 @@ bool dev_pm_skip_resume(struct device *d
>    * The driver of @dev will not receive interrupts while this function is being
>    * executed.
>    */
> -static void __device_resume_noirq(struct device *dev, pm_message_t state, bool async)
> +static void device_resume_noirq(struct device *dev, pm_message_t state, bool async)
>   {
>   	pm_callback_t callback = NULL;
>   	const char *info = NULL;
> @@ -674,16 +674,22 @@ static bool dpm_async_fn(struct device *
>   {
>   	reinit_completion(&dev->power.completion);
>   
> -	if (!is_async(dev))
> -		return false;
> +	if (is_async(dev)) {
> +		dev->power.in_progress = true;
>   
> -	get_device(dev);
> -
> -	if (async_schedule_dev_nocall(func, dev))
> -		return true;
> +		get_device(dev);
>   
> -	put_device(dev);
> +		if (async_schedule_dev_nocall(func, dev))
> +			return true;
>   
> +		put_device(dev);
> +	}
> +	/*
> +	 * Because async_schedule_dev_nocall() above has returned false or it
> +	 * has not been called at all, func() is not running and it safe to
> +	 * update the in_progress flag without additional synchronization.
> +	 */
> +	dev->power.in_progress = false;
>   	return false;
>   }
>   
> @@ -691,18 +697,10 @@ static void async_resume_noirq(void *dat
>   {
>   	struct device *dev = data;
>   
> -	__device_resume_noirq(dev, pm_transition, true);
> +	device_resume_noirq(dev, pm_transition, true);
>   	put_device(dev);
>   }
>   
> -static void device_resume_noirq(struct device *dev)
> -{
> -	if (dpm_async_fn(dev, async_resume_noirq))
> -		return;
> -
> -	__device_resume_noirq(dev, pm_transition, false);
> -}
> -
>   static void dpm_noirq_resume_devices(pm_message_t state)
>   {
>   	struct device *dev;
> @@ -712,18 +710,28 @@ static void dpm_noirq_resume_devices(pm_
>   	mutex_lock(&dpm_list_mtx);
>   	pm_transition = state;
>   
> +	/*
> +	 * Trigger the resume of "async" devices upfront so they don't have to
> +	 * wait for the "non-async" ones they don't depend on.
> +	 */
> +	list_for_each_entry(dev, &dpm_noirq_list, power.entry)
> +		dpm_async_fn(dev, async_resume_noirq);
> +
>   	while (!list_empty(&dpm_noirq_list)) {
>   		dev = to_device(dpm_noirq_list.next);
> -		get_device(dev);
>   		list_move_tail(&dev->power.entry, &dpm_late_early_list);
>   
> -		mutex_unlock(&dpm_list_mtx);
> +		if (!dev->power.in_progress) {
> +			get_device(dev);
>   
> -		device_resume_noirq(dev);
> +			mutex_unlock(&dpm_list_mtx);
>   
> -		put_device(dev);
> +			device_resume_noirq(dev, state, false);
> +
> +			put_device(dev);
>   
> -		mutex_lock(&dpm_list_mtx);
> +			mutex_lock(&dpm_list_mtx);
> +		}
>   	}
>   	mutex_unlock(&dpm_list_mtx);
>   	async_synchronize_full();
> @@ -747,14 +755,14 @@ void dpm_resume_noirq(pm_message_t state
>   }
>   
>   /**
> - * __device_resume_early - Execute an "early resume" callback for given device.
> + * device_resume_early - Execute an "early resume" callback for given device.
>    * @dev: Device to handle.
>    * @state: PM transition of the system being carried out.
>    * @async: If true, the device is being resumed asynchronously.
>    *
>    * Runtime PM is disabled for @dev while this function is being executed.
>    */
> -static void __device_resume_early(struct device *dev, pm_message_t state, bool async)
> +static void device_resume_early(struct device *dev, pm_message_t state, bool async)
>   {
>   	pm_callback_t callback = NULL;
>   	const char *info = NULL;
> @@ -820,18 +828,10 @@ static void async_resume_early(void *dat
>   {
>   	struct device *dev = data;
>   
> -	__device_resume_early(dev, pm_transition, true);
> +	device_resume_early(dev, pm_transition, true);
>   	put_device(dev);
>   }
>   
> -static void device_resume_early(struct device *dev)
> -{
> -	if (dpm_async_fn(dev, async_resume_early))
> -		return;
> -
> -	__device_resume_early(dev, pm_transition, false);
> -}
> -
>   /**
>    * dpm_resume_early - Execute "early resume" callbacks for all devices.
>    * @state: PM transition of the system being carried out.
> @@ -845,18 +845,28 @@ void dpm_resume_early(pm_message_t state
>   	mutex_lock(&dpm_list_mtx);
>   	pm_transition = state;
>   
> +	/*
> +	 * Trigger the resume of "async" devices upfront so they don't have to
> +	 * wait for the "non-async" ones they don't depend on.
> +	 */
> +	list_for_each_entry(dev, &dpm_late_early_list, power.entry)
> +		dpm_async_fn(dev, async_resume_early);
> +
>   	while (!list_empty(&dpm_late_early_list)) {
>   		dev = to_device(dpm_late_early_list.next);
> -		get_device(dev);
>   		list_move_tail(&dev->power.entry, &dpm_suspended_list);
>   
> -		mutex_unlock(&dpm_list_mtx);
> +		if (!dev->power.in_progress) {
> +			get_device(dev);
>   
> -		device_resume_early(dev);
> +			mutex_unlock(&dpm_list_mtx);
>   
> -		put_device(dev);
> +			device_resume_early(dev, state, false);
> +
> +			put_device(dev);
>   
> -		mutex_lock(&dpm_list_mtx);
> +			mutex_lock(&dpm_list_mtx);
> +		}
>   	}
>   	mutex_unlock(&dpm_list_mtx);
>   	async_synchronize_full();
> @@ -876,12 +886,12 @@ void dpm_resume_start(pm_message_t state
>   EXPORT_SYMBOL_GPL(dpm_resume_start);
>   
>   /**
> - * __device_resume - Execute "resume" callbacks for given device.
> + * device_resume - Execute "resume" callbacks for given device.
>    * @dev: Device to handle.
>    * @state: PM transition of the system being carried out.
>    * @async: If true, the device is being resumed asynchronously.
>    */
> -static void __device_resume(struct device *dev, pm_message_t state, bool async)
> +static void device_resume(struct device *dev, pm_message_t state, bool async)
>   {
>   	pm_callback_t callback = NULL;
>   	const char *info = NULL;
> @@ -975,18 +985,10 @@ static void async_resume(void *data, asy
>   {
>   	struct device *dev = data;
>   
> -	__device_resume(dev, pm_transition, true);
> +	device_resume(dev, pm_transition, true);
>   	put_device(dev);
>   }
>   
> -static void device_resume(struct device *dev)
> -{
> -	if (dpm_async_fn(dev, async_resume))
> -		return;
> -
> -	__device_resume(dev, pm_transition, false);
> -}
> -
>   /**
>    * dpm_resume - Execute "resume" callbacks for non-sysdev devices.
>    * @state: PM transition of the system being carried out.
> @@ -1006,16 +1008,25 @@ void dpm_resume(pm_message_t state)
>   	pm_transition = state;
>   	async_error = 0;
>   
> +	/*
> +	 * Trigger the resume of "async" devices upfront so they don't have to
> +	 * wait for the "non-async" ones they don't depend on.
> +	 */
> +	list_for_each_entry(dev, &dpm_suspended_list, power.entry)
> +		dpm_async_fn(dev, async_resume);
> +
>   	while (!list_empty(&dpm_suspended_list)) {
>   		dev = to_device(dpm_suspended_list.next);
>   
>   		get_device(dev);
>   
> -		mutex_unlock(&dpm_list_mtx);
> +		if (!dev->power.in_progress) {
> +			mutex_unlock(&dpm_list_mtx);
>   
> -		device_resume(dev);
> +			device_resume(dev, state, false);
>   
> -		mutex_lock(&dpm_list_mtx);
> +			mutex_lock(&dpm_list_mtx);
> +		}
>   
>   		if (!list_empty(&dev->power.entry))
>   			list_move_tail(&dev->power.entry, &dpm_prepared_list);
>
>
>
Best regards
  
Rafael J. Wysocki Feb. 7, 2024, 10:38 a.m. UTC | #7
On Wed, Feb 7, 2024 at 11:31 AM Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
>
> Dear All,
>
> On 09.01.2024 17:59, Rafael J. Wysocki wrote:
> >   From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Before commit 7839d0078e0d ("PM: sleep: Fix possible deadlocks in core
> > system-wide PM code"), the resume of devices that were allowed to resume
> > asynchronously was scheduled before starting the resume of the other
> > devices, so the former did not have to wait for the latter unless
> > functional dependencies were present.
> >
> > Commit 7839d0078e0d removed that optimization in order to address a
> > correctness issue, but it can be restored with the help of a new device
> > power management flag, so do that now.
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
>
> This patch finally landed in linux-next some time ago as 3e999770ac1c
> ("PM: sleep: Restore asynchronous device resume optimization"). Recently
> I found that it causes a non-trivial interaction with commit
> 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement
> for unbound workqueues"). Since merge commit 954350a5f8db in linux-next
> system suspend/resume fails (board doesn't wake up) on my old Samsung
> Exynos4412-based Odroid-U3 board (ARM 32bit based), which was rock
> stable for last years.
>
> My further investigations confirmed that the mentioned commits are
> responsible for this issue. Each of them separately (3e999770ac1c and
> 5797b1c18919) doesn't trigger any problems. Reverting any of them on top
> of linux-next (with some additional commit due to code dependencies)
> also fixes/hides the problem.
>
> Let me know if You need more information or tests on the hardware. I'm
> open to help debugging this issue.

If you echo 0 to /sys/power/pm_async before suspending the system,
does it still fail?
  
Rafael J. Wysocki Feb. 7, 2024, 10:59 a.m. UTC | #8
On Wed, Feb 7, 2024 at 11:31 AM Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
>
> Dear All,
>
> On 09.01.2024 17:59, Rafael J. Wysocki wrote:
> >   From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Before commit 7839d0078e0d ("PM: sleep: Fix possible deadlocks in core
> > system-wide PM code"), the resume of devices that were allowed to resume
> > asynchronously was scheduled before starting the resume of the other
> > devices, so the former did not have to wait for the latter unless
> > functional dependencies were present.
> >
> > Commit 7839d0078e0d removed that optimization in order to address a
> > correctness issue, but it can be restored with the help of a new device
> > power management flag, so do that now.
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
>
> This patch finally landed in linux-next some time ago as 3e999770ac1c
> ("PM: sleep: Restore asynchronous device resume optimization"). Recently
> I found that it causes a non-trivial interaction with commit
> 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement
> for unbound workqueues"). Since merge commit 954350a5f8db in linux-next
> system suspend/resume fails (board doesn't wake up) on my old Samsung
> Exynos4412-based Odroid-U3 board (ARM 32bit based), which was rock
> stable for last years.
>
> My further investigations confirmed that the mentioned commits are
> responsible for this issue. Each of them separately (3e999770ac1c and
> 5797b1c18919) doesn't trigger any problems. Reverting any of them on top
> of linux-next (with some additional commit due to code dependencies)
> also fixes/hides the problem.
>
> Let me know if You need more information or tests on the hardware. I'm
> open to help debugging this issue.

So I found this report:

https://lore.kernel.org/lkml/b3d08cd8-d77f-45dd-a2c3-4a4db5a98dfa@kernel.org/

which appears to be about the same issue.

Strictly speaking, the regression is introduced by 5797b1c18919,
because it is not a mainline commit yet, but the information regarding
the interaction of it with commit 3e999770ac1c is valuable.

Essentially, what commit 3e999770ac1c does is to schedule the
execution of all of the async resume callbacks in a given phase
upfront, so they can run without waiting for the sync ones to complete
(except when there is a parent-child or supplier-consumer dependency -
the latter represented by a device link).

What seems to be happening after commit 5797b1c18919 is that one (or
more) of the async callbacks get blocked in the workqueue for some
reason.

I would try to replace system_unbound_wq in
__async_schedule_node_domain() with a dedicated workqueue that is not
unbound and see what happens.
  
Marek Szyprowski Feb. 7, 2024, 11:16 a.m. UTC | #9
On 07.02.2024 11:38, Rafael J. Wysocki wrote:
> On Wed, Feb 7, 2024 at 11:31 AM Marek Szyprowski
> <m.szyprowski@samsung.com>  wrote:
>> On 09.01.2024 17:59, Rafael J. Wysocki wrote:
>>>    From: Rafael J. Wysocki<rafael.j.wysocki@intel.com>
>>>
>>> Before commit 7839d0078e0d ("PM: sleep: Fix possible deadlocks in core
>>> system-wide PM code"), the resume of devices that were allowed to resume
>>> asynchronously was scheduled before starting the resume of the other
>>> devices, so the former did not have to wait for the latter unless
>>> functional dependencies were present.
>>>
>>> Commit 7839d0078e0d removed that optimization in order to address a
>>> correctness issue, but it can be restored with the help of a new device
>>> power management flag, so do that now.
>>>
>>> Signed-off-by: Rafael J. Wysocki<rafael.j.wysocki@intel.com>
>>> ---
>> This patch finally landed in linux-next some time ago as 3e999770ac1c
>> ("PM: sleep: Restore asynchronous device resume optimization"). Recently
>> I found that it causes a non-trivial interaction with commit
>> 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement
>> for unbound workqueues"). Since merge commit 954350a5f8db in linux-next
>> system suspend/resume fails (board doesn't wake up) on my old Samsung
>> Exynos4412-based Odroid-U3 board (ARM 32bit based), which was rock
>> stable for last years.
>>
>> My further investigations confirmed that the mentioned commits are
>> responsible for this issue. Each of them separately (3e999770ac1c and
>> 5797b1c18919) doesn't trigger any problems. Reverting any of them on top
>> of linux-next (with some additional commit due to code dependencies)
>> also fixes/hides the problem.
>>
>> Let me know if You need more information or tests on the hardware. I'm
>> open to help debugging this issue.
> If you echo 0 to /sys/power/pm_async before suspending the system,
> does it still fail?

In such case it works fine.

Best regards
  
Rafael J. Wysocki Feb. 7, 2024, 11:25 a.m. UTC | #10
On Wed, Feb 7, 2024 at 12:16 PM Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
>
> On 07.02.2024 11:38, Rafael J. Wysocki wrote:
> > On Wed, Feb 7, 2024 at 11:31 AM Marek Szyprowski
> > <m.szyprowski@samsung.com>  wrote:
> >> On 09.01.2024 17:59, Rafael J. Wysocki wrote:
> >>>    From: Rafael J. Wysocki<rafael.j.wysocki@intel.com>
> >>>
> >>> Before commit 7839d0078e0d ("PM: sleep: Fix possible deadlocks in core
> >>> system-wide PM code"), the resume of devices that were allowed to resume
> >>> asynchronously was scheduled before starting the resume of the other
> >>> devices, so the former did not have to wait for the latter unless
> >>> functional dependencies were present.
> >>>
> >>> Commit 7839d0078e0d removed that optimization in order to address a
> >>> correctness issue, but it can be restored with the help of a new device
> >>> power management flag, so do that now.
> >>>
> >>> Signed-off-by: Rafael J. Wysocki<rafael.j.wysocki@intel.com>
> >>> ---
> >> This patch finally landed in linux-next some time ago as 3e999770ac1c
> >> ("PM: sleep: Restore asynchronous device resume optimization"). Recently
> >> I found that it causes a non-trivial interaction with commit
> >> 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement
> >> for unbound workqueues"). Since merge commit 954350a5f8db in linux-next
> >> system suspend/resume fails (board doesn't wake up) on my old Samsung
> >> Exynos4412-based Odroid-U3 board (ARM 32bit based), which was rock
> >> stable for last years.
> >>
> >> My further investigations confirmed that the mentioned commits are
> >> responsible for this issue. Each of them separately (3e999770ac1c and
> >> 5797b1c18919) doesn't trigger any problems. Reverting any of them on top
> >> of linux-next (with some additional commit due to code dependencies)
> >> also fixes/hides the problem.
> >>
> >> Let me know if You need more information or tests on the hardware. I'm
> >> open to help debugging this issue.
> > If you echo 0 to /sys/power/pm_async before suspending the system,
> > does it still fail?
>
> In such case it works fine.

Thanks for the confirmation.

 It doesn't rely on unbound workqueues then, so that's expected.

Now, I think that there are two possibilities.

One is that commit 3e999770ac1c is generally overoptimistic for your
board and there is a dependency between devices which is not
represented by a device link, and it causes things to go south when
they are not done in a specific order.  If that is the case and commit
5797b1c18919 changes that order, breakage ensues.

The other one is that what happens during async resume does not meet
the assumptions of commit 5797b1c18919 (for example, it can easily
produce a chain of interdependent work items longer than 8) and so it
breaks things.

I would still try to use a non-unbound workqueue for the async thing,
because if it works reliably then, the second possibility will be more
likely.
  
Tejun Heo Feb. 7, 2024, 4:39 p.m. UTC | #11
Hello,

On Wed, Feb 07, 2024 at 12:25:46PM +0100, Rafael J. Wysocki wrote:
> The other one is that what happens during async resume does not meet
> the assumptions of commit 5797b1c18919 (for example, it can easily
> produce a chain of interdependent work items longer than 8) and so it
> breaks things.

Ah, that's fascinating. But aren't CPUs all brought up online before devices
are resumed? If so, the max_active should already be way higher than the
WQ_DFL_MIN_ACTIVE. Also, are these multi node NUMA machines? Otherwise, it
really shouldn't affect anything. One easy way to verify would be just
bumping up WQ_DFL_MIN_ACTIVE and see what happens.

Thanks.
  
Rafael J. Wysocki Feb. 7, 2024, 6:41 p.m. UTC | #12
On Wed, Feb 7, 2024 at 5:40 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Wed, Feb 07, 2024 at 12:25:46PM +0100, Rafael J. Wysocki wrote:
> > The other one is that what happens during async resume does not meet
> > the assumptions of commit 5797b1c18919 (for example, it can easily
> > produce a chain of interdependent work items longer than 8) and so it
> > breaks things.
>
> Ah, that's fascinating. But aren't CPUs all brought up online before devices
> are resumed?

They are.

> If so, the max_active should already be way higher than the
> WQ_DFL_MIN_ACTIVE. Also, are these multi node NUMA machines?

They aren't.

> Otherwise, it really shouldn't affect anything.

Well, it evidently makes a difference.  I'm wondering what difference
it can make in that case.

> One easy way to verify would be just bumping up WQ_DFL_MIN_ACTIVE and see what happens.

Sure.
  
Marek Szyprowski Feb. 7, 2024, 6:55 p.m. UTC | #13
On 07.02.2024 17:39, Tejun Heo wrote:
> On Wed, Feb 07, 2024 at 12:25:46PM +0100, Rafael J. Wysocki wrote:
>> The other one is that what happens during async resume does not meet
>> the assumptions of commit 5797b1c18919 (for example, it can easily
>> produce a chain of interdependent work items longer than 8) and so it
>> breaks things.
> Ah, that's fascinating. But aren't CPUs all brought up online before devices
> are resumed? If so, the max_active should already be way higher than the
> WQ_DFL_MIN_ACTIVE. Also, are these multi node NUMA machines? Otherwise, it
> really shouldn't affect anything. One easy way to verify would be just
> bumping up WQ_DFL_MIN_ACTIVE and see what happens.

I've increased WQ_DFL_MIN_ACTIVE from 8 to 32 and all the system 
suspend/resume issues went away. :)

Best regards
  
Tejun Heo Feb. 7, 2024, 6:58 p.m. UTC | #14
On Wed, Feb 07, 2024 at 07:55:51PM +0100, Marek Szyprowski wrote:
> On 07.02.2024 17:39, Tejun Heo wrote:
> > On Wed, Feb 07, 2024 at 12:25:46PM +0100, Rafael J. Wysocki wrote:
> >> The other one is that what happens during async resume does not meet
> >> the assumptions of commit 5797b1c18919 (for example, it can easily
> >> produce a chain of interdependent work items longer than 8) and so it
> >> breaks things.
> > Ah, that's fascinating. But aren't CPUs all brought up online before devices
> > are resumed? If so, the max_active should already be way higher than the
> > WQ_DFL_MIN_ACTIVE. Also, are these multi node NUMA machines? Otherwise, it
> > really shouldn't affect anything. One easy way to verify would be just
> > bumping up WQ_DFL_MIN_ACTIVE and see what happens.
> 
> I've increased WQ_DFL_MIN_ACTIVE from 8 to 32 and all the system 
> suspend/resume issues went away. :)

Ah, okay, that's surprising. Lemme look at the code again. I gotta be
missing something.

Thanks.
  
Tejun Heo Feb. 7, 2024, 7:48 p.m. UTC | #15
Hello,

I couldn't reproduce effective max_active being pushed down to min_active
across suspend/resume cycles on x86. There gotta be something different.

- Can you please apply the following patch along with the WQ_DFL_MIN_ACTIVE
  bump, go through suspend/resume once and report the dmesg?

- Regardless of the root cause, I think async should switch to a dedicated
  workqueue with explicitly raised min_active (will add an interface for
  it).

Thanks.

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index cf514ba0dfc3..93e5b837f2b4 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -368,6 +368,8 @@ struct workqueue_struct {
 	 */
 	struct rcu_head		rcu;
 
+	bool			dbg;
+
 	/* hot fields used during command issue, aligned to cacheline */
 	unsigned int		flags ____cacheline_aligned; /* WQ: WQ_* flags */
 	struct pool_workqueue __percpu __rcu **cpu_pwq; /* I: per-cpu pwqs */
@@ -1557,6 +1559,10 @@ static void wq_update_node_max_active(struct workqueue_struct *wq, int off_cpu)
 	if (off_cpu >= 0)
 		total_cpus--;
 
+	if (wq->dbg)
+		printk("XXX wq_update_node_max_active: wq=%s off_cpu=%d total=%d range=[%d, %d]",
+		       wq->name, off_cpu, total_cpus, min_active, max_active);
+
 	for_each_node(node) {
 		int node_cpus;
 
@@ -1567,7 +1573,12 @@ static void wq_update_node_max_active(struct workqueue_struct *wq, int off_cpu)
 		wq_node_nr_active(wq, node)->max =
 			clamp(DIV_ROUND_UP(max_active * node_cpus, total_cpus),
 			      min_active, max_active);
+		if (wq->dbg)
+			pr_cont(" node[%d] node_cpus=%d max=%d",
+				node, node_cpus, wq_node_nr_active(wq, node)->max);
 	}
+	if (wq->dbg)
+		pr_cont("\n");
 
 	wq_node_nr_active(wq, NUMA_NO_NODE)->max = min_active;
 }
@@ -4886,6 +4897,9 @@ static struct pool_workqueue *install_unbound_pwq(struct workqueue_struct *wq,
 
 	old_pwq = rcu_access_pointer(*slot);
 	rcu_assign_pointer(*slot, pwq);
+	if (wq->dbg)
+		printk("XXX install_unbound_pwq: wq=%s cpu=%d pool=%d\n",
+		       wq->name, cpu, pwq->pool->id);
 	return old_pwq;
 }
 
@@ -7418,6 +7432,8 @@ void __init workqueue_init_early(void)
 	       !system_power_efficient_wq ||
 	       !system_freezable_power_efficient_wq ||
 	       !system_bh_wq || !system_bh_highpri_wq);
+
+	system_unbound_wq->dbg = true;
 }
 
 static void __init wq_cpu_intensive_thresh_init(void)
  
Rafael J. Wysocki Feb. 7, 2024, 7:54 p.m. UTC | #16
On Wed, Feb 7, 2024 at 8:48 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> I couldn't reproduce effective max_active being pushed down to min_active
> across suspend/resume cycles on x86. There gotta be something different.
>
> - Can you please apply the following patch along with the WQ_DFL_MIN_ACTIVE
>   bump, go through suspend/resume once and report the dmesg?
>
> - Regardless of the root cause, I think async should switch to a dedicated
>   workqueue with explicitly raised min_active (will add an interface for
>   it).

Agreed.
  
Marek Szyprowski Feb. 7, 2024, 9:30 p.m. UTC | #17
On 07.02.2024 20:48, Tejun Heo wrote:
> Hello,
>
> I couldn't reproduce effective max_active being pushed down to min_active
> across suspend/resume cycles on x86. There gotta be something different.
>
> - Can you please apply the following patch along with the WQ_DFL_MIN_ACTIVE
>    bump, go through suspend/resume once and report the dmesg?

Here is the relevant part of the log:

PM: suspend entry (deep)
Filesystems sync: 0.004 seconds
Freezing user space processes
Freezing user space processes completed (elapsed 0.002 seconds)
OOM killer disabled.
Freezing remaining freezable tasks
Freezing remaining freezable tasks completed (elapsed 0.002 seconds)
dwc2 12480000.usb: suspending usb gadget g_ether
dwc2 12480000.usb: new device is full-speed
smsc95xx 1-2:1.0 eth0: entering SUSPEND2 mode
wake enabled for irq 97 (gpx3-2)
usb3503 0-0008: switched to STANDBY mode
wake enabled for irq 124 (gpx1-3)
samsung-pinctrl 11000000.pinctrl: Setting external wakeup interrupt 
mask: 0xfbfff7ff
Disabling non-boot CPUs ...
XXX wq_update_node_max_active: wq=events_unbound off_cpu=1 total=3 
range=[32, 512] node[0] node_cpus=3 max=512
XXX wq_update_node_max_active: wq=events_unbound off_cpu=2 total=2 
range=[32, 512] node[0] node_cpus=2 max=512
XXX wq_update_node_max_active: wq=events_unbound off_cpu=3 total=1 
range=[32, 512] node[0] node_cpus=1 max=512
Enabling non-boot CPUs ...
XXX wq_update_node_max_active: wq=events_unbound off_cpu=-1 total=2 
range=[32, 512] node[0] node_cpus=2 max=512
CPU1 is up
XXX wq_update_node_max_active: wq=events_unbound off_cpu=-1 total=3 
range=[32, 512] node[0] node_cpus=3 max=512
CPU2 is up
XXX wq_update_node_max_active: wq=events_unbound off_cpu=-1 total=4 
range=[32, 512] node[0] node_cpus=4 max=512
CPU3 is up
s3c-i2c 138e0000.i2c: slave address 0x00
s3c-i2c 138e0000.i2c: bus frequency set to 97 KHz
s3c-i2c 13870000.i2c: slave address 0x00
s3c-i2c 13870000.i2c: bus frequency set to 97 KHz
s3c-i2c 13860000.i2c: slave address 0x00
s3c-i2c 13860000.i2c: bus frequency set to 390 KHz
s3c-i2c 13880000.i2c: slave address 0x00
s3c-i2c 13880000.i2c: bus frequency set to 97 KHz
s3c2410-wdt 10060000.watchdog: watchdog disabled
wake disabled for irq 124 (gpx1-3)
s3c-rtc 10070000.rtc: rtc disabled, re-enabling
usb3503 0-0008: switched to HUB mode
wake disabled for irq 97 (gpx3-2)
usb usb1: root hub lost power or was reset
usb 1-2: reset high-speed USB device number 2 using exynos-ehci
smsc95xx 1-2:1.0 eth0: Link is Down
dwc2 12480000.usb: resuming usb gadget g_ether
usb 1-3: reset high-speed USB device number 3 using exynos-ehci
usb 1-3.1: reset high-speed USB device number 4 using exynos-ehci
OOM killer enabled.
Restarting tasks ... done.
random: crng reseeded on system resumption
PM: suspend exit


> ...

Best regards
  
Tejun Heo Feb. 7, 2024, 9:35 p.m. UTC | #18
Hello,

On Wed, Feb 07, 2024 at 10:30:58PM +0100, Marek Szyprowski wrote:
..
> Disabling non-boot CPUs ...
> XXX wq_update_node_max_active: wq=events_unbound off_cpu=1 total=3 range=[32, 512] node[0] node_cpus=3 max=512
> XXX wq_update_node_max_active: wq=events_unbound off_cpu=2 total=2 range=[32, 512] node[0] node_cpus=2 max=512
> XXX wq_update_node_max_active: wq=events_unbound off_cpu=3 total=1 range=[32, 512] node[0] node_cpus=1 max=512
> Enabling non-boot CPUs ...
> XXX wq_update_node_max_active: wq=events_unbound off_cpu=-1 total=2 range=[32, 512] node[0] node_cpus=2 max=512
> CPU1 is up
> XXX wq_update_node_max_active: wq=events_unbound off_cpu=-1 total=3 range=[32, 512] node[0] node_cpus=3 max=512
> CPU2 is up
> XXX wq_update_node_max_active: wq=events_unbound off_cpu=-1 total=4 range=[32, 512] node[0] node_cpus=4 max=512
> CPU3 is up

So, the node max_active does stay at 512. The only pwq which uses min_active
would be the dfl_pwq but I'm not sure why that'd be being used. Can you
please post the output of `drgn -v vmlinux tools/workqueue/wq_dump.py`?

Thanks.
  
Marek Szyprowski Feb. 8, 2024, 7:47 a.m. UTC | #19
On 07.02.2024 22:35, Tejun Heo wrote:
> Hello,
>
> On Wed, Feb 07, 2024 at 10:30:58PM +0100, Marek Szyprowski wrote:
> ...
>> Disabling non-boot CPUs ...
>> XXX wq_update_node_max_active: wq=events_unbound off_cpu=1 total=3 range=[32, 512] node[0] node_cpus=3 max=512
>> XXX wq_update_node_max_active: wq=events_unbound off_cpu=2 total=2 range=[32, 512] node[0] node_cpus=2 max=512
>> XXX wq_update_node_max_active: wq=events_unbound off_cpu=3 total=1 range=[32, 512] node[0] node_cpus=1 max=512
>> Enabling non-boot CPUs ...
>> XXX wq_update_node_max_active: wq=events_unbound off_cpu=-1 total=2 range=[32, 512] node[0] node_cpus=2 max=512
>> CPU1 is up
>> XXX wq_update_node_max_active: wq=events_unbound off_cpu=-1 total=3 range=[32, 512] node[0] node_cpus=3 max=512
>> CPU2 is up
>> XXX wq_update_node_max_active: wq=events_unbound off_cpu=-1 total=4 range=[32, 512] node[0] node_cpus=4 max=512
>> CPU3 is up
> So, the node max_active does stay at 512. The only pwq which uses min_active
> would be the dfl_pwq but I'm not sure why that'd be being used. Can you
> please post the output of `drgn -v vmlinux tools/workqueue/wq_dump.py`?

I've tried to get drgn running on the test target, but then I've noticed 
that it is not possible to run it on ARM 32bit target, as it requires 
PROC_KCORE support, which is 'Visible if: PROC_FS [=y] && MMU [=y] && 
!ARM [=y]' for some reasons.

Best regards
  
Tejun Heo Feb. 9, 2024, 12:20 a.m. UTC | #20
Hello,

On Thu, Feb 08, 2024 at 08:47:12AM +0100, Marek Szyprowski wrote:
> I've tried to get drgn running on the test target, but then I've noticed 
> that it is not possible to run it on ARM 32bit target, as it requires 
> PROC_KCORE support, which is 'Visible if: PROC_FS [=y] && MMU [=y] && 
> !ARM [=y]' for some reasons.

Bummer. I instrumented code on my test setup (x86) and couldn't repro usage
of the dfl_pwq, unfortuantely.

Independent of understanding what's going on with the system_unbound_wq, the
correct solution seems like using a dedicated workqueue with raised
min_active. Just posted the patchset to do that:

  http://lkml.kernel.org/r/ZcVtzJvJCRV5OLM-@slm.duckdns.org

Thanks.
  
Konrad Dybcio Feb. 10, 2024, 2:04 p.m. UTC | #21
On 2/7/24 11:59, Rafael J. Wysocki wrote:
> On Wed, Feb 7, 2024 at 11:31 AM Marek Szyprowski
> <m.szyprowski@samsung.com> wrote:
>>
>> Dear All,
>>
>> On 09.01.2024 17:59, Rafael J. Wysocki wrote:
>>>    From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> Before commit 7839d0078e0d ("PM: sleep: Fix possible deadlocks in core
>>> system-wide PM code"), the resume of devices that were allowed to resume
>>> asynchronously was scheduled before starting the resume of the other
>>> devices, so the former did not have to wait for the latter unless
>>> functional dependencies were present.
>>>
>>> Commit 7839d0078e0d removed that optimization in order to address a
>>> correctness issue, but it can be restored with the help of a new device
>>> power management flag, so do that now.
>>>
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>> ---
>>
>> This patch finally landed in linux-next some time ago as 3e999770ac1c
>> ("PM: sleep: Restore asynchronous device resume optimization"). Recently
>> I found that it causes a non-trivial interaction with commit
>> 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement
>> for unbound workqueues"). Since merge commit 954350a5f8db in linux-next
>> system suspend/resume fails (board doesn't wake up) on my old Samsung
>> Exynos4412-based Odroid-U3 board (ARM 32bit based), which was rock
>> stable for last years.
>>
>> My further investigations confirmed that the mentioned commits are
>> responsible for this issue. Each of them separately (3e999770ac1c and
>> 5797b1c18919) doesn't trigger any problems. Reverting any of them on top
>> of linux-next (with some additional commit due to code dependencies)
>> also fixes/hides the problem.
>>
>> Let me know if You need more information or tests on the hardware. I'm
>> open to help debugging this issue.
> 
> So I found this report:
> 
> https://lore.kernel.org/lkml/b3d08cd8-d77f-45dd-a2c3-4a4db5a98dfa@kernel.org/
> 
> which appears to be about the same issue.
> 
> Strictly speaking, the regression is introduced by 5797b1c18919,
> because it is not a mainline commit yet, but the information regarding
> the interaction of it with commit 3e999770ac1c is valuable.
> 
> Essentially, what commit 3e999770ac1c does is to schedule the
> execution of all of the async resume callbacks in a given phase
> upfront, so they can run without waiting for the sync ones to complete
> (except when there is a parent-child or supplier-consumer dependency -
> the latter represented by a device link).
> 
> What seems to be happening after commit 5797b1c18919 is that one (or
> more) of the async callbacks get blocked in the workqueue for some
> reason.
> 
> I would try to replace system_unbound_wq in
> __async_schedule_node_domain() with a dedicated workqueue that is not
> unbound and see what happens.

Thanks Rafael for helping connect the dots!

I did the laziest things imaginable to switch to a WQ that's
not unbound:

diff --git a/kernel/async.c b/kernel/async.c
index 97f224a5257b..37f1204ab4e9 100644
--- a/kernel/async.c
+++ b/kernel/async.c
@@ -174,7 +174,7 @@ static async_cookie_t __async_schedule_node_domain(async_func_t func,
         spin_unlock_irqrestore(&async_lock, flags);
  
         /* schedule for execution */
-       queue_work_node(node, system_unbound_wq, &entry->work);
+       queue_work_node(node, system_wq, &entry->work);
  
         
.and the system can now suspend/resume all day long again!

Konrad
  
Rafael J. Wysocki Feb. 12, 2024, 2:57 p.m. UTC | #22
Hi,

On Fri, Feb 9, 2024 at 1:20 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Thu, Feb 08, 2024 at 08:47:12AM +0100, Marek Szyprowski wrote:
> > I've tried to get drgn running on the test target, but then I've noticed
> > that it is not possible to run it on ARM 32bit target, as it requires
> > PROC_KCORE support, which is 'Visible if: PROC_FS [=y] && MMU [=y] &&
> > !ARM [=y]' for some reasons.
>
> Bummer. I instrumented code on my test setup (x86) and couldn't repro usage
> of the dfl_pwq, unfortuantely.

On x86 there are fewer device links representing dependencies between
devices, because it doesn't use fw_devlink which is used on ARM
platforms (and other DT-based), so on x86 it is less likely to
reproduce this.

> Independent of understanding what's going on with the system_unbound_wq, the
> correct solution seems like using a dedicated workqueue with raised
> min_active. Just posted the patchset to do that:
>
>   http://lkml.kernel.org/r/ZcVtzJvJCRV5OLM-@slm.duckdns.org

LGTM, thanks!
  

Patch

Index: linux-pm/include/linux/pm.h
===================================================================
--- linux-pm.orig/include/linux/pm.h
+++ linux-pm/include/linux/pm.h
@@ -681,6 +681,7 @@  struct dev_pm_info {
 	bool			wakeup_path:1;
 	bool			syscore:1;
 	bool			no_pm_callbacks:1;	/* Owned by the PM core */
+	bool			in_progress:1;	/* Owned by the PM core */
 	unsigned int		must_resume:1;	/* Owned by the PM core */
 	unsigned int		may_skip_resume:1;	/* Set by subsystems */
 #else
Index: linux-pm/drivers/base/power/main.c
===================================================================
--- linux-pm.orig/drivers/base/power/main.c
+++ linux-pm/drivers/base/power/main.c
@@ -579,7 +579,7 @@  bool dev_pm_skip_resume(struct device *d
 }
 
 /**
- * __device_resume_noirq - Execute a "noirq resume" callback for given device.
+ * device_resume_noirq - Execute a "noirq resume" callback for given device.
  * @dev: Device to handle.
  * @state: PM transition of the system being carried out.
  * @async: If true, the device is being resumed asynchronously.
@@ -587,7 +587,7 @@  bool dev_pm_skip_resume(struct device *d
  * The driver of @dev will not receive interrupts while this function is being
  * executed.
  */
-static void __device_resume_noirq(struct device *dev, pm_message_t state, bool async)
+static void device_resume_noirq(struct device *dev, pm_message_t state, bool async)
 {
 	pm_callback_t callback = NULL;
 	const char *info = NULL;
@@ -674,16 +674,22 @@  static bool dpm_async_fn(struct device *
 {
 	reinit_completion(&dev->power.completion);
 
-	if (!is_async(dev))
-		return false;
+	if (is_async(dev)) {
+		dev->power.in_progress = true;
 
-	get_device(dev);
-
-	if (async_schedule_dev_nocall(func, dev))
-		return true;
+		get_device(dev);
 
-	put_device(dev);
+		if (async_schedule_dev_nocall(func, dev))
+			return true;
 
+		put_device(dev);
+	}
+	/*
+	 * Because async_schedule_dev_nocall() above has returned false or it
+	 * has not been called at all, func() is not running and it safe to
+	 * update the in_progress flag without additional synchronization.
+	 */
+	dev->power.in_progress = false;
 	return false;
 }
 
@@ -691,18 +697,10 @@  static void async_resume_noirq(void *dat
 {
 	struct device *dev = data;
 
-	__device_resume_noirq(dev, pm_transition, true);
+	device_resume_noirq(dev, pm_transition, true);
 	put_device(dev);
 }
 
-static void device_resume_noirq(struct device *dev)
-{
-	if (dpm_async_fn(dev, async_resume_noirq))
-		return;
-
-	__device_resume_noirq(dev, pm_transition, false);
-}
-
 static void dpm_noirq_resume_devices(pm_message_t state)
 {
 	struct device *dev;
@@ -712,18 +710,28 @@  static void dpm_noirq_resume_devices(pm_
 	mutex_lock(&dpm_list_mtx);
 	pm_transition = state;
 
+	/*
+	 * Trigger the resume of "async" devices upfront so they don't have to
+	 * wait for the "non-async" ones they don't depend on.
+	 */
+	list_for_each_entry(dev, &dpm_noirq_list, power.entry)
+		dpm_async_fn(dev, async_resume_noirq);
+
 	while (!list_empty(&dpm_noirq_list)) {
 		dev = to_device(dpm_noirq_list.next);
-		get_device(dev);
 		list_move_tail(&dev->power.entry, &dpm_late_early_list);
 
-		mutex_unlock(&dpm_list_mtx);
+		if (!dev->power.in_progress) {
+			get_device(dev);
 
-		device_resume_noirq(dev);
+			mutex_unlock(&dpm_list_mtx);
 
-		put_device(dev);
+			device_resume_noirq(dev, state, false);
+
+			put_device(dev);
 
-		mutex_lock(&dpm_list_mtx);
+			mutex_lock(&dpm_list_mtx);
+		}
 	}
 	mutex_unlock(&dpm_list_mtx);
 	async_synchronize_full();
@@ -747,14 +755,14 @@  void dpm_resume_noirq(pm_message_t state
 }
 
 /**
- * __device_resume_early - Execute an "early resume" callback for given device.
+ * device_resume_early - Execute an "early resume" callback for given device.
  * @dev: Device to handle.
  * @state: PM transition of the system being carried out.
  * @async: If true, the device is being resumed asynchronously.
  *
  * Runtime PM is disabled for @dev while this function is being executed.
  */
-static void __device_resume_early(struct device *dev, pm_message_t state, bool async)
+static void device_resume_early(struct device *dev, pm_message_t state, bool async)
 {
 	pm_callback_t callback = NULL;
 	const char *info = NULL;
@@ -820,18 +828,10 @@  static void async_resume_early(void *dat
 {
 	struct device *dev = data;
 
-	__device_resume_early(dev, pm_transition, true);
+	device_resume_early(dev, pm_transition, true);
 	put_device(dev);
 }
 
-static void device_resume_early(struct device *dev)
-{
-	if (dpm_async_fn(dev, async_resume_early))
-		return;
-
-	__device_resume_early(dev, pm_transition, false);
-}
-
 /**
  * dpm_resume_early - Execute "early resume" callbacks for all devices.
  * @state: PM transition of the system being carried out.
@@ -845,18 +845,28 @@  void dpm_resume_early(pm_message_t state
 	mutex_lock(&dpm_list_mtx);
 	pm_transition = state;
 
+	/*
+	 * Trigger the resume of "async" devices upfront so they don't have to
+	 * wait for the "non-async" ones they don't depend on.
+	 */
+	list_for_each_entry(dev, &dpm_late_early_list, power.entry)
+		dpm_async_fn(dev, async_resume_early);
+
 	while (!list_empty(&dpm_late_early_list)) {
 		dev = to_device(dpm_late_early_list.next);
-		get_device(dev);
 		list_move_tail(&dev->power.entry, &dpm_suspended_list);
 
-		mutex_unlock(&dpm_list_mtx);
+		if (!dev->power.in_progress) {
+			get_device(dev);
 
-		device_resume_early(dev);
+			mutex_unlock(&dpm_list_mtx);
 
-		put_device(dev);
+			device_resume_early(dev, state, false);
+
+			put_device(dev);
 
-		mutex_lock(&dpm_list_mtx);
+			mutex_lock(&dpm_list_mtx);
+		}
 	}
 	mutex_unlock(&dpm_list_mtx);
 	async_synchronize_full();
@@ -876,12 +886,12 @@  void dpm_resume_start(pm_message_t state
 EXPORT_SYMBOL_GPL(dpm_resume_start);
 
 /**
- * __device_resume - Execute "resume" callbacks for given device.
+ * device_resume - Execute "resume" callbacks for given device.
  * @dev: Device to handle.
  * @state: PM transition of the system being carried out.
  * @async: If true, the device is being resumed asynchronously.
  */
-static void __device_resume(struct device *dev, pm_message_t state, bool async)
+static void device_resume(struct device *dev, pm_message_t state, bool async)
 {
 	pm_callback_t callback = NULL;
 	const char *info = NULL;
@@ -975,18 +985,10 @@  static void async_resume(void *data, asy
 {
 	struct device *dev = data;
 
-	__device_resume(dev, pm_transition, true);
+	device_resume(dev, pm_transition, true);
 	put_device(dev);
 }
 
-static void device_resume(struct device *dev)
-{
-	if (dpm_async_fn(dev, async_resume))
-		return;
-
-	__device_resume(dev, pm_transition, false);
-}
-
 /**
  * dpm_resume - Execute "resume" callbacks for non-sysdev devices.
  * @state: PM transition of the system being carried out.
@@ -1006,16 +1008,25 @@  void dpm_resume(pm_message_t state)
 	pm_transition = state;
 	async_error = 0;
 
+	/*
+	 * Trigger the resume of "async" devices upfront so they don't have to
+	 * wait for the "non-async" ones they don't depend on.
+	 */
+	list_for_each_entry(dev, &dpm_suspended_list, power.entry)
+		dpm_async_fn(dev, async_resume);
+
 	while (!list_empty(&dpm_suspended_list)) {
 		dev = to_device(dpm_suspended_list.next);
 
 		get_device(dev);
 
-		mutex_unlock(&dpm_list_mtx);
+		if (!dev->power.in_progress) {
+			mutex_unlock(&dpm_list_mtx);
 
-		device_resume(dev);
+			device_resume(dev, state, false);
 
-		mutex_lock(&dpm_list_mtx);
+			mutex_lock(&dpm_list_mtx);
+		}
 
 		if (!list_empty(&dev->power.entry))
 			list_move_tail(&dev->power.entry, &dpm_prepared_list);