Message ID | 10423008.nUPlyArG6x@kreacher |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-21139-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:2411:b0:101:2151:f287 with SMTP id m17csp244442dyi; Tue, 9 Jan 2024 08:59:49 -0800 (PST) X-Google-Smtp-Source: AGHT+IEtnBxVq8aaDG0I7o3ZvR7OrELmPy7PD75cIWZMJm+RMuR/VJkblrZVApIKPTxzXRIX+Kgb X-Received: by 2002:a05:622a:22a3:b0:429:a81f:8d10 with SMTP id ay35-20020a05622a22a300b00429a81f8d10mr783473qtb.110.1704819589222; Tue, 09 Jan 2024 08:59:49 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1704819589; cv=none; d=google.com; s=arc-20160816; b=ZLtV9Po3LBKE/fr1+DanFHlBmdEt2UJoiYPrmt0+isDLAdhfpV4PP5tBSrqFFNkIVz tKgv6xmaRsWHsy4BcuTo1MfvEomDsWqvnzq3EBTQChlUKM5cYeucvG5UD60HvGxL3YFi //OYHu0xPdOH/YH0iA8RDTgmOgkjdH8m4zRSc4JHgFUscnSNb1fy6Cj8ac98StD5rpGV B4njmyFpgrZbUIcBw2LXCUC0SNkIiI8RijRjMHGhQXN/FfooHTKjroUjpGExMaXX9Mpj F6Wd+OHBl8QX8npObZ1ZsmLgq0Qez0ai0IfjVDsIaa3DUfC0V8MtQgRUjpbaJw2L6iS/ axzw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:cc:to :from; bh=wk+TVFY1MwiQXiPvNscQQMlMMqLZt/n6P6NITGVqMkM=; fh=WGXy1p2sL2yqDno/Jf9LPz6Ns1wpjGHXhevjps1hyHA=; b=Dy4y9L5MGu76weKFIKE8OsJOXg1cn5R3Wk4Z9MjIiJWpGjTRzBRaeDN0zh+4ENU+xC MAMMxDCnUp+S996vT5Lf8/ZyA1ymJhgb2oFvNhFWJfbHPQJVH/pYXI5OLnLKcdkYHAe0 Zi1TyQxZ+fA5st593aceBfqZqi9WUp1PfNwdWA7yMnOi/0P2/bXw2FfOLTEOZkgcXa40 GCYUUa8qnUK0XszeH2qf67rhaBGmp/8cNs2AVPeA1PSrYgNSr7sf3Xxxfu8sENzmDT2S AUi7lCId8F9zUxybGLqGydgzlqnUCXrdUoAXNSj0lWPAuGe/rAVzQAR76pDqaV92Gyim aLEw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel+bounces-21139-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-21139-ouuuleilei=gmail.com@vger.kernel.org" Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id bb28-20020a05622a1b1c00b004299349dc68si2581672qtb.750.2024.01.09.08.59.49 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 09 Jan 2024 08:59:49 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-21139-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel+bounces-21139-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-21139-ouuuleilei=gmail.com@vger.kernel.org" Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id DD9961C23652 for <ouuuleilei@gmail.com>; Tue, 9 Jan 2024 16:59:48 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 032873AC12; Tue, 9 Jan 2024 16:59:31 +0000 (UTC) Received: from cloudserver094114.home.pl (cloudserver094114.home.pl [79.96.170.134]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 074193A8C5; Tue, 9 Jan 2024 16:59:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=rjwysocki.net Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=rjwysocki.net Received: from localhost (127.0.0.1) (HELO v370.home.net.pl) by /usr/run/smtp (/usr/run/postfix/private/idea_relay_lmtp) via UNIX with SMTP (IdeaSmtpServer 5.4.0) id 5c1102fa8f2cf404; Tue, 9 Jan 2024 17:59:23 +0100 Received: from kreacher.localnet (unknown [195.136.19.94]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by cloudserver094114.home.pl (Postfix) with ESMTPSA id 40513669107; Tue, 9 Jan 2024 17:59:23 +0100 (CET) From: "Rafael J. Wysocki" <rjw@rjwysocki.net> To: Linux PM <linux-pm@vger.kernel.org>, Ulf Hansson <ulf.hansson@linaro.org> Cc: LKML <linux-kernel@vger.kernel.org>, Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Subject: [PATCH v1] PM: sleep: Restore asynchronous device resume optimization Date: Tue, 09 Jan 2024 17:59:22 +0100 Message-ID: <10423008.nUPlyArG6x@kreacher> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="UTF-8" X-CLIENT-IP: 195.136.19.94 X-CLIENT-HOSTNAME: 195.136.19.94 X-VADE-SPAMSTATE: clean X-VADE-SPAMCAUSE: gggruggvucftvghtrhhoucdtuddrgedvkedrvdehledgleefucetufdoteggodetrfdotffvucfrrhhofhhilhgvmecujffqoffgrffnpdggtffipffknecuuegrihhlohhuthemucduhedtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjughrpefhvfevufffkfgggfgtsehtufertddttdejnecuhfhrohhmpedftfgrfhgrvghlucflrdcuhgihshhotghkihdfuceorhhjfiesrhhjfiihshhotghkihdrnhgvtheqnecuggftrfgrthhtvghrnhepffffffekgfehheffleetieevfeefvefhleetjedvvdeijeejledvieehueevueffnecukfhppeduleehrddufeeirdduledrleegnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehinhgvthepudelhedrudefiedrudelrdelgedphhgvlhhopehkrhgvrggthhgvrhdrlhhotggrlhhnvghtpdhmrghilhhfrhhomhepfdftrghfrggvlhculfdrucghhihsohgtkhhifdcuoehrjhifsehrjhifhihsohgtkhhirdhnvghtqedpnhgspghrtghpthhtohepgedprhgtphhtthhopehlihhnuhigqdhpmhesvhhgvghrrdhkvghrnhgvlhdrohhrghdprhgtphhtthhopehulhhfrdhhrghnshhsohhnsehlihhnrghrohdrohhrghdprhgtphhtthhopehlihhnuhigqdhkvghrnhgvlhesvhhgvghrrdhkvghrnhgvlhdrohhrghdprhgtphhtthhopehsthgrnhhishhlrgifrdhgrhhushiikhgrsehlihhnuhigrdhinhhtvghlrdgtohhm X-DCC--Metrics: v370.home.net.pl 1024; Body=4 Fuz1=4 Fuz2=4 X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1787632905862768504 X-GMAIL-MSGID: 1787632905862768504 |
Series |
[v1] PM: sleep: Restore asynchronous device resume optimization
|
|
Commit Message
Rafael J. Wysocki
Jan. 9, 2024, 4:59 p.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Before commit 7839d0078e0d ("PM: sleep: Fix possible deadlocks in core system-wide PM code"), the resume of devices that were allowed to resume asynchronously was scheduled before starting the resume of the other devices, so the former did not have to wait for the latter unless functional dependencies were present. Commit 7839d0078e0d removed that optimization in order to address a correctness issue, but it can be restored with the help of a new device power management flag, so do that now. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- I said I'd probably do this in 6.9, but then I thought more about it and now I think it would be nice to have 6.8-rc1 without a suspend performance regression and the change is relatively straightforward, so here it goes. --- drivers/base/power/main.c | 117 +++++++++++++++++++++++++--------------------- include/linux/pm.h | 1 2 files changed, 65 insertions(+), 53 deletions(-)
Comments
On Tue, Jan 09, 2024 at 05:59:22PM +0100, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > Before commit 7839d0078e0d ("PM: sleep: Fix possible deadlocks in core > system-wide PM code"), the resume of devices that were allowed to resume > asynchronously was scheduled before starting the resume of the other > devices, so the former did not have to wait for the latter unless > functional dependencies were present. > > Commit 7839d0078e0d removed that optimization in order to address a > correctness issue, but it can be restored with the help of a new device > power management flag, so do that now. > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > > I said I'd probably do this in 6.9, but then I thought more about it > and now I think it would be nice to have 6.8-rc1 without a suspend > performance regression and the change is relatively straightforward, > so here it goes. > > --- > drivers/base/power/main.c | 117 +++++++++++++++++++++++++--------------------- > include/linux/pm.h | 1 > 2 files changed, 65 insertions(+), 53 deletions(-) > > Index: linux-pm/include/linux/pm.h > =================================================================== > --- linux-pm.orig/include/linux/pm.h > +++ linux-pm/include/linux/pm.h > @@ -681,6 +681,7 @@ struct dev_pm_info { > bool wakeup_path:1; > bool syscore:1; > bool no_pm_callbacks:1; /* Owned by the PM core */ > + bool in_progress:1; /* Owned by the PM core */ > unsigned int must_resume:1; /* Owned by the PM core */ > unsigned int may_skip_resume:1; /* Set by subsystems */ Not related to the patch, just question: why some types here are unsigned int :1 others bool :1 ? > * dpm_resume_early - Execute "early resume" callbacks for all devices. > * @state: PM transition of the system being carried out. > @@ -845,18 +845,28 @@ void dpm_resume_early(pm_message_t state > mutex_lock(&dpm_list_mtx); > pm_transition = state; > > + /* > + * Trigger the resume of "async" devices upfront so they don't have to > + * wait for the "non-async" ones they don't depend on. > + */ > + list_for_each_entry(dev, &dpm_late_early_list, power.entry) > + dpm_async_fn(dev, async_resume_early); > + > while (!list_empty(&dpm_late_early_list)) { > dev = to_device(dpm_late_early_list.next); > - get_device(dev); > list_move_tail(&dev->power.entry, &dpm_suspended_list); > > - mutex_unlock(&dpm_list_mtx); > + if (!dev->power.in_progress) { I would consider different naming just to make clear this is regarding async call, in_progress looks too generic for me. Fine if you think otherwise, in general patch LGTM: Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
On Wed, Jan 10, 2024 at 11:37 AM Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> wrote: > > On Tue, Jan 09, 2024 at 05:59:22PM +0100, Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > Before commit 7839d0078e0d ("PM: sleep: Fix possible deadlocks in core > > system-wide PM code"), the resume of devices that were allowed to resume > > asynchronously was scheduled before starting the resume of the other > > devices, so the former did not have to wait for the latter unless > > functional dependencies were present. > > > > Commit 7839d0078e0d removed that optimization in order to address a > > correctness issue, but it can be restored with the help of a new device > > power management flag, so do that now. > > > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > --- > > > > I said I'd probably do this in 6.9, but then I thought more about it > > and now I think it would be nice to have 6.8-rc1 without a suspend > > performance regression and the change is relatively straightforward, > > so here it goes. > > > > --- > > drivers/base/power/main.c | 117 +++++++++++++++++++++++++--------------------- > > include/linux/pm.h | 1 > > 2 files changed, 65 insertions(+), 53 deletions(-) > > > > Index: linux-pm/include/linux/pm.h > > =================================================================== > > --- linux-pm.orig/include/linux/pm.h > > +++ linux-pm/include/linux/pm.h > > @@ -681,6 +681,7 @@ struct dev_pm_info { > > bool wakeup_path:1; > > bool syscore:1; > > bool no_pm_callbacks:1; /* Owned by the PM core */ > > + bool in_progress:1; /* Owned by the PM core */ > > unsigned int must_resume:1; /* Owned by the PM core */ > > unsigned int may_skip_resume:1; /* Set by subsystems */ > > Not related to the patch, just question: why some types here are > unsigned int :1 others bool :1 ? No particular reason. I think I will change them all to bool in the future. > > * dpm_resume_early - Execute "early resume" callbacks for all devices. > > * @state: PM transition of the system being carried out. > > @@ -845,18 +845,28 @@ void dpm_resume_early(pm_message_t state > > mutex_lock(&dpm_list_mtx); > > pm_transition = state; > > > > + /* > > + * Trigger the resume of "async" devices upfront so they don't have to > > + * wait for the "non-async" ones they don't depend on. > > + */ > > + list_for_each_entry(dev, &dpm_late_early_list, power.entry) > > + dpm_async_fn(dev, async_resume_early); > > + > > while (!list_empty(&dpm_late_early_list)) { > > dev = to_device(dpm_late_early_list.next); > > - get_device(dev); > > list_move_tail(&dev->power.entry, &dpm_suspended_list); > > > > - mutex_unlock(&dpm_list_mtx); > > + if (!dev->power.in_progress) { > > I would consider different naming just to make clear this > is regarding async call, in_progress looks too generic for me. OK, what about async_in_progress? > Fine if you think otherwise, in general patch LGTM: > > Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Thanks!
On Wed, Jan 10, 2024 at 01:33:07PM +0100, Rafael J. Wysocki wrote: > > I would consider different naming just to make clear this > > is regarding async call, in_progress looks too generic for me. > > OK, what about async_in_progress? Sure, that better. Regards Stanislaw
On Wed, Jan 10, 2024 at 03:05:07PM +0100, Stanislaw Gruszka wrote: > On Wed, Jan 10, 2024 at 01:33:07PM +0100, Rafael J. Wysocki wrote: > > > I would consider different naming just to make clear this > > > is regarding async call, in_progress looks too generic for me. > > > > OK, what about async_in_progress? > Sure, that better. Even better would be using_async IMO, because we don't know if async call is in progress or finish or before start. Regards Stanislaw
On Thu, Jan 11, 2024 at 8:58 AM Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> wrote: > > On Wed, Jan 10, 2024 at 03:05:07PM +0100, Stanislaw Gruszka wrote: > > On Wed, Jan 10, 2024 at 01:33:07PM +0100, Rafael J. Wysocki wrote: > > > > I would consider different naming just to make clear this > > > > is regarding async call, in_progress looks too generic for me. > > > > > > OK, what about async_in_progress? > > Sure, that better. > > Even better would be using_async IMO, because we don't know if > async call is in progress or finish or before start. Well, "in progress" applies to all of the processing of the async call and I regard it as "in progress" once it is known that it will run asynchronously eventually. In any case, I've already applied the async_in_progress version, thanks!
Dear All, On 09.01.2024 17:59, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > Before commit 7839d0078e0d ("PM: sleep: Fix possible deadlocks in core > system-wide PM code"), the resume of devices that were allowed to resume > asynchronously was scheduled before starting the resume of the other > devices, so the former did not have to wait for the latter unless > functional dependencies were present. > > Commit 7839d0078e0d removed that optimization in order to address a > correctness issue, but it can be restored with the help of a new device > power management flag, so do that now. > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- This patch finally landed in linux-next some time ago as 3e999770ac1c ("PM: sleep: Restore asynchronous device resume optimization"). Recently I found that it causes a non-trivial interaction with commit 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues"). Since merge commit 954350a5f8db in linux-next system suspend/resume fails (board doesn't wake up) on my old Samsung Exynos4412-based Odroid-U3 board (ARM 32bit based), which was rock stable for last years. My further investigations confirmed that the mentioned commits are responsible for this issue. Each of them separately (3e999770ac1c and 5797b1c18919) doesn't trigger any problems. Reverting any of them on top of linux-next (with some additional commit due to code dependencies) also fixes/hides the problem. Let me know if You need more information or tests on the hardware. I'm open to help debugging this issue. > I said I'd probably do this in 6.9, but then I thought more about it > and now I think it would be nice to have 6.8-rc1 without a suspend > performance regression and the change is relatively straightforward, > so here it goes. > > --- > drivers/base/power/main.c | 117 +++++++++++++++++++++++++--------------------- > include/linux/pm.h | 1 > 2 files changed, 65 insertions(+), 53 deletions(-) > > Index: linux-pm/include/linux/pm.h > =================================================================== > --- linux-pm.orig/include/linux/pm.h > +++ linux-pm/include/linux/pm.h > @@ -681,6 +681,7 @@ struct dev_pm_info { > bool wakeup_path:1; > bool syscore:1; > bool no_pm_callbacks:1; /* Owned by the PM core */ > + bool in_progress:1; /* Owned by the PM core */ > unsigned int must_resume:1; /* Owned by the PM core */ > unsigned int may_skip_resume:1; /* Set by subsystems */ > #else > Index: linux-pm/drivers/base/power/main.c > =================================================================== > --- linux-pm.orig/drivers/base/power/main.c > +++ linux-pm/drivers/base/power/main.c > @@ -579,7 +579,7 @@ bool dev_pm_skip_resume(struct device *d > } > > /** > - * __device_resume_noirq - Execute a "noirq resume" callback for given device. > + * device_resume_noirq - Execute a "noirq resume" callback for given device. > * @dev: Device to handle. > * @state: PM transition of the system being carried out. > * @async: If true, the device is being resumed asynchronously. > @@ -587,7 +587,7 @@ bool dev_pm_skip_resume(struct device *d > * The driver of @dev will not receive interrupts while this function is being > * executed. > */ > -static void __device_resume_noirq(struct device *dev, pm_message_t state, bool async) > +static void device_resume_noirq(struct device *dev, pm_message_t state, bool async) > { > pm_callback_t callback = NULL; > const char *info = NULL; > @@ -674,16 +674,22 @@ static bool dpm_async_fn(struct device * > { > reinit_completion(&dev->power.completion); > > - if (!is_async(dev)) > - return false; > + if (is_async(dev)) { > + dev->power.in_progress = true; > > - get_device(dev); > - > - if (async_schedule_dev_nocall(func, dev)) > - return true; > + get_device(dev); > > - put_device(dev); > + if (async_schedule_dev_nocall(func, dev)) > + return true; > > + put_device(dev); > + } > + /* > + * Because async_schedule_dev_nocall() above has returned false or it > + * has not been called at all, func() is not running and it safe to > + * update the in_progress flag without additional synchronization. > + */ > + dev->power.in_progress = false; > return false; > } > > @@ -691,18 +697,10 @@ static void async_resume_noirq(void *dat > { > struct device *dev = data; > > - __device_resume_noirq(dev, pm_transition, true); > + device_resume_noirq(dev, pm_transition, true); > put_device(dev); > } > > -static void device_resume_noirq(struct device *dev) > -{ > - if (dpm_async_fn(dev, async_resume_noirq)) > - return; > - > - __device_resume_noirq(dev, pm_transition, false); > -} > - > static void dpm_noirq_resume_devices(pm_message_t state) > { > struct device *dev; > @@ -712,18 +710,28 @@ static void dpm_noirq_resume_devices(pm_ > mutex_lock(&dpm_list_mtx); > pm_transition = state; > > + /* > + * Trigger the resume of "async" devices upfront so they don't have to > + * wait for the "non-async" ones they don't depend on. > + */ > + list_for_each_entry(dev, &dpm_noirq_list, power.entry) > + dpm_async_fn(dev, async_resume_noirq); > + > while (!list_empty(&dpm_noirq_list)) { > dev = to_device(dpm_noirq_list.next); > - get_device(dev); > list_move_tail(&dev->power.entry, &dpm_late_early_list); > > - mutex_unlock(&dpm_list_mtx); > + if (!dev->power.in_progress) { > + get_device(dev); > > - device_resume_noirq(dev); > + mutex_unlock(&dpm_list_mtx); > > - put_device(dev); > + device_resume_noirq(dev, state, false); > + > + put_device(dev); > > - mutex_lock(&dpm_list_mtx); > + mutex_lock(&dpm_list_mtx); > + } > } > mutex_unlock(&dpm_list_mtx); > async_synchronize_full(); > @@ -747,14 +755,14 @@ void dpm_resume_noirq(pm_message_t state > } > > /** > - * __device_resume_early - Execute an "early resume" callback for given device. > + * device_resume_early - Execute an "early resume" callback for given device. > * @dev: Device to handle. > * @state: PM transition of the system being carried out. > * @async: If true, the device is being resumed asynchronously. > * > * Runtime PM is disabled for @dev while this function is being executed. > */ > -static void __device_resume_early(struct device *dev, pm_message_t state, bool async) > +static void device_resume_early(struct device *dev, pm_message_t state, bool async) > { > pm_callback_t callback = NULL; > const char *info = NULL; > @@ -820,18 +828,10 @@ static void async_resume_early(void *dat > { > struct device *dev = data; > > - __device_resume_early(dev, pm_transition, true); > + device_resume_early(dev, pm_transition, true); > put_device(dev); > } > > -static void device_resume_early(struct device *dev) > -{ > - if (dpm_async_fn(dev, async_resume_early)) > - return; > - > - __device_resume_early(dev, pm_transition, false); > -} > - > /** > * dpm_resume_early - Execute "early resume" callbacks for all devices. > * @state: PM transition of the system being carried out. > @@ -845,18 +845,28 @@ void dpm_resume_early(pm_message_t state > mutex_lock(&dpm_list_mtx); > pm_transition = state; > > + /* > + * Trigger the resume of "async" devices upfront so they don't have to > + * wait for the "non-async" ones they don't depend on. > + */ > + list_for_each_entry(dev, &dpm_late_early_list, power.entry) > + dpm_async_fn(dev, async_resume_early); > + > while (!list_empty(&dpm_late_early_list)) { > dev = to_device(dpm_late_early_list.next); > - get_device(dev); > list_move_tail(&dev->power.entry, &dpm_suspended_list); > > - mutex_unlock(&dpm_list_mtx); > + if (!dev->power.in_progress) { > + get_device(dev); > > - device_resume_early(dev); > + mutex_unlock(&dpm_list_mtx); > > - put_device(dev); > + device_resume_early(dev, state, false); > + > + put_device(dev); > > - mutex_lock(&dpm_list_mtx); > + mutex_lock(&dpm_list_mtx); > + } > } > mutex_unlock(&dpm_list_mtx); > async_synchronize_full(); > @@ -876,12 +886,12 @@ void dpm_resume_start(pm_message_t state > EXPORT_SYMBOL_GPL(dpm_resume_start); > > /** > - * __device_resume - Execute "resume" callbacks for given device. > + * device_resume - Execute "resume" callbacks for given device. > * @dev: Device to handle. > * @state: PM transition of the system being carried out. > * @async: If true, the device is being resumed asynchronously. > */ > -static void __device_resume(struct device *dev, pm_message_t state, bool async) > +static void device_resume(struct device *dev, pm_message_t state, bool async) > { > pm_callback_t callback = NULL; > const char *info = NULL; > @@ -975,18 +985,10 @@ static void async_resume(void *data, asy > { > struct device *dev = data; > > - __device_resume(dev, pm_transition, true); > + device_resume(dev, pm_transition, true); > put_device(dev); > } > > -static void device_resume(struct device *dev) > -{ > - if (dpm_async_fn(dev, async_resume)) > - return; > - > - __device_resume(dev, pm_transition, false); > -} > - > /** > * dpm_resume - Execute "resume" callbacks for non-sysdev devices. > * @state: PM transition of the system being carried out. > @@ -1006,16 +1008,25 @@ void dpm_resume(pm_message_t state) > pm_transition = state; > async_error = 0; > > + /* > + * Trigger the resume of "async" devices upfront so they don't have to > + * wait for the "non-async" ones they don't depend on. > + */ > + list_for_each_entry(dev, &dpm_suspended_list, power.entry) > + dpm_async_fn(dev, async_resume); > + > while (!list_empty(&dpm_suspended_list)) { > dev = to_device(dpm_suspended_list.next); > > get_device(dev); > > - mutex_unlock(&dpm_list_mtx); > + if (!dev->power.in_progress) { > + mutex_unlock(&dpm_list_mtx); > > - device_resume(dev); > + device_resume(dev, state, false); > > - mutex_lock(&dpm_list_mtx); > + mutex_lock(&dpm_list_mtx); > + } > > if (!list_empty(&dev->power.entry)) > list_move_tail(&dev->power.entry, &dpm_prepared_list); > > > Best regards
On Wed, Feb 7, 2024 at 11:31 AM Marek Szyprowski <m.szyprowski@samsung.com> wrote: > > Dear All, > > On 09.01.2024 17:59, Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > Before commit 7839d0078e0d ("PM: sleep: Fix possible deadlocks in core > > system-wide PM code"), the resume of devices that were allowed to resume > > asynchronously was scheduled before starting the resume of the other > > devices, so the former did not have to wait for the latter unless > > functional dependencies were present. > > > > Commit 7839d0078e0d removed that optimization in order to address a > > correctness issue, but it can be restored with the help of a new device > > power management flag, so do that now. > > > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > --- > > This patch finally landed in linux-next some time ago as 3e999770ac1c > ("PM: sleep: Restore asynchronous device resume optimization"). Recently > I found that it causes a non-trivial interaction with commit > 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement > for unbound workqueues"). Since merge commit 954350a5f8db in linux-next > system suspend/resume fails (board doesn't wake up) on my old Samsung > Exynos4412-based Odroid-U3 board (ARM 32bit based), which was rock > stable for last years. > > My further investigations confirmed that the mentioned commits are > responsible for this issue. Each of them separately (3e999770ac1c and > 5797b1c18919) doesn't trigger any problems. Reverting any of them on top > of linux-next (with some additional commit due to code dependencies) > also fixes/hides the problem. > > Let me know if You need more information or tests on the hardware. I'm > open to help debugging this issue. If you echo 0 to /sys/power/pm_async before suspending the system, does it still fail?
On Wed, Feb 7, 2024 at 11:31 AM Marek Szyprowski <m.szyprowski@samsung.com> wrote: > > Dear All, > > On 09.01.2024 17:59, Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > Before commit 7839d0078e0d ("PM: sleep: Fix possible deadlocks in core > > system-wide PM code"), the resume of devices that were allowed to resume > > asynchronously was scheduled before starting the resume of the other > > devices, so the former did not have to wait for the latter unless > > functional dependencies were present. > > > > Commit 7839d0078e0d removed that optimization in order to address a > > correctness issue, but it can be restored with the help of a new device > > power management flag, so do that now. > > > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > --- > > This patch finally landed in linux-next some time ago as 3e999770ac1c > ("PM: sleep: Restore asynchronous device resume optimization"). Recently > I found that it causes a non-trivial interaction with commit > 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement > for unbound workqueues"). Since merge commit 954350a5f8db in linux-next > system suspend/resume fails (board doesn't wake up) on my old Samsung > Exynos4412-based Odroid-U3 board (ARM 32bit based), which was rock > stable for last years. > > My further investigations confirmed that the mentioned commits are > responsible for this issue. Each of them separately (3e999770ac1c and > 5797b1c18919) doesn't trigger any problems. Reverting any of them on top > of linux-next (with some additional commit due to code dependencies) > also fixes/hides the problem. > > Let me know if You need more information or tests on the hardware. I'm > open to help debugging this issue. So I found this report: https://lore.kernel.org/lkml/b3d08cd8-d77f-45dd-a2c3-4a4db5a98dfa@kernel.org/ which appears to be about the same issue. Strictly speaking, the regression is introduced by 5797b1c18919, because it is not a mainline commit yet, but the information regarding the interaction of it with commit 3e999770ac1c is valuable. Essentially, what commit 3e999770ac1c does is to schedule the execution of all of the async resume callbacks in a given phase upfront, so they can run without waiting for the sync ones to complete (except when there is a parent-child or supplier-consumer dependency - the latter represented by a device link). What seems to be happening after commit 5797b1c18919 is that one (or more) of the async callbacks get blocked in the workqueue for some reason. I would try to replace system_unbound_wq in __async_schedule_node_domain() with a dedicated workqueue that is not unbound and see what happens.
On 07.02.2024 11:38, Rafael J. Wysocki wrote: > On Wed, Feb 7, 2024 at 11:31 AM Marek Szyprowski > <m.szyprowski@samsung.com> wrote: >> On 09.01.2024 17:59, Rafael J. Wysocki wrote: >>> From: Rafael J. Wysocki<rafael.j.wysocki@intel.com> >>> >>> Before commit 7839d0078e0d ("PM: sleep: Fix possible deadlocks in core >>> system-wide PM code"), the resume of devices that were allowed to resume >>> asynchronously was scheduled before starting the resume of the other >>> devices, so the former did not have to wait for the latter unless >>> functional dependencies were present. >>> >>> Commit 7839d0078e0d removed that optimization in order to address a >>> correctness issue, but it can be restored with the help of a new device >>> power management flag, so do that now. >>> >>> Signed-off-by: Rafael J. Wysocki<rafael.j.wysocki@intel.com> >>> --- >> This patch finally landed in linux-next some time ago as 3e999770ac1c >> ("PM: sleep: Restore asynchronous device resume optimization"). Recently >> I found that it causes a non-trivial interaction with commit >> 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement >> for unbound workqueues"). Since merge commit 954350a5f8db in linux-next >> system suspend/resume fails (board doesn't wake up) on my old Samsung >> Exynos4412-based Odroid-U3 board (ARM 32bit based), which was rock >> stable for last years. >> >> My further investigations confirmed that the mentioned commits are >> responsible for this issue. Each of them separately (3e999770ac1c and >> 5797b1c18919) doesn't trigger any problems. Reverting any of them on top >> of linux-next (with some additional commit due to code dependencies) >> also fixes/hides the problem. >> >> Let me know if You need more information or tests on the hardware. I'm >> open to help debugging this issue. > If you echo 0 to /sys/power/pm_async before suspending the system, > does it still fail? In such case it works fine. Best regards
On Wed, Feb 7, 2024 at 12:16 PM Marek Szyprowski <m.szyprowski@samsung.com> wrote: > > On 07.02.2024 11:38, Rafael J. Wysocki wrote: > > On Wed, Feb 7, 2024 at 11:31 AM Marek Szyprowski > > <m.szyprowski@samsung.com> wrote: > >> On 09.01.2024 17:59, Rafael J. Wysocki wrote: > >>> From: Rafael J. Wysocki<rafael.j.wysocki@intel.com> > >>> > >>> Before commit 7839d0078e0d ("PM: sleep: Fix possible deadlocks in core > >>> system-wide PM code"), the resume of devices that were allowed to resume > >>> asynchronously was scheduled before starting the resume of the other > >>> devices, so the former did not have to wait for the latter unless > >>> functional dependencies were present. > >>> > >>> Commit 7839d0078e0d removed that optimization in order to address a > >>> correctness issue, but it can be restored with the help of a new device > >>> power management flag, so do that now. > >>> > >>> Signed-off-by: Rafael J. Wysocki<rafael.j.wysocki@intel.com> > >>> --- > >> This patch finally landed in linux-next some time ago as 3e999770ac1c > >> ("PM: sleep: Restore asynchronous device resume optimization"). Recently > >> I found that it causes a non-trivial interaction with commit > >> 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement > >> for unbound workqueues"). Since merge commit 954350a5f8db in linux-next > >> system suspend/resume fails (board doesn't wake up) on my old Samsung > >> Exynos4412-based Odroid-U3 board (ARM 32bit based), which was rock > >> stable for last years. > >> > >> My further investigations confirmed that the mentioned commits are > >> responsible for this issue. Each of them separately (3e999770ac1c and > >> 5797b1c18919) doesn't trigger any problems. Reverting any of them on top > >> of linux-next (with some additional commit due to code dependencies) > >> also fixes/hides the problem. > >> > >> Let me know if You need more information or tests on the hardware. I'm > >> open to help debugging this issue. > > If you echo 0 to /sys/power/pm_async before suspending the system, > > does it still fail? > > In such case it works fine. Thanks for the confirmation. It doesn't rely on unbound workqueues then, so that's expected. Now, I think that there are two possibilities. One is that commit 3e999770ac1c is generally overoptimistic for your board and there is a dependency between devices which is not represented by a device link, and it causes things to go south when they are not done in a specific order. If that is the case and commit 5797b1c18919 changes that order, breakage ensues. The other one is that what happens during async resume does not meet the assumptions of commit 5797b1c18919 (for example, it can easily produce a chain of interdependent work items longer than 8) and so it breaks things. I would still try to use a non-unbound workqueue for the async thing, because if it works reliably then, the second possibility will be more likely.
Hello, On Wed, Feb 07, 2024 at 12:25:46PM +0100, Rafael J. Wysocki wrote: > The other one is that what happens during async resume does not meet > the assumptions of commit 5797b1c18919 (for example, it can easily > produce a chain of interdependent work items longer than 8) and so it > breaks things. Ah, that's fascinating. But aren't CPUs all brought up online before devices are resumed? If so, the max_active should already be way higher than the WQ_DFL_MIN_ACTIVE. Also, are these multi node NUMA machines? Otherwise, it really shouldn't affect anything. One easy way to verify would be just bumping up WQ_DFL_MIN_ACTIVE and see what happens. Thanks.
On Wed, Feb 7, 2024 at 5:40 PM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Wed, Feb 07, 2024 at 12:25:46PM +0100, Rafael J. Wysocki wrote: > > The other one is that what happens during async resume does not meet > > the assumptions of commit 5797b1c18919 (for example, it can easily > > produce a chain of interdependent work items longer than 8) and so it > > breaks things. > > Ah, that's fascinating. But aren't CPUs all brought up online before devices > are resumed? They are. > If so, the max_active should already be way higher than the > WQ_DFL_MIN_ACTIVE. Also, are these multi node NUMA machines? They aren't. > Otherwise, it really shouldn't affect anything. Well, it evidently makes a difference. I'm wondering what difference it can make in that case. > One easy way to verify would be just bumping up WQ_DFL_MIN_ACTIVE and see what happens. Sure.
On 07.02.2024 17:39, Tejun Heo wrote: > On Wed, Feb 07, 2024 at 12:25:46PM +0100, Rafael J. Wysocki wrote: >> The other one is that what happens during async resume does not meet >> the assumptions of commit 5797b1c18919 (for example, it can easily >> produce a chain of interdependent work items longer than 8) and so it >> breaks things. > Ah, that's fascinating. But aren't CPUs all brought up online before devices > are resumed? If so, the max_active should already be way higher than the > WQ_DFL_MIN_ACTIVE. Also, are these multi node NUMA machines? Otherwise, it > really shouldn't affect anything. One easy way to verify would be just > bumping up WQ_DFL_MIN_ACTIVE and see what happens. I've increased WQ_DFL_MIN_ACTIVE from 8 to 32 and all the system suspend/resume issues went away. :) Best regards
On Wed, Feb 07, 2024 at 07:55:51PM +0100, Marek Szyprowski wrote: > On 07.02.2024 17:39, Tejun Heo wrote: > > On Wed, Feb 07, 2024 at 12:25:46PM +0100, Rafael J. Wysocki wrote: > >> The other one is that what happens during async resume does not meet > >> the assumptions of commit 5797b1c18919 (for example, it can easily > >> produce a chain of interdependent work items longer than 8) and so it > >> breaks things. > > Ah, that's fascinating. But aren't CPUs all brought up online before devices > > are resumed? If so, the max_active should already be way higher than the > > WQ_DFL_MIN_ACTIVE. Also, are these multi node NUMA machines? Otherwise, it > > really shouldn't affect anything. One easy way to verify would be just > > bumping up WQ_DFL_MIN_ACTIVE and see what happens. > > I've increased WQ_DFL_MIN_ACTIVE from 8 to 32 and all the system > suspend/resume issues went away. :) Ah, okay, that's surprising. Lemme look at the code again. I gotta be missing something. Thanks.
Hello, I couldn't reproduce effective max_active being pushed down to min_active across suspend/resume cycles on x86. There gotta be something different. - Can you please apply the following patch along with the WQ_DFL_MIN_ACTIVE bump, go through suspend/resume once and report the dmesg? - Regardless of the root cause, I think async should switch to a dedicated workqueue with explicitly raised min_active (will add an interface for it). Thanks. diff --git a/kernel/workqueue.c b/kernel/workqueue.c index cf514ba0dfc3..93e5b837f2b4 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -368,6 +368,8 @@ struct workqueue_struct { */ struct rcu_head rcu; + bool dbg; + /* hot fields used during command issue, aligned to cacheline */ unsigned int flags ____cacheline_aligned; /* WQ: WQ_* flags */ struct pool_workqueue __percpu __rcu **cpu_pwq; /* I: per-cpu pwqs */ @@ -1557,6 +1559,10 @@ static void wq_update_node_max_active(struct workqueue_struct *wq, int off_cpu) if (off_cpu >= 0) total_cpus--; + if (wq->dbg) + printk("XXX wq_update_node_max_active: wq=%s off_cpu=%d total=%d range=[%d, %d]", + wq->name, off_cpu, total_cpus, min_active, max_active); + for_each_node(node) { int node_cpus; @@ -1567,7 +1573,12 @@ static void wq_update_node_max_active(struct workqueue_struct *wq, int off_cpu) wq_node_nr_active(wq, node)->max = clamp(DIV_ROUND_UP(max_active * node_cpus, total_cpus), min_active, max_active); + if (wq->dbg) + pr_cont(" node[%d] node_cpus=%d max=%d", + node, node_cpus, wq_node_nr_active(wq, node)->max); } + if (wq->dbg) + pr_cont("\n"); wq_node_nr_active(wq, NUMA_NO_NODE)->max = min_active; } @@ -4886,6 +4897,9 @@ static struct pool_workqueue *install_unbound_pwq(struct workqueue_struct *wq, old_pwq = rcu_access_pointer(*slot); rcu_assign_pointer(*slot, pwq); + if (wq->dbg) + printk("XXX install_unbound_pwq: wq=%s cpu=%d pool=%d\n", + wq->name, cpu, pwq->pool->id); return old_pwq; } @@ -7418,6 +7432,8 @@ void __init workqueue_init_early(void) !system_power_efficient_wq || !system_freezable_power_efficient_wq || !system_bh_wq || !system_bh_highpri_wq); + + system_unbound_wq->dbg = true; } static void __init wq_cpu_intensive_thresh_init(void)
On Wed, Feb 7, 2024 at 8:48 PM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > I couldn't reproduce effective max_active being pushed down to min_active > across suspend/resume cycles on x86. There gotta be something different. > > - Can you please apply the following patch along with the WQ_DFL_MIN_ACTIVE > bump, go through suspend/resume once and report the dmesg? > > - Regardless of the root cause, I think async should switch to a dedicated > workqueue with explicitly raised min_active (will add an interface for > it). Agreed.
On 07.02.2024 20:48, Tejun Heo wrote: > Hello, > > I couldn't reproduce effective max_active being pushed down to min_active > across suspend/resume cycles on x86. There gotta be something different. > > - Can you please apply the following patch along with the WQ_DFL_MIN_ACTIVE > bump, go through suspend/resume once and report the dmesg? Here is the relevant part of the log: PM: suspend entry (deep) Filesystems sync: 0.004 seconds Freezing user space processes Freezing user space processes completed (elapsed 0.002 seconds) OOM killer disabled. Freezing remaining freezable tasks Freezing remaining freezable tasks completed (elapsed 0.002 seconds) dwc2 12480000.usb: suspending usb gadget g_ether dwc2 12480000.usb: new device is full-speed smsc95xx 1-2:1.0 eth0: entering SUSPEND2 mode wake enabled for irq 97 (gpx3-2) usb3503 0-0008: switched to STANDBY mode wake enabled for irq 124 (gpx1-3) samsung-pinctrl 11000000.pinctrl: Setting external wakeup interrupt mask: 0xfbfff7ff Disabling non-boot CPUs ... XXX wq_update_node_max_active: wq=events_unbound off_cpu=1 total=3 range=[32, 512] node[0] node_cpus=3 max=512 XXX wq_update_node_max_active: wq=events_unbound off_cpu=2 total=2 range=[32, 512] node[0] node_cpus=2 max=512 XXX wq_update_node_max_active: wq=events_unbound off_cpu=3 total=1 range=[32, 512] node[0] node_cpus=1 max=512 Enabling non-boot CPUs ... XXX wq_update_node_max_active: wq=events_unbound off_cpu=-1 total=2 range=[32, 512] node[0] node_cpus=2 max=512 CPU1 is up XXX wq_update_node_max_active: wq=events_unbound off_cpu=-1 total=3 range=[32, 512] node[0] node_cpus=3 max=512 CPU2 is up XXX wq_update_node_max_active: wq=events_unbound off_cpu=-1 total=4 range=[32, 512] node[0] node_cpus=4 max=512 CPU3 is up s3c-i2c 138e0000.i2c: slave address 0x00 s3c-i2c 138e0000.i2c: bus frequency set to 97 KHz s3c-i2c 13870000.i2c: slave address 0x00 s3c-i2c 13870000.i2c: bus frequency set to 97 KHz s3c-i2c 13860000.i2c: slave address 0x00 s3c-i2c 13860000.i2c: bus frequency set to 390 KHz s3c-i2c 13880000.i2c: slave address 0x00 s3c-i2c 13880000.i2c: bus frequency set to 97 KHz s3c2410-wdt 10060000.watchdog: watchdog disabled wake disabled for irq 124 (gpx1-3) s3c-rtc 10070000.rtc: rtc disabled, re-enabling usb3503 0-0008: switched to HUB mode wake disabled for irq 97 (gpx3-2) usb usb1: root hub lost power or was reset usb 1-2: reset high-speed USB device number 2 using exynos-ehci smsc95xx 1-2:1.0 eth0: Link is Down dwc2 12480000.usb: resuming usb gadget g_ether usb 1-3: reset high-speed USB device number 3 using exynos-ehci usb 1-3.1: reset high-speed USB device number 4 using exynos-ehci OOM killer enabled. Restarting tasks ... done. random: crng reseeded on system resumption PM: suspend exit > ... Best regards
Hello, On Wed, Feb 07, 2024 at 10:30:58PM +0100, Marek Szyprowski wrote: .. > Disabling non-boot CPUs ... > XXX wq_update_node_max_active: wq=events_unbound off_cpu=1 total=3 range=[32, 512] node[0] node_cpus=3 max=512 > XXX wq_update_node_max_active: wq=events_unbound off_cpu=2 total=2 range=[32, 512] node[0] node_cpus=2 max=512 > XXX wq_update_node_max_active: wq=events_unbound off_cpu=3 total=1 range=[32, 512] node[0] node_cpus=1 max=512 > Enabling non-boot CPUs ... > XXX wq_update_node_max_active: wq=events_unbound off_cpu=-1 total=2 range=[32, 512] node[0] node_cpus=2 max=512 > CPU1 is up > XXX wq_update_node_max_active: wq=events_unbound off_cpu=-1 total=3 range=[32, 512] node[0] node_cpus=3 max=512 > CPU2 is up > XXX wq_update_node_max_active: wq=events_unbound off_cpu=-1 total=4 range=[32, 512] node[0] node_cpus=4 max=512 > CPU3 is up So, the node max_active does stay at 512. The only pwq which uses min_active would be the dfl_pwq but I'm not sure why that'd be being used. Can you please post the output of `drgn -v vmlinux tools/workqueue/wq_dump.py`? Thanks.
On 07.02.2024 22:35, Tejun Heo wrote: > Hello, > > On Wed, Feb 07, 2024 at 10:30:58PM +0100, Marek Szyprowski wrote: > ... >> Disabling non-boot CPUs ... >> XXX wq_update_node_max_active: wq=events_unbound off_cpu=1 total=3 range=[32, 512] node[0] node_cpus=3 max=512 >> XXX wq_update_node_max_active: wq=events_unbound off_cpu=2 total=2 range=[32, 512] node[0] node_cpus=2 max=512 >> XXX wq_update_node_max_active: wq=events_unbound off_cpu=3 total=1 range=[32, 512] node[0] node_cpus=1 max=512 >> Enabling non-boot CPUs ... >> XXX wq_update_node_max_active: wq=events_unbound off_cpu=-1 total=2 range=[32, 512] node[0] node_cpus=2 max=512 >> CPU1 is up >> XXX wq_update_node_max_active: wq=events_unbound off_cpu=-1 total=3 range=[32, 512] node[0] node_cpus=3 max=512 >> CPU2 is up >> XXX wq_update_node_max_active: wq=events_unbound off_cpu=-1 total=4 range=[32, 512] node[0] node_cpus=4 max=512 >> CPU3 is up > So, the node max_active does stay at 512. The only pwq which uses min_active > would be the dfl_pwq but I'm not sure why that'd be being used. Can you > please post the output of `drgn -v vmlinux tools/workqueue/wq_dump.py`? I've tried to get drgn running on the test target, but then I've noticed that it is not possible to run it on ARM 32bit target, as it requires PROC_KCORE support, which is 'Visible if: PROC_FS [=y] && MMU [=y] && !ARM [=y]' for some reasons. Best regards
Hello, On Thu, Feb 08, 2024 at 08:47:12AM +0100, Marek Szyprowski wrote: > I've tried to get drgn running on the test target, but then I've noticed > that it is not possible to run it on ARM 32bit target, as it requires > PROC_KCORE support, which is 'Visible if: PROC_FS [=y] && MMU [=y] && > !ARM [=y]' for some reasons. Bummer. I instrumented code on my test setup (x86) and couldn't repro usage of the dfl_pwq, unfortuantely. Independent of understanding what's going on with the system_unbound_wq, the correct solution seems like using a dedicated workqueue with raised min_active. Just posted the patchset to do that: http://lkml.kernel.org/r/ZcVtzJvJCRV5OLM-@slm.duckdns.org Thanks.
On 2/7/24 11:59, Rafael J. Wysocki wrote: > On Wed, Feb 7, 2024 at 11:31 AM Marek Szyprowski > <m.szyprowski@samsung.com> wrote: >> >> Dear All, >> >> On 09.01.2024 17:59, Rafael J. Wysocki wrote: >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >>> >>> Before commit 7839d0078e0d ("PM: sleep: Fix possible deadlocks in core >>> system-wide PM code"), the resume of devices that were allowed to resume >>> asynchronously was scheduled before starting the resume of the other >>> devices, so the former did not have to wait for the latter unless >>> functional dependencies were present. >>> >>> Commit 7839d0078e0d removed that optimization in order to address a >>> correctness issue, but it can be restored with the help of a new device >>> power management flag, so do that now. >>> >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >>> --- >> >> This patch finally landed in linux-next some time ago as 3e999770ac1c >> ("PM: sleep: Restore asynchronous device resume optimization"). Recently >> I found that it causes a non-trivial interaction with commit >> 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement >> for unbound workqueues"). Since merge commit 954350a5f8db in linux-next >> system suspend/resume fails (board doesn't wake up) on my old Samsung >> Exynos4412-based Odroid-U3 board (ARM 32bit based), which was rock >> stable for last years. >> >> My further investigations confirmed that the mentioned commits are >> responsible for this issue. Each of them separately (3e999770ac1c and >> 5797b1c18919) doesn't trigger any problems. Reverting any of them on top >> of linux-next (with some additional commit due to code dependencies) >> also fixes/hides the problem. >> >> Let me know if You need more information or tests on the hardware. I'm >> open to help debugging this issue. > > So I found this report: > > https://lore.kernel.org/lkml/b3d08cd8-d77f-45dd-a2c3-4a4db5a98dfa@kernel.org/ > > which appears to be about the same issue. > > Strictly speaking, the regression is introduced by 5797b1c18919, > because it is not a mainline commit yet, but the information regarding > the interaction of it with commit 3e999770ac1c is valuable. > > Essentially, what commit 3e999770ac1c does is to schedule the > execution of all of the async resume callbacks in a given phase > upfront, so they can run without waiting for the sync ones to complete > (except when there is a parent-child or supplier-consumer dependency - > the latter represented by a device link). > > What seems to be happening after commit 5797b1c18919 is that one (or > more) of the async callbacks get blocked in the workqueue for some > reason. > > I would try to replace system_unbound_wq in > __async_schedule_node_domain() with a dedicated workqueue that is not > unbound and see what happens. Thanks Rafael for helping connect the dots! I did the laziest things imaginable to switch to a WQ that's not unbound: diff --git a/kernel/async.c b/kernel/async.c index 97f224a5257b..37f1204ab4e9 100644 --- a/kernel/async.c +++ b/kernel/async.c @@ -174,7 +174,7 @@ static async_cookie_t __async_schedule_node_domain(async_func_t func, spin_unlock_irqrestore(&async_lock, flags); /* schedule for execution */ - queue_work_node(node, system_unbound_wq, &entry->work); + queue_work_node(node, system_wq, &entry->work); .and the system can now suspend/resume all day long again! Konrad
Hi, On Fri, Feb 9, 2024 at 1:20 AM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Thu, Feb 08, 2024 at 08:47:12AM +0100, Marek Szyprowski wrote: > > I've tried to get drgn running on the test target, but then I've noticed > > that it is not possible to run it on ARM 32bit target, as it requires > > PROC_KCORE support, which is 'Visible if: PROC_FS [=y] && MMU [=y] && > > !ARM [=y]' for some reasons. > > Bummer. I instrumented code on my test setup (x86) and couldn't repro usage > of the dfl_pwq, unfortuantely. On x86 there are fewer device links representing dependencies between devices, because it doesn't use fw_devlink which is used on ARM platforms (and other DT-based), so on x86 it is less likely to reproduce this. > Independent of understanding what's going on with the system_unbound_wq, the > correct solution seems like using a dedicated workqueue with raised > min_active. Just posted the patchset to do that: > > http://lkml.kernel.org/r/ZcVtzJvJCRV5OLM-@slm.duckdns.org LGTM, thanks!
Index: linux-pm/include/linux/pm.h =================================================================== --- linux-pm.orig/include/linux/pm.h +++ linux-pm/include/linux/pm.h @@ -681,6 +681,7 @@ struct dev_pm_info { bool wakeup_path:1; bool syscore:1; bool no_pm_callbacks:1; /* Owned by the PM core */ + bool in_progress:1; /* Owned by the PM core */ unsigned int must_resume:1; /* Owned by the PM core */ unsigned int may_skip_resume:1; /* Set by subsystems */ #else Index: linux-pm/drivers/base/power/main.c =================================================================== --- linux-pm.orig/drivers/base/power/main.c +++ linux-pm/drivers/base/power/main.c @@ -579,7 +579,7 @@ bool dev_pm_skip_resume(struct device *d } /** - * __device_resume_noirq - Execute a "noirq resume" callback for given device. + * device_resume_noirq - Execute a "noirq resume" callback for given device. * @dev: Device to handle. * @state: PM transition of the system being carried out. * @async: If true, the device is being resumed asynchronously. @@ -587,7 +587,7 @@ bool dev_pm_skip_resume(struct device *d * The driver of @dev will not receive interrupts while this function is being * executed. */ -static void __device_resume_noirq(struct device *dev, pm_message_t state, bool async) +static void device_resume_noirq(struct device *dev, pm_message_t state, bool async) { pm_callback_t callback = NULL; const char *info = NULL; @@ -674,16 +674,22 @@ static bool dpm_async_fn(struct device * { reinit_completion(&dev->power.completion); - if (!is_async(dev)) - return false; + if (is_async(dev)) { + dev->power.in_progress = true; - get_device(dev); - - if (async_schedule_dev_nocall(func, dev)) - return true; + get_device(dev); - put_device(dev); + if (async_schedule_dev_nocall(func, dev)) + return true; + put_device(dev); + } + /* + * Because async_schedule_dev_nocall() above has returned false or it + * has not been called at all, func() is not running and it safe to + * update the in_progress flag without additional synchronization. + */ + dev->power.in_progress = false; return false; } @@ -691,18 +697,10 @@ static void async_resume_noirq(void *dat { struct device *dev = data; - __device_resume_noirq(dev, pm_transition, true); + device_resume_noirq(dev, pm_transition, true); put_device(dev); } -static void device_resume_noirq(struct device *dev) -{ - if (dpm_async_fn(dev, async_resume_noirq)) - return; - - __device_resume_noirq(dev, pm_transition, false); -} - static void dpm_noirq_resume_devices(pm_message_t state) { struct device *dev; @@ -712,18 +710,28 @@ static void dpm_noirq_resume_devices(pm_ mutex_lock(&dpm_list_mtx); pm_transition = state; + /* + * Trigger the resume of "async" devices upfront so they don't have to + * wait for the "non-async" ones they don't depend on. + */ + list_for_each_entry(dev, &dpm_noirq_list, power.entry) + dpm_async_fn(dev, async_resume_noirq); + while (!list_empty(&dpm_noirq_list)) { dev = to_device(dpm_noirq_list.next); - get_device(dev); list_move_tail(&dev->power.entry, &dpm_late_early_list); - mutex_unlock(&dpm_list_mtx); + if (!dev->power.in_progress) { + get_device(dev); - device_resume_noirq(dev); + mutex_unlock(&dpm_list_mtx); - put_device(dev); + device_resume_noirq(dev, state, false); + + put_device(dev); - mutex_lock(&dpm_list_mtx); + mutex_lock(&dpm_list_mtx); + } } mutex_unlock(&dpm_list_mtx); async_synchronize_full(); @@ -747,14 +755,14 @@ void dpm_resume_noirq(pm_message_t state } /** - * __device_resume_early - Execute an "early resume" callback for given device. + * device_resume_early - Execute an "early resume" callback for given device. * @dev: Device to handle. * @state: PM transition of the system being carried out. * @async: If true, the device is being resumed asynchronously. * * Runtime PM is disabled for @dev while this function is being executed. */ -static void __device_resume_early(struct device *dev, pm_message_t state, bool async) +static void device_resume_early(struct device *dev, pm_message_t state, bool async) { pm_callback_t callback = NULL; const char *info = NULL; @@ -820,18 +828,10 @@ static void async_resume_early(void *dat { struct device *dev = data; - __device_resume_early(dev, pm_transition, true); + device_resume_early(dev, pm_transition, true); put_device(dev); } -static void device_resume_early(struct device *dev) -{ - if (dpm_async_fn(dev, async_resume_early)) - return; - - __device_resume_early(dev, pm_transition, false); -} - /** * dpm_resume_early - Execute "early resume" callbacks for all devices. * @state: PM transition of the system being carried out. @@ -845,18 +845,28 @@ void dpm_resume_early(pm_message_t state mutex_lock(&dpm_list_mtx); pm_transition = state; + /* + * Trigger the resume of "async" devices upfront so they don't have to + * wait for the "non-async" ones they don't depend on. + */ + list_for_each_entry(dev, &dpm_late_early_list, power.entry) + dpm_async_fn(dev, async_resume_early); + while (!list_empty(&dpm_late_early_list)) { dev = to_device(dpm_late_early_list.next); - get_device(dev); list_move_tail(&dev->power.entry, &dpm_suspended_list); - mutex_unlock(&dpm_list_mtx); + if (!dev->power.in_progress) { + get_device(dev); - device_resume_early(dev); + mutex_unlock(&dpm_list_mtx); - put_device(dev); + device_resume_early(dev, state, false); + + put_device(dev); - mutex_lock(&dpm_list_mtx); + mutex_lock(&dpm_list_mtx); + } } mutex_unlock(&dpm_list_mtx); async_synchronize_full(); @@ -876,12 +886,12 @@ void dpm_resume_start(pm_message_t state EXPORT_SYMBOL_GPL(dpm_resume_start); /** - * __device_resume - Execute "resume" callbacks for given device. + * device_resume - Execute "resume" callbacks for given device. * @dev: Device to handle. * @state: PM transition of the system being carried out. * @async: If true, the device is being resumed asynchronously. */ -static void __device_resume(struct device *dev, pm_message_t state, bool async) +static void device_resume(struct device *dev, pm_message_t state, bool async) { pm_callback_t callback = NULL; const char *info = NULL; @@ -975,18 +985,10 @@ static void async_resume(void *data, asy { struct device *dev = data; - __device_resume(dev, pm_transition, true); + device_resume(dev, pm_transition, true); put_device(dev); } -static void device_resume(struct device *dev) -{ - if (dpm_async_fn(dev, async_resume)) - return; - - __device_resume(dev, pm_transition, false); -} - /** * dpm_resume - Execute "resume" callbacks for non-sysdev devices. * @state: PM transition of the system being carried out. @@ -1006,16 +1008,25 @@ void dpm_resume(pm_message_t state) pm_transition = state; async_error = 0; + /* + * Trigger the resume of "async" devices upfront so they don't have to + * wait for the "non-async" ones they don't depend on. + */ + list_for_each_entry(dev, &dpm_suspended_list, power.entry) + dpm_async_fn(dev, async_resume); + while (!list_empty(&dpm_suspended_list)) { dev = to_device(dpm_suspended_list.next); get_device(dev); - mutex_unlock(&dpm_list_mtx); + if (!dev->power.in_progress) { + mutex_unlock(&dpm_list_mtx); - device_resume(dev); + device_resume(dev, state, false); - mutex_lock(&dpm_list_mtx); + mutex_lock(&dpm_list_mtx); + } if (!list_empty(&dev->power.entry)) list_move_tail(&dev->power.entry, &dpm_prepared_list);