Message ID | 20231129113120.4907-1-eugen.hristev@collabora.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:a5a7:0:b0:403:3b70:6f57 with SMTP id d7csp272301vqn; Wed, 29 Nov 2023 03:33:24 -0800 (PST) X-Google-Smtp-Source: AGHT+IHj3qRv0xmOpQUwlcJGBQ2cmkL/bXsSTglHtiMRmJbLFgciDT4fZneDDHMwZOts/LHOZFsJ X-Received: by 2002:a17:902:dacf:b0:1cf:edd5:f786 with SMTP id q15-20020a170902dacf00b001cfedd5f786mr7612765plx.21.1701257604090; Wed, 29 Nov 2023 03:33:24 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701257604; cv=none; d=google.com; s=arc-20160816; b=AdSWhlv6LyEIUS/W4XIXlfPcCf3g5uUh96rOVQUeTuPgJBds6y2IdnISYwJB49RdgE /eJf3W02L6K73EzgiSVs90Incwv7n44opr5oesZWNPw3h/fBtIHbjkgJa8z0VKP+EHam 653fDcHIAfigsmfUwauK5dAV9Ac72I/iamNs4kwDMdXYSIC2xY0CKOMwEmFP14iv/vBT e1AeJWPST4GjqpeyGwzR+tIh9wECJheJNtZiWH0Ml7gCTwqNxZI/lKzXAIpeSPkNhmGK B43f0ZEjYQQX1ATgIcp8gRv5dQhv4khY4dh7BCJSi2A7NozKh6PJ/7hC9jP+XkAJiteR neHA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=FMAtC5RlO7Jbgcz7ACyP3z2tGyaNf1QQubxTNs/VVCs=; fh=/FlLlbJIyaSMb/z//GYuwqkY3Pecj8kRqDZsisUVoII=; b=jqkN5yAfrc9XYiszXGrdMZ7/vEHAq8qA4tBAaYsep2HIBLjtHP4o0SEDcubmxU+QAN Ebz7cSZtb76IVR3zUq9fMUfs22zDn6YI+I0xzNLmk/XZkYQJ05mBbP5xzaA/JN+BY0p0 I41KHu4w5fAvWvwLRDJ8RoQkYU2IoPBClV6nYjti8lbDc/twhmvKIb216UBXTwb6LwmU 9tqnd9uX3deTr0jT/SxaOWqAIp6gZx4O0cXrPNAvtHCIrOYP0GVrwbWYRHADEv5MDkUR ptBrQh1Wi/XTdNzLZCA+BU0Q8RfhiY0pXGEWOUK6s+4qYmTH3NX4KIrA6rK9m8pxn545 jQ4A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@collabora.com header.s=mail header.b=NWrmaJ+G; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=collabora.com Received: from howler.vger.email (howler.vger.email. [2620:137:e000::3:4]) by mx.google.com with ESMTPS id i18-20020a170902cf1200b001ce5b92f430si15319647plg.112.2023.11.29.03.33.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Nov 2023 03:33:24 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) client-ip=2620:137:e000::3:4; Authentication-Results: mx.google.com; dkim=pass header.i=@collabora.com header.s=mail header.b=NWrmaJ+G; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=collabora.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id 2D32D80AF82C; Wed, 29 Nov 2023 03:33:16 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232328AbjK2LdH (ORCPT <rfc822;toshivichauhan@gmail.com> + 99 others); Wed, 29 Nov 2023 06:33:07 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51988 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231922AbjK2LdG (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 29 Nov 2023 06:33:06 -0500 Received: from madras.collabora.co.uk (madras.collabora.co.uk [IPv6:2a00:1098:0:82:1000:25:2eeb:e5ab]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CD43084; Wed, 29 Nov 2023 03:33:12 -0800 (PST) Received: from eugen-station.. (cola.collaboradmins.com [195.201.22.229]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: ehristev) by madras.collabora.co.uk (Postfix) with ESMTPSA id E0A006601F02; Wed, 29 Nov 2023 11:33:10 +0000 (GMT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=collabora.com; s=mail; t=1701257591; bh=x16Rnq7CZEpFbM9s04wUl+znCtY6zt8f98/hPwnLU3c=; h=From:To:Cc:Subject:Date:From; b=NWrmaJ+GCcglxq208tXL5bSQmTfPx7i69hCbPXnCUxeWxkjyUeJC0uO/ZZ49Oig5q D1vd0wXcKPpaG8YHmzygzS8viumQuoOrtLZ4HlGhj26Ng4Bjtzt5A4ntoeAsxblAt6 Xw/z1zN5Gh98+BaWWdPt6vF+esqBVJSw0vZVx2QXrBg86+oFsdfHtP7iGog8zbtT6x HV63cXFbi8Pu58QlB9wIvpHYHXZZPf8i2S4Uf+fdWCm9zI8nSSx+IzPa7/fsy07Qh3 yfJ7ZCHeukygDPOEOD/k0+KOurbGnlyT4I8QGBSCnHUemBpwN0tLRjNt/3IOMyFKnG Mqjzf0Nhp/m2A== From: Eugen Hristev <eugen.hristev@collabora.com> To: linux-arm-kernel@lists.infradead.org, linux-mediatek@lists.infradead.org, linux-pm@vger.kernel.org Cc: eballetbo@kernel.org, angelogioacchino.delregno@collabora.com, ulf.hansson@linaro.org, linux-kernel@vger.kernel.org, kernel@collabora.com, Eugen Hristev <eugen.hristev@collabora.com> Subject: [PATCH] pmdomain: mediatek: fix race condition in power on/power off sequences Date: Wed, 29 Nov 2023 13:31:20 +0200 Message-Id: <20231129113120.4907-1-eugen.hristev@collabora.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on howler.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Wed, 29 Nov 2023 03:33:16 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783897893014320397 X-GMAIL-MSGID: 1783897893014320397 |
Series |
pmdomain: mediatek: fix race condition in power on/power off sequences
|
|
Commit Message
Eugen Hristev
Nov. 29, 2023, 11:31 a.m. UTC
It can happen that during the power off sequence for a power domain
another power on sequence is started, and it can lead to powering on and
off in the same time for the similar power domain.
This can happen if parallel probing occurs: one device starts probing, and
one power domain is probe deferred, this leads to all power domains being
rolled back and powered off, while in the same time another device starts
probing and requests powering on the same power domains or similar.
This was encountered on MT8186, when the sequence is :
Power on SSUSB
Power on SSUSB_P1
Power on DIS
-> probe deferred
Power off DIS
Power off SSUSB_P1
Power off SSUSB
During the sequence of powering off SSUSB, some new similar sequence starts,
and during the power on of SSUSB, clocks are enabled.
In this case, powering off SSUSB fails from the first sequence, because
power off ACK bit check times out (as clocks are powered back on by the second
sequence). In consequence, powering it on also times out, and it leads to
the whole power domain in a bad state.
To solve this issue, added a mutex that locks the whole power off/power on
sequence such that it would never happen that multiple sequences try to
enable or disable the same power domain in parallel.
Fixes: 59b644b01cf4 ("soc: mediatek: Add MediaTek SCPSYS power domains")
Signed-off-by: Eugen Hristev <eugen.hristev@collabora.com>
---
drivers/pmdomain/mediatek/mtk-pm-domains.c | 24 +++++++++++++++++-----
1 file changed, 19 insertions(+), 5 deletions(-)
Comments
Il 29/11/23 12:31, Eugen Hristev ha scritto: > It can happen that during the power off sequence for a power domain > another power on sequence is started, and it can lead to powering on and > off in the same time for the similar power domain. > This can happen if parallel probing occurs: one device starts probing, and > one power domain is probe deferred, this leads to all power domains being > rolled back and powered off, while in the same time another device starts > probing and requests powering on the same power domains or similar. > > This was encountered on MT8186, when the sequence is : > Power on SSUSB > Power on SSUSB_P1 > Power on DIS > -> probe deferred > Power off DIS > Power off SSUSB_P1 > Power off SSUSB > > During the sequence of powering off SSUSB, some new similar sequence starts, > and during the power on of SSUSB, clocks are enabled. > In this case, powering off SSUSB fails from the first sequence, because > power off ACK bit check times out (as clocks are powered back on by the second > sequence). In consequence, powering it on also times out, and it leads to > the whole power domain in a bad state. > > To solve this issue, added a mutex that locks the whole power off/power on > sequence such that it would never happen that multiple sequences try to > enable or disable the same power domain in parallel. > > Fixes: 59b644b01cf4 ("soc: mediatek: Add MediaTek SCPSYS power domains") > Signed-off-by: Eugen Hristev <eugen.hristev@collabora.com> I don't think that it's a race between genpd_power_on() and genpd_power_off() calls at all, because genpd *does* have locking after all... at least for probe and for parents of a power domain (and more anyway). As far as I remember, what happens when you start .probe()'ing a device is: platform_probe() -> dev_pm_domain_attach() -> genpd_dev_pm_attach() There, you end up with if (power_on) { genpd_lock(pd); ret = genpd_power_on(pd, 0); genpd_unlock(pd); } ...but when you fail probing, you go with genpd_dev_pm_detach(), which then calls /* Check if PM domain can be powered off after removing this device. */ genpd_queue_power_off_work(pd); but even then, you end up being in a worker doing genpd_lock(genpd); genpd_power_off(genpd, false, 0); genpd_unlock(genpd); ...so I don't understand why this mutex can resolve the situation here (also: are you really sure that the race is solved like that?) I'd say that this probably needs more justification and a trace of the actual situation here. Besides, if this really resolves the issue, I would prefer seeing variants of scpsys_power_{on,off}() functions, because we anyway don't need to lock mutexes during this driver's probe (add_subdomain calls scpsys_power_on()). In that case, `scpsys_power_on_unlocked()` would be an idea... but still, please analyze why your solution works, if it does, because I'm not convinced. Cheers, Angelo > --- > drivers/pmdomain/mediatek/mtk-pm-domains.c | 24 +++++++++++++++++----- > 1 file changed, 19 insertions(+), 5 deletions(-) > > diff --git a/drivers/pmdomain/mediatek/mtk-pm-domains.c b/drivers/pmdomain/mediatek/mtk-pm-domains.c > index d5f0ee05c794..4f136b47e539 100644 > --- a/drivers/pmdomain/mediatek/mtk-pm-domains.c > +++ b/drivers/pmdomain/mediatek/mtk-pm-domains.c > @@ -9,6 +9,7 @@ > #include <linux/io.h> > #include <linux/iopoll.h> > #include <linux/mfd/syscon.h> > +#include <linux/mutex.h> > #include <linux/of.h> > #include <linux/of_clk.h> > #include <linux/platform_device.h> > @@ -56,6 +57,7 @@ struct scpsys { > struct device *dev; > struct regmap *base; > const struct scpsys_soc_data *soc_data; > + struct mutex mutex; > struct genpd_onecell_data pd_data; > struct generic_pm_domain *domains[]; > }; > @@ -238,9 +240,13 @@ static int scpsys_power_on(struct generic_pm_domain *genpd) > bool tmp; > int ret; > > + mutex_lock(&scpsys->mutex); > + > ret = scpsys_regulator_enable(pd->supply); > - if (ret) > + if (ret) { > + mutex_unlock(&scpsys->mutex); > return ret; > + } > > ret = clk_bulk_prepare_enable(pd->num_clks, pd->clks); > if (ret) > @@ -291,6 +297,7 @@ static int scpsys_power_on(struct generic_pm_domain *genpd) > goto err_enable_bus_protect; > } > > + mutex_unlock(&scpsys->mutex); > return 0; > > err_enable_bus_protect: > @@ -305,6 +312,7 @@ static int scpsys_power_on(struct generic_pm_domain *genpd) > clk_bulk_disable_unprepare(pd->num_clks, pd->clks); > err_reg: > scpsys_regulator_disable(pd->supply); > + mutex_unlock(&scpsys->mutex); > return ret; > } > > @@ -315,13 +323,15 @@ static int scpsys_power_off(struct generic_pm_domain *genpd) > bool tmp; > int ret; > > + mutex_lock(&scpsys->mutex); > + > ret = scpsys_bus_protect_enable(pd); > if (ret < 0) > - return ret; > + goto err_mutex_unlock; > > ret = scpsys_sram_disable(pd); > if (ret < 0) > - return ret; > + goto err_mutex_unlock; > > if (pd->data->ext_buck_iso_offs && MTK_SCPD_CAPS(pd, MTK_SCPD_EXT_BUCK_ISO)) > regmap_set_bits(scpsys->base, pd->data->ext_buck_iso_offs, > @@ -340,13 +350,15 @@ static int scpsys_power_off(struct generic_pm_domain *genpd) > ret = readx_poll_timeout(scpsys_domain_is_on, pd, tmp, !tmp, MTK_POLL_DELAY_US, > MTK_POLL_TIMEOUT); > if (ret < 0) > - return ret; > + goto err_mutex_unlock; > > clk_bulk_disable_unprepare(pd->num_clks, pd->clks); > > scpsys_regulator_disable(pd->supply); > > - return 0; > +err_mutex_unlock: > + mutex_unlock(&scpsys->mutex); > + return ret; > } > > static struct > @@ -700,6 +712,8 @@ static int scpsys_probe(struct platform_device *pdev) > return PTR_ERR(scpsys->base); > } > > + mutex_init(&scpsys->mutex); > + > ret = -ENODEV; > for_each_available_child_of_node(np, node) { > struct generic_pm_domain *domain;
On 11/29/23 14:52, AngeloGioacchino Del Regno wrote: > Il 29/11/23 12:31, Eugen Hristev ha scritto: >> It can happen that during the power off sequence for a power domain >> another power on sequence is started, and it can lead to powering on and >> off in the same time for the similar power domain. >> This can happen if parallel probing occurs: one device starts probing, >> and >> one power domain is probe deferred, this leads to all power domains being >> rolled back and powered off, while in the same time another device starts >> probing and requests powering on the same power domains or similar. >> >> This was encountered on MT8186, when the sequence is : >> Power on SSUSB >> Power on SSUSB_P1 >> Power on DIS >> -> probe deferred >> Power off DIS >> Power off SSUSB_P1 >> Power off SSUSB >> >> During the sequence of powering off SSUSB, some new similar sequence >> starts, >> and during the power on of SSUSB, clocks are enabled. >> In this case, powering off SSUSB fails from the first sequence, because >> power off ACK bit check times out (as clocks are powered back on by >> the second >> sequence). In consequence, powering it on also times out, and it leads to >> the whole power domain in a bad state. >> >> To solve this issue, added a mutex that locks the whole power >> off/power on >> sequence such that it would never happen that multiple sequences try to >> enable or disable the same power domain in parallel. >> >> Fixes: 59b644b01cf4 ("soc: mediatek: Add MediaTek SCPSYS power domains") >> Signed-off-by: Eugen Hristev <eugen.hristev@collabora.com> > > I don't think that it's a race between genpd_power_on() and > genpd_power_off() calls > at all, because genpd *does* have locking after all... at least for > probe and for > parents of a power domain (and more anyway). > > As far as I remember, what happens when you start .probe()'ing a device is: > platform_probe() -> dev_pm_domain_attach() -> genpd_dev_pm_attach() > > There, you end up with > > if (power_on) { > genpd_lock(pd); > ret = genpd_power_on(pd, 0); > genpd_unlock(pd); > } > > ...but when you fail probing, you go with genpd_dev_pm_detach(), which > then calls > > /* Check if PM domain can be powered off after removing this > device. */ > genpd_queue_power_off_work(pd); > > but even then, you end up being in a worker doing > > genpd_lock(genpd); > genpd_power_off(genpd, false, 0); > genpd_unlock(genpd); > > ...so I don't understand why this mutex can resolve the situation here > (also: are > you really sure that the race is solved like that?) > > I'd say that this probably needs more justification and a trace of the > actual > situation here. > > Besides, if this really resolves the issue, I would prefer seeing > variants of > scpsys_power_{on,off}() functions, because we anyway don't need to lock > mutexes > during this driver's probe (add_subdomain calls scpsys_power_on()). > In that case, `scpsys_power_on_unlocked()` would be an idea... but > still, please > analyze why your solution works, if it does, because I'm not convinced. What I see in my tests, is that a power on call for SSUSB domain happens while the previous power off sequence did not yet complete, most likely while it's waiting in readx_poll_timeout . This leads to inconsistency of the power domain, not getting the ACKs next time a power on attempt occurs. I understand what you say about locks, but in this case the powering off is not called by the genpd itself, but rather it's called by the rollback probe failed mechanism : when the probing fails, scpsys_domain_cleanup() is called during the same probing session. Then it happens that probing begins again and previous cleanup is not yet completed. I am not sure whether the lock is still held from the previous run, but it's clearly not waiting for a lock to be released to be called again. > > Cheers, > Angelo > >> --- >> drivers/pmdomain/mediatek/mtk-pm-domains.c | 24 +++++++++++++++++----- >> 1 file changed, 19 insertions(+), 5 deletions(-) >> >> diff --git a/drivers/pmdomain/mediatek/mtk-pm-domains.c >> b/drivers/pmdomain/mediatek/mtk-pm-domains.c >> index d5f0ee05c794..4f136b47e539 100644 >> --- a/drivers/pmdomain/mediatek/mtk-pm-domains.c >> +++ b/drivers/pmdomain/mediatek/mtk-pm-domains.c >> @@ -9,6 +9,7 @@ >> #include <linux/io.h> >> #include <linux/iopoll.h> >> #include <linux/mfd/syscon.h> >> +#include <linux/mutex.h> >> #include <linux/of.h> >> #include <linux/of_clk.h> >> #include <linux/platform_device.h> >> @@ -56,6 +57,7 @@ struct scpsys { >> struct device *dev; >> struct regmap *base; >> const struct scpsys_soc_data *soc_data; >> + struct mutex mutex; >> struct genpd_onecell_data pd_data; >> struct generic_pm_domain *domains[]; >> }; >> @@ -238,9 +240,13 @@ static int scpsys_power_on(struct >> generic_pm_domain *genpd) >> bool tmp; >> int ret; >> + mutex_lock(&scpsys->mutex); >> + >> ret = scpsys_regulator_enable(pd->supply); >> - if (ret) >> + if (ret) { >> + mutex_unlock(&scpsys->mutex); >> return ret; >> + } >> ret = clk_bulk_prepare_enable(pd->num_clks, pd->clks); >> if (ret) >> @@ -291,6 +297,7 @@ static int scpsys_power_on(struct >> generic_pm_domain *genpd) >> goto err_enable_bus_protect; >> } >> + mutex_unlock(&scpsys->mutex); >> return 0; >> err_enable_bus_protect: >> @@ -305,6 +312,7 @@ static int scpsys_power_on(struct >> generic_pm_domain *genpd) >> clk_bulk_disable_unprepare(pd->num_clks, pd->clks); >> err_reg: >> scpsys_regulator_disable(pd->supply); >> + mutex_unlock(&scpsys->mutex); >> return ret; >> } >> @@ -315,13 +323,15 @@ static int scpsys_power_off(struct >> generic_pm_domain *genpd) >> bool tmp; >> int ret; >> + mutex_lock(&scpsys->mutex); >> + >> ret = scpsys_bus_protect_enable(pd); >> if (ret < 0) >> - return ret; >> + goto err_mutex_unlock; >> ret = scpsys_sram_disable(pd); >> if (ret < 0) >> - return ret; >> + goto err_mutex_unlock; >> if (pd->data->ext_buck_iso_offs && MTK_SCPD_CAPS(pd, >> MTK_SCPD_EXT_BUCK_ISO)) >> regmap_set_bits(scpsys->base, pd->data->ext_buck_iso_offs, >> @@ -340,13 +350,15 @@ static int scpsys_power_off(struct >> generic_pm_domain *genpd) >> ret = readx_poll_timeout(scpsys_domain_is_on, pd, tmp, !tmp, >> MTK_POLL_DELAY_US, >> MTK_POLL_TIMEOUT); >> if (ret < 0) >> - return ret; >> + goto err_mutex_unlock; >> clk_bulk_disable_unprepare(pd->num_clks, pd->clks); >> scpsys_regulator_disable(pd->supply); >> - return 0; >> +err_mutex_unlock: >> + mutex_unlock(&scpsys->mutex); >> + return ret; >> } >> static struct >> @@ -700,6 +712,8 @@ static int scpsys_probe(struct platform_device *pdev) >> return PTR_ERR(scpsys->base); >> } >> + mutex_init(&scpsys->mutex); >> + >> ret = -ENODEV; >> for_each_available_child_of_node(np, node) { >> struct generic_pm_domain *domain; > >
Il 29/11/23 14:28, Eugen Hristev ha scritto: > On 11/29/23 14:52, AngeloGioacchino Del Regno wrote: >> Il 29/11/23 12:31, Eugen Hristev ha scritto: >>> It can happen that during the power off sequence for a power domain >>> another power on sequence is started, and it can lead to powering on and >>> off in the same time for the similar power domain. >>> This can happen if parallel probing occurs: one device starts probing, and >>> one power domain is probe deferred, this leads to all power domains being >>> rolled back and powered off, while in the same time another device starts >>> probing and requests powering on the same power domains or similar. >>> >>> This was encountered on MT8186, when the sequence is : >>> Power on SSUSB >>> Power on SSUSB_P1 >>> Power on DIS >>> -> probe deferred >>> Power off DIS >>> Power off SSUSB_P1 >>> Power off SSUSB >>> >>> During the sequence of powering off SSUSB, some new similar sequence starts, >>> and during the power on of SSUSB, clocks are enabled. >>> In this case, powering off SSUSB fails from the first sequence, because >>> power off ACK bit check times out (as clocks are powered back on by the second >>> sequence). In consequence, powering it on also times out, and it leads to >>> the whole power domain in a bad state. >>> >>> To solve this issue, added a mutex that locks the whole power off/power on >>> sequence such that it would never happen that multiple sequences try to >>> enable or disable the same power domain in parallel. >>> >>> Fixes: 59b644b01cf4 ("soc: mediatek: Add MediaTek SCPSYS power domains") >>> Signed-off-by: Eugen Hristev <eugen.hristev@collabora.com> >> >> I don't think that it's a race between genpd_power_on() and genpd_power_off() calls >> at all, because genpd *does* have locking after all... at least for probe and for >> parents of a power domain (and more anyway). >> >> As far as I remember, what happens when you start .probe()'ing a device is: >> platform_probe() -> dev_pm_domain_attach() -> genpd_dev_pm_attach() >> >> There, you end up with >> >> if (power_on) { >> genpd_lock(pd); >> ret = genpd_power_on(pd, 0); >> genpd_unlock(pd); >> } >> >> ...but when you fail probing, you go with genpd_dev_pm_detach(), which then calls >> >> /* Check if PM domain can be powered off after removing this device. */ >> genpd_queue_power_off_work(pd); >> >> but even then, you end up being in a worker doing >> >> genpd_lock(genpd); >> genpd_power_off(genpd, false, 0); >> genpd_unlock(genpd); >> >> ...so I don't understand why this mutex can resolve the situation here (also: are >> you really sure that the race is solved like that?) >> >> I'd say that this probably needs more justification and a trace of the actual >> situation here. >> >> Besides, if this really resolves the issue, I would prefer seeing variants of >> scpsys_power_{on,off}() functions, because we anyway don't need to lock mutexes >> during this driver's probe (add_subdomain calls scpsys_power_on()). >> In that case, `scpsys_power_on_unlocked()` would be an idea... but still, please >> analyze why your solution works, if it does, because I'm not convinced. > > What I see in my tests, is that a power on call for SSUSB domain happens while the > previous power off sequence did not yet complete, most likely while it's waiting in > readx_poll_timeout . This leads to inconsistency of the power domain, not getting > the ACKs next time a power on attempt occurs. > > I understand what you say about locks, but in this case the powering off is not > called by the genpd itself, but rather it's called by the rollback probe failed > mechanism : when the probing fails, scpsys_domain_cleanup() is called during the > same probing session. > Then it happens that probing begins again and previous cleanup is not yet > completed. I am not sure whether the lock is still held from the previous run, but > it's clearly not waiting for a lock to be released to be called again. > Sorry but I'm a bit lost now: is the problem about probe deferrals of the USB driver, or about probe deferrals of the mtk-pm-domains driver? scpsys_domain_cleanup() is only called upon scpsys_probe() failure. >> >>> --- >>> drivers/pmdomain/mediatek/mtk-pm-domains.c | 24 +++++++++++++++++----- >>> 1 file changed, 19 insertions(+), 5 deletions(-) >>> >>> diff --git a/drivers/pmdomain/mediatek/mtk-pm-domains.c >>> b/drivers/pmdomain/mediatek/mtk-pm-domains.c >>> index d5f0ee05c794..4f136b47e539 100644 >>> --- a/drivers/pmdomain/mediatek/mtk-pm-domains.c >>> +++ b/drivers/pmdomain/mediatek/mtk-pm-domains.c >>> @@ -9,6 +9,7 @@ >>> #include <linux/io.h> >>> #include <linux/iopoll.h> >>> #include <linux/mfd/syscon.h> >>> +#include <linux/mutex.h> >>> #include <linux/of.h> >>> #include <linux/of_clk.h> >>> #include <linux/platform_device.h> >>> @@ -56,6 +57,7 @@ struct scpsys { >>> struct device *dev; >>> struct regmap *base; >>> const struct scpsys_soc_data *soc_data; >>> + struct mutex mutex; >>> struct genpd_onecell_data pd_data; >>> struct generic_pm_domain *domains[]; >>> }; >>> @@ -238,9 +240,13 @@ static int scpsys_power_on(struct generic_pm_domain *genpd) >>> bool tmp; >>> int ret; >>> + mutex_lock(&scpsys->mutex); >>> + >>> ret = scpsys_regulator_enable(pd->supply); >>> - if (ret) >>> + if (ret) { >>> + mutex_unlock(&scpsys->mutex); >>> return ret; >>> + } >>> ret = clk_bulk_prepare_enable(pd->num_clks, pd->clks); >>> if (ret) >>> @@ -291,6 +297,7 @@ static int scpsys_power_on(struct generic_pm_domain *genpd) >>> goto err_enable_bus_protect; >>> } >>> + mutex_unlock(&scpsys->mutex); >>> return 0; >>> err_enable_bus_protect: >>> @@ -305,6 +312,7 @@ static int scpsys_power_on(struct generic_pm_domain *genpd) >>> clk_bulk_disable_unprepare(pd->num_clks, pd->clks); >>> err_reg: >>> scpsys_regulator_disable(pd->supply); >>> + mutex_unlock(&scpsys->mutex); >>> return ret; >>> } >>> @@ -315,13 +323,15 @@ static int scpsys_power_off(struct generic_pm_domain *genpd) >>> bool tmp; >>> int ret; >>> + mutex_lock(&scpsys->mutex); >>> + >>> ret = scpsys_bus_protect_enable(pd); >>> if (ret < 0) >>> - return ret; >>> + goto err_mutex_unlock; >>> ret = scpsys_sram_disable(pd); >>> if (ret < 0) >>> - return ret; >>> + goto err_mutex_unlock; >>> if (pd->data->ext_buck_iso_offs && MTK_SCPD_CAPS(pd, MTK_SCPD_EXT_BUCK_ISO)) >>> regmap_set_bits(scpsys->base, pd->data->ext_buck_iso_offs, >>> @@ -340,13 +350,15 @@ static int scpsys_power_off(struct generic_pm_domain *genpd) >>> ret = readx_poll_timeout(scpsys_domain_is_on, pd, tmp, !tmp, >>> MTK_POLL_DELAY_US, >>> MTK_POLL_TIMEOUT); >>> if (ret < 0) >>> - return ret; >>> + goto err_mutex_unlock; >>> clk_bulk_disable_unprepare(pd->num_clks, pd->clks); >>> scpsys_regulator_disable(pd->supply); >>> - return 0; >>> +err_mutex_unlock: >>> + mutex_unlock(&scpsys->mutex); >>> + return ret; >>> } >>> static struct >>> @@ -700,6 +712,8 @@ static int scpsys_probe(struct platform_device *pdev) >>> return PTR_ERR(scpsys->base); >>> } >>> + mutex_init(&scpsys->mutex); >>> + >>> ret = -ENODEV; >>> for_each_available_child_of_node(np, node) { >>> struct generic_pm_domain *domain; >>
On 11/29/23 15:37, AngeloGioacchino Del Regno wrote: > Il 29/11/23 14:28, Eugen Hristev ha scritto: >> On 11/29/23 14:52, AngeloGioacchino Del Regno wrote: >>> Il 29/11/23 12:31, Eugen Hristev ha scritto: >>>> It can happen that during the power off sequence for a power domain >>>> another power on sequence is started, and it can lead to powering on >>>> and >>>> off in the same time for the similar power domain. >>>> This can happen if parallel probing occurs: one device starts >>>> probing, and >>>> one power domain is probe deferred, this leads to all power domains >>>> being >>>> rolled back and powered off, while in the same time another device >>>> starts >>>> probing and requests powering on the same power domains or similar. >>>> >>>> This was encountered on MT8186, when the sequence is : >>>> Power on SSUSB >>>> Power on SSUSB_P1 >>>> Power on DIS >>>> -> probe deferred >>>> Power off DIS >>>> Power off SSUSB_P1 >>>> Power off SSUSB >>>> >>>> During the sequence of powering off SSUSB, some new similar sequence >>>> starts, >>>> and during the power on of SSUSB, clocks are enabled. >>>> In this case, powering off SSUSB fails from the first sequence, because >>>> power off ACK bit check times out (as clocks are powered back on by >>>> the second >>>> sequence). In consequence, powering it on also times out, and it >>>> leads to >>>> the whole power domain in a bad state. >>>> >>>> To solve this issue, added a mutex that locks the whole power >>>> off/power on >>>> sequence such that it would never happen that multiple sequences try to >>>> enable or disable the same power domain in parallel. >>>> >>>> Fixes: 59b644b01cf4 ("soc: mediatek: Add MediaTek SCPSYS power >>>> domains") >>>> Signed-off-by: Eugen Hristev <eugen.hristev@collabora.com> >>> >>> I don't think that it's a race between genpd_power_on() and >>> genpd_power_off() calls >>> at all, because genpd *does* have locking after all... at least for >>> probe and for >>> parents of a power domain (and more anyway). >>> >>> As far as I remember, what happens when you start .probe()'ing a >>> device is: >>> platform_probe() -> dev_pm_domain_attach() -> genpd_dev_pm_attach() >>> >>> There, you end up with >>> >>> if (power_on) { >>> genpd_lock(pd); >>> ret = genpd_power_on(pd, 0); >>> genpd_unlock(pd); >>> } >>> >>> ...but when you fail probing, you go with genpd_dev_pm_detach(), >>> which then calls >>> >>> /* Check if PM domain can be powered off after removing this >>> device. */ >>> genpd_queue_power_off_work(pd); >>> >>> but even then, you end up being in a worker doing >>> >>> genpd_lock(genpd); >>> genpd_power_off(genpd, false, 0); >>> genpd_unlock(genpd); >>> >>> ...so I don't understand why this mutex can resolve the situation >>> here (also: are >>> you really sure that the race is solved like that?) >>> >>> I'd say that this probably needs more justification and a trace of >>> the actual >>> situation here. >>> >>> Besides, if this really resolves the issue, I would prefer seeing >>> variants of >>> scpsys_power_{on,off}() functions, because we anyway don't need to >>> lock mutexes >>> during this driver's probe (add_subdomain calls scpsys_power_on()). >>> In that case, `scpsys_power_on_unlocked()` would be an idea... but >>> still, please >>> analyze why your solution works, if it does, because I'm not convinced. >> >> What I see in my tests, is that a power on call for SSUSB domain >> happens while the previous power off sequence did not yet complete, >> most likely while it's waiting in readx_poll_timeout . This leads to >> inconsistency of the power domain, not getting the ACKs next time a >> power on attempt occurs. >> >> I understand what you say about locks, but in this case the powering >> off is not called by the genpd itself, but rather it's called by the >> rollback probe failed mechanism : when the probing fails, >> scpsys_domain_cleanup() is called during the same probing session. >> Then it happens that probing begins again and previous cleanup is not >> yet completed. I am not sure whether the lock is still held from the >> previous run, but it's clearly not waiting for a lock to be released >> to be called again. >> > > Sorry but I'm a bit lost now: is the problem about probe deferrals of > the USB > driver, or about probe deferrals of the mtk-pm-domains driver? > > scpsys_domain_cleanup() is only called upon scpsys_probe() failure. You are right, my explanation was bad. It happens during the mtk-pm-domains driver probe. Not all domains can power up, then everything is rolled back. and this happens multiple times On rare occasions, it happens that another probing sequence starts while the previous one was not finished . I mentioned devices because I had in mind the fact that each device requires a power domain, and parallel probing of these devices causes a call to mtk-pm-domains driver probe to be called from two different places. e.g. device 1 probes -> call mtk-pm-domains probe because it requires X power domain device 2 probes -> call mtk-pm-domains probe because it requires Y power domain. First attempt fails but not completed while second attempt starts. Maybe this is a better explanation of the situation ? > >>> >>>> --- >>>> drivers/pmdomain/mediatek/mtk-pm-domains.c | 24 >>>> +++++++++++++++++----- >>>> 1 file changed, 19 insertions(+), 5 deletions(-) >>>> >>>> diff --git a/drivers/pmdomain/mediatek/mtk-pm-domains.c >>>> b/drivers/pmdomain/mediatek/mtk-pm-domains.c >>>> index d5f0ee05c794..4f136b47e539 100644 >>>> --- a/drivers/pmdomain/mediatek/mtk-pm-domains.c >>>> +++ b/drivers/pmdomain/mediatek/mtk-pm-domains.c >>>> @@ -9,6 +9,7 @@ >>>> #include <linux/io.h> >>>> #include <linux/iopoll.h> >>>> #include <linux/mfd/syscon.h> >>>> +#include <linux/mutex.h> >>>> #include <linux/of.h> >>>> #include <linux/of_clk.h> >>>> #include <linux/platform_device.h> >>>> @@ -56,6 +57,7 @@ struct scpsys { >>>> struct device *dev; >>>> struct regmap *base; >>>> const struct scpsys_soc_data *soc_data; >>>> + struct mutex mutex; >>>> struct genpd_onecell_data pd_data; >>>> struct generic_pm_domain *domains[]; >>>> }; >>>> @@ -238,9 +240,13 @@ static int scpsys_power_on(struct >>>> generic_pm_domain *genpd) >>>> bool tmp; >>>> int ret; >>>> + mutex_lock(&scpsys->mutex); >>>> + >>>> ret = scpsys_regulator_enable(pd->supply); >>>> - if (ret) >>>> + if (ret) { >>>> + mutex_unlock(&scpsys->mutex); >>>> return ret; >>>> + } >>>> ret = clk_bulk_prepare_enable(pd->num_clks, pd->clks); >>>> if (ret) >>>> @@ -291,6 +297,7 @@ static int scpsys_power_on(struct >>>> generic_pm_domain *genpd) >>>> goto err_enable_bus_protect; >>>> } >>>> + mutex_unlock(&scpsys->mutex); >>>> return 0; >>>> err_enable_bus_protect: >>>> @@ -305,6 +312,7 @@ static int scpsys_power_on(struct >>>> generic_pm_domain *genpd) >>>> clk_bulk_disable_unprepare(pd->num_clks, pd->clks); >>>> err_reg: >>>> scpsys_regulator_disable(pd->supply); >>>> + mutex_unlock(&scpsys->mutex); >>>> return ret; >>>> } >>>> @@ -315,13 +323,15 @@ static int scpsys_power_off(struct >>>> generic_pm_domain *genpd) >>>> bool tmp; >>>> int ret; >>>> + mutex_lock(&scpsys->mutex); >>>> + >>>> ret = scpsys_bus_protect_enable(pd); >>>> if (ret < 0) >>>> - return ret; >>>> + goto err_mutex_unlock; >>>> ret = scpsys_sram_disable(pd); >>>> if (ret < 0) >>>> - return ret; >>>> + goto err_mutex_unlock; >>>> if (pd->data->ext_buck_iso_offs && MTK_SCPD_CAPS(pd, >>>> MTK_SCPD_EXT_BUCK_ISO)) >>>> regmap_set_bits(scpsys->base, pd->data->ext_buck_iso_offs, >>>> @@ -340,13 +350,15 @@ static int scpsys_power_off(struct >>>> generic_pm_domain *genpd) >>>> ret = readx_poll_timeout(scpsys_domain_is_on, pd, tmp, !tmp, >>>> MTK_POLL_DELAY_US, >>>> MTK_POLL_TIMEOUT); >>>> if (ret < 0) >>>> - return ret; >>>> + goto err_mutex_unlock; >>>> clk_bulk_disable_unprepare(pd->num_clks, pd->clks); >>>> scpsys_regulator_disable(pd->supply); >>>> - return 0; >>>> +err_mutex_unlock: >>>> + mutex_unlock(&scpsys->mutex); >>>> + return ret; >>>> } >>>> static struct >>>> @@ -700,6 +712,8 @@ static int scpsys_probe(struct platform_device >>>> *pdev) >>>> return PTR_ERR(scpsys->base); >>>> } >>>> + mutex_init(&scpsys->mutex); >>>> + >>>> ret = -ENODEV; >>>> for_each_available_child_of_node(np, node) { >>>> struct generic_pm_domain *domain; >>> >
Il 29/11/23 14:48, Eugen Hristev ha scritto: > On 11/29/23 15:37, AngeloGioacchino Del Regno wrote: >> Il 29/11/23 14:28, Eugen Hristev ha scritto: >>> On 11/29/23 14:52, AngeloGioacchino Del Regno wrote: >>>> Il 29/11/23 12:31, Eugen Hristev ha scritto: >>>>> It can happen that during the power off sequence for a power domain >>>>> another power on sequence is started, and it can lead to powering on and >>>>> off in the same time for the similar power domain. >>>>> This can happen if parallel probing occurs: one device starts probing, and >>>>> one power domain is probe deferred, this leads to all power domains being >>>>> rolled back and powered off, while in the same time another device starts >>>>> probing and requests powering on the same power domains or similar. >>>>> >>>>> This was encountered on MT8186, when the sequence is : >>>>> Power on SSUSB >>>>> Power on SSUSB_P1 >>>>> Power on DIS >>>>> -> probe deferred >>>>> Power off DIS >>>>> Power off SSUSB_P1 >>>>> Power off SSUSB >>>>> >>>>> During the sequence of powering off SSUSB, some new similar sequence starts, >>>>> and during the power on of SSUSB, clocks are enabled. >>>>> In this case, powering off SSUSB fails from the first sequence, because >>>>> power off ACK bit check times out (as clocks are powered back on by the second >>>>> sequence). In consequence, powering it on also times out, and it leads to >>>>> the whole power domain in a bad state. >>>>> >>>>> To solve this issue, added a mutex that locks the whole power off/power on >>>>> sequence such that it would never happen that multiple sequences try to >>>>> enable or disable the same power domain in parallel. >>>>> >>>>> Fixes: 59b644b01cf4 ("soc: mediatek: Add MediaTek SCPSYS power domains") >>>>> Signed-off-by: Eugen Hristev <eugen.hristev@collabora.com> >>>> >>>> I don't think that it's a race between genpd_power_on() and genpd_power_off() >>>> calls >>>> at all, because genpd *does* have locking after all... at least for probe and for >>>> parents of a power domain (and more anyway). >>>> >>>> As far as I remember, what happens when you start .probe()'ing a device is: >>>> platform_probe() -> dev_pm_domain_attach() -> genpd_dev_pm_attach() >>>> >>>> There, you end up with >>>> >>>> if (power_on) { >>>> genpd_lock(pd); >>>> ret = genpd_power_on(pd, 0); >>>> genpd_unlock(pd); >>>> } >>>> >>>> ...but when you fail probing, you go with genpd_dev_pm_detach(), which then calls >>>> >>>> /* Check if PM domain can be powered off after removing this device. */ >>>> genpd_queue_power_off_work(pd); >>>> >>>> but even then, you end up being in a worker doing >>>> >>>> genpd_lock(genpd); >>>> genpd_power_off(genpd, false, 0); >>>> genpd_unlock(genpd); >>>> >>>> ...so I don't understand why this mutex can resolve the situation here (also: are >>>> you really sure that the race is solved like that?) >>>> >>>> I'd say that this probably needs more justification and a trace of the actual >>>> situation here. >>>> >>>> Besides, if this really resolves the issue, I would prefer seeing variants of >>>> scpsys_power_{on,off}() functions, because we anyway don't need to lock mutexes >>>> during this driver's probe (add_subdomain calls scpsys_power_on()). >>>> In that case, `scpsys_power_on_unlocked()` would be an idea... but still, please >>>> analyze why your solution works, if it does, because I'm not convinced. >>> >>> What I see in my tests, is that a power on call for SSUSB domain happens while >>> the previous power off sequence did not yet complete, most likely while it's >>> waiting in readx_poll_timeout . This leads to inconsistency of the power domain, >>> not getting the ACKs next time a power on attempt occurs. >>> >>> I understand what you say about locks, but in this case the powering off is not >>> called by the genpd itself, but rather it's called by the rollback probe failed >>> mechanism : when the probing fails, scpsys_domain_cleanup() is called during the >>> same probing session. >>> Then it happens that probing begins again and previous cleanup is not yet >>> completed. I am not sure whether the lock is still held from the previous run, >>> but it's clearly not waiting for a lock to be released to be called again. >>> >> >> Sorry but I'm a bit lost now: is the problem about probe deferrals of the USB >> driver, or about probe deferrals of the mtk-pm-domains driver? >> >> scpsys_domain_cleanup() is only called upon scpsys_probe() failure. > > You are right, my explanation was bad. > > It happens during the mtk-pm-domains driver probe. > > Not all domains can power up, then everything is rolled back. and this happens > multiple times > On rare occasions, it happens that another probing sequence starts while the > previous one was not finished . > I mentioned devices because I had in mind the fact that each device requires a > power domain, and parallel probing of these devices causes a call to mtk-pm-domains > driver probe to be called from two different places. > > e.g. device 1 probes -> call mtk-pm-domains probe because it requires X power domain > > device 2 probes -> call mtk-pm-domains probe because it requires Y power domain. > > First attempt fails but not completed while second attempt starts. > > Maybe this is a better explanation of the situation ? Yeah, now it's a bit clearer! At this point, I think that you can get away with locking just one path (or two): /* This is the one giving me lots of suspects */ static void scpsys_remove_one_domain(struct scpsys_domain *pd) { int ret; ***lock*** if (scpsys_domain_is_on(pd)) scpsys_power_off(&pd->genpd); ***unlock*** ..... } /* This one as well eventually */ static struct generic_pm_domain *scpsys_add_one_domain(struct scpsys *scpsys, struct device_node *node) { ............... if (MTK_SCPD_CAPS(pd, MTK_SCPD_KEEP_DEFAULT_OFF)) { if (scpsys_domain_is_on(pd)) /* Maybe LOCK this one too? */ dev_warn(scpsys->dev, "%pOF: A default off power domain has been ON\n", node); } else { ** *** lock *** ret = scpsys_power_on(&pd->genpd); ** *** unlock *** if (ret < 0) { dev_err(scpsys->dev, "%pOF: failed to power on domain: %d\n", node, ret); goto err_put_subsys_clocks; } if (MTK_SCPD_CAPS(pd, MTK_SCPD_ALWAYS_ON)) pd->genpd.flags |= GENPD_FLAG_ALWAYS_ON; } .......... } Can you please try locking only the remove_one_domain() poweroff call before trying both? Reason is that in the add_one_domain() case, we haven't registered the power domain yet, so locking may not be required there to make things ticking right. Cheers! >> >>>> >>>>> --- >>>>> drivers/pmdomain/mediatek/mtk-pm-domains.c | 24 +++++++++++++++++----- >>>>> 1 file changed, 19 insertions(+), 5 deletions(-) >>>>> >>>>> diff --git a/drivers/pmdomain/mediatek/mtk-pm-domains.c >>>>> b/drivers/pmdomain/mediatek/mtk-pm-domains.c >>>>> index d5f0ee05c794..4f136b47e539 100644 >>>>> --- a/drivers/pmdomain/mediatek/mtk-pm-domains.c >>>>> +++ b/drivers/pmdomain/mediatek/mtk-pm-domains.c >>>>> @@ -9,6 +9,7 @@ >>>>> #include <linux/io.h> >>>>> #include <linux/iopoll.h> >>>>> #include <linux/mfd/syscon.h> >>>>> +#include <linux/mutex.h> >>>>> #include <linux/of.h> >>>>> #include <linux/of_clk.h> >>>>> #include <linux/platform_device.h> >>>>> @@ -56,6 +57,7 @@ struct scpsys { >>>>> struct device *dev; >>>>> struct regmap *base; >>>>> const struct scpsys_soc_data *soc_data; >>>>> + struct mutex mutex; >>>>> struct genpd_onecell_data pd_data; >>>>> struct generic_pm_domain *domains[]; >>>>> }; >>>>> @@ -238,9 +240,13 @@ static int scpsys_power_on(struct generic_pm_domain *genpd) >>>>> bool tmp; >>>>> int ret; >>>>> + mutex_lock(&scpsys->mutex); >>>>> + >>>>> ret = scpsys_regulator_enable(pd->supply); >>>>> - if (ret) >>>>> + if (ret) { >>>>> + mutex_unlock(&scpsys->mutex); >>>>> return ret; >>>>> + } >>>>> ret = clk_bulk_prepare_enable(pd->num_clks, pd->clks); >>>>> if (ret) >>>>> @@ -291,6 +297,7 @@ static int scpsys_power_on(struct generic_pm_domain *genpd) >>>>> goto err_enable_bus_protect; >>>>> } >>>>> + mutex_unlock(&scpsys->mutex); >>>>> return 0; >>>>> err_enable_bus_protect: >>>>> @@ -305,6 +312,7 @@ static int scpsys_power_on(struct generic_pm_domain *genpd) >>>>> clk_bulk_disable_unprepare(pd->num_clks, pd->clks); >>>>> err_reg: >>>>> scpsys_regulator_disable(pd->supply); >>>>> + mutex_unlock(&scpsys->mutex); >>>>> return ret; >>>>> } >>>>> @@ -315,13 +323,15 @@ static int scpsys_power_off(struct generic_pm_domain >>>>> *genpd) >>>>> bool tmp; >>>>> int ret; >>>>> + mutex_lock(&scpsys->mutex); >>>>> + >>>>> ret = scpsys_bus_protect_enable(pd); >>>>> if (ret < 0) >>>>> - return ret; >>>>> + goto err_mutex_unlock; >>>>> ret = scpsys_sram_disable(pd); >>>>> if (ret < 0) >>>>> - return ret; >>>>> + goto err_mutex_unlock; >>>>> if (pd->data->ext_buck_iso_offs && MTK_SCPD_CAPS(pd, >>>>> MTK_SCPD_EXT_BUCK_ISO)) >>>>> regmap_set_bits(scpsys->base, pd->data->ext_buck_iso_offs, >>>>> @@ -340,13 +350,15 @@ static int scpsys_power_off(struct generic_pm_domain >>>>> *genpd) >>>>> ret = readx_poll_timeout(scpsys_domain_is_on, pd, tmp, !tmp, >>>>> MTK_POLL_DELAY_US, >>>>> MTK_POLL_TIMEOUT); >>>>> if (ret < 0) >>>>> - return ret; >>>>> + goto err_mutex_unlock; >>>>> clk_bulk_disable_unprepare(pd->num_clks, pd->clks); >>>>> scpsys_regulator_disable(pd->supply); >>>>> - return 0; >>>>> +err_mutex_unlock: >>>>> + mutex_unlock(&scpsys->mutex); >>>>> + return ret; >>>>> } >>>>> static struct >>>>> @@ -700,6 +712,8 @@ static int scpsys_probe(struct platform_device *pdev) >>>>> return PTR_ERR(scpsys->base); >>>>> } >>>>> + mutex_init(&scpsys->mutex); >>>>> + >>>>> ret = -ENODEV; >>>>> for_each_available_child_of_node(np, node) { >>>>> struct generic_pm_domain *domain; >>>> >>
On 11/29/23 16:41, AngeloGioacchino Del Regno wrote: > Il 29/11/23 14:48, Eugen Hristev ha scritto: >> On 11/29/23 15:37, AngeloGioacchino Del Regno wrote: >>> Il 29/11/23 14:28, Eugen Hristev ha scritto: >>>> On 11/29/23 14:52, AngeloGioacchino Del Regno wrote: >>>>> Il 29/11/23 12:31, Eugen Hristev ha scritto: >>>>>> It can happen that during the power off sequence for a power domain >>>>>> another power on sequence is started, and it can lead to powering on and >>>>>> off in the same time for the similar power domain. >>>>>> This can happen if parallel probing occurs: one device starts probing, and >>>>>> one power domain is probe deferred, this leads to all power domains being >>>>>> rolled back and powered off, while in the same time another device starts >>>>>> probing and requests powering on the same power domains or similar. >>>>>> >>>>>> This was encountered on MT8186, when the sequence is : >>>>>> Power on SSUSB >>>>>> Power on SSUSB_P1 >>>>>> Power on DIS >>>>>> -> probe deferred >>>>>> Power off DIS >>>>>> Power off SSUSB_P1 >>>>>> Power off SSUSB >>>>>> >>>>>> During the sequence of powering off SSUSB, some new similar sequence starts, >>>>>> and during the power on of SSUSB, clocks are enabled. >>>>>> In this case, powering off SSUSB fails from the first sequence, because >>>>>> power off ACK bit check times out (as clocks are powered back on by the second >>>>>> sequence). In consequence, powering it on also times out, and it leads to >>>>>> the whole power domain in a bad state. >>>>>> >>>>>> To solve this issue, added a mutex that locks the whole power off/power on >>>>>> sequence such that it would never happen that multiple sequences try to >>>>>> enable or disable the same power domain in parallel. >>>>>> >>>>>> Fixes: 59b644b01cf4 ("soc: mediatek: Add MediaTek SCPSYS power domains") >>>>>> Signed-off-by: Eugen Hristev <eugen.hristev@collabora.com> >>>>> >>>>> I don't think that it's a race between genpd_power_on() and genpd_power_off() >>>>> calls >>>>> at all, because genpd *does* have locking after all... at least for probe and for >>>>> parents of a power domain (and more anyway). >>>>> >>>>> As far as I remember, what happens when you start .probe()'ing a device is: >>>>> platform_probe() -> dev_pm_domain_attach() -> genpd_dev_pm_attach() >>>>> >>>>> There, you end up with >>>>> >>>>> if (power_on) { >>>>> genpd_lock(pd); >>>>> ret = genpd_power_on(pd, 0); >>>>> genpd_unlock(pd); >>>>> } >>>>> >>>>> ...but when you fail probing, you go with genpd_dev_pm_detach(), which then calls >>>>> >>>>> /* Check if PM domain can be powered off after removing this device. */ >>>>> genpd_queue_power_off_work(pd); >>>>> >>>>> but even then, you end up being in a worker doing >>>>> >>>>> genpd_lock(genpd); >>>>> genpd_power_off(genpd, false, 0); >>>>> genpd_unlock(genpd); >>>>> >>>>> ...so I don't understand why this mutex can resolve the situation here (also: are >>>>> you really sure that the race is solved like that?) >>>>> >>>>> I'd say that this probably needs more justification and a trace of the actual >>>>> situation here. >>>>> >>>>> Besides, if this really resolves the issue, I would prefer seeing variants of >>>>> scpsys_power_{on,off}() functions, because we anyway don't need to lock mutexes >>>>> during this driver's probe (add_subdomain calls scpsys_power_on()). >>>>> In that case, `scpsys_power_on_unlocked()` would be an idea... but still, please >>>>> analyze why your solution works, if it does, because I'm not convinced. >>>> >>>> What I see in my tests, is that a power on call for SSUSB domain happens while >>>> the previous power off sequence did not yet complete, most likely while it's >>>> waiting in readx_poll_timeout . This leads to inconsistency of the power domain, >>>> not getting the ACKs next time a power on attempt occurs. >>>> >>>> I understand what you say about locks, but in this case the powering off is not >>>> called by the genpd itself, but rather it's called by the rollback probe failed >>>> mechanism : when the probing fails, scpsys_domain_cleanup() is called during the >>>> same probing session. >>>> Then it happens that probing begins again and previous cleanup is not yet >>>> completed. I am not sure whether the lock is still held from the previous run, >>>> but it's clearly not waiting for a lock to be released to be called again. >>>> >>> >>> Sorry but I'm a bit lost now: is the problem about probe deferrals of the USB >>> driver, or about probe deferrals of the mtk-pm-domains driver? >>> >>> scpsys_domain_cleanup() is only called upon scpsys_probe() failure. >> >> You are right, my explanation was bad. >> >> It happens during the mtk-pm-domains driver probe. >> >> Not all domains can power up, then everything is rolled back. and this happens >> multiple times >> On rare occasions, it happens that another probing sequence starts while the >> previous one was not finished . >> I mentioned devices because I had in mind the fact that each device requires a >> power domain, and parallel probing of these devices causes a call to mtk-pm-domains >> driver probe to be called from two different places. >> >> e.g. device 1 probes -> call mtk-pm-domains probe because it requires X power domain >> >> device 2 probes -> call mtk-pm-domains probe because it requires Y power domain. >> >> First attempt fails but not completed while second attempt starts. >> >> Maybe this is a better explanation of the situation ? > > Yeah, now it's a bit clearer! > > At this point, I think that you can get away with locking just one path (or two): > > /* This is the one giving me lots of suspects */ > static void scpsys_remove_one_domain(struct scpsys_domain *pd) > { > int ret; > > ***lock*** > if (scpsys_domain_is_on(pd)) > scpsys_power_off(&pd->genpd); > ***unlock*** > > ..... > } > > /* This one as well eventually */ > static struct > generic_pm_domain *scpsys_add_one_domain(struct scpsys *scpsys, struct device_node > *node) > { > ............... > > if (MTK_SCPD_CAPS(pd, MTK_SCPD_KEEP_DEFAULT_OFF)) { > > if (scpsys_domain_is_on(pd)) /* Maybe LOCK this one too? */ > > dev_warn(scpsys->dev, > "%pOF: A default off power domain has been ON\n", node); > } else { > ** *** lock *** > ret = scpsys_power_on(&pd->genpd); > ** *** unlock *** > if (ret < 0) { > dev_err(scpsys->dev, "%pOF: failed to power on domain: %d\n", node, ret); > goto err_put_subsys_clocks; > } > > if (MTK_SCPD_CAPS(pd, MTK_SCPD_ALWAYS_ON)) > pd->genpd.flags |= GENPD_FLAG_ALWAYS_ON; > } > > .......... > } > > Can you please try locking only the remove_one_domain() poweroff call before > trying both? > > Reason is that in the add_one_domain() case, we haven't registered the power > domain yet, so locking may not be required there to make things ticking right. We need to have 'critical section' all the access to clocks/registers, so both paths (power on and power off) must be serialized. So a mutex must be taken before entering either of those, otherwise it makes no sense. There must be no path left that leads to power on/ power off that is taken without the mutex. > > Cheers! > >>> >>>>> >>>>>> --- >>>>>> drivers/pmdomain/mediatek/mtk-pm-domains.c | 24 +++++++++++++++++----- >>>>>> 1 file changed, 19 insertions(+), 5 deletions(-) >>>>>> >>>>>> diff --git a/drivers/pmdomain/mediatek/mtk-pm-domains.c >>>>>> b/drivers/pmdomain/mediatek/mtk-pm-domains.c >>>>>> index d5f0ee05c794..4f136b47e539 100644 >>>>>> --- a/drivers/pmdomain/mediatek/mtk-pm-domains.c >>>>>> +++ b/drivers/pmdomain/mediatek/mtk-pm-domains.c >>>>>> @@ -9,6 +9,7 @@ >>>>>> #include <linux/io.h> >>>>>> #include <linux/iopoll.h> >>>>>> #include <linux/mfd/syscon.h> >>>>>> +#include <linux/mutex.h> >>>>>> #include <linux/of.h> >>>>>> #include <linux/of_clk.h> >>>>>> #include <linux/platform_device.h> >>>>>> @@ -56,6 +57,7 @@ struct scpsys { >>>>>> struct device *dev; >>>>>> struct regmap *base; >>>>>> const struct scpsys_soc_data *soc_data; >>>>>> + struct mutex mutex; >>>>>> struct genpd_onecell_data pd_data; >>>>>> struct generic_pm_domain *domains[]; >>>>>> }; >>>>>> @@ -238,9 +240,13 @@ static int scpsys_power_on(struct generic_pm_domain *genpd) >>>>>> bool tmp; >>>>>> int ret; >>>>>> + mutex_lock(&scpsys->mutex); >>>>>> + >>>>>> ret = scpsys_regulator_enable(pd->supply); >>>>>> - if (ret) >>>>>> + if (ret) { >>>>>> + mutex_unlock(&scpsys->mutex); >>>>>> return ret; >>>>>> + } >>>>>> ret = clk_bulk_prepare_enable(pd->num_clks, pd->clks); >>>>>> if (ret) >>>>>> @@ -291,6 +297,7 @@ static int scpsys_power_on(struct generic_pm_domain *genpd) >>>>>> goto err_enable_bus_protect; >>>>>> } >>>>>> + mutex_unlock(&scpsys->mutex); >>>>>> return 0; >>>>>> err_enable_bus_protect: >>>>>> @@ -305,6 +312,7 @@ static int scpsys_power_on(struct generic_pm_domain *genpd) >>>>>> clk_bulk_disable_unprepare(pd->num_clks, pd->clks); >>>>>> err_reg: >>>>>> scpsys_regulator_disable(pd->supply); >>>>>> + mutex_unlock(&scpsys->mutex); >>>>>> return ret; >>>>>> } >>>>>> @@ -315,13 +323,15 @@ static int scpsys_power_off(struct generic_pm_domain >>>>>> *genpd) >>>>>> bool tmp; >>>>>> int ret; >>>>>> + mutex_lock(&scpsys->mutex); >>>>>> + >>>>>> ret = scpsys_bus_protect_enable(pd); >>>>>> if (ret < 0) >>>>>> - return ret; >>>>>> + goto err_mutex_unlock; >>>>>> ret = scpsys_sram_disable(pd); >>>>>> if (ret < 0) >>>>>> - return ret; >>>>>> + goto err_mutex_unlock; >>>>>> if (pd->data->ext_buck_iso_offs && MTK_SCPD_CAPS(pd, >>>>>> MTK_SCPD_EXT_BUCK_ISO)) >>>>>> regmap_set_bits(scpsys->base, pd->data->ext_buck_iso_offs, >>>>>> @@ -340,13 +350,15 @@ static int scpsys_power_off(struct generic_pm_domain >>>>>> *genpd) >>>>>> ret = readx_poll_timeout(scpsys_domain_is_on, pd, tmp, !tmp, >>>>>> MTK_POLL_DELAY_US, >>>>>> MTK_POLL_TIMEOUT); >>>>>> if (ret < 0) >>>>>> - return ret; >>>>>> + goto err_mutex_unlock; >>>>>> clk_bulk_disable_unprepare(pd->num_clks, pd->clks); >>>>>> scpsys_regulator_disable(pd->supply); >>>>>> - return 0; >>>>>> +err_mutex_unlock: >>>>>> + mutex_unlock(&scpsys->mutex); >>>>>> + return ret; >>>>>> } >>>>>> static struct >>>>>> @@ -700,6 +712,8 @@ static int scpsys_probe(struct platform_device *pdev) >>>>>> return PTR_ERR(scpsys->base); >>>>>> } >>>>>> + mutex_init(&scpsys->mutex); >>>>>> + >>>>>> ret = -ENODEV; >>>>>> for_each_available_child_of_node(np, node) { >>>>>> struct generic_pm_domain *domain; >>>>> >>> >
diff --git a/drivers/pmdomain/mediatek/mtk-pm-domains.c b/drivers/pmdomain/mediatek/mtk-pm-domains.c index d5f0ee05c794..4f136b47e539 100644 --- a/drivers/pmdomain/mediatek/mtk-pm-domains.c +++ b/drivers/pmdomain/mediatek/mtk-pm-domains.c @@ -9,6 +9,7 @@ #include <linux/io.h> #include <linux/iopoll.h> #include <linux/mfd/syscon.h> +#include <linux/mutex.h> #include <linux/of.h> #include <linux/of_clk.h> #include <linux/platform_device.h> @@ -56,6 +57,7 @@ struct scpsys { struct device *dev; struct regmap *base; const struct scpsys_soc_data *soc_data; + struct mutex mutex; struct genpd_onecell_data pd_data; struct generic_pm_domain *domains[]; }; @@ -238,9 +240,13 @@ static int scpsys_power_on(struct generic_pm_domain *genpd) bool tmp; int ret; + mutex_lock(&scpsys->mutex); + ret = scpsys_regulator_enable(pd->supply); - if (ret) + if (ret) { + mutex_unlock(&scpsys->mutex); return ret; + } ret = clk_bulk_prepare_enable(pd->num_clks, pd->clks); if (ret) @@ -291,6 +297,7 @@ static int scpsys_power_on(struct generic_pm_domain *genpd) goto err_enable_bus_protect; } + mutex_unlock(&scpsys->mutex); return 0; err_enable_bus_protect: @@ -305,6 +312,7 @@ static int scpsys_power_on(struct generic_pm_domain *genpd) clk_bulk_disable_unprepare(pd->num_clks, pd->clks); err_reg: scpsys_regulator_disable(pd->supply); + mutex_unlock(&scpsys->mutex); return ret; } @@ -315,13 +323,15 @@ static int scpsys_power_off(struct generic_pm_domain *genpd) bool tmp; int ret; + mutex_lock(&scpsys->mutex); + ret = scpsys_bus_protect_enable(pd); if (ret < 0) - return ret; + goto err_mutex_unlock; ret = scpsys_sram_disable(pd); if (ret < 0) - return ret; + goto err_mutex_unlock; if (pd->data->ext_buck_iso_offs && MTK_SCPD_CAPS(pd, MTK_SCPD_EXT_BUCK_ISO)) regmap_set_bits(scpsys->base, pd->data->ext_buck_iso_offs, @@ -340,13 +350,15 @@ static int scpsys_power_off(struct generic_pm_domain *genpd) ret = readx_poll_timeout(scpsys_domain_is_on, pd, tmp, !tmp, MTK_POLL_DELAY_US, MTK_POLL_TIMEOUT); if (ret < 0) - return ret; + goto err_mutex_unlock; clk_bulk_disable_unprepare(pd->num_clks, pd->clks); scpsys_regulator_disable(pd->supply); - return 0; +err_mutex_unlock: + mutex_unlock(&scpsys->mutex); + return ret; } static struct @@ -700,6 +712,8 @@ static int scpsys_probe(struct platform_device *pdev) return PTR_ERR(scpsys->base); } + mutex_init(&scpsys->mutex); + ret = -ENODEV; for_each_available_child_of_node(np, node) { struct generic_pm_domain *domain;