Message ID | 20230329140445.2180662-1-konrad.dybcio@linaro.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp449255vqo; Wed, 29 Mar 2023 07:19:15 -0700 (PDT) X-Google-Smtp-Source: AKy350bEdvw2nxEouBBdVwHoGaCiIFSJ8x7VCv9BScVkyqjT8oJ7n+IbxV9EIB+HjAWI/NOmVZwr X-Received: by 2002:aa7:d313:0:b0:4f9:e6f1:5c7c with SMTP id p19-20020aa7d313000000b004f9e6f15c7cmr19913065edq.32.1680099555194; Wed, 29 Mar 2023 07:19:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1680099555; cv=none; d=google.com; s=arc-20160816; b=xJjw0GAVhI123/NSqZ7WgMfldmBjD8+WpkNoLyeo3m5qK9UXYWuEREH7eUqn1BRGSq 1t3tkk4Bgain5vC7W9qt5/elmjM3sT8pOwY6swhlnwq3J86GEmxLsz3pZe+mnIjKCQ8u sdIjEY5XtT5SEw92sRFB+UZImkk9rNlysSbxbeKEO4fQvkNQPMIcUKIbEVFOjbPGq7pO dbs9ZOS8DzWT4BjV6B35VBwK7Tfhp0EG86+HMveHliXNCx8y6HgbCDhtZjlAP71iyYK7 Q3QNkNVdcOyVqoJkO3q/owb/LHaHqkp8kYPqWoMTJFdvJJFljwggOWROUEg0cG5+UsTW ZLCA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=RHJ0WvwFhiwkNTXEVFAcPy6OADOzAA2uvcs+k52iD3M=; b=WeEo00XVSQKsRyRMFttsPOBYYBZuubbVZfK8XGKzryz5t1A61u5r0wwjWAt1r47eSd ZNkCZKVE7tmVM3NwAWx8/pqXhcSyOTHJju00Qz1LBxA8nQy7sqcE4CSLvinejH2I6mdW UJ0d0K1WGp6XOkgOTtxgFT013vch9jv3Ze4WT3s0j9806sI1VWGvhBs7HWStrFqmmpPP paIe/j7piOD+aA2DRQEWoH3062J3VZ+lVDEkzNd93pNIe3tS8TkSRLmW67uSIY/b4FAJ paoa741DM5SmsIxEcnO3AMtTui7jzXdX/S3BcMEqkkjEVeIunXCi7wW7nF0TzHFF6NXX S7tg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=BVWxz02B; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x4-20020a170906710400b009217e23a19esi36225448ejj.599.2023.03.29.07.18.26; Wed, 29 Mar 2023 07:19:15 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=BVWxz02B; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229742AbjC2OGD (ORCPT <rfc822;rua109.linux@gmail.com> + 99 others); Wed, 29 Mar 2023 10:06:03 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40496 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229553AbjC2OGB (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 29 Mar 2023 10:06:01 -0400 Received: from mail-lf1-x131.google.com (mail-lf1-x131.google.com [IPv6:2a00:1450:4864:20::131]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EA55440F8 for <linux-kernel@vger.kernel.org>; Wed, 29 Mar 2023 07:04:52 -0700 (PDT) Received: by mail-lf1-x131.google.com with SMTP id bi9so20338752lfb.12 for <linux-kernel@vger.kernel.org>; Wed, 29 Mar 2023 07:04:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; t=1680098688; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=RHJ0WvwFhiwkNTXEVFAcPy6OADOzAA2uvcs+k52iD3M=; b=BVWxz02BP62xHNvQqVawNsr8/34+J8NWKHuzN6MPcgZdTfCDdxWgtB/bRHjQHDLYpa 8y4U3Ktra2AcImTgh7g9x+808EFuitW3L2kKfw8kqmdWNc58ySzVoa83d6wEY8VHio6E AqxwvtzXiD9jvDp4f59y8HlrK2tnynYGSvhH0TfOndIdBF4oX1yloBovV8Ml5qk4tyNm Juq82SBIFpKv0zlGiuT5fCdOiA8689mtzuGQOrFi+gzImO6l0ZJA8c4IoQQyUb72dPG0 M97dP58+YmQcnKNItNAho3iXUafJ6//KvH0pFgRgFa0nxfPM/ysTmv9Y2HZ05jm2RsoN C+oA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680098688; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=RHJ0WvwFhiwkNTXEVFAcPy6OADOzAA2uvcs+k52iD3M=; b=uDd/Wq+y5nIYOox8SMaw0uKvf89lU3+pmnwWKkefF8fT6s3wwxwRFhHvuxvGpVLGTn 8HpkmgFb6kwQotnR0QutTUg/J8nDAaHY1gXBYR8fTcCP630HNolqn/zJMrZASHA0oZJi NAjALbCv4ILI1Fsu5vseGDShcMpHXkgyMTMmFJtOp0tD89BDZe3pI/97gOl6nTaOtReE kMXeT7muc4JxnUZlrI2rU/Z7+2iABVty6FuND6l1ukSlNaQqUUbtwGAhV0P7LUxi2iaZ h3iCLAURmG9ZtopzSgcUjURz9SscJkQcWU0twz2C0B+SP3EnE+oVUmBOkcDSo/MO0GjT N/2w== X-Gm-Message-State: AAQBX9ciCGp6S8hjwfZQl/hAMmQsYpUtQzseKkAQQVXjSq9lg/Q8BupX KLt7xXCiGD+z66mDaFbXDkbx8A== X-Received: by 2002:ac2:4889:0:b0:4e0:61a6:c158 with SMTP id x9-20020ac24889000000b004e061a6c158mr5775860lfc.36.1680098688469; Wed, 29 Mar 2023 07:04:48 -0700 (PDT) Received: from localhost.localdomain (abxj225.neoplus.adsl.tpnet.pl. [83.9.3.225]) by smtp.gmail.com with ESMTPSA id f21-20020ac251b5000000b004eaf2207a33sm3501083lfk.223.2023.03.29.07.04.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Mar 2023 07:04:48 -0700 (PDT) From: Konrad Dybcio <konrad.dybcio@linaro.org> To: linux-arm-msm@vger.kernel.org, andersson@kernel.org, agross@kernel.org Cc: marijn.suijten@somainline.org, Konrad Dybcio <konrad.dybcio@linaro.org>, Rob Clark <robdclark@gmail.com>, Abhinav Kumar <quic_abhinavk@quicinc.com>, Dmitry Baryshkov <dmitry.baryshkov@linaro.org>, Sean Paul <sean@poorly.run>, David Airlie <airlied@gmail.com>, Daniel Vetter <daniel@ffwll.ch>, Johan Hovold <johan+linaro@kernel.org>, Akhil P Oommen <quic_akhilpo@quicinc.com>, "Joel Fernandes (Google)" <joel@joelfernandes.org>, Nathan Chancellor <nathan@kernel.org>, dri-devel@lists.freedesktop.org, freedreno@lists.freedesktop.org, linux-kernel@vger.kernel.org Subject: [PATCH] drm/msm/adreno: adreno_gpu: Use suspend() instead of idle() on load error Date: Wed, 29 Mar 2023 16:04:44 +0200 Message-Id: <20230329140445.2180662-1-konrad.dybcio@linaro.org> X-Mailer: git-send-email 2.40.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.2 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1761712070965049878?= X-GMAIL-MSGID: =?utf-8?q?1761712070965049878?= |
Series |
drm/msm/adreno: adreno_gpu: Use suspend() instead of idle() on load error
|
|
Commit Message
Konrad Dybcio
March 29, 2023, 2:04 p.m. UTC
If we fail to initialize the GPU for whatever reason (say we don't
embed the GPU firmware files in the initrd), the error path involves
pm_runtime_put_sync() which then calls idle() instead of suspend().
This is suboptimal, as it means that we're not going through the
clean shutdown sequence. With at least A619_holi, this makes the GPU
not wake up until it goes through at least one more start-fail-stop
cycle. Fix that by using pm_runtime_put_sync_suspend to force a clean
shutdown.
Test cases:
1. firmware baked into kernel
2. error loading fw in initrd -> load from rootfs at DE start
Both succeed on A619_holi (SM6375) and A630 (SDM845).
Fixes: 0d997f95b70f ("drm/msm/adreno: fix runtime PM imbalance at gpu load")
Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org>
---
drivers/gpu/drm/msm/adreno/adreno_device.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Comments
On Wed, Mar 29, 2023 at 04:04:44PM +0200, Konrad Dybcio wrote: > If we fail to initialize the GPU for whatever reason (say we don't > embed the GPU firmware files in the initrd), the error path involves > pm_runtime_put_sync() which then calls idle() instead of suspend(). > > This is suboptimal, as it means that we're not going through the > clean shutdown sequence. With at least A619_holi, this makes the GPU > not wake up until it goes through at least one more start-fail-stop > cycle. Fix that by using pm_runtime_put_sync_suspend to force a clean > shutdown. This does not sound right. If pm_runtime_put_sync() fails to suspend the device when the usage count drops to zero, then you have a bug somewhere else. Also since commit 2c087a336676 ("drm/msm/adreno: Load the firmware before bringing up the hardware") the firmware is loaded before even hitting these paths so the above description does not sound right in that respect either (or is missing some details). > Test cases: > 1. firmware baked into kernel > 2. error loading fw in initrd -> load from rootfs at DE start > > Both succeed on A619_holi (SM6375) and A630 (SDM845). > > Fixes: 0d997f95b70f ("drm/msm/adreno: fix runtime PM imbalance at gpu load") > Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org> > --- > drivers/gpu/drm/msm/adreno/adreno_device.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c > index f61896629be6..59f3302e8167 100644 > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c > @@ -477,7 +477,7 @@ struct msm_gpu *adreno_load_gpu(struct drm_device *dev) > return gpu; > > err_put_rpm: > - pm_runtime_put_sync(&pdev->dev); > + pm_runtime_put_sync_suspend(&pdev->dev); > err_disable_rpm: > pm_runtime_disable(&pdev->dev); Johan
On 29.03.2023 16:37, Johan Hovold wrote: > On Wed, Mar 29, 2023 at 04:04:44PM +0200, Konrad Dybcio wrote: >> If we fail to initialize the GPU for whatever reason (say we don't >> embed the GPU firmware files in the initrd), the error path involves >> pm_runtime_put_sync() which then calls idle() instead of suspend(). >> >> This is suboptimal, as it means that we're not going through the >> clean shutdown sequence. With at least A619_holi, this makes the GPU >> not wake up until it goes through at least one more start-fail-stop >> cycle. Fix that by using pm_runtime_put_sync_suspend to force a clean >> shutdown. > > This does not sound right. If pm_runtime_put_sync() fails to suspend the > device when the usage count drops to zero, then you have a bug somewhere > else. I was surprised to see that it was not called as well, but I wasn't able to track it down before.. > > Also since commit 2c087a336676 ("drm/msm/adreno: Load the firmware > before bringing up the hardware") the firmware is loaded before even > hitting these paths so the above description does not sound right in > that respect either (or is missing some details). ..but I did some more digging and I found that the precise "firmware" that fails is the ZAP blob, which is not checked like SQE in the commit you mentioned! Now I don't think that we can easily check for it as-is since zap_shader_load_mdt() does the entire find-load-authenticate dance which is required with secure assets, but it's obviously possible to rip out the find-load part of that and go on from there. Do you think that would be a better solution? Konrad > >> Test cases: >> 1. firmware baked into kernel >> 2. error loading fw in initrd -> load from rootfs at DE start >> >> Both succeed on A619_holi (SM6375) and A630 (SDM845). >> >> Fixes: 0d997f95b70f ("drm/msm/adreno: fix runtime PM imbalance at gpu load") >> Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org> >> --- >> drivers/gpu/drm/msm/adreno/adreno_device.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c >> index f61896629be6..59f3302e8167 100644 >> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c >> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c >> @@ -477,7 +477,7 @@ struct msm_gpu *adreno_load_gpu(struct drm_device *dev) >> return gpu; >> >> err_put_rpm: >> - pm_runtime_put_sync(&pdev->dev); >> + pm_runtime_put_sync_suspend(&pdev->dev); >> err_disable_rpm: >> pm_runtime_disable(&pdev->dev); > > Johan
On Wed, Mar 29, 2023 at 8:48 AM Konrad Dybcio <konrad.dybcio@linaro.org> wrote: > > > > On 29.03.2023 16:37, Johan Hovold wrote: > > On Wed, Mar 29, 2023 at 04:04:44PM +0200, Konrad Dybcio wrote: > >> If we fail to initialize the GPU for whatever reason (say we don't > >> embed the GPU firmware files in the initrd), the error path involves > >> pm_runtime_put_sync() which then calls idle() instead of suspend(). > >> > >> This is suboptimal, as it means that we're not going through the > >> clean shutdown sequence. With at least A619_holi, this makes the GPU > >> not wake up until it goes through at least one more start-fail-stop > >> cycle. Fix that by using pm_runtime_put_sync_suspend to force a clean > >> shutdown. > > > > This does not sound right. If pm_runtime_put_sync() fails to suspend the > > device when the usage count drops to zero, then you have a bug somewhere > > else. > I was surprised to see that it was not called as well, but I wasn't able > to track it down before.. > > > > > Also since commit 2c087a336676 ("drm/msm/adreno: Load the firmware > > before bringing up the hardware") the firmware is loaded before even > > hitting these paths so the above description does not sound right in > > that respect either (or is missing some details). > ..but I did some more digging and I found that the precise "firmware" > that fails is the ZAP blob, which is not checked like SQE in the > commit you mentioned! > > Now I don't think that we can easily check for it as-is since > zap_shader_load_mdt() does the entire find-load-authenticate > dance which is required with secure assets, but it's obviously > possible to rip out the find-load part of that and go on from > there. > > Do you think that would be a better solution? Hmm, to hit this it sounds like you'd need all the fw _except_ the zap in the initrd? BR, -R > Konrad > > > > >> Test cases: > >> 1. firmware baked into kernel > >> 2. error loading fw in initrd -> load from rootfs at DE start > >> > >> Both succeed on A619_holi (SM6375) and A630 (SDM845). > >> > >> Fixes: 0d997f95b70f ("drm/msm/adreno: fix runtime PM imbalance at gpu load") > >> Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org> > >> --- > >> drivers/gpu/drm/msm/adreno/adreno_device.c | 2 +- > >> 1 file changed, 1 insertion(+), 1 deletion(-) > >> > >> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c > >> index f61896629be6..59f3302e8167 100644 > >> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c > >> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c > >> @@ -477,7 +477,7 @@ struct msm_gpu *adreno_load_gpu(struct drm_device *dev) > >> return gpu; > >> > >> err_put_rpm: > >> - pm_runtime_put_sync(&pdev->dev); > >> + pm_runtime_put_sync_suspend(&pdev->dev); > >> err_disable_rpm: > >> pm_runtime_disable(&pdev->dev); > > > > Johan
On 29.03.2023 19:30, Rob Clark wrote: > On Wed, Mar 29, 2023 at 8:48 AM Konrad Dybcio <konrad.dybcio@linaro.org> wrote: >> >> >> >> On 29.03.2023 16:37, Johan Hovold wrote: >>> On Wed, Mar 29, 2023 at 04:04:44PM +0200, Konrad Dybcio wrote: >>>> If we fail to initialize the GPU for whatever reason (say we don't >>>> embed the GPU firmware files in the initrd), the error path involves >>>> pm_runtime_put_sync() which then calls idle() instead of suspend(). >>>> >>>> This is suboptimal, as it means that we're not going through the >>>> clean shutdown sequence. With at least A619_holi, this makes the GPU >>>> not wake up until it goes through at least one more start-fail-stop >>>> cycle. Fix that by using pm_runtime_put_sync_suspend to force a clean >>>> shutdown. >>> >>> This does not sound right. If pm_runtime_put_sync() fails to suspend the >>> device when the usage count drops to zero, then you have a bug somewhere >>> else. >> I was surprised to see that it was not called as well, but I wasn't able >> to track it down before.. >> >>> >>> Also since commit 2c087a336676 ("drm/msm/adreno: Load the firmware >>> before bringing up the hardware") the firmware is loaded before even >>> hitting these paths so the above description does not sound right in >>> that respect either (or is missing some details). >> ..but I did some more digging and I found that the precise "firmware" >> that fails is the ZAP blob, which is not checked like SQE in the >> commit you mentioned! >> >> Now I don't think that we can easily check for it as-is since >> zap_shader_load_mdt() does the entire find-load-authenticate >> dance which is required with secure assets, but it's obviously >> possible to rip out the find-load part of that and go on from >> there. >> >> Do you think that would be a better solution? > > Hmm, to hit this it sounds like you'd need all the fw _except_ the zap > in the initrd? Correct. Konrad > > BR, > -R > >> Konrad >> >>> >>>> Test cases: >>>> 1. firmware baked into kernel >>>> 2. error loading fw in initrd -> load from rootfs at DE start >>>> >>>> Both succeed on A619_holi (SM6375) and A630 (SDM845). >>>> >>>> Fixes: 0d997f95b70f ("drm/msm/adreno: fix runtime PM imbalance at gpu load") >>>> Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org> >>>> --- >>>> drivers/gpu/drm/msm/adreno/adreno_device.c | 2 +- >>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c >>>> index f61896629be6..59f3302e8167 100644 >>>> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c >>>> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c >>>> @@ -477,7 +477,7 @@ struct msm_gpu *adreno_load_gpu(struct drm_device *dev) >>>> return gpu; >>>> >>>> err_put_rpm: >>>> - pm_runtime_put_sync(&pdev->dev); >>>> + pm_runtime_put_sync_suspend(&pdev->dev); >>>> err_disable_rpm: >>>> pm_runtime_disable(&pdev->dev); >>> >>> Johan
On Wed, 29 Mar 2023 at 18:48, Konrad Dybcio <konrad.dybcio@linaro.org> wrote: > > > > On 29.03.2023 16:37, Johan Hovold wrote: > > On Wed, Mar 29, 2023 at 04:04:44PM +0200, Konrad Dybcio wrote: > >> If we fail to initialize the GPU for whatever reason (say we don't > >> embed the GPU firmware files in the initrd), the error path involves > >> pm_runtime_put_sync() which then calls idle() instead of suspend(). > >> > >> This is suboptimal, as it means that we're not going through the > >> clean shutdown sequence. With at least A619_holi, this makes the GPU > >> not wake up until it goes through at least one more start-fail-stop > >> cycle. Fix that by using pm_runtime_put_sync_suspend to force a clean > >> shutdown. > > > > This does not sound right. If pm_runtime_put_sync() fails to suspend the > > device when the usage count drops to zero, then you have a bug somewhere > > else. > I was surprised to see that it was not called as well, but I wasn't able > to track it down before.. Could you please check that it's autosuspend who kicks in? In other words, if we disable autosuspend, the pm_runtime_put_sync is enough()? That would probably mean that we lack some kind of reset in the hw_init path. On the other hand, I do not know how the device will react to the error-in-the-middle state. Modems for example, can enter the state where you can not properly turn it off once it starts the boot process. And if we remember the efforts that Akhil has put into making sure that the GPU is properly reset in case of an _error_, it might be nearly impossible to shut it down in a proper way. Thus said, I think that unless there is an obvious way to restart the init process, Korad's pm_runtime_put_sync_suspend() looks like a correct fix to me. > > Also since commit 2c087a336676 ("drm/msm/adreno: Load the firmware > > before bringing up the hardware") the firmware is loaded before even > > hitting these paths so the above description does not sound right in > > that respect either (or is missing some details). > ..but I did some more digging and I found that the precise "firmware" > that fails is the ZAP blob, which is not checked like SQE in the > commit you mentioned! > > Now I don't think that we can easily check for it as-is since > zap_shader_load_mdt() does the entire find-load-authenticate > dance which is required with secure assets, but it's obviously > possible to rip out the find-load part of that and go on from > there. Yes, I think we should load all firmware early. ZAP shader is a bit unique since the DT can override the name, but it might be nice to check for its presence earlier. At the same time it probably should not stop us from fixing the idle() vs suspend() bug. > > Do you think that would be a better solution? > > Konrad > > > > >> Test cases: > >> 1. firmware baked into kernel > >> 2. error loading fw in initrd -> load from rootfs at DE start > >> > >> Both succeed on A619_holi (SM6375) and A630 (SDM845). > >> > >> Fixes: 0d997f95b70f ("drm/msm/adreno: fix runtime PM imbalance at gpu load") > >> Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org> > >> --- > >> drivers/gpu/drm/msm/adreno/adreno_device.c | 2 +- > >> 1 file changed, 1 insertion(+), 1 deletion(-) > >> > >> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c > >> index f61896629be6..59f3302e8167 100644 > >> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c > >> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c > >> @@ -477,7 +477,7 @@ struct msm_gpu *adreno_load_gpu(struct drm_device *dev) > >> return gpu; > >> > >> err_put_rpm: > >> - pm_runtime_put_sync(&pdev->dev); > >> + pm_runtime_put_sync_suspend(&pdev->dev); > >> err_disable_rpm: > >> pm_runtime_disable(&pdev->dev); > > > > Johan
On 29.03.2023 21:45, Dmitry Baryshkov wrote: > On Wed, 29 Mar 2023 at 18:48, Konrad Dybcio <konrad.dybcio@linaro.org> wrote: >> >> >> >> On 29.03.2023 16:37, Johan Hovold wrote: >>> On Wed, Mar 29, 2023 at 04:04:44PM +0200, Konrad Dybcio wrote: >>>> If we fail to initialize the GPU for whatever reason (say we don't >>>> embed the GPU firmware files in the initrd), the error path involves >>>> pm_runtime_put_sync() which then calls idle() instead of suspend(). >>>> >>>> This is suboptimal, as it means that we're not going through the >>>> clean shutdown sequence. With at least A619_holi, this makes the GPU >>>> not wake up until it goes through at least one more start-fail-stop >>>> cycle. Fix that by using pm_runtime_put_sync_suspend to force a clean >>>> shutdown. >>> >>> This does not sound right. If pm_runtime_put_sync() fails to suspend the >>> device when the usage count drops to zero, then you have a bug somewhere >>> else. >> I was surprised to see that it was not called as well, but I wasn't able >> to track it down before.. > > Could you please check that it's autosuspend who kicks in? In other > words, if we disable autosuspend, the pm_runtime_put_sync is enough()? > > That would probably mean that we lack some kind of reset in the hw_init path. > > On the other hand, I do not know how the device will react to the > error-in-the-middle state. Modems for example, can enter the state > where you can not properly turn it off once it starts the boot > process. > > And if we remember the efforts that Akhil has put into making sure > that the GPU is properly reset in case of an _error_, it might be > nearly impossible to shut it down in a proper way. > > Thus said, I think that unless there is an obvious way to restart the > init process, Korad's pm_runtime_put_sync_suspend() looks like a > correct fix to me. On the GPU side, when you cut GX and CX GDSCs, the hardware is off. Some clock / gdsc logic may be retained, but the GPU itself gets cut off. Parking the clocks and shuttting down VDD_GX (if exists) only makes that stronger. > >>> Also since commit 2c087a336676 ("drm/msm/adreno: Load the firmware >>> before bringing up the hardware") the firmware is loaded before even >>> hitting these paths so the above description does not sound right in >>> that respect either (or is missing some details). >> ..but I did some more digging and I found that the precise "firmware" >> that fails is the ZAP blob, which is not checked like SQE in the >> commit you mentioned! >> >> Now I don't think that we can easily check for it as-is since >> zap_shader_load_mdt() does the entire find-load-authenticate >> dance which is required with secure assets, but it's obviously >> possible to rip out the find-load part of that and go on from >> there. > > Yes, I think we should load all firmware early. ZAP shader is a bit > unique since the DT can override the name, but it might be nice to > check for its presence earlier. > > At the same time it probably should not stop us from fixing the idle() > vs suspend() bug. I'm open to both solutions, as long as it can unblock me from resubmitting the (hopefully) final version of GMU wrapper! Konrad > >> >> Do you think that would be a better solution? >> >> Konrad >> >>> >>>> Test cases: >>>> 1. firmware baked into kernel >>>> 2. error loading fw in initrd -> load from rootfs at DE start >>>> >>>> Both succeed on A619_holi (SM6375) and A630 (SDM845). >>>> >>>> Fixes: 0d997f95b70f ("drm/msm/adreno: fix runtime PM imbalance at gpu load") >>>> Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org> >>>> --- >>>> drivers/gpu/drm/msm/adreno/adreno_device.c | 2 +- >>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c >>>> index f61896629be6..59f3302e8167 100644 >>>> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c >>>> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c >>>> @@ -477,7 +477,7 @@ struct msm_gpu *adreno_load_gpu(struct drm_device *dev) >>>> return gpu; >>>> >>>> err_put_rpm: >>>> - pm_runtime_put_sync(&pdev->dev); >>>> + pm_runtime_put_sync_suspend(&pdev->dev); >>>> err_disable_rpm: >>>> pm_runtime_disable(&pdev->dev); >>> >>> Johan > > >
On Wed, Mar 29, 2023 at 10:45:52PM +0300, Dmitry Baryshkov wrote: > On Wed, 29 Mar 2023 at 18:48, Konrad Dybcio <konrad.dybcio@linaro.org> wrote: > > On 29.03.2023 16:37, Johan Hovold wrote: > > > On Wed, Mar 29, 2023 at 04:04:44PM +0200, Konrad Dybcio wrote: > > >> If we fail to initialize the GPU for whatever reason (say we don't > > >> embed the GPU firmware files in the initrd), the error path involves > > >> pm_runtime_put_sync() which then calls idle() instead of suspend(). > > >> > > >> This is suboptimal, as it means that we're not going through the > > >> clean shutdown sequence. With at least A619_holi, this makes the GPU > > >> not wake up until it goes through at least one more start-fail-stop > > >> cycle. Fix that by using pm_runtime_put_sync_suspend to force a clean > > >> shutdown. > > > > > > This does not sound right. If pm_runtime_put_sync() fails to suspend the > > > device when the usage count drops to zero, then you have a bug somewhere > > > else. > > I was surprised to see that it was not called as well, but I wasn't able > > to track it down before.. > > Could you please check that it's autosuspend who kicks in? In other > words, if we disable autosuspend, the pm_runtime_put_sync is enough()? Yes, that's it. The runtime PM implementation changed at one point and since you need to disable autosuspend first to actually get synchronous behaviour. My bad. > That would probably mean that we lack some kind of reset in the hw_init path. > > On the other hand, I do not know how the device will react to the > error-in-the-middle state. Modems for example, can enter the state > where you can not properly turn it off once it starts the boot > process. > > And if we remember the efforts that Akhil has put into making sure > that the GPU is properly reset in case of an _error_, it might be > nearly impossible to shut it down in a proper way. > > Thus said, I think that unless there is an obvious way to restart the > init process, Korad's pm_runtime_put_sync_suspend() looks like a > correct fix to me. I'd prefer to fix this by disabling autosuspend, but as that would involve also moving the call to enable autosuspend to this function (and add the missing disable on unbind), Konrad's patch using pm_runtime_put_sync_suspend() is probably the best option for now. I can send a patch to move the autosuspend handling on top. Perhaps you can just amend the commit message to clarify that not all fw is apparently preloaded and also mention that you need to use pm_runtime_put_sync_suspend() due to autosuspend being enabled. Johan
On 30/03/2023 17:34, Konrad Dybcio wrote: > > > On 29.03.2023 21:45, Dmitry Baryshkov wrote: >> On Wed, 29 Mar 2023 at 18:48, Konrad Dybcio <konrad.dybcio@linaro.org> wrote: >>> >>> >>> >>> On 29.03.2023 16:37, Johan Hovold wrote: >>>> On Wed, Mar 29, 2023 at 04:04:44PM +0200, Konrad Dybcio wrote: >>>>> If we fail to initialize the GPU for whatever reason (say we don't >>>>> embed the GPU firmware files in the initrd), the error path involves >>>>> pm_runtime_put_sync() which then calls idle() instead of suspend(). >>>>> >>>>> This is suboptimal, as it means that we're not going through the >>>>> clean shutdown sequence. With at least A619_holi, this makes the GPU >>>>> not wake up until it goes through at least one more start-fail-stop >>>>> cycle. Fix that by using pm_runtime_put_sync_suspend to force a clean >>>>> shutdown. >>>> >>>> This does not sound right. If pm_runtime_put_sync() fails to suspend the >>>> device when the usage count drops to zero, then you have a bug somewhere >>>> else. >>> I was surprised to see that it was not called as well, but I wasn't able >>> to track it down before.. >> >> Could you please check that it's autosuspend who kicks in? In other >> words, if we disable autosuspend, the pm_runtime_put_sync is enough()? >> >> That would probably mean that we lack some kind of reset in the hw_init path. >> >> On the other hand, I do not know how the device will react to the >> error-in-the-middle state. Modems for example, can enter the state >> where you can not properly turn it off once it starts the boot >> process. >> >> And if we remember the efforts that Akhil has put into making sure >> that the GPU is properly reset in case of an _error_, it might be >> nearly impossible to shut it down in a proper way. >> >> Thus said, I think that unless there is an obvious way to restart the >> init process, Korad's pm_runtime_put_sync_suspend() looks like a >> correct fix to me. > On the GPU side, when you cut GX and CX GDSCs, the hardware is off. > Some clock / gdsc logic may be retained, but the GPU itself gets > cut off. Parking the clocks and shuttting down VDD_GX (if exists) > only makes that stronger. If I remember correctly, GX and CX GPU GDSCs might be voted by other users. Again, I'd direct you here to the series at [1] [1]: https://patchwork.freedesktop.org/series/111966/
diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c index f61896629be6..59f3302e8167 100644 --- a/drivers/gpu/drm/msm/adreno/adreno_device.c +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c @@ -477,7 +477,7 @@ struct msm_gpu *adreno_load_gpu(struct drm_device *dev) return gpu; err_put_rpm: - pm_runtime_put_sync(&pdev->dev); + pm_runtime_put_sync_suspend(&pdev->dev); err_disable_rpm: pm_runtime_disable(&pdev->dev);