regulator: qcom-rpmh: Revert "regulator: qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS"

Message ID 20230324063357.1.Ifdf3625a3c5c9467bd87bfcdf726c884ad220a35@changeid
State New
Headers
Series regulator: qcom-rpmh: Revert "regulator: qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS" |

Commit Message

Doug Anderson March 24, 2023, 1:34 p.m. UTC
  This reverts commit 58973046c1bf ("regulator: qcom-rpmh: Use
PROBE_FORCE_SYNCHRONOUS"). Further digging into the problems that
prompted the us to switch to synchronous probe showed that the root
cause was a missing "rootwait" in the kernel command line
arguments. Let's reinstate asynchronous probe.

Fixes: 58973046c1bf ("regulator: qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS")
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

 drivers/regulator/qcom-rpmh-regulator.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
  

Comments

Andrew Halaney March 24, 2023, 2:20 p.m. UTC | #1
On Fri, Mar 24, 2023 at 06:34:06AM -0700, Douglas Anderson wrote:
> This reverts commit 58973046c1bf ("regulator: qcom-rpmh: Use
> PROBE_FORCE_SYNCHRONOUS"). Further digging into the problems that
> prompted the us to switch to synchronous probe showed that the root
> cause was a missing "rootwait" in the kernel command line
> arguments. Let's reinstate asynchronous probe.
> 
> Fixes: 58973046c1bf ("regulator: qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS")
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Signed-off-by: Douglas Anderson <dianders@chromium.org>

Reviewed-by: Andrew Halaney <ahalaney@redhat.com>

Thanks!
  
Mark Brown March 24, 2023, 4:37 p.m. UTC | #2
On Fri, 24 Mar 2023 06:34:06 -0700, Douglas Anderson wrote:
> This reverts commit 58973046c1bf ("regulator: qcom-rpmh: Use
> PROBE_FORCE_SYNCHRONOUS"). Further digging into the problems that
> prompted the us to switch to synchronous probe showed that the root
> cause was a missing "rootwait" in the kernel command line
> arguments. Let's reinstate asynchronous probe.
> 
> 
> [...]

Applied to

   broonie/regulator.git for-next

Thanks!

[1/1] regulator: qcom-rpmh: Revert "regulator: qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS"
      commit: ad44ac082fdff7ee57fe125432f7d9d7cb610a23

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark
  
Amit Pundir May 13, 2023, 6:08 p.m. UTC | #3
On Fri, 24 Mar 2023 at 19:05, Douglas Anderson <dianders@chromium.org> wrote:
>
> This reverts commit 58973046c1bf ("regulator: qcom-rpmh: Use
> PROBE_FORCE_SYNCHRONOUS"). Further digging into the problems that
> prompted the us to switch to synchronous probe showed that the root
> cause was a missing "rootwait" in the kernel command line
> arguments. Let's reinstate asynchronous probe.

Hi, the asynchronous probe is broken on Dragonboard 845c (SDM845)
running AOSP (Android Open Source Project) with v6.4-rc1
https://bugs.linaro.org/show_bug.cgi?id=5975.
Can we please go back to synchronous probe.

AOSP do not make use of rootwait, IIRC, but it is added by the
bootloader anyway. And the device fails to boot AOSP regardless of
"rootwait" bootarg being present or not.

FWIW I do not see this regression on RB5 (QRB5165/SM8250) running the
same set of AOSP images.

Regards,
Amit Pundir




>
> Fixes: 58973046c1bf ("regulator: qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS")
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---
>
>  drivers/regulator/qcom-rpmh-regulator.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/regulator/qcom-rpmh-regulator.c b/drivers/regulator/qcom-rpmh-regulator.c
> index 903032b2875f..4826d60e5d95 100644
> --- a/drivers/regulator/qcom-rpmh-regulator.c
> +++ b/drivers/regulator/qcom-rpmh-regulator.c
> @@ -1462,7 +1462,7 @@ MODULE_DEVICE_TABLE(of, rpmh_regulator_match_table);
>  static struct platform_driver rpmh_regulator_driver = {
>         .driver = {
>                 .name = "qcom-rpmh-regulator",
> -               .probe_type = PROBE_FORCE_SYNCHRONOUS,
> +               .probe_type = PROBE_PREFER_ASYNCHRONOUS,
>                 .of_match_table = of_match_ptr(rpmh_regulator_match_table),
>         },
>         .probe = rpmh_regulator_probe,
> --
> 2.40.0.348.gf938b09366-goog
>
  
Linux regression tracking (Thorsten Leemhuis) May 14, 2023, 11:54 a.m. UTC | #4
[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 13.05.23 20:08, Amit Pundir wrote:
> On Fri, 24 Mar 2023 at 19:05, Douglas Anderson <dianders@chromium.org> wrote:
>>
>> This reverts commit 58973046c1bf ("regulator: qcom-rpmh: Use
>> PROBE_FORCE_SYNCHRONOUS"). Further digging into the problems that
>> prompted the us to switch to synchronous probe showed that the root
>> cause was a missing "rootwait" in the kernel command line
>> arguments. Let's reinstate asynchronous probe.
> 
> Hi, the asynchronous probe is broken on Dragonboard 845c (SDM845)
> running AOSP (Android Open Source Project) with v6.4-rc1
> https://bugs.linaro.org/show_bug.cgi?id=5975.
> Can we please go back to synchronous probe.
> 
> AOSP do not make use of rootwait, IIRC, but it is added by the
> bootloader anyway. And the device fails to boot AOSP regardless of
> "rootwait" bootarg being present or not.
> 
> FWIW I do not see this regression on RB5 (QRB5165/SM8250) running the
> same set of AOSP images.

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced ad44ac082fd
#regzbot title regulator: qcom-rpmh: Dragonboard 845c broken due to
asynchronous probe
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

>> Fixes: 58973046c1bf ("regulator: qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS")
>> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
>> Signed-off-by: Douglas Anderson <dianders@chromium.org>
>> ---
>>
>>  drivers/regulator/qcom-rpmh-regulator.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/regulator/qcom-rpmh-regulator.c b/drivers/regulator/qcom-rpmh-regulator.c
>> index 903032b2875f..4826d60e5d95 100644
>> --- a/drivers/regulator/qcom-rpmh-regulator.c
>> +++ b/drivers/regulator/qcom-rpmh-regulator.c
>> @@ -1462,7 +1462,7 @@ MODULE_DEVICE_TABLE(of, rpmh_regulator_match_table);
>>  static struct platform_driver rpmh_regulator_driver = {
>>         .driver = {
>>                 .name = "qcom-rpmh-regulator",
>> -               .probe_type = PROBE_FORCE_SYNCHRONOUS,
>> +               .probe_type = PROBE_PREFER_ASYNCHRONOUS,
>>                 .of_match_table = of_match_ptr(rpmh_regulator_match_table),
>>         },
>>         .probe = rpmh_regulator_probe,
>> --
>> 2.40.0.348.gf938b09366-goog
>>
  
Caleb Connolly May 14, 2023, 12:41 p.m. UTC | #5
On 13/05/2023 18:08, Amit Pundir wrote:
> On Fri, 24 Mar 2023 at 19:05, Douglas Anderson <dianders@chromium.org> wrote:
>>
>> This reverts commit 58973046c1bf ("regulator: qcom-rpmh: Use
>> PROBE_FORCE_SYNCHRONOUS"). Further digging into the problems that
>> prompted the us to switch to synchronous probe showed that the root
>> cause was a missing "rootwait" in the kernel command line
>> arguments. Let's reinstate asynchronous probe.
> 
> Hi, the asynchronous probe is broken on Dragonboard 845c (SDM845)
> running AOSP (Android Open Source Project) with v6.4-rc1
> https://bugs.linaro.org/show_bug.cgi?id=5975.
> Can we please go back to synchronous probe.
> 
> AOSP do not make use of rootwait, IIRC, but it is added by the
> bootloader anyway. And the device fails to boot AOSP regardless of
> "rootwait" bootarg being present or not.

Could you try applying this diff to enable some log spam and let me know
what you get? I'm keen to try and figure this one out. My mail client
might crunch this a bit so I have pasted it here too
https://p.calebs.dev/ab74b7@raw

diff --git a/drivers/soc/qcom/rpmh-rsc.c b/drivers/soc/qcom/rpmh-rsc.c
index f93544f6d796..67859f1bdb28 100644
--- a/drivers/soc/qcom/rpmh-rsc.c
+++ b/drivers/soc/qcom/rpmh-rsc.c
@@ -653,11 +653,23 @@ int rpmh_rsc_send_data(struct rsc_drv *drv, const
struct tcs_request *msg)

        spin_lock_irqsave(&drv->lock, flags);

+       dev_info(drv->dev, "%s: %p tcs->type=%d state=%d, "
+               "wait_for_compl=%d, num_cmds=%d\n",
+               __func__, msg, tcs->type, msg->state,
+               msg->wait_for_compl, msg->num_cmds);
+       for (int i = 0; i < msg->num_cmds; i++)
+               dev_info(drv->dev, "%s: %p cmd[%d] "
+                       "addr=0x%x data=0x%x\n", __func__,
+                       msg, i, msg->cmds[i].addr, msg->cmds[i].data);
+
        /* Wait forever for a free tcs. It better be there eventually! */
        wait_event_lock_irq(drv->tcs_wait,
                            (tcs_id = claim_tcs_for_req(drv, tcs, msg))
>= 0,
                            drv->lock);

+       dev_info(drv->dev, "%s: %px GOT TCS! %d\n",
+               __func__, msg, tcs_id);
+
        tcs->req[tcs_id - tcs->offset] = msg;
        set_bit(tcs_id, drv->tcs_in_use);
        if (msg->state == RPMH_ACTIVE_ONLY_STATE && tcs->type !=
ACTIVE_TCS) {

> 
> FWIW I do not see this regression on RB5 (QRB5165/SM8250) running the
> same set of AOSP images.
> 
> Regards,
> Amit Pundir
> 
> 
> 
> 
>>
>> Fixes: 58973046c1bf ("regulator: qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS")
>> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
>> Signed-off-by: Douglas Anderson <dianders@chromium.org>
>> ---
>>
>>  drivers/regulator/qcom-rpmh-regulator.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/regulator/qcom-rpmh-regulator.c b/drivers/regulator/qcom-rpmh-regulator.c
>> index 903032b2875f..4826d60e5d95 100644
>> --- a/drivers/regulator/qcom-rpmh-regulator.c
>> +++ b/drivers/regulator/qcom-rpmh-regulator.c
>> @@ -1462,7 +1462,7 @@ MODULE_DEVICE_TABLE(of, rpmh_regulator_match_table);
>>  static struct platform_driver rpmh_regulator_driver = {
>>         .driver = {
>>                 .name = "qcom-rpmh-regulator",
>> -               .probe_type = PROBE_FORCE_SYNCHRONOUS,
>> +               .probe_type = PROBE_PREFER_ASYNCHRONOUS,
>>                 .of_match_table = of_match_ptr(rpmh_regulator_match_table),
>>         },
>>         .probe = rpmh_regulator_probe,
>> --
>> 2.40.0.348.gf938b09366-goog
>>
  
Mark Brown May 15, 2023, 1:12 a.m. UTC | #6
On Sat, May 13, 2023 at 11:38:22PM +0530, Amit Pundir wrote:

> Hi, the asynchronous probe is broken on Dragonboard 845c (SDM845)
> running AOSP (Android Open Source Project) with v6.4-rc1
> https://bugs.linaro.org/show_bug.cgi?id=5975.
> Can we please go back to synchronous probe.

Please submit a patch...
  
Amit Pundir May 15, 2023, 2:41 p.m. UTC | #7
On Sun, 14 May 2023 at 18:11, Caleb Connolly <caleb.connolly@linaro.org> wrote:
>
>
>
> On 13/05/2023 18:08, Amit Pundir wrote:
> > On Fri, 24 Mar 2023 at 19:05, Douglas Anderson <dianders@chromium.org> wrote:
> >>
> >> This reverts commit 58973046c1bf ("regulator: qcom-rpmh: Use
> >> PROBE_FORCE_SYNCHRONOUS"). Further digging into the problems that
> >> prompted the us to switch to synchronous probe showed that the root
> >> cause was a missing "rootwait" in the kernel command line
> >> arguments. Let's reinstate asynchronous probe.
> >
> > Hi, the asynchronous probe is broken on Dragonboard 845c (SDM845)
> > running AOSP (Android Open Source Project) with v6.4-rc1
> > https://bugs.linaro.org/show_bug.cgi?id=5975.
> > Can we please go back to synchronous probe.
> >
> > AOSP do not make use of rootwait, IIRC, but it is added by the
> > bootloader anyway. And the device fails to boot AOSP regardless of
> > "rootwait" bootarg being present or not.
>
> Could you try applying this diff to enable some log spam and let me know
> what you get? I'm keen to try and figure this one out. My mail client
> might crunch this a bit so I have pasted it here too
> https://p.calebs.dev/ab74b7@raw

These prints add just enough delay for the UFS probe to succeed that I
can't reproduce the failure anymore.

>
> diff --git a/drivers/soc/qcom/rpmh-rsc.c b/drivers/soc/qcom/rpmh-rsc.c
> index f93544f6d796..67859f1bdb28 100644
> --- a/drivers/soc/qcom/rpmh-rsc.c
> +++ b/drivers/soc/qcom/rpmh-rsc.c
> @@ -653,11 +653,23 @@ int rpmh_rsc_send_data(struct rsc_drv *drv, const
> struct tcs_request *msg)
>
>         spin_lock_irqsave(&drv->lock, flags);
>
> +       dev_info(drv->dev, "%s: %p tcs->type=%d state=%d, "
> +               "wait_for_compl=%d, num_cmds=%d\n",
> +               __func__, msg, tcs->type, msg->state,
> +               msg->wait_for_compl, msg->num_cmds);
> +       for (int i = 0; i < msg->num_cmds; i++)
> +               dev_info(drv->dev, "%s: %p cmd[%d] "
> +                       "addr=0x%x data=0x%x\n", __func__,
> +                       msg, i, msg->cmds[i].addr, msg->cmds[i].data);
> +
>         /* Wait forever for a free tcs. It better be there eventually! */
>         wait_event_lock_irq(drv->tcs_wait,
>                             (tcs_id = claim_tcs_for_req(drv, tcs, msg))
> >= 0,
>                             drv->lock);
>
> +       dev_info(drv->dev, "%s: %px GOT TCS! %d\n",
> +               __func__, msg, tcs_id);
> +
>         tcs->req[tcs_id - tcs->offset] = msg;
>         set_bit(tcs_id, drv->tcs_in_use);
>         if (msg->state == RPMH_ACTIVE_ONLY_STATE && tcs->type !=
> ACTIVE_TCS) {
>
> >
> > FWIW I do not see this regression on RB5 (QRB5165/SM8250) running the
> > same set of AOSP images.
> >
> > Regards,
> > Amit Pundir
> >
> >
> >
> >
> >>
> >> Fixes: 58973046c1bf ("regulator: qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS")
> >> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> >> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> >> ---
> >>
> >>  drivers/regulator/qcom-rpmh-regulator.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/regulator/qcom-rpmh-regulator.c b/drivers/regulator/qcom-rpmh-regulator.c
> >> index 903032b2875f..4826d60e5d95 100644
> >> --- a/drivers/regulator/qcom-rpmh-regulator.c
> >> +++ b/drivers/regulator/qcom-rpmh-regulator.c
> >> @@ -1462,7 +1462,7 @@ MODULE_DEVICE_TABLE(of, rpmh_regulator_match_table);
> >>  static struct platform_driver rpmh_regulator_driver = {
> >>         .driver = {
> >>                 .name = "qcom-rpmh-regulator",
> >> -               .probe_type = PROBE_FORCE_SYNCHRONOUS,
> >> +               .probe_type = PROBE_PREFER_ASYNCHRONOUS,
> >>                 .of_match_table = of_match_ptr(rpmh_regulator_match_table),
> >>         },
> >>         .probe = rpmh_regulator_probe,
> >> --
> >> 2.40.0.348.gf938b09366-goog
> >>
>
> --
> Kind Regards,
> Caleb (they/them)
  
Doug Anderson May 15, 2023, 3:02 p.m. UTC | #8
Hi,

On Mon, May 15, 2023 at 7:42 AM Amit Pundir <amit.pundir@linaro.org> wrote:
>
> On Sun, 14 May 2023 at 18:11, Caleb Connolly <caleb.connolly@linaro.org> wrote:
> >
> >
> >
> > On 13/05/2023 18:08, Amit Pundir wrote:
> > > On Fri, 24 Mar 2023 at 19:05, Douglas Anderson <dianders@chromium.org> wrote:
> > >>
> > >> This reverts commit 58973046c1bf ("regulator: qcom-rpmh: Use
> > >> PROBE_FORCE_SYNCHRONOUS"). Further digging into the problems that
> > >> prompted the us to switch to synchronous probe showed that the root
> > >> cause was a missing "rootwait" in the kernel command line
> > >> arguments. Let's reinstate asynchronous probe.
> > >
> > > Hi, the asynchronous probe is broken on Dragonboard 845c (SDM845)
> > > running AOSP (Android Open Source Project) with v6.4-rc1
> > > https://bugs.linaro.org/show_bug.cgi?id=5975.
> > > Can we please go back to synchronous probe.
> > >
> > > AOSP do not make use of rootwait, IIRC, but it is added by the
> > > bootloader anyway. And the device fails to boot AOSP regardless of
> > > "rootwait" bootarg being present or not.
> >
> > Could you try applying this diff to enable some log spam and let me know
> > what you get? I'm keen to try and figure this one out. My mail client
> > might crunch this a bit so I have pasted it here too
> > https://p.calebs.dev/ab74b7@raw
>
> These prints add just enough delay for the UFS probe to succeed that I
> can't reproduce the failure anymore.

I'd prefer doing at least a little debugging before jumping to a
revert. From looking at your dmesg [1], it looks as if the async probe
is allowing RPMH to probe at the same time as "qcom-vadc-common".
That's something that talks on the SPMI bus and is (potentially)
talking to the same PMICs that RPMH-regulator is, right? I'm by no
means an expert on how Qualcomm's PMICs work, but it seems plausible
that the "qcom-vadc-common" is somehow causing problems and screwing
up RPMH. Does that seem plausible to you?

If so, one interesting way to track it down would be to move around
delays. Put ~500ms sleep at the _end_ of vadc_probe(). Presumably that
_won't_ fix the problem. Now put a ~500ms sleep at the beginning of
vadc_probe(). Maybe that will fix the problem? If so, you can move the
delay around to narrow down the conflict. My wild guess would be that
vadc_reset() could be throwing things for a loop?

If the above doesn't work, maybe we could add more tracing / printouts
to see what is probing at the same time as RPMH?


[1] https://bugs.linaro.org/attachment.cgi?id=1135
  
Amit Pundir May 16, 2023, 6:11 p.m. UTC | #9
On Mon, 15 May 2023 at 20:33, Doug Anderson <dianders@chromium.org> wrote:
>
> Hi,
>
> On Mon, May 15, 2023 at 7:42 AM Amit Pundir <amit.pundir@linaro.org> wrote:
> >
> > On Sun, 14 May 2023 at 18:11, Caleb Connolly <caleb.connolly@linaro.org> wrote:
> > >
> > >
> > >
> > > On 13/05/2023 18:08, Amit Pundir wrote:
> > > > On Fri, 24 Mar 2023 at 19:05, Douglas Anderson <dianders@chromium.org> wrote:
> > > >>
> > > >> This reverts commit 58973046c1bf ("regulator: qcom-rpmh: Use
> > > >> PROBE_FORCE_SYNCHRONOUS"). Further digging into the problems that
> > > >> prompted the us to switch to synchronous probe showed that the root
> > > >> cause was a missing "rootwait" in the kernel command line
> > > >> arguments. Let's reinstate asynchronous probe.
> > > >
> > > > Hi, the asynchronous probe is broken on Dragonboard 845c (SDM845)
> > > > running AOSP (Android Open Source Project) with v6.4-rc1
> > > > https://bugs.linaro.org/show_bug.cgi?id=5975.
> > > > Can we please go back to synchronous probe.
> > > >
> > > > AOSP do not make use of rootwait, IIRC, but it is added by the
> > > > bootloader anyway. And the device fails to boot AOSP regardless of
> > > > "rootwait" bootarg being present or not.
> > >
> > > Could you try applying this diff to enable some log spam and let me know
> > > what you get? I'm keen to try and figure this one out. My mail client
> > > might crunch this a bit so I have pasted it here too
> > > https://p.calebs.dev/ab74b7@raw
> >
> > These prints add just enough delay for the UFS probe to succeed that I
> > can't reproduce the failure anymore.
>
> I'd prefer doing at least a little debugging before jumping to a
> revert. From looking at your dmesg [1], it looks as if the async probe
> is allowing RPMH to probe at the same time as "qcom-vadc-common".
> That's something that talks on the SPMI bus and is (potentially)
> talking to the same PMICs that RPMH-regulator is, right? I'm by no
> means an expert on how Qualcomm's PMICs work, but it seems plausible
> that the "qcom-vadc-common" is somehow causing problems and screwing
> up RPMH. Does that seem plausible to you?
>
> If so, one interesting way to track it down would be to move around
> delays. Put ~500ms sleep at the _end_ of vadc_probe(). Presumably that
> _won't_ fix the problem. Now put a ~500ms sleep at the beginning of
> vadc_probe(). Maybe that will fix the problem? If so, you can move the
> delay around to narrow down the conflict. My wild guess would be that
> vadc_reset() could be throwing things for a loop?
>
> If the above doesn't work, maybe we could add more tracing / printouts
> to see what is probing at the same time as RPMH?

Tried out a few changes today but none of them worked or were
effective enough to debug this crash further, other than setting
fw_devlink=permissive.

Adding more tracing / prints (BOOTTIME_TRACING and
FUNCTION_GRAPH_TRACER) didn't work and didn't help in reproducing the
crash either. They added just enough delay to boot the device
successfully everytime.

I tried to reason with the kernel modules which gets loaded before and
after the qcom-rpmh-regulator (QCOM_REBOOT_MODE, QCOM_PON, IIO/VADC,
SPMI_PMIC* etc) as suggested, but I run into the same crash even if I
disable those driver modules. So I don't think that the other driver
modules which gets loaded at around the same time as
qcom-rpmh-regulator by default have any impact on this failure.

The only way I can boot successfully everytime is if I boot with
fw_devlink=permissive bootarg. So I'll have to check if there is any
new dependency which got added recently in DT or somewhere else that
is causing this breakage.

Regards,
Amit Pundir

>
>
> [1] https://bugs.linaro.org/attachment.cgi?id=1135
  
Doug Anderson May 16, 2023, 9:24 p.m. UTC | #10
Hi,

On Tue, May 16, 2023 at 11:12 AM Amit Pundir <amit.pundir@linaro.org> wrote:
>
> On Mon, 15 May 2023 at 20:33, Doug Anderson <dianders@chromium.org> wrote:
> >
> > Hi,
> >
> > On Mon, May 15, 2023 at 7:42 AM Amit Pundir <amit.pundir@linaro.org> wrote:
> > >
> > > On Sun, 14 May 2023 at 18:11, Caleb Connolly <caleb.connolly@linaro.org> wrote:
> > > >
> > > >
> > > >
> > > > On 13/05/2023 18:08, Amit Pundir wrote:
> > > > > On Fri, 24 Mar 2023 at 19:05, Douglas Anderson <dianders@chromium.org> wrote:
> > > > >>
> > > > >> This reverts commit 58973046c1bf ("regulator: qcom-rpmh: Use
> > > > >> PROBE_FORCE_SYNCHRONOUS"). Further digging into the problems that
> > > > >> prompted the us to switch to synchronous probe showed that the root
> > > > >> cause was a missing "rootwait" in the kernel command line
> > > > >> arguments. Let's reinstate asynchronous probe.
> > > > >
> > > > > Hi, the asynchronous probe is broken on Dragonboard 845c (SDM845)
> > > > > running AOSP (Android Open Source Project) with v6.4-rc1
> > > > > https://bugs.linaro.org/show_bug.cgi?id=5975.
> > > > > Can we please go back to synchronous probe.
> > > > >
> > > > > AOSP do not make use of rootwait, IIRC, but it is added by the
> > > > > bootloader anyway. And the device fails to boot AOSP regardless of
> > > > > "rootwait" bootarg being present or not.
> > > >
> > > > Could you try applying this diff to enable some log spam and let me know
> > > > what you get? I'm keen to try and figure this one out. My mail client
> > > > might crunch this a bit so I have pasted it here too
> > > > https://p.calebs.dev/ab74b7@raw
> > >
> > > These prints add just enough delay for the UFS probe to succeed that I
> > > can't reproduce the failure anymore.
> >
> > I'd prefer doing at least a little debugging before jumping to a
> > revert. From looking at your dmesg [1], it looks as if the async probe
> > is allowing RPMH to probe at the same time as "qcom-vadc-common".
> > That's something that talks on the SPMI bus and is (potentially)
> > talking to the same PMICs that RPMH-regulator is, right? I'm by no
> > means an expert on how Qualcomm's PMICs work, but it seems plausible
> > that the "qcom-vadc-common" is somehow causing problems and screwing
> > up RPMH. Does that seem plausible to you?
> >
> > If so, one interesting way to track it down would be to move around
> > delays. Put ~500ms sleep at the _end_ of vadc_probe(). Presumably that
> > _won't_ fix the problem. Now put a ~500ms sleep at the beginning of
> > vadc_probe(). Maybe that will fix the problem? If so, you can move the
> > delay around to narrow down the conflict. My wild guess would be that
> > vadc_reset() could be throwing things for a loop?
> >
> > If the above doesn't work, maybe we could add more tracing / printouts
> > to see what is probing at the same time as RPMH?
>
> Tried out a few changes today but none of them worked or were
> effective enough to debug this crash further, other than setting
> fw_devlink=permissive.
>
> Adding more tracing / prints (BOOTTIME_TRACING and
> FUNCTION_GRAPH_TRACER) didn't work and didn't help in reproducing the
> crash either. They added just enough delay to boot the device
> successfully everytime.
>
> I tried to reason with the kernel modules which gets loaded before and
> after the qcom-rpmh-regulator (QCOM_REBOOT_MODE, QCOM_PON, IIO/VADC,
> SPMI_PMIC* etc) as suggested, but I run into the same crash even if I
> disable those driver modules. So I don't think that the other driver
> modules which gets loaded at around the same time as
> qcom-rpmh-regulator by default have any impact on this failure.

Ugh, Heisenbugs are no fun to debug. :( It sorta sounds as if pretty
much anything you can do to change the timing fixes you. That does
make reverting the async probe of the regulator less appealing. If, as
you said, it's not just some other driver loading at the same time
that's interfering then the revert "fixes" you in the same way that a
"msleep" would fix you. That doesn't seem like enough of a
justification for the revert to me.

It still feels to me like _something_ is happening at the same time as
the RPMH regulator driver is loading, though, I'm just not sure how to
suggest debugging it. I guess other thoughts:

* When RPMH complains, is it always with the same regulator (lvs1), or
sometimes different ones? Any clue there?

* How much can you control module loading order? If rpmh-regulator
module loads first, does it "fix" things? If it does, maybe you could
bisect to find the place where problems start cropping up. Does that
give any clues?


> The only way I can boot successfully everytime is if I boot with
> fw_devlink=permissive bootarg. So I'll have to check if there is any
> new dependency which got added recently in DT or somewhere else that
> is causing this breakage.

I guess I'll assume that `fw_devlink=permissive` only fixes you
because it changes the timing...
  
Mark Brown May 31, 2023, 12:30 p.m. UTC | #11
On Tue, May 16, 2023 at 02:24:06PM -0700, Doug Anderson wrote:
> On Tue, May 16, 2023 at 11:12 AM Amit Pundir <amit.pundir@linaro.org> wrote:

> > Tried out a few changes today but none of them worked or were
> > effective enough to debug this crash further, other than setting
> > fw_devlink=permissive.

> It still feels to me like _something_ is happening at the same time as
> the RPMH regulator driver is loading, though, I'm just not sure how to
> suggest debugging it. I guess other thoughts:

This discussion seems to have ground to a halt with no resolution so it
looks like the best option here is to apply the revert unless there's
some progress happened off list?
  
Amit Pundir May 31, 2023, 12:52 p.m. UTC | #12
On Wed, 31 May 2023 at 18:00, Mark Brown <broonie@kernel.org> wrote:
>
> On Tue, May 16, 2023 at 02:24:06PM -0700, Doug Anderson wrote:
> > On Tue, May 16, 2023 at 11:12 AM Amit Pundir <amit.pundir@linaro.org> wrote:
>
> > > Tried out a few changes today but none of them worked or were
> > > effective enough to debug this crash further, other than setting
> > > fw_devlink=permissive.
>
> > It still feels to me like _something_ is happening at the same time as
> > the RPMH regulator driver is loading, though, I'm just not sure how to
> > suggest debugging it. I guess other thoughts:
>
> This discussion seems to have ground to a halt with no resolution so it
> looks like the best option here is to apply the revert unless there's
> some progress happened off list?

Sorry about that. I got stuck at other things. I'll get back to it
this week. I'll try to change the module loading order and test run it
to check if that helps.

Regards,
Amit Pundir
  
Amit Pundir June 1, 2023, 6:10 a.m. UTC | #13
On Wed, 17 May 2023 at 02:54, Doug Anderson <dianders@chromium.org> wrote:
>
> Hi,
>
> On Tue, May 16, 2023 at 11:12 AM Amit Pundir <amit.pundir@linaro.org> wrote:
> >
> > On Mon, 15 May 2023 at 20:33, Doug Anderson <dianders@chromium.org> wrote:
> > >
> > > Hi,
> > >
> > > On Mon, May 15, 2023 at 7:42 AM Amit Pundir <amit.pundir@linaro.org> wrote:
> > > >
> > > > On Sun, 14 May 2023 at 18:11, Caleb Connolly <caleb.connolly@linaro.org> wrote:
> > > > >
> > > > >
> > > > >
> > > > > On 13/05/2023 18:08, Amit Pundir wrote:
> > > > > > On Fri, 24 Mar 2023 at 19:05, Douglas Anderson <dianders@chromium.org> wrote:
> > > > > >>
> > > > > >> This reverts commit 58973046c1bf ("regulator: qcom-rpmh: Use
> > > > > >> PROBE_FORCE_SYNCHRONOUS"). Further digging into the problems that
> > > > > >> prompted the us to switch to synchronous probe showed that the root
> > > > > >> cause was a missing "rootwait" in the kernel command line
> > > > > >> arguments. Let's reinstate asynchronous probe.
> > > > > >
> > > > > > Hi, the asynchronous probe is broken on Dragonboard 845c (SDM845)
> > > > > > running AOSP (Android Open Source Project) with v6.4-rc1
> > > > > > https://bugs.linaro.org/show_bug.cgi?id=5975.
> > > > > > Can we please go back to synchronous probe.
> > > > > >
> > > > > > AOSP do not make use of rootwait, IIRC, but it is added by the
> > > > > > bootloader anyway. And the device fails to boot AOSP regardless of
> > > > > > "rootwait" bootarg being present or not.
> > > > >
> > > > > Could you try applying this diff to enable some log spam and let me know
> > > > > what you get? I'm keen to try and figure this one out. My mail client
> > > > > might crunch this a bit so I have pasted it here too
> > > > > https://p.calebs.dev/ab74b7@raw
> > > >
> > > > These prints add just enough delay for the UFS probe to succeed that I
> > > > can't reproduce the failure anymore.
> > >
> > > I'd prefer doing at least a little debugging before jumping to a
> > > revert. From looking at your dmesg [1], it looks as if the async probe
> > > is allowing RPMH to probe at the same time as "qcom-vadc-common".
> > > That's something that talks on the SPMI bus and is (potentially)
> > > talking to the same PMICs that RPMH-regulator is, right? I'm by no
> > > means an expert on how Qualcomm's PMICs work, but it seems plausible
> > > that the "qcom-vadc-common" is somehow causing problems and screwing
> > > up RPMH. Does that seem plausible to you?
> > >
> > > If so, one interesting way to track it down would be to move around
> > > delays. Put ~500ms sleep at the _end_ of vadc_probe(). Presumably that
> > > _won't_ fix the problem. Now put a ~500ms sleep at the beginning of
> > > vadc_probe(). Maybe that will fix the problem? If so, you can move the
> > > delay around to narrow down the conflict. My wild guess would be that
> > > vadc_reset() could be throwing things for a loop?
> > >
> > > If the above doesn't work, maybe we could add more tracing / printouts
> > > to see what is probing at the same time as RPMH?
> >
> > Tried out a few changes today but none of them worked or were
> > effective enough to debug this crash further, other than setting
> > fw_devlink=permissive.
> >
> > Adding more tracing / prints (BOOTTIME_TRACING and
> > FUNCTION_GRAPH_TRACER) didn't work and didn't help in reproducing the
> > crash either. They added just enough delay to boot the device
> > successfully everytime.
> >
> > I tried to reason with the kernel modules which gets loaded before and
> > after the qcom-rpmh-regulator (QCOM_REBOOT_MODE, QCOM_PON, IIO/VADC,
> > SPMI_PMIC* etc) as suggested, but I run into the same crash even if I
> > disable those driver modules. So I don't think that the other driver
> > modules which gets loaded at around the same time as
> > qcom-rpmh-regulator by default have any impact on this failure.
>
> Ugh, Heisenbugs are no fun to debug. :( It sorta sounds as if pretty
> much anything you can do to change the timing fixes you. That does
> make reverting the async probe of the regulator less appealing. If, as
> you said, it's not just some other driver loading at the same time
> that's interfering then the revert "fixes" you in the same way that a
> "msleep" would fix you. That doesn't seem like enough of a
> justification for the revert to me.
>
> It still feels to me like _something_ is happening at the same time as
> the RPMH regulator driver is loading, though, I'm just not sure how to
> suggest debugging it. I guess other thoughts:
>
> * When RPMH complains, is it always with the same regulator (lvs1), or
> sometimes different ones? Any clue there?

It is always either lvs1 or lvs2.

>
> * How much can you control module loading order? If rpmh-regulator
> module loads first, does it "fix" things? If it does, maybe you could
> bisect to find the place where problems start cropping up. Does that
> give any clues?

Loading qcom-rpmh-regulator first does make it difficult to reproduce
the crash. I tried a few combinations to narrow down the issue further
and in one case, I managed to reproduce the crash (1 in 20+ reboots)
while loading the qcom-rpmh-regulator (and the dependent cmd-db,
qcom_rpmh) module alone
https://bugs.linaro.org/attachment.cgi?id=1140.

Regards,
Amit Pundir

>
>
> > The only way I can boot successfully everytime is if I boot with
> > fw_devlink=permissive bootarg. So I'll have to check if there is any
> > new dependency which got added recently in DT or somewhere else that
> > is causing this breakage.
>
> I guess I'll assume that `fw_devlink=permissive` only fixes you
> because it changes the timing...
  
Doug Anderson June 1, 2023, 2:04 p.m. UTC | #14
Hi,

On Wed, May 31, 2023 at 11:11 PM Amit Pundir <amit.pundir@linaro.org> wrote:
>
> On Wed, 17 May 2023 at 02:54, Doug Anderson <dianders@chromium.org> wrote:
> >
> > Hi,
> >
> > On Tue, May 16, 2023 at 11:12 AM Amit Pundir <amit.pundir@linaro.org> wrote:
> > >
> > > On Mon, 15 May 2023 at 20:33, Doug Anderson <dianders@chromium.org> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Mon, May 15, 2023 at 7:42 AM Amit Pundir <amit.pundir@linaro.org> wrote:
> > > > >
> > > > > On Sun, 14 May 2023 at 18:11, Caleb Connolly <caleb.connolly@linaro.org> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 13/05/2023 18:08, Amit Pundir wrote:
> > > > > > > On Fri, 24 Mar 2023 at 19:05, Douglas Anderson <dianders@chromium.org> wrote:
> > > > > > >>
> > > > > > >> This reverts commit 58973046c1bf ("regulator: qcom-rpmh: Use
> > > > > > >> PROBE_FORCE_SYNCHRONOUS"). Further digging into the problems that
> > > > > > >> prompted the us to switch to synchronous probe showed that the root
> > > > > > >> cause was a missing "rootwait" in the kernel command line
> > > > > > >> arguments. Let's reinstate asynchronous probe.
> > > > > > >
> > > > > > > Hi, the asynchronous probe is broken on Dragonboard 845c (SDM845)
> > > > > > > running AOSP (Android Open Source Project) with v6.4-rc1
> > > > > > > https://bugs.linaro.org/show_bug.cgi?id=5975.
> > > > > > > Can we please go back to synchronous probe.
> > > > > > >
> > > > > > > AOSP do not make use of rootwait, IIRC, but it is added by the
> > > > > > > bootloader anyway. And the device fails to boot AOSP regardless of
> > > > > > > "rootwait" bootarg being present or not.
> > > > > >
> > > > > > Could you try applying this diff to enable some log spam and let me know
> > > > > > what you get? I'm keen to try and figure this one out. My mail client
> > > > > > might crunch this a bit so I have pasted it here too
> > > > > > https://p.calebs.dev/ab74b7@raw
> > > > >
> > > > > These prints add just enough delay for the UFS probe to succeed that I
> > > > > can't reproduce the failure anymore.
> > > >
> > > > I'd prefer doing at least a little debugging before jumping to a
> > > > revert. From looking at your dmesg [1], it looks as if the async probe
> > > > is allowing RPMH to probe at the same time as "qcom-vadc-common".
> > > > That's something that talks on the SPMI bus and is (potentially)
> > > > talking to the same PMICs that RPMH-regulator is, right? I'm by no
> > > > means an expert on how Qualcomm's PMICs work, but it seems plausible
> > > > that the "qcom-vadc-common" is somehow causing problems and screwing
> > > > up RPMH. Does that seem plausible to you?
> > > >
> > > > If so, one interesting way to track it down would be to move around
> > > > delays. Put ~500ms sleep at the _end_ of vadc_probe(). Presumably that
> > > > _won't_ fix the problem. Now put a ~500ms sleep at the beginning of
> > > > vadc_probe(). Maybe that will fix the problem? If so, you can move the
> > > > delay around to narrow down the conflict. My wild guess would be that
> > > > vadc_reset() could be throwing things for a loop?
> > > >
> > > > If the above doesn't work, maybe we could add more tracing / printouts
> > > > to see what is probing at the same time as RPMH?
> > >
> > > Tried out a few changes today but none of them worked or were
> > > effective enough to debug this crash further, other than setting
> > > fw_devlink=permissive.
> > >
> > > Adding more tracing / prints (BOOTTIME_TRACING and
> > > FUNCTION_GRAPH_TRACER) didn't work and didn't help in reproducing the
> > > crash either. They added just enough delay to boot the device
> > > successfully everytime.
> > >
> > > I tried to reason with the kernel modules which gets loaded before and
> > > after the qcom-rpmh-regulator (QCOM_REBOOT_MODE, QCOM_PON, IIO/VADC,
> > > SPMI_PMIC* etc) as suggested, but I run into the same crash even if I
> > > disable those driver modules. So I don't think that the other driver
> > > modules which gets loaded at around the same time as
> > > qcom-rpmh-regulator by default have any impact on this failure.
> >
> > Ugh, Heisenbugs are no fun to debug. :( It sorta sounds as if pretty
> > much anything you can do to change the timing fixes you. That does
> > make reverting the async probe of the regulator less appealing. If, as
> > you said, it's not just some other driver loading at the same time
> > that's interfering then the revert "fixes" you in the same way that a
> > "msleep" would fix you. That doesn't seem like enough of a
> > justification for the revert to me.
> >
> > It still feels to me like _something_ is happening at the same time as
> > the RPMH regulator driver is loading, though, I'm just not sure how to
> > suggest debugging it. I guess other thoughts:
> >
> > * When RPMH complains, is it always with the same regulator (lvs1), or
> > sometimes different ones? Any clue there?
>
> It is always either lvs1 or lvs2.

If you reorder the nodes in the device tree, I think it'll change the
probe order. Does that affect anything? I'm wondering if there's some
sort of delayed reaction from a previous regulator.


> > * How much can you control module loading order? If rpmh-regulator
> > module loads first, does it "fix" things? If it does, maybe you could
> > bisect to find the place where problems start cropping up. Does that
> > give any clues?
>
> Loading qcom-rpmh-regulator first does make it difficult to reproduce
> the crash. I tried a few combinations to narrow down the issue further
> and in one case, I managed to reproduce the crash (1 in 20+ reboots)
> while loading the qcom-rpmh-regulator (and the dependent cmd-db,
> qcom_rpmh) module alone
> https://bugs.linaro.org/attachment.cgi?id=1140.
>
> Regards,
> Amit Pundir

OK, now that's even weirder. If loading the module alone reproduces
the problem, I can't imagine why this would be different without
"async" probe. In other words, If it reproduces 5% of the time when
loading the module alone with async probe, I'd expect it to reproduce
5% of the time when loading the module alone _without_ async probe
too.

Note: make sure you're careful to collect more than one reproduction
before asserting odds. If it reproduced once in 20 reboots, it might
be 5%, it might be 20%, or it might be .01%. Ideally you could script
this and let it run for a few hours to get a good reproduction rate?

-Doug
  
Amit Pundir June 2, 2023, 7:30 a.m. UTC | #15
On Thu, 1 Jun 2023 at 19:35, Doug Anderson <dianders@chromium.org> wrote:
>
> Hi,
>
> On Wed, May 31, 2023 at 11:11 PM Amit Pundir <amit.pundir@linaro.org> wrote:
> >
> > On Wed, 17 May 2023 at 02:54, Doug Anderson <dianders@chromium.org> wrote:
> > >
> > > Hi,
> > >
> > > On Tue, May 16, 2023 at 11:12 AM Amit Pundir <amit.pundir@linaro.org> wrote:
> > > >
> > > > On Mon, 15 May 2023 at 20:33, Doug Anderson <dianders@chromium.org> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > On Mon, May 15, 2023 at 7:42 AM Amit Pundir <amit.pundir@linaro.org> wrote:
> > > > > >
> > > > > > On Sun, 14 May 2023 at 18:11, Caleb Connolly <caleb.connolly@linaro.org> wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 13/05/2023 18:08, Amit Pundir wrote:
> > > > > > > > On Fri, 24 Mar 2023 at 19:05, Douglas Anderson <dianders@chromium.org> wrote:
> > > > > > > >>
> > > > > > > >> This reverts commit 58973046c1bf ("regulator: qcom-rpmh: Use
> > > > > > > >> PROBE_FORCE_SYNCHRONOUS"). Further digging into the problems that
> > > > > > > >> prompted the us to switch to synchronous probe showed that the root
> > > > > > > >> cause was a missing "rootwait" in the kernel command line
> > > > > > > >> arguments. Let's reinstate asynchronous probe.
> > > > > > > >
> > > > > > > > Hi, the asynchronous probe is broken on Dragonboard 845c (SDM845)
> > > > > > > > running AOSP (Android Open Source Project) with v6.4-rc1
> > > > > > > > https://bugs.linaro.org/show_bug.cgi?id=5975.
> > > > > > > > Can we please go back to synchronous probe.
> > > > > > > >
> > > > > > > > AOSP do not make use of rootwait, IIRC, but it is added by the
> > > > > > > > bootloader anyway. And the device fails to boot AOSP regardless of
> > > > > > > > "rootwait" bootarg being present or not.
> > > > > > >
> > > > > > > Could you try applying this diff to enable some log spam and let me know
> > > > > > > what you get? I'm keen to try and figure this one out. My mail client
> > > > > > > might crunch this a bit so I have pasted it here too
> > > > > > > https://p.calebs.dev/ab74b7@raw
> > > > > >
> > > > > > These prints add just enough delay for the UFS probe to succeed that I
> > > > > > can't reproduce the failure anymore.
> > > > >
> > > > > I'd prefer doing at least a little debugging before jumping to a
> > > > > revert. From looking at your dmesg [1], it looks as if the async probe
> > > > > is allowing RPMH to probe at the same time as "qcom-vadc-common".
> > > > > That's something that talks on the SPMI bus and is (potentially)
> > > > > talking to the same PMICs that RPMH-regulator is, right? I'm by no
> > > > > means an expert on how Qualcomm's PMICs work, but it seems plausible
> > > > > that the "qcom-vadc-common" is somehow causing problems and screwing
> > > > > up RPMH. Does that seem plausible to you?
> > > > >
> > > > > If so, one interesting way to track it down would be to move around
> > > > > delays. Put ~500ms sleep at the _end_ of vadc_probe(). Presumably that
> > > > > _won't_ fix the problem. Now put a ~500ms sleep at the beginning of
> > > > > vadc_probe(). Maybe that will fix the problem? If so, you can move the
> > > > > delay around to narrow down the conflict. My wild guess would be that
> > > > > vadc_reset() could be throwing things for a loop?
> > > > >
> > > > > If the above doesn't work, maybe we could add more tracing / printouts
> > > > > to see what is probing at the same time as RPMH?
> > > >
> > > > Tried out a few changes today but none of them worked or were
> > > > effective enough to debug this crash further, other than setting
> > > > fw_devlink=permissive.
> > > >
> > > > Adding more tracing / prints (BOOTTIME_TRACING and
> > > > FUNCTION_GRAPH_TRACER) didn't work and didn't help in reproducing the
> > > > crash either. They added just enough delay to boot the device
> > > > successfully everytime.
> > > >
> > > > I tried to reason with the kernel modules which gets loaded before and
> > > > after the qcom-rpmh-regulator (QCOM_REBOOT_MODE, QCOM_PON, IIO/VADC,
> > > > SPMI_PMIC* etc) as suggested, but I run into the same crash even if I
> > > > disable those driver modules. So I don't think that the other driver
> > > > modules which gets loaded at around the same time as
> > > > qcom-rpmh-regulator by default have any impact on this failure.
> > >
> > > Ugh, Heisenbugs are no fun to debug. :( It sorta sounds as if pretty
> > > much anything you can do to change the timing fixes you. That does
> > > make reverting the async probe of the regulator less appealing. If, as
> > > you said, it's not just some other driver loading at the same time
> > > that's interfering then the revert "fixes" you in the same way that a
> > > "msleep" would fix you. That doesn't seem like enough of a
> > > justification for the revert to me.
> > >
> > > It still feels to me like _something_ is happening at the same time as
> > > the RPMH regulator driver is loading, though, I'm just not sure how to
> > > suggest debugging it. I guess other thoughts:
> > >
> > > * When RPMH complains, is it always with the same regulator (lvs1), or
> > > sometimes different ones? Any clue there?
> >
> > It is always either lvs1 or lvs2.
>
> If you reorder the nodes in the device tree, I think it'll change the
> probe order. Does that affect anything? I'm wondering if there's some
> sort of delayed reaction from a previous regulator.

Hi, Bumping lvs1 and lvs2 regulators up to the top of the list in the
DTS https://bugs.linaro.org/show_bug.cgi?id=5975#c4 does seem to work.
I can't reproduce the crash in 125 reboots so far, while I'm still
testing with only qcom-rpmh-regulator kernel module. I'll do some more
testing with full system running and send this re-ordering fix I can't
reproduce the crash further.

>
>
> > > * How much can you control module loading order? If rpmh-regulator
> > > module loads first, does it "fix" things? If it does, maybe you could
> > > bisect to find the place where problems start cropping up. Does that
> > > give any clues?
> >
> > Loading qcom-rpmh-regulator first does make it difficult to reproduce
> > the crash. I tried a few combinations to narrow down the issue further
> > and in one case, I managed to reproduce the crash (1 in 20+ reboots)
> > while loading the qcom-rpmh-regulator (and the dependent cmd-db,
> > qcom_rpmh) module alone
> > https://bugs.linaro.org/attachment.cgi?id=1140.
> >
> > Regards,
> > Amit Pundir
>
> OK, now that's even weirder. If loading the module alone reproduces
> the problem, I can't imagine why this would be different without
> "async" probe. In other words, If it reproduces 5% of the time when
> loading the module alone with async probe, I'd expect it to reproduce
> 5% of the time when loading the module alone _without_ async probe
> too.

I don't know the internal workings of forcing an asynchronous or
synchronous probe, but loading this module alone crashed 60 times in
350 reboots with this async patch, while it didn't crash at all in
about 250 reboots if I do PROBE_FORCE_SYNCHRONOUS.

Regards,
Amit Pundir

>
> Note: make sure you're careful to collect more than one reproduction
> before asserting odds. If it reproduced once in 20 reboots, it might
> be 5%, it might be 20%, or it might be .01%. Ideally you could script
> this and let it run for a few hours to get a good reproduction rate?
>
> -Doug
  
Mark Brown June 2, 2023, 12:37 p.m. UTC | #16
On Fri, Jun 02, 2023 at 01:00:52PM +0530, Amit Pundir wrote:
> On Thu, 1 Jun 2023 at 19:35, Doug Anderson <dianders@chromium.org> wrote:
> > On Wed, May 31, 2023 at 11:11 PM Amit Pundir <amit.pundir@linaro.org> wrote:
> > > On Wed, 17 May 2023 at 02:54, Doug Anderson <dianders@chromium.org> wrote:
> > > > On Tue, May 16, 2023 at 11:12 AM Amit Pundir <amit.pundir@linaro.org> wrote:
> > > > > On Mon, 15 May 2023 at 20:33, Doug Anderson <dianders@chromium.org> wrote:
> > > > > > On Mon, May 15, 2023 at 7:42 AM Amit Pundir <amit.pundir@linaro.org> wrote:
> > > > > > > On Sun, 14 May 2023 at 18:11, Caleb Connolly <caleb.connolly@linaro.org> wrote:

Please delete unneeded context from mails when replying.  Doing this
makes it much easier to find your reply in the message, helping ensure
it won't be missed by people scrolling through the irrelevant quoted
material.

> > If you reorder the nodes in the device tree, I think it'll change the
> > probe order. Does that affect anything? I'm wondering if there's some
> > sort of delayed reaction from a previous regulator.

> Hi, Bumping lvs1 and lvs2 regulators up to the top of the list in the
> DTS https://bugs.linaro.org/show_bug.cgi?id=5975#c4 does seem to work.
> I can't reproduce the crash in 125 reboots so far, while I'm still
> testing with only qcom-rpmh-regulator kernel module. I'll do some more
> testing with full system running and send this re-ordering fix I can't
> reproduce the crash further.

So whatever the issue is here it's a timing/race condition - this seems
like a workaround which works just now but it's not getting to whatever
the actual issue is and that could come back.
  
Amit Pundir June 2, 2023, 1:06 p.m. UTC | #17
On Fri, 2 Jun 2023 at 13:00, Amit Pundir <amit.pundir@linaro.org> wrote:
>
> On Thu, 1 Jun 2023 at 19:35, Doug Anderson <dianders@chromium.org> wrote:
> >
> > If you reorder the nodes in the device tree, I think it'll change the
> > probe order. Does that affect anything? I'm wondering if there's some
> > sort of delayed reaction from a previous regulator.
>
> Hi, Bumping lvs1 and lvs2 regulators up to the top of the list in the
> DTS https://bugs.linaro.org/show_bug.cgi?id=5975#c4 does seem to work.
> I can't reproduce the crash in 125 reboots so far, while I'm still
> testing with only qcom-rpmh-regulator kernel module. I'll do some more
> testing with full system running and send this re-ordering fix I can't
> reproduce the crash further.

Hi, successfully rebooted AOSP with v6.4-rc4 on DB845c about 100+
times with this above mentioned lvs nodes reordering in the device
tree. I don't see any obvious functionality breakage in my limited
smoke testing so far either. I'll post this workaround/fix for review
on the lkml.

Regards,
Amit Pundir
  
Amit Pundir June 2, 2023, 1:12 p.m. UTC | #18
On Fri, 2 Jun 2023 at 18:07, Mark Brown <broonie@kernel.org> wrote:
>
> > > If you reorder the nodes in the device tree, I think it'll change the
> > > probe order. Does that affect anything? I'm wondering if there's some
> > > sort of delayed reaction from a previous regulator.
>
> > Hi, Bumping lvs1 and lvs2 regulators up to the top of the list in the
> > DTS https://bugs.linaro.org/show_bug.cgi?id=5975#c4 does seem to work.
> > I can't reproduce the crash in 125 reboots so far, while I'm still
> > testing with only qcom-rpmh-regulator kernel module. I'll do some more
> > testing with full system running and send this re-ordering fix I can't
> > reproduce the crash further.
>
> So whatever the issue is here it's a timing/race condition - this seems
> like a workaround which works just now but it's not getting to whatever
> the actual issue is and that could come back.

Hi, I'm happy to debug this issue further or test run any
patches/ideas if that helps.

Regards,
Amit Pundir
  
Doug Anderson June 6, 2023, 11:29 p.m. UTC | #19
Hi,

On Fri, Jun 2, 2023 at 6:13 AM Amit Pundir <amit.pundir@linaro.org> wrote:
>
> On Fri, 2 Jun 2023 at 18:07, Mark Brown <broonie@kernel.org> wrote:
> >
> > > > If you reorder the nodes in the device tree, I think it'll change the
> > > > probe order. Does that affect anything? I'm wondering if there's some
> > > > sort of delayed reaction from a previous regulator.
> >
> > > Hi, Bumping lvs1 and lvs2 regulators up to the top of the list in the
> > > DTS https://bugs.linaro.org/show_bug.cgi?id=5975#c4 does seem to work.
> > > I can't reproduce the crash in 125 reboots so far, while I'm still
> > > testing with only qcom-rpmh-regulator kernel module. I'll do some more
> > > testing with full system running and send this re-ordering fix I can't
> > > reproduce the crash further.
> >
> > So whatever the issue is here it's a timing/race condition - this seems
> > like a workaround which works just now but it's not getting to whatever
> > the actual issue is and that could come back.
>
> Hi, I'm happy to debug this issue further or test run any
> patches/ideas if that helps.

I guess it's a better workaround than reverting the async patch. ...at
least it has a chance of having a real effect. That being said, it's
still a bit of a hack.

Ideally you'd be able to somehow get information about RPMH's state
when things fail. Maybe this might be possible with ramdumps or maybe
with JTAG. Unfortunately, I'm not super familiar with either of them.
I assume you aren't either or you would have already tried them...

From a black box perspective, I guess the things I could think of
would be to keep poking around with things that you control. Best
ideas I have:

1. Use "bisect" style techniques to figure out how much you really
need to move the "lvs" regulators. Try moving it halfway up the list.
If that works, move it closer to the bottom. If that doesn't work,
move it closer to the top. Eventually you'd find out which regulator
it's important to be before.

2. Try adding some delays to some of the regulators with
"regulator-enable-ramp-delay" and/or "regulator-settling-time-us".
Without a scope, it'll be tricky to figure out exactly which
regulators might need delays, but you could at least confirm if the
"overkill" approach of having all the regulators have some delay
helps... I guess you could also try putting a big delay for "ldo26".
If that works, you could try moving it up (again using a bisect style
approach) to see where the delay matters?


Currently, I guess my mental model of what might be going wrong is
that regulators are all turning on / adjusting really quickly. Maybe
they aren't switching into "high power mode" quickly enough, maybe
they are busy ramping up or down, or maybe there's simply some other
issue, but I suppose that something happening could be causing the
voltage to droop down (or be too high) and then that's making RPMH
upset. Changing the order could be helping avoid this droop, but the
more proper fix would be to actually account for it with regulator
constraints.

My mental model still doesn't really explain what async probe has to
do with anything, at least if you're loading _just_ rpmh-regulator all
on its own. Assuming you're loading rpmh-regulator all on its own then
async probe should do almost nothing. Unless I'm mistaken, async probe
won't cause all the RPMH regulators to initialize in parallel. They
still initialize synchronously in the rpmh-regulator probe routine.
Async probe would _just_ allow some other driver to run its probe at
the same time. ...but if no other drivers being loaded at the same
time then I'm perplexed about how it could make any difference.


-Doug
  
Mark Brown June 7, 2023, 1:18 p.m. UTC | #20
On Tue, Jun 06, 2023 at 04:29:29PM -0700, Doug Anderson wrote:

> 2. Try adding some delays to some of the regulators with
> "regulator-enable-ramp-delay" and/or "regulator-settling-time-us".
> Without a scope, it'll be tricky to figure out exactly which
> regulators might need delays, but you could at least confirm if the
> "overkill" approach of having all the regulators have some delay
> helps... I guess you could also try putting a big delay for "ldo26".
> If that works, you could try moving it up (again using a bisect style
> approach) to see where the delay matters?

This is information which should be in the datasheets for the part.

> Currently, I guess my mental model of what might be going wrong is
> that regulators are all turning on / adjusting really quickly. Maybe
> they aren't switching into "high power mode" quickly enough, maybe
> they are busy ramping up or down, or maybe there's simply some other
> issue, but I suppose that something happening could be causing the
> voltage to droop down (or be too high) and then that's making RPMH
> upset. Changing the order could be helping avoid this droop, but the
> more proper fix would be to actually account for it with regulator
> constraints.

There could potentially be inrush issues, though I'd not expect
reordering to help much there.
  
Doug Anderson June 7, 2023, 1:47 p.m. UTC | #21
Hi,

On Wed, Jun 7, 2023 at 6:18 AM Mark Brown <broonie@kernel.org> wrote:
>
> On Tue, Jun 06, 2023 at 04:29:29PM -0700, Doug Anderson wrote:
>
> > 2. Try adding some delays to some of the regulators with
> > "regulator-enable-ramp-delay" and/or "regulator-settling-time-us".
> > Without a scope, it'll be tricky to figure out exactly which
> > regulators might need delays, but you could at least confirm if the
> > "overkill" approach of having all the regulators have some delay
> > helps... I guess you could also try putting a big delay for "ldo26".
> > If that works, you could try moving it up (again using a bisect style
> > approach) to see where the delay matters?
>
> This is information which should be in the datasheets for the part.

I was thinking more of something board-specific, not part specific. In
theory with RPMH this is also all supposed to be abstracted out into
the firmware code that sets up RPMH which magically takes care of
things like this, but it certainly could be wrong.
  
Linux regression tracking (Thorsten Leemhuis) June 14, 2023, 3:37 p.m. UTC | #22
Hi, Thorsten here, the Linux kernel's regression tracker.

On 07.06.23 15:47, Doug Anderson wrote:
> 
> On Wed, Jun 7, 2023 at 6:18 AM Mark Brown <broonie@kernel.org> wrote:
>>
>> On Tue, Jun 06, 2023 at 04:29:29PM -0700, Doug Anderson wrote:
>>
>>> 2. Try adding some delays to some of the regulators with
>>> "regulator-enable-ramp-delay" and/or "regulator-settling-time-us".
>>> Without a scope, it'll be tricky to figure out exactly which
>>> regulators might need delays, but you could at least confirm if the
>>> "overkill" approach of having all the regulators have some delay
>>> helps... I guess you could also try putting a big delay for "ldo26".
>>> If that works, you could try moving it up (again using a bisect style
>>> approach) to see where the delay matters?
>>
>> This is information which should be in the datasheets for the part.
> 
> I was thinking more of something board-specific, not part specific. In
> theory with RPMH this is also all supposed to be abstracted out into
> the firmware code that sets up RPMH which magically takes care of
> things like this, but it certainly could be wrong.

/me waves friendly

That afaics was the last mail in this thread about a regression caused
by ad44ac082fd ("regulator: qcom-rpmh: Revert "regulator: qcom-rpmh: Use
PROBE_FORCE_SYNCHRONOUS"") from Doug; Amit's attempt to patch it (
https://lore.kernel.org/lkml/20230602161246.1855448-1-amit.pundir@linaro.org/
) also wasn't welcomed. Just like his earlier revert attempt
(https://lore.kernel.org/lkml/20230515145323.1693044-1-amit.pundir@linaro.org/
).

Does this mean this regression won't be addressed before 6.4 is
released? Or was there some progress and I just missed it? What should I
tell Linus in my next report?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke
  
Doug Anderson June 14, 2023, 3:50 p.m. UTC | #23
Hi,

On Wed, Jun 14, 2023 at 8:37 AM Linux regression tracking (Thorsten
Leemhuis) <regressions@leemhuis.info> wrote:
>
> Hi, Thorsten here, the Linux kernel's regression tracker.
>
> On 07.06.23 15:47, Doug Anderson wrote:
> >
> > On Wed, Jun 7, 2023 at 6:18 AM Mark Brown <broonie@kernel.org> wrote:
> >>
> >> On Tue, Jun 06, 2023 at 04:29:29PM -0700, Doug Anderson wrote:
> >>
> >>> 2. Try adding some delays to some of the regulators with
> >>> "regulator-enable-ramp-delay" and/or "regulator-settling-time-us".
> >>> Without a scope, it'll be tricky to figure out exactly which
> >>> regulators might need delays, but you could at least confirm if the
> >>> "overkill" approach of having all the regulators have some delay
> >>> helps... I guess you could also try putting a big delay for "ldo26".
> >>> If that works, you could try moving it up (again using a bisect style
> >>> approach) to see where the delay matters?
> >>
> >> This is information which should be in the datasheets for the part.
> >
> > I was thinking more of something board-specific, not part specific. In
> > theory with RPMH this is also all supposed to be abstracted out into
> > the firmware code that sets up RPMH which magically takes care of
> > things like this, but it certainly could be wrong.
>
> /me waves friendly
>
> That afaics was the last mail in this thread about a regression caused
> by ad44ac082fd ("regulator: qcom-rpmh: Revert "regulator: qcom-rpmh: Use
> PROBE_FORCE_SYNCHRONOUS"") from Doug; Amit's attempt to patch it (
> https://lore.kernel.org/lkml/20230602161246.1855448-1-amit.pundir@linaro.org/
> ) also wasn't welcomed. Just like his earlier revert attempt
> (https://lore.kernel.org/lkml/20230515145323.1693044-1-amit.pundir@linaro.org/
> ).
>
> Does this mean this regression won't be addressed before 6.4 is
> released? Or was there some progress and I just missed it? What should I
> tell Linus in my next report?

Of the two proposals made (the revert vs. the reordering of the dts),
the reordering of the dts seems better. It only affects the one buggy
board (rather than preventing us to move to async probe for everyone)
and it also has a chance of actually fixing something (changing the
order that regulators probe in rpmh-regulator might legitimately work
around the problem). That being said, just like the revert the dts
reordering is still just papering over the problem and is fragile /
not guaranteed to work forever.

-Doug
  
Amit Pundir June 14, 2023, 7:02 p.m. UTC | #24
On Wed, 7 Jun 2023 at 04:59, Doug Anderson <dianders@chromium.org> wrote:
>
> From a black box perspective, I guess the things I could think of
> would be to keep poking around with things that you control. Best
> ideas I have:
>
> 1. Use "bisect" style techniques to figure out how much you really
> need to move the "lvs" regulators. Try moving it halfway up the list.
> If that works, move it closer to the bottom. If that doesn't work,
> move it closer to the top. Eventually you'd find out which regulator
> it's important to be before.

Hi, I tried this bisect style technique to move lvs regulators up in
the list gradually and I found that they need to be enabled atleast
before ldo12 and the ldo regulators which follow the ldo12 in the
list.

>
> 2. Try adding some delays to some of the regulators with
> "regulator-enable-ramp-delay" and/or "regulator-settling-time-us".
> Without a scope, it'll be tricky to figure out exactly which
> regulators might need delays, but you could at least confirm if the
> "overkill" approach of having all the regulators have some delay
> helps... I guess you could also try putting a big delay for "ldo26".
> If that works, you could try moving it up (again using a bisect style
> approach) to see where the delay matters?

I tried this approach as well earlier today but I don't know how big
"the big" delay can be. The device fails to boot if I add a settling
time of as much as 2sec per each ldo and lvs regulator too. I didn't
try increasing the delay further.

Regards,
Amit Pundir
  
Doug Anderson June 14, 2023, 7:35 p.m. UTC | #25
Hi,

On Wed, Jun 14, 2023 at 12:03 PM Amit Pundir <amit.pundir@linaro.org> wrote:
>
> On Wed, 7 Jun 2023 at 04:59, Doug Anderson <dianders@chromium.org> wrote:
> >
> > From a black box perspective, I guess the things I could think of
> > would be to keep poking around with things that you control. Best
> > ideas I have:
> >
> > 1. Use "bisect" style techniques to figure out how much you really
> > need to move the "lvs" regulators. Try moving it halfway up the list.
> > If that works, move it closer to the bottom. If that doesn't work,
> > move it closer to the top. Eventually you'd find out which regulator
> > it's important to be before.
>
> Hi, I tried this bisect style technique to move lvs regulators up in
> the list gradually and I found that they need to be enabled atleast
> before ldo12 and the ldo regulators which follow the ldo12 in the
> list.

Super weird. I was hoping that something would jump out, but nothing
does. :( I don't understand how lvs1 / lvs2 could have any impact on
ldo12. :(


> > 2. Try adding some delays to some of the regulators with
> > "regulator-enable-ramp-delay" and/or "regulator-settling-time-us".
> > Without a scope, it'll be tricky to figure out exactly which
> > regulators might need delays, but you could at least confirm if the
> > "overkill" approach of having all the regulators have some delay
> > helps... I guess you could also try putting a big delay for "ldo26".
> > If that works, you could try moving it up (again using a bisect style
> > approach) to see where the delay matters?
>
> I tried this approach as well earlier today but I don't know how big
> "the big" delay can be. The device fails to boot if I add a settling
> time of as much as 2sec per each ldo and lvs regulator too. I didn't
> try increasing the delay further.

Yeah, 2 seconds is plenty big. If that doesn't fix it then it's not a
timing issue.

I guess with the above results, I'm still super confused about why the
async probe has any impact at all on this. It sounds like the
_ordering_ of the rpmh-regulators init matters but not the timing, and
I'd expect the ordering to be the same between normal probe and async
probe. Specifically, I think:

a) There is exactly one rpmh-regulator driver instance in your system, right?

b) Regulator initialization happens in rpmh_regulator_probe().

c) The rpmh_regulator_probe() function is itself synchronous. That is,
it sets up one regulator at a time and, I believe, nothing about the
behavior of rpmh_regulator_probe() changes for async vs. sync probe.

...so I'm left baffled...
  
Mark Brown June 14, 2023, 10:21 p.m. UTC | #26
On Wed, Jun 14, 2023 at 12:35:08PM -0700, Doug Anderson wrote:

> Super weird. I was hoping that something would jump out, but nothing
> does. :( I don't understand how lvs1 / lvs2 could have any impact on
> ldo12. :(

Does a device connected to some combination of these regulators have
power sequencing issues?
  

Patch

diff --git a/drivers/regulator/qcom-rpmh-regulator.c b/drivers/regulator/qcom-rpmh-regulator.c
index 903032b2875f..4826d60e5d95 100644
--- a/drivers/regulator/qcom-rpmh-regulator.c
+++ b/drivers/regulator/qcom-rpmh-regulator.c
@@ -1462,7 +1462,7 @@  MODULE_DEVICE_TABLE(of, rpmh_regulator_match_table);
 static struct platform_driver rpmh_regulator_driver = {
 	.driver = {
 		.name = "qcom-rpmh-regulator",
-		.probe_type = PROBE_FORCE_SYNCHRONOUS,
+		.probe_type = PROBE_PREFER_ASYNCHRONOUS,
 		.of_match_table	= of_match_ptr(rpmh_regulator_match_table),
 	},
 	.probe = rpmh_regulator_probe,