arm64: dts: qcom: sdm845-db845c: Move LVS regulator nodes up

Message ID 20230602161246.1855448-1-amit.pundir@linaro.org
State New
Headers
Series arm64: dts: qcom: sdm845-db845c: Move LVS regulator nodes up |

Commit Message

Amit Pundir June 2, 2023, 4:12 p.m. UTC
  Move lvs1 and lvs2 regulator nodes up in the rpmh-regulators
list to workaround a boot regression uncovered by the upstream
commit ad44ac082fdf ("regulator: qcom-rpmh: Revert "regulator:
qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS"").

Without this fix DB845c fail to boot at times because one of the
lvs1 or lvs2 regulators fail to turn ON in time.

Link: https://lore.kernel.org/all/CAMi1Hd1avQDcDQf137m2auz2znov4XL8YGrLZsw5edb-NtRJRw@mail.gmail.com/
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
---
 arch/arm64/boot/dts/qcom/sdm845-db845c.dts | 24 +++++++++++-----------
 1 file changed, 12 insertions(+), 12 deletions(-)
  

Comments

Doug Anderson June 6, 2023, 11:34 p.m. UTC | #1
Hi,

On Fri, Jun 2, 2023 at 9:12 AM Amit Pundir <amit.pundir@linaro.org> wrote:
>
> Move lvs1 and lvs2 regulator nodes up in the rpmh-regulators
> list to workaround a boot regression uncovered by the upstream
> commit ad44ac082fdf ("regulator: qcom-rpmh: Revert "regulator:
> qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS"").
>
> Without this fix DB845c fail to boot at times because one of the
> lvs1 or lvs2 regulators fail to turn ON in time.
>
> Link: https://lore.kernel.org/all/CAMi1Hd1avQDcDQf137m2auz2znov4XL8YGrLZsw5edb-NtRJRw@mail.gmail.com/
> Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
> ---
>  arch/arm64/boot/dts/qcom/sdm845-db845c.dts | 24 +++++++++++-----------
>  1 file changed, 12 insertions(+), 12 deletions(-)
>
> diff --git a/arch/arm64/boot/dts/qcom/sdm845-db845c.dts b/arch/arm64/boot/dts/qcom/sdm845-db845c.dts
> index e14fe9bbb386..df2fde9063dc 100644
> --- a/arch/arm64/boot/dts/qcom/sdm845-db845c.dts
> +++ b/arch/arm64/boot/dts/qcom/sdm845-db845c.dts
> @@ -301,6 +301,18 @@ regulators-0 {
>                 vdd-l26-supply = <&vreg_s3a_1p35>;
>                 vin-lvs-1-2-supply = <&vreg_s4a_1p8>;
>
> +               vreg_lvs1a_1p8: lvs1 {
> +                       regulator-min-microvolt = <1800000>;
> +                       regulator-max-microvolt = <1800000>;
> +                       regulator-always-on;
> +               };
> +
> +               vreg_lvs2a_1p8: lvs2 {
> +                       regulator-min-microvolt = <1800000>;
> +                       regulator-max-microvolt = <1800000>;
> +                       regulator-always-on;
> +               };
> +
>                 vreg_s3a_1p35: smps3 {
>                         regulator-min-microvolt = <1352000>;
>                         regulator-max-microvolt = <1352000>;
> @@ -381,18 +393,6 @@ vreg_l26a_1p2: ldo26 {
>                         regulator-max-microvolt = <1200000>;
>                         regulator-initial-mode = <RPMH_REGULATOR_MODE_HPM>;
>                 };
> -
> -               vreg_lvs1a_1p8: lvs1 {
> -                       regulator-min-microvolt = <1800000>;
> -                       regulator-max-microvolt = <1800000>;
> -                       regulator-always-on;
> -               };
> -
> -               vreg_lvs2a_1p8: lvs2 {
> -                       regulator-min-microvolt = <1800000>;
> -                       regulator-max-microvolt = <1800000>;
> -                       regulator-always-on;
> -               };

This is a hack, but it at least feels less bad than reverting the
async probe patch. I'll leave it to Bjorn to decide if he's OK with
it. Personally, it feels like this would deserve a comment in the dts
to document that these regulators need to be listed first.

Ideally, we could still work towards a root cause. I added a few more
ideas to help with root causing in reply to the original thread about
this.

https://lore.kernel.org/r/CAD=FV=UKyjRNZG-ED2meUAR9aXdco+AbUTHiKixTzjCkaJbjTg@mail.gmail.com/

-Doug
  
Krzysztof Kozlowski June 7, 2023, 7:46 a.m. UTC | #2
On 02/06/2023 18:12, Amit Pundir wrote:
> Move lvs1 and lvs2 regulator nodes up in the rpmh-regulators
> list to workaround a boot regression uncovered by the upstream
> commit ad44ac082fdf ("regulator: qcom-rpmh: Revert "regulator:
> qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS"").
> 
> Without this fix DB845c fail to boot at times because one of the
> lvs1 or lvs2 regulators fail to turn ON in time.

Why regulator would fail to turn on time? If it has ramp-up delay, add
it to DT. Otherwise how is it possible? You have a consumer, right?

This is not the correct solution and it is hiding real issue.

Best regards,
Krzysztof
  
Krzysztof Kozlowski June 7, 2023, 7:49 a.m. UTC | #3
On 07/06/2023 01:34, Doug Anderson wrote:
> Hi,
> 
> On Fri, Jun 2, 2023 at 9:12 AM Amit Pundir <amit.pundir@linaro.org> wrote:
>>
>> Move lvs1 and lvs2 regulator nodes up in the rpmh-regulators
>> list to workaround a boot regression uncovered by the upstream
>> commit ad44ac082fdf ("regulator: qcom-rpmh: Revert "regulator:
>> qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS"").
>>
>> Without this fix DB845c fail to boot at times because one of the
>> lvs1 or lvs2 regulators fail to turn ON in time.
>>
>> Link: https://lore.kernel.org/all/CAMi1Hd1avQDcDQf137m2auz2znov4XL8YGrLZsw5edb-NtRJRw@mail.gmail.com/
>> Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
>> ---
>>  arch/arm64/boot/dts/qcom/sdm845-db845c.dts | 24 +++++++++++-----------
>>  1 file changed, 12 insertions(+), 12 deletions(-)
>>
>> diff --git a/arch/arm64/boot/dts/qcom/sdm845-db845c.dts b/arch/arm64/boot/dts/qcom/sdm845-db845c.dts
>> index e14fe9bbb386..df2fde9063dc 100644
>> --- a/arch/arm64/boot/dts/qcom/sdm845-db845c.dts
>> +++ b/arch/arm64/boot/dts/qcom/sdm845-db845c.dts
>> @@ -301,6 +301,18 @@ regulators-0 {
>>                 vdd-l26-supply = <&vreg_s3a_1p35>;
>>                 vin-lvs-1-2-supply = <&vreg_s4a_1p8>;
>>
>> +               vreg_lvs1a_1p8: lvs1 {
>> +                       regulator-min-microvolt = <1800000>;
>> +                       regulator-max-microvolt = <1800000>;
>> +                       regulator-always-on;
>> +               };
>> +
>> +               vreg_lvs2a_1p8: lvs2 {
>> +                       regulator-min-microvolt = <1800000>;
>> +                       regulator-max-microvolt = <1800000>;
>> +                       regulator-always-on;
>> +               };
>> +
>>                 vreg_s3a_1p35: smps3 {
>>                         regulator-min-microvolt = <1352000>;
>>                         regulator-max-microvolt = <1352000>;
>> @@ -381,18 +393,6 @@ vreg_l26a_1p2: ldo26 {
>>                         regulator-max-microvolt = <1200000>;
>>                         regulator-initial-mode = <RPMH_REGULATOR_MODE_HPM>;
>>                 };
>> -
>> -               vreg_lvs1a_1p8: lvs1 {
>> -                       regulator-min-microvolt = <1800000>;
>> -                       regulator-max-microvolt = <1800000>;
>> -                       regulator-always-on;
>> -               };
>> -
>> -               vreg_lvs2a_1p8: lvs2 {
>> -                       regulator-min-microvolt = <1800000>;
>> -                       regulator-max-microvolt = <1800000>;
>> -                       regulator-always-on;
>> -               };
> 
> This is a hack, but it at least feels less bad than reverting the
> async probe patch. I'll leave it to Bjorn to decide if he's OK with
> it. Personally, it feels like this would deserve a comment in the dts
> to document that these regulators need to be listed first.
> 
> Ideally, we could still work towards a root cause. I added a few more
> ideas to help with root causing in reply to the original thread about
> this.
> 
> https://lore.kernel.org/r/CAD=FV=UKyjRNZG-ED2meUAR9aXdco+AbUTHiKixTzjCkaJbjTg@mail.gmail.com/

We do not shape DTS based on given OS behavior. AOSP needs this, BSD
needs that and Linux needs something else. Next time someone will move
these regulators down because on his system probing is from end of list,
not beginning and he has the same problem.

No, really, are we going to reshuffle nodes because AOSP needs it?

Best regards,
Krzysztof
  
Amit Pundir June 7, 2023, 9:17 a.m. UTC | #4
On Wed, 7 Jun 2023 at 13:19, Krzysztof Kozlowski
<krzysztof.kozlowski@linaro.org> wrote:
>
> On 07/06/2023 01:34, Doug Anderson wrote:
> > Hi,
> >
> > On Fri, Jun 2, 2023 at 9:12 AM Amit Pundir <amit.pundir@linaro.org> wrote:
> >>
> >> Move lvs1 and lvs2 regulator nodes up in the rpmh-regulators
> >> list to workaround a boot regression uncovered by the upstream
> >> commit ad44ac082fdf ("regulator: qcom-rpmh: Revert "regulator:
> >> qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS"").
> >>
> >> Without this fix DB845c fail to boot at times because one of the
> >> lvs1 or lvs2 regulators fail to turn ON in time.
> >>
> >> Link: https://lore.kernel.org/all/CAMi1Hd1avQDcDQf137m2auz2znov4XL8YGrLZsw5edb-NtRJRw@mail.gmail.com/
> >> Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
> >> ---
> >>  arch/arm64/boot/dts/qcom/sdm845-db845c.dts | 24 +++++++++++-----------
> >>  1 file changed, 12 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/arch/arm64/boot/dts/qcom/sdm845-db845c.dts b/arch/arm64/boot/dts/qcom/sdm845-db845c.dts
> >> index e14fe9bbb386..df2fde9063dc 100644
> >> --- a/arch/arm64/boot/dts/qcom/sdm845-db845c.dts
> >> +++ b/arch/arm64/boot/dts/qcom/sdm845-db845c.dts
> >> @@ -301,6 +301,18 @@ regulators-0 {
> >>                 vdd-l26-supply = <&vreg_s3a_1p35>;
> >>                 vin-lvs-1-2-supply = <&vreg_s4a_1p8>;
> >>
> >> +               vreg_lvs1a_1p8: lvs1 {
> >> +                       regulator-min-microvolt = <1800000>;
> >> +                       regulator-max-microvolt = <1800000>;
> >> +                       regulator-always-on;
> >> +               };
> >> +
> >> +               vreg_lvs2a_1p8: lvs2 {
> >> +                       regulator-min-microvolt = <1800000>;
> >> +                       regulator-max-microvolt = <1800000>;
> >> +                       regulator-always-on;
> >> +               };
> >> +
> >>                 vreg_s3a_1p35: smps3 {
> >>                         regulator-min-microvolt = <1352000>;
> >>                         regulator-max-microvolt = <1352000>;
> >> @@ -381,18 +393,6 @@ vreg_l26a_1p2: ldo26 {
> >>                         regulator-max-microvolt = <1200000>;
> >>                         regulator-initial-mode = <RPMH_REGULATOR_MODE_HPM>;
> >>                 };
> >> -
> >> -               vreg_lvs1a_1p8: lvs1 {
> >> -                       regulator-min-microvolt = <1800000>;
> >> -                       regulator-max-microvolt = <1800000>;
> >> -                       regulator-always-on;
> >> -               };
> >> -
> >> -               vreg_lvs2a_1p8: lvs2 {
> >> -                       regulator-min-microvolt = <1800000>;
> >> -                       regulator-max-microvolt = <1800000>;
> >> -                       regulator-always-on;
> >> -               };
> >
> > This is a hack, but it at least feels less bad than reverting the
> > async probe patch. I'll leave it to Bjorn to decide if he's OK with
> > it. Personally, it feels like this would deserve a comment in the dts
> > to document that these regulators need to be listed first.
> >
> > Ideally, we could still work towards a root cause. I added a few more
> > ideas to help with root causing in reply to the original thread about
> > this.
> >
> > https://lore.kernel.org/r/CAD=FV=UKyjRNZG-ED2meUAR9aXdco+AbUTHiKixTzjCkaJbjTg@mail.gmail.com/
>
> We do not shape DTS based on given OS behavior. AOSP needs this, BSD
> needs that and Linux needs something else. Next time someone will move
> these regulators down because on his system probing is from end of list,
> not beginning and he has the same problem.
>
> No, really, are we going to reshuffle nodes because AOSP needs it?

Hi, other than the fact that I reproduced it on AOSP, there is nothing
AOSP specific in this patch. I'm sure there may be another
platforms/OS (which load kernel modules from a ramdisk) that may trip
on this bug. But I can try reproducing it on an OS of your choice if
it helps.

Regards,
Amit Pundir

>
> Best regards,
> Krzysztof
>
  
Krzysztof Kozlowski June 7, 2023, 10:16 a.m. UTC | #5
On 07/06/2023 11:17, Amit Pundir wrote:
> On Wed, 7 Jun 2023 at 13:19, Krzysztof Kozlowski
> <krzysztof.kozlowski@linaro.org> wrote:
>>
>> On 07/06/2023 01:34, Doug Anderson wrote:
>>> Hi,
>>>
>>> On Fri, Jun 2, 2023 at 9:12 AM Amit Pundir <amit.pundir@linaro.org> wrote:
>>>>
>>>> Move lvs1 and lvs2 regulator nodes up in the rpmh-regulators
>>>> list to workaround a boot regression uncovered by the upstream
>>>> commit ad44ac082fdf ("regulator: qcom-rpmh: Revert "regulator:
>>>> qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS"").
>>>>
>>>> Without this fix DB845c fail to boot at times because one of the
>>>> lvs1 or lvs2 regulators fail to turn ON in time.
>>>>
>>>> Link: https://lore.kernel.org/all/CAMi1Hd1avQDcDQf137m2auz2znov4XL8YGrLZsw5edb-NtRJRw@mail.gmail.com/
>>>> Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
>>>> ---
>>>>  arch/arm64/boot/dts/qcom/sdm845-db845c.dts | 24 +++++++++++-----------
>>>>  1 file changed, 12 insertions(+), 12 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845-db845c.dts b/arch/arm64/boot/dts/qcom/sdm845-db845c.dts
>>>> index e14fe9bbb386..df2fde9063dc 100644
>>>> --- a/arch/arm64/boot/dts/qcom/sdm845-db845c.dts
>>>> +++ b/arch/arm64/boot/dts/qcom/sdm845-db845c.dts
>>>> @@ -301,6 +301,18 @@ regulators-0 {
>>>>                 vdd-l26-supply = <&vreg_s3a_1p35>;
>>>>                 vin-lvs-1-2-supply = <&vreg_s4a_1p8>;
>>>>
>>>> +               vreg_lvs1a_1p8: lvs1 {
>>>> +                       regulator-min-microvolt = <1800000>;
>>>> +                       regulator-max-microvolt = <1800000>;
>>>> +                       regulator-always-on;
>>>> +               };
>>>> +
>>>> +               vreg_lvs2a_1p8: lvs2 {
>>>> +                       regulator-min-microvolt = <1800000>;
>>>> +                       regulator-max-microvolt = <1800000>;
>>>> +                       regulator-always-on;
>>>> +               };
>>>> +
>>>>                 vreg_s3a_1p35: smps3 {
>>>>                         regulator-min-microvolt = <1352000>;
>>>>                         regulator-max-microvolt = <1352000>;
>>>> @@ -381,18 +393,6 @@ vreg_l26a_1p2: ldo26 {
>>>>                         regulator-max-microvolt = <1200000>;
>>>>                         regulator-initial-mode = <RPMH_REGULATOR_MODE_HPM>;
>>>>                 };
>>>> -
>>>> -               vreg_lvs1a_1p8: lvs1 {
>>>> -                       regulator-min-microvolt = <1800000>;
>>>> -                       regulator-max-microvolt = <1800000>;
>>>> -                       regulator-always-on;
>>>> -               };
>>>> -
>>>> -               vreg_lvs2a_1p8: lvs2 {
>>>> -                       regulator-min-microvolt = <1800000>;
>>>> -                       regulator-max-microvolt = <1800000>;
>>>> -                       regulator-always-on;
>>>> -               };
>>>
>>> This is a hack, but it at least feels less bad than reverting the
>>> async probe patch. I'll leave it to Bjorn to decide if he's OK with
>>> it. Personally, it feels like this would deserve a comment in the dts
>>> to document that these regulators need to be listed first.
>>>
>>> Ideally, we could still work towards a root cause. I added a few more
>>> ideas to help with root causing in reply to the original thread about
>>> this.
>>>
>>> https://lore.kernel.org/r/CAD=FV=UKyjRNZG-ED2meUAR9aXdco+AbUTHiKixTzjCkaJbjTg@mail.gmail.com/
>>
>> We do not shape DTS based on given OS behavior. AOSP needs this, BSD
>> needs that and Linux needs something else. Next time someone will move
>> these regulators down because on his system probing is from end of list,
>> not beginning and he has the same problem.
>>
>> No, really, are we going to reshuffle nodes because AOSP needs it?
> 
> Hi, other than the fact that I reproduced it on AOSP, there is nothing
> AOSP specific in this patch. I'm sure there may be another
> platforms/OS (which load kernel modules from a ramdisk) that may trip
> on this bug. But I can try reproducing it on an OS of your choice if
> it helps.

I wrote earlier imaginary system where RPM driver loads the regulators
from the end. It would require re-shuffling to previous order of the
nodes. Feel free to change the RPM drivers to simulate it and you should
see that your patch stops helping.

The problem looks like in missing consumers, missing probe dependencies
or something in the driver how it handles these.

DTS should not be used for solving OS related problems.

Best regards,
Krzysztof
  
Amit Pundir June 8, 2023, 5:26 p.m. UTC | #6
On Wed, 7 Jun 2023 at 15:46, Krzysztof Kozlowski
<krzysztof.kozlowski@linaro.org> wrote:
>
>
> The problem looks like in missing consumers, missing probe dependencies
> or something in the driver how it handles these.

Missing consumers seem to be the case here, if I'm reading the
$debugfs/regulator/regulator_summary correctly(?)
https://www.irccloud.com/pastebin/raw/2jwn0EQw.
lvs1 and lvs2 sysfs entries in /sys/class/regulator/ do not list any
consumers explicitly either.

Regards,
Amit Pundir


>
> DTS should not be used for solving OS related problems.
>
> Best regards,
> Krzysztof
>
  
Doug Anderson June 8, 2023, 5:44 p.m. UTC | #7
Hi,

On Thu, Jun 8, 2023 at 10:27 AM Amit Pundir <amit.pundir@linaro.org> wrote:
>
> On Wed, 7 Jun 2023 at 15:46, Krzysztof Kozlowski
> <krzysztof.kozlowski@linaro.org> wrote:
> >
> >
> > The problem looks like in missing consumers, missing probe dependencies
> > or something in the driver how it handles these.
>
> Missing consumers seem to be the case here, if I'm reading the
> $debugfs/regulator/regulator_summary correctly(?)
> https://www.irccloud.com/pastebin/raw/2jwn0EQw.
> lvs1 and lvs2 sysfs entries in /sys/class/regulator/ do not list any
> consumers explicitly either.

They are marked as always-on regulators, though. The lack of an
explicit consumer in device tree shouldn't really matter. I don't
think this is the source of your problem.

-Doug
  
Linux regression tracking (Thorsten Leemhuis) June 14, 2023, 6:18 p.m. UTC | #8
On 02.06.23 18:12, Amit Pundir wrote:
> Move lvs1 and lvs2 regulator nodes up in the rpmh-regulators
> list to workaround a boot regression uncovered by the upstream
> commit ad44ac082fdf ("regulator: qcom-rpmh: Revert "regulator:
> qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS"").
> 
> Without this fix DB845c fail to boot at times because one of the
> lvs1 or lvs2 regulators fail to turn ON in time.

/me waves friendly

FWIW, as it's not obvious: this...

> Link: https://lore.kernel.org/all/CAMi1Hd1avQDcDQf137m2auz2znov4XL8YGrLZsw5edb-NtRJRw@mail.gmail.com/

...is a report about a regression. One that we could still solve before
6.4 is out. One I'll likely will point Linus to, unless a fix comes into
sight.

When I noticed the reluctant replies to this patch I earlier today asked
in the thread with the report what the plan forward was:
https://lore.kernel.org/all/CAD%3DFV%3DV-h4EUKHCM9UivsFHRsJPY5sAiwXV3a1hUX9DUMkkxdg@mail.gmail.com/

Dough there replied:

```
Of the two proposals made (the revert vs. the reordering of the dts),
the reordering of the dts seems better. It only affects the one buggy
board (rather than preventing us to move to async probe for everyone)
and it also has a chance of actually fixing something (changing the
order that regulators probe in rpmh-regulator might legitimately work
around the problem). That being said, just like the revert the dts
reordering is still just papering over the problem and is fragile /
not guaranteed to work forever.
```

Papering over obviously is not good, but has anyone a better idea to fix
this? Or is "not fixing" for some reason an viable option here?

Ciao, Thorsten

> Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
> ---
>  arch/arm64/boot/dts/qcom/sdm845-db845c.dts | 24 +++++++++++-----------
>  1 file changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/arm64/boot/dts/qcom/sdm845-db845c.dts b/arch/arm64/boot/dts/qcom/sdm845-db845c.dts
> index e14fe9bbb386..df2fde9063dc 100644
> --- a/arch/arm64/boot/dts/qcom/sdm845-db845c.dts
> +++ b/arch/arm64/boot/dts/qcom/sdm845-db845c.dts
> @@ -301,6 +301,18 @@ regulators-0 {
>  		vdd-l26-supply = <&vreg_s3a_1p35>;
>  		vin-lvs-1-2-supply = <&vreg_s4a_1p8>;
>  
> +		vreg_lvs1a_1p8: lvs1 {
> +			regulator-min-microvolt = <1800000>;
> +			regulator-max-microvolt = <1800000>;
> +			regulator-always-on;
> +		};
> +
> +		vreg_lvs2a_1p8: lvs2 {
> +			regulator-min-microvolt = <1800000>;
> +			regulator-max-microvolt = <1800000>;
> +			regulator-always-on;
> +		};
> +
>  		vreg_s3a_1p35: smps3 {
>  			regulator-min-microvolt = <1352000>;
>  			regulator-max-microvolt = <1352000>;
> @@ -381,18 +393,6 @@ vreg_l26a_1p2: ldo26 {
>  			regulator-max-microvolt = <1200000>;
>  			regulator-initial-mode = <RPMH_REGULATOR_MODE_HPM>;
>  		};
> -
> -		vreg_lvs1a_1p8: lvs1 {
> -			regulator-min-microvolt = <1800000>;
> -			regulator-max-microvolt = <1800000>;
> -			regulator-always-on;
> -		};
> -
> -		vreg_lvs2a_1p8: lvs2 {
> -			regulator-min-microvolt = <1800000>;
> -			regulator-max-microvolt = <1800000>;
> -			regulator-always-on;
> -		};
>  	};
>  
>  	regulators-1 {
  
Krzysztof Kozlowski June 14, 2023, 6:47 p.m. UTC | #9
On 14/06/2023 20:18, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 02.06.23 18:12, Amit Pundir wrote:
>> Move lvs1 and lvs2 regulator nodes up in the rpmh-regulators
>> list to workaround a boot regression uncovered by the upstream
>> commit ad44ac082fdf ("regulator: qcom-rpmh: Revert "regulator:
>> qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS"").
>>
>> Without this fix DB845c fail to boot at times because one of the
>> lvs1 or lvs2 regulators fail to turn ON in time.
> 
> /me waves friendly
> 
> FWIW, as it's not obvious: this...
> 
>> Link: https://lore.kernel.org/all/CAMi1Hd1avQDcDQf137m2auz2znov4XL8YGrLZsw5edb-NtRJRw@mail.gmail.com/
> 
> ...is a report about a regression. One that we could still solve before
> 6.4 is out. One I'll likely will point Linus to, unless a fix comes into
> sight.
> 
> When I noticed the reluctant replies to this patch I earlier today asked
> in the thread with the report what the plan forward was:
> https://lore.kernel.org/all/CAD%3DFV%3DV-h4EUKHCM9UivsFHRsJPY5sAiwXV3a1hUX9DUMkkxdg@mail.gmail.com/
> 
> Dough there replied:
> 
> ```
> Of the two proposals made (the revert vs. the reordering of the dts),
> the reordering of the dts seems better. It only affects the one buggy
> board (rather than preventing us to move to async probe for everyone)
> and it also has a chance of actually fixing something (changing the
> order that regulators probe in rpmh-regulator might legitimately work
> around the problem). That being said, just like the revert the dts
> reordering is still just papering over the problem and is fragile /
> not guaranteed to work forever.
> ```
> 
> Papering over obviously is not good, but has anyone a better idea to fix
> this? Or is "not fixing" for some reason an viable option here?
> 

I understand there is a regression, although kernel is not mainline
(hash df7443a96851 is unknown) and the only solutions were papering the
problem. Reverting commit is a temporary workaround. Moving nodes in DTS
is not acceptable because it hides actual problem and only solves this
one particular observed problem, while actual issue is still there. It
would be nice to be able to reproduce it on real mainline with normal
operating system (not AOSP) - with ramdiks/without/whatever. So far no
one did it, right?

Best regards,
Krzysztof
  
Amit Pundir June 14, 2023, 7:08 p.m. UTC | #10
On Thu, 15 Jun 2023 at 00:17, Krzysztof Kozlowski
<krzysztof.kozlowski@linaro.org> wrote:
>
> On 14/06/2023 20:18, Linux regression tracking (Thorsten Leemhuis) wrote:
> > On 02.06.23 18:12, Amit Pundir wrote:
> >> Move lvs1 and lvs2 regulator nodes up in the rpmh-regulators
> >> list to workaround a boot regression uncovered by the upstream
> >> commit ad44ac082fdf ("regulator: qcom-rpmh: Revert "regulator:
> >> qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS"").
> >>
> >> Without this fix DB845c fail to boot at times because one of the
> >> lvs1 or lvs2 regulators fail to turn ON in time.
> >
> > /me waves friendly
> >
> > FWIW, as it's not obvious: this...
> >
> >> Link: https://lore.kernel.org/all/CAMi1Hd1avQDcDQf137m2auz2znov4XL8YGrLZsw5edb-NtRJRw@mail.gmail.com/
> >
> > ...is a report about a regression. One that we could still solve before
> > 6.4 is out. One I'll likely will point Linus to, unless a fix comes into
> > sight.
> >
> > When I noticed the reluctant replies to this patch I earlier today asked
> > in the thread with the report what the plan forward was:
> > https://lore.kernel.org/all/CAD%3DFV%3DV-h4EUKHCM9UivsFHRsJPY5sAiwXV3a1hUX9DUMkkxdg@mail.gmail.com/
> >
> > Dough there replied:
> >
> > ```
> > Of the two proposals made (the revert vs. the reordering of the dts),
> > the reordering of the dts seems better. It only affects the one buggy
> > board (rather than preventing us to move to async probe for everyone)
> > and it also has a chance of actually fixing something (changing the
> > order that regulators probe in rpmh-regulator might legitimately work
> > around the problem). That being said, just like the revert the dts
> > reordering is still just papering over the problem and is fragile /
> > not guaranteed to work forever.
> > ```
> >
> > Papering over obviously is not good, but has anyone a better idea to fix
> > this? Or is "not fixing" for some reason an viable option here?
> >
>
> I understand there is a regression, although kernel is not mainline
> (hash df7443a96851 is unknown) and the only solutions were papering the
> problem. Reverting commit is a temporary workaround. Moving nodes in DTS
> is not acceptable because it hides actual problem and only solves this
> one particular observed problem, while actual issue is still there. It
> would be nice to be able to reproduce it on real mainline with normal
> operating system (not AOSP) - with ramdiks/without/whatever. So far no
> one did it, right?

No, I did not try non-AOSP system yet. I'll try it tomorrow, if that
helps. With mainline hash.

Regards,
Amit Pundir
  
Doug Anderson June 14, 2023, 7:44 p.m. UTC | #11
Hi,

On Wed, Jun 14, 2023 at 11:47 AM Krzysztof Kozlowski
<krzysztof.kozlowski@linaro.org> wrote:
>
> On 14/06/2023 20:18, Linux regression tracking (Thorsten Leemhuis) wrote:
> > On 02.06.23 18:12, Amit Pundir wrote:
> >> Move lvs1 and lvs2 regulator nodes up in the rpmh-regulators
> >> list to workaround a boot regression uncovered by the upstream
> >> commit ad44ac082fdf ("regulator: qcom-rpmh: Revert "regulator:
> >> qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS"").
> >>
> >> Without this fix DB845c fail to boot at times because one of the
> >> lvs1 or lvs2 regulators fail to turn ON in time.
> >
> > /me waves friendly
> >
> > FWIW, as it's not obvious: this...
> >
> >> Link: https://lore.kernel.org/all/CAMi1Hd1avQDcDQf137m2auz2znov4XL8YGrLZsw5edb-NtRJRw@mail.gmail.com/
> >
> > ...is a report about a regression. One that we could still solve before
> > 6.4 is out. One I'll likely will point Linus to, unless a fix comes into
> > sight.
> >
> > When I noticed the reluctant replies to this patch I earlier today asked
> > in the thread with the report what the plan forward was:
> > https://lore.kernel.org/all/CAD%3DFV%3DV-h4EUKHCM9UivsFHRsJPY5sAiwXV3a1hUX9DUMkkxdg@mail.gmail.com/
> >
> > Dough there replied:
> >
> > ```
> > Of the two proposals made (the revert vs. the reordering of the dts),
> > the reordering of the dts seems better. It only affects the one buggy
> > board (rather than preventing us to move to async probe for everyone)
> > and it also has a chance of actually fixing something (changing the
> > order that regulators probe in rpmh-regulator might legitimately work
> > around the problem). That being said, just like the revert the dts
> > reordering is still just papering over the problem and is fragile /
> > not guaranteed to work forever.
> > ```
> >
> > Papering over obviously is not good, but has anyone a better idea to fix
> > this? Or is "not fixing" for some reason an viable option here?
> >
>
> I understand there is a regression, although kernel is not mainline
> (hash df7443a96851 is unknown) and the only solutions were papering the
> problem. Reverting commit is a temporary workaround. Moving nodes in DTS
> is not acceptable because it hides actual problem and only solves this
> one particular observed problem, while actual issue is still there. It
> would be nice to be able to reproduce it on real mainline with normal
> operating system (not AOSP) - with ramdiks/without/whatever. So far no
> one did it, right?

The worry I have about the revert here is that it will never be able
to be undone and that doesn't seem great long term. I'm all for a
temporary revert to fix a problem while the root cause is understood,
but in this case I have a hard time believing that we'll make more
progress towards a root cause once the revert lands. All the
investigation we've done so far seems to indicate that the revert only
fixes the problem by luck...

I completely agree that moving the nodes in the DTS is a hack and just
hides the problem. However, it also at least limits the workaround to
the one board showing the problem and doesn't mean we're stuck with
synchronous probe for rpmh-regulator for all eternity because nobody
can understand this timing issue on db845c.

-Doug
  
Amit Pundir June 15, 2023, 1:47 p.m. UTC | #12
On Thu, 15 Jun 2023 at 00:38, Amit Pundir <amit.pundir@linaro.org> wrote:
>
> On Thu, 15 Jun 2023 at 00:17, Krzysztof Kozlowski
> <krzysztof.kozlowski@linaro.org> wrote:
> >
> > On 14/06/2023 20:18, Linux regression tracking (Thorsten Leemhuis) wrote:
> > > On 02.06.23 18:12, Amit Pundir wrote:
> > >> Move lvs1 and lvs2 regulator nodes up in the rpmh-regulators
> > >> list to workaround a boot regression uncovered by the upstream
> > >> commit ad44ac082fdf ("regulator: qcom-rpmh: Revert "regulator:
> > >> qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS"").
> > >>
> > >> Without this fix DB845c fail to boot at times because one of the
> > >> lvs1 or lvs2 regulators fail to turn ON in time.
> > >
> > > /me waves friendly
> > >
> > > FWIW, as it's not obvious: this...
> > >
> > >> Link: https://lore.kernel.org/all/CAMi1Hd1avQDcDQf137m2auz2znov4XL8YGrLZsw5edb-NtRJRw@mail.gmail.com/
> > >
> > > ...is a report about a regression. One that we could still solve before
> > > 6.4 is out. One I'll likely will point Linus to, unless a fix comes into
> > > sight.
> > >
> > > When I noticed the reluctant replies to this patch I earlier today asked
> > > in the thread with the report what the plan forward was:
> > > https://lore.kernel.org/all/CAD%3DFV%3DV-h4EUKHCM9UivsFHRsJPY5sAiwXV3a1hUX9DUMkkxdg@mail.gmail.com/
> > >
> > > Dough there replied:
> > >
> > > ```
> > > Of the two proposals made (the revert vs. the reordering of the dts),
> > > the reordering of the dts seems better. It only affects the one buggy
> > > board (rather than preventing us to move to async probe for everyone)
> > > and it also has a chance of actually fixing something (changing the
> > > order that regulators probe in rpmh-regulator might legitimately work
> > > around the problem). That being said, just like the revert the dts
> > > reordering is still just papering over the problem and is fragile /
> > > not guaranteed to work forever.
> > > ```
> > >
> > > Papering over obviously is not good, but has anyone a better idea to fix
> > > this? Or is "not fixing" for some reason an viable option here?
> > >
> >
> > I understand there is a regression, although kernel is not mainline
> > (hash df7443a96851 is unknown) and the only solutions were papering the
> > problem. Reverting commit is a temporary workaround. Moving nodes in DTS
> > is not acceptable because it hides actual problem and only solves this
> > one particular observed problem, while actual issue is still there. It
> > would be nice to be able to reproduce it on real mainline with normal
> > operating system (not AOSP) - with ramdiks/without/whatever. So far no
> > one did it, right?
>
> No, I did not try non-AOSP system yet. I'll try it tomorrow, if that
> helps. With mainline hash.

Hi, here is the crash report on db845c running vanilla v6.4-rc6 with a
debian build https://bugs.linaro.org/attachment.cgi?id=1142

And fwiw here is the db845c crash log with AOSP running vanilla
v6.4-rc6 https://bugs.linaro.org/attachment.cgi?id=1141

Regards,
Amit Pundir

PS: rootfs in this bug report doesn't matter much because I'm loading
all the kernel modules from a ramdisk and in the case of a crash the
UFS doesn't probe anyway.
  
Krzysztof Kozlowski June 15, 2023, 3:03 p.m. UTC | #13
On 15/06/2023 15:47, Amit Pundir wrote:
> On Thu, 15 Jun 2023 at 00:38, Amit Pundir <amit.pundir@linaro.org> wrote:
>>
>> On Thu, 15 Jun 2023 at 00:17, Krzysztof Kozlowski
>> <krzysztof.kozlowski@linaro.org> wrote:
>>>
>>> On 14/06/2023 20:18, Linux regression tracking (Thorsten Leemhuis) wrote:
>>>> On 02.06.23 18:12, Amit Pundir wrote:
>>>>> Move lvs1 and lvs2 regulator nodes up in the rpmh-regulators
>>>>> list to workaround a boot regression uncovered by the upstream
>>>>> commit ad44ac082fdf ("regulator: qcom-rpmh: Revert "regulator:
>>>>> qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS"").
>>>>>
>>>>> Without this fix DB845c fail to boot at times because one of the
>>>>> lvs1 or lvs2 regulators fail to turn ON in time.
>>>>
>>>> /me waves friendly
>>>>
>>>> FWIW, as it's not obvious: this...
>>>>
>>>>> Link: https://lore.kernel.org/all/CAMi1Hd1avQDcDQf137m2auz2znov4XL8YGrLZsw5edb-NtRJRw@mail.gmail.com/
>>>>
>>>> ...is a report about a regression. One that we could still solve before
>>>> 6.4 is out. One I'll likely will point Linus to, unless a fix comes into
>>>> sight.
>>>>
>>>> When I noticed the reluctant replies to this patch I earlier today asked
>>>> in the thread with the report what the plan forward was:
>>>> https://lore.kernel.org/all/CAD%3DFV%3DV-h4EUKHCM9UivsFHRsJPY5sAiwXV3a1hUX9DUMkkxdg@mail.gmail.com/
>>>>
>>>> Dough there replied:
>>>>
>>>> ```
>>>> Of the two proposals made (the revert vs. the reordering of the dts),
>>>> the reordering of the dts seems better. It only affects the one buggy
>>>> board (rather than preventing us to move to async probe for everyone)
>>>> and it also has a chance of actually fixing something (changing the
>>>> order that regulators probe in rpmh-regulator might legitimately work
>>>> around the problem). That being said, just like the revert the dts
>>>> reordering is still just papering over the problem and is fragile /
>>>> not guaranteed to work forever.
>>>> ```
>>>>
>>>> Papering over obviously is not good, but has anyone a better idea to fix
>>>> this? Or is "not fixing" for some reason an viable option here?
>>>>
>>>
>>> I understand there is a regression, although kernel is not mainline
>>> (hash df7443a96851 is unknown) and the only solutions were papering the
>>> problem. Reverting commit is a temporary workaround. Moving nodes in DTS
>>> is not acceptable because it hides actual problem and only solves this
>>> one particular observed problem, while actual issue is still there. It
>>> would be nice to be able to reproduce it on real mainline with normal
>>> operating system (not AOSP) - with ramdiks/without/whatever. So far no
>>> one did it, right?
>>
>> No, I did not try non-AOSP system yet. I'll try it tomorrow, if that
>> helps. With mainline hash.
> 
> Hi, here is the crash report on db845c running vanilla v6.4-rc6 with a
> debian build https://bugs.linaro.org/attachment.cgi?id=1142
> 
> And fwiw here is the db845c crash log with AOSP running vanilla
> v6.4-rc6 https://bugs.linaro.org/attachment.cgi?id=1141
> 
> Regards,
> Amit Pundir
> 
> PS: rootfs in this bug report doesn't matter much because I'm loading
> all the kernel modules from a ramdisk and in the case of a crash the
> UFS doesn't probe anyway.

I just tried current next with defconfig (I could not find your config,
neither here, nor in your previous mail thread nor in bugzilla). Also
with REGULATOR_QCOM_RPMH as module.

I tried also v6.4-rc6 - also defconfig with default and module
REGULATOR_QCOM_RPMH.

All the cases work on my RB3 - no warnings reported.

If you do not use defconfig, then in all reports please mention the
differences (the best) or at least attach it.



Best regards,
Krzysztof
  
Amit Pundir June 15, 2023, 4:09 p.m. UTC | #14
On Thu, 15 Jun 2023 at 20:33, Krzysztof Kozlowski
<krzysztof.kozlowski@linaro.org> wrote:
>
> On 15/06/2023 15:47, Amit Pundir wrote:
> > On Thu, 15 Jun 2023 at 00:38, Amit Pundir <amit.pundir@linaro.org> wrote:
> >>
> >> On Thu, 15 Jun 2023 at 00:17, Krzysztof Kozlowski
> >> <krzysztof.kozlowski@linaro.org> wrote:
> >>>
> >>> On 14/06/2023 20:18, Linux regression tracking (Thorsten Leemhuis) wrote:
> >>>> On 02.06.23 18:12, Amit Pundir wrote:
> >>>>> Move lvs1 and lvs2 regulator nodes up in the rpmh-regulators
> >>>>> list to workaround a boot regression uncovered by the upstream
> >>>>> commit ad44ac082fdf ("regulator: qcom-rpmh: Revert "regulator:
> >>>>> qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS"").
> >>>>>
> >>>>> Without this fix DB845c fail to boot at times because one of the
> >>>>> lvs1 or lvs2 regulators fail to turn ON in time.
> >>>>
> >>>> /me waves friendly
> >>>>
> >>>> FWIW, as it's not obvious: this...
> >>>>
> >>>>> Link: https://lore.kernel.org/all/CAMi1Hd1avQDcDQf137m2auz2znov4XL8YGrLZsw5edb-NtRJRw@mail.gmail.com/
> >>>>
> >>>> ...is a report about a regression. One that we could still solve before
> >>>> 6.4 is out. One I'll likely will point Linus to, unless a fix comes into
> >>>> sight.
> >>>>
> >>>> When I noticed the reluctant replies to this patch I earlier today asked
> >>>> in the thread with the report what the plan forward was:
> >>>> https://lore.kernel.org/all/CAD%3DFV%3DV-h4EUKHCM9UivsFHRsJPY5sAiwXV3a1hUX9DUMkkxdg@mail.gmail.com/
> >>>>
> >>>> Dough there replied:
> >>>>
> >>>> ```
> >>>> Of the two proposals made (the revert vs. the reordering of the dts),
> >>>> the reordering of the dts seems better. It only affects the one buggy
> >>>> board (rather than preventing us to move to async probe for everyone)
> >>>> and it also has a chance of actually fixing something (changing the
> >>>> order that regulators probe in rpmh-regulator might legitimately work
> >>>> around the problem). That being said, just like the revert the dts
> >>>> reordering is still just papering over the problem and is fragile /
> >>>> not guaranteed to work forever.
> >>>> ```
> >>>>
> >>>> Papering over obviously is not good, but has anyone a better idea to fix
> >>>> this? Or is "not fixing" for some reason an viable option here?
> >>>>
> >>>
> >>> I understand there is a regression, although kernel is not mainline
> >>> (hash df7443a96851 is unknown) and the only solutions were papering the
> >>> problem. Reverting commit is a temporary workaround. Moving nodes in DTS
> >>> is not acceptable because it hides actual problem and only solves this
> >>> one particular observed problem, while actual issue is still there. It
> >>> would be nice to be able to reproduce it on real mainline with normal
> >>> operating system (not AOSP) - with ramdiks/without/whatever. So far no
> >>> one did it, right?
> >>
> >> No, I did not try non-AOSP system yet. I'll try it tomorrow, if that
> >> helps. With mainline hash.
> >
> > Hi, here is the crash report on db845c running vanilla v6.4-rc6 with a
> > debian build https://bugs.linaro.org/attachment.cgi?id=1142
> >
> > And fwiw here is the db845c crash log with AOSP running vanilla
> > v6.4-rc6 https://bugs.linaro.org/attachment.cgi?id=1141
> >
> > Regards,
> > Amit Pundir
> >
> > PS: rootfs in this bug report doesn't matter much because I'm loading
> > all the kernel modules from a ramdisk and in the case of a crash the
> > UFS doesn't probe anyway.
>
> I just tried current next with defconfig (I could not find your config,
> neither here, nor in your previous mail thread nor in bugzilla). Also
> with REGULATOR_QCOM_RPMH as module.
>
> I tried also v6.4-rc6 - also defconfig with default and module
> REGULATOR_QCOM_RPMH.
>
> All the cases work on my RB3 - no warnings reported.
>
> If you do not use defconfig, then in all reports please mention the
> differences (the best) or at least attach it.

Argh.. Sorry about that. Big mistake from my side. I did want to
upload my defconfig but forgot. Defconfig plays a key role because, as
I mentioned in one of my previous email, it is a timing/race bug and
if I do any much changes in my defconfig (i.e. enable ftrace for
example or as little as add printk in qcom_rpmh_regulator code) then I
can't reproduce this bug. So needless to say that I can't reproduce
this bug with default arm64 defconfig.

Please find my custom (but upstream) defconfig here
https://bugs.linaro.org/attachment.cgi?id=1143 and prebuilt binaries
here https://people.linaro.org/~amit.pundir/db845c-userdebug/rpmh_bug/.
"fastboot flash boot ./boot.img-6.4-rc6 reboot" and/or a few (<5)
reboots should be enough to trigger the crash.

I have downloaded the initrd from here
https://snapshots.linaro.org/96boards/dragonboard845c/linaro/debian/569/initrd.img-5.15.0-qcomlt-arm64
but edited ramdisk/init to run "load_module" function early in the
boot and ramdisk/conf/initramfs.conf has "MODULES=list" instead of
"MODULES=most", where all the kernel modules are listed at
/etc/initramfs-tools/modules.

Regards,
Amit Pundir
  
Amit Pundir June 15, 2023, 4:15 p.m. UTC | #15
On Thu, 15 Jun 2023 at 21:39, Amit Pundir <amit.pundir@linaro.org> wrote:
>
> On Thu, 15 Jun 2023 at 20:33, Krzysztof Kozlowski
> <krzysztof.kozlowski@linaro.org> wrote:
> >
> > On 15/06/2023 15:47, Amit Pundir wrote:
> > > On Thu, 15 Jun 2023 at 00:38, Amit Pundir <amit.pundir@linaro.org> wrote:
> > >>
> > >> On Thu, 15 Jun 2023 at 00:17, Krzysztof Kozlowski
> > >> <krzysztof.kozlowski@linaro.org> wrote:
> > >>>
> > >>> On 14/06/2023 20:18, Linux regression tracking (Thorsten Leemhuis) wrote:
> > >>>> On 02.06.23 18:12, Amit Pundir wrote:
> > >>>>> Move lvs1 and lvs2 regulator nodes up in the rpmh-regulators
> > >>>>> list to workaround a boot regression uncovered by the upstream
> > >>>>> commit ad44ac082fdf ("regulator: qcom-rpmh: Revert "regulator:
> > >>>>> qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS"").
> > >>>>>
> > >>>>> Without this fix DB845c fail to boot at times because one of the
> > >>>>> lvs1 or lvs2 regulators fail to turn ON in time.
> > >>>>
> > >>>> /me waves friendly
> > >>>>
> > >>>> FWIW, as it's not obvious: this...
> > >>>>
> > >>>>> Link: https://lore.kernel.org/all/CAMi1Hd1avQDcDQf137m2auz2znov4XL8YGrLZsw5edb-NtRJRw@mail.gmail.com/
> > >>>>
> > >>>> ...is a report about a regression. One that we could still solve before
> > >>>> 6.4 is out. One I'll likely will point Linus to, unless a fix comes into
> > >>>> sight.
> > >>>>
> > >>>> When I noticed the reluctant replies to this patch I earlier today asked
> > >>>> in the thread with the report what the plan forward was:
> > >>>> https://lore.kernel.org/all/CAD%3DFV%3DV-h4EUKHCM9UivsFHRsJPY5sAiwXV3a1hUX9DUMkkxdg@mail.gmail.com/
> > >>>>
> > >>>> Dough there replied:
> > >>>>
> > >>>> ```
> > >>>> Of the two proposals made (the revert vs. the reordering of the dts),
> > >>>> the reordering of the dts seems better. It only affects the one buggy
> > >>>> board (rather than preventing us to move to async probe for everyone)
> > >>>> and it also has a chance of actually fixing something (changing the
> > >>>> order that regulators probe in rpmh-regulator might legitimately work
> > >>>> around the problem). That being said, just like the revert the dts
> > >>>> reordering is still just papering over the problem and is fragile /
> > >>>> not guaranteed to work forever.
> > >>>> ```
> > >>>>
> > >>>> Papering over obviously is not good, but has anyone a better idea to fix
> > >>>> this? Or is "not fixing" for some reason an viable option here?
> > >>>>
> > >>>
> > >>> I understand there is a regression, although kernel is not mainline
> > >>> (hash df7443a96851 is unknown) and the only solutions were papering the
> > >>> problem. Reverting commit is a temporary workaround. Moving nodes in DTS
> > >>> is not acceptable because it hides actual problem and only solves this
> > >>> one particular observed problem, while actual issue is still there. It
> > >>> would be nice to be able to reproduce it on real mainline with normal
> > >>> operating system (not AOSP) - with ramdiks/without/whatever. So far no
> > >>> one did it, right?
> > >>
> > >> No, I did not try non-AOSP system yet. I'll try it tomorrow, if that
> > >> helps. With mainline hash.
> > >
> > > Hi, here is the crash report on db845c running vanilla v6.4-rc6 with a
> > > debian build https://bugs.linaro.org/attachment.cgi?id=1142
> > >
> > > And fwiw here is the db845c crash log with AOSP running vanilla
> > > v6.4-rc6 https://bugs.linaro.org/attachment.cgi?id=1141
> > >
> > > Regards,
> > > Amit Pundir
> > >
> > > PS: rootfs in this bug report doesn't matter much because I'm loading
> > > all the kernel modules from a ramdisk and in the case of a crash the
> > > UFS doesn't probe anyway.
> >
> > I just tried current next with defconfig (I could not find your config,
> > neither here, nor in your previous mail thread nor in bugzilla). Also
> > with REGULATOR_QCOM_RPMH as module.
> >
> > I tried also v6.4-rc6 - also defconfig with default and module
> > REGULATOR_QCOM_RPMH.
> >
> > All the cases work on my RB3 - no warnings reported.
> >
> > If you do not use defconfig, then in all reports please mention the
> > differences (the best) or at least attach it.
>
> Argh.. Sorry about that. Big mistake from my side. I did want to
> upload my defconfig but forgot. Defconfig plays a key role because, as
> I mentioned in one of my previous email, it is a timing/race bug and
> if I do any much changes in my defconfig (i.e. enable ftrace for
> example or as little as add printk in qcom_rpmh_regulator code) then I
> can't reproduce this bug. So needless to say that I can't reproduce
> this bug with default arm64 defconfig.
>
> Please find my custom (but upstream) defconfig here
> https://bugs.linaro.org/attachment.cgi?id=1143 and prebuilt binaries
> here https://people.linaro.org/~amit.pundir/db845c-userdebug/rpmh_bug/.
> "fastboot flash boot ./boot.img-6.4-rc6 reboot" and/or a few (<5)
> reboots should be enough to trigger the crash.
>
> I have downloaded the initrd from here
> https://snapshots.linaro.org/96boards/dragonboard845c/linaro/debian/569/initrd.img-5.15.0-qcomlt-arm64
> but edited ramdisk/init to run "load_module" function early in the
> boot and ramdisk/conf/initramfs.conf has "MODULES=list" instead of
> "MODULES=most", where all the kernel modules are listed at
> /etc/initramfs-tools/modules.

Sorry it is ramdisk/conf/modules not ramdisk/etc/initramfs-tools/modules.

>
> Regards,
> Amit Pundir
  
Krzysztof Kozlowski June 16, 2023, 8:27 a.m. UTC | #16
On 15/06/2023 18:09, Amit Pundir wrote:
> On Thu, 15 Jun 2023 at 20:33, Krzysztof Kozlowski
> <krzysztof.kozlowski@linaro.org> wrote:
>>
>> On 15/06/2023 15:47, Amit Pundir wrote:
>>> On Thu, 15 Jun 2023 at 00:38, Amit Pundir <amit.pundir@linaro.org> wrote:
>>>>
>>>> On Thu, 15 Jun 2023 at 00:17, Krzysztof Kozlowski
>>>> <krzysztof.kozlowski@linaro.org> wrote:
>>>>>
>>>>> On 14/06/2023 20:18, Linux regression tracking (Thorsten Leemhuis) wrote:
>>>>>> On 02.06.23 18:12, Amit Pundir wrote:
>>>>>>> Move lvs1 and lvs2 regulator nodes up in the rpmh-regulators
>>>>>>> list to workaround a boot regression uncovered by the upstream
>>>>>>> commit ad44ac082fdf ("regulator: qcom-rpmh: Revert "regulator:
>>>>>>> qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS"").
>>>>>>>
>>>>>>> Without this fix DB845c fail to boot at times because one of the
>>>>>>> lvs1 or lvs2 regulators fail to turn ON in time.
>>>>>>
>>>>>> /me waves friendly
>>>>>>
>>>>>> FWIW, as it's not obvious: this...
>>>>>>
>>>>>>> Link: https://lore.kernel.org/all/CAMi1Hd1avQDcDQf137m2auz2znov4XL8YGrLZsw5edb-NtRJRw@mail.gmail.com/
>>>>>>
>>>>>> ...is a report about a regression. One that we could still solve before
>>>>>> 6.4 is out. One I'll likely will point Linus to, unless a fix comes into
>>>>>> sight.
>>>>>>
>>>>>> When I noticed the reluctant replies to this patch I earlier today asked
>>>>>> in the thread with the report what the plan forward was:
>>>>>> https://lore.kernel.org/all/CAD%3DFV%3DV-h4EUKHCM9UivsFHRsJPY5sAiwXV3a1hUX9DUMkkxdg@mail.gmail.com/
>>>>>>
>>>>>> Dough there replied:
>>>>>>
>>>>>> ```
>>>>>> Of the two proposals made (the revert vs. the reordering of the dts),
>>>>>> the reordering of the dts seems better. It only affects the one buggy
>>>>>> board (rather than preventing us to move to async probe for everyone)
>>>>>> and it also has a chance of actually fixing something (changing the
>>>>>> order that regulators probe in rpmh-regulator might legitimately work
>>>>>> around the problem). That being said, just like the revert the dts
>>>>>> reordering is still just papering over the problem and is fragile /
>>>>>> not guaranteed to work forever.
>>>>>> ```
>>>>>>
>>>>>> Papering over obviously is not good, but has anyone a better idea to fix
>>>>>> this? Or is "not fixing" for some reason an viable option here?
>>>>>>
>>>>>
>>>>> I understand there is a regression, although kernel is not mainline
>>>>> (hash df7443a96851 is unknown) and the only solutions were papering the
>>>>> problem. Reverting commit is a temporary workaround. Moving nodes in DTS
>>>>> is not acceptable because it hides actual problem and only solves this
>>>>> one particular observed problem, while actual issue is still there. It
>>>>> would be nice to be able to reproduce it on real mainline with normal
>>>>> operating system (not AOSP) - with ramdiks/without/whatever. So far no
>>>>> one did it, right?
>>>>
>>>> No, I did not try non-AOSP system yet. I'll try it tomorrow, if that
>>>> helps. With mainline hash.
>>>
>>> Hi, here is the crash report on db845c running vanilla v6.4-rc6 with a
>>> debian build https://bugs.linaro.org/attachment.cgi?id=1142
>>>
>>> And fwiw here is the db845c crash log with AOSP running vanilla
>>> v6.4-rc6 https://bugs.linaro.org/attachment.cgi?id=1141
>>>
>>> Regards,
>>> Amit Pundir
>>>
>>> PS: rootfs in this bug report doesn't matter much because I'm loading
>>> all the kernel modules from a ramdisk and in the case of a crash the
>>> UFS doesn't probe anyway.
>>
>> I just tried current next with defconfig (I could not find your config,
>> neither here, nor in your previous mail thread nor in bugzilla). Also
>> with REGULATOR_QCOM_RPMH as module.
>>
>> I tried also v6.4-rc6 - also defconfig with default and module
>> REGULATOR_QCOM_RPMH.
>>
>> All the cases work on my RB3 - no warnings reported.
>>
>> If you do not use defconfig, then in all reports please mention the
>> differences (the best) or at least attach it.
> 
> Argh.. Sorry about that. Big mistake from my side. I did want to
> upload my defconfig but forgot. Defconfig plays a key role because, as
> I mentioned in one of my previous email, it is a timing/race bug and
> if I do any much changes in my defconfig (i.e. enable ftrace for
> example or as little as add printk in qcom_rpmh_regulator code) then I
> can't reproduce this bug. So needless to say that I can't reproduce
> this bug with default arm64 defconfig.
> 
> Please find my custom (but upstream) defconfig here
> https://bugs.linaro.org/attachment.cgi?id=1143 and prebuilt binaries
> here https://people.linaro.org/~amit.pundir/db845c-userdebug/rpmh_bug/.
> "fastboot flash boot ./boot.img-6.4-rc6 reboot" and/or a few (<5)
> reboots should be enough to trigger the crash.
> 
> I have downloaded the initrd from here
> https://snapshots.linaro.org/96boards/dragonboard845c/linaro/debian/569/initrd.img-5.15.0-qcomlt-arm64
> but edited ramdisk/init to run "load_module" function early in the
> boot and ramdisk/conf/initramfs.conf has "MODULES=list" instead of
> "MODULES=most", where all the kernel modules are listed at
> /etc/initramfs-tools/modules.

So you have interconnect as module - this is not a supported setup. It
might work with if all the modules are loaded very early or might not.
Pinctrl is another driver which should be built-in.

With your defconfig I see regular issue - console and system dies
because of lack of interconnects, most likely. I don't see your WARNs -
I just see usual hang.

See:
https://lore.kernel.org/all/20221021032702.1340963-1-krzysztof.kozlowski@linaro.org/

If you want them to really be modules, then you need to fix all the
dependencies (SOFTDEP?), probe ordering glitches. It's not a problem of
DTS. Just because something can be built as module, does not mean it
will work. We don't test it, we don't work with them as modules.

It's kind of the same as here:
https://lore.kernel.org/all/ac328b6a-a8e2-873d-4015-814cb4f5588e@canonical.com/

I understand that we might have here regression, if these were working
as modules, but I don't think we ever really committed to it. We can as
well make it non-module to solve the regression.

Best regards,
Krzysztof
  
Amit Pundir June 16, 2023, 5:09 p.m. UTC | #17
Hi,

On Fri, 16 Jun 2023 at 13:57, Krzysztof Kozlowski
<krzysztof.kozlowski@linaro.org> wrote:
>
>
> So you have interconnect as module - this is not a supported setup. It
> might work with if all the modules are loaded very early or might not.
> Pinctrl is another driver which should be built-in.
>
> With your defconfig I see regular issue - console and system dies
> because of lack of interconnects, most likely. I don't see your WARNs -
> I just see usual hang.
>
> See:
> https://lore.kernel.org/all/20221021032702.1340963-1-krzysztof.kozlowski@linaro.org/
>
> If you want them to really be modules, then you need to fix all the
> dependencies (SOFTDEP?), probe ordering glitches. It's not a problem of
> DTS. Just because something can be built as module, does not mean it
> will work. We don't test it, we don't work with them as modules.

I do somewhat agree with most of your arguments but not this one. If a
driver doesn't work as a module then it shouldn't be allowed to build
as a module. I took a quick look at the history of the interconnect
driver and it is tristate from the beginning. And not converted to a
modular build later-on like some of the other drivers to support GKI.

>
> It's kind of the same as here:
> https://lore.kernel.org/all/ac328b6a-a8e2-873d-4015-814cb4f5588e@canonical.com/
>
> I understand that we might have here regression, if these were working
> as modules, but I don't think we ever really committed to it. We can as
> well make it non-module to solve the regression.

Sure. But since v6.4 is around the corner, can we merge this
workaround for now, while a proper fix is being worked upon.

Regards,
Amit Pundir
  
Krzysztof Kozlowski June 17, 2023, 7:21 a.m. UTC | #18
On 16/06/2023 19:09, Amit Pundir wrote:
> Hi,
> 
> On Fri, 16 Jun 2023 at 13:57, Krzysztof Kozlowski
> <krzysztof.kozlowski@linaro.org> wrote:
>>
>>
>> So you have interconnect as module - this is not a supported setup. It
>> might work with if all the modules are loaded very early or might not.
>> Pinctrl is another driver which should be built-in.
>>
>> With your defconfig I see regular issue - console and system dies
>> because of lack of interconnects, most likely. I don't see your WARNs -
>> I just see usual hang.
>>
>> See:
>> https://lore.kernel.org/all/20221021032702.1340963-1-krzysztof.kozlowski@linaro.org/
>>
>> If you want them to really be modules, then you need to fix all the
>> dependencies (SOFTDEP?), probe ordering glitches. It's not a problem of
>> DTS. Just because something can be built as module, does not mean it
>> will work. We don't test it, we don't work with them as modules.
> 
> I do somewhat agree with most of your arguments but not this one. If a
> driver doesn't work as a module then it shouldn't be allowed to build
> as a module. 

Of course you are right. That's why I am pushing against blindly adding
"tristate" by everyone working on GKI. Because such folks like to make
them tristate, but not actually test it or work on issues later.

That's exactly the case from Google and Samsung patches here:
https://lore.kernel.org/all/ac328b6a-a8e2-873d-4015-814cb4f5588e@canonical.com/
and in previous submissions.

> I took a quick look at the history of the interconnect
> driver and it is tristate from the beginning. And not converted to a
> modular build later-on like some of the other drivers to support GKI.

OK, maybe it was never actually tested. Or maybe some versions were
working on boards where debug serial does not have interconnect, but new
chips just followed the pattern without testing?
> 
>>
>> It's kind of the same as here:
>> https://lore.kernel.org/all/ac328b6a-a8e2-873d-4015-814cb4f5588e@canonical.com/
>>
>> I understand that we might have here regression, if these were working
>> as modules, but I don't think we ever really committed to it. We can as
>> well make it non-module to solve the regression.
> 
> Sure. But since v6.4 is around the corner, can we merge this
> workaround for now, while a proper fix is being worked upon.

DTS workaround? No. I don't agree. Once it is merged it will not be fixed.

I am perfectly fine though with making the interconnect or even rpmh
regulator bool instead of tristate.

Best regards,
Krzysztof
  
Amit Pundir June 19, 2023, 7:06 a.m. UTC | #19
On Sat, 17 Jun 2023 at 12:51, Krzysztof Kozlowski
<krzysztof.kozlowski@linaro.org> wrote:
>
> On 16/06/2023 19:09, Amit Pundir wrote:
> > Hi,
> >
> > On Fri, 16 Jun 2023 at 13:57, Krzysztof Kozlowski
> > <krzysztof.kozlowski@linaro.org> wrote:
> >>
> >>
> >> So you have interconnect as module - this is not a supported setup. It
> >> might work with if all the modules are loaded very early or might not.
> >> Pinctrl is another driver which should be built-in.
> >>
> >> With your defconfig I see regular issue - console and system dies
> >> because of lack of interconnects, most likely. I don't see your WARNs -
> >> I just see usual hang.
> >>
> >> See:
> >> https://lore.kernel.org/all/20221021032702.1340963-1-krzysztof.kozlowski@linaro.org/
> >>
> >> If you want them to really be modules, then you need to fix all the
> >> dependencies (SOFTDEP?), probe ordering glitches. It's not a problem of
> >> DTS. Just because something can be built as module, does not mean it
> >> will work. We don't test it, we don't work with them as modules.
> >
> > I do somewhat agree with most of your arguments but not this one. If a
> > driver doesn't work as a module then it shouldn't be allowed to build
> > as a module.
>
> Of course you are right. That's why I am pushing against blindly adding
> "tristate" by everyone working on GKI. Because such folks like to make
> them tristate, but not actually test it or work on issues later.
>
> That's exactly the case from Google and Samsung patches here:
> https://lore.kernel.org/all/ac328b6a-a8e2-873d-4015-814cb4f5588e@canonical.com/
> and in previous submissions.
>
> > I took a quick look at the history of the interconnect
> > driver and it is tristate from the beginning. And not converted to a
> > modular build later-on like some of the other drivers to support GKI.
>
> OK, maybe it was never actually tested. Or maybe some versions were
> working on boards where debug serial does not have interconnect, but new
> chips just followed the pattern without testing?
> >
> >>
> >> It's kind of the same as here:
> >> https://lore.kernel.org/all/ac328b6a-a8e2-873d-4015-814cb4f5588e@canonical.com/
> >>
> >> I understand that we might have here regression, if these were working
> >> as modules, but I don't think we ever really committed to it. We can as
> >> well make it non-module to solve the regression.
> >
> > Sure. But since v6.4 is around the corner, can we merge this
> > workaround for now, while a proper fix is being worked upon.
>
> DTS workaround? No. I don't agree. Once it is merged it will not be fixed.
>
> I am perfectly fine though with making the interconnect or even rpmh
> regulator bool instead of tristate.

As Doug also mentioned in one of his earlier emails, this workaround
is only limited to one particular board. If I try to change the common
interconnect and/or rpmh driver then it will need ack from other stake
holders as well and I'll most likely get more pushback from that side.

Regards,
Amit Pundir
  
Bjorn Andersson June 20, 2023, 3:59 p.m. UTC | #20
On Wed, Jun 14, 2023 at 12:44:15PM -0700, Doug Anderson wrote:
> Hi,
> 
> On Wed, Jun 14, 2023 at 11:47 AM Krzysztof Kozlowski
> <krzysztof.kozlowski@linaro.org> wrote:
> >
> > On 14/06/2023 20:18, Linux regression tracking (Thorsten Leemhuis) wrote:
> > > On 02.06.23 18:12, Amit Pundir wrote:
> > >> Move lvs1 and lvs2 regulator nodes up in the rpmh-regulators
> > >> list to workaround a boot regression uncovered by the upstream
> > >> commit ad44ac082fdf ("regulator: qcom-rpmh: Revert "regulator:
> > >> qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS"").
> > >>
> > >> Without this fix DB845c fail to boot at times because one of the
> > >> lvs1 or lvs2 regulators fail to turn ON in time.
> > >
> > > /me waves friendly
> > >
> > > FWIW, as it's not obvious: this...
> > >
> > >> Link: https://lore.kernel.org/all/CAMi1Hd1avQDcDQf137m2auz2znov4XL8YGrLZsw5edb-NtRJRw@mail.gmail.com/
> > >
> > > ...is a report about a regression. One that we could still solve before
> > > 6.4 is out. One I'll likely will point Linus to, unless a fix comes into
> > > sight.
> > >
> > > When I noticed the reluctant replies to this patch I earlier today asked
> > > in the thread with the report what the plan forward was:
> > > https://lore.kernel.org/all/CAD%3DFV%3DV-h4EUKHCM9UivsFHRsJPY5sAiwXV3a1hUX9DUMkkxdg@mail.gmail.com/
> > >
> > > Dough there replied:
> > >
> > > ```
> > > Of the two proposals made (the revert vs. the reordering of the dts),
> > > the reordering of the dts seems better. It only affects the one buggy
> > > board (rather than preventing us to move to async probe for everyone)
> > > and it also has a chance of actually fixing something (changing the
> > > order that regulators probe in rpmh-regulator might legitimately work
> > > around the problem). That being said, just like the revert the dts
> > > reordering is still just papering over the problem and is fragile /
> > > not guaranteed to work forever.
> > > ```
> > >
> > > Papering over obviously is not good, but has anyone a better idea to fix
> > > this? Or is "not fixing" for some reason an viable option here?
> > >
> >
> > I understand there is a regression, although kernel is not mainline
> > (hash df7443a96851 is unknown) and the only solutions were papering the
> > problem. Reverting commit is a temporary workaround. Moving nodes in DTS
> > is not acceptable because it hides actual problem and only solves this
> > one particular observed problem, while actual issue is still there. It
> > would be nice to be able to reproduce it on real mainline with normal
> > operating system (not AOSP) - with ramdiks/without/whatever. So far no
> > one did it, right?
> 
> The worry I have about the revert here is that it will never be able
> to be undone and that doesn't seem great long term. I'm all for a
> temporary revert to fix a problem while the root cause is understood,
> but in this case I have a hard time believing that we'll make more
> progress towards a root cause once the revert lands. All the
> investigation we've done so far seems to indicate that the revert only
> fixes the problem by luck...
> 
> I completely agree that moving the nodes in the DTS is a hack and just
> hides the problem. However, it also at least limits the workaround to
> the one board showing the problem and doesn't mean we're stuck with
> synchronous probe for rpmh-regulator for all eternity because nobody
> can understand this timing issue on db845c.
> 

I agree that we shouldn't hide this by reverting the regulator change.


And as has been stated a few times already, the symptom indicates that
we have a misconfigured system.

Before accepting a patch just shuffling the bricks, I'd like to see some
more analysis of what happens wrt the rpmh right before the timeout.
Perhaps the landing team can assist here?

Regards,
Bjorn
  
Linux regression tracking (Thorsten Leemhuis) June 22, 2023, 7:47 a.m. UTC | #21
Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
for once, to make this easily accessible to everyone.

As Linus will likely release 6.4 on this or the following Sunday a quick
status inquiry so I can brief him appropriately: is there any hope the
regression this patch tried to fix will be resolved any time soon?
Doesn't look like it from below message and this thread, but maybe I
missed something.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

On 20.06.23 17:59, Bjorn Andersson wrote:
> On Wed, Jun 14, 2023 at 12:44:15PM -0700, Doug Anderson wrote:
>> Hi,
>>
>> On Wed, Jun 14, 2023 at 11:47 AM Krzysztof Kozlowski
>> <krzysztof.kozlowski@linaro.org> wrote:
>>>
>>> On 14/06/2023 20:18, Linux regression tracking (Thorsten Leemhuis) wrote:
>>>> On 02.06.23 18:12, Amit Pundir wrote:
>>>>> Move lvs1 and lvs2 regulator nodes up in the rpmh-regulators
>>>>> list to workaround a boot regression uncovered by the upstream
>>>>> commit ad44ac082fdf ("regulator: qcom-rpmh: Revert "regulator:
>>>>> qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS"").
>>>>>
>>>>> Without this fix DB845c fail to boot at times because one of the
>>>>> lvs1 or lvs2 regulators fail to turn ON in time.
>>>>
>>>> /me waves friendly
>>>>
>>>> FWIW, as it's not obvious: this...
>>>>
>>>>> Link: https://lore.kernel.org/all/CAMi1Hd1avQDcDQf137m2auz2znov4XL8YGrLZsw5edb-NtRJRw@mail.gmail.com/
>>>>
>>>> ...is a report about a regression. One that we could still solve before
>>>> 6.4 is out. One I'll likely will point Linus to, unless a fix comes into
>>>> sight.
>>>>
>>>> When I noticed the reluctant replies to this patch I earlier today asked
>>>> in the thread with the report what the plan forward was:
>>>> https://lore.kernel.org/all/CAD%3DFV%3DV-h4EUKHCM9UivsFHRsJPY5sAiwXV3a1hUX9DUMkkxdg@mail.gmail.com/
>>>>
>>>> Dough there replied:
>>>>
>>>> ```
>>>> Of the two proposals made (the revert vs. the reordering of the dts),
>>>> the reordering of the dts seems better. It only affects the one buggy
>>>> board (rather than preventing us to move to async probe for everyone)
>>>> and it also has a chance of actually fixing something (changing the
>>>> order that regulators probe in rpmh-regulator might legitimately work
>>>> around the problem). That being said, just like the revert the dts
>>>> reordering is still just papering over the problem and is fragile /
>>>> not guaranteed to work forever.
>>>> ```
>>>>
>>>> Papering over obviously is not good, but has anyone a better idea to fix
>>>> this? Or is "not fixing" for some reason an viable option here?
>>>>
>>>
>>> I understand there is a regression, although kernel is not mainline
>>> (hash df7443a96851 is unknown) and the only solutions were papering the
>>> problem. Reverting commit is a temporary workaround. Moving nodes in DTS
>>> is not acceptable because it hides actual problem and only solves this
>>> one particular observed problem, while actual issue is still there. It
>>> would be nice to be able to reproduce it on real mainline with normal
>>> operating system (not AOSP) - with ramdiks/without/whatever. So far no
>>> one did it, right?
>>
>> The worry I have about the revert here is that it will never be able
>> to be undone and that doesn't seem great long term. I'm all for a
>> temporary revert to fix a problem while the root cause is understood,
>> but in this case I have a hard time believing that we'll make more
>> progress towards a root cause once the revert lands. All the
>> investigation we've done so far seems to indicate that the revert only
>> fixes the problem by luck...
>>
>> I completely agree that moving the nodes in the DTS is a hack and just
>> hides the problem. However, it also at least limits the workaround to
>> the one board showing the problem and doesn't mean we're stuck with
>> synchronous probe for rpmh-regulator for all eternity because nobody
>> can understand this timing issue on db845c.
>>
> 
> I agree that we shouldn't hide this by reverting the regulator change.
> 
> 
> And as has been stated a few times already, the symptom indicates that
> we have a misconfigured system.
> 
> Before accepting a patch just shuffling the bricks, I'd like to see some
> more analysis of what happens wrt the rpmh right before the timeout.
> Perhaps the landing team can assist here?
> 
> Regards,
> Bjorn
> 
>
  
Amit Pundir June 22, 2023, 11:48 a.m. UTC | #22
On Thu, 22 Jun 2023 at 13:17, Linux regression tracking (Thorsten
Leemhuis) <regressions@leemhuis.info> wrote:
>
> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
> for once, to make this easily accessible to everyone.
>
> As Linus will likely release 6.4 on this or the following Sunday a quick
> status inquiry so I can brief him appropriately: is there any hope the
> regression this patch tried to fix will be resolved any time soon?

We are most likely to miss v6.4. I'm trying to reproduce the crash
with tracing enabled, to share some more debug information.

Regards,
Amit Pundir

> Doesn't look like it from below message and this thread, but maybe I
> missed something.
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
>
> #regzbot poke
>
> On 20.06.23 17:59, Bjorn Andersson wrote:
> > On Wed, Jun 14, 2023 at 12:44:15PM -0700, Doug Anderson wrote:
> >> Hi,
> >>
> >> On Wed, Jun 14, 2023 at 11:47 AM Krzysztof Kozlowski
> >> <krzysztof.kozlowski@linaro.org> wrote:
> >>>
> >>> On 14/06/2023 20:18, Linux regression tracking (Thorsten Leemhuis) wrote:
> >>>> On 02.06.23 18:12, Amit Pundir wrote:
> >>>>> Move lvs1 and lvs2 regulator nodes up in the rpmh-regulators
> >>>>> list to workaround a boot regression uncovered by the upstream
> >>>>> commit ad44ac082fdf ("regulator: qcom-rpmh: Revert "regulator:
> >>>>> qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS"").
> >>>>>
> >>>>> Without this fix DB845c fail to boot at times because one of the
> >>>>> lvs1 or lvs2 regulators fail to turn ON in time.
> >>>>
> >>>> /me waves friendly
> >>>>
> >>>> FWIW, as it's not obvious: this...
> >>>>
> >>>>> Link: https://lore.kernel.org/all/CAMi1Hd1avQDcDQf137m2auz2znov4XL8YGrLZsw5edb-NtRJRw@mail.gmail.com/
> >>>>
> >>>> ...is a report about a regression. One that we could still solve before
> >>>> 6.4 is out. One I'll likely will point Linus to, unless a fix comes into
> >>>> sight.
> >>>>
> >>>> When I noticed the reluctant replies to this patch I earlier today asked
> >>>> in the thread with the report what the plan forward was:
> >>>> https://lore.kernel.org/all/CAD%3DFV%3DV-h4EUKHCM9UivsFHRsJPY5sAiwXV3a1hUX9DUMkkxdg@mail.gmail.com/
> >>>>
> >>>> Dough there replied:
> >>>>
> >>>> ```
> >>>> Of the two proposals made (the revert vs. the reordering of the dts),
> >>>> the reordering of the dts seems better. It only affects the one buggy
> >>>> board (rather than preventing us to move to async probe for everyone)
> >>>> and it also has a chance of actually fixing something (changing the
> >>>> order that regulators probe in rpmh-regulator might legitimately work
> >>>> around the problem). That being said, just like the revert the dts
> >>>> reordering is still just papering over the problem and is fragile /
> >>>> not guaranteed to work forever.
> >>>> ```
> >>>>
> >>>> Papering over obviously is not good, but has anyone a better idea to fix
> >>>> this? Or is "not fixing" for some reason an viable option here?
> >>>>
> >>>
> >>> I understand there is a regression, although kernel is not mainline
> >>> (hash df7443a96851 is unknown) and the only solutions were papering the
> >>> problem. Reverting commit is a temporary workaround. Moving nodes in DTS
> >>> is not acceptable because it hides actual problem and only solves this
> >>> one particular observed problem, while actual issue is still there. It
> >>> would be nice to be able to reproduce it on real mainline with normal
> >>> operating system (not AOSP) - with ramdiks/without/whatever. So far no
> >>> one did it, right?
> >>
> >> The worry I have about the revert here is that it will never be able
> >> to be undone and that doesn't seem great long term. I'm all for a
> >> temporary revert to fix a problem while the root cause is understood,
> >> but in this case I have a hard time believing that we'll make more
> >> progress towards a root cause once the revert lands. All the
> >> investigation we've done so far seems to indicate that the revert only
> >> fixes the problem by luck...
> >>
> >> I completely agree that moving the nodes in the DTS is a hack and just
> >> hides the problem. However, it also at least limits the workaround to
> >> the one board showing the problem and doesn't mean we're stuck with
> >> synchronous probe for rpmh-regulator for all eternity because nobody
> >> can understand this timing issue on db845c.
> >>
> >
> > I agree that we shouldn't hide this by reverting the regulator change.
> >
> >
> > And as has been stated a few times already, the symptom indicates that
> > we have a misconfigured system.
> >
> > Before accepting a patch just shuffling the bricks, I'd like to see some
> > more analysis of what happens wrt the rpmh right before the timeout.
> > Perhaps the landing team can assist here?
> >
> > Regards,
> > Bjorn
> >
> >
  
Amit Pundir July 7, 2023, 5:08 a.m. UTC | #23
On Thu, 22 Jun 2023 at 17:18, Amit Pundir <amit.pundir@linaro.org> wrote:
>
> On Thu, 22 Jun 2023 at 13:17, Linux regression tracking (Thorsten
> Leemhuis) <regressions@leemhuis.info> wrote:
> >
> > Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
> > for once, to make this easily accessible to everyone.
> >
> > As Linus will likely release 6.4 on this or the following Sunday a quick
> > status inquiry so I can brief him appropriately: is there any hope the
> > regression this patch tried to fix will be resolved any time soon?
>
> We are most likely to miss v6.4. I'm trying to reproduce the crash
> with tracing enabled, to share some more debug information.

FWIW, I couldn't reproduce this bug on v6.5 merge window commit
d528014517f2 (Revert ".gitignore: ignore *.cover and *.mbx")
on 100+ boot tests last night.
For the time being I'm keeping an eye on it and will trigger the boot
tests occasionally in the v6.5 development cycle.

>
> Regards,
> Amit Pundir
>
  
Linux regression tracking (Thorsten Leemhuis) July 14, 2023, 11:04 a.m. UTC | #24
On 07.07.23 07:08, Amit Pundir wrote:
> On Thu, 22 Jun 2023 at 17:18, Amit Pundir <amit.pundir@linaro.org> wrote:
>> On Thu, 22 Jun 2023 at 13:17, Linux regression tracking (Thorsten
>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>
>>> As Linus will likely release 6.4 on this or the following Sunday a quick
>>> status inquiry so I can brief him appropriately: is there any hope the
>>> regression this patch tried to fix will be resolved any time soon?
>>
>> We are most likely to miss v6.4. I'm trying to reproduce the crash
>> with tracing enabled, to share some more debug information.
> 
> FWIW, I couldn't reproduce this bug on v6.5 merge window commit
> d528014517f2 (Revert ".gitignore: ignore *.cover and *.mbx")
> on 100+ boot tests last night.
> For the time being I'm keeping an eye on it and will trigger the boot
> tests occasionally in the v6.5 development cycle.

No update since then, so I assume this remains fixed. I'll thus remove
this from the tracking; please holler if you think it might make sense
to continue tracking this.

#regzbot resolved: apparently solved with during the 6.5 merge window
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.
  

Patch

diff --git a/arch/arm64/boot/dts/qcom/sdm845-db845c.dts b/arch/arm64/boot/dts/qcom/sdm845-db845c.dts
index e14fe9bbb386..df2fde9063dc 100644
--- a/arch/arm64/boot/dts/qcom/sdm845-db845c.dts
+++ b/arch/arm64/boot/dts/qcom/sdm845-db845c.dts
@@ -301,6 +301,18 @@  regulators-0 {
 		vdd-l26-supply = <&vreg_s3a_1p35>;
 		vin-lvs-1-2-supply = <&vreg_s4a_1p8>;
 
+		vreg_lvs1a_1p8: lvs1 {
+			regulator-min-microvolt = <1800000>;
+			regulator-max-microvolt = <1800000>;
+			regulator-always-on;
+		};
+
+		vreg_lvs2a_1p8: lvs2 {
+			regulator-min-microvolt = <1800000>;
+			regulator-max-microvolt = <1800000>;
+			regulator-always-on;
+		};
+
 		vreg_s3a_1p35: smps3 {
 			regulator-min-microvolt = <1352000>;
 			regulator-max-microvolt = <1352000>;
@@ -381,18 +393,6 @@  vreg_l26a_1p2: ldo26 {
 			regulator-max-microvolt = <1200000>;
 			regulator-initial-mode = <RPMH_REGULATOR_MODE_HPM>;
 		};
-
-		vreg_lvs1a_1p8: lvs1 {
-			regulator-min-microvolt = <1800000>;
-			regulator-max-microvolt = <1800000>;
-			regulator-always-on;
-		};
-
-		vreg_lvs2a_1p8: lvs2 {
-			regulator-min-microvolt = <1800000>;
-			regulator-max-microvolt = <1800000>;
-			regulator-always-on;
-		};
 	};
 
 	regulators-1 {