drm/amdgpu: add mb for si

Message ID 20221118074810.380368-1-lizhenneng@kylinos.cn
State New
Headers
Series drm/amdgpu: add mb for si |

Commit Message

李真能 Nov. 18, 2022, 7:48 a.m. UTC
  During reboot test on arm64 platform, it may failure on boot,
so add this mb in smc.

The error message are as follows:
[    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR*
			    late_init of IP block <si_dpm> failed -22
[    7.006919][ 7] [  T295] amdgpu 0000:04:00.0: amdgpu_device_ip_late_init failed
[    7.014224][ 7] [  T295] amdgpu 0000:04:00.0: Fatal error during GPU init

Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn>
---
 drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++
 1 file changed, 2 insertions(+)
  

Comments

Christian König Nov. 18, 2022, 8:01 a.m. UTC | #1
Am 18.11.22 um 08:48 schrieb Zhenneng Li:
> During reboot test on arm64 platform, it may failure on boot,
> so add this mb in smc.
>
> The error message are as follows:
> [    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR*
> 			    late_init of IP block <si_dpm> failed -22
> [    7.006919][ 7] [  T295] amdgpu 0000:04:00.0: amdgpu_device_ip_late_init failed
> [    7.014224][ 7] [  T295] amdgpu 0000:04:00.0: Fatal error during GPU init

Memory barries are not supposed to be sprinkled around like this, you 
need to give a detailed explanation why this is necessary.

Regards,
Christian.

>
> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn>
> ---
>   drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> index 8f994ffa9cd1..c7656f22278d 100644
> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct amdgpu_device *adev)
>   	u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL);
>   	u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0);
>   
> +	mb();
> +
>   	if (!(rst & RST_REG) && !(clk & CK_DISABLE))
>   		return true;
>
  
Michel Dänzer Nov. 18, 2022, 9:18 a.m. UTC | #2
On 11/18/22 09:01, Christian König wrote:
> Am 18.11.22 um 08:48 schrieb Zhenneng Li:
>> During reboot test on arm64 platform, it may failure on boot,
>> so add this mb in smc.
>>
>> The error message are as follows:
>> [    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR*
>>                 late_init of IP block <si_dpm> failed -22
>> [    7.006919][ 7] [  T295] amdgpu 0000:04:00.0: amdgpu_device_ip_late_init failed
>> [    7.014224][ 7] [  T295] amdgpu 0000:04:00.0: Fatal error during GPU init
> 
> Memory barries are not supposed to be sprinkled around like this, you need to give a detailed explanation why this is necessary.
> 
> Regards,
> Christian.
> 
>>
>> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn>
>> ---
>>   drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>> index 8f994ffa9cd1..c7656f22278d 100644
>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct amdgpu_device *adev)
>>       u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL);
>>       u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0);
>>   +    mb();
>> +
>>       if (!(rst & RST_REG) && !(clk & CK_DISABLE))
>>           return true;

In particular, it makes no sense in this specific place, since it cannot directly affect the values of rst & clk.
  
李真能 Nov. 18, 2022, 9:24 a.m. UTC | #3
在 2022/11/18 17:18, Michel Dänzer 写道:
> On 11/18/22 09:01, Christian König wrote:
>> Am 18.11.22 um 08:48 schrieb Zhenneng Li:
>>> During reboot test on arm64 platform, it may failure on boot,
>>> so add this mb in smc.
>>>
>>> The error message are as follows:
>>> [    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR*
>>>                  late_init of IP block <si_dpm> failed -22
>>> [    7.006919][ 7] [  T295] amdgpu 0000:04:00.0: amdgpu_device_ip_late_init failed
>>> [    7.014224][ 7] [  T295] amdgpu 0000:04:00.0: Fatal error during GPU init
>> Memory barries are not supposed to be sprinkled around like this, you need to give a detailed explanation why this is necessary.
>>
>> Regards,
>> Christian.
>>
>>> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn>
>>> ---
>>>    drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++
>>>    1 file changed, 2 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>> index 8f994ffa9cd1..c7656f22278d 100644
>>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct amdgpu_device *adev)
>>>        u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL);
>>>        u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0);
>>>    +    mb();
>>> +
>>>        if (!(rst & RST_REG) && !(clk & CK_DISABLE))
>>>            return true;
> In particular, it makes no sense in this specific place, since it cannot directly affect the values of rst & clk.

I thinks so too.

But when I do reboot test using nine desktop machines,  there maybe 
report this error on one or two machines after Hundreds of times or 
Thousands of times reboot test, at the beginning, I use msleep() instead 
of mb(), these two methods are all works, but I don't know what is the 
root case.

I use this method on other verdor's oland card, this error message are 
reported again.

What could be the root reason?

test environmen:

graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87

driver: amdgpu

os: ubuntu 2004

platform: arm64

kernel: 5.4.18

>
  
Evan Quan Nov. 24, 2022, 10:04 a.m. UTC | #4
[AMD Official Use Only - General]

Could the attached patch help?

Evan
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of ???
> Sent: Friday, November 18, 2022 5:25 PM
> To: Michel Dänzer <michel.daenzer@mailbox.org>; Koenig, Christian
> <Christian.Koenig@amd.com>; Deucher, Alexander
> <Alexander.Deucher@amd.com>
> Cc: amd-gfx@lists.freedesktop.org; Pan, Xinhui <Xinhui.Pan@amd.com>;
> linux-kernel@vger.kernel.org; dri-devel@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: add mb for si
> 
> 
> 在 2022/11/18 17:18, Michel Dänzer 写道:
> > On 11/18/22 09:01, Christian König wrote:
> >> Am 18.11.22 um 08:48 schrieb Zhenneng Li:
> >>> During reboot test on arm64 platform, it may failure on boot, so add
> >>> this mb in smc.
> >>>
> >>> The error message are as follows:
> >>> [    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init
> >>> [amdgpu]] *ERROR*
> >>>                  late_init of IP block <si_dpm> failed -22 [
> >>> 7.006919][ 7] [  T295] amdgpu 0000:04:00.0:
> >>> amdgpu_device_ip_late_init failed [    7.014224][ 7] [  T295] amdgpu
> >>> 0000:04:00.0: Fatal error during GPU init
> >> Memory barries are not supposed to be sprinkled around like this, you
> need to give a detailed explanation why this is necessary.
> >>
> >> Regards,
> >> Christian.
> >>
> >>> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn>
> >>> ---
> >>>    drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++
> >>>    1 file changed, 2 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> >>> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> >>> index 8f994ffa9cd1..c7656f22278d 100644
> >>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> >>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> >>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct
> >>> amdgpu_device *adev)
> >>>        u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL);
> >>>        u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0);
> >>>    +    mb();
> >>> +
> >>>        if (!(rst & RST_REG) && !(clk & CK_DISABLE))
> >>>            return true;
> > In particular, it makes no sense in this specific place, since it cannot directly
> affect the values of rst & clk.
> 
> I thinks so too.
> 
> But when I do reboot test using nine desktop machines,  there maybe report
> this error on one or two machines after Hundreds of times or Thousands of
> times reboot test, at the beginning, I use msleep() instead of mb(), these
> two methods are all works, but I don't know what is the root case.
> 
> I use this method on other verdor's oland card, this error message are
> reported again.
> 
> What could be the root reason?
> 
> test environmen:
> 
> graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87
> 
> driver: amdgpu
> 
> os: ubuntu 2004
> 
> platform: arm64
> 
> kernel: 5.4.18
> 
> >
  
Christian König Nov. 24, 2022, 10:06 a.m. UTC | #5
That's not a patch but some binary file?

Christian.

Am 24.11.22 um 11:04 schrieb Quan, Evan:
> [AMD Official Use Only - General]
>
> Could the attached patch help?
>
> Evan
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of ???
>> Sent: Friday, November 18, 2022 5:25 PM
>> To: Michel Dänzer <michel.daenzer@mailbox.org>; Koenig, Christian
>> <Christian.Koenig@amd.com>; Deucher, Alexander
>> <Alexander.Deucher@amd.com>
>> Cc: amd-gfx@lists.freedesktop.org; Pan, Xinhui <Xinhui.Pan@amd.com>;
>> linux-kernel@vger.kernel.org; dri-devel@lists.freedesktop.org
>> Subject: Re: [PATCH] drm/amdgpu: add mb for si
>>
>>
>> 在 2022/11/18 17:18, Michel Dänzer 写道:
>>> On 11/18/22 09:01, Christian König wrote:
>>>> Am 18.11.22 um 08:48 schrieb Zhenneng Li:
>>>>> During reboot test on arm64 platform, it may failure on boot, so add
>>>>> this mb in smc.
>>>>>
>>>>> The error message are as follows:
>>>>> [    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init
>>>>> [amdgpu]] *ERROR*
>>>>>                   late_init of IP block <si_dpm> failed -22 [
>>>>> 7.006919][ 7] [  T295] amdgpu 0000:04:00.0:
>>>>> amdgpu_device_ip_late_init failed [    7.014224][ 7] [  T295] amdgpu
>>>>> 0000:04:00.0: Fatal error during GPU init
>>>> Memory barries are not supposed to be sprinkled around like this, you
>> need to give a detailed explanation why this is necessary.
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn>
>>>>> ---
>>>>>     drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++
>>>>>     1 file changed, 2 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>> index 8f994ffa9cd1..c7656f22278d 100644
>>>>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct
>>>>> amdgpu_device *adev)
>>>>>         u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL);
>>>>>         u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0);
>>>>>     +    mb();
>>>>> +
>>>>>         if (!(rst & RST_REG) && !(clk & CK_DISABLE))
>>>>>             return true;
>>> In particular, it makes no sense in this specific place, since it cannot directly
>> affect the values of rst & clk.
>>
>> I thinks so too.
>>
>> But when I do reboot test using nine desktop machines,  there maybe report
>> this error on one or two machines after Hundreds of times or Thousands of
>> times reboot test, at the beginning, I use msleep() instead of mb(), these
>> two methods are all works, but I don't know what is the root case.
>>
>> I use this method on other verdor's oland card, this error message are
>> reported again.
>>
>> What could be the root reason?
>>
>> test environmen:
>>
>> graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87
>>
>> driver: amdgpu
>>
>> os: ubuntu 2004
>>
>> platform: arm64
>>
>> kernel: 5.4.18
>>
  
Lazar, Lijo Nov. 24, 2022, 10:41 a.m. UTC | #6
On 11/24/2022 3:34 PM, Quan, Evan wrote:
> [AMD Official Use Only - General]
> 
> Could the attached patch help?
> 
> Evan
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of ???
>> Sent: Friday, November 18, 2022 5:25 PM
>> To: Michel Dänzer <michel.daenzer@mailbox.org>; Koenig, Christian
>> <Christian.Koenig@amd.com>; Deucher, Alexander
>> <Alexander.Deucher@amd.com>
>> Cc: amd-gfx@lists.freedesktop.org; Pan, Xinhui <Xinhui.Pan@amd.com>;
>> linux-kernel@vger.kernel.org; dri-devel@lists.freedesktop.org
>> Subject: Re: [PATCH] drm/amdgpu: add mb for si
>>
>>
>> 在 2022/11/18 17:18, Michel Dänzer 写道:
>>> On 11/18/22 09:01, Christian König wrote:
>>>> Am 18.11.22 um 08:48 schrieb Zhenneng Li:
>>>>> During reboot test on arm64 platform, it may failure on boot, so add
>>>>> this mb in smc.
>>>>>
>>>>> The error message are as follows:
>>>>> [    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init
>>>>> [amdgpu]] *ERROR*
>>>>>                   late_init of IP block <si_dpm> failed -22 [
>>>>> 7.006919][ 7] [  T295] amdgpu 0000:04:00.0:

The issue is happening in late_init() which eventually does

	ret = si_thermal_enable_alert(adev, false);

Just before this, si_thermal_start_thermal_controller is called in 
hw_init and that enables thermal alert.

Maybe the issue is with enable/disable of thermal alerts in quick 
succession. Adding a delay inside si_thermal_start_thermal_controller 
might help.

Thanks,
Lijo

>>>>> amdgpu_device_ip_late_init failed [    7.014224][ 7] [  T295] amdgpu
>>>>> 0000:04:00.0: Fatal error during GPU init
>>>> Memory barries are not supposed to be sprinkled around like this, you
>> need to give a detailed explanation why this is necessary.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn>
>>>>> ---
>>>>>     drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++
>>>>>     1 file changed, 2 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>> index 8f994ffa9cd1..c7656f22278d 100644
>>>>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct
>>>>> amdgpu_device *adev)
>>>>>         u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL);
>>>>>         u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0);
>>>>>     +    mb();
>>>>> +
>>>>>         if (!(rst & RST_REG) && !(clk & CK_DISABLE))
>>>>>             return true;
>>> In particular, it makes no sense in this specific place, since it cannot directly
>> affect the values of rst & clk.
>>
>> I thinks so too.
>>
>> But when I do reboot test using nine desktop machines,  there maybe report
>> this error on one or two machines after Hundreds of times or Thousands of
>> times reboot test, at the beginning, I use msleep() instead of mb(), these
>> two methods are all works, but I don't know what is the root case.
>>
>> I use this method on other verdor's oland card, this error message are
>> reported again.
>>
>> What could be the root reason?
>>
>> test environmen:
>>
>> graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87
>>
>> driver: amdgpu
>>
>> os: ubuntu 2004
>>
>> platform: arm64
>>
>> kernel: 5.4.18
>>
>>>
  
Lazar, Lijo Nov. 24, 2022, 10:49 a.m. UTC | #7
On 11/24/2022 4:11 PM, Lazar, Lijo wrote:
> 
> 
> On 11/24/2022 3:34 PM, Quan, Evan wrote:
>> [AMD Official Use Only - General]
>>
>> Could the attached patch help?
>>
>> Evan
>>> -----Original Message-----
>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of ???
>>> Sent: Friday, November 18, 2022 5:25 PM
>>> To: Michel Dänzer <michel.daenzer@mailbox.org>; Koenig, Christian
>>> <Christian.Koenig@amd.com>; Deucher, Alexander
>>> <Alexander.Deucher@amd.com>
>>> Cc: amd-gfx@lists.freedesktop.org; Pan, Xinhui <Xinhui.Pan@amd.com>;
>>> linux-kernel@vger.kernel.org; dri-devel@lists.freedesktop.org
>>> Subject: Re: [PATCH] drm/amdgpu: add mb for si
>>>
>>>
>>> 在 2022/11/18 17:18, Michel Dänzer 写道:
>>>> On 11/18/22 09:01, Christian König wrote:
>>>>> Am 18.11.22 um 08:48 schrieb Zhenneng Li:
>>>>>> During reboot test on arm64 platform, it may failure on boot, so add
>>>>>> this mb in smc.
>>>>>>
>>>>>> The error message are as follows:
>>>>>> [    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init
>>>>>> [amdgpu]] *ERROR*
>>>>>>                   late_init of IP block <si_dpm> failed -22 [
>>>>>> 7.006919][ 7] [  T295] amdgpu 0000:04:00.0:
> 
> The issue is happening in late_init() which eventually does
> 
>      ret = si_thermal_enable_alert(adev, false);
> 
> Just before this, si_thermal_start_thermal_controller is called in 
> hw_init and that enables thermal alert.
> 
> Maybe the issue is with enable/disable of thermal alerts in quick 
> succession. Adding a delay inside si_thermal_start_thermal_controller 
> might help.
> 

On a second look, temperature range is already set as part of 
si_thermal_start_thermal_controller in hw_init
https://elixir.bootlin.com/linux/v6.1-rc6/source/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c#L6780

There is no need to set it again here -

https://elixir.bootlin.com/linux/v6.1-rc6/source/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c#L7635

I think it is safe to remove the call from late_init altogether. Alex/Evan?

Thanks,
Lijo

> Thanks,
> Lijo
> 
>>>>>> amdgpu_device_ip_late_init failed [    7.014224][ 7] [  T295] amdgpu
>>>>>> 0000:04:00.0: Fatal error during GPU init
>>>>> Memory barries are not supposed to be sprinkled around like this, you
>>> need to give a detailed explanation why this is necessary.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn>
>>>>>> ---
>>>>>>     drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++
>>>>>>     1 file changed, 2 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>>> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>>> index 8f994ffa9cd1..c7656f22278d 100644
>>>>>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct
>>>>>> amdgpu_device *adev)
>>>>>>         u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL);
>>>>>>         u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0);
>>>>>>     +    mb();
>>>>>> +
>>>>>>         if (!(rst & RST_REG) && !(clk & CK_DISABLE))
>>>>>>             return true;
>>>> In particular, it makes no sense in this specific place, since it 
>>>> cannot directly
>>> affect the values of rst & clk.
>>>
>>> I thinks so too.
>>>
>>> But when I do reboot test using nine desktop machines,  there maybe 
>>> report
>>> this error on one or two machines after Hundreds of times or 
>>> Thousands of
>>> times reboot test, at the beginning, I use msleep() instead of mb(), 
>>> these
>>> two methods are all works, but I don't know what is the root case.
>>>
>>> I use this method on other verdor's oland card, this error message are
>>> reported again.
>>>
>>> What could be the root reason?
>>>
>>> test environmen:
>>>
>>> graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87
>>>
>>> driver: amdgpu
>>>
>>> os: ubuntu 2004
>>>
>>> platform: arm64
>>>
>>> kernel: 5.4.18
>>>
>>>>
  
Evan Quan Nov. 25, 2022, 2:06 a.m. UTC | #8
[AMD Official Use Only - General]

Did you see that? It's a patch which I created by git-format-patch.
Anyway I will paste the changes below. I was suspecting maybe we need some waits for smu running.

diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
index 49c398ec0aaf..9f308a021b2d 100644
--- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
+++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
@@ -6814,6 +6814,7 @@ static int si_dpm_enable(struct amdgpu_device *adev)
        struct si_power_info *si_pi = si_get_pi(adev);
        struct amdgpu_ps *boot_ps = adev->pm.dpm.boot_ps;
        int ret;
+       int i;

        if (amdgpu_si_is_smc_running(adev))
                return -EINVAL;
@@ -6909,6 +6910,17 @@ static int si_dpm_enable(struct amdgpu_device *adev)
        si_program_response_times(adev);
        si_program_ds_registers(adev);
        si_dpm_start_smc(adev);
+       /* Waiting for smc alive */
+       for (i = 0; i < adev->usec_timeout; i++) {
+               if (amdgpu_si_is_smc_running(adev))
+                       break;
+               udelay(1);
+       }
+       if (i >= adev->usec_timeout) {
+               DRM_ERROR("Timedout on waiting for smu running\n");
+               return -EINVAL;
+       }
+
        ret = si_notify_smc_display_change(adev, false);
        if (ret) {
                DRM_ERROR("si_notify_smc_display_change failed\n");


BR
Evan
> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> Sent: Thursday, November 24, 2022 6:06 PM
> To: Quan, Evan <Evan.Quan@amd.com>; 李真能 <lizhenneng@kylinos.cn>;
> Michel Dänzer <michel.daenzer@mailbox.org>; Koenig, Christian
> <Christian.Koenig@amd.com>; Deucher, Alexander
> <Alexander.Deucher@amd.com>
> Cc: dri-devel@lists.freedesktop.org; Pan, Xinhui <Xinhui.Pan@amd.com>;
> linux-kernel@vger.kernel.org; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: add mb for si
> 
> That's not a patch but some binary file?
> 
> Christian.
> 
> Am 24.11.22 um 11:04 schrieb Quan, Evan:
> > [AMD Official Use Only - General]
> >
> > Could the attached patch help?
> >
> > Evan
> >> -----Original Message-----
> >> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf
> Of ???
> >> Sent: Friday, November 18, 2022 5:25 PM
> >> To: Michel Dänzer <michel.daenzer@mailbox.org>; Koenig, Christian
> >> <Christian.Koenig@amd.com>; Deucher, Alexander
> >> <Alexander.Deucher@amd.com>
> >> Cc: amd-gfx@lists.freedesktop.org; Pan, Xinhui <Xinhui.Pan@amd.com>;
> >> linux-kernel@vger.kernel.org; dri-devel@lists.freedesktop.org
> >> Subject: Re: [PATCH] drm/amdgpu: add mb for si
> >>
> >>
> >> 在 2022/11/18 17:18, Michel Dänzer 写道:
> >>> On 11/18/22 09:01, Christian König wrote:
> >>>> Am 18.11.22 um 08:48 schrieb Zhenneng Li:
> >>>>> During reboot test on arm64 platform, it may failure on boot, so
> >>>>> add this mb in smc.
> >>>>>
> >>>>> The error message are as follows:
> >>>>> [    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init
> >>>>> [amdgpu]] *ERROR*
> >>>>>                   late_init of IP block <si_dpm> failed -22 [
> >>>>> 7.006919][ 7] [  T295] amdgpu 0000:04:00.0:
> >>>>> amdgpu_device_ip_late_init failed [    7.014224][ 7] [  T295]
> >>>>> amdgpu
> >>>>> 0000:04:00.0: Fatal error during GPU init
> >>>> Memory barries are not supposed to be sprinkled around like this,
> >>>> you
> >> need to give a detailed explanation why this is necessary.
> >>>> Regards,
> >>>> Christian.
> >>>>
> >>>>> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn>
> >>>>> ---
> >>>>>     drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++
> >>>>>     1 file changed, 2 insertions(+)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> >>>>> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> >>>>> index 8f994ffa9cd1..c7656f22278d 100644
> >>>>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> >>>>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> >>>>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct
> >>>>> amdgpu_device *adev)
> >>>>>         u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL);
> >>>>>         u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0);
> >>>>>     +    mb();
> >>>>> +
> >>>>>         if (!(rst & RST_REG) && !(clk & CK_DISABLE))
> >>>>>             return true;
> >>> In particular, it makes no sense in this specific place, since it
> >>> cannot directly
> >> affect the values of rst & clk.
> >>
> >> I thinks so too.
> >>
> >> But when I do reboot test using nine desktop machines,  there maybe
> >> report this error on one or two machines after Hundreds of times or
> >> Thousands of times reboot test, at the beginning, I use msleep()
> >> instead of mb(), these two methods are all works, but I don't know what
> is the root case.
> >>
> >> I use this method on other verdor's oland card, this error message
> >> are reported again.
> >>
> >> What could be the root reason?
> >>
> >> test environmen:
> >>
> >> graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87
> >>
> >> driver: amdgpu
> >>
> >> os: ubuntu 2004
> >>
> >> platform: arm64
> >>
> >> kernel: 5.4.18
> >>
  
Evan Quan Nov. 25, 2022, 2:13 a.m. UTC | #9
[AMD Official Use Only - General]



> -----Original Message-----
> From: Lazar, Lijo <Lijo.Lazar@amd.com>
> Sent: Thursday, November 24, 2022 6:49 PM
> To: Quan, Evan <Evan.Quan@amd.com>; 李真能 <lizhenneng@kylinos.cn>;
> Michel Dänzer <michel.daenzer@mailbox.org>; Koenig, Christian
> <Christian.Koenig@amd.com>; Deucher, Alexander
> <Alexander.Deucher@amd.com>
> Cc: amd-gfx@lists.freedesktop.org; Pan, Xinhui <Xinhui.Pan@amd.com>;
> linux-kernel@vger.kernel.org; dri-devel@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: add mb for si
> 
> 
> 
> On 11/24/2022 4:11 PM, Lazar, Lijo wrote:
> >
> >
> > On 11/24/2022 3:34 PM, Quan, Evan wrote:
> >> [AMD Official Use Only - General]
> >>
> >> Could the attached patch help?
> >>
> >> Evan
> >>> -----Original Message-----
> >>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf
> Of ???
> >>> Sent: Friday, November 18, 2022 5:25 PM
> >>> To: Michel Dänzer <michel.daenzer@mailbox.org>; Koenig, Christian
> >>> <Christian.Koenig@amd.com>; Deucher, Alexander
> >>> <Alexander.Deucher@amd.com>
> >>> Cc: amd-gfx@lists.freedesktop.org; Pan, Xinhui <Xinhui.Pan@amd.com>;
> >>> linux-kernel@vger.kernel.org; dri-devel@lists.freedesktop.org
> >>> Subject: Re: [PATCH] drm/amdgpu: add mb for si
> >>>
> >>>
> >>> 在 2022/11/18 17:18, Michel Dänzer 写道:
> >>>> On 11/18/22 09:01, Christian König wrote:
> >>>>> Am 18.11.22 um 08:48 schrieb Zhenneng Li:
> >>>>>> During reboot test on arm64 platform, it may failure on boot, so
> >>>>>> add this mb in smc.
> >>>>>>
> >>>>>> The error message are as follows:
> >>>>>> [    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init
> >>>>>> [amdgpu]] *ERROR*
> >>>>>>                   late_init of IP block <si_dpm> failed -22 [
> >>>>>> 7.006919][ 7] [  T295] amdgpu 0000:04:00.0:
> >
> > The issue is happening in late_init() which eventually does
> >
> >      ret = si_thermal_enable_alert(adev, false);
> >
> > Just before this, si_thermal_start_thermal_controller is called in
> > hw_init and that enables thermal alert.
> >
> > Maybe the issue is with enable/disable of thermal alerts in quick
> > succession. Adding a delay inside si_thermal_start_thermal_controller
> > might help.
> >
> 
> On a second look, temperature range is already set as part of
> si_thermal_start_thermal_controller in hw_init
> https://elixir.bootlin.com/linux/v6.1-
> rc6/source/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c#L6780
> 
> There is no need to set it again here -
> 
> https://elixir.bootlin.com/linux/v6.1-
> rc6/source/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c#L7635
> 
> I think it is safe to remove the call from late_init altogether. Alex/Evan?
> 
[Quan, Evan] Yes, it makes sense to me. But I'm not sure whether that’s related with the issue here.
Since per my understandings, if the issue is caused by double calling of thermal_alert enablement, it will fail every time.
That cannot explain why adding some delays or a mb() calling can help.

BR
Evan
> Thanks,
> Lijo
> 
> > Thanks,
> > Lijo
> >
> >>>>>> amdgpu_device_ip_late_init failed [    7.014224][ 7] [  T295] amdgpu
> >>>>>> 0000:04:00.0: Fatal error during GPU init
> >>>>> Memory barries are not supposed to be sprinkled around like this,
> you
> >>> need to give a detailed explanation why this is necessary.
> >>>>>
> >>>>> Regards,
> >>>>> Christian.
> >>>>>
> >>>>>> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn>
> >>>>>> ---
> >>>>>>     drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++
> >>>>>>     1 file changed, 2 insertions(+)
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> >>>>>> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> >>>>>> index 8f994ffa9cd1..c7656f22278d 100644
> >>>>>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> >>>>>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> >>>>>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct
> >>>>>> amdgpu_device *adev)
> >>>>>>         u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL);
> >>>>>>         u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0);
> >>>>>>     +    mb();
> >>>>>> +
> >>>>>>         if (!(rst & RST_REG) && !(clk & CK_DISABLE))
> >>>>>>             return true;
> >>>> In particular, it makes no sense in this specific place, since it
> >>>> cannot directly
> >>> affect the values of rst & clk.
> >>>
> >>> I thinks so too.
> >>>
> >>> But when I do reboot test using nine desktop machines,  there maybe
> >>> report
> >>> this error on one or two machines after Hundreds of times or
> >>> Thousands of
> >>> times reboot test, at the beginning, I use msleep() instead of mb(),
> >>> these
> >>> two methods are all works, but I don't know what is the root case.
> >>>
> >>> I use this method on other verdor's oland card, this error message are
> >>> reported again.
> >>>
> >>> What could be the root reason?
> >>>
> >>> test environmen:
> >>>
> >>> graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87
> >>>
> >>> driver: amdgpu
> >>>
> >>> os: ubuntu 2004
> >>>
> >>> platform: arm64
> >>>
> >>> kernel: 5.4.18
> >>>
> >>>>
  
Lazar, Lijo Nov. 25, 2022, 6:12 a.m. UTC | #10
On 11/25/2022 7:43 AM, Quan, Evan wrote:
> [AMD Official Use Only - General]
>
>
>
>> -----Original Message-----
>> From: Lazar, Lijo <Lijo.Lazar@amd.com>
>> Sent: Thursday, November 24, 2022 6:49 PM
>> To: Quan, Evan <Evan.Quan@amd.com>; 李真能 <lizhenneng@kylinos.cn>;
>> Michel Dänzer <michel.daenzer@mailbox.org>; Koenig, Christian
>> <Christian.Koenig@amd.com>; Deucher, Alexander
>> <Alexander.Deucher@amd.com>
>> Cc: amd-gfx@lists.freedesktop.org; Pan, Xinhui <Xinhui.Pan@amd.com>;
>> linux-kernel@vger.kernel.org; dri-devel@lists.freedesktop.org
>> Subject: Re: [PATCH] drm/amdgpu: add mb for si
>>
>>
>>
>> On 11/24/2022 4:11 PM, Lazar, Lijo wrote:
>>>
>>> On 11/24/2022 3:34 PM, Quan, Evan wrote:
>>>> [AMD Official Use Only - General]
>>>>
>>>> Could the attached patch help?
>>>>
>>>> Evan
>>>>> -----Original Message-----
>>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf
>> Of ???
>>>>> Sent: Friday, November 18, 2022 5:25 PM
>>>>> To: Michel Dänzer <michel.daenzer@mailbox.org>; Koenig, Christian
>>>>> <Christian.Koenig@amd.com>; Deucher, Alexander
>>>>> <Alexander.Deucher@amd.com>
>>>>> Cc: amd-gfx@lists.freedesktop.org; Pan, Xinhui <Xinhui.Pan@amd.com>;
>>>>> linux-kernel@vger.kernel.org; dri-devel@lists.freedesktop.org
>>>>> Subject: Re: [PATCH] drm/amdgpu: add mb for si
>>>>>
>>>>>
>>>>> 在 2022/11/18 17:18, Michel Dänzer 写道:
>>>>>> On 11/18/22 09:01, Christian König wrote:
>>>>>>> Am 18.11.22 um 08:48 schrieb Zhenneng Li:
>>>>>>>> During reboot test on arm64 platform, it may failure on boot, so
>>>>>>>> add this mb in smc.
>>>>>>>>
>>>>>>>> The error message are as follows:
>>>>>>>> [    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init
>>>>>>>> [amdgpu]] *ERROR*
>>>>>>>>                    late_init of IP block <si_dpm> failed -22 [
>>>>>>>> 7.006919][ 7] [  T295] amdgpu 0000:04:00.0:
>>> The issue is happening in late_init() which eventually does
>>>
>>>       ret = si_thermal_enable_alert(adev, false);
>>>
>>> Just before this, si_thermal_start_thermal_controller is called in
>>> hw_init and that enables thermal alert.
>>>
>>> Maybe the issue is with enable/disable of thermal alerts in quick
>>> succession. Adding a delay inside si_thermal_start_thermal_controller
>>> might help.
>>>
>> On a second look, temperature range is already set as part of
>> si_thermal_start_thermal_controller in hw_init
>> https://elixir.bootlin.com/linux/v6.1-
>> rc6/source/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c#L6780
>>
>> There is no need to set it again here -
>>
>> https://elixir.bootlin.com/linux/v6.1-
>> rc6/source/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c#L7635
>>
>> I think it is safe to remove the call from late_init altogether. Alex/Evan?
>>
> [Quan, Evan] Yes, it makes sense to me. But I'm not sure whether that’s related with the issue here.
> Since per my understandings, if the issue is caused by double calling of thermal_alert enablement, it will fail every time.
> That cannot explain why adding some delays or a mb() calling can help.

The side effect of the patch is just some random delay introduced for 
every SMC message

The issue happens in late_init(). Between late_init() and dpm 
enablement, there are many smc messages sent which don't have this 
issue. So I think the issue is not with FW not running.

Thus the only case I see is enable/disable of thermal alert in random 
succession.

Thanks,

Lijo

> BR
> Evan
>> Thanks,
>> Lijo
>>
>>> Thanks,
>>> Lijo
>>>
>>>>>>>> amdgpu_device_ip_late_init failed [    7.014224][ 7] [  T295] amdgpu
>>>>>>>> 0000:04:00.0: Fatal error during GPU init
>>>>>>> Memory barries are not supposed to be sprinkled around like this,
>> you
>>>>> need to give a detailed explanation why this is necessary.
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>> Signed-off-by: Zhenneng Li <lizhenneng@kylinos.cn>
>>>>>>>> ---
>>>>>>>>      drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++
>>>>>>>>      1 file changed, 2 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>>>>> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>>>>> index 8f994ffa9cd1..c7656f22278d 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>>>>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct
>>>>>>>> amdgpu_device *adev)
>>>>>>>>          u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL);
>>>>>>>>          u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0);
>>>>>>>>      +    mb();
>>>>>>>> +
>>>>>>>>          if (!(rst & RST_REG) && !(clk & CK_DISABLE))
>>>>>>>>              return true;
>>>>>> In particular, it makes no sense in this specific place, since it
>>>>>> cannot directly
>>>>> affect the values of rst & clk.
>>>>>
>>>>> I thinks so too.
>>>>>
>>>>> But when I do reboot test using nine desktop machines,  there maybe
>>>>> report
>>>>> this error on one or two machines after Hundreds of times or
>>>>> Thousands of
>>>>> times reboot test, at the beginning, I use msleep() instead of mb(),
>>>>> these
>>>>> two methods are all works, but I don't know what is the root case.
>>>>>
>>>>> I use this method on other verdor's oland card, this error message are
>>>>> reported again.
>>>>>
>>>>> What could be the root reason?
>>>>>
>>>>> test environmen:
>>>>>
>>>>> graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87
>>>>>
>>>>> driver: amdgpu
>>>>>
>>>>> os: ubuntu 2004
>>>>>
>>>>> platform: arm64
>>>>>
>>>>> kernel: 5.4.18
>>>>>
  

Patch

diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
index 8f994ffa9cd1..c7656f22278d 100644
--- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
+++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
@@ -155,6 +155,8 @@  bool amdgpu_si_is_smc_running(struct amdgpu_device *adev)
 	u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL);
 	u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0);
 
+	mb();
+
 	if (!(rst & RST_REG) && !(clk & CK_DISABLE))
 		return true;