[1/2] adreno: Shutdown the GPU properly

Message ID 20221111194957.4046771-1-joel@joelfernandes.org
State New
Headers
Series [1/2] adreno: Shutdown the GPU properly |

Commit Message

Joel Fernandes Nov. 11, 2022, 7:49 p.m. UTC
  During kexec on ARM device, we notice that device_shutdown() only calls
pm_runtime_force_suspend() while shutting down the GPU. This means the GPU
kthread is still running and further, there maybe active submits.

This causes all kinds of issues during a kexec reboot:

Warning from shutdown path:

[  292.509662] WARNING: CPU: 0 PID: 6304 at [...] adreno_runtime_suspend+0x3c/0x44
[  292.509863] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
[  292.509872] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  292.509881] pc : adreno_runtime_suspend+0x3c/0x44
[  292.509891] lr : pm_generic_runtime_suspend+0x30/0x44
[  292.509905] sp : ffffffc014473bf0
[...]
[  292.510043] Call trace:
[  292.510051]  adreno_runtime_suspend+0x3c/0x44
[  292.510061]  pm_generic_runtime_suspend+0x30/0x44
[  292.510071]  pm_runtime_force_suspend+0x54/0xc8
[  292.510081]  adreno_shutdown+0x1c/0x28
[  292.510090]  platform_shutdown+0x2c/0x38
[  292.510104]  device_shutdown+0x158/0x210
[  292.510119]  kernel_restart_prepare+0x40/0x4c

And here from GPU kthread, an SError OOPs:

[  192.648789]  el1h_64_error+0x7c/0x80
[  192.648812]  el1_interrupt+0x20/0x58
[  192.648833]  el1h_64_irq_handler+0x18/0x24
[  192.648854]  el1h_64_irq+0x7c/0x80
[  192.648873]  local_daif_inherit+0x10/0x18
[  192.648900]  el1h_64_sync_handler+0x48/0xb4
[  192.648921]  el1h_64_sync+0x7c/0x80
[  192.648941]  a6xx_gmu_set_oob+0xbc/0x1fc
[  192.648968]  a6xx_hw_init+0x44/0xe38
[  192.648991]  msm_gpu_hw_init+0x48/0x80
[  192.649013]  msm_gpu_submit+0x5c/0x1a8
[  192.649034]  msm_job_run+0xb0/0x11c
[  192.649058]  drm_sched_main+0x170/0x434
[  192.649086]  kthread+0x134/0x300
[  192.649114]  ret_from_fork+0x10/0x20

Fix by calling adreno_system_suspend() in the device_shutdown() path.

Cc: Rob Clark <robdclark@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ricardo Ribalda <ribalda@chromium.org>
Cc: Ross Zwisler <zwisler@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 drivers/gpu/drm/msm/adreno/adreno_device.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)
  

Comments

Joel Fernandes Nov. 11, 2022, 9:08 p.m. UTC | #1
> On Nov 11, 2022, at 2:50 PM, Joel Fernandes (Google) <joel@joelfernandes.org> wrote:
> 
> During kexec on ARM device, we notice that device_shutdown() only calls
> pm_runtime_force_suspend() while shutting down the GPU. This means the GPU
> kthread is still running and further, there maybe active submits.
> 
> This causes all kinds of issues during a kexec reboot:
> 
> Warning from shutdown path:
> 
> [  292.509662] WARNING: CPU: 0 PID: 6304 at [...] adreno_runtime_suspend+0x3c/0x44
> [  292.509863] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
> [  292.509872] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [  292.509881] pc : adreno_runtime_suspend+0x3c/0x44
> [  292.509891] lr : pm_generic_runtime_suspend+0x30/0x44
> [  292.509905] sp : ffffffc014473bf0
> [...]
> [  292.510043] Call trace:
> [  292.510051]  adreno_runtime_suspend+0x3c/0x44
> [  292.510061]  pm_generic_runtime_suspend+0x30/0x44
> [  292.510071]  pm_runtime_force_suspend+0x54/0xc8
> [  292.510081]  adreno_shutdown+0x1c/0x28
> [  292.510090]  platform_shutdown+0x2c/0x38
> [  292.510104]  device_shutdown+0x158/0x210
> [  292.510119]  kernel_restart_prepare+0x40/0x4c
> 
> And here from GPU kthread, an SError OOPs:
> 
> [  192.648789]  el1h_64_error+0x7c/0x80
> [  192.648812]  el1_interrupt+0x20/0x58
> [  192.648833]  el1h_64_irq_handler+0x18/0x24
> [  192.648854]  el1h_64_irq+0x7c/0x80
> [  192.648873]  local_daif_inherit+0x10/0x18
> [  192.648900]  el1h_64_sync_handler+0x48/0xb4
> [  192.648921]  el1h_64_sync+0x7c/0x80
> [  192.648941]  a6xx_gmu_set_oob+0xbc/0x1fc
> [  192.648968]  a6xx_hw_init+0x44/0xe38
> [  192.648991]  msm_gpu_hw_init+0x48/0x80
> [  192.649013]  msm_gpu_submit+0x5c/0x1a8
> [  192.649034]  msm_job_run+0xb0/0x11c
> [  192.649058]  drm_sched_main+0x170/0x434
> [  192.649086]  kthread+0x134/0x300
> [  192.649114]  ret_from_fork+0x10/0x20
> 
> Fix by calling adreno_system_suspend() in the device_shutdown() path.
> 
> Cc: Rob Clark <robdclark@chromium.org>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ricardo Ribalda <ribalda@chromium.org>
> Cc: Ross Zwisler <zwisler@kernel.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
> drivers/gpu/drm/msm/adreno/adreno_device.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> index 24b489b6129a..f0cff62812c3 100644
> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> @@ -607,9 +607,12 @@ static int adreno_remove(struct platform_device *pdev)
>    return 0;
> }
> 
> +static int adreno_system_suspend(struct device *dev);
> static void adreno_shutdown(struct platform_device *pdev)
> {
> -    pm_runtime_force_suspend(&pdev->dev);
> +    struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
> +

This local variable definition should go to patch 2/2. Will fix in v2.

Thanks,

 - Joel


> +    WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
> }
> 
> static const struct of_device_id dt_match[] = {
> -- 
> 2.38.1.493.g58b659f92b-goog
>
  
Rob Clark Nov. 12, 2022, 6:44 p.m. UTC | #2
On Fri, Nov 11, 2022 at 1:08 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
>
>
> > On Nov 11, 2022, at 2:50 PM, Joel Fernandes (Google) <joel@joelfernandes.org> wrote:
> >
> > During kexec on ARM device, we notice that device_shutdown() only calls
> > pm_runtime_force_suspend() while shutting down the GPU. This means the GPU
> > kthread is still running and further, there maybe active submits.
> >
> > This causes all kinds of issues during a kexec reboot:
> >
> > Warning from shutdown path:
> >
> > [  292.509662] WARNING: CPU: 0 PID: 6304 at [...] adreno_runtime_suspend+0x3c/0x44
> > [  292.509863] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
> > [  292.509872] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > [  292.509881] pc : adreno_runtime_suspend+0x3c/0x44
> > [  292.509891] lr : pm_generic_runtime_suspend+0x30/0x44
> > [  292.509905] sp : ffffffc014473bf0
> > [...]
> > [  292.510043] Call trace:
> > [  292.510051]  adreno_runtime_suspend+0x3c/0x44
> > [  292.510061]  pm_generic_runtime_suspend+0x30/0x44
> > [  292.510071]  pm_runtime_force_suspend+0x54/0xc8
> > [  292.510081]  adreno_shutdown+0x1c/0x28
> > [  292.510090]  platform_shutdown+0x2c/0x38
> > [  292.510104]  device_shutdown+0x158/0x210
> > [  292.510119]  kernel_restart_prepare+0x40/0x4c
> >
> > And here from GPU kthread, an SError OOPs:
> >
> > [  192.648789]  el1h_64_error+0x7c/0x80
> > [  192.648812]  el1_interrupt+0x20/0x58
> > [  192.648833]  el1h_64_irq_handler+0x18/0x24
> > [  192.648854]  el1h_64_irq+0x7c/0x80
> > [  192.648873]  local_daif_inherit+0x10/0x18
> > [  192.648900]  el1h_64_sync_handler+0x48/0xb4
> > [  192.648921]  el1h_64_sync+0x7c/0x80
> > [  192.648941]  a6xx_gmu_set_oob+0xbc/0x1fc
> > [  192.648968]  a6xx_hw_init+0x44/0xe38
> > [  192.648991]  msm_gpu_hw_init+0x48/0x80
> > [  192.649013]  msm_gpu_submit+0x5c/0x1a8
> > [  192.649034]  msm_job_run+0xb0/0x11c
> > [  192.649058]  drm_sched_main+0x170/0x434
> > [  192.649086]  kthread+0x134/0x300
> > [  192.649114]  ret_from_fork+0x10/0x20
> >
> > Fix by calling adreno_system_suspend() in the device_shutdown() path.
> >
> > Cc: Rob Clark <robdclark@chromium.org>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Ricardo Ribalda <ribalda@chromium.org>
> > Cc: Ross Zwisler <zwisler@kernel.org>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> > drivers/gpu/drm/msm/adreno/adreno_device.c | 5 ++++-
> > 1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > index 24b489b6129a..f0cff62812c3 100644
> > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > @@ -607,9 +607,12 @@ static int adreno_remove(struct platform_device *pdev)
> >    return 0;
> > }
> >
> > +static int adreno_system_suspend(struct device *dev);
> > static void adreno_shutdown(struct platform_device *pdev)
> > {
> > -    pm_runtime_force_suspend(&pdev->dev);
> > +    struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
> > +
>
> This local variable definition should go to patch 2/2. Will fix in v2.
>
> Thanks,
>
>  - Joel
>
>
> > +    WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));

I think maybe adreno_unbind() needs the same treatment?  Any path
where we yank out the power cord without ensuring the scheduler is
parked means we'd be racing with jobs in the scheduler queue.  Ie.
userspace could queue a job before it is frozen, but the drm/scheduler
kthread hasn't yet called the msm_job_run() callback (which does
various touching of the now powered off hw).  So I think we need to
ensure that the scheduler is parked in all paths that call
pm_runtime_force_suspend() (as that bypasses the runpm reference that
would otherwise unsure the hw is powered before msm_job_run pokes at
registers)

BR,
-R

> > }
> >
> > static const struct of_device_id dt_match[] = {
> > --
> > 2.38.1.493.g58b659f92b-goog
> >
  
Joel Fernandes Dec. 1, 2022, 8:08 p.m. UTC | #3
On Sat, Nov 12, 2022 at 6:44 PM Rob Clark <robdclark@gmail.com> wrote:
>
> On Fri, Nov 11, 2022 at 1:08 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> >
> >
> > > On Nov 11, 2022, at 2:50 PM, Joel Fernandes (Google) <joel@joelfernandes.org> wrote:
> > >
> > > During kexec on ARM device, we notice that device_shutdown() only calls
> > > pm_runtime_force_suspend() while shutting down the GPU. This means the GPU
> > > kthread is still running and further, there maybe active submits.
> > >
> > > This causes all kinds of issues during a kexec reboot:
> > >
> > > Warning from shutdown path:
> > >
> > > [  292.509662] WARNING: CPU: 0 PID: 6304 at [...] adreno_runtime_suspend+0x3c/0x44
> > > [  292.509863] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
> > > [  292.509872] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > [  292.509881] pc : adreno_runtime_suspend+0x3c/0x44
> > > [  292.509891] lr : pm_generic_runtime_suspend+0x30/0x44
> > > [  292.509905] sp : ffffffc014473bf0
> > > [...]
> > > [  292.510043] Call trace:
> > > [  292.510051]  adreno_runtime_suspend+0x3c/0x44
> > > [  292.510061]  pm_generic_runtime_suspend+0x30/0x44
> > > [  292.510071]  pm_runtime_force_suspend+0x54/0xc8
> > > [  292.510081]  adreno_shutdown+0x1c/0x28
> > > [  292.510090]  platform_shutdown+0x2c/0x38
> > > [  292.510104]  device_shutdown+0x158/0x210
> > > [  292.510119]  kernel_restart_prepare+0x40/0x4c
> > >
> > > And here from GPU kthread, an SError OOPs:
> > >
> > > [  192.648789]  el1h_64_error+0x7c/0x80
> > > [  192.648812]  el1_interrupt+0x20/0x58
> > > [  192.648833]  el1h_64_irq_handler+0x18/0x24
> > > [  192.648854]  el1h_64_irq+0x7c/0x80
> > > [  192.648873]  local_daif_inherit+0x10/0x18
> > > [  192.648900]  el1h_64_sync_handler+0x48/0xb4
> > > [  192.648921]  el1h_64_sync+0x7c/0x80
> > > [  192.648941]  a6xx_gmu_set_oob+0xbc/0x1fc
> > > [  192.648968]  a6xx_hw_init+0x44/0xe38
> > > [  192.648991]  msm_gpu_hw_init+0x48/0x80
> > > [  192.649013]  msm_gpu_submit+0x5c/0x1a8
> > > [  192.649034]  msm_job_run+0xb0/0x11c
> > > [  192.649058]  drm_sched_main+0x170/0x434
> > > [  192.649086]  kthread+0x134/0x300
> > > [  192.649114]  ret_from_fork+0x10/0x20
> > >
> > > Fix by calling adreno_system_suspend() in the device_shutdown() path.
> > >
> > > Cc: Rob Clark <robdclark@chromium.org>
> > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > Cc: Ricardo Ribalda <ribalda@chromium.org>
> > > Cc: Ross Zwisler <zwisler@kernel.org>
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > ---
> > > drivers/gpu/drm/msm/adreno/adreno_device.c | 5 ++++-
> > > 1 file changed, 4 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > index 24b489b6129a..f0cff62812c3 100644
> > > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > @@ -607,9 +607,12 @@ static int adreno_remove(struct platform_device *pdev)
> > >    return 0;
> > > }
> > >
> > > +static int adreno_system_suspend(struct device *dev);
> > > static void adreno_shutdown(struct platform_device *pdev)
> > > {
> > > -    pm_runtime_force_suspend(&pdev->dev);
> > > +    struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
> > > +
> >
> > This local variable definition should go to patch 2/2. Will fix in v2.
> >
> > Thanks,
> >
> >  - Joel
> >
> >
> > > +    WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
>
> I think maybe adreno_unbind() needs the same treatment?  Any path
> where we yank out the power cord without ensuring the scheduler is
> parked means we'd be racing with jobs in the scheduler queue.  Ie.
> userspace could queue a job before it is frozen, but the drm/scheduler
> kthread hasn't yet called the msm_job_run() callback (which does
> various touching of the now powered off hw).  So I think we need to
> ensure that the scheduler is parked in all paths that call
> pm_runtime_force_suspend() (as that bypasses the runpm reference that
> would otherwise unsure the hw is powered before msm_job_run pokes at
> registers)

a6xx_gmu_remove() calls pm_runtime_force_suspend() , would that need a
treatment too?

Though, adreno_system_suspend() is a static function in adreno_device.cc

Thanks.
  
Rob Clark Dec. 1, 2022, 10:06 p.m. UTC | #4
On Thu, Dec 1, 2022 at 12:08 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Sat, Nov 12, 2022 at 6:44 PM Rob Clark <robdclark@gmail.com> wrote:
> >
> > On Fri, Nov 11, 2022 at 1:08 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > >
> > >
> > > > On Nov 11, 2022, at 2:50 PM, Joel Fernandes (Google) <joel@joelfernandes.org> wrote:
> > > >
> > > > During kexec on ARM device, we notice that device_shutdown() only calls
> > > > pm_runtime_force_suspend() while shutting down the GPU. This means the GPU
> > > > kthread is still running and further, there maybe active submits.
> > > >
> > > > This causes all kinds of issues during a kexec reboot:
> > > >
> > > > Warning from shutdown path:
> > > >
> > > > [  292.509662] WARNING: CPU: 0 PID: 6304 at [...] adreno_runtime_suspend+0x3c/0x44
> > > > [  292.509863] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
> > > > [  292.509872] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > [  292.509881] pc : adreno_runtime_suspend+0x3c/0x44
> > > > [  292.509891] lr : pm_generic_runtime_suspend+0x30/0x44
> > > > [  292.509905] sp : ffffffc014473bf0
> > > > [...]
> > > > [  292.510043] Call trace:
> > > > [  292.510051]  adreno_runtime_suspend+0x3c/0x44
> > > > [  292.510061]  pm_generic_runtime_suspend+0x30/0x44
> > > > [  292.510071]  pm_runtime_force_suspend+0x54/0xc8
> > > > [  292.510081]  adreno_shutdown+0x1c/0x28
> > > > [  292.510090]  platform_shutdown+0x2c/0x38
> > > > [  292.510104]  device_shutdown+0x158/0x210
> > > > [  292.510119]  kernel_restart_prepare+0x40/0x4c
> > > >
> > > > And here from GPU kthread, an SError OOPs:
> > > >
> > > > [  192.648789]  el1h_64_error+0x7c/0x80
> > > > [  192.648812]  el1_interrupt+0x20/0x58
> > > > [  192.648833]  el1h_64_irq_handler+0x18/0x24
> > > > [  192.648854]  el1h_64_irq+0x7c/0x80
> > > > [  192.648873]  local_daif_inherit+0x10/0x18
> > > > [  192.648900]  el1h_64_sync_handler+0x48/0xb4
> > > > [  192.648921]  el1h_64_sync+0x7c/0x80
> > > > [  192.648941]  a6xx_gmu_set_oob+0xbc/0x1fc
> > > > [  192.648968]  a6xx_hw_init+0x44/0xe38
> > > > [  192.648991]  msm_gpu_hw_init+0x48/0x80
> > > > [  192.649013]  msm_gpu_submit+0x5c/0x1a8
> > > > [  192.649034]  msm_job_run+0xb0/0x11c
> > > > [  192.649058]  drm_sched_main+0x170/0x434
> > > > [  192.649086]  kthread+0x134/0x300
> > > > [  192.649114]  ret_from_fork+0x10/0x20
> > > >
> > > > Fix by calling adreno_system_suspend() in the device_shutdown() path.
> > > >
> > > > Cc: Rob Clark <robdclark@chromium.org>
> > > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > > Cc: Ricardo Ribalda <ribalda@chromium.org>
> > > > Cc: Ross Zwisler <zwisler@kernel.org>
> > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > ---
> > > > drivers/gpu/drm/msm/adreno/adreno_device.c | 5 ++++-
> > > > 1 file changed, 4 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > index 24b489b6129a..f0cff62812c3 100644
> > > > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > @@ -607,9 +607,12 @@ static int adreno_remove(struct platform_device *pdev)
> > > >    return 0;
> > > > }
> > > >
> > > > +static int adreno_system_suspend(struct device *dev);
> > > > static void adreno_shutdown(struct platform_device *pdev)
> > > > {
> > > > -    pm_runtime_force_suspend(&pdev->dev);
> > > > +    struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
> > > > +
> > >
> > > This local variable definition should go to patch 2/2. Will fix in v2.
> > >
> > > Thanks,
> > >
> > >  - Joel
> > >
> > >
> > > > +    WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
> >
> > I think maybe adreno_unbind() needs the same treatment?  Any path
> > where we yank out the power cord without ensuring the scheduler is
> > parked means we'd be racing with jobs in the scheduler queue.  Ie.
> > userspace could queue a job before it is frozen, but the drm/scheduler
> > kthread hasn't yet called the msm_job_run() callback (which does
> > various touching of the now powered off hw).  So I think we need to
> > ensure that the scheduler is parked in all paths that call
> > pm_runtime_force_suspend() (as that bypasses the runpm reference that
> > would otherwise unsure the hw is powered before msm_job_run pokes at
> > registers)
>
> a6xx_gmu_remove() calls pm_runtime_force_suspend() , would that need a
> treatment too?
>
> Though, adreno_system_suspend() is a static function in adreno_device.cc

Naw, you get there indirectly from adreno_unbind()

BR,
-R

> Thanks.
  
Joel Fernandes Dec. 1, 2022, 10:13 p.m. UTC | #5
On Thu, Dec 1, 2022 at 10:06 PM Rob Clark <robdclark@gmail.com> wrote:
>
> On Thu, Dec 1, 2022 at 12:08 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > On Sat, Nov 12, 2022 at 6:44 PM Rob Clark <robdclark@gmail.com> wrote:
> > >
> > > On Fri, Nov 11, 2022 at 1:08 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > >
> > > >
> > > >
> > > > > On Nov 11, 2022, at 2:50 PM, Joel Fernandes (Google) <joel@joelfernandes.org> wrote:
> > > > >
> > > > > During kexec on ARM device, we notice that device_shutdown() only calls
> > > > > pm_runtime_force_suspend() while shutting down the GPU. This means the GPU
> > > > > kthread is still running and further, there maybe active submits.
> > > > >
> > > > > This causes all kinds of issues during a kexec reboot:
> > > > >
> > > > > Warning from shutdown path:
> > > > >
> > > > > [  292.509662] WARNING: CPU: 0 PID: 6304 at [...] adreno_runtime_suspend+0x3c/0x44
> > > > > [  292.509863] Hardware name: Google Lazor (rev3 - 8) with LTE (DT)
> > > > > [  292.509872] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > > [  292.509881] pc : adreno_runtime_suspend+0x3c/0x44
> > > > > [  292.509891] lr : pm_generic_runtime_suspend+0x30/0x44
> > > > > [  292.509905] sp : ffffffc014473bf0
> > > > > [...]
> > > > > [  292.510043] Call trace:
> > > > > [  292.510051]  adreno_runtime_suspend+0x3c/0x44
> > > > > [  292.510061]  pm_generic_runtime_suspend+0x30/0x44
> > > > > [  292.510071]  pm_runtime_force_suspend+0x54/0xc8
> > > > > [  292.510081]  adreno_shutdown+0x1c/0x28
> > > > > [  292.510090]  platform_shutdown+0x2c/0x38
> > > > > [  292.510104]  device_shutdown+0x158/0x210
> > > > > [  292.510119]  kernel_restart_prepare+0x40/0x4c
> > > > >
> > > > > And here from GPU kthread, an SError OOPs:
> > > > >
> > > > > [  192.648789]  el1h_64_error+0x7c/0x80
> > > > > [  192.648812]  el1_interrupt+0x20/0x58
> > > > > [  192.648833]  el1h_64_irq_handler+0x18/0x24
> > > > > [  192.648854]  el1h_64_irq+0x7c/0x80
> > > > > [  192.648873]  local_daif_inherit+0x10/0x18
> > > > > [  192.648900]  el1h_64_sync_handler+0x48/0xb4
> > > > > [  192.648921]  el1h_64_sync+0x7c/0x80
> > > > > [  192.648941]  a6xx_gmu_set_oob+0xbc/0x1fc
> > > > > [  192.648968]  a6xx_hw_init+0x44/0xe38
> > > > > [  192.648991]  msm_gpu_hw_init+0x48/0x80
> > > > > [  192.649013]  msm_gpu_submit+0x5c/0x1a8
> > > > > [  192.649034]  msm_job_run+0xb0/0x11c
> > > > > [  192.649058]  drm_sched_main+0x170/0x434
> > > > > [  192.649086]  kthread+0x134/0x300
> > > > > [  192.649114]  ret_from_fork+0x10/0x20
> > > > >
> > > > > Fix by calling adreno_system_suspend() in the device_shutdown() path.
> > > > >
> > > > > Cc: Rob Clark <robdclark@chromium.org>
> > > > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > > > Cc: Ricardo Ribalda <ribalda@chromium.org>
> > > > > Cc: Ross Zwisler <zwisler@kernel.org>
> > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > > ---
> > > > > drivers/gpu/drm/msm/adreno/adreno_device.c | 5 ++++-
> > > > > 1 file changed, 4 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > > index 24b489b6129a..f0cff62812c3 100644
> > > > > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> > > > > @@ -607,9 +607,12 @@ static int adreno_remove(struct platform_device *pdev)
> > > > >    return 0;
> > > > > }
> > > > >
> > > > > +static int adreno_system_suspend(struct device *dev);
> > > > > static void adreno_shutdown(struct platform_device *pdev)
> > > > > {
> > > > > -    pm_runtime_force_suspend(&pdev->dev);
> > > > > +    struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
> > > > > +
> > > >
> > > > This local variable definition should go to patch 2/2. Will fix in v2.
> > > >
> > > > Thanks,
> > > >
> > > >  - Joel
> > > >
> > > >
> > > > > +    WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
> > >
> > > I think maybe adreno_unbind() needs the same treatment?  Any path
> > > where we yank out the power cord without ensuring the scheduler is
> > > parked means we'd be racing with jobs in the scheduler queue.  Ie.
> > > userspace could queue a job before it is frozen, but the drm/scheduler
> > > kthread hasn't yet called the msm_job_run() callback (which does
> > > various touching of the now powered off hw).  So I think we need to
> > > ensure that the scheduler is parked in all paths that call
> > > pm_runtime_force_suspend() (as that bypasses the runpm reference that
> > > would otherwise unsure the hw is powered before msm_job_run pokes at
> > > registers)
> >
> > a6xx_gmu_remove() calls pm_runtime_force_suspend() , would that need a
> > treatment too?
> >
> > Though, adreno_system_suspend() is a static function in adreno_device.cc
>
> Naw, you get there indirectly from adreno_unbind()

Ah gotcha, thanks.

 - Joel
  

Patch

diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
index 24b489b6129a..f0cff62812c3 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_device.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
@@ -607,9 +607,12 @@  static int adreno_remove(struct platform_device *pdev)
 	return 0;
 }
 
+static int adreno_system_suspend(struct device *dev);
 static void adreno_shutdown(struct platform_device *pdev)
 {
-	pm_runtime_force_suspend(&pdev->dev);
+	struct msm_gpu *gpu = dev_to_gpu(&pdev->dev);
+
+	WARN_ON_ONCE(adreno_system_suspend(&pdev->dev));
 }
 
 static const struct of_device_id dt_match[] = {