media: chips-media: wave5: Call v4l2_m2m_job_finish() in job_abort()

Message ID 20240228-wave5-fix-abort-v1-1-7482b9316867@baylibre.com
State New
Headers
Series media: chips-media: wave5: Call v4l2_m2m_job_finish() in job_abort() |

Commit Message

Mattijs Korpershoek Feb. 28, 2024, 4:12 p.m. UTC
  When aborting a stream, it's possible that v4l2_m2m_cancel_job()
remains stuck:

[ 3498.490038][    T1] sysrq: Show Blocked State
[ 3498.511754][    T1] task:V4L2DecodeCompo state:D stack:12480 pid:2387  ppid:1      flags:0x04000809
[ 3498.521153][    T1] Call trace:
[ 3498.524333][    T1]  __switch_to+0x174/0x338
[ 3498.528611][    T1]  __schedule+0x5ec/0x9cc
[ 3498.532795][    T1]  schedule+0x84/0xf0
[ 3498.536630][    T1]  v4l2_m2m_cancel_job+0x118/0x194
[ 3498.541595][    T1]  v4l2_m2m_streamoff+0x34/0x13c
[ 3498.546384][    T1]  v4l2_m2m_ioctl_streamoff+0x20/0x30
[ 3498.551607][    T1]  v4l_streamoff+0x44/0x54
[ 3498.555877][    T1]  __video_do_ioctl+0x388/0x4dc
[ 3498.560580][    T1]  video_usercopy+0x618/0xa0c
[ 3498.565109][    T1]  video_ioctl2+0x20/0x30
[ 3498.569292][    T1]  v4l2_ioctl+0x74/0x8c
[ 3498.573300][    T1]  __arm64_sys_ioctl+0xb0/0xec
[ 3498.577918][    T1]  invoke_syscall+0x60/0x124
[ 3498.582368][    T1]  el0_svc_common+0x90/0xfc
[ 3498.586724][    T1]  do_el0_svc+0x34/0xb8
[ 3498.590733][    T1]  el0_svc+0x2c/0xa4
[ 3498.594480][    T1]  el0t_64_sync_handler+0x68/0xb4
[ 3498.599354][    T1]  el0t_64_sync+0x1a4/0x1a8
[ 3498.603832][    T1] sysrq: Kill All Tasks

According to job_abort() documentation from v4l2_m2m_ops:

  After the driver performs the necessary steps, it has to call
  v4l2_m2m_job_finish() or v4l2_m2m_buf_done_and_job_finish() as
  if the transaction ended normally.

This is not done in wave5_vpu_dec_job_abort(). Neither switch_state() nor
wave5_vpu_dec_set_eos_on_firmware() does this.

Add the missing call to fix the v4l2_m2m_cancel_job() hangs.

Fixes: 9707a6254a8a ("media: chips-media: wave5: Add the v4l2 layer")
Signed-off-by: Mattijs Korpershoek <mkorpershoek@baylibre.com>
---
This has been tested on the AM62Px SK EVM using Android 14.
See:
    https://www.ti.com/tool/PROCESSOR-SDK-AM62P

Note: while this is has not been tested on an upstream linux tree, I
believe the fix is still valid for mainline and I would like to get
some feedback on it.

Thank you in advance for your time.
---
 drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c | 3 +++
 1 file changed, 3 insertions(+)


---
base-commit: 8c64f4cdf4e6cc5682c52523713af8c39c94e6d5
change-id: 20240228-wave5-fix-abort-f72d25881cbd

Best regards,
  

Comments

Nicolas Dufresne March 1, 2024, 2:01 p.m. UTC | #1
Hi,

Le mercredi 28 février 2024 à 17:12 +0100, Mattijs Korpershoek a écrit :
> When aborting a stream, it's possible that v4l2_m2m_cancel_job()
> remains stuck:
> 
> [ 3498.490038][    T1] sysrq: Show Blocked State
> [ 3498.511754][    T1] task:V4L2DecodeCompo state:D stack:12480 pid:2387  ppid:1      flags:0x04000809
> [ 3498.521153][    T1] Call trace:
> [ 3498.524333][    T1]  __switch_to+0x174/0x338
> [ 3498.528611][    T1]  __schedule+0x5ec/0x9cc
> [ 3498.532795][    T1]  schedule+0x84/0xf0
> [ 3498.536630][    T1]  v4l2_m2m_cancel_job+0x118/0x194
> [ 3498.541595][    T1]  v4l2_m2m_streamoff+0x34/0x13c
> [ 3498.546384][    T1]  v4l2_m2m_ioctl_streamoff+0x20/0x30
> [ 3498.551607][    T1]  v4l_streamoff+0x44/0x54
> [ 3498.555877][    T1]  __video_do_ioctl+0x388/0x4dc
> [ 3498.560580][    T1]  video_usercopy+0x618/0xa0c
> [ 3498.565109][    T1]  video_ioctl2+0x20/0x30
> [ 3498.569292][    T1]  v4l2_ioctl+0x74/0x8c
> [ 3498.573300][    T1]  __arm64_sys_ioctl+0xb0/0xec
> [ 3498.577918][    T1]  invoke_syscall+0x60/0x124
> [ 3498.582368][    T1]  el0_svc_common+0x90/0xfc
> [ 3498.586724][    T1]  do_el0_svc+0x34/0xb8
> [ 3498.590733][    T1]  el0_svc+0x2c/0xa4
> [ 3498.594480][    T1]  el0t_64_sync_handler+0x68/0xb4
> [ 3498.599354][    T1]  el0t_64_sync+0x1a4/0x1a8
> [ 3498.603832][    T1] sysrq: Kill All Tasks
> 
> According to job_abort() documentation from v4l2_m2m_ops:
> 
>   After the driver performs the necessary steps, it has to call
>   v4l2_m2m_job_finish() or v4l2_m2m_buf_done_and_job_finish() as
>   if the transaction ended normally.
> 
> This is not done in wave5_vpu_dec_job_abort(). Neither switch_state() nor
> wave5_vpu_dec_set_eos_on_firmware() does this.

The doc said "the driver", not job_abort() specifically ...

> 
> Add the missing call to fix the v4l2_m2m_cancel_job() hangs.
> 
> Fixes: 9707a6254a8a ("media: chips-media: wave5: Add the v4l2 layer")
> Signed-off-by: Mattijs Korpershoek <mkorpershoek@baylibre.com>
> ---
> This has been tested on the AM62Px SK EVM using Android 14.
> See:
>     https://www.ti.com/tool/PROCESSOR-SDK-AM62P
> 
> Note: while this is has not been tested on an upstream linux tree, I
> believe the fix is still valid for mainline and I would like to get
> some feedback on it.
> 
> Thank you in advance for your time.
> ---
>  drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c b/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c
> index ef227af72348..b03e3633a1bc 100644
> --- a/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c
> +++ b/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c
> @@ -1715,6 +1715,7 @@ static void wave5_vpu_dec_device_run(void *priv)
>  static void wave5_vpu_dec_job_abort(void *priv)
>  {
>  	struct vpu_instance *inst = priv;
> +	struct v4l2_m2m_ctx *m2m_ctx = inst->v4l2_fh.m2m_ctx;
>  	int ret;
>  
>  	ret = switch_state(inst, VPU_INST_STATE_STOP);
> @@ -1725,6 +1726,8 @@ static void wave5_vpu_dec_job_abort(void *priv)
>  	if (ret)
>  		dev_warn(inst->dev->dev,
>  			 "Setting EOS for the bitstream, fail: %d\n", ret);
> +
> +	v4l2_m2m_job_finish(inst->v4l2_m2m_dev, m2m_ctx);

Wave5 firmware have no function to cancel pending jobs. By finishing the job
without caring about the firmware state, wave5_vpu_dec_finish_decode() will be
called concurrently to another job. This change will effectively breaks seeking,
and will cause warning and possibly memory corruption.

In principle, setting the EOS flag should be enough to ensure that the drain
sequence is initiated, and that should allow finish_decoder() to be called,
which is the only clean way to get finish_job() to be called.

This driver implementation uses PIC_END operating mode of the IP, that ensure
that each submitted job is assumed to be complete, which means each RUN will
lead to a matching IRQ. We had a similar stall during development which was
fixed with a firmware update, are you sure you have the very latest firmware ?
Any chance you can use v4l2-tracer to share what your software have been doing ?

What you can though though, to fortify this a little, is to introduce a watchdog
here. You can trigger it from abort, and on timeout, you will have to fully
reset and reboot the chip (which is not very fast, you don't want to do this at
all time). When the reset have completed, you will have to carefully reset the
driver state before you can safely continue. You'll need to add thread safe
protection against spurious finish_decode() call, in case the watchdog timeout
raced with the firmware finally coming back to life.

regards,
Nicolas

p.s. you should perhaps trace the firmware job counters to try and understand
more about your specific hang.

>  }
>  
>  static int wave5_vpu_dec_job_ready(void *priv)
> 
> ---
> base-commit: 8c64f4cdf4e6cc5682c52523713af8c39c94e6d5
> change-id: 20240228-wave5-fix-abort-f72d25881cbd
> 
> Best regards,
  
Mattijs Korpershoek March 4, 2024, 4:22 p.m. UTC | #2
Hi Nicolas,

Thank you for your prompt reply.

On ven., mars 01, 2024 at 09:01, Nicolas Dufresne <nicolas.dufresne@collabora.com> wrote:

> Hi,
>
> Le mercredi 28 février 2024 à 17:12 +0100, Mattijs Korpershoek a écrit :
>> When aborting a stream, it's possible that v4l2_m2m_cancel_job()
>> remains stuck:
>> 
>> [ 3498.490038][    T1] sysrq: Show Blocked State
>> [ 3498.511754][    T1] task:V4L2DecodeCompo state:D stack:12480 pid:2387  ppid:1      flags:0x04000809
>> [ 3498.521153][    T1] Call trace:
>> [ 3498.524333][    T1]  __switch_to+0x174/0x338
>> [ 3498.528611][    T1]  __schedule+0x5ec/0x9cc
>> [ 3498.532795][    T1]  schedule+0x84/0xf0
>> [ 3498.536630][    T1]  v4l2_m2m_cancel_job+0x118/0x194
>> [ 3498.541595][    T1]  v4l2_m2m_streamoff+0x34/0x13c
>> [ 3498.546384][    T1]  v4l2_m2m_ioctl_streamoff+0x20/0x30
>> [ 3498.551607][    T1]  v4l_streamoff+0x44/0x54
>> [ 3498.555877][    T1]  __video_do_ioctl+0x388/0x4dc
>> [ 3498.560580][    T1]  video_usercopy+0x618/0xa0c
>> [ 3498.565109][    T1]  video_ioctl2+0x20/0x30
>> [ 3498.569292][    T1]  v4l2_ioctl+0x74/0x8c
>> [ 3498.573300][    T1]  __arm64_sys_ioctl+0xb0/0xec
>> [ 3498.577918][    T1]  invoke_syscall+0x60/0x124
>> [ 3498.582368][    T1]  el0_svc_common+0x90/0xfc
>> [ 3498.586724][    T1]  do_el0_svc+0x34/0xb8
>> [ 3498.590733][    T1]  el0_svc+0x2c/0xa4
>> [ 3498.594480][    T1]  el0t_64_sync_handler+0x68/0xb4
>> [ 3498.599354][    T1]  el0t_64_sync+0x1a4/0x1a8
>> [ 3498.603832][    T1] sysrq: Kill All Tasks
>> 
>> According to job_abort() documentation from v4l2_m2m_ops:
>> 
>>   After the driver performs the necessary steps, it has to call
>>   v4l2_m2m_job_finish() or v4l2_m2m_buf_done_and_job_finish() as
>>   if the transaction ended normally.
>> 
>> This is not done in wave5_vpu_dec_job_abort(). Neither switch_state() nor
>> wave5_vpu_dec_set_eos_on_firmware() does this.
>
> The doc said "the driver", not job_abort() specifically ...

Indeed. Seems I wanted to convince myself that this was job_abort()
specific. Sorry about that.

>
>> 
>> Add the missing call to fix the v4l2_m2m_cancel_job() hangs.
>> 
>> Fixes: 9707a6254a8a ("media: chips-media: wave5: Add the v4l2 layer")
>> Signed-off-by: Mattijs Korpershoek <mkorpershoek@baylibre.com>
>> ---
>> This has been tested on the AM62Px SK EVM using Android 14.
>> See:
>>     https://www.ti.com/tool/PROCESSOR-SDK-AM62P
>> 
>> Note: while this is has not been tested on an upstream linux tree, I
>> believe the fix is still valid for mainline and I would like to get
>> some feedback on it.
>> 
>> Thank you in advance for your time.
>> ---
>>  drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c | 3 +++
>>  1 file changed, 3 insertions(+)
>> 
>> diff --git a/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c b/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c
>> index ef227af72348..b03e3633a1bc 100644
>> --- a/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c
>> +++ b/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c
>> @@ -1715,6 +1715,7 @@ static void wave5_vpu_dec_device_run(void *priv)
>>  static void wave5_vpu_dec_job_abort(void *priv)
>>  {
>>  	struct vpu_instance *inst = priv;
>> +	struct v4l2_m2m_ctx *m2m_ctx = inst->v4l2_fh.m2m_ctx;
>>  	int ret;
>>  
>>  	ret = switch_state(inst, VPU_INST_STATE_STOP);
>> @@ -1725,6 +1726,8 @@ static void wave5_vpu_dec_job_abort(void *priv)
>>  	if (ret)
>>  		dev_warn(inst->dev->dev,
>>  			 "Setting EOS for the bitstream, fail: %d\n", ret);
>> +
>> +	v4l2_m2m_job_finish(inst->v4l2_m2m_dev, m2m_ctx);
>
> Wave5 firmware have no function to cancel pending jobs. By finishing the job
> without caring about the firmware state, wave5_vpu_dec_finish_decode() will be
> called concurrently to another job. This change will effectively breaks seeking,
> and will cause warning and possibly memory corruption.

Ah. That's not what I intended. This patch would completely break the
driver.
Thank you for explaining that to me.

>
> In principle, setting the EOS flag should be enough to ensure that the drain
> sequence is initiated, and that should allow finish_decoder() to be called,
> which is the only clean way to get finish_job() to be called.

>
> This driver implementation uses PIC_END operating mode of the IP, that ensure
> that each submitted job is assumed to be complete, which means each RUN will
> lead to a matching IRQ. We had a similar stall during development which was
> fixed with a firmware update, are you sure you have the very latest firmware ?

Interesting.

I double checked the firmware from linux-firmware:

$ ~/work/upstream/linux-firmware/ main md5sum cnm/wave521c_k3_codec_fw.bin
02c5faa5405559bd59a7361a32c2695a  cnm/wave521c_k3_codec_fw.bin

$ ~/ adb shell md5sum /vendor/firmware/cnm/wave521c_k3_codec_fw.bin
02c5faa5405559bd59a7361a32c2695a  /vendor/firmware/cnm/wave521c_k3_codec_fwbin

Which should be: "FW version : 1.0.3"

> Any chance you can use v4l2-tracer to share what your software have been doing ?

I did not know v4l2-tracer.
I tried to run it on android but it seems it segfaults when overriding
mmap().
I will continue to try to get some traces.

>
> What you can though though, to fortify this a little, is to introduce a watchdog
> here. You can trigger it from abort, and on timeout, you will have to fully
> reset and reboot the chip (which is not very fast, you don't want to do this at
> all time). When the reset have completed, you will have to carefully reset the
> driver state before you can safely continue. You'll need to add thread safe
> protection against spurious finish_decode() call, in case the watchdog timeout
> raced with the firmware finally coming back to life.

Ok, I can look into that as well.

Given that i'm not super familiar with multimedia, this has helped me a
lot.

Thanks!

>
> regards,
> Nicolas
>
> p.s. you should perhaps trace the firmware job counters to try and understand
> more about your specific hang.

I will look into it.

>
>>  }
>>  
>>  static int wave5_vpu_dec_job_ready(void *priv)
>> 
>> ---
>> base-commit: 8c64f4cdf4e6cc5682c52523713af8c39c94e6d5
>> change-id: 20240228-wave5-fix-abort-f72d25881cbd
>> 
>> Best regards,
  

Patch

diff --git a/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c b/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c
index ef227af72348..b03e3633a1bc 100644
--- a/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c
+++ b/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c
@@ -1715,6 +1715,7 @@  static void wave5_vpu_dec_device_run(void *priv)
 static void wave5_vpu_dec_job_abort(void *priv)
 {
 	struct vpu_instance *inst = priv;
+	struct v4l2_m2m_ctx *m2m_ctx = inst->v4l2_fh.m2m_ctx;
 	int ret;
 
 	ret = switch_state(inst, VPU_INST_STATE_STOP);
@@ -1725,6 +1726,8 @@  static void wave5_vpu_dec_job_abort(void *priv)
 	if (ret)
 		dev_warn(inst->dev->dev,
 			 "Setting EOS for the bitstream, fail: %d\n", ret);
+
+	v4l2_m2m_job_finish(inst->v4l2_m2m_dev, m2m_ctx);
 }
 
 static int wave5_vpu_dec_job_ready(void *priv)