drm/msm: Initialize mode_config earlier

Message ID 20230113041051.4189063-1-quic_bjorande@quicinc.com
State New
Headers
Series drm/msm: Initialize mode_config earlier |

Commit Message

Bjorn Andersson Jan. 13, 2023, 4:10 a.m. UTC
  Invoking drm_bridge_hpd_notify() on a drm_bridge with a HPD-enabled
bridge_connector ends up in drm_bridge_connector_hpd_cb() calling
drm_kms_helper_hotplug_event(), which assumes that the associated
drm_device's mode_config.funcs is a valid pointer.

But in the MSM DisplayPort driver the HPD enablement happens at bind
time and mode_config.funcs is initialized late in msm_drm_init(). This
means that there's a window for hot plug events to dereference a NULL
mode_config.funcs.

Move the assignment of mode_config.funcs before the bind, to avoid this
scenario.

Reported-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Bjorn Andersson <quic_bjorande@quicinc.com>
---
 drivers/gpu/drm/msm/msm_drv.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
  

Comments

Dmitry Baryshkov Jan. 13, 2023, 4:23 a.m. UTC | #1
On 13/01/2023 06:10, Bjorn Andersson wrote:
> Invoking drm_bridge_hpd_notify() on a drm_bridge with a HPD-enabled
> bridge_connector ends up in drm_bridge_connector_hpd_cb() calling
> drm_kms_helper_hotplug_event(), which assumes that the associated
> drm_device's mode_config.funcs is a valid pointer.
> 
> But in the MSM DisplayPort driver the HPD enablement happens at bind
> time and mode_config.funcs is initialized late in msm_drm_init(). This
> means that there's a window for hot plug events to dereference a NULL
> mode_config.funcs.
> 
> Move the assignment of mode_config.funcs before the bind, to avoid this
> scenario.

Cam we make DP driver not to report HPD events until the enable_hpd() 
was called? I think this is what was fixed by your internal_hpd patchset.

> 
> Reported-by: Johan Hovold <johan@kernel.org>
> Signed-off-by: Bjorn Andersson <quic_bjorande@quicinc.com>
> ---
>   drivers/gpu/drm/msm/msm_drv.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
> index ee2f60b6f09b..7ac670f3e6ab 100644
> --- a/drivers/gpu/drm/msm/msm_drv.c
> +++ b/drivers/gpu/drm/msm/msm_drv.c
> @@ -438,6 +438,9 @@ static int msm_drm_init(struct device *dev, const struct drm_driver *drv)
>   
>   	drm_mode_config_init(ddev);
>   
> +	ddev->mode_config.funcs = &mode_config_funcs;
> +	ddev->mode_config.helper_private = &mode_config_helper_funcs;
> +
>   	ret = msm_init_vram(ddev);
>   	if (ret)
>   		return ret;
> @@ -479,9 +482,6 @@ static int msm_drm_init(struct device *dev, const struct drm_driver *drv)
>   
>   	drm_helper_move_panel_connectors_to_head(ddev);
>   
> -	ddev->mode_config.funcs = &mode_config_funcs;
> -	ddev->mode_config.helper_private = &mode_config_helper_funcs;
> -
>   	for (i = 0; i < priv->num_crtcs; i++) {
>   		/* initialize event thread */
>   		priv->event_thread[i].crtc_id = priv->crtcs[i]->base.id;
  
Dmitry Baryshkov Jan. 13, 2023, 8:57 a.m. UTC | #2
On 13/01/2023 06:23, Dmitry Baryshkov wrote:
> On 13/01/2023 06:10, Bjorn Andersson wrote:
>> Invoking drm_bridge_hpd_notify() on a drm_bridge with a HPD-enabled
>> bridge_connector ends up in drm_bridge_connector_hpd_cb() calling
>> drm_kms_helper_hotplug_event(), which assumes that the associated
>> drm_device's mode_config.funcs is a valid pointer.
>>
>> But in the MSM DisplayPort driver the HPD enablement happens at bind
>> time and mode_config.funcs is initialized late in msm_drm_init(). This
>> means that there's a window for hot plug events to dereference a NULL
>> mode_config.funcs.
>>
>> Move the assignment of mode_config.funcs before the bind, to avoid this
>> scenario.
> 
> Cam we make DP driver not to report HPD events until the enable_hpd() 
> was called? I think this is what was fixed by your internal_hpd patchset.

Or to express this in another words: I thought that internal_hpd already 
deferred enabling hpd event reporting till the time when we need it, 
didn't it?
  
Bjorn Andersson Jan. 17, 2023, 2:51 a.m. UTC | #3
On Fri, Jan 13, 2023 at 10:57:18AM +0200, Dmitry Baryshkov wrote:
> On 13/01/2023 06:23, Dmitry Baryshkov wrote:
> > On 13/01/2023 06:10, Bjorn Andersson wrote:
> > > Invoking drm_bridge_hpd_notify() on a drm_bridge with a HPD-enabled
> > > bridge_connector ends up in drm_bridge_connector_hpd_cb() calling
> > > drm_kms_helper_hotplug_event(), which assumes that the associated
> > > drm_device's mode_config.funcs is a valid pointer.
> > > 
> > > But in the MSM DisplayPort driver the HPD enablement happens at bind
> > > time and mode_config.funcs is initialized late in msm_drm_init(). This
> > > means that there's a window for hot plug events to dereference a NULL
> > > mode_config.funcs.
> > > 
> > > Move the assignment of mode_config.funcs before the bind, to avoid this
> > > scenario.
> > 
> > Cam we make DP driver not to report HPD events until the enable_hpd()
> > was called? I think this is what was fixed by your internal_hpd
> > patchset.
> 
> Or to express this in another words: I thought that internal_hpd already
> deferred enabling hpd event reporting till the time when we need it, didn't
> it?
> 

I added a WARN_ON(1) in drm_bridge_hpd_enable() to get a sense of when
this window of "opportunity" opens up, and here's the callstack:

------------[ cut here ]------------
WARNING: CPU: 6 PID: 99 at drivers/gpu/drm/drm_bridge.c:1260 drm_bridge_hpd_enable+0x48/0x94 [drm]
...
Call trace:
 drm_bridge_hpd_enable+0x48/0x94 [drm]
 drm_bridge_connector_enable_hpd+0x30/0x3c [drm_kms_helper]
 drm_kms_helper_poll_enable+0xa4/0x114 [drm_kms_helper]
 drm_kms_helper_poll_init+0x6c/0x7c [drm_kms_helper]
 msm_drm_bind+0x370/0x628 [msm]
 try_to_bring_up_aggregate_device+0x170/0x1bc
 __component_add+0xb0/0x168
 component_add+0x20/0x2c
 dp_display_probe+0x40c/0x468 [msm]
 platform_probe+0xb4/0xdc
 really_probe+0x13c/0x300
 __driver_probe_device+0xc0/0xec
 driver_probe_device+0x48/0x204
 __device_attach_driver+0x124/0x14c
 bus_for_each_drv+0x90/0xdc
 __device_attach+0xdc/0x1a8
 device_initial_probe+0x20/0x2c
 bus_probe_device+0x40/0xa4
 deferred_probe_work_func+0x94/0xd0
 process_one_work+0x1a8/0x3c0
 worker_thread+0x254/0x47c
 kthread+0xf8/0x1b8
 ret_from_fork+0x10/0x20
---[ end trace 0000000000000000 ]---

As drm_kms_helper_poll_init() is the last thing being called in
msm_drm_init() shifting around the mode_config.func assignment would not
have any impact.

Perhaps we have shuffled other things around to avoid this bug?  Either
way, let's this on hold  until further proof that it's still
reproducible.

Sorry for the noise,
Bjorn
  
Johan Hovold Jan. 17, 2023, 8:04 a.m. UTC | #4
On Mon, Jan 16, 2023 at 08:51:22PM -0600, Bjorn Andersson wrote:
> On Fri, Jan 13, 2023 at 10:57:18AM +0200, Dmitry Baryshkov wrote:
> > On 13/01/2023 06:23, Dmitry Baryshkov wrote:
> > > On 13/01/2023 06:10, Bjorn Andersson wrote:
> > > > Invoking drm_bridge_hpd_notify() on a drm_bridge with a HPD-enabled
> > > > bridge_connector ends up in drm_bridge_connector_hpd_cb() calling
> > > > drm_kms_helper_hotplug_event(), which assumes that the associated
> > > > drm_device's mode_config.funcs is a valid pointer.
> > > > 
> > > > But in the MSM DisplayPort driver the HPD enablement happens at bind
> > > > time and mode_config.funcs is initialized late in msm_drm_init(). This
> > > > means that there's a window for hot plug events to dereference a NULL
> > > > mode_config.funcs.
> > > > 
> > > > Move the assignment of mode_config.funcs before the bind, to avoid this
> > > > scenario.
> > > 
> > > Cam we make DP driver not to report HPD events until the enable_hpd()
> > > was called? I think this is what was fixed by your internal_hpd
> > > patchset.
> > 
> > Or to express this in another words: I thought that internal_hpd already
> > deferred enabling hpd event reporting till the time when we need it, didn't
> > it?
> > 
> 
> I added a WARN_ON(1) in drm_bridge_hpd_enable() to get a sense of when
> this window of "opportunity" opens up, and here's the callstack:
> 
> ------------[ cut here ]------------
> WARNING: CPU: 6 PID: 99 at drivers/gpu/drm/drm_bridge.c:1260 drm_bridge_hpd_enable+0x48/0x94 [drm]
> ...
> Call trace:
>  drm_bridge_hpd_enable+0x48/0x94 [drm]
>  drm_bridge_connector_enable_hpd+0x30/0x3c [drm_kms_helper]
>  drm_kms_helper_poll_enable+0xa4/0x114 [drm_kms_helper]
>  drm_kms_helper_poll_init+0x6c/0x7c [drm_kms_helper]
>  msm_drm_bind+0x370/0x628 [msm]
>  try_to_bring_up_aggregate_device+0x170/0x1bc
>  __component_add+0xb0/0x168
>  component_add+0x20/0x2c
>  dp_display_probe+0x40c/0x468 [msm]
>  platform_probe+0xb4/0xdc
>  really_probe+0x13c/0x300
>  __driver_probe_device+0xc0/0xec
>  driver_probe_device+0x48/0x204
>  __device_attach_driver+0x124/0x14c
>  bus_for_each_drv+0x90/0xdc
>  __device_attach+0xdc/0x1a8
>  device_initial_probe+0x20/0x2c
>  bus_probe_device+0x40/0xa4
>  deferred_probe_work_func+0x94/0xd0
>  process_one_work+0x1a8/0x3c0
>  worker_thread+0x254/0x47c
>  kthread+0xf8/0x1b8
>  ret_from_fork+0x10/0x20
> ---[ end trace 0000000000000000 ]---
> 
> As drm_kms_helper_poll_init() is the last thing being called in
> msm_drm_init() shifting around the mode_config.func assignment would not
> have any impact.
> 
> Perhaps we have shuffled other things around to avoid this bug?  Either
> way, let's this on hold  until further proof that it's still
> reproducible.

As I've mentioned off list, I haven't hit the apparent race I reported
here:

	https://lore.kernel.org/all/Y1efJh11B5UQZ0Tz@hovoldconsulting.com/

since moving to 6.2. I did hit it with both 6.0 and 6.1-rc2, but it
could very well be that something has changes that fixes (or hides) the
issue since.

Johan
  
Johan Hovold Jan. 23, 2023, 4:01 p.m. UTC | #5
On Tue, Jan 17, 2023 at 09:04:39AM +0100, Johan Hovold wrote:
> On Mon, Jan 16, 2023 at 08:51:22PM -0600, Bjorn Andersson wrote:
> > On Fri, Jan 13, 2023 at 10:57:18AM +0200, Dmitry Baryshkov wrote:
> > > On 13/01/2023 06:23, Dmitry Baryshkov wrote:
> > > > On 13/01/2023 06:10, Bjorn Andersson wrote:
> > > > > Invoking drm_bridge_hpd_notify() on a drm_bridge with a HPD-enabled
> > > > > bridge_connector ends up in drm_bridge_connector_hpd_cb() calling
> > > > > drm_kms_helper_hotplug_event(), which assumes that the associated
> > > > > drm_device's mode_config.funcs is a valid pointer.
> > > > > 
> > > > > But in the MSM DisplayPort driver the HPD enablement happens at bind
> > > > > time and mode_config.funcs is initialized late in msm_drm_init(). This
> > > > > means that there's a window for hot plug events to dereference a NULL
> > > > > mode_config.funcs.
> > > > > 
> > > > > Move the assignment of mode_config.funcs before the bind, to avoid this
> > > > > scenario.
> > > > 
> > > > Cam we make DP driver not to report HPD events until the enable_hpd()
> > > > was called? I think this is what was fixed by your internal_hpd
> > > > patchset.
> > > 
> > > Or to express this in another words: I thought that internal_hpd already
> > > deferred enabling hpd event reporting till the time when we need it, didn't
> > > it?
> > > 
> > 
> > I added a WARN_ON(1) in drm_bridge_hpd_enable() to get a sense of when
> > this window of "opportunity" opens up, and here's the callstack:
> > 
> > ------------[ cut here ]------------
> > WARNING: CPU: 6 PID: 99 at drivers/gpu/drm/drm_bridge.c:1260 drm_bridge_hpd_enable+0x48/0x94 [drm]
> > ...
> > Call trace:
> >  drm_bridge_hpd_enable+0x48/0x94 [drm]
> >  drm_bridge_connector_enable_hpd+0x30/0x3c [drm_kms_helper]
> >  drm_kms_helper_poll_enable+0xa4/0x114 [drm_kms_helper]
> >  drm_kms_helper_poll_init+0x6c/0x7c [drm_kms_helper]
> >  msm_drm_bind+0x370/0x628 [msm]
> >  try_to_bring_up_aggregate_device+0x170/0x1bc
> >  __component_add+0xb0/0x168
> >  component_add+0x20/0x2c
> >  dp_display_probe+0x40c/0x468 [msm]
> >  platform_probe+0xb4/0xdc
> >  really_probe+0x13c/0x300
> >  __driver_probe_device+0xc0/0xec
> >  driver_probe_device+0x48/0x204
> >  __device_attach_driver+0x124/0x14c
> >  bus_for_each_drv+0x90/0xdc
> >  __device_attach+0xdc/0x1a8
> >  device_initial_probe+0x20/0x2c
> >  bus_probe_device+0x40/0xa4
> >  deferred_probe_work_func+0x94/0xd0
> >  process_one_work+0x1a8/0x3c0
> >  worker_thread+0x254/0x47c
> >  kthread+0xf8/0x1b8
> >  ret_from_fork+0x10/0x20
> > ---[ end trace 0000000000000000 ]---
> > 
> > As drm_kms_helper_poll_init() is the last thing being called in
> > msm_drm_init() shifting around the mode_config.func assignment would not
> > have any impact.
> > 
> > Perhaps we have shuffled other things around to avoid this bug?  Either
> > way, let's this on hold  until further proof that it's still
> > reproducible.
> 
> As I've mentioned off list, I haven't hit the apparent race I reported
> here:
> 
> 	https://lore.kernel.org/all/Y1efJh11B5UQZ0Tz@hovoldconsulting.com/
> 
> since moving to 6.2. I did hit it with both 6.0 and 6.1-rc2, but it
> could very well be that something has changes that fixes (or hides) the
> issue since.

For unrelated reasons, I tried enabling async probing, and apart from
apparently causing the panel driver to probe defer indefinitely, I also
again hit the WARN_ON() I had added to catch this:

[   13.593235] WARNING: CPU: 0 PID: 125 at drivers/gpu/drm/drm_probe_helper.c:664 drm_kms_helper_hotplug_event+0x48/0x7
0 [drm_kms_helper]
...
[   13.679429] CPU: 0 PID: 125 Comm: kworker/0:3 Not tainted 6.2.0-rc4 #110
[   13.687159] Hardware name: Qualcomm QRD, BIOS 6.0.220110.BOOT.MXF.1.1-00470-MAKENA-1 01/10/2022
[   13.696947] Workqueue: events pmic_glink_altmode_worker [pmic_glink_altmode]
[   13.705044] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   13.713054] pc : drm_kms_helper_hotplug_event+0x48/0x70 [drm_kms_helper]
[   13.720812] lr : drm_kms_helper_hotplug_event+0x18/0x70 [drm_kms_helper]
[   13.728557] sp : ffff800009e33c90
[   13.732779] x29: ffff800009e33c90 x28: ffffad90862eb000 x27: ffff62d2362ee305
[   13.740956] x26: ffffad90862f1ea0 x25: ffffad9086309b50 x24: 0000000000000000
[   13.749125] x23: 0000000000000003 x22: ffff62d0c5dad000 x21: 0000000000000002
[   13.757291] x20: ffff62d0c6d24000 x19: ffff62d0c5dad000 x18: 0000000000000038
[   13.765443] x17: 0000000000000004 x16: 000000000000d323 x15: 0000000000000004
[   13.773585] x14: ffffad9086594208 x13: ffffad90865f50e8 x12: 0000000000000000
[   13.781723] x11: 00000000000400d7 x10: 0000000000000008 x9 : 0000000000000002
[   13.789867] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000097d00
[   13.798013] x5 : ffff62d0c3395000 x4 : ffff62d2362ed750 x3 : 0000000000097e00
[   13.806161] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000
[   13.814304] Call trace:
[   13.817604]  drm_kms_helper_hotplug_event+0x48/0x70 [drm_kms_helper]
[   13.824959]  drm_bridge_connector_hpd_cb+0xa0/0xc0 [drm_kms_helper]
[   13.832227]  drm_bridge_hpd_notify+0x40/0x60 [drm]
[   13.837993]  pmic_glink_altmode_worker+0xc0/0x150 [pmic_glink_altmode]
[   13.845505]  process_one_work+0x288/0x6c0
[   13.850404]  worker_thread+0x74/0x450
[   13.854948]  kthread+0x118/0x120
[   13.859032]  ret_from_fork+0x10/0x20
[   13.863473] irq event stamp: 7440
[   13.867631] hardirqs last  enabled at (7439): [<ffffad9085b00450>] _raw_spin_unlock_irqrestore+0x80/0x90
[   13.878157] hardirqs last disabled at (7440): [<ffffad9085af33e4>] el1_dbg+0x24/0x90
[   13.886885] softirqs last  enabled at (7308): [<ffffad908514046c>] _stext+0x46c/0x5d8
[   13.895697] softirqs last disabled at (7303): [<ffffad90851467b0>] ____do_softirq+0x10/0x20
[   13.905038] ---[ end trace 0000000000000000 ]---

So the bug still appears to be there (and the MSM DRM driver is fragile
and broken, but we knew that).

Johan
  
Bjorn Andersson Jan. 23, 2023, 5:17 p.m. UTC | #6
On Mon, Jan 23, 2023 at 05:01:45PM +0100, Johan Hovold wrote:
> On Tue, Jan 17, 2023 at 09:04:39AM +0100, Johan Hovold wrote:
> > On Mon, Jan 16, 2023 at 08:51:22PM -0600, Bjorn Andersson wrote:
> > > On Fri, Jan 13, 2023 at 10:57:18AM +0200, Dmitry Baryshkov wrote:
> > > > On 13/01/2023 06:23, Dmitry Baryshkov wrote:
> > > > > On 13/01/2023 06:10, Bjorn Andersson wrote:
> > > > > > Invoking drm_bridge_hpd_notify() on a drm_bridge with a HPD-enabled
> > > > > > bridge_connector ends up in drm_bridge_connector_hpd_cb() calling
> > > > > > drm_kms_helper_hotplug_event(), which assumes that the associated
> > > > > > drm_device's mode_config.funcs is a valid pointer.
> > > > > > 
> > > > > > But in the MSM DisplayPort driver the HPD enablement happens at bind
> > > > > > time and mode_config.funcs is initialized late in msm_drm_init(). This
> > > > > > means that there's a window for hot plug events to dereference a NULL
> > > > > > mode_config.funcs.
> > > > > > 
> > > > > > Move the assignment of mode_config.funcs before the bind, to avoid this
> > > > > > scenario.
> > > > > 
> > > > > Cam we make DP driver not to report HPD events until the enable_hpd()
> > > > > was called? I think this is what was fixed by your internal_hpd
> > > > > patchset.
> > > > 
> > > > Or to express this in another words: I thought that internal_hpd already
> > > > deferred enabling hpd event reporting till the time when we need it, didn't
> > > > it?
> > > > 
> > > 
> > > I added a WARN_ON(1) in drm_bridge_hpd_enable() to get a sense of when
> > > this window of "opportunity" opens up, and here's the callstack:
> > > 
> > > ------------[ cut here ]------------
> > > WARNING: CPU: 6 PID: 99 at drivers/gpu/drm/drm_bridge.c:1260 drm_bridge_hpd_enable+0x48/0x94 [drm]
> > > ...
> > > Call trace:
> > >  drm_bridge_hpd_enable+0x48/0x94 [drm]
> > >  drm_bridge_connector_enable_hpd+0x30/0x3c [drm_kms_helper]
> > >  drm_kms_helper_poll_enable+0xa4/0x114 [drm_kms_helper]
> > >  drm_kms_helper_poll_init+0x6c/0x7c [drm_kms_helper]
> > >  msm_drm_bind+0x370/0x628 [msm]
> > >  try_to_bring_up_aggregate_device+0x170/0x1bc
> > >  __component_add+0xb0/0x168
> > >  component_add+0x20/0x2c
> > >  dp_display_probe+0x40c/0x468 [msm]
> > >  platform_probe+0xb4/0xdc
> > >  really_probe+0x13c/0x300
> > >  __driver_probe_device+0xc0/0xec
> > >  driver_probe_device+0x48/0x204
> > >  __device_attach_driver+0x124/0x14c
> > >  bus_for_each_drv+0x90/0xdc
> > >  __device_attach+0xdc/0x1a8
> > >  device_initial_probe+0x20/0x2c
> > >  bus_probe_device+0x40/0xa4
> > >  deferred_probe_work_func+0x94/0xd0
> > >  process_one_work+0x1a8/0x3c0
> > >  worker_thread+0x254/0x47c
> > >  kthread+0xf8/0x1b8
> > >  ret_from_fork+0x10/0x20
> > > ---[ end trace 0000000000000000 ]---
> > > 
> > > As drm_kms_helper_poll_init() is the last thing being called in
> > > msm_drm_init() shifting around the mode_config.func assignment would not
> > > have any impact.
> > > 
> > > Perhaps we have shuffled other things around to avoid this bug?  Either
> > > way, let's this on hold  until further proof that it's still
> > > reproducible.
> > 
> > As I've mentioned off list, I haven't hit the apparent race I reported
> > here:
> > 
> > 	https://lore.kernel.org/all/Y1efJh11B5UQZ0Tz@hovoldconsulting.com/
> > 
> > since moving to 6.2. I did hit it with both 6.0 and 6.1-rc2, but it
> > could very well be that something has changes that fixes (or hides) the
> > issue since.
> 
> For unrelated reasons, I tried enabling async probing, and apart from
> apparently causing the panel driver to probe defer indefinitely, I also
> again hit the WARN_ON() I had added to catch this:
> 
> [   13.593235] WARNING: CPU: 0 PID: 125 at drivers/gpu/drm/drm_probe_helper.c:664 drm_kms_helper_hotplug_event+0x48/0x7
> 0 [drm_kms_helper]
> ...
> [   13.679429] CPU: 0 PID: 125 Comm: kworker/0:3 Not tainted 6.2.0-rc4 #110
> [   13.687159] Hardware name: Qualcomm QRD, BIOS 6.0.220110.BOOT.MXF.1.1-00470-MAKENA-1 01/10/2022
> [   13.696947] Workqueue: events pmic_glink_altmode_worker [pmic_glink_altmode]
> [   13.705044] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [   13.713054] pc : drm_kms_helper_hotplug_event+0x48/0x70 [drm_kms_helper]
> [   13.720812] lr : drm_kms_helper_hotplug_event+0x18/0x70 [drm_kms_helper]
> [   13.728557] sp : ffff800009e33c90
> [   13.732779] x29: ffff800009e33c90 x28: ffffad90862eb000 x27: ffff62d2362ee305
> [   13.740956] x26: ffffad90862f1ea0 x25: ffffad9086309b50 x24: 0000000000000000
> [   13.749125] x23: 0000000000000003 x22: ffff62d0c5dad000 x21: 0000000000000002
> [   13.757291] x20: ffff62d0c6d24000 x19: ffff62d0c5dad000 x18: 0000000000000038
> [   13.765443] x17: 0000000000000004 x16: 000000000000d323 x15: 0000000000000004
> [   13.773585] x14: ffffad9086594208 x13: ffffad90865f50e8 x12: 0000000000000000
> [   13.781723] x11: 00000000000400d7 x10: 0000000000000008 x9 : 0000000000000002
> [   13.789867] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000097d00
> [   13.798013] x5 : ffff62d0c3395000 x4 : ffff62d2362ed750 x3 : 0000000000097e00
> [   13.806161] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000
> [   13.814304] Call trace:
> [   13.817604]  drm_kms_helper_hotplug_event+0x48/0x70 [drm_kms_helper]
> [   13.824959]  drm_bridge_connector_hpd_cb+0xa0/0xc0 [drm_kms_helper]
> [   13.832227]  drm_bridge_hpd_notify+0x40/0x60 [drm]
> [   13.837993]  pmic_glink_altmode_worker+0xc0/0x150 [pmic_glink_altmode]
> [   13.845505]  process_one_work+0x288/0x6c0
> [   13.850404]  worker_thread+0x74/0x450
> [   13.854948]  kthread+0x118/0x120
> [   13.859032]  ret_from_fork+0x10/0x20
> [   13.863473] irq event stamp: 7440
> [   13.867631] hardirqs last  enabled at (7439): [<ffffad9085b00450>] _raw_spin_unlock_irqrestore+0x80/0x90
> [   13.878157] hardirqs last disabled at (7440): [<ffffad9085af33e4>] el1_dbg+0x24/0x90
> [   13.886885] softirqs last  enabled at (7308): [<ffffad908514046c>] _stext+0x46c/0x5d8
> [   13.895697] softirqs last disabled at (7303): [<ffffad90851467b0>] ____do_softirq+0x10/0x20
> [   13.905038] ---[ end trace 0000000000000000 ]---
> 
> So the bug still appears to be there (and the MSM DRM driver is fragile
> and broken, but we knew that).
> 

But the ordering between mode_config.funcs = !NULL and
drm_kms_helper_poll_init() in msm_drm_init() seems pretty clear.

And my testing shows that drm_kms_helper_poll_init() is the cause for
getting bridge->hpd_cb != NULL.

So the ordering seems legit, unless there's something else causing the
assignment of bridge->hpd_cb to happen earlier in this scenario.

Regards,
Bjorn
  
Johan Hovold Jan. 24, 2023, 8:09 a.m. UTC | #7
On Mon, Jan 23, 2023 at 09:17:49AM -0800, Bjorn Andersson wrote:
> On Mon, Jan 23, 2023 at 05:01:45PM +0100, Johan Hovold wrote:
> > On Tue, Jan 17, 2023 at 09:04:39AM +0100, Johan Hovold wrote:
> > > On Mon, Jan 16, 2023 at 08:51:22PM -0600, Bjorn Andersson wrote:

> > > > Perhaps we have shuffled other things around to avoid this bug?  Either
> > > > way, let's this on hold  until further proof that it's still
> > > > reproducible.
> > > 
> > > As I've mentioned off list, I haven't hit the apparent race I reported
> > > here:
> > > 
> > > 	https://lore.kernel.org/all/Y1efJh11B5UQZ0Tz@hovoldconsulting.com/
> > > 
> > > since moving to 6.2. I did hit it with both 6.0 and 6.1-rc2, but it
> > > could very well be that something has changes that fixes (or hides) the
> > > issue since.
> > 
> > For unrelated reasons, I tried enabling async probing, and apart from
> > apparently causing the panel driver to probe defer indefinitely, I also
> > again hit the WARN_ON() I had added to catch this:
> > 
> > [   13.593235] WARNING: CPU: 0 PID: 125 at drivers/gpu/drm/drm_probe_helper.c:664 drm_kms_helper_hotplug_event+0x48/0x7
> > 0 [drm_kms_helper]

> > So the bug still appears to be there (and the MSM DRM driver is fragile
> > and broken, but we knew that).
> > 
> 
> But the ordering between mode_config.funcs = !NULL and
> drm_kms_helper_poll_init() in msm_drm_init() seems pretty clear.
> 
> And my testing shows that drm_kms_helper_poll_init() is the cause for
> getting bridge->hpd_cb != NULL.
> 
> So the ordering seems legit, unless there's something else causing the
> assignment of bridge->hpd_cb to happen earlier in this scenario.

I'm not saying that this patch is correct (indeed it doesn't seem to
be), but only that the bug I reported still appears to be present in
6.2.

Now that I actually looked at this again, I realise that the reason that
haven't seen it with 6.2 is more likely due to the fact that I'm now
making sure to load the panel driver before the drm driver to avoid that
unnecessary probe deferral.

With async probing, I get the probe deferral again, and boom, I hit the
same old NULL deref.

I see there's a call to drm_kms_helper_poll_fini() in msm_drm_uninit()
which should stop the polling, but perhaps there's still some corner
case due to the unexpected probe (or rather component bind) deferral
which we're hitting.

Johan
  
Johan Hovold Jan. 24, 2023, 8:24 a.m. UTC | #8
On Tue, Jan 24, 2023 at 09:09:02AM +0100, Johan Hovold wrote:
> On Mon, Jan 23, 2023 at 09:17:49AM -0800, Bjorn Andersson wrote:
> > On Mon, Jan 23, 2023 at 05:01:45PM +0100, Johan Hovold wrote:
> > > On Tue, Jan 17, 2023 at 09:04:39AM +0100, Johan Hovold wrote:
> > > > On Mon, Jan 16, 2023 at 08:51:22PM -0600, Bjorn Andersson wrote:
> 
> > > > > Perhaps we have shuffled other things around to avoid this bug?  Either
> > > > > way, let's this on hold  until further proof that it's still
> > > > > reproducible.
> > > > 
> > > > As I've mentioned off list, I haven't hit the apparent race I reported
> > > > here:
> > > > 
> > > > 	https://lore.kernel.org/all/Y1efJh11B5UQZ0Tz@hovoldconsulting.com/
> > > > 
> > > > since moving to 6.2. I did hit it with both 6.0 and 6.1-rc2, but it
> > > > could very well be that something has changes that fixes (or hides) the
> > > > issue since.
> > > 
> > > For unrelated reasons, I tried enabling async probing, and apart from
> > > apparently causing the panel driver to probe defer indefinitely, I also
> > > again hit the WARN_ON() I had added to catch this:
> > > 
> > > [   13.593235] WARNING: CPU: 0 PID: 125 at drivers/gpu/drm/drm_probe_helper.c:664 drm_kms_helper_hotplug_event+0x48/0x7
> > > 0 [drm_kms_helper]
> 
> > > So the bug still appears to be there (and the MSM DRM driver is fragile
> > > and broken, but we knew that).
> > > 
> > 
> > But the ordering between mode_config.funcs = !NULL and
> > drm_kms_helper_poll_init() in msm_drm_init() seems pretty clear.
> > 
> > And my testing shows that drm_kms_helper_poll_init() is the cause for
> > getting bridge->hpd_cb != NULL.
> > 
> > So the ordering seems legit, unless there's something else causing the
> > assignment of bridge->hpd_cb to happen earlier in this scenario.
> 
> I'm not saying that this patch is correct (indeed it doesn't seem to
> be), but only that the bug I reported still appears to be present in
> 6.2.
> 
> Now that I actually looked at this again, I realise that the reason that
> haven't seen it with 6.2 is more likely due to the fact that I'm now
> making sure to load the panel driver before the drm driver to avoid that
> unnecessary probe deferral.
> 
> With async probing, I get the probe deferral again, and boom, I hit the
> same old NULL deref.
> 
> I see there's a call to drm_kms_helper_poll_fini() in msm_drm_uninit()
> which should stop the polling, but perhaps there's still some corner
> case due to the unexpected probe (or rather component bind) deferral
> which we're hitting.

I guess the drm_kms_helper_poll_fini() bit is irrelevant here as the
call comes from the pmic_glink_altmode_worker() and
drm_bridge_hpd_notify().

Perhaps the pmic_glink altmode driver simply isn't notified that the
drm device is gone again due to the late "probe" deferral or similar?

Johan
  
Johan Hovold March 1, 2023, 1:58 p.m. UTC | #9
On Tue, Jan 24, 2023 at 09:09:02AM +0100, Johan Hovold wrote:
> On Mon, Jan 23, 2023 at 09:17:49AM -0800, Bjorn Andersson wrote:
> > On Mon, Jan 23, 2023 at 05:01:45PM +0100, Johan Hovold wrote:
> > > On Tue, Jan 17, 2023 at 09:04:39AM +0100, Johan Hovold wrote:
> > > > On Mon, Jan 16, 2023 at 08:51:22PM -0600, Bjorn Andersson wrote:
> 
> > > > > Perhaps we have shuffled other things around to avoid this bug?  Either
> > > > > way, let's this on hold  until further proof that it's still
> > > > > reproducible.
> > > > 
> > > > As I've mentioned off list, I haven't hit the apparent race I reported
> > > > here:
> > > > 
> > > > 	https://lore.kernel.org/all/Y1efJh11B5UQZ0Tz@hovoldconsulting.com/
> > > > 
> > > > since moving to 6.2. I did hit it with both 6.0 and 6.1-rc2, but it
> > > > could very well be that something has changes that fixes (or hides) the
> > > > issue since.
> > > 
> > > For unrelated reasons, I tried enabling async probing, and apart from
> > > apparently causing the panel driver to probe defer indefinitely, I also
> > > again hit the WARN_ON() I had added to catch this:
> > > 
> > > [   13.593235] WARNING: CPU: 0 PID: 125 at drivers/gpu/drm/drm_probe_helper.c:664 drm_kms_helper_hotplug_event+0x48/0x7
> > > 0 [drm_kms_helper]
> 
> > > So the bug still appears to be there (and the MSM DRM driver is fragile
> > > and broken, but we knew that).
> > > 
> > 
> > But the ordering between mode_config.funcs = !NULL and
> > drm_kms_helper_poll_init() in msm_drm_init() seems pretty clear.
> > 
> > And my testing shows that drm_kms_helper_poll_init() is the cause for
> > getting bridge->hpd_cb != NULL.
> > 
> > So the ordering seems legit, unless there's something else causing the
> > assignment of bridge->hpd_cb to happen earlier in this scenario.
> 
> I'm not saying that this patch is correct (indeed it doesn't seem to
> be), but only that the bug I reported still appears to be present in
> 6.2.

So after debugging this issue a third time, I can conclude that it is
still very much present in 6.2.

It appears you looked at the linux-next tree when you concluded that
this patch was not needed. In 6.2 the bridge->hpd_cb callback is set
before mode_config.funcs is initialised as part of
kms->funcs->hw_init(kms).

The hpd DRM changes heading into 6.3 do appear to avoid the NULL-pointer
dereference by moving the bridge->hpd_cb initialisation to
drm_kms_helper_poll_init() as you mention above.

The PMIC GLINK altmode driver still happily forwards notifications
regardless of the DRM driver state though, which can lead to missed
hotplug events. It seems you need to implement the
hpd_enable()/disable() callbacks and either cache or not enable events
in fw until the DRM driver is ready.

Johan
  
Bjorn Andersson March 2, 2023, 11:17 p.m. UTC | #10
On Wed, Mar 01, 2023 at 02:58:50PM +0100, Johan Hovold wrote:
> On Tue, Jan 24, 2023 at 09:09:02AM +0100, Johan Hovold wrote:
> > On Mon, Jan 23, 2023 at 09:17:49AM -0800, Bjorn Andersson wrote:
> > > On Mon, Jan 23, 2023 at 05:01:45PM +0100, Johan Hovold wrote:
> > > > On Tue, Jan 17, 2023 at 09:04:39AM +0100, Johan Hovold wrote:
> > > > > On Mon, Jan 16, 2023 at 08:51:22PM -0600, Bjorn Andersson wrote:
> > 
> > > > > > Perhaps we have shuffled other things around to avoid this bug?  Either
> > > > > > way, let's this on hold  until further proof that it's still
> > > > > > reproducible.
> > > > > 
> > > > > As I've mentioned off list, I haven't hit the apparent race I reported
> > > > > here:
> > > > > 
> > > > > 	https://lore.kernel.org/all/Y1efJh11B5UQZ0Tz@hovoldconsulting.com/
> > > > > 
> > > > > since moving to 6.2. I did hit it with both 6.0 and 6.1-rc2, but it
> > > > > could very well be that something has changes that fixes (or hides) the
> > > > > issue since.
> > > > 
> > > > For unrelated reasons, I tried enabling async probing, and apart from
> > > > apparently causing the panel driver to probe defer indefinitely, I also
> > > > again hit the WARN_ON() I had added to catch this:
> > > > 
> > > > [   13.593235] WARNING: CPU: 0 PID: 125 at drivers/gpu/drm/drm_probe_helper.c:664 drm_kms_helper_hotplug_event+0x48/0x7
> > > > 0 [drm_kms_helper]
> > 
> > > > So the bug still appears to be there (and the MSM DRM driver is fragile
> > > > and broken, but we knew that).
> > > > 
> > > 
> > > But the ordering between mode_config.funcs = !NULL and
> > > drm_kms_helper_poll_init() in msm_drm_init() seems pretty clear.
> > > 
> > > And my testing shows that drm_kms_helper_poll_init() is the cause for
> > > getting bridge->hpd_cb != NULL.
> > > 
> > > So the ordering seems legit, unless there's something else causing the
> > > assignment of bridge->hpd_cb to happen earlier in this scenario.
> > 
> > I'm not saying that this patch is correct (indeed it doesn't seem to
> > be), but only that the bug I reported still appears to be present in
> > 6.2.
> 
> So after debugging this issue a third time, I can conclude that it is
> still very much present in 6.2.
> 
> It appears you looked at the linux-next tree when you concluded that
> this patch was not needed. In 6.2 the bridge->hpd_cb callback is set
> before mode_config.funcs is initialised as part of
> kms->funcs->hw_init(kms).
> 
> The hpd DRM changes heading into 6.3 do appear to avoid the NULL-pointer
> dereference by moving the bridge->hpd_cb initialisation to
> drm_kms_helper_poll_init() as you mention above.
> 
> The PMIC GLINK altmode driver still happily forwards notifications
> regardless of the DRM driver state though, which can lead to missed
> hotplug events. It seems you need to implement the
> hpd_enable()/disable() callbacks and either cache or not enable events
> in fw until the DRM driver is ready.
> 

It's not clear to me what the expectation from the DRM framework is on
this point. We register a drm_bridge which is only capable of signaling
HPD events (DRM_BRIDGE_OP_HPD), not querying HPD state (DRM_BRIDGE_OP_DETECT).

Does this imply that any such bridge must ensure that hpd events are
re-delivered once hpd_enable() has been invoked (we can't invoke it from
hpd_enable...)?

Is it reasonable to do this retriggering in the altmode driver? Or is it
the job of the TCPM (it seems reasonable to not send the PAN_EN message
until we get hpd_enable()...)?

Regards,
Bjorn
  
Johan Hovold March 8, 2023, 9:58 a.m. UTC | #11
On Thu, Mar 02, 2023 at 03:17:04PM -0800, Bjorn Andersson wrote:
> On Wed, Mar 01, 2023 at 02:58:50PM +0100, Johan Hovold wrote:

> > So after debugging this issue a third time, I can conclude that it is
> > still very much present in 6.2.
> > 
> > It appears you looked at the linux-next tree when you concluded that
> > this patch was not needed. In 6.2 the bridge->hpd_cb callback is set
> > before mode_config.funcs is initialised as part of
> > kms->funcs->hw_init(kms).
> > 
> > The hpd DRM changes heading into 6.3 do appear to avoid the NULL-pointer
> > dereference by moving the bridge->hpd_cb initialisation to
> > drm_kms_helper_poll_init() as you mention above.

I can confirm that as expected my reproducer no longer triggers with
6.3-rc1.
 
> > The PMIC GLINK altmode driver still happily forwards notifications
> > regardless of the DRM driver state though, which can lead to missed
> > hotplug events. It seems you need to implement the
> > hpd_enable()/disable() callbacks and either cache or not enable events
> > in fw until the DRM driver is ready.
> > 
> 
> It's not clear to me what the expectation from the DRM framework is on
> this point. We register a drm_bridge which is only capable of signaling
> HPD events (DRM_BRIDGE_OP_HPD), not querying HPD state (DRM_BRIDGE_OP_DETECT).

I think the assumption is that any bridge that can generate hotplug
events also has a way of detecting whether it is connected (i.e.
DRM_BRIDGE_OP_HPD => DRM_BRIDGE_OP_DETECT).

The pmic_glink_altmode driver appears to be the only driver that sets
DRM_BRIDGE_OP_HPD but not DRM_BRIDGE_OP_DETECT.

> Does this imply that any such bridge must ensure that hpd events are
> re-delivered once hpd_enable() has been invoked (we can't invoke it from
> hpd_enable...)?
> 
> Is it reasonable to do this retriggering in the altmode driver? Or is it
> the job of the TCPM (it seems reasonable to not send the PAN_EN message
> until we get hpd_enable()...)?

Are you sure there is no way to query the firmware about the connected
state?

Otherwise, enabling the notification messages when hpd_enable() is
called looks like it should work as the fw currently appears to always
send a disconnected event followed by a connect event if connected.

But that's not going to be enough unless you can also disable events in
fw on hpd_disable() so that the state can again be updated on the next
hpd_enable().

If that's not possible, it seems you need to cache the state in the
driver and hope you get a notification after a suspend cycle if the
state has changed.

But in any case, the DRM documentation is pretty clear on that a bridge
driver should not be calling drm_bridge_hpd_notify() until hpd_enable()
is called (and also not after hpd_disable()) as the pmic_glink_altmode
driver currently do.

	hpd_enable

	Enable hot plug detection. From now on the bridge shall call
	drm_bridge_hpd_notify() each time a change is detected in the
	output connection status, until hot plug detection gets disabled
	with hpd_disable.

	This callback is optional and shall only be implemented by
	bridges that support hot-plug notification without polling.
	Bridges that implement it shall also implement the hpd_disable
	callback and set the DRM_BRIDGE_OP_HPD flag in their
	drm_bridge->ops.

	https://dri.freedesktop.org/docs/drm/gpu/drm-kms-helpers.html#c.drm_bridge_funcs

Johan
  

Patch

diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
index ee2f60b6f09b..7ac670f3e6ab 100644
--- a/drivers/gpu/drm/msm/msm_drv.c
+++ b/drivers/gpu/drm/msm/msm_drv.c
@@ -438,6 +438,9 @@  static int msm_drm_init(struct device *dev, const struct drm_driver *drv)
 
 	drm_mode_config_init(ddev);
 
+	ddev->mode_config.funcs = &mode_config_funcs;
+	ddev->mode_config.helper_private = &mode_config_helper_funcs;
+
 	ret = msm_init_vram(ddev);
 	if (ret)
 		return ret;
@@ -479,9 +482,6 @@  static int msm_drm_init(struct device *dev, const struct drm_driver *drv)
 
 	drm_helper_move_panel_connectors_to_head(ddev);
 
-	ddev->mode_config.funcs = &mode_config_funcs;
-	ddev->mode_config.helper_private = &mode_config_helper_funcs;
-
 	for (i = 0; i < priv->num_crtcs; i++) {
 		/* initialize event thread */
 		priv->event_thread[i].crtc_id = priv->crtcs[i]->base.id;