[v4] net: ravb: Fix possible UAF bug in ravb_remove

Message ID 20230725030026.1664873-1-zyytlz.wz@163.com
State New
Headers
Series [v4] net: ravb: Fix possible UAF bug in ravb_remove |

Commit Message

Zheng Wang July 25, 2023, 3 a.m. UTC
  In ravb_probe, priv->work was bound with ravb_tx_timeout_work.
If timeout occurs, it will start the work. And if we call
ravb_remove without finishing the work, there may be a
use-after-free bug on ndev.

Fix it by finishing the job before cleanup in ravb_remove.

Note that this bug is found by static analysis, it might be
false positive.

Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper")
Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
---
v4:
- add information about the bug was found suggested by Yunsheng Lin
v3:
- fix typo in commit message
v2:
- stop dev_watchdog so that handle no more timeout work suggested by Yunsheng Lin,
add an empty line to make code clear suggested by Sergey Shtylyov
---
 drivers/net/ethernet/renesas/ravb_main.c | 3 +++
 1 file changed, 3 insertions(+)
  

Comments

Jakub Kicinski July 26, 2023, 3:19 a.m. UTC | #1
On Tue, 25 Jul 2023 11:00:26 +0800 Zheng Wang wrote:
> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
> index 4d6b3b7d6abb..ce2da5101e51 100644
> --- a/drivers/net/ethernet/renesas/ravb_main.c
> +++ b/drivers/net/ethernet/renesas/ravb_main.c
> @@ -2885,6 +2885,9 @@ static int ravb_remove(struct platform_device *pdev)
>  	struct ravb_private *priv = netdev_priv(ndev);
>  	const struct ravb_hw_info *info = priv->info;
>  
> +	netif_carrier_off(ndev);
> +	netif_tx_disable(ndev);
> +	cancel_work_sync(&priv->work);

Still racy, the carrier can come back up after canceling the work.
But whatever, this is a non-issue in the first place.
The fact that ravb_tx_timeout_work doesn't take any locks seems much
more suspicious.
  
Paolo Abeni July 27, 2023, 8:21 a.m. UTC | #2
On Tue, 2023-07-25 at 20:19 -0700, Jakub Kicinski wrote:
> On Tue, 25 Jul 2023 11:00:26 +0800 Zheng Wang wrote:
> > diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
> > index 4d6b3b7d6abb..ce2da5101e51 100644
> > --- a/drivers/net/ethernet/renesas/ravb_main.c
> > +++ b/drivers/net/ethernet/renesas/ravb_main.c
> > @@ -2885,6 +2885,9 @@ static int ravb_remove(struct platform_device *pdev)
> >  	struct ravb_private *priv = netdev_priv(ndev);
> >  	const struct ravb_hw_info *info = priv->info;
> >  
> > +	netif_carrier_off(ndev);
> > +	netif_tx_disable(ndev);
> > +	cancel_work_sync(&priv->work);
> 
> Still racy, the carrier can come back up after canceling the work.

I must admit I don't see how/when this driver sets the carrier on ?!?

> But whatever, this is a non-issue in the first place.

Do you mean the UaF can't happen? I think that is real. 

> The fact that ravb_tx_timeout_work doesn't take any locks seems much
> more suspicious.

Indeed! But that should be a different patch, right?

Waiting a little more for feedback from renesas.

/P
  
Sergey Shtylyov July 27, 2023, 6:48 p.m. UTC | #3
Hello!

On 7/27/23 11:21 AM, Paolo Abeni wrote:
[...]
>>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
>>> index 4d6b3b7d6abb..ce2da5101e51 100644
>>> --- a/drivers/net/ethernet/renesas/ravb_main.c
>>> +++ b/drivers/net/ethernet/renesas/ravb_main.c
>>> @@ -2885,6 +2885,9 @@ static int ravb_remove(struct platform_device *pdev)
>>>  	struct ravb_private *priv = netdev_priv(ndev);
>>>  	const struct ravb_hw_info *info = priv->info;
>>>  
>>> +	netif_carrier_off(ndev);
>>> +	netif_tx_disable(ndev);
>>> +	cancel_work_sync(&priv->work);
>>
>> Still racy, the carrier can come back up after canceling the work.
> 
> I must admit I don't see how/when this driver sets the carrier on ?!?

   The phylib code does it for this MAC driver, see the call tree of
phy_link_change(), on e.g. https://elixir.bootlin.com/linux/v6.5-rc3/source/...

>> But whatever, this is a non-issue in the first place.
> 
> Do you mean the UaF can't happen? I think that is real. 

   Looks possible to me, at least now... and anyway, shouldn't we clean up
after ourselves if we call schedule_work()?However my current impression is
that cancel_work_sync() should be called from ravb_close(), after calling
phy_{stop|disconnect}()...

>> The fact that ravb_tx_timeout_work doesn't take any locks seems much
>> more suspicious.
> 
> Indeed! But that should be a different patch, right?

   Yes.

> Waiting a little more for feedback from renesas.

   Renesas historically hasn't shown much interest to reviewing the sh_eth/ravb
driver patches, so I took that task upon myself. I also happen to be a nominal
author of this driver... :-)

> /P

MBR, Sergey
  
Jakub Kicinski July 27, 2023, 11:48 p.m. UTC | #4
On Thu, 27 Jul 2023 21:48:41 +0300 Sergey Shtylyov wrote:
> >> Still racy, the carrier can come back up after canceling the work.  
> > 
> > I must admit I don't see how/when this driver sets the carrier on ?!?  
> 
>    The phylib code does it for this MAC driver, see the call tree of
> phy_link_change(), on e.g. https://elixir.bootlin.com/linux/v6.5-rc3/source/...
> 
> >> But whatever, this is a non-issue in the first place.  
> > 
> > Do you mean the UaF can't happen? I think that is real.   
> 
>    Looks possible to me, at least now... and anyway, shouldn't we clean up
> after ourselves if we call schedule_work()?However my current impression is
> that cancel_work_sync() should be called from ravb_close(), after calling
> phy_{stop|disconnect}()...
>
> >> The fact that ravb_tx_timeout_work doesn't take any locks seems much
> >> more suspicious.  
> > 
> > Indeed! But that should be a different patch, right?  
> 
>    Yes.
> 
> > Waiting a little more for feedback from renesas.  
> 
>    Renesas historically hasn't shown much interest to reviewing the sh_eth/ravb
> driver patches, so I took that task upon myself. I also happen to be a nominal
> author of this driver... :-)

Simplest fix I can think of is to take a reference on the netdev before
scheduling the work, and then check if it's still registered in the work
itself. Wrap the timeout work in rtnl_lock() to avoid any races there.
  
Simon Horman July 28, 2023, 10:32 a.m. UTC | #5
On Thu, Jul 27, 2023 at 09:48:41PM +0300, Sergey Shtylyov wrote:
> Hello!
> 
> On 7/27/23 11:21 AM, Paolo Abeni wrote:
> [...]
> >>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
> >>> index 4d6b3b7d6abb..ce2da5101e51 100644
> >>> --- a/drivers/net/ethernet/renesas/ravb_main.c
> >>> +++ b/drivers/net/ethernet/renesas/ravb_main.c
> >>> @@ -2885,6 +2885,9 @@ static int ravb_remove(struct platform_device *pdev)
> >>>  	struct ravb_private *priv = netdev_priv(ndev);
> >>>  	const struct ravb_hw_info *info = priv->info;
> >>>  
> >>> +	netif_carrier_off(ndev);
> >>> +	netif_tx_disable(ndev);
> >>> +	cancel_work_sync(&priv->work);
> >>
> >> Still racy, the carrier can come back up after canceling the work.
> > 
> > I must admit I don't see how/when this driver sets the carrier on ?!?
> 
>    The phylib code does it for this MAC driver, see the call tree of
> phy_link_change(), on e.g. https://elixir.bootlin.com/linux/v6.5-rc3/source/...
> 
> >> But whatever, this is a non-issue in the first place.
> > 
> > Do you mean the UaF can't happen? I think that is real. 
> 
>    Looks possible to me, at least now... and anyway, shouldn't we clean up
> after ourselves if we call schedule_work()?However my current impression is
> that cancel_work_sync() should be called from ravb_close(), after calling
> phy_{stop|disconnect}()...
> 
> >> The fact that ravb_tx_timeout_work doesn't take any locks seems much
> >> more suspicious.
> > 
> > Indeed! But that should be a different patch, right?
> 
>    Yes.
> 
> > Waiting a little more for feedback from renesas.
> 
>    Renesas historically hasn't shown much interest to reviewing the sh_eth/ravb
> driver patches, so I took that task upon myself. I also happen to be a nominal
> author of this driver... :-)

FWIIW, that matches my recollection.
Although it may be out of date by now.
  
Lee Jones Aug. 15, 2023, 10:08 a.m. UTC | #6
On Tue, 25 Jul 2023, Zheng Wang wrote:

> In ravb_probe, priv->work was bound with ravb_tx_timeout_work.
> If timeout occurs, it will start the work. And if we call
> ravb_remove without finishing the work, there may be a
> use-after-free bug on ndev.
> 
> Fix it by finishing the job before cleanup in ravb_remove.
> 
> Note that this bug is found by static analysis, it might be
> false positive.
> 
> Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper")
> Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
> ---
> v4:
> - add information about the bug was found suggested by Yunsheng Lin
> v3:
> - fix typo in commit message
> v2:
> - stop dev_watchdog so that handle no more timeout work suggested by Yunsheng Lin,
> add an empty line to make code clear suggested by Sergey Shtylyov
> ---
>  drivers/net/ethernet/renesas/ravb_main.c | 3 +++
>  1 file changed, 3 insertions(+)

Trying my best not to sound like a broken record, but ...

What's the latest with this fix?  Is a v5 en route?
  
Yoshihiro Shimoda Sept. 20, 2023, 2:37 a.m. UTC | #7
Hello Sergey!

> From: Sergey Shtylyov, Sent: Friday, July 28, 2023 3:49 AM
> 
> Hello!
> 
> On 7/27/23 11:21 AM, Paolo Abeni wrote:
> [...]
> >>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
> >>> index 4d6b3b7d6abb..ce2da5101e51 100644
> >>> --- a/drivers/net/ethernet/renesas/ravb_main.c
> >>> +++ b/drivers/net/ethernet/renesas/ravb_main.c
> >>> @@ -2885,6 +2885,9 @@ static int ravb_remove(struct platform_device *pdev)
> >>>  	struct ravb_private *priv = netdev_priv(ndev);
> >>>  	const struct ravb_hw_info *info = priv->info;
> >>>
> >>> +	netif_carrier_off(ndev);
> >>> +	netif_tx_disable(ndev);
> >>> +	cancel_work_sync(&priv->work);
> >>
> >> Still racy, the carrier can come back up after canceling the work.
> >
> > I must admit I don't see how/when this driver sets the carrier on ?!?
> 
>    The phylib code does it for this MAC driver, see the call tree of
> phy_link_change(), on e.g.
> https://elixir.bootlin.com/linux/v6.5-rc3/source
> 
> >> But whatever, this is a non-issue in the first place.
> >
> > Do you mean the UaF can't happen? I think that is real.
> 
>    Looks possible to me, at least now... and anyway, shouldn't we clean up
> after ourselves if we call schedule_work()?However my current impression is
> that cancel_work_sync() should be called from ravb_close(), after calling
> phy_{stop|disconnect}()...

I also think so.

ravb_remove() calls unregister_netdev().
 -> unregister_netdev() calls rtnl_lock() and unregister_netdevice().
 --> unregiter_netdevice_queue()
 ---> unregiter_netdevice_many()
 ----> unregiter_netdevice_many_notify().
 -----> dev_close_many()
 ------> __dev_close_many()
 -------> ops->ndo_stop()

ravb_close() calls phy_stop()
 -> phy_state_machine() with PHY_HALTED
 --> phy_link_down()
 ---> phy_link_change()
 ----> netif_carrier_off()

The patch will be the following:
---
 drivers/net/ethernet/renesas/ravb_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
index 7df9f9f8e134..e452d90de7c2 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -2167,6 +2167,8 @@ static int ravb_close(struct net_device *ndev)
 			of_phy_deregister_fixed_link(np);
 	}
 
+	cancel_work_sync(&priv->work);
+
 	if (info->multi_irqs) {
 		free_irq(priv->tx_irqs[RAVB_NC], ndev);
 		free_irq(priv->rx_irqs[RAVB_NC], ndev);
---

If this patch is acceptable, I'll submit it. But, what do you think?

Best regards,
Yoshihiro Shimoda

> >> The fact that ravb_tx_timeout_work doesn't take any locks seems much
> >> more suspicious.
> >
> > Indeed! But that should be a different patch, right?
> 
>    Yes.
> 
> > Waiting a little more for feedback from renesas.
> 
>    Renesas historically hasn't shown much interest to reviewing the sh_eth/ravb
> driver patches, so I took that task upon myself. I also happen to be a nominal
> author of this driver... :-)
> 
> > /P
> 
> MBR, Sergey
  
Sergey Shtylyov Sept. 29, 2023, 8:23 p.m. UTC | #8
Hello!

On 9/20/23 5:37 AM, Yoshihiro Shimoda wrote:

   Sorry, I got ill that same day and still have subfebrile temperature,
and I forgot about your mail. I'll try replying to it on this weekend...

[...]

MBR, Sergey
  
Yoshihiro Shimoda Oct. 2, 2023, 12:11 a.m. UTC | #9
Hello Sergey,

> From: Sergey Shtylyov, Sent: Saturday, September 30, 2023 5:23 AM
> 
> Hello!
> 
> On 9/20/23 5:37 AM, Yoshihiro Shimoda wrote:
> 
>    Sorry, I got ill that same day and still have subfebrile temperature,
> and I forgot about your mail. I'll try replying to it on this weekend...

Thank you for your reply! I understood it. Please take care of yourself.
I hope you will get well soon.

Best regards,
Yoshihiro Shimoda

> [...]
> 
> MBR, Sergey
  
Sergey Shtylyov Oct. 3, 2023, 7:39 p.m. UTC | #10
Hello!

   Concerning the subject: I doubt that UAF acronym is known to
everybody (e.g. it wasn't known to me), I think we should be able
to afford spelling out use-after-free there...

On 9/20/23 5:37 AM, Yoshihiro Shimoda wrote:
[...]

>>>>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
>>>>> index 4d6b3b7d6abb..ce2da5101e51 100644
>>>>> --- a/drivers/net/ethernet/renesas/ravb_main.c
>>>>> +++ b/drivers/net/ethernet/renesas/ravb_main.c
>>>>> @@ -2885,6 +2885,9 @@ static int ravb_remove(struct platform_device *pdev)
>>>>>  	struct ravb_private *priv = netdev_priv(ndev);
>>>>>  	const struct ravb_hw_info *info = priv->info;
>>>>>
>>>>> +	netif_carrier_off(ndev);
>>>>> +	netif_tx_disable(ndev);
>>>>> +	cancel_work_sync(&priv->work);
>>>>
>>>> Still racy, the carrier can come back up after canceling the work.
>>>
>>> I must admit I don't see how/when this driver sets the carrier on ?!?
>>
>>    The phylib code does it for this MAC driver, see the call tree of
>> phy_link_change(), on e.g.
>> https://elixir.bootlin.com/linux/v6.5-rc3/source
>>
>>>> But whatever, this is a non-issue in the first place.
>>>
>>> Do you mean the UaF can't happen? I think that is real.
>>
>>    Looks possible to me, at least now... and anyway, shouldn't we clean up
>> after ourselves if we call schedule_work()?However my current impression is
>> that cancel_work_sync() should be called from ravb_close(), after calling
>> phy_{stop|disconnect}()...
> 
> I also think so.
> 
> ravb_remove() calls unregister_netdev().
>  -> unregister_netdev() calls rtnl_lock() and unregister_netdevice().
>  --> unregiter_netdevice_queue()
>  ---> unregiter_netdevice_many()
>  ----> unregiter_netdevice_many_notify().
>  -----> dev_close_many()
>  ------> __dev_close_many()
>  -------> ops->ndo_stop()
> 
> ravb_close() calls phy_stop()
>  -> phy_state_machine() with PHY_HALTED
>  --> phy_link_down()
>  ---> phy_link_change()
>  ----> netif_carrier_off()

   Thanks for sharing the call chain, I've followed it once again... :-)

> The patch will be the following:
> ---
>  drivers/net/ethernet/renesas/ravb_main.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
> index 7df9f9f8e134..e452d90de7c2 100644
> --- a/drivers/net/ethernet/renesas/ravb_main.c
> +++ b/drivers/net/ethernet/renesas/ravb_main.c
> @@ -2167,6 +2167,8 @@ static int ravb_close(struct net_device *ndev)
>  			of_phy_deregister_fixed_link(np);
>  	}
>  
> +	cancel_work_sync(&priv->work);
> +
>  	if (info->multi_irqs) {
>  		free_irq(priv->tx_irqs[RAVB_NC], ndev);
>  		free_irq(priv->rx_irqs[RAVB_NC], ndev);
> ---
> 
> If this patch is acceptable, I'll submit it. But, what do you think?

   I think it should do the job. And I suspect you can even test it... :-)

> Best regards,
> Yoshihiro Shimoda

[...]

MBR, Sergey
  
Yoshihiro Shimoda Oct. 4, 2023, 5:05 a.m. UTC | #11
Hello Sergey,

> From: Sergey Shtylyov, Sent: Wednesday, October 4, 2023 4:39 AM
> 
> Hello!
> 
>    Concerning the subject: I doubt that UAF acronym is known to
> everybody (e.g. it wasn't known to me), I think we should be able
> to afford spelling out use-after-free there...

I got it. I'll change the subject.

> On 9/20/23 5:37 AM, Yoshihiro Shimoda wrote:
> [...]
> 
> >>>>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
> >>>>> index 4d6b3b7d6abb..ce2da5101e51 100644
> >>>>> --- a/drivers/net/ethernet/renesas/ravb_main.c
> >>>>> +++ b/drivers/net/ethernet/renesas/ravb_main.c
> >>>>> @@ -2885,6 +2885,9 @@ static int ravb_remove(struct platform_device *pdev)
> >>>>>  	struct ravb_private *priv = netdev_priv(ndev);
> >>>>>  	const struct ravb_hw_info *info = priv->info;
> >>>>>
> >>>>> +	netif_carrier_off(ndev);
> >>>>> +	netif_tx_disable(ndev);
> >>>>> +	cancel_work_sync(&priv->work);
> >>>>
> >>>> Still racy, the carrier can come back up after canceling the work.
> >>>
> >>> I must admit I don't see how/when this driver sets the carrier on ?!?
> >>
> >>    The phylib code does it for this MAC driver, see the call tree of
> >> phy_link_change(), on e.g.
> >>
<snip URL>
> >>
> >>>> But whatever, this is a non-issue in the first place.
> >>>
> >>> Do you mean the UaF can't happen? I think that is real.
> >>
> >>    Looks possible to me, at least now... and anyway, shouldn't we clean up
> >> after ourselves if we call schedule_work()?However my current impression is
> >> that cancel_work_sync() should be called from ravb_close(), after calling
> >> phy_{stop|disconnect}()...
> >
> > I also think so.
> >
> > ravb_remove() calls unregister_netdev().
> >  -> unregister_netdev() calls rtnl_lock() and unregister_netdevice().
> >  --> unregiter_netdevice_queue()
> >  ---> unregiter_netdevice_many()
> >  ----> unregiter_netdevice_many_notify().
> >  -----> dev_close_many()
> >  ------> __dev_close_many()
> >  -------> ops->ndo_stop()
> >
> > ravb_close() calls phy_stop()
> >  -> phy_state_machine() with PHY_HALTED
> >  --> phy_link_down()
> >  ---> phy_link_change()
> >  ----> netif_carrier_off()
> 
>    Thanks for sharing the call chain, I've followed it once again... :-)

Thank you :)

> > The patch will be the following:
> > ---
> >  drivers/net/ethernet/renesas/ravb_main.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
> > index 7df9f9f8e134..e452d90de7c2 100644
> > --- a/drivers/net/ethernet/renesas/ravb_main.c
> > +++ b/drivers/net/ethernet/renesas/ravb_main.c
> > @@ -2167,6 +2167,8 @@ static int ravb_close(struct net_device *ndev)
> >  			of_phy_deregister_fixed_link(np);
> >  	}
> >
> > +	cancel_work_sync(&priv->work);
> > +
> >  	if (info->multi_irqs) {
> >  		free_irq(priv->tx_irqs[RAVB_NC], ndev);
> >  		free_irq(priv->rx_irqs[RAVB_NC], ndev);
> > ---
> >
> > If this patch is acceptable, I'll submit it. But, what do you think?
> 
>    I think it should do the job.

Thank you for your comment! I'll make such a patch.

>  And I suspect you can even test it... :-)

IIUC, causing tx timeout is difficult. But, I guess
we can add a fault injection code somehow.

Best regards,
Yoshihiro Shimoda

> > Best regards,
> > Yoshihiro Shimoda
> 
> [...]
> 
> MBR, Sergey
  
Dirk Behme Oct. 10, 2023, 12:59 p.m. UTC | #12
On 26.07.2023 05:19, Jakub Kicinski wrote:
...
> The fact that ravb_tx_timeout_work doesn't take any locks seems much
> more suspicious.
Does anybody plan to look into this, too?

Best regards

Dirk
  
Yoshihiro Shimoda Oct. 12, 2023, 8:39 a.m. UTC | #13
Hello Behme,

> From: Behme Dirk (CM/ESO2), Sent: Tuesday, October 10, 2023 9:59 PM
> 
> On 26.07.2023 05:19, Jakub Kicinski wrote:
> ...
> > The fact that ravb_tx_timeout_work doesn't take any locks seems much
> > more suspicious.
> Does anybody plan to look into this, too?

I believe my fixed patch [1] resolved this issue too. Let me explain it in detail below.

In the thread, Jakub also mentioned [2] like below:
---
Simplest fix I can think of is to take a reference on the netdev before
scheduling the work, and then check if it's still registered in the work
itself. Wrap the timeout work in rtnl_lock() to avoid any races there.
---

Sergey suggested to add cancel_work_sync() into the ravb_close () [3].
And I investigated calltrace, and then the ravb_close() is under rtnl_lock() [4]
like below:
-----------------------------------------------------------------------
ravb_remove() calls unregister_netdev().
 -> unregister_netdev() calls rtnl_lock() and unregister_netdevice().
 --> unregiter_netdevice_queue()
 ---> unregiter_netdevice_many()
 ----> unregiter_netdevice_many_notify().
 -----> dev_close_many()
 ------> __dev_close_many()
 -------> ops->ndo_stop()

ravb_close() calls phy_stop()
 -> phy_state_machine() with PHY_HALTED
 --> phy_link_down()
 ---> phy_link_change()
 ----> netif_carrier_off()
-----------------------------------------------------------------------

So, during cancel_work_sync() is waiting for canceling the workqueue in ravb_close(),
it's under rtnl_lock() so that no additional locks are needed in ravb_tx_timeout_work().

[1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=3971442870713de527684398416970cf025b4f89
[2] https://lore.kernel.org/netdev/20230727164820.48c9e685@kernel.org/
[3] https://lore.kernel.org/netdev/607f4fe4-5a59-39dd-71c2-0cf769b48187@omp.ru/
[4] https://lore.kernel.org/netdev/OSYPR01MB53341CFDBB49A3BA41A6752CD8F9A@OSYPR01MB5334.jpnprd01.prod.outlook.com/

Best regards,
Yoshihiro Shimoda

> Best regards
> 
> Dirk
  
Dirk Behme Oct. 13, 2023, 6:04 a.m. UTC | #14
Hi,

On 12.10.2023 10:39, Yoshihiro Shimoda wrote:
> Hello Behme,
> 
>> From: Behme Dirk (CM/ESO2), Sent: Tuesday, October 10, 2023 9:59 PM
>>
>> On 26.07.2023 05:19, Jakub Kicinski wrote:
>> ...
>>> The fact that ravb_tx_timeout_work doesn't take any locks seems much
>>> more suspicious.
>> Does anybody plan to look into this, too?
> 
> I believe my fixed patch [1] resolved this issue too. 


I'm not an expert of this driver nor the network stack, so sorry if I'm 
totally wrong here ;) But somehow this answer confuses me. Let me explain:

What you did with [1] is to stop/cancel the workqueue in ravb_close(). 
That's fine. But that is at driver's close time.

What's about driver's normal runtime? What I understood is that 
ravb_tx_timeout_work() might run totally asynchronously to the rest of 
the driver. And therefore needs locking/protection/synchronization if it 
uses resources of the driver which are used elsewhere in the driver, too.

I think this is exactly what is described with:

> ---
> Simplest fix I can think of is to take a reference on the netdev before
> scheduling the work, and then check if it's still registered in the work
> itself. Wrap the timeout work in rtnl_lock() to avoid any races there.
> ---

So, where is above done? Not at close time, but at normal run time of 
the driver?

Best regards

Dirk

> Sergey suggested to add cancel_work_sync() into the ravb_close () [3].
> And I investigated calltrace, and then the ravb_close() is under rtnl_lock() [4]
> like below:
> -----------------------------------------------------------------------
> ravb_remove() calls unregister_netdev().
>   -> unregister_netdev() calls rtnl_lock() and unregister_netdevice().
>   --> unregiter_netdevice_queue()
>   ---> unregiter_netdevice_many()
>   ----> unregiter_netdevice_many_notify().
>   -----> dev_close_many()
>   ------> __dev_close_many()
>   -------> ops->ndo_stop()
> 
> ravb_close() calls phy_stop()
>   -> phy_state_machine() with PHY_HALTED
>   --> phy_link_down()
>   ---> phy_link_change()
>   ----> netif_carrier_off()
> -----------------------------------------------------------------------
> 
> So, during cancel_work_sync() is waiting for canceling the workqueue in ravb_close(),
> it's under rtnl_lock() so that no additional locks are needed in ravb_tx_timeout_work().
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=3971442870713de527684398416970cf025b4f89
> [2] https://lore.kernel.org/netdev/20230727164820.48c9e685@kernel.org/
> [3] https://lore.kernel.org/netdev/607f4fe4-5a59-39dd-71c2-0cf769b48187@omp.ru/
> [4] https://lore.kernel.org/netdev/OSYPR01MB53341CFDBB49A3BA41A6752CD8F9A@OSYPR01MB5334.jpnprd01.prod.outlook.com/
> 
> Best regards,
> Yoshihiro Shimoda
> 
>> Best regards
>>
>> Dirk
  
Yoshihiro Shimoda Oct. 13, 2023, 8:32 a.m. UTC | #15
Hi,

> From: Behme Dirk (CM/ESO2), Sent: Friday, October 13, 2023 3:05 PM
>
> Hi,
>
> On 12.10.2023 10:39, Yoshihiro Shimoda wrote:
> > Hello Behme,
> >
> >> From: Behme Dirk (CM/ESO2), Sent: Tuesday, October 10, 2023 9:59 PM
> >>
> >> On 26.07.2023 05:19, Jakub Kicinski wrote:
> >> ...
> >>> The fact that ravb_tx_timeout_work doesn't take any locks seems much
> >>> more suspicious.
> >> Does anybody plan to look into this, too?
> >
> > I believe my fixed patch [1] resolved this issue too.
>
>
> I'm not an expert of this driver nor the network stack, so sorry if I'm
> totally wrong here ;) But somehow this answer confuses me. Let me explain:
>
> What you did with [1] is to stop/cancel the workqueue in ravb_close().
> That's fine. But that is at driver's close time.
>
> What's about driver's normal runtime? What I understood is that
> ravb_tx_timeout_work() might run totally asynchronously to the rest of
> the driver. And therefore needs locking/protection/synchronization if it
> uses resources of the driver which are used elsewhere in the driver, too.
>
> I think this is exactly what is described with:
>
> > ---
> > Simplest fix I can think of is to take a reference on the netdev before
> > scheduling the work, and then check if it's still registered in the work
> > itself. Wrap the timeout work in rtnl_lock() to avoid any races there.
> > ---
>
> So, where is above done? Not at close time, but at normal run time of
> the driver?

Thank you very much for your detailed explanation. I understood it.
ravb_tx_timeout_work() still has races between ethtool ops for instance.
So, I'll make a patch for it by early next week. However, IIUC, using
rtnl_lock() in ravb_tx_timeout_work() is possible to cause deadlock from
cancel_work_sync() in ravb_close(). So, I'll use rtnl_trylock() instead.

Best regards,
Yoshihiro Shimoda

> Best regards
>
> Dirk
>
> > Sergey suggested to add cancel_work_sync() into the ravb_close () [3].
> > And I investigated calltrace, and then the ravb_close() is under rtnl_lock() [4]
> > like below:
> > -----------------------------------------------------------------------
> > ravb_remove() calls unregister_netdev().
> >   -> unregister_netdev() calls rtnl_lock() and unregister_netdevice().
> >   --> unregiter_netdevice_queue()
> >   ---> unregiter_netdevice_many()
> >   ----> unregiter_netdevice_many_notify().
> >   -----> dev_close_many()
> >   ------> __dev_close_many()
> >   -------> ops->ndo_stop()
> >
> > ravb_close() calls phy_stop()
> >   -> phy_state_machine() with PHY_HALTED
> >   --> phy_link_down()
> >   ---> phy_link_change()
> >   ----> netif_carrier_off()
> > -----------------------------------------------------------------------
> >
> > So, during cancel_work_sync() is waiting for canceling the workqueue in ravb_close(),
> > it's under rtnl_lock() so that no additional locks are needed in ravb_tx_timeout_work().
> >
> > [1]
> https://git.kernel.org/pub/scm/linux/kernel/git%25
> 2Fnetdev%2Fnet.git%2Fcommit%2F%3Fid%3D3971442870713de527684398416970cf025b4f89&data=05%7C01%7Cyoshihiro.shimoda.uh%4
> 0renesas.com%7C466e046b20b548b264f808dbcbb255f6%7C53d82571da1947e49cb4625a166a4a2a%7C0%7C0%7C638327739033548199%7CUn
> known%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=HkA8f5a
> gawjXMvAGkaE6tELaSpjpbIn7M3mU5xbDTD0%3D&reserved=0
> > [2]
> https://lore.kernel.org/netdev/20230727164820.48c9e685
> %40kernel.org%2F&data=05%7C01%7Cyoshihiro.shimoda.uh%40renesas.com%7C466e046b20b548b264f808dbcbb255f6%7C53d82571da19
> 47e49cb4625a166a4a2a%7C0%7C0%7C638327739033548199%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTi
> I6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=cGvnA8WqxM%2FUDa%2FNS2OBztr1IWgjCX4IzBYXe1LGkZU%3D&reserved=0
> > [3]
> https://lore.kernel.org/netdev/607f4fe4-5a59-39dd-71c2
> -0cf769b48187%40omp.ru%2F&data=05%7C01%7Cyoshihiro.shimoda.uh%40renesas.com%7C466e046b20b548b264f808dbcbb255f6%7C53d
> 82571da1947e49cb4625a166a4a2a%7C0%7C0%7C638327739033548199%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luM
> zIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=OWwBKy%2Fdckgo3clPPfn2hxE4H6ToyqdcbhPhGoqoo30%3D&reserved=0
> > [4]
> https://lore.kernel.org/netdev/OSYPR01MB53341CFDBB49A3
> BA41A6752CD8F9A%40OSYPR01MB5334.jpnprd01.prod.outlook.com%2F&data=05%7C01%7Cyoshihiro.shimoda.uh%40renesas.com%7C466
> e046b20b548b264f808dbcbb255f6%7C53d82571da1947e49cb4625a166a4a2a%7C0%7C0%7C638327739033548199%7CUnknown%7CTWFpbGZsb3
> d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Jfypf10jiUfTqWUAukjnPzIQp
> urx7m0ETF5N2Toq8wE%3D&reserved=0
> >
> > Best regards,
> > Yoshihiro Shimoda
> >
> >> Best regards
> >>
> >> Dirk
>
> --
> ======================================================================
> Dirk Behme                      Robert Bosch Car Multimedia GmbH
>                                  CM/ESO2
> Phone: +49 5121 49-3274         Dirk Behme
> Fax:   +49 711 811 5053274      PO Box 77 77 77
> mailto:dirk.behme@de.bosch.com  D-31132 Hildesheim - Germany
>
> Bosch Group, Car Multimedia (CM)
>               Engineering SW Operating Systems 2 (ESO2)
>
> Robert Bosch Car Multimedia GmbH - Ein Unternehmen der Bosch Gruppe
> Sitz: Hildesheim
> Registergericht: Amtsgericht Hildesheim HRB 201334
> Aufsichtsratsvorsitzender: Dr. Dirk Hoheisel
> Geschäftsführung: Dr. Steffen Berns;
>                    Dr. Sven Ost, Jörg Pollak, Dr. Walter Schirm
> ======================================================================
  

Patch

diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
index 4d6b3b7d6abb..ce2da5101e51 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -2885,6 +2885,9 @@  static int ravb_remove(struct platform_device *pdev)
 	struct ravb_private *priv = netdev_priv(ndev);
 	const struct ravb_hw_info *info = priv->info;
 
+	netif_carrier_off(ndev);
+	netif_tx_disable(ndev);
+	cancel_work_sync(&priv->work);
 	/* Stop PTP Clock driver */
 	if (info->ccc_gac)
 		ravb_ptp_stop(ndev);