[RESEND] scsi: qedi: Fix use after free bug in qedi_remove due to race condition

Message ID 20230318081303.792969-1-zyytlz.wz@163.com
State New
Headers
Series [RESEND] scsi: qedi: Fix use after free bug in qedi_remove due to race condition |

Commit Message

Zheng Wang March 18, 2023, 8:13 a.m. UTC
  In qedi_probe, it calls __qedi_probe, which bound &qedi->recovery_work
with qedi_recovery_handler and bound &qedi->board_disable_work
with qedi_board_disable_work.

When it calls qedi_schedule_recovery_handler, it will finally
call schedule_delayed_work to start the work.

When we call qedi_remove to remove the driver, there
may be a sequence as follows:

Fix it by finishing the work before cleanup in qedi_remove.

CPU0                  CPU1

                     |qedi_recovery_handler
qedi_remove          |
  __qedi_remove      |
iscsi_host_free      |
scsi_host_put        |
//free shost         |
                     |iscsi_host_for_each_session
                     |//use qedi->shost

Fixes: 4b1068f5d74b ("scsi: qedi: Add MFW error recovery process")
Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
---
 drivers/scsi/qedi/qedi_main.c | 4 ++++
 1 file changed, 4 insertions(+)
  

Comments

Mike Christie March 20, 2023, 4:11 p.m. UTC | #1
On 3/18/23 3:13 AM, Zheng Wang wrote:
> In qedi_probe, it calls __qedi_probe, which bound &qedi->recovery_work
> with qedi_recovery_handler and bound &qedi->board_disable_work
> with qedi_board_disable_work.
> 
> When it calls qedi_schedule_recovery_handler, it will finally
> call schedule_delayed_work to start the work.
> 
> When we call qedi_remove to remove the driver, there
> may be a sequence as follows:
> 
> Fix it by finishing the work before cleanup in qedi_remove.
> 
> CPU0                  CPU1
> 
>                      |qedi_recovery_handler
> qedi_remove          |
>   __qedi_remove      |
> iscsi_host_free      |
> scsi_host_put        |
> //free shost         |
>                      |iscsi_host_for_each_session
>                      |//use qedi->shost
> 
> Fixes: 4b1068f5d74b ("scsi: qedi: Add MFW error recovery process")
> Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
> ---
>  drivers/scsi/qedi/qedi_main.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/scsi/qedi/qedi_main.c b/drivers/scsi/qedi/qedi_main.c
> index f2ee49756df8..25223f6f5344 100644
> --- a/drivers/scsi/qedi/qedi_main.c
> +++ b/drivers/scsi/qedi/qedi_main.c
> @@ -2414,6 +2414,10 @@ static void __qedi_remove(struct pci_dev *pdev, int mode)
>  	int rval;
>  	u16 retry = 10;
>  
> +	/*cancel work*/

This comment is not needed. The name of the functions you are calling have
"cancel" and "work" in them so we know. If you want to add a comment explain
why the cancel calls are needed here.


> +	cancel_delayed_work_sync(&qedi->recovery_work);
> +	cancel_delayed_work_sync(&qedi->board_disable_work);


How do you know after you have called cancel_delayed_work_sync that
schedule_recovery_handler or schedule_hw_err_handler can't be called?
I don't know the qed driver well, but it looks like you could have
operations still running, so after you cancel here one of those ops
could lead to them scheduling the work again.


> +
>  	if (mode == QEDI_MODE_NORMAL)
>  		iscsi_host_remove(qedi->shost, false);
>  	else if (mode == QEDI_MODE_SHUTDOWN)
  
Zheng Hacker March 23, 2023, 3:44 a.m. UTC | #2
Mike Christie <michael.christie@oracle.com> 于2023年3月21日周二 00:11写道:
>
> On 3/18/23 3:13 AM, Zheng Wang wrote:
> > In qedi_probe, it calls __qedi_probe, which bound &qedi->recovery_work
> > with qedi_recovery_handler and bound &qedi->board_disable_work
> > with qedi_board_disable_work.
> >
> > When it calls qedi_schedule_recovery_handler, it will finally
> > call schedule_delayed_work to start the work.
> >
> > When we call qedi_remove to remove the driver, there
> > may be a sequence as follows:
> >
> > Fix it by finishing the work before cleanup in qedi_remove.
> >
> > CPU0                  CPU1
> >
> >                      |qedi_recovery_handler
> > qedi_remove          |
> >   __qedi_remove      |
> > iscsi_host_free      |
> > scsi_host_put        |
> > //free shost         |
> >                      |iscsi_host_for_each_session
> >                      |//use qedi->shost
> >
> > Fixes: 4b1068f5d74b ("scsi: qedi: Add MFW error recovery process")
> > Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
> > ---
> >  drivers/scsi/qedi/qedi_main.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/drivers/scsi/qedi/qedi_main.c b/drivers/scsi/qedi/qedi_main.c
> > index f2ee49756df8..25223f6f5344 100644
> > --- a/drivers/scsi/qedi/qedi_main.c
> > +++ b/drivers/scsi/qedi/qedi_main.c
> > @@ -2414,6 +2414,10 @@ static void __qedi_remove(struct pci_dev *pdev, int mode)
> >       int rval;
> >       u16 retry = 10;
> >
> > +     /*cancel work*/
>
> This comment is not needed. The name of the functions you are calling have
> "cancel" and "work" in them so we know. If you want to add a comment explain
> why the cancel calls are needed here.
>

Hi,

Sorry for my late reply and thanks for your advice. Will remove it in
the next version of patch.

>
> > +     cancel_delayed_work_sync(&qedi->recovery_work);
> > +     cancel_delayed_work_sync(&qedi->board_disable_work);
>
>
> How do you know after you have called cancel_delayed_work_sync that
> schedule_recovery_handler or schedule_hw_err_handler can't be called?
> I don't know the qed driver well, but it looks like you could have
> operations still running, so after you cancel here one of those ops
> could lead to them scheduling the work again.
>

Sorry I didn't know how to make sure there's no more schedule. But I do think
this is important. Maybe there're someone else who can give us advice.

Best regards,
Zheng
>
> > +
> >       if (mode == QEDI_MODE_NORMAL)
> >               iscsi_host_remove(qedi->shost, false);
> >       else if (mode == QEDI_MODE_SHUTDOWN)
>
  
Manish Rangankar March 23, 2023, 10:17 a.m. UTC | #3
> -----Original Message-----
> From: Zheng Hacker <hackerzheng666@gmail.com>
> Sent: Thursday, March 23, 2023 9:15 AM
> To: Mike Christie <michael.christie@oracle.com>
> Cc: Zheng Wang <zyytlz.wz@163.com>; Nilesh Javali <njavali@marvell.com>;
> Manish Rangankar <mrangankar@marvell.com>; GR-QLogic-Storage-
> Upstream <GR-QLogic-Storage-Upstream@marvell.com>;
> jejb@linux.ibm.com; martin.petersen@oracle.com; linux-
> scsi@vger.kernel.org; linux-kernel@vger.kernel.org;
> 1395428693sheep@gmail.com; alex000young@gmail.com
> Subject: [EXT] Re: [PATCH RESEND] scsi: qedi: Fix use after free bug in
> qedi_remove due to race condition
> 
> External Email
> 
> ----------------------------------------------------------------------
> Mike Christie <michael.christie@oracle.com> 于2023年3月21日周二 00:11写
> 道:
> >
> > On 3/18/23 3:13 AM, Zheng Wang wrote:
> > > In qedi_probe, it calls __qedi_probe, which bound
> > > &qedi->recovery_work with qedi_recovery_handler and bound
> > > &qedi->board_disable_work with qedi_board_disable_work.
> > >
> > > When it calls qedi_schedule_recovery_handler, it will finally call
> > > schedule_delayed_work to start the work.
> > >
> > > When we call qedi_remove to remove the driver, there may be a
> > > sequence as follows:
> > >
> > > Fix it by finishing the work before cleanup in qedi_remove.
> > >
> > > CPU0                  CPU1
> > >
> > >                      |qedi_recovery_handler
> > > qedi_remove          |
> > >   __qedi_remove      |
> > > iscsi_host_free      |
> > > scsi_host_put        |
> > > //free shost         |
> > >                      |iscsi_host_for_each_session
> > >                      |//use qedi->shost
> > >
> > > Fixes: 4b1068f5d74b ("scsi: qedi: Add MFW error recovery process")
> > > Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
> > > ---
> > >  drivers/scsi/qedi/qedi_main.c | 4 ++++
> > >  1 file changed, 4 insertions(+)
> > >
> > > diff --git a/drivers/scsi/qedi/qedi_main.c
> > > b/drivers/scsi/qedi/qedi_main.c index f2ee49756df8..25223f6f5344
> > > 100644
> > > --- a/drivers/scsi/qedi/qedi_main.c
> > > +++ b/drivers/scsi/qedi/qedi_main.c
> > > @@ -2414,6 +2414,10 @@ static void __qedi_remove(struct pci_dev
> *pdev, int mode)
> > >       int rval;
> > >       u16 retry = 10;
> > >
> > > +     /*cancel work*/
> >
> > This comment is not needed. The name of the functions you are calling
> > have "cancel" and "work" in them so we know. If you want to add a
> > comment explain why the cancel calls are needed here.
> >
> 
> Hi,
> 
> Sorry for my late reply and thanks for your advice. Will remove it in the next
> version of patch.
> 
> >
> > > +     cancel_delayed_work_sync(&qedi->recovery_work);
> > > +     cancel_delayed_work_sync(&qedi->board_disable_work);
> >
> >
> > How do you know after you have called cancel_delayed_work_sync that
> > schedule_recovery_handler or schedule_hw_err_handler can't be called?
> > I don't know the qed driver well, but it looks like you could have
> > operations still running, so after you cancel here one of those ops
> > could lead to them scheduling the work again.
> >
> 
> Sorry I didn't know how to make sure there's no more schedule. But I do
> think this is important. Maybe there're someone else who can give us advice.
> 
> Best regards,
> Zheng
> >

Best place to call cancel_delayed_work_sync is after qedi_ops->stop(qedi->cdev) and 
qedi_ops->ll2->stop(qedi->cdev);, after these qed calls firmware will not post any events to qedi driver.

Thanks,
Manish
  
Zheng Hacker March 24, 2023, 3:21 p.m. UTC | #4
Manish Rangankar <mrangankar@marvell.com> 于2023年3月23日周四 18:17写道:
>
>
>
> > -----Original Message-----
> > From: Zheng Hacker <hackerzheng666@gmail.com>
> > Sent: Thursday, March 23, 2023 9:15 AM
> > To: Mike Christie <michael.christie@oracle.com>
> > Cc: Zheng Wang <zyytlz.wz@163.com>; Nilesh Javali <njavali@marvell.com>;
> > Manish Rangankar <mrangankar@marvell.com>; GR-QLogic-Storage-
> > Upstream <GR-QLogic-Storage-Upstream@marvell.com>;
> > jejb@linux.ibm.com; martin.petersen@oracle.com; linux-
> > scsi@vger.kernel.org; linux-kernel@vger.kernel.org;
> > 1395428693sheep@gmail.com; alex000young@gmail.com
> > Subject: [EXT] Re: [PATCH RESEND] scsi: qedi: Fix use after free bug in
> > qedi_remove due to race condition
> >
> > External Email
> >
> > ----------------------------------------------------------------------
> > Mike Christie <michael.christie@oracle.com> 于2023年3月21日周二 00:11写
> > 道:
> > >
> > > On 3/18/23 3:13 AM, Zheng Wang wrote:
> > > > In qedi_probe, it calls __qedi_probe, which bound
> > > > &qedi->recovery_work with qedi_recovery_handler and bound
> > > > &qedi->board_disable_work with qedi_board_disable_work.
> > > >
> > > > When it calls qedi_schedule_recovery_handler, it will finally call
> > > > schedule_delayed_work to start the work.
> > > >
> > > > When we call qedi_remove to remove the driver, there may be a
> > > > sequence as follows:
> > > >
> > > > Fix it by finishing the work before cleanup in qedi_remove.
> > > >
> > > > CPU0                  CPU1
> > > >
> > > >                      |qedi_recovery_handler
> > > > qedi_remove          |
> > > >   __qedi_remove      |
> > > > iscsi_host_free      |
> > > > scsi_host_put        |
> > > > //free shost         |
> > > >                      |iscsi_host_for_each_session
> > > >                      |//use qedi->shost
> > > >
> > > > Fixes: 4b1068f5d74b ("scsi: qedi: Add MFW error recovery process")
> > > > Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
> > > > ---
> > > >  drivers/scsi/qedi/qedi_main.c | 4 ++++
> > > >  1 file changed, 4 insertions(+)
> > > >
> > > > diff --git a/drivers/scsi/qedi/qedi_main.c
> > > > b/drivers/scsi/qedi/qedi_main.c index f2ee49756df8..25223f6f5344
> > > > 100644
> > > > --- a/drivers/scsi/qedi/qedi_main.c
> > > > +++ b/drivers/scsi/qedi/qedi_main.c
> > > > @@ -2414,6 +2414,10 @@ static void __qedi_remove(struct pci_dev
> > *pdev, int mode)
> > > >       int rval;
> > > >       u16 retry = 10;
> > > >
> > > > +     /*cancel work*/
> > >
> > > This comment is not needed. The name of the functions you are calling
> > > have "cancel" and "work" in them so we know. If you want to add a
> > > comment explain why the cancel calls are needed here.
> > >
> >
> > Hi,
> >
> > Sorry for my late reply and thanks for your advice. Will remove it in the next
> > version of patch.
> >
> > >
> > > > +     cancel_delayed_work_sync(&qedi->recovery_work);
> > > > +     cancel_delayed_work_sync(&qedi->board_disable_work);
> > >
> > >
> > > How do you know after you have called cancel_delayed_work_sync that
> > > schedule_recovery_handler or schedule_hw_err_handler can't be called?
> > > I don't know the qed driver well, but it looks like you could have
> > > operations still running, so after you cancel here one of those ops
> > > could lead to them scheduling the work again.
> > >
> >
> > Sorry I didn't know how to make sure there's no more schedule. But I do
> > think this is important. Maybe there're someone else who can give us advice.
> >
> > Best regards,
> > Zheng
> > >
>
> Best place to call cancel_delayed_work_sync is after qedi_ops->stop(qedi->cdev) and
> qedi_ops->ll2->stop(qedi->cdev);, after these qed calls firmware will not post any events to qedi driver.
>

Sorry for my late reply. Will apply that in next version.

Best reagrds,
Zheng

> Thanks,
> Manish
>
  

Patch

diff --git a/drivers/scsi/qedi/qedi_main.c b/drivers/scsi/qedi/qedi_main.c
index f2ee49756df8..25223f6f5344 100644
--- a/drivers/scsi/qedi/qedi_main.c
+++ b/drivers/scsi/qedi/qedi_main.c
@@ -2414,6 +2414,10 @@  static void __qedi_remove(struct pci_dev *pdev, int mode)
 	int rval;
 	u16 retry = 10;
 
+	/*cancel work*/
+	cancel_delayed_work_sync(&qedi->recovery_work);
+	cancel_delayed_work_sync(&qedi->board_disable_work);
+
 	if (mode == QEDI_MODE_NORMAL)
 		iscsi_host_remove(qedi->shost, false);
 	else if (mode == QEDI_MODE_SHUTDOWN)