scsi: libsas: Grab the host lock in sas_ata_device_link_abort()

Message ID 20221220125349.45091-1-yangxingui@huawei.com
State New
Headers
Series scsi: libsas: Grab the host lock in sas_ata_device_link_abort() |

Commit Message

yangxingui Dec. 20, 2022, 12:53 p.m. UTC
  Grab the host lock in sas_ata_device_link_abort() before calling
ata_link_abort(), as the comment in ata_link_abort() mentions.

Signed-off-by: Xingui Yang <yangxingui@huawei.com>
---
 drivers/scsi/libsas/sas_ata.c | 3 +++
 1 file changed, 3 insertions(+)
  

Comments

John Garry Dec. 20, 2022, 2:59 p.m. UTC | #1
On 20/12/2022 12:53, Xingui Yang wrote:
> Grab the host lock in sas_ata_device_link_abort() before calling

This is should be the ata port lock, right? I know that the ata comments 
say differently.

> ata_link_abort(), as the comment in ata_link_abort() mentions.
> 

Can you please add a fixes tag?

> Signed-off-by: Xingui Yang <yangxingui@huawei.com>

Reviewed-by: John Garry <john.g.garry@oracle.com>

> ---
>   drivers/scsi/libsas/sas_ata.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
> index f7439bf9cdc6..4f2017b21e6d 100644
> --- a/drivers/scsi/libsas/sas_ata.c
> +++ b/drivers/scsi/libsas/sas_ata.c
> @@ -889,7 +889,9 @@ void sas_ata_device_link_abort(struct domain_device *device, bool force_reset)
>   {
>   	struct ata_port *ap = device->sata_dev.ap;
>   	struct ata_link *link = &ap->link;
> +	unsigned long flags;
>   
> +	spin_lock_irqsave(ap->lock, flags);
>   	device->sata_dev.fis[2] = ATA_ERR | ATA_DRDY; /* tf status */
>   	device->sata_dev.fis[3] = ATA_ABORTED; /* tf error */
>   
> @@ -897,6 +899,7 @@ void sas_ata_device_link_abort(struct domain_device *device, bool force_reset)
>   	if (force_reset)
>   		link->eh_info.action |= ATA_EH_RESET;
>   	ata_link_abort(link);
> +	spin_unlock_irqrestore(ap->lock, flags);
>   }
>   EXPORT_SYMBOL_GPL(sas_ata_device_link_abort);
>
  
Damien Le Moal Dec. 21, 2022, 12:36 a.m. UTC | #2
On 2022/12/20 23:59, John Garry wrote:
> On 20/12/2022 12:53, Xingui Yang wrote:
>> Grab the host lock in sas_ata_device_link_abort() before calling
> 
> This is should be the ata port lock, right? I know that the ata comments 
> say differently.
> 
>> ata_link_abort(), as the comment in ata_link_abort() mentions.
>>
> 
> Can you please add a fixes tag?
> 
>> Signed-off-by: Xingui Yang <yangxingui@huawei.com>
> 
> Reviewed-by: John Garry <john.g.garry@oracle.com>
> 
>> ---
>>   drivers/scsi/libsas/sas_ata.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
>> index f7439bf9cdc6..4f2017b21e6d 100644
>> --- a/drivers/scsi/libsas/sas_ata.c
>> +++ b/drivers/scsi/libsas/sas_ata.c
>> @@ -889,7 +889,9 @@ void sas_ata_device_link_abort(struct domain_device *device, bool force_reset)
>>   {
>>   	struct ata_port *ap = device->sata_dev.ap;
>>   	struct ata_link *link = &ap->link;
>> +	unsigned long flags;
>>   
>> +	spin_lock_irqsave(ap->lock, flags);
>>   	device->sata_dev.fis[2] = ATA_ERR | ATA_DRDY; /* tf status */
>>   	device->sata_dev.fis[3] = ATA_ABORTED; /* tf error */
>>   
>> @@ -897,6 +899,7 @@ void sas_ata_device_link_abort(struct domain_device *device, bool force_reset)
>>   	if (force_reset)
>>   		link->eh_info.action |= ATA_EH_RESET;
>>   	ata_link_abort(link);

Really need to add lockdep annotations in libata to avoid/catch such bugs...
Will work on that.

>> +	spin_unlock_irqrestore(ap->lock, flags);
>>   }
>>   EXPORT_SYMBOL_GPL(sas_ata_device_link_abort);
>>   
>
  
yangxingui Dec. 21, 2022, 1:47 a.m. UTC | #3
On 2022/12/20 22:59, John Garry wrote:
> On 20/12/2022 12:53, Xingui Yang wrote:
>> Grab the host lock in sas_ata_device_link_abort() before calling
> 
> This is should be the ata port lock, right? I know that the ata comments 
> say differently.
ok, I will update the commit message and use ata port lock instead.
> 
>> ata_link_abort(), as the comment in ata_link_abort() mentions.
>>
> 
> Can you please add a fixes tag?
ok, I will update and add a fix tag.

Thanks,
Xingui
> 
>> Signed-off-by: Xingui Yang <yangxingui@huawei.com>
> 
> Reviewed-by: John Garry <john.g.garry@oracle.com>
> 
>> ---
>>   drivers/scsi/libsas/sas_ata.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/scsi/libsas/sas_ata.c 
>> b/drivers/scsi/libsas/sas_ata.c
>> index f7439bf9cdc6..4f2017b21e6d 100644
>> --- a/drivers/scsi/libsas/sas_ata.c
>> +++ b/drivers/scsi/libsas/sas_ata.c
>> @@ -889,7 +889,9 @@ void sas_ata_device_link_abort(struct 
>> domain_device *device, bool force_reset)
>>   {
>>       struct ata_port *ap = device->sata_dev.ap;
>>       struct ata_link *link = &ap->link;
>> +    unsigned long flags;
>> +    spin_lock_irqsave(ap->lock, flags);
>>       device->sata_dev.fis[2] = ATA_ERR | ATA_DRDY; /* tf status */
>>       device->sata_dev.fis[3] = ATA_ABORTED; /* tf error */
>> @@ -897,6 +899,7 @@ void sas_ata_device_link_abort(struct 
>> domain_device *device, bool force_reset)
>>       if (force_reset)
>>           link->eh_info.action |= ATA_EH_RESET;
>>       ata_link_abort(link);
>> +    spin_unlock_irqrestore(ap->lock, flags);
>>   }
>>   EXPORT_SYMBOL_GPL(sas_ata_device_link_abort);
> 
> .
  
Jason Yan Dec. 21, 2022, 2:42 a.m. UTC | #4
On 2022/12/21 8:36, Damien Le Moal wrote:
> On 2022/12/20 23:59, John Garry wrote:
>> On 20/12/2022 12:53, Xingui Yang wrote:
>>> Grab the host lock in sas_ata_device_link_abort() before calling
>>
>> This is should be the ata port lock, right? I know that the ata comments
>> say differently.
>>
>>> ata_link_abort(), as the comment in ata_link_abort() mentions.
>>>
>>
>> Can you please add a fixes tag?
>>
>>> Signed-off-by: Xingui Yang <yangxingui@huawei.com>
>>
>> Reviewed-by: John Garry <john.g.garry@oracle.com>
>>
>>> ---
>>>    drivers/scsi/libsas/sas_ata.c | 3 +++
>>>    1 file changed, 3 insertions(+)
>>>
>>> diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
>>> index f7439bf9cdc6..4f2017b21e6d 100644
>>> --- a/drivers/scsi/libsas/sas_ata.c
>>> +++ b/drivers/scsi/libsas/sas_ata.c
>>> @@ -889,7 +889,9 @@ void sas_ata_device_link_abort(struct domain_device *device, bool force_reset)
>>>    {
>>>    	struct ata_port *ap = device->sata_dev.ap;
>>>    	struct ata_link *link = &ap->link;
>>> +	unsigned long flags;
>>>    
>>> +	spin_lock_irqsave(ap->lock, flags);
>>>    	device->sata_dev.fis[2] = ATA_ERR | ATA_DRDY; /* tf status */
>>>    	device->sata_dev.fis[3] = ATA_ABORTED; /* tf error */
>>>    
>>> @@ -897,6 +899,7 @@ void sas_ata_device_link_abort(struct domain_device *device, bool force_reset)
>>>    	if (force_reset)
>>>    		link->eh_info.action |= ATA_EH_RESET;
>>>    	ata_link_abort(link);
> 
> Really need to add lockdep annotations in libata to avoid/catch such bugs...
> Will work on that.

Actually in libata there are many places calling ata_link_abort() not 
holding port lock. And some places are holding the real host 
lock(ata_host->lock) while calling ata_link_abort(). So if you add the 
lockdep annotations, there may be too many warnings. If these are real 
issues, we should fix them first.

Thanks,
Jason
  
Damien Le Moal Dec. 21, 2022, 3:59 a.m. UTC | #5
On 2022/12/21 11:42, Jason Yan wrote:
> On 2022/12/21 8:36, Damien Le Moal wrote:
>> On 2022/12/20 23:59, John Garry wrote:
>>> On 20/12/2022 12:53, Xingui Yang wrote:
>>>> Grab the host lock in sas_ata_device_link_abort() before calling
>>>
>>> This is should be the ata port lock, right? I know that the ata comments
>>> say differently.
>>>
>>>> ata_link_abort(), as the comment in ata_link_abort() mentions.
>>>>
>>>
>>> Can you please add a fixes tag?
>>>
>>>> Signed-off-by: Xingui Yang <yangxingui@huawei.com>
>>>
>>> Reviewed-by: John Garry <john.g.garry@oracle.com>
>>>
>>>> ---
>>>>    drivers/scsi/libsas/sas_ata.c | 3 +++
>>>>    1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
>>>> index f7439bf9cdc6..4f2017b21e6d 100644
>>>> --- a/drivers/scsi/libsas/sas_ata.c
>>>> +++ b/drivers/scsi/libsas/sas_ata.c
>>>> @@ -889,7 +889,9 @@ void sas_ata_device_link_abort(struct domain_device *device, bool force_reset)
>>>>    {
>>>>    	struct ata_port *ap = device->sata_dev.ap;
>>>>    	struct ata_link *link = &ap->link;
>>>> +	unsigned long flags;
>>>>    
>>>> +	spin_lock_irqsave(ap->lock, flags);
>>>>    	device->sata_dev.fis[2] = ATA_ERR | ATA_DRDY; /* tf status */
>>>>    	device->sata_dev.fis[3] = ATA_ABORTED; /* tf error */
>>>>    
>>>> @@ -897,6 +899,7 @@ void sas_ata_device_link_abort(struct domain_device *device, bool force_reset)
>>>>    	if (force_reset)
>>>>    		link->eh_info.action |= ATA_EH_RESET;
>>>>    	ata_link_abort(link);
>>
>> Really need to add lockdep annotations in libata to avoid/catch such bugs...
>> Will work on that.
> 
> Actually in libata there are many places calling ata_link_abort() not 
> holding port lock. And some places are holding the real host 
> lock(ata_host->lock) while calling ata_link_abort(). So if you add the 
> lockdep annotations, there may be too many warnings. If these are real 
> issues, we should fix them first.

libata-EH does most of its work without the port lock held because by the time
we get EH started, we are guaranteed to be idle with no commands in flight. That
is why the calls you mention look like "bugs" but are not.

lockdep annotation will be a little tricky, but not too hard to do either.

> 
> Thanks,
> Jason
  
Jason Yan Dec. 21, 2022, 6:35 a.m. UTC | #6
On 2022/12/21 11:59, Damien Le Moal wrote:
> On 2022/12/21 11:42, Jason Yan wrote:
>> On 2022/12/21 8:36, Damien Le Moal wrote:
>>> On 2022/12/20 23:59, John Garry wrote:
>>>> On 20/12/2022 12:53, Xingui Yang wrote:
>>>>> Grab the host lock in sas_ata_device_link_abort() before calling
>>>>
>>>> This is should be the ata port lock, right? I know that the ata comments
>>>> say differently.
>>>>
>>>>> ata_link_abort(), as the comment in ata_link_abort() mentions.
>>>>>
>>>>
>>>> Can you please add a fixes tag?
>>>>
>>>>> Signed-off-by: Xingui Yang <yangxingui@huawei.com>
>>>>
>>>> Reviewed-by: John Garry <john.g.garry@oracle.com>
>>>>
>>>>> ---
>>>>>     drivers/scsi/libsas/sas_ata.c | 3 +++
>>>>>     1 file changed, 3 insertions(+)
>>>>>
>>>>> diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
>>>>> index f7439bf9cdc6..4f2017b21e6d 100644
>>>>> --- a/drivers/scsi/libsas/sas_ata.c
>>>>> +++ b/drivers/scsi/libsas/sas_ata.c
>>>>> @@ -889,7 +889,9 @@ void sas_ata_device_link_abort(struct domain_device *device, bool force_reset)
>>>>>     {
>>>>>     	struct ata_port *ap = device->sata_dev.ap;
>>>>>     	struct ata_link *link = &ap->link;
>>>>> +	unsigned long flags;
>>>>>     
>>>>> +	spin_lock_irqsave(ap->lock, flags);
>>>>>     	device->sata_dev.fis[2] = ATA_ERR | ATA_DRDY; /* tf status */
>>>>>     	device->sata_dev.fis[3] = ATA_ABORTED; /* tf error */
>>>>>     
>>>>> @@ -897,6 +899,7 @@ void sas_ata_device_link_abort(struct domain_device *device, bool force_reset)
>>>>>     	if (force_reset)
>>>>>     		link->eh_info.action |= ATA_EH_RESET;
>>>>>     	ata_link_abort(link);
>>>
>>> Really need to add lockdep annotations in libata to avoid/catch such bugs...
>>> Will work on that.
>>
>> Actually in libata there are many places calling ata_link_abort() not
>> holding port lock. And some places are holding the real host
>> lock(ata_host->lock) while calling ata_link_abort(). So if you add the
>> lockdep annotations, there may be too many warnings. If these are real
>> issues, we should fix them first.
> 
> libata-EH does most of its work without the port lock held because by the time
> we get EH started, we are guaranteed to be idle with no commands in flight. That
> is why the calls you mention look like "bugs" but are not.

What about the interrupt handler such as ahci_error_intr()? I didn't see 
the callers hold the port lock too. Do they need the port lock?
  
Damien Le Moal Dec. 21, 2022, 8:31 a.m. UTC | #7
On 12/21/22 15:35, Jason Yan wrote:
> On 2022/12/21 11:59, Damien Le Moal wrote:
>> On 2022/12/21 11:42, Jason Yan wrote:
>>> On 2022/12/21 8:36, Damien Le Moal wrote:
>>>> On 2022/12/20 23:59, John Garry wrote:
>>>>> On 20/12/2022 12:53, Xingui Yang wrote:
>>>>>> Grab the host lock in sas_ata_device_link_abort() before calling
>>>>>
>>>>> This is should be the ata port lock, right? I know that the ata
>>>>> comments
>>>>> say differently.
>>>>>
>>>>>> ata_link_abort(), as the comment in ata_link_abort() mentions.
>>>>>>
>>>>>
>>>>> Can you please add a fixes tag?
>>>>>
>>>>>> Signed-off-by: Xingui Yang <yangxingui@huawei.com>
>>>>>
>>>>> Reviewed-by: John Garry <john.g.garry@oracle.com>
>>>>>
>>>>>> ---
>>>>>>     drivers/scsi/libsas/sas_ata.c | 3 +++
>>>>>>     1 file changed, 3 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/scsi/libsas/sas_ata.c
>>>>>> b/drivers/scsi/libsas/sas_ata.c
>>>>>> index f7439bf9cdc6..4f2017b21e6d 100644
>>>>>> --- a/drivers/scsi/libsas/sas_ata.c
>>>>>> +++ b/drivers/scsi/libsas/sas_ata.c
>>>>>> @@ -889,7 +889,9 @@ void sas_ata_device_link_abort(struct
>>>>>> domain_device *device, bool force_reset)
>>>>>>     {
>>>>>>         struct ata_port *ap = device->sata_dev.ap;
>>>>>>         struct ata_link *link = &ap->link;
>>>>>> +    unsigned long flags;
>>>>>>     +    spin_lock_irqsave(ap->lock, flags);
>>>>>>         device->sata_dev.fis[2] = ATA_ERR | ATA_DRDY; /* tf status */
>>>>>>         device->sata_dev.fis[3] = ATA_ABORTED; /* tf error */
>>>>>>     @@ -897,6 +899,7 @@ void sas_ata_device_link_abort(struct
>>>>>> domain_device *device, bool force_reset)
>>>>>>         if (force_reset)
>>>>>>             link->eh_info.action |= ATA_EH_RESET;
>>>>>>         ata_link_abort(link);
>>>>
>>>> Really need to add lockdep annotations in libata to avoid/catch such
>>>> bugs...
>>>> Will work on that.
>>>
>>> Actually in libata there are many places calling ata_link_abort() not
>>> holding port lock. And some places are holding the real host
>>> lock(ata_host->lock) while calling ata_link_abort(). So if you add the
>>> lockdep annotations, there may be too many warnings. If these are real
>>> issues, we should fix them first.
>>
>> libata-EH does most of its work without the port lock held because by
>> the time
>> we get EH started, we are guaranteed to be idle with no commands in
>> flight. That
>> is why the calls you mention look like "bugs" but are not.
> 
> What about the interrupt handler such as ahci_error_intr()? I didn't see
> the callers hold the port lock too. Do they need the port lock?

It looks like it is missing for ahci_thunderx_irq_handler() but that one
takes the host lock. Same for xgene_ahci_irq_intr(), again no port lock
but host lock taken. And again for ahci_single_level_irq_intr() for the
non MSI case. For modern MSI adapters, the port lock is taken in

For other cases, ahci_multi_irqs_intr_hard) takes the port lock.

So it looks like ahci_port_intr() needs to take the lock and some
cleanups overall (the host lock should not be necessary in the command
path. But nobody seems to have issues with the "bad" cases... Probably
because they are not mainstream adapters.

Definitely some work needed here.
  
Niklas Cassel Dec. 21, 2022, 9:44 a.m. UTC | #8
On Wed, Dec 21, 2022 at 05:31:59PM +0900, Damien Le Moal wrote:
> > 
> > What about the interrupt handler such as ahci_error_intr()? I didn't see
> > the callers hold the port lock too. Do they need the port lock?
> 
> It looks like it is missing for ahci_thunderx_irq_handler() but that one
> takes the host lock. Same for xgene_ahci_irq_intr(), again no port lock
> but host lock taken. And again for ahci_single_level_irq_intr() for the
> non MSI case. For modern MSI adapters, the port lock is taken in
> 
> For other cases, ahci_multi_irqs_intr_hard) takes the port lock.
> 
> So it looks like ahci_port_intr() needs to take the lock and some
> cleanups overall (the host lock should not be necessary in the command
> path. But nobody seems to have issues with the "bad" cases... Probably
> because they are not mainstream adapters.
> 
> Definitely some work needed here.

ahci_multi_irqs_intr_hard() takes the ap->lock before calling
ahci_handle_port_interrupt(), which calls ahci_port_intr(),
so I don't think there is any work needed for multi IRQ AHCI.

For ahci_single_level_irq_intr() the host lock is taken before
calling ahci_handle_port_intr(), so I don't see why we need any
extra work for single IRQ AHCI.


Remember, while the default is that:
	ap->lock = &host->lock;
see:
https://github.com/torvalds/linux/blob/v6.1/drivers/ata/libata-core.c#L5305

In case of MULTI MSI, the ap->lock is using its own lock:
https://github.com/torvalds/linux/blob/v6.1/drivers/ata/libahci.c#L2460


So what is it that needs to be fixed for AHCI?

I haven't looked at ahci_thunderx_irq_handler() and xgene_ahci_irq_intr()
so I can't speak for these.


Kind regards,
Niklas
  
Damien Le Moal Dec. 21, 2022, 12:24 p.m. UTC | #9
On 2022/12/21 18:44, Niklas Cassel wrote:
> On Wed, Dec 21, 2022 at 05:31:59PM +0900, Damien Le Moal wrote:
>>>
>>> What about the interrupt handler such as ahci_error_intr()? I didn't see
>>> the callers hold the port lock too. Do they need the port lock?
>>
>> It looks like it is missing for ahci_thunderx_irq_handler() but that one
>> takes the host lock. Same for xgene_ahci_irq_intr(), again no port lock
>> but host lock taken. And again for ahci_single_level_irq_intr() for the
>> non MSI case. For modern MSI adapters, the port lock is taken in
>>
>> For other cases, ahci_multi_irqs_intr_hard) takes the port lock.
>>
>> So it looks like ahci_port_intr() needs to take the lock and some
>> cleanups overall (the host lock should not be necessary in the command
>> path. But nobody seems to have issues with the "bad" cases... Probably
>> because they are not mainstream adapters.
>>
>> Definitely some work needed here.
> 
> ahci_multi_irqs_intr_hard() takes the ap->lock before calling
> ahci_handle_port_interrupt(), which calls ahci_port_intr(),
> so I don't think there is any work needed for multi IRQ AHCI.

Yes. I tried to say that above.

> 
> For ahci_single_level_irq_intr() the host lock is taken before
> calling ahci_handle_port_intr(), so I don't see why we need any
> extra work for single IRQ AHCI.
> 
> 
> Remember, while the default is that:
> 	ap->lock = &host->lock;

Ah ! Yes ! good point ! So there are no issues anywhere. This is only confusing
because of the comments I think.

> see:
> https://github.com/torvalds/linux/blob/v6.1/drivers/ata/libata-core.c#L5305
> 
> In case of MULTI MSI, the ap->lock is using its own lock:
> https://github.com/torvalds/linux/blob/v6.1/drivers/ata/libahci.c#L2460
> 
> 
> So what is it that needs to be fixed for AHCI?
> 
> I haven't looked at ahci_thunderx_irq_handler() and xgene_ahci_irq_intr()
> so I can't speak for these.

These are not multi-irq and use the &host->lock directly, which is the same as
taking the ap->lock. We could clean that up for consistency and always use
ap->lock. But not a big deal. No bugs, just a little confusing with a simple
reading of the code.

> 
> 
> Kind regards,
> Niklas
  

Patch

diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
index f7439bf9cdc6..4f2017b21e6d 100644
--- a/drivers/scsi/libsas/sas_ata.c
+++ b/drivers/scsi/libsas/sas_ata.c
@@ -889,7 +889,9 @@  void sas_ata_device_link_abort(struct domain_device *device, bool force_reset)
 {
 	struct ata_port *ap = device->sata_dev.ap;
 	struct ata_link *link = &ap->link;
+	unsigned long flags;
 
+	spin_lock_irqsave(ap->lock, flags);
 	device->sata_dev.fis[2] = ATA_ERR | ATA_DRDY; /* tf status */
 	device->sata_dev.fis[3] = ATA_ABORTED; /* tf error */
 
@@ -897,6 +899,7 @@  void sas_ata_device_link_abort(struct domain_device *device, bool force_reset)
 	if (force_reset)
 		link->eh_info.action |= ATA_EH_RESET;
 	ata_link_abort(link);
+	spin_unlock_irqrestore(ap->lock, flags);
 }
 EXPORT_SYMBOL_GPL(sas_ata_device_link_abort);