[V2] scsi: libsas: Directly kick-off EH when ATA device fell off

Message ID 20221216100327.7386-1-yangxingui@huawei.com
State New
Headers
Series [V2] scsi: libsas: Directly kick-off EH when ATA device fell off |

Commit Message

yangxingui Dec. 16, 2022, 10:03 a.m. UTC
  If the ATA device fell off, call sas_ata_device_link_abort() directly and
mark all outstanding QCs as failed and kick-off EH Immediately. This avoids
having to wait for block layer timeouts.

Signed-off-by: Xingui Yang <yangxingui@huawei.com>
---
Changes to v1:
- Use dev_is_sata() to check ATA device type 
 drivers/scsi/libsas/sas_discover.c | 3 +++
 1 file changed, 3 insertions(+)
  

Comments

Jason Yan Dec. 19, 2022, 2:19 a.m. UTC | #1
On 2022/12/16 18:03, Xingui Yang wrote:
> If the ATA device fell off, call sas_ata_device_link_abort() directly and
> mark all outstanding QCs as failed and kick-off EH Immediately. This avoids
> having to wait for block layer timeouts.
> 
> Signed-off-by: Xingui Yang <yangxingui@huawei.com>
> ---
> Changes to v1:
> - Use dev_is_sata() to check ATA device type
>   drivers/scsi/libsas/sas_discover.c | 3 +++
>   1 file changed, 3 insertions(+)

Looks good,
Reviewed-by: Jason Yan <yanaijie@huawei.com>
  
John Garry Dec. 19, 2022, 9:23 a.m. UTC | #2
On 16/12/2022 10:03, Xingui Yang wrote:
> If the ATA device fell off, call sas_ata_device_link_abort() directly and
> mark all outstanding QCs as failed and kick-off EH Immediately. This avoids
> having to wait for block layer timeouts.
> 
> Signed-off-by: Xingui Yang <yangxingui@huawei.com>
> ---
> Changes to v1:
> - Use dev_is_sata() to check ATA device type
>   drivers/scsi/libsas/sas_discover.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/scsi/libsas/sas_discover.c b/drivers/scsi/libsas/sas_discover.c
> index d5bc1314c341..a12b65eb4a2a 100644
> --- a/drivers/scsi/libsas/sas_discover.c
> +++ b/drivers/scsi/libsas/sas_discover.c
> @@ -362,6 +362,9 @@ static void sas_destruct_ports(struct asd_sas_port *port)
>   
>   void sas_unregister_dev(struct asd_sas_port *port, struct domain_device *dev)
>   {
> +	if (test_bit(SAS_DEV_GONE, &dev->state) && dev_is_sata(dev))
> +		sas_ata_device_link_abort(dev, false);

Firstly, I think that there is a bug in sas_ata_device_link_abort() -> 
ata_link_abort() code in that the host lock in not grabbed, as the 
comment in ata_port_abort() mentions. Having said that, libsas had 
already some dodgy host locking usage - specifically dropping the lock 
for the queuing path (that's something else to be fixed up ... I think 
that is due to queue command CB calling task_done() in some cases), but 
I still think that sas_ata_device_link_abort() should be fixed (to grab 
the host lock).

Secondly, this just seems like a half solution to the age-old problem - 
that is, EH eventually kicking in only after 30 seconds when a disk is 
removed with active IO. I say half solution as SAS disks still have this 
issue for libsas. Can we instead push to try to solve both of them now?

There was a broad previous discussion on this:
https://urldefense.com/v3/__https://lore.kernel.org/linux-scsi/Ykqg0kr0F*2Fyzk2XW@infradead.org/__;JQ!!ACWV5N9M2RV99hQ!MwAZFXXIwuP0lv-kuUIJ0ekUiGBWlTBhU3oQjyOf_yuP1rHDJb8UKMzJjndXNQ-W1PQGJXzgc0bQUsHh4NGh21EOc50$

 From that discussion, Hannes was doing some related prep work series, 
but I don't think it got completed.

Thanks,
John

> +
>   	if (!test_bit(SAS_DEV_DESTROY, &dev->state) &&
>   	    !list_empty(&dev->disco_list_node)) {
>   		/* this rphy never saw sas_rphy_add */
  
yangxingui Dec. 19, 2022, 12:59 p.m. UTC | #3
On 2022/12/19 17:23, John Garry wrote:
> On 16/12/2022 10:03, Xingui Yang wrote:
>> If the ATA device fell off, call sas_ata_device_link_abort() directly and
>> mark all outstanding QCs as failed and kick-off EH Immediately. This 
>> avoids
>> having to wait for block layer timeouts.
>>
>> Signed-off-by: Xingui Yang <yangxingui@huawei.com>
>> ---
>> Changes to v1:
>> - Use dev_is_sata() to check ATA device type
>>   drivers/scsi/libsas/sas_discover.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/scsi/libsas/sas_discover.c 
>> b/drivers/scsi/libsas/sas_discover.c
>> index d5bc1314c341..a12b65eb4a2a 100644
>> --- a/drivers/scsi/libsas/sas_discover.c
>> +++ b/drivers/scsi/libsas/sas_discover.c
>> @@ -362,6 +362,9 @@ static void sas_destruct_ports(struct asd_sas_port 
>> *port)
>>   void sas_unregister_dev(struct asd_sas_port *port, struct 
>> domain_device *dev)
>>   {
>> +    if (test_bit(SAS_DEV_GONE, &dev->state) && dev_is_sata(dev))
>> +        sas_ata_device_link_abort(dev, false);
> 
Hi, John
> Firstly, I think that there is a bug in sas_ata_device_link_abort() -> 
> ata_link_abort() code in that the host lock in not grabbed, as the 
> comment in ata_port_abort() mentions. Having said that, libsas had 
> already some dodgy host locking usage - specifically dropping the lock 
> for the queuing path (that's something else to be fixed up ... I think 
> that is due to queue command CB calling task_done() in some cases), but 
> I still think that sas_ata_device_link_abort() should be fixed (to grab 
> the host lock).
ok, I agree with you very much for this, I had doubts about whether we 
needed to grab lock before.
> 
> Secondly, this just seems like a half solution to the age-old problem - 
> that is, EH eventually kicking in only after 30 seconds when a disk is 
> removed with active IO. I say half solution as SAS disks still have this 
> issue for libsas. Can we instead push to try to solve both of them now?

Jason said you must have such an opinion "a half solution". As libsas 
does not have any interface to mark all outstanding commands as failed 
for SAS disk currently and SAS disk support I/O resumable transmission 
after intermittent disconnections, so I want to optimize sata disk first.
If we want to achieve a complete solution, perhaps we need to define 
such an interface in libsas and implement it by lldd. My current idea is 
to call sas_abort_task() for all outstanding commands in lldd. I wonder 
if you approve of this?

Thanks,
Xingui
> 
> There was a broad previous discussion on this:
> https://urldefense.com/v3/__https://lore.kernel.org/linux-scsi/Ykqg0kr0F*2Fyzk2XW@infradead.org/__;JQ!!ACWV5N9M2RV99hQ!MwAZFXXIwuP0lv-kuUIJ0ekUiGBWlTBhU3oQjyOf_yuP1rHDJb8UKMzJjndXNQ-W1PQGJXzgc0bQUsHh4NGh21EOc50$ 
> 
> 
>  From that discussion, Hannes was doing some related prep work series, 
> but I don't think it got completed.
> 
> Thanks,
> John
> 
>> +
>>       if (!test_bit(SAS_DEV_DESTROY, &dev->state) &&
>>           !list_empty(&dev->disco_list_node)) {
>>           /* this rphy never saw sas_rphy_add */
> 
> .
  
John Garry Dec. 19, 2022, 2:53 p.m. UTC | #4
On 19/12/2022 12:59, yangxingui wrote:
>> Firstly, I think that there is a bug in sas_ata_device_link_abort() -> 
>> ata_link_abort() code in that the host lock in not grabbed, as the 
>> comment in ata_port_abort() mentions. Having said that, libsas had 
>> already some dodgy host locking usage - specifically dropping the lock 
>> for the queuing path (that's something else to be fixed up ... I think 
>> that is due to queue command CB calling task_done() in some cases), 
>> but I still think that sas_ata_device_link_abort() should be fixed (to 
>> grab the host lock).
> ok, I agree with you very much for this, I had doubts about whether we 
> needed to grab lock before.

ok, I hope that you can fix this up separately.

>>
>> Secondly, this just seems like a half solution to the age-old problem 
>> - that is, EH eventually kicking in only after 30 seconds when a disk 
>> is removed with active IO. I say half solution as SAS disks still have 
>> this issue for libsas. Can we instead push to try to solve both of 
>> them now?
> 
> Jason said you must have such an opinion "a half solution". As libsas 
> does not have any interface to mark all outstanding commands as failed 
> for SAS disk currently and SAS disk support I/O resumable transmission 
> after intermittent disconnections

I don't know what you mean by "resumable transmission after intermittent 
disconnections".

> , so I want to optimize sata disk first.
> If we want to achieve a complete solution, perhaps we need to define 
> such an interface in libsas and implement it by lldd. My current idea is 
> to call sas_abort_task() for all outstanding commands in lldd. I wonder 
> if you approve of this?

Are you sure you mean sas_abort_task()? That is for the LLDD to issue an 
abort TMF. I assume that you mean sas_task_abort(). If so, I am not too 
keen on the idea of libsas calling into the LLDD to inform of such an 
event. Note that maybe a tagset iter function could be used by libsas to 
abort each active IO, but I don't like libsas messing with such a thing; 
in addition, there may be some conflict between libsas aborting the IO 
and the IO completing with error in the LLDD.

Please note that I need to refresh my memory on this whole EH topic...

Thanks,
John
  
Jason Yan Dec. 19, 2022, 3:28 p.m. UTC | #5
On 2022/12/19 17:23, John Garry wrote:
> On 16/12/2022 10:03, Xingui Yang wrote:
>> If the ATA device fell off, call sas_ata_device_link_abort() directly and
>> mark all outstanding QCs as failed and kick-off EH Immediately. This 
>> avoids
>> having to wait for block layer timeouts.
>>
>> Signed-off-by: Xingui Yang <yangxingui@huawei.com>
>> ---
>> Changes to v1:
>> - Use dev_is_sata() to check ATA device type
>>   drivers/scsi/libsas/sas_discover.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/scsi/libsas/sas_discover.c 
>> b/drivers/scsi/libsas/sas_discover.c
>> index d5bc1314c341..a12b65eb4a2a 100644
>> --- a/drivers/scsi/libsas/sas_discover.c
>> +++ b/drivers/scsi/libsas/sas_discover.c
>> @@ -362,6 +362,9 @@ static void sas_destruct_ports(struct asd_sas_port 
>> *port)
>>   void sas_unregister_dev(struct asd_sas_port *port, struct 
>> domain_device *dev)
>>   {
>> +    if (test_bit(SAS_DEV_GONE, &dev->state) && dev_is_sata(dev))
>> +        sas_ata_device_link_abort(dev, false);
> 
> Firstly, I think that there is a bug in sas_ata_device_link_abort() -> 
> ata_link_abort() code in that the host lock in not grabbed, as the 
> comment in ata_port_abort() mentions. Having said that, libsas had 
> already some dodgy host locking usage - specifically dropping the lock 
> for the queuing path (that's something else to be fixed up ... I think 

Taking big locks in queuing path is not a good idea. This will bring 
down performance.


> that is due to queue command CB calling task_done() in some cases), but 
> I still think that sas_ata_device_link_abort() should be fixed (to grab 
> the host lock).

For sas_ata_device_link_abort(), it should grab ap->lock.

Thanks,
Jason
  
John Garry Dec. 19, 2022, 3:55 p.m. UTC | #6
On 19/12/2022 15:28, Jason Yan wrote:
>>> +    if (test_bit(SAS_DEV_GONE, &dev->state) && dev_is_sata(dev))
>>> +        sas_ata_device_link_abort(dev, false);
>>
>> Firstly, I think that there is a bug in sas_ata_device_link_abort() -> 
>> ata_link_abort() code in that the host lock in not grabbed, as the 
>> comment in ata_port_abort() mentions. Having said that, libsas had 
>> already some dodgy host locking usage - specifically dropping the lock 
>> for the queuing path (that's something else to be fixed up ... I think 
> 
> Taking big locks in queuing path is not a good idea. This will bring 
> down performance.

But it is expected that ata_qc_issue() should be called with that the 
host lock grabbed (and keep it).

I think that the reason libsas drops the lock is because some LLDD 
queuecommand CBs calls task_done() in some error paths. If we kept the 
lock held, then we could have a deadlock, for example:

sas_ata_qc_issue (has lock) -> lldd_execute_task() = 
pm8001_queue_command() -> task_done() = sas_ata_task_done() -> grab host 
lock => deadlock.

Thanks,
John
  
Damien Le Moal Dec. 19, 2022, 10:59 p.m. UTC | #7
On 12/20/22 00:28, Jason Yan wrote:
> On 2022/12/19 17:23, John Garry wrote:
>> On 16/12/2022 10:03, Xingui Yang wrote:
>>> If the ATA device fell off, call sas_ata_device_link_abort() directly and
>>> mark all outstanding QCs as failed and kick-off EH Immediately. This 
>>> avoids
>>> having to wait for block layer timeouts.
>>>
>>> Signed-off-by: Xingui Yang <yangxingui@huawei.com>
>>> ---
>>> Changes to v1:
>>> - Use dev_is_sata() to check ATA device type
>>>   drivers/scsi/libsas/sas_discover.c | 3 +++
>>>   1 file changed, 3 insertions(+)
>>>
>>> diff --git a/drivers/scsi/libsas/sas_discover.c 
>>> b/drivers/scsi/libsas/sas_discover.c
>>> index d5bc1314c341..a12b65eb4a2a 100644
>>> --- a/drivers/scsi/libsas/sas_discover.c
>>> +++ b/drivers/scsi/libsas/sas_discover.c
>>> @@ -362,6 +362,9 @@ static void sas_destruct_ports(struct asd_sas_port 
>>> *port)
>>>   void sas_unregister_dev(struct asd_sas_port *port, struct 
>>> domain_device *dev)
>>>   {
>>> +    if (test_bit(SAS_DEV_GONE, &dev->state) && dev_is_sata(dev))
>>> +        sas_ata_device_link_abort(dev, false);
>>
>> Firstly, I think that there is a bug in sas_ata_device_link_abort() -> 
>> ata_link_abort() code in that the host lock in not grabbed, as the 
>> comment in ata_port_abort() mentions. Having said that, libsas had 
>> already some dodgy host locking usage - specifically dropping the lock 
>> for the queuing path (that's something else to be fixed up ... I think 
> 
> Taking big locks in queuing path is not a good idea. This will bring 
> down performance.

With HDDs ? You will not see any difference (and SATA SSDs are not a thing
anymore, enough that we should worry too much. NVMe took over). And that
"big lock" is libata is really an integral part of the design. To remove
it, you will need to rewrite libata entirely...

> 
> 
>> that is due to queue command CB calling task_done() in some cases), but 
>> I still think that sas_ata_device_link_abort() should be fixed (to grab 
>> the host lock).
> 
> For sas_ata_device_link_abort(), it should grab ap->lock.

Which is what libata code comments (mistakenly in many places) always
refer as host lock.

> 
> Thanks,
> Jason
  
Damien Le Moal Dec. 19, 2022, 11 p.m. UTC | #8
On 12/20/22 00:55, John Garry wrote:
> On 19/12/2022 15:28, Jason Yan wrote:
>>>> +    if (test_bit(SAS_DEV_GONE, &dev->state) && dev_is_sata(dev))
>>>> +        sas_ata_device_link_abort(dev, false);
>>>
>>> Firstly, I think that there is a bug in sas_ata_device_link_abort() -> 
>>> ata_link_abort() code in that the host lock in not grabbed, as the 
>>> comment in ata_port_abort() mentions. Having said that, libsas had 
>>> already some dodgy host locking usage - specifically dropping the lock 
>>> for the queuing path (that's something else to be fixed up ... I think 
>>
>> Taking big locks in queuing path is not a good idea. This will bring 
>> down performance.
> 
> But it is expected that ata_qc_issue() should be called with that the 
> host lock grabbed (and keep it).
> 
> I think that the reason libsas drops the lock is because some LLDD 
> queuecommand CBs calls task_done() in some error paths. If we kept the 
> lock held, then we could have a deadlock, for example:
> 
> sas_ata_qc_issue (has lock) -> lldd_execute_task() = 
> pm8001_queue_command() -> task_done() = sas_ata_task_done() -> grab host 
> lock => deadlock.

That should be easily solvable using a workqueue for doing task_done(), no ?

> 
> Thanks,
> John
  
yangxingui Dec. 20, 2022, 2:34 a.m. UTC | #9
On 2022/12/19 22:53, John Garry wrote:
> On 19/12/2022 12:59, yangxingui wrote:
>>> Firstly, I think that there is a bug in sas_ata_device_link_abort() 
>>> -> ata_link_abort() code in that the host lock in not grabbed, as the 
>>> comment in ata_port_abort() mentions. Having said that, libsas had 
>>> already some dodgy host locking usage - specifically dropping the 
>>> lock for the queuing path (that's something else to be fixed up ... I 
>>> think that is due to queue command CB calling task_done() in some 
>>> cases), but I still think that sas_ata_device_link_abort() should be 
>>> fixed (to grab the host lock).
>> ok, I agree with you very much for this, I had doubts about whether we 
>> needed to grab lock before.
> 
> ok, I hope that you can fix this up separately.
> 
>>>
>>> Secondly, this just seems like a half solution to the age-old problem 
>>> - that is, EH eventually kicking in only after 30 seconds when a disk 
>>> is removed with active IO. I say half solution as SAS disks still 
>>> have this issue for libsas. Can we instead push to try to solve both 
>>> of them now?
>>
>> Jason said you must have such an opinion "a half solution". As libsas 
>> does not have any interface to mark all outstanding commands as failed 
>> for SAS disk currently and SAS disk support I/O resumable transmission 
>> after intermittent disconnections
> 
> I don't know what you mean by "resumable transmission after intermittent 
> disconnections".
I mean if sas disk plug-in in 2 seconds after plug-out with power 
supply. sas disk can continue response for the active io.
such as: disk's phy up in 2 seconds after phy down.
> 
>> , so I want to optimize sata disk first.
>> If we want to achieve a complete solution, perhaps we need to define 
>> such an interface in libsas and implement it by lldd. My current idea 
>> is to call sas_abort_task() for all outstanding commands in lldd. I 
>> wonder if you approve of this?
> 
> Are you sure you mean sas_abort_task()? That is for the LLDD to issue an 
> abort TMF. I assume that you mean sas_task_abort(). If so, I am not too 

Yes, I mean sas_task_abort(), the two function names are confusing to 
me. ^_^
> keen on the idea of libsas calling into the LLDD to inform of such an 
> event. Note that maybe a tagset iter function could be used by libsas to 
> abort each active IO, but I don't like libsas messing with such a thing; 
> in addition, there may be some conflict between libsas aborting the IO 
> and the IO completing with error in the LLDD.

I agree with you. Since we have a ready-made interface for mark all 
acive io to failed for sata disks, it may be easier to optimize sata 
disks first. If we don't implement similar interfaces in libsas or lldd, 
what good suggestions do you have?

Thanks,
Xingui
> 
> Please note that I need to refresh my memory on this whole EH topic...
> 
> Thanks,
> John
> 
> .
  
Jason Yan Dec. 20, 2022, 2:39 a.m. UTC | #10
On 2022/12/19 17:23, John Garry wrote:
> On 16/12/2022 10:03, Xingui Yang wrote:
>> If the ATA device fell off, call sas_ata_device_link_abort() directly and
>> mark all outstanding QCs as failed and kick-off EH Immediately. This 
>> avoids
>> having to wait for block layer timeouts.
>>
>> Signed-off-by: Xingui Yang <yangxingui@huawei.com>
>> ---
>> Changes to v1:
>> - Use dev_is_sata() to check ATA device type
>>   drivers/scsi/libsas/sas_discover.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/scsi/libsas/sas_discover.c 
>> b/drivers/scsi/libsas/sas_discover.c
>> index d5bc1314c341..a12b65eb4a2a 100644
>> --- a/drivers/scsi/libsas/sas_discover.c
>> +++ b/drivers/scsi/libsas/sas_discover.c
>> @@ -362,6 +362,9 @@ static void sas_destruct_ports(struct asd_sas_port 
>> *port)
>>   void sas_unregister_dev(struct asd_sas_port *port, struct 
>> domain_device *dev)
>>   {
>> +    if (test_bit(SAS_DEV_GONE, &dev->state) && dev_is_sata(dev))
>> +        sas_ata_device_link_abort(dev, false);
> 
> Firstly, I think that there is a bug in sas_ata_device_link_abort() -> 
> ata_link_abort() code in that the host lock in not grabbed, as the 
> comment in ata_port_abort() mentions. Having said that, libsas had 
> already some dodgy host locking usage - specifically dropping the lock 
> for the queuing path (that's something else to be fixed up ... I think 
> that is due to queue command CB calling task_done() in some cases), but 
> I still think that sas_ata_device_link_abort() should be fixed (to grab 
> the host lock).
> 
> Secondly, this just seems like a half solution to the age-old problem - 
> that is, EH eventually kicking in only after 30 seconds when a disk is 
> removed with active IO. I say half solution as SAS disks still have this 
> issue for libsas. Can we instead push to try to solve both of them now?
> 
> There was a broad previous discussion on this:
> https://urldefense.com/v3/__https://lore.kernel.org/linux-scsi/Ykqg0kr0F*2Fyzk2XW@infradead.org/__;JQ!!ACWV5N9M2RV99hQ!MwAZFXXIwuP0lv-kuUIJ0ekUiGBWlTBhU3oQjyOf_yuP1rHDJb8UKMzJjndXNQ-W1PQGJXzgc0bQUsHh4NGh21EOc50$ 
> 
> 
>  From that discussion, Hannes was doing some related prep work series, 
> but I don't think it got completed.

That discussion is not exactly the same with our issue. That discussion 
focused on whether one device's error handling can not suspend the other 
other devices's IO dispatching on the same host. That is something like 
parallelize the error handling for different device.

However what we are trying to resolve here is to shorten the timeout 
handling of a unplugged device. The scsi middle layer doesn't know the 
device is gone and still waiting for the IO until timeout kicks in and 
start the error handling. This made the applications stuck for a 
significant long time.But libsas knows that because it receives the phy 
down event, it knows that device will not come back and there is no need 
to wait for the timeout.

It's true that this is a half solution. I'd like to have a complete 
solution too. So we will try to solve both of them.

Thanks,
Jason
  
John Garry Dec. 20, 2022, 8:43 a.m. UTC | #11
On 19/12/2022 23:00, Damien Le Moal wrote:
>> But it is expected that ata_qc_issue() should be called with that the
>> host lock grabbed (and keep it).
>>
>> I think that the reason libsas drops the lock is because some LLDD
>> queuecommand CBs calls task_done() in some error paths. If we kept the
>> lock held, then we could have a deadlock, for example:
>>
>> sas_ata_qc_issue (has lock) -> lldd_execute_task() =
>> pm8001_queue_command() -> task_done() = sas_ata_task_done() -> grab host
>> lock => deadlock.
> That should be easily solvable using a workqueue for doing task_done(), no ?
> 

I don't see why we cannot just return an error code directly from the 
lldd_execute_task CB always - we end up calling scsi_done() directly 
then. But I am suspicious why it is not already done this way.

Looking at the code history, this fiddling with the ap->lock actually 
looks related to commit 312d3e56119a4bc5c36a96818f87f650c069ddc2 
("[SCSI] libsas: remove ata_port.lock management duties from lldds"). I 
will check that further.

Thanks,
John
  
Jason Yan Dec. 20, 2022, 9:49 a.m. UTC | #12
On 2022/12/19 22:53, John Garry wrote:
> Are you sure you mean sas_abort_task()? That is for the LLDD to issue an 
> abort TMF. I assume that you mean sas_task_abort(). If so, I am not too 
> keen on the idea of libsas calling into the LLDD to inform of such an 
> event. Note that maybe a tagset iter function could be used by libsas to 
> abort each active IO, but I don't like libsas messing with such a thing; 
> in addition, there may be some conflict between libsas aborting the IO 
> and the IO completing with error in the LLDD.

Itering tagset in libsas is odd.

The question is, shall we implement the aborting from the driver side, 
such as what sas_ata_device_link_abort() do. Or shall we implement the 
aborting from the upper side(scsi middle layer or block layer), such as 
trigger block layer time out handler immediately after we found device 
is gone?

Thanks,
Jason
  
yangxingui Dec. 21, 2022, 9:28 a.m. UTC | #13
On 2022/12/19 22:53, John Garry wrote:
> On 19/12/2022 12:59, yangxingui wrote:
>>> Firstly, I think that there is a bug in sas_ata_device_link_abort() 
>>> -> ata_link_abort() code in that the host lock in not grabbed, as the 
>>> comment in ata_port_abort() mentions. Having said that, libsas had 
>>> already some dodgy host locking usage - specifically dropping the 
>>> lock for the queuing path (that's something else to be fixed up ... I 
>>> think that is due to queue command CB calling task_done() in some 
>>> cases), but I still think that sas_ata_device_link_abort() should be 
>>> fixed (to grab the host lock).
>> ok, I agree with you very much for this, I had doubts about whether we 
>> needed to grab lock before.
> 
> ok, I hope that you can fix this up separately.
> 
>>>
>>> Secondly, this just seems like a half solution to the age-old problem 
>>> - that is, EH eventually kicking in only after 30 seconds when a disk 
>>> is removed with active IO. I say half solution as SAS disks still 
>>> have this issue for libsas. Can we instead push to try to solve both 
>>> of them now?
>>
>> Jason said you must have such an opinion "a half solution". As libsas 
>> does not have any interface to mark all outstanding commands as failed 
>> for SAS disk currently and SAS disk support I/O resumable transmission 
>> after intermittent disconnections
> 
> I don't know what you mean by "resumable transmission after intermittent 
> disconnections".
> 
>> , so I want to optimize sata disk first.
>> If we want to achieve a complete solution, perhaps we need to define 
>> such an interface in libsas and implement it by lldd. My current idea 
>> is to call sas_abort_task() for all outstanding commands in lldd. I 
>> wonder if you approve of this?
> 
> Are you sure you mean sas_abort_task()? That is for the LLDD to issue an 
> abort TMF. I assume that you mean sas_task_abort(). If so, I am not too 
> keen on the idea of libsas calling into the LLDD to inform of such an 
> event.
I've implemented this solution. The verification seems to be ok both for 
sas/sata device. I'll update the version again. Please have a look?

Thanks,
Xingui
  Note that maybe a tagset iter function could be used by libsas to
> abort each active IO, but I don't like libsas messing with such a thing; 
> in addition, there may be some conflict between libsas aborting the IO 
> and the IO completing with error in the LLDD.
> 
> Please note that I need to refresh my memory on this whole EH topic...
> 
> Thanks,
> John
> 
> .
  
John Garry Dec. 21, 2022, 9:40 a.m. UTC | #14
On 20/12/2022 09:49, Jason Yan wrote:
> 
> Itering tagset in libsas is odd.

Itering with block layer APIs is just a method to deal with each active 
IO. However, libsas should not be aborting IO directly. It may provide 
helper routines, but the LLDD should be dealing with aborting IO.

 >
 > The question is, shall we implement the aborting from the driver side,
 > such as what sas_ata_device_link_abort() do. Or shall we implement the
 > aborting from the upper side(scsi middle layer or block layer), such as
 > trigger block layer time out handler immediately after we found device
 > is gone?

As mentioned, aborting each IO should be the job of the LLDD. However, 
just making the IO timeout will lead to EH kicking in earlier, and EH 
will do usual per-IO handling in sas_eh_handle_sas_errors() that would 
happen when the IO timesout normally - so what are we really gaining 
here? Just EH kicks in earlier. But we still have the problem of all 
other per-host IO being blocked while EH is active.

Thanks,
John
  
Jason Yan Dec. 21, 2022, 10:29 a.m. UTC | #15
On 2022/12/21 17:40, John Garry wrote:
> On 20/12/2022 09:49, Jason Yan wrote:
>>
>> Itering tagset in libsas is odd.
> 
> Itering with block layer APIs is just a method to deal with each active 
> IO. However, libsas should not be aborting IO directly. It may provide 
> helper routines, but the LLDD should be dealing with aborting IO.
> 
>  >
>  > The question is, shall we implement the aborting from the driver side,
>  > such as what sas_ata_device_link_abort() do. Or shall we implement the
>  > aborting from the upper side(scsi middle layer or block layer), such as
>  > trigger block layer time out handler immediately after we found device
>  > is gone?
> 
> As mentioned, aborting each IO should be the job of the LLDD. However, 
> just making the IO timeout will lead to EH kicking in earlier, and EH 
> will do usual per-IO handling in sas_eh_handle_sas_errors() that would 
> happen when the IO timesout normally - so what are we really gaining 
> here? Just EH kicks in earlier. But we still have the problem of all 
> other per-host IO being blocked while EH is active.

This is not the same issue as I replied yesterday.
https://lkml.org/lkml/2022/12/19/1034

Thanks,
Jason
  

Patch

diff --git a/drivers/scsi/libsas/sas_discover.c b/drivers/scsi/libsas/sas_discover.c
index d5bc1314c341..a12b65eb4a2a 100644
--- a/drivers/scsi/libsas/sas_discover.c
+++ b/drivers/scsi/libsas/sas_discover.c
@@ -362,6 +362,9 @@  static void sas_destruct_ports(struct asd_sas_port *port)
 
 void sas_unregister_dev(struct asd_sas_port *port, struct domain_device *dev)
 {
+	if (test_bit(SAS_DEV_GONE, &dev->state) && dev_is_sata(dev))
+		sas_ata_device_link_abort(dev, false);
+
 	if (!test_bit(SAS_DEV_DESTROY, &dev->state) &&
 	    !list_empty(&dev->disco_list_node)) {
 		/* this rphy never saw sas_rphy_add */