diff mbox series

blk-mq: Properly init bios from blk_mq_alloc_request_hctx()

Message ID	1666454846-11749-1-git-send-email-john.garry@huawei.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: John Garry <john.garry@huawei.com> To: <axboe@kernel.dk> CC: <linux-kernel@vger.kernel.org>, <linux-block@vger.kernel.org>, <hch@lst.de>, John Garry <john.garry@huawei.com> Subject: [PATCH] blk-mq: Properly init bios from blk_mq_alloc_request_hctx() Date: Sun, 23 Oct 2022 00:07:26 +0800 Message-ID: <1666454846-11749-1-git-send-email-john.garry@huawei.com> MIME-Version: 1.0 Content-Type: text/plain Precedence: bulk
Series	blk-mq: Properly init bios from blk_mq_alloc_request_hctx() \| blk-mq: Properly init bios from blk_mq_alloc_request_hctx()

Commit Message

John Garry Oct. 22, 2022, 4:07 p.m. UTC

  Function blk_mq_alloc_request_hctx() is missing zeroing/init of rq->bio,
biotail, __sector, and __data_len members, which blk_mq_alloc_request()
has.

Move init'ing of those members to common blk_mq_rq_ctx_init().

Fixes: 1f5bd336b9150 ("blk-mq: add blk_mq_alloc_request_hctx")
Suggested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: John Garry <john.garry@huawei.com>

Comments

Ming Lei Oct. 23, 2022, 1:12 p.m. UTC | #1

On Sun, Oct 23, 2022 at 12:07:26AM +0800, John Garry wrote:
> Function blk_mq_alloc_request_hctx() is missing zeroing/init of rq->bio,
> biotail, __sector, and __data_len members, which blk_mq_alloc_request()
> has.
> 
> Move init'ing of those members to common blk_mq_rq_ctx_init().
> 
> Fixes: 1f5bd336b9150 ("blk-mq: add blk_mq_alloc_request_hctx")
> Suggested-by: Bart Van Assche <bvanassche@acm.org>
> Signed-off-by: John Garry <john.garry@huawei.com>
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 8070b6c10e8d..260adeb2e455 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -402,6 +402,10 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
>  		}
>  	}
>  
> +	rq->__data_len = 0;
> +	rq->__sector = (sector_t) -1;
> +	rq->bio = rq->biotail = NULL;
> +
>  	return rq;
>  }
>  
> @@ -591,9 +595,6 @@ struct request *blk_mq_alloc_request(struct request_queue *q, blk_opf_t opf,
>  		if (!rq)
>  			goto out_queue_exit;
>  	}
> -	rq->__data_len = 0;
> -	rq->__sector = (sector_t) -1;
> -	rq->bio = rq->biotail = NULL;

This patch looks not good, why do you switch to initialize the three fields
twice in fast path?

BTW, we know blk_mq_alloc_request_hctx() has big trouble, so please
avoid to extend it to other use cases.

Thanks,
Ming

John Garry Oct. 24, 2022, 10:56 a.m. UTC | #2

On 23/10/2022 14:12, Ming Lei wrote:
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 8070b6c10e8d..260adeb2e455 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -402,6 +402,10 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
>>   		}
>>   	}
>>   
>> +	rq->__data_len = 0;
>> +	rq->__sector = (sector_t) -1;
>> +	rq->bio = rq->biotail = NULL;
>> +
>>   	return rq;
>>   }
>>   
>> @@ -591,9 +595,6 @@ struct request *blk_mq_alloc_request(struct request_queue *q, blk_opf_t opf,
>>   		if (!rq)
>>   			goto out_queue_exit;
>>   	}
>> -	rq->__data_len = 0;
>> -	rq->__sector = (sector_t) -1;
>> -	rq->bio = rq->biotail = NULL;
> This patch looks not good, why do you switch to initialize the three fields
> twice in fast path?

Can you please show me how these are initialized twice?

If there is a real concern with this then we go with my original idea, 
which was to copy the init method of blk_mq_alloc_request() (in 
blk_mq_alloc_request_hctx())

> 
> BTW, we know blk_mq_alloc_request_hctx() has big trouble, so please
> avoid to extend it to other use cases.

Yeah, I know this, but sometimes we just need to allocate for a specific 
HW queue...

For my usecase of interest, it should not impact if the cpumask of the 
HW queue goes offline after selecting the cpu in 
blk_mq_alloc_request_hctx(), so any race is ok ... I think.

However it should be still possible to make blk_mq_alloc_request_hctx() 
more robust. How about using something like work_on_cpu_safe() to 
allocate and execute the request with blk_mq_alloc_request() on a cpu 
associated with the HW queue, such that we know the cpu is online and 
stays online until we execute it? Or also extent to 
work_on_cpumask_safe() variant, so that we don't need to try all cpus in 
the mask (to see if online)?

Thanks,
John

Ming Lei Oct. 24, 2022, 1:27 p.m. UTC | #3

On Mon, Oct 24, 2022 at 11:56:21AM +0100, John Garry wrote:
> On 23/10/2022 14:12, Ming Lei wrote:
> > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > index 8070b6c10e8d..260adeb2e455 100644
> > > --- a/block/blk-mq.c
> > > +++ b/block/blk-mq.c
> > > @@ -402,6 +402,10 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
> > >   		}
> > >   	}
> > > +	rq->__data_len = 0;
> > > +	rq->__sector = (sector_t) -1;
> > > +	rq->bio = rq->biotail = NULL;
> > > +
> > >   	return rq;
> > >   }
> > > @@ -591,9 +595,6 @@ struct request *blk_mq_alloc_request(struct request_queue *q, blk_opf_t opf,
> > >   		if (!rq)
> > >   			goto out_queue_exit;
> > >   	}
> > > -	rq->__data_len = 0;
> > > -	rq->__sector = (sector_t) -1;
> > > -	rq->bio = rq->biotail = NULL;
> > This patch looks not good, why do you switch to initialize the three fields
> > twice in fast path?
> 
> Can you please show me how these are initialized twice?

blk_mq_bio_to_request() is one which setup these fields, then you add
another one in blk_mq_rq_ctx_init().

> 
> If there is a real concern with this then we go with my original idea, which
> was to copy the init method of blk_mq_alloc_request() (in
> blk_mq_alloc_request_hctx())
> 
> > 
> > BTW, we know blk_mq_alloc_request_hctx() has big trouble, so please
> > avoid to extend it to other use cases.
> 
> Yeah, I know this,

Did you know the exact issue on nvme-tcp, nvme-rdma or nvme-fc maybe
with blk_mq_alloc_request_hctx()?

> but sometimes we just need to allocate for a specific HW
> queue...
> 
> For my usecase of interest, it should not impact if the cpumask of the HW
> queue goes offline after selecting the cpu in blk_mq_alloc_request_hctx(),
> so any race is ok ... I think.
> 
> However it should be still possible to make blk_mq_alloc_request_hctx() more
> robust. How about using something like work_on_cpu_safe() to allocate and
> execute the request with blk_mq_alloc_request() on a cpu associated with the
> HW queue, such that we know the cpu is online and stays online until we
> execute it? Or also extent to work_on_cpumask_safe() variant, so that we
> don't need to try all cpus in the mask (to see if online)?

But all cpus on this hctx->cpumask could become offline.


Thanks,
Ming

John Garry Oct. 24, 2022, 4:56 p.m. UTC | #4

On 24/10/2022 14:27, Ming Lei wrote:
>>>> -	rq->bio = rq->biotail = NULL;
>>> This patch looks not good, why do you switch to initialize the three fields
>>> twice in fast path?
>> Can you please show me how these are initialized twice?
> blk_mq_bio_to_request() is one which setup these fields, then you add
> another one in blk_mq_rq_ctx_init().

ok, understood.

> 
>> If there is a real concern with this then we go with my original idea, which
>> was to copy the init method of blk_mq_alloc_request() (in
>> blk_mq_alloc_request_hctx())
>>
>>> BTW, we know blk_mq_alloc_request_hctx() has big trouble, so please
>>> avoid to extend it to other use cases.
>> Yeah, I know this,
> Did you know the exact issue on nvme-tcp, nvme-rdma or nvme-fc maybe
> with blk_mq_alloc_request_hctx()?

I thought that the original issue was an OoO bounds issue, fixed in 
14dc7a18. Now there is still some issue in the following link, which is 
still unresolved as I understand:
https://lore.kernel.org/linux-block/5bd886f1-a7c6-b765-da29-777be0328bc2@grimberg.me/#t

But I think that 14dc7a18 may still leave undesirable scenario:
- all cpus in HW queue cpumask may go offline after cpu_online_mask read 
in blk_mq_alloc_request_hctx() and before we get the driver tag and set 
rq->hctx

> 
>> but sometimes we just need to allocate for a specific HW
>> queue...
>>
>> For my usecase of interest, it should not impact if the cpumask of the HW
>> queue goes offline after selecting the cpu in blk_mq_alloc_request_hctx(),
>> so any race is ok ... I think.
>>
>> However it should be still possible to make blk_mq_alloc_request_hctx() more
>> robust. How about using something like work_on_cpu_safe() to allocate and
>> execute the request with blk_mq_alloc_request() on a cpu associated with the
>> HW queue, such that we know the cpu is online and stays online until we
>> execute it? Or also extent to work_on_cpumask_safe() variant, so that we
>> don't need to try all cpus in the mask (to see if online)?
> But all cpus on this hctx->cpumask could become offline.

If all hctx->cpumask are offline then we should not allocate a request 
and this is acceptable. Maybe I am missing your point.

Thanks,
John

Ming Lei Oct. 25, 2022, 12:34 a.m. UTC | #5

On Mon, Oct 24, 2022 at 05:56:15PM +0100, John Garry wrote:
> On 24/10/2022 14:27, Ming Lei wrote:
> > > > > -	rq->bio = rq->biotail = NULL;
> > > > This patch looks not good, why do you switch to initialize the three fields
> > > > twice in fast path?
> > > Can you please show me how these are initialized twice?
> > blk_mq_bio_to_request() is one which setup these fields, then you add
> > another one in blk_mq_rq_ctx_init().
> 
> ok, understood.
> 
> > 
> > > If there is a real concern with this then we go with my original idea, which
> > > was to copy the init method of blk_mq_alloc_request() (in
> > > blk_mq_alloc_request_hctx())
> > > 
> > > > BTW, we know blk_mq_alloc_request_hctx() has big trouble, so please
> > > > avoid to extend it to other use cases.
> > > Yeah, I know this,
> > Did you know the exact issue on nvme-tcp, nvme-rdma or nvme-fc maybe
> > with blk_mq_alloc_request_hctx()?
> 
> I thought that the original issue was an OoO bounds issue, fixed in
> 14dc7a18. Now there is still some issue in the following link, which is
> still unresolved as I understand:
> https://lore.kernel.org/linux-block/5bd886f1-a7c6-b765-da29-777be0328bc2@grimberg.me/#t
> 
> But I think that 14dc7a18 may still leave undesirable scenario:
> - all cpus in HW queue cpumask may go offline after cpu_online_mask read in
> blk_mq_alloc_request_hctx() and before we get the driver tag and set
> rq->hctx

Yeah.

> 
> > 
> > > but sometimes we just need to allocate for a specific HW
> > > queue...
> > > 
> > > For my usecase of interest, it should not impact if the cpumask of the HW
> > > queue goes offline after selecting the cpu in blk_mq_alloc_request_hctx(),
> > > so any race is ok ... I think.
> > > 
> > > However it should be still possible to make blk_mq_alloc_request_hctx() more
> > > robust. How about using something like work_on_cpu_safe() to allocate and
> > > execute the request with blk_mq_alloc_request() on a cpu associated with the
> > > HW queue, such that we know the cpu is online and stays online until we
> > > execute it? Or also extent to work_on_cpumask_safe() variant, so that we
> > > don't need to try all cpus in the mask (to see if online)?
> > But all cpus on this hctx->cpumask could become offline.
> 
> If all hctx->cpumask are offline then we should not allocate a request and
> this is acceptable. Maybe I am missing your point.

As you saw, this API has the above problem too, but any one of CPUs
may become online later, maybe just during blk_mq_alloc_request_hctx(),
and it is easy to cause inconsistence.

You didn't share your use case, but for nvme connection request, if it
is 1:1 mapping, if any one of CPU becomes offline, the controller
initialization could be failed, that isn't good from user viewpoint at
all.


Thanks,
Ming

John Garry Oct. 25, 2022, 7:40 a.m. UTC | #6

On 25/10/2022 01:34, Ming Lei wrote:
>>>> but sometimes we just need to allocate for a specific HW
>>>> queue...
>>>>
>>>> For my usecase of interest, it should not impact if the cpumask of the HW
>>>> queue goes offline after selecting the cpu in blk_mq_alloc_request_hctx(),
>>>> so any race is ok ... I think.
>>>>
>>>> However it should be still possible to make blk_mq_alloc_request_hctx() more
>>>> robust. How about using something like work_on_cpu_safe() to allocate and
>>>> execute the request with blk_mq_alloc_request() on a cpu associated with the
>>>> HW queue, such that we know the cpu is online and stays online until we
>>>> execute it? Or also extent to work_on_cpumask_safe() variant, so that we
>>>> don't need to try all cpus in the mask (to see if online)?
>>> But all cpus on this hctx->cpumask could become offline.
>> If all hctx->cpumask are offline then we should not allocate a request and
>> this is acceptable. Maybe I am missing your point.
> As you saw, this API has the above problem too, but any one of CPUs
> may become online later, maybe just during blk_mq_alloc_request_hctx(),
> and it is easy to cause inconsistence.
> 
> You didn't share your use case, but for nvme connection request, if it
> is 1:1 mapping, if any one of CPU becomes offline, the controller
> initialization could be failed, that isn't good from user viewpoint at
> all.

My use case is in SCSI EH domain. For my HBA controller of interest, to 
abort an erroneous IO we must send a controller proprietary abort 
command on same HW queue as original command. So we would need to 
allocate this abort request for a specific HW queue.

I mentioned before that if no hctx->cpumask is online then we don't need 
to allocate a request. That is because if no hctx->cpumask is online, 
this means that original erroneous IO must be completed due to nature of 
how blk-mq cpu hotplug handler works, i.e. drained, and then we don't 
actually need to abort it any longer, so ok to not get a request.

I have an RFC series for this work in which I am using 
blk_mq_alloc_request_hctx(). However, as I mentioned before, I can 
experiment with using something like work_on_cpu_safe() to alloc and 
execute the abort request to safeguard against cpu hotplug events.

Thanks,
John

Ming Lei Oct. 25, 2022, 9 a.m. UTC | #7

On Tue, Oct 25, 2022 at 3:40 PM John Garry <john.garry@huawei.com> wrote:
>
> On 25/10/2022 01:34, Ming Lei wrote:
> >>>> but sometimes we just need to allocate for a specific HW
> >>>> queue...
> >>>>
> >>>> For my usecase of interest, it should not impact if the cpumask of the HW
> >>>> queue goes offline after selecting the cpu in blk_mq_alloc_request_hctx(),
> >>>> so any race is ok ... I think.
> >>>>
> >>>> However it should be still possible to make blk_mq_alloc_request_hctx() more
> >>>> robust. How about using something like work_on_cpu_safe() to allocate and
> >>>> execute the request with blk_mq_alloc_request() on a cpu associated with the
> >>>> HW queue, such that we know the cpu is online and stays online until we
> >>>> execute it? Or also extent to work_on_cpumask_safe() variant, so that we
> >>>> don't need to try all cpus in the mask (to see if online)?
> >>> But all cpus on this hctx->cpumask could become offline.
> >> If all hctx->cpumask are offline then we should not allocate a request and
> >> this is acceptable. Maybe I am missing your point.
> > As you saw, this API has the above problem too, but any one of CPUs
> > may become online later, maybe just during blk_mq_alloc_request_hctx(),
> > and it is easy to cause inconsistence.
> >
> > You didn't share your use case, but for nvme connection request, if it
> > is 1:1 mapping, if any one of CPU becomes offline, the controller
> > initialization could be failed, that isn't good from user viewpoint at
> > all.
>
> My use case is in SCSI EH domain. For my HBA controller of interest, to
> abort an erroneous IO we must send a controller proprietary abort
> command on same HW queue as original command. So we would need to
> allocate this abort request for a specific HW queue.

IMO, it is one bad hw/sw interface.

First such request has to be reserved, since all inflight IOs can be in error.

Second error handling needs to provide forward-progress, and it is supposed
to not require external dependency, otherwise easy to cause deadlock, but
here request from specific HW queue just depends on this queue's cpumask.

Also if it has to be reserved, it can be done as one device/driver private
command, so why bother blk-mq for this special use case?

>
> I mentioned before that if no hctx->cpumask is online then we don't need
> to allocate a request. That is because if no hctx->cpumask is online,
> this means that original erroneous IO must be completed due to nature of
> how blk-mq cpu hotplug handler works, i.e. drained, and then we don't
> actually need to abort it any longer, so ok to not get a request.

No, it is really not OK, if all cpus in hctx->cpumask are offline, you
can't allocate
request on the specified hw queue, then the erroneous IO can't be handled,
then cpu hotplug handler may hang for ever.

Thanks,
Ming

John Garry Oct. 25, 2022, 9:08 a.m. UTC | #8

On 25/10/2022 10:00, Ming Lei wrote:
>> My use case is in SCSI EH domain. For my HBA controller of interest, to
>> abort an erroneous IO we must send a controller proprietary abort
>> command on same HW queue as original command. So we would need to
>> allocate this abort request for a specific HW queue.
> IMO, it is one bad hw/sw interface.
> 
> First such request has to be reserved, since all inflight IOs can be in error.

Right

> 
> Second error handling needs to provide forward-progress, and it is supposed
> to not require external dependency, otherwise easy to cause deadlock, but
> here request from specific HW queue just depends on this queue's cpumask.
> 
> Also if it has to be reserved, it can be done as one device/driver private
> command, so why bother blk-mq for this special use case?

I have a series for reserved request support, which I will send later. 
Please have a look. And as I mentioned, I would prob not end up using 
blk_mq_alloc_request_hctx() anyway.

> 
>> I mentioned before that if no hctx->cpumask is online then we don't need
>> to allocate a request. That is because if no hctx->cpumask is online,
>> this means that original erroneous IO must be completed due to nature of
>> how blk-mq cpu hotplug handler works, i.e. drained, and then we don't
>> actually need to abort it any longer, so ok to not get a request.
> No, it is really not OK, if all cpus in hctx->cpumask are offline, you
> can't allocate
> request on the specified hw queue, then the erroneous IO can't be handled,
> then cpu hotplug handler may hang for ever.

If the erroneous IO is still in-flight from blk-mq perspective, then how 
can hctx->cpumask still be offline? I thought that we guarantee that 
hctx->cpumask cannot go offline until drained.

Thanks,
John

Ming Lei Oct. 25, 2022, 9:16 a.m. UTC | #9

On Tue, Oct 25, 2022 at 10:08:10AM +0100, John Garry wrote:
> On 25/10/2022 10:00, Ming Lei wrote:
> > > My use case is in SCSI EH domain. For my HBA controller of interest, to
> > > abort an erroneous IO we must send a controller proprietary abort
> > > command on same HW queue as original command. So we would need to
> > > allocate this abort request for a specific HW queue.
> > IMO, it is one bad hw/sw interface.
> > 
> > First such request has to be reserved, since all inflight IOs can be in error.
> 
> Right
> 
> > 
> > Second error handling needs to provide forward-progress, and it is supposed
> > to not require external dependency, otherwise easy to cause deadlock, but
> > here request from specific HW queue just depends on this queue's cpumask.
> > 
> > Also if it has to be reserved, it can be done as one device/driver private
> > command, so why bother blk-mq for this special use case?
> 
> I have a series for reserved request support, which I will send later.
> Please have a look. And as I mentioned, I would prob not end up using
> blk_mq_alloc_request_hctx() anyway.
> 
> > 
> > > I mentioned before that if no hctx->cpumask is online then we don't need
> > > to allocate a request. That is because if no hctx->cpumask is online,
> > > this means that original erroneous IO must be completed due to nature of
> > > how blk-mq cpu hotplug handler works, i.e. drained, and then we don't
> > > actually need to abort it any longer, so ok to not get a request.
> > No, it is really not OK, if all cpus in hctx->cpumask are offline, you
> > can't allocate
> > request on the specified hw queue, then the erroneous IO can't be handled,
> > then cpu hotplug handler may hang for ever.
> 
> If the erroneous IO is still in-flight from blk-mq perspective, then how can
> hctx->cpumask still be offline? I thought that we guarantee that
> hctx->cpumask cannot go offline until drained.

Yeah, the draining is done before the cpu is offline. But the drain is
simply waiting for the inflight IO to be completed. If the IO is failed
during the waiting, you can't allocate such reserved request for error
handling, then hang ever in blk_mq_hctx_notify_offline().

If you just make it one driver private command, there can't be such
issue. Block layer is supposed for handling common case(normal io and pt io),
I'd suggest to not put such special cases into block layer.

thanks,
Ming

John Garry Oct. 25, 2022, 9:32 a.m. UTC | #10

On 25/10/2022 10:16, Ming Lei wrote:
>>>> I mentioned before that if no hctx->cpumask is online then we don't need
>>>> to allocate a request. That is because if no hctx->cpumask is online,
>>>> this means that original erroneous IO must be completed due to nature of
>>>> how blk-mq cpu hotplug handler works, i.e. drained, and then we don't
>>>> actually need to abort it any longer, so ok to not get a request.
>>> No, it is really not OK, if all cpus in hctx->cpumask are offline, you
>>> can't allocate
>>> request on the specified hw queue, then the erroneous IO can't be handled,
>>> then cpu hotplug handler may hang for ever.
>> If the erroneous IO is still in-flight from blk-mq perspective, then how can
>> hctx->cpumask still be offline? I thought that we guarantee that
>> hctx->cpumask cannot go offline until drained.
> Yeah, the draining is done before the cpu is offline. But the drain is
> simply waiting for the inflight IO to be completed. If the IO is failed
> during the waiting, you can't allocate such reserved request for error
> handling, then hang ever in blk_mq_hctx_notify_offline().

Actually if final cpu in hctx->cpumask is going offline, then hctx won't 
queue any more requests, right? In this case I don't think we can queue 
on that hctx anyway. I need to think about this more.

> 
> If you just make it one driver private command, there can't be such
> issue. 

Well we're trying to use reserved requests for EH commands, which that 
goes against.

> Block layer is supposed for handling common case(normal io and pt io),
> I'd suggest to not put such special cases into block layer.

It also supports reserved commands, which I would assume would be 
suitable for EH scenarios.

Thanks,
John

Ming Lei Oct. 25, 2022, 11:21 a.m. UTC | #11

On Tue, Oct 25, 2022 at 10:32:28AM +0100, John Garry wrote:
> On 25/10/2022 10:16, Ming Lei wrote:
> > > > > I mentioned before that if no hctx->cpumask is online then we don't need
> > > > > to allocate a request. That is because if no hctx->cpumask is online,
> > > > > this means that original erroneous IO must be completed due to nature of
> > > > > how blk-mq cpu hotplug handler works, i.e. drained, and then we don't
> > > > > actually need to abort it any longer, so ok to not get a request.
> > > > No, it is really not OK, if all cpus in hctx->cpumask are offline, you
> > > > can't allocate
> > > > request on the specified hw queue, then the erroneous IO can't be handled,
> > > > then cpu hotplug handler may hang for ever.
> > > If the erroneous IO is still in-flight from blk-mq perspective, then how can
> > > hctx->cpumask still be offline? I thought that we guarantee that
> > > hctx->cpumask cannot go offline until drained.
> > Yeah, the draining is done before the cpu is offline. But the drain is
> > simply waiting for the inflight IO to be completed. If the IO is failed
> > during the waiting, you can't allocate such reserved request for error
> > handling, then hang ever in blk_mq_hctx_notify_offline().
> 
> Actually if final cpu in hctx->cpumask is going offline, then hctx won't
> queue any more requests, right? In this case I don't think we can queue on
> that hctx anyway. I need to think about this more.

It can be queued actually, but interrupt may not be delivered if managed
irq is used.

> 
> > 
> > If you just make it one driver private command, there can't be such
> > issue.
> 
> Well we're trying to use reserved requests for EH commands, which that goes
> against.
> 
> > Block layer is supposed for handling common case(normal io and pt io),
> > I'd suggest to not put such special cases into block layer.
> 
> It also supports reserved commands, which I would assume would be suitable
> for EH scenarios.

Then you have to be careful, as I mentioned, EH has to provide forward
progress, if you let blk-mq allocate & submit EH request, the implied
dependency from blk-mq has to be payed attention.


Thanks,
Ming

John Garry Oct. 25, 2022, 11:36 a.m. UTC | #12

On 25/10/2022 12:21, Ming Lei wrote:
>> Actually if final cpu in hctx->cpumask is going offline, then hctx won't
>> queue any more requests, right? In this case I don't think we can queue on
>> that hctx anyway. I need to think about this more.
> It can be queued actually, but interrupt may not be delivered if managed
> irq is used.

Yes, I think it will be queued elsewhere. I would need to check the code 
again.

> 
>>> If you just make it one driver private command, there can't be such
>>> issue.
>> Well we're trying to use reserved requests for EH commands, which that goes
>> against.
>>
>>> Block layer is supposed for handling common case(normal io and pt io),
>>> I'd suggest to not put such special cases into block layer.
>> It also supports reserved commands, which I would assume would be suitable
>> for EH scenarios.
> Then you have to be careful, as I mentioned, EH has to provide forward
> progress, if you let blk-mq allocate & submit EH request, the implied
> dependency from blk-mq has to be payed attention.

OK, thanks, I know that this carries risk, but it seems right approach.

I have been thinking about my HW queue allocation requirement and maybe 
we can solve in low-level driver instead.

The requirement is to send this abort command on same queue as erroneous 
command to ensure that they do not race in HW submission, even though 
chance of this is really tiny. Maybe we can make low-level driver wait 
until erroneous command is really submitted to HW by checking HW 
register, etc. before issuing abort on any HW queue (and so would not 
need blk_mq_alloc_request_hctx() or similar).

BTW, I would still like to fix blk_mq_alloc_request_hctx() to properly 
init ->bio and other fields - ok?

Thanks,
John

Christoph Hellwig Oct. 25, 2022, 12:33 p.m. UTC | #13

On Tue, Oct 25, 2022 at 12:36:10PM +0100, John Garry wrote:
> The requirement is to send this abort command on same queue as erroneous 
> command to ensure that they do not race in HW submission, even though 
> chance of this is really tiny. Maybe we can make low-level driver wait 
> until erroneous command is really submitted to HW by checking HW register, 
> etc. before issuing abort on any HW queue (and so would not need 
> blk_mq_alloc_request_hctx() or similar).

I'm not sure this is a good idea.  I can think of all kinds of interfaces
that could have similar requirements that absolutely do make sense from
the hardware / firmware side.  So despite Ming not liking
blk_mq_alloc_request_hctx there is very little chance of it going away
and thus also very little need to avoid users as more will eventually
pop up if we want it or not.

> BTW, I would still like to fix blk_mq_alloc_request_hctx() to properly init 
> ->bio and other fields - ok?

Yes, it should behave the same blk_mq_alloc_request in that respect,
and we should just copy the assignments to bio, biotail, __sector
and __data_len from it as you did in your RFC patch.

>
> Thanks,
> John
---end quoted text---

diff mbox series

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8070b6c10e8d..260adeb2e455 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -402,6 +402,10 @@  static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 		}
 	}
 
+	rq->__data_len = 0;
+	rq->__sector = (sector_t) -1;
+	rq->bio = rq->biotail = NULL;
+
 	return rq;
 }
 
@@ -591,9 +595,6 @@  struct request *blk_mq_alloc_request(struct request_queue *q, blk_opf_t opf,
 		if (!rq)
 			goto out_queue_exit;
 	}
-	rq->__data_len = 0;
-	rq->__sector = (sector_t) -1;
-	rq->bio = rq->biotail = NULL;
 	return rq;
 out_queue_exit:
 	blk_queue_exit(q);