Message ID | 1666454846-11749-1-git-send-email-john.garry@huawei.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:4242:0:0:0:0:0 with SMTP id s2csp1257674wrr; Sat, 22 Oct 2022 08:47:16 -0700 (PDT) X-Google-Smtp-Source: AMsMyM6+NUZoK1v3hQA3k2TYMYgPepQi9jLpuh8zXddi8Jd6Z5Ko+Q4HykACyV3obS59q4A+qnIa X-Received: by 2002:a17:907:1b22:b0:741:8809:b4e6 with SMTP id mp34-20020a1709071b2200b007418809b4e6mr20614843ejc.84.1666453636254; Sat, 22 Oct 2022 08:47:16 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666453636; cv=none; d=google.com; s=arc-20160816; b=NX2g6Q4A0TS7DbKaceS+9fRqrIXLCU8xmi4MeW9ca5sfLxBw/KUErc+fX5JTQjXI/0 U5CWLQxSRv4Aefh9BeWbXQcLRBGK57Uo0C3rasb4V+jsjm/hbpejb5tc7CrogaLTXXY/ w+cokO4pidV/cir53nXJ9oygeSsrm3t9BszhDCYmhw1DIyIal2CNSBtE3PuEYjxxAatP MgQk2HhA1PrimDLkvPM3VVfcVWEu77NCLREfAqJuH68gCGaLpgUEUqYHXARZAWo1cOSU mQs2q0I90LqxfuoGl34L0L+jj5N7E5Tazn51y3dKaBj7kvUXE0IUffmmRf45pzbsjSsx On4w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:message-id:date:subject:cc:to:from; bh=W2US2sanTHCvEvekg5xOsEe4hlPklqoA7akEgbfkwXk=; b=TkvvKJfJARPoPHNoLXWemlDn3f1BAofTSztIzvvlV0rt6VPzfXoXuKM8jYTgh1Mfuk Fr3qApmif8GYZ0gTv4bOV22iptYLc4/Zj7lmk03eVFDsDVEdWxxyK9SQYPRA9CZHGyJO +iyBXBfWjIr4Ob1BWHPUdfJ8d86L7tQbyYROuA6+z4tjtQRSBnVVYbF36EWgp8ImCGa7 yCfwX3PYI68U193YzCgbxo/f0P2++5lICwxyjBQ4GcpCO2cwz7v0vrkC+PP9g1LzrqOb CSlfMd74Xf1vqCttIGZfPFd2z4azH1Ooo1zCctGcpK07pLgB78q4hcsNFa9/0bdeimdI +QDQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id xf13-20020a17090731cd00b00788361f96a2si25721138ejb.776.2022.10.22.08.46.51; Sat, 22 Oct 2022 08:47:16 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229770AbiJVPgy (ORCPT <rfc822;pwkd43@gmail.com> + 99 others); Sat, 22 Oct 2022 11:36:54 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34282 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229571AbiJVPgx (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sat, 22 Oct 2022 11:36:53 -0400 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0D33927173; Sat, 22 Oct 2022 08:36:52 -0700 (PDT) Received: from fraeml736-chm.china.huawei.com (unknown [172.18.147.201]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4MvljN1C7qz67Q9H; Sat, 22 Oct 2022 23:33:28 +0800 (CST) Received: from lhrpeml500003.china.huawei.com (7.191.162.67) by fraeml736-chm.china.huawei.com (10.206.15.217) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Sat, 22 Oct 2022 17:36:49 +0200 Received: from localhost.localdomain (10.69.192.58) by lhrpeml500003.china.huawei.com (7.191.162.67) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Sat, 22 Oct 2022 16:36:47 +0100 From: John Garry <john.garry@huawei.com> To: <axboe@kernel.dk> CC: <linux-kernel@vger.kernel.org>, <linux-block@vger.kernel.org>, <hch@lst.de>, John Garry <john.garry@huawei.com> Subject: [PATCH] blk-mq: Properly init bios from blk_mq_alloc_request_hctx() Date: Sun, 23 Oct 2022 00:07:26 +0800 Message-ID: <1666454846-11749-1-git-send-email-john.garry@huawei.com> X-Mailer: git-send-email 2.8.1 MIME-Version: 1.0 Content-Type: text/plain X-Originating-IP: [10.69.192.58] X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To lhrpeml500003.china.huawei.com (7.191.162.67) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1747403288326704216?= X-GMAIL-MSGID: =?utf-8?q?1747403288326704216?= |
Series |
blk-mq: Properly init bios from blk_mq_alloc_request_hctx()
|
|
Commit Message
John Garry
Oct. 22, 2022, 4:07 p.m. UTC
Function blk_mq_alloc_request_hctx() is missing zeroing/init of rq->bio,
biotail, __sector, and __data_len members, which blk_mq_alloc_request()
has.
Move init'ing of those members to common blk_mq_rq_ctx_init().
Fixes: 1f5bd336b9150 ("blk-mq: add blk_mq_alloc_request_hctx")
Suggested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: John Garry <john.garry@huawei.com>
Comments
On Sun, Oct 23, 2022 at 12:07:26AM +0800, John Garry wrote: > Function blk_mq_alloc_request_hctx() is missing zeroing/init of rq->bio, > biotail, __sector, and __data_len members, which blk_mq_alloc_request() > has. > > Move init'ing of those members to common blk_mq_rq_ctx_init(). > > Fixes: 1f5bd336b9150 ("blk-mq: add blk_mq_alloc_request_hctx") > Suggested-by: Bart Van Assche <bvanassche@acm.org> > Signed-off-by: John Garry <john.garry@huawei.com> > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 8070b6c10e8d..260adeb2e455 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -402,6 +402,10 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, > } > } > > + rq->__data_len = 0; > + rq->__sector = (sector_t) -1; > + rq->bio = rq->biotail = NULL; > + > return rq; > } > > @@ -591,9 +595,6 @@ struct request *blk_mq_alloc_request(struct request_queue *q, blk_opf_t opf, > if (!rq) > goto out_queue_exit; > } > - rq->__data_len = 0; > - rq->__sector = (sector_t) -1; > - rq->bio = rq->biotail = NULL; This patch looks not good, why do you switch to initialize the three fields twice in fast path? BTW, we know blk_mq_alloc_request_hctx() has big trouble, so please avoid to extend it to other use cases. Thanks, Ming
On 23/10/2022 14:12, Ming Lei wrote: >> diff --git a/block/blk-mq.c b/block/blk-mq.c >> index 8070b6c10e8d..260adeb2e455 100644 >> --- a/block/blk-mq.c >> +++ b/block/blk-mq.c >> @@ -402,6 +402,10 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, >> } >> } >> >> + rq->__data_len = 0; >> + rq->__sector = (sector_t) -1; >> + rq->bio = rq->biotail = NULL; >> + >> return rq; >> } >> >> @@ -591,9 +595,6 @@ struct request *blk_mq_alloc_request(struct request_queue *q, blk_opf_t opf, >> if (!rq) >> goto out_queue_exit; >> } >> - rq->__data_len = 0; >> - rq->__sector = (sector_t) -1; >> - rq->bio = rq->biotail = NULL; > This patch looks not good, why do you switch to initialize the three fields > twice in fast path? Can you please show me how these are initialized twice? If there is a real concern with this then we go with my original idea, which was to copy the init method of blk_mq_alloc_request() (in blk_mq_alloc_request_hctx()) > > BTW, we know blk_mq_alloc_request_hctx() has big trouble, so please > avoid to extend it to other use cases. Yeah, I know this, but sometimes we just need to allocate for a specific HW queue... For my usecase of interest, it should not impact if the cpumask of the HW queue goes offline after selecting the cpu in blk_mq_alloc_request_hctx(), so any race is ok ... I think. However it should be still possible to make blk_mq_alloc_request_hctx() more robust. How about using something like work_on_cpu_safe() to allocate and execute the request with blk_mq_alloc_request() on a cpu associated with the HW queue, such that we know the cpu is online and stays online until we execute it? Or also extent to work_on_cpumask_safe() variant, so that we don't need to try all cpus in the mask (to see if online)? Thanks, John
On Mon, Oct 24, 2022 at 11:56:21AM +0100, John Garry wrote: > On 23/10/2022 14:12, Ming Lei wrote: > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > > index 8070b6c10e8d..260adeb2e455 100644 > > > --- a/block/blk-mq.c > > > +++ b/block/blk-mq.c > > > @@ -402,6 +402,10 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, > > > } > > > } > > > + rq->__data_len = 0; > > > + rq->__sector = (sector_t) -1; > > > + rq->bio = rq->biotail = NULL; > > > + > > > return rq; > > > } > > > @@ -591,9 +595,6 @@ struct request *blk_mq_alloc_request(struct request_queue *q, blk_opf_t opf, > > > if (!rq) > > > goto out_queue_exit; > > > } > > > - rq->__data_len = 0; > > > - rq->__sector = (sector_t) -1; > > > - rq->bio = rq->biotail = NULL; > > This patch looks not good, why do you switch to initialize the three fields > > twice in fast path? > > Can you please show me how these are initialized twice? blk_mq_bio_to_request() is one which setup these fields, then you add another one in blk_mq_rq_ctx_init(). > > If there is a real concern with this then we go with my original idea, which > was to copy the init method of blk_mq_alloc_request() (in > blk_mq_alloc_request_hctx()) > > > > > BTW, we know blk_mq_alloc_request_hctx() has big trouble, so please > > avoid to extend it to other use cases. > > Yeah, I know this, Did you know the exact issue on nvme-tcp, nvme-rdma or nvme-fc maybe with blk_mq_alloc_request_hctx()? > but sometimes we just need to allocate for a specific HW > queue... > > For my usecase of interest, it should not impact if the cpumask of the HW > queue goes offline after selecting the cpu in blk_mq_alloc_request_hctx(), > so any race is ok ... I think. > > However it should be still possible to make blk_mq_alloc_request_hctx() more > robust. How about using something like work_on_cpu_safe() to allocate and > execute the request with blk_mq_alloc_request() on a cpu associated with the > HW queue, such that we know the cpu is online and stays online until we > execute it? Or also extent to work_on_cpumask_safe() variant, so that we > don't need to try all cpus in the mask (to see if online)? But all cpus on this hctx->cpumask could become offline. Thanks, Ming
On 24/10/2022 14:27, Ming Lei wrote: >>>> - rq->bio = rq->biotail = NULL; >>> This patch looks not good, why do you switch to initialize the three fields >>> twice in fast path? >> Can you please show me how these are initialized twice? > blk_mq_bio_to_request() is one which setup these fields, then you add > another one in blk_mq_rq_ctx_init(). ok, understood. > >> If there is a real concern with this then we go with my original idea, which >> was to copy the init method of blk_mq_alloc_request() (in >> blk_mq_alloc_request_hctx()) >> >>> BTW, we know blk_mq_alloc_request_hctx() has big trouble, so please >>> avoid to extend it to other use cases. >> Yeah, I know this, > Did you know the exact issue on nvme-tcp, nvme-rdma or nvme-fc maybe > with blk_mq_alloc_request_hctx()? I thought that the original issue was an OoO bounds issue, fixed in 14dc7a18. Now there is still some issue in the following link, which is still unresolved as I understand: https://lore.kernel.org/linux-block/5bd886f1-a7c6-b765-da29-777be0328bc2@grimberg.me/#t But I think that 14dc7a18 may still leave undesirable scenario: - all cpus in HW queue cpumask may go offline after cpu_online_mask read in blk_mq_alloc_request_hctx() and before we get the driver tag and set rq->hctx > >> but sometimes we just need to allocate for a specific HW >> queue... >> >> For my usecase of interest, it should not impact if the cpumask of the HW >> queue goes offline after selecting the cpu in blk_mq_alloc_request_hctx(), >> so any race is ok ... I think. >> >> However it should be still possible to make blk_mq_alloc_request_hctx() more >> robust. How about using something like work_on_cpu_safe() to allocate and >> execute the request with blk_mq_alloc_request() on a cpu associated with the >> HW queue, such that we know the cpu is online and stays online until we >> execute it? Or also extent to work_on_cpumask_safe() variant, so that we >> don't need to try all cpus in the mask (to see if online)? > But all cpus on this hctx->cpumask could become offline. If all hctx->cpumask are offline then we should not allocate a request and this is acceptable. Maybe I am missing your point. Thanks, John
On Mon, Oct 24, 2022 at 05:56:15PM +0100, John Garry wrote: > On 24/10/2022 14:27, Ming Lei wrote: > > > > > - rq->bio = rq->biotail = NULL; > > > > This patch looks not good, why do you switch to initialize the three fields > > > > twice in fast path? > > > Can you please show me how these are initialized twice? > > blk_mq_bio_to_request() is one which setup these fields, then you add > > another one in blk_mq_rq_ctx_init(). > > ok, understood. > > > > > > If there is a real concern with this then we go with my original idea, which > > > was to copy the init method of blk_mq_alloc_request() (in > > > blk_mq_alloc_request_hctx()) > > > > > > > BTW, we know blk_mq_alloc_request_hctx() has big trouble, so please > > > > avoid to extend it to other use cases. > > > Yeah, I know this, > > Did you know the exact issue on nvme-tcp, nvme-rdma or nvme-fc maybe > > with blk_mq_alloc_request_hctx()? > > I thought that the original issue was an OoO bounds issue, fixed in > 14dc7a18. Now there is still some issue in the following link, which is > still unresolved as I understand: > https://lore.kernel.org/linux-block/5bd886f1-a7c6-b765-da29-777be0328bc2@grimberg.me/#t > > But I think that 14dc7a18 may still leave undesirable scenario: > - all cpus in HW queue cpumask may go offline after cpu_online_mask read in > blk_mq_alloc_request_hctx() and before we get the driver tag and set > rq->hctx Yeah. > > > > > > but sometimes we just need to allocate for a specific HW > > > queue... > > > > > > For my usecase of interest, it should not impact if the cpumask of the HW > > > queue goes offline after selecting the cpu in blk_mq_alloc_request_hctx(), > > > so any race is ok ... I think. > > > > > > However it should be still possible to make blk_mq_alloc_request_hctx() more > > > robust. How about using something like work_on_cpu_safe() to allocate and > > > execute the request with blk_mq_alloc_request() on a cpu associated with the > > > HW queue, such that we know the cpu is online and stays online until we > > > execute it? Or also extent to work_on_cpumask_safe() variant, so that we > > > don't need to try all cpus in the mask (to see if online)? > > But all cpus on this hctx->cpumask could become offline. > > If all hctx->cpumask are offline then we should not allocate a request and > this is acceptable. Maybe I am missing your point. As you saw, this API has the above problem too, but any one of CPUs may become online later, maybe just during blk_mq_alloc_request_hctx(), and it is easy to cause inconsistence. You didn't share your use case, but for nvme connection request, if it is 1:1 mapping, if any one of CPU becomes offline, the controller initialization could be failed, that isn't good from user viewpoint at all. Thanks, Ming
On 25/10/2022 01:34, Ming Lei wrote: >>>> but sometimes we just need to allocate for a specific HW >>>> queue... >>>> >>>> For my usecase of interest, it should not impact if the cpumask of the HW >>>> queue goes offline after selecting the cpu in blk_mq_alloc_request_hctx(), >>>> so any race is ok ... I think. >>>> >>>> However it should be still possible to make blk_mq_alloc_request_hctx() more >>>> robust. How about using something like work_on_cpu_safe() to allocate and >>>> execute the request with blk_mq_alloc_request() on a cpu associated with the >>>> HW queue, such that we know the cpu is online and stays online until we >>>> execute it? Or also extent to work_on_cpumask_safe() variant, so that we >>>> don't need to try all cpus in the mask (to see if online)? >>> But all cpus on this hctx->cpumask could become offline. >> If all hctx->cpumask are offline then we should not allocate a request and >> this is acceptable. Maybe I am missing your point. > As you saw, this API has the above problem too, but any one of CPUs > may become online later, maybe just during blk_mq_alloc_request_hctx(), > and it is easy to cause inconsistence. > > You didn't share your use case, but for nvme connection request, if it > is 1:1 mapping, if any one of CPU becomes offline, the controller > initialization could be failed, that isn't good from user viewpoint at > all. My use case is in SCSI EH domain. For my HBA controller of interest, to abort an erroneous IO we must send a controller proprietary abort command on same HW queue as original command. So we would need to allocate this abort request for a specific HW queue. I mentioned before that if no hctx->cpumask is online then we don't need to allocate a request. That is because if no hctx->cpumask is online, this means that original erroneous IO must be completed due to nature of how blk-mq cpu hotplug handler works, i.e. drained, and then we don't actually need to abort it any longer, so ok to not get a request. I have an RFC series for this work in which I am using blk_mq_alloc_request_hctx(). However, as I mentioned before, I can experiment with using something like work_on_cpu_safe() to alloc and execute the abort request to safeguard against cpu hotplug events. Thanks, John
On Tue, Oct 25, 2022 at 3:40 PM John Garry <john.garry@huawei.com> wrote: > > On 25/10/2022 01:34, Ming Lei wrote: > >>>> but sometimes we just need to allocate for a specific HW > >>>> queue... > >>>> > >>>> For my usecase of interest, it should not impact if the cpumask of the HW > >>>> queue goes offline after selecting the cpu in blk_mq_alloc_request_hctx(), > >>>> so any race is ok ... I think. > >>>> > >>>> However it should be still possible to make blk_mq_alloc_request_hctx() more > >>>> robust. How about using something like work_on_cpu_safe() to allocate and > >>>> execute the request with blk_mq_alloc_request() on a cpu associated with the > >>>> HW queue, such that we know the cpu is online and stays online until we > >>>> execute it? Or also extent to work_on_cpumask_safe() variant, so that we > >>>> don't need to try all cpus in the mask (to see if online)? > >>> But all cpus on this hctx->cpumask could become offline. > >> If all hctx->cpumask are offline then we should not allocate a request and > >> this is acceptable. Maybe I am missing your point. > > As you saw, this API has the above problem too, but any one of CPUs > > may become online later, maybe just during blk_mq_alloc_request_hctx(), > > and it is easy to cause inconsistence. > > > > You didn't share your use case, but for nvme connection request, if it > > is 1:1 mapping, if any one of CPU becomes offline, the controller > > initialization could be failed, that isn't good from user viewpoint at > > all. > > My use case is in SCSI EH domain. For my HBA controller of interest, to > abort an erroneous IO we must send a controller proprietary abort > command on same HW queue as original command. So we would need to > allocate this abort request for a specific HW queue. IMO, it is one bad hw/sw interface. First such request has to be reserved, since all inflight IOs can be in error. Second error handling needs to provide forward-progress, and it is supposed to not require external dependency, otherwise easy to cause deadlock, but here request from specific HW queue just depends on this queue's cpumask. Also if it has to be reserved, it can be done as one device/driver private command, so why bother blk-mq for this special use case? > > I mentioned before that if no hctx->cpumask is online then we don't need > to allocate a request. That is because if no hctx->cpumask is online, > this means that original erroneous IO must be completed due to nature of > how blk-mq cpu hotplug handler works, i.e. drained, and then we don't > actually need to abort it any longer, so ok to not get a request. No, it is really not OK, if all cpus in hctx->cpumask are offline, you can't allocate request on the specified hw queue, then the erroneous IO can't be handled, then cpu hotplug handler may hang for ever. Thanks, Ming
On 25/10/2022 10:00, Ming Lei wrote: >> My use case is in SCSI EH domain. For my HBA controller of interest, to >> abort an erroneous IO we must send a controller proprietary abort >> command on same HW queue as original command. So we would need to >> allocate this abort request for a specific HW queue. > IMO, it is one bad hw/sw interface. > > First such request has to be reserved, since all inflight IOs can be in error. Right > > Second error handling needs to provide forward-progress, and it is supposed > to not require external dependency, otherwise easy to cause deadlock, but > here request from specific HW queue just depends on this queue's cpumask. > > Also if it has to be reserved, it can be done as one device/driver private > command, so why bother blk-mq for this special use case? I have a series for reserved request support, which I will send later. Please have a look. And as I mentioned, I would prob not end up using blk_mq_alloc_request_hctx() anyway. > >> I mentioned before that if no hctx->cpumask is online then we don't need >> to allocate a request. That is because if no hctx->cpumask is online, >> this means that original erroneous IO must be completed due to nature of >> how blk-mq cpu hotplug handler works, i.e. drained, and then we don't >> actually need to abort it any longer, so ok to not get a request. > No, it is really not OK, if all cpus in hctx->cpumask are offline, you > can't allocate > request on the specified hw queue, then the erroneous IO can't be handled, > then cpu hotplug handler may hang for ever. If the erroneous IO is still in-flight from blk-mq perspective, then how can hctx->cpumask still be offline? I thought that we guarantee that hctx->cpumask cannot go offline until drained. Thanks, John
On Tue, Oct 25, 2022 at 10:08:10AM +0100, John Garry wrote: > On 25/10/2022 10:00, Ming Lei wrote: > > > My use case is in SCSI EH domain. For my HBA controller of interest, to > > > abort an erroneous IO we must send a controller proprietary abort > > > command on same HW queue as original command. So we would need to > > > allocate this abort request for a specific HW queue. > > IMO, it is one bad hw/sw interface. > > > > First such request has to be reserved, since all inflight IOs can be in error. > > Right > > > > > Second error handling needs to provide forward-progress, and it is supposed > > to not require external dependency, otherwise easy to cause deadlock, but > > here request from specific HW queue just depends on this queue's cpumask. > > > > Also if it has to be reserved, it can be done as one device/driver private > > command, so why bother blk-mq for this special use case? > > I have a series for reserved request support, which I will send later. > Please have a look. And as I mentioned, I would prob not end up using > blk_mq_alloc_request_hctx() anyway. > > > > > > I mentioned before that if no hctx->cpumask is online then we don't need > > > to allocate a request. That is because if no hctx->cpumask is online, > > > this means that original erroneous IO must be completed due to nature of > > > how blk-mq cpu hotplug handler works, i.e. drained, and then we don't > > > actually need to abort it any longer, so ok to not get a request. > > No, it is really not OK, if all cpus in hctx->cpumask are offline, you > > can't allocate > > request on the specified hw queue, then the erroneous IO can't be handled, > > then cpu hotplug handler may hang for ever. > > If the erroneous IO is still in-flight from blk-mq perspective, then how can > hctx->cpumask still be offline? I thought that we guarantee that > hctx->cpumask cannot go offline until drained. Yeah, the draining is done before the cpu is offline. But the drain is simply waiting for the inflight IO to be completed. If the IO is failed during the waiting, you can't allocate such reserved request for error handling, then hang ever in blk_mq_hctx_notify_offline(). If you just make it one driver private command, there can't be such issue. Block layer is supposed for handling common case(normal io and pt io), I'd suggest to not put such special cases into block layer. thanks, Ming
On 25/10/2022 10:16, Ming Lei wrote: >>>> I mentioned before that if no hctx->cpumask is online then we don't need >>>> to allocate a request. That is because if no hctx->cpumask is online, >>>> this means that original erroneous IO must be completed due to nature of >>>> how blk-mq cpu hotplug handler works, i.e. drained, and then we don't >>>> actually need to abort it any longer, so ok to not get a request. >>> No, it is really not OK, if all cpus in hctx->cpumask are offline, you >>> can't allocate >>> request on the specified hw queue, then the erroneous IO can't be handled, >>> then cpu hotplug handler may hang for ever. >> If the erroneous IO is still in-flight from blk-mq perspective, then how can >> hctx->cpumask still be offline? I thought that we guarantee that >> hctx->cpumask cannot go offline until drained. > Yeah, the draining is done before the cpu is offline. But the drain is > simply waiting for the inflight IO to be completed. If the IO is failed > during the waiting, you can't allocate such reserved request for error > handling, then hang ever in blk_mq_hctx_notify_offline(). Actually if final cpu in hctx->cpumask is going offline, then hctx won't queue any more requests, right? In this case I don't think we can queue on that hctx anyway. I need to think about this more. > > If you just make it one driver private command, there can't be such > issue. Well we're trying to use reserved requests for EH commands, which that goes against. > Block layer is supposed for handling common case(normal io and pt io), > I'd suggest to not put such special cases into block layer. It also supports reserved commands, which I would assume would be suitable for EH scenarios. Thanks, John
On Tue, Oct 25, 2022 at 10:32:28AM +0100, John Garry wrote: > On 25/10/2022 10:16, Ming Lei wrote: > > > > > I mentioned before that if no hctx->cpumask is online then we don't need > > > > > to allocate a request. That is because if no hctx->cpumask is online, > > > > > this means that original erroneous IO must be completed due to nature of > > > > > how blk-mq cpu hotplug handler works, i.e. drained, and then we don't > > > > > actually need to abort it any longer, so ok to not get a request. > > > > No, it is really not OK, if all cpus in hctx->cpumask are offline, you > > > > can't allocate > > > > request on the specified hw queue, then the erroneous IO can't be handled, > > > > then cpu hotplug handler may hang for ever. > > > If the erroneous IO is still in-flight from blk-mq perspective, then how can > > > hctx->cpumask still be offline? I thought that we guarantee that > > > hctx->cpumask cannot go offline until drained. > > Yeah, the draining is done before the cpu is offline. But the drain is > > simply waiting for the inflight IO to be completed. If the IO is failed > > during the waiting, you can't allocate such reserved request for error > > handling, then hang ever in blk_mq_hctx_notify_offline(). > > Actually if final cpu in hctx->cpumask is going offline, then hctx won't > queue any more requests, right? In this case I don't think we can queue on > that hctx anyway. I need to think about this more. It can be queued actually, but interrupt may not be delivered if managed irq is used. > > > > > If you just make it one driver private command, there can't be such > > issue. > > Well we're trying to use reserved requests for EH commands, which that goes > against. > > > Block layer is supposed for handling common case(normal io and pt io), > > I'd suggest to not put such special cases into block layer. > > It also supports reserved commands, which I would assume would be suitable > for EH scenarios. Then you have to be careful, as I mentioned, EH has to provide forward progress, if you let blk-mq allocate & submit EH request, the implied dependency from blk-mq has to be payed attention. Thanks, Ming
On 25/10/2022 12:21, Ming Lei wrote: >> Actually if final cpu in hctx->cpumask is going offline, then hctx won't >> queue any more requests, right? In this case I don't think we can queue on >> that hctx anyway. I need to think about this more. > It can be queued actually, but interrupt may not be delivered if managed > irq is used. Yes, I think it will be queued elsewhere. I would need to check the code again. > >>> If you just make it one driver private command, there can't be such >>> issue. >> Well we're trying to use reserved requests for EH commands, which that goes >> against. >> >>> Block layer is supposed for handling common case(normal io and pt io), >>> I'd suggest to not put such special cases into block layer. >> It also supports reserved commands, which I would assume would be suitable >> for EH scenarios. > Then you have to be careful, as I mentioned, EH has to provide forward > progress, if you let blk-mq allocate & submit EH request, the implied > dependency from blk-mq has to be payed attention. OK, thanks, I know that this carries risk, but it seems right approach. I have been thinking about my HW queue allocation requirement and maybe we can solve in low-level driver instead. The requirement is to send this abort command on same queue as erroneous command to ensure that they do not race in HW submission, even though chance of this is really tiny. Maybe we can make low-level driver wait until erroneous command is really submitted to HW by checking HW register, etc. before issuing abort on any HW queue (and so would not need blk_mq_alloc_request_hctx() or similar). BTW, I would still like to fix blk_mq_alloc_request_hctx() to properly init ->bio and other fields - ok? Thanks, John
On Tue, Oct 25, 2022 at 12:36:10PM +0100, John Garry wrote: > The requirement is to send this abort command on same queue as erroneous > command to ensure that they do not race in HW submission, even though > chance of this is really tiny. Maybe we can make low-level driver wait > until erroneous command is really submitted to HW by checking HW register, > etc. before issuing abort on any HW queue (and so would not need > blk_mq_alloc_request_hctx() or similar). I'm not sure this is a good idea. I can think of all kinds of interfaces that could have similar requirements that absolutely do make sense from the hardware / firmware side. So despite Ming not liking blk_mq_alloc_request_hctx there is very little chance of it going away and thus also very little need to avoid users as more will eventually pop up if we want it or not. > BTW, I would still like to fix blk_mq_alloc_request_hctx() to properly init > ->bio and other fields - ok? Yes, it should behave the same blk_mq_alloc_request in that respect, and we should just copy the assignments to bio, biotail, __sector and __data_len from it as you did in your RFC patch. > > Thanks, > John ---end quoted text---
diff --git a/block/blk-mq.c b/block/blk-mq.c index 8070b6c10e8d..260adeb2e455 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -402,6 +402,10 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, } } + rq->__data_len = 0; + rq->__sector = (sector_t) -1; + rq->bio = rq->biotail = NULL; + return rq; } @@ -591,9 +595,6 @@ struct request *blk_mq_alloc_request(struct request_queue *q, blk_opf_t opf, if (!rq) goto out_queue_exit; } - rq->__data_len = 0; - rq->__sector = (sector_t) -1; - rq->bio = rq->biotail = NULL; return rq; out_queue_exit: blk_queue_exit(q);