Message ID | 20221130132156.2836184-9-linan122@huawei.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:f944:0:0:0:0:0 with SMTP id q4csp909440wrr; Wed, 30 Nov 2022 05:02:24 -0800 (PST) X-Google-Smtp-Source: AA0mqf6KLBTJFFPVLM4c00C9d1Fi/o3SEg5XR7PZQ0JgABf8cIUU8J7QCj3mJqabEjKdVBWz+OHA X-Received: by 2002:a63:5308:0:b0:46f:cec6:8805 with SMTP id h8-20020a635308000000b0046fcec68805mr37979933pgb.612.1669813344211; Wed, 30 Nov 2022 05:02:24 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669813344; cv=none; d=google.com; s=arc-20160816; b=RbjrU3RvKGEz3WH0Cz8U5hi/xdQB9KhFthLGixxxEZO//KHPpnSxbWL71gkXbNiCcu Wa+9eC65onLWoczgbjEaJITjBhJFZ+fJL8+3mVtOoA+vrt9yxmqBuB/s/ovuxKzv0GQW uuaj+UwWIDQ5JcAZHI6OipbPpwnUdhL97/sWa8O+zKTfeEXHiojxH2f+K+vWMjdgVQeZ HeRIo7mH2xTnNbzwX7Kl5n285CDl3AxOGRrnh6ofT6w6bin2XPrK9OHA/SCCDOpkMXZx NkJc+lgEvgG4eWpmd0hJmM/pb+aadvBDt0p7jv3lAlTYg3zX9xSUtGASCjFM52oA718q 0Oag== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=MMUIHipO+ixyCCOtLxkU84axM/dUNdIPeQzLp6D81hs=; b=Rxi/z7fzhkh+mFl8dRsFGjGe38fAzI7BaU6ousAOCp2kIQwovQMPRSKqjCILDzKWel 38cAH8lJQwZxuA4/G73HtyI51RXCeBDd+ih3ENmJgEHR2HpOI7G9Mp5qXMBfUqAnJcaK zHmIHbr4QJjgo/mBZFShM99T17IlLCB4EGJK8mHpFbJxSIqAjxyGkHb38pjJKr5L6Zfe hmgftG29usPnhOdprwCp/ztL4f292wt+IG56SqEXvNFoCgF2E3TzvWwsKoq1StPCCYg/ C9nQFU7inFiJ9BxG+TySOwO6hZ62sfy0p5O6WmmOgGyvefoUVsqdUxAss7doZFbtW6cj vThA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id k9-20020a170902760900b001869fd74e0csi1160072pll.311.2022.11.30.05.02.09; Wed, 30 Nov 2022 05:02:24 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235757AbiK3NB0 (ORCPT <rfc822;heyuhang3455@gmail.com> + 99 others); Wed, 30 Nov 2022 08:01:26 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41446 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234500AbiK3NA6 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 30 Nov 2022 08:00:58 -0500 Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3EDFF4E421; Wed, 30 Nov 2022 05:00:56 -0800 (PST) Received: from dggpeml500021.china.huawei.com (unknown [172.30.72.57]) by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4NMfPR28XyzJp39; Wed, 30 Nov 2022 20:57:31 +0800 (CST) Received: from dggpeml500003.china.huawei.com (7.185.36.200) by dggpeml500021.china.huawei.com (7.185.36.21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Wed, 30 Nov 2022 21:00:54 +0800 Received: from huawei.com (10.175.127.227) by dggpeml500003.china.huawei.com (7.185.36.200) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Wed, 30 Nov 2022 21:00:54 +0800 From: Li Nan <linan122@huawei.com> To: <tj@kernel.org>, <josef@toxicpanda.com>, <axboe@kernel.dk> CC: <cgroups@vger.kernel.org>, <linux-block@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <linan122@huawei.com>, <yukuai3@huawei.com>, <yi.zhang@huawei.com> Subject: [PATCH -next v2 8/9] block: fix null-pointer dereference in ioc_pd_init Date: Wed, 30 Nov 2022 21:21:55 +0800 Message-ID: <20221130132156.2836184-9-linan122@huawei.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20221130132156.2836184-1-linan122@huawei.com> References: <20221130132156.2836184-1-linan122@huawei.com> MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII X-Originating-IP: [10.175.127.227] X-ClientProxiedBy: dggems706-chm.china.huawei.com (10.3.19.183) To dggpeml500003.china.huawei.com (7.185.36.200) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1750926197194870938?= X-GMAIL-MSGID: =?utf-8?q?1750926197194870938?= |
Series |
iocost bugfix
|
|
Commit Message
Li Nan
Nov. 30, 2022, 1:21 p.m. UTC
Remove block device when iocost is initializing may cause
null-pointer dereference:
CPU1 CPU2
ioc_qos_write
blkcg_conf_open_bdev
blkdev_get_no_open
kobject_get_unless_zero
blk_iocost_init
rq_qos_add
del_gendisk
rq_qos_exit
q->rq_qos = rqos->next
//iocost is removed from q->roqs
blkcg_activate_policy
pd_init_fn
ioc_pd_init
ioc = q_to_ioc(blkg->q)
//cant find iocost and return null
Fix problem by moving rq_qos_exit() to disk_release(). ioc_qos_write() get
bd_device.kobj in blkcg_conf_open_bdev(), so disk_release will not be
actived until iocost initialization is complited.
Signed-off-by: Li Nan <linan122@huawei.com>
---
block/genhd.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Comments
On Wed, Nov 30, 2022 at 09:21:55PM +0800, Li Nan wrote: > Remove block device when iocost is initializing may cause > null-pointer dereference: > > CPU1 CPU2 > ioc_qos_write > blkcg_conf_open_bdev > blkdev_get_no_open > kobject_get_unless_zero > blk_iocost_init > rq_qos_add > del_gendisk > rq_qos_exit > q->rq_qos = rqos->next > //iocost is removed from q->roqs > blkcg_activate_policy > pd_init_fn > ioc_pd_init > ioc = q_to_ioc(blkg->q) > //cant find iocost and return null > > Fix problem by moving rq_qos_exit() to disk_release(). ioc_qos_write() get > bd_device.kobj in blkcg_conf_open_bdev(), so disk_release will not be > actived until iocost initialization is complited. I think it'd be better to make del_gendisk wait for these in-flight operations because this might fix the above particular issue but now all the policies are exposed to request_queue in a state it never expected before. del_gendisk() is quiescing the queue around rq_qos_exit(), so maybe we can piggyback on that and update blkcg_conf_open_bdev() to provide such exclusion? Thanks.
Hi, 在 2022/12/01 4:50, Tejun Heo 写道: > On Wed, Nov 30, 2022 at 09:21:55PM +0800, Li Nan wrote: >> Remove block device when iocost is initializing may cause >> null-pointer dereference: >> >> CPU1 CPU2 >> ioc_qos_write >> blkcg_conf_open_bdev >> blkdev_get_no_open >> kobject_get_unless_zero >> blk_iocost_init >> rq_qos_add >> del_gendisk >> rq_qos_exit >> q->rq_qos = rqos->next >> //iocost is removed from q->roqs >> blkcg_activate_policy >> pd_init_fn >> ioc_pd_init >> ioc = q_to_ioc(blkg->q) >> //cant find iocost and return null >> >> Fix problem by moving rq_qos_exit() to disk_release(). ioc_qos_write() get >> bd_device.kobj in blkcg_conf_open_bdev(), so disk_release will not be >> actived until iocost initialization is complited. > > I think it'd be better to make del_gendisk wait for these in-flight > operations because this might fix the above particular issue but now all the > policies are exposed to request_queue in a state it never expected before. > > del_gendisk() is quiescing the queue around rq_qos_exit(), so maybe we can > piggyback on that and update blkcg_conf_open_bdev() to provide such > exclusion? Let del_gendisk waiting for that sounds good, but I'm litter confused how to do that. Following are what I think about: 1) By mentioning that "del_gendisk() is quiescing the queue", do you suggest to add rcu_read_lock()? This seems wrong because blk_iocost_init requires memory allocation. 2) Hold gendisk open mutex 3) Use q_usage_counter, and in the meantime, rq_qos_add() and blkcg_activate_policy() will need refactoring to factor out freeze queue. 4) Define a new metux Thanks, Kuai > > Thanks. >
On Thu, Dec 01, 2022 at 10:12:13AM +0800, Yu Kuai wrote: > 1) By mentioning that "del_gendisk() is quiescing the queue", do you > suggest to add rcu_read_lock()? This seems wrong because blk_iocost_init > requires memory allocation. Quiescing uses SRCU so that should be fine but I'm not sure whether this is the right one to piggyback on. Jens should have a better idea. Thanks.
Hi, 在 2022/12/01 18:11, Tejun Heo 写道: > On Thu, Dec 01, 2022 at 10:12:13AM +0800, Yu Kuai wrote: >> 1) By mentioning that "del_gendisk() is quiescing the queue", do you >> suggest to add rcu_read_lock()? This seems wrong because blk_iocost_init >> requires memory allocation. > > Quiescing uses SRCU so that should be fine but I'm not sure whether this is > the right one to piggyback on. Jens should have a better idea. > > Thanks. > Currently SRCU is used if BLK_MQ_F_BLOCKING set, otherwise RCU is used. dispatch: #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \ do { \ if ((q)->tag_set->flags & BLK_MQ_F_BLOCKING) { \ int srcu_idx; \ \ might_sleep_if(check_sleep); \ srcu_idx = srcu_read_lock((q)->tag_set->srcu); \ (dispatch_ops); \ srcu_read_unlock((q)->tag_set->srcu, srcu_idx); \ } else { \ rcu_read_lock(); \ (dispatch_ops); \ rcu_read_unlock(); \ } \ } while (0) quiesce: void blk_mq_wait_quiesce_done(struct blk_mq_tag_set *set) { if (set->flags & BLK_MQ_F_BLOCKING) synchronize_srcu(set->srcu); else synchronize_rcu(); } Thanks, Kuai
On Thu, Dec 01, 2022 at 06:23:16PM +0800, Yu Kuai wrote: > Hi, > > 在 2022/12/01 18:11, Tejun Heo 写道: > > On Thu, Dec 01, 2022 at 10:12:13AM +0800, Yu Kuai wrote: > > > 1) By mentioning that "del_gendisk() is quiescing the queue", do you > > > suggest to add rcu_read_lock()? This seems wrong because blk_iocost_init > > > requires memory allocation. > > > > Quiescing uses SRCU so that should be fine but I'm not sure whether this is > > the right one to piggyback on. Jens should have a better idea. > > > > Thanks. > > > > Currently SRCU is used if BLK_MQ_F_BLOCKING set, otherwise RCU is used. > > dispatch: > #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \ > do { \ > if ((q)->tag_set->flags & BLK_MQ_F_BLOCKING) { \ > int srcu_idx; \ > \ > might_sleep_if(check_sleep); \ > srcu_idx = srcu_read_lock((q)->tag_set->srcu); \ > (dispatch_ops); \ > srcu_read_unlock((q)->tag_set->srcu, srcu_idx); \ > } else { \ > rcu_read_lock(); \ > (dispatch_ops); \ > rcu_read_unlock(); \ > } \ > } while (0) > > quiesce: > void blk_mq_wait_quiesce_done(struct blk_mq_tag_set *set) > { > if (set->flags & BLK_MQ_F_BLOCKING) > synchronize_srcu(set->srcu); > else > synchronize_rcu(); > } Oh I see. Unfortunately, I don't know what to do off the top of my head. Thanks.
Hi, Tejun! While reviewing rq_qos code, I found that there are some other possible defects: 1) queue_lock is held to protect rq_qos_add() and rq_qos_del(), whlie it's not held to protect rq_qos_exit(), which is absolutely not safe because they can be called concurrently by configuring iocost and removing device. I'm thinking about holding the lock to fetch the list and reset q->rq_qos first: diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c index 88f0fe7dcf54..271ad65eebd9 100644 --- a/block/blk-rq-qos.c +++ b/block/blk-rq-qos.c @@ -288,9 +288,15 @@ void rq_qos_wait(struct rq_wait *rqw, void *private_data, void rq_qos_exit(struct request_queue *q) { - while (q->rq_qos) { - struct rq_qos *rqos = q->rq_qos; - q->rq_qos = rqos->next; + struct rq_qos *rqos; + + spin_lock_irq(&q->queue_lock); + rqos = q->rq_qos; + q->rq_qos = NULL; + spin_unlock_irq(&q->queue_lock); + + while (rqos) { rqos->ops->exit(rqos); + rqos = rqos->next; } } 2) rq_qos_add() can still succeed after rq_qos_exit() is done, which will cause memory leak. Hence a checking is required beforing adding to q->rq_qos. I'm thinking about flag QUEUE_FLAG_DYING first, but the flag will not set if disk state is not marked GD_OWNS_QUEUE. Since blk_unregister_queue() is called before rq_qos_exit(), use the queue flag QUEUE_FLAG_REGISTERED should be OK. For the current problem that device can be removed while initializing , I'm thinking about some possible solutions: Since bfq is initialized in elevator initialization, and others are in queue initialization, such problem is only possible in iocost, hence it make sense to fix it in iocost: 1) use open mutex to prevet concurrency, however, this will cause that configuring iocost will block some other operations that is relied on open_mutex. @@ -2889,7 +2889,15 @@ static int blk_iocost_init(struct gendisk *disk) if (ret) goto err_free_ioc; + mutex_lock(&disk->open_mutex); + if (!disk_live(disk)) { + mutex_unlock(&disk->open_mutex); + ret = -ENODEV; + goto err_del_qos; + } + ret = blkcg_activate_policy(q, &blkcg_policy_iocost); + mutex_unlock(&disk->open_mutex); if (ret) goto err_del_qos; 2) like 1), the difference is that define a new mutex just in iocst. 3) Or is it better to fix it in the higher level? For example: add a new restriction that blkcg_deactivate_policy() should be called with blkcg_activate_policy() in pairs, and blkcg_deactivate_policy() will wait for blkcg_activate_policy() to finish. Something like: diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index ef4fef1af909..6266f702157f 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -1410,7 +1410,7 @@ int blkcg_activate_policy(struct request_queue *q, struct blkcg_gq *blkg, *pinned_blkg = NULL; int ret; - if (blkcg_policy_enabled(q, pol)) + if (WARN_ON_ONCE(blkcg_policy_enabled(q, pol))) return 0; if (queue_is_mq(q)) @@ -1477,6 +1477,8 @@ int blkcg_activate_policy(struct request_queue *q, blkg_put(pinned_blkg); if (pd_prealloc) pol->pd_free_fn(pd_prealloc); + if (!ret) + wake_up(q->policy_waitq); return ret; enomem: @@ -1512,7 +1514,7 @@ void blkcg_deactivate_policy(struct request_queue *q, struct blkcg_gq *blkg; if (!blkcg_policy_enabled(q, pol)) - return; + wait_event(q->policy_waitq, blkcg_policy_enabled(q, pol)); wait_event(q->xxx, blkcg_policy_enabled(q, pol));
Hello, On Mon, Dec 05, 2022 at 05:32:17PM +0800, Yu Kuai wrote: > 1) queue_lock is held to protect rq_qos_add() and rq_qos_del(), whlie > it's not held to protect rq_qos_exit(), which is absolutely not safe > because they can be called concurrently by configuring iocost and > removing device. > I'm thinking about holding the lock to fetch the list and reset > q->rq_qos first: > > diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c > index 88f0fe7dcf54..271ad65eebd9 100644 > --- a/block/blk-rq-qos.c > +++ b/block/blk-rq-qos.c > @@ -288,9 +288,15 @@ void rq_qos_wait(struct rq_wait *rqw, void > *private_data, > > void rq_qos_exit(struct request_queue *q) > { > - while (q->rq_qos) { > - struct rq_qos *rqos = q->rq_qos; > - q->rq_qos = rqos->next; > + struct rq_qos *rqos; > + > + spin_lock_irq(&q->queue_lock); > + rqos = q->rq_qos; > + q->rq_qos = NULL; > + spin_unlock_irq(&q->queue_lock); > + > + while (rqos) { > rqos->ops->exit(rqos); > + rqos = rqos->next; > } > } > > 2) rq_qos_add() can still succeed after rq_qos_exit() is done, which > will cause memory leak. Hence a checking is required beforing adding > to q->rq_qos. I'm thinking about flag QUEUE_FLAG_DYING first, but the > flag will not set if disk state is not marked GD_OWNS_QUEUE. Since > blk_unregister_queue() is called before rq_qos_exit(), use the queue > flag QUEUE_FLAG_REGISTERED should be OK. > > For the current problem that device can be removed while initializing > , I'm thinking about some possible solutions: > > Since bfq is initialized in elevator initialization, and others are > in queue initialization, such problem is only possible in iocost, hence > it make sense to fix it in iocost: So, iolatency is likely to switch to similar lazy init scheme, so we better fix it in the rq_qos / core block layer. ... > 3) Or is it better to fix it in the higher level? For example: > add a new restriction that blkcg_deactivate_policy() should be called > with blkcg_activate_policy() in pairs, and blkcg_deactivate_policy() > will wait for blkcg_activate_policy() to finish. Something like: > > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c > index ef4fef1af909..6266f702157f 100644 > --- a/block/blk-cgroup.c > +++ b/block/blk-cgroup.c > @@ -1410,7 +1410,7 @@ int blkcg_activate_policy(struct request_queue *q, > struct blkcg_gq *blkg, *pinned_blkg = NULL; > int ret; > > - if (blkcg_policy_enabled(q, pol)) > + if (WARN_ON_ONCE(blkcg_policy_enabled(q, pol))) > return 0; > > if (queue_is_mq(q)) > @@ -1477,6 +1477,8 @@ int blkcg_activate_policy(struct request_queue *q, > blkg_put(pinned_blkg); > if (pd_prealloc) > pol->pd_free_fn(pd_prealloc); > + if (!ret) > + wake_up(q->policy_waitq); > return ret; > > enomem: > @@ -1512,7 +1514,7 @@ void blkcg_deactivate_policy(struct request_queue *q, > struct blkcg_gq *blkg; > > if (!blkcg_policy_enabled(q, pol)) > - return; > + wait_event(q->policy_waitq, blkcg_policy_enabled(q, pol)); > wait_event(q->xxx, blkcg_policy_enabled(q, pol)); Yeah, along this line but hopefully something simpler like a mutex. Thanks.
diff --git a/block/genhd.c b/block/genhd.c index dcf200bcbd3e..0db440bbfefb 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -656,7 +656,6 @@ void del_gendisk(struct gendisk *disk) elevator_exit(q); mutex_unlock(&q->sysfs_lock); } - rq_qos_exit(q); blk_mq_unquiesce_queue(q); /* @@ -1168,6 +1167,7 @@ static void disk_release(struct device *dev) !test_bit(GD_ADDED, &disk->state)) blk_mq_exit_queue(disk->queue); + rq_qos_exit(disk->queue); blkcg_exit_disk(disk); bioset_exit(&disk->bio_split);