Message ID | 20221217030908.1261787-1-yukuai1@huaweicloud.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:e747:0:0:0:0:0 with SMTP id c7csp1322916wrn; Fri, 16 Dec 2022 19:00:25 -0800 (PST) X-Google-Smtp-Source: AA0mqf6g4nzZzG9HKIboSXfHMa3+qd0QuKmHiR+YQl/Rq/W2RoQJMXDrNv5gLY6OGLDYTJ3HyCMH X-Received: by 2002:a05:6402:c43:b0:469:ae36:b954 with SMTP id cs3-20020a0564020c4300b00469ae36b954mr45045538edb.30.1671246025793; Fri, 16 Dec 2022 19:00:25 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1671246025; cv=none; d=google.com; s=arc-20160816; b=AfsYHUBfrI1g0V21cyT/IFNW9MkmKuCSYRclBuX6Pj8firv6v31RtNNHCCXl8WyUEb DRnyS9DwSNRMOWKyj+vcTtaPVAhBbiFOXDIUn+YgGa2B/wK4LX4gFyXW89OKAjUdYByL jQ2EUei8tQfLy00BZ/kWUFpM/7nw7k/ivjFDpmUL1aQ28wBFaKMi/E8zzQsi677lmcb9 JyclIBFoJXd+OC/i4uhc3/IsPizEttdrGezthvi3B8zYn8Dkk8MtClMkqC+vOC95HUaR cQydbGEoN/g3qN+YU4cTXrzT08we/SniKse+evg1H8cxb38fhVIz4yr5MVu61VHYm9Xv KGyg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=HKSvEGhQ9zXBJ0U3YiIbRteg4D/c0eabJweo4eguFTE=; b=L3hbPACeieLPCvD0l7JwTYIQIEMIecOUVA9gIHPgPS8KDtadWg9y98R2tfjeq/cPYg ttiPQc8RTWy4WbkgXiH/niPkE9Ie5mQJyUTQ8H+EX/Um7fJeBjMMkYZ8Iwko4614/uKp FurayQ7zoM2ohsalH2oBkRHu+6g0m+GAMyDcWvzLxCicWC0dECzpjyBRnDxxapSzHWMv F/Uu9ErXRdkVMZmJe9zR/nyOpu/+gfS7s9rDZ98mTxDYlLtI1GiAQcfzRJJcDvK1oT+D mLCVwuajlvA2VrnfJ4k8H7ukHOmP1Ihh0eGLCkJ5DMeSob2UxUGooo/PbBK2WTwJkg07 wv5w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id ji8-20020a170907980800b0078de83a052csi4466665ejc.483.2022.12.16.18.59.56; Fri, 16 Dec 2022 19:00:25 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230206AbiLQCsb (ORCPT <rfc822;markus.c.watson@gmail.com> + 99 others); Fri, 16 Dec 2022 21:48:31 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52318 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229911AbiLQCs3 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 16 Dec 2022 21:48:29 -0500 Received: from dggsgout11.his.huawei.com (unknown [45.249.212.51]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 05A3EA453; Fri, 16 Dec 2022 18:48:29 -0800 (PST) Received: from mail02.huawei.com (unknown [172.30.67.153]) by dggsgout11.his.huawei.com (SkyGuard) with ESMTP id 4NYr4l43ghz4f3jps; Sat, 17 Dec 2022 10:48:23 +0800 (CST) Received: from huaweicloud.com (unknown [10.175.127.227]) by APP4 (Coremail) with SMTP id gCh0CgB3m9j5LZ1jmxh7CQ--.11469S4; Sat, 17 Dec 2022 10:48:26 +0800 (CST) From: Yu Kuai <yukuai1@huaweicloud.com> To: tj@kernel.org, hch@infradead.org, josef@toxicpanda.com, axboe@kernel.dk Cc: cgroups@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, yukuai3@huawei.com, yukuai1@huaweicloud.com, yi.zhang@huawei.com Subject: [PATCH -next 0/4] blk-cgroup: synchronize del_gendisk() with configuring cgroup policy Date: Sat, 17 Dec 2022 11:09:04 +0800 Message-Id: <20221217030908.1261787-1-yukuai1@huaweicloud.com> X-Mailer: git-send-email 2.31.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-CM-TRANSID: gCh0CgB3m9j5LZ1jmxh7CQ--.11469S4 X-Coremail-Antispam: 1UD129KBjvdXoWrKryxZw4DGryDJF47WFy8Zrb_yoWkurc_uF Z5J34Sgw1kWayakF97KF4DJFWrKr48Zr429Fy2qF9rWw1fJrnxJw4293WY9F43uF47GrW5 XryrXr1kJrn2gjkaLaAFLSUrUUUUUb8apTn2vfkv8UJUUUU8Yxn0WfASr-VFAUDa7-sFnT 9fnUUIcSsGvfJTRUUUbxkFF20E14v26r4j6ryUM7CY07I20VC2zVCF04k26cxKx2IYs7xG 6rWj6s0DM7CIcVAFz4kK6r1j6r18M28lY4IEw2IIxxk0rwA2F7IY1VAKz4vEj48ve4kI8w A2z4x0Y4vE2Ix0cI8IcVAFwI0_Ar0_tr1l84ACjcxK6xIIjxv20xvEc7CjxVAFwI0_Gr1j 6F4UJwA2z4x0Y4vEx4A2jsIE14v26rxl6s0DM28EF7xvwVC2z280aVCY1x0267AKxVW0oV Cq3wAS0I0E0xvYzxvE52x082IY62kv0487Mc02F40EFcxC0VAKzVAqx4xG6I80ewAv7VC0 I7IYx2IY67AKxVWUGVWUXwAv7VC2z280aVAFwI0_Jr0_Gr1lOx8S6xCaFVCjc4AY6r1j6r 4UM4x0Y48IcxkI7VAKI48JM4x0x7Aq67IIx4CEVc8vx2IErcIFxwACI402YVCY1x02628v n2kIc2xKxwCF04k20xvY0x0EwIxGrwCFx2IqxVCFs4IE7xkEbVWUJVW8JwC20s026c02F4 0E14v26r1j6r18MI8I3I0E7480Y4vE14v26r106r1rMI8E67AF67kF1VAFwI0_Jw0_GFyl IxkGc2Ij64vIr41lIxAIcVC0I7IYx2IY67AKxVWUCVW8JwCI42IY6xIIjxv20xvEc7CjxV AFwI0_Cr0_Gr1UMIIF0xvE42xK8VAvwI8IcIk0rVW3JVWrJr1lIxAIcVC2z280aVAFwI0_ Jr0_Gr1lIxAIcVC2z280aVCY1x0267AKxVW8JVW8JrUvcSsGvfC2KfnxnUUI43ZEXa7VU1 a9aPUUUUU== X-CM-SenderInfo: 51xn3trlr6x35dzhxuhorxvhhfrp/ X-CFilter-Loop: Reflected X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1752428472792275126?= X-GMAIL-MSGID: =?utf-8?q?1752428472792275126?= |
Series |
blk-cgroup: synchronize del_gendisk() with configuring cgroup policy
|
|
Message
Yu Kuai
Dec. 17, 2022, 3:09 a.m. UTC
From: Yu Kuai <yukuai3@huawei.com>
iocost is initialized when it's configured the first time, and iocost
initializing can race with del_gendisk(), which will cause null pointer
dereference:
t1 t2
ioc_qos_write
blk_iocost_init
rq_qos_add
del_gendisk
rq_qos_exit
//iocost is removed from q->roqs
blkcg_activate_policy
pd_init_fn
ioc_pd_init
ioc = q_to_ioc(blkg->q)
//can't find iocost and return null
And iolatency is about to switch to the same lazy initialization.
This patchset fix this problem by synchronize rq_qos_add() and
blkcg_activate_policy() with rq_qos_exit().
Yu Kuai (4):
block/rq_qos: protect 'q->rq_qos' with queue_lock in rq_qos_exit()
block/rq_qos: fail rq_qos_add() after del_gendisk()
blk-cgroup: add a new interface blkcg_conf_close_bdev()
blk-cgroup: synchronize del_gendisk() with configuring cgroup policy
block/blk-cgroup.c | 12 ++++++++++--
block/blk-cgroup.h | 1 +
block/blk-iocost.c | 8 ++++----
block/blk-rq-qos.c | 25 ++++++++++++++++++++-----
block/blk-rq-qos.h | 17 +++++++++++++----
include/linux/blkdev.h | 1 +
6 files changed, 49 insertions(+), 15 deletions(-)
Comments
Hello, On Sat, Dec 17, 2022 at 11:09:04AM +0800, Yu Kuai wrote: > From: Yu Kuai <yukuai3@huawei.com> > > iocost is initialized when it's configured the first time, and iocost > initializing can race with del_gendisk(), which will cause null pointer > dereference: > > t1 t2 > ioc_qos_write > blk_iocost_init > rq_qos_add > del_gendisk > rq_qos_exit > //iocost is removed from q->roqs > blkcg_activate_policy > pd_init_fn > ioc_pd_init > ioc = q_to_ioc(blkg->q) > //can't find iocost and return null > > And iolatency is about to switch to the same lazy initialization. > > This patchset fix this problem by synchronize rq_qos_add() and > blkcg_activate_policy() with rq_qos_exit(). So, the patchset seems a bit overly complicated to me. What do you think about the following? * These init/exit paths are super cold path, just protecting them with a global mutex is probably enough. If we encounter a scalability problem, it's easy to fix down the line. * If we're synchronizing this with a mutex anyway, no need to grab the queue_lock, right? rq_qos_add/del/exit() can all just hold the mutex. * And we can keep the state tracking within rq_qos. When rq_qos_exit() is called, mark it so that future adds will fail - be that a special ->next value a queue flag or whatever. Thanks.
Hi, 在 2022/12/20 4:55, Tejun Heo 写道: > Hello, > > On Sat, Dec 17, 2022 at 11:09:04AM +0800, Yu Kuai wrote: >> From: Yu Kuai <yukuai3@huawei.com> >> >> iocost is initialized when it's configured the first time, and iocost >> initializing can race with del_gendisk(), which will cause null pointer >> dereference: >> >> t1 t2 >> ioc_qos_write >> blk_iocost_init >> rq_qos_add >> del_gendisk >> rq_qos_exit >> //iocost is removed from q->roqs >> blkcg_activate_policy >> pd_init_fn >> ioc_pd_init >> ioc = q_to_ioc(blkg->q) >> //can't find iocost and return null >> >> And iolatency is about to switch to the same lazy initialization. >> >> This patchset fix this problem by synchronize rq_qos_add() and >> blkcg_activate_policy() with rq_qos_exit(). > > So, the patchset seems a bit overly complicated to me. What do you think > about the following? > > * These init/exit paths are super cold path, just protecting them with a > global mutex is probably enough. If we encounter a scalability problem, > it's easy to fix down the line. > > * If we're synchronizing this with a mutex anyway, no need to grab the > queue_lock, right? rq_qos_add/del/exit() can all just hold the mutex. > > * And we can keep the state tracking within rq_qos. When rq_qos_exit() is > called, mark it so that future adds will fail - be that a special ->next > value a queue flag or whatever. Yes, that sounds good. BTW, queue_lock is also used to protect pd_alloc_fn/pd_init_fn,and we found that blkcg_activate_policy() is problematic: blkcg_activate_policy spin_lock_irq(&q->queue_lock); list_for_each_entry_reverse(blkg, &q->blkg_list pd_alloc_fn(GFP_NOWAIT | __GFP_NOWARN,...) -> failed spin_unlock_irq(&q->queue_lock); // release queue_lock here is problematic, this will cause pd_offline_fn called without pd_init_fn. pd_alloc_fn(__GFP_NOWARN,...) If we are using a mutex to protect rq_qos ops, it seems the right thing to do do also using the mutex to protect blkcg_policy ops, and this problem can be fixed because mutex can be held to alloc memroy with GFP_KERNEL. What do you think? Thanks, Kuai > > Thanks. >
Hello, On Tue, Dec 20, 2022 at 05:19:12PM +0800, Yu Kuai wrote: > Yes, that sounds good. BTW, queue_lock is also used to protect > pd_alloc_fn/pd_init_fn,and we found that blkcg_activate_policy() is > problematic: > > blkcg_activate_policy > spin_lock_irq(&q->queue_lock); > list_for_each_entry_reverse(blkg, &q->blkg_list > pd_alloc_fn(GFP_NOWAIT | __GFP_NOWARN,...) -> failed > > spin_unlock_irq(&q->queue_lock); > // release queue_lock here is problematic, this will cause > pd_offline_fn called without pd_init_fn. > pd_alloc_fn(__GFP_NOWARN,...) So, if a blkg is destroyed while a policy is being activated, right? > If we are using a mutex to protect rq_qos ops, it seems the right thing > to do do also using the mutex to protect blkcg_policy ops, and this > problem can be fixed because mutex can be held to alloc memroy with > GFP_KERNEL. What do you think? One worry is that switching to mutex can be more headache due to destroy path synchronization. Another approach would be using a per-blkg flag to track whether a blkg has been initialized. Thanks.
Hi, 在 2022/12/21 0:01, Tejun Heo 写道: > Hello, > > On Tue, Dec 20, 2022 at 05:19:12PM +0800, Yu Kuai wrote: >> Yes, that sounds good. BTW, queue_lock is also used to protect >> pd_alloc_fn/pd_init_fn,and we found that blkcg_activate_policy() is >> problematic: >> >> blkcg_activate_policy >> spin_lock_irq(&q->queue_lock); >> list_for_each_entry_reverse(blkg, &q->blkg_list >> pd_alloc_fn(GFP_NOWAIT | __GFP_NOWARN,...) -> failed >> >> spin_unlock_irq(&q->queue_lock); >> // release queue_lock here is problematic, this will cause >> pd_offline_fn called without pd_init_fn. >> pd_alloc_fn(__GFP_NOWARN,...) > > So, if a blkg is destroyed while a policy is being activated, right? Yes, remove cgroup can race with this, for bfq null pointer deference will be triggered in bfq_pd_offline(). > >> If we are using a mutex to protect rq_qos ops, it seems the right thing >> to do do also using the mutex to protect blkcg_policy ops, and this >> problem can be fixed because mutex can be held to alloc memroy with >> GFP_KERNEL. What do you think? > > One worry is that switching to mutex can be more headache due to destroy > path synchronization. Another approach would be using a per-blkg flag to > track whether a blkg has been initialized. I think perhaps you mean per blkg_policy_data flag? per blkg flag should not work in this case. Thanks, Kuai > > Thanks. >
Hi, 在 2022/12/21 9:10, Yu Kuai 写道: > Hi, > > 在 2022/12/21 0:01, Tejun Heo 写道: >> Hello, >> >> On Tue, Dec 20, 2022 at 05:19:12PM +0800, Yu Kuai wrote: >>> Yes, that sounds good. BTW, queue_lock is also used to protect >>> pd_alloc_fn/pd_init_fn,and we found that blkcg_activate_policy() is >>> problematic: >>> >>> blkcg_activate_policy >>> spin_lock_irq(&q->queue_lock); >>> list_for_each_entry_reverse(blkg, &q->blkg_list >>> pd_alloc_fn(GFP_NOWAIT | __GFP_NOWARN,...) -> failed >>> >>> spin_unlock_irq(&q->queue_lock); >>> // release queue_lock here is problematic, this will cause >>> pd_offline_fn called without pd_init_fn. >>> pd_alloc_fn(__GFP_NOWARN,...) >> >> So, if a blkg is destroyed while a policy is being activated, right? > Yes, remove cgroup can race with this, for bfq null pointer deference > will be triggered in bfq_pd_offline(). BTW, We just found that pd_online_fn() is missed in blkcg_activate_policy()... Currently this is not a problem because only bl-throttle implement it, and blk-throttle is activated while creating blkg. Thanks, Kuai > >> >>> If we are using a mutex to protect rq_qos ops, it seems the right thing >>> to do do also using the mutex to protect blkcg_policy ops, and this >>> problem can be fixed because mutex can be held to alloc memroy with >>> GFP_KERNEL. What do you think? >> >> One worry is that switching to mutex can be more headache due to destroy >> path synchronization. Another approach would be using a per-blkg flag to >> track whether a blkg has been initialized. > I think perhaps you mean per blkg_policy_data flag? per blkg flag should > not work in this case. > > Thanks, > Kuai >> >> Thanks. >> > > . >
On Tue, Dec 20, 2022 at 05:19:12PM +0800, Yu Kuai wrote: > If we are using a mutex to protect rq_qos ops, it seems the right thing > to do do also using the mutex to protect blkcg_policy ops, and this > problem can be fixed because mutex can be held to alloc memroy with > GFP_KERNEL. What do you think? Getting rid of the atomic allocations would be awesome. FYI, I'm also in favor of everything that moves things out of queue_lock into more dedicated locks. queue_lock is such an undocument mess of untargeted things that don't realted to each other right now that splitting and removing it is becoming more and more important.