Message ID | 20221227125502.541931-2-yukuai1@huaweicloud.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:4e01:0:0:0:0:0 with SMTP id p1csp1369641wrt; Tue, 27 Dec 2022 04:38:56 -0800 (PST) X-Google-Smtp-Source: AMrXdXvlqJGpxldMqT77eXdksOAMPucfsiOYsbc5vm5RPoHqxFM+qMdo7BK19bxVbfmRQJgf8LkS X-Received: by 2002:a17:907:d50e:b0:81f:fc05:2ba0 with SMTP id wb14-20020a170907d50e00b0081ffc052ba0mr19135868ejc.2.1672144735771; Tue, 27 Dec 2022 04:38:55 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1672144735; cv=none; d=google.com; s=arc-20160816; b=GIkXRvy44x3R8+wBX7ndKDu64a5rmU8HfxCLzjeMmrL8rQRSBWRoscwxm9wjiaNr/D biX1CYMpyn+g1Iz4hsAn0qiQAjGPK4/y9g6nROJ9LoLnJStIbWzOnBa0kOFxXL6D/Or+ 3BDt+e+4/PsrnUuWXUzlhvFrx6rzHzdY0UU/9VViyJ0HLQo42IJ+wT2veDYmhzZJiS8W AWPMZvS2D6/B8W7mVqWOUxD/pxr5nryLz3q5721sEdFsE0ljH/oyZCO/wCriiUF1CX7d a7h2UReinZmAWGObUOURlGXO9pp3FnXp7J7dP2b7sYmM5pmXekcOOJwQqdzZ4NraVAS1 21/g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=RNn4u4ED5/lzklIwhbgMo2lHoXUsXIIm2iQmiqcRCVo=; b=ZG6DIDdhsW+G3usP9rbzBHmwGFCmAyiZ4BdWyJLKhKU2u1BbmN2+lZXSq8DC+w9QpJ BZiSIlrfYFo0sNwTcUzgxyGWlMrIAbTy5jvXrz3wULdk9pet4wzo3WN/fyDkNWyjBJ0T Ix85UnlO2bpyHEQCqJwfPvUVRoKhnvCoR66MJV4muqN9p8lex+Usny0c0QWfOzjWBvF6 WTAUm9K4xQOmAY28Yp099kpplOZJIT1FeNl+1R5wt8AFax3SXjZ+YVNT2EI2MSvdvck0 pBQpKU4GYVQ2o3WDOM7zE+CcU++pdeutp9bOlNMjdrhNFiQf9qgBG6OGOz7YA3wBYI+d +uKg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id xg11-20020a170907320b00b007c14e98e3a1si11088122ejb.752.2022.12.27.04.38.30; Tue, 27 Dec 2022 04:38:55 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230147AbiL0Mer (ORCPT <rfc822;eddaouddi.ayoub@gmail.com> + 99 others); Tue, 27 Dec 2022 07:34:47 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42026 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229635AbiL0Mek (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 27 Dec 2022 07:34:40 -0500 Received: from dggsgout12.his.huawei.com (unknown [45.249.212.56]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9F9132A3; Tue, 27 Dec 2022 04:34:39 -0800 (PST) Received: from mail02.huawei.com (unknown [172.30.67.153]) by dggsgout12.his.huawei.com (SkyGuard) with ESMTP id 4NhDcR4cLHz4f3jMg; Tue, 27 Dec 2022 20:34:31 +0800 (CST) Received: from huaweicloud.com (unknown [10.175.127.227]) by APP4 (Coremail) with SMTP id gCh0CgAHvbBZ5qpjAnByAg--.34791S5; Tue, 27 Dec 2022 20:34:34 +0800 (CST) From: Yu Kuai <yukuai1@huaweicloud.com> To: tj@kernel.org, hch@infradead.org, josef@toxicpanda.com, axboe@kernel.dk Cc: cgroups@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, yukuai3@huawei.com, yukuai1@huaweicloud.com, yi.zhang@huawei.com Subject: [PATCH v2 1/2] blk-iocost: add refcounting for iocg Date: Tue, 27 Dec 2022 20:55:01 +0800 Message-Id: <20221227125502.541931-2-yukuai1@huaweicloud.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20221227125502.541931-1-yukuai1@huaweicloud.com> References: <20221227125502.541931-1-yukuai1@huaweicloud.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-CM-TRANSID: gCh0CgAHvbBZ5qpjAnByAg--.34791S5 X-Coremail-Antispam: 1UD129KBjvJXoWxAw1kKFW8ZFW7CryxXr45GFg_yoW5Kr43pF 13W345Cay5JF4xuws8t3ZrXw1rAw4fWr4kKrWfWwnYyr17Ar10q3WkA348Ga4rCFZxZr43 XF18KFWUGr4jvF7anT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUUBE14x267AKxVW5JVWrJwAFc2x0x2IEx4CE42xK8VAvwI8IcIk0 rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2048vs2IY020E87I2jVAFwI0_Jr4l82xGYIkIc2 x26xkF7I0E14v26r1I6r4UM28lY4IEw2IIxxk0rwA2F7IY1VAKz4vEj48ve4kI8wA2z4x0 Y4vE2Ix0cI8IcVAFwI0_tr0E3s1l84ACjcxK6xIIjxv20xvEc7CjxVAFwI0_Gr1j6F4UJw A2z4x0Y4vEx4A2jsIE14v26rxl6s0DM28EF7xvwVC2z280aVCY1x0267AKxVW0oVCq3wAS 0I0E0xvYzxvE52x082IY62kv0487Mc02F40EFcxC0VAKzVAqx4xG6I80ewAv7VC0I7IYx2 IY67AKxVWUGVWUXwAv7VC2z280aVAFwI0_Jr0_Gr1lOx8S6xCaFVCjc4AY6r1j6r4UM4x0 Y48IcxkI7VAKI48JM4x0x7Aq67IIx4CEVc8vx2IErcIFxwACI402YVCY1x02628vn2kIc2 xKxwCF04k20xvY0x0EwIxGrwCFx2IqxVCFs4IE7xkEbVWUJVW8JwC20s026c02F40E14v2 6r1j6r18MI8I3I0E7480Y4vE14v26r106r1rMI8E67AF67kF1VAFwI0_Jw0_GFylIxkGc2 Ij64vIr41lIxAIcVC0I7IYx2IY67AKxVWUJVWUCwCI42IY6xIIjxv20xvEc7CjxVAFwI0_ Gr0_Cr1lIxAIcVCF04k26cxKx2IYs7xG6r1j6r1xMIIF0xvEx4A2jsIE14v26r1j6r4UMI IF0xvEx4A2jsIEc7CjxVAFwI0_Gr0_Gr1UYxBIdaVFxhVjvjDU0xZFpf9x0JUHyIUUUUUU = X-CM-SenderInfo: 51xn3trlr6x35dzhxuhorxvhhfrp/ X-CFilter-Loop: Reflected X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1753370838813828650?= X-GMAIL-MSGID: =?utf-8?q?1753370838813828650?= |
Series |
blk-iocost: add refcounting for iocg and ioc
|
|
Commit Message
Yu Kuai
Dec. 27, 2022, 12:55 p.m. UTC
From: Yu Kuai <yukuai3@huawei.com> iocost requires that child iocg must exit before parent iocg, otherwise kernel might crash in ioc_timer_fn(). However, currently iocg is exited in pd_free_fn(), which can't guarantee such order: 1) remove cgroup can concurrent with deactivate policy; 2) blkg_free() triggered by remove cgroup is asynchronously, remove child cgroup can concurrent with remove parent cgroup; Fix the problem by add refcounting for iocg, and child iocg will grab reference of parent iocg, so that parent iocg will wait for all child iocg to be exited. Signed-off-by: Yu Kuai <yukuai3@huawei.com> --- block/blk-iocost.c | 74 +++++++++++++++++++++++++++++++--------------- 1 file changed, 50 insertions(+), 24 deletions(-)
Comments
On Tue, Dec 27, 2022 at 08:55:01PM +0800, Yu Kuai wrote: > From: Yu Kuai <yukuai3@huawei.com> > > iocost requires that child iocg must exit before parent iocg, otherwise > kernel might crash in ioc_timer_fn(). However, currently iocg is exited > in pd_free_fn(), which can't guarantee such order: > > 1) remove cgroup can concurrent with deactivate policy; > 2) blkg_free() triggered by remove cgroup is asynchronously, remove > child cgroup can concurrent with remove parent cgroup; > > Fix the problem by add refcounting for iocg, and child iocg will grab > reference of parent iocg, so that parent iocg will wait for all child > iocg to be exited. Wouldn't it be better to do this refcnting in the blk-cgroup core code rather than in blk-iocost? Thanks.
Hi, 在 2023/01/05 5:44, Tejun Heo 写道: > On Tue, Dec 27, 2022 at 08:55:01PM +0800, Yu Kuai wrote: >> From: Yu Kuai <yukuai3@huawei.com> >> >> iocost requires that child iocg must exit before parent iocg, otherwise >> kernel might crash in ioc_timer_fn(). However, currently iocg is exited >> in pd_free_fn(), which can't guarantee such order: >> >> 1) remove cgroup can concurrent with deactivate policy; >> 2) blkg_free() triggered by remove cgroup is asynchronously, remove >> child cgroup can concurrent with remove parent cgroup; >> >> Fix the problem by add refcounting for iocg, and child iocg will grab >> reference of parent iocg, so that parent iocg will wait for all child >> iocg to be exited. > > Wouldn't it be better to do this refcnting in the blk-cgroup core code > rather than in blk-iocost? > The problem is that I can't find a proper way to fix the competition that pd_free_fn() can be called from different context: 1) from blkg_free() that is called asynchronously from removing cgroup; 2) from blkcg_deactivate_policy() that is called from removing device; 1) is related to blkg, while 2) is not, hence refcnting from blkg can't fix the problem. refcnting from blkcg_policy_data should be ok, but I see that bfq already has the similar refcnting, while other policy doesn't require such refcnting. Any suggestions? Thanks, Kuai > Thanks. >
On Thu, Jan 05, 2023 at 09:14:07AM +0800, Yu Kuai wrote: > 1) is related to blkg, while 2) is not, hence refcnting from blkg can't > fix the problem. refcnting from blkcg_policy_data should be ok, but I > see that bfq already has the similar refcnting, while other policy > doesn't require such refcnting. Hmm... taking a step back, wouldn't this be solved by moving the first part of ioc_pd_free() to pd_offline_fn()? The ordering is strictly defined there, right? Thanks. -- tejun
Hi, 在 2023/01/06 2:32, Tejun Heo 写道: > On Thu, Jan 05, 2023 at 09:14:07AM +0800, Yu Kuai wrote: >> 1) is related to blkg, while 2) is not, hence refcnting from blkg can't >> fix the problem. refcnting from blkcg_policy_data should be ok, but I >> see that bfq already has the similar refcnting, while other policy >> doesn't require such refcnting. > > Hmm... taking a step back, wouldn't this be solved by moving the first part > of ioc_pd_free() to pd_offline_fn()? The ordering is strictly defined there, > right? > Moving first part to pd_offline_fn() has some requirements, like what I did in the other thread: iocg can be activated again after pd_offline_fn(), which is possible because bio can be dispatched when cgroup is removed. I tried to avoid that by: 1) dispatch all throttled bio io ioc_pd_offline() 2) don't throttle bio after ioc_pd_offline() However, you already disagreed with that. 😔 Thanks, Kuai > Thanks. > > -- > tejun > . >
On Fri, Jan 06, 2023 at 09:08:45AM +0800, Yu Kuai wrote: > Hi, > > 在 2023/01/06 2:32, Tejun Heo 写道: > > On Thu, Jan 05, 2023 at 09:14:07AM +0800, Yu Kuai wrote: > > > 1) is related to blkg, while 2) is not, hence refcnting from blkg can't > > > fix the problem. refcnting from blkcg_policy_data should be ok, but I > > > see that bfq already has the similar refcnting, while other policy > > > doesn't require such refcnting. > > > > Hmm... taking a step back, wouldn't this be solved by moving the first part > > of ioc_pd_free() to pd_offline_fn()? The ordering is strictly defined there, > > right? > > > > Moving first part to pd_offline_fn() has some requirements, like what I > did in the other thread: > > iocg can be activated again after pd_offline_fn(), which is possible > because bio can be dispatched when cgroup is removed. I tried to avoid > that by: > > 1) dispatch all throttled bio io ioc_pd_offline() > 2) don't throttle bio after ioc_pd_offline() > > However, you already disagreed with that. 😔 Okay, I was completely wrong while I was replying to your original patch. Should have looked at the code closer, my apologies. What I missed is that pd_offline doesn't happen when the cgroup goes offline. Please take a look at the following two commits: 59b57717fff8 ("blkcg: delay blkg destruction until after writeback has finished") d866dbf61787 ("blkcg: rename blkcg->cgwb_refcnt to ->online_pin and always use it") After the above two commits, ->pd_offline_fn() is called only after all possible writebacks are complete, so it shouldn't allow mass escapes to root. With writebacks out of the picture, it might be that there can be no further IOs once ->pd_offline_fn() is called too as there can be no tasks left in it and no dirty pages, but best to confirm that. So, yeah, the original approach you took should work although I'm not sure the patches that you added to make offline blkg to bypass are necessary (that also contributed to my assumption that there will be more IOs on those blkg's). Have you seen more IOs coming down the pipeline after offline? If so, can you dump some backtraces and see where they're coming from? Thanks.
Hi, 在 2023/01/07 4:18, Tejun Heo 写道: > On Fri, Jan 06, 2023 at 09:08:45AM +0800, Yu Kuai wrote: >> Hi, >> >> 在 2023/01/06 2:32, Tejun Heo 写道: >>> On Thu, Jan 05, 2023 at 09:14:07AM +0800, Yu Kuai wrote: >>>> 1) is related to blkg, while 2) is not, hence refcnting from blkg can't >>>> fix the problem. refcnting from blkcg_policy_data should be ok, but I >>>> see that bfq already has the similar refcnting, while other policy >>>> doesn't require such refcnting. >>> >>> Hmm... taking a step back, wouldn't this be solved by moving the first part >>> of ioc_pd_free() to pd_offline_fn()? The ordering is strictly defined there, >>> right? >>> >> >> Moving first part to pd_offline_fn() has some requirements, like what I >> did in the other thread: >> >> iocg can be activated again after pd_offline_fn(), which is possible >> because bio can be dispatched when cgroup is removed. I tried to avoid >> that by: >> >> 1) dispatch all throttled bio io ioc_pd_offline() >> 2) don't throttle bio after ioc_pd_offline() >> >> However, you already disagreed with that. 😔 > > Okay, I was completely wrong while I was replying to your original patch. > Should have looked at the code closer, my apologies. > > What I missed is that pd_offline doesn't happen when the cgroup goes > offline. Please take a look at the following two commits: > > 59b57717fff8 ("blkcg: delay blkg destruction until after writeback has finished") > d866dbf61787 ("blkcg: rename blkcg->cgwb_refcnt to ->online_pin and always use it") > These two commits are applied for three years, I don't check the details yet but they seem can't guarantee that no io will be handled by rq_qos_throttle() after pd_offline_fn(), because I just reproduced this in another problem: f02be9002c48 ("block, bfq: fix null pointer dereference in bfq_bio_bfqg()") User thread can issue async io, and io can be throttled by blk-throttle(not writeback), then user thread can exit and cgroup can be removed before such io is dispatched to rq_qos_throttle. > After the above two commits, ->pd_offline_fn() is called only after all > possible writebacks are complete, so it shouldn't allow mass escapes to > root. With writebacks out of the picture, it might be that there can be no > further IOs once ->pd_offline_fn() is called too as there can be no tasks > left in it and no dirty pages, but best to confirm that. > > So, yeah, the original approach you took should work although I'm not sure > the patches that you added to make offline blkg to bypass are necessary > (that also contributed to my assumption that there will be more IOs on those > blkg's). Have you seen more IOs coming down the pipeline after offline? If > so, can you dump some backtraces and see where they're coming from? Currently I'm sure such IOs can come from blk-throttle, and I'm not sure yet but I also suspect io_uring can do this. Thanks, Kuai
Hello, On Mon, Jan 09, 2023 at 09:32:46AM +0800, Yu Kuai wrote: > > 59b57717fff8 ("blkcg: delay blkg destruction until after writeback has finished") > > d866dbf61787 ("blkcg: rename blkcg->cgwb_refcnt to ->online_pin and always use it") > > These two commits are applied for three years, I don't check the details > yet but they seem can't guarantee that no io will be handled by > rq_qos_throttle() after pd_offline_fn(), because I just reproduced this > in another problem: > > f02be9002c48 ("block, bfq: fix null pointer dereference in bfq_bio_bfqg()") > > User thread can issue async io, and io can be throttled by > blk-throttle(not writeback), then user thread can exit and cgroup can be > removed before such io is dispatched to rq_qos_throttle. I see. > > After the above two commits, ->pd_offline_fn() is called only after all > > possible writebacks are complete, so it shouldn't allow mass escapes to > > root. With writebacks out of the picture, it might be that there can be no > > further IOs once ->pd_offline_fn() is called too as there can be no tasks > > left in it and no dirty pages, but best to confirm that. > > > > So, yeah, the original approach you took should work although I'm not sure > > the patches that you added to make offline blkg to bypass are necessary > > (that also contributed to my assumption that there will be more IOs on those > > blkg's). Have you seen more IOs coming down the pipeline after offline? If > > so, can you dump some backtraces and see where they're coming from? > > Currently I'm sure such IOs can come from blk-throttle, and I'm not sure > yet but I also suspect io_uring can do this. Yeah, that's unfortunate. There are several options here: 1. Do what you originally suggested - bypass to root after offline. I feel uneasy about this. Both iolatency and throtl clear their configs on offline but that's punting to the parent. For iocost it'd be bypassing all controls, which can actually be exploited. 2. Make all possible IO issuers use blkcg_[un]pin_online() and shift the iocost shutdown to pd_offline_fn(). This likely is the most canonical solution given the current situation but it's kinda nasty to add another layer of refcnting all over the place. 3. Order blkg free so that parents are never freed before children. You did this by adding refcnts in iocost but shouldn't it be possible to simply shift blkg_put(blkg->parent) in __blkg_release() to blkg_free_workfn()? #3 seems the most logical to me. What do you thinK? Thanks.
Hi, 在 2023/01/10 2:23, Tejun Heo 写道: > Yeah, that's unfortunate. There are several options here: > > 1. Do what you originally suggested - bypass to root after offline. I feel > uneasy about this. Both iolatency and throtl clear their configs on > offline but that's punting to the parent. For iocost it'd be bypassing > all controls, which can actually be exploited. > > 2. Make all possible IO issuers use blkcg_[un]pin_online() and shift the > iocost shutdown to pd_offline_fn(). This likely is the most canonical > solution given the current situation but it's kinda nasty to add another > layer of refcnting all over the place. > > 3. Order blkg free so that parents are never freed before children. You did > this by adding refcnts in iocost but shouldn't it be possible to simply > shift blkg_put(blkg->parent) in __blkg_release() to blkg_free_workfn()? As I tried to explain before, we can make sure blkg_free() is called in order, but blkg_free() from remove cgroup can concurrent with deactivate policy, and we can't guarantee the order of ioc_pd_free() that is called both from blkg_free() and blkcg_deactivate_policy(). Hence I don't think #3 is possible. I personaly prefer #1, I don't see any real use case about the defect that you described, and actually in cgroup v1 blk-throtl is bypassed to no limit as well. I'm not sure about #2, that sounds a possible solution but I'm not quite familiar with the implementations here. Consider that bfq already has such refcounting for bfqg, perhaps similiar refcounting is acceptable? Thanks, Kuai
Hello, On Tue, Jan 10, 2023 at 09:39:44AM +0800, Yu Kuai wrote: > As I tried to explain before, we can make sure blkg_free() is called > in order, but blkg_free() from remove cgroup can concurrent with > deactivate policy, and we can't guarantee the order of ioc_pd_free() > that is called both from blkg_free() and blkcg_deactivate_policy(). > Hence I don't think #3 is possible. Hahaha, sorry that I keep forgetting that. This doesn't really feel like that important or difficult part of the problem tho. Can't it be solved by synchronizing blkg free work item against the deactivate path with a mutex? Thanks.
Hi, 在 2023/01/11 2:36, Tejun Heo 写道: > Hello, > > On Tue, Jan 10, 2023 at 09:39:44AM +0800, Yu Kuai wrote: >> As I tried to explain before, we can make sure blkg_free() is called >> in order, but blkg_free() from remove cgroup can concurrent with >> deactivate policy, and we can't guarantee the order of ioc_pd_free() >> that is called both from blkg_free() and blkcg_deactivate_policy(). >> Hence I don't think #3 is possible. > > Hahaha, sorry that I keep forgetting that. This doesn't really feel like > that important or difficult part of the problem tho. Can't it be solved by > synchronizing blkg free work item against the deactivate path with a mutex? > I'm not sure, of course this can fix the problem, but two spinlock 'blkcg->lock' and 'q->queue_lock' are used to protect blkg_destroy() currently, add a mutex(disk level?) requires a refactor, which seems complex to me. Thanks, Kuai
Hello, On Wed, Jan 11, 2023 at 09:36:25AM +0800, Yu Kuai wrote: > I'm not sure, of course this can fix the problem, but two spinlock > 'blkcg->lock' and 'q->queue_lock' are used to protect blkg_destroy() > currently, add a mutex(disk level?) requires a refactor, which seems > complex to me. The fact that the two paths can race each other already seems buggy. e.g. What prevents them from running pd_free on the same pd twice? So, it needs to be fixed anyway and the intention always has been that these callbacks are called in the correct traversal order. Thanks.
Hi, 在 2023/01/12 1:07, Tejun Heo 写道: > Hello, > > On Wed, Jan 11, 2023 at 09:36:25AM +0800, Yu Kuai wrote: >> I'm not sure, of course this can fix the problem, but two spinlock >> 'blkcg->lock' and 'q->queue_lock' are used to protect blkg_destroy() >> currently, add a mutex(disk level?) requires a refactor, which seems >> complex to me. > > The fact that the two paths can race each other already seems buggy. e.g. > What prevents them from running pd_free on the same pd twice? So, it needs I think the root cause is that blkg is tracked from two different list, blkcg->blkg_list from cgroup level and q->blkg_list from disk level. And pd_free_fn is also called from both blkg_destroy() and deactivate policy for a disk. I just thought about another solution: remove the blkcg_deactivate_policy() from rq_qos_exit() from deleting the device, and delay the policy cleanup and free to blkg_destroy_all(). Then the policies(other than bfq) can only call pd_free_fn() from blkg_destroy(), and it's easy to guarantee the order. For bfq, it can stay the same since bfq has refcounting itself. Then for the problem that ioc can be freed in pd_free_fn(), we can fix it by freeing ioc in ioc_pd_free() for root blkg instead of rq_qos_exit(). What do you think? Thanks, Kuai > to be fixed anyway and the intention always has been that these callbacks > are called in the correct traversal order. > > Thanks. >
Hello, On Thu, Jan 12, 2023 at 02:18:15PM +0800, Yu Kuai wrote: > remove the blkcg_deactivate_policy() from rq_qos_exit() from deleting > the device, and delay the policy cleanup and free to blkg_destroy_all(). > Then the policies(other than bfq) can only call pd_free_fn() from > blkg_destroy(), and it's easy to guarantee the order. For bfq, it can > stay the same since bfq has refcounting itself. > > Then for the problem that ioc can be freed in pd_free_fn(), we can fix > it by freeing ioc in ioc_pd_free() for root blkg instead of > rq_qos_exit(). > > What do you think? That would remove the ability to dynamically remove an rq_qos policy, right? We don't currently do it but given that having an rq_qos registered comes with perf overhead, it's something we might want to do in the future - e.g. only activate the policy when the controller is actually enabled. So, idk. What's wrong with synchronizing the two removal paths? blkcg policies are combinations of cgroups and block device configurations, so having exit paths from both sides is kinda natural. Thanks.
Hi, 在 2023/01/13 8:53, Tejun Heo 写道: > Hello, > > On Thu, Jan 12, 2023 at 02:18:15PM +0800, Yu Kuai wrote: >> remove the blkcg_deactivate_policy() from rq_qos_exit() from deleting >> the device, and delay the policy cleanup and free to blkg_destroy_all(). >> Then the policies(other than bfq) can only call pd_free_fn() from >> blkg_destroy(), and it's easy to guarantee the order. For bfq, it can >> stay the same since bfq has refcounting itself. >> >> Then for the problem that ioc can be freed in pd_free_fn(), we can fix >> it by freeing ioc in ioc_pd_free() for root blkg instead of >> rq_qos_exit(). >> >> What do you think? > > That would remove the ability to dynamically remove an rq_qos policy, right? > We don't currently do it but given that having an rq_qos registered comes > with perf overhead, it's something we might want to do in the future - e.g. Yes, that make sense, remove ioc and other policies dynamically. > only activate the policy when the controller is actually enabled. So, idk. > What's wrong with synchronizing the two removal paths? blkcg policies are > combinations of cgroups and block device configurations, so having exit > paths from both sides is kinda natural. I still can't figure out how to synchronizing them will a mutex. Maybe I'm being foolish... Thanks, Kuai
Hello, On Fri, Jan 13, 2023 at 09:10:25AM +0800, Yu Kuai wrote: > > only activate the policy when the controller is actually enabled. So, idk. > > What's wrong with synchronizing the two removal paths? blkcg policies are > > combinations of cgroups and block device configurations, so having exit > > paths from both sides is kinda natural. > > I still can't figure out how to synchronizing them will a mutex. Maybe > I'm being foolish... Hmm... can't you just use e.g. per-bdev mutex which is grabbed by both blkg_free_workfn() and blkcg_deactivate_policy()? Thanks.
Hi, 在 2023/01/13 9:15, Tejun Heo 写道: > Hello, > > On Fri, Jan 13, 2023 at 09:10:25AM +0800, Yu Kuai wrote: >>> only activate the policy when the controller is actually enabled. So, idk. >>> What's wrong with synchronizing the two removal paths? blkcg policies are >>> combinations of cgroups and block device configurations, so having exit >>> paths from both sides is kinda natural. >> >> I still can't figure out how to synchronizing them will a mutex. Maybe >> I'm being foolish... > > Hmm... can't you just use e.g. per-bdev mutex which is grabbed by both > blkg_free_workfn() and blkcg_deactivate_policy()? > I think hold the lock in blkg_free_workfn() is too late, pd_free_fn() for parent from blkcg_deactivate_policy() can be called first. t1: remove cgroup t1/t2 blkcg_destroy_blkgs blkg_destroy percpu_ref_kill(&blkg->refcnt) blkg_release blkg_free schedule_work(&blkg->free_work) // t1 is done t2: handle t1 from removing device blkcg_deactivate_policy pd_free_fn // free parent t3: from t1 blkg_free_workfn pd_free_fn // free child Thanks, Kuai
Hello, On Fri, Jan 13, 2023 at 09:25:11AM +0800, Yu Kuai wrote: > I think hold the lock in blkg_free_workfn() is too late, pd_free_fn() > for parent from blkcg_deactivate_policy() can be called first. > > t1: remove cgroup t1/t2 > blkcg_destroy_blkgs > blkg_destroy > percpu_ref_kill(&blkg->refcnt) > blkg_release > blkg_free > schedule_work(&blkg->free_work) > // t1 is done > > t2: handle t1 from removing device > blkcg_deactivate_policy > pd_free_fn > // free parent > t3: from t1 > blkg_free_workfn > pd_free_fn > // free child As we discussed before, you'd have to order the actual freeing by shifting the ref puts into the free_work. If you move `blkg_put(blkg->parent)` and `list_del_init(&blkg->q_node)` to blkg_free_workfn() (this will require adjustments as these things are used from other places too), the free work items will be ordered and the blkg would remain iterable - IOW, deactivate_policy would be able to see it allowing the two paths to synchronize, right? Thanks.
Hi, 在 2023/01/14 1:16, Tejun Heo 写道: > As we discussed before, you'd have to order the actual freeing by shifting > the ref puts into the free_work. If you move `blkg_put(blkg->parent)` and > `list_del_init(&blkg->q_node)` to blkg_free_workfn() (this will require > adjustments as these things are used from other places too), the free work > items will be ordered and the blkg would remain iterable - IOW, > deactivate_policy would be able to see it allowing the two paths to > synchronize, right? That sounds reasonable to only remove blkg from queue list if pd_free_fn() is done. It's right this way deactivate_policy will be able to see it, and if deactivate_policy is called first, pd_free_fn() can be called here, and later blkg_free_workfn() should skip pd_free_fn(). It's glad that we come up with suitable solution finially. 😃 BTW, it might take some time before I send a new patchset cause Spring Festival is coming. Thanks, Kuai > > Thanks. >
diff --git a/block/blk-iocost.c b/block/blk-iocost.c index 7a0d754b9eb2..525e93e1175a 100644 --- a/block/blk-iocost.c +++ b/block/blk-iocost.c @@ -461,6 +461,8 @@ struct ioc_gq { struct blkg_policy_data pd; struct ioc *ioc; + refcount_t ref; + /* * A iocg can get its weight from two sources - an explicit * per-device-cgroup configuration or the default weight of the @@ -2943,9 +2945,53 @@ static struct blkg_policy_data *ioc_pd_alloc(gfp_t gfp, struct request_queue *q, return NULL; } + refcount_set(&iocg->ref, 1); return &iocg->pd; } +static void iocg_get(struct ioc_gq *iocg) +{ + refcount_inc(&iocg->ref); +} + +static void iocg_put(struct ioc_gq *iocg) +{ + struct ioc *ioc = iocg->ioc; + unsigned long flags; + struct ioc_gq *parent = NULL; + + if (!refcount_dec_and_test(&iocg->ref)) + return; + + if (iocg->level > 0) + parent = iocg->ancestors[iocg->level - 1]; + + if (ioc) { + spin_lock_irqsave(&ioc->lock, flags); + + if (!list_empty(&iocg->active_list)) { + struct ioc_now now; + + ioc_now(ioc, &now); + propagate_weights(iocg, 0, 0, false, &now); + list_del_init(&iocg->active_list); + } + + WARN_ON_ONCE(!list_empty(&iocg->walk_list)); + WARN_ON_ONCE(!list_empty(&iocg->surplus_list)); + + spin_unlock_irqrestore(&ioc->lock, flags); + + hrtimer_cancel(&iocg->waitq_timer); + } + + free_percpu(iocg->pcpu_stat); + kfree(iocg); + + if (parent) + iocg_put(parent); +} + static void ioc_pd_init(struct blkg_policy_data *pd) { struct ioc_gq *iocg = pd_to_iocg(pd); @@ -2973,6 +3019,9 @@ static void ioc_pd_init(struct blkg_policy_data *pd) iocg->level = blkg->blkcg->css.cgroup->level; + if (blkg->parent) + iocg_get(blkg_to_iocg(blkg->parent)); + for (tblkg = blkg; tblkg; tblkg = tblkg->parent) { struct ioc_gq *tiocg = blkg_to_iocg(tblkg); iocg->ancestors[tiocg->level] = tiocg; @@ -2985,30 +3034,7 @@ static void ioc_pd_init(struct blkg_policy_data *pd) static void ioc_pd_free(struct blkg_policy_data *pd) { - struct ioc_gq *iocg = pd_to_iocg(pd); - struct ioc *ioc = iocg->ioc; - unsigned long flags; - - if (ioc) { - spin_lock_irqsave(&ioc->lock, flags); - - if (!list_empty(&iocg->active_list)) { - struct ioc_now now; - - ioc_now(ioc, &now); - propagate_weights(iocg, 0, 0, false, &now); - list_del_init(&iocg->active_list); - } - - WARN_ON_ONCE(!list_empty(&iocg->walk_list)); - WARN_ON_ONCE(!list_empty(&iocg->surplus_list)); - - spin_unlock_irqrestore(&ioc->lock, flags); - - hrtimer_cancel(&iocg->waitq_timer); - } - free_percpu(iocg->pcpu_stat); - kfree(iocg); + iocg_put(pd_to_iocg(pd)); } static void ioc_pd_stat(struct blkg_policy_data *pd, struct seq_file *s)