Message ID | 20230104085354.2343590-4-yukuai1@huaweicloud.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:4e01:0:0:0:0:0 with SMTP id p1csp5031978wrt; Wed, 4 Jan 2023 00:32:29 -0800 (PST) X-Google-Smtp-Source: AMrXdXst6/oWk5q/FIj7SNzG5ZCO5JNrFDVmWJoJncTw84J8A8h9ca5Tgq3zQfpkgYU8vIQIpHMl X-Received: by 2002:a17:90a:6aca:b0:226:646e:597 with SMTP id b10-20020a17090a6aca00b00226646e0597mr11464068pjm.2.1672821148859; Wed, 04 Jan 2023 00:32:28 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1672821148; cv=none; d=google.com; s=arc-20160816; b=UtKP/dy0i8YPLvgklWclxRuSnNpyy7AIhoqOVX9arktt2puNqdesJ2ObE9U8LU16a9 GyTMQYIQhPQ3IsiUAzw9bRPj1e/y0IHV6GjodTZ7nmOhYgV3YypU1+89RNuGqJwrV/Ye M5GIttPYHmNrZTx23zio7jX4u1q+jYKX+B/KQR9gkKXNvM6Y6x8RpniSh0NSfEyx88T+ +WKFdY4Y179vYLZeFTH8Vm1M2sYiuCMDxeolvFfomxXSXJjUguuJ7Ba6qEAk8XPLrQbU miWyGJeYn2QkMLv2IL1MvMK7aIURf9J5eqXWQV9lllO5lvxIF/XWaj+BLSsJOihg82AD HDRA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=BwEcAQpHvxc5yt6vAsTK74I9anTVeJoiibF7einDG8Y=; b=z9twcnjvOWR8VAUEsjuq3EL2kPdYk8Xn+C2QHXbbCBz0EfV+rAHUJmJKHI6a6pDKEo eHlByy5JXVE2scKSCPc3+kBZkYMuwkq4D+OFauEd1O0u5/HObMrto2PB28C/jXp4T7Xh PTf9gLwe2v1FGaCadr8lWAmQzNr32AsaRgYYKBhMGI8hDYTC+v9AwjZWh7L+0dx1jk3o O+Cd9mp/YdiV94zaeq3pMe1llkEYus6DafXOBAFv9dy+VCRMIC4EezxrWcUqG3axoNnY /bn0WnA2EbyZkEbJ8QA5jcYsSoCaQM8mYugUY+tNEaxOJq/69V0wkFbHtVrtvF9xesaa HwHw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id lk6-20020a17090b33c600b00212def0cbedsi13830230pjb.8.2023.01.04.00.32.17; Wed, 04 Jan 2023 00:32:28 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234103AbjADI3u (ORCPT <rfc822;tmhikaru@gmail.com> + 99 others); Wed, 4 Jan 2023 03:29:50 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59508 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234041AbjADI3Z (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 4 Jan 2023 03:29:25 -0500 Received: from dggsgout11.his.huawei.com (dggsgout11.his.huawei.com [45.249.212.51]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CE96E192BA; Wed, 4 Jan 2023 00:29:24 -0800 (PST) Received: from mail02.huawei.com (unknown [172.30.67.169]) by dggsgout11.his.huawei.com (SkyGuard) with ESMTP id 4Nn2nq0Qy6z4f42p3; Wed, 4 Jan 2023 16:29:19 +0800 (CST) Received: from huaweicloud.com (unknown [10.175.127.227]) by APP2 (Coremail) with SMTP id Syh0CgBnW+ndOLVjwys+BA--.58806S7; Wed, 04 Jan 2023 16:29:21 +0800 (CST) From: Yu Kuai <yukuai1@huaweicloud.com> To: tj@kernel.org, hch@infradead.org, josef@toxicpanda.com, axboe@kernel.dk Cc: cgroups@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, yukuai3@huawei.com, yukuai1@huaweicloud.com, yi.zhang@huawei.com, yangerkun@huawei.com Subject: [PATCH -next 3/4] block/rq_qos: use a global mutex to protect rq_qos apis Date: Wed, 4 Jan 2023 16:53:53 +0800 Message-Id: <20230104085354.2343590-4-yukuai1@huaweicloud.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20230104085354.2343590-1-yukuai1@huaweicloud.com> References: <20230104085354.2343590-1-yukuai1@huaweicloud.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-CM-TRANSID: Syh0CgBnW+ndOLVjwys+BA--.58806S7 X-Coremail-Antispam: 1UD129KBjvJXoWxAryftw1DGrW7ZFy8Cw1Utrb_yoWrGr4Dpa 1fKrWay3yqgrs7Was8Gw48ZasxKw1rKryUJF4xJrWfArWvvF1jvF10yFyxWFWrZF9rJrs3 Ary5trs5Ar1UXwUanT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUUBE14x267AKxVWrJVCq3wAFc2x0x2IEx4CE42xK8VAvwI8IcIk0 rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2048vs2IY020E87I2jVAFwI0_JrWl82xGYIkIc2 x26xkF7I0E14v26ryj6s0DM28lY4IEw2IIxxk0rwA2F7IY1VAKz4vEj48ve4kI8wA2z4x0 Y4vE2Ix0cI8IcVAFwI0_tr0E3s1l84ACjcxK6xIIjxv20xvEc7CjxVAFwI0_Gr1j6F4UJw A2z4x0Y4vEx4A2jsIE14v26rxl6s0DM28EF7xvwVC2z280aVCY1x0267AKxVW0oVCq3wAS 0I0E0xvYzxvE52x082IY62kv0487Mc02F40EFcxC0VAKzVAqx4xG6I80ewAv7VC0I7IYx2 IY67AKxVWUJVWUGwAv7VC2z280aVAFwI0_Jr0_Gr1lOx8S6xCaFVCjc4AY6r1j6r4UM4x0 Y48IcxkI7VAKI48JM4x0x7Aq67IIx4CEVc8vx2IErcIFxwACI402YVCY1x02628vn2kIc2 xKxwCF04k20xvY0x0EwIxGrwCFx2IqxVCFs4IE7xkEbVWUJVW8JwC20s026c02F40E14v2 6r1j6r18MI8I3I0E7480Y4vE14v26r106r1rMI8E67AF67kF1VAFwI0_Jw0_GFylIxkGc2 Ij64vIr41lIxAIcVC0I7IYx2IY67AKxVWUJVWUCwCI42IY6xIIjxv20xvEc7CjxVAFwI0_ Gr0_Cr1lIxAIcVCF04k26cxKx2IYs7xG6r1j6r1xMIIF0xvEx4A2jsIE14v26r1j6r4UMI IF0xvEx4A2jsIEc7CjxVAFwI0_Gr0_Gr1UYxBIdaVFxhVjvjDU0xZFpf9x0JUd8n5UUUUU = X-CM-SenderInfo: 51xn3trlr6x35dzhxuhorxvhhfrp/ X-CFilter-Loop: Reflected X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1754080109267219897?= X-GMAIL-MSGID: =?utf-8?q?1754080109267219897?= |
Series |
block/rq_qos: protect rq_qos apis with global mutex
|
|
Commit Message
Yu Kuai
Jan. 4, 2023, 8:53 a.m. UTC
From: Yu Kuai <yukuai3@huawei.com> This patch fix following problems: 1) rq_qos_add() and rq_qos_del() is protected, while rq_qos_exit() is not. 2) rq_qos_add() and blkcg_activate_policy() is not atomic, if rq_qos_exit() is done before blkcg_activate_policy(), null-ptr-deference can be triggered. rq_qos_add(), rq_qos_del() and rq_qos_exit() are super cold path, hence fix the problems by using a global mutex to protect them. Signed-off-by: Yu Kuai <yukuai3@huawei.com> --- block/blk-rq-qos.c | 50 ++++++++++++++++++++++++++++++---------------- 1 file changed, 33 insertions(+), 17 deletions(-)
Comments
On Wed, Jan 04, 2023 at 04:53:53PM +0800, Yu Kuai wrote: > From: Yu Kuai <yukuai3@huawei.com> > > This patch fix following problems: > > 1) rq_qos_add() and rq_qos_del() is protected, while rq_qos_exit() is > not. This part makes sense. > 2) rq_qos_add() and blkcg_activate_policy() is not atomic, if > rq_qos_exit() is done before blkcg_activate_policy(), > null-ptr-deference can be triggered. I'm not sure this part does. I think it'd be better to guarantee that device destruction is blocked while these configuration operations are in progress which can be built into blkg_conf helpers. Thanks.
Hello, again. On Wed, Jan 04, 2023 at 11:39:47AM -1000, Tejun Heo wrote: > > 2) rq_qos_add() and blkcg_activate_policy() is not atomic, if > > rq_qos_exit() is done before blkcg_activate_policy(), > > null-ptr-deference can be triggered. > > I'm not sure this part does. I think it'd be better to guarantee that device > destruction is blocked while these configuration operations are in progress > which can be built into blkg_conf helpers. A bit more explanation: Usually, this would be handled in the core - when a device goes away, its sysfs files get shut down before stuff gets freed and the sysfs file removal waits for in-flight operations to finish and prevents new ones from starting, so we don't have to worry about in-flight config file operations racing against device removal. Here, the problem isn't solved by that because the config files live on cgroupfs and their lifetimes are not coupled with the block devices'. So, we need to synchronize manually. And, given that, the right place to do is the blkg config helpers cuz they're the ones which establish the connection between cgroup and block layer. Can you please take a look at the following patchset I just posted: https://lkml.kernel.org/r/20230105002007.157497-1-tj@kernel.org After that, all these configuration operations are wrapped between blkg_conf_init() and blkg_conf_exit() which probably are the right place to implement the synchronization. Thanks.
Hi, 在 2023/01/05 8:28, Tejun Heo 写道: > Hello, again. > > On Wed, Jan 04, 2023 at 11:39:47AM -1000, Tejun Heo wrote: >>> 2) rq_qos_add() and blkcg_activate_policy() is not atomic, if >>> rq_qos_exit() is done before blkcg_activate_policy(), >>> null-ptr-deference can be triggered. >> >> I'm not sure this part does. I think it'd be better to guarantee that device >> destruction is blocked while these configuration operations are in progress >> which can be built into blkg_conf helpers. > > A bit more explanation: > > Usually, this would be handled in the core - when a device goes away, its > sysfs files get shut down before stuff gets freed and the sysfs file removal > waits for in-flight operations to finish and prevents new ones from > starting, so we don't have to worry about in-flight config file operations > racing against device removal. > > Here, the problem isn't solved by that because the config files live on > cgroupfs and their lifetimes are not coupled with the block devices'. So, we > need to synchronize manually. And, given that, the right place to do is the > blkg config helpers cuz they're the ones which establish the connection > between cgroup and block layer. Thanks for the explanation, I agree with that. > > Can you please take a look at the following patchset I just posted: > > https://lkml.kernel.org/r/20230105002007.157497-1-tj@kernel.org > > After that, all these configuration operations are wrapped between > blkg_conf_init() and blkg_conf_exit() which probably are the right place to > implement the synchronization. I see that, blkg_conf_init() and blkg_conf_exit() is good, however there are some details I want to confirm: 1) rq_qos_add() can be called from iocost/iolatency, where blkg_conf_init() will be called first, while rq_qos_add() can also be called from wbt, where there is no blkg_conf_init(). Hence it seems to me we need two locks here, one to protect rq_qos apis; one to synchronize policy configuration and device removal. 2) If you agree with 1), it seems better to use the other lock in device level, consider that there is no need to synchronize confituration for different devices. Thanks, Kuai > > Thanks. >
Hello, On Thu, Jan 05, 2023 at 09:35:21AM +0800, Yu Kuai wrote: > > Can you please take a look at the following patchset I just posted: > > > > https://lkml.kernel.org/r/20230105002007.157497-1-tj@kernel.org > > > > After that, all these configuration operations are wrapped between > > blkg_conf_init() and blkg_conf_exit() which probably are the right place to > > implement the synchronization. > > I see that, blkg_conf_init() and blkg_conf_exit() is good, however there > are some details I want to confirm: > > 1) rq_qos_add() can be called from iocost/iolatency, where > blkg_conf_init() will be called first, while rq_qos_add() can also be > called from wbt, where there is no blkg_conf_init(). Hence it seems to > me we need two locks here, one to protect rq_qos apis; one to > synchronize policy configuration and device removal. wbt's lazy init is tied to one of the block device sysfs files, right? So, it *should* already be protected against device removal. Thanks.
Hi, 在 2023/01/06 2:34, Tejun Heo 写道: > Hello, > > On Thu, Jan 05, 2023 at 09:35:21AM +0800, Yu Kuai wrote: >>> Can you please take a look at the following patchset I just posted: >>> >>> https://lkml.kernel.org/r/20230105002007.157497-1-tj@kernel.org >>> >>> After that, all these configuration operations are wrapped between >>> blkg_conf_init() and blkg_conf_exit() which probably are the right place to >>> implement the synchronization. >> >> I see that, blkg_conf_init() and blkg_conf_exit() is good, however there >> are some details I want to confirm: >> >> 1) rq_qos_add() can be called from iocost/iolatency, where >> blkg_conf_init() will be called first, while rq_qos_add() can also be >> called from wbt, where there is no blkg_conf_init(). Hence it seems to >> me we need two locks here, one to protect rq_qos apis; one to >> synchronize policy configuration and device removal. > > wbt's lazy init is tied to one of the block device sysfs files, right? So, > it *should* already be protected against device removal. That seems not true, I don't think q->sysfs_lock can protect that, consider that queue_wb_lat_store() doesn't check if del_gendisk() is called or not: t1: wbt lazy init t2: remove device queue_attr_store del_gendisk blk_unregister_queue mutex_lock(&q->sysfs_lock) ... mutex_unlock(&q->sysfs_lock); rq_qos_exit mutex_lock(&q->sysfs_lock); queue_wb_lat_store wbt_init rq_qos_add mutex_unlock(&q->sysfs_lock); I tried to comfirm that by adding following delay: diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 93d9e9c9a6ea..101c33cb0a2b 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -11,6 +11,7 @@ #include <linux/blktrace_api.h> #include <linux/blk-mq.h> #include <linux/debugfs.h> +#include <linux/delay.h> #include "blk.h" #include "blk-mq.h" @@ -734,6 +735,8 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr, if (!entry->store) return -EIO; + msleep(10000); + mutex_lock(&q->sysfs_lock); res = entry->store(q, page, length); mutex_unlock(&q->sysfs_lock); And then do the following test: 1) echo 10000 > /sys/block/sdb/queue/wbt_lat_usec & 2) echo 1 > /sys/block/sda/device/delete Then, following bug is triggered: [ 51.923642] BUG: unable to handle page fault for address: ffffffffffffffed [ 51.924294] #PF: supervisor read access in kernel mode [ 51.924773] #PF: error_code(0x0000) - not-present page [ 51.925252] PGD 1820b067 P4D 1820b067 PUD 1820d067 PMD 0 [ 51.925754] Oops: 0000 [#1] PREEMPT SMP [ 51.926123] CPU: 1 PID: 539 Comm: bash Tainted: G W 6.2.0-rc1-next-202212267 [ 51.927124] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-b4 [ 51.928334] RIP: 0010:__rq_qos_issue+0x30/0x60 [ 51.928761] Code: 48 89 f5 53 48 89 fb 48 83 05 eb eb c9 0b 01 eb 19 48 89 df 48 83 05 e6 e9 [ 51.930426] RSP: 0018:ffffc90000c3b9b0 EFLAGS: 00010206 [ 51.930905] RAX: 0000000000000000 RBX: ffffffffffffffed RCX: 0000000000000017 [ 51.931554] RDX: 000007c329800000 RSI: ffff8881022c0380 RDI: ffffffffffffffed [ 51.932197] RBP: ffff8881022c0380 R08: 0000000c385056e3 R09: ffff8881022c05c8 [ 51.932841] R10: 0000000000000000 R11: ffff888100a94000 R12: ffff888102145000 [ 51.933488] R13: 0000000000000000 R14: ffff888100a94000 R15: ffff8881022c04a0 [ 51.934140] FS: 00007fd23def9700(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000 [ 51.934856] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 51.935379] CR2: ffffffffffffffed CR3: 0000000106fff000 CR4: 00000000000006e0 [ 51.936036] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 51.936675] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 51.937315] Call Trace: [ 51.937545] <TASK> [ 51.937749] blk_mq_start_request+0x1d1/0x240 [ 51.938151] scsi_queue_rq+0x347/0x1190 [ 51.938513] blk_mq_dispatch_rq_list+0x366/0xef0 [ 51.938938] ? tick_nohz_tick_stopped+0x1a/0x40 [ 51.939356] ? __irq_work_queue_local+0x59/0xd0 [ 51.939769] ? __sbitmap_get_word+0x3b/0xb0 [ 51.940153] __blk_mq_sched_dispatch_requests+0xdd/0x210 [ 51.940633] blk_mq_sched_dispatch_requests+0x38/0xa0 [ 51.941089] __blk_mq_run_hw_queue+0xca/0x110 [ 51.941483] __blk_mq_delay_run_hw_queue+0x1fc/0x210 [ 51.941931] blk_mq_run_hw_queue+0x15c/0x1d0 [ 51.942327] blk_mq_sched_insert_request+0x9c/0x210 [ 51.942769] blk_execute_rq+0xec/0x290 [ 51.943121] __scsi_execute+0x131/0x310 [ 51.943492] sd_sync_cache+0xc6/0x280 [ 51.943831] sd_shutdown+0x7f/0x180 [ 51.944155] sd_remove+0x53/0x80 [ 51.944457] device_remove+0x80/0xa0 [ 51.944785] device_release_driver_internal+0x131/0x270 [ 51.945257] device_release_driver+0x16/0x20 [ 51.945643] bus_remove_device+0x135/0x200 Thanks, Kuai > > Thanks. >
Hello, On Fri, Jan 06, 2023 at 09:33:26AM +0800, Yu Kuai wrote: > > wbt's lazy init is tied to one of the block device sysfs files, right? So, > > it *should* already be protected against device removal. > > That seems not true, I don't think q->sysfs_lock can protect that, > consider that queue_wb_lat_store() doesn't check if del_gendisk() is > called or not: > > t1: wbt lazy init t2: remove device > queue_attr_store > del_gendisk > blk_unregister_queue > mutex_lock(&q->sysfs_lock) > ... > mutex_unlock(&q->sysfs_lock); > rq_qos_exit > mutex_lock(&q->sysfs_lock); > queue_wb_lat_store > wbt_init > rq_qos_add > mutex_unlock(&q->sysfs_lock); So, it's not sysfs_lock but sysfs file deletion. When a kernfs, which backs sysfs, file is removed, it disables future operations and drains all inflight ones before returning, so you remove the interface files before cleaning up the object that it interacts with, you don't have to worry about racing against file operations as none can be in flight at that point. > I tried to comfirm that by adding following delay: > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c > index 93d9e9c9a6ea..101c33cb0a2b 100644 > --- a/block/blk-sysfs.c > +++ b/block/blk-sysfs.c > @@ -11,6 +11,7 @@ > #include <linux/blktrace_api.h> > #include <linux/blk-mq.h> > #include <linux/debugfs.h> > +#include <linux/delay.h> > > #include "blk.h" > #include "blk-mq.h" > @@ -734,6 +735,8 @@ queue_attr_store(struct kobject *kobj, struct attribute > *attr, > if (!entry->store) > return -EIO; > > + msleep(10000); > + > mutex_lock(&q->sysfs_lock); > res = entry->store(q, page, length); > mutex_unlock(&q->sysfs_lock); > > And then do the following test: > > 1) echo 10000 > /sys/block/sdb/queue/wbt_lat_usec & > 2) echo 1 > /sys/block/sda/device/delete > > Then, following bug is triggered: > > [ 51.923642] BUG: unable to handle page fault for address: > ffffffffffffffed > [ 51.924294] #PF: supervisor read access in kernel mode > [ 51.924773] #PF: error_code(0x0000) - not-present page > [ 51.925252] PGD 1820b067 P4D 1820b067 PUD 1820d067 PMD 0 > [ 51.925754] Oops: 0000 [#1] PREEMPT SMP > [ 51.926123] CPU: 1 PID: 539 Comm: bash Tainted: G W > 6.2.0-rc1-next-202212267 > [ 51.927124] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > ?-20190727_073836-b4 > [ 51.928334] RIP: 0010:__rq_qos_issue+0x30/0x60 This indicates that we aren't getting the destruction order right. It could be that there are other reasons why the ordering is like this and we might have to synchronize separately. Sorry that I've been asking you to go round and round but block device add/remove paths have always been really tricky and we wanna avoid adding more complications if at all possible. Can you see why the device is being destroyed before the queue attr is removed? Thanks.
Hi, 在 2023/01/07 2:23, Tejun Heo 写道: > Hello, > > On Fri, Jan 06, 2023 at 09:33:26AM +0800, Yu Kuai wrote: >>> wbt's lazy init is tied to one of the block device sysfs files, right? So, >>> it *should* already be protected against device removal. >> >> That seems not true, I don't think q->sysfs_lock can protect that, >> consider that queue_wb_lat_store() doesn't check if del_gendisk() is >> called or not: >> >> t1: wbt lazy init t2: remove device >> queue_attr_store >> del_gendisk >> blk_unregister_queue >> mutex_lock(&q->sysfs_lock) >> ... >> mutex_unlock(&q->sysfs_lock); >> rq_qos_exit >> mutex_lock(&q->sysfs_lock); >> queue_wb_lat_store >> wbt_init >> rq_qos_add >> mutex_unlock(&q->sysfs_lock); > > So, it's not sysfs_lock but sysfs file deletion. When a kernfs, which backs > sysfs, file is removed, it disables future operations and drains all > inflight ones before returning, so you remove the interface files before > cleaning up the object that it interacts with, you don't have to worry about > racing against file operations as none can be in flight at that point. Ok, thanks for explanation, I'll look into this and try to find out how this works. > >> I tried to comfirm that by adding following delay: >> >> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c >> index 93d9e9c9a6ea..101c33cb0a2b 100644 >> --- a/block/blk-sysfs.c >> +++ b/block/blk-sysfs.c >> @@ -11,6 +11,7 @@ >> #include <linux/blktrace_api.h> >> #include <linux/blk-mq.h> >> #include <linux/debugfs.h> >> +#include <linux/delay.h> >> >> #include "blk.h" >> #include "blk-mq.h" >> @@ -734,6 +735,8 @@ queue_attr_store(struct kobject *kobj, struct attribute >> *attr, >> if (!entry->store) >> return -EIO; >> >> + msleep(10000); >> + >> mutex_lock(&q->sysfs_lock); >> res = entry->store(q, page, length); >> mutex_unlock(&q->sysfs_lock); >> >> And then do the following test: >> >> 1) echo 10000 > /sys/block/sdb/queue/wbt_lat_usec & >> 2) echo 1 > /sys/block/sda/device/delete >> >> Then, following bug is triggered: >> >> [ 51.923642] BUG: unable to handle page fault for address: >> ffffffffffffffed >> [ 51.924294] #PF: supervisor read access in kernel mode >> [ 51.924773] #PF: error_code(0x0000) - not-present page >> [ 51.925252] PGD 1820b067 P4D 1820b067 PUD 1820d067 PMD 0 >> [ 51.925754] Oops: 0000 [#1] PREEMPT SMP >> [ 51.926123] CPU: 1 PID: 539 Comm: bash Tainted: G W >> 6.2.0-rc1-next-202212267 >> [ 51.927124] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >> ?-20190727_073836-b4 >> [ 51.928334] RIP: 0010:__rq_qos_issue+0x30/0x60 > > This indicates that we aren't getting the destruction order right. It could > be that there are other reasons why the ordering is like this and we might > have to synchronize separately. > > Sorry that I've been asking you to go round and round but block device > add/remove paths have always been really tricky and we wanna avoid adding > more complications if at all possible. Can you see why the device is being > destroyed before the queue attr is removed? Of course, I'll glad to help, I'll let you know if I have any progress. Thanks, Kuai
Hi, 在 2023/01/09 9:38, Yu Kuai 写道: >> So, it's not sysfs_lock but sysfs file deletion. When a kernfs, which >> backs >> sysfs, file is removed, it disables future operations and drains all >> inflight ones before returning, so you remove the interface files before >> cleaning up the object that it interacts with, you don't have to worry >> about >> racing against file operations as none can be in flight at that point. > I understand this know, kernfs_fop_write_iter() will grab kernfs_node->active, and kobject_del() will wait for active to be dropped in kernfs_drain(). >> Sorry that I've been asking you to go round and round but block device >> add/remove paths have always been really tricky and we wanna avoid adding >> more complications if at all possible. Can you see why the device is >> being >> destroyed before the queue attr is removed? > Sorry that I actually tested with patch 4 applied, and this is a bug introduced by patch 4, my apologies. It set rqos to ERR_PTR() in rq_qos_exit, and follow up rq_qos_issue() just check if rqos is NULL. I'll wait for your patchset to be apllied, and then send a new version. Just one thing to confirm, do you think it's better to use a global mutex rather than a disk level mutex? I'm not sure because this will cause different cgroup configurations from different disk can't concurrent. Thanks, Kuai
diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c index 50544bfb12f1..5f7ccc249c11 100644 --- a/block/blk-rq-qos.c +++ b/block/blk-rq-qos.c @@ -2,6 +2,8 @@ #include "blk-rq-qos.h" +static DEFINE_MUTEX(rq_qos_lock); + /* * Increment 'v', if 'v' is below 'below'. Returns true if we succeeded, * false if 'v' + 1 would be bigger than 'below'. @@ -286,23 +288,18 @@ void rq_qos_wait(struct rq_wait *rqw, void *private_data, finish_wait(&rqw->wait, &data.wq); } -int rq_qos_add(struct request_queue *q, struct rq_qos *rqos) +static int __rq_qos_add(struct request_queue *q, struct rq_qos *rqos) { /* * No IO can be in-flight when adding rqos, so freeze queue, which * is fine since we only support rq_qos for blk-mq queue. - * - * Reuse ->queue_lock for protecting against other concurrent - * rq_qos adding/deleting */ blk_mq_freeze_queue(q); - spin_lock_irq(&q->queue_lock); if (rq_qos_id(q, rqos->id)) goto ebusy; rqos->next = q->rq_qos; q->rq_qos = rqos; - spin_unlock_irq(&q->queue_lock); blk_mq_unfreeze_queue(q); @@ -314,29 +311,23 @@ int rq_qos_add(struct request_queue *q, struct rq_qos *rqos) return 0; ebusy: - spin_unlock_irq(&q->queue_lock); blk_mq_unfreeze_queue(q); return -EBUSY; } -void rq_qos_del(struct request_queue *q, struct rq_qos *rqos) +static void __rq_qos_del(struct request_queue *q, struct rq_qos *rqos) { struct rq_qos **cur; - /* - * See comment in rq_qos_add() about freezing queue & using - * ->queue_lock. - */ + /* See comment in __rq_qos_add() about freezing queue */ blk_mq_freeze_queue(q); - spin_lock_irq(&q->queue_lock); for (cur = &q->rq_qos; *cur; cur = &(*cur)->next) { if (*cur == rqos) { *cur = rqos->next; break; } } - spin_unlock_irq(&q->queue_lock); blk_mq_unfreeze_queue(q); @@ -345,13 +336,33 @@ void rq_qos_del(struct request_queue *q, struct rq_qos *rqos) mutex_unlock(&q->debugfs_mutex); } +int rq_qos_add(struct request_queue *q, struct rq_qos *rqos) +{ + int ret; + + mutex_lock(&rq_qos_lock); + ret = __rq_qos_add(q, rqos); + mutex_unlock(&rq_qos_lock); + + return ret; +} + +void rq_qos_del(struct request_queue *q, struct rq_qos *rqos) +{ + mutex_lock(&rq_qos_lock); + __rq_qos_del(q, rqos); + mutex_unlock(&rq_qos_lock); +} + void rq_qos_exit(struct request_queue *q) { + mutex_lock(&rq_qos_lock); while (q->rq_qos) { struct rq_qos *rqos = q->rq_qos; q->rq_qos = rqos->next; rqos->ops->exit(rqos); } + mutex_unlock(&rq_qos_lock); } #ifdef CONFIG_BLK_CGROUP @@ -364,15 +375,20 @@ int rq_qos_add_and_activate_policy(struct request_queue *q, struct rq_qos *rqos, * may get called before policy activation completion, can't assume that * the target bio has an pd associated and need to test for NULL. */ - int ret = rq_qos_add(q, rqos); + int ret; - if (ret) + mutex_lock(&rq_qos_lock); + ret = __rq_qos_add(q, rqos); + if (ret) { + mutex_unlock(&rq_qos_lock); return ret; + } ret = blkcg_activate_policy(q, pol); if (ret) - rq_qos_del(q, rqos); + __rq_qos_del(q, rqos); + mutex_unlock(&rq_qos_lock); return ret; } #endif