Message ID | 20240207092756.2087888-1-linan666@huaweicloud.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-56224-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7301:168b:b0:106:860b:bbdd with SMTP id ma11csp2113093dyb; Wed, 7 Feb 2024 01:39:04 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCXILfuHdw/Bv9EtvUpxrWG6W8TGCmj+mj76x1/zL3cvsT5+TpHFyKWJUMuDEwvZH6cX8twrbtf9r/B5mD/5fJN6bo+74g== X-Google-Smtp-Source: AGHT+IFlMcqnnRuPJRhLF49bYT9xRK3Z3pQ+8oo1cFNEv15x+cBwpSdfjQDum95Ld5zRKhsIsjGV X-Received: by 2002:a05:6a20:1fa3:b0:19b:e574:14de with SMTP id dm35-20020a056a201fa300b0019be57414demr3268721pzb.35.1707298744281; Wed, 07 Feb 2024 01:39:04 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1707298744; cv=pass; d=google.com; s=arc-20160816; b=qk+ANd054FE4ut2yMAFI+OId/gbyy/Nq7NiiwsRZlC93eyJFNnBI4KQs4zg9v+miGu OVIHvnPhECFnO7a+uKJlqwY4WgP9JOHKcRgmRuaF6IoALwLjKvGNu33iermAd3wKqzVZ K6lFksim5bpTAbzaBfcAYg525/MFcApEsocpwBe/OKPDUDwXMSj6/B8tWR8lMpBvj2o2 iqNdGtzrHRdYUg5c5ghVew3D+lyCFoiis7a3faH1LxYTG75wPra0AM6lLdly1pN80GOk qpqz7F3Z878Jg5mAX/ZRJb6MtRl3E7i/lUnyda8zgih40QmT1GFiL/MEnxwxr1ksT1SP Xxxg== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:cc:to :from; bh=YcX/c0AStymcY5mmpZIr2BG/RCjRtTzsZ0AqoTNpkAk=; fh=1gYQVJgzlO6JG8JSCsRAL7fJNEglg5gjIlSUNxvRPFs=; b=Urhl0/GsitocGlF7Pmu4kpBbODrxwvlEUotRbLdXGWDApXArDT60RQIT4mYRbFycf9 Ck1PuYnUHjhd3tK6mgkZQHxZXU16bGMIScPaVbtgSYdprkdxpSlcBWC29nv9p5lLhbZf v+VvviTbW0JRYMue0oGH3Bo9X7NHe8jlXQT2XsB6BS7Q21XFvN8MEmVVarDuwUPlGPLh p3+rfWEJH713ThpM8KyqgwHzfJrCoPKzd/Zo/N2zJ/SHtRwmZWYFguZmhzJsNg8qAUiI lq6Nluhd4h+6F+q1pLqmwvSm9M9L0nMy6m61pN25geA7uGt0L7d/0ivaCrlPB/iJfOEW wjQg==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=huaweicloud.com); spf=pass (google.com: domain of linux-kernel+bounces-56224-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-56224-ouuuleilei=gmail.com@vger.kernel.org" X-Forwarded-Encrypted: i=2; AJvYcCXVeIoROwWBLQJVwbCZCZPVj6DGUpVpLakcbfqZcehXzBf13Pk0O7OH4hcCm039EQ9uGVAkGR4gozYFJP35wsEpbeAAJw== Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [2604:1380:40f1:3f00::1]) by mx.google.com with ESMTPS id m8-20020a170902db0800b001d775a1f306si1283745plx.562.2024.02.07.01.39.03 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 07 Feb 2024 01:39:04 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-56224-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) client-ip=2604:1380:40f1:3f00::1; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=huaweicloud.com); spf=pass (google.com: domain of linux-kernel+bounces-56224-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-56224-ouuuleilei=gmail.com@vger.kernel.org" Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id 75DC2B24952 for <ouuuleilei@gmail.com>; Wed, 7 Feb 2024 09:38:03 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 5BE474BA8D; Wed, 7 Feb 2024 09:32:56 +0000 (UTC) Received: from dggsgout12.his.huawei.com (unknown [45.249.212.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4EACF20DFA; Wed, 7 Feb 2024 09:32:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.56 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707298374; cv=none; b=tzSbRHUNkc1Gt9RoQYIC8V/T+mAZPCsMXqTClxtPUqLkhRzZq/62pM8wIygUQfTUlSB+yIheLTS+iit2Aj+RepSdAXSRDNcVmfF/amB8gNjphF0DG7VNDFPNKQaIoTBjsxbdIpaIaxLi5cGHlwTGff9ibmFIaxcYES/A0i301hM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707298374; c=relaxed/simple; bh=7u+z3q+4Hs3vIsnTCgfHn67LCKeRoHsrK3+hSZEaDt0=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version; b=kHfmuE0nCQe9eIW5fU+oYG2Wgo9bSpVVlAOVuqadVQI0gaGRrh1paPv1OW5UxHQTeM8dKEofRixX6UBz9hhOh916lI122Cr9a3cGTpAwjUo14YIXI0uWFCPh4HZ6VWW7zCcx+JXcM4PgJ9hz4OZLeJxu4vtJHo8GQ6BcCFps58Q= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=huaweicloud.com; spf=pass smtp.mailfrom=huaweicloud.com; arc=none smtp.client-ip=45.249.212.56 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=huaweicloud.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huaweicloud.com Received: from mail.maildlp.com (unknown [172.19.163.216]) by dggsgout12.his.huawei.com (SkyGuard) with ESMTP id 4TVFJh5RQmz4f3jsm; Wed, 7 Feb 2024 17:32:36 +0800 (CST) Received: from mail02.huawei.com (unknown [10.116.40.112]) by mail.maildlp.com (Postfix) with ESMTP id 394401A0A9A; Wed, 7 Feb 2024 17:32:41 +0800 (CST) Received: from huaweicloud.com (unknown [10.175.104.67]) by APP1 (Coremail) with SMTP id cCh0CgDHlxA3TsNl4_n7DA--.43370S4; Wed, 07 Feb 2024 17:32:41 +0800 (CST) From: linan666@huaweicloud.com To: axboe@kernel.dk Cc: linux-raid@vger.kernel.org, song@kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linan666@huaweicloud.com, yukuai3@huawei.com, yi.zhang@huawei.com, houtao1@huawei.com, yangerkun@huawei.com Subject: [PATCH] block: fix deadlock between bd_link_disk_holder and partition scan Date: Wed, 7 Feb 2024 17:27:56 +0800 Message-Id: <20240207092756.2087888-1-linan666@huaweicloud.com> X-Mailer: git-send-email 2.39.2 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-CM-TRANSID: cCh0CgDHlxA3TsNl4_n7DA--.43370S4 X-Coremail-Antispam: 1UD129KBjvJXoWxZFy8KF45WF43KryfWr4rKrg_yoW5Cr4fpF Z0gFZ7try8ta1Dur4Dt3y7Zr4UKw18Wa1xJr97KFy29rZrArs29r12yFy7uFy8KrWIyF4D tF1UX3yYvF40k3DanT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUU9E14x267AKxVW8JVW5JwAFc2x0x2IEx4CE42xK8VAvwI8IcIk0 rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2ocxC64kIII0Yj41l84x0c7CEw4AK67xGY2AK02 1l84ACjcxK6xIIjxv20xvE14v26F1j6w1UM28EF7xvwVC0I7IYx2IY6xkF7I0E14v26F4j 6r4UJwA2z4x0Y4vEx4A2jsIE14v26rxl6s0DM28EF7xvwVC2z280aVCY1x0267AKxVW0oV Cq3wAac4AC62xK8xCEY4vEwIxC4wAS0I0E0xvYzxvE52x082IY62kv0487Mc02F40EFcxC 0VAKzVAqx4xG6I80ewAv7VC0I7IYx2IY67AKxVWUJVWUGwAv7VC2z280aVAFwI0_Jr0_Gr 1lOx8S6xCaFVCjc4AY6r1j6r4UM4x0Y48IcxkI7VAKI48JM4x0x7Aq67IIx4CEVc8vx2IE rcIFxwACI402YVCY1x02628vn2kIc2xKxwAKzVCY07xG64k0F24l42xK82IYc2Ij64vIr4 1l4I8I3I0E4IkC6x0Yz7v_Jr0_Gr1lx2IqxVAqx4xG67AKxVWUJVWUGwC20s026x8GjcxK 67AKxVWUGVWUWwC2zVAF1VAY17CE14v26r1q6r43MIIYrxkI7VAKI48JMIIF0xvE2Ix0cI 8IcVAFwI0_Jr0_JF4lIxAIcVC0I7IYx2IY6xkF7I0E14v26r1j6r4UMIIF0xvE42xK8VAv wI8IcIk0rVWrZr1j6s0DMIIF0xvEx4A2jsIE14v26r1j6r4UMIIF0xvEx4A2jsIEc7CjxV AFwI0_Jr0_GrUvcSsGvfC2KfnxnUUI43ZEXa7VUbSApUUUUUU== X-CM-SenderInfo: polqt0awwwqx5xdzvxpfor3voofrz/ X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1790232487587796427 X-GMAIL-MSGID: 1790232487587796427 |
Series |
block: fix deadlock between bd_link_disk_holder and partition scan
|
|
Commit Message
Li Nan
Feb. 7, 2024, 9:27 a.m. UTC
From: Li Nan <linan122@huawei.com> 'open_mutex' of gendisk is used to protect open/close block devices. But in bd_link_disk_holder(), it is used to protect the creation of symlink between holding disk and slave bdev, which introduces some issues. When bd_link_disk_holder() is called, the driver is usually in the process of initialization/modification and may suspend submitting io. At this time, any io hold 'open_mutex', such as scanning partitions, can cause deadlocks. For example, in raid: T1 T2 bdev_open_by_dev lock open_mutex [1] ... efi_partition ... md_submit_bio md_ioctl mddev_syspend -> suspend all io md_add_new_disk bind_rdev_to_array bd_link_disk_holder try lock open_mutex [2] md_handle_request -> wait mddev_resume T1 scan partition, T2 add a new device to raid. T1 waits for T2 to resume mddev, but T2 waits for open_mutex held by T1. Deadlock occurs. Fix it by introducing a local mutex 'holder_mutex' to replace 'open_mutex'. Signed-off-by: Li Nan <linan122@huawei.com> --- block/holder.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-)
Comments
On Wed, Feb 7, 2024 at 1:32 AM <linan666@huaweicloud.com> wrote: > > From: Li Nan <linan122@huawei.com> > > 'open_mutex' of gendisk is used to protect open/close block devices. But > in bd_link_disk_holder(), it is used to protect the creation of symlink > between holding disk and slave bdev, which introduces some issues. > > When bd_link_disk_holder() is called, the driver is usually in the process > of initialization/modification and may suspend submitting io. At this > time, any io hold 'open_mutex', such as scanning partitions, can cause > deadlocks. For example, in raid: > > T1 T2 > bdev_open_by_dev > lock open_mutex [1] > ... > efi_partition > ... > md_submit_bio > md_ioctl mddev_syspend > -> suspend all io > md_add_new_disk > bind_rdev_to_array > bd_link_disk_holder > try lock open_mutex [2] > md_handle_request > -> wait mddev_resume > > T1 scan partition, T2 add a new device to raid. T1 waits for T2 to resume > mddev, but T2 waits for open_mutex held by T1. Deadlock occurs. > > Fix it by introducing a local mutex 'holder_mutex' to replace 'open_mutex'. Is this to fix [1]? Do we need some Fixes and/or Closes tags? Could you please add steps to reproduce this issue? Thanks, Song [1] https://bugzilla.kernel.org/show_bug.cgi?id=218459 > > Signed-off-by: Li Nan <linan122@huawei.com> > --- > block/holder.c | 12 +++++++----- > 1 file changed, 7 insertions(+), 5 deletions(-) > > diff --git a/block/holder.c b/block/holder.c > index 37d18c13d958..5bfb0a674cc7 100644 > --- a/block/holder.c > +++ b/block/holder.c > @@ -8,6 +8,8 @@ struct bd_holder_disk { > int refcnt; > }; > > +static DEFINE_MUTEX(holder_mutex); > + > static struct bd_holder_disk *bd_find_holder_disk(struct block_device *bdev, > struct gendisk *disk) > { > @@ -80,7 +82,7 @@ int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk) > kobject_get(bdev->bd_holder_dir); > mutex_unlock(&bdev->bd_disk->open_mutex); > > - mutex_lock(&disk->open_mutex); > + mutex_lock(&holder_mutex); > WARN_ON_ONCE(!bdev->bd_holder); > > holder = bd_find_holder_disk(bdev, disk); > @@ -108,7 +110,7 @@ int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk) > goto out_del_symlink; > list_add(&holder->list, &disk->slave_bdevs); > > - mutex_unlock(&disk->open_mutex); > + mutex_unlock(&holder_mutex); > return 0; > > out_del_symlink: > @@ -116,7 +118,7 @@ int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk) > out_free_holder: > kfree(holder); > out_unlock: > - mutex_unlock(&disk->open_mutex); > + mutex_unlock(&holder_mutex); > if (ret) > kobject_put(bdev->bd_holder_dir); > return ret; > @@ -140,7 +142,7 @@ void bd_unlink_disk_holder(struct block_device *bdev, struct gendisk *disk) > if (WARN_ON_ONCE(!disk->slave_dir)) > return; > > - mutex_lock(&disk->open_mutex); > + mutex_lock(&holder_mutex); > holder = bd_find_holder_disk(bdev, disk); > if (!WARN_ON_ONCE(holder == NULL) && !--holder->refcnt) { > del_symlink(disk->slave_dir, bdev_kobj(bdev)); > @@ -149,6 +151,6 @@ void bd_unlink_disk_holder(struct block_device *bdev, struct gendisk *disk) > list_del_init(&holder->list); > kfree(holder); > } > - mutex_unlock(&disk->open_mutex); > + mutex_unlock(&holder_mutex); > } > EXPORT_SYMBOL_GPL(bd_unlink_disk_holder); > -- > 2.39.2 >
在 2024/2/8 14:50, Song Liu 写道: > On Wed, Feb 7, 2024 at 1:32 AM <linan666@huaweicloud.com> wrote: >> >> From: Li Nan <linan122@huawei.com> >> >> 'open_mutex' of gendisk is used to protect open/close block devices. But >> in bd_link_disk_holder(), it is used to protect the creation of symlink >> between holding disk and slave bdev, which introduces some issues. >> >> When bd_link_disk_holder() is called, the driver is usually in the process >> of initialization/modification and may suspend submitting io. At this >> time, any io hold 'open_mutex', such as scanning partitions, can cause >> deadlocks. For example, in raid: >> >> T1 T2 >> bdev_open_by_dev >> lock open_mutex [1] >> ... >> efi_partition >> ... >> md_submit_bio >> md_ioctl mddev_syspend >> -> suspend all io >> md_add_new_disk >> bind_rdev_to_array >> bd_link_disk_holder >> try lock open_mutex [2] >> md_handle_request >> -> wait mddev_resume >> >> T1 scan partition, T2 add a new device to raid. T1 waits for T2 to resume >> mddev, but T2 waits for open_mutex held by T1. Deadlock occurs. >> >> Fix it by introducing a local mutex 'holder_mutex' to replace 'open_mutex'. > > Is this to fix [1]? Do we need some Fixes and/or Closes tags? > No. Just use another way to fix [2], and both [2] and this patch can fix the issue. I am not sure about the root cause of [1] yet. [2] https://patchwork.kernel.org/project/linux-raid/list/?series=812045 > Could you please add steps to reproduce this issue? We need to modify the kernel, add sleep in md_submit_bio() and md_ioctl() as below, and then: 1. mdadm -CR /dev/md0 -l1 -n2 /dev/sd[bc] #create a raid 2. echo 1 > /sys/module/md_mod/parameters/error_inject #enable sleep 3. 'mdadm --add /dev/md0 /dev/sda' #add a disk to raid 4. submit ioctl BLKRRPART to raid within 10s. Changes of kernel: diff --git a/drivers/md/md.c b/drivers/md/md.c index 350f5b22ba6f..ce16d319edf2 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -76,6 +76,8 @@ static DEFINE_SPINLOCK(pers_lock); static const struct kobj_type md_ktype; +static bool error_inject = false; + struct md_cluster_operations *md_cluster_ops; EXPORT_SYMBOL(md_cluster_ops); static struct module *md_cluster_mod; @@ -372,6 +374,8 @@ static bool is_suspended(struct mddev *mddev, struct bio *bio) void md_handle_request(struct mddev *mddev, struct bio *bio) { + if (error_inject) + ssleep(10); check_suspended: if (is_suspended(mddev, bio)) { DEFINE_WAIT(__wait); @@ -7752,6 +7756,8 @@ static int md_ioctl(struct block_device *bdev, blk_mode_t mode, */ if (mddev->pers) { mdu_disk_info_t info; + if (error_inject) + ssleep(10); if (copy_from_user(&info, argp, sizeof(info))) err = -EFAULT; else if (!(info.state & (1<<MD_DISK_SYNC))) @@ -10120,6 +10126,7 @@ module_param_call(start_ro, set_ro, get_ro, NULL, S_IRUSR|S_IWUSR); module_param(start_dirty_degraded, int, S_IRUGO|S_IWUSR); module_param_call(new_array, add_named_array, NULL, NULL, S_IWUSR); module_param(create_on_open, bool, S_IRUSR|S_IWUSR); +module_param(error_inject, bool, S_IRUSR|S_IWUSR); MODULE_LICENSE("GPL"); MODULE_DESCRIPTION("MD RAID framework");
On Thu, Feb 8, 2024 at 12:44 AM Li Nan <linan666@huaweicloud.com> wrote: > > > > 在 2024/2/8 14:50, Song Liu 写道: > > On Wed, Feb 7, 2024 at 1:32 AM <linan666@huaweicloud.com> wrote: > >> > >> From: Li Nan <linan122@huawei.com> > >> > >> 'open_mutex' of gendisk is used to protect open/close block devices. But > >> in bd_link_disk_holder(), it is used to protect the creation of symlink > >> between holding disk and slave bdev, which introduces some issues. > >> > >> When bd_link_disk_holder() is called, the driver is usually in the process > >> of initialization/modification and may suspend submitting io. At this > >> time, any io hold 'open_mutex', such as scanning partitions, can cause > >> deadlocks. For example, in raid: > >> > >> T1 T2 > >> bdev_open_by_dev > >> lock open_mutex [1] > >> ... > >> efi_partition > >> ... > >> md_submit_bio > >> md_ioctl mddev_syspend > >> -> suspend all io > >> md_add_new_disk > >> bind_rdev_to_array > >> bd_link_disk_holder > >> try lock open_mutex [2] > >> md_handle_request > >> -> wait mddev_resume > >> > >> T1 scan partition, T2 add a new device to raid. T1 waits for T2 to resume > >> mddev, but T2 waits for open_mutex held by T1. Deadlock occurs. > >> > >> Fix it by introducing a local mutex 'holder_mutex' to replace 'open_mutex'. > > > > Is this to fix [1]? Do we need some Fixes and/or Closes tags? > > > > No. Just use another way to fix [2], and both [2] and this patch can fix > the issue. I am not sure about the root cause of [1] yet. > > [2] https://patchwork.kernel.org/project/linux-raid/list/?series=812045 > > > Could you please add steps to reproduce this issue? > > We need to modify the kernel, add sleep in md_submit_bio() and md_ioctl() > as below, and then: > 1. mdadm -CR /dev/md0 -l1 -n2 /dev/sd[bc] #create a raid > 2. echo 1 > /sys/module/md_mod/parameters/error_inject #enable sleep > 3. 'mdadm --add /dev/md0 /dev/sda' #add a disk to raid > 4. submit ioctl BLKRRPART to raid within 10s. The analysis makes sense. I also hit the issue a couple times without adding extra delays. But I am not sure whether this is the best fix (I didn't find real issues with it either). Maybe we don't need to suspend the array for ADD_NEW_DISK? So that something like the following might just work? Thanks, Song @@ -7573,7 +7577,6 @@ static inline bool md_ioctl_valid(unsigned int cmd) static bool md_ioctl_need_suspend(unsigned int cmd) { switch (cmd) { - case ADD_NEW_DISK: case HOT_ADD_DISK: case HOT_REMOVE_DISK: case SET_BITMAP_FILE:
On Thu, Feb 8, 2024 at 4:49 PM Song Liu <song@kernel.org> wrote: > > On Thu, Feb 8, 2024 at 12:44 AM Li Nan <linan666@huaweicloud.com> wrote: > > > > > > > > 在 2024/2/8 14:50, Song Liu 写道: > > > On Wed, Feb 7, 2024 at 1:32 AM <linan666@huaweicloud.com> wrote: > > >> > > >> From: Li Nan <linan122@huawei.com> > > >> > > >> 'open_mutex' of gendisk is used to protect open/close block devices. But > > >> in bd_link_disk_holder(), it is used to protect the creation of symlink > > >> between holding disk and slave bdev, which introduces some issues. > > >> > > >> When bd_link_disk_holder() is called, the driver is usually in the process > > >> of initialization/modification and may suspend submitting io. At this > > >> time, any io hold 'open_mutex', such as scanning partitions, can cause > > >> deadlocks. For example, in raid: > > >> > > >> T1 T2 > > >> bdev_open_by_dev > > >> lock open_mutex [1] > > >> ... > > >> efi_partition > > >> ... > > >> md_submit_bio > > >> md_ioctl mddev_syspend > > >> -> suspend all io > > >> md_add_new_disk > > >> bind_rdev_to_array > > >> bd_link_disk_holder > > >> try lock open_mutex [2] > > >> md_handle_request > > >> -> wait mddev_resume > > >> > > >> T1 scan partition, T2 add a new device to raid. T1 waits for T2 to resume > > >> mddev, but T2 waits for open_mutex held by T1. Deadlock occurs. > > >> > > >> Fix it by introducing a local mutex 'holder_mutex' to replace 'open_mutex'. > > > > > > Is this to fix [1]? Do we need some Fixes and/or Closes tags? > > > > > > > No. Just use another way to fix [2], and both [2] and this patch can fix > > the issue. I am not sure about the root cause of [1] yet. > > > > [2] https://patchwork.kernel.org/project/linux-raid/list/?series=812045 > > > > > Could you please add steps to reproduce this issue? > > > > We need to modify the kernel, add sleep in md_submit_bio() and md_ioctl() > > as below, and then: > > 1. mdadm -CR /dev/md0 -l1 -n2 /dev/sd[bc] #create a raid > > 2. echo 1 > /sys/module/md_mod/parameters/error_inject #enable sleep > > 3. 'mdadm --add /dev/md0 /dev/sda' #add a disk to raid > > 4. submit ioctl BLKRRPART to raid within 10s. > > The analysis makes sense. I also hit the issue a couple times without adding > extra delays. But I am not sure whether this is the best fix (I didn't find real > issues with it either). To be extra safe and future proof, we can do something like the following to only suspend the array for ADD_NEW_DISK on not-running arrays. This appear to solve the problem reported in https://bugzilla.kernel.org/show_bug.cgi?id=218459 Thanks, Song diff --git a/drivers/md/md.c b/drivers/md/md.c index 9e41a9aaba8b..395911d5f4d6 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -7570,10 +7570,11 @@ static inline bool md_ioctl_valid(unsigned int cmd) } } -static bool md_ioctl_need_suspend(unsigned int cmd) +static bool md_ioctl_need_suspend(struct mddev *mddev, unsigned int cmd) { switch (cmd) { case ADD_NEW_DISK: + return mddev->pers != NULL; case HOT_ADD_DISK: case HOT_REMOVE_DISK: case SET_BITMAP_FILE: @@ -7625,6 +7626,7 @@ static int md_ioctl(struct block_device *bdev, blk_mode_t mode, void __user *argp = (void __user *)arg; struct mddev *mddev = NULL; bool did_set_md_closing = false; + bool need_suspend; if (!md_ioctl_valid(cmd)) return -ENOTTY; @@ -7716,8 +7718,11 @@ static int md_ioctl(struct block_device *bdev, blk_mode_t mode, if (!md_is_rdwr(mddev)) flush_work(&mddev->sync_work); - err = md_ioctl_need_suspend(cmd) ? mddev_suspend_and_lock(mddev) : - mddev_lock(mddev); + need_suspend = md_ioctl_need_suspend(mddev, cmd); + if (need_suspend) + err = mddev_suspend_and_lock(mddev); + else + err = mddev_lock(mddev); if (err) { pr_debug("md: ioctl lock interrupted, reason %d, cmd %d\n", err, cmd); @@ -7846,8 +7851,10 @@ static int md_ioctl(struct block_device *bdev, blk_mode_t mode, err != -EINVAL) mddev->hold_active = 0; - md_ioctl_need_suspend(cmd) ? mddev_unlock_and_resume(mddev) : - mddev_unlock(mddev); + if (need_suspend) + mddev_unlock_and_resume(mddev); + else + mddev_unlock(mddev); out: if(did_set_md_closing)
Hi, 在 2024/02/17 3:03, Song Liu 写道: > On Thu, Feb 8, 2024 at 4:49 PM Song Liu <song@kernel.org> wrote: >> >> On Thu, Feb 8, 2024 at 12:44 AM Li Nan <linan666@huaweicloud.com> wrote: >>> >>> >>> >>> 在 2024/2/8 14:50, Song Liu 写道: >>>> On Wed, Feb 7, 2024 at 1:32 AM <linan666@huaweicloud.com> wrote: >>>>> >>>>> From: Li Nan <linan122@huawei.com> >>>>> >>>>> 'open_mutex' of gendisk is used to protect open/close block devices. But >>>>> in bd_link_disk_holder(), it is used to protect the creation of symlink >>>>> between holding disk and slave bdev, which introduces some issues. >>>>> >>>>> When bd_link_disk_holder() is called, the driver is usually in the process >>>>> of initialization/modification and may suspend submitting io. At this >>>>> time, any io hold 'open_mutex', such as scanning partitions, can cause >>>>> deadlocks. For example, in raid: >>>>> >>>>> T1 T2 >>>>> bdev_open_by_dev >>>>> lock open_mutex [1] >>>>> ... >>>>> efi_partition >>>>> ... >>>>> md_submit_bio >>>>> md_ioctl mddev_syspend >>>>> -> suspend all io >>>>> md_add_new_disk >>>>> bind_rdev_to_array >>>>> bd_link_disk_holder >>>>> try lock open_mutex [2] >>>>> md_handle_request >>>>> -> wait mddev_resume >>>>> >>>>> T1 scan partition, T2 add a new device to raid. T1 waits for T2 to resume >>>>> mddev, but T2 waits for open_mutex held by T1. Deadlock occurs. >>>>> >>>>> Fix it by introducing a local mutex 'holder_mutex' to replace 'open_mutex'. >>>> >>>> Is this to fix [1]? Do we need some Fixes and/or Closes tags? >>>> >>> >>> No. Just use another way to fix [2], and both [2] and this patch can fix >>> the issue. I am not sure about the root cause of [1] yet. >>> >>> [2] https://patchwork.kernel.org/project/linux-raid/list/?series=812045 >>> >>>> Could you please add steps to reproduce this issue? >>> >>> We need to modify the kernel, add sleep in md_submit_bio() and md_ioctl() >>> as below, and then: >>> 1. mdadm -CR /dev/md0 -l1 -n2 /dev/sd[bc] #create a raid >>> 2. echo 1 > /sys/module/md_mod/parameters/error_inject #enable sleep >>> 3. 'mdadm --add /dev/md0 /dev/sda' #add a disk to raid >>> 4. submit ioctl BLKRRPART to raid within 10s. >> >> The analysis makes sense. I also hit the issue a couple times without adding >> extra delays. But I am not sure whether this is the best fix (I didn't find real >> issues with it either). > > To be extra safe and future proof, we can do something like the > following to only > suspend the array for ADD_NEW_DISK on not-running arrays. > > This appear to solve the problem reported in > > https://bugzilla.kernel.org/show_bug.cgi?id=218459 > > Thanks, > Song > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 9e41a9aaba8b..395911d5f4d6 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -7570,10 +7570,11 @@ static inline bool md_ioctl_valid(unsigned int cmd) > } > } > > -static bool md_ioctl_need_suspend(unsigned int cmd) > +static bool md_ioctl_need_suspend(struct mddev *mddev, unsigned int cmd) > { > switch (cmd) { > case ADD_NEW_DISK: > + return mddev->pers != NULL; Did you check already that this problem is not related that 'active_io' is leaked for flush IO? I don't understand the problem reported yet. If 'mddev->pers' is not set yet, md_submit_bio() will return directly, and 'active_io' should not be grabbed in the first place. md_run() is the only place to convert 'mddev->pers' from NULL to a real personality, and it's protected by 'reconfig_mutex', however, md_ioctl_need_suspend() is called without 'reconfig_mutex', hence there is a race condition: md_ioctl_need_suspend array_state_store // mddev->pers is NULL, return false mddev_lock do_md_run mddev->pers = xxx mddev_unlock // mddev_suspend is not called mddev_lock md_add_new_disk if (mddev->pers) md_import_device bind_rdev_to_array add_bound_rdev mddev->pers->hot_add_disk -> hot add disk without suspending Thanks, Kuai > case HOT_ADD_DISK: > case HOT_REMOVE_DISK: > case SET_BITMAP_FILE: > @@ -7625,6 +7626,7 @@ static int md_ioctl(struct block_device *bdev, > blk_mode_t mode, > void __user *argp = (void __user *)arg; > struct mddev *mddev = NULL; > bool did_set_md_closing = false; > + bool need_suspend; > > if (!md_ioctl_valid(cmd)) > return -ENOTTY; > @@ -7716,8 +7718,11 @@ static int md_ioctl(struct block_device *bdev, > blk_mode_t mode, > if (!md_is_rdwr(mddev)) > flush_work(&mddev->sync_work); > > - err = md_ioctl_need_suspend(cmd) ? mddev_suspend_and_lock(mddev) : > - mddev_lock(mddev); > + need_suspend = md_ioctl_need_suspend(mddev, cmd); > + if (need_suspend) > + err = mddev_suspend_and_lock(mddev); > + else > + err = mddev_lock(mddev); > if (err) { > pr_debug("md: ioctl lock interrupted, reason %d, cmd %d\n", > err, cmd); > @@ -7846,8 +7851,10 @@ static int md_ioctl(struct block_device *bdev, > blk_mode_t mode, > err != -EINVAL) > mddev->hold_active = 0; > > - md_ioctl_need_suspend(cmd) ? mddev_unlock_and_resume(mddev) : > - mddev_unlock(mddev); > + if (need_suspend) > + mddev_unlock_and_resume(mddev); > + else > + mddev_unlock(mddev); > > out: > if(did_set_md_closing) > . >
On Sat, Feb 17, 2024 at 11:47 PM Yu Kuai <yukuai1@huaweicloud.com> wrote: > > Hi, > > 在 2024/02/17 3:03, Song Liu 写道: > > On Thu, Feb 8, 2024 at 4:49 PM Song Liu <song@kernel.org> wrote: > >> > >> On Thu, Feb 8, 2024 at 12:44 AM Li Nan <linan666@huaweicloud.com> wrote: > >>> > >>> > >>> > >>> 在 2024/2/8 14:50, Song Liu 写道: > >>>> On Wed, Feb 7, 2024 at 1:32 AM <linan666@huaweicloud.com> wrote: > >>>>> > >>>>> From: Li Nan <linan122@huawei.com> > >>>>> > >>>>> 'open_mutex' of gendisk is used to protect open/close block devices But > >>>>> in bd_link_disk_holder(), it is used to protect the creation of symlink > >>>>> between holding disk and slave bdev, which introduces some issues. > >>>>> > >>>>> When bd_link_disk_holder() is called, the driver is usually in the process > >>>>> of initialization/modification and may suspend submitting io. At this > >>>>> time, any io hold 'open_mutex', such as scanning partitions, can cause > >>>>> deadlocks. For example, in raid: > >>>>> > >>>>> T1 T2 > >>>>> bdev_open_by_dev > >>>>> lock open_mutex [1] > >>>>> ... > >>>>> efi_partition > >>>>> ... > >>>>> md_submit_bio > >>>>> md_ioctl mddev_syspend > >>>>> -> suspend all io > >>>>> md_add_new_disk > >>>>> bind_rdev_to_array > >>>>> bd_link_disk_holder > >>>>> try lock open_mutex [2] > >>>>> md_handle_request > >>>>> -> wait mddev_resume > >>>>> > >>>>> T1 scan partition, T2 add a new device to raid. T1 waits for T2 to resume > >>>>> mddev, but T2 waits for open_mutex held by T1. Deadlock occurs. > >>>>> > >>>>> Fix it by introducing a local mutex 'holder_mutex' to replace 'open_mutex'. > >>>> > >>>> Is this to fix [1]? Do we need some Fixes and/or Closes tags? > >>>> > >>> > >>> No. Just use another way to fix [2], and both [2] and this patch can fix > >>> the issue. I am not sure about the root cause of [1] yet. > >>> > >>> [2] https://patchwork.kernel.org/project/linux-raid/list/?series=812045 > >>> > >>>> Could you please add steps to reproduce this issue? > >>> > >>> We need to modify the kernel, add sleep in md_submit_bio() and md_ioctl() > >>> as below, and then: > >>> 1. mdadm -CR /dev/md0 -l1 -n2 /dev/sd[bc] #create a raid > >>> 2. echo 1 > /sys/module/md_mod/parameters/error_inject #enable sleep > >>> 3. 'mdadm --add /dev/md0 /dev/sda' #add a disk to raid > >>> 4. submit ioctl BLKRRPART to raid within 10s. > >> > >> The analysis makes sense. I also hit the issue a couple times without adding > >> extra delays. But I am not sure whether this is the best fix (I didn't find real > >> issues with it either). > > > > To be extra safe and future proof, we can do something like the > > following to only > > suspend the array for ADD_NEW_DISK on not-running arrays. > > > > This appear to solve the problem reported in > > > > https://bugzilla.kernel.org/show_bug.cgi?id=218459 > > > > Thanks, > > Song > > > > diff --git a/drivers/md/md.c b/drivers/md/md.c > > index 9e41a9aaba8b..395911d5f4d6 100644 > > --- a/drivers/md/md.c > > +++ b/drivers/md/md.c > > @@ -7570,10 +7570,11 @@ static inline bool md_ioctl_valid(unsigned int cmd) > > } > > } > > > > -static bool md_ioctl_need_suspend(unsigned int cmd) > > +static bool md_ioctl_need_suspend(struct mddev *mddev, unsigned int cmd) > > { > > switch (cmd) { > > case ADD_NEW_DISK: > > + return mddev->pers != NULL; > > Did you check already that this problem is not related that 'active_io' > is leaked for flush IO? > > I don't understand the problem reported yet. If 'mddev->pers' is not set > yet, md_submit_bio() will return directly, and 'active_io' should not be > grabbed in the first place. AFAICT, this is not related to the active_io issue. > > md_run() is the only place to convert 'mddev->pers' from NULL to a real > personality, and it's protected by 'reconfig_mutex', however, > md_ioctl_need_suspend() is called without 'reconfig_mutex', hence there > is a race condition: > > md_ioctl_need_suspend array_state_store > // mddev->pers is NULL, return false > mddev_lock > do_md_run > mddev->pers = xxx > mddev_unlock > > // mddev_suspend is not called > mddev_lock > md_add_new_disk > if (mddev->pers) > md_import_device > bind_rdev_to_array > add_bound_rdev > mddev->pers->hot_add_disk > -> hot add disk without suspending Yeah, this race condition exists. We probably need some trick with suspend and lock here. Thanks, Song
Hi, Christoph 在 2024/02/07 17:27, linan666@huaweicloud.com 写道: > From: Li Nan <linan122@huawei.com> > > 'open_mutex' of gendisk is used to protect open/close block devices. But > in bd_link_disk_holder(), it is used to protect the creation of symlink > between holding disk and slave bdev, which introduces some issues. > > When bd_link_disk_holder() is called, the driver is usually in the process > of initialization/modification and may suspend submitting io. At this > time, any io hold 'open_mutex', such as scanning partitions, can cause > deadlocks. For example, in raid: > > T1 T2 > bdev_open_by_dev > lock open_mutex [1] > ... > efi_partition > ... > md_submit_bio > md_ioctl mddev_syspend > -> suspend all io > md_add_new_disk > bind_rdev_to_array > bd_link_disk_holder > try lock open_mutex [2] > md_handle_request > -> wait mddev_resume > > T1 scan partition, T2 add a new device to raid. T1 waits for T2 to resume > mddev, but T2 waits for open_mutex held by T1. Deadlock occurs. > > Fix it by introducing a local mutex 'holder_mutex' to replace 'open_mutex'. Can you take a look at this patch? I think for raid(perhaps and dm and other drivers), it's reasonable to suspend IO while hot adding new underlying disks. And I think add new slaves to holder is not related to open the holder disk, because caller should already open the holder disk to hot add slaves, hence 'open_mutex' for holder is not necessary here. Actually bd_link_disk_holder() is protected by 'reconfig_mutex' for raid, and 'table_devices_lock' for dm(I'm not sure yet if other drivers have similiar lock). For raid, we do can fix this problem in raid by delay bd_link_disk_holder() while the array is not suspended, however, we'll consider this fix later if you think this patch is not acceptable. Thanks, Kuai > > Signed-off-by: Li Nan <linan122@huawei.com> > --- > block/holder.c | 12 +++++++----- > 1 file changed, 7 insertions(+), 5 deletions(-) > > diff --git a/block/holder.c b/block/holder.c > index 37d18c13d958..5bfb0a674cc7 100644 > --- a/block/holder.c > +++ b/block/holder.c > @@ -8,6 +8,8 @@ struct bd_holder_disk { > int refcnt; > }; > > +static DEFINE_MUTEX(holder_mutex); > + > static struct bd_holder_disk *bd_find_holder_disk(struct block_device *bdev, > struct gendisk *disk) > { > @@ -80,7 +82,7 @@ int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk) > kobject_get(bdev->bd_holder_dir); > mutex_unlock(&bdev->bd_disk->open_mutex); > > - mutex_lock(&disk->open_mutex); > + mutex_lock(&holder_mutex); > WARN_ON_ONCE(!bdev->bd_holder); > > holder = bd_find_holder_disk(bdev, disk); > @@ -108,7 +110,7 @@ int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk) > goto out_del_symlink; > list_add(&holder->list, &disk->slave_bdevs); > > - mutex_unlock(&disk->open_mutex); > + mutex_unlock(&holder_mutex); > return 0; > > out_del_symlink: > @@ -116,7 +118,7 @@ int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk) > out_free_holder: > kfree(holder); > out_unlock: > - mutex_unlock(&disk->open_mutex); > + mutex_unlock(&holder_mutex); > if (ret) > kobject_put(bdev->bd_holder_dir); > return ret; > @@ -140,7 +142,7 @@ void bd_unlink_disk_holder(struct block_device *bdev, struct gendisk *disk) > if (WARN_ON_ONCE(!disk->slave_dir)) > return; > > - mutex_lock(&disk->open_mutex); > + mutex_lock(&holder_mutex); > holder = bd_find_holder_disk(bdev, disk); > if (!WARN_ON_ONCE(holder == NULL) && !--holder->refcnt) { > del_symlink(disk->slave_dir, bdev_kobj(bdev)); > @@ -149,6 +151,6 @@ void bd_unlink_disk_holder(struct block_device *bdev, struct gendisk *disk) > list_del_init(&holder->list); > kfree(holder); > } > - mutex_unlock(&disk->open_mutex); > + mutex_unlock(&holder_mutex); > } > EXPORT_SYMBOL_GPL(bd_unlink_disk_holder); >
On Mon, Feb 19, 2024 at 04:53:36PM +0800, Yu Kuai wrote: > Can you take a look at this patch? I think for raid(perhaps and dm and > other drivers), it's reasonable to suspend IO while hot adding new > underlying disks. And I think add new slaves to holder is not related to > open the holder disk, because caller should already open the holder disk > to hot add slaves, hence 'open_mutex' for holder is not necessary here. > > Actually bd_link_disk_holder() is protected by 'reconfig_mutex' for > raid, and 'table_devices_lock' for dm(I'm not sure yet if other drivers > have similiar lock). > > For raid, we do can fix this problem in raid by delay > bd_link_disk_holder() while the array is not suspended, however, we'll > consider this fix later if you think this patch is not acceptable. Yes, not taking open_lock here seems reasonable, open_lock or it's previous name has always been a bit of a catchall without very well defined semantics. I'd give the symbol a blk_ prefix, though.
diff --git a/block/holder.c b/block/holder.c index 37d18c13d958..5bfb0a674cc7 100644 --- a/block/holder.c +++ b/block/holder.c @@ -8,6 +8,8 @@ struct bd_holder_disk { int refcnt; }; +static DEFINE_MUTEX(holder_mutex); + static struct bd_holder_disk *bd_find_holder_disk(struct block_device *bdev, struct gendisk *disk) { @@ -80,7 +82,7 @@ int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk) kobject_get(bdev->bd_holder_dir); mutex_unlock(&bdev->bd_disk->open_mutex); - mutex_lock(&disk->open_mutex); + mutex_lock(&holder_mutex); WARN_ON_ONCE(!bdev->bd_holder); holder = bd_find_holder_disk(bdev, disk); @@ -108,7 +110,7 @@ int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk) goto out_del_symlink; list_add(&holder->list, &disk->slave_bdevs); - mutex_unlock(&disk->open_mutex); + mutex_unlock(&holder_mutex); return 0; out_del_symlink: @@ -116,7 +118,7 @@ int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk) out_free_holder: kfree(holder); out_unlock: - mutex_unlock(&disk->open_mutex); + mutex_unlock(&holder_mutex); if (ret) kobject_put(bdev->bd_holder_dir); return ret; @@ -140,7 +142,7 @@ void bd_unlink_disk_holder(struct block_device *bdev, struct gendisk *disk) if (WARN_ON_ONCE(!disk->slave_dir)) return; - mutex_lock(&disk->open_mutex); + mutex_lock(&holder_mutex); holder = bd_find_holder_disk(bdev, disk); if (!WARN_ON_ONCE(holder == NULL) && !--holder->refcnt) { del_symlink(disk->slave_dir, bdev_kobj(bdev)); @@ -149,6 +151,6 @@ void bd_unlink_disk_holder(struct block_device *bdev, struct gendisk *disk) list_del_init(&holder->list); kfree(holder); } - mutex_unlock(&disk->open_mutex); + mutex_unlock(&holder_mutex); } EXPORT_SYMBOL_GPL(bd_unlink_disk_holder);