block: don't set GD_NEED_PART_SCAN if scan partition failed

Message ID 20230322035926.1791317-1-yukuai1@huaweicloud.com
State New
Headers
Series block: don't set GD_NEED_PART_SCAN if scan partition failed |

Commit Message

Yu Kuai March 22, 2023, 3:59 a.m. UTC
  From: Yu Kuai <yukuai3@huawei.com>

Currently if disk_scan_partitions() failed, GD_NEED_PART_SCAN will still
set, and partition scan will be proceed again when blkdev_get_by_dev()
is called. However, this will cause a problem that re-assemble partitioned
raid device will creat partition for underlying disk.

Test procedure:

mdadm -CR /dev/md0 -l 1 -n 2 /dev/sda /dev/sdb -e 1.0
sgdisk -n 0:0:+100MiB /dev/md0
blockdev --rereadpt /dev/sda
blockdev --rereadpt /dev/sdb
mdadm -S /dev/md0
mdadm -A /dev/md0 /dev/sda /dev/sdb

Test result: underlying disk partition and raid partition can be
observed at the same time

Note that this can still happen in come corner cases that
GD_NEED_PART_SCAN can be set for underlying disk while re-assemble raid
device.

Fixes: e5cfefa97bcc ("block: fix scan partition for exclusively open device again")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/genhd.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)
  

Comments

Ming Lei March 22, 2023, 7:58 a.m. UTC | #1
On Wed, Mar 22, 2023 at 11:59:26AM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Currently if disk_scan_partitions() failed, GD_NEED_PART_SCAN will still
> set, and partition scan will be proceed again when blkdev_get_by_dev()
> is called. However, this will cause a problem that re-assemble partitioned
> raid device will creat partition for underlying disk.
> 
> Test procedure:
> 
> mdadm -CR /dev/md0 -l 1 -n 2 /dev/sda /dev/sdb -e 1.0
> sgdisk -n 0:0:+100MiB /dev/md0
> blockdev --rereadpt /dev/sda
> blockdev --rereadpt /dev/sdb
> mdadm -S /dev/md0
> mdadm -A /dev/md0 /dev/sda /dev/sdb
> 
> Test result: underlying disk partition and raid partition can be
> observed at the same time
> 
> Note that this can still happen in come corner cases that
> GD_NEED_PART_SCAN can be set for underlying disk while re-assemble raid
> device.
> 
> Fixes: e5cfefa97bcc ("block: fix scan partition for exclusively open device again")
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>

The issue still can't be avoided completely, such as, after rebooting,
/dev/sda1 & /dev/md0p1 can be observed at the same time. And this one
should be underlying partitions scanned before re-assembling raid, I
guess it may not be easy to avoid.

Also seems the following change added in e5cfefa97bcc isn't necessary:

                /* Make sure the first partition scan will be proceed */
                if (get_capacity(disk) && !(disk->flags & GENHD_FL_NO_PART) &&
                    !test_bit(GD_SUPPRESS_PART_SCAN, &disk->state))
                        set_bit(GD_NEED_PART_SCAN, &disk->state);

since the following disk_scan_partitions() in device_add_disk() should cover
partitions scan.

> ---
>  block/genhd.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index 08bb1a9ec22c..a72e27d6779d 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -368,7 +368,6 @@ int disk_scan_partitions(struct gendisk *disk, fmode_t mode)
>  	if (disk->open_partitions)
>  		return -EBUSY;
>  
> -	set_bit(GD_NEED_PART_SCAN, &disk->state);
>  	/*
>  	 * If the device is opened exclusively by current thread already, it's
>  	 * safe to scan partitons, otherwise, use bd_prepare_to_claim() to
> @@ -381,12 +380,19 @@ int disk_scan_partitions(struct gendisk *disk, fmode_t mode)
>  			return ret;
>  	}
>  
> +	set_bit(GD_NEED_PART_SCAN, &disk->state);
>  	bdev = blkdev_get_by_dev(disk_devt(disk), mode & ~FMODE_EXCL, NULL);
>  	if (IS_ERR(bdev))
>  		ret =  PTR_ERR(bdev);
>  	else
>  		blkdev_put(bdev, mode & ~FMODE_EXCL);
>  
> +	/*
> +	 * If blkdev_get_by_dev() failed early, GD_NEED_PART_SCAN is still set,
> +	 * and this will cause that re-assemble partitioned raid device will
> +	 * creat partition for underlying disk.
> +	 */
> +	clear_bit(GD_NEED_PART_SCAN, &disk->state);

I feel GD_NEED_PART_SCAN becomes a bit hard to follow.

So far, it is only consumed by blkdev_get_whole(), and cleared in
bdev_disk_changed(). That means partition scan can be retried
if bdev_disk_changed() fails.

Another mess is that more drivers start to touch this flag, such as
nbd/sd, probably it is better to change them into one API of
blk_disk_need_partition_scan(), and hide implementation detail
to drivers.


thanks,
Ming
  
Yu Kuai March 22, 2023, 9:12 a.m. UTC | #2
Hi, Ming

在 2023/03/22 15:58, Ming Lei 写道:
> On Wed, Mar 22, 2023 at 11:59:26AM +0800, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Currently if disk_scan_partitions() failed, GD_NEED_PART_SCAN will still
>> set, and partition scan will be proceed again when blkdev_get_by_dev()
>> is called. However, this will cause a problem that re-assemble partitioned
>> raid device will creat partition for underlying disk.
>>
>> Test procedure:
>>
>> mdadm -CR /dev/md0 -l 1 -n 2 /dev/sda /dev/sdb -e 1.0
>> sgdisk -n 0:0:+100MiB /dev/md0
>> blockdev --rereadpt /dev/sda
>> blockdev --rereadpt /dev/sdb
>> mdadm -S /dev/md0
>> mdadm -A /dev/md0 /dev/sda /dev/sdb
>>
>> Test result: underlying disk partition and raid partition can be
>> observed at the same time
>>
>> Note that this can still happen in come corner cases that
>> GD_NEED_PART_SCAN can be set for underlying disk while re-assemble raid
>> device.
>>
>> Fixes: e5cfefa97bcc ("block: fix scan partition for exclusively open device again")
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> 
> The issue still can't be avoided completely, such as, after rebooting,
> /dev/sda1 & /dev/md0p1 can be observed at the same time. And this one
> should be underlying partitions scanned before re-assembling raid, I
> guess it may not be easy to avoid.

Yes, this is possible and I don't know how to fix this yet...
> 
> Also seems the following change added in e5cfefa97bcc isn't necessary:
> 
>                  /* Make sure the first partition scan will be proceed */
>                  if (get_capacity(disk) && !(disk->flags & GENHD_FL_NO_PART) &&
>                      !test_bit(GD_SUPPRESS_PART_SCAN, &disk->state))
>                          set_bit(GD_NEED_PART_SCAN, &disk->state);
> 
> since the following disk_scan_partitions() in device_add_disk() should cover
> partitions scan.

This can't be guaranteed if someone else open the device excl after
bdev_add and before disk_scan_partitions:

t1: 			t2:
device_add_disk
  bdev_add
   insert_inode_hash
			// open device excl
  disk_scan_partitions
  // will fail

However, this is just in theory, and it's unlikely to happen in
practice.

Thanks,
Kuai
> 
>> ---
>>   block/genhd.c | 8 +++++++-
>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/genhd.c b/block/genhd.c
>> index 08bb1a9ec22c..a72e27d6779d 100644
>> --- a/block/genhd.c
>> +++ b/block/genhd.c
>> @@ -368,7 +368,6 @@ int disk_scan_partitions(struct gendisk *disk, fmode_t mode)
>>   	if (disk->open_partitions)
>>   		return -EBUSY;
>>   
>> -	set_bit(GD_NEED_PART_SCAN, &disk->state);
>>   	/*
>>   	 * If the device is opened exclusively by current thread already, it's
>>   	 * safe to scan partitons, otherwise, use bd_prepare_to_claim() to
>> @@ -381,12 +380,19 @@ int disk_scan_partitions(struct gendisk *disk, fmode_t mode)
>>   			return ret;
>>   	}
>>   
>> +	set_bit(GD_NEED_PART_SCAN, &disk->state);
>>   	bdev = blkdev_get_by_dev(disk_devt(disk), mode & ~FMODE_EXCL, NULL);
>>   	if (IS_ERR(bdev))
>>   		ret =  PTR_ERR(bdev);
>>   	else
>>   		blkdev_put(bdev, mode & ~FMODE_EXCL);
>>   
>> +	/*
>> +	 * If blkdev_get_by_dev() failed early, GD_NEED_PART_SCAN is still set,
>> +	 * and this will cause that re-assemble partitioned raid device will
>> +	 * creat partition for underlying disk.
>> +	 */
>> +	clear_bit(GD_NEED_PART_SCAN, &disk->state);
> 
> I feel GD_NEED_PART_SCAN becomes a bit hard to follow.
> 
> So far, it is only consumed by blkdev_get_whole(), and cleared in
> bdev_disk_changed(). That means partition scan can be retried
> if bdev_disk_changed() fails.
> 
> Another mess is that more drivers start to touch this flag, such as
> nbd/sd, probably it is better to change them into one API of
> blk_disk_need_partition_scan(), and hide implementation detail
> to drivers.
> 
> 
> thanks,
> Ming
> 
> .
>
  
Jan Kara March 22, 2023, 9:47 a.m. UTC | #3
On Wed 22-03-23 15:58:35, Ming Lei wrote:
> On Wed, Mar 22, 2023 at 11:59:26AM +0800, Yu Kuai wrote:
> > From: Yu Kuai <yukuai3@huawei.com>
> > 
> > Currently if disk_scan_partitions() failed, GD_NEED_PART_SCAN will still
> > set, and partition scan will be proceed again when blkdev_get_by_dev()
> > is called. However, this will cause a problem that re-assemble partitioned
> > raid device will creat partition for underlying disk.
> > 
> > Test procedure:
> > 
> > mdadm -CR /dev/md0 -l 1 -n 2 /dev/sda /dev/sdb -e 1.0
> > sgdisk -n 0:0:+100MiB /dev/md0
> > blockdev --rereadpt /dev/sda
> > blockdev --rereadpt /dev/sdb
> > mdadm -S /dev/md0
> > mdadm -A /dev/md0 /dev/sda /dev/sdb
> > 
> > Test result: underlying disk partition and raid partition can be
> > observed at the same time
> > 
> > Note that this can still happen in come corner cases that
> > GD_NEED_PART_SCAN can be set for underlying disk while re-assemble raid
> > device.
> > 
> > Fixes: e5cfefa97bcc ("block: fix scan partition for exclusively open device again")
> > Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> 
> The issue still can't be avoided completely, such as, after rebooting,
> /dev/sda1 & /dev/md0p1 can be observed at the same time. And this one
> should be underlying partitions scanned before re-assembling raid, I
> guess it may not be easy to avoid.

So this was always happening (before my patches, after my patches, and now
after Yu's patches) and kernel does not have enough information to know
that sda will become part of md0 device in the future. But mdadm actually
deals with this as far as I remember and deletes partitions for all devices
it is assembling the array from (and quick tracing experiment I did
supports this).

								Honza
  
Jan Kara March 22, 2023, 9:52 a.m. UTC | #4
On Wed 22-03-23 11:59:26, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Currently if disk_scan_partitions() failed, GD_NEED_PART_SCAN will still
> set, and partition scan will be proceed again when blkdev_get_by_dev()
> is called. However, this will cause a problem that re-assemble partitioned
> raid device will creat partition for underlying disk.
> 
> Test procedure:
> 
> mdadm -CR /dev/md0 -l 1 -n 2 /dev/sda /dev/sdb -e 1.0
> sgdisk -n 0:0:+100MiB /dev/md0
> blockdev --rereadpt /dev/sda
> blockdev --rereadpt /dev/sdb
> mdadm -S /dev/md0
> mdadm -A /dev/md0 /dev/sda /dev/sdb
> 
> Test result: underlying disk partition and raid partition can be
> observed at the same time
> 
> Note that this can still happen in come corner cases that
> GD_NEED_PART_SCAN can be set for underlying disk while re-assemble raid
> device.
> 
> Fixes: e5cfefa97bcc ("block: fix scan partition for exclusively open device again")
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>

This looks good to me. I've actually noticed this problem already when
looking at the patch resulting in commit e5cfefa97bcc but Jens merged it
before I got to checking it and then I've convinced myself it's not serious
enough to redo the patch. Anyway, feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza 

> ---
>  block/genhd.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index 08bb1a9ec22c..a72e27d6779d 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -368,7 +368,6 @@ int disk_scan_partitions(struct gendisk *disk, fmode_t mode)
>  	if (disk->open_partitions)
>  		return -EBUSY;
>  
> -	set_bit(GD_NEED_PART_SCAN, &disk->state);
>  	/*
>  	 * If the device is opened exclusively by current thread already, it's
>  	 * safe to scan partitons, otherwise, use bd_prepare_to_claim() to
> @@ -381,12 +380,19 @@ int disk_scan_partitions(struct gendisk *disk, fmode_t mode)
>  			return ret;
>  	}
>  
> +	set_bit(GD_NEED_PART_SCAN, &disk->state);
>  	bdev = blkdev_get_by_dev(disk_devt(disk), mode & ~FMODE_EXCL, NULL);
>  	if (IS_ERR(bdev))
>  		ret =  PTR_ERR(bdev);
>  	else
>  		blkdev_put(bdev, mode & ~FMODE_EXCL);
>  
> +	/*
> +	 * If blkdev_get_by_dev() failed early, GD_NEED_PART_SCAN is still set,
> +	 * and this will cause that re-assemble partitioned raid device will
> +	 * creat partition for underlying disk.
> +	 */
> +	clear_bit(GD_NEED_PART_SCAN, &disk->state);
>  	if (!(mode & FMODE_EXCL))
>  		bd_abort_claiming(disk->part0, disk_scan_partitions);
>  	return ret;
> -- 
> 2.31.1
>
  
Ming Lei March 22, 2023, 11:34 a.m. UTC | #5
On Wed, Mar 22, 2023 at 10:47:07AM +0100, Jan Kara wrote:
> On Wed 22-03-23 15:58:35, Ming Lei wrote:
> > On Wed, Mar 22, 2023 at 11:59:26AM +0800, Yu Kuai wrote:
> > > From: Yu Kuai <yukuai3@huawei.com>
> > > 
> > > Currently if disk_scan_partitions() failed, GD_NEED_PART_SCAN will still
> > > set, and partition scan will be proceed again when blkdev_get_by_dev()
> > > is called. However, this will cause a problem that re-assemble partitioned
> > > raid device will creat partition for underlying disk.
> > > 
> > > Test procedure:
> > > 
> > > mdadm -CR /dev/md0 -l 1 -n 2 /dev/sda /dev/sdb -e 1.0
> > > sgdisk -n 0:0:+100MiB /dev/md0
> > > blockdev --rereadpt /dev/sda
> > > blockdev --rereadpt /dev/sdb
> > > mdadm -S /dev/md0
> > > mdadm -A /dev/md0 /dev/sda /dev/sdb
> > > 
> > > Test result: underlying disk partition and raid partition can be
> > > observed at the same time
> > > 
> > > Note that this can still happen in come corner cases that
> > > GD_NEED_PART_SCAN can be set for underlying disk while re-assemble raid
> > > device.
> > > 
> > > Fixes: e5cfefa97bcc ("block: fix scan partition for exclusively open device again")
> > > Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> > 
> > The issue still can't be avoided completely, such as, after rebooting,
> > /dev/sda1 & /dev/md0p1 can be observed at the same time. And this one
> > should be underlying partitions scanned before re-assembling raid, I
> > guess it may not be easy to avoid.
> 
> So this was always happening (before my patches, after my patches, and now
> after Yu's patches) and kernel does not have enough information to know
> that sda will become part of md0 device in the future. But mdadm actually
> deals with this as far as I remember and deletes partitions for all devices
> it is assembling the array from (and quick tracing experiment I did
> supports this).

I am testing on Fedora 37, so mdadm v4.2 doesn't delete underlying
partitions before re-assemble.

Also given mdadm or related userspace has to change for avoiding
to scan underlying partitions, just wondering why not let userspace
to tell kernel not do it explicitly?

Thanks,
Ming
  
Jan Kara March 22, 2023, 1:07 p.m. UTC | #6
On Wed 22-03-23 19:34:30, Ming Lei wrote:
> On Wed, Mar 22, 2023 at 10:47:07AM +0100, Jan Kara wrote:
> > On Wed 22-03-23 15:58:35, Ming Lei wrote:
> > > On Wed, Mar 22, 2023 at 11:59:26AM +0800, Yu Kuai wrote:
> > > > From: Yu Kuai <yukuai3@huawei.com>
> > > > 
> > > > Currently if disk_scan_partitions() failed, GD_NEED_PART_SCAN will still
> > > > set, and partition scan will be proceed again when blkdev_get_by_dev()
> > > > is called. However, this will cause a problem that re-assemble partitioned
> > > > raid device will creat partition for underlying disk.
> > > > 
> > > > Test procedure:
> > > > 
> > > > mdadm -CR /dev/md0 -l 1 -n 2 /dev/sda /dev/sdb -e 1.0
> > > > sgdisk -n 0:0:+100MiB /dev/md0
> > > > blockdev --rereadpt /dev/sda
> > > > blockdev --rereadpt /dev/sdb
> > > > mdadm -S /dev/md0
> > > > mdadm -A /dev/md0 /dev/sda /dev/sdb
> > > > 
> > > > Test result: underlying disk partition and raid partition can be
> > > > observed at the same time
> > > > 
> > > > Note that this can still happen in come corner cases that
> > > > GD_NEED_PART_SCAN can be set for underlying disk while re-assemble raid
> > > > device.
> > > > 
> > > > Fixes: e5cfefa97bcc ("block: fix scan partition for exclusively open device again")
> > > > Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> > > 
> > > The issue still can't be avoided completely, such as, after rebooting,
> > > /dev/sda1 & /dev/md0p1 can be observed at the same time. And this one
> > > should be underlying partitions scanned before re-assembling raid, I
> > > guess it may not be easy to avoid.
> > 
> > So this was always happening (before my patches, after my patches, and now
> > after Yu's patches) and kernel does not have enough information to know
> > that sda will become part of md0 device in the future. But mdadm actually
> > deals with this as far as I remember and deletes partitions for all devices
> > it is assembling the array from (and quick tracing experiment I did
> > supports this).
> 
> I am testing on Fedora 37, so mdadm v4.2 doesn't delete underlying
> partitions before re-assemble.

Strange, I'm on openSUSE Leap 15.4 and mdadm v4.1 deletes these partitions
(at least I can see mdadm do BLKPG_DEL_PARTITION ioctls). And checking
mdadm sources I can see calls to remove_partitions() from start_array()
function in Assemble.c so I'm not sure why this is not working for you...

> Also given mdadm or related userspace has to change for avoiding
> to scan underlying partitions, just wondering why not let userspace
> to tell kernel not do it explicitly?

Well, those userspace changes are long deployed, now you would introduce
new API that needs to proliferate again. Not very nice. Also how would that
exactly work? I mean once mdadm has underlying device open, the current
logic makes sure we do not create partitions anymore. But there's no way
how mdadm could possibly prevent creation of partitions for devices it
doesn't know about yet so it still has to delete existing partitions...

								Honza
  
Ming Lei March 22, 2023, 4:08 p.m. UTC | #7
On Wed, Mar 22, 2023 at 02:07:09PM +0100, Jan Kara wrote:
> On Wed 22-03-23 19:34:30, Ming Lei wrote:
> > On Wed, Mar 22, 2023 at 10:47:07AM +0100, Jan Kara wrote:
> > > On Wed 22-03-23 15:58:35, Ming Lei wrote:
> > > > On Wed, Mar 22, 2023 at 11:59:26AM +0800, Yu Kuai wrote:
> > > > > From: Yu Kuai <yukuai3@huawei.com>
> > > > > 
> > > > > Currently if disk_scan_partitions() failed, GD_NEED_PART_SCAN will still
> > > > > set, and partition scan will be proceed again when blkdev_get_by_dev()
> > > > > is called. However, this will cause a problem that re-assemble partitioned
> > > > > raid device will creat partition for underlying disk.
> > > > > 
> > > > > Test procedure:
> > > > > 
> > > > > mdadm -CR /dev/md0 -l 1 -n 2 /dev/sda /dev/sdb -e 1.0
> > > > > sgdisk -n 0:0:+100MiB /dev/md0
> > > > > blockdev --rereadpt /dev/sda
> > > > > blockdev --rereadpt /dev/sdb
> > > > > mdadm -S /dev/md0
> > > > > mdadm -A /dev/md0 /dev/sda /dev/sdb
> > > > > 
> > > > > Test result: underlying disk partition and raid partition can be
> > > > > observed at the same time
> > > > > 
> > > > > Note that this can still happen in come corner cases that
> > > > > GD_NEED_PART_SCAN can be set for underlying disk while re-assemble raid
> > > > > device.
> > > > > 
> > > > > Fixes: e5cfefa97bcc ("block: fix scan partition for exclusively open device again")
> > > > > Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> > > > 
> > > > The issue still can't be avoided completely, such as, after rebooting,
> > > > /dev/sda1 & /dev/md0p1 can be observed at the same time. And this one
> > > > should be underlying partitions scanned before re-assembling raid, I
> > > > guess it may not be easy to avoid.
> > > 
> > > So this was always happening (before my patches, after my patches, and now
> > > after Yu's patches) and kernel does not have enough information to know
> > > that sda will become part of md0 device in the future. But mdadm actually
> > > deals with this as far as I remember and deletes partitions for all devices
> > > it is assembling the array from (and quick tracing experiment I did
> > > supports this).
> > 
> > I am testing on Fedora 37, so mdadm v4.2 doesn't delete underlying
> > partitions before re-assemble.
> 
> Strange, I'm on openSUSE Leap 15.4 and mdadm v4.1 deletes these partitions
> (at least I can see mdadm do BLKPG_DEL_PARTITION ioctls). And checking
> mdadm sources I can see calls to remove_partitions() from start_array()
> function in Assemble.c so I'm not sure why this is not working for you...

I added dump_stack() in delete_partition() for partition 1, not observe
stack trace during booting.

> 
> > Also given mdadm or related userspace has to change for avoiding
> > to scan underlying partitions, just wondering why not let userspace
> > to tell kernel not do it explicitly?
> 
> Well, those userspace changes are long deployed, now you would introduce
> new API that needs to proliferate again. Not very nice. Also how would that
> exactly work? I mean once mdadm has underlying device open, the current
> logic makes sure we do not create partitions anymore. But there's no way
> how mdadm could possibly prevent creation of partitions for devices it
> doesn't know about yet so it still has to delete existing partitions...

I meant if mdadm has to change to delete existed partitions, why not add
one ioctl to disable partition scan for this disk when deleting
partitions/re-assemble, and re-enable scan after stopping array.

But looks it isn't so, since you mentioned that remove_partitions is
supposed to be called before starting array, however I didn't observe this
behavior.

I am worrying if the current approach may cause regression, one concern is
that ioctl(BLKRRPART) needs exclusive open now, such as:

1) mount /dev/vdb1 /mnt

2) ioctl(BLKRRPART) may fail after removing /dev/vdb3


thanks,
Ming
  
Jan Kara March 23, 2023, 10:51 a.m. UTC | #8
On Thu 23-03-23 00:08:51, Ming Lei wrote:
> On Wed, Mar 22, 2023 at 02:07:09PM +0100, Jan Kara wrote:
> > On Wed 22-03-23 19:34:30, Ming Lei wrote:
> > > On Wed, Mar 22, 2023 at 10:47:07AM +0100, Jan Kara wrote:
> > > > On Wed 22-03-23 15:58:35, Ming Lei wrote:
> > > > > On Wed, Mar 22, 2023 at 11:59:26AM +0800, Yu Kuai wrote:
> > > > > > From: Yu Kuai <yukuai3@huawei.com>
> > > > > > 
> > > > > > Currently if disk_scan_partitions() failed, GD_NEED_PART_SCAN will still
> > > > > > set, and partition scan will be proceed again when blkdev_get_by_dev()
> > > > > > is called. However, this will cause a problem that re-assemble partitioned
> > > > > > raid device will creat partition for underlying disk.
> > > > > > 
> > > > > > Test procedure:
> > > > > > 
> > > > > > mdadm -CR /dev/md0 -l 1 -n 2 /dev/sda /dev/sdb -e 1.0
> > > > > > sgdisk -n 0:0:+100MiB /dev/md0
> > > > > > blockdev --rereadpt /dev/sda
> > > > > > blockdev --rereadpt /dev/sdb
> > > > > > mdadm -S /dev/md0
> > > > > > mdadm -A /dev/md0 /dev/sda /dev/sdb
> > > > > > 
> > > > > > Test result: underlying disk partition and raid partition can be
> > > > > > observed at the same time
> > > > > > 
> > > > > > Note that this can still happen in come corner cases that
> > > > > > GD_NEED_PART_SCAN can be set for underlying disk while re-assemble raid
> > > > > > device.
> > > > > > 
> > > > > > Fixes: e5cfefa97bcc ("block: fix scan partition for exclusively open device again")
> > > > > > Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> > > > > 
> > > > > The issue still can't be avoided completely, such as, after rebooting,
> > > > > /dev/sda1 & /dev/md0p1 can be observed at the same time. And this one
> > > > > should be underlying partitions scanned before re-assembling raid, I
> > > > > guess it may not be easy to avoid.
> > > > 
> > > > So this was always happening (before my patches, after my patches, and now
> > > > after Yu's patches) and kernel does not have enough information to know
> > > > that sda will become part of md0 device in the future. But mdadm actually
> > > > deals with this as far as I remember and deletes partitions for all devices
> > > > it is assembling the array from (and quick tracing experiment I did
> > > > supports this).
> > > 
> > > I am testing on Fedora 37, so mdadm v4.2 doesn't delete underlying
> > > partitions before re-assemble.
> > 
> > Strange, I'm on openSUSE Leap 15.4 and mdadm v4.1 deletes these partitions
> > (at least I can see mdadm do BLKPG_DEL_PARTITION ioctls). And checking
> > mdadm sources I can see calls to remove_partitions() from start_array()
> > function in Assemble.c so I'm not sure why this is not working for you...
> 
> I added dump_stack() in delete_partition() for partition 1, not observe
> stack trace during booting.
> 
> > 
> > > Also given mdadm or related userspace has to change for avoiding
> > > to scan underlying partitions, just wondering why not let userspace
> > > to tell kernel not do it explicitly?
> > 
> > Well, those userspace changes are long deployed, now you would introduce
> > new API that needs to proliferate again. Not very nice. Also how would that
> > exactly work? I mean once mdadm has underlying device open, the current
> > logic makes sure we do not create partitions anymore. But there's no way
> > how mdadm could possibly prevent creation of partitions for devices it
> > doesn't know about yet so it still has to delete existing partitions...
> 
> I meant if mdadm has to change to delete existed partitions, why not add
> one ioctl to disable partition scan for this disk when deleting
> partitions/re-assemble, and re-enable scan after stopping array.
> 
> But looks it isn't so, since you mentioned that remove_partitions is
> supposed to be called before starting array, however I didn't observe this
> behavior.

Yeah, not sure what's happening on your system.

> I am worrying if the current approach may cause regression, one concern is
> that ioctl(BLKRRPART) needs exclusive open now, such as:
> 
> 1) mount /dev/vdb1 /mnt
> 
> 2) ioctl(BLKRRPART) may fail after removing /dev/vdb3

Well, but we always had some variant of:

        if (disk->open_partitions)
                return -EBUSY;

in disk_scan_partitions(). So as long as any partition on the disk is used,
EBUSY is the correct return value from BLKRRPART.

								Honza
  
Ming Lei March 23, 2023, 12:03 p.m. UTC | #9
On Thu, Mar 23, 2023 at 11:51:20AM +0100, Jan Kara wrote:
> On Thu 23-03-23 00:08:51, Ming Lei wrote:
> > On Wed, Mar 22, 2023 at 02:07:09PM +0100, Jan Kara wrote:
> > > On Wed 22-03-23 19:34:30, Ming Lei wrote:
> > > > On Wed, Mar 22, 2023 at 10:47:07AM +0100, Jan Kara wrote:
> > > > > On Wed 22-03-23 15:58:35, Ming Lei wrote:
> > > > > > On Wed, Mar 22, 2023 at 11:59:26AM +0800, Yu Kuai wrote:
> > > > > > > From: Yu Kuai <yukuai3@huawei.com>
> > > > > > > 
> > > > > > > Currently if disk_scan_partitions() failed, GD_NEED_PART_SCAN will still
> > > > > > > set, and partition scan will be proceed again when blkdev_get_by_dev()
> > > > > > > is called. However, this will cause a problem that re-assemble partitioned
> > > > > > > raid device will creat partition for underlying disk.
> > > > > > > 
> > > > > > > Test procedure:
> > > > > > > 
> > > > > > > mdadm -CR /dev/md0 -l 1 -n 2 /dev/sda /dev/sdb -e 1.0
> > > > > > > sgdisk -n 0:0:+100MiB /dev/md0
> > > > > > > blockdev --rereadpt /dev/sda
> > > > > > > blockdev --rereadpt /dev/sdb
> > > > > > > mdadm -S /dev/md0
> > > > > > > mdadm -A /dev/md0 /dev/sda /dev/sdb
> > > > > > > 
> > > > > > > Test result: underlying disk partition and raid partition can be
> > > > > > > observed at the same time
> > > > > > > 
> > > > > > > Note that this can still happen in come corner cases that
> > > > > > > GD_NEED_PART_SCAN can be set for underlying disk while re-assemble raid
> > > > > > > device.
> > > > > > > 
> > > > > > > Fixes: e5cfefa97bcc ("block: fix scan partition for exclusively open device again")
> > > > > > > Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> > > > > > 
> > > > > > The issue still can't be avoided completely, such as, after rebooting,
> > > > > > /dev/sda1 & /dev/md0p1 can be observed at the same time. And this one
> > > > > > should be underlying partitions scanned before re-assembling raid, I
> > > > > > guess it may not be easy to avoid.
> > > > > 
> > > > > So this was always happening (before my patches, after my patches, and now
> > > > > after Yu's patches) and kernel does not have enough information to know
> > > > > that sda will become part of md0 device in the future. But mdadm actually
> > > > > deals with this as far as I remember and deletes partitions for all devices
> > > > > it is assembling the array from (and quick tracing experiment I did
> > > > > supports this).
> > > > 
> > > > I am testing on Fedora 37, so mdadm v4.2 doesn't delete underlying
> > > > partitions before re-assemble.
> > > 
> > > Strange, I'm on openSUSE Leap 15.4 and mdadm v4.1 deletes these partitions
> > > (at least I can see mdadm do BLKPG_DEL_PARTITION ioctls). And checking
> > > mdadm sources I can see calls to remove_partitions() from start_array()
> > > function in Assemble.c so I'm not sure why this is not working for you...
> > 
> > I added dump_stack() in delete_partition() for partition 1, not observe
> > stack trace during booting.
> > 
> > > 
> > > > Also given mdadm or related userspace has to change for avoiding
> > > > to scan underlying partitions, just wondering why not let userspace
> > > > to tell kernel not do it explicitly?
> > > 
> > > Well, those userspace changes are long deployed, now you would introduce
> > > new API that needs to proliferate again. Not very nice. Also how would that
> > > exactly work? I mean once mdadm has underlying device open, the current
> > > logic makes sure we do not create partitions anymore. But there's no way
> > > how mdadm could possibly prevent creation of partitions for devices it
> > > doesn't know about yet so it still has to delete existing partitions...
> > 
> > I meant if mdadm has to change to delete existed partitions, why not add
> > one ioctl to disable partition scan for this disk when deleting
> > partitions/re-assemble, and re-enable scan after stopping array.
> > 
> > But looks it isn't so, since you mentioned that remove_partitions is
> > supposed to be called before starting array, however I didn't observe this
> > behavior.
> 
> Yeah, not sure what's happening on your system.

Looks not see such issue on Fedora 38, but it does happen on Fedora 37.

> 
> > I am worrying if the current approach may cause regression, one concern is
> > that ioctl(BLKRRPART) needs exclusive open now, such as:
> > 
> > 1) mount /dev/vdb1 /mnt
> > 
> > 2) ioctl(BLKRRPART) may fail after removing /dev/vdb3
> 
> Well, but we always had some variant of:
> 
>         if (disk->open_partitions)
>                 return -EBUSY;
> 
> in disk_scan_partitions(). So as long as any partition on the disk is used,
> EBUSY is the correct return value from BLKRRPART.

OK, missing that check.

Then the change basically can be thought as ioctl(BLKRRPART) requiring O_EXCL,
One app just open(disk) with O_EXCL for a bit long, then ioctl(BLKRRPART) can't
be done from other process. Hope there isn't such case in practice.


Thanks,
Ming
  
Ming Lei March 23, 2023, 11:59 p.m. UTC | #10
On Wed, Mar 22, 2023 at 11:59:26AM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Currently if disk_scan_partitions() failed, GD_NEED_PART_SCAN will still
> set, and partition scan will be proceed again when blkdev_get_by_dev()
> is called. However, this will cause a problem that re-assemble partitioned
> raid device will creat partition for underlying disk.
> 
> Test procedure:
> 
> mdadm -CR /dev/md0 -l 1 -n 2 /dev/sda /dev/sdb -e 1.0
> sgdisk -n 0:0:+100MiB /dev/md0
> blockdev --rereadpt /dev/sda
> blockdev --rereadpt /dev/sdb
> mdadm -S /dev/md0
> mdadm -A /dev/md0 /dev/sda /dev/sdb
> 
> Test result: underlying disk partition and raid partition can be
> observed at the same time
> 
> Note that this can still happen in come corner cases that
> GD_NEED_PART_SCAN can be set for underlying disk while re-assemble raid
> device.

That is why I suggest to touch this flag as less as possible, maybe
replace it with one function parameter in future.

> 
> Fixes: e5cfefa97bcc ("block: fix scan partition for exclusively open device again")
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>

So far, let's move on with the fix:

Reviewed-by: Ming Lei <ming.lei@redhat.com>

Thanks,
Ming
  
Yu Kuai April 6, 2023, 3:42 a.m. UTC | #11
Hi, Jens!

在 2023/03/22 11:59, Yu Kuai 写道:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Currently if disk_scan_partitions() failed, GD_NEED_PART_SCAN will still
> set, and partition scan will be proceed again when blkdev_get_by_dev()
> is called. However, this will cause a problem that re-assemble partitioned
> raid device will creat partition for underlying disk.
> 
> Test procedure:
> 
> mdadm -CR /dev/md0 -l 1 -n 2 /dev/sda /dev/sdb -e 1.0
> sgdisk -n 0:0:+100MiB /dev/md0
> blockdev --rereadpt /dev/sda
> blockdev --rereadpt /dev/sdb
> mdadm -S /dev/md0
> mdadm -A /dev/md0 /dev/sda /dev/sdb
> 
> Test result: underlying disk partition and raid partition can be
> observed at the same time
> 
> Note that this can still happen in come corner cases that
> GD_NEED_PART_SCAN can be set for underlying disk while re-assemble raid
> device.
> 

Can you apply this patch?

Thanks,
Kuai
> Fixes: e5cfefa97bcc ("block: fix scan partition for exclusively open device again")
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   block/genhd.c | 8 +++++++-
>   1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index 08bb1a9ec22c..a72e27d6779d 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -368,7 +368,6 @@ int disk_scan_partitions(struct gendisk *disk, fmode_t mode)
>   	if (disk->open_partitions)
>   		return -EBUSY;
>   
> -	set_bit(GD_NEED_PART_SCAN, &disk->state);
>   	/*
>   	 * If the device is opened exclusively by current thread already, it's
>   	 * safe to scan partitons, otherwise, use bd_prepare_to_claim() to
> @@ -381,12 +380,19 @@ int disk_scan_partitions(struct gendisk *disk, fmode_t mode)
>   			return ret;
>   	}
>   
> +	set_bit(GD_NEED_PART_SCAN, &disk->state);
>   	bdev = blkdev_get_by_dev(disk_devt(disk), mode & ~FMODE_EXCL, NULL);
>   	if (IS_ERR(bdev))
>   		ret =  PTR_ERR(bdev);
>   	else
>   		blkdev_put(bdev, mode & ~FMODE_EXCL);
>   
> +	/*
> +	 * If blkdev_get_by_dev() failed early, GD_NEED_PART_SCAN is still set,
> +	 * and this will cause that re-assemble partitioned raid device will
> +	 * creat partition for underlying disk.
> +	 */
> +	clear_bit(GD_NEED_PART_SCAN, &disk->state);
>   	if (!(mode & FMODE_EXCL))
>   		bd_abort_claiming(disk->part0, disk_scan_partitions);
>   	return ret;
>
  
Jens Axboe April 6, 2023, 10:29 p.m. UTC | #12
On 4/5/23 9:42 PM, Yu Kuai wrote:
> Hi, Jens!
> 
> 在 2023/03/22 11:59, Yu Kuai 写道:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Currently if disk_scan_partitions() failed, GD_NEED_PART_SCAN will still
>> set, and partition scan will be proceed again when blkdev_get_by_dev()
>> is called. However, this will cause a problem that re-assemble partitioned
>> raid device will creat partition for underlying disk.
>>
>> Test procedure:
>>
>> mdadm -CR /dev/md0 -l 1 -n 2 /dev/sda /dev/sdb -e 1.0
>> sgdisk -n 0:0:+100MiB /dev/md0
>> blockdev --rereadpt /dev/sda
>> blockdev --rereadpt /dev/sdb
>> mdadm -S /dev/md0
>> mdadm -A /dev/md0 /dev/sda /dev/sdb
>>
>> Test result: underlying disk partition and raid partition can be
>> observed at the same time
>>
>> Note that this can still happen in come corner cases that
>> GD_NEED_PART_SCAN can be set for underlying disk while re-assemble raid
>> device.
>>
> 
> Can you apply this patch?

None of them apply to my for-6.4/block branch...
  
Ming Lei April 7, 2023, 2:01 a.m. UTC | #13
On Thu, Apr 06, 2023 at 04:29:43PM -0600, Jens Axboe wrote:
> On 4/5/23 9:42 PM, Yu Kuai wrote:
> > Hi, Jens!
> > 
> > 在 2023/03/22 11:59, Yu Kuai 写道:
> >> From: Yu Kuai <yukuai3@huawei.com>
> >>
> >> Currently if disk_scan_partitions() failed, GD_NEED_PART_SCAN will still
> >> set, and partition scan will be proceed again when blkdev_get_by_dev()
> >> is called. However, this will cause a problem that re-assemble partitioned
> >> raid device will creat partition for underlying disk.
> >>
> >> Test procedure:
> >>
> >> mdadm -CR /dev/md0 -l 1 -n 2 /dev/sda /dev/sdb -e 1.0
> >> sgdisk -n 0:0:+100MiB /dev/md0
> >> blockdev --rereadpt /dev/sda
> >> blockdev --rereadpt /dev/sdb
> >> mdadm -S /dev/md0
> >> mdadm -A /dev/md0 /dev/sda /dev/sdb
> >>
> >> Test result: underlying disk partition and raid partition can be
> >> observed at the same time
> >>
> >> Note that this can still happen in come corner cases that
> >> GD_NEED_PART_SCAN can be set for underlying disk while re-assemble raid
> >> device.
> >>
> > 
> > Can you apply this patch?
> 
> None of them apply to my for-6.4/block branch...

This patch is bug fix, and probably should aim at 6.3.

Thanks,
Ming
  
Jens Axboe April 7, 2023, 2:42 a.m. UTC | #14
On 4/6/23 8:01 PM, Ming Lei wrote:
> On Thu, Apr 06, 2023 at 04:29:43PM -0600, Jens Axboe wrote:
>> On 4/5/23 9:42 PM, Yu Kuai wrote:
>>> Hi, Jens!
>>>
>>> 在 2023/03/22 11:59, Yu Kuai 写道:
>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>>
>>>> Currently if disk_scan_partitions() failed, GD_NEED_PART_SCAN will still
>>>> set, and partition scan will be proceed again when blkdev_get_by_dev()
>>>> is called. However, this will cause a problem that re-assemble partitioned
>>>> raid device will creat partition for underlying disk.
>>>>
>>>> Test procedure:
>>>>
>>>> mdadm -CR /dev/md0 -l 1 -n 2 /dev/sda /dev/sdb -e 1.0
>>>> sgdisk -n 0:0:+100MiB /dev/md0
>>>> blockdev --rereadpt /dev/sda
>>>> blockdev --rereadpt /dev/sdb
>>>> mdadm -S /dev/md0
>>>> mdadm -A /dev/md0 /dev/sda /dev/sdb
>>>>
>>>> Test result: underlying disk partition and raid partition can be
>>>> observed at the same time
>>>>
>>>> Note that this can still happen in come corner cases that
>>>> GD_NEED_PART_SCAN can be set for underlying disk while re-assemble raid
>>>> device.
>>>>
>>>
>>> Can you apply this patch?
>>
>> None of them apply to my for-6.4/block branch...
> 
> This patch is bug fix, and probably should aim at 6.3.

Yeah I see now, but it's a bit of a mashup of 2 patches, and
then a separate one. I've applied the single fixup for 6.3.
  

Patch

diff --git a/block/genhd.c b/block/genhd.c
index 08bb1a9ec22c..a72e27d6779d 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -368,7 +368,6 @@  int disk_scan_partitions(struct gendisk *disk, fmode_t mode)
 	if (disk->open_partitions)
 		return -EBUSY;
 
-	set_bit(GD_NEED_PART_SCAN, &disk->state);
 	/*
 	 * If the device is opened exclusively by current thread already, it's
 	 * safe to scan partitons, otherwise, use bd_prepare_to_claim() to
@@ -381,12 +380,19 @@  int disk_scan_partitions(struct gendisk *disk, fmode_t mode)
 			return ret;
 	}
 
+	set_bit(GD_NEED_PART_SCAN, &disk->state);
 	bdev = blkdev_get_by_dev(disk_devt(disk), mode & ~FMODE_EXCL, NULL);
 	if (IS_ERR(bdev))
 		ret =  PTR_ERR(bdev);
 	else
 		blkdev_put(bdev, mode & ~FMODE_EXCL);
 
+	/*
+	 * If blkdev_get_by_dev() failed early, GD_NEED_PART_SCAN is still set,
+	 * and this will cause that re-assemble partitioned raid device will
+	 * creat partition for underlying disk.
+	 */
+	clear_bit(GD_NEED_PART_SCAN, &disk->state);
 	if (!(mode & FMODE_EXCL))
 		bd_abort_claiming(disk->part0, disk_scan_partitions);
 	return ret;