[RFC,1/7] block: Support creating a struct file from a block device

Message ID 20230126033358.1880-2-demi@invisiblethingslab.com
State New
Headers
Series Allow race-free block device handling |

Commit Message

Demi Marie Obenour Jan. 26, 2023, 3:33 a.m. UTC
  The newly added blkdev_get_file() function allows kernel code to create
a struct file for any block device.  The main use-case is for the
struct file to be exposed to userspace as a file descriptor.  A future
patch will modify the DM_DEV_CREATE_CREATE ioctl to allow userspace to
get a file descriptor to the newly created block device, avoiding nasty
race conditions.

Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
---
 block/bdev.c           | 77 +++++++++++++++++++++++++++++++++++-------
 include/linux/blkdev.h |  5 +++
 2 files changed, 70 insertions(+), 12 deletions(-)
  

Comments

Christoph Hellwig Jan. 30, 2023, 8:08 a.m. UTC | #1
On Wed, Jan 25, 2023 at 10:33:53PM -0500, Demi Marie Obenour wrote:
> The newly added blkdev_get_file() function allows kernel code to create
> a struct file for any block device.  The main use-case is for the
> struct file to be exposed to userspace as a file descriptor.  A future
> patch will modify the DM_DEV_CREATE_CREATE ioctl to allow userspace to
> get a file descriptor to the newly created block device, avoiding nasty
> race conditions.

NAK.  Do not add wierd side-way interfaces to the block layer.
  
Demi Marie Obenour Jan. 30, 2023, 7:22 p.m. UTC | #2
On Mon, Jan 30, 2023 at 12:08:23AM -0800, Christoph Hellwig wrote:
> On Wed, Jan 25, 2023 at 10:33:53PM -0500, Demi Marie Obenour wrote:
> > The newly added blkdev_get_file() function allows kernel code to create
> > a struct file for any block device.  The main use-case is for the
> > struct file to be exposed to userspace as a file descriptor.  A future
> > patch will modify the DM_DEV_CREATE_CREATE ioctl to allow userspace to
> > get a file descriptor to the newly created block device, avoiding nasty
> > race conditions.
> 
> NAK.  Do not add wierd side-way interfaces to the block layer.

What do you recommend instead?  This solves a real problem for
device-mapper users and I am not aware of a better solution.
  
Christoph Hellwig Jan. 31, 2023, 8:53 a.m. UTC | #3
On Mon, Jan 30, 2023 at 02:22:39PM -0500, Demi Marie Obenour wrote:
> What do you recommend instead?  This solves a real problem for
> device-mapper users and I am not aware of a better solution.

You could start with explaining the problem and what other methods
you tried that failed.  In the end it's not my job to fix your problem.
I generally gladly help, but this kind of attitude doesn't get very
far.
  
Demi Marie Obenour Jan. 31, 2023, 4:27 p.m. UTC | #4
On Tue, Jan 31, 2023 at 12:53:03AM -0800, Christoph Hellwig wrote:
> On Mon, Jan 30, 2023 at 02:22:39PM -0500, Demi Marie Obenour wrote:
> > What do you recommend instead?  This solves a real problem for
> > device-mapper users and I am not aware of a better solution.
> 
> You could start with explaining the problem and what other methods
> you tried that failed.  In the end it's not my job to fix your problem.

I’m working on a “block not-script” (Xen block device hotplug script
written in C) for Qubes OS.  The current hotplug script is a shell
script that takes a global lock, which serializes all invocations and
significantly slows down VM creation and destruction.  My C program
avoids this problem.

One of the goals of the not-script is to never leak resources, even if
it dies with SIGKILL or is never called with the “remove” argument to
destroy the devices it created.  Therefore, whenever possible, it relies
on automatic destruction of devices that are no longer used.  I have
managed to make this work for loop devices, provided that the Xen
blkback driver is patched to accept a diskseq in the physical-device
Xenstore node.  I have *not* managed to make this work for device-mapper
devices, however.  One of the problems is that there is no way to
atomically create a device-mapper device and obtain a file descriptor to
it such that the device will be destroyed when no longer used.  To solve
this problem, I added a new flag (DM_FILE_DESCRIPTOR_FLAG) that asks the
device-mapper driver to provide userspace a file descriptor for the
device that was just created.  The uAPI will likely change in future
versions of the patch, but the general idea will not.

While it is easy to provide userspace with an FD to any struct file, it
is *not* easy to obtain a struct file for a given struct block_device.
I could have had device-mapper implement everything itself, but that
would have duplicated a large amount of code already in the block layer.
Instead, I decided to refactor the block layer to provide a function
that does exactly what was needed.  The result was this patch.  In the
future, I would like to add an ioctl for /dev/loop-control that creates
a loop device and returns a file descriptor to the loop device.  I could
also see iSCSI supporting this, with the socket file descriptor being
passed in from userspace.

blkdev_do_open() does not solve any problem for me at this time.
Instead, it represents the code shared by blkdev_get_by_dev() and
blkdev_get_file().  I decided to export it because it could be of
independent use to others.  In particular, it could potentially
simplify disk_scan_partitions() in block/genhd.c, pkt_new_dev() in
pktcdvd, backing_dev_store() in zram, and f2fs_scan_devices() in f2fs.

I hope this is enough information.  If it is not, feel free to ask for
more.
  
Christoph Hellwig Feb. 1, 2023, 7:45 a.m. UTC | #5
On Tue, Jan 31, 2023 at 11:27:59AM -0500, Demi Marie Obenour wrote:
> While it is easy to provide userspace with an FD to any struct file, it
> is *not* easy to obtain a struct file for a given struct block_device.
> I could have had device-mapper implement everything itself, but that
> would have duplicated a large amount of code already in the block layer.
> Instead, I decided to refactor the block layer to provide a function
> that does exactly what was needed.  The result was this patch.  In the
> future, I would like to add an ioctl for /dev/loop-control that creates
> a loop device and returns a file descriptor to the loop device.  I could
> also see iSCSI supporting this, with the socket file descriptor being
> passed in from userspace.

And it is somewhat intentional that you can't.  Block device inodes
have interesting life times and are never directly exposed to userspace
at all.  They are internal, and only f_mapping of a file system inode
delegates to them or I/O.  Your patch now magically exposes them to
userspace.  And it then bypasses all pathname and inode permission
based access checks and auditing.  So we can't just do it.

> blkdev_do_open() does not solve any problem for me at this time.
> Instead, it represents the code shared by blkdev_get_by_dev() and
> blkdev_get_file().  I decided to export it because it could be of
> independent use to others.  In particular, it could potentially
> simplify disk_scan_partitions() in block/genhd.c, pkt_new_dev() in
> pktcdvd, backing_dev_store() in zram, and f2fs_scan_devices() in f2fs.

All thse need to actually open the underlying device as they do I/O.
Doing I/O without opening the device is a no-go.
  
Demi Marie Obenour Feb. 1, 2023, 4:18 p.m. UTC | #6
On Tue, Jan 31, 2023 at 11:45:55PM -0800, Christoph Hellwig wrote:
> On Tue, Jan 31, 2023 at 11:27:59AM -0500, Demi Marie Obenour wrote:
> > While it is easy to provide userspace with an FD to any struct file, it
> > is *not* easy to obtain a struct file for a given struct block_device.
> > I could have had device-mapper implement everything itself, but that
> > would have duplicated a large amount of code already in the block layer.
> > Instead, I decided to refactor the block layer to provide a function
> > that does exactly what was needed.  The result was this patch.  In the
> > future, I would like to add an ioctl for /dev/loop-control that creates
> > a loop device and returns a file descriptor to the loop device.  I could
> > also see iSCSI supporting this, with the socket file descriptor being
> > passed in from userspace.
> 
> And it is somewhat intentional that you can't.  Block device inodes
> have interesting life times and are never directly exposed to userspace
> at all.  They are internal, and only f_mapping of a file system inode
> delegates to them or I/O.  Your patch now magically exposes them to
> userspace.

The intention is that the file descriptor is equvalent to what one would
get by first creating the device and then opening it.  If it is not,
that is a bug in one of my patches.

> And it then bypasses all pathname and inode permission
> based access checks and auditing.  So we can't just do it.

Accessing /dev/mapper/control is already enough to panic the kernel, so
presumably only fully trusted userspace can make the ioctl to begin
with.  Furthermore, this only allows a userspace process to get a file
descriptor to the device-mapper device it itself created.

> > blkdev_do_open() does not solve any problem for me at this time.
> > Instead, it represents the code shared by blkdev_get_by_dev() and
> > blkdev_get_file().  I decided to export it because it could be of
> > independent use to others.  In particular, it could potentially
> > simplify disk_scan_partitions() in block/genhd.c, pkt_new_dev() in
> > pktcdvd, backing_dev_store() in zram, and f2fs_scan_devices() in f2fs.
> 
> All thse need to actually open the underlying device as they do I/O.
> Doing I/O without opening the device is a no-go.

blkdev_do_open() *does* open the device.  If it doesn’t, that’s a bug.
In v2 I will add the same access control checks that blkdev_get_by_dev()
does.  Is this sufficient?
  
Ming Lei Feb. 2, 2023, 8:49 a.m. UTC | #7
On Tue, Jan 31, 2023 at 11:27:59AM -0500, Demi Marie Obenour wrote:
> On Tue, Jan 31, 2023 at 12:53:03AM -0800, Christoph Hellwig wrote:
> > On Mon, Jan 30, 2023 at 02:22:39PM -0500, Demi Marie Obenour wrote:
> > > What do you recommend instead?  This solves a real problem for
> > > device-mapper users and I am not aware of a better solution.
> > 
> > You could start with explaining the problem and what other methods
> > you tried that failed.  In the end it's not my job to fix your problem.
> 
> I’m working on a “block not-script” (Xen block device hotplug script
> written in C) for Qubes OS.  The current hotplug script is a shell
> script that takes a global lock, which serializes all invocations and
> significantly slows down VM creation and destruction.  My C program
> avoids this problem.
> 
> One of the goals of the not-script is to never leak resources, even if
> it dies with SIGKILL or is never called with the “remove” argument to

If it dies, you still can restart one new instance for handling the device
leak by running one simple daemon to monitor if not-script is live.

> destroy the devices it created.  Therefore, whenever possible, it relies
> on automatic destruction of devices that are no longer used.  I have

This automatic destruction of devices is supposed to be done in
userspace, cause only userspace knows when device is needed, when
it is needed.

So not sure this kind of work should be involved in kernel.


Thanks, 
Ming
  
Demi Marie Obenour Feb. 2, 2023, 5:24 p.m. UTC | #8
On Thu, Feb 02, 2023 at 04:49:54PM +0800, Ming Lei wrote:
> On Tue, Jan 31, 2023 at 11:27:59AM -0500, Demi Marie Obenour wrote:
> > On Tue, Jan 31, 2023 at 12:53:03AM -0800, Christoph Hellwig wrote:
> > > On Mon, Jan 30, 2023 at 02:22:39PM -0500, Demi Marie Obenour wrote:
> > > > What do you recommend instead?  This solves a real problem for
> > > > device-mapper users and I am not aware of a better solution.
> > > 
> > > You could start with explaining the problem and what other methods
> > > you tried that failed.  In the end it's not my job to fix your problem.
> > 
> > I’m working on a “block not-script” (Xen block device hotplug script
> > written in C) for Qubes OS.  The current hotplug script is a shell
> > script that takes a global lock, which serializes all invocations and
> > significantly slows down VM creation and destruction.  My C program
> > avoids this problem.
> > 
> > One of the goals of the not-script is to never leak resources, even if
> > it dies with SIGKILL or is never called with the “remove” argument to
> 
> If it dies, you still can restart one new instance for handling the device
> leak by running one simple daemon to monitor if not-script is live.

This requires userspace to maintain state that persists across process
restarts, and is also non-compositional.  If there was a userspace
daemon that was responsible for all block device management in the
system, this would be more reasonable, but no such daemon exists.
Furthermore, the amount of code required in userspace dwarfs the amount
of code my patches add to the kernel, both in size and complexity.

> > destroy the devices it created.  Therefore, whenever possible, it relies
> > on automatic destruction of devices that are no longer used.  I have
> 
> This automatic destruction of devices is supposed to be done in
> userspace, cause only userspace knows when device is needed, when
> it is needed.

In my use-case, the last reference to the device is held by the blkback
driver in the kernel.  More generally, any case where a device is
created for a single purpose and should be destroyed when no longer
used will benefit from this.  Encrypted swap devices are a simple
example, as they can be destroyed with a single “swapoff” command.
  

Patch

diff --git a/block/bdev.c b/block/bdev.c
index edc110d90df4041e7d337976951bd0d17525f1f7..09cb5ef900ca9ad5b21250bb63e64cc2a79f9289 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -459,10 +459,33 @@  static struct file_system_type bd_type = {
 struct super_block *blockdev_superblock __read_mostly;
 EXPORT_SYMBOL_GPL(blockdev_superblock);
 
+static struct vfsmount *bd_mnt __read_mostly;
+
+struct file *
+blkdev_get_file(struct block_device *bdev, fmode_t flags, void *holder)
+{
+	struct inode *inode;
+	struct file *filp;
+	int ret;
+
+	ret = blkdev_do_open(bdev, flags, holder);
+	if (ret)
+		return ERR_PTR(ret);
+	inode = bdev->bd_inode;
+	filp = alloc_file_pseudo(inode, bd_mnt, "[block]", flags | O_CLOEXEC, &def_blk_fops);
+	if (IS_ERR(filp)) {
+		blkdev_put(bdev, flags);
+	} else {
+		filp->f_mapping = inode->i_mapping;
+		filp->f_wb_err = filemap_sample_wb_err(filp->f_mapping);
+	}
+	return filp;
+}
+EXPORT_SYMBOL(blkdev_get_file);
+
 void __init bdev_cache_init(void)
 {
 	int err;
-	static struct vfsmount *bd_mnt;
 
 	bdev_cachep = kmem_cache_create("bdev_cache", sizeof(struct bdev_inode),
 			0, (SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT|
@@ -775,7 +798,7 @@  void blkdev_put_no_open(struct block_device *bdev)
  *
  * Use this interface ONLY if you really do not have anything better - i.e. when
  * you are behind a truly sucky interface and all you are given is a device
- * number.  Everything else should use blkdev_get_by_path().
+ * number.  Everything else should use blkdev_get_by_path() or blkdev_do_open().
  *
  * CONTEXT:
  * Might sleep.
@@ -785,9 +808,7 @@  void blkdev_put_no_open(struct block_device *bdev)
  */
 struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder)
 {
-	bool unblock_events = true;
 	struct block_device *bdev;
-	struct gendisk *disk;
 	int ret;
 
 	ret = devcgroup_check_permission(DEVCG_DEV_BLOCK,
@@ -800,18 +821,52 @@  struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder)
 	bdev = blkdev_get_no_open(dev);
 	if (!bdev)
 		return ERR_PTR(-ENXIO);
-	disk = bdev->bd_disk;
+
+	ret = blkdev_do_open(bdev, mode, holder);
+	if (ret) {
+		blkdev_put_no_open(bdev);
+		return ERR_PTR(ret);
+	}
+
+	return bdev;
+}
+EXPORT_SYMBOL(blkdev_get_by_dev);
+
+/**
+ * blkdev_do_open - open a block device by device pointer
+ * @bdev: pointer to the device to open
+ * @mode: FMODE_* mask
+ * @holder: exclusive holder identifier
+ *
+ * Open the block device pointed to by @bdev. If @mode includes
+ * %FMODE_EXCL, the block device is opened with exclusive access.  Specifying
+ * %FMODE_EXCL with a %NULL @holder is invalid.  Exclusive opens may nest for
+ * the same @holder.
+ *
+ * Unlike blkdev_get_by_dev() and bldev_get_by_path(), this function does not
+ * do any permission checks.  The most common use-case is where the device
+ * was freshly created by userspace.
+ *
+ * CONTEXT:
+ * Might sleep.
+ *
+ * RETURNS:
+ * Reference 0 on success, -errno on failure.
+ */
+int blkdev_do_open(struct block_device *bdev, fmode_t mode, void *holder) {
+	struct gendisk *disk = bdev->bd_disk;
+	int ret = -ENXIO;
+	bool unblock_events = true;
 
 	if (mode & FMODE_EXCL) {
 		ret = bd_prepare_to_claim(bdev, holder);
 		if (ret)
-			goto put_blkdev;
+			return ret;
 	}
 
 	disk_block_events(disk);
 
 	mutex_lock(&disk->open_mutex);
-	ret = -ENXIO;
 	if (!disk_live(disk))
 		goto abort_claiming;
 	if (!try_module_get(disk->fops->owner))
@@ -842,7 +897,7 @@  struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder)
 
 	if (unblock_events)
 		disk_unblock_events(disk);
-	return bdev;
+	return 0;
 put_module:
 	module_put(disk->fops->owner);
 abort_claiming:
@@ -850,11 +905,9 @@  struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder)
 		bd_abort_claiming(bdev, holder);
 	mutex_unlock(&disk->open_mutex);
 	disk_unblock_events(disk);
-put_blkdev:
-	blkdev_put_no_open(bdev);
-	return ERR_PTR(ret);
+	return ret;
 }
-EXPORT_SYMBOL(blkdev_get_by_dev);
+EXPORT_SYMBOL(blkdev_do_open);
 
 /**
  * blkdev_get_by_path - open a block device by name
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 43d4e073b1115e4628a001081fbf08b296d342df..04635cb5ee29d22394a34c65eb34bea4e7847d8d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -325,6 +325,11 @@  typedef int (*report_zones_cb)(struct blk_zone *zone, unsigned int idx,
 
 void disk_set_zoned(struct gendisk *disk, enum blk_zoned_model model);
 
+struct file *
+blkdev_get_file(struct block_device *bdev, fmode_t flags, void *holder);
+
+int blkdev_do_open(struct block_device *bdev, fmode_t flags, void *holder);
+
 #ifdef CONFIG_BLK_DEV_ZONED
 
 #define BLK_ALL_ZONES  ((unsigned int)-1)