[RFC,1/4] init: move block device helpers from init/do_mounts.c
Commit Message
In init/do_mounts.c there are helper functions devt_from_partuuid,
devt_from_partlabel and devt_from_devname. In order to make these
functions available to other users, move them to block/bdev.c.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
---
block/bdev.c | 166 +++++++++++++++++++++++++++++++++++++++++
include/linux/blkdev.h | 15 ++++
init/do_mounts.c | 166 +----------------------------------------
3 files changed, 183 insertions(+), 164 deletions(-)
Comments
On Thu, Nov 17, 2022 at 12:42:43AM +0000, Daniel Golle wrote:
> In init/do_mounts.c there are helper functions devt_from_partuuid,
> devt_from_partlabel and devt_from_devname. In order to make these
> functions available to other users, move them to block/bdev.c.
Yes, they should be moved and I have an old patch to do this a bit
smarter. But that doesn't mean anyone else should call them.
Hi Christoph,
On Wed, Nov 16, 2022 at 09:55:55PM -0800, Christoph Hellwig wrote:
> On Thu, Nov 17, 2022 at 12:42:43AM +0000, Daniel Golle wrote:
> > In init/do_mounts.c there are helper functions devt_from_partuuid,
> > devt_from_partlabel and devt_from_devname. In order to make these
> > functions available to other users, move them to block/bdev.c.
>
> Yes, they should be moved and I have an old patch to do this a bit
> smarter. But that doesn't mean anyone else should call them.
I have followed your advise and implemented a "real" block driver
stacking on top of an existing block device to expose uImage.FIT
filesystem subimages as block devices. It mangles any bio submitted to
it and re-submits the bio with changed bi_bdev and sector offset added
to bi_iter.bi_sector for all list elements.
That works, but has slightly less utility value than the partition
parser approach as in this way I cannot easily populate the PARTNAME
uevent which can later help userspace to identify a device by the FIT
subimage name -- I'd have to either invent a new bus_type or
device_type, both seem inappropriate and have unwanted side effects.
Or am I missing something and there is a way to use add_uevent_var()
for a disk_type device?
In exchange, instead of using the PARTNAME= uevent var, I can now
freely name the device node by setting the disk_name string in
struct gendisk. While this doesn't offer the same amount of freedom, it
is sufficient for our use-case.
However, I don't see a way to avoid using (or duplicating)
devt_from_devname() to select the lower device somehow without having
to probe and parse *all* block devices present (which is, from what I
understood, what you want to avoid, and I agree that it is more safe to
not do that...)
Can you or anyone give some advise on how this should be done?
As block partitions are not present in device tree, using a cmdline
argument (with similar semantics to 'root=') specifying the lower block
device seems unavoidable.
Also note that due to libfdt not exporting symbols, the driver
currently cannot be built as a module. Hence, I would either need to
export most of libfdt API or only allow the driver to be built-into the
kernel. I don't see a problem with having it always built-in.
Yet another (imho not terrible) problem is removal of the lower device.
Many of the supported SBC use a micro SD card to boot, which can be
removed by the user while the system is running (which is generally not
a good idea, but anyway). For partitions this is handled automatically
by blk_drop_partitions() called directly from genhd.c.
I'm currently playing with doing something similar using the bus device
removal notification, but it doesn't seem to work for all cases, e.g.
mmcblk device do not seem to have the ->bus pointer populated at all
(ie. disk_to_dev(disk)->bus == NULL for mmcblk devices).
Any ideas regarding that?
Thank you for your advise!
Daniel
On Sat, Nov 19, 2022 at 04:03:11PM +0000, Daniel Golle wrote:
> That works, but has slightly less utility value than the partition
> parser approach as in this way I cannot easily populate the PARTNAME
> uevent which can later help userspace to identify a device by the FIT
> subimage name -- I'd have to either invent a new bus_type or
> device_type, both seem inappropriate and have unwanted side effects.
> Or am I missing something and there is a way to use add_uevent_var()
> for a disk_type device?
You're not exposing a partition here - this is an image format that
sits in a partition and we should not pretend that is a partition.
> However, I don't see a way to avoid using (or duplicating)
> devt_from_devname() to select the lower device somehow without having
> to probe and parse *all* block devices present (which is, from what I
> understood, what you want to avoid, and I agree that it is more safe to
> not do that...)
>
> Can you or anyone give some advise on how this should be done?
Just set the block driver up from an initramfs, like we do for all
modern stackable drivers.
> Yet another (imho not terrible) problem is removal of the lower device.
> Many of the supported SBC use a micro SD card to boot, which can be
> removed by the user while the system is running (which is generally not
> a good idea, but anyway). For partitions this is handled automatically
> by blk_drop_partitions() called directly from genhd.c.
> I'm currently playing with doing something similar using the bus device
> removal notification, but it doesn't seem to work for all cases, e.g.
> mmcblk device do not seem to have the ->bus pointer populated at all
> (ie. disk_to_dev(disk)->bus == NULL for mmcblk devices).
I have WIP patches that allow the claimer of a block device get
resize and removal notification. It's not going to land for 6.2,
but I hope I have it ready in time for the next merge window.
Hi Christoph,
On Tue, Nov 22, 2022 at 04:37:08AM -0800, Christoph Hellwig wrote:
> On Sat, Nov 19, 2022 at 04:03:11PM +0000, Daniel Golle wrote:
> > That works, but has slightly less utility value than the partition
> > parser approach as in this way I cannot easily populate the PARTNAME
> > uevent which can later help userspace to identify a device by the FIT
> > subimage name -- I'd have to either invent a new bus_type or
> > device_type, both seem inappropriate and have unwanted side effects.
> > Or am I missing something and there is a way to use add_uevent_var()
> > for a disk_type device?
>
> You're not exposing a partition here - this is an image format that
> sits in a partition and we should not pretend that is a partition.
It doesn't need to be literally the PARTNAME uevent, just any way to
communicate the names of mapped subimages to userspace.
My understanding by now is that there is no way around introducing a
new device_type and then mitigate the unwanted side effects by
follow-up changes, ie. make it possible to use that new device_type
when specifying the rootfs= cmdline variable (currently only disks and
partitions are considered there).
Or give up on the idea that uImage.FIT subimages mapped by the new
driver can be identified by userspace by poking uevent from sysfs and
just rely on convention and ordering.
>
> > However, I don't see a way to avoid using (or duplicating)
> > devt_from_devname() to select the lower device somehow without having
> > to probe and parse *all* block devices present (which is, from what I
> > understood, what you want to avoid, and I agree that it is more safe to
> > not do that...)
> >
> > Can you or anyone give some advise on how this should be done?
>
> Just set the block driver up from an initramfs, like we do for all
> modern stackable drivers.
Instead of using a kernel cmdline parameter we could also have the
bootloader embed that information as string in the 'chosen' section in
the device tree blob, right next to the cmdline. However, as there is
no representation of block partitions in device tree, also in that case
the lower device will have to be referenced by a string somehow, ie.
devt_from_devname() or the like will be needed.
Needing an initramfs, even if it boils down to just one statically
compile executable, is a massive bloat and complication when building
embedded device firmware and none of the over 1580 devices currently
supported by OpenWrt need an intermediate initramfs to mount their
on-flash squashfs rootfs (some, however, already use this uImage.FIT
partition parser, and not needing a downstream patch for that would be
nice).
uImage.FIT typically contains the complete firmware used on an embedded
device, ie. at least a Linux kernel, device tree blob and a filesystem.
The main use of this whole uImage.FIT-parsing-in-Linux approach I'm
trying to get across here is to expose one or more 'filesystem'-type
subimages of such an image as block devices, also so that one of them
can directly be mounted as rootfs by the kernel.
This *replaces* the use of 'ramdisk' type sub-images which need to
remain allocated at runtime, while using a squashfs 'filesystem' type
sub-image allows freeing the filesystem cache if ram is becomes scarce.
As both, storage and memory, are often very limited on small embedded
devices, OpenWrt has always been using a squashfs as rootfs with a
storage-type specific filesytem used as r/w overlay on top. Up to now,
the rootfs is often stored in platform-specific ways, ie. an additional
partition on block devices, MTD partition on NOR flash or UBI volume on
NAND flash.
Carrying the read-only squashfs filesystem inside the uImage.FIT
structure has the advantage of being agnostic regarding the
storage-type (NOR/mtdblockX, NAND/ubiblockX, MMC/mmcblkXpY) and allows
the bootloader to validate the filesystem hash before starting the
kernel, ie. ensuring integrity of the firmware as-a-whole which
includes the root filesystem.
>
> > Yet another (imho not terrible) problem is removal of the lower device.
> > Many of the supported SBC use a micro SD card to boot, which can be
> > removed by the user while the system is running (which is generally not
> > a good idea, but anyway). For partitions this is handled automatically
> > by blk_drop_partitions() called directly from genhd.c.
> > I'm currently playing with doing something similar using the bus device
> > removal notification, but it doesn't seem to work for all cases, e.g.
> > mmcblk device do not seem to have the ->bus pointer populated at all
> > (ie. disk_to_dev(disk)->bus == NULL for mmcblk devices).
>
> I have WIP patches that allow the claimer of a block device get
> resize and removal notification. It's not going to land for 6.2,
> but I hope I have it ready in time for the next merge window.
I'm looking forward to integrate that in the uImage.FIT block driver
I've been working on. In the meantime, should I already post my current
draft so we can start discussing if that solution could be acceptable?
Best regards
Daniel
On Fri, Dec 09, 2022 at 05:00:34PM +0000, Daniel Golle wrote:
> It doesn't need to be literally the PARTNAME uevent, just any way to
> communicate the names of mapped subimages to userspace.
>
> My understanding by now is that there is no way around introducing a
> new device_type and then mitigate the unwanted side effects by
> follow-up changes, ie. make it possible to use that new device_type
> when specifying the rootfs= cmdline variable (currently only disks and
> partitions are considered there).
>
> Or give up on the idea that uImage.FIT subimages mapped by the new
> driver can be identified by userspace by poking uevent from sysfs and
> just rely on convention and ordering.
To be honest I'm still really confused by the patch set. What is
the problem with simply using an initramfs, which is the way to
deal with any non-trivial init issue for about 20 years now, predated
by the somewhat more awkware initrd...
> Needing an initramfs, even if it boils down to just one statically
> compile executable, is a massive bloat and complication when building
> embedded device firmware and none of the over 1580 devices currently
> supported by OpenWrt need an intermediate initramfs to mount their
> on-flash squashfs rootfs (some, however, already use this uImage.FIT
> partition parser, and not needing a downstream patch for that would be
> nice).
You can always build the initrams into the kernel image, I don't
see how that is bloat.
> Carrying the read-only squashfs filesystem inside the uImage.FIT
> structure has the advantage of being agnostic regarding the
> storage-type (NOR/mtdblockX, NAND/ubiblockX, MMC/mmcblkXpY) and allows
> the bootloader to validate the filesystem hash before starting the
> kernel, ie. ensuring integrity of the firmware as-a-whole which
> includes the root filesystem.
What is the point of the uImage.FIT? Why can't you use a sane
format? Or at least not use it on top of partitions, which is just
braindead. Who actually came up with this amazingly convoluted and
annoying scheme and why?
On Mon, Dec 12, 2022 at 01:03:09AM -0800, Christoph Hellwig wrote:
> On Fri, Dec 09, 2022 at 05:00:34PM +0000, Daniel Golle wrote:
> > It doesn't need to be literally the PARTNAME uevent, just any way to
> > communicate the names of mapped subimages to userspace.
> >
> > My understanding by now is that there is no way around introducing a
> > new device_type and then mitigate the unwanted side effects by
> > follow-up changes, ie. make it possible to use that new device_type
> > when specifying the rootfs= cmdline variable (currently only disks and
> > partitions are considered there).
> >
> > Or give up on the idea that uImage.FIT subimages mapped by the new
> > driver can be identified by userspace by poking uevent from sysfs and
> > just rely on convention and ordering.
>
> To be honest I'm still really confused by the patch set. What is
> the problem with simply using an initramfs, which is the way to
> deal with any non-trivial init issue for about 20 years now, predated
> by the somewhat more awkware initrd...
The thing is that there isn't anything extraordinarily complex here,
just dynamically partitioning a block device based on a well-known
format.
>
> > Needing an initramfs, even if it boils down to just one statically
> > compile executable, is a massive bloat and complication when building
> > embedded device firmware and none of the over 1580 devices currently
> > supported by OpenWrt need an intermediate initramfs to mount their
> > on-flash squashfs rootfs (some, however, already use this uImage.FIT
> > partition parser, and not needing a downstream patch for that would be
> > nice).
>
> You can always build the initrams into the kernel image, I don't
> see how that is bloat.
Using initramfs implies that we would need a 2nd copy of the standard C
library and libfdt, both alone will already occupy more than just a
single 64kB block of flash. I understand that from the point of view of
classic x86 servers or even embedded devices with eMMC this seems
negligible. However, keep in mind that a huge number of existing
devices and also new designs of embedded devices often boot from just a
few megabytes of NOR flash, and there every byte counts.
>
> > Carrying the read-only squashfs filesystem inside the uImage.FIT
> > structure has the advantage of being agnostic regarding the
> > storage-type (NOR/mtdblockX, NAND/ubiblockX, MMC/mmcblkXpY) and allows
> > the bootloader to validate the filesystem hash before starting the
> > kernel, ie. ensuring integrity of the firmware as-a-whole which
> > includes the root filesystem.
>
> What is the point of the uImage.FIT?
It is the format used by Das U-Boot, which is by far the most common
bootloader found on small embedded devices running Linux.
Is is already used by Das U-Boot to validate and load kernel,
devicetree, initramfs, ... to RAM before launching Linux.
I've included a link to the documentation[1] which gives insights
regarding the motivation to create such a format.
> Why can't you use a sane format?
Define "sane format". :) And tell that to the people over at Das U-Boot
please. I'm just trying to make best use of the status-quo which is
that uImage.FIT is the de-facto standard to boot Linux on small
embedded devices.
> Or at least not use it on top of partitions, which is just
> braindead.
In fact, that's only one out of three possible uses in which parsing
the contained sub-image boundaries can be useful:
* On devices with NOR flash uImage.FIT is stored in an MTD partition,
hence the uImage.FIT partition parser (or small stackable block
driver) would then operate on top of /dev/mtdblockX.
* On devices with NAND flash uImage.FIT is stored in a UBI volume,
hence in this case /dev/ubiblockX needs to be processed.
* On devices with block storage (like eMMC or SD card) the image is
stored on a (GPT) partition (/dev/mmcblkXpY).
Generally speaking, when it comes to storing the rootfs (typically
squashfs), this can be solved in two ways:
a) keep the rootfs inside the uImage.FIT structure (to be then dealt
with by this patch series)
b) store the rootfs on an additional partition or volume:
- additional GPT partition on block devices
- additional MTD partition on devices with NOR flash
- additional UBI volume on devices with NAND flash
Now lets look at the consequences of each approach:
I.
It is often a requirement that the bootloader shall decide about the
integrity of the firmware before starting it (ie. verify hashes of
kernel, dtb, **and rootfs**). In case of approch b) this will have to
be implemented in a custom, platform-specific way; many vendors do
things like this, resulting in huge variety of different, customs ways
which let the bootloader validate the rootfs. As a huge majority of
devices actually use Das U-Boot which already supports validating
uImage.FIT content, this becomes a no-brainer when using approach a).
II.
Updating the firmware in case of approach a) being used is very simple:
Just write the whole image to a storage volume. In case of approach b)
an archive or tarball has to be used, allowing kernel.uImage and rootfs
to each be written to different storage volumes. And for the special
case of NOR flash, developers came up with a complete out-of-tree
subsystem handling only the splitting of MTD partitions in the various
ways used by vendor-modified bootloaders [2].
III.
Approach a) allows the very same uImage.FIT binary to be read by the
bootloader, used as firmware-upgrade format within Linux userland and
also serve as a recovery image format which can be loaded e.g. via TFTP
and stored by the bootloader. Approach b) will require different
formats for firmware upgrades (eg. a tarball) and storage on-disk,
depending on the platform and type of storage media. And when it comes
to bootloader-level recovery via TFTP, one would have to implement
extraction of that tarball **by the bootloader** or use yet another
format.
IV.
When it comes to devices allowing to boot from different media,
approach a) allows using the exact same firmware image independently of
the storage media used for booting. When using approach b), different
firmware images have to be provided **for the same device**, depending
on whether a block device, NAND/UBI or NOR/MTD is used for booting.
The BananaPi BPi-R64 and more recent BananaPi BPi-R3 boards are good
examples for boards which allow booting of different media, and are
already using uImage.FIT as a unified format.
I hope this explains my motivation. Please ask should there by any
doubts or if any of my explainations above are not clear.
> Who actually came up with this amazingly convoluted and
> annoying scheme and why?
This may answer your question:
[1]: https://source.denx.de/u-boot/u-boot/-/blob/master/doc/uImage.FIT/howto.txt
[2]: https://git.openwrt.org/?p=openwrt/openwrt.git;a=tree;f=target/linux/generic/files/drivers/mtd/mtdsplit
On Mon, Dec 12, 2022 at 11:02:33AM +0000, Daniel Golle wrote:
> The thing is that there isn't anything extraordinarily complex here,
> just dynamically partitioning a block device based on a well-known
> format.
Yes, but a completely non-standard format that nests inside an
partition.
> Using initramfs implies that we would need a 2nd copy of the standard C
> library and libfdt, both alone will already occupy more than just a
> single 64kB block of flash.
Why do you need libfdt? And with a simple statically linked kpartx
you won't pull in much of libc either.
> I understand that from the point of view of
> classic x86 servers or even embedded devices with eMMC this seems
> negligible. However, keep in mind that a huge number of existing
> devices and also new designs of embedded devices often boot from just a
> few megabytes of NOR flash, and there every byte counts.
So I've worked quite a bit on really small deeply embedded systems,
and for those I wouldn't even think of using strange image formats
or the rather wasteful GPT partition format.
There we wouldn't dare to use paritions or weird image formats, but
> > What is the point of the uImage.FIT?
>
> It is the format used by Das U-Boot, which is by far the most common
> bootloader found on small embedded devices running Linux.
> Is is already used by Das U-Boot to validate and load kernel,
> devicetree, initramfs, ... to RAM before launching Linux.
> I've included a link to the documentation[1] which gives insights
> regarding the motivation to create such a format.
That doesn't explain why you'd want to use it. Nor how people
came up with it.
> In fact, that's only one out of three possible uses in which parsing
> the contained sub-image boundaries can be useful:
> * On devices with NOR flash uImage.FIT is stored in an MTD partition,
> hence the uImage.FIT partition parser (or small stackable block
> driver) would then operate on top of /dev/mtdblockX.
>
> * On devices with NAND flash uImage.FIT is stored in a UBI volume,
> hence in this case /dev/ubiblockX needs to be processed.
And all the mtdblock / ubiblock is due to the lack of a native
mtd/ubi backend for squashfs? Why can't we take the block layer
out of the loop entirely?
> I hope this explains my motivation. Please ask should there by any
> doubts or if any of my explainations above are not clear.
None of this explains the silly nesting inside the GPT partition.
It is not needed for the any use cases and the root probem here.
Without that you could simply implement a parition format, with that
you get into crazy nesting behavior. Note that it would have
any benefit over just not doing this silly image.
Maybe someone just needs to go back and come up wit ha scheme that
actually works and implement that in uboot as well.
On Mon, Dec 12, 2022 at 10:48:12PM -0800, Christoph Hellwig wrote:
> On Mon, Dec 12, 2022 at 11:02:33AM +0000, Daniel Golle wrote:
> > The thing is that there isn't anything extraordinarily complex here,
> > just dynamically partitioning a block device based on a well-known
> > format.
>
> Yes, but a completely non-standard format that nests inside an
> partition.
The reason for this current discussion (see subject line) is exactly
that you didn't like the newly introduced partition type GUID which
then calls the newly introduced partition parser taking care of the
uImage.FIT content of a partition.
Instead you suggested to implement it as a small stackable block
driver, which I did (and will post as RFC in a moment from now).
This block driver (if built-into the kernel and relied upon to expose
the block device used as root filesystem) will need to identify the
lower device it should work on. And for that the helper functions such
as devt_from_devname() need to be available for that driver.
>
> > Using initramfs implies that we would need a 2nd copy of the standard C
> > library and libfdt, both alone will already occupy more than just a
> > single 64kB block of flash.
>
> Why do you need libfdt?
uImage.FIT is based on FDT, so the easiest way to parse this format
without reinventing the wheel is using libfdt.
> And with a simple statically linked kpartx you won't pull in much of
> libc either.
Well, enough to have kpartx and libfdt covered. I still assume that's
more than 64KiB of compressed bytecode in total.
>
> > I understand that from the point of view of
> > classic x86 servers or even embedded devices with eMMC this seems
> > negligible. However, keep in mind that a huge number of existing
> > devices and also new designs of embedded devices often boot from just a
> > few megabytes of NOR flash, and there every byte counts.
>
> So I've worked quite a bit on really small deeply embedded systems,
> and for those I wouldn't even think of using strange image formats
> or the rather wasteful GPT partition format.
> There we wouldn't dare to use paritions or weird image formats, but
Well, the reality as of today is that even the smallest of all embedded
devices will need updated firmware at some point.
Using an image format which allows to dynamically partition a firmware
image into kernel and rootfs parts is needed **especially** on devices
with very little storage space -- it's there that we cannot afford to
use hard-coded MTD partition sizes either limiting or overestimating
the future kernel size.
And having a unified format which serves a whole range of devices does
make that easier -- surely, on NOR flash, there would not be a GPT
partition; mtdblock on top of an MTD partition is used in that case.
>
> > > What is the point of the uImage.FIT?
> >
> > It is the format used by Das U-Boot, which is by far the most common
> > bootloader found on small embedded devices running Linux.
> > Is is already used by Das U-Boot to validate and load kernel,
> > devicetree, initramfs, ... to RAM before launching Linux.
> > I've included a link to the documentation[1] which gives insights
> > regarding the motivation to create such a format.
>
> That doesn't explain why you'd want to use it. Nor how people
> came up with it.
I didn't want to quote all the history here, but it is explained rather
well in their documentation I had linked there.
>
> > In fact, that's only one out of three possible uses in which parsing
> > the contained sub-image boundaries can be useful:
> > * On devices with NOR flash uImage.FIT is stored in an MTD partition,
> > hence the uImage.FIT partition parser (or small stackable block
> > driver) would then operate on top of /dev/mtdblockX.
> >
> > * On devices with NAND flash uImage.FIT is stored in a UBI volume,
> > hence in this case /dev/ubiblockX needs to be processed.
>
> And all the mtdblock / ubiblock is due to the lack of a native
> mtd/ubi backend for squashfs? Why can't we take the block layer
> out of the loop entirely?
A block representation is the common denominator of all the
above. Sure, I could implement splitting MTD devices according to
uImage.FIT and then add MTD support to squashfs. Then implement
splitting of UBI volumes and add UBI support to squashfs.
However, simply using mtdblock on NOR and ubiblock on NAND seem
well-suited and hence there wasn't ever a need to implement
MTD or UBI support for any **read-only** filesystem.
Afaik that is what mtdblock and ubiblock are intended to be used
for.
> > I hope this explains my motivation. Please ask should there by any
> > doubts or if any of my explainations above are not clear.
>
> None of this explains the silly nesting inside the GPT partition.
> It is not needed for the any use cases and the root probem here.
So where would you store the uImage (which will have to exist
even to just load kernel and DTB in U-Boot, even without containing
the root filesystem) on devices with eMMC then?
I did consider block2mtd on eMMC and gluebi (which is deprecated) on
UBI, but in order to mount the filesystem I'd then use mtdblock further
down the road, and that'd actually be silly.
> Without that you could simply implement a parition format, with that
> you get into crazy nesting behavior. Note that it would have
> any benefit over just not doing this silly image.
Are you suggesting to come up with an entirely new type of partition
table only for that purpose? Which will require its own tools and
implementation in both, U-Boot and Linux? What would be the benefit
over just using GPT partitioning?
I'm not saying that this cannot be done (and I will do it if you
convince me that it'd be needed).
>
> Maybe someone just needs to go back and come up wit ha scheme that
> actually works and implement that in uboot as well.
In terms of "works" both are working very well, the uImage.FIT
partition parser as well as the tiny stackable block driver you had
asked me to write instead. I've tested both on multiple devices by now,
the former is even used in production on one of the currently most
popular devices running OpenWrt (according to update backend stats).
Thank you for your patience!
Daniel
On Tue, Dec 13, 2022 at 12:45:27PM +0000, Daniel Golle wrote:
> > Yes, but a completely non-standard format that nests inside an
> > partition.
>
> The reason for this current discussion (see subject line) is exactly
> that you didn't like the newly introduced partition type GUID which
> then calls the newly introduced partition parser taking care of the
> uImage.FIT content of a partition.
Which is the exact nesting I'm complaining about. Why do you need
to use your format inside a GPT partition table? What you're doing
is bascially nesting a partition table format inside another one,
which doesn't make any sense at all.
> This block driver (if built-into the kernel and relied upon to expose
> the block device used as root filesystem) will need to identify the
> lower device it should work on. And for that the helper functions such
> as devt_from_devname() need to be available for that driver.
And devt_from_devname must not be used by more non-init code. It is
bad it got exposed at all, but new users are not acceptable.
> A block representation is the common denominator of all the
> above. Sure, I could implement splitting MTD devices according to
> uImage.FIT and then add MTD support to squashfs. Then implement
> splitting of UBI volumes and add UBI support to squashfs.
Implementing MTD and/or UBI support would allow you to build a
kernel without CONFIG_BLOCK, which will save you a lot more than
the 64k you were whining about above.
> > None of this explains the silly nesting inside the GPT partition.
> > It is not needed for the any use cases and the root probem here.
>
> So where would you store the uImage (which will have to exist
> even to just load kernel and DTB in U-Boot, even without containing
> the root filesystem) on devices with eMMC then?
Straight on the block device, where else?
> Are you suggesting to come up with an entirely new type of partition
> table only for that purpose? Which will require its own tools and
> implementation in both, U-Boot and Linux? What would be the benefit
> over just using GPT partitioning?
Why do you need another layer of partitioning instead of storing
all your information either in the uImage, or in some other
partition format of your choice?
See, if you have GPT, DOS or whatever partitions, you just use
partitions and store all the bits your care about in them.
If you want a fancy not invented here syndrome image format you use that.
But don't use both.
On Wed, Dec 14, 2022 at 08:43:29AM -0800, Christoph Hellwig wrote:
> On Tue, Dec 13, 2022 at 12:45:27PM +0000, Daniel Golle wrote:
> > > Yes, but a completely non-standard format that nests inside an
> > > partition.
> >
> > The reason for this current discussion (see subject line) is exactly
> > that you didn't like the newly introduced partition type GUID which
> > then calls the newly introduced partition parser taking care of the
> > uImage.FIT content of a partition.
>
> Which is the exact nesting I'm complaining about. Why do you need
> to use your format inside a GPT partition table?
The GPT partition table is typically written only once to an eMMC-
based device in factory. Firmware images (typically uImage.FIT) are
stored in partitions because there are sometimes two of them (for
A/B dual-boot, or recovery/production dual-boot).
As a working device firmware consists of kernel, dtb and rootfs, all
these three things have to be written and used together, typically
they also come together in one file for firmware upgrade (ie.
rootfs appended to kernel, tarballs, or uImage.FIT containing all three
of them).
As the size of kernel and rootfs cannot be determined accurately at
the time the device is made, having individual GPT partitions for
kernel and rootfs ends up to either being a limit to future groth of
the kernel image or wastes space by overestimating the kernel size.
Changing the GPT partitioning when updating the device to match the
exact sizes is also not an option as a damage to the GPT would then
present a single point of failure (backup GPT also wouldn't help
much here), so for dual-boot to actually be meaningful, we shouldn't
ever write to any parts of the disk/flash which affect more than one
of the dual-boot options.
> What you're doing is bascially nesting a partition table format
> inside another one, which doesn't make any sense at all.
See the last paragraph of this message for good reasons why one would
want to do exactly that.
>
> > This block driver (if built-into the kernel and relied upon to expose
> > the block device used as root filesystem) will need to identify the
> > lower device it should work on. And for that the helper functions such
> > as devt_from_devname() need to be available for that driver.
>
> And devt_from_devname must not be used by more non-init code. It is
> bad it got exposed at all, but new users are not acceptable.
I assume that implementing anything similar using blk_lookup_devt in
the driver itself is then also not acceptable, right?
Yet another option would be to implement a way to acquire this
information from device tree. Ie. have a reference to the disk device
as well as an unsigned integer in the 'chosen' node which the
bootloader can use to communicate this to the kernel. Example:
chosen {
bootdev = <&mmc0 3>;
};
It's a bit more tricky for ubiblock or mtdblock devices because they
don't have *any* representation in device tree at all at this point.
In case of an MTD partition (for mtdblockX) we would just reference
the mtd partition in the same way.
To do this cleanly with NAND/UBI, I'd start with adding
device-tree-based attaching of an MTD partition to UBI using a
device-tree attribute 'compatible = "linux,ubi"' on the MTD partition.
We could then have sub-nodes referencing specific UBI volumes, to
select them for use with ubiblock but also for those nodes then
being a valid reference for use with the to-be-introduced 'bootdev'
attribute in 'chosen'.
Does that sound acceptable from your perspective?
>
> > A block representation is the common denominator of all the
> > above. Sure, I could implement splitting MTD devices according to
> > uImage.FIT and then add MTD support to squashfs. Then implement
> > splitting of UBI volumes and add UBI support to squashfs.
>
> Implementing MTD and/or UBI support would allow you to build a
> kernel without CONFIG_BLOCK, which will save you a lot more than
> the 64k you were whining about above.
Even devices with NOR flash may still want support for removable
block devices like USB pendrives or SD cards... Many home-routers
got only 8MiB of NOR flash and yet come with USB 2.0 ports intended
for a pendrive which is then shared via Samba.
Also, heavily customzied per-device kernel builds would never
scale up to support thousands of devices -- hence OpenWrt uses the
exact same kernel build for many devices, which makes both, the
build process and also debugging kernel bugs, much much easier (or
even doable at all).
>
> > > None of this explains the silly nesting inside the GPT partition.
> > > It is not needed for the any use cases and the root probem here.
> >
> > So where would you store the uImage (which will have to exist
> > even to just load kernel and DTB in U-Boot, even without containing
> > the root filesystem) on devices with eMMC then?
>
> Straight on the block device, where else?
As the first few blocks are typically used for bootloader code and
bootloader environment, we would then need to hard-code the offset(s)
of the uImage.FIT on the block device. Imho this becomes messy and just
using partitions seemed like a straight forward solution.
And what about dual-boot systems where you have more than one firmware
image? Hard-code more offsets? For each device?
In a way, I was considering this by using blkdevparts= cmdline option
instead of GPT, but
>
> > Are you suggesting to come up with an entirely new type of partition
> > table only for that purpose? Which will require its own tools and
> > implementation in both, U-Boot and Linux? What would be the benefit
> > over just using GPT partitioning?
>
> Why do you need another layer of partitioning instead of storing
> all your information either in the uImage, or in some other
> partition format of your choice?
The reason is the different life-cycle of the device main partition
table, bootloader, bootloader environment, ... on one hand and each
firmware image on a dual boot system on the other hand.
Hence there is more than just one uImage: typically bootloader,
bootloader environment, firmware A (uImage.FIT) and firmware B.
Relace "A" and "B" with "recovery" and "production", depending on the
dual-boot style implemented.
Therefore re-writing the whole disk during firmware upgrades is not an
option because it is risky, eg. in case of a sudden power failure we
could end up with a hard-bricked system. So to me it makes sense that
for a firmware upgrade, we write only to one partition and don't touch
GPT or anything else on the device. So in case something goes wrong,
the device will still boot, the bootloader will realize that the
uImage.FIT in one partition is broken (uImage.FIT also comes with
hashes to ensure image integrity) and it will load something else (from
another partition) instead.
On Wed, Dec 14, 2022 at 08:01:48PM +0000, Daniel Golle wrote:
> I assume that implementing anything similar using blk_lookup_devt in
> the driver itself is then also not acceptable, right?
No. blk_lookup_devt is a workaround for the early init code where
the file system is not available yet. The only reasonable use case
is the early init code, but even that is better handled by an initramfs.
> Yet another option would be to implement a way to acquire this
> information from device tree. Ie. have a reference to the disk device
> as well as an unsigned integer in the 'chosen' node which the
> bootloader can use to communicate this to the kernel. Example:
> chosen {
> bootdev = <&mmc0 3>;
> };
>
> It's a bit more tricky for ubiblock or mtdblock devices because they
> don't have *any* representation in device tree at all at this point.
>
> In case of an MTD partition (for mtdblockX) we would just reference
> the mtd partition in the same way.
>
> To do this cleanly with NAND/UBI, I'd start with adding
> device-tree-based attaching of an MTD partition to UBI using a
> device-tree attribute 'compatible = "linux,ubi"' on the MTD partition.
> We could then have sub-nodes referencing specific UBI volumes, to
> select them for use with ubiblock but also for those nodes then
> being a valid reference for use with the to-be-introduced 'bootdev'
> attribute in 'chosen'.
>
> Does that sound acceptable from your perspective?
Device tree questions are really out of my knowledge and domain, the
right people to talk to would be the device-tree and relevant driver
maintainers.
> Even devices with NOR flash may still want support for removable
> block devices like USB pendrives or SD cards... Many home-routers
> got only 8MiB of NOR flash and yet come with USB 2.0 ports intended
> for a pendrive which is then shared via Samba.
Ok, so we're not _that_ resource contrained at all, because for the
really small devices with just a few MB of ram, not having the block
layer makes a huge difference.
> As the first few blocks are typically used for bootloader code and
> bootloader environment, we would then need to hard-code the offset(s)
> of the uImage.FIT on the block device. Imho this becomes messy and just
> using partitions seemed like a straight forward solution.
Ok, so the whole DOS-World boot load nightmare has spread to these
devices as well.
> And what about dual-boot systems where you have more than one firmware
> image? Hard-code more offsets? For each device?
Is this for a fail save use case where on image is update while the
other on is in use? I've usually seen people use two different NOR
chips for that to have full error isolation.
> The reason is the different life-cycle of the device main partition
> table, bootloader, bootloader environment, ... on one hand and each
> firmware image on a dual boot system on the other hand.
Oh, so the firmware image doesn't even include the bootloader, but the
bootload is on the same device? That just seems like a very odd
setup, but I'll take your word for it being used.
> Therefore re-writing the whole disk during firmware upgrades is not an
> option because it is risky, eg. in case of a sudden power failure we
> could end up with a hard-bricked system. So to me it makes sense that
> for a firmware upgrade, we write only to one partition and don't touch
> GPT or anything else on the device. So in case something goes wrong,
> the device will still boot, the bootloader will realize that the
> uImage.FIT in one partition is broken (uImage.FIT also comes with
> hashes to ensure image integrity) and it will load something else (from
> another partition) instead.
I'm just really confused about this whole setup. Maybe it's just
because what I've worked with, which is either really deeply embededd
devices where the everything is one image, that you might in some case
have twice, or a full blown UEFI-like setup where the boot loader is
separate, but can simply load a kernel from the actual root file system.
I have to say even after all these round trip I'm still not sure
what problem where really solving. In basically all other environments
that have a powerful enough bootloader to read file systems your rootfs
would simply store the kernel image and dtb if they are interdependent.
Is the root cause that uboot doesn't support reading files from squashfs
despite supporting half a dozen other file systems?
Hi Christoph,
On Tue, Nov 22, 2022 at 04:37:08AM -0800, Christoph Hellwig wrote:
> On Sat, Nov 19, 2022 at 04:03:11PM +0000, Daniel Golle wrote:
> > [...]
> > Yet another (imho not terrible) problem is removal of the lower device.
> > Many of the supported SBC use a micro SD card to boot, which can be
> > removed by the user while the system is running (which is generally not
> > a good idea, but anyway). For partitions this is handled automatically
> > by blk_drop_partitions() called directly from genhd.c.
> > I'm currently playing with doing something similar using the bus device
> > removal notification, but it doesn't seem to work for all cases, e.g.
> > mmcblk device do not seem to have the ->bus pointer populated at all
> > (ie. disk_to_dev(disk)->bus == NULL for mmcblk devices).
>
> I have WIP patches that allow the claimer of a block device get
> resize and removal notification. It's not going to land for 6.2,
> but I hope I have it ready in time for the next merge window.
Any news about that patchset? I'd happily review, test and use it ;)
@@ -1092,3 +1092,169 @@ void bdev_statx_dioalign(struct inode *inode, struct kstat *stat)
blkdev_put_no_open(bdev);
}
+
+struct uuidcmp {
+ const char *uuid;
+ int len;
+};
+
+/**
+ * match_dev_by_uuid - callback for finding a partition using its uuid
+ * @dev: device passed in by the caller
+ * @data: opaque pointer to the desired struct uuidcmp to match
+ *
+ * Returns 1 if the device matches, and 0 otherwise.
+ */
+static int match_dev_by_uuid(struct device *dev, const void *data)
+{
+ struct block_device *bdev = dev_to_bdev(dev);
+ const struct uuidcmp *cmp = data;
+
+ if (!bdev->bd_meta_info ||
+ strncasecmp(cmp->uuid, bdev->bd_meta_info->uuid, cmp->len))
+ return 0;
+ return 1;
+}
+
+/**
+ * devt_from_partuuid - looks up the dev_t of a partition by its UUID
+ * @uuid_str: char array containing ascii UUID
+ *
+ * The function will return the first partition which contains a matching
+ * UUID value in its partition_meta_info struct. This does not search
+ * by filesystem UUIDs.
+ *
+ * If @uuid_str is followed by a "/PARTNROFF=%d", then the number will be
+ * extracted and used as an offset from the partition identified by the UUID.
+ *
+ * Returns the matching dev_t on success or 0 on failure.
+ */
+dev_t devt_from_partuuid(const char *uuid_str, int *root_wait)
+{
+ struct uuidcmp cmp;
+ struct device *dev = NULL;
+ dev_t devt = 0;
+ int offset = 0;
+ char *slash;
+
+ cmp.uuid = uuid_str;
+
+ slash = strchr(uuid_str, '/');
+ /* Check for optional partition number offset attributes. */
+ if (slash) {
+ char c = 0;
+
+ /* Explicitly fail on poor PARTUUID syntax. */
+ if (sscanf(slash + 1, "PARTNROFF=%d%c", &offset, &c) != 1)
+ goto clear_root_wait;
+ cmp.len = slash - uuid_str;
+ } else {
+ cmp.len = strlen(uuid_str);
+ }
+
+ if (!cmp.len)
+ goto clear_root_wait;
+
+ dev = class_find_device(&block_class, NULL, &cmp, &match_dev_by_uuid);
+ if (!dev)
+ return 0;
+
+ if (offset) {
+ /*
+ * Attempt to find the requested partition by adding an offset
+ * to the partition number found by UUID.
+ */
+ devt = part_devt(dev_to_disk(dev),
+ dev_to_bdev(dev)->bd_partno + offset);
+ } else {
+ devt = dev->devt;
+ }
+
+ put_device(dev);
+ return devt;
+
+clear_root_wait:
+ pr_err("VFS: PARTUUID= is invalid.\n"
+ "Expected PARTUUID=<valid-uuid-id>[/PARTNROFF=%%d]\n");
+ if (root_wait && *root_wait) {
+ pr_err("Disabling rootwait; root= is invalid.\n");
+ *root_wait = 0;
+ }
+ return 0;
+}
+EXPORT_SYMBOL_GPL(devt_from_partuuid);
+
+/**
+ * match_dev_by_label - callback for finding a partition using its label
+ * @dev: device passed in by the caller
+ * @data: opaque pointer to the label to match
+ *
+ * Returns 1 if the device matches, and 0 otherwise.
+ */
+static int match_dev_by_label(struct device *dev, const void *data)
+{
+ struct block_device *bdev = dev_to_bdev(dev);
+ const char *label = data;
+
+ if (!bdev->bd_meta_info || strcmp(label, bdev->bd_meta_info->volname))
+ return 0;
+ return 1;
+}
+
+dev_t devt_from_partlabel(const char *label)
+{
+ struct device *dev;
+ dev_t devt = 0;
+
+ dev = class_find_device(&block_class, NULL, label, &match_dev_by_label);
+ if (dev) {
+ devt = dev->devt;
+ put_device(dev);
+ }
+
+ return devt;
+}
+EXPORT_SYMBOL_GPL(devt_from_partlabel);
+
+dev_t devt_from_devname(const char *name)
+{
+ dev_t devt = 0;
+ int part;
+ char s[32];
+ char *p;
+
+ if (strlen(name) > 31)
+ return 0;
+ strcpy(s, name);
+ for (p = s; *p; p++) {
+ if (*p == '/')
+ *p = '!';
+ }
+
+ devt = blk_lookup_devt(s, 0);
+ if (devt)
+ return devt;
+
+ /*
+ * Try non-existent, but valid partition, which may only exist after
+ * opening the device, like partitioned md devices.
+ */
+ while (p > s && isdigit(p[-1]))
+ p--;
+ if (p == s || !*p || *p == '0')
+ return 0;
+
+ /* try disk name without <part number> */
+ part = simple_strtoul(p, NULL, 10);
+ *p = '\0';
+ devt = blk_lookup_devt(s, part);
+ if (devt)
+ return devt;
+
+ /* try disk name without p<part number> */
+ if (p < s + 2 || !isdigit(p[-2]) || p[-1] != 'p')
+ return 0;
+ p[-1] = '\0';
+ return blk_lookup_devt(s, part);
+}
+EXPORT_SYMBOL_GPL(devt_from_devname);
@@ -1494,6 +1494,9 @@ int truncate_bdev_range(struct block_device *bdev, fmode_t mode, loff_t lstart,
loff_t lend);
#ifdef CONFIG_BLOCK
+dev_t devt_from_partuuid(const char *uuid_str, int *root_wait);
+dev_t devt_from_partlabel(const char *label);
+dev_t devt_from_devname(const char *name);
void invalidate_bdev(struct block_device *bdev);
int sync_blockdev(struct block_device *bdev);
int sync_blockdev_range(struct block_device *bdev, loff_t lstart, loff_t lend);
@@ -1502,6 +1505,18 @@ void sync_bdevs(bool wait);
void bdev_statx_dioalign(struct inode *inode, struct kstat *stat);
void printk_all_partitions(void);
#else
+static inline dev_t devt_from_partuuid(const char *uuid_str, int *root_wait)
+{
+ return MKDEV(0, 0);
+}
+static inline dev_t devt_from_partlabel(const char *label);
+{
+ return MKDEV(0, 0);
+}
+static inline dev_t devt_from_devname(const char *name);
+{
+ return MKDEV(0, 0);
+}
static inline void invalidate_bdev(struct block_device *bdev)
{
}
@@ -1,6 +1,7 @@
// SPDX-License-Identifier: GPL-2.0-only
#include <linux/module.h>
#include <linux/sched.h>
+#include <linux/blkdev.h>
#include <linux/ctype.h>
#include <linux/fd.h>
#include <linux/tty.h>
@@ -60,169 +61,6 @@ static int __init readwrite(char *str)
__setup("ro", readonly);
__setup("rw", readwrite);
-#ifdef CONFIG_BLOCK
-struct uuidcmp {
- const char *uuid;
- int len;
-};
-
-/**
- * match_dev_by_uuid - callback for finding a partition using its uuid
- * @dev: device passed in by the caller
- * @data: opaque pointer to the desired struct uuidcmp to match
- *
- * Returns 1 if the device matches, and 0 otherwise.
- */
-static int match_dev_by_uuid(struct device *dev, const void *data)
-{
- struct block_device *bdev = dev_to_bdev(dev);
- const struct uuidcmp *cmp = data;
-
- if (!bdev->bd_meta_info ||
- strncasecmp(cmp->uuid, bdev->bd_meta_info->uuid, cmp->len))
- return 0;
- return 1;
-}
-
-/**
- * devt_from_partuuid - looks up the dev_t of a partition by its UUID
- * @uuid_str: char array containing ascii UUID
- *
- * The function will return the first partition which contains a matching
- * UUID value in its partition_meta_info struct. This does not search
- * by filesystem UUIDs.
- *
- * If @uuid_str is followed by a "/PARTNROFF=%d", then the number will be
- * extracted and used as an offset from the partition identified by the UUID.
- *
- * Returns the matching dev_t on success or 0 on failure.
- */
-static dev_t devt_from_partuuid(const char *uuid_str)
-{
- struct uuidcmp cmp;
- struct device *dev = NULL;
- dev_t devt = 0;
- int offset = 0;
- char *slash;
-
- cmp.uuid = uuid_str;
-
- slash = strchr(uuid_str, '/');
- /* Check for optional partition number offset attributes. */
- if (slash) {
- char c = 0;
-
- /* Explicitly fail on poor PARTUUID syntax. */
- if (sscanf(slash + 1, "PARTNROFF=%d%c", &offset, &c) != 1)
- goto clear_root_wait;
- cmp.len = slash - uuid_str;
- } else {
- cmp.len = strlen(uuid_str);
- }
-
- if (!cmp.len)
- goto clear_root_wait;
-
- dev = class_find_device(&block_class, NULL, &cmp, &match_dev_by_uuid);
- if (!dev)
- return 0;
-
- if (offset) {
- /*
- * Attempt to find the requested partition by adding an offset
- * to the partition number found by UUID.
- */
- devt = part_devt(dev_to_disk(dev),
- dev_to_bdev(dev)->bd_partno + offset);
- } else {
- devt = dev->devt;
- }
-
- put_device(dev);
- return devt;
-
-clear_root_wait:
- pr_err("VFS: PARTUUID= is invalid.\n"
- "Expected PARTUUID=<valid-uuid-id>[/PARTNROFF=%%d]\n");
- if (root_wait)
- pr_err("Disabling rootwait; root= is invalid.\n");
- root_wait = 0;
- return 0;
-}
-
-/**
- * match_dev_by_label - callback for finding a partition using its label
- * @dev: device passed in by the caller
- * @data: opaque pointer to the label to match
- *
- * Returns 1 if the device matches, and 0 otherwise.
- */
-static int match_dev_by_label(struct device *dev, const void *data)
-{
- struct block_device *bdev = dev_to_bdev(dev);
- const char *label = data;
-
- if (!bdev->bd_meta_info || strcmp(label, bdev->bd_meta_info->volname))
- return 0;
- return 1;
-}
-
-static dev_t devt_from_partlabel(const char *label)
-{
- struct device *dev;
- dev_t devt = 0;
-
- dev = class_find_device(&block_class, NULL, label, &match_dev_by_label);
- if (dev) {
- devt = dev->devt;
- put_device(dev);
- }
-
- return devt;
-}
-
-static dev_t devt_from_devname(const char *name)
-{
- dev_t devt = 0;
- int part;
- char s[32];
- char *p;
-
- if (strlen(name) > 31)
- return 0;
- strcpy(s, name);
- for (p = s; *p; p++) {
- if (*p == '/')
- *p = '!';
- }
-
- devt = blk_lookup_devt(s, 0);
- if (devt)
- return devt;
-
- /*
- * Try non-existent, but valid partition, which may only exist after
- * opening the device, like partitioned md devices.
- */
- while (p > s && isdigit(p[-1]))
- p--;
- if (p == s || !*p || *p == '0')
- return 0;
-
- /* try disk name without <part number> */
- part = simple_strtoul(p, NULL, 10);
- *p = '\0';
- devt = blk_lookup_devt(s, part);
- if (devt)
- return devt;
-
- /* try disk name without p<part number> */
- if (p < s + 2 || !isdigit(p[-2]) || p[-1] != 'p')
- return 0;
- p[-1] = '\0';
- return blk_lookup_devt(s, part);
-}
-#endif /* CONFIG_BLOCK */
static dev_t devt_from_devnum(const char *name)
{
@@ -284,7 +122,7 @@ dev_t name_to_dev_t(const char *name)
return Root_RAM0;
#ifdef CONFIG_BLOCK
if (strncmp(name, "PARTUUID=", 9) == 0)
- return devt_from_partuuid(name + 9);
+ return devt_from_partuuid(name + 9, &root_wait);
if (strncmp(name, "PARTLABEL=", 10) == 0)
return devt_from_partlabel(name + 10);
if (strncmp(name, "/dev/", 5) == 0)