[v9,00/11] btrfs: introduce RAID stripe tree

Message ID 20230914-raid-stripe-tree-v9-0-15d423829637@wdc.com
Headers
Series btrfs: introduce RAID stripe tree |

Message

Johannes Thumshirn Sept. 14, 2023, 4:06 p.m. UTC
  Updates of the raid-stripe-tree are done at ordered extent write time to safe
on bandwidth while for reading we do the stripe-tree lookup on bio mapping
time, i.e. when the logical to physical translation happens for regular btrfs
RAID as well.

The stripe tree is keyed by an extent's disk_bytenr and disk_num_bytes and
it's contents are the respective physical device id and position.

For an example 1M write (split into 126K segments due to zone-append)
rapido2:/home/johannes/src/fstests# xfs_io -fdc "pwrite -b 1M 0 1M" -c fsync /mnt/test/test
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0065 sec (151.538 MiB/sec and 151.5381 ops/sec)

The tree will look as follows (both 128k buffered writes to a ZNS drive):

RAID0 case:
bash-5.2# btrfs inspect-internal dump-tree -t raid_stripe /dev/nvme0n1
btrfs-progs v6.3
raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0) 
leaf 805535744 items 1 free space 16218 generation 8 owner RAID_STRIPE_TREE
leaf 805535744 flags 0x1(WRITTEN) backref revision 1
checksum stored 2d2d2262
checksum calced 2d2d2262
fs uuid ab05cfc6-9859-404e-970d-3999b1cb5438
chunk uuid c9470ba2-49ac-4d46-8856-438a18e6bd23
        item 0 key (1073741824 RAID_STRIPE_KEY 131072) itemoff 16243 itemsize 56
                        encoding: RAID0
                        stripe 0 devid 1 offset 805306368 length 131072
                        stripe 1 devid 2 offset 536870912 length 131072
total bytes 42949672960
bytes used 294912
uuid ab05cfc6-9859-404e-970d-3999b1cb5438

RAID1 case:
bash-5.2# btrfs inspect-internal dump-tree -t raid_stripe /dev/nvme0n1
btrfs-progs v6.3
raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0) 
leaf 805535744 items 1 free space 16218 generation 8 owner RAID_STRIPE_TREE
leaf 805535744 flags 0x1(WRITTEN) backref revision 1
checksum stored 56199539
checksum calced 56199539
fs uuid 9e693a37-fbd1-4891-aed2-e7fe64605045
chunk uuid 691874fc-1b9c-469b-bd7f-05e0e6ba88c4
        item 0 key (939524096 RAID_STRIPE_KEY 131072) itemoff 16243 itemsize 56
                        encoding: RAID1
                        stripe 0 devid 1 offset 939524096 length 65536
                        stripe 1 devid 2 offset 536870912 length 65536
total bytes 42949672960
bytes used 294912
uuid 9e693a37-fbd1-4891-aed2-e7fe64605045

A design document can be found here:
https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true

The user-space part of this series can be found here:
https://lore.kernel.org/linux-btrfs/20230215143109.2721722-1-johannes.thumshirn@wdc.com

Changes to v8:
- Changed tracepoints according to David's comments
- Mark on-disk structures as packed
- Got rid of __DECLARE_FLEX_ARRAY
- Rebase onto misc-next
- Split out helpers for new btrfs_load_block_group_zone_info RAID cases
- Constify declarations where possible
- Initialise variables before use
- Lower scope of variables
- Remove btrfs_stripe_root() helper
- Pick different BTRFS_RAID_STRIPE_KEY constant
- Reorder on-disk encoding types to match the raid_index
- And possibly more, please git range-diff the versions
- Link to v8: https://lore.kernel.org/r/20230911-raid-stripe-tree-v8-0-647676fa852c@wdc.com

Changes to v7:
- Huge rewrite

v7 of the patchset can be found here:
https://lore.kernel.org/linux-btrfs/cover.1677750131.git.johannes.thumshirn@wdc.com/

Changes to v6:
- Fix degraded RAID1 mounts
- Fix RAID0/10 mounts

v6 of the patchset can be found here:
https://lore/kernel.org/linux-btrfs/cover.1676470614.git.johannes.thumshirn@wdc.com

Changes to v5:
- Incroporated review comments from Josef and Christoph
- Rebased onto misc-next

v5 of the patchset can be found here:
https://lore/kernel.org/linux-btrfs/cover.1675853489.git.johannes.thumshirn@wdc.com

Changes to v4:
- Added patch to check for RST feature in sysfs
- Added RST lookups for scrubbing 
- Fixed the error handling bug Josef pointed out
- Only check if we need to write out a RST once per delayed_ref head
- Added support for zoned data DUP with RST

Changes to v3:
- Rebased onto 20221120124734.18634-1-hch@lst.de
- Incorporated Josef's review
- Merged related patches

v3 of the patchset can be found here:
https://lore/kernel.org/linux-btrfs/cover.1666007330.git.johannes.thumshirn@wdc.com

Changes to v2:
- Bug fixes
- Rebased onto 20220901074216.1849941-1-hch@lst.de
- Added tracepoints
- Added leak checker
- Added RAID0 and RAID10

v2 of the patchset can be found here:
https://lore.kernel.org/linux-btrfs/cover.1656513330.git.johannes.thumshirn@wdc.com

Changes to v1:
- Write the stripe-tree at delayed-ref time (Qu)
- Add a different write path for preallocation

v1 of the patchset can be found here:
https://lore.kernel.org/linux-btrfs/cover.1652711187.git.johannes.thumshirn@wdc.com/

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
Johannes Thumshirn (11):
      btrfs: add raid stripe tree definitions
      btrfs: read raid-stripe-tree from disk
      btrfs: add support for inserting raid stripe extents
      btrfs: delete stripe extent on extent deletion
      btrfs: lookup physical address from stripe extent
      btrfs: implement RST version of scrub
      btrfs: zoned: allow zoned RAID
      btrfs: add raid stripe tree pretty printer
      btrfs: announce presence of raid-stripe-tree in sysfs
      btrfs: add trace events for RST
      btrfs: add raid-stripe-tree to features enabled with debug

 fs/btrfs/Makefile               |   2 +-
 fs/btrfs/accessors.h            |  10 +
 fs/btrfs/bio.c                  |  25 +++
 fs/btrfs/block-rsv.c            |   6 +
 fs/btrfs/disk-io.c              |  18 ++
 fs/btrfs/extent-tree.c          |   7 +
 fs/btrfs/fs.h                   |   4 +-
 fs/btrfs/inode.c                |   8 +-
 fs/btrfs/locking.c              |   1 +
 fs/btrfs/ordered-data.c         |   1 +
 fs/btrfs/ordered-data.h         |   2 +
 fs/btrfs/print-tree.c           |  26 +++
 fs/btrfs/raid-stripe-tree.c     | 449 ++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/raid-stripe-tree.h     |  52 +++++
 fs/btrfs/scrub.c                |  53 +++++
 fs/btrfs/sysfs.c                |   3 +
 fs/btrfs/volumes.c              |  43 +++-
 fs/btrfs/volumes.h              |  16 +-
 fs/btrfs/zoned.c                | 144 ++++++++++++-
 include/trace/events/btrfs.h    |  75 +++++++
 include/uapi/linux/btrfs.h      |   1 +
 include/uapi/linux/btrfs_tree.h |  31 +++
 22 files changed, 954 insertions(+), 23 deletions(-)
---
base-commit: 1d73023d96965a5c8fb76a39aec88d840ebe5b21
change-id: 20230613-raid-stripe-tree-e330c9a45cc3

Best regards,
  

Comments

David Sterba Sept. 14, 2023, 6:25 p.m. UTC | #1
On Thu, Sep 14, 2023 at 09:06:55AM -0700, Johannes Thumshirn wrote:
> Updates of the raid-stripe-tree are done at ordered extent write time to safe
> on bandwidth while for reading we do the stripe-tree lookup on bio mapping
> time, i.e. when the logical to physical translation happens for regular btrfs
> RAID as well.
> 
> The stripe tree is keyed by an extent's disk_bytenr and disk_num_bytes and
> it's contents are the respective physical device id and position.
> 
> For an example 1M write (split into 126K segments due to zone-append)
> rapido2:/home/johannes/src/fstests# xfs_io -fdc "pwrite -b 1M 0 1M" -c fsync /mnt/test/test
> wrote 1048576/1048576 bytes at offset 0
> 1 MiB, 1 ops; 0.0065 sec (151.538 MiB/sec and 151.5381 ops/sec)
> 
> The tree will look as follows (both 128k buffered writes to a ZNS drive):
> 
> RAID0 case:
> bash-5.2# btrfs inspect-internal dump-tree -t raid_stripe /dev/nvme0n1
> btrfs-progs v6.3
> raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0) 
> leaf 805535744 items 1 free space 16218 generation 8 owner RAID_STRIPE_TREE
> leaf 805535744 flags 0x1(WRITTEN) backref revision 1
> checksum stored 2d2d2262
> checksum calced 2d2d2262
> fs uuid ab05cfc6-9859-404e-970d-3999b1cb5438
> chunk uuid c9470ba2-49ac-4d46-8856-438a18e6bd23
>         item 0 key (1073741824 RAID_STRIPE_KEY 131072) itemoff 16243 itemsize 56
>                         encoding: RAID0
>                         stripe 0 devid 1 offset 805306368 length 131072
>                         stripe 1 devid 2 offset 536870912 length 131072
> total bytes 42949672960
> bytes used 294912
> uuid ab05cfc6-9859-404e-970d-3999b1cb5438
> 
> RAID1 case:
> bash-5.2# btrfs inspect-internal dump-tree -t raid_stripe /dev/nvme0n1
> btrfs-progs v6.3
> raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0) 
> leaf 805535744 items 1 free space 16218 generation 8 owner RAID_STRIPE_TREE
> leaf 805535744 flags 0x1(WRITTEN) backref revision 1
> checksum stored 56199539
> checksum calced 56199539
> fs uuid 9e693a37-fbd1-4891-aed2-e7fe64605045
> chunk uuid 691874fc-1b9c-469b-bd7f-05e0e6ba88c4
>         item 0 key (939524096 RAID_STRIPE_KEY 131072) itemoff 16243 itemsize 56
>                         encoding: RAID1
>                         stripe 0 devid 1 offset 939524096 length 65536
>                         stripe 1 devid 2 offset 536870912 length 65536
> total bytes 42949672960
> bytes used 294912
> uuid 9e693a37-fbd1-4891-aed2-e7fe64605045
> 
> A design document can be found here:
> https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true

Please also turn it to developer documentation file (in
btrfs-progs/Documentation/dev), it can follow the same structure.

> 
> The user-space part of this series can be found here:
> https://lore.kernel.org/linux-btrfs/20230215143109.2721722-1-johannes.thumshirn@wdc.com
> 
> Changes to v8:
> - Changed tracepoints according to David's comments
> - Mark on-disk structures as packed
> - Got rid of __DECLARE_FLEX_ARRAY
> - Rebase onto misc-next
> - Split out helpers for new btrfs_load_block_group_zone_info RAID cases
> - Constify declarations where possible
> - Initialise variables before use
> - Lower scope of variables
> - Remove btrfs_stripe_root() helper
> - Pick different BTRFS_RAID_STRIPE_KEY constant
> - Reorder on-disk encoding types to match the raid_index
> - And possibly more, please git range-diff the versions
> - Link to v8: https://lore.kernel.org/r/20230911-raid-stripe-tree-v8-0-647676fa852c@wdc.com

v9 will be added as topic branch to for-next, I did several style
changes so please send any updates as incrementals if needed.
  
David Sterba Sept. 20, 2023, 4:23 p.m. UTC | #2
On Thu, Sep 14, 2023 at 08:25:34PM +0200, David Sterba wrote:
> On Thu, Sep 14, 2023 at 09:06:55AM -0700, Johannes Thumshirn wrote:
> > Updates of the raid-stripe-tree are done at ordered extent write time to safe
> > on bandwidth while for reading we do the stripe-tree lookup on bio mapping
> > time, i.e. when the logical to physical translation happens for regular btrfs
> > RAID as well.
> > 
> > The stripe tree is keyed by an extent's disk_bytenr and disk_num_bytes and
> > it's contents are the respective physical device id and position.
> > 
> > For an example 1M write (split into 126K segments due to zone-append)
> > rapido2:/home/johannes/src/fstests# xfs_io -fdc "pwrite -b 1M 0 1M" -c fsync /mnt/test/test
> > wrote 1048576/1048576 bytes at offset 0
> > 1 MiB, 1 ops; 0.0065 sec (151.538 MiB/sec and 151.5381 ops/sec)
> > 
> > The tree will look as follows (both 128k buffered writes to a ZNS drive):
> > 
> > RAID0 case:
> > bash-5.2# btrfs inspect-internal dump-tree -t raid_stripe /dev/nvme0n1
> > btrfs-progs v6.3
> > raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0) 
> > leaf 805535744 items 1 free space 16218 generation 8 owner RAID_STRIPE_TREE
> > leaf 805535744 flags 0x1(WRITTEN) backref revision 1
> > checksum stored 2d2d2262
> > checksum calced 2d2d2262
> > fs uuid ab05cfc6-9859-404e-970d-3999b1cb5438
> > chunk uuid c9470ba2-49ac-4d46-8856-438a18e6bd23
> >         item 0 key (1073741824 RAID_STRIPE_KEY 131072) itemoff 16243 itemsize 56
> >                         encoding: RAID0
> >                         stripe 0 devid 1 offset 805306368 length 131072
> >                         stripe 1 devid 2 offset 536870912 length 131072
> > total bytes 42949672960
> > bytes used 294912
> > uuid ab05cfc6-9859-404e-970d-3999b1cb5438
> > 
> > RAID1 case:
> > bash-5.2# btrfs inspect-internal dump-tree -t raid_stripe /dev/nvme0n1
> > btrfs-progs v6.3
> > raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0) 
> > leaf 805535744 items 1 free space 16218 generation 8 owner RAID_STRIPE_TREE
> > leaf 805535744 flags 0x1(WRITTEN) backref revision 1
> > checksum stored 56199539
> > checksum calced 56199539
> > fs uuid 9e693a37-fbd1-4891-aed2-e7fe64605045
> > chunk uuid 691874fc-1b9c-469b-bd7f-05e0e6ba88c4
> >         item 0 key (939524096 RAID_STRIPE_KEY 131072) itemoff 16243 itemsize 56
> >                         encoding: RAID1
> >                         stripe 0 devid 1 offset 939524096 length 65536
> >                         stripe 1 devid 2 offset 536870912 length 65536
> > total bytes 42949672960
> > bytes used 294912
> > uuid 9e693a37-fbd1-4891-aed2-e7fe64605045
> > 
> > A design document can be found here:
> > https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true
> 
> Please also turn it to developer documentation file (in
> btrfs-progs/Documentation/dev), it can follow the same structure.
> 
> > 
> > The user-space part of this series can be found here:
> > https://lore.kernel.org/linux-btrfs/20230215143109.2721722-1-johannes.thumshirn@wdc.com
> > 
> > Changes to v8:
> > - Changed tracepoints according to David's comments
> > - Mark on-disk structures as packed
> > - Got rid of __DECLARE_FLEX_ARRAY
> > - Rebase onto misc-next
> > - Split out helpers for new btrfs_load_block_group_zone_info RAID cases
> > - Constify declarations where possible
> > - Initialise variables before use
> > - Lower scope of variables
> > - Remove btrfs_stripe_root() helper
> > - Pick different BTRFS_RAID_STRIPE_KEY constant
> > - Reorder on-disk encoding types to match the raid_index
> > - And possibly more, please git range-diff the versions
> > - Link to v8: https://lore.kernel.org/r/20230911-raid-stripe-tree-v8-0-647676fa852c@wdc.com
> 
> v9 will be added as topic branch to for-next, I did several style
> changes so please send any updates as incrementals if needed.

Moved to misc-next. I'll do a minor release of btrfs-progs soon so we
get the tool support for testing.