[v8,00/11] btrfs: introduce RAID stripe tree

Message ID 20230911-raid-stripe-tree-v8-0-647676fa852c@wdc.com
Headers
Series btrfs: introduce RAID stripe tree |

Message

Johannes Thumshirn Sept. 11, 2023, 12:52 p.m. UTC
  Updates of the raid-stripe-tree are done at ordered extent write time to safe
on bandwidth while for reading we do the stripe-tree lookup on bio mapping
time, i.e. when the logical to physical translation happens for regular btrfs
RAID as well.

The stripe tree is keyed by an extent's disk_bytenr and disk_num_bytes and
it's contents are the respective physical device id and position.

For an example 1M write (split into 126K segments due to zone-append)
rapido2:/home/johannes/src/fstests# xfs_io -fdc "pwrite -b 1M 0 1M" -c fsync /mnt/test/test
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0065 sec (151.538 MiB/sec and 151.5381 ops/sec)

The tree will look as follows (both 128k buffered writes to a ZNS drive):

RAID0 case:
bash-5.2# btrfs inspect-internal dump-tree -t raid_stripe /dev/nvme0n1
btrfs-progs v6.3
raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0) 
leaf 805535744 items 1 free space 16218 generation 8 owner RAID_STRIPE_TREE
leaf 805535744 flags 0x1(WRITTEN) backref revision 1
checksum stored 2d2d2262
checksum calced 2d2d2262
fs uuid ab05cfc6-9859-404e-970d-3999b1cb5438
chunk uuid c9470ba2-49ac-4d46-8856-438a18e6bd23
        item 0 key (1073741824 RAID_STRIPE_KEY 131072) itemoff 16243 itemsize 40
                        encoding: RAID0
                        stripe 0 devid 1 offset 805306368
                        stripe 1 devid 2 offset 536870912
total bytes 42949672960
bytes used 294912
uuid ab05cfc6-9859-404e-970d-3999b1cb5438

RAID1 case:
bash-5.2# btrfs inspect-internal dump-tree -t raid_stripe /dev/nvme0n1
btrfs-progs v6.3
raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0) 
leaf 805535744 items 1 free space 16218 generation 8 owner RAID_STRIPE_TREE
leaf 805535744 flags 0x1(WRITTEN) backref revision 1
checksum stored 56199539
checksum calced 56199539
fs uuid 9e693a37-fbd1-4891-aed2-e7fe64605045
chunk uuid 691874fc-1b9c-469b-bd7f-05e0e6ba88c4
        item 0 key (939524096 RAID_STRIPE_KEY 131072) itemoff 16243 itemsize 40
                        encoding: RAID1
                        stripe 0 devid 1 offset 939524096
                        stripe 1 devid 2 offset 536870912
total bytes 42949672960
bytes used 294912
uuid 9e693a37-fbd1-4891-aed2-e7fe64605045

A design document can be found here:
https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true

The user-space part of this series can be found here:
https://lore.kernel.org/linux-btrfs/20230215143109.2721722-1-johannes.thumshirn@wdc.com

Changes to v7:
- Huge rewrite

v7 of the patchset can be found here:
https://lore.kernel.org/linux-btrfs/cover.1677750131.git.johannes.thumshirn@wdc.com/

Changes to v6:
- Fix degraded RAID1 mounts
- Fix RAID0/10 mounts

v6 of the patchset can be found here:
https://lore/kernel.org/linux-btrfs/cover.1676470614.git.johannes.thumshirn@wdc.com

Changes to v5:
- Incroporated review comments from Josef and Christoph
- Rebased onto misc-next

v5 of the patchset can be found here:
https://lore/kernel.org/linux-btrfs/cover.1675853489.git.johannes.thumshirn@wdc.com

Changes to v4:
- Added patch to check for RST feature in sysfs
- Added RST lookups for scrubbing 
- Fixed the error handling bug Josef pointed out
- Only check if we need to write out a RST once per delayed_ref head
- Added support for zoned data DUP with RST

Changes to v3:
- Rebased onto 20221120124734.18634-1-hch@lst.de
- Incorporated Josef's review
- Merged related patches

v3 of the patchset can be found here:
https://lore/kernel.org/linux-btrfs/cover.1666007330.git.johannes.thumshirn@wdc.com

Changes to v2:
- Bug fixes
- Rebased onto 20220901074216.1849941-1-hch@lst.de
- Added tracepoints
- Added leak checker
- Added RAID0 and RAID10

v2 of the patchset can be found here:
https://lore.kernel.org/linux-btrfs/cover.1656513330.git.johannes.thumshirn@wdc.com

Changes to v1:
- Write the stripe-tree at delayed-ref time (Qu)
- Add a different write path for preallocation

v1 of the patchset can be found here:
https://lore.kernel.org/linux-btrfs/cover.1652711187.git.johannes.thumshirn@wdc.com/

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
Johannes Thumshirn (11):
      btrfs: add raid stripe tree definitions
      btrfs: read raid-stripe-tree from disk
      btrfs: add support for inserting raid stripe extents
      btrfs: delete stripe extent on extent deletion
      btrfs: lookup physical address from stripe extent
      btrfs: implement RST version of scrub
      btrfs: zoned: allow zoned RAID
      btrfs: add raid stripe tree pretty printer
      btrfs: announce presence of raid-stripe-tree in sysfs
      btrfs: add trace events for RST
      btrfs: add raid-stripe-tree to features enabled with debug

 fs/btrfs/Makefile               |   2 +-
 fs/btrfs/accessors.h            |  10 +
 fs/btrfs/bio.c                  |  23 ++
 fs/btrfs/block-rsv.c            |   6 +
 fs/btrfs/disk-io.c              |  18 ++
 fs/btrfs/disk-io.h              |   5 +
 fs/btrfs/extent-tree.c          |   7 +
 fs/btrfs/fs.h                   |   4 +-
 fs/btrfs/inode.c                |   8 +-
 fs/btrfs/locking.c              |   5 +-
 fs/btrfs/ordered-data.c         |   1 +
 fs/btrfs/ordered-data.h         |   2 +
 fs/btrfs/print-tree.c           |  49 ++++
 fs/btrfs/raid-stripe-tree.c     | 493 ++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/raid-stripe-tree.h     |  52 +++++
 fs/btrfs/scrub.c                |  56 +++++
 fs/btrfs/sysfs.c                |   3 +
 fs/btrfs/volumes.c              |  43 +++-
 fs/btrfs/volumes.h              |  15 +-
 fs/btrfs/zoned.c                | 113 ++++++++-
 include/trace/events/btrfs.h    |  75 ++++++
 include/uapi/linux/btrfs.h      |   1 +
 include/uapi/linux/btrfs_tree.h |  33 ++-
 23 files changed, 999 insertions(+), 25 deletions(-)
---
base-commit: 133da717263112d81bb95b5535ceb2c1eeddd4e7
change-id: 20230613-raid-stripe-tree-e330c9a45cc3

Best regards,
  

Comments

David Sterba Sept. 12, 2023, 8:46 p.m. UTC | #1
On Mon, Sep 11, 2023 at 05:52:11AM -0700, Johannes Thumshirn wrote:
> Add trace events for raid-stripe-tree operations.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/raid-stripe-tree.c  |  8 +++++
>  include/trace/events/btrfs.h | 75 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 83 insertions(+)
> 
> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> index 7ed02e4b79ec..5a9952cf557c 100644
> --- a/fs/btrfs/raid-stripe-tree.c
> +++ b/fs/btrfs/raid-stripe-tree.c
> @@ -62,6 +62,9 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start,
>  		if (found_end <= start)
>  			break;
>  
> +		trace_btrfs_raid_extent_delete(fs_info, start, end,
> +					       found_start, found_end);
> +
>  		ASSERT(found_start >= start && found_end <= end);
>  		ret = btrfs_del_item(trans, stripe_root, path);
>  		if (ret)
> @@ -120,6 +123,8 @@ static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans,
>  		return -ENOMEM;
>  	}
>  
> +	trace_btrfs_insert_one_raid_extent(fs_info, bioc->logical, bioc->size,
> +					   num_stripes);
>  	btrfs_set_stack_stripe_extent_encoding(stripe_extent, encoding);
>  	for (int i = 0; i < num_stripes; i++) {
>  		u64 devid = bioc->stripes[i].dev->devid;
> @@ -445,6 +450,9 @@ int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info,
>  
>  		stripe->physical = physical + offset;
>  
> +		trace_btrfs_get_raid_extent_offset(fs_info, logical, *length,
> +						   stripe->physical, devid);
> +
>  		ret = 0;
>  		goto free_path;
>  	}
> diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
> index b2db2c2f1c57..e2c6f1199212 100644
> --- a/include/trace/events/btrfs.h
> +++ b/include/trace/events/btrfs.h
> @@ -2497,6 +2497,81 @@ DEFINE_EVENT(btrfs_raid56_bio, raid56_write,
>  	TP_ARGS(rbio, bio, trace_info)
>  );
>  
> +TRACE_EVENT(btrfs_insert_one_raid_extent,
> +
> +	TP_PROTO(struct btrfs_fs_info *fs_info, u64 logical, u64 length,

const struct fs_info

> +		 int num_stripes),
> +
> +	TP_ARGS(fs_info, logical, length, num_stripes),
> +
> +	TP_STRUCT__entry_btrfs(
> +		__field(	u64,	logical		)
> +		__field(	u64,	length		)
> +		__field(	int,	num_stripes	)
> +	),
> +
> +	TP_fast_assign_btrfs(fs_info,
> +		__entry->logical	= logical;
> +		__entry->length		= length;
> +		__entry->num_stripes	= num_stripes;
> +	),
> +
> +	TP_printk_btrfs("logical=%llu, length=%llu, num_stripes=%d",
> +			__entry->logical, __entry->length,
> +			__entry->num_stripes)

Tracepoint messages should follow the formatting guidelines
https://btrfs.readthedocs.io/en/latest/dev/Development-notes.html#tracepoints

> +);
> +
> +TRACE_EVENT(btrfs_raid_extent_delete,
> +
> +	TP_PROTO(struct btrfs_fs_info *fs_info, u64 start, u64 end,
> +		 u64 found_start, u64 found_end),
> +
> +	TP_ARGS(fs_info, start, end, found_start, found_end),
> +
> +	TP_STRUCT__entry_btrfs(
> +		__field(	u64,	start		)
> +		__field(	u64,	end		)
> +		__field(	u64,	found_start	)
> +		__field(	u64,	found_end	)
> +	),
> +
> +	TP_fast_assign_btrfs(fs_info,
> +		__entry->start	=	start;
> +		__entry->end	=	end;
> +		__entry->found_start =	found_start;
> +		__entry->found_end =	found_end;

Tracepoints follow the fancy spacing and alignment in the assign blocks.

> +	),
> +
> +	TP_printk_btrfs("start=%llu, end=%llu, found_start=%llu, found_end=%llu",
> +			__entry->start, __entry->end, __entry->found_start,
> +			__entry->found_end)
> +);
  
Qu Wenruo Sept. 13, 2023, 9:51 a.m. UTC | #2
On 2023/9/11 22:22, Johannes Thumshirn wrote:
> A filesystem that uses the RAID stripe tree for logical to physical
> address translation can't use the regular scrub path, that reads all
> stripes and then checks if a sector is unused afterwards.
>
> When using the RAID stripe tree, this will result in lookup errors, as the
> stripe tree doesn't know the requested logical addresses.
>
> Instead, look up stripes that are backed by the extent bitmap.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>   fs/btrfs/scrub.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 56 insertions(+)
>
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index f16220ce5fba..5101e0a3f83e 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -23,6 +23,7 @@
>   #include "accessors.h"
>   #include "file-item.h"
>   #include "scrub.h"
> +#include "raid-stripe-tree.h"
>
>   /*
>    * This is only the first step towards a full-features scrub. It reads all
> @@ -1634,6 +1635,56 @@ static void scrub_reset_stripe(struct scrub_stripe *stripe)
>   	}
>   }
>
> +static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx,
> +					    struct scrub_stripe *stripe)
> +{
> +	struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
> +	struct btrfs_bio *bbio = NULL;
> +	int mirror = stripe->mirror_num;
> +	int i;
> +
> +	atomic_inc(&stripe->pending_io);
> +
> +	for_each_set_bit(i, &stripe->extent_sector_bitmap, stripe->nr_sectors) {
> +		struct page *page;
> +		int pgoff;
> +
> +		page = scrub_stripe_get_page(stripe, i);
> +		pgoff = scrub_stripe_get_page_offset(stripe, i);
> +
> +		/* The current sector cannot be merged, submit the bio. */
> +		if (bbio &&
> +		    ((i > 0 && !test_bit(i - 1, &stripe->extent_sector_bitmap)) ||
> +		     bbio->bio.bi_iter.bi_size >= BTRFS_STRIPE_LEN)) {
> +			ASSERT(bbio->bio.bi_iter.bi_size);
> +			atomic_inc(&stripe->pending_io);
> +			btrfs_submit_bio(bbio, mirror);
> +			bbio = NULL;
> +		}
> +
> +		if (!bbio) {
> +			bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_READ,
> +				fs_info, scrub_read_endio, stripe);
> +			bbio->bio.bi_iter.bi_sector = (stripe->logical +
> +				(i << fs_info->sectorsize_bits)) >> SECTOR_SHIFT;
> +		}
> +
> +		__bio_add_page(&bbio->bio, page, fs_info->sectorsize, pgoff);
> +	}
> +
> +	if (bbio) {
> +		ASSERT(bbio->bio.bi_iter.bi_size);
> +		atomic_inc(&stripe->pending_io);
> +		btrfs_submit_bio(bbio, mirror);

Since RST is looked up during btrfs_submit_bio() (to be more accurate,
set_io_stripe()), and I just checked there is no special requirement to
make btrfs to lookup using commit root.

This means we can have a problem that extent items and RST are out-of-sync.

For scrub, all the extent items are searched using commit root, but
btrfs_get_raid_extent_offset() is only using current root.
Thus you would got some problems during fsstress and scrub.


We need some way to distinguish scrub bbio from regular ones (which is a
completely new requirement).
For now only scrub doesn't initialize bbio->inode, thus it can be used
to do the distinguish (at least for now).

Thanks,
Qu
> +	}
> +
> +	if (atomic_dec_and_test(&stripe->pending_io)) {
> +		wake_up(&stripe->io_wait);
> +		INIT_WORK(&stripe->work, scrub_stripe_read_repair_worker);
> +		queue_work(stripe->bg->fs_info->scrub_workers, &stripe->work);
> +	}
> +}
> +
>   static void scrub_submit_initial_read(struct scrub_ctx *sctx,
>   				      struct scrub_stripe *stripe)
>   {
> @@ -1645,6 +1696,11 @@ static void scrub_submit_initial_read(struct scrub_ctx *sctx,
>   	ASSERT(stripe->mirror_num > 0);
>   	ASSERT(test_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &stripe->state));
>
> +	if (btrfs_need_stripe_tree_update(fs_info, stripe->bg->flags)) {
> +		scrub_submit_extent_sector_read(sctx, stripe);
> +		return;
> +	}
> +
>   	bbio = btrfs_bio_alloc(SCRUB_STRIPE_PAGES, REQ_OP_READ, fs_info,
>   			       scrub_read_endio, stripe);
>
>
  
David Sterba Sept. 13, 2023, 4:59 p.m. UTC | #3
On Mon, Sep 11, 2023 at 05:52:07AM -0700, Johannes Thumshirn wrote:
> A filesystem that uses the RAID stripe tree for logical to physical
> address translation can't use the regular scrub path, that reads all
> stripes and then checks if a sector is unused afterwards.
> 
> When using the RAID stripe tree, this will result in lookup errors, as the
> stripe tree doesn't know the requested logical addresses.
> 
> Instead, look up stripes that are backed by the extent bitmap.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/scrub.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 56 insertions(+)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index f16220ce5fba..5101e0a3f83e 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -23,6 +23,7 @@
>  #include "accessors.h"
>  #include "file-item.h"
>  #include "scrub.h"
> +#include "raid-stripe-tree.h"
>  
>  /*
>   * This is only the first step towards a full-features scrub. It reads all
> @@ -1634,6 +1635,56 @@ static void scrub_reset_stripe(struct scrub_stripe *stripe)
>  	}
>  }
>  
> +static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx,
> +					    struct scrub_stripe *stripe)
> +{
> +	struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
> +	struct btrfs_bio *bbio = NULL;
> +	int mirror = stripe->mirror_num;
> +	int i;
> +
> +	atomic_inc(&stripe->pending_io);
> +
> +	for_each_set_bit(i, &stripe->extent_sector_bitmap, stripe->nr_sectors) {
> +		struct page *page;
> +		int pgoff;

This should be unsigned int.

> +
> +		page = scrub_stripe_get_page(stripe, i);
> +		pgoff = scrub_stripe_get_page_offset(stripe, i);

You can probably move the initializations right to the declarations, I
think we have that elsewhere too.

> +		/* The current sector cannot be merged, submit the bio. */
> +		if (bbio &&
> +		    ((i > 0 && !test_bit(i - 1, &stripe->extent_sector_bitmap)) ||
> +		     bbio->bio.bi_iter.bi_size >= BTRFS_STRIPE_LEN)) {
> +			ASSERT(bbio->bio.bi_iter.bi_size);
> +			atomic_inc(&stripe->pending_io);
> +			btrfs_submit_bio(bbio, mirror);
> +			bbio = NULL;
> +		}
> +
> +		if (!bbio) {
> +			bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_READ,
> +				fs_info, scrub_read_endio, stripe);
> +			bbio->bio.bi_iter.bi_sector = (stripe->logical +
> +				(i << fs_info->sectorsize_bits)) >> SECTOR_SHIFT;
> +		}
> +
> +		__bio_add_page(&bbio->bio, page, fs_info->sectorsize, pgoff);
> +	}
> +
> +	if (bbio) {
> +		ASSERT(bbio->bio.bi_iter.bi_size);
> +		atomic_inc(&stripe->pending_io);
> +		btrfs_submit_bio(bbio, mirror);
> +	}
> +
> +	if (atomic_dec_and_test(&stripe->pending_io)) {
> +		wake_up(&stripe->io_wait);
> +		INIT_WORK(&stripe->work, scrub_stripe_read_repair_worker);
> +		queue_work(stripe->bg->fs_info->scrub_workers, &stripe->work);
> +	}
> +}
  
Qu Wenruo Sept. 14, 2023, 9:27 a.m. UTC | #4
On 2023/9/11 22:22, Johannes Thumshirn wrote:
> If we find a raid-stripe-tree on mount, read it from disk.
>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> Reviewed-by: Anand Jain <anand.jain@oracle.com>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>   fs/btrfs/block-rsv.c       |  6 ++++++
>   fs/btrfs/disk-io.c         | 18 ++++++++++++++++++
>   fs/btrfs/disk-io.h         |  5 +++++
>   fs/btrfs/fs.h              |  1 +
>   include/uapi/linux/btrfs.h |  1 +
>   5 files changed, 31 insertions(+)
>
> diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
> index 77684c5e0c8b..4e55e5f30f7f 100644
> --- a/fs/btrfs/block-rsv.c
> +++ b/fs/btrfs/block-rsv.c
> @@ -354,6 +354,11 @@ void btrfs_update_global_block_rsv(struct btrfs_fs_info *fs_info)
>   		min_items++;
>   	}
>
> +	if (btrfs_fs_incompat(fs_info, RAID_STRIPE_TREE)) {
> +		num_bytes += btrfs_root_used(&fs_info->stripe_root->root_item);
> +		min_items++;
> +	}
> +
>   	/*
>   	 * But we also want to reserve enough space so we can do the fallback
>   	 * global reserve for an unlink, which is an additional
> @@ -405,6 +410,7 @@ void btrfs_init_root_block_rsv(struct btrfs_root *root)
>   	case BTRFS_EXTENT_TREE_OBJECTID:
>   	case BTRFS_FREE_SPACE_TREE_OBJECTID:
>   	case BTRFS_BLOCK_GROUP_TREE_OBJECTID:
> +	case BTRFS_RAID_STRIPE_TREE_OBJECTID:
>   		root->block_rsv = &fs_info->delayed_refs_rsv;
>   		break;
>   	case BTRFS_ROOT_TREE_OBJECTID:
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 4c5d71065ea8..1ecebcfc1c17 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1179,6 +1179,8 @@ static struct btrfs_root *btrfs_get_global_root(struct btrfs_fs_info *fs_info,
>   		return btrfs_grab_root(fs_info->block_group_root);
>   	case BTRFS_FREE_SPACE_TREE_OBJECTID:
>   		return btrfs_grab_root(btrfs_global_root(fs_info, &key));
> +	case BTRFS_RAID_STRIPE_TREE_OBJECTID:
> +		return btrfs_grab_root(fs_info->stripe_root);
>   	default:
>   		return NULL;
>   	}
> @@ -1259,6 +1261,7 @@ void btrfs_free_fs_info(struct btrfs_fs_info *fs_info)
>   	btrfs_put_root(fs_info->fs_root);
>   	btrfs_put_root(fs_info->data_reloc_root);
>   	btrfs_put_root(fs_info->block_group_root);
> +	btrfs_put_root(fs_info->stripe_root);
>   	btrfs_check_leaked_roots(fs_info);
>   	btrfs_extent_buffer_leak_debug_check(fs_info);
>   	kfree(fs_info->super_copy);
> @@ -1804,6 +1807,7 @@ static void free_root_pointers(struct btrfs_fs_info *info, bool free_chunk_root)
>   	free_root_extent_buffers(info->fs_root);
>   	free_root_extent_buffers(info->data_reloc_root);
>   	free_root_extent_buffers(info->block_group_root);
> +	free_root_extent_buffers(info->stripe_root);
>   	if (free_chunk_root)
>   		free_root_extent_buffers(info->chunk_root);
>   }
> @@ -2280,6 +2284,20 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
>   		fs_info->uuid_root = root;
>   	}
>
> +	if (btrfs_fs_incompat(fs_info, RAID_STRIPE_TREE)) {
> +		location.objectid = BTRFS_RAID_STRIPE_TREE_OBJECTID;
> +		root = btrfs_read_tree_root(tree_root, &location);
> +		if (IS_ERR(root)) {
> +			if (!btrfs_test_opt(fs_info, IGNOREBADROOTS)) {
> +				ret = PTR_ERR(root);
> +				goto out;
> +			}
> +		} else {
> +			set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
> +			fs_info->stripe_root = root;
> +		}
> +	}
> +
>   	return 0;
>   out:
>   	btrfs_warn(fs_info, "failed to read root (objectid=%llu): %d",
> diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
> index 02b645744a82..8b7f01a01c44 100644
> --- a/fs/btrfs/disk-io.h
> +++ b/fs/btrfs/disk-io.h
> @@ -103,6 +103,11 @@ static inline struct btrfs_root *btrfs_grab_root(struct btrfs_root *root)
>   	return NULL;
>   }
>
> +static inline struct btrfs_root *btrfs_stripe_tree_root(struct btrfs_fs_info *fs_info)
> +{
> +	return fs_info->stripe_root;
> +}
> +

Do we really need this? IIRC we never have a wrapper or fs_info->fs_root.

Thanks,
Qu
>   void btrfs_put_root(struct btrfs_root *root);
>   void btrfs_mark_buffer_dirty(struct extent_buffer *buf);
>   int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index d84a390336fc..5c7778e8b5ed 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -367,6 +367,7 @@ struct btrfs_fs_info {
>   	struct btrfs_root *uuid_root;
>   	struct btrfs_root *data_reloc_root;
>   	struct btrfs_root *block_group_root;
> +	struct btrfs_root *stripe_root;
>
>   	/* The log root tree is a directory of all the other log roots */
>   	struct btrfs_root *log_root_tree;
> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> index dbb8b96da50d..b9a1d9af8ae8 100644
> --- a/include/uapi/linux/btrfs.h
> +++ b/include/uapi/linux/btrfs.h
> @@ -333,6 +333,7 @@ struct btrfs_ioctl_fs_info_args {
>   #define BTRFS_FEATURE_INCOMPAT_RAID1C34		(1ULL << 11)
>   #define BTRFS_FEATURE_INCOMPAT_ZONED		(1ULL << 12)
>   #define BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2	(1ULL << 13)
> +#define BTRFS_FEATURE_INCOMPAT_RAID_STRIPE_TREE	(1ULL << 14)
>
>   struct btrfs_ioctl_feature_flags {
>   	__u64 compat_flags;
>
  
Johannes Thumshirn Sept. 14, 2023, 9:33 a.m. UTC | #5
On 14.09.23 11:27, Qu Wenruo wrote:
>> +static inline struct btrfs_root *btrfs_stripe_tree_root(struct btrfs_fs_info *fs_info)
>> +{
>> +	return fs_info->stripe_root;
>> +}
>> +
> 
> Do we really need this? IIRC we never have a wrapper or fs_info->fs_root.

This was requested from Josef a while ago, to make the conversion to 
per-block-group stripe trees easier. But hch also wanted me to remove it 
(and I thought I already did) so lemme get rid of it if Josef doesn't 
speak up.