Message ID | 20230914-raid-stripe-tree-v9-1-15d423829637@wdc.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:172:b0:3f2:4152:657d with SMTP id h50csp464256vqi; Thu, 14 Sep 2023 09:18:03 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFgAvK7I4XnhLJrjA5H8Z8cp27HOuU6NknJ/XgQGax6QBeNr+RBxNBNvMzM0t1R1IgCGiVb X-Received: by 2002:a05:6a00:22c5:b0:68e:2d9d:b095 with SMTP id f5-20020a056a0022c500b0068e2d9db095mr6786752pfj.5.1694708283098; Thu, 14 Sep 2023 09:18:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1694708283; cv=none; d=google.com; s=arc-20160816; b=PwwZe6eAkIaYJjdm/u9fbssgMLYxz9yIaOpslyMK/rwEaADb6zmQ1oYDhLbxRJ9rOY rRkbMN9Ofy0Gy9v+0PUYzGUhtLxteosytOSicJ4HOjF+NNhxTcOyNmaTZ6LjHAGrcG/e 3+lIJzxs8m9JeIhInjTNKql5T1qBcRglEUPDKIlbtho3NaNFa6hLX8lw8IMl7onXUj9O /PN65i0FujpvJpyu/zmTtdtpSuts86x6ZHs+qCzqnGrtzQjs1OjmyQWKtFLpKMJKU6lQ DHLR0rgGVr401dOgZSqFtob7+t/vuEky8C7tq2/2SPfOs4Z64XKSXVH6fHah15aGFT68 BlIw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:in-reply-to:references:message-id :content-transfer-encoding:mime-version:subject:date:from :wdcironportexception:ironport-sdr:ironport-sdr:dkim-signature; bh=VZF/Ti9iTgF95jgzmBIwrfSLhRte2X/q6fhOjYqJT5k=; fh=bp3vMIo1qVzFCHckMTFYF7uCNt+DyLY9wZgFAQuWS8M=; b=aCSzlpHv+tX/WewUgM8cHe06rlBOKfTrGdvJygbmXW35qEMbiy4UwzAFRCCmWoDXzo fnCAhtsmeXZiZWUrCO9CFlW0T07r7CRL2YtxpYcOw081f0OFTe5NTHEhBzQ7FNGKQe8l j69VKo7kdRuWjTNXwBH48OHPB5oMxmxgdTsJZUpNhW+0LWGxzKUA1lqUrMdZiUITBzsO 197dJ7CzSuc/8fO+SjdYl9dIC3DJP/9A86VGPi+hoxuPR4sYGwo35hsoKM1NAlVwXVu7 pbuqn8hEZlfqqtdvZvENHKNgG2T4x/4nhaoGKjcqrEH2GlS7yZmcPb/huchXV9q+/5oV INEA== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@wdc.com header.s=dkim.wdc.com header.b=UZuuydUp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=wdc.com Received: from morse.vger.email (morse.vger.email. [23.128.96.31]) by mx.google.com with ESMTPS id b9-20020a056a000cc900b00690158afc78si1965395pfv.284.2023.09.14.09.18.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 14 Sep 2023 09:18:03 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) client-ip=23.128.96.31; Authentication-Results: mx.google.com; dkim=fail header.i=@wdc.com header.s=dkim.wdc.com header.b=UZuuydUp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=wdc.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by morse.vger.email (Postfix) with ESMTP id BDB17826FAB7; Thu, 14 Sep 2023 09:07:30 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at morse.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241709AbjINQHQ (ORCPT <rfc822;chrisfriedt@gmail.com> + 34 others); Thu, 14 Sep 2023 12:07:16 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57888 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S241822AbjINQHK (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 14 Sep 2023 12:07:10 -0400 Received: from esa3.hgst.iphmx.com (esa3.hgst.iphmx.com [216.71.153.141]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C1FC22114; Thu, 14 Sep 2023 09:07:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1694707624; x=1726243624; h=from:date:subject:mime-version:content-transfer-encoding: message-id:references:in-reply-to:to:cc; bh=yig7ZYZ9K8+Q6gaqvbyj//qu2/U7MXP1hMerBHoL1fQ=; b=UZuuydUp0IS88mC3nyNJZ3buCGDEbyddC9lS7SKEvasojAAnIPpASXIx iYyMjYWWeyXSklCskGgGfg+7dcSVrVT9MMXZQVMujKyR5Q86vTwT6p1vL JcY16DsRNUYOeQCwS3AU01XsK/rywKpFBg9RY4PCzwtXFhtntXlF4FlIv T7R0K60LI3pdJU7suhVQF58IupRTbepOCfU7E6Wl4QQkhSUPHPbVImu7/ qUe5rYw5lfHJDByEgK3zPzm8VwddbF2ifw92aHwBoFJdp/av3Z0+xLKfN 5FldMk6rGorZvuc64NyIslgtM0+he3hsEaHCUnLoJKIbE+cE7ewcx9RGB g==; X-CSE-ConnectionGUID: Oq6kqQ9xTHqTkLidpu+IHw== X-CSE-MsgGUID: 6b046WrKR6SGOepNmxysbg== X-IronPort-AV: E=Sophos;i="6.02,146,1688400000"; d="scan'208";a="248490527" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 15 Sep 2023 00:07:04 +0800 IronPort-SDR: d8sZR2K5C4RiTHJxWFO03PXgNqLIyAbaIfs1BGzhHQpxO90mj0ZT6h93Oxucz4cR9MaDYDZYL1 EopFsfqwStvBshvZTBR4BqU+KOhqXNzEqf+opXXWhx+GBEwLn92pvUWL4zmNXj37QCB9E7c3+p a9Sw9B83xDNCYkjxws1KkUQal58/BWjxTdjXKH39tzytVhmo/PIHntsMk3aTAA53MMrRmaMUtz oXN3+Xh7cWxtipernJEK0rXXk2PRVwkk4PdNyDHGu9rbPe44iDSeQ2d+LxAp0fDNYxr0mAQItk XvQ= Received: from uls-op-cesaip01.wdc.com ([10.248.3.36]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES128-GCM-SHA256; 14 Sep 2023 08:14:05 -0700 IronPort-SDR: d4BGto01G7yzNL8AbiuA/U+KaKuQ+zjOs9i9moHzhV9pOOh/0LkBd3FIgbOtxEpyqbya2mx8HB gnbqY1aZExcaokfwqDHecWBHUhUht/hUzAa6tJCZTSfovAvBV51T0DIfujpDnGEYYU/P+pmGdS EZW1Gaz8NSL0EABd9ADKpdZDofZ47yurdt2sNd4VCugyTBjF2eBgAq8Fm0gWYiIkPrQ0fjUx98 +ihBS17KxqEtJK6unhPIEGLGxNSNRKdGDGVXbi48ZgM0BqUpMPJypaUhDsXiNlFGiVVI2vTe5R TqI= WDCIronportException: Internal Received: from unknown (HELO redsun91.ssa.fujisawa.hgst.com) ([10.149.66.6]) by uls-op-cesaip01.wdc.com with ESMTP; 14 Sep 2023 09:07:04 -0700 From: Johannes Thumshirn <johannes.thumshirn@wdc.com> Date: Thu, 14 Sep 2023 09:06:56 -0700 Subject: [PATCH v9 01/11] btrfs: add raid stripe tree definitions MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <20230914-raid-stripe-tree-v9-1-15d423829637@wdc.com> References: <20230914-raid-stripe-tree-v9-0-15d423829637@wdc.com> In-Reply-To: <20230914-raid-stripe-tree-v9-0-15d423829637@wdc.com> To: Chris Mason <clm@fb.com>, Josef Bacik <josef@toxicpanda.com>, David Sterba <dsterba@suse.com> Cc: Christoph Hellwig <hch@lst.de>, Naohiro Aota <naohiro.aota@wdc.com>, Qu Wenruo <wqu@suse.com>, Damien Le Moal <dlemoal@kernel.org>, linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org, Johannes Thumshirn <johannes.thumshirn@wdc.com> X-Mailer: b4 0.12.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1694707621; l=4184; i=johannes.thumshirn@wdc.com; s=20230613; h=from:subject:message-id; bh=yig7ZYZ9K8+Q6gaqvbyj//qu2/U7MXP1hMerBHoL1fQ=; b=T2aI+H2ePlmrUB4XIWgfPXkzlHXUBMZ0knml0Vm/qH4n46fEcOOEKJ8TnRfcxlFIbyLARA4d3 u/+FMs15Y4PAxnbpGtXHk80f+rIxztIDl3stCsIEXzVB4oFq3TVKZ2V X-Developer-Key: i=johannes.thumshirn@wdc.com; a=ed25519; pk=TGmHKs78FdPi+QhrViEvjKIGwReUGCfa+3LEnGoR2KM= Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]); Thu, 14 Sep 2023 09:07:31 -0700 (PDT) X-Spam-Status: No, score=-0.6 required=5.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1777030432463052390 X-GMAIL-MSGID: 1777030432463052390 |
Series |
btrfs: introduce RAID stripe tree
|
|
Commit Message
Johannes Thumshirn
Sept. 14, 2023, 4:06 p.m. UTC
Add definitions for the raid stripe tree. This tree will hold information
about the on-disk layout of the stripes in a RAID set.
Each stripe extent has a 1:1 relationship with an on-disk extent item and
is doing the logical to per-drive physical address translation for the
extent item in question.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/accessors.h | 10 ++++++++++
fs/btrfs/locking.c | 1 +
include/uapi/linux/btrfs_tree.h | 31 +++++++++++++++++++++++++++++++
3 files changed, 42 insertions(+)
Comments
On 2023/9/15 01:36, Johannes Thumshirn wrote: > Add definitions for the raid stripe tree. This tree will hold information > about the on-disk layout of the stripes in a RAID set. > > Each stripe extent has a 1:1 relationship with an on-disk extent item and > is doing the logical to per-drive physical address translation for the > extent item in question. > > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> > --- > fs/btrfs/accessors.h | 10 ++++++++++ > fs/btrfs/locking.c | 1 + > include/uapi/linux/btrfs_tree.h | 31 +++++++++++++++++++++++++++++++ > 3 files changed, 42 insertions(+) > > diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h > index f958eccff477..977ff160a024 100644 > --- a/fs/btrfs/accessors.h > +++ b/fs/btrfs/accessors.h > @@ -306,6 +306,16 @@ BTRFS_SETGET_FUNCS(timespec_nsec, struct btrfs_timespec, nsec, 32); > BTRFS_SETGET_STACK_FUNCS(stack_timespec_sec, struct btrfs_timespec, sec, 64); > BTRFS_SETGET_STACK_FUNCS(stack_timespec_nsec, struct btrfs_timespec, nsec, 32); > > +BTRFS_SETGET_FUNCS(stripe_extent_encoding, struct btrfs_stripe_extent, encoding, 8); > +BTRFS_SETGET_FUNCS(raid_stride_devid, struct btrfs_raid_stride, devid, 64); > +BTRFS_SETGET_FUNCS(raid_stride_physical, struct btrfs_raid_stride, physical, 64); > +BTRFS_SETGET_FUNCS(raid_stride_length, struct btrfs_raid_stride, length, 64); > +BTRFS_SETGET_STACK_FUNCS(stack_stripe_extent_encoding, > + struct btrfs_stripe_extent, encoding, 8); > +BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_devid, struct btrfs_raid_stride, devid, 64); > +BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_physical, struct btrfs_raid_stride, physical, 64); > +BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_length, struct btrfs_raid_stride, length, 64); > + > /* struct btrfs_dev_extent */ > BTRFS_SETGET_FUNCS(dev_extent_chunk_tree, struct btrfs_dev_extent, chunk_tree, 64); > BTRFS_SETGET_FUNCS(dev_extent_chunk_objectid, struct btrfs_dev_extent, > diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c > index 6ac4fd8cc8dc..74d8e2003f58 100644 > --- a/fs/btrfs/locking.c > +++ b/fs/btrfs/locking.c > @@ -74,6 +74,7 @@ static struct btrfs_lockdep_keyset { > { .id = BTRFS_UUID_TREE_OBJECTID, DEFINE_NAME("uuid") }, > { .id = BTRFS_FREE_SPACE_TREE_OBJECTID, DEFINE_NAME("free-space") }, > { .id = BTRFS_BLOCK_GROUP_TREE_OBJECTID, DEFINE_NAME("block-group") }, > + { .id = BTRFS_RAID_STRIPE_TREE_OBJECTID, DEFINE_NAME("raid-stripe") }, > { .id = 0, DEFINE_NAME("tree") }, > }; > > diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h > index fc3c32186d7e..6d9c43416b6e 100644 > --- a/include/uapi/linux/btrfs_tree.h > +++ b/include/uapi/linux/btrfs_tree.h > @@ -73,6 +73,9 @@ > /* Holds the block group items for extent tree v2. */ > #define BTRFS_BLOCK_GROUP_TREE_OBJECTID 11ULL > > +/* Tracks RAID stripes in block groups. */ > +#define BTRFS_RAID_STRIPE_TREE_OBJECTID 12ULL > + > /* device stats in the device tree */ > #define BTRFS_DEV_STATS_OBJECTID 0ULL > > @@ -261,6 +264,8 @@ > #define BTRFS_DEV_ITEM_KEY 216 > #define BTRFS_CHUNK_ITEM_KEY 228 > > +#define BTRFS_RAID_STRIPE_KEY 230 > + > /* > * Records the overall state of the qgroups. > * There's only one instance of this key present, > @@ -719,6 +724,32 @@ struct btrfs_free_space_header { > __le64 num_bitmaps; > } __attribute__ ((__packed__)); > > +struct btrfs_raid_stride { > + /* The btrfs device-id this raid extent lives on */ > + __le64 devid; > + /* The physical location on disk */ > + __le64 physical; > + /* The length of stride on this disk */ > + __le64 length; > +} __attribute__ ((__packed__)); > + > +/* The stripe_extent::encoding, 1:1 mapping of enum btrfs_raid_types */ > +#define BTRFS_STRIPE_RAID0 1 > +#define BTRFS_STRIPE_RAID1 2 > +#define BTRFS_STRIPE_DUP 3 > +#define BTRFS_STRIPE_RAID10 4 > +#define BTRFS_STRIPE_RAID5 5 > +#define BTRFS_STRIPE_RAID6 6 > +#define BTRFS_STRIPE_RAID1C3 7 > +#define BTRFS_STRIPE_RAID1C4 8 > + > +struct btrfs_stripe_extent { > + __u8 encoding; Considerng the encoding for now is 1:1 map of btrfs_raid_types, and normally we use variable like @raid_index for such usage, have considered rename it to "raid_index" or "profile_index" instead? Another thing is, you may want to add extra tree-checker code to verify the btrfs_stripe_extent members. For encoding, it should be all be the known numbers, and item size for alignment. The same for physical/length alignment checks. Thanks, Qu > + __u8 reserved[7]; > + /* An array of raid strides this stripe is composed of */ > + struct btrfs_raid_stride strides[]; > +} __attribute__ ((__packed__)); > + > #define BTRFS_HEADER_FLAG_WRITTEN (1ULL << 0) > #define BTRFS_HEADER_FLAG_RELOC (1ULL << 1) > >
On 2023/9/15 09:52, Qu Wenruo wrote: > > > On 2023/9/15 01:36, Johannes Thumshirn wrote: >> Add definitions for the raid stripe tree. This tree will hold information >> about the on-disk layout of the stripes in a RAID set. >> >> Each stripe extent has a 1:1 relationship with an on-disk extent item and >> is doing the logical to per-drive physical address translation for the >> extent item in question. >> >> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> >> --- >> fs/btrfs/accessors.h | 10 ++++++++++ >> fs/btrfs/locking.c | 1 + >> include/uapi/linux/btrfs_tree.h | 31 +++++++++++++++++++++++++++++++ >> 3 files changed, 42 insertions(+) >> >> diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h >> index f958eccff477..977ff160a024 100644 >> --- a/fs/btrfs/accessors.h >> +++ b/fs/btrfs/accessors.h >> @@ -306,6 +306,16 @@ BTRFS_SETGET_FUNCS(timespec_nsec, struct >> btrfs_timespec, nsec, 32); >> BTRFS_SETGET_STACK_FUNCS(stack_timespec_sec, struct btrfs_timespec, >> sec, 64); >> BTRFS_SETGET_STACK_FUNCS(stack_timespec_nsec, struct btrfs_timespec, >> nsec, 32); >> +BTRFS_SETGET_FUNCS(stripe_extent_encoding, struct >> btrfs_stripe_extent, encoding, 8); >> +BTRFS_SETGET_FUNCS(raid_stride_devid, struct btrfs_raid_stride, >> devid, 64); >> +BTRFS_SETGET_FUNCS(raid_stride_physical, struct btrfs_raid_stride, >> physical, 64); >> +BTRFS_SETGET_FUNCS(raid_stride_length, struct btrfs_raid_stride, >> length, 64); >> +BTRFS_SETGET_STACK_FUNCS(stack_stripe_extent_encoding, >> + struct btrfs_stripe_extent, encoding, 8); >> +BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_devid, struct >> btrfs_raid_stride, devid, 64); >> +BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_physical, struct >> btrfs_raid_stride, physical, 64); >> +BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_length, struct >> btrfs_raid_stride, length, 64); >> + >> /* struct btrfs_dev_extent */ >> BTRFS_SETGET_FUNCS(dev_extent_chunk_tree, struct btrfs_dev_extent, >> chunk_tree, 64); >> BTRFS_SETGET_FUNCS(dev_extent_chunk_objectid, struct btrfs_dev_extent, >> diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c >> index 6ac4fd8cc8dc..74d8e2003f58 100644 >> --- a/fs/btrfs/locking.c >> +++ b/fs/btrfs/locking.c >> @@ -74,6 +74,7 @@ static struct btrfs_lockdep_keyset { >> { .id = BTRFS_UUID_TREE_OBJECTID, DEFINE_NAME("uuid") }, >> { .id = BTRFS_FREE_SPACE_TREE_OBJECTID, >> DEFINE_NAME("free-space") }, >> { .id = BTRFS_BLOCK_GROUP_TREE_OBJECTID, >> DEFINE_NAME("block-group") }, >> + { .id = BTRFS_RAID_STRIPE_TREE_OBJECTID, >> DEFINE_NAME("raid-stripe") }, >> { .id = 0, DEFINE_NAME("tree") }, >> }; >> diff --git a/include/uapi/linux/btrfs_tree.h >> b/include/uapi/linux/btrfs_tree.h >> index fc3c32186d7e..6d9c43416b6e 100644 >> --- a/include/uapi/linux/btrfs_tree.h >> +++ b/include/uapi/linux/btrfs_tree.h >> @@ -73,6 +73,9 @@ >> /* Holds the block group items for extent tree v2. */ >> #define BTRFS_BLOCK_GROUP_TREE_OBJECTID 11ULL >> +/* Tracks RAID stripes in block groups. */ >> +#define BTRFS_RAID_STRIPE_TREE_OBJECTID 12ULL >> + >> /* device stats in the device tree */ >> #define BTRFS_DEV_STATS_OBJECTID 0ULL >> @@ -261,6 +264,8 @@ >> #define BTRFS_DEV_ITEM_KEY 216 >> #define BTRFS_CHUNK_ITEM_KEY 228 >> +#define BTRFS_RAID_STRIPE_KEY 230 >> + >> /* >> * Records the overall state of the qgroups. >> * There's only one instance of this key present, >> @@ -719,6 +724,32 @@ struct btrfs_free_space_header { >> __le64 num_bitmaps; >> } __attribute__ ((__packed__)); >> +struct btrfs_raid_stride { >> + /* The btrfs device-id this raid extent lives on */ >> + __le64 devid; >> + /* The physical location on disk */ >> + __le64 physical; >> + /* The length of stride on this disk */ >> + __le64 length; Forgot to mention, for btrfs_stripe_extent structure, its key is (PHYSICAL, RAID_STRIPE_KEY, LENGTH) right? So is the length in the btrfs_raid_stride duplicated and we can save 8 bytes? Thanks, Qu >> +} __attribute__ ((__packed__)); >> + >> +/* The stripe_extent::encoding, 1:1 mapping of enum btrfs_raid_types */ >> +#define BTRFS_STRIPE_RAID0 1 >> +#define BTRFS_STRIPE_RAID1 2 >> +#define BTRFS_STRIPE_DUP 3 >> +#define BTRFS_STRIPE_RAID10 4 >> +#define BTRFS_STRIPE_RAID5 5 >> +#define BTRFS_STRIPE_RAID6 6 >> +#define BTRFS_STRIPE_RAID1C3 7 >> +#define BTRFS_STRIPE_RAID1C4 8 >> + >> +struct btrfs_stripe_extent { >> + __u8 encoding; > > Considerng the encoding for now is 1:1 map of btrfs_raid_types, and > normally we use variable like @raid_index for such usage, have > considered rename it to "raid_index" or "profile_index" instead? > > Another thing is, you may want to add extra tree-checker code to verify > the btrfs_stripe_extent members. > > For encoding, it should be all be the known numbers, and item size for > alignment. > > The same for physical/length alignment checks. > > Thanks, > Qu >> + __u8 reserved[7]; >> + /* An array of raid strides this stripe is composed of */ >> + struct btrfs_raid_stride strides[]; >> +} __attribute__ ((__packed__)); >> + >> #define BTRFS_HEADER_FLAG_WRITTEN (1ULL << 0) >> #define BTRFS_HEADER_FLAG_RELOC (1ULL << 1) >>
On 15.09.23 02:27, Qu Wenruo wrote: >>> /* >>> * Records the overall state of the qgroups. >>> * There's only one instance of this key present, >>> @@ -719,6 +724,32 @@ struct btrfs_free_space_header { >>> __le64 num_bitmaps; >>> } __attribute__ ((__packed__)); >>> +struct btrfs_raid_stride { >>> + /* The btrfs device-id this raid extent lives on */ >>> + __le64 devid; >>> + /* The physical location on disk */ >>> + __le64 physical; >>> + /* The length of stride on this disk */ >>> + __le64 length; > > Forgot to mention, for btrfs_stripe_extent structure, its key is > (PHYSICAL, RAID_STRIPE_KEY, LENGTH) right? > > So is the length in the btrfs_raid_stride duplicated and we can save 8 > bytes? Nope. The length in the key is the stripe length. The length in the stride is the stride length. Here's an example for why this is needed: wrote 32768/32768 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) wrote 131072/131072 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) wrote 8192/8192 bytes at offset 65536 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) [snip] item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 32 encoding: RAID0 stripe 0 devid 1 physical XXXXXXXXX length 32768 item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 80 encoding: RAID0 stripe 0 devid 1 physical XXXXXXXXX length 32768 stripe 1 devid 2 physical XXXXXXXXX length 65536 stripe 2 devid 1 physical XXXXXXXXX length 32768 item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 32 encoding: RAID0 stripe 0 devid 1 physical XXXXXXXXX length 8192 Without the length in the stride, we don't know when to select the next stride in item 1 above.
On 2023/9/15 19:25, Johannes Thumshirn wrote: > On 15.09.23 02:27, Qu Wenruo wrote: >>>> /* >>>> * Records the overall state of the qgroups. >>>> * There's only one instance of this key present, >>>> @@ -719,6 +724,32 @@ struct btrfs_free_space_header { >>>> __le64 num_bitmaps; >>>> } __attribute__ ((__packed__)); >>>> +struct btrfs_raid_stride { >>>> + /* The btrfs device-id this raid extent lives on */ >>>> + __le64 devid; >>>> + /* The physical location on disk */ >>>> + __le64 physical; >>>> + /* The length of stride on this disk */ >>>> + __le64 length; >> >> Forgot to mention, for btrfs_stripe_extent structure, its key is >> (PHYSICAL, RAID_STRIPE_KEY, LENGTH) right? >> >> So is the length in the btrfs_raid_stride duplicated and we can save 8 >> bytes? > > Nope. The length in the key is the stripe length. The length in the > stride is the stride length. > > Here's an example for why this is needed: > > wrote 32768/32768 bytes at offset 0 > XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) > wrote 131072/131072 bytes at offset 0 > XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) > wrote 8192/8192 bytes at offset 65536 > XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) > > [snip] > > item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 32 > encoding: RAID0 > stripe 0 devid 1 physical XXXXXXXXX length 32768 > item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX > itemsize 80 Maybe you want to put the whole RAID_STRIPE_KEY definition into the headers. In fact my initial assumption of such case would be something like this: item 0 key (X+0 RAID_STRIPE 32K) stripe 0 devid 1 physical XXXXX len 32K item 1 key (X+32K RAID_STRIPE 32K) stripe 0 devid 1 physical XXXXX + 32K len 32K item 2 key (X+64K RAID_STRIPE 64K) stripe 0 devid 2 physical YYYYY len 64K item 3 key (X+128K RAID_STRIPE 32K) stripe 0 devid 1 physical XXXXX + 64K len 32K ... AKA, each RAID_STRIPE_KEY would only contain a continous physical stripe. And in above case, item 0 and item 1 can be easily merged, also length can be removed. And this explains why the lookup code is more complex than I initially thought. BTW, would the above layout make the code a little easier? Or is there any special reason for the existing one layout? Thank, Qu > encoding: RAID0 > stripe 0 devid 1 physical XXXXXXXXX length 32768 > stripe 1 devid 2 physical XXXXXXXXX length 65536 > stripe 2 devid 1 physical XXXXXXXXX length 32768 This current layout has another problem. For RAID10 the interpretation of the RAID_STRIPE item can be very complex. While > item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 32 > encoding: RAID0 > stripe 0 devid 1 physical XXXXXXXXX length 8192 > > Without the length in the stride, we don't know when to select the next > stride in item 1 above.
On 15.09.23 12:34, Qu Wenruo wrote: >> [snip] >> >> item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 32 >> encoding: RAID0 >> stripe 0 devid 1 physical XXXXXXXXX length 32768 >> item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX >> itemsize 80 > Maybe you want to put the whole RAID_STRIPE_KEY definition into the headers. > > In fact my initial assumption of such case would be something like this: > > item 0 key (X+0 RAID_STRIPE 32K) > stripe 0 devid 1 physical XXXXX len 32K > item 1 key (X+32K RAID_STRIPE 32K) > stripe 0 devid 1 physical XXXXX + 32K len 32K > item 2 key (X+64K RAID_STRIPE 64K) > stripe 0 devid 2 physical YYYYY len 64K > item 3 key (X+128K RAID_STRIPE 32K) > stripe 0 devid 1 physical XXXXX + 64K len 32K > ... > > AKA, each RAID_STRIPE_KEY would only contain a continous physical stripe. > And in above case, item 0 and item 1 can be easily merged, also length > can be removed. > > And this explains why the lookup code is more complex than I initially > thought. > > BTW, would the above layout make the code a little easier? > Or is there any special reason for the existing one layout? It would definitely make the code easier to the cost of more items. But of cause smaller items, as we can get rid of the stride length. Let me think about it.
On 15.09.23 12:34, Qu Wenruo wrote: > > > > On 2023/9/15 19:25, Johannes Thumshirn wrote: >> On 15.09.23 02:27, Qu Wenruo wrote: >>>>> /* >>>>> * Records the overall state of the qgroups. >>>>> * There's only one instance of this key present, >>>>> @@ -719,6 +724,32 @@ struct btrfs_free_space_header { >>>>> __le64 num_bitmaps; >>>>> } __attribute__ ((__packed__)); >>>>> +struct btrfs_raid_stride { >>>>> + /* The btrfs device-id this raid extent lives on */ >>>>> + __le64 devid; >>>>> + /* The physical location on disk */ >>>>> + __le64 physical; >>>>> + /* The length of stride on this disk */ >>>>> + __le64 length; >>> >>> Forgot to mention, for btrfs_stripe_extent structure, its key is >>> (PHYSICAL, RAID_STRIPE_KEY, LENGTH) right? >>> >>> So is the length in the btrfs_raid_stride duplicated and we can save 8 >>> bytes? >> >> Nope. The length in the key is the stripe length. The length in the >> stride is the stride length. >> >> Here's an example for why this is needed: >> >> wrote 32768/32768 bytes at offset 0 >> XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) >> wrote 131072/131072 bytes at offset 0 >> XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) >> wrote 8192/8192 bytes at offset 65536 >> XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) >> >> [snip] >> >> item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 32 >> encoding: RAID0 >> stripe 0 devid 1 physical XXXXXXXXX length 32768 >> item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX >> itemsize 80 > > Maybe you want to put the whole RAID_STRIPE_KEY definition into the headers. > > In fact my initial assumption of such case would be something like this: > > item 0 key (X+0 RAID_STRIPE 32K) > stripe 0 devid 1 physical XXXXX len 32K > item 1 key (X+32K RAID_STRIPE 32K) > stripe 0 devid 1 physical XXXXX + 32K len 32K > item 2 key (X+64K RAID_STRIPE 64K) > stripe 0 devid 2 physical YYYYY len 64K > item 3 key (X+128K RAID_STRIPE 32K) > stripe 0 devid 1 physical XXXXX + 64K len 32K > ... > > AKA, each RAID_STRIPE_KEY would only contain a continous physical stripe. > And in above case, item 0 and item 1 can be easily merged, also length > can be removed. > > And this explains why the lookup code is more complex than I initially > thought. > > BTW, would the above layout make the code a little easier? > Or is there any special reason for the existing one layout? > > Thank, > Qu > > >> encoding: RAID0 >> stripe 0 devid 1 physical XXXXXXXXX length 32768 >> stripe 1 devid 2 physical XXXXXXXXX length 65536 >> stripe 2 devid 1 physical XXXXXXXXX length 32768 > > This current layout has another problem. > For RAID10 the interpretation of the RAID_STRIPE item can be very complex. > While > >> item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 32 >> encoding: RAID0 >> stripe 0 devid 1 physical XXXXXXXXX length 8192 >> >> Without the length in the stride, we don't know when to select the next >> stride in item 1 above. > JFYI preliminary tests for your suggestion look reasonably good. I'll give it some more testing and code cleanup but it actually seems sensible to do.
diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h index f958eccff477..977ff160a024 100644 --- a/fs/btrfs/accessors.h +++ b/fs/btrfs/accessors.h @@ -306,6 +306,16 @@ BTRFS_SETGET_FUNCS(timespec_nsec, struct btrfs_timespec, nsec, 32); BTRFS_SETGET_STACK_FUNCS(stack_timespec_sec, struct btrfs_timespec, sec, 64); BTRFS_SETGET_STACK_FUNCS(stack_timespec_nsec, struct btrfs_timespec, nsec, 32); +BTRFS_SETGET_FUNCS(stripe_extent_encoding, struct btrfs_stripe_extent, encoding, 8); +BTRFS_SETGET_FUNCS(raid_stride_devid, struct btrfs_raid_stride, devid, 64); +BTRFS_SETGET_FUNCS(raid_stride_physical, struct btrfs_raid_stride, physical, 64); +BTRFS_SETGET_FUNCS(raid_stride_length, struct btrfs_raid_stride, length, 64); +BTRFS_SETGET_STACK_FUNCS(stack_stripe_extent_encoding, + struct btrfs_stripe_extent, encoding, 8); +BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_devid, struct btrfs_raid_stride, devid, 64); +BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_physical, struct btrfs_raid_stride, physical, 64); +BTRFS_SETGET_STACK_FUNCS(stack_raid_stride_length, struct btrfs_raid_stride, length, 64); + /* struct btrfs_dev_extent */ BTRFS_SETGET_FUNCS(dev_extent_chunk_tree, struct btrfs_dev_extent, chunk_tree, 64); BTRFS_SETGET_FUNCS(dev_extent_chunk_objectid, struct btrfs_dev_extent, diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c index 6ac4fd8cc8dc..74d8e2003f58 100644 --- a/fs/btrfs/locking.c +++ b/fs/btrfs/locking.c @@ -74,6 +74,7 @@ static struct btrfs_lockdep_keyset { { .id = BTRFS_UUID_TREE_OBJECTID, DEFINE_NAME("uuid") }, { .id = BTRFS_FREE_SPACE_TREE_OBJECTID, DEFINE_NAME("free-space") }, { .id = BTRFS_BLOCK_GROUP_TREE_OBJECTID, DEFINE_NAME("block-group") }, + { .id = BTRFS_RAID_STRIPE_TREE_OBJECTID, DEFINE_NAME("raid-stripe") }, { .id = 0, DEFINE_NAME("tree") }, }; diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h index fc3c32186d7e..6d9c43416b6e 100644 --- a/include/uapi/linux/btrfs_tree.h +++ b/include/uapi/linux/btrfs_tree.h @@ -73,6 +73,9 @@ /* Holds the block group items for extent tree v2. */ #define BTRFS_BLOCK_GROUP_TREE_OBJECTID 11ULL +/* Tracks RAID stripes in block groups. */ +#define BTRFS_RAID_STRIPE_TREE_OBJECTID 12ULL + /* device stats in the device tree */ #define BTRFS_DEV_STATS_OBJECTID 0ULL @@ -261,6 +264,8 @@ #define BTRFS_DEV_ITEM_KEY 216 #define BTRFS_CHUNK_ITEM_KEY 228 +#define BTRFS_RAID_STRIPE_KEY 230 + /* * Records the overall state of the qgroups. * There's only one instance of this key present, @@ -719,6 +724,32 @@ struct btrfs_free_space_header { __le64 num_bitmaps; } __attribute__ ((__packed__)); +struct btrfs_raid_stride { + /* The btrfs device-id this raid extent lives on */ + __le64 devid; + /* The physical location on disk */ + __le64 physical; + /* The length of stride on this disk */ + __le64 length; +} __attribute__ ((__packed__)); + +/* The stripe_extent::encoding, 1:1 mapping of enum btrfs_raid_types */ +#define BTRFS_STRIPE_RAID0 1 +#define BTRFS_STRIPE_RAID1 2 +#define BTRFS_STRIPE_DUP 3 +#define BTRFS_STRIPE_RAID10 4 +#define BTRFS_STRIPE_RAID5 5 +#define BTRFS_STRIPE_RAID6 6 +#define BTRFS_STRIPE_RAID1C3 7 +#define BTRFS_STRIPE_RAID1C4 8 + +struct btrfs_stripe_extent { + __u8 encoding; + __u8 reserved[7]; + /* An array of raid strides this stripe is composed of */ + struct btrfs_raid_stride strides[]; +} __attribute__ ((__packed__)); + #define BTRFS_HEADER_FLAG_WRITTEN (1ULL << 0) #define BTRFS_HEADER_FLAG_RELOC (1ULL << 1)