Message ID | 20230911-raid-stripe-tree-v8-0-647676fa852c@wdc.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:9ecd:0:b0:3f2:4152:657d with SMTP id t13csp46003vqx; Mon, 11 Sep 2023 15:45:50 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEyryDOhsMMFd/2VwxhWJkh/ehnLwojMQIuZ6fhd6CAS8fNjqi7b03S9rHiN5VOZKYinqXS X-Received: by 2002:a17:902:6906:b0:1c0:aca0:8c2d with SMTP id j6-20020a170902690600b001c0aca08c2dmr8355974plk.67.1694472350146; Mon, 11 Sep 2023 15:45:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1694472350; cv=none; d=google.com; s=arc-20160816; b=BQdiEOEKNd+Q6iSyn+THjNTHYzbivDWtNTkkpmmclHDmPRf8GTJPRj4Mdye+G8mE1g ZnqG0BiDlNPpeDsChrvQ8QZljdYL5Guw11yPTe7/G5f581nPX/Wqq/qo/+5qXZKrbz7m MLnDe/iHkDAXpsbF7n1ljZ7vNtS/gPDTaVAQadqo8AXuuGcS3hlya07FG9pIRFE5hJd4 zx/jxQngaBbYJRWH4HOVKgmTYML5bp0Jw0F/5/VN/BBosYZBNHeMs9K0E55UTPJuvZLS t+mYOnwfn22zrwUxeATLqJTPmqt008U0rH0hs+l0a2Dagj5iU2sU5LBUKoh6Wl90GdW+ OGWw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:wdcironportexception :ironport-sdr:ironport-sdr:dkim-signature; bh=QZXA9muv9y1iRj54rW/RH72M4HFOHLWO8luibbFCobE=; fh=L5FSYWdc7PdTSZ1si3SJbA1fPRAr1dqksiMQBYUBruQ=; b=qeY6r64FmRHYaF0m3ST72OKnf9y0vYZXSMwU5it8ntygpcDuRWO9c8L117KWHiMMUY bLgOcdq5X1ZeKLtPvUVQDEkxylu6Jl4002RRC/RKebjqwF++CsTj3iGXHf2E11eLwFhV UfApWZ8pnl2zMWLrjaovGqEfKEYoDPuWQCvmTEsgvUXA0q9YBtYzmxd7ilSed07ZNVQ9 dmeguSrQUzJ1M/6DL78ERtzSTJD6ycKJhC9bOcdxAspHAkBhXy1jksWra5lMgZy066q1 hSox7W0VhHqoS9VytP5YlTwO4QgaHe+8M5JrTd2PkIk7JmNajh7iCRzy+esXrrOxT+tM T1vg== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@wdc.com header.s=dkim.wdc.com header.b=cNfwXJi7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=wdc.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id f5-20020a170902ce8500b001bf1d1d99besi7211083plg.423.2023.09.11.15.45.35; Mon, 11 Sep 2023 15:45:50 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=fail header.i=@wdc.com header.s=dkim.wdc.com header.b=cNfwXJi7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=wdc.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1378693AbjIKWgj (ORCPT <rfc822;zouguomin@gmail.com> + 99 others); Mon, 11 Sep 2023 18:36:39 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56862 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237460AbjIKMw3 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 11 Sep 2023 08:52:29 -0400 Received: from esa5.hgst.iphmx.com (esa5.hgst.iphmx.com [216.71.153.144]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C7213CEB; Mon, 11 Sep 2023 05:52:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1694436745; x=1725972745; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=/GWruwMj73XrJmk+0WNQIqaqY58Qcl6UowfZNG3AlVk=; b=cNfwXJi7omRs7uUIlB6YcPoUnYXiXOf2wGzYFovoz6bf4daAvBxSPJv4 EBF/AJ2cWQZwfFVG1U0WkVdUQ1Yupz+0ErTuXvRhPeZJTEyNlbWjbvI5M HiiNVsSYJKB7YNa+cwVsBI/Bi7/xGn1DSkjHYW6vN6CqpoKrAC95Nw2Vg DLs2VhQo6zPSqVY4UVvyyYkzEsGuzQGfQXLZ/auXu836MYQxsbMXE7fFk 0Qks0D7wdIwxLsLIEgJ+xCyR21fW9EogxRUc4hrGw19/VjVBLNXuucXje sLxvv6LKOjC1HFu1hN8xbH0SSdBJ3drZBeZruEIhzSiYTuyo6wu8nrdGs g==; X-CSE-ConnectionGUID: ojPVaRudQ3avWpHJkgnG+A== X-CSE-MsgGUID: JEBuHoT6QzemH19c36MuHA== X-IronPort-AV: E=Sophos;i="6.02,244,1688400000"; d="scan'208";a="243594375" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2023 20:52:24 +0800 IronPort-SDR: hYA10h37euADM+ETY8OtkOMabkyDKCfyP4TjVLzowV49sx5s2cHlz26+Chv5QYbzC61PXeV7tZ CKG7fL9twmu4r5yq6DDyyqkentfXLI7bMuI27D5TjmxEJLThsoVCgGCNSFchc67L0D1JsFcrMO cxTOi0+yL2IQDjXGIZUtre7rEtiZBesu7FZ2EUzpJRFZlaCCoZeaEkkO/Kysyuq3coTb9vAwox iKy5+/LfEFq1WTh2jlAX0nBbXyUmYw6m+GxXZ/VoHgS2HouAnvZ59H+1WKNCk43RfimFwl3yUk 1Ms= Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES128-GCM-SHA256; 11 Sep 2023 04:59:29 -0700 IronPort-SDR: xbQpaGrf4GErrvRFnpsK4EFMtmV3NmvhPqKqq0eqHeyKDzX3K5xLrM8VpRbU/vdwCddIG0f4VX tA2t8I1SqQo1PUcqcKoIq38But9m7Y084ow1rZY7/WWXS+FW6aNDN9f9ruIKqZJP4EnNp6JYrx mA9efDL6Hw1Ycam6HQ8rj86gv0/CKXgMBTEUf/Kqhp6Xz8zWRItaj1UzoVwx/vIAD/EtVT534S WrGxRVkzzzD+B7JR/OrM6Y6hmUgVZYf4O0LTOxqkyMphsdolxfyBOc84P7ElKYWtXsFV7aYdPN Ws8= WDCIronportException: Internal Received: from unknown (HELO redsun91.ssa.fujisawa.hgst.com) ([10.149.66.6]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2023 05:52:22 -0700 From: Johannes Thumshirn <johannes.thumshirn@wdc.com> To: Chris Mason <clm@fb.com>, Josef Bacik <josef@toxicpanda.com>, David Sterba <dsterba@suse.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>, Christoph Hellwig <hch@lst.de>, Naohiro Aota <naohiro.aota@wdc.com>, Qu Wenruo <wqu@suse.com>, Damien Le Moal <dlemoal@kernel.org>, linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org, Anand Jain <anand.jain@oracle.com> Subject: [PATCH v8 00/11] btrfs: introduce RAID stripe tree Date: Mon, 11 Sep 2023 05:52:01 -0700 Message-ID: <20230911-raid-stripe-tree-v8-0-647676fa852c@wdc.com> X-Mailer: git-send-email 2.41.0 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" X-Mailer: b4 0.12.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1694436627; l=6155; i=johannes.thumshirn@wdc.com; s=20230613; h=from:subject:message-id; bh=/GWruwMj73XrJmk+0WNQIqaqY58Qcl6UowfZNG3AlVk=; b=Bj/hqP8Mk+eRtZYYKYKNEk0DMizMpZkr+94No3mkh1qvdY3oZIsE+pX12WXVT0VbmYJ01W80M UeQgqWKECMmDNlFLs6uYX6gThfzVwyp0ulIc85wFNV8r3BbnOJ4Bab4 X-Developer-Key: i=johannes.thumshirn@wdc.com; a=ed25519; pk=TGmHKs78FdPi+QhrViEvjKIGwReUGCfa+3LEnGoR2KM= Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,HEXHASH_WORD, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_PASS,SPF_NONE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1776783038935432307 X-GMAIL-MSGID: 1776783038935432307 |
Series |
btrfs: introduce RAID stripe tree
|
|
Message
Johannes Thumshirn
Sept. 11, 2023, 12:52 p.m. UTC
Updates of the raid-stripe-tree are done at ordered extent write time to safe
on bandwidth while for reading we do the stripe-tree lookup on bio mapping
time, i.e. when the logical to physical translation happens for regular btrfs
RAID as well.
The stripe tree is keyed by an extent's disk_bytenr and disk_num_bytes and
it's contents are the respective physical device id and position.
For an example 1M write (split into 126K segments due to zone-append)
rapido2:/home/johannes/src/fstests# xfs_io -fdc "pwrite -b 1M 0 1M" -c fsync /mnt/test/test
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0065 sec (151.538 MiB/sec and 151.5381 ops/sec)
The tree will look as follows (both 128k buffered writes to a ZNS drive):
RAID0 case:
bash-5.2# btrfs inspect-internal dump-tree -t raid_stripe /dev/nvme0n1
btrfs-progs v6.3
raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0)
leaf 805535744 items 1 free space 16218 generation 8 owner RAID_STRIPE_TREE
leaf 805535744 flags 0x1(WRITTEN) backref revision 1
checksum stored 2d2d2262
checksum calced 2d2d2262
fs uuid ab05cfc6-9859-404e-970d-3999b1cb5438
chunk uuid c9470ba2-49ac-4d46-8856-438a18e6bd23
item 0 key (1073741824 RAID_STRIPE_KEY 131072) itemoff 16243 itemsize 40
encoding: RAID0
stripe 0 devid 1 offset 805306368
stripe 1 devid 2 offset 536870912
total bytes 42949672960
bytes used 294912
uuid ab05cfc6-9859-404e-970d-3999b1cb5438
RAID1 case:
bash-5.2# btrfs inspect-internal dump-tree -t raid_stripe /dev/nvme0n1
btrfs-progs v6.3
raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0)
leaf 805535744 items 1 free space 16218 generation 8 owner RAID_STRIPE_TREE
leaf 805535744 flags 0x1(WRITTEN) backref revision 1
checksum stored 56199539
checksum calced 56199539
fs uuid 9e693a37-fbd1-4891-aed2-e7fe64605045
chunk uuid 691874fc-1b9c-469b-bd7f-05e0e6ba88c4
item 0 key (939524096 RAID_STRIPE_KEY 131072) itemoff 16243 itemsize 40
encoding: RAID1
stripe 0 devid 1 offset 939524096
stripe 1 devid 2 offset 536870912
total bytes 42949672960
bytes used 294912
uuid 9e693a37-fbd1-4891-aed2-e7fe64605045
A design document can be found here:
https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true
The user-space part of this series can be found here:
https://lore.kernel.org/linux-btrfs/20230215143109.2721722-1-johannes.thumshirn@wdc.com
Changes to v7:
- Huge rewrite
v7 of the patchset can be found here:
https://lore.kernel.org/linux-btrfs/cover.1677750131.git.johannes.thumshirn@wdc.com/
Changes to v6:
- Fix degraded RAID1 mounts
- Fix RAID0/10 mounts
v6 of the patchset can be found here:
https://lore/kernel.org/linux-btrfs/cover.1676470614.git.johannes.thumshirn@wdc.com
Changes to v5:
- Incroporated review comments from Josef and Christoph
- Rebased onto misc-next
v5 of the patchset can be found here:
https://lore/kernel.org/linux-btrfs/cover.1675853489.git.johannes.thumshirn@wdc.com
Changes to v4:
- Added patch to check for RST feature in sysfs
- Added RST lookups for scrubbing
- Fixed the error handling bug Josef pointed out
- Only check if we need to write out a RST once per delayed_ref head
- Added support for zoned data DUP with RST
Changes to v3:
- Rebased onto 20221120124734.18634-1-hch@lst.de
- Incorporated Josef's review
- Merged related patches
v3 of the patchset can be found here:
https://lore/kernel.org/linux-btrfs/cover.1666007330.git.johannes.thumshirn@wdc.com
Changes to v2:
- Bug fixes
- Rebased onto 20220901074216.1849941-1-hch@lst.de
- Added tracepoints
- Added leak checker
- Added RAID0 and RAID10
v2 of the patchset can be found here:
https://lore.kernel.org/linux-btrfs/cover.1656513330.git.johannes.thumshirn@wdc.com
Changes to v1:
- Write the stripe-tree at delayed-ref time (Qu)
- Add a different write path for preallocation
v1 of the patchset can be found here:
https://lore.kernel.org/linux-btrfs/cover.1652711187.git.johannes.thumshirn@wdc.com/
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
Johannes Thumshirn (11):
btrfs: add raid stripe tree definitions
btrfs: read raid-stripe-tree from disk
btrfs: add support for inserting raid stripe extents
btrfs: delete stripe extent on extent deletion
btrfs: lookup physical address from stripe extent
btrfs: implement RST version of scrub
btrfs: zoned: allow zoned RAID
btrfs: add raid stripe tree pretty printer
btrfs: announce presence of raid-stripe-tree in sysfs
btrfs: add trace events for RST
btrfs: add raid-stripe-tree to features enabled with debug
fs/btrfs/Makefile | 2 +-
fs/btrfs/accessors.h | 10 +
fs/btrfs/bio.c | 23 ++
fs/btrfs/block-rsv.c | 6 +
fs/btrfs/disk-io.c | 18 ++
fs/btrfs/disk-io.h | 5 +
fs/btrfs/extent-tree.c | 7 +
fs/btrfs/fs.h | 4 +-
fs/btrfs/inode.c | 8 +-
fs/btrfs/locking.c | 5 +-
fs/btrfs/ordered-data.c | 1 +
fs/btrfs/ordered-data.h | 2 +
fs/btrfs/print-tree.c | 49 ++++
fs/btrfs/raid-stripe-tree.c | 493 ++++++++++++++++++++++++++++++++++++++++
fs/btrfs/raid-stripe-tree.h | 52 +++++
fs/btrfs/scrub.c | 56 +++++
fs/btrfs/sysfs.c | 3 +
fs/btrfs/volumes.c | 43 +++-
fs/btrfs/volumes.h | 15 +-
fs/btrfs/zoned.c | 113 ++++++++-
include/trace/events/btrfs.h | 75 ++++++
include/uapi/linux/btrfs.h | 1 +
include/uapi/linux/btrfs_tree.h | 33 ++-
23 files changed, 999 insertions(+), 25 deletions(-)
---
base-commit: 133da717263112d81bb95b5535ceb2c1eeddd4e7
change-id: 20230613-raid-stripe-tree-e330c9a45cc3
Best regards,
Comments
On Mon, Sep 11, 2023 at 05:52:11AM -0700, Johannes Thumshirn wrote: > Add trace events for raid-stripe-tree operations. > > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> > --- > fs/btrfs/raid-stripe-tree.c | 8 +++++ > include/trace/events/btrfs.h | 75 ++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 83 insertions(+) > > diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c > index 7ed02e4b79ec..5a9952cf557c 100644 > --- a/fs/btrfs/raid-stripe-tree.c > +++ b/fs/btrfs/raid-stripe-tree.c > @@ -62,6 +62,9 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, > if (found_end <= start) > break; > > + trace_btrfs_raid_extent_delete(fs_info, start, end, > + found_start, found_end); > + > ASSERT(found_start >= start && found_end <= end); > ret = btrfs_del_item(trans, stripe_root, path); > if (ret) > @@ -120,6 +123,8 @@ static int btrfs_insert_one_raid_extent(struct btrfs_trans_handle *trans, > return -ENOMEM; > } > > + trace_btrfs_insert_one_raid_extent(fs_info, bioc->logical, bioc->size, > + num_stripes); > btrfs_set_stack_stripe_extent_encoding(stripe_extent, encoding); > for (int i = 0; i < num_stripes; i++) { > u64 devid = bioc->stripes[i].dev->devid; > @@ -445,6 +450,9 @@ int btrfs_get_raid_extent_offset(struct btrfs_fs_info *fs_info, > > stripe->physical = physical + offset; > > + trace_btrfs_get_raid_extent_offset(fs_info, logical, *length, > + stripe->physical, devid); > + > ret = 0; > goto free_path; > } > diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h > index b2db2c2f1c57..e2c6f1199212 100644 > --- a/include/trace/events/btrfs.h > +++ b/include/trace/events/btrfs.h > @@ -2497,6 +2497,81 @@ DEFINE_EVENT(btrfs_raid56_bio, raid56_write, > TP_ARGS(rbio, bio, trace_info) > ); > > +TRACE_EVENT(btrfs_insert_one_raid_extent, > + > + TP_PROTO(struct btrfs_fs_info *fs_info, u64 logical, u64 length, const struct fs_info > + int num_stripes), > + > + TP_ARGS(fs_info, logical, length, num_stripes), > + > + TP_STRUCT__entry_btrfs( > + __field( u64, logical ) > + __field( u64, length ) > + __field( int, num_stripes ) > + ), > + > + TP_fast_assign_btrfs(fs_info, > + __entry->logical = logical; > + __entry->length = length; > + __entry->num_stripes = num_stripes; > + ), > + > + TP_printk_btrfs("logical=%llu, length=%llu, num_stripes=%d", > + __entry->logical, __entry->length, > + __entry->num_stripes) Tracepoint messages should follow the formatting guidelines https://btrfs.readthedocs.io/en/latest/dev/Development-notes.html#tracepoints > +); > + > +TRACE_EVENT(btrfs_raid_extent_delete, > + > + TP_PROTO(struct btrfs_fs_info *fs_info, u64 start, u64 end, > + u64 found_start, u64 found_end), > + > + TP_ARGS(fs_info, start, end, found_start, found_end), > + > + TP_STRUCT__entry_btrfs( > + __field( u64, start ) > + __field( u64, end ) > + __field( u64, found_start ) > + __field( u64, found_end ) > + ), > + > + TP_fast_assign_btrfs(fs_info, > + __entry->start = start; > + __entry->end = end; > + __entry->found_start = found_start; > + __entry->found_end = found_end; Tracepoints follow the fancy spacing and alignment in the assign blocks. > + ), > + > + TP_printk_btrfs("start=%llu, end=%llu, found_start=%llu, found_end=%llu", > + __entry->start, __entry->end, __entry->found_start, > + __entry->found_end) > +);
On 2023/9/11 22:22, Johannes Thumshirn wrote: > A filesystem that uses the RAID stripe tree for logical to physical > address translation can't use the regular scrub path, that reads all > stripes and then checks if a sector is unused afterwards. > > When using the RAID stripe tree, this will result in lookup errors, as the > stripe tree doesn't know the requested logical addresses. > > Instead, look up stripes that are backed by the extent bitmap. > > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> > --- > fs/btrfs/scrub.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 56 insertions(+) > > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c > index f16220ce5fba..5101e0a3f83e 100644 > --- a/fs/btrfs/scrub.c > +++ b/fs/btrfs/scrub.c > @@ -23,6 +23,7 @@ > #include "accessors.h" > #include "file-item.h" > #include "scrub.h" > +#include "raid-stripe-tree.h" > > /* > * This is only the first step towards a full-features scrub. It reads all > @@ -1634,6 +1635,56 @@ static void scrub_reset_stripe(struct scrub_stripe *stripe) > } > } > > +static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx, > + struct scrub_stripe *stripe) > +{ > + struct btrfs_fs_info *fs_info = stripe->bg->fs_info; > + struct btrfs_bio *bbio = NULL; > + int mirror = stripe->mirror_num; > + int i; > + > + atomic_inc(&stripe->pending_io); > + > + for_each_set_bit(i, &stripe->extent_sector_bitmap, stripe->nr_sectors) { > + struct page *page; > + int pgoff; > + > + page = scrub_stripe_get_page(stripe, i); > + pgoff = scrub_stripe_get_page_offset(stripe, i); > + > + /* The current sector cannot be merged, submit the bio. */ > + if (bbio && > + ((i > 0 && !test_bit(i - 1, &stripe->extent_sector_bitmap)) || > + bbio->bio.bi_iter.bi_size >= BTRFS_STRIPE_LEN)) { > + ASSERT(bbio->bio.bi_iter.bi_size); > + atomic_inc(&stripe->pending_io); > + btrfs_submit_bio(bbio, mirror); > + bbio = NULL; > + } > + > + if (!bbio) { > + bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_READ, > + fs_info, scrub_read_endio, stripe); > + bbio->bio.bi_iter.bi_sector = (stripe->logical + > + (i << fs_info->sectorsize_bits)) >> SECTOR_SHIFT; > + } > + > + __bio_add_page(&bbio->bio, page, fs_info->sectorsize, pgoff); > + } > + > + if (bbio) { > + ASSERT(bbio->bio.bi_iter.bi_size); > + atomic_inc(&stripe->pending_io); > + btrfs_submit_bio(bbio, mirror); Since RST is looked up during btrfs_submit_bio() (to be more accurate, set_io_stripe()), and I just checked there is no special requirement to make btrfs to lookup using commit root. This means we can have a problem that extent items and RST are out-of-sync. For scrub, all the extent items are searched using commit root, but btrfs_get_raid_extent_offset() is only using current root. Thus you would got some problems during fsstress and scrub. We need some way to distinguish scrub bbio from regular ones (which is a completely new requirement). For now only scrub doesn't initialize bbio->inode, thus it can be used to do the distinguish (at least for now). Thanks, Qu > + } > + > + if (atomic_dec_and_test(&stripe->pending_io)) { > + wake_up(&stripe->io_wait); > + INIT_WORK(&stripe->work, scrub_stripe_read_repair_worker); > + queue_work(stripe->bg->fs_info->scrub_workers, &stripe->work); > + } > +} > + > static void scrub_submit_initial_read(struct scrub_ctx *sctx, > struct scrub_stripe *stripe) > { > @@ -1645,6 +1696,11 @@ static void scrub_submit_initial_read(struct scrub_ctx *sctx, > ASSERT(stripe->mirror_num > 0); > ASSERT(test_bit(SCRUB_STRIPE_FLAG_INITIALIZED, &stripe->state)); > > + if (btrfs_need_stripe_tree_update(fs_info, stripe->bg->flags)) { > + scrub_submit_extent_sector_read(sctx, stripe); > + return; > + } > + > bbio = btrfs_bio_alloc(SCRUB_STRIPE_PAGES, REQ_OP_READ, fs_info, > scrub_read_endio, stripe); > >
On Mon, Sep 11, 2023 at 05:52:07AM -0700, Johannes Thumshirn wrote: > A filesystem that uses the RAID stripe tree for logical to physical > address translation can't use the regular scrub path, that reads all > stripes and then checks if a sector is unused afterwards. > > When using the RAID stripe tree, this will result in lookup errors, as the > stripe tree doesn't know the requested logical addresses. > > Instead, look up stripes that are backed by the extent bitmap. > > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> > --- > fs/btrfs/scrub.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 56 insertions(+) > > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c > index f16220ce5fba..5101e0a3f83e 100644 > --- a/fs/btrfs/scrub.c > +++ b/fs/btrfs/scrub.c > @@ -23,6 +23,7 @@ > #include "accessors.h" > #include "file-item.h" > #include "scrub.h" > +#include "raid-stripe-tree.h" > > /* > * This is only the first step towards a full-features scrub. It reads all > @@ -1634,6 +1635,56 @@ static void scrub_reset_stripe(struct scrub_stripe *stripe) > } > } > > +static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx, > + struct scrub_stripe *stripe) > +{ > + struct btrfs_fs_info *fs_info = stripe->bg->fs_info; > + struct btrfs_bio *bbio = NULL; > + int mirror = stripe->mirror_num; > + int i; > + > + atomic_inc(&stripe->pending_io); > + > + for_each_set_bit(i, &stripe->extent_sector_bitmap, stripe->nr_sectors) { > + struct page *page; > + int pgoff; This should be unsigned int. > + > + page = scrub_stripe_get_page(stripe, i); > + pgoff = scrub_stripe_get_page_offset(stripe, i); You can probably move the initializations right to the declarations, I think we have that elsewhere too. > + /* The current sector cannot be merged, submit the bio. */ > + if (bbio && > + ((i > 0 && !test_bit(i - 1, &stripe->extent_sector_bitmap)) || > + bbio->bio.bi_iter.bi_size >= BTRFS_STRIPE_LEN)) { > + ASSERT(bbio->bio.bi_iter.bi_size); > + atomic_inc(&stripe->pending_io); > + btrfs_submit_bio(bbio, mirror); > + bbio = NULL; > + } > + > + if (!bbio) { > + bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_READ, > + fs_info, scrub_read_endio, stripe); > + bbio->bio.bi_iter.bi_sector = (stripe->logical + > + (i << fs_info->sectorsize_bits)) >> SECTOR_SHIFT; > + } > + > + __bio_add_page(&bbio->bio, page, fs_info->sectorsize, pgoff); > + } > + > + if (bbio) { > + ASSERT(bbio->bio.bi_iter.bi_size); > + atomic_inc(&stripe->pending_io); > + btrfs_submit_bio(bbio, mirror); > + } > + > + if (atomic_dec_and_test(&stripe->pending_io)) { > + wake_up(&stripe->io_wait); > + INIT_WORK(&stripe->work, scrub_stripe_read_repair_worker); > + queue_work(stripe->bg->fs_info->scrub_workers, &stripe->work); > + } > +}
On 2023/9/11 22:22, Johannes Thumshirn wrote: > If we find a raid-stripe-tree on mount, read it from disk. > > Reviewed-by: Josef Bacik <josef@toxicpanda.com> > Reviewed-by: Anand Jain <anand.jain@oracle.com> > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> > --- > fs/btrfs/block-rsv.c | 6 ++++++ > fs/btrfs/disk-io.c | 18 ++++++++++++++++++ > fs/btrfs/disk-io.h | 5 +++++ > fs/btrfs/fs.h | 1 + > include/uapi/linux/btrfs.h | 1 + > 5 files changed, 31 insertions(+) > > diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c > index 77684c5e0c8b..4e55e5f30f7f 100644 > --- a/fs/btrfs/block-rsv.c > +++ b/fs/btrfs/block-rsv.c > @@ -354,6 +354,11 @@ void btrfs_update_global_block_rsv(struct btrfs_fs_info *fs_info) > min_items++; > } > > + if (btrfs_fs_incompat(fs_info, RAID_STRIPE_TREE)) { > + num_bytes += btrfs_root_used(&fs_info->stripe_root->root_item); > + min_items++; > + } > + > /* > * But we also want to reserve enough space so we can do the fallback > * global reserve for an unlink, which is an additional > @@ -405,6 +410,7 @@ void btrfs_init_root_block_rsv(struct btrfs_root *root) > case BTRFS_EXTENT_TREE_OBJECTID: > case BTRFS_FREE_SPACE_TREE_OBJECTID: > case BTRFS_BLOCK_GROUP_TREE_OBJECTID: > + case BTRFS_RAID_STRIPE_TREE_OBJECTID: > root->block_rsv = &fs_info->delayed_refs_rsv; > break; > case BTRFS_ROOT_TREE_OBJECTID: > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 4c5d71065ea8..1ecebcfc1c17 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -1179,6 +1179,8 @@ static struct btrfs_root *btrfs_get_global_root(struct btrfs_fs_info *fs_info, > return btrfs_grab_root(fs_info->block_group_root); > case BTRFS_FREE_SPACE_TREE_OBJECTID: > return btrfs_grab_root(btrfs_global_root(fs_info, &key)); > + case BTRFS_RAID_STRIPE_TREE_OBJECTID: > + return btrfs_grab_root(fs_info->stripe_root); > default: > return NULL; > } > @@ -1259,6 +1261,7 @@ void btrfs_free_fs_info(struct btrfs_fs_info *fs_info) > btrfs_put_root(fs_info->fs_root); > btrfs_put_root(fs_info->data_reloc_root); > btrfs_put_root(fs_info->block_group_root); > + btrfs_put_root(fs_info->stripe_root); > btrfs_check_leaked_roots(fs_info); > btrfs_extent_buffer_leak_debug_check(fs_info); > kfree(fs_info->super_copy); > @@ -1804,6 +1807,7 @@ static void free_root_pointers(struct btrfs_fs_info *info, bool free_chunk_root) > free_root_extent_buffers(info->fs_root); > free_root_extent_buffers(info->data_reloc_root); > free_root_extent_buffers(info->block_group_root); > + free_root_extent_buffers(info->stripe_root); > if (free_chunk_root) > free_root_extent_buffers(info->chunk_root); > } > @@ -2280,6 +2284,20 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info) > fs_info->uuid_root = root; > } > > + if (btrfs_fs_incompat(fs_info, RAID_STRIPE_TREE)) { > + location.objectid = BTRFS_RAID_STRIPE_TREE_OBJECTID; > + root = btrfs_read_tree_root(tree_root, &location); > + if (IS_ERR(root)) { > + if (!btrfs_test_opt(fs_info, IGNOREBADROOTS)) { > + ret = PTR_ERR(root); > + goto out; > + } > + } else { > + set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state); > + fs_info->stripe_root = root; > + } > + } > + > return 0; > out: > btrfs_warn(fs_info, "failed to read root (objectid=%llu): %d", > diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h > index 02b645744a82..8b7f01a01c44 100644 > --- a/fs/btrfs/disk-io.h > +++ b/fs/btrfs/disk-io.h > @@ -103,6 +103,11 @@ static inline struct btrfs_root *btrfs_grab_root(struct btrfs_root *root) > return NULL; > } > > +static inline struct btrfs_root *btrfs_stripe_tree_root(struct btrfs_fs_info *fs_info) > +{ > + return fs_info->stripe_root; > +} > + Do we really need this? IIRC we never have a wrapper or fs_info->fs_root. Thanks, Qu > void btrfs_put_root(struct btrfs_root *root); > void btrfs_mark_buffer_dirty(struct extent_buffer *buf); > int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid, > diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h > index d84a390336fc..5c7778e8b5ed 100644 > --- a/fs/btrfs/fs.h > +++ b/fs/btrfs/fs.h > @@ -367,6 +367,7 @@ struct btrfs_fs_info { > struct btrfs_root *uuid_root; > struct btrfs_root *data_reloc_root; > struct btrfs_root *block_group_root; > + struct btrfs_root *stripe_root; > > /* The log root tree is a directory of all the other log roots */ > struct btrfs_root *log_root_tree; > diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h > index dbb8b96da50d..b9a1d9af8ae8 100644 > --- a/include/uapi/linux/btrfs.h > +++ b/include/uapi/linux/btrfs.h > @@ -333,6 +333,7 @@ struct btrfs_ioctl_fs_info_args { > #define BTRFS_FEATURE_INCOMPAT_RAID1C34 (1ULL << 11) > #define BTRFS_FEATURE_INCOMPAT_ZONED (1ULL << 12) > #define BTRFS_FEATURE_INCOMPAT_EXTENT_TREE_V2 (1ULL << 13) > +#define BTRFS_FEATURE_INCOMPAT_RAID_STRIPE_TREE (1ULL << 14) > > struct btrfs_ioctl_feature_flags { > __u64 compat_flags; >
On 14.09.23 11:27, Qu Wenruo wrote: >> +static inline struct btrfs_root *btrfs_stripe_tree_root(struct btrfs_fs_info *fs_info) >> +{ >> + return fs_info->stripe_root; >> +} >> + > > Do we really need this? IIRC we never have a wrapper or fs_info->fs_root. This was requested from Josef a while ago, to make the conversion to per-block-group stripe trees easier. But hch also wanted me to remove it (and I thought I already did) so lemme get rid of it if Josef doesn't speak up.