Message ID | 20240225023826.2413565-1-kent.overstreet@linux.dev |
---|---|
Headers |
Return-Path: <linux-kernel+bounces-79929-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:a81b:b0:108:e6aa:91d0 with SMTP id bq27csp1405935dyb; Sat, 24 Feb 2024 18:39:16 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCUgR209Gaq6lHUKOZIX03HbP0RRxlNSLY4P62DLrbFaXN/9rM2+6Zks3shwLuIY3UY1KjJ3/tNOOC4/j9uGbbxVoGPS/w== X-Google-Smtp-Source: AGHT+IFkHFpA2mnC/WM7d9eQKxPvd2z2j+4w68Lr7GaUr4kfMVE/+lr+S+3XOgXntIJBTID3Hksk X-Received: by 2002:a05:6402:3813:b0:564:fe3a:280b with SMTP id es19-20020a056402381300b00564fe3a280bmr3751752edb.6.1708828756432; Sat, 24 Feb 2024 18:39:16 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1708828756; cv=pass; d=google.com; s=arc-20160816; b=WBKU7eU6XqbDDDB+nOpLOvWEIRjBrDkPRJ88QHm/ae98ldR16MZYVJXVdisDFkZeo0 235IwpyhQe3iXA1tuin051opWgsyRq9vGX9AkUh/ZQSOp65ktsuAZX3dEzkU5WwVFmdz Z2i6tR90SiIN+ZZzR65dpv/uU/D9Kvg8NWhw09z/H3uvOyX7qUwswdmZ290Ako/ChtQ0 p1NiYQpk8hdn6sz4m+DRX4pns3XpOBEAK6da/n9gp7GTulezqAXxflY6mdQ2+qrfL/RT wtl/Kz/xpb5221CC9VI7TdyJts1ebwTWoHy76nsY8R8WPsTvyHyfck0L/EtNoSi/PycX E3IA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:cc:to :from:dkim-signature; bh=V5+DSdY0aRNEv/2+yTdFbM0tH4++AQae7Dw+EEJHSY8=; fh=SnOeglvET/U1zyii+x9DhYF55XqwjY99f4SRc7SAHEY=; b=xjPv0I6C3a3GRqI6vGrlulmV3vIMDbqL5UeS7wZ5Qx+o9G95XNYfRGyaQGyvG122Gj 4NjTlMVNfJUxWGk7h9sCrnX1liiFYNdCLmggdABUVcQRUY28FM97ZWwslCACVW0YocAT AQzx0748GuHEcPW11aZx6q5gHadOUw5dq/vh3Fb3D5DH56N3cxI6pzTVC5sU5rWq2Ap2 sYbLDqbNyd1Zz/ivdNxzrgjWKKsT0lpbvqk4zcfh53vd5r+X6UETljT/wc9fVrLCthD5 Q+J3qUSNKpYFjacxQCHlUSLXs0zU9DEjMjWPyHfS4swqGIB5JkcF6V6IFKJaTjSAgkDf SIHQ==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b="uvwcLTA/"; arc=pass (i=1 spf=pass spfdomain=linux.dev dkim=pass dkdomain=linux.dev dmarc=pass fromdomain=linux.dev); spf=pass (google.com: domain of linux-kernel+bounces-79929-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-79929-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [2604:1380:4601:e00::3]) by mx.google.com with ESMTPS id u3-20020aa7d883000000b00563a698055bsi933978edq.395.2024.02.24.18.39.16 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 24 Feb 2024 18:39:16 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-79929-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b="uvwcLTA/"; arc=pass (i=1 spf=pass spfdomain=linux.dev dkim=pass dkdomain=linux.dev dmarc=pass fromdomain=linux.dev); spf=pass (google.com: domain of linux-kernel+bounces-79929-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-79929-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id DD44B1F217A6 for <ouuuleilei@gmail.com>; Sun, 25 Feb 2024 02:39:15 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id E0969D2FE; Sun, 25 Feb 2024 02:38:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="uvwcLTA/" Received: from out-184.mta1.migadu.com (out-184.mta1.migadu.com [95.215.58.184]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5631228F1 for <linux-kernel@vger.kernel.org>; Sun, 25 Feb 2024 02:38:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.184 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708828719; cv=none; b=XUqvLDapLLU650Ivv4tbHtOi5/CXpbQ59BWi5+RDOUVsHqMNVCD01Gmbfy/L+c4b82Zu2kAs6WUN7XKI07C7LQtz3pi6lWXfXz/dJyyMJBOra5KQGEBuc6IPGxK2qkccheNjdOByRrx3TfxH0Wwda7rW+rr2CDUC4EEvA+NMvKo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708828719; c=relaxed/simple; bh=JoV9DJ+vFR1r4GaiacG2hqDz+lqO9SZK4X/J6Hh6ANI=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=CbgyrGvb7cIiIarKuTLePV5sjGambE7uZX7RsB7nmacn9jJMQa3ZJFvbDLp1DtowqNbcDEnlvv7HJ8PKOzv2vJjxOhgxEdfMTvTRIXDQA0cgQKQFyQ0TFhjF9Xj3jo4lrMamqIWEHCx7BUyFJxOxTQUiOf4tGtr3VTvw9P+4hzE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=uvwcLTA/; arc=none smtp.client-ip=95.215.58.184 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1708828714; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=V5+DSdY0aRNEv/2+yTdFbM0tH4++AQae7Dw+EEJHSY8=; b=uvwcLTA/FEYN7cQhp0+DJgmyNTxQzHXzQcPOnzdacjXH9q08bX/38vNHprJikRJ/V1Rn4Z PvREeBNAPNBb39BzEPLVcbpB5A2F6t1yyKPL8b5HDDIQ5tmjq/h69jnBsTULrLBQbrQ4/6 l9xrarlLJ5u49AJ9HDVZ+V04c0xldrA= From: Kent Overstreet <kent.overstreet@linux.dev> To: linux-bcachefs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Kent Overstreet <kent.overstreet@linux.dev>, djwong@kernel.org, bfoster@redhat.com Subject: [PATCH 00/21] bcachefs disk accounting rewrite Date: Sat, 24 Feb 2024 21:38:02 -0500 Message-ID: <20240225023826.2413565-1-kent.overstreet@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1791836821919004172 X-GMAIL-MSGID: 1791836821919004172 |
Series |
bcachefs disk accounting rewrite
|
|
Message
Kent Overstreet
Feb. 25, 2024, 2:38 a.m. UTC
here it is; the disk accounting rewrite I've been talking about since forever. git link: https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-disk-accounting-rewrite test dashboard (just rebased, results are regenerating as of this writing but shouldn't be any regressions left): https://evilpiepirate.org/~testdashboard/ci?branch=bcachefs-disk-accounting-rewrite The old disk accounting scheme was fast, but had some limitations: - lack of scalability: it was based on percpu counters additionally sharded by outstanding journal buffer, and then just prior to journal write we'd roll up the counters and add them to the journal entry. But this meant that all counters were added to every journal write, which meant it'd never be able to support per-snapshot counters. - it was a pain to extend this was why, until now, we didn't have proper compressed accounting, and getting compression ratio required a full btree scan In the new scheme: - every set of counters is a bkey, a key in a btree (BTREE_ID_accounting). this means they aren't pinned in the journal - the key has structure, and is extensible disk_accounting_key is a tagged union, and it's just union'd over bpos - counters are deltas, until flushed to the underlying btree this means counter updates are normal btree updates; the btree write buffer makes counter updates efficient. Since reading counters from the btree would be expensive - it'd require a write buffer flush to get up-to-date counters - we also maintain a parallel set of accounting in memory, a bit like the old scheme but without the per-journal-buffer sharding. The in memory accounters indexed in an eytzinger tree by disk_accounting_key/bpos, with the counters themselves being percpu u64s. Reviewers: do a "is this adequately documented, can I find my way around, do things make sense", not line-by-line "does this have bugs". Compatibility: this is in no way compatible with the old disk accounting on disk format, and it's not feasible to write out accounting in the old format - that means we have to regenerate accounting when upgrading or downgrading past this version. That should work more or less seamlessly with the most recent compat bits (bch_sb_field downgrade, so we can tell older versions what recovery psases to run and what to fix); additionally, userspace fsck now checks if the kernel bcachefs version better matches the on disk version than itself and if so uses the kernle fsck implementation with the OFFLINE_FSCK ioctl - so we shouldn't be bouncing back and forth between versions if your tools and kernel don't match. upgrade/downgrade still need a bit more testing, but transparently using kernel fsck is well tested as of latest versions. but: 6.7 users (& possibly 6.8) beware, the sb_downgrade section is in 6.7 but BCH_IOCTL_OFFLINE_FSCK is not, and backporting that doesn't look likely given current -stable process fiasco. merge ETA - this stuff may make the next merge window; I'd like to get per-snapshot-id accounting done with it, that should be the biggest item left. Cheers, Kent Kent Overstreet (21): bcachefs: KEY_TYPE_accounting bcachefs: Accumulate accounting keys in journal replay bcachefs: btree write buffer knows how to accumulate bch_accounting keys bcachefs: Disk space accounting rewrite bcachefs: dev_usage updated by new accounting bcachefs: Kill bch2_fs_usage_initialize() bcachefs: Convert bch2_ioctl_fs_usage() to new accounting bcachefs: kill bch2_fs_usage_read() bcachefs: Kill writing old accounting to journal bcachefs: Delete journal-buf-sharded old style accounting bcachefs: Kill bch2_fs_usage_to_text() bcachefs: Kill fs_usage_online bcachefs: Kill replicas_journal_res bcachefs: Convert gc to new accounting bcachefs: Convert bch2_replicas_gc2() to new accounting bcachefs: bch2_verify_accounting_clean() bcachefs: Eytzinger accumulation for accounting keys bcachefs: bch_acct_compression bcachefs: Convert bch2_compression_stats_to_text() to new accounting bcachefs: bch2_fs_accounting_to_text() bcachefs: bch2_fs_usage_base_to_text() fs/bcachefs/Makefile | 3 +- fs/bcachefs/alloc_background.c | 137 +++-- fs/bcachefs/alloc_background.h | 2 + fs/bcachefs/bcachefs.h | 22 +- fs/bcachefs/bcachefs_format.h | 81 +-- fs/bcachefs/bcachefs_ioctl.h | 7 +- fs/bcachefs/bkey_methods.c | 1 + fs/bcachefs/btree_gc.c | 259 ++++------ fs/bcachefs/btree_iter.c | 9 - fs/bcachefs/btree_journal_iter.c | 23 +- fs/bcachefs/btree_journal_iter.h | 15 + fs/bcachefs/btree_trans_commit.c | 71 ++- fs/bcachefs/btree_types.h | 1 - fs/bcachefs/btree_update.h | 22 +- fs/bcachefs/btree_write_buffer.c | 120 ++++- fs/bcachefs/btree_write_buffer.h | 50 +- fs/bcachefs/btree_write_buffer_types.h | 2 + fs/bcachefs/buckets.c | 663 ++++--------------------- fs/bcachefs/buckets.h | 70 +-- fs/bcachefs/buckets_types.h | 14 +- fs/bcachefs/chardev.c | 75 +-- fs/bcachefs/disk_accounting.c | 584 ++++++++++++++++++++++ fs/bcachefs/disk_accounting.h | 203 ++++++++ fs/bcachefs/disk_accounting_format.h | 145 ++++++ fs/bcachefs/disk_accounting_types.h | 20 + fs/bcachefs/ec.c | 166 ++++--- fs/bcachefs/inode.c | 42 +- fs/bcachefs/journal_io.c | 13 +- fs/bcachefs/recovery.c | 126 +++-- fs/bcachefs/recovery_types.h | 1 + fs/bcachefs/replicas.c | 242 ++------- fs/bcachefs/replicas.h | 16 +- fs/bcachefs/replicas_format.h | 21 + fs/bcachefs/replicas_types.h | 16 - fs/bcachefs/sb-clean.c | 62 --- fs/bcachefs/sb-downgrade.c | 12 +- fs/bcachefs/sb-errors_types.h | 4 +- fs/bcachefs/super.c | 74 ++- fs/bcachefs/sysfs.c | 109 ++-- 39 files changed, 1873 insertions(+), 1630 deletions(-) create mode 100644 fs/bcachefs/disk_accounting.c create mode 100644 fs/bcachefs/disk_accounting.h create mode 100644 fs/bcachefs/disk_accounting_format.h create mode 100644 fs/bcachefs/disk_accounting_types.h create mode 100644 fs/bcachefs/replicas_format.h
Comments
On Thu, Feb 29, 2024 at 04:16:00PM -0500, Kent Overstreet wrote: > On Thu, Feb 29, 2024 at 01:44:27PM -0500, Brian Foster wrote: > > On Wed, Feb 28, 2024 at 11:10:12PM -0500, Kent Overstreet wrote: > > > I think it ended up not needing to be moved, and I just forgot to drop > > > it - originally I disallowed accounting entries that referenced > > > nonexistent devices, but that wasn't workable so now it's only nonzero > > > accounting keys that aren't allowed to reference nonexistent devices. > > > > > > I'll see if I can delete it. > > > > > > > Do you mean to delete the change that moves the call, or the flush call > > entirely? > > Delte the change, I think there's further cleanup (& probably bugs to > fix) possible with that flush call but I'm not going to get into it > right now. > Ok, just trying to determine whether I need to look back and make sure this doesn't regress the problem this originally fixed. > > > +/* > > > + * Notes on disk accounting: > > > + * > > > + * We have two parallel sets of counters to be concerned with, and both must be > > > + * kept in sync. > > > + * > > > + * - Persistent/on disk accounting, stored in the accounting btree and updated > > > + * via btree write buffer updates that treat new accounting keys as deltas to > > > + * apply to existing values. But reading from a write buffer btree is > > > + * expensive, so we also have > > > + * > > > > I find the wording a little odd here, and I also think it would be > > helpful to explain how/from where the deltas originate. For example, > > something along the lines of: > > > > "Persistent/on disk accounting, stored in the accounting btree and > > updated via btree write buffer updates. Accounting updates are > > represented as deltas that originate from <somewhere? trans triggers?>. > > Accounting keys represent these deltas through commit into the write > > buffer. The accounting/delta keys in the write buffer are then > > accumulated into the appropriate accounting btree key at write buffer > > flush time." > > yeah, that's worth including. > > There's an interesting point that you're touching on; btree write buffer > are always dependent state changes from some other (non write buffer) > btree; we never look at a write buffer btree and generate an update > there - we can't, reading from a write buffer btree doesn't get you > anything consistent or up to date. > > So in normal operation it really only makes sense to do write buffer > updates from a transactional trigger - that's the only way to use them > and have them be consistent with the resst of the filesystem. > > And since triggers work by comparing old and new, they naturally > generate updates that are deltas. > Hm that is interesting, I hadn't made that connection. Thanks. Brian > > > + * - In memory accounting, where accounting is stored as an array of percpu > > > + * counters, indexed by an eytzinger array of disk acounting keys/bpos (which > > > + * are the same thing, excepting byte swabbing on big endian). > > > + * > > > > Not really sure about the keys vs. bpos thing, kind of related to my > > comments on the earlier patch. It might be more clear to just elide the > > implementation details here, i.e.: > > > > "In memory accounting, where accounting is stored as an array of percpu > > counters that are cheap to read, but not persistent. Updates to in > > memory accounting are propagated from the transaction commit path." > > > > ... but NBD, and feel free to reword, drop and/or correct any of that > > text. > > It's there because bch2_accounting_mem_read() takes a bpos when it > should be a disk_accounting_key. I'll fix that if I can... > > > > + * Cheap to read, but non persistent. > > > + * > > > + * To do a disk accounting update: > > > + * - initialize a disk_accounting_key, to specify which counter is being update > > > + * - initialize counter deltas, as an array of 1-3 s64s > > > + * - call bch2_disk_accounting_mod() > > > + * > > > + * This queues up the accounting update to be done at transaction commit time. > > > + * Underneath, it's a normal btree write buffer update. > > > + * > > > + * The transaction commit path is responsible for propagating updates to the in > > > + * memory counters, with bch2_accounting_mem_mod(). > > > + * > > > + * The commit path also assigns every disk accounting update a unique version > > > + * number, based on the journal sequence number and offset within that journal > > > + * buffer; this is used by journal replay to determine which updates have been > > > + * done. > > > + * > > > + * The transaction commit path also ensures that replicas entry accounting > > > + * updates are properly marked in the superblock (so that we know whether we can > > > + * mount without data being unavailable); it will update the superblock if > > > + * bch2_accounting_mem_mod() tells it to. > > > > I'm not really sure what this last paragraph is telling me, but granted > > I've not got that far into the code yet either. > > yeah that's for a whole different subsystem that happens to be slaved to > the accounting - replicas.c, which also used to help out quite a bit > with the accounting but now it's pretty much just for managing the > superblock replicas section. > > The superblock replicas section is just a list of entries, where each > entry is a list of devices - "there is replicated data present on this > set of devices". We also have full counters of how much data is present > replicated across each set of devices, so the superblock section is just > a truncated version of the accounting - "data exists on these devices", > instead of saying how much. >