[0/2] btrfs: zoned: kick reclaim earlier on fast zoned devices

Message ID 20240122-reclaim-fix-v1-0-761234a6d005@wdc.com
Headers
Series btrfs: zoned: kick reclaim earlier on fast zoned devices |

Message

Johannes Thumshirn Jan. 22, 2024, 10:51 a.m. UTC
  We had a report from the field where filling a zoned drive with one file
60% of the drive's capacity and then overwriting this file results in
ENOSPC.

If said drive is fast and small enough, the problem can be easily
triggered, as both reclaim of dirty block-groups and deletion of unused
block-groups only happen at transaction commit time. But if the whole test
is faster than we're doing transaction commits we're unnecessarily running
out of usable space on a zoned drive.

This can easily be reproduced by the following fio snippet:
fio --name=foo --filename=$TEST/foo --size=$60_PERCENT_OF_DRIVE --rw=write\
	   --loops=2

A fstests testcase for this issue will be sent as well.

---
Johannes Thumshirn (2):
      btrfs: zoned: use rcu list for iterating devices to collect stats
      btrfs: zoned: wake up cleaner sooner if needed

 fs/btrfs/free-space-cache.c | 6 ++++++
 fs/btrfs/zoned.c            | 6 +++---
 2 files changed, 9 insertions(+), 3 deletions(-)
---
base-commit: d9796b728dcbf25e0190e542be33902222098fac
change-id: 20240122-reclaim-fix-1fcae9c27fc8

Best regards,
  

Comments

Johannes Thumshirn Jan. 22, 2024, 4:04 p.m. UTC | #1
On 22.01.24 11:51, Johannes Thumshirn wrote:
> We had a report from the field where filling a zoned drive with one file
> 60% of the drive's capacity and then overwriting this file results in
> ENOSPC.
> 
> If said drive is fast and small enough, the problem can be easily
> triggered, as both reclaim of dirty block-groups and deletion of unused
> block-groups only happen at transaction commit time. But if the whole test
> is faster than we're doing transaction commits we're unnecessarily running
> out of usable space on a zoned drive.
> 
> This can easily be reproduced by the following fio snippet:
> fio --name=foo --filename=$TEST/foo --size=$60_PERCENT_OF_DRIVE --rw=write\
> 	   --loops=2
> 
> A fstests testcase for this issue will be sent as well.

Please disregard this series. I'm stupid and had lockdep active during 
testing, so the timing was messed up and reclaim had a chance to kick in.