[17/21] fs: xfs: iomap atomic write support

Message ID 20230929102726.2985188-18-john.g.garry@oracle.com
State New
Headers
Series block atomic writes |

Commit Message

John Garry Sept. 29, 2023, 10:27 a.m. UTC
  Ensure that when creating a mapping that we adhere to all the atomic
write rules.

We check that the mapping covers the complete range of the write to ensure
that we'll be just creating a single mapping.

Currently minimum granularity is the FS block size, but it should be
possibly to support lower in future.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_iomap.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)
  

Comments

Christoph Hellwig Nov. 9, 2023, 3:26 p.m. UTC | #1
On Fri, Sep 29, 2023 at 10:27:22AM +0000, John Garry wrote:
> Ensure that when creating a mapping that we adhere to all the atomic
> write rules.
> 
> We check that the mapping covers the complete range of the write to ensure
> that we'll be just creating a single mapping.
> 
> Currently minimum granularity is the FS block size, but it should be
> possibly to support lower in future.

I really dislike how this forces aligned allocations.  Aligned
allocations are a nice optimization to offload some of the work
to the storage hard/firmware, but we need to support it in general.
And I think with out of place writes into the COW fork, and atomic
transactions to swap it in we can do that pretty easily.

That should also allow to get rid of the horrible forcealign mode,
as we can still try align if possible and just fall back to the
out of place writes.
  
John Garry Nov. 10, 2023, 10:42 a.m. UTC | #2
On 09/11/2023 15:26, Christoph Hellwig wrote:
> On Fri, Sep 29, 2023 at 10:27:22AM +0000, John Garry wrote:
>> Ensure that when creating a mapping that we adhere to all the atomic
>> write rules.
>>
>> We check that the mapping covers the complete range of the write to ensure
>> that we'll be just creating a single mapping.
>>
>> Currently minimum granularity is the FS block size, but it should be
>> possibly to support lower in future.
> I really dislike how this forces aligned allocations.  Aligned
> allocations are a nice optimization to offload some of the work
> to the storage hard/firmware, but we need to support it in general.
> And I think with out of place writes into the COW fork, and atomic
> transactions to swap it in we can do that pretty easily.
> 
> That should also allow to get rid of the horrible forcealign mode,
> as we can still try align if possible and just fall back to the
> out of place writes.
> 
> 

How could we try to align? Do you mean that we try to align up to some 
stage in the block allocator search? That seems like some middle ground 
between no alignment and forcealign.

And what would we be aligning to?

Thanks,
John
  
John Garry Nov. 28, 2023, 8:56 a.m. UTC | #3
Hi Christoph,

>>>
>>> Currently minimum granularity is the FS block size, but it should be
>>> possibly to support lower in future.
>> I really dislike how this forces aligned allocations.  Aligned
>> allocations are a nice optimization to offload some of the work
>> to the storage hard/firmware, but we need to support it in general.
>> And I think with out of place writes into the COW fork, and atomic
>> transactions to swap it in we can do that pretty easily.
>>
>> That should also allow to get rid of the horrible forcealign mode,
>> as we can still try align if possible and just fall back to the
>> out of place writes.
>>

Can you try to explain your idea a bit more? This is blocking us.

Are you suggesting some sort of hybrid between the atomic write series 
you had a few years ago and this solution?

To me that would be continuing with the following:
- per-IO RWF_ATOMIC (and not O_ATOMIC semantics of nothing is written 
until some data sync)
- writes must be a power-of-two and at a naturally-aligned offset
- relying on atomic write HW support always

But for extents which are misaligned, we CoW to a new extent? I suppose 
we would align that extent to alignment of the write (i.e. length of write).

BTW, we also have rtvol support which does not use forcealign as it 
already can guarantee alignment, but still does rely on the same 
principle of requiring alignment - would you want CoW support there also?

Thanks,
John
  
Christoph Hellwig Nov. 28, 2023, 1:56 p.m. UTC | #4
On Tue, Nov 28, 2023 at 08:56:37AM +0000, John Garry wrote:
> Are you suggesting some sort of hybrid between the atomic write series you 
> had a few years ago and this solution?

Very roughly, yes.

> To me that would be continuing with the following:
> - per-IO RWF_ATOMIC (and not O_ATOMIC semantics of nothing is written until 
> some data sync)

Yes.

> - writes must be a power-of-two and at a naturally-aligned offset

Where offset is offset in the file?  It would not require it.  You
probably want to do it for optimal performance, but requiring it
feeels rather limited.

> - relying on atomic write HW support always

And I think that's where we have different opinions.  I think the hw
offload is a nice optimization and we should use it wherever we can.
But building the entire userspace API around it feels like a mistake.

> BTW, we also have rtvol support which does not use forcealign as it already 
> can guarantee alignment, but still does rely on the same principle of 
> requiring alignment - would you want CoW support there also?

Upstream doesn't have out of place write support for the RT subvolume
yet.  But Darrick has a series for it and we're actively working on
upstreaming it.
  
John Garry Nov. 28, 2023, 5:42 p.m. UTC | #5
On 28/11/2023 13:56, Christoph Hellwig wrote:
> On Tue, Nov 28, 2023 at 08:56:37AM +0000, John Garry wrote:
>> Are you suggesting some sort of hybrid between the atomic write series you
>> had a few years ago and this solution?
> Very roughly, yes.
> 
>> To me that would be continuing with the following:
>> - per-IO RWF_ATOMIC (and not O_ATOMIC semantics of nothing is written until
>> some data sync)
> Yes.
> 
>> - writes must be a power-of-two and at a naturally-aligned offset
> Where offset is offset in the file? 

ok, fine, it would not be required for XFS with CoW. Some concerns still:
a. device atomic write boundary, if any
b. other FSes which do not have CoW support. ext4 is already being used 
for "atomic writes" in the field - see dubious amazon torn-write prevention.

About b., we could add the pow-of-2 and file offset alignment 
requirement for other FSes, but then need to add some method to 
advertise that restriction.

> It would not require it.  You
> probably want to do it for optimal performance, but requiring it
> feeels rather limited.
> 
>> - relying on atomic write HW support always
> And I think that's where we have different opinions.

I'm just trying to understand your idea and that is not necessarily my 
final opinion.

>  I think the hw
> offload is a nice optimization and we should use it wherever we can.

Sure, but to me it is a concern that we have 2x paths to make robust a. 
offload via hw, which may involve CoW b. no HW support, i.e. CoW always

And for no HW support, if we don't follow the O_ATOMIC model of 
committing nothing until a SYNC is issued, would we allocate, write, and 
later free a new extent for each write, right?

> But building the entire userspace API around it feels like a mistake.
> 

ok, but FWIW it works for the usecases which we know.

>> BTW, we also have rtvol support which does not use forcealign as it already
>> can guarantee alignment, but still does rely on the same principle of
>> requiring alignment - would you want CoW support there also?
> Upstream doesn't have out of place write support for the RT subvolume
> yet.  But Darrick has a series for it and we're actively working on
> upstreaming it.
Yeah, I thought that I heard this.

Thanks,
John
  
Martin K. Petersen Nov. 29, 2023, 2:45 a.m. UTC | #6
> b. other FSes which do not have CoW support. ext4 is already being
> used for "atomic writes" in the field

We also need raw block device access to work within the constraints
required by the hardware.

>> probably want to do it for optimal performance, but requiring it
>> feeels rather limited.

The application developers we are working with generally prefer an error
when things are not aligned properly. Predictable performance is key.
Removing the performance variability of doing double writes is the
reason for supporting atomics in the first place.

I think there is value in providing a more generic (file-centric) atomic
user API. And I think the I/O stack plumbing we provide would be useful
in supporting such an endeavor. But I am not convinced that atomic
operations in general should be limited to the couple of filesystems
that can do CoW.
  
Christoph Hellwig Dec. 4, 2023, 1:45 p.m. UTC | #7
On Tue, Nov 28, 2023 at 05:42:10PM +0000, John Garry wrote:
> ok, fine, it would not be required for XFS with CoW. Some concerns still:
> a. device atomic write boundary, if any
> b. other FSes which do not have CoW support. ext4 is already being used for 
> "atomic writes" in the field - see dubious amazon torn-write prevention.

What is the 'dubious amazon torn-write prevention'?

> About b., we could add the pow-of-2 and file offset alignment requirement 
> for other FSes, but then need to add some method to advertise that 
> restriction.

We really need a better way to communicate I/O limitations anyway.
Something like XFS_IOC_DIOINFO on steroids.

> Sure, but to me it is a concern that we have 2x paths to make robust a. 
> offload via hw, which may involve CoW b. no HW support, i.e. CoW always

Relying just on the hardware seems very limited, especially as there is
plenty of hardware that won't guarantee anything larger than 4k, and
plenty of NVMe hardware without has some other small limit like 32k
because it doesn't support multiple atomicy mode.

> And for no HW support, if we don't follow the O_ATOMIC model of committing 
> nothing until a SYNC is issued, would we allocate, write, and later free a 
> new extent for each write, right?

Yes. Then again if you do data journalling you do that anyway, and as
one little project I'm doing right now shows that data journling is
often the fastest thing we can do for very small writes.
  
John Garry Dec. 4, 2023, 3:19 p.m. UTC | #8
On 04/12/2023 13:45, Christoph Hellwig wrote:
> On Tue, Nov 28, 2023 at 05:42:10PM +0000, John Garry wrote:
>> ok, fine, it would not be required for XFS with CoW. Some concerns still:
>> a. device atomic write boundary, if any
>> b. other FSes which do not have CoW support. ext4 is already being used for
>> "atomic writes" in the field - see dubious amazon torn-write prevention.
> 
> What is the 'dubious amazon torn-write prevention'?

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/storage-twp.html

AFAICS, this is without any kernel changes, so no guarantee of unwanted 
splitting or merging of bios.

Anyway, there will still be !CoW FSes which people want to support.

> 
>> About b., we could add the pow-of-2 and file offset alignment requirement
>> for other FSes, but then need to add some method to advertise that
>> restriction.
> 
> We really need a better way to communicate I/O limitations anyway.
> Something like XFS_IOC_DIOINFO on steroids.
> 
>> Sure, but to me it is a concern that we have 2x paths to make robust a.
>> offload via hw, which may involve CoW b. no HW support, i.e. CoW always
> 
> Relying just on the hardware seems very limited, especially as there is
> plenty of hardware that won't guarantee anything larger than 4k, and
> plenty of NVMe hardware without has some other small limit like 32k
> because it doesn't support multiple atomicy mode.

So what would you propose as the next step? Would it to be first achieve 
atomic write support for XFS with HW support + CoW to ensure contiguous 
extents (and without XFS forcealign)?

> 
>> And for no HW support, if we don't follow the O_ATOMIC model of committing
>> nothing until a SYNC is issued, would we allocate, write, and later free a
>> new extent for each write, right?
> 
> Yes. Then again if you do data journalling you do that anyway, and as
> one little project I'm doing right now shows that data journling is
> often the fastest thing we can do for very small writes.

Ignoring FSes, then how is this supposed to work for block devices? We 
just always need HW support, right?

Thanks,
John
  
Christoph Hellwig Dec. 4, 2023, 3:39 p.m. UTC | #9
On Mon, Dec 04, 2023 at 03:19:15PM +0000, John Garry wrote:
> On 04/12/2023 13:45, Christoph Hellwig wrote:
>> On Tue, Nov 28, 2023 at 05:42:10PM +0000, John Garry wrote:
>>> ok, fine, it would not be required for XFS with CoW. Some concerns still:
>>> a. device atomic write boundary, if any
>>> b. other FSes which do not have CoW support. ext4 is already being used for
>>> "atomic writes" in the field - see dubious amazon torn-write prevention.
>>
>> What is the 'dubious amazon torn-write prevention'?
>
> https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/storage-twp.html
>
> AFAICS, this is without any kernel changes, so no guarantee of unwanted 
> splitting or merging of bios.
>
> Anyway, there will still be !CoW FSes which people want to support.

Ugg, so they badly reimplement NVMe atomic write support and use it
without software stack enablement.  Calling it dubious is way to
gentle..

>> Relying just on the hardware seems very limited, especially as there is
>> plenty of hardware that won't guarantee anything larger than 4k, and
>> plenty of NVMe hardware without has some other small limit like 32k
>> because it doesn't support multiple atomicy mode.
>
> So what would you propose as the next step? Would it to be first achieve 
> atomic write support for XFS with HW support + CoW to ensure contiguous 
> extents (and without XFS forcealign)?

I think the very first priority is just block device support without
any fs enablement.  We just need to make sure the API isn't too limited
for additional use cases.

> Ignoring FSes, then how is this supposed to work for block devices? We just 
> always need HW support, right?

Yes.
  
John Garry Dec. 4, 2023, 6:06 p.m. UTC | #10
On 04/12/2023 15:39, Christoph Hellwig wrote:
>> So what would you propose as the next step? Would it to be first achieve
>> atomic write support for XFS with HW support + CoW to ensure contiguous
>> extents (and without XFS forcealign)?
> I think the very first priority is just block device support without
> any fs enablement.  We just need to make sure the API isn't too limited
> for additional use cases.

Sounds ok
  
Theodore Ts'o Dec. 5, 2023, 4:55 a.m. UTC | #11
On Mon, Dec 04, 2023 at 03:19:15PM +0000, John Garry wrote:
> > 
> > What is the 'dubious amazon torn-write prevention'?
> 
> https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/storage-twp.html
> 
> AFAICS, this is without any kernel changes, so no guarantee of unwanted
> splitting or merging of bios.

Well, more than one company has audited the kernel paths, and it turns
out that for selected Kernel versions, after doing desk-check
verification of the relevant kernel baths, as well as experimental
verification via testing to try to find torn writes in the kernel, we
can make it safe for specific kernel versions which might be used in
hosted MySQL instances where we control the kernel, the mysql server,
and the emulated block device (and we know the database is doing
Direct I/O writes --- this won't work for PostgreSQL).  I gave a talk
about this at Google I/O Next '18, five years ago[1].

[1] https://www.youtube.com/watch?v=gIeuiGg-_iw

Given the performance gains (see the talk (see the comparison of the
at time 19:31 and at 29:57) --- it's quite compelling. 

Of course, I wouldn't recommend this approach for a naive sysadmin,
since most database adminsitrators won't know how to audit kernel code
(see the discussion at time 35:10 of the video), and reverify the
entire software stack before every kernel upgrade.  The challenge is
how to do this safely.

The fact remains that both Amazon's EBS and Google's Persistent Disk
products are implemented in such a way that writes will not be torn
below the virtual machine, and the guarantees are in fact quite a bit
stronger than what we will probably end up advertising via NVMe and/or
SCSI.  It wouldn't surprise me if this is the case (or could be made
to be the case) For Oracle Cloud as well.

The question is how to make this guarantee so that the kernel knows
when various cloud-provided block devicse do provide these greater
guarantees, and then how to make it be an architected feature, as
opposed to a happy implementation detail that has to be verified at
every kernel upgrade.

Cheers,

						- Ted
  
John Garry Dec. 5, 2023, 11:09 a.m. UTC | #12
On 05/12/2023 04:55, Theodore Ts'o wrote:
>> AFAICS, this is without any kernel changes, so no guarantee of unwanted
>> splitting or merging of bios.
> Well, more than one company has audited the kernel paths, and it turns
> out that for selected Kernel versions, after doing desk-check
> verification of the relevant kernel baths, as well as experimental
> verification via testing to try to find torn writes in the kernel, we
> can make it safe for specific kernel versions which might be used in
> hosted MySQL instances where we control the kernel, the mysql server,
> and the emulated block device (and we know the database is doing
> Direct I/O writes --- this won't work for PostgreSQL).  I gave a talk
> about this at Google I/O Next '18, five years ago[1].
> 
> [1]https://urldefense.com/v3/__https://www.youtube.com/watch?v=gIeuiGg-_iw__;!!ACWV5N9M2RV99hQ!I4iRp4xUyzAT0UwuEcnUBBCPKLXFKfk5FNmysFbKcQYfl0marAll5xEEVyB5mMFDqeckCWLmjU1aCR2Z$  
> 
> Given the performance gains (see the talk (see the comparison of the
> at time 19:31 and at 29:57) --- it's quite compelling.
> 
> Of course, I wouldn't recommend this approach for a naive sysadmin,
> since most database adminsitrators won't know how to audit kernel code
> (see the discussion at time 35:10 of the video), and reverify the
> entire software stack before every kernel upgrade.

Sure

>  The challenge is
> how to do this safely.

Right, and that is why I would be concerned about advertising torn-write 
protection support, but someone has not gone through the effort of 
auditing and verification phase to ensure that this does not happen in 
their software stack ever.

> 
> The fact remains that both Amazon's EBS and Google's Persistent Disk
> products are implemented in such a way that writes will not be torn
> below the virtual machine, and the guarantees are in fact quite a bit
> stronger than what we will probably end up advertising via NVMe and/or
> SCSI.  It wouldn't surprise me if this is the case (or could be made
> to be the case) For Oracle Cloud as well.
> 
> The question is how to make this guarantee so that the kernel knows
> when various cloud-provided block devicse do provide these greater
> guarantees, and then how to make it be an architected feature, as
> opposed to a happy implementation detail that has to be verified at
> every kernel upgrade.

The kernel can only judge atomic write support from what the HW product 
data tells us, so cloud-provided block devices need to provide that 
information as best possible if emulating the some storage technology.

Thanks,
John
  
Ming Lei Dec. 5, 2023, 1:59 p.m. UTC | #13
On Mon, Dec 04, 2023 at 03:19:15PM +0000, John Garry wrote:
> On 04/12/2023 13:45, Christoph Hellwig wrote:
> > On Tue, Nov 28, 2023 at 05:42:10PM +0000, John Garry wrote:
> > > ok, fine, it would not be required for XFS with CoW. Some concerns still:
> > > a. device atomic write boundary, if any
> > > b. other FSes which do not have CoW support. ext4 is already being used for
> > > "atomic writes" in the field - see dubious amazon torn-write prevention.
> > 
> > What is the 'dubious amazon torn-write prevention'?
> 
> https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/storage-twp.html
> 
> AFAICS, this is without any kernel changes, so no guarantee of unwanted
> splitting or merging of bios.
> 
> Anyway, there will still be !CoW FSes which people want to support.
> 
> > 
> > > About b., we could add the pow-of-2 and file offset alignment requirement
> > > for other FSes, but then need to add some method to advertise that
> > > restriction.
> > 
> > We really need a better way to communicate I/O limitations anyway.
> > Something like XFS_IOC_DIOINFO on steroids.
> > 
> > > Sure, but to me it is a concern that we have 2x paths to make robust a.
> > > offload via hw, which may involve CoW b. no HW support, i.e. CoW always
> > 
> > Relying just on the hardware seems very limited, especially as there is
> > plenty of hardware that won't guarantee anything larger than 4k, and
> > plenty of NVMe hardware without has some other small limit like 32k
> > because it doesn't support multiple atomicy mode.
> 
> So what would you propose as the next step? Would it to be first achieve
> atomic write support for XFS with HW support + CoW to ensure contiguous
> extents (and without XFS forcealign)?
> 
> > 
> > > And for no HW support, if we don't follow the O_ATOMIC model of committing
> > > nothing until a SYNC is issued, would we allocate, write, and later free a
> > > new extent for each write, right?
> > 
> > Yes. Then again if you do data journalling you do that anyway, and as
> > one little project I'm doing right now shows that data journling is
> > often the fastest thing we can do for very small writes.
> 
> Ignoring FSes, then how is this supposed to work for block devices? We just
> always need HW support, right?

Looks the HW support could be minimized, just like what Google and Amazon did,
16KB physical block size with proper queue limit setting.

Now seems it is easy to make such device with ublk-loop by:

- use one backing disk with 16KB/32KB/.. physical block size
- expose proper physical bs & chunk_sectors & max sectors queue limit

Then any 16KB aligned direct WRITE with N*16KB length(N in [1, 8] with 256
chunk_sectors) can be atomic-write.



Thanks,
Ming
  

Patch

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 70fe873951f3..3424fcfc04f5 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -783,6 +783,7 @@  xfs_direct_write_iomap_begin(
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_sb		*m_sb = &mp->m_sb;
 	struct xfs_bmbt_irec	imap, cmap;
 	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
 	xfs_fileoff_t		end_fsb = xfs_iomap_end_fsb(mp, offset, length);
@@ -814,6 +815,41 @@  xfs_direct_write_iomap_begin(
 	if (error)
 		goto out_unlock;
 
+	if (flags & IOMAP_ATOMIC_WRITE) {
+		xfs_filblks_t unit_min_fsb, unit_max_fsb;
+
+		xfs_ip_atomic_write_attr(ip, &unit_min_fsb, &unit_max_fsb);
+
+		if (!imap_spans_range(&imap, offset_fsb, end_fsb)) {
+			error = -EIO;
+			goto out_unlock;
+		}
+
+		if (offset % m_sb->sb_blocksize ||
+		    length % m_sb->sb_blocksize) {
+			error = -EIO;
+			goto out_unlock;
+		}
+
+		if (imap.br_blockcount == unit_min_fsb ||
+		    imap.br_blockcount == unit_max_fsb) {
+			/* min and max must be a power-of-2 */
+		} else if (imap.br_blockcount < unit_min_fsb ||
+			   imap.br_blockcount > unit_max_fsb) {
+			error = -EIO;
+			goto out_unlock;
+		} else if (!is_power_of_2(imap.br_blockcount)) {
+			error = -EIO;
+			goto out_unlock;
+		}
+
+		if (imap.br_startoff &&
+		    imap.br_startoff % imap.br_blockcount) {
+			error =  -EIO;
+			goto out_unlock;
+		}
+	}
+
 	if (imap_needs_cow(ip, flags, &imap, nimaps)) {
 		error = -EAGAIN;
 		if (flags & IOMAP_NOWAIT)