[10/21] block: Add fops atomic write support

Message ID 20230929102726.2985188-11-john.g.garry@oracle.com
State New
Headers
Series block atomic writes |

Commit Message

John Garry Sept. 29, 2023, 10:27 a.m. UTC
  Add support for atomic writes, as follows:
- Ensure that the IO follows all the atomic writes rules, like must be
  naturally aligned
- Set REQ_ATOMIC

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 block/fops.c | 42 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 41 insertions(+), 1 deletion(-)
  

Comments

Bart Van Assche Sept. 29, 2023, 5:51 p.m. UTC | #1
On 9/29/23 03:27, John Garry wrote:
> +	if (pos % atomic_write_unit_min_bytes)
> +		return false;
> +	if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
> +		return false;
> +	if (!is_power_of_2(iov_iter_count(iter)))
> +		return false;
[ ... ]
> +	if (pos % iov_iter_count(iter))
> +		return false;

Where do these rules come from? Is there any standard that requires
any of the above?

Thanks,

Bart.
  
John Garry Oct. 2, 2023, 10:10 a.m. UTC | #2
On 29/09/2023 18:51, Bart Van Assche wrote:
> On 9/29/23 03:27, John Garry wrote:
>> +    if (pos % atomic_write_unit_min_bytes)
>> +        return false;
>> +    if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
>> +        return false;
>> +    if (!is_power_of_2(iov_iter_count(iter)))
>> +        return false;
> [ ... ]
>> +    if (pos % iov_iter_count(iter))
>> +        return false;
> 
> Where do these rules come from? Is there any standard that requires
> any of the above?

SCSI and NVMe have slightly different atomic writes semantics, and the 
rules are created to work for both.

In addition, the rules are related to FS extent alignment.

Note that for simplicity and consistency we use the same rules for 
regular files as for bdev's.

This is the coding for the rules and where they come from:

 > +	if (!atomic_write_unit_min_bytes)
 > +		return false;

If atomic_write_unit_min_bytes == 0, then we just don't support atomic 
writes.

 > +	if (pos % atomic_write_unit_min_bytes)
 > +		return false;

See later rules.

 > +	if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
 > +		return false;

For SCSI, there is an atomic write granularity, which dictates 
atomic_write_unit_min_bytes. So here we need to ensure that the length 
is a multiple of this value.

 > +	if (!is_power_of_2(iov_iter_count(iter)))
 > +		return false;

This rule comes from FS block alignment and NVMe atomic boundary.

FSes (XFS) have discontiguous extents. We need to ensure that an atomic 
write does not cross discontiguous extents. To do this we ensure extent 
length and alignment and limit atomic_write_unit_max_bytes to that.

For NVMe, an atomic write boundary is a boundary in LBA space which an 
atomic write should not cross. We limit atomic_write_unit_max_bytes such 
that it is evenly divisible into this atomic write boundary.

To ensure that the write does not cross these alignment boundaries we 
say that it must be naturally aligned and a power-of-2 in length.

We may be able to relax this rule but I am not sure it buys us anything 
- typically we want to be writing a 64KB block aligned to 64KB, for example.

 > +	if (iov_iter_count(iter) > atomic_write_unit_max_bytes)
 > +		return false;

We just can't exceed this length.

 > +	if (pos % iov_iter_count(iter))
 > +		return false;

As above, ensure naturally aligned.

Thanks,
John
  
Bart Van Assche Oct. 2, 2023, 7:12 p.m. UTC | #3
On 10/2/23 03:10, John Garry wrote:
> On 29/09/2023 18:51, Bart Van Assche wrote:
>> On 9/29/23 03:27, John Garry wrote:
>  > +    if (pos % atomic_write_unit_min_bytes)
>  > +        return false;
> 
> See later rules.

Is atomic_write_unit_min_bytes always equal to the logical block size?
If so, can the above test be left out?

>  > +    if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
>  > +        return false;
> 
> For SCSI, there is an atomic write granularity, which dictates 
> atomic_write_unit_min_bytes. So here we need to ensure that the length 
> is a multiple of this value.

Are there any SCSI devices that we care about that report an ATOMIC 
TRANSFER LENGTH GRANULARITY that is larger than a single logical block?
I'm wondering whether we really have to support such devices.

>  > +    if (!is_power_of_2(iov_iter_count(iter)))
>  > +        return false;
> 
> This rule comes from FS block alignment and NVMe atomic boundary.
> 
> FSes (XFS) have discontiguous extents. We need to ensure that an atomic 
> write does not cross discontiguous extents. To do this we ensure extent 
> length and alignment and limit atomic_write_unit_max_bytes to that.
> 
> For NVMe, an atomic write boundary is a boundary in LBA space which an 
> atomic write should not cross. We limit atomic_write_unit_max_bytes such 
> that it is evenly divisible into this atomic write boundary.
> 
> To ensure that the write does not cross these alignment boundaries we 
> say that it must be naturally aligned and a power-of-2 in length.
> 
> We may be able to relax this rule but I am not sure it buys us anything 
> - typically we want to be writing a 64KB block aligned to 64KB, for 
> example.

It seems to me that the requirement is_power_of_2(iov_iter_count(iter))
is necessary for some filesystems but not for all filesystems. 
Restrictions that are specific to a single filesystem (XFS) should not 
occur in code that is intended to be used by all filesystems 
(blkdev_atomic_write_valid()).

Thanks,

Bart.
  
Martin K. Petersen Oct. 3, 2023, 12:48 a.m. UTC | #4
Bart,

> Are there any SCSI devices that we care about that report an ATOMIC
> TRANSFER LENGTH GRANULARITY that is larger than a single logical
> block?

Yes.

Note that code path used inside a storage device to guarantee atomicity
of an entire I/O may be substantially different from the code path which
only offers an incremental guarantee at a single logical or physical
block level (to the extent that those guarantees are offered at all but
that's a different kettle of fish).

> I'm wondering whether we really have to support such devices.

Yes.
  
John Garry Oct. 3, 2023, 8:37 a.m. UTC | #5
On 02/10/2023 20:12, Bart Van Assche wrote:
>>  > +    if (!is_power_of_2(iov_iter_count(iter)))
>>  > +        return false;
>>
>> This rule comes from FS block alignment and NVMe atomic boundary.
>>
>> FSes (XFS) have discontiguous extents. We need to ensure that an 
>> atomic write does not cross discontiguous extents. To do this we 
>> ensure extent length and alignment and limit 
>> atomic_write_unit_max_bytes to that.
>>
>> For NVMe, an atomic write boundary is a boundary in LBA space which an 
>> atomic write should not cross. We limit atomic_write_unit_max_bytes 
>> such that it is evenly divisible into this atomic write boundary.
>>
>> To ensure that the write does not cross these alignment boundaries we 
>> say that it must be naturally aligned and a power-of-2 in length.
>>
>> We may be able to relax this rule but I am not sure it buys us 
>> anything - typically we want to be writing a 64KB block aligned to 
>> 64KB, for example.
> 
> It seems to me that the requirement is_power_of_2(iov_iter_count(iter))
> is necessary for some filesystems but not for all filesystems. 
> Restrictions that are specific to a single filesystem (XFS) should not 
> occur in code that is intended to be used by all filesystems 
> (blkdev_atomic_write_valid()).

I don't think that is_power_of_2(write length) is specific to XFS. It is 
just a simple mathematical method to ensure we obey length and alignment 
requirement always.

Furthermore, if ext4 wants to support atomic writes, for example, then 
it will probably base that on bigalloc. And bigalloc is power-of-2 based.

As for the rules, current proposal is:
- atomic_write_unit_min and atomic_write_unit_max are power-of-2
- write needs to be at a naturally aligned file offset
- write length needs to be a power-of-2 between atomic_write_unit_min 
and atomic_write_unit_max, inclusive

Those could be relaxed to:
- atomic_write_unit_min and atomic_write_unit_max are power-of-2
- write length needs to be a multiple of atomic_write_unit_min and a max 
of atomic_write_unit_max
- write needs to be at an offset aligned to atomic_write_unit_min
- write cannot cross atomic_write_unit_max boundary within the file

Are the relaxed rules better? I don't think so, and I don't like "write 
cannot cross atomic_write_unit_max boundary" in terms of wording.

Thanks,
John
  
Bart Van Assche Oct. 3, 2023, 4:45 p.m. UTC | #6
On 10/3/23 01:37, John Garry wrote:
> I don't think that is_power_of_2(write length) is specific to XFS.

I think this is specific to XFS. Can you show me the F2FS code that 
restricts the length of an atomic write to a power of two? I haven't 
found it. The only power-of-two check that I found in F2FS is the 
following (maybe I overlooked something):

$ git grep -nH is_power fs/f2fs
fs/f2fs/super.c:3914:	if (!is_power_of_2(zone_sectors)) {

Thanks,

Bart.
  
Bart Van Assche Oct. 3, 2023, 4:55 p.m. UTC | #7
On 10/2/23 17:48, Martin K. Petersen wrote:
> 
> Bart,
> 
>> Are there any SCSI devices that we care about that report an ATOMIC
>> TRANSFER LENGTH GRANULARITY that is larger than a single logical
>> block?
> 
> Yes.
> 
> Note that code path used inside a storage device to guarantee atomicity
> of an entire I/O may be substantially different from the code path which
> only offers an incremental guarantee at a single logical or physical
> block level (to the extent that those guarantees are offered at all but
> that's a different kettle of fish).
> 
>> I'm wondering whether we really have to support such devices.
> 
> Yes.

Hi Martin,

I'm still wondering whether we really should support storage devices
that report an ATOMIC TRANSFER LENGTH GRANULARITY that is larger than
the logical block size. Is my understanding correct that the NVMe
specification makes it mandatory to support single logical block atomic
writes since the smallest value that can be reported as the AWUN 
parameter is one logical block because this parameter is a 0's based
value? Is my understanding correct that SCSI devices that report an 
ATOMIC TRANSFER LENGTH GRANULARITY that is larger than the logical block
size are not able to support the NVMe protocol?

 From the NVMe specification section about the identify controller 
response: "Atomic Write Unit Normal (AWUN): This field indicates the 
size of the write operation guaranteed to be written atomically to the 
NVM across all namespaces with any supported namespace format during 
normal operation. This field is specified in logical blocks and is a 0’s 
based value."

Thanks,

Bart.
  
Martin K. Petersen Oct. 4, 2023, 2:53 a.m. UTC | #8
Bart,

> I'm still wondering whether we really should support storage devices
> that report an ATOMIC TRANSFER LENGTH GRANULARITY that is larger than
> the logical block size.

We should. The common case is that the device reports an ATOMIC TRANSFER
LENGTH GRANULARITY matching the reported physical block size. I.e. a
logical block size of 512 bytes and a physical block size of 4KB. In
that scenario a write of a single logical block would require
read-modify-write of a physical block.

> Is my understanding correct that the NVMe specification makes it
> mandatory to support single logical block atomic writes since the
> smallest value that can be reported as the AWUN parameter is one
> logical block because this parameter is a 0's based value? Is my
> understanding correct that SCSI devices that report an ATOMIC TRANSFER
> LENGTH GRANULARITY that is larger than the logical block size are not
> able to support the NVMe protocol?

That's correct. There are obviously things you can express in SCSI that
you can't in NVMe. And the other way around. Our intent is to support
both protocols.
  
John Garry Oct. 4, 2023, 9:14 a.m. UTC | #9
On 03/10/2023 17:45, Bart Van Assche wrote:
> On 10/3/23 01:37, John Garry wrote:
>> I don't think that is_power_of_2(write length) is specific to XFS.
> 
> I think this is specific to XFS. Can you show me the F2FS code that 
> restricts the length of an atomic write to a power of two? I haven't 
> found it. The only power-of-two check that I found in F2FS is the 
> following (maybe I overlooked something):
> 
> $ git grep -nH is_power fs/f2fs
> fs/f2fs/super.c:3914:    if (!is_power_of_2(zone_sectors)) {

Any usecases which we know of requires a power-of-2 block size.

Do you know of a requirement for other sizes? Or are you concerned that 
it is unnecessarily restrictive?

We have to deal with HW features like atomic write boundary and FS 
restrictions like extent and stripe alignment transparent, which are 
almost always powers-of-2, so naturally we would want to work with 
powers-of-2 for atomic write sizes.

The power-of-2 stuff could be dropped if that is what people want. 
However we still want to provide a set of rules to the user to make 
those HW and FS features mentioned transparent to the user.

Thanks,
John
  
Bart Van Assche Oct. 4, 2023, 5:22 p.m. UTC | #10
On 10/3/23 19:53, Martin K. Petersen wrote:
> 
> Bart,
> 
>> I'm still wondering whether we really should support storage
>> devices that report an ATOMIC TRANSFER LENGTH GRANULARITY that is
>> larger than the logical block size.
> 
> We should. The common case is that the device reports an ATOMIC
> TRANSFER LENGTH GRANULARITY matching the reported physical block
> size. I.e. a logical block size of 512 bytes and a physical block
> size of 4KB. In that scenario a write of a single logical block would
> require read-modify-write of a physical block.

Block devices must serialize read-modify-write operations internally
that happen when there are multiple logical blocks per physical block.
Otherwise it is not guaranteed that a READ command returns the most
recently written data to the same LBA. I think we can ignore concurrent
and overlapping writes in this discussion since these can be considered
as bugs in host software.

In other words, also for the above example it is guaranteed that writes
of a single logical block (512 bytes) are atomic, no matter what value
is reported as the ATOMIC TRANSFER LENGTH GRANULARITY.

>> Is my understanding correct that the NVMe specification makes it 
>> mandatory to support single logical block atomic writes since the 
>> smallest value that can be reported as the AWUN parameter is one 
>> logical block because this parameter is a 0's based value? Is my 
>> understanding correct that SCSI devices that report an ATOMIC
>> TRANSFER LENGTH GRANULARITY that is larger than the logical block
>> size are not able to support the NVMe protocol?
> 
> That's correct. There are obviously things you can express in SCSI
> that you can't in NVMe. And the other way around. Our intent is to
> support both protocols.

How about aligning the features of the two protocols as much as
possible? My understanding is that all long-term T10 contributors are
all in favor of this.

Thanks,

Bart.
  
Bart Van Assche Oct. 4, 2023, 5:34 p.m. UTC | #11
On 10/4/23 02:14, John Garry wrote:
> On 03/10/2023 17:45, Bart Van Assche wrote:
>> On 10/3/23 01:37, John Garry wrote:
>>> I don't think that is_power_of_2(write length) is specific to XFS.
>>
>> I think this is specific to XFS. Can you show me the F2FS code that 
>> restricts the length of an atomic write to a power of two? I haven't 
>> found it. The only power-of-two check that I found in F2FS is the 
>> following (maybe I overlooked something):
>>
>> $ git grep -nH is_power fs/f2fs
>> fs/f2fs/super.c:3914:    if (!is_power_of_2(zone_sectors)) {
> 
> Any usecases which we know of requires a power-of-2 block size.
> 
> Do you know of a requirement for other sizes? Or are you concerned that 
> it is unnecessarily restrictive?
> 
> We have to deal with HW features like atomic write boundary and FS 
> restrictions like extent and stripe alignment transparent, which are 
> almost always powers-of-2, so naturally we would want to work with 
> powers-of-2 for atomic write sizes.
> 
> The power-of-2 stuff could be dropped if that is what people want. 
> However we still want to provide a set of rules to the user to make 
> those HW and FS features mentioned transparent to the user.

Hi John,

My concern is that the power-of-2 requirements are only needed for
traditional filesystems and not for log-structured filesystems (BTRFS,
F2FS, BCACHEFS).

What I'd like to see is that each filesystem declares its atomic write
requirements (in struct address_space_operations?) and that
blkdev_atomic_write_valid() checks the filesystem-specific atomic write
requirements.

Thanks,

Bart.
  
Martin K. Petersen Oct. 4, 2023, 6:17 p.m. UTC | #12
Hi Bart!

> In other words, also for the above example it is guaranteed that
> writes of a single logical block (512 bytes) are atomic, no matter
> what value is reported as the ATOMIC TRANSFER LENGTH GRANULARITY.

There is no formal guarantee that a disk drive sector read-modify-write
operation results in a readable sector after a power failure. We have
definitely seen blocks being mangled in the field.

Wrt. supporting SCSI atomic block operations, the device rejects the
WRITE ATOMIC(16) command if you attempt to use a transfer length smaller
than the reported granularity. If we want to support WRITE ATOMIC(16) we
have to abide by the values reported by the device. It is not optional.

Besides, the whole point of this patch set is to increase the
"observable atomic block size" beyond the physical block size to
facilitate applications that prefer to use blocks in the 8-64KB range.
IOW, using the logical block size is not particularly interesting. The
objective is to prevent tearing of much larger blocks.

> How about aligning the features of the two protocols as much as
> possible? My understanding is that all long-term T10 contributors are
> all in favor of this.

That is exactly what this patch set does. Out of the 5-6 different
"atomic" modes of operation permitted by SCSI and NVMe, our exposed
semantics are carefully chosen to permit all compliant devices to be
used. Based on only two reported queue limits (FWIW, we started with way
more than that. I believe that complexity was part of the first RFC we
posted). Whereas this series hides most of the complexity in the various
unfortunate protocol quirks behind a simple interface: Your tear-proof
writes can't be smaller than X bytes and larger than Y bytes and they
must be naturally aligned. This simplified things substantially from an
application perspective.
  
Dave Chinner Oct. 4, 2023, 9:59 p.m. UTC | #13
On Wed, Oct 04, 2023 at 10:34:13AM -0700, Bart Van Assche wrote:
> On 10/4/23 02:14, John Garry wrote:
> > On 03/10/2023 17:45, Bart Van Assche wrote:
> > > On 10/3/23 01:37, John Garry wrote:
> > > > I don't think that is_power_of_2(write length) is specific to XFS.
> > > 
> > > I think this is specific to XFS. Can you show me the F2FS code that
> > > restricts the length of an atomic write to a power of two? I haven't
> > > found it. The only power-of-two check that I found in F2FS is the
> > > following (maybe I overlooked something):
> > > 
> > > $ git grep -nH is_power fs/f2fs
> > > fs/f2fs/super.c:3914:    if (!is_power_of_2(zone_sectors)) {
> > 
> > Any usecases which we know of requires a power-of-2 block size.
> > 
> > Do you know of a requirement for other sizes? Or are you concerned that
> > it is unnecessarily restrictive?
> > 
> > We have to deal with HW features like atomic write boundary and FS
> > restrictions like extent and stripe alignment transparent, which are
> > almost always powers-of-2, so naturally we would want to work with
> > powers-of-2 for atomic write sizes.
> > 
> > The power-of-2 stuff could be dropped if that is what people want.
> > However we still want to provide a set of rules to the user to make
> > those HW and FS features mentioned transparent to the user.
> 
> Hi John,
> 
> My concern is that the power-of-2 requirements are only needed for
> traditional filesystems and not for log-structured filesystems (BTRFS,
> F2FS, BCACHEFS).

Filesystems that support copy-on-write data (needed for arbitrary
filesystem block aligned RWF_ATOMIC support) are not necessarily log
structured. For example: XFS.

All three of the filesystems you list above still use power-of-2
block sizes for most of their metadata structures and for large data
extents. Hence once you go above a certain file size they are going
to be doing full power-of-2 block size aligned IO anyway. hence the
constraint of atomic writes needing to be power-of-2 block size
aligned to avoid RMW cycles doesn't really change for these
filesystems.

In which case, they can just set their minimum atomic IO size to be
the same as their block size (e.g. 4kB) and set the maximum to
something they can guarantee gets COW'd in a single atomic
transaction. What the hardware can do with REQ_ATOMIC IO is
completely irrelevant at this point....

> What I'd like to see is that each filesystem declares its atomic write
> requirements (in struct address_space_operations?) and that
> blkdev_atomic_write_valid() checks the filesystem-specific atomic write
> requirements.

That seems unworkable to me - IO constraints propagate from the
bottom up, not from the top down.

Consider multi-device filesystems (btrfs and XFS), where different
devices might have different atomic write parameters.  Which
set of bdev parameters does the filesystem report to the querying
bdev?  (And doesn't that question just sound completely wrong?)

It also doesn't work for filesystems that can configure extent
allocation alignment at an individual inode level (like XFS) - what
does the filesystem report to the device when it doesn't know what
alignment constraints individual on-disk inodes might be using?

That's why statx() vectors through filesystems to all them to set
their own parameters based on the inode statx() is being called on.
If the filesystem has a native RWF_ATOMIC implementation, it can put
it's own parameters in the statx min/max atomic write size fields.
If the fs doesn't have it's own native support, but can do physical
file offset/LBA alignment, then it publishes the block device atomic
support parameters or overrides them with it's internal allocation
alignment constraints. If the bdev doesn't support REQ_ATOMIC, the
filesystem says "atomic writes are not supported".

-Dave.
  
Bart Van Assche Oct. 5, 2023, 5:10 p.m. UTC | #14
On 10/4/23 11:17, Martin K. Petersen wrote:
> 
> Hi Bart!
> 
>> In other words, also for the above example it is guaranteed that 
>> writes of a single logical block (512 bytes) are atomic, no matter
>> what value is reported as the ATOMIC TRANSFER LENGTH GRANULARITY.
> 
> There is no formal guarantee that a disk drive sector 
> read-modify-write operation results in a readable sector after a 
> power failure. We have definitely seen blocks being mangled in the 
> field.

Aren't block devices expected to use a capacitor that provides enough
power to handle power failures cleanly?

How about blacklisting block devices that mangle blocks if a power
failure occurs? I think such block devices are not compatible with
journaling filesystems nor with log-structured filesystems.

Thanks,

Bart.
  
Dave Chinner Oct. 5, 2023, 10:36 p.m. UTC | #15
On Thu, Oct 05, 2023 at 10:10:45AM -0700, Bart Van Assche wrote:
> On 10/4/23 11:17, Martin K. Petersen wrote:
> > 
> > Hi Bart!
> > 
> > > In other words, also for the above example it is guaranteed that
> > > writes of a single logical block (512 bytes) are atomic, no matter
> > > what value is reported as the ATOMIC TRANSFER LENGTH GRANULARITY.
> > 
> > There is no formal guarantee that a disk drive sector read-modify-write
> > operation results in a readable sector after a power failure. We have
> > definitely seen blocks being mangled in the field.
> 
> Aren't block devices expected to use a capacitor that provides enough
> power to handle power failures cleanly?

Nope.

Any block device that says it operates in writeback cache mode (i.e.
almost every single consumer SATA and NVMe drive ever made) has a
volatile write back cache and so does not provide any power fail
data integrity guarantees. Simple to check, my less-than-1-yr-old
workstation tells me:

$ lspci |grep -i nvme
03:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
06:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
$ cat /sys/block/nvme*n1/queue/write_cache
write back
write back
$

That they have volatile writeback caches....

> How about blacklisting block devices that mangle blocks if a power
> failure occurs? I think such block devices are not compatible with
> journaling filesystems nor with log-structured filesystems.

Statements like this from people working on storage hardware really
worry me. It demonstrates a lack of understanding of how filesystems
actually work, not to mention the fact that this architectural
problem (i.e. handling volatile device write caches correctly) was
solved in the Linux IO stack a couple of decades ago. This isn't
even 'state of the art' knowledge - this is foundational knowlege
that everyone working on storage should know.

The tl;dr summary is that filesystems will issue a cache flush
request (REQ_PREFLUSH) and/or write-through to stable storage
semantics (REQ_FUA) for any data, metadata or journal IO that has
data integrity and/or ordering requirements associated with it. The
block layer will then do the most optimal correct thing with that
request (e.g. ignore them for IO being directed at WC disabled
devices), but it guarantees the flush/fua semantics for those IOs
will be provided by all layers in the stack right down to the
persistent storage media itself. Hence all the filesystem has to do
is get it's IO and cache flush ordering correct, and everything
just works regardless of the underlying storage capabilities.

And, yes, any storage device with volatile caches that doesn't
implement cache flushes correctly is considered broken and will get
black listed....

-Dave.
  
Bart Van Assche Oct. 5, 2023, 10:58 p.m. UTC | #16
On 10/5/23 15:36, Dave Chinner wrote:
> $ lspci |grep -i nvme
> 03:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
> 06:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
> $ cat /sys/block/nvme*n1/queue/write_cache
> write back
> write back
> $
> 
> That they have volatile writeback caches....

It seems like what I wrote has been misunderstood completely. With
"handling a power failure cleanly" I meant that power cycling a block device
does not result in read errors nor in reading data that has never been written.
Although it is hard to find information about this topic, here is what I found
online:
* About certain SSDs with power loss protection:
   https://us.transcend-info.com/embedded/technology/power-loss-protection-plp
* About another class of SSDs with power loss protection:
   https://www.kingston.com/en/blog/servers-and-data-centers/ssd-power-loss-protection
* About yet another class of SSDs with power loss protection:
   https://phisonblog.com/avoiding-ssd-data-loss-with-phisons-power-loss-protection-2/

So far I haven't found any information about hard disks and power failure
handling. What I found is that most current hard disks protect data with ECC.
The ECC mechanism should provide good protection against reading data that
has never been written. If a power failure occurs while a hard disk is writing
a physical block, can this result in a read error after power is restored? If
so, is this behavior allowed by storage standards?

Bart.
  
Dave Chinner Oct. 6, 2023, 4:31 a.m. UTC | #17
On Thu, Oct 05, 2023 at 03:58:38PM -0700, Bart Van Assche wrote:
> On 10/5/23 15:36, Dave Chinner wrote:
> > $ lspci |grep -i nvme
> > 03:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
> > 06:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
> > $ cat /sys/block/nvme*n1/queue/write_cache
> > write back
> > write back
> > $
> > 
> > That they have volatile writeback caches....
> 
> It seems like what I wrote has been misunderstood completely. With
> "handling a power failure cleanly" I meant that power cycling a block device
> does not result in read errors nor in reading data that has never been written.

Then I don't see what your concern is. 

Single sector writes are guaranteed atomic and have been for as long
as I've worked in this game. OTOH, multi-sector writes are not
guaranteed to be atomic - they can get torn on sector boundaries,
but the individual sectors within that write are guaranteed to be
all-or-nothing. 

Any hardware device that does not guarantee single sector write
atomicity (i.e. tears in the middle of a sector) is, by definition,
broken. And we all know that broken hardware means nothing in the
storage stack works as it should, so I just don't see what point you
are trying to make...

> Although it is hard to find information about this topic, here is what I found
> online:
> * About certain SSDs with power loss protection:
>   https://us.transcend-info.com/embedded/technology/power-loss-protection-plp
> * About another class of SSDs with power loss protection:
>   https://www.kingston.com/en/blog/servers-and-data-centers/ssd-power-loss-protection
> * About yet another class of SSDs with power loss protection:
>   https://phisonblog.com/avoiding-ssd-data-loss-with-phisons-power-loss-protection-2/

Yup, devices that behave as if they have non-volatile write caches.
Such devices have been around for more than 30 years, they operate
the same as devices without caches at all.

> So far I haven't found any information about hard disks and power failure
> handling. What I found is that most current hard disks protect data with ECC.
> The ECC mechanism should provide good protection against reading data that
> has never been written.  If a power failure occurs while a hard disk is writing
> a physical block, can this result in a read error after power is restored? If
> so, is this behavior allowed by storage standards?

If a power fail results in read errors from the storage media being
reported to the OS instead of the data that was present in the
sector before the power failure, then the device is broken. If there
is no data in the region being read because it has never been
written, then it should return zeros (no data) rather than stale
data or a media read error.

-Dave.
  
Bart Van Assche Oct. 6, 2023, 5:22 p.m. UTC | #18
On 10/5/23 21:31, Dave Chinner wrote:
> Then I don't see what your concern is.
> 
> Single sector writes are guaranteed atomic and have been for as long 
> as I've worked in this game. OTOH, multi-sector writes are not 
> guaranteed to be atomic - they can get torn on sector boundaries, but
> the individual sectors within that write are guaranteed to be 
> all-or-nothing.
> 
> Any hardware device that does not guarantee single sector write 
> atomicity (i.e. tears in the middle of a sector) is, by definition, 
> broken. And we all know that broken hardware means nothing in the 
> storage stack works as it should, so I just don't see what point you 
> are trying to make...

Do you agree that the above implies that it is not useful in patch 01/21
of this series to track atomic_write_unit_min_bytes in the block layer
nor to export this information to user space? The above implies that
this parameter will always be equal to the logical block size. Writes to
a single physical block happen atomically. If there are multiple logical
blocks per physical block, the block device must serialize 
read/modify/write cycles internally.

Thanks,

Bart.
  
Martin K. Petersen Oct. 7, 2023, 1:21 a.m. UTC | #19
Bart,

> The above implies that this parameter will always be equal to the
> logical block size.

It does not. Being able to write each individual block in an I/O without
tearing does not imply that a device can write two blocks as a single
atomic operation.

> Writes to a single physical block happen atomically. If there are
> multiple logical blocks per physical block, the block device must
> serialize read/modify/write cycles internally.

This is what SBC has to say:

"If any write command that is not an atomic write command, does not
complete successfully (e.g., the command completed with CHECK CONDITION
status, or the command was being processed at the time of a power loss
or an incorrect demount of a removable medium), then any data in the
logical blocks referenced by the LBAs specified by that command is
indeterminate."

SBC defines "atomic write command" like this:

"An atomic write command performs one or more atomic write operations.
 The following write commands are atomic write commands:

 a) WRITE ATOMIC (16) (see 5.48); and
 b) WRITE ATOMIC (32) (see 5.49)."

You will note that none of the regular WRITE commands appear in that
list.

Now, in practice we obviously rely heavily on the fact that most devices
are implemented in a sane fashion which doesn't mess up individual
logical blocks on power fail. But the spec does not guarantee this; it
is device implementation dependent. And again, we have seen both hard
disk drives and SSDs that cause collateral damage to an entire physical
block when power is lost at the wrong time.
  
Ming Lei Dec. 4, 2023, 2:30 a.m. UTC | #20
On Fri, Sep 29, 2023 at 10:27:15AM +0000, John Garry wrote:
> Add support for atomic writes, as follows:
> - Ensure that the IO follows all the atomic writes rules, like must be
>   naturally aligned
> - Set REQ_ATOMIC
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  block/fops.c | 42 +++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 41 insertions(+), 1 deletion(-)
> 
> diff --git a/block/fops.c b/block/fops.c
> index acff3d5d22d4..516669ad69e5 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -41,6 +41,29 @@ static bool blkdev_dio_unaligned(struct block_device *bdev, loff_t pos,
>  		!bdev_iter_is_aligned(bdev, iter);
>  }
>  
> +static bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos,
> +			      struct iov_iter *iter)
> +{
> +	unsigned int atomic_write_unit_min_bytes =
> +			queue_atomic_write_unit_min_bytes(bdev_get_queue(bdev));
> +	unsigned int atomic_write_unit_max_bytes =
> +			queue_atomic_write_unit_max_bytes(bdev_get_queue(bdev));
> +
> +	if (!atomic_write_unit_min_bytes)
> +		return false;

The above check should have be moved to limit setting code path.

> +	if (pos % atomic_write_unit_min_bytes)
> +		return false;
> +	if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
> +		return false;
> +	if (!is_power_of_2(iov_iter_count(iter)))
> +		return false;
> +	if (iov_iter_count(iter) > atomic_write_unit_max_bytes)
> +		return false;
> +	if (pos % iov_iter_count(iter))
> +		return false;

I am a bit confused about relation between atomic_write_unit_max_bytes and
atomic_write_max_bytes.

Here the max IO length is limited to be <= atomic_write_unit_max_bytes,
so looks userspace can only submit IO with write-atomic-unit naturally
aligned IO(such as, 4k, 8k, 16k, 32k, ...), but these user IOs are
allowed to be merged to big one if naturally alignment is respected and
the merged IO size is <= atomic_write_max_bytes.

Is my understanding right? If yes, I'd suggest to document the point,
and the last two checks could be change to:

	/* naturally aligned */
	if (pos % iov_iter_count(iter))
		return false;

	if (iov_iter_count(iter) > atomic_write_max_bytes)
		return false;

Thanks, 
Ming
  
John Garry Dec. 4, 2023, 9:27 a.m. UTC | #21
On 04/12/2023 02:30, Ming Lei wrote:

Hi Ming,

>> +static bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos,
>> +			      struct iov_iter *iter)
>> +{
>> +	unsigned int atomic_write_unit_min_bytes =
>> +			queue_atomic_write_unit_min_bytes(bdev_get_queue(bdev));
>> +	unsigned int atomic_write_unit_max_bytes =
>> +			queue_atomic_write_unit_max_bytes(bdev_get_queue(bdev));
>> +
>> +	if (!atomic_write_unit_min_bytes)
>> +		return false;
> The above check should have be moved to limit setting code path.

Sorry, I didn't fully understand your point.

I added this here (as opposed to the caller), as I was not really 
worried about speeding up the failure path. Are you saying to call even 
earlier in submission path?

> 
>> +	if (pos % atomic_write_unit_min_bytes)
>> +		return false;
>> +	if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
>> +		return false;
>> +	if (!is_power_of_2(iov_iter_count(iter)))
>> +		return false;
>> +	if (iov_iter_count(iter) > atomic_write_unit_max_bytes)
>> +		return false;
>> +	if (pos % iov_iter_count(iter))
>> +		return false;
> I am a bit confused about relation between atomic_write_unit_max_bytes and
> atomic_write_max_bytes.

I think that naming could be improved. Or even just drop merging (and 
atomic_write_max_bytes concept) until we show it to improve performance.

So generally atomic_write_unit_max_bytes will be same as 
atomic_write_max_bytes, however it could be different if:
a. request queue nr hw segments or other request queue limits needs to 
restrict atomic_write_unit_max_bytes
b. atomic_write_unit_max_bytes does not need to be a power-of-2 and 
atomic_write_max_bytes does. So essentially:
atomic_write_unit_max_bytes = rounddown_pow_of_2(atomic_write_max_bytes)

> 
> Here the max IO length is limited to be <= atomic_write_unit_max_bytes,
> so looks userspace can only submit IO with write-atomic-unit naturally
> aligned IO(such as, 4k, 8k, 16k, 32k, ...), 

correct

> but these user IOs are
> allowed to be merged to big one if naturally alignment is respected and
> the merged IO size is <= atomic_write_max_bytes.

correct, but the resultant merged IO does not have have to be naturally 
aligned.

> 
> Is my understanding right? 

Yes, but...

> If yes, I'd suggest to document the point,
> and the last two checks could be change to:
> 
> 	/* naturally aligned */
> 	if (pos % iov_iter_count(iter))
> 		return false;
> 
> 	if (iov_iter_count(iter) > atomic_write_max_bytes)
> 		return false;

.. we would not be merging at this point as this is just IO submission 
to the block layer, so atomic_write_max_bytes does not come into play 
yet. If you check patch 7/21, you will see that we limit IO size to 
atomic_write_max_bytes, which is relevant merging.

Thanks,
John
  
Ming Lei Dec. 4, 2023, 12:18 p.m. UTC | #22
On Mon, Dec 04, 2023 at 09:27:00AM +0000, John Garry wrote:
> On 04/12/2023 02:30, Ming Lei wrote:
> 
> Hi Ming,
> 
> > > +static bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos,
> > > +			      struct iov_iter *iter)
> > > +{
> > > +	unsigned int atomic_write_unit_min_bytes =
> > > +			queue_atomic_write_unit_min_bytes(bdev_get_queue(bdev));
> > > +	unsigned int atomic_write_unit_max_bytes =
> > > +			queue_atomic_write_unit_max_bytes(bdev_get_queue(bdev));
> > > +
> > > +	if (!atomic_write_unit_min_bytes)
> > > +		return false;
> > The above check should have be moved to limit setting code path.
> 
> Sorry, I didn't fully understand your point.
> 
> I added this here (as opposed to the caller), as I was not really worried
> about speeding up the failure path. Are you saying to call even earlier in
> submission path?

atomic_write_unit_min is one hardware property, and it should be checked
in blk_queue_atomic_write_unit_min_sectors() from beginning, then you
can avoid this check every other where.

> 
> > 
> > > +	if (pos % atomic_write_unit_min_bytes)
> > > +		return false;
> > > +	if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
> > > +		return false;
> > > +	if (!is_power_of_2(iov_iter_count(iter)))
> > > +		return false;
> > > +	if (iov_iter_count(iter) > atomic_write_unit_max_bytes)
> > > +		return false;
> > > +	if (pos % iov_iter_count(iter))
> > > +		return false;
> > I am a bit confused about relation between atomic_write_unit_max_bytes and
> > atomic_write_max_bytes.
> 
> I think that naming could be improved. Or even just drop merging (and
> atomic_write_max_bytes concept) until we show it to improve performance.
> 
> So generally atomic_write_unit_max_bytes will be same as
> atomic_write_max_bytes, however it could be different if:
> a. request queue nr hw segments or other request queue limits needs to
> restrict atomic_write_unit_max_bytes
> b. atomic_write_unit_max_bytes does not need to be a power-of-2 and
> atomic_write_max_bytes does. So essentially:
> atomic_write_unit_max_bytes = rounddown_pow_of_2(atomic_write_max_bytes)
> 

plug merge often improves sequential IO perf, so if the hardware supports
this way, I think 'atomic_write_max_bytes' should be supported from the
beginning, such as:

- user space submits sequential N * (4k, 8k, 16k, ...) atomic writes, all can
be merged to single IO request, which is issued to driver.

Or 

- user space submits sequential 4k, 4k, 8k, 16K, 32k, 64k atomic writes, all can
be merged to single IO request, which is issued to driver.

The hardware should recognize unit size by start LBA, and check if length is
valid, so probably the interface might be relaxed to:

1) start lba is unit aligned, and this unit is in the supported unit
range(power_2 in [unit_min, unit_max])

2) length needs to be:

- N * this_unit_size
- <= atomic_write_max_bytes


> > 
> > Here the max IO length is limited to be <= atomic_write_unit_max_bytes,
> > so looks userspace can only submit IO with write-atomic-unit naturally
> > aligned IO(such as, 4k, 8k, 16k, 32k, ...),
> 
> correct
> 
> > but these user IOs are
> > allowed to be merged to big one if naturally alignment is respected and
> > the merged IO size is <= atomic_write_max_bytes.
> 
> correct, but the resultant merged IO does not have have to be naturally
> aligned.
> 
> > 
> > Is my understanding right?
> 
> Yes, but...
> 
> > If yes, I'd suggest to document the point,
> > and the last two checks could be change to:
> > 
> > 	/* naturally aligned */
> > 	if (pos % iov_iter_count(iter))
> > 		return false;
> > 
> > 	if (iov_iter_count(iter) > atomic_write_max_bytes)
> > 		return false;
> 
> .. we would not be merging at this point as this is just IO submission to
> the block layer, so atomic_write_max_bytes does not come into play yet. If
> you check patch 7/21, you will see that we limit IO size to
> atomic_write_max_bytes, which is relevant merging.

I know the motivation of atomic_write_max_bytes, and now I am wondering
atomic_write_max_bytes may be exported to userspace for the sake of
atomic write performance.


Thanks,
Ming
  
John Garry Dec. 4, 2023, 1:13 p.m. UTC | #23
>>
>> I added this here (as opposed to the caller), as I was not really worried
>> about speeding up the failure path. Are you saying to call even earlier in
>> submission path?
> atomic_write_unit_min is one hardware property, and it should be checked
> in blk_queue_atomic_write_unit_min_sectors() from beginning, then you
> can avoid this check every other where.

ok, but we still need to ensure in the submission path that the block 
device actually supports atomic writes - this was the initial check.

> 
>>>> +	if (pos % atomic_write_unit_min_bytes)
>>>> +		return false;
>>>> +	if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
>>>> +		return false;
>>>> +	if (!is_power_of_2(iov_iter_count(iter)))
>>>> +		return false;
>>>> +	if (iov_iter_count(iter) > atomic_write_unit_max_bytes)
>>>> +		return false;
>>>> +	if (pos % iov_iter_count(iter))
>>>> +		return false;
>>> I am a bit confused about relation between atomic_write_unit_max_bytes and
>>> atomic_write_max_bytes.
>> I think that naming could be improved. Or even just drop merging (and
>> atomic_write_max_bytes concept) until we show it to improve performance.
>>
>> So generally atomic_write_unit_max_bytes will be same as
>> atomic_write_max_bytes, however it could be different if:
>> a. request queue nr hw segments or other request queue limits needs to
>> restrict atomic_write_unit_max_bytes
>> b. atomic_write_unit_max_bytes does not need to be a power-of-2 and
>> atomic_write_max_bytes does. So essentially:
>> atomic_write_unit_max_bytes = rounddown_pow_of_2(atomic_write_max_bytes)
>>
> plug merge often improves sequential IO perf, so if the hardware supports
> this way, I think 'atomic_write_max_bytes' should be supported from the
> beginning, such as:
> 
> - user space submits sequential N * (4k, 8k, 16k, ...) atomic writes, all can
> be merged to single IO request, which is issued to driver.
> 
> Or
> 
> - user space submits sequential 4k, 4k, 8k, 16K, 32k, 64k atomic writes, all can
> be merged to single IO request, which is issued to driver.

Right, we do expect userspace to use a fixed block size, but we give 
scope in the API to use variable size.

> 
> The hardware should recognize unit size by start LBA, and check if length is
> valid, so probably the interface might be relaxed to:
> 
> 1) start lba is unit aligned, and this unit is in the supported unit
> range(power_2 in [unit_min, unit_max])
> 
> 2) length needs to be:
> 
> - N * this_unit_size
> - <= atomic_write_max_bytes

Please note that we also need to consider:
- any atomic write boundary (from NVMe)
- virt boundary (from NVMe)

And, as I mentioned elsewhere, I am still not 100% comfortable that we 
don't pay attention to regular max_sectors_kb...

> 
> 
>>> Here the max IO length is limited to be <= atomic_write_unit_max_bytes,
>>> so looks userspace can only submit IO with write-atomic-unit naturally
>>> aligned IO(such as, 4k, 8k, 16k, 32k, ...),
>> correct
>>
>>> but these user IOs are
>>> allowed to be merged to big one if naturally alignment is respected and
>>> the merged IO size is <= atomic_write_max_bytes.
>> correct, but the resultant merged IO does not have have to be naturally
>> aligned.
>>
>>> Is my understanding right?
>> Yes, but...
>>
>>> If yes, I'd suggest to document the point,
>>> and the last two checks could be change to:
>>>
>>> 	/* naturally aligned */
>>> 	if (pos % iov_iter_count(iter))
>>> 		return false;
>>>
>>> 	if (iov_iter_count(iter) > atomic_write_max_bytes)
>>> 		return false;
>> .. we would not be merging at this point as this is just IO submission to
>> the block layer, so atomic_write_max_bytes does not come into play yet. If
>> you check patch 7/21, you will see that we limit IO size to
>> atomic_write_max_bytes, which is relevant merging.
> I know the motivation of atomic_write_max_bytes, and now I am wondering
> atomic_write_max_bytes may be exported to userspace for the sake of
> atomic write performance.

It is available from sysfs for the request queue, but in an earlier 
series Dave Chinner suggested doing more to expose to the application 
programmer. So here that would mean a statx member. I'm still not 
sure... it just didn't seem like a detail which the user would need to 
know or be able to do much with.

Thanks,
John
  
Ming Lei Dec. 5, 2023, 1:45 a.m. UTC | #24
On Mon, Dec 04, 2023 at 01:13:55PM +0000, John Garry wrote:
> 
> > > 
> > > I added this here (as opposed to the caller), as I was not really worried
> > > about speeding up the failure path. Are you saying to call even earlier in
> > > submission path?
> > atomic_write_unit_min is one hardware property, and it should be checked
> > in blk_queue_atomic_write_unit_min_sectors() from beginning, then you
> > can avoid this check every other where.
> 
> ok, but we still need to ensure in the submission path that the block device
> actually supports atomic writes - this was the initial check.

Then you may add one helper bdev_support_atomic_write().

> 
> > 
> > > > > +	if (pos % atomic_write_unit_min_bytes)
> > > > > +		return false;
> > > > > +	if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
> > > > > +		return false;
> > > > > +	if (!is_power_of_2(iov_iter_count(iter)))
> > > > > +		return false;
> > > > > +	if (iov_iter_count(iter) > atomic_write_unit_max_bytes)
> > > > > +		return false;
> > > > > +	if (pos % iov_iter_count(iter))
> > > > > +		return false;
> > > > I am a bit confused about relation between atomic_write_unit_max_bytes and
> > > > atomic_write_max_bytes.
> > > I think that naming could be improved. Or even just drop merging (and
> > > atomic_write_max_bytes concept) until we show it to improve performance.
> > > 
> > > So generally atomic_write_unit_max_bytes will be same as
> > > atomic_write_max_bytes, however it could be different if:
> > > a. request queue nr hw segments or other request queue limits needs to
> > > restrict atomic_write_unit_max_bytes
> > > b. atomic_write_unit_max_bytes does not need to be a power-of-2 and
> > > atomic_write_max_bytes does. So essentially:
> > > atomic_write_unit_max_bytes = rounddown_pow_of_2(atomic_write_max_bytes)
> > > 
> > plug merge often improves sequential IO perf, so if the hardware supports
> > this way, I think 'atomic_write_max_bytes' should be supported from the
> > beginning, such as:
> > 
> > - user space submits sequential N * (4k, 8k, 16k, ...) atomic writes, all can
> > be merged to single IO request, which is issued to driver.
> > 
> > Or
> > 
> > - user space submits sequential 4k, 4k, 8k, 16K, 32k, 64k atomic writes, all can
> > be merged to single IO request, which is issued to driver.
> 
> Right, we do expect userspace to use a fixed block size, but we give scope
> in the API to use variable size.

Maybe it is enough to just take atomic_write_unit_min_bytes
only, and allow length to be N * atomic_write_unit_min_bytes.

But it may violate atomic write boundary?

> 
> > 
> > The hardware should recognize unit size by start LBA, and check if length is
> > valid, so probably the interface might be relaxed to:
> > 
> > 1) start lba is unit aligned, and this unit is in the supported unit
> > range(power_2 in [unit_min, unit_max])
> > 
> > 2) length needs to be:
> > 
> > - N * this_unit_size
> > - <= atomic_write_max_bytes
> 
> Please note that we also need to consider:
> - any atomic write boundary (from NVMe)

Can you provide actual NVMe boundary value?

Firstly natural aligned write won't cross boundary, so boundary should
be >= write_unit_max, see blow code from patch 10/21:

+static bool bio_straddles_atomic_write_boundary(loff_t bi_sector,
+				unsigned int bi_size,
+				unsigned int boundary)
+{
+	loff_t start = bi_sector << SECTOR_SHIFT;
+	loff_t end = start + bi_size;
+	loff_t start_mod = start % boundary;
+	loff_t end_mod = end % boundary;
+
+	if (end - start > boundary)
+		return true;
+	if ((start_mod > end_mod) && (start_mod && end_mod))
+		return true;
+
+	return false;
+}
+

Then if the WRITE size is <= boundary, the above function should return
false, right? Looks like it is power_of(2) & aligned atomic_write_max_bytes?

> - virt boundary (from NVMe)

virt boundary is applied on bv_offset and bv_len, and NVMe's virt
bounary is (4k - 1), it shouldn't be one issue in reality.

> 
> And, as I mentioned elsewhere, I am still not 100% comfortable that we don't
> pay attention to regular max_sectors_kb...

max_sectors_kb should be bigger than atomic_write_max_bytes actually,
then what is your concern?



Thanks,
Ming
  
John Garry Dec. 5, 2023, 10:49 a.m. UTC | #25
On 05/12/2023 01:45, Ming Lei wrote:
>> Right, we do expect userspace to use a fixed block size, but we give scope
>> in the API to use variable size.
> Maybe it is enough to just take atomic_write_unit_min_bytes
> only, and allow length to be N * atomic_write_unit_min_bytes.
> 
> But it may violate atomic write boundary?

About atomic boundary, we just don't allow a merge which will result in 
a write which will straddle a boundary as there are no guarantees of 
atomicity then.

Having said this, atomic write boundary is just relevant to NVMe, so if 
we don't have merges there, then we could just omit this code.

> 
>>> The hardware should recognize unit size by start LBA, and check if length is
>>> valid, so probably the interface might be relaxed to:
>>>
>>> 1) start lba is unit aligned, and this unit is in the supported unit
>>> range(power_2 in [unit_min, unit_max])
>>>
>>> 2) length needs to be:
>>>
>>> - N * this_unit_size
>>> - <= atomic_write_max_bytes
>> Please note that we also need to consider:
>> - any atomic write boundary (from NVMe)
> Can you provide actual NVMe boundary value?
> 
> Firstly natural aligned write won't cross boundary, so boundary should
> be >= write_unit_max,

Correct

> see blow code from patch 10/21:
> 
> +static bool bio_straddles_atomic_write_boundary(loff_t bi_sector,
> +				unsigned int bi_size,
> +				unsigned int boundary)
> +{
> +	loff_t start = bi_sector << SECTOR_SHIFT;
> +	loff_t end = start + bi_size;
> +	loff_t start_mod = start % boundary;
> +	loff_t end_mod = end % boundary;
> +
> +	if (end - start > boundary)
> +		return true;
> +	if ((start_mod > end_mod) && (start_mod && end_mod))
> +		return true;
> +
> +	return false;
> +}
> +
> 
> Then if the WRITE size is <= boundary, the above function should return
> false, right? 

Actually if WRITE > boundary then we must be crossing a boundary and 
should return true, which is what the first condition checks.

However 2x naturally-aligned atomic writes could be less than 
atomic_write_max_bytes but still straddle if merged.

> Looks like it is power_of(2) & aligned atomic_write_max_bytes?
> 
>> - virt boundary (from NVMe)
> virt boundary is applied on bv_offset and bv_len, and NVMe's virt
> bounary is (4k - 1), it shouldn't be one issue in reality.

On a related topic, as I understand, for NVMe we just split bios 
according to virt boundary for PRP, but we won't always use PRP. So is 
there value in not splitting bio according to PRP if SGL will actually 
be used?

> 
>> And, as I mentioned elsewhere, I am still not 100% comfortable that we don't
>> pay attention to regular max_sectors_kb...
> max_sectors_kb should be bigger than atomic_write_max_bytes actually,
> then what is your concern?

My concern is that we don't enforce that, so we may issue atomic writes 
which exceed max_sectors_kb.

If we did enforce it, then atomic_write_unit_min_bytes, 
atomic_write_unit_max_bytes, and atomic_write_max_bytes all need to be 
limited according to max_sectors_kb.

Thanks,
John
  

Patch

diff --git a/block/fops.c b/block/fops.c
index acff3d5d22d4..516669ad69e5 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -41,6 +41,29 @@  static bool blkdev_dio_unaligned(struct block_device *bdev, loff_t pos,
 		!bdev_iter_is_aligned(bdev, iter);
 }
 
+static bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos,
+			      struct iov_iter *iter)
+{
+	unsigned int atomic_write_unit_min_bytes =
+			queue_atomic_write_unit_min_bytes(bdev_get_queue(bdev));
+	unsigned int atomic_write_unit_max_bytes =
+			queue_atomic_write_unit_max_bytes(bdev_get_queue(bdev));
+
+	if (!atomic_write_unit_min_bytes)
+		return false;
+	if (pos % atomic_write_unit_min_bytes)
+		return false;
+	if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
+		return false;
+	if (!is_power_of_2(iov_iter_count(iter)))
+		return false;
+	if (iov_iter_count(iter) > atomic_write_unit_max_bytes)
+		return false;
+	if (pos % iov_iter_count(iter))
+		return false;
+	return true;
+}
+
 #define DIO_INLINE_BIO_VECS 4
 
 static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
@@ -48,6 +71,8 @@  static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 {
 	struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
 	struct bio_vec inline_vecs[DIO_INLINE_BIO_VECS], *vecs;
+	bool is_read = iov_iter_rw(iter) == READ;
+	bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;
 	loff_t pos = iocb->ki_pos;
 	bool should_dirty = false;
 	struct bio bio;
@@ -56,6 +81,9 @@  static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 	if (blkdev_dio_unaligned(bdev, pos, iter))
 		return -EINVAL;
 
+	if (atomic_write && !blkdev_atomic_write_valid(bdev, pos, iter))
+		return -EINVAL;
+
 	if (nr_pages <= DIO_INLINE_BIO_VECS)
 		vecs = inline_vecs;
 	else {
@@ -65,7 +93,7 @@  static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 			return -ENOMEM;
 	}
 
-	if (iov_iter_rw(iter) == READ) {
+	if (is_read) {
 		bio_init(&bio, bdev, vecs, nr_pages, REQ_OP_READ);
 		if (user_backed_iter(iter))
 			should_dirty = true;
@@ -74,6 +102,8 @@  static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 	}
 	bio.bi_iter.bi_sector = pos >> SECTOR_SHIFT;
 	bio.bi_ioprio = iocb->ki_ioprio;
+	if (atomic_write)
+		bio.bi_opf |= REQ_ATOMIC;
 
 	ret = bio_iov_iter_get_pages(&bio, iter);
 	if (unlikely(ret))
@@ -167,10 +197,14 @@  static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 	struct blkdev_dio *dio;
 	struct bio *bio;
 	bool is_read = (iov_iter_rw(iter) == READ), is_sync;
+	bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;
 	blk_opf_t opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb);
 	loff_t pos = iocb->ki_pos;
 	int ret = 0;
 
+	if (atomic_write)
+		return -EINVAL;
+
 	if (blkdev_dio_unaligned(bdev, pos, iter))
 		return -EINVAL;
 
@@ -305,6 +339,7 @@  static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 	struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
 	bool is_read = iov_iter_rw(iter) == READ;
 	blk_opf_t opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb);
+	bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;
 	struct blkdev_dio *dio;
 	struct bio *bio;
 	loff_t pos = iocb->ki_pos;
@@ -313,6 +348,9 @@  static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 	if (blkdev_dio_unaligned(bdev, pos, iter))
 		return -EINVAL;
 
+	if (atomic_write && !blkdev_atomic_write_valid(bdev, pos, iter))
+		return -EINVAL;
+
 	if (iocb->ki_flags & IOCB_ALLOC_CACHE)
 		opf |= REQ_ALLOC_CACHE;
 	bio = bio_alloc_bioset(bdev, nr_pages, opf, GFP_KERNEL,
@@ -347,6 +385,8 @@  static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 			bio_set_pages_dirty(bio);
 		}
 	} else {
+		if (atomic_write)
+			bio->bi_opf |= REQ_ATOMIC;
 		task_io_account_write(bio->bi_iter.bi_size);
 	}