[1/1] block: Check the queue limit before bio submitting

Message ID 20231025092255.27930-1-ed.tsai@mediatek.com
State New
Headers
Series [1/1] block: Check the queue limit before bio submitting |

Commit Message

Ed Tsai (蔡宗軒) Oct. 25, 2023, 9:22 a.m. UTC
  From: Ed Tsai <ed.tsai@mediatek.com>

Referring to commit 07173c3ec276 ("block: enable multipage bvecs"),
each bio_vec now holds more than one page, potentially exceeding
1MB in size and causing alignment issues with the queue limit.

In a sequential read/write scenario, the file system maximizes the
bio's capacity before submitting. However, misalignment with the
queue limit can result in the bio being split into smaller I/O
operations.

For instance, assuming the maximum I/O size is set to 512KB and the
memory is highly fragmented, resulting in each bio containing only
one 2-pages bio_vec (i.e., bi_size = 1028KB). This would cause the
bio to be split into two 512KB portions and one 4KB portion. As a
result, the originally expected continuous large I/O operations are
interspersed with many small I/O operations.

To address this issue, this patch adds a check for the max_sectors
before submitting the bio. This allows the upper layers to proactively
detect and handle alignment issues.

I performed the Antutu V10 Storage Test on a UFS 4.0 device, which
resulted in a significant improvement in the Sequential test:

Sequential Read (average of 5 rounds):
Original: 3033.7 MB/sec
Patched: 3520.9 MB/sec

Sequential Write (average of 5 rounds):
Original: 2225.4 MB/sec
Patched: 2800.3 MB/sec

Signed-off-by: Ed Tsai <ed.tsai@mediatek.com>
---
 block/bio.c | 6 ++++++
 1 file changed, 6 insertions(+)
  

Comments

Ed Tsai (蔡宗軒) Nov. 1, 2023, 2:23 a.m. UTC | #1
On Wed, 2023-10-25 at 17:22 +0800, ed.tsai@mediatek.com wrote:
> From: Ed Tsai <ed.tsai@mediatek.com>
> 
> Referring to commit 07173c3ec276 ("block: enable multipage bvecs"),
> each bio_vec now holds more than one page, potentially exceeding
> 1MB in size and causing alignment issues with the queue limit.
> 
> In a sequential read/write scenario, the file system maximizes the
> bio's capacity before submitting. However, misalignment with the
> queue limit can result in the bio being split into smaller I/O
> operations.
> 
> For instance, assuming the maximum I/O size is set to 512KB and the
> memory is highly fragmented, resulting in each bio containing only
> one 2-pages bio_vec (i.e., bi_size = 1028KB). This would cause the
> bio to be split into two 512KB portions and one 4KB portion. As a
> result, the originally expected continuous large I/O operations are
> interspersed with many small I/O operations.
> 
> To address this issue, this patch adds a check for the max_sectors
> before submitting the bio. This allows the upper layers to
> proactively
> detect and handle alignment issues.
> 
> I performed the Antutu V10 Storage Test on a UFS 4.0 device, which
> resulted in a significant improvement in the Sequential test:
> 
> Sequential Read (average of 5 rounds):
> Original: 3033.7 MB/sec
> Patched: 3520.9 MB/sec
> 
> Sequential Write (average of 5 rounds):
> Original: 2225.4 MB/sec
> Patched: 2800.3 MB/sec
> 
> Signed-off-by: Ed Tsai <ed.tsai@mediatek.com>
> ---
>  block/bio.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 816d412c06e9..a4a1f775b9ea 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1227,6 +1227,7 @@ static int __bio_iov_iter_get_pages(struct bio
> *bio, struct iov_iter *iter)
>  	iov_iter_extraction_t extraction_flags = 0;
>  	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
>  	unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
> +	struct queue_limits *lim = &bdev_get_queue(bio->bi_bdev)-
> >limits;
>  	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
>  	struct page **pages = (struct page **)bv;
>  	ssize_t size, left;
> @@ -1275,6 +1276,11 @@ static int __bio_iov_iter_get_pages(struct bio
> *bio, struct iov_iter *iter)
>  		struct page *page = pages[i];
>  
>  		len = min_t(size_t, PAGE_SIZE - offset, left);
> +		if (bio->bi_iter.bi_size + len >
> +		    lim->max_sectors << SECTOR_SHIFT) {
> +			ret = left;
> +			break;
> +		}
>  		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
>  			ret = bio_iov_add_zone_append_page(bio, page,
> len,
>  					offset);
> -- 
> 2.18.0
> 


Hi Jens,

Just to clarify any potential confusion, I would like to provide
further details based on the assumed scenario mentioned above.

When the upper layer continuously sends 1028KB full-sized bios for
sequential reads, the Block Layer sees the following sequence:
	submit bio: size = 1028KB, start LBA = n
	submit bio: size = 1028KB, start LBA = n + 1028KB 
	submit bio: size = 1028KB, start LBA = n + 2056KB
	...

However, due to the queue limit restricting the I/O size to a maximum
of 512KB, the Block Layer splits into the following sequence:
	submit bio: size = 512KB, start LBA = n
	submit bio: size = 512KB, start LBA = n +  512KB
	submit bio: size =   4KB, start LBA = n + 1024KB	
	submit bio: size = 512KB, start LBA = n + 1028KB
	submit bio: size = 512KB, start LBA = n + 1540KB
	submit bio: size =   4KB, start LBA = n + 2052KB
	submit bio: size = 512KB, start LBA = n + 2056KB
	submit bio: size = 512KB, start LBA = n + 2568KB
	submit bio: size =   4KB, start LBA = n + 3080KB
	...
	
The original expectation was for the storage to receive large,
contiguous requests. However, due to non-alignment, many small I/O
requests are generated. This problem is easily visible because the
user pages passed in are often allocated by the buddy system as order 0
pages during page faults, resulting in highly non-contiguous memory.

As observed in the Antutu Sequential Read test below, it is similar to
the description above where the splitting caused by the queue limit
leaves small requests sandwiched in between:

block_bio_queue: 8,32 R 86925864 + 2144 [Thread-51]
block_split: 8,32 R 86925864 / 86926888 [Thread-51]
block_split: 8,32 R 86926888 / 86927912 [Thread-51]
block_rq_issue: 8,32 R 524288 () 86925864 + 1024 [Thread-51]
block_rq_issue: 8,32 R 524288 () 86926888 + 1024 [Thread-51]
block_bio_queue: 8,32 R 86928008 + 2144 [Thread-51]
block_split: 8,32 R 86928008 / 86929032 [Thread-51]
block_split: 8,32 R 86929032 / 86930056 [Thread-51]
block_rq_issue: 8,32 R 524288 () 86928008 + 1024 [Thread-51]
block_rq_issue: 8,32 R 49152 () 86927912 + 96 [Thread-51]
block_rq_issue: 8,32 R 524288 () 86929032 + 1024 [Thread-51]
block_bio_queue: 8,32 R 86930152 + 2112 [Thread-51]
block_split: 8,32 R 86930152 / 86931176 [Thread-51]
block_split: 8,32 R 86931176 / 86932200 [Thread-51]
block_rq_issue: 8,32 R 524288 () 86930152 + 1024 [Thread-51]
block_rq_issue: 8,32 R 49152 () 86930056 + 96 [Thread-51]
block_rq_issue: 8,32 R 524288 () 86931176 + 1024 [Thread-51]
block_bio_queue: 8,32 R 86932264 + 2096 [Thread-51]
block_split: 8,32 R 86932264 / 86933288 [Thread-51]
block_split: 8,32 R 86933288 / 86934312 [Thread-51]
block_rq_issue: 8,32 R 524288 () 86932264 + 1024 [Thread-51]
block_rq_issue: 8,32 R 32768 () 86932200 + 64 [Thread-51]
block_rq_issue: 8,32 R 524288 () 86933288 + 1024 [Thread-51]

I simply prevents non-aligned situations in bio_iov_iter_get_pages.
Besides making the upper layer application aware of the queue limit, I
would appreciate any other directions or suggestions you may have.
Thank you!
  
Christoph Hellwig Nov. 3, 2023, 8:15 a.m. UTC | #2
You need to look into the splitting code to see why the splits are
suboptimal for your device.  We can't limit the upper bio size as
we have code relying on it not having arbitrary limits.
  
Ed Tsai (蔡宗軒) Nov. 3, 2023, 9:05 a.m. UTC | #3
On Fri, 2023-11-03 at 01:15 -0700, Christoph Hellwig wrote:
> You need to look into the splitting code to see why the splits are
> suboptimal for your device.  We can't limit the upper bio size as
> we have code relying on it not having arbitrary limits.
> 

The splitting code is working correctly based on the queue limit.
However, since 5.1, full bios are no longer a fixed size and often do
not align with the queue limit. This results in the block layer
splitting some small fragments when submitting consecutive full bios. I
don't think this is specific to a particular device.

While I understand your point about not limiting the bio size, in the
past, the 1MB limit prevented the generation of fragmented bios during
the splitting.
  
Ming Lei Nov. 3, 2023, 4:20 p.m. UTC | #4
On Wed, Nov 01, 2023 at 02:23:26AM +0000, Ed Tsai (蔡宗軒) wrote:
> On Wed, 2023-10-25 at 17:22 +0800, ed.tsai@mediatek.com wrote:
> > From: Ed Tsai <ed.tsai@mediatek.com>
> > 
> > Referring to commit 07173c3ec276 ("block: enable multipage bvecs"),
> > each bio_vec now holds more than one page, potentially exceeding
> > 1MB in size and causing alignment issues with the queue limit.
> > 
> > In a sequential read/write scenario, the file system maximizes the
> > bio's capacity before submitting. However, misalignment with the
> > queue limit can result in the bio being split into smaller I/O
> > operations.
> > 
> > For instance, assuming the maximum I/O size is set to 512KB and the
> > memory is highly fragmented, resulting in each bio containing only
> > one 2-pages bio_vec (i.e., bi_size = 1028KB). This would cause the
> > bio to be split into two 512KB portions and one 4KB portion. As a
> > result, the originally expected continuous large I/O operations are
> > interspersed with many small I/O operations.
> > 
> > To address this issue, this patch adds a check for the max_sectors
> > before submitting the bio. This allows the upper layers to
> > proactively
> > detect and handle alignment issues.
> > 
> > I performed the Antutu V10 Storage Test on a UFS 4.0 device, which
> > resulted in a significant improvement in the Sequential test:
> > 
> > Sequential Read (average of 5 rounds):
> > Original: 3033.7 MB/sec
> > Patched: 3520.9 MB/sec
> > 
> > Sequential Write (average of 5 rounds):
> > Original: 2225.4 MB/sec
> > Patched: 2800.3 MB/sec
> > 
> > Signed-off-by: Ed Tsai <ed.tsai@mediatek.com>
> > ---
> >  block/bio.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/block/bio.c b/block/bio.c
> > index 816d412c06e9..a4a1f775b9ea 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -1227,6 +1227,7 @@ static int __bio_iov_iter_get_pages(struct bio
> > *bio, struct iov_iter *iter)
> >  	iov_iter_extraction_t extraction_flags = 0;
> >  	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
> >  	unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
> > +	struct queue_limits *lim = &bdev_get_queue(bio->bi_bdev)-
> > >limits;
> >  	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
> >  	struct page **pages = (struct page **)bv;
> >  	ssize_t size, left;
> > @@ -1275,6 +1276,11 @@ static int __bio_iov_iter_get_pages(struct bio
> > *bio, struct iov_iter *iter)
> >  		struct page *page = pages[i];
> >  
> >  		len = min_t(size_t, PAGE_SIZE - offset, left);
> > +		if (bio->bi_iter.bi_size + len >
> > +		    lim->max_sectors << SECTOR_SHIFT) {
> > +			ret = left;
> > +			break;
> > +		}
> >  		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> >  			ret = bio_iov_add_zone_append_page(bio, page,
> > len,
> >  					offset);
> > -- 
> > 2.18.0
> > 
> 
> 
> Hi Jens,
> 
> Just to clarify any potential confusion, I would like to provide
> further details based on the assumed scenario mentioned above.
> 
> When the upper layer continuously sends 1028KB full-sized bios for
> sequential reads, the Block Layer sees the following sequence:
> 	submit bio: size = 1028KB, start LBA = n
> 	submit bio: size = 1028KB, start LBA = n + 1028KB 
> 	submit bio: size = 1028KB, start LBA = n + 2056KB
> 	...
> 
> However, due to the queue limit restricting the I/O size to a maximum
> of 512KB, the Block Layer splits into the following sequence:
> 	submit bio: size = 512KB, start LBA = n
> 	submit bio: size = 512KB, start LBA = n +  512KB
> 	submit bio: size =   4KB, start LBA = n + 1024KB	
> 	submit bio: size = 512KB, start LBA = n + 1028KB
> 	submit bio: size = 512KB, start LBA = n + 1540KB
> 	submit bio: size =   4KB, start LBA = n + 2052KB
> 	submit bio: size = 512KB, start LBA = n + 2056KB
> 	submit bio: size = 512KB, start LBA = n + 2568KB
> 	submit bio: size =   4KB, start LBA = n + 3080KB
> 	...
> 	
> The original expectation was for the storage to receive large,
> contiguous requests. However, due to non-alignment, many small I/O
> requests are generated. This problem is easily visible because the
> user pages passed in are often allocated by the buddy system as order 0
> pages during page faults, resulting in highly non-contiguous memory.

If order 0 page is added to bio, the multipage bvec becomes nop
basically(256bvec holds 256 pages), then how can it make a difference
for you?

> 
> As observed in the Antutu Sequential Read test below, it is similar to
> the description above where the splitting caused by the queue limit
> leaves small requests sandwiched in between:
> 
> block_bio_queue: 8,32 R 86925864 + 2144 [Thread-51]
> block_split: 8,32 R 86925864 / 86926888 [Thread-51]
> block_split: 8,32 R 86926888 / 86927912 [Thread-51]
> block_rq_issue: 8,32 R 524288 () 86925864 + 1024 [Thread-51]
> block_rq_issue: 8,32 R 524288 () 86926888 + 1024 [Thread-51]
> block_bio_queue: 8,32 R 86928008 + 2144 [Thread-51]
> block_split: 8,32 R 86928008 / 86929032 [Thread-51]
> block_split: 8,32 R 86929032 / 86930056 [Thread-51]
> block_rq_issue: 8,32 R 524288 () 86928008 + 1024 [Thread-51]
> block_rq_issue: 8,32 R 49152 () 86927912 + 96 [Thread-51]
> block_rq_issue: 8,32 R 524288 () 86929032 + 1024 [Thread-51]
> block_bio_queue: 8,32 R 86930152 + 2112 [Thread-51]
> block_split: 8,32 R 86930152 / 86931176 [Thread-51]
> block_split: 8,32 R 86931176 / 86932200 [Thread-51]
> block_rq_issue: 8,32 R 524288 () 86930152 + 1024 [Thread-51]
> block_rq_issue: 8,32 R 49152 () 86930056 + 96 [Thread-51]
> block_rq_issue: 8,32 R 524288 () 86931176 + 1024 [Thread-51]
> block_bio_queue: 8,32 R 86932264 + 2096 [Thread-51]
> block_split: 8,32 R 86932264 / 86933288 [Thread-51]
> block_split: 8,32 R 86933288 / 86934312 [Thread-51]
> block_rq_issue: 8,32 R 524288 () 86932264 + 1024 [Thread-51]
> block_rq_issue: 8,32 R 32768 () 86932200 + 64 [Thread-51]
> block_rq_issue: 8,32 R 524288 () 86933288 + 1024 [Thread-51]
> 
> I simply prevents non-aligned situations in bio_iov_iter_get_pages.

But there is still 4KB IO left if you limit max bio size is 512KB,
then how does this 4KB IO finally go in case of 1028KB IO?

> Besides making the upper layer application aware of the queue limit, I
> would appreciate any other directions or suggestions you may have.

The problem is related with IO size from application.

If you send unaligned IO, you can't avoid the last IO with small size, no
matter if block layer bio split is involved or not. Your patch just lets
__bio_iov_iter_get_pages split the bio, and you still have 4KB left
finally when application submits 1028KB, right?

Then I don't understand why your patch improves sequential IO
performance.

Thanks,
Ming
  
Ed Tsai (蔡宗軒) Nov. 4, 2023, 1:11 a.m. UTC | #5
On Sat, 2023-11-04 at 00:20 +0800, Ming Lei wrote:
>  On Wed, Nov 01, 2023 at 02:23:26AM +0000, Ed Tsai (蔡宗軒) wrote:
> > On Wed, 2023-10-25 at 17:22 +0800, ed.tsai@mediatek.com wrote:
> > > From: Ed Tsai <ed.tsai@mediatek.com>
> > > 
> > > Referring to commit 07173c3ec276 ("block: enable multipage
> bvecs"),
> > > each bio_vec now holds more than one page, potentially exceeding
> > > 1MB in size and causing alignment issues with the queue limit.
> > > 
> > > In a sequential read/write scenario, the file system maximizes
> the
> > > bio's capacity before submitting. However, misalignment with the
> > > queue limit can result in the bio being split into smaller I/O
> > > operations.
> > > 
> > > For instance, assuming the maximum I/O size is set to 512KB and
> the
> > > memory is highly fragmented, resulting in each bio containing
> only
> > > one 2-pages bio_vec (i.e., bi_size = 1028KB). This would cause
> the
> > > bio to be split into two 512KB portions and one 4KB portion. As a
> > > result, the originally expected continuous large I/O operations
> are
> > > interspersed with many small I/O operations.
> > > 
> > > To address this issue, this patch adds a check for the
> max_sectors
> > > before submitting the bio. This allows the upper layers to
> > > proactively
> > > detect and handle alignment issues.
> > > 
> > > I performed the Antutu V10 Storage Test on a UFS 4.0 device,
> which
> > > resulted in a significant improvement in the Sequential test:
> > > 
> > > Sequential Read (average of 5 rounds):
> > > Original: 3033.7 MB/sec
> > > Patched: 3520.9 MB/sec
> > > 
> > > Sequential Write (average of 5 rounds):
> > > Original: 2225.4 MB/sec
> > > Patched: 2800.3 MB/sec
> > > 
> > > Signed-off-by: Ed Tsai <ed.tsai@mediatek.com>
> > > ---
> > >  block/bio.c | 6 ++++++
> > >  1 file changed, 6 insertions(+)
> > > 
> > > diff --git a/block/bio.c b/block/bio.c
> > > index 816d412c06e9..a4a1f775b9ea 100644
> > > --- a/block/bio.c
> > > +++ b/block/bio.c
> > > @@ -1227,6 +1227,7 @@ static int __bio_iov_iter_get_pages(struct
> bio
> > > *bio, struct iov_iter *iter)
> > >  iov_iter_extraction_t extraction_flags = 0;
> > >  unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
> > >  unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
> > > +struct queue_limits *lim = &bdev_get_queue(bio->bi_bdev)-
> > > >limits;
> > >  struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
> > >  struct page **pages = (struct page **)bv;
> > >  ssize_t size, left;
> > > @@ -1275,6 +1276,11 @@ static int __bio_iov_iter_get_pages(struct
> bio
> > > *bio, struct iov_iter *iter)
> > >  struct page *page = pages[i];
> > >  
> > >  len = min_t(size_t, PAGE_SIZE - offset, left);
> > > +if (bio->bi_iter.bi_size + len >
> > > +    lim->max_sectors << SECTOR_SHIFT) {
> > > +ret = left;
> > > +break;
> > > +}
> > >  if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> > >  ret = bio_iov_add_zone_append_page(bio, page,
> > > len,
> > >  offset);
> > > -- 
> > > 2.18.0
> > > 
> > 
> > 
> > Hi Jens,
> > 
> > Just to clarify any potential confusion, I would like to provide
> > further details based on the assumed scenario mentioned above.
> > 
> > When the upper layer continuously sends 1028KB full-sized bios for
> > sequential reads, the Block Layer sees the following sequence:
> > submit bio: size = 1028KB, start LBA = n
> > submit bio: size = 1028KB, start LBA = n + 1028KB 
> > submit bio: size = 1028KB, start LBA = n + 2056KB
> > ...
> > 
> > However, due to the queue limit restricting the I/O size to a
> maximum
> > of 512KB, the Block Layer splits into the following sequence:
> > submit bio: size = 512KB, start LBA = n
> > submit bio: size = 512KB, start LBA = n +  512KB
> > submit bio: size =   4KB, start LBA = n + 1024KB
> > submit bio: size = 512KB, start LBA = n + 1028KB
> > submit bio: size = 512KB, start LBA = n + 1540KB
> > submit bio: size =   4KB, start LBA = n + 2052KB
> > submit bio: size = 512KB, start LBA = n + 2056KB
> > submit bio: size = 512KB, start LBA = n + 2568KB
> > submit bio: size =   4KB, start LBA = n + 3080KB
> > ...
> > 
> > The original expectation was for the storage to receive large,
> > contiguous requests. However, due to non-alignment, many small I/O
> > requests are generated. This problem is easily visible because the
> > user pages passed in are often allocated by the buddy system as
> order 0
> > pages during page faults, resulting in highly non-contiguous
> memory.
> 
> If order 0 page is added to bio, the multipage bvec becomes nop
> basically(256bvec holds 256 pages), then how can it make a difference
> for you?



> 
> > 
> > As observed in the Antutu Sequential Read test below, it is similar
> to
> > the description above where the splitting caused by the queue limit
> > leaves small requests sandwiched in between:
> > 
> > block_bio_queue: 8,32 R 86925864 + 2144 [Thread-51]
> > block_split: 8,32 R 86925864 / 86926888 [Thread-51]
> > block_split: 8,32 R 86926888 / 86927912 [Thread-51]
> > block_rq_issue: 8,32 R 524288 () 86925864 + 1024 [Thread-51]
> > block_rq_issue: 8,32 R 524288 () 86926888 + 1024 [Thread-51]
> > block_bio_queue: 8,32 R 86928008 + 2144 [Thread-51]
> > block_split: 8,32 R 86928008 / 86929032 [Thread-51]
> > block_split: 8,32 R 86929032 / 86930056 [Thread-51]
> > block_rq_issue: 8,32 R 524288 () 86928008 + 1024 [Thread-51]
> > block_rq_issue: 8,32 R 49152 () 86927912 + 96 [Thread-51]
> > block_rq_issue: 8,32 R 524288 () 86929032 + 1024 [Thread-51]
> > block_bio_queue: 8,32 R 86930152 + 2112 [Thread-51]
> > block_split: 8,32 R 86930152 / 86931176 [Thread-51]
> > block_split: 8,32 R 86931176 / 86932200 [Thread-51]
> > block_rq_issue: 8,32 R 524288 () 86930152 + 1024 [Thread-51]
> > block_rq_issue: 8,32 R 49152 () 86930056 + 96 [Thread-51]
> > block_rq_issue: 8,32 R 524288 () 86931176 + 1024 [Thread-51]
> > block_bio_queue: 8,32 R 86932264 + 2096 [Thread-51]
> > block_split: 8,32 R 86932264 / 86933288 [Thread-51]
> > block_split: 8,32 R 86933288 / 86934312 [Thread-51]
> > block_rq_issue: 8,32 R 524288 () 86932264 + 1024 [Thread-51]
> > block_rq_issue: 8,32 R 32768 () 86932200 + 64 [Thread-51]
> > block_rq_issue: 8,32 R 524288 () 86933288 + 1024 [Thread-51]
> > 
> > I simply prevents non-aligned situations in bio_iov_iter_get_pages.
> 
> But there is still 4KB IO left if you limit max bio size is 512KB,
> then how does this 4KB IO finally go in case of 1028KB IO?
> 
> > Besides making the upper layer application aware of the queue
> limit, I
> > would appreciate any other directions or suggestions you may have.
> 
> The problem is related with IO size from application.
> 
> If you send unaligned IO, you can't avoid the last IO with small
> size, no
> matter if block layer bio split is involved or not. Your patch just
> lets
> __bio_iov_iter_get_pages split the bio, and you still have 4KB left
> finally when application submits 1028KB, right?
> 
> Then I don't understand why your patch improves sequential IO
> performance.
> 
> Thanks,
> Ming
> 

The application performs I/O with a sufficitenly large I/O size,
causing it to constantly fill up and submit full bios. However, in the
iomap direct I/O scenario, pages are added to the bio one by one from
the user buffer. This typically triggers page faults, resulting in the
allocation of order 0 pages from the buddy system.

The remaining amount of each order in the buddy system varies over
time. If there are not enough pages available in a particular order,
pages are split from higher orders. When pages are obtained from the
higher order, the user buffer may contain some small consecutive
patterns.

In summary, the physical layout of the user buffer is unpredictable,
and when it contains some small consecutive patterns, the size of the
bio becomes randomly unaligned during filling.

This patch limits the bio to be filled up to the max_sectors. The
submission is an async operation, so once the bio is queued, it will
immediately return and continue filled and submit the next bio.

Best,
Ed
  
Yu Kuai Nov. 4, 2023, 2:12 a.m. UTC | #6
Hi,

在 2023/11/04 9:11, Ed Tsai (蔡宗軒) 写道:
> On Sat, 2023-11-04 at 00:20 +0800, Ming Lei wrote:
>>   On Wed, Nov 01, 2023 at 02:23:26AM +0000, Ed Tsai (蔡宗軒) wrote:
>>> On Wed, 2023-10-25 at 17:22 +0800, ed.tsai@mediatek.com wrote:
>>>> From: Ed Tsai <ed.tsai@mediatek.com>
>>>>
>>>> Referring to commit 07173c3ec276 ("block: enable multipage
>> bvecs"),
>>>> each bio_vec now holds more than one page, potentially exceeding
>>>> 1MB in size and causing alignment issues with the queue limit.
>>>>
>>>> In a sequential read/write scenario, the file system maximizes
>> the
>>>> bio's capacity before submitting. However, misalignment with the
>>>> queue limit can result in the bio being split into smaller I/O
>>>> operations.
>>>>
>>>> For instance, assuming the maximum I/O size is set to 512KB and
>> the
>>>> memory is highly fragmented, resulting in each bio containing
>> only
>>>> one 2-pages bio_vec (i.e., bi_size = 1028KB). This would cause
>> the
>>>> bio to be split into two 512KB portions and one 4KB portion. As a
>>>> result, the originally expected continuous large I/O operations
>> are
>>>> interspersed with many small I/O operations.
>>>>
>>>> To address this issue, this patch adds a check for the
>> max_sectors
>>>> before submitting the bio. This allows the upper layers to
>>>> proactively
>>>> detect and handle alignment issues.
>>>>
>>>> I performed the Antutu V10 Storage Test on a UFS 4.0 device,
>> which
>>>> resulted in a significant improvement in the Sequential test:
>>>>
>>>> Sequential Read (average of 5 rounds):
>>>> Original: 3033.7 MB/sec
>>>> Patched: 3520.9 MB/sec
>>>>
>>>> Sequential Write (average of 5 rounds):
>>>> Original: 2225.4 MB/sec
>>>> Patched: 2800.3 MB/sec
>>>>
>>>> Signed-off-by: Ed Tsai <ed.tsai@mediatek.com>
>>>> ---
>>>>   block/bio.c | 6 ++++++
>>>>   1 file changed, 6 insertions(+)
>>>>
>>>> diff --git a/block/bio.c b/block/bio.c
>>>> index 816d412c06e9..a4a1f775b9ea 100644
>>>> --- a/block/bio.c
>>>> +++ b/block/bio.c
>>>> @@ -1227,6 +1227,7 @@ static int __bio_iov_iter_get_pages(struct
>> bio
>>>> *bio, struct iov_iter *iter)
>>>>   iov_iter_extraction_t extraction_flags = 0;
>>>>   unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
>>>>   unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
>>>> +struct queue_limits *lim = &bdev_get_queue(bio->bi_bdev)-
>>>>> limits;
>>>>   struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
>>>>   struct page **pages = (struct page **)bv;
>>>>   ssize_t size, left;
>>>> @@ -1275,6 +1276,11 @@ static int __bio_iov_iter_get_pages(struct
>> bio
>>>> *bio, struct iov_iter *iter)
>>>>   struct page *page = pages[i];
>>>>   
>>>>   len = min_t(size_t, PAGE_SIZE - offset, left);
>>>> +if (bio->bi_iter.bi_size + len >
>>>> +    lim->max_sectors << SECTOR_SHIFT) {
>>>> +ret = left;
>>>> +break;
>>>> +}
>>>>   if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
>>>>   ret = bio_iov_add_zone_append_page(bio, page,
>>>> len,
>>>>   offset);
>>>> -- 
>>>> 2.18.0
>>>>
>>>
>>>
>>> Hi Jens,
>>>
>>> Just to clarify any potential confusion, I would like to provide
>>> further details based on the assumed scenario mentioned above.
>>>
>>> When the upper layer continuously sends 1028KB full-sized bios for
>>> sequential reads, the Block Layer sees the following sequence:
>>> submit bio: size = 1028KB, start LBA = n
>>> submit bio: size = 1028KB, start LBA = n + 1028KB
>>> submit bio: size = 1028KB, start LBA = n + 2056KB
>>> ...
>>>
>>> However, due to the queue limit restricting the I/O size to a
>> maximum
>>> of 512KB, the Block Layer splits into the following sequence:
>>> submit bio: size = 512KB, start LBA = n
>>> submit bio: size = 512KB, start LBA = n +  512KB
>>> submit bio: size =   4KB, start LBA = n + 1024KB
>>> submit bio: size = 512KB, start LBA = n + 1028KB
>>> submit bio: size = 512KB, start LBA = n + 1540KB
>>> submit bio: size =   4KB, start LBA = n + 2052KB
>>> submit bio: size = 512KB, start LBA = n + 2056KB
>>> submit bio: size = 512KB, start LBA = n + 2568KB
>>> submit bio: size =   4KB, start LBA = n + 3080KB
>>> ...
>>>
>>> The original expectation was for the storage to receive large,
>>> contiguous requests. However, due to non-alignment, many small I/O
>>> requests are generated. This problem is easily visible because the
>>> user pages passed in are often allocated by the buddy system as
>> order 0
>>> pages during page faults, resulting in highly non-contiguous
>> memory.
>>
>> If order 0 page is added to bio, the multipage bvec becomes nop
>> basically(256bvec holds 256 pages), then how can it make a difference
>> for you?
> 
> 
> 
>>
>>>
>>> As observed in the Antutu Sequential Read test below, it is similar
>> to
>>> the description above where the splitting caused by the queue limit
>>> leaves small requests sandwiched in between:
>>>
>>> block_bio_queue: 8,32 R 86925864 + 2144 [Thread-51]
>>> block_split: 8,32 R 86925864 / 86926888 [Thread-51]
>>> block_split: 8,32 R 86926888 / 86927912 [Thread-51]
>>> block_rq_issue: 8,32 R 524288 () 86925864 + 1024 [Thread-51]
>>> block_rq_issue: 8,32 R 524288 () 86926888 + 1024 [Thread-51]
>>> block_bio_queue: 8,32 R 86928008 + 2144 [Thread-51]
>>> block_split: 8,32 R 86928008 / 86929032 [Thread-51]
>>> block_split: 8,32 R 86929032 / 86930056 [Thread-51]
>>> block_rq_issue: 8,32 R 524288 () 86928008 + 1024 [Thread-51]
>>> block_rq_issue: 8,32 R 49152 () 86927912 + 96 [Thread-51]
>>> block_rq_issue: 8,32 R 524288 () 86929032 + 1024 [Thread-51]
>>> block_bio_queue: 8,32 R 86930152 + 2112 [Thread-51]
>>> block_split: 8,32 R 86930152 / 86931176 [Thread-51]
>>> block_split: 8,32 R 86931176 / 86932200 [Thread-51]
>>> block_rq_issue: 8,32 R 524288 () 86930152 + 1024 [Thread-51]
>>> block_rq_issue: 8,32 R 49152 () 86930056 + 96 [Thread-51]
>>> block_rq_issue: 8,32 R 524288 () 86931176 + 1024 [Thread-51]
>>> block_bio_queue: 8,32 R 86932264 + 2096 [Thread-51]
>>> block_split: 8,32 R 86932264 / 86933288 [Thread-51]
>>> block_split: 8,32 R 86933288 / 86934312 [Thread-51]
>>> block_rq_issue: 8,32 R 524288 () 86932264 + 1024 [Thread-51]
>>> block_rq_issue: 8,32 R 32768 () 86932200 + 64 [Thread-51]
>>> block_rq_issue: 8,32 R 524288 () 86933288 + 1024 [Thread-51]
>>>
>>> I simply prevents non-aligned situations in bio_iov_iter_get_pages.
>>
>> But there is still 4KB IO left if you limit max bio size is 512KB,
>> then how does this 4KB IO finally go in case of 1028KB IO?
>>
>>> Besides making the upper layer application aware of the queue
>> limit, I
>>> would appreciate any other directions or suggestions you may have.
>>
>> The problem is related with IO size from application.
>>
>> If you send unaligned IO, you can't avoid the last IO with small
>> size, no
>> matter if block layer bio split is involved or not. Your patch just
>> lets
>> __bio_iov_iter_get_pages split the bio, and you still have 4KB left
>> finally when application submits 1028KB, right?
>>
>> Then I don't understand why your patch improves sequential IO
>> performance.
>>
>> Thanks,
>> Ming
>>
> 
> The application performs I/O with a sufficitenly large I/O size,
> causing it to constantly fill up and submit full bios. However, in the
> iomap direct I/O scenario, pages are added to the bio one by one from
> the user buffer. This typically triggers page faults, resulting in the
> allocation of order 0 pages from the buddy system.
> 
> The remaining amount of each order in the buddy system varies over
> time. If there are not enough pages available in a particular order,
> pages are split from higher orders. When pages are obtained from the
> higher order, the user buffer may contain some small consecutive
> patterns.
> 
> In summary, the physical layout of the user buffer is unpredictable,
> and when it contains some small consecutive patterns, the size of the
> bio becomes randomly unaligned during filling.
> 
> This patch limits the bio to be filled up to the max_sectors. The
> submission is an async operation, so once the bio is queued, it will
> immediately return and continue filled and submit the next bio.
> 

Same as Ming, I still don't quite understand why your patch improves
sequential IO performance, are you trying to indicate that the
reason the bio is filled to 1028k is because memory is highly
fragmented? And user is issue more than 1028k to kernel at a time?

Thanks,
Kuai

> Best,
> Ed
>
  
Ming Lei Nov. 4, 2023, 3:43 a.m. UTC | #7
On Sat, Nov 04, 2023 at 01:11:02AM +0000, Ed Tsai (蔡宗軒) wrote:
> On Sat, 2023-11-04 at 00:20 +0800, Ming Lei wrote:
> >  On Wed, Nov 01, 2023 at 02:23:26AM +0000, Ed Tsai (蔡宗軒) wrote:
> > > On Wed, 2023-10-25 at 17:22 +0800, ed.tsai@mediatek.com wrote:
> > > > From: Ed Tsai <ed.tsai@mediatek.com>
> > > > 
> > > > Referring to commit 07173c3ec276 ("block: enable multipage
> > bvecs"),
> > > > each bio_vec now holds more than one page, potentially exceeding
> > > > 1MB in size and causing alignment issues with the queue limit.
> > > > 
> > > > In a sequential read/write scenario, the file system maximizes
> > the
> > > > bio's capacity before submitting. However, misalignment with the
> > > > queue limit can result in the bio being split into smaller I/O
> > > > operations.
> > > > 
> > > > For instance, assuming the maximum I/O size is set to 512KB and
> > the
> > > > memory is highly fragmented, resulting in each bio containing
> > only
> > > > one 2-pages bio_vec (i.e., bi_size = 1028KB). This would cause
> > the
> > > > bio to be split into two 512KB portions and one 4KB portion. As a
> > > > result, the originally expected continuous large I/O operations
> > are
> > > > interspersed with many small I/O operations.
> > > > 
> > > > To address this issue, this patch adds a check for the
> > max_sectors
> > > > before submitting the bio. This allows the upper layers to
> > > > proactively
> > > > detect and handle alignment issues.
> > > > 
> > > > I performed the Antutu V10 Storage Test on a UFS 4.0 device,
> > which
> > > > resulted in a significant improvement in the Sequential test:
> > > > 
> > > > Sequential Read (average of 5 rounds):
> > > > Original: 3033.7 MB/sec
> > > > Patched: 3520.9 MB/sec
> > > > 
> > > > Sequential Write (average of 5 rounds):
> > > > Original: 2225.4 MB/sec
> > > > Patched: 2800.3 MB/sec
> > > > 
> > > > Signed-off-by: Ed Tsai <ed.tsai@mediatek.com>
> > > > ---
> > > >  block/bio.c | 6 ++++++
> > > >  1 file changed, 6 insertions(+)
> > > > 
> > > > diff --git a/block/bio.c b/block/bio.c
> > > > index 816d412c06e9..a4a1f775b9ea 100644
> > > > --- a/block/bio.c
> > > > +++ b/block/bio.c
> > > > @@ -1227,6 +1227,7 @@ static int __bio_iov_iter_get_pages(struct
> > bio
> > > > *bio, struct iov_iter *iter)
> > > >  iov_iter_extraction_t extraction_flags = 0;
> > > >  unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
> > > >  unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
> > > > +struct queue_limits *lim = &bdev_get_queue(bio->bi_bdev)-
> > > > >limits;
> > > >  struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
> > > >  struct page **pages = (struct page **)bv;
> > > >  ssize_t size, left;
> > > > @@ -1275,6 +1276,11 @@ static int __bio_iov_iter_get_pages(struct
> > bio
> > > > *bio, struct iov_iter *iter)
> > > >  struct page *page = pages[i];
> > > >  
> > > >  len = min_t(size_t, PAGE_SIZE - offset, left);
> > > > +if (bio->bi_iter.bi_size + len >
> > > > +    lim->max_sectors << SECTOR_SHIFT) {
> > > > +ret = left;
> > > > +break;
> > > > +}
> > > >  if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> > > >  ret = bio_iov_add_zone_append_page(bio, page,
> > > > len,
> > > >  offset);
> > > > -- 
> > > > 2.18.0
> > > > 
> > > 
> > > 
> > > Hi Jens,
> > > 
> > > Just to clarify any potential confusion, I would like to provide
> > > further details based on the assumed scenario mentioned above.
> > > 
> > > When the upper layer continuously sends 1028KB full-sized bios for
> > > sequential reads, the Block Layer sees the following sequence:
> > > submit bio: size = 1028KB, start LBA = n
> > > submit bio: size = 1028KB, start LBA = n + 1028KB 
> > > submit bio: size = 1028KB, start LBA = n + 2056KB
> > > ...
> > > 
> > > However, due to the queue limit restricting the I/O size to a
> > maximum
> > > of 512KB, the Block Layer splits into the following sequence:
> > > submit bio: size = 512KB, start LBA = n
> > > submit bio: size = 512KB, start LBA = n +  512KB
> > > submit bio: size =   4KB, start LBA = n + 1024KB
> > > submit bio: size = 512KB, start LBA = n + 1028KB
> > > submit bio: size = 512KB, start LBA = n + 1540KB
> > > submit bio: size =   4KB, start LBA = n + 2052KB
> > > submit bio: size = 512KB, start LBA = n + 2056KB
> > > submit bio: size = 512KB, start LBA = n + 2568KB
> > > submit bio: size =   4KB, start LBA = n + 3080KB
> > > ...
> > > 
> > > The original expectation was for the storage to receive large,
> > > contiguous requests. However, due to non-alignment, many small I/O
> > > requests are generated. This problem is easily visible because the
> > > user pages passed in are often allocated by the buddy system as
> > order 0
> > > pages during page faults, resulting in highly non-contiguous
> > memory.
> > 
> > If order 0 page is added to bio, the multipage bvec becomes nop
> > basically(256bvec holds 256 pages), then how can it make a difference
> > for you?
> 
> 
> 
> > 
> > > 
> > > As observed in the Antutu Sequential Read test below, it is similar
> > to
> > > the description above where the splitting caused by the queue limit
> > > leaves small requests sandwiched in between:
> > > 
> > > block_bio_queue: 8,32 R 86925864 + 2144 [Thread-51]
> > > block_split: 8,32 R 86925864 / 86926888 [Thread-51]
> > > block_split: 8,32 R 86926888 / 86927912 [Thread-51]
> > > block_rq_issue: 8,32 R 524288 () 86925864 + 1024 [Thread-51]
> > > block_rq_issue: 8,32 R 524288 () 86926888 + 1024 [Thread-51]
> > > block_bio_queue: 8,32 R 86928008 + 2144 [Thread-51]
> > > block_split: 8,32 R 86928008 / 86929032 [Thread-51]
> > > block_split: 8,32 R 86929032 / 86930056 [Thread-51]
> > > block_rq_issue: 8,32 R 524288 () 86928008 + 1024 [Thread-51]
> > > block_rq_issue: 8,32 R 49152 () 86927912 + 96 [Thread-51]
> > > block_rq_issue: 8,32 R 524288 () 86929032 + 1024 [Thread-51]
> > > block_bio_queue: 8,32 R 86930152 + 2112 [Thread-51]
> > > block_split: 8,32 R 86930152 / 86931176 [Thread-51]
> > > block_split: 8,32 R 86931176 / 86932200 [Thread-51]
> > > block_rq_issue: 8,32 R 524288 () 86930152 + 1024 [Thread-51]
> > > block_rq_issue: 8,32 R 49152 () 86930056 + 96 [Thread-51]
> > > block_rq_issue: 8,32 R 524288 () 86931176 + 1024 [Thread-51]
> > > block_bio_queue: 8,32 R 86932264 + 2096 [Thread-51]
> > > block_split: 8,32 R 86932264 / 86933288 [Thread-51]
> > > block_split: 8,32 R 86933288 / 86934312 [Thread-51]
> > > block_rq_issue: 8,32 R 524288 () 86932264 + 1024 [Thread-51]
> > > block_rq_issue: 8,32 R 32768 () 86932200 + 64 [Thread-51]
> > > block_rq_issue: 8,32 R 524288 () 86933288 + 1024 [Thread-51]
> > > 
> > > I simply prevents non-aligned situations in bio_iov_iter_get_pages.
> > 
> > But there is still 4KB IO left if you limit max bio size is 512KB,
> > then how does this 4KB IO finally go in case of 1028KB IO?
> > 
> > > Besides making the upper layer application aware of the queue
> > limit, I
> > > would appreciate any other directions or suggestions you may have.
> > 
> > The problem is related with IO size from application.
> > 
> > If you send unaligned IO, you can't avoid the last IO with small
> > size, no
> > matter if block layer bio split is involved or not. Your patch just
> > lets
> > __bio_iov_iter_get_pages split the bio, and you still have 4KB left
> > finally when application submits 1028KB, right?
> > 
> > Then I don't understand why your patch improves sequential IO
> > performance.
> > 
> > Thanks,
> > Ming
> > 
> 
> The application performs I/O with a sufficitenly large I/O size,
> causing it to constantly fill up and submit full bios. However, in the
> iomap direct I/O scenario, pages are added to the bio one by one from
> the user buffer. This typically triggers page faults, resulting in the
> allocation of order 0 pages from the buddy system.
> 
> The remaining amount of each order in the buddy system varies over
> time. If there are not enough pages available in a particular order,
> pages are split from higher orders. When pages are obtained from the
> higher order, the user buffer may contain some small consecutive
> patterns.
> 
> In summary, the physical layout of the user buffer is unpredictable,
> and when it contains some small consecutive patterns, the size of the
> bio becomes randomly unaligned during filling.

Yes, multipage bvec depends on physical layout of user buffer, but it
doesn't affect bio size, which is decided by userspace, and the bvec
page layout doesn't affect IO pattern.

If userspace submits 1028K IO, the IO is split into 512K, 512K and 4K,
no matter if each bvec includes how many pages.

If userspace submits very large IO, such as 512M, which will be split
into 1K bios with 512K size.

You patch doesn't makes any difference actually from block layer
viewpoint, such as:

1) dd if=/dev/ublkb0 of=/dev/null bs=1028k count=1 iflag=direct

2) observe IO pattern by the following bpftrace:

kprobe:blk_mq_start_request
{
	$rq = (struct request *)arg0;

	printf("%lu %16s %d %d: %s %s bio %lx-%lu\n",
		nsecs / 1000,  comm, pid, cpu, ksym(reg("ip")),
		$rq->q->disk->disk_name, $rq->__sector, $rq->__data_len);
}

3) no matter if your patch is applied or not, the following pattern
is always observed:

117828811               dd 1475 0: blk_mq_start_request ublkb0 bio 0-524288
117828823               dd 1475 0: blk_mq_start_request ublkb0 bio 400-524288
117828825               dd 1475 0: blk_mq_start_request ublkb0 bio 800-4096

Similar pattern can be observed when running dd inside FS(xfs).

> 
> This patch limits the bio to be filled up to the max_sectors. The
> submission is an async operation, so once the bio is queued, it will
> immediately return and continue filled and submit the next bio.

bio split doesn't change this behavior too, the difference is just how
many bios the upper layer(iomap) observed.

Your patch just moves the split into upper layer, and iomap can see
3 bios with your patch when `dd bs=1028K`, and in vanilla tree, iomap
just submits single bio with 1028K, block layer splits it into
512k, 512k, and 4k. So finally UFS driver observes same IO pattern.

In short, please root cause why your patch improves performance, or
please share your workloads, so we can observe the IO pattern and
related mm/fs behavior and try to root cause it.


Thanks,
Ming
  
Ed Tsai (蔡宗軒) Nov. 6, 2023, 1:33 a.m. UTC | #8
On Sat, 2023-11-04 at 11:43 +0800, Ming Lei wrote:
>  On Sat, Nov 04, 2023 at 01:11:02AM +0000, Ed Tsai (蔡宗軒) wrote:
> > On Sat, 2023-11-04 at 00:20 +0800, Ming Lei wrote:
> > >  On Wed, Nov 01, 2023 at 02:23:26AM +0000, Ed Tsai (蔡宗軒) wrote:
> > > > On Wed, 2023-10-25 at 17:22 +0800, ed.tsai@mediatek.com wrote:
> > > > > From: Ed Tsai <ed.tsai@mediatek.com>
> > > > > 
> > > > > Referring to commit 07173c3ec276 ("block: enable multipage
> > > bvecs"),
> > > > > each bio_vec now holds more than one page, potentially
> exceeding
> > > > > 1MB in size and causing alignment issues with the queue
> limit.
> > > > > 
> > > > > In a sequential read/write scenario, the file system
> maximizes
> > > the
> > > > > bio's capacity before submitting. However, misalignment with
> the
> > > > > queue limit can result in the bio being split into smaller
> I/O
> > > > > operations.
> > > > > 
> > > > > For instance, assuming the maximum I/O size is set to 512KB
> and
> > > the
> > > > > memory is highly fragmented, resulting in each bio containing
> > > only
> > > > > one 2-pages bio_vec (i.e., bi_size = 1028KB). This would
> cause
> > > the
> > > > > bio to be split into two 512KB portions and one 4KB portion.
> As a
> > > > > result, the originally expected continuous large I/O
> operations
> > > are
> > > > > interspersed with many small I/O operations.
> > > > > 
> > > > > To address this issue, this patch adds a check for the
> > > max_sectors
> > > > > before submitting the bio. This allows the upper layers to
> > > > > proactively
> > > > > detect and handle alignment issues.
> > > > > 
> > > > > I performed the Antutu V10 Storage Test on a UFS 4.0 device,
> > > which
> > > > > resulted in a significant improvement in the Sequential test:
> > > > > 
> > > > > Sequential Read (average of 5 rounds):
> > > > > Original: 3033.7 MB/sec
> > > > > Patched: 3520.9 MB/sec
> > > > > 
> > > > > Sequential Write (average of 5 rounds):
> > > > > Original: 2225.4 MB/sec
> > > > > Patched: 2800.3 MB/sec
> > > > > 
> > > > > Signed-off-by: Ed Tsai <ed.tsai@mediatek.com>
> > > > > ---
> > > > >  block/bio.c | 6 ++++++
> > > > >  1 file changed, 6 insertions(+)
> > > > > 
> > > > > diff --git a/block/bio.c b/block/bio.c
> > > > > index 816d412c06e9..a4a1f775b9ea 100644
> > > > > --- a/block/bio.c
> > > > > +++ b/block/bio.c
> > > > > @@ -1227,6 +1227,7 @@ static int
> __bio_iov_iter_get_pages(struct
> > > bio
> > > > > *bio, struct iov_iter *iter)
> > > > >  iov_iter_extraction_t extraction_flags = 0;
> > > > >  unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
> > > > >  unsigned short entries_left = bio->bi_max_vecs - bio-
> >bi_vcnt;
> > > > > +struct queue_limits *lim = &bdev_get_queue(bio->bi_bdev)-
> > > > > >limits;
> > > > >  struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
> > > > >  struct page **pages = (struct page **)bv;
> > > > >  ssize_t size, left;
> > > > > @@ -1275,6 +1276,11 @@ static int
> __bio_iov_iter_get_pages(struct
> > > bio
> > > > > *bio, struct iov_iter *iter)
> > > > >  struct page *page = pages[i];
> > > > >  
> > > > >  len = min_t(size_t, PAGE_SIZE - offset, left);
> > > > > +if (bio->bi_iter.bi_size + len >
> > > > > +    lim->max_sectors << SECTOR_SHIFT) {
> > > > > +ret = left;
> > > > > +break;
> > > > > +}
> > > > >  if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> > > > >  ret = bio_iov_add_zone_append_page(bio, page,
> > > > > len,
> > > > >  offset);
> > > > > -- 
> > > > > 2.18.0
> > > > > 
> > > > 
> > > > 
> > > > Hi Jens,
> > > > 
> > > > Just to clarify any potential confusion, I would like to
> provide
> > > > further details based on the assumed scenario mentioned above.
> > > > 
> > > > When the upper layer continuously sends 1028KB full-sized bios
> for
> > > > sequential reads, the Block Layer sees the following sequence:
> > > > submit bio: size = 1028KB, start LBA = n
> > > > submit bio: size = 1028KB, start LBA = n + 1028KB 
> > > > submit bio: size = 1028KB, start LBA = n + 2056KB
> > > > ...
> > > > 
> > > > However, due to the queue limit restricting the I/O size to a
> > > maximum
> > > > of 512KB, the Block Layer splits into the following sequence:
> > > > submit bio: size = 512KB, start LBA = n
> > > > submit bio: size = 512KB, start LBA = n +  512KB
> > > > submit bio: size =   4KB, start LBA = n + 1024KB
> > > > submit bio: size = 512KB, start LBA = n + 1028KB
> > > > submit bio: size = 512KB, start LBA = n + 1540KB
> > > > submit bio: size =   4KB, start LBA = n + 2052KB
> > > > submit bio: size = 512KB, start LBA = n + 2056KB
> > > > submit bio: size = 512KB, start LBA = n + 2568KB
> > > > submit bio: size =   4KB, start LBA = n + 3080KB
> > > > ...
> > > > 
> > > > The original expectation was for the storage to receive large,
> > > > contiguous requests. However, due to non-alignment, many small
> I/O
> > > > requests are generated. This problem is easily visible because
> the
> > > > user pages passed in are often allocated by the buddy system as
> > > order 0
> > > > pages during page faults, resulting in highly non-contiguous
> > > memory.
> > > 
> > > If order 0 page is added to bio, the multipage bvec becomes nop
> > > basically(256bvec holds 256 pages), then how can it make a
> difference
> > > for you?
> > 
> > 
> > 
> > > 
> > > > 
> > > > As observed in the Antutu Sequential Read test below, it is
> similar
> > > to
> > > > the description above where the splitting caused by the queue
> limit
> > > > leaves small requests sandwiched in between:
> > > > 
> > > > block_bio_queue: 8,32 R 86925864 + 2144 [Thread-51]
> > > > block_split: 8,32 R 86925864 / 86926888 [Thread-51]
> > > > block_split: 8,32 R 86926888 / 86927912 [Thread-51]
> > > > block_rq_issue: 8,32 R 524288 () 86925864 + 1024 [Thread-51]
> > > > block_rq_issue: 8,32 R 524288 () 86926888 + 1024 [Thread-51]
> > > > block_bio_queue: 8,32 R 86928008 + 2144 [Thread-51]
> > > > block_split: 8,32 R 86928008 / 86929032 [Thread-51]
> > > > block_split: 8,32 R 86929032 / 86930056 [Thread-51]
> > > > block_rq_issue: 8,32 R 524288 () 86928008 + 1024 [Thread-51]
> > > > block_rq_issue: 8,32 R 49152 () 86927912 + 96 [Thread-51]
> > > > block_rq_issue: 8,32 R 524288 () 86929032 + 1024 [Thread-51]
> > > > block_bio_queue: 8,32 R 86930152 + 2112 [Thread-51]
> > > > block_split: 8,32 R 86930152 / 86931176 [Thread-51]
> > > > block_split: 8,32 R 86931176 / 86932200 [Thread-51]
> > > > block_rq_issue: 8,32 R 524288 () 86930152 + 1024 [Thread-51]
> > > > block_rq_issue: 8,32 R 49152 () 86930056 + 96 [Thread-51]
> > > > block_rq_issue: 8,32 R 524288 () 86931176 + 1024 [Thread-51]
> > > > block_bio_queue: 8,32 R 86932264 + 2096 [Thread-51]
> > > > block_split: 8,32 R 86932264 / 86933288 [Thread-51]
> > > > block_split: 8,32 R 86933288 / 86934312 [Thread-51]
> > > > block_rq_issue: 8,32 R 524288 () 86932264 + 1024 [Thread-51]
> > > > block_rq_issue: 8,32 R 32768 () 86932200 + 64 [Thread-51]
> > > > block_rq_issue: 8,32 R 524288 () 86933288 + 1024 [Thread-51]
> > > > 
> > > > I simply prevents non-aligned situations in
> bio_iov_iter_get_pages.
> > > 
> > > But there is still 4KB IO left if you limit max bio size is
> 512KB,
> > > then how does this 4KB IO finally go in case of 1028KB IO?
> > > 
> > > > Besides making the upper layer application aware of the queue
> > > limit, I
> > > > would appreciate any other directions or suggestions you may
> have.
> > > 
> > > The problem is related with IO size from application.
> > > 
> > > If you send unaligned IO, you can't avoid the last IO with small
> > > size, no
> > > matter if block layer bio split is involved or not. Your patch
> just
> > > lets
> > > __bio_iov_iter_get_pages split the bio, and you still have 4KB
> left
> > > finally when application submits 1028KB, right?
> > > 
> > > Then I don't understand why your patch improves sequential IO
> > > performance.
> > > 
> > > Thanks,
> > > Ming
> > > 
> > 
> > The application performs I/O with a sufficitenly large I/O size,
> > causing it to constantly fill up and submit full bios. However, in
> the
> > iomap direct I/O scenario, pages are added to the bio one by one
> from
> > the user buffer. This typically triggers page faults, resulting in
> the
> > allocation of order 0 pages from the buddy system.
> > 
> > The remaining amount of each order in the buddy system varies over
> > time. If there are not enough pages available in a particular
> order,
> > pages are split from higher orders. When pages are obtained from
> the
> > higher order, the user buffer may contain some small consecutive
> > patterns.
> > 
> > In summary, the physical layout of the user buffer is
> unpredictable,
> > and when it contains some small consecutive patterns, the size of
> the
> > bio becomes randomly unaligned during filling.
> 
> Yes, multipage bvec depends on physical layout of user buffer, but it
> doesn't affect bio size, which is decided by userspace, and the bvec
> page layout doesn't affect IO pattern.
> 
This will result in variable sizes of full bios when filling them, as
the size of the bio_vec depends on the physical layout of the user
buffer.

> If userspace submits 1028K IO, the IO is split into 512K, 512K and
> 4K,
> no matter if each bvec includes how many pages.
> 
Sorry for the confusion caused by my description of 1028KB. It was
meant to illustrate the scenario of submitting an unaligned bio. In my
opinion, the least ideal size for a full bio due to physical layout
would be 1028KB, as it would result in an I/O with only 4KB.


> If userspace submits very large IO, such as 512M, which will be split
> into 1K bios with 512K size.
> 
No. The block layer cannot receive a 512MB bio to split into 512K size.
It will encounter full bios of various size, because the size of each
bio_vec is based on the physical layout.


> You patch doesn't makes any difference actually from block layer
> viewpoint, such as:
> 
> 1) dd if=/dev/ublkb0 of=/dev/null bs=1028k count=1 iflag=direct
> 
> 2) observe IO pattern by the following bpftrace:
> 
> kprobe:blk_mq_start_request
> {
> $rq = (struct request *)arg0;
> 
> printf("%lu %16s %d %d: %s %s bio %lx-%lu\n",
> nsecs / 1000,  comm, pid, cpu, ksym(reg("ip")),
> $rq->q->disk->disk_name, $rq->__sector, $rq->__data_len);
> }
> 
> 3) no matter if your patch is applied or not, the following pattern
> is always observed:
> 
> 117828811               dd 1475 0: blk_mq_start_request ublkb0 bio 0-
> 524288
> 117828823               dd 1475 0: blk_mq_start_request ublkb0 bio
> 400-524288
> 117828825               dd 1475 0: blk_mq_start_request ublkb0 bio
> 800-4096
> 
> Similar pattern can be observed when running dd inside FS(xfs).
> 
> > 
> > This patch limits the bio to be filled up to the max_sectors. The
> > submission is an async operation, so once the bio is queued, it
> will
> > immediately return and continue filled and submit the next bio.
> 
> bio split doesn't change this behavior too, the difference is just
> how
> many bios the upper layer(iomap) observed.
> 
> Your patch just moves the split into upper layer, and iomap can see
> 3 bios with your patch when `dd bs=1028K`, and in vanilla tree, iomap
> just submits single bio with 1028K, block layer splits it into
> 512k, 512k, and 4k. So finally UFS driver observes same IO pattern.
> 
> In short, please root cause why your patch improves performance, or
> please share your workloads, so we can observe the IO pattern and
> related mm/fs behavior and try to root cause it.
> 
> 
> Thanks,
> Ming
> 

Antutu Sequential Read performs 72 reads of 64MB using aio. I simulated
a 64MB read using dd and observed that the actual bio queue sizes were 
slightly larger than 1MB. Not a single bio with 64MB, but rather
multiple various size of bios.

dd-5970[001] ..... 758.485446: block_bio_queue: 254,52 R 2933336 + 2136
dd-5970[001] ..... 758.487145: block_bio_queue: 254,52 R 2935472 + 2152
dd-5970[001] ..... 758.488822: block_bio_queue: 254,52 R 2937624 + 2128
dd-5970[001] ..... 758.490478: block_bio_queue: 254,52 R 2939752 + 2160
dd-5970[001] ..... 758.492154: block_bio_queue: 254,52 R 2941912 + 2096
dd-5970[001] ..... 758.493874: block_bio_queue: 254,52 R 2944008 + 2160
  
Ed Tsai (蔡宗軒) Nov. 6, 2023, 1:40 a.m. UTC | #9
On Mon, 2023-11-06 at 09:33 +0800, Ed Tsai wrote:
> On Sat, 2023-11-04 at 11:43 +0800, Ming Lei wrote:
> >  On Sat, Nov 04, 2023 at 01:11:02AM +0000, Ed Tsai (蔡宗軒) wrote:
> > > On Sat, 2023-11-04 at 00:20 +0800, Ming Lei wrote:
> > > >  On Wed, Nov 01, 2023 at 02:23:26AM +0000, Ed Tsai (蔡宗軒) wrote:
> > > > > On Wed, 2023-10-25 at 17:22 +0800, ed.tsai@mediatek.com
> > > > > wrote:
> > > > > > From: Ed Tsai <ed.tsai@mediatek.com>
> > > > > > 
> > > > > > Referring to commit 07173c3ec276 ("block: enable multipage
> > > > 
> > > > bvecs"),
> > > > > > each bio_vec now holds more than one page, potentially
> > 
> > exceeding
> > > > > > 1MB in size and causing alignment issues with the queue
> > 
> > limit.
> > > > > > 
> > > > > > In a sequential read/write scenario, the file system
> > 
> > maximizes
> > > > the
> > > > > > bio's capacity before submitting. However, misalignment
> > > > > > with
> > 
> > the
> > > > > > queue limit can result in the bio being split into smaller
> > 
> > I/O
> > > > > > operations.
> > > > > > 
> > > > > > For instance, assuming the maximum I/O size is set to 512KB
> > 
> > and
> > > > the
> > > > > > memory is highly fragmented, resulting in each bio
> > > > > > containing
> > > > 
> > > > only
> > > > > > one 2-pages bio_vec (i.e., bi_size = 1028KB). This would
> > 
> > cause
> > > > the
> > > > > > bio to be split into two 512KB portions and one 4KB
> > > > > > portion.
> > 
> > As a
> > > > > > result, the originally expected continuous large I/O
> > 
> > operations
> > > > are
> > > > > > interspersed with many small I/O operations.
> > > > > > 
> > > > > > To address this issue, this patch adds a check for the
> > > > 
> > > > max_sectors
> > > > > > before submitting the bio. This allows the upper layers to
> > > > > > proactively
> > > > > > detect and handle alignment issues.
> > > > > > 
> > > > > > I performed the Antutu V10 Storage Test on a UFS 4.0
> > > > > > device,
> > > > 
> > > > which
> > > > > > resulted in a significant improvement in the Sequential
> > > > > > test:
> > > > > > 
> > > > > > Sequential Read (average of 5 rounds):
> > > > > > Original: 3033.7 MB/sec
> > > > > > Patched: 3520.9 MB/sec
> > > > > > 
> > > > > > Sequential Write (average of 5 rounds):
> > > > > > Original: 2225.4 MB/sec
> > > > > > Patched: 2800.3 MB/sec
> > > > > > 
> > > > > > Signed-off-by: Ed Tsai <ed.tsai@mediatek.com>
> > > > > > ---
> > > > > >  block/bio.c | 6 ++++++
> > > > > >  1 file changed, 6 insertions(+)
> > > > > > 
> > > > > > diff --git a/block/bio.c b/block/bio.c
> > > > > > index 816d412c06e9..a4a1f775b9ea 100644
> > > > > > --- a/block/bio.c
> > > > > > +++ b/block/bio.c
> > > > > > @@ -1227,6 +1227,7 @@ static int
> > 
> > __bio_iov_iter_get_pages(struct
> > > > bio
> > > > > > *bio, struct iov_iter *iter)
> > > > > >  iov_iter_extraction_t extraction_flags = 0;
> > > > > >  unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
> > > > > >  unsigned short entries_left = bio->bi_max_vecs - bio-
> > > 
> > > bi_vcnt;
> > > > > > +struct queue_limits *lim = &bdev_get_queue(bio->bi_bdev)-
> > > > > > > limits;
> > > > > > 
> > > > > >  struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
> > > > > >  struct page **pages = (struct page **)bv;
> > > > > >  ssize_t size, left;
> > > > > > @@ -1275,6 +1276,11 @@ static int
> > 
> > __bio_iov_iter_get_pages(struct
> > > > bio
> > > > > > *bio, struct iov_iter *iter)
> > > > > >  struct page *page = pages[i];
> > > > > >  
> > > > > >  len = min_t(size_t, PAGE_SIZE - offset, left);
> > > > > > +if (bio->bi_iter.bi_size + len >
> > > > > > +    lim->max_sectors << SECTOR_SHIFT) {
> > > > > > +ret = left;
> > > > > > +break;
> > > > > > +}
> > > > > >  if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> > > > > >  ret = bio_iov_add_zone_append_page(bio, page,
> > > > > > len,
> > > > > >  offset);
> > > > > > -- 
> > > > > > 2.18.0
> > > > > > 
> > > > > 
> > > > > 
> > > > > Hi Jens,
> > > > > 
> > > > > Just to clarify any potential confusion, I would like to
> > 
> > provide
> > > > > further details based on the assumed scenario mentioned
> > > > > above.
> > > > > 
> > > > > When the upper layer continuously sends 1028KB full-sized
> > > > > bios
> > 
> > for
> > > > > sequential reads, the Block Layer sees the following
> > > > > sequence:
> > > > > submit bio: size = 1028KB, start LBA = n
> > > > > submit bio: size = 1028KB, start LBA = n + 1028KB 
> > > > > submit bio: size = 1028KB, start LBA = n + 2056KB
> > > > > ...
> > > > > 
> > > > > However, due to the queue limit restricting the I/O size to a
> > > > 
> > > > maximum
> > > > > of 512KB, the Block Layer splits into the following sequence:
> > > > > submit bio: size = 512KB, start LBA = n
> > > > > submit bio: size = 512KB, start LBA = n +  512KB
> > > > > submit bio: size =   4KB, start LBA = n + 1024KB
> > > > > submit bio: size = 512KB, start LBA = n + 1028KB
> > > > > submit bio: size = 512KB, start LBA = n + 1540KB
> > > > > submit bio: size =   4KB, start LBA = n + 2052KB
> > > > > submit bio: size = 512KB, start LBA = n + 2056KB
> > > > > submit bio: size = 512KB, start LBA = n + 2568KB
> > > > > submit bio: size =   4KB, start LBA = n + 3080KB
> > > > > ...
> > > > > 
> > > > > The original expectation was for the storage to receive
> > > > > large,
> > > > > contiguous requests. However, due to non-alignment, many
> > > > > small
> > 
> > I/O
> > > > > requests are generated. This problem is easily visible
> > > > > because
> > 
> > the
> > > > > user pages passed in are often allocated by the buddy system
> > > > > as
> > > > 
> > > > order 0
> > > > > pages during page faults, resulting in highly non-contiguous
> > > > 
> > > > memory.
> > > > 
> > > > If order 0 page is added to bio, the multipage bvec becomes nop
> > > > basically(256bvec holds 256 pages), then how can it make a
> > 
> > difference
> > > > for you?
> > > 
> > > 
> > > 
> > > > 
> > > > > 
> > > > > As observed in the Antutu Sequential Read test below, it is
> > 
> > similar
> > > > to
> > > > > the description above where the splitting caused by the queue
> > 
> > limit
> > > > > leaves small requests sandwiched in between:
> > > > > 
> > > > > block_bio_queue: 8,32 R 86925864 + 2144 [Thread-51]
> > > > > block_split: 8,32 R 86925864 / 86926888 [Thread-51]
> > > > > block_split: 8,32 R 86926888 / 86927912 [Thread-51]
> > > > > block_rq_issue: 8,32 R 524288 () 86925864 + 1024 [Thread-51]
> > > > > block_rq_issue: 8,32 R 524288 () 86926888 + 1024 [Thread-51]
> > > > > block_bio_queue: 8,32 R 86928008 + 2144 [Thread-51]
> > > > > block_split: 8,32 R 86928008 / 86929032 [Thread-51]
> > > > > block_split: 8,32 R 86929032 / 86930056 [Thread-51]
> > > > > block_rq_issue: 8,32 R 524288 () 86928008 + 1024 [Thread-51]
> > > > > block_rq_issue: 8,32 R 49152 () 86927912 + 96 [Thread-51]
> > > > > block_rq_issue: 8,32 R 524288 () 86929032 + 1024 [Thread-51]
> > > > > block_bio_queue: 8,32 R 86930152 + 2112 [Thread-51]
> > > > > block_split: 8,32 R 86930152 / 86931176 [Thread-51]
> > > > > block_split: 8,32 R 86931176 / 86932200 [Thread-51]
> > > > > block_rq_issue: 8,32 R 524288 () 86930152 + 1024 [Thread-51]
> > > > > block_rq_issue: 8,32 R 49152 () 86930056 + 96 [Thread-51]
> > > > > block_rq_issue: 8,32 R 524288 () 86931176 + 1024 [Thread-51]
> > > > > block_bio_queue: 8,32 R 86932264 + 2096 [Thread-51]
> > > > > block_split: 8,32 R 86932264 / 86933288 [Thread-51]
> > > > > block_split: 8,32 R 86933288 / 86934312 [Thread-51]
> > > > > block_rq_issue: 8,32 R 524288 () 86932264 + 1024 [Thread-51]
> > > > > block_rq_issue: 8,32 R 32768 () 86932200 + 64 [Thread-51]
> > > > > block_rq_issue: 8,32 R 524288 () 86933288 + 1024 [Thread-51]
> > > > > 
> > > > > I simply prevents non-aligned situations in
> > 
> > bio_iov_iter_get_pages.
> > > > 
> > > > But there is still 4KB IO left if you limit max bio size is
> > 
> > 512KB,
> > > > then how does this 4KB IO finally go in case of 1028KB IO?
> > > > 
> > > > > Besides making the upper layer application aware of the queue
> > > > 
> > > > limit, I
> > > > > would appreciate any other directions or suggestions you may
> > 
> > have.
> > > > 
> > > > The problem is related with IO size from application.
> > > > 
> > > > If you send unaligned IO, you can't avoid the last IO with
> > > > small
> > > > size, no
> > > > matter if block layer bio split is involved or not. Your patch
> > 
> > just
> > > > lets
> > > > __bio_iov_iter_get_pages split the bio, and you still have 4KB
> > 
> > left
> > > > finally when application submits 1028KB, right?
> > > > 
> > > > Then I don't understand why your patch improves sequential IO
> > > > performance.
> > > > 
> > > > Thanks,
> > > > Ming
> > > > 
> > > 
> > > The application performs I/O with a sufficitenly large I/O size,
> > > causing it to constantly fill up and submit full bios. However,
> > > in
> > 
> > the
> > > iomap direct I/O scenario, pages are added to the bio one by one
> > 
> > from
> > > the user buffer. This typically triggers page faults, resulting
> > > in
> > 
> > the
> > > allocation of order 0 pages from the buddy system.
> > > 
> > > The remaining amount of each order in the buddy system varies
> > > over
> > > time. If there are not enough pages available in a particular
> > 
> > order,
> > > pages are split from higher orders. When pages are obtained from
> > 
> > the
> > > higher order, the user buffer may contain some small consecutive
> > > patterns.
> > > 
> > > In summary, the physical layout of the user buffer is
> > 
> > unpredictable,
> > > and when it contains some small consecutive patterns, the size of
> > 
> > the
> > > bio becomes randomly unaligned during filling.
> > 
> > Yes, multipage bvec depends on physical layout of user buffer, but
> > it
> > doesn't affect bio size, which is decided by userspace, and the
> > bvec
> > page layout doesn't affect IO pattern.
> > 
> 
> This will result in variable sizes of full bios when filling them, as
> the size of the bio_vec depends on the physical layout of the user
> buffer.
> 
> > If userspace submits 1028K IO, the IO is split into 512K, 512K and
> > 4K,
> > no matter if each bvec includes how many pages.
> > 
> 
> Sorry for the confusion caused by my description of 1028KB. It was
> meant to illustrate the scenario of submitting an unaligned bio. In
> my
> opinion, the least ideal size for a full bio due to physical layout
> would be 1028KB, as it would result in an I/O with only 4KB.
> 
> 
> > If userspace submits very large IO, such as 512M, which will be
> > split
> > into 1K bios with 512K size.
> > 
> 
> No. The block layer cannot receive a 512MB bio to split into 512K
> size.
> It will encounter full bios of various size, because the size of each
> bio_vec is based on the physical layout.
> 
> 
> > You patch doesn't makes any difference actually from block layer
> > viewpoint, such as:
> > 
> > 1) dd if=/dev/ublkb0 of=/dev/null bs=1028k count=1 iflag=direct
> > 
> > 2) observe IO pattern by the following bpftrace:
> > 
> > kprobe:blk_mq_start_request
> > {
> > $rq = (struct request *)arg0;
> > 
> > printf("%lu %16s %d %d: %s %s bio %lx-%lu\n",
> > nsecs / 1000,  comm, pid, cpu, ksym(reg("ip")),
> > $rq->q->disk->disk_name, $rq->__sector, $rq->__data_len);
> > }
> > 
> > 3) no matter if your patch is applied or not, the following pattern
> > is always observed:
> > 
> > 117828811               dd 1475 0: blk_mq_start_request ublkb0 bio
> > 0-
> > 524288
> > 117828823               dd 1475 0: blk_mq_start_request ublkb0 bio
> > 400-524288
> > 117828825               dd 1475 0: blk_mq_start_request ublkb0 bio
> > 800-4096
> > 
> > Similar pattern can be observed when running dd inside FS(xfs).
> > 
> > > 
> > > This patch limits the bio to be filled up to the max_sectors. The
> > > submission is an async operation, so once the bio is queued, it
> > 
> > will
> > > immediately return and continue filled and submit the next bio.
> > 
> > bio split doesn't change this behavior too, the difference is just
> > how
> > many bios the upper layer(iomap) observed.
> > 
> > Your patch just moves the split into upper layer, and iomap can see
> > 3 bios with your patch when `dd bs=1028K`, and in vanilla tree,
> > iomap
> > just submits single bio with 1028K, block layer splits it into
> > 512k, 512k, and 4k. So finally UFS driver observes same IO pattern.
> > 
> > In short, please root cause why your patch improves performance, or
> > please share your workloads, so we can observe the IO pattern and
> > related mm/fs behavior and try to root cause it.
> > 
> > 
> > Thanks,
> > Ming
> > 
> 
> Antutu Sequential Read performs 72 reads of 64MB using aio. I
> simulated
> a 64MB read using dd and observed that the actual bio queue sizes
> were 
> slightly larger than 1MB. Not a single bio with 64MB, but rather
> multiple various size of bios.
> 
> dd-5970[001] ..... 758.485446: block_bio_queue: 254,52 R 2933336 +
> 2136
> dd-5970[001] ..... 758.487145: block_bio_queue: 254,52 R 2935472 +
> 2152
> dd-5970[001] ..... 758.488822: block_bio_queue: 254,52 R 2937624 +
> 2128
> dd-5970[001] ..... 758.490478: block_bio_queue: 254,52 R 2939752 +
> 2160
> dd-5970[001] ..... 758.492154: block_bio_queue: 254,52 R 2941912 +
> 2096
> dd-5970[001] ..... 758.493874: block_bio_queue: 254,52 R 2944008 +
> 2160
> 

Sorry for missing out on my dd command. Here it is:
dd if=/data/test_file of=/dev/null bs=64m count=1 iflag=direct

Best,
Ed
  
Ming Lei Nov. 6, 2023, 4:53 a.m. UTC | #10
On Mon, Nov 06, 2023 at 01:40:12AM +0000, Ed Tsai (蔡宗軒) wrote:
> On Mon, 2023-11-06 at 09:33 +0800, Ed Tsai wrote:
> > On Sat, 2023-11-04 at 11:43 +0800, Ming Lei wrote:

...

> Sorry for missing out on my dd command. Here it is:
> dd if=/data/test_file of=/dev/null bs=64m count=1 iflag=direct

OK, thanks for the sharing.

I understand the issue now, but not sure if it is one good idea to check
queue limit in __bio_iov_iter_get_pages():

1) bio->bi_bdev may not be set

2) what matters is actually bio's alignment, and bio size still can
be big enough

So I cooked one patch, and it should address your issue:


diff --git a/block/bio.c b/block/bio.c
index 816d412c06e9..7d982e74c65d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1266,6 +1266,24 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		size -= trim;
 	}
 
+	/*
+	 * Try to make bio aligned with 128KB if there are pages left so we
+	 * can avoid small bio in case of big chunk sequential IO
+	 */
+	if (iov_iter_count(iter)) {
+		unsigned curr_size = (bio->bi_iter.bi_size + size) &
+			~((128U << 10) - 1);
+		if (curr_size <= bio->bi_iter.bi_size) {
+			ret = left = size;
+			goto revert;
+		} else {
+			curr_size -= bio->bi_iter.bi_size;
+			ret = size - curr_size;
+			iov_iter_revert(iter, ret);
+			size = curr_size;
+		}
+	}
+
 	if (unlikely(!size)) {
 		ret = -EFAULT;
 		goto out;
@@ -1285,7 +1303,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 
 		offset = 0;
 	}
-
+revert:
 	iov_iter_revert(iter, left);
 out:
 	while (i < nr_pages)




Thanks,
Ming
  
Ming Lei Nov. 6, 2023, 11:54 a.m. UTC | #11
On Mon, Nov 06, 2023 at 12:53:31PM +0800, Ming Lei wrote:
> On Mon, Nov 06, 2023 at 01:40:12AM +0000, Ed Tsai (蔡宗軒) wrote:
> > On Mon, 2023-11-06 at 09:33 +0800, Ed Tsai wrote:
> > > On Sat, 2023-11-04 at 11:43 +0800, Ming Lei wrote:
> 
> ...
> 
> > Sorry for missing out on my dd command. Here it is:
> > dd if=/data/test_file of=/dev/null bs=64m count=1 iflag=direct
> 
> OK, thanks for the sharing.
> 
> I understand the issue now, but not sure if it is one good idea to check
> queue limit in __bio_iov_iter_get_pages():
> 
> 1) bio->bi_bdev may not be set
> 
> 2) what matters is actually bio's alignment, and bio size still can
> be big enough
> 
> So I cooked one patch, and it should address your issue:

The following one fixes several bugs, and is verified to be capable of
making big & aligned bios, feel free to run your test against this one:

 block/bio.c | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index 816d412c06e9..80b36ce57510 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1211,6 +1211,7 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
 }
 
 #define PAGE_PTRS_PER_BVEC     (sizeof(struct bio_vec) / sizeof(struct page *))
+#define BIO_CHUNK_SIZE	(256U << 10)
 
 /**
  * __bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio
@@ -1266,6 +1267,31 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		size -= trim;
 	}
 
+	/*
+	 * Try to make bio aligned with 128KB if it isn't the last one, so
+	 * we can avoid small bio in case of big chunk sequential IO because
+	 * of bio split and multipage bvec.
+	 *
+	 * If nothing is added to this bio, simply allow unaligned since we
+	 * have chance to add more bytes
+	 */
+	if (iov_iter_count(iter) && bio->bi_iter.bi_size) {
+		unsigned int aligned_size = (bio->bi_iter.bi_size + size) &
+			~(BIO_CHUNK_SIZE - 1);
+
+		if (aligned_size <= bio->bi_iter.bi_size) {
+			/* stop to add page if this bio can't keep aligned */
+			if (!(bio->bi_iter.bi_size & (BIO_CHUNK_SIZE - 1))) {
+				ret = left = size;
+				goto revert;
+			}
+		} else {
+			aligned_size -= bio->bi_iter.bi_size;
+			iov_iter_revert(iter, size - aligned_size);
+			size = aligned_size;
+		}
+	}
+
 	if (unlikely(!size)) {
 		ret = -EFAULT;
 		goto out;
@@ -1285,7 +1311,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 
 		offset = 0;
 	}
-
+revert:
 	iov_iter_revert(iter, left);
 out:
 	while (i < nr_pages)
  
Ed Tsai (蔡宗軒) Nov. 7, 2023, 2:53 a.m. UTC | #12
On Mon, 2023-11-06 at 19:54 +0800, Ming Lei wrote:
>  On Mon, Nov 06, 2023 at 12:53:31PM +0800, Ming Lei wrote:
> > On Mon, Nov 06, 2023 at 01:40:12AM +0000, Ed Tsai (蔡宗軒) wrote:
> > > On Mon, 2023-11-06 at 09:33 +0800, Ed Tsai wrote:
> > > > On Sat, 2023-11-04 at 11:43 +0800, Ming Lei wrote:
> > 
> > ...
> > 
> > > Sorry for missing out on my dd command. Here it is:
> > > dd if=/data/test_file of=/dev/null bs=64m count=1 iflag=direct
> > 
> > OK, thanks for the sharing.
> > 
> > I understand the issue now, but not sure if it is one good idea to
> check
> > queue limit in __bio_iov_iter_get_pages():
> > 
> > 1) bio->bi_bdev may not be set
> > 
> > 2) what matters is actually bio's alignment, and bio size still can
> > be big enough
> > 
> > So I cooked one patch, and it should address your issue:
> 
> The following one fixes several bugs, and is verified to be capable
> of
> making big & aligned bios, feel free to run your test against this
> one:
> 
>  block/bio.c | 28 +++++++++++++++++++++++++++-
>  1 file changed, 27 insertions(+), 1 deletion(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 816d412c06e9..80b36ce57510 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1211,6 +1211,7 @@ static int bio_iov_add_zone_append_page(struct
> bio *bio, struct page *page,
>  }
>  
>  #define PAGE_PTRS_PER_BVEC     (sizeof(struct bio_vec) /
> sizeof(struct page *))
> +#define BIO_CHUNK_SIZE(256U << 10)
>  
>  /**
>   * __bio_iov_iter_get_pages - pin user or kernel pages and add them
> to a bio
> @@ -1266,6 +1267,31 @@ static int __bio_iov_iter_get_pages(struct bio
> *bio, struct iov_iter *iter)
>  size -= trim;
>  }
>  
> +/*
> + * Try to make bio aligned with 128KB if it isn't the last one, so
> + * we can avoid small bio in case of big chunk sequential IO because
> + * of bio split and multipage bvec.
> + *
> + * If nothing is added to this bio, simply allow unaligned since we
> + * have chance to add more bytes
> + */
> +if (iov_iter_count(iter) && bio->bi_iter.bi_size) {
> +unsigned int aligned_size = (bio->bi_iter.bi_size + size) &
> +~(BIO_CHUNK_SIZE - 1);
> +
> +if (aligned_size <= bio->bi_iter.bi_size) {
> +/* stop to add page if this bio can't keep aligned */
> +if (!(bio->bi_iter.bi_size & (BIO_CHUNK_SIZE - 1))) {
> +ret = left = size;
> +goto revert;
> +}
> +} else {
> +aligned_size -= bio->bi_iter.bi_size;
> +iov_iter_revert(iter, size - aligned_size);
> +size = aligned_size;
> +}
> +}
> +
>  if (unlikely(!size)) {
>  ret = -EFAULT;
>  goto out;
> @@ -1285,7 +1311,7 @@ static int __bio_iov_iter_get_pages(struct bio
> *bio, struct iov_iter *iter)
>  
>  offset = 0;
>  }
> -
> +revert:
>  iov_iter_revert(iter, left);
>  out:
>  while (i < nr_pages)
> -- 
> 2.41.0
> 
> 
> 
> Thanks, 
> Ming
> 

The latest patch you provided with 256 alignment does help alleviate
the severity of fragmentation. However, the actual aligned size may
vary depending on the device. Using a fixed and universal size of 128
or 256KB only provides partial relief from fragmentation.

I performed a dd direct I/O read of 64MB with your patch, and although
most of the bios were aligned, there were still cases of missalignment
to the device limit (e.g., 512MB for my device), as shown below:

dd [000] ..... 392.976830: block_bio_queue: 254,52 R 2997760 + 3584
dd [000] ..... 392.979940: block_bio_queue: 254,52 R 3001344 + 3584
dd [000] ..... 392.983235: block_bio_queue: 254,52 R 3004928 + 3584
dd [000] ..... 392.986468: block_bio_queue: 254,52 R 3008512 + 3584

Comparing the results of the Antutu Sequential test to the previous
data, it is indeed an improvement, but still a slight difference with
limiting the bio size to max_sectors:

Sequential Read (average of 5 rounds):
Original: 3033.7 MB/sec
Limited to max_sectors: 3520.9 MB/sec
Aligned 256KB: 3471.5 MB/sec

Sequential Write (average of 5 rounds):
Original: 2225.4 MB/sec
Limited to max_sectors: 2800.3 MB/sec
Aligned 256KB: 2618.1 MB/sec

What if we limit the bio size only for those who have set the
max_sectors?

Best,
Ed
  
Ming Lei Nov. 7, 2023, 3:48 a.m. UTC | #13
On Tue, Nov 07, 2023 at 02:53:20AM +0000, Ed Tsai (蔡宗軒) wrote:
> On Mon, 2023-11-06 at 19:54 +0800, Ming Lei wrote:
> >  On Mon, Nov 06, 2023 at 12:53:31PM +0800, Ming Lei wrote:
> > > On Mon, Nov 06, 2023 at 01:40:12AM +0000, Ed Tsai (蔡宗軒) wrote:
> > > > On Mon, 2023-11-06 at 09:33 +0800, Ed Tsai wrote:
> > > > > On Sat, 2023-11-04 at 11:43 +0800, Ming Lei wrote:
> > > 
> > > ...
> > > 
> > > > Sorry for missing out on my dd command. Here it is:
> > > > dd if=/data/test_file of=/dev/null bs=64m count=1 iflag=direct
> > > 
> > > OK, thanks for the sharing.
> > > 
> > > I understand the issue now, but not sure if it is one good idea to
> > check
> > > queue limit in __bio_iov_iter_get_pages():
> > > 
> > > 1) bio->bi_bdev may not be set
> > > 
> > > 2) what matters is actually bio's alignment, and bio size still can
> > > be big enough
> > > 
> > > So I cooked one patch, and it should address your issue:
> > 
> > The following one fixes several bugs, and is verified to be capable
> > of
> > making big & aligned bios, feel free to run your test against this
> > one:
> > 
> >  block/bio.c | 28 +++++++++++++++++++++++++++-
> >  1 file changed, 27 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/bio.c b/block/bio.c
> > index 816d412c06e9..80b36ce57510 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -1211,6 +1211,7 @@ static int bio_iov_add_zone_append_page(struct
> > bio *bio, struct page *page,
> >  }
> >  
> >  #define PAGE_PTRS_PER_BVEC     (sizeof(struct bio_vec) /
> > sizeof(struct page *))
> > +#define BIO_CHUNK_SIZE(256U << 10)
> >  
> >  /**
> >   * __bio_iov_iter_get_pages - pin user or kernel pages and add them
> > to a bio
> > @@ -1266,6 +1267,31 @@ static int __bio_iov_iter_get_pages(struct bio
> > *bio, struct iov_iter *iter)
> >  size -= trim;
> >  }
> >  
> > +/*
> > + * Try to make bio aligned with 128KB if it isn't the last one, so
> > + * we can avoid small bio in case of big chunk sequential IO because
> > + * of bio split and multipage bvec.
> > + *
> > + * If nothing is added to this bio, simply allow unaligned since we
> > + * have chance to add more bytes
> > + */
> > +if (iov_iter_count(iter) && bio->bi_iter.bi_size) {
> > +unsigned int aligned_size = (bio->bi_iter.bi_size + size) &
> > +~(BIO_CHUNK_SIZE - 1);
> > +
> > +if (aligned_size <= bio->bi_iter.bi_size) {
> > +/* stop to add page if this bio can't keep aligned */
> > +if (!(bio->bi_iter.bi_size & (BIO_CHUNK_SIZE - 1))) {
> > +ret = left = size;
> > +goto revert;
> > +}
> > +} else {
> > +aligned_size -= bio->bi_iter.bi_size;
> > +iov_iter_revert(iter, size - aligned_size);
> > +size = aligned_size;
> > +}
> > +}
> > +
> >  if (unlikely(!size)) {
> >  ret = -EFAULT;
> >  goto out;
> > @@ -1285,7 +1311,7 @@ static int __bio_iov_iter_get_pages(struct bio
> > *bio, struct iov_iter *iter)
> >  
> >  offset = 0;
> >  }
> > -
> > +revert:
> >  iov_iter_revert(iter, left);
> >  out:
> >  while (i < nr_pages)
> > -- 
> > 2.41.0
> > 
> > 
> > 
> > Thanks, 
> > Ming
> > 
> 
> The latest patch you provided with 256 alignment does help alleviate
> the severity of fragmentation. However, the actual aligned size may
> vary depending on the device. Using a fixed and universal size of 128
> or 256KB only provides partial relief from fragmentation.
> 
> I performed a dd direct I/O read of 64MB with your patch, and although
> most of the bios were aligned, there were still cases of missalignment
> to the device limit (e.g., 512MB for my device), as shown below:

512MB is really big, and actually you have reached 3520MB in READ by
limiting max bio size to 1MB in your original patch.

Just be curious what is the data if you change to align with max sectors
against my last patch? which can try to maximize & align bio.

> 
> dd [000] ..... 392.976830: block_bio_queue: 254,52 R 2997760 + 3584
> dd [000] ..... 392.979940: block_bio_queue: 254,52 R 3001344 + 3584
> dd [000] ..... 392.983235: block_bio_queue: 254,52 R 3004928 + 3584
> dd [000] ..... 392.986468: block_bio_queue: 254,52 R 3008512 + 3584

Yeah, I thought that 128KB should be fine for usual hardware, but
looks not good enough.

> 
> Comparing the results of the Antutu Sequential test to the previous
> data, it is indeed an improvement, but still a slight difference with
> limiting the bio size to max_sectors:
> 
> Sequential Read (average of 5 rounds):
> Original: 3033.7 MB/sec
> Limited to max_sectors: 3520.9 MB/sec
> Aligned 256KB: 3471.5 MB/sec
> 
> Sequential Write (average of 5 rounds):
> Original: 2225.4 MB/sec
> Limited to max_sectors: 2800.3 MB/sec
> Aligned 256KB: 2618.1 MB/sec

Thanks for sharing the data.

> 
> What if we limit the bio size only for those who have set the
> max_sectors?

I think it may be doable, but need more smart approach for avoiding
extra cost of iov_iter_revert(), and one way is to add bio_shrink()
(or bio_revert()) to run the alignment just once.

I will think further and write a new patch if it is doable.



Thanks,
Ming
  
Ed Tsai (蔡宗軒) Nov. 7, 2023, 4:35 a.m. UTC | #14
On Tue, 2023-11-07 at 11:48 +0800, Ming Lei wrote:
>  On Tue, Nov 07, 2023 at 02:53:20AM +0000, Ed Tsai (蔡宗軒) wrote:
> > On Mon, 2023-11-06 at 19:54 +0800, Ming Lei wrote:
> > >  On Mon, Nov 06, 2023 at 12:53:31PM +0800, Ming Lei wrote:
> > > > On Mon, Nov 06, 2023 at 01:40:12AM +0000, Ed Tsai (蔡宗軒) wrote:
> > > > > On Mon, 2023-11-06 at 09:33 +0800, Ed Tsai wrote:
> > > > > > On Sat, 2023-11-04 at 11:43 +0800, Ming Lei wrote:
> > > > 
> > > > ...
> > > > 
> > > > > Sorry for missing out on my dd command. Here it is:
> > > > > dd if=/data/test_file of=/dev/null bs=64m count=1
> iflag=direct
> > > > 
> > > > OK, thanks for the sharing.
> > > > 
> > > > I understand the issue now, but not sure if it is one good idea
> to
> > > check
> > > > queue limit in __bio_iov_iter_get_pages():
> > > > 
> > > > 1) bio->bi_bdev may not be set
> > > > 
> > > > 2) what matters is actually bio's alignment, and bio size still
> can
> > > > be big enough
> > > > 
> > > > So I cooked one patch, and it should address your issue:
> > > 
> > > The following one fixes several bugs, and is verified to be
> capable
> > > of
> > > making big & aligned bios, feel free to run your test against
> this
> > > one:
> > > 
> > >  block/bio.c | 28 +++++++++++++++++++++++++++-
> > >  1 file changed, 27 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/block/bio.c b/block/bio.c
> > > index 816d412c06e9..80b36ce57510 100644
> > > --- a/block/bio.c
> > > +++ b/block/bio.c
> > > @@ -1211,6 +1211,7 @@ static int
> bio_iov_add_zone_append_page(struct
> > > bio *bio, struct page *page,
> > >  }
> > >  
> > >  #define PAGE_PTRS_PER_BVEC     (sizeof(struct bio_vec) /
> > > sizeof(struct page *))
> > > +#define BIO_CHUNK_SIZE(256U << 10)
> > >  
> > >  /**
> > >   * __bio_iov_iter_get_pages - pin user or kernel pages and add
> them
> > > to a bio
> > > @@ -1266,6 +1267,31 @@ static int __bio_iov_iter_get_pages(struct
> bio
> > > *bio, struct iov_iter *iter)
> > >  size -= trim;
> > >  }
> > >  
> > > +/*
> > > + * Try to make bio aligned with 128KB if it isn't the last one,
> so
> > > + * we can avoid small bio in case of big chunk sequential IO
> because
> > > + * of bio split and multipage bvec.
> > > + *
> > > + * If nothing is added to this bio, simply allow unaligned since
> we
> > > + * have chance to add more bytes
> > > + */
> > > +if (iov_iter_count(iter) && bio->bi_iter.bi_size) {
> > > +unsigned int aligned_size = (bio->bi_iter.bi_size + size) &
> > > +~(BIO_CHUNK_SIZE - 1);
> > > +
> > > +if (aligned_size <= bio->bi_iter.bi_size) {
> > > +/* stop to add page if this bio can't keep aligned */
> > > +if (!(bio->bi_iter.bi_size & (BIO_CHUNK_SIZE - 1))) {
> > > +ret = left = size;
> > > +goto revert;
> > > +}
> > > +} else {
> > > +aligned_size -= bio->bi_iter.bi_size;
> > > +iov_iter_revert(iter, size - aligned_size);
> > > +size = aligned_size;
> > > +}
> > > +}
> > > +
> > >  if (unlikely(!size)) {
> > >  ret = -EFAULT;
> > >  goto out;
> > > @@ -1285,7 +1311,7 @@ static int __bio_iov_iter_get_pages(struct
> bio
> > > *bio, struct iov_iter *iter)
> > >  
> > >  offset = 0;
> > >  }
> > > -
> > > +revert:
> > >  iov_iter_revert(iter, left);
> > >  out:
> > >  while (i < nr_pages)
> > > -- 
> > > 2.41.0
> > > 
> > > 
> > > 
> > > Thanks, 
> > > Ming
> > > 
> > 
> > The latest patch you provided with 256 alignment does help
> alleviate
> > the severity of fragmentation. However, the actual aligned size may
> > vary depending on the device. Using a fixed and universal size of
> 128
> > or 256KB only provides partial relief from fragmentation.
> > 
> > I performed a dd direct I/O read of 64MB with your patch, and
> although
> > most of the bios were aligned, there were still cases of
> missalignment
> > to the device limit (e.g., 512MB for my device), as shown below:
> 
> 512MB is really big, and actually you have reached 3520MB in READ by
> limiting max bio size to 1MB in your original patch.
> 
> Just be curious what is the data if you change to align with max
> sectors
> against my last patch? which can try to maximize & align bio.

Sorry, it is a typo. Please disregard it. It should be 512KB instead.

> 
> > 
> > dd [000] ..... 392.976830: block_bio_queue: 254,52 R 2997760 + 3584
> > dd [000] ..... 392.979940: block_bio_queue: 254,52 R 3001344 + 3584
> > dd [000] ..... 392.983235: block_bio_queue: 254,52 R 3004928 + 3584
> > dd [000] ..... 392.986468: block_bio_queue: 254,52 R 3008512 + 3584
> 
> Yeah, I thought that 128KB should be fine for usual hardware, but
> looks not good enough.
> 
> > 
> > Comparing the results of the Antutu Sequential test to the previous
> > data, it is indeed an improvement, but still a slight difference
> with
> > limiting the bio size to max_sectors:
> > 
> > Sequential Read (average of 5 rounds):
> > Original: 3033.7 MB/sec
> > Limited to max_sectors: 3520.9 MB/sec
> > Aligned 256KB: 3471.5 MB/sec
> > 
> > Sequential Write (average of 5 rounds):
> > Original: 2225.4 MB/sec
> > Limited to max_sectors: 2800.3 MB/sec
> > Aligned 256KB: 2618.1 MB/sec
> 
> Thanks for sharing the data.
> 
> > 
> > What if we limit the bio size only for those who have set the
> > max_sectors?
> 
> I think it may be doable, but need more smart approach for avoiding
> extra cost of iov_iter_revert(), and one way is to add bio_shrink()
> (or bio_revert()) to run the alignment just once.
> 
> I will think further and write a new patch if it is doable.
> 
> 
> 
> Thanks,
> Ming
> 

Thank you very much. I will continue to stay updated on this issue to
see if there are any difficulties or alternative directions that may
arise.

Best,
Ed
  

Patch

diff --git a/block/bio.c b/block/bio.c
index 816d412c06e9..a4a1f775b9ea 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1227,6 +1227,7 @@  static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	iov_iter_extraction_t extraction_flags = 0;
 	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
 	unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
+	struct queue_limits *lim = &bdev_get_queue(bio->bi_bdev)->limits;
 	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
 	struct page **pages = (struct page **)bv;
 	ssize_t size, left;
@@ -1275,6 +1276,11 @@  static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		struct page *page = pages[i];
 
 		len = min_t(size_t, PAGE_SIZE - offset, left);
+		if (bio->bi_iter.bi_size + len >
+		    lim->max_sectors << SECTOR_SHIFT) {
+			ret = left;
+			break;
+		}
 		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
 			ret = bio_iov_add_zone_append_page(bio, page, len,
 					offset);