[v2] block: limit the extract size to align queue limit

Message ID 20231110051950.21972-1-ed.tsai@mediatek.com
State New
Headers
Series [v2] block: limit the extract size to align queue limit |

Commit Message

Ed Tsai (蔡宗軒) Nov. 10, 2023, 5:19 a.m. UTC
  From: Ed Tsai <ed.tsai@mediatek.com>

When an application performs a large IO, it fills and submits multiple
full bios to the block layer. Referring to commit 07173c3ec276
("block: enable multipage bvecs"), the full bio size is no longer fixed
at 1MB but can vary based on the physical memory layout.

The size of the full bio no longer aligns with the maximum IO size of
the queue. Therefore, in a 64MB read, you may see many unaligned bios
being submitted.

Executing the command to perform a 64MB read:

	dd if=/data/test_file of=/dev/null bs=64m count=1 iflag=direct

It demonstrates the submission of numerous unaligned bios:

	block_bio_queue: 254,52 R 2933336 + 2136
	block_bio_queue: 254,52 R 2935472 + 2152
	block_bio_queue: 254,52 R 2937624 + 2128
	block_bio_queue: 254,52 R 2939752 + 2160

This patch limits the number of extract pages to ensure that we submit
the bio once we fill enough pages, preventing the block layer from
spliting small I/Os in between.

I performed the Antutu V10 Storage Test on a UFS 4.0 device, which
resulted in a significant improvement in the Sequential test:

Sequential Read (average of 5 rounds):
Original: 3033.7 MB/sec
Patched: 3520.9 MB/sec

Sequential Write (average of 5 rounds):
Original: 2225.4 MB/sec
Patched: 2800.3 MB/sec

Link: https://lore.kernel.org/linux-arm-kernel/20231025092255.27930-1-ed.tsai@mediatek.com/
Signed-off-by: Ed Tsai <ed.tsai@mediatek.com>

---
 block/bio.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)
  

Comments

Christoph Hellwig Nov. 10, 2023, 6:36 a.m. UTC | #1
On Fri, Nov 10, 2023 at 01:19:49PM +0800, ed.tsai@mediatek.com wrote:
> +	if (bdev && blk_queue_pci_p2pdma(bdev->bd_disk->queue))
>  		extraction_flags |= ITER_ALLOW_P2PDMA;

As pointed out in reply to Ming, you really need to first figure out
if we can assume we have a valid bdev or not, and if not pass all the
relevant information separately.

> +	if (bdev && bio_op(bio) != REQ_OP_ZONE_APPEND) {
> +		unsigned int max = queue_max_bytes(bdev_get_queue(bdev));

The higher level code must not look at queue_max_bytes, that is only
used for splitting and might not even be initialized.
  
kernel test robot Nov. 10, 2023, 10:59 a.m. UTC | #2
Hi,

kernel test robot noticed the following build warnings:

[auto build test WARNING on axboe-block/for-next]
[also build test WARNING on hch-configfs/for-next linus/master v6.6 next-20231110]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/ed-tsai-mediatek-com/block-limit-the-extract-size-to-align-queue-limit/20231110-142205
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
patch link:    https://lore.kernel.org/r/20231110051950.21972-1-ed.tsai%40mediatek.com
patch subject: [PATCH v2] block: limit the extract size to align queue limit
config: arc-randconfig-002-20231110 (https://download.01.org/0day-ci/archive/20231110/202311101853.9N398fyj-lkp@intel.com/config)
compiler: arc-elf-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231110/202311101853.9N398fyj-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202311101853.9N398fyj-lkp@intel.com/

All warnings (new ones prefixed by >>):

   block/bio.c: In function '__bio_iov_iter_get_pages':
>> block/bio.c:1261:29: warning: suggest parentheses around '-' in operand of '&' [-Wparentheses]
    1261 |                         max - bio->bi_iter.bi_size & (max - 1) : max;
         |                         ~~~~^~~~~~~~~~~~~~~~~~~~~~


vim +1261 block/bio.c

  1214	
  1215	/**
  1216	 * __bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio
  1217	 * @bio: bio to add pages to
  1218	 * @iter: iov iterator describing the region to be mapped
  1219	 *
  1220	 * Extracts pages from *iter and appends them to @bio's bvec array.  The pages
  1221	 * will have to be cleaned up in the way indicated by the BIO_PAGE_PINNED flag.
  1222	 * For a multi-segment *iter, this function only adds pages from the next
  1223	 * non-empty segment of the iov iterator.
  1224	 */
  1225	static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
  1226	{
  1227		iov_iter_extraction_t extraction_flags = 0;
  1228		unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
  1229		unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
  1230		struct block_device *bdev = bio->bi_bdev;
  1231		struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
  1232		struct page **pages = (struct page **)bv;
  1233		ssize_t max_extract = UINT_MAX - bio->bi_iter.bi_size;
  1234		ssize_t size, left;
  1235		unsigned len, i = 0;
  1236		size_t offset;
  1237		int ret = 0;
  1238	
  1239		/*
  1240		 * Move page array up in the allocated memory for the bio vecs as far as
  1241		 * possible so that we can start filling biovecs from the beginning
  1242		 * without overwriting the temporary page array.
  1243		 */
  1244		BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
  1245		pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
  1246	
  1247		if (bdev && blk_queue_pci_p2pdma(bdev->bd_disk->queue))
  1248			extraction_flags |= ITER_ALLOW_P2PDMA;
  1249	
  1250		/*
  1251		 * Each segment in the iov is required to be a block size multiple.
  1252		 * However, we may not be able to get the entire segment if it spans
  1253		 * more pages than bi_max_vecs allows, so we have to ALIGN_DOWN the
  1254		 * result to ensure the bio's total size is correct. The remainder of
  1255		 * the iov data will be picked up in the next bio iteration.
  1256		 */
  1257		if (bdev && bio_op(bio) != REQ_OP_ZONE_APPEND) {
  1258			unsigned int max = queue_max_bytes(bdev_get_queue(bdev));
  1259	
  1260			max_extract = bio->bi_iter.bi_size ?
> 1261				max - bio->bi_iter.bi_size & (max - 1) : max;
  1262		}
  1263		size = iov_iter_extract_pages(iter, &pages, max_extract,
  1264					      nr_pages, extraction_flags, &offset);
  1265		if (unlikely(size <= 0))
  1266			return size ? size : -EFAULT;
  1267	
  1268		nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
  1269	
  1270		if (bdev) {
  1271			size_t trim = size & (bdev_logical_block_size(bdev) - 1);
  1272			iov_iter_revert(iter, trim);
  1273			size -= trim;
  1274		}
  1275	
  1276		if (unlikely(!size)) {
  1277			ret = -EFAULT;
  1278			goto out;
  1279		}
  1280	
  1281		for (left = size, i = 0; left > 0; left -= len, i++) {
  1282			struct page *page = pages[i];
  1283	
  1284			len = min_t(size_t, PAGE_SIZE - offset, left);
  1285			if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
  1286				ret = bio_iov_add_zone_append_page(bio, page, len,
  1287						offset);
  1288				if (ret)
  1289					break;
  1290			} else
  1291				bio_iov_add_page(bio, page, len, offset);
  1292	
  1293			offset = 0;
  1294		}
  1295	
  1296		iov_iter_revert(iter, left);
  1297	out:
  1298		while (i < nr_pages)
  1299			bio_release_page(bio, pages[i++]);
  1300	
  1301		return ret;
  1302	}
  1303
  
Ming Lei Nov. 10, 2023, 4:03 p.m. UTC | #3
On Fri, Nov 10, 2023 at 01:19:49PM +0800, ed.tsai@mediatek.com wrote:
> From: Ed Tsai <ed.tsai@mediatek.com>
> 
> When an application performs a large IO, it fills and submits multiple
> full bios to the block layer. Referring to commit 07173c3ec276
> ("block: enable multipage bvecs"), the full bio size is no longer fixed
> at 1MB but can vary based on the physical memory layout.
> 
> The size of the full bio no longer aligns with the maximum IO size of
> the queue. Therefore, in a 64MB read, you may see many unaligned bios
> being submitted.
> 
> Executing the command to perform a 64MB read:
> 
> 	dd if=/data/test_file of=/dev/null bs=64m count=1 iflag=direct
> 
> It demonstrates the submission of numerous unaligned bios:
> 
> 	block_bio_queue: 254,52 R 2933336 + 2136
> 	block_bio_queue: 254,52 R 2935472 + 2152
> 	block_bio_queue: 254,52 R 2937624 + 2128
> 	block_bio_queue: 254,52 R 2939752 + 2160
> 
> This patch limits the number of extract pages to ensure that we submit
> the bio once we fill enough pages, preventing the block layer from
> spliting small I/Os in between.
> 
> I performed the Antutu V10 Storage Test on a UFS 4.0 device, which
> resulted in a significant improvement in the Sequential test:
> 
> Sequential Read (average of 5 rounds):
> Original: 3033.7 MB/sec
> Patched: 3520.9 MB/sec
> 
> Sequential Write (average of 5 rounds):
> Original: 2225.4 MB/sec
> Patched: 2800.3 MB/sec
> 
> Link: https://lore.kernel.org/linux-arm-kernel/20231025092255.27930-1-ed.tsai@mediatek.com/
> Signed-off-by: Ed Tsai <ed.tsai@mediatek.com>
> 
> ---
>  block/bio.c | 17 ++++++++++++-----
>  1 file changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 816d412c06e9..8d3a112e68da 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1227,8 +1227,10 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  	iov_iter_extraction_t extraction_flags = 0;
>  	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
>  	unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
> +	struct block_device *bdev = bio->bi_bdev;
>  	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
>  	struct page **pages = (struct page **)bv;
> +	ssize_t max_extract = UINT_MAX - bio->bi_iter.bi_size;
>  	ssize_t size, left;
>  	unsigned len, i = 0;
>  	size_t offset;
> @@ -1242,7 +1244,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
>  	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
>  
> -	if (bio->bi_bdev && blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue))
> +	if (bdev && blk_queue_pci_p2pdma(bdev->bd_disk->queue))
>  		extraction_flags |= ITER_ALLOW_P2PDMA;
>  
>  	/*
> @@ -1252,16 +1254,21 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  	 * result to ensure the bio's total size is correct. The remainder of
>  	 * the iov data will be picked up in the next bio iteration.
>  	 */
> -	size = iov_iter_extract_pages(iter, &pages,
> -				      UINT_MAX - bio->bi_iter.bi_size,
> +	if (bdev && bio_op(bio) != REQ_OP_ZONE_APPEND) {
> +		unsigned int max = queue_max_bytes(bdev_get_queue(bdev));
> +
> +		max_extract = bio->bi_iter.bi_size ?
> +			max - bio->bi_iter.bi_size & (max - 1) : max;
> +	}
> +	size = iov_iter_extract_pages(iter, &pages, max_extract,
>  				      nr_pages, extraction_flags, &offset);

The above is just what I did in the 'slow path' of patch v2[1], and it can't work
well for every extracting pages which is usually slow, and batching
extracting pages should be done always, such as:

1) build one ublk disk(suppose it is /dev/ublkb0) with max sectors of 32k:

- rublk add null --io-buf-size=16384 -q 2	[2]

2) run 64KB IO

fio --direct=1 --size=230G --bsrange=64k-64k --runtime=20 --numjobs=2 --ioengine=libaio \
	--iodepth=64 --iodepth_batch_submit=64 --iodepth_batch_complete_min=64 --group_reporting=1 \
	--filename=/dev/ublkb0 --name=/dev/ublkb0-test-randread --rw=randread

In my local VM, read BW is dropped to 3709MB/s from 20GB/s in the above
fio test with this patch.

The point is that:

1) bio size alignment is only needed in case of multiple bios

2) bio size alignment is needed only when the current bio is approaching
to become FULL

3) with multiple bvec, it is hard to know how many pages can be held
in bvecs beforehand

In short, running every alignment is much less efficient.


[1] https://lore.kernel.org/linux-block/202311100354.HYfqOQ7o-lkp@intel.com/T/#u
[2] install rublk via `cargo install --version=^0.1 rublk` and
CONFIG_BLK_DEV_UBLK is required

Thanks, 
Ming
  

Patch

diff --git a/block/bio.c b/block/bio.c
index 816d412c06e9..8d3a112e68da 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1227,8 +1227,10 @@  static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	iov_iter_extraction_t extraction_flags = 0;
 	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
 	unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
+	struct block_device *bdev = bio->bi_bdev;
 	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
 	struct page **pages = (struct page **)bv;
+	ssize_t max_extract = UINT_MAX - bio->bi_iter.bi_size;
 	ssize_t size, left;
 	unsigned len, i = 0;
 	size_t offset;
@@ -1242,7 +1244,7 @@  static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
 	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
 
-	if (bio->bi_bdev && blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue))
+	if (bdev && blk_queue_pci_p2pdma(bdev->bd_disk->queue))
 		extraction_flags |= ITER_ALLOW_P2PDMA;
 
 	/*
@@ -1252,16 +1254,21 @@  static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	 * result to ensure the bio's total size is correct. The remainder of
 	 * the iov data will be picked up in the next bio iteration.
 	 */
-	size = iov_iter_extract_pages(iter, &pages,
-				      UINT_MAX - bio->bi_iter.bi_size,
+	if (bdev && bio_op(bio) != REQ_OP_ZONE_APPEND) {
+		unsigned int max = queue_max_bytes(bdev_get_queue(bdev));
+
+		max_extract = bio->bi_iter.bi_size ?
+			max - bio->bi_iter.bi_size & (max - 1) : max;
+	}
+	size = iov_iter_extract_pages(iter, &pages, max_extract,
 				      nr_pages, extraction_flags, &offset);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
 	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
 
-	if (bio->bi_bdev) {
-		size_t trim = size & (bdev_logical_block_size(bio->bi_bdev) - 1);
+	if (bdev) {
+		size_t trim = size & (bdev_logical_block_size(bdev) - 1);
 		iov_iter_revert(iter, trim);
 		size -= trim;
 	}