[md-6.9,v2,00/10] md/raid1: refactor read_balance() and some minor fix

Message ID	20240227120327.1432511-1-yukuai1@huaweicloud.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel+bounces-83166-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) client-ip=2604:1380:40f1:3f00::1; From: Yu Kuai <yukuai1@huaweicloud.com> To: xni@redhat.com, paul.e.luse@linux.intel.com, song@kernel.org, shli@fb.com, neilb@suse.com Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, yukuai3@huawei.com, yukuai1@huaweicloud.com, yi.zhang@huawei.com, yangerkun@huawei.com Subject: [PATCH md-6.9 v2 00/10] md/raid1: refactor read_balance() and some minor fix Date: Tue, 27 Feb 2024 20:03:17 +0800 Message-Id: <20240227120327.1432511-1-yukuai1@huaweicloud.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	md/raid1: refactor read_balance() and some minor fix \| [md-6.9,v2,00/10] md/raid1: refactor read_balance() and some minor fix [md-6.9,v2,01/10] md: add a new helper rdev_has_badblock() [md-6.9,v2,02/10] md/raid1: record nonrot rdevs while adding/removing rdevs to conf [md-6.9,v2,04/10] md/raid1-10: add a helper raid1_check_read_range() [md-6.9,v2,05/10] md/raid1-10: factor out a new helper raid1_should_read_first() [md-6.9,v2,06/10] md/raid1: factor out read_first_rdev() from read_balance() [md-6.9,v2,08/10] md/raid1: factor out choose_bb_rdev() from read_balance() [md-6.9,v2,09/10] md/raid1: factor out the code to manage sequential IO [md-6.9,v2,10/10] md/raid1: factor out helpers to choose the best rdev from read_balance()

Message ID

20240227120327.1432511-1-yukuai1@huaweicloud.com

Headers

Received-SPF: pass (google.com: domain of
 linux-kernel+bounces-83166-ouuuleilei=gmail.com@vger.kernel.org designates
 2604:1380:40f1:3f00::1 as permitted sender) client-ip=2604:1380:40f1:3f00::1;
From: Yu Kuai <yukuai1@huaweicloud.com>
To: xni@redhat.com,
	paul.e.luse@linux.intel.com,
	song@kernel.org,
	shli@fb.com,
	neilb@suse.com
Cc: linux-raid@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	yukuai3@huawei.com,
	yukuai1@huaweicloud.com,
	yi.zhang@huawei.com,
	yangerkun@huawei.com
Subject: [PATCH md-6.9 v2 00/10] md/raid1: refactor read_balance() and some
 minor fix
Date: Tue, 27 Feb 2024 20:03:17 +0800
Message-Id: <20240227120327.1432511-1-yukuai1@huaweicloud.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

md/raid1: refactor read_balance() and some minor fix |

Message

Yu Kuai Feb. 27, 2024, 12:03 p.m. UTC

  From: Yu Kuai <yukuai3@huawei.com>

Changes in v2:
 - add new conter in conf for patch 2;
 - fix the case choose next idle while there is no other idle disk in
 patch 3;
 - add some review tag from Xiao Ni for patch 1, 4-8

The original idea is that Paul want to optimize raid1 read
performance([1]), however, we think that the original code for
read_balance() is quite complex, and we don't want to add more
complexity. Hence we decide to refactor read_balance() first, to make
code cleaner and easier for follow up.  

Before this patchset, read_balance() has many local variables and many
branches, it want to consider all the scenarios in one iteration. The
idea of this patch is to divide them into 4 different steps:

1) If resync is in progress, find the first usable disk, patch 5;
Otherwise:
2) Loop through all disks and skipping slow disks and disks with bad
blocks, choose the best disk, patch 10. If no disk is found:
3) Look for disks with bad blocks and choose the one with most number of
sectors, patch 8. If no disk is found:
4) Choose first found slow disk with no bad blocks, or slow disk with
most number of sectors, patch 7.

Note that step 3) and step 4) are super code path, and performance
should not be considered.

And after this patchset, we'll continue to optimize read_balance for
step 2), specifically how to choose the best rdev to read.

[1] https://lore.kernel.org/all/20240102125115.129261-1-paul.e.luse@linux.intel.com/

Yu Kuai (10):
  md: add a new helper rdev_has_badblock()
  md/raid1: record nonrot rdevs while adding/removing rdevs to conf
  md/raid1: fix choose next idle in read_balance()
  md/raid1-10: add a helper raid1_check_read_range()
  md/raid1-10: factor out a new helper raid1_should_read_first()
  md/raid1: factor out read_first_rdev() from read_balance()
  md/raid1: factor out choose_slow_rdev() from read_balance()
  md/raid1: factor out choose_bb_rdev() from read_balance()
  md/raid1: factor out the code to manage sequential IO
  md/raid1: factor out helpers to choose the best rdev from
    read_balance()

 drivers/md/md.h       |  11 +
 drivers/md/raid1-10.c |  69 +++++++
 drivers/md/raid1.c    | 465 +++++++++++++++++++++++++-----------------
 drivers/md/raid1.h    |   1 +
 drivers/md/raid10.c   |  58 ++----
 drivers/md/raid5.c    |  35 ++--
 6 files changed, 391 insertions(+), 248 deletions(-)

Comments

Song Liu Feb. 27, 2024, 9:11 p.m. UTC | #1

On Tue, Feb 27, 2024 at 4:09 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> From: Yu Kuai <yukuai3@huawei.com>
>
> Changes in v2:
>  - add new conter in conf for patch 2;
>  - fix the case choose next idle while there is no other idle disk in
>  patch 3;
>  - add some review tag from Xiao Ni for patch 1, 4-8
>
> The original idea is that Paul want to optimize raid1 read
> performance([1]), however, we think that the original code for
> read_balance() is quite complex, and we don't want to add more
> complexity. Hence we decide to refactor read_balance() first, to make
> code cleaner and easier for follow up.
>
> Before this patchset, read_balance() has many local variables and many
> branches, it want to consider all the scenarios in one iteration. The
> idea of this patch is to divide them into 4 different steps:
>
> 1) If resync is in progress, find the first usable disk, patch 5;
> Otherwise:
> 2) Loop through all disks and skipping slow disks and disks with bad
> blocks, choose the best disk, patch 10. If no disk is found:
> 3) Look for disks with bad blocks and choose the one with most number of
> sectors, patch 8. If no disk is found:
> 4) Choose first found slow disk with no bad blocks, or slow disk with
> most number of sectors, patch 7.
>
> Note that step 3) and step 4) are super code path, and performance
> should not be considered.
>
> And after this patchset, we'll continue to optimize read_balance for
> step 2), specifically how to choose the best rdev to read.
>
> [1] https://lore.kernel.org/all/20240102125115.129261-1-paul.e.luse@linuxintel.com/

v2 looks good to me. Thanks! I will give Xiao some more time to review
it one more time before pushing it to md-6.9.

Song

Xiao Ni Feb. 28, 2024, 2:16 a.m. UTC | #2

On Tue, Feb 27, 2024 at 8:09 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> From: Yu Kuai <yukuai3@huawei.com>
>
> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
> the case choose next idle in read_balance():
>
> read_balance:
>  for_each_rdev
>   if(next_seq_sect == this_sector || dist == 0)
>   -> sequential reads
>    best_disk = disk;
>    if (...)
>     choose_next_idle = 1
>     continue;
>
>  for_each_rdev
>  -> iterate next rdev
>   if (pending == 0)
>    best_disk = disk;
>    -> choose the next idle disk
>    break;
>
>   if (choose_next_idle)
>    -> keep using this rdev if there are no other idle disk
>    contine
>
> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> remove the code:
>
> -               /* If device is idle, use it */
> -               if (pending == 0) {
> -                       best_disk = disk;
> -                       break;
> -               }
>
> Hence choose next idle will never work now, fix this problem by
> following:
>
> 1) don't set best_disk in this case, read_balance() will choose the best
>    disk after iterating all the disks;
> 2) add 'pending' so that other idle disk will be chosen;
> 3) add a new local variable 'sequential_disk' to record the disk, and if
>    there is no other idle disk, 'sequential_disk' will be chosen;
>
> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  drivers/md/raid1.c | 32 ++++++++++++++++++++++----------
>  1 file changed, 22 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 0fed01b06de9..fc5899fb08c1 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -598,13 +598,12 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>         const sector_t this_sector = r1_bio->sector;
>         int sectors;
>         int best_good_sectors;
> -       int best_disk, best_dist_disk, best_pending_disk;
> +       int best_disk, best_dist_disk, best_pending_disk, sequential_disk;
>         int disk;
>         sector_t best_dist;
>         unsigned int min_pending;
>         struct md_rdev *rdev;
>         int choose_first;
> -       int choose_next_idle;
>
>         /*
>          * Check if we can balance. We can balance on the whole
> @@ -615,11 +614,11 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>         sectors = r1_bio->sectors;
>         best_disk = -1;
>         best_dist_disk = -1;
> +       sequential_disk = -1;
>         best_dist = MaxSector;
>         best_pending_disk = -1;
>         min_pending = UINT_MAX;
>         best_good_sectors = 0;
> -       choose_next_idle = 0;
>         clear_bit(R1BIO_FailFast, &r1_bio->state);
>
>         if ((conf->mddev->recovery_cp < this_sector + sectors) ||
> @@ -712,7 +711,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>                         int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
>                         struct raid1_info *mirror = &conf->mirrors[disk];
>
> -                       best_disk = disk;
>                         /*
>                          * If buffered sequential IO size exceeds optimal
>                          * iosize, check if there is idle disk. If yes, choose
> @@ -731,15 +729,22 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>                             mirror->next_seq_sect > opt_iosize &&
>                             mirror->next_seq_sect - opt_iosize >=
>                             mirror->seq_start) {
> -                               choose_next_idle = 1;
> -                               continue;
> +                               /*
> +                                * Add 'pending' to avoid choosing this disk if
> +                                * there is other idle disk.
> +                                */
> +                               pending++;
> +                               /*
> +                                * If there is no other idle disk, this disk
> +                                * will be chosen.
> +                                */
> +                               sequential_disk = disk;
> +                       } else {
> +                               best_disk = disk;
> +                               break;
>                         }
> -                       break;
>                 }
>
> -               if (choose_next_idle)
> -                       continue;
> -
>                 if (min_pending > pending) {
>                         min_pending = pending;
>                         best_pending_disk = disk;
> @@ -751,6 +756,13 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>                 }
>         }
>
> +       /*
> +        * sequential IO size exceeds optimal iosize, however, there is no other
> +        * idle disk, so choose the sequential disk.
> +        */
> +       if (best_disk == -1 && min_pending != 0)
> +               best_disk = sequential_disk;
> +
>         /*
>          * If all disks are rotational, choose the closest disk. If any disk is
>          * non-rotational, choose the disk with less pending request even the
> --
> 2.39.2
>
Hi all
This patch looks good to me.
Reviewed-by: Xiao Ni <xni@redhat.com>