[md-6.9,00/10] md/raid1: refactor read_balance() and some minor fix

Message ID	20240222075806.1816400-1-yukuai1@huaweicloud.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel+bounces-76050-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3; From: Yu Kuai <yukuai1@huaweicloud.com> To: paul.e.luse@linux.intel.com, song@kernel.org, neilb@suse.com, shli@fb.com Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, yukuai3@huawei.com, yukuai1@huaweicloud.com, yi.zhang@huawei.com, yangerkun@huawei.com Subject: [PATCH md-6.9 00/10] md/raid1: refactor read_balance() and some minor fix Date: Thu, 22 Feb 2024 15:57:56 +0800 Message-Id: <20240222075806.1816400-1-yukuai1@huaweicloud.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	md/raid1: refactor read_balance() and some minor fix \| [md-6.9,00/10] md/raid1: refactor read_balance() and some minor fix [md-6.9,01/10] md: add a new helper rdev_has_badblock() [md-6.9,02/10] md: record nonrot rdevs while adding/removing rdevs to conf [md-6.9,04/10] md/raid1-10: add a helper raid1_check_read_range() [md-6.9,05/10] md/raid1-10: factor out a new helper raid1_should_read_first() [md-6.9,06/10] md/raid1: factor out read_first_rdev() from read_balance() [md-6.9,07/10] md/raid1: factor out choose_slow_rdev() from read_balance() [md-6.9,08/10] md/raid1: factor out choose_bb_rdev() from read_balance() [md-6.9,09/10] md/raid1: factor out the code to manage sequential IO [md-6.9,10/10] md/raid1: factor out helpers to choose the best rdev from read_balance()

Message ID

20240222075806.1816400-1-yukuai1@huaweicloud.com

Headers

Received-SPF: pass (google.com: domain of
 linux-kernel+bounces-76050-ouuuleilei=gmail.com@vger.kernel.org designates
 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3;
From: Yu Kuai <yukuai1@huaweicloud.com>
To: paul.e.luse@linux.intel.com,
	song@kernel.org,
	neilb@suse.com,
	shli@fb.com
Cc: linux-raid@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	yukuai3@huawei.com,
	yukuai1@huaweicloud.com,
	yi.zhang@huawei.com,
	yangerkun@huawei.com
Subject: [PATCH md-6.9 00/10] md/raid1: refactor read_balance() and some minor
 fix
Date: Thu, 22 Feb 2024 15:57:56 +0800
Message-Id: <20240222075806.1816400-1-yukuai1@huaweicloud.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

md/raid1: refactor read_balance() and some minor fix |

Message

Yu Kuai Feb. 22, 2024, 7:57 a.m. UTC

  From: Yu Kuai <yukuai3@huawei.com>

The orignial idea is that Paul want to optimize raid1 read
performance([1]), however, we think that the orignial code for
read_balance() is quite complex, and we don't want to add more
complexity. Hence we decide to refactor read_balance() first, to make
code cleaner and easier for follow up.  

Before this patchset, read_balance() has many local variables and many
braches, it want to consider all the scenarios in one iteration. The
idea of this patch is to devide them into 4 different steps:

1) If resync is in progress, find the first usable disk, patch 5;
Otherwise:
2) Loop through all disks and skipping slow disks and disks with bad
blocks, choose the best disk, patch 10. If no disk is found:
3) Look for disks with bad blocks and choose the one with most number of
sectors, patch 8. If no disk is found:
4) Choose first found slow disk with no bad blocks, or slow disk with
most number of sectors, patch 7.

Note that step 3) and step 4) are super code path, and performance
should not be considered.

And after this patchset, we'll continue to optimize read_balance for
step 2), specifically how to choose the best rdev to read.

[1] https://lore.kernel.org/all/20240102125115.129261-1-paul.e.luse@linux.intel.com/

Yu Kuai (10):
  md: add a new helper rdev_has_badblock()
  md: record nonrot rdevs while adding/removing rdevs to conf
  md/raid1: fix choose next idle in read_balance()
  md/raid1-10: add a helper raid1_check_read_range()
  md/raid1-10: factor out a new helper raid1_should_read_first()
  md/raid1: factor out read_first_rdev() from read_balance()
  md/raid1: factor out choose_slow_rdev() from read_balance()
  md/raid1: factor out choose_bb_rdev() from read_balance()
  md/raid1: factor out the code to manage sequential IO
  md/raid1: factor out helpers to choose the best rdev from
    read_balance()

 drivers/md/md.c       |  28 ++-
 drivers/md/md.h       |  12 ++
 drivers/md/raid1-10.c |  69 +++++++
 drivers/md/raid1.c    | 454 ++++++++++++++++++++++++------------------
 drivers/md/raid10.c   |  66 ++----
 drivers/md/raid5.c    |  35 ++--
 6 files changed, 402 insertions(+), 262 deletions(-)

Comments

Paul Menzel Feb. 22, 2024, 8:40 a.m. UTC | #1

Dear Kuai, dear Paul,


Thank you for your work. Some nits and request for benchmarks below.


Am 22.02.24 um 08:57 schrieb Yu Kuai:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> The orignial idea is that Paul want to optimize raid1 read

original

> performance([1]), however, we think that the orignial code for

original

> read_balance() is quite complex, and we don't want to add more
> complexity. Hence we decide to refactor read_balance() first, to make
> code cleaner and easier for follow up.
> 
> Before this patchset, read_balance() has many local variables and many
> braches, it want to consider all the scenarios in one iteration. The

branches

> idea of this patch is to devide them into 4 different steps:

divide

> 1) If resync is in progress, find the first usable disk, patch 5;
> Otherwise:
> 2) Loop through all disks and skipping slow disks and disks with bad
> blocks, choose the best disk, patch 10. If no disk is found:
> 3) Look for disks with bad blocks and choose the one with most number of
> sectors, patch 8. If no disk is found:
> 4) Choose first found slow disk with no bad blocks, or slow disk with
> most number of sectors, patch 7.
> 
> Note that step 3) and step 4) are super code path, and performance
> should not be considered.
> 
> And after this patchset, we'll continue to optimize read_balance for
> step 2), specifically how to choose the best rdev to read.

Is there a change in performance with the current patch set? Is radi1 
well enough covered by the test suite?


Kind regards,

Paul


> [1] https://lore.kernel.org/all/20240102125115.129261-1-paul.e.luse@linux.intel.com/
> 
> Yu Kuai (10):
>    md: add a new helper rdev_has_badblock()
>    md: record nonrot rdevs while adding/removing rdevs to conf
>    md/raid1: fix choose next idle in read_balance()
>    md/raid1-10: add a helper raid1_check_read_range()
>    md/raid1-10: factor out a new helper raid1_should_read_first()
>    md/raid1: factor out read_first_rdev() from read_balance()
>    md/raid1: factor out choose_slow_rdev() from read_balance()
>    md/raid1: factor out choose_bb_rdev() from read_balance()
>    md/raid1: factor out the code to manage sequential IO
>    md/raid1: factor out helpers to choose the best rdev from
>      read_balance()
> 
>   drivers/md/md.c       |  28 ++-
>   drivers/md/md.h       |  12 ++
>   drivers/md/raid1-10.c |  69 +++++++
>   drivers/md/raid1.c    | 454 ++++++++++++++++++++++++------------------
>   drivers/md/raid10.c   |  66 ++----
>   drivers/md/raid5.c    |  35 ++--
>   6 files changed, 402 insertions(+), 262 deletions(-)

Yu Kuai Feb. 22, 2024, 9:08 a.m. UTC | #2

Hi,

在 2024/02/22 16:40, Paul Menzel 写道:
> Is there a change in performance with the current patch set? Is radi1 
> well enough covered by the test suite?

Yes, there are no performance degradation, and mdadm tests passed. And
Paul Luse also ran fio mixed workload w/data integrity and it passes.

Thanks,
Kuai

Luse, Paul E Feb. 22, 2024, 1:04 p.m. UTC | #3

> On Feb 22, 2024, at 2:08 AM, Yu Kuai <yukuai1@huaweicloud.com> wrote:
> 
> Hi,
> 
> 在 2024/02/22 16:40, Paul Menzel 写道:
>> Is there a change in performance with the current patch set? Is radi1 well enough covered by the test suite?
> 
> Yes, there are no performance degradation, and mdadm tests passed. And
> Paul Luse also ran fio mixed workload w/data integrity and it passes.
> 
> Thanks,
> Kuai
> 

Kuai is correct, in my original perf improvement patch I included lots of results.  For this set where we just refactored I checked performance to assure we didn't go downhill but didn't save the results as deltas were in the noise.  After this series lands we will look at introducing performance improvements again and at that time results from a full performance sweep will be included.  

For data integrity, 1 and 2 disk mirrors were ran overnight w/fio and crcr32 check - no issues.

To assure other code paths execute as they did before was a little trickier without a unit test framework but for those cases I did modify/un-modify the code several times to follow various code paths and assure they're working as expected (ie bad blocks, etc)

-Paul

Paul Menzel Feb. 22, 2024, 3:30 p.m. UTC | #4

Dear Paul, dear Kuai


Am 22.02.24 um 14:04 schrieb Luse, Paul E:

>> On Feb 22, 2024, at 2:08 AM, Yu Kuai <yukuai1@huaweicloud.com> wrote:

>> 在 2024/02/22 16:40, Paul Menzel 写道:
>>> Is there a change in performance with the current patch set? Is
>>> radi1 well enough covered by the test suite?
>> 
>> Yes, there are no performance degradation, and mdadm tests passed.
>> And Paul Luse also ran fio mixed workload w/data integrity and it
>> passes.
> 
> Kuai is correct, in my original perf improvement patch I included
> lots of results.  For this set where we just refactored I checked
> performance to assure we didn't go downhill but didn't save the
> results as deltas were in the noise.  After this series lands we will
> look at introducing performance improvements again and at that time
> results from a full performance sweep will be included.
> 
> For data integrity, 1 and 2 disk mirrors were ran overnight w/fio and
> crcr32 check - no issues.
> 
> To assure other code paths execute as they did before was a little
> trickier without a unit test framework but for those cases I did
> modify/un-modify the code several times to follow various code paths
> and assure they're working as expected (ie bad blocks, etc)
Thank you very much for the elaborate response.

In our infrastructure, we often notice things improve, but we sometimes 
also have the “feeling” that things get worse. As IO is so complex, I 
find it always helpful to exactly note down the test setup and the run 
tests. So thank you for responding.


Kind regards,

Paul

Xiao Ni Feb. 26, 2024, 8:55 a.m. UTC | #5

Hi Kuai

Thanks for the effort!

On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> From: Yu Kuai <yukuai3@huawei.com>
>
> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
> the case choose next idle in read_balance():
>
> read_balance:
>  for_each_rdev
>   if(next_seq_sect == this_sector || disk == 0)

typo error: s/disk/dist/g

>   -> sequential reads
>    best_disk = disk;
>    if (...)
>     choose_next_idle = 1
>     continue;
>
>  for_each_rdev
>  -> iterate next rdev
>   if (pending == 0)
>    best_disk = disk;
>    -> choose the next idle disk
>    break;
>
>   if (choose_next_idle)
>    -> keep using this rdev if there are no other idle disk
>    continue
>
> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> remove the code:
>
> -               /* If device is idle, use it */
> -               if (pending == 0) {
> -                       best_disk = disk;
> -                       break;
> -               }
>
> Hence choose next idle will never work now, fix this problem by
> following:
>
> 1) don't set best_disk in this case, read_balance() will choose the best
>    disk after iterating all the disks;
> 2) add 'pending' so that other idle disk will be chosen;
> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks
>    are rotational, this disk will still be chosen;
>
> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  drivers/md/raid1.c | 21 ++++++++++++---------
>  1 file changed, 12 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index c60ea58ae8c5..d0bc67e6d068 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>         unsigned int min_pending;
>         struct md_rdev *rdev;
>         int choose_first;
> -       int choose_next_idle;
>
>         /*
>          * Check if we can balance. We can balance on the whole
> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>         best_pending_disk = -1;
>         min_pending = UINT_MAX;
>         best_good_sectors = 0;
> -       choose_next_idle = 0;
>         clear_bit(R1BIO_FailFast, &r1_bio->state);
>
>         if ((conf->mddev->recovery_cp < this_sector + sectors) ||
> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>                         int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
>                         struct raid1_info *mirror = &conf->mirrors[disk];
>
> -                       best_disk = disk;
>                         /*
>                          * If buffered sequential IO size exceeds optimal
>                          * iosize, check if there is idle disk. If yes, choose
> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>                             mirror->next_seq_sect > opt_iosize &&
>                             mirror->next_seq_sect - opt_iosize >=
>                             mirror->seq_start) {
> -                               choose_next_idle = 1;
> -                               continue;
> +                               /*
> +                                * Add 'pending' to avoid choosing this disk if
> +                                * there is other idle disk.
> +                                * Set 'dist' to 0, so that if there is no other
> +                                * idle disk and all disks are rotational, this
> +                                * disk will still be chosen.
> +                                */
> +                               pending++;
> +                               dist = 0;

There is a problem. If all disks are not idle and there is a disk with
dist=0 before the seq disk, it can't read from the seq disk. It will
read from the first disk with dist=0. Maybe we can only add the codes
which are removed from 2e52d449bcec?

Best Regards
Xiao

> +                       } else {
> +                               best_disk = disk;
> +                               break;
>                         }
> -                       break;
>                 }
>
> -               if (choose_next_idle)
> -                       continue;
> -
>                 if (min_pending > pending) {
>                         min_pending = pending;
>                         best_pending_disk = disk;
> --
> 2.39.2
>
>

Xiao Ni Feb. 26, 2024, 9:24 a.m. UTC | #6

On Mon, Feb 26, 2024 at 5:12 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2024/02/26 16:55, Xiao Ni 写道:
> > Hi Kuai
> >
> > Thanks for the effort!
> >
> > On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>
> >> From: Yu Kuai <yukuai3@huawei.com>
> >>
> >> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
> >> the case choose next idle in read_balance():
> >>
> >> read_balance:
> >>   for_each_rdev
> >>    if(next_seq_sect == this_sector || disk == 0)
> >
> > typo error: s/disk/dist/g
> >
> >>    -> sequential reads
> >>     best_disk = disk;
> >>     if (...)
> >>      choose_next_idle = 1
> >>      continue;
> >>
> >>   for_each_rdev
> >>   -> iterate next rdev
> >>    if (pending == 0)
> >>     best_disk = disk;
> >>     -> choose the next idle disk
> >>     break;
> >>
> >>    if (choose_next_idle)
> >>     -> keep using this rdev if there are no other idle disk
> >>     continue
> >>
> >> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> >> remove the code:
> >>
> >> -               /* If device is idle, use it */
> >> -               if (pending == 0) {
> >> -                       best_disk = disk;
> >> -                       break;
> >> -               }
> >>
> >> Hence choose next idle will never work now, fix this problem by
> >> following:
> >>
> >> 1) don't set best_disk in this case, read_balance() will choose the best
> >>     disk after iterating all the disks;
> >> 2) add 'pending' so that other idle disk will be chosen;
> >> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks
> >>     are rotational, this disk will still be chosen;
> >>
> >> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> >> Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
> >> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
> >> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> >> ---
> >>   drivers/md/raid1.c | 21 ++++++++++++---------
> >>   1 file changed, 12 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> >> index c60ea58ae8c5..d0bc67e6d068 100644
> >> --- a/drivers/md/raid1.c
> >> +++ b/drivers/md/raid1.c
> >> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>          unsigned int min_pending;
> >>          struct md_rdev *rdev;
> >>          int choose_first;
> >> -       int choose_next_idle;
> >>
> >>          /*
> >>           * Check if we can balance. We can balance on the whole
> >> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>          best_pending_disk = -1;
> >>          min_pending = UINT_MAX;
> >>          best_good_sectors = 0;
> >> -       choose_next_idle = 0;
> >>          clear_bit(R1BIO_FailFast, &r1_bio->state);
> >>
> >>          if ((conf->mddev->recovery_cp < this_sector + sectors) ||
> >> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>                          int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
> >>                          struct raid1_info *mirror = &conf->mirrors[disk];
> >>
> >> -                       best_disk = disk;
> >>                          /*
> >>                           * If buffered sequential IO size exceeds optimal
> >>                           * iosize, check if there is idle disk. If yes, choose
> >> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>                              mirror->next_seq_sect > opt_iosize &&
> >>                              mirror->next_seq_sect - opt_iosize >=
> >>                              mirror->seq_start) {
> >> -                               choose_next_idle = 1;
> >> -                               continue;
> >> +                               /*
> >> +                                * Add 'pending' to avoid choosing this disk if
> >> +                                * there is other idle disk.
> >> +                                * Set 'dist' to 0, so that if there is no other
> >> +                                * idle disk and all disks are rotational, this
> >> +                                * disk will still be chosen.
> >> +                                */
> >> +                               pending++;
> >> +                               dist = 0;
> >
> > There is a problem. If all disks are not idle and there is a disk with
> > dist=0 before the seq disk, it can't read from the seq disk. It will
> > read from the first disk with dist=0. Maybe we can only add the codes
> > which are removed from 2e52d449bcec?
>
> If there is a disk with disk=0, then best_dist_disk will be updated to
> the disk, and best_dist will be updated to 0 already:
>
> // in each iteration
> if (dist < best_dist) {
>         best_dist = dist;
>         btest_disk_disk = disk;
> }
>
> In this case, best_dist will be set to the first disk with dist=0, and
> at last, the disk will be chosen:
>
> if (best_disk == -1) {
>          if (has_nonrot_disk || min_pending == 0)
>                  best_disk = best_pending_disk;
>          else
>                  best_disk = best_dist_disk;
>                 -> the first disk with dist=0;
> }
>
> So, the problem that you concerned should not exist.

Hi Kuai

Thanks for the explanation. You're right. It chooses the first disk
which has dist==0. In the above, you made the same typo error disk=0
as the comment. I guess you want to use dist=0, right? Beside this,
this patch is good to me.

Best Regards
Xiao
>
> Thanks,
> Kuai
> >
> > Best Regards
> > Xiao
> >
> >> +                       } else {
> >> +                               best_disk = disk;
> >> +                               break;
> >>                          }
> >> -                       break;
> >>                  }
> >>
> >> -               if (choose_next_idle)
> >> -                       continue;
> >> -
> >>                  if (min_pending > pending) {
> >>                          min_pending = pending;
> >>                          best_pending_disk = disk;
> >> --
> >> 2.39.2
> >>
> >>
> >
> > .
> >
>

Xiao Ni Feb. 26, 2024, 1:20 p.m. UTC | #7

On Mon, Feb 26, 2024 at 5:40 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2024/02/26 17:24, Xiao Ni 写道:
> > On Mon, Feb 26, 2024 at 5:12 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>
> >> Hi,
> >>
> >> 在 2024/02/26 16:55, Xiao Ni 写道:
> >>> Hi Kuai
> >>>
> >>> Thanks for the effort!
> >>>
> >>> On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>>>
> >>>> From: Yu Kuai <yukuai3@huawei.com>
> >>>>
> >>>> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
> >>>> the case choose next idle in read_balance():
> >>>>
> >>>> read_balance:
> >>>>    for_each_rdev
> >>>>     if(next_seq_sect == this_sector || disk == 0)
> >>>
> >>> typo error: s/disk/dist/g
> >>>
> >>>>     -> sequential reads
> >>>>      best_disk = disk;
> >>>>      if (...)
> >>>>       choose_next_idle = 1
> >>>>       continue;
> >>>>
> >>>>    for_each_rdev
> >>>>    -> iterate next rdev
> >>>>     if (pending == 0)
> >>>>      best_disk = disk;
> >>>>      -> choose the next idle disk
> >>>>      break;
> >>>>
> >>>>     if (choose_next_idle)
> >>>>      -> keep using this rdev if there are no other idle disk
> >>>>      continue
> >>>>
> >>>> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> >>>> remove the code:
> >>>>
> >>>> -               /* If device is idle, use it */
> >>>> -               if (pending == 0) {
> >>>> -                       best_disk = disk;
> >>>> -                       break;
> >>>> -               }
> >>>>
> >>>> Hence choose next idle will never work now, fix this problem by
> >>>> following:
> >>>>
> >>>> 1) don't set best_disk in this case, read_balance() will choose the best
> >>>>      disk after iterating all the disks;
> >>>> 2) add 'pending' so that other idle disk will be chosen;
> >>>> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks
> >>>>      are rotational, this disk will still be chosen;
> >>>>
> >>>> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> >>>> Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
> >>>> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
> >>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> >>>> ---
> >>>>    drivers/md/raid1.c | 21 ++++++++++++---------
> >>>>    1 file changed, 12 insertions(+), 9 deletions(-)
> >>>>
> >>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> >>>> index c60ea58ae8c5..d0bc67e6d068 100644
> >>>> --- a/drivers/md/raid1.c
> >>>> +++ b/drivers/md/raid1.c
> >>>> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>>>           unsigned int min_pending;
> >>>>           struct md_rdev *rdev;
> >>>>           int choose_first;
> >>>> -       int choose_next_idle;
> >>>>
> >>>>           /*
> >>>>            * Check if we can balance. We can balance on the whole
> >>>> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>>>           best_pending_disk = -1;
> >>>>           min_pending = UINT_MAX;
> >>>>           best_good_sectors = 0;
> >>>> -       choose_next_idle = 0;
> >>>>           clear_bit(R1BIO_FailFast, &r1_bio->state);
> >>>>
> >>>>           if ((conf->mddev->recovery_cp < this_sector + sectors) ||
> >>>> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>>>                           int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
> >>>>                           struct raid1_info *mirror = &conf->mirrors[disk];
> >>>>
> >>>> -                       best_disk = disk;
> >>>>                           /*
> >>>>                            * If buffered sequential IO size exceeds optimal
> >>>>                            * iosize, check if there is idle disk. If yes, choose
> >>>> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>>>                               mirror->next_seq_sect > opt_iosize &&
> >>>>                               mirror->next_seq_sect - opt_iosize >=
> >>>>                               mirror->seq_start) {
> >>>> -                               choose_next_idle = 1;
> >>>> -                               continue;
> >>>> +                               /*
> >>>> +                                * Add 'pending' to avoid choosing this disk if
> >>>> +                                * there is other idle disk.
> >>>> +                                * Set 'dist' to 0, so that if there is no other
> >>>> +                                * idle disk and all disks are rotational, this
> >>>> +                                * disk will still be chosen.
> >>>> +                                */
> >>>> +                               pending++;
> >>>> +                               dist = 0;
> >>>
> >>> There is a problem. If all disks are not idle and there is a disk with
> >>> dist=0 before the seq disk, it can't read from the seq disk. It will
> >>> read from the first disk with dist=0. Maybe we can only add the codes
> >>> which are removed from 2e52d449bcec?
> >>
> >> If there is a disk with disk=0, then best_dist_disk will be updated to
> >> the disk, and best_dist will be updated to 0 already:
> >>
> >> // in each iteration
> >> if (dist < best_dist) {
> >>          best_dist = dist;
> >>          btest_disk_disk = disk;
> >> }
> >>
> >> In this case, best_dist will be set to the first disk with dist=0, and
> >> at last, the disk will be chosen:
> >>
> >> if (best_disk == -1) {
> >>           if (has_nonrot_disk || min_pending == 0)
> >>                   best_disk = best_pending_disk;
> >>           else
> >>                   best_disk = best_dist_disk;
> >>                  -> the first disk with dist=0;
> >> }
> >>
> >> So, the problem that you concerned should not exist.
> >
> > Hi Kuai
> >
> > Thanks for the explanation. You're right. It chooses the first disk
> > which has dist==0. In the above, you made the same typo error disk=0
> > as the comment. I guess you want to use dist=0, right? Beside this,
> > this patch is good to me.
>
> Yes, and Paul change the name 'best_dist' to 'closest_dist_disk',
> and 'btest_disk_disk' to 'closest_dist' in the last patch to avoid typo
> like this. :)

Ah, thanks :)  I haven't been there.

Regards
Xiao
>
> Thanks,
> Kuai
>
>
> >
> > Best Regards
> > Xiao
> >>
> >> Thanks,
> >> Kuai
> >>>
> >>> Best Regards
> >>> Xiao
> >>>
> >>>> +                       } else {
> >>>> +                               best_disk = disk;
> >>>> +                               break;
> >>>>                           }
> >>>> -                       break;
> >>>>                   }
> >>>>
> >>>> -               if (choose_next_idle)
> >>>> -                       continue;
> >>>> -
> >>>>                   if (min_pending > pending) {
> >>>>                           min_pending = pending;
> >>>>                           best_pending_disk = disk;
> >>>> --
> >>>> 2.39.2
> >>>>
> >>>>
> >>>
> >>> .
> >>>
> >>
> >
> > .
> >
>

Song Liu Feb. 27, 2024, 12:27 a.m. UTC | #8

Hi Kuai and Paul,

On Thu, Feb 22, 2024 at 12:03 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> From: Yu Kuai <yukuai3@huawei.com>
>
> The orignial idea is that Paul want to optimize raid1 read
> performance([1]), however, we think that the orignial code for
> read_balance() is quite complex, and we don't want to add more
> complexity. Hence we decide to refactor read_balance() first, to make
> code cleaner and easier for follow up.
>
> Before this patchset, read_balance() has many local variables and many
> braches, it want to consider all the scenarios in one iteration. The
> idea of this patch is to devide them into 4 different steps:
>
> 1) If resync is in progress, find the first usable disk, patch 5;
> Otherwise:
> 2) Loop through all disks and skipping slow disks and disks with bad
> blocks, choose the best disk, patch 10. If no disk is found:
> 3) Look for disks with bad blocks and choose the one with most number of
> sectors, patch 8. If no disk is found:
> 4) Choose first found slow disk with no bad blocks, or slow disk with
> most number of sectors, patch 7.

Thanks for your great work in this set. It looks great.

Please address feedback from folks and send v2. We can still get this in
6.9 merge window.

Thanks,
Song

Xiao Ni Feb. 27, 2024, 2:23 a.m. UTC | #9

On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> From: Yu Kuai <yukuai3@huawei.com>
>
> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
> the case choose next idle in read_balance():
>
> read_balance:
>  for_each_rdev
>   if(next_seq_sect == this_sector || disk == 0)
>   -> sequential reads
>    best_disk = disk;
>    if (...)
>     choose_next_idle = 1
>     continue;
>
>  for_each_rdev
>  -> iterate next rdev
>   if (pending == 0)
>    best_disk = disk;
>    -> choose the next idle disk
>    break;
>
>   if (choose_next_idle)
>    -> keep using this rdev if there are no other idle disk
>    contine
>
> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> remove the code:
>
> -               /* If device is idle, use it */
> -               if (pending == 0) {
> -                       best_disk = disk;
> -                       break;
> -               }
>
> Hence choose next idle will never work now, fix this problem by
> following:
>
> 1) don't set best_disk in this case, read_balance() will choose the best
>    disk after iterating all the disks;
> 2) add 'pending' so that other idle disk will be chosen;
> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks
>    are rotational, this disk will still be chosen;
>
> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  drivers/md/raid1.c | 21 ++++++++++++---------
>  1 file changed, 12 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index c60ea58ae8c5..d0bc67e6d068 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>         unsigned int min_pending;
>         struct md_rdev *rdev;
>         int choose_first;
> -       int choose_next_idle;
>
>         /*
>          * Check if we can balance. We can balance on the whole
> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>         best_pending_disk = -1;
>         min_pending = UINT_MAX;
>         best_good_sectors = 0;
> -       choose_next_idle = 0;
>         clear_bit(R1BIO_FailFast, &r1_bio->state);
>
>         if ((conf->mddev->recovery_cp < this_sector + sectors) ||
> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>                         int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
>                         struct raid1_info *mirror = &conf->mirrors[disk];
>
> -                       best_disk = disk;
>                         /*
>                          * If buffered sequential IO size exceeds optimal
>                          * iosize, check if there is idle disk. If yes, choose
> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>                             mirror->next_seq_sect > opt_iosize &&
>                             mirror->next_seq_sect - opt_iosize >=
>                             mirror->seq_start) {
> -                               choose_next_idle = 1;
> -                               continue;
> +                               /*
> +                                * Add 'pending' to avoid choosing this disk if
> +                                * there is other idle disk.
> +                                * Set 'dist' to 0, so that if there is no other
> +                                * idle disk and all disks are rotational, this
> +                                * disk will still be chosen.
> +                                */
> +                               pending++;
> +                               dist = 0;
> +                       } else {
> +                               best_disk = disk;
> +                               break;
>                         }
> -                       break;
>                 }

Hi Kuai

I noticed something. In patch 12cee5a8a29e, it sets best_disk if it's
a sequential read. If there are no other idle disks, it will read from
the sequential disk. With this patch, it reads from the
best_pending_disk even min_pending is not 0. It looks like a wrong
behaviour?

Best Regards
Xiao
>
> -               if (choose_next_idle)
> -                       continue;
> -
>                 if (min_pending > pending) {
>                         min_pending = pending;
>                         best_pending_disk = disk;
> --
> 2.39.2
>
>

Xiao Ni Feb. 27, 2024, 4:49 a.m. UTC | #10

On Tue, Feb 27, 2024 at 10:38 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2024/02/27 10:23, Xiao Ni 写道:
> > On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>
> >> From: Yu Kuai <yukuai3@huawei.com>
> >>
> >> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
> >> the case choose next idle in read_balance():
> >>
> >> read_balance:
> >>   for_each_rdev
> >>    if(next_seq_sect == this_sector || disk == 0)
> >>    -> sequential reads
> >>     best_disk = disk;
> >>     if (...)
> >>      choose_next_idle = 1
> >>      continue;
> >>
> >>   for_each_rdev
> >>   -> iterate next rdev
> >>    if (pending == 0)
> >>     best_disk = disk;
> >>     -> choose the next idle disk
> >>     break;
> >>
> >>    if (choose_next_idle)
> >>     -> keep using this rdev if there are no other idle disk
> >>     contine
> >>
> >> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> >> remove the code:
> >>
> >> -               /* If device is idle, use it */
> >> -               if (pending == 0) {
> >> -                       best_disk = disk;
> >> -                       break;
> >> -               }
> >>
> >> Hence choose next idle will never work now, fix this problem by
> >> following:
> >>
> >> 1) don't set best_disk in this case, read_balance() will choose the best
> >>     disk after iterating all the disks;
> >> 2) add 'pending' so that other idle disk will be chosen;
> >> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks
> >>     are rotational, this disk will still be chosen;
> >>
> >> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> >> Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
> >> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
> >> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> >> ---
> >>   drivers/md/raid1.c | 21 ++++++++++++---------
> >>   1 file changed, 12 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> >> index c60ea58ae8c5..d0bc67e6d068 100644
> >> --- a/drivers/md/raid1.c
> >> +++ b/drivers/md/raid1.c
> >> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>          unsigned int min_pending;
> >>          struct md_rdev *rdev;
> >>          int choose_first;
> >> -       int choose_next_idle;
> >>
> >>          /*
> >>           * Check if we can balance. We can balance on the whole
> >> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>          best_pending_disk = -1;
> >>          min_pending = UINT_MAX;
> >>          best_good_sectors = 0;
> >> -       choose_next_idle = 0;
> >>          clear_bit(R1BIO_FailFast, &r1_bio->state);
> >>
> >>          if ((conf->mddev->recovery_cp < this_sector + sectors) ||
> >> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>                          int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
> >>                          struct raid1_info *mirror = &conf->mirrors[disk];
> >>
> >> -                       best_disk = disk;
> >>                          /*
> >>                           * If buffered sequential IO size exceeds optimal
> >>                           * iosize, check if there is idle disk. If yes, choose
> >> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>                              mirror->next_seq_sect > opt_iosize &&
> >>                              mirror->next_seq_sect - opt_iosize >=
> >>                              mirror->seq_start) {
> >> -                               choose_next_idle = 1;
> >> -                               continue;
> >> +                               /*
> >> +                                * Add 'pending' to avoid choosing this disk if
> >> +                                * there is other idle disk.
> >> +                                * Set 'dist' to 0, so that if there is no other
> >> +                                * idle disk and all disks are rotational, this
> >> +                                * disk will still be chosen.
> >> +                                */
> >> +                               pending++;
> >> +                               dist = 0;
> >> +                       } else {
> >> +                               best_disk = disk;
> >> +                               break;
> >>                          }
> >> -                       break;
> >>                  }
> >
> > Hi Kuai
> >
> > I noticed something. In patch 12cee5a8a29e, it sets best_disk if it's
> > a sequential read. If there are no other idle disks, it will read from
> > the sequential disk. With this patch, it reads from the
> > best_pending_disk even min_pending is not 0. It looks like a wrong
> > behaviour?
>
> Yes, nice catch, I didn't notice this yet... So there is a hidden
> logical, sequential IO priority is higher than minimal 'pending'
> selection, it's only less than 'choose_next_idle' where idle disk
> exist.

Yes.


>
> Looks like if we want to keep this behaviour, we can add a 'sequential
> disk':
>
> if (is_sequential())
>   if (!should_choose_next())
>    return disk;
>   ctl.sequential_disk = disk;
>
> ...
>
> if (ctl.min_pending != 0 && ctl.sequential_disk != -1)
>   return ctl.sequential_disk;

Agree with this, thanks :)

Best Regards
Xiao
>
> Thanks,
> Kuai
>
> >
> > Best Regards
> > Xiao
> >>
> >> -               if (choose_next_idle)
> >> -                       continue;
> >> -
> >>                  if (min_pending > pending) {
> >>                          min_pending = pending;
> >>                          best_pending_disk = disk;
> >> --
> >> 2.39.2
> >>
> >>
> >
> > .
> >
>