Message ID | 20240222075806.1816400-1-yukuai1@huaweicloud.com |
---|---|
Headers |
Return-Path: <linux-kernel+bounces-76050-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:aa16:b0:108:e6aa:91d0 with SMTP id by22csp99018dyb; Thu, 22 Feb 2024 00:05:34 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCUlgP8b3UWcV95Cawzl7pdtFLM+aUxvYUvN/N57HbjAi3CaYtflSHtkzI0oFk3xOpMCRRlUSO+u/egUzq2c2doItgXOxg== X-Google-Smtp-Source: AGHT+IHJW5W2J8japNs7Io9+SmMM2V4p0yiVtEoz7ikfh7p3c2YVoYDyBDY8wX7zPer4hslDCrM/ X-Received: by 2002:a17:907:76d4:b0:a3e:791d:afe6 with SMTP id kf20-20020a17090776d400b00a3e791dafe6mr7338438ejc.44.1708589134474; Thu, 22 Feb 2024 00:05:34 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1708589134; cv=pass; d=google.com; s=arc-20160816; b=iyCHswNG/ShuLB5WvM2fi+EQfI2jc5Q7QfRLjbEAwhBUNiTY5d6LaT4TKSeRktAkbB GsNXyOo8IqLvcMR3BCTXNPxyxQFqRtvuwZ/kAW2vy5iT+8jVcRlHznxfQkO0lAFJzYUM fXt/u75Ej0i6G8SPdx7CjguS8HaOlDBV26uwkQg4SPSBm+m7oBK/sa+OW/h1Ax21OAS9 zd71ce60dOaFmHZaDStfmK8DlWOj5B076O7PXjR/LnrZ3VIXzdmlVGgN5CfssO2PT5Wu OImv6qcs4qhXgwHyPimVs0E4PWg1nTthHmZOaJITjwnuQZkw7c+2S60GRlT2tsdBx0ch 1eJQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:cc:to :from; bh=U64p6vNRMg8PztUZCfWW8rpfDtm4s42lOoVO6i3xDCE=; fh=5pFDn0/OhEinqiskZflJb/MnGts2EKkuL30fOGsymHI=; b=P2H4eTTnz8caDPsDwWp0ymIadF3/JE5UYkPOW5ZbYQgrEdvBb7bPhF6s0iIgaid69A Rver03dFrV4bmfr/kVsmbKmA9MkgcEyHOkp6b17J8Pv8aC7F1FeX7b34n216BHUxGwHL NNY1BJROpEX0ilMt7rBwqiOSrdo6MSuMFehQZmKbgqoBclkya+8+5AFkDDBF5MB8B+iQ ePvFIEBD9KJA63Nb0b3ycOVibKr9uXu32Vk79HP2VHX6I9AzQKoNj1xXeCG7tJgwNgnN pQ2KqFSYVMCR/0Jfmv+cVWroM1sZBXfKGbF0n1owY4kROm+U3gkc8F93hu+untA/XbJf hXKQ==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=huaweicloud.com); spf=pass (google.com: domain of linux-kernel+bounces-76050-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-76050-ouuuleilei=gmail.com@vger.kernel.org" Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [2604:1380:4601:e00::3]) by mx.google.com with ESMTPS id g25-20020a1709067c5900b00a3ea3dc7e55si3457792ejp.246.2024.02.22.00.05.34 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 22 Feb 2024 00:05:34 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-76050-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=huaweicloud.com); spf=pass (google.com: domain of linux-kernel+bounces-76050-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-76050-ouuuleilei=gmail.com@vger.kernel.org" Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 1AFCC1F22D74 for <ouuuleilei@gmail.com>; Thu, 22 Feb 2024 08:05:34 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 678BD224D8; Thu, 22 Feb 2024 08:03:59 +0000 (UTC) Received: from dggsgout11.his.huawei.com (unknown [45.249.212.51]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7423017BB6; Thu, 22 Feb 2024 08:03:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.51 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708589037; cv=none; b=TfbTFTR8mRkF7W677+uX0SzUBVeXwDaKskeRSTi8D6AS7hRPfTh5va0DQLaueEKZ1spw0r2cvAxu3kPofD/NaKtaQsSN8ihV/2hyzrxQtWXzVni96vTBwDgbszU1j2ei9KURcwRqV0hnUbn1GLeo/SWws9k0zJNnln1Fk6kVU7o= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708589037; c=relaxed/simple; bh=Z0n7pAD254UYraghApQ4/8UxVK58E9XO9dLVWzyEbwg=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version; b=m1TVIboKt1IWxZ4NRdMY4PHQiJAOlylC+IfeslCLkIQUMbCTGgUQdwOuAhe3WAZAKFS5PM4UIFBWxBunjn3ceHThWsVM4q6AWbrKvYk59wvovqCNLm6OT5oTWovLmbrV33/O4WKhtcz9QoJYTfMzRUdVXVlkLeJz/Di3jv5RUXo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=huaweicloud.com; spf=pass smtp.mailfrom=huaweicloud.com; arc=none smtp.client-ip=45.249.212.51 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=huaweicloud.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huaweicloud.com Received: from mail.maildlp.com (unknown [172.19.93.142]) by dggsgout11.his.huawei.com (SkyGuard) with ESMTP id 4TgQdB3CcQz4f3kq8; Thu, 22 Feb 2024 16:03:42 +0800 (CST) Received: from mail02.huawei.com (unknown [10.116.40.112]) by mail.maildlp.com (Postfix) with ESMTP id A1FA51A0176; Thu, 22 Feb 2024 16:03:45 +0800 (CST) Received: from huaweicloud.com (unknown [10.175.104.67]) by APP1 (Coremail) with SMTP id cCh0CgBXKBHc_9ZlnaUhEw--.37287S4; Thu, 22 Feb 2024 16:03:42 +0800 (CST) From: Yu Kuai <yukuai1@huaweicloud.com> To: paul.e.luse@linux.intel.com, song@kernel.org, neilb@suse.com, shli@fb.com Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, yukuai3@huawei.com, yukuai1@huaweicloud.com, yi.zhang@huawei.com, yangerkun@huawei.com Subject: [PATCH md-6.9 00/10] md/raid1: refactor read_balance() and some minor fix Date: Thu, 22 Feb 2024 15:57:56 +0800 Message-Id: <20240222075806.1816400-1-yukuai1@huaweicloud.com> X-Mailer: git-send-email 2.39.2 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-CM-TRANSID: cCh0CgBXKBHc_9ZlnaUhEw--.37287S4 X-Coremail-Antispam: 1UD129KBjvJXoW7Wry3ZFyxJF17WF1kJw17GFg_yoW8tr1Up3 yavFy3Zwn8ZrW3AFn3Zw48G343twn3JFWxJas7Jw4F9r1avFWUA397JrW8CFWDCFy5trn7 Xr17GrZ7uF1FyFDanT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUUv014x267AKxVW8JVW5JwAFc2x0x2IEx4CE42xK8VAvwI8IcIk0 rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2ocxC64kIII0Yj41l84x0c7CEw4AK67xGY2AK02 1l84ACjcxK6xIIjxv20xvE14v26w1j6s0DM28EF7xvwVC0I7IYx2IY6xkF7I0E14v26r4U JVWxJr1l84ACjcxK6I8E87Iv67AKxVW0oVCq3wA2z4x0Y4vEx4A2jsIEc7CjxVAFwI0_Gc CE3s1le2I262IYc4CY6c8Ij28IcVAaY2xG8wAqx4xG64xvF2IEw4CE5I8CrVC2j2WlYx0E 2Ix0cI8IcVAFwI0_Jr0_Jr4lYx0Ex4A2jsIE14v26r1j6r4UMcvjeVCFs4IE7xkEbVWUJV W8JwACjcxG0xvY0x0EwIxGrwACjI8F5VA0II8E6IAqYI8I648v4I1lFIxGxcIEc7CjxVA2 Y2ka0xkIwI1l42xK82IYc2Ij64vIr41l4I8I3I0E4IkC6x0Yz7v_Jr0_Gr1lx2IqxVAqx4 xG67AKxVWUJVWUGwC20s026x8GjcxK67AKxVWUGVWUWwC2zVAF1VAY17CE14v26r1q6r43 MIIYrxkI7VAKI48JMIIF0xvE2Ix0cI8IcVAFwI0_Jr0_JF4lIxAIcVC0I7IYx2IY6xkF7I 0E14v26r1j6r4UMIIF0xvE42xK8VAvwI8IcIk0rVWrZr1j6s0DMIIF0xvEx4A2jsIE14v2 6r1j6r4UMIIF0xvEx4A2jsIEc7CjxVAFwI0_Jr0_GrUvcSsGvfC2KfnxnUUI43ZEXa7VUb XdbUUUUUU== X-CM-SenderInfo: 51xn3trlr6x35dzhxuhorxvhhfrp/ X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1791585560243643569 X-GMAIL-MSGID: 1791585560243643569 |
Series |
md/raid1: refactor read_balance() and some minor fix
|
|
Message
Yu Kuai
Feb. 22, 2024, 7:57 a.m. UTC
From: Yu Kuai <yukuai3@huawei.com>
The orignial idea is that Paul want to optimize raid1 read
performance([1]), however, we think that the orignial code for
read_balance() is quite complex, and we don't want to add more
complexity. Hence we decide to refactor read_balance() first, to make
code cleaner and easier for follow up.
Before this patchset, read_balance() has many local variables and many
braches, it want to consider all the scenarios in one iteration. The
idea of this patch is to devide them into 4 different steps:
1) If resync is in progress, find the first usable disk, patch 5;
Otherwise:
2) Loop through all disks and skipping slow disks and disks with bad
blocks, choose the best disk, patch 10. If no disk is found:
3) Look for disks with bad blocks and choose the one with most number of
sectors, patch 8. If no disk is found:
4) Choose first found slow disk with no bad blocks, or slow disk with
most number of sectors, patch 7.
Note that step 3) and step 4) are super code path, and performance
should not be considered.
And after this patchset, we'll continue to optimize read_balance for
step 2), specifically how to choose the best rdev to read.
[1] https://lore.kernel.org/all/20240102125115.129261-1-paul.e.luse@linux.intel.com/
Yu Kuai (10):
md: add a new helper rdev_has_badblock()
md: record nonrot rdevs while adding/removing rdevs to conf
md/raid1: fix choose next idle in read_balance()
md/raid1-10: add a helper raid1_check_read_range()
md/raid1-10: factor out a new helper raid1_should_read_first()
md/raid1: factor out read_first_rdev() from read_balance()
md/raid1: factor out choose_slow_rdev() from read_balance()
md/raid1: factor out choose_bb_rdev() from read_balance()
md/raid1: factor out the code to manage sequential IO
md/raid1: factor out helpers to choose the best rdev from
read_balance()
drivers/md/md.c | 28 ++-
drivers/md/md.h | 12 ++
drivers/md/raid1-10.c | 69 +++++++
drivers/md/raid1.c | 454 ++++++++++++++++++++++++------------------
drivers/md/raid10.c | 66 ++----
drivers/md/raid5.c | 35 ++--
6 files changed, 402 insertions(+), 262 deletions(-)
Comments
Dear Kuai, dear Paul, Thank you for your work. Some nits and request for benchmarks below. Am 22.02.24 um 08:57 schrieb Yu Kuai: > From: Yu Kuai <yukuai3@huawei.com> > > The orignial idea is that Paul want to optimize raid1 read original > performance([1]), however, we think that the orignial code for original > read_balance() is quite complex, and we don't want to add more > complexity. Hence we decide to refactor read_balance() first, to make > code cleaner and easier for follow up. > > Before this patchset, read_balance() has many local variables and many > braches, it want to consider all the scenarios in one iteration. The branches > idea of this patch is to devide them into 4 different steps: divide > 1) If resync is in progress, find the first usable disk, patch 5; > Otherwise: > 2) Loop through all disks and skipping slow disks and disks with bad > blocks, choose the best disk, patch 10. If no disk is found: > 3) Look for disks with bad blocks and choose the one with most number of > sectors, patch 8. If no disk is found: > 4) Choose first found slow disk with no bad blocks, or slow disk with > most number of sectors, patch 7. > > Note that step 3) and step 4) are super code path, and performance > should not be considered. > > And after this patchset, we'll continue to optimize read_balance for > step 2), specifically how to choose the best rdev to read. Is there a change in performance with the current patch set? Is radi1 well enough covered by the test suite? Kind regards, Paul > [1] https://lore.kernel.org/all/20240102125115.129261-1-paul.e.luse@linux.intel.com/ > > Yu Kuai (10): > md: add a new helper rdev_has_badblock() > md: record nonrot rdevs while adding/removing rdevs to conf > md/raid1: fix choose next idle in read_balance() > md/raid1-10: add a helper raid1_check_read_range() > md/raid1-10: factor out a new helper raid1_should_read_first() > md/raid1: factor out read_first_rdev() from read_balance() > md/raid1: factor out choose_slow_rdev() from read_balance() > md/raid1: factor out choose_bb_rdev() from read_balance() > md/raid1: factor out the code to manage sequential IO > md/raid1: factor out helpers to choose the best rdev from > read_balance() > > drivers/md/md.c | 28 ++- > drivers/md/md.h | 12 ++ > drivers/md/raid1-10.c | 69 +++++++ > drivers/md/raid1.c | 454 ++++++++++++++++++++++++------------------ > drivers/md/raid10.c | 66 ++---- > drivers/md/raid5.c | 35 ++-- > 6 files changed, 402 insertions(+), 262 deletions(-)
Hi, 在 2024/02/22 16:40, Paul Menzel 写道: > Is there a change in performance with the current patch set? Is radi1 > well enough covered by the test suite? Yes, there are no performance degradation, and mdadm tests passed. And Paul Luse also ran fio mixed workload w/data integrity and it passes. Thanks, Kuai
> On Feb 22, 2024, at 2:08 AM, Yu Kuai <yukuai1@huaweicloud.com> wrote: > > Hi, > > 在 2024/02/22 16:40, Paul Menzel 写道: >> Is there a change in performance with the current patch set? Is radi1 well enough covered by the test suite? > > Yes, there are no performance degradation, and mdadm tests passed. And > Paul Luse also ran fio mixed workload w/data integrity and it passes. > > Thanks, > Kuai > Kuai is correct, in my original perf improvement patch I included lots of results. For this set where we just refactored I checked performance to assure we didn't go downhill but didn't save the results as deltas were in the noise. After this series lands we will look at introducing performance improvements again and at that time results from a full performance sweep will be included. For data integrity, 1 and 2 disk mirrors were ran overnight w/fio and crcr32 check - no issues. To assure other code paths execute as they did before was a little trickier without a unit test framework but for those cases I did modify/un-modify the code several times to follow various code paths and assure they're working as expected (ie bad blocks, etc) -Paul
Dear Paul, dear Kuai Am 22.02.24 um 14:04 schrieb Luse, Paul E: >> On Feb 22, 2024, at 2:08 AM, Yu Kuai <yukuai1@huaweicloud.com> wrote: >> 在 2024/02/22 16:40, Paul Menzel 写道: >>> Is there a change in performance with the current patch set? Is >>> radi1 well enough covered by the test suite? >> >> Yes, there are no performance degradation, and mdadm tests passed. >> And Paul Luse also ran fio mixed workload w/data integrity and it >> passes. > > Kuai is correct, in my original perf improvement patch I included > lots of results. For this set where we just refactored I checked > performance to assure we didn't go downhill but didn't save the > results as deltas were in the noise. After this series lands we will > look at introducing performance improvements again and at that time > results from a full performance sweep will be included. > > For data integrity, 1 and 2 disk mirrors were ran overnight w/fio and > crcr32 check - no issues. > > To assure other code paths execute as they did before was a little > trickier without a unit test framework but for those cases I did > modify/un-modify the code several times to follow various code paths > and assure they're working as expected (ie bad blocks, etc) Thank you very much for the elaborate response. In our infrastructure, we often notice things improve, but we sometimes also have the “feeling” that things get worse. As IO is so complex, I find it always helpful to exactly note down the test setup and the run tests. So thank you for responding. Kind regards, Paul
Hi Kuai Thanks for the effort! On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <yukuai1@huaweicloud.com> wrote: > > From: Yu Kuai <yukuai3@huawei.com> > > Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add > the case choose next idle in read_balance(): > > read_balance: > for_each_rdev > if(next_seq_sect == this_sector || disk == 0) typo error: s/disk/dist/g > -> sequential reads > best_disk = disk; > if (...) > choose_next_idle = 1 > continue; > > for_each_rdev > -> iterate next rdev > if (pending == 0) > best_disk = disk; > -> choose the next idle disk > break; > > if (choose_next_idle) > -> keep using this rdev if there are no other idle disk > continue > > However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.") > remove the code: > > - /* If device is idle, use it */ > - if (pending == 0) { > - best_disk = disk; > - break; > - } > > Hence choose next idle will never work now, fix this problem by > following: > > 1) don't set best_disk in this case, read_balance() will choose the best > disk after iterating all the disks; > 2) add 'pending' so that other idle disk will be chosen; > 3) set 'dist' to 0 so that if there is no other idle disk, and all disks > are rotational, this disk will still be chosen; > > Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.") > Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> > Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> > Signed-off-by: Yu Kuai <yukuai3@huawei.com> > --- > drivers/md/raid1.c | 21 ++++++++++++--------- > 1 file changed, 12 insertions(+), 9 deletions(-) > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index c60ea58ae8c5..d0bc67e6d068 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > unsigned int min_pending; > struct md_rdev *rdev; > int choose_first; > - int choose_next_idle; > > /* > * Check if we can balance. We can balance on the whole > @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > best_pending_disk = -1; > min_pending = UINT_MAX; > best_good_sectors = 0; > - choose_next_idle = 0; > clear_bit(R1BIO_FailFast, &r1_bio->state); > > if ((conf->mddev->recovery_cp < this_sector + sectors) || > @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > int opt_iosize = bdev_io_opt(rdev->bdev) >> 9; > struct raid1_info *mirror = &conf->mirrors[disk]; > > - best_disk = disk; > /* > * If buffered sequential IO size exceeds optimal > * iosize, check if there is idle disk. If yes, choose > @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > mirror->next_seq_sect > opt_iosize && > mirror->next_seq_sect - opt_iosize >= > mirror->seq_start) { > - choose_next_idle = 1; > - continue; > + /* > + * Add 'pending' to avoid choosing this disk if > + * there is other idle disk. > + * Set 'dist' to 0, so that if there is no other > + * idle disk and all disks are rotational, this > + * disk will still be chosen. > + */ > + pending++; > + dist = 0; There is a problem. If all disks are not idle and there is a disk with dist=0 before the seq disk, it can't read from the seq disk. It will read from the first disk with dist=0. Maybe we can only add the codes which are removed from 2e52d449bcec? Best Regards Xiao > + } else { > + best_disk = disk; > + break; > } > - break; > } > > - if (choose_next_idle) > - continue; > - > if (min_pending > pending) { > min_pending = pending; > best_pending_disk = disk; > -- > 2.39.2 > >
On Mon, Feb 26, 2024 at 5:12 PM Yu Kuai <yukuai1@huaweicloud.com> wrote: > > Hi, > > 在 2024/02/26 16:55, Xiao Ni 写道: > > Hi Kuai > > > > Thanks for the effort! > > > > On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <yukuai1@huaweicloud.com> wrote: > >> > >> From: Yu Kuai <yukuai3@huawei.com> > >> > >> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add > >> the case choose next idle in read_balance(): > >> > >> read_balance: > >> for_each_rdev > >> if(next_seq_sect == this_sector || disk == 0) > > > > typo error: s/disk/dist/g > > > >> -> sequential reads > >> best_disk = disk; > >> if (...) > >> choose_next_idle = 1 > >> continue; > >> > >> for_each_rdev > >> -> iterate next rdev > >> if (pending == 0) > >> best_disk = disk; > >> -> choose the next idle disk > >> break; > >> > >> if (choose_next_idle) > >> -> keep using this rdev if there are no other idle disk > >> continue > >> > >> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.") > >> remove the code: > >> > >> - /* If device is idle, use it */ > >> - if (pending == 0) { > >> - best_disk = disk; > >> - break; > >> - } > >> > >> Hence choose next idle will never work now, fix this problem by > >> following: > >> > >> 1) don't set best_disk in this case, read_balance() will choose the best > >> disk after iterating all the disks; > >> 2) add 'pending' so that other idle disk will be chosen; > >> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks > >> are rotational, this disk will still be chosen; > >> > >> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.") > >> Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> > >> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> > >> Signed-off-by: Yu Kuai <yukuai3@huawei.com> > >> --- > >> drivers/md/raid1.c | 21 ++++++++++++--------- > >> 1 file changed, 12 insertions(+), 9 deletions(-) > >> > >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > >> index c60ea58ae8c5..d0bc67e6d068 100644 > >> --- a/drivers/md/raid1.c > >> +++ b/drivers/md/raid1.c > >> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > >> unsigned int min_pending; > >> struct md_rdev *rdev; > >> int choose_first; > >> - int choose_next_idle; > >> > >> /* > >> * Check if we can balance. We can balance on the whole > >> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > >> best_pending_disk = -1; > >> min_pending = UINT_MAX; > >> best_good_sectors = 0; > >> - choose_next_idle = 0; > >> clear_bit(R1BIO_FailFast, &r1_bio->state); > >> > >> if ((conf->mddev->recovery_cp < this_sector + sectors) || > >> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > >> int opt_iosize = bdev_io_opt(rdev->bdev) >> 9; > >> struct raid1_info *mirror = &conf->mirrors[disk]; > >> > >> - best_disk = disk; > >> /* > >> * If buffered sequential IO size exceeds optimal > >> * iosize, check if there is idle disk. If yes, choose > >> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > >> mirror->next_seq_sect > opt_iosize && > >> mirror->next_seq_sect - opt_iosize >= > >> mirror->seq_start) { > >> - choose_next_idle = 1; > >> - continue; > >> + /* > >> + * Add 'pending' to avoid choosing this disk if > >> + * there is other idle disk. > >> + * Set 'dist' to 0, so that if there is no other > >> + * idle disk and all disks are rotational, this > >> + * disk will still be chosen. > >> + */ > >> + pending++; > >> + dist = 0; > > > > There is a problem. If all disks are not idle and there is a disk with > > dist=0 before the seq disk, it can't read from the seq disk. It will > > read from the first disk with dist=0. Maybe we can only add the codes > > which are removed from 2e52d449bcec? > > If there is a disk with disk=0, then best_dist_disk will be updated to > the disk, and best_dist will be updated to 0 already: > > // in each iteration > if (dist < best_dist) { > best_dist = dist; > btest_disk_disk = disk; > } > > In this case, best_dist will be set to the first disk with dist=0, and > at last, the disk will be chosen: > > if (best_disk == -1) { > if (has_nonrot_disk || min_pending == 0) > best_disk = best_pending_disk; > else > best_disk = best_dist_disk; > -> the first disk with dist=0; > } > > So, the problem that you concerned should not exist. Hi Kuai Thanks for the explanation. You're right. It chooses the first disk which has dist==0. In the above, you made the same typo error disk=0 as the comment. I guess you want to use dist=0, right? Beside this, this patch is good to me. Best Regards Xiao > > Thanks, > Kuai > > > > Best Regards > > Xiao > > > >> + } else { > >> + best_disk = disk; > >> + break; > >> } > >> - break; > >> } > >> > >> - if (choose_next_idle) > >> - continue; > >> - > >> if (min_pending > pending) { > >> min_pending = pending; > >> best_pending_disk = disk; > >> -- > >> 2.39.2 > >> > >> > > > > . > > >
On Mon, Feb 26, 2024 at 5:40 PM Yu Kuai <yukuai1@huaweicloud.com> wrote: > > Hi, > > 在 2024/02/26 17:24, Xiao Ni 写道: > > On Mon, Feb 26, 2024 at 5:12 PM Yu Kuai <yukuai1@huaweicloud.com> wrote: > >> > >> Hi, > >> > >> 在 2024/02/26 16:55, Xiao Ni 写道: > >>> Hi Kuai > >>> > >>> Thanks for the effort! > >>> > >>> On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <yukuai1@huaweicloud.com> wrote: > >>>> > >>>> From: Yu Kuai <yukuai3@huawei.com> > >>>> > >>>> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add > >>>> the case choose next idle in read_balance(): > >>>> > >>>> read_balance: > >>>> for_each_rdev > >>>> if(next_seq_sect == this_sector || disk == 0) > >>> > >>> typo error: s/disk/dist/g > >>> > >>>> -> sequential reads > >>>> best_disk = disk; > >>>> if (...) > >>>> choose_next_idle = 1 > >>>> continue; > >>>> > >>>> for_each_rdev > >>>> -> iterate next rdev > >>>> if (pending == 0) > >>>> best_disk = disk; > >>>> -> choose the next idle disk > >>>> break; > >>>> > >>>> if (choose_next_idle) > >>>> -> keep using this rdev if there are no other idle disk > >>>> continue > >>>> > >>>> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.") > >>>> remove the code: > >>>> > >>>> - /* If device is idle, use it */ > >>>> - if (pending == 0) { > >>>> - best_disk = disk; > >>>> - break; > >>>> - } > >>>> > >>>> Hence choose next idle will never work now, fix this problem by > >>>> following: > >>>> > >>>> 1) don't set best_disk in this case, read_balance() will choose the best > >>>> disk after iterating all the disks; > >>>> 2) add 'pending' so that other idle disk will be chosen; > >>>> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks > >>>> are rotational, this disk will still be chosen; > >>>> > >>>> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.") > >>>> Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> > >>>> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> > >>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com> > >>>> --- > >>>> drivers/md/raid1.c | 21 ++++++++++++--------- > >>>> 1 file changed, 12 insertions(+), 9 deletions(-) > >>>> > >>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > >>>> index c60ea58ae8c5..d0bc67e6d068 100644 > >>>> --- a/drivers/md/raid1.c > >>>> +++ b/drivers/md/raid1.c > >>>> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > >>>> unsigned int min_pending; > >>>> struct md_rdev *rdev; > >>>> int choose_first; > >>>> - int choose_next_idle; > >>>> > >>>> /* > >>>> * Check if we can balance. We can balance on the whole > >>>> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > >>>> best_pending_disk = -1; > >>>> min_pending = UINT_MAX; > >>>> best_good_sectors = 0; > >>>> - choose_next_idle = 0; > >>>> clear_bit(R1BIO_FailFast, &r1_bio->state); > >>>> > >>>> if ((conf->mddev->recovery_cp < this_sector + sectors) || > >>>> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > >>>> int opt_iosize = bdev_io_opt(rdev->bdev) >> 9; > >>>> struct raid1_info *mirror = &conf->mirrors[disk]; > >>>> > >>>> - best_disk = disk; > >>>> /* > >>>> * If buffered sequential IO size exceeds optimal > >>>> * iosize, check if there is idle disk. If yes, choose > >>>> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > >>>> mirror->next_seq_sect > opt_iosize && > >>>> mirror->next_seq_sect - opt_iosize >= > >>>> mirror->seq_start) { > >>>> - choose_next_idle = 1; > >>>> - continue; > >>>> + /* > >>>> + * Add 'pending' to avoid choosing this disk if > >>>> + * there is other idle disk. > >>>> + * Set 'dist' to 0, so that if there is no other > >>>> + * idle disk and all disks are rotational, this > >>>> + * disk will still be chosen. > >>>> + */ > >>>> + pending++; > >>>> + dist = 0; > >>> > >>> There is a problem. If all disks are not idle and there is a disk with > >>> dist=0 before the seq disk, it can't read from the seq disk. It will > >>> read from the first disk with dist=0. Maybe we can only add the codes > >>> which are removed from 2e52d449bcec? > >> > >> If there is a disk with disk=0, then best_dist_disk will be updated to > >> the disk, and best_dist will be updated to 0 already: > >> > >> // in each iteration > >> if (dist < best_dist) { > >> best_dist = dist; > >> btest_disk_disk = disk; > >> } > >> > >> In this case, best_dist will be set to the first disk with dist=0, and > >> at last, the disk will be chosen: > >> > >> if (best_disk == -1) { > >> if (has_nonrot_disk || min_pending == 0) > >> best_disk = best_pending_disk; > >> else > >> best_disk = best_dist_disk; > >> -> the first disk with dist=0; > >> } > >> > >> So, the problem that you concerned should not exist. > > > > Hi Kuai > > > > Thanks for the explanation. You're right. It chooses the first disk > > which has dist==0. In the above, you made the same typo error disk=0 > > as the comment. I guess you want to use dist=0, right? Beside this, > > this patch is good to me. > > Yes, and Paul change the name 'best_dist' to 'closest_dist_disk', > and 'btest_disk_disk' to 'closest_dist' in the last patch to avoid typo > like this. :) Ah, thanks :) I haven't been there. Regards Xiao > > Thanks, > Kuai > > > > > > Best Regards > > Xiao > >> > >> Thanks, > >> Kuai > >>> > >>> Best Regards > >>> Xiao > >>> > >>>> + } else { > >>>> + best_disk = disk; > >>>> + break; > >>>> } > >>>> - break; > >>>> } > >>>> > >>>> - if (choose_next_idle) > >>>> - continue; > >>>> - > >>>> if (min_pending > pending) { > >>>> min_pending = pending; > >>>> best_pending_disk = disk; > >>>> -- > >>>> 2.39.2 > >>>> > >>>> > >>> > >>> . > >>> > >> > > > > . > > >
Hi Kuai and Paul, On Thu, Feb 22, 2024 at 12:03 AM Yu Kuai <yukuai1@huaweicloud.com> wrote: > > From: Yu Kuai <yukuai3@huawei.com> > > The orignial idea is that Paul want to optimize raid1 read > performance([1]), however, we think that the orignial code for > read_balance() is quite complex, and we don't want to add more > complexity. Hence we decide to refactor read_balance() first, to make > code cleaner and easier for follow up. > > Before this patchset, read_balance() has many local variables and many > braches, it want to consider all the scenarios in one iteration. The > idea of this patch is to devide them into 4 different steps: > > 1) If resync is in progress, find the first usable disk, patch 5; > Otherwise: > 2) Loop through all disks and skipping slow disks and disks with bad > blocks, choose the best disk, patch 10. If no disk is found: > 3) Look for disks with bad blocks and choose the one with most number of > sectors, patch 8. If no disk is found: > 4) Choose first found slow disk with no bad blocks, or slow disk with > most number of sectors, patch 7. Thanks for your great work in this set. It looks great. Please address feedback from folks and send v2. We can still get this in 6.9 merge window. Thanks, Song
On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <yukuai1@huaweicloud.com> wrote: > > From: Yu Kuai <yukuai3@huawei.com> > > Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add > the case choose next idle in read_balance(): > > read_balance: > for_each_rdev > if(next_seq_sect == this_sector || disk == 0) > -> sequential reads > best_disk = disk; > if (...) > choose_next_idle = 1 > continue; > > for_each_rdev > -> iterate next rdev > if (pending == 0) > best_disk = disk; > -> choose the next idle disk > break; > > if (choose_next_idle) > -> keep using this rdev if there are no other idle disk > contine > > However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.") > remove the code: > > - /* If device is idle, use it */ > - if (pending == 0) { > - best_disk = disk; > - break; > - } > > Hence choose next idle will never work now, fix this problem by > following: > > 1) don't set best_disk in this case, read_balance() will choose the best > disk after iterating all the disks; > 2) add 'pending' so that other idle disk will be chosen; > 3) set 'dist' to 0 so that if there is no other idle disk, and all disks > are rotational, this disk will still be chosen; > > Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.") > Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> > Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> > Signed-off-by: Yu Kuai <yukuai3@huawei.com> > --- > drivers/md/raid1.c | 21 ++++++++++++--------- > 1 file changed, 12 insertions(+), 9 deletions(-) > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index c60ea58ae8c5..d0bc67e6d068 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > unsigned int min_pending; > struct md_rdev *rdev; > int choose_first; > - int choose_next_idle; > > /* > * Check if we can balance. We can balance on the whole > @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > best_pending_disk = -1; > min_pending = UINT_MAX; > best_good_sectors = 0; > - choose_next_idle = 0; > clear_bit(R1BIO_FailFast, &r1_bio->state); > > if ((conf->mddev->recovery_cp < this_sector + sectors) || > @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > int opt_iosize = bdev_io_opt(rdev->bdev) >> 9; > struct raid1_info *mirror = &conf->mirrors[disk]; > > - best_disk = disk; > /* > * If buffered sequential IO size exceeds optimal > * iosize, check if there is idle disk. If yes, choose > @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > mirror->next_seq_sect > opt_iosize && > mirror->next_seq_sect - opt_iosize >= > mirror->seq_start) { > - choose_next_idle = 1; > - continue; > + /* > + * Add 'pending' to avoid choosing this disk if > + * there is other idle disk. > + * Set 'dist' to 0, so that if there is no other > + * idle disk and all disks are rotational, this > + * disk will still be chosen. > + */ > + pending++; > + dist = 0; > + } else { > + best_disk = disk; > + break; > } > - break; > } Hi Kuai I noticed something. In patch 12cee5a8a29e, it sets best_disk if it's a sequential read. If there are no other idle disks, it will read from the sequential disk. With this patch, it reads from the best_pending_disk even min_pending is not 0. It looks like a wrong behaviour? Best Regards Xiao > > - if (choose_next_idle) > - continue; > - > if (min_pending > pending) { > min_pending = pending; > best_pending_disk = disk; > -- > 2.39.2 > >
On Tue, Feb 27, 2024 at 10:38 AM Yu Kuai <yukuai1@huaweicloud.com> wrote: > > Hi, > > 在 2024/02/27 10:23, Xiao Ni 写道: > > On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <yukuai1@huaweicloud.com> wrote: > >> > >> From: Yu Kuai <yukuai3@huawei.com> > >> > >> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add > >> the case choose next idle in read_balance(): > >> > >> read_balance: > >> for_each_rdev > >> if(next_seq_sect == this_sector || disk == 0) > >> -> sequential reads > >> best_disk = disk; > >> if (...) > >> choose_next_idle = 1 > >> continue; > >> > >> for_each_rdev > >> -> iterate next rdev > >> if (pending == 0) > >> best_disk = disk; > >> -> choose the next idle disk > >> break; > >> > >> if (choose_next_idle) > >> -> keep using this rdev if there are no other idle disk > >> contine > >> > >> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.") > >> remove the code: > >> > >> - /* If device is idle, use it */ > >> - if (pending == 0) { > >> - best_disk = disk; > >> - break; > >> - } > >> > >> Hence choose next idle will never work now, fix this problem by > >> following: > >> > >> 1) don't set best_disk in this case, read_balance() will choose the best > >> disk after iterating all the disks; > >> 2) add 'pending' so that other idle disk will be chosen; > >> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks > >> are rotational, this disk will still be chosen; > >> > >> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.") > >> Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> > >> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> > >> Signed-off-by: Yu Kuai <yukuai3@huawei.com> > >> --- > >> drivers/md/raid1.c | 21 ++++++++++++--------- > >> 1 file changed, 12 insertions(+), 9 deletions(-) > >> > >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > >> index c60ea58ae8c5..d0bc67e6d068 100644 > >> --- a/drivers/md/raid1.c > >> +++ b/drivers/md/raid1.c > >> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > >> unsigned int min_pending; > >> struct md_rdev *rdev; > >> int choose_first; > >> - int choose_next_idle; > >> > >> /* > >> * Check if we can balance. We can balance on the whole > >> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > >> best_pending_disk = -1; > >> min_pending = UINT_MAX; > >> best_good_sectors = 0; > >> - choose_next_idle = 0; > >> clear_bit(R1BIO_FailFast, &r1_bio->state); > >> > >> if ((conf->mddev->recovery_cp < this_sector + sectors) || > >> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > >> int opt_iosize = bdev_io_opt(rdev->bdev) >> 9; > >> struct raid1_info *mirror = &conf->mirrors[disk]; > >> > >> - best_disk = disk; > >> /* > >> * If buffered sequential IO size exceeds optimal > >> * iosize, check if there is idle disk. If yes, choose > >> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > >> mirror->next_seq_sect > opt_iosize && > >> mirror->next_seq_sect - opt_iosize >= > >> mirror->seq_start) { > >> - choose_next_idle = 1; > >> - continue; > >> + /* > >> + * Add 'pending' to avoid choosing this disk if > >> + * there is other idle disk. > >> + * Set 'dist' to 0, so that if there is no other > >> + * idle disk and all disks are rotational, this > >> + * disk will still be chosen. > >> + */ > >> + pending++; > >> + dist = 0; > >> + } else { > >> + best_disk = disk; > >> + break; > >> } > >> - break; > >> } > > > > Hi Kuai > > > > I noticed something. In patch 12cee5a8a29e, it sets best_disk if it's > > a sequential read. If there are no other idle disks, it will read from > > the sequential disk. With this patch, it reads from the > > best_pending_disk even min_pending is not 0. It looks like a wrong > > behaviour? > > Yes, nice catch, I didn't notice this yet... So there is a hidden > logical, sequential IO priority is higher than minimal 'pending' > selection, it's only less than 'choose_next_idle' where idle disk > exist. Yes. > > Looks like if we want to keep this behaviour, we can add a 'sequential > disk': > > if (is_sequential()) > if (!should_choose_next()) > return disk; > ctl.sequential_disk = disk; > > ... > > if (ctl.min_pending != 0 && ctl.sequential_disk != -1) > return ctl.sequential_disk; Agree with this, thanks :) Best Regards Xiao > > Thanks, > Kuai > > > > > Best Regards > > Xiao > >> > >> - if (choose_next_idle) > >> - continue; > >> - > >> if (min_pending > pending) { > >> min_pending = pending; > >> best_pending_disk = disk; > >> -- > >> 2.39.2 > >> > >> > > > > . > > >