Message ID | 20240227120327.1432511-1-yukuai1@huaweicloud.com |
---|---|
Headers |
Return-Path: <linux-kernel+bounces-83166-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:a81b:b0:108:e6aa:91d0 with SMTP id bq27csp2658022dyb; Tue, 27 Feb 2024 04:22:46 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCVvxVdTtqe84SThQtwXW/P0+WsMw9ObumHRCfIjJmAvO9kLBUHfS52ubJFdpSC27dTV+UzKu/tdp9aR/5SGWnpwqJscSw== X-Google-Smtp-Source: AGHT+IFpgBH1KLEiFIol2pyg1tSHDYj8d3085eBPA1d1DRfTcdt0VVcK4937aUgwTflMYOsnYbUd X-Received: by 2002:a05:6a20:94ce:b0:19e:cb16:f03e with SMTP id ht14-20020a056a2094ce00b0019ecb16f03emr2203184pzb.30.1709036566390; Tue, 27 Feb 2024 04:22:46 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1709036566; cv=pass; d=google.com; s=arc-20160816; b=j7X/NRHGRh7DP1Lo4djNgD+mI/W5iNnnHL+iVlgH7/kgd9CDilrBj/HBIdAbBTbJGg SsdSdDcN0hbfWEZjohTvDG7XenwCCXMYQrFUDFfnCQQGtsUYezmhEAnEB4iqzGevYm4v dDwbch14oCIg1qdv4l7YADuer33WYZLDm8R2Mw9cNKWaR+HO2f9hd7pNwSoB9EDac4fx J08ZmXICOklC+tlaxzt0s8dg52SLWBs/CtomGFHXSGiH+8PHhT5WHmW1jQbQx4qxGzqu PUhpi3Gg9KDNEE08VtHOMq9AOyifzhKwCL8ewlLc/JTGAdIAIlguiG1oPRQJ306WUGKZ 3n+w== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:cc:to :from; bh=lBbOCNIoLme8RPkIjbu7aZlYZVdtZOYcXC+kpQPMskc=; fh=Ytj0OQL2yyVgRSgKlJizbwNzVtV64xy+1TGtb7SiHpY=; b=DoVFmNgrFE9tZo5Bqq5z+0VH9x/BPxEdqFxxp8cO6ZRhOK7uoSt3YFRqDQNluq5U4S rfQp24uu93IqQYjzOXMXlm2+d/MMMjx0sS9JFPeIFkvMZwkscce9xLlLqhoopdZne2kU DrZrfRHbnZ5zcs+1cjbAw+kH6mE9FGHE3cysb1W3FqKgeyu8XKPfIi9Qw5vMOL0CL40U Iu7j6PqvyBNlCmcbjkAoQjRc2myowP/sLEGW8fi68Q826ONKARZduzNi7aFwLpClDzt2 srg0OBHekTJbXrOlCfB3K0BlSgJMKABFfLv3PVuTX5FXZGDxMd80yspkStCU/Wxf0yOF hsVQ==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=huaweicloud.com); spf=pass (google.com: domain of linux-kernel+bounces-83166-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-83166-ouuuleilei=gmail.com@vger.kernel.org" Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [2604:1380:40f1:3f00::1]) by mx.google.com with ESMTPS id v19-20020a634813000000b005dc85635271si5312022pga.605.2024.02.27.04.22.45 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 27 Feb 2024 04:22:46 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-83166-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) client-ip=2604:1380:40f1:3f00::1; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=huaweicloud.com); spf=pass (google.com: domain of linux-kernel+bounces-83166-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-83166-ouuuleilei=gmail.com@vger.kernel.org" Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id 8F3E4B28DCC for <ouuuleilei@gmail.com>; Tue, 27 Feb 2024 12:10:36 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 4D43F13B2AF; Tue, 27 Feb 2024 12:09:32 +0000 (UTC) Received: from dggsgout12.his.huawei.com (unknown [45.249.212.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6D0C813A254; Tue, 27 Feb 2024 12:09:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.56 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709035770; cv=none; b=YFKXC/EqKx6xp3T8HlLAIXheXEkF1teldQqaPhHlbij9wPq3Ret85nQdNGbVW8QwJz8/w38SWgLVD4idwH5ZXSg/1zuHD1U1Zsmv5DQyKKGUwQgfuhBznuZlmKOLSpvwuMoNQ5JuT22I4l0fodHS951rN7Ydfdn4eed7W6Ac9XE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709035770; c=relaxed/simple; bh=kUwH7DmIMW/4Ko5W7N5/xUBfhD7XJme0RnuZkmCPsvQ=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version; b=L6KTSfDZo02jJHpr+hFKgtM04ULScv/B/y7M9tmEaRDrCwsUOzOK2BcyFN2KnjX7dY4t+k57o65m+IHUJ4orDXpSzKBPezb2nfFC3Hefhx7/5t1lTIjCRh0A9e6mQCPqCON1H7ubCB/EtJMdQ01aCQBExoykdG8l6acaEeYblCI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=huaweicloud.com; spf=pass smtp.mailfrom=huaweicloud.com; arc=none smtp.client-ip=45.249.212.56 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=huaweicloud.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huaweicloud.com Received: from mail.maildlp.com (unknown [172.19.93.142]) by dggsgout12.his.huawei.com (SkyGuard) with ESMTP id 4TkbrG3ZxSz4f3jMp; Tue, 27 Feb 2024 20:09:18 +0800 (CST) Received: from mail02.huawei.com (unknown [10.116.40.112]) by mail.maildlp.com (Postfix) with ESMTP id AEEC61A016E; Tue, 27 Feb 2024 20:09:23 +0800 (CST) Received: from huaweicloud.com (unknown [10.175.104.67]) by APP1 (Coremail) with SMTP id cCh0CgAn+RHv0N1lpKNAFQ--.28259S4; Tue, 27 Feb 2024 20:09:21 +0800 (CST) From: Yu Kuai <yukuai1@huaweicloud.com> To: xni@redhat.com, paul.e.luse@linux.intel.com, song@kernel.org, shli@fb.com, neilb@suse.com Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, yukuai3@huawei.com, yukuai1@huaweicloud.com, yi.zhang@huawei.com, yangerkun@huawei.com Subject: [PATCH md-6.9 v2 00/10] md/raid1: refactor read_balance() and some minor fix Date: Tue, 27 Feb 2024 20:03:17 +0800 Message-Id: <20240227120327.1432511-1-yukuai1@huaweicloud.com> X-Mailer: git-send-email 2.39.2 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-CM-TRANSID: cCh0CgAn+RHv0N1lpKNAFQ--.28259S4 X-Coremail-Antispam: 1UD129KBjvJXoW7tw1fGF13WFW8tw15Zr1DZFb_yoW8KF1Dp3 yY9FyfXw4DZrW3AFn3Za18G34fJwn3JFWxJr97t34F9r1avrWUt393JrW8CFWDCry5trn7 Wr47GrZ7uF10yFDanT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUUv014x267AKxVW8JVW5JwAFc2x0x2IEx4CE42xK8VAvwI8IcIk0 rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2ocxC64kIII0Yj41l84x0c7CEw4AK67xGY2AK02 1l84ACjcxK6xIIjxv20xvE14v26ryj6F1UM28EF7xvwVC0I7IYx2IY6xkF7I0E14v26F4j 6r4UJwA2z4x0Y4vEx4A2jsIE14v26rxl6s0DM28EF7xvwVC2z280aVCY1x0267AKxVW0oV Cq3wAS0I0E0xvYzxvE52x082IY62kv0487Mc02F40EFcxC0VAKzVAqx4xG6I80ewAv7VC0 I7IYx2IY67AKxVWUJVWUGwAv7VC2z280aVAFwI0_Jr0_Gr1lOx8S6xCaFVCjc4AY6r1j6r 4UM4x0Y48IcxkI7VAKI48JM4x0x7Aq67IIx4CEVc8vx2IErcIFxwACI402YVCY1x02628v n2kIc2xKxwCF04k20xvY0x0EwIxGrwCFx2IqxVCFs4IE7xkEbVWUJVW8JwC20s026c02F4 0E14v26r1j6r18MI8I3I0E7480Y4vE14v26r106r1rMI8E67AF67kF1VAFwI0_Jw0_GFyl IxkGc2Ij64vIr41lIxAIcVC0I7IYx2IY67AKxVWUJVWUCwCI42IY6xIIjxv20xvEc7CjxV AFwI0_Gr0_Cr1lIxAIcVCF04k26cxKx2IYs7xG6rW3Jr0E3s1lIxAIcVC2z280aVAFwI0_ Jr0_Gr1lIxAIcVC2z280aVCY1x0267AKxVW8JVW8JrUvcSsGvfC2KfnxnUUI43ZEXa7VUb XdbUUUUUU== X-CM-SenderInfo: 51xn3trlr6x35dzhxuhorxvhhfrp/ X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1792054726565336170 X-GMAIL-MSGID: 1792054726565336170 |
Series |
md/raid1: refactor read_balance() and some minor fix
|
|
Message
Yu Kuai
Feb. 27, 2024, 12:03 p.m. UTC
From: Yu Kuai <yukuai3@huawei.com>
Changes in v2:
- add new conter in conf for patch 2;
- fix the case choose next idle while there is no other idle disk in
patch 3;
- add some review tag from Xiao Ni for patch 1, 4-8
The original idea is that Paul want to optimize raid1 read
performance([1]), however, we think that the original code for
read_balance() is quite complex, and we don't want to add more
complexity. Hence we decide to refactor read_balance() first, to make
code cleaner and easier for follow up.
Before this patchset, read_balance() has many local variables and many
branches, it want to consider all the scenarios in one iteration. The
idea of this patch is to divide them into 4 different steps:
1) If resync is in progress, find the first usable disk, patch 5;
Otherwise:
2) Loop through all disks and skipping slow disks and disks with bad
blocks, choose the best disk, patch 10. If no disk is found:
3) Look for disks with bad blocks and choose the one with most number of
sectors, patch 8. If no disk is found:
4) Choose first found slow disk with no bad blocks, or slow disk with
most number of sectors, patch 7.
Note that step 3) and step 4) are super code path, and performance
should not be considered.
And after this patchset, we'll continue to optimize read_balance for
step 2), specifically how to choose the best rdev to read.
[1] https://lore.kernel.org/all/20240102125115.129261-1-paul.e.luse@linux.intel.com/
Yu Kuai (10):
md: add a new helper rdev_has_badblock()
md/raid1: record nonrot rdevs while adding/removing rdevs to conf
md/raid1: fix choose next idle in read_balance()
md/raid1-10: add a helper raid1_check_read_range()
md/raid1-10: factor out a new helper raid1_should_read_first()
md/raid1: factor out read_first_rdev() from read_balance()
md/raid1: factor out choose_slow_rdev() from read_balance()
md/raid1: factor out choose_bb_rdev() from read_balance()
md/raid1: factor out the code to manage sequential IO
md/raid1: factor out helpers to choose the best rdev from
read_balance()
drivers/md/md.h | 11 +
drivers/md/raid1-10.c | 69 +++++++
drivers/md/raid1.c | 465 +++++++++++++++++++++++++-----------------
drivers/md/raid1.h | 1 +
drivers/md/raid10.c | 58 ++----
drivers/md/raid5.c | 35 ++--
6 files changed, 391 insertions(+), 248 deletions(-)
Comments
On Tue, Feb 27, 2024 at 4:09 AM Yu Kuai <yukuai1@huaweicloud.com> wrote: > > From: Yu Kuai <yukuai3@huawei.com> > > Changes in v2: > - add new conter in conf for patch 2; > - fix the case choose next idle while there is no other idle disk in > patch 3; > - add some review tag from Xiao Ni for patch 1, 4-8 > > The original idea is that Paul want to optimize raid1 read > performance([1]), however, we think that the original code for > read_balance() is quite complex, and we don't want to add more > complexity. Hence we decide to refactor read_balance() first, to make > code cleaner and easier for follow up. > > Before this patchset, read_balance() has many local variables and many > branches, it want to consider all the scenarios in one iteration. The > idea of this patch is to divide them into 4 different steps: > > 1) If resync is in progress, find the first usable disk, patch 5; > Otherwise: > 2) Loop through all disks and skipping slow disks and disks with bad > blocks, choose the best disk, patch 10. If no disk is found: > 3) Look for disks with bad blocks and choose the one with most number of > sectors, patch 8. If no disk is found: > 4) Choose first found slow disk with no bad blocks, or slow disk with > most number of sectors, patch 7. > > Note that step 3) and step 4) are super code path, and performance > should not be considered. > > And after this patchset, we'll continue to optimize read_balance for > step 2), specifically how to choose the best rdev to read. > > [1] https://lore.kernel.org/all/20240102125115.129261-1-paul.e.luse@linuxintel.com/ v2 looks good to me. Thanks! I will give Xiao some more time to review it one more time before pushing it to md-6.9. Song
On Tue, Feb 27, 2024 at 8:09 PM Yu Kuai <yukuai1@huaweicloud.com> wrote: > > From: Yu Kuai <yukuai3@huawei.com> > > Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add > the case choose next idle in read_balance(): > > read_balance: > for_each_rdev > if(next_seq_sect == this_sector || dist == 0) > -> sequential reads > best_disk = disk; > if (...) > choose_next_idle = 1 > continue; > > for_each_rdev > -> iterate next rdev > if (pending == 0) > best_disk = disk; > -> choose the next idle disk > break; > > if (choose_next_idle) > -> keep using this rdev if there are no other idle disk > contine > > However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.") > remove the code: > > - /* If device is idle, use it */ > - if (pending == 0) { > - best_disk = disk; > - break; > - } > > Hence choose next idle will never work now, fix this problem by > following: > > 1) don't set best_disk in this case, read_balance() will choose the best > disk after iterating all the disks; > 2) add 'pending' so that other idle disk will be chosen; > 3) add a new local variable 'sequential_disk' to record the disk, and if > there is no other idle disk, 'sequential_disk' will be chosen; > > Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.") > Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com> > Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com> > Signed-off-by: Yu Kuai <yukuai3@huawei.com> > --- > drivers/md/raid1.c | 32 ++++++++++++++++++++++---------- > 1 file changed, 22 insertions(+), 10 deletions(-) > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index 0fed01b06de9..fc5899fb08c1 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -598,13 +598,12 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > const sector_t this_sector = r1_bio->sector; > int sectors; > int best_good_sectors; > - int best_disk, best_dist_disk, best_pending_disk; > + int best_disk, best_dist_disk, best_pending_disk, sequential_disk; > int disk; > sector_t best_dist; > unsigned int min_pending; > struct md_rdev *rdev; > int choose_first; > - int choose_next_idle; > > /* > * Check if we can balance. We can balance on the whole > @@ -615,11 +614,11 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > sectors = r1_bio->sectors; > best_disk = -1; > best_dist_disk = -1; > + sequential_disk = -1; > best_dist = MaxSector; > best_pending_disk = -1; > min_pending = UINT_MAX; > best_good_sectors = 0; > - choose_next_idle = 0; > clear_bit(R1BIO_FailFast, &r1_bio->state); > > if ((conf->mddev->recovery_cp < this_sector + sectors) || > @@ -712,7 +711,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > int opt_iosize = bdev_io_opt(rdev->bdev) >> 9; > struct raid1_info *mirror = &conf->mirrors[disk]; > > - best_disk = disk; > /* > * If buffered sequential IO size exceeds optimal > * iosize, check if there is idle disk. If yes, choose > @@ -731,15 +729,22 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > mirror->next_seq_sect > opt_iosize && > mirror->next_seq_sect - opt_iosize >= > mirror->seq_start) { > - choose_next_idle = 1; > - continue; > + /* > + * Add 'pending' to avoid choosing this disk if > + * there is other idle disk. > + */ > + pending++; > + /* > + * If there is no other idle disk, this disk > + * will be chosen. > + */ > + sequential_disk = disk; > + } else { > + best_disk = disk; > + break; > } > - break; > } > > - if (choose_next_idle) > - continue; > - > if (min_pending > pending) { > min_pending = pending; > best_pending_disk = disk; > @@ -751,6 +756,13 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect > } > } > > + /* > + * sequential IO size exceeds optimal iosize, however, there is no other > + * idle disk, so choose the sequential disk. > + */ > + if (best_disk == -1 && min_pending != 0) > + best_disk = sequential_disk; > + > /* > * If all disks are rotational, choose the closest disk. If any disk is > * non-rotational, choose the disk with less pending request even the > -- > 2.39.2 > Hi all This patch looks good to me. Reviewed-by: Xiao Ni <xni@redhat.com>