Message ID | 20230606051217.2064-1-iecedge@gmail.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp3159241vqr; Mon, 5 Jun 2023 22:31:41 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ5bIUhzs3Mq3ZZBcOSEfxEsqEx/PTWf3ZNejd1dsPRr3/8VojZLzrWzMdMwbgNIgB/frGB5 X-Received: by 2002:a05:6a20:840b:b0:10c:6dbc:2810 with SMTP id c11-20020a056a20840b00b0010c6dbc2810mr1751810pzd.62.1686029501001; Mon, 05 Jun 2023 22:31:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686029500; cv=none; d=google.com; s=arc-20160816; b=0Q9Hql5ufNrbwyLNu8GG5LFbwXe4UOQumSwLecQ1SLtZlnzW2T0Mo1wzLCa6uM713t 8/V15dQxgB8kohyPj0D545bYuTxnMZ05u6AfQf7QGIboGKiCbsOnlNw0Lp5yvZVH4zth 1cuHgUDzskTDvvJ0wXqWfbuhYVsYEYWn3feU4IKF4kwoLbJp984zqfTX0YD2+rWytT8N OjJUcOZb2HmvVsNpXpfVkVhQ/2nm81qcvB/H0S5m3bOQCIPglAQNh/AL5mS/tT9S8yi4 zOOUYS6/9DeNHWZneUwAivFL7o97LXC91/taPEgwSsUUonsjxh4BTYHYFQ3nMONKQ2lL 1UaQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=8oq115v7opk1kvdj9ziSNYaJGB85R3jZNwGWEgDNy3c=; b=Oe+RS8c7HIfe9FdrCDrSWlP3AIJdntlgKMzgqAXrTC16VJEaL9d0tlkuVheKRhK9EN gJ3o3OK4seE50AZwZxQjzUTQ4MrYbQDYpxSnlgKQAxWaplQUNsHH263kO1qnQHSIAYG8 7XtgfhQwxwqKBC0OvkT/TVGRha2+V/LL2AFAIWFsVKhpkY8h9zz7Sfy51Hrhl+OWRZkg zsaPgz5svAQwxZt4d/SKsNPL7PHXSdOGQw2b9311ycX5UPREy4WpdV7Ybp5H6+jkXkBC xcS3oFkmDTZELuYKzo4p4IFqiZfRUrCCYK7u/wdLfZNeruXzactPUd/vWvniIBPW6v+Z poOw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=OsRTo+hs; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id i62-20020a638741000000b0053f2551834esi6944304pge.735.2023.06.05.22.31.27; Mon, 05 Jun 2023 22:31:40 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=OsRTo+hs; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234471AbjFFFMy (ORCPT <rfc822;xxoosimple@gmail.com> + 99 others); Tue, 6 Jun 2023 01:12:54 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41688 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234408AbjFFFMo (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 6 Jun 2023 01:12:44 -0400 Received: from mail-yw1-x112f.google.com (mail-yw1-x112f.google.com [IPv6:2607:f8b0:4864:20::112f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AAEE0123; Mon, 5 Jun 2023 22:12:38 -0700 (PDT) Received: by mail-yw1-x112f.google.com with SMTP id 00721157ae682-565bd368e19so57501937b3.1; Mon, 05 Jun 2023 22:12:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1686028358; x=1688620358; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=8oq115v7opk1kvdj9ziSNYaJGB85R3jZNwGWEgDNy3c=; b=OsRTo+hsJIugXQPIR5sgBM4A9Z1R7WLB6boVGjUdAaXp54HjE3KgZedWQevahtI/FR Omrg4Mr34iolwF/f3T6AQBV96m/KcsPvH745lNKIfhXKJA4+Vm+4OBcsgeL3EwFCP/pS QO7suTVUSYwN1Ld2/Mj9jRMSnTNcxP3UswMi2KgHVNmt2mE2/l3Kbt8QjHQo8ka+AF5X iRCw2ifSZ4KcRR2FYl1ITYNzw1cuhWcilAWuvIQwAmVP7UGqgNsNQ+o5PEJE4WxDhLr3 C0/zzcsKFglNcpuG61s+v5CLfGN2rO48+eDDMwwGZa56h7IS+LggL6sRLSsPzWQz1fYV R6yA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686028358; x=1688620358; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=8oq115v7opk1kvdj9ziSNYaJGB85R3jZNwGWEgDNy3c=; b=JClNy0bvAhqfWlUBuXgqSIDE4/zrS2yN+B7yI8xeHFp0rZYPhjo0fyAnYEUAzOVvwy Rs+GqGgzkw1Tnfg/2BFOyq8HOIgYOTxVQ0h2Xt79fYfW1L0paQf+ugXK5i+EkjRAeI/3 O3s4Yl53hc/bnhhs6R8zrTe6QdjLv4a8pPozCsVRrBLjltVR4tvAsjhee+XJYSYJFOCF FXXTSOq6crVVdAnF1GMGg2xB2qUqjmwAwBXtRy3A1A7r2Lgcvq60Kjm9Dtr7In619/5y P0igY0a9WojD4SclfeTJ5B17hDNEHMUkKg6RkXVfEuQTmQjItRbkoFYbuqwueu1ntQGZ 3P9w== X-Gm-Message-State: AC+VfDzl3x3eJQM9dQYPBVSCLdfuwmT2tpBguqKNlV/JEGPcNp/fCkrk kja3xfTXgekJc432ZPO1sAk= X-Received: by 2002:a81:688b:0:b0:569:ecfc:dd77 with SMTP id d133-20020a81688b000000b00569ecfcdd77mr439526ywc.6.1686028357871; Mon, 05 Jun 2023 22:12:37 -0700 (PDT) Received: from ip-172-31-23-7.us-east-2.compute.internal (ec2-18-222-137-9.us-east-2.compute.amazonaws.com. [18.222.137.9]) by smtp.googlemail.com with ESMTPSA id o64-20020a0dcc43000000b00565ebcdcc95sm3824722ywd.84.2023.06.05.22.12.36 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 05 Jun 2023 22:12:37 -0700 (PDT) From: Jianlin Lv <iecedge@gmail.com> To: jejb@linux.ibm.com, martin.petersen@oracle.com, paulmck@kernel.org, bp@suse.de, peterz@infradead.org, will@kernel.org, rdunlap@infradead.org, kim.phillips@amd.com, rostedt@goodmis.org, wyes.karny@amd.com Cc: iecedge@gmail.com, jianlv@ebay.com, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-scsi@vger.kernel.org Subject: [PATCH] scsi: sd: support specify probe type of build-in driver Date: Tue, 6 Jun 2023 05:12:17 +0000 Message-Id: <20230606051217.2064-1-iecedge@gmail.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1767930070077078986?= X-GMAIL-MSGID: =?utf-8?q?1767930070077078986?= |
Series |
scsi: sd: support specify probe type of build-in driver
|
|
Commit Message
Jianlin Lv
June 6, 2023, 5:12 a.m. UTC
When SCSI disks are located on different SCSI hosts within a system,
asynchronous detection can lead to non-deterministic SCSI disk names.
This patch introduces the 'sd_probe_type=' kernel boot parameter.
In scenarios where SCSI disk name sensitivity is crucial, the probe type
of the build-in sd driver can be set to synchronous. As a result,
the scsi disk names are deterministic.
Signed-off-by: Jianlin Lv <iecedge@gmail.com>
---
.../admin-guide/kernel-parameters.txt | 9 ++++++++
drivers/scsi/sd.c | 23 +++++++++++++++++++
2 files changed, 32 insertions(+)
Comments
On 6/5/23 22:12, Jianlin Lv wrote: > In scenarios where SCSI disk name sensitivity is crucial, the probe type > of the build-in sd driver can be set to synchronous. As a result, > the scsi disk names are deterministic. Which are these scenarios? Additionally, how can synchronous scanning of sd devices make a difference if there are multiple host bus adapters that use an interface type that is scanned asynchronously? Bart.
On Tue, Jun 06, 2023 at 05:12:17AM +0000, Jianlin Lv wrote: > When SCSI disks are located on different SCSI hosts within a system, > asynchronous detection can lead to non-deterministic SCSI disk names. Yes, as can various other conditions. Your code better be able to deal with that.
On Wed, Jun 7, 2023 at 1:38 AM Bart Van Assche <bvanassche@acm.org> wrote: > > On 6/5/23 22:12, Jianlin Lv wrote: > > In scenarios where SCSI disk name sensitivity is crucial, the probe type > > of the build-in sd driver can be set to synchronous. As a result, > > the scsi disk names are deterministic. > > Which are these scenarios? > > Additionally, how can synchronous scanning of sd devices make a > difference if there are multiple host bus adapters that use an interface > type that is scanned asynchronously? > > Bart. The change was prompted by an issue with SCSI devices probing non-deterministic. On the issue node, there are two types of SCSI hosts: 1. MegaRAID adapters associated with 24 local disks. The disks are named sequentially as "sda," "sdb," and so on, up to "sdx." 2. STAT controllers associated with the root disk, named "sdy." Both the MegaRAID adapters and the SATA controller (PCH) are accessed via the PCIe bus. In theory, depending on their PCIe bus ID in ascending order, the devices should be initialized in ascending order as well. However, the SCSI driver currently probes devices asynchronously to allow for more parallelism. __driver_attach ->if (driver_allows_async_probing(drv)) async_schedule_dev(__driver_attach_async_helper, dev); During the probing of SCSI disks attached to MegaRAID, root disk probing may occur, resulting in a disk naming inconsistency issue. For example, if root disk probing happens in the middle,it is named "sdq", The subsequent SCSI disks that are probed will have their names drift, starting from "sdr" up to "sdy." For cloud deployment, the local volume provisioner detects and creates PVs for each local disk (from sda to sdx) on the host, and it cleans up the disks when they are released. This requires the logical names of the disks to be deterministic. Therefore, I have submitted this patch to allow users to configure the SCSI disk probe type. If synchronous probing is configured, the SCSI disk probing order is deterministic and will follow the ascending order of the PCIe bus ID. Jianlin
On 6/7/23 08:55, Jianlin Lv wrote: > 1. MegaRAID adapters associated with 24 local disks. The disks are named > sequentially as "sda," "sdb," and so on, up to "sdx." > 2. STAT controllers associated with the root disk, named "sdy." > > Both the MegaRAID adapters and the SATA controller (PCH) are accessed via > the PCIe bus. In theory, depending on their PCIe bus ID in ascending order, > the devices should be initialized in ascending order as well. Hmm ... I don't think there is anything that prevents the PCIe maintainer from changing the PCIe probing behavior from synchronous to asynchronous? In other words, I don't think it is safe to assume that PCIe devices are always scanned in the same order. > For cloud deployment, the local volume provisioner detects and creates PVs > for each local disk (from sda to sdx) on the host, and it cleans up the > disks when they are released. > This requires the logical names of the disks to be deterministic. I see two possible solutions: - Change the volume provisioner such that it uses disk references that do not depend on the probing order, e.g. /dev/disk/by-id/... - Implement an algorithm in systemd that makes disk names predictable. An explanation of how predictable names work for network interfaces is available here: https://wiki.debian.org/NetworkInterfaceNames. The systemd documentation about predictable network names is available here: https://www.freedesktop.org/software/systemd/man/systemd.net-naming-scheme.html These alternatives have the advantage that disk scanning remains asynchronous. Thanks, Bart.
On Thu, Jun 8, 2023 at 1:07 AM Bart Van Assche <bvanassche@acm.org> wrote: > > On 6/7/23 08:55, Jianlin Lv wrote: > > 1. MegaRAID adapters associated with 24 local disks. The disks are named > > sequentially as "sda," "sdb," and so on, up to "sdx." > > 2. STAT controllers associated with the root disk, named "sdy." > > > > Both the MegaRAID adapters and the SATA controller (PCH) are accessed via > > the PCIe bus. In theory, depending on their PCIe bus ID in ascending order, > > the devices should be initialized in ascending order as well. > > Hmm ... I don't think there is anything that prevents the PCIe maintainer > from changing the PCIe probing behavior from synchronous to asynchronous? > In other words, I don't think it is safe to assume that PCIe devices are > always scanned in the same order. > > > For cloud deployment, the local volume provisioner detects and creates PVs > > for each local disk (from sda to sdx) on the host, and it cleans up the > > disks when they are released. > > This requires the logical names of the disks to be deterministic. > > I see two possible solutions: > - Change the volume provisioner such that it uses disk references that do > not depend on the probing order, e.g. /dev/disk/by-id/... Yes, The "/dev/disk/by-id/" can uniquely identify SCSI devices. However, I don't think it is suitable for the volume provisioner workflow. For nodes of the same SKU , a unified YAML file will be defined to instruct the volume provisioner on how to manage the local disks. If use WWID, it would mean that a unique YAML file needs to be defined for each node. This approach becomes impractical when dealing with a large number of work nodes. Jianlin > - Implement an algorithm in systemd that makes disk names predictable. > An explanation of how predictable names work for network interfaces is > available here: https://wiki.debian.org/NetworkInterfaceNames. The > systemd documentation about predictable network names is available here: > https://www.freedesktop.org/software/systemd/man/systemd.net-naming-scheme.html > > These alternatives have the advantage that disk scanning remains asynchronous. > > Thanks, > > Bart. >
On Wed, Jun 7, 2023 at 10:10 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Tue, Jun 06, 2023 at 05:12:17AM +0000, Jianlin Lv wrote: > > When SCSI disks are located on different SCSI hosts within a system, > > asynchronous detection can lead to non-deterministic SCSI disk names. > > Yes, as can various other conditions. Your code better be able to deal > with that. Could you give an example about "other conditions" ? Jianlin
On 6/7/23 19:51, Jianlin Lv wrote: > On Thu, Jun 8, 2023 at 1:07 AM Bart Van Assche <bvanassche@acm.org> wrote: >> On 6/7/23 08:55, Jianlin Lv wrote: >> I see two possible solutions: >> - Change the volume provisioner such that it uses disk references that do >> not depend on the probing order, e.g. /dev/disk/by-id/... > > Yes, The "/dev/disk/by-id/" can uniquely identify SCSI devices. However, > I don't think it is suitable for the volume provisioner workflow. > For nodes of the same SKU , a unified YAML file will be defined to instruct > the volume provisioner on how to manage the local disks. > If use WWID, it would mean that a unique YAML file needs to be defined > for each node. This approach becomes impractical when dealing with a large > number of work nodes. Please consider using the paths available in /dev/disk/by-path. Thanks, Bart.
On Fri, Jun 9, 2023 at 12:23 AM Bart Van Assche <bvanassche@acm.org> wrote: > > On 6/7/23 19:51, Jianlin Lv wrote: > > On Thu, Jun 8, 2023 at 1:07 AM Bart Van Assche <bvanassche@acm.org> wrote: > >> On 6/7/23 08:55, Jianlin Lv wrote: > >> I see two possible solutions: > >> - Change the volume provisioner such that it uses disk references that do > >> not depend on the probing order, e.g. /dev/disk/by-id/... > > > > Yes, The "/dev/disk/by-id/" can uniquely identify SCSI devices. However, > > I don't think it is suitable for the volume provisioner workflow. > > For nodes of the same SKU , a unified YAML file will be defined to instruct > > the volume provisioner on how to manage the local disks. > > If use WWID, it would mean that a unique YAML file needs to be defined > > for each node. This approach becomes impractical when dealing with a large > > number of work nodes. > Please consider using the paths available in /dev/disk/by-path. Sorry for the late reply. I carefully checked the server in the production environment and found some corner cases where there are differences in the dev/disk/by-path/ of nodes with the same SKU. These differences are caused by inconsistent target_numbers. For example: diff -y aa-by-path bb-by-path pci-0000:86:00.0-scsi-0:3:86:0 -> ../../sda | pci-0000:86:00.0-scsi-0:3:88:0 -> ../../sda pci-0000:86:00.0-scsi-0:3:87:0 -> ../../sdb | pci-0000:86:00.0-scsi-0:3:89:0 -> ../../sdb pci-0000:86:00.0-scsi-0:3:88:0 -> ../../sdc | pci-0000:86:00.0-scsi-0:3:90:0 -> ../../sdc pci-0000:86:00.0-scsi-0:3:89:0 -> ../../sdd | pci-0000:86:00.0-scsi-0:3:91:0 -> ../../sdd pci-0000:86:00.0-scsi-0:3:90:0 -> ../../sde | pci-0000:86:00.0-scsi-0:3:92:0 -> ../../sde pci-0000:86:00.0-scsi-0:3:92:0 -> ../../sdf | pci-0000:86:00.0-scsi-0:3:93:0 -> ../../sdf pci-0000:86:00.0-scsi-0:3:93:0 -> ../../sdg | pci-0000:86:00.0-scsi-0:3:94:0 -> ../../sdg pci-0000:86:00.0-scsi-0:3:94:0 -> ../../sdh | pci-0000:86:00.0-scsi-0:3:95:0 -> ../../sdh I'm still not sure what causes the target_numbers to be different. However, the existence of such corner cases makes /dev/disk/by-path unusable for the volume provisioner, similar to /dev/disk/by-id/. So, If it's not possible to configure disk serialization detection, then it seems that implementing predictable disk names is the only option. Jianlin > > Thanks, > > Bart.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 9e5bab29685f..083f741d63bb 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5611,6 +5611,15 @@ non-zero "wait" parameter. See weight_single and weight_many. + sd_probe_type= [HW,SCSI] Manual setup probe type of built-in scsi disk driver + Format: <int> + Default: 1 + <int> -- device driver probe type to try + 0 - PROBE_DEFAULT_STRATEGY + 1 - PROBE_PREFER_ASYNCHRONOUS + 2 - PROBE_FORCE_SYNCHRONOUS + Example: sd_probe_type=1 + skew_tick= [KNL] Offset the periodic timer tick per cpu to mitigate xtime_lock contention on larger systems, and/or RCU lock contention on all systems with CONFIG_MAXSMP set. diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 1624d528aa1f..78b80b9e5618 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -121,6 +121,9 @@ static void scsi_disk_release(struct device *cdev); static DEFINE_IDA(sd_index_ida); +/* Probe type of SCSI Disk Driver */ +static int sd_probe_type = PROBE_PREFER_ASYNCHRONOUS; + static mempool_t *sd_page_pool; static struct lock_class_key sd_bio_compl_lkclass; @@ -3826,6 +3829,25 @@ static int sd_resume_runtime(struct device *dev) return sd_resume(dev); } +#ifndef MODULE + +/* Set the boot options to sd driver. + * Syntax is defined in Documentation/admin-guide/kernel-parameters.txt. + */ +static int __init sd_probe_setup(char *str) +{ + int probe_type = -1; + + if (get_option(&str, &probe_type) && probe_type >= 0 && probe_type < 3) + sd_probe_type = probe_type; + + return 1; +} + +__setup("sd_probe_type=", sd_probe_setup); + +#endif + /** * init_sd - entry point for this driver (both when built in or when * a module). @@ -3858,6 +3880,7 @@ static int __init init_sd(void) goto err_out_class; } + sd_template.gendrv.probe_type = sd_probe_type; err = scsi_register_driver(&sd_template.gendrv); if (err) goto err_out_driver;