Message ID | 4653f009c3dacae8ebf3a4865aaa944aa9c7cc7e.1675802050.git.nicolinc@nvidia.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp3084140wrn; Tue, 7 Feb 2023 13:22:35 -0800 (PST) X-Google-Smtp-Source: AK7set8f6kwzB9TukK7XAvn3KPkBFEpJVVXBHNnRkoIgoKr324UX+Zk6dloscBm1q2dbI6MeT5Ly X-Received: by 2002:a17:907:6e1f:b0:8aa:502c:44d3 with SMTP id sd31-20020a1709076e1f00b008aa502c44d3mr4025713ejc.41.1675804955719; Tue, 07 Feb 2023 13:22:35 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1675804955; cv=pass; d=google.com; s=arc-20160816; b=owgESdrYbRKnyceuMYjPPSc5uilP90lsVq6RHaxTVPWWbcQe2icxKkrru0VZn+cgNc ExLHNL6BdcBdUD7jPGmTcz8nIfYp97Ho0KEtfk3OsQxVLXqIcvbuaHLZeIbSDGq3Di8r 84750X+wGd3NR30v9Q6SdZ4CG8uZygOKXWaMhfOtybzIbLQplA1ZZlby4rPZehlbleki cLFRtjSN4Hf5R1inoSdlPT6CFE19VmlkeB+/PXJUhYCFXA9mYLzJarHeYFYEg/ty+6+e XTIEVxgiv9jVHZsWCo8AzM5mKWHEMFPwmVVW7NRSDpaJ8fJnjn3fVLfJN8DTkUvy8heP M/NA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=U0/lfuYt+MI4r7ejkxRRk7H1BD5w+BT9RL83evvsWvQ=; b=hK4zqIQGM3sQsWKqR0VIin8lRW+2psylDhU/Bw6C7ekyerj8SNZeiwfFYkb9dY5unJ PU0dldtka9HyWAa7r9ZyPR0onM/d0wALmmokF+GncNvX0FA3ytGEfFEfxA3pL7PgzKSq vqD5QcSDPhfhSOYAISHQaEs+o95TXjcqQRHDCWSJSBsfxuUMhOlJDeOkwLpYQMW9+LEQ NNXjr0sgEsMyiGtulqt9P8QrFSyhR7I4pzsj/AoMKYP3cE7caiN/FTYo71zhAfSfTSZn LntBvf1JpkhuUiXUUijwje8FQ+aJ5tb8tx7mD2Lj7rKtBmwV0OIJrom273/5i7mBhqkw 4T2w== ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@Nvidia.com header.s=selector2 header.b=nq9CjMt4; arc=pass (i=1 spf=pass spfdomain=nvidia.com dmarc=pass fromdomain=nvidia.com); spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=nvidia.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id jr21-20020a170906515500b008879a229030si15449796ejc.292.2023.02.07.13.22.11; Tue, 07 Feb 2023 13:22:35 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@Nvidia.com header.s=selector2 header.b=nq9CjMt4; arc=pass (i=1 spf=pass spfdomain=nvidia.com dmarc=pass fromdomain=nvidia.com); spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=nvidia.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230245AbjBGVVD (ORCPT <rfc822;kmanaouilinux@gmail.com> + 99 others); Tue, 7 Feb 2023 16:21:03 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59162 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230062AbjBGVUi (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 7 Feb 2023 16:20:38 -0500 Received: from NAM12-BN8-obe.outbound.protection.outlook.com (mail-bn8nam12on2041.outbound.protection.outlook.com [40.107.237.41]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 86A2B3BDB2; Tue, 7 Feb 2023 13:20:37 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=EdXDUpW7XqJ2wEQOuvyX7PmYmAbWJUIGRajeeKLBSTqVULMdwjpyVHjZX/naqI2Krd6zxZxdTiZXIqsxLPpMmSQyU8gcimaBRpxX7eb2LaFmAcNJHEdqdR3aApWaM47gGVmqx8aPJFhm7Al9FozyFbGhEV/fmm0P2MTWiDg+LS1m76GMHwSuzKACuiR7ExyeVnxh5xGBn24Bemac7EL1U1YIk3LaMR0ZdcklBYvXcmsk6ie2Ak6+4Xc2SJ0u2eMFCBLCXMIRM0Uaboh4dPPVxUxj6HiJDDTW6nG3gFjvRQq/3a/MMMGsMa+cTb7n8zedpIugdmyfXo1287IiuRypRw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=U0/lfuYt+MI4r7ejkxRRk7H1BD5w+BT9RL83evvsWvQ=; b=h1mwWEU0rvMvWQKqJF4yT6Unfj8VZz86Isp3ra2dMfY6dlvbgQKREj329qEBI5oocny+ijLTKvG48UnrNMXF71/4o0InwT4hdgZUgoMuPLhqk+cQdGmkRVksEgx5VLOBAYrJ0wEMz42p32a9G62HCjLgSLy4vypmQN6khhbirRjE2UhHMXqs1UG5feSrIvUcejWZKv8HjSVqAipeOMs9aRQebh+fQomUPtkjJHFURtOUsbFXvlvWJs77qVY/uk1GqQkBLL5dG+QI1wKWmrsapwCZWryKBPBwMCPRDvznKTzdFl294NAofYaI0uepIEA67NloQ/LbP7McSqGEC8S9dg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 216.228.118.233) smtp.rcpttodomain=intel.com smtp.mailfrom=nvidia.com; dmarc=pass (p=reject sp=reject pct=100) action=none header.from=nvidia.com; dkim=none (message not signed); arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=U0/lfuYt+MI4r7ejkxRRk7H1BD5w+BT9RL83evvsWvQ=; b=nq9CjMt4mNXU7CrPEqqntICxgSTTHe8fHTG5Rm5JpTm4IJxu3k6IVEPc1qUzC5CR7fm03U0GLHW0qKwT+hYyWntYXU9xFhCcctAcMUodNMAJs+QO2hulCXD5VaqQVLH3qSVJssB/d+acmHO8GAQabQmvfeljMlTTj9XQDt0GR2eOmxxonobULi5JCj7iTAEnWrbRYt+YP+dJMDUnhgMCxBH+IzCt3mq5nN5T3ysApF4iLCD/2qNb6YDvOGBhURsepOsybo/C/T0H+/zKD++IAqQIr2dh1JQtsA8kxylPWPgCK0D8OZhzkmQk8+oT9rwfh8qIXPi17BE+IbP5eImmIg== Received: from BN9PR03CA0484.namprd03.prod.outlook.com (2603:10b6:408:130::9) by DS0PR12MB8219.namprd12.prod.outlook.com (2603:10b6:8:de::10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6064.36; Tue, 7 Feb 2023 21:20:35 +0000 Received: from BN8NAM11FT114.eop-nam11.prod.protection.outlook.com (2603:10b6:408:130:cafe::4f) by BN9PR03CA0484.outlook.office365.com (2603:10b6:408:130::9) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6064.36 via Frontend Transport; Tue, 7 Feb 2023 21:20:35 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 216.228.118.233) smtp.mailfrom=nvidia.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=nvidia.com; Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 216.228.118.233 as permitted sender) receiver=protection.outlook.com; client-ip=216.228.118.233; helo=mail.nvidia.com; pr=C Received: from mail.nvidia.com (216.228.118.233) by BN8NAM11FT114.mail.protection.outlook.com (10.13.177.46) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6086.16 via Frontend Transport; Tue, 7 Feb 2023 21:20:35 +0000 Received: from drhqmail203.nvidia.com (10.126.190.182) by mail.nvidia.com (10.127.129.6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.986.36; Tue, 7 Feb 2023 13:20:27 -0800 Received: from drhqmail201.nvidia.com (10.126.190.180) by drhqmail203.nvidia.com (10.126.190.182) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.986.36; Tue, 7 Feb 2023 13:20:26 -0800 Received: from Asurada-Nvidia.nvidia.com (10.127.8.9) by mail.nvidia.com (10.126.190.180) with Microsoft SMTP Server id 15.2.986.36 via Frontend Transport; Tue, 7 Feb 2023 13:20:26 -0800 From: Nicolin Chen <nicolinc@nvidia.com> To: <jgg@nvidia.com>, <kevin.tian@intel.com>, <joro@8bytes.org>, <will@kernel.org>, <robin.murphy@arm.com>, <alex.williamson@redhat.com>, <shuah@kernel.org> CC: <yi.l.liu@intel.com>, <linux-kernel@vger.kernel.org>, <iommu@lists.linux.dev>, <kvm@vger.kernel.org>, <linux-kselftest@vger.kernel.org>, <baolu.lu@linux.intel.com> Subject: [PATCH v2 08/10] iommufd/device: Use iommu_group_replace_domain() Date: Tue, 7 Feb 2023 13:18:00 -0800 Message-ID: <4653f009c3dacae8ebf3a4865aaa944aa9c7cc7e.1675802050.git.nicolinc@nvidia.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <cover.1675802050.git.nicolinc@nvidia.com> References: <cover.1675802050.git.nicolinc@nvidia.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: BN8NAM11FT114:EE_|DS0PR12MB8219:EE_ X-MS-Office365-Filtering-Correlation-Id: a6b725c9-c710-49fb-9333-08db095126f0 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: cSTboGDsKodEu/bJgDkZ9nSRhHVCX5XkPHfKcDAZDt/QeITym+v8d+mMXCD6jUy17IG7ibEUsE9jrWKurr0siK7KHyGvUDaaU5dbFSUqrecHfMBQbTI9RUx/nxVUkBGPJlQdLnVmV9P+mALpllarGYy0C5C33BrsBUhVxwMUHRv7cq945/jUO194/Gx8rvW/mZmCbJwkRSrLJodHfKhAh8ydkzq4tg4V3TCsdhOvAJ2NX/kXz0dtyL0a308S4ShbJQnTJEJKemIdeBqKqEcF5iupY0BXDr0NA4yB6bqiOPmwLeVXMZwmCiZoyW855V55SQX5HyNj9dI19KWkm/+X7Z/js5UCJoGx4Sd6ciefDGE1xFsXq65bFLlgBwAcSz+pHECT0E1tSuYOjNv8+XzRe5qzP3hzwLp72YydZJJPI2qjDRCYClKsw/2upe4D04Ja84wUiHq2Ee42GFFWnaN/14mDWkeewWzslGd6P2oEU/a3kv4sKxdtXHRYDBLLz+NAb4pOYTamwWcIxfHJj7hyIFfgTjlmlFgfFEXnwNr6sfZPRHP8QTddXCL4/W7u7fbjCayng+bU3+vAgNR/C5L9uwZcCxTGNNP534rLUxN367alaul51CurkqQi+tVvqyyxdpQXL1sRN/IjxqTRueXDbxL8J1u2WmGHCBCGRIKJpSLznSbnt5nbAzgMgGCzrr2f5tDnFA1IagNbta/UhUzEMw== X-Forefront-Antispam-Report: CIP:216.228.118.233;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:mail.nvidia.com;PTR:dc7edge2.nvidia.com;CAT:NONE;SFS:(13230025)(4636009)(39860400002)(136003)(396003)(376002)(346002)(451199018)(46966006)(40470700004)(36840700001)(40460700003)(4326008)(82740400003)(7636003)(316002)(356005)(41300700001)(8936002)(8676002)(70206006)(70586007)(36860700001)(5660300002)(7416002)(6666004)(2616005)(26005)(2906002)(478600001)(186003)(82310400005)(86362001)(110136005)(83380400001)(54906003)(47076005)(336012)(426003)(40480700001)(36756003)(7696005);DIR:OUT;SFP:1101; X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 07 Feb 2023 21:20:35.1171 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: a6b725c9-c710-49fb-9333-08db095126f0 X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=43083d15-7273-40c1-b7db-39efd9ccc17a;Ip=[216.228.118.233];Helo=[mail.nvidia.com] X-MS-Exchange-CrossTenant-AuthSource: BN8NAM11FT114.eop-nam11.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: DS0PR12MB8219 X-Spam-Status: No, score=-1.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FORGED_SPF_HELO, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_PASS,SPF_NONE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1757208857627756086?= X-GMAIL-MSGID: =?utf-8?q?1757208857627756086?= |
Series |
Add IO page table replacement support
|
|
Commit Message
Nicolin Chen
Feb. 7, 2023, 9:18 p.m. UTC
iommu_group_replace_domain() is introduced to support use cases where an
iommu_group can be attached to a new domain without getting detached from
the old one. This replacement feature will be useful, for cases such as:
1) vPASID mode, when a guest wants to replace a single pasid (PASID=0)
table with a larger table (PASID=N)
2) Nesting mode, when switching the attaching device from an S2 domain
to an S1 domain, or when switching between relevant S1 domains.
as it allows these cases to switch seamlessly without a DMA disruption.
So, call iommu_group_replace_domain() in the iommufd_device_do_attach().
And add a __iommmufd_device_detach helper to allow the replace routine to
do a partial detach on the current hwpt that's being replaced. Though the
updated locking logic is overcomplicated, it will be eased, once those
iopt_table_add/remove_ioas and list_add/del calls are moved to hwpt's
allocation/destroy() functions in the coming nesting series, as that'll
depend on a new ->domain_alloc_user op in the iommu core.
Also, block replace operations that are from/to auto_domains, i.e. only
user-allocated hw_pagetables can be replaced or replaced with.
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/iommufd/device.c | 101 +++++++++++++++++-------
drivers/iommu/iommufd/iommufd_private.h | 2 +
2 files changed, 76 insertions(+), 27 deletions(-)
Comments
Hi Nic, > From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Wednesday, February 8, 2023 5:18 AM > > iommu_group_replace_domain() is introduced to support use cases where > an > iommu_group can be attached to a new domain without getting detached > from > the old one. This replacement feature will be useful, for cases such as: > 1) vPASID mode, when a guest wants to replace a single pasid (PASID=0) > table with a larger table (PASID=N) > 2) Nesting mode, when switching the attaching device from an S2 domain > to an S1 domain, or when switching between relevant S1 domains. > as it allows these cases to switch seamlessly without a DMA disruption. > > So, call iommu_group_replace_domain() in the > iommufd_device_do_attach(). > And add a __iommmufd_device_detach helper to allow the replace routine > to > do a partial detach on the current hwpt that's being replaced. Though the > updated locking logic is overcomplicated, it will be eased, once those > iopt_table_add/remove_ioas and list_add/del calls are moved to hwpt's > allocation/destroy() functions in the coming nesting series, as that'll > depend on a new ->domain_alloc_user op in the iommu core. > > Also, block replace operations that are from/to auto_domains, i.e. only > user-allocated hw_pagetables can be replaced or replaced with. > > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> > --- > drivers/iommu/iommufd/device.c | 101 +++++++++++++++++------- > drivers/iommu/iommufd/iommufd_private.h | 2 + > 2 files changed, 76 insertions(+), 27 deletions(-) > > diff --git a/drivers/iommu/iommufd/device.c > b/drivers/iommu/iommufd/device.c > index b8c3e3baccb5..8a9834fc129a 100644 > --- a/drivers/iommu/iommufd/device.c > +++ b/drivers/iommu/iommufd/device.c > @@ -9,6 +9,8 @@ > #include "io_pagetable.h" > #include "iommufd_private.h" > > +MODULE_IMPORT_NS(IOMMUFD_INTERNAL); > + > static bool allow_unsafe_interrupts; > module_param(allow_unsafe_interrupts, bool, S_IRUGO | S_IWUSR); > MODULE_PARM_DESC( > @@ -194,9 +196,61 @@ static bool > iommufd_hw_pagetable_has_group(struct iommufd_hw_pagetable *hwpt, > return false; > } > > +/** > + * __iommmufd_device_detach - Detach a device from idev->hwpt to > new_hwpt This function doesn't do anything to make this device attached to new_hwpt. It is done in the iommufd_device_attach_ioas(). New_hwpt here indicates if this detach requires to do some extra thing. E.g. remove reserved iova from the idev->hwpt->ioas. So may just say " Detach a device from idev->hwpt", and explain the usage of new_hwpt in the below. > + * @idev: device to detach > + * @new_hwpt: new hw_pagetable to attach (pass in NULL for a simple > detach) The new hw_pagetable to be attached. > + * @detach_group: flag to call iommu_detach_group > + * > + * This is a cleanup helper shared by the replace and detach routines. > Comparing > + * to a detach routine, a replace routine only needs a partial detach > procedure: > + * it does not need the iommu_detach_group(); it will attach the device to > a new > + * hw_pagetable after a partial detach from the currently attached > hw_pagetable, > + * so certain steps can be skipped if two hw_pagetables have the same > IOAS. > + */ > +static void __iommmufd_device_detach(struct iommufd_device *idev, > + struct iommufd_hw_pagetable > *new_hwpt, > + bool detach_group) > +{ > + struct iommufd_hw_pagetable *hwpt = idev->hwpt; > + struct iommufd_ioas *new_ioas = NULL; > + > + if (new_hwpt) > + new_ioas = new_hwpt->ioas; > + > + mutex_lock(&hwpt->devices_lock); > + list_del(&idev->devices_item); > + if (hwpt->ioas != new_ioas) > + mutex_lock(&hwpt->ioas->mutex); The lock order is mostly hwpt->ioas->mutex and then hwpt->devices_lock. See the iommufd_device_auto_get_domain(). If possible, may switch the order sequence here. Also, rename hwpt to be cur_hwpt, this may help reviewers to distinguish it from the hwpt in the caller of this function. It looks to be a deadlock at first look, but not after closer reading. > + if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) { > + if (list_empty(&hwpt->devices)) { > + iopt_table_remove_domain(&hwpt->ioas->iopt, > + hwpt->domain); > + list_del(&hwpt->hwpt_item); > + } > + if (detach_group) > + iommu_detach_group(hwpt->domain, idev->group); > + } > + if (hwpt->ioas != new_ioas) { > + iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev- > >dev); > + mutex_unlock(&hwpt->ioas->mutex); > + } > + mutex_unlock(&hwpt->devices_lock); > + > + if (hwpt->auto_domain) > + iommufd_object_destroy_user(idev->ictx, &hwpt->obj); > + else > + refcount_dec(&hwpt->obj.users); > + > + idev->hwpt = NULL; > + > + refcount_dec(&idev->obj.users); > +} > + > static int iommufd_device_do_attach(struct iommufd_device *idev, > struct iommufd_hw_pagetable *hwpt) > { > + struct iommufd_hw_pagetable *cur_hwpt = idev->hwpt; > phys_addr_t sw_msi_start = PHYS_ADDR_MAX; > int rc; > > @@ -236,7 +290,7 @@ static int iommufd_device_do_attach(struct > iommufd_device *idev, > * the group once for the first device that is in the group. > */ > if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) { > - rc = iommu_attach_group(hwpt->domain, idev->group); > + rc = iommu_group_replace_domain(idev->group, hwpt- > >domain); > if (rc) > goto out_iova; > > @@ -249,6 +303,10 @@ static int iommufd_device_do_attach(struct > iommufd_device *idev, > } > } > > + /* Replace the cur_hwpt without iommu_detach_group() */ > + if (cur_hwpt) > + __iommmufd_device_detach(idev, hwpt, false); > + > idev->hwpt = hwpt; > refcount_inc(&hwpt->obj.users); > list_add(&idev->devices_item, &hwpt->devices); > @@ -256,7 +314,10 @@ static int iommufd_device_do_attach(struct > iommufd_device *idev, > return 0; > > out_detach: > - iommu_detach_group(hwpt->domain, idev->group); > + if (cur_hwpt) > + iommu_group_replace_domain(idev->group, cur_hwpt- > >domain); > + else > + iommu_detach_group(hwpt->domain, idev->group); > out_iova: > iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev); > out_unlock: > @@ -345,6 +406,13 @@ int iommufd_device_attach(struct iommufd_device > *idev, u32 *pt_id) > struct iommufd_hw_pagetable *hwpt = > container_of(pt_obj, struct > iommufd_hw_pagetable, obj); > > + if (idev->hwpt == hwpt) > + goto out_done; > + if (idev->hwpt && idev->hwpt->auto_domain) { > + rc = -EBUSY; This means if device was attached to an auto_created hwpt, then we cannot replace it with a user allocated hwpt? If yes, this means the replace is not available until user hwpt support, which is part of nesting. > + goto out_put_pt_obj; > + } > + > mutex_lock(&hwpt->ioas->mutex); > rc = iommufd_device_do_attach(idev, hwpt); > mutex_unlock(&hwpt->ioas->mutex); > @@ -356,6 +424,8 @@ int iommufd_device_attach(struct iommufd_device > *idev, u32 *pt_id) > struct iommufd_ioas *ioas = > container_of(pt_obj, struct iommufd_ioas, obj); > > + if (idev->hwpt) > + return -EBUSY; So we don't allow ioas replacement for physical devices. Is it? Looks like emulated devices allows it. > rc = iommufd_device_auto_get_domain(idev, ioas); > if (rc) > goto out_put_pt_obj; > @@ -367,6 +437,7 @@ int iommufd_device_attach(struct iommufd_device > *idev, u32 *pt_id) > } > > refcount_inc(&idev->obj.users); > +out_done: > *pt_id = idev->hwpt->obj.id; > rc = 0; > > @@ -385,31 +456,7 @@ > EXPORT_SYMBOL_NS_GPL(iommufd_device_attach, IOMMUFD); > */ > void iommufd_device_detach(struct iommufd_device *idev) > { > - struct iommufd_hw_pagetable *hwpt = idev->hwpt; > - > - mutex_lock(&hwpt->ioas->mutex); > - mutex_lock(&hwpt->devices_lock); > - list_del(&idev->devices_item); > - if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) { > - if (list_empty(&hwpt->devices)) { > - iopt_table_remove_domain(&hwpt->ioas->iopt, > - hwpt->domain); > - list_del(&hwpt->hwpt_item); > - } > - iommu_detach_group(hwpt->domain, idev->group); > - } > - iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev); > - mutex_unlock(&hwpt->devices_lock); > - mutex_unlock(&hwpt->ioas->mutex); > - > - if (hwpt->auto_domain) > - iommufd_object_destroy_user(idev->ictx, &hwpt->obj); > - else > - refcount_dec(&hwpt->obj.users); > - > - idev->hwpt = NULL; > - > - refcount_dec(&idev->obj.users); > + __iommmufd_device_detach(idev, NULL, true); > } > EXPORT_SYMBOL_NS_GPL(iommufd_device_detach, IOMMUFD); > > diff --git a/drivers/iommu/iommufd/iommufd_private.h > b/drivers/iommu/iommufd/iommufd_private.h > index 593138bb37b8..200c783800ad 100644 > --- a/drivers/iommu/iommufd/iommufd_private.h > +++ b/drivers/iommu/iommufd/iommufd_private.h > @@ -9,6 +9,8 @@ > #include <linux/refcount.h> > #include <linux/uaccess.h> > > +#include "../iommu-priv.h" > + > struct iommu_domain; > struct iommu_group; > struct iommu_option; > -- > 2.39.1 Regards, Yi Liu
> From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Wednesday, February 8, 2023 5:18 AM > > +/** > + * __iommmufd_device_detach - Detach a device from idev->hwpt to Nit: s/_iommmufd/__iommufd Regards, Yi Liu
> From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Wednesday, February 8, 2023 5:18 AM > > iommu_group_replace_domain() is introduced to support use cases where > an > iommu_group can be attached to a new domain without getting detached > from > the old one. This replacement feature will be useful, for cases such as: > 1) vPASID mode, when a guest wants to replace a single pasid (PASID=0) > table with a larger table (PASID=N) > 2) Nesting mode, when switching the attaching device from an S2 domain > to an S1 domain, or when switching between relevant S1 domains. > as it allows these cases to switch seamlessly without a DMA disruption. > > So, call iommu_group_replace_domain() in the iommufd_device_do_attach(). > And add a __iommmufd_device_detach helper to allow the replace routine > to > do a partial detach on the current hwpt that's being replaced. Though the > updated locking logic is overcomplicated, it will be eased, once those > iopt_table_add/remove_ioas and list_add/del calls are moved to hwpt's > allocation/destroy() functions in the coming nesting series, as that'll > depend on a new ->domain_alloc_user op in the iommu core. then why not moving those changes into this series to make it simple? > > Also, block replace operations that are from/to auto_domains, i.e. only > user-allocated hw_pagetables can be replaced or replaced with. where does this restriction come from? iommu_group_replace_domain() can switch between any two UNMANAGED domains. What is the extra problem in iommufd to support from/to auto domains? > > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> > --- > drivers/iommu/iommufd/device.c | 101 +++++++++++++++++------- > drivers/iommu/iommufd/iommufd_private.h | 2 + > 2 files changed, 76 insertions(+), 27 deletions(-) > > diff --git a/drivers/iommu/iommufd/device.c > b/drivers/iommu/iommufd/device.c > index b8c3e3baccb5..8a9834fc129a 100644 > --- a/drivers/iommu/iommufd/device.c > +++ b/drivers/iommu/iommufd/device.c > @@ -9,6 +9,8 @@ > #include "io_pagetable.h" > #include "iommufd_private.h" > > +MODULE_IMPORT_NS(IOMMUFD_INTERNAL); > + > static bool allow_unsafe_interrupts; > module_param(allow_unsafe_interrupts, bool, S_IRUGO | S_IWUSR); > MODULE_PARM_DESC( > @@ -194,9 +196,61 @@ static bool > iommufd_hw_pagetable_has_group(struct iommufd_hw_pagetable *hwpt, > return false; > } > > +/** > + * __iommmufd_device_detach - Detach a device from idev->hwpt to > new_hwpt 'from ... to ...' means a replace semantics. then this should be called iommufd_device_replace_hwpt(). > + * @idev: device to detach > + * @new_hwpt: new hw_pagetable to attach (pass in NULL for a simple > detach) > + * @detach_group: flag to call iommu_detach_group > + * > + * This is a cleanup helper shared by the replace and detach routines. > Comparing > + * to a detach routine, a replace routine only needs a partial detach > procedure: > + * it does not need the iommu_detach_group(); it will attach the device to a > new > + * hw_pagetable after a partial detach from the currently attached > hw_pagetable, > + * so certain steps can be skipped if two hw_pagetables have the same IOAS. > + */ > +static void __iommmufd_device_detach(struct iommufd_device *idev, > + struct iommufd_hw_pagetable > *new_hwpt, > + bool detach_group) > +{ > + struct iommufd_hw_pagetable *hwpt = idev->hwpt; > + struct iommufd_ioas *new_ioas = NULL; > + > + if (new_hwpt) > + new_ioas = new_hwpt->ioas; > + > + mutex_lock(&hwpt->devices_lock); > + list_del(&idev->devices_item); > + if (hwpt->ioas != new_ioas) > + mutex_lock(&hwpt->ioas->mutex); I think new_ioas->mutext was meant here. > + if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) { > + if (list_empty(&hwpt->devices)) { > + iopt_table_remove_domain(&hwpt->ioas->iopt, > + hwpt->domain); > + list_del(&hwpt->hwpt_item); > + } I'm not sure how this can be fully shared between detach and replace. Here some work e.g. above needs to be done before calling iommu_group_replace_domain() while others can be done afterwards.
On Wed, Feb 08, 2023 at 08:08:42AM +0000, Liu, Yi L wrote: > > From: Nicolin Chen <nicolinc@nvidia.com> > > Sent: Wednesday, February 8, 2023 5:18 AM > > > > iommu_group_replace_domain() is introduced to support use cases where > > an > > iommu_group can be attached to a new domain without getting detached > > from > > the old one. This replacement feature will be useful, for cases such as: > > 1) vPASID mode, when a guest wants to replace a single pasid (PASID=0) > > table with a larger table (PASID=N) > > 2) Nesting mode, when switching the attaching device from an S2 domain > > to an S1 domain, or when switching between relevant S1 domains. > > as it allows these cases to switch seamlessly without a DMA disruption. > > > > So, call iommu_group_replace_domain() in the > > iommufd_device_do_attach(). > > And add a __iommmufd_device_detach helper to allow the replace routine > > to > > do a partial detach on the current hwpt that's being replaced. Though the > > updated locking logic is overcomplicated, it will be eased, once those > > iopt_table_add/remove_ioas and list_add/del calls are moved to hwpt's > > allocation/destroy() functions in the coming nesting series, as that'll > > depend on a new ->domain_alloc_user op in the iommu core. > > > > Also, block replace operations that are from/to auto_domains, i.e. only > > user-allocated hw_pagetables can be replaced or replaced with. > > > > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> > > --- > > drivers/iommu/iommufd/device.c | 101 +++++++++++++++++------- > > drivers/iommu/iommufd/iommufd_private.h | 2 + > > 2 files changed, 76 insertions(+), 27 deletions(-) > > > > diff --git a/drivers/iommu/iommufd/device.c > > b/drivers/iommu/iommufd/device.c > > index b8c3e3baccb5..8a9834fc129a 100644 > > --- a/drivers/iommu/iommufd/device.c > > +++ b/drivers/iommu/iommufd/device.c > > @@ -9,6 +9,8 @@ > > #include "io_pagetable.h" > > #include "iommufd_private.h" > > > > +MODULE_IMPORT_NS(IOMMUFD_INTERNAL); > > + > > static bool allow_unsafe_interrupts; > > module_param(allow_unsafe_interrupts, bool, S_IRUGO | S_IWUSR); > > MODULE_PARM_DESC( > > @@ -194,9 +196,61 @@ static bool > > iommufd_hw_pagetable_has_group(struct iommufd_hw_pagetable *hwpt, > > return false; > > } > > > > +/** > > + * __iommmufd_device_detach - Detach a device from idev->hwpt to > > new_hwpt > > This function doesn't do anything to make this device attached to new_hwpt. > It is done in the iommufd_device_attach_ioas(). New_hwpt here indicates if > this detach requires to do some extra thing. E.g. remove reserved iova from > the idev->hwpt->ioas. So may just say " Detach a device from idev->hwpt", > and explain the usage of new_hwpt in the below. Yea. You are right. > > + * @idev: device to detach > > + * @new_hwpt: new hw_pagetable to attach (pass in NULL for a simple > > detach) > > The new hw_pagetable to be attached. OK. > > + * @detach_group: flag to call iommu_detach_group > > + * > > + * This is a cleanup helper shared by the replace and detach routines. > > Comparing > > + * to a detach routine, a replace routine only needs a partial detach > > procedure: > > + * it does not need the iommu_detach_group(); it will attach the device to > > a new > > + * hw_pagetable after a partial detach from the currently attached > > hw_pagetable, > > + * so certain steps can be skipped if two hw_pagetables have the same > > IOAS. > > + */ > > +static void __iommmufd_device_detach(struct iommufd_device *idev, > > + struct iommufd_hw_pagetable > > *new_hwpt, > > + bool detach_group) > > +{ > > + struct iommufd_hw_pagetable *hwpt = idev->hwpt; > > + struct iommufd_ioas *new_ioas = NULL; > > + > > + if (new_hwpt) > > + new_ioas = new_hwpt->ioas; > > + > > + mutex_lock(&hwpt->devices_lock); > > + list_del(&idev->devices_item); > > + if (hwpt->ioas != new_ioas) > > + mutex_lock(&hwpt->ioas->mutex); > > The lock order is mostly hwpt->ioas->mutex and then hwpt->devices_lock. > See the iommufd_device_auto_get_domain(). If possible, may switch the > order sequence here. Yea, I know it's a bit strange. Yet... Our nesting series simplifies this part to: if (cur_ioas != new_ioas) { mutex_lock(&hwpt->ioas->mutex); iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev); mutex_unlock(&hwpt->ioas->mutex); } So, here is trying to avoid something like: if (cur_ioas != new_ioas) mutex_lock(&hwpt->ioas->mutex); // doing something if (cur_ioas != new_ioas) iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev); // doing something if (cur_ioas != new_ioas) mutex_unlock(&hwpt->ioas->mutex); > Also, rename hwpt to be cur_hwpt, this may help > reviewers to distinguish it from the hwpt in the caller of this function. It > looks to be a deadlock at first look, but not after closer reading. Sure. > > @@ -345,6 +406,13 @@ int iommufd_device_attach(struct iommufd_device > > *idev, u32 *pt_id) > > struct iommufd_hw_pagetable *hwpt = > > container_of(pt_obj, struct > > iommufd_hw_pagetable, obj); > > > > + if (idev->hwpt == hwpt) > > + goto out_done; > > + if (idev->hwpt && idev->hwpt->auto_domain) { > > + rc = -EBUSY; > > This means if device was attached to an auto_created hwpt, then we > cannot replace it with a user allocated hwpt? If yes, this means the > replace is not available until user hwpt support, which is part of nesting. After aligning with Jason, this limit here might be wrong, as we should be able to support replacing an IOAS. I'd need to take a closer look and fix it in v3. > > + if (idev->hwpt) > > + return -EBUSY; > > So we don't allow ioas replacement for physical devices. Is it? > Looks like emulated devices allows it. This was to avoid an replace with an auto_domain. Similarly, it's likely wrong, as I replied above. Thanks Nic
On Wed, Feb 08, 2023 at 08:12:55AM +0000, Liu, Yi L wrote: > External email: Use caution opening links or attachments > > > > From: Nicolin Chen <nicolinc@nvidia.com> > > Sent: Wednesday, February 8, 2023 5:18 AM > > > > +/** > > + * __iommmufd_device_detach - Detach a device from idev->hwpt to > > Nit: s/_iommmufd/__iommufd Will fix it. Thanks!
On Thu, Feb 09, 2023 at 04:00:52AM +0000, Tian, Kevin wrote: > > From: Nicolin Chen <nicolinc@nvidia.com> > > Sent: Wednesday, February 8, 2023 5:18 AM > > > > iommu_group_replace_domain() is introduced to support use cases where > > an > > iommu_group can be attached to a new domain without getting detached > > from > > the old one. This replacement feature will be useful, for cases such as: > > 1) vPASID mode, when a guest wants to replace a single pasid (PASID=0) > > table with a larger table (PASID=N) > > 2) Nesting mode, when switching the attaching device from an S2 domain > > to an S1 domain, or when switching between relevant S1 domains. > > as it allows these cases to switch seamlessly without a DMA disruption. > > > > So, call iommu_group_replace_domain() in the iommufd_device_do_attach(). > > And add a __iommmufd_device_detach helper to allow the replace routine > > to > > do a partial detach on the current hwpt that's being replaced. Though the > > updated locking logic is overcomplicated, it will be eased, once those > > iopt_table_add/remove_ioas and list_add/del calls are moved to hwpt's > > allocation/destroy() functions in the coming nesting series, as that'll > > depend on a new ->domain_alloc_user op in the iommu core. > > then why not moving those changes into this series to make it simple? The simplification depends on the new ->domain_alloc_user op and its implementation in SMMU driver, which would be introduced by the nesting series of VT-d and SMMU respectively. At this point, it's hard to decide the best sequence of our three series. Putting this replace series first simply because it seems to be closer to get merged than the other two bigger series. > > Also, block replace operations that are from/to auto_domains, i.e. only > > user-allocated hw_pagetables can be replaced or replaced with. > > where does this restriction come from? iommu_group_replace_domain() > can switch between any two UNMANAGED domains. What is the extra > problem in iommufd to support from/to auto domains? It was my misunderstanding. We should have supported that. Will fix in v3 and add the corresponding support. > > +/** > > + * __iommmufd_device_detach - Detach a device from idev->hwpt to > > new_hwpt > > 'from ... to ...' means a replace semantics. then this should be called > iommufd_device_replace_hwpt(). Had a hard time to think about the naming, it's indeed a detach- only routine, but it takes a parameter named new_hwpt... Perhaps I should just follow Yi's suggestion by rephrasing the narrative of this function. > > +static void __iommmufd_device_detach(struct iommufd_device *idev, > > + struct iommufd_hw_pagetable > > *new_hwpt, > > + bool detach_group) > > +{ > > + struct iommufd_hw_pagetable *hwpt = idev->hwpt; > > + struct iommufd_ioas *new_ioas = NULL; > > + > > + if (new_hwpt) > > + new_ioas = new_hwpt->ioas; > > + > > + mutex_lock(&hwpt->devices_lock); > > + list_del(&idev->devices_item); > > + if (hwpt->ioas != new_ioas) > > + mutex_lock(&hwpt->ioas->mutex); > > I think new_ioas->mutext was meant here. new_hwpt is an input from an replace routine, where it holds the new_ioas->mutex already. Yi's right that the code here is a bit confusing. I will try to change it a bit for readability. > > + if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) { > > + if (list_empty(&hwpt->devices)) { > > + iopt_table_remove_domain(&hwpt->ioas->iopt, > > + hwpt->domain); > > + list_del(&hwpt->hwpt_item); > > + } > > I'm not sure how this can be fully shared between detach and replace. > Here some work e.g. above needs to be done before calling > iommu_group_replace_domain() while others can be done afterwards. This iopt_table_remove_domain/list_del is supposed to be done in the hwpt's destroy() actually. We couldn't move it because it'd need the new domain_alloc_user op and its implementation in ARM driver. Overall, I think it should be safe to put it behind the iommu_group_replace_domain(). Thanks Nic
On Thu, Feb 09, 2023 at 01:13:07PM -0800, Nicolin Chen wrote: > On Thu, Feb 09, 2023 at 04:00:52AM +0000, Tian, Kevin wrote: > > > > From: Nicolin Chen <nicolinc@nvidia.com> > > > Sent: Wednesday, February 8, 2023 5:18 AM > > > > > > iommu_group_replace_domain() is introduced to support use cases where > > > an > > > iommu_group can be attached to a new domain without getting detached > > > from > > > the old one. This replacement feature will be useful, for cases such as: > > > 1) vPASID mode, when a guest wants to replace a single pasid (PASID=0) > > > table with a larger table (PASID=N) > > > 2) Nesting mode, when switching the attaching device from an S2 domain > > > to an S1 domain, or when switching between relevant S1 domains. > > > as it allows these cases to switch seamlessly without a DMA disruption. > > > > > > So, call iommu_group_replace_domain() in the iommufd_device_do_attach(). > > > And add a __iommmufd_device_detach helper to allow the replace routine > > > to > > > do a partial detach on the current hwpt that's being replaced. Though the > > > updated locking logic is overcomplicated, it will be eased, once those > > > iopt_table_add/remove_ioas and list_add/del calls are moved to hwpt's > > > allocation/destroy() functions in the coming nesting series, as that'll > > > depend on a new ->domain_alloc_user op in the iommu core. > > > > then why not moving those changes into this series to make it simple? > > The simplification depends on the new ->domain_alloc_user op and > its implementation in SMMU driver, which would be introduced by > the nesting series of VT-d and SMMU respectively. Since we are not fixing all the drivers at this point, this argument doesn't hold water. It is as I said in the other email, there should be no changes to the normal attach/replace path when adding manual HWPT creation - once those are removed there should be minimal connection between these two series. Jason
> From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Friday, February 10, 2023 5:13 AM > > > > + if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) { > > > + if (list_empty(&hwpt->devices)) { > > > + iopt_table_remove_domain(&hwpt->ioas->iopt, > > > + hwpt->domain); > > > + list_del(&hwpt->hwpt_item); > > > + } > > > > I'm not sure how this can be fully shared between detach and replace. > > Here some work e.g. above needs to be done before calling > > iommu_group_replace_domain() while others can be done afterwards. > > This iopt_table_remove_domain/list_del is supposed to be done in > the hwpt's destroy() actually. We couldn't move it because it'd > need the new domain_alloc_user op and its implementation in ARM > driver. Overall, I think it should be safe to put it behind the > iommu_group_replace_domain(). > My confusion is that we have different flows between detach/attach and replace. today with separate detach+attach we have following flow: Remove device from current hwpt; if (last_device in hwpt) { Remove hwpt domain from current iopt; if (last_device in group) detach group from hwpt domain; } if (first device in group) { attach group to new hwpt domain; if (first_device in hwpt) Add hwpt domain to new iopt; Add device to new hwpt; but replace flow is different on the detach part: if (first device in group) { replace group's domain from current hwpt to new hwpt; if (first_device in hwpt) Add hwpt domain to new iopt; } Remove device from old hwpt; if (last_device in old hwpt) Remove hwpt domain from old iopt; Add device to new hwpt; I'm yet to figure out whether we have sufficient lock protection to prevent other paths from using old iopt/hwpt to find the device which is already attached to a different domain.
On Thu, Feb 09, 2023 at 08:01:00PM -0400, Jason Gunthorpe wrote: > On Thu, Feb 09, 2023 at 01:13:07PM -0800, Nicolin Chen wrote: > > On Thu, Feb 09, 2023 at 04:00:52AM +0000, Tian, Kevin wrote: > > > > > > From: Nicolin Chen <nicolinc@nvidia.com> > > > > Sent: Wednesday, February 8, 2023 5:18 AM > > > > > > > > iommu_group_replace_domain() is introduced to support use cases where > > > > an > > > > iommu_group can be attached to a new domain without getting detached > > > > from > > > > the old one. This replacement feature will be useful, for cases such as: > > > > 1) vPASID mode, when a guest wants to replace a single pasid (PASID=0) > > > > table with a larger table (PASID=N) > > > > 2) Nesting mode, when switching the attaching device from an S2 domain > > > > to an S1 domain, or when switching between relevant S1 domains. > > > > as it allows these cases to switch seamlessly without a DMA disruption. > > > > > > > > So, call iommu_group_replace_domain() in the iommufd_device_do_attach(). > > > > And add a __iommmufd_device_detach helper to allow the replace routine > > > > to > > > > do a partial detach on the current hwpt that's being replaced. Though the > > > > updated locking logic is overcomplicated, it will be eased, once those > > > > iopt_table_add/remove_ioas and list_add/del calls are moved to hwpt's > > > > allocation/destroy() functions in the coming nesting series, as that'll > > > > depend on a new ->domain_alloc_user op in the iommu core. > > > > > > then why not moving those changes into this series to make it simple? > > > > The simplification depends on the new ->domain_alloc_user op and > > its implementation in SMMU driver, which would be introduced by > > the nesting series of VT-d and SMMU respectively. > > Since we are not fixing all the drivers at this point, this argument > doesn't hold water. > > It is as I said in the other email, there should be no changes to the > normal attach/replace path when adding manual HWPT creation - once > those are removed there should be minimal connection between these two > series. Yes. I replied here earlier than the discussion with you in that thread. I think I should just drop this line of commit messages. Thanks Nic
On Fri, Feb 10, 2023 at 02:11:23AM +0000, Tian, Kevin wrote: > > > > + if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) { > > > > + if (list_empty(&hwpt->devices)) { > > > > + iopt_table_remove_domain(&hwpt->ioas->iopt, > > > > + hwpt->domain); > > > > + list_del(&hwpt->hwpt_item); > > > > + } > > > > > > I'm not sure how this can be fully shared between detach and replace. > > > Here some work e.g. above needs to be done before calling > > > iommu_group_replace_domain() while others can be done afterwards. > > > > This iopt_table_remove_domain/list_del is supposed to be done in > > the hwpt's destroy() actually. We couldn't move it because it'd > > need the new domain_alloc_user op and its implementation in ARM > > driver. Overall, I think it should be safe to put it behind the > > iommu_group_replace_domain(). > > > > My confusion is that we have different flows between detach/attach > and replace. > > today with separate detach+attach we have following flow: > > Remove device from current hwpt; > if (last_device in hwpt) { > Remove hwpt domain from current iopt; > if (last_device in group) > detach group from hwpt domain; > } > > if (first device in group) { > attach group to new hwpt domain; > if (first_device in hwpt) > Add hwpt domain to new iopt; > Add device to new hwpt; > > but replace flow is different on the detach part: > > if (first device in group) { > replace group's domain from current hwpt to new hwpt; > if (first_device in hwpt) > Add hwpt domain to new iopt; > } > > Remove device from old hwpt; > if (last_device in old hwpt) > Remove hwpt domain from old iopt; > > Add device to new hwpt; Oh... thinking it carefully, I see the flow does look a bit off. Perhaps it's better to have a similar flow for replace. However, I think something would be still different due to its tricky nature, especially for a multi-device iommu_group. An iommu_group_detach happens only when a device is the last one in its group to go through the routine via a DETACH ioctl, while an iommu_group_replace_domain() happens only when the device is the first one in its group to go through the routine via another ATTACH ioctl. However, when the first device does a replace, the cleanup routine of the old hwpt is a NOP, since there are still other devices (same group) in the old hwpt. And two implications here: 1) Any other device in the same group has to forcibly switch to the new domain, when the first device does a replace. 2) The actual hwpt cleanup can only happen at the last device's replace call. This also means that kernel has to rely on the integrity of the user space that it must replace all active devices in the group: For a three-device iommu_group, [scenario 1] a. ATTACH dev1 to hwpt1; b. ATTACH dev2 to hwpt1; c. ATTACH dev3 to hwpt1; d. ATTACH (REPLACE) dev1 to hwpt2; (no hwpt1 cleanup; replace dev2&3 too; do hwpt2 init) e. ATTACH (REPLACE) dev2 to hwpt2; // user space must do (no hwpt1 cleanup; no dev2 replace; no hwpt2 init) f. ATTACH (REPLACE) dev3 to hwpt2; // user space must do (do hwpt1 cleanup; no dev3 replace; no hwpt2 init) [scenario 2] a. ATTACH dev1 to hwpt1; b. ATTACH dev2 to hwpt1; c. ATTACH dev3 to hwpt1; d. DETACH dev3 from hwpt1; (detach dev3; no hwpt1 cleanup) f. ATTACH (REPLACE) dev1 to hwpt2; (no hwpt1 cleanup; replace dev2&3 too; do hwpt2 init) g. ATTACH (REPLACE) dev2 to hwpt2; // user space must do (do hwpt1 cleanup; no dev2 replace; no hwpt2 init) h. (optional) ATTACH dev3 to hwpt2; // clean ATTACH, not a REPLACE (no hwpt1 cleanup; no dev3 replace; no hwpt2 init) [scenario 3] a. ATTACH dev1 to hwpt1; b. ATTACH dev2 to hwpt1; c. ATTACH dev3 to hwpt1; d. DETACH dev3 from hwpt1; (detach dev3; no hwpt1 cleanup) e. DETACH dev2 from hwpt1; (detach dev2; no hwpt1 cleanup) f. ATTACH (REPLACE) dev1 to hwpt2; (do hwpt1 cleanup; replace dev2&3 too; do hwpt2 init) g. (optional) ATTACH dev2 to hwpt2; // clean ATTACH, not a REPLACE (no hwpt1 cleanup; no dev2 replace; no hwpt2 init) h. (optional) ATTACH dev3 to hwpt2; // clean ATTACH, not a REPLACE (no hwpt1 cleanup; no dev3 replace; no hwpt2 init) Following the narratives above, [current detach+attach flow] // DETACH dev1 from hwpt1; Log dev1 out of the hwpt1's device list; NOP; // hwpt1 has its group; iopt_remove_reserved_iova(hwpt1->iopt, dev1); idev1->hwpt = NULL; refcount_dec(); // DETACH dev2 from hwpt1; Log dev2 out of the hwpt1's device list; if (hwpt1 does not have its group) { // last device to detach if (hwpt1's device list is empty) iopt_table_remove_domain/list_del(hwpt1); iommu_detach_group(); } iopt_remove_reserved_iova(hwpt1->iopt, dev2); idev2->hwpt = NULL; refcount_dec(); ... // ATTACH dev1 to hwpt2; iopt_table_enforce_group_resv_regions(hwpt2->iopt, dev1); if (hwpt2 does not have its group) { // first device to attach iommu_attach_group(); if (hwpt2's device list is empty) iopt_table_add_domain/list_add(hwpt2); } idev1->hwpt = hwpt2; refcount_inc(); Log dev1 in the hwpt2's device list; // ATTACH dev2 to hwpt2; iopt_table_enforce_group_resv_regions(hwpt2->iopt, dev2); NOP; // hwpt2 has its group; idev2->hwpt = hwpt2; refcount_inc(); Log dev2 in to the hwpt2's device list; [correct (?) replace flow - scenario 1 above] // 1.d Switch (REPLACE) dev1 from hwpt1 to hwpt2; partial detach (dev1) { Log dev1 out of the hwpt1's device list; NOP // hwpt1 has its group, and hwpt1's device list isn't empty iopt_remove_reserved_iova(hwpt1->iopt, dev1); refcount_dec(); } iopt_table_enforce_group_resv_regions(hwpt2->iopt, dev1); if (hwpt2 does not have its group) { // first device to replace iommu_group_replace_domain(); if (hwpt2's device list is empty) iopt_table_add_domain/list_add(hwpt2); } idev1->hwpt = hwpt2; refcount_int(); Log dev1 in the hwpt2's device list; // 1.e Switch (REPLACE) dev2 from hwpt1 to hwpt2; partial detach (dev2) { Log dev2 out of the hwpt1's device list; NOP // hwpt1 has its group, and hwpt1's device list isn't empty iopt_remove_reserved_iova(hwpt1->iopt, dev2); refcount_dec(); } iopt_table_enforce_group_resv_regions(hwpt2->iopt, dev2); NOP; // hwpt2 has its group, and hwpt2's device list isn't empty idev2->hwpt = hwpt2; refcount_int(); Log dev2 in the hwpt2's device list; // 1.f Switch (REPLACE) dev3 from hwpt1 to hwpt2; partial detach (dev3) { Log dev3 out of the hwpt1's device list; if (hwpt1 does not have its group) { // last device to detach if (hwpt1's device list is empty) iopt_table_remove_domain/list_del(hwpt1); } iopt_remove_reserved_iova(hwpt1->iopt, dev3); refcount_dec(); } iopt_table_enforce_group_resv_regions(hwpt2->iopt, dev3); NOP; // hwpt2 has its group, and hwpt2's device list isn't empty idev3->hwpt = hwpt2; refcount_int(); Log dev3 in the hwpt2's device list; And, this would also complicate the error-out routines too... > I'm yet to figure out whether we have sufficient lock protection to > prevent other paths from using old iopt/hwpt to find the device > which is already attached to a different domain. With the correct (?) flow, I think it'd be safer for one-device group. But it's gets tricker for the multi-device case above: the dev2 and dev3 are already in the new domain, but all their iommufd objects (idev->hwpt and iopt) are lagging. Do we need a lock across the three IOCTLs? One way to avoid it is to force-update idev2 and idev3 too when idev1 does a replace -- by iterating all same-group devices out of the old hwpt. This, however, feels a violation against being device-centric... i.e. scenario 1: a. ATTACH dev1 to hwpt1; b. ATTACH dev2 to hwpt1; c. ATTACH dev3 to hwpt1; d. ATTACH (REPLACE) dev1 to hwpt2; (do hwpt1 cleanup; replace dev2&3 and update idev2&3 too; do hwpt2 init) e. ATTACH (REPLACE) dev2 to hwpt2; // user space must do, or be aware of 1.d (kernel does dummy) f. ATTACH (REPLACE) dev3 to hwpt2; // user space must do, or be aware of 1.d (kernel does dummy) Thanks Nic
> From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Saturday, February 11, 2023 8:10 AM > > On Fri, Feb 10, 2023 at 02:11:23AM +0000, Tian, Kevin wrote: > > > > > > + if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) { > > > > > + if (list_empty(&hwpt->devices)) { > > > > > + iopt_table_remove_domain(&hwpt->ioas->iopt, > > > > > + hwpt->domain); > > > > > + list_del(&hwpt->hwpt_item); > > > > > + } > > > > > > > > I'm not sure how this can be fully shared between detach and replace. > > > > Here some work e.g. above needs to be done before calling > > > > iommu_group_replace_domain() while others can be done afterwards. > > > > > > This iopt_table_remove_domain/list_del is supposed to be done in > > > the hwpt's destroy() actually. We couldn't move it because it'd > > > need the new domain_alloc_user op and its implementation in ARM > > > driver. Overall, I think it should be safe to put it behind the > > > iommu_group_replace_domain(). > > > > > > > My confusion is that we have different flows between detach/attach > > and replace. > > > > today with separate detach+attach we have following flow: > > > > Remove device from current hwpt; > > if (last_device in hwpt) { > > Remove hwpt domain from current iopt; > > if (last_device in group) > > detach group from hwpt domain; > > } > > > > if (first device in group) { > > attach group to new hwpt domain; > > if (first_device in hwpt) > > Add hwpt domain to new iopt; > > Add device to new hwpt; > > > > but replace flow is different on the detach part: > > > > if (first device in group) { > > replace group's domain from current hwpt to new hwpt; > > if (first_device in hwpt) > > Add hwpt domain to new iopt; > > } > > > > Remove device from old hwpt; > > if (last_device in old hwpt) > > Remove hwpt domain from old iopt; > > > > Add device to new hwpt; > > Oh... thinking it carefully, I see the flow does look a bit off. > Perhaps it's better to have a similar flow for replace. > > However, I think something would be still different due to its > tricky nature, especially for a multi-device iommu_group. > > An iommu_group_detach happens only when a device is the last one > in its group to go through the routine via a DETACH ioctl, while > an iommu_group_replace_domain() happens only when the device is > the first one in its group to go through the routine via another > ATTACH ioctl. However, when the first device does a replace, the > cleanup routine of the old hwpt is a NOP, since there are still > other devices (same group) in the old hwpt. And two implications > here: > 1) Any other device in the same group has to forcibly switch to > the new domain, when the first device does a replace. > 2) The actual hwpt cleanup can only happen at the last device's > replace call. > > This also means that kernel has to rely on the integrity of the > user space that it must replace all active devices in the group: > Jason suggested to move hwpt cleanup out of the detach path in his reply to patch7. Presumably with that fix the major tricky point about hwpt in following scenarios would disappear. Let's see how it will work out then. 😊
On Mon, Feb 13, 2023 at 02:34:18AM +0000, Tian, Kevin wrote: > External email: Use caution opening links or attachments > > > > From: Nicolin Chen <nicolinc@nvidia.com> > > Sent: Saturday, February 11, 2023 8:10 AM > > > > On Fri, Feb 10, 2023 at 02:11:23AM +0000, Tian, Kevin wrote: > > > > > > > > + if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) { > > > > > > + if (list_empty(&hwpt->devices)) { > > > > > > + iopt_table_remove_domain(&hwpt->ioas->iopt, > > > > > > + hwpt->domain); > > > > > > + list_del(&hwpt->hwpt_item); > > > > > > + } > > > > > > > > > > I'm not sure how this can be fully shared between detach and replace. > > > > > Here some work e.g. above needs to be done before calling > > > > > iommu_group_replace_domain() while others can be done afterwards. > > > > > > > > This iopt_table_remove_domain/list_del is supposed to be done in > > > > the hwpt's destroy() actually. We couldn't move it because it'd > > > > need the new domain_alloc_user op and its implementation in ARM > > > > driver. Overall, I think it should be safe to put it behind the > > > > iommu_group_replace_domain(). > > > > > > > > > > My confusion is that we have different flows between detach/attach > > > and replace. > > > > > > today with separate detach+attach we have following flow: > > > > > > Remove device from current hwpt; > > > if (last_device in hwpt) { > > > Remove hwpt domain from current iopt; > > > if (last_device in group) > > > detach group from hwpt domain; > > > } > > > > > > if (first device in group) { > > > attach group to new hwpt domain; > > > if (first_device in hwpt) > > > Add hwpt domain to new iopt; > > > Add device to new hwpt; > > > > > > but replace flow is different on the detach part: > > > > > > if (first device in group) { > > > replace group's domain from current hwpt to new hwpt; > > > if (first_device in hwpt) > > > Add hwpt domain to new iopt; > > > } > > > > > > Remove device from old hwpt; > > > if (last_device in old hwpt) > > > Remove hwpt domain from old iopt; > > > > > > Add device to new hwpt; > > > > Oh... thinking it carefully, I see the flow does look a bit off. > > Perhaps it's better to have a similar flow for replace. > > > > However, I think something would be still different due to its > > tricky nature, especially for a multi-device iommu_group. > > > > An iommu_group_detach happens only when a device is the last one > > in its group to go through the routine via a DETACH ioctl, while > > an iommu_group_replace_domain() happens only when the device is > > the first one in its group to go through the routine via another > > ATTACH ioctl. However, when the first device does a replace, the > > cleanup routine of the old hwpt is a NOP, since there are still > > other devices (same group) in the old hwpt. And two implications > > here: > > 1) Any other device in the same group has to forcibly switch to > > the new domain, when the first device does a replace. > > 2) The actual hwpt cleanup can only happen at the last device's > > replace call. > > > > This also means that kernel has to rely on the integrity of the > > user space that it must replace all active devices in the group: > > > > Jason suggested to move hwpt cleanup out of the detach path > in his reply to patch7. Presumably with that fix the major tricky > point about hwpt in following scenarios would disappear. Let's > see how it will work out then. 😊 What about point 1? If dev2 and dev3 are already replaced when doing iommu_group_replace_domain() on dev1, their idev objects still have the old hwpt/iopt until user space does another two IOCTLs on them, right? Should we only call iommu_group_replace_domain() when the last device in the group gets a replace IOCTL? Thanks Nic
> From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Monday, February 13, 2023 3:49 PM > > On Mon, Feb 13, 2023 at 02:34:18AM +0000, Tian, Kevin wrote: > > External email: Use caution opening links or attachments > > > > > > > From: Nicolin Chen <nicolinc@nvidia.com> > > > Sent: Saturday, February 11, 2023 8:10 AM > > > > > > On Fri, Feb 10, 2023 at 02:11:23AM +0000, Tian, Kevin wrote: > > > > > > > > > > + if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) { > > > > > > > + if (list_empty(&hwpt->devices)) { > > > > > > > + iopt_table_remove_domain(&hwpt->ioas->iopt, > > > > > > > + hwpt->domain); > > > > > > > + list_del(&hwpt->hwpt_item); > > > > > > > + } > > > > > > > > > > > > I'm not sure how this can be fully shared between detach and > replace. > > > > > > Here some work e.g. above needs to be done before calling > > > > > > iommu_group_replace_domain() while others can be done > afterwards. > > > > > > > > > > This iopt_table_remove_domain/list_del is supposed to be done in > > > > > the hwpt's destroy() actually. We couldn't move it because it'd > > > > > need the new domain_alloc_user op and its implementation in ARM > > > > > driver. Overall, I think it should be safe to put it behind the > > > > > iommu_group_replace_domain(). > > > > > > > > > > > > > My confusion is that we have different flows between detach/attach > > > > and replace. > > > > > > > > today with separate detach+attach we have following flow: > > > > > > > > Remove device from current hwpt; > > > > if (last_device in hwpt) { > > > > Remove hwpt domain from current iopt; > > > > if (last_device in group) > > > > detach group from hwpt domain; > > > > } > > > > > > > > if (first device in group) { > > > > attach group to new hwpt domain; > > > > if (first_device in hwpt) > > > > Add hwpt domain to new iopt; > > > > Add device to new hwpt; > > > > > > > > but replace flow is different on the detach part: > > > > > > > > if (first device in group) { > > > > replace group's domain from current hwpt to new hwpt; > > > > if (first_device in hwpt) > > > > Add hwpt domain to new iopt; > > > > } > > > > > > > > Remove device from old hwpt; > > > > if (last_device in old hwpt) > > > > Remove hwpt domain from old iopt; > > > > > > > > Add device to new hwpt; > > > > > > Oh... thinking it carefully, I see the flow does look a bit off. > > > Perhaps it's better to have a similar flow for replace. > > > > > > However, I think something would be still different due to its > > > tricky nature, especially for a multi-device iommu_group. > > > > > > An iommu_group_detach happens only when a device is the last one > > > in its group to go through the routine via a DETACH ioctl, while > > > an iommu_group_replace_domain() happens only when the device is > > > the first one in its group to go through the routine via another > > > ATTACH ioctl. However, when the first device does a replace, the > > > cleanup routine of the old hwpt is a NOP, since there are still > > > other devices (same group) in the old hwpt. And two implications > > > here: > > > 1) Any other device in the same group has to forcibly switch to > > > the new domain, when the first device does a replace. > > > 2) The actual hwpt cleanup can only happen at the last device's > > > replace call. > > > > > > This also means that kernel has to rely on the integrity of the > > > user space that it must replace all active devices in the group: > > > > > > > Jason suggested to move hwpt cleanup out of the detach path > > in his reply to patch7. Presumably with that fix the major tricky > > point about hwpt in following scenarios would disappear. Let's > > see how it will work out then. 😊 > > What about point 1? If dev2 and dev3 are already replaced when > doing iommu_group_replace_domain() on dev1, their idev objects > still have the old hwpt/iopt until user space does another two > IOCTLs on them, right? > > Should we only call iommu_group_replace_domain() when the last > device in the group gets a replace IOCTL? > With Jason's proposal now the attach/detach paths only handle the connection between device and hwpt. I don't see a problem here with devices in a group attached to different hwpt's in this short transition window. There is no impact to ioas/iopt operations (though the user is not expected to change it before the group transition is fully completed). Just that the old hwpt cannot be destroyed until the last dev detaches from it.
On Sun, Feb 12, 2023 at 11:48:30PM -0800, Nicolin Chen wrote: > What about point 1? If dev2 and dev3 are already replaced when > doing iommu_group_replace_domain() on dev1, their idev objects > still have the old hwpt/iopt until user space does another two > IOCTLs on them, right? We have a complicated model for multi-device groups... The first device in the group to change domains must move all the devices in the group But userspace is still expected to run through and change all the other devices So replace should be a NOP if the group is already linked to the right domain. This patch needs to make sure that incosistency in the view betwen the iommu_group and the iommufd model doesn't cause a functional problem. Jason
On Mon, Feb 13, 2023 at 10:49:34AM -0400, Jason Gunthorpe wrote: > On Sun, Feb 12, 2023 at 11:48:30PM -0800, Nicolin Chen wrote: > > > What about point 1? If dev2 and dev3 are already replaced when > > doing iommu_group_replace_domain() on dev1, their idev objects > > still have the old hwpt/iopt until user space does another two > > IOCTLs on them, right? > > We have a complicated model for multi-device groups... > > The first device in the group to change domains must move all the > devices in the group > > But userspace is still expected to run through and change all the > other devices > > So replace should be a NOP if the group is already linked to the right > domain. > > This patch needs to make sure that incosistency in the view betwen the > iommu_group and the iommufd model doesn't cause a functional > problem. Yea, I was thinking that we'd need to block any access to the idev->hwpt of a pending device's, before the kernel finishes the "NOP" IOCTL from userspace, maybe with a helper: (iommu_get_domain_for_dev(idev->dev) != idev->hwpt->domain) Thanks Nic
On Fri, Feb 10, 2023 at 02:11:23AM +0000, Tian, Kevin wrote: > My confusion is that we have different flows between detach/attach > and replace. > > today with separate detach+attach we have following flow: > > Remove device from current hwpt; > if (last_device in hwpt) { > Remove hwpt domain from current iopt; > if (last_device in group) > detach group from hwpt domain; > } > > if (first device in group) { > attach group to new hwpt domain; > if (first_device in hwpt) > Add hwpt domain to new iopt; > Add device to new hwpt; > > but replace flow is different on the detach part: > > if (first device in group) { > replace group's domain from current hwpt to new hwpt; > if (first_device in hwpt) > Add hwpt domain to new iopt; > } > > Remove device from old hwpt; > if (last_device in old hwpt) > Remove hwpt domain from old iopt; > > Add device to new hwpt; > > I'm yet to figure out whether we have sufficient lock protection to > prevent other paths from using old iopt/hwpt to find the device > which is already attached to a different domain. With Jason's new series, the detach() routine is lighter now. I wonder if it'd be safer now to do the detach() call after iommu_group_replace_domain()? Thanks Nic @@ -196,10 +198,41 @@ static bool iommufd_hw_pagetable_has_group(struct iommufd_hw_pagetable *hwpt, return false; } +/** + * __iommufd_device_detach - Detach a device from idev->hwpt + * @idev: device to detach + * @detach_group: flag to call iommu_detach_group + * + * This is a cleanup helper shared by the replace and detach routines. Comparing + * to a detach routine, a replace call does not need the iommu_detach_group(). + */ +static void __iommufd_device_detach(struct iommufd_device *idev, + bool detach_group) +{ + struct iommufd_hw_pagetable *hwpt = idev->hwpt; + + mutex_lock(&hwpt->devices_lock); + list_del(&idev->devices_item); + if (detach_group && !iommufd_hw_pagetable_has_group(hwpt, idev->group)) + iommu_detach_group(hwpt->domain, idev->group); + iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev); + mutex_unlock(&hwpt->devices_lock); + + if (hwpt->auto_domain) + iommufd_object_destroy_user(idev->ictx, &hwpt->obj); + else + refcount_dec(&hwpt->obj.users); + + idev->hwpt = NULL; + + refcount_dec(&idev->obj.users); +} + /* On success this consumes a hwpt reference from the caller */ static int iommufd_device_do_attach(struct iommufd_device *idev, struct iommufd_hw_pagetable *hwpt) { + struct iommufd_hw_pagetable *cur_hwpt = idev->hwpt; phys_addr_t sw_msi_start = PHYS_ADDR_MAX; int rc; @@ -237,7 +270,7 @@ static int iommufd_device_do_attach(struct iommufd_device *idev, * the group once for the first device that is in the group. */ if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) { - rc = iommu_attach_group(hwpt->domain, idev->group); + rc = iommu_group_replace_domain(idev->group, hwpt->domain); if (rc) goto out_iova; @@ -249,6 +282,10 @@ static int iommufd_device_do_attach(struct iommufd_device *idev, } } + /* Detach from the cur_hwpt without iommu_detach_group() */ + if (cur_hwpt) + __iommufd_device_detach(idev, false); + idev->hwpt = hwpt; /* The HWPT reference from the caller is moved to this list */ list_add(&idev->devices_item, &hwpt->devices);
> From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Tuesday, February 14, 2023 6:54 PM > > On Mon, Feb 13, 2023 at 10:49:34AM -0400, Jason Gunthorpe wrote: > > On Sun, Feb 12, 2023 at 11:48:30PM -0800, Nicolin Chen wrote: > > > > > What about point 1? If dev2 and dev3 are already replaced when > > > doing iommu_group_replace_domain() on dev1, their idev objects > > > still have the old hwpt/iopt until user space does another two > > > IOCTLs on them, right? > > > > We have a complicated model for multi-device groups... > > > > The first device in the group to change domains must move all the > > devices in the group > > > > But userspace is still expected to run through and change all the > > other devices > > > > So replace should be a NOP if the group is already linked to the right > > domain. > > > > This patch needs to make sure that incosistency in the view betwen the > > iommu_group and the iommufd model doesn't cause a functional > > problem. > > Yea, I was thinking that we'd need to block any access to the > idev->hwpt of a pending device's, before the kernel finishes > the "NOP" IOCTL from userspace, maybe with a helper: > (iommu_get_domain_for_dev(idev->dev) != idev->hwpt->domain) > I didn't see what would be broken w/o such blocking measure. Can you elaborate?
> From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Tuesday, February 14, 2023 7:00 PM > > On Fri, Feb 10, 2023 at 02:11:23AM +0000, Tian, Kevin wrote: > > > My confusion is that we have different flows between detach/attach > > and replace. > > > > today with separate detach+attach we have following flow: > > > > Remove device from current hwpt; > > if (last_device in hwpt) { > > Remove hwpt domain from current iopt; > > if (last_device in group) > > detach group from hwpt domain; > > } > > > > if (first device in group) { > > attach group to new hwpt domain; > > if (first_device in hwpt) > > Add hwpt domain to new iopt; > > Add device to new hwpt; > > > > but replace flow is different on the detach part: > > > > if (first device in group) { > > replace group's domain from current hwpt to new hwpt; > > if (first_device in hwpt) > > Add hwpt domain to new iopt; > > } > > > > Remove device from old hwpt; > > if (last_device in old hwpt) > > Remove hwpt domain from old iopt; > > > > Add device to new hwpt; > > > > I'm yet to figure out whether we have sufficient lock protection to > > prevent other paths from using old iopt/hwpt to find the device > > which is already attached to a different domain. > > With Jason's new series, the detach() routine is lighter now. > > I wonder if it'd be safer now to do the detach() call after > iommu_group_replace_domain()? > yes, looks so.
On Wed, Feb 15, 2023 at 01:37:19AM +0000, Tian, Kevin wrote: > > From: Nicolin Chen <nicolinc@nvidia.com> > > Sent: Tuesday, February 14, 2023 6:54 PM > > > > On Mon, Feb 13, 2023 at 10:49:34AM -0400, Jason Gunthorpe wrote: > > > On Sun, Feb 12, 2023 at 11:48:30PM -0800, Nicolin Chen wrote: > > > > > > > What about point 1? If dev2 and dev3 are already replaced when > > > > doing iommu_group_replace_domain() on dev1, their idev objects > > > > still have the old hwpt/iopt until user space does another two > > > > IOCTLs on them, right? > > > > > > We have a complicated model for multi-device groups... > > > > > > The first device in the group to change domains must move all the > > > devices in the group > > > > > > But userspace is still expected to run through and change all the > > > other devices > > > > > > So replace should be a NOP if the group is already linked to the right > > > domain. > > > > > > This patch needs to make sure that incosistency in the view betwen the > > > iommu_group and the iommufd model doesn't cause a functional > > > problem. > > > > Yea, I was thinking that we'd need to block any access to the > > idev->hwpt of a pending device's, before the kernel finishes > > the "NOP" IOCTL from userspace, maybe with a helper: > > (iommu_get_domain_for_dev(idev->dev) != idev->hwpt->domain) > > > > I didn't see what would be broken w/o such blocking measure. > > Can you elaborate? iommu_group { idev1, idev2 } (1) Attach all devices to domain1 { group->domain = domain1; idev1->hwpt = hwpt1; // domain1 idev2->hwpt = hwpt1; // domain1 } (2) Attach (replace) idev1 only to domain2 { group->domain = domain2 idev1->hwpt = hwpt2; // domain2 idev2->hwpt == domain1 // != iommu_get_domain_for_dev() } Then if user space isn't aware of these and continues to do IOMMU_IOAS_MAP for idev2. Though IOVA mappings may be added onto the domain2 correctly, yet domain2's iopt itree won't reflect that, until idev2->hwpt is updated too, right? And the tricky thing is that, though we advocate a device- centric uAPI, we'd still have to ask user space to align the devices in the same iommu_group, which is also not present in QEMU's RFC v3 series. The traditional detach+attach flow doesn't seem to have this issue, since there's no re-entry so the work flow is always that detaching all devices first before attaching to another domain. Thanks Nic
> From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Wednesday, February 15, 2023 9:59 AM > > On Wed, Feb 15, 2023 at 01:37:19AM +0000, Tian, Kevin wrote: > > > > From: Nicolin Chen <nicolinc@nvidia.com> > > > Sent: Tuesday, February 14, 2023 6:54 PM > > > > > > On Mon, Feb 13, 2023 at 10:49:34AM -0400, Jason Gunthorpe wrote: > > > > On Sun, Feb 12, 2023 at 11:48:30PM -0800, Nicolin Chen wrote: > > > > > > > > > What about point 1? If dev2 and dev3 are already replaced when > > > > > doing iommu_group_replace_domain() on dev1, their idev objects > > > > > still have the old hwpt/iopt until user space does another two > > > > > IOCTLs on them, right? > > > > > > > > We have a complicated model for multi-device groups... > > > > > > > > The first device in the group to change domains must move all the > > > > devices in the group > > > > > > > > But userspace is still expected to run through and change all the > > > > other devices > > > > > > > > So replace should be a NOP if the group is already linked to the right > > > > domain. > > > > > > > > This patch needs to make sure that incosistency in the view betwen the > > > > iommu_group and the iommufd model doesn't cause a functional > > > > problem. > > > > > > Yea, I was thinking that we'd need to block any access to the > > > idev->hwpt of a pending device's, before the kernel finishes > > > the "NOP" IOCTL from userspace, maybe with a helper: > > > (iommu_get_domain_for_dev(idev->dev) != idev->hwpt->domain) > > > > > > > I didn't see what would be broken w/o such blocking measure. > > > > Can you elaborate? > > iommu_group { idev1, idev2 } > > (1) Attach all devices to domain1 { > group->domain = domain1; > idev1->hwpt = hwpt1; // domain1 > idev2->hwpt = hwpt1; // domain1 > } > > (2) Attach (replace) idev1 only to domain2 { > group->domain = domain2 > idev1->hwpt = hwpt2; // domain2 > idev2->hwpt == domain1 // != iommu_get_domain_for_dev() > } > > Then if user space isn't aware of these and continues to do > IOMMU_IOAS_MAP for idev2. Though IOVA mappings may be added > onto the domain2 correctly, yet domain2's iopt itree won't > reflect that, until idev2->hwpt is updated too, right? IOMMU_IOAS_MAP is for ioas instead of for device. even w/o any device attached we still allow ioas map. also note in case of domain1 still actively attached to idev3 in another group, banning IOAS_MAP due to idev2 in transition is also incorrect for idev3. > > And the tricky thing is that, though we advocate a device- > centric uAPI, we'd still have to ask user space to align the > devices in the same iommu_group, which is also not present > in QEMU's RFC v3 series. IMHO this requirement has been stated clearly from the start when designing this device centric interface. QEMU should be improved to take care of it.
On Wed, Feb 15, 2023 at 02:15:59AM +0000, Tian, Kevin wrote: > External email: Use caution opening links or attachments > > > > From: Nicolin Chen <nicolinc@nvidia.com> > > Sent: Wednesday, February 15, 2023 9:59 AM > > > > On Wed, Feb 15, 2023 at 01:37:19AM +0000, Tian, Kevin wrote: > > > > > > From: Nicolin Chen <nicolinc@nvidia.com> > > > > Sent: Tuesday, February 14, 2023 6:54 PM > > > > > > > > On Mon, Feb 13, 2023 at 10:49:34AM -0400, Jason Gunthorpe wrote: > > > > > On Sun, Feb 12, 2023 at 11:48:30PM -0800, Nicolin Chen wrote: > > > > > > > > > > > What about point 1? If dev2 and dev3 are already replaced when > > > > > > doing iommu_group_replace_domain() on dev1, their idev objects > > > > > > still have the old hwpt/iopt until user space does another two > > > > > > IOCTLs on them, right? > > > > > > > > > > We have a complicated model for multi-device groups... > > > > > > > > > > The first device in the group to change domains must move all the > > > > > devices in the group > > > > > > > > > > But userspace is still expected to run through and change all the > > > > > other devices > > > > > > > > > > So replace should be a NOP if the group is already linked to the right > > > > > domain. > > > > > > > > > > This patch needs to make sure that incosistency in the view betwen the > > > > > iommu_group and the iommufd model doesn't cause a functional > > > > > problem. > > > > > > > > Yea, I was thinking that we'd need to block any access to the > > > > idev->hwpt of a pending device's, before the kernel finishes > > > > the "NOP" IOCTL from userspace, maybe with a helper: > > > > (iommu_get_domain_for_dev(idev->dev) != idev->hwpt->domain) > > > > > > > > > > I didn't see what would be broken w/o such blocking measure. > > > > > > Can you elaborate? > > > > iommu_group { idev1, idev2 } > > > > (1) Attach all devices to domain1 { > > group->domain = domain1; > > idev1->hwpt = hwpt1; // domain1 > > idev2->hwpt = hwpt1; // domain1 > > } > > > > (2) Attach (replace) idev1 only to domain2 { > > group->domain = domain2 > > idev1->hwpt = hwpt2; // domain2 > > idev2->hwpt == domain1 // != iommu_get_domain_for_dev() > > } > > > > Then if user space isn't aware of these and continues to do > > IOMMU_IOAS_MAP for idev2. Though IOVA mappings may be added > > onto the domain2 correctly, yet domain2's iopt itree won't > > reflect that, until idev2->hwpt is updated too, right? > > IOMMU_IOAS_MAP is for ioas instead of for device. > > even w/o any device attached we still allow ioas map. > > also note in case of domain1 still actively attached to idev3 in > another group, banning IOAS_MAP due to idev2 in transition > is also incorrect for idev3. OK. That's likely true. It doesn't seem to be correct to block an IOMMU_IOAS_MAP. But things will be out of control, if user space continues mapping something onto domain1's iopt for idev2, even after it is attached covertly to domain2's iopt by the replace routine. I wonder how kernel should handle this and keep the consistency between IOMMUFD objects and iommu_group. > > And the tricky thing is that, though we advocate a device- > > centric uAPI, we'd still have to ask user space to align the > > devices in the same iommu_group, which is also not present > > in QEMU's RFC v3 series. > > IMHO this requirement has been stated clearly from the start > when designing this device centric interface. QEMU should be > improved to take care of it. OK. It has to be so... Thanks Nic
On Wed, Feb 15, 2023 at 01:38:32AM +0000, Tian, Kevin wrote: > External email: Use caution opening links or attachments > > > > From: Nicolin Chen <nicolinc@nvidia.com> > > Sent: Tuesday, February 14, 2023 7:00 PM > > > > On Fri, Feb 10, 2023 at 02:11:23AM +0000, Tian, Kevin wrote: > > > > > My confusion is that we have different flows between detach/attach > > > and replace. > > > > > > today with separate detach+attach we have following flow: > > > > > > Remove device from current hwpt; > > > if (last_device in hwpt) { > > > Remove hwpt domain from current iopt; > > > if (last_device in group) > > > detach group from hwpt domain; > > > } > > > > > > if (first device in group) { > > > attach group to new hwpt domain; > > > if (first_device in hwpt) > > > Add hwpt domain to new iopt; > > > Add device to new hwpt; > > > > > > but replace flow is different on the detach part: > > > > > > if (first device in group) { > > > replace group's domain from current hwpt to new hwpt; > > > if (first_device in hwpt) > > > Add hwpt domain to new iopt; > > > } > > > > > > Remove device from old hwpt; > > > if (last_device in old hwpt) > > > Remove hwpt domain from old iopt; > > > > > > Add device to new hwpt; > > > > > > I'm yet to figure out whether we have sufficient lock protection to > > > prevent other paths from using old iopt/hwpt to find the device > > > which is already attached to a different domain. > > > > With Jason's new series, the detach() routine is lighter now. > > > > I wonder if it'd be safer now to do the detach() call after > > iommu_group_replace_domain()? > > > > yes, looks so. Will rebase a v3 once Jason updates his series. Thanks!
> From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Wednesday, February 15, 2023 3:15 PM > > But things will be out of control, if user space continues mapping > something onto domain1's iopt for idev2, even after it is attached > covertly to domain2's iopt by the replace routine. I wonder how > kernel should handle this and keep the consistency between IOMMUFD > objects and iommu_group. > this is where I don't understand. domain mapping just reflects what an address space has. Take Qemu for example. w/o vIOMMU domain mappings is added/ removed according to the change in the GPA address space. w/ vIOMMU then it is synced with guest page table. why would userspace construct mappings for a specific device? when device is moved from one domain to another domain, it just bears with whatever the new domain allows it to access.
On Tue, Feb 14, 2023 at 11:15:09PM -0800, Nicolin Chen wrote: > But things will be out of control, if user space continues mapping > something onto domain1's iopt for idev2, even after it is attached > covertly to domain2's iopt by the replace routine. I wonder how > kernel should handle this and keep the consistency between IOMMUFD > objects and iommu_group. I've been looking at this, the reason the locking is such a PITA is because we are still trying to use the device list short cut. We need to have a iommu group object instead then everything will work smoothly. Jason
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index b8c3e3baccb5..8a9834fc129a 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -9,6 +9,8 @@ #include "io_pagetable.h" #include "iommufd_private.h" +MODULE_IMPORT_NS(IOMMUFD_INTERNAL); + static bool allow_unsafe_interrupts; module_param(allow_unsafe_interrupts, bool, S_IRUGO | S_IWUSR); MODULE_PARM_DESC( @@ -194,9 +196,61 @@ static bool iommufd_hw_pagetable_has_group(struct iommufd_hw_pagetable *hwpt, return false; } +/** + * __iommmufd_device_detach - Detach a device from idev->hwpt to new_hwpt + * @idev: device to detach + * @new_hwpt: new hw_pagetable to attach (pass in NULL for a simple detach) + * @detach_group: flag to call iommu_detach_group + * + * This is a cleanup helper shared by the replace and detach routines. Comparing + * to a detach routine, a replace routine only needs a partial detach procedure: + * it does not need the iommu_detach_group(); it will attach the device to a new + * hw_pagetable after a partial detach from the currently attached hw_pagetable, + * so certain steps can be skipped if two hw_pagetables have the same IOAS. + */ +static void __iommmufd_device_detach(struct iommufd_device *idev, + struct iommufd_hw_pagetable *new_hwpt, + bool detach_group) +{ + struct iommufd_hw_pagetable *hwpt = idev->hwpt; + struct iommufd_ioas *new_ioas = NULL; + + if (new_hwpt) + new_ioas = new_hwpt->ioas; + + mutex_lock(&hwpt->devices_lock); + list_del(&idev->devices_item); + if (hwpt->ioas != new_ioas) + mutex_lock(&hwpt->ioas->mutex); + if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) { + if (list_empty(&hwpt->devices)) { + iopt_table_remove_domain(&hwpt->ioas->iopt, + hwpt->domain); + list_del(&hwpt->hwpt_item); + } + if (detach_group) + iommu_detach_group(hwpt->domain, idev->group); + } + if (hwpt->ioas != new_ioas) { + iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev); + mutex_unlock(&hwpt->ioas->mutex); + } + mutex_unlock(&hwpt->devices_lock); + + if (hwpt->auto_domain) + iommufd_object_destroy_user(idev->ictx, &hwpt->obj); + else + refcount_dec(&hwpt->obj.users); + + idev->hwpt = NULL; + + refcount_dec(&idev->obj.users); +} + static int iommufd_device_do_attach(struct iommufd_device *idev, struct iommufd_hw_pagetable *hwpt) { + struct iommufd_hw_pagetable *cur_hwpt = idev->hwpt; phys_addr_t sw_msi_start = PHYS_ADDR_MAX; int rc; @@ -236,7 +290,7 @@ static int iommufd_device_do_attach(struct iommufd_device *idev, * the group once for the first device that is in the group. */ if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) { - rc = iommu_attach_group(hwpt->domain, idev->group); + rc = iommu_group_replace_domain(idev->group, hwpt->domain); if (rc) goto out_iova; @@ -249,6 +303,10 @@ static int iommufd_device_do_attach(struct iommufd_device *idev, } } + /* Replace the cur_hwpt without iommu_detach_group() */ + if (cur_hwpt) + __iommmufd_device_detach(idev, hwpt, false); + idev->hwpt = hwpt; refcount_inc(&hwpt->obj.users); list_add(&idev->devices_item, &hwpt->devices); @@ -256,7 +314,10 @@ static int iommufd_device_do_attach(struct iommufd_device *idev, return 0; out_detach: - iommu_detach_group(hwpt->domain, idev->group); + if (cur_hwpt) + iommu_group_replace_domain(idev->group, cur_hwpt->domain); + else + iommu_detach_group(hwpt->domain, idev->group); out_iova: iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev); out_unlock: @@ -345,6 +406,13 @@ int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id) struct iommufd_hw_pagetable *hwpt = container_of(pt_obj, struct iommufd_hw_pagetable, obj); + if (idev->hwpt == hwpt) + goto out_done; + if (idev->hwpt && idev->hwpt->auto_domain) { + rc = -EBUSY; + goto out_put_pt_obj; + } + mutex_lock(&hwpt->ioas->mutex); rc = iommufd_device_do_attach(idev, hwpt); mutex_unlock(&hwpt->ioas->mutex); @@ -356,6 +424,8 @@ int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id) struct iommufd_ioas *ioas = container_of(pt_obj, struct iommufd_ioas, obj); + if (idev->hwpt) + return -EBUSY; rc = iommufd_device_auto_get_domain(idev, ioas); if (rc) goto out_put_pt_obj; @@ -367,6 +437,7 @@ int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id) } refcount_inc(&idev->obj.users); +out_done: *pt_id = idev->hwpt->obj.id; rc = 0; @@ -385,31 +456,7 @@ EXPORT_SYMBOL_NS_GPL(iommufd_device_attach, IOMMUFD); */ void iommufd_device_detach(struct iommufd_device *idev) { - struct iommufd_hw_pagetable *hwpt = idev->hwpt; - - mutex_lock(&hwpt->ioas->mutex); - mutex_lock(&hwpt->devices_lock); - list_del(&idev->devices_item); - if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) { - if (list_empty(&hwpt->devices)) { - iopt_table_remove_domain(&hwpt->ioas->iopt, - hwpt->domain); - list_del(&hwpt->hwpt_item); - } - iommu_detach_group(hwpt->domain, idev->group); - } - iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev); - mutex_unlock(&hwpt->devices_lock); - mutex_unlock(&hwpt->ioas->mutex); - - if (hwpt->auto_domain) - iommufd_object_destroy_user(idev->ictx, &hwpt->obj); - else - refcount_dec(&hwpt->obj.users); - - idev->hwpt = NULL; - - refcount_dec(&idev->obj.users); + __iommmufd_device_detach(idev, NULL, true); } EXPORT_SYMBOL_NS_GPL(iommufd_device_detach, IOMMUFD); diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 593138bb37b8..200c783800ad 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -9,6 +9,8 @@ #include <linux/refcount.h> #include <linux/uaccess.h> +#include "../iommu-priv.h" + struct iommu_domain; struct iommu_group; struct iommu_option;