Message ID | c1c65ce093a3b585546dd17a77949efbe03a81d9.1674939002.git.nicolinc@nvidia.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp1504180wrn; Sat, 28 Jan 2023 13:19:31 -0800 (PST) X-Google-Smtp-Source: AMrXdXuy1LmWM8wyr902YTniwrosoCLpoj0RVQcqk/bGecKYCPDqeTF9Thtvbdy8LWdTVJpN4i9R X-Received: by 2002:a17:90b:1d0f:b0:229:f975:9c90 with SMTP id on15-20020a17090b1d0f00b00229f9759c90mr34503415pjb.5.1674940771449; Sat, 28 Jan 2023 13:19:31 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1674940771; cv=pass; d=google.com; s=arc-20160816; b=gvFf3p+0zuTTLhYyezaKVW1Zmpty4BC7vBPAo1DfaoyHMfNve8a5enE6ZUWmZF5uZA hoFBENCYIrDVl/HNEwlwKt+mgFdaEsEDcAcoTBhKw9NMP7kiNxiO3mcCuSbVpo4AxxHn avWprTB6YGDZNS96nc+mGdZvJViVjZQhBxa6H7RdWoQ1R1GwTkAXXMp31X8mEFroPJVW V/pj4LmQwl/2BIXvxv8iaAfDwpThQhhXkd5p4PD2qC7ucrANFVGfA6JUSoyeBPysI4rZ Ep4cPxbKZLtaNs8dFKUPmlJdD6BuauHnImx5Bq6aGQL42a3XWdNoTp1QB/OB1diM7peU h9zw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=5aWMGcXQVNAN9lZL32+cKA11taZBiVCr7fCTxeRXQ+0=; b=vQW/ZvUcpN4P672q3uB4McNi1wmtEB4iBZ7jn+wbRTc6WPCAnlx02iXyleA8X0RY+m YhvfuYgWsoBeLevShtacvIDuaNzAR+OGyNqH5kJ4hh9RLGdxgWotUddtOx+biB8Y1pEm 1baEaKkVDvhGvSIcAipt64LM/yQLwBDnwWkOhytLl2hHBuQ9B0QyX+GA0d+E8ns/3I+b vq7kzZvbPcO2mKObKCkxvkPyf+XpEsgwVbOSzFM5vMlWOsw+TglSrxqZTJyJ3rFUdtOC OXUQXCD2rGeFZVYYgc6Ur5vcC51aqBKRAmrkG0WIxJdqlnJJhn1SSsDQvl8swZMWo/E2 CxmQ== ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@Nvidia.com header.s=selector2 header.b=D8GzNgDj; arc=pass (i=1 spf=pass spfdomain=nvidia.com dmarc=pass fromdomain=nvidia.com); spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=nvidia.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a22-20020a17090ad81600b0022c0912029esi8758298pjv.30.2023.01.28.13.19.18; Sat, 28 Jan 2023 13:19:31 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@Nvidia.com header.s=selector2 header.b=D8GzNgDj; arc=pass (i=1 spf=pass spfdomain=nvidia.com dmarc=pass fromdomain=nvidia.com); spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=nvidia.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233821AbjA1VSm (ORCPT <rfc822;n2h9z4@gmail.com> + 99 others); Sat, 28 Jan 2023 16:18:42 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37224 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231236AbjA1VSg (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sat, 28 Jan 2023 16:18:36 -0500 Received: from NAM12-MW2-obe.outbound.protection.outlook.com (mail-mw2nam12on2048.outbound.protection.outlook.com [40.107.244.48]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 38CE85594 for <linux-kernel@vger.kernel.org>; Sat, 28 Jan 2023 13:18:35 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=V2uJoPscV4EakUceoZ5gQcvZKK8iYr/5axakidVaxr6nYTdsQCrOiJMp+rgymuayPMbB1KUlZb+YtGGb8W2h4BTfdDb6RugwmW4VwKIT0ri4eZIjv+t2O+6vJgplCeZbSOcrDGMuH6i6H7zjafHYusxHjOJ/0l7uVktj7mYCbNZJhwsmFy5QMb7z8GzFAroN6+lfQPC01HMHxNBjUo9e7h02eXp4hQryTLmLHKyxg9whQePQKkNd37FJvBnmDE+aC1q0qx9l2vdhvsw8wqhDxz0yBLYifvCy0ebR0noHq1kJcqDjKbwhkE/Zxj6hs8ohakLSGAAouTIwRayxXhKM1Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=5aWMGcXQVNAN9lZL32+cKA11taZBiVCr7fCTxeRXQ+0=; b=g5/qZxSvFoEzciy5Cr0rSDnwC7t48dk2ILKkBbteA05U3tiqnmSyZijo05uaaypR/JLFfgk1rHITrPdUmFBdhW4ozNRC+c9MYPjuL38+tj6u3tqEDHekhWvWxnIDKdxunL3f8gYL+TK9P9MNIidu7Gz8B4UYjQPIzMhPJYbsr1VXluf+kVdvFIz8suViRON9VFLkpWP7q41QctrDtRLHfABFby6YrBLVljm4zgXABBEUekqK4I0Td5edYJQcuXGN/6988CDy8D9lFWF0TzWcgvMsCs0HZnc8HX8VaM+dne7gtbSd5bmZCSkgItdl4cj1J9hmWs9zNV8eN1weljxTHQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 216.228.117.161) smtp.rcpttodomain=intel.com smtp.mailfrom=nvidia.com; dmarc=pass (p=reject sp=reject pct=100) action=none header.from=nvidia.com; dkim=none (message not signed); arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=5aWMGcXQVNAN9lZL32+cKA11taZBiVCr7fCTxeRXQ+0=; b=D8GzNgDjH6wONJ78t1Lo7yV3SHgduBHMmfFerrozbiYTwcEualzbuXc9wZG7vlTrvzRoL57O8zeEaZNmFv1Qypr70tBjITlXyLLYnd8OckkY+6mg4U+LthzEe3fzKtRaB8ZgsYCwOBOFNCSdWd94eM63tikUQTkZLGj0X7bO5cC5g1BtSFh1j2ICmIOx8jF5K9TiNSsh3kueG6k0zuz+sjGtzbyHQ0Ahm+0qNk/aj0Npt33/6+S9ulL7RUC+GEwG6n3X4OfeCyK2bCROkfwEjiQicsyTAPqhMt/Fycxy8f639WlH5Wm6RX3UY0jmv1NxOWuZBKVEPIyp6NCMQ7rhNw== Received: from DM6PR21CA0029.namprd21.prod.outlook.com (2603:10b6:5:174::39) by DS7PR12MB6005.namprd12.prod.outlook.com (2603:10b6:8:7c::17) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6043.28; Sat, 28 Jan 2023 21:18:32 +0000 Received: from DS1PEPF0000E639.namprd02.prod.outlook.com (2603:10b6:5:174:cafe::7c) by DM6PR21CA0029.outlook.office365.com (2603:10b6:5:174::39) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6086.0 via Frontend Transport; Sat, 28 Jan 2023 21:18:32 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 216.228.117.161) smtp.mailfrom=nvidia.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=nvidia.com; Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 216.228.117.161 as permitted sender) receiver=protection.outlook.com; client-ip=216.228.117.161; helo=mail.nvidia.com; pr=C Received: from mail.nvidia.com (216.228.117.161) by DS1PEPF0000E639.mail.protection.outlook.com (10.167.17.71) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6064.17 via Frontend Transport; Sat, 28 Jan 2023 21:18:32 +0000 Received: from rnnvmail203.nvidia.com (10.129.68.9) by mail.nvidia.com (10.129.200.67) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.986.36; Sat, 28 Jan 2023 13:18:25 -0800 Received: from rnnvmail201.nvidia.com (10.129.68.8) by rnnvmail203.nvidia.com (10.129.68.9) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.986.36; Sat, 28 Jan 2023 13:18:24 -0800 Received: from Asurada-Nvidia.nvidia.com (10.127.8.12) by mail.nvidia.com (10.129.68.8) with Microsoft SMTP Server id 15.2.986.36 via Frontend Transport; Sat, 28 Jan 2023 13:18:24 -0800 From: Nicolin Chen <nicolinc@nvidia.com> To: <jgg@nvidia.com>, <kevin.tian@intel.com> CC: <yi.l.liu@intel.com>, <iommu@lists.linux.dev>, <linux-kernel@vger.kernel.org> Subject: [PATCH v2 1/3] iommufd: Add devices_users to track the hw_pagetable usage by device Date: Sat, 28 Jan 2023 13:18:09 -0800 Message-ID: <c1c65ce093a3b585546dd17a77949efbe03a81d9.1674939002.git.nicolinc@nvidia.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <cover.1674939002.git.nicolinc@nvidia.com> References: <cover.1674939002.git.nicolinc@nvidia.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: DS1PEPF0000E639:EE_|DS7PR12MB6005:EE_ X-MS-Office365-Filtering-Correlation-Id: 290459a8-780f-467a-d9aa-08db01753596 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: YJ2zW8pBwY8qdZTSNWN3F2Hv1Ewj9S9IOc1Mvk6WvknCYwFk1lCy2ZbQoc+DOBJeKifx/dXBdHaiGY3PVz8L0AEQfxv1wSZruxk1vzp2A1cRdjWlWaivEl6kdGOgPYMpgcxUM0Hb5d7mJw+mrT938L8wiQV6In55HkzhqmRtw+PEDTv7OtML+V2eTRB5jEdenKBolSs+/tgwrzMbmNTMWialrocDxK39RM6pLpTYlbI/MMvtU1qSiBiPUmXdKkNPUrtZdFSodoGUMn4KQ3rHsfHj9MX6O5deCitUjEFWOYTmnICmg+QheXa4v7nSwGCmYfTCWtZMCTwt1TMC5k2fk8Azy/DVDkX3GIYYC5kZDMWx6RQis4vfX+ZjpSrtSQuozc48nu9rhMgDzj4HHBeHAeegpuSwEXDU1Om0MZkqe4gTcGyd7R44luFoaMlaf6yAn1Ltys4LaRnmRZy1HGraHMeYfubg6CBImAyUyNgTG9XFBltVpyOM+wVzOBhMI2T6GAgSv0RcHXkaS3TZaNAKkZD1m2RFMXx9cZFSa7LC+wfkx8dAqoTsnXfuDulOZ7RYr1sDfmOyZgYk2mYGCSjxKWGscT9LeKZf899inI8PG110PQemOaj6cXCat+Tk2aNsJYz6XTCX+sxiJOzFD2fQee3ze/AtqPxpcbe1L/pl2UNA2tyxjdLSzpwnJgpH5cHQqMqM18PobfjWWyJ77QtSHg== X-Forefront-Antispam-Report: CIP:216.228.117.161;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:mail.nvidia.com;PTR:dc6edge2.nvidia.com;CAT:NONE;SFS:(13230025)(4636009)(39860400002)(136003)(396003)(346002)(376002)(451199018)(36840700001)(40470700004)(46966006)(40460700003)(2906002)(5660300002)(54906003)(316002)(110136005)(478600001)(6666004)(7696005)(41300700001)(8936002)(8676002)(4326008)(70586007)(70206006)(40480700001)(82310400005)(186003)(36860700001)(82740400003)(7636003)(26005)(86362001)(336012)(83380400001)(47076005)(426003)(2616005)(36756003)(356005);DIR:OUT;SFP:1101; X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 28 Jan 2023 21:18:32.3374 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 290459a8-780f-467a-d9aa-08db01753596 X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=43083d15-7273-40c1-b7db-39efd9ccc17a;Ip=[216.228.117.161];Helo=[mail.nvidia.com] X-MS-Exchange-CrossTenant-AuthSource: DS1PEPF0000E639.namprd02.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: DS7PR12MB6005 X-Spam-Status: No, score=-1.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FORGED_SPF_HELO, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_PASS,SPF_NONE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1756302694373833856?= X-GMAIL-MSGID: =?utf-8?q?1756302694373833856?= |
Series |
iommufd: Remove iommufd_hw_pagetable_has_group
|
|
Commit Message
Nicolin Chen
Jan. 28, 2023, 9:18 p.m. UTC
From: Yi Liu <yi.l.liu@intel.com> Currently, hw_pagetable tracks the attached devices using a device list. When attaching the first device to the kernel-managed hw_pagetable, it should be linked to IOAS. When detaching the last device from this hwpt, the link with IOAS should be removed too. And this first-or-last device check is done with list_empty(hwpt->devices). However, with a nested configuration, when a device is attached to the user-managed stage-1 hw_pagetable, it will be added to this user-managed hwpt's device list instead of the kernel-managed stage-2 hwpt's one. And this breaks the logic for a kernel-managed hw_pagetable link/disconnect to/from IOAS/IOPT. e.g. the stage-2 hw_pagetable would be linked to IOAS multiple times if multiple device is attached, but it will become empty as soon as one device detached. Add a devices_users in struct iommufd_hw_pagetable to track the users of hw_pagetable by the attached devices. Make this field as a pointer, only allocate for a stage-2 hw_pagetable. A stage-1 hw_pagetable should reuse the stage-2 hw_pagetable's devices_users, because when a device attaches to a stage-1 hw_pagetable, linking the stage-2 hwpt to the IOAS is still required. So, with a nested configuration, increase the devices_users on the stage-2 (parent) hwpt, no matter a device is attached to the stage-1 or the stage-2 hwpt. Adding this devices_users also reduces the dependency on the device list, so it helps the following patch to remove the device list completely. Signed-off-by: Yi Liu <yi.l.liu@intel.com> Co-developed-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> --- drivers/iommu/iommufd/device.c | 8 +++++--- drivers/iommu/iommufd/hw_pagetable.c | 11 +++++++++++ drivers/iommu/iommufd/iommufd_private.h | 1 + 3 files changed, 17 insertions(+), 3 deletions(-)
Comments
> From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Sunday, January 29, 2023 5:18 AM > > From: Yi Liu <yi.l.liu@intel.com> > > Currently, hw_pagetable tracks the attached devices using a device list. > When attaching the first device to the kernel-managed hw_pagetable, it > should be linked to IOAS. When detaching the last device from this hwpt, > the link with IOAS should be removed too. And this first-or-last device > check is done with list_empty(hwpt->devices). > > However, with a nested configuration, when a device is attached to the > user-managed stage-1 hw_pagetable, it will be added to this user-managed > hwpt's device list instead of the kernel-managed stage-2 hwpt's one. And > this breaks the logic for a kernel-managed hw_pagetable link/disconnect > to/from IOAS/IOPT. e.g. the stage-2 hw_pagetable would be linked to IOAS > multiple times if multiple device is attached, but it will become empty > as soon as one device detached. > > Add a devices_users in struct iommufd_hw_pagetable to track the users of device_users > hw_pagetable by the attached devices. Make this field as a pointer, only > allocate for a stage-2 hw_pagetable. A stage-1 hw_pagetable should reuse > the stage-2 hw_pagetable's devices_users, because when a device attaches > to a stage-1 hw_pagetable, linking the stage-2 hwpt to the IOAS is still > required. So, with a nested configuration, increase the devices_users on > the stage-2 (parent) hwpt, no matter a device is attached to the stage-1 > or the stage-2 hwpt. Above is very confusing w/o seeing the full series of nesting support. As a preparatory step this should focus on existing code and what this series tries to achieve. e.g. I'd not make device_users a pointer here. Do that incrementally when the nesting support comes.
On Sun, Jan 29, 2023 at 09:23:05AM +0000, Tian, Kevin wrote: > > From: Nicolin Chen <nicolinc@nvidia.com> > > Sent: Sunday, January 29, 2023 5:18 AM > > > > From: Yi Liu <yi.l.liu@intel.com> > > > > Currently, hw_pagetable tracks the attached devices using a device list. > > When attaching the first device to the kernel-managed hw_pagetable, it > > should be linked to IOAS. When detaching the last device from this hwpt, > > the link with IOAS should be removed too. And this first-or-last device > > check is done with list_empty(hwpt->devices). > > > > However, with a nested configuration, when a device is attached to the > > user-managed stage-1 hw_pagetable, it will be added to this user-managed > > hwpt's device list instead of the kernel-managed stage-2 hwpt's one. And > > this breaks the logic for a kernel-managed hw_pagetable link/disconnect > > to/from IOAS/IOPT. e.g. the stage-2 hw_pagetable would be linked to IOAS > > multiple times if multiple device is attached, but it will become empty > > as soon as one device detached. > > > > Add a devices_users in struct iommufd_hw_pagetable to track the users of > > device_users I assume you are suggesting a rename right? I can do that. > > hw_pagetable by the attached devices. Make this field as a pointer, only > > allocate for a stage-2 hw_pagetable. A stage-1 hw_pagetable should reuse > > the stage-2 hw_pagetable's devices_users, because when a device attaches > > to a stage-1 hw_pagetable, linking the stage-2 hwpt to the IOAS is still > > required. So, with a nested configuration, increase the devices_users on > > the stage-2 (parent) hwpt, no matter a device is attached to the stage-1 > > or the stage-2 hwpt. > > Above is very confusing w/o seeing the full series of nesting support. > > As a preparatory step this should focus on existing code and what this > series tries to achieve. e.g. I'd not make device_users a pointer here. > Do that incrementally when the nesting support comes. OK. I will shift that part to the nesting series. Thanks Nic
> From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Sunday, January 29, 2023 5:31 PM > > > > Add a devices_users in struct iommufd_hw_pagetable to track the users > of > > > > device_users > > I assume you are suggesting a rename right? I can do that. > yes
> From: Tian, Kevin <kevin.tian@intel.com> > Sent: Sunday, January 29, 2023 5:23 PM > > > hw_pagetable by the attached devices. Make this field as a pointer, only > > allocate for a stage-2 hw_pagetable. A stage-1 hw_pagetable should > reuse > > the stage-2 hw_pagetable's devices_users, because when a device > attaches > > to a stage-1 hw_pagetable, linking the stage-2 hwpt to the IOAS is still > > required. So, with a nested configuration, increase the devices_users on > > the stage-2 (parent) hwpt, no matter a device is attached to the stage-1 > > or the stage-2 hwpt. > > Above is very confusing w/o seeing the full series of nesting support. > > As a preparatory step this should focus on existing code and what this > series tries to achieve. e.g. I'd not make device_users a pointer here. > Do that incrementally when the nesting support comes. Yes, in the below branch, I've moved this patch to be together with the nesting commits. Maybe I can send out the nesting RFC. https://github.com/yiliu1765/iommufd/commits/wip/iommufd-v6.2-rc4-nesting Regards, Yi Liu
On Sat, Jan 28, 2023 at 01:18:09PM -0800, Nicolin Chen wrote: > From: Yi Liu <yi.l.liu@intel.com> > > Currently, hw_pagetable tracks the attached devices using a device list. > When attaching the first device to the kernel-managed hw_pagetable, it > should be linked to IOAS. When detaching the last device from this hwpt, > the link with IOAS should be removed too. And this first-or-last device > check is done with list_empty(hwpt->devices). > > However, with a nested configuration, when a device is attached to the > user-managed stage-1 hw_pagetable, it will be added to this user-managed > hwpt's device list instead of the kernel-managed stage-2 hwpt's one. And > this breaks the logic for a kernel-managed hw_pagetable link/disconnect > to/from IOAS/IOPT. e.g. the stage-2 hw_pagetable would be linked to IOAS > multiple times if multiple device is attached, but it will become empty > as soon as one device detached. Why this seems really weird to say. The stage 2 is linked explicitly to the IOAS that drives it's map/unmap Why is there any implicit activity here? There should be no implicit attach of the S2 to an IOAS ever. Jason
On Mon, Jan 30, 2023 at 11:02:25AM -0400, Jason Gunthorpe wrote: > On Sat, Jan 28, 2023 at 01:18:09PM -0800, Nicolin Chen wrote: > > From: Yi Liu <yi.l.liu@intel.com> > > > > Currently, hw_pagetable tracks the attached devices using a device list. > > When attaching the first device to the kernel-managed hw_pagetable, it > > should be linked to IOAS. When detaching the last device from this hwpt, > > the link with IOAS should be removed too. And this first-or-last device > > check is done with list_empty(hwpt->devices). > > > > However, with a nested configuration, when a device is attached to the > > user-managed stage-1 hw_pagetable, it will be added to this user-managed > > hwpt's device list instead of the kernel-managed stage-2 hwpt's one. And > > this breaks the logic for a kernel-managed hw_pagetable link/disconnect > > to/from IOAS/IOPT. e.g. the stage-2 hw_pagetable would be linked to IOAS > > multiple times if multiple device is attached, but it will become empty > > as soon as one device detached. > > Why this seems really weird to say. > > The stage 2 is linked explicitly to the IOAS that drives it's > map/unmap > > Why is there any implicit activity here? There should be no implicit > attach of the S2 to an IOAS ever. I think this is supposed to say the following use case: Two stage-1 hwpts share the same parent s2_hwpt: attach device1 to stage-1 hwpt1: ... if (list_empty(s1_hwpt1->devices)) // empty; true iopt_table_add_domain(s2_hwpt->domain); // do once s1_hwpt1 device list cnt++; ... attach device2 to stage-1 hwpt2: ... if (list_empty(s1_hwpt2->devices)) // empty; true iopt_table_add_domain(s2_hwpt->domain); // do again s1_hwpt2 device list cnt++; ... This is because each hwpt has its own device list. To prevent the duplicated iopt_table_add_domain call, we need to check all the device list. So this patch adds a shared list among all relevant hwpts. I can revise the commit message to make a better sense. Thanks Nicolin
On Mon, Jan 30, 2023 at 11:27:37AM -0800, Nicolin Chen wrote: > On Mon, Jan 30, 2023 at 11:02:25AM -0400, Jason Gunthorpe wrote: > > On Sat, Jan 28, 2023 at 01:18:09PM -0800, Nicolin Chen wrote: > > > From: Yi Liu <yi.l.liu@intel.com> > > > > > > Currently, hw_pagetable tracks the attached devices using a device list. > > > When attaching the first device to the kernel-managed hw_pagetable, it > > > should be linked to IOAS. When detaching the last device from this hwpt, > > > the link with IOAS should be removed too. And this first-or-last device > > > check is done with list_empty(hwpt->devices). > > > > > > However, with a nested configuration, when a device is attached to the > > > user-managed stage-1 hw_pagetable, it will be added to this user-managed > > > hwpt's device list instead of the kernel-managed stage-2 hwpt's one. And > > > this breaks the logic for a kernel-managed hw_pagetable link/disconnect > > > to/from IOAS/IOPT. e.g. the stage-2 hw_pagetable would be linked to IOAS > > > multiple times if multiple device is attached, but it will become empty > > > as soon as one device detached. > > > > Why this seems really weird to say. > > > > The stage 2 is linked explicitly to the IOAS that drives it's > > map/unmap > > > > Why is there any implicit activity here? There should be no implicit > > attach of the S2 to an IOAS ever. > > I think this is supposed to say the following use case: > > Two stage-1 hwpts share the same parent s2_hwpt: > > attach device1 to stage-1 hwpt1: > ... > if (list_empty(s1_hwpt1->devices)) // empty; true > iopt_table_add_domain(s2_hwpt->domain); // do once > s1_hwpt1 device list cnt++; > ... No, this doesn't make sense. The s2_hwpt should be created explicitly, not using autodomains When it is created it should be linked to a single IOAS and that is when iopt_table_add_domain() should have been called. The S1 attach should do *nothing* to a S2. Jason
On Mon, Jan 30, 2023 at 03:50:07PM -0400, Jason Gunthorpe wrote: > On Mon, Jan 30, 2023 at 11:27:37AM -0800, Nicolin Chen wrote: > > On Mon, Jan 30, 2023 at 11:02:25AM -0400, Jason Gunthorpe wrote: > > > On Sat, Jan 28, 2023 at 01:18:09PM -0800, Nicolin Chen wrote: > > > > From: Yi Liu <yi.l.liu@intel.com> > > > > > > > > Currently, hw_pagetable tracks the attached devices using a device list. > > > > When attaching the first device to the kernel-managed hw_pagetable, it > > > > should be linked to IOAS. When detaching the last device from this hwpt, > > > > the link with IOAS should be removed too. And this first-or-last device > > > > check is done with list_empty(hwpt->devices). > > > > > > > > However, with a nested configuration, when a device is attached to the > > > > user-managed stage-1 hw_pagetable, it will be added to this user-managed > > > > hwpt's device list instead of the kernel-managed stage-2 hwpt's one. And > > > > this breaks the logic for a kernel-managed hw_pagetable link/disconnect > > > > to/from IOAS/IOPT. e.g. the stage-2 hw_pagetable would be linked to IOAS > > > > multiple times if multiple device is attached, but it will become empty > > > > as soon as one device detached. > > > > > > Why this seems really weird to say. > > > > > > The stage 2 is linked explicitly to the IOAS that drives it's > > > map/unmap > > > > > > Why is there any implicit activity here? There should be no implicit > > > attach of the S2 to an IOAS ever. > > > > I think this is supposed to say the following use case: > > > > Two stage-1 hwpts share the same parent s2_hwpt: > > > > attach device1 to stage-1 hwpt1: > > ... > > if (list_empty(s1_hwpt1->devices)) // empty; true > > iopt_table_add_domain(s2_hwpt->domain); // do once > > s1_hwpt1 device list cnt++; > > ... > > No, this doesn't make sense. > > The s2_hwpt should be created explicitly, not using autodomains iopt_table_add_domain() is called in iommufd_device_do_attach(), so it's shared by both a created hwpt or autodomain. > When it is created it should be linked to a single IOAS and that is > when iopt_table_add_domain() should have been called. I recall we've discussed this that SMMU sets up domain when it attaches the device to, so we made a compromise here... > The S1 attach should do *nothing* to a S2. With that compromise, a nested attach flow may be 1) create an s2 hwpt 2) create an s1 hwpt 3) attach dev to s1 calls iopt_table_add_domain() Thanks Nicolin
On Mon, Jan 30, 2023 at 12:04:33PM -0800, Nicolin Chen wrote: > I recall we've discussed this that SMMU sets up domain when it > attaches the device to, so we made a compromise here... The ARM driver has a problem that it doesn't know what SMMU instance will host the domain when it is allocated so it doesn't know if it should select a S1 or S2 page table format - which is determined by the capabilities of the specific SMMU HW block. However, we don't have this problem when creating the S2. The S2 should be created by a special alloc_domain_iommufd() asking for the S2 format. Not only does the new alloc_domain_iommufd API directly request a S2 format table, but it also specifies the struct device so any residual details can be determined directly. Thus there is no need to do the two stage process when working with the S2. So fixup the driver to create fully configured iommu_domain's immediately and get rid of this problem. IMHO I would structure the smmu driver so that all the different iommu_domain formats have their own ops pointer. The special "undecided" format would have a special ops with only attach_dev and at first attach it would switch the ops to whatever format it selected. I think this could get rid of a lot of the 'if undecided/S1/S2/CD' complexity all over the place. You know what type it is because you were called on a op that is only called on its type. Jason
On Mon, Jan 30, 2023 at 04:35:35PM -0400, Jason Gunthorpe wrote: > On Mon, Jan 30, 2023 at 12:04:33PM -0800, Nicolin Chen wrote: > > > I recall we've discussed this that SMMU sets up domain when it > > attaches the device to, so we made a compromise here... > > The ARM driver has a problem that it doesn't know what SMMU instance > will host the domain when it is allocated so it doesn't know if it > should select a S1 or S2 page table format - which is determined by > the capabilities of the specific SMMU HW block. > > However, we don't have this problem when creating the S2. The S2 > should be created by a special alloc_domain_iommufd() asking for the > S2 format. Not only does the new alloc_domain_iommufd API directly > request a S2 format table, but it also specifies the struct device so > any residual details can be determined directly. > > Thus there is no need to do the two stage process when working with > the S2. Ah, right! Taking a quick look, we should be able to call that arm_smmu_domain_finalise when handling alloc_domain_iommufd(). > So fixup the driver to create fully configured iommu_domain's > immediately and get rid of this problem. OK. I will draft a patch today. > IMHO I would structure the smmu driver so that all the different > iommu_domain formats have their own ops pointer. The special > "undecided" format would have a special ops with only attach_dev and > at first attach it would switch the ops to whatever format it > selected. > > I think this could get rid of a lot of the 'if undecided/S1/S2/CD' > complexity all over the place. You know what type it is because you > were called on a op that is only called on its type. I see. I can try that as well. Hopefully it won't touch too many places that cam raise a potential big concern/objection.. Thanks Nicolin
On Mon, Jan 30, 2023 at 04:35:35PM -0400, Jason Gunthorpe wrote: > IMHO I would structure the smmu driver so that all the different > iommu_domain formats have their own ops pointer. The special > "undecided" format would have a special ops with only attach_dev and > at first attach it would switch the ops to whatever format it > selected. > > I think this could get rid of a lot of the 'if undecided/S1/S2/CD' > complexity all over the place. You know what type it is because you > were called on a op that is only called on its type. An auto/unmanaged domain allocation via iommu_domain_alloc() would be S1, while an allocation via ops->domain_alloc_user can be S1 or S2 with a given parameter/flag. So, actually the format is always decided. The trouble we have with the iommu_domain_alloc() path is that we don't pass the dev pointer down to ops->domain_alloc. So, the SMMU driver can't know which SMMU device the device is behind, resulting in being unable to finalizing the domain. Robin mentioned that he has a patch "iommu: Pass device through ops->domain_alloc". Perhaps that is required for us to entirely fix the add_domain() problem? Thanks Nic
On Mon, Jan 30, 2023 at 12:54:01PM -0800, Nicolin Chen wrote: > On Mon, Jan 30, 2023 at 04:35:35PM -0400, Jason Gunthorpe wrote: > > On Mon, Jan 30, 2023 at 12:04:33PM -0800, Nicolin Chen wrote: > > > > > I recall we've discussed this that SMMU sets up domain when it > > > attaches the device to, so we made a compromise here... > > > > The ARM driver has a problem that it doesn't know what SMMU instance > > will host the domain when it is allocated so it doesn't know if it > > should select a S1 or S2 page table format - which is determined by > > the capabilities of the specific SMMU HW block. > > > > However, we don't have this problem when creating the S2. The S2 > > should be created by a special alloc_domain_iommufd() asking for the > > S2 format. Not only does the new alloc_domain_iommufd API directly > > request a S2 format table, but it also specifies the struct device so > > any residual details can be determined directly. > > > > Thus there is no need to do the two stage process when working with > > the S2. > > Ah, right! Taking a quick look, we should be able to call that > arm_smmu_domain_finalise when handling alloc_domain_iommufd(). > > > So fixup the driver to create fully configured iommu_domain's > > immediately and get rid of this problem. > > OK. I will draft a patch today. @Yi Do you recall doing iopt_table_add_domain() in hwpt_alloc()? Jason has a great point above. So even SMMU should be able to call the iopt_table_add_domain() after a kernel-manged hwpt allocation rather than after an iommu_attach_group(), except an auto_domain or a selftest mock_domain that still needs to attach the device first, otherwise the SMMU driver (currently) cannot finalize the domain aperture. I made a draft today, and ran some sanity with SMMUv3: "iommufd: Attach/detach hwpt to IOAS at alloc/destroy" https://github.com/nicolinc/iommufd/commit/5ae54f360495aae35b5967d1eb00149912145639 The change basically: moves iopt_table_add/remove_domain() next to hwpt_alloc/destroy(); an auto_domain or a mock_domain needs to attach_dev first, before calling the add_domain(). With this change, following patches of attach_ioas() and new selftest should be also updated. I have them in a wip branch: https://github.com/nicolinc/iommufd/commits/wip/iommufd-v6.2-rc4-nesting-01312023 Can you check if there's anything wrong with this approach? And would it be possible for you to integrate this into the nesting series? I can also help cleanup and polish the changes with a proper commit message. Thanks Nic
On Tue, Jan 31, 2023 at 10:57:19PM -0800, Nicolin Chen wrote: > On Mon, Jan 30, 2023 at 04:35:35PM -0400, Jason Gunthorpe wrote: > > > IMHO I would structure the smmu driver so that all the different > > iommu_domain formats have their own ops pointer. The special > > "undecided" format would have a special ops with only attach_dev and > > at first attach it would switch the ops to whatever format it > > selected. > > > > I think this could get rid of a lot of the 'if undecided/S1/S2/CD' > > complexity all over the place. You know what type it is because you > > were called on a op that is only called on its type. > > An auto/unmanaged domain allocation via iommu_domain_alloc() would > be S1, while an allocation via ops->domain_alloc_user can be S1 or > S2 with a given parameter/flag. So, actually the format is always > decided. The trouble we have with the iommu_domain_alloc() path is > that we don't pass the dev pointer down to ops->domain_alloc. So, Precisely, the undecided part is not the format but the ias/ios and SMMU feature bits of the SMMU HW device's, in order to set up the domain->geometry. And the regular domain_alloc() taking a "type" augment alone can't get the SMMU pointer. Thanks Nic > the SMMU driver can't know which SMMU device the device is behind, > resulting in being unable to finalizing the domain. Robin mentioned > that he has a patch "iommu: Pass device through ops->domain_alloc". > Perhaps that is required for us to entirely fix the add_domain() > problem? > > Thanks > Nic
On Tue, Jan 31, 2023 at 10:57:13PM -0800, Nicolin Chen wrote: > On Mon, Jan 30, 2023 at 04:35:35PM -0400, Jason Gunthorpe wrote: > > > IMHO I would structure the smmu driver so that all the different > > iommu_domain formats have their own ops pointer. The special > > "undecided" format would have a special ops with only attach_dev and > > at first attach it would switch the ops to whatever format it > > selected. > > > > I think this could get rid of a lot of the 'if undecided/S1/S2/CD' > > complexity all over the place. You know what type it is because you > > were called on a op that is only called on its type. > > An auto/unmanaged domain allocation via iommu_domain_alloc() would > be S1, while an allocation via ops->domain_alloc_user can be S1 or > S2 with a given parameter/flag. So, actually the format is always > decided. No, it can't decide the S1/S2 format until it knows the smmu because of this: /* Restrict the stage to what we can actually support */ if (!(smmu->features & ARM_SMMU_FEAT_TRANS_S1)) smmu_domain->stage = ARM_SMMU_DOMAIN_S2; if (!(smmu->features & ARM_SMMU_FEAT_TRANS_S2)) smmu_domain->stage = ARM_SMMU_DOMAIN_S1; So the format is never decided. > that we don't pass the dev pointer down to ops->domain_alloc. So, > the SMMU driver can't know which SMMU device the device is behind, > resulting in being unable to finalizing the domain. Robin mentioned > that he has a patch "iommu: Pass device through ops->domain_alloc". > Perhaps that is required for us to entirely fix the add_domain() > problem? Robin is making progress, hopefully soon So the issue is with replace you need to have the domain populated before we can call replace but you can't populate the domain until it is bound because of the above issue? That seems unsovlable without fixing up the driver. I'd say replace can go ahead ingoring that issue and that for now replace will only work on ARM with domains created by domain_alloc_user that are fully configured. It will start working correctly for auto domains once Robin's changes get finished. Is there another issue? Jason
On Wed, Feb 01, 2023 at 11:53:02AM -0400, Jason Gunthorpe wrote: > On Tue, Jan 31, 2023 at 10:57:13PM -0800, Nicolin Chen wrote: > > On Mon, Jan 30, 2023 at 04:35:35PM -0400, Jason Gunthorpe wrote: > > > > > IMHO I would structure the smmu driver so that all the different > > > iommu_domain formats have their own ops pointer. The special > > > "undecided" format would have a special ops with only attach_dev and > > > at first attach it would switch the ops to whatever format it > > > selected. > > > > > > I think this could get rid of a lot of the 'if undecided/S1/S2/CD' > > > complexity all over the place. You know what type it is because you > > > were called on a op that is only called on its type. > > > > An auto/unmanaged domain allocation via iommu_domain_alloc() would > > be S1, while an allocation via ops->domain_alloc_user can be S1 or > > S2 with a given parameter/flag. So, actually the format is always > > decided. > > No, it can't decide the S1/S2 format until it knows the smmu because > of this: > > /* Restrict the stage to what we can actually support */ > if (!(smmu->features & ARM_SMMU_FEAT_TRANS_S1)) > smmu_domain->stage = ARM_SMMU_DOMAIN_S2; > if (!(smmu->features & ARM_SMMU_FEAT_TRANS_S2)) > smmu_domain->stage = ARM_SMMU_DOMAIN_S1; > > So the format is never decided. OK. That's right. And the solution to that is also passing a dev pointer in regular ->domain_alloc() op. > > that we don't pass the dev pointer down to ops->domain_alloc. So, > > the SMMU driver can't know which SMMU device the device is behind, > > resulting in being unable to finalizing the domain. Robin mentioned > > that he has a patch "iommu: Pass device through ops->domain_alloc". > > Perhaps that is required for us to entirely fix the add_domain() > > problem? > > Robin is making progress, hopefully soon > > So the issue is with replace you need to have the domain populated > before we can call replace but you can't populate the domain until it > is bound because of the above issue? That seems unsovlable without > fixing up the driver. Not really. A REPLACE ioctl is just an ATTACH, if the device just gets BIND-ed. So the SMMU driver will initialize ("finalise") the domain during the replace() call, then iopt_table_add_domain() can be done. So, not a blocker here. > I'd say replace can go ahead ingoring that issue and that for now > replace will only work on ARM with domains created by > domain_alloc_user that are fully configured. > > It will start working correctly for auto domains once Robin's changes > get finished. > > Is there another issue? Oh. I think we mixed the topics here. These three patches were not to unblock but to clean up a way for the replace series and the nesting series, for the device locking issue: if (cur_hwpt != hwpt) mutex_lock(&cur_hwpt->device_lock); mutex_lock(&hwpt->device_lock); ... if (iommufd_hw_pagetabe_has_group()) { // touching device list ... iommu_group_replace_domain(); ... } if (cur_hwpt && hwpt) list_del(&idev->devices_item); list_add(&idev->devices_item, &cur_hwpt->devices); ... mutex_unlock(&hwpt->device_lock); if (cur_hwpt != hwpt) mutex_unlock(&cur_hwpt->device_lock); I just gave another thought about it. Since we have the patch-2 from this series moving the ioas->mutex, it already serializes attach/detach routines. And I see that all the places touching idev->device_item and hwpt->devices are protected by ioas->mutex. So, perhaps we can simply remove the device_lock? do_attach(): mutex_lock(&ioas->mutex); // protect both devices_item and hwpt_item ... if (iommufd_hw_pagetabe_has_group()) { // touching device list ... iommu_group_replace_domain(); ... } if (cur_hwpt && hwpt) list_del(&idev->devices_item); list_add(&idev->devices_item, &cur_hwpt->devices); ... mutex_unlock(&ioas->mutex); do_detach(): mutex_lock(&ioas->mutex); // protect both devices_item and hwpt_item ... if (iommufd_hw_pagetabe_has_group()) { // touching device list ... iommu_detach_group(); ... } list_del(&idev->devices_item); ... mutex_unlock(&ioas->mutex); If this is correct, I think I can prepare the replace series and send it by the end of the day. Thanks Nic
On Wed, Feb 01, 2023 at 09:46:23AM -0800, Nicolin Chen wrote: > > So the issue is with replace you need to have the domain populated > > before we can call replace but you can't populate the domain until it > > is bound because of the above issue? That seems unsovlable without > > fixing up the driver. > > Not really. A REPLACE ioctl is just an ATTACH, if the device just > gets BIND-ed. So the SMMU driver will initialize ("finalise") the > domain during the replace() call, then iopt_table_add_domain() can > be done. > > So, not a blocker here. Well, yes, there sort of is because the whole flow becomes nonsensical - we are supposed to have the iommu_domain populated by the time we do replace. Otherwise replace is extra-pointless.. > > Is there another issue? > > Oh. I think we mixed the topics here. These three patches were > not to unblock but to clean up a way for the replace series and > the nesting series, for the device locking issue: > > if (cur_hwpt != hwpt) > mutex_lock(&cur_hwpt->device_lock); > mutex_lock(&hwpt->device_lock); > ... > if (iommufd_hw_pagetabe_has_group()) { // touching device list > ... > iommu_group_replace_domain(); > ... > } > if (cur_hwpt && hwpt) > list_del(&idev->devices_item); > list_add(&idev->devices_item, &cur_hwpt->devices); > ... > mutex_unlock(&hwpt->device_lock); > if (cur_hwpt != hwpt) > mutex_unlock(&cur_hwpt->device_lock); What is the issue? That isn't quite right, but the basic bit is fine If you want to do replace then you have to hold both devices_lock and you write that super ugly thing like this lock_both: if (hwpt_a < hwpt_b) { mutex_lock(&hwpt_a->devices_lock); mutex_lock_nested(&hwpt_b->devices_lock); } else if (hwpt_a > hwpt_b) { mutex_lock(&hwpt_b->devices_lock); mutex_lock_nested(&hwpt_a->devices_lock); } else mutex_lock(&hwpt_a->devices_lock); And then it is trivial, yes? Using the group_lock in the iommu core is the right way to fix this.. Maybe someday we can do that. (also document that replace causes all the devices in the group to change iommu_domains at once) > I just gave another thought about it. Since we have the patch-2 > from this series moving the ioas->mutex, it already serializes > attach/detach routines. And I see that all the places touching > idev->device_item and hwpt->devices are protected by ioas->mutex. > So, perhaps we can simply remove the device_lock? The two hwpts are not required to have the same ioas, so this doesn't really help.. Jason
On Wed, Feb 01, 2023 at 02:37:42PM -0400, Jason Gunthorpe wrote: > On Wed, Feb 01, 2023 at 09:46:23AM -0800, Nicolin Chen wrote: > > > So the issue is with replace you need to have the domain populated > > > before we can call replace but you can't populate the domain until it > > > is bound because of the above issue? That seems unsovlable without > > > fixing up the driver. > > > > Not really. A REPLACE ioctl is just an ATTACH, if the device just > > gets BIND-ed. So the SMMU driver will initialize ("finalise") the > > domain during the replace() call, then iopt_table_add_domain() can > > be done. > > > > So, not a blocker here. > > Well, yes, there sort of is because the whole flow becomes nonsensical > - we are supposed to have the iommu_domain populated by the time we do > replace. Otherwise replace is extra-pointless.. The "finalise" is one of the very first lines of the attach_dev() callback function in SMMU driver, though it might still undesirably fail the replace(). https://github.com/nicolinc/iommufd/commit/5ae54f360495aae35b5967d1eb00149912145639 Btw, this is a draft that I made to move iopt_table_add_domain(). I think we can have this with the nesting series. Later, once we pass in the dev pointer to the ->domain_alloc op using Robin's change, all the iopt_table_add_domain() can be done within the hwpt_alloc(), prior to an attach()/replace(). > > > Is there another issue? > > > > Oh. I think we mixed the topics here. These three patches were > > not to unblock but to clean up a way for the replace series and > > the nesting series, for the device locking issue: > > > > if (cur_hwpt != hwpt) > > mutex_lock(&cur_hwpt->device_lock); > > mutex_lock(&hwpt->device_lock); > > ... > > if (iommufd_hw_pagetabe_has_group()) { // touching device list > > ... > > iommu_group_replace_domain(); > > ... > > } > > if (cur_hwpt && hwpt) > > list_del(&idev->devices_item); > > list_add(&idev->devices_item, &cur_hwpt->devices); > > ... > > mutex_unlock(&hwpt->device_lock); > > if (cur_hwpt != hwpt) > > mutex_unlock(&cur_hwpt->device_lock); > > What is the issue? That isn't quite right, but the basic bit is fine > > If you want to do replace then you have to hold both devices_lock and > you write that super ugly thing like this > > lock_both: > if (hwpt_a < hwpt_b) { > mutex_lock(&hwpt_a->devices_lock); > mutex_lock_nested(&hwpt_b->devices_lock); > } else if (hwpt_a > hwpt_b) { > mutex_lock(&hwpt_b->devices_lock); > mutex_lock_nested(&hwpt_a->devices_lock); > } else > mutex_lock(&hwpt_a->devices_lock); > > And then it is trivial, yes? Yea. That's your previous remark. > Using the group_lock in the iommu core is the right way to fix > this.. Maybe someday we can do that. > > (also document that replace causes all the devices in the group to > change iommu_domains at once) Yes. There's a discussion in PATCH-3 of this series. I drafted a patch changing iommu_attach/detach_dev(): https://github.com/nicolinc/iommufd/commit/124f7804ef38d50490b606fd56c1e27ce551a839 Baolu had a similar patch series a year ago. So we might continue that effort in parallel, and eventually drop the device list/lock. > > I just gave another thought about it. Since we have the patch-2 > > from this series moving the ioas->mutex, it already serializes > > attach/detach routines. And I see that all the places touching > > idev->device_item and hwpt->devices are protected by ioas->mutex. > > So, perhaps we can simply remove the device_lock? > > The two hwpts are not required to have the same ioas, so this doesn't > really help.. Hmm...in that case, we should hold two ioas->mutex locks in addition to two device locks? Thanks Nic
On Wed, Feb 01, 2023 at 11:25:10AM -0800, Nicolin Chen wrote: > The "finalise" is one of the very first lines of the attach_dev() > callback function in SMMU driver, though it might still undesirably > fail the replace(). It won't get that far. Remember how this all works - only autodomains have the special path that allocates a domain, attaches the empty domain, and then populates it with the ioas. We made this special path specifically to accomodate the current ARM drivers, otherwise they wouldn't work at all. replace can't do this - replace must always start out with a pre-existing hwpt that was allocated with a dedicated hwpt allocation ioctl. Wwhen the hwpt was allocated it must be linked to the IOAS at that time, because we definately don't do defered IOAS linkage. So on ARM when you create an unfinalizes iommu_domain it cannot be added to the IOAS (because it has an empty aperture) and creation will fail, or worse, it will get added to an empty IOAS and make the IOAS permanently unusable. > Hmm...in that case, we should hold two ioas->mutex locks in > addition to two device locks? No, the device lock is the thing that protects the data you are touching no reason to make it any different. Jason
On Wed, Feb 01, 2023 at 04:00:40PM -0400, Jason Gunthorpe wrote: > On Wed, Feb 01, 2023 at 11:25:10AM -0800, Nicolin Chen wrote: > > > The "finalise" is one of the very first lines of the attach_dev() > > callback function in SMMU driver, though it might still undesirably > > fail the replace(). > > It won't get that far. > > Remember how this all works - only autodomains have the special path > that allocates a domain, attaches the empty domain, and then populates > it with the ioas. We made this special path specifically to accomodate > the current ARM drivers, otherwise they wouldn't work at all. Yes. > replace can't do this - replace must always start out with a > pre-existing hwpt that was allocated with a dedicated hwpt allocation > ioctl. > > Wwhen the hwpt was allocated it must be linked to the IOAS at that > time, because we definately don't do defered IOAS linkage. > > So on ARM when you create an unfinalizes iommu_domain it cannot be > added to the IOAS (because it has an empty aperture) and creation will > fail, or worse, it will get added to an empty IOAS and make the IOAS > permanently unusable. IIUIC, user space might add some IOVA mappings to the hwpt/iopt, before doing a replace(). If we do a deferred IOAS linkage to this hwpt/iopt, it will cause a problem because we are adding the reserved region for the MSI window later than IOMMU_IOAS_MAP calls. Thus, "we definately don't do defered IOAS linkage". With this justification, I think I should include my patch of moving iopt_table_add/remove_domain(), in the replace series. > > Hmm...in that case, we should hold two ioas->mutex locks in > > addition to two device locks? > > No, the device lock is the thing that protects the data you are > touching no reason to make it any different. Thinking it deeper: we don't really touch the old_hwpt->ioas or its iopt, but only the new_hwpt->ioas/iopt to reserve the MSI window, so there isn't an actual need to hold old_hwpt->ioas's mutex. I will prepare the replace series and send it by the end of the day, upon certain level of sanity/verification. Thanks! Nic
On Wed, Feb 01, 2023 at 01:18:08PM -0800, Nicolin Chen wrote: > On Wed, Feb 01, 2023 at 04:00:40PM -0400, Jason Gunthorpe wrote: > > On Wed, Feb 01, 2023 at 11:25:10AM -0800, Nicolin Chen wrote: > > > > > The "finalise" is one of the very first lines of the attach_dev() > > > callback function in SMMU driver, though it might still undesirably > > > fail the replace(). > > > > It won't get that far. > > > > Remember how this all works - only autodomains have the special path > > that allocates a domain, attaches the empty domain, and then populates > > it with the ioas. We made this special path specifically to accomodate > > the current ARM drivers, otherwise they wouldn't work at all. > > Yes. > > > replace can't do this - replace must always start out with a > > pre-existing hwpt that was allocated with a dedicated hwpt allocation > > ioctl. > > > > Wwhen the hwpt was allocated it must be linked to the IOAS at that > > time, because we definately don't do defered IOAS linkage. > > > > So on ARM when you create an unfinalizes iommu_domain it cannot be > > added to the IOAS (because it has an empty aperture) and creation will > > fail, or worse, it will get added to an empty IOAS and make the IOAS > > permanently unusable. > > IIUIC, user space might add some IOVA mappings to the hwpt/iopt, > before doing a replace(). If we do a deferred IOAS linkage to > this hwpt/iopt, it will cause a problem because we are adding > the reserved region for the MSI window later than IOMMU_IOAS_MAP > calls. Thus, "we definately don't do defered IOAS linkage". > > With this justification, I think I should include my patch of > moving iopt_table_add/remove_domain(), in the replace series. I just posted the replace series. But I found that the base line (without the nesting series) allocates iommu_domains always with the ->domain_alloc() op. So, we cannot actually move the iopt_table_add_domain() in the replace series, as I intended to. Yet, a great news is that our nesting series replaces the domain_alloc() op entirely with ->domain_alloc_user() for all the iommu_domain allocations, including for auto_domains. So, we can completely move iopt_table_add_domain() to the hwpt allocation function. And we don't really need a big change in the SMMU driver nor Robin's patch that passes in dev ptr to domain_alloc() op. And even this device_users refcount in this patch is no longer needed. It also simplifies the shared device locking situation, if I am not missing anything. So, in short, we'll have to wait for ->domain_alloc_user() patch (in the nesting series) to unblock the problem that we discussed above regarding the iopt_table_add_domain(). Thanks Nic
On Tue, Jan 31, 2023 at 11:48:04PM -0800, Nicolin Chen wrote: > On Mon, Jan 30, 2023 at 12:54:01PM -0800, Nicolin Chen wrote: > > On Mon, Jan 30, 2023 at 04:35:35PM -0400, Jason Gunthorpe wrote: > > > On Mon, Jan 30, 2023 at 12:04:33PM -0800, Nicolin Chen wrote: > > > > > > > I recall we've discussed this that SMMU sets up domain when it > > > > attaches the device to, so we made a compromise here... > > > > > > The ARM driver has a problem that it doesn't know what SMMU instance > > > will host the domain when it is allocated so it doesn't know if it > > > should select a S1 or S2 page table format - which is determined by > > > the capabilities of the specific SMMU HW block. > > > > > > However, we don't have this problem when creating the S2. The S2 > > > should be created by a special alloc_domain_iommufd() asking for the > > > S2 format. Not only does the new alloc_domain_iommufd API directly > > > request a S2 format table, but it also specifies the struct device so > > > any residual details can be determined directly. > > > > > > Thus there is no need to do the two stage process when working with > > > the S2. > > > > Ah, right! Taking a quick look, we should be able to call that > > arm_smmu_domain_finalise when handling alloc_domain_iommufd(). > > > > > So fixup the driver to create fully configured iommu_domain's > > > immediately and get rid of this problem. > > > > OK. I will draft a patch today. > > @Yi > Do you recall doing iopt_table_add_domain() in hwpt_alloc()? > > Jason has a great point above. So even SMMU should be able to > call the iopt_table_add_domain() after a kernel-manged hwpt > allocation rather than after an iommu_attach_group(), except > an auto_domain or a selftest mock_domain that still needs to > attach the device first, otherwise the SMMU driver (currently) > cannot finalize the domain aperture. Some update today: I found ops->domain_alloc_user() is used for all domain allocations inside IOMMUFD. So, without any special case, we could entirely do iopt_table_add_domain() in the __iommufd_hw_pagetable_alloc() and accordingly do iopt_table_remove_domain() in the hw_pagetable_destroy(): https://github.com/nicolinc/iommufd/commit/85248e1c5f645e1eb701562eb078cf586af617fe (We can also skip that "symmetric" patch for the list_add, moving the list_add/del calls directly to alloc/destroy.) Without the complication of the add/remove_domain() calls in the do_attach/detach() functions, there is no need for the device_users counter any more. I am not 100% sure if we still need the shared device lock, though so far the sanity that I run doesn't show a problem. We may discuss about it later when we converge our branches. As before, I am also okay to do in the way with incremental changes on top of your tree and to ask you to integrate, once you have your branch ready. My full wip branch: https://github.com/nicolinc/iommufd/commits/wip/iommufd-v6.2-rc5-nesting Thanks Nic
On Wed, Feb 01, 2023 at 11:28:05PM -0800, Nicolin Chen wrote: > So, in short, we'll have to wait for ->domain_alloc_user() > patch (in the nesting series) to unblock the problem that we > discussed above regarding the iopt_table_add_domain(). I think that is probably OK It would be good to prepare a SMMUv3 patch assuming Robin's stuff lands to move the domain iopt determination to allocate so we are ready. Jason
> From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Thursday, February 2, 2023 5:12 PM > > On Tue, Jan 31, 2023 at 11:48:04PM -0800, Nicolin Chen wrote: > > On Mon, Jan 30, 2023 at 12:54:01PM -0800, Nicolin Chen wrote: > > > On Mon, Jan 30, 2023 at 04:35:35PM -0400, Jason Gunthorpe wrote: > > > > On Mon, Jan 30, 2023 at 12:04:33PM -0800, Nicolin Chen wrote: > > > > > > > > > I recall we've discussed this that SMMU sets up domain when it > > > > > attaches the device to, so we made a compromise here... > > > > > > > > The ARM driver has a problem that it doesn't know what SMMU > instance > > > > will host the domain when it is allocated so it doesn't know if it > > > > should select a S1 or S2 page table format - which is determined by > > > > the capabilities of the specific SMMU HW block. > > > > > > > > However, we don't have this problem when creating the S2. The S2 > > > > should be created by a special alloc_domain_iommufd() asking for the > > > > S2 format. Not only does the new alloc_domain_iommufd API directly > > > > request a S2 format table, but it also specifies the struct device so > > > > any residual details can be determined directly. > > > > > > > > Thus there is no need to do the two stage process when working with > > > > the S2. > > > > > > Ah, right! Taking a quick look, we should be able to call that > > > arm_smmu_domain_finalise when handling alloc_domain_iommufd(). > > > > > > > So fixup the driver to create fully configured iommu_domain's > > > > immediately and get rid of this problem. > > > > > > OK. I will draft a patch today. > > > > @Yi > > Do you recall doing iopt_table_add_domain() in hwpt_alloc()? Yeah, doing iopt_table_add_domain() in hwpt_alloc suits well. The only reason for current code is SMMU has drawback with it. Great to see it is solved. > > Jason has a great point above. So even SMMU should be able to > > call the iopt_table_add_domain() after a kernel-manged hwpt > > allocation rather than after an iommu_attach_group(), except > > an auto_domain or a selftest mock_domain that still needs to > > attach the device first, otherwise the SMMU driver (currently) > > cannot finalize the domain aperture. > > Some update today: I found ops->domain_alloc_user() is used > for all domain allocations inside IOMMUFD. So, without any > special case, we could entirely do iopt_table_add_domain() > in the __iommufd_hw_pagetable_alloc() and accordingly do > iopt_table_remove_domain() in the hw_pagetable_destroy(): > https://github.com/nicolinc/iommufd/commit/85248e1c5f645e1eb701562e > b078cf586af617fe > (We can also skip that "symmetric" patch for the list_add, > moving the list_add/del calls directly to alloc/destroy.) > > Without the complication of the add/remove_domain() calls > in the do_attach/detach() functions, there is no need for > the device_users counter any more. Yes. > I am not 100% sure if we still need the shared device lock, > though so far the sanity that I run doesn't show a problem. > We may discuss about it later when we converge our branches. > As before, I am also okay to do in the way with incremental > changes on top of your tree and to ask you to integrate, > once you have your branch ready. I think reusing the device lock can simplify things to avoid bad readability. However, I'll do a double-check. > My full wip branch: > https://github.com/nicolinc/iommufd/commits/wip/iommufd-v6.2-rc5- > nesting Thanks, I'm now re-integrating the nesting patches. Regards, Yi Liu
> From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Thursday, February 2, 2023 3:25 AM > > On Wed, Feb 01, 2023 at 02:37:42PM -0400, Jason Gunthorpe wrote: > > On Wed, Feb 01, 2023 at 09:46:23AM -0800, Nicolin Chen wrote: > > > > So the issue is with replace you need to have the domain populated > > > > before we can call replace but you can't populate the domain until it > > > > is bound because of the above issue? That seems unsovlable without > > > > fixing up the driver. > > > > > > Not really. A REPLACE ioctl is just an ATTACH, if the device just > > > gets BIND-ed. So the SMMU driver will initialize ("finalise") the > > > domain during the replace() call, then iopt_table_add_domain() can > > > be done. > > > > > > So, not a blocker here. > > > > Well, yes, there sort of is because the whole flow becomes nonsensical > > - we are supposed to have the iommu_domain populated by the time we > do > > replace. Otherwise replace is extra-pointless.. > > The "finalise" is one of the very first lines of the attach_dev() > callback function in SMMU driver, though it might still undesirably > fail the replace(). > > https://github.com/nicolinc/iommufd/commit/5ae54f360495aae35b5967d1 > eb00149912145639 > Btw, this is a draft that I made to move iopt_table_add_domain(). I > think we can have this with the nesting series. > > Later, once we pass in the dev pointer to the ->domain_alloc op > using Robin's change, all the iopt_table_add_domain() can be done > within the hwpt_alloc(), prior to an attach()/replace(). > > > > > Is there another issue? > > > > > > Oh. I think we mixed the topics here. These three patches were > > > not to unblock but to clean up a way for the replace series and > > > the nesting series, for the device locking issue: > > > > > > if (cur_hwpt != hwpt) > > > mutex_lock(&cur_hwpt->device_lock); > > > mutex_lock(&hwpt->device_lock); > > > ... > > > if (iommufd_hw_pagetabe_has_group()) { // touching device > list > > > ... > > > iommu_group_replace_domain(); > > > ... > > > } > > > if (cur_hwpt && hwpt) > > > list_del(&idev->devices_item); > > > list_add(&idev->devices_item, &cur_hwpt->devices); > > > ... > > > mutex_unlock(&hwpt->device_lock); > > > if (cur_hwpt != hwpt) > > > mutex_unlock(&cur_hwpt->device_lock); > > > > What is the issue? That isn't quite right, but the basic bit is fine > > > > If you want to do replace then you have to hold both devices_lock and > > you write that super ugly thing like this > > > > lock_both: > > if (hwpt_a < hwpt_b) { > > mutex_lock(&hwpt_a->devices_lock); > > mutex_lock_nested(&hwpt_b->devices_lock); > > } else if (hwpt_a > hwpt_b) { > > mutex_lock(&hwpt_b->devices_lock); > > mutex_lock_nested(&hwpt_a->devices_lock); > > } else > > mutex_lock(&hwpt_a->devices_lock); > > > > And then it is trivial, yes? > > Yea. That's your previous remark. > > > Using the group_lock in the iommu core is the right way to fix > > this.. Maybe someday we can do that. > > > > (also document that replace causes all the devices in the group to > > change iommu_domains at once) > > Yes. There's a discussion in PATCH-3 of this series. I drafted a > patch changing iommu_attach/detach_dev(): > https://github.com/nicolinc/iommufd/commit/124f7804ef38d50490b606fd5 > 6c1e27ce551a839 > > Baolu had a similar patch series a year ago. So we might continue > that effort in parallel, and eventually drop the device list/lock. > > > > I just gave another thought about it. Since we have the patch-2 > > > from this series moving the ioas->mutex, it already serializes > > > attach/detach routines. And I see that all the places touching > > > idev->device_item and hwpt->devices are protected by ioas->mutex. > > > So, perhaps we can simply remove the device_lock? > > > > The two hwpts are not required to have the same ioas, so this doesn't > > really help.. > > Hmm...in that case, we should hold two ioas->mutex locks in > addition to two device locks? Aha, seems so. You can replace a s1 hwpt with another s1 hwpt which Has a different ioas. 😊 maybe this is something incremental to nesting series. In nesting series, between ioas and s1 hwpt, they can share the device_lock. Isn't it? Regards, Yi Liu
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index 9f3b9674d72e..208757c39c90 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -212,7 +212,7 @@ static int iommufd_device_do_attach(struct iommufd_device *idev, hwpt->domain->ops->enforce_cache_coherency( hwpt->domain); if (!hwpt->enforce_cache_coherency) { - WARN_ON(list_empty(&hwpt->devices)); + WARN_ON(refcount_read(hwpt->devices_users) == 1); rc = -EINVAL; goto out_unlock; } @@ -236,7 +236,7 @@ static int iommufd_device_do_attach(struct iommufd_device *idev, if (rc) goto out_iova; - if (list_empty(&hwpt->devices)) { + if (refcount_read(hwpt->devices_users) == 1) { rc = iopt_table_add_domain(&hwpt->ioas->iopt, hwpt->domain); if (rc) @@ -246,6 +246,7 @@ static int iommufd_device_do_attach(struct iommufd_device *idev, idev->hwpt = hwpt; refcount_inc(&hwpt->obj.users); + refcount_inc(hwpt->devices_users); list_add(&idev->devices_item, &hwpt->devices); mutex_unlock(&hwpt->devices_lock); return 0; @@ -387,9 +388,10 @@ void iommufd_device_detach(struct iommufd_device *idev) mutex_lock(&hwpt->ioas->mutex); mutex_lock(&hwpt->devices_lock); + refcount_dec(hwpt->devices_users); list_del(&idev->devices_item); if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) { - if (list_empty(&hwpt->devices)) { + if (refcount_read(hwpt->devices_users) == 1) { iopt_table_remove_domain(&hwpt->ioas->iopt, hwpt->domain); list_del(&hwpt->hwpt_item); diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c index 43d473989a06..910e759ffeac 100644 --- a/drivers/iommu/iommufd/hw_pagetable.c +++ b/drivers/iommu/iommufd/hw_pagetable.c @@ -16,6 +16,8 @@ void iommufd_hw_pagetable_destroy(struct iommufd_object *obj) iommu_domain_free(hwpt->domain); refcount_dec(&hwpt->ioas->obj.users); mutex_destroy(&hwpt->devices_lock); + WARN_ON(!refcount_dec_if_one(hwpt->devices_users)); + kfree(hwpt->devices_users); } /** @@ -46,11 +48,20 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas, INIT_LIST_HEAD(&hwpt->devices); INIT_LIST_HEAD(&hwpt->hwpt_item); mutex_init(&hwpt->devices_lock); + hwpt->devices_users = kzalloc(sizeof(*hwpt->devices_users), GFP_KERNEL); + if (!hwpt->devices_users) { + rc = -ENOMEM; + goto out_free_domain; + } + refcount_set(hwpt->devices_users, 1); + /* Pairs with iommufd_hw_pagetable_destroy() */ refcount_inc(&ioas->obj.users); hwpt->ioas = ioas; return hwpt; +out_free_domain: + iommu_domain_free(hwpt->domain); out_abort: iommufd_object_abort(ictx, &hwpt->obj); return ERR_PTR(rc); diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 222e86591f8a..f128d77fb076 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -247,6 +247,7 @@ struct iommufd_hw_pagetable { /* Head at iommufd_ioas::hwpt_list */ struct list_head hwpt_item; struct mutex devices_lock; + refcount_t *devices_users; struct list_head devices; };